866 87 43MB
English Pages XXIX, 1182 [1158] Year 2021
Advances in Intelligent Systems and Computing 1166
Deepak Gupta · Ashish Khanna · Siddhartha Bhattacharyya · Aboul Ella Hassanien · Sameer Anand · Ajay Jaiswal Editors
International Conference on Innovative Computing and Communications Proceedings of ICICC 2020, Volume 2
Advances in Intelligent Systems and Computing Volume 1166
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **
More information about this series at http://www.springer.com/series/11156
Deepak Gupta Ashish Khanna Siddhartha Bhattacharyya Aboul Ella Hassanien Sameer Anand Ajay Jaiswal •
•
•
•
•
Editors
International Conference on Innovative Computing and Communications Proceedings of ICICC 2020, Volume 2
123
Editors Deepak Gupta Maharaja Agrasen Institute of Technology Rohini, Delhi, India Siddhartha Bhattacharyya CHRIST (Deemed to be University) Bengaluru, Karnataka, India Sameer Anand Department of Computer Science Shaheed Sukhdev College of Business Studies University of Delhi Rohini, Delhi, India
Ashish Khanna Maharaja Agrasen Institute of Technology Rohini, Delhi, India Aboul Ella Hassanien Department of Information Technology Faculty of Computers and Information Cairo University Giza, Egypt Ajay Jaiswal Department of Computer Science Shaheed Sukhdev College of Business Studies University of Delhi Rohini, Delhi, India
ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-15-5147-5 ISBN 978-981-15-5148-2 (eBook) https://doi.org/10.1007/978-981-15-5148-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Dr. Deepak Gupta would like to dedicate this book to his father Sh. R. K. Gupta, his mother Smt. Geeta Gupta for their constant encouragement, his family members including his wife, brothers, sisters, kids, and to my students close to my heart. Dr. Ashish Khanna would like to dedicate this book to his mentors Dr. A. K. Singh and Dr. Abhishek Swaroop for their constant encouragement and guidance and his family members including his mother, wife and kids. He would also like to dedicate this work to his (Late) father Sh. R. C. Khanna with folded hands for his constant blessings. Prof. (Dr.) Siddhartha Bhattacharyya would like to dedicate this book to his father Late Ajit Kumar Bhattacharyya, his mother Late Hashi Bhattacharyya, his beloved wife Rashni and his colleagues Jayanta Biswas and Debabrata Samanta.
Prof. (Dr.) Aboul Ella Hassanien would like to dedicate this book to his wife Azza Hassan El-Saman. Dr. Sameer Anand would like to dedicate this book to his Dada Prof. D. C. Choudhary, his beloved wife Shivanee and his son Shashwat. Dr. Ajay Jaiswal would like to dedicate this book to his father Late Prof. U. C. Jaiswal, his mother Brajesh Jaiswal, his beloved wife Anjali, his daughter Prachii and his son Sakshaum.
ICICC-2020 Steering Committee Members
Patrons Dr. Poonam Verma, Principal, SSCBS, University of Delhi Prof. Dr. Pradip Kumar Jain, Director, National Institute of Technology Patna, India
General Chairs Prof. Dr. Siddhartha Bhattacharyya, Christ University, Bengaluru Dr. Prabhat Kumar, National Institute of Technology Patna, India
Honorary Chairs Prof. Dr. Janusz Kacprzyk, FIEEE, Polish Academy of Sciences, Poland Prof. Dr. Vaclav Snasel, Rector, VSB-Technical University of Ostrava, Czech Republic
Conference Chairs Prof. Dr. Aboul Ella Hassanien, Cairo University, Egypt Prof. Dr. Joel J. P. C. Rodrigues, National Institute of Telecommunications (Inatel), Brazil Prof. Dr. R. K. Agrawal, Jawaharlal Nehru University, Delhi
vii
viii
ICICC-2020 Steering Committee Members
Technical Program Chairs Prof. Dr. Victor Hugo C. de Albuquerque, Universidade de Fortaleza, Brazil Prof. Dr. A. K. Singh, National Institute of Technology, Kurukshetra Prof. Dr. Anil K Ahlawat, KIET Group of Institutes, Ghaziabad
Editorial Chairs Prof. Dr. Abhishek Swaroop, Bhagwan Parshuram Institute of Technology, Delhi Dr. Arun Sharma, Indira Gandhi Delhi Technical University for Womens, Delhi Prerna Sharma, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi
Conveners Dr. Ajay Jaiswal, SSCBS, University of Delhi Dr. Sameer Anand, SSCBS, University of Delhi Dr. Ashish Khanna, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi Dr. Deepak Gupta, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi Dr. Gulshan Shrivastava, National Institute of Technology Patna, India
Publication Chairs Prof. Dr. Neeraj Kumar, Thapar Institute of Engineering and Technology Dr. Mohamed Elhoseny, University of North Texas Dr. Hari Mohan Pandey, Edge Hill University, UK Dr. Sahil Garg, École de technologie supérieure, Université du Québec, Montreal, Canada
Publicity Chairs Dr. M. Tanveer, Indian Institute of Technology, Indore, India Dr. Jafar A. Alzubi, Al-Balqa Applied University, Salt, Jordan Dr. Hamid Reza Boveiri, Sama College, IAU, Shoushtar Branch, Shoushtar, Iran
ICICC-2020 Steering Committee Members
Co-convener Mr. Moolchand Sharma, Maharaja Agrasen Institute of Technology, India
Organizing Chairs Dr. Kumar Bijoy, SSCBS, University of Delhi Dr. Rishi Ranjan Sahay, SSCBS, University of Delhi
Organizing Team Dr. Gurjeet Kaur, SSCBS, University of Delhi Dr. Aditya Khamparia, Lovely Professional University, Punjab, India Dr. Abhimanyu Verma, SSCBS, University of Delhi Dr. Onkar Singh, SSCBS, University of Delhi Kalpna Sagar, KIET Group of Institutes, Ghaziabad
ix
Preface
We hereby are delighted to announce that Shaheed Sukhdev College of Business Studies, New Delhi in association with National Institute of Technology Patna and University of Valladolid Spain has hosted the eagerly awaited and much coveted International Conference on Innovative Computing and Communication (ICICC-2020). The third version of the conference was able to attract a diverse range of engineering practitioners, academicians, scholars and industry delegates, with the reception of abstracts including more than 3,200 authors from different parts of the world. The committee of professionals dedicated towards the conference is striving to achieve a high quality technical program with tracks on Innovative Computing, Innovative Communication Network and Security, and Internet of Things. All the tracks chosen in the conference are interrelated and are very famous among present day research community. Therefore, a lot of research is happening in the above-mentioned tracks and their related sub-areas. As the name of the conference starts with the word ‘innovation’, it has targeted out of box ideas, methodologies, applications, expositions, surveys and presentations helping to upgrade the current status of research. More than 800 full-length papers have been received, among which the contributions are focused on theoretical, computer simulation-based research, and laboratory-scale experiments. Amongst these manuscripts, 196 papers have been included in the Springer proceedings after a thorough two-stage review and editing process. All the manuscripts submitted to the ICICC-2020 were peer-reviewed by at least two independent reviewers, who were provided with a detailed review proforma. The comments from the reviewers were communicated to the authors, who incorporated the suggestions in their revised manuscripts. The recommendations from two reviewers were taken into consideration while selecting a manuscript for inclusion in the proceedings. The exhaustiveness of the review process is evident, given the large number of articles received addressing a wide range of research areas. The stringent review process ensured that each published manuscript met the rigorous academic and scientific standards. It is an exalting experience to finally see these elite contributions materialize into two book volumes as ICICC-2020 proceedings by Springer entitled International Conference on Innovative Computing and Communications. xi
xii
Preface
The articles are organized into two volumes in some broad categories covering subject matters on machine learning, data mining, big data, networks, soft computing, and cloud computing, although given the diverse areas of research reported it might not have been always possible. ICICC-2020 invited six key note speakers, who are eminent researchers in the field of computer science and engineering, from different parts of the world. In addition to the plenary sessions on each day of the conference, fifteen concurrent technical sessions are held every day to assure the oral presentation of around 195 accepted papers. Keynote speakers and session chair(s) for each of the concurrent sessions have been leading researchers from the thematic area of the session. A technical exhibition is held during all the 3 days of the conference, which has put on display the latest technologies, expositions, ideas and presentations. The delegates were provided with a book of extended abstracts to quickly browse through the contents, participate in the presentations and provide access to a broad audience of the audience. The research part of the conference was organized in a total of 45 special sessions. These special sessions provided the opportunity for researchers conducting research in specific areas to present their results in a more focused environment. An international conference of such magnitude and release of the ICICC-2020 proceedings by Springer has been the remarkable outcome of the untiring efforts of the entire organizing team. The success of an event undoubtedly involves the painstaking efforts of several contributors at different stages, dictated by their devotion and sincerity. Fortunately, since the beginning of its journey, ICICC-2020 has received support and contributions from every corner. We thank them all who have wished the best for ICICC-2020 and contributed by any means towards its success. The edited proceedings volumes by Springer would not have been possible without the perseverance of all the steering, advisory and technical program committee members. All the contributing authors owe thanks from the organizers of ICICC-2020 for their interest and exceptional articles. We would also like to thank the authors of the papers for adhering to the time schedule and for incorporating the review comments. We wish to extend my heartfelt acknowledgment to the authors, peer-reviewers, committee members and production staff whose diligent work put shape to the ICICC-2020 proceedings. We especially want to thank our dedicated team of peer-reviewers who volunteered for the arduous and tedious step of quality checking and critique on the submitted manuscripts. We wish to thank my faculty colleagues Mr. Moolchand Sharma and Ms. Prerna Sharma for extending their enormous assistance during the conference. The time spent by them and the midnight oil burnt is greatly appreciated, for which we will ever remain indebted. The management, faculties, administrative and support staff of the college has always been extending their services whenever needed, for which we remain thankful to them.
Preface
xiii
Lastly, we would like to thank Springer for accepting our proposal for publishing the ICICC-2020 conference proceedings. Help received from Mr. Aninda Bose, the acquisition senior editor, in the process has been very useful. Rohini, India
Ashish Khanna Deepak Gupta Organizers, ICICC-2020
About This Book
International Conference on Innovative Computing and Communication (ICICC-2020) was held on 21–23 February at Shaheed Sukhdev College of Business Studies in association with National Institute of Technology Patna and University of Valladolid Spain. This conference was able to attract a diverse range of engineering practitioners, academicians, scholars and industry delegates, with the reception of papers including more than 3200 authors from different parts of the world. Only 195 papers have been accepted and registered with an acceptance ratio of 24% to be published in two volumes of prestigious springer Advances in Intelligent Systems and Computing (AISC) series. This volume includes a total of 98 papers.
xv
Contents
A Dummy Location Generation Model for Location Privacy in Vehicular Ad hoc Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bhawna Chaudhary and Karan Singh Evaluating User Influence in Social Networks Using k-core . . . . . . . . . N. Govind and Rajendra Prasad Lal
1 11
Depression Anatomy Using Combinational Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Apeksha Rustagi, Chinkit Manchanda, Nikhil Sharma, and Ila Kaushik
19
A Hybrid Cost-Effective Genetic and Firefly Algorithm for Workflow Scheduling in Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ishadeep Kaur and P. S. Mann
35
Flexible Dielectric Resonator Antenna Using Polydimethylsiloxane Substrate as Dielectric Resonator for Breast Cancer Diagnostics . . . . . Doondi Kumar Janapala and Moses Nesasudha
47
Machine Learning-Based Prototype for Restaurant Rating Prediction and Cuisine Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kunal Bikram Dutta, Aman Sahu, Bharat Sharma, Siddharth S. Rautaray, and Manjusha Pandey Deeper into Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jatin Bindra, Bulla Rajesh, and Savita Ahlawat Investigation of Ionospheric Total Electron Content (TEC) During Summer Months for Ionosphere Modeling in Indian Region Using Dual-Frequency NavIC System . . . . . . . . . . . . . . . . . . . . . . . . . . Sharat Chandra Bhardwaj, Anurag Vidyarthi, B. S. Jassal, and A. K. Shukla
57
69
83
xvii
xviii
Contents
An Improved Terrain Profiling System with High-Precision Range Measurement Method for Underwater Surveyor Robot . . . . . . . . . . . . Maneesha and Praveen Kant Pandey
93
Prediction of Diabetes Mellitus: Comparative Study of Various Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arooj Hussain and Sameena Naaz
103
Tracking of Soccer Players Using Optical Flow . . . . . . . . . . . . . . . . . . Chetan G. Kagalagomb and Sunanda Dixit
117
Selection of Probabilistic Data Structures for SPV Wallet Filtering . . . Adeela Faridi and Farheen Siddiqui
131
Hybrid BF-PSO Algorithm for Automatic Voltage Regulator System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hotha Uday Kiran and Sharad Kumar Tiwari Malware Classification Using Multi-layer Perceptron Model . . . . . . . . Jagsir Singh and Jaswinder Singh
145 155
Protocol Random Forest Model to Enhance the Effectiveness of Intrusion Detection Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . Thet Thet Htwe and Nang Saing Moon Kham
169
Detecting User’s Spreading Influence Using Community Structure and Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Kumar, Khyati Grover, Lakshay Singhla, and Kshitij Jindal
179
Bagging- and Boosting-Based Latent Fingerprint Image Classification and Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Megha Chhabra, Manoj Kumar Shukla, and Kiran Kumar Ravulakollu
189
Selecting Social Robot by Understanding Human–Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kiran Jot Singh, Divneet Singh Kapoor, and Balwinder Singh Sohi
203
Flooding and Forwarding Based on Efficient Routing Protocol . . . . . . Preshi Godha, Swati Jadon, Anshi Patle, Ishu Gupta, Bharti Sharma, and Ashutosh Kumar Singh
215
Hardware-Based Parallelism Scheme for Image Steganography Speed up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seyedeh Haleh Seyed Dizaji, Mina Zolfy Lighvan, and Ali Sadeghi
225
Predicting Group Size for Software Issues in an Open-Source Software Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . Deepti Chopra and Arvinder Kaur
237
Wireless CNC Plotter for PCB Using Android Application . . . . . . . . . Ashish Kumar, Shubham Kumar, and Swati Shukla
247
Contents
Epileptic Seizure Detection Using Machine Learning Techniques . . . . . Sudesh Kumar, Rekh Ram Janghel, and Satya Prakash Sahu Analysis of Minimum Support Price Prediction for Indian Crops Using Machine Learning and Numerical Methods . . . . . . . . . . . . . . . . Sarthak Gupta, Akshara Agarwal, Paluck Deep, Saurabh Vaish, and Archana Purwar
xix
255
267
A Robust Methodology for Creating Large Image Datasets Using a Universal Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gurpartap Singh, Sunil Agrawal, and B. S. Sohi
279
A Comparative Empirical Evaluation of Topic Modeling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pooja Kherwa and Poonam Bansal
289
An Analysis of Machine Learning Techniques for Flood Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vinay Dubey and Rahul Katarya
299
Link Prediction in Complex Network: Nature Inspired Gravitation Force Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Kumar, Utkarsh Chaudhary, Rewant Kedia, and Tushar Singhal
309
Hridaya Kalp: A Prototype for Second Generation Chronic Heart Disease Detection and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . Harshit Anand, Abhishek Anand, Indrashis Das, Siddharth S. Rautaray, and Manjusha Pandey Two-Stream Mid-Level Fusion Network for Human Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mainak Chakraborty, Alik Pramanick, and Sunita Vikrant Dhavale Content Classification Using Active Learning Approach . . . . . . . . . . . Neha Bansal, Arun Sharma, and R. K. Singh
321
331 345
Analysis of NavIC Multipath Signal Sensitivity for Soil Moisture in Presence of Vegetation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vivek Chamoli, Rishi Prakash, Anurag Vidyarthi, and Ananya Ray
353
Uncovering Employee Job Satisfaction Using Machine Learning: A Case Study of Om Logistics Ltd . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diksha Jain, Sandhya Makkar, Lokesh Jindal, and Mukta Gupta
365
Transaction Privacy Preservations for Blockchain Technology . . . . . . . Bharat Bhushan and Nikhil Sharma
377
EEG Artifact Removal Techniques: A Comparative Study . . . . . . . . . Mridu Sahu, Samrudhi Mohdiwale, Namrata Khoriya, Yogita Upadhyay, Anjali Verma, and Shikha Singh
395
xx
Contents
Evolution of Time-Domain Feature for Classification of Two-Class Motor Imagery Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rahul Kumar, Mridu Sahu, and Samrudhi Mohdiwale
405
Finding Influential Spreaders in Weighted Networks Using Weighted-Hybrid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Kumar, Yash Raghav, and Bhavya Nag
415
Word-Level Sign Language Gesture Prediction Under Different Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monika Arora, Priyanshu Mehta, Divyanshu Mittal, and Prachi Bajaj
427
Firefly Algorithm-Based Optimized Controller for Frequency Control of an Autonomous Multi-Microgrid . . . . . . . . . . . . . . . . . . . . . Kshetrimayum Millaner Singh, Sadhan Gope, and Nicky Pradhan
437
Abnormal Activity-Based Video Synopsis by Seam Carving for ATM Surveillance Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Yogameena and R. Janani
453
Behavioral Analysis from Online Data Using Temporal Graphs . . . . . Anam Iqbal and Farheen Siddiqui
463
Medical Data Analysis Using Machine Learning with KNN . . . . . . . . . Sabyasachi Mohanty, Astha Mishra, and Ankur Saxena
473
Insight to Model Clone’s Differentiation, Classification, and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ritu Garg and R. K. Singh Predicting Socio-economic Features for Indian States Using Satellite Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pooja Kherwa, Savita Ahlawat, Rishabh Sobti, Sonakshi Mathur, and Gunjan Mohan Semantic Space Autoencoder for Cross-Modal Data Retrieval . . . . . . . Shaily Malik and Poonam Bansal A Novel Approach to Classify Cardiac Arrhythmia Using Different Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parag Jain, C. S. Arjun Babu, Sahana Mohandoss, Nidhin Anisham, Shivakumar Gadade, A. Srinivas, and Rajasekar Mohan Offline Handwritten Mathematical Expression Evaluator Using Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . Amit Choudhary, Savita Ahlawat, Harsh Gupta, Aniruddha Bhandari, Ankur Dhall, and Manish Kumar
487
497
509
517
527
Contents
An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Tanvir Islam, M. Raihan, Fahmida Farzana, Promila Ghosh, and Shakil Ahmed Shaj An Overview of Ultra-Wide Band Antennas for Detecting Early Stage of Breast Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. K. Anooradha, A. Amir Anton Jone, Anita Jones Mary Pushpa, V. Neethu Susan, and T. Beril Lynora Single Image Haze Removal Using Hybrid Filtering Method . . . . . . . . K. P. Senthilkumar and P. Sivakumar
xxi
539
551
561
An Optimized Multilayer Outlier Detection for Internet of Things (IoT) Network as Industry 4.0 Automation and Data Exchange . . . . . . Adarsh Kumar and Deepak Kumar Sharma
571
Microscopic Image Noise Reduction Using Mathematical Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mangala Shetty and R. Balasubramani
585
A Decision-Based Multi-layered Outlier Detection System for Resource Constraint MANET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adarsh Kumar and P. Srikanth
595
Orthonormal Wavelet Transform for Efficient Feature Extraction for Sensory-Motor Imagery Electroencephalogram Brain–Computer Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Poonam Chaudhary and Rashmi Agrawal
611
Performance of RPL Objective Functions Using FIT IoT Lab . . . . . . . Spoorthi P. Shetty and Udaya Kumar K. Shenoy
623
Predictive Analytics for Retail Store Chain . . . . . . . . . . . . . . . . . . . . . Sandhya Makkar, Arushi Sethi, and Shreya Jain
631
Object Identification in Satellite Imagery and Enhancement Using Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . Pranav Pushkar, Lakshay Aggarwal, Mohammad Saad, Aditya Maheshwari, Harshit Awasthi, and Preeti Nagrath
643
Keyword Template Based Semi-supervised Topic Modelling in Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Greeshma N. Gopal, Binsu C. Kovoor, and U. Mini
659
A Community Interaction-Based Routing Protocol for Opportunistic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deepak Kumar Sharma, Shrid Pant, and Rinky Dwivedi
667
xxii
Contents
Performance Analysis of the ML Prediction Models for the Detection of Sybil Accounts in an OSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankita Kumari and Manu Sood
681
Exploring Feature Selection Technique in Detecting Sybil Accounts in a Social Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shradha Sharma and Manu Sood
695
Implementation of Ensemble-Based Prediction Model for Detecting Sybil Accounts in an OSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Priyanka Roy and Manu Sood
709
Performance Analysis of Impact of Network Topologies on Different Controllers in SDN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dharmender Kumar and Manu Sood
725
Bees Classifier Using Soft Computing Approaches . . . . . . . . . . . . . . . . Abhilakshya Agarwal and Rahul Pradhan
737
Fuzzy Trust Based Secure Routing Protocol for Opportunistic Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nisha Kandhoul and S. K. Dhurandher
749
Student’s Performance Prediction Using Data Mining Technique Depending on Overall Academic Status and Environmental Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Syeda Farjana Shetu, Mohd Saifuzzaman, Nazmun Nessa Moon, Sharmin Sultana, and Ridwanullah Yousuf
757
Evaluate and Predict Concentration of Particulate Matter (PM2.5) Using Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . Shaon Hossain Sani, Akramkhan Rony, Fyruz Ibnat Karim, M. F. Mridha, and Md. Abdul Hamid Retrieval of Frequent Itemset Using Improved Mining Algorithm in Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sandhya Sandeep Waghere, PothuRaju RajaRajeswari, and Vithya Ganesan Number Plate Recognition System for Vehicles Using Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Amzad Hossain, Istiaque Ahmed Suvo, Amitabh Ray, Md. Ariful Islam Malik, and M. F. Mridha The Model to Determine the Location and the Date by the Length of Shadow of Objects for Communication Networks . . . . . . . . . . . . . . Renrui Zhang
771
787
799
815
Contents
xxiii
CW-CAE: Pulmonary Nodule Detection from Imbalanced Dataset Using Class-Weighted Convolutional Autoencoder . . . . . . . . . . . . . . . . Seba Susan, Dhaarna Sethi, and Kriti Arora
825
SORTIS: Sharing of Resources in Cloud Framework Using CloudSim Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kushagra Gupta and Rahul Johari
835
Predicting Diabetes Using ML Classification Techniques . . . . . . . . . . . Geetika Vashisht, Ashish Kumar Jha, and Manisha Jailia Er–Yb Co-doped Fibre Amplifier Performance Enhancement for Super-Dense WDM Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . Anurupa Lubana, Sanmukh Kaur, and Yugnanda Malhotra Seizure Detection from Intracranial Electroencephalography Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pranjal Naman, Satyarth Vats, Monarch Batra, Raunaq Bhalla, and Smriti Srivastava Reader: Speech Synthesizer and Speech Recognizer . . . . . . . . . . . . . . . Mohammad Muzammil Khan and Anam Saiyeda Comparing CNN Architectures for Gait Recognition Using Optical Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sumit Sarin, Anirudh Chugh, Antriksh Mittal, and Smriti Srivastava Digital Identity Management System Using Blockchain Technology . . . Ei Shwe Sin and Thinn Thu Naing
845
855
867
877
887 895
Enhancing Redundant Content Elimination Algorithm Using Processing Power of Multi-Core Architecture . . . . . . . . . . . . . . . Rahul Saxena and Monika Jain
907
Matched Filter Design Using Dynamic Histogram for Power Quality Events Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manish Kumar Saini and Rajender Kumar Beniwal
921
Managing Human (Social) Capital in Medium to Large Companies Using Organizational Network Analysis: Monoplex Network Approach with the Application of Highly Interactive Visual Dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Srečko Zajec, Leo Mrsic, and Robert Kopal
937
Gender and Age Estimation from Gait: A Review . . . . . . . . . . . . . . . . Tawqeer Ul Islam, Lalit Kumar Awasthi, and Urvashi Garg
947
Parkinson’s Disease Detection Through Visual Deep Learning . . . . . . . Vasudev Awatramani and Deepak Gupta
963
xxiv
Contents
Architecture and Framework Enabling Internet of Vehicles Towards Intelligent Transportation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Manaswini, B. Saikrishna, and Nishu Gupta
973
Group Data Sharing and Auditing While Securing Sensitive Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shubham Singh and Deepti Aggarwal
985
Novel Umbrella 360 Cloud Seeding Based on Self-landing Reusable Hybrid Rocket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Satyabrat Shukla, Gautam Singh, Saikat Kumar Sarkar, and Purnima Lala Mehta
999
User Detection Using Cyclostationary Feature Detection in Cognitive Radio Networks with Various Detection Criteria . . . . . . . . . . . . . . . . . 1013 Budati Anil Kumar, V. Hima Bindu, and N. Swetha Fuzzy-Based DBSCAN Algorithm to Elect Master Cluster Head and Enhance the Network Lifetime and Avoid Redundancy in Wireless Sensor Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031 Tripti Sharma, Amar Mohapatra, and Geetam Tomar Water Quality Evaluation Using Soft Computing Method . . . . . . . . . . 1043 Shivam Bhardwaj, Deepak Gupta, and Ashish Khanna Crowd Estimation of Real-Life Images with Different View-Points . . . . 1053 Md Shah Fahad and Akshay Deepak Scalable Machine Learning in C++ (CAMEL) . . . . . . . . . . . . . . . . . . . 1063 Moolchand Sharma, Anshuman Raina, Kashish Khullar, Harshit Khandelwal, and Saumye Mehrotra Intelligent Gateway for Data-Centric Communication in Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083 Rohit Raj, Akash Sinha, Prabhat Kumar, and M. P. Singh A Critical Review: SANET and Other Variants of Ad Hoc Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093 Ekansh Chauhan, Manpreet Sirswal, Deepak Gupta, and Ashish Khanna HealthStack–A Decentralized Medical Record Storage Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115 Mayank Bansal, Kalpna Sagar, and Anil Ahlawat AEECC-SEP: Ant-Based Energy Efficient Condensed Cluster Stable Election Protocol in Wireless Sensor Network . . . . . . . . . . . . . . . . . . . 1125 Tripti Sharma, Amar Mohapatra, and Geetam Tomar
Contents
xxv
Measurement and Modeling of DTCR Software Parameters Based on Intranet Wide Area Measurement System for Smart Grid Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1139 Mohammad Kamrul Hasan, Musse Mohamud Ahmed, and Sherfriz Sherry Musa Dynamic Load Modeling and Parameter Estimation of 132/275 KV Using PMU-Based Wide Area Measurement System . . . . . . . . . . . . . . 1151 Musse Mohamud Ahmed, Mohammad Kamrul Hasanl, and Noor Shamilawani Farhana Yusoff Enhanced Approach for Android Malware Detection . . . . . . . . . . . . . . 1165 Gulshan Shrivastava and Prabhat Kumar Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1179
About the Editors
Dr. Deepak Gupta is an eminent academician; and plays versatile roles and responsibilities juggling between lectures, research, publications, consultancy, community service, Ph.D. and postdoctorate supervision, etc. With 12 years of rich expertise in teaching and two years in industry; he focuses on rational and practical learning. He has contributed massive literature in the fields of human–computer interaction, intelligent data analysis, nature-inspired computing, machine learning and soft computing. He has served as Editor-in-Chief, Guest Editor, and Associate Editor in SCI and various other reputed journals. He has completed his postdoc from Inatel, Brazil, and Ph.D. from Dr. APJ Abdul Kalam Technical University. He has authored/edited 33 books with national/international level publisher (Elsevier, Springer, Wiley, Katson). He has published 105 scientific research publications in reputed international journals and conferences including 53 SCI Indexed Journals of IEEE, Elsevier, Springer, Wiley and many more. He is the convener and organizer of ‘ICICC’ Springer Conference Series. Dr. Ashish Khanna has 16 years of expertise in Teaching, Entrepreneurship, and Research & Development. He received his Ph.D. degree from National Institute of Technology, Kurukshetra. He has completed his M.Tech. and B.Tech. from GGSIPU, Delhi. He has completed his postdoc from the Internet of Things Lab at Inatel, Brazil, and University of Valladolid, Spain. He has published around 45 SCI indexed papers in IEEE Transaction, Springer, Elsevier, Wiley and many more reputed journals with cumulative impact factor of above 100. He has around 100 research articles in top SCI/Scopus journals, conferences and book chapters. He is co-author of around 20 edited and textbooks. His research interest includes distributed systems, MANET, FANET, VANET, IoT, machine learning and many more. He is originator of Bhavya Publications and Universal Innovator Lab. Universal Innovator is actively involved in research, innovation, conferences, startup funding events and workshops. He has served the research field as a Keynote Speaker/Faculty Resource Person/Session Chair/Reviewer/TPC member/
xxvii
xxviii
About the Editors
postdoctorate supervision. He is convener and organizer of ICICC conference series. He is currently working at the Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, under GGSIPU, Delhi, India. He is also serving as Series Editor in Elsevier and De Gruyter publishing houses. Dr. Siddhartha Bhattacharyya is currently serving as a Professor in the Department of Computer Science and Engineering of Christ University, Bangalore. He is a co-author of 5 books and the Co-editor of 50 books and has more than 250 research publications in international journals and conference proceedings to his credit. He has got two PCTs to his credit. He has been member of the organizing and technical program committees of several national and international conferences. His research interests include hybrid intelligence, pattern recognition, multimedia data processing, social networks and quantum computing. He is also a certified Chartered Engineer of Institution of Engineers (IEI), India. He is on the Board of Directors of the International Institute of Engineering and Technology (IETI), Hong Kong. He is a privileged inventor of NOKIA. Dr. Aboul Ella Hassanien is the Founder and Head of the Egyptian Scientific Research Group (SRGE). Hassanien has more than 1000 scientific research papers published in prestigious international journals and over 50 books covering such diverse topics as data mining, medical images, intelligent systems, social networks and smart environment. Prof. Hassanien won several awards including the Best Researcher of the Youth Award of Astronomy and Geophysics of the National Research Institute, Academy of Scientific Research (Egypt, 1990). He was also granted a scientific excellence award in humanities from the University of Kuwait for the 2004 Award, and received the superiority of scientific—University Award (Cairo University, 2013). Also he honored in Egypt as the best researcher at Cairo University in 2013. He was also received the Islamic Educational, Scientific and Cultural Organization (ISESCO) prize on Technology (2014) and received the State Award for excellence in engineering sciences 2015. He was awarded the medal of Sciences and Arts of the first class by the President of the Arab Republic of Egypt, 2017. Professor Hassanien awarded the international Scopus Award for the meritorious research contribution in the field of computer science (2019). Dr. Sameer Anand is currently working as an Assistant professor in the Department of Computer science at Shaheed Sukhdev College of Business Studies, University of Delhi, Delhi. He has received his M.Sc., M.Phil., and Ph.D. (Software Reliability) from the Department of Operational Research, University of Delhi. He is a recipient of ‘Best Teacher Award’ (2012) instituted by Directorate of Higher Education, Government of NCT, Delhi. The research interest of Dr. Anand includes operational research, software reliability and machine learning. He has completed an Innovation project from the University of Delhi. He has worked in different capacities in international conferences. Dr. Anand has published several papers in the reputed journals like IEEE Transactions on Reliability, International Journal of
About the Editors
xxix
Production Research (Taylor & Francis), International Journal of Performability Engineering, etc. He is a member of Society for Reliability Engineering, Quality and Operations Management. Dr. Sameer Anand has more than 16 years of teaching experience. Dr. Ajay Jaiswal is currently serving as an Assistant Professor in the Department of Computer Science of Shaheed Sukhdev College of Business Studies, University of Delhi, Delhi. He is Co-editor of two books/journals and co-author of dozens of research publications in international journals and conference proceedings. His research interest includes pattern recognition, image processing, and machine learning. He has completed an interdisciplinary project titled ‘Financial Inclusion-Issues and Challenges: An Empirical Study’ as Co-PI. This project was awarded by the University of Delhi. He obtained his masters from the University of Roorkee (now IIT Roorkee) and Ph.D. from Jawaharlal Nehru University, Delhi. He is a recipient of the Best Teacher Award from the Government of NCT of Delhi. He has more than nineteen years of teaching experience.
A Dummy Location Generation Model for Location Privacy in Vehicular Ad hoc Networks Bhawna Chaudhary and Karan Singh
Abstract The vehicular ad hoc networks are designed to tackle the problems that occur due to the proliferation of vehicles in our society. However, most of its applications require the access to location of vehicles participating in the network and may lead to life-threatening situations. Hence, a careful solution is required while sharing the credentials including location. In this work, we propose to use the dummy location generation of the vehicles. This method helps in protecting the location privacy of a vehicle by creating confusion in the network. This paper contributes to a dummy location generation method, by evaluating the conditional probabilities of location and time pairing. Firstly, we describe the technique used by the adversary and present our dummy location generation method which is simple in nature and gives efficiency as compared to existing methods. Results prove the validity of our proposed model. Keywords Dummy location · Location privacy · Security · Anonymity · VANET
1 Introduction In recent times, traffic congestion is considered one of the serious issues faced by the whole world. The problems that arise due to the use of private transport is the increasing number of road accidents, additional expenses, and related dangers as well as serious socioeconomic issues being faced by modern society. To deal with these problems, it has developed a very promising technology, i.e., vehicular ad hoc networks (VANETs) [1]. Using this technology, vehicles equipped with on-board unit (OBU) communication device can communicate with the help of roadside units (RSU), i.e., V2R architecture or they can communicate directly by short-range direct B. Chaudhary (B) · K. Singh School of Computer Systems and Sciences, Jawaharlal Nehru University, Delhi, India e-mail: [email protected] K. Singh e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_1
1
2
B. Chaudhary and K. Singh
communications by sending beacons as a message to each other, i.e., V2V architecture. Using these architectures, VANETs can offer a vast variety of applications. These applications fall into two classes: Safety- and Non-safety-related applications. Safety-related applications include warning messages, cooperative driving, and traffic optimization, whereas non-safety-related applications include an exchange of entertainment messages [8]. The main purpose of the outgrowth of vehicular communication is to enhance road safety. These safety applications are on-demand and require location-aware services to feed real-time information to its users. For this purpose, beacon messages are transmitted into the network every 10 ms and contain a lot of personal information, such as the timestamp of the vehicle, its identity, and some spatiotemporal information (i.e., speed, velocity, acceleration, etc.) [4]. The given information helps a driver to sniff forthcoming dangerous situations on the road with a window gap for the driver to respond. Still, neighbor nodes (having malicious tendencies) can easily eavesdrop the messages and then link them based on the identity of the vehicle to extract all the visited locations. This can compromise network privacy as one vehicle is associated with only one driver [5, 8]. For the effective utilization of such network, it is necessary to develop a set of elaborate protocols and finely designed privacy mechanisms to make VANET application feasible, i.e., personal information of the driver such as its identity and most visited places must be preserved in order to prevent users from being traced illegally through vehicular communications. A compromised network does not only affect one’s privacy but can threaten one’s security [7]. A malicious node may spoof the information present in the beacons and misdirect the other nodes. We propose an attack scenario in which an adversary gathers spatiotemporal user information, such as patterns of the user’s frequently visited locations that include office address, residence address, and restaurants and utilizes their actual location from the dummy locations (Fig. 1). In this work, we propose a methodology that produces dummy locations that are used in such an attack strategy using a simple statistical method. In conclusion, we define an attack model and the objective of our method as follows: Attack Scenario: An attacker keeps prior knowledge of target node and external spatiotemporal information, connecting to a context-linking attack [4]. The adversary may try to find out the real location of the vehicle from dummy locations using such information. Objective—We present a method to generate realistic dummy locations untraceable against this attack scenario. We propose a method for dummy generation that carefully selects the dummy locations from the higher priority (frequently visited) k location obtained by finding conditional probabilities. Moreover, we focus on such locations that are considered more vulnerable with respect to spatiotemporal contexts by adding a weight factor. Experiments have shown that this approach works well in the given scenario. Our approach generates more realistic dummy locations while considering the time of actual events. This approach is sufficiently simple to be utilized in real-time applications and obscure the actual location among the dummy locations more successfully as compared to the existing methods.
A Dummy Location Generation Model for Location Privacy …
3
Fig. 1 Dummy location generation model by RSU
2 Related Work In [9], this paper’s authors are inclined toward the cryptographic mix zones by deploying a special RSU at those places where traffic density is too high like crossroads and toll booths. Mix zones can be elaborate as they unidentified regions of the network, where the identifiers of mobile nodes are changed to obscure the relationship between entering and exit events. Whenever a vehicle enters into a cryptographic mix zone, a symmetric key is assigned by the RSU to the vehicle. During traveling into the mix zones, every communicated message remains encrypted to protect the useful information imbibed in the message from the adversary. Vehicles in the mix zones send the symmetric key with the message to the vehicles that are in direct transmission range outside of the mix zones such that, those vehicles are also able to decrypt messages. In [3, 12], in order to effectively change and reduce the number of pseudonyms used in the network, results have proven that synchronous pseudonym change algorithm has leading efficiency over the similar status algorithm, and the similar status works better than the position algorithm. They simulated the three algorithms in the same environment by using vehicular mobility model STRAW [an
4
B. Chaudhary and K. Singh
integrated mobile model], which observes vehicular behavior and simplifies traffic control mechanisms. The heuristics applied to optimize the pseudonyms in the network [11] reduce the communication required for procuring pseudonyms and the possibility of tracking at the time of procurement. This work asserts that the proposed heuristics for updating the pseudonym at some particular place and time when vehicle density is low helps in maximization of anonymity with minimum updating frequency. CARAVAN [10] suggests that by associating neighboring vehicles into groups, it leads to reducing the frequency of broadcasting a message for V2I applications by a vehicle. Using a group, the vehicles can be provided with an extended silent period, which, in turn, enhances their anonymity and also achieve unlinkability. This solution assumes that VANETs have a registration authority RA, which has data of all the vehicles joining the network. Each vehicle also registers for the services of his interest and only RA knows the association between real identity and pseudonyms allocated to the vehicles. An enhancement technique is suggested that allows for the actual difference between RSUs and the power transmission control by vehicles. In [6], for the very first time privacy is preserved by using the dummies. In this approach, dummies are using query–response system with location-based services. A new privacy protocol known as PARROTS (Position Altered Requests Relayed Over Time and Space), has been presented suggesting that privacy can be preserved for location-based service and the users can be preserved by LBS administrators in these three cases: (a) when LBS demand constant support from the network, (b) if RA conflicts with RSU, and (c) when spatiotemporal information of a vehicle can be linked. Though this study does not compare network efficiencies, but does enlighten a new method of protecting privacy. In this work, we propose a dummy location generation technique, in which the attacker has prior knowledge of the target user’s profile and spatiotemporal information. Unlike other approaches, this approach considers the realistic scenario in which an attacker may collect the information about the vehicle and the owner from the social networking sites to learn the pattern of the target node.
3 Our Threat Model and Dummy Generation 3.1 Threat Model A basic architecture for location-based services in VANETs consists of vehicular nodes having geographical positioning devices, RSUs, and service providers. Vehicles present in the network can communicate using the beacon messages, which compute and respond to the queries by using user location coordinates. In our threat model, we consider an adversary as a vehicular node that may set up the communication with the target node. An attacker behaves according to its predefined protocol but tries to find out the real information of the target vehicle (the real location, in this
A Dummy Location Generation Model for Location Privacy …
5
case, is the last updated location updated at the nearest RSU). Out of different possible attack scenarios described, we have chosen the fixed-position attack where the adversary observes a query set from the target node. Moreover, each node participating in the network shares its location every time it initiates the communication. Also, we assume that the installed GPS in the vehicle is trustful and cannot be spoofed by the adversary. We concentrate on achieving location privacy by implementing a dummy generation technique into which decisions made by the techniques are authentic.
3.2 The Dummy Generation Scheme To deal with the dummy generation mentioned in the above section, a method is proposed which provides an abstraction of the exact location from the service provider by offering a set of fake locations, known as dummy locations or dummies, containing the exact location. This procedure works as follows: a. The user’s vehicle is present at some location A. b. The user communicates position data A, along with a set of fake locations, such as B, C, D, and E to the RSU. c. The RSU generates location values for all the dummy locations from A to E and sends the message back to the user vehicle. d. The user vehicle communicates using the values from the received set and enables to hide its location privacy. The actual location A cannot be exposed to any other vehicle as well as to other RSUs. The only intended vehicle is aware of its exact location, whereas the RSU may not be. Thus, no other entity except the vehicle itself can distinguish the actual location of the vehicle from a pool of k defined locations (including 1 actual location and k-1 dummy locations). Therefore, the aforementioned method can be used to preserve location privacy by establishing the k-anonymity. Else defaultR ← C endif endif returnr ← randomnumberfromR
3.3 Our Attack Model In our proposed attack model, the adversary can perform a context-linking attack which assumes that the adversary is aware of the spatiotemporal information of the target vehicle. Using the online posted information, an adversary may predict the location of its target vehicle. For example, generally, a user while commuting from her home to the workplace follows the same route and stops at similar points. After observing such behavior of the target user, an adversary may collect the remaining information from different platforms such as social networking sites, information
6
B. Chaudhary and K. Singh
exchanged in the beacons, etc. Also, the attacker may gain knowledge of the frequently visited restaurants and accessed places. The prior knowledge of locations acts as an analytical challenge for the development of dummy generation technique.
4 Our Proposed Work In this section, we will present a novel dummy generation method that can obtain prior knowledge about the user vehicle and his whereabouts. This approach is based on the following objectives: a. Generate the dummy locations where the target user frequently visits at a particular time. b. Generate the dummy locations that seem vulnerable from the target user’s perspective. We can tackle the first objective by calculating the conditional probabilities and add a weighting scheme to fulfill the next objective.
4.1 Generating Frequent Visited Locations of Vehicles Our work thoroughly examines the dummy locations by calculating the conditional probability of locations given at a time, predicting the targeted vehicle’s behavior at a particular time of the day. This can be calculated by finding out the probability that a user vehicle is at a specific location at a given time [13]. P(Location of the vehicle)(Time) =
P(Location of the vehicle ∩ time) P(Time)
(1)
The above-written equation finds out the probabilities of joint of events such as at this location and at this particular time. Also, we initiate the data by putting 1 to every location/time pair to avoid 0 probabilities. After calculating the P(location—time) of all the probable locations at some specific time, we produce dummy locations with respect to the highest probabilities locations. If we encounter two equal values for the P(location—time) then only P(location) will be considered. This method enlightens the probable locations of a user vehicle at any specific time.
4.2 Assessment of Vulnerable Time/Location Pairs We assume a few vulnerable locations that are known to attackers and may use them as dummy locations. Generally, the schedule of a user is fixed. Our model generates the dummy locations for the regularly visited locations and time pairs to
A Dummy Location Generation Model for Location Privacy …
7
increase uncertainty for the attacker. For every vulnerable location and time, our model allocates a weight, i.e., known as risk. P(dummy location) =
P(location ∩ time) ∗ risk P(time)
(2)
If the value of the risk is greater than 1, then the location/time is vulnerable and if it is equal to 1 then we consider that the location/time is under control. Thereafter, we establish a dummy vehicle location to every possible vulnerable location.
5 Experimental Design and Evaluation To determine the efficiency of the proposed model, we perform the experiments based on the real observed data by a vehicle’s user. The experimental study is considered to verify the given objectives: a. How agile is our dummy location of vehicles generated by calculating the conditional probabilities against the attacker model? b. Does the finding of vulnerable location and time pairs contribute any aid to ensure a more agile dummy location?
5.1 Experimental Settings We have established our dataset that involves logged information (time and location) about a target vehicle of Jaipur city. These logs are designed after observing the target vehicle for 2 weeks, leads to 198 log data instances entry into the database. The locations consider approximately 45 famous places of the city, out of that only 12 most visited locations are chosen for this study. Out of the chosen dataset, we train 150 instances and analyze the proposed model that at least contains 5 days of logs. Attacker’s Scenario: Our assumption is that the attacker has collected information about the target vehicle beforehand. Target vehicle’s information (T i ): We can find out the vehicle information on the Internet. As a result, the vehicle owner’s name and address can be retrieved from the Internet and by searching the same identity on the social networking sites, more information can be obtained, such as the addresses of the target vehicle’s office, home, cafe, and clinics. Spatiotemporal information(T s i): Our assumption says that the attacker has all the prior information about the cafes, restaurant, and other places. Additionally, the attacker has common sense that people generally go to office in the morning hours
8
B. Chaudhary and K. Singh
Fig. 2 The performance of our approach according to different risk and k
(7–9 a.m.), cafe in the evening(5–7 p.m.), and will be at home in the night. This pattern helps the attacker to predict the location of a vehicle at any specific time. Performance Measure: We measure the average probability that the attacker is able to figure out the exact location of target vehicle(e). As per our attack model described in Sect. 3.3, the attacker has the capacity of randomly selecting the possibility of the target among the given set of information(E). The success probability of the attacker can be measured as Sucessprobability = 1/Ei f thee ∈ E0i f 0i f e/ ∈ E
(3)
5.2 Experimental Results We evaluate the functionality of our algorithm in comparison with the already suggested methods. In our base paper [2], dummy locations are generated using a cloaking technique, each location having circular probability. Another algorithm with dummy location services use an entropy-based scheme to place the dummy location on the road [14]. Moreover, we propose to evaluate the results theoretically from the optimal k-anonymity algorithm, which states that the probability of estimating the 1 real location is k . In Fig. 3, we have shown the comparison of our work and other two methods described earlier. Results show that different k-anonymity can be obtained by using various dummy locations. The graphs justify that our approach results in a lower probability that the attacker clearly identifies the real location of the vehicle, which proves that the proposed model is safer than the other solutions. While finding the average, in our method, the attacker gets success 3.8% lesser time than the second
A Dummy Location Generation Model for Location Privacy …
9
Fig. 3 Comparison of success probability of attacker and different k anonymity
baseline approach and 28.6% lesser time than the first. Additionally, as the number of dummy locations increases, the efficacy of our algorithm also increases. Thereafter, we examine our weighting scheme for pairing of vulnerable time and location of the target vehicle. We can find out by proofing the schedule of the vehicle thoroughly. We place the risk value from 1 to 9 according to the vulnerability of the location. The reason behind this is that the larger value of k shows that most of the location involves the dummy locations priorly. We can prove that this approach can be efficiently applied to vehicular ad hoc networks, as this lowers the network establishment cost. If the vehicle is able to set their route priorly, our approach will generate much safer dummy locations and may provide a safer network.
6 Conclusion and Future Aspects In this work, we have considered the attacker’s strategy who has the prior knowledge of the target user’s vehicle and external background information. To address such a serious threat, we define a dummy location generation algorithm for vehicles that efficiently place the dummy location of the vulnerable vehicles after calculating the conditional probabilities at a specific time. Furthermore, we prove that location privacy can be achieved if we consider the spatiotemporal information. Experimental results prove that our statistical method gives more effective results than existing methods. In our future work, we plan to elongate the proposed work to tackle random attacks and implement them on the vast map of the city.
10
B. Chaudhary and K. Singh
References 1. R. Al-ani, B. Zhou, Q. Shi, A. Sagheer, A survey on secure safety applications in vanet, in 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (IEEE, 2018), pp. 1485–1490 2. M. Arif, G. Wang, T. Peng, Track me if you can? query based dual location privacy in vanets for v2v and v2i communication, in 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/12th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE) (2018), pp. 1091–1096 3. S. Bao, W. Hathal, H. Cruickshank, Z. Sun, P. Asuquo, A. Lei, A lightweight authentication and privacy-preserving scheme for vanets using tesla and bloom filters. ICT Express (2017) 4. S. Buchegger, T. Alpcan, Security games for vehicular networks, in 2008 46th Annual Allerton Conference on Communication, Control, and Computing (IEEE, 2008), pp. 244–251 5. G. Calandriello, P. Papadimitratos, J.-P. Hubaux, A. Lioy, Efficient and robust pseudonymous authentication in vanet, in Proceedings of the fourth ACM international workshop on Vehicular ad hoc networks (ACM, 2007), pp. 19–28 6. G. Corser, H. Fu, T. S. P. D. W. M. S. L. Y. Z, Privacy-by-decoy: protecting location privacy against collusion and deanonymization in vehicular location based services, in IEEE Intelligent Vehicles Symposium Proceedings (2014), pp. 1030–1036 7. M. Gupta, N.S. Chaudhari, Anonymous roaming authentication protocol for wireless network with backward unlinkability and natural revocation. Ann. Telecommun. 1–10 (2018) 8. H. Hartenstein, L. Laberteaux, A tutorial survey on vehicular ad hoc networks. IEEE Commun. Mag. 46(6), 164–171 (2008) 9. M. Raya, J.-P. Hubaux, Securing vehicular ad hoc networks. J. Comput. Secy 15(1), 39–68 (2007) 10. K. Sampigethaya, L. Huang, M. Li, R. Poovendran, K. Matsuura, K. Sezaki, Caravan: providing location privacy to vanets. Defense Technical Information Center (2005) 11. K. Sharma, B.K. Chaurasia, S. Verma, G.S. Tomar, Token based trust computation in vanet. Int. J. Grid Distrib. Comput. 9(5), 313–320 (2016) 12. M. Wang, D. Liu, L. Zhu, Y. Xu, F. Wang, Lespp: lightweight and efficient strong privacy preserving authentication scheme for secure vanet communication. Computing 98(7), 685–708 (2016) 13. Z. Yan, P. Wang, W. Feng, A novel scheme of anonymous authentication on trust in pervasive social networking. Inf. Sci. 445, 79–96 (2018) 14. Q. Yang, A. Lim, R. X. Q. X, Location privacy protection in contention based forwarding for vanets, in IEEE Global Telecommunications Conference GLOBECOM (2010), pp. 1–5
Evaluating User Influence in Social Networks Using k-core N. Govind and Rajendra Prasad Lal
Abstract Given a social network with an influence propagation model, selecting a small subset of users to maximize the influence spread is known as influence maximization problem. It has been shown that influence maximization problem is NP-hard, and several approximation algorithms and heuristics have been proposed. In this work, we follow a graph-theoretic approach to find the initial spreaders called seed nodes such that the expected number of influenced users is maximized. It has been well established through a series of research works that a special subgraph called k-core is very useful to find most influential users. A k-core subgraph H of a graph G is defined as a maximal induced subgraph where every node in H is having at least k neighbors. We apply a topology-based algorithm called Local Index Rank (LIR) on k-core (for some fixed k) to select the seed nodes in a social network. The accuracy and efficiency of the proposed method have been established using two benchmark datasets of SNAP (Stanford Network Analysis Project) database. Keywords Influence maximization · Social network · k-core · Independent cascade
1 Introduction Social Network Analysis (SNA) is an active research area which has attracted researchers from academia as well as industry. Social networks like Facebook, Flickr, YouTube, and Twitter, etc., are extensive and very effective in propagating information and promoting market products among its users in a very short time span. There are specific nodes in social networks which can propagate information to a large number of users quickly, called influential nodes. Identifying such nodes in the social graph will help in controlling epidemic outbreaks, speed up information propagation, advertisements by e-commerce websites, and so on. This has attracted N. Govind (B) · R. P. Lal University of Hyderabad, Hyderabad, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_2
11
12
N. Govind and R. P. Lal
scientists from various fields like economics, sociology, and computer science to study the influence spread in social networks. Specifically, identifying influential nodes (users) in social networks has been discussed and analyzed extensively by the academicians as well as people from the industry recently [2, 8, 13]. With the inspiration of viral marketing, the influence maximization problem was first studied by Domingos et al. [3, 17] algorithmically from data mining perspective. Kempe et al. [8] were first to formulate the influence maximization problem into a stochastic optimization problem and established its NP-hardness results. The problem can be defined as, given a social network G and a positive integer s (number of influential spreaders), find a subset S of influential users so that the total number of users influenced by them is maximized. These influential users also called as the seed set, who are the initial adopters of innovation or information, will propagate the innovation or information to their neighbors in the network. The users who are influenced by these initial adopters in turn will propagate the information or influence to their neighbors in the network. There are several influence propagation models such as Independent (weighted) Cascade and Linear Threshold (LT) are proposed by researchers in literature. As influence maximization problem belongs to the class of NP-hardness problems, various approximation algorithms [2, 6, 8, 10, 12] and heuristics [2, 8, 13, 20] have been proposed. Recently, some graph-theoretic approaches using special subgraphs like k-core, k-truss, etc., have also been proposed to find influential users in the social network. A k-core is a maximal induced subgraph in which all the nodes have degree at least k. The high connectivity of the nodes in a k-core makes it very useful for influence spreading [9]. The core decomposition of graphs has also been applied in the areas of community detection, event detection, text mining, etc. [14]. In this work, we apply heuristics like degree discount, Local Index Rank (LIR) on the k-core of the social network to find a set of influential nodes. We have taken the Independent Cascade (IC) model to calculate the expectation value of the number of influenced users. The paper organization is as follows; Sect. 2 contains the problem definition, Sect. 3 includes existing work, Sect. 4 consists of proposed methods, Sect. 5 includes dataset description, Sect. 6 contains experimental results then followed by the conclusion and future work.
2 Influence Maximization Here, we look into the basic definitions and methods useful to study and solve the influence maximization problem. Social Network: A social network is a graph denoted by G = (V, E), where V is a set of users, and E is a set of links between the users. Here, for our study, we consider undirected graphs.
Evaluating User Influence in Social Networks Using k-core
13
Influence Maximization: The influence maximization problem can be defined as, given a social network G = (V, E) and a positive number s, identify subset of users S ⊂ V, |S| = s, such that the influence spread function f (S) is maximized [8]. The influence spread function f (S) of seed set S is defined as the expected numbers of nodes get influenced under a propagation or diffusion model. The IC model [7, 8, 18] and LT model [4, 5, 8] are mainly used to stochastically model the influence propagation by triggering the propagation of influence in the network with the seed set already chosen. In IC model, a probability of influence puv is assigned to every edge (u, v). If at a given time t, the node u is influenced then at time t + 1 it attempts to influence the node v with probability puv , if it succeeds then v gets influenced. Every influenced node will have at most one chance to influence its neighbors. Once a node reaches an influenced state, then it remains in the same state. This process starts with an initial seed and stops when no nodes remain to influence. In LT model, the edge between every neighbor v of a node u is given a weight buv and every node hasgiven a threshold value. Node u get influenced when these conditions satisfy i.e., v∈N (u) buv ≤ 1 and v∈N (u) buv ≥ θv , where θv is threshold of each user v ∈ V , generated uniformly at random in interval [0,1] and N(u) represents neighbors of node u. The process continues until no nodes remain to influence. Other diffusion models with variations are also available in the literature [23] and Kempe et al. [8] given the generalized versions of these two models.
3 Existing Work Kempe et al. [8] were the first to formulate the influence maximization problem as an optimization problem under IC and LT models. They proved its NP-hardness and proposed an approximation algorithm with an approximation ratio of 1 − 1/e − ε. In order to find efficient solutions to the influence maximization problem, various algorithms are proposed in the literature. These can be mainly divided into two types, viz. greedy and heuristic algorithms. Greedy algorithms: Kempe et al. [8] proved that influence spread function is submodular and proposed hill-climbing greedy algorithm. Leskovec et al. [10] presented a Cost-Effective Lazy Forward (CELF) algorithm based on lazy evaluation of the objective function, which was 700 times efficient than the former algorithm. Amit et al. [6] proposed CELF++ by improving the CELF algorithm. Chen et al. [2] proposed greedy algorithms like NewGreedy and MixedGreedy. Greedy algorithms work well to produce seed set but have prohibitively high time complexity. Heuristic algorithms: Kempe et al. [8] proposed some heuristics like degree and degree centrality. Chen et al. [2] improved over the degree centrality by notion degree discount. The idea is to discount the degree of a node if it has seed nodes as its neighbors. The discounted degree of node v is given by d v –2t v –t v p(d v –t v ), where d v denotes degree(v), t v is count of seed nodes as neighbors and p is the propagation
14
N. Govind and R. P. Lal
probability. Wang et al. [21] proposed the generalized degree discount algorithm as an extension to the degree discount. The idea is to modify the degree discount by considering two-hop neighbors. The generalized degree discount of a node v is given by d v –2t v –(d v –t v )t v p + (1/ 2)t v (t v –1)p– dv −tv tw p. Liu et al. [13] proposed a topology-based algorithm called LIR, which is based on the degree. Zhang et al. [22] proposed the VoteRank algorithm, which is based on the voting capacity of nodes and the average degree of the network. Nodes in the network votes to its neighbors and node with the highest votes will be chosen. The selected node will not participate in the voting, and the voting ability of neighbor nodes will be decreased in the next turn. Pal et al. [16] studied the heuristics of centrality measures and modeled a new centrality measure based on diffusion degree.
4 Proposed Method In this section, we discuss the k-core subgraphs and methods proposed by us. The concept of cores was coined by Seidman [19] in 1983 to find the cohesive subgroups of users in a given network. Cohesive subgroups are a subset of nodes or users with strong, direct, intense, or positive ties. Kitsak et al. [9] studied influence spread based on k-core using an epidemic model and showed that the core of the network contains the influential spreaders. Malliaros et al. [15] studied the influence spread based on the k-truss, which is a triangle (cycle of length3)-based extension of k-core. A k-core is defined as a subgraph H V , E induced from a graph G = (V, E) in which degree (v ) ≥ k ∀v ∈ V , where k is a positive number. A k-core subgraph can be obtained by decomposing a graph based on the property of degree, i.e., the nodes whose degree is less than k along with edges incident on, should be deleted recursively until all the nodes have at least degree k. A linear time algorithm proposed by Batagelj et al. [1] can be employed to find the k-core of a graph. Liu et al. [13] proposed a heuristic called LIR with the intuition of avoiding the “rich club effect,” i.e., to avoid adjacency among two high-degree nodes. The nodes selected by the LIR will have degree more than their neighboring nodes, and most of the time they are not connected with each other. They proposed a ranking for each node v based on the degree of that node and its neighbors, which is given by the number of neighbors having higher degree than the node v. After computing LI for all the nodes, the nodes with LI = 0 are selected and sorted in descending order of their degree. Then the required number of nodes is chosen as seed set from the sorted list. It can be observed that the chances of the nodes in k-core having nonzero LI values are high. Hence, when LIR applied on a graph G does not select adequate number of nodes from the k-core of G. It excludes some influential nodes from k-core, which is in contradiction with the fact that mostly influential nodes reside in k-core [9]. Here, we apply LIR on the k-core of the graph to find the seed set. The intuition is to include those influential nodes which are excluded by LIR when applied on the original graph.
Evaluating User Influence in Social Networks Using k-core
15
Our approach is accurate and scalable as k-cores are relatively small in size and time-efficient to compute. Here, we also use the degree discount heuristic on k-core to find the influential nodes. Our proposed algorithm is outlined in algorithm 1. It takes the graph G = (V, E) and the seed set size s as input and produces top-s influential nodes. The step 1 of algorithm 1 computes the maximal k-core subgraph using Batagelj et al. [1] algorithm. Then degree discount or LIR is applied on k-core obtained in step 1. In step 3, top-s nodes are selected based on their degree discount value or degree (over the node with LI = 0). The step 1 of algorithm 1 has time complexity of O (|E|). Step 2 and step 3 of algorithm 1 combined take O (s log |V | + |E|) for degree discount heuristic and O (|E|) for LIR, respectively. So, the overall time complexity of proposed algorithm 1 is O (|E|). Algorithm 1: k-core-LIR / Degree Discount Input: G(V E) s Output: top-s seeds (seed set) 1 Compute
the maximal k-core of G LIR or Degree Discount heuristic on k-core subgraph 3 Select the set of s nodes. 2 Apply
5 Datasets In this section, we provide the information of datasets. We use two benchmark datasets ca-GrQc and ca-HepTh from SNAP [11]. ca-GrQc is an undirected graph related to a collaboration network of Arxiv General Relativity and ca-Hepth is also an undirected graph representing the research collaboration of scientists who have co-authored papers in High Energy Physics category. Some of the fundamental properties of these two networks are given in Table 1. Table 1 Properties of the two datasets Name
Nodes
Edges
Average clustering coefficient
Average degree
Type
ca-GrQc
5242
14496
0.5296
5.5
Undirected
ca-Hepth
9877
25998
0.4717
5.2
Undirected
16
N. Govind and R. P. Lal
6 Experimental Results Here, we discuss the experiments run and results obtained. We compare the proposed methods with the existing heuristics like degree [8], degree discount [2], generalized degree discount [21], VoteRank [22]. Now, we discuss the settings of the experiment. The experiments are run on an Intel(R) Xeon(R) CPU with 64-GB main memory. Here, we use IC model for the calculation of influence spread. We run the experiment 10,000 times and take the average. We set influence probability value p = 0.05. The total number of seeds, i.e., the size of the seed set is 50. The p value for degree Discount and generalized degree discount is set to 0.05. For the construction of k-core, we follow the algorithm proposed by Batagelj et al. [1], and we fix k = 3 in k-core algorithm. The results in Figs. 1 and 2 show that the variation of the influence spread with the varying seed set size s. Degree heuristic shows less spread in both the datasets and k-core-LIR shows the highest spread compared to others. k-core degree discount also shows spread almost as other heuristics like degree discount, generalized degree discount, and VoteRank in both datasets. Generalized degree discounts show better spread than degree heuristic in ca-GrQc dataset, and it is showing almost the same spread as other heuristics like degree discount, VoteRank, and k-core degree discount in the case of ca-HepTh dataset. A significant increase in spread with the k-core-LIR method can be seen in both ca-GrQc and ca-HepTh dataset and this method is time efficient to compute the seed set from core. Results show that the proposed methods are more accurate and time-efficient.
Fig. 1 Number of influenced users versus seed set size on ca-GrQc dataset
Evaluating User Influence in Social Networks Using k-core
17
Fig. 2 Number of influenced users versus seed set on ca-HepTh dataset
7 Conclusion In this work, we have proposed a graph-theoretic method to find an efficient solution to the problem of influence maximization in social networks. Our approach is based on applying different topology-based algorithms like LIR, degree discount on the k-core of the social graph. As the size of k-core is generally small in comparison to the social graph, our proposed methods are scalable. The experimental study on the two datasets ca-GrQc and ca-HepTh shows that our proposed method outputs highly influential seed nodes resulting in large number of influenced users as compared to other existing heuristics. There are other types of subgraphs like k-truss, k-clubs, k-clans, etc., having special structures, can also be utilized to identify influential seed users in social networks. In the future, we plan to work with these structures along with various other influence propagation models such as LT model, SIR model, to find more efficient algorithms for influence maximization problem.
References 1. V. Batagelj, M. Zaversnik, An o (m) algorithm for cores decomposition of networks (2003). arxiv: cs/0310049. arXiv preprint 2. W. Chen, Y. Wang, S. Yang, Efficient influence maximization in social networks, in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2009), pp. 199–208 3. P. Domingos, M. Richardson, Mining the network value of customers, in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2001), pp. 57–66
18
N. Govind and R. P. Lal
4. J. Goldenberg, B. Libai, E. Muller, Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark. Lett. 12(3), 211–223 (2001) 5. J. Goldenberg, B. Libai, E. Muller, Using complex systems analysis to advance marketing theory development: modeling heterogeneity effects on new product growth through stochastic cellular automata. Acad. Mark. Sci. Rev. 9(3), 1–18 (2001) 6. A. Goyal, W. Lu, L.V. Lakshmanan, Celf++: optimizing the greedy algorithm for influence maximization in social networks, in Proceedings of the 20th International Conference Companion on World Wide Web (ACM, 2011), pp. 47–48 7. M. Granovetter, Threshold models of collective behavior. Am. J. Soc. 6, 1420–1443 (1978) 8. D. Kempe, J. Kleinberg, E. Tardos, Maximizing the spread of influence through a social network, in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2003), pp. 137–146 9. M. Kitsak, L.K. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H.E. Stanley, H.A. Makse, Identification of influential spreaders in complex networks. Nat. Phys. 6(11), 888 (2010) 10. J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, N. Glance, Cost-effective outbreak detection in networks, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2007), pp. 420–429 11. J. Leskovec, A. Krevl, SNAP datasets: Stanford large network dataset collection (2014). http:// snap.stanford.edu/data 12. Y. Li, J. Fan, Y. Wang, K.L. Tan, Influence maximization on social graphs: a survey. IEEE Trans. Knowl. Data Eng. 30(10), 1852–1872 (2018) 13. D. Liu, Y. Jing, J. Zhao, W. Wang, G. Song, A fast and efficient algorithm for mining top-k nodes in complex networks. Sci. Rep. 7, 43330 (2017) 14. F.D. Malliaros, A.N. Papadopoulos, M. Vazirgiannis, Core decomposition in graphs: concepts, algorithms and applications, in EDBT (2016), pp. 720–721 15. F.D. Malliaros, M.E.G. Rossi, M. Vazirgiannis, Locating influential nodes in complex networks. Sci. Rep. 6, 19307 (2016) 16. S.K. Pal, S. Kundu, C. Murthy, Centrality measures, upper bound, and influence maximization in large scale directed social networks. Fundam. Inf. 130(3), 317–342 (2014) 17. M. Richardson, P. Domingos, Mining knowledge-sharing sites for viral marketing, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2002), pp. 61–70 18. T.C. Schelling, Micromotives and macrobehavior. WW Norton & Company (2006) 19. S.B. Seidman, Network structure and minimum degree. Soc. Netw. 5(3), 269–287 (1983) 20. A. Sheikhahmadi, M.A. Nematbakhsh, A. Shokrollahi, Improving detection of influential nodes in complex networks. Physica A 436, 833–845 (2015) 21. X. Wang, X. Zhang, C. Zhao, D. Yi, Maximizing the spread of influence via generalized degree discount. PloS one 11(10), e0164393 (2016) 22. J.X. Zhang, D.B. Chen, Q. Dong, Z.D. Zhao, Identifying a set of influential spreaders in complex networks. Sci. Rep. 6, 27823 (2016) 23. Y. Zheng, A survey: models, techniques and applications of influence maximization problem (2018)
Depression Anatomy Using Combinational Deep Neural Network Apeksha Rustagi, Chinkit Manchanda, Nikhil Sharma, and Ila Kaushik
Abstract Depression is a temperament syndrome that causes a tenacious emotion of wretchedness and loss of interest in any activity. It is a supreme root of mental illness, which has established the growth in the risk of early death and economic burden to a country. Traditional clinical analysis procedures are subjective, complex and need considerable contribution of professionals. The turn of the century saw incredible progress in using deep learning for medical diagnosis. Though, prediction and implementation of mental state can be remarkably hard. In this paper, we present Combinational Deep Neural Network (CDNN) for automated depression detection with facial images and text data using amalgamation of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). Thereafter, combining the prediction scores of both CNN and RNN model and level of depression is decided on the basis of the range of the predefined depression-level scores. Simulation outcomes based on real-field channel measurements show that the proposed model can significantly predict depression with superior performance. Keywords Depression · Artificial intelligence · Mental illness · Combinational deep neural network · CNN · RNN
A. Rustagi Bhagwan Parshuram Institute of Technology, Delhi, India e-mail: [email protected] C. Manchanda · N. Sharma (B) HMR Institute of Technology and Management, Delhi, India e-mail: [email protected] C. Manchanda e-mail: [email protected] I. Kaushik Krishna Institute of Engineering and Technology, Ghaziabad, Uttar Pradesh, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_3
19
20
A. Rustagi et al.
1 Introduction Mental illness is less painful than physical illness, but it is more common and also difficult to bear. Depression is a serious medical illness that negatively affects how a person feels and the way one thinks. India comes under the countries experiencing with the extreme problem of psychological and discernible illnesses, in terms of maximum time of lifespan lost due to ill health or death attuned for the size of population [1]. As stated by a report by WHO, 6.5% of population of India agonizes from depression. General symptoms of depression are: • Feelings of sadness and emptiness • Losing interest in most of the common day to day activities and personal hobbies [2, 3]. • Anxiety, short temperament and irritation • Difficulty in concentrating, thinking and even making choices Depression is a curable disease. If depression is detected at an early stage the duration of treatment would shorten [4]. Regrettably, the percentage of approachability to treatment is surprisingly low. There are effective measures and treatments for curing depression but there is a severe lack of psychological health staffs like psychologists, doctors and psychiatrists in the country. The major problem is that nearly two-third of the patients do not seek out for help. The explanations for not seeking any treatment comprise not recognizing the symptoms of depression, the phobia of going to a mental health specialist, extreme cost of the doctors and the shortage of doctors. Depression not damages mental health of a person nevertheless physical fitness too. It is associated through high blood pressure, back ache and diabetes [5]. Depression also put the heart patients at risk by 67% and an increase in a risk of cancer by 50% [6]. Also, this psychological sickness is affecting the whole family, peers and other connections in form of anxiety and mental collapse. This gives us a reason to be motivated for helping and curing the people suffering, gives purpose to invest in depression anticipation and suppository. Keeping in mind the destructive effects of depression on people and on the society as well, computer vision developers have suggested approaches which are based on vocal and non-vocal data for precise assessment of a person’s depression level. The forms of transformation in audio and visual data have been exploited for automated, contact-free breakdown and analysis of depressing behaviours. The graph in Fig. 1 shows the ranks of eight countries on the basis of their depressed population. China is having the extreme percentage of depressed population with 12.25%, that is, almost one-eighth of its population is suffering from depression. And India being the third country leading in depression with 6.5%, that is 8.7 crore people out of its 133.92 crore population suffers from depression [7]. Many machine learning algorithms which classify depression have already been proposed [8]. These problems take depression as a classification problem where they differentiate between patient’s depression level [9]. These algorithms face a dataset disproportion problem. Around 300 million people out of 7.7 billion people
Depression Anatomy Using Combinational Deep Neural Network
21
Fig. 1 Depression ranks of countries
face depression, the low frequency of depression among general population leads to inappropriate datasets. Some authors have also proposed that the features of speech and voice of the depressed people differ from features of the non-depressed/healthy people. The visual data convey important features like facial expressions, head pose, body movement and eye blinks which also differ in case of depressed and nondepressed people. But building models on this information also fails at times because of data disproportion issues. Although humans will probably always be better at understanding emotions than machines but machines are also gaining experience based on their own assets. Also, the motive is not to have a competition between humans and machines but to make the machines learn from humans. Emotional Artificial Intelligence (EAI) is a subclass of artificial intelligence which deals in human emotions. It is an upcoming research topic widely used for sentiment analysis involving human emotions. To help increase the rate of approachability to mental health service, it is essential to use an advanced technology and active measures. Moreover, people should be more conscious about their emotional and mental health. There should be an effective and easily approachable depression detection system made available on a platform which is easily accessed by the majority of the people, that is, internet. According to various surveys conducted worldwide it has been reported that Twitter is one of the most popular social networks in the world and people use it as a means of sharing thoughts, beliefs, feelings as well as their life events. Therefore, we chose Twitter as the platform for building the depression detection system. Summing up, this research goal is to deliver a depression detection tool on Twitter, the most popular social network by using Natural Language Processing (NLP), Convolutional Neural Network
22
A. Rustagi et al.
(CNN), Recurrent Neural Network (RNN) techniques for emotional analysis and construct the depression detection algorithm [10]. The problem of these disproportionate datasets can be resolved by refining the datasets and bringing them into almost equivalent shapes. CNN is one of the mostly used techniques for image classification with desired number of convolutional layers. Twitter also provides the best dataset for text classification because of the limit on the amount of characters allowed in a single tweet. RNN with Long Short-Term Memory (LSTM) and NLP can be used for the classification of texts. The collective result from both the algorithms provides a better and reformed version of the older versions of emotional AI (EAI) algorithms used till now. In this paper, we applied EAI on the tweet data which is divided into depressed and nondepressed emotion classes. RNN with LSTM-based model is used for the prediction of data obtained from the user. The speech to text library of python is used for obtaining vocal data form the user and converted into text data for prediction from the model. The refined image dataset is used for training the model for classification of images among the two classes using CNN-based model. The combined result from both the models is used for predicting the emotional state and level of depression of the user.
2 Literature Review Healthcare is not just about in what way you see yourself physically, but also how well is your mental state. A lot of people, who aren’t doing well psychologically incline to form a pattern of actions with their day to day activities, first one being their choice of words, social media activities, searches, etc. The methods embraced are personality inventory, psychological tests, clinical examination, brain scanning, etc. Expert consultations: This practice is functioned by skilled mental health professionals. The consultant must require solid knowledge of depressing ciphers and indication along observational skills. It is a talking treatment that includes proficient professionals guiding the patient in the right direction. This practice can also be led by other mental health experts but this method is time consuming. Gadgets like smartphones, laptops, etc., can be useful to collect the user’s behavioural data which shows the mental condition of the user. As youngsters uses various applications and perform certain activities which leave digital footprints that might offer signs to their psychological well-being. Specialists say likely signs include variations in writing speed, voice quality, word choice, etc. A vast range of effort has observed user behaviour or mental state using the data collected by the smartphones, inclusive of detecting whether the user is depressed or not [11]. Prediction of depression through mobile data from using features extraction from tracking location by GPS, SMS, google searches, social media activities, etc. Former research used speech to observe and identify the depression. The features of acoustic speech have recently been examined as conceivable signs for depression
Depression Anatomy Using Combinational Deep Neural Network
23
in grownups. The properties of depression imitated in the speech construction system makes speech a viable feature for depression detection. Cannizzaro et al. [12] studied the connection between depression and speech through testing statistical analysis on different factors of speech. Acoustic speech has different variables, which includes speaking rate (words per minute), percent pause time and pitch disparity. Speaking rate and pitch disparity had huge interdependence for detecting depression. Besides the study of speech factor for detecting the depression, there is a research that studies writing for depression detection in addition to syntactic construction and semantic content of an individual with depression. There are different psychological concepts which chains semantic factors to the depression detection. Beck et al. [13] concept of depressive propounds that individual inclined to depression have depression schematic, and results in seeing the world with negative perspective, not appreciating anything and being isolated from everything. These theories once triggered give intensification depression, conflicting, and traumatic behaviour. Munmund et al. [14] studied, how effectively would the social media be able to perceive the depression. Social media generate a prospect to analyses social network data for user’s state of mind and thoughts to study their moods and attitudes when they are interactive via social media applications. The dependent variables in the data such as social activities, sentiments, choice of words, etc., were fetched from twitter. Tweets showing self-assessed depressing aspects help in recognizing it before hand and make it possible for parents, specialists, individuals to examine posts for linguistic suspicions that indication deteriorating mental well-being. The prediction from the model developed from this research with 70% accuracy. Wang et al. [15] presented the structure to generate probabilistic appearance outlines for video data to work on depression detection. To detect depression from the data from video, it initially detects significant facial landmarks to depict facial appearance variation and calculates the outline variations of regions defined by different landmarks which are further used to train the support vector machine classifier model. After that, Bayesian estimation scheme is applied to the facial data from the videos to generate probabilistic outline for facial landmarks. After examining the outlines for facial landmarks, outcome shows that there is difference between the expressions of depressed and not depressed individual. Zhu et al. [16] proposed a D-Convolutional Neural Network (DCNN)-based method for prediction of depression from the video data. DCNN is most commonly used for analysing the visual image data which achieved a superior result. The presented model comprises of two simultaneous CNNs: an expression DCNN to take out facial features and a dynamic DCNN to take out dynamic wave features by calculating the visual drift among a certain number following mounts and both predicting the scale of depression. At the last of their DCNN model, to merge the results of both CNNs (expression and dynamic), two completely coupled layers are implemented.
24
A. Rustagi et al.
3 Dataset The facial images dataset is obtained by the modification of the FER2018 open source data. We had training dataset, test dataset (which is then used as validation dataset for our project) and further a private dataset (same size with test dataset and will be used as data for evaluating the prediction performance). It is noteworthy that in original provided dataset (either in training dataset or in test dataset), we have actually in total six categories: ‘Angry, Surprise, Happy, Sad, Disgust and Neutral’. The main problem arises here as for our research only two categories of depressed and non-depressed images were required. As a solution to it, we grouped four of the above-mentioned categories into non-depressed dataset and two into depressed dataset. On further selection of images, we ended up having around 6000 images belonging to both the categories respectively. The text data consisted of tweets which were collected using the twitter API. Around 10,000 tweets were collected and further segregated into training and testing dataset with the split ration of 80:20. Two lists of tokens (words) were compiled for both the datasets. The training list consisted of words signifying depression inclinations like ‘depressed’, ‘suicide’, ‘self-harm’. For the test dataset, random tweets were collected including both positive and negative features. Our Approach Distribution learning is a framework which allows us to assign a distribution label to an entity rather than using various labels [17]. When a model learns the distribution related to a label space for a sample, it shows the level of importance of each label existing in this space [18]. Thus, this method can be used for improving a model’s predictive accuracy. Distribution learning is widely used in problems like emotion recognition [19] and age estimation [20]. Here, we approached by diving out our tasks into two sections at the start. The first half comprises of classification of facial expression into depressed or non-depressed class. The second half consists of classification of the text data received from the user into the two above-mentioned classes [21]. The combined prediction from both the sections is used for concluding the level of depression of the user amongst the four three predefined levels.
4 Model Architecture The first section of the model is the classification of the facial images of a person into depressed and non-depressed classes. CNN [22] is a well-known algorithm of DNNs which specialize in classification of images. It is an algorithm which takes input as an image, allocate important weights and biases to several attributes in the input image and distinguish one from the other [23]. The architecture of a ConvNet is similar with the connectivity patterns of the human brain. The purpose of using a convolutional neural network is to make the process easier of processing images in
Depression Anatomy Using Combinational Deep Neural Network
25
a simple way for which consists of important features, without any loss of features which are crucial for getting a decent prediction [24]. As shown in Fig. 2, the input image is a coloured image of size (32,32) from training dataset which is first grey scaled and passed into the convolutional neural networked with three convolutional layers and fully connected layer giving output from one of the two classes [25]. We start with a set of three convolutional layers each followed with a maxpooling layer, activation function ‘relu’ is used for all the three layers with pool size (2,2) in the maxpooling layers. The features to be captured from convolutional layer are increased from 32 to 128, it is proposed that such hierarchical structure (with increasing layer nodes) performs better for deep neural network. Finally, the convolved layer is first flattened and then goes through two more dense layers to reach the output layer in which ‘Softmax’ activation function is used for multiclass classification (two classes in total). Table 1, shows the number of trainable parameters obtained from each layer of the convolutional network. We obtain a total of 37,218 parameters from around 12,000 images for training the model for depression detection in images. The second section of the model comprises of the text classification. RNN [26] with LSTM is used for this purpose. Depression is a state of mind which cannot be predicted from a single text from the person. Predicting depression requires keeping in mind the previous conversations and inputs from the person [27]. Though we don’t identify how brain works up till now, but it is considered that there must be a logic unit and a memory unit. Decisions are made on the basis of reasoning and experience. Hence, for the algorithms to do so, we provide memories. This is the purpose of using RNN [28]. General feed forward neural network memorizes things learnt during training and generates outputs; however, RNNs memorize training as well as learn from the past inputs and further practice them for generation of outputs. For example, a vanilla feed forward network learns how ‘1’ looks like and then use its learning for classification of all the inputs, but RNNs classify the later outputs on
Fig. 2 CNN architecture for depression detection
26 Table 1 Summary for CNN model
A. Rustagi et al. Layer(type)
Output shape
Number of parameters
Conv2d_1
(None, 26, 26, 32)
Max_pooling2d_1
(None, 13, 13, 32)
0
Conv2d_2
(None, 11, 11, 32)
9248
Max_pooling2d_2
(None, 5, 5, 32)
0
Conv2d_3
(None, 3, 3, 64)
18496
Max_pooling2d_3
(None, 1, 1, 64)
0
Flatten_1
(None, 64)
0
Dense_1
(None, 128)
Dense_2
(None, 2)
896
8320 258
the basis of the current knowledge (training) as well as the past knowledge (previous inputs) for prediction of the later inputs. In a general feed forward neural network, a static size input vector is provided, processed and converted to static size output vector. When these transformations are done on a series of input vectors for generation of the output vectors, this network becomes a recurrent network with varying input size and higher accuracy. But in exercise, RNN suffer from two difficulties: the vanishing gradient problem and the exploding gradient problem which makes it unfit for use [29]. This is where we use LSTMs. LSTM introduced a memory unit called ‘cell’ into the neural network. Now, the decision is made after considering the current input, prior output and prior memory. A new output is generated and the old memory is altered. Figure 3 explains the reason for using RNN with LSTM network with accuracy comparison between the other approachable networks. Figure 4, perfectly explains the working of a recurrent neural networking using the past outputs at time intervals (t−1), (t) for prediction at time interval (t + 1). On receiving the datasets, the data is pre-processed which includes removing of duplicates, word tokenization, removing stop words and converting contractions. All the inputs to a neural network should be of same length; therefore, the length of largest sentence is stored. The words are converted into tokens and the sentences with length shorter than maximum length are padded with value ‘0’ in the end. Now, LSTM Embedding layer is added. Embedding is done to solve the major problem of sparse input data by mapping the high-dimensional data to lower dimensions. The model is further compiled with ‘categorical_crossentropy loss function’ and ‘adam’ optimizer. Table 2 shows the number of trainable parameters obtained from each layer of the recurrent neural network. The total number of parameters obtained for training of the text model is 511,194 parameters. In this paper, the captured image of the user is taken as an input for the image prediction model and the text data obtained from the user as answers to the system inquired questionnaire is taken as input for the text prediction model. The combined prediction scores of the models are averaged and the level of depression is decided on
Depression Anatomy Using Combinational Deep Neural Network
Fig. 3 Accuracy versus Epoch graph for text analysis
Fig. 4 Sequence of RNN
27
28
A. Rustagi et al.
Table 2 Summary for RNN model Layer (type)
Output shape
Number of parameters
Embedding_1
(None, 2573, 128)
256000
Spatial_dropout1d_1
(None, 2573, 128)
Lstm_1
(None, 196)
Dense_3
(None, 2)
0 254800 394
Fig. 5 Comparison graph for three approaches
the basis of the range of the predefined depression level scores to which the predicted score belongs. Figure 5 states the proof for our approach being better than the previously proposed approaches for the problem, showing accuracy obtained on using only facial expressions for depression prediction with colour blue, prediction from text with colour green and prediction from combined text and face images with colour mustard. Proposed Algorithm 1. Data collection: csv file is converted to images and image selection is done on the basis of factors mentioned earlier in the paper. 2. CNN model is prepared using three convolutional layers, images are greyscale and trained for 300 epochs. In this, we take a small matrix of numbers (filters) and pass it over the image and alter it on the basis of its values of the filters.
Depression Anatomy Using Combinational Deep Neural Network
G[m, n] = ( f ∗ h)[m, n] =
j
29
h[ j, k] f [m − j, n − k]
(1)
k
The successive feature map values are determined on the basis to the above-mentioned expression, where f = input image and h = kernel. The indexes of the rows and columns are denoted of the resultant matrix are denoted by m and n correspondingly. According to convolutional rules, the filter and the image must have the same number of channels. If we want to apply several filters on an image, we do it by applying convolution on each of them separately, stack the results and combine them into one. Following formula is used for this purpose, [n, n, n C ] ∗ [ f, f, n C ] =
n + 2P − f n + 2P − f + 1, + 1, n f s s
(2)
here, n = image size, f = filter size, nC = number of channels in image, P = padding used, s = stride used, nf = number of filters 3. An RNN model is prepared with LSTM for text-based predictions and trained for 25 epochs. The gates in LSTMs are sigmoid activation functions with outputs either 0 or 1. Zero meaning the gates closed and 1 meaning open. The equations for LSTM gates are: it = σ wi ht−1 , xt + bi˙
(3)
This equation is used for storing information in the cell state. f t = σ w f ht−1 , xt + b f
(4)
This equation is used for determining the information to be thrown from the cell state. ot = σ wo ht−1 , xt + bo
(5)
This is for the output gate to provide activation to final output from the LSTM block.here, i t : input gate f t : forget gate ot : output gate σ : sigmoid function wx : weight for gate(x)neurons h t−1 : output of previous block at timestamp(t − 1) xt : input at timestamp(t)
30
A. Rustagi et al.
bx : biases for gate(x) Equation for final output, h t = ot ∗ tanh ct where ct : cell state(memory)at timestamp(t) 4. Both the models are saved with ‘.h5’ extensions. 5. The inputs are collected from the user by image capture using opencv and text by questionnaire. Passing these to the respective models given two outputs which are further label encoded and used for the calculation of final output as follows: pf =
pi + pt 2
(6)
where, p f : final prediction pi : prediction from image pt : prediction from text p f = 0 : no depression p f = 1 : medium depression p f = 0 : high depression
5 Result In Fig. 6, these are some of the questions asked by the machine for taking user’s answers as input for the RNN-based text model. The result is shown in Fig. 7 with the image of the user along with the level of
Fig. 6 Question–Answer between machine and user
Depression Anatomy Using Combinational Deep Neural Network
31
Fig. 7 Image captured with final predictions
depression predicted using both the models and the final calculated level of depression of the user.
6 Conclusion and Future Scope Inspired by the issue of unevenness in the depression dataset due to the imbalance of depression in population, we have prepared this model of automated depression detection which can be easily approachable to most people. In this paper, a combined learning architecture of CNN and RNN is presented for automated depression detection. Though the training of the model was done by distributing the task into two sections but the prediction was done after combining the scores from both the model. This division allows to discover the connection between facial images, text data and depression levels, and has the possibility to improve accuracy of the model with expert suggestions in future. Supervised learning classification have a restriction that it cannot result with human-level accuracy in prediction of depression through text data and thus, the facial image data are used for having a better result and accuracy of the model. Experiments on public FER2018 data and twitter datasets showed that our proposed method yields in interesting results in comparison to related works in this field. This suggests that treating depression as a combination of visual illness and mental illness holds significance. The data used in this research are less, and further experiments can also be done by taking more data into account. In future, we can also work on other features for depression detection like features extracted from speech. The model can also be further worked upon and converted to a smartphone application to increase its proximity. Features for handling different stages of the depression can be added to provide the needed help.
32
A. Rustagi et al.
References 1. M.S. Neethu, Rajsree, Sentiment analysis in twitter using machine learning techniques, in Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT) (2013) 2. S. Meystre, P.J. Haug, Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. J. Biomed. Inf. 39(6) (December 2006) 3. M. Desai, M.A. Mehta, Techniques for sentiment analysis of Twitter data: a comprehensive survey, in International Conference on Computing, Communication and Automation (ICCCA) (2016) 4. B.W. Conti D, The economic impact of depression in the workplace. J. Occup. Med. 36, 983–988 (1994) 5. M. H. Foundation, Physical health and mental health 6. T. Kongsuk, S. Supanya, K. Kenbubpha, S. Phimtra, S. Sukhawaha, J. Leejongpermpoon, Services for depression and suicide in Thailand. WHO South-East Asia. J. Public Health 6(1), 34–38 (2017) 7. A. Halfin, REPORTS depression: the benefits of early and appropriate treatment © Ascend Media. November, pp. 92–97 (2007) 8. L. Canzian, M. Musolesi. Trajectories of depression: unobtrusive monitoring of depressive states by means of smartphone mobility traces analysis, in Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (ACM, 2015) 9. R. LiKamWa et al., Moodscope: building a mood sensor from smartphone usage patterns, in Proceedings of ACM MobiSys (2013) 10. Yrr, zgr, et al., Context-awareness for mobile sensing: a survey and future directions. IEEE Commun. Surveys Tutor. 18(1), 68–93 (2016) 11. Saeb, Sohrab, et al., Mobile phone sensor correlates of depressive symptom severity in daily-life behavior: an exploratory study. J. Med. Internet Res. 17(7) (2015) 12. M. Cannizzaro, B. Harel, N. Reilly, P. Chappell, P.J Snyder, Voice acoustical measurement of the severity of major depression. Brain Cogn. 56(1), 30–35 (2004) 13. A.T. Beck, Depression: Clinical, Experimental, and Theoretical Aspects (University of Pennsylvania Press, 1967) 14. M. De Choudhury, M. Gamon, Predicting Depression via Social Media. Proc. Seventh Int. AAAI Conf. Weblogs Soc. Media. 2, 128–137 (2013) 15. P. Wang, F. Barrett, E. Martin, M. Milonova, R.E. Gur, R.C. Gur, C. Kohler, R. Verma, Automated video-based facial expression analysis of neuropsychiatric disorders. J. Neurosci. Methods 168(1), 224–238 (2008) 16. Y. Zhu, Y. Shang, Z. Shao, G. Guo, Automated depression diagnosis based on deep networks to encode facial appearance and dynamics, in IEEE Transactions on Affective Computing (2017) 17. M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, L. Sellie, On the learnability of discrete distributions, in ACM Symposium on Theory of Computing (1994) 18. C. Manchanda, R. Rathi, N. Sharma, Traffic density investigation & road accident analysis in India using deep learning, in 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974528 19. X. Geng, C. Yin, Z. Zhou, Facial age estimation by learning from label distributions. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2401–2412 (2013) 20. Y. Zhou, H. Xue, X. Geng, Emotion distribution recognition from facial expressions, in ICM (2015) 21. M. Chakarverti, N. Sharma, R.R. Divivedi, Prediction analysis techniques of data mining: a review. SSRN Electron. J. (2019). https://doi.org/10.2139/ssrn.3350303 22. P. Ray, A. Chakrabarti, A mixed approach of deep learning method and rule-based method to improve aspect level sentiment analysis. Appl. Comput. Inf. (2019) 23. M. Grover, B. Verma, N. Sharma, I. Kaushik, Traffic control using V-2-V based method using reinforcement learning, in 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974540
Depression Anatomy Using Combinational Deep Neural Network
33
24. L. Zhang, S. Wang, B. Liu, Deep learning for sentiment analysis: a survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8(4), e1253 (2018) 25. J. Deriu, M. Gonzenbach, F. Uzdilli, A. Lucchi, V. De Luca, M. Jaggi, SwissCheese at SemEval2016 Task 4: sentiment classification using an ensemble of convolutional neural networks with distant supervision, in Proceedings of the 10th International Workshop on Semantic Evaluation (2016), pp. 1124–1128 26. M. Harjani, M. Grover, N. Sharma, I. Kaushik, Analysis of various machine learning algorithm for cardiac pulse prediction, in 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974519 27. Y. Yin, S. Yangqiu, M. Zhang, NNEMBs at SemEval-2017 Task 4: neural twitter sentiment classification: a simple ensemble method with different embeddings, in Proceedings of the 11th International Workshop on Semantic Evaluation (2017), pp. 621–625 28. R. Tiwari, N. Sharma, I. Kaushik, A. Tiwari, B. Bhushan, Evolution of IoT & data analytics using deep learning, in 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974481 29. H. Pan, H. Han, S. Shan, X. Chen, Mean-variance loss for deep age estimation from a face, in CVPR (2018)
A Hybrid Cost-Effective Genetic and Firefly Algorithm for Workflow Scheduling in Cloud Ishadeep Kaur and P. S. Mann
Abstract Cloud computing is developing as a new platform that gives high-quality information over the Internet at a very low cost. But still, it has numerous concerns that need to be focused. Workflow scheduling is the main serious concern in cloud computing. In this paper, we propose a Hybrid Cost-Effective Genetic and Firefly Algorithm (CEFA) for Workflow Scheduling in Cloud Computing. In the existing approach, the number of iteration was very large which increases the total execution cost and time which we will optimize in the proposed algorithm. The performance is estimated on scientific workflows and the results show that the proposed algorithm performs better than the existing algorithm. Three parameters are used to compare the performance of the existing and proposed algorithm; (1) execution time, (2) execution cost, and (3) termination delay. Keywords Cloud computing · Genetic algorithm · Workflow scheduling · Firefly algorithm · Execution time · Execution cost · Termination delay
1 Introduction The most recent movements in the cloud framework are assembling our experts to offer services in a general sense progressively versatile and arranged system. Distributed processing is the premature advancement which depends upon pay-peruse criteria. It is an enlisting point of view where requests, information, data transmission, and IT associations are provided over the Internet. The objective of the cloud association suppliers is to utilize the asset effectively and accomplish the most phenomenal favorable position. The enduring evolution of cloud computing in IT I. Kaur (B) Department of Computer Science and Engineering, DAVIET, Jalandhar, India e-mail: [email protected] P. S. Mann Department of Information Technology, DAVIET, Jalandhar, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_4
35
36
I. Kaur and P. S. Mann
has led several explanatory remarks on cloud computing. The US National Institute of Standards and Technology (NIST) defines the cloud computing as [1]: “Cloud computing is a model enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” There is an excess of a hundred million figuring gadgets associated with the web and a considerable lot of them are utilizing distributed computing administrations day by day because it provides a flexible and easy way to keep and retrieve data and files [2]. Cloud Computing is a promising advancement that allows the customers to pay as they require. It engages encouraging of inescapable applications from client, exploratory, and business regions. Distributed computing is advancing a utility-arranged IT organization to customers around the globe. The creating cost of tuning and managing PC structures is provoking out-sourcing of business organizations to encourage core interests. The features of distributed framework includes self-association, broad structure, asset pooling and smart flexibility. On intrigue, self-association recommends that clients (ordinarily affiliations) can ask for and deal with their own particular Computing assets. The distributed framework is an accumulation of two phrasings in the situation of figuring innovation with computing resources. It is an investor of diverse assets and a crossover structure of huge structure that incorporates the need conveyed through the web as administrations along side the equipment and framework programming required to help the administrations.
2 Related Work There are different studies of workflow scheduling in cloud computing The author [1] has proposed a genetic algorithm approach for scheduling workflow applications by minimizing the cost while meeting user’s deadline constraint or minimizing the execution time while meeting the user’s budget. The proposed algorithm evaluates fitness function into two parts: cost fitness and time fitness. It solves the budget and deadline-constrained optimization problems. The results show that the genetic algorithm is better for handling complex workflows structure. In paper [2], the author has proposed a new heuristic algorithm for task scheduling which embeds a new fast technique named Elitism Stepping into the Genetic Algorithm with the objective to reduce the schedule length within an acceptable computational time. The algorithm sorts the task in the order of execution according to the bottom level which reduces the finish time of the algorithm. The author compared the proposed algorithm with BGA and obtained a better schedule length or finish time. The result shows significant improvement in the computation time of the new algorithm. The author [3] has surveyed the various existing workflow scheduling algorithm in cloud computing and tabulated the various parameters along with the tools. The author concluded that existing workflow scheduling algorithms do not consider reliability and availability. So there is a need to implement a workflow scheduling algorithm that can improve
A Hybrid Cost-Effective Genetic and Firefly Algorithm …
37
the availability and reliability in the cloud environment. The author [4] has presented a scheduling technique based on a relatively new swarm-based approach known as Cat Swarm Optimization. This technique shows improvement over PSO in terms of speed of convergence. By using the Seeking mode and Tracing mode, the algorithm reduces the wastage of energy and obtains a solution in a much lesser number of iterations. The author has targeted at minimization of the total cost, the minimum number of iterations, and fair distribution of workload. The authors had proved that CSO gives better results than PSO in terms of execution time and computation time. The paper [5] addresses a novel hybrid algorithm named ACO–FA, which joins in Ant Colony Optimization (ACO) with a Firefly Algorithm (FA) to solve unrestricted optimization problems. The proposed algorithm joins in the merits of both ACO and FA, where the algorithm is initialized by a set of random ants that are roaming through the search space. The proposed algorithm to handle complex problems of genuine measurements has been accepted due to procedure simplicity. It can efficiently overwhelm the drawback of the classical ant colony algorithm, which is not suitable for continuous optimizations. The author [6] has proposed an MPQGA, which produces various priority queues using a heuristic-based crossover and heuristic-based mutation operator in order to reduce the makespan. The author has used an integer-stringcoded genetic algorithm that employs roulette-wheel selection and elitism. It uses the advantages of the HEFT heuristic algorithm to find a better result in which the highest priority task calculated by the upward rank is mapped on to the processor which gives the less EFT. It produces a set of multiple priority queues based on downward rank, a combination of level upward and downward rank for the initial population and the remaining priority queues are chosen randomly. These three heuristic methods are used to generate good seeds that will be uniformly spread into the entire feasible solution space so that no stone is left unturned. This algorithm covers a large search space than the deterministic algorithm without much cost. In paper [7], the author has presented a Deadline-Constrained Heuristic-based Genetic Algorithm for scheduling applications on cloud that decreases the execution cost while meeting the deadline. Each task is allocated priority via bottom-level and top-level. The algorithm is equated with SGA under the same deadline constraint and pricing model. The simulation results show that the proposed algorithm has a promising performance as compared to SGA. The performance of the algorithm is evaluated with synthetic workflows such as Montage, LIGO, Epigenomics, and Cybershake. The author [8] has presented a hybrid approach, which combines the positive benefits of the heuristic algorithm and a metaheuristics algorithm by modifying its genetic operators. In heterogeneous computing systems, workflow scheduling. In this paper [9], the author has suggested a Genetic Algorithm to work under multicore processor. The main objective of this algorithm is to reduce the makespan time and rise the speed-up ratio. Weight Sum Approach (WSA) is used to calculate the fitness function. The simulation results show that the suggested algorithm performs better than the current algorithm and seems to be very efficient and effective to improve the overall performance of the computing system considerably. It uses the HEFT heuristic which is better than the other list-based heuristic in terms of its robust nature and makespan for initial seed. The HEFT heuristic gives the direction to the algorithm in improving
38
I. Kaur and P. S. Mann
the performance and as a result, it converges faster than the random initial population. It uses the direct representation method for chromosome and each chromosome consists of two parts. Elitism helps in maintaining the quality by copying the best chromosomes from one iteration to the next iteration. The twofold genetic operators, namely, crossover and mutation are used which helps in optimizing the fundamental objective (to minimize makespan) in less amount of time. It also optimized the load balancing during the execution. It produces lesser makespan by modification of tasks on a multicore processor. In this paper [10], the author has proposed RTEAH algorithm which increases the algorithm by decreasing the makespan, weighting time, and burst time by managing the load on the processor. Firstly, the Round Robin Algorithm is hybrid with Throttled Algorithm. It increases flexibility than these two algorithms and is hybrid with ESCE algorithm which reduces the waiting and burst time. Then to overcome the problem of index table updation, the algorithm is merged with the ABCO algorithm. So the RTEAH algorithm performs better and also manages the load. In this paper [11], a novel workflow scheduling is introduced in which a fuzzy dominance sort based heterogeneous earliest time (FDHEFT) algorithm is proposed which merges the fuzzy dominance sort mechanism with list-based scheduling. The proposed algorithm performs better than existing algorithm. The algorithm also minimizes the CPU runtime. The algorithm proposed in this paper [12] is GAAPI which is the hybridization of the Genetic Algorithm and Ant Colony Optimization. In this paper, the problem to search solutions in local and global optima is addressed. Another issue is the lack of advanced search capability, which is solved by the proposed algorithm. So the hybridization of the evolutionary algorithm can help to solve the problem. The proposed algorithm maintains a balance in exploration and exploitation. Genetic Algorithm will work in solution search and API reduces its speed of convergence. So this will increase the chance of faster convergence toward global optimum. The proposed algorithm is compared with PSO and GA which shows that the proposed algorithm performs better. The author [13] describes an IIOT-based health monitoring framework in which smartphones or desktop via Bluetooth technology can continuously monitor the person’s body with the help of ECG signals and in case if any disorder is detected in person’s body the information will be safely sent to the healthcare professionals and it will help to avoid preventable deaths. The service is integrated with the cloud for secure, safe, and high-quality data transmission and to maintain the patient’s privacy.
3 Proposed Approach The main objective of the paper is to optimize the results of cost-effective genetic algorithm by hybridizing it with PEFT-generated solutions as an initial population to firefly algorithm, which will optimize the solution and firefly-optimized solution is then provided to the genetic algorithm to make that solution more optimized and thereby providing better results in terms of termination delay, finish time, and execution cost. PEFT algorithm is chosen as it is the first list-based heuristic, which
A Hybrid Cost-Effective Genetic and Firefly Algorithm …
39
has outperformed HEFT which was best in terms of makespan and efficiency. The working of PEFT algorithm is explained below. p rank OC T (ti ) =
k=1
OC T (ti , pk ) P
Algorithm 1: PEFT 1. Calculate the values of the OCT matrix. The value of the OCT matrix will be calculated according to the below equation. It will assign the cost to execute all the jobs OC T (ti , pk ) = maxt j ∈ succ(t j ) [min pw ∈ P{OC T t j , pw + w t j , pw + c¯i j )}]
(1)
where c¯i j = 0 if pw = pk 2. Compute the OCT of each node and computed OCT states the rank of every job (rank OC T ) using Eq. 3. p rank OC T (ti ) =
k=1 OC T (ti , pk )
P
(2)
3. Repeat until all the jobs are assigned to the desired resources. a. Calculate the optimistic earliest finish time(EFT) using the below equation. O E F T ti, p j = E F T ti, p j + OC T ti, p j
(3)
b. Jobs are assigned to the processor which will give the least OEFT. 4. Return the optimal solution
The optimal solution achieved from PEFT is used as an initial population of firefly algorithm, i.e., this solution acts as the first solution for the firefly population and the other solutions are generated randomly. Further fitness which is the attractiveness of the firefly based on the light intensity is calculated and based on that fitness the population of fireflies is updated. Algorithm 2: Firefly Optimization 1. Population of the firefly is initialized using the prioritize solution of the PEFT algorithm. PEFT algorithm will compute the initial population by using the OCT table and calculate the optimistic Earliest Finish Time(EFT). 2. Repeat Steps a to c, until the termination condition is met. a. Calculate the relative distance and attractiveness between the Fireflies in the population. b. Update the light intensity of the fireflies determined by the objective function.
40
I. Kaur and P. S. Mann
c. Order the fireflies and upgrade the positions. 3. Return the best optimal solution.
Optimized solution of firefly algorithm is then fed to the genetic algorithm to achieve the results in terms of termination delay, execution cost, and finish time of the schedule in workflow scheduling as CEGF propounded hybrid algorithm. Algorithm 3: Proposed CEGF 1. Create the first population by taking one chromosome using the PEFT Algorithm and the remaining of the chromosomes randomly. 2. Optimize the population using the Firefly Algorithm. 3. Compute the fitness value of the optimized population from Firefly Algorithm as the execution time of the solution. 4. The optimized solution of the Firefly Algorithm is then fed to the Genetic Algorithm. 5. Select the chromosomes randomly and apply crossover and mutation operators of the Genetic Algorithm to produce the next generation. 6. Validate the resulting solution by checking the fitness function and add it to the new population. 7. The Genetic Algorithm will produce the best-optimized solution. 8. Evaluate the performance parameters such as execution time, execution cost, and termination delays.
Flowchart of the propounded technique is drawn below showing the functionality of technique in terms of block diagram (Fig. 1).
4 Result and Discussions The proposed approaches, CEGF Cost-Effective Firefly and Genetic Hybrid, have been simulated using JAVA JDK Netbeans IDE with WorkflowSim simulator. The results have been analyzed on various scientific workloads present in the WorkflowSim package including Montage, CyberShake, and Epigenomics with their varying number of tasks compared with other existing techniques CEGA CostEffective Genetic Algorithm in terms of finish time, execution cost, and termination delay.
4.1 Analysis in Terms of Finish Time Finish time is the all-out execution time of assignment, ti, on the virtual machine of type VM that has least execution time among a wide range of VMs accessible in cloud and its completion time is characterized as
A Hybrid Cost-Effective Genetic and Firefly Algorithm …
41
Apply PEFT algorithm for initial population generation Genetic Algorithm Firefly algorithm Initialize population of Firefly with PEFT prioritize schedule
Move firefly and evaluate the light intensity
Initialize population of GA with optimized schedule of firefly
Selection
Crossover Update Solutions Mutation Evaluate the new solutions Next Generation
Optimized schedule
Fig. 1 Flowchart of the proposed algorithm
Finish Time (Ti ) = Start Time (Ti ) +
End TimeV Mk 1 − variation
Figure 2 and Table 1 shows the comparison results of Finish Time of the propounded algorithm with other existing algorithms. The figure shows that the proposed algorithm performs better than the existing procedure. For workload Montage 100, the finish time is 98.08 ms for the proposed and 116.82 ms for the existing. The same is with other cases and the proposed technique is best in all the cases than the existing.
4.2 Analysis in Terms of Execution Cost The Execution Cost is to locate a reasonable schedule(S) for a given work process with the end goal that Total Execution Cost does not surpass the cutoff time (D) of the work process. The Execution Cost can be premeditated as
42
I. Kaur and P. S. Mann Finish Time Comparision 9000
8554
8000 7000
7000 6000
5739.4
5000 4144.49
4000 3000
2290.21
2000 1000
59.21
0 36.71
482.94
116.82 98.08
894.35
280.01
1178.24
561.68
GAFFA(Proposed)
GA(Existing)
Fig. 2 Comparison of finish time
Table 1 Simulation results of proposed technique in terms of finish time
Scientific Workflows [Montage,50]
GAFFA (Proposed) 36.71
GA (Existing) 59.21
[Montage, 100]
98.08
116.82
[CyberShake,30]
280.01
482.94
[CyberShake,50]
561.68
894.35
[CyberShake,100]
1178.24
2290.21
[Epigenomics,46]
4144.49
7550.89
[Epigenomics,100]
5739.4
8554
R LFTr j − LSTr j C rj ∗ Exceution Cost = r j=1 where r is the number of resources set, LST is the lease start time, and LFT is lease finish time. Figure 3 and Table 2 shows the comparison results of Finish Time of the propounded algorithm with other existing algorithms. The figure shows that the proposed algorithm performs better than the existing procedure. For workload Montage 100, the finish time is 22490 for the proposed and 154580 for the existing. Similarly, for other workloads, the proposed technique performs better than other methods.
A Hybrid Cost-Effective Genetic and Firefly Algorithm …
43
Execution Cost Comparision 250000
237521
235982
200000 198518
154580 150000 113580
151890
100000 58952
45790 50000 0
15800 23974 11420
22490
28950
30272
GAFFA(Proposed)
GA(Existing)
Fig. 3 Comparison of execution cost
Table 2 Simulation results of proposed technique in terms of execution cost
Scientific workflows
GAFFA (Proposed)
[Montage,50]
11420
GA (Existing) 45790
[Montage, 100]
22490
154580
[CyberShake,30]
28950
113580
[CyberShake,50]
30272
151890
[CyberShake,100]
58952
235982
[Epigenomics,46]
15800
198518
[Epigenomics,100]
23974
237521
4.3 Analysis in Terms of Termination Delay When a VM is leased, it takes time to proper initialization and whenever computing resources release, they will take the time to shut down. The longer time in resource acquiring will increase the total execution time and longer time in the shutdown will increase the overall cost of the workflow. Figure 4 and Table 3 show the comparison results of the Termination Delay of the propounded algorithm with other existing algorithms. The figure shows that the proposed algorithm performs better than the existing procedure. For workload Montage 100, the finish time is 308 ms for the proposed and 6156 ms for the existing. Similarly, for other workloads, the proposed technique performs better than other methods.
44
I. Kaur and P. S. Mann
Termination Delay Comparision 7000
6156
6000 5000 4000 2895
3000
3034 2987 1945
2000 1000 0
284 308
418
445
GAFFA(Proposed)
2259 487
127
GA(Existing)
Fig. 4 Termination delay comparison
Table 3 Simulation results of proposed technique in terms of termination delay
Scientific workflows
GAFFA (Proposed)
GA (Existing)
[Montage,50]
238
2016
[Montage, 100]
308
6156
[CyberShake,30]
418
2895
[CyberShake,50]
445
3034
[CyberShake,100]
127
2987
[Epigenomics,46]
284
1945
[Epigenomics,100]
487
2259
5 Conclusion The existing algorithm improves a novel scheme for encoding, population initialization, crossover, and mutation operators of the Genetic Algorithm. It mostly focuses on minimizing the delay, finish time, and cost. The existing CEGA algorithm reflects all the characteristics of the cloud such as heterogeneity, on request resource provisioning, and pay-as-you-go model. The simulation experiments are conducted on four scientific workflows, which show that CEGA exhibits the highest hit rate for deadline constraint. In the existing approach, the number of iterations is very large which increases the Total Execution Cost and Total Execution Time which we will optimize in the proposed algorithm. The propounded algorithm is replicated in WorkflowSim simulator using NetBeans IDE and shows that the results are better than existing procedures. The future extension is to comprehend and to enhance the proposed calculation by using resource-aware and more load balancing algorithms.
A Hybrid Cost-Effective Genetic and Firefly Algorithm …
45
References 1. J. Yu, R. Buyya, Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms. Sci. Program 14(3–4), 217–230 (2006) 2. A. MasoudRahmani, M. Ali Vahedi, A novel task scheduling in multiprocessor systems with genetic algorithm by using elitism stepping method. INFOCOMP—J. Comput. Sci. 7(2), 58–64 (2008) 3. A. Bala, I. Chana, A survey of various workflow scheduling algorithms in cloud environment, in Proceedings of the 2nd National Conference on Information and Communication Technology (NCICT) (2011) 4. Ciornei, E. Kyriakides, Hybrid ant colony-genetic algorithm (GAAPI) for global continuous optimization. IEEE Trans. Syst. Man, Cybern. B, Cybern. 42(1), 234–245 (2011) 5. A.A. El-Sawy, R.M. Rizk-Allah, E.M. Zaki, Hybridizing ant colony optimization with firefly algorithm for unconstrained optimization problems. Appl. Math. Comput. 224, 473–483 (2013) 6. S. Bilgaiyan, M. Das, S. Sagnika, Workflow scheduling in cloud computing environment using cat swarm optimization, in Proceedings of the 2014 IEEE International Advance Computing Conference (IACC) (IEEE, 2014) 7. J. Hu, K. Li, K. Li, Y. Xu, A genetic algorithm for task scheduling on heterogeneous computing systems using multiple priority queues. Inf. Sci. 270(6), 255–287 (2014) 8. A. Verma, S. Kaushal, Cost-time efficient scheduling plan for executing workflows in the cloud. J. Grid Comput. Springer 13(4), 495–506 (2015) 9. S.G. Ahmad, C.S. Liew, E.U. Munir, T.F. Ang, S.U. Khan, A hybrid genetic algorithm for optimization of scheduling workflow applications in heterogeneous computing systems. J. Parallel Distrib. Comput. 87, 80–90 (2016) 10. M.S. Hossain, G. Muhammad, Cloud-assisted Industrial Internet of Things (IIoT) c enabled framework for health monitoring. Comput. Netw. 101, 192–202 (2016) 11. A. Bose, P. Kuila, T. Biswas, A novel genetic algorithm based scheduling for multi-core systems, in 4th International Conference on Smart Innovations in Communication and Computational Sciences (SICCS), vol. 851 (Springer, 2018), pp. 1–10 12. G. Zhang, J. Sun, J. Zhou, S. Hu, T. Wei, X. Zhou, Minimizing cost and makespan for workflow scheduling in cloud using fuzzy dominance sort based HEFT, Future Gener. Comput. Syst. 93, 278–289 (2019) 13. S.R. Gundu, T. Anuradha, Improved hybrid algorithm approach based load balancing technique in cloud computing 9(2) Version 1 (2019)
Flexible Dielectric Resonator Antenna Using Polydimethylsiloxane Substrate as Dielectric Resonator for Breast Cancer Diagnostics Doondi Kumar Janapala and Moses Nesasudha
Abstract In this work a flexible Dielectric Resonator Antenna (DRA) operating at 2.45 GHz is presented for breast cancer diagnosis. Polydimethylsiloxane (PDMS) is used as Dielectric Resonator (DR). The proposed radiating element consist concentric circular arcs formed in an inverse symmetrical manner on both sides of the microstrip feed line. The Defective Ground Surface is used as ground plane, where it is formed by etching slots to form concentric square rings below the radiator. Four square shaped PDMS slabs are used as DR and placed below the slots of DGS. The DRA antenna is simulated for both flat and flexible conditions and comparative analysis is presented. The suitability of the antenna is verified by analyzing the antenna by placing near the female breast phantom model without and with cancer tumor tissue. The simulated Specific Absorption Rate (SAR) of the antenna on skin model and male human left arm phantom model and on female breast model are evaluated and presented. Keywords Polydimethylsiloxane (PDMS) · Flexible · Wearable · Specific absorption rate (SAR) · Dielectric resonator antenna · Brest cancer diagnosis
1 Introduction Flexible antennas development has been rapidly grown in recent years. The need to develop new antennas which can be adaptable to our daily life monitoring like health, entertainment, emergency responders, surveillance, sensing applications, military and health care for screening, diagnostics and treatment. The flexible antennas are very easy to mount on any curved surfaces, which make its more suitable for human D. K. Janapala (B) · M. Nesasudha Department of ECE, Karunya Institute of Technology and Sciences (Deemed to be University), Coimbatore 641114, India e-mail: [email protected] M. Nesasudha e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_5
47
48
D. K. Janapala and M. Nesasudha
wearable applications. Over the years several flexible antennas have been developed using different kinds of flexible dielectric spacers like Rogers materials, Kapton materials, paper, cloth, polymide, Polyetherimide, Polyethylene Glycol and PDMS [1–5]. One of the main applications of these flexible devices is health monitoring, diagnosis and treatment, over the years several antennas have been developed to detect, to monitor and to treat in health care application. While designing these wearable antennas the main consideration is to understand the variations in the dielectric properties of the human tissues in deferent age groups, size and gender. The measurement of female breast tissues electrical characteristics is explained in [6]. For breast tissues validation in vivo and vitro is presented in [7]. These validations in measuring the electrical properties of healthy and malignant tissues can help in developing antennas to diagnose or treatment. Breast cancer is one of the deadly cancer women are facing. The early detection of the breast cancer using the antenna is one of the open research topic where different kinds of antennas and detection methods were investigated over the years. A review of using Electromagnetic techniques in detection of breast cancer is presented in [8]. Nano material based sensors and wearable sensors can also be used for breast cancer detection [9]. Using antenna for microwave imaging for breast cancer detection is presented in [10–12]. A five port ring reflectometer probe system for in vitro breast cancer tumor detection is implemented in [13]. Mm-wave skin cancer detection using Vivaldi antenna presented in [14]. A flexible Microwave antenna is developed for chemotherapy of breast in [15]. A flexible 16 array antenna is used to detect breast cancer in [16]. Similarly a 4 × 4 array antenna is developed for 3D breast cancer detection in [17]. Wide slot and stacked antenna comparison for breast cancer detection presented in [18]. In the current work a compact 53 mm × 36 mm antenna is designed and 1 mm thickness PDMS slabs are used as DR for developing a flexible DR antenna for wearable applications. The proposed antenna backed with PDMS DR has significant decrease in SAR due the PDMS resonator where leakage towards human phantom body is minimized. The designed antenna performance is analyzed by considering different human body phantom models to validate its suitability for wearable applications. The antenna analyzed for different bending radius at 30, 40 and 50 mm. the proposed antenna is placed near the female breast phantom model and the dielectric properties of the healthy and malignant tissues are assigned to the phantom models at 2.45 GHz. The analysis carried out in this work and the variations for without tumor and with tumor for breast cancer detection is presented using E, H, and J field distributions.
Flexible Dielectric Resonator Antenna …
49
2 Proposed DRA Antenna Design and Specifications The proposed DRA geometry is presented in the following Fig. 1. Rogers RO3006 having dielectric constant 6.15 and loss tangent 0.0025 is used as dielectric spacer. Transparent and flexible PDMS substrate is used as the DR; here the pure PDMS layer is prepared without impurities. The PDMS dielectric constant is 2.7 and loss tangent is 0.314. The dimensions of the DR antenna operating at 2.45 GHz are optimized using parametric study and the optimal dimensions are presented in the Table 1.
2.1 Step by Step Implementation of Proposed Antenna The proposed antenna is designed and simulated using ANSYS HFSS 19.2v. The step by step implementation of the designed DRA antenna is presented in the following Fig. 2. Figure 2, it can be seen that antenna without DR and DGS is operating at 2.52 GHz(red), where the DGS placement tuned the antenna to operate at 2.42 GHz(blue). The antenna operates at 2.45 GHz with the addition of PDMS layers back side as DR. It can be seen that the addition of the DR improved the Band Width (BW) and the reflection coefficient. Fig. 1 Proposed DRA a Top view b Bottom view c Side view and d Diagonal view
Table 1 Optimal dimensions (in mm)
L = 53
L1 = 35.5
L2 = 1.3
L3 = 15.85
W = 36
W1 = 1.3
W2 = 14.2
W3 = 1.65
W4 = 5.675
W5 = 2.0625
W6 = 10.4875
r1 = 4.2625
r2 = 6.325
r3 = 9.075
r4 = 11.1375
r5 = 13.2
r6 = 14.85
r7 = 16.5
H1 = 1.27
H2 = 1
50
D. K. Janapala and M. Nesasudha
Fig. 2 Step by step implementation of DRA and the respective reflection coefficient versus frequency curves comparison
2.2 Performance of DRA Without Bending and With Bending The designed DR antenna is bended on to cylindrical shape having radius 30, 40 and 50 mm. The comparative analysis is presented with the help of reflection coefficient curve in Fig. 3a, and radiation patterns in Fig. 3b. From Fig. 3a, it can be seen that from the reflection coefficient curves the designed DRA maintaining the reflection coefficient value below −10 dB in both flat and bended conditions at 2.45 GHz. The simulated reflection coefficient value for the DRA at 2.45 GHz for flat, bended condition for radius 30 mm, 40 mm, and 50 mm are −27.24 dB, −16.43 dB, −16.21 dB and −13.94 dB respectively. In Fig. 3b, c the radiation patterns for flat and bended condition are having some shift where at phi = 0° the pattern is broadened incase of bending compared to the flat condition.
Fig. 3 (a) Reflection coefficient curves comparison for flat and flexible condition with different bending radii (b) radiation pattern flat condition and (c) radiation pattern bending condition (Ra = 30 mm)
Flexible Dielectric Resonator Antenna …
51
3 Effects of Human Body on Designed DRA The human body presence effect on the designed DRA the antenna performance is evaluated by analyzing the antenna in different conditions. Here the minimum distance between the antenna and the phantom model is maintained 10 mm. By taking the safety of the human body into consideration 100 mW input power is given to the antenna. The antenna is analyzed in flat condition by placing the antenna on top of four layered tissue model, which consist of skin followed by fat, muscle and bone. Here the dielectric properties of appropriate tissues at 2.45 GHz where given as listed in Table 2. The position of the DRA in flat condition on top of phantom model and the SAR is presented in the following Fig. 4. In a similar way the bended antenna is placed near female breast phantom model and the SAR analysis is carried out. Figure 5 illustrates the DRA position near female phantom and the SAR analysis. From Figs. 4 and 5 data the maximum average SAR value evaluated over volume of 1 g tissue is listed in the following Table 3. From the above Table 3 data the maximum SAR value is 1.3867. According to the FCC the standard tolerable value for SAR is 1.6 W/Kg over volume of 1 g tissue followed by India and US. Table 2 Dielectric properties of body tissue at 2.45 GHz S.No
Tissue
Relative permittivity(εr )
Loss tangent
Conductivity (S/m)
1
Skin
42.853
0.27255
1.5919
2
Fat
0.14524
0.10452
3
Muscle
52.729
0.24194
1.7388
4
Bone
11.381
0.2542
0.39431
5.2801
Fig. 4 Position of the Antenna on layered phantom model and SAR analysis
52
D. K. Janapala and M. Nesasudha
Fig. 5 DRA bended (30 mm) condition placed near female breast phantom model
Table 3 SAR comparison
S.No
Condition
SAR value (W/Kg)
1
Flat (Layered phantom)
1.1041
2
Bend (Female breast phantom)
1.3867
4 Breast Cancer Detection Using Designed DRA For breast cancer detection using antennas the measuring setup consists of one transmitting antenna and one receiving antenna. The reflecting wave will be analyzed for the without tumor and with tumor case over the time duration. The difference in the impulse can give the understanding of the tumor. By changing the position of the antenna and by using the field distribution curves the position of the tumor and its size can be estimated. In the current work while placing the designed DRA near the human phantom breast model the changes in the E, H, and J field distribution are analyzed for without and with tumor cases. 3 mm radius size cancer tumor is considered and positioned 2 mm under the skin of the phantom model and the antenna is placed exactly 10 mm away from the phantom model. The dielectric properties of the breast model healthy and cancer affected tissues at 2.45 GHz are listed in the following Table 4. The position of the tumor is presented in the following Fig. 6. The comparative analysis of the E, H and J field distribution without and with tumor for antenna flat and bended conditions are illustrated in the following Figs. 7 and 8. Table 4 Breast tissue dielectric properties
Healthy tissue
Cancer tumor tissue
Dielectric constant
4.4401
55.2566
Conductivity (S/m)
0.1304
2.7015
Flexible Dielectric Resonator Antenna …
53
Fig. 6 Female breast phantom model with cancer tumor
Comparing the E field distribution without tumor and with tumor in Fig. 7a, c the maximum value E field distribution value without tumor is 63.86(V/m) where as with tumor it is increased to 74.26(V/m). Current distribution comparison from Fig. 7b, d data maximum value is 102.539(A/m2 ) without tumor and 117.8865(A/m2 ) with tumor. Similarly for antenna bended condition, Fig. 8a, c data the maximum E field distribution is 51.91(V/m) without tumor and 58.50(V/m) with tumor. From Fig. 8b,
Fig. 7 field distribution comparison for antenna in flat condition without tumor: a E-field, b J-field & with Tumor: c E-field, d J-field
54
D. K. Janapala and M. Nesasudha
Fig. 8 Field distribution comparison for antenna in bended (30 mm) condition without tumor: a E-field, b J-field & with Tumor: c E-field, d J-field
d) data the maximum value for current distribution over the volume 101.74(A/m2 ) without tumor and 117.099(A/m2 ) with tumor. The presence of the tumor caused significant anomaly in field distribution and increased the E and J field distribution because of its change in dielectric properties of the cancer effected tissue. Figures 7 and 8 it can be seen that there is significant change in the E-, H and J field distributions for without and with tumor conditions.
5 Conclusion A flexible Dielectric Resonator Antenna is designed for 2.45 GHz wearable health care applications. The designed antenna bending condition is verified by analyzing the antenna performance for different conditions. from Sect. 4 data the designed antenna maintained SAR below the standard value of 1.6 W/Kg the maximum SAR value obtained for the designed antenna is 1.38 W/Kg this concludes that the current DRA is suitable candidate for wearable applications. From Sect. 5 data the antenna suitability is verified for detecting the breast cancer by considering female breast phantom model without tumor and with tumor. There is significant change in the E, H, and J field distribution curves for without and with tumor cases for the antenna in flat and bended conditions. The designed DRA antenna can be positioned as a pair for receiving and transmitting antenna female human breast model to detect the change in impulse received to detect the position and size of the tumor.
Flexible Dielectric Resonator Antenna …
55
References 1. H.R. Khaleel, H.M. Al-Rizzo, D.G. Rucker, Compact polyimide-based antennas for flexible displays. J. Display Technol. 8(2), 91–97 (2012) 2. C.M. Dikmen, G. Cakir, S. Cimen, Ultra wide band crescent antenna with enhanced maximum gain, in 2017 20th International Symposium on Wireless Personal Multimedia Communications (WPMC) (2017). https://doi.org/10.1109/wpmc.2017.8301822 3. L. Xing, Y. Huang, Q. Xu, S. Alja’afreh, T. Liu, Complex permittivity of water-based liquids for liquid antennas. IEEE Antennas Wirel. Propag. Lett. 15, 1626–1629 (2016) 4. S. Ahmed, F.A. Tahir, A. Shamim, H.M. Cheema, A compact kapton-based inkjet-printed multiband antenna for flexible wireless devices. IEEE Antennas Wirel. Propag. Lett. 14, 1802– 1805 (2015) 5. R.B.V.B. Simorangkir, A. Kiourti, K.P. Esselle, UWB Wearable Antenna With a Full Ground Plane Based on PDMS-Embedded Conductive Fabric. IEEE Antennas Wirel. Propag. Lett. 17(3), 493–496 (2018) 6. T.-H. Kim, J.-K. Pack, Measurement of electrical characteristics of female breast tissues for the development of the breast cancer detector. Prog. Electromagn. Res. C 30, 189–199 (2012) 7. R.J. Halter, T. Zhou, P.M. Meaney, A. Hartov, R.J. Barth, K.M. Rosenkranz, W.A. Wells, C.A. Kogel, A. Borsic, E. J. Rizzo, K.D. Paulsen, The correlation of in vivo and ex vivo tissue dielectric properties to validate electromagnetic breast imaging: initial clinical experience. Physiol. Meas. 30(6), S121–S136 (2009) 8. M.M. Islam, M.R.I. Faruque, N. Misran, M.T. Islam, Detection of breast cancer using electromagnetic techniques: a review. Int. J. Appl. Electromagn. Mech. 51(3), 215–233 (2016) 9. S. Sugumaran, M.F. Jamlos, M.N. Ahmad, C.S. Bellan, D. Schreurs, Nano structured materials with plasmonic nano-biosensors for early cancer detection: a past and future prospect. Biosens. Bioelectron. 100, 361–373 (2018) 10. A. Vispa, L. Sani, M. Paoli, A. Bigotti, G. Raspa, N. Ghavami, G. Tiberi, UWB device for breast microwave imaging: phantom and clinical validations. Measurement (2019). https://doi. org/10.1016/j.measurement.2019.05.109 11. X. Guo, M.R. Casu, M. Graziano, M. Zamboni, Simulation and design of an UWB imaging system for breast cancer detection. Integr. VLSI J. 47(4), 548–559 (2014) 12. T. Gholipur, M. Nakhkash, Optimized matching liquid with wide-slot antenna for microwave breast imaging. AEU – Int. J. Electron Commun. 85, 192–197 (2018) 13. C.Y. Lee, K.Y. You, Z. Abbas, K.Y. Lee, Y.S. Lee, E.M. Cheng, S-band five-port ring reflectometer-probe system for in vitro breast tumor detection. Int. J. RF Microwave Comput. Aided Eng. 28(3), e21198 (2017) 14. A. Mirbeik-Sabzevari, S. Li, E. Garay, H.-T. Nguyen, H. Wang, N. Tavassolian. Synthetic ultrahigh-resolution millimeter-wave imaging for skin cancer detection. IEEE Trans. Biomed. Eng. 1–1 (2018). https://doi.org/10.1109/tbme.2018.2837102 15. M. Asili, P. Chen, A.Z. Hood, A. Purser, R. Hulsey, L. Johnson, E. Topsakal, Flexible microwave antenna applicator for chemo-thermotherapy of the breast. IEEE Antennas Wirel. Propag. Lett. 14, 1778–1781 (2015) 16. H. Bahramiabarghouei, E. Porter, A. Santorelli, B. Gosselin, M. Popovic, L.A. Rusch, Flexible 16 antenna array for microwave breast cancer detection. IEEE Trans. Biomed. Eng. 62(10), 2516–2525 (2015) 17. T. Sugitani, S. Kubota, A. Toya, T. Kikkawa, Compact planar UWB antenna array for breast cancer detection, in Proceedings of the 2012 IEEE International Symposium on Antennas and Propagation (2012). https://doi.org/10.1109/aps.2012.6348794 18. D. Gibbins, M. Klemm, I.J. Craddock, J.A. Leendertz, A. Preece, R. Benjamin, A Comparison of a wide-slot and a stacked patch antenna for the purpose of breast cancer detection. IEEE Trans. Antennas Propag. 58(3), 665–674 (2010)
Machine Learning-Based Prototype for Restaurant Rating Prediction and Cuisine Selection Kunal Bikram Dutta, Aman Sahu, Bharat Sharma, Siddharth S. Rautaray, and Manjusha Pandey
Abstract India is popular for its assorted multi-cuisine prepared in a huge number of restaurants and hotel resorts, which is implicative of unity in diversity. The food chain industry and the restaurant business in India is a very competitive one and lack of research and knowledge about the competition usually leads to the failure of many such enterprises. The principal issues that continue to produce difficulties to them include high real estate expenses, escalating food costs, fragmented supply chain, over-licensing, and even after that restaurateur does not know whether the business will develop or not. This project aims to solve this problem by analyzing ratings, reviews, cuisines, restaurant type, demand, online ordering service, table booking, availability of the restaurant and make the machine learning model learn these and predict ratings of new restaurant and how positive and negative reviews should be expected. This research work considers the data of the city of Bengaluru from Zomato as an example for showing how our model works and can help a restaurateur choose the location and cuisine which will give it better ratings, reviews, and make the business more profitable. Keywords Cuisine · Random forest regressor · Ratings · Restaurants · Zomato
K. B. Dutta (B) · A. Sahu · B. Sharma · S. S. Rautaray · M. Pandey School of Computer Engineering, Kalinga Institute of Industrial Technology (Deemed to be University), Bhubanewar, India e-mail: [email protected] A. Sahu e-mail: [email protected] B. Sharma e-mail: [email protected] S. S. Rautaray e-mail: [email protected] M. Pandey e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_6
57
58
K. B. Dutta et al.
1 Introduction The restaurant industry is highly competitive in India as restaurants from all over the world can be found here, from the United States to Japan to Russia, you get all type of cuisines. Delivery, Dine-out, Pubs, Bars, Drinks, Buffet, Desserts any type you name it and India has it. Unless you have a reliable selection of cuisine, more possibilities are, you will have difficulty being noticeable from the crowd. Setting new restaurants and gaining a competing advantage requires a detailed study of the demographics of the around and the quality of existing contenders and in this field, there is a lack of technology-based analysis and solutions. The saturation of the food industry is not up to the necessity; yet, new eateries are opening every day. Consequently, it has become challenging for them to face already stabilized restaurants. So, we take help of Zomato dataset of Bengaluru to showcase how our machine learning prototype can help a new restaurateur in picking the menus, theme, cuisine, etc., based on the analysis on demography of the locations and ratings of restaurants there that can be an advantage to avoid high competitions in the industry. In Bengaluru, most of the people are dependent mainly on restaurants as they do not have time to cook for themselves. With such an overwhelming demand for restaurants, it has become important to study the demography of a location and what kind of food is more popular in a locality and where to eat for the best experience in that locality. Based upon the existing restaurants and their ratings, we can predict what cuisines can give them the best ratings with our prototype on the features of the dataset like restaurant type, votings, reviews, average cost, online order, and table booking facility, etc. [1].
2 Technologies Used NumPy library, which is a shorter version for the Numerical Python and it is quite efficient in providing as interface which can be used for the purpose of storing the data and in most cases operating the data on either dense or very dense data buffers. NumPy has similar implementation of the list types but they are faster and less costly as compared to their other counterpart it also operates efficiently even if the size of the data is increased. The implementation part of the Numpy is rather interesting as it has powerful array object which are N-dimensional and they are build in a very focused way which helps in easily integrating it with the basic language like C/C++ or other codes which has mathematical functionality or implementation of different types like Fourier and other capabilities. Numpy also is very compatible while using it side by side with the different Machine Learning modules like the Pandas or the Matplotlib which are very useful in the ML world. Pandas is a another rapidly used package which is used in Machine Learning; it is written in the Python language and it inherits some of the properties of the Numpy.
Machine Learning-Based Prototype for Restaurant …
59
The basic advantage of using pandas is that it has Dataframe and Series. DataFrame, if explained in simpler term, arrays which also have row and column names and it can have different types of data or sometimes missing data. Also it provides storage for the data. Pandas has some very best data operations which can be performed to DB. It is capable of implementing many additional functionality like suitable operations can be performed on the data based on other columns or pivot tables can be created. Scikit-Learn, another very widely used Python libraries that presents powerful versions of a huge number of known algorithms including Random Forests, k-means algorithm, Gradient Boosting, SVM(support vector machine) and has been designed to operate with the Python libraries NumPy and SciPy. Scikit-Learn is defined by a clean, uniform, and streamlined API alongside quite helpful and complete online documentation and advanced functions like boosting and bagging, feature selection, detection of outliers, and rejecting noise and methods for model selection and validation such as cross-validation, hyperparameter tuning, and metrics. Matplotlib is a data visualization library which inherits the properties of Numpy array and it is planned to have a working with the SciPy stack. It produces figures of high quality in a variety of formats and interactive environments across platforms. One can plot plots, histograms, bar charts, error charts, scatter plots, etc., with just a few lines of code. Seaborn is a used for the purpose of the visualization of data and has a interface which can be used to have detailed which are quite appealing it inherits properties of Matplotlib and has compatibility with the Pandas Module. It intends to present a visualization as an important part of defining and presenting data. The functions operates on list or DataFrames or even arrays which have a large size and which requires statistical operationas and mapping in order to produce very informative plots [2].
3 State of Art
Article
Author
Year
Approach
Review
Restaurant Rating: “Industrial Standard and Word-of-Mouth A Text Mining and Multi-dimentional Sentiment Analysis” [3]
Yang Yu, Qiwei Gan; [9–12]
2015
Sentiment analysis of reviews regarding aspects of food, decor, service, pricing and special contexts to predict rating
Multidimensional sentiment analysis and text mining were applied. The paper is more theory-driven than data-driven (continued)
60
K. B. Dutta et al.
(continued) Article
Author
Year
Approach
Review
Prediction of star ratings from online reviews [4]
Ch. Sarath Chandra Reddy; K. Uday Kumar; J. Dheeraj Keshav; Bakshi Rohit Prasad; Sonali Agarwal [13–15]
2017
Many classifiers like the typically used Bag of Words or the Multinomial NB, or the more usedTrigram Multinomial NB, Bigram Multinomial NB etc and also Random Forest
Classifiers like the Random Forest performed better than the rest of the known classifiers. It is a good implementation of ratings predicting but not so much help in for new restaurateurs that we will provide
Multi-view Clustering in Collaborative Filtering Based Rating Prediction [26]
Chengcui Zhang; Ligaj Pradhan [16–18]
2016
To predict an unknown rating of a user for a restaurant, first cluster to which the user/restaurant belongs found and then the average of the k-NNs from the user cluster gives the prediction for the user rating
Multi-view clustering produced better results but automatically compares several known views and selects a set which can be the best views or nearer to that, can improve user-item rating prediction and it only predicts the rating cannot predict better cuisine for a location
Machine learning based class level prediction of restaurant reviews [5].
F. M. Takbir Hossain; Md. Ismail Hossain [19–21]
2017
Sci-kit learn library and Natural language Toolkit (NLTK) were used
This model aims to predict the reviews given by the user as negative or positive. In this paper, sentiment analysis was done on the online reviews but the problem of the establishment of new restaurants is not solved
Restaurant rating based on textual feedback [6]
Sanjukta Saha; A. K. Santra [22, 23]
2017
Collaborative Filtering
Analysis of reviews is done to calculate the user ratings. There was no use of machine learning (continued)
Machine Learning-Based Prototype for Restaurant …
61
(continued) Article
Author
Year
Approach
Review
Restaurant Recommendation System for User Preference and Services Based on Rating and Amenities [7]
R. M. Gomathi; S. P. Ajitha; T. G. Hari Satya Krishna; U. I. Harsha Pranay [24]
2019
NLP(Natural Language Processing) algorithms are used for identification of the sentiments of the user comments
Sentimental analysis is performed on the reviews and user comments to recommend a hotel
Restaurant setup business analysis using yelp dataset [8]
Sindhu Hegde; Supriya Satyappanavar; Shankar Setty [25]
2017
Manual observation, kd tree
Analysis of data was done very well but there was no use of machine learning
4 Architecture Design The Zomato Bengaluru dataset consists of 17 columns and 51717 rows. The columns were URL, name, address, book_table, online_order, votes, rate, phone, location, dish_liked, rest_type, approx_cost(for two people), cuisines, menu_item, reviews_list, listed_in(type), and listed_in(city). The ratings were also string which were converted into floating values. And further, the null values in the column were filled with the mean value of the column. The rows containing null values related to the remaining columns were dropped as they were comparatively very low in number. Following this, a layer of analysis of exploratory type was added to further understand the relations between the various columns and how they were correlated with the “rating” column. After this, Label Encoding was applied to the “location,” “cuisine,” and “rest_type” (restaurant type) columns. With this, our data was cleaned and conditioned and was ready to be fed to the model. Figure 1 depicts the prototype of our model i.e how it is working to predict the rating of the restaurants that have not been rated yet. As we decided to go with the Random Forest algorithm, our model of choice was the “RandomForestRegressor” from Scikit-Learn. It is an ensemble learner for regression built on decision trees. It functions by forming many decision trees at training time and then mean prediction of the unique trees for regression as output. Random forest tries to build multiple CART (CART, short for Classification and Regression Trees) models with more combinations of different samples and different use of the initial variables and then perform a final prediction on each observation. Ultimate prediction is a function of each prediction. This ultimate prediction can just be the mean of each prediction.
62
K. B. Dutta et al.
Fig. 1 Proposed prototype
5 Implementation and Results The Zomato Bengaluru dataset was loaded with the help of Pandas library. The dataset consisted of 17 columns and 51717 rows. The columns were URL of unique restaurants, addresses, names, online order availability, table booking facility, rate, votes, phone numbers, locations, type of restaurants, liked dishes list, cuisines, avg approx cost for two people, list of user reviews, items in a menu, type of restaurants list. Firstly, we converted the “rate” column from a string to float and the null values in this column were filled in with the mean of the column. Then, since phone number, URL, and address do not contribute to the overall rating of the restaurant, we dropped those columns. Then exploratory data analysis was done on the dataset to find relations among the columns in an efficient way using the Python libraries discussed earlier. Which are the top restaurant chains in Bengaluru, percentage of restaurants in a location, percentage of type of restaurants, how many of the restaurants do not accept online order, table booking services, what is the ratio between restaurants that provide and do not provide table booking, is there any difference between votes of restaurants accepting and not accepting online orders, top restaurants by rating, relation between cost and rate of restaurants, restaurant type, location, rating distribution, which are the most common restaurant type in Bengaluru, which are the most popular cuisines of Bengaluru, which are the most liked dishes and which item appeared most on the menu item which are the most common cuisines in each locations, every such
Machine Learning-Based Prototype for Restaurant …
63
relation among the column(features) is extracted through pie charts, bar plots, box plots, histograms(Kernel Density Estimation), scatter plots using Python “seaborn” and “Matplotlib” libraries. From EDA it is observed that online ordering helps in the high rating of a restaurant (Fig. 3) on the other hand online table booking do not affect it much (Fig. 2). It is clear from the EDA that most of the restaurants are located in the top 10–15 locations (Fig. 4) and others are scattered in other locations and also we can observe the most famous restaurant types in the city (Fig. 5).
Fig. 2 Rating versus table booking
Fig. 3 Rating versus Online order
64
Fig. 4 Percentage of restaurants in that location
Fig. 5 Percentage type of restaurants
K. B. Dutta et al.
Machine Learning-Based Prototype for Restaurant …
65
Since we are also trying to predict the best cuisines for a location given the rating, so from EDA, we take a look at the most famous cuisines of the city (Fig. 6). We observe the rating distribution of the restaurants and the cost distribution to have the idea of how further should we proceed on feature selection and preprocessing. From the above graphs (Figs. 7 and 8), it is observed that most of the restaurants are rated between 3.5 and 4. It is also clear that the restaurants had an average cost of less than 1000 have better ratings in comparison to the more expensive restaurants. From EDA, we pre-process the data and fill up the missing values and drop the rows with unknown locations and finally, from the relations of other columns with ratings, we selected nine columns having a high correlation with rating column to proceed with our model.
Fig. 6 Most popular cuisines of Bengaluru
Fig. 7 Distribution of ratings
66
K. B. Dutta et al.
Fig. 8 Distribution of costs of all restaurants
Fig. 9 r2_score of different machine learning models
Before implementing the model, LabelEncoder from the Scikit-learn library was used to label encode the columns of location, rest_type, and the cuisines. Before encoding, the null values of these three columns were dropped as they were relatively low in number. Then using the StandardScaler from Scikit-learn’s preprocessing (sklearn.preprocessing), we scaled the values of the dataset. Then, the dataset was split in the proportion of, respectively, 60, 20, 20 for training, validation, and testing using the train_test_split from sklearn.model_selection. Then, the machine learning models of LinearRegression, DecisionTree, and RandomForestRegressor were trained on the training data. The models were evaluated using r2_score from the sklearn.metrics. The best model was of the RandomForestRegressor which was giving a very high r2_score without any hyperparameter tuning (Fig. 9).
6 Conclusion This project tries to predict the rating scores for new restaurants based on their location, cuisine, approximate cost, and other factors based on which the model will provide the best-fit cuisine for a location for a new restaurant business. This will be of great remedy to the entrepreneurs to gain some advantage at the beginning of the business. Here, the Zomato Bengaluru dataset has been explored to find out which traits/features were essential to predict the rating of the restaurant. This project uses a RandomForestRegressor to effectively predict the ratings of the restaurant on the basis of the input features and is able to train it to reach relatively high accuracy.
Machine Learning-Based Prototype for Restaurant …
67
7 Future Works From the predicted ratings to the work of finding, the best cuisine in a location for a new restaurant is going on. Further works that can be done are that Sentimental Analysis on the customers’ reviews given for the restaurants which would further aid in predicting the rating of the restaurant to be established, thus this will help in finding best the cuisines for a particular location.
References 1. https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants 2. https://github.com/jakevdp/PythonDataScienceHandbook/tree/master/notebooks 3. System Sciences (HICSS), Annual Hawaii International Conference on IEEE, Restaurant Rating: Industrial Standard and Word-of-Mouth A Text Mining and Multidimensional Sentiment Analysis 4. TENCON, IEEE Region 10 International Conference, Prediction of Star Ratings from Online Reviews 5. Humanitarian Technology Conference (R10-HTC), IEEE Region 10, Machine learning based class level prediction of restaurant reviews 6. International Conference on Microelectronic Devices, Circuits and Systems (ICMDCS), Restaurant rating based on textual feedback 7. International Conference on Computational Intelligence in Data Science (ICCIDS), Restaurant Recommendation System for User Preference and Services Based on Rating and Amenities 8. International Conference on Advances in Computing, Communications, and Informatics (ICACCI), Sentiment based Food Classification for Restaurant Business 9. American Automobile Association Approval Requirements and Diamond Rating Guidelines (AAA Publishing, Heathrow, FL, 2009) 10. C. Dellarocas, The digitization of word of mouth: promise and challenges of online feedback mechanisms. Manage. Sci. 49(10), 1407–1424 (2003) 11. N. Archak, A. Ghose, P.G. Ipeirotis, Deriving the pricing power of product features by mining consumer reviews. Manage. Sci. 57(8), 1485–1509 (2011) 12. W. Duan, B. Gu, A.B. Whinston, The dynamics of online word-of-mouth and product sales-an empirical investigation of the movie industry. J. Retail. 84(2), 233–242 (2008) 13. S. Aravindan, A. Ekbal, Feature extraction and opinion mining in online product reviews, in Information Technology (ICIT) 2014 International Conference on, pp. 94–99 (2014), December 14. Y. Mengqi, M. Xue, W. Ouyang, Restaurants Review Star Prediction for Yelp Dataset 15. G. Dubey, A. Rana, N.K. Shukla, User reviews data analysis using opinion mining on the web, in Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE) 2015 International Conference on, pp. 603–612 (2015), February 16. M. Sharma, S. Mann, A survey of recommender systems: approaches and limitations. Int. J. Innov. Eng. Technol. 2(2), 8–14 (2013) 17. S. Bickel, T. Scheffer, Multi-View Clustering, in Proceedings of IEEE International Conference on Data Mining (2004), pp. 19–26, November 18. X. He, M.-Y. Kan, P. Xie, X. Chen, Comment-based multi-view clustering of web 2.0 items, in Proceedings of the 23rd International Conference on World Wide Web (2014), pp. 771–782, April 19. B. Pang, L. Lee, S. Vaithyanathan, Thumbs up?: sentiment classification using machine learning techniques, in Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing-Volume 10 (2002)
68
K. B. Dutta et al.
20. H. Minqing, B. Liu, Mining and summarizing customer reviews, in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2004) 21. G. Anindya, G. Ipeirotis, Designing novel review ranking systems: predicting the usefulness and impact of reviews, in Proceedings of the Ninth International Conference on Electronic Commerce (2007) 22. X. Lei, X. Qian, G. Zhao, Rating prediction based on social sentiment from textual reviews, in IEEE Transactions on Multimedia Manuscript Id: MM-006446, pp. 1–12 23. S. Prakash, A. Nazick, R. Panchendrarajan, A. Pemasiri, M. Brunthavan, S. Ranathunga, Categorizing food names in restaurant reviews, in IEEE (2016), pp. 1–5 24. Uzma Fasahte, Deeksha Gambhir, Mrunal Merulingkar, Aditi Monde, Amruta Pokhare, Hotel recommendation system. Imp. J. Interdiscip. Res. (IJIR) 3(11), 318–324 (2017) 25. H. Parsa, A. Gregory, M. Terry, Why do restaurants fail? Part iii: an analysis of macro and micro factors. Emerg. Asp. Redefin. Tour. Hosp. 1(1), 16–25 (2010) 26. 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), Multi-view Clustering in Collaborative Filtering Based Rating Prediction
Deeper into Image Classification Jatin Bindra, Bulla Rajesh, and Savita Ahlawat
Abstract Recognizing images was a challenging task a few years back. With the advancement of technology and the introduction of deeper neural networks, the issue of recognizing images is solved to a large extent. Inspired by the performance of deep learning models in image classification, the present paper proposed three techniques and implemented that for image classification. The residual network, convolutional neural network, and logistic regression were used for classification. The neural networks have shown the state-of-the-art results in the classification of images. In the implementation of these models, some modifications are made to build a deep residual network and convolutional neural networks. On testing, the ResNet model gave 98.49% accuracy on MNIST and 87.31% on Fashion MNIST. CNN model gave 98.73% accuracy on MNIST and 87.38% on Fashion MNIST. Logistic regression gave 91.79% on MNIST and 83.74% on Fashion MNIST. Keywords Deep neural networks · Residual network · Convolutional neural network · Logistic regression · MNIST · Fashion MNIST
1 Introduction Image classification involves a series of steps, which are performed on an image to get a label for it. With the advancement of technology, the images are generated and shared on a very large scale. Because of its increasing significance, bringing up deeper models and improving the classification can have a major impact on computer J. Bindra · S. Ahlawat (B) Department of CSE, Maharaja Surajmal Institute of Technology, Delhi, India e-mail: [email protected] J. Bindra e-mail: [email protected] B. Rajesh Department of IT, Indian Institute of Information Technology, Allahabad, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_7
69
70
J. Bindra et al.
vision. It might be possible that one algorithm is performing better for one use case and one for another use case. Therefore, it is important to compare various algorithms on more than one dataset. The most well-known algorithms for image classification is based on the techniques of deep learning. Deep learning algorithms are used widely in image classification problems. For instance, deep convolutional neural networks which were applied to ImageNet dataset showed that convolutional neural network is capable of achieving some record-breaking results [1]. This paper demonstrates the implementation of deep ResNet neural network, convolutional neural network, and regression model for image classification and comparison with the other state–of-the-art algorithms. The models were first tested on MNIST Dataset. The MNIST dataset is used popularly in the field of computer vision to compare the state-of-the-art algorithms. The dataset consists of 70000 images of handwritten digits from 0 to 9. Each image is grayscale with size 28 × 28. The MNIST Dataset (Fig. 1) was introduced in 1998. At that time, good computing power was not widely available. In today’s world with good computing power, many algorithms can get good accuracy with the MNIST dataset and it is widely used because of its simplicity. In April 2017, Google Brain research scientist asked people in a tweet to move away from MNIST as it is overused [2]. Even the basic Machine Learning algorithms can achieve more than 90% of classification accuracy on MNIST. For this reason, we also tested our models on Fashion MNIST dataset. In Aug 2017, Fashion MNIST [3] was released. Similar to MNIST, it consists of 70,000, 28 × 28 grayscale images (Fig. 2). Out of which 60,000 are used for training purposes and rest 10,000 images are used for testing purposes. Fashion MNIST also consists of 10 classes. It contains shapes of some complicated wearable fashion items. The MNIST dataset and Fashion MNIST dataset is so popular because it is widely available in
Fig. 1 MNIST dataset of handwritten images from 0 to 9
Deeper into Image Classification
71
Fig. 2 Fashion MNIST dataset consisting of wearable fashion items
most libraries and deep learning frameworks. In addition to it, there are lots of helper functions provided by different frameworks. The overall structure of the deep residual network implemented in this work consists of 12 layers with two jump connections. Four such structures are connected to each other to form a larger model. The CNN is used with 13 layers to form a deep neural network for classification. Logistic regression was used from the Sklearn library. These models were then tested on the MNIST dataset and Fashion MNIST dataset by calculating the accuracy of each model for these two datasets.
2 Related Work Deep learning forms the basis of image classification. Some recent research has shown that deep residual networks can be used in a variety of applications. It is not just limited to static image classification but can also include detection of moving objects, surveillance recordings, and so on and so forth. For instance, Szegedy et al. [4] gave evidence that by using the residual connections, the training of inception networks increased significantly. The research showed three new networks which
72
J. Bindra et al.
include Inception-ResNet-v1, Inception-ResNet-v2, and Inception-v4. Ou et al. [5] proposed a structure which is based on ResNet to detect moving objects. The research used ResNet-18 with an encoder–decoder structure. Further, the research used supervised learning in which the input fed includes object frame along with the corresponding labels. By using this structure, they showed that the performance on the I2R and the CDnet2014 dataset was better than the other conventional algorithms. Lu et al. [6] proposed a DCR (Deep Coupled Residual) network. Their DCR model consisted of two branch networks and a trunk network. They used this model for face recognition of lower resolution. Their experiments showed that the DCR model has better performance on LFW and SCface datasets as compared to the other state-ofthe-art models. Jung et al. [7] used surveillance recordings data for classification and localization using deep ResNet of 18 layers with ReLU function for the classification part. For localization also they used ResNet. In localization, they used R-FCN with deep residual models for accurate results and further showed that their model outperformed the other state-of-the-art models in both classification and localization. Palvanov and Cho [8] implemented four models which include residual network, capsule network, convolutional neural network, and multinomial logistic regression; and tested these models on the MNIST dataset in a real-time environment. Li and He [9] proposed an improved version of ResNet by using shortcut connections which are adjustable. The results reported by them showed an improvement of accuracy when compared to the classical ResNet. Also, their research showed that under the learning rate of 0.001, their improved ResNet had 2.85% higher accuracy on CIFAR-10 and 3.81% higher accuracy on CIFAR-100 datasets when compared to classical ResNet. Xia et al. [10] used SCNN and ResNet models combined with SIFT-flow algorithm for kidney segmentation and showed that the kidney segmentation accuracy was improved by their research. Zhang et al. [11] proposed a deep convolutional network for image denoising in which the authors used residual learning for the separation of noise from noisy observation. In their work residual learning also played a role in speeding the training process and boosting the denoise performance. Their results produced favorable image denoising both quantitatively and qualitatively. Apart from this by using GPU implementation, the run time also seems promising. CNN is widely used for image analysis. CNN is considered to be one of the state-of-the-art algorithms for image analysis. Shin et al. [12] studied and showed three important factors on convolutional neural networks architecture, transfer learning, and dataset characteristics on application to computer-aided detection problem. Baldassarre et al. [13] proposed a model in which they combined convolutional neural network with highlevel features which were extracted from a pre-trained model: Inception-ResNet-v2. The authors were successful in image colorization tasks of high-level components like sky, sea, etc. Some advances on the CNN model are also made to get better accuracy or improve the training time or testing time. One such research was done in Fast RR-CNN. Girshick [14] proposed a method for object detection which was the Fast Regionbased Convolutional Network method. This method not only improved detection accuracy but also improved testing and training speed. Huang et al. [15] introduce the Dense Convolutional Network (DenseNet). The layers in this network are
Deeper into Image Classification
73
connected to every other layer in a feed-forward fashion. Testing the model on a single dataset may not help in stating the generalization use of the model. Thus, the authors compared the model with four popular datasets (CIFAR-10, CIFAR-100, SVHN, and ImageNet). They showed that DenseNets obtain significant improvement over other models. Gidaris and Komodakis [16] used a multi-region deep convolutional neural network to propose an object detection system and got 78.2% and 73.9% accuracy on PASCAL VOC2007 and PASCAL VOC2012 challenges, respectively. Abadi et al. [17] described the TensorFlow interface and implementation details of TensorFlow. It was built in Google and is widely used to solve artificial intelligence problems. The TensorFlow APIs were created and it was made open source so that the community of developers and researchers around the globe can use it. He and Sun [18] presented the architecture that gave comparable accuracy in the ImageNet dataset. Despite this accuracy, it was 20% faster than “Alexnet.” In the past, many models were made to classify images based on neural networks. Agarap [19] used CNN-Softmax and CNN-SVM to classify images by using both MNIST and Fashion MNIST dataset. Xiao et al. [3] who introduced the Fashion MNIST also tested the data with various state-of-the-art algorithms which include Decision tree classifier, Extra tree classifier, Gradient boosting, K-Neighbours, Linear SVC, Logistic regression, MLP, Passive Aggressive, Perception, Random Forest, SGDC and SVC. Chen et al. [20] compared four neural networks on the MNIST dataset. These models include deep residual networks, convolutional neural networks, Dense Convolutional Network (DenseNet), and an improvement in CNN by using Capsnet. The authors showed that Capsnet requires a small amount of training data and can achieve excellent accuracy with it. Seo and Shin [21] used the hierarchical structure of apparel classes for the classification. The Hierarchical Convolutional Neural Networks (H-CNN) that they proposed were based on VGGNet.
3 Methodology The objective of the methodology is to classify the MNIST and Fashion MNIST dataset using deep ResNet neural network, convolutional neural network, and regression model. A block diagram of the methodology is shown in Fig. 3. It is the basic procedure followed in all three models used. Firstly, data is imported to get the input. The data is directly imported from tensorflow.examples.tutorials and tensorflow.keras. It is further pre-processed before feeding it to the model. The preprocessing step includes resizing images and normalizing the pixel values by dividing the matrix by 255. The labels are then converted into the categorical format. Finally, the images are fed from the training set to the model in order to train the model and then from the testing set to get the output label for the images. The experiments were done on a system with the following tools and system configurations:
74
J. Bindra et al.
Fig. 3 Basic steps followed for each model
• Coding Language: Python 3 • Development Environment: Jupyter Notebook hosted in Google Colab. Google Colab is an interactive environment that is used to write and execute code. The development environment consists of 12.72 GB RAM, 48.97 GB of Disk, and 14.73 GB of GPU in Colab environment. • OS: Microsoft Windows 10, 2015 • Processor used: Intel(R) Core(TM) i3-2310 M CPU @2.10 GHz. • The models used (A) ResNet (Deep Residual Network), (B) CNN (Convolutional Neural Network), (C) Logistic regression. A. ResNet Residual neural network (ResNet) is a special type of deep learning model in which skip connections are present. The network has connections that jump over a few layers. This is useful to avoid vanishing gradient problem. Li and He [9] explained that by introducing shortcut connections the problem of gradient fading was solved. They supported this by simplifying ResNet and deducing the backpropagation in it. He et al. [22] presented a residual learning framework to ease the training of networks that are substantially deeper than those used previously. Deep residual nets helped them to win the first place on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. Szegedy et al. [23] showed
Deeper into Image Classification
75
that the introduction of residual connections leads to improved training speed for the Inception architecture. In this paper, the ResNet implemented is formed by two skip connections (Fig. 4). The residual network consists of 12 layers. The layers implemented in our ResNet are batch normalization, convolution, dropout, ReLU, jump step in which the first layer is added to the output that is given from ReLU, dropout, convolution, batch normalization, and finally the output of last convolution layer is added to the input layer. On increasing the layers, the weights learned by initial layers play a negligible role in prediction. This is because of the vanishing gradient problem. To overcome this, we first introduced a connection of the first layer to the output of ReLU function. Few more layers were added to make the network deeper. Again with the increased Fig. 4 Residual network implemented in Tensorflow
76
J. Bindra et al.
layers, adding another connection would help in solving vanishing gradient problem. So the connection was made from the input layer to the last convolutional layer to give the final output. The ReLU operation is defined in Eq. 1: f (x) = max(0, x)
(1)
Here, x is the input and f(x) is the output. From this equation, it can be seen that output is x if x ≥0 and output is 0 if x < 0. The training dataset of MNIST has a batch size of 32 and Fashion MNIST has a batch size of 1000. Before sending the training images to the model, first, it was passed through the convolutional layer and then through the batch normalization layer. After this, the images were passed through 4 residual networks one after the other linearly. The output obtained is passed through another convolutional layer and then the 4-Dimensional tensor obtained from the convolutional network is flattened into a 2-Dimensional tensor. Finally, it is passed through a dense layer to get the output. Mathematically, a residual block function is defined as: y = f (x, {Wi }) + x
(2)
Here in Eq. 2, y is the output vector, x is the input vector, f (x, {W i }) represents the mapping that is to be learned. B. CNN CNN is used as a baseline model in many image classification tasks. For instance, Johnson and Zhang [24] used CNN on text categorization to exploit the 1-D structure of text data for accurate prediction. Zeiler and Fergus [25] introduced a novel visualization technique that described the function of intermediate feature layers. Also, the technique described the operation of the classifier. The visualizations helped them to find model architectures that gave challenging results and was better than the ImageNet classification benchmark which was set by Krizhevsky et al. [1]. In this research, the 13 layered deeper CNN network is used for the classification of images. The input is pre-processed and fed into the model. The layers convolutional, Max pool, and drop out were used for 3 times linearly before using the flattened layer. Max Pooling is done to downsample the image. For instance, after applying max pooling on Eq. 3, the output is given in Eq. 4. ⎡
1 ⎢ 5 X =⎢ ⎣ 9 13
2 6 10 14
3 7 11 15
⎤ 4 8 ⎥ ⎥ 12 ⎦ 16
(3)
Deeper into Image Classification
77
Fig. 5 CNN model in Keras
6 8 MaxPooling(X ) = 14 16
(4)
Then it is flattened and passed through three dense layers. The three dense layers had unit’s parameters as 128, 50, and 10, respectively. Figure 5 depicts the CNN Model in Keras model. C. Logistic regression Logistic regression is a basic Machine Learning model used for classification. The logistic regression was imported directly from linear models in Sklearn [26]. Logistic regression has a sigmoidal curve. The equation of sigmoid function (Eq. 5) is given as S(x) = 11 = e − x This formula results in the formation of “S-Shaped curved,”
(5)
78
J. Bindra et al.
The images were pre-processed and the model was trained by calling fit function. The iterations were set to 2000 in fit function. Finally, the testing dataset was fed into the trained model.
4 Results and Discussion In this paper after training the three models, we fed the testing data through them. For MNIST (Fig. 6) dataset, the highest accuracy was obtained from CNN followed by ResNet and Logistic Regression. The accuracy of CNN and ResNet models were very close to each other. For Fashion MNIST (Fig. 7), the highest accuracy was obtained by CNN, followed by ResNet and logistic regression. The accuracy and testing configuration of different models (A) ResNet, (B) CNN, (C) Logistic Regression are stated below. The models were compared with the previous work in literature. The comparison is represented in Table 1. The performance of ResNet, CNN, and regression implemented in this research is comparable with the other implementations of similar models. Fig. 6 ResNet, CNN, and logistic regression on MNIST dataset
Fig. 7 ResNet, Regression, and CNN on Fashion MNIST dataset
Deeper into Image Classification Table 1 Comparison of implementation of ResNet, CNN, and logistic regression with other models from the literature
79 Model
MNIST dataset (%)
Fashion MNIST dataset
LinearSVC [3]
91.70
83.60%
LogisticRegression [3]
91.70
84.20%
CNN-SVM [19]
99.04
90.72%
ResNet [8]
97.3
–
CNN [8]
98.1
–
ResNet
98.49
87.31%
Logistic Regression 91.79
83.74%
CNN
87.38%
98.73
A. ResNet The ResNet model for MNIST dataset was trained for 7 epochs with a batch size of 32. The testing was performed on the model with a batch size of 1000. The accuracy of the ResNet model on the MNIST dataset comes out to be 98.49%. The same model was then used for Fashion MNIST. The only change was in the number of epochs and batch size while training. For MNIST increasing epochs was decreasing the accuracy which may be due to overfitting. In the case of Fashion MNIST, the test data accuracy comes out to be 87.31% by increasing the number of epochs to 80. The batch size while training was kept to 1000. B. CNN The CNN model described in Fig. 5 was implemented as a sequential model in Keras. For training, the batch size of 200 was used. The number of epochs was set to 100. The MNIST dataset gave an accuracy of 98.73% on the testing dataset. The Fashion MNIST dataset gave an accuracy of 87.38%. While running the Fashion MNIST dataset the number of epochs and batch size of the model was the same as set for MNIST. C. Logistic regression The logistic regression was implemented from Sklearn. The maximum number of iterations taken for the solvers to converge is set to 100 by default in a fit method of logistic regression. The iterations were set to 2000, otherwise the model failed to converge. The MNIST dataset gave an accuracy of 91.79% on the testing dataset. The Fashion MNIST gave an accuracy of 83.74% on the testing dataset. For Fashion MNIST also the iterations were set to 2000.
80
J. Bindra et al.
5 Conclusion This study provides insights into how deep learning models give more accuracy for image classification. The comparisons of models by testing the models on two datasets which include MNIST and Fashion MNIST show that accuracy of ResNet and CNN is very close to each other. The comparison with the other models in literature shows how our tuning in the implementation of ResNet and CNN has an impact on accuracy. The accuracy can be further increased by making the neural network deeper, training it with increased computing power, and increasing the size of the training dataset. The use of ResNet and other deep learning models can also be extended to many real-world applications using various datasets like image dataset, time-series dataset, etc. In the future, more advancement can be made in deep neural networks to explore more combinations of features. Pre-processing techniques can be further explored that can help in reducing the training time and increasing accuracy.
References 1. A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with deep convolutional neural networks, in 25th International Conference on Neural Information Processing Systems (ACM, Lake Tahoe, Nevada, 2012), Vol. 1, p. 9 2. I. Goodfellow (2017), goodfellow_ian/status/852591106655043584?lang = en. https://twitter. com/goodfellow_ian/status/852591106655043584?lang=en 3. H. Xiao et al., Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms (2017). http://arxiv.org/abs/1708.07747. n. pag 4. C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (AAAI, 2016) 5. X. Ou, P. Yan, Y. Zhang, B. Tu, G. Zhang, J. Wu, W. Li, Moving object detection method via ResNet-18 with encoder–decoder structure in complex scenes. IEEE Access 7, 108152–108160 (2019) 6. Z. Lu, X. Jiang, A.C. Kot, Deep coupled ResNet for low-resolution face recognition. IEEE Signal Process. Lett. 25, 526–530 (2018) 7. H. Jung, M. Choi, J. Jung, J. Lee, S. Kwon, W.Y. Jung, ResNet-based vehicle classification and localization in traffic surveillance systems, in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017), pp. 934–940 8. A. Palvanov, Y.I. Cho, Comparisons of deep learning algorithms for MNIST in real-time environment. Int. J. Fuzzy Log. Intell. Syst. 18, 126–134 (2018). https://doi.org/10.5391/IJFIS. 2018.18.2.126 9. B. Li, Y. He, An improved ResNet based on the adjustable shortcut connections. IEEE Access 6, 18967–18974 (2018) 10. K. Xia, H. Yin, Y. Zhang, Deep semantic segmentation of kidney and space-occupying lesion area based on SCNN and ResNet models combined with SIFT-Flow algorithm. J. Med. Syst. 43, 1–12 (2018) 11. K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a Gaussian Denoiser: residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26, 3142–3155 (2017) 12. H. Shin, H. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D.J. Mollura, R.M. Summers, Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35, 1285–1298 (2016)
Deeper into Image Classification
81
13. F. Baldassarre, D.G. Morín, L. Rodés-Guirao, Deep koalarization: image colorization using CNNs and inception-ResNet-v2 (2017). http://arxiv.org/abs/1712.03400 14. Girshick, Fast R-CNN ICCV 15 Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 1440–1448 15. G. Huang, Z. Liu, K.Q. Weinberger, Densely connected convolutional networks, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 2261–2269 16. S. Gidaris, N. Komodakis, Object detection via a multi-region and semantic segmentationaware CNN model, in 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 1134–1142 17. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I.J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, M. Levenberg, D. Mané, R. Monga, S. Moore, D.G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P.A. Tucker, V. Vanhoucke, V. Vasudevan, F.B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous distributed systems (2015). http:// arxiv.org/abs/1603.04467 18. K. He, J. Sun, Convolutional neural networks at constrained time cost, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), pp. 5353–5360 19. A.F. Agarap, An Architecture Combining Convolutional Neural Network (CNN) and Support Vector Machine (SVM) for image classification (2017). http://arxiv.org/abs/1712.03541. n. pag 20. F. Chen, N. Chen, H. Mao, H. Hu, Assessing four Neural Networks on Handwritten Digit Recognition Dataset (MNIST) (2018). http://arxiv.org/abs/1811.08278 21. Y. Seo, K. Shin, Hierarchical convolutional neural networks for fashion image classification. Expert Syst. Appl. 116, 328–339 (2019) 22. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Las Vegas, NV, USA, 2016), p. 12 23. C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-ResNet and the impact of residual connections on learning, in AAAI’17 Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (ACM, San Francisco, California, USA, 2017), p. 12 24. R. Johnson, T. Zhang, Effective use of word order for text categorization, in Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Association for Computational Linguistics, Denver, Colorado, 2015), p. 10 25. M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol. 8689, ed by D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Springer, Cham, 2014) 26. Scikit-learn.org, scikit-learn: machine learning in Python—scikit-learn 0.22 documentation (2019). . https://scikit-learn.org/stable/. Accessed 9 Dec 2019
Investigation of Ionospheric Total Electron Content (TEC) During Summer Months for Ionosphere Modeling in Indian Region Using Dual-Frequency NavIC System Sharat Chandra Bhardwaj, Anurag Vidyarthi, B. S. Jassal, and A. K. Shukla Abstract When signals from satellites propagate through the ionosphere, a delay is introduced due to the presence of Total Electron Content (TEC) between transmitter and receiver. The generation of TEC in the ionosphere is primarily dependent on solar activity (Diurnal and Seasonal). The ionospheric delay can cause a major degradation in the positional accuracy of the satellite navigation system. For the estimation of ionospheric delay, slant TEC (STEC) along the path between satellite and receiver is needed. For a single-frequency user, a modeled ionospheric vertical TEC (VTEC) at Ionospheric Pierce Point (IPP) is converted into STEC for delay estimation. However, the behavior of TEC is highly dynamic in low-latitude and equatorial regions (Indian region), and thus conventional ionospheric model introduces additional error in positioning. The NavIC (Navigation with Indian Constellation) system geostationary satellite constellation is uniquely capable of the investigation of ionospheric TEC, and it can facilitate for ionospheric modeling applications. This paper deals with estimating of accurate STEC and VTEC using dual-frequency NavIC code and carrier measurements, and investigation of its temporal variation for modeling applications. Keywords Total Electron Content (TEC) · STEC · VTEC · Ionospheric delay · NavIC · Ionospheric modeling S. C. Bhardwaj (B) · A. Vidyarthi · B. S. Jassal Propogation Research Laboratory, Department of Electronics and Communication, Graphic Era (Deemed to be University), Dehradun, India e-mail: [email protected] A. Vidyarthi e-mail: [email protected] B. S. Jassal e-mail: [email protected] A. K. Shukla Space Applications Center, Indian Space Research Organization, Ahmedabad, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_8
83
84
S. C. Bhardwaj et al.
1 Introduction Satellite positioning is an essential service in modern public and military applications. To deliver the positional services in the Indian region, NavIC (Navigation with Indian Constellation) system (formerly known as Indian Regional Navigation Satellite System (IRNSS)) has been designed with seven satellites constellation (three of them are geostationary (GEO) and four are geosynchronous (GSO)). The NavIC systems operate in the dual-frequency band at S1 (2492.028 MHz) and L5 (1176.45 MHz) [1]. The accuracy of position, determined by satellite signals is crucial for applications like aircraft landing and guidance systems. The study of a glacier and tectonic plate’s movement needs accuracy in millimeter level [2]. To achieve such accuracy, it is necessary to determine and eliminate all sources of error in satellite positioning. There are various sources such as atmospheric layers (ionosphere, troposphere), satellite and receiver clock errors, multipath, earth’s magnetic field, etc., that influence the satellite signals measurement which further leads to a positional error [3]. Among all, the ionosphere is a major source that accounts for 90% of the positional error. When the signal passes through the ionosphere then its velocity changes and tends to bend due to change in the refractive index of the ionosphere [4]. This phenomenon that introduces a delay in signal measured at the receiver, can introduce positional errors of up to 100 m. The ionospheric delay depends on the Total Electron Content (TEC) between the satellite and the receiver, and the frequency of the signal. The electrons present in the ionosphere are mainly affected by solar radiation, geomagnetic storms, and lower atmosphere waves [5]. The estimation of this ionospheric delay (called as first-order error) needs slant TEC (STEC) along the path between satellite and receiver. Due to the complexity of the design and cost of the dual-frequency system, many ionospheric-modeled data sources such as the coefficient-based model are being used to facilitate the singlefrequency receiver for ionospheric delay estimation [6]. For a single-frequency user, a modeled ionospheric vertical TEC (VTEC) at Ionospheric Pierce Point (IPP) is converted into STEC for delay estimation. Although, the modeled VTEC works suitably for mid-latitude regions where the ionosphere behaves smoothly can cause a significant positional error in low-latitude and equatorial reason, such as India, due to dynamic and unpredictable behavior of ionosphere [7]. To meet the challenge of modeling of the ionospheric and delay estimation in the Indian region, there is a need to investigate the ionospheric VTEC as a function of diurnal and seasonal solar activity. However, this needs to estimate accurate STEC and VTEC. The accurate STEC can be determined by taking the difference between dual-frequency measurements (i.e., code and carrier phase) [8]. Most of the error sources, like troposphere, multipath, and satellite and receiver clock are frequency independent. By taking the difference of measurements, all these effects are eliminated and the effect of the frequency-dependent source (i.e., ionosphere) remains [9]. The NavIC receiver is installed at Graphic Era Deemed to be University, Dehradun (lat. 31.26° N, long. 77.99° E), and data being collected at dual-frequency L5 and S1. A plot of observed satellite for 24 h (IST) at the receiver is shown in Fig. 1. In
Investigation of Ionospheric Total Electron Content (TEC) … 0
Fig. 1 Observable NavIC satellites at the receiver (June 5, 2017)
330
85 0 30
Az El Rx
30
PRN 2
300
PRN 3
60
PRN 4 PRN 5 PRN 6
60
PRN 7
90
270 2
90
4
6
7 3
240
5
210
120
150 180
the figure, PRN 2, 4, 5 can be observed as Geosynchronous (GSO) satellites and PRN 3, 6, 7 as a Geostationary (GEO). In addition, the GEOs are uniquely capable of ionospheric studies over the Indian region. Due to constant IPP of GEOs, the behavior of ionospheric TEC, as a function of diurnal and seasonal solar activity, can be investigated more precisely as compared to variable IPPs of GPS (Global Positioning System) satellites. Thus, in this paper, the investigation of STEC and VTEC is being carried out using GEO satellites (i.e., PRN 3, 6, 7). In Sects. 2 and 3, the estimation and analysis of STEC and VTEC have been discussed.
2 Estimation of STEC and VTEC The STEC can be determined by taking the difference of dual frequencies code and/or carrier phase measurements. A typical STEC using satellite code measurement is given by [10]: STEC£ =
f 12 f 22 1 (£2 − £1 ) 40.3 ( f 12 − f 22 )
(1)
where £1 , £1 are the measured code ranges, f 1 , f 2 are the satellite frequencies For NavIC frequencies, f 1 (S1) = 2492.08 MHz and f 2 (L5) = 1176.45 MHz, the Eq. (1) can be written as [11]: STEC£ = 4.4192 × 1016 × (£2 − £1 ) electron/m2
(2)
STEC£ = 4.4192 × (£2 − £1 )[TECU]
(3)
86
S. C. Bhardwaj et al.
Similarly, the STEC derived from carrier phase range can be written as STECϕ = 4.4192 × (ϕ1 − ϕ2 )[TECU]
(4)
2.1 Smoothing and True STEC As compared to code range, carrier phase range is much precise, although, its use is not straightforward due to the presence of integer carrier cycle ambiguities. However, both code and carrier phase measurements can be for improvement in the accuracy of STEC estimation [12]. A Hatch filter-based code carrier leveling process has been used for the determination of absolute STEC [13]. In Fig. 2, the code and carrier derived STEC£ (in blue) and STECϕ (in green) are shown. Due to the presence of integer carrier cycle ambiguity in STEC estimation, the STECϕ is below zero (−ve value). By using STEC£ , a leveling constant D (i.e., 105.8 for PRN 3) has been derived and added to STECϕ to find the absolute STEC (red line). It can be observed that the STEC overlaps and follows the mean variation of STEC£ . The STEC still contains satellite and receiver Differential Instrumental Biases (DIBs). These biases arise due to path delay difference of signal at two frequencies (i.e., L5 and S1). The initial satellite DIBs are provided by SAC, Ahmedabad; and initial receiver DIB is estimated using FRB method. The final DIBs are estimated using Kalman filter. The biased and true STECs (hereafter called as STEC) are shown in Fig. 3. After the removal of biases, the STEC can be used now as VTEC estimation. 100
Fig. 2 Smoothing of STEC
PRN 5
STEC STEC£
80
STEC (TECU)
60 40 20 Leveling By Constant D= 105.8
0 -20 -40
STECφ
-60 -80
0
4
8
12 IST (Hours)
16
20
24
Investigation of Ionospheric Total Electron Content (TEC) …
87
90
Fig. 3 Diurnal variation of true STEC, June 5, 2017
True STEC (TECU)
80
PRN 5
70
PRN 7 60
PRN 6
50
40
PRN 4
PRN 3
0
PRN 2 4
8
12
16
20
24
IST (Hours)
2.2 Estimation of VTEC For single-frequency users, the mapped or modeled VTEC is converted into STEC in order to calculate the ionospheric delay. Thus, the STEC, estimated from dualfrequency measurement, must be converted into VTEC for mapping or modeling in the Indian region. The VTEC can be obtained by taking a projection from the slant path to a vertical path as shown in Fig. 4. The ionosphere is considered as a thin layer (called thin shell model) at the altitude around 300–400 km above the earth’s surface. The effective height or centroid of the mass of the ionosphere shell intersects the user to satellite line-of-sight is called Ionospheric Pierce Point (IPP). The STEC is converted into VTEC by multiplying an obliquity factor given as [5, 14]. Fig. 4 Ionosphere thin shell model and location of the IPP
Ionospheric Piercing Point (IPP)
VTEC
STEC
Ionosphere E Receiver (ϕu λu)
hIPP
(ϕIPP λIPP)
RE
O
Centre of Earth
88
S. C. Bhardwaj et al. 60
Fig. 5 Diurnal variation of VTEC, June 5, 2017
55
True VTEC (TECU)
50 45 PRN 3
40
PRN 6
PRN 4 35 PRN 5 30
PRN 7
25 PRN 2 20
0
4
8
12
16
20
24
IST (Hours)
Re cos θ VTEC = STEC × cos sin−1 Re + h max
(5)
where Re (Radius of Earth) = 6378 km, hmax = 350 km, θ = elevation angle at the receiver. The diurnal variation of estimated VTECs for all PRN is shown in Fig. 5. Similar to STEC, an elevation-dependent variation in VTEC values have been observed in the GSO and GEO satellites. The VTECs of GEO follow the diurnal sun variation and have a peak at the same time as found in the case of STECs.
3 Analysis of STEC and VTEC As discussed in Sect. 1, the investigation of STEC and VTEC is required in order to estimate the ionospheric delay precisely for the Indian region. It is also discussed in sec II that GEO satellites are suitable for the investigation due to observations that STEC and VTEC curves follow diurnal sun variations. In this section, the investigation of STEC and VTEC, for 1 week each of summer months (June, July, and August 2017), are done for GEO satellites (i.e., PRN 3,6,7). The estimated STEC and VTEC for June 5–10, 2017 are shown in Fig. 6. It can be observed that the curves are similar (±2 TECU) expect for June 7 and 9 due to unknown solar variations. Thus, a mean of STEC and VTEC has been calculated and plotted in Fig. 7 with standard deviation. The deviations are larger in the afternoon, as compared to night and morning due to unpredictability of solar radiation. But a mean value could help to find a general trend for the month. Thus, mean values STEC and VTECs for June, July, and August
Investigation of Ionospheric Total Electron Content (TEC) … 58
75 June 5 PRN 3 June 6 June 7 June 8 June 9 June 10
65
June 5 June 6 June 7 June 8 June 9 June 10
56 54
VTEC (TECU)
70
STEC (TECU)
89
60 55
52 50
PRN 3
48 46 44
50 45
42 0
4
12
8
16
20
40
24
0
4
8
12
16
20
24
16
20
24
IST (Hours)
IST (Hours)
(a)
(b)
Fig. 6 Diurnal variation of a STEC b VTEC, for PRN 3, June 5–10, 2017 58
70 Mean STEC PRN 3
PRN 3
54
VTEC (TECU)
STEC (TECU)
65
Mean VTEC
56
60
55
52 50 48 46 44
50
42 45
0
4
8
12 IST (Hours)
(a)
16
20
24
40
0
4
12
8
IST (Hours)
(b)
Fig. 7 Mean of Hourly averaged a STEC b VTEC and standard deviation
2017 of GEO satellites are shown in Fig. 8. Due to lower elevation angle, the STECs for PRNs 6 and 7 are high (Fig. 8b, c), as compared to PRN 3 (Fig. 8a). In the figures, although, different peak values have been observed for different PRN, the trend of the curve, i.e., morning sharp rise, evening steep fall, and constant before sunrise, are found similar. The monthly VTEC curves of individual PRNs are almost similar (±2 TECU) except at the peak time (±4 TECU). Due to different positions of PRNs 3, 6, 7 (Fig. 1), the corresponding IPPs are different and thus the difference in VTEC peak values is expected. However, due to dependency of STEC on elevation, the VTEC values of PRNs 6 and 7 are thus having lower VTEC values (Fig. 8e, f) than PRN 3 (Fig. 8d). Hence, an elevation-dependent VTEC modeling along with latitude and longitude are needed to overcome the effect of lower elevation angle STECs. From the observation, it is found that in the summer season the VTECs are quite stable and could be suitably used for ionospheric modeling applications.
90
S. C. Bhardwaj et al. 90
75 June July Aug.
80
PRN 3
65
June July Aug.
85
STEC (TECU)
STEC (TECU)
70
60 55
PRN 6
75 70 65 60
50 45
55 0
4
8
12
16
20
50
24
8
(b)
20
24
June July Aug.
56 54
PRN 7
PRN 3
VTEC (TECU)
52
70 65 60
50 48 46 44
55
42
50
40
0
4
8
12
16
20
38
24
0
4
8
12
IST (Hours)
16
20
24
IST (Hours)
(c)
(d) 42
50 June July Aug.
48 46
June July Aug.
40 38
PRN 6 VTEC (TECU)
44 VTEC (TECU)
16
58
75
42 40 38 36
PRN 7
36 34 32 30 28
34
26
32 30
12
(a) June July Aug.
80 STEC (TECU)
4
IST (Hours)
90 85
45
0
IST (Hours)
24
0
4
8
12
16
20
24
0
4
8
12
IST (Hours)
IST (Hours)
(e)
(f)
16
20
24
Fig. 8 Diurnal Variation of STEC and VTEC of (a), (d) PRN 3 (b), (e) PRN 6 (c), f PRN 7, respectively, for the months, June, July, and August 2017
Investigation of Ionospheric Total Electron Content (TEC) …
91
4 Conclusion The STEC and VTEC have been estimated by using dual-frequency code and carrier phase data and its diurnal variation for summer months (June, July, and August 2017) has been investigated. The overall diurnal VTEC variation is found similar to STECdependent peak values. It is also found that an elevation angle dependency is present along with diurnal sun variation. Hence, it is suggested that an elevation-dependent VTEC modeling, along with latitude and longitude, is needed for Indian region. Acknowledgments The authors would like to thanks Space Application Center, Indian Space Research Organization (ISRO), Ahmedabad, for providing the necessary funds, instruments and technical support for carrying out this research work under a sponsored research project.
References 1. IRNSS SIS ICD for SPS. ISRO-ISAC V 1.1 (2011) 2. C. Rizos, Principle and practice of GPS surveying, monograph No. 17, School of Geomatic Engg., University of New South Wales, Sydney, 1997 3. J.G. Peter, GPS for Geodesy (Springer-Verlag, Berlin Heidelberg, 1998) 4. C.H. Papas, Theory of Electromagnetic Wave Propagation (McGraw-Hill, New York, 1988) 5. J.A. Klobuchar, Ionospheric time-delay algorithm for singlefrequency GPS users. IEEE Trans. Aerosp. Electron. Syst. AES-23(3), 325–331 (1987) 6. N. Jakowski, C. Mayer, M.M. Hoque, V. Wilken, Total electron content models and their use in ionosphere monitoring, Radio Sci. 46, RS0D18 (2011) 7. P.R. Rao, K. Niranjan, D.S.V.V.D. Prasad, S.G. Krishna, G. Uma, On the validity of the ionospheric pierce point (IPP) altitude of 350 km in the Indian equatorial and low-latitude sector. Ann. Geophys. 24(8), 2159–2168 (2006) 8. S. Bassiri, G.A. Hajj, Modeling the global positioning system signal propagation through the ionosphere Telecommunications and Data Acquisition Progress Report, NASA Jet Propulsion Laboratory, Caltech, Pasadena, 1992, pp. 92–103 9. E.J. Petrie, M. Hernández-Pajares, P. Spalla, P. Moore, M.A. King, A review of higher order ionospheric refraction effects on dual frequency GPS. Surv. Geophys. 32(3), 197–253 (2011) 10. A.J. Manucci, B.A. Iijima, U.J. Lindqwister, X. Pi, L. Sparks, B.D. Wilson, GPS and ionosphere. URSI reviews of radio science. Jet Propulsion Laboratory, Pasadena (1999) 11. S.C. Bhardwaj, A. Vidyarthi, B.S. Jassal, A.K. Shukla, Study of temporal variation of vertical TEC using NavIC data, in 2017 International Conference on Emerging Trends in Computing and Communication Technologies (ICETCCT) (IEEE, 2017) 12. D.E. Wells, N. Beck, D. Delikaraoglou, A. Kleusberg, E.J. Krakiwsky, G. Lachapelle, P. Vanicek, Guide to GPS Positioning (Canadian GPS Associates, Fredericton, 1986) 13. S. Sinha, R. Mathur, S.C. Bharadwaj, A. Vidyarthi, B.S. Jassal, A.K. Shukla, Estimation and Smoothing of TEC from NavIC Dual Frequency Data, in 2018 4th International Conference on Computing Communication and Automation (ICCCA) (IEEE, 2018) 14. M.S. Bagiya, H.P. Joshi, K.N. Iyer, M. Aggarwal, S. Ravindran, B.M. Pathan, TEC variations during low solar activity period (2005–2007) near the equatorial ionospheric anomaly crest region in India (2009)
An Improved Terrain Profiling System with High-Precision Range Measurement Method for Underwater Surveyor Robot Maneesha and Praveen Kant Pandey
Abstract The paper presents an improved terrain profiling system with highprecision range measurement method for underwater surveyor robot. Extensive research has been carried out in the area of terrain profiling in different scenarios; however, limited work has been performed for underwater environment. In the present work, a surveyor robot has been designed using ultrasonic range sensor for terrain profiling in underwater environment. The dynamic nature of underwater scenario adds significant noise to the acoustic signals leading to inaccurate range measurements. The noise embedded in the data in underwater environment degrades the working of the surveyor robot leading to uncertainty in terrain profiling, surveying, and navigation. Different digital signal filtering techniques are used to remove noise from data, leading to a better estimate of the range data and improved signal-to-noise ratio. Two sets of range measurement data in an underwater setup have been estimated using signal processing techniques. The results show that low pass FIR filter improved the results as compared to Moving Average method; however, the high value of standard deviation of the results depicts that FIR filtering is not adequate for accurate range measurement and thereby faithful working of underwater robot. Kalman filter being a recursive estimator provides an optimal solution for estimation and data prediction tasks and is efficient in filtering the noisy data from an input sensor. However, its filtering performance is dependent on the selection of measurement noise covariance R used in the predictor-corrector model. In an actual underwater data streaming environment, it is very difficult to obtain the optimal value of R from complex device configuration. In order to avoid a poor selection of R and to determine its best estimate directly from the sensor data, an analytical method using Denoising Autoencoder is used in the present work. The results show that the Kalman filter method using Denoising Autoencoder estimated the range data accurately. The terrain profile for an underwater test setup was generated by simultaneously recording the position of the robot and the elevation data filtered using the Maneesha · P. K. Pandey (B) Department of Electronics, Maharaja Agrasen College, University of Delhi, Delhi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_9
93
94
Maneesha and P. K. Pandey
above method. The results are in good agreement with the actual terrain profile of the test setup. Keywords Terrain profile · Elevation map · Ultrasonic range sensor · Underwater range measurement · Digital signal filter · Low pass FIR filter · Kalman filter · Denoising Autoencoder
1 Introduction Underwater milieu is attracting major interest with regards to its vast resources lying underneath the oceans. Scientists and researchers are continuously developing newer and better technologies to further uncover this rather undiscovered world for the benefits of the world at large. In this scenario, the advancements in underwater robotics with better sensing technology are providing numerous opportunities to explore and harness the vast energy which lies amidst the water bodies. Underwater robots used are highly dependent on their ability to sense and respond to their environment for their exploration activities. In recent years, various research efforts in the field of underwater robots are giving rise to more focused, consistent, and reliable underwater robotic vehicles, thereby minimizing the need for human workers [1]. Currently, Remotely Operated Vehicles (ROVs) and Autonomous Underwater Vehicles (AUVs) provide sensor platforms for measuring underwater properties [2]. Both can operate in a previously unmapped environment with unpredictable disturbances and threats [3]. Salamonowicz, ASCE; and Arnold generated terrain profiles of Alaska and Antarctica using Seasat radar altimeter data and data reduction method developed by the Geoscience Research Corporation. However, when compared with terrain profile generated using doppler measurements, an error of magnitude 50 feet was observed [4]. Florida Atlantic University’s Ocean Engineering Dept. and University of South Florida’s Marine Science Department designed long-range AUV “The Ocean Voyager II.” It is used for coastal oceanography using the principle of light reflectance and absorption measurement while flying at a constant altitude [5]. Apart from these, the Internet of Underwater Things (IoUT) is being used as a worldwide network of smart interconnected underwater objects with a digital entity [6] to enable various practical applications, such as environmental monitoring, underwater exploration, and disaster prevention. For terrain profiling and surveillance, many types of range sensors based on electromagnetic waves, such as light or short radio wave are available for estimating the distance in air medium, but these sensors are not effective under water as electromagnetic waves are heavily attenuated beyond short distances [7, 8]. However, sound waves can propagate easily in water. Hence, ultrasonic range sensors are preferred for range measurement for depth measurement, and to detect and map landmarks under the water by surveyor robot. However, precision and update rate of the range measurement are the two major limitations of ultrasonic range sensors. Further, the dynamic nature of the underwater
An Improved Terrain Profiling System with High-Precision Range …
95
medium disrupts the signal quality as acoustic signals are actuated mechanically [9]. Thus, the data collected by ultrasonic range sensors has limited precision as they are prone to noise in dynamic environments with variation in temperature, turbidity, and strong underwater currents, leading to uncertainty in terrain profiling, surveying, and navigation. The noise embedded in the data collected by ultrasonic range sensors entails vital impact on the working of the surveyor robot. In order to avoid a non-convergent system, a new calibration is essential by filtering the inherent noise in the data to obtain a better estimation of the range data. The paper presents a comparative study of the use of digital signal filtering techniques including Moving Average filter, Finite Impulse Response (FIR) filter, and Kalman filter to remove noise from data obtained from the ultrasonic range sensor, leading to better estimate of the range data and improved signal-to-noise ratio. The filtered data was used to generate terrain profile of an underwater test setup.
2 Design of Underwater Surveyor Robot 2.1 Hardware In the present work, the underwater surveyor robot was designed using ATmega328 microcontroller and was equipped with ultrasonic range sensor, accelerometer, gyroscope, and a temperature sensor for generating terrain profile. GY-85 BMP085, a 9-axis sensor module comprising 3-axis gyroscope, 3 axis accelerometer, and 3-axis magnetic field is used for inertial measurement system. DYP-ME007Y-PWM waterproof ultrasonic sensor is used for underwater range measurement and terrain scanning. After power on, the ultrasonic sensor waits for the trigger signal. When it receives a trigger signal, it generates and transmits eight 40 kHz pulses and waits for echo signal. A PWM pulse is generated according to the delay between the transmitted and the echo signal. The value of the distance can be deduced from the pulse width. If no echo is detected, the sensor generates a constant pulse width of about 35 ms. Two sets of range measurement data were obtained using range sensor for estimating depth in an underwater setup. The raw data received from the sensor lacks precision due to the inherent characteristics of the sensor as well as the noisy underwater environment. Hence, it is imperative to estimate the accurate signal from the sensor data using signal processing techniques. The paper analyzes and compares the filtered range sensor data obtained using three different signal processing techniques, i.e., Moving Average filter, Finite Impulse Response filter, and Kalman filter. The terrain profile was generated by simultaneously recording the position of the robot using inertial measurement system and the elevation data was filtered using the improved Kalman filter method.
96
Maneesha and P. K. Pandey
2.2 High-Precision Range Measurement Methods 2.2.1
Moving Average Filter Algorithm
The Moving Average filter or running-mean filter is one of the most commonly used filters in underwater environment [10]. It is used to filter the short-term ripples and emphasize longer term trends. The threshold between short-term and long-term depends on the parameters of the Moving Average filter based on the application. The filter output consists of filtered data sequence with the degree of smoothening, and associated loss of information from both the ends of the input data, depending on the number of filter weights [10]. Mathematical design of the filter is described in detail by Thomson and Emery. 1 xi+k 2M + 1 i=0 2M
z M+k = w=
1 2M + 1
(1)
The above equation shows that the Moving Average filter is a moving rectangular window filter and consists of an odd number of 2M + 1 equal weights w which resembles a uniform probability density function. Two implementations of Moving Average filter with M = 1 and 2 are presented in the current work.
2.2.2
Finite Impulse Response (FIR) Filter Algorithm
Finite Impulse Response filter is a filter whose impulse response is of finite time duration, i.e.; it decays and settles to zero in finite time duration. FIR filter uses different equations to generate output as a weighted sum of samples of input signal. The output of an Nth order general linear FIR filter, with impulse response h k is given by the following equation [11] zk =
N −1 m=0
h m xk−m =
N −1
h k−m xm
(3)
m=0
For an ideal low pass FIR filter, the frequency response h k is given by hk =
ωc sin(kωc ) = sin c(kωc ) kπ π
(4)
Finite Impulse Response filter with N = 10 and 20 are implemented to obtain the estimated signal for both sets of range data.
An Improved Terrain Profiling System with High-Precision Range …
2.2.3
97
Kalman Filter
Kalman filter uses a series of measurements observed over time, containing process and environment noises, and produces estimates of unknown variables which are more accurate and precise, by estimating a joint probability distribution over the variables for each timeframe [12]. Knowing the covariance matrices of the estimate and the incoming measurements, the filter fuses measurements and estimates, minimizing the variance of the resulting estimate [13]. The filter is also referred to as Linear Quadratic Estimator (LQE). Kalman model for the current problem comprising state differences and measurement equations for linear dynamic system is given by [14] x k = Ax k−1 + Buk + wk
(5)
z k = H x k + vk
(6)
where, Variable
Description
Dimension
x
System State Vector, x k ∈ Rn
n×1
u
System Control Vector
p×1
w
Process/Perturbation Noise Vector, wk ∈ Rn
n×1
z
Measurement Vector, z k ∈ Rm
m×1
v
Measurement Noise Vector, vk ∈
A
System State Matrix, A ∈ Rn×n
B
System Control Matrix, B ∈
H
Measurement Matrix, H ∈ Rm×n
Rm
Rn× p
m×1 n×n n×p m×n
wk and vk are independent Gaussian white noise sequences which satisfy the following equations [15–17]. E{wk } = 0; E{vk } = 0; E wk vTj = 0
(7)
E wk wTj = Qδk j ; E vk vTj = Rδk j
(8)
where function E{X} represents expectation (or mean) of X, Q is covariance matrix of Process Noise Vector wk , R is the covariance matrix of Measurement Noise Vector vk , δk j = 1 if k = j and δk j = 0 if k = j. The Kalman filter is a recursive filter and is a two-step process comprising Prediction and Correction (or Update). The first step, i.e., Prediction phase uses the state estimate from the previous time step to produce an estimate of the state at the current time step. In the Correction phase, the current prediction data is combined with current observation information to obtain more precise state estimate.
98
Maneesha and P. K. Pandey
Prediction ˆ k−1 + Buk xˆ − k = Ax T P− k = A P k−1 A + Q
(9) (10)
− where xˆ − k is Predicted state estimate, Pk is Predicted error covariance. n×n is error covariance matrix, i.e., covariance of state error ek (difference Pk ∈ R between estimated state value and the state). P may be defined [12] by the following equation.
E ek eTj = P k δk j
(11)
where δk j = 1 if k = j and δk j = 0 if k = j Correction/Update T P− k H T H P− k H + R − ˆk xˆ k = xˆ − k + K k zk − H x
Kk =
P k = (I − K k H) P − k
(12) (13) (14)
where K is the Kalman gain which is the relative weight given to the measurements and current state estimate through covariance matrices Q and R, xˆk is the corrected/updated state estimate. Although Kalman filter is very efficient in filtering the noisy data from an input sensor, however, its filtering performance is dependent on the input noise parameters. Selecting optimal value of measurement noise covariance R is an important parameter for effective filtering using Kalman filter. In an actual underwater data streaming environment, it is very difficult to obtain the optimal value of R from complex device configuration. In case of a poor choice of the value of R, the accuracy of the Kalman filter is reduced and degraded. In order to avoid a poor selection of R and to determine its best estimate directly from the sensor data, an analytical method using Denoising Autoencoder is used in the present work [18, 19].
3 Result To reduce the noise in range measurement data and improve the signal-to-noise ration, data for two depths were recorded using ultrasonic range sensor. The data set I was measured for depth less than 100 cm whereas the data set II was taken for depth greater than 200 cm.
An Improved Terrain Profiling System with High-Precision Range …
99
Two implementations of the Moving Average filter with M = 1 and M = 2 were applied to the data set I and data set II. The results are shown in Fig. 1. The mean value and standard deviation of filtered data (data set I, M = 1) in Fig. 1Ia is 75.46 cm and 0.085, respectively. The standard deviation of filtered data (data set I, M = 2) in Fig. 1Ib reduces to 0.076 while the mean remains same. The mean value and standard deviation of filtered data (data set II, M = 1) in Fig. 1IIa is 210.91 cm and 0.147, respectively. The standard deviation of filtered data (data set II, M = 2) in Fig. 1IIb reduces to 0.117 while the mean remains same. Finite Impulse Response filter with order N equal to 10 and 20 was implemented and applied to both sets of the sensor data. The results are shown in Fig. 2. The mean value and standard deviation of filtered data (data set I, N = 10) in Fig. 2Ia is 75.38 cm and 1.330, respectively, while the mean value and standard deviation of filtered data (data set I, N = 20) in Fig. 2Ib is 75.39 cm and 1.395, respectively. The mean value and standard deviation of filtered data (data set II, N = 10) in Fig. 2IIa is 210.90 cm and 0.118, respectively. The standard deviation of filtered data (data set II, N = 20) in Fig. 2IIb reduces to 0.128 while the mean remains the same. Kalman filter was applied to data set I and II and the results are shown in Fig. 3. The mean and standard deviation of the filtered data for data set I were found to be 75.41 cm and 0.059. The mean and standard deviation values for data set II were 210.91 cm and 0.020, respectively.
Fig. 1 Filtering of range sensor data using Moving Average filter
100
Maneesha and P. K. Pandey
Fig. 2 Filtering of range sensor data using FIR filter
Fig. 3 Filtering of range sensor data using Kalman filter
A test setup of the dimension of size 60 cm × 35 cm was designed and immersed in water tank of height 90 cm. The setup is shown in Fig. 4. The range data along with the position was captured simultaneously by the underwater surveyor robot. The range data was then filtered using the improved Kalman filter method described above. Terrain profile for the test setup was thus generated by the underwater surveyor robot. The results were logged in a computer for plotting the 2D surface plots. MATLAB was used to plot the terrain profile and is shown in Fig. 5.
An Improved Terrain Profiling System with High-Precision Range …
101
Fig. 4 Underwater test setup
Fig. 5 Terrain profile for underwater test setup
4 Conclusion The paper presents an improved terrain profiling system with high-precision range measurement method for underwater surveyor robot. A comparative study of the use of digital signal filtering techniques including Moving Average filter, Finite Impulse Response (FIR) filter, and Kalman filter to remove noise from data obtained from ultrasonic range sensor, leading to better estimate of the range data and improved signal-to-noise ratio is presented. The results show that the Kalman filter method with Denoising Autoencoder estimated the range data accurately and were superior to the Moving Average filter and FIR filter methods in the filtering accuracy. The improved Kalman filter method was used to generate terrain profile of underwater test setup. The results are in good agreement with the actual terrain profile of the test setup.
102
Maneesha and P. K. Pandey
References 1. J. Yuh, Design control of autonomous underwater robots: a survey. Auton. Robot. 8(1), 7–24 (2000) 2. R.E. Thomson, W.J. Emery, Data acquisition and recording, data analysis methods in physical oceanography, 3rd edn. (Elsevier, 2014), pp. 1–186 3. J.G. Bellingham, K. Rajan, Robotics in remote and hostile environments. Science 318(5853), 1098–1102 (2007) 4. P.H. Salamonowicz, A.M. ASCE, G.C. Arnold, Terrain profiling using seasat radar altimeter, J. Surv. Eng. (1985). https://doi.org/10.1061/(asce)0733-9453(1985)111:2(140) 5. S.M. Smith, S.E. Dunn, The Ocean Voyager II: an AUV designed for coastal oceanography, Autonomous Underwater Vehicle Technology, AUV ‘94 (1994). http://doi.org/10.1109/AUV. 1994.518618 6. J. Pascual, O. Sanjua´n, J.M. Cueva, B.C. Pelayo, M. A´ lvarez, A. Gonza´ lez, Modeling architecture for collaborative virtual objects based on services. J. Netw. Comput. Appl. 34(5), 1634–1647 (2011) 7. P.K. Pandey, Maneesha, S. Sharma, V. Kumar, S. Pandey, An intelligent terrain profiling embedded system for underwater applications, in Proceeding of International Conference on Computational Intelligence & Communication Technology (CICT) (2018) ISBN: 978-1-5386-0886-9, IEEE Xplore 8. P. Jonsson, I. Sillitoe, B. Dushaw, J. Heltne, Observing using sound and light – a short review of underwater acoustic and video-based methods. Ocean Sci. Discuss. 6, 819–870 (2009) 9. M.R. Arshad, Recent advancement in sensor technology for underwater applications. Indian J. Geo-Marine Sci. 38(3), 267–273 (2009) 10. R.E. Thomson, W.J. Emery, Digital Filters, Data Analysis Methods in Physical Oceanography, 3rd edn. (Elsevier, 2014), p. 607 11. F.J. Taylor, Digital Filters: Principles and Applications with MATLAB (IEEE-John Wiley & Sons, Inc., Publication, 2012) 12. J. Ou, S. Li, J. Zhang, C. Ding, Method and Evaluation Method of Ultra-Short-Load Forecasting in Power System, Data Science: 4th International Conference of Pioneering Computer Scientists, Engineers and Educators, Proceedings, Part 2 (2018) 13. T.D. Larsen, K.L. Hansen, N.A. Andersen, O. Ravn, Design of Kalman filters for mobile robots: Evaluation of the kinematic and odometric approach, in Proceedings of IEEE Conference on Control Applications, vol. 2 (1999) 14. R. Kalman, A new approach to linear filtering and prediction problems. J. Basic Eng. (1960) 15. X.F. Zhang, A.W. Heemink, J.C.H. Van Eijeren, Performance robustness analysis of Kalman filter for linear discrete-time systems under plant and noise uncertainty. Int. J. Syst. Sci. 26(2), 257–275 (1995) 16. G. Chen, J. Wang, L. Shieh, Interval kalman filtering. IEEE Trans. Aerosp. Electron. Syst. 33(1), 250–259 (1997) 17. J. Xiong, C. Jauberthie, L. Trave-Massuyes, New computation aspects for the interval Kalman filtering, in 15th IFAC Workshop on Control Applications of Optimization (2012) 18. P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, in Proceedings of the 25th International Conference on Machine Learning (2008), pp. 1096–1103 19. P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in Proceedings of the ICML Workshop on Unsupervised and Transfer Learning (2011), pp. 37–49
Prediction of Diabetes Mellitus: Comparative Study of Various Machine Learning Models Arooj Hussain and Sameena Naaz
Abstract Diabetes is a common metabolic-cum-endocrine disorder in the world today. It is generally a chronic problem where either the pancreas does not produce an adequate quantity of Insulin, a hormone that regulates blood glucose level, or the body does not effectively utilize the produced Insulin. This review paper presents a comparison of various Machine Learning models in the detection of Diabetes Mellitus (Type-2 Diabetes). Selected papers published from 2010 to 2019 have been comparatively analyzed and conclusions were drawn. Various models that have been compared are Adaptive Neuro-Fuzzy Inference System (ANFIS), Deep Neural Network (DNN), Support Vector Machine (SVM), Artificial Neural Network (ANN), Logistic Regression, Decision Tree, Naive Bayes, K-Nearest Neighbours (KNN) and Random Forest. The two models which have outperformed all others in most studies taken into consideration are Random Forest and Naive Bayes. Other powerful mechanisms are SVM, ANN and ANFIS. The criteria chosen for comparison are accuracy and Matthew’s Correlation Coefficient (MCC). Keywords Machine Learning · Diabetes Mellitus · Random Forest · Artificial Neural Network (ANN) · Logistic Regression · Cross-validation · Percentage split
1 Introduction Machine Learning (ML) can be defined as a subtype of Artificial Intelligence to solve real-world problems by “providing learning ability to computer without additional programming” [1]. As the development in ML increased, along with it increased the use of computers in medicine. A. Hussain · S. Naaz (B) Department of Computer Science and Engineering, School of Engineering Sciences and Technology, Jamia Hamdard, New Delhi 110062, India e-mail: [email protected] A. Hussain e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_10
103
104
A. Hussain and S. Naaz
Diabetes Mellitus is a disease that is affecting a large population throughout the world and is becoming a huge challenge to tackle with. According to the data released by International Diabetes Federation [2], in 2017 there were 425 million people suffering from Diabetes globally, which are expected to increase up to 629 million by 2045. For the classification and prediction of the occurrence of Diabetes, various computational techniques have been developed and utilized. The use of Machine Learning techniques in the prediction is proving to be very useful as it is increasing the accuracy in diagnosis, reducing the costs and increasing the rates of effective treatments. Patients of other diseases like cancers such as of breast and brain tumours can also be benefitted by employing Machine Learning to detect the anomalies by studying their scans as shown by Naaz et al. [3], Kahksha et al. [4], Hassan et al. [5]. A comprehensive study on a number of Machine Learning models used for identification of diabetes has been done in this review, thereby comparing their performance and figuring out the most suitable amongst them. In this survey, several ML models viz. Adaptive Neuro-Fuzzy Inference System (ANFIS), Deep Neural Network (DNN), Support Vector Machine (SVM), Artificial Neural Network (ANN), Logistic Regression, Decision Tree, Naive Bayes, K-Nearest Neighbours (KNN) and Random Forest have been compared. At last, the outcome of the previously conducted studies has been analyzed which hopefully would help in future advancement and research. This remaining paper has been arranged into sections as follows: In Sect. 2, the description of the problem statement and the aim of the study are given. Section 3 discusses all the techniques that have been used by researchers to predict Diabetes Mellitus in the selected papers, followed by Validation Techniques used in Sect. 4. Section 5 summarizes the results that are obtained from each study. Section 6 describes the cumulative results drawn from all the studies according to the method of validation used, and in Sect. 7, the conclusion from the entire exercise is provided. Finally, the challenges faced, future directions and limitations of this study are briefly discussed in Sect. 8.
2 Problem Statement This survey has been formulated to carry out a review of Machine Learning techniques/models and their application in classification/detection of Diabetes Mellitus. The aim of this survey is to compare these techniques and to conclude which is the most feasible model for achieving the highest accuracy in the prediction process. Going by the experience of scientists from the papers [6, 7] who have compared ML techniques in various research studies, the literature review was carried out by rigorously scanning and going through previously published papers in depth to obtain inferences about Diabetes prediction. Only those papers were selected that were published between 2010 and 2019.
Prediction of Diabetes Mellitus …
105
3 Machine Learning Techniques The various techniques that have been compared in this review are briefly described below. Adaptive Neuro-Fuzzy Inference System (ANFIS) is a system that incorporates the basic principles of Neural Networks as well as Fuzzy Logic. It combines the parallel distributed processing with the learning ability of ANN and hence works as a hybrid. ANFIS is itself composed of two parts, the antecedent part and the conclusion part, which communicate with each other using some pre-set rules [8]. Architecturally it is generally made up of 5 layers [9]. An Artificial Neural Network (ANN) is a Machine Learning model that takes inspiration from the biological neural networks present inside the human brain. Analogous to the human nervous system, the basic building block of ANN is called a Neuron. Any Neural Network has a minimum of three layers, input layer, output layer and a hidden layer sandwiched in between them. An ANN featuring more than one hidden layer is called a Deep Neural Network (DNN). A Support Vector Machine (SVM) is a supervised Machine Learning method employed mostly for classification problems and rarely for regression problems, proposed by Platt et al. [10]. As a classifier, it is a discriminative classification method that segregates data points into two or more classes on the basis of certain parameters [11]. The best hyperplane is considered to be the one with the largest distance from the closest data point, known as margin, which decreases the chances of generalization error. Logistic Regression can be thought of as a simpler version of DNN in a way, deprived of any hidden layers [12]. The processing of data is done by using an activation function and a sigmoid function whose output will then be compared to 0.5 for the classification purpose. It takes the use of a sigmoid function for finding the probability of a class using the following rule, in case probability comes out to be ≥0.5, it is assumed to be belonging to class 1 and if probability xi−1 }
(4)
410
R. Kumar et al.
{|xi − xi+1 | ≥ ε} and {|xi − xi−1 | ≥ ε}
(5)
Waveform Length Waveform length is the total length of the signal for each segment. This can be calculated as given in Eq. (6). WL =
N
|xi |
(6)
i=1
3 Classification Two mental tasks right hand and feet movement are classified by ten various types of classifiers such as Naïve Bayes, bays classifier AdaBoost, and some decision-based classifiers are used. In this classification, features extracted from the time-domain method are applied as input to these classifiers to evaluate the performance of it so as to understand the capability of them to separate the motor imagery task.
4 Results and Discussion The section provides the result obtained from the classification after the use of timedomain feature. The different classifiers are evaluated to judge the time-domain features. The details are presented in subsequent tables. In the current work, motor imagery task is taken for analysis. The tasks data using EEG signal are taken from BNCI Horizon 2020. In the dataset, signals are recorded through 15 channels for 14 subjects. In this work, only 3 subjects with central channel (C3, CZ, and C4) are considered for analysis. Each subject consists of 3 runs for training and 5 runs for testing, every run comprises data, trial value, and class label. Two classes that are right hand and feet movement considered as class 1 and class 2, respectively. Before performing pre-processing method, first data are segmented into 20 trials for each channel. In the next stage, four time-domain features such as MAV, ZC, SSC, and WL are calculated and combined as feature set. On this set of features, 10 different types of classification are performed and evaluate the performance of each classifier. The obtained result shows that for subject 1 (Table 1) AdaBoostM1 and Decision table are showing the highest and similar recognitions capability with 56.25% and minimum accuracy is archived by Logistic classifier. For subject 2 (Table 2) maximum accuracy is 59.375% achieved by the Bayes net and IBK classifiers. Similarly, for subject 3 (Table 3) maximum accuracy is 59.375 is archived by Naive Bayes, SGD, IBK, and LWL.
Evolution of Time-Domain Feature for Classification … Table 1 Performance of different classifiers on S01 Classifier TP rate FP rate Precision Recall Bayes net
Naïve Bayes
Logistic
SGD
SMO
IBK
LWL
AdaboostM1
Decision table
Random forest
0.375 0.5 0.438 0.375 0.5 0.438 0.313 0.375 0.344 0.25 0.563 0.406 0.313 0.5 0.46 0.625 0.25 0.438 0.5 0.5 0.5 1 0.125 0.563 0.75 0.375 0.563 0.25 0.5 0.375
0.5 0.625 0.563 0.5 0.625 0.563 0.625 0.688 0.656 0.438 0.75 0.594 0.5 0.688 0.594 0.75 0.375 0.563 0.5 0.5 0.5 0.875 0 0.438 0.625 0.25 0.438 0.5 0.75 0.625
0.429 0.444 0.437 0.429 0.444 0.437 0.333 0.353 0.343 0.364 0.429 0.396 0.389 0.421 0.403 0.455 0.4 0.427 0.5 0.5 0.5 0.533 1 0.767 0.545 0.6 0.573 0.333 0.4 0.367
0.375 0.5 0.438 0.375 0.5 0.438 0.313 0.375 0.344 0.25 0.563 0.406 0.313 0.5 0.406 0.625 0.25 0.438 0.5 0.5 0.5 1 0.125 0.563 0.75 0.375 0.563 0.25 0.5 0.375
411
F-measure Class
Accuracy
0.4 0.471 0.435 0.4 0.471 0.435 0.323 0.364 0.364 0.296 0.486 0.391 0.345 0.357 0.401 0.526 0.308 0.417 0.5 0.5 0.5 0.696 0.222 0.459 0.632 0.462 0.547 0.286 0.444 0.365
1 2
43.75
1 2
43.75
1 2
34.375
1 2
40.625
1 2
40.625
1 2
43.75
1 2
50
1 2
56.25
1 2
56.25
1 2
50
In terms of subject, subject 2 (S02) and subject 3 (S03) delivered highest accuracy on the other hand subject 1 (S01) delivered minimum accuracy for same set of features. Best result is given by the Bayes net and Naive Bayes classifier with average of 53.125% classification accuracy and worst result is given by the Logistic classifier with average of 43.75% classification accuracy. To examine the efficiency of features, Table 4 provides the comparison between proposed approach and existing approach for same dataset. This table shows classification performance for each of the three subjects. As shown in Table 4, proposed method gives better performance of 56.25% for subject 1 and 59.375% for subject 2
412
R. Kumar et al.
Table 2 Performance of different classifiers on S02 Classifier TP rate FP rate Precision Recall Bayes net
Naïve Bayes
Logistic
SGD
SMO
IBK
LWL
AdaboostM1
Decision table
Random forest
0.65 0.5 0.594 0.6 0.5 0.563 0.5 0.417 0.469 0.5 0.5 0.5 0.5 0.583 0.531 0.55 0.667 594 0.45 0.583 0.5 0.15 0.917 0.438 0.5 0.333 0.438 0.1 1 0.438
0.5 0.35 0.444 0.5 0.4 0.463 0.583 0.5 0.552 0.5 0.5 0.5 0.417 0.5 0.448 0.333 0.45 0.377 0.147 0.55 0.467 0.083 0.85 0.371 0.667 0.5 0.604 0 0.9 0.338
0.684 0.462 0.601 0.667 0.429 0.577 0.588 0.333 0.493 0.625 0.375 0.531 0.667 0.412 0.571 0.733 0.471 0.635 0.643 0.389 0.548 0.75 0.393 0.616 0.556 0.286 0.454 1 0.4 0.775
0.65 0.5 0.594 0.6 0.5 0.563 0.5 0.417 0.469 0.5 0.5 0.5 0.5 0.583 0.531 0.55 0.667 0.594 0.45 0.583 0.5 0.15 0.917 0.438 0.5 0.333 0.438 0.1 1 0.438
F-measure Class
Accuracy
0.667 0.48 0.597 0.632 0.462 0.568 0.541 0.37 0.477 0.556 0.429 0.508 0.571 0.483 0.538 0.629 0.552 0.6 0.529 0.467 0.506 0.25 0.55 0.363 0.526 0.308 0.444 0.182 0.571 0.328
1 2
59.375
1 2
56.25
1 2
46.875
1 2
50
1 2
53.125
1 2
59.375
1 2
50
1 2
43.75
1 2
43.75
1 2
43.75
and subject 3. Therefore it can be stated that proposed method significantly performs better than existing method that already discussed in the literature survey section. It can be noticed that highest accuracy archived is 59.375%, which is not enough to design the stable and reliable BCI. Lowest accuracy indicates that regardless of feature ability in classification and it had difficulty to deal with chaotic behavior of EEG signal. Mostly best feature is the one that gives better accuracy in order to design the BCI.
Evolution of Time-Domain Feature for Classification …
413
Table 3 Performance of different classifiers on S03 Classifier TP rate FP rate Precision Recall Bayes net
Naïve Bayes
Logistic
SGD
SMO
IBK
LWL
AdaboostM1
Decision table
Random forest
0.5 0.667 0.563 0.333 0.45 0.377 0.35 0.75 0.5 0.4 0.917 0.594 0.4 0.917 0.594 0.5 0.75 0.594 0.4 0.917 0.594 0.15 1 0.469 0.4 0.583 0.469 0.1 1 0.438
0.333 0.5 0.396 0.733 0.471 0.635 0.25 0.65 0.4 0.083 0.6 0.277 0.083 0.6 0.277 0.25 0.5 0.344 0.083 0.6 0.277 0 0.85 0.319 0.417 0.6 0.485 0 0.9 0.338
0.714 0.444 0.613 0.733 0.471 0.635 0.7 0.409 0.591 0.889 0.478 0.735 0.889 0.478 0.735 0.769 0.474 0.658 0.889 0.478 0.735 1 0.414 0.78 0.615 0.368 0.523 1 0.4 0.775
Table 4 Comparison with other methods Authors Methods S01 Sahu et al. DWT features Proposed method Time-domain features
54.375 56.250
0.5 0.667 0.568 0.55 0.667 0.594 0.35 0.75 0.5 0.4 0.917 0.594 0.4 0.917 0.581 0.5 0.75 0.594 0.4 0.917 0.594 0.15 1 0.469 0.4 0.583 0.469 0.1 1 0.438
F-measure Class
Accuracy
0.588 0.533 0.568 0.629 0.552 0.6 0.467 0.529 0.49 0.552 0.629 0.581 0.552 0.629 0.581 0.606 0.581 0.597 0.552 0.629 0.581 0.261 0.585 0.383 0.485 0.452 0.472 0.182 0.571 0.328
1 2
56.25
1 2
59.375
1 2
50
1 2
59.375
1 2
59.375
1 2
59.375
1 2
59.375
1 2
46.875
1 2
43.75
1 2
43.75
S02
S03
57.500 59.375
51.250 59.375
414
R. Kumar et al.
5 Conclusion In this paper, four time-domain features have been used to classify the two-class motor imagery action, right hand and feet movement. Here author calculated four timedomain feature MAV, ZC, SSC, and WL which are used as input for classification. In this classification, 10 classifiers are used in order to check the recognition ability of features and compare the performance of classifiers. Performance of classifiers is varying according to the subject. For subject S01, AdaboostM1 and Decision table are showing the highest accuracy, for subject S02, Bayes net and IBK classifiers are achieved highest accuracy, and Naive Bayes, SGD, IBK, and LWL performed well for subject S03. For all subjects, Bayes net and Naive Bayes classifier are the best classifiers among all 10 classifiers which give 53.125% accuracy. It is clear that we got maximum accuracy lies between 55 and 59%. Comparative analysis also shows the better performance of time-domain features but still there is scope in future for large data where performance can be improved with variety in motor imagery data.
References 1. J.R. Wolpaw, N. Birbaumer, D.J. McFarland, G. Pfurtscheller, T.M. Vaughan, Brain–computer interfaces for communication and control. 113(6), 767–791 (2002) 2. S.G. Mason, G.E. Birch, A general framework for brain-computer interface design. 11(1), 70–85 (2003) 3. M.X. Cohen, Analyzing neural time series data: theory and practice. MIT Press (2014) 4. S. Vaid, P. Singh, C. Kaur, EEG signal analysis for BCI interface: a review, in 2015 Fifth International Conference on Advanced Computing and Communication Technologies (IEEE, 2015), pp. 143–147 5. A. Khorshidtalab, M. Salami, M. Hamedi, Evaluation of time-domain features for motor imagery movements using FCM and SVM, in 2012 Ninth International Conference on Computer Science and Software Engineering (JCSSE) (IEEE, 2012), pp. 17–22 6. P. Geethanjali, Y.K. Mohan, J. Sen, Time domain feature extraction and classification of EEG data for brain computer interface, in 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (IEEE, 2012), pp. 1136–1139 7. R. Upadhyay, A. Manglick, D. Reddy, P. Padhy, P.J.C. Kankar, E. Engineering, Channel optimization and nonlinear feature extraction for Electroencephalogram signals classification. 45, 222–234 (2015) 8. A.S. Sankar, S.S. Nair, V.S. Dharan, P. Sankaran, Wavelet sub band entropy based feature extraction method for BCI. 46, 1476–1482 (2015) 9. Z. Liu, J. Sun, Y. Zhang, P. Rolfe, Sleep staging from the EEG signal using multi-domain feature extraction. 30, 86–97 (2016) 10. V. Harpale, V. Bairagi, An adaptive method for feature selection and extraction for classification of epileptic EEG signal in significant states (2018) 11. M. Sahu, S. Shukla, Impact of feature selection on EEG based motor imagery, in Information and Communication Technology for Competitive Strategies (Springer, 2019), pp. 749–762 12. G.U. Technology (2015) Two class motor imagery (002-2014). http://bnci-horizon-2020.eu/ database/data-sets 13. G. Pfurtscheller, C. Neuper, Motor imagery and direct brain-computer communication. 89(7), 1123–1134 (2001)
Finding Influential Spreaders in Weighted Networks Using Weighted-Hybrid Method Sanjay Kumar, Yash Raghav, and Bhavya Nag
Abstract Finding efficient influencers has attracted a lot of researchers considering the advantages and the various ways in which it can be used. There are a lot of methods but most of them are available for unweighted networks, while there are numerous weighted networks available in real life. Finding influential users on weighted networks has numerous applications like influence maximization, controlling rumours, etc. Many algorithms such as weighted-Degree, weightedVoteRank, weighted-h-index, and entropy-based methods have been used to rank the nodes in a weighted network according to their spreading capability. Our proposed method can be used in case of both weighted or unweighted networks for finding strong influencers efficiently. Weighted-VoteRank and weighted-H-index methods take the local spreading capability of the nodes into account, while entropy takes both local and global capability of influencing the nodes in consideration. In this paper, we consider the advantages and drawbacks of the various methods and propose a weighted-hybrid method using our observations. First, we try to improve the performance of weighted-VoteRank and weighted-h-index methods and then propose a weighted-hybrid method, which combines the performance of our improved weighted-VoteRank, improved weighted-H-index, and entropy method. Simulations using an epidemic model, Susceptible-Infected-Recovered (SIR) model produces better results as compared to other standard methods. Keywords Complex networks · Influence maximization · Node centrality · SIR model · Weighted-H-index · Weighted-VoteRank S. Kumar (B) · Y. Raghav · B. Nag Department of Computer Science and Engineering, Delhi Technological University, Shahbad Daulatpur, Main Bawana Road, Delhi 110042, India e-mail: [email protected] Y. Raghav e-mail: [email protected] B. Nag e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_37
415
416
S. Kumar et al.
1 Introduction Most of the real-world networks like social networks, biological networks, collaboration networks, circuit networks are complex networks. These networks have a large number of nodes or users, and interaction between nodes are complex. The evolution of complex networks has led to the establishment of many useful applications like influence maximization, node classification, viral-marketing, link prediction, information propagation, etc. Influence maximization [1, 2] requires a strategic selection of influential individuals who are capable of spreading information by ‘Word-ofMouth’ analogy, after knowing the information source. Therefore, the objective of influence maximization is of great value in real-world systems, which dictate the choice of source spreaders to maximize the final extend of spreading/propagation in the complex network. Determining influential spreaders in the propagation process has a number of applications such as rumour control, virus Transmission, information dissemination. Many real-life networks like transportation networks, email networks are the weighted networks. Finding influential nodes in a weighted network is a hot and demanding research topic. Many centralities have been proposed for this task. Weighted-degree [2] is the simplest weighted centrality that is used. It helps in deciding the most effective influencers on the basis of the product of the degree of the node and the average weights of its connection with its neighbours. It just helps to find the local influence. Closeness [3] and betweenness [4] are both centralities that consider a node’s shortest path to other nodes and thus are not a very viable centrality for large networks. Other than these, many advanced methods have come up to achieve this objective. PageRank [5] uses an algorithm that was used to rank web pages, i.e. counting citations or back-links of the page. In K-shell Decomposition [6], a core number (Ks) is allotted to each node (not unique) which represents the location of that node according to successive layers in the shell of the network, thereby being able to take global structure of the network into consideration. Many K-shell improved methods were also proposed [7–11] such as Mixed Degree Decomposition aiming at differentiating nodes with different influences. VoteRank [12] is a method that chooses the spreader by considering the voting ability of its neighbours. Once chosen, the node can’t vote again. The main advantage of this method is that ‘far off’ spreaders or relatively disconnected spreaders are chosen by implementing this method. H-index method [13], the ‘h’ value reflects the relative local influence of a node. These existing methods are usually applied to unweighted networks that consider only a single type of relation between all nodes. However, in real-world scenarios, the edges are weighted depicting the extent of interaction. Thus, if a social network comprises of more than one kind of relation between
Finding Influential Spreaders in Weighted Networks …
417
the individuals, weighted graphs are used. This relation between nodes can depict capacity, capability (unweighted) or duration, emotional intensity (weighted). Thus, the concept of weighted networks can be expanded to a lot of real-world networks. Traditional methods of unweighted networks don’t consider the weight of edges and thus, many variants of the traditional methods are available such as weighted-degree centrality [14], weighted betweenness centrality [2], weighted K-shell decomposition [15], weighted-H-Index centrality [13], weighted-VoteRank. In the weighted K-shell decomposition, both the degree of the node and the weight of the links were considered by allocating each node a weighted-degree k. The pruning process was done similar to that of K-shell decomposition [15]. The weighted-H-Index defined the edge weight as a product of degrees of the connecting nodes, while the weightedVoteRank changed the traditional method for calculation of voting score to include the effect of strength of links between two nodes. Recently proposed NCVoteRank method is the extension of VoteRank by bringing neighbourhood coreness value in voting [16]. Thus, by tweaking the unweighted centralities to include the strengths of links, weighted centralities could be implemented in real-world networks the effect of different kinds of links could be understood. Our contribution of proposed method, i.e. weighted-Hybrid involves the following: 1. An improved version of weighted-h-index contributes to the final score of a node, which helps us in deciding the efficient influencers in a social network. 2. An improved version of weighted-VoteRank also contributes to the final score of a node. 3. Entropy centrality helps us in combining the global property with the local property which is the case with other two. The organization of this paper is as follows: Sect. 2 presents a brief about related works. In Sect. 3, we present information diffusion model, performance metrics and datasets used in this work. The proposed method is described in Sect. 4. Section 5 summarizes our results and findings and eventually, paper if concluded by Sect. 6.
2 Related Works Weighted-H-Index: In the weighted-H-Index method, the edge weights were defined to quantify the diffusion capacity of the network. The edge weight was defined as the product of the degree of the vertices connected by the edge. For each vertex i, connected to a vertex j, the weighted edge was decomposed into multiple weighted edges equal to the degree of vertex j. Completing this procedure for each neighbour of vertex i, H-Index was calculated in a traditional manner, i.e. maximum h value for the node i, such that it has at least h neighbours with weights more than or equal to h.
418
S. Kumar et al.
Weighted-VoteRank: Sun et al. [17] proposed a weighted-VoteRank method to improve the idea of the VoteRank method in which not only the number of neighbours was taken into consideration but also the weights of their relation with the current node. This method is used to find the multiple influential spreaders in a weighted network in which each node is allocated a tuple consisting of a voting ability and a voting score, i.e. each node v is attached to a tuple consisting of its voting score and voting ability: {sv , vav }. Initially, this tuple is initialized to {0, 1}. At each step, node votes for its directly connected neighbours according to its voting ability. The voting score in a weighted network was defined as the square root of the product of the weights with the voting ability of each neighbour as shown in the equation below. sv =
|N (v)| ∗
vai ∗ wv,i
(1)
i∈γ (v)
Thus, for any given node v, three factors were taken into consideration in order to determine the voting score, the number of neighbour nodes of v, i.e. |γ (v)|, the voting ability of its neighbour i, i.e. vai and the edge weight between the neighbour i and the node v, i.e. wv,i . Initially, each node had a voting ability equal to unity, however, after each voting, the neighbours of the selected node had their voting ability reduced by a constant value . Entropy-based centrality: Qiao et al. [18] proposed entropy-based centrality for the weighted network, the total influencing power of current node was divided into two parts local power and global power. The local power can be achieved by combining the interaction frequency entropy, which indicates the accessibility of the node, and the structural entropy, which indicates the popularity and communication activity of the node. A complete network was deconstructed into smaller subnetworks and the required interaction and structural information derived from it. This information along with the information from the two-hop neighbours formed the total power for a node.
3 Information Diffusion Model, Performance Metrics and Datasets 3.1 Information Diffusion Model In the paper being proposed, the Stochastic Susceptible-Infected Recovered (SIR) is used as the information diffusion model to assess the performance of our algorithm. This model divides network nodes into three categories, i.e. Susceptible (S), infected (I) and recovered (R). Nodes that are in the susceptible state are likely to receive data from neighbours surrounding it. The SIR model takes a list of spreaders as input, i.e. a subset of the network nodes, infection probability (β) and recovery probability (γ ).
Finding Influential Spreaders in Weighted Networks …
419
In this type of model, all nodes are initially liable to get infected except a few nodes that are in the infected state. After every step, susceptible neighbours are affected by the infected nodes with a probability of β. Then they enter the recovered stage with a probability of γ . Once reaching the recovered stage, they are immunized and can’t be infected again. As the model discussed above is a random model, the abovediscussed steps were run for 100 times and the average of the results were taken for all the 100 simulations.
3.2 Performance Metrics We judge the performance of our approach along with others using the following matrices: (1) Final Infected Scale (F(t c )): The final infected scale is defined as the ratio of recovered nodes at the end of the SIR simulations and the total number of nodes in the network. Here, recovered nodes correspond to those nodes who first got infected and then recovered in the SIR model. The high value of F(t c ) means information or idea, which was propagated by the influential spreaders, has reached to a large number of people in the social network. The final infected scale is calculated using the following equation: Final Infected Rate, F(tc ) =
n R(tc) n
(2)
where nR(tc) = no. of recovered nodes when spreading is at steady-state and n = total no. of nodes. (2) Shortest path length (Ls): It is used to evaluate the structural properties between each pair of selected spreaders. The shortest path length is calculated for each pair of spreaders and is an essential metric that considers, the location of the influential spreaders. Its high value denotes that the spreaders are widely distributed in the network and hence can spread information to a more substantial portion of the network. Ls =
1 lu,v |S|(|S| − 1) u,v ∈S,u =v
(3)
where lu, v denotes shortest path from node u to node v and |S| is the total number of spreaders.
420
S. Kumar et al.
Table 1 Used dataset S. No.
Dataset name
Description
#Nodes
1
Powergrid
An undirected weighted network containing information about the power grid of the Western States of the United States of America
4941
#Edges 6594
2
Facebook-like social network
This undirected weighted dataset originates from an online community for students at the University of California, Irvine
1899
20297
3
US top 500 airport network
An undirected weighted network of the 500 busiest commercial airports in the United States
500
28237
4
Bitcoin+11
This is a user–user trust/distrust undirected weighted network
5881
35592
3.3 Datasets We chose to work with four real-life datasets to judge the performance of our proposed method of finding influencers in weighted networks. Table 1 lists all the datasets used with brief descriptions. These data sets are publicly available at https://toreopsahl. com/datasets/.
4 Proposed Method In this section, we first present the improved weighted-H-index and improved weighted-VoteRank method and then describe then proposed Weighted-Hybrid method, which is the combination of three techniques, namely, weighted-Hindex, improved weighted-VoteRank method and entropy centrality. The weighted-degree often gave the best results (about 70% models showed such results), but it only considers the significance of the one-hop neighbours to determine the most significant spreaders. Thus, by the same logic, we decided to improve the traditional weighted-H-Index and weighted-VoteRank methods by including the information of the neighbours of the nodes in the formulas. A traditional weighted-HIndex method evaluated a node’s spreading power according to the number of highly influential neighbours, however, it failed to account for the topological structure of the network. Thus, this method on its own is unable to give excellent results. However, it is highly beneficial in real-world scenarios where we are missing a few links or some network information because it is not sensitive to small variations in degree. Keeping in mind its benefits in neutralizing the effect of missing links and data on the final output, we decided to include an enhanced version of weighted-H-Index in our hybrid. A weighted-VoteRank has a huge advantage over any of the other methods,
Finding Influential Spreaders in Weighted Networks …
421
i.e. it protects the output from the rich-club phenomenon. While determining multiple spreaders, it discounts the voting ability of the selected spreader’s neighbours and thus, rather than choosing all the spreaders in one crowded area and thereby causing an overlap of influences and neglecting the far-flung regions; this method tries to choose spreaders which are far from each other so as to maximize the influence and reduce the chances of an overlap. Entropy is a very useful method that considers the topological structure of the network by considering the indirect influence of the node in the form of two-hop neighbours. Thus, entropy considers the global qualities of a node. Improved weighted-H-index: In the classical weighted-H-index, we introduce effective weight as a product of the strength of link (weight of edge) and the spreading capacity of the link (degree of nodes) and the effective weight of an edge is defined as wi j = wi j + ki ∗ k j
(4)
where wij is the weight of the edge between nodes i and j, k i and k j are the degrees of nodes i and j, respectively. Further, we follow the same procedure to find the effective H-Index of a node by first decomposing the weighted edge into multiple weighted edges based on the degree of the neighbouring node. Finally, we follow the definition of H-Index for a node ‘n’ which is defined as the max h value such that there are neighbours equal to or greater than h of ‘weight’ equal to or larger than h. Improved weighted-VoteRank: In the weighted-VoteRank method, we propose to modify the voting score (sv ) of node v, as presented in Eq. (1) and the resultant equation is sv =
|γ (v)|
vai ∗ wv,i ∗ ki
(5)
i∈γ (v)
Here, k i is the degree of the node i (neighbour of node v). Thus, we also take in account the number of neighbour nodes of i, which is the neighbour of node v. Hence, considering the nodes up to two hops may take care of the spreading process in a better manner. Weighted-Hybrid method: We propose our weighted-Hybrid method to be a combination of these three methods: E i = α ∗ W Hi + β ∗ W Vi + μ ∗ T Pi /3
(6)
where the effective influence of a node i is the sum of its improved weighted-H-index (WH’i ), improved weighted-VoteRank Score (WV’i ), total influencing power calculated through entropy formula (TPi ). Also, α, β, μ are constants. Experimentally, we keep the value of α, β, μ to be equal, i.e. α = β = μ.
422
S. Kumar et al.
As we considered the ratio of three different methods, we normalized the final score using min-max normalization (as per Eq. (7)). In max-min normalization, the data is scaled between 0 and 1. y=
x − min max − min
(7)
5 Results and Analysis We chose to work on four datasets of weighted networks mentioned in Sect. 3 of this paper. Datasets considered include Powergrid, Facebook-like Social Network, Bitcoin+11 and US Top-500 Airport Network. We ran the SIR model 100 times as it is a random model and its working can vary. The beta value was taken to be 0.01 which just means that every node has got capability of infecting the 1% of his neighbouring nodes. The results, obtained by running the SIR model for 100 times, were then averaged to take different ways of spreading in consideration. We chose to compare our algorithm of choosing efficient influencers on the basis of the Final infected scale F(t c ) versus time. The performance of our proposed method, i.e. W-Hybrid was compared with other proposed methods and the results were noted. After observing the results obtained after conducting the required experiments, it is evident that our proposed method gives better results when we used the final affected scale F(t c ) performance matrix. Our method, i.e. weighted-hybrid is able to affect a number of nodes than the other methods at the same time. Also, we noticed that decreasing the beta value usually improved the results of our W-Hybrid method (Figs. 1, 2, 3 and 4).
a) 10 spreaders
b) 20spreaders
Fig. 1 a, b F(t c ) versus time for powergrid data set with initial spreaders as 10 and 20
Finding Influential Spreaders in Weighted Networks …
a) 10 spreaders
423
b) 20 spreaders
Fig. 2 a, b F(t c ) versus time for Facebook-like social network data set with initial spreaders as 10 and 20
a)10 spreaders were chosen
b) 20 spreaders were chosen
Fig. 3 a, b F(t c ) versus time for Bitcoin data set with initial spreaders as 10 and 20 and β as 0.01
Ls values Table 2 lists the value of Ls for various data sets calculated using Eq. (3). As the above results show, we are able to maximize the shortest path between selected spreaders which will lead to a better influence spread in the network.
424
S. Kumar et al.
a)10 spreaders were chosen
b) 20 spreaders were chosen
Fig. 4 a, b F(t c ) versus time for US Top-500 airport network data set with initial spreaders as 10 and 20 with and β as 0.01
6 Conclusions In this manuscript, we have proposed a weighted-hybrid method to find the most effective influencers or spreaders in the given weighted network so that information can reach a large number of users in the system. The proposed weighted-hybrid method is able to include both the local and global attributes of a node in the determination of the influence of the node. Furthermore, the weighted-hybrid is able to deal with issues such as the rich-club phenomenon, hidden/missing links in the network. The experiments conducted gave us results that conveyed that our proposed technique is better technique to find multiple influencers in a weighted complex network.
US Top-500 airport network
Facebook-like social network
Bitcoin
Powergrid
Dataset
372.9
4.433
21.178
5.515
Degree
7.986
2.241
19.571
2.939
Closeness
17.337
2.494
21.535
4.33
Betweenness
169.07
3.322
20.535
4.015
Weighted-H-Index
Table 2 Ls values for all datasets with initial spreaders as 10 and β as 0.01
2905.9
4.055
24.178
6.438
Improved weighted-H-Index
1527.2
5.013
32.857
2.424
Improved weighted-VoteRank
203.8
4.185
28.218
5.181
Entropy
703.5
5.364
33.678
9.030
W-Hybrid
Finding Influential Spreaders in Weighted Networks … 425
426
S. Kumar et al.
References 1. W. Chen, C. Wang, Y. Wang, Scalable influence maximization for prevalent viral marketing in large-scale social networks, in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2010), pp. 1029–1038 2. T. Opsahl, F. Agneessens, J. Skvoretz, Node centrality in weighted networks: generalizing degree and shortest paths. Soc. Netw. 32(3), 245–251 (2010) 3. Y. Du, C. Gao, X. Chen, Y. Hu, R. Sadiq, Y. Deng, A new closeness centrality measure via effective distance in complex networks. Chaos Interdiscip. J. Nonlinear Sci. 25(3), p. 033112 (2015) 4. D. Prountzos, K. Pingali, Betweenness centrality. ACM SIGPLAN Not. 48(8), 35 (2013) 5. S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998) 6. M. Kitsak, L. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. Stanley, H. Makse, Identification of influential spreaders in complex networks. Nat. Phys. 6(11), 888–893 (2010) 7. Z. Liu, C. Jiang, J. Wang, H. Yu, The node importance in actual complex networks based on a multi-attribute ranking method. Knowl.-Based Syst. 84, 56–66 (2015) 8. A. Zareie, A. Sheikhahmadi, A hierarchical approach for influential node ranking in complex social networks. Expert Syst. Appl. 93, 200–211 (2018) 9. J. Bae, S. Kim, Identifying and ranking influential spreaders in complex networks by neighborhood coreness. Phys. A 395, 549–559 (2014) 10. Z. Wang, Y. Zhao, J. Xi, C. Du, Fast ranking influential nodes in complex networks using a k-shell iteration factor. Phys. A 461, 171–181 (2016) 11. Z. Wang, C. Du, J. Fan, Y. Xing, Ranking influential nodes in social networks based on node position and neighborhood. Neurocomputing 260, 466–477 (2017) 12. J. Zhang, D. Chen, Q. Dong, Z. Zhao, Erratum: Corrigendum: Identifying a set of influential spreaders in complex networks. Sci. Rep. 6(1) (2016) 13. L. Lü, T. Zhou, Q. Zhang, H. Stanley, The H-index of a network node and its relation to degree and coreness. Nat. Commun. 7(1) (2016) 14. A. Nikolaev, R. Razib, A. Kucheriya, On efficient use of entropy centrality for social network analysis and community detection. Soc. Netw. 40, 154–162 (2015) 15. A. Garas, F. Schweitzer, S. Havlin, Ak-shell decomposition method for weighted networks. New J. Phys. 14(8), 083030 (2012) 16. S. Kumar, B.S. Panda, Identifying influential nodes in social networks: neighborhood coreness based voting approach. Phys. A 124215 (2020) 17. H.L. Sun, D.B. Chen, J.L. He, E. Chng, A voting approach to uncover multiple influential spreaders on weighted networks. Phys. A 519, 303–312 (2019) 18. T. Qiao, W. Shan, G. Yu, C. Liu, A novel entropy-based centrality approach for identifying vital nodes in weighted networks. Entropy 20(4), 261 (2018)
Word-Level Sign Language Gesture Prediction Under Different Conditions Monika Arora, Priyanshu Mehta, Divyanshu Mittal, and Prachi Bajaj
Abstract With over 6% population suffering from hearing problems and relying on sign language to communicate with the masses and expressing their emotions through actions. It has been an onerous task for the speech and hearing-impaired people to make people understand and thus it is necessity to build a system that can help anyone in understanding the gestures and generate its meaning. A system for sign language recognition can be a preliminary step to establish better communication. We used word-level Argentinian Sign Language (LSA) video dataset with 64 actions which are shot under different lights and with non-identical subjects. Video data accommodate both dimensional and sequential attributes, thus we used a deep convolutional neural network along with recurrent neural network with LSTM units to incorporate both together. We created two different test cases, that is, indoor lighting environment with single subject and a mix of both indoor and outdoor conditions with multiple subjects and have achieved accuracy of 93.75% and 90.625%, respectively. Keywords Sign language gesture prediction · Recurrent neural network · Convolutional neural network · LSTM
M. Arora · P. Mehta (B) · D. Mittal · P. Bajaj Bhagwan Parshuram Intitute of Technology, Delhi, India e-mail: [email protected] M. Arora e-mail: [email protected] D. Mittal e-mail: [email protected] P. Bajaj e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_38
427
428
M. Arora et al.
1 Introduction The challenge of communication between speech and hearing-impaired individuals with others has been existing worldwide. Therefore, the requirement to develop a unique strategy for communication that doesn’t include verbal methods, that is, Sign Language. It incorporates the utilization of gestures which involve formations of hands-orientations, movements along with facial expressions. These gestures assume a significant job in sharing their cerebrations and help them in communicating with others. Generally, an ordinary may not be able to gain proficiency with the sign language thus it would get hard for them to comprehend any of such symbols. Also, it is not feasible to have a translator every time. To overcome this issue, many researchers have worked over a large span of time to develop systems with technological support that bridge the communication gap effectively between the two. Taking this serious concern into consideration, we have tried to develop a viable system that can recognize each gesture of sign language proficiently, identify, and translate to a relevant readable form that the general public can understand. The practical application involves inputting gestures from a user, data-based recognition, and translating to an understandable form upon comparison. The techniques to input involve two approaches either vision-based identification or using gloves with sensors implementing hardware. We have adopted a hybrid of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). It takes videos as inputs and derives frames from them to train and test. The recordings taken are a part of the Sign Language dataset, that is, utilized to set up our proposed model. Similar to the existence of various dialects around the world, there are variants of Sign Language too like Japanese Sign Language, Korean Sign Language, Indian Sign language (ISL), American Sign Language (ASL), etc. [1]. Our chosen dataset is the Argentinian Sign language database (LSA) that includes videos of 10 non-master subjects who executed each gesture five times for the distinct signs. Signs for some 64 commonly used words have been chosen in the LSA comprising of verbs and nouns. The dataset is a collection of 3200 recordings of various signs shot under different types of lighting [2].
2 Literature Review Using different datasets as per individual requirements, several approaches have been undertaken by researchers to develop models on the subject of “Sign Language Recognition”. Some have worked upon figuring out the alphabets while others took into consideration the identification of commonly used phrases or terms. Masood et al. [1] proposed a method of real-time Sign Language recognition from Video sequences by implementing CNN and RNN to train on the spatial and temporal features, respectively, over the Argentinian Sign Language (LSA) gestures.
Word-Level Sign Language Gesture Prediction Under Different Conditions
429
The model also implemented Long Short-Term Memory (LSTM) to bridge time intervals in noisy incompressible input sequences along with pool layer approach. The work by researchers Masood et al. [3] involved using the finger-spelled letters of American Sign Language as it’s dataset to train the CNN model which was inspired by VGG19. With the aim of reducing the learning time considerably, a pretrained model was used to initialize the weights. Only a few epochs were sufficient to converge despite a very deep model. The authors Tripathi et al. [4] proposed a method by applying gradient-based key frame extraction method for recognizing symbols from continuous gestures. To find patterns in input data, principal component analysis has been applied. Several distance metrics like Cosine Distance, Mahalanobis Distance, Euclidean Distance, etc were used for gesture recognition. Results obtained from Euclidean distance and correlation depicted highest recognition rates among others. Pardeshi et al. [5] worked upon the comparison and analysis of Deep Learning algorithms like AlexNet and GoogLeNet for training the images. To minimize the training period, the project included multicore processors and GPUs with parallel computing toolbox in MATLAB. In the method proposed by Ko et al. [6], the KETI dataset (Korean language) takes into account certain words and phrases that need to be expressed in an emergency situation, where physical disability may act as a severe hindrance. The recognition system works upon feature extraction of Human Keypoints from hands, face, other body parts, etc. using OpenPose library. Upon vector normalization, they used stacked bidirectional GRUs for classification of the standardized feature vectors. The system by researchers Mali et al. [7] involves pre-processing using MATLAB, skin thresholding, and dilation and erosion before feature extraction implementing PCA. Further the SVM classifier is applied for classification and analysis. An overall accuracy of 95.31% could be achieved. Singha and Das [8] in their research proposed a method divided into steps where first data is acquired, pre-processed, then features are extracted and classification is done. Classification has been done using Eigen value-weighted Euclidean distance. From continuous videos, the system could recognize 24 different alphabets of Indian Standard Language with an accuracy of 96%.
3 Proposed Methodology 3.1 Data Acquisition The dataset utilized for the framework is The Argentinian Sign Language (LSA) which has been made with the objective of creating a word reference for LSA and preparing a programmed sign recognizer. It comprises videos where 10 non-master subjects executed each gesture five times for the 64 distinct signs. Signs were chosen among the most usually utilized ones in the LSA dictionary, including verbs and
430
M. Arora et al.
Table 1 64 symbols of LSA ID
Name
H
ID
Name
H
ID
Name
H
ID
Name
1
Opaque
R
17
Call
R
33
Hungry
R
49
Yogurt
H B
2
Red
R
18
Skimmer
R
34
Map
B
50
Accept
B
3
Green
R
19
Bitter
R
35
Coin
B
51
Thanks
B
4
Yellow
R
20
Sweet milk
R
36
Music
B
52
Shut down
R
5
Bright
R
21
Milk
R
37
Ship
R
53
Appear
B
6
Light-blue
R
22
Water
R
38
None
R
54
To land
B
7
Colors
R
23
Food
R
39
Name
R
55
Catch
B
8
Red
R
24
Argentina
R
40
Patience
R
56
Help
B
9
Women
R
25
Uruguay
R
41
Perfume
R
57
Dance
B
10
Enemy
R
26
Country
R
42
Deaf
R
58
Bathe
B
11
Son
R
27
Last name
R
43
Trap
B
59
Buy
R
12
Man
R
28
Where
R
44
Rice
B
60
Copy
B
13
Away
R
29
Mock
B
45
Barbecue
B
61
Run
B
14
Drawer
R
30
Birthday
R
46
Candy
R
62
Realize
R
15
Born
R
31
Breakfast
B
47
Chewing gum
R
63
Give
B
16
Learn
R
32
Photo
B
48
Spaghetti
B
64
Find
R
nouns. The dataset is an assortment of 3200 recordings of various signs shot both under different types of lighting [2] (Table 1).
3.2 Pre-processing 3.2.1
Frame Extraction
Since it is difficult to train a model on videos directly, approximately 200 frames have been extracted from each video sequence and then used for training the model, thus increasing the dataset and playing a major role in improving the accuracy (Fig. 1).
3.2.2
Feature Extraction
The hands of the subjects are detected using OpenCv library of Python. The background and other body parts act as noise and are not required for preparing our recognizing system. Thus, they are removed. The background is made black and the image is converted to grayscale so that color of gloves is not involved in the learning of model.
Word-Level Sign Language Gesture Prediction Under Different Conditions
431
Fig. 1 Data flow model
3.3 Classification Videos as dataset comprise both spatial and temporal features. To extract the spatial features, we have implied Convolutional Neural Network and to extract temporal features by relating frames one after the other, we have implied Recurring Neural Network (Fig.2).
3.3.1
Positional Feature Extraction Using CNN
CNN can be effectively utilized as an approach for classifying images due to its outstanding ability of recognizing relations and finding patterns easily irrespective of any translational or rotational change in images [9]. Firstly, we have clustered the frames in their respective symbol subfolder and have retrained each frame using Tensorflow’s retrain in order to generate the respective bottlenecks corresponding to each frame. Then, there is transfer learning which uses a pre-trained neural network, in this case, the bottlenecks. For instance, we have extracted the spatial features from
432
M. Arora et al.
Fig. 2 Frame extraction
video frames with CNN by implementing the Image recognition model: Inception V3 model of Tensorflow library [10] which uses about 25 million parameters and about 5 billion operations to classify each image. Since only the final layer has been trained, it could be completed in feasible time and resources. The predicted frames are stored for the train model. 3.3.2
Sequential Feature Extraction Using RNN
After the model is trained using CNN, the softmax-based prediction is implemented to output a model that can be passed to the RNN for the final prediction of the actual word related to each video sequence. As we have sequential data, to predict the output, RNNs utilizes the current input and the previous output recurrently [11]. Since RNNs cannot learn long-term dependencies, we have utilized Long Short-Term
Word-Level Sign Language Gesture Prediction Under Different Conditions
433
Fig. 3 Process flow
Memory (LSTM) model [12]. The sequence of forecast videos for each sign of the train information from CNN is then given to the RNN model for preparing on the temporal features. After this the model is used for making predictions on the test data (Fig. 3).
4 Result Each video was split into 200 frames in order to retrieve the dimensional characteristics, and hence predictions were done on each frame and then finally on corresponding frames when arranged sequentially, RNN was used to give final predicted value. We have prepared a training set and two different testing conditions, in Case 1 we have tested the sign corresponding to same set of subjects and environmental conditions, whereas in Case 2 we have used mixture of different subjects under both artificial and natural lighting conditions in the test set where the subjects in the testing data were not a part of training data. In Case 1, 120 out of 128 gestures of same subject under equivalent lighting conditions were interpreted successfully and we achieved an accuracy of 93.75% while the accuracy dropped to 90.625% in Case 2, as was expected, where 290 out of 320 videos were recognized correctly. In Fig. 4a, the gesture copy has accuracy below 45% due to the reason that the left hand of the subject slightly overlaps the right hand of his, during the course of action
Fig. 4 a Copy sign and b Breakfast sign
434
M. Arora et al.
completion in the video, for a fraction of second and while this is less prominent in natural lighting, the overlapping frame is more eminent when subject or illumination is changed, thus making the prediction slightly difficult. On the contrary Fig. 4b has 100% accuracy, this gesture also uses both hands for the depiction but the course of the action is identical. Both the hand go from one position then upwards and traverse back to the same position. Also, no overlapping is required therefore no ambiguity for the algorithm.
5 Conclusion Gestures and expressions are substantial in day-to-day communication and their recognition by computers is an equally exciting and strenuous task. Our work presents a method which is able to interpret hand gestures from the Argentinian Sign Language (LSA) and examines the effect of different lighting conditions on the predicted result. We carved up two test instances based on disparate subjects and illumination milieu. We have attained results of 93.75% for Case 1 where both train and test data had same subject under alike lighting and 90.625% for Case 2 where test data had a mixture of distinct subjects under different lights. Through our study, we can conclude two main perorations, first, CNN along with RNN can be highly effective in treating video sequences and second, there are certain losses such as loss in edge detection and frame mapping when the subject or the environmental changes are brought into the mix without training the data over these conditions.
References 1. S. Masood, A. Srivastava, H.C. Thuwal, M. Ahmad, Real-time sign language gesture (word) recognition from video sequences using CNN and RNN. Intell. Eng. Inf. 623–632 (2018) 2. F. Ronchetti, F. Quiroga, C. Estrebou, L. Lanzarini, A. Rosete, LSA64: A Dataset of Argentinian Sign Language, in XX II Congreso Argentino de Ciencias de la Computación (CACIC) (2016) 3. S. Masood, H.C. Thuwal, A. Srivastava, American sign language character recognition using convolution neural network. Smart Comput. Inf. 403–412 (2018) 4. K. Tripathi, N.B.G. C. Nandi, Continuous Indian sign language gesture recognition and sentence formation. Proc. Comput. Sci. 54, 523–531 (2015) 5. K. Pardeshi, Dr. R. Sreemathy, A. Velapure, Recognition of Indian sign language alphabets for hearing and speech impaired people using deep learning, in Proceedings of International Conference on Communication and Information Processing (ICCIP) (2019) 6. S.-K. Ko, J.G. Son, H. Jung, Sign language recognition with recurrent neural network using human keypoint detection, in The 2018 Conference (2018) 7. D.G. Malia, N.S. Limkar, S.H. Malic, Indian sign language recognition using SVM classifier, in Proceedings of International Conference on Communication and Information Processing (ICCIP) (2019) 8. J. Singha, K. Das, Automatic Indian sign language recognition for continuous video sequence. ADBU J. Eng. Technol.2, 0021105(5pp) (2015) 9. B. Garcia, S. Viesca, Real-time American sign language recognition with convolutional neural networks. Convolutional Neural Networks for Visual Recognition at Stanford University (2016)
Word-Level Sign Language Gesture Prediction Under Different Conditions
435
10. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado et al., Tensorflow: largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016) 11. Cooper Helen, Brian Holt, and Richard Bowden. Sign language recognition, in Visual Analysis of Humans (Springer, London, 2011), pp. 539–562 12. S. Hochreiter, J. Schmidhuber, Long short term memory. Neural Comput. 9(8), 1735–1780 (1997)
Firefly Algorithm-Based Optimized Controller for Frequency Control of an Autonomous Multi-Microgrid Kshetrimayum Millaner Singh, Sadhan Gope, and Nicky Pradhan
Abstract This paper considered a mathematical model that consists of two-area microgrid based on renewable energy resources for the study of automatic generation control. The microgrid consists of Solar Photovoltaic (SPV); Hydro, Battery Energy Storage System (BESS); Load, and one has Bio Gas Turbine Generator (BGTG) and other have Biodiesel Engine Generator (BDEG). Proportional-Integral (PI) controller is used as the frequency controller for this system. The BDEG, BGTG, and BESS have been considered for instant Load Frequency Control (LFC) sources during a disturbance in the system frequency. Cuckoo Search (CS) and Firefly (FA) algorithms are used for tuning the gain values of the controllers. Finally, for the validity of the proposed approach, the system performance obtained by the firefly algorithm for PI controller with random step load perturbation is compared with the CS algorithm Keywords Proportional-Integral · Solar photovoltaic · Biodiesel engine generator · Battery energy storage system · Bio gas turbine generator
1 Introduction The electrical energy consumption of the world has continued to rise rapidly at a rate faster than the other forms of energy consumption. In recent times, the demand for energy to fuel the country’s economic growth has only spurted. The core idea is that overall growth and development of any country is mostly decided by the quantum of energy consumption by that country. Furthermore, most of the sources of energy for global consumption comes from conventional sources. For instance, K. M. Singh (B) · S. Gope · N. Pradhan Electrical Engineering Department, Mizoram University, Aizawl, India e-mail: [email protected] S. Gope e-mail: [email protected] N. Pradhan e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_39
437
438
K. M. Singh et al.
coal-fired power plants generate 72 % of India’s electricity. However, conventional sources of energy are fastly running out. Their ever-increasing exploitation is also bringing about catastrophic and irreparable damages to the environment as well as accelerating global warming. In this scenario, the integration of Renewable Energy Sources (RESs) in the power grid is being promoted across the globe in order to ensure sustainable development and combat climate change. Due to the initiation of policy of deregulation in the power system, distributed generator has opportunities to encourage the microgrid system in the power industry. But, it is very difficult and complex for integration of RESs with conventional energy sources due to the uncertain nature of RESs, load variation, and imbalance of load demand and supply generation. This problem can be overcome by load frequency control as it the primary control which correspondingly controls the system frequency and active power of the system. As a result, energy storage is the most secure and efficient option to mitigate the difference between energy demand and supply. In literature review, studies used for energy storage system and its effect in frequency deviation and ACE control of multi-microgrid are analyzed [1]. Grasshopper Optimization Algorithm (GOA) is used for multi-area microgrid frequency control with fuzzy PID controller [2]. The study of frequency deviation by considering communication delay in multi-microgrid is studied in [3]. Considering Renewable Source Uncertainties study the frequency response of microgrid with PID controller and is compared to different optimization algorithms that are Cow Search Algorithm, Whale Optimization Algorithm (WOA), and Mosquito Flying Optimization (MFO) [4]. Comparison of different controllers is used for load frequency control like PI, PID, PD, ID, and PIFD with the help of Particle Swarm Optimization (PSO), Grasshopper Optimization Algorithm (GOA), and Genetic Algorithm (GA) [5]. In this paper, the PI controller is used to investigate the load frequency control of multi-microgrid (2-areas) connected with a tie-line. The parameters of PI controller are tuned with the help of CS and FA Algorithm considering the changes of load that is an application of SLP. The results of the algorithms are compared.
2 Overview of Multi-Microgrid 2.1 Multi-Microgrid Multi-Microgrids (MMGs) are considered an advanced level system than microgrid. It operates at the medium voltage level. Multi-Microgrid (MMGs) is comprised of numerous low voltage Microgrids (MGs) and Distributed Generators (DGs) units connected next to MV feeders. It’s capable to have many controllable Distributed Generators (DGs) units and Microgrids (MGs). The benefits of multi-microgrids are that it is possible to implement Demand Side Management (DSM), as it needs the classified control scheme. It needs efficient control and management of the system [6].
Firefly Algorithm-Based Optimized Controller …
439
2.2 Solar Photovoltaic Model A PV system consists of several cells connected in series and parallel to deliver desirable current and voltage. The V-I characteristic of PV system is non-linear and the output power of the PV array depends on load current and solar radiation. The PV system transfer function can be expressed as G pv (S) =
K pv 1 + STpv
(1)
where K pv is gain constant and Tpv is time constant [6].
2.3 Biogas Turbine Generator (BGTG) Biogas is obtained from the decomposable of wastes and animal excreta. It can be economically used in a micro Gas Turbine Generator (BGTG) for a generation of power [7]. The output power of BGTG is directly proportional to stoke of the components in BGTG. The transfer function of BGTG is given as G BGTG (S) =
1 + STCR 1 1 + S XC . (1 + SVC )(1 + Sb B ) (1 + STBG ) (1 + STBT )
(2)
2.4 BioDiesel Engine Generator (BDEG) Biofuel is extracted from crops. Transesterification is the process that is used for the extraction of fuel. The fuel has the same chemical properties as diesel and can use in usual diesel generators [7]. The BDEG transfer function can be express as G BDEG (S) =
K BE K VA 1 + STVA 1 + STBE
(3)
2.5 Hydro Plant The thermal plant prototype is similar to the hydro plant prototype. The main three units of the hydro plant are governor, turbine, and generator load. The speed governor transfer function can be represented as [8]
440
K. M. Singh et al.
G HG (S) =
K HG 1 + STHG
(4)
where K HG is gain constant and THG is time constant. The turbine unit can be represented as a transfer function given below [8] G HT (S) =
K HT 1 + STHT
(5)
where K HT is gain constant and THT is time constant of the turbine.
2.6 System Frequency Variation and Power Deviation To keep a stable operation of the power system, the total generation of power should efficiently control and also make suitable dispatch so that to meet the total load demand [1]. Therefore total power generation (PT ) in the microgrid system is equal to the summation of all sources, i.e., solar photovoltaic power (PPV ), Biodiesel engine generator power (PBDEG ), Biogas turbine generator power (PBDEG ) Hydro plant (PH ), and power of energy storage system (PESS ), i.e., given in equation as PT = PPV + PBDEG + PBGTG + PH ± PESS
(6)
The deviation of power in the system is given by total power generation (PT ) minus power demand (PD ) as follows: Pe = PT − PD
(7)
As we know that system frequency deviation is due to the changes in the total power deviation, and its frequency variation ω is given by ω =
Pe K sys
(8)
where K sys is frequency characteristic constant. A time delay is essential between power deviation and frequency deviation, therefore frequency deviation to power deviation in per unit can be expressed by transfer function as follows: G sys (S) = =
ω K sys 1 1 = (D + M S) K sys 1 + S K sys
(9)
where M denotes equivalent inertia constant and D denotes system damping constant.
Firefly Algorithm-Based Optimized Controller …
441
2.7 Interconnection of Purposed Multi-Microgrid with Tie-Line By interconnection of standalone microgrids through tie-line which can be reliable to power supply for the load demand, we consider that all microgrids have their control area. The characteristic that shows how to respond to the system frequency deviation of a specific area is determined by tie-line bias control. It is used to exchange energy between microgrids when the power generation and load demand are not equal, and frequency deviation occurs in that area [1]. The tie-line power deviation (Ptie ) is given as Ptie = Ps
ω1 dt −
ω2 dt
(10)
where ω1 , ω2 is frequency deviation of area-1 and area-2, respectively, and Ps is synchronizing power coefficient. Laplace transform of tie-line power deviation is given by (Fig. 1) C=
Ptie S PS = ω S S
Fig. 1 Proposed block diagram of two areas multi-microgrid model/system
(11)
442
K. M. Singh et al.
3 Algorithm The detailed overview of the adopted optimization technique along and the flowchart of the FA Algorithm have been discussed in Refs. [6, 9, 10] and CS Algorithm in Ref. [11, 12].
4 Results and Discussion 4.1 PI Controller for Firefly Optimization The various simulation results for the PI controller using firefly algorithm optimization in the proposed model of two-area multi-microgrid for the study of load frequency control are given as follows. Figure 2 shows load variation in area-1 and also shows the response of power generation of Hydro, ESS, PV, BDEG. Initially, when the load demand is at nominal then it started low from nominal, i.e., 0.8 pu at t = 55 s, 0.73 pu at t = 60 s. Then it gets back to normal from t = 65 s to t = 85 s, after that it starts increasing the load demand, i.e., 1.2 pu at t = 90 s, 1.5 pu at t = 95 s, 1.15 pu at t = 100 s and then back to nominal value from t = 105 s to t = 120 s. Initially it considers to give normal step output. And at around up to 5 s most of the sources including ESS are fluctuating then back to normal by LFC. At t = 50 s BDEG generation begins to reduce gradually until t = 60 s while the supply from
Fig. 2 Response of area-1 with PI controller by using FA
Firefly Algorithm-Based Optimized Controller …
443
Fig. 3 Response of area-2 with PI controller by using FA
ESS starts to decrease up until 55 s, then the supply starts to increase up to 60 s, after that its supply is reduced and back to normal unlike hydro generation which does not have much effect in the system as it acts as baseload generation. When the load demand is increased at t = 85 s, BDEG generation is increased and ESS also supplies power up to t = 100 s, where also hydro contributes a small amount of power, then it’s back to normal. Figure 3 shows load variation in area-2 and also shows the response of power generation of Hydro, ESS, PV, BGTG. Initially, the load demand is at nominal then it started high from nominal, i.e., 1.2 pu at t = 55 s, 1.15 pu at t = 60 s. Then it is back to normal fromm t = 65 s to t = 80 s, after that it starts to reduce load demand ea. 0.8 pu at t = 85 s, 0.75 pu at t = 90 s, 1.15 pu at t = 95 s and then back to nominal value from t = 100 s to t = 120 s. Similarly, initially it considers to give normal step output. And at around up to 5 s most of the sources including ESS are fluctuating then back to normal by LFC. Then load demand is increased at t = 50 s correspondingly the ESS starts to supply power to system up to t = 60 s after that reduces the supply and BGTG starts to increase generation up to 60 s after reducing generation to 65 s. But during 60–65 s ESS again starts the supply and then back to nominal. At t = 80 s load demand is reduced here similarly both the ESS and BGTG reduce the supply up to 90 s, then both ESS and BGTG start to increase supply up to 100 s after it gets back to normal. Here also hydro acts as the baseload generation. Figures 4 and 5 shows the frequency response of MG-1 and MG-2, respectively. Figure 6 shows the response for deviation of power between area-1 (MG-1) and area2 (MG-2). From above results it shows that both the frequency deviation of area-1 and area-2 are back to normal after certain disturbance by changing load demand
444
K. M. Singh et al.
Fig. 4 Frequency response of area-1 with PI controller by using FA
Fig. 5 Frequency response of area-2 with PI controller by using FA
and also shows that deviation power in tie-line during the disturbances and back to normal, which is clear that LFC controls both frequency deviation and exchange of power take place in the tie-line.
Firefly Algorithm-Based Optimized Controller …
445
Fig. 6 Power deviation response of tie-line for area-1 and area-2
4.2 PI Controller for Cuckoo Search Optimization The various simulation results for the PI controller using Cuckoo Search Optimization in the proposed model of two-area multi-microgrid for the study of load frequency control are given as follows. Figure 7 shows the response of load, ESS, hydro, BDEG, and PV in area-1. Here the load demand is starting to decrease from t = 50 s then 0.8 pu at t = 55 s, 0.73 pu at t = 65 s load demand is increasesed from 85 s, i.e., 1.2 pu at t = 90 s, 1.5 pu at t = 95 s and 1.15 pu at t = 100 s. This result shows that hydro seems to act as the baseload generator at all over. Here ESS and BDEG start to reduce supply up to 55 s, then after slowly increasing the supply up to 65 s then reducing up to 85 s then again increasing the supply after up to 95 s it reduces to 100 s finally back to the normal; here hydro also contributes the supply according to the load changes but it affects lesser than others. Figure 8 shows load variation in area-2 and also shows the response of power generation of Hydro, ESS, PV, BGTG. Initially, the load demand is at nominal then it started high from nominal, i.e., 1.2 pu at t = 55 s, 1.15 pu at t = 60 s. Then it gets back to normal from t = 65 s to t = 80 s, after that it starts to reduce load demand, i.e., 0.8 pu at t = 85 s, 0.75 pu at t = 90 s, 1.15 pu at t = 95 s and then back to nominal value at from t = 100 s to t = 120 s. Here ESS impacts more than other sources as it starts to supply power from 55 s, up to 60-s corresponding to load where BGTG and hydro also increase their supply during this period then after that reduce their supply. At t = 85 s all sources are reducing their supply up to 95 s after increasing their supply up to 100 s then reduce and back to normal.
446
K. M. Singh et al.
Fig. 7 Response of area-1 with PI controller by using CS
Fig. 8 Response of area-2 with PI controller by using CS
Figure 9 shows the frequency response of MG-1 and MG-2. Figure 10 shows the response for deviation of power between area-1 (MG-1) and area-2 (MG-2). The above results show that both the frequency deviation of area-1 and area-2 are back to normal after certain disturbance by changing load demand. Also it shows that the deviation power in tie-line during the disturbances is back to normal, which clearly
Firefly Algorithm-Based Optimized Controller …
447
Fig. 9 Frequency response of area-2 and area-2 with PI controller by using CS
Fig. 10 Power deviation response of tie-line for area-1 and area-2
verifies that LFC controls both the frequency deviation and exchange of power that take place in the tie-line.
448 Table 1 Gains values of PI controller by Firefly Algorithm optimization
K. M. Singh et al. Controller
Kp
Ki
Controller-1
1.9833
1.9996
Controller-2
1.1865
1.1060
Controller-3
1.7418
1.9998
Controller-4
1.5865
1.9449
Controller-5
0.6189
1.9845
Controller-6
1.9925
1.9636
Pf1
1.9996
Pf2
1.9750
4.3 Comparison of Firefly Algorithm and Cuckoo Search Optimization The proposed model is simulated and used ISE as the objective function to optimize with Firefly Algorithm and Cuckoo Search Algorithm in MATLAB 2018a. The system is analyzed with Step Load Perturbation (SLP) then the response of the system used by the PI controller is compared with optimization algorithms between Firefly Algorithm and Cuckoo Search Algorithm. The gains values of PI controller for Firefly Algorithm and Cuckoo Search are given in Tables 1 and 2, respectively. The various comparisons are shown in the given figures. Figures 11, 12, and 13 depict the comparison of the algorithm. Here Fig. 11 shows the comparison of the frequency deviation of microgrid-1 with PI controller tune by both Firefly Algorithm and Cuckoo Search Algorithm. Similarly, Fig. 12 shows that comparison for frequency deviation of microgrid-2 and FA is superior to CS. Figure 13 depicts the comparison of tie-line response by using FA and CS algorithm; hence FA gives more extension results than CS. Table 2 Gains values of PI by Cuckoo Search Algorithm optimization
Controller
Kp
Ki
Controller-1
1.581289934981527
1.275385914668063
Controller-2
1.981013131586057
0.785919954776326
Controller-3
0.899786331530040
2
Controller-4
0
0.843885456792004
Controller-5
1.403856781582448
0.283325335077086
Controller-6
1.168895975320635
1.150072145237343
Pf1
1.150072145237343
Pf2
0.620301772980351
Firefly Algorithm-Based Optimized Controller …
449
Fig. 11 Comparison of frequency response of area-1 with PI controller by using FA and CS
Fig. 12 Comparison of frequency response of area-2 with PI controller by using FA and CS
5 Conclusion In this paper, a multi-microgrid connected with tie-line by regulating the PI controller gains embedded in the individual microgrid system has been investigated. FA and CS algorithm has been exercised for generating an optimal gain of PI controller for the stability of the system under random step load perturbation. The simulation results show that the PI controller with FA has a better control effect compared to the PI controller with CS at step load perturbation.
450
K. M. Singh et al.
Fig. 13 Comparison of power tie-line using FA and CS
Appendix
Symbol and abbreviation
Values
K PV , TPV (solar photovoltaic gains constants and time constant)
1, 1.5
K ESS , TESS (energy storage system gains constant and time constant)
−10, 0.1
K VA , TVA , K BE , TBE (biodiesel engine generator gain constant and time constant)
1, 0.05, 1, 0.5
X C , YC , b B , TCR , TBG , TBT (biogas turbine generator gain constant and time constant)
0.6, 1, 0.05, 0.01, 0.23, 0.2
K HG , THG , K HT , THT (hydro turbine, governor gain constant and time constant)
1, 41.6, 1, 0.5
D1, D2, M1, M2, Ps (power system gain constant and tie-line gain constant)
0.02, 0.03, 0.8, 0.7, 1.5
B1,1/R1, B12, B22, 1/R2, B2, 1/R22, 1/R12 (system droop gain constant and bias gain constant)
0.1866, 0.1666, 0.4366, 0.1966, 0.4466, 12.5, 25, 0.4168
Firefly Algorithm-Based Optimized Controller …
451
References 1. A.H. Chowdhury, M. Asaduz-Zaman, Load frequency control of multi-microgrid using energy storage system, in IEEE International Conference on Electrical and Computer Engineering (2014), pp. 548–551 2. D.K. Lal, A.K. Barisal, M. Tripathy, Load frequency control of multi area interconnected microgrid power system using grasshopper optimization algorithm optimized fuzzy PID controller, in IEEE International Conference on Recent Advances on Engineering, Technology and Computational Sciences (2018), pp. 1–6 3. X. Wang et al., Load frequency control in multiple microgrids based on model predictive control with communication delay. J. Eng. 13, 1851–1856 (2017) 4. P. Srimannarayana, A. Bhattacharya, S. Sharma, Load frequency control of microgrid considering renewable source uncertainties, in IEEE International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC) (2018), pp. 419–423 5. A.K. Barik, D.C. Das, Expeditious frequency control of solar photovoltaic/biogas/biodiesel generator based isolated renewable microgrid using grasshopper optimization algorithm. IET Renew. Power Gener. 12(14), 1659–1667 (2018) 6. N.J. Gil, J.A.P. Lopes, Hierarchical frequency control scheme for islanded multi-microgrids operation. in IEEE International Conference on Lausanne Power Tech Lausanne (2007), pp. 473–478 7. D. Muthu, C. Venkatasubramanian, K. Ramakrishnan, J. Sasidhar, Production of biogas from wastes blended with cow dung for electricity generation-a case study, in IOP International Conference Series: Earth and Environmental Science, vol. 80, no. 1 (2017), pp. 1–8 8. C. Srinivasarathnam, C. Yammani, S. Maheswarapu, Multi-objective jaya algorithm for optimal scheduling of DGs in distribution system sectionalized into multi-microgrids. Smart Sci. 7(1), 59–78 (2019) 9. A.A. El-Fergany, M.A. El-Hameed, Efficient frequency controllers for autonomous two-area hybrid microgrid system using social-spider optimiser. IET Gener. Transm. Distrib. 11(3), 637–648 (2017) 10. Xin-She Yang, Xingshi He, Firefly algorithm: recent advances and applications. Int. J. Swarm Intell. 1(1), 36–50 (2013) 11. Ramin Rajabioun, Cuckoo optimization algorithm. Appl. Soft Comput. 11(8), 5508–5518 (2011) 12. P.K. Ray, S.R. Mohanty, N. Kishor, Small-signal analysis of autonomous hybrid distributed generation systems in presence of ultra-capacitor and tie-line operation. J. Electric. Eng. 61(4), 205–214 (2010)
Abnormal Activity-Based Video Synopsis by Seam Carving for ATM Surveillance Applications B. Yogameena and R. Janani
Abstract Criminal activities have been increasing in ATM centers, but the law enforcement authorities become mindful only after the incident have occurred. Viewing the whole video sequence is tedious and also slows down the investigation process. The abnormal activity involved in the proposed work is stabbing. The Lucas Kanade method of optical flow is proposed to analyze the stabbing action by the velocity and direction the knife moves with. The analysis is further enhanced by the facial expression recognition. The proposed method involves abnormal activity analysis followed by video synopsis. Video condensation by a seam carving provides an effective solution. The concept of the seam carving is to associate reliable activityaware cost with seams and recursively evacuate seams one at a time. The seam cost is equal to the sum of all pixels that make up a seam. To condense a given video the ribbons with minimum cost are removed by a user-defined stopping criterion. The datasets are real-time ATM crimes involving stabbing action with a knife. The experimental outcomes show that the demonstrated framework gives an effective synopsis of the video based on abnormal activities, i.e., a reduction in duration and no frames. Keywords Faster R-CNN · Seam carving · Stabbing action · Video synopsis
1 Introduction Surveillance is monitoring and inferring information related to the behavior, activities of a person and also used for preventing the crime. Video Synopsis is the contemporary presentation of events that allow hours of video footage to be checked in B. Yogameena · R. Janani (B) Department of Electronics and Communication Engineering, Thiagarajar College of Engineering, Madurai, Tamil Nadu, India e-mail: [email protected] B. Yogameena e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_40
453
454
B. Yogameena and R. Janani
just minutes [1, 2]. Now a day, ATM crime is increasing by involving abnormal activities such as threatening people with sharp objects, breaking ATM machines, changing posture, bending, and walking [3, 4]. Accordingly, an efficient approach is required to provide a synopsis video based on suspicious activities which will be helpful in forensic analysis. Seam carving is a content-aware resizing algorithm. It works by setting an amount of seams (path of least significance) in an image and automatically extracts seams to minimize the size of the image or insert seam to extend it. The reduction as well as the extension of the image size in both directions is achieved by eliminating or adding seams successively [5]. Optical flow is a visual scene design of evident movement of objects, surface, and edges brought about by the relative movement between the camera and the scene. The Lucas Kanade method in computer vision is a broadly utilized differential strategy for assessing the optical flow. It accepts that the flow in a local area of the pixel under rumination is essentially constant and can provide an rate of the motion of fascinating features in progressive images of a scene [6]. An entire image and a set of target proposals are the input for a Fast R-CNN network [4, 7]. The training set is hands with a knife and the position of the knife in the training images will be given in the ground truth table for learning different knives at different angles and positions.
2 Related Work Video synopsis is an effective tool that preserves the crucial task in the original video while providing a short video presentation. Lu et al. [8], Rav-Acha et al. [1], have focused on a video synopsis which is the most broadly utilized approach for shortening the lengthy video. The texture method and Gaussian mixture model by Lu et al. [8] are joined to identify increasingly smaller foreground with shadow removal. A particle refine tracker is used to generate additional fluent tubes for synoptic video. This method efficiently concatenates many tube fragments belonging to one object activity. Avidan et al. [9] proposed an algorithm which is used for object removal and image content enhancement. Seam carving is arranged in pixel count and resizing is arranged in number of seams to be expelled, or inserted. Chen et al. [10] introduced an algorithm where sheets are grafted incrementally from the video cube to lessen a video’s length. The algorithm carves out images of smaller importance until the desired video size is obtained. The main contribution of the system is a generalization of the video frame to a sheet in a video cube. The sheet can be formulated by using min-cut formulation. Li et al. [11] have suggested video condensation method, in that the ribbons are carved using dynamic programming to minimize an activity-aware cost function. Our method is applicable to ATM surveillance for generating synopsis videos, compared to previous synopsis approaches.
Abnormal Activity-Based Video Synopsis by Seam Carving …
455
2.1 Problem Formulation and Motivation There is a lack of literature for abnormal activity-based video synopsis for ATM surveillance applications and to handle the challenges when multiple objects move in a different direction, speed, change in the chronological order of events, bad tube extraction, shadow, and occlusion. (1) Criminal activities have been increasing in ATM centers. Many times law enforcement authorities become aware of the crime after several hours after the incident. (2) Investigation of the crime is done by watching the surveillance videos in the ATM centers. Watching the entire video sequence is time-wasting and also slows down the investigation process.
2.2 Contribution and Objective As per the survey of abnormal activity-based video synopsis, till now no work has been done on stabbing action-based video synopsis for ATM applications. To synopsis a video in the sparse crowd involving stabbing action which is applicable for forensic analysis and useful for evidences. To develop an efficient algorithm for abnormal activity, video synopsis in a sparse crowd is carved out with ribbons for minimizing activity-aware cost function for ATM surveillance applications.
3 Methodology 3.1 Foreground Segmentation Using Gaussian Mixture Model Foreground segmentation is used as a primary step for detection of moving objects. The background is modeled and the foreground is detected by Gaussian Mixture Model (GMM) [12]. A Gaussian Mixture Model is a weighed sum of the densities of M components given by Eq. (1) as p(x|λ) =
M i=1
wi g x|μi ,
(1)
i
where X is a dimensional continuous value vector, wi = 1, . . . , M is the mixture weights, and g |μi , i i = 1, . . . M, are the Gaussian densities component. Each component density is a D-variate Gaussian function of the form (2) given as
456
B. Yogameena and R. Janani
g x|μi ,
i
⎧ ⎫ −1 ⎨ 1 ⎬ 1 exp − = − μ − μ ) (x ) (x i i 1 D ⎩ 2 ⎭ 2π 2 2 i
where μi is the mean vector and
(2)
i
i is
the covariance matrix.
3.2 Blob Detection and Labeling A blob is projected from the head plane to the ground plane; any discontinuity in a blob that represents a person is accomplished by linking discontinuous blobs that are protected by a bounding rectangle. For shorter people, having too large head plane height may result in the zero intersected area. Contrarily, setting very low height head plane can result in detecting shorter objects. A rectangular area is created for all the blobs by joining the opposite points C1 and C2. If this area is below the area threshold, then the area is categorized separately. The blobs whose estimated area overreach a threshold are grouped. The blobs projection is then split into head plane and ground plane then; it is separated and labeled as The area projected on head plane = C1 − L2 The area projected in ground plane = C2 − L1 Intersected area = C1 − C2
3.3 Hand with Knife Detection by Using R-CNN At test time, R-CNN generates proposals for the given image around 2000 categoryindependent regions, separates each area with category-specific linear SVMs, and also eliminates a fixed-length feature vector from each proposal using a CNN. For example, consider training a binary classifier (SVM) to detect knife, the image region tightly enclosing the knife must be a positive example and the background region would be the negative example. If a region overlaps knife, it is overcome by thresholding. Threshold ranges from 0 to 0.5. The classification is between hand with knife and others.
Abnormal Activity-Based Video Synopsis by Seam Carving …
457
3.4 Optical Flow by Lucas Kanade Method and Face Expression Recognition by Faster R-CNN Lucas Kanade feature tracker was used in order to determine the movement of each subtarget. Optical flow is used to calculate the speed and direction of a movable object from one frame to another. The Lucas Kanade method of optical flow is used because it provides fast calculation and accurate time derivatives. Faster R-CNN is considered the system that consists of a network of regional proposals and Fast Regions with Convolutional Neural Network Features (Fast RCNN). Next the multi-task loss in fast R-CNN, the objective function is minimized. For an anchor box, loss function is defined by Eq. (3), L pi , ti∗ = Lcls pi , pi∗ λpi∗ Lreg ti , ti∗ ti∗
(3)
where pi is the probability of prediction for anchor being an object. Since Faster R-CNN is used, The Region Of Interest (ROI) of each image must be marked first. In order to improve the reliability of the experimental results, three different depths of the network are used to train and test data. Facial expression recognition enhances the stabbing action analysis.
3.5 Video Synopsis by Seam Carving The main goal of video condensation is to remove inactive pixels and produce a shorter length video [13, 14]. A segment of N consecutive video frames with W pixels wide and H pixels tall should end up with a new segment N of consecutive frames but N ≥ N . If R denotes the vertical or horizontal ribbon, then the cost of ribbon is given by
C(R) = C(x, y, t) =
C(x, y, t)
(4)
(x,y,t) R
Ix (x, y, t)
2
2 2 + I y (x, y, t) + It (x, y, t)
(5)
where C be a cost function with each pixel having co-ordinates (x, y, t) and Ix , I y , Iz be local estimates of horizontal, vertical, and temporal derivatives. Consequently, a vertical or horizontal ribbon cannot span more than Mφ = (φmax(W, H ), φ + 1), where W is wide, H is Height, φ is Flex parameter [15]. Therefore, the experimental results of video synopsis by seam carving are shown in Table 2.
458
B. Yogameena and R. Janani
4 Results and Discussions 4.1 Experimental Results The real-time dataset-3 consists of 1,79,220 frames, the location and the sample frame are shown in Table 1 and Fig. 1 respectively. The proposed algorithm is applied to the input video to provide the synopsis of the video based on stabbing action which will be useful in forensic analysis to analyze the crime effectively. From the above results (Figs. 2, 3, and 4) the input video is modeled and the foreground objects are identified by Gaussian Mixture Model (GMM) then, the detected foreground is labeled. The individual with knife in hand is detected by using the Regional Convolutional Neural Network. Finally, the individual with knife in hand is marked with a blob of red color. Video synopsis is done by seam carving which carves ribbon out by the cost of seams. The addition or removal of seams is based on flex parameter whose value ranges from 1 to 3 as shown in Table 3. The seam involving stabbing action has high value and the seam having no activity has low value. The seams are removed until the desired length of video is obtained. Table 1 Real-time datasets S.No
Dataset
Frame no
Location
1
Real-time Dataset 1
Frame
198
India (Bangalore)
2
Real-time Dataset 2
64
China
3*
Real-time Dataset-3
251
Italy
Abnormal Activity-Based Video Synopsis by Seam Carving …
459
Table 2 Video synopsis by seam carving Input video
Flex values
Output video
Duration
= 0.1
2 min
= 0.2
3 min
= 0.3
4 min
Duration: 97 min
Fig. 1 Sample frame of real-time dataset-3
5 Conclusion An efficient algorithm is required to provide an abnormal activity-based video synopsis in order to make the forensic analysis faster. The background is modeled and the foreground is segmented by Gaussian mixture model. The individuals from the foreground are grouped into individual blob. The individual with hand in knife
460
B. Yogameena and R. Janani
Fig. 2 Foreground detection by Gaussian mixture model
Fig. 3 Blob detection and labeling
is detected by using the Regional Convolutional Neural Network. The motion vector estimation is done by optical flow by Lucas Kanade to determine the speed and velocity the knife with which the knife moves. The velocity of the knife used for destructive purposes will be higher than that of the velocity of knife used for other constructive purposes. If face is detected, then faster regional Convolutional Neural Network is used to recognize the facial expression which enhances recognition of stabbing action. If face is not detected then it passes to abnormal activity-based video
Abnormal Activity-Based Video Synopsis by Seam Carving …
461
Fig. 4 Individual’s hand with knife detection using R-CNN
Table 3 Performance measure of video synopsis S.No
Input video
No of output frames
No of output frames
Condensation rate
1
Real-time Dataset-1
1,45,500
φ = 0.1
3000
1:48.5
φ = 0.2
4500
1:32.3
φ = 0.3
6000
1:24.25
φ = 0.1
1620
1:12
φ = 0.2
3240
1:6
φ = 0.3
4050
1:5
φ = 0.1
6240
1:28.72
φ = 0.2
7800
1:22.98
φ = 0.3
9360
1:19.15
2
3*
Real-time Dataset-2
Real-time Dataset-3
20,025
1,79,220
synopsis which is done by seam carving. Seam carving in the video is extracting the seams for frames involving stabbing action which is based upon cost function. The seams with low cost are eliminated to obtain a video synopsis. Then, the performance of video synopsis is done by comparing the no of frames in the input and output video. Consequently, the system provides an efficient abnormal activity-based video synopsis for ATM surveillance application. Till now, there is a lack of work in obtaining video synopsis by both horizontal and vertical seam carving. In addition, any other abnormal activities such as change of posture, walking, bending can be employed in future work.
462
B. Yogameena and R. Janani
References 1. A. Rav-Acha, Y., Pritch, S. Peleg, Making a long video short: dynamic video synopsis, in IEEE Conference on CVPR (Computer Vision and Pattern Recognition), December 2006, pp. 1–5 2. H.-C. Chen, P.-C. Chung, “Online surveillance video synopsis, in: Proceedings of IEEE International Conference on CVPR, May 2012, pp. 1843–1846 3. A. Glowacz, A. Dziech, M. Kmie´c, Visual detection of knives in security applications using active appearance model. IEEE Trans. Image Process. 54, 703–712 (2015) 4. B. Yogameena, S. Veeralakshmi, E. Komagal, S. Raju, V. Abhaikumar, RVM - based human action classification in crowd through projection and star skeletonization. J. Image Video Process. (2009) 5. X. Ye, J. Yang, X. Sun, Foreground background separation from video clips via motion-assisted matrix restoration. IEEE Trans. Circ. Syst. Video Technol. 25(11), 1721–1734 (2015) 6. D. Patel, S. Upadhyay, Optical flow measurement using Lucas Kanade method. Int. J. Comput. Appl. 61(10), 6–10 (2013) 7. J. Lia, J. Zhanga, D. Zhanga, J. Zhanga, T. Lia, Y. Xiaa, Q. Yana, L. Xuna, Facial expression recognition with faster R-CNN. Int. Conf. Inf. Commun. Technol. 107(2017), 135–140 (2017) 8. M. Lu, Y. Wang, G. Pan, Generating fluent tubes in video synopsis, in: Proceedings of IEEE International Conference on Pattern Recognition, May 2013, pp. 2292–2296 9. S. Avidan, A. Shamir, Seam carving for content-aware image resizing. ACM TOG (Transactions on Graphics) 26(3) (2008) 10. B. Chen, P. Sen, Video carving, in Euro-graphics Conference on Computational Photography and Image-Based Rendering (2008) 11. Z. Li, P. Ishwar, J. Konrad, Video condensation by Ribbon Carving. IEEE Trans. Image Process. 18(11), 2572–258 (2017) 12. V. Tiwari, D. Choudhary, V. Tiwari, Foreground segmentation using GMM combined temporal differencing, in International Conference on Computer, Communications and Electronics (2017) 13. K. Li, B. Yan, W. Wang, H. Gharavi, An effective video synopsis approach with seam carving. IEEE Signal Process. 23(1), 11–14 (2016) 14. P.S. Surafi, H.S. Mahesh, Surveillance video synopsis via scaling down moving objects. Int. J. Sci. Technol. Eng. 3(9), 298–302 (2017) 15. R. Furuta, T. Yamasaki, I. Tsubaki, Fast volume seam carving ith multi-pass dynamic programming, in International Conference on Image Processing (2016), pp. 1818–1822
Behavioral Analysis from Online Data Using Temporal Graphs Anam Iqbal and Farheen Siddiqui
Abstract The Internet over and above social media is the basis of human interaction, information exchange, and communication nowadays, which has resulted in prodigious data footprints. If prediction techniques are efficiently employed this data can be put to appropriate utilization for deducing human behavior. We in our work have proposed a methodology for collecting data from social media by assessing the user interactions online, using time-varying attributed or temporal graphs. Initially, we have discussed temporal graphs and how temporal and structural properties of users can be modeled using these evolving graphs for predicting the personality type of the user. The online platforms from where the datasets have been used for the deductions are Stack Exchange and Twitter. Moreover, the secondary research question addressed in this paper is How temporal or time-varying features impact our user behavior prediction. The graphs plotted using the provided datasets show the interactive behavior of users on different platforms. Keywords Time-varying attributed graph · Social media data · Stack exchange · Facebook · Twitter · Data mining
1 Introduction In the last decade, the number of Internet users has increased to about 56% of the world population in 2019, up 9% from January, 2018 [1]. With the dawn of new and developing technologies, human interaction has mostly become dependent on social media and hence has resulted in the generation of huge data footprints. For example, according to a survey by Domo, called the Data never sleeps 0.0016 GB of data is created every second [2]. The data from all the social networking platforms is stored on large servers, which can be utilized for data analytics. We in our paper have proposed a methodology for A. Iqbal (B) · F. Siddiqui Jamia Hamdard, New Delhi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_41
463
464
A. Iqbal and F. Siddiqui
utilizing the data that defines human interactions on the social network and hence deduce some attributes of user behavior. The type of information that can be extracted from social media data is still limited, but its utilization in the most appropriate scenario can also give results better than expected. In order to gain an insight into such applications most of the researchers have concentered on deriving those features which impact user behavior. His study is a challenging task as the user behavior and attributes related to it depend on temporal, spatial, and contextual factors [3, 4]. The appropriate understanding of the dynamics of a user’s interaction with social media can have multiple applications in diverse fields, for example, for a Whatsapp user the online and offline time at the start and end of the day, respectively, can help determine his sleep cycle. Over the course of our work, we have come across some fundamental arguments and we have posed them as research questions. ‘What methodology and its consequent implementation can allow data analysts to model, explore, extract, and then finally predict the user behavior using social media data, by employing data mining techniques?’
2 Literature Review A huge number of online platforms are used for communication, discussion, and information exchange. If this data is put to use in prediction it can serve various applications [5]. The authors in [6] proposed the communication model, where the message travels from the sender to the receiver and was explained by [7] by implying the feedback factor. The clients using social media are both senders and receivers at their respective ends, which emphasize feedback more than the original message.
2.1 Prediction A large number of researchers have proposed and researched upon various aspects related to response of users to activities taking place on the Internet, like, predicting purchasing activities [8], customer behavior [9], churn [10], users’ loyalty [11], identification of criminals, and financial defaulters [12]. Enhanced pieces of work incorporate complex analysis like, predicting temporal properties of a client, and hence providing them with an improved experience [13]. In [14] the analysts predicted user cuisine choices by checking check-ins; the limitation was the computational cost was not evaluated. In [15] historical geographical data was used to predict the future locations of the user accurately. In [16] user activity time at home was predicted.
Behavioral Analysis from Online Data Using Temporal Graphs
465
2.2 Modeling User Behavior We in our research have classified models into graph-based models and dynamicbased models, on the basis of their core functionalities and structure.
2.2.1
Dynamic Models
These models represent the behavior of objects with respect to time. The objects in our scenario are Internet users. A lot of work has been done on the analysis of dynamic networks, [17, 18] as well as modeling based on time-varying attributes in large scale network datasets [19, 20] emphasizes on the use of temporal links to improve prediction models. For implementing dynamic models for human behavior in [21] a human is placed analogous to a device with varied mental states. Each state in its own is a dynamic process xi = f i (x, t) + ε(t)
(1)
where xi is the state vector at time i, f is a function that models xi, and ε represents a noise process In case of a dynamic multiple models, the probability for n-dimension observation Yz if the kth model dynamics are given, can be put as
(k)T
−1 (k)
e − 2 τz R τz P(k) Yz |X∗z = n 1 (2π ) 2 Det(R) 2 1
(2)
where R is the co-variance matrix and τz(k) = Yz − f (i) (X z∗(k) , t)
(3)
2.3 Graph-Based Models For using graphs in determining user behavior nodes of the graph are used to represent people and the edges represent the interaction between these nodes. These models are used to establish structural properties of users. A graph is represented as (Fig. 1). G = (V, E)
(4)
466
A. Iqbal and F. Siddiqui
Fig. 1 An AMG with four nodes, u1-4, four edges, e1-4, and each node having three attributes
where V nodes are the people and E edges represent the interactions between the people These graphs can be used to learn the types of associations between the users, i.e., who follows who, and between the user and the platform, i.e., how active is a user on Twitter. A simple graph model only has users and their associations, but a real-world scenario is much more complex. For that there is a need to extend the classical graph model. The result is powerful statistical analysis of the data. One implementation is nodes having associated attribute [22]. Such graphs are called attributed graphs. An attributive graph is given in Fig. 2. If nodes do not have attributes, in order to simplify the task at hand, a combination of graphs can also be used as shown in [23].
3 Temporal Graphs Temporal or time-varying graphs (TVGs) comprise as a set of entities X, relationship between entities represented by edges E, and a variable for defining any other property, P, i.e., E ⊆ X ×X ×P. A single property can be defined over multiple entities. The relationships between entities have a lifetime, referred to as the timestamp, Γ, whereΓ ∈ T. System dynamics can be represented by a TVG, G, i.e., G = (X ; E; Γ ; σ ; ξ ).
(5)
where σ : E × Γ → {0, 1} represents whether an edge exists or not at a particular time, and ξ : E ×Γ → T indicates the time taken to traverse one edge One very important deduction from the behavior of temporal graphs is that the final graph is a union of many temporal sub-graphs(static graph when time is constant),
Behavioral Analysis from Online Data Using Temporal Graphs
467
Fig. 2 Process flow of the proposed work
i.e., F(G) = SG(1) ∪ SG(2) . . . ∪ SG(k−1) ∪ SG(k) .
(6)
4 Temporal Graphs for Behavior Prediction Temporal graph is one of the most suitable implementations for a social network, because the edges are fluctuating. Also the interactions between users keep changing with time, either due to the spatial movement of users and change in status of relationship between users like acquaintances, friends, family, which are all dynamic relations. Following the research questions raised the research process in this project is divided into following steps:
468
A. Iqbal and F. Siddiqui
1. Developing the model: The model used is Time-Varying Graph, which describes structural and temporal features of the users. 2. Feature Extraction: This includes feature extraction and classification. Semantic and computational are the two classes of features that have been considered. 3. Prediction: Four algorithms which can be used are: Extreme Gradient Boosting (XGB), Decision trees (DT), Linear Regression (LR), and k-Nearest Neighbor (kNN). Evaluation metrics, like accuracy and time can be used to compare these with other previous models. 4. Visualization: At last we propose visualization on the data sets using Case-Based Reasoning (CBR). This paper puts forth the results of the prediction based on TVG.
5 Retaining Temporal Features The two data sets used are StackOverflow and Twitter, each obtained from their official websites. The layout algorithm used is the Fruchterman–Reingold algorithm. The dynamic network models can be developed using Python or R. We have chosen Python as the language and anaconda as the platform for the construction of the graph. Rest of the implementation is done in python using the igraph package. Our model encapsulates the time-varying properties of social media data. The temporal attributes enable us to further divide the graph into sub-graphs. This is an efficient mechanism as the features are only to be computed only for the sub-graph, hence reducing the computation time. Our previous deductions have already stated that a social network is very similar to a graph. Thus we have employed graph theory for the detailed study of social network traits. The Time-Varying Attributed Graph (TVAG) consists of objects with time varied attributes. These attributes can be engineered to design the interaction and relationship graphs. When people interact on the social media, relationships are established between them. In order to represent these, a relationship graph RelG (t) has been used, where its nodes and edges represent the users and their relationships, respectively. Each relationship R (Uti (t)) has necessarily two attributes, one is the source a UtSOURCE , and the other is target UtTARGET . When people interact in the social media, they interact with each other through messages. In order to represent these, an interaction graph INTG(t) has been used, where its nodes and edges represent users and their messages, respectively.
Behavioral Analysis from Online Data Using Temporal Graphs
469
6 Results 6.1 Twitter Twitter is a micro-blogging website, which enables users, to post tweets, and follow other users. Tweets are generally associated with a ‘#’ hashtag, which represents the trending area. The data model consists of five tables: users, tweets, entities, entities in objects, and places. In the relationship graph nodes are users while the edges show the relation between users. Twitter discards the temporal variations; hence, relationships once established do not generally change. The interaction graph of twitter comes out to be time-stamped. The tweets of users are broadcasted to their corresponding followers. The interaction graph hence is modeled between users and their tweets. The tweets have an attribute which determines its type: mention, reply, or retweet. In the interaction graph, each and every edge has a lifetime, but in the relationship graph the timestamp is from time of the first tweet and it follows every retweet. Figure 3 gives the interaction graph between 500 random users following a particular hasthtag, and the distribution of other users who are not following the hashtag. Figure 4 gives the relationship graph of the 500 most active users. These users interact, i.e., tweet and retweet) with each other quite a lot as is depicted by the density of the interaction graph.
Fig. 3 Interaction graph for twitter
470
A. Iqbal and F. Siddiqui
Fig. 4 Relationship graph for twitter
7 Stack Exchange Stack Exchange is a collection of about 128 question answer websites on varied topics. Each of these sites follows a particular data model. The interactions here are queries, responses to these queries, and additional comments. The data model consists of six tables. In order to form the basis of these models, interaction graph is to be modeled between users and their messages. Additional attributes here are Reputation, Views, and Message attributes. These are generally temporal, hence need an additional attribute which can define the timebound changes that occur in the contents of the messages. In Stack Exchange the users cannot develop relations between them; hence, a relationship graph cannot be established. For our model, we have picked up the questions and answers from the July 2009 to October 2014, and have developed an interaction graph for this scenario, which is given in Fig. 5. Users
Fig. 5 Interaction graph for Stack Exchange
Behavioral Analysis from Online Data Using Temporal Graphs
471
Fig. 6 Interaction graph for Stack Exchange
are represented by the nodes, while answer flow is represented by edges. Another interaction graph is plotted which shows the interaction of users following a particular subject of study and involved in answering the questions related to that particular topic (Fig. 6).
8 Conclusion The research carried out in this paper is based on a primary research problem, that is, ‘how social media data can be modeled in a way that the capturing, exploring, and hence understanding human behavior?’ Graphs are one of the spire suitable tools for representing and for the exploring social media data sets. Social networks like stack overflow and Twitter data can be modeled into temporal model with hardly any loss of data and with focus on the time-varying behavioral attribute of the users. Hence, to demonstrate the usage of the proposed model Twitter and Stack Overflow datasets were modeled. Moving forward, the structural features derived from this temporal model play a very important role in feature extraction step of machine learning, hence reducing the effort needed for extracting desired and useful features automatically.
472
A. Iqbal and F. Siddiqui
References 1. Topic: Internet Usage Worldwide (2019). Www.Statista.Com, https://www.statista.com/topics/ 1145/internet-usage-worldwide/. Accessed 15 Dec 2019 2. D. Cohen, 10 Takeaways From Domo’S, In 7th Annual Data Never Sleeps Infographic (2019). www.Adweek.com, https://www.adweek.com/digital/10-takeaways-from-domos-7th-annualdata-never-sleeps-infographic/ 3. S. Scellato, A. Noulas, C. Mascolo, Exploiting place features link prediction location-based social networks, in Proceedings of the 17th ACM SIGKDD 11 (ACM, USA, 2011), pp. 1046– 1054 4. A. Guille, H. Hakim, Predictive model for ttemporal dynamics of information diffusion, in Proceedings of the 21st international conference on WWW (ACM, 2012) 5. G. Barbier, H. Liu, Data mining in social media, in Social Network Data Analytics (Springer, Boston, MA, 2011), pp. 327–352 6. C. Shannon, A mathematical theory of communication 27(4), 623–656 (1948) 7. W. Schramm, D.F. Roberts, The Process and Effects of Mass Communication (University of Illinois Press Urbana, 1971). rev. edn 8. C. Lo, D. Frankowski, J. Leskovec, Understanding behaviors that lead to purchasing: a case study of pinterest, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2016) 9. A. Martínez et al., A machine learning framework for customer purchase prediction in the non-contractual setting. Eur. J. Oper. Res. (2018) 10. M. Miloševi´c, Ž. Nenad, A. Igor, Early churn prediction with personalized targeting mobile social games. Expert Syst. Appl. 83, 326–332 (2017) 11. W. Buckinx, V. Geert, D. Van Poel, Predicting customer loyalty using internal transactional database. Exp. Syst. Appl. 32.1, 125–134 (2007) 12. G. Sudhamathy, C. Jothi Venkateswaran, Analytics using R for predicting credit defaulters, in 2016 IEEE International Conference on Advances in Computer Applications (ICACA) (IEEE, 2016) 13. R. Boutaba et al., A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. J. Internet Serv. Appl. 9.1, 16 (2018) 14. W. Min et al., A survey on food computing. ACM Comput. Surv. (CSUR) 52.5, 92 (2019) 15. Y. Miura et al., A simple scalable neural networks based model for geolocation prediction in Twitter. in Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) (2016) 16. J. Chen, Y. Liu, M. Zou, Home location profiling for users in social media. Inf. Manag. 53(1), 135–143 (2016) 17. Bommakanti, S.A.S. Rajita, S. Panda, Events detection in temporally evolving social networks, in 2018 IEEE International Conference on Big Knowledge (ICBK) (IEEE, 2018) 18. L.Y. Zhilyakova, Dynamic graph models and their properties. Autom. Remote Control 76(8), 1417–1435 (2015) 19. R.A. Rossi et al., Modeling dynamic behavior in large evolving graphs, in Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (ACM, 2013) 20. V. Nicosia et al., Graph metrics for temporal networks, in Temporal Networks (Springer, Berlin, 2013), pp. 15–40 21. A.W. Woolley, I. Aggarwal, T.W. Malone, Collective intelligence and group performance. Current Direct. Psychol. Sci. 24.6, 420–424 (2015) 22. V. Peysakhovich, C. Hurter, A. Telea, Attribute-driven edge bundling for general graphs with applications in trail analysis, in 2015 IEEE Pacific Visualization Symposium (PacificVis) (IEEE, 2015) 23. A. Guille et al., Information diffusion in online social networks: a survey. ACM Sigmod Record 42.2, 17–28 (2013)
Medical Data Analysis Using Machine Learning with KNN Sabyasachi Mohanty, Astha Mishra, and Ankur Saxena
Abstract Machine learning has been used to develop diagnostic tools in the field of medicine for decades. Huge progress has been made in this area, however, a lot more work has yet to be done in order to make it more pertinent for real-time application in our day-to-day life. As a part of data mining, ML learns from previously fed data to classify and cluster relevant information. Hence, the main problems arise due to variations in the big data in the individuals and huge amounts of unorganised datasets. We have used ML to figure out various patterns in our dataset and to calculate the accuracy of this data, with the hope that this serves as a stepping stone towards developing tools that can help in medical diagnosis/treatment in future. Creating an efficient diagnostic tool will help improve healthcare to a great extent. We have used a mixed dataset where an individual with any severe illness in early stages or individuals who are further along, are both present. We use libraries like seaborn to construct a detailed map of the data. The fundamental factors considered in this dataset are age, gender, region of stay and blood groups. The main goal is to compare different data to each other and locate patterns within. Keywords Medical diagnosis · Seaborn · Matplotlib · Data mining · KNN
S. Mohanty · A. Mishra · A. Saxena (B) Amity University, Noida Uttar Pradesh,, India e-mail: [email protected] S. Mohanty e-mail: [email protected] A. Mishra e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_42
473
474
S. Mohanty et al.
1 Introduction Machine learning has helped to make huge strides in the fields of science and technology, including medical data processing and significant impact on life science medical research. Few highlights include the recent advances that have been made in the development of machine learning pipelines for statistical bioinformatics and their deployment in clinical diagnosis, prognosis and drug development [1]. Machine learning algorithms can also be trained to screen complications on medical imaging data [2]. We obtained this data using google trends which reflect upon the interest people have shown in the field of machine learning since 2014. It is based on the web searches made over this period which is a good source to reflect upon the popularity of any kind of entity in this digital age [3]. From 2014 to 2019 there has been consistent rise by huge proportions that shows how vast applications of ML are being realised and discovered by more and more people [4]. Machine learning has gradually spread across several areas within the medical industry with the complete potential to revolutionise the whole industry [5]. Until a few years ago, the medicine solely was dependent on heuristic approaches, the knowledge is gathered through experiences and self-learning, crucial in healthcare environment [6]. The increasing amount of data or the big data is the node for the application of machine learning [7]. ML is a platform that can skim information from numerous sources into an integrated system that can help in decision-making processes even for professionals [8].
1.1 Artificial Intelligence The focus of artificial intelligence has been hugely drawn towards the improvisation of healthcare since the 1960s. In addition to building databases which store medical data such as the patient data, research libraries, administrative and financial systems, the research focus for Artificial Intelligence is innovating techniques for better medical diagnosis [9]. For example, PubMed is a service of the US national library of medicine that includes over 16 million citations from journals for biomedical articles dated back to the 1950s.
1.2 Medical Diagnosis It analyses the structured data sets such as images, genetic and the EP data. In the clinical applications the ML procedures attempt to patient’s traits, or infer the probability of the disease results. In the process of molecular drug discovery and manufacturing of drugs, machine learning can be used for precision medicine, next generation
Medical Data Analysis Using Machine Learning with KNN
475
sequencing, nano-medicine, etc. [10]. For better treatments, we are aiming towards the development of improvised algorithms, for example, using the existing treatment methods, say, cancer precision treatment, with the machine learning technologies [11]. Machine learning models have been trained to screen patients. Screening models or the algorithms have already been started for identifying tumours, diabetes, heart diseases, skin cancer, etc. The algorithms and ML models should be of high precision and high sensitivity for the best evaluation and diagnosis of the diseases or ailments [12] (Fig. 1). Machine Learning tools can be put to various kinds of uses [13]. The following Fig. 2 shows a heat map, that has been used to analyse the Air Quality Index (AQI) of the entire city of Delhi over a month. This data analysis has been performed by a renowned media company news channel, India Today by their Data Intelligence Unit on pollution statistics provided by CPCB [14]. CPCB is Central Pollution Control Board, a statutory organisation under the Ministry of Environment, Forest and Climate Change. Therefore, this is publicly available data, which could be easily ignored if not for the processing that India Today did on that impactless statistical data [15]. One glance at the heat map gives enough information regarding the state of air quality in the city [16]. The dark shades in the odd-even weeks show that air was at its worst during this period, with average AQI of 365. We are able to analyse the impact of a very popular government scheme without having to read and compare hundreds of numeric values of index [17]. Use of ML tools in this example has been to analyse large scale medically and environmentally relevant data for an area of 1,484 km2 with a population of 1.9 crores. Its utility is indeed limitless. The Fig. 3 is a step ahead, it’s a pollution calendar for the year of 2019 [18, 19].
Fig. 1 Changes in people’s interest in machine learning over a period of 5 years
476
S. Mohanty et al.
Fig. 2 Air quality index heat graph for January of 2016 (statistics during odd-even scheme implementation)
Fig. 3 AQI heat graph calendar for year of 2019
Medical Data Analysis Using Machine Learning with KNN
477
2 Methodology 2.1 About the Dataset We have collected this dataset through the means of a google form that we circulated among our college mates and friends. That is why, maximum dataset has the medical information of individuals from the age groups 16–30. Upon receiving complete responses, further processing of dataset involved calculation of BMI from the height and weight data of the individuals, changing certain column entries like medical history, symptom diagnosis, etc. to Boolean format, along with grouping age of individuals into age groups of two years coupled in one group. The dataset processing was instrumental to correct, unambiguous presentation and seamless execution of ML tools on the data.
2.2 Environment Setup Anaconda was installed to get the work started, as it makes the process of installing libraries seamless, which is used with Python version 3.7. We used Jupyter notebook as our IDE because it’s one of the gold standard IDE for machine learning as it is user friendly and has simple interface. It was most appropriate for our work as it displays the graphs and the data clearly.
2.3 Starting We used the most popular machine learning libraries of python like Sklearn in our work. The data was used in Comma Separated Values (CSV) format. Before starting with the analysis we need to import the libraries and its dependencies. The libraries imported had all things for data analysis machine learning and data visualisation. Pandas, numpy, matplotlib and seaborn are few major libraries. The dataset looks like (See Fig. 4). The complete process can be summarised as (See Fig. 5). Upon installation of jupyter notebook, an integrated development environment, on our desktop, we used the preinstalled libraries on the software for further editing, like, numpy for mathematical operations, seaborn, sklearn, pandas and matplotlib. Then used pandas library for importing our dataset onto jupyter. We performed data visualisation using these library functions to helps us with the data analysis process. Then, we used KNN algorithm to classify the data (Fig. 6).
478
S. Mohanty et al.
Fig. 4 head() function of pandas shows the first 5 rows of the dataset
Fig. 5 The workflow
3 Results and Discussion 3.1 Data Relation with Respect to Gender Using Pairplot Function of Seaborn Library The above plot shows the analysis of different parameters of the dataset with respect to the gender in terms of male and female. The red dots show the female and the blue ones represent the males. We can comprehend various patters in these clusters of points plots against the two axes (Fig. 7).
3.2 The Few Individual Parameters of Dataset in Form of Graphs or Histograms Which Is Crucial for Data Analysis The above histogram gives an overview of the age of the participants. It shows that the data has a broad group of participants between the ages of 18 and 21 years as compared to elder groups. This is because the survey was done in a higher educational institute, with majority population of young participants. This helps us to predict the kind of illnesses that could be common in this group among the individuals and what we can expect in general in terms of medical histories (Fig. 8).
Medical Data Analysis Using Machine Learning with KNN
479
Fig. 6 pairplot was used to plot the histogram to show relations
The above histogram shows that the highest number of individuals have B + ve blood group, it is also the most common blood among people of Indian subcontinent. This definitely speaks well in the accuracy of this dataset (Fig.9). The survey form circulated through electronic medium, on messaging apps, etc. received significant responses from the female individuals. This can also indicate greater medical and physiological awareness among female participants (Fig. 10).
3.3 Comparison of Two Parameters Together The above histogram shows that females took more medications as compared to the male individuals. The above graph proves the fact that females usually consume more medicines and get ill more frequently as compared to men (Fig. 11).
480
Fig. 7 Individual parameter study of age groups in form of histogram
Fig. 8 Individual parameter study of blood groups in form of histogram
S. Mohanty et al.
Medical Data Analysis Using Machine Learning with KNN
Fig. 9 Individual parameter study of blood groups in form of histogram
Fig. 10 Parameter study of gender and medications in form of histogram
481
482
S. Mohanty et al.
Fig. 11 Parameter study of gender and blood group in form of histogram
The above histogram shows that females took more medications as compared to the male individuals. The above graph proves the fact that females usually consume more medicines and get ill more frequently as compared to men (Fig. 12).
Fig. 12 The graph shows the inter relationship of one parameter with each other by a float value
Medical Data Analysis Using Machine Learning with KNN
483
Fig. 13 The above histogram plots the number of males and females with respect to BMI and medical history
3.4 For Better Analysis We Did a 1 to 1 Comparison of Data The heat map gives information regarding the type of data collected by the survey. The dark shade of green shows the right number which had the standard type of data obtained during the data collection. And, thus, lighter shades indicate non-standard data that had to be further processed before visualisation and analysis. We did this analysis of the impact factor without having to read and compare hundreds of numeric values of the dataset (Fig. 13).
3.5 Finding the K Nearest Neighbour (KNN) Algorithm This supervised ML algorithm was used to solve classification problems in terms of male and females between the fields BMI and medical history of the people. Here, similar groups are closer together and the dissimilar ones are relatively farther apart (Fig. 14). To calculate and present the distance between two corresponding points, we have plotted the above graph for further data analysis in terms of its accuracy (Euclidean Distance or the straight-line distance was used).
484
S. Mohanty et al.
Fig. 14 The above graph shows the relation between the accuracy of data with K nearest values
4 Conclusion and Future Scope This model requires better, more robust entries that are accurate and curated. Since the diagnosis is not specific it cannot be analysed with just the few parameters as more information is needed to be analysed due to difference in multiple ailments. The data should be 98 % accurate for it to be acceptable in real-time diagnostic tool development. The dataset is required to be trained rigorously to make the analysis more efficient. Also, the future work may involve deep learning and neural network like BERT and other better algorithms after an improvised dataset is formed. Acknowledgments We would like to express our deep sense of gratitude towards Amity Institute of Biotechnology and our family, without their support throughout the process this paper would have not been accomplished.
References 1. I. Sharma, A. Agarwal, A. Saxena, S. Chandra, Development of a better Study resource for genetic disorders through online platform. Int. J. Inf. Syst. Manag. Sci. 1(2), 252–258 (2018) 2. S. Mohagaonkara, A. Rawlani, P. Srivastavac, A. Saxena, HerbNet: intelligent knowledge discovery in MySQL database for acute ailments, in 4th International Conference on Computers and Management (ICCM) (ELSEVIER-SSRN, 2018), pp. 161–165. ISSN: 1556-5068 3. S. Shuklaa, A. Saxena, Python based drug designing for Alzheimer’s disease, in 4th International Conference on Computers and Management (ICCM) (ELSEVIER-SSRN, 2018) pp. 20–24. ISSN: 1556-5068
Medical Data Analysis Using Machine Learning with KNN
485
4. A Agarwal and A Saxena, Comparing machine learning algorithms to predict diabetes inwomen and visualize factors affecting it the most—a step toward better healthcare forwomen, in International Conference on Innovative Computing and Communications. https://doi.org/10.1007/ 978-981-15-1286-5_29,2019 5. A. Saxena, N. Kaushik, A. Chaurasia and N. Kaushik, Predicting the outcome of an election results using sentiment analysis of machine learning, in International Conference on Innovative Computing and Communications. https://doi.org/10.1007/978-981-15-1286-5_43,2019 6. A. Saxena, S. Chandra, A. Grover, L. Anand and S. Jauhari, Genetic variance study in human on the basis of skin/eye/hair pigmentation using apache spark, in International Conference on Innovative Computing and Communications. https://doi.org/10.1007/978-981-15-1286-5_3 1,2019 7. V.V Vijayan, C. Anjali, Prediction and diagnosis of diabetes mellitus-a machine learning approach, in 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS) (Trivandrum, 2015) 8. B sarvwar, V Sharma, Intelligent Naive Bayes approach to diagnose diabetes type- 2. Int. J. Comput. Appl. Issues Chall. Netw. Intell. Comput. Technol. (2012) 9. R Motka, V Parmar, Diabetes mellitus forecast using different data mining techniques, in IEEE International Conference on Computer and Communication Technology (ICCCT) 2013 10. S. Sapna, A. Tamilarasi, M. Pravin, Implementation of genetic algorithm in predicting diabetes. Int. J. Comput. Sci. 9, 234–240 11. K Savvas, N. Schizas Christos, Region based support vector machine algorithm for medical diagnosis on Pima Indian diabetes dataset, in IEEE Conference on Bioinformatics and Bioengineering (2012), pp. 139–144 12. A. Al Jarullah, Decision discovery for the diagnosis of Type II Diabetes, in IEEE Conference on Innovations in Information Technology (2011), pp. 303–307 13. D.M. Nirmala, B.S. Appavu alias, U.V. Swathi, An amalgam KNN to predict Diabetes Mellitus, in IEEE International Conference on Emerging Trends in Computing Communication and Nanotechnology(ICECCN) (2013), pp. 691–695 14. U Poonam, H Kaur, P Patil, Improvement in prediction rate and accuracy of diabetic diagnosis system using fuzzy logic hybrid combination, in International Conference on Pervasive Computing (ICPC) (2015), pp. 1–4 15. S.S Vinod Chandra, S Anand Hareendran, Artificial Intelligence and Machine Learning (PHI learning Private Limited, Delhi, 2014), p. 110092 16. R. Bellazzi, B. Zupan, Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Informatics 77, 81–97 (2008) 17. A. Agarwal, A. Saxena, Malignant tumor detection using machine learning through scikit-learn. Int. J. Pure Appl. Math. 119(15), 2863–2874 (2018) 18. S. Saria, A.K. Rajani, J. Gould, D. Koller, A.A Penn, Integration of early physiological responses predicts later illness severity in preterm infants. Sci. Trans. Med. 2, 48ra65 (2010) 19. D.C. Kale, D. Gong, Z. Che et al., An examination of multivariate time series hashing with applications to health car, in IEEE International Conference on Data Mining (ICDM) (2014), pp. 260–69
Insight to Model Clone’s Differentiation, Classification, and Visualization Ritu Garg and R. K. Singh
Abstract Model clones are model fragments in terms of model elements clustered together in form of containment relationship that is highly similar. Due to their property of defect propagation, these are harmful from maintenance point of view. Work on model cloning is less mature as compared to code cloning. In order to fill this gap, the authors identified the key areas regarding model clones with important concepts along with their benefits, limitations, and findings on basis of existing literature that is helpful for further research. It creates awareness about the attributes in which code and model clones differentiates. Then the classification of model clones is refined and proposed on basis of similarity and clustering strategy followed by the techniques for detection of model clones where importance of hybrid clone detection is studied on basis of pros and cons of other existing techniques for clone detection. Recommendations are given regarding the techniques for the visualization and reporting of model clones detected. Keywords Clone · Software quality · Modeling · Duplication · Refactoring
1 Introduction to Model Clones and Its Representation Model is an abstract representation of any system with respect to a context from specific viewpoint [1, 2]. “Connected submodels, that are in structural equivalence to one another, up to a particular threshold represents Model Clones” [1]. The idea of modeling promotes transformation of real-world ideas into clear design and more maintainable system. In initial phases of Software Development Life Cycle (SDLC), process modeling of the software systems is done using the UML that is used to design the structure and behaviors of the software. Unified Modeling Language (UML) R. Garg (B) · R. K. Singh Indira Gandhi Delhi Technical University for Women, Delhi, India e-mail: [email protected] R. K. Singh e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_43
487
488
R. Garg and R. K. Singh
is useful for modeling application systems and Simulink, for modeling embedded system. The internal representation of UML models is stored textually in form of XML files in tree structure [3]. However, physical representation in itself may not replace the conceptual view of the software, which represents a form of graph. This is because the conceptual view represents the semantic information along with the structural information in form of concepts and its features, while the physical representation represents structural information only. There are many literature surveys for code clone detection and its associated areas but model clone is still lacking behind. The features corresponding to model elements, if similar, to other model elements of same type may be termed as model clone. However, such similarity is vague in itself as affirmed by Ira Bexter [4]. Due to this different author uses different basis of similarity whether it is in terms of attributes of similarity, clone classification, similarity detection technique or reporting of similar model elements. This study not only provides consolidate and comprehensive view of model clones, its attributes, classification, detection, and visualization but also reports the major findings in these areas. This will help future authors to understand these concepts and distinguish them from code clones for further research. The main contributions are as under 1. Attributes in which code and model clones are similar or different during the clone detection are reported. 2. Different clone classifications used in literature are discussed and then refined by the authors. This is done to have a clear and concise idea of interrelation between them for comparing various studies. 3. Different model clone detection techniques are presented with their pros and cons to depict the need for hybrid technique for better efficiency. 4. Various techniques for visualization of clones are studied to focus the need of aggregation of both textual and graphical approaches for better understanding and navigation of clones. The authors have focused to restrict to concept of model clones. In this paper, Sect. 2 shows the similarities/differences of attributes in which model and code clones differ. Section 3 discusses the proposed classification of clones in models. Shift to Hybrid technique for model clone detection is discussed in Sect. 4. Section 5 deals with the different techniques used for visualization of clones. Section 6 discusses the conclusions and future works for extending this study.
2 Similarities/Differences of Attributes in Which Model and Code Clones Differ In contrast to code clones (software clones that exist during the implementation phase in SDLC), model clones differ in the following ways as shown in Table 1. Similarities in code and model clones are on the basis of layout, color, and position among the elements, which are lines of code in case of code clone and model elements
Insight to Model Clone’s …
489
Table 1 Differentiation of code and model clones [3, 5–7] Attribute
Code clone
Model clones (UML)
Information
Dependency information (Implicit similarity)
Containment information along with dependencies among objects (Explicit similarity)
Structure
Textual structure at file level
Tree-like structure at higher level (Containment info) or Graph-like structure (dependency information) or combination of both
Notes/Comments
Comments are not important from structural and semantic viewpoint with respect to code
Notes/Comments or notes are important from semantic viewpoint with respect to models
Identifiers
Identifiers identified by names are locally unique
Identifiers are identified by the id that are globally unique while name attribute of model elements is locally unique
Naming
All identifiers must have a name
All identifiers must have id. Name is optional which may result in loopholes
Coupling
Depends on the structure of function call
Depends on the relationships between model element along with structure of objects
Renaming
Simple renaming in terms of variable, constant or literal names
Blind renaming in terms of variable, constant, or literal names on the basis of type of the block
in case of model clone. The clone detection techniques use these attributes as basis of comparison for model fragments.
3 Proposed Classification of Clones in Models In the existing literature, model clones have many classifications at broader domain. However, the authors presented the majorly used classifications that can be interrelated to one another. It provides a common view of classification of model clones for further research. This will help to compare the various clone detection techniques on basis of various attributes for better efficiency.
490
R. Garg and R. K. Singh
3.1 Classification of Clones on Basis of Clustering Effect Is Identified as [3, 7] 1. Primary clone 2. Secondary clone. Primary clones: clones based on similarity of fragment as a whole consisting of all similar elements or resultant after merging various clone fragments where each contains similar elements within the fragments. Secondary clones: clones based on the similarity of a single indivisible element in a fragment.
3.2 Classification of Model Clone on Basis of Similarity as Per Störrle [3, 7] Type A: Exact Model Clone–A model that is exactly similar in terms of content other than the Layout, Secondary Notation, and Internal identifiers. It avoids differences in position, color, spacing, text fonts, appearance and formatting, orientation, etc. In code clones, comment does not play any important role at code level due to concrete level of details but in model clones, notes/comments play a vital role due to the abstract representation of the software that has capability to detect potential clones. In model clones, a global unique identifier is associated with each model element while such a case was not with code clones where uniqueness lies only on basis of naming convention during coding process. Due to these facts, the model elements may be similar but not identical within a system. Type B: Renamed Model Clone–A model that is highly similar in terms of content other than the changes such as changes to Names of Elements, Attributes, and Parts along with the variation as mentioned in Type A. Thus, it takes into account the variations among the labels along with their values with respect to model elements of the model. It will identify similarities considering the variations in terms of data types, access specifiers, or other meta-attributes related to model elements into account on basis of developer as they may change the scope and accessibility of model element. Type C: Modified Model Clone–A model that is highly similar in terms of content other than the changes such as addition or removal of parts (set of model elements as submodel) and ordering in the same hierarchical level along with the variation as mentioned in Type B. It involves avoidance of variations among submodels where any model element is added, modified or removed; up to a certain threshold. It may lead to gapped clones that mean the number of clones will increase in such a way that the size of the clone is less. Type D: Sementic Model Clone–A model that is approximately similar in terms of content only that may be due to practices like Copying of Model Fragments, Methods, or Constraints imposed by the Languages, Convergent Development, or
Insight to Model Clone’s …
491
other processes. It takes into account the equivalence as unique normal form of models [8]. However, they may be exactly similar in terms of meaning of content. It checks the behavior of the system on basis of the inputs, if that results in the same outputs. Such clones are very hard to detect because of the semantic nature involved in the abstract representation. That is why their interlinking with preconditions and postconditions increases the precision of clone detection techniques. Different authors have used the Object Constraint Language (OCL) specifications for semantic similarity for these pre and postconditions. Other than that for synonyms, maintaining a dictionary identifies the similar model elements at lower level of abstraction to increase the precision. It is different from other types because it not only involves pairwise matching for structural content but also semantic transformations that are very hard to detect. This pairwise matching of model elements and attributes reports many exact model elements as secondary clones. However, the focus should be in detecting maximal matching as primary clones whether it should be in exact match or approximate match to identify the major area of emphasis for clone detection at detail level further. In order to remove such accidental duplications as secondary clones, there should be some threshold for the reporting of number of model elements in primary clones.
3.3 Classification of Model Clone on Basis of Similarity as Per Rattan [6, 9, 10] Type-1: Model clones based on standard modeling or coding practices. These are the repetitions using model elements within model (fields in class) due to programming or modeling (default identifiers in serializable class). Type-2: Model clones by purpose. These are the repetitions in form of nature of relationships (overriding feature among parent and all child subclasses or realization relationship between interface and implemented classes due to repetition of abstract operations). Type-3: Model clones based on design practices. These represent the repetitions that are present among different model elements (classes) in form of clusters of different sizes due to unfinished design or any other reason instead. The classification given by Rattan relates to class diagrams. The authors analyzed that all these three types of clones mentioned by Rattan [6, 9, 10] are based on the same naming conventions for attributes/relationships supported by minor changes in meta-attributes within them. Therefore, it may represent exact model clones, i.e., provided by Störrle [3, 7]. Thus, the refined classification is as shown in Fig. 1.
492
R. Garg and R. K. Singh
Fig. 1 Refined/Proposed classification of model clones
4 Shift to Hybrid Technique for Model Clone Detection Model clone detection is similar to code clone detection with the difference that here they are concerned with model elements and meta-attributes of the model elements with respect to models. The Model Clone Detection (MCD) techniques along with their pros and cons are listed below that depicts the shift to hybrid technique for clone detection 1. 2. 3. 4. 5. 6. 7.
Graph-based MCD Token-based MCD Tree-based MCD Hierarchical textual MCD(special class of tree-based MCD) Metric-based MCD Semantic-based MCD Feature-based MCD.
Earlier (especially before 2012) the techniques used for model cloning were graph-based model clone detection [1, 2, 11, 12]. It involves matching for sub-graph isomorphism that is an NP-Complete problem [4]. That makes it difficult and time consuming. The token-based technique has capabilities to detect mainly T-1 and T-2 clones. Therefore, recall is less where these approaches are used [13]. In case of treebased clone detection, the structural clones are easy to detect but any shift of model element is difficult to detect [3, 5–7, 9, 10]. In case of hierarchical textual approach, lexical approach detects renaming within the tree structure that tool Simone uses for model clone detection [14]. The metric-based approaches provide better performance to compare model elements using metrics easily with less detection time for model clones [15]. Semantic-based approaches rely on the behavior of the concepts used in the study that requires transformation, which is difficult to detect [8]. Feature-based clone detection identifies the features corresponding to the concepts and measures the similarity on basis of granularity (class, method, identifier) [16–18]. It may use machine-learning approaches in order to train the system so that it may later test for the similarities. Due to high complexity involved in such MCD, a heuristic is required to balance time and space complexity. To overcome these limitations, a combination of these MCD’s in form of hybrid technique, depending on the type of software and
Insight to Model Clone’s …
493
desired performance parameters is better choice for better efficiency in terms of time and space complexity.
5 Different Techniques Used for Visualization of Clones The clones reported to the developer or the one who maintains the system are as follows: 1. Clone pairs 2. Clone class. Clone pair represents exactly two sets of clone instances or components (set of interrelated model elements) that are highly similar to each other. Clone class represents two or more sets of clone instances or components (set of interrelated model elements) that are highly similar to one another. For a clone class having n clone instances, we may have n(n–1)/2 clone pairs each with different sets of clone instances. The clone class cardinality represents number of cloned instances of any fragment including the original fragment whereas the size of clone class represents the number of nodes or model elements present in a clone instance. Models are created in Integrated Development Environment (IDE) but in visualization because of hierarchical layering in models that span multiple files and models. Therefore, there should be some mechanism to represent the clones either textually or graphically. In textual representation, there is no link with the IDE just model elements that are clone are referred by different means such as 1. Complete Paths of model element names. Highlight the names of the blocks or lines that are clones as clone pairs or clone classes [5, 15, 19]. Along with clone classes and clone instances, it should also provide the capability to explore clone instances in same working environment [20]. In graphical representation, there is linkage with the IDE where model elements that are clone are represented within the IDE itself. The following techniques used are [5] 2. Using different coloring schemes for small systems. 3. Scatter-plots to identify the models where major components or highly risky components according to the Parento principle are cloned. 4. Using matrix representation for representation of clones. In some cases, we may use both textual and graphical representation for representing the clones in textual form and then it is flexible for the user to switch to graphical mode within the IDE itself, if one wants. 5. Complete path of element names on click of which it is redirected to the visual representation of the clones in models with coloring schemes (used for large size software systems) [21].
494
R. Garg and R. K. Singh
Such representations of clones are better than the individual textual or graphical approach for visualizations. Due to easier navigation and understanding of clones in software, it finds its major application in clone lineage and clone genealogies in the software evolution. Directed Acyclic Graph (DAG) that corresponds to clone group’s history of evolution, for the various versions; represents Clone Lineage. Clone genealogies represent the relation in form of co-change in the history of the revisions. It depicts the effect on the clone instance of other clones on basis of cochange phenomena, if change occurs in clone instance of a clone class. Since models are abstract in nature so they contain little information as compared to code due to which precision and time taken by clone detection process decreases.
6 Conclusion and Future Scope This paper deals with model clones highlighting the attributes in which code clones are differentiated from model clones such as information, structure, comments, identifiers, naming scheme, coupling, and renaming during MCD. Then model clone classification for detecting model clones according to existing researches is refined on basis of similarity and clustering strategy by the authors to have a consistent view all around. Pros and cons of various MCD techniques focus the need for detecting clones using the hybrid MCD for balancing time and space complexities. Then these clones detected as per MCD technique with clone classifications on basis of various attributes are presented using the model clone visualization techniques based on textual, graphical, and a combination of both approaches along with its linkage to IDE. However, the combination of both textual and graphical approaches provides better understanding and navigation during clone evolution. This study is useful for future researchers to understand the model clone, their classification and visualization areas to enhance further research. Model Cloning needs further exploration and refinement from the aspects of analysis, detection, and management during the evolution of software that may help to improve quality of industrial practices for developing and maintaining software. The effects of attributes of model on clone detection and management still need exploration and validation with empirical studies.
References 1. F. Deissenboeck, H.B. Juergens, E.M. Pfaehler, B. Schaetz, Model clone detection in practice, in Proceedings of the 4th International Workshop on Software Clones (ACM, 2010), pp. 57–64 2. B.J. Muscedere, R. Hackman, D. Anbarnam, J.M. Atlee, I.J. Davis, M.W. Godfrey, Detecting feature-interaction symptoms in automotive software using lightweight analysis, in 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), (IEEE, 2019), pp. 175–185
Insight to Model Clone’s …
495
3. M. Chochlov, M. English, J. Buckley, D. Ilie, M. Scanlon, Identifying feature clones: an industrial case study, in 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER) (IEEE, 2019), pp. 544–548 4. Rattan,D., Bhatia, R., Singh,M.: Model clone detection based on tree comparison. In: India Conference (INDICON).IEEE, pp. 1041–1046 (2012) 5. B. Hummel, E. Juergens, D. Steidl, Index-based model clone detection, in Proceedings of the 5th International Workshop on Software Clones (ACM, 2011), pp. 21–27 6. H. Störrle, Towards clone detection in UML domain models. Softw. Syst. Model. 12(2), 307– 329 (2013) 7. D. Rattan, R. Bhatia, M. Singh, Detecting high level similarities in source code and beyond. Int. J. Energy. Inf. Commun. 6(2), 1–16 (2015) 8. H. Störrle, Effective and efficient model clone detection, in Software, Services, and Systems (Springer International Publishing, 2015), pp. 440–457 9. M.H. Alalfi, J.R. Cordy, T.R. Dean, Analysis and clustering of model clones: an automotive industrial experience, in IEEE Conference on Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE) (IEEE. Software Evolution Week, 2014), pp. 375–378 10. E.J. Rapos, A. Stevenson, M.H. Alalfi, J.R. Cordy, SimNav: Simulink navigation of model clone classes, in International Working Conference on Source Code Analysis and Manipulation (SCAM) (IEEE, 2015), pp. 241–246 11. D. Rattan, M.G. Singh, R.G. Bhatia, Design and development of an efficient software clone detection technique. Doctoral dissertation (2015) 12. G. Mahajan, Software cloning in extreme programming environment. arXiv (2014), pp. 1906– 1919 13. Deissenboeck, F., Hummel, B., Jürgens, E., Schätz, B., Wagner, S., Girard, J. F., &Teuchert, S.: Clone detection in automotive model-based development. In: Software Engineering, 2008. ICSE’08. ACM/IEEE 30th International Conference, pp. 603–612 (2008) 14. R. Garg, R.K. Singh, Detecting model clones using design metrics, in International Conference on New Frontiers in Engineering, Science and Technology (2018), pp. 147–153 15. B. Al-Batran, B. Schätz, B. Hummel, Semantic clone detection for model-based development of embedded systems. Model Driven Eng. Lang. Syst. 258–272 (2011) 16. C.K. Roy, J.R. Cordy, A survey on software clone detection research. Queen’s School Comput. TR 541(115), 64–68 (2007) 17. D. Rattan, R. Bhatia, M. Singh, Software clone detection: a systematic review. Inf. Softw. Technol. 55(7), 1165–1199 (2013) 18. N.H. Pham, H.A. Nguyen, T.T. Nguyen, J.M. Al-Kofahi and T.N. Nguyen, Complete and accurate clone detection in graph-based models, in Proceedings of the 31st International Conference on Software Engineering (IEEE Computer Society, 2009), pp. 276–286 19. I.D. Baxter, A. Yahin, L. Moura, M. Sant’Anna and L. Bier, Clone detection using abstract syntax trees. In software maintenance, in Proceedings of International Conference (IEEE, 1998), pp. 368–377 20. S.K. Choudhary, M.A. Sindagi, M.V. Patel, U.S. Patent Application No. 15/637, 684 (2019) 21. E.J. Rapos, A. Stevenson, M.H. Alalfi, and J.R. Cordy, SimNav: Simulink navigation of model clone classes, in IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM) (2015), pp. 241–246
Predicting Socio-economic Features for Indian States Using Satellite Imagery Pooja Kherwa, Savita Ahlawat, Rishabh Sobti, Sonakshi Mathur, and Gunjan Mohan
Abstract This paper presents a novel, accurate, inexpensive, and scalable method for estimating some of the socio-economic features like electricity availability, treated water, electronics like television, radio, communication mediums like mobile phone, landline phone and vehicle like 2/3/4 wheeler from high-resolution daytime and nighttime satellite imagery. Our approach is a novel method, which helps to track and target poverty and development in India and other developing countries. Keywords Satellite images · Ridge regression · Stochastic Gradient Descent (SGD) · Machine learning · Convolutional neural network
1 Introduction In developing countries like India, collecting information that is grounded on precise evaluations of monetary and advancement pointers on foot through Census is troublesome. Census is error-prone and uproarious because of the extensive changeability in the information accumulation forms over the geology, and there is regularly no validation. Through our machine learning model, we endeavor to limit this exertion, help move towards easy development, and provide more accurate results. Exact estimations of the monetary qualities of populaces basically impact both research and arrangement. Such estimations shape choices by individual governments about how to apportion assets and give the establishment to worldwide endeavors to comprehend and track advance toward enhancing human livelihood. Through our model, we were able to accomplish the following:
P. Kherwa · S. Ahlawat (B) · R. Sobti · S. Mathur · G. Mohan Maharaja Surajmal Institute of Technology, New Delhi 110058, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_44
497
498
P. Kherwa et al.
• Prepared and analyzed the data for around 10,000 villages spread over two states of North India using Census 2011, village boundaries, and corresponding satellite imagery. • Trained eight deep convolutional neural network-based model for direct regression of socio-economic features of the Census 2011 data on daytime image data. • Used ridge regression on nighttime satellite images of the villages. • Compared the regression scores obtained by both the models with existing models, which used either the night or daytime data for evaluation. The two Indian states Punjab and Haryana are analyzed for the present research. For the primary run, 2000 pictures were utilized for day and nighttime data.
2 Literature Survey In 2011, association between nightlight and GDP estimates for India at the district level using non-linear regression techniques was studied [1]. In 2013, nightlight highresolution satellite images were used as a proxy for development using regression model. [2, 3]. In this a machine learning tool is developed with very high accuracy to predict socio-economic scenario using daylight images [4]. A global poverty map has been produced using a poverty index calculated by dividing population count by the brightness of satellite observed lighting. In another work deep Convolutional Neural Network (CNN) with daytime images is used to identify land use pattern [5]. They analyze land use pattern using advanced computer vision techniques for labeling, for this they used ground truth labels obtained from surveys. Predicting poverty is again another area in developing countries, where nighttime lighting is considered as a rough estimate for economic wealth of that countries [6].
3 Data Collection and Description 3.1 Census Data Vector The asset model for two north Indian states Punjab and Haryana was created [7]. The village-level information is ordered by ids and gives collected data of around 140 household characteristics. Then a dimension reduction technique is used to reduce this dimensionality from 140 feature vector to 5. The vector fields were electricity availability, treated water, electronics [television, radio], communication mediums [mobile phone, landline], and vehicle [2/3/4 wheeler].
Predicting Socio-economic Features …
499
3.2 Outlier Removal The Census 2011 data has noise and a large number of errors for some villages. Outliers are values that don’t coordinate the general character of the dataset. For dismissing anomalies, we process the appropriation of Mahalanobis distance of all villages [8]. Threshold was set up to 10% deviation from standard middle. Results are given with and without anomaly dismissal.
3.3 Daytime Satellite Images For geo-enrollment of the cities publicly available geospatial vector data format created by government of India [9] is used. We acquire the daytime satellite pictures comparing to the towns from Google static maps using the API given by Google [10] (Google Static Maps API, 2017). Sample picture is given in Fig. 1. The API call requires API key, latitude, longitude, zoom level, etc. for a successful query. We set the zoom level = 15 and the size of the image as (640 × 640), which is roughly equivalent to 7 sq km ground area.
Fig. 1 Daytime satellite image
500
P. Kherwa et al.
Fig. 2 Nighttime satellite image
We have tested utilizing single pictures corresponding to the town centroid covering zones of 7 m2 ground zone. For the primary run, 6000 pictures were utilized as our underlying set.
3.4 Nighttime Images The nightlight information given by the Defense Meteorological Satellite Program’s Operational Linescan System [11] has been used in the present work. A sample nighttime satellite image is given in Fig. 2. The nightlight information is accessible in 30 circular segment second grids. The nightlight guide is two-dimensional array of intensities. The nightlight image is then cropped into smaller images of 7sq km each, centered at the geographical coordinates corresponding to the villages under consideration. Each nightlight image is indexed according to unique village ids.
4 Proposed Convolutional Neural Network Architecture Convolutional Neural Network (CNN) is the most widely used and powerful technique to analyze high-resolution satellite images. A motivational work used highresolution satellite imagery to successfully carry out automatic road extraction using high-resolution satellite images [12] and used convolutional networks on satellite images to detect and classify roads, buildings, vegetation, etc. Since, features, like roads, buildings, etc., and their quantitative analysis form the basis of overall development and are concurrent with other socio-economic features; hence, it can be
Predicting Socio-economic Features …
501
concluded that a trained CNN is fully capable of predicting socio-economic features using satellite imagery [13, 14]. As we have an availability of a large image dataset at our disposal, therefore it was feasible to train all the fully connected layers from scratch. The complete architecture of the model used by us is described in Fig. 3. The first five convolutional layers of the VGG CNN-S architecture are taken as it is, i.e., the weights are used unchanged and are not tuned while training of the model. The last three fully connected layers are ripped off and trained from scratch, initializing the weights using a Gaussian distribution with zero mean and standard deviation of 0.01.
Fig. 3 Architecture of CNN used
502 Table 1 Hyperparameters used for training
P. Kherwa et al. Hyperparameter SGD with momentum
Adam optimizer
Learning policy
Step, with step size = 500 Fixed
Learning rate
0.000001
0.0001
Weight decay
0.005
–
Momentum 1
0.8
0.9
Momentum 2
–
0.999
Gamma
0.2
–
The pre-trained VGG CNN-S is trained on an input size of 224 × 224, but the usage of higher resolution images, having input dimensions of 640 × 640, mandates the removal and retraining of the fully connected layers. Moreover, the underlying task of the existing model is the classification of the image into one out of a thousand classes. So, another achievement was to change this underlying task to regression of the five target outputs that we need. The weight decay was changed to 0.005. The Caffe architecture [15], a convolutional neural network architecture for fast feature embedding has been used for model specification and training purpose. Since the Caffe architecture supports only classification, so, the Euclidean Loss has been used which is essentially the L2 loss layer and is given by Eq. 1. E=
N 1 yˆn − yn 2 2 2N n=1
(1)
where E is Euclidean loss, N is the total number of samples, and ||.|| is Euclidean norm. We have used two optimizers, first is the Stochastic Gradient Descent with momentum introduced by [16]. The second solver is the Adam optimizer introduced by [4]. The hyperparameters used for both of these optimizers are presented in Table 1. All the images in the dataset are first added to a Lightning Memory-mapped Database (LMDB) format, and before getting fed into the feedforward CNN, the mean of all the images is subtracted from each image for normalization of input features. Also, the target labels are scaled down by a factor of 0.01, so that the range of target vector is changed from [0, 100] to [0, 1]. 1.
5 Regression for Night Data On the nightlights data, which is essentially a two-dimensional matrix of light intensities, we applied ridge regression with the asset vector as the multidimensional target. Since the spatial distribution of the light intensities during nighttime is not as essential as the net amount of light intensity in a particular village, hence the mean and standard deviation of the nightlight intensities for a particular village is used for training a Ridge regression model.
Predicting Socio-economic Features …
503
During implementation, we carried out the following tasks: • Segmenting out [13, 14] sized one-dimensional figures centered on the village centroids for around 6000 villages. • Taking the mean, standard deviation, and maximum values for the light intensity of each village, and serializing the data along with the five target variables. • Normalizing the input features, and scaling down the target labels by a factor of 100, to bring them to the range [0, 1] from [0, 100]. • Running the ridge regression algorithm on the training data, and evaluating the Mean Squared Error and Mean Absolute Error on the testing set.
6 CNN for Night Data Convolutional Neural Network (CNN) is also used for nighttime data, and it provides similar results as the ridge regression approach. Since the size of input image data was very less, hence a very small CNN was used as a model. The architecture contains only one Convolutional layer with 4 × 4 filters, one RELU activation layer, a flatten layer, a dropout layer with dropout probability of 0.3, and one fully connected output layer. TensorFlow library was used in Python for model training and specification. The loss function chosen for this network is Mean Squared Error (MSE), which can be mathematically represented as in Eq. 2. 1 ˆ i )2 (Yi − Y n i=1 n
MSE =
(2)
ˆ is the where MSE is Mean Squared Error, ‘n’ is the total number of samples, ‘Y’ real value of the nth sample, and ‘y’ is the predicted value by our model. The optimizer used to minimize this loss function is Adam optimizer, with the parameters as mentioned in Table 2. Table 2 Parameters for ADAM optimizer
Parameter
Value
No. of epochs
10
Learning rate
0.001
Momentum 1
0.9
Momentum 2
0.999
504
P. Kherwa et al.
7 Results First, we obtain the results separately for nighttime and daytime data by using different models to train on both of them and getting the results for the same testing set, i.e., the same set of villages for both nighttime and daytime. And then, we combine the predictions obtained from the models performing best on each dataset, by taking the average of the results obtained from the two models. The results obtained are described below.
7.1 Daytime Model The convolutional neural network trained on the daytime data set with the target as the census socio-economic vector was run for 550 iterations with two separate optimization algorithms.
7.2 SGD with Momentum The obtained results are tabulated in Table 3. The curve for training loss using this algorithm is given in Fig. 4. The presence of spikes in the training loss, as the epochs progress, is because of the use of mini-batching technique used by the SGD algorithm, which aims to update the weights with respect to the loss obtained on each iteration of a single batch. Some mini-batches have “by chance” unlucky data for the optimization, inducing those spikes you see in the cost function.
7.2.1
ADAM Optimizer
The obtained results are tabulated in Table 4. The curve for training loss using this algorithm is given in Fig. 5. Table 3 Results on daytime data using SGD with momentum
Result parameter
Value
Iterations
550
Euclidean loss (Training set)
1.1
Euclidean loss (Test set)
0.088
Mean absolute error (Test set)
0.236
Predicting Socio-economic Features …
505
Fig. 4 Training loss for SGD with momentum
Table 4 Results on daytime data using ADAM optimizer
Result parameter
Value
Iterations
550
Euclidean loss (Training set)
0.8
Euclidean loss (Test set)
0.062
Mean absolute error (Test set)
0.168
Fig. 5 Training loss for ADAM optimizer
506
P. Kherwa et al.
Table 5 Results on night data using ridge regression
Table 6 Results on night data using convolution neural network
Result parameter
Value
Euclidean loss (Training set)
0.0353
Mean absolute error (Training set)
0.1314
Euclidean loss (Test set)
0.0342
Mean absolute error (Test set)
0.1289
Result parameter
Value
Euclidean loss (Training set)
0.0282
Mean absolute error (Training set)
0.1264
Euclidean loss (Test set)
0.0300
Mean absolute error (Test set)
0.1228
7.3 Nighttime Model The nighttime models were first trained on a training set of 5400-nightlight intensity images by using two different models, the results obtained from which are given here.
7.3.1
Ridge Regression
The obtained results are tabulated in Table 5.
7.3.2
Convolution Neural Network
The obtained results are tabulated in Table 6.
7.4 Final Combined Result Taking a weighted average of the daytime and nighttime model based on their Euclidean Loss values, the results of Table 7 are obtained.
Predicting Socio-economic Features … Table 7 Combined results on day and nighttime data
Result parameter
507 Value
Euclidean loss (Half test set)
0.029
Mean absolute error (Half test set)
0.120
Euclidean loss (Full test set)
0.029
Mean absolute error (Test set)
0.119
8 Conclusion In today’s era, it is very difficult to find reliable and high-frequency data, available census is also error-prone and expensive. So in this paper, a machine learning approach is presented, through satellite images both at nighttime and daytime taken in real time. This presented approach can be useful tool to address various issues of development in country like India and other developing countries like poverty, health, agriculture, sanitation, and other resource management.
References 1. L. Bhandari, K. Roy Chowdhury, Night lights and economic activity in India: a study using DMSP-OLS night time images. Proc. Asia-Pac. Adv. Netw. 32, 218–236 (2011). https://doi. org/10.7125/apan.32.24; http://dx.doi.org/10.7125/APAN.32.24. ISSN 2227-3026 2. P.K. Suraj, A. Gupta, M. Sharma, S.B. Paul, S. Banerjee, On monitoring development indicators using high-resolution satellite images (2018). arXiv:1712.02282v3 3. C.D. Elvidge, P.C. Sutton, T. Ghosh, B.T. Tuttle, K.E. Baugh, B. Bhaduri, E. Bright, A global poverty map derived from satellite data. Comput. Geosci. 35(8), 1652–1660 (2009) 4. D.P. Kingma, J. Ba, Adam: a method for stochastic optimization (2014). arXiv:1412.6980 5. A. Albert, J. Kaur, M.C. Gonzalez, Using convolutional networks and satellite imagery to identify patterns in urban environments at a large scale, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD (2017), pp. 1357–1366 6. N. Jean, M. Burke, M. Xie, W.M. Davis, D.B. Lobell, S. Ermon, Combining satellite imagery and machine learning to predict poverty. Science 353(6301), 790–794 (2016) 7. The Ministry of Home Affairs, Government of India.Census Data. http://www.censusindia. gov.in/2011-Common/CensusData2011.html 8. P.C. Mahalanobis, On the generalised distance in statistics. Proc. National Inst. Sci. India 2(1), 49–55 (1936). Retrieved 27 Sept 2016 9. The Ministry of Science and Technology, Government of India, Survey of India (2017). http:// www.surveyofindia.gov.in 10. Google Static Maps API. https://developers.google.com/maps/documentation/staticmaps/ 11. NOAA/NGDC Earth Observation Group. National Geophysical Data Center, Version DMSPOLS Nighttime Lights Time Series (2013) 12. A.V. Buslaev, S.S Seferbekov, V.I. Iglovikov, Fully Convolutional Network for Automatic Road Extraction from Satellite Imagery (2018). arXiv:1806.05182 13. V. Iglovikov, S. Mushinskiy, O. Vladimir, Satellite Imagery Feature Detection using Deep Convolutional Neural Network: A Kaggle Competition (2017). arXiv:1706.06169 14. K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: delving deep into convolutional nets, in British Machine Vision Conference (2014)
508
P. Kherwa et al.
15. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding. CoRR, abs/1408.5093 (2014). http://arxiv.org/abs/1408.5093 16. A. Ng, J. Ngiam, C.Y. Foo, Y. Mai, C. Suen, UFLDL Tutorial (2017). http://ufldl.stanford.edu/ wiki/index.php/UFLDL_Tutorial
Semantic Space Autoencoder for Cross-Modal Data Retrieval Shaily Malik and Poonam Bansal
Abstract The primary aim of cross-modal retrieval is to enable the user to retrieve data across different modalities in a flexible manner. Through this paper, we tackle the problem of retrieving data across different modalities, where the input is given in one form, and relevant data of another type is retrieved as the output, as per the requirement of the user. Most of the techniques or approaches that have been used so far have not considered the feature and semantic information preservation. As a result of this negligence, they are not able to obtain effective results. Here, we have proposed a two-stage learning method that does the projection of low dimensional embeddings to multimodal data that preserve both feature and semantic information, which enabled us to get satisfactory results. In this paper, we have proposed an autoencoder for cross-model retrieval that can process both visual as well as textual data based on their semantic similarity. Keywords Semantic learning · Multimodal data retrieval · Neural network
1 Introduction Most of the applications in today’s world involve multiple modalities such as the text, images, sound, or videos describing a variety of information. To grasp the information present in all such modalities, one must understand the relationship that exists between them. Although, some techniques are already proposed to provide resolution of such problem, they are unable to produce satisfactory results as they fail to preserve feature and latent information. Through this work we tried to elaborate S. Malik (B) · P. Bansal Department of Computer Science and Engineering, Maharaja Surajmal Institute of Technology, GGSIPU, New Delhi, India e-mail: [email protected] P. Bansal e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_45
509
510
S. Malik and P. Bansal
a model that challenges this issue by creating an effective technique to retrieve data across different modalities and produce adequate results [1, 2].
1.1 Multimodal Data In our day to day lives, we come across many applications of multimodal data where information comes from different sources such as the images, text, or speech. Usually the content of a web page is described through the text, images, or videos for exhibiting the common content, illustrating heterogeneous properties. Most of the search techniques that have been used so far are single modality based that does not fulfill the demanding requirement of information retrieval across multimodal data. So, through this paper, we try to explore the technical challenges surrounding multimodal data by accomplishing the semantic features of the data. This can be achieved through learning how the samples belonging to the same semantic category can be mapped into common space, even though in today’s world the data is generated from multimodal sources, and there should be divergent samples from different semantic groups. We also need to gain knowledge about the prejudiced features of data generated by heterogeneous multimedia sources. A minimization of discrimination loss of the two spaces: common representation space and the label space can be proposed to fulfill these needs [3–5]. At the same time, we likewise limit the separation between the portrayals of each picture content pair to lessen the cross-modular disparity. The semantic code vectors are found out which comprises both: the component data and the mark data, at that point the projections are found out by multi-modular semantic autoencoder which is utilized to ventures picture and content together to the educated code vector and the picture and content from code vector can be remade.
2 Literature Review K. Wang et al. gave an effective and hearty strategy for recovery of information from various modalities which is progressively appropriate and amazing when contrasted with the customary single-modality based strategies. They provided an overview for cross-modal retrieval and summarized a variety of representation methods which can be classified into two fundamental gatherings: (a) genuine esteemed representation learning and (b) pair wise representation learning. A few normally utilized multimodal datasets are presented, the presentation of some agent strategies on some ordinarily utilized datasets are assessed [1]. Another methodology for learning basic portrayals for the heterogeneous information is proposed. The regular portrayals scholarly can be both discriminative just as methodology invariant for cross-modular recovery. This goal was achieved by a new approach named DSCMR by reducing the discrimination loss as well as the modality invariance loss simultaneously [2]. A
Semantic Space Autoencoder for Cross-Modal Data Retrieval
511
new approach for accomplishing the task of multimodal data retrieval is discussed. In this method, the type of data modality mappings of the cross-modal retrieval is learnt so that data from different sources can be projected to embeddings in such a way that the original extracted feature information and the semantic information in both modalities would be preserved [6]. J. Gu1 et al. said that in the first place, get familiar with the component mindful 405 semantic code vectors which join the data from both element spaces and the name spaces. Afterwards, encoder-decoder worldview is utilized to learn projections which venture the picture and content to the semantic code vector and recoup the first highlights from the semantic code vector [7]. The authors gave a new paradigm for accomplishing the task of cross-modal retrieval. In this given method, modalitybased projections are learnt so that data from these modalities can be projected to embeddings that would preserve the semantic information and the original feature information in both modalities. We from the outset become familiar with the 405 component semantic code vectors which have the consolidated data from the name space and the element spaces. An encoder-decoder model is used to learn the projections and mappings which are further used to project the available textual feature and image captions to the semantic code vector, and then the semantic code vector is used to reconstruct the initial features [8]. We learn effective methods for extraction of features, creation of shared subspaces considering the significant level of semantic data and how to optimize them [9, 10], how to extract features from hand-drawn images [11]. The neural system learns a multi-modular inserting space for pieces of pictures and sentences and reasons about their idle, between the modular arrangements. It is shown that the combination of CNN and RNN [12], CNN visual features [13], Cross-Modal Generative Adversarial Networks (CM-GANs) [14, 15] can without much of a stretch accomplish predominant outcomes contrasted and utilizing conventional visual highlights.
3 Cross-Modal Retrieval In this paper, we have proposed an auto encoder for cross-model retrieval that can process both visual as well as textual data. The autoencoder can convert textual data to image and vice versa, and can be used to find similar images or text. This can be used to address machine translation problems, recommendation systems, image denoising, and dimensionality reduction for data visualization and in many other fields. For this work we have used flickr8k. The image dataset consists of 8092 images and the text dataset in json format consists of corresponding captions to the image dataset. This work is basically divided into four parts: Image to Text Conversion, Text to Text conversion, Image to Image Conversion, and Text to Image Conversion. A. Image to Text Image to Text conversion can be achieved through the process called image captioning which is done into two parts: First component is an image encoder that takes the image
512
S. Malik and P. Bansal
as input and converts it into representations that are meaningful to do captioning. A deep convolution neural network is used as image encoder. Second component is the caption decoder which takes the image representations as input and gives the descriptions as output. GRU is used for caption decoding. In this work, we have used the pre-final layer activations of an already existing image classifier, i.e., inception network. This is to avoid training image encoder from the beginning. The representations from the inception network are fed into the Recurrent Neural Network (RNN). We train the decoder and check the performance by generating the captions for random images from the training and testing datasets. B. Text to Text The functionality of text to text generation is build [16] by the representations developed by the network while captioning the images. In this part of the work, we need to feed the words to the network in such a format that it can act as the input to the network. So we begin by randomly created word embeddings [17] and try exploring what the network learnt about words when training was done. Since visualization of large dimensions is not possible so we have used a technique known as T-SNE which helps in reduction of number of dimensions without leading to any change in the neighbors while converting from high to low dimensional space. 100-dimensional representation is taken and the cosine similarity to all the other words present in the data is calculated. C. Image to Image To find the similar images to the image given as input we have applied the same technique of T-SNE for visualizing the nearest neighbors of the image given as input. We find the image representations of each image [18] and store the representations corresponding to each in a text file. This part of the work aims at providing the functionality of searching the most similar image to the image that the user provides as input [19]. We first take the representation of the image provided by the user and apply cosine similarity to find the closest image in the data. D. Text to Image Text to image conversion is achieved by developing the functionality of searching images via captions [20]. For this we perform the reverse of what we did for generating caption for an image. As the first step, we start with completely random 300dimensional tensor as input rather than 300-dimensional representation of image coming from an encoder. In the next step, all layers of the network are frozen, i.e., PyTorch does not calculate the gradients. Assuming that randomly generated input tensor comes out of the image encoder we feed it into caption decoder. Then the caption being generated by network is taken at the time that arbitrary input was given and is compared with the user-given caption. The loss is calculated by comparing the network-generated and user-provided caption. The gradients for the input tensor are calculated to minimize the loss. The input tensor is changed by taking tiny step in
Semantic Space Autoencoder for Cross-Modal Data Retrieval
513
the direction given by gradients. We repeat unless we reach to convergence or until the loss reaches below a definite threshold. Then the final input tensor is taken and its value is used to find the closest images to it by applying cosine similarity.
4 Results and Discussion The model for image captioning was trained at 40 epochs and the average running loss came out to be around 2.84. Keeping the trained model as the base, the functionalities of similar text and similar image retrieval were developed. At last, for retrieval of image from text given as input, we performed the reverse of image captioning. The epochs vs loss graph was plotted to see the loss incurred while giving the text to image results. Several architectures have been tested with different combinations of dense layers with CNN ones. The resulting architecture configuration (the size and number of layers) showed the best results on cross-validation test which corresponds to the optimal usage of training data. Our tests proved that the architecture using dense layers to deal with fixed-length vectors and CNN layers for handling varied length vectors is optimal (Fig. 1). For the text to image conversion path, we achieved our objective to energize the grounded content component to produce a picture that is like the ground-truth one as appeared in Fig. 2a. Although the produced pictures are of restricted quality for complex multi-object scenes, they despite everything contain certain conceivable shapes, hues, and foundations when contrasted with the ground-truth picture and the recovered pictures. This proposes that our model can catch the complex basic language-picture relations. Cosine similarity is a metric that is used to measure how similar the two images are. It quantifies the degree of the similarity between intensity patterns in two images. Fig. 1 Epochs versus loss graph to check the loss during text to image conversion
514
S. Malik and P. Bansal
Fig. 2 Results of the Auto encoding process for the image and text modalities
In Fig. 2b, we can clearly see that the results retrieved are of high accuracy as the similar text is identified using the cosine similarity among the words. In Fig. 2c, in Image to Image retrieval it can be inferred from the test results that the system is able to match query images in different resolutions with the images in the database. It tries to identify the similar type of images from the dataset. In Fig. 2d, picture to-content recovery, where the aftereffects of recovered inscriptions just as the ground-truth subtitles. We can see that the recovered subtitles of our model can all the more likely depict the inquiry pictures.
5 Conclusion The autoencoder is aimed to retrieve relevant data using heterogeneous modalities, i.e., text vs. images. The key thought is to distinguish the quantitative similitude in single-modular subspace and afterward move them to the basic subspace to set up the semantic connections between unpaired things across modals. Experiments show that our method outperforms the state-of-the-art approaches in single or pairbased data retrieval tasks. In this paper, we follow the dataset segment and highlight
Semantic Space Autoencoder for Cross-Modal Data Retrieval
515
extraction methodologies as inception for picture encoder and Gated Recurrent Unit (GRU) for a sentence or as content decoder. Based on the study of various resources we found that GRU exhibits better performance on certain smaller datasets. We can additionally examine the impact of picture encoding model on the cross-modular element installing by supplanting the VGG19 model rather than beginning or utilizing LSTM instead of GRU, and evaluate the performance for further optimization. The future work includes the designing of more smooth algorithms to summarize the multimodal data and multimodal learning with limited and noisy annotations. We can also work on improvement of scalability on large scale data and Finer-level cross-modal semantic correlation modeling.
References 1. K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey on cross-modal retrieval, in Senior Member (IEEE, 2016) 2. L. Zhen, P. Hu, X. Wang, D. Peng, Deep supervised cross-modal retrieval, machine intelligence laboratory, in College of Computer Science (Sichuan University Chengdu, 610065, China, 2019), pp. 10394–10403 3. X. Zhai, Y. Peng, J. Xiao, Learning cross-media joint representation with sparse and semi supervised regularization. IEEE Trans. Circuits Syst. Video Technol. 24(6), 965–978 (2014) 4. Y.T. Zhuang, Y.F. Wang, F. Wu, Y. Zhang, and W.M. Lu, Supervised coupled dictionary learning with group structures for multi-modal retrieval, in AAAI Conference on Artificial Intelligence (2013) 5. C. Wang, H. Yang, C. Meinel, Deep semantic mapping for cross-modal retrieval, in International Conference on Tools with Artificial Intelligence (2015), pp. 234–241 6. Y. Wu, S. Wang, Q. Huang, Multi-modal semantic autoencoder for cross-modal retrieval. Neurocomputing (2018). https://doi.org/10.1016/j.neucom.2018.11.042 7. J. Gu1, J. Cai2, S. Joty2, L. Niu3, G. Wang, Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models (Hangzhou, China, CVPR, 2018) 8. V. Ranjan, N. Rasiwasia, C.V. Jawahar, Multi-label cross-modal retrieval, in 2015 IEEE International Conference on Computer Vision (ICCV) (Santiago, 2015), pp. 4094–4102. https://doi. org/10.1109/iccv.2015.466 9. T. Yao, T. Mei, C.-W. Ngo, Learning query and image similarities with ranking canonical correlation analysis, in International Conference on Computer Vision (2015), pp. 28–36 10. R. Socher, A. Karpathy, Q.V. Le, C.D. Manning, A.Y. Ng, Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014) 11. Y. Jhansi, E. Sreenivasa Reddy, Sketch based image retrieval with cosine similarity. ANU College of Engineering,Acharya Nagarjuna University, India, Int. J. Adv. Res. Comput. Sci. 8(3) (2017) 12. A. Karpathy, A. Joulin, F. Li, Deep fragment embeddings for bidirectional image sentence mapping, in Advances in Neural Information Processing Systems (2014), pp. 1889–1897 13. Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, S. Yan, Cross-modal retrieval with cnn visual features: a new baseline, in IEEE Transactions on Cybernetics, p. Preprint 14. Y. Peng, J. Qi, Y. Yuan. Cross-modal generative adversarial networks for common representation learning. TMM (2017) 15. X.-Y. Jing, R.-M. Hu, Y.-P. Zhu, S.-S. Wu, C. Liang, J.-Y. Yang, Intra view and interview supervised correlation analysis for multi-view feature learning, in AAAI Conference on Artificial Intelligence (2014), pp. 1882–1889
516
S. Malik and P. Bansal
16. J. Martinez-Gil, An overview of textual semantic similarity measures based on web intelligence. https://doi.org/10.1007/s10462-012-9349-8 17. Y. Lu, Z. Lai, X. Li, Fellow, IEEE, D. Zhang, Fellow, IEEE, W. KeungWong, C. Yuan, Learning Parts-Based and Global Representation for Image Classification. https://doi.org/10.1109/tcsvt. 2017.2749980 18. Y. Gong, Q. Ke, M. Isard, S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vision 106(2), 210–233 (2014) 19. L. Yang, V.C. Bhavsar, H. Boley, On semantic concept similarity methods, in Proceedings of International Conference on Information and Communication Technology and System (2008), pp. 4−11 20. F. Cararra, A. Esuli, T. Fagni, F. Falchi, A. Moreo, Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions (2016). arXiv:1606.07287v1[cs.IR]
A Novel Approach to Classify Cardiac Arrhythmia Using Different Machine Learning Techniques Parag Jain, C. S. Arjun Babu, Sahana Mohandoss, Nidhin Anisham, Shivakumar Gadade, A. Srinivas, and Rajasekar Mohan
Abstract The major cause of deaths around the world is cardiovascular disease. Arrhythmia is one such disease in which the heart beats in an abnormal rhythm or rate. The detection and classification of various types of cardiac arrhythmia is a challenging task for doctors. If it’s not done accurately or not done on time, the patient’s life can be at a great risk, as few arrhythmias are serious, and some can even cause potentially fatal symptoms. This paper illustrates an effective solution to help doctors in the critical diagnosis of various types of cardiac arrhythmias. To classify the type of arrhythmia, the patient might be suffering from, the solution utilizes a variety of machine learning algorithms. UCI machine learning repository dataset is used for training and testing the model. Implementing the solution can provide a much-needed early diagnosis that proves to be critical in saving many human lives.
P. Jain · C. S. Arjun Babu · S. Mohandoss · S. Gadade · R. Mohan (B) PES University, Banashankari, Bengaluru 560085, India e-mail: [email protected] P. Jain e-mail: [email protected] C. S. Arjun Babu e-mail: [email protected] S. Mohandoss e-mail: [email protected] S. Gadade e-mail: [email protected] N. Anisham The University of Texas at Dallas, Campbell Rd, Richardson 75080, USA e-mail: [email protected] A. Srinivas Dayananda Sagar University, Bengaluru 560068, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_46
517
518
P. Jain et al.
Keywords Machine learning · ECG recordings · Cardiac Arrhythmia · Ensemble methods · Hard voting · Healthcare · Feature selection
1 Introduction The heart of a healthy human being beats at a rate of 60–100 beats per minute in a periodic sinus rhythm, which is maintained by the heart’s electrical system. When there are problems with this electrical system, the heart chambers will beat in a random way or the heart will beat too fast or too slow. These conditions are collectively called as cardiac arrhythmia. The history and ECG tests are crucial in the diagnosis of the patients suspected with arrhythmias [1]. A typical electrocardiogram (ECG) tracing comprises of P wave, T wave, and QRS complex, which repeats in a sequence. A normal ECG tracing is shown in the Fig. 1. A cardiologist evaluates the ailments based on the various parameters like the shape, duration, amplitude, PR, QT, RR intervals, etc., of the waves [2]. Determining the specific type of arrhythmia is a difficult task because of the massive amount of information involved and the possibility of miscalculating the number of beats by looking at ECG. Pattern recognition of ECG by visual interpretation is prone to errors. Some arrhythmias are just slightly uncomfortable while few arrhythmias such as ventricular fibrillation are deadly [3]. Therefore, it becomes pivotal to evaluate the exact type of arrhythmia the patient is affected with. The objective of this paper is to train a machine learning system to categorize the arrhythmia dataset into one of the 16 classes. This paper makes the following specific contributions: • Offers a GUI-based framework to assist doctors in diagnosing patients who are suspected to have a cardiac arrhythmia. Fig. 1 Normal ECG tracing [17]
A Novel Approach to Classify Cardiac Arrhythmia …
519
• Predicts the type of arrhythmia which the patient might be suffering from using the ensemble of trained machine learning models. • Improvement in prediction performance over existing work done in the same field of study.
2 Literature Review In the early days, arrhythmia detection was carried out using conventional statistical methods like heart rate variability (HRV) analysis [4]. Variations in the indicators of HRV, like duration of successive RR intervals and multiple derived statistical parameters such as root mean square difference and standard deviations, point to the existence of an arrhythmia [4]. The arrhythmia dataset [5] was created and classification was proposed in [6]. They developed a new supervised inductive learning algorithm, VFI5 for the classification. A couple of machine learning algorithms have been investigated in the same classification problem [2]. It was found that feature selection using gradient boosting technique and the model trained with SVM, gave the best results comparatively. To select features Principle Component Analysis (PCA) technique was used and detection of arrhythmia was done using various SVM-based methods like Fuzzy Decision Function, Decision Directed Acyclic Graph, One Against One and One Against All in [7]. Cardiac arrhythmia diagnosis was carried out by techniques such as Fisher Score and Least Squares-SVM with Gaussian radial basis function and 2Dgrid search parameters in [8]. In [9], an arrhythmia prediction was accomplished by a combination of methods like dimensionality reduction by PCA and clustering by Bag of Visual Words on different models, based on Random Forest (RF), SVM, Logistic Regression, and kNN. The arrhythmia dataset was classified by selecting significant features using the wrapper method around RF and normalizing it in [10]. Further, it was used to implement several classifiers such as Multi-Layer Perceptron, NB, kNN, RF, and SVM.
3 The Dataset and Its Preprocessing Dataset: We use the dataset from the UCI repository [5], which contains records of 452 patients with 279 different attributes. Every record contains 4 personal details of patients like age, weight, gender, and height and 275 derived attributes of the ECG waves such as amplitude, width, vector angle, and so on which can be found in [5]. Each record has the conclusion of an expert cardiologist, which represents the class of arrhythmia. Class 01 indicates a normal ECG, classes 02–15 indicate various types of arrhythmias, while class 16 indicates the remaining unclassified ones. Preprocessing: Records with abnormal values such as height of 500, 780 cm, age of 0, etc., were removed. The missing values represented by “?” are replaced
520
P. Jain et al.
with the median value of that feature. WEKA [11] was used to visualize the variance of the features. Further, all the features with standard deviation close to zero were eliminated, as they have a very little effect on the final result. The preprocessing yields a clean dataset of 163 features and 420 records.
4 System Description Supervised machine learning techniques are used to solve the classification problem. All of them are implemented in python. We then form different models by training each of the below algorithms with the training dataset.
4.1 Naïve Bayes (NB) NB is derived from the Bayes’ theorem. It assumes that the value of a feature is independent of any other feature’s value [12]. In NB, the predicted class is the one with the highest posterior probability. Posterior probability is given as posterior probability =
prior probability × likelihood evidence
(1)
where prior probability of a class is the ratio of the numbers of samples of that class to the total number of samples, the evidence is the sum of the likelihoods of all classes. Before the likelihood of a class is calculated, P(A|C) i.e., conditional probability of each attribute of that class in the training sample is calculated. It is given as −(x−μ)2 1 e 2σ 2 P A C =√ 2 2π σ
(2)
where x is the value of that attribute, σ is the variance, μ represents the mean of all the values of that attribute. Likelihood is the product of the conditional probabilities of all attributes of that class.
4.2 Decision Trees (DT) DT is a classifier which follows a tree structure. We implement a DT using the ID3 algorithm. In the ID3 algorithm, the attribute for splitting the data samples is decided by the information gain. The information gain which describes how effectively a given attribute splits the training sample into the given classes is given as
A Novel Approach to Classify Cardiac Arrhythmia …
Gain(S, A) = E(S) −
v∈Values(A)
521
|Sx | × E(S) . |S|
(3)
where Sx is the subset of S for which attribute A has value x and Entropy E(S) is given as E(S) =
c
pi log2 pi .
(4)
i=1
where pi is the proportion of S belonging to class i and S is the total number of samples. The data samples are split, based on the attribute with the highest information. The process continues until the entropy becomes zero.
4.3 k-Nearest Neighbors (kNN) The kNN algorithm groups the instances based on their similarity. kNN is a type of lazy learning algorithm, where all the class labels of the nearest neighbors, from the training dataset are stored and all computations are postponed until the classification [13]. The prediction class is determined based on the majority of k-nearest neighbors of the test instance. In this work, a “k” value of 3 is used. If two samples p and q have “n” number of attributes each, then the Euclidean distance d(p, q) is given as n d( p, q) =
( pi − qi )2 .
(5)
i=1
All the training samples are sorted based on their Euclidean distance and the nearest neighbors are determined based on the k.
4.4 Support Vector Machine (SVM) SVM is a technique that works by creating hyperplanes, which separate the different classes in space. We use the “one-versus-all” approach. Here one class is taken to form a hyperplane, so that it separates that class from the rest of the classes. This is done for all the classes and the label is predicted. It does the classification based on parameters C, gamma, and a specified kernel. The SVM linear classifier kernel function is a dot product of the data point vectors given as K xi , x j = xiT , x j .
(6)
522
P. Jain et al.
We have tried different kernels, C, and gamma values and the best accuracies were obtained with radial basis function (rbf) kernel [14] for C and gamma values of 100 and 1000, respectively.
4.5 Voting Feature Interval (VFI) Classification in each VFI algorithm is based on a majority voting of all class predictions, made by each feature. Prediction is made by a feature based on the projections of all the training instances on that feature. In VFI each feature is given an equal weightage. This paper makes use of all five variations of VFI algorithms [15]. VFI1 The algorithm constructs feature intervals for all the features of each class. The sum of all the votes, for all the features, for each class is calculated. The class with the majority of votes is identified as the prediction class. VFI2 This algorithm differs from VFI1, in finding the lower bounds of the intervals. The endpoints are selected as the midpoints, instead of the lower bounds. VFI3 VFI3 is again a modification of VFI1, in determining the class counts. This is done to consider the three lower bound types of the range intervals. VFI4 VFI4 is similar to VFI3 but if the highest and lowest points of a feature are the same for a class, a point interval is constructed instead of a range interval. VFI5 VFI5 is similar to VFI4, however, it constructs point intervals for all endpoints and all values between the distinct endpoints as range intervals, excluding the endpoints.
4.6 Ensemble Method Ensemble method takes multiple models and combines them to produce an aggregate model, which performs better than any of its individual models. We use a hard voting ensemble method. It makes use of all the above algorithms to classify an unknown data record. Each algorithm predicts one of the 16 class labels. The class which is predicted the most number of times is taken as the final predicted class. A GUI is used to display the result.
A Novel Approach to Classify Cardiac Arrhythmia …
523
5 Results and Discussions 5.1 k-Fold Cross Validation If we have a dataset that has a very low ratio of the number of data records to the number of features, then there will be a lot of variation in the accuracy estimates for different partitions of training and testing datasets. To mitigate this, we perform k-fold cross validation, where the original sample is randomly partitioned into k equal subsamples. One of the subsamples is used as validation data for testing the model and the rest k–1 subsamples are used as training data. This cross validation is performed k times until each of the k subsamples is taken once as validation data. For a 15-fold cross validation the above models perform at its best. The architectural design of the system used in this paper is shown in Fig. 2.
5.2 Performance Analysis We use accuracy as the performance indicator. The accuracy is calculated as the number of correct predictions to the number of evaluated records. The Fig. 3 illustrates the accuracy percentage of the several classifiers for split ratios k = 15. To summarize the figure • The low accuracy of the NB algorithm is due to the fact that every feature is assumed to be independent of the other and hence the interdependence of the features is not taken into account.
Fig. 2 Architecture design for arrhythmia classification
524
P. Jain et al.
Fig. 3 Accuracy percentage of various classifiers
• The DT algorithm overfits the data. This causes incorrect predictions and it lowers the accuracy. • The kNN algorithm is more effective if there are a greater number of neighbors. Hence, we can improve the accuracy further with a larger dataset. • The SVM algorithm gave the highest accuracy of 68.33%. SVM supports different kernels which can be used to create nonlinear hyperplanes between the classes which increases the accuracy of the model. • The VFI algorithms too consider feature independence. The accuracies were better but the training time increased as compared to NB.
5.3 Arrhythmia Classification The hard voting ensemble method predicts with an accuracy of 90.71%. This is a significant improvement in the prediction performance. Individually the model is prone to different kinds of errors like variance, noise, and bias on the dataset [16]. This can result in an average performance because each individual model might over-fit, different parts of the dataset. If the models are reasonably diverse, informed, and independent, the risk of over-fitting is reduced, as their individual mistakes are averaged out, by merging all these predictions together. The outcomes consequently tend to be substantially better. The core intuition is to develop a “strong learner” from a group of “weak learners”. Ultimately this paper provides a concrete diagnosis that is highly irrefutable.
A Novel Approach to Classify Cardiac Arrhythmia …
525
6 Conclusion We have provided a solution to detect the presence of cardiac arrhythmia and to classify it. The approach was to pre-process the arrhythmia dataset, use k-fold cross validation to train various models with machine learning algorithms using the training set, and predict the arrhythmia class using the testing set. By preprocessing the arrhythmia dataset, issues like underfitting or overfitting were addressed. k-fold cross validation performed the best for k value of 15. We have trained the models using NB, DT, kNN, SVM, VFI1, VFI2, VFI3, VFI4, and VFI5 algorithms using the training set. Finally, the class of arrhythmia has been predicted by the majority vote of these models using the hard voting ensemble method. The paper has achieved a best in class accuracy of 90.71%, which is robust and reliable enough for the doctors to provide a crucial diagnosis. Hence it is evident that in predicting the class of arrhythmia, the accuracy of the system surpasses the previous models of similar type. In the future, the execution time could be reduced by making use of methods like multithreading and batch processing.
References 1. T. Harrison, D. Kasper, S. Hauser et al., Harrison’s Principles of Internal Medicine (McGrawHill Education, New York [i pozostałe], 2018) 2. A. Batra, V. Jawa, Classification of Arrhythmia using conjunction of machine learning algorithms and ECG diagnostic criteria. Int. J. Biol. Biomed. 2016, 1–7 (2016) 3. H. Publishing, Cardiac Arrhythmias-harvard health, in Harvard Health (2020). https://www. health.harvard.edu/a_to_z/cardiac-arrhythmias-a-to-z. Accessed 11 Jan 2020 4. T. Electrophysiology, Heart rate variability. Circulation 93, 1043–1065 (1996). https://doi.org/ 10.1161/01.cir.93.5.1043 5. (2020) UCI Machine Learning Repository: Arrhythmia Data Set, in Archive.ics.uci.edu. https:// archive.ics.uci.edu/ml/datasets/Arrhythmia. Accessed 11 Jan 2020 6. H. Guvenir, B. Acar, G. Demiroz, A. Cekin, A supervised machine learning algorithm for arrhythmia analysis. Comput. Cardiol. (1997). https://doi.org/10.1109/cic.1997.647926 7. N. Kohli, N. Verma, Arrhythmia classification using SVM with selected features. Int. J. Eng. Sci. Technol. (2012). https://doi.org/10.4314/ijest.v3i8.10 8. E. Yılmaz, An expert system based on fisher score and LS-SVM for cardiac Arrhythmia diagnosis. Comput. Math. Methods Med. 1–6 (2013). https://doi.org/10.1155/2013/849674 9. P. Shimpi, S. Shah, M. Shroff, A. Godbole, A machine learning approach for the classification of cardiac arrhythmia. Int. Conf. Comput. Methodol. Commun. (ICCMC) 2017, 603–607 (2017). https://doi.org/10.1109/iccmc.2017.8282537 10. A. Mustaqeem, S.M. Anwar, M. Majid A.R. Khan, Wrapper method for feature selection to classify cardiac arrhythmia, in 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (2017), pp. 3656–3659. https://doi.org/ 10.1109/embc.2017.8037650 11. W. Badr, getting started with Weka 3 - machine learning on GUI, in Medium (2019). https://tow ardsdatascience.com/getting-started-with-weka-3-machine-learning-on-gui-7c58ab684513. Accessed 12 Jan 2020 12. P. Joshi, Artificial Intelligence with Python (Packt Publishing Ltd., Birmingham, UK, 2017)
526
P. Jain et al.
13. S. Karimifard, A. Ahmadian, M. Khoshnevisan, M.S. Nambakhsh, Morphological heart Arrhythmia detection using hermitian basis functions and kNN classifier. Int. Conf. IEEE Eng. Med. Biol. Soc. 2006, 1367–1370 (2006). https://doi.org/10.1109/iembs.2006.260182 14. A. Alexandridis, E. Chondrodima, N. Giannopoulos, H. Sarimveis, A Fast and efficient method for training categorical radial basis function networks. IEEE Trans. Neural Netw. Learn. Syst. 28, 2831–2836 (2017). https://doi.org/10.1109/tnnls.2016.2598722 15. G. Demiröz, Non-Incremental Classification Learning Algorithms Based on Voting Feature Intervals (Bilkent University, M.Sc., 1997) 16. R.R.F. DeFilippi, Boosting, bagging, and stacking-ensemble methods with sklearn and mlens, in Medium (2018). https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-met hods-with-sklearn-and-mlens-a455c0c982de. Accessed 12 Jan 2020 17. (2020) Sinus rhythm, in En.wikipedia.org. https://en.wikipedia.org/wiki/Sinus_rhythm. Accessed 11 Jan 2020
Offline Handwritten Mathematical Expression Evaluator Using Convolutional Neural Network Amit Choudhary, Savita Ahlawat, Harsh Gupta, Aniruddha Bhandari, Ankur Dhall, and Manish Kumar
Abstract Recognition of Offline Handwritten Mathematical Expression (HME) is a complicated task in the field of computer vision. The proposed method in this paper follows three steps: segmentation, recognition and evaluation of the HME image (which may include multiple mathematical expressions and linear equations). The segmentation of symbols from image incorporates a novel pre-contour filtration technique to remove distortions from segmented symbols. Then, recognition of segmented symbols is done using Convolutional Neural Network which is trained on an augmented dataset prepared from EMNIST and custom-built dataset giving an accuracy of 97% in recognizing the symbols correctly. Finally, the expressions/equations are evaluated by tokenizing, converting into postfix expressions and then solving using a custom-built parser. Keywords Offline HME image · Symbol segmentation · Multiple expressions · Linear equation · Augmented dataset · CNN
A. Choudhary Department of Computer Science, Maharaja Surajmal Institute, New Delhi, India e-mail: [email protected] S. Ahlawat (B) · H. Gupta · A. Bhandari · A. Dhall · M. Kumar Department of Computer Science and Engineering, Maharaja Surajmal Institute of Technology, New Delhi, India e-mail: [email protected] H. Gupta e-mail: [email protected] A. Bhandari e-mail: [email protected] A. Dhall e-mail: [email protected] M. Kumar e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_47
527
528
A. Choudhary et al.
1 Introduction Digitzation of offline work is on the rise to increase the longevity of documents most of which consist of mathematical expressions. Since mathematics is almost entirely subsumed in its expressions, it is imperative to properly digitize these expressions to maintain the consistency in the digital documents [1]. But recognition of handwritten mathematical expressions is a difficult task and topic of many ongoing and concluded research works [2]. Handwritten mathematical expression recognition is of two types: online and offline. The former form consists of recognizing the characters by their strokes while they are being written on a tablet or smartphone. While the latter form consists of recognizing the characters from an image of a handwritten document. This research paper will be entirely dedicated to the digitization and evaluation of offline handwritten mathematical expressions. Since mathematics itself is a very wide field, digitizing and evaluating all of the mathematical symbols becomes a very complex and tedious task. Therefore, only a subset of these mathematical symbols is considered in this paper which are digits (0-9), arithmetic operators (‘+’, ‘–’, ‘*’, ‘÷’), characters (‘a’, ‘b’, ‘c’, ‘d’, ‘u’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’) and parenthesis. All of these will be referred to as symbols in the rest of this research paper. The focus in this research has been on segmentation and recognition of multiple arithmetics and linear mathematical expressions from a single image and then evaluating these successfully recognized expressions. The original contributions and approaches followed in this research paper are outlined in the following paragraphs. Image comprising of single or multiple mathematical expressions containing either entirely arithmetic or linear expressions are considered as input. A new approach of pre-contour filtration is considered in this paper where expressions are tightly cropped to remove any noise that might obfuscate the segmentation of symbols. The segmented symbols are arranged in their original manner using a novel algorithmic technique. These segmented symbols are recognized using a Convolutional Neural Network (CNN), because of its state-of-the-art performance in classification of images [3], which result in digitized expressions. The evaluation of these digitized expressions is performed by ingeniously built string manipulation algorithm. The segmentation of ‘=’ and ‘÷’ characters results in the detection of separate components instead of a single symbol. This problem is solved by vertically combining the components for the height of the image resulting in a single segmented image of the symbol. Another problem faced is the ambiguity between ‘×’ and ‘*’ characters because of the similarity in their handwritten versions. This problem is solved by considering the succeeding symbol which should be a digit or an open parenthesis in case of ‘*’ and any other symbol for ‘×’. Finally, the recognition of offline handwritten symbols was made easier by using a shallow CNN which was able to understand the complex relationship of strokes constituting a symbol.
Offline Handwritten Mathematical Expression …
529
The related work is presented in the section. The proposed method of this paper is described in Sect. 3. Sections 4 and 5 present the result and a conclusion of the work done.
2 Related Work A lot of work has been done in the field of Handwritten Mathematical Expression recognition. Some of these have been studied before the implementation of the proposed system in this paper. The proposed system used in [4, 5] is to normalize the image of HME. The threshold value of 50px was used. Then edge detection followed by morphological transformations is applied and separation of components of the image has been considered. Features like skew, entropy and standard deviation were extracted to improve the accuracy of the neural network. The recognition was done with a backpropagation neural network with adaptive learning. The neural network had 10 input nodes, two hidden layers and one output layer with 10 nodes. The proposed network achieved an accuracy of 99.91% on the training dataset of 5 × 7 pixels. In [6] the proposed method was to classify handwritten mathematical symbols. Convolutional Neural Network (CNN) model was used with a 5 × 5 kernel for convolutional layer and 2 × 2 kernel for a max-pooling layer. Sigmoid function for non-linearity was used at every layer of the network. The log-likelihood cost function was used to check the performance. The CROHME 2014 dataset was used for training and images were resized to 32 × 32 pixels. The accuracy achieved was 87.72%. A crucial point identified in [6] was that some symbols were misclassified by CNN because they had a similar structure with the other symbols, a problem that was also faced in this paper. In [7], the main objective was symbol detection from images of HME. Three modified versions SSD model were used along with the original SSD model for the detection and classification of mathematical symbols from HME. There were 52353 Gray images of 32 × 32 size belonging to 106 classes of symbols. Dataset of HME contained 2256 images of 552 expressions of 300 × 300 resolution divided into three sets. The precision for each class was calculated and class weight for each symbol was calculated. The maximum mAP gain (0.65) was observed in the version of SSD where 1 convolution layer was modified and 2 new layers were added. In [8], the main focus was to recognize and digitize HME. Convolutional Neural network with an input shape of 45 × 45, three convolutions and max-pooling layers, a fully connected layer and output shape of 83 was used for classification of HMS extracted from HME. Preprocessing of HME images included grayscale conversion, noise reduction using median blur, binarization using adaptive threshold and thinning to make the thickness of foreground 1 pixel. Then segmentation was done using projection profiling algorithm and using connected component labelling. CNN achieved an accuracy of about 87.72% on HMS.
530
A. Choudhary et al.
The approach taken in [9] is different from other proposed systems, it mainly focuses on Chinese HME. For symbol segmentation, a decomposition on strokes is operated, then dynamic programming to find the paths corresponding to the best segmentation manner and to reduce the stroke searching complexity is used. For symbol recognition, Spatial geometry and directional element features are classified by a Gaussian Mixture Model learned through the Expectation-Maximization algorithm. For semantic relationship analysis, a ternary tree is utilized to store the ranked symbols through calculating their priorities. The system was tested on a dataset consisting of 30 model expressions with a total of about 15000 symbols. The system performs well at symbol level but recognition of full expression shows 17% accuracy. The system proposed in [10] is for recognition and evaluation for single or a group of handwritten quadratic equations. NIST dataset and self-prepared symbols are used for training after preprocessing techniques such as grayscale, binarization and low pass filtering Horizontal compact projection analysis and combined connected component analysis methods are used for segmentation. For the classification of specific characters, CNN is applied. The system was able to fully recognize 39.11% equation correctly in the set of 1000 images whereas character segmentation accuracy was 91.08%. The methodology proposed in [11] uses a CNN for feature extraction, a bidirectional LSTM for encoding extracted features and an LSTM and an attention model for generating target LaTex. The dataset used is CROHME with augmentation techniques such as local distortion and global distortion. Recognition neural network consists of five convolution layers, four max-pooling layers with no dropout layer. The accuracy obtained on CROHME was 35.19%.
3 Methodology The proposed work used EMNIST [12] and Handwritten Math Symbols [13] dataset for mathematical expression digitizer and evaluator. Each class of operands in the dataset [12] contained images in 32 × 32 × 3 dimension. Only 2471 images of each lower case alphabet (‘a’, ‘b’, ‘c’, ‘d’, ‘u’, ‘v’, ‘w’, ‘x’, ‘y’ and ‘z’) and digits (0–9) are selected in the present work. The letters were selected on the basis of the quality of the text written in the image, the possible similarities with other necessary symbols used in the classifier and the frequency of occurrence of different alphabets in different types of equations. The proposed method follows the processing steps shown in the flowchart in Fig. 1. The subsequent sub-sections elaborate on the steps followed in the present work.
Offline Handwritten Mathematical Expression …
531
Fig. 1 Proposed method
3.1 Preprocessing of the Input Image The input image is initially preprocessed to prepare for the correct segmentation of symbols. Illumination
By making the brightness constant (185, trial and error based value) throughout the image, small and irrelevant contours detected due to variation in natural lighting can be eliminated to a great extent thereby improving the accuracy of evaluation. Grayscale By converting coloured images into grayscale images reduces the processing power and time. Also, thresholding and edge detection algorithms work best with grayscale images. Gaussian Blur Gaussian blur with a filter of 7 × 7 kernel worked best out of all the blurring techniques tested. Blurring reduces the amount of noise present in the image which helps during the processing of the image so that only relevant features will be extracted from the image. Threshold The threshold value of 150px is applied to the image. It helps in separating the digits/foreground (higher pixel values) from the background (lower pixel values).
3.2 Segmentation of Preprocessed Image Segmentation involves extracting the symbols from the image and then sorting them in the digitized form according to their original order in the images. The methods adopted for Tight Crop, Contour Detection, Padding Contours, Extending Contours and Sorting Contours are explained in the subsequent section. Tight Crop- To remove contours which are very small and meaningless as they can be marks on the paper or just noise, a pre-contour filtration technique is used.
532
A. Choudhary et al.
The contours are initially detected using OpenCV library function findContours() using RETR_TREE hierarchy of contours because it provides all the contours from the expression image. The detected contours having Contour Area < (0.002* Total Area of the Image) are removed thereby eliminating the small and irrelevant contours which would have otherwise affected the overall accuracy of the expression during evaluation. The 0.002 value has been considered by trial and error basis. After the removal of small contours, the expression image is tightly cropped within the minimum and maximum values of x and y coordinates achieved from all of the detected contours. Contour Detection- A threshold technique with 120px value is applied to remove any remaining noise from the tightly cropped expression image obtained from the previous step. For extracting the digits and operators from the resultant image, OpenCV findContours() function is used along with the RETR_EXTERNAL hierarchy of contours. The output image has only the extreme outer contour containing the complete digit and operator. To filter out the small contours that might be subparts of operands and operators, all the contours with Contour Area < (0.002*Total Area of the Image) are removed. Here the 0.002 value is finalized after running several trials and observing the error. Padding Contours-The contours obtained from the input image are tightly cropped, by using copyMakeBorder() function of OpenCV. In the present work, the contours are padded with 40 pixels on all sides thereby making sure that the value within the contour is centrally aligned for easy detection for the neural network. Extending Contours-In the present work, the contours with x-coordinate length > = (2 * y-coordinate length), are extended vertically in both directions by value 0.5 times the difference in length and breadth of the image. It also solved the problem of detecting ‘=’ and ‘÷’ symbols as a single operator. The resultant images are shown in Fig. 2a, b. Sorting Contours-Since the expression image can contain multiple lines of mathematical expressions, it is important to sort the operators and operands in the correct order so that the expression can be solved correctly. Therefore, in the present work, the sorting of the contours is performed using the following processing steps. (a) Segregating each contour into appropriate expression row according to their minimum and maximum y coordinate values. (b) Contours from each row of a mathematical expression from the image that was clubbed together, are stored in separate arrays. (c) All contours in respective mathematical expressions are sorted according to their x coordinate values thereby organizing the contours in their original order present in the input image. Fig. 2 a Extracted contour b Extended contour
(a)
(b)
Offline Handwritten Mathematical Expression …
533
3.3 Augmenting the Dataset The Handwritten Math Symbols Dataset has been used for operator images [13]. But this dataset had a few sample images of division (÷) operator. Therefore, the proposed work performed augmentation of images. The augmentation process is a combination of various steps including rotating the images upside down and laterally inverting the images.
3.4 Preprocessing the Dataset Images The present work used only 2471 images of 45 × 45 × 3 dimension of the EMNIST and Handwritten Math Symbols Dataset as sample images [12, 13]. The sample image set contains an equal number of sample images from each class. On carefully observing the sample images it is found that the writing strokes in the sample images are thin and partially visible. Therefore, to overcome this problem the following preprocessing steps are applied to each image: • Dilation-In this step, a matrix of one of the dimension 3 × 3 is used for dilation which smoothed the image which has been pixelated due to increased size. • Threshold-In this step, a threshold value is finalized to separate the foreground (higher pixel values) from the background (lower pixel values). In the present work, the threshold value of 150px has been used for all symbol images except ‘÷’ operator for which the threshold value is 235px. The threshold values are finalized on a trial and error basis. The sample images of digit, variable and parentheses required further preprocessing which is elaborated in the subsequent steps. Preprocessing of Digits and Variables The following preprocessing steps are applied to improve the quality of sample images to ensure good classification by neural network: (a) Inverting the RGB values The RGB values of each image are inverted (i.e. Subtracted from 255) since the images in the dataset [12] were white text on a black background. (b) Resizing Images Each image is resized from 32 × 32 × 3 to 45 × 45 × 3, to match the size of images of the operators.
534
A. Choudhary et al.
Fig. 3 a Original image b Image after padding and resizing
(a)
(b)
Preprocessing Parenthesis The images of the parenthesis in the dataset [13] are very similar to digit ‘1’ which would result in wrong recognition of digits. To solve this problem, preprocessing of these images was done as follows to make these images look like parentheses 1. Padding Images. Each image is padded with 14 white pixels on the top and the bottom of the image to increase the bulge in the centre of the parenthesis. 2. Resizing Images. Each image has a size of 45 × 73 × 3 after the padding was applied in the previous step. The images were resized back to its original size of 45 × 45 × 3. It is clearly visible from Fig. 3a, b that after implementing the preprocessing techniques suggested in Steps 1 and 2, the resultant parenthesis images turned out to be more curved and had greater resemblance with actual handwritten parenthesis operator than the ones that were present in the dataset before the preprocessing.
3.5 Recognizing Symbol Using CNN In this last step, a deep neural network is created and trained on sample images. The recognized symbols are stored and passed to the next step for equation evaluation. The details are as follows: Creating and Training the Deep Neural Network A convolutional neural network was created to classify the different numbers, operators and variables. The network is made up of three convolutional layers, two fully connected layers, and two dropout layers as shown in Fig. 4.
3.6 Solving the Digitized Expression The expression obtained after the classification is a stream of characters stored in a string. Different operations are performed for arithmetic expressions and linear equations. Following are the steps for arithmetic equations:
Offline Handwritten Mathematical Expression …
535
Fig. 4 Convolutional neural network architecture
(a) Tokenizing the string The stream of characters is converted into a list of tokens in the order they appear in the expression. (b) Creating a parser to solve the arithmetic equations First, a function that converts the string detected by the neural network into a list of strings (each string containing an operator or an operand) is created. Then a function to solve the arithmetic equations is created which first determines whether to check the correctness of the expression or to solve the expression according to BODMAS rule by first converting the expression into a postfix expression and then evaluating the postfix expression. (c) Extracting the coefficients from the linear equation If the expression is a set of linear equations, they are passed through a function to solve them and return the values of variables used in that equation.
4 Experimental Result The experimental results are shown in Table 1, which shows the result of the k-fold cross-validated approach used to train the convolutional neural network for better results and lower overfitting.
536
A. Choudhary et al.
Table 1 Accuracy after each fold of CNN Cross-validation fold
Training accuracy (max)
Training loss (min)
Validation accuracy (max)
Validation loss (min)
1
0.9639
0.1115
0.9786
0.0635
2
0.9787
0.0686
0.9894
0.0260
3
0.9882
0.0371
0.9955
0.0144
4
0.9908
0.0287
0.9968
0.0108
5
0.9904
0.0291
0.9982
0.0064
6
0.9946
0.0162
0.9989
0.0028
7
0.9955
0.0150
0.9993
0.0019
8
0.9948
0.0177
0.9991
0.0042
9
0.9962
0.0137
0.9993
0.0015
10
0.9966
0.0111
0.9998
0.0006
The problem of detecting ‘=’ and ‘÷’ was overcome by our proposed approach of sorting the contours according to x coordinates which was similar to the approach used in [10]. The problem of ambiguity between ‘*’ and ‘×’ symbols was solved by checking whether the succeeding token in the string is a digit or ‘(’ in the case of ‘*’ and any other symbol otherwise. The augmentation and preprocessing of training data images used for the customized convolutional neural network helped in improving the classification of alphabets and parentheses. The proposed system was also tested on various other self-shot images having various kinds of expressions and equations. The proposed system is able to recognize and evaluate them successfully.
5 Conclusion and Future Work This paper focused on segmentation, recognition and evaluation of offline handwritten mathematical expressions. Only arithmetic and linear mathematical expressions were considered in this paper. A precontour filtration technique was suggested to remove distortions from segmented symbols which was able to reduce the noise from images to a great extent. The sorting technique designed was able to preserve the equation order for any number of expressions in the image. The customized convolutional neural network designed gave a convincing result with an accuracy of 97% in recognizing the segmented symbols. Finally, the correct evaluation was achieved for the tested expressions. In future work, the segmentation technique needs to be further improved because the sub-parts of symbols are being detected as a separate contour leading to erroneous detection. The proposed method can be extended to quadratic equations by employing the same technique of segmentation and recognition. Also, the dataset
Offline Handwritten Mathematical Expression …
537
can be expanded to include the remaining alphabets, thus increasing the domain of the system.
References 1. A.M. Awal, H. Mouchère, C.V. Gaudin, towards handwritten mathematical expression recognition, in 10th International Conference on Document Analysis and Recognition (2009), pp. 1046–1050 2. C. Lu, K. Mohan, Recognition of Online Handwritten Mathematical Expressions Using Convolutional Neural Networks (2015), pp. 1–7 3. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks. Neural Inf. Process. Syst. 1(25), 1097–1105 (2012) 4. S. Shinde, R.B. Waghulade, D.S. Bormane, A new neural network based algorithm for identifying handwritten mathematical equations, in International Conference on Trends in Electronics and Informatics ICEI (2017), pp. 204–209 5. S. Shinde, R.B. Waghulade, An improved algorithm for recognizing mathematical equations by using machine learning approach and hybrid feature extraction technique, in International Conference on Electrical, Instrumentation and Communication Engineering (ICEICE2017) (2017), pp. 1–7 6. I. Ramadhan, B. Purnama, S.A. Faraby, Convolutional neural networks applied to handwritten mathematical symbols Classification, in Fourth International Conference on Information and Communication Technologies (ICoICT) (2016), pp. 1–4 7. G.S. Tran, C.K. Huynh, T.S. Le, T.P. Phan, Handwritten mathematical expression recognition using convolutional neural network, in 3rd International Conference on Control, Robotics and Cybernetics (CRC) (2018), pp. 15–19 8. L. D’ Souza, M. Mascarenhas, Offline handwritten mathematical expression recognition using convolutional neural network, in International Conference on Information, Communication, Engineering and Technology (ICICET) (2018), pp. 1–3 9. Y. Hu, L. Peng, Y. Tang, On-line handwritten mathematical expression recognition method based on statistical and semantic analysis, in 11th IAPR International Workshop on Document Analysis Systems (2014), pp. 171–175 10. M.B. Hossain, F. Naznin, Y.A. Joarder, M.Z. Islam, M.J. Uddin, Recognition and solution for handwritten equation using convolutional neural network, in Joint 7th International Conference on Informatics, Electronics and Vision (ICIEV) (2018) 11. A.D. Le, M. Nakagawa, Training an end-to-end system for handwritten mathematical expression recognition by generated patterns, in 14th IAPR International Conference on Document Analysis and Recognition (2017), pp. 1056–1061 12. G. Cohen, S. Afshar, J. Tapson, A. Van Schaik, EMNIST: an extension of MNIST to handwritten letters, in International Joint Conference on Neural Networks (IJCNN) (2017), pp. 2921–2926 13. Handwritten Math Symbols Dataset [Online]. Available: https://www.kaggle.com/xainano/han dwrittenmathsymbols 14. Y. Chajri, A. Maarir, B. Bouikhalene, A comparative study of handwritten mathematical symbols recognition. in International Conference Computer Graphics, Imaging and Visualization (CGIV), vol. 1(13) (2016), pp. 448–451 15. A.M. Hambal, Z. Pei, F.L. Ishabailu, Image noise reduction and filtering techniques. Int. J. Sci. Res. (IJSR) 6(3), 2033–2038 (2017)
An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm Md. Tanvir Islam, M. Raihan, Fahmida Farzana, Promila Ghosh, and Shakil Ahmed Shaj
Abstract Diabetes Mellitus introduce various diseases that affect the way of using sugar in human body. Sugar plays a vital role as it is the main source of energy for cells that build up muscles and tissues. So, any issue that causes the problem to maintain normal blood sugar in our blood can create serious problems. Diabetes is one of the diseases which results in abnormal sugar level in the blood and can occur due to several problems like bad diet, obesity, hypertension, increasing age, depression, etc. Diabetes can lead to cardiovascular disease, kidney, brain, foot, skin, nerve, hearing impairment and eye damage. From this thinking, in this study, we have tried to build up some rules using Association Rule Mining technique with various diabetes symptoms and factors to predict diabetes efficiently. We have got 8 rules using Apriori Algorithm. Keywords Diabetes mellitus · Diabetes prediction · Machine learning · Association rule mining · Apriori algorithm
Md. Tanvir Islam · M. Raihan (B) · F. Farzana · P. Ghosh · S. Ahmed Shaj North Western University, Khulna, Bangladesh e-mail: [email protected]; [email protected]; [email protected] Md. Tanvir Islam e-mail: [email protected]; [email protected] F. Farzana e-mail: [email protected] P. Ghosh e-mail: [email protected] S. Ahmed Shaj e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_48
539
540
M. Tanvir Islam et al.
1 Introduction In recent days, diabetes is one of the major issues in health care which is spreading very fast. Generally, it appears when the level of blood sugar increases than the normal level of sugar in blood [1]. From a study, we came to know that around 12,550 people had diabetes at mature ages, and the development of type-2 diabetes (T2D) was almost 2.5 times. Additionally, each pathophysiological disease entity serves to exacerbate the other. Both hypertension and diabetes increases the chances of cardiovascular disease (CVD) and renal disease [2]. Bangladesh is one of the six nations of the International Diabetes Federation (IDF) South-East Asia (SEA) region. Around 425 million individuals have these chronic diabetes diseases worldwide and 82 million individuals in the SEA Region; by 2045 this will ascend to 151 million [3]. The motive of our analysis is to build a model that identifies diabetes accurately using Machine Learning Algorithm. The purpose of our study is to find out the relationship between some diabetes risk factors which increase the probability of developing diabetes. The other part of the manuscript is arranged as follows: in Sects. 2, 3 the related works and methodology have been elaborated with a distinguishing destination to the justness of the classier algorithms, respectively. In Sect. 4 the outcome of this analysis has been clarified with the impulsion to justify the novelty of this exploration. Finally, this research paper is terminated with Sect. 5.
2 Related Works An algorithm was proposed by Vrushali R. Balpande et al. which provides severity in terms of ratio interpreted as the impact of diabetes patterns are generated in step 7 which is for frequent pattern generation than Apriori [4]. Another survey of data mining was combined with Decision Trees and Association Rule by a research team. They have applied 1251 different cases with Apriori Genetic algorithm for T2D. They have tried to prove that the interaction of Multi SNPs is associated with diabetes [5]. Another research team performed an analysis where they used Apriori algorithm with a dataset which contains a total of 768 instances with 8 numeric variables. Their model generates optimal rules on data where the coverage is 74 and confidence is 100% within a condition of low pregnancy, normal diastolic blood pressure, and low Diabetes Pedigree Function (DPF) [6]. Similarly, a study was conducted with Bottom-up Summarization (BUS) and Association Rule summarization techniques and they have found 10 rules [7]. An analysis was performed on association rules from 1-sequence patterns that generated visualizing comparing medication trajectory graph to quickly identify interesting patterns having minimum support value 0.001 [8]. Likewise, an analysis has been performed on gestational diabetes mellitus where several algorithms have been used such as Iterative Dichotomiser 3 (ID3), C4.5, T-
An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm
541
test, F-test. They have used a dataset consist of 3075 instances with having seven attributes and they have got two rules from the study. The risk factors are Polyhydramnios, Preeclampsia, Infections, Risk of Operative Delivery, Puerperal Sepsis, Wound Infection, Ketoacidosis [9]. Another research is developed to predict the associations of pleiotropic genes with data mining on the Human Leukocyte Antigen (HLA) gene complex in relation to Type-1 Diabetes (T1D). Whereas the Major Histocompatibility Complex (MHC) proteins in human where HLA types are inherited [10]. In a health informatics study, 33 attributes have been taken where the minimum support value was set to 10% and minimum confidence value was set to 90%. For the heart disease followed by 25 medical risk factors where the minimum support value was 1%, minimum confidence value was 70%, maximum rule size was 4, minimum lift value was 1.20, and minimum lift value was 2.0 for cover rules which are associated with diabetes beside that in the other research of dataset analyzing using Frequent Pattern Tree (FP-Tree) based on Association Rule (AR). The achieved accuracy was 95%, while sensitivity and specificity were 97 and 96%, respectively [11]. So, it is quite clear that to predict diabetes efficiently the Association Rule Mining techniques are very useful.
3 Methodology Our study has been performed in several steps. They are as follows: – – – – –
Data Collection Data Preprocessing Dataset Training Association Rule Mining Tools and Simulation Environment. Figure 1 shows the overall work flow of our study.
3.1 Collection of Data The dataset we have used for this study was collected from various diagnostic centers located in Khulna, Bangladesh. It contains 464 instances and each instance has 22 unique features shown in Table 1 where 48.92% of the total are male and 51.08% are female.
542
M. Tanvir Islam et al.
Fig. 1 Work flow of the study
Start Import collected data with 464 instances and 22 features Preprocess data with median() and trimmedMean() Apply Machine Learning Association algorithm Select Support and Confidence values Apriori algorithm
Determine rules
End
Table 1 Features list Attribute Age Gender Drug history Weight loss Diastolic blood pressure (bp) Systolic blood pressure (bp) Duration of diabetes Height
Subcategory
Data distribution Mean, Median
L.V. = 20 yrs H.V. = 83 yrs Male Female Yes No Yes No L.V. = 80 mmHg H.V. = 170 mmHg L.V. = 50 mmHg H.V. = 110 mmHg L.V. = 0 day H.V. = 7300 days L.V. = 138 cm H.V. = 174 cm
41.65,40.40 48.92% 51.08% 57.11% 42.89% 46.34% 53.66% 117.8,120 77.67,80 713.8,90 155.9,156 (continued)
An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm Table 1 (continued) Attribute Weight Blood sugar before eating Blood sugar after eating Urine color before eating
Urine color after eating
Waist Thirst Hunger Relatives Pain or numbness Blurred vision
543
Subcategory
Data distribution Mean, Median
L.V. = 37 kg H.V. = 85 kg L.V. = 3.07 mmol/L H.V. = 20.6 mmol/L L.V. = 5.8 mmol/L H.V. = 28.09 mmol/L Nil Blue Yellow Orange Green Brick Red Green Yellow Nil Blue Red Yellow Orange Green Brick Red Green Yellow H.V. = 28 cm L.V. = 44 cm Yes No Yes No Yes No Yes No Yes No
60.79,60 7.792,7.205 13.003,12.760 46.55% 10.13% 9.70% 15.73% 16.81% 0.43% 0.65% 46.98% 6.90% 2.80% 9.27% 22.20% 7.11% 3.23% 1.51% 35.22,35.00 48.06% 51.94% 44.61% 55.39% 58.41% 41.59% 46.55% 53.45% 53.45% 46.55%
3.2 Data Preprocessing To handle missing data we have used a couple of functions from R-3.5.3 namely trimmedMean() which evacuates a proportion of the highest and lowest perceptions
544
M. Tanvir Islam et al.
and afterward takes the average of the numbers that stay in the dataset [12] and median() which computes the most middle value [13].
3.3 Data Training To train our dataset we have used the percent split method which split the dataset. The dataset contains 70% training data and 30% test data.
3.4 Association Rule Mining Association Rule Mining is very useful for selecting a proper market strategy. It is a Machine Learning technique which works based on some rules [14]. Business analysts use this technique to discover the behavior of customers by finding association and correlation between the products that have been used by consumers. The results from this kind of analysis help them to know whether their existing strategy should change or not [15]. It describes the relationships among different variables or features of a large dataset. It predicts frequent if-then associations know as association rule mining [14]. There are several algorithms to implement Association Rule Mining. In this study, we have used Apriori algorithm (implemented in R-3.5.3) which has three common components to measure association as follows:
3.4.1
Support
Support of an item set A is proportion of the transactions in the database in which the item A appears is signify the popularity of an items set (Table 2). Suppor t (A) =
3.4.2
N umber o f transactions in which A appear s T otal number o f transactions
Confidence
It signifies the likelihood of item B being purchased when item A is purchased. Con f idence({ A} → {B}) =
Suppor t (A ∪ B) Sup(A)
An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm Table 2 Features list second part Attribute Type of medicine
Family stroke Physical activity Classes
545
Subcategory
Data distribution Mean, Median
No Tablet Insulin Yes No Yes No Yes No
35.99% 37.07% 26.94% 42.89% 57.11% 90.30% 9.70% 65.09% 34.91%
L.V. = Lowest Value H.V. = Highest Value
3.4.3
Lift
This signifies the likelihood of an item Y being purchased when item X is purchased while taking into account the popularity of Y. Li f t ({A} → {B}) =
Suppor t (A ∪ B) Suppor t (A) × Suppor t (B)
This technique can be very slow as it gives a number of combinations. So, to speed up the process we need to follow the given steps [16]: 1. For support and confidence set a minimum value. 2. Extract the subsets having the highest value of support than the lowest threshold. 3. If confidence value of any rule is higher than the minimum threshold then selects that rule and thus select all rules from the subsets. 4. According to descending order of lift, order the rules.
3.5 Tools and Simulation Environment – R-3.5.3 – RStudio 1.1.463
546
M. Tanvir Islam et al.
3.6 R Packages Some of the important functions we have used to perform the analysis are given below [17]: subset() It returns subsets of matrices, vectors or data frame if they satisfy conditions. is.na() It helps to deal with missing data or the data that are not available (na) and basically used with if else. median() This method is used to find out the most middle value of a data series. apriori() It provides rules of association and correlations between items using the mining technique. inspect() It summarizes the pertinent alternative, statistics, and plot that should be examined. library(arulesViz) It visualizes frequent item sets and association rules. plot() It’s a generic function to plot R items or objects.
4 Outcomes The analysis gives 8 rules representing the association and correlation between several features of the dataset. The rules have been given in Table 3. Here, the features have been categorized by yes and no, for example, weightloss = no, means the patient is not losing weight, and similarly drug history = yes, means the patient has been taking medicine. And, the outcomes have been classified as two types: classes = yes (diabetes affected) and classes = no (not diabetes affected). The first rule shows the association of pain numbness with a duration of diabetes where the value of Support is 0.502, Confidence is 0.939, and Lift is 1.098. The second and third rules also show relations of weight loss and hunger with a duration of diabetes where for no weight loss and no hunger the values of support, confidence, and lift are 0.509, 0.948, 1.108 and 0.509, 0.918, 1.073, respectively. For these three rules, the range of diabetes duration is 0 to 2100 days. Rules 4 and 5 state the relation of drug history with classes and physical activity. According to rules 6 and 7 for diabetes affected patients physical activity is yes which means the patients have to do any physical activity such as walk and exercise. Similarly, patients who have both drug history and diabetes, for them the physical activity is also yes. So, if a person has diabetes, he or she has to do physical activity and if the person has both diabetes and drug history, he or she also has to do physical activity. In the same manner, for having both drug history and physical activity, the class will be diabetes (classes = yes), that is, rule number 8. The highest support, confidence, and lift are, respectively, 0.603 for rule 6, 1.00 for rule 8, and 1.536 for rule 8. Figure 2 plotted the 8 rules, where most of the rules are within the support range of 0.5–0.53, confidence range from 0.5 to 0.95, and lift range from 0.5 to 1.2. So, 5 rules out of 8 are inside of these ranges and the rest are outside.
An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm Table 3 Associated rules between some items Transactions Item 1 1 2 3 4
Soap Handwash Onion Potato
547
Item 2
Item 3
Handwash Soap Potato Onion
Shampoo Shampoo Burger Burger
Confidence
1
1.5 1.4 1.3 1.2 1.1
0.98 0.96 0.94 0.92 0.5
0.52
0.54
0.56
Support
0.58
0.6
Lift
Fig. 2 Scatter plot for 8 rules Fig. 3 Nominal features and their percentages for yes and no
Some nominal attributes of our dataset have been plotted in the Fig. 3 with respect to the percentage. The attributes contain mainly two types of value. One is Yes, and another is No, for example, if someone has felt pain then the value of pain for that person will be yes. Similarly, the other attributes have value as yes or no. The graph describes the total percentage of yes and no for each attribute. Here, the highest percentage of yes is found for physical activity, that is, about 90.30%, and the highest percentage of no is for family stroke history, that is, about 57.11%. Table 4 shows the differences between previous models and our newly proposed model based on the number of instances, attributes, and algorithms used for the analysis, and it also compares the outcomes of the systems (Table 5).
548
M. Tanvir Islam et al.
Table 4 Association rules Rules LHS [1] [2]
{pain_numbness = no} {weightloss = no}
[3]
{hunger = no}
[4] [5]
{drug_history = yes} {drug_history = yes}
[6]
{classes = yes}
[7]
{drug_history = yes, classes = yes} {drug_history = yes, physical_activity =yes}
[8]
RHS
Support
Confidence Lift
{duration_of_ diabetes = 0-2100} {duration_of _diabetes = 0-2100} {duration_of _diabetes = 0-2100} {classes = yes} {physical _activity = yes} {physical _activity = yes} {physical _activity = yes} {classes = yes}
0.50
0.94
1.098
0.51
0.95
1.108
0.51
0.92
1.073
0.56 0.52
0.99 0.92
1.519 1.209
0.60
0.93
1.222
0.52
0.93
1.223
0.52
1.00
1.536
Table 5 Comparison with other systems with proposed system Reference Sample size Attributes Algorithms number [5]
1251
–
[6] [9] Our proposed system
768 3075 464
8 7 23
Decision trees and Association rule with the apriori—genetic algorithm Apriori algorithm ID3, C4.5, T-test, F-test Apriori algorithm
Number of rules 1
14 2 8
5 Conclusion Diabetes Mellitus is a regular illness that is upsetting people all over the world. So, we have played out this investigation by utilizing Machine Learning to identify diabetes precisely, and the experiment has been performed effectively with expected results. Although there were some limitations of using the Apriori algorithm, because sometimes it may need to find many rules that need huge time to compute. In this case, we have some tentative arrangements to lead the investigation with more accuracy. We have got a total of 8 rules, and the highest and lowest values are 0.603 and 0.502 for support, 1.0 and 0.917 for confidence, 1.536 and 1.073 for lift. We would like to use more popular algorithms like Frequent Pattern Tree (FP-tree), Maximum Frequent Itemset Algorithm (MAFIA), Aprioritid algorithm, Apriori Hybrid algorithm, Tertius
An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm
549
algorithm, etc. Finally, depending on the best execution of these investigations and calculations, we want to develop an expert system by using the results of exploration and learning.
References 1. A. Bhatia, Y. Chiu (David Chiu), Machine Learning with R Cookbook, 2nd edn. Livery Place 35 Livery Street Birmingham B3 2PB, UK.: Packt (2015). Diabetes, World Health Organization (2017). [Online]. http://www.who.int/news-room/fact-sheets/detail/diabetes. Accessed 25 Jan 2019 2. G. Govindarajan, J. Sowers, C. Stump, Hypertension and diabetes mellitus. European Cardiovascular Disease (2006) 3. IDF SEA members, The International Diabetes Federation (IDF), Online (2013). http:// www.idf.org/our-network/regions-members/south-east-asia/members/93-bangladesh.html. Accessed 01 Feb 2019 4. V. Balpande, R. Wajgi, Prediction and severity estimation of diabetes using data mining technique, in 2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bangalore, India (2017), pp. 576–580 5. B. Shivakumar, S. Alby, A survey on data-mining technologies for prediction and diagnosis of diabetes, in 2014 International Conference on Intelligent Computing Applications, Coimbatore, India (2014), pp. 167–173 6. B. Patil, R. Joshi, D. Toshniwal, Association rule for classification of type-2 diabetic patients, in 2010 Second International Conference on Machine Learning and Computing, Bangalore, India (2010), pp. 330–334 7. G. Simon, P. Caraballo, T. Therneau, S. Cha, M. Castro, P. Li, Extending association rule summarization techniques to assess risk of diabetes mellitus, in IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 1, pp. 130–141 (2015). Accessed 12 Feb 2019 8. P.H. Khotimah, A. Hamasaki, M. Yoshikawa, O. Sugiyama, K. Okamoto, T. Kuroda, On association rule mining from diabetes medical history, in DEIM (2018), pp. 1–5 9. C. Raveendra, M. Thiyagarajan, P. Thulasi, S. Priya, Role of association rules in medical examination records of Gestational Diabetes Mellitus, in 2017 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India (2017), pp. 78– 81 10. I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas, I. Chouvarda, Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15, 104–116 (2017) 11. W. Altaf, M. Shahbaz, A. Guergachi, Applications of association rule mining in health informatics: a survey. Artif. Intell. Rev. 47(3), 313–340 (2016). https://doi.org/10.1007/s10462016-9483-9. Accessed 17 Feb 2019 12. H. Emblem, When to use a trimmed mean. Medium (2018). [Online]. https://medium.com/ @HollyEmblem/when-to-use-a-trimmed-mean-fd6aab347e46. Accessed 05 Mar 2019 13. Median Function R Documentation (2017). [Online]. https://www.rdocumentation.org/ packages/stats/versions/3.5.2/topics/median. Accessed 10 Mar 2019 14. A. Yosola, Association rule mining - apriori algorithm. NoteWorthy-The Journal Blog (2018). [Online]. https://blog.usejournal.com/association-rule-mining-apriorialgorithm-c517f8d7c54c. Accessed 12 Mar 2019 15. A. Shah, Association rule mining with modified apriori algorithm using top down approach, in 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), Bangalore, India (2016), pp. 747–752
550
M. Tanvir Islam et al.
16. U. Malik, Association rule mining via apriori algorithm in Python. Stack Abuse (2018). [Online]. https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/. Accessed 16 Mar 2019 17. A. Bhatia, Yu-Wei, D. Chiu, Machine Learning with R Cookbook - Second Edition: Analyze Data and Build Predictive Models, 2nd edn. (Packt Publishing Ltd., Birmingham, 2017)
An Overview of Ultra-Wide Band Antennas for Detecting Early Stage of Breast Cancer M. K. Anooradha, A. Amir Anton Jone, Anita Jones Mary Pushpa, V. Neethu Susan, and T. Beril Lynora
Abstract This study gives us a glance of ultra-wideband (UWB) antenna sensors and it is applied in medical applications, especially for microwave radar imaging. The utilization of Ultra-Wide Band sensor-based microwave energy in microwave imaging diligence for detecting tumor cells in breast. In radar imaging, the electrical changes in the human tissues when the back scattered radiation with the use of sensor is analyzed. These cells expose more dielectric constants since its water content is high. The aim of this clause is to render for microwave investigators with a deeper data on electromagnetic proficiency for microwave imaging detectors and explain its late evolutions in these proficiencies. To detect breast cancer in women in early stages with comfortable and easy methods, here different types of UWB antenna Novel antenna, bow-tie antenna, Slot antenna, Microstrip antenna, Planar plate antenna, Circular antenna are survived. The generally operated frequency in medical field is from 3.1 GHz to 10.6 GHz. Keywords Tumor cells detection · Ultra-wide band antenna · Radar imaging techniques · Backscatter radiation · Breast cancer · Microwave imaging
M. K. Anooradha · A. Amir Anton Jone (B) · A. Jones Mary Pushpa · V. Neethu Susan · T. Beril Lynora Department of ECE, Karunya Institute of Technology and Sciences, Karunya Nagar, Coimbatore, India e-mail: [email protected] M. K. Anooradha e-mail: [email protected] A. Jones Mary Pushpa e-mail: [email protected] V. Neethu Susan e-mail: [email protected] T. Beril Lynora e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_49
551
552
M. K. Anooradha et al.
1 Introduction Women are at the prospect of breast cancer, and the chances of developing it increases with age. Here we may not know the exact reason for its cause. There are two types of tumor cells: benign (non-cancerous) and malignant (cancerous). Benign tumor cells don’t affect the organs in the body but they continue to grow in abnormal size, whereas malignant invades other cells and spreads to the body parts. The earlier the cancer is detected; it’s more curable and reduces the mortality. The most common way of detecting breast cancer is by mammography, that checks when you have no symptoms or even at the initial stage. But most women feel some discomfort during the process. The pressure against the breast from the testing equipment causes little discomfort and is applicable for young women. Ultrasound is an imaging process that sends sounds waves with higher frequency into the breast and converts into images for viewing purpose. This technique is not directly advised, but when the physician finds an abnormality in the breast during mammography, ultrasound is the best way to find or detect tumor cells. For women under age 30, ultrasound is a preferred method than mammography to evaluate lumps in the breast. Even ultrasounds can find the cancer cells in the women with denser breasts.
2 Related Works A. Square patch T-slot antenna A simple square patch antenna and a T-slot mounted in the ground plane. Three rectangular slots are added for the purpose of surface current. The square patch is found on the FR-4 substrate with dielectric permittivity 1r = 3.34 and heaviness of 0.794 mm. 50 microstrip feed line excites the T-slot. It has a return loss of –10 dB. This antenna is designed with a good impedance matching and exhibit a larger bandwidth of 14.26 GHz. A yielding of 12.1 GHz is graphically observed when we compare the test results with the numerical predictions. Simulation at 3 GHz, they result in near-far (far-field) field radiation pattern. This method of detecting tumor using antenna is suitable for microwave thermal radiation or thermography. The advantages are it is safer diagnostic and small tumor can also be detected. This antenna demonstrates low spatial resolution when compared with other methods [1] (Fig. 1). B. Bow-Tie Antenna This antenna is designed with a frequency range of 5–7 GHz and is used to detect tumors of different sizes. The relative permittivity 1r is 50 and the conductivity σ = 0.7 S/m. It operates around the center frequency of 5.5 GHz. The metal body of bow-tie is printed on two thin flexible dielectric sheets [2]. The total thickness of the dielectric sheets is 0.1287 mm. The relative permittivity of the sheet is 4. It has a
An Overview of Ultra-Wide Band Antennas …
553
Fig. 1 Square patch T-slot antenna (Courtesy Proposed antenna structure design)
reflection loss of –10 dB and covers a bandwidth of 500 MHz. The pulse bandwidth is less than 10% of center frequency. This antenna can detect spherical like tumor using a narrow band pulse of center frequency and bandwidth. The radiation pattern exhibited here is non-ionized radiation. This antenna is required to be wide band, compact, low profile, lightweight and flexible to be placed directly on the breast. It provides low risk tool (Fig. 2). C. Slot Antenna A UWB antenna is printed on the circuit board and it’s fabricated by the use of slot antenna. The frequency range used is 6 GHz. The slot is mounted on the antenna element and there is an antenna fork which is fed on the microstrip with a symmetric multiprocessor. After the current distribution analyze the antenna parameter S11, which has deviations below –10 dB is used for cancer detection. The antenna interface is connected to the down pulse generator and up pulse generator. The width of the
Fig. 2 Compact bow tie antenna used for imaging (courtesy)
554
M. K. Anooradha et al.
(a)
(b)
(c)
Fig. 3 Length and Width of the tuning fork placed on the antenna, back end and its current distribution (Courtesy photograph of a slot antenna and a simulation result of current distribution a Front side b Back side c Current distribution)
GMP [3] is determined by the delay elements when the time response is same in both. The power supply is given to the DPG, UPG and antenna interface is connected to a clock, where it’s connected to a pattern generator and an oscillator. The detection for cancer is done by antenna array. At the output, hemispherical plane wave fronts are obtained which are scattered at the target. This waveform is received at receiver antenna. The cancer is detected when the input pulse width is from 5 to 6 GHz with 198 ps (Fig. 3). D. Microstrip Antenna The flexible microstrip antenna is designed for the detection of different types of tumor cells. This antenna is used to identify the type of tumor cells within its parameters dielectric properties and relative permittivity and the result obtained is the form of return loss. A single microstrip antenna is changed to flexible substrates and its ground planes because this antenna provides a better shielding from stay radiations. A microstrip flexible antenna can be designed and operated in ISM (Industrial, Scientific and Medical) [4] band and its used to identify the cells in its early phase. The feed technique used here is microstrip feed and its fed. The antenna is designed in the frequency range of 2.4 GHz (ISM band). The simulation result obtained is –29 dB return loss and high gain. The flexible substrate used is Kapton Polyimide substrate which retains its dielectric property under any circumstances and is water resistant. Substrate carries the properties as: dielectric constant = 3.4, thickness = 1MIL, loss tangent = 0.002. The radiation pattern is omnidirectional (Fig. 4). Fig. 4 Microstrip antenna and its schematic representation (courtesy)
An Overview of Ultra-Wide Band Antennas …
(a)
555
(b)
Fig. 5 Top view and bottom view ( (Courtesy a Top view b Bottom view)
E. Planar Plate Antenna The use of planar plate antenna is designed in a shape of circular disc that is produced on two vertical rectangular plates. It is located on the ground plane with the length and width of 40 mm and thickness of 0.5 mm. The feeding to the antenna is done by vertical plate of length 5 m and breadth of 15 mm. The feeding probe is connected to the vertical plate through the slot in the ground plane. This antenna is formed from a copper plane having heaviness of 0.5 mm. The HP 8510C [5] network analyzer is applied for calculating the values. Here when the tumor cells are detected, a strong scattering takes place when the microwave hits the tumor cells and it consists a bandwidth at minimum return loss 10 dB is accomplished between 3 and 8 GHz. Radiation pattern exhibited here is a directional radiation pattern with a gain of 8 dBi (Fig. 5). F. Rectangular Antenna This antenna is architecture with an absolute frequency of 2.45 Ghz and a total extension of 37.26*28.82 mm on a FR4 substratum. It is mounted on a rectangular gusset-fed micro-strip patch antenna. It has a relative permittivity 1r = 4.4, breadth of 65.4 mm, extension of 88.99 mm and heaviness 1.588 mm. This consists of five different antennas located on the cutis of the breast to obtain different parameters of electric, magnetic fields and current density of active breast tissue. Hemispherical shape is used to model the breast phantom possessed of a skin with outer radius 70 mm and thickness 2 mm. The five antennas have a radiation pattern variable from 3.34 dB to 1.6 dB. The array in a circular layout is placed where 8 antennas are placed close to one another, attached to clod interface and isolated by circularly space of λ/2. The antenna is used to radiate energy received by alternative is
556
M. K. Anooradha et al.
Fig. 6 Different structures of rectangular antenna
Table 1 Antenna Specifications Specifications
Efficacy
Specifications
Efficacy
Width of the antenna [1]
20
Slot1
6
Length of the antenna [1]
25
Slot2
5
Width and length of the patch [1]
10*10
Slot3
3
Length of the feed [1]
15
Ground length
12.33
known as mutual coupling. This antenna has a good directional radiation pattern and easily designed for Microwave Breast Imaging (MBI) [6]. The simulated result has good impedance coordination, high radiation pattern with low reciprocal integration (Fig. 6). G. Rectangular Step Slot Antenna The UWB antenna is a well distinguished element/module to furnish higher efficiency to identify the tumor cells. The antenna proposed here is a rectangular step slot shape antenna. This antenna is projected to have a better impedance matching of 50 ohm. A micro strip feed line with an offset feed from the center is fed to the antenna and is marked on FR-4 substratum. The ultimate goal of designing a UWB antenna is to reduce its size (covenant size) and enhancing better performance. Many steps and process in rectangular stair slot has been carried out to get over the restrictions regarding narrow bandwidth and hapless impedance matching. An increase of stairs (steps) between feed and antenna allows better evolution in improved adjustment. The length and width of steps−L1 = L2 = 2.4 mm, L3 = 2.3 mm and W1 = W2 = 0.5 mm, W3 = 1.5 mm as shown in Table 1. The performance of the broadband can be determined by resonating the antenna at numerous frequencies to increase the bandwidth (Fig. 7).
3 Conclusion Different types of antennas are surveyed for the detection of breast cancer. Here we have compared the performance of different types of antennas for better results. Our antenna’s purpose is to defeat the restrictions of the established solutions for detecting breast cancer at its early stage, accurate. Our main aim is to identify the
An Overview of Ultra-Wide Band Antennas …
557
Fig. 7 Proposed rectangular step slot antenna
Table 2 Specifications of rectangular antenna
Specifications Efficacy (mm) Specifications Efficacy (mm) W [6]
65.4
L1
3.997
L [6]
88.99
L2
13.84
Wp [6]
37.26
GL
9.57
Lp [6]
28.82
GW
1
LG [6]
48.82
FL
20
W1 [6]
4
FW
3.036
W2 [6]
11.26
–
–
tumor cells in a painless method and also harmless for the skin. The purpose is to identify tumor cells at its initial stage before the cells get matured and spreads all over the breast and where the women reach the stage of removing her breast through surgical method. Despite these challenges there are many evidences that suggest these tumor cells maybe curable if diagnosed and treated early (Table 2). We have explored that microwave radar imaging so far have failed to elicit a survival benefit. This has led to an over utilization of the resources and expensive methods of false identification. With the future guidelines and improvised techniques, we can avoid both under and over evaluation of patient’s disease status (Tables 3 and 4).
558
M. K. Anooradha et al.
Table 3 Antenna design specifications
Specifications
Units (mm)
Specifications
Units (mm)
Ws [7]
19
Ls
33
Wp [7]
3
Lp
4
Wf [7]
1.8
Lf
9
Lg [7]
6
Lr
2
Fd [7]
11.5
Wu1
0.5
Wu2 [7]
0.5
Wu3
1.5
Lu1 [7]
2.4
Lu2
2.4
Lu3 [7]
2.3
–
–
Table 4 Comparative analysis of UWB antennas S.no
Types of antenna
Year
Advantages
Disadvantages
01.
Novel antenna
2016
● Safer diagnostic ● Measurement losses, ● It can detect even small effects the cable tumors
02.
Bow-tie antenna
2018
● Compact ● Lightweight ● Flexible
● It can’t detect 2 tumors with space less than 20 mm
03.
Slot antenna
2013
● Its simplicity ● Can transmit high power
● Low radiation efficiency ● High cross-polarization level
04.
Microstrip antenna
2017
● Water resistant
● It has lower gain ● Low efficiency
05.
Planar antenna
2010
● Good radiation pattern
● Fabrication inaccuracies
06.
Rectangular Antenna
2017
● Better image enhancing ● No proper reflection coefficient
07.
Rectangular step slot antenna
2019
● Compact size, low VSWR, and better return loss ● High gain and directivity
NIL
References 1. A. Afyf, L. Bellarbia, N. Yaakoubib, E. Gaviotb, L. Camberleinb, M. Latrachc, M.A. Sennouni, Novel Antenna structure for early breast cancer detection, in Procedia Engineering (2016), pp. 1334–1337 2. Abdullah K. Alqallaf, Rabie Deeb, compact bow-tie antenna for the detection of multiple tumors. ICBES 140, 1–8 (2018) 3. T. Sugitani, S. Kubota, M. Hafiz, A. Toya, T. Kikkawa, A breast cancer detection system using 198 ps gaussian monocycle pulse CMOS transmitter and UWB antenna array, in Poceedings of the 2013 International Symposium on Electromagnetic Theory (2013), pp. 372–374 4. P. Chauhan, Sayan Dey, Subham Dhar, J.M. Rathod, Breast cancer detection using flexible microstrip antenna. Kalpa Publ. Eng. 1, 348–353 (2017)
An Overview of Ultra-Wide Band Antennas …
559
5. R. Abd-Alhameed, C. Hwang See, I. Tamer Elfergani, A Compact UWB antenna design for breast cancer detection 6(2), 129–132 (2010) 6. K. Ouerghi, N. Fadlallah, A. Smida, R. Ghayoula, J. Fattahi and N. Boulejfen, Circular Antenna Array Design for Breast Cancer Detection (IEEE Access, 2017) 7. A. Amir Anton Jone, T. Anita Jones Mary, A novel compact microstrip UWB rectangular stair slot antenna for the analysis of return losses. Int. J. Innovat. Technol. Explor. Eng. 8(10) (2019)
Single Image Haze Removal Using Hybrid Filtering Method K. P. Senthilkumar and P. Sivakumar
Abstract The mist, smog, fog, and haze occurring in the outdoor environment produces distortion in the image when it is picked up by the imaging sensor and hence it degrades the image resolution and clear scene of the image acquired. Hence to clear the haze from the captured image a wavelet transform method combined with Guided image filter and Global guided image filter for better visibility and improved quality in the output image has been developed. In hybrid filtering, two different filters are combined as a hybrid filter (GIF & GGIF) to remove the haze present in the single image and this method also conserves the finer details of the image. The wavelet technique is used to detect the edges in the input image which gives good results when compared to other techniques. The global guided image filter (GGIF) is a filter that combines the effect of edge conserving method and guidance structure transfer method which conserves small edge structures in the dehazed output image. The dehazed output image has good preservation of edge details and also gives sharper edges which are used in real time transportation systems. The performance parameters obtained has better results such as high PSNR and low MSE. Keywords Haze removal · Single image · Hybrid filter · Wavelet transform · Edge preserving
1 Introduction The occurrence of haze in the external atmosphere is a universal and general event. It may occur due to the existence of small dust molecules, smog, water droplets, and from the light reflected from the region of the surface when it goes to the K. P. Senthilkumar (B) Department of ECE, Kingston Engineering College, Vellore, India e-mail: [email protected] P. Sivakumar Department of ECE, Dr. N.G.P Institute of Technology, Coimbatore, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_50
561
562
K. P. Senthilkumar and P. Sivakumar
Fig. 1 Phenomenon of Haze formation
observer. Due to this images captured in the camera suffer from low contrast, reduced changing colors, and also change in brightness. The phenomenon is explained in Fig. 1 in which the image reaches the observer suffers from haze and atmospheric light scattering/absorption problem. Single image haze removal method was used to overcome the above problem which gives a better viewable scene effect and more clear effective image details. In general, the haze removal or denoising clears the undesirable eye effects and hence it is considered to be an image enhancing method. This methodology is more useful in various image transform methods, Data processing machine-based vision methods, Video aided automobile moving systems, and outside supervision video systems. Denoising method is classified into three major domains they are Time domain, Transform domain, and Variation-based technique. Several methods were developed for removing haze from a single image which was used in various real time applications. Narasimhan et al. [1] discussed several issues in recovering the image details of the outer environment from degraded grey scale images. They estimated the locate depth discontinuities in the image and designed the visible scene, from the two images captured in the different outer environment. Tan [2] developed an automated model which required a single input image. They observed different conditions in which, the sunshiny images have more contrast value than the outer environment images which was affected by poor climatic conditions and the next condition was air light difference. Fattal [3] prepared a scheme in which a hazy image was formed by a precise image modeling which has dark surface region and depth scene transmission map, but it was unsuccessful due to a large amount of haze. He et al. [4] developed a transparent method called as dark channel prior method to remove the haze from a single image. They utilized this for outdoor environment image in which the nonatmospheric regions has minimal one color region and less color ranges at some places known as dark region. Drew et al. [5] found a method to estimate the transmission map for submarine regions, they proposed a model known as Underwater DCP, which enables a vital enhancement over the older methods which was based on DCP.
Single Image Haze Removal Using Hybrid Filtering Method
563
Pang et al. [6] developed a haze model for a single image by using the dark channel method and GIF. They utilized the guided image filter to filter the haze image which has less running time when compared to other existing methods. In the GIF [7] method, the design of the transfer filter was developed using the quadratic expansion method and color attenuation prior method was used in [8]. The WGIF method [9] used the weighted function along with the guided image which produced better results. The WLS filter [10] proposed was a significant method when compared to the edge securing filter, in which the running time of the smoothening filter was made significantly good than the method [7]. In [11] edge preserving for multiple images was used with a standard filter which gives better results. Iterative bilateral filter [12] was proposed for fast image dehazing which was used in real time applications. Guided Joint bilateral filter method [13] was proposed for real time fast dehazing in single image haze removal. In [14] optimized contrast enhancement was used in image dehazing and real time videos. The G-GIF proposed in [15] preserves the minute details of the images better than WGIF and GIF. In the dehazed image to remove the halo artifacts and they made the transmission map to be the same with respect to the input haze image, but the structure of the input haze image was conserved better when they used the minimal color channel. The performance parameters showed that the proposed algorithm produced good results and sharper images when compared to the other algorithms used. The comprehensive survey of various single image haze removal methods was discussed in [16]. A review on various haze removal methods was presented in [17] in which they analyzed the various haze removal methods with several performance parameters. In the proposed method a new technique called as hybrid filtering method is developed by combining Guided Image Filter and Globally Guided Image Filter which have been used to get an effective method that removes the presence of noise from the single input image in an effective manner. The different performance parameters like Peak Signal to Noise Ratio, Mean Square Error, Accuracy, Sensitivity, and Specificity are also calculated. The entire paper is classified as follows: Sect. 2 describes the various existing filter methods, Sect. 3 explains the method of Hybrid filtering, Sect. 4 gives the output simulation results, Sect. 5 lists the various performance parameters and Sect. 6 describes the conclusion.
2 Existing Filter Methods The different filtering methods used for haze removal in a single image is given below.
564
K. P. Senthilkumar and P. Sivakumar
2.1 Bilateral Filter Method This filtering technique is used for smoothening the input images and it also has edge preserving property which operates by transfusing indifferent nearby pixel values. This filter is local and very simple where the grey levels or colors are fused by their geometric proximity values in both domain and range. It filters the edges to remove the noise present in it but does not support noise reduction to a greater extent, but this filter suffers from the “gradient reversal” effect which gives undesirable sharpening of edges.
2.2 Guided Image Filter Method This filter is also called as an edge conserving filter similar to the bilateral filter which conserves the image edges in a good manner. This filter works as a fast running algorithm whose running time does not depends on the size of the filter and no unwanted profile across the edges thus known to be as edge conserving filter. It gives the linear relation between the input haze image and the guidance image with respect to the output and hence it has a speed running time when compared to BF. But the limitation is it doesn’t conserve the small edge details in the output image.
2.3 Guided Joint Bilateral Filter Method This filter method is employed to create a new mask which eliminates the heavy composition details and restores the details across the edges in the image. This filter is used to filter the atmospheric mask to obtain the best detailed new air envelope. The Guided joint bilateral filter is utilized if the haze image doesn’t give the original details about the edges if the input image has more noise. This filter can be used in noise removal because it enforces the edge details of the filtered input image which is more likely with the reference image.
2.4 Weighted Guided Image Filter Method Local filters designed has the major disadvantages like edge preservation problem than the global filters so WGIF was used to reduce the complication which used a weighted function along with Guided image filter. It conserves the sharp edges in the image like other universal filters so that the halo artifacts effect and unwanted profile around the edges were removed. Here the filter called as Gaussian was adopted to remove the artifacts and the time needed is O(N) only. In haze removal for the single
Single Image Haze Removal Using Hybrid Filtering Method
565
image, the various problems discussed are Halo artifacts, noise amplification, and color fidelity. This filter overcomes the all major problems. But the major drawback was this filter does not preserve the fine edge information in the dehazed image and also it over smoothens across the small edges.
2.5 Globally Guided Image Filter Method It is also known as the universal filter and hence used in many applications. They used two filters called as structure transfer filter and an edge conserving smoothening filter. The first filter was used to change the initial form of the image for filtering operation and the smoothening filter was utilized to smoothen the dehazed image. Haze removal using a single image utilizes both Global Guided Image Filter and Koschmiedar’s law which conserves the small edge details of the image and produced better output than the older methods like GIF and WGIF.
3 Hybrid Filter Method The hybrid filter technique has been developed by using the two filters GIF and G-GIF. Both the filtering techniques have numerous advantages and applications in which removal of haze from a single image is very important. This new technique has been used in single image haze removal in an effective manner to eliminate the problem of GIF. The new hybrid filter used has been developed by using structure guidance method and a fast edge conserving technique. The new filter method process flow is explained in Fig. 2.
Fig. 2 Process flow of the hybrid filter method
566
K. P. Senthilkumar and P. Sivakumar
3.1 Guided Image Filter (GIF) Guided Image Filter works on an image for edges to be preserved by using the image, called as guidance image, to improve the filter operation. The other input image called as guidance image may be the original image or a peculiar form or a totally contrasting image. The guidance image is similar to the filter image and the elements are also same as the edges in the input. If the guidance image is having varying forms then the guidance image will encounter the image to be filtered, due to that these structures are set in the output image. This phenomenon is known to be the method of transferring the structure to the output image.
3.2 Globally Guided Image Filter (G-GIF) The Globally GIF is constructed by using two filters one is global structure guidance filter and the other one is global corner conserving smoothening filter. This filter employs nearby neighboring pixel interpolation down-sampling technique to clearly view the input images which reduces the processing time complexity of the guided filter. After the process of globally guided image filter operation, the output image uses an up-sampling bilinear interpolation method to clearly view the output image.
3.3 Hybrid Filter Method The limitation of GIF can be resolved by hybrid filtering techniques by combining the GIF and G-GIF filters to eliminate the haze effect from a single image. To preserve the detail in the edges of the input image and the sharp finer details initially edges has to be detected. The Wavelet transform and inverse wavelet transform is used for this process. The procedure for haze removal from a single image using the hybrid filter technique is given in Sect. 3.4.
3.4 Process Flow Step 1 The wavelet partition method is applied to the haze input image in order to obtain the lower frequency region and higher frequency region. Step 2 Then apply the wavelet transform which partition the lower frequency region, in order to obtain lower frequency region and higher frequency region. Step 3 After that GIF & GGIF filters are used to get the reconstructed dehazed image which is shown in Fig. 5e Step 4 Then for the higher frequency regions
Single Image Haze Removal Using Hybrid Filtering Method
567
(a) Estimate the local noise variance (b) And for every pixel in higher frequency regions (i) Determine the range of threshold in the image (ii) By applying the soft threshold Step 5 At last inverse wavelet transform is used so that the output dehazed image is obtained. The Wavelet-based method is used to partition the hazy input image which is shown in Fig. 3a in two frequency regions low and high and then the hybrid filter method is used for the haze image and the guidance image. The GIF output is shown in Fig. 4c and G-GIF output obtained is shown in Fig. 4d the reconstructed image obtained is shown in Fig. 5e and then the image is given to the inverse wavelet transform to obtain the output which is shown in Fig. 5f.
4 Simulation Results The input hazy image is shown in Fig. 3a and the various output images are shown below.
(a)
(b)
Fig. 3 Results of hybrid filtering technique a Haze image, b guidance image
(c) Fig. 4 c GIF filter output image, d G-GIF filter output image
(d)
568
K. P. Senthilkumar and P. Sivakumar
(e)
(f)
Fig. 5 e Reconstructed image, f output image
5 Performance Parameters The following performance parameter is calculated by the proposed method. The two most important parameters are Mean Square Error (MSE) and Peak Signal to Noise Ratio(PSNR) which is described below in Eqs. (1) and (2) and also the other parameters like Accuracy, Specificity, and Sensitivity are calculated and shown in Table 1.
5.1 Mean Square Error (MSE) Mean Square Error (MSE) is estimated between input haze image and output dehazed image. It is determined by the formula given below m−1 n−1 MSE = 1 p ∗ q (I (i, j)) − (K (i, j))2
(1)
i=0 j=0
where p, q shows the haze image height and breadth, I(i, j) and K(i, j) are output dehazed image and haze input. If the value of Mean square error is very less then it shows that the quality of the image obtained in the output is good.
5.2 Peak Signal to Noise Ratio (PSNR) Peak Signal to Noise Ratio (PSNR) is estimated by considering the input haze image and output dehazed image which is obtained by P S N R = 10 log10
2 2n − 1 MSE
(2)
If PSNR is very high then it indicates the quality of output image is very good.
Single Image Haze Removal Using Hybrid Filtering Method
569
Fig. 6 Pie Chart representing the different performance parameters
Table 1 Performance parameters and results obtained
S. no.
Performance parameters
Results
1
Peak signal to noise ratio
24.2
2
Mean Square Error
3
Sensitivity
89
4
Accuracy
86
5
Specificity
81
0.00063
The results obtained by the hybrid filtering method are shown below in Table 1. The Performance parameters obtained are given as a graphical chart which is shown below in Fig. 6.
6 Conclusions The proposed hybrid filter is used to produce clear images and conserves the edge details in the dehazed output image better than the other methods. The smaller details in the output dehazed image are clear and sharp than those of the current single image haze removal methods. The simulation results show that the hybrid filter method based haze removal technique enhances both visible quality of the image and also conserves fine edge details in the output image which gives high PSNR and low MSE. In the future, this method will be used to remove haze from real time video which preserves the fine edge structure and to produce a sharper output image. This method can be used in various applications like single image dehazing, real time transportation systems, and outdoor video surveillance systems.
570
K. P. Senthilkumar and P. Sivakumar
References 1. S.G. Narasimhan, S.K. Nayar, Chromatic framework for vision in bad weather, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Hilton Head Island, SC Hilton Head Island, SC, 2000), pp. 598–605 2. R. Tan, Visibility in bad weather from a single image, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Anchorage, AK, 2008), pp. 1–8 3. R. Fattal, Single image de-hazing, in Proceedings of the SIGGRAPH (2008), pp. 1–9 4. K. He, J. Sun, X. Tang, Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2341–2353 (2011) 5. P. Drews, E. Nascimento, F. Moraes, S. Botelho, M. Campos, Transmission estimation in underwater single images, Proceedings of the IEEE International Conference on Computer Vision Workshop (2013), pp 825–830 6. J. Pang, O.C. Au, Z. Guo, Improved single image dehazing using guided filter, in Proceedings of the APSIPA, ASC (2011), pp 1–4 7. K. He, J. Sun, X. Tang, Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1397–1409 (2013) 8. Q. Zhu, J. Mai, L. Shao, A fast single image haze removal algorithm using colour attenuation prior. IEEE Trans. Image Process. 24(11), 3522–3533 (2015) 9. Z. Li, J. Zheng, Z. Zhu, W. Yao, S. Wu, Weighted guided image filtering. IEEE Trans. Image Process. 24(1), 120–129 (2015) 10. Z. Li, J. Zheng, Edge-preserving decomposition-based single image haze removal. IEEE Trans. Image Process. 24(12), 5432–5441 (2015) 11. Z. Farbman, R. Fattal, D. Lischinshi, R. Szeliski, Edge-preserving decompositions for multiscale tone and detail manipulation. ACM Trans. Graph. 27(3), 67 (2008) 12. S. Kang, W. Bo, Z. Zhihui, Z. Zhiqiang, Fast single image de-hazing using iterative bilateral Filter, in Proceedings of the International Conference on Computer Science (2010), pp 1–4 13. C. Xiao, J. Gan, Fast image de-hazing using guided joint bilateral filter. Vis. Comput. 28(6–8), 713–721 (2012) 14. J.H. Kim, W.D. Jang, J.Y. Sim, C.S. Kim, Optimized contrast enhancement for real-time image and video de-hazing. J. Vis. Commun. Image Represent. 24(3), 410–425 (2013) 15. Zhengguo Li, Jinghong Zheng, Single image de-hazing using globally guided image filtering. IEEE Trans. Image Process. 27(1) (2018) 16. K.P. Senthilkumar, P. Sivakumar, Haze removal techniques-a comprehensive survey. Int. J. Control Theory Appl 9(28), 365–376 (2016) 17. K.P. Senthilkumar, P. Sivakumar, A review on Haze removal techniques, in Lecture Notes in Computational Vision and Biomechanics, vol. 31 (Springer, 2019), pp 113–123, ISBN 978–3030-04061-1
An Optimized Multilayer Outlier Detection for Internet of Things (IoT) Network as Industry 4.0 Automation and Data Exchange Adarsh Kumar and Deepak Kumar Sharma
Abstract In this work, a multilayered, multidimensional outlier detection mechanism is proposed. The multilayered system consists of: ultra-lightweight, lightweight and heavyweight detection whereas multiple dimensions involve machine learning, Markov construction, and content and context-based outlier detection. A thresholdbased system is proposed for ultra-lightweight and lightweight outlier detection systems. The heavyweight outlier detection system requires higher computational and communicational costs for outlier detection. In simulation it is observed that the optimal number of clusters required for 50, 100, 500, 1000, 2000, 3000, 4000 and 5000 node network are 5–18, 16–17, 5–25, 33–34, 13–39, 38–39, 22–51 and 45–52, respectively. Keywords Outlier · Inlier · Threshold · Attack · Performance · Time-series analysis
1 Introduction The integration of mobile ad hoc network (MANET) and Internet of Things (IoT) opens new applications for smart environments like automated power plants, intelligence transportation and traffic management systems, car connectivity, etc. The possibilities of wide applications for IoT systems increase with opportunities of interoperability between different types of networks in a smart environment. In smart environments, like MANET–IoT systems, information exchange over different things, routing principles and protocols, clustering, cluster interaction, etc. are designing issues for interoperability network constructions. Due to complex designing issues, MANETs are highly vulnerable to attacks. MANET–IoT interaction characteristics A. Kumar (B) · D. K. Sharma University of Petroleum and Energy Studies, Dehradun, Uttrakhand, India e-mail: [email protected] D. K. Sharma e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_51
571
572
A. Kumar and D. K. Sharma
like open medium, distributed network, autonomous nodes distribution and participation, decentralization, etc. make such systems more complex and challenging. Thus, these systems are easily prone to attacks, and countermeasures are required which identify unauthorized and malicious activities with the use of existing and minimum resources. In existing infrastructure, identification of performance-based nodes helps in identifying various attacks. Performance-based process includes the feature of nodes deployment, their interaction, overall network outputs and services, etc. The majority of MANET’s routing protocols provides clustering and cluster head selection mechanisms. The reusability of the clustering process in identifying active and passive nodes is studied for attack detection. Designing a MANET–IoT outlier detection scheme that is energy efficient and generates a low outlier identification traffic with high accuracy and detection rate is a major area of research in this work. In this work, a multidimensional multilayered outlier detection architecture with increasing complexity is proposed. In this architecture, ultra-lightweight, lightweight and heavyweight outlier detection mechanisms are proposed dimensions. In ultra-lightweight outlier detection (ULOD), outlier nodes are identified with their deployment with internal [1–7] and external [8–16] indices, and without analyzing their performance. In lightweight outlier detection (LOD), outlier nodes are identified after analyzing their performance and using QoS parameters. In heavyweight outlier detection (HOD), outlier nodes are identified using multiple techniques deployed at different layers of MANET architecture. Overall, outlier nodes are identified based on their deployment, density, performance and interactions with other nodes at different layers. Unlike most conventional outlier detection mechanisms that adopt promiscuous monitoring strategy and result in heavy outlier detection traffic generation, the proposed approach uses a dynamic, continuous and increasing complexity (for outliers)-based monitoring strategy wherein nodes with high outlier probability are monitored more frequently compared to inlier nodes. This paper is organized as follows. A literature survey on a recent outlier detection mechanism is presented in Sect. 2. A detailed description of the increasing complexity-based outlier detection approach is presented in Sect. 3. This scheme uses internal and external indices-based ULOD approach, QoS parameter and performance-based LOD approach, and layered dependent increasing computational complexity-based HOD. In Sect. 4, simulation-based analysis is performed for measuring the stability of clusters in the outlier detection process. Finally, a conclusion is drawn in Sect. 5.
2 Literature Survey In this section, various outlier detection mechanisms are surveyed for MANETs [17– 28]. Li et al. [17] proposed an outlier detection mechanism demonstrating behavioral patterns for attack detection using the Dempster–Shafer theory. This scheme takes observations from multiple nodes and reflects the uncertainty and unreliability of the
An Optimized Multilayer Outlier Detection …
573
observations. The proposed scheme is observed to be resilient to various attacks and stable for a distributed network. Although communication overhead is required for efficient detection, actual detection of attacks validates the efficiency of this scheme. It is further observed that the proposed scheme is better for a distributed network but a single approach is not reliable in detection for dynamic networks like MANETs. Sun et al. [18] focus on intrusion detection systems for mobile ad hoc networks and wireless sensor networks. These systems consist of attack detection and elimination using the outlier detection process as well. It is observed that a vast majority of methods are dependent on outlier-based threshold mechanisms. These thresholdbased outlier systems collect data through multiple ways and perform outlier detection after applying data analytics. Karlsson et al. [19] perform wormhole attack detection using the outlier detection process. It is found that various algorithms like traversal time and hop count analysis (TTHCA) and modified transmission time-based mechanism (MTTM) are efficient for attack detection with low traffic overheads. Karlsson et al. [3] extended the TTHCA algorithm and named it traversal time per hop analysis (TTpHA). This extended version uses threshold-based outlier process with different node radio coverage and prevailing MANET conditions. The author claimed that the extended version is better than the base version in terms of detection performance. Yadav et al. [20] proposed the detection of a black hole attack in MANET using the outlier detection process over ad hoc on-demand distance vector (AODV) routing protocol. This scheme is vigilant to those nodes which attract other nodes for data communication by compromising some of their secret entities. In experimental analysis, authors have observed that the proposed scheme provides simplicity, robustness and effectiveness for AODV protocol and other existing routing mechanisms. However, this approach is another threshold-based mechanism covering performance evaluation in detail. A major challenge is deciding the ideal threshold value for outlier detection. Kumar et al. [21, 24–28] proposed attack detection using the outlier detection process after integrating trust mechanism in MANET routing protocols. Trust mechanism includes trust score generation, trust score transmission, rust re-computation and trust regeneration. In evaluation, it is observed that various attacks are resilient to trust mechanism. The outlier detection process helps in detecting attacks and intrusions whereas, trust mechanism help in finding nodes whose disconnection would protect the network for unauthorized activities. The outlier detection process is efficient if lightweight cryptography primitives are pre-integrated with low-resource devices-based MANETs. Henningsen et al. [22, 23] identified the use of the term “misbehavior detection” for attack analysis in wireless networks. In this work, authors have used this terminology for identifying attacks in industrial wireless networks as an exemplary application area. This work focuses on data collected at a physical layer. Data is analyzed using machine learning techniques and outperforming and underperforming data elements are traced for outlier detection. Finally, it is observed that the technique suitable for wireless communication is also beneficial for ad hoc networks in dynamic situations with high flexibility and mobility.
574
A. Kumar and D. K. Sharma
3 Proposed Approach This section proposes a multidimensional and multilayered approach for outlier identification and countermeasures. This section starts with an explanation to a dataset collected from hierarchical MANET and further considered for the outlier detection process. In the proposed outlier detection mechanism, multiple dimensions are explained initially. As shown in Fig. 1, there are three major dimensions: ultralightweight, lightweight and heavyweight. Ultra-lightweight and lightweight outlier detection mechanisms concentrate on cost-effectiveness, especially communication and computational costs. In heavyweight outlier detection process, major concentration is drawn toward the identification of outliers without concern about the costs involved in it. Thus, a multilayered architecture is proposed which identifies outliers at different layers of the MANET protocol stack. The detailed process is explained in the following subsections:
Fig. 1 Proposed multidimensional multilayered outlier detection architecture
An Optimized Multilayer Outlier Detection …
575
r -t 0.044240586 -Hs 8 -Hd -1 -Ni 8 -Nx 75.00 -Ny 500.00 -Nz 0.00 -Ne -4.220000 -Nl RTR -Nw --- -Ma 0 -Md ffffffff Ms s -t 0.044240777 -Hs 3 -Hd -1 -Ni 3 -Nx 85.00 -Ny 400.00 -Nz 0.00 -Ne -6.054000 -Nl RTR -Nw --- -Ma 1 -Md ffffffff -Ms d -t 0.044240605 -Hs 4 -Hd -1 -Ni 4 -Nx 55.00 -Ny 670.00 -Nz 0.00 -Ne -8.770000 -Nl RTR -Nw --- -Ma 2 Md ffffffff -M r -t 0.044240887 -Hs 6 -Hd -1 -Ni 6 -Nx 35.00 -Ny 800.00 -Nz 0.00 -Ne -3.032000 -Nl RTR -Nw --- -Ma 1 -Md ffffffff -
Ms
Fig. 2 Sample records in dataset generated using hierarchical MANET
3.1 Dataset Figure 2 shows an example of entries considered for analysis in a dataset. In this dataset, various fields opted for analysis are packet action (sent, receive, drop and forward), packet action time, packet action location, layer involved, flags, sequence no., packet type, size of packet, flags for source and destination addresses, etc.
3.2 Proposed Increasing Complexity Outlier Detection Architecture In this section, MANET is divided into a set of clusters using the top-down cluster splitting mechanism. These clusters are formed in such a way that every node in a given cluster is within a limited distance and transmission range. These nodes also share a common set of properties described in the distance metric. After constructing a hierarchical network, one node per cluster is elected as cluster head which monitors, controls and provides outlier detection and other clustering services to all other cluster nodes for a predefined period of time. The detailed functionalities of this component are explained as follows.
3.2.1
MANET Clustering and Cluster Head Election
Hierarchical clustering plays an important role in resource-constrained MANETs. MANETs are extremely dynamic and unsteady in nature which creates problems in splitting the network into clusters, and selection of cluster heads for controlling and monitoring cluster’s activities. The major objective of the work presented in this subsection is to reduce the packets transmission overhead at every time in the clustering process. In order to reduce packet transmission overhead, a novel distance metric-based efficient clustering approach is utilized for hierarchical clustering and cluster-head selection. In contrast to the k-means clustering algorithm, hierarchical clustering does not require to prespecify the number of clusters. This scheme uses a distance matrix between observations for identifying similar groups. Initially, a divisive hierarchical clustering approach is used for clustering. Thus, all data points fall
576
A. Kumar and D. K. Sharma
under a single group; thereafter, nodes with similar nature are divided into different clusters. Pseudocode 1 presents the detailed divisive clustering process. Pseudocode 1: Divisive Hierarchical Clustering Algorithm Goal: To create clusters 1. All data points are grouped into a single cluster. 2. Pick each data point one by one and follow the following steps: a. Iterate each data point in the cluster and compute average dissimilarity of this point from all other data point in the same cluster. b. If dissimilarity is greater than a pre-determined threshold then put the data point in new cluster. c. Repeat Step 2a and 2b for every data point in every cluster. 3. Count number of data points in each cluster and pick all clusters having data point greater than one. 4. Compare each cluster with other cluster using data points and dissimilarity score. 5. If dissimilarity score of two data points in two different clusters is greater than zero then move data point to another cluster. 6. Else if dissimilarity score is lesser than zero or zero then a new cycle of restructuring clusters will start and outliers are identified from beginning.
3.2.2
Ultra-Lightweight Outlier Detection (ULOD)
Cluster validation is a process of evaluating the goodness of the clustering algorithm [1]. According to [2–5], cluster validation methods can be categorized into three classes: internal, external and relative cluster validation. To measure the goodness of a clustering algorithm, internal indices methods use properties that are internal information from a dataset and are used in clustering process. In this category of methods, compactness, separation/connectedness and connectivity are reflected in their outcomes [1]. Various internal cluster validation indices are [5–9]: CalinskiHarabasz measure (CHI), Density-Based Cluster Validation (DBCV), etc. Pseudocode 2 shows the generic steps followed for ULOD in this work. ULOD is an indices and threshold-based lightweight outlier detection mechanism for resourceconstrained devices. In order to detect outliers, internal and external indices are used for analysis. In order to identify n-objects with efficient internal and external indicesbased outlier detections factor (IEIODF) values, various internal and external indices are used [10–14]. Pseudocode 2 explains the ULOD process in detail.
An Optimized Multilayer Outlier Detection …
577
Pseudocode 2: Ultra-lightweight Outlier Detection (ULOD) 1. Apply divisive hierarchical clustering algorithm initially followed by hierarchical clustering using proposed distance metric and construct clusters. 2. If any node falls outside clusters then consider those nodes as outliers 3. X_Internal_Indices_inlier_list =NULL 4. X_External_Indices_inlier_list =NULL 5. X_Internal_Indices_list = [DI, DBI, RMSSDI, RSI, SI, II, XBI, CHI] 6. X_External_Indices_list = [FI, NMII, PI, EI, RI, JI] 7. Count_outlier=0 8. Count_inlier=0 9. for item in X_Internal_Indices_list 10. Compute item 11. 12. if then 13. Append item to X_QoS_outlier_list 14. else 15. Append item string to X_QoS_inlier_list 16. end if 17. end for 18. for item in X_External_Indices_list 19. Compute item 20. 21. if then 22. Append item to X_QoS_outlier_list 23. else 24. Append item string to X_QoS_inlier_list 25. end if 26. end for 27. for item in X_QoS_outlier_list 28. Count_outlier=Count_outlier+1 29. end for 30. for item in X_QoS_inlier_list 31. Count_inlier=Count_inlier+1 32. end for then 33. if 34. for item in X_QoS_outlier_list 35. Identify the clusters using item and declare them outliers 36. end for 37. end if
3.2.3
Lightweight Outlier Detection (LOD)
LOD is a performance and threshold-based lightweight outlier detection mechanism for resource-constrained devices. In order to measure performance, QoS metric is used for analysis. In order to identify n-objects with efficient local distance, proportionate QoS outlier factor (LDQOF) values, throughput (TP), goodput (GP) and end-to-end delay (ETED) QoS metric I used. A detailed explanation to LOD process is explained in Pseudocode 3.
578
A. Kumar and D. K. Sharma Pseudocode 3: Lightweight Outlier Detection (LOD) 1. Pick one data point in 2. Retrieve ’s -nearest connected neighbors using 3. Consider this a one network 4. X_QoS_outlier_list:=NULL 5. X_QoS_inlier_list:=NULL 6. X_QoS_list = [TP, GP, ETED] 7. Count_outlier=0 8. Count_inlier=0 9. for item in X_QoS_list 10. Compute item 11. 12. if 13. Append item to X_QoS_outlier_list 14. else 15. Append item string to X_QoS_inlier_list 16. end if 17. end for 18. for item in X_QoS_outlier_list 19. Count_outlier=Count_outlier+1 20. end for 21. for item in X_QoS_inlier_list 22. Count_inlier=Count_inlier+1 23. end for 24. if then 25. for item in X_QoS_outlier_list 26. Sort the nodes using item and declare them outliers 27. end for 28. end if
3.2.4
then
Heavyweight Outlier Detection (HOD)
As shown in Fig. 1, HOD is a multilayered outlier detection process. This process collects data from the LOD process and performs data preprocessing process for identifying missing data, removing duplicated data and data enrichment. After data preprocessing, data parser module parse the data for three different layers: MAC layer, routing layer and application layer. Each of these layers has its own outlier detection process. Brief description to outlier detection at each of these layers is as follows: • MAC layer outlier detection (MACLOD) uses machine learning process for outlier detection. In the machine learning process four phases are used for analysis: preprocessing, learning/training, evaluation and prediction. In the preprocessing phase, training and testing datasets are prepared for analysis. In the learning/training phase, features are extracted and compared using decision treebased clustering mechanism. In the evaluation phase, testing data’s features are compared with training set data features for outlier detection. In the prediction phase, new data features are extracted and directly compared with expected features of inlier or outlier rather than executing learning/training and evaluation phase again and again. • Routing layer outlier detection (RLOD) is the second layer of the outlier detection process in HOD. Here, routing packets are filtered for analysis and constructing a
An Optimized Multilayer Outlier Detection …
579
Markov chain. State probability transition matrix is computed from Markov chain construction and the probability of outlier is computed using this matrix. • Application layer outlier detection (ALOD) is the third layer of the outlier detection process in HOD. In this layer, filtered application layer packets are processed for associativity rules. Associativity rules-based outlier detection is a content and contextual outlier detection mechanism. Content association determines the chain of source, intermediate and destination nodes whereas context-based association determines outliers by dividing the nodes into colonies and empires. Contextual similar colonies are put together for constructing empires and dissimilar colonies are either moved to neighboring empires or considered as outliers. After receiving the nodes with a label of outlier or inlier, the nodes are passed through the scoring module. This module may take a single layer opinion or aggregated observations of all layers depending upon its configuration. The final score is computed in terms of percentage of outliers in a particular cluster and in the whole network.
4 Simulation, and Its Results and Analysis This section explains the experimental setup, simulation parameters taken for analysis and clusters’ visualization. Clusters are validated through various internal and external indices. Simulation of the proposed approach in detail with environment setup is explained as follows.
4.1 Simulation Environment In simulation analysis, 50–5000 nodes are distributed randomly over 1500 m × 1500 m area. The Random WayPoint Mobility model is used with a wireless channel and an omnidirectional antenna to trace and capture packets. A maximum of 7 packets per second can transfer at a time with each packets containing a maximum of 512 bits. Here, ns-3 [15] simulator is used to simulate nodes with 0.1 to 5 m/s mobility. Total simulation is executed for 2000s with a multi-execution scenario.
4.2 Simulation-Indices Computation and Analysis As discussed earlier, cluster validation evaluates the goodness of the clustering algorithm. A detailed evaluation of the proposed approach using cluster validation methods is explained as follows.
580
A. Kumar and D. K. Sharma
4.2.1
A Comparative Analysis of Internal Indices
Internal cluster validation methods use properties that are internal information from the dataset. In this work, various indices used for evaluation are II, XBI and . Figure 3 shows the comparative analysis of II, XBI and with variation in the number of nodes. Figure 3 shows that the optimal indices’ values for 50, 100, 500, 1000, 2000, 3000, 4000 and 5000 node datasets are observed during T1 –T2 , T3 –T4 , T5 –T6 , T7 –T8 , T8 –T9, T8 –T9 , T8 –T9 and T8 –T9 slots with 4, 18, 23, 33, 38, 40, 53 and 56 clusters, respectively. Table 1 shows the comparative analysis of timing slots and the number of clusters indicating optimal indices value for all internal indices taken for analysis. IEIODFitem_L O W E R threshold values selected for all internal indices (taken for analysis) are the values where all indices agree for considering all clusters as valid clusters. For 50-node dataset, IEIODFitem_L O W E R threshold values selected for II, XBI and are 0.084, 0.271 and 0.243, respectively. A detailed threshold index change with variations in the number of nodes is shown in Fig. 3. This analysis is an experiment for internal indices. Figure 3 shows the comparative analysis of those indices whose value is varying between 0 and 1. As compared to XBI, values for II and remain almost constant for all types of networks (small to large scale). If II,
Index Value
0.8 II
0.6 0.4
XBI
0.2 0 50
100
500
1000
2000
3000
4000
Γ
5000
No. of Nodes
Fig. 3 Analysis of index value variation for three internal indices (II, XBI and )
Table 1 Timing slots and no. of clusters indicating optimal indices’ value Indices Datasets 50 II
XBI
A T5 –T6
100
500
1000
2000
3000
4000
5000
T10 –T11
T6 –T7
T11 –T12
T1 –T2
T10 –T11
T5 –T6
T12 –T13
B
32
36
23
41
8
49
34
59
C
0.084
0.052
0.26
0.023
0.292
0.141
0.696
0.4
A T7 –T8
T8 –T9
T10 –T11
T6 –T7
T2 –T3
T1 –T2
T5 –T6
T2 –T3
B
32
36
38
30
13
7
34
27
C
0.271
0.145
0.032
0.203
0.139
0.269
0.01
0.223
T5 –T6
T6 –T7
T6 –T7
T12 –T13
T2 –T3
T7 –T8
T5 –T6
A T12 –T13 B
32
22
31
30
44
14
48
46
C
0.243
0.104
0.209
0.312
0.184
0.108
0.212
0.174
*A = timing slots, B = number of clusters, C = indices value
An Optimized Multilayer Outlier Detection …
581
XBI and are compared among themselves, then maximum variation is observed in II and minimum variation is observed in XBI. index value decreases from 50 to 100 nodes (very small-scale network), increases from 100 to 1000 nodes (very small- to medium-scale network), decreases from 1000 nodes to 3000 nodes (medium-scale network), increases from 3000 nodes to 5000 nodes with a small decrease from 4000 to 5000 nodes (large-scale network). XBI value decreases from 50 to 500 nodes (small-scale network), increases from 500 to 3000 nodes with one-time decrease for 2000 nodes (medium-scale network), and shows maximum decrease (from 3000 to 4000 nodes) and increase (from 4000 to 5000 nodes) for large-scale network.
4.2.2
A Comparative Analysis of External Indices
To measure the goodness of a clustering algorithm, these methods use external information for comparisons. For example, the use of known labeled cluster datasets is generally preferred for comparisons between a produced partition and known partition [16]. In this work, various methods used for external cluster validation are EI, RI and JI. Figure 4 shows the comparative analysis of EI, RI and JI with variations in the number of nodes. For example, Fig. 4 shows that the optimal EI values for 50, 100, 500, 1000, 2000, 3000, 4000 and 5000 nodes datasets are observed during T3 –T4 , T3 –T4 , T1 –T2 , T7 –T8 , T2 –T3 , T8 –T9 , T3 –T4 and T4 –T5 slots with 13, 36, 11, 34, 21, 39, 55 and 48 clusters, respectively. Table 2 shows a detailed analysis of timing slots and the number of clusters indicating optimal indices’ value for all Lower external indices taken for evaluation. I E I O D Fthr eshold threshold values selected for all external indices (taken for analysis) are the values where all indices agree for Lower considering all clusters as valid clusters. For the 50-nods dataset, I E I O D Fthr eshold 1.2 EI 1
Index Value
0.8 RI
0.6 0.4
JI
0.2 0 50
100
500
1000
2000
3000
4000
5000
No. of Nodes
Fig. 4 Analysis of index value variation for external indices (EI, RI and JI)
582
A. Kumar and D. K. Sharma
Table 2 Timing slots and no. of clusters indicating optimal indices value Indices
Datasets 50
EI
RI
JI
100
500
1000
2000
3000
4000
5000
A
T2 –T3
T6 –T7
T3 –T4
T7 –T8
T4 –T5
T8 –T9
T12 –T13
T7 –T8
B
13
36
11
34
21
39
55
48
C
0.837
0.881
0.905
0.868
0.98
0.793
0.946
0.898
A
Up to T1
Up to T1
T1 –T2
Up to T1
T1 –T2
T3 –T4
T9 –T10
T10 –T11
B
3
4
5
5
8
21
52
57
C
0.981
0.998
0.92
0.931
0.937
0.98
0.983
0.901
A
T2 –T3
T5 –T6
T3 –T4
T6 –T7
T7 –T8
T4 –T5
T5 –T6
T2 –T3
B
13
22
11
30
34
24
34
27
C
0.982
0.885
0.806
0.882
0.967
0.961
0.975
0.84
*A = timing slots, B = number of clusters, C = indices value
thresholds for EI, RI and JI are 0.837, 0.981 and 0.982, respectively. A detailed comparative analysis of threshold variations with variations in the number of nodes is shown in Fig. 4. This experimentation is performed for external indices (EI, RI and JI). It is observed that the threshold index value lies between 0.7 and 1. As compared to internal threshold index variation, external threshold indices show a slight variation or remain constant with variations in the number of nodes. For a small-scale network (50–500 nodes), EI indices increase whereas RI indices show an increase for very small-scale network (50–100 nodes) and decrease for very small-scale to small-scale network (100–500 nodes). Also, JI index value decreases for small-scale network (50–500 nodes). For medium-scale network (500–3000 nodes), JI and RI values increase whereas EI values increases for small-scale to medium-scale network (500–1000 nodes), decrease for medium-scale network (1000–2000 nodes) followed by an increase from medium- to large-scale network (2000–3000 nodes). For largescale network (3000–5000 nodes), EI increases for 3000–4000 nodes followed by a decrease for 4000–5000 nodes, whereas RI and JI values increase from 3000 to 4000 nodes and decrease slightly from 4000 to 5000 nodes.
5 Conclusion In MANET, outlier detection systems are not only helpful in detecting the number of attacks but can also adaptively respond and/or mitigate the detected attacks. In this work, the proposed outlier detection scheme used a multidimensional and multilayer outlier detection mechanism for MANETs. In a multidimensional architecture, three subsystems are proposed: ultra-lightweight, lightweight and heavyweight. Ultra-lightweight and lightweight systems are threshold-based outlier detection systems. Ultra-lightweight system detects outliers based on internal and external
An Optimized Multilayer Outlier Detection …
583
indices whereas lightweight system detects through QoS parameters. Heavyweight outlier system uses multilayered outlier detection mechanism. This system detects outliers from application, routing and MAC layer of MANET protocols layering stack. In simulation analysis, it is observed that the number of clusters required for small-, medium- and large-scale networks varies from 5 to 52. A minimum of 0.91% and a maximum of 104.1% percent improvement in cluster stability is observed.
References 1. Cluster Validation Statistics: Must Know Methods—Articles—STHDA. http://www.sthda. com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-mustknow-methods/. Accessed 5 July 2018 2. G. Brock, V. Pihur, S. Datta, S. Datta, clValid: an R package for cluster validation. J. Stat. Softw. 25(4), 1–22 (2008) 3. M. Charrad, N. Ghazzali, B. Boiteau, A. Niknafs, NbClust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 61(6), 1–8 (2014) 4. S. Theodoridis, K. Koutroumbas, Pattern Recognition (Academic Press, 2009) 5. P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. (1987) 6. J.C. Dunn, Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974) 7. Davies, D. L. and Bouldin, D. W.: A Cluster Separation Measure, IEEE Trans. Pattern Anal. Mach. Intell., (1979) 8. T. Caliñski, J. Harabasz, A Dendrite method foe cluster analysis. Commun. Stat. (1974) 9. D. Moulavi, P.A. Jaskowiak, R.J. Campello, A. Zimek, J.D. Sander, Density-based clustering validation. In: Proceedings of the 2014 SIAM International Conference on Data Mining (2014) 10. Evaluation of clustering. https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clu stering-1.html#fig:clustfg3. Accessed 05 July 2018 11. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J., Wu, S.: Understanding and enhancement of internal clustering validation measures, IEEE Trans. Cybern., (2013) 12. F. Kovács, C. Legány, A. Babos, Cluster validity measurement techniques. In: 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases (2006), pp. 388–394 13. S. Huang, Y. Cheng, D. Lang, R. Chi, G. Liu, A formal algorithm for verifying the validity of clustering results based on model checking. PLoS One (2014) 14. Gurung, S. and Chauhan, S.: A dynamic threshold based approach for mitigating black-hole attack in MANET, Wirel. Networks, pp. 1–15, 2017 15. The Network Simulator-ns-2. https://www.isi.edu/nsnam/ns/. Accessed 5 July 2018 16. T. Van Craenendonck, K. Leuven, H. Blockeel, Using Internal Validity Measures to Compare Clustering Algorithms (ICML, 2015), 1–8 17. W. Li, A. Joshi, Outlier detection in ad hoc networks using dempster-Shafer theory. In: 2009 Tenth International Conference on Mobile Data Management: Systems, Services and Middleware (IEEE, May 2009), pp. 112–121 18. Sun, B., Osborne, L., Xiao, Y. and Guizani, S.: Intrusion detection techniques in mobile ad hoc and wireless sensor networks. IEEE Wirel. Commun. 14(5) (2007) 19. J. Karlsson, G. Pulkkis, L.S. Dooley, A packet traversal time per hop based adaptive wormhole detection algorithm for MANETs. In: 2016 24th International Conference on Software, Telecommunications and Computer Networks (SoftCOM) (IEEE, September 2016), pp. 1–7 20. S. Yadav, M.C. Trivedi, V.K. Singh, M.L. Kolhe, Securing AODV routing protocol against black hole attack in MANET using outlier detection scheme. In: 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics (UPCON) (IEEE, October 2017), pp. 1–4
584
A. Kumar and D. K. Sharma
21. A. Kumar, K. Gopal, A. Aggarwal, Design and analysis of lightweight trust mechanism for secret data using lightweight cryptographic primitives in MANETs. IJ Netw. Secur. 18(1), 1–18 (2016) 22. S. Henningsen, S. Dietzel, B. Scheuermann, Challenges of misbehavior detection in industrial wireless networks. In: Ad Hoc Networks (Springer, Cham, 2018), pp. 37–46 23. A. Kumar, K. Gopal, A. Aggarwal, Novel trust hierarchical construction for RFID sensor-based MANETs using ECCs. ETRI J. 37(1), 186–196 (2015) 24. A. Kumar, K. Gopal, A. Aggarwal, A novel lightweight key management scheme for RFIDsensor integrated hierarchical MANET based on internet of things. Int. J. Adv. Intell. Paradi. 9(2–3), 220–245 (2017) 25. A. Kumar, A. Aggarwal, Performance analysis of MANET using elliptic curve cryptosystem. In: 14th International Conference on Advanced Communication Technology (ICACT) (IEEE, 2012), pp. 201–206 26. A. Kumar, K. Gopal, A. Aggarwal, Simulation and cost analysis of group authentication protocols. In: 2016 Ninth International Conference on Contemporary Computing (IC3) (IEEE, Noida, India, 2016), pp. 1–7 27. Kumar, A., Aggarwal, A. and Gopal, K.: A novel and efficient reader-to-reader and tag-to-tag anti-collision protocol. IETE J. Res. 1–12 (2018). [Published Online] 28. A. Kumar, A. Aggarwal, Charu: Survey and taxonomy of key management protocols for wired and wireless networks. Int. J. Netw. Secur. Appl. 4(3), 21–40 (2012)
Microscopic Image Noise Reduction Using Mathematical Morphology Mangala Shetty and R. Balasubramani
Abstract In image processing to enhance the region of an image, mathematical morphological (MM) operations are taken an important role. Mainly application of basic morphological techniques are useful in improving the quality of an image. In the collection and delivery process, the image will be polluted by salt-and-pepper noise, which would lead directly to image quality reduction throughout subsequent processes of image analysis. Thus obtaining the actual image from the image that is distorted by noise is therefore of great importance [1]. This paper deals with an approach using morphological functions to reduce the salt-and-pepper noise from scanning electron microscopic(SEM) images of bacteria cell. The noise removal has a wide effect in getting the accurate segmentation and classification of bacteria cells thereby cell identification accuracy increases to identify the bacteria cells within a short period of time automatically, the noise has to remove from the SEM image of bacteria. Various quality assessment operations are used to measure the quality of enhanced images. The results of the experiment indicate that without blurring edges, this experiment can reduce noise effectively from the input image. The validation outcomes of denoised images with a higher peak signal-to-noise ratio (PSNR) and mean squared error (MSE) show their reliable application potential. Keywords Mathematical morphology · Structuring element · SEM Bacteria
1 Introduction Set theory principles are the building blocks of MM. Geometrical shape of the object in an image is considered for the application of MM techniques. Since many morphological operations examine relatively ordered pixel values, these operators are M. Shetty (B) · R. Balasubramani NMAMIT Nitte, Karkala, India e-mail: [email protected] R. Balasubramani e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_52
585
586
M. Shetty and R. Balasubramani
effectively helpful in reducing the noise level present in the images [2–4]. The design of noise reduction strategies for image-based calculating systems is of great importance [5]. The primary principle behind Morphological Processing is to analyze an image geometric shape by comparing it with tiny patterns at different locations called structuring components [6]. Noise can be defined as a random variation in the image brightness. It degrades the image. In most of the cases, the presence of noise is mainly due to the image acquisition phase. In the process of image acquisition, an optical image is transferred into a continuous electrical signal and then it is sampled [2]. Complete noise removal without image signal distortion is not possible. But the reduction of noise to a certain acceptable range to further analyze and to process the image is possible. Salt-and-Pepper noise is the most common type of additive noise in the image. Inclusion of Salt-and-Pepper noise in the image is due to many reasons. Some of them are defective camera sensors, faulty memory regions, timing fault in the digitization process,transmission of signal in noisy channels, etc. [2]. Impulsive noise reduction strategies are described in the literature. The popularly used non-linear and scalable filter is the standard median filter (SMF) [7]. Whenever the level of noise increases, this filter tends to blur the image that distorts its content. The same limitations occur for the progressive type switched median filter (PMSF) [8]. The latest adaptive noise removal algorithms are the decision-based algorithm (DBA) [9] and the noise adaptive fuzzy switching median salt-and-pepper noise reduction filter (NAFSM) [10]. In the present study, the efficient structuring element to reduce salt-and-pepper noise present in the scanning electron microscope (SEM) images of bacteria cells are presented. The relevant factors for evaluation to be analyzed are as follows. (1) Impacts of peak signal-to-noise ratio and MSE on noise reduction in an unprocessed SEM image. (2) Effective quality assessment of noise reduced images of SEM. (3) Comparison with other 2D-flach structuring components in reducing salt-and-pepper noise from SEM images using the proposed mathematical morphology-based method. In the resulting images, the proposed method retains edges and features information. In addition, experimental data show that when images are distorted with higher impulsive noise, the proposed model has high performance in noise reduction.
2 Morphological Operations In image processing, morphological operations are highly experimented [8] in improving the appearance. To reduce the noise, the MM is also applied and it uses structuring element to probe the image, and thereby useful information from the image can be obtained and noise can be reduced while preserving the features. This paper is on an experiment in which four morphological operations are working to reduce the noise from the grayscale image and thereby enhancing the quality of the images.
Microscopic Image Noise Reduction Using Mathematical Morphology
587
2.1 Structuring Element Structuring element (SE) can be defined as a simple predefined shape used to identify the neighborhood pixel values. Mainly 2D-flat SE plays a vital role in morphological operations with binary and grayscale data because their light transmission functions are unknown and morphological operations are applied on the relative ordering of pixel values instead of their numerical values. From a graphical point of view, structuring elements can be represented either by a matrix having 0s and 1s or as a set of foreground pixels all having values 1. Some conventional structuring elements like arbitrary, ball, diamond. There are two types of structuring elements, flat SE and non flat SE. In this paper, five arbitrary 2D-flat structuring elements namely disk, square, rectangle, line, and octagon have been used for the experiments and shown in Figs. 1, 2, 3, 4 and 5.
2.2 Dilation Dilation is useful to add pixels to the boundaries of the region or it is also used to fill the holes in the picture. There is a possibility that the holes will be completely closed or the holes will be narrowed in an image. So the initial figure can be extended or shrunk by dilation. It is possible to connect disjoint pixels or to insert pixels at edges using dilation operation.
Fig. 1 Square boundary and rectangle boundary SE
Fig. 2 Octagon boundary SE
588
M. Shetty and R. Balasubramani
Fig. 3 Circle boundary SE
Fig. 4 Line boundary SE
2.3 Erosion Erosion operation produces the reverse effect of dilation. In erosion, boundaries will be narrowed and it expands the holes. This is done by setting an ON-valued pixel to OFF-valued as a structuring element sliding across the image. All the pixels which are completly overlap with the ON-valued pixels are set to OFF valued pixels.
2.4 Opening and Closing Erosion and dilation may be applied repeatedly to achieve the desired results. Nonetheless, in the processed picture, the execution order of these operations display a difference. Combining dilation and erosion, opening and closing are obtained. The opening procedure requires erosion with the same structuring component followed by dilation. The process of closing begins with dilation followed by erosion with the same structuring component. Opening is performed to smooth the surface contours, to split narrow joints, and remove thin ridges. It is possible to smooth contour sections in closing but it fuses narrow breaks, includes contour gaps, and removes small holes. When small noise regions are more in an image, the opening operation must be used. On the other hand, closing restores connectivity between objects close to each other.
Microscopic Image Noise Reduction Using Mathematical Morphology
589
Fig. 5 Process sequence in the proposed method
3 Proposed Approach The term morphology refers to a specific method of filtering and SEs in digital image processing. In morphological image processing, choosing a suitable SE is a very important task. SE can be represented either with a matrix of 0s and 1s or with all values as a set of foreground pixels. The origin of the SE must be clearly identified in both the representation. This technique is based on using morphological operations with 2D-flat structuring elements to eliminate salt-and-pepper noise. Figure 5 displays the proposed method with the process sequence. In the initial phase of the proposed approach, images of lactococcus bacteria are taken in different dimensions and converted in the second stage into a grayscale image. The grayscale picture is defined as intensity values that range from black to white to high at the lowest intensity. The morphological operations were performed in the third stage with arbitrary 2D-flat SE proposed in this paper and shown in Figs. 1, 2, 3, 4 and 5.
4 Experimental Results and Discussions Using morphological operators, five 2D-flat arbitrary SEs were used to perform the noise removal process. They are disk, square, row, octagon, and rectangle; lactococcus images are chosen to do the experimental research; lactococcus image of 512 × 512, 460 × 819, 1024 × 1024, 1218 × 1120, and 2048 × 2048 dimensions are used within sixty percent of salt-and-pepper noise. From the final resulting images Figs. 10, 11, 12, 13, 14 and 15, it is clear that most of the noise can be reduced using square-shaped SE. The numerical measurements with PSNR and MSE are also shown in Figs. 6, 7, 8 and 9. The numerical measures of improved images are shown by the reduction of noise. The square boundary SE contributes higher PSNR and octagon SE yields lower PSNR, respectively, from statistical observations using the proposed method.
590
Fig. 6 PSNR for 1024 × 1024 image
Fig. 7 PSNR for 512 × 512 image
M. Shetty and R. Balasubramani
Microscopic Image Noise Reduction Using Mathematical Morphology
Fig. 8 MSE for 1024 × 1024 image
Fig. 9 MSE for 512 × 512 image
591
592 Fig. 10 Noisy Image
Fig. 11 Image with line SE
Fig. 12 Image with disk SE
M. Shetty and R. Balasubramani
Microscopic Image Noise Reduction Using Mathematical Morphology Fig. 13 Image with square SE
Fig. 14 Image with rectangle SE
Fig. 15 Image with octagon SE
593
594
M. Shetty and R. Balasubramani
5 Conclusion and Future Work SE plays an important role in image enhancement for noise removal using morphological operations for SEM images of bacteria. Selecting various structuring elements will result in myriad applications for analyzing and storage of the geometric details of images. Thus ultimately determine the distribution and volume of data and their existence in the morphological transformation. Dilate, erosion open and close are the morphological procedures applied in this experiment to the noisy SEM image of the lactococcus bacteria cells. Although these operations have their own efforts in improving the images, it is possible to combine these operators to greatly improve the appearance of the noisy image by reducing noise. The conclusions made in this paper were based purely on the experimental outcome. The morphological analysis with five arbitrary SEs was performed in this paper to perform noise reduction procedure. Statistical measurements can also be seen with resulting images for different 2D-flat SEs. Among the various SEs, square SE is recognized as being more reliable in the elimination of noise as per the visual perception evaluation and statistical measurements. The result was reliable and a very strong degree of improvement was reached, showing the efficiency of the proposed work. More morphological operations experimented with higher noise levels in future research. Acknowledgments The authors are grateful for supplying the SEM images to Dr. Dennis Kunkel, former president of Dennis Kunkel Microscopy Inc.
References 1. Y. Shi, X. Yang, Y. Guo, translation invariant directional framelet transform combined with gabor filters for image denoising. IEEE Trans. Image Process. 23(1), 44–55 (2013) 2. T. Huang, G. Yang, G. Tang, A fast two-dimensional median filtering algorithm. IEEE Trans. Acoust. Speech Signal Process. 27(1), 13–18 (1979) 3. A. Taleb-Ahmed, X. Leclerc, T. Michel, Semi-automatic segmentation of vessels by mathematical morphology: application in MRI, in Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), vol. 3 (IEEE, 2001), pp. 1063–1066 4. K.K.V. Toh, N.A.M. Isa, Noise adaptive fuzzy switching median filter for salt-and-pepper noise reduction. IEEE Signal Process. Lett. 17(3), 281–284 (2009) 5. V.V Das, S. Janahanlal, Y. Chaba. Computer Networks and Information Technologies: Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Bangalore, India, March 10–11, 2011. Proceedings, vol. 142 (Springer, 2011) 6. K. Ratna Babu, K.V.N. Sunitha, Image de-noising and enhancement for salt and pepper noise using genetic algorithm-morphological operations. Int. J. Signal and Image Process. 4(1), 36 (2013) 7. Z. Wang, D. Zhang, Progressive switching median filter for the removal of impulse noise from highly corrupted images. IEEE Trans. Circ. Syst II: Analog Digi. Signal Process. 46(1), 78–80 (1999) 8. KS Srinivasan and David Ebenezer, a new fast and efficient decision-based algorithm for removal of high-density impulse noises. IEEE Signal Process. Lett. 14(3), 189–192 (2007) 9. F. Ortiz, Gaussian noise removal by color morphology and polar color models, inInternational Conference Image Analysis and Recognition (Springer, 2006), pp. 163–172 10. S.E Umbaugh. Computer Imaging: Digital Image Analysis and Processing (CRC press, 2005)
A Decision-Based Multi-layered Outlier Detection System for Resource Constraint MANET Adarsh Kumar and P. Srikanth
Abstract MANET is a useful network for providing various services and applications. Among those services and applications, sharing is important. The sharing of resources is possible when the availability of resources is ensured. In this work, the multi-dimensional multi-layered solution is proposed for ensuring the availability of network resources. The multi-dimensional approach provides criteria for collecting and analyzing data from different security dimensions. A multi-layered outlier detection algorithm using hierarchical data interconnection is proposed in this work. In the analysis, it is observed that internal indices like DBI and RSI give confirmation of clusters stability with the proposed approach. A minimum of 4.1% and a maximum of 11.3% stability is observed with variation in a number of nodes. Similarly, external indices like F-measure and NMI indicate stability in comparison to external clusters. A minimum of 2% and a maximum of 13.5% stability is observed. Keywords Outliers · Attack detection and countermeasure · MANET · Clustering · QoS
1 Introduction Mobile ad hoc networks (MANETs) constituted with limited hardware devices are decentralized, autonomous, and dynamic in nature. Using this type of network, various applications can be designed to resolve [1]: natural or man-made disasters, road traffic issues, group/military movements, item/visitor tracking systems, autonomous household appliances, etc. The major challenge among resolving these A. Kumar (B) · P. Srikanth Department of Systemics, School of Computer Science, University of Petroleum and Energy Studies, Dehradun 248007, Uttrakhand, India e-mail: [email protected] P. Srikanth e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_53
595
596
A. Kumar and P. Srikanth
issues is the implementation of security aspects due to scarcity of resources [2]. Security requirements include the implementation of cryptography primitives and protocols. In order to implement any primitive or protocol, devices should be available. Thus, availability is observed to be the first challenge necessarily resolved. Outlier detection mechanisms can be helpful in ensuring availability, i.e., those devices are allowed to work in the network which is having good historical or presents records else devices are put under scrutiny until they prove themselves. The goal of ensuring availability through outlier detection mechanisms is to identify nodes relevant for communication. There are various outlier detection mechanism and the majority of them follows statistical approaches. Statistical approaches require both training and testing processes for identification. Thus, statistical approaches are either parametric or non-parametric [2]. Gaussian processes modeled using means and covariance is the best example of parametric outlier detection. Whereas, nonparametric approaches are silent over mathematical calculations. In another classification, univariate and multi-variate are also helpful in outlier detection. In networkspecific scenarios (like MANETs), the multi-layered parametric and multi-variate feature is considered as important and priority-based outlier detection mechanism [3]. These multi-layered solutions formulate scalability, high-quality results, and better performance-based scenarios. In respect of multiple layers, there is a possibility of different attacks at different layers, thus performance and the feature-based system are helpful in identifying unruly nodes. This work proposes a multi-layered architecture for outlier detection. Here, five layers are proposed for outlier detection. Layer-1 uses machine learning unsupervised approach and Bayesian classifier for outlier detection and analysis. The trained dataset is prepared with labels as outlier or inlier. Network layer packet features are used for probability-based outlier detection. This data is used again at layer-3 for application trust score-based outlier detection. An associated rule-based mechanism is used for outlier detection with extracted features. Layer-4 detection is an aggregated outlier detection module from the above three layers. In this layer, outliers are identified at subgroup and network levels. Overall, the multi-layered outlier detection module is flexible to scale with increasing complexity ensuring availability. This work is organized as follows. Section 2 discusses the literature survey which has been carrying out in multi-layered outlier detection in MANETs. Section 3 explains the proposed unsupervised machines learning-based multi-layered outlier detection approach for limited hardware devices-based MANETs. Section 4 presents the results and analysis of simulation for the proposed work. Finally, a conclusion is drawn in Sect. 5.
A Decision-Based Multi-layered Outlier Detection …
597
2 Literature Survey With the development of wide-areas of application for MANETs, it is essential to realize the importance of efficient security services for ensuring the availability of resources as and when required. Outlier detection approaches are preferred way of outlier detection as lighter statistical techniques make it suitable for resource constraint devices in MANETs. Various multi-layered outlier detection models are proposed over networks where there is ad hoc connectivity and instability. For example, in [3], Mapanga et al. proposed neural network-based multi-layered intrusion detection architecture for MANET using the outlier detection process. This proposed model analyzes packets at the network layer and passes results to other layers. An enhanced threshold-based packet analysis process is used for attack analysis. This technique is used for detecting and isolating black hole attacks from the network. The proposed technique is claimed to be better in terms of packet delivery and end-to-end delay. In [4], a two-level outlier detection scheme is proposed for detecting misbehaving nodes in MANETs. Guo et al. [4] have proposed an outlier detection mechanismbased joint prediction of the level and the conditional variance of the traffic flow series. In this detection, mechanism data is collected from different regions and a comparative analysis is performed ensuring efficiency of the proposed outlier detection mechanism. Recommendations are made to investigate the underlying outlier generating mechanism and countermeasures for transportation-based applications. The outlier detection approach uses the concept of the variability of the smaller size population representing malicious nodes should be greater than the variance of the larger size population representing normal nodes in the network. In order to efficiently and effectively separate the normal nodes from malicious nodes, the linear regression process is performed in parallel for computing the threshold using the measurement of fluctuation for received classified instances. In [5–11, 13–18] other outlier detection approaches are discussed in detail which are helpful for ad hoc connectivity and unstable environment. As expected, most of the existing approaches use the network layer’s packet analysis for intrusion detection [7]. These network layer’s packet analysis processes hardly apply any machine learning approach in its pre-processing or analysis. Thus, incomplete or inconsistent data records are also identified in the form of outliers. In addition, the complexity of outlier computation is not taken into consideration. Complexity is an important parameter for resource constraint devices thus a mechanism should be flexible and scalable for computational complexity-based outlier detection.
598
A. Kumar and P. Srikanth
3 Proposed Approach This section explains the proposed outlier detection approach in detail as shown in Fig. 1. The proposed approach applies multiple techniques at different layers for outlier detection. These techniques are discussed as follows:
3.1 Data Pre-processing Initially, the network is constructed and the node’s performance data is logged for analysis. The constructed network is adaptive in nature and uses multiple protocols for data exchange and configurations. Adaptability is implemented when there is a need for improvement in performance. After network construction and configuration, data is logged for analysis and further processing. This processing includes identifying false entries, side-channel attacked records, data duplication records, and initial data dependency observations. Initial observations and records dependencies help in data reduction for further analysis and presentation. After data preparation, data is forwarded to layers for analysis.
3.2 Layer-1 Outlier Detection This is the first layer of outlier detection. In this layer, data link layer features are extracted and these feature sets are processed through the machine learning cycle as shown in Fig. 2. Initially, data processing starts with one window and the size of the window varies till ‘N’. Those datasets are put in a training set whose label is observed as outlier or inlier. Unlabeled and unpredictable data is put in the testing dataset. The proposed mechanism observes the nature of data for anomaly detection. In nature observation, the rate of anomalies is computed during training for estimating the probability of anomaly. In a given window, a packet is observed multiple times for Layer-1 Layer-2 Layer-3 Layer-4
Unsupervised machine learning approach characterizing network layer data Unsupervised machine learning approach characterizing transport layer data Unsupervised machine learning approach characterizing application layer data and advances outlier detection using rule mining Aggregated detection
Fig. 1 Proposed outlier detection approach
A Decision-Based Multi-layered Outlier Detection …
599
Fig. 2 Machine learning cycle
distinct values. If a packet with the same source but different destinations or different sources with the same destination is observed ‘x’ number of times with ‘d’-distinct values then the probability of anomaly is x/d. This processing is performed in the decision tree using ruleset. Further, this is helpful in building a trained classifier. This classifier saves time for outliers and inliers detection in new data. Sliding window process of outlier detection in the machine learning cycle is explained in Fig. 3. This process collects data sequences from logged data and inserts a window for the trained dataset. In this trained dataset, new entries from the testing dataset are inserted one by one through feature extraction and comparison. This comparison involves an outlier with a new outlier label and an inlier with a new inlier label. After labeling, each node profile is built. Node profile is a contextual aware representation of the node’s features. The hierarchical clustering mechanism [17] is used for node profile building and computing the average of all node values having a similar node profile.
Fig. 3 WxN-window process in machine learning cycle
600
A. Kumar and P. Srikanth
This hierarchical process of profile building is helpful in connecting similar nodes together and identifying feature-based outliers. After the machine learning phase, another evaluation phase is integrated for extracting node features. This phase predicts dependencies based on process contextualization and identifies outliers [10]. Figure 4a–c shows process contextualization without outliers. Figure 4d–f shows process contextualization with outliers. Figure 4a shows a process with independent nodes without interconnection. Although there is no interconnection among nodes, all nodes are connected with a single process/activity. Thus, these nodes are not considered as outliers in this process. Figure 4b shows nodes interconnected among themselves and connected with a single common process. In this scenario, some nodes will act as sources and others as a destination. Nodes may act as intermediate nodes also but no intermediate node should allow an alternative path to existing paths. Multiple self-loops are allowed in this process. Figure 4c shows another scenario with parallel activities. In this scenario, multiple nodes are interconnected in single or multiple processes and parallel paths are possible. Each process must have a single source and a single destination. For k = n, a maximum of n parallel activities is allowed. Figure 4d shows a process of contextual outlier detection. In this process, those nodes are considered as outliers who are connected in a process but are not performing an activity for a long time. This is a threshold-based outlier detection approach. Initially, the threshold time period is the average value of waiting for any activities in the network. Thereafter, the average time period of the subgroup is considered for detection. Figure 4e shows a process of contextual outlier detection when all nodes are not connected in a process and disconnected nodes are acting as source nodes regularly. In this scenario, multiple paths from disconnected nodes to destination nodes are not possible. Figure 4f shows a scenario where multiple paths are possible. In both cases, disconnected nodes are considered as outlier nodes. Detail process of contextual outlier detection is explained in Pseudocode 1.
A Decision-Based Multi-layered Outlier Detection … Fig. 4 a Directed acyclic graph (DAG) when k = 0 Nodes found connected in a process but no activity (NO OUTLIERS). b Directed acyclic graph (DAG) when k = 0 Nodes found connected in a process with single activity (NO OUTLIERS). c Directed Acyclic Graph (DAG) when k = 2 Nodes found connected in a process with two parallel activities (NO OUTLIERS). d Directed Acyclic Graph (DAG) when k = 0 Nodes found not connected in a process for long time with no activity (OUTLIERS). e Directed Acyclic Graph (DAG) when k = 1 Nodes found not connected in a process for long time with activity (OUTLIERS). Possibilities: • Distance bounding attack, • Distance Hijacking attack, • Man-in-Middle attack, • Sync-hole attack
a
b
c
d
e
601
602 Fig. 4 (continued)
A. Kumar and P. Srikanth
f
Pseudocode 1: Contextual Outlier Detection Goal: To evaluate the class of data points collected from a particular node. 1. Iterate each node one by one. 2. Extract features of each data element coming or going out of a particular node. 3. Analyze the features and identify whether the collected feature predicts the graph with one or more parent nodes. 4. Calculate the time period of the inactivity of a node without connection with any process. 5. If the time of inactivity is going beyond a threshold then 6. Node is marked outlier 7. End if 8. if the node is not connected with any process but it is performing an activity with other nodes then 9. Mark the node as an outlier 10. End if 11. If all nodes are connected with any process then 12. Execute content based outlier detection process 13. If randomly picked content is suspicious from historical records then 14. Get connect node’s profile and mark them for outlier analysis 15. End if 16. If randomly picked content is suspicious from historical records with multi-dimensional features then 17. Get connect node’s profile and mark them for outlier analysis 18. End if 19. else 20. return 21. end if
Observations from context and content-based outlier detection processes are compared with the divisive hierarchical clustering process defined previously. Importance to both observations is given equally if labels of both analyses are same then the data label is considered as the final label else if there is a discrepancy in the observation dataset is put in testing dataset for analysis again.
A Decision-Based Multi-layered Outlier Detection …
603
3.3 Layer-2 Outlier Detection Layer-2 outlier detection process deals with the transition of node states. A node state can vary indicating damage caused due to side-channel effects. Transitions between node states are helpful in constructing a Markov chain. Initially, nodes are placed randomly and their movements are observed over a certain period of time. Markov chain process is a process of analysis using historical data and it is helpful in detecting outliers using ruleset. Transitions of node’s states are observed for control and regular messages. Control message sender or receiver is put under scrutiny if these messages are sent or received beyond threshold without any further action. Complete process of outlier detection follows the following steps: chain construction, transition matrix formation, and final computations. The chain construction process uses graphical datasets for record-keeping and computations. The probability matrix accesses the graphical dataset and store paths among nodes in two-dimensional space. Figure 5 shows an example of Markov chain construction and the probability transition matrix is shown in Table 1. Figure 5 and Table 1 presents two routes from source to destination. Path probability ratio of two paths is calculated as Route 1/route 2 = (0.7 + 0.2 + 0.05)/(0.3 + 0.7 + 0.05) = 0.95/1.05 = 0.9 < 1. Now, if nodes are selecting
Fig. 5 Example of markov chain construction
Table 1 Transition matrix 1
2
3
4
5
6
…
N
1
0
0.3
2
0.3
0
0
0.7
0
0
…
…
0.7
0
0
0
…
…
3
0
0.7
4
0.7
0
0
0.2
0.05
0.05
…
…
0.2
0
0
0.95
…
5
0
0
…
0
0.05
0
0.95
…
…
6
0
0
0
…
…
…
…
0.05
0.95
0
…
…
…
…
…
…
…
N
…
…
…
…
…
…
…
…
604
A. Kumar and P. Srikanth
route 2 for control or data message transmission then no outlier exists (i.e., all nodes are inliers). If route 1 is selected for transmission then source and destination nodes with degree >1 in route 2 are under scrutiny. In addition, all intermediate nodes with degree ≥3 are under scrutiny.
3.4 Layer-3 Outlier Detection Layer-3 outlier detection process starts with an assumption that all nodes are randomly deployed. Further, their deployment area is well known in advance as shown in Fig. 6a. Profiles of all nodes are collected from the above layer and the initial population is decided for analysis as shown in Fig. 6b. Association rules [11–13] are applied for outlier detection. Among the initial population, highly trusted nodes are identified for applying association rules as shown in Fig. 6c. Using trusted nodes, the initial population is divided into the imperialist countries and imperialist states. The imperialist country is considered to be denser as compared to the imperialist state. Thus, the imperialist country is defined as a collection of nodes with a number of interconnections (with trusted nodes) greater than a certain threshold. The imperialist state is also a collection of nodes but the number of interconnections is lesser than the imperialist country but greater than a minimum density-based threshold required for outlier detection. Figure 6e shows the construction of a colony, empire, and sub-zones. The whole population area is divided into colonies covering imperialist countries or states. Highly trusted nodes are interconnected for authentic data communication. Thereafter, high power nodes are connected with highly trusted nodes for constructing an empire. Connection of each trusted nodes and highly powered nodes formulate a sub-zone. Small sub-zones are merged by moving high powered nodes to other neighboring sub-zones. If nodes are left isolated or smaller sub-zones exist, after repetitive merging attempts, then these nodes or sub-zones are considered as outliers.
3.5 Layer-4 Detection Layer-4 outlier detection is added in proposed multi-layered architecture for those devices where there is scarcity of resources. In this layer, outlier detection administrator has the option of considering observations of single or multiple layers in his/her final opinion. Resource constraint devices may choose any layer implementation and observations for analysis whereas resourceful network/devices should select combined opinion of all layers. Pseudocode 4 explains the combined outlier detection process in detail.
A Decision-Based Multi-layered Outlier Detection …
605
Fig. 6 a Nodes are distributed randomly over geographical region (Stage 1). b Decide initial population. c Identify highly trusted nodes. d Divide all nodes into imperialist states and countries. e Build colonies inside empires using nearest possible connection to high power node
606
A. Kumar and P. Srikanth
Fig. 6 (continued)
Pseudocode 4: Combines outlier score calculator 1. Iterate each layer regularly and collect labels of each node 2. if each node’s label is same for all above three layers then 3. 4. else 5. Implement fuzzy min-max in computing conflicting node’s exact labels 6.
4 Simulation, Evaluations, and Analysis In this section, network simulation and performance of cluster indices (internal and external) are explained reflecting the stability of colony, empire, and sub-zones. This explanation is as follows:
4.1 Simulation Setup In the simulation, a network of 50–5000 nodes is formulated for performance analysis. Nodes have the flexibility to move in any directions and at specified speed within a geographic area. Details of simulation parameters are shown in Table 2. In this work, eight variations of the network, with different numbers of data records, are considered for analysis. This analysis is observed during the different time periods with variation in the number of clustered formed. It is observed that the number of clusters and their stability increases with an increase in time.
A Decision-Based Multi-layered Outlier Detection … Table 2 Simulation setup
Parameters
607 Value
Nodes
50–5000
Communication via
Wireless channel
Radio propagation model
Ray tracing
Interface
Wireless Phy
MAC type
802.11
Queue type
Priority queue
Antenna
Omni antenna
Waiting queue size
50 packets
Maximum X-dimension of the 500 m topography Maximum Y-dimension of the topography
500 m
Mobility model
Random Waypoint mobility
Data transfer rates
7 packets/second
Single packet size
1024 bits
Discrete event simulator
ns-3 [12]
Total simulation time
1500 s
Number of slots assigned to reader at stretch ()
1
Time of each slot
10 ms
Velocity (minimum to maximum)
0.3–5 m/s
4.2 Analysis of Internal and External Cluster Indices This sub-section explains the internal and external clustering indices used for measuring the quality of clusters formed in outlier detection process. Higher the quality of clustering indices better will be the cluster implementation process which in turns validates effective and efficient identification of outliers and inliers. Simulation is performed over three different timings slots for analysis. Analysis of internal and external indices are explained as follows:
4.2.1
Internal Cluster Indices
Internal indices are used for measuring the quality of clustering without any external data. This includes data units and features inherited within the dataset are used for measurements. In this work, Davies–Bouldin index (DBI) and R-squared Indices (RSI) are used as internal indices for analysis as shown in Fig. 4. Trends for DBI and RSI are almost same. All indices values increase from 50 to 100 nodes (very
A. Kumar and P. Srikanth 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
upto T1 T1-T2 T2-T3 T3-T4 T4-T5 T5-T6 T6-T7 T7-T8 T8-T9 T9-T10 T10-T11 T11-T12 T12-T13
50 100 500 10002000300040005000
No. of Nodes
0.3 0.25
Index Value
Index Value
608
0.2 0.15 0.1 0.05 0
No. of Nodes
upto T1 T1-T2 T2-T3 T3-T4 T4-T5 T5-T6 T6-T7 T7-T8 T8-T9 T9-T10 T10-T11 T11-T12 T12-T13
(b) R-squared Indices(RSI)
(a) Davies-Bouldin index (DBI)
Fig. 7 Internal cluster evaluation. a Davies–Bouldin index (DBI). b R-squared indices (RSI)
small scale network) and decrease from 100 to 5000 nodes (small scale to large scale network). Figure 7a and b shows RSI and RMSSDI index analysis during different time slots. In case of RSI and RMSSDI, an elbow structure indicates higher stability. Thus, proposed clustering and outlier detection mechanism are validated to be stable.
4.2.2
External Cluster Indices
External indices are used for measuring the quality of clustering with external information, i.e., quantities and features inherited from known cluster structure of a dataset are used for measurement. In this work, F-measure Index (FI) and Normalized Mutual Information (NMI) indices are used as external indices for analysis as shown in Fig. 5. FI and NMII indices show increase for very small scale network (50–100 nodes) and decrease for very small scale to small scale network (100–500 nodes). Higher FI and NMII means more stability. According to FI, proposed mechanism is best for 50, 100, and 1000 nodes networks, and it is good during initial time slots (up to T5) for other networks as shown in Fig. 8a. NMII values in Fig. 8b shows that the proposed mechanism is best for 50, 100, and 4000 nodes networks. Although it is good for other networks as well, more fluctuations are observed in these cases.
Index Value
1 0.8 0.6 0.4 0.2 0
50 100 500 10002000300040005000 No. of Nodes
upto T1 T1-T2 T2-T3 T3-T4 T4-T5 T5-T6 T6-T7 T7-T8 T8-T9 T9-T10 T10-T11 T11-T12 T12-T13
1.2 1 Index Value
1.2
0.8 0.6 0.4 0.2 0
50 100 500 10002000300040005000 No. of Nodes
(a) F-measure Index Fig. 8 External cluster evaluation. a F-measure index. b NMI
(b) NMI
upto T1 T1-T2 T2-T3 T3-T4 T4-T5 T5-T6 T6-T7 T7-T8 T8-T9 T9-T10 T10-T11 T11-T12 T12-T13
A Decision-Based Multi-layered Outlier Detection …
609
5 Conclusion In dynamically changing topology-based networks like MANET, single dimensional security solutions are not efficient in providing proper safeguard. Thus, layer-based solutions are preferred. Single dimension-Single layer solution does not identify all types of attacks. Whereas, multi-dimensional multi-layer solutions are increasing their importance by considering different data at different points. In this work, a similar approached is proposed where four different layers consider different types of data for attack analysis. The proposed approach filter and analyze network, transport, and application layer data at three different layers for analysis. Fourth layer provides a provision of collecting observations from above three layers and concludes the results. In analysis, it is observed that internal indices like DI, RMSSDI, DBI, and RSI give confirmation of clusters stability with proposed approach. A minimum of 4.1% and maximum of 11.3% stability is observed with variation in number of nodes. Similarly, external indices like F-measure and NMI indicate stability in comparison to external clusters. A minimum of 2% and maximum of 13.5% stability is observed. In future, hybrid indices will be explored to improve the results and advanced analysis will be performed to reduce error approximations.
References 1. J. Liu, Y. Xu, Y. Shen, X. Jiang, T. Taleb, On performance modeling for MANETs under general limited buffer constraint. IEEE Trans. Veh. Technol. 66(10), 9483–9497 (2017) 2. S. Sen, J.A. Clark, J.E. Tapiador, Security threats in mobile ad hoc networks. In: Security of self-organizing networks: MANET, WSN, WMN, VANET, ed. by A.-S. Khan Pathan, 1st edn (CRC Press, New York, 2016), pp. 127–147 3. I. Mapanga, V. Kumar, W. Makondo, T. Kushboo, P. Kadebu, W. Chanda, Design and implementation of an intrusion detection system using MLP-NN for MANET. In: IST-Africa Week Conference (IST-Africa) 2017 (IEEE, Windhoek, Namibia, 2017), pp. 1–12 4. J. Guo, W. Huang, B.M. Williams, Real time traffic flow outlier detection using short-term traffic conditional variance prediction. Transp. Res. Part C: Emerg. Technolo. 50, 160–172 (2014) 5. I. Butun, S.D. Morgera, R. Sankar, A survey of intrusion detection systems in wireless sensor networks. IEEE Commun. Surv. Tutorials 16(1), 266–282 (2014) 6. L. Nishani, M. Biba, Machine learning for intrusion detection in MANET: a state-of-the-art survey. J. Intell. Inf. Syst. 46(2), 391–407 (2016) 7. A. Amouri, V.T. Alaparthy, S.D. Morgera, Cross layer-based intrusion detection based on network behavior for IoT. In: 2018 IEEE 19th Wireless and Microwave Technology Conference (WAMICON) (IEEE, 2018), pp. 1–4 8. M.A. Hayes, M.A. Capretz, Contextual anomaly detection framework for big sensor data. J. Big Data 2(2), 1–22 (2015) 9. R. Agrawal, T. Imieli´nski, A. Swami, Mining association rules between sets of items in large databases. In: ACM SIGMOD Record (ACM, NY, USA, 1993), pp. 207–216 10. M. Hahsler, R. Karpienko, Visualizing association rules in hierarchical groups. J. Bus. Econ. 87(3), 313–335 (2017) 11. S. Shamshirband, A. Amini, N.B. Anuar, M.L. Mat Kiah, Y.W. Teh, S. Furnell, D-FICCA: a density-based fuzzy imperialist competitive clustering algorithm for intrusion detection in wireless sensor networks. Meas. J. Int. Meas. Confed. 55, 212–226 (2014)
610
A. Kumar and P. Srikanth
12. The Network Simulator—ns-2.” https://www.isi.edu/nsnam/ns/. Accessed 5 July 2018 13. F. Chen, P. Deng, J. Wan, D. Zhang, A.V. Vasilakos, X. Rong, Data mining for the internet of things: literature review and challenges. Int. J. Distrib. Sens. Netw. 11(8), 1–14 (2015) 14. A. Kumar, K. Gopal, A. Aggarwal, Simulation and cost analysis of group authentication protocols. In: 2016 Ninth International Conference on Contemporary Computing (IC3) (IEEE, Noida, India, 2016), pp. 1–7 15. A. Kumar, A. Aggarwal, A., K. Gopal, A novel and efficient reader-to-reader and tag-to-tag anti-collision protocol. IETE J. Res., 1–12 (2018). [Published Online] 16. A. Kumar, K. Gopal, A. Aggarwal, Design and analysis of lightweight trust mechanism for secret data using lightweight cryptographic primitives in MANETs. IJ Netw. Secur. 18(1), 1–18 (2016) 17. S.K. Solanki, J.T. Patel, A survey on association rule mining. In: 2015 Fifth International Conference on Advanced Computing & Communication Technologies (ACCT) (IEEE, 2015), pp. 212–216 18. A. Kumar, A. Aggarwal, Survey and taxonomy of key management protocols for wired and wireless networks. Int. J. Netw. Secur. Appl. 4(3), 21–40 (2012)
Orthonormal Wavelet Transform for Efficient Feature Extraction for Sensory-Motor Imagery Electroencephalogram Brain–Computer Interface Poonam Chaudhary and Rashmi Agrawal Abstract Wavelet Transform (WT) is a well-known method for localizing frequency in time domain in transient and non-stationary signals like electroencephalogram (EEG) signals. These EEG signals are used for non-invasive Brain–Computer Interface (BCI) system design. Generally, the signals are decomposed in dyadic (twoband) frequency bands for frequency localization in time domain. The triadic approach involves the filtering of EEG signals into three frequency filter bands: low-pass filter, high-pass filter, and band-pass filter. The sensory-motor imagery (SMI) frequencies (α, β, and high γ ) can be localized from non-stationary EEG signals in using this triadic wavelet filter efficiently. Further features can be extracted using common spatial pattern (CSP) algorithms and these features can be classified by machine learning algorithms. This paper discusses dyadic and non-dyadic filtering in detail and also proposes an approach for frequency localization using three-band orthogonal wavelet transformation for classification of sensory-motor imagery electroencephalogram (EEG) signals. Keywords Electroencephalogram (EEG) · Filter band · Common spatial patterns (CSP) · Non-dyadic orthogonal wavelet transformation · Sensory-motor imagery (SMI) · Brain–computer interface
1 Introduction of BCI A Brain–Computer Interface is an alluring research area from last two decades with the successful online application in education, rehabilitation, home automation, restoration, entertainment, and enhancement. The advancement in technologies like wireless recording, signal processing techniques, computer algorithms, and P. Chaudhary (B) · R. Agrawal Manav Rachna International Institute of Research and Studies, Faridabad, India e-mail: [email protected] R. Agrawal e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_54
611
612
P. Chaudhary and R. Agrawal
brain sciences made possible this unimagined task of converting brain signals into control signals for computer or any other electronic devices. Specifically, physically impaired patients can be rehabilitated. The people having misfortune brain damage or any brain diseases (e.g., amyotrophic lateral sclerosis suffering people, stroke patients) can use BCI and significant literature positively impacted on large scale [1–3]. The brain signals can be acquired either invasively by planting electrodes inside the grey area of brain or non-invasively by placing electrodes on the scalp of the brain. Electroencephalography (EEG) is a method to record the brain signals non-invasively from the scalp of the brain [4–6]. Despite of high SNR, i.e., signal-tonoise ratio, this method is considered convenient as it does not require any surgical procedure. Furthermore, Electrocorticography (ECoG) is invasive neuroimagining method, recorded from the cortical surface of brain. It is also known as intracranial EEG [4]. The local field potentials (LFPs), single unit activity potentials, and multi-unit activity potentials are some other invasive technologies. The biomechanic parameters, like motor imagery even, can be extracted and applied successfully from these spatio-temporal signals [7–10]. Literature is available in which researchers [11–15] have used these signals for classifying upper limb movement and used it in controlling the electronic devices. They achieved the task using invasive electrodes in human brain and monkeys’ brain and resulted in low signal-to-noise ratio and accurate control of prosthetic devices in three-dimensional space [13–17]. Though, the brain signals acquired in invasive methods degrade gradually, the risk of performing surgery for placement of electrodes makes these methods more unrealistic approach. EEG method measures neural activity directly, economically, and portably for medical use. Thus, EEG is the most accepted method for brain–computer interface instead of higher spatial resolution technologies like fMRI, MEG, etc. A speller system for paralyzed people [16], EEG-based wheelchair [17] and Reach and Grasp event using a Robotic Arm [18] are some successful projects in assistive and rehabilitation devices for physically disabled patients. The use of BCI system comprises two phase: in first phase system is calibrated and known as training phase, in second phase the BCI system is used online to translate the recognized brain activity patterns into control commands for a computer. Finding the relationship and analyzing the patterns inside the brain signal, underlying physical events and cognitive processing are some of challenging tasks in brain–computer interfacing. The basic framework to implement an online EEG-based brain–computer interface system is a closed-loop, which starts with acquiring/recording specific EEG patterns of user (e.g., Motor imagery or using visual stimuli). Then acquired EEG signals are preprocessed using advance signal processing techniques like de-noising, digitization, spatial and spectral filtering, [19] etc. in order to normalize the signals and then feature extraction, selection from these preprocessed signals has to be performed in order to characterize them in compatible structure [20]. Further, these compact feature sets are classified [21] before converting them into command signals for computer application. Then users give the feedback that whether this command signal has been interpreted correctly as mental task or not. Every step of this protocol involves algorithms which can be optimized for better performance of brain–computer interface system [22]. The designing of BCI system is a very challenging task due to subject
Orthonormal Wavelet Transform for Efficient …
613
and application specificity. The performance is dependent over algorithms used at every step of BCI design. The basic performance measure is the classification accuracy of classifier used with assumption of balanced classes and unbiased classifier. Kappa metric or confusion metric which calculates the sensitivity–specificity pair, or precision, is an alternative choice when classes are biased or unbalanced. Another performance measures for BCI system are AUC curve, and ROC curve often used when the classification depends upon the continuous parameter like a threshold. The overall performance strongly depends on the performance of the subcomponents of BCI system. There are different orchestrations of the BCI system like hybrid, self pace or system paced, and many hybrid systems [23]. This paper discusses the wavelet analysis versus spectrum analysis of signal processing and Sect. 2 describes different wavelets used for decomposition of signals in Sect. 2 followed by Sect. 3 which discusses the construction and advantages of orthonormal non-dyadic wavelet. Section 4 proposes the application of orthonormal non-dyadic wavelet for decomposing the acquired EEG signal in the field of motor imagery classification followed with the conclusion and future work in Sect. 5.
2 Wavelet Analysis Wavelet analysis (WT) of biosignals has attracted attention in recent years for signal processing using software techniques. Wavelet transform or wavelet theory has gained popularity over Fourier Transform (FT) which was most commonly applied representation of signals. Unlike FT, WT expands the mother wavelet function which considers the translation and dilations of this basis function, instead of taking trigonometric polynomials. Finally, localization of scaling properties is performed in frequency and time domain [24]. Thus, it allows a close correlation between the function with its coefficients and ensures numerical stability during reconstruction. Construction of powerful wavelet basis functions is the main goal of wavelet analysis. It extends to find efficient methods for their computation like fast FT can be formulated using wavelet basis functions. So wavelet spectrum can be formulated which consider the signal’s time (or spatial) domain information as well as frequency domain information. Due to non-stationary characteristics in time domain, biosignals processed with FT not yield the desired results. So, wavelet analysis is a powerful tool for processing non-stationary signals. The wavelet is a smooth and flexible short time localized oscillating waveform which has an average value zero and has good frequency and time localization [25]. Figure 1a, b demonstrates the Fourier and wavelet transform of a signal. Figure 2 demonstrates the Fourier and wavelet transform of a signal. The decomposition of a function of time (here a signal) into its constitute frequencies can be done by Fourier Transform and represented mathematically as
614
P. Chaudhary and R. Agrawal
Fig. 1 a Fourier transform of a signal b Wavelet transform of a signal N-channel Raw EEG
3-band Wavelet Decomposition
Feature extraction and selection
Classification (ANN, SVM, Bayesian,
Motor Task
Fig. 2 Block diagram of proposed sensory-motor imagery EEG classification using non-dyadic wavelet decomposition
∞ F(ω) =
f (t)e −iwt dt
(1)
−∞
Equation 1 gives the Fourier coefficients F(ω) by sum over all time of the signal f (t) multiplied by a complex exponential. According to Heisenberg’s uncertainty principle, velocity, and position of an object cannot be measured at the same time exactly. This proves the time and frequency resolution problems regardless of any transform used. Thus, multiresolution analysis (MRA) is an alternate approach for signal analysis at different frequencies with different resolutions. This approach gives good frequencies resolution and poor time resolution at low frequencies and good time resolution and poor frequency resolution at high frequency. As EEG signals have low-frequency resolution for long time and high-frequency resolution for very small time. So multiresolution analysis (MRA) is suitable for EEG signal analysis. Wavelet transform can be categorized as discrete wavelt transform (DWT), continuous wavelet transform (CWT) ,and multiresolution-based transform (MRT). Like
Orthonormal Wavelet Transform for Efficient …
615
short time fourier transform (STFT), in CWT signal of finite energy is multiplied with function of frequency bands and signal transform is computed separately for different segments of the time-domain signal. The signal is reconstructed by integrating resulting frequency components. Unlike STFT, in CWT does not compute negative frequencies and transform is computed for every single spectral components. The result of FT coefficients multiplied by sinusoidal frequency ω; leave the constituent sin component of original signal known as wavelet coefficients. Formally the CWT is the sum over scaled and shifted versions of the wavelet function multiplied by all time of the signal. FT uses sin() and cos() functions whereas wavelets can define a set of basis functions ψ k (t) as follows: f (t) =
ak ψk (t)
(2)
k
The basis can be constructed by applying translations (a real number τ ) and scaling (stretch/compress by positive scale s) on the “mother” wavelet ψ(t): t −τ 1 ψ(s, τ, t) = √ ψ s s
(3)
The projection of a function y onto the subspace of scale s then has the form W Tψ {y}(s, t) · ψs,τ (t)dτ
ys (t) =
(4)
R
with wavelet coefficients W Tψ {y}(s, t) = y|ψs,t =
y(t) · ψs,τ (t)dt
(5)
R
Some of the continuous wavelets are Poisson wavelet, Mexican hat wavelet, Morlet and modified Morlet wavelet, Shannon wavelet, Beta wavelet, Casual wavelet, Hermitian wavelet, Cauchy wavelet, Meyer wavelet, and many more [26]. The analysis of a signal using all the wavelet coefficients is computationally impossible and a NP-hard problem. So wavelets are discretely sampled and reconstructed. Series Expansion of Discrete-Time Signals is explained as if x[n] is a square-summable sequence, i.e., x[n] ∈ 2 (Z) and orthonormal expansion of x[n] of the form x[n] =
(ϕk [1], X [1])ϕk [n] =
k∈Z
where X [k] = ϕk [1],|x[l] =
X [k]ϕk [n] → x 2 = X 2
(6)
k∈Z
l
ϕk∗ [n], x[1] is the transform of x and the basis
functions ϕk satisfy the orthonormal constraint ϕk [n],|ϕ1 [n] = δ[k − l] [26].
616
P. Chaudhary and R. Agrawal
The DWT decomposes the signal into detailed information and coarse approximation and analyzes the signal at different frequency bands with different resolutions. The two filters known as high-pass filter and low-pass filter employ two sets of functions in time domain, known as wavelet functions and scaling functions, respectively. The original signal y[i] is filtered first by g[i] (high-band filter) and then to h[i] (lowband filter) in Eqs. (7) and (8) and resulted into convolution of two then half of the samples can be eliminated using Nyquist’s theorem, as the frequency of the signal is now f /2 instead of f . The signal further sub-sampled by 2 known as one level of decomposition of signal y[i] and mathematically represented as yhigh [k] =
y[i] · g[2k − i]
(7)
y[i] · h[2k − i]
(8)
i
ylow [k] =
i
EEG is signal is a non-stationary signal. Hence, for such transient signals, a time–frequency representation is highly desirable, with an aim to derive meaningful features [10].
3 Non-dyadic Wavelet Transform Wavelet series expansion decomposes the finite energy functions for analysis of the same. Thus, basis functions must be regular, well localized, and of finite energy. It is convenient to take special values for s and τ in defining wavelet basis as s = 2−j and τ = k. 2−j for jth stage of the process. Thus, scale samples of wavelet transform following a geometric sequence of 2 is known as dyadic wavelet transform. Equation (3) can be rewritten as Eq. (10) known as dyadic wavelet transform of f. The family of dyadic wavelet is a frame of L 2 (R).
W f u, 2 j =
+∞ −∞
1 t −u dt = f × ψ 2 j (u), f (t) √ ψ 2j 2j
with −t 1 ψ 2 j (t) = ψ2 j (t) = √ ψ j 2j 2
(10)
The time–frequency localized basis functions are popular among the researchers for the applications like analysis of acquired signals [27], image coding [28, 29], features extraction [30–32]. Orhan et al. [33] and Ubeyli et al. [34] have implemented two-band wavelet filter banks and extracted the features, then classified the
Orthonormal Wavelet Transform for Efficient …
617
features into predefined classes. Authors [35] took two frequency bands each of ω = π /2 for frequency resolution and concluded their results in poor frequency resolution both in high- and low-frequency band. Further, dyadic filter bank (M = 2 band) can be extended to M-band filter bank with M > 2 sub-bands, improves the frequency resolution to ω = π /M. To increase the frequency resolution in high- or low-frequency signals, the number of sub-bands can be increased in the region. Thus, higher frequency resolution of triadic filter bank can be useful for practical applications which include high- or low-frequency signals [36]. Further, localization of highor low-band filters in spatio-temporal domain results in improved performance of 3-band filter banks. Xie and Morris [37] and Sharma et al. [38] have designed dyadic regular orthogonal and biorthogonal filter banks, respectively, using time–frequency wavelet basis functions. Two-band wavelet transform has been implemented extensively [28, 31, 32, 37, 38] and it outperforms many other existing methods like empirical mode decomposition (EMD) [39, 40], high-order moment parameters [41, 42], autoregression, and band-power based models [43]. The literature has shown poor frequency resolution with two-band wavelet transformation both in high- and low-frequency signals. There can be improvement in sensory-motor imagery (SMI) classification accuracy by increasing the frequency resolution of any frequency region of dyadic wavelet transform. A more flexible time–frequency wavelet transformation can be tiled up using M-band wavelet decomposition. Lin et al. [44] and Lin et al. [45] have proposed the construction of M-band wavelets using multiresolution analysis (MRA). They decomposed the input signal into M parts using the filter bank matrix based on the calculated filter coefficients. The filter bank matrix(X) is the concatenation of K number of M × K overlapping factor matrices given by X = [X 0 , X 1 , … X K −1 ]. The filter bank should produce the orthonormal and reconstructable output for given polyphase matrix B(z) (Eq. 11), and the conditions to be followed for such output are shown in Eq. 12 B(z) = X 0 + X 1 z −1 + · · · + X k−1 z −(K −1)
(11)
⎧ ⎪ ⎨ Z e = Me1 RRT = I ⎪ ⎩ SST = I
(12)
where
Z=
K −1
X i e = [1, 1, . . . 1]T , e1 = [1, 0, . . . 0]T
(13)
i=0
(14)
618
P. Chaudhary and R. Agrawal
(15)
The M-band filter bank X k is decomposed in the orthogonal matrices to solve the constraint equation using singular value decomposition (SVD) that can be given as X k = E Dk F
(16)
where the factored matrices E and F are orthogonal and X = [X 0 , X 1 ] satisfies the Eq. (12), if and only if they have following decomposition [44, 45]. Sharma et al. [46] have used optimal orthogonal wavelet to decompose the ECG signal for automated heartbeat classification. They have designed a finite impulse response filter (FIR) that assures the condition of zero moments and condition of orthogonality. Chandel et al. [47] have proposed the triadic wavelet decomposition to find the suitable features which give higher accuracy for epileptic seizure classification. Bhati et al. [48] have designed the epileptic seizure signal classification using three-band orthogonal wavelet filter bank with stopband energy. Benchabane et al. [49] have applied statistical threshold on coefficients of wavelet decomposition of individual evoked potential signals and estimated the mean value of the same across the trials to improve the signal-to-noise ratio.
4 Proposed Approach of Sensory-Motor Imagery (SMI) EEG Classification Using Non-dyadic Wavelet Decomposition Brain working depends upon the perception level and it shows different rhythmic activities. The rhythms are affected by cognition process of thoughts and preparation of actions, e.g., eye blink can attenuate particular rhythm. The reality that sheer thoughts distress the rhythms can become the basis for the BCI system. Different brain rhythms can be identified in EEG with different range of frequencies. Niedermeyer [50] has given Greek letters delta, theta, alpha, beta, gamma, and mu (δ, θ, α, β, γ , and μ) to represent the brain rythms. Author has explained that sensory-motor patterns are present in α, β, and high γ brain rythms. The frequency ranges of these rythms in EEG signal are as follows: (i) Alpha wave: 8–13 Hz, (ii) Beta rhythm: 13–30 Hz, (iii) Gamma rhythm: 30–85 Hz. This section discusses a new approach to filter out the frequency bands using nondyadic wavelet decomposition [51] for sensory-motor imagery EEG classification for brain–computer interfacing. The steps involved in proposed methodology for sensory-motor imagery classification from EEG signals are explained in Fig. 2.
Orthonormal Wavelet Transform for Efficient …
619
The raw N-channel EEG data will be decomposed into three-band filter to localize the time–frequency characteristic. This results in segmentation of frequency bandwidth and results in three sub-bands frequencies, i.e., frequency from (a) 0 to π /3, (b) π /3 to 2π /3, (c) 2π /3 to π. Splitting the frequency bandwidths using triadic wavelet increases the flexibility. The division of lowest frequency sub-band again up to essential number of level. However to find α, β, and high γ frequencies of sensory-motor imagery patterns, few number of frequency band can be selected from the decomposed signal for further feature extraction. Rest frequencies can be discarded. Further, features, like band-power, CSP, power spectrum density, etc., are some examples of features to be extracted from selected frequency bands. The wavelet fuzzy approximate entropy, clustering techniques, cross-correlation techniques, and many techniques exist for feature extraction from raw EEG signals. The selection of sub-bands can be done on the basis of corresponding brain rhythms. The features extracted can be high-dimension vectors depend upon the number of channels, number of trials, number of sessions from multiple modality, and sampling rate of modality. It is neither realistic nor useful to consider all features for classification. So selecting a smaller subset of distinctive feature set or feature space projection is an important step in pattern recognition for classification. The aim of feature selection process is to remove the redundant and uninformative features along with finding unique features which do not overfit the training set and classify the real dataset with higher accuracy even in the presence of noise and artifacts. To reduce the curse of dimensionality, the representative features can be selected out of all the coefficients obtained from the three-band frequency domain [52–56]. This opens the use of many analytical and statistical (1st moment, 2nd moment, 3rd moment, etc.) to further evaluate them. The machine learning algorithms like support vector machine (SVM), k-means clustering, Bayesian networks, Artificial Neural Network (ANN), Radial basis function (RBF), decision tree, etc. can be applied for further identification of imagined motor task. The ultimate goal of BCI design is to translate the mental event of user into control commands. The acquired raw EEG signal has to be converted into real action in surrounding environment. So, classification or pattern matching of the signal into predefined classes is naturally the next step after preprocessing and feature extraction and selection. Machine learning has played an important role not only in identifying the user intent but also handle the variation in ongoing user’s signals. Considering traditional approach of pattern matching, the classification algorithms for mental task recognition inside the EEG signals can be categorized in four categories: (1) adaptive classifiers, (2) transfer learning-based classifiers, (3) matrix and tensor classifiers, and (4) deep learning-based classifiers.
5 Conclusion and Future Work Brain–computer interfacing (BCI) is a new pathway to human brain and unlocks many solutions for physically disabled people. Acquisition of brain signals using EEG non-invasively has carried this task in practical domain. For the start BCI
620
P. Chaudhary and R. Agrawal
competition IV dataset can be considered for the analysis [55]. Despite of existing literature of algorithms and methods, there is still a scope of improvement in every step of designing a robust BCI system. This paper discusses and proposing a new approach for signal processing analysis based on wavelet transformation. It discusses Fourier Transform, Continuous Wavelet Transform (CWT), Discrete Wavelet Transform (DWT), and Multiresolution Wavelet Transform (MWT) and their application in EEG signal decomposition. Further, literature of both dyadic and non-dyadic orthogonal transformation has been discussed for the localization of time–frequency analysis of EEG signal. A new approach has been proposed for sensory-motor imagery (SMI) EEG classification using Non-dyadic wavelet decomposition. Further, this approach would be implemented for preprocessing of EEG signals, comparison of the machine learning algorithms like support vector machine (SVM), k-means clustering, Bayesian networks, ANN, Radial basis function (RBF), decision tree, etc. can be applied for further identification of imagined motor task. With growth of the applications of BCI, security and threats have become major issues now [57]. These threats and ethical issues could also be explored further. The proposed approach could analyze nonstationary power at different frequencies, which exists in fractal structure in time series, so the dissimilarities between target and nontarget EEG signal are recognized. The application of non-dyadic filter will be demonstrated in our next paper.
References 1. N. Birbaumer, W. Heetderks, J. Wolpaw, W. Heetderks, D. McFarland, P.H. Peckham, G. Schalk, E. Donchin, L. Quatrano, C. Robinson, T. Vaughan, Brain-computer interface technology: a review of the first international meeting. IEEE Trans. Rehabil. Eng. 8(2), 164–173 (2000) 2. J.R. Wolpaw, N. Birbaumer, D.J. McFarland, G. Pfurtscheller, T.M. Vaughan, Brain-computer interfaces for communication and control (in eng). Clin. Neurophysiol. 113(6), 767–791 (2002) 3. M.A. Lebedev, M.A. Nicolelis, Brain-machine interfaces: from basic science to neuroprostheses and neurorehabilitation. Physiol. Rev. 97(2), 767–837 (2017) 4. L.F. Nicolas-Alonso, J. Gomez-Gil, Brain computer interfaces, a review. Sensors 12(2), 1211– 1279 (2012) 5. N. Birbaumer, T. Hinterberger, A. Kubler, N. Neumann, The thought-translation device (ttd): Neurobehavioral mechanisms and clinical outcome. IEEE Trans. Neural Syst. Rehabil. Eng. 11, 120–123 (2003) 6. J. Wolpaw, D. McFarland, T. Vaughan, G. Schalk, The wadsworth center brain computer interface (BCI) research and development program. IEEE Trans. Neural Syst. Rehabil. Eng. 11, 204–207 (2003) 7. G. Pfurtscheller, C. Neuper, G. Muller, B. Obermaier, G. Krausz, A. Schlogl, R. Scherer, B. Graimann, C. Keinrath, D. Skliris, M. Wrtz, G. Supp, C. Schrank, Graz-BCI: state of the art and clinical applications. IEEE Trans. Neural Syst. Rehabil. Eng. 11, 177–180 (2003) 8. J. Borisoff, S. Mason, G. Birch, Brain interface research for asynchronous control applications. IEEE Trans. Neural Syst. Rehabil. Eng. 14, 160–164 (2006) 9. M.W. Slutzky, R.D. Flint, Physiological properties of brain-machine interface input signals. J. Neurophysiol. 118(2), 1329–1343 (2017)
Orthonormal Wavelet Transform for Efficient …
621
10. T. Gandhi, B.K. Panigrahi, S. Anand, A comparative study of wavelet families for EEG signal classification. Neurocomputing 74(17), 3051–3057 (2011) 11. L.R. Hochberg et al., Reach and grasp by people with tetraplegia using a neurally controlled robotic arm. Nature 485(7398), 372–375 (2012) 12. M. Velliste, S. Perel, M.C. Spalding, A.S. Whitford, A.B. Schwartz, Cortical control of a prosthetic arm for self-feeding. Nature 453(7198), 1098–1101 (2008) 13. S.-P. Kim, J.D. Simeral, L.R. Hochberg, J.P. Donoghue, G.M. Friehs, M.J. Black, Point-andclick cursor control with an intracortical neural interface system by humans with tetraplegia. IEEE Trans. Neural Syst. Rehabil. Eng. 19(2), 193–203 (2011) 14. D.M. Taylor, S.I.H. Tillery, A.B. Schwartz, Direct cortical control of 3D neuroprosthetic devices. Science 296(5574), 1829–1832 (2002) 15. J. Vogel et al., An assistive decision-and-control architecture for force-sensitive hand–arm systems driven by human–machine interfaces. Int. J. Rob. Res. 34(6), 763–780 (2015) 16. N. Birbaumer et al., A spelling device for the paralysed. Nature 398(6725), 297–298 (1999) 17. L. Bi, X.-A. Fan, Y. Liu, EEG-based brain-controlled mobile robots: a survey. IEEE Trans. Hum. Mach. Syst. 43(2), 161–176 (2013) 18. J. Meng, S. Zhang, A. Bekyo, J. Olsoe, B. Baxter, B. He, Noninvasive electroencephalogram based control of a robotic arm for reach and grasp tasks. Sci. Rep. 6, 38565 (2016) 19. B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, K.R. Müller, Optimizing spatial filters for robust EEG single-trial analysis. IEEE Signal Proc. Mag. 25, 41–56 (2008) 20. F. Lotte, M. Congedo, EEG Feature Extraction (Wiley, New York, 2016). pp 127–43 21. F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, B. Arnaldi, A review of classification algorithms for EEG-based brain–computer interfaces. J. Neural Eng. 4, R1–13 (2007) 22. C Neuper, G. Pfurtscheller, Neurofeedback training for BCI control, in Brain–Computer Interfaces: Revolutionizing Human-Computer Interaction, ed. by B. Graimann, G. Pfurtscheller, B. Allison (Springer, Berlin, 2010). pp. 65–78 23. M. Fatourechi, R. Ward, S. Mason, J. Huggins, A. Schlogl, G. Birch, Comparison of evaluation metrics in classification applications with imbalanced datasets International Conference on Machine Learning and Applications (IEEE, 2008). pp 777–82 24. H.D.N. Alves, Fault diagnosis and evaluation of the performance of the overcurrent protection in radial distribution networks based on wavelet transform and rule-based expert system, in 2015 IEEE Symposium Series on Computational Intelligence (IEEE, 2015). pp. 1852–1859 25. Y. Shi, X. Zhang, A Gabor atom network for signal classification with application in radar target recognition. IEEE Trans. Signal Process., 2994–3004 (2001) 26. A. Bruce, H.Y. Gao, Applied Wavelet Analysis with S-Plus (Springer, 1996) 27. D. Gabor, Theory of communication. Part 1: The analysis of information. J. Inst. Electr. Eng. Part III: Radio Commun. Eng. 93(26), 429–441 (1946) 28. D.M. Monro, B.G. Sherlock, Space-frequency balance in biorthogonal wavelets, in Proceedings of International Conference on Image Processing, vol. 1 (IEEE, 1997). pp. 624–627 29. L. Shen, Z. Shen, Compression with time-frequency localization filters. Wavelets and Splines, 428–443 (2006) 30. B. Boashash, N.A. Khan, T. Ben-Jabeur, Time–frequency features for pattern recognition using high-resolution TFDs: A tutorial review. Digit. Signal Proc. 40, 1–30 (2015) 31. R. San-Segundo, J.M. Montero, R. Barra-Chicote, F. Fernández, J.M. Pardo, Feature extraction from smartphone inertial signals for human activity segmentation. Sig. Process. 120, 359–372 (2016) 32. A.T. Tzallas, M.G. Tsipouras, D.I. Fotiadis, Automatic seizure detection based on timefrequency analysis and artificial neural networks. Comput. Intell. Neurosci. (2007) 33. U. Orhan, M. Hekim, M. Ozer, EEG signals classification using the K-means clustering and a multilayer perceptron neural network model. Expert Syst. Appl. 38(10), 13475–13481 (2011) 34. E.D. Übeyli, Combined neural network model employing wavelet coefficients for EEG signals classification. Digit. Signal Proc. 19(2), 297–308 (2009) 35. A.N. Akansu, P.A. Haddad, R.A. Haddad, P.R. Haddad, Multiresolution Signal Decomposition: Transforms, Subbands, and Wavelets (Academic Press, 2001)
622
P. Chaudhary and R. Agrawal
36. M. Rhif, A. Ben Abbes, I.R. Farah, B. Martínez, Y. Sang, Wavelet transform application for/in non-stationary time-series analysis: a review. Appl. Sci. 9(7), 1345 (2019) 37. H. Xie, J.M. Morris, Design of orthonormal wavelets with better time-frequency resolution, in Wavelet Applications, vol. 2242 (International Society for Optics and Photonics, March 1994). pp. 878–887 38. M. Sharma, V.M. Gadre, S. Porwal, An eigenfilter-based approach to the design of timefrequency localization optimized two-channel linear phase biorthogonal filter banks. Cir. Syst. Signal Process. 34(3), 931–959 (2015) 39. R. Sharma, R. Pachori, U. Acharya, Application of entropy measures on intrinsic mode functions for the automated identification of focal electroencephalogram signals. Entropy 17(2), 669–691 (2015) 40. V. Bajaj, R.B. Pachori, Classification of seizure and nonseizure EEG signals using empirical mode decomposition. IEEE Trans. Inf. Technol. Biomed. 16(6), 1135–1142 (2011) 41. R. Ebrahimpour, K. Babakhan, S.A.A.A. Arani, S. Masoudnia, Epileptic seizure detection using a neural network ensemble method and wavelet transform. Neural Netw. World 22(3), 291 (2012) 42. K. Abualsaud, M. Mahmuddin, M. Saleh, A. Mohamed, Ensemble classifier for epileptic seizure detection for imperfect EEG data. Sci. World J. (2015) 43. E. Parvinnia, M. Sabeti, M.Z. Jahromi, R. Boostani, Classification of EEG Signals using adaptive weighted distance nearest neighbor algorithm. J. King Saud Univ. Comput. Inf. Sci. 26(1), 1–6 (2014) 44. T. Lin, P. Hao, S. Xu, Matrix factorizations for reversible integer implementation of orthonormal M-band wavelet transforms. Sig. Process. 86(8), 2085–2093 (2006) 45. A.L. Goldberger, L.A. Amaral, L. Glass, J.M. Hausdorff, P.C. Ivanov, R.G. Mark, J.E. Mietus, G.B. Moody, C.K. Peng, H.E. Stanley, PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000) 46. M. Sharma, R.S. Tan, U.R. Acharya, Automated heartbeat classification and detection of arrhythmia using optimal orthogonal wavelet filters. Inform. Med. Unlocked 16, 100221 (2019) 47. G. Chandel, P. Upadhyaya, O. Farooq, Y.U. Khan, Detection of seizure event and its onset/offset using orthonormal triadic wavelet based features. IRBM 40(2), 103–112 (2019) 48. D. Bhati, R.B. Pachori, V.M. Gadre, Optimal design of three-band orthogonal wavelet filter bank with stop band energy for identification of epileptic seizure eeg signals, in Machine Intelligence and Signal Analysis (Springer, Singapore, 2019). pp. 197–207 49. B. Benchabane, M. Benkherrat, B. Burle, F. Vidal, T. Hasbroucq, S. Djelel, A. Belmeguenai, Wavelets statistical denoising (WaSDe): individual evoked potential extraction by multiresolution wavelets decomposition and bootstrap. IET Signal Proc. 13(3), 348–355 (2019) 50. E. Niedermeyer, The normal EEG of the waking adult, in Electroencephalography: Basic Principles, Clinical Applications, and Related Fields, vol. 167 (2005). pp. 155–164 51. T. Lin, S. Xu, Q. Shi, P. Hao, An algebraic construction of orthonormal M-band wavelets with perfect reconstruction. Appl. Math. Comput. 172(2), 717–730 (2006) 52. K.P. Thomas, C. Guan, A.P. Vinod, C.T. Lau, K.K. Ang, A new discriminative common spatial pattern method for motor imagery brain–computer interfaces. IEEE Trans. Biomed. Eng. 56(11), 2730–2733 (2009) 53. W. Wu, Z. Chen, X. Gao, Y. Li, E.N. Brown, S. Gao, Probabilistic common spatial patterns for multichannel EEG analysis. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 639–653 (2015) 54. S.H. Park, D. Lee, S.G. Lee, Filter bank regularized common spatial pattern ensemble for small sample motor imagery classification. IEEE Trans. Neural Syst. Rehabil. Eng. 26, 2 (2018) 55. T. Michael, et al., Review of the BCI competition IV. Front. Neurosci. 6, 55 (2012) 56. P. Chaudhary, R. Agrawal, A comparative study of linear and non-linear classifiers in sensory motor imagery based brain computer interface. J. Comput. Theor. Nanosci. 16(12), 5134–5139 (2019) 57. P. Chaudhary, R. Agrawal, Emerging threats to security and privacy in brain computer interface. Int. J. Adv. Stud. Sci. Res. 3(12) (2018)
Performance of RPL Objective Functions Using FIT IoT Lab Spoorthi P. Shetty and Udaya Kumar K. Shenoy
Abstract The Internet of Things is a system, which connects many heterogeneous devices and it finds application in several areas. The network used in IoT is Low Power and Lossy Networks (LLN) because the devices used in IoT are power constrained. LLN uses Routing Protocol for Low Power and Lossy Networks (RPL) as its routing protocol and it is considered as an IETF standardized protocol for LLN. RPL constructs Destination Oriented Directed Acyclic Graph (DODAG) to select the appropriate path to the destination. In RPL, the DODAG can be constructed based on the objective function. Thus, the selection of the best objective function plays a major role in RPL. The main metric for selection of objective function is the power, as our focus is on the design of power efficient IoT. The most widely used objective functions in RPL are OF0 and MRHOF. The metric used by objective function OF0 is hop count and MRHOF uses expected transmission count metric. In the existing research, the superiority of these two objective functions is established using only simulation studies but not based on the real testbed experiment. Hence, it is necessary to conduct the experiment in the real testbed to assess the suitable objective function. In this paper, experiments are conducted in the FIT IoT Lab to select the best objective function with respect to the power parameter. From the result, it is identified that both OF0 and MRHOF perform equally and in some cases, it is observed that MRHOF is more power efficient than OF0. The objective functions are also evaluated for single and multi sink scenarios. It is identified through the experiments that the increase in the number of sink nodes does not affect the power consumption.
S. P. Shetty (B) Department of MCA, N.M.A.M.Institute of Technology Nitte, Karkala, India e-mail: [email protected] U. K. K. Shenoy Department of CSE, N.M.A.M.Institute of Technology Nitte, Karkala, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_55
623
624
S. P. Shetty and U. K. K. Shenoy
1 Introduction The LLN is a community of many embedded devices that are limited in storage, power, and resource management. These devices are connected to each other by a variety of links. LLNs have a wide range of applications, including industrial surveillance, automation of building, monitoring of environment, management of energy, health care, etc. RPL is known as one of the powerful routing protocols for LLN network. RPL uses Destination Oriented DAG topological concept to construct the tree structure. This DODAG uses a specific objective function for tree construction. Selecting the best objective function thus plays a vital role in building DODAG and it also helps to make the routing protocol more efficient and effective. The most widely used objective functions are OF0 and MRHOF. In this, OF0 uses metric of hop count and MRHOF uses metric of ETX. The selection of the best objective function plays a major role in the DODAG construction and also in effective utilization of routing protocol. In the existing research, the comparison of different objective functions is performed using simulation. But testing in a real testbed would give a better testimony of the objective function. This paper is categorized as follows: Sect. 2 gives motivation of the work and also covers the literature survey. Section 3 introduces the testbed configuration, Sect. 4 explains the outcomes and gives the framework to predict the nature of the network. Finally, Sect. 5 provides the conclusion.
2 Literature Survey 2.1 Low Power and Lossy Network The Low power and Lossy Network is a working group created by IETF. This group mainly works on standardizing the routing protocol for LLN. The salient features of LLN are, the nodes used in this can carry limited data with less energy [1]. The nodes used in this network are powered with less energy and it can store only a limited rate of data. But the error rate and link failure are more with less packet delivery ratio [2]. It also supports different types of traffic flows with source and destination as either point or multipoint. Because of these support for different types of traffic flows, the LLN network is more suitable for IoT.
2.2 Routing Protocol for Low Power and Lossy Networks One of the effective protocols for the IPv6 LLN network is RPL. The special features of RPL are (i) It is highly adaptable to any network circumstances. (ii) When routes are not accessible, it provides the alternative route by default. (iii) It uses DODAG
Performance of RPL Objective Functions Using FIT IoT Lab
625
topological notion to construct a tree structure. (iv) It is a function that comprises of two steps. They are route discovery and maintenance of route. The route discovery helps to create new paths between the nodes and in maintenance of route, it helps to maintain the created route. RPL uses the concept of DODAG to construct the routing structure. It works on the concept of DAG (Directed Acyclic Graph). In this logical routing, tree construction is performed on a physical network. The construction is based on the objective function. This DODAG is periodically rebuilt and modified using the trickle timer.
2.3 Comparison of RPL for Varied Topology There was a lot of research on IoT stable nodes with RPL protocol. The research demonstrates that RPL is an efficient routing protocol for IoT network with stable nodes. The performance of the routing protocol using RPL objective functions is considered by Spoorthi et al. [3]. In Long et al. [4] and Gnawali et al. [5], the Collection Tree Protocol’s (CTP) performance is compared with RPL. It shows how the performance of RPL is better for the scalability parameter. The results of CTP and RPL are compared for the parameter of Packet Reception Ratio and power. By the result, it is proved that CTP performs better in sparse network and RPL works well in dense network with more data traffic. The limitation of this work is that the researchers failed to identify the suitable objective function in RPL for stable topology. In Qasem et al. [6], the working of objective function is evaluated using simulation. In this, random and grid topologies are considered for the experiment. Here, in IoT network, the power consumption is calculated based on RX value. From the experiment, it is identified that for the 60% RX value, both objective functions perform well for power and PDR. From the result, it is also noted that, in some scenarios, the performance of MRHOF is superior to OF0 in random and grid topology. In the paper [7], the working of objective functions is compared through simulation. The researchers have shown that the OF0 typically performs better than MRHOF in terms of Power Consumption and Convergence Time for Static-Grid Topology. In paper [8], the authors have focused on the power Consumption and Packet Delivery Ratio(PDR) metrics for stable network. In this, the simulation is performed under two topologies, i.e., random and fixed. From the results, it is noted that using OF0 objective function, PDR is more in low density network, and using MRHOF, efficient utilization of power is more in dense network. The authors Lam Nguyen et al. [4] addressed the load balancing problem and evaluated the skewness of DODAG both via numerical simulations and via actual large-scale testbed. In the paper, the authors proposed a solution called SB-RPL, which aims to obtain large-scale balanced distribution of workload between the nodes in LLN. In this, the researchers implemented SB-RPL in ContikiOS and conducted an extensive evaluation using computer simulation and on large-scale real-world testbed. The researchers also compared their solution with the current objective
626
S. P. Shetty and U. K. K. Shenoy
function, mainly on the parameter of load balancing, but not on power. It can be noted in all of the above papers that the comparison of the objective function is done mainly using a simulator. In some of the papers, OF0 performs better while in some other, MRHOF performs well. The main criterion for the selection of the best objective function is it should be power efficient, because RPL is the protocol which is mainly used in LLN network. Hence, it is important to check the working of the objective function in the real environment using testbed. The parameter considered in this test is the power which makes this work unique.
3 Experiment Details The experiment is carried out with the aim of testing the performance of the RPL objective function for the distinct scalability of the nodes. Our main objective in this paper is to evaluate the performance of OF0 and MRHOF objective functions with RPL protocol in FIT IoT Lab.
3.1 FIT IoT Lab Setup In this part, we present our study on the FIT IoT LAB testbed. In our experiments, we used the platform installed in the Lille site, France. We used 40 nodes (M3 ARM-Cortex) from the Lille site contributed by FIT IoT Lab testbed as described in Table 2. The topology includes one sink located at the center and 40 random sensor nodes. In a predefined time interval, it generates UDP packets. The M3 node has one ARM M3-Cortex micro-controller, one 64kB RAM, one IEEE 802.15.4 radio AT86RF231, one rechargeable 3.7 V LiPo Battery, and several types of sensors. In order to construct the multi-hop topology, the transmission power is set to −17 dBm as in the tutorial of FIT IoT Lab. The details of parameters is described in Table 1.
Table 1 Hardware parameters Antenna model MAC Radio chip Radio propagation Transmission power RX RSSI threshold
Omni-directional 802.15.4 beacon enabled TI CC2420 2.4 GHz −17 dBm −69 dBm
Performance of RPL Objective Functions Using FIT IoT Lab Table 2 FIT IoT Lab experimental setup Experimental parameters Environment network scale Node spacement Deployed nodes Platform Duration Application traffic Payload size Number of hops Embedded network stack Compared objective functions
627
Values Indoor 40 nodes and 1 sink Uniform random 41 random nodes ContikiOS/M3 Cortex ARM 15 min per instance UDP/IPv6 traffic 16 bytes Multihop ContikiMAC RPL (OF0, MRHOF)
4 Result 4.1 Comparison of OF0 and MRHOF Using Single Sink As an initial step, both OF0 and MRHOF objective functions are compared in terms of power for a single sink and for a varied number of sender nodes as shown in Table 3 and Fig. 1. In this, it is noted that OF0 consumes less power for the number of sender nodes 20 and 40 and MRHOF consumes less power for the number of sender nodes 10 and 30. Hence using a single sink one can not conclude about the power efficient objective function.
Table 3 Comparision of OFO and MRHOF for single sink Number of nodes OF0 10 20 30 40
Fig. 1 Comparison of OF0 with MRHOF for single sink
0.162 0.161 0.162 0.160
MRHOF 0.161 0.162 0.160 0.161
628
S. P. Shetty and U. K. K. Shenoy
Table 4 Comparison of OFO and MRHOF for multi sink Number of nodes OF0 10 20 30 40
0.162 0.161 0.162 0.162
MRHOF 0.159 0.160 0.161 0.161
Fig. 2 Comparison of OF0 with MRHOF for multi sink
4.2 Comparison of OF0 and MRHOF Using Multi Sink As shown in Table 4 and Fig. 2, the objective functions are compared for multi sink with varied number of nodes. From the experiments, it is noted that MRHOF is more power efficient than OF0.
4.3 Analyzing the Performance of OF0 Using Single Sink and Multi Sink In Table 5 and Fig. 3, OF0 objective function’s performance is compared for a single sink and multi sink, here it is noted that increase in the number of sinks does not affect the power, i.e., when the number of nodes is 10, 20, 30 both in case of a single sink and multi sink, the performance of OF0 is the same as that of MRHOF. But when the number of nodes is 40, in case of single sink, OF0 consumes less power and in case of multi sink, OF0 consumes more power.
Table 5 Comparison of OF0 for single and multi sink Number of nodes Single sink 10 20 30 40
0.162 0.161 0.162 0.160
Multi sink 0.162 0.161 0.162 0.162
Performance of RPL Objective Functions Using FIT IoT Lab
629
Fig. 3 Comparison of OF0 for single and multi sink
Table 6 Comparison of MRHOF for single and multi sink Number of nodes Single sink 10 20 30 40
0.161 0.162 0.160 0.161
Multi sink 0.159 0.160 0.161 0.161
Fig. 4 Comparision of MRHOF for single and multi sink
4.4 Analyzing the Performance of MRHOF Using Single and Multi Sink The performance of MRHOF is compared for both single sink and multi sink as shown in Table 6 and Fig. 4. From the results, it is noted that for the sparse network (for the number of sender nodes 10 and 20), the MRHOF consumes less power for multi sink than the single sink. In the case of the dense network (for the number of sender nodes are 30 and 40) for both single sink and multi sink, MRHOF consumes almost the same power as that of OF0.
5 Conclusion The paper has evaluated the two main objective functions of RPL using FIT IoT Lab. From the experiments, it is noted that both OF0 and MRHOF perform equally for a single sink and the MRHOF is more power efficient than the OF0 in case of
630
S. P. Shetty and U. K. K. Shenoy
multi sink. In the paper, even the performance of network is checked for both single sink and multi sink and it is observed that change in the number of sinks does not affect the power consumption for both the objective functions. In future work, the experiment can be conducted for mobile nodes to analyze the working of objective functions.
References 1. H.-S. Kim, J. Ko, D.E. Culler, J. Paek, Challenging the ipv6 routing protocol for low-power and lossy networks (RPL): a survey. IEEE Commun. Surv. Tutor 19(4), 2502–2525 (2017) 2. G.G. Krishna, G. Krishna, N. Bhalaji, Analysis of routing protocol for low-power and lossy networks in iot real time applications. Procedia Comput. Sci. 87, 270–274 (2016) 3. S.U.K. Shetty Spoorthi, Performance of static IoT networks using RPL objective functions. IJRTE 8, 8972–8977 (2019) 4. N.T. Long, N. De Caro, W. Colitti, A. Touhafi, K. Steenhaut, Comparative performance study of RPL in wireless sensor networks, in 19th IEEE Symposium on Communications and Vehicular Technology in the Benelux (SCVT) (IEEE, 2012). pp. 1–6 5. O. Gnawali, R. Fonseca, K. Jamieson, D. Moss, P. Levis, Collection tree protocol, in Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems (ACM, 2009). pp. 1–14 6. M. Qasem, H. Altawssi, M.B. Yassien, A. Al-Dubai, Performance evaluation of RPL objective functions, in 2015 IEEE International Conference on (CIT/IUCC/DASC/PICOM) (IEEE, 2015). pp. 1606–1613 7. W. Mardini, M. Ebrahim, M. Al-Rudaini, Comprehensive performance analysis of RPL objective functions in iot networks. Int. J. Commun. Netw. Inf. Secur. 9(3), 323–332 (2017) 8. Q.Q. Abuein, M.B. Yassein, M.Q. Shatnawi, L. Bani-Yaseen, O. Al-Omari, M. Mehdawi, H. Altawssi, Performance evaluation of routing protocol (RPL) for internet of things. Perform. Eval. 7(7) (2016)
Predictive Analytics for Retail Store Chain Sandhya Makkar, Arushi Sethi, and Shreya Jain
Abstract Purpose-Forecasting techniques are used in the real-world system for better decision making. The main purpose of this research paper is to explore the techniques used by retail store chains for variety of products at various store locations by working on a Retail Chain’s dataset. Methodology- A Public Data set of a retail store chain has been taken which has various details regarding the weekly sales. With the help of python, the data is handled, analyzed and model is created and tested. And further used to forecast the future. Finding- Understanding the various techniques used for forecasting multiple products at multiple places and selecting the best technique based on accuracy. Keywords Forecasting · Exponential smoothening · Random forest · Regression
1 Introduction If information is the oil of the twenty-first century, then analytics is surely the internal combustion engine. And one of the important tool of analytics which has gained attention of the business organizations over the years is forecasting. Forecasting can be easily termed as the process of estimating a future event which is out of the control of the business and becomes a basis for decision making and managerial planning. An organization cannot control its future circumstances, but its impact can be reduced with proper management and planning. Forecasting is one such step towards reducing the impact of any future uncertainty. It has been S. Makkar (B) · A. Sethi · S. Jain Lal Bahadur Shastri Institute of Management, New Delhi, India e-mail: [email protected] A. Sethi e-mail: [email protected] S. Jain e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_56
631
632
S. Makkar et al.
consistently recognized as crucial capability for business planning and management [1]. Forecasting is all pervasive, it is needed at every level whether it is manufacturing, sales, inventory management, service provider as demand forecasting is necessary to identify any market opportunity, enhance customer satisfaction, schedule production and anticipate the financial requirements. Forecasting ensures the optimum utilization of resources as it plays an important role at identifying the trends in the future sales based upon the past sales and so accordingly raw material can be purchased based upon the sales forecast so it will also reduce the bullwhip impact. Forecasting plays a very crucial role in different areas of functions of business management because every organization wants to know the forward estimation of the market so that they can plan the internal functions accordingly, because if managers make decisions without knowing that what is going to happen in near future then they would not be able to maintain a balance between inventory needed and amount of sales and would not be able to make investment without knowing the profit. It forms a crucial role at the formulation of strategic and tactical decision making for the effective and efficient management of business [2]. As the organizations are growing larger and larger, the magnitude of every decision plays an important role, so the organizations are moving towards the more systematic approach of which forecasting is considered one of important part [3]. Predicting a future event involves a lot of complexity as it is dependent upon internal factors such as variety of products offered, life span of the products, product usage and external factors such as market in which it is existing, competing firms, market segment and many [ 4]. But future belong to those who prepare themselves for it.
2 Forecasting as Distinct from Planning Forecasting and planning are the two different functions of the firm, forecasting is generally used to estimate the possible demand for future based on the past records and a certain set of assumptions [5]. Planning on the other hand is used to make the steps to be taken in consideration with the results for forecasting [6]. Once the results of forecasting are there before the firm, strategies need to be made for how to tackle the results. So forecasting gives the situation, now what action plan needs to be there for that situation, this is the function of planning. One important point that managers need to take into consideration is that what would be the impact of the planning on the forecast result and how the results of the forecast may be best combined in planning [7].
Predictive Analytics for Retail Store Chain
633
3 Principles of Forecasting • Accuracy of the forecast: In most business, a minimal amount of error is being reserved and is tolerated and the percentage of this error varies from company to company, but error shouldn’t be above the permissible limits, the standards of the accuracy must be maintained. • Impact of Time horizon: As the time horizon increases, the accuracy of the forecast decreases because if the time span increases then there are greater chances of the new patterns which can impact the result of forecasting [8]. • Technological Change: Forecasting works best in the industry in which technological change is somewhat constant, because if there would be dynamic industry then it will become difficult to form patterns and hence would impact the result of forecasting. • Barriers to entry: Forecast would be more accurate when there are more barriers to entry because there would be less competitors to impact the established patterns and hence more accurate forecast. • Distribution of Information: The faster the dissemination of information, the less competitive advantage forecasting would give to the firm, because competitors can also make use of the same information. • Elasticity of demand: The more inelastic the demand is, there would be greater accuracy as for example the demand of necessities can be predicted easily as compare to the demand of automobile, which is elastic and hence less accuracy at prediction. • Consumer versus Industrial Goods: Accuracy at forecasting is better in consumer goods as compare to the industrial goods, because industrial goods are sold only to few customers and of which if some are lost, there would be huge loss [9]. • Aggregate versus Disaggregate: When aggregate forecast for a family or for a product is taken then there are more accurate results as compare to the single items, because the patterns of single items change much faster than patterns of the aggregate groups.
4 Multi Variant Time Series Data Time is the most important asset that a firm can have, and this statement is more specifically applicable at a multistore, which needs to align its activities according to the season, because these stores need to identify the best time at which they can boost up sales. In a multivariate time series, there are number of variables whose performance is dependent upon time [10]. The variables are not only dependent upon the past data, but their performance is also inter-related with each other, the variables are dependent on each other also and this dependency is used in further forecasting. Multivariate time series forecasting of a retail store chain
634
S. Makkar et al.
Retail is one of the most important business domains which faces numerous optimization problems such as optimal prices, stock levels, discount allowed, recommendations which can be now easily solved with various data analysis methods. Data science and data mining applications can be used in even in forecasting of sales which can ultimately help in proper optimization for price, cost, inventory etc. [5]. The accurate prediction of sales is a challenging task in today’s competitive and dynamic business environment, and it can help the retailer in inventory management and increases their profits. For the purpose of understanding forecasting of sales for retail sector, a dataset of a Global Retail Stores chain is taken which has 45 different outlets at different location all over the USA and have 99 different departments. The data is a public dataset which has been taken from Google toolkit. The data also consist of datapoints for which forecasting is to be done. The data is multivariate because the prediction is to be done for different stores and department in a time series. The whole forecasting will be carried out in Python and some part of data exploration will be done using Tableau.
4.1 Data Understanding The dataset ranges for over 3 years and has various variables but is mainly classified in 3 different categories: Sales Description—It contains details about the sales of products under different department in different stores on weekly basis. Variable Name
Description
Store ID
The stores are assigned ID ranging from 1 to 45
Department number
The number of departments from 1 to 99 for every store
Is Holiday Super Bowl
This variable is in form of Boolean, i.e., having values true and false. True states that particular day was a holiday. Following are main holidays which happened that time. 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day
10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving
26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas
31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13
Sales
This was the sales figure of individual departments in individual stores on weekly basis.
Store Description
This includes the details related to all the 45 stores
Type
All the 45 stores are divided in 3 categories A, B and C
Size
This describes the size of each store
Location and weekly Description
The description about different location like, temperature, fuel price etc. (continued)
Predictive Analytics for Retail Store Chain
635
(continued) Variable Name
Description
Temperature
Average temperature of the location in that particular week
Fuel price
This variable describes the average weekly fuel price for the different stores
Markdown
5 different type of markdowns on the prices in a specific week mostly during the holidays
CPI
Consumer Price Index
The Fig. 1 shows the details about all the variables recorded. It talks about the description of the data type of individual variables.
4.2 Data Exploration The data has in total 14 variables which will be used for the purpose of sales prediction. But before carrying out the prediction using the model, proper understanding of the data is necessary. This is known as data exploration. Fig. 1 Descriptive Analysis of Variables
636
4.2.1
S. Makkar et al.
Multivariate Analysis
A correlation heat map is created in Fig. 2, which helps in visually understanding whether there is any strong correlation between the variables. The heat map shows the maximum correlation is among variables is 0.3 which is not very high. Markdown 2 and markdown 3 has a positive correlation with Isholiday which clearly shows that the markdown is done during the holiday week. The size of the store is correlated with weekly sales which shows that larger the size of the store, higher is the sales. CPI is known to have a negative relation with unemployment, higher the CPI lower is the unemployment rate of a state. The variables have correlation but there is no high correlation because of which any variable need to be removed.
Fig. 2 The correlation between different variables
Predictive Analytics for Retail Store Chain
4.2.2
637
Univariate Analysis
The variables are now individually explored for better understanding. Every variable is plotted against the target variable, i.e., weekly sales to notice any specific observation. The two variables which showed some important observation are ‘type of store’ and ‘holiday’ as shown in Fig. 3. Type of Store-There are 3 types of stores, A B and C in which all the 45 stores are divided. After plotting it against weekly sales, it is observed that the number and amount of sales in type C is very less compared to other 2. Holiday-0 is for no holiday and 1 is for the holiday week. It can be clearly seen in the Fig. 4 that the weeks with holidays have more number of sales and the amount of sales is also large compared to the weeks without holidays. It can be observed in the Fig. 5 that Christmas weekend had the maximum sale of more 240,000. Fig. 3 Bivariate analysis between weekly sales and type of store
Fig. 4 Holiday and weekly sales
638
S. Makkar et al.
Fig. 5 Count of weeks on which the sales is more than 240,000
4.3 Data Preprocessing The data should be preprocessed to convert the raw data into an understandable form for the model. This will help in proper implementation of the model and therefore will give much more efficient and accurate results.
4.3.1
Null Value Treatment
The data has various null values, especially in markdown columns as seen in the Fig. 6 which needs to be treated. Null value in mark down columns indicate that there is no markdown available during that date, so it can be written as zero. Even in the weekly sales column there are 115,064 data points which are null, it is because these are the data points which need to be predicted. For the time being these are also filled with 0. Fig. 6 The count of null values in the data
Predictive Analytics for Retail Store Chain
4.3.2
639
Creating Dummies
Holiday-Isholiday data has boolean data, false and true, so a dummy is created for it performing the model. 0 is assigned to true, i.e., holiday week and 1 is assigned to no holiday week. Month- For month there are 12 dummies, 1 for every month. And on the basis of date of sales, 1 is assigned to the column of that month and others are left with 0. Black Friday—If it is black Friday, then 1 to the black Friday yes column and 0 to other and vice versa. Pre christmas—Sales during the christmas time is high compared to other weeks, so a dummy is created to classify the sales whether it is durning christmas time or not.
4.4 Model Implementation 4.4.1
Random Forest
Random Forest is an ensemble bagging learning method especially for classification and regression. It comprises of several decision trees. In classification, each individual tree in the random forest gives out a class prediction and the class with the most votes become the prediction of the model. And for regression mean prediction is considered. The data for this retail store chain is multivariate, i.e., there are various variables to help the prediction and also the prediction is to be done for different department, store and dates. So normal time series forecasting cannot be used. Therefore, here random forest algorithm is used to show how multivariate data forecasting is done.
4.4.2
Lagged Values
Random Forest evaluates the data points without collaborating the information from the past with the present. So, because of this lagging variable are created, in this case lagged sales is created which will help in bringing a pattern from the past for evaluating the present. The lagged sales is created considering 1 lag week.
4.4.3
Selected Variables
‘LaggedSales’, ‘Sales_dif’, ‘LaggedAvailable’, ‘CPI’, ‘Fuel_Price’, ‘isHoliday_False’, ‘isHoliday_True’, ‘Temperature’, ‘Unemployment’, ‘MarkDown1’, ‘MarkDown2’, ‘MarkDown3’, ‘MarkDown4’, ‘MarkDown5’, ‘Size’, ‘Pre_christmas_no’, ‘Pre_christmas_yes’, ‘Black_Friday_no’, ‘Black_Friday_yes’, ‘md1_present’, ‘md2_present’, ‘md3_present’, ‘md4_present’, ‘md5_present’.
640
S. Makkar et al.
Fig. 7 This figure represents the output of the model
These are the 24 variables which are considered for use in the model.
4.4.4
Train and Test Split
The dataset is finally divided into historic and forecasting. In the beginning the forecasting data was combined with the given historic data and was named as test. The historic data is further divided into 80% training and 20% testing for getting more accurate results.
4.5 Result Random forest is first done on the training set, the number of trees assigned are 20. Followed by running on the whole model and then it is finally used for the prediction. The graphs in Fig. 7 represents the distribution of the predicted values firstly against the weekly sales and secondly shows the probability distribution. As it is condensed together it clearly shows that the error is minimized. The final forecasting for the weekly sales of the retails store is done for future dates on the basis of the model created and they were—25,978.1, 26,966.8, 27,052.5, 54,787.7 and 54,313.9 for 5 continuous days
5 Conclusion In past times, forecasting was something which was the work of mathematicians or consultants, but with the changing time and technology, more and more senior managers are trying to work and make use of these techniques for long term planning for their organizations. This helps in reducing expense and time of bringing top consultants for small work. But still there are various complexities involved like,
Predictive Analytics for Retail Store Chain
641
the type of technique to be selected for the particular purpose, the amount of data required, etc. Still forecasting methods cannot be perfect under all conditions. Even after applying the appropriate technique, one should properly monitor and control the process, so as to avoid aany error. Forecasting techniques are rewarding for the managers but they need to tackle all the challenges coming in their way.
References 1. J.T. Mentzer, R. Gomes, R.E. Krapfel, Physical distribution service: A fundamental marketing concept?. JAMS 17, 53–62 (1989) 2. L. Cassettari, I. Bendato, M. Mosca, R. Mosca, A New Stochastic Multi source Approach to Improve The Accuracy of the Sales Forecasts (University of Greenwich, 2016) 3. C. Maritime, Kent, UK, K.K., Intelligent techniques for forecasting multiple time series in real-world systems, in NW School of Business and Economicss (Fayetteville State University, North Carolina, USA, 2014) 4. D. Waddell, A.S. Sohal, Forecasting: the key to managerial decision making management decision. Res. Forecast. Early-Warning Methods 32(1), 41–49, 0025–1747 (1994) 5. R. Fildes, T. Huang, D. Soopramanien, The value of competitive information in forecasting fmcg retail product sales and the variable selection problem. Eur. J. Oper. Res. 237, 738–748 (2014) 6. I. Alon, M.H. Qi, R.J. Sadowski, Forecasting aggregate retail sales: A comparison of artificial neural networks and traditional methods. J. Retailing Consum. Serv. 8(3), 147-156 (2001) 7. N.S. Arunraj, D. Ahrens, A hybrid seasonal autoregressive integrated moving average and quantile regression for daily food sales forecasting. Int. J. Econ., 321–335 (2015) 8. A. Chong, B. Li, E. Ngai, E. Ch’Ng, F. Lee, Predicting online product sales via online reviews, sentiments, and promotion strategies: a big data architecture and neural network approach. Int. J. Oper. Prod. Manag. 36, 358–383 (2016) 9. K.J. Ferreira, B.H.A. Lee, D. Simchi-Levi, Analytics for an online retailer: Demand forecasting and price optimization. Manuf. Serv. Oper. Manag. 18, 69–88 (2016) 10. M.D. Geurts, J.P. Kelly, Forecasting retail sales using alternative models. IJF 2, 261–272 (1986)
Object Identification in Satellite Imagery and Enhancement Using Generative Adversarial Networks Pranav Pushkar, Lakshay Aggarwal, Mohammad Saad, Aditya Maheshwari, Harshit Awasthi, and Preeti Nagrath
Abstract Ship detection from satellite images is an essential application for sea security, port traffic control, disaster management, and rescue operations which incorporates traffic surveillance, illicit fisheries, oil spills, and observation of ocean contamination. Significant challenges for this method include cloud, tidal wave, and even the variability of ship sizes. In this paper, we introduce a framework for ship detection from low-resolution satellite images using the best combination of Generative Adversarial Networks (GANs) and Convolutional Neural Networks (CNNs) with respect to image enhancement and training time reduction, as well as high accuracy. The operations of the above proposed method has been done on the Kaggle open source (“Ships in Satellite Imagery”) dataset. Keywords Ship detection · Satellite imagery · Generative adversarial networks · Convolutional neural networks
P. Pushkar (B) · L. Aggarwal · M. Saad · A. Maheshwari · H. Awasthi · P. Nagrath Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India e-mail: [email protected] L. Aggarwal e-mail: [email protected] M. Saad e-mail: [email protected] A. Maheshwari e-mail: [email protected] H. Awasthi e-mail: [email protected] P. Nagrath e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_57
643
644
P. Pushkar et al.
1 Introduction The greatest challenge in the satellite imaging domain is how to cope up with a small dataset and limited amount of annotated dataset, especially while employing unsupervised [1] learning algorithm which generally requires a large number of clear and enhanced training [2] samples. Considerable efforts have been made during the last decades to design and develop different algorithms and tools for ship detection in satellite/aerial imagery. This study presents a review of the recent progress in the field of ship detection from satellite imagery using deep learning [3]. We also use different GAN [4] models to speed up the training process (at the same time enhancing the images and doing noise reduction as well) and finally each of their performance is compared on our dataset. Ship recognition is a significant and vast research area in the field of Computer Vision (along with various Deep Learning methods) that have recently come into limelight and it is believed that these methods can come very handy in different applications like identification of illicit oil spills, observing oceanic traffic in fisheries, security, military movements [5], etc. Our Kaggle open-source Dataset (“Ships in Satellite Imagery”) consists of images of various ships collected over the San Francisco Bay and San Pedro Bay areas of California. We are applying Convolution Neural network (CNN) [6], DCGAN [7] (Deep Convolutional Generative Adversarial Network), as well as RCNN, and DCGAN got us the most accurate results. Objective: • • • •
To remove the ship like images which are formed by clouds and tidal waves. Enhance the image as the ship is moving there is a wave line formed with the ship. Increase the size of the image so that object is properly identified. Comparison between the previous methodology for the detection of the ship.
The further section contains the following: Sect. 2 states the related works, Sect. 3 consists of methodologies of proposed work, Sect. 4 consists of experimental set-up and results, Sect. 5 consists of discussions and Sect. 6 consists of conclusion.
2 Related Works Early works utilized 2D, 3D, and 5D top view pictures and with top and both side views images taking by Synthetic-aperture radar [8] explicit model alongside image descriptors as a result of the unavailability of high resolution images. For instance in [9–11] the creator tended to the issue of ship location as a 3-D object discovery issue. Visual saliency estimation is one of the pre-mindful strategies for people to concentrate their eyes on districts with appealing substance from scenes, applicant locales contain genuine and bogus targets. In these districts, total profiles for each
Object Identification in Satellite Imagery and Enhancement …
645
speculated objective are still to be affirmed. Given the presumed ship focus, the district is pivoted by adjusting the ship hub to the vertical bearing, and afterward the S-HOG [12] descriptor is determined. In view of this component, a segregation methodology is made to choose whether the speculated objective is a genuine ship. Afterward, a few ship recognition methods have utilized blends of various highlights so as to catch various qualities of boats. Later Christiana Corbane [10] recognized the ship by allocating enrollment probabilities to the consequences of the identification in a post handling stage. Thus, an ultimate conclusion is left to the administrator who would then be able to approve the aftereffects of the location dependent on his experience. Furthermore, research was continuing using an eager non-most extreme concealment calculation [15] and a bunching on-appearance-based [12] way to deal with a bunch of numerous locations. Despite the fact that it is generally realized that utilizing in excess of a solitary component [18] improves the general execution of the recognition calculation, the presentation is profoundly subject to the selection of highlights. Other send recognition technique is utilized on engineered opening radar (SAR) [10] to limit the issue of night vision picture issues. A few papers in the open writing treat techniques for the recognition of ship focuses on SAR information. Inggs and Robinson (1999) explored the utilization of radar run profiles as target marks for the distinguishing proof of ship targets and neural systems for the arrangement of these marks. Tello et al. (2004) utilized wavelet change by methods for the multi goals examination to investigate multiscale discontinuities of SAR pictures and henceforth recognized ship focuses on an especially loud foundation. Given the long history and progressing enthusiasm, there is a broad writing on calculations for transport recognition in the writing. As far as operational execution is concerned, Zhang et al. (2006) announced impediments of SAR in distinguishing littler dispatches in inland waters. Furthermore, due to the nearness of dot and the decreased elements of the objectives contrasted and the sensor spatial goals, the programmed elucidation of SAR pictures is regularly intricate despite the fact that vessels undetected are now and again unmistakable to the eye. The second strategy for dispatch recognition lies in optical remote detecting, which has been investigated since the dispatch of Land sat during the 1970s. Now and then in PC vision, models frequently neglect to perceive or limit questions on the low goals pictures. To handle this issue, SRGAN [13] is utilized. It comprises of two sub-systems, super goals sub-system and discovery sub arrange. While super goals sub organize is accomplished by stacking of personality remaining squares while the location sub arrange receives the single shot multibox locator (SSD).
646
P. Pushkar et al.
Fig. 1 Flowchart of the proposed methodology
3 Methodology 3.1 Overview of the Proposed Work In this paper, our main objective is to do Object Detection (ships in our case) using Generative Adversarial Networks and Convolutional Neural Networks. We reconstruct a HR image (i.e., 400 * 400) from the given LR input (i.e., 80 * 80) using SRGAN [1, 4] (Fig. 2) and then apply EEGAN [9], [17] (Fig. 3, to do edge enhancement of input 400 * 400 images) to SR output. Section 3.2 briefly introduces GANs (in particular EEGAN). Section 3.2 tells us about the error detection in the form of clouds or crest waves (by heat maps or by some detection algorithm). Section 3.2 tells us about the technique employed in training our dataset using CNN [17], RCNN [19] (less accuracy), and DCGAN [21] and finally Sect. 3.2 provides the network architecture and implementation details (Fig. 1).
3.2 Generative Adversarial Networks (GAN) Generative Adversarial Networks (GANs) [4] are an incredible class of neural networks that are utilized for unsupervised learning [16]. GANs are essentially comprised of an arrangement of two contending neural network models which rival one another and can dissect, catch, and duplicate the varieties inside a dataset. In
Object Identification in Satellite Imagery and Enhancement …
647
Fig. 2 [2]: (Generator and discriminator model)
GANs, there is a generator and a discriminator. The Generator produces fake samples of data (be it a picture, sound, and so on.) and attempts to trick the Discriminator [12]. The Discriminator then attempts to recognize the genuine and fake samples. The Generator and the Discriminator are both Neural Networks and both of them run in rivalry with one another in the training stage. The steps are repeated numerous times and the Generator and Discriminator [2] get better and better in their tasks after each repetition. As highlighted in Fig. 3, our proposed method EEGAN is made up of three basic sections: a generator (G), a discriminator (D), and a VGG19 [12] network for feature extraction. The generator (G) can be divided into two subnetworks: an EESN and a UDSN. UDSN is made of few dense blocks and a reconstruction layer for producing an intermediate HR result. EESN is utilized to enhance the target edges extracted from the intermediate SR image by removing most of the unwanted noise. We obtain the final HR output by replacing the noise edges with the more enhanced edges from EESN (Fig. 4).
Fig. 3 [5]: (Generator and discriminator models for super resolution GAN [SRGAN])
648
P. Pushkar et al.
Fig. 4 [7]: (Representation of UDSN and EESN models of EEGAN [edge enhanced GAN])
4 Simulation and Results 4.1 Experimental Set-Up We are using the Kaggle open-source data set (“Ships in Satellite Imagery”) which consists of satellite images collected over the San Francisco Bay and San Pedro Bay areas of California. It includes 500 80 × 80 RGB images labeled with either a “ship” or “no-ship” classification. The entire data set is in .png format. There are more than 500 images in which 130 images contain “ship” and ships are of different sizes, orientations, and atmospheric interferes like clouds, tidal waves, etc., are included. We take the dataset and apply the model on a system that provides a gpu so that faster processing of images can be done (Figs. 5 and 6).
Fig. 5 (“Ship class” label)
Object Identification in Satellite Imagery and Enhancement …
649
Fig. 6 (“No-ship class” label)
The “no-ship” class includes 370 images. Most of them are random samples of different land cover features—water, vegetation, bare buildings, etc. Some of them are “partial ships” that contain only some part/portion of a ship. We use this dataset to train our models and then apply them on the scene that contains large number of ships and then check their respective training time, testing time, and accuracy they provide on the same dataset.
4.2 Results Following results were obtained chronologically. Before proceeding one point should be kept in mind that the training and testing time, as well as the accuracy of the model, are dependent on the system used for implementation and may vary on different systems. But they do give a comparative idea of the models and can help us choose a better model for ship detection. 1. EEGAN (Figs. 7 and 8) Time taken for training the model = 2500 s Time taken for image preparation = 1540 s 2. CNN (Fig. 9) Time taken for training the model = 2780 s Time taken for ship detection = 2200 s Accuracy of the model = 92% 3. RCNN (Figs. 10 and 11) Time taken for training the model = 3000 s Time taken for ship detection = 1500 s Accuracy of the model = 96% 4. Faster RCNN (Figs. 12, 13, 14, 15 and 16) Time taken for training the model = 2500 s Time taken for ship detection = 1400 s Accuracy of the model = 98%.
650
P. Pushkar et al.
Fig. 7 Before applying EEGAN
5 Discussions We compile our results in a table to get a more precise look at them and have a direct idea of their performance with respect to the accuracy, time taken to train, as well as to detect ships. Model
Time taken for training the model (s)
Time taken for detecting ships./image preparation (s)
Accuracy (%)
Edge enhanced generative adversarial network (EEGAN)
2500
1540
NA
Convolutional neural network (CNN)
2780
2200
92
Region convolutional 3000 neural network (RCNN)
1500
96 (continued)
Object Identification in Satellite Imagery and Enhancement …
651
(continued) Model
Time taken for training the model (s)
Time taken for detecting ships./image preparation (s)
Accuracy (%)
Faster region convolutional neural network (faster RCNN)
2500
1400
98
5.1 Comparative Analysis As we are aiming for the best combination of models for satellite image analysis, we need to take GAN + CNN models. As we have only one GAN model, we are able to just compare the CNN models only and find our best combination with our GAN model. Though CNN is very easy to implement, i.e., its training time is less and its detection time is too much as compared to the other two models. Hence, we can’t further proceed with this model. In RCNN out training time is the highest but on the same hand, it is very accurate and fast while detecting ships. Faster RCNN provides the highest accuracy and least detection time. Though implementation of both RCNN and faster RCNN models is complex and hence sometimes incompatible on certain systems.
6 Conclusion In this paper, we provide an unsupervised method to detect ships from satellite images using a GAN-based framework and Convolutional Neural Networks (CNNs). In the proposed technique, we used SRGAN or EEGAN for preparing HR images and doing Edge Enhancement of those HR images so that we get suitable images with noise reduction and removing the artifacts and sharp edges, respectively. Moreover, the proposed method is robust for scenes with cloud, tidal waves, and is effective when size varies and accuracy is also high in the detection of ships. Though the study was limited due to technical limitations, still a good comparative analysis could be drawn and models could be given their respective merits and demerits with respect to each other. EEGAN + faster RCNN gives the best model combination considering the present scope of research and technical limitations.
652
Fig. 8 After applying EEGAN
Fig. 9 Ships detected through CNN
P. Pushkar et al.
Object Identification in Satellite Imagery and Enhancement …
Fig. 10 Ships detected through RCNN Fig. 11 Confusion matrix for RCNN
653
654
Fig. 12 Training model statistics for successive epochs
Fig. 13 Tabular data for different epochs (cycles)
P. Pushkar et al.
Object Identification in Satellite Imagery and Enhancement …
Fig. 14 Loss-learning rate curve for successive epochs
Fig. 15 Ships detected through faster RCNN
655
656
P. Pushkar et al.
Fig. 16 Confusion matrix for faster RCNN model
References 1. R. Girshick, J. Donahue, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the International Conference on CVPR (IEEE, Columbus, 2014), pp. 580–587 2. R. Girshick, Fast R-CNN, in Proceedings of the International Conference on CVPR (IEEE, Santiago, 2015), pp. 1440–1448 3. S. Ren, K. He, R. Girshick, Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI 39, 1137 (2017) 4. K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, J. Jiang, Edge-enhanced GAN for remote sensing image superresolution. IEEE Trans. Geosci. Remote Sens. 1, 1–13 (2019) 5. W. Liu, D. Anguelov, SSD: single shot multibox detector, in Proceedings of the International Conference on ECCV (Springer, Amsterdam, 2015), pp. 21–37 6. J. Redmon, S. Divvala, You only look once: unified, real-time object detection, in Proceedings of the International Conference on CVPR (IEEE, Las Vegas, 2016), pp. 779–788 7. S. Bell, C.L. Zitnick, Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks, in Proceedings of the International Conference on CVPR (IEEE, Las Vegas, 2015), pp. 2874–2883 8. F. Yang, W. Choi, Y. Lin, Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers, in Proceedings of the International Conference on CVPR (IEEE, LasVegas, 2016), pp. 2129–2137 9. V. Ramakrishnan, A.K. Prabhavathy, J. Devishree, A survey on vehicle detection techniques in aerial surveillance. Int. J. Comput. Appl. 55(18), 43–47 (2012) 10. C. Corbane, L. Najman, E. Pecoul, L. Demagistri, M. Petit, A complete processing chain for ship detection using optical satellite imagery. Int. J. Remote Sens. (Taylor & Francis) 31(22), 5837–5854 (2010) 11. S. Qi, J. Ma, J. Lin, Y. Li, J. Tian, Unsupervised ship detection based on saliency and S-HOG descriptor from optical satellite images. IEEE Geosci. Remote Sens. Lett. 12(7), 1415–1455 (2015)
Object Identification in Satellite Imagery and Enhancement …
657
12. P.F. Felzenszwalb, R.B. Girshick, Object detection with discriminatively trained part-based models. TPAMI 47, 6–7 (2014) 13. I.J. Goodfellow, J. Pouget Abadie, Generative adversarial networks advances, in Neural Information Processing Systems (2014), pp. 2672–2680
Keyword Template Based Semi-supervised Topic Modelling in Tweets Greeshma N. Gopal, Binsu C. Kovoor, and U. Mini
Abstract The performance of supervised and semi-supervised approaches for topic modelling is highly depended on the prior information used for its tagging. Tweets are short-length texts and hence demand supplementary information to infer their topic. The correlation of a word with a topic changes with time in social media. Therefore it is not appropriate to fix a tag for a keyword for indefinite time. Here we have proposed a framework for the adaptive selection of the keywords for tagging with the help of external knowledge. The keyword template will be updated appropriately with the time slice in consideration. The evaluation matrices have shown that this model is giving consistent and accurate latent topic identification in short text. Keywords Topic modelling · Semi-supervised learning · LDA · Tweets
1 Introduction Social Media is considered as one of the richest sources to extract statistical information. Social media can provide relevant and genuine information through analysis. However, identifying and extracting only the relevant data is not that easy. There are several statistical models that have been implemented to extract the hidden category of subject, that the text is dealing with. Latent Dirichlet algorithm is one of the most commonly used techniques for topic modelling. This model has G. N. Gopal (B) · B. C. Kovoor School of Engineering, CUSAT, Kochi, India e-mail: [email protected] B. C. Kovoor e-mail: [email protected] G. N. Gopal College of Engineering Cherthala, Cherthala, India U. Mini CUSAT, Kochi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_58
659
660
G. N. Gopal et al.
given satisfying result in the topic modelling with documents. While trying to infer the topic in tweets, one of the major challenge that we face is the sparsity of words in short text. Therefore the information recognition fails to attain accuracy with naive models based on Latent Dirichlet Allocation (LDA) [1].
2 Related Work The inadequacy of sufficient words in short texts like tweets were addressed by several researchers by incorporating knowledge from external sources or by aggregating the tweets. Yang et al. have identified the key-phrases from the short text itself [2]. They have considered Wikipedia, WordNet, YAGO and Know It All for extracting the supplementary knowledge. Similar work was done by Zhu et al. by doubling the strength of feature keywords, after estimating the importance of the keyword from external source [3]. Cheng et al. have used word correlation statistics obtained from external corpus to improve LDA [4]. Scientific document titles were efficiently classified using similar external knowledge in the proposed work of Vo and Ock [5]. Social data providers like Socialbakers and Topic enhanced word embedding (TEWE) were used by Li et al. to develop Entity knowledge Base [6]. Work by Kim et al. who have taken feedback based on time series in their iterative topic modelling, was a step towards a models that adapts to the change in topic with time [7]. When considering the external sources like Wikipedia word distributions [8], the constantly changing mapping behaviour of a word to topic is not reflected. Collecting additional knowledge from relevant sources that is constantly updated with the time will help us to map the word to current semantics. Basave et al. [9] have used the news group for inferring the word relationship through summarization technique. Through summarization they have extracted dominant words of a particular topic. The behavioural study of social media users has shown the statistics that, people tweet more about controversial topics rather than on other relevant facts mentioned in news portals. So the word frequency in external sources and tweets will not always match. Another challenging factor in understanding the topic is that the relationship of a word with topic is highly dynamic in social media. For example, when the price of onion hit the roof during November 2019 in India, people used the word Onion to disapprove the government policies. Onion which was referred as a word related to cooking was then contributing to the topic politics. Obviously, news groups can provide relevant words for a particular time period related to a topic. On the other hand, depending on word frequency to extract dominant words of a topic from news headlines is not so reliable due to the sparsity of words. Hence the topic models for tweets with external knowledge must address the altering word semantics as well as the sparsity of keywords. The information acquired from external corpus can be incorporated to the models by reshaping the unsupervised algorithm to supervised or semi-supervised algorithms. Initial tagging of the corpus in supervised learning
Keyword Template Based Semi-supervised Topic Modelling in Tweets
661
demands huge amount of human intervention. Hence semi-supervised algorithms are widely accepted in developing Guided LDA.
3 Semi-Supervised Topic Modelling for Tweets In this section, we describe the framework and model for the automatic labelling for the semi-supervised LDA. Figure 1 shows the framework of tweet classification based on topic modelling. A set of words k1 , k2 , . . . , k3 are selected to form keyword templates that act as the starting data points for the clustering. These keywords are chosen with at most care such that the semantics of these words are fixed. Using these initial data points, word clustering is done in the news corpus to tag those words to a topic ti . During the topic sampling of words in the LDA, this additional information is considered for estimating the distribution. Later classification is done with the topic modelled text. Meanwhile, the top words selected for a particular topic are fed to the keyword template selection for update.
3.1 Inference Algorithm For inferring whether a term belongs to a particular topic, we have used Collapsed Gibbs Sampling. Initially a distribution is generated over K topics with Dirichlet
Fig. 1 Framework for adaptive topic modelling of tweets
662
G. N. Gopal et al.
prior α. We then have to draw a Dirichlet distribution φ for words with the Dirichlet prior β. This distribution can tell what a document is about. Finally, for each word in a document, a topic is drawn and a word that contributes to a topic, with the multinomial distribution. In the Naive LDA model the initial distribution of the topics for the words is based on initial Dirichlet distribution. While here, a set of words are tagged to the labels and during the sampling this acts as a prior. The whole system is described in the plate notation shown in Fig. 2. There are M tweets that are to be classified under different topics. The number of words in these documents are N. The correlation of a tweet with a topic ti depends on how the words in that particular tweet are aligned to that topic. The fundamental objective of our work was to automate the keyword tagging in a semi-supervised LDA. From the preliminary analysis, we observed that most of the tweets are incomplete or short and will not be having sufficient information to identify the subject they are dealing with. This is because people usually tweet to express their opinion about controversial topics, rather than passing the news. There will be plenty of information flooding in the social network mentioning the topic. People give comments and post in the social network which has direct or indirect connection with the subject. Hence, depending on an external source, to complete the semantic information, is required. From the external source, the word correlations can be identified. However, identifying this correlation with the frequency of two words occurring together was not always successful because of the sparsity of the words in text. Moreover, the frequency of this observation was not consistent in the corpus. For example, the word ‘Modi’ and ‘ISRO’ was found to occur only 5 times in the news corpus and ‘Modi’ and ‘Trump’ occurred 15 times. If the word co-occurrence statistics predict setting distance 10 as the frequency then both pairs will be in different clusters. We have observed that the clustering
Fig. 2 Adaptive topic modelling plate notation
Keyword Template Based Semi-supervised Topic Modelling in Tweets
663
algorithms like k-means, spectral clustering, etc. fail to retrieve entity relationships in the news corpus. Hence we have used DBSCAN algorithm that shows the strength or density of the co-occurrence in the current space. In this algorithm if two words t1 and t1 are having a similarity strength between a word in the keyword template and a word in the news corpus, then that word is added into the cluster. The algorithm runs recursively with the new term as the input. The recursive algorithm is shown in Algorithm 2. Algorithm 1 Template labelled topic modelling Require: M: Number of Tweets N : Number of words k: Number of topics α: Prior for the Dirichlet Distribution over topics β: Prior for the Dirichlet Distribution over words Z : List of topics from z 0 to z k Wc : Keywords obtained from Clustering with Template Keywords as initial data points. 1: for i = 1: k do 2: do Clustering(W1...n ) 3: end for 4: for i = 1: M do 5: Generate θi ∼ Dir(α) 6: end for 7: for i = 1: k do 8: Generate φi ∼ Dir(β) 9: end for 10: for m = 1: M; n = 1: N and c = 1: k do 11: if w(m, n) ∈ Wc then 12: Z m,n = l(Wc ) 13: else 14: Z m,n = Multinomial(θi ) 15: end if 16: w(m, n) = Multinomial(φ Z m,n ) 17: end for
Algorithm 2 Clustering 1: Function addterms(term t) 2: for i = 1: N do 3: if JaccardSimilarity(t, i) > tck then 4: if t ∈ / Wc then 5: addterm(t) 6: end if 7: end if 8: end for
The keywords that are used to tag the corpus were extracted from the external corpus. One of the major challenges faced during this process was due to the sparsity
664
G. N. Gopal et al.
of the data in the extracted news. This is because there will be only two or three headlines related to a piece of news. Measuring the relationship of a word to another word in short and sparse text is very difficult. Usually the co-occurence similarity of two words is measured using matrices like Jaccard similarity [10] where distance between two terms t1 and t2 is d J (t1 , t2 ) = 1 − J (t1 , t2 ) =
|t1 ∪ t2 | − |t1 ∩ t2 | |t1 ∪ t2 |
However, in a sparse data, the number of times two terms co-occur may be more than the times they individually occur. In such case, the similarity index turns out to be zero even though both terms co-occur multiple times. Therefore, we have considered the normalized value of co-occurrence count here. Observing that the keyword relationships are getting only slight projection based on count, our next step was to extract only important words for the clustering process. The entity recognition was done for extracting only relevant keywords. In addition to this, the entity itself was to be cleaned since it had many stop words. Leaving the structure of entity as it was not a good choice since tweets seldom have complete entity patterns. People use first name or last name of a person and not the complete word of that person when they tweet. For example, if we consider the entity “Indian National Congress”, people when they tweet use the word Congress. So we extracted only the proper nouns from the entity recognized from the news. Later, news that has only the proper nouns were given to the clustering algorithm. The keywords obtained through clustering was then used to tag the tweet corpus.
4 Experiments and Results The dataset used for the experiments are the tweets extracted from Twitter for a chosen time period. The tweets were collected using the Twitter Tweepy API. During the preprocessing of the tweets, URLs and special characters were removed. The texts were then converted to lower case and tokenized. Later, stop words were removed from the tweets. The external corpus that we have used is the news websites. The daily news is scraped from the news websites. Importance was given to the news in trend and most shared news. During the initial preprocessing of the news corpus, URLs and image links were removed. In every news text, the entity words that describe the topics they were aligned were to be recognized. For this we have employed Named Entity Recognition (NER) and have extracted all names, locations and organizations. The NER was done using Spacy, since the dataset was showing better performance than Stanford [11]. The experiments were also done by choosing entity words by extracting only the proper nouns from the text. The word list of each news text is then clustered using the algorithm Algorithm 2.
Keyword Template Based Semi-supervised Topic Modelling in Tweets
665
For the clustering, we have selected initial data points from where the clustering starts. These words have a close relationship with the topic. However, the consistency of the topic word relationship does not matter in our model since, every time, these words are updated with the top words obtained through Collapsed Gibbs sampling. So our model is not only automatically tagging the words for semi-supervised learning, it is also adapting to the change in semantics of the keywords. In essence human intervention is required only during the deploy of the model, thereby reducing the burden of labelling in semi-supervised and supervised algorithm. The modelling was done by extending the STTM tool for short-text topic modelling [12] The topic distribution obtained through modelling is used to classify the text. The classification accuracy of this semi-supervised algorithm is compared with accuracy of unsupervised algorithm. The experiments have shown that our algorithm with automatic topic labelling is giving a consistent solution thoughout the execution as in Fig. 3. The figure shows the accuracy of KSSTM and Naive LDA, when they were run multiple times with same input. In the unsupervised algorithm, the performance of the classification depends completely on the initial distribution. The change in accuracy was observed when we use only the entity extracted from the external news corpus and also when considering only proper nouns. The results have shown that the accuracy improves as we filter out ambiguous words that may fall into both clusters (Fig. 4).
Fig. 3 a Classification accuracy in dataset 1, b classification accuracy in dataset 2
Fig. 4 a Classification accuracy by applying NER and b proper noun extraction from data
666
G. N. Gopal et al.
4.1 Conclusion Through the experiments, it is observed that Semi-supervised and Supervised LDA can always provide a more consistent topic categorization. However, tagging the data needs lots of human effort and time. The suggested model proposes a method to automatically extract the words to be tagged from the external corpus. The proposed method is designed for the topic modelling in tweets which are short in length. The semantics of tweets are highly correlated with the current news and this hypothesis is used to extract the knowledge for the inference. The experiments have shown that the model is providing better and reliable solution to the topic modelling problem.
References 1. D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003) 2. S. Yan, W. Lu, D. Yang, L. Yao, B. Wei, Short text understanding by leveraging knowledge into topic model, in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2015), pp. 1232–1237 3. Y. Zhu, L. Li, L. Luo, Learning to classify short text with topic model and external knowledge, in International Conference on Knowledge Science, Engineering and Management (Springer, Berlin, 2013), pp. 493–503 4. X. Cheng, X. Yan, Y. Lan, J. Guo, BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014) 5. D.T. Vo, C.Y. Ock, Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst. Appl. 42(3), 1684–1698 (2015) 6. Q. Li, S. Shah, X. Liu, A. Nourbakhsh, R. Fang, Tweetsift: tweet topic classification based on entity knowledge base and topic enhanced word embedding, in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (ACM, 2016), pp. 2429–2432 7. H.D. Kim, M. Castellanos, M. Hsu, C. Zhai, T. Rietz, D. Diermeier, Mining causal topics in text data: iterative topic modeling with time series feedback, in Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (ACM, 2013), pp. 885–890 8. J. Wood, P. Tan, W. Wang, C. Arnold, Source-LDA: enhancing probabilistic topic models using prior knowledge sources, in 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (IEEE, 2017), pp. 411–422 9. A.E.C. Basave, Y. He, R. Xu, Automatic labelling of topic models learned from twitter by summarisation, inProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 2, Short Papers, 2014), pp. 618–624 10. A. Saxena, M. Prasad, A. Gupta, N. Bharill, O.P. Patel, A. Tiwari, M.J. Er, W. Ding, C.T. Lin, A review of clustering techniques and developments. Neurocomputing 267, 664–681 (2017) 11. B. Kleinberg, M. Mozes, A. Arntz, B. Verschuere, Using named entities for computerautomated verbal deception detection. J. Forensic Sci. 63(3), 714–723 (2018) 12. J. Qiang, Y. Li, Y. Yuan, W. Liu, X. Wu, STTM: A Tool for Short Text Topic Modeling (2018). arXiv:1808.02215
A Community Interaction-Based Routing Protocol for Opportunistic Networks Deepak Kumar Sharma, Shrid Pant, and Rinky Dwivedi
Abstract Opportunistic Networks (Opp-Nets) provide the capability to interact and transfer information between spontaneous mobile nodes. In these networks, the routing of messages, which involves the selection of the best intermediate hop for the relay of message packets, is one of the most important issues. This is primarily due to the non-availability of prerequisite knowledge about the network topology and configuration. This paper presents a community interaction-based routing protocol for Opp-Nets that may be used to select appropriate intermediate nodes as hops. The selection is based on the interaction point probability and social activeness of the nodes, which are calculated and analyzed at the sender and, each, intermediate nodes. The results for the proposed protocol are obtained using the ONE simulator, and analytically and graphically compared with other contemporary routing protocols to show its effectiveness. Keywords Opportunistic networks · Routing · Community interaction · ONE simulator
D. K. Sharma · S. Pant Department of Information Technology, Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India e-mail: [email protected] S. Pant e-mail: [email protected] R. Dwivedi (B) Department of Computer Science and Engineering, Maharaja Surajmal Institute of Technology, Janakpuri, New Delhi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_59
667
668
D. K. Sharma et al.
1 Introduction With the emergence of novel and affordable wireless technologies such as Bluetooth, 3G [1], WiFi, and many others, it is now possible to enable almost any device with wireless technology. This has led to an exponential increase in the number of wireless networks. However, wireless network infrastructures are not available in every scenario, especially those involving extreme environments like deep space and oceans. Therefore, Mobile Ad Hoc Networks [2] (MANETs) were introduced to deal with some of these challenges because, in MANETs, each node is mobile and act as intermediate nodes for transferring the data. MANETs, however, require that an end-to-end (E2E) connectivity among any pair of source-destination nodes be available for packets transfer. This assumption may result in data getting dropped midway, when intermediate nodes go down due to failure, power outage, or if the nodes go out of the radio range [3]. Hence, without a complete connected path, communication might never happen in case of MANETs. In real-world scenarios, there are many such circumstances where end-to-end active paths may never exist. In these situations, Opportunistic Networks (OppNets) can provide a means to route the packets, while accounting for intermittently connected paths. In Opp-Nets, the nodes are obligated to buffer their message packets until they discover appropriate nodes that they can forward them to, in such a way that these packets may eventually be delivered to their desired destinations. OppNets also come under the sub-class of Delay Tolerant Networks [4] (DTNs). They are traditionally characterized by low power mobile devices communicating and networking in an ad hoc nature. The network connectivity is sparse, intermittent, and usually unpredictable. Even the duration for which two nodes might meet each other is highly variable. Thus, reliable delivery of the message is not guaranteed in these networks due to the network partitions and intermittent connectivity. Every node decides the best next-hop using several appropriate parameters before it transfers this message to any other node in the network. Routing in Opp-Nets does not require the existence of a connected path. Instead, each node decides which nodes it should relay the packets to, in order to guarantee the successful transfer of the packets with minimal delay. By virtue of the issue of network partition, the intermediate nodes might not discover viable nodes to relay the packets toward the target. In these circumstances, the nodes might have to store the packets in their buffer for a period of time, when there exists no opportunity for forwarding toward the goal nodes. So, a buffer management scheme is needed when new packets arrive at a node and its buffer is already full. For this, an acknowledgment-based technique can be used to remove the message copy from the buffer of the nodes. Also, the nodes in Opp-Net have limited energy resources. intermittent contacts, frequent disconnection, and reconnection, long delays, etc., that generally result in the drainage of the battery. This is one of the primary issues in Opp-Nets that must be addressed. Secure routing is also needed to provide secure communication between the nodes in Opp-Nets. By designing the protocol with security features, it can be assured that the network is protected against various types of security threats
A Community Interaction-Based Routing Protocol …
669
in the underlying environment. This is also an area of concern in Opp-Nets that needs attention [5]. The following are some characteristics of Opp-Nets: (1) they are devoid of any fixed network topology as the nodes are in constant movement relative to each other, (2) contact opportunities and contact duration times are less and varying since the nodes are moving at all times, (3) links present in the network at any given instant may not be present at all times due to node failure, nodes being out of radio range of each other, or node power failure, (4) as a result links are unreliable and give varying performance, (5) buffer capacity in each node is high in order to buffer as many messages as possible and to avoid them from being dropped due to buffer overflow, finally (6) since these networks are delay tolerant, average latency for delivering a message to its intended receiver is quite high as compared to existing legacy networks that require end-to-end connected paths between source and destination. The following are some of the applications of Opportunistic Networks [6]: • Emergency applications: Opp-Nets can be used in all kinds of emergency situations such as an earthquake and hurricane. A seed Opp-Net with some nodes can be deployed for disaster recovery. Other potential helper nodes equipped with more facilities can be added as per requirement to grow into an expanded Opp-Net. • Opportunistic computing: Opp-Nets can be used to provide a platform for distributed computing where various resources, content, services, applications can be shared by mobile devices for various purposes. • Recommender systems: Opp-Nets can exploit the various context information about the nodes such as mobility patterns, contact history, and their workplace information. This contextual information can be used to furnish suggestions on multiple items. • Mobile data offloading: The mobile social Opp-Nets can be used for the purpose of mobile data offloading. The immense increase in smartphone users has overloaded large portions of the 3G networks. A number of research works have been accomplished to take advantage of mobile social Opp-Nets for data offloading on 3G networks. • Information exchange: Opp-Nets also utilize the data transmission potential of small devices, such as mobile phones. These handheld devices form an Opp-Net when they come in close proximity of other wireless devices to exchange data.
2 Related Works This section presents a review of some relevant routing protocols, and attempts made to decrease the congestion. Numerous algorithms have been proposed for effective routing of messages in opportunistic networks. There are mainly two classes of routing protocols: (1) Context-Ignorant Routing Protocols and (2) Context-Aware Routing Protocols. In Context-Ignorant Protocols, the nodes are oblivious to the network information to select the next-intermediate nodes. In Context-Aware Algorithms, on the other hand, the delivery probability is calculated using different
670
D. K. Sharma et al.
network metrics for routing. The following are some of the routing strategies that are used in delay tolerant networks [7]. A. First Contact: First Contact is a simple routing algorithm in which the sender passes the message only when it comes in direct connection with the target, or the sender and target are immediate neighbors. Until then, the packets are stored in the buffer, waiting to be in contact with the destination. In this protocol, the local copy of messages is removed after every transfer between nodes. Since there exists just one copy of the data in the entire network, the congestion and resource utilization are less. Although simple, First Contact has very limited applications as the delivery is very poor and relaying along the random paths might not allow progress toward the target. B. Epidemic Routing: Epidemic Routing Protocol [8] is founded on the theory of Epidemic Algorithms. It is a dissemination-based protocol in which the message packets are passed within the network with the help of flooding mechanisms. In it, the starting node floods the entire network with numerous replicas of the message packets intended for delivery to the target node. This is accomplished by distributing many copies of the message packets to every encountered node, which further distributes the copies to their adjacent nodes. This activity is continued until a replica of the message has reached the target node. Thus, the message spreads in the network like an epidemic and each node infects all its surrounding nodes, that haven’t been infected. The algorithm has a good delivery rate, but suffers from heavy buffer and bandwidth requirements resulting in wastage of network resources. C. PROPHET Routing: PROPHET [9] is a history-based protocol that employs the knowledge of past interactions and transitivity to route a message. The protocol employs a parameter named delivery predictability, which is the probability of interaction among nodes and the destination to decide the next receiver of the message packet. PROPHET is founded on the presumption that the movement of nodes always follows a special movement pattern and is repetitive over a given interval. PROPHET utilizes this repetitiveness in the nodes’ movements and creates a probability table called delivery predictability, which contains the probability of final delivery of the message packets from a given node. This probability is based on the node’s movement pattern and the history of interactions with other nodes that have helped the node to successfully deliver the node in the past. In this protocol, whenever a node finds other nodes, the exchange of delivery predictability values take place. This allows the message packets to be forwarded to those nodes which possess a better delivery predictability value. D. Spray and Wait Routing: Spray and Wait Protocol [10] is based on the technique of controlled flooding. It is essentially an improvement over the existing Epidemic routing algorithm such that
A Community Interaction-Based Routing Protocol …
671
it restricts the volume of flooding and reduces the network resource usage. Routing takes place in two stages: the Spray Phase and the Wait Phase. In the Spray phase the starting nodes compute L, the number of replicas of the message the network should be flooded with and forward these copies to L distinct nodes called the relay nodes. Message could be directly delivered to the target in this phase. During the Wait Phase, the sender node waits for L nodes so that at least one of the L nodes directly delivers the message to the target. So, the network gets flooded only by L copies of the message. This protocol requires a high amount of mobility of nodes within the network. E. Spray and Focus Routing: Spray and Focus Routing [11] is an advancement over the Spray and Wait Routing. It operates in two stages, namely, the Spray Phase and the Focus Phase. In Spray Phase the initiator of the message can distribute a copy to only a fixed number of relays, e.g., L. Now, the nodes that received the copy can only distribute a copy to half of this fixed number of relays, i.e., L = 2 and so on. If L = 1, then the packets can only transmit one relay on the basis of a particular relaying criterion. During Focus Phase, the forwarding is done based on this forwarding criterion. A group of timers are employed by the protocol to measure the interval between the meetings of two nodes. The timers are employed to define a utility function, which helps nodes decide the usefulness of relay nodes in delivering packets. Packets are transmitted to only those nodes which have a higher utility function value. F. Other Works: Many modern routing algorithms have been proposed to provide a more efficient way for message routing. Different node characteristics and network information are analyzed to decide the best possible routes between nodes. Besides the benchmark protocols like Epidemic, PROPHET, and others which have been discussed above, other routing protocols apply numerous techniques to achieve optimal results. Application of game theory [12, 13], clustering techniques [14, 15], fuzzy systems [16], machine learning [17–19], and many others [20, 21] have resolved issues pertaining to specific aspects of Opp-Nets.
3 Proposed Protocol The proposed algorithm is thoroughly described in this section.
3.1 Parameters Considered A novel routing algorithm is proposed for efficient message delivery, by minimizing the number of copies and selecting appropriate relay node, in an Opportunistic
672
D. K. Sharma et al.
Network environment. The intermediate nodes for message delivery are selected on the basis of the following factors: (1) Interaction Point Probability: An interaction point is a particular location where numerous nodes from diverse communities join each other to interact routinely. The nodes which have a greater probability of advancing in the direction of the interaction point are good candidates for intermediate nodes. (2) Socially Active Nodes: A node is said to be socially active if it interacts, relatively, with many nodes of the network. As compared to static or less mobile nodes, the nodes which change their positions frequently have a higher probability of interacting with other nodes. For a node to be considered socially active, it could be changing its position frequently, i.e., move fast and have a short wait time. A node that is either socially active or has high interaction point probability is considered for intermediate nodes for message delivery.
3.2 Assumptions While proposing the routing algorithm, the following assumptions were made: 1. Message exchange takes place at interaction point or inside community and no other place. 2. Nodes meeting at the interaction point will diverge/move in different directions, i.e., will enter different communities. 3. Minimal time is taken for data transfer between the node and the destination, i.e., the destination does not change its community in between a data transfer.
3.3 The Proposed Protocol The source generates the message ID along with destination ID. The source delivers the message when it comes in contact with the destination. But, the probability of that happening is very less, so we take the help of intermediate nodes. The source gives a copy of the message to a node in the community that is socially active. This socially active node transfers the message to the node that is in range (is connected) and has the interaction point probability higher than a threshold value. Every node maintains a table of messages which are yet to be delivered in the form of message buffer. The node moves out of the community to reach the interaction point. The node, on reaching the GP, will meet some other node. The meeting nodes will exchange their messages, which are not common. Thus, various nodes at GP will have a copy of that message. When these nodes move to different communities, so will the message. The nodes after meeting at interaction point enter in different communities. On entering a community the node transfers its message list to the node that is a socially active host in that community. Thus, various communities have a copy of the message. Chances
A Community Interaction-Based Routing Protocol …
673
that destination node is in one of these communities are high, therefore message will be delivered whenever destination is in contact with any of the socially active nodes. This ensures minimal end-to-end delay because, even if destination changes its community, it may enter in a community whose socially active node has a copy of the message. If any of the intermediary node is the required destination then that node receives the message, it is not further relayed.
4 Simulation Setup and Results In this section, the simulation setup is explained and the results are thoroughly discussed.
4.1 Simulation Setup Simulation studies have been conducted by employing the ONE simulator for comparing the efficacy of Community Interaction-based routing against Epidemic, PROPHET, First Contact, Spray and Wait, and Direct Delivery. It has been presumed that the buffer size and transmission duration of the nodes are restricted. The parameters and relevant values of simulation are as follows: Parameter
Value
Area
6500 m * 6500 m
Data transfer rate
250 Kbps
Number of groups
10
Buffer space of each node
5 MB
Speed range
1–7 m/s
Wait time range
0–120 s
Message size
50–150 Kb
Message generation interval
25–35 s
Simulation time
43,000 s
Movement model
CommInteractMovement
The following performance metrics are taken into consideration: (1) Delivery Probability: The probability of the messages which are successfully received by the target within a provided time period. (2) Hop Count: It depicts the number of hops required by the packets to reach from source to destination. (3) Dropped Message: the number of packets dropped from the buffers of the nodes.
674
D. K. Sharma et al.
4.2 Simulation Results This subsection presents the graphical and analytical analysis of the results received by varying various simulation parameters through the Opportunistic Network Environment (ONE) Simulator. Figures 1, 2, 3, 4, and 5 show the performance of the proposed algorithm on numerous performance metrics and against some existing routing protocols. Figure 1 shows the performance of various routing protocols with respect to the delivery probability. The delivery probability naturally increases with time for all the protocols. The proposed algorithm’s graph initially lies above Prophet due to imprecise prediction in Prophet, but soon falls below Prophet and tends to follow the Epidemic curve. The Epidemic graph gives the best results in its initial stages, but decreases toward the end due to packet loss caused by overloaded buffers. Direct
Fig. 1 Comparison against various existing routing protocols in terms of delivery probability
Fig. 2 Cumulative probability comparison against message delay
A Community Interaction-Based Routing Protocol …
Fig. 3 Variation of delivery probability with number of host nodes
Fig. 4 Effect of speed variation on delivery probability
Fig. 5 Number of hop count for different routing algorithms
675
676
D. K. Sharma et al.
Delivery, Spray and Wait, and First Contact perform poorer than the proposed algorithm as the intermediate hop selection in the proposed algorithm considers multiple parameters before relaying the messages. As shown in Fig. 2, the cumulative probability of all the routing algorithms increases with the message delay until they approach a maximum value. The rise is higher toward the start and slows down at the end. The proposed algorithm’s curve lies above Direct Delivery, Spray and Wait, and First Contact, while it is below Epidemic and Prophet router. Figure 3 shows the variation of Delivery Probability with the number of hosts per group. The probability increases with the incrementing number of hosts in each group until it reaches a maximum, after which it starts to fall due to the overflowing buffers at higher values. Incrementing the number of nodes increases the intermediate helping nodes for message delivery, and thus the delivery probability. Figure 4 shows the change in delivery probability with the variation in the speed of mobile nodes. As the speed of nodes is incremented, various nodes become socially active, thereby increasing the number of replicas of the message. This provides a greater delivery probability. The hop count of various routing algorithms, for the same settings, have been depicted in Fig. 5. The hop count for Direct Delivery, as expected, comes to 1, while the others have a value above 1.5. The proposed algorithm’s hop count comes out to be less than most of the existing routing algorithms because the intermediate nodes are selected only if they satisfy certain conditions, as described in Sect. 3. This reduces the hops required to route the message, and hence proves its efficiency over others. In Fig. 6, the average latency of various existing routing protocols and the proposed method are plotted and compared. The figure clearly shows the latency of the proposed algorithm to be in the range of First Contact and Direct Delivery, and slightly greater than Epidemic and Prophet.
Fig. 6 Average latency of various routing protocols
A Community Interaction-Based Routing Protocol …
677
5 Limitations and Conclusion This section describes the various limitations faced by our protocol and provides a glimpse of the possible future works.
5.1 Limitations 1. The proposed algorithm assumes that there is a fixed point in the network (interaction point) where nodes meet for message relaying. Thus, this assumptions inhibits the algorithm to work for all kind of network design in opportunistic networks. 2. Some of the nodes are considered to be moving with a higher speed, i.e., more than a threshold value. Hence, in networks where the node mobility is not so fast, this algorithm does not work efficiently. 3. Since, in our proposed algorithm, the network is considered to be a group of communities, it does not work so efficiently in other movement models.
5.2 Conclusion This paper has highlighted a novel routing mechanism and compared it with other contemporary protocols of Opportunistic Networks. The architecture, characteristics and challenges in Opportunistic Networks have been discussed at great lengths. Further, the architecture and different modules of ONE Simulator have also been explored. The proposed community interaction-based routing protocol attempts to minimize the volume of copies of messages and end-to-end delay in the delivery of messages by selecting appropriate intermediate nodes. The simulation results have concluded that the proposed algorithm outperforms Spray and Wait, First Contact, and Direct Delivery Protocols with respect to the delivery ratio, and is very close to Epidemic. The results also emphasize that the average hop count taken by the messages are less than 2. These simulated results reveal that the proposed protocol significantly helps in minimizing the usage of bandwidth of the network by restricting the excess messages that would have been dropped by nodes.
678
D. K. Sharma et al.
References 1. K. Miya, M. Watanabe, M. Hayashi, T. Kitade, O. Kato, K. Homma, CDMA/TDD cellular systems for the 3rd generation mobile communication, in 1997 IEEE 47th Vehicular Technology Conference. Technology in Motion, Phoenix, AZ, USA (Vol. 2, 1997), pp. 820–824 2. V. Chandrasekhar, W.K.G. Seah, Y.S. Choo, H.V. Ee, Localization in underwater sensor networks: survey and challenges, in Proceedings of the 1st ACM International Workshop on Underwater Networks (WUWNet’06) (ACM, New York, NY, USA, 2006), pp. 33–40 3. H. Yang, H. Luo, F. Ye, L. Songwu, L. Zhang, Security in mobile ad hoc networks: challenges and solutions. IEEE Wirel. Commun. 11(1), 38–47 (2004) 4. V. Singh, L. Raja, D. Panwar, P. Agarwal, Delay tolerant networks architecture, protocols, and its application in vehicular Ad-Hoc networks, in Hidden Link Prediction in Stochastic Social Networks (IGI Global, 2019), pp. 135–161 5. S. Trifunovic, S.T. Kouyoumdjieva, B. Distl, L. Pajevic, G. Karlsson, B. Plattner, A decade of research in opportunistic networks: challenges, relevance, and future directions. IEEE Commun. Mag. 55(1), 168–173 (2017) 6. M.K. Denko, Mobile Opportunistic Networks: Architectures, Protocols and Applications (Auerbach Publications, 2019) 7. M. Alajeely, R. Doss, A. Ahmad, Routing protocols in opportunistic networks: a survey. IETE Tech. Rev. 35(4), 369–387 (2018) 8. A. Vahdat, D. Becker, Epidemic routing for partially-connected Ad Hoc networks. Technical report number CS-200006, Duke University, pp. 1–14 9. T. Huang, C. Lee, L. Chen, PRoPHET+: an adaptive PRoPHET-based routing protocol for opportunistic network, in 2010 24th IEEE International Conference on Advanced Information Networking and Applications, Perth, WA (2010), pp. 112–119 10. T. Spyropoulos, K. Psounis, C.S. Raghavendra, Spray and wait: an efficient routing scheme for intermittently connected mobile networks, in SIGCOMM’05 Workshops, 22–26 August 2005, Philadelphia, PA, USA 11. T. Spyropoulos, K. Psounis, C.S. Raghavendra, Spray and focus: efficient mobility-assisted routing for heterogeneous and correlated mobility, in Fifth Annual IEEE International Conference on Pervasive Computing and Communications Workshops (PerComW’07), White Plains, NY (2007), pp. 79–85 12. A. Chhabra, V. Vashishth, D.K. Sharma, SEIR: a Stackelberg game based approach for energy-aware and incentivized routing in selfish opportunistic networks, in 2017 51st Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD (2017), pp. 1–6 13. A. Chhabra, V. Vashishth, D.K. Sharma, A game theory based secure model against Black hole attacks in opportunistic networks, in 2017 51st Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD (2017), pp. 1–6 14. D.K. Sharma, S.K. Dhurandher, D. Agarwal et al., kROp: k-means clustering based routing protocol for opportunistic networks. J. Ambient Intell. Human Comput. 10, 1289–1306 (2019) 15. D.K. Sharma, Aayush, A. Sharma, J. Kumar, KNNR: K-nearest neighbour classification based routing protocol for opportunistic networks, in 2017 Tenth International Conference on Contemporary Computing (IC3), Noida (2017), pp. 1–6 16. A. Chhabra, V. Vashishth, D.K. Sharma, A fuzzy logic and game theory based adaptive approach for securing opportunistic networks against black hole attacks. Int. J. Commun. Syst. 31, e3487 (2018) 17. D.K. Sharma, S.K. Dhurandher, I. Woungang, R.K. Srivastava, A. Mohananey, J.J.P.C. Rodrigues, A machine learning-based protocol for efficient routing in opportunistic networks. IEEE Syst. J. 12(3), 2207–2213 (2018) 18. S.K. Dhurandher, D.K. Sharma, I. Woungang, S. Bhati, HBPR: history based prediction for routing in infrastructure-less opportunistic networks, in 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), Barcelona (2013), pp. 931–936
A Community Interaction-Based Routing Protocol …
679
19. A. Chhabra, V. Vashishth, D.K. Sharma, GMMR: a Gaussian mixture model based unsupervised machine learning approach for optimal routing in opportunistic IoT networks. Comput. Commun. 134 (2018). https://doi.org/10.1016/j.comcom.2018.12.001 20. A. Gupta, A. Bansal, D. Naryani, D.K. Sharma, CRPO: cognitive routing protocol for opportunistic networks, in Proceedings of the International Conference on High Performance Compilation, Computing and Communications (HP3C-2017). (ACM, New York, NY, USA, 2017), pp. 121–125 21. D.K. Sharma, S. Singh, V. Gautam, S. Kumaram, M. Sharma, S. Pant, An efficient routing protocol for social opportunistic networks using ant routing. IET Netw. (2019)
Performance Analysis of the ML Prediction Models for the Detection of Sybil Accounts in an OSN Ankita Kumari
and Manu Sood
Abstract The Online Social Networks (OSNs) as such have significantly become huge platforms for information sharing and social interactions for a variety of users across the globe. In the backdrop of fast transformations, these OSNs are undergoing illegal activities especially in the form of security attacks, and have already started reflecting serious harmful effects on these interactions. One of the prominent attacks in such environments, the Sybil attack, is jeopardizing various categories of social interactions as the number of users having Sybil accounts on these social platforms is experiencing phenomenal growth. The existence of such Sybil accounts on OSNs may threaten to defeat the very purpose of these OSNs. The presence of these Sybil accounts of malicious users is really almost impossible to control and, very difficult to detect. In this paper, with the help of Machine Learning (ML), an attempt has been made to uncover the presence of such Sybil accounts on an OSN such as Twitter. After the acquisition and preprocessing of available datasets, the Correlation with Heatmap and Logistic Regression-Recursive Feature Elimination (LR-RFE) feature selection techniques were applied to get a set of optimal features from these datasets. Then the prediction models were trained on these datasets by using Random Forest (RF), Decision Tree (DT), Logistic Regression (LR) and Support Vector Machine (SVM) classifiers. Further, the effects of biasing of genuine accounts with fake accounts on feature selection and classification have been presented. It is concluded that the prediction models using the DT algorithm outperformed all other classifiers. Keywords Feature selection · Support vector machine · Random forest · Logistic regression · Decision tree · Sybil account · Biasing
A. Kumari (B) · M. Sood Department of Computer Science, Himachal Pradesh University, Shimla, India e-mail: [email protected] M. Sood e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_60
681
682
A. Kumari and M. Sood
1 Introduction In the present times, Online Social Networks (OSNs) like Twitter, Facebook, Instagram, etc., are becoming the most generally used sources of information as well as social interactions. Their growth has transformed how people conduct business and interact with each other [1]. These platforms cannot be bound by any single standard definition at present. However, Boyd and Ellison in [2] have defined social networks as web-based services that allow individuals to (a) create a public or semi-public profile(s) within a bounded system, (b) make a list of other users with whom they share a connection, and (c) view and traverse various lists of connections within the system. Due to the presence of a huge number of users on these platforms and the uncanny ease with which any user can hide her/his real identity or create virtual identities including Sybil identities, the result is that any user can be easily trolled without any cost to the trolling users. This has led to an uncontrolled increase in a number of fake profiles on these OSNs culminating into a serious problem to genuine authorized users of these platforms [3]. In the beginning, these platforms were being used to connect with family members and friends on the social networks and also to hook up with old out-of-contact friends, but nowadays, the use of the OSNs has increased multifold for multidimensional purposes at a massive rate. Different people use these platforms for different purposes generating mammoth amounts of data. So to churn out requisite useful information like the identities of fake users, etc., from this large data, some special techniques are needed. Machine Learning (ML) is one such mechanism that caters to these kinds of needs by providing the techniques to dig some sense out of the stack of data in effective and simple ways [4]. ML is a component of Artificial Intelligence (AI) where the main focus is to make a machine learn from the given data as humans learn from their experiences. Basically, there are four methods through which a machine can learn and these are supervised learning, semi-supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, labeled data is given as an input to the machine, and based on that labeled data; the machine is trained to label the outcome too. Classification and Regression are the two significant techniques used under supervised learning. In unsupervised learning, unlabelled data is given as an input to the machine through which a model finds the hidden patterns among the pieces of data, clustering being one of the popular techniques under this category. In semi-supervised learning, both labeled and unlabeled data can be given as an input to the machine as it is the combination of supervised and unsupervised learning. In reinforcement learning, the model is trained on the basis of feedback from the neighboring environment. It is also known as feedback-oriented learning or reward-based learning [1]. The ML process involves various steps depending upon the prediction model and those steps as defined in [5] in general are (a) data collection, (b) data preparation, (c) data analyzing, (d) model training, (e) model testing, and after performing all these steps the prediction model is ready to use.
Performance Analysis of the ML Prediction Models …
683
In OSNs, when a normal-looking malicious user creates multiple fake user accounts and tries to control the behavior of a social platform, the Sybil attack is said to have taken place. The Sybil attack is basically a security threat which tries not only to control the resources available on the network but also to influence the natural social interactions. In order to get rid of these types of attacks, many studies are available on the development of defense mechanisms against Sybil attacks [6–9]. Twitter, at present, being one of the significant Online Social Networks, got initiated for the purpose of microblogging on social networks in which tweets were restricted only to 140 characters. But now it has also become an information-sharing platform and the size of the tweets has been doubled to 280 characters. The use of this platform for social causes is increasing day by day at a huge pace and so is the presence of the number of Sybil accounts. Hence, it becomes imperative to detect such types of accounts. In this paper, we have developed a few prediction models using ML classifiers for the detection of these Sybil accounts, used as a synonym to the fake accounts. We have used four classification techniques, namely Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), and Support Vector Machine (SVM), for determining whether an account is a fake account or genuine account.
1.1 Objectives The objectives which have been kept in focus while conducting this research work are (a) To analyze the effect of biasing on the dataset, i.e., fake accounts (FSF, INT, TWT) on genuine accounts (E13 and TFP); (b) To analyze classification results on the biased datasets for RF, DT, LR, and SVM classifiers, and (c) To compare the classifier results based on evaluation metrics.
1.2 Paper Structure This paper consists of four sections. Section 2 explains the methodology followed to achieve the objectives of this study. Section 3 highlights the results and analysis of the experimentation. Section 4 concludes the work and presents a pointer toward future work. The novelty of this paper is stated as follows: (a) A set of real-time datasets have been used for the prediction models with different proportions of biasing, (b) Two different feature selection categories namely, Correlation with Heatmap and LR-RFE have been used to select the optimum set of features before the application of the predictive modeling, (c) Four different classifiers have been explored for this predictive modeling for the purpose of comparing their performances on the given real-time datasets, and (d) The performances of two of the proposed models have
684
A. Kumari and M. Sood
been found to be almost ideal for almost all the evaluation parameters with the performance of DT model as the best.
2 The Methodology Followed In this section, the methodology followed to conduct this research work is described briefly.
2.1 Dataset and Biasing The datasets used to conduct this study are in short described here. The details of the datasets used are shown in Table 1. Cresci et al. in [10] collected these datasets in their research work, and we are thankful to them for allowing us to perform our experiments on them. This table contains the data of user accounts of Twitter. It includes five datasets in which three datasets are of fake accounts (FSF, TWT, INT) and two datasets are of genuine accounts (E13, TFP). The number of features in all the datasets are the same, i.e., 34. The authors of [10] collected the dataset of genuine accounts themselves for their own study, and the dataset of fake accounts was bought online. After collecting the datasets, the next step carried out was the biasing of the datasets. Table 2 shows the details of the biased datasets. Table 1 Datasets considered [4] Type of accounts
S. no.
Dataset
No. of features
No. of accounts
Fake accounts
1
FSF (Fast Followerz)
34
1169
2
INT (Inter Twitter)
34
1337
3
TWT (Twitter Technology)
34
845
1
E13 (Elezioni 2013)
34
1481
2
TFP (The Fake Project)
34
469
Genuine accounts
Table 2 Biased datasets
Cases
Datasets
No. of accounts
D1
Dataset-D1
E13-FSF
2650
D2
Dataset-D2
E13-INT
2818
D3
Dataset-D3
E13-TWT
2326
D4
Dataset-D4
TFP-FSF
1638
D5
Dataset-D5
TFP-INT
1806
D6
Dataset-D6
TFP-TWT
1314
Performance Analysis of the ML Prediction Models …
685
Table 2 shows that a total of six datasets were obtained after the biasing of fake accounts with genuine accounts. Further to study the effect of biasing, four cases were prepared on each dataset. In the first case, 100% biasing is done, i.e., all accounts of genuine and fake users were combined together. In the second case, 75% of accounts of genuine users were biased with 25% accounts of fake users. In the third case, 60% of genuine were biased with 40% of fake users. In the last case, 50% biasing of genuine with an equal percentage of fake users was done. These cases have been named as D11 (100), D12 (75–25), D13 (60–40), D14 (50) for Dataset-D1 and likewise for Dataset-D2, 3 4, 5 and 6. These 24 datasets in total were used during the process of feature selection and classification for the purpose of predicting the occurrences of Sybil accounts.
2.2 Experimental Setup In this study, we have used the Python language for the process of feature selection and implementation of various classification algorithms. The data preprocessing includes the scaling, cleaning, integration, and reduction of data, and then the normalized, scaled, and cleaned data is obtained. The dataset used in this study contained some features with no values and some features with missing values. So, the features with no values were eliminated in the beginning, and after that, the missing values in the dataset were replaced with zero. A feature named dataset contained only the name of a particular dataset, so this feature was also dropped. At the end of data preprocessing, a subset of 31 features was obtained from the original set of 34 features. The next step was to obtain the most significant features out of these 31 features using feature selection techniques. Feature Selection (FS) is basically the process of removing insignificant, unwanted, and noisy features [11]. With the help of FS, a subset of pertinent features is selected out of the total number of features in any available dataset. This helps in selecting those features which contribute most toward the output variable and thus help in making the predictive model more competent [4]. The FS technique is divided into three categories, i.e., filter method, wrapper method, and embedded method. In our study, we have used Correlation with Heatmap which is a filter method and Recursive Feature Elimination with Logistic Regression (LR-RFE) technique under the wrapper method for the selection of optimal features. The Correlation with Heatmap is a feature selection technique in which the data is represented graphically. In this technique, a 2D visualization of data is given in a matrix in which different colors are used to represent different values. This is basically a statistical term which uses familiar utilization between two variables to convey the linear relationship with each other in addition to their closeness in relationship [12, 13].
686
A. Kumari and M. Sood
The Recursive Feature Elimination method of Feature Selection selects the lesser number of features iteratively. In this method, to get the importance of each feature, the predictor is trained first on the original set of features, which further eliminates those features which are having the least significance, and this process continues until a proper set of features is obtained. RFE mainly helps in ranking feature significance and feature selection. The study in [14] shows that reduction of features using RFE helps in improving prediction accuracy. So in order to have the best subset of features from the set of original 31 features, these feature selection techniques were used in this study. At the end of the feature selection process, an optimal subset of 22 features is obtained, which are further used in the process of model building. The predictive models are built using ML classifiers. In this study, four classifiers were used, namely Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), and Support Vector Machine (SVM) for the training and testing of the classification models. The conventional ratio of 70:30 has been used for the training and testing of the classifier models in our study. RF is an ensemble learning method in which a multitude of decision trees are constructed at the training time [15]. Decision trees are the trees that classify instances by sorting them based on feature values [16]. LR is a regression method for predicting a binarydependent variable [17]. SVM is a supervised learning algorithm that is useful for recognizing precise patterns in complex datasets [18]. The experimentation in this study was conducted for the detection of fake (Sybil) accounts from the Twitter datasets. So for the evaluation and prediction of experimentation results, we have used confusion matrix and evaluation metrics. The evaluation metrics used here were Accuracy, Precision, Recall, F1 score, Mathew Correlation Coefficient (MCC), and Specificity.
3 Results and Analysis The results of the experiments conducted in this study by using four classifiers RF, DT, LR, and SVM are shown in Tables 3, 4, 5, and 6, respectively. Also, the graphical representation of these results of each classifier is depicted in Figs. 1, 2, 3, and 4 for the sake of comparison. Table 3 displays the experimental results of the RF classifier for all the 24 cases obtained after the biasing of datasets. Figure 1 gives the graphical representation of the results compiled in Table 3 for the RF classifier. As can be seen, the performance of this classifier is quite good for the datasets 1, 2, and 4 as far as the values of six evaluation metrics are concerned. But for other datasets, the values of these metrics are a bit on the lower side. Table 4 displays the experimental results of the DT classifier for all the 24 cases obtained after the biasing of datasets and the corresponding graphical representation for the DT classifier of the results is shown in Fig. 2. It can be concluded from this figure that not only this classifier produces quite good values of the evaluation metrics for datasets 1, 2, and 4, but also the values of these metrics for the other three datasets are better too when compared to those of the RF classifier.
Performance Analysis of the ML Prediction Models …
687
Table 3 Results of Random Forest classifier-based prediction model Metric values for random forest classifier Datasets
Cases
Accuracy
Precision
Recall
F1-score
MCC
Specificity
Dataset-D1
D11 (100)
1.000
1.000
1.000
1.000
1.000
1.000
D12 (75–25)
1.000
1.000
1.000
1.000
1.000
1.000
D13 (60–40)
1.000
1.000
1.000
1.000
1.000
1.000
Dataset-D2
Dataset-D3
Dataset-D4
Dataset-D5
Dataset-D6
D14 (50)
1.000
1.000
1.000
1.000
1.000
1.000
D21 (100)
0.996
0.996
1.000
0.993
0.992
0.992
D22 (75–25)
1.000
1.000
1.000
1.000
1.000
1.000
D23 (60–40)
1.000
1.000
1.000
1.000
1.000
1.000
D24 (50)
1.000
1.000
1.000
1.000
1.000
1.000
D31 (100)
0.974
0.980
0.993
0.968
0.942
0.937
D32 (75–25)
0.984
0.991
1.000
0.982
0.930
0.885
D33 (60–40)
0.963
0.976
0.989
0.963
0.895
0.877
D34 (50)
0.974
0.980
0.986
0.974
0.942
0.942
D41 (100)
1.000
1.000
1.000
1.000
1.000
1.000
D42 (75–25)
1.000
1.000
1.000
1.000
1.000
1.000
D43 (60–40)
1.000
1.000
1.000
1.000
1.000
1.000
D44 (50)
1.000
1.000
1.000
1.000
1.000
1.000
D51 (100)
0.983
0.970
0.980
0.961
0.959
0.984
D52 (75–25)
1.000
1.000
1.000
1.000
1.000
1.000
D53 (60-40)
0.981
0.972
1.000
0.947
0.959
0.972
D54 (50)
0.994
0.989
1.000
0.979
0.985
0.992
D61 (100)
0.954
0.945
0.936
0.954
0.906
0.967
D62 (75–25)
0.973
0.978
0.971
0.985
0.944
0.976
D63 (60–40)
0.927
0.924
0.982
0.873
0.860
0.882
D64 (50)
0.984
0.976
0.954
1.000
0.966
1.000
Table 5 displays the experimental results of the LR classifier for all the 24 cases obtained after the biasing of datasets and Fig. 3 shows the graphical representation of these results presented in Table 5 for the LR classifier. An examination on this table, as well as figure, simply shows that values of almost all the evaluation metrics for this classifier are quite low. Table 6 displays the experimental results of the SVM classifier for all the 24 cases obtained after the biasing of datasets. Figure 4 gives the graphical representation of the results displayed in this table. From Table 6 and Fig. 4, it can be deduced that the performance of this SVM classifier for all the evaluation metrics is far from satisfactory. Based upon the results and analyses of the values of all six evaluation metrics for the four classifiers used in our experimentation, it is concluded that the results of the Decision Tree classifier for all the evaluation metrics were the best. This entails that this prediction model when
688
A. Kumari and M. Sood
Table 4 Results of DT classifier-based prediction model Metric values for decision tree classifier Datasets
Cases
Accuracy
Precision
Recall
F1-score
MCC
Specificity
Dataset-D1
D11 (100)
0.996
0.996
0.993
1.000
0.993
1.000
D12 (75–25)
1.000
1.000
1.000
1.000
1.000
1.000
D13 (60–40)
1.000
1.000
1.000
1.000
1.000
1.000
Dataset-D2
Dataset-D3
Dataset-D4
Dataset-D5
Dataset-D6
D14 (50)
1.000
1.000
1.000
1.000
1.000
1.000
D21 (100)
0.988
0.989
0.987
0.99
0.977
0.989
D22 (75–25)
0.993
0.996
1.000
0.992
0.98
0.968
D23 (60–40)
0.990
0.992
0.990
0.995
0.979
0.991
D24 (50)
1.000
1.000
1.000
1.000
1.000
1.000
D31 (100)
0.962
0.971
0.967
0.976
0.917
0.953
D32 (75–25)
0.965
0.980
0.992
0.969
0.837
0.783
D33 (60–40)
0.933
0.956
0.966
0.947
0.810
0.825
D34 (50)
0.968
0.976
0.982
0.970
0.930
0.943
D41 (100)
1.000
1.000
1.000
1.000
1.000
1.000
D42 (75–25)
1.000
1.000
1.000
1.000
1.000
1.000
D43 (60–40)
1.000
1.000
1.000
1.000
1.000
1.000
D44 (50)
1.000
1.000
1.000
1.000
1.000
1.000
D51 (100)
0.972
0.951
0.922
0.981
0.932
0.992
D52 (75–25)
0.993
0.993
0.987
1.000
0.986
1.000
D53 (60–40)
0.983
0.974
1.000
0.950
0.963
0.975
D54 (50)
1.000
1.000
1.000
1.000
1.000
1.000
D61 (100)
0.944
0.934
0.934
0.934
0.887
0.952
D62 (75–25)
0.967
0.973
0.960
0.986
0.933
0.979
D63 (60–40)
0.897
0.892
0.920
0.865
0.796
0.878
D64 (50)
0.993
0.990
0.980
1.000
0.985
1.000
used for the prediction of occurrences of Sybil accounts on the datasets pertaining to Twitter OSNwill produce the best results with the best possible values of accuracy, recall, specificity, precision, F1 score, and MCC. We have arrived at this conclusion as the values achieved for all these metrics based on our experiments in this paper are near perfect values.
Performance Analysis of the ML Prediction Models …
689
Table 5 Results of LR classifier-based prediction model Metric values for logistic regression classifier Datasets
Cases
Accuracy
Precision
Recall
F1-score
Dataset-D1
D11 (100)
0.817
0.805
0.674
1.000
0.689
1.000
D12 (75–25)
0.902
0.937
0.882
1.000
0.753
1.000
D13 (60–40)
0.861
0.886
0.795
1.000
0.746
1.000
Dataset-D2
Dataset-D3
Dataset-D4
Dataset-D5
Dataset-D6
MCC
Specificity
D14 (50)
0.869
0.874
0.777
1.000
0.769
1.000
D21 (100)
0.765
0.705
0.546
0.996
0.604
0.997
D22 (75–25)
0.974
0.983
0.967
1.000
0.931
1.000
D23 (60–40)
0.829
0.849
0.747
0.982
0.694
0.976
D24 (50)
0.851
0.842
0.733
0.989
0.736
0.990
D31 (100)
0.687
0.792
0.910
0.700
0.235
0.266
D32 (75–25)
0.745
0.852
0.853
0.851
−0.064
0.081
D33 (60–40)
0.718
0.824
0.890
0.767
0.141
0.221
D34 (50)
0.638
0.758
0.858
0.679
0.084
0.207
D41 (100)
0.872
0.739
0.586
1.000
0.703
1.000
D42 (75–25)
0.868
0.861
0.756
1.000
0.767
1.000
D43 (60–40)
0.895
0.835
0.717
1.000
0.784
1.000
D44 (50)
0.903
0.779
0.638
1.000
0.751
1.000
D51 (100)
0.833
0.552
0.386
0.968
0.547
0.995
D52 (75–25)
0.837
0.812
0.683
1.000
0.715
1.000
D53 (60–40)
0.888
0.788
0.682
0.933
0.731
0.978
D54 (50)
0.916
0.789
0.661
0.979
0.762
0.995
D61 (100)
0.638
0.416
0.325
0.577
0.198
0.843
D62 (75–25)
0.779
0.803
0.688
0.965
0.610
0.953
D63 (60–40)
0.902
0.900
0.909
0.891
0.804
0.896
D64 (50)
0.728
0.415
0.283
0.777
0.347
0.958
4 Conclusion In this study, the data preprocessing and a combination of feature selection techniques have been implemented on the datasets taken from the authors of another study. For obtaining a subset of optimal features, we used two different types of FS techniques, Correlation with Heatmap and LR-RFE belonging to two different categories, and by using these techniques, we have obtained a subset of 22 effective features from the original set of 31 features. We carried out experimentation on the set of 24 biased datasets containing data related to these selected features. The prediction models have further been built using four classifiers, namely RF, DT, LR, and SVM. The analyses of the results obtained for all the six evaluation metrics show that with the selected set of features on the 24 datasets, the performance of the Decision Tree (DT)
690
A. Kumari and M. Sood
Table 6 Results of SVM classifier-based prediction model Metric Values for support vector machine classifier Datasets
Cases
Accuracy
Precision
Recall
F1-score
MCC
Specificity
Dataset-D1
D11 (100)
0.562
0.719
1.000
0.562
0.000
0.000
D12 (75–25)
0.825
0.904
1.000
0.825
0.000
0.000
D13 (60–40)
0.676
0.806
1.000
0.676
0.000
0.000
Dataset-D2
Dataset-D3
Dataset-D4
Dataset-D5
Dataset-D6
D14 (50)
0.584
0.737
1.000
0.584
0.000
0.000
D21 (100)
0.513
0.678
1.000
0.513
0.000
0.000
D22 (75–25)
0.782
0.877
1.000
0.782
0.000
0.000
D23 (60–40)
0.640
0.78
1.000
0.64
0.000
0.000
D24 (50)
0.539
0.701
1.000
0.539
0.000
0.000
D31 (100)
0.653
0.79
1.000
0.653
0.000
0.000
D32 (75–25)
0.86
0.924
1.000
0.86
0.000
0.000
D33 (60–40)
0.743
0.852
1.000
0.743
0.000
0.000
D34 (50)
0.661
0.796
1.000
0.661
0.000
0.000
D41 (100)
0.691
0.000
0.000
0.000
0.000
1.000
D42 (75–25)
0.868
0.861
0.756
1.000
0.767
1.000
D43 (60–40)
0.629
0.000
0.000
0.000
0.000
1.000
D44 (50)
0.733
0.000
0.000
0.000
0.000
1.000
D51 (100)
0.734
0.000
0.000
0.000
0.000
1.000
D52 (75–25)
0.515
0.68
1.000
0.515
0.000
0.000
D53 (60–40)
0.696
0.000
0.000
0.000
0.000
1.000
D54 (50)
0.761
0.000
0.000
0.000
0.000
1.000
D61 (100)
0.603
0.000
0.000
0.000
0.000
1.000
D62 (75–25)
0.655
0.792
1.000
0.655
0.000
0.000
D63 (60–40)
0.517
0.000
0.000
0.000
0.000
1.000
D64 (50)
0.658
0.000
0.000
0.000
0.000
1.000
classifier was better than the other three classifiers used in this study. The values of all the six metrics for this classifier have been found to be near perfect which means that predictions made by this model can be used to identify the presence of Sybil accounts in the datasets of an OSN with great accuracy, specifically Twitter. In future, we are going to enhance our prediction models by using ensemble and optimization techniques to achieve better results on the same or different datasets.
Performance Analysis of the ML Prediction Models …
691
1.05 Value of metrics
1 0.95
Accuracy
0.9
Precision
0.85
Recall
0.8
F1 Score MCC D11(100) D12(75-25) D13(60-40) D14(50) D21(100) D22(75-25) D23(60-40) D24(50) D31(100) D32(75-25) D33(60-40) D34(50) D41(100) D42(75-25) D43(60-40) D44(50) D51(100) D52(75-25) D53(60-40) D54(50) D61(100) D62(75-25) D63(60-40) D64(50)
0.75
Specificity
Dataset-D1 Dataset-D2 Dataset-D3 Dataset-D4 Dataset-D5 Dataset-D6
Fig. 1 Comparative analysis of RF classifier metrics on 24 biased datasets
Value of metrics
1.2 1 0.8 Accuracy
0.6
Precision
0.4
Recall
0.2
F1 Score D11(100) D12(75-25) D13(60-40) D14(50) D21(100) D22(75-25) D23(60-40) D24(50) D31(100) D32(75-25) D33(60-40) D34(50) D41(100) D42(75-25) D43(60-40) D44(50) D51(100) D52(75-25) D53(60-40) D54(50) D61(100) D62(75-25) D63(60-40) D64(50)
0
Dataset-D1 Dataset-D2 Dataset-D3 Dataset-D4 Dataset-D5 Dataset-D6
Fig. 2 Comparative analysis of DT classifier metrics on 24 biased datasets
MCC Specificity
692
A. Kumari and M. Sood 1.2
Value of metrics
1 0.8
Accuracy
0.6
Precision
0.4
Recall
0.2
F1 Score MCC
-0.2
D11(100) D12(75-25) D13(60-40) D14(50) D21(100) D22(75-25) D23(60-40) D24(50) D31(100) D32(75-25) D33(60-40) D34(50) D41(100) D42(75-25) D43(60-40) D44(50) D51(100) D52(75-25) D53(60-40) D54(50) D61(100) D62(75-25) D63(60-40) D64(50)
0
Specificity
Dataset-D1 Dataset-D2 Dataset-D3 Dataset-D4 Dataset-D5 Dataset-D6
Fig. 3 Comparative analysis of LR classifier metrics on 24 biased datasets
Value of metrics
1.2 1 0.8
Accuracy
0.6
Precision
0.4
Recall
0.2
F1 Score D11(100) D12(75-25) D13(60-40) D14(50) D21(100) D22(75-25) D23(60-40) D24(50) D31(100) D32(75-25) D33(60-40) D34(50) D41(100) D42(75-25) D43(60-40) D44(50) D51(100) D52(75-25) D53(60-40) D54(50) D61(100) D62(75-25) D63(60-40) D64(50)
0
MCC Specificity
Dataset-D1 Dataset-D2 Dataset-D3 Dataset-D4 Dataset-D5 Dataset-D6
Fig. 4 Comparative analysis of SVM classifier metrics on 24 biased datasets
Acknowledgments We convey our gratitude to Cresci et al. [10] for providing us their permission to perform our experiments on the datasets we acquired from them.
References 1. M. Al-Qurishi, M. Al-Rakhami, A. Alamri, M. Alrubaian, S.M.M. Rahman, M.S. Hossain, Sybil defense techniques in online social networks: a survey. IEEE Access 5, 1200–1219 (2017) 2. D. Boyd, N. Ellison, Social network sites: definition, history, and scholarship. J. Comput. Med. Commun. 13, 210–230 (2007)
Performance Analysis of the ML Prediction Models …
693
3. P. Galán-García, J.G.D.L. Puerta, C.L. Gómez, I. Santos, P.G. Bringas, Supervised machine learning for the detection of troll profiles in twitter social network: application to a real case of cyberbullying. Logic J. IGPL 24(1), 42–53 (2016) 4. D. Sonkhla, M. Sood, Performance analysis and feature selection on Sybil user data using recursive feature elimination. Int. J. Innov. Technol. Explor. Eng. (IJITEE) 8, 48–56 (2019) 5. H.M. Anwer, M. Farouk, A. Abdel-Hamid, A framework for efficient network anomaly intrusion detection with feature selection, in Proceedings of 9th International Conference on Information and Communication Systems, Irbid (2018), pp. 157–162 6. A. Vasudeva, M. Sood, Sybil attack on lowest id clustering algorithm in the mobile ad hoc network. Int. J. Netw. Secur. Appl. 4(5), 135–147 (2012) 7. M. Sood, A. Vasudeva, Perspectives of Sybil attack in routing protocols of mobile ad hoc network, in Computer Networks and Communications (NetCom) ed. by N. Chaki, et al. Lecture Notes in Electrical Engineering, vol. 131 (Springer, New York, NY, 2013), pp. 3–13 8. A. Vasudeva, M. Sood, A Vampire Act of Sybil attack on the highest node degree clustering in mobile Ad Hoc networks. Indian J. Sci. Technol. 9(32), 1–9 (2016) 9. A. Vasudeva, M. Sood, Survey on Sybil attack defense mechanisms in wireless ad hoc networks. J. Netw. Comput. Appl. 120, 78–118 (2018) 10. S. Cresci, R.D. Pietro, R. Petrocchi, A. Spognardi, M. Tesconi, Fame for sale: efficient detection of fake Twitter followers. Decis. Support Syst. 80, 56–71 (2015) 11. H. Nkiama, S.Z.M. Said, M. Saidu, A subset feature elimination mechanism for intrusion detection system. Int. J. Adv. Comput. Sci. Appl. 7(4), 148–157 (2016) 12. Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007) 13. S. Zhao, Y. Guo, Q. Sheng, Y. Shyr, Advanced heat map and clustering analysis using heatmap. BioMed. Res. Int. (2014) 14. T.E. Mathew, A logistic regression with recursive feature elimination model for breast cancer diagnosis. Int. J. Emerg. Technol. 10(3), 55–63 (2019) 15. A. Liaw, M. Wiener, Classification and regression by random forest. R News 2(3), 18–22 (2002) 16. S.B. Kotsiantis, I. Zaharakis, P. Pintelas, Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007) 17. I. Kurt, M. Ture, A.T. Kurum, Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease. Expert Syst. Appl. 34, 366–374 (2008) 18. P. Pavlidis, I. Wapinski, W.S. Noble, Support vector machine classification on the web. Bioinform. Appl. Note 20(4), 586–587 (2004)
Exploring Feature Selection Technique in Detecting Sybil Accounts in a Social Network Shradha Sharma
and Manu Sood
Abstract Machine learning (ML) provides us the techniques to carve out meaningful insights into the useful information embedded in various datasets by making the machine learn from the datasets. There are different machine learning techniques available for various purposes. The general sequence of steps for a typical supervised machine learning technique includes preprocessing, feature selection, building the prediction model, testing and validating the model. Various ML techniques are being used to detect the presence of fake as well as spambot accounts on a number of Online Social Networks (OSNs). These fake/spambot accounts especially the Sybil accounts appear in these networks with malicious intentions to disrupt or highjack the very purpose of these networks. In this paper, we have trained various prediction models using appropriate real-time datasets to detect the presence of Sybil accounts on online social media. Since the data is collected from various sources; it necessitates the preprocessing of the dataset. The preprocessing has mainly been carried out for (a) removing the noise from this data and/or (b) normalizing values of various features. Next, three different feature selection techniques have been used for the selection of the optimal set of features from the superset of features so as to remove the features that are redundant and irrelevant in making accurate predictions. The three feature selection techniques used are Correlation Matrix with Heatmap, Feature Importance and Recursive Feature Elimination with Cross-Validation. Further, KNearest Neighbor (KNN), Random Forest (RF) and Support Vector Machine (SVM) classifiers have been deployed to train the proposed prediction models for predicting the presence of Sybil accounts in the OSN dataset. The performances of the proposed prediction models have been analyzed using six standard metrics. We conclude that the prediction model based on the Random Forest classifier provides the best results in predicting the presence of Sybil accounts in the dataset of an OSN. S. Sharma (B) · M. Sood Department of Computer Science, Himachal Pradesh University, Shimla, India e-mail: [email protected] M. Sood e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_61
695
696
S. Sharma and M. Sood
Keywords Data preprocessing · Feature selection · Classification · K nearest neighbor classifier · Random forest classifier · Support vector machine classifier · Datasets · Sybil account
1 Introduction The transactions and interactions being performed online on the Internet are generating huge amounts of digital traces in the cyber-physical space. A significant chunk of contributions to this data can be attributed to social media and social networks. Online social networks, because of their great user-friendliness, simple interfaces and multilevel stay-in-touch approaches are not only attracting a large number of users to use these networks 24 × 7, but are also drawing the attention of spammers and attackers. These spammers and attackers exploit the inbuilt mechanisms of these social networks to influence the interactions of genuine users, sometimes adversely. These social network sites record these interactions due to which large data is being generated and stored in various servers. It is difficult to separate the data related to the attackers from this huge data with the existing manipulation practices. Of late, Machine Learning (ML) has come to the forefront for the detection of such malicious activities intentionally launched by vested interests (users). It can be used to classify the data related to genuine users and fake/malicious users if used appropriately. ML as the subset of Artificial Intelligence focuses on the training of machines through ML algorithms using the huge datasets so as to detect or predict the occurrences of data related to the fake users or attackers. There are four types of ML techniques: Supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning. Supervised learning trains the machine by using labeled data. It mainly uses classification and regression techniques. Classification predicts the discrete responses whereas regression predicts the continuous responses. Unsupervised learning does not consist of labeled data. It uses techniques like clustering and association. Clustering is mainly used for the grouping of data and finding hidden patterns. Association rules mainly help in finding out the association among the data objects in the huge dataset. In semisupervised learning, the amount of input data is very large but only small data is labeled. It mainly deploys graph-based methods along with classification, regression and prediction techniques. Reinforcement learning is basically based on the reward and punishment method. It consists of three main components: agent, environment and actions and is mainly used in gaming, navigation and robotics. Given the kind of problem of detection/prediction of fake/malicious users at hand, supervised learning is the best suitable ML category for this purpose. In supervised learning, the process of training a model consists of a number of steps, the very first step being the collection of data necessary to train the model. Data collected from various sources is large in size and may consist of noise. So, the next step is preprocessing of this data and involves data cleaning, data transformation, error correction and normalization [1]. Data cleaning deals with the removal of missing
Exploring Feature Selection Technique …
697
and noisy data. Missing data is the data with some of the missing values and noisy data is irrelevant data. The next step is data transformation for transforming data into the required form followed by the process of feature selection. Datasets may consist of hundreds of features, out of which only a few may contribute to effective prediction. Hence, irrelevant features are dropped using various feature selection techniques which in turn improves the accuracy and reduces the training time. There are three methods for feature selection: Filter, Wrapper and Embedded methods. In the filter method, the selection of features is based on calculating their scores in a statistical test for their correlation with the target variable [2]. In the wrapper method, the feature subset is selected by using the induction algorithm which is also the part of evaluation function [3]. The third technique is the embedded method which combines the techniques used in filter and wrapper methods. The next step in the creation of the model for prediction is the selection of the ML technique. There are a number of techniques available for training a prediction model. There is no single technique to date that can be used in all the scenarios in general. The technique(s) to be used depends upon the problem at hand, datasets and various other factors. Since our datasets contained labeled data and our problem is related to the classification, we have used K-Nearest Neighbor, Random Forest and Support Vector Machine classification techniques. After the selection of the prediction model, the next important step is to test the prediction model for which the dataset is divided into two parts: training data and testing data. Training data is used to train the model and testing data is used to test the designed and trained model. Any social network is a huge platform which provides the way by which people all around the world can connect with each other. Online social networks (OSNs) provide such platforms through which people all around the world can exchange their ideas and information with one another. These are similar to virtual communities that are connected through hardware, networking and special software for socialization. One of the most popular OSNs used by people across the globe is Twitter. The adversaries in such OSNs could be either fake users having Sybil accounts associated with them or could well be spambots. The OSNs often provide either some rules to detect/remove the spambots, or some software program(s) for this purpose but at times, both these machines fail to detect the fake user or the spambots, and spambots become successful in their malicious designs [4–6]. Twitter before getting transformed into a huge social site was originally started as a personal micro-blogging site [7]. In order to increase the number of followers of the target account on Twitter or any other OSN, fake followers and social spambots come into play. Social spambot is a profile on the social media platform that is programmed to automatically generate messages and follow accounts. In this paper, after using three feature selection techniques to obtain an optimal subset of features, we have explored three different ML techniques to find out the fake followers and social spambots on an OSN such as Twitter. In order to determine whether a given account is real or Sybil account in the dataset, three classification techniques were used which are K-Nearest Neighbor (KNN), Random Forest (RF) and Support Vector Machine (SVM). The datasets used for this research work contained both numerical and categorical values. Each and every instance in the
698
S. Sharma and M. Sood
dataset has a unique set of values. Since we have used Python for the implementation purpose, it could not deal with the categorical values. Instead, assigning binary values to the categorical data, all the categorical values were converted into their corresponding numerical values, which in return provided better results. This research work is intended to fulfill the following objectives: (1) using a combination of feature selection techniques, retrieve the set of best optimal features from the complete features set for all the datasets used, (2) to compare the prediction performance the three ML classifiers, i.e., RF, KNN and SVM on various available datasets, and (3) to study and analyze the effect of biasing on human account with fake followers and human account with social spambots. The novelty of this research work is (a) various datasets containing real-time data used with the permission of the owners have been prepared for the training and testing of ML prediction models using different percentages of biasing, (b) in order to obtain the most optimal set of features in the datasets, a combination of three feature selection techniques have been used in cascade, and (c) three supervised learning classifiers have been used and their performances compared on all the datasets under investigation so as to predict the occurrence of Sybil accounts and spambot accounts on these real-time datasets with high accuracy. The remainder of this paper consists of the following sections. Section 2 presents the information about the related work. Section 3 is about the methodology followed. Section 4 highlights the results and their analyses. Section 5 presents the conclusion and future scope.
2 Related Work Machine learning is being widely used across the world to find solutions elusive so far in different problem domains. One such domain is the presence of Sybil and/or fake accounts on OSNs. The techniques of machine learning are also being used in predicting the occurrences of these fake Twitter accounts and spambots. Some of the related work has been mentioned here so as to get an idea about the current state of affairs in this field. For the detection of fake Twitter account, the authors in [7] have used the feature selection process to create a subset of the features in such a way that no useful information was left out but the unnecessary and repetitive features were removed. They have shown that it enhanced the accuracy and reduced the computational time. The feature selection techniques, when used appropriately, decrease the storage needs, prevent overfitting and enable data visualization [8]. Whether an account on the OSN is a human account or a fake account can be predicted on the basis of the information obtained from the profile of the account [9]. A subtle criterion has been outlined in [10] on the basis of which a human account can be detected in the corresponding dataset that contains the combination of human as well as spambot account. The authors in [11] have successfully used the ML algorithm to detect spam on Twitter. In their work, they have proposed a hybrid approach to detecting the streaming of Twitter spam using a combination of Decision Tree,
Exploring Feature Selection Technique …
699
Particle Swarm Optimization and Genetic algorithm. Some important features on the basis of message content and user behavior can be used along with the SVM classification algorithm for the successful detection of spam [12]. McCord and Chuah in [13] have carried out their evaluation on the basis of suggested user and contentbased features with the help of four classifiers for the detection of spammers and legitimate users.
3 Methodology Followed This section describes, in brief, the description of the process followed to achieve the goal. For the implementation of various ML algorithms, Python language has been used. We have used Jupyter Notebook to execute the codes written in Python language. Jupyter Notebook is a web-based application and it is open sourced [14].
3.1 Datasets In order to achieve the set objectives, a set of datasets having human, fake Twitter as well as spambot accounts has been considered. We have studied nine datasets, a total of eight datasets have been used in this study, out of which two datasets (E13, Genuine accounts) contain human accounts, three datasets have the fake accounts data and other three contain the data for spambots. Each dataset supports the same number of features all of which have the same labels. Table 1 presents the description of the type of dataset, nature of dataset and number of accounts in each dataset. Table 2 presents the names of all the 32 features for the datasets. Cresci et al. have created this dataset for their research work [7]. They verified each and every genuine account themselves which made these datasets special. The Table 1 Datasets considered S. no.
Dataset
Nature of dataset accounts
No. of accounts
1 2
E13 (Elezioni 2013)
Human
1481
TFP (The Fake Project)
Human
469
3 4
Genuine accounts
Human
3474
TWT (Twitter Technology)
Fake accounts
845
5
INT (Inter Twitter)
Fake accounts
1337
6
FSF (Fast Followers)
Fake accounts
1169
7
Spambot 1
Spambots
991
8
Spambot 2
Spambots
3457
9
Spambot 3
Spambots
464
700
S. Sharma and M. Sood
Table 2 Features in the features set Id
Friends count
Language
Geo enabled
Name
Favorite count
Time zone
Profile image URL
Screen name
Listed count
Location
Profile banner URL
Status count
Created at
Default profile
Profile text color
Followers count
URL
Default profile image
Profile image URL https
UTC offset
Protected
Verified
Updated
Profile sidebar fill color
Profile background image URL
Profile background color Profile link color
Profile use background image
Profile background image URL https
Profile sidebar border color
Profile background
first dataset of human accounts, i.e., E13 (Elezioni 2013) was created during the elections conducted in Italy in 2013. The second dataset of human accounts, i.e., The Fake Project (TFP) was created by authors on their own. They had started a project named “The Fake Project” (a Twitter account) in order to collect the data about these human accounts. We have not used this dataset in our experiments. The next three datasets which contained the details of fake account were bought by them online. Spambot 1 dataset was created by observing a group of social bots on Twitter in 2014 in Mayoral elections in Rome. Spambot 2 promoted the #TALNTS hashtag for several months where Talnts was a mobile phone application used for hiring workers in the fields of writing, digital photography and music. Spambot 3 dataset was collected from Amazon.com which advertises products on sale.
3.2 Dataset Biasing Out of the nine datasets described in Table 1, we have used eight of them to generate 18 numbers of biased datasets as shown in Table 3. The biasing of data is carried out as follows. Biasing of E13 dataset with FSF, INT and TWT datasets in the ratios of 50:50, 25:75 and 75:25 was carried out to obtain 9 datasets named M1–M9. Also, 9 more biased datasets M10–M18 were obtained by biasing genuine accounts with three spambot datasets again in the ratios of 50:50, 25:75 and 75:25. The table also lists the resulting total number of accounts in each of these 18 datasets.
Exploring Feature Selection Technique …
701
Table 3 Biased datasets S. no.
Case
Spambot
Total accounts
M1
(E13–FSF) (50%–50%)
Human accounts 741
Fake accounts 585
–
1326
M2
(E13–FSF) (25%–75%)
371
877
–
1248
M3
(E13–FSF) (75%–25%)
1111
293
–
1404
M4
(E13–INT) (50%–50%)
741
669
–
1410
M5
(E13–INT) (25%–75%)
371
1003
–
1374
M6
(E13–INT) (75%–25%)
1111
335
–
1446
M7
(E13–TWT) (50%–50%)
741
423
–
1164
M8
(E13–TWT) (25%–75%)
371
634
–
1005
M9
(E13–TWT) (75%–25%)
1111
212
–
1323
M10
(Genuine–spambot 1) (50%–50%)
1738
–
496
2234
M11
(Genuine–spambot 1) (25%–75%)
869
–
744
1613
M12
(Genuine–spambot 1) (75%–25%)
2609
–
248
2857
M13
(Genuine–spambot 2) (50%–50%)
1738
–
1729
3467
M14
(Genuine–spambot 2) (25%–75%)
869
–
2594
3463
M15
(Genuine–spambot 2) (75%–25%)
2609
–
865
3474
M16
(Genuine–spambot 3) (50%–50%)
1738
–
233
1971
M17
(Genuine–spambot 3) (25%–75%)
869
–
349
1218
M18
(Genuine–spambot 3) (75%–25%)
2609
–
117
2726
3.3 Data Preprocessing Data preprocessing is the process of cleaning, scaling and transforming the data into the required format. In the process of data cleaning NaN (Not-a-Number), inconsistent and missing values are removed [1]. The data under consideration also had some
702
S. Sharma and M. Sood
missing values and some features in the datasets did not have any value at all. The first step was the removal of those features which did not contain any values. After that missing values were replaced with zero. Features containing the redundant values were also dropped. After this preprocessing of data, we were left with 24 features into the feature set and the datasets with these 24 features were further subjected to feature selection techniques.
3.4 Feature Selection The main reason behind using large datasets is that they contain a large number of features in the feature set. These features contain information about the target variables. The statement “the more the number of features, the better is the performance” is not valid for every case. In feature set, there are some features which when removed from the feature set, do not affect the solution at all. Normally, these features may be irrelevant, constitute noise, or are redundant [10] and hence can be dropped safely. Out of three categories of feature selection methods, the filter methods normally use some mathematical functions and are known to be faster than wrapper methods. These methods include Univariate method, Chi-Square method and Correlation Matrix with Heatmap. Wrapper methods use the classifiers to prepare the feature set with maximum accuracy [15]. The third category of embedded methods combines the salient features of both the other two categories. Filter methods provide better results if there are very large numbers of features in the feature set but when dealing with fewer features, wrapper methods work better [15]. Three feature selection techniques have been explored in this work in order to find out the best optimal features set for further processing. The first one is Correlation Matrix with Heatmap which deals with the relationships among the features themselves and with the target variable. The second method used is Feature Importance. This method ranks the features according to their importance. The third method used is Recursive Feature Elimination with Cross-Validation (RFE-CV). If C is the classifier for prediction (e.g., a random forest), F is the scoring function to evaluate the performance of the classifier (e.g., accuracy); k is the number of features to be eliminated in every step (e.g., 1 feature) then RFE-CV starts with all the n features, makes predictions with cross-validation using C, computes the relative cross-validated performance score F and the ranking of the importance of the features. Then it eliminates the lowest k features in the ranking and remakes the predictions, the computation of the performance score and the feature ranking. It continues until all the features are eliminated. Finally, it outputs the set of features that produced the predictor with the best performance score [16]. After applying Correlation Matrix with Heatmap, 15 features were selected out of 24 features. Subsequently, applying the Feature Importance technique the number was further reduced to 11. Finally, the RFE-CV technique outputs the 8 best optimal features as shown in Table 4. All the later operations are performed on these selected best optimal features listed in Table 4.
Exploring Feature Selection Technique …
703
Table 4 Final features set statuses_count
friends_count
favourites_count
Lang
Profile_text_color
profile_background_image_url
Profile_link_color
Updated
1.02 1
Values of Metrics
0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 Accuracy Precision
Recall F-Measure
MCC
Evaluation Metrics
Specificity
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18
Fig. 1 Comparison of values of evaluation metrics for KNN classifier
3.5 Classifiers Used Three classifiers used in this study are Random Forest (RF), Support Vector Machine (SVM) and K-Nearest Neighbor (KNN). The KNN classifier also known as a nonparametric algorithm is used for both classification and regression. It works on the principle that the objects within a dataset are close to each other and have similar properties [17]. It is also used for both classification and regression and is very fast to train. The RF classifier is also used to address the problems of classification and regression. This algorithm works in two steps, in the first step random tress are created, and the second step is initiated to predict the output based on the basis of votes of trees generated in the first step [1]. The classifications through the SVM classifiers are known to be discriminative in nature which are subsequently used for the classification and regression. It constructs a hyperplane among the data points segregating the data points into two classes. There can be multiple planes separating the data points into two classes but the one with the maximal margin between the data points of two classes is considered the best and is selected as the hyperplane. Data points that affect the position of the hyperplane are called the support vectors.
704
S. Sharma and M. Sood
3.6 Evaluation Criteria This study has been conducted to detect fake accounts such as Sybil accounts and social spambots. For the evaluation of experimentation results, confusion matrix and six evaluation metrics have been used. These evaluation metrics are Accuracy, Precision, Recall, F-Measure, MCC and Specificity.
4 Results and Analysis After selecting the requisite features for the datasets using feature selection techniques, the selected features with their data were used for further experiments. Prediction models based on KNN, RF and SVM classifiers have been proposed next. Each prediction model was trained and tested on all the 18 datasets explained in Table 3. The results collected from the experiment for all the six evaluation metrics are listed in Tables 5, 6 and 7, respectively, for each of the classifiers. Table 5 and Fig. 1 presents the performance of the predictive model trained and tested by using KNN Table 5 Values of valuation metrics of KNN classifier Case
KNN classifier Values of evaluation metrics Accuracy
Precision
Recall
F-measure
MCC
Specificity
M1
0.99
1.00
0.99
0.99
0.99
1.00
M2
0.99
1.00
0.99
0.99
0.98
1.00
M3
0.99
1.00
1.00
0.99
0.99
1.00
M4
0.98
1.00
1.00
0.97
0.97
1.00
M5
0.97
0.96
0.94
0.93
0.96
0.98
M6
0.96
0.93
0.89
0.92
0.93
0.97
M7
0.97
0.98
0.96
0.96
0.96
0.98
M8
0.99
1.00
0.99
0.99
0.99
1.00
M9
0.97
0.97
0.92
0.94
0.95
0.99
M10
0.98
1.00
0.97
1.00
0.97
1.00
M11
0.99
1.00
0.99
0.99
0.98
1.00
M12
0.98
0.99
0.97
0.97
0.97
0.99
M13
0.98
0.97
0.97
0.94
0.96
0.99
M14
0.98
0.97
0.98
0.97
0.98
0.97
M15
0.99
0.98
0.99
0.98
0.98
0.98
M16
0.97
0.96
0.97
0.97
0.97
0.98
M17
0.98
0.99
0.98
0.98
0.97
0.99
M18
0.97
0.97
0.96
0.96
0.95
0.97
Exploring Feature Selection Technique …
705
Table 6 Values of evaluation metrics for RF classifier Case
Random forest classifier Evaluation metrics Accuracy
Precision
Recall
F-measure
MCC
Specificity
M1
1.00
0.99
1.00
1.00
1.00
1.00
M2
1.00
1.00
1.00
1.00
1.00
1.00
M3
1.00
1.00
1.00
1.00
1.00
1.00
M4
1.00
1.00
1.00
1.00
1.00
1.00
M5
1.00
1.00
1.00
1.00
1.00
1.00
M6
1.00
1.00
1.00
1.00
1.00
1.00
M7
1.00
1.00
1.00
1.00
1.00
1.00
M8
1.00
1.00
1.00
1.00
1.00
1.00
M9
1.00
1.00
1.00
1.00
1.00
1.00
M10
1.00
1.00
1.00
1.00
1.00
1.00
M11
1.00
1.00
1.00
1.00
1.00
1.00
M12
1.00
1.00
1.00
1.00
1.00
1.00
M13
1.00
1.00
1.00
1.00
1.00
1.00
M14
1.00
1.00
1.00
1.00
1.00
1.00
M15
1.00
1.00
1.00
1.00
1.00
1.00
M16
1.00
1.00
1.00
1.00
1.00
1.00
M17
1.00
1.00
1.00
1.00
1.00
1.00
M18
1.00
1.00
1.00
1.00
1.00
1.00
in terms of the values calculated for the six evaluation metrics. Table 6 and Fig. 2 shows the performance of these metrics for the predictive model trained by using the RF classifier and then Table 7 and Fig. 3 highlights the performance of these metrics for the predictive model trained by using an SVM classifier. In case of KNN-based model, we got the best result for dataset M3 (E13–FSF) (75%–25%). In case of RF-based model, the values achieved for each metric are 1.00 except precision (0.99) for dataset M1 (E13–FSF) (50%–50%). In case of SVM-based model, from 14 datasets we got best results and only four datasets have provided values less than others, and those are M3 (E13–FSF) (75%–25%), M7 (E13–TWT) (50%–50%), M12 (Genuine–spambot 1) (75%–25%) and M16 (Genuine–spambot 3) (50%–50%). After comparing the maximum values of metrics obtained from three different classifiers, we observed that the performance of the RF classifier is the best among all these classifiers with best results for dataset M3 (E13–FSF) (75–25%) in all three cases.
706
S. Sharma and M. Sood
Table 7 Values of evaluation metrics for SVM classifier Case
Support vector machine classifier Evaluation metrics Accuracy
Precision
Recall
F-measure
MCC
Specificity
M1
0.99
1.00
0.99
0.99
0.99
1.00
M2
0.99
1.00
0.99
0.99
0.99
1.00
M3
0.99
1.00
0.99
0.99
0.99
1.00
M4
0.98
0.99
0.99
0.98
0.98
0.97
M5
0.99
1.00
0.99
0.99
0.99
1.00
M6
0.99
1.00
0.99
0.99
0.99
1.00
M7
0.98
0.99
0.96
0.97
0.97
0.99
M8
0.99
1.00
0.99
0.99
0.99
1.00
M9
0.99
1.00
0.99
0.99
0.99
1.00
M10
0.99
1.00
0.99
0.99
0.99
1.00
M11
0.99
1.00
0.99
0.99
0.99
1.00
M12
0.98
0.98
0.96
0.96
0.97
0.98
M13
0.99
1.00
0.99
0.99
0.99
1.00
M14
0.99
1.00
0.99
0.99
0.99
1.00
M15
0.99
1.00
0.99
0.99
0.99
1.00
M16
0.98
0.98
0.96
0.96
0.97
0.98
M17
0.99
1.00
0.99
0.99
0.99
1.00
M18
0.99
1.00
0.99
0.99
0.99
1.00
1.002 1
Values of Metrics
0.998 0.996 0.994 0.992 0.99 0.988 0.986 0.984 Accuracy Precision
Recall
F-Measure
MCC
Evaluation Metrics Fig. 2 Comparison of values of evaluation metrics for RF classifier
Specificity
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18
Exploring Feature Selection Technique …
707
1.01
Values of Metrics
1 0.99 0.98 0.97 0.96 0.95 0.94 Accuracy Precision
Recall F-Measure
MCC
Evaluation Metrics
Specificity
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18
Fig. 3 Comparison of values of evaluation metrics for SVM classifier
5 Conclusion and Future Scope In this work we have first used preprocessing on the data collected from various sources and after that three different feature selection techniques were used on preprocessed data. Preprocessing removes all the noisy data and feature selection removes all the redundant features and helps in selecting the best optimal features. After the selection of a set of eight best optimal features for the datasets under consideration, three prediction models have been created based on three different classifiers, KNN, RF and SVM. The performances of these prediction models have been evaluated on the bases of values of six standard metrics normally used for this purpose. Out of three classifiers used, RF provides the best results on all these six metrics. We can further extend this study to implement a high-performance model for real-time environment. We can use this study to design some set of rules for using further optimization techniques on the experiments conducted. Acknowledgments We express our utmost gratitude toward Cresci et al. [7, 18] for allowing us to use the datasets created by them for this research work, since these datasets were the basic requirement for our research.
References 1. N. Bindra, M. Sood, Data pre-processing techniques for boosting performance in network traffic classification, in 1st International Conference on Computational Intelligence and Data
708
S. Sharma and M. Sood
Analytics, ICCIDA-2018 (Springer CCIS Series, Bhubaneshwar, Odhisha, India, 2018) 2. https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methodswith-an-example-or-how-to-select-the-right-variables/. Last accessed on 21 Nov 2019 3. G. John, R. Kohavi, K. Pfleger, Irrelevant features and the subset selection problem, in Proceedings of 5th International Conference on Machine Learning (Morgan Kaufmann, New Brunswick, NJ, Los Altos, CA, 1994), pp. 121–129 4. A. Vasudeva, M. Sood, Survey on Sybil attack defense mechanisms in wireless ad hoc networks. J. Netw. Comput. Appl. 120, 78–118 (2018) 5. A. Vasudeva, M. Sood, A vampire act of Sybil attack on the highest node degree clustering in mobile ad hoc networks. Indian J. Sci. Technol. 9(32), 1–9 (2016) 6. A. Vasudeva, M. Sood, Perspectives of Sybil attack in routing protocols of mobile ad hoc network, in Computer Networks and Communications (NetCom), ed. by Chaki et al. LNEE, vol. 131 (Springer, New York, NY, 2013), pp. 3–13 7. S. Cresci, R.D. Pietro, R. Petrocchi, A. Spognardi, M. Tesconi, Fame for sale: efficient detection of fake Twitter followers. Decis. Support Syst. 80, 56–71 (2015) 8. G. Devi, M. Sabrigiriraj, Feature selection, online feature selection techniques for big data classification: a review, in Proceeding of International Conference on Current Trends Toward Converging Technologies (IEEE, Coimbatore, India, 2018), pp. 1–9 9. J. Alowibdi, U. Buy, P. Yu, L. Stenneth, Detecting deception in online social networks, in Proceedings of International Conference on Advances in Social Network Analysis and Mining (ASONAM) (IEEE/ACM, 2014), pp. 383–390 10. G. Stringhini, M. Egele, C. Kruegel, G. Vigna, Poultry markets: on the underground economy of twitter followers, in Proceedings of Workshop on Online Social Networks WOSN’12 (ACM, 2012), pp 1–6 11. S. Murugan, G. Devi, Detecting streaming of twitter spam using hybrid method. Wireless Pers. Commun. 103(2), 1353–1374 (2018) 12. Z. Xianghan, Z. Zeng, Z. Chen, Y. Yu, C. Rong, Detecting spammers on social networks. Neurocomputing 159, 27–34 (2015) 13. M. McCord, M. Chuah, Spam detection on twitter using traditional classifiers, in Proceedings of 8th International Conference on Autonomic and Trusted Computing, ATC 2011. LNCS (Springer, Berlin, Heidelberg, 2011), pp. 175–186 14. Project Jupyter, https://jupyter.org/. Last Accessed on 21 Nov 2019 15. D. Sonkhla, M. Sood, Performance analysis and feature selection on Sybil user data using recursive feature elimination. Int. J. Innov. Technol. Explor. Eng. (IJITEE) 8(9S4), 48–56 (2019) 16. https://www.researchgate.net/post/Recursive_feature_selection_with_crossvalidation_in_ the_caret_package_R_how_is_the_final_best_feature_set_select.ed2/. Last accessed on 22 Nov 2019 17. K. Yan, D. Zhang, Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sens. Actuators B Chem. 212, 353–363 (2015) 18. S. Cresci, R.D. Pietro, R. Petrocchi, A. Spognardi, M. Tesconi, The paradigm-shift of social spambots: evidence, theories, and tools for the arms race, in Proceedings of 26th International Conference on World Wide Web Companion International World Wide Web Conferences Steering Committee (2017), pp. 963–972
Implementation of Ensemble-Based Prediction Model for Detecting Sybil Accounts in an OSN Priyanka Roy
and Manu Sood
Abstract Online Social Networks (OSNs) are the leading platforms that are being used these days for a variety of social interactions generally aimed at fulfilling the specific needs of different strata of users. Normally, a user is allowed to join these social networks with little or negligible amount of antecedent verification which essentially leads to the coexistence of fake entities with malicious intentions on these social networking websites. A specific category of such accounts is known as the Sybil accounts where a malicious user pretending as an honest user, creates multiple fake identities to manipulate/harm honest users, creating an illusion of the real users in the OSN that these are real identities. In the absence of stringent control mechanisms, it is difficult to identify and remove these malicious accounts. But, as every single interaction on a social media website leaves its digital trace and a huge number of such interactions every day culminates into huge datasets, it is possible to use Machine Learning (ML) techniques to build prediction models for identifying these Sybil accounts. This paper is one such attempt where we have used ML techniques to build prediction models that can predict the presence of Sybil accounts in Twitter datasets. After preprocessing the data in these datasets, we have selected an optimal set of features using one filter method namely Correlation with Heatmap and two wrapper methods namely Recursive Feature Elimination (RFE) and Recursive Feature Elimination with Cross-Validation (RFE-CV). Then using 8 classifiers (SVM, NN, LR, DT, RF, NB, GPC, and KNN) for the classification of accounts in the datasets, we have concluded that the Decision Tree classifier gives the best prediction performance among all these classifiers. Lastly, we have used an ensemble of 6 classifiers (SVM, NN, LR, DT, RF, and KNN) by using Bagging (max voting) to achieve better results. But it can be concluded that due to the inclusion of weak learners like SVM, NN, and GPC in the ensemble, DT has given the best possible prediction outcomes. P. Roy (B) · M. Sood Department of Computer Science, Himachal Pradesh University, Shimla, India e-mail: [email protected] M. Sood e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_62
709
710
P. Roy and M. Sood
Keywords Data preprocessing · Feature selection · Classifier · Ensemble of classifiers · Bagging · Max voting · Sybil attack
1 Introduction Due to recent rapid advancements in technology, Online Social Networks (OSNs) are fast becoming prominent and leading platforms for a variety of social interactions worldwide. OSNs generally ride on wide area networks connecting people anywhere anytime. The users connect to an OSN by registering on its website, i.e., by creating profiles on a particular social network like Facebook, Instagram, Twitter, LinkedIn, etc. They can hook on to these networks using different processes like (a) using to send and receive friend requests, (b) using tags (on Twitter, mostly hashtags are used, and (c) by sharing links with each other. There are two supreme elements of the OSN, users (connectors) and links (connection). Users play the role of connectors by creating their profiles displaying information (mostly unverifiable by all means) about the listed user in their profiles. These connectors join each other by using send/receive requests, uploading data, sharing information, and building communication. These links are known as connections that show the associations among users. These associations may be one-to-one, one-to-many or many-to-many. OSNs are used for a variety of purposes such as personal, professional (video conferencing) as well as in business (e.g., e-Commerce) [1]. There are 65+ OSN platforms available to the users, among those, the two most popular platforms, especially among younger generations, are Facebook and Twitter. Generally, Facebook is used for sharing social views, photos, videos, knowledge, opinion, business, etc., whereas Twitter, in addition to being a microblogging site, is used more for political debates, professional interactions, and as the largest source of daily hunts. Due to the inherent structures of these OSNs, there are ample chances for any user to create multiple numbers of fake accounts or spam accounts on purpose. These fake users may also attempt to steal/copy information, personal or professional, of other users so as to create fake profiles seemingly looking like those of real or genuine users. Also, the fake user accounts hide the real identities of the actual users, helping them to remain untraceable [2]. Fake users may have a variety of motivations like broadcasting false information, trolling somebody, spreading fake news/content/rumors about a person/group/organization/event, cyberbullying, destroying reputations of competitors, etc. The primary purpose of fake users is to harm some genuine user intentionally [3]. When a fake user uses his multiple fake profiles to pretend as multiple different genuine users for some ulterior motives, it is known as the Sybil attack. The word Sybil comes from a woman named Sybil Dorsett. She was a woman with a multiple personality disorder problem. There are three dimensions of Sybil attacks: (a) What—A user with malicious intensions creates multiple Sybil nodes with an aim to harm the honest users by portraying Sybil nodes as real nodes; (b) How—It first joins the weak targets (honest nodes) and then mounts multiple Sybil
Implementation of Ensemble-Based Prediction Model …
711
nodes to attack them, and (c) When—Sybil attack can be easily mounted if there is no centralized trust party in the system at a suitable location. Different approaches used for detecting the occurrences of a Sybil attack are given in [4, 5]. Two categories of graph-based approaches used for detection of Sybil attack are (a) Sybil Detection Schemes (SybilGuard, Gatekeeper, SybilRank, SybilLimit, SybilDefence, SybilFrame, SybilBelief, Integro), and (b) Sybil Tolerance Schemes (Ostra, Sumup, True Top). There are two ML-based approaches too for detection of Sybil attacks: (a) Supervised Approaches (Uncovering Sybil, Social Turing Test) and (b) Unsupervised Approach (Click Stream, Latest community model). Using these Sybil attack detection techniques, Sybil nodes in OSNs can be detected so as to initiate steps for making the users safe [6]. Online Social Networks generate an enormous amount of data which is, of late, being used for the detection/prediction of the occurrences of Sybil attacks in social media such as Twitter [7]. Machine Learning (ML) provides various solutions to identify/earmark the presence of malicious nodes along with their Sybil nodes based on the manipulation of the enormous data called datasets. As humans learn from their past experiences, the machine can also be trained from the experiences of some experimentation based upon datasets. In this research work, we have used Cresci 2015 dataset which is a dataset based upon the genuine and fake Twitter accounts [7] for the purpose of presenting various prediction models based upon ML. Firstly, we have applied some data preprocessing techniques for cleaning, transforming, and for normalization of datasets. Thereafter, three feature selection techniques have been used to select the most relevant subset of features out of the given set of features, the purpose being reducing overfitting, improving accuracy, and reducing the training time of the models. With the help of feature selection, we can remove noisy, irrelevant, redundant features [8]. We have used filter and wrapper methods of feature selection to find out the best results by retrieving the number of optimal features from each of the datasets [9]. As far as the filter method is concerned, Correlation with Heatmap (Pearson Correlation) has been used for selecting the optimal features in all the datasets. From this, the Heatmap can easily be drawn to conclude which features are more related to the target variable so as to retrieve the best optimal features. In wrapper method, out of a few available choices, Recursive Feature Elimination (RFE) method was chosen to find the ranking of features. Later, Recursive Feature Elimination with CrossValidation (RFE-CV) was used to transform the entire dataset, using the best scoring number of features. The cross-validation provides a better optimal feature set whereas the simple RFE method provides the ranking for the features. In this paper, only the results obtained from Correlation with Heatmap were finally considered as the final results for further processing since the wrapper methods use the same strategy as that of the filter method and are computationally expensive when compared to filter method. For the purpose of achieving the best performance, most of the prediction models use the ensemble techniques along with classifiers. There are multiple ensemble techniques available in the literature, bagging techniques along with max voting
712
P. Roy and M. Sood
for classifier ensemble is one of them that can be trained on a specific dataset for achieving best merged accuracy. In OSNs, there are multiple fake accounts, and some of these fake accounts pretend as multiple genuine users to launch a Sybil attack on target user(s). Sybil attack is used to influence the online interactions and behaviors of honest/genuine users maybe, by spreading false information, gathering personal information of genuine users, and posting negative comments and negative responses to posts, blogs of genuine accounts, etc. So detections of these fake accounts are of prime importance to make OSNs safe and secure platforms for genuine users. Although almost all the social media platforms capture various interactions of all their users, it is difficult to detect the presence of Sybil accounts in these datasets without using machine learning techniques. Not much work has been carried out on the detection of Sybil accounts using various ML techniques. Whatever little work has been carried out on Sybil account detection, the performance of the models has been below par. This paper is an attempt to find out a model based on an optimal ensemble of ML classifiers to achieve the best performance of prediction as compared to the individual classifiers. The paper is divided into five sections. Section 2 highlights the research methodology used for this work to achieve all objectives. The details of datasets used and other necessary details are presented along with the experimental setup in this section. Section 3 discusses the ensemble of classifiers used, whereas Sect. 4 summarizes the results of experiments followed by their analysis. The work has been concluded in Sect. 5 with a pointer toward future work. The novelty of this research work as perceived by the authors is as follows: • All five datasets of Cresci 2015 [7] have been considered for the purpose of training and testing of predictive models proposed and the process of controlled biasing has culminated into 24 datasets from these five datasets. • Three different feature selection techniques Correlation with Heatmap, RFE, and RFE-CV, belonging to two different classes of FS techniques, have been used before the final selection of an optimal set of features for the prediction models proposed. • To explore the performances of various prediction models proposed, eight ML classifiers namely SVM, DT, RF, NB, GPC, NN, and LR have been used, the main purpose being the comparison of performances of these individual classifiers on biased datasets. • An ensemble of 6 classifiers has been used on the same datasets and its performance has been compared with those of individual classifiers. • The results of comparison have shown that the performance of the ensemble for the purpose of prediction of the presence of Sybil accounts is almost ideal for the datasets used. The objectives of this work are (1) Using appropriate feature selection techniques, to select an optimal set of features of the classifiers, (2) To analyze the performances of various individual classifiers on the biased datasets (genuine with fake accounts), and (3) To analyze the performances of an ensemble of classifiers on the same datasets and comparison of performance with individual classifiers.
Implementation of Ensemble-Based Prediction Model …
713
2 Research Methodology and Simulation Setup Figure 1 depicts the methodology that we have used to implement this research work, whereas Table 1 shows the details of the datasets based on the work carried out by Cresci et al. [7] and used in our research work. Cresci 2015 datasets consist of Twitter accounts data. There is a total of five datasets in this dataset named as Elezioni 2013 (E13), The Fake Project (TFP), Fast Followers (FSF), Inter Twitter (INT), and Twitter Technology (TWT). Two datasets are of genuine accounts (E13, TFP) and three of fake accounts (FSF, INT, TWT). The number of features and their Cresci 2015 Dataset Data Preprocessing Correlation with Heatmap, RFE, RFECV
Feature Selection Training Set (70%)
SVM NN KNN
Testing Set (30%)
DT
NB
GPC
RF
LR
Classification Models
Calculate Six Evaluation Metrics SVM NN KNN
DT
NB
GPC
RF
Ensemble of Six Classifiers
LR
Output of Classifiers By using Bagging & Max Voting
Evaluation Metrics of Ensemble Fig. 1 Methodology used for this research work
Table 1 Twitter datasets and their details [7]
Account type
Datasets
Name of datasets Total accounts
Genuine accounts
1
E13 (Elezioni 2013)
1481
2
TFP (The Fake Project)
469
3
FSF (Fast Followers)
1169
4
INT (Inter Twitter)
1337
5
TWT (Twitter Technology)
Fake accounts
845
714 Table 2 Complete features set of Twitter accounts [7]
P. Roy and M. Sood ID
Location
Profile background title
Name
Default profile
Profile sidebar fill color
Screen name
Default profile image
Profile background image URL
Status count
Geo enable
Profile link color
Follower counts
Profile image URL
Utc offset
Friends count
Profile banner URL
Protected
Favorites count
Profile use background image
Verified
Listed count
Profile background image https
Description
Created at
Profile text color
Updated
URL
Profile image URL https
Dataset
Time zone
Profile sidebar border color
labels in genuine and fake accounts are exactly the same and are listed in Table 2. We have biased the datasets, i.e., genuine accounts (E13, TFT) with fake accounts (FSF, INT, TWT) datasets to build 24 named datasets based on the ratio of genuine to fake accounts (100:100), (75:25), (60:40), (50:50) as shown in Table 3. For the practical implemetation we have used anaconda, In Anaconda we have used Jupyter Notebook [10].
3 The Proposed Models It is a common knowledge that feature selection helps in removing noisy, redundant, irrelevant features [8]. Feature selection techniques have been used in this work as ‘data cleaning agents’ and to select the most relevant set of features in order to achieve reduced overfitting, improved accuracy, and reduced training time of the models. Out of the three categories of FS methods, for the sake of exploring, we have used one filter and two wrapper methods to retrieve the optimal set of features. We have used Correlation with Heatmap (Pearson Correlation) filter method for selecting the optimal set of features for all datasets and to show the relationship between each input with its target in 2D colored matrix. This technique pictorially depicts relations of all the features with objective variables. The correlation can be Positive or Negative. Next, we used two wrapper methods, namely, Recursive Feature Elimination (RFE) and Recursive Feature Elimination with Cross-Validation (RFE-CV). RFE is for finding the ranking of features. We
Implementation of Ensemble-Based Prediction Model …
715
Table 3 Details of biased datasets used Biasing of genuine accounts with fake accounts Biasing ratio on dataset
E13 + FSF Total accounts
E13 + INT Total accounts
E13 + TWT
Total accounts
100–100
AP1 (100)
2650
AP2 (100)
2818
AP3 (100)
2326
75–25
AP1 (75–25)
1403
AP2 (75–25)
1446
AP3 (75–25)
1322
60–40
AP1 (60–40)
1357
AP2 (60–40)
1424
AP3 (60–40)
1227
50–50
AP1 (50)
1325
AP2 (50)
1408
AP3 (50)
1162
Biasing ratio on dataset
TFP + FSF
Total accounts
TFP + FSF
Total accounts
TFP + FSF
Total accounts
100–100
AP4 (100)
1638
AP5 (100)
1806
AP6 (100)
1314
75–25
AP4 (75–25)
645
AP5 (75–25)
687
AP6 (75–25)
563
60–40
AP4 (60–40)
749
AP5 (60–40)
816
AP6 (60–40)
619
50–50
AP4 (50)
818
AP5 (50)
902
AP6 (50)
656
have used 24 datasets in our research work, and RFE gives different rankings to feature sets of the different datasets (24 in all in our case). RFE-CV transforms the entire set using the best scoring number of features. Crossvalidation gives better optimal features set as compared to that of RFE. Here, we have considered the results of Correlation with Heatmap filter method only for further use in classifiers since it gave better results as compared to the wrapper methods. Another deciding factor was that the wrapper methods used the same strategy as that of the filter method but were computationally expensive when compared to filters methods. In our datasets, application of Correlation with Heatmap feature selection technique resulted in a set of 22 optimal features as shown in Fig. 2. This figure also depicts the relationship of each input with its target in 2D colored matrix where it can easily be determined which feature is more related to the target variable. Hence it has enabled us to retrieve a set of best optimal features. After selecting the best features of all 24 datasets, as a standard practice, we split our data into training and testing data in the ratio of 70:30 to be used in the classifiers. A classifier is an algorithm that is used to map input data to some specific category or label for the sake of classification. Various categories of different classifiers are used for the purpose of classification. Eight classifiers used in this work are Support Vector Machine (SVM), Logistic Regression (LR), Neural Network (NN), K-Nearest Neighbor (KNN), Random Forest (RF), Gaussian Process Classifier (GPC), Naive Bayes (NB), and Decision Tree (DT) [11, 12]. For measuring the performance of various classifiers, a set of six metrics have been used, some of which are based on the Confusion Matrix. A Confusion Matrix, also
716
P. Roy and M. Sood
Fig. 2 Features selected through Correlation with heatmap filter method
called an error matrix [7], is generally used as the basis for measuring the performance of a classifier in ML. In our case, True Negative (TN) is the number of fake accounts identified as fake. True Positive (TP) is the number of genuine accounts identified as genuine. False Negative (FN) is the number of genuine accounts identified as fake and False Positive (FP) is the number of fake accounts identified as genuine. Evaluation metrics used to evaluate our results are Accuracy, Precision, Recall/Sensitivity, F1 score, MCC (Matthew Correlation Coefficient), and Specificity. After evaluating these six metrics for all 8 classifiers mentioned above, we used an ensemble of classifiers for further investigations. In ensemble learning, we take multiple classifiers and combine the output of various classifiers to get better prediction or classification accuracy. Here, the classifier output is merged based on different ensemble techniques like max voting, averaging, and weighted average. There are also advanced ensemble techniques such as stacking, boosting, blending, and bagging. In this research work, we have used bagging because it is a combination of Bootstrap and Aggregation and is simple to implement with good performance [13]. This ensemble classification model runs multiple classifiers in parallel and independent of each other. Bagging takes bootstrap samples of data and trains the classifiers on each sample before classifiers’ predictions (votes) are combined by majority voting. Bagging ensemble method entails high classification accuracy [14]. The maximum unweighted data resulting from SVM and GPC classifiers have been considered as weak learners. Therefore, we have used the bagging technique to randomly sample weak learners and merged the output of only 6 out of 8 classifiers in the ensemble process. The classifiers used for the ensemble are KNN, RF, LR,
Implementation of Ensemble-Based Prediction Model …
717
SVM, DT, and NN. We did not consider GPC and NB because they both produced results somewhat similar to SVM. In the bagging technique, we have used max voting because, in max voting, it uses multiple models to make predictions for each instance of datasets, and the prediction of each model is considered as a single vote only. The prediction which got the majority of votes was taken as the final result. Out of the two types of voting methods used in ensemble learning, hard voting and soft voting, we have used hard voting for performing ensemble on all 24 datasets because of its established supremacy.
4 Results and Analysis After the selection of an optimum feature set consisting of 22 features for all the 24 biased datasets using the feature selection techniques, the performance of six metrics for 8 classifiers of the simulation experiments under supervised ML have been presented in Figs. 3, 4, 5, 6, 7, 8, 9, and 10. We have found the results first without the use of the ensemble of classifiers, i.e., the performance of six evaluation metrics of 8 classifiers (SVM, NN, KNN, LR, DT, RF, NB, and GPC) for all 24 biasing datasets. From these results, we conclude that the performance of Decision Tree (DT) classifier over the metrics namely accuracy, precision, recall, specificity, F1 score, and MCC is the best among all other classifiers, the second best performance on these parameters is of the KNN and RF classifiers, the third best performance belongs to LR and NB, and the worst performance is of SVM, NN, and GPC classifiers. Secondly, the results of the performance of the ensemble of 6 classifiers (SVM, NN, KNN, LR, DT, and RF) over these six metrics for all the 24 biased datasets have been presented in Fig. 11 and Table 4. A careful perusal of Figs. 3, 4, 5, 6, 7, 8, 9, 10, and 11 shows that the performances of the ensemble of classifiers and that 1.2 Values of Metrics
1 0.8 Accuracy
0.6 0.4
F1 score
0.2
Recall
0 AP6(50) AP6(60-40) AP6(75-25) AP6(100) AP5(50) AP5(60-40) AP5(75-25) AP5(100) AP4(50) AP4(60-40) AP4(75-25) AP4(100) AP3(50) AP3(60-40) AP3(75-25) AP3(100) AP2(50) AP2(60-40) AP2(75-25) AP2(100) AP1(50) AP1(60-40) AP1(75-25) AP1(100) Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
Dataset 6
Fig. 3 Performance of six metrics for SVM classifier on various datasets
Precision MCC Specificity
718
P. Roy and M. Sood 1.2 1
0.6
Accuracy
0.4
F1 score
0.2
Recall Precision
0 -0.2 -0.4
AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)
Values of Metrics
0.8
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
MCC Specificity
Dataset 6
Fig. 4 Performance of six metrics for NN classifier on various datasets 1.2 Values of Metrics
1 0.8 0.6
Accuracy
0.4
Precision
0.2
Recall F1 Score AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)
0
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
MCC Specificity
Dataset 6
Fig. 5 Performance of six metrics for LR classifier on various datasets
of the DT classifier are almost comparable. A deeper look at the evaluation metrics has revealed that the performance of the ensemble is below par than that of the DT. The reason for the ensemble not providing better performance has been traced to the participation of weak learners like NB and GPC. These weak learners have pulled down the performance of the ensemble. And, this has been found experimentally true not only for the bagging technique used for combining the outputs in the ensemble but for all other techniques too. As can be seen from Fig. 8, the performance of metrics for Decision Tree classifier on Dataset 1, 2, and 4 are almost near perfect. For the rest of the three datasets too, the values of six evaluation metrics have been found to be quite good. This means that this DT classifier when used for the prediction of Sybil accounts on the social media datasets performs the best not only with the highest accuracy but also with the highest values of other metrics too.
719
1.02 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82
Accuracy F1 score Recall Precision AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)
Values of Metrics
Implementation of Ensemble-Based Prediction Model …
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
MCC Specificity
Dataset 6
Fig. 6 Performance of six metrics for RF classifier on various datasets 1.2 Values of Metrics
1 0.8 0.6
Accuracy
0.4
F1 score
0.2
Recall Precision AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)
0
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
MCC Specificity
Dataset 6
Fig. 7 Performance of six metrics for KNN classifier on various datasets
5 Conclusion and Future Scope In this research work, after due data preprocessing and biasing (genuine accounts with fake accounts) on datasets, we have built 24 datasets from the five available datasets for the purpose of building an ML model to predict the presence of Sybil accounts on an OSN website. After preparing the datasets, features have been selected using both filter and wrapper methods. But we considered the results of the filter method only, i.e., the results we have obtained by Correlation with Heatmap as final results because the wrapper method is computationally expensive. After obtaining an optimal set of 22 features, we used all the 24 sets of datasets to train and test the prediction models proposed using SVM, RF, NN, NB, KNN, GPC, RF, and LR classifiers. From the
720
P. Roy and M. Sood 1.2
Values of Metrics
1 0.8 0.6
Accuracy
0.4
F1 score
0.2
Recall Precision AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)
0
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
MCC Specificity
Dataset 6
Fig. 8 Performance of six metrics for DT classifier on various datasets 1.2
Values of Metrics
1 0.8 0.6
Accuracy
0.4
F1 score Recall
0.2
Precision AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)
0
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
MCC Specificity
Dataset 6
Fig. 9 Performance of six metrics for NB classifier on various datasets
results of the evaluated metrics, we conclude that the performance of Decision Tree is the best among all other classifiers whereas SVM, NN, and GPC classifiers are the worst performers. Subsequently, using an ensemble of 6 of these classifiers, we evaluated and compared the performance of the ensemble of classifiers only to conclude that though the performance of this ensemble over all the parameters is encouraging, due to involvement of weak learners like SVM, NN, and GPC in this ensemble, the Decision Tree classifier still provides the best performance individually in comparison to other individual classifiers or the ensemble of classifiers. The results were found to be no different for various other available combining techniques used in the ensemble of classifiers. Further, different optimization techniques may be tried
Implementation of Ensemble-Based Prediction Model …
721
1.2
Values of Metrics
1 0.8 0.6
Accuracy
0.4
F1 score
0.2
Recall Precision AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)
0
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
MCC Specificity
Dataset 6
1.2 1 0.8 0.6 0.4 0.2 0
Accuracy F1 score Recall AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)
Values of Metrics
Fig. 10 Performance of six metrics for GPC classifier on various datasets
Dataset 1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
Precision MCC Specificity
Dataset 6
Fig. 11 Performance of six metrics for the ensemble of 6 classifiers on various datasets
on this ensemble output in the future in order to explore and compare the results of the ensemble of classifiers with optimization too.
722
P. Roy and M. Sood
Table 4 Performance metrics data of ensemble of 6 classifiers Ensemble of classifiers Datasets
Cases
Accuracy
F1 score
Recall
Precision
MCC
Specificity
Dataset 1
AP1 (100)
0.993
0.994
0.99
0.997
0.987
1
AP1 (75–25)
1
1
1
1
1
1
AP1 (60–40)
1
1
1
1
1
1
Dataset 2
Dataset 3
Dataset 4
Dataset 5
Dataset 6
AP1 (50)
1
1
1
1
1
1
AP2 (100)
0.99
0.991
1
0.982
0.981
0.98
AP2 (75–25)
1
1
1
1
1
1
AP2 (60–40)
0.995
0.996
1
0.992
0.99
0.987
AP2 (50)
0.99
0.991
1
0.982
0.981
0.98
AP3 (100)
0.851
0.892
0.968
0.826
0.667
0.645
AP3 (75–25)
0.891
0.939
1
0.885
0.53
0.317
AP3 (60–40)
0.943
0.961
0.988
0.936
0.855
0.578
AP3 (50)
0.851
0.892
0.968
0.826
0.677
0.645
AP4 (100)
0.906
0.805
0.673
1
0.77
1
AP4 (75–25)
0.906
0.89
0.801
1
0.825
1
AP4 (60–40)
0.924
0.887
0.797
1
0.843
1
AP4 (50)
0.934
0.87
0.771
1
0.84
1
AP5 (100)
0.874
0.6964
0.553
0.9397
0.658
0.987
AP5 (75–25)
0.995
0.995
0.99
1
0.99
1
AP5 (60–40)
0.885
0.805
0.69
0.966
0.748
1
AP5 (50)
0.926
0.836
0.728
0.98
0.804
0.995
AP6 (100)
0.784
0.581
0.418
0.951
0.535
0.988
AP6 (75–25)
0.923
0.934
0.877
1
0.852
1
AP6 (60–40)
0.951
0.946
0.94
0.951
0.902
0.96
AP6 (50)
0.796
0.607
0.442
0.968
0.564
0.992
Acknowledgments We are grateful to Cresci et al. [7] for allowing us to use their real-time dataset, i.e., Cresci 2015 dataset in this research work.
References 1. H. Mayadunna, L. Rupasinghe, A trust evaluation model for online social networks, in Proceedings of IEEE 2018 National Information Technology Conference (NITC) 02–04 October, Colombo, Sri Lanka (2018) 2. A.H. Wang, Don’t follow me: spam detection in twitter, in 2010 International Conference on Security and Cryptography (SECRYPT) (IEEE, 2010), pp. 1–10 3. F. Masood, G. Ammad, A. Almogren, A. Abbas, H.A. Khattak, I.U. Din, M. Guizani, M. Zuair, Spammer Detection and Fake User Identification on Social Networks (IEEE, 2019)
Implementation of Ensemble-Based Prediction Model …
723
4. M. Al-Qurishi, M. Al-Rakhami, A. Alamri, M. Alrubaian, S.M.M. Rahman, M.S. Hossain, Sybil defense techniques in online social networks: a survey. IEEE Access 5, 1200–1219 (2017) 5. A. Vasudeva, M. Sood, Survey on Sybil attack defense mechanisms in wireless ad hoc networks. J. Netw. Comput. Appl. 120, 78–118 (2018) 6. H. Bansal, M. Misra, Sybil detection in online social networks (OSNs), in Proceedings of IEEE 6th International Conference on Advanced Computing (2016) 7. S. Cresci, R.D. Pietro, R. Petrocchi, A. Spognardi, M. Tesconi, Fame for sale: efficient detection of fake Twitter followers. Decis. Support Syst. 80, 56–71 (2015) 8. H. Nkiama, S.Z.M. Said, M. Saidu, A subset feature elimination mechanism for intrusion detection system. Int. J. Adv. Comput. Sci. Appl. 7(4), 148–157 (2016) 9. N. Bindra, M. Sood, Data pre-processing techniques for boosting performance in network traffic classification, in Proceedings of First International Conference on Computational Intelligence and Data Analytics, ICCIDA-2018, 26–27 October 2018 (Springer CCIS Series, Gandhi Institute For Technology (GIFT), Bhubaneshwar, Odhisha, India, 2018) 10. https://www.anaconda.com/distribution/#download-section. Last accessed on 07 Dec 2019 11. https://analyticsindiamag.com/7-types-classification-algorithms. Last accessed on 07 Dec 2019 12. https://scikit-learn.org/stable/modules/gaussian_process.html. Last accessed on 07 Dec 2019 13. R. Polikar, Ensemble based systems in decision making. IEEE Circ. Syst. Mag. 21–44 (2008) 14. J.J. Rodriguez, L.I. Kuncheva, Rotation forest: a new classifier ensemble method. IEEE Trans. Pattern Mach. Intell. 28(10), 1619–1630 (2006)
Performance Analysis of Impact of Network Topologies on Different Controllers in SDN Dharmender Kumar and Manu Sood
Abstract Over the passage of a decade, technology, especially in computer science, goes beyond the thinking level of man. The traditional approach of networking has several loopholes that have been minimized to some extent with a modern networking approach known as Software-Defined Networks (SDNs). Software-Defined Network (SDNs) has made the communication more interesting with several notable features such as flexibility, dynamic and agile behavior. These features have become possible with its unique features such as centralized control, direct programmability, and physical separation of two planes named network control plane and forwarding plane or data plane. Since the whole network and its entities are controlled by the control plane, this feature of controlling and separating the two planes has made Software-Defined Networks (SDNs) completely different from traditional networking. By networking we mean that there must exist communication among various physical and logical devices. So Communication plays an important role in any network and is also a vital part of any network. In order to have better communication in SDN, it is essential to have analysis and evaluation of the performance of different network topologies. So finding the number of network topologies, and also finding the best topology among them, which can be proposed best for communications in SDN, would be interesting. In this paper, we propose to find out the best topology among four possible topologies in SDN, on three different SDN controllers through simulation in Mininet. This selection of best topology is done by analysis and evaluation of different network parameters such as throughput, round Trip Time, end-to-end delay, bandwidth, and packet loss with/without link down. Based on the result obtained with different parameters, we propose the best topology that provides us the results for both bestcase as well as worst-case communication in our experiment. Four different types of topologies on three different SDN controllers (OpenDaylight, POX, NOX.) have been shown to be simulated through Mininet and Wireshark for SDN. D. Kumar (B) · M. Sood Department of Computer Science, H.P. University, Shimla, India e-mail: [email protected] M. Sood e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_63
725
726
D. Kumar and M. Sood
Keywords Software-defined network · OpenFlow · Mininet · Wireshark · Control plane · Throughput · Round trip time · OpenDaylight · POX · NOX
1 Introduction The traditional approach, which we are still using for network communication, is popular since a long time, but simple. This traditional networking is identified by features that are implemented on a particular device always. These dedicated devices in conventional networking can be any one among switches, routers, and application delivery controllers [1–6]. On the whole, physical devices are used for performing networking functionality in traditional networks. Also, there is hard coupling between both data and control planes in traditional networking, which means the data forwarding and controlling policies is provided by the same layer in this traditional approach. This tight coupling between data and control planes may lead to a certain level of additional complexity for instance, the policies once defined in traditional networking during communication, are hard to be altered dynamically as per the needs of users. We may have The only way to change these networking policies on the go is to halt the communication temporarily to accommodate suitable modifications in the policies. Traditional approach of networking has some other significant limitations such as: a) networking setup is a time consuming process and is prone to errors, b) specific professional experts are needed to handle multi-vendor environment, and c) these have low network security and reliability. All these limitations of traditional networking paradigm are the major contributors for SDN to be evolved as the emerging networking approach.
1.1 Software-Defined Networking Factors affecting the traditional way of networking give rise to SDN. That has been emerged as a new technology from a networking perspective for the past few years. It plays a very important role in networking, also overcomes the limitation that has been encountered in traditional networking. As the traditional approach of networking follows the principle of strong coupling, SDN works on the principle of loose coupling, which means both the data plane and control plane are decoupled. Also both separate data and control plane are not have tied up with the hardware device as [1–5]. SDN entirely uses different sets of data forwarding policies. In SDN, the controller act as a major role player in the network related to these forwarding and controlling policies. The configuration of the controller can be done dynamically in order to make any changes, which is a drawback of traditional networking [6]. Therefore, dynamic configuration means, in SDN, the enforcement of new policies can be done at a later stage to override the previous one. Control plane’s primary
Performance Analysis of Impact of Network Topologies …
727
functionality is to specify the path and communication parameters, while the data plane implements the decision directed by the control plane. Most significant features in SDN are (a) control plane arrangement, (b) controller as a centralized authority, (c) open interface, and (d) direct and dynamic programming facility. The complete SDN architecture is explained in detail in [7]. It comprises the whole networking platform for SDN. Architecture for SDN consists of three layers names : application layer, control layer, and infrastructure layer. Northbound API provides communication among the upper two layers, whereas southbound API enforces communication between two lower layers. The application layer as the topmost layer is responsible for providing the abstract application view, while the control layer as the middle layer plays a very important role in SDN. Network operating system named as the OpenFlow controller exists in control layer. The purpose of this layer is to provide the interface to the upper application layer and lower data or infrastructure layers. It is possible to have an OpenFlow facility with SDN; this feature differentiates the SDN from traditional networks. Also, the dynamic as well as programmatical configuration of the controller makes the configuration process easy and arbitrary. This feature is not available in traditional networking. Forwarding is done by the lower-most data layer.
1.2 SDN and Traditional Networking Comparison The major key networking difference between the two different approaches is in Table 1.
1.2.1
Topologies’ Significance
For proper management of network topology plays an important as many characteristics of network is affected by the topology, e.g. performance, communication policies complexity level, reliability, and the efficiency of network [8–10]. A Network may Table 1 Traditional networks versus SDN [7] S. no.
Features
Conventional network
SDN
1
Control plane and data plane
Tight coupling
Loosely coupled, both the layers are decoupled
2
Protocol
Number of protocols are used
Mostly OpenFlow is used
3
Reliability
Less reliable
More reliable
4
Security
Moderate
High
5
Energy consumption
Very high
Less
728
D. Kumar and M. Sood
have several different topologies with their merit and demerit [11]. But not a single topology can act as the best for every network requirement. Therefore, selecting the best among all topologies requires some evaluation. Throughput, Bandwidth, Packet loss, Round Trip Time (RTT), End-To-End delays are the different performance affecting parameters of a network topology [12]. By careful examination of SDN-related literature, before exploring the simulation results in this research paper, we observed that there exists no such paper that covers the performance comparing results for different topologies with different SDN controllers. Mininet is a simplest and easily accessible tool for simulation in the SDN environment. In this paper, different topologies and their possibility in SDN have been compared with respect to different networking controllers in SDN through Mininet. For a topology to be best or worst among six available topologies in SDN, we perform the simulation of four topologies with Mininet. Simulation results obtained with Mininet have been extended by another tool that can plot the graph named Wireshark.
2 Proposed Work With simulated results, the performance of various SDN topologies has been evaluated, based on different network parameters. Different network topologies that can be created with commands in SDN with Mininet environment [13, 14] are as follows: (a) Minimal: Minimal topology is the first simplest topology in SDN with a single switch and double host by default. Mininet command for implementing minimal topology is sudo mn − topo = minimal. (b) Single Topology: Another different kind of topology used in SDN is single topology, having 1 * N switches and hosts. Mininet command for implementing single topology is sudo mn − topo = single, 3. (c) Linear Topology: Linear topology is a little different as compared to the single topology in SDN with N * N switches as well as hosts instead of 1 * N as in single topology. Mininet command for implementing linear topology is sudo mn − topo = linear, 3. (d) Tree Topology: As the name suggests, tree topology has internal structure like a tree with multiple levels, where two hosts are associated with every switch. Mininet command for implementing tree topology is sudo mn − topo = tree, 3. (e) Reversed Topology: Reversed topology as the name suggests is completely the reverse of the single topology, in which host and switch are connected in reverse order. The command for implementing reverse topology is sudo mn − topo = reversed, 3. (f) Torus Topology: Torus topology of SDN is similar to that of mesh topology in traditional networking. Mininet command for implementing torus topology is sudo mn − topo = torus, 3, 3.
Performance Analysis of Impact of Network Topologies …
729
Numeric value in all the above topologies’ command specifies the number of hosts, switches, and for tree topology these values represent the number of levels. These values can be arbitrary values, may increase or decrease depending on the network requirement.
3 Simulation Environment and Results Implementation details of the results have been carried out by Ubuntu 18.04 LTS as an operating system, with a minimum of 2 GB RAM and a minimum of 6–8 GB free hard disk space. The simulation results have been obtained for three different SDN controllers: OpenDaylight, POX, and NOX with Python as the language. Torus topology was not used in this experiment because it is a non-switched topology. Also, ring topology was not used because switches are irrelevant in this topology. Another reason for not using the ring topology was that in current networking scenario, ring topology is rare in use. All these topologies used for our experiment, created with Mininet, with complete details of switches as well as hosts for all three SDN controllers during experiments, are listed below. For all four topologies, we created one server host (i.e. h1 in all cases), rest host as client (e.g. ‘h3’ or ‘h4’); one of the client hosts for all topologies requests the server host to download a file. Same bandwidth was used for all topologies as well as for all three SDN controllers. Bandwidth is fixed using the command –link tc, bw = value. Also, the file of the same size is used for the experiment in all cases. Linear topology creation is depicted in Fig. 1 for POX controller. Hosts, switches, and controller used in our experiments are shown with h1, h2, etc. as hosts and
Fig. 1 Creating linear topology in SDN using Mininet
730
D. Kumar and M. Sood
s1, s2, and c0 as switches and controller, respectively. The figure below shows the Mininet command execution in order to make h1 host as the server, the command is h1 python –m SimpleHTTPServer 80 & [13, 14]. Also, the file downloading operation is performed in all four topologies and for three different SDN controllers, numeric values used as well as obtained for different network parameters is shown in Tables 2, 3, 4, 5, and 6 as the outcomes of these simulations, respectively (Fig. 2). Simulated results are obtained for different topologies for all three different SDN controllers, on different network parameters with Wireshark and Mininet [14–18]. Simulated results have been obtained with the help of Mininet, while tables and graphs of our experiments were obtained with Wireshark from these results. These results are for downloading a specific file from the server host through a client host for all topologies as well as controllers. For all four topologies and three different Table 2 Different simulation network elements used S. no. Topology used
No. of servers No. of switches No. of hosts No. of controllers
1
Single topology
1
1
2
1
2
Linear topology
1
3
2
1
3
Tree topology
1
7
7
1
4
Reversed topology 1
1
2
1
Table 3 Topologies’ simulation results for POX controller Parameters used
Bandwidth fixed (Gb/s)
Topology used
End-to-end delay obtained (ms) 7.2799
Throughput obtained (b/s)
Round trip time obtained (ms)
Packet loss (when link is down)
Packet loss (when link is not down)
20,300
15.032
0
66
31,000
46.83
0
66
Single
26
Linear
26
23.41
Tree
26
32.36
24,100
65.72
0
25
Reversed
26
6.26
24,000
13.550
0
Network unreachable
Table 4 Simulated network parameters’ value in experiments Topology used
Single topology
Linear topology
Tree topology
Reversed topology
Segment range (Bytes)
0–1500
01–1500
0–120
0–1500
No. of segments
3
7
12
3
RTT (ms)
15.032
46.83
65.72
13.55
Average throughput (bps)
20,300
31,000
24,100
24,000
Parameters used
Performance Analysis of Impact of Network Topologies …
731
Table 5 Topologies’ simulation results for NOX controller Parameters used
Bandwidth fixed (Gb/s)
End-to-end delay obtained (ms)
Throughput obtained (b/s)
Round trip time obtained (ms)
Packet loss (when link is down)
Packet loss (when link is not down)
Single
26
9.2799
23,300
12.02
0
66
Linear
26
25.41
27,000
46.43
0
66
Tree
26
33.36
24,100
65.92
0
25
Reversed
26
8.26
22,000
13.250
0
Network unreachable
Topology used
Table 6 Topologies’ simulation results for OpenDaylight controller Parameters used
Bandwidth fixed (Gb/s)
Topology used Single
26
Linear
26
End-to end-delay obtained (ms) 8.2799 27.41
Throughput obtained (b/s)
Round trip time obtained (ms)
Packet loss (when link is down)
Packet loss (when link is not down
22,100
18.002
0
66
28,000
40.8
0
66
Tree
26
35.36
23,400
60.32
0
25
Reversed
26
10.26
24,800
10.450
0
Network unreachable
SDN controllers, values obtained for these different network parameters are given below in Tables 3, 4, and 5, respectively. The purpose of our experiment was to find out the single best topology among all three different controllers that provide the best communication results for all these different topologies. Figure 3 is a table listing the details of packets used and transmitted in linear topology in POX controller, captured through Wireshark during the downloading of the file. File downloaded was divided into 14 packets of different size as shown in Fig. 3. The same way the file was downloaded for other topologies and for other two SDN controllers, OpenDaylight and NOX.
3.1 Graphs Figures 4 and 5 present below, plotted with Wireshark, and depict the throughput and round-trip-time graph for the linear topology. In the same way, we find out the graph for both throughput as well as round trip time for reaming three topologies and also for other two SDN controllers, OpenDaylight and NOX. These graphs were captured with Wireshark tools during downloading the file for all topologies and for all controllers from a server through a request made by the client host for all cases.
732
Fig. 2 Downloading file from server with linear topology
Fig. 3 Packet details for linear topology
D. Kumar and M. Sood
Performance Analysis of Impact of Network Topologies …
733
Fig. 4 Linear topology graph for throughput with POX controller
Fig. 5 Linear topology graph for round trip time with POX controller
The results are shown in Tables 3, 4, and 6 for different network parameters and for all topologies as well as for all three SDN controllers [17, 18]. Graph for throughput and round trip time is shown in Figs. 4 and 5, respectively. From the analysis of the results of our experiments, we observe that the worst topology for POX, NOX, and OpenDaylight SDN controllers is tree topology while taking RTT into consideration. Because maximum RTT is obtained with tree topology
734
D. Kumar and M. Sood
for all three controllers. Also, the topology that provides the worst result for SDN for the other two controllers was single, reversed for NOX, and OpenDaylight, respectively, while average throughput is taken into account. The topology that provides the best result for SDN from RTT point of view was reversed, single, and reversed for all three controllers, respectively. Similarly, linear topology was considered as a best topology for SDN in all three cases because it results in maximum throughput. Based on these simulation results for SDN, we can say that (a) best topology among all four topologies, with three different SDN controllers, is linear topology, as it gives maximum average throughput and medium RTT, and (b) tree topology is the worst topology for SDN with RTT. A single topology cannot be best or worst for all three different SDN controllers with all network parameters; finally, we depict the best topology on two network parameters as Throughput and Round Trip Time.
4 Conclusion SDN application comprises fine-grained traffic, quick failover demand, and fast interaction amid switches, hosts, and controllers. Generation and execution of control message and operation in SDN can be variable for all unique topologies. Therefore, we aim to analyze the best-controlled communication topology for SDN with this paper. From simulation and experimental results, we found that there exist no single topologies that can show the best outcome for all three different SDN controllers for all network parameters. The precise result of our experimentation on the bases of Throughput and RTT to identify the best and worst topology for SDN is linear and tree topologies, respectively. Due to the time constraints, we have simulated limited experiments on these topologies and controllers. There were various different sets of experiments that are left for future research. Some of the variations in experiments that can be conducted in future to expand the scope of the investigations may include varying the size and/or numbers of the files being communicated. Moreover it would also be interesting to investigate the results with different sizes of data packets in the experiments.
References 1. D. Kreutz, F.M.V. Ramos, Software-defined networking: a comprehensive survey. IEEE/ACM Trans. Audio, Speech, Lang. Process. 103(1), 1–76 (2015) 2. H. Farhady, H.Y. Lee, Software-defined networking: a survey. Comput. Netw. 81, 1–95 (2015) 3. S. Badotra, J. Singh, A review paper on software defined networking. Int. J. Adv. Res. Comput. Sci. 8(3) (2017) 4. S. Sezer, S. Scott-Hayward, Are we ready for SDN?—Implementation challenges for softwaredefined networks. IEEE Commun. Mag. 51(7), 36–43 (2013). https://doi.org/10.1109/MCOM. 2013.6553676
Performance Analysis of Impact of Network Topologies …
735
5. B. Astuto, A. Nunes, A survey of software-defined networking: past, present, and future of programmable networks. IEEE Commun. Surv. Tutorials 16(3), Third Quarter, 1617–1634 (2014) 6. S.H. Yeganeh, A. Tootoonchian, On scalability of software-defined networking. IEEE Commun. Mag. 51(2), 136–141 (2013) 7. M. Sood, Nishtha, Traditional verses software defined networks: a review paper. Int. J. Comput. Eng. Appl. 7(1) 2014 8. S. Perumbuduru, J. Dhar, Performance evaluation of different network topologies based on ant colony optimization. Int. J. Wirel. Mobile Netw. (IJWMN) 2(4) (2010), http://airccse.org/jou rnal/jwmn/1110ijwmn.12.pdf. Last Accessed on 31 Dec 2018 9. R. Hegde, The impact of network topologies on the performance of the in-vehicle network. Int. J. Comput. Theory Eng. 5(3) (2013), http://ijcte.org/papers/719-A30609.pdf. Last accessed on 31 Dec 2018 10. D.S. Lee, J.L. Kal, Network topology analysis (Sandia Report, SAND2008-0069, Sandia National Laboratories, California, 2008), https://prod-ng.sandia.gov/techlib-noauth/accesscontrol.cgi/2008/080069.pdf. Last accessed on 31 Dec 2018 11. B. Meador, A survey of computer network topology and analysis examples, https://www.cse. wustl.edu/~jain/cse567-08/ftp/topology.pdf. Last accessed on 31 Dec 2018 12. M. Gallagher, Effect of topology on network bandwidth. Masters Thesis, University of Wollongong Thesis Collection, 1954–2016, University of Wollongong, Australia, Available: https:// ro.uow.edu.au/theses/2539/. Last accessed on 31 Dec 2018 13. D. Kumar, M. Sood, Software defined networks (SDN): experimentation with Mininet topologies. Indian J. Sci. Technol. 9(32) (2016). https://doi.org/10.17485/ijst/2016/v9i32/ 100195 14. Mininet walkthrough, http://mininet.org/walkthrough/. Last accessed on 31 Dec 2018 15. R. Barrett, A. Facey. Dynamic traffic diversion in SDN: test bed vs Mininet, in International Conference on Computing, Networking and Communications (ICNC): Network Algorithms and Performance Evaluation (2017). https://doi.org/10.1109/iccnc.2017.7876121 16. E. Guruprasad, G. Sindhu, Using custom Mininet topology configuring L2-switch in Opendaylight. Int. J. Recent Innov. Trends Comput. Commun. 5(5), 45–48. ISSN: 2321-8169 17. J. Biswas, Ashutosh, An insight into network traffic analysis using packet sniffer. Int. J. Comput. Appl. 94(11), 39–44 (2014) 18. Wireshark Complete Tutorial, https://www.wireshark.org/docs/wsug_html/. Last accessed on 31 Dec 2018
Bees Classifier Using Soft Computing Approaches Abhilakshya Agarwal and Rahul Pradhan
Abstract Researchers have gone through many studies and attempt to classify among two different types of bees, i.e., Honey bee and Bumble bee. From these approaches many of them depend on the subjective analysis and lacked any measurable analysis. Therefore, our work is an attempt to cross the gap between the subjective and measurable approaches to classify among bees. Thus, the researches use the machine learning algorithms in the research (classification and neural network) that will classify bee as Honeybee and Bumblebee using data in the form of images. This research will greatly speed-up the study of bee populations. Machine learning used in this research will help in automating this classification by its photograph. Information from these algorithms can be used by researchers in the study of bees. This research also includes manipulation of images, in which the data is prepared for applying the model which will be based on features extracted from bees. In this research, we also performed dimension reduction for focusing specially on the bee’s image, i.e., not considering the background details like flowers and other unnecessary details in the image. Keywords Support vector machines · K-nearest neighbor · Random forest · Decision tree · Logistic regression
1 Introduction Many researches had been carried out to classify the bees whether they are honey bee or bumble bee. Many of those researches depended on qualitative techniques in place of a quantitative approach used for figuring out and detecting the bees. Some of the researchers have also investigated approximately the quantitative techniques A. Agarwal (B) · R. Pradhan Department of Computer Engineering and Application, GLA University, Mathura, UP, India e-mail: [email protected] R. Pradhan e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_64
737
738
A. Agarwal and R. Pradhan
in an effort to be used for monitoring bees’ classification. They get their photos for identifying the bees. They also get new features as indicators to determine bees, i.e., biological feature, e.g., fatness, length, robustness, and many more. The tip of their abdomen is more pointed. Bees, flying bugs, are known for their job in fertilization and for creating nectar and beeswax. There are more than 16,000 known types of bees in seven perceived organic families found on each landmass with the exception of Antarctica. A few animal categories—including bumble bees, honey bees, and stingless honey bees— live socially in settlements. A bumble bee (likewise spelled bumble bee) is a flying bug inside the sort Apis of the honey bee clade, all local to Eurasia yet spread to four different mainlands by individuals. They are known for the development of perpetual, pioneer homes from wax. A honey bee is any of more than 250 species in the class Bombus, some portion of Apidae, one of the honey bee families. Wild honey bees are basic to continue fertilization. As of now, distinguishing the sort of a honey bee requires master exactness. This examination is spurred from the “Guideless Bees Classifier” challenge by Metis facilitated by Drivendata.org. Robotizing this procedure, with the end goal that basically an image of a honey bee can be utilized for classification by an AI calculation, will enormously accelerate the investigation of the honey bee populaces. As a feature of this exploration, the point was to create classifiers to precisely recognize the variety of a honey bee as either a “Bumble bee” (class 0) or a “Honey bee” (class 1) from its photo.
2 Methodology Figure 1 is the paradigm for the system proposed in this article. First, researchers use the cameras for gathering photographs of bees. A real-time running algorithm or bunch of functions will be helpful in extracting and counting some particular biological features of bees. Then the resulting information extracted from the features will be stored in database. When each bee is observed once, then the databases data Fig. 1 System overview
Bees Classifier Using Soft Computing Approaches
739
will be divided into 2 sets, i.e., train and test set. As the dataset contains 3969 bees, then they would be divided in the ratio of 7:3. After the data is divided, various ML models will be trained on the training set containing labels for honey bee and bumble bee as 0 and 1, respectively. For betterment of the model, we then apply validation algorithm for verifying each model’s confusion metrics parameters.
2.1 Photograph Gathering In this system, to extract the bee’s features we have used various means for gathering bee’s photographs like GitHub, Google photos, and many more. We decide to use a Kinect sensor as it is relatively inexpensive than RGBD sensor to detect the bees and click their photo no matter honey or bumble bee. The data that we collected consists of 3969 bee images which is very small as compared to the ambient dimensionality of the image (200 × 200 × 3). In addition, our dataset, the class distribution of honey bees and bumble bees is about 1:4, i.e., there is skew in the distribution.
2.2 Features-Detecting Algorithm There are various features that show a bee classification; from these, many features can be extracted from straightening the image after converting the 3 channels to 1 channel, i.e., from RGB image to Grayscale image and apply histogram equalization to each image. Then, we obtain features using histogram of oriented gradients (HOG) and DAISY feature descriptor [1]. Some of these features can be raw intensity value of each pixel; color combination of bees; light brown striations on their abdomen. The bumble bees, on the other hand, are darker and broad. The quality of images varies drastically. As we collected features of each data which is in count 128,100 values, i.e., the number of pixels or features that will be gathered after straightening the image, as this was a very large data to be processed on approximately 9.30 GB, so we have to convert and reduce the dimensions of the images. We focused specially on bees that were in the center of the images. But in this process, we got some problem, while focusing on bees, as in some images we found bees at the corner of the image.
2.3 Database After manipulating each bee’s image with the PCA transformation to convert from the original image to focus on bees, then gathering image features in the form of pixels of manipulated images, we converted the data to CSV file which will act as a database. The data is recorded into a new row inside thebdatabase. The number
740
A. Agarwal and R. Pradhan
of counts of various behaviors used for classification will be considered as separate features that will represent single pixels in the image.
2.4 Classification Algorithm When each and every bee is observed, that are meant to be classified, then the data of each feature or pixel of image will be used for classification. The various ML algorithms are used for classifying the image like K-Nearest Neighbor (K-NN), Support Vector Machine (SVM), Random Forest, Decision Tree, Logistic Regression. Then we test the data on these models and then algorithms will be analyzed by the confusion matrix. Each model is trained and tested and then it will be going to be validated by the validation algorithm. The default classifier label is classified as bumble bee, i.e., 1 and on the other hand, the other classifier is classified as honey bee, i.e., 0. This approach is executed by which the result is the data indicating most of it is as bumble bees; another reason may be data of bumble bee is in more ratio as compared to honey bee and lack of the images of honey bee results in the prediction closer to the bumble bee label.
2.5 Validation Algorithm Since the predictions made above are now labeled as honey bee or bumble bee, we create validation process for verifying how the accuracy of each model varies on changing their parameters by using cross-validation algorithm. In this research, we have used support vector machine algorithm (SVM), K-nearest neighbor (K-NN), random forest (RF), logistic regression (LR), and decision tree (DT). We have taken our data in part of 7:3 ratio in which we will be taking 70% of data as training part and 30% as testing part, and also there is skewness in data as the ratio of bumble bee images is more than honey bee in ratio of 4:1, i.e., bumble bee is 4 times more than honey bee. We have taken cross-fold validation algorithm to validate on different part of the images to get the most accurate result.
3 Experiments 3.1 Data Generation The data we are going to use in this research is created by collecting photographs from various sources and also, we will be running our classification algorithms on the manipulated data, which will be formed by performing various functions.
Bees Classifier Using Soft Computing Approaches
741
Fig. 2 Cropping
Firstly, we will perform Region of Interest (ROI) cropping. Since in most cases, ROI occupies a small portion of images. By cropping the images, we can avoid the portion of the image which is not containing bee. To achieve this, we use the region covariance descriptor [2], where we use a set of bee templates (available online) to identify the regions of the image which contain the bee. This method first constructs a suitable feature vector (for both the template image and candidate image segments), and then estimates an empirical covariance matrix for the feature vectors, and finally compares the relative distances of the covariance matrices from that of the template images. Upon further investigation, we found that this technique does not always come with good results, as it may crop that part of the image that does not contain any bee. Figure 2 shows the cropping of images by this method. Secondly, we will perform Data Augmentation. As we know the distribution of images is not balanced and also the data is skewed, we employ data augmentation on samples of class 0 to make the distribution of labels to be roughly uniform. We perform data augmentation on the honey bee (class 0) images by adding noise (zero mean AWGN with variance of 0.01), and by introducing rotations (−90° and 90°). This quadruples the size of class 0 (we get 3308 images from 827). The next process in this augmented dataset, which now contains 5955 samples, to obtain the feature vectors are as described in the following section.
3.2 K-Nearest Neighbor Algorithm As data is generated now, we classify the data using K-Nearest Neighbor algorithm. It is classic non-parametric method that is available to utilize while performing the task related to classification and regression. Both classification and regression need to be k closest training examples in the feature space, the only difference may be in output whether the task is of classification or regression. In classification, we have binary class honey bee and bumble bee, therefore the output of K-NN will be class membership. The image of bee will be assigned to the class which is most common among its k neighbors, where k is a small positive integer, the most common class will be judged on the basis of votes from neighbors. Bee will be assigned to class that is solely present in the neighborhood as the chosen value of k is 1.
742
A. Agarwal and R. Pradhan
Such type of classification where we take decision of class assignment on the basis of local statistics and avoid or delay the full computation are often referred as lazy learner or instance-based learning. A set of objects is used for taking the neighbors for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be considered as the training set for the algorithm, as no explicit training step is required. Figure 3 illustrates the example of k-NN classification. As you can see in Fig. 3 that green dot is the representation we have used for sample data, and k-NN model is supposed to classify this green dot into either blue square or red triangle. There will be greater chances of classifying this green dot into red triangle as k = 3 which can be seen as a solid line circle containing 2 triangles and only 1 square. But if k = 5, i.e., red solid line circle then k-NN will suppose to classify this green dot to blue square as there are 3 squares and only 2 triangles inside the circle. This algorithm works on majorities. The training examples are vectors in a multidimensional feature space, each with a category label. The training phase consists only of storing the feature vectors of the images, i.e., pixels and class labels of the training samples, i.e., the 0 or 1 (0 for honey bee and 1 for bumble bee). In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or the test point) is classified by assigning the label as 0 or 1 which is occurring most frequently among the k training samples nearest to that query point (image of a bee). The accuracy of the k-NN algorithm may be critically degraded by means of the presence of noisy facts as an instance, a colorful flower much like the coloration of the bee, or if the feature scales aren’t consistent with their importance. In such classification problems, i.e., binary cases, which is our project, it is helpful to choose k to be an odd number as this avoids similar votes and that is why we tested various odd values of k and finally end up with the value 11 which is giving the highest accuracy. Fig. 3 Example of K-NN classification
Bees Classifier Using Soft Computing Approaches
743
Fig. 4 Example of SVM
3.3 Support Vector Machine Algorithm Support vector machine often represented with its short-form SVM is a classifier that separates two classes that are not linearly separable. This separation between the classes can be achieved by separating hyperplane. SVM uses labeled data and outputs an optimal hyperplane which categorizes new examples. As Fig. 4 shows that in 2D space, we can plot data points and color of these that represent their different classes, SVM tries to find an optimal plane that is clearly dividing these two classes. This hyperplane is not a single line but it is the margin that divides the two classes, data points that lie along the margin borders are imagined as points that support this margin, and hence model is named after them as support vector machine. What we have done our support vector machine implementation, we divided our dataset of 3969 images in 7:3 ratio for training and testing. Then we create an SVM classifier by putting the attributes: kernel = ‘linear’, with probability taken True and random state 42. We have trained our model for different parameters so that we get the same result every time we run. After training and testing we found accuracy in the range 66–80% with the above parameters. SVM is achieving the best accuracy of 80% with PCA 500.
3.4 Decision Tree Algorithm A decision tree is a choice help tool that uses a tree-like version of choices and their feasible consequences, which include chance event results, useful resource prices,
744
A. Agarwal and R. Pradhan
and application. Decision trees are a non-parametric supervised getting to know method used for both category and regression tasks. Decision tree can be represented as a flowchart in which each node represents a condition or test and the output of this test will be represented by the two child nodes of this internal node. Test over here which an internal node represents is preferably a binary test which results in a range of two outcomes possible. Classification rules can be derived from the path between the root and leaf nodes. Each of these paths become a classification rule. Decision tree consists of three types of nodes: internal nodes aka decision nodes represented by square, child nodes if not decision nodes then represented by circles. And lastly, leaf nodes that are the final outcome represented by rectangle. Decision trees are commonly used in operations research and operations management. Another use of decision tree is as a descriptive way for calculating conditional chances. In this research, we used this classifier with the criterion as entropy and maximum depth of the tree as 11, which gave us the accuracy as 71.2% (best among the parameters used), on changing the criterion and max depth values the results were varying and the accuracy kept on decreasing. The precision was found to be 70% based on this classification.
3.5 Random Forest Algorithm Random forest, as it is quite evident from its name that it will offer an ensemble approach by collecting huge variety of attributes under one umbrella. What we have done for our random forest implementation is that we divided our dataset of 3969 images in 7:3 ratio for training and testing. Then we created a random forest classifier by putting the attributes: N estimators value taken is 15, with criterion entropy and random state 1234321. We have trained our model for different parameters such as for N estimator value from 5 to 20, for criterion entropy and gini and a random_state parameter so that we get the same result every time we run. After training and testing, we found the best accuracy of 80% with the above parameters n_estimator = 10, criterion = ‘entropy’. Random forests include more than one single tree each based on a random pattern of the training facts. They are commonly more correct than single choice trees. Each tree in the random forest gives a class prediction and the class with the most votes becomes our model’s prediction. As we can see in Fig. 5, there are nine decision trees trained for random forest and out of which six trees result in prediction and three decision tree results in prediction. So the overall result for the random forest is prediction.
Bees Classifier Using Soft Computing Approaches
745
Fig. 5 Example of random forest
3.6 Logistic Regression Algorithm Another classification algorithm we are using is Logistic Regression. This algorithm got its name from the function that is used in its core, i.e., logit function or often called logistic function. There is another name few authors refer it with, sigmoid function. This is because of the shape of the curve which is in the form of S-shaped structure which can take any real number values and can map it between 0 and 1. (Here in our case 0 stands for honeybee while 1 stands for bumblebee). 1 1 + e−value
(1)
In this, ‘e’ is the base of the natural logarithms (Euler’s number or the EXP() function) and ‘value’ is the actual numerical value that you want to transform. The coefficients (Beta values b) of the logistic regression algorithm ought to be predicted out of your training data. This is executed using max-likelihood estimation, whether the picture is classified as honey bee or bumble bee. The best coefficients would result into a model that can predict a value near to 1 (i.e., bumble bee) for the default class and a value near to 0 (i.e., honey bee) for the
746
A. Agarwal and R. Pradhan
other class. The intuition for max-likelihood for logistic regression like a procedure of searching seeks values for the coefficients that decrease the error in the probabilities predicted by the model with those in the data (e.g., probability of 1 if the data is the primary class). In binary or binomial logistic regression, the outcome is usually displayed as “0” or “1”, as this results in the most trustworthy interpretation. The logarithm of the odds is the logit of the probability, the logit is defined as follows: logit p = ln
p 1− p
f or 0 < p < 1
logit E(Y ) = α + βx
(2) (3)
4 Results The data which is being processed on various classification is based on the application of the algorithms—Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Decision Tree, Random Forest (RF), and Logistic Regression. We use python’s machine learning package, scikit-learn to implement these algorithms and find the confusion matrix to check the metric parameters. As a feature of this exploration, the point was to create classifiers to precisely recognize the variety of a honey bee as either a “Bumble bee” (class 0), or a “Honey bee” (class 1) from its photo. After that, the features were selected using the extraction method and dimensionality reduction. In extraction, the histogram equalization was applied to each image. Through this we used the raw pixel values to form our features. We analyze their performance using Confusion Matrix. On analyzing, the SVM algorithm gave an accuracy varying between 75 and 80%, K-NN gave an accuracy of 80%, Decision Tree gave an accuracy of 71.2%, Random Forest gave an accuracy of 80%, and Logistic Regression gave an accuracy of 75.3%. Training data available for each class is much skewed (honey bees to bumble bee ratio is about 1:4). Thus, error rate is not an indicator of the performance of a classifier (at least for the original data and cropped data case, where the dataset is skewed). Hence, in addition to error rates, we use the area under the receiver operating characteristics (ROC) curve. It is also referred as AUC, as a performance metric. AUC performance of the classifiers (k-SVM, S-LR, and RF) on (a) the original dataset, (b) the dataset formed by cropping the bees, and (c) the augmented dataset. Figure 6 shows the AUC performance of the classifiers (SVM, RF, LR), i.e., Support vector machine, Random Forest, Logistic Regression, respectively. (a) The original dataset, (b) dataset formed by cropping of images (ROI), (c) the augmented dataset, (d) shows the decay of singular values of the PCA features for original, cropped, and augmented datasets.
Bees Classifier Using Soft Computing Approaches
747
Fig. 6 AUC curves for a the original dataset, b dataset formed by cropping of images (ROI), c the augmented dataset d shows the decay of singular values of the PCA features for original, cropped, and augmented datasets
5 Conclusion This research on Naive bees’ classifier that automatically classifies any bee between honey bee and bumble bee with the help of various machine learning classification algorithms. In this research, we generated a result that will represent whether the given image is of honey bee or bumble bee. The dataset contains 3969 images of both types of bees. Once the data is loaded, we applied PCA for dimension reduction and furthermore straighten the images resulting 128,100 pixels or columns or features. After the data is generated, we simply applied classification algorithms and checked for the accuracies of each. The outcome of the Naive bee’s classifier is helpful in allowing researchers to more quickly and efficiently collect field data and that too with the help of an automated machine. Pollinating bees have critical roles in both ecology and agriculture, and disease like colony collapse disorder threatens these insects. Identifying different species of bees in the field means that we can better understand the prevalence and growth of these important insects.
748
A. Agarwal and R. Pradhan
We can add more images of other important insects out there in the wild and contributing a part in the balanced functioning of ecosystem and build a classification model for the clarification of each, as a future expansion of this research.
References 1. E. Tola, V. Lepetit, P. Fua, DAISY: an efficient dense descriptor applied to wide baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 815–830 (2010) 2. O. Tuzel, F. Porikli, P. Meer, Region covariance: a fast descriptor for detection and classification, in Proceedings of the 9th European Conference on Computer Vision—Volume Part II, ECCV’06 (Springer-Verlag, Berlin, Heidelberg, 2006), pp. 589–600
Fuzzy Trust Based Secure Routing Protocol for Opportunistic Internet of Things Nisha Kandhoul and S. K. Dhurandher
Abstract Opportunistic Internet of Things (OppIoT) is a network of Internet of Things (IoT) devices and communities formed by humans. The data is shared among the nodes in a broadcast manner, using the opportunistic contacts between humans. So, devising secure data transmission technique is necessary. As OppIoT comprises of a wide range of devices like sensors, smart devices, and so on, not all of them are capable of handling the complexity of the security protocols. We incorporate fuzzy logic for adding flexibility to the system. In this paper, the authors propose a fuzzy trust based routing protocol FuzzyT_CAFE for protecting the network against bad or good mouthing, Sybil, blackhole, and packet fabrication attacks. The protocol derives the trust of nodes in the network using four fuzzy attributes: Unfabricated Packet Ratio, Amiability, Forwarding Ratio, and Encounter Ratio. The output parameter is the Trust of the node for determining whether the selected node is malicious or not and the message should be forwarded or not. Simulation results suggest that the proposed FuzzyT_CAFE protocol is more flexible and outperforms the base routing protocol T_CAFE in terms of unfabricated packets received, higher message delivery probability, and a very low dropped message count. Keywords OppIoT · Fuzzy logic · Security · Trust
1 Introduction Internet of Things (IoT) [1] is a collection of sensors, digital devices, and humans that are connected to the Internet. Devices forming IoT are present everywhere and are used in domains like health care, intelligent homes, and many more. Opportunistic N. Kandhoul (B) Division of Information Technology, N.S.I.T, University of Delhi, New Delhi, India e-mail: [email protected] S. K. Dhurandher Department of Information Technology, Netaji Subhas University of Technology, New Delhi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_65
749
750
N. Kandhoul and S. K. Dhurandher
Networks (OppNets) [2] are a type of Delay Tolerant Networks, where routes are built on the fly whenever a device needs to send a message to other devices in the network. Opportunistic Internet of Things (OppIoT) [3] is an amalgamation of IoT and OppNets that brings together humans and a wide variety of devices like smartphones, sensors, RFID, and so on. The divergence of IoT devices and broadcast technique of data sharing further elevates the problem of user privacy [4] and data security. These attacks [5] can be passive that aim at gathering information about network and users like eavesdropping, or can be active attacks aimed at harming the users and data like packet fabricating, blackhole attack, etc. The attackers obstruct the normal functioning of the network and waste the resources of power limited devices. So, the need of the hour is to ensure the privacy and security of OppIoT devices. Cryptography based methods are not very successful when it comes to OppIoT as they consume lots of resources and key management is very difficult in the presence of attacker nodes in the network. So, using trust based routing is most appropriate for securing OppIoT. Trust is the measure of belief that a node has on another node about its future behavior in the network. This paper proposes a fuzzy version of existing T_CAFE [6], a trust based secure routing protocol. Fuzzy Logic (FL) brings fuzziness in the system and provides relaxation in absolute conditions. FL has the capability of handling numerical data and linguistic knowledge simultaneously. FL allows the computation of Trust based on several adaptive rules, Fuzzy_Trust. These rules provide a mapping of several input parameters to Fuzzy_Trust thereby reducing the errors in the system and making it more adaptive while guarding the network against attacks viz. bad and good mouthing, black hole, Sybil, and packet fabrication. The major contribution of this work includes – Fuzzy based Trust computation: Relaxation is provided in trust computation using the fuzzy logic thereby making the system flexible. – Detection and isolation of attackers: FuzzyT_CAFE isolates attackers from packet forwarding procedure. The remaining paper is arranged as follows. Section 2 presents the literature survey. The details of proposed F_T_CAFE are given in Sect. 3. Simulation results are discussed in Sect. 4. Section 5 provides paper’s conclusion.
2 Related Work In this section, existing works related to fuzzy based secure OppIoT networks are discussed. Cuka et al. [7] presented several fuzzy based systems for the selection of IoT devices in opportunistic networks. The input parameters used were device speed, distance, and residual energy. The output parameter is IoT device selection decision. The malicious behavior of the nodes is not addressed.
Fuzzy Trust Based Secure Routing Protocol for Opportunistic Internet of Things
751
Chhabra et al. [8] proposed FuzzyPT that provided defense against black hole attacks by extracting information from messages available in the buffer, threat messages, and applying the fuzzy logic. FuzzyPT improved decision-making and reduced the number of false positives. The authors verified the proposition by using game theoretic approach. Xia et al. [9] proposed a fuzzy and trust based approach for character building of the nodes. This protocol dynamically computed fuzzy trust for setting up routing path free from malicious nodes. Dhurandher et al. [10] proposed a fuzzy approach for geocasting in OppNets. The protocol employed several fuzzy attributes namely movement, residual energy, and buffer space for the selection of next hop for message forwarding.
3 System Model 3.1 Motivation Fuzzy logic takes decisions based on perception. The exact values are not used and errors are accepted within a range. The message forwarding decisions can be made based on certain attributes like trust, forwarding behavior, and so on. The presence of malicious nodes hugely affects these decisions. A fuzzy controller can be used for this decision-making by giving these attributes as input and the output can then be used for making forwarding decisions. Fuzzy logic reduces the complexity of the system without affecting the performance of the system. This motivation has led the authors to design (FuzzyT_CAFE), that is, a fuzzy version of (T_CAFE) where the fuzzy trust of the OppIoT devices is calculated using fuzzy parameters making the system a bit more flexible thereby protecting the system from attackers.
3.2 Proposed FuzzyT_CAFE Routing Protocol FuzzyT_CAFE is a fuzzy extension of T_CAFE. The details about the computation of Trust can be referenced from [6]. When a node wants to send a packet, it initiates fuzzy trust computation for the neighboring nodes as depicted in Fig. 1. In this work, encounter ratio (ER), forwarding ratio (FR), amiability (Amb) and unfabricated packet ratio (UR) [6] as computed by T_CAFE are the inputs and trust is the defuzzified output of a fuzzy controller. Higher the value of input parameters, higher is the trust. If FR is low, the trust is low as it suggests a possible black hole. ER along with Amb is the measure of the social behavior of the node and is used in the detection of Sybil nodes. UR is used to detect packet fabrication attacks. All these parameters when combined form trust. For malicious nodes, the value of trust is very low, whereas benign nodes, which are social and have a high forwarding rate,
752
N. Kandhoul and S. K. Dhurandher Fuzzy Unfabricated RaƟo Fuzzy Amiability
Fuzzifier
Fuzzy Forwarding RaƟo
Rule Based Processor
Defuzzifier
Fuzzy Trust
Fuzzy Encounter RaƟo Fuzzy parameters Fuzzifier
T_CAFE computed parameters
Fig. 1 Fuzzy logic controller
(a) Amiability
(b) Encounter Ratio
(c) Unfabricated Ratio
(d) Forwarding Ratio
Fig. 2 Input parameters
Fig. 3 Trust
Fuzzy Trust Based Secure Routing Protocol for Opportunistic Internet of Things Table 1 Fuzzy rule base S_No ER FR UR 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18
P P P P G P P G G G P G G E P P P G
P P P G P P G G P G G G G P E P P P
P P G P P G G P P G G P G P P E P P
753
Amb
Trust
S_No ER
FR
UR
Amb
Trust
P G P P P G P P G P G G G P P P E E
VL VL VL L VL L L M L G G M G VL M M M M
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
E P G E P G E G G G E G P P E E G E
P P P G E G G E G G G E G E G E G E
P G E P G E G G G P G G E G G G E E
G M G G M VG VG G G M G G M M VG VG G E
G E P P P G G G E E P P G G E G E E
have higher trust. FLC fuzzifies Trust and adds flexibility to the system to make it close to the real-world behavior. Table 1 provides the rule base for the FLC. Figure 2 shows the membership functions of fuzzy input and Fig. 3 shows the output variable. All the values have been normalized and vary in the range 0–1.
4 Simulation Results This section provides the details of simulations performed using ONE simulator [11]. Each simulation is run for 42,800 s. The performance of FuzzyT_CAFE is evaluated and compared with the results for T_CAFE, under varying Time to Live of messages. 500 Kb–1 Mb sized message is created every 25–35 s. TTL is varied in the range of 100–300 min and the result of this variation is captured in Figs. 4 and 5. From these figures, it is clear that the fuzzy version produces comparable results like the basic version and adds flexibility to the system. Figure 4 shows the impact of changing the TTL of messages on various performance parameters. Figure 4a shows that the probability of message delivery falls with rising Time to Live of messages, as the messages now stay in the buffer for a larger period of time eventually leading them to be dropped. The average value of probabil-
754
N. Kandhoul and S. K. Dhurandher
5800
0.3 0.25
0.2 0.15 0.1
Unfabricated packets recieved
6000
0.4
Messages Dropped
Delivery Probability
0.45
0.35
Unfabricated packets recieved vs TTL (in Minutes)
Messages Dropped vs TTL(in Minutes)
Delivery Probability vs TTL (in Minutes)
5600
5400 5200 5000
4800 4600
0.05
0
4400
100 150 200 250 300
1400 1200 1000 800 600 400 200 0 100 150 200 250 300
100 150 200 250 300
TTL (in Minutes)
TTL (in Minutes)
TTL (in Minutes) T_CAFE
1600
FuzzyT_CAFE
T_CAFE
(a) Delivery Probability
(b) Messages dropped
FuzzyT_CAFE
T_CAFE
FuzzyT_CAFE
(c)Unfabricated packets received
Fig. 4 Performance metrics versus TTL Messages dropped vs Percentage of malicious nodes
0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
Unfabricated packets received vs Percentage of malicious nodes
7000 6000
Unfabricated packets received
Messages dropped
Delivery probability
Delivery probability vs Percentage of malicious nodes
5000 4000 3000 2000 1000 0
5
10
15
20
25
Percentage of malicious nodes FuzzyT_CAFE
T_CAFE
(a) Delivery Probability
5
10
15
20
25
Percentage of malicious nodes T_CAFE
FuzzyT_CAFE
(b) Messages dropped
1600 1400 1200 1000 800 600 400 200 0 5
10
15
20
25
Percentage of malicious nodes T_CAFE
FuzzyT_CAFE
(c) Unfabricated packets received
Fig. 5 Performance metrics versus percentage of malicious nodes
ity for delivery of messages for FuzzyT_CAFE is 0.34 which is only 6% lower than T_CAFE. Figure 4b captures the number of messages dropped with rising TTL. The average messages dropped for FuzzyT_CAFE are 5608. With increasing TTL, the count of unfabricated messages received at the devices reduces as malicious nodes tend to get detected with increasing time as shown in Fig. 4c. The average count of unfabricated messages for FuzzyT_CAFE is 1176 and for T_CAFE is 1005 which is only 1.7% lower. Figure 5 shows the impact of changing the percentage of malicious nodes present in the network on various performance metrics considered. Figure 5a shows that the probability of message delivery falls with rising maliciousness in the network. The average value of probability for delivery of messages for FuzzyT_CAFE is 0.3 which is only 1% lower than T_CAFE. Figure 5b captures the number of messages dropped
Fuzzy Trust Based Secure Routing Protocol for Opportunistic Internet of Things
755
with a rising count of malicious nodes in the network. The average messages dropped for FuzzyT_CAFE are 5472. With increasing malicious nodes in the network, the count of unfabricated messages received at the devices rises as shown in Fig. 5c. The average count of unfabricated messages for FuzzyT_CAFE is 1103 and for T_CAFE is 1012.
5 Conclusion A novel fuzzy trust based secure routing protocol FuzzyT_CAFE was proposed for OppIoT networks. The simulation results prove that FuzzyT_CAFE performs well as compared with T_CAFE without using higher number of resources and requires very low computation. As a future work, FuzzyT_CAFE can be tested on real mobility traces.
References 1. L. Atzori, A. Iera, G. Morabito, The internet of things: a survey. Comput. Netw. 54(15), 2787– 2805 (2010) 2. L. Pelusi, A. Passarella, M. Conti, Opportunistic networking: data forwarding in disconnected mobile ad hoc networks. IEEE Commun. Mag. 44(11), 134–141 (2006) 3. B. Guo, D. Zhang, Z. Wang, Z. Yu, X. Zhou, Opportunistic IoT: exploring the harmonious interaction between human and the internet of things. J. Netw. Comput. Appl. 36(6), 1531– 1539 (2013) 4. S. Sicari, A. Rizzardi, L.A. Grieco, A. Coen-Porisini, Security, privacy and trust in internet of things: the road ahead. Comput. Netw. 76, 146–164 (2015) 5. T.N.D. Pham, C.K. Yeo, Detecting colluding blackhole and greyhole attacks in delay tolerant networks. IEEE Trans. Mob. Comput. 15(5), 1116–1129 (2016) 6. N. Kandhoul, S.K. Dhurandher, I. Woungang, T_CAFE: a trust based security approach for opportunistic IoT. IET Commun. (2019) 7. M. Cuka, D. Elmazi, K. Bylykbashi, E. Spaho, M. Ikeda, L. Barolli, Implementation and performance evaluation of two fuzzy-based systems for selection of IoT devices in opportunistic networks. J. Ambient Intell. Hum. Comput. 10(2), 519–529 (2019) 8. A. Chhabra, V. Vashishth, D.K. Sharma, A fuzzy logic and game theory based adaptive approach for securing opportunistic networks against black hole attacks. Int. J. Commun. Syst. 31(4), e3487 (2018) 9. Hui Xia, Zhiping Jia, Ju Lei, Youqin Zhu, Trust management model for mobile ad hoc network based on analytic hierarchy process and fuzzy theory. IET Wirel. Sens. Syst. 1(4), 248–266 (2011) 10. S.K. Dhurandher, J. Singh, I. Woungang, M. Takizawa, G. Gupta, R. Kumar, Fuzzy geocasting in opportunistic networks, in International Conference on Broadband and Wireless Computing, Communication and Applications (Springer, Cham, 2019), pp. 279–292 11. A. Keränen, J. Ott, T. Kärkkäinen, The one simulator for DTN protocol evaluation, in Proceedings of the 2nd International Conference on Simulation Tools and Techniques (ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), 2009), pp. 1–10
Student’s Performance Prediction Using Data Mining Technique Depending on Overall Academic Status and Environmental Attributes Syeda Farjana Shetu, Mohd Saifuzzaman, Nazmun Nessa Moon, Sharmin Sultana, and Ridwanullah Yousuf Abstract In the education sector, it has been a challenging task to identify the students individually to take appropriate actions to get a very deserving outcome from them. On the other hand, a student getting higher education should have the knowledge about the market demands and where are their weaknesses. If it is possible to get some data from students’ academic record and their percepts on some factors related to academic performances those may help to understand the reasons for success and failure which would be very useful in the educational environment and student’s success rate. We collect data from students from different institutes. First, we create an online survey form to get data, and then we process them to get some valuable information. After getting those data, we visualize then analyze them from different prospects. We try to get some exact knowledge which can be crucial for students’ success or failure in an academic environment. We apply the data mining technique decision tree algorithm (j48) to develop a model that shows us the hierarchy of different attributes related to students’ academic performance and their personal behaviors that affect a students’ academic status. Then we try to get information
S. F. Shetu (B) · M. Saifuzzaman · N. N. Moon · S. Sultana · R. Yousuf Department of Computer Science & Engineering, Daffodil International University, Dhaka, Bangladesh e-mail: [email protected] M. Saifuzzaman e-mail: [email protected] N. N. Moon e-mail: [email protected] S. Sultana e-mail: [email protected] R. Yousuf e-mail: [email protected] M. Saifuzzaman Department of Computer Science, Jahangirnagar University, Dhaka, Bangladesh © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_66
757
758
S. F. Shetu et al.
about which attributes have positive and which attribute has a negative impact on students’ academic growth. Keywords Educational data mining · J48 algorithm · ID3 algorithm · Data mining
1 Introduction Data mining is the process of getting data from some domain and analyzing them to extract some hidden and useful information that can be stored as knowledge. Then the data in the form of knowledge can be used to modify decision-making. Educational data mining describes a research field concerned with information generated from educational environment through data mining methodology. Educational institutes are the base of a nation. And education is the backbone of the nation. So every parent, student, tutor, or any other person who is related to these educational matters are continuously trying to improve the quality of the education or the out-come of it. Effective learning and outcome are expected from everyone. Yet it is hard to ensure for any institution to their students. As from students’ or parents’ prospect they are more eager for a better outcome. The significance of educational data mining can be described for different levels of an educational setting. But here we are trying to be focused on student and their academic results. Improving the educational learning system effectively is an institution’s desire. If an institution can understand its students’ learning behaviors and activities pattern, then they can just tweak those for a poor student to get an improved outcome from them. In our research, we have tried to collect some data related to students’ academic and personal life which may affect their academic excellence. Then we classify them based on their academic results, and try to find out the reason for their success or failure.
2 Literature Review Generally, our research completes the background study from several resources to target several aspects of student performance, academic status, and overall environment. Here we can say that our research is mobilized research that is associated with two major aspects and build an incomparable proposal of effectiveness. Hussain et al. [1] and Costa et al. [2] used Educational Data Mining (EDM) for acquiring knowledge of student’s academic failure in their result. Where Hussain et al. [1] applied four classification methods, such as the classifier J48, PART, Random Forest, and Bayes Network. They gathered data from three Assam colleges, India. By using Apriori algorithm, they got their accuracy 99%. However, Costa et al. [2] applied artificial intelligence in education and automatic prediction for analyzing failure rates. They also collected their data from Brazilian Public Universities. They
Student’s Performance Prediction Using Data Mining Technique …
759
used the support vector machine for finding the failure results, and their technique can identify this. Asif et al. [3] collected four years of students’ dataset and also divided them into two categories of low and high accomplishment of the students based on their marks. Kumar et al. [4] proposed a model that can predict students’ achievement individually in their subjects, and also weak and strong points; this research used five types of classifiers: Rule-based, Naïve Bayes, Neural Network, Decision tree, and K-nearest neighbor. Mueen et al. [5] used data mining on educational sector for predicting students’ academic performance based on their academic record and forum participation. In their research methodology, they got 86% of prediction accuracy by using Naïve Bayes classification, Neural Network, and Decision tree. Amrieh et al. [6] proposed a model that can predict student’s behavioral features by using a set of classifiers such as, Artificial Neural Network, Naïve Bayes, and Decision tree. In their research, they also used Bagging, Boosting, and Random Forest (RF) and after using these methods their total accuracy was 80%. Their method is very useful for predicting student’s behavior by learning the e-learning management system. Yehuala [7] proposed a data mining approach that can predict a student’s success and failure rates. This method extracts hidden patterns from student’s data and also can find out the relation between different variables and data. They used 11,873 data and also used Naive Bayes, J48 algorithms using a decision tree for generating. They obtained 92.3% accuracy after using a couple of algorithms. Amrieh et al. [8] and Shahiri et al. [9] used educational data mining for predicting behavioral features of students with the e-learning system. Using experience API web service (XAPI) and using some data mining techniques such as Artificial Neural Network, Naïve Bayesian, and decision tree, they collected data from Kalboard 360. Borkar et al. [10] showed a method of students’ academic performance using Apriori algorithm, Neural Network, and Association mining rule. They used 60 datasets and their accuracy was 46% that is below average. Ahmed et al. [11] and Ramesh et al. [12] applied ID3 algorithm and Knowledge Discovery in Database (KDD) [11] for analyzing students’ academic performance and also applied Multilayer perception. They used various types of variables and their final accuracy was 72.38% [12]. Kabakchieva et al. [13] also used Educational Data Mining (EDM) for predicting students’ performance. They used different types of phases for extracting their dataset. They collected their dataset from Bulgarian university of their three admission campaign but they did not get their desirable accuracy rate. Baradwaj et al. [14] used decision tree method along with some specific attribute to extract hidden knowledge which represented the students’ performance. Yadav et al. [15] proposed a model for improvement of engineering students using C4.5 algorithm, ID3 algorithm, and CART algorithm. They got their true positive and false rate accuracy which was 0.786 for ID3 and C4.5 Decision trees algorithm. Where Osmanbegovic et al. [16] collected a large dataset from the Economics department of the University of Tuzla in the academic year 2010–2011 and their proposed model can help students for their higher education developments. Kabra et al. [17] and Ramaswami et al. [18] used educational data mining and CHAID prediction model for acquiring students’ performance. They collected 772 student’s dataset and 346 student’s dataset also their accuracy was 0.907 and 44.69% from CHAID model. Oyelade et al. [19] applied K-means clustering
760
S. F. Shetu et al.
methods for predicting students’ performance records. Baker et al. [20] used Bayesian Knowledge tracing for prediction of student performance after the use of an intelligent tutor. This tutor consists of 19 modules from 15 universities in North America. ThaiNghe et al. [21] used a recommender system and matrix factorization for predicting student performance. This research used a large number of datasets between 2008 and 2009 years of Algebra and Bridge. To sum up, many researchers have been trying to solve educational problems using data mining techniques and algorithms to improve academic performance for students. But, in our research, we will focus on the impact of students’ overall academic performance as well as the educational environment that is really very important for every student for improving their academic zone and results.
3 Methodology In this research, we expect to predict students’ CGPA by classification, and relate some important attributes that affect students’ results. At the end of the process, we would be able to say which attribute has more effect on the result and which has not. We will make a decision tree for the training data and make a model based on that. Data mining predicts the future by means of modeling (Fig. 1).
Fig. 1 Necessary steps of extracting outcome from data
Student’s Performance Prediction Using Data Mining Technique …
761
3.1 Data Collection We collected data from Private and Public Universities. From Private Universities, we got 257 responses. From Pubic Universities, we got 186 responses. The dataset contains their whole academic performance, activities, behavior, mental health, and results.
3.2 Data Pre-processing In this research, we expect to predict students’ CGPA by classification. And relate some important attributes that affect students’ results. At the end of the process, we would be able to say which attribute has more effect on the result and which has not. We will make a decision tree for the training data and make a model based on that. Data mining predicts the future by means of modeling (Fig. 2). From Private Universities, we got 257 responses (Table 1). From Pubic Universities, we got 186 responses (Table 2). A decision tree can be easily converted to a set of rules by mapping one by one from the root node to the leaf nodes. We have done that to form a predictive model that can predict students’ results.
Fig. 2 Students attributes models
762 Table 1 Data description for Private University
S. F. Shetu et al. Attribute name
Probable values
Data type
CGPA
A, B, C, D, E
Categorical/nominal
Gender
Male, female
Categorical/nominal
Attendance
Highly regular, regular, irregular
Categorical/nominal
GroupStudy
Often, sometimes, Categorical/nominal never
ClassTest
Good, average, poor
Categorical/nominal
Attention
Good, average, bad
Categorical/nominal
Responsiveness
Yes, sometimes, no
Categorical/nominal
Contents
Yes, enough, no
Categorical/nominal
Interaction
Good, average, bad
Categorical/nominal
Depression
Yes, sometimes, no
Categorical/nominal
Social Media
High, average, low Categorical/nominal
ExtraCurriculuar
Yes, little, no
Financial
Often, sometimes, Categorical/nominal rare
Affair
Yes, no
Categorical/nominal
DrugAddiction
Yes, little, no
Categorical/nominal
Categorical/nominal
3.3 Data Classification or Mining As our objective was to predict students’ results, we have classified the result in 5 classes. A = Students who have a great result and great potential B = Students who have a good result and good potential C = Students who have an average result and but can be improved D = Students who have a poor result and need care to improve E = Students who have a poor result and can be a failure The decision tree constructs a tree structure in the form of classification or regression models. It breaks up a dataset into smaller and smaller subsets while at the same time incrementally developing an associated decision tree. The end result is a tree with nodes for decision and nodes for leaves. A node of decision has two branches or more. The node to the leaf is a classification or decision. The top-most decision node in a tree that coincides with the best predictor is called a root node.
Student’s Performance Prediction Using Data Mining Technique … Table 2 Data description for Public University
763
Attribute
Probable values
Data type
CGPA
A, B, C, D, E
Categorical/nominal
Gender
Male, female
Categorical/nominal
Attendance
Highly regular, regular, irregular
Categorical/nominal
GroupStudy
Often, sometimes, Categorical/nominal rare
PLStudy
Regular, sometimes, never
StudyAtHall
Often, sometimes, Categorical/nominal never
ClassTest
Good, average, poor
Categorical/nominal
Attention
Good, average, poor
Categorical/nominal
Responsiveness
Often, sometimes, Categorical/nominal rare
Content
Very much, enough, a little
Categorical/nominal
ClassEnvironment
Good, average, poor
Categorical/nominal
Interaction
Good, average, poor
Categorical/nominal
Tution
High, average, low
Categorical/nominal
Social Media
High, average, low
Categorical/nominal
DrugAddiction
Regular, sometimes, never
Categorical/nominal
Categorical/nominal
So in the decision tree, we generated that the leaf node will be any of the classes above. And their parent nodes have an effect on that class. The root node will be the attribute that affects a students’ result most. In the second level, the next important node exists and so on. We have first applied attribute selection methods to datasets in order to get the idea of which attribute has more relational status with our CGPA class. We have used three different types of attribute selection methods 1. Gain Ratio Feature Evaluator Assess the value of an object by estimating the gain in information relative to the class. InfoGain(Class, Attribute) = H(Class) − H(Class|Attribute).
764
S. F. Shetu et al.
Table 3 Attribute selection methods’ results for Private University
Attribute
GainRatio
InfoGain
Correlation
ClassTest Attendance
0.11675
0.13148
0.1844
0.06425
0.07347
0.1547
Contents
0.05466
0.07833
0.0963
Gender
0.05155
0.0472
0.1408
Depression
0.03733
0.05864
0.106
Social Media
0.03487
0.04272
0.0964
Financial
0.03454
0.05225
0.0814
Attention
0.02328
0.02864
0.0551
GroupStudy
0.01695
0.02672
0.0478
ExtraCurriculuar
0.01482
0.02336
0.0495
Interaction
0.01389
0.01932
0.0334
DrugAddiction
0.01062
0.01129
0.0655
Affair
0.0091
0.00853
0.0597
Responsiveness
0.00852
0.0113
0.0481
2. Information Gain Ranking Filter Evaluates the worth of an attribute by measuring the gain ratio with respect to the class. GainR(Class, Attribute) = (H(Class) − H(Class|Attribute))/H(Attribute). 3. Correlation Ranking Filter Analyzes the value of a characteristic by testing the (Pearson’s) relation between it and the class. Here is the result for attribute selection methods applied to datasets of Private University (Table 3). Here is the result for attribute selection methods applied to data sets of Public University (Table 4).
3.4 Data Interpretation After just applying attribute selection methods, we can see which attribute has a stronger relationship with the result. Then we apply a classification algorithm decision tree (j48) to generate the predictive model. Building decision trees core algorithm called ID3 by J. R. Quinlan uses a topdown, greedy search in the space of possible branches without backtracking. ID3
Student’s Performance Prediction Using Data Mining Technique … Table 4 Attribute selection methods’ results for Public University
Attribute name
GainRatio
765 InfoGain
Correlation
ClassTest
0.17374
0.2493
0.1866
Attendance
0.09853
0.1225
0.1316
DrugAddiction
0.07635
0.0421
0.0983
Content
0.07504
0.1111
0.1126
Interaction
0.06794
0.0777
0.0981
PLStudy
0.03752
0.055
0.0823
Social Media
0.03694
0.0436
0.0848
Tuition
0.03611
0.0544
0.0653
Gender
0.01843
0.0153
0.0678
GroupStudy
0.01828
0.0283
0.056
ClassEnvironment
0.01503
0.021
0.0311
Responsiveness
0.01501
0.0236
0.0544
StudyAtHall
0.01484
0.0193
0.0484
Attention
0.00846
0.0129
0.0403
makes use of Entropy and Information Gain to build a decision tree. The rules are followed for those. Entropy of one attribute: c E (S) = − pi log2 pi i=1
Entropy of two attributes: E (T, X) = P(c)E(c) c∈X
Information Gain: Gain (T, X) = Entropy (T) – Entropy (T, X) True Positive (TP) rate (the ratio of instances classified as Class x among all case studies that have Class x) and Precision (the ratio of examples that have Class x between all those identified as Class X).
4 Results and Outcome Decision Tree (j48) Result for Private University For Private University, we got maximum TP Rate in B that is 0.818 and precision in E that is 0.898. We got Weighted Avg. of TP rate 0.774 and Precision 0.791 (Fig. 3 and Tables 5 and 6).
766
S. F. Shetu et al.
Fig. 3 Decision tree generated from data set of Private University Table 5 Result for generating model for Private University
Table 6 Result for each class
Correctly classified instances
77.4319%
Incorrectly classified instances
22.5681%
Kappa statistic
0.6952
Mean absolute error
0.1087
Root mean squared error
0.2331
Relative absolute error
37.4531%
Root relative squared error
61.2592%
Total number of instances
257
Class
TP rate
Precision
A
0.737
0.519
B
0.818
0.675
C
0.765
0.703
D
0.698
0.800
E
0.815
0.898
Weighted avg.
0.774
0.791
Student’s Performance Prediction Using Data Mining Technique …
767
Fig. 4 Decision tree generated from data set of Pubic University
Decision Tree (j48) Result for Public University After applying the attribute Selection method, we could see that both in Private and Public University class tests, social media, interaction, group study had a great impact on the result (Fig. 4). After analyzing the pure data we collected directly from students, we have made a predictive model for the different educational environments. For Private University we had 257 instances and 186 for Public University. In both environments, we have got above 75% accuracy (Table 7). Table 7 Result for generating model for Public University
Correctly classified instances
75.8065%
Incorrectly classified instances
24.1935%
Kappa statistic
0.6912
Mean absolute error
0.1183
Root mean squared error
0.2432
Relative absolute error
37.5045%
Root relative squared error
61.2526%
Total number of instances
186
768 Table 8 Result for each class
S. F. Shetu et al. Class
TP rate
Precision
A
0.860
0.754
B
0.846
0.767
C
0.650
0.703
D
0.686
0.774
E
0.682
0.833
Weighted avg.
0.758
0.759
Here is a list of attributes and their impact on the result. We describe it as a level. With lower value attribute has more impact on the result (Table 8).
5 Conclusion In our article, we explored the possibility of educational data mining techniques. Where a large amount of data can be collected from Student, Administration, Institution, Parents, and so on. And then the data can be processed and analyzed to find the relation between data, pattern of learning, previously unknown information, and can be extracted as knowledge and use it for better decision-making. Evaluating the knowledge will give us the status of students’ academic success, and the administration and managing levels can use that information for better observations. Students can also know their lacking, those causing them to perform well and yield good results. This research project aims to check the capabilities of data mining techniques in educational environments. We gathered data from many students of various educational institutes and used some good data mining techniques to produce a valid output. The limitation of our research is it does predict results in different educational settings instead of a ubiquitous system. Our research has some other limitations like providing exact decision to administration level or learning/tutoring system. This research is a preliminary experiment to get knowledge from students’ huge data domains and discuss the possibilities of data mining and its techniques in educational sectors. Further improvement is needed to deploy a system that can identify students and treat them well as their necessity. In our research, our goal was to predict a student’s performance depending on the overall academic status and environment. Further research and implementation are required to get a potential outcome. More factors should be identified that have a great effect on students’ performance to get more accurate result.
Student’s Performance Prediction Using Data Mining Technique …
769
References 1. S. Hussain, N.A. Dahan, F.M. Ba-Alwib, N. Ribata, Educational data mining and analysis of students’ academic performance using WEKA. Indones. J. Electr. Eng. Comput. Sci. 9(2), 447–459 (2018) 2. E.B. Costa, B. Fonseca, M.A. Santana, F.F. de Araújo, J. Rego, Evaluating the effectiveness of educational data mining techniques for early prediction of students’ academic failure in introductory programming courses. Comput. Hum. Behav. 73, 247–256 (2017) 3. R. Asif, A. Merceron, S.A. Ali, N.G. Haider, Analyzing undergraduate students’ performance using educational data mining. Comput. Educ. 113, 177–194 (2017) 4. M. Kumar, A.J. Singh, D. Handa, Literature survey on student’s performance prediction in education using data mining techniques. Int. J. Educ. Manag. Eng. 7(6), 42–49 (2017) 5. A. Mueen, B. Zafar, U. Manzoor, Modeling and predicting students’ academic performance using data mining techniques. Int. J. Mod. Educ. Comput. Sci. 8(11), 36.s (2016) 6. E.A. Amrieh, T. Hamtini, I. Aljarah, Mining educational data to predict student’s academic performance using ensemble methods. Int. J. Database Theory Appl. 9(8), 119–136 (2016) 7. M.A. Yehuala, Application of data mining techniques for student success and failure prediction (The case of Debre_Markos University). Int. J. Sci. Technol. Res. 4(4), 91–94 (2015) 8. E.A. Amrieh, T. Hamtini, I. Aljarah, Preprocessing and analyzing educational data set using X-API for improving student’s performance, in 2015 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT) (IEEE, 2015), pp. 1–5 9. A.M. Shahiri, W. Husain, A review on predicting student’s performance using data mining techniques. Procedia Comput. Sci. 72, 414–422 (2015) 10. S. Borkar, K. Rajeswari, Attributes selection for predicting students’ academic performance using education data mining and artificial neural network. Int. J. Comput. Appl. 86(10) (2014) 11. A.B.E.D. Ahmed, I.S. Elaraby, Data mining: a prediction for student’s performance using classification method. World J. Comput. Appl. Technol. 2(2), 43–47 (2014) 12. V. Ramesh, P. Parkavi, K. Ramar, Predicting student performance: a statistical and data mining approach. Int. J. Comput. Appl. 63(8) (2013) 13. D. Kabakchieva, Predicting student performance by using data mining methods for classification. Cybern. Inf. Technol. 13(1), 61–72 (2013) 14. B.K. Baradwaj, S. Pal, Mining educational data to analyze students’ performance. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 2(6), 2011 (2012) 15. S.K. Yadav, S. Pal, Data mining: a prediction for performance improvement of engineering students using classification. World Comput. Sci. Inf. Technol. J. (WCSIT), 2(2), 51–56 (2012). ISSN: 2221-0741 16. E. Osmanbegovic, M. Suljic, Data mining approach for predicting student performance. Econ. Rev. J Econ. Bus. 10(1), 3–12 (2012) 17. R.R. Kabra, R.S. Bichkar, Performance prediction of engineering students using decision trees. Int. J. Comput. Appl. 36(11), 8–12 (2011) 18. M. Ramaswami, R. Bhaskaran, A CHAID based performance prediction model in educational data mining. IJCSI Int. J. Comput. Sci. Issues 7(1) (2010). ISSN (Online): 1694-0784, ISSN (Print): 1694-0814 19. O.J. Oyelade, O.O. Oladipupo, I.C. Obagbuwa, Application of k means clustering algorithm for prediction of students academic performance. Int. J. Comput. Sci. Inf. Secur. (IJCSIS) 7(1) (2010) 20. R.S. Baker, A.T. Corbett, S.M. Gowda, A.Z. Wagner, B.A. MacLaren, L.R. Kauffman, A.P. Mitchell, S. Giguere, Contextual slip and prediction of student performance after use of an intelligent tutor, in International Conference on User Modeling, Adaptation, and Personalization (Springer, Berlin, Heidelberg, 2010), pp. 52–63 21. N. Thai-Nghe, L. Drumond, A. Krohn-Grimberghe, L. Schmidt-Thieme, Recommender system for predicting student performance. Procedia Comput. Sci. 1(2), 2811–2819 (2010)
Evaluate and Predict Concentration of Particulate Matter (PM2.5 ) Using Machine Learning Approach Shaon Hossain Sani, Akramkhan Rony, Fyruz Ibnat Karim, M. F. Mridha, and Md. Abdul Hamid
Abstract Particulate Matter (PM2.5 ) is a general term used for a mixture of solid particles and liquid droplets. PM2.5 is the utmost serious air pollutant associated with death and diseases compared to other air pollutants. Here, we have focused on the concentration of PM2.5 in Dhaka city. With the help of our proposed predictive model, we can predict Particulate Matter (PM2.5 ) hourly concentrations. The ambient air quality data were collected from October 2016 to March 2019. We have used Artificial Neural Network(ANN) to fill the missing value of our Dataset. And we have used the Ensemble model (StackNet) to predict PM2.5 . We have acquired RMSE value 26.93 and the coefficient of the Pearson correlation (R) 0.9307 for the BARC dataset. On the other hand, for the Darussalam dataset, we have acquired RMSE value 25.36 and R-value 0.9620. Keywords Particulate matter (PM2.5 ) · ANN · StackNet · Air pollution
S. H. Sani · A. Rony · F. I. Karim Department of CSE, University of Asia Pacific, Dhaka, Bangladesh e-mail: [email protected] A. Rony e-mail: [email protected] F. I. Karim e-mail: [email protected] M. F. Mridha (B) Department of CSE, Bangladesh University of Business and Technology, Dhaka, Bangladesh e-mail: [email protected] Md. Abdul Hamid Department of Information Technology, King Abdulaziz University Jeddah, Jeddah, Saudi Arabia e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_67
771
772
S. H. Sani et al.
1 Introduction In Dhaka city, air pollution is one of the major problems among other pollutions. Day by day Bangladesh is having developed rapidly. Because of developing mills, factories, industries, brick kilns, diesel generators are increasing gradually. In addition, in Dhaka city, the number of motorized vehicles has been heightening substantially, there has been an undue rise in a diverse mixture of old technology vehicles despite that the road space is narrowing and the traffic complication is reaching in unmanageable ratios. The main provider of air pollution is brick kilns which emit NOX , SO2 , CO gases along with fine particles of coal [1, 2]. From Motorized vehicles, nitrogen oxide emission rate is higher which emits by burning fossil fuel has been regarded as a major pollutant. There are many industries in Dhaka city, from which emits many kinds of air pollutants like particulate matter, hydrocarbon, ammonia, and other chemicals. It has become one of the most tackled issues for every citizen living in Dhaka city. Recently, many research works have been doing about air pollution due to adverse climate change and severe health effects. Especially, we are working on Particulate Matter (PM2.5 ) but there is a different kind of source of the particulate matter on the various area in Dhaka city. Bilkis A. Begum [3] has found that during the winter season, there is an increase in particulate matter around semi-urban areas as well as an urban area in Dhaka city. With respect to BARC(Farmgate), carbon concentration was much higher than in Darussalam (Gabtoli). There are a huge number of transportations vehicles moving around in the farmgate area during the daytime to late-night and among those, the maximum number of transportations vehicles are consisting of a duel fueled engine, which causes air pollution. A numerical value of Air Quality Index (AQI) between 151 and 200 indicates that every people may suffer health diseases and the air quality is considered as “Unhealthy”. Member of sensitive groups is taken into seriously because they might suffer from various cardiovascular disease. AQI between 201 and 300 is classified as “very unhealthy”, if the score is between 301 and 500, then it is classified as “Extremely Unhealthy”. Dhaka is one of the most populated cities in the world, it has been tussling with air pollution for a long time. Moreover, Dhaka has hold top position in the world’s most polluted cities. According to Air Visual, the air quality at Dhaka is “Very Unhealthy”. And It’s ranked top among all other cities in the world. Particulate matters are released directly into an atmosphere from sources such as forest fires, trucks, cars, heavy equipment, and other burning activities like wood-fired boilers, wood stoves, burning wastes. Primary particles also consist of crustal material from sources such as stone crushing, unpaved roads, construction sites, and metallurgical operations. But the major sources of air pollution are transportation engines, power and heat generation, industrial processing, and burning of solid waste. The air quality further declines during the dry seasons from October to April but improves during the monsoon because fine particles have seasonal patterns. Particulate matter has two varieties, one PM2.5 and another PM10 . Basically, PM2.5 consists of fine particles with aerodynamic diameters less than or equal to 2.5 µm and for PM10 less than or equal to 10 µm. An extensive body of scientific evidence shows that short-
Evaluate and Predict Concentration of Particulate Matter …
773
or long-term exposures to tiny particles can cause adverse cardiovascular effects, including heart attacks and strokes [4, 5] resulting in hospitalizations and in some cases premature death also. Recently, the number of air pollutants related disease patients increase dramatically because of detrimental air quality. In our proposed Machine learning approach, we specially used an Ensemble model (StackNet) that is based on GradientBoostingRegressor, ExtraTreesRegressor, MLPRegressor. Our proposed Machine learning Regression models work with some important air pollutants (SO2 , NOx, CO, O3, PM10 ) and meteorological values (Solar Rad, BP, Rain, Temperature, Humidity) to predict Particulate Matter (PM2.5 ) [6–8]. With the help of predicted values, we may evaluate our model that how much it learns from data accurately. Though our dataset has lots of missing values and it’s quite impossible to predict accurately without doing some smart works on dataset. We use Artificial Neural Networks to fill missing values and probably it works better than any traditional method. This paper, we have made the following contributions: • An Artificial neural network [9] is proposed to predict and fill the missing value of the dataset because the artificial neural network is able to learn accurately from data. • An Ensemble model is proposed to predict the Particulate Matter(PM2.5 ) and we have achieved comparatively lower RMSE. The rest of the paper is organized as follows. In Sect. 2, we discuss some works related to the proposal in this paper. We present the Data Processing Analysis in Sect. 3. Then in Sect. 4, we present the Methodology of this paper. In Sect. 5, we present the result analysis. Finally, in Sect. 6, we conclude our paper. This paper proposed a machine learning technique to predict Particulate Matter (PM2.5 ).
2 Related Works After many studies about PM2.5 , we found some related paper that contains harmony to our proposed methodology. But somehow, we are not satisfied with those methodologies. Some related works use Machine learning classification algorithm to classify the pollutants class. Some are trying to reach a very complicated approach that is totally unusual to such kinds of problems. In [3] this paper, Bilkis, A. Begum analyzes the data over 15 years and observe the effects of major air pollutants. Author had an observation for Dhaka’s Air pollution, the long-term air quality of Dhaka has been contaminated through not only the industry, and factory establishment rapidly, but also from an increasing the number of passenger cars and brick kilns sites. In this paper, they worked for collecting air quality data specially PM2.5 , PM10, Black carbon to analyze and compare which area is most polluted in Pabna city located in Bangladesh, what are the main causes of high concentration of Particulate matter with their method Counter “Repeat mode”. [10] PM10 is the main issue, data are sampled over 10 years for predicting PM10 factor in New Zealand. Multivariate Linear Regression (MLR), Artificial Neural Networks, and also various alternating approaches of Classification and Regression trees. “Dixian Zhu” [11] forecast PM2.5 , Ozone, sulfur using several types of Regularization techniques like “Standard
774
S. H. Sani et al.
Frobenius norm regularization”, “Nuclear norm regularization”. They have shown that the proposed regularization technique achieve better result compared to other existing works which used regularization technique. “Heidar Malek” [12] have used for predicting air pollutants AQI and AQHI using an artificial neural network for Ahvaz, Iran. They have acquired RMSE value 0.59 and the coefficient of the Pearson correlation(R) 0.87.
3 Data Pre-processing and Analysis 3.1 Data Source In Dhaka city, there are three stations for Measuring and collecting Air quality data on an hourly basis. Bangladesh’s government takes action for resolving air pollution and strictly monitors air quality. So, for that purpose, the Ministry of Environment and Forests proposed a Project named Clean Air and Sustainable Environment (CASE) [13]. Basically, we have collected data over 2.5 Years from October 2016 to March 2019 of two stations (Tables 1 and 2) only. The Shangshad Bhaban and the Bangladesh Agricultural Research Council (BARC) are almost in the same location within 1 km. So, we take only one station data which is BARC because BARC is located in Farmgate which has been concerned for the most polluted area in Dhaka city. From 1990 to 1999, a few observations were conducted and shown that the concentration level of particles raised to 3000 µg/m3 (Police Box, Farmgate, December 1999), but the allowed level was 4000 µg/m3 . Around the Farmgate area, the maximum permitted range of Sulfur dioxide is 1000 µg/m 3, but the measured range was 385 µg/m3 . All of those have happened because sulfur dioxide and Particulate matter concentration was high in Tejgaon area. Another Station is Darussalam, Dhaka which is located near Gabtoli, Dhaka. Near Darussalam within 5 km, there are lots of Bricks kilns, Industries, and one of the busiest Bus Station located, and in most of the other side of Darussalam, there are so many Garments Factories located. So, we clearly noticed that both stations have a strong correlation with air pollution and pollutants.
3.2 Analyze Dataset and Missing Data After a deep observation of the datasets, we have found that there are lots of missing values among both of the datasets. So, we have to use the right technique to fill the missing values for getting better results and prediction. Because if we rigorously understand every key feature of dataset, then we easily do the rest of the part through that level of understanding. Both BARC and Darussalam datasets show that One value missing or Series of values missing. So, we proposed a technique, that is, we split Date into Year, Month, Week, Day (Table 3). And then we have used the
Evaluate and Predict Concentration of Particulate Matter …
775
Fig. 1 Artificial neural networks
interpolation method for Temperature, Humidity, Solar Radiate, Barometric pressure columns because those values are always maintaining an upward or downward trend. So, if any value is missing, then the interpolation method can guess the data between them. In Fig. 1, we have used Artificial Neural Networks for predicting PM10 with respect to Year, Month, Week, Day, Hour, Temperature, Humidity, Barometric pressure, Solar Radiate and guessing the missing value and filling those. For predicting PM2.5 with respect to Year, Month, Week, Day, Hour, Temperature, Humidity, Barometric pressure, Solar radiate, PM10 and guessing the missing value and filling those. Then for predicting O3 and CO, we have comprised Year, Month, Week, Day, Hour, Temperature, Humidity, Barometric pressure, Solar Radiate and filled their missing values. After that to fill the missing value of NOx, we have comprised Year, Month, Week, Day, Hour, Temperature, Humidity, Barometric pressure, Solar radiate, O3 and filled their missing values. Finally, we have used Year, Month, Week, Day, Hour, Temperature, Humidity, Barometric pressure, Solar radiate, PM10 , PM2.5 , CO, NOx, O3 for filling the missing values of SO2 with the help of Artificial Neural networks model. We select different input parameters based on the relation between data.
3.3 Visualizing PM2.5 Concentration Over Time After the Visualization of both BARC and Darussalam Datasets, we have noticed that there are many variations over Month, Week, Day, Hour basis PM2.5 Concentration (Figs. 2, 3, 4, 5, 6, 7, 8 and 9).
4 Methodology StackNet Fig. 10 is an analytical and mathematical based framework implemented which is developed by Kaz-Anova in Java platform, later in Python platform that
776
S. H. Sani et al.
Fig. 2 The concentration of PM2.5 on various hours of a day shows variation over time (BARC dataset)
Fig. 3 The concentration of PM2.5 on various days of a month shows variation over day (BARC dataset)
works like a feedforward neural net. Basically, the main idea is based on Wolpert’s stacked generalization method and uses in combined levels of layers to enhance the accuracy in traditional machine learning problems. On the contrary to feedforward neural net, in place of being trained through backpropagation, the network is built iteratively one layer at a time, each of which uses the final target as its target. 1. StackNet Design 2. Flow Chart In this paper, we especially focused on Gradient Boosting Regressor (GBR), (ETR), Random Forest Regressor (RFR), and Bayesian Ridge. With the help of Root Mean Squared Error (RMSE), we evaluate our proposed model.
Evaluate and Predict Concentration of Particulate Matter …
777
Fig. 4 The concentration of PM2.5 on various days of a week shows variation over day (BARC dataset). Days are categorized from Saturday as (0) to Friday (6)
Fig. 5 The concentration of PM2.5 on various days of a month shows variation over day (BARC dataset)
4.1 StackNet Design There are two different forms of StackNet [14]: one is every layer outright uses the predictions from only one previous layer, and another is every layer uses the predictions from all previous layers including the input layer that is regarded as restacking. StackNet is ordinarily better than the best single model which is contained in every first layer. However, its ability to perform well still count on a mix of robust and various single models for the purpose of getting the best out of this metamodeling methodology. We design the StackNet architecture for our problem based on the following concepts: (a) including more models which have similar prediction
778
S. H. Sani et al.
Fig. 6 The concentration of PM2.5 on various hours of a day shows variation over time (Darussalam dataset)
Fig. 7 The concentration of PM2.5 on various days of a month shows variation over day (Darussalam dataset)
performance, (b) having a linear model in each layer, (c) placing models with better performance on a higher layer, and (d) increasing the diversity in each layer. The resulting StackNet, shown in Fig. 11, consists of three layers and four models. These models include Bayesian Ridge Regressor, Random Forest Regressors (RFR), Extra Trees Regressors (ETR), Gradient Boosting Regressor (GBR), and Ridge Regressor. The first layer has one linear regressor and three Ensemble-based regressors, the second layer contains one linear regressor and two Ensemble-based regressors, and the third layer only has one linear regressor. Each layer uses the predictions from all previous layers including the input layer.
Evaluate and Predict Concentration of Particulate Matter …
779
Fig. 8 The concentration of PM2.5 on various days of a week shows variation over day (Darussalam dataset). Days are categorized from Saturday as (0) to Friday (6)
Fig. 9 The concentration of PM2.5 on various days of a month shows variation over day (Darussalam dataset)
4.2 Flow Chart In Fig. 11, we draw our overall working procedure that graphically describes how our proposed model works. First of all, we process our raw dataset with feature selection and fill missing values. For filling the missing values, we have applied Artificial Neural networks. After that, we split our preprocessed dataset into eight folds for the k-fold cross-validation. Then fit every training data and evaluate every test data with our proposed method (Figs. 12 and 13).
780
S. H. Sani et al.
Fig. 10 StackNet framework architecture for the Ensemble-based regressor, the number of trees and the maximum depth of each tree which are indicated in the first and second number, respectively
Fig. 11 Flowchart of the whole process for predicting PM2.5
Evaluate and Predict Concentration of Particulate Matter …
781
Fig. 12 Evaluation results of PM2.5 (Darussalam dataset)
Fig. 13 Evaluation results of PM2.5 (BARC dataset)
5 Result Analysis Performance evaluation by RMSE value. We know that the Root Mean Square Error (RMSE) value is sometimes better than the Mean Squared Error (MSE). For considering a model performance, the RMSE value helps us. In this paper, we have used Table 1 and Table 2 for Datasets which we collect from the Environment Department (CASE project) in Bangladesh. Table 1 is for BARC dataset (Farmgate, Dhaka) and Table 2 is for the Darussalam dataset (Darussalam, Dhaka). Though our both
782
S. H. Sani et al.
Table 1 BARC dataset (Farmgate, Dhaka) Sl Date no
Time SO2 NOX CO
O3
PM2.5 PM10
Temp RH
Solar BP Rain Rad
1
2016-10-01
1
3.98
0.8
55.13 26.59 84.03
1.2
2
2016-10-01
2
2.83
1.23
34.41 26.61 86.57
2.3
3
2016-10-01
3
1.79
0.95
19.04 27.15 88.1
1.61
4
2016-10-01
4
0.83
0.78
12.59 27.05 89.71
5
2016-10-01
5
0.36
0.92
6
2016-10-01
6
0.11
7
2016-10-01
7
8
2016-10-01
8
9
2016-10-01
26.45 91.11
1.44
0.62
13.42 26.51 87.56
1.97
0.63
19.85 26.61 88.33
1.35
3.41
0.72
17.52 26.54 91.88
0.75
9
6.27
1.02
34.47 27.16 88.68
1.68
10 2016-10-01 10
5.02
0.55
74.13 26.67 85.05
0.84
11 2016-10-01 11
7.5
1.88
114.12 26.74 76.43
0.32
12 2016-10-01 12
3.05
0.95
141.83 26.72 65.41
1.02
13 2016-10-01 13
0.52
0.34 1.24
145.12 26.47 64.32
1.5
14 2016-10-01 14
0.13
0.94 1.4
122.42 26.45 65.05
15 2016-10-01 15
0.44
0.82 1.27
117.16 26.98 72.67
1.2
16 2016-10-01 16
1.09
0.6
108.26 26.4
2.46
0.82
9.2
72.05
Dataset contain a huge amount of missing values. So, it’s quite impossible to work with those Datasets. But Using Neural Networks for predicting missing values and making our both Dataset usable is a huge success for us. Because we have obtained a very accurate result for both Datasets with the help of StackNet that combined eightfold cross-validation. And for Darussalam Dataset, we have acquired 0.96 for the Pearson correlation coefficient (R) and RMSE 0.25. On the other hand, for BARC Dataset, we have acquired R equal to 0.93 and RMSE 0.26 (Table 4).
6 Conclusion In this paper, our main goal is to predict the Particulate Matter (PM2.5 ). Artificial Neural Network has been used to reduce the missing value. An Ensemble model (StackNet) is proposed to predict Particulate Matter (PM2.5 ). The stacking technique with K-fold Cross-Validation is applied in StackNet along with Gradient Boosting Regressor, Random Forest Regressor, and Extra Trees Regressor. Till now for both Datasets, RMSE 26.93 and RMSE 25.36 are remarkable. This work uncovers many research ways for further studies. We ensure that further, it is possible to predict PM10 , SO2 , NOX , CO, etc., using the proposed approach.
2016-10-01
2016-10-01
2016-10-01
2016-10-01
2016-10-01
14
15
16
2016-10-01
8
13
2016-10-01
7
12
2016-10-01
6
2016-10-01
2016-10-01
5
11
2016-10-01
4
2016-10-01
2016-10-01
3
2016-10-01
2016-10-01
2
10
2016-10-01
1
9
Date
Sl no
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Time
0.7
0.34
0.11
0.33
0.28
0.27
0.12
0.2
1.08
0.07
0.08
0.07
0.29
0.57
SO2
43.3
48.6
88.07
65.7
71.82
76.89
56.51
39.83
2.62
1.66
1.97
10.61
19.13
17.45
23.06
NOX
Table 2 Darussalam dataset (Darussalam, Dhaka)
1.87
1.79
1.81
1.73
1.91
2.2
2.07
2.33
2.05
1.24
1.44
1.48
1.54
1.6
1.67
1.76
CO
1.25
1.62
0.9
1.61
1.54
0.82
1.1
1.09
0.97
1.2
0.99
1.05
1.13
1.03
1.12
O3
PM2.5
162.42
148.36
115.92
157.72
183.43
101.93
66.98
49.47
41.22
23.5
26.87
43.51
60.51
83.36
69.42
124.34
PM10
29.47
29.9
31.74
32.91
32.32
31.19
30.46
28.33
27.87
27.01
27.98
28.53
28.75
28.84
29.01
29.48
Temp
72.87
74.41
68.13
62.56
65.36
71.77
76.01
83.92
85.8
82.81
83.48
85.12
84.18
83.68
82.86
80.56
RH
79.13
68.21
151.19
522.61
758.1
459.63
550.7
211.88
48.36
13.69
8.35
8.2
8.18
8.19
8.06
7.99
Solar Rad
1004.81
1005.17
1005.8
1006.78
1007.83
1008.26
1008.35
1008.16
1007.73
1007.36
1006.71
1006.36
1006.14
1006.01
1006.05
1006.15
BP
0.02
0.02
0.08
0.02
Rain
Evaluate and Predict Concentration of Particulate Matter … 783
784
S. H. Sani et al.
Table 3 ANN model input parameter for missing values prediction Input Params for PM10
Input Params for PM2.5
Input Params for O3
Input Params for CO
Input Params for NOx
Input Param for SO2
Date
Date
Date
Date
Date
Date
Month
Mont
Month
Mont
Month
Mont
Week
Week
Week
Week
Week
Week
Day
Day
Day
Day
Day
Day
Time
Time
Time
Time
Time
Time
Temp
Temp
Temp
Temp
Temp
Temp
Hum
Hum
Hum
Hum
Hum
Hum
BP
BP
BP
BP
BP
BP
Solar Rad
Solar Rad
Solar Rad
Solar Rad
Solar Rad
Solar Rad
–
PM10
–
–
NOx
PM10
–
–
–
–
–
PM2.5
–
–
–
–
–
CO
–
–
–
–
–
NOx
–
–
–
–
–
O3
Table 4 Performance measurement
Dataset
Pearson correlation coefficient (R)
RMSE
BARC
0.9307
26.9317
Darussalam
0.9620
25.3664
References 1. Md. Raquibul Hasan, Md. Akram Hossain, U. Sarjana, Md. Rashedul Hasan, Status of air quality and survey of particulate matter pollution in Pabna city, Bangladesh. Am. J. Eng. Res. (AJER) 5(11), 18−22 2. M.M. Hoque, B.A. Begum, A.M. Shawan, S.J. Ahmed, Particulate matter concentrations in the air of Dhaka and Gazipur city during winter: a comparative study, in International Conference on Physics Sustainable Development & Technology (ICPSDT-2015) 3. B.A. Begum, P.K. Hopke, Ambient air quality in Dhaka Bangladesh over two decades: impacts of policy on air quality. Aerosol Air Qual. Res. 18, 1910–1920 (2018) 4. M.-J. Chen, P.-H. Yang, M.-T. Hsieh, C.-H. Yeh, C.-H. Huang, C.-M. Yang, G.-M. Lin, Machine learning to relate PM2.5 and PM10 concentrations to outpatient visits for upper respiratory tract infections in Taiwan: a nationwide analysis. https://doi.org/10.12998/wjcc.v6.i8.200 5. R. Gore, D. Deshpande, Air data analysis for predicting health risks. IJCSN Int. J. Comput. Sci. Netw. 7(1) (2018). ISSN (Online): 2277-5420, www.IJCSN.org 6. D. Xiao, F. Fang, J. Zheng, C.C. Pain, I.M. Navon, Machine learning-based rapid response tools for regional air pollution modelling. Atmos. Environ. 199(15), 463–473 (2019). https:// doi.org/10.1016/j.atmosenv.2018.11.051 7. J.K. Deters, R. Zalakeviciute, M. Gonzalez, Y. Rybarczyk, Modeling PM2.5 urban pollution using machine learning and selected meteorological parameters. J. Electr. Comput. Eng. 2017(Article ID 5106045), 14 (2017). https://doi.org/10.1155/2017/5106045
Evaluate and Predict Concentration of Particulate Matter …
785
8. C.R. Aditya, C.R. Deshmukh, D.K. Nayana, P.G. Vidyavastu, Detection and prediction of air pollution using machine learning models. Int. J. Eng. Trends Technol. (IJETT) 59(4) (2018) 9. S. Roy, Prediction of particulate matter concentrations using artificial neural network. https:// doi.org/10.5923/j.re.20120202.05 10. J. Whalley, S. Zandi, Particulate matter sampling techniques and data modelling methods. https://doi.org/10.5772/65054 11. D. Zhu, C. Cai, T. Yang, X. Zhou, A machine learning approach for air quality prediction: model regularization and optimization. https://doi.org/10.3390/bdcc2010005 12. H. Maleki, A. Sorooshian, G. Goudarzi, Z. Babol, Y.T. Birgani, M. Rahmati, Air pollution prediction by using an artificial neural network model. https://doi.org/10.1007/s10098-01901709-w 13. http://case.doe.gov.bd 14. M. Michailidis, Stacknet, meta modelling framework (2017), https://github.com/kaz-Anova/ StackNet
Retrieval of Frequent Itemset Using Improved Mining Algorithm in Hadoop Sandhya Sandeep Waghere, PothuRaju RajaRajeswari, and Vithya Ganesan
Abstract Today in parallel mining, extraction of frequent patterns from a huge dataset in a short time is a very difficult task. Frequent pattern mining not only plays a vital role in framing of association rule but plays an important role in effective classification and clustering also. Apriori, FP-Growth, and Eclat algorithms are very basic algorithms in frequent patterns mining but they are lagging in balancing of workload, fault-tolerance, and synchronization. To overcome this, recently proposed algorithm focuses on parallelization of a large number of machines in a distributed computing environment using MapReduce framework. For contributing in this case, we propose improved frequent itemset mining algorithm. This algorithm helps in finding frequent itemset from a huge dataset. It uses the concept of clustering for effective utilization of space and easy retrieval, in which large pattern sets are divided into discrete and uniform clusters, and each cluster is characterized by its center point. For pattern matching, we use FP-Growth algorithm. We are considering parameters like time and accuracy for comparing the existing system with the proposed system. Finally, we show that the proposed system is more accurate and requires less time to find frequent itemset. We have used online Retail Dataset, as we have a large amount of data for mining, we have extracted items that were bought by each customer. The existing system takes 99 s for discovering frequently occurred items. While our new approach for finding frequent itemsets takes very less time. It takes only 10 s, so it saves time more efficiently. Our new technique is implemented with the help of Locality Sensitive hashing technique, moving k-means, and FP-Growth algorithm.
S. S. Waghere (B) Research Scholar, CSE Department, KLEF, Guntur, AP, India e-mail: [email protected] P. RajaRajeswari · V. Ganesan Professor, CSE Department, KLEF, Guntur, AP, India e-mail: [email protected] V. Ganesan e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_68
787
788
S. S. Waghere et al.
Keywords MapReduce · HDFS · Improved mining algorithm · FP-Growth algorithm
1 Introduction The exploration of frequent itemsets is one of the very crucial issues. Sets of frequent item revealing tactics strengthen the process of discovering meaningful information which gives business insight and help the decision makers using MapReduce programming framework [1]. In some cases, the size of the set is too large to calculate the database of the transaction. In the era of Big Data, there is a necessity of a new approach to calculate sets of frequent elements from data which consist of records in terabytes. The researchers proposed several approaches to address the major challenges of the data, but all of these approaches suffer from the synchronization, the balance of the workload, and the problem of fault-tolerance. To overcome this problem, MapReduce Programming model arises which processed large dataset in parallel and distributed environment [2]. Basically, it has two crucial tasks that have to be done that are Map and Reduce. Mapper will take an original dataset, process it, and convert it into key-value pairs. Reducer will take the output of mapper as an input and aggregate the intermediate results generated and give the final output. Hadoop is a framework that allows us to store and process large datasets in a parallel and distributed fashion. It also helps in heterogeneous environment by taking the help of data placement strategies [3]. To extent frequent itemsets mining to huge size datasets, we propose an improved mining algorithm and for pattern matching, we use FP-Growth algorithm. We use Hadoop architecture using MapReduce programming model. Functioning of algorithm includes splitting massive datasets into feasible pieces of data that run on diverse nodes. The concept of sampling of the sets of integer data in chunks of data manageable reduces the complexity of the space, because it is easier to calculate the set of small data effectively in a single node instead of load and scan or compute the set of data together at once and also reduces the complexity of time through the support to the parallelization of tasks and the execution of sub-tasks. The mapping function in each individual node or machine receives these pieces of data and generates output, according to the local minimum support. In addition, before generating the final itemset of frequent elements as output, the reducing function counts all values and discards those values that do not hold up the global minimum threshold. Paper follows sequence as Sect. 2 giving brief about related work done so far. Section 3 gives details about the suggested system overview, its architecture as well as its flow of working. Section 4 discussed about experimental setup and result observed. Section 5 stated the conclusion.
Retrieval of Frequent Itemset Using Improved Mining Algorithm …
789
2 Literature Survey Basically, there are three classic frequent itemset mining algorithms that run in a single node. 1. Apriori Algorithm based on candidate generation approach in which any subset from output frequent itemsets must also be frequent. It requires more time as database scans each time a candidate itemset is generated. Breadth First Search technique is used. It requires large memory space as too many candidate itemsets are generated. 2. Frequent Pattern-Growth Algorithm which is better than Apriori as it takes less time and scans databases twice in the whole process. The divide and conquer technique is used. FP-Tree is generated but building FP-Tree is expensive and it consumes more memory. 3. Eclat Algorithm converts original dataset into a new table. Every row of the new table contains a list of sorted transaction IDs of respective items. In the end, sets of frequently occurred items are extracted by intersecting two transaction list of that item. This algorithm suffers from the main memory and communication overhead. Zhang et al. [4] describe two different methods which convert Apriori algorithm into MapReduce task. In the first method, it takes all relative itemsets in mapping phase and then removes those who do not satisfy minimum threshold criteria in reduce phase. Direct conversion of Apriori algorithm is carried out in the second phase. These described methods are used by M. Y. Lin et al. [5] MREclat algorithm is based on MapRFeduce framework. It presents improved Eclat algorithm which improves the efficiency of frequent itemset mining. It solves the problems of storage and capability of computation in case of massive data. In comparison with other algorithms, it has very high scalability and better speedup. S. Moes et al. [6] suggested two methods for frequent itemset mining for BigData on MapReduce. The first method is named as DistEclat method. It is a modified version of pure EClat method which divides search space uniformly among the map jobs. It has issues in two cases; one is when mining of subtrees has to be done, it needs entire database into main memory and second is it needs entire datasets to be communicated to most of the mappers. The second method is named as BigFIM method which utilizes the best part of Apriori and Eclat algorithms. It fits database in main memory. DistEclat provides speed but unable to provide better scalability whereas BigFIM is good in case of scalability but the speed of this algorithm is less. M. Riondato et al. [7] explains that PARMA stands for Parallel Randomized Algorithm. It uses sampling method which will find frequent itemset in very less time. It generates a sampling list which will help in forming clusters. The execution speed of this algorithm is high as it minimizes data replication. MRPrePost Algorithm was proposed by Jinggui Liao and Yuelong zhao. It is a modified version of Prepost. MRPrepost is based on prefix patterns so this algorithm is well doing in association rule mining of large data. It improves the performance of Prepost and it is better than Prepost and PFP in terms of scalability and stability. Mining result of MRPrepost is approximate which is close to the original result. ClustBigFIM algorithm has beaten the issues of main memory and communication overhead occurred in DistEclat. It provides a hybrid approach which combines Apriori, k-means, and EClat algorithms for frequent itemset mining of large dataset. It has beaten the drawbacks of BigFIM
790
S. S. Waghere et al.
by increasing scalability and performance. The resulting output is approximately close to original results but with faster speed. Our work is based on the analysis of these three algorithms. We consider four different datasets having number of transactions which differs in size. Comparison is made using the parameters time taken and the memory used by each algorithm to find the frequent patterns. In the case of large clusters, extant parallel mining algorithms face the difficulties in data distribution. For withdrawing this problem of data distribution, we will design parallel FP with Hadoop approach to find frequent itemsets and is often called FiDoop’s MapReduce programming model. For avoiding the storage of architectural conditional patterns in traditional FP, FiDoop [8] uses three MapReduce jobs that are used to complete the mining tasks. In the important third MapReduce job, the mapper exclusively breaks down the itemset and the reducer combines by building a small super-metric tree. We have implemented FiDoop in our Hadoop-house cluster. Since itemsets of different lengths have different disassemblies and construction costs, which in turn affects dimensions and data distribution. To improve the performance of FiDoop, we have developed a load balancing measure that balances clusters of compute nodes. The frequent search for sets of elements in large heterogeneous databases at a minimum time is considered to be one of the most important problems of data extraction. As a solution to this problem, several algorithms have been proposed to accelerate execution, parallelization, and distribution of workload evenly on a large number of machines in the distributed computer environment. It is basically using MapReduce framework for data processing. Some of them are actually able to determine the appropriate number of computers needed, taking into account the balance of the workload and the efficiency of execution. But internally, it is very difficult to know precise number of iteration required for any large dataset in prior to finding out the frequent set of items based on iterative sampling. The authors propose an Improved and Compact Algorithm (ICA) to find common elements in a minimum time, using a distributed computational environment. It is also able to determine the exact number of internal iteration required for any large dataset, whether the data is structured or not [9]. J. Xie et al. explains Improving MapReduce performance through data placement in heterogeneous Hadoop clusters; in this, the author describes the problem of placing data between nodes so that each node achieves improved data processing performance adaptively [3]. Sandhya S Waghere et al. describes that filtered or processed data will store on HDFS that will improve performance as well as gain more speed [10].
3 System Overview 3.1 Problem Statement We design a system that retrieves frequent itemset in minimum time from input big dataset. We propose the data partitioning scheme and improved frequent itemset
Retrieval of Frequent Itemset Using Improved Mining Algorithm …
791
Fig. 1 System architecture
Dataset
Hadoop Database File System for Storing Map-Reduce Model for Parallel count of item FP-Growth for finding frequent item set Data portioning and Fidoop for minimizing time
Result
mining algorithm for reorganizing data which will minimize the time for finding frequent itemset. Figure 1 shows the proposed system architecture. To extent frequent itemsets mining to massive size datasets, we suggest a moving k-means algorithm and for pattern matching, we use FP-Growth algorithm. We work on Hadoop architecture with MapReduce programming model. Large pattern sets are divided into discrete and uniform clusters, and each cluster is characterizing by its center point. Finally, we show that the system requires less time for finding frequent itemsets.
3.2 System Architecture Figure 1 shows suggested system architecture which clears that we use k-means clustering concept for scaling purpose. For pattern matching, FP-Growth algorithm is used with an effective data partitioning. Scheme improved frequent itemset mining algorithm is implemented under Hadoop MapReduce framework. 1. Uploading Dataset 2. The system stores data into the Hadoop Distributed File System (HDFS).
792
S. S. Waghere et al.
3. Then with the help of Hadoop MapReduce programming concept get a parallel count of itemsets. 4. We apply a parallel FP algorithm on the dataset to get frequent itemsets with time. 5. The suggested algorithm used to reorganize data in a cluster format. 6. Then by using FiDoop with the proposed improved mining algorithm, we generate the frequent itemset in minimum time.
3.3 System Flow Initially we upload a dataset in HDFS file system of Hadoop. Data is processed and after MapReduce process, it will generate key-value pair which will give a count of each item. Output of this step will generate 1-Frequent itemset after discarding infrequent items from itemset. Output of this step is giving as input to k-means algorithm where the formation of clusters has to be done. Next formation of matrix of data using Minhash has to be done. With the help of locality sensitive hashing technique, we are finding out co-related items. Co-related items are placed in one bucket and thus the number of buckets is created which will give co-related items. Finally, build a compact structure and we will get frequent itemset from a given dataset in a very less time.
3.4 Mathematical Model S = {I, O, M, F, K, L, L1}; I = Input O = output M = Map Reducing F = FP-Growth K = Moving k-means L = Locality Sensitivity Hashing Map Reducing: M = {I}; M1 (Output) = Find the count of every input from dataset in all transactions. FP-Growth Algorithm: F = {M1}; F1 (Output) = Find the element of count with different patterns. (1) Process P3 = Moving k-means: P3 = {I}; Where K n J= (xi − yi)2 = 1 i=1j=1
xi − yi is Euclidian distance between a point xi and yj; Output = Cluster the dataset into different groups
Retrieval of Frequent Itemset Using Improved Mining Algorithm …
793
(2) Process = P4 (Locality Sensitivity Hashing) P4 = {P3}; K n J= (xi − yi)2 = 1 i=1j=1
m is power of 2 J= St (D1D2) = E(1/k)row hi(D1 = h1(D2)) k → using of k character h(x) = x mod m; Process P5: P5 = {P4}; Repeat process P4 Output = grouping similar data into a single unit.
4 Result Analysis 4.1 Dataset We have used online Retail Dataset, as we have a large amount of data for mining, we have extracted items that were bought by each customer, so each row in the dataset represents items that are purchased by a single customer. Likewise, we got the number of rows representing items bought by a customer and saved this data in “Retail Item Data” file. Then we convert n_names of items into numbers where each number represents one item. Next pre-process data and convert data into numeric form. Once data is converted to numeric form, we store numeric data in “RetailMainDataset”. We apply different techniques for processing data and after that, we get frequent itemset. Table 1 shows a sample of input dataset, after that, we pre-process the dataset. We got numeric values and the figure shows a snapshot after pre-processing the original dataset (Fig. 2).
4.2 Result Figure 3 shows the execution of our system and output. The parallel FP-Growth algorithm takes more time. It takes 99 s for discovering frequently occurred items. While our new approach for finding frequent itemsets takes very less time. It takes only 10 s, so it saves time more efficiently. Our new technique is implemented with the help of Locality Sensitive hashing technique, moving k-means, and FP-Growth algorithm (Fig. 4). Table 2 shows time comparison details. The observations show that FP-Growth algorithm requires more time for pattern matching and frequent itemset mining. Our proposed approach uses improved algorithm for data portioning which will take the help of Locality Sensitive Hashing technique and Min Hash technique for reorganizing the data. These reorganized data are used as input to FP-Growth algorithm.
794
S. S. Waghere et al.
Table 1 Sample of dataset InvoiceNo StockCode Description
Quantity InvoiceDate UnitPrice CustomerID Country
536365
85123A
WHITE HANGING HEART T-LIGHT HOLDER
6
01-12-2010 2.55 08:26
17850
United Kingdom
536365
71053
WHITE METAL LANTERN
6
01-12-2010 3.39 08:26
17850
United Kingdom
536365
84406B
CREAM CUPID HEARTS COAT HANGER
8
01-12-2010 2.75 08:26
17850
United Kingdom
536365
84029G
KNITTED UNION FLAG HOT WATER BOTTLE
6
01-12-2010 3.39 08:26
17850
United Kingdom
536365
84029E
RED WOOLLY HOTTIE WHITE HEART.
6
01-12-2010 3.39 08:26
17850
United Kingdom
536365
22752
SET 7 BABUSHKA NESTING BOXES
2
01-12-2010 7.65 08:26
17850
United Kingdom
536365
21730
GLASS STAR FROSTED T-LIGHT HOLDER
6
01-12-2010 4.25 08:26
17850
United Kingdom
536366
22633
HAND WARMER UNION JACK
6
01-12-2010 1.85 08:28
17850
United Kingdom
536366
22632
HAND WARMER RED POLKA DOT
6
01-12-2010 1.85 08:28
17850
United Kingdom
536367
84879
ASSORTED COLOUR BIRD ORNAMENT
32
01-12-2010 1.69 08:34
13047
United Kingdom
536367
22745
POPPY’S PLAYHOUSE BEDROOM
6
01-12-2010 2.1 08:34
13047
United Kingdom
536367
22748
POPPY’S PLAYHOUSE KITCHEN
6
01-12-2010 2.1 08:34
13047
United Kingdom
Retrieval of Frequent Itemset Using Improved Mining Algorithm …
795
Fig. 2 Snapshot pre-processing dataset
Fig. 3 Snapshot after execution of PFP require 99 s
From Table 2, we conclude that the reorganized data with FP-Growth algorithm require less time for finding frequent itemsets.
796
S. S. Waghere et al.
Fig. 4 Snapshot after execution of FiDoop require 1 s
Table 2 Time comparison
Algorithm
Time in second
FP-Growth algorithm
47
Improved frequent itemset mining algorithm
30
Table 2 concludes that FP-Growth algorithm required 47 s for finding frequent itemset from input customer retail dataset whereas suggested improved frequent itemset mining algorithm required 30 s for the same output. Figure 5 shows a time comparison graph of FP-Growth and proposed algorithm. Table 3 shows the dataset of different dimensions. FiDoop with proposed method
Fig. 5 Time comparison graph
Retrieval of Frequent Itemset Using Improved Mining Algorithm … Table 3 Dimensionality comparison
797
Dataset name
Fp-Growth (s)
FiDoop (s)
T40I40D1K
50
45
T40I30D1K
55
50
T40I20D1K
65
60
T40I10D1K
75
70
take less time for frequent itemset mining than normal FP-Growth algorithm. In the above example, we are taking dataset of different dimensions like T40I40DIK dimensional data takes 50 s for FP-Growth algorithm and FiDoop with suggested method takes 45 s to get the same output.
5 Conclusion We proposed frequent itemset mining in minimum time for improving performance. In this System, we first utilize Hadoop MapReduce programming for parallel count of each item followed by FP-Growth algorithm to catch frequent itemset. After uncovering frequent itemset for minimizing time, we use the proposed methodology with different techniques like Locality hashing, min hash, and improved k-means with FP-Growth algorithm. Finally, we consummate that suggested system is more efficient for finding frequent itemsets.
References 1. A.F. Gates et al., Building a high-level dataflow system on top of Map-Reduce: the Pig experience, in Proceedings of the VLDB Endowment 2009 (2009) 2. Y. Xun, X. Qin, J. Zhang, FiDoop: parallel mining of frequent itemset mining using Mapreduce, in IEEE 2016 (2016) 3. J. Xie et al., Improving Mapreduce performance through data placement in heterogeneous Hadoop clusters, in 2010 IEEE (2010) 4. Z. Zhang, G. Ji, M. Tang, MREclat: An Algorithm for Parallel Mining Frequent Itemsets, School of Computer Science and Technology, Nanjing Normal University, Nanjing, China 5. M.Y. Lin, P.-Y. Lee, S.-C. Hsueh, Apriori-based frequent itemset mining algorithms on MapReduce, in ICUIMC’12, Kuala Lumpur, Malaysia, 20–22 Feb 2012 6. S. Moes, E. Aksehirli, B. Goethals, Frequent itemset mining for big data, in 2013 IEEE International Conference on Big Data. https://doi.org/10.1109/bigdata.2013.6691742 7. M. Riondato, J.A. DeBrabant, R. Fonseca, E. Upfal, Presented a paper on “ PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce” 8. Y. Xun, J. Zhang, X. Qin, X. Zhao, FiDoop-DP: data partitioning in frequent itemset mining on Hadoop cluster. IEEE Trans. Parallel Distrib. Syst. 28(1) (2017)
798
S. S. Waghere et al.
9. R. Agarwal, S. Singh, S. Vats, Implementation of an improved algorithm for frequent itemset mining using Hadoop, in ICCCA 2016 (2016) 10. S.S. Waghere, P. Rajarajeswari, Parallel frequent dataset mining and feature subset selection for high dimensional data on Hadoop using Map-Reduce. Int. J. Appl. Eng. Res. ISSN 0973–4562. 12(18) (2017)
Number Plate Recognition System for Vehicles Using Machine Learning Approach Md. Amzad Hossain, Istiaque Ahmed Suvo, Amitabh Ray, Md. Ariful Islam Malik, and M. F. Mridha
Abstract In recent times, the deep learning techniques in particular Convolutional Neural Networks (CNNs) are extensively used in computer vision and machine learning field. Machine learning technique provides high accuracy in different classification tasks like as MNIST, CIFAR-100, ImageNet, and CIFAR-10. However, there are lots of research being conducted for Bangla number plate recognition in the last decade. None of them are used to deploy a physical system for the Bangla Number plate recognition system because of their poor recognition accuracy. In this research work, we proposed a new algorithm for vehicle number plate recognition based on Connected Component Analysis (CCA) and Convolutional Neural Networks (CNNs). We have implemented the CCA technique for number plate detection and character segmentation. Which produced 92.78% accuracy for number plate detection and 97.94% accuracy for character segmentation. Along with that, we have also implemented a CNN model for character recognition and used a dataset “PlateNumbers” for training this model. The dataset consists of 408 (120 × 110) character images in 17 classes. It’s a standard and very first dataset. So finally, we have produced 96.91% accuracy in the character recognition stage by implementing our CNN model. The results of our research work indicate that the performance of the system is noticeable. Md. Amzad Hossain (B) · I. A. Suvo · A. Ray · Md. Ariful Islam Malik · M. F. Mridha Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Rupnagar R/A, Mirpur-2, Dhaka 1216, Bangladesh e-mail: [email protected] I. A. Suvo e-mail: [email protected] A. Ray e-mail: [email protected] Md. Ariful Islam Malik e-mail: [email protected] M. F. Mridha e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_69
799
800
Md. Amzad Hossain et al.
Keywords NPR · CNN · CCA · Machine learning · Character segmentation · Character recognition · SVM
1 Introduction Number Plate Recognition (NPR) is known as Automatic Number Plate Recognition (ANPR) [1]. Nowadays, it is an important research area due to its applications such as in traffic enforcement, secure gated entrances, law enforcement, and toll gates. It also contributes to developing an Intelligent Transportation System (ITS) [2]. Modern life is closely related to the ITS. Because it is able to handle the movement of vehicles on the roads and cities. Bangladesh needs to develop intelligent transportation systems on traffic control, toll gates, and parking. But it is difficult due to different backgrounds, different lighting effects, and especially for Bangla character. In this research work, we have used a database which consists of different images from different backgrounds, lighting effects, and different angles. This research work brings in a structure for recognition of Bangladesh number plates. The structure consists of four processing steps: 1. Preprocessing by removing noise and shadow. 2. Detecting number plate by edges detection, CCA, morphological processing, and vertical projection. 3. Character segmentation by CCA and bounding box. 4. Character recognition by identifying the feature from a character image and applying the CNN model. In this research work, we have implemented a Connected Component Analysis (CCA) [3] architecture for detecting the number plate and segmenting each character. In the phase of NPD, we mainly focus on edges detection, morphological processing, and finally, we applied a vertical projection for actual plate detection. In the character recognition and classification phase, we used a supervised classification technique of the machine learning approach. We have implemented a supervised Convolutional Neural Network (CNN) [4] model for character recognition which produced high accuracy results in a recognition rate of 96.91%. Also, we came to know by our study that this research work can produce up to 97% overall accuracy where the other method produces less accuracy while recognizing Bangla number plate.
2 Related Works This section surveys related works on number plate recognition systems. We have studied and worked with Convolutional Neural Networks (CNNs) and Connected Component Analysis (CCA). but we also studied the techniques of other methods. There are several techniques used for number plate recognition. Hossen et al. [5] have
Number Plate Recognition System for Vehicles …
801
implemented a digital license plate recognition system for Bangladeshi vehicles. They used the back-propagation feed-forward neural network and morphological approaches for implementing their system. In [6], Shahed et al. presented an automatic Bengali NPR for recognizing the plate character. In this work, they recognized the Bangla number plate for metropolitan cities. They focus on edge detection, CNN, and morphological operation for implementing their work. ANPR of Bangladeshi Vehicles is presented in [7] where Amin et al. developed their system by implementing the Bangla Optical Character Recognition (OCR). They used OCR in the final stage of NPL for recognizing character from the number plate. In [8], Abdullah et al. have implemented their LPR system for Dhaka metropolitan city where they mainly focus on YOLO based network. They implemented a YOLOv3 algorithm for localizing the number plate and recognizing the character. Roy et al. [9] have tried to focus on Skew Correction and match RGB color intensity with contour detection. Abedin et al. [10] implemented their system of LPR for the Bangla LP. In this work, they implement their system based on deep learning models with contour properties. Rokonuzzaman et al. [11] also implemented their system based on machine learning and Robot Operating System (ROS). Gou et al. [12] have implemented their proposed method using several techniques of character recognition. They used extremal regions, various validations, and edge detection in the primary stage. In the final stage, they applied Restricted Boltzmann Machines for character recognition. Miah et al. [13] have implemented an ANPR system for Bangladeshi vehicles. For implementing their systems, they focus on neural network and pattern recognition with machine vision techniques. From the above study, we can conclude that lots of works have been done on number plate recognition. But in this research work, we produced up to 97% accuracy and its remarkable results.
3 Proposed Method Our proposed method is designed for Number Plate Recognition (NPR) system. It is a well-known image processing technology. The process of our system is mainly divided into four parts from the input image. The NPR system’s block diagram is shown in Fig. 1. Fig. 1 Block diagram of NPR system
Input Image
Preprocessing
Number Plate Detection
Output
Character Recognition
Character Segmentation
802
Md. Amzad Hossain et al.
Fig. 2 Gray to binary image transform
3.1 Preprocessing Various preprocessing techniques are performed as an initial step of NPR. It’s used to improve the quality of input images by removing shadows and noises from the image. It’s used as an initial step to improve the number plate detection rate and it is executed, before the number plate detection stage. In this step, an image is taken as input and then converted to a grayscale image. Each pixel in a grayscale image is between 0 255. Then we finally convert it to a binary image, which is black or white. Figure 2 shows the transformation of gray to a binary image. There are various preprocessing algorithms used for NPR. We used the Otsu binarization method in this research work. By this method, the input image is segmented into several sub-regions. Then a threshold value is calculated for each sub-region. According to the calculation of the sub-region threshold, the grayscale image is converted to a binary image.
3.2 Number Plate Detection This is probably the most important stage of the NPR system where the position of the Number plate is determined. Here, an image is taken as input, then a number plate as an output image is provided. Here, first, we applied the Connected Component Analysis (CCA) for identifying the connected region in the input image. Along with CCA, we have also applied the edge detection and morphological process. CCA is most useful for identifying group and label connected regions. When the value of a pixel is similar to another, then both are considered to be connected to each other. The labeling of CCA is shown in Fig. 3. For mapping and labeling all the connected regions, we used measure.label method in this stage. We also used regionprops and patches.Rectangle method to label all the
Number Plate Recognition System for Vehicles …
803
Fig. 3 Labeling of connected component analysis
mapped regions in a rectangular shape. Then it successfully detects all the regions that are connected to each other. The labeling image is shown in Fig. 4. Figure 4 shows the results of all labeled regions. But it also contains the unnecessary regions that are connected to each other. For removing those regions, we used some conditions, and that conditions represent a number plate. That conditions are • • • •
The regions must be a rectangular shape. The height of the region is less than the width. The ranges of the width of the NP region between 12 and 42% in full image. The ranges of the height of the NP region between 5 and 20% in full image.
After applying the above characteristics, the transformation result shows as in Fig. 5. Fig. 4 Map all the connected regions
804
Md. Amzad Hossain et al.
Fig. 5 Identify the actual number plate
In Fig. 5, it finally detects the actual plate region. But sometimes, it can be different because of some regions that look like a number plate, for example—headlamps, stickers, etc. That time the system detects more than one region as a plate region that is totally wrong. So, for handling that situation, we used a vertical projection which identifies the actual number plate according to the density. Because the density of the actual plate area is always high due to the fact that characters are written on it. Then it finally identifies and extracts the actual number plate by this projection process that is shown in the following Fig. 6.
Fig. 6 Extracted actual number plate
Number Plate Recognition System for Vehicles …
805
Fig. 7 Line segmentation
3.3 Character Segmentation Character Segmentation is the next step of Number Plate Detection. We segmented the characters of the number plate by two steps. That is line segmentation and word and character segmentation. We perform this stage by calculating the vertical and horizontal histogram. Line Segmentation: In the line segmentation stage, we separated the line in the number plate. This separation procedure is executed by scanning the plate image horizontally. We constructed a row histogram for calculating the black pixel’s frequency in each row. When the pixel’s value is zero in a row, it denotes there is a boundary between the line. So both lines are separate and no connection occurs between them. That is shown in Fig. 7. Word and character Segmentation: In the word and character segmentation stage, we separated each word and character in the number plate. This separation procedure is executed by scanning the plate image vertically. We constructed a column histogram for calculating the black pixel’s frequency in each column. When the black pixel is continuous, it’s considered to be a word or character. But when the pixel’s value is zero in a column, it denotes the space between word or character. So both word and character are considered to be separate. That is shown in Fig. 8.
3.4 Character Recognition This is the last step of our NPR research works where we faced the machine learning approach to recognize the plate characters. The machine learning approach is mainly categorized for supervised learning, unsupervised learning, and reinforcement learning. We have implemented our research work by supervised learning. Because the output types of a number plate are already known. Then we have implemented the CNN model of our research work. For training this model, we used a database that is the combination of training and testing datasets. We used the image
806
Md. Amzad Hossain et al.
Fig. 8 Character segmentation
size of 120PX by 110PX for training and testing. Because we’ve already resized each character image in (120 ×110) size from plate image. After training and testing our model, we recognize the plate character by applying this model. The structure of our CNN model is shown in Fig. 9. After applying our CNN model, we have successfully recognized the word and character from the number plate. Some recognition results are shown in Fig. 10.
Fig. 9 Our system’s convolution neural network
Number Plate Recognition System for Vehicles …
807
Fig. 10 Bangla word and character recognition
4 NPRBV Algorithm Result: Recognize Number Plate Character Input: Color image 1. 2. 3. 4. 5.
Input an image Convert to grayscale image Calculate threshold value and convert grayscale to binary image Apply CCA to identify connected region while Region > 50 do a. if Region == Like Number Plate then i. Store This Region b. else i. Eliminate This Region end end
6.
Apply Vertical Projection on all stored regions and detect the actual number plate 7. Apply CCA and segment all plate characters from the number plate 8. Store each segmented character as an image 9. Apply CNN and Predict all class levels for those images 10. Recognize All plate characters and output them This is the overall algorithmic process of our research works. We have implemented this especially based on two techniques: Connected Component Analysis (CCA) and Convolutional Neural Network (CNN). According to this algorithm, both techniques have provided high accuracy in their specific steps. The flowchart of our research work is shown in Fig. 11.
808
Md. Amzad Hossain et al.
Start Input Image Convert to gray scale image Filter and Convert to Binary Image Connected component analysis (CCA)
Edge detection
Region == Like NP
Morphological Processing
No
Eliminate
Yes Vertical Projection Connected Component Analysis (CCA)
Training dataset & Test dataset
Character Image
CNN training CNN model
NP Image label prediction Final recognition End Fig. 11 Flowchart of number plate recognition system
Number Plate Recognition System for Vehicles …
809
5 Experimental Results and Analysis The purpose of testing is to identify errors. Testing is the method of trying to discover every understandable error or defect in a work. In our research work, we used different colors, different angles, and different size images. We worked with a total of 408 images in 17 classes. These images are used to train and test our implemented NPR system. For implementing our model, we used different tools that are Anaconda, Spyder, Python, Keras, and Tensorflow. We find better performance from our model in every step of NPR. We found 92.78% result for number plate extraction, 97.94% for character segmentation, and 96.91% for character recognition. The performance of our model for 10 epochs are shown in Table 1. Result of Existing Method: For implementing our existing method, we used the model of SVM (Support Vector Machine). SVM is the most applicable machine learning algorithm. We also used a Cross-validation technique to identify the validation accuracy of the model. The accuracy rate of the existing model is very poor; it is almost 63%. That is shown in Fig. 12. In Fig. 12, we can see that the accuracy graph of the existing model is up and down in per field epoch. Finally, the accuracy reaches to 63%, which is a very poor rate. So the recognition error of the number plate by this method is a large number. Result of Proposed Method: For implementing our Proposed method, we used the techniques of CCA along with the CNN model. The result of our Proposed method is almost 97% accurate and the error rate is less than 4%. It provides high accuracy result and it is much better than the existing method. The accuracy of our method is shown in Fig. 13. In Fig. 13, we can see that the accuracy graph of our model is increased per epoch continuously and reaches the high accuracy that is not present in the existing system. Similarly, the error rate is decreased continuously per epoch and it reaches less than 4%. That is shown in Fig. 14. We used different techniques for different steps of NPR and we got the better result of each step. Here, Table 2 shows the results of different steps with applied techniques for an input image. Discussion: Our existing method is implemented by the Support Vector Machine (SVM). On the other hand, our proposed method is implemented by the Connected Table 1 Training and validation result of our model for 10 epoch Item
Epoch 1
Ep. 2
Ep. 3
Ep. 4
Ep. 5
Ep. 6
Ep. 7
Ep. 8
Ep. 9
Ep. 10
Training accuracy
0.340
0.736
0.862
0.920
0.947
0.965
0.977
0.986
0.994
0.996
Validation accuracy
0.567
0.721
0.848
0.827
0.899
0.924
0.911
0.901
0.942
0.971
Training error
2.174
0.892
0.457
0.263
0.170
0.114
0.078
0.052
0.028
0.021
Validation error
2.896
0.685
0.481
0.480
0.363
0.369
0.402
0.362
0.351
0.349
810
Fig. 12 Plotting graph of the existing method
Fig. 13 Graph of the proposed method for accuracy
Md. Amzad Hossain et al.
Number Plate Recognition System for Vehicles …
811
Fig. 14 Plotting graph for error
Table 2 Accuracy of different stages of proposed work with applied techniques Techniques
Steps of NPR
Number of image
Correctly classify
Accuracy (%)
CCA CCA
Number plate detection
97
90
92.78
Character segmentation
97
95
CNN
97.94
Character recognition
97
94
96.91
Component Analysis (CCA) and Convolutional Neural Network (CNN). CCA is an application of graph theory. By CCA, the subsets of connected components are uniquely labeled based on a given heuristic. CNN is most commonly applied to analyzing visual imagery. We used a cross-validation technique to identify the validation accuracy of our existing method. Then we find the accuracy rate which is almost 63% and it is very poor. On the other hand, we calculate the accuracy rate of our proposed method by calculating four metrics “loss”, “val-loss”, “acc”, and “valacc” of training and testing sets. Here, the “acc” and “val-acc” define the accuracy rate of our work. By calculating “acc” and “val-acc”, we find the accuracy rate of our proposed method which is almost 97%. The accuracy comparison graph of the existing and proposed method is shown in Fig. 15. From the above discussion and comparison graphs, it’s clear that our proposed method is better than the existing method. Here, Table 3 shows the comparison of accuracy among proposed, existing, and other related works.
812
Md. Amzad Hossain et al.
Fig. 15 Plotting graph of the proposed method for error
Table 3 Comparison among proposed, existing, and other well-reported methods for number plate recognition
References
NPD accuracy (%)
CS accuracy (%)
CR accuracy (%)
Proposed method
92.78
97.94
96.91
Hossen et al. 93.89 [5]
98.22
92.77
Abdullah et al. [8]
89.00
90.00
92.70
Miah et al. [13]
94.20
92.60
91.80
Existing method
92.78
96.94
63.00
In this above table, we compared our research work with another five related work including the existing method where we mainly focus on the accuracy of three basic stages of NPR that is detection of the number plate, segmentation of plate character, and character recognition. In the first stage, most of the work is able to detect the number plate with an almost similar accuracy of our work. The second step is almost similar. But in the final stage, all works fail to recognize character with greater than 90% accuracy. Whereas our work recognizes character with 96.91% accuracy. Here, the overall system accuracy of proposed, existing, and other related works is shown in the following histogram in Fig. 16.
Number Plate Recognition System for Vehicles …
813
Fig. 16 Overall comparison of proposed versus other methods
6 Conclusion In this paper, we have implemented an algorithm of number plate recognition for Bangladeshi vehicles. This algorithm represents the complete system of NPR based on Machine Learning. We implemented different tools and techniques in each step of the NPR system. Here, firstly we used edges detection and morphological processing along with CCA for number plate detection. In some cases, we have also used vertical projection for detecting the actual number plate from regions that look like the number plate. Subsequently, for character recognition, we used the bounding box along with the concept of Connected Component Analysis (CCA). Finally, we used the Convolutional Neural Network (CNN) for character recognition by extracting features from the segmented image. In this work, we have taken the input image from different background images with lighting effects and variations of the plate model. Our work is very efficient in all steps of number plate recognition. It achieved a 92.78% success rate for number plate detection with the variation of distance between vehicles and camera. Achieved a 97.94% success rate for character segmentation and finally, it achieved a 96.91% success rate for character recognition.
References 1. S. Azam, Md. Monirul Islam, Automatic License Plate Detection in Hazardous Condition, vol. 36, (Elsevier, 2016), pp. 172–186 2. J.P.D. Dalida, A.-J.N. Galiza, A.G.O. Godoy, M.Q. Nakaegawa, J.L.M. Vallester, A.R. dela Cruz, Development of Intelligent Transportation System for Philippine License Plate
814
Md. Amzad Hossain et al.
Recognition (IEEE, 2016), pp. 3762–3766 3. Md. Azher Uddin, J.B. Joolee, S.A. Chowdhury, Bangladeshi Vehicle Digital License Plate Recognition for Metropolitan Cities Using Support Vector Machine (IEEE, 2016) 4. P. Dhar, S. Guha, T. Biswas, Md. Zainal Abedin, A System Design for License Plate Recognition by Using Edge Detection and Convolution Neural Network (IEEE, 2018), pp. 1–4 5. M.K. Hossen, A.C. Roy, Md. Shahnur Azad Chowdhury, Md. Sajjatul Islam, K. Deb, License Plate Detection and Recognition System Based on Morphological Approach and Feed-Forward Neural Network, vol. 18 (IEEE, 2018), pp. 36–45 6. Md. Tanvir Shahed, Md. Rahatul Islam Udoy, B. Saha, A.I. Khan, S. Subrina, Automatic Bengali Number Plate Reader (IEEE, 2017), pp. 1364–1368 7. Md. Ruhul Amin, N. Mohammad, Md. Abu Naser Bikas, An automatic number plate recognition of Bangladeshi vehicles. Citeseer, vol. 93 (2014) 8. S. Abdullah, Md. Mahedi Hasan, S.M. Saiful Islam, YOLO-based three-stage network for Bangla license plate recognition in Dhaka metropolitan city, in ICBSLP (2018), pp. 1–6 9. A.C. Roy, M.K. Hossen, D. Nag, License Plate Detection and Character Recognition System for Commercial Vehicles Based on Morphological Approach and Template Matching (IEEE, 2016), pp. 1–6 10. Md. Zainal Abedin, A.C. Nath, P. Dhar, K. Deb, M. Shahadat Hossain, License Plate Recognition System Based on Contour Properties and Deep Learning Model (IEEE, 2017), pp. 590–593 11. M. Rokonuzzaman, M.A. Al Amin, M.H.K.M.U. Ahmed, M. Towhidur Rahman, Automatic Vehicle Identification System Using Machine Learning and Robot Operating System (ROS) (IEEE, 2017), pp. 253–258 12. C. Gou, K. Wang, Y. Yao, Z. Li, Vehicle Number Plate Recognition Based on Extremal Regions and Restricted Boltzmann Machines, vol. 17 (IEEE, 2015), pp. 1096–1107 13. M.B.A. Miah, S. Akter, C. Bonik, Automatic Bangladeshi vehicle number plate recognition system using neural network. Am. Int. J. Res. Sci. Technol. Eng. Math. 62–66 (2015)
The Model to Determine the Location and the Date by the Length of Shadow of Objects for Communication Networks Renrui Zhang
Abstract Sun-shadow positioning technology is a new positioning method, namely, by providing the changes in objects’ sun-shadow, to determine the location and date of shooting in communication networks. Based on the analysis of changes about the sun-shadow. This paper, by using the solar azimuth, elevation, declination angle, and solar hour angle, has established the mathematical model to determine the position and date of some objects. Firstly, the principle of the sun-shadow is analyzed by using the relevant parameters, sun-shadow and geographical coordinates and the provided moments. Secondly, it has used two different methods to calculate the positions. At last, it gets 40 groups of information about the length of straight shadow every one minute from 8:55 by using CAD software for video information processing. Then, it has soluted the actual length of shadow making use of a similar relationship. According to the relationship between the camera coordinate system and the world coordinate system, consulting the solution of the latitude and the longitude under the sundial model, we find the places in the video maybe in Hohhot during July. Keywords Analemmatic sundial model · Sun-benchmark orientation method · Camera coordinate system · Longitude correction
1 Introduction In the era of rapid development of Internet technology, processing technology of image and video has become an important means of extracting information, which has caused more and more international attention in the world. How to determine the position and recording date is an important aspect of video data analysis [1–5]. Meanwhile, the sun-shadow positioning technology, which is an important method to determine the video shoot location and date, has more accuracy by analyzing the R. Zhang (B) School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_70
815
816
R. Zhang
change of sun-shadow in the video. Also, establishing the mathematical and physical model to catch the changes about the sun-shadow of objects to get the position and the shooting date has a very wide range of applications, especially in the military field [6–10]. We will choose an interesting problem on the Internet to build the appropriate model, then test and improve the model by Appendix data [11–14]. Sun-shadow positioning technology is a new positioning method, namely, by providing the changes in objects’ sun-shadow, to determine the location and date of shooting [15]. Based on the analysis of changes about the sun-shadow. This paper, by using the solar azimuth, elevation, declination angle, and solar hour angle, has established the mathematical model to determine the position and date of some objects. Firstly, through the analysis of forming principle of the sun-shadow, combining with the relevant parameters with sun-shadow and geographical coordinates and the moments provided, we have constructed two different models about the sun-shadow length of straight, which are without the time differences and with the time difference. By means of solving these models, we get an amount of the changes about shadow length value of the straight in Tiananmen Square, and then draw the curve of shadow through MATLAB software [16]. Above all, a simple conclusion has obtained, which is when we do not consider the factor of time difference, the shadow length curve is symmetrical; and while the time difference is considered, the curve is no longer symmetric. Secondly, this paper has used two different methods to calculate the positions [17]. Method I: firstly, according to the relevant definitions about solar elevation angle, the shadow length model is established. Then we find the latitudes and the longitudes are almost the same even if the straight length is changed, which are calculated by computers, N 2.4° E 110°. We could determine the places that may appear in Malaysia judging from the latitudes and the longitudes [18]. Method II: Let the ground plane where the shadow is in be the sundial plane, and the fixed straight bar be the gnomon, we could get the orientations and coordinate system of directions using the sun-benchmark orientation method based on Cartesian coordinate system mentioned above [19]. Then we obtain the approximate range of longitude (111°– 130.5°) from the relationship about the difference between the time and longitude. After the correction of accuracy, the range of longitude becomes (109.5°–129.3°) [20]. Combining the solution of latitude (N 2.6° and N 18.2°), we find the places maybe in Hainan or Malaysia. Although the methods are different, the results are basically consistent. Again, based on the geographical location of the straight, we also need to find the date. According to the relationship between the motion pattern of sun direct point and the parameters of solar elevation angle, meanwhile, supposing that spring equinox be the reference time and the tropic of cancer be the reference latitude [21], we finally get the positions of N 22, E 75, possibly in Xinjiang, dated July 7 calculated from Appendix II; N 33, E112, possibly in Hubei. At last, we could get 40 groups of information about the length of straight shadow every one minute from 8:55 by using CAD software for video information processing [22]. Then we have soluted the actual length of shadow making use of similar relationship. According to the relationship between the camera coordinate system and the world coordinate system, consulting the solution of the latitude and the longitude under the sundial model, we find the places in the video maybe in Hohhot during July.
The Model to Determine the Location and the Date by the Length …
817
Table 1 Symbol description α
Solar elevation angle
H
Straight height
δ
Solar declination angle
L
Central meridian
ϕ
Local longitude
L meridian
Object longitude values
β
Solar hour angle
t lc
Longitude calibration value
l
The length of shadow
h
Time angle
ln
The distance from point to straight
σ
Direct sunlight latitude
1. Establish the mathematical model of the changes about the length of the shadow; analyze the principles of the shadow length’s changes with various parameters; draw the curve of the shadow length of 3-m straight at Tiananmen Square (latitude 39° 54 min 26 s, east longitude 116° 23 min 29 s) from 9:00 to 15:00 (Beijing time) Oct. 22, 2015. 2. According to the data of coordinate about endpoints of straight shadow fixed on the ground, a mathematical model is built to determine the location where the straight is. Then we take the data of Appendix 1 into this model to get a number of possible locations. 3. Another mathematical model is constructed to determine the date when the straight bar is shooting. 4. Also, we could estimate the height of the straight is 2 m in some way. Establishing a mathematical model to determine the possible location where the video is shooting.
2 Model Assumptions and Symbol Description 1. 2. 3. 4.
Assuming the plane which video objects are in is the horizon in Table 1; Assuming the Earth is the standard sphere; Ignoring atmosphere refractive index; Suppose the sunlight on the earth are parallel, and the horizontal ground in somewhere on the Earth is its the tangent plane.
3 The Establishment and Solution of Model The online Problem asks to draw the graph of trend from the solution of the length of shadow. From the definition of the solar elevation angle, we will discuss the difference of graphs when the time difference is considered and not to be considered.
818
R. Zhang
no
du
H Sun trajectory
l m Fig. 1 Relationship between shadow length and sun elevation angle
(1) Solution of length of Straight rop’s sun-shadow Making the use of geometric relation between solar elevation angle [15] and straight bar’s shadow length, the shadow lengths of each time between 9:00 and 15:00 are obtained, which can draw a curve of straight bar’s shadow length. From Fig. 1, a formula to calculate the length of the shadow is obtained, as follows: l=
H tan α
(1)
H represents the length of the straight bar, α is the sun elevation angle, from the relevant references, we could get a formula for the solar elevation angle as follows: sin α = sin ϕ · sin δ + cos ϕ · cos δ · cos β
(2)
And ϕ is the local latitude, δ is the solar declination [17], β is the sun angle [16]. In order to determine the relevant parameters, we need to calculate the length of the shadow: a. Establishment of solar declination angle [17] From the references, we could find the precise formula for the calculation of solar declination [17] as follows: δ=
180◦ · (0.006918 − 0.399912 cos γ + 0.070257 sin γ − 0.006758 cos 2γ π +0.000907 sin 2γ − 0.002697 cos 3γ + 0.00149 sin 3γ ) (3)
The Model to Determine the Location and the Date by the Length …
γ =
2π (N − 1) 365
819
(4)
δ is measured in degrees, N is counted from Jan.1st. b. Calculation formula for the solar hour angle is [16] and Calculation formula for the solar azimuth is [4]
β = (T − 12) × 15◦ cos A =
(5)
sin α sin ϕ − sin δ cos α cos ϕ
(6)
cos δ sin β cos α
(7)
sin A =
(2) Establishment of the trajectory of the shadow From the references, the solar azimuth [18] stands for the angle between the sunlight projected on the ground and the local meridian, which can be approximately regarded as the angle between the erected rop’s shadow and the right south. In order to describe the trajectory of the shadow length, and further to determine solar azimuth, we could construct the Cartesian coordinate system, in which the right East direction is X-axis and the right North direction is Y-axis. Therefore, the sun-shadow of the objects should change from the west to east. That is to say, before the noon, the shadow should be on the west side; while after the noon, the shadow should be on the east, and the length of the shadow should be the shortest at noon. According to the change principle of sundial shadow and the concept of the solar azimuth, we could construct the following coordinate system. Put the coordinate of the endpoint of the rop’s shadow as (xi , yi )(i = 9, 10, . . . , 15), then
◦
xi = −l · cos(180 − A) ◦
yi = l · sin(180 − A)
(8)
820
R. Zhang
4 Solution of the Model for Problem (1) Solution of the solar declination angle [17] If Jan.1st is the first day in a year, then Oct. 22nd is the 295th day. That is N = 295, then take N into Eq. (3), we could get δ = −11.0740◦ (2) According to Eq. (5), the solar hour angle could be obtained as below: (3) According to Eq. (2), the solar elevation angle could be calculated as below in Table 2. It is known that the Beijing Tiananmen Square is located in North latitude 39° 54 min 26 s, East longitude 116° 23 min 29 s. After the transformation of the units, we get ϕ = 39.907◦ , μ = 116.391◦ . Taking the values of α, β, ϕ into Eq. (2), we obtain the solar elevation angle at each time as follows in Table 3. Data analysis: The data from above shows that the solar elevation angle from 9 a.m. to 15 p.m. is symmetric with respect to 12:00, and the angle reaches the maximum at noon. (4) Solution of the length of rop’s shadow The length of straight is 3 m, that is, H = 3. Taking all of the values of α at every moment into Eq. (1), we get the length of the straight’s shadow is like this in Table 4. Fitting the data above by MATLAB software, we could find the function expression of the length of shadow as follows: y = 0.3375x 2 − 8.101x + 52.2
R 2 = 0.99
(5) Curve of change of shadow length The curve of change of shadow length with 3 m straight is drawn according to the data in Table 3 as Fig. 2.
Table 2 Solar hour angle Time (unit: h) Solar hour angle(unit: °)
9
10
11
12
13
14
15
−45
−30
−15
0
15
30
45
Table 3 Solar elevation angle Time (unit: h)
9
10
11
12
13
14
15
Solar elevation angle (unit: °)
24.15
31.92
37.15
39.02
37.15
31.92
24.15
The Model to Determine the Location and the Date by the Length …
821
Table 4 The length of straight shadow Time (unit: h)
9
10
11
12
13
14
15
Shadow length (unit: m)
6.6917
4.8161
3.9593
3.7022
3.9593
4.8161
6.6917
Fig. 2 Change curve of shadow length of 3-m straight
The shadow length from 9 a.m. to 15 p.m. is symmetric with respect to 12:00, and the shadow length reaches the minimum at noon. The range of the shadow length is the closed interval [3.7, 6.69]. The solution of solar direction angle is given. Taking the values of δ, β, and α into the Eqs. (6) and (7), we will find the solar azimuth as follows in Tables 5 and 6). Table 5 Solar azimuth Time (unit: h)
9
10
11
Solar azimuth (unit: °)
−50.4
−35.5
−18.6
12 0
13
14
15
18.6
35.5
50.4
Table 6 Solution of coordinates of straight shadow endpoints Time (unit: h)
9
10
11
12
13
14
15
T24 -coordinate
4.262
3.920
3.752
3.702
3.752
3.920
3.894
Yh = sin φ cos h-coordinate
−5.152
−2.793
1.263
0
1.263
−2.793
−5.420
822
R. Zhang
5 Conclusion This paper, by using the solar azimuth, elevation, declination angle, and solar hour angle, has established the mathematical model to determine the position and date of some objects. Firstly, the principle of the sun-shadow is analyzed by using the relevant parameters, sun-shadow and geographical coordinates and the provided moments. Secondly, it has used two different methods to calculate the positions. At last, it gets 40 groups of information about the length of the straight shadow every one minute from 8:55 by using CAD software for video information processing. Then, it has processed the actual length of shadow making use of a similar relationship. According to the relationship between the camera coordinate system and the world coordinate system, consulting the solution of the latitude and the longitude under the sundial model, we find the places in the video maybe in Hohhot during July.
References 1. H. Tao, M.Z.A. Bhuiyan, M.A. Rahman, T. Wang, J. Wu, S.Q. Salih, Y. Li, T. Hayajneh, TrustData: trustworthy and secured data collection for event detection in industrial cyberphysical system. IEEE Trans. Ind. Inform. (2019) 2. M.A. Rahman, A.T. Asyhari, S. Azad, M.M. Hasan, C.P. Munaiseche, M. Krisnanda, A cyberenabled mission-critical system for post-flood response: exploiting TV White Space as network backhaul links. IEEE Access 7, 100318–100331 (2019) 3. H. Tao, I. Ebtehaj, H. Bonakdari, S. Heddam, C. Voyant, N. Al-Ansari, R. Deo, Z.M. Yaseen, Designing a new data intelligence model for global solar radiation prediction: application of multivariate modeling scheme. Energies 12(7), 1365 (2019) 4. M.A. Rahman, Q.M. Salih, A.T. Asyhari, S. Azad, Traveling distance estimation to mitigate unnecessary handoff in mobile wireless networks. Ann. Telecommun. 1–10 (2019) 5. H. Tao, M.Z.A. Bhuiyan, M.A. Rahman, G. Wang, T. Wang, M.M. Ahmed, J. Li, Economic perspective analysis of protecting big data security and privacy. Future Gener. Comput. Syst. 98, 660–671 (2019) 6. M.A. Rahman, A.T. Asyhari, M.Z.A. Bhuiyan, Q.M. Salih, K.Z.B. Zamli, L-CAQ: joint linkoriented channel-availability and channel-quality based channel selection for mobile cognitive radio networks. J. Netw. Comput. Appl. 113, 26–35 (2018) 7. H. Tao, A.M. Bobaker, M.M. Ramal, Z.M. Yaseen, M.S. Hossain, S. Shahid, Determination of biochemical oxygen demand and dissolved oxygen for semi-arid river environment: application of soft computing models. Environ. Sci. Pollut. Res. 26(1), 923–937 (2019) 8. H. Tao, M.Z.A. Bhuiyan, A.N. Abdalla, M.M. Hassan, J.M. Zain, T. Hayajneh, Secured data collection with hardware-based ciphers for IoT-based healthcare. IEEE Internet Things J. (2018) 9. A. Rahman, S.N. Sadat, A.T. Asyhari, N. Refat, M.N. Kabir, R.A. Arshah, A secure and sustainable framework to mitigate hazardous activities in online social networks. IEEE Trans. Sustain. Comput. (2019) 10. H. Tao, L. Diop, A. Bodian, K. Djaman, P.M. Ndiaye, Z.M. Yaseen, Reference evapotranspiration prediction using hybridized fuzzy model with firefly algorithm: regional case study in Burkina Faso. Agric. Water Manag. 208, 140–151 (2018) 11. M.A. Rahman, S. Azad, A.T. Asyhari, M.Z.A. Bhuiyan, K. Anwar, Collab-SAR: a collaborative avalanche search-and-rescue missions exploiting hostile alpine networks. IEEE Access 6, 42094–42107 (2018)
The Model to Determine the Location and the Date by the Length …
823
12. H. Tao, B. Keshtegar, Z.M. Yaseen, The feasibility of integrative radial basis M5Tree predictive model for river suspended sediment load simulation. Water Resour. Manag. 1–20 (2019) 13. M.A. Rahman, V. Mezhuyev, M.Z.A. Bhuiyan, S.N. Sadat, S.A.B. Zakaria, N. Refat, Reliable decision making of accepting friend request on online social networks. IEEE Access 6, 9484– 9491 (2018) 14. H. Tao, S.O. Sulaiman, Z.M. Yaseen, H. Asadi, S.G. Meshram, M.A. Ghorbani, What is the potential of integrating phase space reconstruction with SVM-FFA data-intelligence model? Application of rainfall forecasting over regional scale. Water Resour. Manag. 1–25 (2018) 15. M.U. Saleem, Gnomon assessment for geographic coordinate, solar horizontal & equatorial coordinates, time of local sunrise, noon, sunset, direction of qibla, size of Earth & Sun for Lahore Pakistan. Open J. Appl. Sci. 6(02), 100 (2016) 16. M. Macedon, V. Ion, Angular stroke requirements for solar tracking azimuthal mechanism at any latitude, in IFToMM World Congress on Mechanism and Machine Science (Springer, Cham, 2019), pp. 3573–3582 17. A.L. Mahmood, Date/time operated two axis solar radiation tracking system for Baghdad city. Int. J. Appl. Eng. Res. 13(7), 5534–5537 (2018) 18. I.D.A.S. Tracker, Regular paper improve dual axis solar tracker algorithm based on sunrise and sunset position. J. Electr. Syst. 11(4), 397–406 (2015) 19. S. Skouri, A.B.H. Ali, S. Bouadila, M.B. Salah, S.B. Nasrallah, Design and construction of sun tracking systems for solar parabolic concentrator displacement. Renew. Sustain. Energy Rev. 60, 1419–1429 (2016) 20. W. Nsengiyumva, S.G. Chen, L. Hu, X. Chen, Recent advancements and challenges in solar tracking systems (STS): a review. Renew. Sustain. Energy Rev. 81, 250–279 (2018) 21. A.Z. Hafez, A.M. Yousef, N.M. Harag, Solar tracking systems: technologies and trackers drive types–A review. Renew. Sustain. Energy Rev. 91, 754–782 (2018) 22. C. Morón, J. Díaz, D. Ferrández, M. Ramos, Mechatronic prototype of parabolic solar tracker. Sensors 16(6), 882 (2016)
CW-CAE: Pulmonary Nodule Detection from Imbalanced Dataset Using Class-Weighted Convolutional Autoencoder Seba Susan , Dhaarna Sethi, and Kriti Arora
Abstract A Class-Weighted Convolutional Autoencoder (CW-CAE) is proposed in this paper to resolve the skewed class distribution found in lung nodule image datasets. The source of these images is the Lung Image Database Consortium image collection (LIDC-IDRI) comprising of lung Computed Tomography (CT) scans. The annotated CT scans are divided into image patches that are labeled as either ‘nodule’ or ‘non-nodule’ images. Understandably, the number of samples containing nodules is substantially less as compared to that of the non-nodules. To solve the classimbalance issue and prevent bias in decision-making, a class-weight equal to the ratio of the total population to the class population is introduced. The class-weights are multiplied with the respective loss function associated with each class during the computation of the aggregate loss function in the training phase. The training module consists of a feature extractor which is the encoder part of a Convolutional Autoencoder (CAE) pre-trained on the lung nodule dataset, and a classifier comprising of randomly initialized fully connected layers at the output stage. Experiments prove the efficacy of our class-weighted approach for the imbalanced dataset as compared to the state of the art. Keywords Convolutional autoencoder · Class-weight · Loss function · Pulmonary nodule detection
1 Introduction One of the most common forms of cancer found worldwide is pulmonary or lung cancer. The first stage in diagnosing lung cancer from lung Computed Tomography (CT) scans is the detection of lung nodules [1]. The latter stages in diagnosis involve a malignancy test for these lung nodules. The accurate detection of lung nodules in S. Susan (B) · D. Sethi · K. Arora Department of Information Technology, Delhi Technological University, Bawana Road, Delhi 110042, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_71
825
826
S. Susan et al.
CT scan images is thus a crucial factor in cancer diagnosis, and designing an efficient machine learning algorithm for the same is the matter of study in this paper. Deep neural networks are of today a natural choice for a machine learning strategy to learn from biomedical datasets [2]. In several situations, deep learning has proved to outperform handcrafted features [3]. In a broad perspective, deep learning covers several types of network variants such as deep perceptron networks, deep Convolutional Neural Networks (CNN), and autoencoders. [4]. In our experiments, we use the autoencoder networks [5], more specifically, the Convolutional Autoencoder (CAE) proposed by Masci et al. in [6] for 2D input images. CAE is now a popular choice for computer vision problems as a feature extractor prior to the classifier stage [7]. It has the unique property of learning from itself, i.e., the input and output constitute of the same image, while the CAE weights are tuned to achieve this reconstruction. Expanding the layers of a CAE is the usual form of research found in the literature [8, 9]. The traditional methods for pulmonary nodule detection generally involve deep convolutional neural networks and hyperparameter tuning [10, 11]. Researchers usually tend to overlook the issue of skewed class distributions affecting most biomedical datasets. In this paper, we incorporate class-weights in our deep learning module to mitigate the bias in decision-making while training the network to identify images containing lung nodules. Our learning module called the Class-Weighted Convolutional Autoencoder (CW-CAE) utilizes a set of fully connected layers at the output stage to learn the features extracted from the CAE. This paper is organized as follows. Some basic concepts are revised in Sect. 2; the proposed model and learning scheme is presented in Sect. 3; the results are discussed in Sect. 4; the conclusions are drawn in Sect. 5.
2 Preliminaries 2.1 Convolutional Autoencoder (CAE) Figure 1 shows the general architecture of a three-layer autoencoder. Proposed by Bengio et al. in [5], autoencoders generate the hidden representation (output activation of the middle or hidden layer) for an input based on connection weights learned in the training phase. The task is to reconstruct the input at the output layer. The autoencoder weights are optimized to achieve this objective. Convolutional Autoencoders (CAE) were introduced by Masci et al. in [6] to handle and process 2D inputs or images. The basic autoencoder principle involving the encoder–decoder combination, shown in Fig. 1, is the same. The basic advantage over conventional autoencoders is the preservation of spatial locality and sharing of weights. After training is complete, in which the entire set of training images is presented to the CAE, the encoder module is separated out for feature extraction. The output of the encoder part of CAE is the set of hidden representations and these
CW-CAE: Pulmonary Nodule Detection from Imbalanced Dataset …
827
Fig. 1 The three-layer autoencoder structure—ENCODER: [Input Layer; Hidden Layer], DECODER: [Output Layer]
are applied to a classifier to learn the features extracted. Classifiers such as Support Vector Machines, Convolutional Neural Networks, and Neural Nets (Dense Fully Connected Networks (FCN)) could be used [12, 13]. We use fully connected layers (FCN) as the classifier for our CAE model. The details of the layers of the proposed CAE-FCN model are described in Sect. 3.
3 Proposed CW-CAE Model for Pulmonary Nodule Detection 3.1 Proposed Learning Model Figure 2 shows the stacked layers in the proposed model. Figure 2a shows the architecture of the Convolutional Autoencoder (CAE). The weights of the CAE are tuned
Fig. 2 The proposed model. a Layers of CAE. b Encoder of CAE + fully connected layers
828
S. Susan et al.
Fig. 3 Instances of ‘nodule’ and ‘non-nodule’ image patches
on the set of training images. The procedure is to present one by one the training images to both the CAE input and output layers, and adjust the connection weights in order to allow this reconstruction of the input image at the output. After training is complete, the encoder part is separated and attached to randomly initialized fully connected (FCN) layers that act as the classifier, as shown in Fig. 2b. Deep tuning of the final end-to-end network architecture in Fig. 2b is performed by which all weights (Encoder CAE, FCN) are fine-tuned for the training set. The class-weights are introduced for the final network architecture of Encoder of CAE + fully connected layers in Fig. 2b. The class-weights determined to be optimal for our experiments are 1 for the majority class and 6 for the minority class.
3.2 Class-Weighted Loss Function to Resolve Class-Imbalance In data mining, imbalanced datasets are treated mostly by sampling approaches [14] that involve undersampling of the majority class and/or oversampling of the minority class. Replicating an intelligent sampling scheme in computer vision is challenging due to the computational complexity involved. At a small scale, data augmentation serves to achieve the effect of oversampling, by artificially creating affine transformed images of both classes [3] or specifically of the minority class only [2]. In the proposed approach, a class-weighting scheme is introduced for our final architecture of Encoder CAE + FCN in Fig. 2b. In this scheme, the loss function associated with each class (minority/majority) is weighted (or multiplied) by a factor equal to the ratio of the total population to the class population during the computation of the aggregate loss function. Let x be the class-weight associated with the minority class and y be the class-weight associated with the majority class. The aggregate loss function is computed as
CW-CAE: Pulmonary Nodule Detection from Imbalanced Dataset …
Loss = Majorityloss ∗ y + Minorityloss ∗ x
829
(1)
where x is computed as x=
Number of nodule and non − nodule images in training set Total population = Minority population Number of nodule images in training set (2)
and y is computed as y=
Number of nodule and non − nodule images in training set Total population = Majority population Number of non − nodule images in training set (3)
The loss function in (1) incorporating class-weights is embedded in the network optimization algorithm for the end-to-end architecture in Fig. 2b. The idea is to make the system more sensitive to samples from the underrepresented class while training the final network.
4 Experimental Results 4.1 Experimental Setup The proposed model CW-CAE was implemented in Python software (ver 3.7) on an Intel dual core processor clocked at 2.7 GHz. The model was trained on images of ‘nodule’ and ‘non-nodule’ annotated by radiologists. The source of these images is the Lung Image Database Consortium image collection (LIDC-IDRI) comprising of annotated lung Computed Tomography (CT) scans [15]. The CT scans are cropped into non-overlapping image patches of size 50 × 50 that are labeled as either ‘nodule’ or ‘non-nodule’ images based on the presence or absence of pulmonary nodules. The total number of image patches is 5187, the number of ‘nodule’ images is 845, and the number of non-nodule image patches is 4342. The numbers reflect the imbalanced class distribution since the number of images containing lung nodules is in a huge minority as compared to those without nodules. Our CW-CAE model is proposed to counteract the class-imbalance by the introduction of class-weights in the network optimization. The hyperparameters of our learning model (and also of state-of-theart networks used for comparison) are optimizer—Adadelta; learning rate—0.01; loss function—mean square error; number of epochs—200; batch size—128. In the testing phase, there are 1340 ‘non-nodule’ and 282 ‘nodule’ image samples.
830 Table 1 Percentage of correct classification for the LIDC dataset by different methods
S. Susan et al. Method
Accuracy (in %)
CNN [16] (Fig. 3)
91.6
CAE + CNN [6] (Fig. 4)
91.5
Deep CAE [17]
86.62
VGG16 [18] + fine tuning
83
CAE + Dense FCN (proposed)
92
CW-CAE + Dense FCN (proposed)
93.24
4.2 Results and Discussions The results of our experiments are summarized in Table 1 in terms of the percentage of correct classification or accuracy. Our method is observed to outperform the existing techniques of Convolutional Neural Networks (CNN) [16], Convolutional autoencoders with CNN [6], Deep Convolutional autoencoders [17], and the popular pre-trained network VGG16 [17], with an accuracy of 93.34%. Of the traditional networks used for comparison, the CNN [16] and the CAE [6] architectures are shown in Figs. 4 and 5, respectively. For the Deep CAE architecture, the readers are referred to [17]. VGG16 is a publicly available deep convolutional network [18] pre-trained on millions of images from ImageNet, the object database. The pre-trained network is fine-tuned on our lung nodule images for this experiment. The higher accuracies achieved for our method are attributed to the class-weights assigned in the network cost optimization. This fact is brought into notice while comparing the results with the non-weighted version of our model that achieves 92% accuracy. The performance curve is plotted in Fig. 6 as a function of the minority classweight x. It is observed that the maximum accuracy of 93.34% is obtained when the minority class-weight x = 6. A feasible argument in favor of this phenomenon is that the ratio of the total number of image patches to the number of patches containing Fig. 4 The convolutional neural network in [16] used for comparison
CW-CAE: Pulmonary Nodule Detection from Imbalanced Dataset …
831
Fig. 5 a Convolutional autoencoder in [6] used for comparison. b Encoder part of the convolutional autoencoder concatenated with CNN layers at the output [6]
Fig. 6 Performance curve as the minority class-weight x is varied for the proposed method
lung nodules is (5187/845), that is, rounded off to 6. On similar lines, the ratio of the total number of image patches to the number of patches not containing lung nodules is (5187/4342), that is, rounded off to 1. This justifies our choice of y = 1 as the majority class-weight.
832
S. Susan et al.
5 Conclusions In this work, we explore class-weighted network optimization as a solution to tackle skewed class distributions found in pulmonary nodule datasets. Convolutional autoencoders are used to extract features from input images and the linearized features are learned through fully connected dense layers. While training, the loss function associated with each class (nodule/non-nodule) is weighted by a factor equal to the ratio of the total population to the class population during the computation of the aggregate loss function. Higher accuracies prove the efficiency of our Class-Weighted Convolutional Autoencoder (CW-CAE) learning approach, as compared to the state of the art. Diagnostic tests on the cancerous nature of the detected nodules form the future scope of our work.
References 1. I. Ali, G.R. Hart, G. Gunabushanam, Y. Liang, W. Muhammad, B. Nartowt, M. Kane, X. Ma, J. Deng, Lung nodule detection via deep reinforcement learning. Front. Oncol. 8, 108 (2018) 2. M. Saini, S. Susan, Data augmentation of minority class with transfer learning for classification of imbalanced breast cancer dataset using inception-V3, in Iberian Conference on Pattern Recognition and Image Analysis (Springer, Cham, 2019), pp. 409–420 3. M. Saini, S. Susan, Comparison of deep learning, data augmentation and bag of-visual-words for classification of imbalanced image datasets, in International Conference on Recent Trends in Image Processing and Pattern Recognition (Springer, Singapore, 2018), pp. 561–571 4. W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F.E. Alsaadi, A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26 (2017) 5. Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, in Advances in Neural Information Processing Systems (2007), pp. 153–160 6. J. Masci, U. Meier, D. Cire¸san, J. Schmidhuber, Stacked convolutional auto-encoders for hierarchical feature extraction, in International Conference on Artificial Neural Networks (Springer, Berlin, Heidelberg, 2011), pp. 52–59 7. Y. Wang, Z. Xie, X. Kai, Y. Dou, Y. Lei, An efficient and effective convolutional auto-encoder extreme learning machine network for 3D feature learning. Neurocomputing 174, 988–998 (2016) 8. F. Li, H. Qiao, B. Zhang, Discriminatively boosted image clustering with fully convolutional auto-encoders. Pattern Recognit. 83, 161–173 (2018) 9. V. Turchenko, A. Luczak, Creation of a deep convolutional auto-encoder in Caffe, in 2017 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS), vol. 2 (IEEE, 2017), pp. 651–659 10. B. Van Ginneken, A.A.A. Setio, C. Jacobs, F. Ciompi, Off-the-shelf convolutional neural network features for pulmonary nodule detection in computed tomography scans, in 2015 IEEE 12th International symposium on biomedical imaging (ISBI) (IEEE, 2015), pp. 286–289 11. H. Xie, D. Yang, N. Sun, Z. Chen, Y. Zhang, Automated pulmonary nodule detection in CT images using deep convolutional neural networks. Pattern Recognit. 85, 109–119 (2019) 12. G. Mi, Y. Gao, Y. Tan, Apply stacked auto-encoder to spam detection, in International Conference in Swarm Intelligence (Springer, Cham, 2015), pp. 3–15 13. B. Hou, R. Yan, Convolutional auto-encoder model for Finger-Vein verification. IEEE Trans. Instrum. Meas. (2019)
CW-CAE: Pulmonary Nodule Detection from Imbalanced Dataset …
833
14. S. Susan, A. Kumar, SSOMaj-SMOTE-SSOMin: three-step intelligent pruning of majority and minority samples for learning from imbalanced datasets. Appl. Soft Comput. 78, 141–149 (2019) 15. S.G. Armato III, G. McLennan, L. Bidaut, M.F. McNitt-Gray, C.R. Meyer, A.P. Reeves, B. Zhao et al., The lung image database consortium (LIDC) and image database resource initiative (IDRI): a completed reference database of lung nodules on CT scans. Med. Phys. 38(2), 915– 931 (2011) 16. Q.Z. Song, L. Zhao, X.K. Luo, X.C. Dou, Using deep learning for classification of lung nodules on computed tomography images. J. Healthc. Eng. 2017 (2017) 17. O. Yildirim, U.B. Baloglu, R.-S. Tan, E.J. Ciaccio, U. Rajendra Acharya, A new approach for arrhythmia classification using deep coded features and LSTM networks. Comput. Methods Programs Biomed. 176, 121–133 (2019) 18. K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps (2013). arXiv:1312.6034
SORTIS: Sharing of Resources in Cloud Framework Using CloudSim Tool Kushagra Gupta and Rahul Johari
Abstract In today’s environment, many number of FOSS Cloud Computing software and Proprietary Cloud Computing tools are available for academicians and researchers for carrying out their SAAS or PAAS-oriented Programming. In current research work, Free and Open Source (FOSS) Java-based CloudSim Tool has been used to actively portray customization to the Cloud Framework for better performance, using multiple attributes/resources such as Virtual Machine (VM), MIPS Requirement (million instructions per second), RAM, Bandwidth (BW), and Number of Processors required. A user is prompted to ascertain his requirements and they are allocated depending on his day-to-day dynamic needs. Keywords CloudSim · Virtual machines · Resource allocation
1 Introduction Cloud computing is the way using which one can store, manage, process, host, and share the data and resources on the network of servers over the web. Cloud computing is evolving at a rapid rate, many research scholars, academicians, researchers, and entrepreneurs are using it to view and host their applications. As compared to 10 years back, their was very limited use of Cloud computing concepts, principles, and activities, but with MAGI Companies (Microsoft, Manjrasoft, Amazon, Adobe, Apple, Accenture, Alibaba, Google, IBM, Infosys) pumping in million of dollar of investment in establishing state of the art World class Servers and Green Computing Supported by Swinger Lab, GGSIPU Delhi. K. Gupta (B) · R. Johari SWINGER (Security, Wireless IoT Network Group of Engineering and Research) Lab, USICT, GGSIP University, New Delhi, India e-mail: [email protected] R. Johari e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_72
835
836
K. Gupta and R. Johari
compatible Data Centres, Cloud computing has witnessed tremendous growth. This has also helped many service-oriented companies and educational institutions and universities to start using PAAS and IAAS-based applications in big way. It is very useful in distributing computing environment over a network. CloudSim is an active simulator for testing Cloud-related work. It is a simulation framework developed by researchers team headed by Dr. Rajkumar Buyya at CLOUDS Laboratory, University of Melbourne. There are basically three entities in CloudSim Framework: 1. Cloud Information Service (CIS) 2. Data Center 3. Data Center Broker: For better clarity and understanding each entity is briefly explained as follows: Cloud Information Service: It is an entity to which one has to register all resources. It helps to detect the status of currently active working services or resources. Data Center: Datacenter contains a set of hosts and it helps in managing Virtual Machines (VMs). It creates hosts with characteristics like RAM, Processing Elements(PE), Central Processing Unit(CPU), Bandwidth, Million instructions per seconds(MIPS), etc. The host is virtualized into virtual machines and later one runs the task of users. Data Center Broker: Data Center Broker is the main part of CloudSim Framework. It works on the behalf of user. Cloudlets are submitted to these brokers. Then, Broker submit these cloudlets to Virtual Machines present at server. Basic Working Model of CloudSim Framework: It is a six-step process. Step 1: Cloud Information Service Object will be created and it will get registered. Step 2: Creation of Data Center. Each data center contains Hosts which are physical. Now, host gets virtualized into Virtual Machines(VMs) and then Virtual Machines use the resources of hosts in the virtual environment. Step 3: Data Center Broker Creation. Once the broker is created, it queries the Cloud Information Service for registered Data Centers. Step 4: Cloud Information Service returns the Data Centers registered with their configurations. Step 5: Cloudlets get submitted to the Data Center Broker for their execution. Step 6: Then Broker submits the cloudlets to Virtual Machines according to policies defined for the physical machines.
2 Problem Statement In Cloud computing, efficient allocation of resources is very important. If resources are not allocated efficiently, that is, some servers have high load and others have less load, then it will lead to high energy consumption. As Cloud services can be extended from 2 servers to 1000s of servers or more, so it is not possible to deploy
SORTIS: Sharing of Resources in Cloud Framework …
837
Fig. 1 CloudSim working model
and test the Cloud to check whether the resource allocation is efficient or not, and this is the biggest problem. To analyze the resource allocation in Cloud computing environment which is extended from 2 to n servers, a Cloud simulation and modeling tool is required which will take care of the Cloud resources as per requirement. During the current research work, it is proposed to use CloudSim framework for managing the resources like host and Virtual Machines (VMs) (Fig. 1).
3 Literature Survey In [1], CloudSim package available in GitHub repository was successfully downloaded and installed. All the Eight user friendly Java based Programs were cleanly deployed. In [2], author(s) have explained the basics of Cloud computing. Author(s) explained that Cloud computing contains three service models and four deployment models. According to the author(s), three service models are Software as a Service or SAAS, Platform as a Service or PAAS, and Infrastructure as a Service or IAAS. Software as a Service (SAAS): In this model, Cloud providers provide applications and then customers use these applications. Clients can access these applications from various client devices such as a Web Browser. Platform as a Service (PAAS): In PAAS, customers create their own customized applications and can deploy onto the Cloud Infrastructures. But customers doesn’t have right to manage the Cloud infrastructure components like Networks, Servers or Operating system, and storage.
838
K. Gupta and R. Johari
Infrastructure as a Service (IAAS): In IAAS, customers have right to choose the operating system, storage. According to the author(s), four deployment models are Private Cloud, Public Cloud, Hybrid Cloud, and Community Cloud. Private Cloud: Private Cloud Infrastructure is used by a single organization (having multiple customers). It is deployed globally. There is more security in private Cloud than public Cloud. Private Cloud is more expensive than public Cloud. Public Cloud: Public Cloud Infrastructure is used by general public. It is for open use. It is deployed locally. There is less security in public Cloud than private Cloud. Public Cloud is less expensive than private Cloud. Community Cloud: This type of Cloud Infrastructure is used by a community of customers from organization. Hybrid Cloud: It is a combination of two or more different Cloud infrastructures like private, community, or public. In [3], author(s) have explained three task scheduling algorithms, that is, First come first serve (FCFS), Round Robin Scheduling Algorithm, and Generalized Priority Scheduling Algorithm. In FCFS, tasks are scheduled in the order of their arrivals like, that is, tasks which arrive first will be scheduled first. Starvation is the main disadvantage of FCFS, that is, the shortest task which arrives later have to wait for a while for bigger task to finish. In Round Robin Scheduling, there is a concept of time quantum. Each task will be executed for that time quantum, if in that quantum job will not complete then it will send back to queue and next job will be scheduled. The major drawback of RR Scheduling is that large jobs take larger time to complete. In Generalized Priority Algorithm, priorities are assigned to tasks on the basis of their size such that task having biggest size has top rank and priorities are given to Virtual Machines on the basis of their MIPS value such that VM having largest MIPS has top rank. In [4] author(s) have explained Cloudlet Schedulers. Cloudlet schedulers define how the available VM resources are allocated to cloudlets. There are two types of allocation criteria. Space-Shared: Assign specific CPUs to specific Virtual Machines. Time-Shared: Dynamically allocate the capacity of CPUs to Virtual Machines. In [5], author(s) have explained different load balancing algorithms. These are three algorithms. Round Robin Algorithm: This policy works on the time-slicing technique where each node has allocated a time slot and it has to use this slot only otherwise has to wait for it next turn. Equally Spread current execution load: This technique uses a load balancer which manages the jobs that are ready for execution. The work of load balancer is to arrange the jobs in queue and then transfer jobs to different virtual machines. Throttled Load balancing: This technique first searches the appropriate VMs for assigning a job. The job manager maintains a list of all virtual machines, then it allots the desired job to the appropriate machine.
SORTIS: Sharing of Resources in Cloud Framework …
839
In [6], authors(s) perform the analysis on different CPU scheduling algorithms. Following are the salient goals of CPU scheduling algorithms: Efficiency: For efficient work, a scheduler must keep the system busy all the available time so that more work can be done and hence higher efficiency would be achieved. Response time: A scheduler must focus on minimizing the response time for real-world applications. Turnaround time: A scheduler must focus on minimizing the turn around time so that process will not wait for CPU allocation first time. Throughput: A scheduler must focus on maximizing the throughput so that more jobs processed per unit time.
4 Flow Chart The complete working of the process of Customization (Allocation and Deallocation) of Resources on demand basis in CloudSim Tool has been depicted through the flow chart (Fig. 2).
5 Algorithm Used 5.1 Algorithm 1: Validate the User Details Notation UID: User Id Entered PWD: Password Entered SPWD: Password Stored in database for corresponding UID. Trigger When user wants to enter in CloudSim. 1. SPWD: Extract Stored Password for UID From Database 2. if (PWD equals SPWD) return true; else return false; 3. Initialize the Resource Allocation Process.
5.2 Algorithm 2: Allocate Resources to the User Notation MIPS: Mips requirement by the user. RAM: Ram requirement by the user. CPU: Processors requirement by the user. BW: Bandwidth requirement by the user.
840
Fig. 2 Flowchart and graph (virtual machines vs. time) of CloudSim tool
K. Gupta and R. Johari
SORTIS: Sharing of Resources in Cloud Framework …
841
VM: Represents Virtual Machine ID: Autogenerated ID of VM Trigger When resources are available for Allocation. 1. Initialize the CloudSim Library. 2. Create a Datacenter. 3. Create a DataCenter Broker. 4. Create a Virtual Machine with given characteristics. VM (ID, MIPS, CPU, RAM, BW) 5. Submit VM to VM List. 6. Create Cloudlet. 7. Add Cloudlet to the Cloudlet List. 8. Start the Simulation.
5.3 Algorithm 3: VM Resource Allocation Notation UID: User ID entered by User. MIPS: Mips requirement by the user. RAM: Ram requirement by the user. CPU: Processors requirement by the user. BW: Bandwidth requirement by the user. VM: Represents Virtual Machine ID: Autogenerated ID of VM Trigger When user demand for resources using CloudSim 1. for (1 to N) 2. If (validateUser (UID, PWD)) 3. If (MIPS and RAM and CPU and BW == available) Resource Allocation (MIPS, RAM, CPU, BW) else Print “Requested resources are not available” else Print” Please Enter correct User Id AND Password
6 Acknowledgement We are thankful to our University: GGSIPU, Delhi for providing research-oriented academic environment and for funding the current research work.
7 SnapShots The snapshots of the simulation work performed in CloudSim Package are depicted in Figs. 3, 4, and 5, respectively.
842
K. Gupta and R. Johari
Fig. 3 Output of dynamic allocation on single VM
Fig. 4 Dynamic allocation on multiple VMs
8 Conclusion and Future Work After carefully reading and analyzing various skeletal sample code snippets bundled in Java-based FOSS CloudSim Package, program was initiated to implement the task of dynamic resource allocation. In dynamic allocation, a user is prompted for required resources congurations. For Instance, What is MIPS requirement of user?, What is RAM requirement of user?, What is bandwidth requirement of user?, Number of Processors required by the user, etc.
SORTIS: Sharing of Resources in Cloud Framework …
843
Fig. 5 Output of CloudSim Program, depicted in Fig. 4
In Future, the encryption will be provided to user. The encryption techniques will be implemented between user and Data Broker so that no hacker will be able to crack the security of framework. It is proposed to use various techniques for encryption like Caesar Cipher Technique, Vigenere Cipher Technique, etc.
References 1. https://github.com/Cloudslab/cloudsim 2. P. Mell, T. Grance, The NIST definition of cloud computing (2011) 3. A. Agarwal, S. Jain, Efficient optimal algorithm of task scheduling in cloud computing environment. arXiv preprint arXiv:1404.2076 (2014) 4. R. Kumar, G. Sahoo, Cloud computing simulation using CloudSim. arXiv preprint arXiv:1403. 3253 (2014) 5. H.S. Mahalle, P.R. Kaveri, V. Chavan, Load balancing on cloud data centres. Int. J. Adv. Res. Comp. Sci. Soft. Eng. 3(1) (2013) 6. M. Gahlawat, P. Sharma, Analysis and performance assessment of CPU scheduling algorithms in cloud using Cloud Sim. Int. J. Appl. Inf. Syst. (IJAIS) (2013) ISSN: 2249–0868
Predicting Diabetes Using ML Classification Techniques Geetika Vashisht, Ashish Kumar Jha, and Manisha Jailia
Abstract Healthcare industry is advancing at a lightning speed with extensive usage of IT tools and techniques. The use of machine learning algorithms is not only restricted to the field of computer science. It has sneaked into the healthcare industry too and is assisting the medical practitioners in the prediction of the onset of several diseases based on a particular set of attributes like age, BMI, blood pressure, glucose and insulin level and so on. Diabetes is one such disease that is growing at a very rapid rate and is pretty fatal leading to the requirement of a promising prediction system to diagnose the onset of the disease before it silently attacks the patients and causes the avoidable damage to health. Machine Learning techniques are doable in mining the diabetes dataset to efficiently classify and predict the disease. In this study, four machine learning classification algorithms are compared to find the more viable one in classifying a diabetic and a non-diabetic. Keywords Diabetes · Classification · PIDD (Pima Indian Diabetes Dataset) · LDA (Linear Discriminant Analysis) · kNN (k-Nearest Neighbour) · RF (Random Forest) · SVM (Support Vector Machine) · ANN (Artificial Neural Network)
1 Introduction Diabetes mellitus is a metabolic disease caused by several reasons mainly because of lack of insulin secretion by pancreas, obesity, impaired glucose tolerance, high blood pressure, dyslipidemia, sedentary lifestyle, age, stress levels, family history, high alcohol intake, low levels of good cholesterol-HDL, higher LDL, i.e. bad cholesterol, high levels of triglycerides and Polycystic Ovary Syndrome (PCOS). Diabetes mellitus has two forms—“type 1” and “type 2”. In type 1 diabetes, very less to no G. Vashisht (B) · A. K. Jha Department of Computer Science, CVS College, Delhi University, New Delhi, India e-mail: [email protected] M. Jailia Department of Computer Science, Banasthali Vidyapith, Banasthali, India © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_73
845
846
G. Vashisht et al.
insulin hormone is produced by the pancreas which is a requirement to use body’s blood sugar. Type 2 diabetes is a lifelong disease that refrains the usage of insulin by the body. It has no cure and that leads to the need to predict it as early as possible so that the disease can be controlled at an initial stage. Frequent urination, increased hunger and thirst, weight loss, blurry vision and extreme fatigue are some warning signs of the disease. If left undiagnosed, it can lead to several health hazards like diabetic ketoacidosis and nonketotic hyperosmolar coma [2]. Other complications associated with diabetes are heart stroke, neuropathy, nephropathy, retinopathy, dementia, depression, hearing loss and recurring bacterial or fungal infections. China has the maximum number of diabetic patients all over the world, counted to be around 116 million people. India, however, has over 30 million people till now who have been diagnosed with diabetes. This study aims to analyse and evaluate the existing patterns to discover the new valuable patterns to interpret them and extract useful knowledge. The data mining techniques are experimented to find the appropriate approaches for efficient classification of diabetes dataset. The R Studio software was employed as a tool for implementing classification algorithms for diagnosing diabetes. In this study, we apply Support Vector Machine (SVM), Linear Discriminant Algorithm (LDA), Random Forest (RF) and k-Nearest Neighbours (kNN) classifiers and then compare their performance.
2 Related Work Authors have been taking a keen interest in the prediction of diabetes using the Pima Indian Diabetes Dataset (PIDD) in the recent past. The classified work is presented in Table 1 [4–11].
3 Data and Methodology 3.1 Dataset Used R Studio tool is used for executing the experiment. The classifiers are evaluated on diabetes dataset namely PIDD, acquired from UCI repository. This dataset comprises medical details of 768 instances. Only female patients are considered for this study. The dataset has eight numeric-valued attributes (N) as mentioned in Table 2. Dataset description is defined in Tables 1 and 2, and attribute description is defined in Table 3. The dataset was well analysed to put up an effective model to predict and diagnose the diabetes disease.
Predicting Diabetes Using ML Classification Techniques
847
Table 1 Comparison of the work done by authors in the field on the basis of accuracy Authors
Approach used
Dataset
Results
V. Anuja Kumari (2013)
SVM
PIDD
Accuracy = 78%
Nagarajan, R. M. Chandrasekaran and P. Ramasubramanian (2014)
ID3 NB C4.5 Random Tree
540 records having seven attributes
Highest accuracy by Random Tree = 93.8%
Yilmaz, Inan and Uzer K-Means + SVM (2014)
Statlog PIDD
Accuracy = 93.65%
Kumari, Vohra and Arora (2014)
Bayesian Network
206 records having nine input attributes and one output attribute
Highest accuracy by Bayesian Network—99.51% with the Mean Absolute Error (MEA) of 0.0053
Heydari, Teimouri, ANN, Bayesian Heshmati and Alavinia Network, Decision (2015) Tree, SVM, 5-NN
2536 cases from Tabriz University of Medical Sciences
Accuracy: ANN = 97.44% SVM = 81.19% 5-NN = 90.85% Decision Tree = 95% Bayesian Network = 91.60%
Zhu, Xie and Zheng (2015)
MFWC (k = 10)
PIDD
Accuracy = 93.45%
Santhanam and Padmavathi (2015)
SVM, K-Means, Genetic Algorithm
PIDD
Highest accuracy by SVM = 98.82%
Paul and Latha (2017)
Classification model PIDD using recursive partitioning algorithm
Ashiquzzaman, Deep Learning Tushar, Islam and Kim (2017)
PIDD
Accuracy = 83%
Accuracy = 88.41%
Islam and Jahan (2017) Logistic Regression, PIDD SVM, MLP, NB, IBK
Highest accuracy by Logistic Regression—78.01%
Dwivedi (2017)
ML PIDD Algorithms—SVM, ANN, KNN, Logistic Regression, Classification Tree
Highest accuracy by Logistic Regression—78% with a misclassification rate of 0.22
Vijayashree and Jayashree (2017)
Deep Neural Network, ANN
PIDD
Highest accuracy by Deep Neural Network = 82.67%
K. Priyadarshini and Dr. I. Lakshmi (2018)
Bayesian Network, Naive Bayes
1540 instances from Highest accuracy by the “National Institute Bayesian Network = of Diabetes Diseases” 99.35% (continued)
848
G. Vashisht et al.
Table 1 (continued) Authors
Approach used
Sisodia and Sisodia, Decision Tree, Naive Prediction of Diabetes Bayes, SVM using Classification Algorithms, Procedia Computer Science 132 (2018)
Table 2 Description of the dataset used
Dataset
Results
PIDD
Highest accuracy by Naive Bayes—76.30%
Database
#Attributes
#Instances
PIDD
8
768
Table 3 Attributes description Attributes
Value
Mean
Min
Max
Number of times pregnant
N
3.4
0
17
Plasma glucose concentration
N
120.9
0
199
Diastolic blood pressure (mm Hg)
N
69.1
0
122
Skin fold thickness (mm)
N
20.5
0
99
2-h serum insulin (mu U/ml)
N
79.8
0
846
BMI (weight in kg/(height in m)2 )
N
32
0
67.1
Diabetes pedigree function
N
0.5
0.078
2.42
Age in years
N
33.2
21
81
3.2 Data Pre-processing When data is not structured and has noise because of URLs, symbols, slangs and so on [8], then pre-processing is required. Pre-processing takes place in several stages— converting text into same case, stemming of words, removing blank spaces in the text, removing punctuations and numbers and removing stop words using an existing stop word list and a customized list. The dataset is structured and has no missing values.
3.3 Data Methodology The proposed model is presented in Fig. 1. The model is initially split into two sections viz. ‘Training data’—the set of data from the dataset that is used to train the model, and ‘Testing data’—the remaining set of data used to test the model.
Predicting Diabetes Using ML Classification Techniques
849
Fig. 1 Proposed model’s architecture
Training the machine learning model on a training set enables it to understand correlations in the training set and then the model is tested on the test set to check how accurately it can predict. Here, 80% of the dataset is allocated to train the model which is the training set and the rest of the 20% to test set. For this task, data partition method from the caret package of R is used. Algorithms Used for Classification Support Vector Machine (SVM), Linear Discriminant Algorithm (LDA), Random Forest (RF) and k-Nearest Neighbours (kNN) classifiers are used for classification of the patients who are prone to diabetes from the ones who are not [1–3].
3.3.1
Support Vector Machine (SVM)
The popular supervised machine learning classification algorithm that aims at finding a hyperplane (decision boundaries) in an N-dimensional space that distinctly classifies the data points where N is the number of features. More than one hyperplane can be chosen to separate the two classes of data points with an objective to discover a plane that has the maximum distance amid the data points of the two classes. The number of features helps decide the dimension of the hyperplane. The data points that are nearer to the hyperplane are known as the support vectors that allow the maximization of the margin of the classifier.
850
3.3.2
G. Vashisht et al.
Linear Discriminant Algorithm (LDA)
Linear Discriminant Algorithm is a supervised classification algorithm that works well for problems where the output variable can take a fixed number of values also known as categorical variable. LDA supports both binary and multi-class classification. It works by estimating the probability that the newly introduced set of inputs belong to each class. The class that gets the highest probability is the output class and this is how the prediction is made. Bayes’ theorem is used to estimate the probabilities.
3.3.3
Random Forest (RF)
Random Forest is a supervised learning algorithm which usually gives accurate results, especially on a large database with a large variable set. This is because it captures the relative importance of several input variables at the same time and enables many observations to participate while predicting. It also gives an estimate of the importance of various variables/features to the classification. It computes the score for each variable automatically after the training.
3.3.4
k-Nearest Neighbours (kNN)
k-Nearest Neighbours is an easily implementable, supervised classification algorithm with a low calculation time. It is called a lazy learning algorithm because it uses the entire data for training while classification thus a specialized training phase is not required. It is also called non-parametric learning algorithm as no assumption is made about the underlying data. kNN works by assigning the values to the new data point on the basis of its similarity to the points in the training set. The value of ‘k’ also has a greater impact on the performance of the classifier.
4 Results and Discussion Predictive models cannot assure cent percent correct prediction; hence, many performance metrics are generally used to evaluate the proposed model. Popular performance measures are Sensitivity, Specificity, Precision and F1-Score. Table 4 represents the classifiers’ performance over Precision, Sensitivity, Specificity and F1-Score. The corresponding classifiers’ performance over Accuracy, Kappa, F1, Precision and Sensitivity are plotted via a graph in Figs. 3, 4, 5, 6 and 7.
Predicting Diabetes Using ML Classification Techniques
851
Table 4 Performance of the classifiers Classifiers
Precision (%)
Specificity (%)
Sensitivity (%)
F1-Score (%)
LDA
81
58
92
86
kNN
79
54
85
81
RF
80
54
90
85
SVM
80
58
87
83
Fig. 2 Evaluation metrics used
Fig. 3 Accuracy–kappa plot
852
Fig. 4 Performance measure—accuracy
Fig. 5 Performance measure—F1
Fig. 6 Performance measure—precision
G. Vashisht et al.
Predicting Diabetes Using ML Classification Techniques
853
Fig. 7 Performance measure—sensitivity
5 Conclusion Detection of diabetes before its onset can be a boon to people worldwide as this disease has spread its wings across the globe. This study made an attempt to design a model that can assist in the prediction of the disease. Four machine learning classifiers are presented and evaluated on several performance measures. Experimental results on Pima Indian diabetes database present the competence of the proposed model with an accuracy of 78% using the linear discriminant algorithm. In the future, the model can be used to predict other diseases. The model used in this work can be applied on other medical datasets to predict the onset of the diseases.
References 1. P. Indoria, Y.K. Rathore, A survey: detection and prediction of diabetes using machine learning techniques. Int. J. Eng. Res. Technol. (IJERT), 287–291 (2018) 2. D. Kumar, R. Govindasamy, Performance and evaluation of classification data mining techniques in diabetes. Int. J. Comput. Sci. Inf. Technol. 6, 1312–1319 (2015) 3. P. Kumar, V. Umatejaswi, Diagnosing diabetes using data mining techniques. Int. J. Sci. Res. Publ., 705–709 (2017) 4. V. Kumari, R. Chitra, Classification of diabetes disease using support vector machine. Int. J. Eng. Res. Appl. (IJERA) www.ijera.com, 1797–1801 (2013) 5. S. Perveen, M. Shahbaz, A. Guergachi, K. Keshavjee, Performance analysis of data mining classification techniques to predict diabetes. Procedia Comput. Sci., 115–121 (2016) 6. S. Sadhana, S. Savitha, Analysis of diabetic data set using Hive and R. Int. J. Emerg. Technol. Adv. Eng. 4 (2014) 7. S. Saru, S. Subashree, Analysis and prediction of diabetes using machine learning. Int. J. Emerg. Technol. Innov. Eng. 5(4) (2019) 8. N.M. Sharef, M.A. Azmi Murad, N. Mustapha, H.M. Zin The effects of pre-processing strategies in sentiment analysis of online movie reviews, in AIP Conference Proceedings, vol. 1891 (2017), p. 020089
854
G. Vashisht et al.
9. K. Sharmila, S. Manickam, Efficient prediction and classification of diabetic patients from big data using R. Int. J. Adv. Eng. Res. Sci. 2 (2015) 10. D. Sisodia, D.S. Sisodia, Prediction of diabetes using classification algorithms, in International Conference on Computational Intelligence and Data Science (ICCIDS 2018) (2018), pp. 1578– 1585 11. Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, H. Tang, Predicting diabetes mellitus with machine learning techniques. Bioinform. Comput. Biol. 9 (2018) 12. Pima Indians dataset. https://www.kaggle.com/uciml/pima-indians diabetes-database
Er–Yb Co-doped Fibre Amplifier Performance Enhancement for Super-Dense WDM Applications Anurupa Lubana, Sanmukh Kaur, and Yugnanda Malhotra
Abstract The need for high-speed data transmission has augmented the demand for all-optical amplifiers in dense wavelength division multiplexing (DWDM) systems. In all-optical amplifiers, there is no constraint to convert an input optical signal to electrical and again back to optical signal; hence are very efficient when employed in DWDM systems. In modern high capacity optical communication, there is need for all-optical amplifiers with higher gain. In this paper, the performance optimization of all-optical Erbium–Ytterbium (Er–Yb) co-doped fibre amplifier has been done for a 100-channel super-dense wavelength division multiplexing (SD-WDM) system over a wavelength range of 1525–1625 nm at a reduced channel spacing of 0.2 nm. The EYDFA amplifier reported with a maximum gain of approximately 45 dB for a flat band range has proved to be a good choice for achieving high and flat gain for huge and fast data transmission. The results in terms of lower gain variation ratios (GVR) of 0.03 and 0.16 have been attained as a function of channel wavelengths and input powers, respectively, with good Quality-factor of 6.6. The boosted gain, reduced GVR and acceptable Quality-factor make EYDFA amplifier superior to other amplifiers for long-haul applications. Keywords Doped fibre amplifier · EYDFA · SD-WDM systems · Gain · Gain variation ratio · Quality-factor
A. Lubana (B) · S. Kaur ASET, Amity University, Noida, Uttar Pradesh 201313, India e-mail: [email protected] S. Kaur e-mail: [email protected] A. Lubana Ambedkar Institute of Technology, Shakarpur, New Delhi 110092, India Y. Malhotra Bharati Vidyapeeth College of Engineering, Pashchim Vihar, New Delhi 110063, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_74
855
856
A. Lubana et al.
1 Introduction In modern days, there is a crucial requirement to attain high-speed data transmission using improved and unconventional optical system constituents and devices in an optical communication system [1, 2]. For making the best use of bandwidth, dense wavelength division multiplexing (DWDM) systems have been used in digital communication systems that can transmit many optical signals of diverse wavelengths over a single optical fibre very proficiently. The DWDM systems are capable of performing better only due to the evolution and upgradation of materials used to fabricate the all-optical amplifiers [3–6]. All-optical amplifiers are capable of strengthening any wavelength if provided with appropriate pump wavelengths. The DWDM system with channel spacing of 25 GHz (0.2 nm) and less are known as SDWDM systems in [7]. Semiconductor optical amplifiers (SOAs) demonstrated nonlinear behaviour such as self-phase modulation and four-wave mixing, and required electrical pumping [8]. On the other hand, Raman amplifiers offer amplification by using specific pumping wavelength, and resultant gain of up to 15 dB is achieved [9–11]. An EYDFA is the most widely used all-optical fibre amplifier. Er–Yb-doped fibre amplifiers utilize erbium–ytterbium double-clad, large-core fibre technologies to achieve high output power at low cost. The erbium–ytterbium-doped fibre amplifier shows wide gain bandwidth and high efficiency [4]. The used optical fibre length can be influenced by the quantity of erbium–ytterbium doping ions, pump wavelength, pump power and input signal power. A gain of ~13 dB has been accomplished using fibre amplifier at an input data rate of 10 Gb/s for an SD-WDM system [12]. The pump laser, erbium–ytterbium-doped fibre and wavelength selective coupler are three basic components of the EYDFA amplifier. In [13], the gain of 56 dB has been achieved with a dual-stage EYDFA amplifier with a better noise figure. In [14], characteristics of EYDFA amplifier and EYDFA + Raman hybrid amplifier have been reviewed by Singh et al. The gain of 50 and 35 dB has been attained for EYDFA only and hybrid amplifier, respectively, along with the maximum noise figure of value 6. The all-optical amplifiers in DWDM systems are designed to operate within a specified band for the number of input channels. The gain achieved for a fibre amplifier is one of the main characteristics that determines the performance of a DWDM system, and the difference between maximum and minimum gain is also required to be minimum along with agreeable Quality-factor. The amplification factor (gain) of an all-optical fibre amplifier is calculated as 10 log of ratio of output power to input power [4]. Anurupa et al. presented an EYDFA amplifier in an SD-DWDM system and achieved a maximum gain of up to 43 dB with a very low gain variation ratio. Along with high gain, such an amplifier must also possess an important property of gain flatness, which is the variation of the obtained gain between different transmission channels [3]. If the gain variation ratio (GVR) of an optical amplifier is high then WDM channels will have different gain, which is not desirable in optical amplification. It is the measure of the gain flatness of all-optical amplifiers [4]. It is the ratio of the difference between the maximum and minimum gain to the minimum gain. The Quality-factor is a function of the optical signal to noise ratio (OSNR) which
Er–Yb Co-doped Fibre Amplifier Performance Enhancement …
857
provides a qualitative measure of the receiver performance and can be measured from the eye diagram at the receiver end of the communication link. To achieve a higher value of OSNR, a higher bit rate is required. Bit error rate (BER) is the number of error bits per unit of time. To obtain a specific value of bit error rate, it indicates minimum OSNR required for a given signal [12–15]. In this paper, first, we propose and analyse the EYDFA amplifier for 1525– 1625 nm wavelength range to achieve high and flat gain. Next, at two different pump wavelengths, the EYDFA amplifier has been analysed in terms of gain, Qualityfactor and GVR with respect to channel wavelengths and input powers, respectively, covering the range of low GVR of 1535–1567 nm. Further, the comparison of the proposed amplifier at two different wavelengths has been performed. Results achieved using the proposed amplifier are the following: • The EYDFA amplifier reported with a maximum gain of approximately 45 dB has proved to be a good choice for achieving high and flat gain for huge and fast data transmission. • Lower gain variation ratios (GVR) of 0.03 and 0.16 have been attained with a good Quality-factor of 6.6.
2 Performance Analysis of EYDFA Amplifier 2.1 Methodology, System Set-up and Simulation Results The proposed EYDFA amplifier has been simulated in an SD-WDM system using optisystem-15 software as shown in Fig. 1. The various parameters fixed during the design of the system have been placed in Table 1. The set-up consists of a system of 100 channels with a channel spacing of 25 GHz (0.2 nm). For every transmitted wavelength, maximum gain at each wavelength has been documented. The system has been evaluated over the wavelength range of 1525–1625 nm using a range of
Fig. 1 SD-DWDM system employing EYDFA amplifier
858 Table 1 System design parameters
Table 2 EYDFA amplifier parameters
A. Lubana et al. Parameter
Value
Transmission distance
150 km (SMF) + 5 m (EYDFA)
Spacing between channels
25 GHz (0.2 nm)
Input data
80 Gb/s
Channel count
100
Power provided at input
0.1–2.5 mW
Modulation
NRZ
Dark current
1 nA
Responsivity of detector
1 A/W
Parameter
Value
Cross-section area
50 µm2
Pump signal wavelength
1056, 1480 nm
Counter pump power
200 mW
Length of fibre
5m
Ion density (Er3+)
5.14 × 1025 m−3
Ion density (Yb3+)
6.2 × 1026 m−3
Lifetime of Er3 + ions
10 ms
Lifetime of Yb3 + ions
ms
input powers from 0.1 to 2.5 mW, at an EYDFA pump signal power of 200 mW. The system input data rate is 80 Gb/s and has been analysed for 1056 and 1480 nm pump signals. The signals after passing through multiplexer have been transmitted over single-mode fibre and EYDFA amplifier of length 125 km and 5 m, respectively. In this work, the proposed EYDFA amplifier has been scrutinized for a range of 1525–1625 nm using 1056 and 1480 nm pump signals. For every transmitted wavelength range, a bandwidth of 20 nm is analysed as the channel spacing among the channels is 0.2 nm, and the number of channels is 100. Further, similarly, a part of 1525–1625 nm band has been investigated from 1535 to 1567 nm. Parameters of EYDFA amplifier have been listed in Table 2. The counter pumping scheme has been utilized for fibre loss compensation. Erbium and ytterbium concentrations have been kept constant while taking results.
2.2 Performance Evaluation of EYDFA Amplifier at 1056 nm Pump Signal Wavelength Here, first, the system has been studied for a pump signal wavelength of 1056 nm. The maximum gain obtained at each wavelength among the 100 channels has been
Er–Yb Co-doped Fibre Amplifier Performance Enhancement …
859
Fig. 2 Gain versus wavelength plot of proposed EYDFA amplifier
plotted in Fig. 2. It can be seen from the figure that the two gain peaks have appeared at the wavelength of 1537 and 1559 nm, providing a gain of 41.2 and 41.37 dB, respectively. The gain is approaching to a value of 30 dB after the wavelength range of 1570 nm. The gain of the system is approaching towards the low value of 16 dB at a wavelength of 1620 nm. Quality-factor tells about the noise present in the signal and signifies the performance of an optical communication system. The maximum Quality-factor at wavelengths over the entire range of 1525–1625 nm has been shown in Fig. 3. The maximum Quality-factor of 23 is achieved at a wavelength range of 1579 nm. The plot of the average Quality-factor as a function of input power has been depicted in Fig. 4. The average Quality-factor of 5.7 is observed at a power level of 0.1 mW; on the other hand, at an input power level of 2.5 mW, the highest Qualityfactor 7.7 has been observed. Further, the EYDFA amplifier has been examined for the narrower band stretching from 1535 to 1567 nm. The achieved gain of the system is higher gain with low GVR. Figure 5 shows the magnified gain spectrum of proposed amplifier over narrower band stretching from 1535 to 1567 nm with input power values ranging from 0.1 to 2.5 mW. The highest and lowest flat band gain of 41.37 and 30.7 dB has been detected at 1559 and 1551 nm wavelength, respectively. Highest values of gain have been witnessed at an input power level of 0.1 mW. As the input power level is growing from 0.1 to 2.5 mW, the gain is reduced from 41.37 to 35 dB. Figure 6 displays the average gain of the proposed amplifier at a pump signal of 1056 nm with respect to the power provided at the input extending from 0.1 to 2.5 mW. The highest and lowest value of the average gain of 40 and 32 dB has been detected at an input power level of 0.1 mW and 2.5 m. It is visible from the figure that the gain of the amplifier is becoming lower as the value of power is going higher.
860
A. Lubana et al.
Fig. 3 Quality-factor versus wavelength plot at different wavelengths (1525–1625 nm)
Fig. 4 Quality-factor at different input power (1525–1625 nm)
2.3 Performance Comparison of EYDFA Amplifiers at Two Different EYDFA Wavelengths In this section, we compare proposed fibre amplifier performance in terms of gain, Quality-factor and gain variation ratio at two different pump signal wavelengths for two wavelength ranges of 1525–1625 nm (full C+ L band) and 1535–1567 nm (flat
Er–Yb Co-doped Fibre Amplifier Performance Enhancement …
861
Fig. 5 Gain of proposed EYDFA amplifier at varying power values
Fig. 6 Average gain versus input power (1535–1567 nm)
narrower band). Figure 7 depicts the comparison plot of EYDFA amplifier gain at two different pump signal wavelengths of 1056 and 1480 nm over the wavelength band of 1525–1625 nm. It can be seen from the figure that the proposed EYDFA amplifier gain appreciation by 4 dB has been achieved. The gain of the system of minimum value of 18 dB has been achieved at 1620 nm. It has been observed from Figs. 2 and 8 that the gain peaks at pump wavelengths exist at the same wavelength range of 1537 and 1559 nm.
862
A. Lubana et al.
Fig. 7 Flat band Quality-factor comparison plot at different wavelengths
Figure 8 depicts the Quality-factor spectrum at two EYDFA pump signals of 1056 and 1480 nm over the narrower wavelength band range of 1535–1567 nm with less GVR. At a wavelength of 1551 nm, the maximum Quality-factor of 11.38 has been accomplished. The Quality-factor achieved for both the pump signal wavelengths is almost comparable. Figure 9 shows the GVR of EYDFA amplifier at two different pump signals as a function of wavelengths from 1535 to 1567 nm. The GVR of less than 0.05 has been observed at 1480 nm pump signal; on the other hand, the GVR
Fig. 8 Gain plots of EYDFA amplifier at different pump wavelengths
Er–Yb Co-doped Fibre Amplifier Performance Enhancement …
863
Fig. 9 Flat band GVR comparison plot at different wavelengths
as a function of different power levels is shown in Fig. 10. The GVR of less than 0.7 has been witnessed over the range of 1535–1567 nm. The average gain and Quality-factor as a function input power at two different pump wavelengths have been shown in Figs. 11 and 12. The average gain and Qualityfactor of 44 dB and 6.6, respectively, have been observed over a flat band range.
Fig. 10 Flat band GVR at different input powers
864
A. Lubana et al.
Fig. 11 Flat band average gain at input powers
Fig. 12 Flat band average Quality-factor at input powers
3 Summary of Results and Comparison with Other Optical Amplifiers with Proposed EYDFA Amplifier The proposed amplifier at a data rate of 80 Gb/s with an input power of 0.1–2.5 mW has been analysed over the wavelength range of 1525–1625 nm. Further, for the narrower flat band from 1535 to 1567 nm, the proposed amplifier pump wavelength has been optimized to achieve the higher gain, lower GVR and good Quality-factor. The gain of 45 dB, GVR of 100 GB to be compressed to SAMBAMBA format in 1.5 h. which previously through serial computing took 8.5 h. or more. Ko Kusudo et al. [18] proposed a bit parallel algorithm for matching multiple patterns with different lengths. It employs fast string search using OpenMP directives to facilitate data-parallel search for strings. Compared to the single-core CPU implementation of the PFAC algorithm, the method claimed 1.4 times higher throughput for a genome dataset that had many partial matches between the text and patterns. The method has a limitation on total length of patterns, which must be large, but smaller than the word size. Al-Dabbagh et al. [15] proposed a parallel exact string matching algorithm version over OpenMP. The test results of the proposed algorithm claimed a speedup of 5 over 200 MB data size of the strings. Thakur et al. [16] showed a comparison between serial and parallel versions of searching algorithm using OpenMP over two and four cores over two and four threads, respectively. J. Zheng et al. [19] did Arabidopsis and Poplar promoter processing using OpenMP technology. In the paper authors
910
R. Saxena and M. Jain
executed both promoter data processing as well as prime split algorithms parallelly. A comparison was drawn between the serial as well as parallel execution over 4, 8, and 16 threads. The experimental results show a reduction from 1.1 to 0.5 min approximately for 40165 effective Arabidopsis combination items and 72962 effective Poplar combination items. Borovikov [20] proposed a practical approach to Face Image Retrieval (FIR). It utilizes the multi-core computing architecture to implement its major modules like face detection to obtain considerable speedups over real-time web-based image data. In this paper, we have used the concept of parallelizing the search mechanism for the duplicate files based on the content in the file over a huge database using OpenMP multithreading power and directives. Later it has been shown in paper that the proposed method yields a high reduction in time taken to summarize a heavy video. The forthcoming sections below discuss the search mechanisms, need for parallelization, and algorithmic procedure along with analysis of the results obtained.
3 Searching Algorithms For searching any file in a collection of files some existing techniques are discussed below. Table 1 shows the comparison of these techniques in terms of complexities, i.e., worst, average, and best case. A. Linear Search It is the most common and simple approach, which searches for the specific file or element in the directory [3]. The algorithm searches in a linear order, when the specific element found from the linear search the pointer returns the location of element. On average, the algorithm takes the complexity O(n). B. Binary Search When the size of file is too large, then binary search algorithm gives better results as compared with linear search, but it only works for files arranged in a sequential order [4]. The complexity of binary search algorithm is O(log N). Table 1 Shows the comparison of searching techniques in terms of complexities
Searching algorithm
Worst case
Average case
Best case
Linear search
O(n)
O(n/2)
O(1)
Binary search
O(logn)
O(logn)
O(1)
Tree search
O(n)
O(logn)
O(1)
Hashing
O(n)
O(1)
O(1)
Enhancing Redundant Content Elimination …
911
C. Tree Search Binary search tree is a tree in which the left node contains the elements lesser than the nodes and right elements contain the elements greater than the nodes. The complexity of binary search tree is O(n). D. Hashing Hashing is a popular technique for searching in large number of database. In hashing technique, the array elements are mapped with hash function. The hash function is generally in the form of M mod N. The elements are searched with the help of hash function [5].
4 Need for Parallelization and High Performance Computing Platforms A. Need for Parallelization The searching techniques and mechanism discussed in the previous section have been used for years in variety of modes for numerous applications. However, every mechanism has its own pros and cons. Binary search technique can be worked out only with sorted data whereas tree search method requires balancing at a certain stage as the size of data is enormous which adds up to the cost of searching. The most efficient data structure construct which can be thought of for data storage called hashing too performs poorly over a large data set owing to the curse of poorly determined hash function and collisions in hash table [14]. The simplest of all is the linear search method but the choice of selection for algorithm is diminished as the complexity of the operation is a linear function of the input data size. For an enormous data chunk, this linear order hurts badly. Thus, here comes the need to find a way to reduce this increase in the computation time and today’s modern architecture of machines incorporating multiple processors with high frequencies allows us to do that. In this paper, we have used the Single Instruction Multiple Data (SIMD) power of the processors to do this job where the large dataset is spawned over the processors and the algorithm is modified to map this parallel architecture of the machine. The searching of the content is done parallely by each processor over the dataset mapped to the processor. This algorithmic modification causes the time complexity reduction in the algorithm by a factor of ‘p’, where ‘p’ is regarded as the number of processors. Further the hyper-threading architecture of the processor further causes a speedup in the runtime execution of the algorithm. So, in the following sections,
912
R. Saxena and M. Jain
this algorithmic modification has been explained along with parallel computing and standards available. B. High-Performance Computing: Overview Complex computation and processing of huge data is a need in today’s world. Most of the application based on real-life problem takes lots of time to execute because of its huge database and complex operations. The solution to all these problems is High-Performance Computing (HPC). HPC is based on the concept of multi-core and many-core architectures. HPC uses more than one processor to solve a problem. Basically, the problem is split into multiple subproblems and these subproblems are assigned to various processors with the use of threads [6]. The threads run parallel among all the cores, which result in faster computation. When the problem is split among 10 cores or less (Central Processing Unit) then it is called multi-core architecture and if it uses more than 100 cores (Graphical Processing Unit) then it is called many-core architectures. Both CUDA and OpenMP follow shared-memory parallelism. For using the HPC to its full potential, the serial program needs to be rewritten to map the parallel architecture. There are various platforms available for writing the parallel code such as Compute Unified Device Architecture (CUDA), OPENCL, OPENACC, etc. for many-core processing while OPENMP, for multi-core architectures [7]. For the elimination of duplicate records based upon the content similarity, the proposed method exhibits parallelism over multi-core architecture using OPENMP. C. Open Multi-Processing (OPENMP) OPENMP is a popular and easiest platform to write a parallel code to run in multiplecore architecture. OPENMP consists of basic three main components, namely, Directives, Environmental variables, and Clauses. The main concept behind the parallelism is multithreading. Multiple threads are created in a program and work is distributed among the threads using directives, clauses, and environmental variables. OPENMP program is a combination of serial and parallel code. The programmer has to explicitly specify which code to run in parallel using compiler directives [7]. OPENMP supports C and Fortran language. We have used C language for parallel implementation of algorithm and compared the performance of among various number of cores and threads [8]. Figure 3 shows the basic architecture of OpenMP.
Enhancing Redundant Content Elimination …
913
Fig. 3 Basic architecture of OPENMP [9]
5 Algorithm for Removal of Duplicate Files The proposed algorithm in the paper is an application for deletion of duplicate files over multi-core architecture. The algorithm follows two steps firstly, it searches for the duplicate files with the same file name as well as the content among all the files and secondly deletes all the duplicate files identified except the original file. Below is the algorithm for detection and deletion of duplicate files. Algorithm 1. 2. 3. 4. 5.
Define Number of files (N) and Number of Elements (E) Create ‘N’ files and input ‘E’ number of random numbers in each file Compare one file with the rest (N-1) files Repeat the comparison of files, for each file If the numbers of, say, file1 and file2 are found to be equal, increment the ‘equal’ variable 6. At the end of comparison of two files, if they have similarity more than the threshold declared, discard the second file. 7. Repeat the same during comparison of all the pair of files. It will work both in serial and parallel computation. The serial and parallel pseudocode implementation of function comparison is shown below.
914
A.
R. Saxena and M. Jain
SERIAL CODE: SERIAL APPROACH
Procedure comparison (file1, file2) // file1 and file2 being the file descriptors for the files to be matched Begin for each entity in file 1 and file 2 if (value in file 1 is same as file 2) continue; else return not identical files end for return file 2 is identical to file 1 End
B.
PARALLEL OPENMP CODE: PARALLEL APPROACH
Procedure comparison new (file1, file2) // file1 and file2 being the file descriptors for the files to be matched Begin #pragma omp parallel for num_threads (n) for each entity in file 1 and file 2 and increment by n if (value in file 1 is same as file 2) continue; else return not identical files end for return file 2 is identical to file 1 End
In the above pseudocode, the for loop computations have been spawned over ‘n’ number of threads as per the system resource availability. The loop increments by a factor of ‘n’ as in one go ‘n’ number of elements of the file are compared and this is the key reduction step of the algorithm as number of ties the loop is run is reduced by a factor of ‘n’ which is equal to the number of threads invoked.
Enhancing Redundant Content Elimination …
915
6 Performance Analysis of Serial Code and Parallel Code For testing the algorithm of duplicate file detection and deletion, we have created different number of files with various contents and compared the performance of serial code in C language, which uses the single core, and parallel code in OPENMP, which uses the concept of multi-core (quad-core) [22, 23]. From these testing results, we have found that the multi-core architecture gives better result as compared to single core within the terms of time. These results have been evaluated on a machine with Intel Xeon Phi series quadcore architecture with hyper-threading enabled and Linux distribution environment (CentOS 6.0.2 or Ubuntu 14.04) being installed over bare metal (i.e., not on virtual machines). The parallel code has been written using OpenMP 4.0 standard in C++. The code has been tested for a variety of datasets and irrespective of the nature of data being compared algorithmic efficiency is unaffected. Table 2 shows these results of performance analysis of serial versus parallel code. One of the snapshots of results achieved for 100 files and 50,000 elements is shown in Fig. 4. The comparison of serial and parallel code is also shown in Fig. 5 where X axis is Total number of file * contents per file whereas Y axis represents the time taken to run in seconds. Table 2 Results of performance analysis of serial versus parallel code No. of files (n)
10
No. of elements (e) 1000
Total no. of elements (n*e)
Serial code execution time (in seconds)
Parallel code execution time (in seconds)
10,000
0 m 3.305 s
0 m 0.020 s
100
1,000
1,00,000
0 m 2.573 s
0 m 1.246 s
100
10,000
10,00,000
0 m 25.700 s
0 m 11.788 s
100
50,000
50,00,000
2 m 5.230 s
0 m 16.557 s
1,000 500
1,000
10,00,000
2 m 47.277 s
1 m 56.865 s
35,000
1,75,00,000
33 m 7.738 s
3 m 50.461 s
Fig. 4 The snapshots of results achieved for 100 files and 50,000 elements
916
R. Saxena and M. Jain
Fig. 5 Shows the comparison of serial and parallel code in terms of time taken to run the code
7 Applications The proposed algorithm can be used in various applications, some of them are listed below. • Android memory cleaner [9] • Disk cleaner • Duplicate Record Detection can have a sub-application like: Processing call detail records [10] • Storage Optimizing Encoder and Method: An encoder and method, such as for the use in CD-ROM pre-mastering software, optimize storage on a computer-readable recording medium by eliminating redundant storage of identical data streams for duplicate files [11] • The encoder and method detect whether two files have equivalent data streams, and encodes such duplicate files as a single data stream referenced by the respective directory entries of the files. • In fields like Image processing and Signal processing, duplicate file detections have vast applications. One such example in Image processing is Video Summarization • In the field of Signal Processing, there is a topic known as ‘audio fingerprinting’ whose applications involve duplicate detection, whose goal is to identify duplicate audio clips in a set, even if they differ in compression quality or duration [12].
Enhancing Redundant Content Elimination …
917
Fig. 6 Time comparison of serial and parallel code for video summarization
8 Video Summarization Using Proposed Algorithm Some videos take lots of space and kill time to see whole content present in it. We can summarize the video using the proposed algorithm and see the video main contents instead of watching the whole video. The video mainly consists of collection of frames, we can compare the matrix values of frames and delete the duplicate values [13]. We have taken a sample video of 15 min after applying proposed algorithm in single core and multi-core, the video shortened to 2 min (approximately) revealing only the main contents in the video. After analysis using single-core and multi-core architecture, we have found that multi-core architecture gives far better results than single core. Figure 6 shows the graph of time taken by single-core and multi-core architecture.
9 Conclusion and Future Scope For the proposed algorithm, the performance of parallel code is seen to get reduced by almost half or less than half of the execution time taken by the serial code. Further optimization possible would be, by parallel opening multiple files to be compared. This largely reduces the execution time. An example is shown below—(statistics for N = 500; E = 10,000) • Parallelizing multiple file comparison function takes 0 m 26.299 s • Parallelizing only the comparison () function takes 1 m 33.207 s The state of art in [15] and [16] utilizes an almost similar concept as discussed in the paper. The authors in [15] have evaluated results for various pattern lengths of
918
R. Saxena and M. Jain
strings and claims a speedup of 5 for a 200 MB size of data. The proposed algorithm in this paper shows a speedup of 11 for 500 files having 10,000 floating values in each file over a quad-core machine as discussed above. This speedup will further scale up for even large number of files and data size in each file with an increase in the number of processors. Results can further be improved by taking into account some specialized OpenMP features like • Varying number of threads (increasing up to a limit) [24, 25] • Involving schedule (dynamic with chunk size or guided) In the future, more optimal results could be achieved by the techniques used in high-performance computing [21].
References 1. WinExt, (2013, March 13). Retrieved from https://www.trisunsoft.com/ 2. “Speed up & Optimize Your PC with CCleaner®.” RSS. N.p., (2005) 3. V.P. Parmar, C.K. Kumbharana, Comparing linear search and binary search algorithms to search an element from a linear list ımplemented through static array, dynamic array and linked list. Int. J. Comput. App. 121(3) (2015) 4. S. Pushpa, P. Vinod, Binary search tree balancing methods: a critical study. IJCSNS Int. J. Comput. Sci. Network Sec. 7(8), 237–243 (2007) 5. J. Rautaray, R. Kumar, Hash based searching algorithm. Int. J. Innov. Res. Sci Eng. Technol. 2(2), 469–471 (2013) 6. R. Saxena, M. Jain, D.P. Sharma, S. Jaidka, A review on VANET routing protocols and proposing a parallelized genetic algorithm based heuristic modification to mobicast routing for real time message passing. J. Int. Fuzzy Sys. 36(3), 2387–2398 (2019) 7. B. Mustafa, R. Shahana, W. Ahmed, Parallel implementation of doolittle algorithm using openMP for multicore machines, in 2015 IEEE International Advance Computing Conference (IACC), IEEE (2015, June), pp. 575–578 8. R. Saxena, M. Jain, S.M. Yaqub, Sudoku game solving approach through parallel processing, in Proceedings of the Second International Conference on Computational Intelligence and Informatics (Springer, Singapore, 2018), pp. 447–455 9. B. Stockton, 5 Android Apps That Really Clean Up Your Device (No Placebos!) (2019, December 10). Retrieved from https://www.makeuseof.com/tag/5-apps-really-clean-androiddevice-arent-placebos/ 10. A.K. Elmagarmid, P.G. Ipeirotis, V.S. Verykios, Duplicate record detection: A survey. IEEE Trans. Know. Data Eng. 19(1), 1–16 (2006) 11. I.M.A.D. Suarjaya, A new algorithm for data compression optimization (2012). arXiv:1209. 1045 12. P. Cano, E. Batlle, T. Kalker, J. Haitsma, A review of audio fingerprinting. J. VLSI Signal Proc. Sys. Signal Image Video Technol. 41(3), 271–284 (2005) 13. Z. Elkhattabi, Y. Tabii, A. Benkaddour, Video summarization: techniques and applications. Int. J. Comput. Inf. Eng. 9(4), 928–933 (2015) 14. R. Ghosh, Advantages and Dısadvantages of Hashıng (1970, January 1). http://rajaghoshtech2. blogspot.com/2010/03/advantages-and-disadvantages-of-hashing.html 15. S.S.M. Al-Dabbagh, N.H. Barnouti, M.A.S. Naser, Z.G. Ali, Parallel quick search algorithm for the exact string matching problem using openMP. J. Comput. Commun. 4(13), 1–11 (2016) 16. N. Thakur, S. Kumar, V.K. Patle, Comparison of serial and parallel searching in multicore systems, in 2014 International Conference on Parallel, Distributed and Grid Computing, IEEE (2014, December), pp. 334–338
Enhancing Redundant Content Elimination …
919
17. G.G. Faust, I.M. Hall, SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics 30(17), 2503–2505 (2014) 18. K. Kusudo, F. Ino, K. Hagihara, A bit-parallel algorithm for searching multiple patterns with various lengths. J. Para. Dis. Comput. 76, 49–57 (2015) 19. Y. Shi, J. Lu, J.J. Zheng, The parallel processing for promoter data base on OpenMP, in Electronic Engineering and Information Science: Proceedings of the International Conference of Electronic Engineering and Information Science 2015 (ICEEIS 2015), January 17–18, (CRC Press, Harbin, China, 2015, June), p. 325 20. E. Borovikov, S. Vajda, G. Lingappa, M.C. Bonifant, Parallel computing in face ımage retrieval: practical approach to the real-world ımage search, in Multi-Core Computer Vision and Image Processing for Intelligent Applications (IGI Global, 2017), pp. 155–189 21. R. Saxena, M. Jain, D.P. Sharma, GPU-based parallelization of topological sorting, in Proceedings of First International Conference on Smart System, Innovations and Computing (Springer, Singapore, 2018), pp. 411–421 22. R. Saxena, M. Jain, D. Singh, A. Kushwah, An enhanced parallel version of RSA public key crypto based algorithm using OpenMP, in Proceedings of the 10th International Conference on Security of Information and Networks (2017, October), pp. 37–42 23. R. Saxena, M. Jain, A. Kumar, V. Jain, T. Sadana, S. Jaidka, An improved genetic algorithm based solution to vehicle routing problem over OpenMP with load consideration, in Advances in Communication, Devices and Networking (Springer, Singapore, 2019), pp. 285–296 24. M. Jain, R. Saxena, S. Jaidka, M.K. Jhamb, Parallelization of data buffering and processing mechanism in mesh wireless sensor network for IoT applications, in Smart Computing Paradigms: New Progresses and Challenges (Springer, Singapore, 2020), pp. 3–12 25. R. Saxena, M. Jain, K. Malhotra, K.D. Vasa, An optimized openmp-based genetic algorithm solution to vehicle routing problem, in Smart Computing Paradigms: New Progresses and Challenges (Springer, Singapore, 2020), pp. 237–245
Matched Filter Design Using Dynamic Histogram for Power Quality Events Detection Manish Kumar Saini and Rajender Kumar Beniwal
Abstract This paper proposes the novel scheme of matched filter design for detection of power quality events in renewable integrated system. Matched filter gives better response on account of similarity with the signal, thus gives high SNR. Therefore, filters for power quality signals have been designed utilizing the information inherent in the power quality signal itself. In power quality signals, event-related information is hidden in the form of repetitive patterns in the signal. That repetitive information is extracted through the technique of dynamic histogram-based quantization for designing the filter matched with PQ disturbance signal. The stability of the newly designed matched filters has been analyzed using the frequency response of filters. The proposed scheme has been implemented for the detection of voltage sag, transient, and harmonics occurring in the renewable integrated system. Keywords Matched filter · Histogram · Repetitive pattern · Power quality
M. K. Saini Electrical Engineering Department, D. C. R. University of Science and Technology, Sonepat 131039, India R. K. Beniwal (B) Electrical Engineering Department, Sobhasaria Group of Institutions, Sikar 332001, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_80
921
922
M. K. Saini and R. K. Beniwal
1 Introduction With the advancement in technology, on one hand, advanced equipments being used in renewable integrated systems are getting sensitive day-by-day and on the other hand, the use of non-linear power electronic switches, degrading the supply, are increasing day-by-day. Consequently, the demand for good power quality is also rising day-by-day. All the non-linear electronic switches used to integrate the renewable energy sources with the main grid, inject harmonics in the renewable integrated system and produce current distortion. Harmonics are those components in the fundamental waveform of current and voltage which are integer multiples of fundamental frequency of the waveform [1]. These harmonics cause high damage to sensitive equipments and lead to financial instability of connected loads [2]. Another prominent source of power quality degradation is capacitor banks which are generally installed for power factor improvement and sometimes also for reactive power compensation. Switching events of the large capacitor banks cause transients and maloperation of adjustable speed drives connected in the network [3]. These switching actions also cause momentary voltage sag which can be very dangerous to the production plants. Besides all these causes, intermittent supply of renewable sources is also a major source of power quality issues. Consequently, characterization and detection of all these power quality issues have become inevitable for both utilities and customers. Detection of the power quality issues has been effectively dealt in the literature using numerous signal processing techniques [4]. Among all the developed tools for power quality analysis, filtering process is the most basic and effective technique. Many researchers have utilized different filters, e.g., passive filters, active filters, and hybrid filters. G. Chang et al. have employed passive harmonic filters for limiting the harmonic voltage and produced voltage distortion in the system [5]. H. Brantsaeter et al. have designed different configurations of eight LCL filter and applied these filters with grid inverters in offshore wind power plant [6]. Due to limited compensation behavior in dynamic conditions, high cost and weight of passive filters, active filters are more preferred for signal filtering [7]. S. Aravind et al. have proposed the self-tuning series active filters for mitigation of harmonics problem in wind/solar hybrid RES [8].
Matched Filter Design Using Dynamic …
923
Successful implementation of all filters needs suitable filter designing methods in their background. The effectiveness of filters depends on how well the filter has been made capable to deal with different power quality signals [9]. The concept of matched filter comes in the picture. Matched filters are designed using the characteristics of the signal itself. Consequently, these filters produce better output with high signal-tonoise (SNR) ratio, as evidenced by numerous research works in the literature. PS de Oliveira et al. have designed quadrature matched filter using independent component analysis to extract the harmonics from the signals [10]. There is varied applicability of matched filters in other research domains also, for instance, target detection in hyperspectral image [11], detection of weak nuclear quadrupole resonance signal [12], seizure detection in EEG [13], face recognition [14], speaker recognition [15], speech analysis [16], fingerprint classification [17], and many others. The utilization of matched filters has not been explored much in the domain of power quality analysis. Therefore, this paper presents the novel technique for matched filter design. For designing the matched filter, information in the form of repetitive patterns is extracted from the signal and used as the basis for filter design. Repetitive patterns in the signal are captured by using the dynamic histogram of the power quality signal followed by vector quantization of obtained result. This gives the coefficients for designing the filter matched to the signal particularly for power quality issue likes harmonics, transient, and sag. The stability analysis of matched filters has also been performed to validate the feasibility of the designed filters. The frequency response of filtered signals verifies the superiority of matched filters in the domain of power quality analysis particularly for renewable integrated systems. In this paper, Sect. 2 discusses the proposed technique for matched filter design. Section 3 presents the simulation results of proposed technique for PQ events like voltage sag, harmonics, and transient. Section 4 gives the conclusion of this paper.
2 Proposed Methodology This paper presents the novel technique of matched filter design for the analysis of PQ events in renewable integrated system. The flow of proposed concept is shown in Fig. 1. The technique of filter design starts with computation of dynamic histogram to extract the repetitive pattern in the signal. Dynamic histogram is preferred over static histogram as it is continuously updated with variation in the signal. The process
Fig. 1 Block diagram of proposed methodology
924
M. K. Saini and R. K. Beniwal
of dynamic histogram is associated with vector quantization to perform histogrambased quantization [18]. Vector quantization is one of the efficient compression techniques which reduces the dimension of input vector without loss of significant and useful information present in the input vector. Vector quantization generates the codebook that represents reduced form of the repetitive pattern present in power quality signal. This codebook acts as the signal representative which is used for designing matched filters [16]. For implementing the proposed algorithm, renewable integrated system is simulated in MATLAB/Simulink. Three power quality events, i.e., voltage sag, transient, and harmonics are simulated and the obtained three-phase voltage signals are segmented into single-phase signals. Single-phase signals are given as input to further stages as shown in Fig. 1. After filter designing, stability analysis of the designed filters has been performed with frequency response of the matched filter.
3 Filter Design for Power Quality Events This section presents the simulation results of filters designed for voltage sag, transient, and harmonic signals. Figure 2 shows three-phase voltage signals during sag, harmonics, and transient. Histogram is plotted for all these three PQ event signals, as shown in Fig. 3. These histograms reflect the variation in voltage signals of different PQ event. These voltage signals are then segmented into single-phase signals. All three single-phase signals for voltage sag are shown in Fig. 4. The case of singlephase voltage signals in case of harmonics is presented in Fig. 5. Figure 6 shows single-phase voltage signals in case of voltage transient. Dynamic histograms of each phase are then plotted for all three events, as in Figs. 7, 8, and 9. Histograms provide the information about repetitive pattern in the signal. All the power quality events have different information inherent in the voltage signal. Histograms show the difference in repetitive pattern of voltage signals due to different events. The pattern obtained from histogram is further compressed using vector quantization. This generates a codebook for each phase of all PQ event signals. Respective codebooks supply the signal information to be utilized for designing the matched filter. Matched filter coefficients for each phase are listed in Table 1 for voltage sag, harmonics, and transient signals. Thus, separate matched filters have been designed for detecting each type of PQ event. The matched filter gives better response on account of similarity with the signal, thus gives high SNR.
Matched Filter Design Using Dynamic …
925
1.5 1 0.5 0 -0.5 -1 -1.5 0
1500
1000
500
2000
2500
3000
2000
2500
3000
3500
(i) 1.5 1 0.5 0 -0.5 -1 -1.5
1500
1000
500
0
3500
(ii) 1.5 1 0.5 0 -0.5 -1 -1.5 0
500
1000
1500
2000
2500
3000
3500
(iii) Fig. 2 Power quality three-phase voltage signals (i) sag (ii) harmonics and (iii) transient
1600
1500
1500
1400 1200 1000
1000
1000 800 600
500
500
400 200 0 -1.5
-1
-0.5
0
(i)
0.5
1
1.5
0 -1.5
-1
-0.5
0
(ii)
0.5
1
1.5
0
-1
-0.5
0
0.5
1
1.5
(iii)
Fig. 3 Dynamic histogram of three-phase voltage signals (i) sag (ii) harmonics and (iii) transient
926
M. K. Saini and R. K. Beniwal
1.5
1.5
1
1
1
0.5
0.5
0.5
1.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1
-1.5
0
500
1000
1500
2000 2500
3000 3500
-1.5
0
-1.5 0
1000 1500 2000 2500 3000 3500
500
(i)
500
1000
(ii)
1500
2000
2500
3000
3500
(iii)
Fig. 4 Single phase voltage signals during sag (i) Phase-A (ii) Phase-B and (iii) Phase-C 1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1.5 0
500
1000
2000
1500
2500
3000
3500
-1.5
-1
0
1000 1500 2000 2500 3000 3500
500
(i)
-1.5 0
500
1000 1500 2000 2500 3000 3500
(ii)
(iii)
Fig. 5 Single phase voltage signals during harmonics (i) Phase-A (ii) Phase-B and (iii) Phase-C 1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1.5
-1.5
0
500
1000 1500 2000 2500 3000 3500
-1
0
500
1000
1500
(i)
2000
2500
3000
-1.5 0
3500
500
1000
(ii)
1500
2000
2500
3000
3500
(iii)
Fig. 6 Single phase voltage signals during transient (i) Phase-A (ii) Phase-B and (iii) Phase-C 600
600
600
500
500
500
400
400
400
300
300
300
200
200
200
100
100
100
0 -1.5
-1
-0.5
0
(i)
0.5
1
1.5
0 -1.5
-1
-0.5
0
(ii)
0.5
1
1.5
0 -1.5
-1
-0.5
0
0.5
1
1.5
(iii)
Fig. 7 Dynamic histogram of all three phases of voltage sag signal (i) Phase-A (ii) Phase-B and (iii) Phase-C
Matched Filter Design Using Dynamic …
927
600
600
600
500
500
500
400
400
400
300
300
300
200
200
200
100
100
100
0 -1.5
-1
-0.5
0
0.5
1
1.5
0 -1.5
-1
-0.5
(i)
0
0.5
1
1.5
0 -1.5
-1
-0.5
(ii)
0
0.5
1
1.5
(iii)
Fig. 8 Dynamic histogram of all three phases of harmonic signal (i) Phase-A (ii) Phase-B and (iii) Phase-C 700
800
600
600
700
500
500
700
400
400
300
300
600 500 400 300
200
200
100
100
0 -1.5
-1
-0.5
0
(i)
0.5
1
1.5
0 -1.5
200 100
-1
-0.5
0
(ii)
0.5
1
1.5
0 -1.5
-1
-0.5
0
0.5
1
1.5
(iii)
Fig. 9 Dynamic histogram of all three phases of transient signal (i) Phase-A (ii) Phase-B and (iii) Phase-C
Further, stability analysis of the designed matched filters has been carried out with the help of frequency response. Both magnitude and phase response of newly constructed filters for filtering voltage sag signal are illustrated in Fig. 10. Magnitude response of the newly constructed matched filters exhibits the characteristics of lowpass filter. Phase response of all three filters shows linear response which indicates toward stability of the designed filter. Similarly, Fig. 11 showcases the magnitude and phase response for designed filters of all three phases of voltage harmonics signal. In case of harmonics also, there is linear phase response in case of all three phases. The phase response and magnitude response for three phases of voltage transient signal are shown in Fig. 12. In the case of voltage transient also, designed filters show linear phase response. Thus, magnitude response exhibits the stable behavior in case of all the filters designed for voltage sag signal, voltage transient signal, and voltage harmonics signal. Thus, this work has proposed the design of stable matched filters for detection of PQ events in renewable integrated system. Different filters have been designed for different PQ events occurring in the system.
928
M. K. Saini and R. K. Beniwal
Table 1 Filter coefficients for all three phases during voltage sag, harmonics, and transient Voltage sag
Harmonics
Phase-A Phase-B
Phase-C
0.0211
0.0271
0.4457
−0.0542 −0.3350 0.0545 −0.0690 0.2521
Transient
Phase-A Phase-B
Phase-C
0.0271
−0.2183 −0.1468 −0.0816 −0.1971
0.0670
−0.0545 −0.1218 0.1853
−0.1302 −0.0755 −0.1724 0.2984
0.2063
−0.3394 −0.0970 0.2061
0.0824
0.4350
0.2879
0.4592 0.3915
0.3185
−0.3017 0.1321
0.2090
0.2028
0.2916
0.1636 0.1818
−0.5304 −0.0937
−0.1631 0.2484
−0.3212 0.5411
−0.0157 0.0816
−0.1009 −0.2160 0.1576
−0.3527 1.0000
−0.2334 1.0000
1.0000
−1.8534 1.0000
−0.3527 1.0000
1.0000
−2.2314 1.0000
−0.4210 −0.2392 1.0000
−0.2774 −0.8099 −0.1696 −0.5212 0.2156 0.3185
0.7286
0.2090
0.2028
−0.3685 −0.5840 −0.2606 0.1152 0.4350
0.2063
−0.3394 −0.0970 0.2061
−0.0690 0.2521
0.2879
−0.1606 −0.4263 0.2152
0.0824
0.4592
−0.0542 −0.3350 0.0545
−0.0545 −0.1218 0.1853
0.0211
0.0271
0.4457
0.0271
0.0670
−0.0157 0.0816
0.5356 −0.4239
−0.3212 0.5411
−0.1890 0.0697
−0.2215 0.1296
−0.1302 −0.0755 −0.1724 0.2984
−2.7288
−0.5790 1.0000
−0.1009 −0.2160 0.1576 −0.1631 0.2484
−0.4239 0.5356
−0.5790 1.0000
−1.7820 1.0000
−0.2334 1.0000
−0.3017 0.1321 0.3915
0.3993
−0.1890 0.0697
−0.2774 −0.8099 −0.1696 −0.5212 0.2156 −0.4210 −0.2392 1.0000
Phase-C
−0.2111 −0.1725 0.1045
−0.2215 0.1296
−0.1606 −0.4263 0.2152
−0.3685 −0.5840 −0.2606 0.1152 0.7286
Phase-A Phase-B
0.1818
−0.0937 −0.5304
−0.2111 −0.1725 0.1045 0.2916
0.1636
0.3993
−0.2183 −0.1468 −0.0816 −0.1971
4 Conclusion This work presents the design of matched filters for the detection of power quality disturbances, e.g., voltage sag, harmonics, and transients. The application of nonlinear electronic equipments and non-linear loads in the renewable integrated system causes many power quality events and damages the whole system. Therefore, early and accurate detection of these power quality events is feasible with the matched filters which can detect the events in a better way as compared to the standard filters. The repetitive information hidden in the signal is extracted using dynamic histogram-based quantization and further used to design the matched filters. The designed matched filters have been found stable in all three cases of power quality events, as illustrated by their frequency response.
Matched Filter Design Using Dynamic …
929
Fig. 10 Frequency responses (magnitude response with phase response) of filter for all three single-phase voltage signals during sag (i) Phase-A (ii) Phase-B and (iii) Phase-C
930
Fig. 10 (continued)
M. K. Saini and R. K. Beniwal
Matched Filter Design Using Dynamic …
931
Fig. 11 Frequency responses (magnitude response with phase response) of filter for all three single-phase voltage signals during harmonics (i) Phase-A (ii) Phase-B and (iii) Phase-C
932
Fig. 11 (continued)
M. K. Saini and R. K. Beniwal
Matched Filter Design Using Dynamic …
933
Fig. 12 Frequency responses (magnitude response with phase response) for all three single-phase voltage signals during transient (i) Phase-A (ii) Phase-B and (iii) Phase-C
934
M. K. Saini and R. K. Beniwal
Fig. 12 (continued)
References 1. R. Kapoor, M.K. Saini, Multiwavelet transform based classification of PQ events. Int. Trans. Elec. Energy Sys. 22(4), 518–532 (2011) 2. R. Kapoor, M.K. Saini, Hybrid demodulation concept and harmonic analysis for single/multiple power quality events detection and classification. Int. J. Elec. Pow. Energy Sys. 33(10), 1608– 1622 (2011) 3. R. Kapoor, M.K. Saini, Detection and tracking of short duration variation of power system disturbances using modified potential function. Int. J. Elec. Pow. Energy Sys. 47, 394–401 (2013) 4. M.K. Saini, R. Kapoor, Classification of power quality events—a review. Int. J. Elec. Pow. Energy Sys. 43(1), 11–19 (2012) 5. G. Chang, S.-Y. Chu, H.-L. Wang, A new method of passive harmonic filter planning for controlling voltage distortion in a power system, in IEEE Power Engineering Society General Meeting, vol. 3 (San Francisco, CA, USA, 2005), p. 2333 6. H. Brantsæter, L. Kocewiak, A.R. Årdal, E. Tedeschi, Passive filter design and offshore wind turbine modelling for system level harmonic studies. Energy Proc. 80(2015), 401–410 (2015) 7. H. Prasad, T.D. Sudhakar, Design of active filters to reduce harmonics for power quality improvement, in Int. Conf. on Computation of Power, Energy, Information and Communication (2015), pp. 0336-0344 8. S. Aravind, U. Vinatha, V.N. Jayasankar, Wind-solar grid connected renewable energy system with series active self-tuning filter, in International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) (Chennai, 2016), pp. 1944-1948
Matched Filter Design Using Dynamic …
935
9. M.K. Saini, R.K. Beniwal, Optimum fractionally delayed wavelet design for PQ event detection and classification. Int. Trans. Elec. Energy Sys. 27(10), 1–15 (2017) 10. P.S.D. Oliveira, M.A.A. Lima, A.S. Cerqueira, C.A. Duque, D.D. Ferreira, Harmonic extraction based on Independent Component Analysis and quadrature matched filters, in 17th International Conference on Harmonics and Quality of Power (ICHQP) (Belo Horizonte, 2016), pp. 344–349 11. Z. Shi, S. Yang, Z. Jiang, Target detection using difference measured function based matched filter for hyperspectral imagery. Optik Int. J. Light Elec. Opt. 124(17), 3017–3021 (2013) 12. J. Niu, T. Su, X. He, K. Zhu, H. Wu, Weak NQR signal detection based on generalized matched filter. Proc. Eng. 7(2010), 377–382 (2010) 13. B. Boashash, G. Azemi, A review of time–frequency matched filter design with application to seizure detection in multichannel newborn EEG. Dig. Signal Proc. 28, 28–38 (2014) 14. A. Sinha, K. Singh, The design of a composite wavelet matched filter for face recognition using breeder genetic algorithm. Opt. Lasers Eng. 43(12), 1277–1291 (2005) 15. M.K. Saini, S. Jain, Designing of speaker based wavelet filter, in International conference on Signal Processing and Communication (Noida, 2013), pp. 261–266 16. A. Gupta, S.D. Joshi, S. Prasad, A new approach for estimation of statistically matched wavelet. IEEE Trans. Signal Proc. 53(5), 1778–1779 (2005) 17. M.K. Saini, J.S. Saini, S. Sharma, Moment based wavelet filter design for fingerprint classification, in International Conference on Signal Processing and Communication (Noida, 2013), pp. 267–270 18. C.Y. Wan, L.S. Lee, Histogram-based quantization for robust and/or distributed speech recognition. IEEE Trans. Audio, Speech Lang. Proc. 16(4), 859–873 (2008)
Managing Human (Social) Capital in Medium to Large Companies Using Organizational Network Analysis: Monoplex Network Approach with the Application of Highly Interactive Visual Dashboards Sreˇcko Zajec, Leo Mrsic , and Robert Kopal Abstract Human resource practitioners are increasingly interested in social networks as a way to strengthen relationships among employees, to improve efficiency of project teams, to better understand information flow within company, and, finally, to be able to respond to modern demands in company’s human capital management. Identification of leaders in fields, dynamics of communities, and its influence to overall organization performance can be supported and guided using technology. This paper offers a framework for establishing interactive measurement tool inside medium to large organization (500 + employees) conceptualizing different types of network theory metrics and uses case examples to identify outcomes typically associated with each division, team, or group of employees. Human capital evaluation is a challenging task and one of the key points in modern organizations, facing millennial workforce and constant turbulence on talent feed. Network analysis is a science approach that uses various data models and advanced visualization to represent the structure of relationships between people, organizations, goals, interests, and other entities within organization. In this article, we describe how network theory applied in the form of organizational network analysis (ONA) can be used as an interdisciplinary concept for managing human capital in large organization together with framework and how to build interactive HR support dashboard following core network theory concepts and its adaptation to management-friendly tool.
S. Zajec · L. Mrsic (B) Algebra University College, Ilica 242, 10000 Zagreb, Croatia e-mail: [email protected] S. Zajec e-mail: [email protected] R. Kopal Effectus University College, J.F.Kennedy Square 2, 10000 Zagreb, Croatia e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_81
937
938
S. Zajec et al.
Keywords Network theory · ONA · Organizational network analysis · Organizational interaction · Working interaction · Managing human capital · Data visualization
1 Introduction In spite of continuous shifting of boundaries scientific community often encounters economic challenges in the sense of monetization and business usefulness of various scientific cases. In that aspect reducing complex mathematical calculations to straightforward concepts which are useful in everchanging business environment can be a challenge itself. Although Organizational Network Analysis as such is a common technique used in optimization of human resource utilization, practice often shows significant disbalance of understanding of analytical results between the network science practitioners and firms higher (human) resource management. The approach used in the following scenario tries to overcome such disbalance with the usage of highly interactive visualization tool which is powered with the set of various network metrics and calculations. Main research outcomes can be summarized as follows: (i) the visualization tool was highly interactive with embedded filtering options for exploratory analysis, something which standard network analysis outcomes aimed for management use that systematically lacks; (ii) although various forms of employee interactions can be analyzed through network approach, combining different set of layers grouped from various questions gave even more meaningful insight into informal organization functioning; (iii) insight into employee overall satisfaction was also evaluated and later combined with network analysis—such approach helped to identify key personas within company and enabled reactive approach in managing human resources; (iv) scalable granulation was implemented in analysis, which helped not only to identify key people, but organizational units as well.
2 Data Collection For the use-case scenario medium to large IT company was chosen (N = 431). Network mapping was carried through comprehensive questioner and set of questions aimed to gather data which can be later transformed and used in dyadic form with the belonging link weights (1, 0.7, 0.4, 0.2) analogue to the person choice priority in answers [1]. Mentioned questions (N = 16) were grouped in ten groups which in the end yielded 16 monoplex networks of directed and undirected type. Those groups corresponded to mission and vision, work interactions, social/grapevine, decision making,
Managing Human (Social) Capital …
939
innovations, expertise, customer knowledge, general communication, feedback on work, growth and advancement & future of organization. In total 269 employees answered the questioner with one exception—if the person didn’t answer it still could appear in certain network as someone else could have mentioned them. On the contrary, that couldn’t happen through the more general part of the questioner (Sense of Community Index, company image, etc.).
3 Methods Although one single (monoplex) network is commonly perceived as one layer, in the following case semantic is somewhat different, as one layer (or better to say, perspective) is composed from several different networks (questions). Layer grouping was also made through two distinctive levels of interactions— working and organizational interactions, where working interactions represent purely work-related matters and organizational interactions provide macro overview from slightly different perspectives [2, 3]. Layer 1 (Working interactions) Layer 1 Working interactions (with whom someone works most) Q2—Who do you work with (whether it is working together, sharing information, or other resources) to get your job done? Q3—From whom do you most often seek information related to your business? Q4—To whom do you most often provide job-related information? Layer 2-7 (Organizational interactions) Layer 2 Communication (informal structure of the organization) Q1—Who are you talking to in your business on topics such as mission, vision, corporate values. Q5—Who are you talking to at the company about what is happening at work and at the company in general? Q6—Who do you talk to about business-related topics (sports, film, music, etc.)? Q12—With whom in the company should you interact more to be more efficient in your business? Q13—Who are you most involved with in your business? Layer 3 Information network (who goes to whom regarding work-related matters) Q3—From whom do you most often seek information related to your business? Q4—To whom do you most often provide job-related information?
940
S. Zajec et al.
Q9—From whom in the company do you get expert advice regarding your business? Q10—When do I need information or advice on who is available to you with sufficient time available? Layer 4 Access (colleagues with enough time) Q10—When do I need information or advice on who is available to you with sufficient time available? Layer 5: Interpersonal (people who see someone’s career path) Q14—From whom in the company do you get feedback or confirmation that what you do is useful and has a positive impact on the company? Q15—From which company do you get advice on your career and professional development that helps you work more effectively? Layer 6 Engaging in dialogue that helps people solve problems at work Q2—Who are you working with (whether it is working together, sharing information or other resources) to get your job done? Q7—Who do you work with at a company when you need some input, suggestion, or comment before making a decision related to your business? Q8—Who are you talking to at your company about ideas, new solutions, innovations, or the like that can help you do a better job? Q11—Who are you talking to in the company about customer needs (external and internal) and market trends and requirements? Q15—From which company do you get advice on your career and professional development that helps you work more effectively? Layer 7 Future leader (employees perceived as stars) Q16—When you think small about the people who work for the company today, who would you single out as the prototype of the person who will work for the company for five years? Due to relatively small number of nodes and the context of each question basic network calculations were applied with caution. In that regard, analysis was undertaken on the node and network level with the corresponding application of widely known Pajek software which later followed with further calculations and visualizations in Tableau software. Ensuing calculation results for each network were afterwards combined and summarized for every single layer, while taking into consideration link direction (especially important for authority & hub calculations) of each question. Calculations were made for each corresponding layer.
Managing Human (Social) Capital …
941
4 Results To get the comprehensive conclusions, analysis was split into two parts. General part of the questioner dealt with questions regarding Social Community Index and more broad set of questions, of which three most representative groups were made. Those included already mentioned Social Community Index, group of questions regarding firm vs. employee relationships, and firms overall prestige. Results were analyzed from best to worst overall score—on the question level, on the organizational unit level, and on the employee level. Latter also included classical clustering to identify people with good, average, or bad score (Fig. 1). With the combination of various data on every employee, such as HR department records & files, her/his overall status, and finally, their questioner score, the idea was to combine those outcomes with the network analysis results. In that regard, findings about employees were aimed to identify unsatisfied people and put those insights in the correlation with the network position and metrics for corresponding employee (i.e., determination of high or low potential for information control). If satisfaction level deviated from the high network score, then that would obviously be a warning signal for further investigation. Regarding classical network metrics such as closeness, they yielded relatively small spread between the largest and smallest values due to modest node count which means that the information can reach network’s periphery in the relatively small numbers of steps (and vice versa) [4]. Due to nature of each question and corresponding networks with embedded link weights, Laplacian centrality showed most consistent results when measuring overall node importance. Although correlation with the Betweenness centrality was observed, Laplace was used as a most comprehensive metric as it takes link direction and weights into consideration [5] (Fig. 2).
Fig. 1 Social community index visualized on employee, organizational, and question level
942
S. Zajec et al.
Fig. 2 Working interactions layer shows laplace, closeness, betweenness, degree and authority/hub metrics per employee (x axis—values, y axis—organizational unit) + current project teams
As an add-on, to broaden the perspective for analysis, not only formal hierarchy was observed, but the project one as well. That was undertaken from two different angels. As pic. 2 already shows, by implementing basic filtering option (right side), it was possible to filter out only those employees working on a specific project and observe their metrics which were calculated on the company level. Additionally, two-mode network was also taken into consideration, where one type of nodes was represented by projects and the other one by employees working on those projects. Such approach yielded extra insight as network metrics for involved employees were recalculated, but this time on the project level [6, 7] (Fig. 3). Furthermore, node position comparison between different layers was the next obvious step to take. Perhaps the most notable difference was observed when comparing working interactions vs. communication layer. In the case of few dozen
Fig. 3 Two mode network analysis with included network metrics for involved employees
Managing Human (Social) Capital …
943
nodes, high values of key metrics (Laplace, Betweenness, Closeness) were recorded in one layer (communication network) and proportionally reversed values in another one (working interactions). That showed not only that these employees have questionable social activity regarding working interactions, but also they are quite active in informal communication, meaning that their meaningful working hours utilization is possibly low (Figs. 4 and 5). Using Gartner-like visualizations for the above mentioned three key network metrics clear comparison between layers was demonstrated. Furthermore, three levels of comparison were made. Upper part of the graph represents total node count and their representative network metrics [9, 10]. The graph in the middle represents the
Fig. 4 Communication layer on node level (Laplace—y axis, Closeness—x axis, Betweenness— circle size)
Fig. 5 Working interactions layer on node level (Laplace—y axis, Closeness—x axis, Betweenness—circle size)
944
S. Zajec et al.
same level of aggregation, but with applied classical clustering method (in total five clusters) to help rank similar employees. On the last (bottom) visualization level of aggregation is somewhat different and each dot represents single organizational unit in the company. The size of the dot corresponds to the sum of each metric for every single employee in the designated organizational unit [8]. That way insight on social flow and social capabilities among units in the company hierarchy was clearly demonstrated. In the end, comparison between analyzed layers and formal organization was accomplished.
5 Conclusion Although in this scenario well-know and established analysis techniques were applied for every single (monoplex) network, collected data has considerable potential to be adopted for multilayer network analysis which has been established as a new framework in network theory. In practice that means that several different multilayer models can be considered for various calculations on node, subnetwork, or any other network level for that matter. Due to still ongoing researches in this area (and the lack of available software) currently established techniques are partly limited. Nevertheless, outcomes and concepts introduced in this article would be a worthy basis for comparison in which traditional and well-known network science framework could be challenged with the completely new approach which has great potential to yield completely different computational results.
References 1. V. Krebs, Managing core competences of the organization: organizational network mapping. IHRIM J. Vol. XII 5, 393–410 (2008) 2. D. Hansen, B. Shneiderman, M. Smith, Analyzing Social Media Networks with NodeXL (Morgan Kaufmann, Kindle Edition, Burlington, MA, 2011) 3. W. De Nooy, A. Mrvar, V. Bagatelj, Exploratory Social Network Analysis with Pajek (Structural Analysis in the Social Sciences) (Cambridge University Press, Kindle Edition, 2011) 4. U.A. Brandes, Faster Algorithm for Betweenness Centrality (KOPS—Institutional Repository of the University of Konstanz, Konstanz, 2001) 5. X. QI, E. Fuller, Q. Wu, Y. Wu, C.-Q. Zhang, Laplacian Centrality: A New Centrality Measure for Weighted Networks, vol. 194 (Elsevier, Information Sciences, New York, 2012), pp. 240– 253 6. J. Kreutel, Augmenting network analysis with linked data for humanities research, in Digital Cultural Heritage, ed. by H. Kremers (Springer, Cham, 2020) 7. I. Salehudin, Social/Network Power: Applying Social Capital Concept to Explain the Behavioral Tendency of Individuals in Granting Favors within the Organizational Context (ICMBR 2009) 8. M. Kilduff, D. Krackhardt, Interpersonal Networks in Organizations: Cognition, Personality, Dynamics, and Culture (Structural Analysis in the Social Sciences) (Cambridge University Press, 2008)
Managing Human (Social) Capital …
945
9. P. Anklam, Net Work: A Practical Guide to Creating and Sustaining Networks at Work and in the World (Butterworth-Heinemann Publishing, 2007) 10. R.L. Cross, J.C. Singer, R.J. Thomas, Y. Silverstone, The Organizational Network Fieldbook: Best Practices, Techniques and Exercises to Drive Organizational Innovation and Performance (Wiley, 2010)
Gender and Age Estimation from Gait: A Review Tawqeer Ul Islam, Lalit Kumar Awasthi, and Urvashi Garg
Abstract Gait—the way a person walks, is a behavioral biometric which can be used for human identification purpose just like other biometrics—fingerprints, hand geometry, eyes, face, ear, etc. Apart from that, gait can also be used for the estimation of an individual’s gender, age group, and age. Age estimation has relatively more extensive applications like in visual surveillance and monitoring, access control to places like shopping malls, airports, liquor shops, public clubs, etc. Over the years many age estimation methods and techniques have been proposed majorly in the computer vision field. This research depicts a broad literature survey of gender and age estimation techniques from gait. Methodical analysis of age estimation algorithms, their classification, different quantified features and representations of gait used, and their merits and demerits are described. We have also presented a separate analysis of various human gait databases that have been developed over the years in this research area. This study presents a holistic comparative study thereby providing a—way to go ahead, in this research area. Keywords Gait · Gait energy image · Gait contour · Convolutional neural network · Age estimation
T. U. Islam (B) · L. K. Awasthi · U. Garg Department of Computer Science and Engineering, Dr. B R Ambedkar National Institute of Technology, Jalandhar, PB 144011, India e-mail: [email protected] L. K. Awasthi e-mail: [email protected] U. Garg e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_82
947
948
T. U. Islam et al.
1 Introduction Gait is the manner of walking of an individual. It is determined by attributes like posture, stride, arm swing, step length, etc. Being a biometric, gait can be quantified using these attributes thus it can be applied for identity authentication purpose. In this area, there have been numerous studies done in the past [1–11] and a lot of work is still going on. In the field of medical sciences gait has largely been studied by the orthopaedics to investigate and diagnose various orthopaedic ailments and diseases [12, 13]. This area largely focusses on sensor-based data accumulation [14–17] where multiple sensors would be attached to the human subject at critical points over the body like knees, elbows, back, shoulders, etc. The data generated by these sensors are in the form of continuous signals which is further processed using various signal-processing techniques to transform it into human interpretable form for inferential diagnosis. In bioinformatics the approach and applications have been relatively different and diverse. In this realm, the focus has been mainly on image and video-based gait data format. Most of the studies use image processing techniques over the image and video (inherently made up of multiple image frames) data. The applications have been in the field of forensics and computer-assisted surveillance in identification of person, gender detection, age estimation, etc. This approach provides the facility for capturing the gait of the subject at a distance and without their cooperation. Thus, leverages its usage in areas like access control to places where there is an age constraint. The gait databases for gender and age estimation do not store the gait images or videos of the individual subjects in raw form rather they are converted into less simpler representations without losing the essential features that determine these attributes. The most commonly used descriptors are Gait Image Silhouette, Gait Image Contour, and Gait Energy Image. Silhouette is a basic filter which outputs a black and white image where the white pixels represent the body of the subject and everything else is black. The contour is a rather simpler representation where only the boundary of the subject’s body is outlined by white pixels and everything else is black. The Gait Energy Image (GEI) captures more features as it superimposes multiple Gait Image Silhouettes to form a richer representation. The arm swing and leg movement are captured in a GEI both of which prove to be deterministic in gender and age estimation (Figs. 1, 2 and 3).
2 Methodology The standard gait image-based gender and age estimation procedure begins with the gait capturing phase. In this phase, a camera (or multiple cameras set at different angles) captures multiple images of the subject while he/she is walking on a treadmill
Gender and Age Estimation from Gait … Fig. 1 Gait image silhouette example
Fig. 2 Gait image contour example
949
950
T. U. Islam et al.
Fig. 3 Gait energy image example
or a plain surface. Next, these raw images go through a filter, which outputs a low dimensional gait descriptor. It is followed by the feature extraction phase from the gait descriptor [18–20]. The feature extraction involves the measurement of certain parameters over the descriptor. The output of the first phase is a feature vector of the size equivalent to the number of features selected. Thus, every image is converted into a distinct feature vector. It is followed by the model creation phase in which a model is built upon these features vectors to interpolate the relationship between these features and the dependent variables, i.e., gender and age. Figure 4 depicts the whole procedure. Recently many studies have proposed techniques which do not require any explicit feature extraction. Most of these approaches use deep learning models like the convolutional neural networks. During the training phase, the CNN learns the hidden patterns(features) automatically from the input gait descriptor images. In such techniques, the standard procedure begins the same way until the creation of gait descriptor usually the GEI followed by training of the deep learning model using a subset of the gait descriptor dataset. Once the training culminates, the model is evaluated over a separate test subset of the gait dataset as depicted in Fig. 5.
Gender and Age Estimation from Gait … Fig. 4 Age estimation from gait with manual feature extraction
951
Gait Image Capturing Output
Conversion to Gait Descriptor Output
Feature ExtracƟon
Output
Model CreaƟon
Model Performance EvaluaƟon
EvaluaƟon Results
[x1, x2, x3, ……, xn]
952 Fig. 5 Age estimation from gait without manual feature extraction
T. U. Islam et al.
Gait Image Capturing Output
Conversion to Gait Descriptor Output
Model Training
Output
Model TesƟng
Trained Model
EvaluaƟon Results
3 Performance Evaluation In gender estimation and age group classification the commonly used evaluation metric used is accuracy, which is calculated as Accuracy =
cc × 100%, N
(1)
where CC is the total number of correctly classified samples and N is the total number of samples in the evaluation set.
Gender and Age Estimation from Gait …
953
In age regression problem, in which the age in years is estimated, the three commonly used performance evaluation criteria are Mean Absolute Error (MEA), Standard Deviation (SD), and Cumulative Score (CS), which are calculated as MEA =
1 n |ti − pi|, i=1 N
(2)
where N is the total number of samples in the evaluation set, ti and pi are the true and estimated value of age for the ith sample. Standard Deviation is calculated as 1 n (3) SD = (|ti − pi| − M E A)2 i=1 N −1 CS is calculated as C S(k) =
Nk × 100% N
(4)
where Nk is the number of samples for which MEA is less than k years.
4 Literature Survey The problem of age estimation from gait has been studied lately by various researchers and multiple databases have been built to address the problem. These databases are ever-improving with more and more subjects being added to the database. These databases have still a few shortcomings like less subjects available in the far ends of the age spectrum. Table 1 provides a detailed description of these databases. In literature, the different studies addressing this problem can be broadly classified into two categories: (a) Inertial Sensor-Based and (b) Computer Vision-Based. a. Inertial Sensor-Based Approaches: Abreu et al. [24] partially proved that it is possible to estimate the age of an individual from gait analysis. They first generated the hip-knee cyclogram, which is a representation of the cyclic patterns like walking followed by feature extraction. A total of 40 individual subjects—20 men and 20 women were a part of the study. They were able to distinguish between the two age groups—young and elderly but could not determine the gender. Another study by Callisaya et al. [25] revealed that the gender of an individual modifies the relationship between gait and age. They studied 223 subjects aging between 60 and 86 years. They found a significant association between gender and different gait variables. Nigg et al. [26] studied the correlation between gait and age by measuring some selected kinematic variables related to the ankle and knee joints
No. of Subjects
18
18
11
9
774
Database Name
HuGaDB [21]
HuGaDB
MAREA(Indoor) [22]
MAREA(Street)
OU-ISIR (Inertial Sensor) [23]
Table 1 Sensor-based gait databases
17
360
600
1218
11544
Time (Seconds)
3
4(Accelerometers)
4(Accelerometers)
6
6(Inertial Sensors)
No. of Sensors
Indoors
Indoors (Flat Space and Treadmill)
Indoors/Outdoors
Indoors/Outdoors
Capturing Environment
Walk
Walk and Run
Walk and Run
Running
Walk
Activity
Yes
Yes
Yes
No
No
Age Information Available
Inertial Sensor-based Gait Database
Movement Analysis in Real-World Environments Using Accelerometers
Movement Analysis in Real-World Environments Using Accelerometers
Running at Various Speeds
Walking and Turning at various speeds on a flat surface
Description
954 T. U. Islam et al.
Gender and Age Estimation from Gait …
955
Table 2 Sensor-based approaches Authors
Proposed technique
No. of subjects Output type
Abreu et al.
Used hip-knee cyclogram representation
40
Age Groups
223
Gender
Callisaya et al. Used GAITRite walkway to record the Gait speed, step length, cadence, step width, etc. Nigg et al.
Used correlation between kinematic variables 118 related to the ankle and knee joints and reaction forces exerted by the ground beneath the feet
Age Groups
and reaction forces exerted by the ground beneath the feet. The study was conducted on 60 male and 58 female subjects (Table 2). b. Computer Vision-Based Approaches: The earlier computer vision-based studies mostly relied upon manual feature extraction and techniques. Zhang et al. [27] used the Gait Image Contour representation of the gait for feature extraction. They first converted the contour image into lower dimensional representation called the Frame to Exemplar (FED) distance which measures the distance from the contour centroid to multiple points on the contour boundary. Then a Hidden Markov Model is built on that. The database used was a self-created gait database of 14 subjects of varying ages. The model did a binary classification into young and old classes and it reached up to 83.33% correct classification rate. Makihara et al. [28] used Gaussian Regression model in conjunction with the Gait Energy Image descriptor for the age estimation. The model was trained and tested on a self-created whole-generation Gait Database containing 1728 subjects of ages between 2 and 94. They achieved a Mean Absolute Error of 8.2 years. Another study by Mansouri Nabila et al. [29] proposed a new gait descriptor which captured both Spatiotemporal Longitudinal and Spatiotemporal Transverse projections of the silhouette. They used the Support Vector Machine model over the OU-ISIR database of around 4000 subjects and achieved the precision of around 74%. Jiwen Lu et al. [30] used a set of Gabor features like the Gabor magnitude and the phase information of the gait sequence. In order to improve the performance, they used a feature fusion technique. The results were demonstrated on USF database consisting of 1870 subjects. The model achieved the best MEA of 5.42 years. Xiang Li et al. [31] proposed an age group-dependent method for age estimation. They clustered the subjects into separate age groups using a directed acyclic graph and SVM as the classification model. Next step involved is a support vector regression using a Gaussian kernel along with a manifold learning technique. The model was evaluated on OULP Age Dataset comprising of 63846 subjects which include 31,093 males and 32,753 with ages between 2 and 90. The model achieved an average classification accuracy of 72.23% in age group estimation and 6.78 years MEA in age estimation. Hu et al. [32] introduced the maximization of mutual information technique using the Gabor filter for feature extraction and Bayes Rule based on HMM for the classification. Again, this was a binary classifier of age into young and old. The gender
956
T. U. Islam et al.
classification results were evaluated using CASIA(B) dataset and IRIP dataset and the age classification on the database was used by Zhang et al. The results were better than Zhang et al. who also used the HMM as the CCR reached up to 85.71%. Jiwen Lu et al. [33] studied the gender estimation and identity recognition in a relatively different setup environment. The approach considered more realistic assumption where the subjects walked in any arbitrary direction. They proposed a metric learning method for distance metric learning based on sparse reconstruction which minimized the sparse reconstruction errors within the same class and maximized the error between different classes simultaneously. They developed a new dataset called the ADSC-AWD dataset employing the Microsoft Kinect depth sensor. They achieved an accuracy of 93.7% in gender estimation. A recent study by Haiping Zhu et al. [34] proposed a deep learning approach that used three local convolutional neural networks and a global CNN whose output was fed into an ordinal distribution regression model for the age estimation. The three local CNN’s are trained over three separate parts of the GEI—head, body and feet. The global CNN is trained over the whole GEIs. The model was trained and tested on the OULP Age Dataset and it achieved an MEA of 5.24 and CS(k = 5) of 69.95%. The introduction of deep learning models like CNN eliminates the need of explicitly extracting the features from the gait descriptor as it automatically learns and captures the distinctive patterns in it. Another recent study by Sakata et al. [35] proposed a state-of-the-art CNN based method for the gender and age estimation. They used a sequential CNN for the gender and age group estimation and age regression. The GEI would first go through a CNN which predicts its gender then it goes sequentially through other two CNNs predicting the Age Group and age, respectively. The model was trained and tested on the OULP Age dataset and the results were promising as it reached an MEA of 5.84 years. The model predicts relatively worse for the old age subjects. This behavior is inflicted by the smaller number of subjects available in the OULP database in the old age category. Murat Berksan [36] studied different CNN architectures over gait silhouette average representation to come up with a gender estimation accuracy of 97.45% and MEA of 5.74 years in age estimation (Tables 3 and 4).
5 Discussion Though the number of sensor-based studies is relatively less and is mainly confined to the medical sciences field, they have played a vital role at the start by establishing the relationships between various gait features and age and gender of a person. The manual feature extraction techniques, although are simpler than the automatic ones but relatively capture lesser independent features from a GEI. Among the computer vision-based approaches, this review reveals the lack of data for certain age groups like old and children which ultimately degrades the classification accuracy. Large gait databases like OU-ISIR made it possible to use deep learning models like CNN
No. of subjects
122
34
68
200
185
10307
26
178
65528
63,846
20
Database name
USF [37]
OU-ISIR Treadmill A [38]
OU-ISIR Treadmill B
OU-ISIR Treadmill C
OU-ISIR Treadmill D
OU-ISIR (MVLP)
OU-ISIR (GaitST 1)
OU-ISIR (GaitST 2)
OU-ISIR(OULP-Bag)
OU-ISIR (OULP Age)
CASIA A [39]
Table 3 Camera based gait databases
12
25
25
60
60
28
370
200
2764
612
1877
Sequences
3
1
1
1
1
14
1
25
1
1
10
Viewpoints
Indoors
Indoors
Indoors
Indoors
Indoors
Indoors
Indoors
Indoors
Indoors
Indoors
Outdoors
Capturing environment
Walk
Walk
Walk
Walk (Accelerated)
Walk (Accelerated)
Walk
Walk
Walk
Walk
Walk
Walk
Activity
No
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Age information available
(continued)
Standard Dataset
The OU-ISIR Large Population Gait Database with real life carried object (UL-LP Bag)
The OU-ISIR Gait Speed Transition Database
The OU-ISIR Gait Speed Transition Database
The OU-ISIR Multi-view Large population Database
-do-
-do-
-do-
The OU-ISIR Gait Database, Treadmill Dataset
Video-based gait dataset
Description
Gender and Age Estimation from Gait … 957
No. of subjects
124
153
88
Database name
CASIA B [40]
CASIA C [41]
CASIA D [42]
Table 3 (continued)
NA
1530
1240
Sequences
4
1
11
Viewpoints
Outdoors
Indoors
Capturing environment
Walk and Fast Walk
Walk and Fast Walk
Walk
Activity
No
No
No
Age information available
Gait and its Corresponding Footprint Dataset
Infrared Gait Dataset
Multi-View Dataset
Description
958 T. U. Islam et al.
Gender and Age Estimation from Gait …
959
Table 4 Computer vision-based approaches Authors
Proposed method
Dataset used
Output type
Performance
Zhang et al.
Frequency Exemplar Distance and Hidden Markov Model
Self-Created
Binary Classification (Young and Old)
83.33%
Makihara et al.
Gaussian Regression
Whole Generation Gait Database
Age Regression
8.2 Years
Mansouri Nabila et al.
Spatiotemporal Longitudinal and Spatiotemporal Transverse with Support Vector Machine
OU-ISIR
Binary Classification
74%
Jiwen Lu et al.
Gabor Features and Feature Fusion Technique
USF
Age Regression
5.42 Years
Xiang Li et al.
Manifold Learning Technique with SVM and Gaussian kernel
OULP Age
Age Regression
6.78 Years
M. Hu et al.
Gabor filter for feature extraction and Bayes Rule based on HMM
CASIA(B)
Binary Classification
86%
Jiwen Lu et al.
Gender ADSC-AWD recognition from gait with arbitrary walking directions
Gender and Identity Recognition
93.7%
Haiping Zhu et al. Ordinal Distribution Regression using CNN
OULP Age
Gender and Age Regression
5.24 Years
A. Sakata et al.
Multi-stage CNN
OULP Age
Gender and Age Regression
5.84 Years
Murat Berksan
CNN over Gait OULP Age Silhoutte Average
Gender and Age Regression
5.74 Years
but due to the smaller intrinsic features present in the GEI’s the size of the neural network is restricted (Figs. 6 and 7).
960
T. U. Islam et al.
Fig. 6 Comparison of gender classification accuracy
Fig. 7 Comparison of MAE in age estimation
MAE (in years)
8.2 6.78 5.42
5.24
5.84
5.74
Authors
6 Conclusion and Future Scope This study provides a holistic view of the Age estimation from gait problem. Though this problem is relatively new, still there has been commendable work done in this research field. From this study, we concluded that the deep learning-based approaches outperformed all other approaches. More work needs to be done on the problem using the ever-improving deep learning models. We can also expect the enlargement of Gait databases which would inductively improve the age estimation accuracy of the deep learning models. In future, we can expect deep learning models which can be treated directly on the video data to enhance the accuracy of the task.
Gender and Age Estimation from Gait …
961
References 1. M.S. Nixon, J.N. Carter, Automatic recognition by gait. Proc. IEEE. 94(11), 2013–2024, November (2006) 2. N.V. Boulgouris, D. Hatzinakos, K.N. Plataniotis, Gait recognition: a challenging signal processing technology for biometric identification. IEEE Signal Proc. Mag. 22(6), 78–90, November (2005) 3. S. Sarkar, P.J. Phillips, Z. Liu, I.R. Vega, P. Grother, K.W. Bowyer, The humanID gait challenge problem: Data sets, performance, and analysis. IEEE Trans. Patt. Anal. Mach. Intell. 27(2), 162–177, February (2005) 4. J. Han, B. Bhanu, Individual recognition using gait energy image. IEEE Trans. Patt. Anal. Mach. Intell. 28(2), 316–322, February (2006) 5. L. Wang, T. Tan, H. Ning, W. Hu, Silhouette analysis-based gait recognition for human identification. IEEE Trans. Pattern Anal. Mach. Intell. 25(12), 1505–1518, December (2003) 6. J. Lu, E. Zhang, Gait recognition for human identification based on ICA and fuzzy SVM through multiple views fusion. Patt. Recognit. Lett. 28(16), 2401–2411 (2007) 7. G.V. Veres, M.S. Nixon, J.N. Carter, Modelling the time-variant covariates for gait recognition. in Proc. Audio- and Video-Based Biometric Person Authentication (2005), pp. 597–606 8. S. Argyropoulos, S.D. Tzovaras, D. Ioannidis, M.G. Strintzis, N.V. Boulgourisa, Z.X. Chi, A channel coding approach for human authentication from gait sequences. IEEE Trans. Inf. Forensics Sec. 4(3), 428–440, September (2009) 9. C. Chen, J. Liang, H. Zhao, H. Hu, J. Tian, Factorial HMM and parallel HMM for gait recognition. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 39(1), 114–123, January (2009) 10. D. Ioannidis, D. Tzovaras, I.G. Damousis, S. Argyropoulos, K. Moustakas. Gait recognition using compact feature extraction transform and depth information. IEEE Trans. Inf. Forensics Sec. 2(3), 623–630, Septmber (2007) 11. N. Takemura, Y. Makihara, D. Muramatsu, T. Echigo, Y. Yagi, On input/output architectures for convolutional neural network-based cross-view gait recognition. IEEE Trans. Circ. Syst. Video Technol. PP(99), 1–1 (2017) 12. D. Janssen, W.I. Schöllhorn, K.M. Newell, J.M. Jäger, F. Rost, K. Vehof, Diagnosing fatigue in gait patterns by support vector machines and self-organizing maps. Hum. Mov. Sci. 30(5), 966–975 (2011). https://doi.org/10.1016/j.humov.2010.08 13. R. Liao, Y. Makihara, D. Muramatsu, I. Mitsugami, Y. Yagi, K. Yoshiyama, H. Kazui, M. Takeda, Video-based gait analysis in cerebrospinal fluid tap test for idiopathic normal pressure hydrocephalus patients (in japanese), in The 15th Annual Meeting of the Japanese Society of NPH (Suita, Japan, 2014) 14. O. Tirosh, R. Baker, J. McGinley, GaitaBase: web-based repository system for gait analysis. Comput. Bio. Med. 40(2), 201–207 (2010) 15. T.T. Ngo, Y. Makihara, H. Nagahara, Y. Mukaigawa, Y. Yagi, The largest inertial Sensors-based database and performance evaluation of gait-based personal authentication. Patt. Recog. 47(1), 228–237 (2014) 16. D.T.P. Fong, Y.Y. Chan, The use of wearable inertial motion sensors in human lower limb. Biomech. Stud. Syst. Rev. Sensors 10, 11556–11565 (2010) 17. W. Tao, T. Liu, R. Zheng, H. Feng, Gait analysis using wearable sensors. Sensors 12, 2255–2283 (2012) 18. X. Li, S.J. Maybank, S. Yan, D. Tao, D. Xu, Gait components and their application to gender recognition. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 38(2), 145–155, February (2008) 19. K. Bashir, T. Xiang, S. Gong, Feature selection for gait recognition without subject cooperation, in Proc. British Machine Vision Conf. (Leeds, UK, September, 2008) 20. K. Bashir, T. Xiang, S. Gong, Feature selection on gait energy image for human identification, in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (2008), pp. 985–988 21. R. Chereshnev, A. Kertész-Farkas, Hugadb: human gait database for activity recognition from wearable inertial sensor networks, in International Conference on Analysis of Images, Social Networks and Texts (Springer, Cham, 2017, July), pp. 131–141
962
T. U. Islam et al.
22. S. Khandelwal, N. Wickström, Evaluation of the performance of accelerometer-based gait event detection algorithms in different real-world scenarios using the MAREA gait database, Gait & Posture 51, 84–90, ISSN 0966-6362, January (2017) 23. C. Xu, Y. Makihara, G. Ogi, X. Li, Y. Yagi, J. Lu, The OU-ISIR gait database comprising the large population dataset with age and performance evaluation of age estimation. IPSJ Trans. Comput. Vis. Appl. 9(1), 24 (2017) 24. R.S. Abreu, E.L.M. Naves, T.B. Caparelli, D.T.G. Mariano, V.C. Dionísio, Is it possible to identify the gender and age group of adults from gait analysis with hip-knee cyclograms? Revista Brasileira de Engenharia Biomédica 30(3), 274–280 (2014) 25. M.L. Callisaya, L. Blizzard, M.D. Schmidt, J.L. McGinley, V.K. Srikanth, Sex modifies the relationship between age and gait: a population-based study of older adults. J. Gerontol. Series A: Bio. Sci. Med. Sci. 63(2), 165–170 (2008) 26. B.M. Nigg, V. Fisher, J.L. Ronsky, Gait characteristics as a function of age and gender. Gait & Posture 2(4), 213–220 (1994) 27. D. Zhang, Y. Wang, B. Bhanu, Age classification based on gait using HMM, in Int. Conf. Computer Society, Istanbul-Turkey (August 2010), pp. 3834–3837 28. Y. Makihara, M. Okumura, H. Iwama, Y. Yagi, Gait-based age estimation using a wholegeneration gait database, in 2011 International Joint Conference on Biometrics, IJCB 2011. 2011.6117531 (2011) 29. M. Nabila, A.I. Mohammed, B.J. Yousra, Gait-based human age classification using a silhouette model. IET Biometr. 7(2), 116–124 (2017) 30. J. Lu, Y.P. Tan, Gait-based human age estimation. IEEE Trans. Inf. Forensics Sec. 5(4), 761–770 (2010) 31. X. Li, Y. Makihara, C. Xu, Y. Yagi, M. Ren, Gait-based human age estimation using age group-dependent manifold learning and regression. Multi. Tools Appl. 77(21), 28333–28354 (2018) 32. M. Hu, Y. Wang, Z. Zhang, Maximisation of mutual information for gait-based soft biometric classification using Gabor features. IET Biomet. 1(1), 55–62 (2012) 33. J. Lu, G. Wang, P. Moulin, Human identity and gender recognition from gait sequences with arbitrary walking directions. IEEE Trans. Inf. Forensics Sec. 9(1), 51–61 (2013) 34. H. Zhu, Y. Zhang, G. Li, J. Zhang, H. Shan, Ordinal Distribution Regression for Gait-based Age Estimation (2019). arXiv:1905.11005 35. A. Sakata, N. Takemura, Y. Yagi, Gait-based age estimation using multi-stage convolutional neural network. IPSJ Trans. Comput. Vis. Appl. 11(1), 4 (2019) 36. M. Berksan, Gender recognition and age estimation based on human gait (Master’s thesis, Ba¸skent Üniversitesi Fen Bilimleri Enstitüsü) (2019) 37. S. Sarkar, P.J. Phillips, Z. Liu, I.R. Vega, P. Grother, K.W. Bowyer, The human id gait challenge problem: data sets, performance, and analysis. IEEE Trans. Patt. Anal. Mach. Int. 27(2), 162– 177 (2005) 38. T.T. Ngo, Y. Makihara, H. Nagahara, Y. Mukaigawa, Y. Yagi, The largest inertial sensor-based gait database and performance evaluation of gait-based personal authentication. Patt. Recog. 47(1), 228–237 (2014) 39. Liang Wang, Tieniu Tan, Huazhong Ning, Hu Weiming, Silhoutte analysis-based gait recognition for human identification. IEEE Trans. Patt. Anal. Mach. Int. (PAMI) 25(12), 1505–1518 (2003) 40. S. Zheng, J. Zhang, K. Huang, R. He, T. Tan, Robust view transformation model for gait recognition, in International Conference on Image Processing (ICIP) (Brussels, Belgium, 2011) 41. D. Tan, K. Huang, S. Yu, T. Tan, Efficient night gait recognition based on template matching, in Proc. of the 18 th International Conference on Pattern Recognition (ICPR06) (Hong Kong, China, August 2006) 42. S. Zheng, K. Huang, T. Tan, D. Tao, A cascade fusion scheme for gait and cumulative foot pressure image recognition. Patt. Recog. 45(10), 3603–3610 (2012)
Parkinson’s Disease Detection Through Visual Deep Learning Vasudev Awatramani and Deepak Gupta
Abstract Parkinson’s Disease (PD) is a neurodegenerative disorder that affects numerous people and tends to get more acute as time progresses. From its early stages, several symptoms occur among patients such as micro-graphing, tremors, and stiffness. If identified beforehand, diagnosis is much more effective. This work aims to build an automated deep learning-based system to determine whether a given individual is suffering from Parkinson’s. We utilize images of written exams (from the HandPd dataset, consisting of Spiral and Meander templates) taken by subjects for this very purpose. Physiological datasets are challenging to work with due to typical obstacles associated with them, such as insufficient data and disproportionate class representation. The proposed methodology employs techniques intuitively based on Transfer Learning to solve the mentioned problems. Through these procedures, an accuracy of 98.24% on the Spiral dataset and 98.11% on the Meander dataset was achieved. Keywords Computer vision · Deep learning · Transfer learning · Biomedical image analysis · Class imbalance
1 Introduction Parkinson’s disease is a progressive, persistent neurological ailment that affects the Substantia nigra [1] of the human brain repressing the production of dopamine. Dopamine purposes as a neurotransmitter utilized in physical mobility. Parkinson’s hampers the ability of the brain to produce sufficient dopamine; hence, typical motor symptoms of this condition include bradykinesia, muscle rigidity, tremor, or shaking in limbs. One of the most common indications observed is the change in handwriting, as sufferers find difficulty in writing and face micro-graphing. Patients suffering from Parkison’s have rigidity and diminished amplitudes in their handwriting flow from its V. Awatramani (B) · D. Gupta Maharaja Agrasen Institute of Technology, Rohini, New Delhi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_83
963
964
V. Awatramani and D. Gupta
early onset. Therefore, handwritten evaluation mechanisms can be readily used in the early stages economically. In such assessments, subjects are asked to draw figures over predefined templates such as spirals and meanders. Consequently, an expert examines these drawings and decides whether the subject suffers from Parkinson’s or not. In order to have a decisive outcome, individuals draw multiple figures of the same template. Analyzing each of these samples is time-consuming and complicated as interpretations can be subjective and vary from expert to expert. Henceforth, the paper proposes a deep learning-based system that automates the examination process. This work studies images of handwritten figures as presented in HandPD dataset. Both Parkinsonian patients and Healthy subjects drew these figures which were labeled accordingly. To classify these samples, the study employs Deep CNN architectures and Transfer Learning. Moreover, to address the approach also addresses the problem of an unbalanced dataset through both established and non-conventional methods.
2 Related Work This section outlines some of the state-of-the-art works related to the HandPD dataset. Pereira et al. [2] who contributed to the formulation of the dataset itself, developed a classification methodology as a two-stage process in their earlier works. The first stage involved feature extraction by determining the distance between the template and figure drawn by the subject at specific points. These features were normalized and processed for the next stage. In the second stage, the extracted attributes were fed to traditional machine learning algorithms, such as Optimum-Path Forest (OPF), Naive Bayes(NB), and Support Vector Machine(SVM). They achieved a maximum accuracy of 65.88% for Spiral dataset through NB, and 66.37% through SVM. Pereira et al. also employed Deep learning techniques in their later works and realized enhanced performance. In one study [3], they processed signal data from the biometric pen used by subjects to draw the exam figures. This data represented handwriting dynamics of subjects in a time series model and exercised a CNN-based network for feature extraction. It resulted in improved accuracy of 84.42% for Spiral and 83.77% for Meander dataset. Lastly, in another research with Residual Networks, Passos et al. [4] obtained the highest recorded accuracy on HandPD dataset. They preprocessed images as inputs to a pre-trained ResNet-50 and used PCA dimensionality reduction over the network feature vector to obtain representations of the samples. These representations were fed to an OPF model to accomplish 96.31 and 96.71% in Meander and Spiral datasets, respectively, as the mean accuracy. Apart from Deep learning, Evolutionary algorithms have also been adopted to extract informative representations from samples in HandPD dataset. Pereira et al. [5] applied Particle Swarm Optimization Algorithm, Bat Algorithm, and Firefly Algorithm on extracted CNN features of images to obtain an accuracy of 89.62, 90.38, and 83.79% (for Spiral Dataset, Meander Dataset had marginal variations from mentioned results). Gupta et al. [6] used the Optimized Cuttlefish Algorithm
Parkinson’s Disease Detection Through Visual Deep Learning
965
over the representations computed by Pereira et al. in their original work [2] and accomplished an approximate 22% improvement. On similar lines, another work [7] employed a modified version of the Grey Wolf Optimization algorithm and recorded 92.41 and 93.02% accuracy on Spiral and Meander datasets, respectively. There have been many extensive studies on the detection of Parkinson’s such as through speech processing using ANNs [8] and DBNs [9], as well as through vocal [10] and gait analysis [11]. Since our work is primarily focused on studying generalizable methods over visual data, these works have not been discussed.
3 Methodology 3.1 Class Imbalance Resolution The study uses the HandPD [2] and New HandPd [3] datasets, both covering written samples from two sections of subjects, Control/Healthy Group and Patient Group, the latter being affected by Parkinson’s Disease. The samples were collected at Botucatu Medical School, São Paulo State University, Brazil. In HandPd dataset, the subjects consisted of 92 individuals out of which 74 were affected by Parkinson’s disease and the remainder of 18 individuals were free from the ailment. As a result, 368 images (for both Spiral and Meander) were assembled, 72 belonging to the Healthy group, and 296 to the Patient group. The NewHandPD dataset is relatively more balanced than HandPD dataset in terms of the number of individuals as well as the number of samples belonging to each category. The dataset included 264 images with a balanced demographics of subjects consisting of 35 Healthy individuals and 31 Patients. From mere observation (Fig. 1), one can infer that the HandPD dataset suffers from the Class Imbalance Problem [12]. Such a case transpires when an individual label is underrepresented in contrast to other labels. While most prevalent in the case of twoclass data, this issue is usually in the form of the number of positive instances being Fig. 1 Represent class distribution for original old HandPD dataset
966
V. Awatramani and D. Gupta
outnumbered by the number of negative instances. Majority of machine learning algorithms function best when the class distribution is even, i.e., the number of samples belonging to each category is roughly equal. Effects of such stipulation if not met, lead to an inherent bias towards the majority category. As an illustration, consider a dataset of financial transactions, used to determine whether a given transaction is fraudulent or genuine. It would be apparent that the fraudulent transactions would be far less from genuine ones, say ten fraudulent to 10000 genuine samples. So, a classifier trained over the concerned dataset may achieve 99.9% training accuracy by merely disregarding the fraudulent category and predicting genuine as the label for every input instance. Henceforth, the Dataset Imbalance problem is a significant challenge and has a considerable influence, especially in binary classification. To address the problem of Class Imbalance, the methodology uses conventional Upsampling as a solution. Upsampling involves duplicating the minority class such that it is proportionately represented and has a similar effect on the machine learning classifier during training as the majority class does. The intuition behind Upsampling can work in a reversed manner as well, i.e., removing some of the samples of majority class, this is known as Undersampling. However, deciding which instances to exclude (or even randomly removing) could prove to be counterproductive as neural networks are data-hungry algorithms. Apart from sampling-based strategies, there are some strategies which deal with tuning or regularizing the cost function of the machine learning model. However, these could prove to be tedious and result in cumbersome tweaking of hyperparameters. Another alternative is to employ Synthetic Sampling that involves generating synthetic instances of the minority class. Synthetic Minority Over-sampling Technique (SMOTE) [13] is a popular option for synthetic sampling. The technique involves, considering each minority class sample and introducing synthetic examples along the line segments joining any or all of the k-nearest neighbors belonging to the corresponding class. Depending upon the number of instances required, samples from the k-nearest neighbors are randomly chosen. Though notably useful for low dimensional features, SMOTE has certain shortcomings when it comes to high dimensional data such as images. These limitations include intensive computations required for determining K-nearest neighbors. Dimensionality reduction is a possible solution to the above limitations, but it may result in loss of information. To generate viable synthetic samples, the study explores some deep learningbased procedures, and consequently, identifies Wasserstein Generative Adversarial Networks (WGANs) [14] to be a fitting solution. Similar to Ian Goodfellow’s Generative Adversarial Networks (GANs), WGANs also consist of a generator and critic (discriminator). However, unlike the original GANs, discriminator in WGANs is essentially a regressor, rather than a classifier and uses Wasserstein-1 distance as metric as opposed to Jensen–Shannon(JS) divergence or Kullback–Leibler(KL) divergence (they provide less smooth and stable training of generator). Wasserstein-1 Distance is also known as Earth-Mover distance, as the distance between the two probability distributions is taken into account to evaluate their similarity. Through these changes, Martin et al. were able to avoid vanishing gradient problems. Furthermore, WGANs leverage meaningful learning even when loss became zero in contrast to GANs, where the gradient learning essentially became saturated.
Parkinson’s Disease Detection Through Visual Deep Learning
967
Consider,
PG PG Synthetically generated by WGANs in initial epochs.
PR PR Actual sample from the dataset. The aim is to minimize the Wasserstein distance over the discriminator outputs of the two images. Therefore, we optimize the following approximated loss operation min{max E x→Pr [D(x)] − E z→Pg [D(gθ (z))] θ
w∈W
subjected to parameterized ||DW ||L ≤ k (K-Lipschitz continuity). Here, we are trying to bring the expectation, E of discriminator function D, for actual distribution x, and generator produced images g(z) for noise distribution z, closer. D is parameterized to DW having Lipschitz Constant ≤ K inorder to approximate the intractable Wasserstein distance to a tractable form. However, some shortcomings still make training WGANs challenging, such as slow convergence and inferior quality of generated images in higher resolutions. Ishaan et al. proposed WGAN-GP that introduced Gradient Penalty constant in the loss function during training WGANs. Nevertheless, the approach exercised Transfer Learning as the solution to this predicament as well. Inspired by ProGANs by Karras et al. [15], the study applied limited Progressive Growing of the network to produce stable and good quality images of higher resolution images. First, the generator was trained on 64 × 64 sized images for 150 epochs corresponding to an input noise vector of shape (100,). Then, we replaced the first few layers (2 blocks consisting of 2-Dimensional Convolutional Transpose Layers with ReLU Non-Linearity and
968
V. Awatramani and D. Gupta
Batch Normalization Layers, that also take noise vector as their input) of the trained generator with fresh layers but corresponding for higher dimensional images, say 128 × 128 and trained for another 300 epochs. We repeated the mentioned process once more, such that the generator could produce 256 × 256 sized images. Correspondingly, a new critic was instantiated for every dimensional growth in generator’s output (Fig. 2). It is worthy to note to maintain fidelity of the synthetic samples produced, the approach produces instances of the Healthy class (underrepresented class) only. However, a clinical opinion on the fidelity of the synthetic instances would be helpful to such practices that attempt to model real-world samples (Fig. 3). Fig. 2 Steps followed for Progressively Training WGANs. Training GANs directly on higher dimensional images is difficult as large numbers of epochs are required to generate suitable images
Generator for 64X64 Image Critic for 64X64 Image 150 Training Epochs
Progressive Growth Action
Generator for 128X128 Image (retains part of previous Trained Generator Network) New Critic for 128X128 Image
Progressive Growth Action
Generator for 256X256 Image(retains part of previous Trained Generator Network) New Critic for 256X256 Image
Fig. 3 1 Is one of the original images from the HandPD Dataset and 2 Is one of the images synthetically generated by Pro-WGANs
Parkinson’s Disease Detection Through Visual Deep Learning
969
Fig. 4 Comparison of performance of various pre-trained models
It can be realized that such efforts to balance out the dataset, provided fruitful gains. The methodology observed 96.83% accuracy in the balanced Spiral HandPD dataset as compared to 94.71% original.
3.2 Classification Deep Learning is a representational learning method, where feature extraction specific to the task in hand is not restricted to specific regulations determined by meticulous study but conforms to identifying significant traits of input information to emulate required output function. This quality allows Deep Learning models to produce state-of-the-art results with the only condition of a substantial amount of training data being available. However, this is not the usual case, especially in Physiological datasets and even in chosen dataset as well. To resolve this obstacle, the approach employs Transfer Learning. Transfer learning is a machine learning technique where a model trained on one task is re-purposed on some other related tasks. The pre-trained model selected for the task is ResNet-50 [16] (Fig. 4) ResNets are a popular choice among image classification tasks due to their ability to provide advantages of both deep and shallow networks without either’s drawbacks. Deep networks are proficient in learning complex features without but suffer from vanishing gradient problems. For addressing this concern, ResNet architecture consists of residual or skip connections that allow gradients to flow directly to initial filters from later layers (Fig. 5). Here a [l] represents the activation from previous layer l. Then, a [l+1] = H (W[l+1] a [l] + b[l+1] ) where H (x) represents the activation function (ReLu activation in this case) while W[l] and b[l] represent the weights and bias for the layer, respectively. However, due to the skip connection, the output at layer l + 2 becomes,
970
V. Awatramani and D. Gupta
Fig. 5 Represents a basic block of a Residual Network with skip connection
a [l+2] = H (W[l+2] a [l+1] + b[l+2] + a [l] ) Therefore, ResNet-based networks can possess numerous layers and hidden units in their architecture without affecting the performance. Consequently, ResNet-50 is preferred by real-world practitioners as well, as the majority of the top-flight submissions in Stanford’s DAWNBench [17] competition are based on ResNet-50 and its variations. In addition to a comprehensive architecture, pre-trained weights from Transfer Learning allow models to have exceptional accuracy without training their convolutional base. However, fine-tuning the convolution base according to the selected dataset does boost the accuracy considerably as compared to the use of frozen weights.
4 Results Since the study addressed imbalanced class datasets with upsampling and synthetically generated samples, we validated the system distinctly for respective strategies. This assisted in examining the effect that unbalanced data sources have on the model’s performance (Tables 1 and 2). Table 1 Mean accuracies over spiral exam in HandPD
Dataset
Accuracy
Old spiralPD
94.41
New spiralPD
95.23
Combined oversampled spiralPD
98.24
Synthetic spiralPD
96.83
‘Old’ refers to earlier version of HandPD dataset which were class imbalanced and ‘New’ refers to later version which were balanced as compared to earlier versions
Parkinson’s Disease Detection Through Visual Deep Learning Table 2 Mean accuracies over meander exam in HandPD
971
Dataset
Accuracy
Old meanderlPD
93.71
New meanderlPD
94.89
Combined oversampled meanderPD
98.11
Synthetic meanderPD
96.74
5 Conclusion This paper examines the adeptness of Transfer Learning in the field of Biomedical Imagery and establishes the methods to be a suitable strategy for small-sized datasets that belong to different domains such as HandPD. The utilization of techniques such Deep Residual Networks is vital in predicting intricate samples that resembled one group (healthy or patient) but belonged to the counter class. Moreover, the nature of the dataset cannot be disregarded in the achievement of current results. Class balance in a dataset can prove to be an asset during the preparation of an unbiased predictive system. Though peculiar, synthetically generated samples such as from WGAN can be leveraged to train machine learning models in such challenging cases. Moreover, transfer learning lends an assisting hand in this task as well through Progressive Growing of Generative Adversarial Networks. Consequently, the ability of neural networks to exercise learned knowledge by prior systems can be extensively harnessed to counter common hindrances in physiological research as in the case of HandPD. In our future work, we shall emphasize on making Transfer Learning models more efficient and computationally inexpensive. For this, we shall explore neural network compression techniques [18] over pretrained architectures. This shall enable us producing accurate as well lightweight models that can run considerably fast on smaller mobile devices as well. Apart from Transfer Learning, we shall also examine the serviceability of neural architecture search [19] in biomedical imagery as automated hyper-parameter tuning and AutoML are widely nascent in the field of physiological machine learning. Lastly, to make networks secure and protected against any malicious intents, we shall inculcate defense strategies against adversarial attack [20] in the training process.
References 1. P. Johns, Clinical Neuroscience, in Chapter 13 Parkinson’s diseases (2104), pp. 173–179 2. C.R. Pereira, D.R. Pereira, F.A. Silva, J.P. Masieiro, S.A.T. Weber, C. Hook, J.P. Papa, A new computer vision-based approach to aid the diagnosis of Parkinson’s disease. Comput. Methods Prog. Biomed. (2016) 3. C.R. Pereira, S.A.T. Weber, C. Hook, G.H. Rosa, J.P. Papa, Deep learning-aided parkinson’s disease diagnosis from handwritten dynamics, in 2016 29th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI) (2016)
972
V. Awatramani and D. Gupta
4. L.A. Passos, C.R. Pereira, E.R.S. Rezende, T.J. Carvalho, S.A.T. Weber, C. Hook, J.P. Papa, Parkinson’s disease identification using residual networks and optimum-path forest, in 2018 IEEE 12th International Symposium on Applied Computational Intelligence and Informatics (SACI) (2018) 5. C.R. Pereira, D.R. Pereira, J.P. Papa, G.H. Rosa, X.S. Yang, Convolutional neural networks applied for parkinson’s disease identification, in A. Holzinger (ed.), Machine Learning for Health Informatics. Lecture Notes in Computer Science, vol 9605 (Springer, Cham, 2016) 6. D. Gupta, A. Julka, S. Jain, T. Aggarwal, A. Khanna, N. Arunkumar, V.H.C. de Albuquerque, Optimized cuttlefish algorithm for diagnosis of Parkinson’s disease. Cogn. Sys. Res. (2018) 7. P. Sharma, S. Sundaram, M. Sharma, A. Sharma, D. Gupta, Diagnosis of Parkinson’s disease using modified grey wolf optimization. Cogn. Sys. Res. (2018) 8. S. Cimen, B. Bolat, Diagnosis of Parkinson’s disease by using ANN, in 2016 International Conference on Global Trends in Signal Processing, Information Computing and Communication (ICGTSPICC) (2016) 9. A.H. Al-Fatlawi, M.H. Jabardi, S.H. Ling, Efficient diagnosis system for Parkinson’s disease using deep belief network, in 2016 IEEE Congress on Evolutionary Computation (CEC) (2016) 10. E.A. Belalcazar-Bolanos, J.R. Orozco-Arroyave, J.D. Arias-Londono, J.F. Vargas-Bonilla, E. Noth, Automatic detection of Parkinson’s disease using noise measures of speech, in Symposium of Signals, Images and Artificial Vision - 2013: STSIVA (2013) 11. X. Wu, X. Chen, Y. Duan, S. Xu, N. Cheng, N. An, A study on gait-based Parkinson’s disease detection using a force sensitive platform, in 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (2017) 12. M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, in IEEE Transactions on Systems, Man, and Cybernetics (2012) 13. C. Bowyer, H. Kegelmeyer, SMOTE: synthetic minority over-sampling technique. J. Art. Intell. 16 (2002) 14. M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN (2017). http://arxiv.org/abs/1701.07875 15. Progressive Growing Of Gans For Improved Quality, Stability, and Variation. Karras- TeroAila- Timo- Samuli- Lehtinen- Jaakko - https://arxiv.org/abs/1710.10196 16. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 17. Stanford DAWN Deep Learning Benchmark (DAWNBench). https://dawn.cs.stanford.edu/ben chmark/ 18. H. Kim, M. Khan, C. Kyung, Efficient Neural Network Compression (2019) 19. B. Zoph, V. Vasudevan, J. Shlens, Q. Le, Learning Transferable Architectures for Scalable Image Recognition (2019) 20. N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. Celik, A. Swami, Practical Black-Box Attacks against Machine Learning (2019)
Architecture and Framework Enabling Internet of Vehicles Towards Intelligent Transportation System R. Manaswini, B. Saikrishna, and Nishu Gupta
Abstract Internet of Things (IoT) enables exchanging of various entities like smart health, smart industry, and smart home development. These systems have already gained popularity and are still increasing day by day. Moving ahead, we foster to develop a ubiquitous transport system with the help of IoT sensors technology. A new diversity leads to the introduction of Internet of Vehicles (IoV). IoV facilitates communication between vehicle-to-vehicle, vehicle-to-sensors, and vehicle-toroadside unit. With the help of artificial intelligence, IoV forms a solid backbone for Intelligent Transportation Systems (ITS) which gives further insight to the technologies that better explain traffic efficiency and their management applications. In this article, a novel approach towards architecture model, framework, cross-level interaction, applications and challenges of IoV are discussed in the context of ITS and future vehicular scenario. Keywords Artificial intelligence · Cloud computing · Internet of vehicles · Internet of things · Intelligent transportation system · VANET
1 Introduction Emergence of big-scale computational techniques embedded with wireless communication networking along with cloud infrastructure has made the recognition of smart cities to achieve the near emergence. The concept of smart objects which provide seamless connectivity along with ensuring safety and smart environment R. Manaswini · B. Saikrishna · N. Gupta (B) Electronics and Communication Engineering Department, Vaagdevi College of Engineering, Warangal, India e-mail: [email protected] R. Manaswini e-mail: [email protected] B. Saikrishna e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_84
973
974
R. Manaswini et al.
through increasing interaction and interoperability is called as the Internet of Things (IoT) [1]. In the current research paradigm about Intelligent Transportation Systems (ITS), conventional vehicular ad hoc network (VANET) is transforming into Internet of Vehicles (IoV). VANET covers only a small mobility network that is inherently subjected to mobility constraint. In several aspects of the big cities such as traffic jam, big buildings, bad drivers’ behaviors, complex road network, it has demonstrated to hinder efficient transportation. Therefore, VANET, in its strict sense, is temporary, random, and unstable, and the range of usage is local and discrete. Therefore, VANET cannot supply ubiquitous (global) and sustainable services or application to its end users. It is over the past several years that there has not been popular implementation of VANET [2]. On the other hand, an IoV is an integrated management of the ITS and other applications in the smart cities. The conception of smart cities is appearing as strategy to extenuate the challenges of rapid and continuous urbanization while at the same time provides better Quality of Service (QoS) to the smart cities [3]. Smart cities are recognition of manageability, controllability, and credibility and composed of the multi-user, multi-things, multi-vehicle, and multi-network entities. IoV can be considered convenient with mobile Internet and traditional IoT. It is a large network of interaction with reference to technologies and to dynamic mobile communication system. It also allows the exchange of information between vehicle-to-device (V2D), vehicle-to-sensor (V2S), and device-to-device (D2D) within a vehicle. Deployment of the IoV in smart cities enables information sharing and assembling of large data information on vehicle, road, infrastructure, building, and their surroundings. IoV can supply services to the advertiser and supervise vehicles. Figure 1 depicts an overview of applications of IoV. It is also helpful in supplying abounding multimedia and mobile Internet application services. IoV has convergent concentration as serving application of ITS by ensuring driver safety, traffic efficiency, and infotainment. IoV service is needed by smart cities for big-scale data sensing, collection, information, processing, and storage. One of the main challenges of the IoV deployment in the smart cities is integration of all its components. Another challenge is to ensure reliable and real-time delivery of rapid emergency services and big scale of data collection between vehicular application and platform [4]. The remainder of this paper is organized as follows. In Sect. 2, background work in form of literature survey is presented. Section 3 discusses the proposed authentication protocol. Performance evaluation is presented in Sect. 4. Finally, the article is concluded in Sect. 5.
2 Background Work The ITS in extensive conception involves several applications in the traffic management system such as automatic license plate recognition system, and traffic signal control system. The rapid emergence of wireless mobile communication technologies has attracted researcher’s attention towards communication between vehicle as
Architecture and Framework Enabling Internet …
975
Vehicle to grid
V2G+G2V Charging stations
Vehicles to device communication
Internet of vehicles V2D Telematics
Parking Alerts
communication
Fig. 1 An overview of applications of Internet of vehicle
road safety and development of transportation efficiency. Specifically, for a longtime VANET has been under the spotlight for this purpose. VANET uses dedicated Short-Range Communication (DSRC) technology [4, 5]. However, it has its own limitations. The problem remains unsolved due to high-speed mobility of vehicle and currently incomplete infrastructure, leading to reliability of services and connection in VANET being vulnerable. The emergence of big data and IoT lead to the concept of IoV. Under agreed communication protocol and data interaction, standards, wireless communication, and exchange information are conducted for IoV between vehicle-to-anything (V2X) such as another vehicle, road infrastructure [6]. IoV mainly focuses on the integration of human and vehicle which is an extension of human abilities. It is a network model, service model, and behavior model of the human-vehicle system which is highly different from the wireless mobile network [2]. The IoV is an interactive communication system of different types as shown in Fig. 2. First block depicts Inter-vehicular network and second, intra-vehicular network. In the first category, network is divided into three types, viz., V2S, V2D, and D2D. The intra-vehicular network is divided into four types, viz., vehicle-to-vehicle (V2V), vehicle-to-pedestrian (V2P), vehicle-to-grid (V2G), and vehicle-to-barrier (V2B). We discuss in detail about these network types based on their category.
976
R. Manaswini et al.
Fig. 2 Interactive model of IoV
Fig. 3 Registration vehicle
• Inter-vehicular network system Vehicle-to-Sensor (V2S) The IoV implementation requires different devices such as sensors, and actuators communicating with each. For these communication systems, Data Acquisition System (DAS) where the vehicular data is transferred on the network through On Road Diagnosis (ORD2) interface. It helps in avoiding accidents, renders safety driving, and improvement in driving experiences [7]. Vehicle-to-Driver (V2D) It is the interaction of the vehicle-to-driver (V2D) and vehicle-to-environment (V2E) communication for the motorcycle, and vehicle-based on the smartphone on core and wireless communication. It is equipped with embedded Bluetooth-CAN and
Architecture and Framework Enabling Internet …
977
interfaced with the smartphone, which acts as the gateway towards element and web server [8]. Device-to-Device (D2D) It is the D2D communication model that allows direct connection and communicates with another device, rather than through the intermediary application server. These device communications over many types of network including IP network and the Internet. These devices use protocols like Bluetooth, Z-wave, or Zigbee to establish directly D2D communication [9]. • Intra-vehicular network system Vehicle-to-Vehicle (V2V) The layered architecture distributive information sharing is enabled through DSRC technologies with the help of the GPS information through V2V communication. A new technology known as Advanced Driver Assistance (ADA) systems is introduced in these protocols. It allows the process by combining and embedding data in the vehicle and peer-to-peer process occurs in these systems that tell about the roads conditions to avoid collisions [10, 11]. Vehicle-to-Pedestrians (V2P) At present two methods exist to overcome the issue of scanning and extension receiving. These methods are helpful in decreasing night accidents on roads because of fast driving. To reduce these problems, LED lights are to be introduced in vehicles. By these two methods, we can protect vulnerable road users (VRU) like pedestrians, light motor vehicle drivers, etc. Vehicle-to-Barrier (V2B) The deployment of vehicle-to-barrier communication help minimizing the vehicle crash that occurs due to run-off-road (ROR). Day-by-day crashes on ROR are increasing, to control this vehicle crashes a vehicle-to-barrier V2B is placed between vehicles and radios embedded in roadside barriers. Vehicle-to-Grid (V2G) V2G gives an advantage to power-energy concerns such as stabilizing energy demand and supply fluctuations. By introducing plug-in electric vehicle (PEV) reduces the power and cost. Challenges in using a V2G is to provide security to the grid [12].
3 Proposed Authentication Protocol for IoV In this section, we propose an authentication protocol to reach inconsequential certification of V2V communication. According to this protocol, only authenticated vehicles are permitted to communicate with each other. The protocol is further sub-divided
978 Table 1 The notation and specific explanations
R. Manaswini et al. Notations
Description
1
A vehicle x in the Network
TA
Trust authority
I DT A
ID of TA
I D1
ID of 1
φT A
Private key of unit TA
h(.)
Hash function
⊕
The XOR operator
The connection symbol
into three authentication subprotocols, namely, initial stage, registration stage, and authentication stage. Initial stage In the authentication protocol, each node in the region of Trust Authority(TA) is uniquely identified with an ID. TA generates specific privacy key using security single hash function h (.): ∅T A = h(I D T A R T A )
(1)
where ∅T A is the privacy key to TA and I D T A correspond to the ID of TA. R T A is the random number generated through the TA. It requires to input a message of random length and the output message is 128-bits process. MD5 (Message-Digest algorithm 5) is to allocate the input message hooked on blocks by 512-bits. Each block is divided into 16 subblocks along with 32-bits. In the sequence of processing, the output is dependent on the four groups by 32-bits. The four groups are cascaded and hash values through 128-bits are created. Nevertheless, the performance time of MD5 algorithms is up to 0.0018 s and 9.258 s, correspondingly. In addition, Table 1 shows the major notations and their related meaning. Registration stage Corresponding to each vehicle, there is a unique identify (ID) and security key. Lets I Do correspond to ID of the vehicles o and So correspond to security key of the vehicles o. Instead of vehicle’s id being regenerated repeatedly by the system, its factors are generated using I Do and So as shown in the following equation. ρ o = h I Do So .
(2)
The vehicles o compute the factors ζ o as shown in the following equation. ζ o = Ao ⊕ B ζ o .
(3)
Architecture and Framework Enabling Internet …
979
Fig. 4 Communication price
The factor ζ o is transmitted to the TA with vehicle’s o. When received, the TA generates a random number h T A . The TA factor τ T A is shown in the following equation. τ T A = h(ζ o ϕ o ) ⊗ ∅T A
(4)
where ϕ o = h I Do h T A . Finally, the factors τ T A and h T A are transmitted to the vehicles o as shown in Fig. 4. When a message is received, the vehicles o store these factors in Tamper-Proof Device (TPD) and the vehicles o factors set { ζ o , τ T A ,h T A ,ϕ o } (Fig. 3). Authentication stage a. Identity detection Instead of communicating with other vehicles, the vehicles first authenticate its identity with themselves and communicate with other vehicles subsequent to finishing the authentication stage [13]. The vehicles generate the factors ζ using specific ID and security key as shown in Eqs. (2) and (3). If they are the identical, the vehicles are authenticated. If they are not trustworthy, they must reregister through vehicles until authentication succeeds. The vehicles o are required to communicate by other entities, it is return factors ζ o corresponding to Eq. (3). If they are equal, the vehicles o are authenticated positively, and it is acceptable to communicate through other entities. The vehicles’ authorization processing is comparatively simple. b. Message authentication To make sure the safety of transmitting data, the communication entity is required to be authenticated before it is prepared to transmit data.
980
R. Manaswini et al.
i. Request message Precisely, while the vehicles 1 request to transmit data to the vehicles 2 , it first sends a request message to the vehicles and marks the delivery time request. In similar time, the vehicle 1 generates a random number h2 . Subsequently, the vehicles separate the factors as of TPD and the value of factors ϕ o is computed as ϕ o = h I Do h T A . The vehicles 0 use generate factors ζ o , ϕ o , and τ T A to calculate the security key of TA, as shown in the following equation: ∅T A = h ζ o ϕ o ⊕ τ T A .
(5)
The vehicles 0 compute the following factors: o = h(∅T A S t x ) ⊕ ζ o
(6)
So = o ⊕ ζ o ⊕ ∅T A ,
(7)
μo = Rqst ⊕ o ⊕ ∅T A ⊕ S t x .
(8)
where S t x is the timestamp for the request. ii. Reply message The vehicles 2 first data are timestamps of the receiver factors { o , μo and S t x }, which is denoted as S r x . Subsequently, S r x is likened through S t x that is separated from { o , μo and S t x }. If S r x is extremely late, the following disparity must hold: S r x − S t x ≥ α S1 ,
(9)
where α S1 is the system factor. While disparity holds, its received factors { o , μo and S t x } have expired. The vehicles 2 are instantaneously halted communicating through vehicle 1 . Then, it should go for the next step. The vehicle 2 recalculates the factors h1 , which is previously generated with vehicle 1 . The recalculated factors h1 are provided by
h 1 = T0 ⊕ h(∅T A || St x )
(10)
Correspondingly, the vehicle 2 recalculates S o as shown in the following equation:
Architecture and Framework Enabling Internet …
981
S o = h 1 ⊕ To ⊕ ∅T A .
(11)
Subsequently, the vehicle 2 excerpts request message from μo = Rqst ⊕ o ⊕ ∅T A ⊕ S t x , which is provided by Rqst = μo ⊕ o ⊕ ∅T A ⊕ S t x .
(12)
Next finding these factors, the vehicle 2 computes two new factors F 2 and L 2 provided by F2 = h h 1 || αS1 || ∅T A ,
L 2 F2 ⊕ ∅T A ⊕ S o ⊕ h 1 .
(13) (14)
Finally, the vehicle 2 communicates the applicable factors to the vehicle 1 . When received, vehicle 1 sends a response message to the vehicle 2 . For the security of the channel, a reply message is encoded which is provided by E N Repl y = EC D F 2 .
(15)
Finally, the vehicle 2 communicates factors { L 2 , Repl y} to the vehicles. This protocol proposes to decrease the operation time of the authentication process. Recall Eq. (15); The data necessary to the encoder are reply messages, and the key F 2 . Using reply and F 2 as input to the protocol, the encoder EN-reply is generated. iii. Communication units The exchange of control bits (sent, replay) among the vehicle 1 and vehicle 2 allows to get each other’s information. When receiving { L 2 , Repl y}, the vehicle 1 is first recorded of the data acceptance and the timestamps are represented by S t x . Subsequently, the vehicle 1 ensures safety check whether disparity S r x − S r x ≥ α S2 is satisfied or not. If not, the vehicle 2 avoids communicating with vehicle 1 . Once disparity Sr x − Sr x ≥ αS2 . is satisfied and found to be secure, vehicle 1 will be get replay message from EN-Reply. To get the reply, the vehicle 1 should compute F 2 perfectly and decrypt it successfully. Correspondingly, vehicle 1 computes F 2 according to Eq. (16).
Let F 2 and F 2 compute the vehicle 1 , which is providing by
F 2 = L 2 ⊕ ∅T A ⊕ S1 ⊕ h 1 .
(16)
Since, the factor F 2 is used to decrypt the EN-Reply and to get the reply message successfully, which is provided by
982
R. Manaswini et al.
Repl y = DC P F 2 (EN − Reply)
(17)
where DC P F 2 (∗) is the decrypted function. If F 2 == F2 secures, the vehicle 1 can decrypt EN −Reply and get the replay. When correctly decrypted, the vehicle 1 deems that the vehicle 2 is protected and the vehicle 1 should communicate with the vehicle 2 .
4 Performances Evaluation This section explains the communication price, storage price, and battery consumption of the authentication protocol. This computer factor is used for this performance analysis are follows: Intel (R) Core i3-5000 CPU, 3.9 GHz, RAM 8.00 GB these are discussed below. i. communication, storage price The communication above it is computed based on the whole number of vehicles used different variables in the message transmission across the v2v communication area, as shown in Figs. 5 and 6 communication and storage price correspondingly. The storage price is the entire necessary memory to keep various factors. We are contemplated that the hash-digest is of 256-bit, a size of random number of 8-bytes, a timestamp is of 4 bytes, bi-linear combination desires 128 bytes, symmetric and asymmetric encoder and decoder needs 64 bytes, it takes signature 128 bytes. ii Computational time The computational time is mostly computed based in the large number of vehicles is needed cryptographic process to login authentication or communication stages. In
Fig. 5 Storage price
Architecture and Framework Enabling Internet …
983
Fig. 6 Computational price
this various protocol, author has used different processes, a one-way hash function (T h ()), asymmetric encoder (T asecd ), asymmetric decoder (T asd c ), signing performance (T s ), ex-potential performance (T p ), and bi-linear combination (T c ), as shown in Fig. 7 is representation of computational price. ii. Battery consumptions To take out the system processes, it is necessary the total energy and this is called battery consumption. It is the energy needed of a full communication in among vehicles, which requires vehicles and servers. It is able to be compared as E N C = T P ∗C. Now T P is the whole needed computational time, ENC is the battery consumption power, C is the maximum power (11.08 W), for wireless communication. Figure 8 shows representation of battery consumptions.
Fig. 7 Energy ingesting
984
R. Manaswini et al.
5 Conclusion This paper presents an overview as well as advanced level protocols for wireless communication technologies to enhance vehicular communication in IoV. Fivelevel architecture of IoV is proposed. We discuss the protocols used for wireless access communication and routing protocols in inter-vehicular network protocols. We proposed an authentication protocol of the V2V scenario in VANETs. The system only necessitates hash operations and upholds the necessary security level. Additionally, the privacy and integrity of the message are protected. We made our system inconsequential by taking less amount memory decreasing number of variables to be stored. The results show that our protocol performance in separate scenario such as the communication storage, computational, and battery consumption.
References 1. Li Minn Ang Senior Member, IEEE, Kah Phooi Seng, Member, IEEE, Gerald Ijemaru, Member, IEEE, and Adamu Murtala Zungeru Senior Member, IEEE: Deployment of IoV for Smart Cities: Applications, Architecture and Challenges in (2018) 2. Y. Fangchun, W. Shangguang, L. Jinglin, L. Zhihan, S. Qibo, An Overview of the Internet of Vehicles in (2104) 3. O. Kaiwartya, A. Abdullah, Y. Cao, A. Altameem, M. Prasad, C. Lin, X. Liu, Internet of Vehicles: Motivation, layered architecture, network model, challenges, and future aspects. IEEE Access 4, 5356–5373 (2016) 4. S. Al-Sultan, M.M. Al-Doori, A.H. Al-Bayatti, H. Zedan, A comprehensive survey on vehicular ad hoc network. J. Netw. Comput. Appl. 37, 380–392 (2014) 5. M. Chen, Y. Tiana, G. Fortino, J. Zhang, I. Humard, Cognitive Internet of Vehicles for journal computer communication in (2018) 6. N. Gupta, A. Prakash, R. Tripathi, Medium access control protocols for safety applications in vehicular ad-hoc network: a classification and comprehensive survey. Vehicular Commun. 2, 223–237 (2015) 7. Y. Xie, X. Su, Y. He, and X. Chen, “STM32-based vehicle data acquisition system for internetof-vehicles,” IEEE Computer Society, pp. 895–898, 2017 8. P. Gandotra, R.K. Jha, S. Jain, A survey on device-to-device (D2D) communication: architecture and security issues. J. Network Comput. Appl. Elsevier 78, 9–29 (2017) 9. S. Gao, A. Lim, D. Bevly, An empirical study of DSRC V2V performance in truck platooning scenarios. Dig. Commun. Networks, Elsevier 2, 233–244 (2016) 10. N. Salameh, G. Challita, S. Mousset, A. Bensrhair, S. Ramaswamy, Collaborative positioning and embedded multi-sensors fusion cooperation in advanced driver assistance system. Trans. Res. Part C, Elsevier 29, 197–213 (2013) 11. N. Gupta, A. Prakash, R. Tripathi, Clustering based cognitive MAC protocol for channel allocation to prioritize safety message dissemination in vehicular ad-hoc network. Vehicular Commun. 5, 44–54 (2016) 12. S. Temel, M.C. Vuran, R.K. Faller, A primer on vehicle-to-barrier communications: effects of roadside barriers, encroachment, and vehicle braking, in Proc. Vehicular Tech. Conf. (2016) 13. H. Vasudev, D. Das, A lightweight authentication protocol for V2V communication in VANETs, in Proceedings of the 2018 IEEE SmartWorld, Scalable Computing and Communications (Guangzhou, China, October 2018), pp. 1237–1242
Group Data Sharing and Auditing While Securing Sensitive Information Shubham Singh and Deepti Aggarwal
Abstract In today’s world, it can be said that cloud computing is a most important research topic nowadays. The reason behind this is the number of services provided by it to its users. Among them, there exists a service which allows a number of users to share their data with each other. This service is very helpful in collaborative environment as it improves its efficiency and has a number of potential applications. But the data shared might include some information which is sensitive and should not be shared with every other member of the group. To tackle the issue, this paper proposes a scheme which ensures the security of such sensitive data to be shared among a group of cloud users. For sharing data among them, block design-based key agreement protocol has been used in the proposed approach along with identitybased blind signature for verifying data’s integrity. Later, a performance analysis in the end shows that the proposed scheme is efficient and stable. Keywords Data security · Sensitive information hiding · Weil pairing · Block design · Ring signature
1 Introduction Cloud computing is one of the hottest research topics in today’s world. It provides resources (which can be shared also), along with capabilities to store and process them, to its users when needed [1]. It motivates people to store their data to some external storage server and access it later anytime when required. But there exist some privacy concerns too in cloud services. Cloud allows a user to form a group and share data among the members of the group. But if the data to be shared contains S. Singh NIT Meghalaya, Shillong, India e-mail: [email protected] D. Aggarwal (B) IMS Engineering College, Ghaziabad, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_85
985
986
S. Singh and D. Aggarwal
some information which should be kept secret from all or some members of the group then a problem arises as any mechanism to hide such sensitive information does not exist. For instance, if a user want to share a secret military document with few other members of the group and some information in it which is highly confidential should not be revealed to anyone then in such a case a problem to share this document with sensitive part hidden arises. Later even if the document is shared with the group members, security of this file is under observation. Only the members of the group should be able to access it and to verify whether this file is intact or not. This reveals that not only sensitive information hiding is required but a way to securely share the file among group members and to verify files integrity is also needed. This paper put forwards a method for securely hiding sensitive information in a file and to share it among a number of users with the help of a key agreement protocol which is based on block design [2] in such a way that each of the members in the group will be able to verify its integrity with the help of identity-based blind signature [3].
2 Related Work Confidentiality, integrity, and availability are the crucial attributes for any data and these problems cannot be solved by a single security method. Liu discusses the traditional technologies and capabilities of cloud and the newer technologies which must be used for better security and privacy of data [4]. The importance of security of data stored in the cloud has been emphasized many times, along with the measures needed to secure the data in the cloud. In [5], AES algorithm is used to provide security to the end users of Cloud. Al-Jaberi et al. mainly focused on the integrity verification along with privacy preservation using algorithms and protocols for cloudstored data where Amazon S3 was used as the cloud storage provider [6]. RSA partial homomorphic and MD5 algorithm are used in the proposed model [7]. Mahalle et al. used data encryption using hybrid encryption algorithm to ensure data security such that even the administrator of the cloud server does not have access to the private data of the user [8]. The keys are generated on the basis of system time thereby making the overall system more secure. In [9], the author proposed a new security model with double authentication mechanism to be implemented to restrict unauthorized persons from getting the control of user’s data. People are afraid of sharing their data with an untrusted cloud service provider. In [10], different security issues arising from the usage of cloud services are discussed. The author has proposed if one wants to take the cloud computing to the next level, the security capabilities need to be strengthened. Elliptic curve cryptography encryption technique is used for securing the data present in the cloud so that only authenticated user can access the data [11]. Any breach at the cloud would not affect the security of stored data as the data is in encrypted form. The author has covered all the cloud security techniques and countermeasures along with the attacks possible against cloud-based services and the countermeasures to thwart these attacks [12].
Group Data Sharing and Auditing …
987
Mohamed et al. implemented a software on Amazon EC2 Microinstance to enhance data security in [13]. Researchers are working hard to find a way for securely storing local data on some cloud server and successfully verify its integrity later. Authors in [14, 15] used homomorphic authentication in order to decrease the computation and communication cost. These schemes are later modified to improve their efficiency in [16–18] and more features like data update were added in [19, 20]. A scheme is proposed in [21] to support user revocation. In the ref. [22], the authors presented a dynamic auditing scheme. This scheme uses proxy tag update but doesn’t consider ciphertext store. Authors in [23] proposed a verifiable database scheme but this scheme lacks public verifiability. Vector commitment was used in [24] to build a publicly verifiable database. Backes et al. presented in [25] a scheme having properties which overcome the limitations of [24]. Later group signature was introduced in [26] which allows any member of the group to have its own secret key and sign a message. The identity of actual signer is only known to the group manager, who is trusted, and in this way, this scheme confirms signers anonymity.
3 Problem Formulation There are three entities in the proposed scheme: group users, TPA, and cloud server, as shown in Fig. 1. There individual functions are explained as follows: Fig. 1 Architecture of proposed scheme
988
S. Singh and D. Aggarwal
3.1 Group Users Data owner and the other members which come together to share information among them are collectively known as group users. In the proposed scheme any number of individuals can form a group together for sharing data without the need of any third party entity like group manager. The member who shares his data with other members of group will be identified as the data owner. He will be able to hide sensitive information present in his data such that this sensitive information will not be visible to other members of the group.
3.2 Cloud A cloud provides remote storage services to its users. Data outsourced by any cloud user is stored here. It is a semi-trusted entity and hence cloud may acquire access to stored data. In such a case the data stored becomes vulnerable and hence the data to be stored in the cloud is encrypted first by the proposed scheme before getting stored at the cloud.
3.3 TPA TPA refers to Third Party Auditor. He is responsible for the generation of public parameters and cryptographic keys for group users. He is also responsible for the verification of integrity of stored data when requested by any authorized group member.
4 Preliminaries 4.1 Weil Pairing Let us consider a group G1 whose generator is G. Let consider G2 as the subgroup of F p2 which contains all the elements having order q. The mapping ê: G 1 X G 1 → G 2 is known as Weil pairing [2] and has the properties given below: 1. 2. 3. 4.
The mapping is bilinear, i.e., ê(aP, bQ) = ê(P, Q)ab ¥ P, Q ∈ G1 and a, b ∈ Z. It is non-degenerate, i.e., ê(P, P) = 1. It is non-commutative, i.e., ê(P, Q) = ê(Q, P) ¥ P,Q ∈ G1 ê(P, Q) is computable.
Group Data Sharing and Auditing …
989
5. ¥ P1 , P2 , Q1 , Q2 ∈ G1 , ê(P1 + P2 , Q1 ) = ê(P1 , Q1 ). ê(P2 , Q1 ) and ê(P1 , Q1 + Q2 ) = ê(P1 , Q1 ). ê(P1 , Q2 ).
4.2 (v, k + 1, 1) design Consider a set V = {0,1,…,v-1} of v elements and B = {B0 , B1 , …, Bb-1 } of b blocks. If a finite structure σ = (V, B) satisfies the conditions given below then it is said to be a (b, v, r, k, λ) design [2]: 1. 2. 3. 4.
Two elements ∈ V appear simultaneously in λ blocks. Every element ∈ V appears in r of b blocks. k, v ∈ V satisfies k < v. b, v ∈ V satisfies b ≥ v,
where k = | Bi |, v = |V|, b = |B|, r and λ = parameters of design. If k = r, and b = v, then it is known as (v, k, λ) design and (v, k + 1, 1) is used in the proposed scheme.
4.3 Identity-Based Blind Signature Identity-based blind signature is basically build on traditional ring signature which uses users identity as his public key [3]. This simplifies the complex key structure. This scheme consists of three entities, namely, signer, user, and a third-party auditor. It is a collection of four algorithms, namely, Setup, Extract, Sign, and Verify, which are described below: • Setup: Executed by third-party auditor. It generates some public parameters and the master key. • Extract: This algorithm is used to generate secret keys of all the users. It takes as input public parameters, master key, and users arbitrary identity ID and outputs the private key. • Sign: It computes a signature for the input data. • Verify: It verifies whether the computed signature is valid or not.
4.4 AES Algorithm AES refers to advanced encryption standard. It is a symmetric block cipher and performs all of its computations on bytes, not on bits. The number of rounds involved in encrypting the data is variable and depends on key length. Each round of encryption process is divided in subprocesses. The decryption process is the reverse of encryption process where each round consists of four subprocesses: add round key, mix columns, shift rows, byte substitution. Since the subprocesses are in reverse manner,
990
S. Singh and D. Aggarwal
the encryption and decryption algorithms should be separately implemented. It is more secure than other encryption algorithms and is faster in both hardware and software. It supports large key sizes of 192- and 256-bits. It is considered invulnerable to all security attacks except brute-force which attempts to decrypt data by trying all possible combinations of the security key in 128-, 192-, and 256-bits. When its most secure 256-bits key is used, it will take a large number of years to guess the key by brute-force attack and hence it is almost unbreakable [27].
4.5 RSA Algorithm RSA refers to Rivest, Shamir, Adleman algorithm. This cryptographic algorithm is an asymmetric algorithm. The idea behind it is that it is difficult to find the factors of a large integer. Hence two prime numbers are randomly chosen which are then used to generate the cryptographic keys. Public key involves two numbers in which one number is the product of these two prime numbers. There is only one way of compromising the secret key which is by the factorization of large number. Hence encryption length is based on key size and increasing the size increases the security. The keys are 1024-bits long and breaking it is an impossible task yet [27].
5 Proposed Scheme 5.1 Overview The proposed scheme is based on identity-based blind signature [3] and block designbased key agreement protocol [2] for sharing data among a group of users in cloud. Figure 2 showcases the steps involved in this proposed scheme. According to the proposed scheme, any number of users can come together as a group to share data among them without the need of any third party entity like group manager. The member who want to store or share his data with other members is the Fig. 2 Process flow of the proposed scheme
Group Data Sharing and Auditing …
991
data owner. He has the option to hide sensitive information present in the data before sharing it with other members of the group. This sensitive information is encrypted using AES algorithm and a partially encrypted file is generated which is used in the further process. Later this partially encrypted file is shared among the group users with the help of a conference key which is generated using the block design-based key agreement protocol. Before sharing this file among the users, identity-based blind signature is used to generate a signature of the file which is later used to verify the integrity of this shared file by the third-party auditor when requested by any member of the group.
5.2 Concrete Scheme The proposed scheme consists of six algorithms: PreProcess, Setup, Extract, ConfKeyGen, Sign, and Verify. This section puts forward a concrete definition of the proposed scheme. • PreProcess It is initiated by data owner. It deals with the process of hiding sensitive information present in the data to be shared with other members by the data owner. The data which is to be shared is at first divided into a number of blocks. The blocks are constructed in such a way that a separate block is created for each sensitive information part which is identified with the help of specially placed markers in the original file. Then all these sensitive blocks are encrypted through AES algorithm and in last, all the blocks are joined together to create a partially encrypted file. Now this file will be shared by the data owner with other members of the group. This is illustrated in Fig. 3. PreProcess makes use of two algorithms, namely, F_Encrypt and F_Decrypt, which are explained as follows: F_Encrypt algorithm is responsible for dividing the file in blocks and later encrypting the sensitive blocks with the help of AES algorithm to generate a partially encrypted file. This partially encrypted file is shared with the help of key agreement protocol which is based on block design and later a signature for this file is generated using identity-based blind signature. F_Decrypt algorithm is the reverse process of
Fig. 3 Generation of partially encrypted file
992
S. Singh and D. Aggarwal
F_Encrypt algorithm. It is responsible for the decryption of sensitive information and restoration of the original file. • Setup This is the next step in the proposed scheme and deals with the generation of some public parameters and the master key of TPA. Let G1 , G2 , G, and ê be the parameters of Weil Pairing. Let consider H 1 and H 2 as the hash functions and p, q be the prime numbers. Generate the public parameters {p, q, G1 , G2 , G, ê, Ppub , H 1 , H 2 } and the secret key s ∈ Z q∗ of TPA which acts as the master key. • Extract This algorithm deals with the generation of cryptographic keys of the group members. Given a participants identity IDi where ID ∈ {0,1}* . Compute signers public key, participant i = H 1 (IDi ), and his secret key, si = sH 1 (IDi ). Now for authentication purpose, RSA algorithm is used in the proposed scheme. TPA chooses for each participant, public key ei and private key d i , and then distributes the pair (ei , n) to all the particiapnts. n here represents the product of two large prime numbers p and q. Compute Y i = H 2 (IDi ) and X i = (Yi )id . Keep (d i , X i ) secret. • ConfKeyGen 1. Select any random number r i as secret key and then for each participant, compute M i = ê(G, ei r i S). 2. Find wi = H 2 (M i , t i ) and then compute T i = X i ê(G, wi r i S i ). For authentication services, Y i , T i and time stamp t i are used. 3. If j ∈ E i (j = i), participant i will receive a message from participant j given by Dj = {Y j , (M j )ei , T j , t j }. 4. Now according to the block structure of E, there are following cases: a. participant 0 will receive message from participant j where 1 ≤ j ≤ k. b. participant i , where i ≤ k, will receive message from participant j , where j = mk + 1+ MODk (i-1)(m-1), given that j = i. c. participant i , where i =E m,m will receive message from participant 0 and participant j , where j = ((i-1)/k)k + 1, given that j = i. d. Remaining members will receive message from participant (i-1)/k and participant j , where j = mk + 1+ MODk (mx-x-m + r), given that j = i and r = 2,3,.., k. 5. After each member of the group has received a message to generate common e d conference key, compute M j = M j i i where j ∈ E i – i to decipher the message. e j w∗ / M j j , where w ∗j = H 2 (M j , t j ), to 6. participant i will compute T j authenticate participant j . If it comes out equal to Y j , authentication successful. Mx followed by the common conference key given as 7. Compute C i,j = x∈E i −j k = Mi C j,i . j such that i ∈ E j
Group Data Sharing and Auditing …
993
• Sign 1. Select a number r ∈ Z q∗ . Compute R = rP which is used as commitment. 2. Randomly select two numbers a and b from Z q∗ . These two numbers are used as blinding factors. Compute c = H (m, e(b H 1 (IDi ) + R + ap, ei )) + b(modq) to blind the message m and send it to the signer. 3. Signer will send back S = c si +r ei as a signature. 4. Compute S’ = S + a ei and c’ = c-b and output {m, S’, c’} to unblind. 5. Output (S’, c’) as the blind signature of the message. • Verify If c’ = H(m, e(S’,p)e(H 1 (IDi ), ei )−c’ ), the signature is valid.
6 Performance Evaluation This section puts forward a performance analysis of proposed scheme assuming that the underlying building blocks are secure. The algorithms are implemented in C language with the help of pairing-based cryptography and GUN multiple-precision arithmetic libraries and were executed on a Windows 10 machine having an Intel Core i7-4770 CPU at 3.40 GHz and 8G memory. The minimum execution time taken by the algorithms in proposed scheme is shown in the graph in Fig. 4. This graph showcases the time taken by each of the algorithms to complete its operation. All the algorithms have very low-time cost and hence the computations involved in these algorithms are not expensive. The time taken by the ConfKeyGen algorithm for generating the common conference key for
Fig. 4 Execution time taken by algorithms
994
S. Singh and D. Aggarwal
Fig. 5 Time taken by ConfKeyGen
different number of participants is shown in Fig. 5. The time taken with an increase in the number of files also increases but at a very slow rate. This means that any number of participants can be added without worrying much about the delay in key generation. The signature generation and verification time are shown in the graphs in Figs. 6 and 7, respectively. It has been clear from the graph that the corresponding time taken to sign and verify increases with the increase in the number of files. But it is also clear that this difference is increasing at a very slow rate. Hence through the simulation, it can be concluded that the proposed scheme is successfully implemented and its performance is stable.
7 Conclusion The proposed scheme successfully ensures the security of sensitive information in a file when it is shared among a group of users in a cloud-based environment. The proposed scheme successfully ensures the security of sensitive data by encrypting it with one of the best available algorithm AES. The use of a key agreement protocol which is based on block design allows group sharing of data with the help of a common conference key, and the addition of blind ring signature takes the security to the next level. It helps in ensuring the integrity of data by allowing any member of the group to check for data’s integrity just by sending a request to the TPA. Also, a performance analysis in the end shows that the propose scheme is stable and is efficient.
Group Data Sharing and Auditing …
Fig. 6 Time taken for signature generation
Fig. 7 Time taken for signature verification
995
996
S. Singh and D. Aggarwal
8 Future Work The proposed scheme does not support features like dynamic file updation, user accountability, and user revocation. An absence of these features leads to inability in updating the file without downloading it, to limit file access to only specific members of the group, to identify the signer of the file, etc. Hence enhancements in the proposed scheme are needed to support these features.
References 1. The NIST Definition of Cloud Computing. http://csrc.nist.gov/publications/nistpubs/800-145/ SP800-145.pdf 2. J. Shen, T. Zhou, D. He, Y. Zhang, X. Sun, Y. Xiang, Block design-based key agreement for group data sharing in cloud computing. IEEE Trans. Dep. Sec. Comput. 16(6), 996–1010, 1 November–December (2019) 3. F. Zhang, K. Kim, Id-based blind signature and ring signature from pairings, in International Conference on the Theory and Application of Cryptology and Information Security (Springer, 2002), pp. 533–547 4. W. Liu, Research on cloud computing security problem and strategy, in Consumer Electronics, Communications and Networks (CECNet), 2012 2nd International Conference (2012), pp. 1216–1219 5. B. Thiyagarajan, R. Kamalakannan, Data integrity and security in cloud environment using AES algorithm, in Information Communication and Embedded Systems (ICICES), 2014 International Conference (2014), pp. 1–5 6. M.F. Al-Jaberi, A. Zainal, Data integrity and privacy model in cloud computing, in Biometrics and Security Technologies (ISBAST), 2014 International Symposium (2014), pp. 280–284 7. P. Ora, P.R. Pal, Data security and integrity in cloud computing based on RSA partial homomorphic and MD5 cryptography, in Computer, Communication and Control (IC4), 2015 International Conference (2015), pp. 1–6 8. V.S. Mahalle, A.K. Shahade, Enhancing the data security in cloud by implementing hybrid (RSA & AES) encryption algorithm, in Power, Automation and Communication (INPAC), 2014 International Conference (2014), pp. 146–149 9. R. Kaur, R.P. Singh, Enhanced cloud computing security and integrity verification via novel encryption techniques, in Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference (2014), pp. 1227–1233 10. M.Z. Meetei, A. Goel, Security issues in cloud computing, in 2012 5th International Conference on BioMedical Engineering and Informatics (BMEI 2012) 11. A. Kumar, B.G. Lee, H.J. Lee, A. Kumari, Secure storage and access of data in cloud computing, in 2012 International Conference on ICT Convergence (2012), pp. 336–339 12. M. Hamdi, Security of cloud computing, storage, and networking, in Collaboration Technologies and Systems (CTS), 2012 International Conference (2012), pp. 1–5 13. E.M. Mohamed, H.S. Abdelkader, S. EI-Etriby, Enhanced data security model for cloud computing, in Informatics and Systems (INFOS), 2012 8th International Conference (2012), pp. CC12-CC17 14. G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peterson, D. Song, Provable data possession at untrusted stores, in Proceedings of the 14th ACM conference on Computer and communications security, Acm (2007), pp. 598–609 15. A. Juels, B.S. Kaliski, Pors: proofs of retrievability for large files, in Proceedings of the 14th ACM conference on Computer and communications security, Acm (2007), pp. 584–597
Group Data Sharing and Auditing …
997
16. C. Wang, Q. Wang, K. Ren, W. Lou, Privacy-preserving public auditing for data storage security in cloud computing, in Infocom, 2010 proceedings IEEE, IEEE (2010), pp. 1–9 17. B. Wang, B. Li, H. Li, Oruta: privacy-preserving public auditing for shared data in the cloud, in Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on, IEEE (2012), pp. 295–302 18. J. Yuan, S. Yu, Proofs of retrievability with public verifiability and constant communication cost in cloud, in Proceedings of the 2013 international workshop on Security in cloud computing, ACM (2013), pp. 19–26 19. Y. Dodis, S. Vadhan, D. Wichs, Proofs of retrievability via hardness amplification, in Theory of Cryptography Conference (Springer, 2009), pp. 109–127 20. C.C. Erway, A. Kupc¨¸u,¨, C. Papamanthou, R. Tamassia, Dynamic provable data possession, in ACM Transactions on Information and System Security (TISSEC) (2015) 17, 15 21. B. Wang, L.B. Hui, L, Public auditing for shared data with e cient user revocation in the cloud, in IEEE INFOCOM 2013 (Turin, Italy, IEEE, 2013), pp. 2904–2912 22. J. Yuan, S. Yu, E cient public integrity checking for cloud data sharing with multi-user modification, in INFOCOM, 2014 Proceedings IEEE, IEEE (2014), pp. 2121–2129 23. S. Benabbas, R. Gennaro, Y. Vahlis, Verifiable delegation of computation over large datasets, in Annual Cryptology Conference (Springer, 2011), pp. 111–131 24. D. Catalano, D. Fiore, Vector commitments and their applications, in Public-Key Cryptography–PKC 2013 (Springer, 2013), pp. 55–72 25. M. Backes, D. Fiore, R.M. Reischuk, Verifiable delegation of computation on outsourced data, in Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, ACM (2013) pp. 863–874 26. D. Chaum, E. Van Heyst, Group signatures, in Workshop on the Theory and Application of Cryptographic Techniques (Springer, 1991), pp. 257–265 27. W. Stalling, Cryptography and Network Security: Principles and Practice, 5th Ed. (2011)
Novel Umbrella 360 Cloud Seeding Based on Self-landing Reusable Hybrid Rocket Satyabrat Shukla, Gautam Singh, Saikat Kumar Sarkar, and Purnima Lala Mehta
Abstract Receiving sufficient rainfall has always been an issue in agriculture, wherein some areas receive good rain while other areas receive no rain at all. Drought areas suffer an increase in temperature with extreme pollution and disturbance in plants’ respiration process. Moreover, agricultural produce is imperfect and affects the farmer’s finances in a big way. On the other hand, whatever produce is available turns out to be less affordable for the consumers to buy. State-of-the-art cloudseeding methods have been adopted before but are costly, less effective, risky, and time-consuming. In this paper, we propose an umbrella-based 360 degrees design of a self-landing hybrid rocket to aid cloud seeding and shall prove to combat the problems mentioned above in an effective way. Keywords Artificial rain · Cloud seeding · Self-landing · Hybrid rocket · 360 umbrella mechanism
1 Introduction Rain forms as moisture accumulate around particles in the air like dust sand, making the air to reach a point of saturation at which it can no longer withstand the weight in that moisture and droplets fall in the form of raindrops. Cloud seeding is the dispersion of substances into the air (clouds) that serve as cloud condensation or ice for weather modification that aims to increase precipitation by altering the cloud composition. S. Shukla (B) · G. Singh · S. K. Sarkar · P. L. Mehta Department of CSE, IILM College of Engineering and Technology, Noida, India e-mail: [email protected] G. Singh e-mail: [email protected] S. K. Sarkar e-mail: [email protected] P. L. Mehta e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_86
999
1000
S. Shukla et al.
Water is one of the essential commodities on earth sustaining human life. In many regions of the world, traditional sources and supplies of groundwater, rivers, and reservoirs are either inadequate or under threat from ever-increasing demands on water from changes in land use and growing populations. Only a small part of the available moisture in clouds is transformed into precipitation that reaches the surface. This has made scientists and engineers for implementing the idea of cloud seeding and increasing water supplies, i.e., rain making. Cloud seeding accelerates this process in providing an additional nucleus around which water droplets can accumulate and condensate.
1.1 Background Cloud-seeding techniques are new in India and not used much in the country presently. Although some experiments have been performed, none has shown success. There are areas where heavy rain has destroyed several parts of Maharashtra for some 6–7 years. But ironically, from the past 2–3 years, these areas are declared as drought areas. The artificial rain process started many times with probably less success rate or even failed. The Nashik project [1, 2], where rocket-based cloud-seeding technique was used, a total of five rockets were used, but eventually, only two missiles hit the target and were not enough for the adequate rainfall. This technique was firstly used in Maharashtra in India. While for another project in Aurangabad District (Maharashtra), four of “king-air B-200” planes [2] were arranged to carry the silver iodine along with Doter radar were imported from America for the cloud seeding. Though the experiment was completed, the result-oriented goal for this technique was not satisfactory and was expensive for further trials. The southern states became the most recent try with the cloud seeding to bring back the rain as it was the third consecutive year that seasonal monsoons have failed in these regions resulting in drought hitting Karnataka [3]. Mass migrations and farmer suicides are the problematic agricultural scenario in the state. Though the cloud-seeding program has been approved in Karnataka [4] for the cost of Rs. 88 crores further supported by the cabinet at 93 crores including other expenses [4] but being a plane-based seeding process, it is expensive and time-consuming. From this, it though seems that there are implementations of various cloud-seeding projects but none of having proper, legit, and satisfactory results. It is evident that the cloudseeding method adoption is at an amateur stage in India and if achieved it would be a milestone in the history of India.
1.2 Motivation We can create endless solutions for water management, but the fact is that the availability of water should also be there. Around 2050 [5], India will face a vast and
Novel Umbrella 360 Cloud Seeding …
1001
severe problem of water scarcity which will turn things to worse and we cannot afford to stay still and do nothing. Not only that increasing pollution and temperature, but drought and disturbance in water cycle are also creating a problem that was not faced earlier. Cloud seeding is not the exact solution but it covers a massive part in bringing balance in the water cycle and creating a chain reaction of reducing abovementioned problems.
2 State of the Art There are several cloud-seeding methods listed in the literature. Following are some cloud-seeding approaches: (a) Plane-based cloud seeding: Flares that produce small salt particles are attached to the trailing edge of the wings of seeding aircraft and ignited in updrafts below the cloud base of convective storms. This method overcomes most of the problems and difficulties faced in the handling and the use of hygroscopic materials and made seeding with ice nuclei (AgI) a more attractive option [6, 7]. (b) Drone-based cloud seeding: The “Sandoval Silver State Seeder,” a new drone built by the Desert Research Institute [8], which deploys silver-iodide flares to kick off rainfall. The cloud-seeding drone just had its first test flight in the USA. It did not form any rain but it only went 400 feet up. Cloud seeding is not really about creating weather from nothing, but more about getting it to fall when and where we want it, and it is not for ending droughts. (c) Ground generators: It is known that ground-based generators are widely used in the world practice of weather modification for cloud seeding with the purpose of precipitation enhancement and hail suppression. In particular, in the United States, Morocco, Cuba, and many countries, ground generators are used to increase precipitation, while in France, Spain, and Brazil for anti-hail protection [9]. Taking into account that the use of ground-based generators is sometimes economically more desirable than the use of an airplane or rocket technology in 2005 began to create ground-based and firework ice-forming aerosol generators [10, 11]. (d) Electric Rainmaking Technology: The ion technology’s backers think their idea beats [12] cloud seeding for several reasons. It produces more rainfall, and it does not need clouds to be in the area to work. Also, it expects to be less expensive theoretically, because it does not require aircraft to spread chemicals, the usual method. Further, they believe that changing the polarity and quantity of the ions could reduce rainfall where it is too plentiful, prevent hail, and even break up fog at airports [12]. To these claims, Earthwise adds that its technology reduced air pollution [13, 14] in trials in Mexico City and Salamanca, because the condensation it caused warmed the air, creating an updraft that carried away pollution [14].
1002
S. Shukla et al.
(e) Rocket-based cloud seeding: Countries like China and companies or agencies are using rockets, an innovative network of artificial-intelligence-enabled strategic micro-rocket launches and a distributed grid of climactic sensors and spreading technologies for cloud seeding. Like ACAP’s Striyproekts, the “LOZA” missile protection system is designed for a powerful impact on clouds by spraying chemical reagent in them and consists of [15–17]. (f) Laser technologies: Laser-induced [18] condensation has been recently proposed as a possible alternative to more traditional rain enhancement techniques like hygroscopic [19] and crystallizing seeding, due to its potential for triggering condensation in sub-saturated conditions. Although condensation has been shown to occur on very local scales by the use of lasers to generate CCN in sub-saturated air, questions remain on the relevance of this technology to precipitation enhancement and, thus, the approach is currently lacking the scientific basis to enhance precipitation in the atmosphere [9, 20]. (g) Acoustic waves: A hail or acoustic cannon is a shock wave generator [21]. This shock wave then travels at the speed of sound through the cloud forming above; a disturbance which manufacturers claim increases collision coalescence growth of tiny water droplets, thus producing bigger raindrops. The cannon is also claimed to disrupt the growth phase of hailstones. They review the application of cannons to weather modification and find no scientific basis for this methodology. The recent works done for predicting weather conditions using basic application sensors on factors such as humidity, temperature, and the wind give an overview of initiation of weather and its upgradation with time and environmental conditions [22]. Various modeling methods use liquid CO2 uplifting the rain formation and finally the total rain [23]. Nevertheless, multiple inputs have been provided in past years setting platforms for future development in weather enhancement methods. Its applications are in the areas of agriculture and environment countering various inputs and solutions to various severe problems like drought control [24]: Long and continuous process cloud-seeding program can decrease the impact of drought in slow steady manner, however, since increased precipitation before and after drought would temper the reduction of rainfall during the drought period. It can be used for extended water management and creates rain (providing water) [2, 9]. India has made developments in providing well-sanitized water supply for the masses but as the population is increasing its resources are getting endangered. Regardless of improvements to drinking water, many water resources are getting polluted. Also, water scarcity in India is predicted to worsen as the overall population is expected to increase to 1.6 billion by the year 2050. In further addition to these problems, water crises are ready to be a global concern. i.
Pollution control: As rain droplets fall, it acquires or attracts aerosol and carbon particles along with it bringing it to ground [12, 13], and cleans the trees making the respiration of plants effective. Thus, it controls the pollution.
Novel Umbrella 360 Cloud Seeding …
1003
ii. Temperature control: Rain and temperature have opposite relation apart from humidity, rains bring the heat down as the clouds shield the direct sun rays landing on land creating plastic shed making the temperature fall which decreases further as the cold water droplets bring the temperature down to a suitable temperature, and this may result in a cooling effect for the Earth. Some scientists may seem this as a potential benefit, as this cooling may offset the warming caused by climate change. For example, Dubai had a successful artificial rain project last year and opting to next project controlling temperature and pollution. iii. Hail suppression [24, 25]: To prevent hail damages, it is necessary to transform the dangerous convective clouds so as not to allow the formation of large hailstones. The number of ice crystals in the cloud is small and in the presence of appropriate conditions they generate hailstones with increased size. For example, in a working model, Dubai has invested $11 million [26], $168 million by China [27], $ 15 million by USA [28], estimation of 10 million (plans) by India [4], and few by other countries. All these cloud-seeding projects include seeding by drones and mostly by planes making it immensely costly and inefficient to be precise.
3 Umbrella 360 Cloud-Seeding Approach: The Proposed Concept The umbrella 360 cloud-seeding approaches an omni-directional 360 seeding technology (mentioned below the report) which means it uses eight solid rocket seeders placed at a particular angle in a controlled launching mechanism (Figs. 4 and 5) with Thrust Vector Control (TVC) that proposes an idea of station keeping (hovering in air, refer to Fig. 1) and a self-propulsion system. Its umbrella 360 seeding helps to increase the area of planting changing angle (Fig. 6) in every single launch which automatically reduces the cost and increases rainmaking probability. While for drought areas we can target rain clouds [24] to move it to those regions to provide rain. Figure 1 shows the trajectory of the rocket, how it launches, enters the cloud along with the hovering stage, and finally with reaction control system helps to retrieve the missile safely ready to be relaunched. Figure 2 represents the launch-land trajectory at different altitude levels that define slope ranges as. O to A: Engine full thrust (take off) and gimbaling to overcome drag. At A: Max height for engine offline. A to B: Inertial ascending till apogee. At B: APOGEE (max height). B to C: RCS in action controlling the seeding activity and rocket/free fall. C to D: Partial engine online to slow down the rocket. D to E: Final burnout for self-landing w.r.t. RCS and landing legs online. Figure 3 shows the umbrella mechanism opening which provides a takeoff platform to rocket seeders. 360-degree mechanism: The rocket places itself at the most probable center of the cloud and ejects the seeders in 360 degrees covering a large diameter of clouds in single takeoff while the plane covers 1 km. It covers 4 km.
Fig. 1 Rocket launch-land trajectory
1004 S. Shukla et al.
Fig. 2 Rocket launch-land trajectory for 2–5 km
Novel Umbrella 360 Cloud Seeding … 1005
1006
S. Shukla et al.
Fig. 3 Umbrella mechanism completely open
This also includes angle spreading which ensures every part of the circular area is covered by changing its angle to the previous. Figure 4 shows the first launch covering angle that covers a part of the whole circle according to the planned seeding cycle. Figure 5 depicts that each launch covers different areas by just changing the rockets roll seeding angle targeting the remaining part of the circle of seeding cycle. Lastly, Fig. 6 describes that after each launch covering a different area of the cycle, the seeding coverage becomes dense thus increasing the probability of rain formation, and this automatically increases the seeding area and the possibility of rain making [9]. Figure 7 illustrates the landing mechanism where the upper part is the oxidizer (N2 O/O2 ), and the lower part is solid fuel embedded in a gyromechanism that enables it to maneuver independently supported by two independent motors controlling pitch and yaw of the rocket nozzle that helps it to manage its acceleration, deceleration, and position decision with onboard gyrosensor. Figure 8 illustrates the rocket fuel and oxidizer flow layout (how the engine works). In simple words, as the oxidizer gets released into the fuel through an inlet, there occurs a chemical reaction, by the insertion of ignition, which causes fire (starts to burn) because of excess oxygen provided by the N2 O tank. The energy gets released in the form of thrust that results in forwarding acceleration of the rocket.
Novel Umbrella 360 Cloud Seeding …
1007
Fig. 4 Real angle seeding
Fig. 5 Change angle seeding
4 The Business Aspect Along with its problem-solving potential, it is also a good business idea. It holds up a market estimated to 300–500 million dollars [26–28] globally with conventional methods and as the technology improves things start taking a new shape expanding the
1008
Fig. 6 Complete seeding
Fig. 7 Self-landing mechanism
Fig. 8 Oxidizer and engine workings/flow
S. Shukla et al.
Novel Umbrella 360 Cloud Seeding …
1009
business to massive levels, perhaps a billion-dollar industry. This speculation is just for cloud-.seeding applications. It takes a phase shift as we include drought control, pollution control, hail and snow suppressions, and early acidic rain for preventing crop damage. A market that huge automatically creates costumes, i.e., government bodies and contractors. This accurately creates a direct relation for costumer willing to pay versus criticalities which ensure 100% business growth with no delays or problems. This can be achieved either by bootstrapping or potential funds by the government or by investors increasing the valuation of the business along with demand. Figure 9 shows the area of business potential, TAM: Total Available Market, SAM: Serviceable Available Market, and SOM: Serviceable Obtainable Market. Figure 10 reveals the relation between criticality and customer willingness to pay, as water will be scarce in almost every region, people (customer) will be ready to pay any legit price to get solution to the problem occurring. Fig. 9 Market analysis
TAM 3 Thousand 500 Cr SAM 300-500 Cr
SOM 200 Cr
Fig. 10 Comparison between criticality and customer willing to pay
1010
S. Shukla et al.
5 Conclusion To obtain a sumptuous and healthy agricultural produce, it is essential to have a sufficient amount of rain. Farmers struggle to fight with Mother Nature and suffer in getting a good and healthy crop yield especially in drought-hit areas. This paper proposed a solution of a 360-degree umbrella mechanism of cloud seeding using reusable, hybrid, and self-landing rocket in developing an eco-friendly, sustainable, and fair option with reusability advantages to combat the agricultural problems in areas receiving less or no rain. We describe the launching and landing trajectory of the hybrid rocket and discuss the 360-degree omni-directional cloud-seeding method with suitable diagrams. Lastly, we conclude the paper by discussing the business aspect of adopting the cloud-seeding method.
References 1. Nashik: Rocket finally fired in the dry zone, cloud seeding to bring in rain| Nashik News— Times of India. https://timesofindia.indiatimes.com/city/nashik/Nashik-Rocket-finally-firedin-dry-zone-cloud-seeding-to-bring-in-rain/articleshow/48510296.cms. Accessed 30 Jan 2020 2. Maharashtra Government Plans Cloud-Seeding in Drought-Hit Regions of State in August, https://weather.com/en-IN/india/news/news/2019-05-29-maharashtra-governmentplans-cloud-seeding-in-drought-hit-regions-of. Accessed 31 Jan 2020 3. J.R. Kulkarni, S.B. Morwal, N.R. Deshpande, Rainfall enhancement in Karnataka state cloud seeding program “Varshadhare” 2017. Atmos. Res. 219, 65–76 (2019). https://doi.org/10.1016/ j.atmosres.2018.12.020 4. Karnataka Cabinet Approves Cloud Seeding Programme—News18, https://www.news18.com/ news/india/karnataka-cabinet-approves-cloud-seeding-programme-2161895.html. Accessed 30 Jan 2020 5. P. Mehta, Impending water crisis in India and comparing clean water standards among developing and developed nations. 11 (2012) 6. R.T. Bruintjes, A review of cloud seeding experiments to enhance precipitation and some new prospects. Bull. Am. Meteorol. Soc. 80, 805–820 (1999) 7. UAE Research Program for Rain Enhancement Science, http://www.uaerep.ae/. Accessed 30 Jan 2020 8. What is cloud seeding? https://www.dri.edu/cloudseeding/about-the-program. Accessed 30 Jan 2020 9. M. Sioutas, Hail characteristics and cloud seeding effect for hail suppression in Central Macedonia, Greece, in Perspectives on Atmospheric Sciences, ed. by T. Karacostas, A. Bais, PT Nastos (Springer International Publishing, Cham, 2017), pp. 271–277. https://doi.org/10.1007/978-3319-35095-0_38 10. ScienceDirect Snapshot, https://www.sciencedirect.com/science/article/pii/ 016980959400088U 11. M. Murakami, Japanese Cloud Seeding Experiments for Precipitation Augmentation (JCSEPA)—New Approaches and some results from wintertime and summertime weather modification programs—4 12. Electric Rainmaking Technology Gets Mexico’s Blessing—IEEE Spectrum, https:// spectrum.ieee.org/energy/environment/electric-rainmaking-technology-gets-mexicosblessing. Accessed 30 Jan 2020 13. Can it rain clean the atmosphere? http://news.mit.edu/2015/rain-drops-attract-aerosols-cleanair-0828. Accessed 30 Jan 2020
Novel Umbrella 360 Cloud Seeding …
1011
14. Urban Runoff Pollution Control Quantity and Its Design Rainfall in China—(China Water & Wastewater) (2008) 年22期, http://en.cnki.com.cn/Article_en/CJFDTOTALGSPS200822006.htm 15. Stroyproject—about us. Manufacturer of LOZA ROCKETS, https://www.cloud-seeding.info/ page.php?id=2&lang=1. Accessed 30 Jan 2020 16. The Greater Saint John Cloud Seeding Program, http://www.acapsj.org/cloud-seeding. Accessed 30 Jan 2020 17. V. Horvat, B. Lipovscak, Cloud seeding with the TG-10 rockets. J. Weather Modif. 15, 56–61 (2012) 18. K. Yoshihara, Laser-induced mist and particle formation from ambient air: a possible new cloud seeding method. Chem. Lett. 34, 1370–1371 (2005). https://doi.org/10.1246/cl.2005.1370 19. S. Malik, Division of Environmental Sciences, SKUAST K Shalimar, Srinagar, J & K, India. Cloud seeding; its prospects and concerns in the modern world—a review. Int. J. Pure, Appl. Biosci. 6, 791–796 (2018). https://doi.org/10.18782/2320-7051.6824 20. Seeding Change in Weather Modification Globally|World Meteorological Organization, https://public.wmo.int/en/resources/bulletin/seeding-change-weather-modification-globally. Accessed 30 Jan 2020 21. M.P. Foster, J.C. Pflaum, Acoustic seeding. J. Weather Modif. 17, 38–44 (2012) 22. A. Malhotra, S. Som, SK Khatri, IoT based predictive model for cloud seeding, in 2019 Amity International Conference on Artificial Intelligence (AICAI) (2019), pp. 669–773. https://doi. org/10.1109/AICAI.2019.8701412 23. H. Xiao, W. Zhai, Z. Chen, Y. He, D. Jin, A modeling method of cloud seeding for rain enhancement, in Current Trends in High-Performance Computing and Its Applications, ed. by W. Zhang, W. Tong, Z. Chen, R. Glowinski (Springer, Berlin, Heidelberg, 2005), pp. 539–543. https://doi.org/10.1007/3-540-27912-1_74 24. 1520-0450(1965)0040553CSATDI2.0.pdf, https://journals.ametsoc.org/doi/pdf/10.1175/ 1520-0450%281965%29004%3C0553%3ACSATDI%3E2.0.CO%3B2 25. Hail Suppression Agency, https://www.weathermod-bg.eu/index_en.php. Accessed 30 Jan 2020 26. The cost of cloud seeding in the UAE, https://www.arabianbusiness.com/the-cost-of-cloudseeding-in-uae-670857.html. Accessed 31 Jan 2020 27. China creates 55 billion tons of artificial rain a year—and it plans to quintuple that—Quartz, https://qz.com/138141/china-creates-55-billion-tons-of-artificial-rain-ayear-and-it-plans-to-quintuple-that/. Accessed 30 Jan 2020 28. 204166.pdf, https://www.gao.gov/assets/210/204166.pdf
User Detection Using Cyclostationary Feature Detection in Cognitive Radio Networks with Various Detection Criteria Budati Anil Kumar, V. Hima Bindu, and N. Swetha
Abstract To identify the user’s presence by using non-cooperative detection methods in the Cognitive Radio (CR), networks used are Energy Detection (ED), Matched Filter Detection (MFD), and Cyclostationary Feature Detection (CFD). The signal power threshold is the critical parameter to identify whether the user is present or absent in the spectrum. In the literature, various authors proposed their research on spectrum sensing using CFD with static and predefined dynamic thresholds. In this paper, authors proposed the novel CFD with inverse covariance approach method for dynamic threshold estimated by using Generalized Likelihood Ratio Test (GLRT) and Neyman–Pearson (NP) observer detection criteria. The results show the performance of the proposed method as the probability of detection (PD ) is increased and the probabilities of false alarm (Pfa ) and missed detection (Pmd ) have been reduced when compared with the existing methods. The results are simulated using Matlab software and the results are analyzed among the three parameters. Keywords CFD · CR · Spectrum sensing · GLRT · NP observer
1 Introduction CFD is used for sensing the PU transmissions by utilizing the features of cyclostationarity in the received signals [1]. Cyclostationary detection is one of the noncooperative detection methods in CR networks. Cabric et al. [2] proposed his research on cyclostationary feature detector-based spectrum sensing with various modulation techniques. The proposed modulated techniques contain the properties of pulse trains, B. A. Kumar (B) · V. Hima Bindu · N. Swetha Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, India e-mail: [email protected] V. Hima Bindu e-mail: [email protected] N. Swetha e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_87
1013
1014
B. A. Kumar et al.
hopping sequences, cyclic prefixes, etc. If the data contains the stationary random process, the modulated signals are characterized as cyclostationarity because it contains the properties like autocorrelation and mean. They estimated the static threshold by using the CAF, and the results are estimated by Power Spectral Density (PSD) at various SNR values [2]. Fehske et al. [3] proposed that in wireless communication systems spectrum allocation and channel sensing are the two interesting prospectives in CR license bands. To classify the signals from their applications, cyclic spectral analysis approach is applied to cyclostationary detectors in the proposed method. The performance of the detector depends on observation and computational time when the signal bandwidth and carrier frequency are unknown. In the proposed method, they used a neural network for signal classification when the case of baseband signal is present [3]. Explored his research on signal detection based on spectral correlation method in IEEE 802.22. They demonstrated as to detect the signal more effectively for spectral correlation function is used and all the human-made signals contain cyclostationary random principles. The detection environment in IEEE 802.22 systems reduces the computational complexity because of its spectral correlation method. They estimated the magnitude response for various frequencies, and the detection performance is analyzed at various cyclic frequencies. The features are derived by extracting the fundamental statistics like mean and autocorrelation or periodicity in the signal, or they can be induced to assist in spectrum detection [4, 5]. To detect the user’s presence instead of PSD, cyclic correlation function is used in any given spectrum. Unlike ED scheme, the cyclostationarity-based detection algorithms can distinguish noise from PU signals. The results indicate that noise is Wide-Sense Stationary (WSS), and the modulated signals are cyclostationary with the spectral correlation due to the redundancy of signal periodicities. Zhao et al. [6] described the user’s presence by using cyclostationary detection with an asymptotically optimal chi-square random test. The modulated techniques of PSK, QAM, OFDM, and GMSK are used to analyze the performance of cyclostationary detection with different delays in the same cyclic frequency. The detection method performance is analyzed by using the performance metrics of PD , Pfa , and ROC [6]. Bhargavi et al. [7] proposed cyclostationary detection in CR spectrum sensing. Spectral correlation density and magnitude squared coherence methods are used for estimating the static threshold. The parameters PD , Pfa , and ROC are used for analyzing the performance of the detection method [7]. The comparative analysis is proposed between various detection methods, and the parameter Pmd is analyzed at various SNR values with fixed Pfa . Bagwari et al. [8] proposed his research on spectrum sensing techniques at transmitter detection of MFD, CFD, and ED. The author identified as CFD technique requires prior information about the user, and it is complex to implementation [8]. The performance parameters PD , Pfa , and Pmd are used to analyze the CFD detection performance. Armi et al. [9] explored CR spectrum sensing by using cyclostationary feature method. The performance of CR with ED and cyclostationary feature by using AWGN has been evaluated. The static threshold condition and the parameters of PD and Pfa have been computed. The performance of cyclostationary feature method by using the parameters PD , Pfa , and the performance by using windows techniques like Hamming window, Rectwin window, Barlett window, etc. has been
User Detection Using Cyclostationary Feature Detection …
1015
computed [9]. Salahdine et al. [10] have done a comparative analysis with dynamic threshold among ED, MFD, and CFD. By using NP observer detection criteria, static threshold is estimated, and the performance metrics of Pd , Pf , and Pm are measured [11]. The researchers proposed their research work on dynamic threshold, and the simulation results are mentioned between number of samples versus the metrics of Pd and Pf . The result of ED is easy to implement, but it has not been able to differentiate between signal and noise, resulting in high false alarms. Li et al. [11] proposed an algorithm which works robustly to estimate the carrier frequencies for mixed signals using the features of spectral coherence function [12]. They also evaluated the performance of the proposed algorithm for signal mixture with different channel conditions. Sardana et al. [12] described his research on spectrum sensing in CR networks. They observed that the channels are suffered from hidden nodes and fading effects in wireless environment. Due to effects in wireless environment, the transmission signal strength is decaying [13]. The authors avoid this problem by applying relay-based cooperative spectrum sensing. Kumar and NandhaKumar [13] proposed his research work on cyclostationary spectrum sensing by integrating cyclic prefix with OFDM. Afterward, cyclostationary spectrum sensing with OFDM without cyclic prefix is proposed as the function to the transceiver being to utilize the bandwidth effectively which is wasted by cyclic prefix phase [14]. The authors analyzed the performance of cyclostationary spectrum sensing with the help of performance metrics like PD , Pfa , and bit error rate. The research work on CFD with autocorrelation function and spectral autocorrelation functions has been proposed. The user’s presence by fixed threshold and the dynamic fixed limit has been estimated. The estimated threshold with uniform AWGN and the parameters of PD and Pf by using GLRT detection criteria alone have been measured. ROC has been plotted and analyzed for measuring receiver sensitivity. The cyclic features for a single or two cyclic frequencies have been computed. If the fixed threshold has been taken as high, the PD is low and miss detection error may arise in the detection. Otherwise, the limit has been considered as low, then the PD is as high and false alarms error may occur at the discovery. Threshold has been a critical parameter to identify the user’s presence accurately. If the signal characteristics are known, the existing researchers suggest that GLRT is more compatible detection criteria and if unknown, NP observer is more consistent. The measurement of threshold by using GLRT alone has been proposed. The gaps are identified from the available literature; the fixed threshold is measured with either GLRT or NP observer by uniform AWGN at low SNR values. In the proposed research work, an attempt has been made to estimate the optimal threshold with GLRT and NP observer detection criteria by using nonuniform AWGN. Less focus has been shown on the estimation of threshold by NP observer detection criteria and probability of miss detection estimation. The proposed research work focuses on measuring of Pmd along with PD and Pfa .
1016
B. A. Kumar et al.
Fig. 1 Block diagram of cyclostationary feature detector
2 Cyclostationary Feature Detection A signal varying cyclically with the time that has statistical properties is called cyclostationary process. The cyclostationary process performs aggregating the statistical parameters like variance, mean, etc. in single cycle. This averaging has been considered equivalent to the time or phase of the process as a uniformly distributed random variable over a single period. This type of analysis is more suitable for the operation where periodic structure is observed with synchronism. However, the receiver has been designed deliberately for cyclostationary signals. Usually, it provides a significant amount of information in the form of pulse stream synchronizing or sinusoidal signal timing about the actual phase of the message. The block diagram of the CFD method has been shown in Fig. 1. The signal and noise of the CFD method have been discriminated and do not depend on prior information of the user. Cyclostationary detector works based on spectral redundancy in almost every human-made signal. The CFD contains Fast Fourier Transform (FFT), correlator, average threshold, and feature detectors. The CFD carries the inherent cyclostationary properties like periodic statistics, spectral correlation, etc. The received signal is applied to Bandpass Filter (BPF) and is used to energy measurement in the related band. The output of the BPF is applied as input to FFT. The FFT is computed for the received signal and as given information to correlator, which correlates the signal and provides integrator or average threshold. The output of the average limit is offered to feature detector to compare with threshold. The comparison has been used to identify the user’s absence or presence.
3 Proposed Work The importance of the optimum threshold value for the identification of users has been discussed in introduction section. In the existing method, the uniform AWGN noise is used for the estimation of threshold, and maximum of researchers had formulated the limit by using GLRT only. The threshold has been computed either static or fixed dynamic. In this paper, the authors expressed the dynamic threshold with nonuniform AWGN by using two detection criteria of GLRT and NP observer. The proposed CFDI block diagram is shown in Fig. 2.
User Detection Using Cyclostationary Feature Detection …
1017
Fig. 2 Block diagram of the proposed CFDI
The received signal y(n) is given as input to the N-point FFT in the proposed CFDI method, as shown in Fig. 2. The FFT is used to determine the frequency content of a signal to perform the spectral analysis in discrete form. FFT has been used to achieve computational efficiency to adopt a divide and conquer approach. The output of the FFT signal is applied to a correlator that correlates the present sample value with the previous sample. If it matches the previous sample, the decision is considered as the present sample decision. If it does not match, the current example is forward for the threshold comparison. In the proposed research work, a dynamic threshold has been formulated by using the detection criteria of GLRT and NP observers. To measure the limit for the received signal, y(n) is assumed as statistical hypothesis H 0 and H 1, where H 1 is considered as both signal and noise and H 0 is considered as noise alone sound [14]. The output y(n) is represented as follows: H0 = y(n) = w(n) H1 = y(n) = s(n) + w(n)
(1)
Here, w(n) is assumed as unit variance, zero mean of the noise signal, and s(n) is the original signal. In this proposed work, it has been assumed that different noise levels are presented at each channel. So, the threshold for various SNR levels has been estimated. The limit is calculated for each channel ‘N’ and jointly Gaussian random variable [15] has been given as P(y; Hi ) =
1 (2π σ 2 ) N /2 det(C)1/2
1 exp − (y − Hi )T C −1 (y − Hi ) 2
(2)
1018
B. A. Kumar et al.
where C −1 and T are inverse covariance approach and the transpose of the signal, respectively, where H i = H 0 and H 1 . To estimate the threshold value, GLRT detection criteria have been applied [16]. The threshold estimation using GLRT condition is L(Y ) =
P(y; H1 ) =γ P(y; H0 )
(3)
where ‘γ’ is the initial threshold, and L(Y ) is the GLRT threshold value. The ratio of probability of H 1 to H 0 has been substitute from Eq. (2) in Eq. (3) and obtained as R αyy∗ =
exp[−1/2(y − H 1)T C −1 (y − H 1)] > H1 γ exp[−1/2(y − H 0)T C −1 (y − H 0)] ≤ H0
(4)
where R αyy is the autocorrelation function, and ‘α’ is the cyclic frequency which is equivalent to L(Y ) = R αyy∗
(5)
The CFDI has the periodicity property that defines the period of samples α. The present samples are correlated with previous examples of the same series for decisionmaking and are called Autocorrelation Function (AF) represented as Ryy. The combination of AF with cyclic frequency has been described as R αyy , and it has been equivalent to the threshold of L(Y). The Eq. (4) has been simplified: 1 1 ⇒ − (y − H1 )T C −1 (y − H1 ) + (y − H0 )T C −1 (y − H0 )> ≤ ln(γ ) 2 2
(6)
The above Eq. (6) has been expanded and moved right side to obtain the resultant equation as ⇒ (H1 − H0 ) C T
−1
> 1 y − (H0 + H1 ) ln(γ ) 2 ≤
(7)
The necessary threshold condition for the GLRT of CFDI is R αyy∗
= (H1 − H0 )
T
C −1 y≤>
1 T −1 ln(γ ) + (H1 − H0 ) C (H0 + H1 ) 2
(8)
The FFT condition has been applied to the necessary threshold Eq. (8) of CFDIG, and the resultant equation has been obtained as 1 (H1 − H0 )T C −1 (H1 + H0 ) exp(− j2π αn f s ) R αyy∗ = ln(γ ) + 2
(9)
User Detection Using Cyclostationary Feature Detection …
1019
where f s is the sampling frequency, and n is the number of samples. The average value for the Eq. [9] has been obtained as 1/2 1 R αyy∗ = ln(γ ) + (H1 − H0 )T C −1 (H1 + H0 ) exp(− j2π αn f s ) 2
(10)
The above Eq. (10) is the final threshold condition for the CFDIG. From the available literature, the author has suggested that if the received signal characteristics are not known, then the NP observer is more suitable for signal detection. So, in the proposed research work, an attempt has been made to estimate the optimum threshold for CFDI by using NP observer method. The NP observer threshold [17] condition is 2 P(y; H1 ) σ 2 L(Y ) = 2 = γ P(y; H0 ) σ 2
(11)
where σ 2 is the variance and the probability values of P(y; H 1 ) and P(y; H 0 ) from Eq. (2) have been substituted in (11), and the resultant equation is equal to Eq. (5); then the simplified equation is
R αyy∗
2 exp[−(y − H 1)T C −1 (y − H 1)] 2 2σ =
2 exp[−(y − H 0)T C −1 (y − H 0)] 2 2σ
> H1 ≤ H0
γ
(12)
Simplifying the Eq. (12) √ 1 1 2 ⇒ − (y − H1 )T C −1 (y − H1 ) + (y − H0 )T C −1 (y − H0 )> γ ≤ σ ln 2 2
(13)
After simplifying and solving the above Eq. (13), we get √ 1 2 γ ⇒ (H1 − H0 )T C −1 [y − (H0 + H1 )]> ≤ σ ln 2
(14)
The above equation is simplified by moving to the right-hand side, and then the essential threshold condition for CFDINP method is R αyy∗ = (H1 − H0 )T C −1 y≤> σ 2 ln
√ 1 γ + (H1 − H0 )T C −1 (H0 + H1 ) 2
(15)
The FFT condition is applied to the necessary threshold Eq. (15) of CFDINP, and the comparison is simplified to obtain as √ 1 > (H1 − H0 )T C −1 (H0 + H1 ) exp(− j2παn f s ) σ 2 ln γ + R αyy∗ = (H1 − H0 )T C −1 y≤ 2
(16)
1020
B. A. Kumar et al.
where f s is the sampling frequency, and n is the number of samples. The average value has been taken from Eq. (16) and obtained as 1/ 2 √ 1 > (H1 − H0 )T C −1 (H0 + H1 ) exp(− j2παn f s ) σ 2 ln γ + R αyy∗ = (H1 − H0 )T C −1 y≤ 2
(17)
Equation (17) is the final threshold condition for the CFDINP. The received signal has been compared with the estimated dynamic threshold and terms of Eqs. (10) and (17) identifying the probability that the number of samples is detected accurately; how many number of samples are falsely detected and missing detected? The user’s presence has been estimated by the three performance metrics of PD , Pfa , and Pmd . To identify the detection level of the proposed two methods, CFDIG and CFDINP are used by three performance metrics. The three performance metrics are measured from the following equations [16, 18]. The probability of false alarm (Pfa ) has been estimated as
γ − (H1 − H0 )T C −1 H0 Pf a = Q (18) (H1 − H0 )T C −1 (H1 − H0 ) The probability of detection (PD ) has been estimated as PD = Q[Q −1 (P f a ) −
(H1 − H0 )T C −1 (H1 − H0 )]
(19)
The probability of miss detection (Pmd ) has been measured as
Pmd = Q
γ − (H1 − H0 )T C −1 H1 (H1 − H0 )T C −1 (H1 − H0 )
(20)
4 Results and Discussion To get simulation results, instead of 100 scanned samples from the radio environment, 100 Monte Carlo samples are generated and added the nonuniform AWGN noise to the samples. The free space propagation path loss is assumed as 40 dB/decade [18], the environment contains Rayleigh fading, the power ranges from −10 to 0 dB, and the rate of periodicity of scanning is assumed as 20 s. The range of samples for a period α has been considered for every 100 samples. The parameter of PD has been analyzed between the proposed CFDIG, CFDINP methods and the existing CFD method in Fig. 3. At −10 dB the CFD contains the PD as 0.3 only, but the CFDIG and CFDINP contain 0.45 and 0.43, respectively, as shown in Table 1. The PD is increasing when the SNR value is increased from −10 to 0 dB. The PD is high for all the SNR levels when compared with the existing method.
User Detection Using Cyclostationary Feature Detection …
1021
Fig. 3 PD versus input SNR for proposed CFDI
Table 1 PD at various SNR levels of CFDI
S. no.
Input power (dB)
CFDIG PD
CFDINP PD
1
−10
0.45
0.43
2
−9
0.51
0.48
3
−8
0.58
0.54
4
−7
0.66
0.61
5
−6
0.75
0.70
6
−5
0.83
0.77
7
−4
0.90
0.86
8
−3
0.95
0.93
9
−2
0.98
0.97
10
−1
0.99
0.99
11
0
1
1
The threshold used in existing method is fixed for all SNR values, due to which the PD is achieved as less. In the case of proposed methods, the dynamic threshold value has been estimated and applied to all input SNRs. Hence, the proposed methods have high PD than the existing CFD method. Among the proposed methods, the CFDIG gives better detection probability than CFDINP. To compare the CFDIG and CFDINP, the PD offers the same from the power level 0 dB above onward, but from −10 to −1 dB power level the CFDIG contains
1022
B. A. Kumar et al.
Fig. 4 Comparison of PD with dynamic and fixed thresholds
more detection than CFDINP. Hence, at low SNR levels, the CFDIG contributes better exposure than the existing proposed CFDINP. The comparison between the proposed dynamic thresholds with the static threshold for the case of PD at various power levels is shown in Fig. 4. The comparison of PD is analyzed between the power levels −10 and −5 dB with fixed threshold versus dynamic threshold, as shown in Fig. 4. In the CFD method, the threshold value is set for all SNR values. The PD is observed at −10 dB; because of fixed threshold the PD has been achieved as 0.4 and due to dynamic limit, the PD is achieved for CFDIG and CFDINP as 0.45 and 0.43, respectively. The probability of detection has been observed at −5 dB because of the fixed threshold. The PD has been achieved as 0.63 and due to dynamic limit, the PD is performed for CFDIG and CFDINP as 0.83 and 0.77, respectively. To compare the detection probabilities at fixed threshold value, the power level increases from −10 to −5 dB, whereas the PD is increased by 0.23 only. In the case of dynamic threshold, the power level has been increased from −10 to −5 dB, whereas the PD has increased up to 0.24 for CFDINP and 0.23 for CFDIG. The increase in rate of PD is the same for existing and proposed methods, but the level of probability of detection is higher for the proposed purposes. To compare with the current techniques, the proposed methods have shown better detection probability. In the proposed detection methods, the CFDIG had increased by 0.38 when the power level has risen from −10 to −5 dB, but the CFDINP is increased by 0.34 when the power level got hiked from −10 to −5 dB. The rate of increase of PD and the level of PD is higher for CFDIG when compared to CFDINP. The proposed methods are compared, and the detailed analysis shows that CFDIG provides more detection probability. The dynamic threshold provides better improvement in detection than fixed threshold value at low SNR levels. The parameter of Pfa is analyzed with the existing CFD method in Fig. 5. At − 10 dB, the CFD method holds the Pfa as 0.7, the proposed CFDINP contains the Pfa as 0.49 for the same power level, and CFDIG includes 0.48 as shown in Table 2.
User Detection Using Cyclostationary Feature Detection …
1023
Fig. 5 Pfa versus input SNR for proposed CFDI
Table 2 Pfa at various SNR levels of CFDI
S. no.
Input power (dB)
CFDIG Pfa
CFDINP Pfa
1
−10
0.48
0.49
2
−9
0.46
0.48
3
−8
0.42
0.44
4
−7
0.35
0.39
5
−6
0.30
0.33
6
−5
0.22
0.27
7
−4
0.18
0.23
8
−3
0.15
0.18
9
−2
0.11
0.14
10
−1
0.07
0.10
11
0
0.02
0.06
When the SNR is increasing from −10 to 0 dB, the false alarm probability has been reduced. To compare the CFDIG and CFDINP with the existing CFD, the Pfa is less for the proposed two methods at all input SNR. The fixed low threshold value had been used for all current processes of SNR values, so the Pfa is high. In the case of proposed methods, the dynamic threshold value is estimated and applied to all input SNR. The proposed method offers less Pfa than existing methods. Among the two proposed methods, the CFDIG offers less false alarms than CFDINP.
1024
B. A. Kumar et al.
Fig. 6 Comparison of Pfa with dynamic and fixed thresholds
Hence, at low SNR levels, the CFDIG provides better improvement than existing and proposed CFDINP. The comparison between the proposed dynamic thresholds with the fixed limit for the case of Pfa at different SNR levels is shown in Fig. 5. The correlation of Pfa has been analyzed between the power levels −10 and −5 dB with fixed threshold versus dynamic threshold as shown in Fig. 6. In the existing method, the threshold value is set for all SNR values. The Pfa is observed at −10 dB for fixed threshold as 0.8 and due to dynamic limit the Pfa for CFDIG and CFDINP are 0.49 and 0.48, respectively. The false alarm probability is observed at −5 dB; for fixed threshold, the Pfa is 0.45 and due to dynamic limit, the Pfa for CFDIG and CFDINP are 0.22 and 0.27, respectively. To compare the false alarm probability, at fixed threshold value the SNR is increased from −10 to −5 dB; the Pfa mitigates around 0.35. In the case of dynamic threshold, the power level is increased from −10 to −5 dB; the Pfa got decreased to 0.26 for CFDINP and 0.22 for CFDIG. The rate of false alarm probability has mitigated higher for the existing method than proposed methods, but the level of false alarms is less for the intended purposes. The current and the proposed methods are compared to provide better improvement in case of false alarms. In the proposed methods, the CFDIG has less false alarms of 0.27 when the SNR level has been increased from −10 to −5 dB, but the CFDINP had decreased by 0.22. The rate of decrease of Pfa and the level of false alarm probabilities are less for CFDIG when compared to CFDINP. By comparing the proposed methods, CFDIG provides less likelihood of false alarms. The dynamic threshold provides better improvement in false alarms identification than the fixed threshold value at lower SNR levels. The parameter of Pmd is analyzed with the existing CFD method, as shown in Fig. 7. At −10 dB, the CFD method contains Pmd as 0.7, the proposed CFDINP
User Detection Using Cyclostationary Feature Detection …
1025
Fig. 7 Pmd versus input SNR for proposed CFDI
contains Pmd as 0.56 for the same power level, and CFDIG includes 0.54 as shown in Table 3. When the SNR level is increasing from −10 to 0 dB, the Pmd has been reducing. To compare the CFDIG and CFDINP with the CFD method, the Pmd is less for the proposed two purposed at all input SNR. The high fixed threshold value is used in existing practice for all SNR values, due to that Pmd became high. In case of proposed methods, the dynamic threshold value is estimated and applied to all input SNRs. Hence, the proposed methods offer less Pmd than existing CFD. Table 3 Pmd at various SNR levels of CFDI
S. no.
Input power (dB)
CFDIG Pmd
CFDINP Pmd
1
−10
0.54
0.56
2
−9
0.48
0.51
3
−8
0.41
0.45
4
−7
0.33
0.38
5
−6
0.24
0.30
6
−5
0.16
0.22
7
−4
0.09
0.13
8
−3
0.04
0.06
9
−2
0.01
0.02
10
−1
0
0
11
0
0
0
1026
B. A. Kumar et al.
Fig. 8 Comparison of Pmd with dynamic and fixed thresholds
In the proposed methods, CFDIG provides less miss detections than CFDINP. The comparison between the proposed dynamic thresholds with the static limit, the Pmd at different power levels, is tabulated, and the analysis on Pmd has been shown in Fig. 8. The comparison of Pmd is analyzed between the power levels 10 and −5 dB with fixed threshold versus dynamic threshold as shown in Fig. 9. In the CFD method, the threshold value is fixed for all SNR values. The Pmd observed at −10 dB has been achieved as 0.8 for a higher fixed threshold, and for dynamic threshold the Pmd is made for CFDIG and CFDINP as 0.54 and 0.56, respectively. The miss detection probability is observed at −5 dB and for fixed limit the Pmd is achieved at 0.45, and due to dynamic threshold the Pmd had been achieved for CFDIG and CFDINP as 0.16 and 0.22, respectively. To compare the miss detection probability, at fixed threshold value, the SNR level had been increased from −10 to −5 dB and the Pmd is decreasing around 0.35. In the case of dynamic threshold, the SNR level increases from −10 to −5 dB and the Pmd mitigates up to 0.34 for CFDINP and 0.38 for CFDIG. The rate of miss detection probability had been higher for the existing method than the proposed methods, but the level of miss detections is less for the intended purposes. Comparing the proposed methods provides better improvement for the case of miss detections. The rate of miss detections for CFDIG is decreased to 0.38, while the power level is increasing from −10 to −5 dB, but the CFDINP had been reduced to 0.34 for the proposed method. The rate of decrease of miss detection probability and the level of miss detection probabilities are less for CFDIG when compared to CFDINP. By comparing the proposed methods, CFDIG offers less likelihood of miss detections.
User Detection Using Cyclostationary Feature Detection …
1027
Fig. 9 ROC curve of CFDI
The dynamic threshold provides better improvement in miss detection identification samples than fixed threshold value at low SNR levels. The performance of the parameters PD , Pfa , and Pmd is analyzed individually, and from the results, it has been observed and concluded that the proposed CFDIG gives better performance than the CFD and proposed CFDINP methods. ROC curve has been plotted to compare the tests between Pfa and PD as shown in Fig. 9. The ROC curve is planned to measure the receiver sensitivity among the CFDINP, CFDIG, and existing CFD methods. From Fig. 9, CFDIG has given a better sensitivity performance at low SNR value compared to CFDINP and existing CFD methods. The PD value has been varied from 0.3 to 0.9 in the existing process, but the proposed CFDIG of PD has been modified from 0.45 to 1 at various levels of Pfa .
5 Conclusion In this paper, first, the cyclostationary feature detection process has been discussed and the gaps are identified and discussed. Secondly, the cyclostationary feature detector has been discussed, and the usefulness of the proposed detection method has been explained. Then the dynamic threshold value is estimated for the proposed CFDIG and CFDINP detection methods. The parameters PD , Pfa , and Pmd are calculated for the recommended two approaches. In the results and discussion, each parameter has been measured and plotted for the power levels from −10 to 0 dB. The statistical comparison has been proposed among the three parameters at −10 and −5 dB power levels and identified that the CFDIG provides better detection when compared to
1028
B. A. Kumar et al.
existing and suggested CFDINP. Finally, the ROC curve is plotted and recognizes that the receiver is more sensible for which type of detection method. Finally, it can be concluded that the CFDIG provides more sensitivity and detection methods at low SNR levels.
References 1. M. Oner, F. Jondral, Cyclostationarity based air interface recognition for software radio systems, in Radio and Wireless Conference, 2004 IEEE (IEEE, 2004, September), pp. 263–266 2. D. Cabric, S.M. Mishra, Brodersen, Implementation issues in spectrum sensing for cognitive radios, in Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems, and Computers, 2004, vol. 1 (IEEE, 2004, November), pp. 772–776 3. A. Fehske, J. Gaeddert, J.H. Reed, A new approach to signal classification using spectral correlation and neural networks, in 2005 First IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks, 2005. DySPAN 2005 (IEEE, 2005, November), pp. 144–150 4. K. Maeda, A. Benjebbour, T. Asai, T. Furuno, T. Ohya, Recognition among OFDM-based systems utilizing cyclostationarity-inducing transmission, in 2nd IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks, 2007. DySPAN 2007 (IEEE, 2007, April), pp. 516–523 5. P.D. Sutton, J. Lotze, K.E. Nolan, L.E. Doyle, Cyclostationary signature detection in multipath rayleigh fading environments, in 2nd International Conference on Cognitive Radio Oriented Wireless Networks and Communications, 2007. CrownCom 2007 (IEEE, 2007, August), pp. 408–413 6. Z. Zhao, G. Zhong, D. Qu, T. Jiang, Cyclostationarity-based spectrum sensing with subspace projection, in 2009 IEEE 20th International Symposium on Personal, Indoor and Mobile Radio Communications (IEEE, 2009, September), pp. 2300–2304 7. D. Bhargavi, C.R. Murthy, Performance comparison of energy, matched-filter, and cyclostationarity-based spectrum sensing, in 2010 IEEE Eleventh International Workshop on Signal Processing Advances in Wireless Communications (SPAWC) (IEEE, 2010, June), pp. 1–5 8. A. Bagwari, B. Singh, Comparative performance evaluation of spectrum sensing techniques for cognitive radio networks, in 2012 Fourth International Conference on Computational Intelligence and Communication Networks (CICN) (IEEE, 2012, November), pp. 98–105 9. N. Armi, M.Z. Yusoff, N.M. Saad, Cooperative spectrum sensing in a decentralized cognitive radio system, in EUROCON, 2013 IEEE (IEEE, 2013, July), pp. 113–118 10. F. Salahdine, H. El Ghazi, N. Kaabouch, W.F. Fihri, Matched filter detection with dynamic threshold for cognitive radio networks, in 2015 International Conference on Wireless Networks and Mobile Communications (WINCOM) (IEEE, 2015, October), pp. 1–6 11. D. Li, L. Zhang, Z. Liu, Z. Wu, Z. Zhang, Mixed-signal detection and carrier frequency estimation based on coherent spectral features, in 2016 International Conference on Computing, Networking and Communications (ICNC) (IEEE, 2016, February), pp. 1–5 12. M. Sardana, A. Vohra, Analysis of different spectrum sensing techniques, in 2017 International Conference on Computer, Communications and Electronics (Comptelix) (IEEE, 2017, July), pp. 422–425 13. A. Kumar, P. NandhaKumar, OFDM system with cyclostationary feature detection spectrum sensing. ICT Express 5(1), 21–25 (2019) 14. S. Tertinek, Optimum detection of deterministic and random signals (2004)
User Detection Using Cyclostationary Feature Detection …
1029
15. A. Papoulis, S.U. Pillai, Probability, Random Variables, and Stochastic Processes (Tata McGraw-Hill Education, 2002) 16. A.K. Budati, H. Valiveti, Identify the user presence by GLRT and NP detection criteria in cognitive radio spectrum sensing. Int. J. Commun. Syst. e4142 (2019) 17. W.C. Lee, Mobile Communications Design Fundamentals, vol. 25 (Wiley, 2010) 18. S.M. Kay, Fundamentals of Statistical Signal Processing, vol. II: Detection Theory. Signal Processing (Upper Saddle River, NJ, Prentice Hall, 1998)
Fuzzy-Based DBSCAN Algorithm to Elect Master Cluster Head and Enhance the Network Lifetime and Avoid Redundancy in Wireless Sensor Network Tripti Sharma, Amar Mohapatra, and Geetam Tomar Abstract The sensor nodes are distributed over the specific geographical region within the wireless sensor network. Since nodes are randomly deployed, network could have low as well as densely deployed network areas. The DBSCAN algorithm is used to separate the high- and low-density areas. The entire network is separated into four grids; from each grid, a master cluster head was selected, and only that master cluster head is allowed to communicate sensed information to the sink. The main goal of this algorithm was to prevent the redundancy along with enhanced network’s lifetime and improved stability period. It has been seen through the simulation that the proposed algorithm has extended network lifespan and prolonged stability period as compared to LEACH and IC-ACO in densely deployed network. Keywords LEACH · Clustering · DBSCAN
1 Introduction The recent development in WSN empowered the concepts of the sensor network, which is formed by several low powers and cheap multi-functional sensor nodes. The energy consumption is the main problem in WSN [1] since sensor nodes have limited and constant power. Thus, it affects the design of protocols and routing algorithms in the sensor network. The primary objective of these sensor nodes is to sense the data in their surroundings. Further, the data is to be transmitted to one of the T. Sharma (B) IT Department, Maharaja Surajmal Institute of Technology, New Delhi, India e-mail: [email protected] A. Mohapatra IT Department, Indira Gandhi Delhi Technical University for Women, New Delhi, India e-mail: [email protected] G. Tomar Birla Institute of Applied Sciences, Bhimtal 263136, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_88
1031
1032
T. Sharma et al.
authorized nodes such as the sink node. Thus, several hierarchical routing algorithms were proposed, and a particular group of nodes, named the clusters and their heads, were chosen in these routing protocols which transmit the information to the sink to reduce the energy consumption. Mostly, the nodes in WSN are deployed randomly; hence, there could be a probability of high- as well as low-density sensor nodes in the network areas. There could be a high probability that the nodes in the high-density areas probe similar or redundant information, thus could elevate the issue of battery lifetime. Literature review reveals that most of the hierarchical routing protocols have a probability model [2]; hence, while selecting the cluster heads, there is a probability that these cluster heads lie in a close proximity [3]. Thus, a feasible solution is needed that helps in extending the network lifetime and more energy-efficient protocols. Fuzzy logic that accomplishes and manipulates the linguistic variables has been used as a possible solution in many routing protocols for the selection of cluster heads. Fuzzy logic systems are grounded in the linguistic rules which intermingle unlike linguistic parameters to create the fuzzy output. Fuzzy logic includes the imprecise data rather than the precise crisp data. It has been observed in the real world, most of the things are imprecise or not as exact data; thus, fuzzy logic better narrates to the physical world as compared to classical information theory. The fuzzy information system supports in the effective representation of model of certainty. Fuzzy Inference System (FIS) has four components [4]: the fuzzifier, the defuzzifier, the fuzzy inference, and the rule base. In fuzzy methodology, quantitative data are represented in terms of a linguistic variable as words or sentences in natural language. The lifetime of WSN can be improved and energy consumption could be reduced by the appropriate cluster head selection. In this research work, the fuzzy logic method is deployed for the master CH’s selection on the basis of two descriptors, the distance and residual energy. The remaining article is structured as follows: Section 2 provides a summary of algorithms already used in WSNs. Section 3 has the description about the energy model utilized in the proposed algorithm. Section 4 has the methodology used in the proposed algorithm. Further, in Sect. 5, simulation results have been conferred, and finally, the Sect. 6 concludes the research work.
2 Related Work 2.1 LEACH In this protocol, the Cluster Head (CH) nodes are chosen randomly in a probabilistic fashion. The LEACH algorithm is taken through various iterations, and in each iteration, there are two phases, setup and transmission phase [5]. The selection of CHs and the creation of various clusters take place in setup phase. Each node transfers the information to their respective CH and from CH to sink in transmission
Fuzzy-Based DBSCAN Algorithm …
1033
phase. LEACH Protocol is justified to be comparatively better in prolonging the network lifetime and energy in comparison to other conventional algorithms. However, LEACH Protocol is not appropriate for large and dense networks.
2.2 EAUCF The CH selection procedure in EAUCF [6] and LEACH protocols is similar. Fuzzy logic has been used in this protocol for computing the selection of competition radius of tentative CH. The distances from sink to node and node’s leftover energy have been chosen as two fuzzy parameters. This algorithm gives better result in case of half node dead and first node dead, but, in case of last node dead, the LEACH performs better.
2.3 CHEF CHEF used the fuzzy if-then rules to select CHs. The residual energy and proximity distance are taken into consideration for input parameters. Stability period of CHEF [7] is also better than the LEACH protocol.
2.4 PEGASIS PEGASIS [8] is an extension to the LEACH protocol. The data transfer and reception are performed by the neighbor nodes, and multiple clusters are not formed in PEGASIS. Hence, for the data transmission to the base station, a chain-like structure is formed. The nodes that are closer to sink are allowed to transmit the data. For the selection of chain, an annealing algorithm is used. Hierarchical PEGASIS and Energy Balancing PEGASIS are extensions to PEGASIS.
2.5 DBSCAN The DBSCAN [9] algorithm can classify clusters in huge spatial datasets by observing the local density of database elements, via a single input parameter. Besides, the user gets a recommendation on which parameter value that would be suitable. In this way, the minimum information of the domain is required. The DBSCAN can likewise figure out what information ought to be categorized as noise or outliers. In spite of this, its working process is fast and scales extremely well with the size of the database directly.
1034
T. Sharma et al.
Fig. 1 Diagram for energy dissipation
3 Energy Model Analysis The proposed algorithm utilized a simple first-order model suggested by the Heinzelman et al. Figure 1 displays the amplifier used for electronic transmission and receiving. Power attenuation depends on the distance from the receiver to the transmitter. The energy dissipation from its transmitter is E T x (k, d) = k ∗ E elec + E f s ∗ k ∗ d 4 if d ≥ d0
(1)
= k ∗ E elec + E mp ∗ k ∗ d 2 if d < d0
(2)
On receiving a k-bit data packet, the energy dissipation is E Rx (k) = E elec ∗ k
(3)
The energy dissipation parameter (E elec ) is required for electronic circuits. Here, size of packet is k, the distance between two nodes is d, and E fs and E mp are the transmitter amplifier characteristics.
4 Proposed Algorithm If the sensor nodes are distributed randomly, there is a high likelihood of high-density and low-density sensor nodes network areas are being created. It has been observed that the areas having high density are probable of sensing the similar or redundant
Fuzzy-Based DBSCAN Algorithm …
1035
information since they are placed in close proximity. Hence, the energy of nodes gets depleted in sensing the similar information in these areas. Since, energy is the key issue in WSN [10], conservation of energy in these situations is mandatory for any routing algorithms. From the literature survey, it was discovered that several routing algorithms were proposed to decrease the energy cost of sensing the redundant information. The DBSCAN algorithm is used to separate the entire network into high- and low-density areas. Afterward, the entire network is divided into clusters. Hence, the DBSCAN algorithm is used for identifying the dynamic cluster in which some of the nodes are identified as noise or outlier and they will not participate in any cluster formation. However, to cover the entire region they will also participate in sensing and sending data to the sink directly or to the nearest CHs. With the help of the DBSCAN algorithm [11], the high-density and low-density areas are identified and the clusters are formed in high-density areas and low-density areas. After the cluster formation, the whole network is divided into four grids, but all the CHs will not transmit the information to the sink. In each grid, a master CH has been chosen from among the CHs lying in that grid. These master CHs will transmit the information to the sink [12]. Hence, there is one master CH in each grid and in total there are four master CHs to transfer the information to the sink. The master CH was chosen based on two fuzzy parameters: the distance from the sink and the residual energy. This entire algorithm is performed in two phases: the setup phase and the transmission phase. The phases are as described below.
4.1 The Setup Phase The use of DBSCAN algorithm helps in the development of clusters in setup phase and it would divide the entire network into high- and low-density zones. The CHs for each cluster are chosen randomly as in LEACH protocol. For minimization of redundancy in a dense cluster, all the nodes are not allowed to transmit the information. The sleep management is applied, and only 5% nodes with higher energy are allowed to transmit data to their CHs, since in high-density areas nodes are having a high probability of sensing the similar information. Thus, sleep management helps in reducing the sensing and transmitting of similar information. After the CHs have been identified and cluster formation has been done and the whole network is partitioned into four grids, the CHs lying in these grids will choose the master CH, and only it will transmit the information to the sink.
4.2 Selection of Master CHs Using Fuzzy Logic In the setup phase, the master CH node has been elected on the basis of fuzzy logic [13]. Here, we use two input functions:
1036 Table 1 Input variable and fuzzy sets
T. Sharma et al. Input
Fuzzy sets
Residual energy of CH node
Low Average High
Distance of cluster head node from sink
Near Medium Far
• Distance between CHs and sink. • Residual energy: Leftover energy of the CHs. The fuzzy sets [14] are modified, and the membership functions are defined when the distance from CH to sink and the CH nodes’ leftover energy are transferred to the fuzzy inference [15]. CH nodes’ residual energy and the distance to the sink from each CH node have been transferred to the Fuzzy Inference System (FIS). These inputs have been converted into the fuzzy sets. Three different membership functions are associated with both the input variables, namely distance and residual energy. The residual energy variable is ranging from 0 to 0.5 and the distance variable is ranging from 0 to 75. The linguistic variables signifying the distance and energy are represented by three levels. In distance, these levels are far, medium, and near and for residual energy, these levels are low, average, and high as shown in Table 1. Figures 2 and 3 reflect the distance and residual energy membership function. These linguistic variables were managed by the fuzzy rule base [16] to produce the output at the point where the data is updated to the membership functions and more functions are transferred to the FIS. Table 2 displays the rule base and its representation in the proposed algorithm.
4.3 The Steady-State Phase The original transmission will take place in this phase. The four elected master CHs using the two fuzzy descriptors, namely the residual energy and the distance from the sink, will forward the sensed data to the sink node. The algorithm for the above phases is as follows: Step 1: Initialization of parameters and basic layouts. Step 2: Separation of high-density and low-density areas using DBSCAN algorithm and formation of clusters. Step 3: Selection of random CHs in each cluster. Step 4: Minimization of redundancy based on sleep management. Step 5: Division of network areas into four grids. Step 6: Selection of master CHs based on two fuzzy descriptors. Step 7: Transmission of information from master CHs to sink.
Fuzzy-Based DBSCAN Algorithm …
1037
Fig. 2 Membership function editor distance
5 Simulation Results The simulation results and the performance of the three protocols, namely LEACH,IC-ACO, and proposed algorithm, have been discussed in this section. The sink node is positioned at (50, 50). The result shows that the fuzzy logic approach used to elect the master CHs calculates the linguistic rules in a natural way. For simulation, 100, 200, and 300 nodes are randomly placed in a 100*100 region. Table 3 displays the parameters used for the simulation. The following criteria have been chosen for the comparison of LEACH, IC-ACO, and proposed algorithms: Stable region: The stable region (or round) within which every single node is alive.
1038
T. Sharma et al.
Fig. 3 Membership function editor residual energy
Table 2 Inference rules
S. no
Energy
Distance
Chances
1
Low
Near
Low
2
High
Near
High
3
Average
Near
High
4
Low
Medium
Low
5
High
Medium
High
6
Average
Medium
High
7
Low
Far
Low
8
High
Far
High
9
Average
Far
Low
It has been clearly seen in the figures that stability period of proposed algorithm is enhanced as compared to IC-ACO [17] and LEACH; the overall network lifetime and stable period have been improved in the proposed algorithm as compared to existing techniques. Table 4 indicates that when the network becomes dense with 100–200
Fuzzy-Based DBSCAN Algorithm … Table 3 Parameter values
1039
Simulation parameters
Values
ETX = ERX
50*0.000000001 J
Sink’s (XY) position
(50,50)
Data Aggregation Energy (EDA)
5*0.000000001 J
Initial energy
0.5 J
E fs
10*0.000000000001 J
E mp
0.0013*0.000000000001 J
Maximum rounds
3000
Table 4 Values of first node dead round of LEACH, IC-ACO, and proposed algorithm Description
LEACH
IC-ACO
Proposed algorithm
First node dead round (100 nodes)
436
930
1320
First node dead round (200 nodes)
222
948
1346
First node dead round (300 nodes)
133
962
1403
and 200–300 nodes, the suggested algorithm does significantly better as compared to LEACH and IC-ACO. The total number of alive nodes at different rounds has been shown in Fig. 4 with 100 nodes, which specifies the lifespan of the network. Figure 5 shows the total number of alive nodes at different rounds with 200 nodes. Figure 6 depicts the alive nodes at different rounds with 300 nodes. Figures 4, 5, and 6 indicate that the suggested algorithm does significantly better in a dense environment than the LEACH and the IC-ACO. The first node in LEACH protocol is dead at 436, 222, 133 rounds when the number of nodes is 100, 200, and 300; at 930, 948, 962 rounds in IC-ACO; and at 1320, 1346, and 1403 in proposed algorithm, which clearly demonstrates that the LEACH protocol’s performance is degraded in the dense network, but the IC-ACO performs better. The figure confirms that the suggested algorithm does significantly better than the IC-ACO.
6 Conclusion Identifying the optimal route and finding out the optimal number of CHs and cluster formation are some of the challenging issues in WSN. This algorithm’s main purpose is to extend the lifespan of the network while the network is densely deployed. It has been observed that the sensor nodes are usually positioned very close to each other in a dense network and having a tendency to communicate redundant data to the base station, thus, wasting energy in processing that redundant information. In the proposed framework, the cluster creation and CHs selection are based on the application of DBSCAN algorithm and fuzzy logic, and for choosing the nodes that
1040
T. Sharma et al.
100 proposed 90
LEACH IC-ACO
80
Number of nodes alive --->
70
60
50
40
30
20
10
0 0
500
1000
1500
2000
2500
3000
3500
Number of rounds --->
Fig. 4 Total number of nodes alive (100 nodes) versus rounds
go into the sleep mode in a particular round, the sleep management has been applied. The simulation results reveal that even with the extra liability of selection of critical nodes, a division of high-density areas, low-density areas, and sleep management, this algorithm is competent in terms of extended network lifespan and stability duration in densely deployed network.
Fuzzy-Based DBSCAN Algorithm …
1041
200 LEACH
180
ICACO Proposed
Number of nodes alive --->
160 140 120 100 80 60 40 20 0 0
500
1000
1500
2000
2500
3000
3500
Number of rounds --->
Fig. 5 Total number of nodes alive (200 nodes) versus rounds 300 LEACH Proposed IC-ACO
Number of nodes alive --->
250
200
150
100
50
0
0
500
1000
1500
2000
Number of rounds --->
Fig. 6 Total number of nodes alive (300 nodes) versus rounds
2500
3000
3500
1042
T. Sharma et al.
References 1. C.C. Shen, C. Srisathapornphat, C. Jaikaeo, Sensor information, networking architecture and applications. IEEE Pers. Commun. 8(4), 52–59 (2001) 2. W.R. Heinzelman, A. Chandrakasan, H. Balakrishnan, Energy-efficient communication protocol for wireless micro sensor networks, in Proceedings of the 33rd Annual Hawaii International Conference on System Sciences, vol. 2 (2000, January), pp. 1–10 3. Y. Gao, K. Wu, F. Li, Analysis on the redundancy of wireless sensor networks, in Proceedings of the 2nd ACM International Workshop on Wireless Sensor Networks and Applications (WSNA) (San Diego, CA, 2003), pp. 108–114 4. T. Haider, M. Yusuf, A fuzzy approach to energy optimized for wireless sensor networks. Int. Arab J. Inf. Technol. 6(2), 179–185 (2009) 5. W.B. Heinzelman, A.P. Chandrakasan, H. Balakrishnan, An application-specific protocol architecture for wireless microsensor networks. IEEE Trans. Wirel. Commun. 1(4), 660–670 (2002) 6. H. Bagci, A. Yazici, An energy aware fuzzy unequal clustering algorithm for wireless sensor networks, in IEEE International Conference on Fuzzy Systems (2010, July), pp. 1–8 7. I. Gupta, D. Riordan, S. Sampalli, Cluster-head election using fuzzy logic for wireless sensor networks, in Proceedings of the 3rd Annual Conference on Communication Networks and Services Research (2005, May), pp. 255–260 8. S. Lindsey, C.S. Raghavendra, PEGASIS: Power-efficient gathering in sensor information systems, in Proceedings, IEEE Aerospace Conference, vol. 3 (IEEE, 2002, March), pp. 3–3 9. H.S. Emadi, S.M. Mazinani, A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor networks. Wirel. Pers. Commun. 98(2), 2025–2035 (2018) 10. S. Okdem, D. Karaboga, Routing in wireless sensor networks using an ant colony optimization (ACO) router chip. Sensors 9, 909–921 (2009) 11. D. Pan, L. Zhao, Uncertain data cluster based on DBSCAN, in 2011 International Conference on Multimedia Technology (IEEE, 2011, July), pp. 3781–3784 12. T. Sharma, B. Kumar, Fuzzy based master cluster head election protocol in wireless sensor network. Int. J. Comput. Sci. Telecommun. 3(10), 8–13 (2012) 13. J.M. Kim, S.H. Park, Y.J. Han, T.M. Chung, CHEF: cluster head election mechanism using fuzzy logic in wireless sensor networks, in Proceedings of 10th International Conference on Advanced Communication Technology, vol. 1 (2008, February), pp. 654–659 14. J. Anno, L. Barolli, L. Durresi, F. Xhafa, A. Koyama, A cluster head decision system for sensor networks using fuzzy logic and number of neighbor nodes, in Proceedings of International conference on Ubiquitous-Media (2008), pp. 50–56 15. S.Y. Chiang, J.L. Wang, Routing analysis using fuzzy logic system in wireless sensor network. KES 966–973 (2008) 16. J.S. Lee, W.L. Cheng, Fuzzy-logic-based clustering approach for wireless sensor networks using energy prediction. IEEE Sens. J. 12(9), 2891–2897 (2012) 17. J.Y. Kim, T. Sharma, B. Kumar, G.S. Tomar, K. Berry, W.H. Lee, Intercluster ant colony optimization algorithm for wireless sensor network in dense environment. Int. J. Distrib. Sens. Netw. Article ID 457402, 10 (2014)
Water Quality Evaluation Using Soft Computing Method Shivam Bhardwaj, Deepak Gupta, and Ashish Khanna
Abstract Water is one of the essential ingredients of life, as every species of flora and fauna requires water for its survival. The usable freshwater available is only 2.5% of the whole water present on planet Earth. Further due to rise in the level of water pollution, it has become inevitable to determine the Water Quality Index (WQI) for the suitable use of water. In this report, we applied a soft computing approach to evaluate the WQI and categorise water into different categories according to its suitable uses. Determination of WQI is highly arcane as it is influenced by numerous factors and chemical components of water. Fuzzy logic is an approach based on the ‘magnitude of truth’ rather than the ‘two valued’ (Binary logic). Soft computing provides us tools to determine WQI linguistically, which is ‘very good’, ‘poor’, ‘moderately poor’, etc. The study focuses on determining the WQI by applying fuzzy logic methodology. Keywords WQI · Soft computing · Fuzzy logic · Binary logic
1 Introduction WQI is determined to turn the convoluted water data sets into a form of information that is comprehensible and lucid. It is the reflection of water quality parameters taken into account as an overall measure of quality. The current methodology used for the evaluation of water quality index is not precise and often leads to error in results. Due to the rise in water pollution, the freshwater available to us has become contaminated and filled with harmful elements which can cause virulent diseases to numerous living species, if consumed or used for different purposes. Various countries are concerned due to byzantine water data sets and the impreciseness in the results of WQI generated by using statistical approach. S. Bhardwaj (B) · D. Gupta · A. Khanna Department of Computer Science Engineering, Maharaja Agrasen, Institute of Technology, Guru Gobind Singh Indraprastha University, Delhi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_89
1043
1044
S. Bhardwaj et al.
Fig. 1 Vengaihnakere Lake google maps image [8]
WQI determined by soft computing techniques were more accurate and precise than statistical approach, this was observed due to difference in both approaches. Soft computing is based on fuzzy logic and probabilistic reasoning; it can handle complex and noisy data sets that have high levels of uncertainty. Whereas, hard computing is based on binary logic and crisp system; it uses a precisely stated model for generating an output with exact input data sets. The area under study is Vengaihnakere Lake which is situated in Bengaluru, Karnataka (Fig. 1). It covers an area of 16.2 hectares which is equivalent to 40 acres of area. Vengaihnakere Lake is a popular tourist spot in Bengaluru city and also a great attraction for boating enthusiasts. The prior motive of our study was to evaluate the ecological status of Vengaihnakere Lake, the results of which would help in ameliorating the quality of lake water [1]. In the making of the software to evaluate Fuzzy Water Index (FWI), we will be taking different variables, i.e. pH Level, Total Hardness (TH), nitrates, iron, magnesium and Total Dissolved Solids (TDS). Firstly, we will make a fuzzy set with the variables taken and then perform operations on it, and then finally, we will perform defuzzification to obtain a crisp value of FWI. The software helps us in analysing lake water of Varthur area; by doing so we can divide water into different categories and decide its appropriate use or even perform some treatments if necessary to see how the specific elements present in water shape its quality. In the study conducted by Raman Bai, Reinier Bouwmeester and Mohan. S, WQI value was obtained to demonstrate the categorization of river which would make Water Quality (WQ) evaluation more comprehensible especially in public consideration [2]. Fuzzy logic is an easy to use and efficacious fuzzy expert system from which common people can access the quality of their drinking water [3]. A fuzzy-based approximation for air quality is used to analyse the way ahead for the Zimbabwe sugarcane growing and processing industry. Further areas of enhancement that are based on soft computing are to provide certain technical involvement once environmental quality levels have been approximated [4]. The application of the new index was projected at a sampling station on Karoon River in Iran, based on empirical
Water Quality Evaluation Using Soft Computing Method
1045
water quality data. Fuzzy model has denoted that water quality has high feasibility with the expected results in Karoon River [5].
2 Principle of Proposed Work The Principle used to develop the Fuzzy Water Index (FWI) is fuzzy logic. It was introduced by Dr. Lotfi Zadeh of University of California, Berkeley during 1960s [6]. Fuzzy logic measures the certainty or vagueness of membership of element of the set in a way similar to which a man makes decision while performing the mental and physical activities. Fuzzy logic deals with fuzzy sets and fuzzy algebra; we can say that fuzzy is a language like relational and Boolean algebra. The solution of a particular problem is calculated on the bases of fuzzy rules that are defined for fuzzy sets. It uses the if-then rule, i.e. if A, then B. That is, Premise: x is P Implication: IF x is P, THEN y is C Consequent: y is C where C and P are semantic values established by fuzzy sets in the universe of discourse x and y, respectively. For example, the relationship between a man’s honesty and trustworthiness. If Aman is extremely honest, then he is completely trustworthy. If Aman is partially honest, then he is trustworthy at times. If Aman is dishonest, then he is not trustworthy at all. Fuzzy logic architecture possesses four main components. The first component is fuzzy rule base; it owns all the rules and if-then conditions provided by the specialists to control the decision-making system. The second component is fuzzification interface; it permits the user to covert the crisp input into fuzzy input so that the converted fuzzy input can be operated over fuzzy rules and properties. The third component is inference engine; this engine helps the user to determine the percentage of harmony between fuzzy input and the fuzzy rules. Based on the percentage of harmony, it evaluates which rules are to be used corresponding to the given input. The fourth component defuzzification is done at last; this process transforms the fuzzy sets produced by fuzzification interface and operated over inference engine to their corresponding crisp value (Fig. 2).
2.1 Membership Functions A membership function represents the magnitude of truth in fuzzy logic. It maps each element of a fuzzy set onto a membership value between [0, 1]. A membership value of an element is its characteristics in the set between [0, 1]. It is also described as the degree of membership; a larger membership value of an element signifies
1046
S. Bhardwaj et al.
Fig. 2 Fuzzy rule maker
its greater belonging to the corresponding fuzzy set and vice versa. Membership functions are graphical representations of fuzzy sets where the X-axis denotes the universe of discourse and the Y-axis denotes the degree of membership in the [0, 1] interval. S = {( A, µS ( A))| A ∈ U} S and A = elements of fuzzy where S = fuzzy set, µS = membership function of set. A fuzzy model for lake water quality assessment is developed. Different shapes of membership functions could be used according to their appropriate applications, for correct prediction of the Fuzzy Water Index (FWI) as shown is (Fig. 3). The accuracy also depends on the quantity of fuzzy sets used in the mapping process [7].
3 Methodology The methodology adopted in this study is known as the weighted arithmetic index method. In this study, six parameters were considered, namely pH, Total Hardness (TH), Total Dissolved Solids (TDS), nitrates, iron and magnesium. WQI is the reflection of water quality parameters taken into account as an overall measure of quality. It is a generated value through which we can elicit facts and conclusions about a particular area’s water and its use. The calculation of WQI is done in steps. Firstly, relative weight (Wn ) is determined by assigning the weights (wi ) to each selected parameter by Indian standards as given in Table 1. The equation to determine the relative weight is wi Wn = wi where Wn = relative weight and wi = weight of each parameter.
Water Quality Evaluation Using Soft Computing Method
1047
Fig. 3 Different membership functions [9]
Table 1 Simulation parameters wi
S. no.
Parameters
Indian standards (Si )
1.
pH
6.5–8.5
4
4/21
–
2.
TDS
500–2000
4
4/21
mg/L
3.
TH
300–600
2
2/21
mg/L
4.
Iron
0.3–1.0
4
4/21
mg/L
5.
Magnesium
30–100
2
2/21
mg/L
6.
Nitrates
45–100
5
5/21 Wn = 1
mg/L
Wn
wi = 21
Units
Now, after calculating Wn , we will compute a quality rating scale (Q n ). Q n is a quality scale computed by dividing observed concentrations with Indian standards (Si ) for each parameter and multiplying it with 100. Q n = Ci Si × 100. where Q n = quality rating scale, Ci = observed concentration and Si = Indian standards.
1048
S. Bhardwaj et al.
After calculating Wn and Q n , the WQI is ready to be computed, but before that, we need to calculate a Sub Index (SI) for every parameter taken for simulation. SI = Q n × Wn WQI =
SI
where SI = sub index, Q n = quality rating scale and Wn = relative weight.
4 Results and Discussion In making the FWI modelling, a total of 22 rules were made and mathematical equations were implemented in statistical modelling. A significant difference was recorded in WQI determined by both models and a tangible drift in the accuracies. FWI model was developed through trifurcation of selected parameters and then evaluating FWI for each trifurcated section; after this, the obtained FWIs were taken to compute the WQI. It was done by taking two FWIs as input and one as output and making rules of them; 21 rules are implemented in generating WQI output, then the mean of the generated output WQI values is calculated, whereas statistical model was developed by imputing the mathematical formulas and equations to generate an output WQI value (Figs. 4, 5, 6, 7, 8, 9 and 10). Firstly, the FWIs were determined through inputting selecting parameters and then the output obtained are used as inputs to determine the final WQI that is 71.26,
Fig. 4 Membership functions for fuzzy rules
Water Quality Evaluation Using Soft Computing Method
Fig. 5 Membership functions for fuzzy rules
Fig. 6 Membership functions for fuzzy rules
Fig. 7 Fuzzy rules
1049
1050
Fig. 8 Fuzzy rules
Fig. 9 Fuzzy rules
Fig. 10 Fuzzy rules
S. Bhardwaj et al.
Water Quality Evaluation Using Soft Computing Method
1051
Table 2 Maximum evaluated WQI S. no.
Parameters
Relative weight (Wn )
Quality rating scale (Q n )
Sub Index (SI)
1.
pH
0.1905
104.46
19.89
2.
TDS
0.1905
66.60
12.68
3.
TH
0.0952
106.67
10.15
4.
Iron
0.1905
5.
Nitrates
0.2381
46.67
11.11
6.
Magnesium
0.0952
333.33
31.74
66.667
12.698
WQI = 98.26
Table 3 Minimum evaluated WQI S. no.
Parameters
Relative weight (Wn )
Quality rating scale (Q n )
Sub Index (SI)
1.
pH
0.1905
94.58
18.01
2.
TDS
0.1905
205.50
39.14
3.
TH
0.0952
43.51
4.14
4.
Iron
0.1905
10
1.905
5.
Nitrates
0.2381
56.67
13.49
6.
Magnesium
0.0952
50
4.76 WQI = 81.44
Table 4 Water quality categories [10]
WQI value
Water quality
WQI < 50
Excellent [A]
50 ≤ WQI < 100
Good [B]
100 ≤ WQI < 200
Poor [C]
200 ≤ WQI < 300
Very poor water [D]
300 ≤ WQI
Virulent water [E]
whereas in statistical approach, the WQI is determined through the mathematical formula of WQI as given in Tables 2 and 3. WQI from statistical formula = mean (max WQI, min WQI) = 89.85. The quality is determined using Table 4.
5 Conclusion Fuzzy logic helped us analyse the complex water data sets and saved the time complexity of the problem, whereas statistical approach requires large amount of time for solving the problem and often leads to errors when computing tortuous data
1052
S. Bhardwaj et al.
sets. This makes fuzzy logic a useful tool in understanding complex data sets. The lake water was determined to be suitable for outdoor bathing, and a part of category B type water which is noteworthy as Vengaihnakere Lake is a tourist destination and boating point. It would be virulent for the community and tourists if the lake’s water engenders diseases. But, both our models generated different WQI values. The statistical model predicted the WQI values very close to being in category C type water, i.e. the water is suitable for drinking and bathing after a chemical treatment, whereas the fuzzy water index computed WQI value lied in the middle of category B type water.
References 1. http://travel2karnataka.com/vengaihnakere_lake_bangalore.htm 2. V. Raman Bai, Reinier Bouwmeester, S. Mohan, Fuzzy logic water quality index and importance of water quality parameters. Air Soil and Water Res. 2, 51–59 (2009) 3. Nidhi Mishra, P. Jha, Fuzzy expert system for drinking water quality index. Recent Res. Sci. Technol. 6(1), 122–125 (2014) 4. Davison Zimwara, Lameck Mugwagwa, Knowledge Nherera, Soft computing methods for predicting environmental quality: a case study of the Zimbabwe sugar processing industry. J. US China Public Adm. 10(4), 345–357 (2013) 5. F. Babaei Semiromi, A.H. Hassani, A. Torabian, A.R. Karbassi, F. Hosseinzadeh Lotfi, Water quality index development using fuzzy logic: a case study of the Karoon River of Iran. Afr. J. Biotechnol. 10(50), 10125–10133 (2011) 6. https://en.wikipedia.org/wiki/Fuzzy_rule 7. S. Naveen, J. Aishwarya, K.N. Rashmishree, S. Naveen Kumar, Determination of water quality of Vengaihnakere Lake and Varthur lake, Bangalore. Int. Res. J. Eng. Technol. 5(6), 2597–2601 (2018) 8. https://www.google.com/maps/@13.0166385,77.6995821,2005m/data=!3m1!1e3 9. http://researchhubs.com/post/engineering/fuzzy-system/fuzzy-membership-function.html 10. https://www.researchgate.net/post/How_to_calculate_water_quality_Index
Crowd Estimation of Real-Life Images with Different View-Points Md Shah Fahad and Akshay Deepak
Abstract This paper found out the problem and suggested a solution for crowd estimation of real-life images. In general, the camera position is fixed at the public places and capture the top view of the images. Most of the recent neural networks are developed with these images. Recently, the CSRNet model was developed for the ShanghaiTech dataset. This model achieved better accuracy than state-of-the-art methods. It is difficult to capture the top view of images where the crowd is gathered at random such as strike, and riot. Therefore, we capture both the top and the front view of images to deal with such circumstances. In this work, the CSRNet model is evaluated using two different test cases consisting of either only top view images or front view images. The mean absolute error (MAE) and mean squared error (MSE) values of the front view images are higher than the top view images. The relative MAE and MSE of the CSRNet model for the front view images are 28.64 and 47.86%, respectively, higher than the top view images. It is noted that higher MAE and MSE means lower performance. This issue can be resolved using the suggested GANN network, which can project the front view images into the top view images. After that, these images can be evaluated using the CSRNet model. Keywords Crowd estimation · CSRNet · Dilated CNN · Gradient adversarial neural network (GANN) · Top view · Front view
1 Introduction Crowd estimation is a method of counting or measuring the number of people in the crowd. It helps in ensuring social stability and population safety at the public site. Crowd estimation methods are broadly classified into two categories (i) direct M. S. Fahad · A. Deepak (B) National Institute of Technology, Patna, India e-mail: [email protected] M. S. Fahad e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_90
1053
1054
M. S. Fahad and A. Deepak
methods [1–4] (ii) indirect methods [5, 6]. Direct methods are those in which each human body is segmented and detected using classifiers, and then human bodies are counted. The subtype of direct methods are model-based [1–3] and trajectory clustering-based methods [4]. In model-based methods, detection and segmentation are carried out using machine-learning algorithms [7], while trajectory clusteringbased methods identified each body based on the long-term tracking and clustering of interesting points on human bodies like [8]. The indirect methods assume the crowd as a whole and use regression models to measure the crowd size by extracting the crowd features such as textures. The features for indirect methods are a pixel, texture, and gradient-based features [5, 6]. These important features are extracted by careful analysis of the images. Nowadays Convolutional Neural Network (CNN) is a toprated network for crowd estimation [9–14]. The CNN network design filters by data itself. These filters extract useful features for crowd estimation. There has been a large number of networks for crowd estimation. In [13], the contextual pyramid CNN was proposed to generate density maps. This network explicitly utilized the contextual information at the local and global levels. A single column fully convolutional neural network was proposed in [15] for crowd estimation. The problem of a single column fully convolutional neural network was resolved in a Multi-Column Convolutional Neural Network (MCCN) [16]. The MCNN achieved high performance for crowd estimation. The MCNN network does not work for the deeper network. This is due to the flexible receptive field provided by convolutional filters and requires large training time. A CNN-based model was proposed in [17] known as CSRNet (dilated convolutional neural networks for estimating highly congested scenes) is the most recent victory in this field of research. A better estimation has been achieved using the CSRNet model when compared with state-of-the-art methods. The main idea of this model is to capture high-level features with larger receptive fields. The larger receptive field is achieved by a dilated convolutional neural network. In dilated CNN, the filter size is expanded by filling the zeros at the vacant position. The dilated filters expand the filter size without loss of coverage or resolution. The number of parameters remains the same as typical CNN filters. The benefit of dilated CNN is more coverage with efficient computation and less memory usage. In this work, the CSRNet model is evaluated using real-life images. These images are collected by different elevations of the camera. The whole images are categorized into the top and front views. These images are evaluated using the CSRNet model, which is developed by the ShanghaiTech dataset. The relative MAE and MSE of the CSRNet model for the front view images are 28.64% and 47.86%, respectively, higher than the top view images. The MAE and MSE dramatically increase for the front view images. This is due to the self-occlusion and different light conditions in the evaluation dataset. The rest part of the paper is structured in the following manner. Section 2 describes the CSRNet architecture. The results corresponding to the different views are discussed in Sect. 3. An approach is suggested in Sect. 4 that may be a robust solution for view-invariant crowd estimation. The paper is concluded in Sect. 5.
Crowd Estimation of Real-Life Images with Different View-Points
1055
2 CSRNet Architecture The CSRNet architecture used VGG-16 [18] as the front-end due to its transfer learning capability and flexibility to add the new architecture at the back-end for the desired task. In the CSRNet architecture, the 16 fully connected layers (classification part) were frozen, and new architecture was added after the convolutional layer of VGG-16. The output size of the front-end of CSRNet is 18 th of the input size. If the extra CNN and max-pooling layers are added after the front-end, then it further reduced the size of the output. This results in the problem of creating high-density maps of the crowd. To deal with this issue, dilated CNN was used in the CSRNet. The dilated CNN captures the detailed feature as well as maintain the output resolution. In the CSRNet, 2-D dilated CNN was used [19]. The 2-D dilated CNN can be defined as follows: d( p, q) =
Q P (y( p + r × i, q + r × j)w(i, j)
(1)
i=1 j=1
d( p, q) is the output of the dilated convolution of the input y( p, q). w(i, j) is the kernel of the length P and width Q. r is the dilation rate that is one for the normal CNN. In general pooling layer is used after the convolution layer on CNN. There are various pooling operations such as max and average pooling to prevent overfitting and variance. However, the pooling operation drastically reduces spatial resolution. This means the spatial information is lost from the feature maps. Deconvolution [20] is an alternative way to keep the spatial resolution by upsampling. But, this creates additional complexity and time that is not suitable for all applications. Dilated CNN is a better choice using the sparse kernel, which incorporates both convolution and pooling. The size of the receptive field is enlarged by dilated convolution operation without increasing computational cost and the number of parameters in the network. A small kernel size f × f is enlarged f + ( f − 1)(r − 1) after the dilated convolution with the dilation rate r . The dilated CNN provides multi-scale contextual information with the same resolution. To understand the difference between up-sampling and dilated convolution, Fig. 1 is reproduced from [17]. In the upper part of Fig. 1, max-pooling is used, which downsamples the input with a factor of two. The output is passed with a convolutional layer with a 3 × 3 Sobel kernel. The output size is reduced with a factor of two due to max-pooling operation. The equal size of input and output was achieved using the deconvolution layer by the up-sampling operation. In the lower part of Fig. 1, dilated convolution with dilation rate (r = 2) is used. It keeps an equal dimensions of the input and output and contains more detailed information than the up-sampling operation.
1056
M. S. Fahad and A. Deepak
Fig. 1 Comparison between max-pooling, convolution, up-sampling, and dilated convolution. The size of the kernel is 3 × 3 in both operation. The dilation rate (r = 2) is used in dilated convolution operation
The CSRNet was fine-tuned with VGG-16. In the first ten layers of VGG-16, there were only three pooling layers instead of five pooling layers as original VGG-16. There were four back-end architectures after the VGG-16 of CSRNet. The output of the VGG-16 was not the same as the input. Up-sampling was used to make the equal size of input and output. After that, the four different back-end architectures were proposed and evaluated. The back-end architecture used dilated convolution. However, RNN network with the dilated convolution can be good alternative to estimate high-density images. The loss function of the CSRNet is defined as follow: L(W ) =
N 1 C(X i , W ) − Ci 22 N i=1
(2)
where X i is the images which are given to CSRNet, N is the batch size during training, C(X i , W ) is the output generated by CSRNet with parameter W , and Ci is the ground truth count of the images. The first 10 CNN layers of VGG-16 (front-end) were fined tuned. The backend layer was initialized with Gaussian weights with standard deviation of 0.01. Euclidean distance was used to measure the difference between the network output and ground truth label as defined in Eq. 2. The stochastic gradient descent (SGD) was used to optimize the parameter with fixed learning rate 1e−6 .
Crowd Estimation of Real-Life Images with Different View-Points
1057
3 Result and Discussion The CSRNet model is developed using the ShanghaiTech dataset. The ShanghaiTech dataset is divided into part A and part B. There are 482 images in part A. The highly congested scene was collected from the Internet. Part B contains 716 images from the street in Shanghai. These images are relatively sparse than part A. There is a total of 330,165 persons in the entire dataset. The CSRNet model is evaluated using real-life images of the Indian dataset. We have collected 26 images from NIT ghat, 35 images from various events organized in our college, and downloaded images of the crowd from the Internet. The Indian dataset is divided into three parts, namely, (1) NIT ghat, (2) Google, and (3) college. The NIT ghat dataset contains top view images, while Google and college datasets contain front view images. The ground truth of these images is annotated by our M. Tech students in NIT Patna, India.
3.1 Evaluation Metrics There are two popular metrics Mean Absolute Error (MAE) And Mean Squared Error (MSE) areused to evaluate the CSRNet model. The MAE and MSE is defined as follows: N 1 ˆ |L i − L i | (3) MAE = N i=1 N 1 MSE = | Lˆ i − L i |2 N i=1
(4)
where N is the number of images in the evaluation dataset. L i is the count of ground truth while Lˆ i is the estimated count from the model which is calculated as Ci =
L W
p(l,w)
(5)
l=1 w=1
p(l,w) is the pixels at (l, w) of the generated density map while L and W are the length and width of the density map. The density maps of the different images are plotted in Fig. 2. The density map of Fig. 2b is well distributed as the humans are distributed in the respective image. The images in Fig. 2a, c are taken from the front view. The density maps of Fig. 2a, c are not distributed as the humans are distributed in their actual images. This is due to the self-occlusion of humans in the images. This is the natural problem that degrades the accuracy of the crowd estimation problem.
1058
M. S. Fahad and A. Deepak
Fig. 2 Images with different views and their density maps. a and c are the front view images, while the top view image is plotted in b. The upper part of the figures shows density maps of the lower part of the images
To prove the above claim, the CSRNet model is evaluated using real-life images of the Indian dataset. The MAE and MSE values are calculated for each dataset. The MAE and MSE values for different datasets are reported in Table 1. The MAE and MSE values for NIT ghat images are better than the Google and college images. The NIT ghat images are taken from top view while Google and college images are taken from the front view. The CSRNet model was trained on the ShanghaiTech dataset. In general, this dataset contains top images. The MAE and MSE values significantly decrease for Google and college images. This is because of the different training and testing distribution. The MAE and MSE values of front view datasets are 28.64 and 47.86% relatively higher than the top view dataset. The CSRNet model is evaluated from real-life images. These images
Crowd Estimation of Real-Life Images with Different View-Points
1059
Table 1 Evaluation result of real-life images from CSRNet model Datasets MAE MSE NIT ghat images (top view) Google images (front view) College images (front view+(dense and sparse))
47.72 61.39 62.78
53.13 78.56 79.8
are taken in different light conditions, both color and greyscale images. These results highlight the problem which can occur in the natural environment. The accuracy of the front view images is degraded due to the occlusion of the human bodies in the images. The solution is suggested to deal with this problem. The Gradient Adversarial Neural Network (GANN) is suggested, which can project the front view images into the top view images. The transformed front view images can be evaluated using CSRNet model. GANN is a very approach to image translation. Nowadays, researchers are widely used GANN as domain adaptation [21]. The future direction of our work is to develop and evaluate GANN generated images. The GANN network is formulated, which is discussed in the next Sect. 4.
4 Suggested Approach In this approach, the variation corresponding to view can be reduced using GANN [22]. GANN is configured as a domain converter which projects the front view images into the top view images. GANN is configured as a domain converter which projects the front view images into the top view images. Figure 3 shows the proposed architecture of GANN. In general, the GANN network uses one generator and one discriminator. The discriminator is a classifier that identifies the generated image is either real or fake. In our case, the generated image is either top view or front view. The proposed architecture used two discriminators. The first discriminator(D1) identifies the generated image is either top or front views. The second discriminator (D2) identifies the generated images are associated or not with the original image. A generator model (G) that takes the images as input and capture the data distribution. Both discriminators and generators use neural networks which are trained using backpropagation algorithm. The real-life images contain various types of noise that can be filtered by the techniques proposed in [23]. The input to the discriminator D2 is a pair of top and front view images. This network is trained to obtain the probability that the pair is associated or not. The loss function for discriminator D2 is defined as follows: L D2 (X F , X ) = −l. log[D2(X f , X )] + (l − 1). log[1 − D2(X S , X )]
(6)
1060
M. S. Fahad and A. Deepak
Fig. 3 Proposed GANN architecture that can project the front view images into the top view images. There are two discriminators (D1) and (D2). The D1 identifies the generated image is either top or front views. The D2 identifies the generated image is associated or not with the top view image
1; if(X F = X T ) t= 0; otherwise where X F and X T is the front and top view image, respectively, L D2 is the loss corresponding to the discriminator D2. The input to the discriminator D1 is the top and front view images. This network is trained to obtain the probability that the image is either top or front view. The loss function for discriminator D1 is defined as follow: L D1 (X ) = −l. log[D2(X )] + (l − 1). log[1 − D2(X )]
(7)
1; if(X = X T ) t= 0; for(X = X F ) The input to the generator G is the front view images. The generator G is trained using the help of discriminators D1 and D2. When the front view images are given
Crowd Estimation of Real-Life Images with Different View-Points
1061
to the generator, their labels are given as the top view. The discriminator classifies these images as the front view images. The generator is trained until the images are classified as the top view images by the discriminator D1. In this way, the generator makes fool the discriminator D1. This type of training is also called adversarial training of the network. At the same time, discriminator D2 checks the generated top view images are either associated or not with the top view images. The discriminator D2 keeps important information about the images. If the discriminator D2 is absent from the network, there may be a chance the actual information is lost from images.
5 Conclusion This paper highlighted the problem of crowd estimation of real-life images. The CSRNet model was used to evaluate real-life images. These images are categorized into the top and the front view images. The CSRNet model achieved better accuracy for the top view images than the front view images. The accuracy is reduced due to the self-occlusion of human bodies in the images. To deal with this issue, the GANN model is suggested, which can transform the front view images into top view images. Further, the CSRNet model can achieve better accuracy for front view images. The proposed GANN is different from the original GANN. The proposed GANN uses two discriminators. One discriminator identifies the generated image is either top or front view. The second network checks the generated image is either associated or not with the original images. The second discriminator keeps the original information as well as the same time the first discriminator transforms it into the top view. In this work, the GANN model is not fully implemented. Our future work is to evaluate our dataset for proposed GANN generated images.
References 1. P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2009) 2. J. Gall, V. Lempitsky, Class-specific hough forests for object detection, in Decision Forests for Computer Vision and Medical Image Analysis (Springer, 2013), pp. 143–157 3. P. Gardzi´nski, K. Kowalak, Ł. Kami´nski, S. Ma´ckowiak, Crowd density estimation based on voxel model in multi-view surveillance systems, in 2015 International Conference on Systems, Signals, and Image Processing (IWSSIP) (IEEE, 2015), pp. 216–219 4. A.S. Rao, J. Gubbi, S. Marusic, M. Palaniswami, Estimation of crowd density by clustering motion cues. Vis. Comput. 31(11), 1533–1552 (2015) 5. N. Hussain, H.S.M. Yatim, N.L. Hussain, J.L.S. Yan, F. Haron, Cdes: a pixel-based crowd density estimation system for masjid al-haram. Saf. Sci. 49(6), 824–833 (2011) 6. X. Xu, D. Zhang, H. Zheng, Crowd density estimation of scenic spots based on multifeature ensemble learning. J. Electr. Comput. Eng. 2017, (2017)
1062
M. S. Fahad and A. Deepak
7. G. Shrivastava, K. Sharma, M. Khari, S.E. Zohora, Role of cyber security and cyber forensics in India, in Handbook of Research on Network Forensics and Analysis Techniques (IGI Global, 2018), pp. 143–161 8. R. Ghosh, S. Thakre, P. Kumar, A vehicle number plate recognition system using regionof-interest based filtering method, in2018 Conference on Information and Communication Technology (CICT) (IEEE, 2018), pp. 1–6 9. V.A. Sindagi, V.M. Patel, A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recogn. Lett. 107, 3–16 (2018) 10. D.B. Sam, S. Surya, R.V. Babu, Switching convolutional neural network for crowd counting, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2017), pp. 4031–4039 11. E. Walach, L. Wolf, Learning to count with cnn boosting, in European Conference on Computer Vision (Springer, 2016), pp. 660–676 12. C. Shang, H. Ai, B. Bai, End-to-end crowd counting via joint learning local and global count, in 2016 IEEE International Conference on Image Processing (ICIP) (IEEE, 2016), pp. 1215–1219 13. V.A. Sindagi, V.M. Patel, Generating high-quality crowd density maps using contextual pyramid cnns, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 1861–1870 14. Cong Zhang, Kai Kang, Hongsheng Li, Xiaogang Wang, Rong Xie, Xiaokang Yang, Datadriven crowd understanding: A baseline for a large-scale crowd dataset. IEEE Trans. Multimedia 18(6), 1048–1061 (2016) 15. M. Marsden, K. McGuinness, S. Little, N.E. O’Connor, Fully convolutional crowd counting on highly congested scenes. arXiv preprint arXiv:1612.00220 (2016) 16. Y. Zhang, D. Zhou, S. Chen, S. Gao, Y. Ma, Single-image crowd counting via multi-column convolutional neural network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 589–597 17. Y. Li, X. Zhang, D. Chen, CSRNet: dilated convolutional neural networks for understanding the highly congested scenes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 1091–1100 18. S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections for efficient neural network, in Advances in Neural Information Processing Systems (2015), pp. 1135–1143 19. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.L. Yuille, Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017) 20. H. Noh, S. Hong, B. Han, Learning deconvolution network for semantic segmentation, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 1520–1528 21. S. Yu, H. Chen, G. Reyes, B. Edel, N. Poh, Gaitgan: invariant gait feature extraction using generative adversarial networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017), pp. 30–37 22. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Systems (2014), pp. 2672–2680 23. S. Sahu, H.V. Singh, B. Kumar, A.K. Singh, P. Kumar, Image processing based automated glaucoma detection techniques and role of de-noising: a technical survey, in Handbook of Multimedia Information Security: Techniques and Applications (Springer, Cham, 2019), pp. 359–375
Scalable Machine Learning in C++ (CAMEL) Moolchand Sharma, Anshuman Raina, Kashish Khullar, Harshit Khandelwal, and Saumye Mehrotra
Abstract As technology to collect and operate data from everyday tasks has augmented, there is a significant rise in extrapolation concluded from the datasets. This has made Machine Learning seemingly omnipresent in the decision-making processes around the world. From Decision-Driven programmers to Expert Systems, we count on Machine Learning for optimization and increasing the efficiency of subsystems. Yet, we find that Machine Learning today is constrained based on Programming Language used in development. In this paper, we created a library that is purely developed in C++, a widely used compiler language. We also aim to calculate the performance metrics of Compiled versus Interpreted Languages after developing the algorithms. The scientific library “Armadillo” is used to ease many math-related functions and help us traverse the problem of dynamic datasets introduction instead of statically coded matrices. This paper aims to highlight the differences between Compiled and Interpreted languages as well as to find if Compiled languages are a better alternative for ML Algorithms. This research is also aimed to be a continuing effort to be used as a library like TensorFlow, which offers Application Program Interfaces (API) coded in C as a medium. Lastly, we also aim to increase the scalability of these algorithms to remove any language-based constraints. Thus, these are the main reasons for developing C++ Augmented Machine Learning (CAMEL) library. M. Sharma · A. Raina (B) · K. Khullar · H. Khandelwal · S. Mehrotra Department of Computer Science and Engineering, MAIT, Delhi, India e-mail: [email protected] M. Sharma e-mail: [email protected] K. Khullar e-mail: [email protected] H. Khandelwal e-mail: [email protected] S. Mehrotra e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_91
1063
1064
M. Sharma et al.
Keywords CAMEL (C++ Augmented MachinE Learning) · Machine learning · Library · Regression · Classification · Perceptron · Time performance · Scalability · Interpreted and compiled languages
1 Introduction With the recent advancements in machine learning algorithms, there is a need to improve the scalability of the existing algorithms. C++ being a fast, compiled, Object-Oriented programming language executes at a speed comparable to assembly language. Presently, Python is the most widely used interpreted language by Data Scientists and analysts all over the world for creating models. Interpreted languages are much slower than compiled language. As students are learning about machine learning, applying its algorithms from scratch is the best way to learn it as it provides a better way to enhance your concepts. Although there are better alternatives when we consider the built-in libraries of Python, C++, without a doubt, offers more to a machine learning student than a machine learning user. Many Python libraries, instead of running code using Python interpreter, run a C++ or another low-level language code to create models. Although it improves the speed to some extent, it also directly provides an overhead to the model. In our library, we have created models in C++ and have compared the execution speed and efficiency of our system with the existing implementation and a model developed in an interpreted language. Once implemented, our library can be applied in multiple domains to improve the scalability of the machine learning models. Since our library is coded in C++, our library would provide and inherently allow programmers and data scientists to code machine learning projects in C++. For ease, we shall denote our models as CAMEL (C++ Augmented MachinE Learning). There are, however, multiple Machine Learning libraries available for making machine learning models, only a handful of those are centered toward the scalability of the machine learning models. Our library is, therefore, targeting this domain by being light-weight and scalable. Furthermore, we make use of the latest C++ features and template strategies to optimize our code. Lastly, our library will provide a consistent and intuitive interface with a short learning curve vital for better development experience. CAMEL—software for Small-Scale Scalable Machine Learning is open-sourced at https://github.com/camelml/camel. The paper is lined up as: Relevant concepts regarding machine learning and the algorithms are described in Sect. 2. The existing system is described in Sect. 3. The proposed procedure followed by the experiment is described in Sect. 4. Experimental setup and results produced through the execution of the proposed algorithm are described and discussed in Sects. 5, 6, and 7. Finally, the rest section includes the paper with final remarks.
Scalable Machine Learning in C++ (CAMEL)
1065
2 Literature Review Many data scientists and researchers use Python or R for their machine learning model because of highly readable syntax and low learning curve of these languages. Hence, very few texts have been published, and researches have been conducted on machine learning algorithms using compiled languages in the recent past. This changed with the onset of TensorFlow. Beginning in 2011, Google Brain assembled DistBelief as a restrictive AI framework dependent on neural nets. Its use developed steadily over various Alphabet organizations for ubiquitous applications in new age decision systems [1, 2]. Google congregated many computer specialists to quantify and augment DistBelief into a speedy, increasingly vigorous application-grade library, which is now known as TensorFlow [3, 4]. Mlpy, is a machine learning library written in Python programming language that utilizes NumPy to add the reference for NumPy at its core [5]. It provides the basic models for supervised and unsupervised learning. Mlpy enhances the development experience by providing modularity and reduces the number of lines of code with its high usability. sklearn is a medium-scale Python module that provides the implementations of machine learning in the Python programming environment [6]. For mathematical calculations, sklearn uses NumPy and SciPy for arrays and matrices, respectively. Further, it optimizes computation by using a Python programming language, which provides the speed of compiled languages within the Python-like environment [7]. Release of TensorFlow as open-source beaconed a herald for positive change across many organizations. TensorFlow could now be used as open-source software, only based on your need and the amount of data one had (GPU vs. CPU) [8, 9]. Projects revolving around Face Identification, analysis of images now exist in profusion. But the main problem remains mostly untouched, i.e., the absence of alternatives for coders that want Compiled Language support for Compiled Language software. Although extensions for the use of Python and R in such programs are as a solution, they can never match the speed of native algorithms. Armadillo presented a mathematical library for C++ for linear algebra, which inherits its syntactical structure from MATLAB [10]. Armadillo depends heavily on templating, operator overloading, and typecasting. By thorough exploitation of the latest C++ features, Armadillo enables compile-time parsing and optimizations for faster computation. Ryan et al. have also presented a similar model for machine learning algorithms implemented in C++. Their library uses generic C++ programming methodologies that outperform its counterparts in terms of its performance [11]. Their experimental section has demonstrated that C++ works more efficiently than other frameworks for the same parameters and dataset. Similarly, Dlib-ml is another open-source library that allows scientists and researchers to develop machine learning models in C++ [12]. Dlib-ml optimizes the mathematical computations by using BLAS.
1066
M. Sharma et al.
Optimizing machine learning algorithms is particularly tricky and tedious. It provides a comprehensive study of machine learning optimization techniques by comparing multiple such methods that have been used in the past and are currently being implemented [13–15]. Zidek et al. have presented research for the usability of web technologies for machine learning applications [8]. The backend system which implements the machine learning algorithm is implemented in C++. Since the embedded system runs on low power, the high computation speed of C++ is an ideal solution for running a machine learning algorithm on Raspberry Pi [16, 17].
3 Existing System 3.1 Why C++? Generic programming, optimization, acceleration, and code portability are some of the salient features of C++. We aim to exploit these dominances of C++ in our library as well. We have also used various features of the latest C++17 standard, which allows our code to speed up compared to the previous C++11 standard. Moreover, C++ also implements various compiler optimization techniques such as Intraprocedural Dataflow Analysis, Inline functions, and efficient Memory Allocation and Optimization. Machine Learning, AI, and Neural Networks are now ubiquitous in any segment, be it research, industry, and the scholarly world. There is a general attitude that this is done continuously in some other language than C++ (maybe Python, or R). But we often forget the fact that even though Python is a standout among the most utilized dialects for AI, the majority of the structures that worked with Python is exploiting a lot of exceedingly streamlined C++ libraries in the engine. At last, on the off chance that you need to convey your application to execute on equipment that is inserted or in a progressively obliged condition, you will eventually need to build up the last arrangement utilizing C++ (regardless of whether the sender is using a cloud-based virtual machine). There are a lot of C++ libraries that offer quick numerical activities and are the apparatuses of the exchange for some developers in Intelligence systems. From the developer’s point of view, the dominance of C++ does seem rational for its use in ML. Even now, many deep learning libraries support extensive C++ and C APIs, which help in designing better and accurate composite models. Also, C++’s use in models makes sure that the program will be closer to Machine code than other High-Level Languages.
Scalable Machine Learning in C++ (CAMEL)
1067
3.2 Advantages of C++ C++ takes precedence over other languages in our proposed model because: (i) (ii)
(iii)
(iv) (v)
C++ is a robust language and is often the go-to platform for scalable and secure APIs for extension of other systems’ use. Through the feature of Standard Template Library (STL), we can now teach general programming practices for factory pattern. This will help to generalize the model, i.e., irrespective of the type of data entered, we will be able to train our model. C++ is a language that is much closer to machine code as compared to HLLs like Python, R, etc. This helps in saving some time during conversion, which can come in handy when we talk about a lot of code. A compiler is better suited for speed tasks, where mistakes are corrected by experience. This makes for a dominant priority in our choice of language. Due to its static nature and the fact that it is “closer to the metal,” it is used in mathematical relations, Neural Networks, AI, platforms like TensorFlow, etc.
The main reasons which employ priority for us to choose C++ while coding for models are the following: (i) Speed: C++ is a type of compiled language, which operates the use of compiler for the conversion of code from High-level code to Machine Code. The compiler works by converting code page by page, instead of the line-by-line approach of the Interpreter. This makes it much more efficient when speed is a factor. We aim to utilize this speediness since our datasets will have records ranging in hundreds and thousands, if not millions. (ii) Static Nature of C++: C++ does not have any virtual machine compiling C++ executable code. It is organized specifically to the system it is running on. This makes it a better metal programming language and efficient for software requiring little or no support.
3.3 C++ and Machine Learning Geometric change, numerical solvers, and other related calculations, which are a piece of straight polynomial math, have discovered real use in Machine Learning Applications and especially in profound learning models. In most cases, deep learning applications involve operations on matrices and vectors with a large number of addition and multiplication operations with varying sizes of the same. As a result, to enhance the speed of the computation of such services, C++ is used under the hood. To further improve the computation speed, the execution can be accelerated by the use of multiple cores of GPU. Intel’s Nervana, Google’s Tensor Processing Unit, and Microsoft’s Project Catapult are among those who have implemented the same [11, 9].
1068
M. Sharma et al.
In contrast to the CPU, the use of GPU has significantly improved the performance of these applications. Moreover, GPU clusters have offered easy-to-use solutions to many data scientists and researchers at low cost. Customization and flexibility of C++ allow a researcher to write machine learning applications for any purpose. Since C++ is closer to assembly code, it offers the added advantage of using the same group of fast and rigorous C++ libraries that Modeling giants like TensorFlow, Scala, and CNTK employ for use. This rich library’s support for C++ and already developed functionalities help in various complex tasks like matrix multiplication, convolution theorem analysis as well as allow the ease of hyperparameter tuning. The ML modeling libraries and APIs available to users and developers of today can take advantage of essential accelerated computation in GPUs. This enables many ML applications to process a more extensive data set. The current prevalent system used by developers for ML is Nvidia’s GPU. Also, TensorFlow offers the option of CUDA—a parallel computing platform—for hardware acceleration option inbuilt. A similar form of hardware acceleration has also been built for AMD and embedded platforms like Raspberry Pi ARM Mali GPUs using TensorFlow interfaces [18].
4 Proposed System In this paper, we aim to devise algorithms of Machine Learning in C++ for two purposes—firstly, to compare and contrast the performance measures of C++ versus other language variants and to compare the performances on the same algorithms. Secondly—we also aim to devise a Medium-Scale Integrated Library for Machine Learning in C++ to increase support for C++ coders.
4.1 Scalability The scalability of a system is defined as the capability and capacity of the system to handle incoming requests efficiently, even if the number and size of such requests increase proportionally. Scalability is thus, often associated with augmentation of the system rather than its revision. Scalability means the ability of a system to, with relative ease, be adapted to a broader user base than initially or intended initially. C++ allows us to achieve the needed scalability. Although it may not be the most popular machine learning implementation as opposed to the other generally more straightforward languages such as MATLAB, Java, or Python [19]. Out of the four languages mentioned above, C++ is the closest to the core of Machine Code and, thus, hardest to understand and code in. It also supports generic programming via its Standard Template Library (STL) feature. C++ being such a complex and diverse language but it is quite scalable. It is possible to make functions using basic coding concepts for any predefined complex purpose in MATLAB or R [15]. This makes for a better learning experience.
Scalable Machine Learning in C++ (CAMEL)
1069
4.2 Extrapolation as Application Programming Interfaces (APIs) The use of Armadillo for the implementation of all these models, as shown, is more straightforward than a general code. Without the use of Armadillo, we attain a better understanding of the models themselves from a basic level rather than just using pre-implemented blocks of code. As evident from the release software system, it is much more efficient to use these libraries rather than making a general code from scratch. There are, however, more advantages to not using libraries than just an indepth understanding of the concept; it avoids the wastage of time spent while going through pages of documentation of the library. The use of Armadillo provides us with one significant advantage, i.e., the use of matrices as a derived data type. Thus, instead of loops and iterations around to get summation, multiplication, and other operations, we employ the use of dot and cross products as well as summarization and aggregation features to reduce the code. Further, a library may not always be maintained for as long as you want to use it, and thus you might have to remove it and then repeat everything from the base after all. Henceforth, working without the use of inbuilt functions and libraries adds another level of independence to the work and competence of the coder.
4.3 Use of Armadillo In the code release, we can see two variations: (i) The LARS code has equations coded along with the pseudo-codes, i.e., following set conventions. (ii) The evolved and more complicated algorithms are more precise and have fewer equations. The first one follows a set paradigm of Procedural and Object-Oriented functions, in which we try to teach the equations using math libraries to get the desired results. However, in the convoluted algorithms like Decision Tree Model and K Nearest Neighbors, we have used Armadillo as our go-to scientific library. This also helps us to reduce the run time and offer us the added feature of coding as in Interpreted Languages, which takes less time for development. For example, to multiply two matrices m and n, we can use the following approach demonstrated in Algorithm 1.
1070
M. Sharma et al.
Algorithm 1—Multiplication of two Matrices
But using Armadillo, we can use the following: mat a; (Declaration of ‘a’ as derived data type mat). a = m*n; (Multiplication is done, and the result is stored in ‘a’ matrix).
4.4 General Approach for Implementation of the Proposed Models Figure 1 gives us the flowchart of the programming paradigm followed to design the algorithms of the library.
5 Experimental Setup Table 1 shows the specifications of the machine that was used to test our proposed system. The following are the datasets that were considered for a corresponding model.
Scalable Machine Learning in C++ (CAMEL)
1071
Fig. 1 Flowchart for the general approach
Table 1 Machine specification
Specifications
Details
Processor
Intel Core i5 (6th Gen) processor
RAM
8 GB DDR4 RAM
Graphics memory
2 GB
Clock speed
2.3 GHz
Graphic processor
NVIDIA GeForce 940MX
5.1 Regression Dataset Table 2 shows the dimensions of the dataset used to test the performance of algorithms.
Table 2 Regression datasets details
Size (rows * columns)
Abbreviated
100 * 3
Small-scale
1000 * 3
Medium
10000 * 3
Large
100000 * 3
Very large
1072
M. Sharma et al.
5.2 Classification Dataset Table 3 shows the dimensions of the dataset used to test the performance of algorithms.
6 Observations We have tested the working of the proposed C++ library against existing machine learning libraries and algorithms in other languages. In the following tables, we have compared the speed and computational efficiency of our library with the commonly used machine learning libraries. Tables 4, 5, 6, 7, 8, 9, and 10 [7, 8] show us the time performance comparison of individual algorithms on the same machine and the same datasets (Note: 1 ms = 10–3 s“). (i) Linear Regression (ii) Polynomial Regression (iii) Multivariate Regression (iv) Logistic Regression (v) KNN (vi) K-Means (vii) Single-Layer Perceptron
Table 3 Classification datasets details
Table 4 Linear regression time comparison
Size (rows * columns)
Abbreviated
100 * 3
Small-scale
1000 * 3
Medium
10000 * 3
Large
100000 * 3
Very large
Dataset size
CAMEL (C++) (ms)
Weka (Java) (ms)
sklearn (Python)
100 × 2
0.79
13
4.42 ms
1000 × 2
2.65
27
17.8 ms
10000 × 2
13.41
12
25.49 ms
100000 × 2
24.46
48
30.71 s
Scalable Machine Learning in C++ (CAMEL)
1073
Table 5 Polynomial regression time comparison Power
Dataset size
3
100 × 4
CAMEL (C++) (ms)
3
1000 × 4
7.32
2.1
3
10000 × 4
10.61
16
10.1
3
100000 × 4
62.27
52
76.2
1.21
Table 6 Multivariate regression time comparison
0.82
Dataset size
Table 7 Logistic regression time comparison
Weka (Java) (ms)
CAMEL (C++) (ms)
sklearn (Python) (ms) 7.52 8.13
Weka (Java) (ms)
100 × 3
0.71
2
1000 × 3
6.74
14
10000 × 3
14.72
48
100000 × 3
26.54
119
Dataset size
CAMEL (C++) (ms)
sklearn (Python) 0.47 ms 2 ms 8.37 ms 23 s
Weka (Java)
sklearn (Python) (ms) 15.6
100 × 3
0.87
5 ms
1000 × 3
3.92
13 ms
15.66
10000 × 3
5.42
89 ms
15.8
100000 × 3
7.66
226
15.81
Table 8 KNN time comparison K
Dataset size
CAMEL (C++) (ms)
Weka (Java) (ms)
sklearn (Python) (ms)
2
100 × 3
0.215
24
3.34
2
1000 × 3
1.59
36
4.89
2
10000 × 3
2
100000 × 3
9.13
107
14.83
107.49
454
135
Table 9 K-Means time comparison K
Dataset size
CAMEL (C++) (ms)
Weka (Java) (ms)
sklearn (Python) (ms)
2
100 × 3
1.35
1.7
38.2
3
1000 × 3
4.54
15
86.9
2
10000 × 3
27.87
22
167
3
100000 × 3
225
301
1680
1074 Table 10 Single-layer perceptron time comparison
M. Sharma et al. Dataset size 100 × 3
CAMEL (C++) (ms) 0.31
Weka (Java) (ms) 4
sklearn (Python) (ms) 4.33
1000 × 3
5.26
38
5.56
10000 × 3
25.84
102
16.7
100000 × 3
153.78
379
158
7 Results and Analysis As stated above, we have compiled the performance of Algorithm variants coded in different languages and available on various platforms. The metrics we have used above are the wall time and the run time of the problem. In Layman terms, we have checked the time it takes for the data to fit and a hypothesis to get formulated, which in turn will be tested on the testing data.
7.1 Linear Regression In Linear Regression comparison, we find that our performance for small-scale datasets is 8.2 times better and faster as compared to its closest variant, i.e., sklearn that is coded in Python. This is because, for small datasets, the performance measure is much better for compiled languages like C++, which are closer to the hardware than for interpreted languages. Java also fares poorly because it fares on a higher level of complexity as compared to C++. The performance difference decreases as we increase the dataset size, but even when the dataset is in 100,000 s, linear regression in C++ still offers speed. Figure 2 shows the time-series change in performance measures for the three different libraries.
7.2 Polynomial Regression Here, Weka fares better for Small-Scale datasets, even though the time taken by CAMEL is quite less in comparison. The time taken by Python exceeds both, and thus our library immediately competes with Java’s performance. For data in thousands, the performance of Weka and CAMEL is quite comparable, which can be seen on the graph. This measure intersects after data size is increased to ten thousand, and after then, sklearn becomes much more optimized. Even then, CAMEL fares better than Weka. Figure 3 shows the time-series comparison of the performances.
Scalable Machine Learning in C++ (CAMEL)
1075
Fig. 2 Linear regression comparison chart
Fig. 3 Polynomial regression comparison chart
7.3 Multivariate Regression In Multivariate Regression, the time performances for sklearn (in Python) and CAMEL (our library) go hand in hand but effectively outperform Weka when its data becomes large. Figure 4 shows us the performance of the libraries.
1076
M. Sharma et al.
Fig. 4 Multivariate regression comparison chart
7.4 K-Means CAMEL outperforms sklearn and Weka when it comes to K-Means, be it for small datasets ranging in 100 s to large datasets having records near 100000 s. This shows that our K-Means is efficient when it comes to finding the underlying relationship
Fig. 5 K-means comparison chart
Scalable Machine Learning in C++ (CAMEL)
1077
Fig. 6 Perceptron comparison chart
between a set of clustered data. Figure 5 shows us the time performance comparison on data of the K-Means algorithm.
7.5 Perceptron Our perceptron is intended to be an entry of C++ in the deep learning framework. For small-scale data, our perceptron works brilliantly, although as the data increases, our performance decreases. Yet we outperform Weka when it comes to Single-Layer Perceptron and gets better as data goes into a hundred or thousand. As we see the variation from the graph, we can make out that sklearn takes more time to fit data at start, yet this performance time decreases around ten thousand, which is where sklearn perceptron’s run time beats ours. But our’s gets better as data increases. In all this, Weks’s performance is in a higher echelon of the time frame and need not be compared. Figure 6 shows us the time performance comparison on the dataset.
7.6 Logistic Regression Logistic Regression also shows good time performance when coded in C++, as we can see the time difference our model brings to the table. This can be effectively utilized and used for coding. It fares better for every sized data set in comparison to other variants of sklearn and Weka. Otherwise, the time performance does not
1078
M. Sharma et al.
Fig. 7 Logistic regression comparison chart
depend so much on the size, as seen from the data above. Figure 7 is a graph of time performance comparison between libraries.
7.7 K Nearest Neighbors KNN performed best in our library as it effectively outperforms the other models on each step. Although sklearn comes close to data ranging in ten thousand, it is still less effective than our model. Figure 8 shows us the time performance comparison on data of the K Nearest Neighbors algorithm.
8 Conclusion In this paper, we have presented the working of Linear Regression, Logistic Regression, Multivariate Regression, Polynomial Regression, KNN, K-Means, and Perceptron for Neural Networks implemented in C++ for our library named CAMEL. We conducted several tests on our library with dataset sizes ranging from small to medium for each algorithm and along with the models implemented in other libraries written in Java and Python. Based on our observations, we can conclude that CAMEL outperforms other libraries for small datasets for all implemented algorithms. Although we do note a spike in training time with an increase in dataset size, the training time of other libraries is comparatively much higher. Also, for massive datasets, the time
Scalable Machine Learning in C++ (CAMEL)
1079
Fig. 8 KNN comparison chart. Note The procedure we follow is to grade for iterations, i.e., gradientfree approach, instead of relying on custom iterations for accuracy. This is in contrast to the gradient descent employed by sklearn
taken by our library is lower than all libraries for every algorithm that we have considered in this paper. Finally, we conclude that even though Machine Learning Practices in Python and R (Interpreted Languages) are ubiquitous, our library performs better than sklearn and Weka when compared in terms of training time.
9 Future Work 9.1 Scope for Machine Learning in C++ We live in an age where the hardware has evolved at a higher rate in comparison to the software available. Thus, it can be concluded that even if we had to parse through billion records and record computation, availability for hardware support would be the last concern. Yet further optimization of algorithms and looking for better alternatives to the current system should always be a priority, as our study says, once fully optimized, our algorithms can and will replace Python-coded variants. It is because today, we have the onset of TensorFlow, Keras, Scala, etc., libraries coded in languages that are compiled not interpreted. Having them attached to a native language closer to hardware can yield better results if further optimization can be done on such algorithms.
1080
M. Sharma et al.
9.2 Scope for CAMEL in Future The future range for our paper, on the other hand, is entirely different altogether. Our library is currently at an initial stage, and thus not fully developed. There exists much to be worked and developed to the fullest extent before it can be used at the production level. For one, it has a limited number of models, and more research is required to include the latest models currently in practice. Once done so, they can be used by amateurs and professionals alike to design and run the custom machine learning models. Secondly, our library can be further extended by parameterizing and formatting the existing functions to support run time/dynamic user customization. This will allow the user to run further tests and design convoluted models. Moreover, this will assist us in generating a higher accuracy score even if the number of programs runs and cycles are low in number. Lastly, we believe that our proposed system can be used on the production level to process millions of records of data. Therefore, we can extend our research further by utilizing the CUDA framework for exploiting GPU cores. So, it was developed to first perform at its absolute optimized level to result in a Medium-Scale Machine Learning System, and then can be further elaborated and extrapolated to create an LSMLS (Large-Scale Machine Learning System) with the ability to be easily used by the user in Expert Systems to help the decision-making process better.
References 1. Nervana—https://spectrum.ieee.org/tech-talk/computing/software/nervana-systems-putsdeep-learning-ai-in-the-cloud 2. Maili—https://developer.arm.com/technologies/machine-learning-on-arm/developermaterial/software-for-machine-learning-on-arm 3. Tensorflow—https://gpuopen.com/rocm-tensorflow-1-8-release 4. M. Hall, E. Frank, G. Holmes, WEKA—Experiences with a Java open-source project. J. Mach. Learn. Res. (2010) 5. D. Albanese, G. Jurman, R. Visintainer, S. Merler, mlpy: machine learning Python (October 2011) 6. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 7. F. Pedregosa et al., Scikit-learn: machine learning in Python. JMLR 12, 2825–2830 (2011) 8. K. Židek, J. Piteˇl, A. Hošovský, Machine learning algorithms implementation into embedded systems with web application user interface, in International Conference on Intelligent Engineering Systems (2017) 9. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, TensorFlow: a system for large-scale machine learning, in USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) 10. C. Sanderson, R. Curtin, Armadillo: a template-based C++ library for linear algebra. J. Open Source Softw. 1, 26 (2016) 11. R.R. Curtin, J.R. Cline, N.P. Slagle, W.B. March, P. Ram, N.A. Mehta, A.G. Gray, ML Pack—A scalable machine learning library (College of Computing Georgia Institute of Technology). J. Mach. Learn. Res. 14, 801–805 (2013)
Scalable Machine Learning in C++ (CAMEL)
1081
12. D.E. King, Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. (2009) 13. L. Bottou, F.E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning (2018) 14. P. Oliveira, F. Portela, M.F. Santos, A. Abelha, J. Machado, Machine learning—an overview of optimization techniques. ISBN: 978-1-61804-320-7 15. Machine Learning, Optimization, and Data Science 4th International Conference, LOD 2018 (Volterra, Italy, September 13–16, 2018), Revised Selected Papers 16. Unified and efficient machine learning library, homepage at shogun-toolbox.org 17. R. Monga et al., TensorFlow: large-scale machine learning on heterogeneous systems (PDF) (2015), TensorFlow.org. Google Research. Accessed 10 Nov 2015 18. M. Wong, R. Douglas, E. Barsoum, S. Pati, P. Goldsborough, F. Seide, Towards machine learning for C++ (2016) 19. S. Perez, Google open-sources the machine learning tech behind google photos search, smart reply and more. TechCrunch (2015). Accessed 11 Nov 2015 20. C. Metz, Google just open sourced TensorFlow, It’s artificial intelligence engine. Wired (2015). Accessed 10 Nov 2015 21. R. Kohavi, D. Sommerfield, Data mining using a machine learning library in C++. Int. J. Artif. Intell. (1997) 22. O. Reyes, E. Pérez, M. del Carmen Rodríguez-Hernández, H.M. Fardoun, S. Ventura, JCLAL: a Java framework for active learning. J. Mach. Learn. Res. 17 (2016)
Intelligent Gateway for Data-Centric Communication in Internet of Things Rohit Raj, Akash Sinha, Prabhat Kumar, and M. P. Singh
Abstract Extending a wireless sensor network to the Internet of Things poses a plethora of challenges. One of the major causes is that these networks have been designed as a resource constrained network while Internet of Things require delivering a variety of services to the end users irrespective of their geolocation. This paper proposes an intelligent gateway to facilitate data-driven communication in a resource constraint networks so as to enable the extension of resource constraint sensor/actuator networks to the Internet of Things. The proposed solution allows for multiple levels of Quality of Service in low end devices without extra infrastructural changes. Results obtained show that the proposed solution is feasible in real-time scenarios and is capable of handling multiple types of data simultaneously. Keywords Internet of Things · Gateway · Publish-subscribe · Data-centric communication · MQTT
1 Introduction Wireless Sensor Networks (WSNs) have been the focal point of research for past few years [1]. These networks are characterized by limited energy capabilities, and generally use ad hoc mode of communication where the sensor nodes communicate with each other as peers to send and receive data. Ad hoc mode of communication R. Raj · A. Sinha · P. Kumar (B) · M. P. Singh Computer Science and Engineering, National Institute of Technology Patna, Patna, India e-mail: [email protected] R. Raj e-mail: [email protected] A. Sinha e-mail: [email protected] M. P. Singh e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_92
1083
1084
R. Raj et al.
Fig. 1 Aggregation traffic pattern in WSN
r
r
r
r1+r2+r3
allows peer-to-peer communication without a centralized access point. This feature makes the sensor nodes to be easily deployable on dynamic and random terrains. Each node in a WSN collects data independently and forward it to various neighboring nodes based on a predefined algorithm. The popular ad hoc network routing algorithms are Destination Sequence Distance Vector (DSDV) [2] and Optimized Link State Routing Protocol (OLSR) [3]. Both of these algorithms are proactive algorithms and use convergence and aggregation traffic pattern for communication as shown in Fig. 1. In aggregation traffic pattern, multiple requests are combined into a single request at some particular node [4]. Sensor networks route their data to the sink. The sink can then send the data to the cloud for further processing and delivering services to the client. Data is sent to the cloud via the information gateway or task manager node. The gateway, hence, acts as a middleware for the sensors’ data to the clients and is responsible for proper delivery. Several middleware architectures have been proposed for this interoperation. The common design patterns used for such gateways are service oriented [5, 6], virtual machine based [7], software defined network (SDN) [8], and publisher–subscriber based [9]. In this paper, we propose an intelligent gateway which segregates the incoming aggregated data from sink and delivers it to concerned parties on the cloud with multiple Quality of Service (QoS). The proposed architecture enables deployment of multiple types of sensor nodes such as temperature sensor, humidity sensor, and luminosity detectors as a single large network. These nodes can, further, interoperate and send their data to the sink where this data will be aggregated. This collected data will then be communicated to the gateway where it will be segregated in an intelligent manner and will be then delivered to the interested clients. Therefore, by using such an intelligent gateway, we can increase the interoperability and cost effectiveness of a wireless sensor network. The proposed intelligent gateway uses publisher–subscriber model. In such a model, there is publishing on the topic to specific receivers but not the final intended client. The end clients can then retrieve the data by subscribing to the topic. This message pattern is based on data-centric approach of communication [4] where the
Intelligent Gateway for Data-Centric Communication …
1085
resolution of final client does not take place at source. Therefore, use of publisher– subscriber model abstracts the network resolution details and allows the developers to work on more important content of data rather than the delivery mechanisms. In this proposed work, we use Message Queue Telemetry Transport (MQTT) protocol on application layer [10] for pub/sub-based communication between the nodes and the gateway. It is a broker-based protocol which unicasts the topic-based messages to broker as opposed to DDS protocol which uses broker-less architecture and multicasts the data stream making it unsuitable for low energy inter-node WSN communication. MQTT protocol which was pioneered at IBM labs can provide both Connectionless and Connection-oriented service depending upon the level of Quality of Service (QoS) selected. Furthermore, this protocol is perfect ZigBee-based low bandwidth devices because MQTT has a very small code footprint as well as a very cheap bandwidth cost with header size as small as 2 bytes per message [10]. This flexibility and lightness make this protocol suitable for our intelligent gateway. The rest of the paper is organized as follows: Sect. 2 discusses the existing literature related with the proposed work; Sect. 3 explains the proposed methodology; Sect. 4 analyzes the experimental results; and finally, Sect. 5 presents the concluding remarks.
2 Related Works WSN requires the integration of the sensor network with the traditional infrastructures (LAN, Internet etc.) which is done by introducing a gateway between the two [11]. As the dynamic nature of sensor/actuator (SA) increases the likelihood of failure of the SA nodes and the involved wireless links, the conventional approach of utilizing network addresses for communication involves many roadblocks and is quite challenging. This urges the need for data-centric communication model in our paper. Publisher–Subscriber models are more widespread and common as far as distributed computing systems are concerned. In a standard data-centric communication model [12], the delivery of information to the consumers is based upon their interests and the content of the data, it has no dependency upon the network address. Publish/Subscribe (pub/sub) models introduce dynamic application topology, in which it is very easy to add new data sources/consumers or replace the existing modules. Three types of pub/subsystems have been proposed, viz., topic based, type based, and content based [13]. Hill et al. [14] proposed a topic-based publisher–subscriber system in which they introduced a broker in the traditional network. Here, a single pub message may be sent to a number of subscribers, which captures the true essence of the publisher– subscriber model. Similarly, Zoumboulakis et al. proposed Asene [15] in which they implemented an active database within a WSN, using publisher/subscriber model for communication between the nodes. It waits for events then evaluates a condition. Asene, however, is not an all-purpose transport mechanism. Mires [16] is another publisher–subscriber architecture for WSNs. The SA devices publish their readings
1086
R. Raj et al.
only if the user has subscribed to the specific sensor reading. Sink nodes which are directly connected to the PC issue subscriptions. Subscriptions are made on the basis of the content of the messages in DV/DRP (Distance Vector/Dynamic Receiver Partitioning) as proposed in [17]. Due to the complexity of matching subscriptions to arbitrary data packets, it is very difficult to implement this protocol on the target devices. Hunkeler et al. proposed MQTT-S [10], which is an extension of the open publish/subscribe Message Queuing Telemetry Transport (MQTT) [18] to WSNs. It can operate over low end and battery-operated SA devices while overcoming the bandwidth constraints involved in WSNs. However, the major drawback of this system lies in the lack of accountability in case of failure. There is no responsibility nor a single source of truth unless all messages are being routed through a central database (which in turn renders the entire broker architecture pointless). Thus, even though the above-mentioned system was lightweight and low powered, it provided no Quality of Service (QoS) 1 or 2 from broker to client. Another major problem is the duty and sleep cycle of the client devices which has to be handled efficiently to save as much energy as possible. Thus, the gateway and broker need to have information about the potential sleep time.
3 Proposed Solution This section discusses the architecture of the proposed solution. The architecture comprises multiple individual sensor nodes that are responsible for sensing the environmental data and communicating it to the sink node. The sink node aggregates the data received from the multiple sensor nodes and transmits the aggregated data to gateway residing at the cloud. It is vital to mention that the sensor nodes may be of diverse capabilities. For instance, it is not necessary that all the sensor nodes collect similar data from the environment. In such a heterogeneous environment, there may be nodes for sensing temperature and humidity all embedded in a same network. These diverse networks seem meaningful when considering IoT-based scenarios where every “thing” is supposed to have a network connectivity. The sink performs the aggregation in the network. The aggregated data is received at the proposed gateway where it is disintegrated into multiple units, each representing different sort of information. The isolated information is further sent to the client or the actuator devices who have subscribed for that particular information. For instance, there may be the street lights which function according to the light intensity of the surroundings and there can be HVAC systems that require temperature information of the environment. In such a scenario, even though the luminosity and the temperature sensors deployed in an area measure different values, it is the responsibility of the gateway to separate both the information which it has received in an aggregated form from the sink node and deliver it separately to the street lights and the HVAC systems which are acting as the actuating devices. Figure 2 depicts the overall architecture of the system.
Intelligent Gateway for Data-Centric Communication …
1087
Fig. 2 Proposed architecture
The separation of data at the gateway is facilitated by the use of topics for representing different types of information. Similar to MQTT, the proposed solution provides flexibility in terms of publication of requests. There can be multiple levels of topics on which the sensor nodes can publish. For example, we can have multiple topics “network1/light_sensor1” and “network1/Sound_sensor2” for same network which has two hierarchies. First hierarchy here denoted the network number in which the node is present. The second level denotes the type of node it is. We may have further hierarchy which may denote the unique node number of the node. It is important to note that each level is separated by “/ ” in this protocol. Data published by the sensor nodes is assigned to a particular topic by the sink node. This is possible since the sink node has the knowledge of all the sensors that are deployed in an area. Each of the nodes is linked to a particular topic so that when data is received from that particular sensor node it is assigned to the topic with which that particular node is linked. The use of topics, further facilitates the clients or the actuating devices to subscribe to only that topic which is of their interest. One of the main advantages of the proposed gateway over the existing MQTT-S protocol is that it can support multiple QoS levels simultaneously which is currently not possible with the MQTT-S gateway. This is possible since the broker itself is residing on the gateway which is not the case in the former. The MQTT-S gateway connects to the broker using the MQTT client API [10]. This can be attributed to fact that the MQTT-S gateway cannot delay acknowledging the message once it has received it from the Client API as it would result in API stalling. In contrast, since the proposed gateway itself is acting as a broker, the case of API stalling is not applicable. The proposed architecture can easily be explained by an example of a Local Area Network in a Home network with multiple interconnected devices typically relying on same ISP Internet connectivity. In such a situation, we require multiple QoS for different devices. For example, the door controller would require highest QoS as the action should be performed exactly once. Typical fire alarms would require QoS
1088
R. Raj et al.
of at least once as the information can be sent out more than once. Other devices can very function on lower levels depending on the urgency and security of the service required. In this architecture, there will be a gateway which would perform data segregation before sending it out on Internet. Additionally, it would also be responsible for providing multiple QoS on the same network. So, our door un-locker would open the gate only once for maximum reliability whereas the fire alarms can send out multiple detections to speed up the process and “at-least once” delivery.
4 Results and Analysis The proposed architecture uses the publisher–subscriber design paradigm on the combined data of sensor nodes to intelligently separate the requests and deliver it to the clients. In this regard, few simulations were done to further emphasize upon the availability of our work. Simulations were done in the laboratory and on various modes of operation of MQTT protocol. A local mosquito broker [19] was setup which is an Open source broker for MQTT protocol. This broker was deployed on debian-based Kali Linux and the network infrastructure used Small Office/Home Office (SOHO) wireless network based on Dlink DIR-2730u wireless modem. Another method of publication was Eclipse cloud-based MQTT broker which is sandboxed and available on port number 1883 and is freely available. Figure 3 shows the total publishing time for a single sensor node after varying the payload from 10 bytes to roughly 1600 bytes with a periodic interval of 20 bytes each. The broker used for this was Eclipse MQTT cloud-based broker. This was done to measure the performance impact upon aggregation of data in a sensor node network. From Fig. 4, we can see that even after increasing the payload size from 10 to 1600 bytes, the average time taken to publish a message via sandboxed online broker is almost same. The average publishing time varies between 700 and 800 ms with few abrupt peaks and troughs in between owing and network bandwidth variation. The publishing time remains more or less constant even when the payload size is increased. This shows that aggregation of data will not affect the performance of the sensor network. Figure 4 shows the dependency of the programming language and the publication time. Two popular programming languages of choice were used for this simulation, i.e., Java and Python. These two languages are polar opposite to each other as python provides a rich code abstraction and easy to implement whereas java provides much more scalable framework and is suitable for industrial production. The number of messages send at a single time was varied. The local mosquito broker was deployed to obtain the results in local environment. To get the publication time, the timestamp values were appended as the payload data. This timestamp value was deducted from the current timestamp at the server end to get the publication time. It can be observed that for small number of messages (10–20), the performance of MQTT on both python and java is roughly same, i.e., it varies between 0 and
Intelligent Gateway for Data-Centric Communication …
Fig. 3 MQTT performance characteristics
Fig. 4 Programming language dependency
1089
1090
R. Raj et al.
50 ms. However, as the number of messages increased, python tends to get slower than java counterpart. We can see a significant difference of nearly 100 ms when the number of aggregated messages is around 230. Hence, while designing such system, we need to consider the size of the network we are going to use. If it is small network (up to 50–100 nodes), Python should be language of choice, as it is considerably easier to implement. However, for a larger deployment, java, or embedded C-based programming should be preferred.
5 Conclusions and Future Works This paper proposes topic-based pub/sub gateway model generalized for WSNs. The proposed solution is superior to the MQTT-S because it provides all the three Quality of Service on a single network. The transfer of all the sensor data from the gateway to the broker requires protocols that are bandwidth-efficient, energyefficient, and capable of working with limited hardware resources (i.e., main memory and power supply). Protocols such as Message Queue Telemetry Transport (MQTT), Constrained Application Protocol (CoAP), RestfulHttp, XMPP, and DOS are used for this purpose. The performance of the protocols is measured in terms of delay and total data (bytes) transferred per message. The total data transferred per message indicates the bandwidth usage. Another important advantage this architecture gives of stability under variable payload. The proposed architecture considers the abovementioned attributes for serving the requests of the clients requiring diverse QoS. In future, the current architecture can be extended to include multiple protocols on a single gateway and provide a more intelligent middleware with an enhanced protocol interoperation scheme.
References 1. I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, E. Cayirci, Wireless sensor networks: a survey. Comput. Netw. 38(4), 393–422 (2002) 2. A.H.A. Rahman, Z.A. Zukarnain, Performance comparison of AODV, DSDV and I-DSDV routing protocols in mobile ad hoc networks. Eur. J. Sci. Res. 31(4), 566–576 (2009) 3. T. Clausen, P. Jacquet, Optimized link state routing protocol (OLSR) (No. RFC 3626) (2003) 4. L. Krishnamachari, D. Estrin, S. Wicker, The impact of data aggregation in wireless sensor networks, in Proceedings of the 22nd International Conference on Distributed Computing Systems Workshops, 2002 (IEEE, 2002), pp. 575–578 5. L.J. Zhang, J. Zhang, H. Cai, Service-oriented architecture. Serv. Comput. 89–113 (2007) 6. D. Valtchev, I. Frankov, Service gateway architecture for a smart home. IEEE Commun. Mag. 40(4), 126–132 (2002) 7. T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, D. Boneh, Terra: a virtual machine-based platform for trusted computing, in ACM SIGOPS Operating Systems Review, vol. 37, no. 5 (ACM, 2003, October), pp. 193–206 8. K. Kirkpatrick, Software-defined networking. Commun. ACM 56(9), 16–19 (2013)
Intelligent Gateway for Data-Centric Communication …
1091
9. A. Hakiri, P. Berthou, A. Gokhale, S. Abdellatif, Publish/subscribe-enabled software defined networking for efficient and scalable IoT communications. IEEE Commun. Mag. 53(9), 48–54 (2015) 10. U. Hunkeler, H.L. Truong, A. Stanford-Clark, MQTT-S—A publish/subscribe protocol for wireless sensor networks, in 3rd International Conference on Communication Systems Software and Middleware and Workshops, 2008. comsware 2008 (IEEE, 2008, January), pp. 791–798 11. K. Aberer, M. Hauswirth, A. Salehi, The Global Sensor Networks middleware for efficient and flexible deployment and interconnection of sensor networks (No. LSIR-REPORT-2006-006) (2006) 12. B. Krishnamachari, D. Estrin, S. Wicker, Modelling data-centric routing in wireless sensor networks, in IEEE infocom, vol. 2 (2002, June), pp. 39–44 13. G. Kendall, C. Horril, C. Cole, U.S. Patent Application No. 10/254,456 (2002) 14. J.L. Hill, D.E. Culler, Mica: A wireless platform for deeply embedded networks. IEEE Micro 22(6), 12–24 (2002) 15. M. Zoumboulakis, G. Roussos, A. Poulovassilis, Active rules for sensor databases, in Proceedings of the 1st International Workshop on Data Management for Sensor Networks (DMSN 04) (2004), pp. 98–103 16. E. Souto, G. Guimarães, G. Vasconcelos, M. Vieira, N. Rosa, C. Ferraz, J. Kelner, Mires: a publish/subscribe middleware for sensor networks. Pers. Ubiquit. Comput. 10(1), 37–44 (2006) 17. D.B. Johnson, D.A. Maltz, Dynamic source routing in ad hoc wireless networks. Mob. Comput. 153–181 (1996) 18. A. Banks, R. Gupta, MQTT Version 3.1. 1. OASIS standard, 29 (2014) 19. An Open Source MQTT v3.1 Broker, Mosquitto.org (2018), https://mosquitto.org/. Accessed 28 Jan 2020
A Critical Review: SANET and Other Variants of Ad Hoc Networks Ekansh Chauhan, Manpreet Sirswal, Deepak Gupta, and Ashish Khanna
Abstract In recent years the technology has advanced and developed at a tremendous rate, wireless mobile ad hoc networks and its variants have played a salient role in several critical applications. Due to the different topologies and features, the variants of ad hoc networks have become a networking standard to explore the unmapped and unplumbed areas of land and oceans where infrastructure-based networks cannot be installed. The variants of ad hoc networks, like, flying ad hoc networks (FANETs) have nodes that operate on high altitudes with lower node density and in Vehicular ad hoc networks (VANETs) nodes operate on the ground with higher mobility. Similarly, Sea ad hoc networks (SANETs) is relatively an unexplored area of research. Therefore, in the presented exposition, a detailed critical review of SANETs is given. Additionally, differences between SANETs and other variants of ad hoc networks are also provided. SANETs facilitate applications for seismic monitoring, environment monitoring, military uses, and many more. The paper also provides an overview of the challenges needed to overcome for the development of the SANET system. Some of the challenges are security and peer-to-peer connections. Different deployment procedures and their issues related to the discussed technology are also scrutinized. Moreover, the different routing protocols are analyzed and their applications in SANETs are studied. Finally, future areas of research and development in SANETs are also discussed.
E. Chauhan (B) · M. Sirswal · D. Gupta · A. Khanna Maharaja Agrasen Institute of Technology, New Delhi, India e-mail: [email protected] M. Sirswal e-mail: [email protected] D. Gupta e-mail: [email protected] A. Khanna e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_93
1093
1094
E. Chauhan et al.
Keywords MANET · VANET · FANET · SANET · AUV · USV · Distributed system
1 Introduction According to [1], “A wireless ad hoc network is a bracket of communication nodes which can compose and sustain a network among themselves, without the help of the main station or a leading administrator (infrastructure)”. In recent years the characteristics of the wireless networks provided the opportunity for many researchers to discover the types of ad hoc networks, MANET, VANET, and FANET [2]. There have been multiple surveys and innumerable researches conducted on these ad hoc networks. So, the examination of a new ad hoc network called Sea ad hoc network (SANET) is done (Fig. 1). A mobile ad hoc networks (MANETs) consists of different mobile routers connected to form random diagrams and figures. The routers are presently independent to interchange arbitrarily and systematize themselves randomly; therefore, the topology of the network might modify quickly and uncertainly [3]. Vehicular ad hoc networks (VANETs) are a more challenging version of MANETs. Inquiry in the field of automotive industries, as well as wireless networking, is being conducted on it [4, 5]. In VANET the nodes are constituted by moving vehicles. The extraordinary speed of mobile vehicles is the main distinguishing feature of VANETs [6]. According to [7] “Flying ad hoc networks (FANETs) are ad hoc network between UAVs. With the progress of embedded systems now it is feasible to produce small or mini UAVs at a low cost”. However, coordination and collaboration of multiple UAVs is a very complex system and can’t be handled by one UAV leading to the formation of a multi UAV system, which makes FANET more challenging and acts as a distinguishing factor for it. Most of the existing work in the field of ad hoc networks concentrates on land and air, totally neglecting the ocean. The ocean covers about 2/3rd of the earth’s surface and it has been interesting humans for centuries now. Yet the majority of the ocean is unexplored. In the fields of military and transportation, the ocean has played a notable role in the past centenary [8]. Hence the need for building a wireless data acquisition network called ad hoc network for aquatic applications arises, which is called as SANET. Moreover, with the increase in global warming the polar ice sheets are melting at a very rapid rate, leading to an increase in sea level. Therefore, a system that can accurately record the change in sea level is very necessary to be installed, for example, an underwater ad hoc network that can accurately and timely provide information to the government [1]. The nodes in SANET are boat nodes whose main aim is to provide a network when a node is in the sea. It can also be used to save and guide the refugees trying to cross through any water medium. While in FANET there may be an obstruction in
A Critical Review: SANET and Other Variants …
1095
Fig. 1 Classification of Ad Hoc networks
forming networks because of high mountains, no such obstruction can affect SANET and instead there is open space for communication between the nodes.
2 SANET Applications The various applications of SANET system are as follows: (1) Catastrophe: SANET is helpful when some accident or disaster happens such as sinking of boats, oil barges or shipwrecks [2]. The nodes of SANET would send the information as soon as the mishappening occurs or if any error occurs during transmission of the message, that error can be identified and the message will still be conveyed. Whenever a distressed or missing vessel is located, the authorized organizations deploy helicopters, rescue vessels or any other appropriate vessel to return them to land. (2) Seismic monitoring: SANET can be used to detect and record the earth’s motion under sea or any water body from man made and natural sources. It works on the principle of inertia. The seismometer body rests or floats on the surface of the sea. Inside the body, a heavy mass is suspended between the two magnets. With the movement of the earth, the seismometer moves too and so does the magnets, but the mass remains unchanged in its place. With the oscillation of the mass through the magnetic field, an electric current is produced which is measured by the instrument [9]. We note and record the variations in the oil reservoir over a fixed time duration, there is a whole branch of study dedicated to it, called “4-D seismic”. Terrestrial oil fields are only annually or quarterly monitored using this technique since it involves large capital and operational costs [10]. (3) Environment monitoring: SANET can be used to monitor and observe the changes in the normal ocean waves, earthquakes, volcanic eruptions and other underwater explosions above or below water, the gravitational pull of the earth and sun generates the wind and tides, which in turn produce these natural disasters. Therefore, it can be used to predict the tsunami or predict weather conditions and also transfer this information to the ships or vessels in any water body.
1096
E. Chauhan et al.
Hence the information about natural disasters like tsunami can be conveyed to the concerned authorities well before time and arrangements can be made to save people and reduce the casualties as much as possible. Hence the climatic observations made at sea can help us in the following ways: A. B. C. D.
In sending a warning to other ships and coastal administrators. In understanding the global climate. In predicting the future weather. In sending the data to meteorological and hydrological services (NHMS) centres, which form the part of climatic prediction models from where local and global forecasts are generated. E. In making observations at sea, at the same time period as that on land, therefore, helping in understanding the different weather conditions. There is an old saying “Garbage in garbage out”, therefore, the forecasts will be as good as the data received. (4) Military: SANET can act as a promising technique for exchanging information between the military headquarters and the ships. Earlier satellites were used by navy ships to communicate with each other or with ground station back on land. But these communications were usually delayed and had a restriction of limited bandwidth. On the other hand, SANET allows us to form a ship area network while at sea, thus helping in high-speed transmission of data among ships, enhancing their sharing of multimedia data and better coordination in battlefield operations [11]. SANET can also help in detection, tracking and identification of submarines. It can be designed to work at a very low frequency, ranging from a few hertz to a few kHz. Due to low frequency the detection range increases. Therefore, SANET can correctly detect and discover low-frequency noise sources [1]. (5) Underwater robots: A very important application of SANET is autonomous underwater robots who can coordinate sensing of oil leaks or any biological phenomena such as phytoplankton concentrations [10], i.e. the chores that would seem difficult for a man. SANET will require a multi robot network. In the multi robot network the robots form a communication network on the fly, i.e. talk to each other and collaborate in a distributed fashion [10]. However, the communication between these robots is expected to be of low rate, but they are still expected to be able to solve any issue efficiently by coordinating and planning with each other.
A Critical Review: SANET and Other Variants …
1097
3 Comparison Between Ad Hoc Networks 3.1 MANET According to [12] “A mobile ad hoc network is comprised of a number of mobile nodes connected together to form a network without any existing infrastructure. MANETs are peer-to-peer, multi-hop wireless networks in which data is transferred to a random destination node, through various middle nodes”. MANET node movement is relatively slow and its mobility model is random and can sometimes result in undesirable path plans. Based on the node movement, the topology alters in MANET are also slow when compared to other ad hoc networks [3]. The nodes in ad hoc networks can act as routers as well, hence they have various computational properties to operate the information. MANET nodes are battery powered since the power consumption is low [13].
3.2 VANET According to [14] “VANET is a subclass of MANET, where mobile nodes are moving vehicles. VANETs are a fundamental part of the International Transport System (ITS) framework”. Sometimes, VANETs are referred to as Intelligent Transportation Networks. The nodes in VANET have unlimited energy and high mobility which is foreseeable due to the finite street designs [15]. In VANET, the use of GPS receiver for coordinates can get information accurate to 10–15 m, which is enough for route navigation. Somehow, this accuracy is not enough for security operations like crash alerts. Hence some researchers use Assisted GPS (AGPS) or Differential GPS (DGPS) with an accuracy of about 10 cm [16, 17].
3.3 FANET FANET can be explained as a subdivision of VANET, where the nodes are usually UAVs (unmanned aerial vehicles). FANET presents itself as a low cost, versatile answer for extension of web framework around the world. Perhaps the greatest difference of UAVs is the high movability and speed variation they have, which permits them to get to hard to arrive at places. According to [18], “a FANET node can have a speed of 30–460 km/h. Due to the high mobility of UAVs, the topology change is faster and more frequent in FANET”. UAVs have a random mobility model but in some cases, they move on a prearranged path and have a regular mobility model. UAVs are generally dispersed in the sky, so their density is lower than the density of nodes in MANET and VANET. When it comes to power consumption FANET communication system is supported by the power source of UAVs, which means there
1098
E. Chauhan et al.
is no power resource problem [19]. The only limitation in FANET when it comes to computational powers is weight, however, most of the UAVs have high computational powers. Because of high-speed multiple UAV framework, FANET requires exceptionally precise location data. GPS alone is not enough for that, therefore, a GPS and an Inertial Measurement Unit (IMU) is fitted inside every UAV.
3.4 SANET Sea Ad Hoc Network (SANET) is comprised of boat nodes such as ships, boats, underwater vehicles, USVs (unmanned surface vehicle) and vessels connected together to form a large network. The intent of SANET is to increase the extent of the aquatic connectivity [20]. “Node density is defined as the average number of nodes present in a unit area”, since the nodes in SANET are dispersed in oceans or other water bodies, the node density is medium. The nodes in SANET move at a speed that is faster than the nodes in MANET but slower than the nodes in VANET and FANET. The line of movement of underwater vehicles can be random and unpredictable causing the mobility model to be random too. Due to the high mobility of USVs in SANET, the topology change of SANET is also faster than the topology change of MANET but slower than the topology change of VANET and FANET. Since the nodes are on the water, the line of sight between various boat nodes is very high. Energy consumption in FANET is also quite high. The frequency band of SANET lies between 5 and 8 GHz. To track the various nodes in SANET accurately GPS would not be able to give the desired results, therefore, we use AGPS, DGPS and Automatic Identification System (AIS) in every node. Since, SANET supports so many applications that can help in several ways, which is only restricted by the delay in data transfer, a stable multi-hop synchronization mechanism for the reliable communication of nodes needs to developed [21]. Table 1 represents the comparison of SANET with other ad hoc networks.
4 Deployment Schemes for SANET According to [22], “There are three schemes for SANET that will be applied in positions of harbour, shore and ocean, respectively”. Since the current land networks can be used at the harbour, we will focus on the other two positions in this paper, i.e. shore and ocean.
A Critical Review: SANET and Other Variants …
1099
Table 1 Comparison between Ad Hoc networks Ad Hoc networks\Parameters
MANET
VANET
FANET
SANET
Node type
Smartphone, Laptop, PDA, Tablet
Car, Bus, Motorbike, Truck
Drone, Aircraft, Ship, Boat, Copter, Underwater Satellite vechile, USV (Unmanned surface vechile), Vessel
Node density
Low
High
Very low
Medium
Node mobility
Low
High
Very high
Medium
Mobility model
• Random • 2D or 3D
• Regulat • 2D
• Random or • Random regular under • 2D or 3D special conditions • 3D
Topology change connectivity
Medium
High (Rush hours)
Low
Low
Propogation model
• On the ground • On the • Low LoS ground • Low LoS
• In the air • High LoS
• On the water • High LoS
Power consumption
Low
High
High (Depends on the UAV)
High
Setup
Positioning
Road
• Airfields • Hands
Water
Localization
GPS
GPS, AGPS, DGPS
GPS, AGPS, DGPS, IMU (Inertial measurement unit)
GPS AGPS, DGPS, AIS (Automatic identification system)
Frequency band
2.4 GHz
509 GHz
2.4/5 GHz
5/8 GHz
4.1 SANET Network at Shore If a node lies in the radius where it can directly contact or connect with the Radio Access Station (RAS), it can convey everything directly, but if a node is outside that radius, then it will have to form a network with other nodes in order to communicate. Figure 2 shows the network architecture of SANET at the shore where solid lines represent the UHF band link for terrestrial communications and dotted lines represent the VHF band link for SANET communications.
1100
E. Chauhan et al.
Fig. 2 Proposed maritime wireless communication architecture for shore
4.2 SANET Network in Ocean There is no root station available in the ocean because of the large distance from land and it is not practically possible to deploy a root station in the ocean. Therefore, peer-to-peer communication is required since it requires no base station. Figure 3 shows the network architecture of SANET in the ocean. The solid, dotted and white lines represent the UHF, VHF and HF band link. Since we cannot access links to RAS even with multi-hop VHF, the communication needs to be done using the available HF band modem.
A Critical Review: SANET and Other Variants …
1101
Fig. 3 Proposed maritime wireless communication architecture for ocean
5 Routing Protocols in SANET The routing protocols in SANET are designed and classified into various categories based on the way routing data is shared, and the way paths are established [23]. Based on these criteria we have following routing protocols. (1) Proactive Routing Protocols: Proactive routing protocols use routing tables to store data related to the connections between each pair of nodes. Every node keeps one or more routing tables, thus forming the full topographic anatomy of the network. These routing tables need to be refreshed consistently to provide correct data from the source node to the destination node [1]. Consequently, it becomes easier to choose the shortest route from root to target node, hence decreasing latency significantly [21].
1102
E. Chauhan et al.
But to keep up with the latest routing information, topographic data must be exchanged among the USVs regularly, causing congestion of network, consumption of more bandwidth and slow reaction to disconnections [24]. Therefore, the main advantages of such algorithms are: (1) Routes are always available on request. (2) Less delivery delays. (3) Easier to choose the shortest route. And the disadvantages are: (1) Slow reaction to reconstruction. (2) A lot of packets and data need to be maintained for smooth working. Examples of these algorithms are OLSR and DSDV. (1.1) Destination Sequenced Distance Vector (DSDV): DSDV is a table-driven routing scheme, which means it is a proactive routing protocol. It is established as an improved version of the bellman ford algorithm. The usage of sequence numbers in routing protocols leads to the advancement in the bellman ford algorithm as it provides freedom from loops. In DSDV every node must maintain a routing table comprising of the addresses of all the possible destination nodes, count of jumps needed to get to the target, and the address of the next node. The routing table modification process is generated through the exchange of data between the nearest nodes (Fig. 4). According to [25] “sequence number is also attached with every route to a target address. Whenever the topology of the network changes, a new sequence number is assigned to the changed paths”. Therefore, the sequence number indicates the validity of a path. Higher the sequence number, more reliable is the path, thus avoiding the formation of loops. Every time a path changes, its sequence number is incremented by two. Thus, all the paths with even sequence numbers are reachable. If a node notices that a path to the destination is not working, then the path is assigned a high number of hopes (meaning infinity) and its sequence number is made odd, therefore, an odd sequence number means that the path is not reachable. Now to reduce the traffic, “full dump” and “incremental” data is exchanged in this system, “full dump” contains all the information about the changed path and “incremental” contains the information about the changes made [26]. Assigning of new sequence numbers every time the topology changes, takes time, hence DSDV is not suited for networks with high activity. (1.2) Optimized Link State Routing Protocol (OLSR): OLSR protocol has been discussed in various studies [27–33] which have implemented it under several model circumstances. It has information about all the present
A Critical Review: SANET and Other Variants …
1103
Fig. 4 Mechanism of DSDV
links between USVs. It occupies the constancy of link state algorithm because it is a maximization of link state algorithm. OLSR minimizes the overhead because it uses only selected nodes, called MPRs which stands for the multipoint relay. The knowledge of all the existing links is established by periodic transfer of topology control packets between nodes of the network. Using MPRs for transmission of messages also limits the amount of transfers needed to spread a message across all the nodes and the maximum time required for the transmission. As [24] states “This protocol is very helpful for networks where a large number of nodes are interacting with another set of a large number of nodes and the source, destination pairs are changing over time”. OLSR is best fitted for crowded networks as the optimization done using MPRs only acts as an advantage when there is a lot of traffic. But one drawback of this algorithm is that no exception is made for small networks, so if a small network is using this protocol it will still do the same amount of work, which might not even be required. This restricts the scalability of this protocol and it works efficiently only in dense networks (Fig. 5). (2) Reactive Routing protocols: Reactive routing protocols are also named as “On Demand routing protocols” because they do not retain information in the table and the route finding process is only started when one node wants to communicate with other nodes. The path is determined by exploring the maximum routing paths available, due to which this type of protocols undergo a high delay and response time particularly when the system is fragmented. This algorithm finds the route by using two packets: Route Request packets (RREQ) and Route Reply packets (RREP). RREQ is used by source node by flooding it in the network and only the target node or destination node replies to this RREQ using RREP, thus when the RREP reaches the source the communication is initiated. Thus, this algorithm can be used for networks with a huge bandwidth [34] like SANET. The main advantages of these protocols are:
1104
E. Chauhan et al.
Fig. 5 Mechanism of OSLR
(1) Reduces overhead. (2) Suitable for systems with large data transfer capacity. And the disadvantages are: (1) High reaction time due to the discovery process in route discovery. (2) High congestion can make the network cluttered. Examples of these algorithms are DSR and AODV. (2.1) Dynamic Source Routing (DSR): According to [35] “The dynamic source routing protocol or DSR is a simple and efficient routing protocol designed mainly for wireless interlocking networks and is based on a method known as source routing”. Since DSR is reactive in nature, a discovery process is started only on demand or when the communication is required. While crafting DSR, the intention was to make a routing protocol that has a very little running cost but can respond to network changes very quickly. DSR provides the possibility to find multiple paths to a destination by exchanging RREQ packets. Each intermediate node that passes the packet to the next node adds its own address to the list in the packet. A one-way response packet or RREP containing addresses of all the intermediate nodes is produced, when the RREQ packet arrives at the destination node. Then the final path is the path that requires the minimum count of hops to reach the destination node.
A Critical Review: SANET and Other Variants …
1105
Fig. 6 Mechanism of DSR
The DSR protocol is comprised of two main schemes, Route Discovery and Route Maintenance. The node produces a path error message and is transmitted to the root node when a measurable number of transfers fail, and the message comprises of problem reference. Now the root node needs another path to the destination node that it already doesn’t have in its memory. It initiates a route exploration process again to find a better path (Fig. 6). And since DSR does not need any regular update messages, it avoids the loss of bandwidth. (2.2) Ad Hoc On Demand Vector Algorithm (AODV): As stated in [36] AODV algorithm enables dynamic, self-starting and multi-hop routing between the participating mobile nodes. It only finds routes when the need for communication arises. It maintains these nodes as long as the communication continues. Similar to DSR, AODV protocols use the flooding of RREQ across the network to identify the path differentiating characteristics of this protocol is its association of sequence number with every route entry. An intermediate node can reply to RREQ if it has the path to the target node whose respective sequence number is equal to or larger than the one present in the RREQ or if it is the destination node. Or else, it retransmits the route request, Nodes store and audit the Route request’s IP address and Broadcast ID.
1106
E. Chauhan et al.
If they receive a RREQ from an IP address that they have already processed, they do not transfer it. When the root node collects route reply, it initiates sending the information or data to the destination or final node. The root node revises its routing table and starts using the more efficient route if it receives a route reply with a greater sequence number than it already possesses. When a network collapses when the path is still effective, the node previous to the broken link transmits a RERR to the root node. In AODV the routes to destination are maintained only till the communication is active, when the communication stops all the links are deleted (Fig. 7). (3) Hybrid Protocols: To knuckle down the disadvantages previously mentioned routing approaches, hybrid protocols were designed which is a merge between PRP and RRP protocols containing the advantages or benefits of both. PRP requires overhead to sustain a network and RRP needs an ample amount of time to establish the possible paths. So, to fix these problems, hybrid protocols take up the notion of separate areas or zones where a proactive approach is used inside the zones, thus lessening the overhead, and for the intercommunication of zones, reactive strategy is used. This protocol is best fitted for large networks. The paths are initially created using a proactive approach and after that demand is served using a reactive approach.
Fig. 7 Mechanism of AODV
A Critical Review: SANET and Other Variants …
1107
The main advantages of these Protocols are: (1) (2) (3) (4)
It contains the benefits of both PRP and RRP. It is convenient for crowded networks. Low overhead. Less time delay in finding routes.
The main disadvantages of these protocols: (1) Time to find the route depends on the slope of the traffic volume. Example of Hybrid protocol is ZRP. (3.1) Zone Routing Protocol (ZRP): According to [37] “Zone Routing Protocol or ZRP is a hybrid routing protocol merging two types of routing protocols, Reactive and Proactive”. It utilizes the benefits of both to make the route finding process increasingly proficient and quick. In ZRP, the entire topology is divided into different zones and RRP or PRP are used inside the zone or between the zone based on their strengths and weaknesses. Each zone is distinguished based on the distance between nodes using a predefined radius r and hence each node contains a set of nodes. AUVs in the same zone use intra zone routing that uses a proactive methodology to communicate. And if a communication between zones is to be made, then a data packet must be transmitted from one zone to another, inter zone routing is used which is based on a reactive approach. It reduces the need for hubs to keep the entire network proactive. ZRP also explains a strategy called BRP (Border cast Resolution protocol) to regulate gridlock between zones. If communication is to be made between the AUVs and the source unmanned vehicle S, and there is no path to target node D that is present in a different zone, BRP is used to flood the Route Request (RREQ) across nodes. ZRP enhances the organization of nodes (Fig. 8). Features of different routing protocols in SANET have been distinguished in Table 2.
1108
Fig. 8 Mechanism of ZRP
E. Chauhan et al.
A Critical Review: SANET and Other Variants …
1109
Table 2 Comparison between routing protocols Feature
OLSR
AODV
DSR
ZRP
Protocol type
Link state
Table driven Distance and source vector routing
DSDV
Source routing
Table driven and source routing
Route maintained in
Routing table
Routing table
Routing table
Route cache Routing table Route cache
Route discovery Via control message link sensing
On demand
Via control On demand message
Via control message On demand
Multiple route discovery
Yes
No
No
Yes
Multicast
Yes
Yes
Yes
No
Yes
Broadcast
Limited by MPR set
Yes
Full
Yes
Parity
Reuse of routing information
Yes
No
Yes
No
Yes
Route reconfiguration
Link state mechanism/Routing message transmission in advance
Erase route then source notification or local route repair
Sequence number adopted
Erase route the source notification
Sequence number and erase route the source notification
Limited overhead
Comcepts of MRPs
No
Concept of sequence numbers
Concepts of Concept of cache sequence numbers and route cache
Advantages
Minimize the overhead, improve the transmission quality
Adaptable to highly dynamic topologies, reduced control overhead
Avoid extra traffic, reduce the amount of space in the routing table
Multiple routes, reduces bandwith overhead
Enhances the organization of nodes
Disadvantages
Require more processing power and bandwith
Scalability problems, large delay caused by the route discovery process
High control overhead, wastage of bandwith
Scalability problems due to source routing and flooding, large delay
inadequate for high mobility
Yes
1110
E. Chauhan et al.
6 Challenges and Issues in SANET (1) GPS Localization [4] In any ad hoc network, it is crucial for each USV to know its current state with respect to other USVs. But, GPS localization and time synchronization in SANET is not easy, since high frequency waves used by Global Positioning System (GPS) cannot travel well below water and are quickly absorbed by the water surface. It is no longer possible to use the regular GPS free methodology utilized in ad hoc services used on land, of measuring the Time-Difference-of-Arrival (TDoA) [38] between a radio frequency and an acoustic signal since normally used radio frequency fails to operate under water. Moreover, the flow of water, temperature difference and pressure influence the speed of acoustic waves. An alternate way of transmitting signals in the water is the use of EM waves, it is outlined to be rapid and productive communication. EM ways are considered to be better than acoustic waves mainly because of their high bandwidth, but certain elements confine the usage of EM waves such as their need to be transmitted differently depending on the water type [1]. (2) Security In any ad hoc network, safety of data transfer is the biggest challenge. It needs revisiting every time a new concept is announced to update and upgrade the security services. The data that is being transferred and the nodes in use should be secure from any malicious attack that can grow even more if the node is a basic point. As discussed above SANET nodes are restricted in power, calculation and correspondence abilities which makes SANET even more vulnerable to security threats. Moreover, an ad hoc system that is self-organizing like SANET needs more security than just cryptography, security attacks can still be made even if an efficient cryptosystem is guarding the network. The biggest threat is the denial of service attack, which can happen if the battery of the nodes starts draining due to extra computation and communication, or if the network of a node is interrupted. These attacks can take place irrespective of the presence of cryptographic protections. (3) Peer-to-peer communication According to [5] “A fleet of autonomous underwater vehicles (AUVs) require coordinated synchronization and accident avoidance using peer-to-peer communications (P2P)”. But the AUVs operating underwater has a very limited operational range, especially while transmitting heavy data. The range of operation of AUVs can be extended but at the expense of speed of transmission of data which can prove to be a little risky sometimes, since it won’t be possible to get the data and warning on time. However, there are two other problems underwater that the propagation time is much larger than transportation time and scattering of waves. Scattering is the change in direction of motion of particle because of a collision with another particle
A Critical Review: SANET and Other Variants …
1111
[1]. Higher the turbidity, higher is the scattering effect. According to [39] “These three conditions are necessary to be implemented in order for an ad hoc network to be successful underwater: (a) Presence of a stable connection link, (b) A reliable routing protocol, (c) A protocol for sharing of communication link”. Therefore, peerto-peer communication among nodes is a major challenge in SANET, if they won’t be able to communicate with each other effectively and collaborate to transfer data, the whole meaning of this network stands useless. The route of transfer of data is solely decided by the nodes. (4) Energy supply efficiency As the energy consumption of USV’s is too high, so it has become a challenge to provide them the required energy constantly and efficiently in order for them to keep working at a required rate. Sustainable improvements are underway in renewable energy advancements by overusing resources like solar energy [21]. This power produced cannot satisfy the energy needs for data transfer and the long distance movement of USVs. The first possible substitute is to cooperate with other USVs to reach the required energy limit. The second substitute is the ideal positioning of recharge stations. To address the energy concerns a new investigation is drafted to combine cloud computing prototype with SANETs. But, another challenge that comes with cloud computing is of security, because the data can be transferred between USVs. Various safety mechanisms are required to overcome this challenge. Cloud computing prototypes are being investigated only. They are yet to be applied [5]. So, the nodes must be designed with low power consumption in mind to maximize the network life, their every design aspect should be concentrated towards minimizing the energy requirement. Another solution that can be implemented is by using lithium batteries. Lithium batteries because they are a smart substitute to AA batteries which suffer from physical deterioration and leakage currents. But using batteries in the USVs might not be a good idea, because in order to change the batteries in USVs they will have to be retrieved from the bottom of the sea, which can be a time consuming and costly process [10]. (5) Routing challenges Routing protocols in SANET are dissimilar to other ad hoc networks because the deployment of SANET is different therefore it makes for a challenge to propose a routing protocol and algorithm that can update the routing tables in SANET when the deployment changes. Moreover, the routing protocols that have been designed are for nodes having low mobility (MANET) or very high mobility (FANET), a little or no research is done towards making the routing protocols efficient for nodes having medium mobility. Both the medium mobility and medium density are significant challenges in developing a routing approach to ensure a reliable transfer of information. According to [40], “Currently, there are two routing protocols developed for sublunary sensor networks namely, proactive and reactive”. Both of them have
1112
E. Chauhan et al.
some issues and challenges. In a proactive approach, a broad signalling is provoked in order to create paths, every time the topology is changed because of constant node movement. Whereas reactive protocols are more appropriate for dynamic conditions and cause a significant delay in sending information bundles to create routes. The extremity of these difficulties is enhanced when the nodes move in 3D space (i.e. underwater). Therefore, there is a grave need to introduce new protocols that can implement new techniques in the above mentioned situations.
7 Future Research Topics SANET has become a critical area of examination and research in current years. This technology can provide formidable aid to existing services and can provide new implementations. However, it still has many issues and challenges to overcome to become a successfully established network. Specific technical issues that need to be focused upon are frequent disconnections, restricted transfer speed and inadequate energy capacity. There is a need for advancement in routing protocols that focuses on bandwidth optimization, latency reduction, security problem, fault tolerance, stronger connection to prevent frequent link disconnections when there is high mobility and also integrates all the required sensors in a single unit. Examination of basic SANET environment like density and movability structure of AUVs can also be done. Security of data transfer is one of the biggest and difficult issues not only in SANET but in all the variants of ad hoc networks, extensive research needs to be done on the generation and transmission of security codes.
8 Conclusion The main concepts of SANETs have been analyzed in this paper. The contrasts among SANETs and other variants of ad hoc networks in premises of versatility, node density, topology transition and power utilization are highlighted. The basic purpose is to serve customers and the potential use of SANETs in seismic monitoring, environment monitoring, underwater robots and military domains is presented. Then the main problems associated with SANET system such as transmission problems, security and routing challenges are shown. In SANETs, routing is thought of as one of the leading components to guarantee the correct performance of the network, therefore, several routing protocols have been discussed thoroughly. By the evidences, reactive protocols are mostly used in SANET. We have highlighted AODV and DSR under this category because they do not need to store routing information for a long time. A brief study that discusses and validates all the major routing protocols is presented. Finally, the less explored challenges of SANET are identified and listed under future research challenges. As a final conclusion, we can say SANET can generate a large amount of different systems and the various applications of this system are
A Critical Review: SANET and Other Variants …
1113
exposed. Furthermore, the crucial factors that need to be considered in SANET are Energy consumption, Capacity and Dependability of network, which require to be analyzed comprehensively to ensure the right quality of service.
References 1. M. Garcia, S. Sendra, M. Atenas, J. Lloret, Underwater wireless Ad-hoc networks: A survey. 2 (2011) 2. Taher, H. Mohammed, Sea Ad hoc Network (SANET) challenges and development. Int. J. Appl. Eng. Res. 1 (2018 3. W. Lou, W. Liu, Y. Zhang, Y. Fang, SPREAD: Improving network security by multipath routing in mobile ad hoc networks. Wireless Networks 15(3) (2010) 4. T. Eissa, S. Razak, M. Ngadi, Towards providing a new lightweight authentication and encryption scheme for MANET. Wireless Networks 17(4) (2011) 5. S. Mutly, G. Yilmaz, A distributed cooperative trust based intrusion detection framework for MANETs.In: The Seventh International Conference on Networking and Services ICNS 2011 (2011) 6. K. Hartann, C. Steup, The vulnerability of UAVs to cyber attacks-An approach to the risk assessment 7. I. Bekmezci, S. Koray, S. Temel, Flying Ad-Hoc networks (FANETs): A survey. (Elsevier, 2013), p. 1 8. J. Kong, J.-H. Cui, D. Wu, G. Mario, Building underwater Ad-Hoc networks and sensor networks for large scale real-time aquatic applications. Milcom 2005, 1 (2005) 9. Wikipedia. Available https://en.wikipedia.org/wiki/Ocean-bottom_seismometer 10. J. Heidemann, Y. Li, A. Syed, J. Wills, W. Ye, Underwater sensor networking: research challenges and potential applications (2005) 11. Wikipedia. Available https://en.wikipedia.org/wiki/Vehicular_ad-hoc_network 12. JavaTPoint. Available https://www.javatpoint.com/mobile-adhoc-network 13. S. Misra, S. Bhardwaj, Secure and robust localization in a wireless ad. Trans. Vehic. Technol. 58, 1480–1489 (2009) 14. Research challenges in intelligent transportation network. IFIP Keynote (2008) 15. W. Li, H. Song, ART: An attack-resistant trust management scheme for securing vehicular ad hoc networks. Trans. Intell. Transp. Syst. 17(4), 960–969 (2016) 16. H.-S. Ahn, C.-H. Won, DGPS/IMU integration-based geolocation system: airborne experimental test results. Aerosp. Sci. Technol. 13, 316–324 (2009) 17. A.K. Wong, T.K. Woo, A.T.L. Lee, X. Xiao, V.W.H. Luk, K.W. Cheng, An AGPS-based elderly tracking system. International Conference on Ubiquitous and Future Networks (2009) 18. J. Clapper, J. Young, J. Cartwright, J. Grimes, Unmanned systems roadmap 2007–2032. Dept. Defense (2007) 19. A. Purohit, F. Mokaya, P. Zhang, Collaborative indoor sensing with the sensor fly aerial sensor network. In: Proceedings of the 11th International Conference on Information Processing in Sensor Networks (2012), pp. 145–146 20. R. Al-Zaidi, J.C. Woods, M. Al-Khalidi, H. Hu, Building novel VHF-based wireless sensor networks for the Internet of marine things. Sensors J. 18(5), 2131–2144 (2018) 21. O.S. Oubbati, M. Atiquzzaman, P. Lorenz, M.H. Tareque, M.S. Hossain, Routing in flying Ad Hoc networks: Survey, constraints, and future challenge perspectives (2019) 22. Y. Kim, J. Kim, Y. Wang, K. Chang, J. Park, Y. Lim, Application scenarios of nautical Ad-hoc network for maritime communications. Oceans 2009, 2 (2009) 23. S.A. Ade, P.A. Tijare, Performance comparison of AODV, DSDV, OLSR and DSR routing protocols in mobile Ad Hoc networks. Int. J. Inf. Technol. Knowl. Manage. 2(2), 545–548 (2010)
1114
E. Chauhan et al.
24. T. Clausen, P. Jacquet, Optimized link state routing protocol (OLSR). RFC (2003) 25. Computer Science & Electronics Journals. Available https://www.csjournals.com/ 26. N.S. Mohamad, Performance evaluation of AODV, DSDV & DSR routing protocol in grid environment. Int. J. Comput. Sci. Netw. Secur. 9(7) (2009) 27. S. Makkar, Y. Singh, R. Singh, Performance investigation of OLSR and AODV routing protocols for 3 D FANET environment using NS3. J. Commun. Eng. Syst. 8(2), 1–10 (2018) 28. A. Leonov, G. Litvinov, D. Korneev, Simulation and analysis of transmission range effect on AODV and OLSR routing protocols in flying ad hoc networks (FANETs) formed by miniUAVs with different node density. In: Proceedings System Signal Synchronization, Generating Processed Telecommunication (2018), pp. 1–7 29. Y. Jiao, W. Li, I. Joe, OLSR improvement with link live time for FANETs. Adv. Multim. Ubiquit. Eng. 521–527 (2017) 30. A. Nayyar, Flying adhoc network (FANETs): Simulation based performance comparison of routing protocols: AODV, DSDV, DSR, OLSR, AOMDV and HWMP. In: Proceedings of the International Conference on Advanced Big Data, Computational Data Communication System (icABCD) (2018), pp. 1–9 31. K. Singh, A.K. Verma, Experimental analysis of AODV, DSDV and OLSR routing protocol for flying adhoc networks (FANETs). In Proceedings IEEE International Conference on the Electronics, Computation Communication Technologies (ICECCT) (2015), pp. 1–4 32. K. Singh, A. K. Verma, Applying OLSR routing in FANETs. In: Proceedings of the IEEE International Conference Advance Communication Control Computer Technologies (ICACCCT) (2014), pp. 1212–1215 33. D.S. Vasiliev, D.S. Meitis, A. Abilov, Simulation-based comparison of AODV, OLSR and HWMP protocols for flying ad hoc networks. In: Proceedings of the International Confernce Next Generation Wired/Wireless Networks (Cham, 2014), pp. 245–252 34. A.M. Sagheer, H.M. Taher, Identity based cryptography for secure AODV routing protocol. TELFOR 198–201 (2012) 35. D. Johnson, Y. Hu, D. Maltz, The dynamic source routing protocol (DSR) for mobile Ad-hoc networks for IPv4. RFC 4728 (2007) 36. C. Perkins, E. Belding-Royer, S. Das, Ad-Hoc on-demand distance vector (Aodv) routing. RFC 3561 (2003) 37. Z.J. Haas, M.R. Pearlman, ZRP: A hybrid framework for routing in ad hoc networks. Ad Hoc Netw. 221–253 (2001) 38. A. Savvides, C.C. Han, M.B. Srivastava, Dynamic fine- grained local0ization in Ad-Hoc networks of sensors. (ACM MOBICOM, 2001) pp. 166–179 39. K.Y. Foo, P.R. Atkins, T. Collins, S.A. Pointer, C.P. Tiltman, Sea trials of an underwater, ad hoc, acoustic network with stationary assets. IET Radar Sonar Navig. (2007) 40. A. Abdullah, M. Ayaz, Underwater wireless sensor networks: Routing issues and future challenges (2009)
HealthStack–A Decentralized Medical Record Storage Application Mayank Bansal, Kalpna Sagar, and Anil Ahlawat
Abstract The aim of this study is to design and develop a blockchain-based web app called Health Stack to maintain accurate and complete medical records of patients, to help doctors to fetch previous medical history of the patients, to assist user to find out the disease he or she is suffering from and much more. For medical services, secure data storage is one of major concern for people. This problem can be resolved by developing an app using a blockchain technology having the features of decentralization and verifiability. Development of this app doesn’t involve any kind of dependency on third-party. Keywords Blockchain · Decentralized · Disease prediction health IT · Medical records
1 Introduction Health Information Technology (Health IT) is a broad term that describes the technology and infrastructure used to record, analyze and share a patient’s health data. Various technologies include health record systems, including personal, paper and electronic; personal health devices, including smart devices and applications; and finally, to share and discuss information to communities. Some of these techniques can tell if a patient needs to take a diet, and most of the time Golo’s diet is what they need to do or take a gynecological pill for gynecology, as most men do. Health IT helps in providing better care to patients and achieves health equity. It supports patient data recording to improve healthcare delivery and allows analysis of this information for health services and government agencies. It further improves healthcare delivery, improves patient safety, reduces medical errors, and strengthens interaction between patients and health care providers. M. Bansal (B) · K. Sagar · A. Ahlawat KIET Group of Institutions, Ghaziabad, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_94
1115
1116
M. Bansal et al.
It has been found that the is a need for reliable and inexpensive medical record software in low-and middle-income countries (LMIC). Health IT use in medical clinics not only improves the quality of health care, provides accurate patient records, but also enables clinicians to better understand the patient’s medical history. With an extensive patient history, physicians are empowered to treat diseases and prevent excessive use of drugs that can be fatal. Without a medical record, the physician has to rely on the patient’s memory, leading to a false medical history due to memory, complex drug names, and diseases that affect the patient’s memory. This motivates us to design and develop Health Stack app. And the objective of this paper is to have a secure ledger for storing sensitive medical information on a blockchain server that provides critical data analysis, such as a specific area, age and gender, such as a common disease, which can be used for the development of any country. This ensures that any patient’s medical, treatment, medication and vaccine records are tied to his Aadhaar card so that previous reports no longer need to be stored. Any physician and his family member can access this crucial information. Personal details & Medical history of a patient will be stored through Block chain to ensure Security and decentralization. Data Analysis can be employed to get the overall health condition of citizens of the country. A doctor can access his patient account after scanning his Aadhaar card through his phone and can update or add a medical prescription along with his reports. This will help other doctors and his family member to track the medical history of the patient. This application will have following interfaces. Here are some of the features that are available in the user interface. (a) Chat with Expert A Chatbot build using Machine Learning Algorithm is integrated on the landing page itself. So, if the user is having any type of confusion or want to ask about the disease he is suffering from, he can ask by typing the symptoms. Along with this he will be shown the medicines and Diet he should take. (b) Scan Aadhaar Card Any user can scan Aadhaar card of his own or any family member to check the medical history of the person, the disease he is or was suffering from, the treatment he has gone through and what are drugs & vaccines he has intake. (c) Professional Interface The professional interface is used by doctors. After registering on the application doctor can scan the Aadhaar of any of his patient and see his medical history, previous treatment records, previous drugs & vaccines records and add new prescriptions or treatment record. (d) Registration It is done to access the application as “Doctor”, a person needs to fill a registration form where he will be asked about his details, qualification, area of interest, year of experience, and license number and some identity cards. Once these details are filled, the doctor can access his interface.
HealthStack–A Decentralized Medical Record Storage Application
1117
(e) Scan & Add Doctor can access his patient record after scanning his Aadhaar card and then can view or add new prescription, his treatment information, and drugs, and vaccines records. The remaining sections of this paper are arranged as follows: in “Technology used” section, different technologies used for the development of this app are explained.
2 Related Work Health IT is generally observed as solution so as to improve healthcare sector [1–3]. It has augmented accessibility of medical data to aid medical research and healthcare management [4–8]. Large number of peoples including students, researchers, and entrepreneurs are working to bring blockchain in health sector, and specifically in retrieval of health records from a unique identity. Blockchain technology is considered as a shared decentralized ledger for recording the transactions. It is employed to record events as products from its beginning to the present state in an unmodifiable log [9, 10, 11]. Decentralization, verifiability, and immutability are essential features of blockchain that are required in the medical and healthcare industry. The applications of Blockchain have gained the attention of research institutions around the world. Health bank is an international innovator in digital health, and energetically involved in tapping into blockchain such as smart contracts [11]. For enterprise blockchain solutions, Gem Health is one of the established providers which is partnered with Philips Blockchain Lab to have blockchain technology to address the trade-off between patient centric care and operational efficiency by creating a healthcare ecosystem connected to universal data infrastructure [12]. The effective utilization of blockchain technology in healthcare are increasing so as to benefit population health and medical records. Further, various research studies have been identified that highlight application of blockchain in healthcare sector. In this industry, the ongoing researches in the blockchain technology includes medical information protection, medical data storage and sharing, medical data application, forecast analysis, etc. The study [13] discussed the goals and benefits of blockchain technologies in healthcare. The study [14] introduced a blockchain-based application whose architecture supports the patients to manage, and share their own data securely but also enabled untrusted third parties to process medical and health data while ensuring patient privacy through introducing secure multifactor computing. The study [15] introduced OPAL/Enigma encryption platform that is based upon the blockchain technology so as to create a secure environment for storing and analyzing medical data. Few research studies have also been found that are based on blockchain technology for storing and managing patient medical records [16, 17, 18, 19]. The study [18] provided a blockchain-based method to share the patient’s data. The study [19] also
1118
M. Bansal et al.
employed blockchain network technology to generate an interinstitutional medical health prediction model.
3 Feasibility Studys A. Implementation & Technical Feasibility As per now, the government is focusing on Linking every detail of a person with his Aadhaar card. Such as linking his bank account, pan card, Mobile number, Gas Connection, Property Record. Thus, we aim to link his Medical reports to his Aadhaar card to create a unique identity of any person. This will help a lot to a patient or his family member, tracking his medical, treatment & drugs and vaccine history. Also, reduce dependency on Hard copy or papers. Use of machine learning will help any user to find out what disease he is having and its proper treatment without going to doctor and spending time and money. The project will help mankind and Government and thus this will be feasible to implement. Talking about Technical Feasibility, we are not using any rocket science, it is just the use of some popular and tested technology in an innovative manner and thus we guarantee 100% feasibility of the project. B. Need & Significance As discussed above, we need a unique identity of a person, so that all of his details such as personal, financial and even Medical details can be tracked. We need this project to provide secure ledger to store sensitive medical information on the block chain server also providing crucial data analysis like most common disease in particular area, age group, gender which can be useful for the development of any country. We also need to ensure that medical, treatment, drugs & Vaccines record of any patient is linked with his Aadhaar card so that he didn’t need to store his previous reports anymore. And any doctor or his family member can access this crucial information.
4 Brief Overview of Technologies Used Block chain-based web application is built to enhance the healthcare industry. Block chain is used to store and mining the data and increase the security of the database through decentralization. A block chain, as the name implies, is a chain of digital “blocks” that contain records of transactions. Each block is connected to all the blocks before and after it. This alone might not seem like much of a deterrence, but block chain has some other inherent characteristics that provide additional means of security.
HealthStack–A Decentralized Medical Record Storage Application
1119
Node JS is used as a Backend Language to retrieve and update data. It is a platform built on Chrome’s JavaScript runtime for easily building fast and scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices. Angular JS is used to make the frontend of the application mobile responsive and interactive, and also to support Scan feature. Angular JS is a structural framework for dynamic web apps. AngularJS is what HTML would have been, had it been designed for applications. HTML is a great declarative language for static documents. Angular JS’s data binding and dependency injection eliminate much of the code you would otherwise have to write. The impedance mismatch between dynamic applications and static documents is often solved with: a library—a collection of functions which are useful when writing web apps. Your code is in charge and it calls into the library when it sees fit. E.g., jQuery. Frameworks—a particular implementation of a web application, where your code fills in the details. The framework is in charge and it calls into your code when it needs something app specific. E.g., durandal, ember, etc. Angular JS takes another approach. It attempts to minimize the impedance mismatch between document centric HTML and what an application needs by creating new HTML constructs. • AngularJS teaches the browser new syntax through a construct we call directives. Examples include: • Data binding, as in {{}}. • DOM control structures for repeating, showing and hiding DOM fragments. • Support for forms and form validation. • Attaching new behavior to DOM elements, such as DOM event handling. • Grouping of HTML into reusable components.\ Machine learning algorithm (Regression Model) is used to predict the diseases a patient is having by his symptoms and also suggest medicines and treatment accordingly. Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent (target) and independent variable (s) (predictor). This technique is used for forecasting, time series modelling, and finding the causal effect relationship between the variables. For example, relationship between rash driving and number of road accidents by a driver is best studied through regression. Linear Regression, Logistic Regression, Polynomial Regression, Stepwise Regression, Ridge Regression, Lasso Regression, Elastic Net Regression are different type of regressions models.
5 Experimentations and Results The architecture of the medical blockchain is shown in Fig. 1 Medical institutions, patients and third-party agencies (such as medical information service platform,
1120
M. Bansal et al.
Fig. 1 The architecture of medical blockchain
medical insurance company, etc.) are three main types of transaction bodies in the medical blockchain. The below in Fig. 2 code snippet shows the algorithm we are using to store the medical records of a patient. Table 1 presents the comparison of conventional system and proposed scheme presented in the paper.
6 Conclusion Personal details & medical history of a patient will be stored through block chain to ensure security and decentralization. A doctor can access his patient account after scanning his Aadhaar card through his phone and can update or add a medical prescription along with his reports. This ensures that any patient’s medical, treatment, medication, and vaccine records are tied to his Aadhaar card so that not only security and privacy can be achieved but also previous reports would be digitally available for long-run in life. This also overcomes the problem of medical data of the patients being dispersed in various medical institutions. Thus, we can conclude that this model can create a revolutionary change in Health-tech sector. In near future, blockchain based medical networks can be established with the aim of associating various medical institutions.
HealthStack–A Decentralized Medical Record Storage Application
1121
Fig. 2 Code snippet Table 1 Comparison to conventional systems
ABE [20] KAC [21] Proposed app Reliance on trusted third Yes parties
Yes
Yes
Tamper resistance
Yes
Yes
Yes
Privacy protection
Yes
Yes
Yes
Secure storage
Yes
Yes
Yes
Control of medical records
Partial
Partial
Complete
1122
M. Bansal et al.
References 1. M.Z. Hydari, R. Telang, W.M. Marella, Saving patient ryan—can advanced electronic medical records make patient care safer? Manage Sci. Artic. Adv. 1–19 (2018). https://doi.org/10.1287/ mnsc.2018.3042 2. H.K. Bhargava, A.N. Mishra, Electronic medical records and physician productivity: Evidence from panel data analysis. Manag. Sci. 60, 2543–2562 (2014). https://doi.org/10.1287/mnsc. 2014.1934 3. C. Zheng, C. Xia, Q. Guo, M. Dehmer, Interplay between SIR-based disease spreading and awareness diffusion on multiplex networks. J. Parallel Distr. Com. 115, 20–28 (2018). https:// doi.org/10.1016/j.jpdc.2018.01.001 4. X.B. Li, J. Qin, Anonymizing and sharing medical text records. Inf. Syst. Res. 28, 32–352 (2017). https://doi.org/10.1287/isre.2016.0676 5. C. Li, L. Wang, S. Sun, C. Xia, Identification of influential spreaders based on classified neighbors in real-world complex networks. Appl. Math. Comput. 320, 512–523 (2018). https:// doi.org/10.1016/j.amc.2017.10.001 6. Z. Xu, X. Wei, X. Luo, Y. Liu, L. Mei, C. Hu, L.Chen, Knowle: a semantic link network based system for organizing large scale online news events. Future Gener. Comp. Syst. 43–44, 40–50 (2015). https://doi.org/10.1016/j.future.2014.04.0.02 7. D. He, N. Kumar, H. Wang, L. Wang, K.K.R. Choo, A. Vinel, A provably-secure cross-domain handshake scheme with symptoms-matching for mobile healthcare social network. IEEE Trans. Depend. Secur. Comput. 15, 633–645 (2018). https://doi.org/10.1109/tdsc.2016.2596286 8. M. Ma, D. He, M.K. Khan, J. Chen, Certificateless searchable public key encryption scheme for mobile healthcare system. Comput. Electr. Eng. 65, 413–424 (2018). https://doi.org/10. 1016/J.COMPELECENG.2017.05.014 9. K.N. Griggs, O. Ossipova, C.P. Kohlios, A.N. Baccarini, E.A. Howson, T. Hayajneh, Healthcare blockchain system using smart contracts for secure automated remote patient monitoring. J. Med. Syst. 42, 130 (2018). https://doi.org/10.1007/s10916-018-0982-x 10. C. Lin, D. He, X. Huang, K.K.R. Choo, A.V. Vasilakos, BSeIn: A blockchain-based secure mutual authentication with fine-grained access control system for industry 4.0. J. Netw. Comput. Appl. 42–52 (2018). https://doi.org/10.1016/JJNCA.2018.05.005 11. C. Lin, D. He, X. Huang, M. Khan, K. Choo, A new transitively closed undirected graph authentication scheme for blockchain-based identity management systems. IEEE Access 6, 28203–28212 (2018). https://doi.org/10.1109/ACCESS.2018.2837650 12. B.N. Peter, Blockchain applications for healthcare. (2017). http://www.cio.com/article/304 2603/innovation/blockchainapplications-forhealthcare.html. Accessed 17 March 2017 13. G. Prisco, The blockchain for healthcare: Gem launches gem health network with philips blockchain lab. (2016). https://bitcoinmagazine.com/articles/the-blockchain-for-heathcaregemlaunches-gem-health-network-with-philips-blockchain-lab-1461674938/. Accessed 26 April 2018 14. M. Mettler, Blockchain technology in healthcare: The revolution starts here. In: 2016 IEEE 18th International Conference on e-Health Networking, Applications and Services (Healthcom), (2016), pp. 1–3. https://doi.org/10.1109/HealthCom.2016.7749510 15. X. Yue, H. Wang, D. Jin, M. Li, W. Jiang, Healthcare data gateways: Found healthcare intelligence on blockchain with novel privacy risk control. J. Med. Syst. 40(10), 1–8 (2016). https:// doi.org/10.1007/s10916-016-0574-6 16. D. Lvan, Moving toward a blockchain-based method for the secure storage of patient records. (2016). http://www.healthit.gov/sites/default/files/9-16-drew_ivan_20160804_blockc hain_for_healthcare_final.pdf. Accessed 4 August 2016 17. B. Yuan, W. Lin, C. McDonnell, Blockchains and electronic health records. (2016). http://mcd onnell.mit.edu/blockchain_ehr.pdf. Accessed 4 May 2016 18. A. Ekblaw, A. Azaria, J.D. Halamka, A. Lippman, A Case study for blockchain in healthcare: BMedRecˆ prototype for electronic health records and medical research data. (2016). https://
HealthStack–A Decentralized Medical Record Storage Application
1123
pdfs.semanticscholar.org/56e6/5b469cad2f3ebd560b3a10e7346780f4ab0a.pdf. Accessed 4 May 2017 19. A. Azaria, A. Ekblaw, T. Vieira, A. Lippman, MedRec: Using blockchain for medical data access and permission management. (2016) 20. K. Peterson, R. Deeduvanu, P. Kanjamala, K.B.M. Clinic, A blockchain-based approach to health information exchange networks. (2016). https://www.healthit.gov/sites/default/files/1255-blockchain-based-approach-final.pdf. Accessed 26 May 2016 21. T.T. Kuo, C.N. Hsu, L. Ohno-Machado, ModelChain: decentralized privacy-preserving healthcare predictive modeling framework on private blockchain networks. (2016). https://www.hea lthit.gov/sites/default/files/10-30-ucsd-dbmi-oncblockchain-challenge.pdf. Accessed 22 May 2016
AEECC-SEP: Ant-Based Energy Efficient Condensed Cluster Stable Election Protocol in Wireless Sensor Network Tripti Sharma, Amar Mohapatra, and Geetam Tomar
Abstract The sensor nodes are diffused in a wireless sensor network in the specific geographical region. A heterogeneous WSN involves multiple normal sensor nodes and few advance nodes known as advanced nodes. Advanced node mainly used for data clustering, filtering, transport, and fusion is more expensive and capable than the normal node. In this paper, an energy efficient condensed cluster protocol is proposed in which the data routing is based on the ant colony optimization algorithm. In SEP, the formation of clusters is random and entirely based on threshold values; however, in this protocol attempt has been made to lower the number of clusters by combining the adjacent clusters. Therefore, only one cluster is formed within a specific geographical region, which helps to resolve the redundancy over that region. In this paper, efforts have been made to reduce the cluster-heads to prevent the redundancy along with the use of ant colony optimization-based routing to select the optimal routing path. This protocol is proposed to increase the stability period and decrease the instability period along with improved energy efficiency and network lifetime. Simulation results depict that the suggested protocol offers higher energy efficiency, prolonged stability period, and low instability period as compared to SEP and LEACH protocols. Keywords Clustering · WSN · Ant colony optimization
T. Sharma (B) IT Department, Maharaja Surajmal Institute of Technology, New Delhi, India e-mail: [email protected] A. Mohapatra IT Department, Indira Gandhi Delhi Technical University for Women, New Delhi, India e-mail: [email protected] G. Tomar Birla Institute of Applied Sciences, Bhimtal 263136, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_95
1125
1126
T. Sharma et al.
1 Introduction To the best of our understanding, practically the wireless sensor network’s emphasis on the work wherein the nodes are identical to the similar aptitude of communication, sensing, power, and computation. Accordingly, this kind of sensor network considered as a homogeneous wireless sensor network [1]. We anticipate diverse, special purpose sensor nodes those could be considered to structure a single sensor network to accomplish more inclusive tasks, for example, some sensor collects audio signals, some sensor collects image data, some sensor has more power than the other node, and some sensor has more processing capabilities. This consequence the heterogeneous Wireless Sensor Network (WSN) [1] that contains a range of composition of sensors. A typical heterogeneous WSN involves of a large number of normal sensor nodes and some advance nodes. In heterogeneous WSN, advanced node [2] is mainly used for data clustering, filtering, transport, and fusion. These nodes are more expensive and capable than the normal node. While these sensors sense and transfer information to another positioned node available in the network, then a substantial quantity of energy dissipation takes place. A variety of routing approaches has already been offered to edge the powers cast-off by these sensor nodes [3]. Various hierarchical routing protocols with the notion of clustering like LEACH and SEP are paramount for sustaining energy efficiency [4]. In this paper, we have discussed an AEECC-SEP, a heterogeneous energy protocol, which is an improvement over existing SEP protocol. In SEP; the formation of the cluster is random and entirely based on threshold values; however, an attempt has been made to lower the number of clusters in this protocol by fusing the adjacent clusters. Therefore, only one cluster is formed within a specific geographical region, which helps to resolve the redundancy over that region. The research problem addressed in this paper is effective redundancy management of a clustered heterogeneous WSN to prolong its lifetime. Simulation results depict that the suggested protocol offers improved energy efficiency, prolonged stable region, and low instability period than the SEP and LEACH protocols. In this research work, Sect. 2 describes the already existing heterogeneous routing protocols for WSN. Section 3 describes the AEECC-SEP; An Ant-based Energy Efficient Condensed Cluster Stable Election protocol for WSN. Section 4 entails the results of our clustering-based approach. The research work is eventually outlined in Sect. 5.
2 Related Work A WSN is a serene of numerous sensor nodes dispersed randomly. Clustering helps to prolong the network’s lifespan by decreasing energy depletion. Clustering also helps in increasing the overall network lifetime along with scope of scalability. To achieve the node heterogeneity, the clustering algorithms must be energy efficient.
AEECC-SEP: Ant-Based Energy …
1127
2.1 Stochastic Distributed Energy Efficient Clustering for HWSNs (SDEEC) SDEEC [1] is an advancement of DEEC algorithm, and it is offered as Stochastic DEEC by Elbhiret and all. This choice of CH is based on nodes remaining energy. The primary idea is the stochastic strategy through which the intra-clusters transmission is condensed. Similar to DEEC, it also reflects two levels of heterogeneity; on the other hand, it conserves more energy [1]. In SDEEC, non-CH nodes can also remain in sleep mode to conserve energy. The network has been portioned into dynamic clusters in this protocol. All non-CH nodes transmit information to the relevant CHs within their assigned period of transmission [5]. For receiving the whole data from the nodes exists in all the clusters, it is the prerequisite that the receiver must be on. On the revival of data, CHs do certain signal processing to compact the data collected into a single signal. Afterwards, each CH directs the aggregated information to its primary CH. Non-CH nodes could be in a sleep state for energy saving.
2.2 The Steady Group Clustering Scheme for HWSN (SGCH) A protocol based on SGCH is suggested by Liaw in 2009. All the nodes are divided into groups by initial energy in this algorithm [6]. This process lasts in two steps, namely, the grouping and data transmission stage. The sink node transmits a Group Head Request (GHR) to all existing nodes. Each node drives back the acknowledgement with identity and preliminary information of energy [6]. The sink node chooses the cluster-head by directing a message and the identity of cluster-heads. Afterwards, all the group heads discover its associate by directing the group request message to every node in the network. Algorithm formulates the groups (clusters) by ensuring this process. This protocol deliberates on the energy heterogeneity of the sensor node at multilevel.
2.3 Ant Colony Optimization (ACO) The ACO was presented by Marco Dorigo in 1992 and termed Ant System (AS). ACO algorithms are used to provide the approximate solution to hard combinatorial issues in a rational computation time. In Ant Colony Optimization [7], ants or asynchronous agents are continuously launched from various nodes in order to create a partial way out for the specific issue while passing through the various stages of that problem. A greedy local judgment has been followed by these agents that rely on attractiveness and trail information [8]. A partial solution to the problem has been incrementally produced by all the ants during traversal of different phases of the problem [9]. The
1128
T. Sharma et al.
ACO tries to resolve an optimization problem by repeatedly following the underneath steps: • Solutions are created using a pheromone model. • The solution is functional for amending the pheromone values to prejudice it for future sampling to construct a better solution.
2.4 Stable Election Protocol for Clustered HWSNs (SEP) Stable Election Protocol (SEP) [10] proposed by Smaragdakis G, et al. describes the effect of heterogeneity. The Stable Election Protocol (SEP) protocol needs no global knowledge of network’s energy sharing [10]. SEP protocol provides the assurance that the energy of each node is uniformly used, and the assignment of weighted election probability to each node is done in this protocol, and this assignment is used for the election of CH node. The difficulty faced with heterogeneous- oblivious protocols is that if the same threshold values are set for normal as well as advanced nodes, then, there is no assurance that n ∗ Popt is the number of CHs per round per epoch. Two parameters of heterogeneity are used in SEP protocols, namely, the additional energy factor (a) and the fraction of advanced node (m). SEP assures about the fairness on energy consumption as advanced node is highly sound to have a possibility of being CH than the normal node because of weighted election probability. The weighted election probability for advanced nodes and normal nodes are Padv and Pnrm . These are calculated from Eqs. (1) and (2) as given as below: Popt 1+a∗m
(1)
Popt (1 + a). 1+a∗m
(2)
Pnrm = Padv =
With the support of these weighted probabilities, thresholds are determined for the normal nodes and advanced nodes in SEP.
3 AEECC-SEP: An Ant-Based Energy Efficient Condensed Cluster Stable Election Protocol The optimum number of cluster formation is not guaranteed in SEP protocol. The number of nodes and network size has an influence on the number of clusters formed in SEP protocol. The literature survey and simulation results have shown that for a stable network the number of clusters formed should not be very high or very low [10]. SEP protocol does not require any global information about the position and appearance of these clusters, but it allows the formation of a large number of
AEECC-SEP: Ant-Based Energy …
1129
clusters. In the suggested protocol an extra measure has been taken to lower the number of clusters so that no two clusters are formed within a small geographical region of certain radii. This will not permit the selection of CHs, which lie in a close proximity. For example, the similar type of information or redundant information will be sensed, which results in transmission of redundant information from these CHs. Hence, the more energy of the CHs and sensor nodes to get wasted in collecting and forwarding of redundant information. In the proposed work, CHs are well distributed. AEECC-SEP protocol also adopts two phases at the commencement of every round similar to LEACH and SEP protocol. The entire algorithm is performed in two phases the setup phase and the transmission phase. The details of the phases are as below.
3.1 Setup Phase for AEECC-SEP The main objective of setup phase is the creation of clusters, and in this phase CHs are also identified. The CHs are chosen randomly on comparing a threshold value T (n), if the random value is lower than the threshold value for a particular node, then that node is chosen as CH of that particular round. Once the CHs are identified for the respective clusters, they advertise their signal strength (i.e., Residual energy and identity) within a circle of radii d. If no other CH is present within the circle of radius d, this node is announced as a cluster–head. Moreover, if some other CH is exists within the circular region, then the node equates its energy with the other CH, and if its energy is less than the other CH, then it rejects itself from becoming the CH. However, if its energy is greater than the other CHs exist within the circular region, then the other CHs are discarded and the node nominates itself to be the CH for that round. By this means, once all the CHs are recognized for the given round, the CHs broadcast their identity in all over the network to let clusters form similar to SEP protocol. Hence, the proposed protocol assures that no two CHs exist within the circular area of the radius d. It aids in reduction of the number of clusters and hence minimizes the energy dissipation.
3.1.1
Selection of Cluster-Heads
Similar to SEP the proposed protocol is also implemented in the heterogeneous network where we consider that the some of the nodes are advanced nodes. In other words, the heterogeneous nodes are not having similar computation power or initial battery level. If we have undertaken probability of advanced nodes to be m and number of nodes n. Then, n*m would be the total number of advanced nodes. Also, if we consider that with an optimal probability Po , each node will become CH, and when a node has been chosen as a CH, it cannot become a CH within 1/Po rounds. The advanced nodes have “a” times as much energy as the ordinary nodes, then the
1130
T. Sharma et al.
weighted probability of normal nodes to become CH is calculated by Eq. (1), and the weighted probability of advanced nodes is calculated by Eq. (2). In the proposed protocol, the CHs are separated at least by separation distance”d”. The value of d is also chosen to be optimal. This is because a very small value of d allows too many clusters to be formed in which the protocol gives results almost similar to the SEP. A large value of d makes a very less number of clusters in the network, which is also not stable. So, after carrying out a number of experiments by varying the value of “d”, it has been found that for a network area of 100*100 having randomly placed 100 nodes; the best result is obtained when d is in between 30 and 35.
3.1.2
Calculation of Separation Distance “d”
A square region of the dimension M is considered. If K opt is the optimum number of clusters in this rectangular region [11], then the value of K opt can be calculated as given in Eq. 3. √
N
K opt = √ 2π
E fs M E mp d 2 to BS
(3)
Here, the number of nodes is N, and region length is M. So, the area of each cluster is (M ∗ M)/K . Assuming this area to be circular, it can be written as an Eq. (4)
M ∗ M
K opt = pi ∗ d ∗ d.
(4)
From the Eq. (4), d can be calculated by Eq. (5) as given below d = sqrt (M ∗ M)/ K opt ∗ pi .
(5)
For example, if M = 100 and k = 3 is considered for simulation then d comes out to be approximately 32, which matches with our experimental result. Figure 1 shows the average energy dissipation with the increase in the number of clusters. It is clear from the figure that minimum energy is dissipated when the optimal numbers of clusters are 2 and 3.
3.2 Steady-State Phase for AEECC-SEP This phase permits the transfer of information from the non-CH nodes to the CHs or to the base station, whichever is nearer. Formation of clusters is done in the setup phase, but the information is transmitted, received, and aggregated in the steady-state
AEECC-SEP: Ant-Based Energy …
1131
Average energy dissipated in the number of cluster
0.0404
0.0402
0.04
0.0398
0.0396
0.0394
0.0392
0.039
1
2
3
4
5 6 Number of cluster
7
8
9
10
Fig. 1 The optimal value of cluster for minimum energy dissipation
phase. Hence, only in this phase the energy of nodes gets dissipated [12]. Thus to maintain the least overhead, the length of steady phase is kept longer than that of the setup phase. In each round, the dissipation of energy in CH and non-CH is calculated by Eq. (6) and Eq. (7), respectively, as given below: E(C H ) = E T (k, d to B S) + E R(k) + E D A
(6)
E(non − C H ) = E T (k, d to C H ) + E R(k).
(7)
The sink is considered to have infinite energy; hence, it is not dissipated after the base station accepts data via the positioned wireless nodes in the network.
3.2.1
The Developing Solution for Ant-Based Routing
This phase comprises the transformation of information within the network. The Ant Colony Optimized algorithm has been used for finding the optimized routes between the nodes and CHs. The steps for identifying the ACO [13] path is as follows: A. Forward ants are introduced at each source node. B. The ants experience the intermediate nodes with a specific end goal to reach their related CHs. C. The ants are having a probabilistic attitude for making the decision which node is to be navigated next. This probabilistic attitude is centered on the pheromone and heuristic data. The likelihood is figured as
1132
T. Sharma et al.
Pi j =
(τi j )α1 (η j )β1 α1 β1 j∈N (τi j ) (η j )
(8)
where τij denotes the pheromone information and figured as τi j =
1 . di j
(9)
Here, d ij is the distance from the node i to its allied CH. ηj denotes the heuristic information which refers to the node’s energy and is calculated as ηj =
E 0 − E residual
k∈N E k
(10)
where E residual is the residual energy and E 0 is the initial energy. The α1 and β1 parameters help in regulating the relative weight of heuristic and pheromone trail, respectively. D. A node having maximum probability is chosen as a subsequent hop to transmit the data to its associated cluster-heads.
4 Simulation Results The proposed protocol’s performance is assessed in this section by performing numbers of experiments to compare the proposed protocol, namely, AEECC-SEP with LEACH and SEP with the similar parameters and network setup. Simulation results carried out for the performance evaluation shows that AEECC-SEP has prolonged stable period and less energy dissipation in each round.
4.1 Network Settings For simulation 100 nodes are randomly positioned in a 100*100 area. It is assumed that the base station is positioned at (50, 50) in the network area. In the proposed protocol, the extended form of LEACH and SEP is considered for heterogeneous wireless sensor network. The fraction m for advanced nodes is kept 0.2; hence, 20 advanced nodes are positioned the given network. It is assumed that the size of the packet is 4000 bits. Table 1 displays the values taken for the simulation parameters.
AEECC-SEP: Ant-Based Energy … Table 1 Simulation parameters for AEECC-SEP
Simulation parameters
1133 Values
ETX = ERX
50 nJ/bit
Sink’s (XY) position
(50, 50)
Data aggregation energy (EDA)
5*0.000000001 J
Initial energy
0.5 J
K
4000 bits
Kopt (optimal no of cluster)
3
Popt
0.05
N (No of Nodes)
100
a(level of heterogeneity)
2
m (advance node fraction)
0.2
d (separation distance)
30
Efs
10*0.000000000001 J
Emp
0.0013*0.000000000001 J
Network size
100*100
The following parameters have been considered in measuring the performance of the proposed protocol: • Stable region (stability period): It should be as high as possible. • The period between the rounds at which first node dead and the rounds at which last node dead is considered as the instability period. It is desirable to keep this period to be minimum. • Energy dissipation with the number of rounds in the network. • Different values of heterogeneity. In AEECC-SEP, the optimal value of minimum cluster-head separation distance d is taken 30 for the simulation. It could be seen from Fig. 2 that how the stability period changes with different values of d. It is clear from the figure that when cluster-head separation distance has been taken 30 the network is more stable. Figure 3 depicts the significant advancement in the stability period of AEECCSEP as compared to SEP and LEACH protocol for heterogeneous wireless sensor network. The result of the simulation shows the round at which first node dead in LEACH and SEP are 1053 and 1113, respectively, whereas in AEECC-SEP, the first node dead at 1370 rounds that depict the performance improvement of 23.1% over SEP and 30.1% over the LEACH protocol as shown in Table 2. The calculation of instability period has been performed by taking the difference between the rounds at which first node dead to the rounds at which last node dead. For a WSN the unstable region is ideal to be as small as possible and stable period should be as high as possible. It is clear from Table 3 that the unstable region for AEECC-SEP shows a major decline. The comparison between lengths of unstable region is LEACH > SEP > AEECC-SEP. Hence, it is clear that the AEECC-SEP has lowest instability
1134
T. Sharma et al.
Fig. 2 Stability period vs cluster-head separation distance
Fig. 3 Total number of alive normal nodes vs number of rounds for m = 0.2 Table 2 Improvement of stable region in AEECC-SEP Description
LEACH
SEP
AEECC-SEP
Improvement over SEP
First node dead
1053
1113
1370
23.1%
Table 3 Declination of unstable region in AEECC-SEP Description
LEACH
SEP
AEECC-SEP
Improvement over SEP
Instability Period (round at which last node dies–round at which first node dies)
1689
954
893
6.8%
AEECC-SEP: Ant-Based Energy …
1135
Fig. 4 Total number of alive advanced nodes versus number of rounds for m = 0.2
period. Figures. 3 and 4 display the stability periods of normal and advanced nodes, respectively. Figure 4 shows the number of advanced node dead in case of LEACH, SEP, and the proposed protocol AEECC-SEP. The advanced node expired in case of AEECCSEP earlier than the LEACH and SEP show that the proper selection of advanced node as CHs node. AEECC-SEP is also proved to be more resilient than SEP. When the advanced nodes are increased or the total additive energy of the system is enhanced, AEECCSEP is flexible enough to provide better results as seen from Fig. 5. The value of stable region has been checked at the different values of (m*a) and the figure shows that proposed protocol is more resilient as compared to SEP protocol. As the values
Fig. 5 Length of stable periods vs total additive energy in the system
1136
T. Sharma et al.
of heterogeneity (i.e., m*a) are varied, stability period of AEECC-SEP protocols also improves. After every iterative round aggregated energy of every cluster keeps on dissipating. Figure 6 illustrates that there is an improvement in AEECC-SEP over existing LEACH and SEP with the advanced sensor nodes in the energy expansion model. Since a number of clusters are less in AEECC-SEP, hence, the information received at the base station are also less as compare to LEACH and SEP as shown in Fig. 7. The information received at the base station in AEECC-SEP is lower after some round of iterations, since the less number of clusters are formed in AEECC-SEP so less redundant information is transmitted to base station.
Fig. 6 Total energy of clusters vs number of rounds for m = 0.2
Fig. 7 Data packets received at base station versus number of rounds
AEECC-SEP: Ant-Based Energy …
1137
The following conclusions are attained for AEECC-SEP: • The stability period of AEECC-SEP is prolonged in comparison with LEACH and SEP. • The instability period is diminished. • It is observed that the AEECC-SEP is more robust than SEP while consuming the advanced node’s extra energy. As the extra energy increases, AEECC-SEP has higher stability period as compared to SEP.
5 Conclusion Research work carried out in heterogeneous sensor network shows that if the percentage of nodes fitted out with higher energy as in contrast to other existing nodes in the given network, then, they play a major role in enhancing the overall network lifetime as well as the stability period of the wireless sensor network. AEECC-SEP protocol is designed with the motivation that various existing applications could be benefited by such heterogeneity. AEECC-SEP is an extension of SEP with two levels of hierarchy that results in increased stability duration compared with SEP and LEACH. In proposed protocol, clusters are parted well from each other and redundant information is tried to be wiped out. The clusters are merged on the basis of the separation distance between them. Results show that the performance of AEECCSEP is improved as compared to existing protocols in heterogeneous environments, and it is more resilient than the existing protocol SEP. Improvement of the stable region in AEECC-SEP is 23.1% over SEP and declination of the unstable region in AEECC-SEP over SEP is 6.8%. Therefore, the higher stable region and the lower unstable region indicate the efficiency improvement of the proposed protocol.
References 1. V. Katiyar, N. Chand, S. Soni, Clustering algorithms for heterogeneous wireless sensor network: A survey. Int. J. Appl. Eng. Res. 1(2), 273 (2010) 2. C. Duan, H. Fan, A distributed energy balance clustering protocol for heterogeneous wireless sensor networks. In: 2007 International Conference on Wireless Communications, Networking and Mobile Computing (IEEE, 2007), pp. 2469–2473 3. K. Akkaya, M. Younis, A survey on routing protocols for wireless sensor networks. Ad hoc Netw. 3(3), 325–349 (2005) 4. J.Z. Li, H. Gao, Research advances in wireless sensor networks. J. Comput. Res. Adv. 45(1), 1–15 (2008) 5. I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, E. Cayirci, Wireless sensor networks: a survey. Comput. Netw. 38 (2002) 6. J.J. Liaw, C.Y. Dai, Y.J. Wang, The steady clustering scheme for heterogeneous wireless sensor networks. In 2009 Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing (IEEE, 2009), pp. 336–341 7. M. Saleem, G.A. Di Caro, M. Farooq, Swarm intelligence based routing protocol for wireless sensor networks: Survey and future directions. Inf. Sci. 181(20), 4597–4624 (2011)
1138
T. Sharma et al.
8. N. Jiang, R. Zhou, S. Yang, Q. Ding, An improved ant colony broadcasting algorithm for wireless sensor networks. Int. J. Distrib. Sens. Netw. 5(1), 45–45 (2009) 9. A.M. Zungeru, L.M. Ang, K.P. Seng, Classical and swarm intelligence based routing protocols for wireless sensor networks: A survey and comparison. J. Netw. Comput. Appl. 35(5), 1508– 1536 (2012) 10. G. Smaragdakis, I. Matta, A. Bestavros, SEP: A stable election protocol for clustered heterogeneous wireless sensor networks. (Boston University Computer Science Department, 2004) 11. J. Yick, B. Mukherjee, D. Ghosal, Wireless sensor network survey. Comput. Netw. 52(12), 2292–2330 (2008) 12. D. Estrin, A. Sayeed, M. Srivastava, Wireless sensor networks. In: Tutorial at the Eighth ACM International Conference on Mobile Computing and Networking (MobiCom 2002), vol. 255 (2002) 13. S. Okdem, D. Karaboga, Routing in wireless sensor networks using an ant colony optimization (ACO) router chip. Sensors 9(2), 909–921 (2009)
Measurement and Modeling of DTCR Software Parameters Based on Intranet Wide Area Measurement System for Smart Grid Applications Mohammad Kamrul Hasan, Musse Mohamud Ahmed, and Sherfriz Sherry Musa Abstract In the advanced smart grid computing and measurement, the phasor measurement units (PMUs) based wide area measurement (WAM) infrastructure facilitate the real-time monitoring, control, and measurements. This WAM system uses the GPS satellite synchronized, and measures the voltage and current phasors of the feeders of transmission line and from the entire grid systems. This WAM infrastructure is mainly operated through the dynamic thermal current rating (DTCR) software to estimate the thermal effect in transmission lines to make sure thermal current does not exceed the limit. However, the existing DTCR is not fully functional due to some thermal estimation and calibration issues that occur frequently on the current ratings transmission line. Due to the thermal effect, the amount of current flowing through the conductor may reach its maximum value. When overcurrent occurs, it affects the performance of the transmission line. Therefore, this article studies the recent advancement of the WAM infrastructure as well as static and dynamic DTCR models in calibrating the climate parameters to the software. The main aim is to find out the drawback of the existing systems and then propose the heat balance model that considers the atmospheric climate conditions such as wind speed, and ambient temperature in estimating the current ratings. The performance of the thermal models is evaluated using MATLAB-based numerical analysis. The result shows that the impact of the atmospheric conditions on dynamic current ratings. Keywords Smart grid · PMU · DTCR · WAM
M. K. Hasan (B) Center for Cyber Secuirty, Faculty of Information Science and Technology, Universiti Kenagsaan Malaysia, 43600 Bangi, Malaysia e-mail: [email protected] M. M. Ahmed · S. S. Musa Department of Electrical and Electronics Engineering, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_96
1139
1140
M. K. Hasan et al.
1 Introduction The dynamic thermal current rating (DTCR) for the grid computing was introduced in 1970s, to measure the environmental effects on the overhead transmission lines, which is the conductor current of overhead transmission line based on the dynamic environmental measurement and its realization framework of the DTCR system was given [1]. The technique mainly introduced the dynamicity of the capacity of the transmission lines. The main concept is the “concentrated current of the transmission line is the static thermal capacity limit that is set up to prevent overheating when the line load increases”. The limit is to indicate the climate conditions in order to continue the transmission. The DTCR is the WAM-based software operated through intranet wireless connectivity. The WAM system is an efficient system that employed phasor measurement unit (PMU) to collect data at a distant location. It is categorized as a system that has a greater performance in monitoring and controlling power system dynamic. The PMUs are installed at substations to generate and transmit data to the PDC which gather and groups all the data in different sets based on the timestamp [2]. PMU is an upgraded system that has high tendency to overcome the limitations in supervisory control and data acquisition (SCADA) system [3, 4]. Figure 1 shows the SCADA system framework. The limitations are can be specified of SCADA system as below: • Data obtained from different locations at a particular time are less precise • Slow changes in reactive and active power • Monitoring system on transmission line is not as expected.
Fig. 1 Satellite timestamp enabled communication system for WAM System [4]
Measurement and Modeling of DTCR Software Parameters …
1141
Fig. 2 Operational block diagram of PMU [3]
Due to some advantages, PMUs is a high-cost device which used to provide the view of the transmission line [5]. For each and every bus, the PMU can show the voltage, current, and power flow on the grid. In short, PMU is a device that can provide phasor measurement of voltages and currents. The data of PMU will be transmitted to a central location where PDC is installed. In the communication framework, data is transmitted in form of the byte that is also known as message frame (see Fig. 2). Once the data has reached the PDC, it will be transmitted to the next level of PDC for various applications.
1.1 Dynamic Line Rating There are numbers of transmission lines system monitoring that are available in the market all around the world. One of the most viable techniques used to increase the transmission line capacity is dynamic line rating (DLR) or dynamic thermal current rating (DTCR). DLR is a method of exploiting the existing transmission lines where
1142
M. K. Hasan et al.
it has the capability to avoid newline investment, increase the loading capacity, and provide attachments in a faulted section of transmission line [6]. Thermal current limit or ampacity refers to the maximum current that can flow through a conductor or device. The thermal current can be classified into two groups that are static and dynamic thermal current. There are two types of thermal ratings used under normal operating in transmission lines system. The first one is static thermal rating. Static thermal ratings are defined as the current that lead conductor to operate under presumed atmospheric conditions from wind, solar, and rain. Dynamic current thermal on the other hand is a current, which allows the conductor to operate under real atmospheric conditions. In an overhead transmission line, the line ratings were controlled by the importance of statutory clearance between the conductor and other elements. The rise in the line sag is mainly caused by the high temperature in the conductor [7]. Theoretically, the transmission line and distribution networks, the capacity of loading limit can be determined by maximum thermal of the conductor. It can be observed for the long transmission line, aspects such as economic energy losses, system stability, and voltage regulation are taking into account in determining its maximum loading capabilities. For some particular loading system conditions, the thermal rating of the transmission line is explained in terms of current or MVA at nominal voltage limit (expert’s system) and this is usually based on the level of normal, seasonal, and emergency load. Hence, under normal and fault condition, aspects such as loading limit, electrical characteristics, and real atmosphere condition affect the conductor temperature of power transmission line. However, using the WAM system the DTCR mainly calibrates the climate parameters of the conductors to estimate the real-time conductor’s situation in critical climate circumstances. Therefore, the main goal of this study is to model the climate parameters to calibrate the DTCR software of WAM system. The main contribution of this article is as below: 1. A state-of-the-art literature review on the related methods and models over static and dynamic measurement systems. 2. A thermal balance model for parameter calibration is proposed for the DTCR software in WAM system through implying the climate parameters. The focused key parameters are such as wind speed and ambient temperature. 3. The proposed model facilitates the improvement of the DTCR software’s efficiency in calibrating and measuring the climate situation dynamically, throughput for cell edge users is proposed. The rest of the paper is organized as Sect. 2 discusses the related study of dynamic thermal line rating model and a summary of the existing model’s pros and cons, Sect. 3 discusses the design and its considerations of dynamic line rating model, Sect. 4 discusses the results and analyzes the performance. Finally, Sect. 5 concludes the paper.
Measurement and Modeling of DTCR Software Parameters …
1143
2 Related Works The dynamic thermal line rating includes the real-time data measurement such as local measurement data and loading asset on transmission line [8]. The dynamic thermal current rating can be calculated based on various parameters that are available in real-time monitoring data of transmission lines. Due to the capability of increasing maximum loads and allows thermal constraint, the dynamic thermal current rating is always higher than the static thermal rating. Generally, DTLR is depended on the actual current flowing through the conductor where all the important atmospheric condition changes with time [8]. The purpose of DTLR is to upgrade the static line ratings (SLR), which operates based on a fixed conservative weather assumption. It has been mentioned that the thermal limit is more sensitive to wind speed as compared to ambient temperature [9]. For high voltage alternating current (AC) lines, the limit of line’s impedance surge load cannot be more 30% of the thermal capability [10]. The authors [11] have modeled a DLR scheme, where the predicted weather condition is the input and maximum or future current capacity through the transmission line is the output. To get an expected result, the accuracy of the thermodynamic model needed to be high for the maximum current capacity (MCC) calculation. This is because a less accurate model can lead to line sag especially when the transmission line reaches its loading and aging limit [12]. Dynamic thermal line rating which includes the process of weather measurement and line rating calculation is actually a good practice in the smart grid application [9]. Authors [8, 12] have claimed that the use DLR is a not a demanding one. However, it is demanded when comes to the concept of public safety that drawn from the electrical clearance related to the aging of the conductor. It has been stated that conductor temperature at line span that experiences least cooling tends to have the higher thermal capacity and this section is known as critical span. A device like a sensor and weather model is used in the DTLR system to monitor the entire condition of the transmission lines [9, 13]. A DTLR system is able to carry more power as compared to the static rating system [16, 17]. Conservative atmospheric condition such as high ambient temperature, low wind speed with perpendicular direction, and high solar radiation can be used for static rating estimation. It is further described that rating is more accurate and secure when the forecasted rating is close to the actual rating. Hence, static line rating is more secure and less accurate as compared to the dynamic line monitoring system. The summary of the literature review is highlighted in Table 1 with its pros and cons.
3 Dynamic Thermal Line Rating Model A single measurement method had investigated the single-phase model. It can be seen that the single-phase modeling is able to delineate its behavior in steady-state condition where the three-sequence network is decoupled [15]. Only the positive
1144
M. K. Hasan et al.
Table 1 Summary of literature review table styles Models
Merits
Demerits
Focused mainly on the thermo-resistivity coefficient [12]
At some point, some inherent errors can be decreased by focusing on some concealed parameter
Inconsistent calibration on PMU can lead to errors (noise). However, this model still suffers the calibration errors
Rain is considered as a parameter [16]
The evidence shows that both predicted and measured measurement of conductor temperature produced a good correlation during rainy condition
However, precipitation can be investigated in detail when some constant like wind speed is in constant
A new model in DLR system that not only includes the calculation of steady state but also the transient [17]
Able to calculate the conductor Due to the online sensors, one temperature precisely of the major challenges of this project was the power supply
High-Temperature Equivalent Model (HTEM) that could decrease the errors made by inaccurate meteorological parameters [20]
It can be noted from the Need some improvement for formula that evaporation at the further development of conductor surface also taken dynamic rating systems into account
sequence voltages and currents can be used to decide the overhead line parameters of the positive sequence. In order to overcome the limitation of an algorithm for average conductor temperature determination, the calibration process was focused on the uncertain parameter such as thermo-resistivity coefficient. It such a quick way to determine the non-linear programming problem from calculation where there is concurrent between the current case and unknown parameters. The difference between temperature reciprocated by model and an average temperature that produce errors can be reduced. In their project, they came out with an unknown vector which used to estimate the exact thermo-resistivity coefficient, α. During the overload condition, errors should be weighted because in such case, high accuracy of temperature estimation is required. For the standard single method, noise and PMU calibration errors responsible for the inconsistency. Therefore, this model still suffers the calibration errors [15]. On the other hand, authors had presented an idea that considered rain as a parameter in the thermal model rating of CIGRE assumption [16]. This is because the CIGRE standard did not put into account the effect of cooling from the rainy condition on the overhead transmission line. Validation of the introduced algorithm was performed by comparing the measurement of DTR in a laboratory and computed results from upgraded DTR estimation. For the experimental measurements purpose, a testing site was built as in Fig. 2. The evidence shows that both predicted and measured measurement of conductor temperature produced a good correlation during rainy condition. The rain parameter plays an important role in evaporation and convection term. However, it has been
Measurement and Modeling of DTCR Software Parameters …
1145
studied that a precipitation can be investigated in detail when some constant like wind speed is in constant [17]. Authors had mentioned that the DLR system that not only includes the calculation of steady state but also the transient [18]. It is also considered that the thermal inertia and other related parameters such as wind speed and direction, solar radiation, and precipitation were considered. It can be seen that with the aid of sensors, sag along the line can be measured easily and this parameter provides information for the average temperature of the conductor. It has mentioned that the model involved the cooling effect of precipitation. In consequences, the researchers have developed the DLR model in HVL-BUTE which is able to estimate the conductor temperature and the sag-temperature. Due to the online sensors, the major issue raised on the power supply. In such case, a high voltage is a condition that needed to be focused because the international standard practices that to qualify. The equipment under the conditions like corona and high electromagnetic field is still not sufficient. Some studies have repeatedly stressed that model that includes the dynamic thermal behavior will affect the changes in conductor temperature. It will take more time to change even though there is a sudden change in the current [19]. Authors had proposed the Power Donut 2 (PD2) as a solution to overloading condition on transmission lines. Power Donut 2 (PD2) is a device that can be utilized for transmission lines monitoring [15]. It can be seen that the authors had applied the Pearson Correlation Coefficient Matrix for the correlation between every parameter input and ampacity based on steady-state formulae. They reported that wind speed, air temperature, the angle between the line and wind azimuth has had the great correlation with current rating maximum value. Figure 3 shows the relationship between the weather variable and correlation magnitude. The authors had compared the fuzzy and probabilistic estimation of dynamic thermal current rating of the transmission line. According to the investigation, fuzzy numbers are used to delineate the uncertainties of the weather unpredictable on
(a)
(b)
Fig. 3 DTLR testing system: (a) thermal rate system [12], (b) Diagram of testing site [16]
1146
M. K. Hasan et al.
the transmission line and as well as the estimated maximum current. For the fuzzy adoption, there was a number assign for each weather variable and such variable will be inserted in the fuzzy thermal heat balance model. The minimum current was considered as the ampacity estimation. Due to the less time complexity and less simulation time, the computational cost of the rating estimation can be decreased through the fuzzy-based model [12]. Monte Carlo simulation (MCS) is a technique that used to simulate the weather variable that is then fed into the conductor thermal heat balance model of IEEE standard. The equation of thermal heat balance model will be used to calculate the current rating on the transmission line. The ampacity estimation of the entire transmission line is obtained from the minimum ampacity at a specific span and this can be explained by the Eq. (1) [14]. I (x) = I1 (x) + I2 (x)
(1)
p(x − μ1 ) 1 x − μ1 (x − μ2 ) · I1 (x) = φ − σ1 σ1 σ1 1 − p 2 σ2 1 − p 2 p(x − μ2 ) 1 x − μ2 (x − μ1 ) · I2 (x) = φ − σ2 σ2 σ2 1 − p 2 σ1 1 − p 2 where, (I1 , I 2 ) is the correlated Gaussian random variable associated of current at two-section, (μ1 , μ2 ) is the means, (σ 1 , σ 2 ) is the variances, ρ is the correlation factor, ø is the probability distribution function (PDF), and ø is the cumulative distribution function (CDF) of the standard normal distribution.
4 Result and Discussion The performance of current rating model is evaluated using Monte Carlo simulation. The simulation parameters are adopted from thermal heat balance model of IEEE 738 standard. The IEEE standard parameters are listed in Table 2. DTCR is one of the software in WAMS that uses the phasor measurement data at the remote ends of a transmission line. A thermal heat balance model is built-in on DTCR in order to estimate the maximum current through the transmission line. Dynamic thermal current rating (DTCR) model is considered for grid system which is equipped with phasor measurement units (PMU) in wide area measurement (WAM) system. Field measured weather data such as wind speed and the ambient temperature is adopted on heat balance model. The performance of the dynamic thermal model is evaluated using simulation approach. The following plots are achieved for the current ratings. Figure 4 demonstrates the probabilities of the winds at the critical climate conditions that relation between wind speed and current ratings. It can be seen from Fig. 5
Measurement and Modeling of DTCR Software Parameters …
1147
Table 2 Parameters with the IEEE, CIGRE standard, and measured value [14, 16, 17, 18, 19, 20, 21] Parameters
Standards IEEE
CIGRE
Measured
Conductor temperature (°C)
29.4
29.2
30.7
Wind speed (m/s)
2.0
2.0
20
Direction/angle with conductor (Degree)
58
58
58
Ambient temperature (°C)
22.7
22.7
22.7
Solar radiation (W/m2 )
960
960
960
Load current (A)
800
800
800
Equal current sharing (A)
200
200
200
Conductor height above sea level (m)
1500
1500
1500
Fig. 4 Representation of input variable and correlation magnitude [18]
that the current rating is affected by the variation of the wind speed. The current rating effected linearly if the wind speed increases. Hence, it can be observed that the wind plays an important role in line cooling, and therefore, under most circumstances proposed dynamic model’s limit is higher than static line rating, and the ratings are more accurate even. The higher the wind generation that relates to a higher current line ratings. Similarly, the relation between solar radiation and current ratings is shown in Fig. 6, where it is clearly demonstrated the probabilities of the ambient temperature for various winds measurement and the current rating effects on the variation of conductor heat conditions. The effects of solar radiations characteristic for the effect of air density on current ratings are shown in Fig. 7. It can be observed that the proposed dynamic parameter modeling and calibration can measure the real-time climate effects from the environment dynamically. The measurement suggests that the dynamic parameter calibration can indicate the conductor temperature is also affected by the variation of wind speed.
1148
M. K. Hasan et al.
Fig. 5 DTCR parameter modeling for the probability of the wind parameter (a), and the ratio of the current rating against wind speed (b)
5 Conclusion The climate parameters are very important measurement for the grid transmission and the computing at the DTCR software. This article studied the similar models and methods of the static DTCR system and proposes the dynamic DTCR parameter in calibrating the climate conditions. The most advanced WAM system and the complete grid layout is composed and studied according to the DTCR software configuration and network requirement. From the result and performance analysis it is observed that if the wind speed is higher, wind incident on lines is expected to be higher than the one considered for calculating the static limit. It is also monitored that when the solar radiations are higher than the effect on current rating expectedly higher. Hence, the proposed DTCR calibration parameters have shown the impact to the system in various climate conditions.
Measurement and Modeling of DTCR Software Parameters … Fig. 6 DTCR parameter modeling for the probability of the ambient temperature (a), and the ratio of the conductor temperature against wind speed (b)
Fig. 7 Current rating versus solar radiation
1149
1150
M. K. Hasan et al.
References 1. IEEE Standard for calculating the current-temperature relationship of bare overhead conductors, IEEE Standard 738–2012 (2013) 2. A.S. Rana, M.S. Thomas, N. Senroy, Wide area measurement system performance based on latency and link utilization. 1–5 (2015) 3. P. Nanda, PMU implementation for a wide area measurement of a power system. 23–24 (2017) 4. H. Mohammad Kamrul et al., Phase offset analysis of asymmetric communications infrastructure in smart grid. Elektronika ir Elektrotechnika 25(2), 67–71 (2019) 5. S. Das, D.K. Mohanta, Simulation of wide area measurement system with optimal phasor measurement unit location. 226–230 (2014) 6. J. Teh, I. Cotton, S.T.R. Al, Risk-informed design modification of dynamic thermal rating system. 9, 2697–2704 (2015) 7. I. Engineering, Accommodating increased fluctuations in conductor current due to intermittent renewable energy a short-term dynamic thermal rating for. 141–145 (2016) 8. Y. Cong, P. Regulski, P. Wall, M. Osborne, V. Terzija, On the use of dynamic thermal-line ratings for improving operational tripping schemes. 31(4), 1891–1900 (2016) 9. L. Dawson, A.M. Knight, Applicability of dynamic thermal line rating for long lines. 8977 (2017) 10. W. Group, O. Lines, Real-time overhead transmission-line monitoring for dynamic rating. 31(3), 921–927 (2016) 11. S. Karimi, A.M. Knight, P. Musilek, J. Heckenbergerova, A probabilistic estimation for dynamic thermal rating of transmission lines. (2016) 12. S. Karimi, A.M. Knight, P. Musilek, A comparison between fuzzy and probabilistic estimation of dynamic thermal rating of transmission lines. 1740–1744 (2016) 13. E. Fernandez, I. Albizu, M.T. Bedialauneta, A.J. Mazon, P.T. Leite, Review of dynamic line rating systems for wind power integration. Renew. Sustain. Energy Rev. 53, 80–92 (2016) 14. M. Šiler, J. Heckenbergerová, P. Musilek, J. Rodway, Sensitivity analysis of conductor currenttemperature calculations. Dept. of Mathematics and Physics, Faculty of Electrical Engineering and information (2013), p. 3 15. E.M. Carlini, C. Pisani, A. Vaccaro, D. Villaci, Dynamic line rating monitoring in WAMS: challenges and practical solutions. 5 (2015) 16. G. Kosec, M. Maksi´c, V. Djurica, Dynamic thermal rating of power lines—model and measurements in rainy conditions. Int. J. Electr. Power Energy Syst. 91(4), 222–229 (2017) 17. D. Balangó, I. Pácsonyi, Overview of a new dynamic line rating system, from modeling to measurement. (2015) 1–6 18. M. Musavi, D. Chamberlain, Q. Li, Overhead conductor dynamic thermal rating measurement and prediction. In: Proc. IEEE Int. Conf. Smart Meas. Grids SMFG 2011, vol. no. 1 (2011), pp. 135–138 19. Y. Liu, Y. Cheng, The field experience of a dynamic rating system on overhead power transmission lines. In: IEEE International Conference on High Voltage Engineering and Application (ICHVE), 2016 (IEEE, 2016), pp. 1–4 20. X. Zhou, Y. Wang, X. Zhou, W. Tao, Z. Niu, A. Qu, Uncertainty analysis of dynamic thermal rating of overhead transmission line. J. Inf. Proces. Syst. 15(2) (2019) 21. V. Roy, S. Noureen, T. Atique, S. Bayne, M. Giesselmann, A.S. Subburaj, M.A. Harral, Design, development and experimental setup of a PMU network for monitoring and anomaly detection. In: 2019 SoutheastCon (IEEE, 2019), pp. 1–6
Dynamic Load Modeling and Parameter Estimation of 132/275 KV Using PMU-Based Wide Area Measurement System Musse Mohamud Ahmed, Mohammad Kamrul Hasanl, and Noor Shamilawani Farhana Yusoff Abstract Dynamic load modeling for Kuching power system is presented in this paper by practicing on real data and load models of Sarawak grid system. Load modeling is vital in the power industry since power system reliability and stability can be discovered based on the time domain simulations results. Dynamic load modeling was achieved in this paper by using MATLAB using load response and the system stability being assessed. Some parameter estimations of the load composition at the selected bus also included in this paper. The results of the simulations on load models being compared with the recorded system data in Phasor Measurement Unit (PMU) of Wide Area Measurement System (WAMs) by applying an event of bus tripping time interval. Values of estimated parameters on load composition are then being converged by using Least Square Error Method and compared with the actual recorded data until optimized load models were achieved. Keywords WAMs · PMU · Dynamic load modeling · Least square error method · Parameter estimation
M. M. Ahmed · N. S. F. Yusoff Department of Electrical and Electronics Engineering, Faculty of Engineering, Universiti Malaysia Sarawak (UNIMAS), 94300 Kota Samarahan, Sarawak, Malaysia e-mail: [email protected] N. S. F. Yusoff e-mail: [email protected] M. K. Hasanl (B) Center for Cyber Secuirty, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), 43600 Bangi, Selangor, Malaysia e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_97
1151
1152
M. M. Ahmed et al.
1 Introduction Development of existed load in the Sarawak grid leads to the significant changes on the current transmission and distribution network basically in power demand. Growing of power demand pressurized utility company in Sarawak (SEB) to improve on grid reliability by preventing the grid to operate beyond its capability. For that reasons, it is required to have load modeling by practicing on the real data and load models to secure on the reliable power system operation. SEB concerned to proposed on load modeling method since less ideas on dynamic characteristics and load composition in Sarawak grid system are known. Besides that, SEB had studied recently with assumption on some parameters of arbitrary static load models by matching recorded data and simulated load response. Basically, most of models on the transformer, generators, and transmission lines can be done exactly compared to load models as some loads are time varying and depending on the types of load class either it is residential or industrial, etc. In addition, most of the researchers interested in assessing static and dynamic loading profile due to the appearances of new-conventional types of loads which are power-electronic based or interfaced and its requirements to operate with increasing on non-conventional and intermittent types of generation [1]. In addition, raising usage of devices connected to the switching power supplies and energy-saving lighting. With the high penetration on power-electronic based, most of the load models used by customers today are not updated therefore correct analysis on load modeling should be done by considering exact and accurate representation of static and dynamic on the system loads. The importance of conducting this study is for the system reliability due to the numerous of blackout events occurred such as in Swedish in 1983 due to the inappropriate representation of system loads. CLODBL load model type in PSSE considered as appropriate load model to be referred since it is a complex load model structure seen in Fig. 2. Simulated Kuching area power system consists of 4 buses with PV and PQ bus for load modeling and the load at the Entinggan bus was chosen to use as an estimated load (in dashed red box) as shown in Fig. 1. Load modeling technique was used by modeling the 132/275 kV Network of Kuching Power System using MATLAB Simulink and to assess on the parameter estimation at Entinggan bus, an appropriate load model in the PSSE Software which is CLODBL types load was realistic to be applied into network. Parameters of CLODBL load applied into the MATLAB Simulink shown in Fig. 3 by estimating on the large and small motor, transformer saturation, and constant power load. Since the event of tripping bus was used in a day time, % of discharge lighting is considered as 0 by assuming that no lighting used, no discharge, hence not included in the simulation.
Dynamic Load Modeling and Parameter Estimation …
Fig. 1 Simulated 132/275 kV Kuching area power system
Fig. 2 Complex load model features [10] Fig. 3 Load model applied into network
1153
1154
M. M. Ahmed et al.
2 Related Work There are some related works relates to this research based on the dynamic load modeling and being highlighted in this section by mentioning the author’s contributions, pros and cons for the selected papers.
2.1 Dynamic Load Model for Industrial Facilities Using Template and PSS/E Authors in [2] proposed on feasible dynamic load modeling focusing on the industrial facilities of 110 MW Kraft paper mill facility since most previous method of load modeling only intense on the aggregate load that serves a mix of the multiple customers. This method consists of three steps such as 1. Conduct load survey for specific types of industrial facilities (e.g., 110 MW Kraft paper mill facility) and create template. Load surveys include some aspects on existence of co-generation, system configuration of distribution (e.g., location of load for each voltage level), motor components (e.g., asynchronous and synchronous) [3, 4], and fractions of Variables Frequency Drives (VFD) and static loads. 2. Used the template by determining the load composition of the facility required by PSS/E CLOD load model structure. CLOD load models include large and small motor, discharge lighting, transformer saturation, and other static loads. Some of rules are proposed in this method such as percentage of large motor defined as fraction of induction motor rating 2.4 kV, percentage of small motor assumed as fraction for induction motor with rating of 480 V, used typical 1–2% of transformer existing current, 0% of discharge lighting as it is not considered, VFD assumed as constant MVA load and remaining load percentages refers to the fraction of other static load (e.g., lighting). 3. Create PSS/E CLOD load model of the facility using load composition data. These steps were done by lumped the load at the 13.2 kV bus and turn off the co-generation unit. The accuracy of this proposed model was validated by comparing on the created PSS/E CLOD load model with the measured disturbances response at 230 kV Point of Interconnection (POI) bus having two events of disturbances on [5]. 1. 85% of voltage sag for 0.5 s in order to validate on voltage dependency of the model. 2. Experienced on frequency excursion within 59.8 Hz to 59.9 Hz to validate on the frequency dependence load model.
Dynamic Load Modeling and Parameter Estimation …
1155
2.2 Modeling of Nonlinear Dynamic Power System Using the Vector Fitting Technique Authors in [3, 6–9] proposed on measurement-based generic load model that is applicable for dynamic simulation in bigger loading condition by using vector fitting technique. Few scenarios on dynamic modeling were simulated using NEPLAN Software to confirm the accuracy of the load models. These structure load models considered on the real and reactive power as independent quantities based on Exponential Recovery Load Model (ERLM). Simulation was done on the IEEE 39 New England test system by using simulation responses at the Bus 12 by replacing new load having 8 Induction Motors (IM) and static load. Throughout the simulations, ideal on load tap changer, voltage magnitude, real, and reactive power used at a data rate of 1000 samples per second as input for the load models. Different number of iterations were considered in order to demonstrate the influence ofR2 criterion on the results of proposed model. Then, the proposed model was assessed using Monte Carlo (MC) simulations by altering on the load composition (e.g., number of static load and IM). Last but not least, based on the simulation conducted on fifth model of IM, it is proven that this proposed model can be used for the higher order of load dynamics.
2.3 Automated Identification of Power System Load Models Based on Field Measurements Automated Load Modeling Tool (ALMT) was developed using MATLAB to significantly improved efficiency of load modeling and reliable assessment of system dynamic performance can be executed. It is a fully automated process of load modeling without requiring on human involvement by using software tools. ALMT sustained three major stages as stated in research paper [7] which are 1. Data Processing Recorded power system signal was imported to the program, processed to identify on the load model and each of them contains V, I, P, Q, and f. First stage sustained on identification of disturbances, data conditioning, and classification. 2. Load Model Selection Identify most suitable load model for a recorded disturbance followed to shape of dominant response. 3. Load Model Parameter Identification Iteration on load model parameter until benchmark of the accuracy was achieved by comparing recorded and simulated load response.
1156
M. M. Ahmed et al.
Table 1 Summary of related works Methods
Pros
Cons
Generic load modeling for the industrial facilities was proposed by using PSS/E CLOD load model
This new method is practical and feasible to be used in power system planning studies and thus current load modeling used by the utility companies can be improved
The template needs to be done with extension on the load survey for the specific types of the load on the industrial facilities
Authors proposed on generic load model based on Vector Fitting method for the dynamic simulation used in power system
All coefficients need to be estimated but must be accurate to confirm the performance of proposed model is reliable
This developed models capable to represent high order load accurately with the usage of variable order transfer function
The authors proposed on the fully automated process of load modeling without human intervention by using software tools. Automated Load Modeling Tool (ALMT) enables automatic load model identifications at buses based on online measurements
ALMT improves efficiency of load modeling with appropriate accuracy and more reliable assessment of system dynamic performance can be done
The authors proposed on the fully automated process of load modeling without human intervention by using software tools. Automated Load Modeling Tool (ALMT) enables automatic load model identifications at buses based on online measurements
This method can improve on the effectiveness of load modeling with proper accuracy and more reliable assessment of system dynamic can be done. It does not need human involvement, effortless on load modeling for the power system stability studies. ALMT used actual recorded system data where it might consist of noise due to the acquisition devices and with the consequences of that, ALMT undergoes processing of filtering by using the MA, SG, and BW filter. However, SG filters cannot appropriately filter on the start and end point and lead to the noticeable spike at each end. Other than that, this load modeling only applied to the load mix at a monitored bus in the meantime. Based on the research and journal that have been explained in the previous section, there are some highlighted points and ideas with consideration of the author’s contribution, merits, and research gap as indicated in the table below (Table 1).
3 Dynamic Load Models There are few load models in the PSSE Software that will be discussed within this part. Each of the load models presents their own load classifications and description.
Dynamic Load Modeling and Parameter Estimation …
1157
3.1 ZIP Models ZIP model is a static/constant load with constant impedance (Z), constant current (I), and constant power (P). In PSSE load models, users have flexibility to choose to have their own model either e0 , e1 , or e2 depending on the load classifications (e.g., load with constant Z = e0 ). Equations below can be used in the ZIP model with all in per unit (pu) values. P = PLoad (Pp + P1V + Pz V 2 )
Pp + P1 + Pz = 1
Q = Q Load (Q P + Q 1V + Q z V 2 )
PP + P1 + Pz = 1 However, Q can be used only for the single electrical device but not appropriate to be applied for the part of grid since the value is almost zero and hence equation of Q is replaced with P. Basically, in static load modeling, it does not show any time dependence during the system outage.
3.2 Dynamic Load Models In PSSE there are few dynamic load models in the library with different characteristics. List of the models can be explained as follows: • IEEBL It is a static load model with frequency dependence component. Equation involved in this model are P = PLoad (a1 v n + a2 v n + a3 v n )(1 + a7 f )
Q = Q Load (a4 v n4 + a5 v n5 + a6 v n6 )(1 + a8 f )
1158
M. M. Ahmed et al.
where a1, 2, 3, …6 = constant of static load (ZIP model), n1 , n2 , and n3 = ZIP model (e.g., 0, 1, 2), a7 and a8 = load percentage effect to frequency, f = frequency effect. • LDFRBL Dynamic load model with system frequency dependent where the frequency effect widely in this model. This model has no effect on constant admittance load or shunt device. m r ω ω P = P0 · · · I P = I po ω0 ω0 Q = Q0
ω ω0
m
· · · Iq = Iqo
ω ω0
s
where m, n, r, and s depends on the system characteristics (e.g., integers, zero if corresponding load components are independent of frequency). • CLODBL This model replaces all constant current, MVA and admittance load with a composite load consist of large and small motor, lighting, and any equipment that fed to the substation (e.g., transformer saturations). This model is appropriate to be used if the simulation need to be done without having detailed dynamic data. • CIM5BL and CIM6BL This model is used for the large individual induction motor and can be designed either single-cage or double-cage including rotor flux dynamic. Load composition for this model can be any percentage of constant MVA, current, or admittance. Values on the stator and rotor impedance or its magnetization reactance can be collected from the manufacturer of the induction motor.
4 Dynamic Load Model and Parametric Study Network data recorded for the whole Kuching area in every single bus and data on system outage in Mambong bus was collected from the SEB. It was assessed with consideration on their behavior and influences toward power system such as under/over frequency and over/under voltage. Load at the Entinggan bus was chosen to study on the load composition. Network on 132/275 kV Kuching Power System was reduced to 4 buses to make the simulation and analysis easier, since larger network leads to many iterations process. Review on the network bus can be seen in
Dynamic Load Modeling and Parameter Estimation …
1159
Fig. 4 Four (4) bus 132/274 kV Kuching power system
Fig. 4. To study on dynamic stability, an event of disturbance was applied and simulated to the power system having a short sample time and the behavior of frequency, voltage, current and power at the bus were analyzed. of 4 min 29 s but sorted to the total of 130 s based on the true event recorded in PMU which is Mambong tripping. Dynamic load modeling was done in this paper by assessing on the bus and transmission line stability and identify on which part of the network will be heavily loaded. This paper used designed load model during simulation for parameter estimation mentioned in Fig. 3 to determine on the load composition at the Entinggan bus. Based on that, simulated response and measured data in PMU being compared to determine on the error. Figure 5 reveals more details on the procedure of the model parameter estimation to determine the load composition with comparison on the simulated response and recorded data. Simulated data is the results of simulation by designing complex load models of CLODBL into MATLAB Simulink and recorded data is the data that is collected from PMU, with an event of disturbance for Mambong fault bus for both models [9, 3]. Least Square Error equation is presented below in order to determine the error within the measured and simulated response data [4, 5] (Table 2). N {X E(X ) = K =1
meas (tk )
− X simu (tk , X )}2
1160
M. M. Ahmed et al.
Fig. 5 Model parameter estimation procedure
Table 2 Symbols definition [6, 3, 4, 9, 11] Symbol
Definition
Values
N
Total number of sampling
130
k
Number of sampling
1, 2, 3, n
X meas
Values of voltage/current on certain k
Follows to recorded PMU value
X simu
Values of simulated load response of voltage/current for certain k
Follows to simulated load models
5 Result and Discussion The 132/275 kV Kuching Power System was simulated using load flow tool of MATLAB. This is to study the load flow of the system under normal and disturbance conditions to evaluate the stability of power system. The evaluations were made considering on Bus 1: Sejingkat, Bus 2: Engkelili, Bus 3: Entinggan, and Bus 4: Mambong. Some events of system outage were considered for simulation time (5 s) and the behavior of the (tripped synchronous Machine at Sejingkat) system was evaluated. The simulation result revealed on the condition of the system for prefault and during fault when Sejingkat’s source being shit down from generating real and reactive power into the system(shown in Fig. 6). Based on that, voltage magnitude measured at Sejingkat, Entinggan, and Mambong bus has decreased since high
Dynamic Load Modeling and Parameter Estimation …
1161
7000
6000
Current (A)
5000
4000
3000
2000
1000
0
1
2
3
4
Bus Number Fig. 6 Condition of current during and before Sejingkat Machine tripped
current leads to the high voltage drop. However, voltage drop at the Engkelili was higher, unacceptable beyond the safe range of voltage magnitude as no-load was supplied at the Engkelili bus. High voltage drops in the system results to the poor performance to the domestic appliances such as refrigerators or lighting system and within the peak time all those domestic appliances might not be function properly. Other than that, for the industrial consumer that using on the high rating motor will lead to the total shut down since the motor is unable to operate at full capacity in the low voltage level [8]. Bar graph shown in Fig. 7 revealedthe system condition pre and during fault at the Mambong-Engkelili transmission line. In summary, during the fault, MambongEntinggan transmission line current increases rapidly since it is connected to the same bus and leads to the voltage magnitude dropped until it becomes unstable, beyond the acceptable range of 0.9–1.1 pu. Figure 8 shows on the result of the load flow for the voltage magnitude when the model was simulated with few cases on normal operating condition, increment, and decrement of real and reactive power load. Based on that, further increment on real and reactive power of the load causes on the severe voltage drop and higher current drawn as well as high reactive power demand to the system. Since the voltage magnitude must be within acceptable range of 0.9–1.1 pu, voltage magnitude at the Mambong was considered as low voltage magnitude. Load models on complex load were designed and estimated by percentage. It is compared with simulated and measured PMU data, then the errors were converged until it attained to the minimal. Figure 9 shows the load composition for the Entinggan bus. Based on that, the approximate load composition was identified at the Entinggan, where it is build up
1162
M. M. Ahmed et al. 10000 Before Mambong-Engkelili fault During Mambong-Engkelili fault
9000 8000 7000
Current (A)
6000 5000 4000 3000 2000 1000 0
Sejingkat-Entinggan Mambong-Engkelili Mambong-Entinggan
Sejingkat
Entinggan
Transmission Lines/Load
Fig. 7 Condition of current before and during Mambong-Engkelili fault 1.2
Sejingkat Entinggan Mambong
Voltage Magnitude (pu)
1.1
Engkelili
1 0.9 0.8 0.7 0.6 0.5 0.4
1
2
3
4
Variation of Cases
Fig. 8 Load power consumptions for variation of cases in power demand
5
Dynamic Load Modeling and Parameter Estimation … Fig. 9 Load composition in Entinggan
1163
Const ant MVA Transf ormer Satura tion
Large Motor
Small Motor
60% majority from the small motor load, followed with 20% constant MVA load, equivalent percentage of 10% for large motor and transformer saturation.
6 Conclusion This paper discusses the load profiling methods for the dynamic load modeling and proposed a dynamic load model to obtain optimized load model parameters as well as total of load composition. The least square error method was used to estimate the parameters. Values for the simulated data using PSSE Software and recorded data in PMU system outage have being compared by using the method. It is then analyzed and demonstrated through graph by doing some convergence of the parameter until the error becomes zero and hence optimized load model is obtained. Acknowledgments We deeply acknowledge the Sarawak Energy (SEB) as they have provided system outage data and recorded data in PMU. The work financially supported by the Research Innovation and Enterprise Centre (RIEC), Universiti Malaysia Sarawak (UNiMAS) under the Grant F02/DPD/1639/2018.
References 1. J.V Milanovi, K. Yamashita, S.M. Villanueva, S.Ž. Djoki, S. Member, L.M. Korunovi, International industry practice on power system load modeling. 1–9, (2012) 2. S. Li, Q.V Qro, Dynamic load modeling for industrial facilities using template and PSS/E composite load model structure CLOD Shengqiang Li. 1–9 (2017) 3. M.K. Hasan, A.F. Ismail, S. Islam, W. Hashim, B. Pandey, Dynamic spectrum allocation scheme for heterogeneous network. Wireless Pers. Commun. 95(2), 299–315 (2017) 4. A.A. Eltahir, R.A. Saeed, A. Mukherjee, M.K. Hasan, Evaluation and analysis of an enhanced hybrid wireless mesh protocol for vehicular ad hoc network. EURASIP J. Wireless Commun. Netw. 2016(1), 1–11 (2016) 5. M.K. Hasan, A.F. Ismail, S. Islam, W. Hashim, M.M. Ahmed, I. Memon, A novel HGBBDSACTI approach for subcarrier allocation in heterogeneous network. Telecommun. Syst. 70(2), 245–262 (2019)
1164
M. M. Ahmed et al.
6. E.O. Kontis, A.I. Chrysochos, G.K. Papagiannis, Modeling of nonlinear dynamic power system loads using the vector fitting technique. (2016) 7. Y. Zhu, J.V. Milanovic, Automatic identification of power system load models based on field measurements. IEEE Trans. Power Syst. 8950, 1–1 (2017) 8. S. Nunoo, J.C. Attachie, F.N. Duah, An investigation into the causes and effects of voltage drops on an 11 kV feeder an investigation into the causes and effects of voltage drops on an 11 kV feeder. (2012) 9. M.K. Hasan, S.H. Yousoff, M.M. Ahmed, A.H.A. Hashim, A.F. Ismail, S. Islam, Phase offset analysis of asymmetric communications infrastructure in smart grid. Elektronika ir Elektrotechnika 25(2), 67–71 (2019) 10. S. Industry, S. Power, T. International, PSS/E model libray. (2013) 11. M.Q. Khan, M.M. Ahmed, A.M. Haidar, N. Julai, M.K. Hasan, Synchrophasors based wide area protection and phasor estimation: a review. In 2018 IEEE 7th International Conference on Power and Energy (PECon) (IEEE, 2018), pp. 215–220
Enhanced Approach for Android Malware Detection Gulshan Shrivastava and Prabhat Kumar
Abstract Today, Android-based devices are most prevalent in the market because of user-friendly features. Due to this friendly features same as augmented the security issues. Through repackage app, the risk of confidential information loss is increasing day by day. The repackaged app is a malware app that behaves similarly as the original app. It is infected with malicious code and the user cannot identify the difference between the original app and repackage app. In this paper, the repackaged app and its activity to perform on the devices are discussed. An enhanced model is also proposed to identify malware applications, inspired by the web crawler technique. This model focuses to secure the ignorable overhead on the device and provide highly accurate repackaged version detection of a known app. The classification technique to ascertaining the app is also used. Keywords Android application · Classification · Malware · Similarity score · VSM
1 Introduction Android has gained tremendous popularity among the users after it has launched innovation research and warning firm. By Gartner [1] more than 4 billion Gadgets running on Google’s Android OS were dispatched till 2018, denoting its 80% versatile piece of the overall industry. Android Google Play Store is having the infectious applications along with the actual applications. In 2019, a note of caution issued for Android users after scientists at ESET (Executive Security & Engineering G. Shrivastava (B) · P. Kumar Department of Computer Science & Engineering, National Institute of Technology Patna, Bihar, India e-mail: [email protected] P. Kumar e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_98
1165
1166
G. Shrivastava and P. Kumar
Technologies, Inc.) brought out a year-long app that has been installed for 80 lakh times which were among the 42 apps [2]. Android is the most prominent working framework which was composed mainly of touch screen cell phones, for example, advanced mobile phones and tablet PCs. It was produced by Google, in light of altered adaptation of Linux bit and another opensource programming. One of the primary reasons why it is a widely used operating system is that it is a capable working framework and it underpins some utilization in Smart telephones. Android being an open-source operating system makes it even more convenient for everybody to use it. The android improvement underpins with the full java programming language with its first form being discharged in 2008 and the most recent refreshed adaptation is “Android 10” released in September 2019 as the 17th version of Android OS. Android is architected as an item stack including applications, a working system, run-time condition, middleware, organizations, and libraries. This outline can be laid out as in Fig. 1. Each layer of the stack and its related parts. Each layer is planned
Fig. 1 Android architecture and layers [19]
Enhanced Approach for Android Malware Detection
1167
out systematically and can be managed to give the perfect application progression and execution condition for mobile phones application system having useful java libraries. The android has a working framework and a number of programming options, which is separated into five different areas and architect in four principle layers as depicted in Fig. 1. • Linux Kernel The Linux kernel is the core layer of the whole framework and all the associated drivers are incorporated in it. The Fundamental framework functionalities of the kernel possess process administration, memory administration, and gadget administration which includes camera, keypad, display, etc. It comprises of restricted sources modified particularly for the inserted condition. The entire Android OS is based on Linux kernel and with minor changes, it becomes the core of the Android. Additionally, it acts as an interface layer between the equipment and other programming layers. • Libraries This layer is just above the Linux kernel which includes some of the native Android libraries. Just because of these libraries, the android device becomes capable of handling different types of data. These sets of libraries include open-source web browsers such as WebKit, SQLite, libc, OpenGL, SSL, etc. WebKit is used to store HTML content. It is the database that is used to store application data. Primarily it is used for video mainly for rendering 2D and 3D content on the screen. SSL library is responsible for security pertaining to the internet. • Application Framework It is a framework which provides services that are required to run an application. These services are provided by the application layer that helps in running, as well as managing the Android application. These functionalities are given as java classes. Application designers can make use of these administrations in the applications created by them. Various services included in this framework are: Activity Manager manages all the aspects of the application lifecycle and activity stack. Resource Manager manages all the resources which are used by an application by making it possible to maintain these application resources independently. Such non-code embedded resources may include Strings, color settings, and user interface layouts of the application. View System is a full arrangement of outlooks used to make application interfaces. Content Providers manage the set of data to access it which are structured in nature. Encapsulation of data and security are also managed. It is a standard interface that transit running information in one process to another process at
1168
G. Shrivastava and P. Kumar
runtime. Fundamentally, this service could be imparted at the application level too. Notification Manager displays alerts and notification to users., With this service, applications can notify the users for such events that happen from time to time. • Android Application (App) This section holds the critical segment called as Dalvik Virtual Machine, which is known as a Java virtual machine that is utilized mainly for outlining and enhancing Android application It is a product that runs applications on Android gadgets. Highlights are given by Linux kernel, for example, multithreading, multitasking execution condition, and memory administration are in java, and Dalvik VM makes use of these highlights. Dalvik VM offers a favorable position to the applications to keep running as a procedure straightforwardly and with its own particular VM [28]. Android being open-source in nature is the most widely used operating system. This wild spread popularity of android, however, makes it more vulnerable to security attacks. This can be attributed to the fact that android is free to download and any application can be published on its market without any cost. Any attacker can build malware application publish it on the android market. Such applications if downloaded and used, can lead to serious security issues. A comprehensive literary assessment of various security issues pertaining to android applications and their solutions clearly reveals that such security attacks can prove to be a menace for the users. Hence it is required to devise an apt solution for identifying malware applications in order to avoid such security threats. This paper aims to develop an appropriate solution to this problem. This paper is organized into seven sub-sections: Sect. 2 depicts related work, in the Sect. 3 proposed formulation for repackaged app and result analysis is presented. In Sect. 4, the conclusion of the paper and its future work is discussed.
2 Literature Survey Portable applications assume a noteworthy part in the colossal increment of the ubiquity of the Android device. These applications are getting progressively modern, robust, life-connecting with, and security meddlesome. This massive gathering of applications incorporates the dominant part of assortments running from amusement, profitability, therapeutic services, recreations to web-based dating, home security, and business administration. iOS clients need to root or “escape(jailbreak)” their gadget with a specific end goal to permit the download of uses from obscure sources, however, Android clients do not have to do that and subsequently, it gives them an expansive capacity to introduce pilfered, tainted or restricted applications from Google play just by changing the framework settings. Android gives advanced motivators to the clients to introduce outsider applications yet opens their protection to noteworthy security risks.
Enhanced Approach for Android Malware Detection
1169
The exponentially expanding number of Android applications, the informal application designer and the current security vulnerabilities in Android OS urge malware engineers to exploit such helpless OS and applications and confiscate the client’s private data which hurts the applications markets and the designer notoriety. The all-inclusiveness of Android (“applications”), Android’s responsiveness concerning the wellsprings of available applications and the arrangement of open data are a part of our life that security risk is extended in Android. In Android, the effort for the utilization and association of assurance harming applications is low. Android applications in Google Play were found to release delicate information (e.g., contraption identifiers). At present, android malware application have a list of permissions which approval helps malware to sufficiently produces data from tainted devices (e.g., SMS messages, phone numbers, and customer accounts). In like manner to the extending malware passages in its business focus, Google displayed in February 2012 Bouncer, which performs malware examination in the business focus applications. Furthermore, the latest variations of Android (i.e., v.4.2 and onwards) fuse a thin client that examines all applications on the device, including those presented from elective sources. Regardless, a present evaluation shows the inadequacy of this framework (15% area extent). A massive amount of research work on Android Security is being done, including the exposure valuations, detection, and investigation of malware. An outline of the modern malware trends is presented by Faruki et al. [3]. Malware analysis can be performed by static analysis, dynamic analysis, and hybrid analysis methods. Static malware analysis is done by extracting the properties of applications and analyzing many static features without running the code. Dynamic analysis is performed by generating the runtime profiles of applications and through observing and acquiring the memory ingestion of CPU, uses of battery, and statistics of network traffic. A static method cannot be swapped by the dynamic method and vice versa. The app market can directly use the static technique, and the Android users can directly use the dynamic technique. Sharma et al. [4] proposed an approach that detects the malware application while downloading an application in the devices. Benign and malware dataset(M0Droid) is used to analyze the permission of application. The proposed approach is identified the list of malware permissions. Supervised learning is used in this analysis and performace analysis is done by confusion matrix. Hyun Jae et al. [5] designed a fast malware detector using creator information. It detects the malware permission and behavior. This analysis is based on the static analysis technique, certificate serial number is used as feature to detects the malware. This analysis is the light weight process, which particularly analyzes the few feature of apps i.e. malware behavior and permissions. This approach is provided with high detection accuracy. In 2017, Shen et al. [6], proposed Petridish that exploits the significant security semantics of Dalvik bytecode to produce the detection model. The automatic mining and progressive distillation to update the detection model make it more efficient. DroidMOSS [7], a fuzzy hashing method to generate a fingerprint by identifying app repackaging. The fingerprint is passed through hashing respectively, which is
1170
G. Shrivastava and P. Kumar
Table 1 Historical highlights of repackaged app analysis Year
Description
2011
An analysis of the AnserverBot Trojan [20]
2012
Aurasium [21], a vigorous and useful technology that defends users of the extensively used Android from the malicious and untrusted application
2014
VetDroid [22], a dynamic analysis platform for generally analysing complex behaviours in Android apps from a new permission use perception
2015
AndroidSOO [23], an insubstantial method for the recognition of repackaging indications on Android apps
2016
Detecting repackaged apps by permissions in the app [24]
2017
A highly efficient method for detecting android malware using machine learning method RF [25]
2018
Machine learning based significant permission identification for Android malware detection [26]
2019
SensDroid [27], Intent and permission based analysis
the subset of the complete opcode sequence. Nearly 2300 apps are scanned using this technique, and it is found that 5 to 13% of the apps are repackaged. Embedding watermarks in the application could be one of the ways to check the originality of the application. If the app does not have a watermark or the watermark is not trustworthy, then the app can be declared as repackaged. This concept is named AppInk, and it is proposed by Zhou et al. [8] and Chen et al. [9] proposed the concept of the Centroid, which solved the problem of not having an accurate and scalable technique for detecting repackaged applications. Zhang et al. [10] proposed ViewDroid, which is a user interface based approach to mobile application repackaging detection. Android applications are event dominated and user interaction intensive, and the interactions between users and applications are implemented through user interface, or views. This analysis inspires the design of a new birthmark for Android applications mainly feature view graph which captures user’s navigation behavior across app views (Table 1). Regardless, when explicitly solicited, a minority of clients in the study done by Felt et al. [11] reported that they have wiped out the establishment of an application because of its authorization demands. At last, the clients mistakenly trusted that applications experience security examination amid their accommodation in the Android Market. In the investigation, it was additionally discovered that there was confusion about application testing in application markets. Besides, most clients were uninformed of the presence of an application testing instrument. DREBIN [12] explained that the manifest file and decomposed code of applications were used to check the API calls, permissions, app components, hardware resources, filtered intents, and network addresses. Though the most extensive dataset among 129013 applications was used in these examinations, there were around 4.5% of malware samples. This experiment achieved 94% of accuracy in malware detection. Extensive processing was required by Derbin for execution and extraction of
Enhanced Approach for Android Malware Detection
1171
a considerable number of features from the application code and manifest file. It is considered a less efficient method as it takes more time to analyze the application. DroidMat [13] expressed a few features from small files of disassembled codes and the manifest file. The features that are used to extract permissions, intent messages, components deployments, and API calls. In this experiment, the K-means algorithm was applied for the Singular Value Decomposition (SVD) method and clustering lowrank approximation. To classify in malware or in benign apps, the minimized clusters are processed with a KNN (k-nearest neighbor) algorithm. An accuracy of 97.6% is reached with zero false-positive rates. 1738 applications are analyzed consisting of only 238 malware samples and 1500 benign. There is only 13% malware dataset of the total available datasets, which is even not a sufficient dataset for identifying the malware usage patterns. It is needed to perform the execution to check, as the accuracy is less when the processing time is higher. In 2013 La Polla et al. [14], reported a survey on the security of mobile devices. In this work, as a matter of primary importance, authors have examined the current situation of versatile malware, alongside some remarkable illustrations. Additionally, it is sketched out to get imminent dangers and detailed a few expectations for the near future. Also, it sorted known assaults against cell phones, particularly at the application level, concentrating on how the assault is completed and what is the objective of the assailant. At last, this paper assessed current security arrangements for smartphones by concentrating on existing systems
3 Proposed Approach for Malware Detection and Result Analysis Android device may be infected in various ways such as, visit in the malicious website, downloading the malware application, Spam, sending malicious SMS and message, and malware infected advertisement. We have categorized the malware according to their behavior. Here a framework is proposed that identified malware applications depicted as Fig. 2. These following steps are used in the framework to identified the malware apps. A. Crawler. The crawler is a program that searches the web pages from the internet. It automatically provides web pages according to user requirements. The crawler is used for a different purpose and to download the website in bulk. This is the main component, uses the search engine and index file. This access web pages, grants permission to explore the web pages, and gathers the queries as per the requirement of the user. To collect and archive is the largest datasets of web pages periodically. It is used as the web data mining that analysis the statistical property of web pages. The crawler does not manage web repository centrally but also manages hundreds of web content providers.
Fig. 2 Proposed malware apps detection technique
1172 G. Shrivastava and P. Kumar
Enhanced Approach for Android Malware Detection
1173
B. Parser. Parse extracts application string such as APIs permission, commands, certificate information, and intent. The general catalogue information is sufficient for detector and classifier components. C. Profile Mold. Analyzes the attacker profile. It is categorized into the following two parts [15]: (1) Inductive Profiling: This observes the analysis of malware dataset statically according to malware behavior. It calculates the correlation between malware apps and attacker characteristics. In this following four stages are used. (a) Data integration: It is collected malware centric information and creator-centric from multiple resources such as web-based crawler. (b) Malware Behavior Definition: The attacker steals confidential information without user knowledge. The mobile device is infected in various ways such as download the repackaged apps and financial threats. (c) Malware Attack State: This is used to categorize and summarized different malware characteristics. Malware behavior is analyzed when malware application is executes and calculate the relation between type of malware application and lost data of the victim. (d) Profile Creation: The malware behavior analyzes according to the above three processes. Then creates the profile according to malware behavior, pattern, and then executes in the device. Malware behaviors are estimated to adopt malicious API sequences, permission distributions, the serial number of certificate and usage of system command, etc. (a) Serial Number of Certificate: The distribute app creates sign by the private key and standard certificate by the public key. According to RFC 2459 and X.509, the standard has the requirement to generate the certificate creator’s name, location, and organization. The certificate has a unique serial number if the creator fills the false information then identify and the process cannot continue. The malware creator is used many times particular serial number then we prepared the blacklist according to previous store certificate in malware dataset. (b) Intent: Android applications are not specified in the unique program. It is made the Android component, broadcast receiver, activity, service, and content provider and runs efficiently to other operating systems. These four components are used to deliver the message and make the relationship to another component. The intent means to deliver the message for a particular application to perform the instruction [16]. (c) API: It provides the control for the Android platform. The malware extracts all API from the source code application. If we analyze the Malware API sample, then identify the malware application. The malware API is shown the behavior of malware. (d) System command: The malware uses commands, such as “reboot”, “mkdir”, “getprop”, “ln”, “ps” “chmod”, “insmod”, “su”, “mount”, “sh”, “killall”. These commands runs on the root of the android device, if we extract the string in ana
1174
(e)
(2) D.
(1)
G. Shrivastava and P. Kumar
application and obtined any of the above commands that shows it is a malware application. Permission: The android application needs few permissions from user before to its installation. For that application requested to user for these permissions and because of unawareness, user grants these permissions. Now malware application have access to the confidential data with user acceptance [17]. Deductive Profiling: It is based on the top-down approach and analysis according to deductive logic. Detection Engine. According to detection rule, to create the serial number list, the rule of checking, leak sensitive information, the rule of checking usage command for leak forged file, and likelihood ratio of the request and APIrelated critical information. Here an algorithm is proposed that compares the serial number of each application with blacklist malware dataset application if the match with a dataset then identified as malware application. This checks the usage of command for leveraging forged files. It runs on a rooted device that identifies malware code. The last step calculates the likelihood ratio under the given critical permission. Vector Space Model: Here, unique opcode approach is used that creates the distribution list which have document files and term frequency. The score count calculated by the following equation.
BGN_Dcount =
bgn_D Frq total no. of bgn doc
MAL_Dcount =
mal_D Frq total no. of mal doc
BGN_Tcount =
bgn_ter m F latest bgn_ termcount
MAL_Tcount =
mal_ter m F latest mal_ termcount
The vector space model is represented by the following equation: Di = (w1i , w2i , . . . wn i ) Q j = w1 j , w2 j , . . . ws j The Di represents the row and Qj represents the column. The Hartley test is a good approach to reduce the opcode space. It checks the homogeneity variance of opcode in the target class and calculates the Fmax value. E. Malware App Classification. The dataset is classified into a different group to identified the particular malware according to the following approaches:
Enhanced Approach for Android Malware Detection
1175
(1) Similarity Score: The similarity score is observed similarity between two malware application. Analysis of the critical permission, suspicious API string, malicious command is computed by the following equation
S = Σwi BFSi where Σwi = 1 where wi and BFSi represent accurate similarity weight. Jacquard coefficient applies to calculate the similarity of malicious command. The similarity of requested permissions and API-related permissions are calculated by the average of two values that uses as the similarity of critical permissions. (2) Classification: Classification provides a similar group of malware and compares with a malicious application signature and identifies each group of signature. If the signature is the same in the group, then the first application includes in malware group. There are following sample behaving in the system. • It computes between signatures of the sample and the existing group by a similarity score. All existing groups have computed the score each. • To Choose the highest value of similarity scores, max (SS). • Compare the selected values with the similarity threshold TS. If ma artificial × (SS) > TS. The corresponding group includes samples. If max (SS) < TS, generates sample outcome in a new group. Its signature represents the groups of signature. It is based on Artificial intelligence and statistical analysis. Bayesian Net is used for classification, which is based on the graphical model. It represents the set of random variables and information flow in the decision table. The graphical representations are used to know about the uncertain domain. The random variable represents each node in the graph. While each node represents probabilistic dependencies among the corresponding random variable. The statistical and computational methods are used to calculate the graph. The Bayesian net is combined with principle theory and statistics. The naive bayes factorized the class conditional density. This is used for joint distribution. The graphical models are the different ways to represent CI mathematically. To identify the joint probability distribution, below equations are used: i. X j⊥X k|Y p (x|y) = n × j = 1 ii. p (x j |y) p (x, y) = p (y) iii. n × j = 1 p (x j |y) F. Malware App List. This phase is used to obtain the result as a malware application and its performance evaluation using the confusion matrix. We
1176
G. Shrivastava and P. Kumar
used MoDroid dataset [18] to analyze the proposed method performance as represented in Table 2. Benign samples that were classified as malware are called false positive while malware samples when classified correctly is known as true negative as given in Table 3. Sensitivity: The sensitivity of a test in our case is its ability to classify the benign cases correctly. Sensitivity =
179 t_pos = = 89.5% pos 200
Specificity: The specificity of a test in our case is its ability to classify the malware cases correctly. Specificity =
182 t_neg = = 91% neg 200
Accuracy: The accuracy of a test is its competence to segregate the benign and malware samples correctly. Accuracy = sensitivity
neg pos + specificity = 90.25% pos + neg pos + neg
Table 2 Dataset classification Dataset type
Number of apps
Malware
200
Benign
200
Table 3 Performance matrix Classes
Predicted: benign app
Predicted: malware app
Total
Recognition (%)
Actual: benign app
t_pos
f_pos
200
89.50
Actual: malware app
f_neg
t_neg
200
91.00
Total
197
203
400
90.25
Benign: Positive for apps Malware: Negative for apps True Positive (TP) = 179 False Positive (FP) = 21 True Negative (TN) = 182 False Negative (FN) = 18
Enhanced Approach for Android Malware Detection
1177
4 Conclusion In this paper, the proposed framework is to identify the android malware application based on some feature analysis. This framework is fulfilled by all security breaches and provides security to identify the malware application while downloading the application on the device. This paper is analyzed the android application permissions through crawler and classification techniques and obtain 90.25% accuracy. In the future, we will extend this framework with a detection approach for more features with high accuracy results.
References 1. C. Hopping, C, Hopping, Gartner predicts 2% growth in PCs, tablets and smartphones [Internet]. PC Tech Author. (2018). https://www.pcauthority.com.au/news/gartner-predicts-2-growth-inpcs-tablets-and-smartphones-475656 2. Z. Whittaker, Millions downloaded dozens of Android apps from Google Play that were infected with adware. https://techcrunch.com/2019/10/24/millions-dozens-android-apps-adware/ 3. P. Faruki, A. Bharmal, V. Laxmi, V. Ganmoor, M.S. Gaur, M. Conti, M. Rajarajan, Android security: a survey of issues, malware penetration, and defenses. IEEE Commun. Surv. Tutor. 17(2), 998–1022 (2014) 4. K. Sharma, B.B. Gupta, Towards privacy risk analysis in Android applications using machine learning approaches. Int. J. E-Serv. Mob. Appl. (IJESMA) 11(2), 1–21 (2019) 5. A.P. Felt, M. Finifter, E. Chin, S. Hanna, D. Wagner, A survey of mobile malware in the wild. In Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices (2011), pp. 3–14 6. Z.X. Shen, C.W. Hsu, S.W. Shieh, Security semantics modeling with progressive distillation. IEEE Trans. Mob. Comput. 16(11), 3196–3208 (2017) 7. Y.D. Lin, Y.C. Lai, C.H. Chen, H.C. Tsai, Identifying android malicious repackaged applications by thread-grained system call sequences. Comput. Secur. 39, 340–350 (2013) 8. W. Zhou, X. Zhang, X. Jiang, AppInk: watermarking android apps for repackaging deterrence. In Proceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security (2013), pp. 1–12 9. K. Chen, P. Liu, Y, Zhang, Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering (2014) pp. 175–186 10. F. Zhang, H. Huang, S. Zhu, D. Wu, P. Liu, ViewDroid: Towards obfuscation-resilient mobile application repackaging detection. In Proceedings of the 2014 ACM Conference on Security and Privacy in Wireless & Mobile Networks (2014), pp. 25–36 11. A.P. Felt, E. Ha, S. Egelman, A. Haney, E. Chin, D. Wagner, Android permissions: User attention, comprehension, and behavior. In Proceedings of the eighth symposium on usable privacy and security,pp. 1–14 12. D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, C.E.R.T. Siemens, Drebin: Effective and explainable detection of android malware in your pocket. NDSS 14, 23–26 (2014) 13. D.J. Wu, C.H. Mao, T.E. Wei, H.M. Lee, K.P. Wu, Droidmat: Android malware detection through manifest and api calls tracing. In 2012 Seventh Asia Joint Conference on Information Security (IEEE, 2012), pp. 62–69) 14. M. La Polla, F. Martinelli, D. Sgandurra, A survey on security for mobile devices. IEEE Commun. Surv. Tutor. 15(1), 446–471 (2012)
1178
G. Shrivastava and P. Kumar
15. U. Bayer, P.M. Comparetti, C. Hlauschek, C. Kruegel, E. Kirda, Scalable, behavior-based malware clustering. NDSS 9, 8–11 (2009) 16. G. Shrivastava, P. Kumar, Intent and permission modeling for privacy leakage detection in android. Energy Syst. 1–14. (2019). https://doi.org/10.1007/s12667-019-00359-7 17. G. Shrivastava, P. Kumar, Privacy analysis of android applications: state-of-art and literary assessment. Scal. Comput. Pract. Exp. 18(3), 243–252 (2017) 18. M. Damshenas, A. Dehghantanha, K.K.R. Choo, R. Mahmud, Modroid: An android behavioralbased malware detection model. J. Inf. Privacy Secur. 11(3), 141–157 (2015) 19. S.R. Srivastava, S. Dube, G. Shrivastava, K. Sharma, Smartphone triggered security challenges—Issues, case studies and prevention. Cyber Security in Parallel and Distributed Computing: Concepts, Techniques, Applications and Case Studies (2019), pp. 187–206 20. Y. Zhou, X. Jiang, An analysis of the anserverbot trojan. Tech. Rep. 9 (2011) 21. R. Xu, H. Saïdi, R. Anderson, Aurasium: Practical policy enforcement for android applications. In Presented as part of the 21st USENIX Security Symposium (USENIX Security 12) (2012), pp. 539–552 22. Y. Zhang, M. Yang, B. Xu, Z. Yang, G. Gu, P. Ning, B. Zang, Vetting undesirable behaviors in android apps with permission use analysis. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security (2013), pp. 611–622 23. H. Gonzalez, A.A. Kadir, N. Stakhanova, A.J. Alzahrani, A.A. Ghorbani, Exploring reverse engineering symptoms in Android apps. In Proceedings of the Eighth European Workshop on System Security (2015), pp. 1–7 24. P. Teufl, M. Ferk, A. Fitzek, D. Hein, S. Kraxberger, C. Orthacker, Malware detection by applying knowledge discovery processes to application metadata on the Android Market (Google Play). Secur. Commun. Netw. 9(5), 389–419 (2016) 25. S. Kumar, A. Viinikainen, T. Hamalainen, Evaluation of ensemble machine learning methods in mobile threat detection. In 2017 12th International Conference for Internet Technology and Secured Transactions (ICITST) (IEEE, 2017), pp. 261–268 26. K. Sharma, B.B. Gupta, Mitigation and risk factor analysis of android applications. Comput. Electr. Eng. 71, 416–430 (2018) 27. G. Shrivastava, P. Kumar, SensDroid: analysis for malicious activity risk of Android application. Multimedia Tools Appl. 78(24), 35713–35731 (2019) 28. G. Shrivastava, P. Kumar, D. Gupta, J.J. Rodrigues, Privacy issues of android application permissions: A literature review. Trans. Emerg. Telecommun. Technol. e3773 (2019). https:// doi.org/10.1002/ett.3773
Author Index
A Abdul Hamid, Md., 771 Agarwal, Abhilakshya, 737 Agarwal, Akshara, 267 Aggarwal, Deepti, 985 Aggarwal, Lakshay, 643 Agrawal, Rashmi, 611 Agrawal, Sunil, 279 Ahlawat, Anil, 1115 Ahlawat, Savita, 69, 497, 527 Ahmed, Musse Mohamud, 1139, 1151 Ahmed Shaj, Shakil, 539 Amir Anton Jone, A., 551 Amzad Hossain, Md., 799 Anand, Abhishek, 321 Anand, Harshit, 321 Anisham, Nidhin, 517 Anooradha, M. K., 551 Ariful Islam Malik, Md., 799 Arjun Babu, C. S., 517 Arora, Kriti, 825 Arora, Monika, 427 Awasthi, Harshit, 643 Awasthi, Lalit Kumar, 947 Awatramani, Vasudev, 963
B Bajaj, Prachi, 427 Balasubramani, R., 585 Bansal, Mayank, 1115 Bansal, Neha, 345 Bansal, Poonam, 289, 509 Batra, Monarch, 867 Beniwal, Rajender Kumar, 921
Beril Lynora, T., 551 Bhalla, Raunaq, 867 Bhandari, Aniruddha, 527 Bhardwaj, Sharat Chandra, 83 Bhardwaj, Shivam, 1043 Bhushan, Bharat, 377 Bindra, Jatin, 69
C Chakraborty, Mainak, 331 Chamoli, Vivek, 353 Chaudhary, Bhawna, 1 Chaudhary, Poonam, 611 Chaudhary, Utkarsh, 309 Chauhan, Ekansh, 1093 Chhabra, Megha, 189 Chopra, Deepti, 237 Choudhary, Amit, 527 Chugh, Anirudh, 887
D Das, Indrashis, 321 Deepak, Akshay, 1053 Deep, Paluck, 267 Dhall, Ankur, 527 Dhurandher, S. K., 749 Dixit, Sunanda, 117 Dubey, Vinay, 299 Dutta, Kunal Bikram, 57 Dwivedi, Rinky, 667
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2
1179
1180 F Fahad, Md Shah, 1053 Faridi, Adeela, 131 Farzana, Fahmida, 539
G Gadade, Shivakumar, 517 Ganesan, Vithya, 787 Garg, Ritu, 487 Garg, Urvashi, 947 Ghosh, Promila, 539 Godha, Preshi, 215 Gopal, Greeshma N., 659 Gope, Sadhan, 437 Govind, N., 11 Grover, Khyati, 179 Gupta, Deepak, 963, 1043, 1093 Gupta, Harsh, 527 Gupta, Ishu, 215 Gupta, Kushagra, 835 Gupta, Mukta, 365 Gupta, Nishu, 973 Gupta, Sarthak, 267
H Hasan, Mohammad Kamrul, 1139 Hasanl, Mohammad Kamrul, 1151 Hima Bindu, V., 1013 Htwe, Thet Thet, 169 Hussain, Arooj, 103
I Iqbal, Anam, 463 Islam, Tawqeer Ul, 947
J Jadon, Swati, 215 Jailia, Manisha, 845 Jain, Diksha, 365 Jain, Monika, 907 Jain, Parag, 517 Jain, Shreya, 631 Janani, R., 453 Janapala, Doondi Kumar, 47 Janghel, Rekh Ram, 255 Jassal, B. S., 83 Jha, Ashish Kumar, 845 Jindal, Kshitij, 179 Jindal, Lokesh, 365 Johari, Rahul, 835
Author Index Jones Mary Pushpa, Anita, 551
K Kagalagomb, Chetan G., 117 Kandhoul, Nisha, 749 Kapoor, Divneet Singh, 203 Karim, Fyruz Ibnat, 771 Katarya, Rahul, 299 Kaur, Arvinder, 237 Kaur, Ishadeep, 35 Kaur, Sanmukh, 855 Kaushik, Ila, 19 Kedia, Rewant, 309 Kham, Nang Saing Moon, 169 Khandelwal, Harshit, 1063 Khan, Mohammad Muzammil, 877 Khanna, Ashish, 1043, 1093 Kherwa, Pooja, 289, 497 Khoriya, Namrata, 395 Khullar, Kashish, 1063 Kiran, Hotha Uday, 145 Kopal, Robert, 937 Kovoor, Binsu C., 659 Kumar, Adarsh, 571, 595 Kumar, Ashish, 247 Kumar, Budati Anil, 1013 Kumar, Dharmender, 725 Kumari, Ankita, 681 Kumar, Manish, 527 Kumar, Prabhat, 1083, 1165 Kumar, Rahul, 405 Kumar, Sanjay, 179, 309, 415 Kumar, Shubham, 247 Kumar, Sudesh, 255
L Lal, Rajendra Prasad, 11 Lubana, Anurupa, 855
M Maheshwari, Aditya, 643 Makkar, Sandhya, 365, 631 Malhotra, Yugnanda, 855 Malik, Shaily, 509 Manaswini, R., 973 Manchanda, Chinkit, 19 Maneesha, 93 Mann, P. S., 35 Mathur, Sonakshi, 497 Mehrotra, Saumye, 1063 Mehta, Priyanshu, 427
Author Index Mehta, Purnima Lala, 999 Mini, U., 659 Mishra, Astha, 473 Mittal, Antriksh, 887 Mittal, Divyanshu, 427 Mohandoss, Sahana, 517 Mohan, Gunjan, 497 Mohan, Rajasekar, 517 Mohanty, Sabyasachi, 473 Mohapatra, Amar, 1031, 1125 Mohdiwale, Samrudhi, 395, 405 Moon, Nazmun Nessa, 757 Mridha, M. F., 771, 799 Mrsic, Leo, 937 Musa, Sherfriz Sherry, 1139
N Naaz, Sameena, 103 Nag, Bhavya, 415 Nagrath, Preeti, 643 Naing, Thinn Thu, 895 Naman, Pranjal, 867 Neethu Susan, V., 551 Nesasudha, Moses, 47
P Pandey, Manjusha, 57, 321 Pandey, Praveen Kant, 93 Pant, Shrid, 667 Patle, Anshi, 215 Pradhan, Nicky, 437 Pradhan, Rahul, 737 Prakash, Rishi, 353 Pramanick, Alik, 331 Purwar, Archana, 267 Pushkar, Pranav, 643
R Raghav, Yash, 415 Raihan, M., 539 Raina, Anshuman, 1063 RajaRajeswari, PothuRaju, 787 Raj, Rohit, 1083 Rajesh, Bulla, 69 Rautaray, Siddharth S., 57, 321 Ravulakollu, Kiran Kumar, 189 Ray, Amitabh, 799 Ray, Ananya, 353 Rony, Akramkhan, 771 Roy, Priyanka, 709 Rustagi, Apeksha, 19
1181 S Saad, Mohammad, 643 Sadeghi, Ali, 225 Sagar, Kalpna, 1115 Sahu, Aman, 57 Sahu, Mridu, 395, 405 Sahu, Satya Prakash, 255 Saifuzzaman, Mohd, 757 Saikrishna, B., 973 Saini, Manish Kumar, 921 Saiyeda, Anam, 877 Sani, Shaon Hossain, 771 Sarin, Sumit, 887 Sarkar, Saikat Kumar, 999 Saxena, Ankur, 473 Saxena, Rahul, 907 Senthilkumar, K. P., 561 Sethi, Arushi, 631 Sethi, Dhaarna, 825 Seyed Dizaji, Seyedeh Haleh, 225 Sharma, Arun, 345 Sharma, Bharat, 57 Sharma, Bharti, 215 Sharma, Deepak Kumar, 571, 667 Sharma, Moolchand, 1063 Sharma, Nikhil, 19, 377 Sharma, Shradha, 695 Sharma, Tripti, 1031, 1125 Sheony, Udaya Kumar K., 623 Shetty, Mangala, 585 Shetty, Spoorthi P., 623 Shetu, Syeda Farjana, 757 Shrivastava, Gulshan, 1165 Shukla, A. K., 83 Shukla, Manoj Kumar, 189 Shukla, Satyabrat, 999 Shukla, Swati, 247 Siddiqui, Farheen, 131, 463 Sin, Ei Shwe, 895 Singhal, Tushar, 309 Singh, Ashutosh Kumar, 215 Singh, Gautam, 999 Singh, Gurpartap, 279 Singh, Jagsir, 155 Singh, Jaswinder, 155 Singh, Karan, 1 Singh, Kiran Jot, 203 Singh, Kshetrimayum Millaner, 437 Singhla, Lakshay, 179 Singh, M. P., 1083 Singh, R. K., 345, 487 Singh, Shikha, 395 Singh, Shubham, 985
1182 Sinha, Akash, 1083 Sirswal, Manpreet, 1093 Sivakumar, P., 561 Sobti, Rishabh, 497 Sohi, Balwinder Singh, 203, 279 Sood, Manu, 681, 695, 709, 725 Srikanth, P., 595 Srinivas, A., 517 Srivastava, Smriti, 867, 887 Sultana, Sharmin, 757 Sunita Dhavale, Sanjay, 331 Susan, Seba, 825 Suvo, Istiaque Ahmed, 799 Swetha, N., 1013
T Tanvir Islam, Md., 539 Tiwari, Sharad Kumar, 145 Tomar, Geetam, 1031, 1125
U Upadhyay, Yogita, 395
Author Index V Vaish, Saurabh, 267 Vashisht, Geetika, 845 Vats, Satyarth, 867 Verma, Anjali, 395 Vidyarthi, Anurag, 83, 353
W Waghere, Sandhya Sandeep, 787
Y Yogameena, B., 453 Yousuf, Ridwanullah, 757 Yusoff, Noor Shamilawani Farhana, 1151
Z Zajec, Sreˇcko, 937 Zhang, Renrui, 815 Zolfy Lighvan, Mina, 225