International Conference on Innovative Computing and Communications: Proceedings of ICICC 2020, Volume 2 [1st ed.] 9789811551475, 9789811551482

This book includes high-quality research papers presented at the Third International Conference on Innovative Computing

549 84 43MB

English Pages XXIX, 1182 [1158] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Matter ....Pages i-xxix
A Dummy Location Generation Model for Location Privacy in Vehicular Ad hoc Networks (Bhawna Chaudhary, Karan Singh)....Pages 1-10
Evaluating User Influence in Social Networks Using k-core (N. Govind, Rajendra Prasad Lal)....Pages 11-18
Depression Anatomy Using Combinational Deep Neural Network (Apeksha Rustagi, Chinkit Manchanda, Nikhil Sharma, Ila Kaushik)....Pages 19-33
A Hybrid Cost-Effective Genetic and Firefly Algorithm for Workflow Scheduling in Cloud (Ishadeep Kaur, P. S. Mann)....Pages 35-45
Flexible Dielectric Resonator Antenna Using Polydimethylsiloxane Substrate as Dielectric Resonator for Breast Cancer Diagnostics (Doondi Kumar Janapala, Moses Nesasudha)....Pages 47-55
Machine Learning-Based Prototype for Restaurant Rating Prediction and Cuisine Selection (Kunal Bikram Dutta, Aman Sahu, Bharat Sharma, Siddharth S. Rautaray, Manjusha Pandey)....Pages 57-68
Deeper into Image Classification (Jatin Bindra, Bulla Rajesh, Savita Ahlawat)....Pages 69-81
Investigation of Ionospheric Total Electron Content (TEC) During Summer Months for Ionosphere Modeling in Indian Region Using Dual-Frequency NavIC System (Sharat Chandra Bhardwaj, Anurag Vidyarthi, B. S. Jassal, A. K. Shukla)....Pages 83-91
An Improved Terrain Profiling System with High-Precision Range Measurement Method for Underwater Surveyor Robot ( Maneesha, Praveen Kant Pandey)....Pages 93-102
Prediction of Diabetes Mellitus: Comparative Study of Various Machine Learning Models (Arooj Hussain, Sameena Naaz)....Pages 103-115
Tracking of Soccer Players Using Optical Flow (Chetan G. Kagalagomb, Sunanda Dixit)....Pages 117-129
Selection of Probabilistic Data Structures for SPV Wallet Filtering (Adeela Faridi, Farheen Siddiqui)....Pages 131-143
Hybrid BF-PSO Algorithm for Automatic Voltage Regulator System (Hotha Uday Kiran, Sharad Kumar Tiwari)....Pages 145-153
Malware Classification Using Multi-layer Perceptron Model (Jagsir Singh, Jaswinder Singh)....Pages 155-168
Protocol Random Forest Model to Enhance the Effectiveness of Intrusion Detection Identification (Thet Thet Htwe, Nang Saing Moon Kham)....Pages 169-178
Detecting User’s Spreading Influence Using Community Structure and Label Propagation (Sanjay Kumar, Khyati Grover, Lakshay Singhla, Kshitij Jindal)....Pages 179-187
Bagging- and Boosting-Based Latent Fingerprint Image Classification and Segmentation (Megha Chhabra, Manoj Kumar Shukla, Kiran Kumar Ravulakollu)....Pages 189-201
Selecting Social Robot by Understanding Human–Robot Interaction (Kiran Jot Singh, Divneet Singh Kapoor, Balwinder Singh Sohi)....Pages 203-213
Flooding and Forwarding Based on Efficient Routing Protocol (Preshi Godha, Swati Jadon, Anshi Patle, Ishu Gupta, Bharti Sharma, Ashutosh Kumar Singh)....Pages 215-223
Hardware-Based Parallelism Scheme for Image Steganography Speed up (Seyedeh Haleh Seyed Dizaji, Mina Zolfy Lighvan, Ali Sadeghi)....Pages 225-236
Predicting Group Size for Software Issues in an Open-Source Software Development Environment (Deepti Chopra, Arvinder Kaur)....Pages 237-246
Wireless CNC Plotter for PCB Using Android Application (Ashish Kumar, Shubham Kumar, Swati Shukla)....Pages 247-254
Epileptic Seizure Detection Using Machine Learning Techniques (Sudesh Kumar, Rekh Ram Janghel, Satya Prakash Sahu)....Pages 255-266
Analysis of Minimum Support Price Prediction for Indian Crops Using Machine Learning and Numerical Methods (Sarthak Gupta, Akshara Agarwal, Paluck Deep, Saurabh Vaish, Archana Purwar)....Pages 267-277
A Robust Methodology for Creating Large Image Datasets Using a Universal Format (Gurpartap Singh, Sunil Agrawal, B. S. Sohi)....Pages 279-288
A Comparative Empirical Evaluation of Topic Modeling Techniques (Pooja Kherwa, Poonam Bansal)....Pages 289-297
An Analysis of Machine Learning Techniques for Flood Mitigation (Vinay Dubey, Rahul Katarya)....Pages 299-307
Link Prediction in Complex Network: Nature Inspired Gravitation Force Approach (Sanjay Kumar, Utkarsh Chaudhary, Rewant Kedia, Tushar Singhal)....Pages 309-319
Hridaya Kalp: A Prototype for Second Generation Chronic Heart Disease Detection and Classification (Harshit Anand, Abhishek Anand, Indrashis Das, Siddharth S. Rautaray, Manjusha Pandey)....Pages 321-329
Two-Stream Mid-Level Fusion Network for Human Activity Detection (Mainak Chakraborty, Alik Pramanick, Sunita Vikrant Dhavale)....Pages 331-343
Content Classification Using Active Learning Approach (Neha Bansal, Arun Sharma, R. K. Singh)....Pages 345-352
Analysis of NavIC Multipath Signal Sensitivity for Soil Moisture in Presence of Vegetation (Vivek Chamoli, Rishi Prakash, Anurag Vidyarthi, Ananya Ray)....Pages 353-364
Uncovering Employee Job Satisfaction Using Machine Learning: A Case Study of Om Logistics Ltd (Diksha Jain, Sandhya Makkar, Lokesh Jindal, Mukta Gupta)....Pages 365-376
Transaction Privacy Preservations for Blockchain Technology (Bharat Bhushan, Nikhil Sharma)....Pages 377-393
EEG Artifact Removal Techniques: A Comparative Study (Mridu Sahu, Samrudhi Mohdiwale, Namrata Khoriya, Yogita Upadhyay, Anjali Verma, Shikha Singh)....Pages 395-403
Evolution of Time-Domain Feature for Classification of Two-Class Motor Imagery Data (Rahul Kumar, Mridu Sahu, Samrudhi Mohdiwale)....Pages 405-414
Finding Influential Spreaders in Weighted Networks Using Weighted-Hybrid Method (Sanjay Kumar, Yash Raghav, Bhavya Nag)....Pages 415-426
Word-Level Sign Language Gesture Prediction Under Different Conditions (Monika Arora, Priyanshu Mehta, Divyanshu Mittal, Prachi Bajaj)....Pages 427-435
Firefly Algorithm-Based Optimized Controller for Frequency Control of an Autonomous Multi-Microgrid (Kshetrimayum Millaner Singh, Sadhan Gope, Nicky Pradhan)....Pages 437-451
Abnormal Activity-Based Video Synopsis by Seam Carving for ATM Surveillance Applications (B. Yogameena, R. Janani)....Pages 453-462
Behavioral Analysis from Online Data Using Temporal Graphs (Anam Iqbal, Farheen Siddiqui)....Pages 463-472
Medical Data Analysis Using Machine Learning with KNN (Sabyasachi Mohanty, Astha Mishra, Ankur Saxena)....Pages 473-485
Insight to Model Clone’s Differentiation, Classification, and Visualization (Ritu Garg, R. K. Singh)....Pages 487-495
Predicting Socio-economic Features for Indian States Using Satellite Imagery (Pooja Kherwa, Savita Ahlawat, Rishabh Sobti, Sonakshi Mathur, Gunjan Mohan)....Pages 497-508
Semantic Space Autoencoder for Cross-Modal Data Retrieval (Shaily Malik, Poonam Bansal)....Pages 509-516
A Novel Approach to Classify Cardiac Arrhythmia Using Different Machine Learning Techniques (Parag Jain, C. S. Arjun Babu, Sahana Mohandoss, Nidhin Anisham, Shivakumar Gadade, A. Srinivas et al.)....Pages 517-526
Offline Handwritten Mathematical Expression Evaluator Using Convolutional Neural Network (Amit Choudhary, Savita Ahlawat, Harsh Gupta, Aniruddha Bhandari, Ankur Dhall, Manish Kumar)....Pages 527-537
An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm (Md. Tanvir Islam, M. Raihan, Fahmida Farzana, Promila Ghosh, Shakil Ahmed Shaj)....Pages 539-550
An Overview of Ultra-Wide Band Antennas for Detecting Early Stage of Breast Cancer (M. K. Anooradha, A. Amir Anton Jone, Anita Jones Mary Pushpa, V. Neethu Susan, T. Beril Lynora)....Pages 551-559
Single Image Haze Removal Using Hybrid Filtering Method (K. P. Senthilkumar, P. Sivakumar)....Pages 561-570
An Optimized Multilayer Outlier Detection for Internet of Things (IoT) Network as Industry 4.0 Automation and Data Exchange (Adarsh Kumar, Deepak Kumar Sharma)....Pages 571-584
Microscopic Image Noise Reduction Using Mathematical Morphology (Mangala Shetty, R. Balasubramani)....Pages 585-594
A Decision-Based Multi-layered Outlier Detection System for Resource Constraint MANET (Adarsh Kumar, P. Srikanth)....Pages 595-610
Orthonormal Wavelet Transform for Efficient Feature Extraction for Sensory-Motor Imagery Electroencephalogram Brain–Computer Interface (Poonam Chaudhary, Rashmi Agrawal)....Pages 611-622
Performance of RPL Objective Functions Using FIT IoT Lab (Spoorthi P. Shetty, Udaya Kumar K. Shenoy)....Pages 623-630
Predictive Analytics for Retail Store Chain (Sandhya Makkar, Arushi Sethi, Shreya Jain)....Pages 631-641
Object Identification in Satellite Imagery and Enhancement Using Generative Adversarial Networks (Pranav Pushkar, Lakshay Aggarwal, Mohammad Saad, Aditya Maheshwari, Harshit Awasthi, Preeti Nagrath)....Pages 643-657
Keyword Template Based Semi-supervised Topic Modelling in Tweets (Greeshma N. Gopal, Binsu C. Kovoor, U. Mini)....Pages 659-666
A Community Interaction-Based Routing Protocol for Opportunistic Networks (Deepak Kumar Sharma, Shrid Pant, Rinky Dwivedi)....Pages 667-679
Performance Analysis of the ML Prediction Models for the Detection of Sybil Accounts in an OSN (Ankita Kumari, Manu Sood)....Pages 681-693
Exploring Feature Selection Technique in Detecting Sybil Accounts in a Social Network (Shradha Sharma, Manu Sood)....Pages 695-708
Implementation of Ensemble-Based Prediction Model for Detecting Sybil Accounts in an OSN (Priyanka Roy, Manu Sood)....Pages 709-723
Performance Analysis of Impact of Network Topologies on Different Controllers in SDN (Dharmender Kumar, Manu Sood)....Pages 725-735
Bees Classifier Using Soft Computing Approaches (Abhilakshya Agarwal, Rahul Pradhan)....Pages 737-748
Fuzzy Trust Based Secure Routing Protocol for Opportunistic Internet of Things (Nisha Kandhoul, S. K. Dhurandher)....Pages 749-755
Student’s Performance Prediction Using Data Mining Technique Depending on Overall Academic Status and Environmental Attributes (Syeda Farjana Shetu, Mohd Saifuzzaman, Nazmun Nessa Moon, Sharmin Sultana, Ridwanullah Yousuf)....Pages 757-769
Evaluate and Predict Concentration of Particulate Matter (PM2.5) Using Machine Learning Approach (Shaon Hossain Sani, Akramkhan Rony, Fyruz Ibnat Karim, M. F. Mridha, Md. Abdul Hamid)....Pages 771-785
Retrieval of Frequent Itemset Using Improved Mining Algorithm in Hadoop (Sandhya Sandeep Waghere, PothuRaju RajaRajeswari, Vithya Ganesan)....Pages 787-798
Number Plate Recognition System for Vehicles Using Machine Learning Approach (Md. Amzad Hossain, Istiaque Ahmed Suvo, Amitabh Ray, Md. Ariful Islam Malik, M. F. Mridha)....Pages 799-814
The Model to Determine the Location and the Date by the Length of Shadow of Objects for Communication Networks (Renrui Zhang)....Pages 815-823
CW-CAE: Pulmonary Nodule Detection from Imbalanced Dataset Using Class-Weighted Convolutional Autoencoder (Seba Susan, Dhaarna Sethi, Kriti Arora)....Pages 825-833
SORTIS: Sharing of Resources in Cloud Framework Using CloudSim Tool (Kushagra Gupta, Rahul Johari)....Pages 835-843
Predicting Diabetes Using ML Classification Techniques (Geetika Vashisht, Ashish Kumar Jha, Manisha Jailia)....Pages 845-854
Er–Yb Co-doped Fibre Amplifier Performance Enhancement for Super-Dense WDM Applications (Anurupa Lubana, Sanmukh Kaur, Yugnanda Malhotra)....Pages 855-866
Seizure Detection from Intracranial Electroencephalography Recordings (Pranjal Naman, Satyarth Vats, Monarch Batra, Raunaq Bhalla, Smriti Srivastava)....Pages 867-875
Reader: Speech Synthesizer and Speech Recognizer (Mohammad Muzammil Khan, Anam Saiyeda)....Pages 877-886
Comparing CNN Architectures for Gait Recognition Using Optical Flows (Sumit Sarin, Anirudh Chugh, Antriksh Mittal, Smriti Srivastava)....Pages 887-893
Digital Identity Management System Using Blockchain Technology (Ei Shwe Sin, Thinn Thu Naing)....Pages 895-906
Enhancing Redundant Content Elimination Algorithm Using Processing Power of Multi-Core Architecture (Rahul Saxena, Monika Jain)....Pages 907-919
Matched Filter Design Using Dynamic Histogram for Power Quality Events Detection (Manish Kumar Saini, Rajender Kumar Beniwal)....Pages 921-935
Managing Human (Social) Capital in Medium to Large Companies Using Organizational Network Analysis: Monoplex Network Approach with the Application of Highly Interactive Visual Dashboards (Srečko Zajec, Leo Mrsic, Robert Kopal)....Pages 937-945
Gender and Age Estimation from Gait: A Review (Tawqeer Ul Islam, Lalit Kumar Awasthi, Urvashi Garg)....Pages 947-962
Parkinson’s Disease Detection Through Visual Deep Learning (Vasudev Awatramani, Deepak Gupta)....Pages 963-972
Architecture and Framework Enabling Internet of Vehicles Towards Intelligent Transportation System (R. Manaswini, B. Saikrishna, Nishu Gupta)....Pages 973-984
Group Data Sharing and Auditing While Securing Sensitive Information (Shubham Singh, Deepti Aggarwal)....Pages 985-997
Novel Umbrella 360 Cloud Seeding Based on Self-landing Reusable Hybrid Rocket (Satyabrat Shukla, Gautam Singh, Saikat Kumar Sarkar, Purnima Lala Mehta)....Pages 999-1011
User Detection Using Cyclostationary Feature Detection in Cognitive Radio Networks with Various Detection Criteria (Budati Anil Kumar, V. Hima Bindu, N. Swetha)....Pages 1013-1029
Fuzzy-Based DBSCAN Algorithm to Elect Master Cluster Head and Enhance the Network Lifetime and Avoid Redundancy in Wireless Sensor Network (Tripti Sharma, Amar Mohapatra, Geetam Tomar)....Pages 1031-1042
Water Quality Evaluation Using Soft Computing Method (Shivam Bhardwaj, Deepak Gupta, Ashish Khanna)....Pages 1043-1052
Crowd Estimation of Real-Life Images with Different View-Points (Md Shah Fahad, Akshay Deepak)....Pages 1053-1062
Scalable Machine Learning in C++ (CAMEL) (Moolchand Sharma, Anshuman Raina, Kashish Khullar, Harshit Khandelwal, Saumye Mehrotra)....Pages 1063-1081
Intelligent Gateway for Data-Centric Communication in Internet of Things (Rohit Raj, Akash Sinha, Prabhat Kumar, M. P. Singh)....Pages 1083-1091
A Critical Review: SANET and Other Variants of Ad Hoc Networks (Ekansh Chauhan, Manpreet Sirswal, Deepak Gupta, Ashish Khanna)....Pages 1093-1114
HealthStack–A Decentralized Medical Record Storage Application (Mayank Bansal, Kalpna Sagar, Anil Ahlawat)....Pages 1115-1123
AEECC-SEP: Ant-Based Energy Efficient Condensed Cluster Stable Election Protocol in Wireless Sensor Network (Tripti Sharma, Amar Mohapatra, Geetam Tomar)....Pages 1125-1138
Measurement and Modeling of DTCR Software Parameters Based on Intranet Wide Area Measurement System for Smart Grid Applications (Mohammad Kamrul Hasan, Musse Mohamud Ahmed, Sherfriz Sherry Musa)....Pages 1139-1150
Dynamic Load Modeling and Parameter Estimation of 132/275 KV Using PMU-Based Wide Area Measurement System (Musse Mohamud Ahmed, Mohammad Kamrul Hasanl, Noor Shamilawani Farhana Yusoff)....Pages 1151-1164
Enhanced Approach for Android Malware Detection (Gulshan Shrivastava, Prabhat Kumar)....Pages 1165-1178
Back Matter ....Pages 1179-1182
Recommend Papers

International Conference on Innovative Computing and Communications: Proceedings of ICICC 2020, Volume 2 [1st ed.]
 9789811551475, 9789811551482

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Advances in Intelligent Systems and Computing 1166

Deepak Gupta · Ashish Khanna · Siddhartha Bhattacharyya · Aboul Ella Hassanien · Sameer Anand · Ajay Jaiswal   Editors

International Conference on Innovative Computing and Communications Proceedings of ICICC 2020, Volume 2

Advances in Intelligent Systems and Computing Volume 1166

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **

More information about this series at http://www.springer.com/series/11156

Deepak Gupta Ashish Khanna Siddhartha Bhattacharyya Aboul Ella Hassanien Sameer Anand Ajay Jaiswal •









Editors

International Conference on Innovative Computing and Communications Proceedings of ICICC 2020, Volume 2

123

Editors Deepak Gupta Maharaja Agrasen Institute of Technology Rohini, Delhi, India Siddhartha Bhattacharyya CHRIST (Deemed to be University) Bengaluru, Karnataka, India Sameer Anand Department of Computer Science Shaheed Sukhdev College of Business Studies University of Delhi Rohini, Delhi, India

Ashish Khanna Maharaja Agrasen Institute of Technology Rohini, Delhi, India Aboul Ella Hassanien Department of Information Technology Faculty of Computers and Information Cairo University Giza, Egypt Ajay Jaiswal Department of Computer Science Shaheed Sukhdev College of Business Studies University of Delhi Rohini, Delhi, India

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-15-5147-5 ISBN 978-981-15-5148-2 (eBook) https://doi.org/10.1007/978-981-15-5148-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Dr. Deepak Gupta would like to dedicate this book to his father Sh. R. K. Gupta, his mother Smt. Geeta Gupta for their constant encouragement, his family members including his wife, brothers, sisters, kids, and to my students close to my heart. Dr. Ashish Khanna would like to dedicate this book to his mentors Dr. A. K. Singh and Dr. Abhishek Swaroop for their constant encouragement and guidance and his family members including his mother, wife and kids. He would also like to dedicate this work to his (Late) father Sh. R. C. Khanna with folded hands for his constant blessings. Prof. (Dr.) Siddhartha Bhattacharyya would like to dedicate this book to his father Late Ajit Kumar Bhattacharyya, his mother Late Hashi Bhattacharyya, his beloved wife Rashni and his colleagues Jayanta Biswas and Debabrata Samanta.

Prof. (Dr.) Aboul Ella Hassanien would like to dedicate this book to his wife Azza Hassan El-Saman. Dr. Sameer Anand would like to dedicate this book to his Dada Prof. D. C. Choudhary, his beloved wife Shivanee and his son Shashwat. Dr. Ajay Jaiswal would like to dedicate this book to his father Late Prof. U. C. Jaiswal, his mother Brajesh Jaiswal, his beloved wife Anjali, his daughter Prachii and his son Sakshaum.

ICICC-2020 Steering Committee Members

Patrons Dr. Poonam Verma, Principal, SSCBS, University of Delhi Prof. Dr. Pradip Kumar Jain, Director, National Institute of Technology Patna, India

General Chairs Prof. Dr. Siddhartha Bhattacharyya, Christ University, Bengaluru Dr. Prabhat Kumar, National Institute of Technology Patna, India

Honorary Chairs Prof. Dr. Janusz Kacprzyk, FIEEE, Polish Academy of Sciences, Poland Prof. Dr. Vaclav Snasel, Rector, VSB-Technical University of Ostrava, Czech Republic

Conference Chairs Prof. Dr. Aboul Ella Hassanien, Cairo University, Egypt Prof. Dr. Joel J. P. C. Rodrigues, National Institute of Telecommunications (Inatel), Brazil Prof. Dr. R. K. Agrawal, Jawaharlal Nehru University, Delhi

vii

viii

ICICC-2020 Steering Committee Members

Technical Program Chairs Prof. Dr. Victor Hugo C. de Albuquerque, Universidade de Fortaleza, Brazil Prof. Dr. A. K. Singh, National Institute of Technology, Kurukshetra Prof. Dr. Anil K Ahlawat, KIET Group of Institutes, Ghaziabad

Editorial Chairs Prof. Dr. Abhishek Swaroop, Bhagwan Parshuram Institute of Technology, Delhi Dr. Arun Sharma, Indira Gandhi Delhi Technical University for Womens, Delhi Prerna Sharma, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi

Conveners Dr. Ajay Jaiswal, SSCBS, University of Delhi Dr. Sameer Anand, SSCBS, University of Delhi Dr. Ashish Khanna, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi Dr. Deepak Gupta, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi Dr. Gulshan Shrivastava, National Institute of Technology Patna, India

Publication Chairs Prof. Dr. Neeraj Kumar, Thapar Institute of Engineering and Technology Dr. Mohamed Elhoseny, University of North Texas Dr. Hari Mohan Pandey, Edge Hill University, UK Dr. Sahil Garg, École de technologie supérieure, Université du Québec, Montreal, Canada

Publicity Chairs Dr. M. Tanveer, Indian Institute of Technology, Indore, India Dr. Jafar A. Alzubi, Al-Balqa Applied University, Salt, Jordan Dr. Hamid Reza Boveiri, Sama College, IAU, Shoushtar Branch, Shoushtar, Iran

ICICC-2020 Steering Committee Members

Co-convener Mr. Moolchand Sharma, Maharaja Agrasen Institute of Technology, India

Organizing Chairs Dr. Kumar Bijoy, SSCBS, University of Delhi Dr. Rishi Ranjan Sahay, SSCBS, University of Delhi

Organizing Team Dr. Gurjeet Kaur, SSCBS, University of Delhi Dr. Aditya Khamparia, Lovely Professional University, Punjab, India Dr. Abhimanyu Verma, SSCBS, University of Delhi Dr. Onkar Singh, SSCBS, University of Delhi Kalpna Sagar, KIET Group of Institutes, Ghaziabad

ix

Preface

We hereby are delighted to announce that Shaheed Sukhdev College of Business Studies, New Delhi in association with National Institute of Technology Patna and University of Valladolid Spain has hosted the eagerly awaited and much coveted International Conference on Innovative Computing and Communication (ICICC-2020). The third version of the conference was able to attract a diverse range of engineering practitioners, academicians, scholars and industry delegates, with the reception of abstracts including more than 3,200 authors from different parts of the world. The committee of professionals dedicated towards the conference is striving to achieve a high quality technical program with tracks on Innovative Computing, Innovative Communication Network and Security, and Internet of Things. All the tracks chosen in the conference are interrelated and are very famous among present day research community. Therefore, a lot of research is happening in the above-mentioned tracks and their related sub-areas. As the name of the conference starts with the word ‘innovation’, it has targeted out of box ideas, methodologies, applications, expositions, surveys and presentations helping to upgrade the current status of research. More than 800 full-length papers have been received, among which the contributions are focused on theoretical, computer simulation-based research, and laboratory-scale experiments. Amongst these manuscripts, 196 papers have been included in the Springer proceedings after a thorough two-stage review and editing process. All the manuscripts submitted to the ICICC-2020 were peer-reviewed by at least two independent reviewers, who were provided with a detailed review proforma. The comments from the reviewers were communicated to the authors, who incorporated the suggestions in their revised manuscripts. The recommendations from two reviewers were taken into consideration while selecting a manuscript for inclusion in the proceedings. The exhaustiveness of the review process is evident, given the large number of articles received addressing a wide range of research areas. The stringent review process ensured that each published manuscript met the rigorous academic and scientific standards. It is an exalting experience to finally see these elite contributions materialize into two book volumes as ICICC-2020 proceedings by Springer entitled International Conference on Innovative Computing and Communications. xi

xii

Preface

The articles are organized into two volumes in some broad categories covering subject matters on machine learning, data mining, big data, networks, soft computing, and cloud computing, although given the diverse areas of research reported it might not have been always possible. ICICC-2020 invited six key note speakers, who are eminent researchers in the field of computer science and engineering, from different parts of the world. In addition to the plenary sessions on each day of the conference, fifteen concurrent technical sessions are held every day to assure the oral presentation of around 195 accepted papers. Keynote speakers and session chair(s) for each of the concurrent sessions have been leading researchers from the thematic area of the session. A technical exhibition is held during all the 3 days of the conference, which has put on display the latest technologies, expositions, ideas and presentations. The delegates were provided with a book of extended abstracts to quickly browse through the contents, participate in the presentations and provide access to a broad audience of the audience. The research part of the conference was organized in a total of 45 special sessions. These special sessions provided the opportunity for researchers conducting research in specific areas to present their results in a more focused environment. An international conference of such magnitude and release of the ICICC-2020 proceedings by Springer has been the remarkable outcome of the untiring efforts of the entire organizing team. The success of an event undoubtedly involves the painstaking efforts of several contributors at different stages, dictated by their devotion and sincerity. Fortunately, since the beginning of its journey, ICICC-2020 has received support and contributions from every corner. We thank them all who have wished the best for ICICC-2020 and contributed by any means towards its success. The edited proceedings volumes by Springer would not have been possible without the perseverance of all the steering, advisory and technical program committee members. All the contributing authors owe thanks from the organizers of ICICC-2020 for their interest and exceptional articles. We would also like to thank the authors of the papers for adhering to the time schedule and for incorporating the review comments. We wish to extend my heartfelt acknowledgment to the authors, peer-reviewers, committee members and production staff whose diligent work put shape to the ICICC-2020 proceedings. We especially want to thank our dedicated team of peer-reviewers who volunteered for the arduous and tedious step of quality checking and critique on the submitted manuscripts. We wish to thank my faculty colleagues Mr. Moolchand Sharma and Ms. Prerna Sharma for extending their enormous assistance during the conference. The time spent by them and the midnight oil burnt is greatly appreciated, for which we will ever remain indebted. The management, faculties, administrative and support staff of the college has always been extending their services whenever needed, for which we remain thankful to them.

Preface

xiii

Lastly, we would like to thank Springer for accepting our proposal for publishing the ICICC-2020 conference proceedings. Help received from Mr. Aninda Bose, the acquisition senior editor, in the process has been very useful. Rohini, India

Ashish Khanna Deepak Gupta Organizers, ICICC-2020

About This Book

International Conference on Innovative Computing and Communication (ICICC-2020) was held on 21–23 February at Shaheed Sukhdev College of Business Studies in association with National Institute of Technology Patna and University of Valladolid Spain. This conference was able to attract a diverse range of engineering practitioners, academicians, scholars and industry delegates, with the reception of papers including more than 3200 authors from different parts of the world. Only 195 papers have been accepted and registered with an acceptance ratio of 24% to be published in two volumes of prestigious springer Advances in Intelligent Systems and Computing (AISC) series. This volume includes a total of 98 papers.

xv

Contents

A Dummy Location Generation Model for Location Privacy in Vehicular Ad hoc Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bhawna Chaudhary and Karan Singh Evaluating User Influence in Social Networks Using k-core . . . . . . . . . N. Govind and Rajendra Prasad Lal

1 11

Depression Anatomy Using Combinational Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Apeksha Rustagi, Chinkit Manchanda, Nikhil Sharma, and Ila Kaushik

19

A Hybrid Cost-Effective Genetic and Firefly Algorithm for Workflow Scheduling in Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ishadeep Kaur and P. S. Mann

35

Flexible Dielectric Resonator Antenna Using Polydimethylsiloxane Substrate as Dielectric Resonator for Breast Cancer Diagnostics . . . . . Doondi Kumar Janapala and Moses Nesasudha

47

Machine Learning-Based Prototype for Restaurant Rating Prediction and Cuisine Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kunal Bikram Dutta, Aman Sahu, Bharat Sharma, Siddharth S. Rautaray, and Manjusha Pandey Deeper into Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jatin Bindra, Bulla Rajesh, and Savita Ahlawat Investigation of Ionospheric Total Electron Content (TEC) During Summer Months for Ionosphere Modeling in Indian Region Using Dual-Frequency NavIC System . . . . . . . . . . . . . . . . . . . . . . . . . . Sharat Chandra Bhardwaj, Anurag Vidyarthi, B. S. Jassal, and A. K. Shukla

57

69

83

xvii

xviii

Contents

An Improved Terrain Profiling System with High-Precision Range Measurement Method for Underwater Surveyor Robot . . . . . . . . . . . . Maneesha and Praveen Kant Pandey

93

Prediction of Diabetes Mellitus: Comparative Study of Various Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arooj Hussain and Sameena Naaz

103

Tracking of Soccer Players Using Optical Flow . . . . . . . . . . . . . . . . . . Chetan G. Kagalagomb and Sunanda Dixit

117

Selection of Probabilistic Data Structures for SPV Wallet Filtering . . . Adeela Faridi and Farheen Siddiqui

131

Hybrid BF-PSO Algorithm for Automatic Voltage Regulator System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hotha Uday Kiran and Sharad Kumar Tiwari Malware Classification Using Multi-layer Perceptron Model . . . . . . . . Jagsir Singh and Jaswinder Singh

145 155

Protocol Random Forest Model to Enhance the Effectiveness of Intrusion Detection Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . Thet Thet Htwe and Nang Saing Moon Kham

169

Detecting User’s Spreading Influence Using Community Structure and Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Kumar, Khyati Grover, Lakshay Singhla, and Kshitij Jindal

179

Bagging- and Boosting-Based Latent Fingerprint Image Classification and Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Megha Chhabra, Manoj Kumar Shukla, and Kiran Kumar Ravulakollu

189

Selecting Social Robot by Understanding Human–Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kiran Jot Singh, Divneet Singh Kapoor, and Balwinder Singh Sohi

203

Flooding and Forwarding Based on Efficient Routing Protocol . . . . . . Preshi Godha, Swati Jadon, Anshi Patle, Ishu Gupta, Bharti Sharma, and Ashutosh Kumar Singh

215

Hardware-Based Parallelism Scheme for Image Steganography Speed up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seyedeh Haleh Seyed Dizaji, Mina Zolfy Lighvan, and Ali Sadeghi

225

Predicting Group Size for Software Issues in an Open-Source Software Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . Deepti Chopra and Arvinder Kaur

237

Wireless CNC Plotter for PCB Using Android Application . . . . . . . . . Ashish Kumar, Shubham Kumar, and Swati Shukla

247

Contents

Epileptic Seizure Detection Using Machine Learning Techniques . . . . . Sudesh Kumar, Rekh Ram Janghel, and Satya Prakash Sahu Analysis of Minimum Support Price Prediction for Indian Crops Using Machine Learning and Numerical Methods . . . . . . . . . . . . . . . . Sarthak Gupta, Akshara Agarwal, Paluck Deep, Saurabh Vaish, and Archana Purwar

xix

255

267

A Robust Methodology for Creating Large Image Datasets Using a Universal Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gurpartap Singh, Sunil Agrawal, and B. S. Sohi

279

A Comparative Empirical Evaluation of Topic Modeling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pooja Kherwa and Poonam Bansal

289

An Analysis of Machine Learning Techniques for Flood Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vinay Dubey and Rahul Katarya

299

Link Prediction in Complex Network: Nature Inspired Gravitation Force Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Kumar, Utkarsh Chaudhary, Rewant Kedia, and Tushar Singhal

309

Hridaya Kalp: A Prototype for Second Generation Chronic Heart Disease Detection and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . Harshit Anand, Abhishek Anand, Indrashis Das, Siddharth S. Rautaray, and Manjusha Pandey Two-Stream Mid-Level Fusion Network for Human Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mainak Chakraborty, Alik Pramanick, and Sunita Vikrant Dhavale Content Classification Using Active Learning Approach . . . . . . . . . . . Neha Bansal, Arun Sharma, and R. K. Singh

321

331 345

Analysis of NavIC Multipath Signal Sensitivity for Soil Moisture in Presence of Vegetation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vivek Chamoli, Rishi Prakash, Anurag Vidyarthi, and Ananya Ray

353

Uncovering Employee Job Satisfaction Using Machine Learning: A Case Study of Om Logistics Ltd . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diksha Jain, Sandhya Makkar, Lokesh Jindal, and Mukta Gupta

365

Transaction Privacy Preservations for Blockchain Technology . . . . . . . Bharat Bhushan and Nikhil Sharma

377

EEG Artifact Removal Techniques: A Comparative Study . . . . . . . . . Mridu Sahu, Samrudhi Mohdiwale, Namrata Khoriya, Yogita Upadhyay, Anjali Verma, and Shikha Singh

395

xx

Contents

Evolution of Time-Domain Feature for Classification of Two-Class Motor Imagery Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rahul Kumar, Mridu Sahu, and Samrudhi Mohdiwale

405

Finding Influential Spreaders in Weighted Networks Using Weighted-Hybrid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Kumar, Yash Raghav, and Bhavya Nag

415

Word-Level Sign Language Gesture Prediction Under Different Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Monika Arora, Priyanshu Mehta, Divyanshu Mittal, and Prachi Bajaj

427

Firefly Algorithm-Based Optimized Controller for Frequency Control of an Autonomous Multi-Microgrid . . . . . . . . . . . . . . . . . . . . . Kshetrimayum Millaner Singh, Sadhan Gope, and Nicky Pradhan

437

Abnormal Activity-Based Video Synopsis by Seam Carving for ATM Surveillance Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Yogameena and R. Janani

453

Behavioral Analysis from Online Data Using Temporal Graphs . . . . . Anam Iqbal and Farheen Siddiqui

463

Medical Data Analysis Using Machine Learning with KNN . . . . . . . . . Sabyasachi Mohanty, Astha Mishra, and Ankur Saxena

473

Insight to Model Clone’s Differentiation, Classification, and Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ritu Garg and R. K. Singh Predicting Socio-economic Features for Indian States Using Satellite Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pooja Kherwa, Savita Ahlawat, Rishabh Sobti, Sonakshi Mathur, and Gunjan Mohan Semantic Space Autoencoder for Cross-Modal Data Retrieval . . . . . . . Shaily Malik and Poonam Bansal A Novel Approach to Classify Cardiac Arrhythmia Using Different Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parag Jain, C. S. Arjun Babu, Sahana Mohandoss, Nidhin Anisham, Shivakumar Gadade, A. Srinivas, and Rajasekar Mohan Offline Handwritten Mathematical Expression Evaluator Using Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . Amit Choudhary, Savita Ahlawat, Harsh Gupta, Aniruddha Bhandari, Ankur Dhall, and Manish Kumar

487

497

509

517

527

Contents

An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Tanvir Islam, M. Raihan, Fahmida Farzana, Promila Ghosh, and Shakil Ahmed Shaj An Overview of Ultra-Wide Band Antennas for Detecting Early Stage of Breast Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. K. Anooradha, A. Amir Anton Jone, Anita Jones Mary Pushpa, V. Neethu Susan, and T. Beril Lynora Single Image Haze Removal Using Hybrid Filtering Method . . . . . . . . K. P. Senthilkumar and P. Sivakumar

xxi

539

551

561

An Optimized Multilayer Outlier Detection for Internet of Things (IoT) Network as Industry 4.0 Automation and Data Exchange . . . . . . Adarsh Kumar and Deepak Kumar Sharma

571

Microscopic Image Noise Reduction Using Mathematical Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mangala Shetty and R. Balasubramani

585

A Decision-Based Multi-layered Outlier Detection System for Resource Constraint MANET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adarsh Kumar and P. Srikanth

595

Orthonormal Wavelet Transform for Efficient Feature Extraction for Sensory-Motor Imagery Electroencephalogram Brain–Computer Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Poonam Chaudhary and Rashmi Agrawal

611

Performance of RPL Objective Functions Using FIT IoT Lab . . . . . . . Spoorthi P. Shetty and Udaya Kumar K. Shenoy

623

Predictive Analytics for Retail Store Chain . . . . . . . . . . . . . . . . . . . . . Sandhya Makkar, Arushi Sethi, and Shreya Jain

631

Object Identification in Satellite Imagery and Enhancement Using Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . Pranav Pushkar, Lakshay Aggarwal, Mohammad Saad, Aditya Maheshwari, Harshit Awasthi, and Preeti Nagrath

643

Keyword Template Based Semi-supervised Topic Modelling in Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Greeshma N. Gopal, Binsu C. Kovoor, and U. Mini

659

A Community Interaction-Based Routing Protocol for Opportunistic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deepak Kumar Sharma, Shrid Pant, and Rinky Dwivedi

667

xxii

Contents

Performance Analysis of the ML Prediction Models for the Detection of Sybil Accounts in an OSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankita Kumari and Manu Sood

681

Exploring Feature Selection Technique in Detecting Sybil Accounts in a Social Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shradha Sharma and Manu Sood

695

Implementation of Ensemble-Based Prediction Model for Detecting Sybil Accounts in an OSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Priyanka Roy and Manu Sood

709

Performance Analysis of Impact of Network Topologies on Different Controllers in SDN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dharmender Kumar and Manu Sood

725

Bees Classifier Using Soft Computing Approaches . . . . . . . . . . . . . . . . Abhilakshya Agarwal and Rahul Pradhan

737

Fuzzy Trust Based Secure Routing Protocol for Opportunistic Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nisha Kandhoul and S. K. Dhurandher

749

Student’s Performance Prediction Using Data Mining Technique Depending on Overall Academic Status and Environmental Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Syeda Farjana Shetu, Mohd Saifuzzaman, Nazmun Nessa Moon, Sharmin Sultana, and Ridwanullah Yousuf

757

Evaluate and Predict Concentration of Particulate Matter (PM2.5) Using Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . Shaon Hossain Sani, Akramkhan Rony, Fyruz Ibnat Karim, M. F. Mridha, and Md. Abdul Hamid Retrieval of Frequent Itemset Using Improved Mining Algorithm in Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sandhya Sandeep Waghere, PothuRaju RajaRajeswari, and Vithya Ganesan Number Plate Recognition System for Vehicles Using Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Amzad Hossain, Istiaque Ahmed Suvo, Amitabh Ray, Md. Ariful Islam Malik, and M. F. Mridha The Model to Determine the Location and the Date by the Length of Shadow of Objects for Communication Networks . . . . . . . . . . . . . . Renrui Zhang

771

787

799

815

Contents

xxiii

CW-CAE: Pulmonary Nodule Detection from Imbalanced Dataset Using Class-Weighted Convolutional Autoencoder . . . . . . . . . . . . . . . . Seba Susan, Dhaarna Sethi, and Kriti Arora

825

SORTIS: Sharing of Resources in Cloud Framework Using CloudSim Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kushagra Gupta and Rahul Johari

835

Predicting Diabetes Using ML Classification Techniques . . . . . . . . . . . Geetika Vashisht, Ashish Kumar Jha, and Manisha Jailia Er–Yb Co-doped Fibre Amplifier Performance Enhancement for Super-Dense WDM Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . Anurupa Lubana, Sanmukh Kaur, and Yugnanda Malhotra Seizure Detection from Intracranial Electroencephalography Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pranjal Naman, Satyarth Vats, Monarch Batra, Raunaq Bhalla, and Smriti Srivastava Reader: Speech Synthesizer and Speech Recognizer . . . . . . . . . . . . . . . Mohammad Muzammil Khan and Anam Saiyeda Comparing CNN Architectures for Gait Recognition Using Optical Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sumit Sarin, Anirudh Chugh, Antriksh Mittal, and Smriti Srivastava Digital Identity Management System Using Blockchain Technology . . . Ei Shwe Sin and Thinn Thu Naing

845

855

867

877

887 895

Enhancing Redundant Content Elimination Algorithm Using Processing Power of Multi-Core Architecture . . . . . . . . . . . . . . . Rahul Saxena and Monika Jain

907

Matched Filter Design Using Dynamic Histogram for Power Quality Events Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manish Kumar Saini and Rajender Kumar Beniwal

921

Managing Human (Social) Capital in Medium to Large Companies Using Organizational Network Analysis: Monoplex Network Approach with the Application of Highly Interactive Visual Dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Srečko Zajec, Leo Mrsic, and Robert Kopal

937

Gender and Age Estimation from Gait: A Review . . . . . . . . . . . . . . . . Tawqeer Ul Islam, Lalit Kumar Awasthi, and Urvashi Garg

947

Parkinson’s Disease Detection Through Visual Deep Learning . . . . . . . Vasudev Awatramani and Deepak Gupta

963

xxiv

Contents

Architecture and Framework Enabling Internet of Vehicles Towards Intelligent Transportation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Manaswini, B. Saikrishna, and Nishu Gupta

973

Group Data Sharing and Auditing While Securing Sensitive Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shubham Singh and Deepti Aggarwal

985

Novel Umbrella 360 Cloud Seeding Based on Self-landing Reusable Hybrid Rocket . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Satyabrat Shukla, Gautam Singh, Saikat Kumar Sarkar, and Purnima Lala Mehta

999

User Detection Using Cyclostationary Feature Detection in Cognitive Radio Networks with Various Detection Criteria . . . . . . . . . . . . . . . . . 1013 Budati Anil Kumar, V. Hima Bindu, and N. Swetha Fuzzy-Based DBSCAN Algorithm to Elect Master Cluster Head and Enhance the Network Lifetime and Avoid Redundancy in Wireless Sensor Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031 Tripti Sharma, Amar Mohapatra, and Geetam Tomar Water Quality Evaluation Using Soft Computing Method . . . . . . . . . . 1043 Shivam Bhardwaj, Deepak Gupta, and Ashish Khanna Crowd Estimation of Real-Life Images with Different View-Points . . . . 1053 Md Shah Fahad and Akshay Deepak Scalable Machine Learning in C++ (CAMEL) . . . . . . . . . . . . . . . . . . . 1063 Moolchand Sharma, Anshuman Raina, Kashish Khullar, Harshit Khandelwal, and Saumye Mehrotra Intelligent Gateway for Data-Centric Communication in Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083 Rohit Raj, Akash Sinha, Prabhat Kumar, and M. P. Singh A Critical Review: SANET and Other Variants of Ad Hoc Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093 Ekansh Chauhan, Manpreet Sirswal, Deepak Gupta, and Ashish Khanna HealthStack–A Decentralized Medical Record Storage Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115 Mayank Bansal, Kalpna Sagar, and Anil Ahlawat AEECC-SEP: Ant-Based Energy Efficient Condensed Cluster Stable Election Protocol in Wireless Sensor Network . . . . . . . . . . . . . . . . . . . 1125 Tripti Sharma, Amar Mohapatra, and Geetam Tomar

Contents

xxv

Measurement and Modeling of DTCR Software Parameters Based on Intranet Wide Area Measurement System for Smart Grid Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1139 Mohammad Kamrul Hasan, Musse Mohamud Ahmed, and Sherfriz Sherry Musa Dynamic Load Modeling and Parameter Estimation of 132/275 KV Using PMU-Based Wide Area Measurement System . . . . . . . . . . . . . . 1151 Musse Mohamud Ahmed, Mohammad Kamrul Hasanl, and Noor Shamilawani Farhana Yusoff Enhanced Approach for Android Malware Detection . . . . . . . . . . . . . . 1165 Gulshan Shrivastava and Prabhat Kumar Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1179

About the Editors

Dr. Deepak Gupta is an eminent academician; and plays versatile roles and responsibilities juggling between lectures, research, publications, consultancy, community service, Ph.D. and postdoctorate supervision, etc. With 12 years of rich expertise in teaching and two years in industry; he focuses on rational and practical learning. He has contributed massive literature in the fields of human–computer interaction, intelligent data analysis, nature-inspired computing, machine learning and soft computing. He has served as Editor-in-Chief, Guest Editor, and Associate Editor in SCI and various other reputed journals. He has completed his postdoc from Inatel, Brazil, and Ph.D. from Dr. APJ Abdul Kalam Technical University. He has authored/edited 33 books with national/international level publisher (Elsevier, Springer, Wiley, Katson). He has published 105 scientific research publications in reputed international journals and conferences including 53 SCI Indexed Journals of IEEE, Elsevier, Springer, Wiley and many more. He is the convener and organizer of ‘ICICC’ Springer Conference Series. Dr. Ashish Khanna has 16 years of expertise in Teaching, Entrepreneurship, and Research & Development. He received his Ph.D. degree from National Institute of Technology, Kurukshetra. He has completed his M.Tech. and B.Tech. from GGSIPU, Delhi. He has completed his postdoc from the Internet of Things Lab at Inatel, Brazil, and University of Valladolid, Spain. He has published around 45 SCI indexed papers in IEEE Transaction, Springer, Elsevier, Wiley and many more reputed journals with cumulative impact factor of above 100. He has around 100 research articles in top SCI/Scopus journals, conferences and book chapters. He is co-author of around 20 edited and textbooks. His research interest includes distributed systems, MANET, FANET, VANET, IoT, machine learning and many more. He is originator of Bhavya Publications and Universal Innovator Lab. Universal Innovator is actively involved in research, innovation, conferences, startup funding events and workshops. He has served the research field as a Keynote Speaker/Faculty Resource Person/Session Chair/Reviewer/TPC member/

xxvii

xxviii

About the Editors

postdoctorate supervision. He is convener and organizer of ICICC conference series. He is currently working at the Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, under GGSIPU, Delhi, India. He is also serving as Series Editor in Elsevier and De Gruyter publishing houses. Dr. Siddhartha Bhattacharyya is currently serving as a Professor in the Department of Computer Science and Engineering of Christ University, Bangalore. He is a co-author of 5 books and the Co-editor of 50 books and has more than 250 research publications in international journals and conference proceedings to his credit. He has got two PCTs to his credit. He has been member of the organizing and technical program committees of several national and international conferences. His research interests include hybrid intelligence, pattern recognition, multimedia data processing, social networks and quantum computing. He is also a certified Chartered Engineer of Institution of Engineers (IEI), India. He is on the Board of Directors of the International Institute of Engineering and Technology (IETI), Hong Kong. He is a privileged inventor of NOKIA. Dr. Aboul Ella Hassanien is the Founder and Head of the Egyptian Scientific Research Group (SRGE). Hassanien has more than 1000 scientific research papers published in prestigious international journals and over 50 books covering such diverse topics as data mining, medical images, intelligent systems, social networks and smart environment. Prof. Hassanien won several awards including the Best Researcher of the Youth Award of Astronomy and Geophysics of the National Research Institute, Academy of Scientific Research (Egypt, 1990). He was also granted a scientific excellence award in humanities from the University of Kuwait for the 2004 Award, and received the superiority of scientific—University Award (Cairo University, 2013). Also he honored in Egypt as the best researcher at Cairo University in 2013. He was also received the Islamic Educational, Scientific and Cultural Organization (ISESCO) prize on Technology (2014) and received the State Award for excellence in engineering sciences 2015. He was awarded the medal of Sciences and Arts of the first class by the President of the Arab Republic of Egypt, 2017. Professor Hassanien awarded the international Scopus Award for the meritorious research contribution in the field of computer science (2019). Dr. Sameer Anand is currently working as an Assistant professor in the Department of Computer science at Shaheed Sukhdev College of Business Studies, University of Delhi, Delhi. He has received his M.Sc., M.Phil., and Ph.D. (Software Reliability) from the Department of Operational Research, University of Delhi. He is a recipient of ‘Best Teacher Award’ (2012) instituted by Directorate of Higher Education, Government of NCT, Delhi. The research interest of Dr. Anand includes operational research, software reliability and machine learning. He has completed an Innovation project from the University of Delhi. He has worked in different capacities in international conferences. Dr. Anand has published several papers in the reputed journals like IEEE Transactions on Reliability, International Journal of

About the Editors

xxix

Production Research (Taylor & Francis), International Journal of Performability Engineering, etc. He is a member of Society for Reliability Engineering, Quality and Operations Management. Dr. Sameer Anand has more than 16 years of teaching experience. Dr. Ajay Jaiswal is currently serving as an Assistant Professor in the Department of Computer Science of Shaheed Sukhdev College of Business Studies, University of Delhi, Delhi. He is Co-editor of two books/journals and co-author of dozens of research publications in international journals and conference proceedings. His research interest includes pattern recognition, image processing, and machine learning. He has completed an interdisciplinary project titled ‘Financial Inclusion-Issues and Challenges: An Empirical Study’ as Co-PI. This project was awarded by the University of Delhi. He obtained his masters from the University of Roorkee (now IIT Roorkee) and Ph.D. from Jawaharlal Nehru University, Delhi. He is a recipient of the Best Teacher Award from the Government of NCT of Delhi. He has more than nineteen years of teaching experience.

A Dummy Location Generation Model for Location Privacy in Vehicular Ad hoc Networks Bhawna Chaudhary and Karan Singh

Abstract The vehicular ad hoc networks are designed to tackle the problems that occur due to the proliferation of vehicles in our society. However, most of its applications require the access to location of vehicles participating in the network and may lead to life-threatening situations. Hence, a careful solution is required while sharing the credentials including location. In this work, we propose to use the dummy location generation of the vehicles. This method helps in protecting the location privacy of a vehicle by creating confusion in the network. This paper contributes to a dummy location generation method, by evaluating the conditional probabilities of location and time pairing. Firstly, we describe the technique used by the adversary and present our dummy location generation method which is simple in nature and gives efficiency as compared to existing methods. Results prove the validity of our proposed model. Keywords Dummy location · Location privacy · Security · Anonymity · VANET

1 Introduction In recent times, traffic congestion is considered one of the serious issues faced by the whole world. The problems that arise due to the use of private transport is the increasing number of road accidents, additional expenses, and related dangers as well as serious socioeconomic issues being faced by modern society. To deal with these problems, it has developed a very promising technology, i.e., vehicular ad hoc networks (VANETs) [1]. Using this technology, vehicles equipped with on-board unit (OBU) communication device can communicate with the help of roadside units (RSU), i.e., V2R architecture or they can communicate directly by short-range direct B. Chaudhary (B) · K. Singh School of Computer Systems and Sciences, Jawaharlal Nehru University, Delhi, India e-mail: [email protected] K. Singh e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_1

1

2

B. Chaudhary and K. Singh

communications by sending beacons as a message to each other, i.e., V2V architecture. Using these architectures, VANETs can offer a vast variety of applications. These applications fall into two classes: Safety- and Non-safety-related applications. Safety-related applications include warning messages, cooperative driving, and traffic optimization, whereas non-safety-related applications include an exchange of entertainment messages [8]. The main purpose of the outgrowth of vehicular communication is to enhance road safety. These safety applications are on-demand and require location-aware services to feed real-time information to its users. For this purpose, beacon messages are transmitted into the network every 10 ms and contain a lot of personal information, such as the timestamp of the vehicle, its identity, and some spatiotemporal information (i.e., speed, velocity, acceleration, etc.) [4]. The given information helps a driver to sniff forthcoming dangerous situations on the road with a window gap for the driver to respond. Still, neighbor nodes (having malicious tendencies) can easily eavesdrop the messages and then link them based on the identity of the vehicle to extract all the visited locations. This can compromise network privacy as one vehicle is associated with only one driver [5, 8]. For the effective utilization of such network, it is necessary to develop a set of elaborate protocols and finely designed privacy mechanisms to make VANET application feasible, i.e., personal information of the driver such as its identity and most visited places must be preserved in order to prevent users from being traced illegally through vehicular communications. A compromised network does not only affect one’s privacy but can threaten one’s security [7]. A malicious node may spoof the information present in the beacons and misdirect the other nodes. We propose an attack scenario in which an adversary gathers spatiotemporal user information, such as patterns of the user’s frequently visited locations that include office address, residence address, and restaurants and utilizes their actual location from the dummy locations (Fig. 1). In this work, we propose a methodology that produces dummy locations that are used in such an attack strategy using a simple statistical method. In conclusion, we define an attack model and the objective of our method as follows: Attack Scenario: An attacker keeps prior knowledge of target node and external spatiotemporal information, connecting to a context-linking attack [4]. The adversary may try to find out the real location of the vehicle from dummy locations using such information. Objective—We present a method to generate realistic dummy locations untraceable against this attack scenario. We propose a method for dummy generation that carefully selects the dummy locations from the higher priority (frequently visited) k location obtained by finding conditional probabilities. Moreover, we focus on such locations that are considered more vulnerable with respect to spatiotemporal contexts by adding a weight factor. Experiments have shown that this approach works well in the given scenario. Our approach generates more realistic dummy locations while considering the time of actual events. This approach is sufficiently simple to be utilized in real-time applications and obscure the actual location among the dummy locations more successfully as compared to the existing methods.

A Dummy Location Generation Model for Location Privacy …

3

Fig. 1 Dummy location generation model by RSU

2 Related Work In [9], this paper’s authors are inclined toward the cryptographic mix zones by deploying a special RSU at those places where traffic density is too high like crossroads and toll booths. Mix zones can be elaborate as they unidentified regions of the network, where the identifiers of mobile nodes are changed to obscure the relationship between entering and exit events. Whenever a vehicle enters into a cryptographic mix zone, a symmetric key is assigned by the RSU to the vehicle. During traveling into the mix zones, every communicated message remains encrypted to protect the useful information imbibed in the message from the adversary. Vehicles in the mix zones send the symmetric key with the message to the vehicles that are in direct transmission range outside of the mix zones such that, those vehicles are also able to decrypt messages. In [3, 12], in order to effectively change and reduce the number of pseudonyms used in the network, results have proven that synchronous pseudonym change algorithm has leading efficiency over the similar status algorithm, and the similar status works better than the position algorithm. They simulated the three algorithms in the same environment by using vehicular mobility model STRAW [an

4

B. Chaudhary and K. Singh

integrated mobile model], which observes vehicular behavior and simplifies traffic control mechanisms. The heuristics applied to optimize the pseudonyms in the network [11] reduce the communication required for procuring pseudonyms and the possibility of tracking at the time of procurement. This work asserts that the proposed heuristics for updating the pseudonym at some particular place and time when vehicle density is low helps in maximization of anonymity with minimum updating frequency. CARAVAN [10] suggests that by associating neighboring vehicles into groups, it leads to reducing the frequency of broadcasting a message for V2I applications by a vehicle. Using a group, the vehicles can be provided with an extended silent period, which, in turn, enhances their anonymity and also achieve unlinkability. This solution assumes that VANETs have a registration authority RA, which has data of all the vehicles joining the network. Each vehicle also registers for the services of his interest and only RA knows the association between real identity and pseudonyms allocated to the vehicles. An enhancement technique is suggested that allows for the actual difference between RSUs and the power transmission control by vehicles. In [6], for the very first time privacy is preserved by using the dummies. In this approach, dummies are using query–response system with location-based services. A new privacy protocol known as PARROTS (Position Altered Requests Relayed Over Time and Space), has been presented suggesting that privacy can be preserved for location-based service and the users can be preserved by LBS administrators in these three cases: (a) when LBS demand constant support from the network, (b) if RA conflicts with RSU, and (c) when spatiotemporal information of a vehicle can be linked. Though this study does not compare network efficiencies, but does enlighten a new method of protecting privacy. In this work, we propose a dummy location generation technique, in which the attacker has prior knowledge of the target user’s profile and spatiotemporal information. Unlike other approaches, this approach considers the realistic scenario in which an attacker may collect the information about the vehicle and the owner from the social networking sites to learn the pattern of the target node.

3 Our Threat Model and Dummy Generation 3.1 Threat Model A basic architecture for location-based services in VANETs consists of vehicular nodes having geographical positioning devices, RSUs, and service providers. Vehicles present in the network can communicate using the beacon messages, which compute and respond to the queries by using user location coordinates. In our threat model, we consider an adversary as a vehicular node that may set up the communication with the target node. An attacker behaves according to its predefined protocol but tries to find out the real information of the target vehicle (the real location, in this

A Dummy Location Generation Model for Location Privacy …

5

case, is the last updated location updated at the nearest RSU). Out of different possible attack scenarios described, we have chosen the fixed-position attack where the adversary observes a query set from the target node. Moreover, each node participating in the network shares its location every time it initiates the communication. Also, we assume that the installed GPS in the vehicle is trustful and cannot be spoofed by the adversary. We concentrate on achieving location privacy by implementing a dummy generation technique into which decisions made by the techniques are authentic.

3.2 The Dummy Generation Scheme To deal with the dummy generation mentioned in the above section, a method is proposed which provides an abstraction of the exact location from the service provider by offering a set of fake locations, known as dummy locations or dummies, containing the exact location. This procedure works as follows: a. The user’s vehicle is present at some location A. b. The user communicates position data A, along with a set of fake locations, such as B, C, D, and E to the RSU. c. The RSU generates location values for all the dummy locations from A to E and sends the message back to the user vehicle. d. The user vehicle communicates using the values from the received set and enables to hide its location privacy. The actual location A cannot be exposed to any other vehicle as well as to other RSUs. The only intended vehicle is aware of its exact location, whereas the RSU may not be. Thus, no other entity except the vehicle itself can distinguish the actual location of the vehicle from a pool of k defined locations (including 1 actual location and k-1 dummy locations). Therefore, the aforementioned method can be used to preserve location privacy by establishing the k-anonymity. Else defaultR ← C endif endif returnr ← randomnumberfromR

3.3 Our Attack Model In our proposed attack model, the adversary can perform a context-linking attack which assumes that the adversary is aware of the spatiotemporal information of the target vehicle. Using the online posted information, an adversary may predict the location of its target vehicle. For example, generally, a user while commuting from her home to the workplace follows the same route and stops at similar points. After observing such behavior of the target user, an adversary may collect the remaining information from different platforms such as social networking sites, information

6

B. Chaudhary and K. Singh

exchanged in the beacons, etc. Also, the attacker may gain knowledge of the frequently visited restaurants and accessed places. The prior knowledge of locations acts as an analytical challenge for the development of dummy generation technique.

4 Our Proposed Work In this section, we will present a novel dummy generation method that can obtain prior knowledge about the user vehicle and his whereabouts. This approach is based on the following objectives: a. Generate the dummy locations where the target user frequently visits at a particular time. b. Generate the dummy locations that seem vulnerable from the target user’s perspective. We can tackle the first objective by calculating the conditional probabilities and add a weighting scheme to fulfill the next objective.

4.1 Generating Frequent Visited Locations of Vehicles Our work thoroughly examines the dummy locations by calculating the conditional probability of locations given at a time, predicting the targeted vehicle’s behavior at a particular time of the day. This can be calculated by finding out the probability that a user vehicle is at a specific location at a given time [13]. P(Location of the vehicle)(Time) =

P(Location of the vehicle ∩ time) P(Time)

(1)

The above-written equation finds out the probabilities of joint of events such as at this location and at this particular time. Also, we initiate the data by putting 1 to every location/time pair to avoid 0 probabilities. After calculating the P(location—time) of all the probable locations at some specific time, we produce dummy locations with respect to the highest probabilities locations. If we encounter two equal values for the P(location—time) then only P(location) will be considered. This method enlightens the probable locations of a user vehicle at any specific time.

4.2 Assessment of Vulnerable Time/Location Pairs We assume a few vulnerable locations that are known to attackers and may use them as dummy locations. Generally, the schedule of a user is fixed. Our model generates the dummy locations for the regularly visited locations and time pairs to

A Dummy Location Generation Model for Location Privacy …

7

increase uncertainty for the attacker. For every vulnerable location and time, our model allocates a weight, i.e., known as risk. P(dummy location) =

P(location ∩ time) ∗ risk P(time)

(2)

If the value of the risk is greater than 1, then the location/time is vulnerable and if it is equal to 1 then we consider that the location/time is under control. Thereafter, we establish a dummy vehicle location to every possible vulnerable location.

5 Experimental Design and Evaluation To determine the efficiency of the proposed model, we perform the experiments based on the real observed data by a vehicle’s user. The experimental study is considered to verify the given objectives: a. How agile is our dummy location of vehicles generated by calculating the conditional probabilities against the attacker model? b. Does the finding of vulnerable location and time pairs contribute any aid to ensure a more agile dummy location?

5.1 Experimental Settings We have established our dataset that involves logged information (time and location) about a target vehicle of Jaipur city. These logs are designed after observing the target vehicle for 2 weeks, leads to 198 log data instances entry into the database. The locations consider approximately 45 famous places of the city, out of that only 12 most visited locations are chosen for this study. Out of the chosen dataset, we train 150 instances and analyze the proposed model that at least contains 5 days of logs. Attacker’s Scenario: Our assumption is that the attacker has collected information about the target vehicle beforehand. Target vehicle’s information (T i ): We can find out the vehicle information on the Internet. As a result, the vehicle owner’s name and address can be retrieved from the Internet and by searching the same identity on the social networking sites, more information can be obtained, such as the addresses of the target vehicle’s office, home, cafe, and clinics. Spatiotemporal information(T s i): Our assumption says that the attacker has all the prior information about the cafes, restaurant, and other places. Additionally, the attacker has common sense that people generally go to office in the morning hours

8

B. Chaudhary and K. Singh

Fig. 2 The performance of our approach according to different risk and k

(7–9 a.m.), cafe in the evening(5–7 p.m.), and will be at home in the night. This pattern helps the attacker to predict the location of a vehicle at any specific time. Performance Measure: We measure the average probability that the attacker is able to figure out the exact location of target vehicle(e). As per our attack model described in Sect. 3.3, the attacker has the capacity of randomly selecting the possibility of the target among the given set of information(E). The success probability of the attacker can be measured as Sucessprobability = 1/Ei f thee ∈ E0i f 0i f e/ ∈ E

(3)

5.2 Experimental Results We evaluate the functionality of our algorithm in comparison with the already suggested methods. In our base paper [2], dummy locations are generated using a cloaking technique, each location having circular probability. Another algorithm with dummy location services use an entropy-based scheme to place the dummy location on the road [14]. Moreover, we propose to evaluate the results theoretically from the optimal k-anonymity algorithm, which states that the probability of estimating the 1 real location is k . In Fig. 3, we have shown the comparison of our work and other two methods described earlier. Results show that different k-anonymity can be obtained by using various dummy locations. The graphs justify that our approach results in a lower probability that the attacker clearly identifies the real location of the vehicle, which proves that the proposed model is safer than the other solutions. While finding the average, in our method, the attacker gets success 3.8% lesser time than the second

A Dummy Location Generation Model for Location Privacy …

9

Fig. 3 Comparison of success probability of attacker and different k anonymity

baseline approach and 28.6% lesser time than the first. Additionally, as the number of dummy locations increases, the efficacy of our algorithm also increases. Thereafter, we examine our weighting scheme for pairing of vulnerable time and location of the target vehicle. We can find out by proofing the schedule of the vehicle thoroughly. We place the risk value from 1 to 9 according to the vulnerability of the location. The reason behind this is that the larger value of k shows that most of the location involves the dummy locations priorly. We can prove that this approach can be efficiently applied to vehicular ad hoc networks, as this lowers the network establishment cost. If the vehicle is able to set their route priorly, our approach will generate much safer dummy locations and may provide a safer network.

6 Conclusion and Future Aspects In this work, we have considered the attacker’s strategy who has the prior knowledge of the target user’s vehicle and external background information. To address such a serious threat, we define a dummy location generation algorithm for vehicles that efficiently place the dummy location of the vulnerable vehicles after calculating the conditional probabilities at a specific time. Furthermore, we prove that location privacy can be achieved if we consider the spatiotemporal information. Experimental results prove that our statistical method gives more effective results than existing methods. In our future work, we plan to elongate the proposed work to tackle random attacks and implement them on the vast map of the city.

10

B. Chaudhary and K. Singh

References 1. R. Al-ani, B. Zhou, Q. Shi, A. Sagheer, A survey on secure safety applications in vanet, in 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (IEEE, 2018), pp. 1485–1490 2. M. Arif, G. Wang, T. Peng, Track me if you can? query based dual location privacy in vanets for v2v and v2i communication, in 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/12th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE) (2018), pp. 1091–1096 3. S. Bao, W. Hathal, H. Cruickshank, Z. Sun, P. Asuquo, A. Lei, A lightweight authentication and privacy-preserving scheme for vanets using tesla and bloom filters. ICT Express (2017) 4. S. Buchegger, T. Alpcan, Security games for vehicular networks, in 2008 46th Annual Allerton Conference on Communication, Control, and Computing (IEEE, 2008), pp. 244–251 5. G. Calandriello, P. Papadimitratos, J.-P. Hubaux, A. Lioy, Efficient and robust pseudonymous authentication in vanet, in Proceedings of the fourth ACM international workshop on Vehicular ad hoc networks (ACM, 2007), pp. 19–28 6. G. Corser, H. Fu, T. S. P. D. W. M. S. L. Y. Z, Privacy-by-decoy: protecting location privacy against collusion and deanonymization in vehicular location based services, in IEEE Intelligent Vehicles Symposium Proceedings (2014), pp. 1030–1036 7. M. Gupta, N.S. Chaudhari, Anonymous roaming authentication protocol for wireless network with backward unlinkability and natural revocation. Ann. Telecommun. 1–10 (2018) 8. H. Hartenstein, L. Laberteaux, A tutorial survey on vehicular ad hoc networks. IEEE Commun. Mag. 46(6), 164–171 (2008) 9. M. Raya, J.-P. Hubaux, Securing vehicular ad hoc networks. J. Comput. Secy 15(1), 39–68 (2007) 10. K. Sampigethaya, L. Huang, M. Li, R. Poovendran, K. Matsuura, K. Sezaki, Caravan: providing location privacy to vanets. Defense Technical Information Center (2005) 11. K. Sharma, B.K. Chaurasia, S. Verma, G.S. Tomar, Token based trust computation in vanet. Int. J. Grid Distrib. Comput. 9(5), 313–320 (2016) 12. M. Wang, D. Liu, L. Zhu, Y. Xu, F. Wang, Lespp: lightweight and efficient strong privacy preserving authentication scheme for secure vanet communication. Computing 98(7), 685–708 (2016) 13. Z. Yan, P. Wang, W. Feng, A novel scheme of anonymous authentication on trust in pervasive social networking. Inf. Sci. 445, 79–96 (2018) 14. Q. Yang, A. Lim, R. X. Q. X, Location privacy protection in contention based forwarding for vanets, in IEEE Global Telecommunications Conference GLOBECOM (2010), pp. 1–5

Evaluating User Influence in Social Networks Using k-core N. Govind and Rajendra Prasad Lal

Abstract Given a social network with an influence propagation model, selecting a small subset of users to maximize the influence spread is known as influence maximization problem. It has been shown that influence maximization problem is NP-hard, and several approximation algorithms and heuristics have been proposed. In this work, we follow a graph-theoretic approach to find the initial spreaders called seed nodes such that the expected number of influenced users is maximized. It has been well established through a series of research works that a special subgraph called k-core is very useful to find most influential users. A k-core subgraph H of a graph G is defined as a maximal induced subgraph where every node in H is having at least k neighbors. We apply a topology-based algorithm called Local Index Rank (LIR) on k-core (for some fixed k) to select the seed nodes in a social network. The accuracy and efficiency of the proposed method have been established using two benchmark datasets of SNAP (Stanford Network Analysis Project) database. Keywords Influence maximization · Social network · k-core · Independent cascade

1 Introduction Social Network Analysis (SNA) is an active research area which has attracted researchers from academia as well as industry. Social networks like Facebook, Flickr, YouTube, and Twitter, etc., are extensive and very effective in propagating information and promoting market products among its users in a very short time span. There are specific nodes in social networks which can propagate information to a large number of users quickly, called influential nodes. Identifying such nodes in the social graph will help in controlling epidemic outbreaks, speed up information propagation, advertisements by e-commerce websites, and so on. This has attracted N. Govind (B) · R. P. Lal University of Hyderabad, Hyderabad, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_2

11

12

N. Govind and R. P. Lal

scientists from various fields like economics, sociology, and computer science to study the influence spread in social networks. Specifically, identifying influential nodes (users) in social networks has been discussed and analyzed extensively by the academicians as well as people from the industry recently [2, 8, 13]. With the inspiration of viral marketing, the influence maximization problem was first studied by Domingos et al. [3, 17] algorithmically from data mining perspective. Kempe et al. [8] were first to formulate the influence maximization problem into a stochastic optimization problem and established its NP-hardness results. The problem can be defined as, given a social network G and a positive integer s (number of influential spreaders), find a subset S of influential users so that the total number of users influenced by them is maximized. These influential users also called as the seed set, who are the initial adopters of innovation or information, will propagate the innovation or information to their neighbors in the network. The users who are influenced by these initial adopters in turn will propagate the information or influence to their neighbors in the network. There are several influence propagation models such as Independent (weighted) Cascade and Linear Threshold (LT) are proposed by researchers in literature. As influence maximization problem belongs to the class of NP-hardness problems, various approximation algorithms [2, 6, 8, 10, 12] and heuristics [2, 8, 13, 20] have been proposed. Recently, some graph-theoretic approaches using special subgraphs like k-core, k-truss, etc., have also been proposed to find influential users in the social network. A k-core is a maximal induced subgraph in which all the nodes have degree at least k. The high connectivity of the nodes in a k-core makes it very useful for influence spreading [9]. The core decomposition of graphs has also been applied in the areas of community detection, event detection, text mining, etc. [14]. In this work, we apply heuristics like degree discount, Local Index Rank (LIR) on the k-core of the social network to find a set of influential nodes. We have taken the Independent Cascade (IC) model to calculate the expectation value of the number of influenced users. The paper organization is as follows; Sect. 2 contains the problem definition, Sect. 3 includes existing work, Sect. 4 consists of proposed methods, Sect. 5 includes dataset description, Sect. 6 contains experimental results then followed by the conclusion and future work.

2 Influence Maximization Here, we look into the basic definitions and methods useful to study and solve the influence maximization problem. Social Network: A social network is a graph denoted by G = (V, E), where V is a set of users, and E is a set of links between the users. Here, for our study, we consider undirected graphs.

Evaluating User Influence in Social Networks Using k-core

13

Influence Maximization: The influence maximization problem can be defined as, given a social network G = (V, E) and a positive number s, identify subset of users S ⊂ V, |S| = s, such that the influence spread function f (S) is maximized [8]. The influence spread function f (S) of seed set S is defined as the expected numbers of nodes get influenced under a propagation or diffusion model. The IC model [7, 8, 18] and LT model [4, 5, 8] are mainly used to stochastically model the influence propagation by triggering the propagation of influence in the network with the seed set already chosen. In IC model, a probability of influence puv is assigned to every edge (u, v). If at a given time t, the node u is influenced then at time t + 1 it attempts to influence the node v with probability puv , if it succeeds then v gets influenced. Every influenced node will have at most one chance to influence its neighbors. Once a node reaches an influenced state, then it remains in the same state. This process starts with an initial seed and stops when no nodes remain to influence. In LT model, the edge between every neighbor v of a node u is given a weight buv and every node hasgiven a threshold value.  Node u get influenced when these conditions satisfy i.e., v∈N (u) buv ≤ 1 and v∈N (u) buv ≥ θv , where θv is threshold of each user v ∈ V , generated uniformly at random in interval [0,1] and N(u) represents neighbors of node u. The process continues until no nodes remain to influence. Other diffusion models with variations are also available in the literature [23] and Kempe et al. [8] given the generalized versions of these two models.

3 Existing Work Kempe et al. [8] were the first to formulate the influence maximization problem as an optimization problem under IC and LT models. They proved its NP-hardness and proposed an approximation algorithm with an approximation ratio of 1 − 1/e − ε. In order to find efficient solutions to the influence maximization problem, various algorithms are proposed in the literature. These can be mainly divided into two types, viz. greedy and heuristic algorithms. Greedy algorithms: Kempe et al. [8] proved that influence spread function is submodular and proposed hill-climbing greedy algorithm. Leskovec et al. [10] presented a Cost-Effective Lazy Forward (CELF) algorithm based on lazy evaluation of the objective function, which was 700 times efficient than the former algorithm. Amit et al. [6] proposed CELF++ by improving the CELF algorithm. Chen et al. [2] proposed greedy algorithms like NewGreedy and MixedGreedy. Greedy algorithms work well to produce seed set but have prohibitively high time complexity. Heuristic algorithms: Kempe et al. [8] proposed some heuristics like degree and degree centrality. Chen et al. [2] improved over the degree centrality by notion degree discount. The idea is to discount the degree of a node if it has seed nodes as its neighbors. The discounted degree of node v is given by d v –2t v –t v p(d v –t v ), where d v denotes degree(v), t v is count of seed nodes as neighbors and p is the propagation

14

N. Govind and R. P. Lal

probability. Wang et al. [21] proposed the generalized degree discount algorithm as an extension to the degree discount. The idea is to modify the degree discount by considering two-hop neighbors. The generalized degree discount of a node v is  given by d v –2t v –(d v –t v )t v p + (1/ 2)t v (t v –1)p– dv −tv tw p. Liu et al. [13] proposed a topology-based algorithm called LIR, which is based on the degree. Zhang et al. [22] proposed the VoteRank algorithm, which is based on the voting capacity of nodes and the average degree of the network. Nodes in the network votes to its neighbors and node with the highest votes will be chosen. The selected node will not participate in the voting, and the voting ability of neighbor nodes will be decreased in the next turn. Pal et al. [16] studied the heuristics of centrality measures and modeled a new centrality measure based on diffusion degree.

4 Proposed Method In this section, we discuss the k-core subgraphs and methods proposed by us. The concept of cores was coined by Seidman [19] in 1983 to find the cohesive subgroups of users in a given network. Cohesive subgroups are a subset of nodes or users with strong, direct, intense, or positive ties. Kitsak et al. [9] studied influence spread based on k-core using an epidemic model and showed that the core of the network contains the influential spreaders. Malliaros et al. [15] studied the influence spread based on the k-truss, which is a triangle (cycle of length3)-based extension of k-core. A k-core is defined as a subgraph H V  , E  induced from a graph G = (V, E) in which degree (v  ) ≥ k ∀v  ∈ V  , where k is a positive number. A k-core subgraph can be obtained by decomposing a graph based on the property of degree, i.e., the nodes whose degree is less than k along with edges incident on, should be deleted recursively until all the nodes have at least degree k. A linear time algorithm proposed by Batagelj et al. [1] can be employed to find the k-core of a graph. Liu et al. [13] proposed a heuristic called LIR with the intuition of avoiding the “rich club effect,” i.e., to avoid adjacency among two high-degree nodes. The nodes selected by the LIR will have degree more than their neighboring nodes, and most of the time they are not connected with each other. They proposed a ranking for each node v based on the degree of that node and its neighbors, which is given by the number of neighbors having higher degree than the node v. After computing LI for all the nodes, the nodes with LI = 0 are selected and sorted in descending order of their degree. Then the required number of nodes is chosen as seed set from the sorted list. It can be observed that the chances of the nodes in k-core having nonzero LI values are high. Hence, when LIR applied on a graph G does not select adequate number of nodes from the k-core of G. It excludes some influential nodes from k-core, which is in contradiction with the fact that mostly influential nodes reside in k-core [9]. Here, we apply LIR on the k-core of the graph to find the seed set. The intuition is to include those influential nodes which are excluded by LIR when applied on the original graph.

Evaluating User Influence in Social Networks Using k-core

15

Our approach is accurate and scalable as k-cores are relatively small in size and time-efficient to compute. Here, we also use the degree discount heuristic on k-core to find the influential nodes. Our proposed algorithm is outlined in algorithm 1. It takes the graph G = (V, E) and the seed set size s as input and produces top-s influential nodes. The step 1 of algorithm 1 computes the maximal k-core subgraph using Batagelj et al. [1] algorithm. Then degree discount or LIR is applied on k-core obtained in step 1. In step 3, top-s nodes are selected based on their degree discount value or degree (over the node with LI = 0). The step 1 of algorithm 1 has time complexity of O (|E|). Step 2 and step 3 of algorithm 1 combined take O (s log |V | + |E|) for degree discount heuristic and O (|E|) for LIR, respectively. So, the overall time complexity of proposed algorithm 1 is O (|E|). Algorithm 1: k-core-LIR / Degree Discount Input: G(V E) s Output: top-s seeds (seed set) 1 Compute

the maximal k-core of G LIR or Degree Discount heuristic on k-core subgraph 3 Select the set of s nodes. 2 Apply

5 Datasets In this section, we provide the information of datasets. We use two benchmark datasets ca-GrQc and ca-HepTh from SNAP [11]. ca-GrQc is an undirected graph related to a collaboration network of Arxiv General Relativity and ca-Hepth is also an undirected graph representing the research collaboration of scientists who have co-authored papers in High Energy Physics category. Some of the fundamental properties of these two networks are given in Table 1. Table 1 Properties of the two datasets Name

Nodes

Edges

Average clustering coefficient

Average degree

Type

ca-GrQc

5242

14496

0.5296

5.5

Undirected

ca-Hepth

9877

25998

0.4717

5.2

Undirected

16

N. Govind and R. P. Lal

6 Experimental Results Here, we discuss the experiments run and results obtained. We compare the proposed methods with the existing heuristics like degree [8], degree discount [2], generalized degree discount [21], VoteRank [22]. Now, we discuss the settings of the experiment. The experiments are run on an Intel(R) Xeon(R) CPU with 64-GB main memory. Here, we use IC model for the calculation of influence spread. We run the experiment 10,000 times and take the average. We set influence probability value p = 0.05. The total number of seeds, i.e., the size of the seed set is 50. The p value for degree Discount and generalized degree discount is set to 0.05. For the construction of k-core, we follow the algorithm proposed by Batagelj et al. [1], and we fix k = 3 in k-core algorithm. The results in Figs. 1 and 2 show that the variation of the influence spread with the varying seed set size s. Degree heuristic shows less spread in both the datasets and k-core-LIR shows the highest spread compared to others. k-core degree discount also shows spread almost as other heuristics like degree discount, generalized degree discount, and VoteRank in both datasets. Generalized degree discounts show better spread than degree heuristic in ca-GrQc dataset, and it is showing almost the same spread as other heuristics like degree discount, VoteRank, and k-core degree discount in the case of ca-HepTh dataset. A significant increase in spread with the k-core-LIR method can be seen in both ca-GrQc and ca-HepTh dataset and this method is time efficient to compute the seed set from core. Results show that the proposed methods are more accurate and time-efficient.

Fig. 1 Number of influenced users versus seed set size on ca-GrQc dataset

Evaluating User Influence in Social Networks Using k-core

17

Fig. 2 Number of influenced users versus seed set on ca-HepTh dataset

7 Conclusion In this work, we have proposed a graph-theoretic method to find an efficient solution to the problem of influence maximization in social networks. Our approach is based on applying different topology-based algorithms like LIR, degree discount on the k-core of the social graph. As the size of k-core is generally small in comparison to the social graph, our proposed methods are scalable. The experimental study on the two datasets ca-GrQc and ca-HepTh shows that our proposed method outputs highly influential seed nodes resulting in large number of influenced users as compared to other existing heuristics. There are other types of subgraphs like k-truss, k-clubs, k-clans, etc., having special structures, can also be utilized to identify influential seed users in social networks. In the future, we plan to work with these structures along with various other influence propagation models such as LT model, SIR model, to find more efficient algorithms for influence maximization problem.

References 1. V. Batagelj, M. Zaversnik, An o (m) algorithm for cores decomposition of networks (2003). arxiv: cs/0310049. arXiv preprint 2. W. Chen, Y. Wang, S. Yang, Efficient influence maximization in social networks, in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2009), pp. 199–208 3. P. Domingos, M. Richardson, Mining the network value of customers, in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2001), pp. 57–66

18

N. Govind and R. P. Lal

4. J. Goldenberg, B. Libai, E. Muller, Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark. Lett. 12(3), 211–223 (2001) 5. J. Goldenberg, B. Libai, E. Muller, Using complex systems analysis to advance marketing theory development: modeling heterogeneity effects on new product growth through stochastic cellular automata. Acad. Mark. Sci. Rev. 9(3), 1–18 (2001) 6. A. Goyal, W. Lu, L.V. Lakshmanan, Celf++: optimizing the greedy algorithm for influence maximization in social networks, in Proceedings of the 20th International Conference Companion on World Wide Web (ACM, 2011), pp. 47–48 7. M. Granovetter, Threshold models of collective behavior. Am. J. Soc. 6, 1420–1443 (1978) 8. D. Kempe, J. Kleinberg, E. Tardos, Maximizing the spread of influence through a social network, in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2003), pp. 137–146 9. M. Kitsak, L.K. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H.E. Stanley, H.A. Makse, Identification of influential spreaders in complex networks. Nat. Phys. 6(11), 888 (2010) 10. J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, N. Glance, Cost-effective outbreak detection in networks, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2007), pp. 420–429 11. J. Leskovec, A. Krevl, SNAP datasets: Stanford large network dataset collection (2014). http:// snap.stanford.edu/data 12. Y. Li, J. Fan, Y. Wang, K.L. Tan, Influence maximization on social graphs: a survey. IEEE Trans. Knowl. Data Eng. 30(10), 1852–1872 (2018) 13. D. Liu, Y. Jing, J. Zhao, W. Wang, G. Song, A fast and efficient algorithm for mining top-k nodes in complex networks. Sci. Rep. 7, 43330 (2017) 14. F.D. Malliaros, A.N. Papadopoulos, M. Vazirgiannis, Core decomposition in graphs: concepts, algorithms and applications, in EDBT (2016), pp. 720–721 15. F.D. Malliaros, M.E.G. Rossi, M. Vazirgiannis, Locating influential nodes in complex networks. Sci. Rep. 6, 19307 (2016) 16. S.K. Pal, S. Kundu, C. Murthy, Centrality measures, upper bound, and influence maximization in large scale directed social networks. Fundam. Inf. 130(3), 317–342 (2014) 17. M. Richardson, P. Domingos, Mining knowledge-sharing sites for viral marketing, in Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2002), pp. 61–70 18. T.C. Schelling, Micromotives and macrobehavior. WW Norton & Company (2006) 19. S.B. Seidman, Network structure and minimum degree. Soc. Netw. 5(3), 269–287 (1983) 20. A. Sheikhahmadi, M.A. Nematbakhsh, A. Shokrollahi, Improving detection of influential nodes in complex networks. Physica A 436, 833–845 (2015) 21. X. Wang, X. Zhang, C. Zhao, D. Yi, Maximizing the spread of influence via generalized degree discount. PloS one 11(10), e0164393 (2016) 22. J.X. Zhang, D.B. Chen, Q. Dong, Z.D. Zhao, Identifying a set of influential spreaders in complex networks. Sci. Rep. 6, 27823 (2016) 23. Y. Zheng, A survey: models, techniques and applications of influence maximization problem (2018)

Depression Anatomy Using Combinational Deep Neural Network Apeksha Rustagi, Chinkit Manchanda, Nikhil Sharma, and Ila Kaushik

Abstract Depression is a temperament syndrome that causes a tenacious emotion of wretchedness and loss of interest in any activity. It is a supreme root of mental illness, which has established the growth in the risk of early death and economic burden to a country. Traditional clinical analysis procedures are subjective, complex and need considerable contribution of professionals. The turn of the century saw incredible progress in using deep learning for medical diagnosis. Though, prediction and implementation of mental state can be remarkably hard. In this paper, we present Combinational Deep Neural Network (CDNN) for automated depression detection with facial images and text data using amalgamation of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). Thereafter, combining the prediction scores of both CNN and RNN model and level of depression is decided on the basis of the range of the predefined depression-level scores. Simulation outcomes based on real-field channel measurements show that the proposed model can significantly predict depression with superior performance. Keywords Depression · Artificial intelligence · Mental illness · Combinational deep neural network · CNN · RNN

A. Rustagi Bhagwan Parshuram Institute of Technology, Delhi, India e-mail: [email protected] C. Manchanda · N. Sharma (B) HMR Institute of Technology and Management, Delhi, India e-mail: [email protected] C. Manchanda e-mail: [email protected] I. Kaushik Krishna Institute of Engineering and Technology, Ghaziabad, Uttar Pradesh, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_3

19

20

A. Rustagi et al.

1 Introduction Mental illness is less painful than physical illness, but it is more common and also difficult to bear. Depression is a serious medical illness that negatively affects how a person feels and the way one thinks. India comes under the countries experiencing with the extreme problem of psychological and discernible illnesses, in terms of maximum time of lifespan lost due to ill health or death attuned for the size of population [1]. As stated by a report by WHO, 6.5% of population of India agonizes from depression. General symptoms of depression are: • Feelings of sadness and emptiness • Losing interest in most of the common day to day activities and personal hobbies [2, 3]. • Anxiety, short temperament and irritation • Difficulty in concentrating, thinking and even making choices Depression is a curable disease. If depression is detected at an early stage the duration of treatment would shorten [4]. Regrettably, the percentage of approachability to treatment is surprisingly low. There are effective measures and treatments for curing depression but there is a severe lack of psychological health staffs like psychologists, doctors and psychiatrists in the country. The major problem is that nearly two-third of the patients do not seek out for help. The explanations for not seeking any treatment comprise not recognizing the symptoms of depression, the phobia of going to a mental health specialist, extreme cost of the doctors and the shortage of doctors. Depression not damages mental health of a person nevertheless physical fitness too. It is associated through high blood pressure, back ache and diabetes [5]. Depression also put the heart patients at risk by 67% and an increase in a risk of cancer by 50% [6]. Also, this psychological sickness is affecting the whole family, peers and other connections in form of anxiety and mental collapse. This gives us a reason to be motivated for helping and curing the people suffering, gives purpose to invest in depression anticipation and suppository. Keeping in mind the destructive effects of depression on people and on the society as well, computer vision developers have suggested approaches which are based on vocal and non-vocal data for precise assessment of a person’s depression level. The forms of transformation in audio and visual data have been exploited for automated, contact-free breakdown and analysis of depressing behaviours. The graph in Fig. 1 shows the ranks of eight countries on the basis of their depressed population. China is having the extreme percentage of depressed population with 12.25%, that is, almost one-eighth of its population is suffering from depression. And India being the third country leading in depression with 6.5%, that is 8.7 crore people out of its 133.92 crore population suffers from depression [7]. Many machine learning algorithms which classify depression have already been proposed [8]. These problems take depression as a classification problem where they differentiate between patient’s depression level [9]. These algorithms face a dataset disproportion problem. Around 300 million people out of 7.7 billion people

Depression Anatomy Using Combinational Deep Neural Network

21

Fig. 1 Depression ranks of countries

face depression, the low frequency of depression among general population leads to inappropriate datasets. Some authors have also proposed that the features of speech and voice of the depressed people differ from features of the non-depressed/healthy people. The visual data convey important features like facial expressions, head pose, body movement and eye blinks which also differ in case of depressed and nondepressed people. But building models on this information also fails at times because of data disproportion issues. Although humans will probably always be better at understanding emotions than machines but machines are also gaining experience based on their own assets. Also, the motive is not to have a competition between humans and machines but to make the machines learn from humans. Emotional Artificial Intelligence (EAI) is a subclass of artificial intelligence which deals in human emotions. It is an upcoming research topic widely used for sentiment analysis involving human emotions. To help increase the rate of approachability to mental health service, it is essential to use an advanced technology and active measures. Moreover, people should be more conscious about their emotional and mental health. There should be an effective and easily approachable depression detection system made available on a platform which is easily accessed by the majority of the people, that is, internet. According to various surveys conducted worldwide it has been reported that Twitter is one of the most popular social networks in the world and people use it as a means of sharing thoughts, beliefs, feelings as well as their life events. Therefore, we chose Twitter as the platform for building the depression detection system. Summing up, this research goal is to deliver a depression detection tool on Twitter, the most popular social network by using Natural Language Processing (NLP), Convolutional Neural Network

22

A. Rustagi et al.

(CNN), Recurrent Neural Network (RNN) techniques for emotional analysis and construct the depression detection algorithm [10]. The problem of these disproportionate datasets can be resolved by refining the datasets and bringing them into almost equivalent shapes. CNN is one of the mostly used techniques for image classification with desired number of convolutional layers. Twitter also provides the best dataset for text classification because of the limit on the amount of characters allowed in a single tweet. RNN with Long Short-Term Memory (LSTM) and NLP can be used for the classification of texts. The collective result from both the algorithms provides a better and reformed version of the older versions of emotional AI (EAI) algorithms used till now. In this paper, we applied EAI on the tweet data which is divided into depressed and nondepressed emotion classes. RNN with LSTM-based model is used for the prediction of data obtained from the user. The speech to text library of python is used for obtaining vocal data form the user and converted into text data for prediction from the model. The refined image dataset is used for training the model for classification of images among the two classes using CNN-based model. The combined result from both the models is used for predicting the emotional state and level of depression of the user.

2 Literature Review Healthcare is not just about in what way you see yourself physically, but also how well is your mental state. A lot of people, who aren’t doing well psychologically incline to form a pattern of actions with their day to day activities, first one being their choice of words, social media activities, searches, etc. The methods embraced are personality inventory, psychological tests, clinical examination, brain scanning, etc. Expert consultations: This practice is functioned by skilled mental health professionals. The consultant must require solid knowledge of depressing ciphers and indication along observational skills. It is a talking treatment that includes proficient professionals guiding the patient in the right direction. This practice can also be led by other mental health experts but this method is time consuming. Gadgets like smartphones, laptops, etc., can be useful to collect the user’s behavioural data which shows the mental condition of the user. As youngsters uses various applications and perform certain activities which leave digital footprints that might offer signs to their psychological well-being. Specialists say likely signs include variations in writing speed, voice quality, word choice, etc. A vast range of effort has observed user behaviour or mental state using the data collected by the smartphones, inclusive of detecting whether the user is depressed or not [11]. Prediction of depression through mobile data from using features extraction from tracking location by GPS, SMS, google searches, social media activities, etc. Former research used speech to observe and identify the depression. The features of acoustic speech have recently been examined as conceivable signs for depression

Depression Anatomy Using Combinational Deep Neural Network

23

in grownups. The properties of depression imitated in the speech construction system makes speech a viable feature for depression detection. Cannizzaro et al. [12] studied the connection between depression and speech through testing statistical analysis on different factors of speech. Acoustic speech has different variables, which includes speaking rate (words per minute), percent pause time and pitch disparity. Speaking rate and pitch disparity had huge interdependence for detecting depression. Besides the study of speech factor for detecting the depression, there is a research that studies writing for depression detection in addition to syntactic construction and semantic content of an individual with depression. There are different psychological concepts which chains semantic factors to the depression detection. Beck et al. [13] concept of depressive propounds that individual inclined to depression have depression schematic, and results in seeing the world with negative perspective, not appreciating anything and being isolated from everything. These theories once triggered give intensification depression, conflicting, and traumatic behaviour. Munmund et al. [14] studied, how effectively would the social media be able to perceive the depression. Social media generate a prospect to analyses social network data for user’s state of mind and thoughts to study their moods and attitudes when they are interactive via social media applications. The dependent variables in the data such as social activities, sentiments, choice of words, etc., were fetched from twitter. Tweets showing self-assessed depressing aspects help in recognizing it before hand and make it possible for parents, specialists, individuals to examine posts for linguistic suspicions that indication deteriorating mental well-being. The prediction from the model developed from this research with 70% accuracy. Wang et al. [15] presented the structure to generate probabilistic appearance outlines for video data to work on depression detection. To detect depression from the data from video, it initially detects significant facial landmarks to depict facial appearance variation and calculates the outline variations of regions defined by different landmarks which are further used to train the support vector machine classifier model. After that, Bayesian estimation scheme is applied to the facial data from the videos to generate probabilistic outline for facial landmarks. After examining the outlines for facial landmarks, outcome shows that there is difference between the expressions of depressed and not depressed individual. Zhu et al. [16] proposed a D-Convolutional Neural Network (DCNN)-based method for prediction of depression from the video data. DCNN is most commonly used for analysing the visual image data which achieved a superior result. The presented model comprises of two simultaneous CNNs: an expression DCNN to take out facial features and a dynamic DCNN to take out dynamic wave features by calculating the visual drift among a certain number following mounts and both predicting the scale of depression. At the last of their DCNN model, to merge the results of both CNNs (expression and dynamic), two completely coupled layers are implemented.

24

A. Rustagi et al.

3 Dataset The facial images dataset is obtained by the modification of the FER2018 open source data. We had training dataset, test dataset (which is then used as validation dataset for our project) and further a private dataset (same size with test dataset and will be used as data for evaluating the prediction performance). It is noteworthy that in original provided dataset (either in training dataset or in test dataset), we have actually in total six categories: ‘Angry, Surprise, Happy, Sad, Disgust and Neutral’. The main problem arises here as for our research only two categories of depressed and non-depressed images were required. As a solution to it, we grouped four of the above-mentioned categories into non-depressed dataset and two into depressed dataset. On further selection of images, we ended up having around 6000 images belonging to both the categories respectively. The text data consisted of tweets which were collected using the twitter API. Around 10,000 tweets were collected and further segregated into training and testing dataset with the split ration of 80:20. Two lists of tokens (words) were compiled for both the datasets. The training list consisted of words signifying depression inclinations like ‘depressed’, ‘suicide’, ‘self-harm’. For the test dataset, random tweets were collected including both positive and negative features. Our Approach Distribution learning is a framework which allows us to assign a distribution label to an entity rather than using various labels [17]. When a model learns the distribution related to a label space for a sample, it shows the level of importance of each label existing in this space [18]. Thus, this method can be used for improving a model’s predictive accuracy. Distribution learning is widely used in problems like emotion recognition [19] and age estimation [20]. Here, we approached by diving out our tasks into two sections at the start. The first half comprises of classification of facial expression into depressed or non-depressed class. The second half consists of classification of the text data received from the user into the two above-mentioned classes [21]. The combined prediction from both the sections is used for concluding the level of depression of the user amongst the four three predefined levels.

4 Model Architecture The first section of the model is the classification of the facial images of a person into depressed and non-depressed classes. CNN [22] is a well-known algorithm of DNNs which specialize in classification of images. It is an algorithm which takes input as an image, allocate important weights and biases to several attributes in the input image and distinguish one from the other [23]. The architecture of a ConvNet is similar with the connectivity patterns of the human brain. The purpose of using a convolutional neural network is to make the process easier of processing images in

Depression Anatomy Using Combinational Deep Neural Network

25

a simple way for which consists of important features, without any loss of features which are crucial for getting a decent prediction [24]. As shown in Fig. 2, the input image is a coloured image of size (32,32) from training dataset which is first grey scaled and passed into the convolutional neural networked with three convolutional layers and fully connected layer giving output from one of the two classes [25]. We start with a set of three convolutional layers each followed with a maxpooling layer, activation function ‘relu’ is used for all the three layers with pool size (2,2) in the maxpooling layers. The features to be captured from convolutional layer are increased from 32 to 128, it is proposed that such hierarchical structure (with increasing layer nodes) performs better for deep neural network. Finally, the convolved layer is first flattened and then goes through two more dense layers to reach the output layer in which ‘Softmax’ activation function is used for multiclass classification (two classes in total). Table 1, shows the number of trainable parameters obtained from each layer of the convolutional network. We obtain a total of 37,218 parameters from around 12,000 images for training the model for depression detection in images. The second section of the model comprises of the text classification. RNN [26] with LSTM is used for this purpose. Depression is a state of mind which cannot be predicted from a single text from the person. Predicting depression requires keeping in mind the previous conversations and inputs from the person [27]. Though we don’t identify how brain works up till now, but it is considered that there must be a logic unit and a memory unit. Decisions are made on the basis of reasoning and experience. Hence, for the algorithms to do so, we provide memories. This is the purpose of using RNN [28]. General feed forward neural network memorizes things learnt during training and generates outputs; however, RNNs memorize training as well as learn from the past inputs and further practice them for generation of outputs. For example, a vanilla feed forward network learns how ‘1’ looks like and then use its learning for classification of all the inputs, but RNNs classify the later outputs on

Fig. 2 CNN architecture for depression detection

26 Table 1 Summary for CNN model

A. Rustagi et al. Layer(type)

Output shape

Number of parameters

Conv2d_1

(None, 26, 26, 32)

Max_pooling2d_1

(None, 13, 13, 32)

0

Conv2d_2

(None, 11, 11, 32)

9248

Max_pooling2d_2

(None, 5, 5, 32)

0

Conv2d_3

(None, 3, 3, 64)

18496

Max_pooling2d_3

(None, 1, 1, 64)

0

Flatten_1

(None, 64)

0

Dense_1

(None, 128)

Dense_2

(None, 2)

896

8320 258

the basis of the current knowledge (training) as well as the past knowledge (previous inputs) for prediction of the later inputs. In a general feed forward neural network, a static size input vector is provided, processed and converted to static size output vector. When these transformations are done on a series of input vectors for generation of the output vectors, this network becomes a recurrent network with varying input size and higher accuracy. But in exercise, RNN suffer from two difficulties: the vanishing gradient problem and the exploding gradient problem which makes it unfit for use [29]. This is where we use LSTMs. LSTM introduced a memory unit called ‘cell’ into the neural network. Now, the decision is made after considering the current input, prior output and prior memory. A new output is generated and the old memory is altered. Figure 3 explains the reason for using RNN with LSTM network with accuracy comparison between the other approachable networks. Figure 4, perfectly explains the working of a recurrent neural networking using the past outputs at time intervals (t−1), (t) for prediction at time interval (t + 1). On receiving the datasets, the data is pre-processed which includes removing of duplicates, word tokenization, removing stop words and converting contractions. All the inputs to a neural network should be of same length; therefore, the length of largest sentence is stored. The words are converted into tokens and the sentences with length shorter than maximum length are padded with value ‘0’ in the end. Now, LSTM Embedding layer is added. Embedding is done to solve the major problem of sparse input data by mapping the high-dimensional data to lower dimensions. The model is further compiled with ‘categorical_crossentropy loss function’ and ‘adam’ optimizer. Table 2 shows the number of trainable parameters obtained from each layer of the recurrent neural network. The total number of parameters obtained for training of the text model is 511,194 parameters. In this paper, the captured image of the user is taken as an input for the image prediction model and the text data obtained from the user as answers to the system inquired questionnaire is taken as input for the text prediction model. The combined prediction scores of the models are averaged and the level of depression is decided on

Depression Anatomy Using Combinational Deep Neural Network

Fig. 3 Accuracy versus Epoch graph for text analysis

Fig. 4 Sequence of RNN

27

28

A. Rustagi et al.

Table 2 Summary for RNN model Layer (type)

Output shape

Number of parameters

Embedding_1

(None, 2573, 128)

256000

Spatial_dropout1d_1

(None, 2573, 128)

Lstm_1

(None, 196)

Dense_3

(None, 2)

0 254800 394

Fig. 5 Comparison graph for three approaches

the basis of the range of the predefined depression level scores to which the predicted score belongs. Figure 5 states the proof for our approach being better than the previously proposed approaches for the problem, showing accuracy obtained on using only facial expressions for depression prediction with colour blue, prediction from text with colour green and prediction from combined text and face images with colour mustard. Proposed Algorithm 1. Data collection: csv file is converted to images and image selection is done on the basis of factors mentioned earlier in the paper. 2. CNN model is prepared using three convolutional layers, images are greyscale and trained for 300 epochs. In this, we take a small matrix of numbers (filters) and pass it over the image and alter it on the basis of its values of the filters.

Depression Anatomy Using Combinational Deep Neural Network

G[m, n] = ( f ∗ h)[m, n] =

 j

29

h[ j, k] f [m − j, n − k]

(1)

k

The successive feature map values are determined on the basis to the above-mentioned expression, where f = input image and h = kernel. The indexes of the rows and columns are denoted of the resultant matrix are denoted by m and n correspondingly. According to convolutional rules, the filter and the image must have the same number of channels. If we want to apply several filters on an image, we do it by applying convolution on each of them separately, stack the results and combine them into one. Following formula is used for this purpose,  [n, n, n C ] ∗ [ f, f, n C ] =

n + 2P − f n + 2P − f + 1, + 1, n f s s

 (2)

here, n = image size, f = filter size, nC = number of channels in image, P = padding used, s = stride used, nf = number of filters 3. An RNN model is prepared with LSTM for text-based predictions and trained for 25 epochs. The gates in LSTMs are sigmoid activation functions with outputs either 0 or 1. Zero meaning the gates closed and 1 meaning open. The equations for LSTM gates are:     it = σ wi ht−1 , xt + bi˙

(3)

This equation is used for storing information in the cell state.     f t = σ w f ht−1 , xt + b f

(4)

This equation is used for determining the information to be thrown from the cell state.     ot = σ wo ht−1 , xt + bo

(5)

This is for the output gate to provide activation to final output from the LSTM block.here, i t : input gate f t : forget gate ot : output gate σ : sigmoid function wx : weight for gate(x)neurons h t−1 : output of previous block at timestamp(t − 1) xt : input at timestamp(t)

30

A. Rustagi et al.

  bx : biases for gate(x) Equation for final output, h t = ot ∗ tanh ct where ct : cell state(memory)at timestamp(t) 4. Both the models are saved with ‘.h5’ extensions. 5. The inputs are collected from the user by image capture using opencv and text by questionnaire. Passing these to the respective models given two outputs which are further label encoded and used for the calculation of final output as follows: pf =

pi + pt 2

(6)

where, p f : final prediction pi : prediction from image pt : prediction from text p f = 0 : no depression p f = 1 : medium depression p f = 0 : high depression

5 Result In Fig. 6, these are some of the questions asked by the machine for taking user’s answers as input for the RNN-based text model. The result is shown in Fig. 7 with the image of the user along with the level of

Fig. 6 Question–Answer between machine and user

Depression Anatomy Using Combinational Deep Neural Network

31

Fig. 7 Image captured with final predictions

depression predicted using both the models and the final calculated level of depression of the user.

6 Conclusion and Future Scope Inspired by the issue of unevenness in the depression dataset due to the imbalance of depression in population, we have prepared this model of automated depression detection which can be easily approachable to most people. In this paper, a combined learning architecture of CNN and RNN is presented for automated depression detection. Though the training of the model was done by distributing the task into two sections but the prediction was done after combining the scores from both the model. This division allows to discover the connection between facial images, text data and depression levels, and has the possibility to improve accuracy of the model with expert suggestions in future. Supervised learning classification have a restriction that it cannot result with human-level accuracy in prediction of depression through text data and thus, the facial image data are used for having a better result and accuracy of the model. Experiments on public FER2018 data and twitter datasets showed that our proposed method yields in interesting results in comparison to related works in this field. This suggests that treating depression as a combination of visual illness and mental illness holds significance. The data used in this research are less, and further experiments can also be done by taking more data into account. In future, we can also work on other features for depression detection like features extracted from speech. The model can also be further worked upon and converted to a smartphone application to increase its proximity. Features for handling different stages of the depression can be added to provide the needed help.

32

A. Rustagi et al.

References 1. M.S. Neethu, Rajsree, Sentiment analysis in twitter using machine learning techniques, in Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT) (2013) 2. S. Meystre, P.J. Haug, Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. J. Biomed. Inf. 39(6) (December 2006) 3. M. Desai, M.A. Mehta, Techniques for sentiment analysis of Twitter data: a comprehensive survey, in International Conference on Computing, Communication and Automation (ICCCA) (2016) 4. B.W. Conti D, The economic impact of depression in the workplace. J. Occup. Med. 36, 983–988 (1994) 5. M. H. Foundation, Physical health and mental health 6. T. Kongsuk, S. Supanya, K. Kenbubpha, S. Phimtra, S. Sukhawaha, J. Leejongpermpoon, Services for depression and suicide in Thailand. WHO South-East Asia. J. Public Health 6(1), 34–38 (2017) 7. A. Halfin, REPORTS depression: the benefits of early and appropriate treatment © Ascend Media. November, pp. 92–97 (2007) 8. L. Canzian, M. Musolesi. Trajectories of depression: unobtrusive monitoring of depressive states by means of smartphone mobility traces analysis, in Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (ACM, 2015) 9. R. LiKamWa et al., Moodscope: building a mood sensor from smartphone usage patterns, in Proceedings of ACM MobiSys (2013) 10. Yrr, zgr, et al., Context-awareness for mobile sensing: a survey and future directions. IEEE Commun. Surveys Tutor. 18(1), 68–93 (2016) 11. Saeb, Sohrab, et al., Mobile phone sensor correlates of depressive symptom severity in daily-life behavior: an exploratory study. J. Med. Internet Res. 17(7) (2015) 12. M. Cannizzaro, B. Harel, N. Reilly, P. Chappell, P.J Snyder, Voice acoustical measurement of the severity of major depression. Brain Cogn. 56(1), 30–35 (2004) 13. A.T. Beck, Depression: Clinical, Experimental, and Theoretical Aspects (University of Pennsylvania Press, 1967) 14. M. De Choudhury, M. Gamon, Predicting Depression via Social Media. Proc. Seventh Int. AAAI Conf. Weblogs Soc. Media. 2, 128–137 (2013) 15. P. Wang, F. Barrett, E. Martin, M. Milonova, R.E. Gur, R.C. Gur, C. Kohler, R. Verma, Automated video-based facial expression analysis of neuropsychiatric disorders. J. Neurosci. Methods 168(1), 224–238 (2008) 16. Y. Zhu, Y. Shang, Z. Shao, G. Guo, Automated depression diagnosis based on deep networks to encode facial appearance and dynamics, in IEEE Transactions on Affective Computing (2017) 17. M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, L. Sellie, On the learnability of discrete distributions, in ACM Symposium on Theory of Computing (1994) 18. C. Manchanda, R. Rathi, N. Sharma, Traffic density investigation & road accident analysis in India using deep learning, in 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974528 19. X. Geng, C. Yin, Z. Zhou, Facial age estimation by learning from label distributions. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2401–2412 (2013) 20. Y. Zhou, H. Xue, X. Geng, Emotion distribution recognition from facial expressions, in ICM (2015) 21. M. Chakarverti, N. Sharma, R.R. Divivedi, Prediction analysis techniques of data mining: a review. SSRN Electron. J. (2019). https://doi.org/10.2139/ssrn.3350303 22. P. Ray, A. Chakrabarti, A mixed approach of deep learning method and rule-based method to improve aspect level sentiment analysis. Appl. Comput. Inf. (2019) 23. M. Grover, B. Verma, N. Sharma, I. Kaushik, Traffic control using V-2-V based method using reinforcement learning, in 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974540

Depression Anatomy Using Combinational Deep Neural Network

33

24. L. Zhang, S. Wang, B. Liu, Deep learning for sentiment analysis: a survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8(4), e1253 (2018) 25. J. Deriu, M. Gonzenbach, F. Uzdilli, A. Lucchi, V. De Luca, M. Jaggi, SwissCheese at SemEval2016 Task 4: sentiment classification using an ensemble of convolutional neural networks with distant supervision, in Proceedings of the 10th International Workshop on Semantic Evaluation (2016), pp. 1124–1128 26. M. Harjani, M. Grover, N. Sharma, I. Kaushik, Analysis of various machine learning algorithm for cardiac pulse prediction, in 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974519 27. Y. Yin, S. Yangqiu, M. Zhang, NNEMBs at SemEval-2017 Task 4: neural twitter sentiment classification: a simple ensemble method with different embeddings, in Proceedings of the 11th International Workshop on Semantic Evaluation (2017), pp. 621–625 28. R. Tiwari, N. Sharma, I. Kaushik, A. Tiwari, B. Bhushan, Evolution of IoT & data analytics using deep learning, in 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974481 29. H. Pan, H. Han, S. Shan, X. Chen, Mean-variance loss for deep age estimation from a face, in CVPR (2018)

A Hybrid Cost-Effective Genetic and Firefly Algorithm for Workflow Scheduling in Cloud Ishadeep Kaur and P. S. Mann

Abstract Cloud computing is developing as a new platform that gives high-quality information over the Internet at a very low cost. But still, it has numerous concerns that need to be focused. Workflow scheduling is the main serious concern in cloud computing. In this paper, we propose a Hybrid Cost-Effective Genetic and Firefly Algorithm (CEFA) for Workflow Scheduling in Cloud Computing. In the existing approach, the number of iteration was very large which increases the total execution cost and time which we will optimize in the proposed algorithm. The performance is estimated on scientific workflows and the results show that the proposed algorithm performs better than the existing algorithm. Three parameters are used to compare the performance of the existing and proposed algorithm; (1) execution time, (2) execution cost, and (3) termination delay. Keywords Cloud computing · Genetic algorithm · Workflow scheduling · Firefly algorithm · Execution time · Execution cost · Termination delay

1 Introduction The most recent movements in the cloud framework are assembling our experts to offer services in a general sense progressively versatile and arranged system. Distributed processing is the premature advancement which depends upon pay-peruse criteria. It is an enlisting point of view where requests, information, data transmission, and IT associations are provided over the Internet. The objective of the cloud association suppliers is to utilize the asset effectively and accomplish the most phenomenal favorable position. The enduring evolution of cloud computing in IT I. Kaur (B) Department of Computer Science and Engineering, DAVIET, Jalandhar, India e-mail: [email protected] P. S. Mann Department of Information Technology, DAVIET, Jalandhar, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_4

35

36

I. Kaur and P. S. Mann

has led several explanatory remarks on cloud computing. The US National Institute of Standards and Technology (NIST) defines the cloud computing as [1]: “Cloud computing is a model enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” There is an excess of a hundred million figuring gadgets associated with the web and a considerable lot of them are utilizing distributed computing administrations day by day because it provides a flexible and easy way to keep and retrieve data and files [2]. Cloud Computing is a promising advancement that allows the customers to pay as they require. It engages encouraging of inescapable applications from client, exploratory, and business regions. Distributed computing is advancing a utility-arranged IT organization to customers around the globe. The creating cost of tuning and managing PC structures is provoking out-sourcing of business organizations to encourage core interests. The features of distributed framework includes self-association, broad structure, asset pooling and smart flexibility. On intrigue, self-association recommends that clients (ordinarily affiliations) can ask for and deal with their own particular Computing assets. The distributed framework is an accumulation of two phrasings in the situation of figuring innovation with computing resources. It is an investor of diverse assets and a crossover structure of huge structure that incorporates the need conveyed through the web as administrations along side the equipment and framework programming required to help the administrations.

2 Related Work There are different studies of workflow scheduling in cloud computing The author [1] has proposed a genetic algorithm approach for scheduling workflow applications by minimizing the cost while meeting user’s deadline constraint or minimizing the execution time while meeting the user’s budget. The proposed algorithm evaluates fitness function into two parts: cost fitness and time fitness. It solves the budget and deadline-constrained optimization problems. The results show that the genetic algorithm is better for handling complex workflows structure. In paper [2], the author has proposed a new heuristic algorithm for task scheduling which embeds a new fast technique named Elitism Stepping into the Genetic Algorithm with the objective to reduce the schedule length within an acceptable computational time. The algorithm sorts the task in the order of execution according to the bottom level which reduces the finish time of the algorithm. The author compared the proposed algorithm with BGA and obtained a better schedule length or finish time. The result shows significant improvement in the computation time of the new algorithm. The author [3] has surveyed the various existing workflow scheduling algorithm in cloud computing and tabulated the various parameters along with the tools. The author concluded that existing workflow scheduling algorithms do not consider reliability and availability. So there is a need to implement a workflow scheduling algorithm that can improve

A Hybrid Cost-Effective Genetic and Firefly Algorithm …

37

the availability and reliability in the cloud environment. The author [4] has presented a scheduling technique based on a relatively new swarm-based approach known as Cat Swarm Optimization. This technique shows improvement over PSO in terms of speed of convergence. By using the Seeking mode and Tracing mode, the algorithm reduces the wastage of energy and obtains a solution in a much lesser number of iterations. The author has targeted at minimization of the total cost, the minimum number of iterations, and fair distribution of workload. The authors had proved that CSO gives better results than PSO in terms of execution time and computation time. The paper [5] addresses a novel hybrid algorithm named ACO–FA, which joins in Ant Colony Optimization (ACO) with a Firefly Algorithm (FA) to solve unrestricted optimization problems. The proposed algorithm joins in the merits of both ACO and FA, where the algorithm is initialized by a set of random ants that are roaming through the search space. The proposed algorithm to handle complex problems of genuine measurements has been accepted due to procedure simplicity. It can efficiently overwhelm the drawback of the classical ant colony algorithm, which is not suitable for continuous optimizations. The author [6] has proposed an MPQGA, which produces various priority queues using a heuristic-based crossover and heuristic-based mutation operator in order to reduce the makespan. The author has used an integer-stringcoded genetic algorithm that employs roulette-wheel selection and elitism. It uses the advantages of the HEFT heuristic algorithm to find a better result in which the highest priority task calculated by the upward rank is mapped on to the processor which gives the less EFT. It produces a set of multiple priority queues based on downward rank, a combination of level upward and downward rank for the initial population and the remaining priority queues are chosen randomly. These three heuristic methods are used to generate good seeds that will be uniformly spread into the entire feasible solution space so that no stone is left unturned. This algorithm covers a large search space than the deterministic algorithm without much cost. In paper [7], the author has presented a Deadline-Constrained Heuristic-based Genetic Algorithm for scheduling applications on cloud that decreases the execution cost while meeting the deadline. Each task is allocated priority via bottom-level and top-level. The algorithm is equated with SGA under the same deadline constraint and pricing model. The simulation results show that the proposed algorithm has a promising performance as compared to SGA. The performance of the algorithm is evaluated with synthetic workflows such as Montage, LIGO, Epigenomics, and Cybershake. The author [8] has presented a hybrid approach, which combines the positive benefits of the heuristic algorithm and a metaheuristics algorithm by modifying its genetic operators. In heterogeneous computing systems, workflow scheduling. In this paper [9], the author has suggested a Genetic Algorithm to work under multicore processor. The main objective of this algorithm is to reduce the makespan time and rise the speed-up ratio. Weight Sum Approach (WSA) is used to calculate the fitness function. The simulation results show that the suggested algorithm performs better than the current algorithm and seems to be very efficient and effective to improve the overall performance of the computing system considerably. It uses the HEFT heuristic which is better than the other list-based heuristic in terms of its robust nature and makespan for initial seed. The HEFT heuristic gives the direction to the algorithm in improving

38

I. Kaur and P. S. Mann

the performance and as a result, it converges faster than the random initial population. It uses the direct representation method for chromosome and each chromosome consists of two parts. Elitism helps in maintaining the quality by copying the best chromosomes from one iteration to the next iteration. The twofold genetic operators, namely, crossover and mutation are used which helps in optimizing the fundamental objective (to minimize makespan) in less amount of time. It also optimized the load balancing during the execution. It produces lesser makespan by modification of tasks on a multicore processor. In this paper [10], the author has proposed RTEAH algorithm which increases the algorithm by decreasing the makespan, weighting time, and burst time by managing the load on the processor. Firstly, the Round Robin Algorithm is hybrid with Throttled Algorithm. It increases flexibility than these two algorithms and is hybrid with ESCE algorithm which reduces the waiting and burst time. Then to overcome the problem of index table updation, the algorithm is merged with the ABCO algorithm. So the RTEAH algorithm performs better and also manages the load. In this paper [11], a novel workflow scheduling is introduced in which a fuzzy dominance sort based heterogeneous earliest time (FDHEFT) algorithm is proposed which merges the fuzzy dominance sort mechanism with list-based scheduling. The proposed algorithm performs better than existing algorithm. The algorithm also minimizes the CPU runtime. The algorithm proposed in this paper [12] is GAAPI which is the hybridization of the Genetic Algorithm and Ant Colony Optimization. In this paper, the problem to search solutions in local and global optima is addressed. Another issue is the lack of advanced search capability, which is solved by the proposed algorithm. So the hybridization of the evolutionary algorithm can help to solve the problem. The proposed algorithm maintains a balance in exploration and exploitation. Genetic Algorithm will work in solution search and API reduces its speed of convergence. So this will increase the chance of faster convergence toward global optimum. The proposed algorithm is compared with PSO and GA which shows that the proposed algorithm performs better. The author [13] describes an IIOT-based health monitoring framework in which smartphones or desktop via Bluetooth technology can continuously monitor the person’s body with the help of ECG signals and in case if any disorder is detected in person’s body the information will be safely sent to the healthcare professionals and it will help to avoid preventable deaths. The service is integrated with the cloud for secure, safe, and high-quality data transmission and to maintain the patient’s privacy.

3 Proposed Approach The main objective of the paper is to optimize the results of cost-effective genetic algorithm by hybridizing it with PEFT-generated solutions as an initial population to firefly algorithm, which will optimize the solution and firefly-optimized solution is then provided to the genetic algorithm to make that solution more optimized and thereby providing better results in terms of termination delay, finish time, and execution cost. PEFT algorithm is chosen as it is the first list-based heuristic, which

A Hybrid Cost-Effective Genetic and Firefly Algorithm …

39

has outperformed HEFT which was best in terms of makespan and efficiency. The working of PEFT algorithm is explained below. p rank OC T (ti ) =

k=1

OC T (ti , pk ) P

Algorithm 1: PEFT 1. Calculate the values of the OCT matrix. The value of the OCT matrix will be calculated according to the below equation. It will assign the cost to execute all the jobs   OC T (ti , pk ) = maxt j ∈ succ(t j ) [min pw ∈ P{OC T t j , pw   + w t j , pw + c¯i j )}]

(1)

where c¯i j = 0 if pw = pk 2. Compute the OCT of each node and computed OCT states the rank of every job (rank OC T ) using Eq. 3. p rank OC T (ti ) =

k=1 OC T (ti , pk )

P

(2)

3. Repeat until all the jobs are assigned to the desired resources. a. Calculate the optimistic earliest finish time(EFT) using the below equation.       O E F T ti, p j = E F T ti, p j + OC T ti, p j

(3)

b. Jobs are assigned to the processor which will give the least OEFT. 4. Return the optimal solution

The optimal solution achieved from PEFT is used as an initial population of firefly algorithm, i.e., this solution acts as the first solution for the firefly population and the other solutions are generated randomly. Further fitness which is the attractiveness of the firefly based on the light intensity is calculated and based on that fitness the population of fireflies is updated. Algorithm 2: Firefly Optimization 1. Population of the firefly is initialized using the prioritize solution of the PEFT algorithm. PEFT algorithm will compute the initial population by using the OCT table and calculate the optimistic Earliest Finish Time(EFT). 2. Repeat Steps a to c, until the termination condition is met. a. Calculate the relative distance and attractiveness between the Fireflies in the population. b. Update the light intensity of the fireflies determined by the objective function.

40

I. Kaur and P. S. Mann

c. Order the fireflies and upgrade the positions. 3. Return the best optimal solution.

Optimized solution of firefly algorithm is then fed to the genetic algorithm to achieve the results in terms of termination delay, execution cost, and finish time of the schedule in workflow scheduling as CEGF propounded hybrid algorithm. Algorithm 3: Proposed CEGF 1. Create the first population by taking one chromosome using the PEFT Algorithm and the remaining of the chromosomes randomly. 2. Optimize the population using the Firefly Algorithm. 3. Compute the fitness value of the optimized population from Firefly Algorithm as the execution time of the solution. 4. The optimized solution of the Firefly Algorithm is then fed to the Genetic Algorithm. 5. Select the chromosomes randomly and apply crossover and mutation operators of the Genetic Algorithm to produce the next generation. 6. Validate the resulting solution by checking the fitness function and add it to the new population. 7. The Genetic Algorithm will produce the best-optimized solution. 8. Evaluate the performance parameters such as execution time, execution cost, and termination delays.

Flowchart of the propounded technique is drawn below showing the functionality of technique in terms of block diagram (Fig. 1).

4 Result and Discussions The proposed approaches, CEGF Cost-Effective Firefly and Genetic Hybrid, have been simulated using JAVA JDK Netbeans IDE with WorkflowSim simulator. The results have been analyzed on various scientific workloads present in the WorkflowSim package including Montage, CyberShake, and Epigenomics with their varying number of tasks compared with other existing techniques CEGA CostEffective Genetic Algorithm in terms of finish time, execution cost, and termination delay.

4.1 Analysis in Terms of Finish Time Finish time is the all-out execution time of assignment, ti, on the virtual machine of type VM that has least execution time among a wide range of VMs accessible in cloud and its completion time is characterized as

A Hybrid Cost-Effective Genetic and Firefly Algorithm …

41

Apply PEFT algorithm for initial population generation Genetic Algorithm Firefly algorithm Initialize population of Firefly with PEFT prioritize schedule

Move firefly and evaluate the light intensity

Initialize population of GA with optimized schedule of firefly

Selection

Crossover Update Solutions Mutation Evaluate the new solutions Next Generation

Optimized schedule

Fig. 1 Flowchart of the proposed algorithm

Finish Time (Ti ) = Start Time (Ti ) +

End TimeV Mk 1 − variation

Figure 2 and Table 1 shows the comparison results of Finish Time of the propounded algorithm with other existing algorithms. The figure shows that the proposed algorithm performs better than the existing procedure. For workload Montage 100, the finish time is 98.08 ms for the proposed and 116.82 ms for the existing. The same is with other cases and the proposed technique is best in all the cases than the existing.

4.2 Analysis in Terms of Execution Cost The Execution Cost is to locate a reasonable schedule(S) for a given work process with the end goal that Total Execution Cost does not surpass the cutoff time (D) of the work process. The Execution Cost can be premeditated as

42

I. Kaur and P. S. Mann Finish Time Comparision 9000

8554

8000 7000

7000 6000

5739.4

5000 4144.49

4000 3000

2290.21

2000 1000

59.21

0 36.71

482.94

116.82 98.08

894.35

280.01

1178.24

561.68

GAFFA(Proposed)

GA(Existing)

Fig. 2 Comparison of finish time

Table 1 Simulation results of proposed technique in terms of finish time

Scientific Workflows [Montage,50]

GAFFA (Proposed) 36.71

GA (Existing) 59.21

[Montage, 100]

98.08

116.82

[CyberShake,30]

280.01

482.94

[CyberShake,50]

561.68

894.35

[CyberShake,100]

1178.24

2290.21

[Epigenomics,46]

4144.49

7550.89

[Epigenomics,100]

5739.4

8554

  R     LFTr j − LSTr j C rj ∗ Exceution Cost = r j=1 where r is the number of resources set, LST is the lease start time, and LFT is lease finish time. Figure 3 and Table 2 shows the comparison results of Finish Time of the propounded algorithm with other existing algorithms. The figure shows that the proposed algorithm performs better than the existing procedure. For workload Montage 100, the finish time is 22490 for the proposed and 154580 for the existing. Similarly, for other workloads, the proposed technique performs better than other methods.

A Hybrid Cost-Effective Genetic and Firefly Algorithm …

43

Execution Cost Comparision 250000

237521

235982

200000 198518

154580 150000 113580

151890

100000 58952

45790 50000 0

15800 23974 11420

22490

28950

30272

GAFFA(Proposed)

GA(Existing)

Fig. 3 Comparison of execution cost

Table 2 Simulation results of proposed technique in terms of execution cost

Scientific workflows

GAFFA (Proposed)

[Montage,50]

11420

GA (Existing) 45790

[Montage, 100]

22490

154580

[CyberShake,30]

28950

113580

[CyberShake,50]

30272

151890

[CyberShake,100]

58952

235982

[Epigenomics,46]

15800

198518

[Epigenomics,100]

23974

237521

4.3 Analysis in Terms of Termination Delay When a VM is leased, it takes time to proper initialization and whenever computing resources release, they will take the time to shut down. The longer time in resource acquiring will increase the total execution time and longer time in the shutdown will increase the overall cost of the workflow. Figure 4 and Table 3 show the comparison results of the Termination Delay of the propounded algorithm with other existing algorithms. The figure shows that the proposed algorithm performs better than the existing procedure. For workload Montage 100, the finish time is 308 ms for the proposed and 6156 ms for the existing. Similarly, for other workloads, the proposed technique performs better than other methods.

44

I. Kaur and P. S. Mann

Termination Delay Comparision 7000

6156

6000 5000 4000 2895

3000

3034 2987 1945

2000 1000 0

284 308

418

445

GAFFA(Proposed)

2259 487

127

GA(Existing)

Fig. 4 Termination delay comparison

Table 3 Simulation results of proposed technique in terms of termination delay

Scientific workflows

GAFFA (Proposed)

GA (Existing)

[Montage,50]

238

2016

[Montage, 100]

308

6156

[CyberShake,30]

418

2895

[CyberShake,50]

445

3034

[CyberShake,100]

127

2987

[Epigenomics,46]

284

1945

[Epigenomics,100]

487

2259

5 Conclusion The existing algorithm improves a novel scheme for encoding, population initialization, crossover, and mutation operators of the Genetic Algorithm. It mostly focuses on minimizing the delay, finish time, and cost. The existing CEGA algorithm reflects all the characteristics of the cloud such as heterogeneity, on request resource provisioning, and pay-as-you-go model. The simulation experiments are conducted on four scientific workflows, which show that CEGA exhibits the highest hit rate for deadline constraint. In the existing approach, the number of iterations is very large which increases the Total Execution Cost and Total Execution Time which we will optimize in the proposed algorithm. The propounded algorithm is replicated in WorkflowSim simulator using NetBeans IDE and shows that the results are better than existing procedures. The future extension is to comprehend and to enhance the proposed calculation by using resource-aware and more load balancing algorithms.

A Hybrid Cost-Effective Genetic and Firefly Algorithm …

45

References 1. J. Yu, R. Buyya, Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms. Sci. Program 14(3–4), 217–230 (2006) 2. A. MasoudRahmani, M. Ali Vahedi, A novel task scheduling in multiprocessor systems with genetic algorithm by using elitism stepping method. INFOCOMP—J. Comput. Sci. 7(2), 58–64 (2008) 3. A. Bala, I. Chana, A survey of various workflow scheduling algorithms in cloud environment, in Proceedings of the 2nd National Conference on Information and Communication Technology (NCICT) (2011) 4. Ciornei, E. Kyriakides, Hybrid ant colony-genetic algorithm (GAAPI) for global continuous optimization. IEEE Trans. Syst. Man, Cybern. B, Cybern. 42(1), 234–245 (2011) 5. A.A. El-Sawy, R.M. Rizk-Allah, E.M. Zaki, Hybridizing ant colony optimization with firefly algorithm for unconstrained optimization problems. Appl. Math. Comput. 224, 473–483 (2013) 6. S. Bilgaiyan, M. Das, S. Sagnika, Workflow scheduling in cloud computing environment using cat swarm optimization, in Proceedings of the 2014 IEEE International Advance Computing Conference (IACC) (IEEE, 2014) 7. J. Hu, K. Li, K. Li, Y. Xu, A genetic algorithm for task scheduling on heterogeneous computing systems using multiple priority queues. Inf. Sci. 270(6), 255–287 (2014) 8. A. Verma, S. Kaushal, Cost-time efficient scheduling plan for executing workflows in the cloud. J. Grid Comput. Springer 13(4), 495–506 (2015) 9. S.G. Ahmad, C.S. Liew, E.U. Munir, T.F. Ang, S.U. Khan, A hybrid genetic algorithm for optimization of scheduling workflow applications in heterogeneous computing systems. J. Parallel Distrib. Comput. 87, 80–90 (2016) 10. M.S. Hossain, G. Muhammad, Cloud-assisted Industrial Internet of Things (IIoT) c enabled framework for health monitoring. Comput. Netw. 101, 192–202 (2016) 11. A. Bose, P. Kuila, T. Biswas, A novel genetic algorithm based scheduling for multi-core systems, in 4th International Conference on Smart Innovations in Communication and Computational Sciences (SICCS), vol. 851 (Springer, 2018), pp. 1–10 12. G. Zhang, J. Sun, J. Zhou, S. Hu, T. Wei, X. Zhou, Minimizing cost and makespan for workflow scheduling in cloud using fuzzy dominance sort based HEFT, Future Gener. Comput. Syst. 93, 278–289 (2019) 13. S.R. Gundu, T. Anuradha, Improved hybrid algorithm approach based load balancing technique in cloud computing 9(2) Version 1 (2019)

Flexible Dielectric Resonator Antenna Using Polydimethylsiloxane Substrate as Dielectric Resonator for Breast Cancer Diagnostics Doondi Kumar Janapala and Moses Nesasudha

Abstract In this work a flexible Dielectric Resonator Antenna (DRA) operating at 2.45 GHz is presented for breast cancer diagnosis. Polydimethylsiloxane (PDMS) is used as Dielectric Resonator (DR). The proposed radiating element consist concentric circular arcs formed in an inverse symmetrical manner on both sides of the microstrip feed line. The Defective Ground Surface is used as ground plane, where it is formed by etching slots to form concentric square rings below the radiator. Four square shaped PDMS slabs are used as DR and placed below the slots of DGS. The DRA antenna is simulated for both flat and flexible conditions and comparative analysis is presented. The suitability of the antenna is verified by analyzing the antenna by placing near the female breast phantom model without and with cancer tumor tissue. The simulated Specific Absorption Rate (SAR) of the antenna on skin model and male human left arm phantom model and on female breast model are evaluated and presented. Keywords Polydimethylsiloxane (PDMS) · Flexible · Wearable · Specific absorption rate (SAR) · Dielectric resonator antenna · Brest cancer diagnosis

1 Introduction Flexible antennas development has been rapidly grown in recent years. The need to develop new antennas which can be adaptable to our daily life monitoring like health, entertainment, emergency responders, surveillance, sensing applications, military and health care for screening, diagnostics and treatment. The flexible antennas are very easy to mount on any curved surfaces, which make its more suitable for human D. K. Janapala (B) · M. Nesasudha Department of ECE, Karunya Institute of Technology and Sciences (Deemed to be University), Coimbatore 641114, India e-mail: [email protected] M. Nesasudha e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_5

47

48

D. K. Janapala and M. Nesasudha

wearable applications. Over the years several flexible antennas have been developed using different kinds of flexible dielectric spacers like Rogers materials, Kapton materials, paper, cloth, polymide, Polyetherimide, Polyethylene Glycol and PDMS [1–5]. One of the main applications of these flexible devices is health monitoring, diagnosis and treatment, over the years several antennas have been developed to detect, to monitor and to treat in health care application. While designing these wearable antennas the main consideration is to understand the variations in the dielectric properties of the human tissues in deferent age groups, size and gender. The measurement of female breast tissues electrical characteristics is explained in [6]. For breast tissues validation in vivo and vitro is presented in [7]. These validations in measuring the electrical properties of healthy and malignant tissues can help in developing antennas to diagnose or treatment. Breast cancer is one of the deadly cancer women are facing. The early detection of the breast cancer using the antenna is one of the open research topic where different kinds of antennas and detection methods were investigated over the years. A review of using Electromagnetic techniques in detection of breast cancer is presented in [8]. Nano material based sensors and wearable sensors can also be used for breast cancer detection [9]. Using antenna for microwave imaging for breast cancer detection is presented in [10–12]. A five port ring reflectometer probe system for in vitro breast cancer tumor detection is implemented in [13]. Mm-wave skin cancer detection using Vivaldi antenna presented in [14]. A flexible Microwave antenna is developed for chemotherapy of breast in [15]. A flexible 16 array antenna is used to detect breast cancer in [16]. Similarly a 4 × 4 array antenna is developed for 3D breast cancer detection in [17]. Wide slot and stacked antenna comparison for breast cancer detection presented in [18]. In the current work a compact 53 mm × 36 mm antenna is designed and 1 mm thickness PDMS slabs are used as DR for developing a flexible DR antenna for wearable applications. The proposed antenna backed with PDMS DR has significant decrease in SAR due the PDMS resonator where leakage towards human phantom body is minimized. The designed antenna performance is analyzed by considering different human body phantom models to validate its suitability for wearable applications. The antenna analyzed for different bending radius at 30, 40 and 50 mm. the proposed antenna is placed near the female breast phantom model and the dielectric properties of the healthy and malignant tissues are assigned to the phantom models at 2.45 GHz. The analysis carried out in this work and the variations for without tumor and with tumor for breast cancer detection is presented using E, H, and J field distributions.

Flexible Dielectric Resonator Antenna …

49

2 Proposed DRA Antenna Design and Specifications The proposed DRA geometry is presented in the following Fig. 1. Rogers RO3006 having dielectric constant 6.15 and loss tangent 0.0025 is used as dielectric spacer. Transparent and flexible PDMS substrate is used as the DR; here the pure PDMS layer is prepared without impurities. The PDMS dielectric constant is 2.7 and loss tangent is 0.314. The dimensions of the DR antenna operating at 2.45 GHz are optimized using parametric study and the optimal dimensions are presented in the Table 1.

2.1 Step by Step Implementation of Proposed Antenna The proposed antenna is designed and simulated using ANSYS HFSS 19.2v. The step by step implementation of the designed DRA antenna is presented in the following Fig. 2. Figure 2, it can be seen that antenna without DR and DGS is operating at 2.52 GHz(red), where the DGS placement tuned the antenna to operate at 2.42 GHz(blue). The antenna operates at 2.45 GHz with the addition of PDMS layers back side as DR. It can be seen that the addition of the DR improved the Band Width (BW) and the reflection coefficient. Fig. 1 Proposed DRA a Top view b Bottom view c Side view and d Diagonal view

Table 1 Optimal dimensions (in mm)

L = 53

L1 = 35.5

L2 = 1.3

L3 = 15.85

W = 36

W1 = 1.3

W2 = 14.2

W3 = 1.65

W4 = 5.675

W5 = 2.0625

W6 = 10.4875

r1 = 4.2625

r2 = 6.325

r3 = 9.075

r4 = 11.1375

r5 = 13.2

r6 = 14.85

r7 = 16.5

H1 = 1.27

H2 = 1

50

D. K. Janapala and M. Nesasudha

Fig. 2 Step by step implementation of DRA and the respective reflection coefficient versus frequency curves comparison

2.2 Performance of DRA Without Bending and With Bending The designed DR antenna is bended on to cylindrical shape having radius 30, 40 and 50 mm. The comparative analysis is presented with the help of reflection coefficient curve in Fig. 3a, and radiation patterns in Fig. 3b. From Fig. 3a, it can be seen that from the reflection coefficient curves the designed DRA maintaining the reflection coefficient value below −10 dB in both flat and bended conditions at 2.45 GHz. The simulated reflection coefficient value for the DRA at 2.45 GHz for flat, bended condition for radius 30 mm, 40 mm, and 50 mm are −27.24 dB, −16.43 dB, −16.21 dB and −13.94 dB respectively. In Fig. 3b, c the radiation patterns for flat and bended condition are having some shift where at phi = 0° the pattern is broadened incase of bending compared to the flat condition.

Fig. 3 (a) Reflection coefficient curves comparison for flat and flexible condition with different bending radii (b) radiation pattern flat condition and (c) radiation pattern bending condition (Ra = 30 mm)

Flexible Dielectric Resonator Antenna …

51

3 Effects of Human Body on Designed DRA The human body presence effect on the designed DRA the antenna performance is evaluated by analyzing the antenna in different conditions. Here the minimum distance between the antenna and the phantom model is maintained 10 mm. By taking the safety of the human body into consideration 100 mW input power is given to the antenna. The antenna is analyzed in flat condition by placing the antenna on top of four layered tissue model, which consist of skin followed by fat, muscle and bone. Here the dielectric properties of appropriate tissues at 2.45 GHz where given as listed in Table 2. The position of the DRA in flat condition on top of phantom model and the SAR is presented in the following Fig. 4. In a similar way the bended antenna is placed near female breast phantom model and the SAR analysis is carried out. Figure 5 illustrates the DRA position near female phantom and the SAR analysis. From Figs. 4 and 5 data the maximum average SAR value evaluated over volume of 1 g tissue is listed in the following Table 3. From the above Table 3 data the maximum SAR value is 1.3867. According to the FCC the standard tolerable value for SAR is 1.6 W/Kg over volume of 1 g tissue followed by India and US. Table 2 Dielectric properties of body tissue at 2.45 GHz S.No

Tissue

Relative permittivity(εr )

Loss tangent

Conductivity (S/m)

1

Skin

42.853

0.27255

1.5919

2

Fat

0.14524

0.10452

3

Muscle

52.729

0.24194

1.7388

4

Bone

11.381

0.2542

0.39431

5.2801

Fig. 4 Position of the Antenna on layered phantom model and SAR analysis

52

D. K. Janapala and M. Nesasudha

Fig. 5 DRA bended (30 mm) condition placed near female breast phantom model

Table 3 SAR comparison

S.No

Condition

SAR value (W/Kg)

1

Flat (Layered phantom)

1.1041

2

Bend (Female breast phantom)

1.3867

4 Breast Cancer Detection Using Designed DRA For breast cancer detection using antennas the measuring setup consists of one transmitting antenna and one receiving antenna. The reflecting wave will be analyzed for the without tumor and with tumor case over the time duration. The difference in the impulse can give the understanding of the tumor. By changing the position of the antenna and by using the field distribution curves the position of the tumor and its size can be estimated. In the current work while placing the designed DRA near the human phantom breast model the changes in the E, H, and J field distribution are analyzed for without and with tumor cases. 3 mm radius size cancer tumor is considered and positioned 2 mm under the skin of the phantom model and the antenna is placed exactly 10 mm away from the phantom model. The dielectric properties of the breast model healthy and cancer affected tissues at 2.45 GHz are listed in the following Table 4. The position of the tumor is presented in the following Fig. 6. The comparative analysis of the E, H and J field distribution without and with tumor for antenna flat and bended conditions are illustrated in the following Figs. 7 and 8. Table 4 Breast tissue dielectric properties

Healthy tissue

Cancer tumor tissue

Dielectric constant

4.4401

55.2566

Conductivity (S/m)

0.1304

2.7015

Flexible Dielectric Resonator Antenna …

53

Fig. 6 Female breast phantom model with cancer tumor

Comparing the E field distribution without tumor and with tumor in Fig. 7a, c the maximum value E field distribution value without tumor is 63.86(V/m) where as with tumor it is increased to 74.26(V/m). Current distribution comparison from Fig. 7b, d data maximum value is 102.539(A/m2 ) without tumor and 117.8865(A/m2 ) with tumor. Similarly for antenna bended condition, Fig. 8a, c data the maximum E field distribution is 51.91(V/m) without tumor and 58.50(V/m) with tumor. From Fig. 8b,

Fig. 7 field distribution comparison for antenna in flat condition without tumor: a E-field, b J-field & with Tumor: c E-field, d J-field

54

D. K. Janapala and M. Nesasudha

Fig. 8 Field distribution comparison for antenna in bended (30 mm) condition without tumor: a E-field, b J-field & with Tumor: c E-field, d J-field

d) data the maximum value for current distribution over the volume 101.74(A/m2 ) without tumor and 117.099(A/m2 ) with tumor. The presence of the tumor caused significant anomaly in field distribution and increased the E and J field distribution because of its change in dielectric properties of the cancer effected tissue. Figures 7 and 8 it can be seen that there is significant change in the E-, H and J field distributions for without and with tumor conditions.

5 Conclusion A flexible Dielectric Resonator Antenna is designed for 2.45 GHz wearable health care applications. The designed antenna bending condition is verified by analyzing the antenna performance for different conditions. from Sect. 4 data the designed antenna maintained SAR below the standard value of 1.6 W/Kg the maximum SAR value obtained for the designed antenna is 1.38 W/Kg this concludes that the current DRA is suitable candidate for wearable applications. From Sect. 5 data the antenna suitability is verified for detecting the breast cancer by considering female breast phantom model without tumor and with tumor. There is significant change in the E, H, and J field distribution curves for without and with tumor cases for the antenna in flat and bended conditions. The designed DRA antenna can be positioned as a pair for receiving and transmitting antenna female human breast model to detect the change in impulse received to detect the position and size of the tumor.

Flexible Dielectric Resonator Antenna …

55

References 1. H.R. Khaleel, H.M. Al-Rizzo, D.G. Rucker, Compact polyimide-based antennas for flexible displays. J. Display Technol. 8(2), 91–97 (2012) 2. C.M. Dikmen, G. Cakir, S. Cimen, Ultra wide band crescent antenna with enhanced maximum gain, in 2017 20th International Symposium on Wireless Personal Multimedia Communications (WPMC) (2017). https://doi.org/10.1109/wpmc.2017.8301822 3. L. Xing, Y. Huang, Q. Xu, S. Alja’afreh, T. Liu, Complex permittivity of water-based liquids for liquid antennas. IEEE Antennas Wirel. Propag. Lett. 15, 1626–1629 (2016) 4. S. Ahmed, F.A. Tahir, A. Shamim, H.M. Cheema, A compact kapton-based inkjet-printed multiband antenna for flexible wireless devices. IEEE Antennas Wirel. Propag. Lett. 14, 1802– 1805 (2015) 5. R.B.V.B. Simorangkir, A. Kiourti, K.P. Esselle, UWB Wearable Antenna With a Full Ground Plane Based on PDMS-Embedded Conductive Fabric. IEEE Antennas Wirel. Propag. Lett. 17(3), 493–496 (2018) 6. T.-H. Kim, J.-K. Pack, Measurement of electrical characteristics of female breast tissues for the development of the breast cancer detector. Prog. Electromagn. Res. C 30, 189–199 (2012) 7. R.J. Halter, T. Zhou, P.M. Meaney, A. Hartov, R.J. Barth, K.M. Rosenkranz, W.A. Wells, C.A. Kogel, A. Borsic, E. J. Rizzo, K.D. Paulsen, The correlation of in vivo and ex vivo tissue dielectric properties to validate electromagnetic breast imaging: initial clinical experience. Physiol. Meas. 30(6), S121–S136 (2009) 8. M.M. Islam, M.R.I. Faruque, N. Misran, M.T. Islam, Detection of breast cancer using electromagnetic techniques: a review. Int. J. Appl. Electromagn. Mech. 51(3), 215–233 (2016) 9. S. Sugumaran, M.F. Jamlos, M.N. Ahmad, C.S. Bellan, D. Schreurs, Nano structured materials with plasmonic nano-biosensors for early cancer detection: a past and future prospect. Biosens. Bioelectron. 100, 361–373 (2018) 10. A. Vispa, L. Sani, M. Paoli, A. Bigotti, G. Raspa, N. Ghavami, G. Tiberi, UWB device for breast microwave imaging: phantom and clinical validations. Measurement (2019). https://doi. org/10.1016/j.measurement.2019.05.109 11. X. Guo, M.R. Casu, M. Graziano, M. Zamboni, Simulation and design of an UWB imaging system for breast cancer detection. Integr. VLSI J. 47(4), 548–559 (2014) 12. T. Gholipur, M. Nakhkash, Optimized matching liquid with wide-slot antenna for microwave breast imaging. AEU – Int. J. Electron Commun. 85, 192–197 (2018) 13. C.Y. Lee, K.Y. You, Z. Abbas, K.Y. Lee, Y.S. Lee, E.M. Cheng, S-band five-port ring reflectometer-probe system for in vitro breast tumor detection. Int. J. RF Microwave Comput. Aided Eng. 28(3), e21198 (2017) 14. A. Mirbeik-Sabzevari, S. Li, E. Garay, H.-T. Nguyen, H. Wang, N. Tavassolian. Synthetic ultrahigh-resolution millimeter-wave imaging for skin cancer detection. IEEE Trans. Biomed. Eng. 1–1 (2018). https://doi.org/10.1109/tbme.2018.2837102 15. M. Asili, P. Chen, A.Z. Hood, A. Purser, R. Hulsey, L. Johnson, E. Topsakal, Flexible microwave antenna applicator for chemo-thermotherapy of the breast. IEEE Antennas Wirel. Propag. Lett. 14, 1778–1781 (2015) 16. H. Bahramiabarghouei, E. Porter, A. Santorelli, B. Gosselin, M. Popovic, L.A. Rusch, Flexible 16 antenna array for microwave breast cancer detection. IEEE Trans. Biomed. Eng. 62(10), 2516–2525 (2015) 17. T. Sugitani, S. Kubota, A. Toya, T. Kikkawa, Compact planar UWB antenna array for breast cancer detection, in Proceedings of the 2012 IEEE International Symposium on Antennas and Propagation (2012). https://doi.org/10.1109/aps.2012.6348794 18. D. Gibbins, M. Klemm, I.J. Craddock, J.A. Leendertz, A. Preece, R. Benjamin, A Comparison of a wide-slot and a stacked patch antenna for the purpose of breast cancer detection. IEEE Trans. Antennas Propag. 58(3), 665–674 (2010)

Machine Learning-Based Prototype for Restaurant Rating Prediction and Cuisine Selection Kunal Bikram Dutta, Aman Sahu, Bharat Sharma, Siddharth S. Rautaray, and Manjusha Pandey

Abstract India is popular for its assorted multi-cuisine prepared in a huge number of restaurants and hotel resorts, which is implicative of unity in diversity. The food chain industry and the restaurant business in India is a very competitive one and lack of research and knowledge about the competition usually leads to the failure of many such enterprises. The principal issues that continue to produce difficulties to them include high real estate expenses, escalating food costs, fragmented supply chain, over-licensing, and even after that restaurateur does not know whether the business will develop or not. This project aims to solve this problem by analyzing ratings, reviews, cuisines, restaurant type, demand, online ordering service, table booking, availability of the restaurant and make the machine learning model learn these and predict ratings of new restaurant and how positive and negative reviews should be expected. This research work considers the data of the city of Bengaluru from Zomato as an example for showing how our model works and can help a restaurateur choose the location and cuisine which will give it better ratings, reviews, and make the business more profitable. Keywords Cuisine · Random forest regressor · Ratings · Restaurants · Zomato

K. B. Dutta (B) · A. Sahu · B. Sharma · S. S. Rautaray · M. Pandey School of Computer Engineering, Kalinga Institute of Industrial Technology (Deemed to be University), Bhubanewar, India e-mail: [email protected] A. Sahu e-mail: [email protected] B. Sharma e-mail: [email protected] S. S. Rautaray e-mail: [email protected] M. Pandey e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_6

57

58

K. B. Dutta et al.

1 Introduction The restaurant industry is highly competitive in India as restaurants from all over the world can be found here, from the United States to Japan to Russia, you get all type of cuisines. Delivery, Dine-out, Pubs, Bars, Drinks, Buffet, Desserts any type you name it and India has it. Unless you have a reliable selection of cuisine, more possibilities are, you will have difficulty being noticeable from the crowd. Setting new restaurants and gaining a competing advantage requires a detailed study of the demographics of the around and the quality of existing contenders and in this field, there is a lack of technology-based analysis and solutions. The saturation of the food industry is not up to the necessity; yet, new eateries are opening every day. Consequently, it has become challenging for them to face already stabilized restaurants. So, we take help of Zomato dataset of Bengaluru to showcase how our machine learning prototype can help a new restaurateur in picking the menus, theme, cuisine, etc., based on the analysis on demography of the locations and ratings of restaurants there that can be an advantage to avoid high competitions in the industry. In Bengaluru, most of the people are dependent mainly on restaurants as they do not have time to cook for themselves. With such an overwhelming demand for restaurants, it has become important to study the demography of a location and what kind of food is more popular in a locality and where to eat for the best experience in that locality. Based upon the existing restaurants and their ratings, we can predict what cuisines can give them the best ratings with our prototype on the features of the dataset like restaurant type, votings, reviews, average cost, online order, and table booking facility, etc. [1].

2 Technologies Used NumPy library, which is a shorter version for the Numerical Python and it is quite efficient in providing as interface which can be used for the purpose of storing the data and in most cases operating the data on either dense or very dense data buffers. NumPy has similar implementation of the list types but they are faster and less costly as compared to their other counterpart it also operates efficiently even if the size of the data is increased. The implementation part of the Numpy is rather interesting as it has powerful array object which are N-dimensional and they are build in a very focused way which helps in easily integrating it with the basic language like C/C++ or other codes which has mathematical functionality or implementation of different types like Fourier and other capabilities. Numpy also is very compatible while using it side by side with the different Machine Learning modules like the Pandas or the Matplotlib which are very useful in the ML world. Pandas is a another rapidly used package which is used in Machine Learning; it is written in the Python language and it inherits some of the properties of the Numpy.

Machine Learning-Based Prototype for Restaurant …

59

The basic advantage of using pandas is that it has Dataframe and Series. DataFrame, if explained in simpler term, arrays which also have row and column names and it can have different types of data or sometimes missing data. Also it provides storage for the data. Pandas has some very best data operations which can be performed to DB. It is capable of implementing many additional functionality like suitable operations can be performed on the data based on other columns or pivot tables can be created. Scikit-Learn, another very widely used Python libraries that presents powerful versions of a huge number of known algorithms including Random Forests, k-means algorithm, Gradient Boosting, SVM(support vector machine) and has been designed to operate with the Python libraries NumPy and SciPy. Scikit-Learn is defined by a clean, uniform, and streamlined API alongside quite helpful and complete online documentation and advanced functions like boosting and bagging, feature selection, detection of outliers, and rejecting noise and methods for model selection and validation such as cross-validation, hyperparameter tuning, and metrics. Matplotlib is a data visualization library which inherits the properties of Numpy array and it is planned to have a working with the SciPy stack. It produces figures of high quality in a variety of formats and interactive environments across platforms. One can plot plots, histograms, bar charts, error charts, scatter plots, etc., with just a few lines of code. Seaborn is a used for the purpose of the visualization of data and has a interface which can be used to have detailed which are quite appealing it inherits properties of Matplotlib and has compatibility with the Pandas Module. It intends to present a visualization as an important part of defining and presenting data. The functions operates on list or DataFrames or even arrays which have a large size and which requires statistical operationas and mapping in order to produce very informative plots [2].

3 State of Art

Article

Author

Year

Approach

Review

Restaurant Rating: “Industrial Standard and Word-of-Mouth A Text Mining and Multi-dimentional Sentiment Analysis” [3]

Yang Yu, Qiwei Gan; [9–12]

2015

Sentiment analysis of reviews regarding aspects of food, decor, service, pricing and special contexts to predict rating

Multidimensional sentiment analysis and text mining were applied. The paper is more theory-driven than data-driven (continued)

60

K. B. Dutta et al.

(continued) Article

Author

Year

Approach

Review

Prediction of star ratings from online reviews [4]

Ch. Sarath Chandra Reddy; K. Uday Kumar; J. Dheeraj Keshav; Bakshi Rohit Prasad; Sonali Agarwal [13–15]

2017

Many classifiers like the typically used Bag of Words or the Multinomial NB, or the more usedTrigram Multinomial NB, Bigram Multinomial NB etc and also Random Forest

Classifiers like the Random Forest performed better than the rest of the known classifiers. It is a good implementation of ratings predicting but not so much help in for new restaurateurs that we will provide

Multi-view Clustering in Collaborative Filtering Based Rating Prediction [26]

Chengcui Zhang; Ligaj Pradhan [16–18]

2016

To predict an unknown rating of a user for a restaurant, first cluster to which the user/restaurant belongs found and then the average of the k-NNs from the user cluster gives the prediction for the user rating

Multi-view clustering produced better results but automatically compares several known views and selects a set which can be the best views or nearer to that, can improve user-item rating prediction and it only predicts the rating cannot predict better cuisine for a location

Machine learning based class level prediction of restaurant reviews [5].

F. M. Takbir Hossain; Md. Ismail Hossain [19–21]

2017

Sci-kit learn library and Natural language Toolkit (NLTK) were used

This model aims to predict the reviews given by the user as negative or positive. In this paper, sentiment analysis was done on the online reviews but the problem of the establishment of new restaurants is not solved

Restaurant rating based on textual feedback [6]

Sanjukta Saha; A. K. Santra [22, 23]

2017

Collaborative Filtering

Analysis of reviews is done to calculate the user ratings. There was no use of machine learning (continued)

Machine Learning-Based Prototype for Restaurant …

61

(continued) Article

Author

Year

Approach

Review

Restaurant Recommendation System for User Preference and Services Based on Rating and Amenities [7]

R. M. Gomathi; S. P. Ajitha; T. G. Hari Satya Krishna; U. I. Harsha Pranay [24]

2019

NLP(Natural Language Processing) algorithms are used for identification of the sentiments of the user comments

Sentimental analysis is performed on the reviews and user comments to recommend a hotel

Restaurant setup business analysis using yelp dataset [8]

Sindhu Hegde; Supriya Satyappanavar; Shankar Setty [25]

2017

Manual observation, kd tree

Analysis of data was done very well but there was no use of machine learning

4 Architecture Design The Zomato Bengaluru dataset consists of 17 columns and 51717 rows. The columns were URL, name, address, book_table, online_order, votes, rate, phone, location, dish_liked, rest_type, approx_cost(for two people), cuisines, menu_item, reviews_list, listed_in(type), and listed_in(city). The ratings were also string which were converted into floating values. And further, the null values in the column were filled with the mean value of the column. The rows containing null values related to the remaining columns were dropped as they were comparatively very low in number. Following this, a layer of analysis of exploratory type was added to further understand the relations between the various columns and how they were correlated with the “rating” column. After this, Label Encoding was applied to the “location,” “cuisine,” and “rest_type” (restaurant type) columns. With this, our data was cleaned and conditioned and was ready to be fed to the model. Figure 1 depicts the prototype of our model i.e how it is working to predict the rating of the restaurants that have not been rated yet. As we decided to go with the Random Forest algorithm, our model of choice was the “RandomForestRegressor” from Scikit-Learn. It is an ensemble learner for regression built on decision trees. It functions by forming many decision trees at training time and then mean prediction of the unique trees for regression as output. Random forest tries to build multiple CART (CART, short for Classification and Regression Trees) models with more combinations of different samples and different use of the initial variables and then perform a final prediction on each observation. Ultimate prediction is a function of each prediction. This ultimate prediction can just be the mean of each prediction.

62

K. B. Dutta et al.

Fig. 1 Proposed prototype

5 Implementation and Results The Zomato Bengaluru dataset was loaded with the help of Pandas library. The dataset consisted of 17 columns and 51717 rows. The columns were URL of unique restaurants, addresses, names, online order availability, table booking facility, rate, votes, phone numbers, locations, type of restaurants, liked dishes list, cuisines, avg approx cost for two people, list of user reviews, items in a menu, type of restaurants list. Firstly, we converted the “rate” column from a string to float and the null values in this column were filled in with the mean of the column. Then, since phone number, URL, and address do not contribute to the overall rating of the restaurant, we dropped those columns. Then exploratory data analysis was done on the dataset to find relations among the columns in an efficient way using the Python libraries discussed earlier. Which are the top restaurant chains in Bengaluru, percentage of restaurants in a location, percentage of type of restaurants, how many of the restaurants do not accept online order, table booking services, what is the ratio between restaurants that provide and do not provide table booking, is there any difference between votes of restaurants accepting and not accepting online orders, top restaurants by rating, relation between cost and rate of restaurants, restaurant type, location, rating distribution, which are the most common restaurant type in Bengaluru, which are the most popular cuisines of Bengaluru, which are the most liked dishes and which item appeared most on the menu item which are the most common cuisines in each locations, every such

Machine Learning-Based Prototype for Restaurant …

63

relation among the column(features) is extracted through pie charts, bar plots, box plots, histograms(Kernel Density Estimation), scatter plots using Python “seaborn” and “Matplotlib” libraries. From EDA it is observed that online ordering helps in the high rating of a restaurant (Fig. 3) on the other hand online table booking do not affect it much (Fig. 2). It is clear from the EDA that most of the restaurants are located in the top 10–15 locations (Fig. 4) and others are scattered in other locations and also we can observe the most famous restaurant types in the city (Fig. 5).

Fig. 2 Rating versus table booking

Fig. 3 Rating versus Online order

64

Fig. 4 Percentage of restaurants in that location

Fig. 5 Percentage type of restaurants

K. B. Dutta et al.

Machine Learning-Based Prototype for Restaurant …

65

Since we are also trying to predict the best cuisines for a location given the rating, so from EDA, we take a look at the most famous cuisines of the city (Fig. 6). We observe the rating distribution of the restaurants and the cost distribution to have the idea of how further should we proceed on feature selection and preprocessing. From the above graphs (Figs. 7 and 8), it is observed that most of the restaurants are rated between 3.5 and 4. It is also clear that the restaurants had an average cost of less than 1000 have better ratings in comparison to the more expensive restaurants. From EDA, we pre-process the data and fill up the missing values and drop the rows with unknown locations and finally, from the relations of other columns with ratings, we selected nine columns having a high correlation with rating column to proceed with our model.

Fig. 6 Most popular cuisines of Bengaluru

Fig. 7 Distribution of ratings

66

K. B. Dutta et al.

Fig. 8 Distribution of costs of all restaurants

Fig. 9 r2_score of different machine learning models

Before implementing the model, LabelEncoder from the Scikit-learn library was used to label encode the columns of location, rest_type, and the cuisines. Before encoding, the null values of these three columns were dropped as they were relatively low in number. Then using the StandardScaler from Scikit-learn’s preprocessing (sklearn.preprocessing), we scaled the values of the dataset. Then, the dataset was split in the proportion of, respectively, 60, 20, 20 for training, validation, and testing using the train_test_split from sklearn.model_selection. Then, the machine learning models of LinearRegression, DecisionTree, and RandomForestRegressor were trained on the training data. The models were evaluated using r2_score from the sklearn.metrics. The best model was of the RandomForestRegressor which was giving a very high r2_score without any hyperparameter tuning (Fig. 9).

6 Conclusion This project tries to predict the rating scores for new restaurants based on their location, cuisine, approximate cost, and other factors based on which the model will provide the best-fit cuisine for a location for a new restaurant business. This will be of great remedy to the entrepreneurs to gain some advantage at the beginning of the business. Here, the Zomato Bengaluru dataset has been explored to find out which traits/features were essential to predict the rating of the restaurant. This project uses a RandomForestRegressor to effectively predict the ratings of the restaurant on the basis of the input features and is able to train it to reach relatively high accuracy.

Machine Learning-Based Prototype for Restaurant …

67

7 Future Works From the predicted ratings to the work of finding, the best cuisine in a location for a new restaurant is going on. Further works that can be done are that Sentimental Analysis on the customers’ reviews given for the restaurants which would further aid in predicting the rating of the restaurant to be established, thus this will help in finding best the cuisines for a particular location.

References 1. https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants 2. https://github.com/jakevdp/PythonDataScienceHandbook/tree/master/notebooks 3. System Sciences (HICSS), Annual Hawaii International Conference on IEEE, Restaurant Rating: Industrial Standard and Word-of-Mouth A Text Mining and Multidimensional Sentiment Analysis 4. TENCON, IEEE Region 10 International Conference, Prediction of Star Ratings from Online Reviews 5. Humanitarian Technology Conference (R10-HTC), IEEE Region 10, Machine learning based class level prediction of restaurant reviews 6. International Conference on Microelectronic Devices, Circuits and Systems (ICMDCS), Restaurant rating based on textual feedback 7. International Conference on Computational Intelligence in Data Science (ICCIDS), Restaurant Recommendation System for User Preference and Services Based on Rating and Amenities 8. International Conference on Advances in Computing, Communications, and Informatics (ICACCI), Sentiment based Food Classification for Restaurant Business 9. American Automobile Association Approval Requirements and Diamond Rating Guidelines (AAA Publishing, Heathrow, FL, 2009) 10. C. Dellarocas, The digitization of word of mouth: promise and challenges of online feedback mechanisms. Manage. Sci. 49(10), 1407–1424 (2003) 11. N. Archak, A. Ghose, P.G. Ipeirotis, Deriving the pricing power of product features by mining consumer reviews. Manage. Sci. 57(8), 1485–1509 (2011) 12. W. Duan, B. Gu, A.B. Whinston, The dynamics of online word-of-mouth and product sales-an empirical investigation of the movie industry. J. Retail. 84(2), 233–242 (2008) 13. S. Aravindan, A. Ekbal, Feature extraction and opinion mining in online product reviews, in Information Technology (ICIT) 2014 International Conference on, pp. 94–99 (2014), December 14. Y. Mengqi, M. Xue, W. Ouyang, Restaurants Review Star Prediction for Yelp Dataset 15. G. Dubey, A. Rana, N.K. Shukla, User reviews data analysis using opinion mining on the web, in Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE) 2015 International Conference on, pp. 603–612 (2015), February 16. M. Sharma, S. Mann, A survey of recommender systems: approaches and limitations. Int. J. Innov. Eng. Technol. 2(2), 8–14 (2013) 17. S. Bickel, T. Scheffer, Multi-View Clustering, in Proceedings of IEEE International Conference on Data Mining (2004), pp. 19–26, November 18. X. He, M.-Y. Kan, P. Xie, X. Chen, Comment-based multi-view clustering of web 2.0 items, in Proceedings of the 23rd International Conference on World Wide Web (2014), pp. 771–782, April 19. B. Pang, L. Lee, S. Vaithyanathan, Thumbs up?: sentiment classification using machine learning techniques, in Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing-Volume 10 (2002)

68

K. B. Dutta et al.

20. H. Minqing, B. Liu, Mining and summarizing customer reviews, in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2004) 21. G. Anindya, G. Ipeirotis, Designing novel review ranking systems: predicting the usefulness and impact of reviews, in Proceedings of the Ninth International Conference on Electronic Commerce (2007) 22. X. Lei, X. Qian, G. Zhao, Rating prediction based on social sentiment from textual reviews, in IEEE Transactions on Multimedia Manuscript Id: MM-006446, pp. 1–12 23. S. Prakash, A. Nazick, R. Panchendrarajan, A. Pemasiri, M. Brunthavan, S. Ranathunga, Categorizing food names in restaurant reviews, in IEEE (2016), pp. 1–5 24. Uzma Fasahte, Deeksha Gambhir, Mrunal Merulingkar, Aditi Monde, Amruta Pokhare, Hotel recommendation system. Imp. J. Interdiscip. Res. (IJIR) 3(11), 318–324 (2017) 25. H. Parsa, A. Gregory, M. Terry, Why do restaurants fail? Part iii: an analysis of macro and micro factors. Emerg. Asp. Redefin. Tour. Hosp. 1(1), 16–25 (2010) 26. 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), Multi-view Clustering in Collaborative Filtering Based Rating Prediction

Deeper into Image Classification Jatin Bindra, Bulla Rajesh, and Savita Ahlawat

Abstract Recognizing images was a challenging task a few years back. With the advancement of technology and the introduction of deeper neural networks, the issue of recognizing images is solved to a large extent. Inspired by the performance of deep learning models in image classification, the present paper proposed three techniques and implemented that for image classification. The residual network, convolutional neural network, and logistic regression were used for classification. The neural networks have shown the state-of-the-art results in the classification of images. In the implementation of these models, some modifications are made to build a deep residual network and convolutional neural networks. On testing, the ResNet model gave 98.49% accuracy on MNIST and 87.31% on Fashion MNIST. CNN model gave 98.73% accuracy on MNIST and 87.38% on Fashion MNIST. Logistic regression gave 91.79% on MNIST and 83.74% on Fashion MNIST. Keywords Deep neural networks · Residual network · Convolutional neural network · Logistic regression · MNIST · Fashion MNIST

1 Introduction Image classification involves a series of steps, which are performed on an image to get a label for it. With the advancement of technology, the images are generated and shared on a very large scale. Because of its increasing significance, bringing up deeper models and improving the classification can have a major impact on computer J. Bindra · S. Ahlawat (B) Department of CSE, Maharaja Surajmal Institute of Technology, Delhi, India e-mail: [email protected] J. Bindra e-mail: [email protected] B. Rajesh Department of IT, Indian Institute of Information Technology, Allahabad, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_7

69

70

J. Bindra et al.

vision. It might be possible that one algorithm is performing better for one use case and one for another use case. Therefore, it is important to compare various algorithms on more than one dataset. The most well-known algorithms for image classification is based on the techniques of deep learning. Deep learning algorithms are used widely in image classification problems. For instance, deep convolutional neural networks which were applied to ImageNet dataset showed that convolutional neural network is capable of achieving some record-breaking results [1]. This paper demonstrates the implementation of deep ResNet neural network, convolutional neural network, and regression model for image classification and comparison with the other state–of-the-art algorithms. The models were first tested on MNIST Dataset. The MNIST dataset is used popularly in the field of computer vision to compare the state-of-the-art algorithms. The dataset consists of 70000 images of handwritten digits from 0 to 9. Each image is grayscale with size 28 × 28. The MNIST Dataset (Fig. 1) was introduced in 1998. At that time, good computing power was not widely available. In today’s world with good computing power, many algorithms can get good accuracy with the MNIST dataset and it is widely used because of its simplicity. In April 2017, Google Brain research scientist asked people in a tweet to move away from MNIST as it is overused [2]. Even the basic Machine Learning algorithms can achieve more than 90% of classification accuracy on MNIST. For this reason, we also tested our models on Fashion MNIST dataset. In Aug 2017, Fashion MNIST [3] was released. Similar to MNIST, it consists of 70,000, 28 × 28 grayscale images (Fig. 2). Out of which 60,000 are used for training purposes and rest 10,000 images are used for testing purposes. Fashion MNIST also consists of 10 classes. It contains shapes of some complicated wearable fashion items. The MNIST dataset and Fashion MNIST dataset is so popular because it is widely available in

Fig. 1 MNIST dataset of handwritten images from 0 to 9

Deeper into Image Classification

71

Fig. 2 Fashion MNIST dataset consisting of wearable fashion items

most libraries and deep learning frameworks. In addition to it, there are lots of helper functions provided by different frameworks. The overall structure of the deep residual network implemented in this work consists of 12 layers with two jump connections. Four such structures are connected to each other to form a larger model. The CNN is used with 13 layers to form a deep neural network for classification. Logistic regression was used from the Sklearn library. These models were then tested on the MNIST dataset and Fashion MNIST dataset by calculating the accuracy of each model for these two datasets.

2 Related Work Deep learning forms the basis of image classification. Some recent research has shown that deep residual networks can be used in a variety of applications. It is not just limited to static image classification but can also include detection of moving objects, surveillance recordings, and so on and so forth. For instance, Szegedy et al. [4] gave evidence that by using the residual connections, the training of inception networks increased significantly. The research showed three new networks which

72

J. Bindra et al.

include Inception-ResNet-v1, Inception-ResNet-v2, and Inception-v4. Ou et al. [5] proposed a structure which is based on ResNet to detect moving objects. The research used ResNet-18 with an encoder–decoder structure. Further, the research used supervised learning in which the input fed includes object frame along with the corresponding labels. By using this structure, they showed that the performance on the I2R and the CDnet2014 dataset was better than the other conventional algorithms. Lu et al. [6] proposed a DCR (Deep Coupled Residual) network. Their DCR model consisted of two branch networks and a trunk network. They used this model for face recognition of lower resolution. Their experiments showed that the DCR model has better performance on LFW and SCface datasets as compared to the other state-ofthe-art models. Jung et al. [7] used surveillance recordings data for classification and localization using deep ResNet of 18 layers with ReLU function for the classification part. For localization also they used ResNet. In localization, they used R-FCN with deep residual models for accurate results and further showed that their model outperformed the other state-of-the-art models in both classification and localization. Palvanov and Cho [8] implemented four models which include residual network, capsule network, convolutional neural network, and multinomial logistic regression; and tested these models on the MNIST dataset in a real-time environment. Li and He [9] proposed an improved version of ResNet by using shortcut connections which are adjustable. The results reported by them showed an improvement of accuracy when compared to the classical ResNet. Also, their research showed that under the learning rate of 0.001, their improved ResNet had 2.85% higher accuracy on CIFAR-10 and 3.81% higher accuracy on CIFAR-100 datasets when compared to classical ResNet. Xia et al. [10] used SCNN and ResNet models combined with SIFT-flow algorithm for kidney segmentation and showed that the kidney segmentation accuracy was improved by their research. Zhang et al. [11] proposed a deep convolutional network for image denoising in which the authors used residual learning for the separation of noise from noisy observation. In their work residual learning also played a role in speeding the training process and boosting the denoise performance. Their results produced favorable image denoising both quantitatively and qualitatively. Apart from this by using GPU implementation, the run time also seems promising. CNN is widely used for image analysis. CNN is considered to be one of the state-of-the-art algorithms for image analysis. Shin et al. [12] studied and showed three important factors on convolutional neural networks architecture, transfer learning, and dataset characteristics on application to computer-aided detection problem. Baldassarre et al. [13] proposed a model in which they combined convolutional neural network with highlevel features which were extracted from a pre-trained model: Inception-ResNet-v2. The authors were successful in image colorization tasks of high-level components like sky, sea, etc. Some advances on the CNN model are also made to get better accuracy or improve the training time or testing time. One such research was done in Fast RR-CNN. Girshick [14] proposed a method for object detection which was the Fast Regionbased Convolutional Network method. This method not only improved detection accuracy but also improved testing and training speed. Huang et al. [15] introduce the Dense Convolutional Network (DenseNet). The layers in this network are

Deeper into Image Classification

73

connected to every other layer in a feed-forward fashion. Testing the model on a single dataset may not help in stating the generalization use of the model. Thus, the authors compared the model with four popular datasets (CIFAR-10, CIFAR-100, SVHN, and ImageNet). They showed that DenseNets obtain significant improvement over other models. Gidaris and Komodakis [16] used a multi-region deep convolutional neural network to propose an object detection system and got 78.2% and 73.9% accuracy on PASCAL VOC2007 and PASCAL VOC2012 challenges, respectively. Abadi et al. [17] described the TensorFlow interface and implementation details of TensorFlow. It was built in Google and is widely used to solve artificial intelligence problems. The TensorFlow APIs were created and it was made open source so that the community of developers and researchers around the globe can use it. He and Sun [18] presented the architecture that gave comparable accuracy in the ImageNet dataset. Despite this accuracy, it was 20% faster than “Alexnet.” In the past, many models were made to classify images based on neural networks. Agarap [19] used CNN-Softmax and CNN-SVM to classify images by using both MNIST and Fashion MNIST dataset. Xiao et al. [3] who introduced the Fashion MNIST also tested the data with various state-of-the-art algorithms which include Decision tree classifier, Extra tree classifier, Gradient boosting, K-Neighbours, Linear SVC, Logistic regression, MLP, Passive Aggressive, Perception, Random Forest, SGDC and SVC. Chen et al. [20] compared four neural networks on the MNIST dataset. These models include deep residual networks, convolutional neural networks, Dense Convolutional Network (DenseNet), and an improvement in CNN by using Capsnet. The authors showed that Capsnet requires a small amount of training data and can achieve excellent accuracy with it. Seo and Shin [21] used the hierarchical structure of apparel classes for the classification. The Hierarchical Convolutional Neural Networks (H-CNN) that they proposed were based on VGGNet.

3 Methodology The objective of the methodology is to classify the MNIST and Fashion MNIST dataset using deep ResNet neural network, convolutional neural network, and regression model. A block diagram of the methodology is shown in Fig. 3. It is the basic procedure followed in all three models used. Firstly, data is imported to get the input. The data is directly imported from tensorflow.examples.tutorials and tensorflow.keras. It is further pre-processed before feeding it to the model. The preprocessing step includes resizing images and normalizing the pixel values by dividing the matrix by 255. The labels are then converted into the categorical format. Finally, the images are fed from the training set to the model in order to train the model and then from the testing set to get the output label for the images. The experiments were done on a system with the following tools and system configurations:

74

J. Bindra et al.

Fig. 3 Basic steps followed for each model

• Coding Language: Python 3 • Development Environment: Jupyter Notebook hosted in Google Colab. Google Colab is an interactive environment that is used to write and execute code. The development environment consists of 12.72 GB RAM, 48.97 GB of Disk, and 14.73 GB of GPU in Colab environment. • OS: Microsoft Windows 10, 2015 • Processor used: Intel(R) Core(TM) i3-2310 M CPU @2.10 GHz. • The models used (A) ResNet (Deep Residual Network), (B) CNN (Convolutional Neural Network), (C) Logistic regression. A. ResNet Residual neural network (ResNet) is a special type of deep learning model in which skip connections are present. The network has connections that jump over a few layers. This is useful to avoid vanishing gradient problem. Li and He [9] explained that by introducing shortcut connections the problem of gradient fading was solved. They supported this by simplifying ResNet and deducing the backpropagation in it. He et al. [22] presented a residual learning framework to ease the training of networks that are substantially deeper than those used previously. Deep residual nets helped them to win the first place on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. Szegedy et al. [23] showed

Deeper into Image Classification

75

that the introduction of residual connections leads to improved training speed for the Inception architecture. In this paper, the ResNet implemented is formed by two skip connections (Fig. 4). The residual network consists of 12 layers. The layers implemented in our ResNet are batch normalization, convolution, dropout, ReLU, jump step in which the first layer is added to the output that is given from ReLU, dropout, convolution, batch normalization, and finally the output of last convolution layer is added to the input layer. On increasing the layers, the weights learned by initial layers play a negligible role in prediction. This is because of the vanishing gradient problem. To overcome this, we first introduced a connection of the first layer to the output of ReLU function. Few more layers were added to make the network deeper. Again with the increased Fig. 4 Residual network implemented in Tensorflow

76

J. Bindra et al.

layers, adding another connection would help in solving vanishing gradient problem. So the connection was made from the input layer to the last convolutional layer to give the final output. The ReLU operation is defined in Eq. 1: f (x) = max(0, x)

(1)

Here, x is the input and f(x) is the output. From this equation, it can be seen that output is x if x ≥0 and output is 0 if x < 0. The training dataset of MNIST has a batch size of 32 and Fashion MNIST has a batch size of 1000. Before sending the training images to the model, first, it was passed through the convolutional layer and then through the batch normalization layer. After this, the images were passed through 4 residual networks one after the other linearly. The output obtained is passed through another convolutional layer and then the 4-Dimensional tensor obtained from the convolutional network is flattened into a 2-Dimensional tensor. Finally, it is passed through a dense layer to get the output. Mathematically, a residual block function is defined as: y = f (x, {Wi }) + x

(2)

Here in Eq. 2, y is the output vector, x is the input vector, f (x, {W i }) represents the mapping that is to be learned. B. CNN CNN is used as a baseline model in many image classification tasks. For instance, Johnson and Zhang [24] used CNN on text categorization to exploit the 1-D structure of text data for accurate prediction. Zeiler and Fergus [25] introduced a novel visualization technique that described the function of intermediate feature layers. Also, the technique described the operation of the classifier. The visualizations helped them to find model architectures that gave challenging results and was better than the ImageNet classification benchmark which was set by Krizhevsky et al. [1]. In this research, the 13 layered deeper CNN network is used for the classification of images. The input is pre-processed and fed into the model. The layers convolutional, Max pool, and drop out were used for 3 times linearly before using the flattened layer. Max Pooling is done to downsample the image. For instance, after applying max pooling on Eq. 3, the output is given in Eq. 4. ⎡

1 ⎢ 5 X =⎢ ⎣ 9 13

2 6 10 14

3 7 11 15

⎤ 4 8 ⎥ ⎥ 12 ⎦ 16

(3)

Deeper into Image Classification

77

Fig. 5 CNN model in Keras



6 8 MaxPooling(X ) = 14 16

(4)

Then it is flattened and passed through three dense layers. The three dense layers had unit’s parameters as 128, 50, and 10, respectively. Figure 5 depicts the CNN Model in Keras model. C. Logistic regression Logistic regression is a basic Machine Learning model used for classification. The logistic regression was imported directly from linear models in Sklearn [26]. Logistic regression has a sigmoidal curve. The equation of sigmoid function (Eq. 5) is given as S(x) = 11 = e − x This formula results in the formation of “S-Shaped curved,”

(5)

78

J. Bindra et al.

The images were pre-processed and the model was trained by calling fit function. The iterations were set to 2000 in fit function. Finally, the testing dataset was fed into the trained model.

4 Results and Discussion In this paper after training the three models, we fed the testing data through them. For MNIST (Fig. 6) dataset, the highest accuracy was obtained from CNN followed by ResNet and Logistic Regression. The accuracy of CNN and ResNet models were very close to each other. For Fashion MNIST (Fig. 7), the highest accuracy was obtained by CNN, followed by ResNet and logistic regression. The accuracy and testing configuration of different models (A) ResNet, (B) CNN, (C) Logistic Regression are stated below. The models were compared with the previous work in literature. The comparison is represented in Table 1. The performance of ResNet, CNN, and regression implemented in this research is comparable with the other implementations of similar models. Fig. 6 ResNet, CNN, and logistic regression on MNIST dataset

Fig. 7 ResNet, Regression, and CNN on Fashion MNIST dataset

Deeper into Image Classification Table 1 Comparison of implementation of ResNet, CNN, and logistic regression with other models from the literature

79 Model

MNIST dataset (%)

Fashion MNIST dataset

LinearSVC [3]

91.70

83.60%

LogisticRegression [3]

91.70

84.20%

CNN-SVM [19]

99.04

90.72%

ResNet [8]

97.3



CNN [8]

98.1



ResNet

98.49

87.31%

Logistic Regression 91.79

83.74%

CNN

87.38%

98.73

A. ResNet The ResNet model for MNIST dataset was trained for 7 epochs with a batch size of 32. The testing was performed on the model with a batch size of 1000. The accuracy of the ResNet model on the MNIST dataset comes out to be 98.49%. The same model was then used for Fashion MNIST. The only change was in the number of epochs and batch size while training. For MNIST increasing epochs was decreasing the accuracy which may be due to overfitting. In the case of Fashion MNIST, the test data accuracy comes out to be 87.31% by increasing the number of epochs to 80. The batch size while training was kept to 1000. B. CNN The CNN model described in Fig. 5 was implemented as a sequential model in Keras. For training, the batch size of 200 was used. The number of epochs was set to 100. The MNIST dataset gave an accuracy of 98.73% on the testing dataset. The Fashion MNIST dataset gave an accuracy of 87.38%. While running the Fashion MNIST dataset the number of epochs and batch size of the model was the same as set for MNIST. C. Logistic regression The logistic regression was implemented from Sklearn. The maximum number of iterations taken for the solvers to converge is set to 100 by default in a fit method of logistic regression. The iterations were set to 2000, otherwise the model failed to converge. The MNIST dataset gave an accuracy of 91.79% on the testing dataset. The Fashion MNIST gave an accuracy of 83.74% on the testing dataset. For Fashion MNIST also the iterations were set to 2000.

80

J. Bindra et al.

5 Conclusion This study provides insights into how deep learning models give more accuracy for image classification. The comparisons of models by testing the models on two datasets which include MNIST and Fashion MNIST show that accuracy of ResNet and CNN is very close to each other. The comparison with the other models in literature shows how our tuning in the implementation of ResNet and CNN has an impact on accuracy. The accuracy can be further increased by making the neural network deeper, training it with increased computing power, and increasing the size of the training dataset. The use of ResNet and other deep learning models can also be extended to many real-world applications using various datasets like image dataset, time-series dataset, etc. In the future, more advancement can be made in deep neural networks to explore more combinations of features. Pre-processing techniques can be further explored that can help in reducing the training time and increasing accuracy.

References 1. A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with deep convolutional neural networks, in 25th International Conference on Neural Information Processing Systems (ACM, Lake Tahoe, Nevada, 2012), Vol. 1, p. 9 2. I. Goodfellow (2017), goodfellow_ian/status/852591106655043584?lang = en. https://twitter. com/goodfellow_ian/status/852591106655043584?lang=en 3. H. Xiao et al., Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms (2017). http://arxiv.org/abs/1708.07747. n. pag 4. C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (AAAI, 2016) 5. X. Ou, P. Yan, Y. Zhang, B. Tu, G. Zhang, J. Wu, W. Li, Moving object detection method via ResNet-18 with encoder–decoder structure in complex scenes. IEEE Access 7, 108152–108160 (2019) 6. Z. Lu, X. Jiang, A.C. Kot, Deep coupled ResNet for low-resolution face recognition. IEEE Signal Process. Lett. 25, 526–530 (2018) 7. H. Jung, M. Choi, J. Jung, J. Lee, S. Kwon, W.Y. Jung, ResNet-based vehicle classification and localization in traffic surveillance systems, in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017), pp. 934–940 8. A. Palvanov, Y.I. Cho, Comparisons of deep learning algorithms for MNIST in real-time environment. Int. J. Fuzzy Log. Intell. Syst. 18, 126–134 (2018). https://doi.org/10.5391/IJFIS. 2018.18.2.126 9. B. Li, Y. He, An improved ResNet based on the adjustable shortcut connections. IEEE Access 6, 18967–18974 (2018) 10. K. Xia, H. Yin, Y. Zhang, Deep semantic segmentation of kidney and space-occupying lesion area based on SCNN and ResNet models combined with SIFT-Flow algorithm. J. Med. Syst. 43, 1–12 (2018) 11. K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a Gaussian Denoiser: residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26, 3142–3155 (2017) 12. H. Shin, H. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D.J. Mollura, R.M. Summers, Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35, 1285–1298 (2016)

Deeper into Image Classification

81

13. F. Baldassarre, D.G. Morín, L. Rodés-Guirao, Deep koalarization: image colorization using CNNs and inception-ResNet-v2 (2017). http://arxiv.org/abs/1712.03400 14. Girshick, Fast R-CNN ICCV 15 Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 1440–1448 15. G. Huang, Z. Liu, K.Q. Weinberger, Densely connected convolutional networks, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 2261–2269 16. S. Gidaris, N. Komodakis, Object detection via a multi-region and semantic segmentationaware CNN model, in 2015 IEEE International Conference on Computer Vision (ICCV) (2015), pp. 1134–1142 17. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I.J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, M. Levenberg, D. Mané, R. Monga, S. Moore, D.G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P.A. Tucker, V. Vanhoucke, V. Vasudevan, F.B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous distributed systems (2015). http:// arxiv.org/abs/1603.04467 18. K. He, J. Sun, Convolutional neural networks at constrained time cost, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), pp. 5353–5360 19. A.F. Agarap, An Architecture Combining Convolutional Neural Network (CNN) and Support Vector Machine (SVM) for image classification (2017). http://arxiv.org/abs/1712.03541. n. pag 20. F. Chen, N. Chen, H. Mao, H. Hu, Assessing four Neural Networks on Handwritten Digit Recognition Dataset (MNIST) (2018). http://arxiv.org/abs/1811.08278 21. Y. Seo, K. Shin, Hierarchical convolutional neural networks for fashion image classification. Expert Syst. Appl. 116, 328–339 (2019) 22. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, Las Vegas, NV, USA, 2016), p. 12 23. C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-ResNet and the impact of residual connections on learning, in AAAI’17 Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (ACM, San Francisco, California, USA, 2017), p. 12 24. R. Johnson, T. Zhang, Effective use of word order for text categorization, in Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Association for Computational Linguistics, Denver, Colorado, 2015), p. 10 25. M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol. 8689, ed by D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Springer, Cham, 2014) 26. Scikit-learn.org, scikit-learn: machine learning in Python—scikit-learn 0.22 documentation (2019). . https://scikit-learn.org/stable/. Accessed 9 Dec 2019

Investigation of Ionospheric Total Electron Content (TEC) During Summer Months for Ionosphere Modeling in Indian Region Using Dual-Frequency NavIC System Sharat Chandra Bhardwaj, Anurag Vidyarthi, B. S. Jassal, and A. K. Shukla Abstract When signals from satellites propagate through the ionosphere, a delay is introduced due to the presence of Total Electron Content (TEC) between transmitter and receiver. The generation of TEC in the ionosphere is primarily dependent on solar activity (Diurnal and Seasonal). The ionospheric delay can cause a major degradation in the positional accuracy of the satellite navigation system. For the estimation of ionospheric delay, slant TEC (STEC) along the path between satellite and receiver is needed. For a single-frequency user, a modeled ionospheric vertical TEC (VTEC) at Ionospheric Pierce Point (IPP) is converted into STEC for delay estimation. However, the behavior of TEC is highly dynamic in low-latitude and equatorial regions (Indian region), and thus conventional ionospheric model introduces additional error in positioning. The NavIC (Navigation with Indian Constellation) system geostationary satellite constellation is uniquely capable of the investigation of ionospheric TEC, and it can facilitate for ionospheric modeling applications. This paper deals with estimating of accurate STEC and VTEC using dual-frequency NavIC code and carrier measurements, and investigation of its temporal variation for modeling applications. Keywords Total Electron Content (TEC) · STEC · VTEC · Ionospheric delay · NavIC · Ionospheric modeling S. C. Bhardwaj (B) · A. Vidyarthi · B. S. Jassal Propogation Research Laboratory, Department of Electronics and Communication, Graphic Era (Deemed to be University), Dehradun, India e-mail: [email protected] A. Vidyarthi e-mail: [email protected] B. S. Jassal e-mail: [email protected] A. K. Shukla Space Applications Center, Indian Space Research Organization, Ahmedabad, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_8

83

84

S. C. Bhardwaj et al.

1 Introduction Satellite positioning is an essential service in modern public and military applications. To deliver the positional services in the Indian region, NavIC (Navigation with Indian Constellation) system (formerly known as Indian Regional Navigation Satellite System (IRNSS)) has been designed with seven satellites constellation (three of them are geostationary (GEO) and four are geosynchronous (GSO)). The NavIC systems operate in the dual-frequency band at S1 (2492.028 MHz) and L5 (1176.45 MHz) [1]. The accuracy of position, determined by satellite signals is crucial for applications like aircraft landing and guidance systems. The study of a glacier and tectonic plate’s movement needs accuracy in millimeter level [2]. To achieve such accuracy, it is necessary to determine and eliminate all sources of error in satellite positioning. There are various sources such as atmospheric layers (ionosphere, troposphere), satellite and receiver clock errors, multipath, earth’s magnetic field, etc., that influence the satellite signals measurement which further leads to a positional error [3]. Among all, the ionosphere is a major source that accounts for 90% of the positional error. When the signal passes through the ionosphere then its velocity changes and tends to bend due to change in the refractive index of the ionosphere [4]. This phenomenon that introduces a delay in signal measured at the receiver, can introduce positional errors of up to 100 m. The ionospheric delay depends on the Total Electron Content (TEC) between the satellite and the receiver, and the frequency of the signal. The electrons present in the ionosphere are mainly affected by solar radiation, geomagnetic storms, and lower atmosphere waves [5]. The estimation of this ionospheric delay (called as first-order error) needs slant TEC (STEC) along the path between satellite and receiver. Due to the complexity of the design and cost of the dual-frequency system, many ionospheric-modeled data sources such as the coefficient-based model are being used to facilitate the singlefrequency receiver for ionospheric delay estimation [6]. For a single-frequency user, a modeled ionospheric vertical TEC (VTEC) at Ionospheric Pierce Point (IPP) is converted into STEC for delay estimation. Although, the modeled VTEC works suitably for mid-latitude regions where the ionosphere behaves smoothly can cause a significant positional error in low-latitude and equatorial reason, such as India, due to dynamic and unpredictable behavior of ionosphere [7]. To meet the challenge of modeling of the ionospheric and delay estimation in the Indian region, there is a need to investigate the ionospheric VTEC as a function of diurnal and seasonal solar activity. However, this needs to estimate accurate STEC and VTEC. The accurate STEC can be determined by taking the difference between dual-frequency measurements (i.e., code and carrier phase) [8]. Most of the error sources, like troposphere, multipath, and satellite and receiver clock are frequency independent. By taking the difference of measurements, all these effects are eliminated and the effect of the frequency-dependent source (i.e., ionosphere) remains [9]. The NavIC receiver is installed at Graphic Era Deemed to be University, Dehradun (lat. 31.26° N, long. 77.99° E), and data being collected at dual-frequency L5 and S1. A plot of observed satellite for 24 h (IST) at the receiver is shown in Fig. 1. In

Investigation of Ionospheric Total Electron Content (TEC) … 0

Fig. 1 Observable NavIC satellites at the receiver (June 5, 2017)

330

85 0 30

Az El Rx

30

PRN 2

300

PRN 3

60

PRN 4 PRN 5 PRN 6

60

PRN 7

90

270 2

90

4

6

7 3

240

5

210

120

150 180

the figure, PRN 2, 4, 5 can be observed as Geosynchronous (GSO) satellites and PRN 3, 6, 7 as a Geostationary (GEO). In addition, the GEOs are uniquely capable of ionospheric studies over the Indian region. Due to constant IPP of GEOs, the behavior of ionospheric TEC, as a function of diurnal and seasonal solar activity, can be investigated more precisely as compared to variable IPPs of GPS (Global Positioning System) satellites. Thus, in this paper, the investigation of STEC and VTEC is being carried out using GEO satellites (i.e., PRN 3, 6, 7). In Sects. 2 and 3, the estimation and analysis of STEC and VTEC have been discussed.

2 Estimation of STEC and VTEC The STEC can be determined by taking the difference of dual frequencies code and/or carrier phase measurements. A typical STEC using satellite code measurement is given by [10]: STEC£ =

  f 12 f 22 1 (£2 − £1 ) 40.3 ( f 12 − f 22 )

(1)

where £1 , £1 are the measured code ranges, f 1 , f 2 are the satellite frequencies For NavIC frequencies, f 1 (S1) = 2492.08 MHz and f 2 (L5) = 1176.45 MHz, the Eq. (1) can be written as [11]:   STEC£ = 4.4192 × 1016 × (£2 − £1 ) electron/m2

(2)

STEC£ = 4.4192 × (£2 − £1 )[TECU]

(3)

86

S. C. Bhardwaj et al.

Similarly, the STEC derived from carrier phase range can be written as STECϕ = 4.4192 × (ϕ1 − ϕ2 )[TECU]

(4)

2.1 Smoothing and True STEC As compared to code range, carrier phase range is much precise, although, its use is not straightforward due to the presence of integer carrier cycle ambiguities. However, both code and carrier phase measurements can be for improvement in the accuracy of STEC estimation [12]. A Hatch filter-based code carrier leveling process has been used for the determination of absolute STEC [13]. In Fig. 2, the code and carrier derived STEC£ (in blue) and STECϕ (in green) are shown. Due to the presence of integer carrier cycle ambiguity in STEC estimation, the STECϕ is below zero (−ve value). By using STEC£ , a leveling constant D (i.e., 105.8 for PRN 3) has been derived and added to STECϕ to find the absolute STEC (red line). It can be observed that the STEC overlaps and follows the mean variation of STEC£ . The STEC still contains satellite and receiver Differential Instrumental Biases (DIBs). These biases arise due to path delay difference of signal at two frequencies (i.e., L5 and S1). The initial satellite DIBs are provided by SAC, Ahmedabad; and initial receiver DIB is estimated using FRB method. The final DIBs are estimated using Kalman filter. The biased and true STECs (hereafter called as STEC) are shown in Fig. 3. After the removal of biases, the STEC can be used now as VTEC estimation. 100

Fig. 2 Smoothing of STEC

PRN 5

STEC STEC£

80

STEC (TECU)

60 40 20 Leveling By Constant D= 105.8

0 -20 -40

STECφ

-60 -80

0

4

8

12 IST (Hours)

16

20

24

Investigation of Ionospheric Total Electron Content (TEC) …

87

90

Fig. 3 Diurnal variation of true STEC, June 5, 2017

True STEC (TECU)

80

PRN 5

70

PRN 7 60

PRN 6

50

40

PRN 4

PRN 3

0

PRN 2 4

8

12

16

20

24

IST (Hours)

2.2 Estimation of VTEC For single-frequency users, the mapped or modeled VTEC is converted into STEC in order to calculate the ionospheric delay. Thus, the STEC, estimated from dualfrequency measurement, must be converted into VTEC for mapping or modeling in the Indian region. The VTEC can be obtained by taking a projection from the slant path to a vertical path as shown in Fig. 4. The ionosphere is considered as a thin layer (called thin shell model) at the altitude around 300–400 km above the earth’s surface. The effective height or centroid of the mass of the ionosphere shell intersects the user to satellite line-of-sight is called Ionospheric Pierce Point (IPP). The STEC is converted into VTEC by multiplying an obliquity factor given as [5, 14]. Fig. 4 Ionosphere thin shell model and location of the IPP

Ionospheric Piercing Point (IPP)

VTEC

STEC

Ionosphere E Receiver (ϕu λu)

hIPP

(ϕIPP λIPP)

RE

O

Centre of Earth

88

S. C. Bhardwaj et al. 60

Fig. 5 Diurnal variation of VTEC, June 5, 2017

55

True VTEC (TECU)

50 45 PRN 3

40

PRN 6

PRN 4 35 PRN 5 30

PRN 7

25 PRN 2 20

0

4

8

12

16

20

24

IST (Hours)

   Re cos θ VTEC = STEC × cos sin−1 Re + h max

(5)

where Re (Radius of Earth) = 6378 km, hmax = 350 km, θ = elevation angle at the receiver. The diurnal variation of estimated VTECs for all PRN is shown in Fig. 5. Similar to STEC, an elevation-dependent variation in VTEC values have been observed in the GSO and GEO satellites. The VTECs of GEO follow the diurnal sun variation and have a peak at the same time as found in the case of STECs.

3 Analysis of STEC and VTEC As discussed in Sect. 1, the investigation of STEC and VTEC is required in order to estimate the ionospheric delay precisely for the Indian region. It is also discussed in sec II that GEO satellites are suitable for the investigation due to observations that STEC and VTEC curves follow diurnal sun variations. In this section, the investigation of STEC and VTEC, for 1 week each of summer months (June, July, and August 2017), are done for GEO satellites (i.e., PRN 3,6,7). The estimated STEC and VTEC for June 5–10, 2017 are shown in Fig. 6. It can be observed that the curves are similar (±2 TECU) expect for June 7 and 9 due to unknown solar variations. Thus, a mean of STEC and VTEC has been calculated and plotted in Fig. 7 with standard deviation. The deviations are larger in the afternoon, as compared to night and morning due to unpredictability of solar radiation. But a mean value could help to find a general trend for the month. Thus, mean values STEC and VTECs for June, July, and August

Investigation of Ionospheric Total Electron Content (TEC) … 58

75 June 5 PRN 3 June 6 June 7 June 8 June 9 June 10

65

June 5 June 6 June 7 June 8 June 9 June 10

56 54

VTEC (TECU)

70

STEC (TECU)

89

60 55

52 50

PRN 3

48 46 44

50 45

42 0

4

12

8

16

20

40

24

0

4

8

12

16

20

24

16

20

24

IST (Hours)

IST (Hours)

(a)

(b)

Fig. 6 Diurnal variation of a STEC b VTEC, for PRN 3, June 5–10, 2017 58

70 Mean STEC PRN 3

PRN 3

54

VTEC (TECU)

STEC (TECU)

65

Mean VTEC

56

60

55

52 50 48 46 44

50

42 45

0

4

8

12 IST (Hours)

(a)

16

20

24

40

0

4

12

8

IST (Hours)

(b)

Fig. 7 Mean of Hourly averaged a STEC b VTEC and standard deviation

2017 of GEO satellites are shown in Fig. 8. Due to lower elevation angle, the STECs for PRNs 6 and 7 are high (Fig. 8b, c), as compared to PRN 3 (Fig. 8a). In the figures, although, different peak values have been observed for different PRN, the trend of the curve, i.e., morning sharp rise, evening steep fall, and constant before sunrise, are found similar. The monthly VTEC curves of individual PRNs are almost similar (±2 TECU) except at the peak time (±4 TECU). Due to different positions of PRNs 3, 6, 7 (Fig. 1), the corresponding IPPs are different and thus the difference in VTEC peak values is expected. However, due to dependency of STEC on elevation, the VTEC values of PRNs 6 and 7 are thus having lower VTEC values (Fig. 8e, f) than PRN 3 (Fig. 8d). Hence, an elevation-dependent VTEC modeling along with latitude and longitude are needed to overcome the effect of lower elevation angle STECs. From the observation, it is found that in the summer season the VTECs are quite stable and could be suitably used for ionospheric modeling applications.

90

S. C. Bhardwaj et al. 90

75 June July Aug.

80

PRN 3

65

June July Aug.

85

STEC (TECU)

STEC (TECU)

70

60 55

PRN 6

75 70 65 60

50 45

55 0

4

8

12

16

20

50

24

8

(b)

20

24

June July Aug.

56 54

PRN 7

PRN 3

VTEC (TECU)

52

70 65 60

50 48 46 44

55

42

50

40

0

4

8

12

16

20

38

24

0

4

8

12

IST (Hours)

16

20

24

IST (Hours)

(c)

(d) 42

50 June July Aug.

48 46

June July Aug.

40 38

PRN 6 VTEC (TECU)

44 VTEC (TECU)

16

58

75

42 40 38 36

PRN 7

36 34 32 30 28

34

26

32 30

12

(a) June July Aug.

80 STEC (TECU)

4

IST (Hours)

90 85

45

0

IST (Hours)

24

0

4

8

12

16

20

24

0

4

8

12

IST (Hours)

IST (Hours)

(e)

(f)

16

20

24

Fig. 8 Diurnal Variation of STEC and VTEC of (a), (d) PRN 3 (b), (e) PRN 6 (c), f PRN 7, respectively, for the months, June, July, and August 2017

Investigation of Ionospheric Total Electron Content (TEC) …

91

4 Conclusion The STEC and VTEC have been estimated by using dual-frequency code and carrier phase data and its diurnal variation for summer months (June, July, and August 2017) has been investigated. The overall diurnal VTEC variation is found similar to STECdependent peak values. It is also found that an elevation angle dependency is present along with diurnal sun variation. Hence, it is suggested that an elevation-dependent VTEC modeling, along with latitude and longitude, is needed for Indian region. Acknowledgments The authors would like to thanks Space Application Center, Indian Space Research Organization (ISRO), Ahmedabad, for providing the necessary funds, instruments and technical support for carrying out this research work under a sponsored research project.

References 1. IRNSS SIS ICD for SPS. ISRO-ISAC V 1.1 (2011) 2. C. Rizos, Principle and practice of GPS surveying, monograph No. 17, School of Geomatic Engg., University of New South Wales, Sydney, 1997 3. J.G. Peter, GPS for Geodesy (Springer-Verlag, Berlin Heidelberg, 1998) 4. C.H. Papas, Theory of Electromagnetic Wave Propagation (McGraw-Hill, New York, 1988) 5. J.A. Klobuchar, Ionospheric time-delay algorithm for singlefrequency GPS users. IEEE Trans. Aerosp. Electron. Syst. AES-23(3), 325–331 (1987) 6. N. Jakowski, C. Mayer, M.M. Hoque, V. Wilken, Total electron content models and their use in ionosphere monitoring, Radio Sci. 46, RS0D18 (2011) 7. P.R. Rao, K. Niranjan, D.S.V.V.D. Prasad, S.G. Krishna, G. Uma, On the validity of the ionospheric pierce point (IPP) altitude of 350 km in the Indian equatorial and low-latitude sector. Ann. Geophys. 24(8), 2159–2168 (2006) 8. S. Bassiri, G.A. Hajj, Modeling the global positioning system signal propagation through the ionosphere Telecommunications and Data Acquisition Progress Report, NASA Jet Propulsion Laboratory, Caltech, Pasadena, 1992, pp. 92–103 9. E.J. Petrie, M. Hernández-Pajares, P. Spalla, P. Moore, M.A. King, A review of higher order ionospheric refraction effects on dual frequency GPS. Surv. Geophys. 32(3), 197–253 (2011) 10. A.J. Manucci, B.A. Iijima, U.J. Lindqwister, X. Pi, L. Sparks, B.D. Wilson, GPS and ionosphere. URSI reviews of radio science. Jet Propulsion Laboratory, Pasadena (1999) 11. S.C. Bhardwaj, A. Vidyarthi, B.S. Jassal, A.K. Shukla, Study of temporal variation of vertical TEC using NavIC data, in 2017 International Conference on Emerging Trends in Computing and Communication Technologies (ICETCCT) (IEEE, 2017) 12. D.E. Wells, N. Beck, D. Delikaraoglou, A. Kleusberg, E.J. Krakiwsky, G. Lachapelle, P. Vanicek, Guide to GPS Positioning (Canadian GPS Associates, Fredericton, 1986) 13. S. Sinha, R. Mathur, S.C. Bharadwaj, A. Vidyarthi, B.S. Jassal, A.K. Shukla, Estimation and Smoothing of TEC from NavIC Dual Frequency Data, in 2018 4th International Conference on Computing Communication and Automation (ICCCA) (IEEE, 2018) 14. M.S. Bagiya, H.P. Joshi, K.N. Iyer, M. Aggarwal, S. Ravindran, B.M. Pathan, TEC variations during low solar activity period (2005–2007) near the equatorial ionospheric anomaly crest region in India (2009)

An Improved Terrain Profiling System with High-Precision Range Measurement Method for Underwater Surveyor Robot Maneesha and Praveen Kant Pandey

Abstract The paper presents an improved terrain profiling system with highprecision range measurement method for underwater surveyor robot. Extensive research has been carried out in the area of terrain profiling in different scenarios; however, limited work has been performed for underwater environment. In the present work, a surveyor robot has been designed using ultrasonic range sensor for terrain profiling in underwater environment. The dynamic nature of underwater scenario adds significant noise to the acoustic signals leading to inaccurate range measurements. The noise embedded in the data in underwater environment degrades the working of the surveyor robot leading to uncertainty in terrain profiling, surveying, and navigation. Different digital signal filtering techniques are used to remove noise from data, leading to a better estimate of the range data and improved signal-to-noise ratio. Two sets of range measurement data in an underwater setup have been estimated using signal processing techniques. The results show that low pass FIR filter improved the results as compared to Moving Average method; however, the high value of standard deviation of the results depicts that FIR filtering is not adequate for accurate range measurement and thereby faithful working of underwater robot. Kalman filter being a recursive estimator provides an optimal solution for estimation and data prediction tasks and is efficient in filtering the noisy data from an input sensor. However, its filtering performance is dependent on the selection of measurement noise covariance R used in the predictor-corrector model. In an actual underwater data streaming environment, it is very difficult to obtain the optimal value of R from complex device configuration. In order to avoid a poor selection of R and to determine its best estimate directly from the sensor data, an analytical method using Denoising Autoencoder is used in the present work. The results show that the Kalman filter method using Denoising Autoencoder estimated the range data accurately. The terrain profile for an underwater test setup was generated by simultaneously recording the position of the robot and the elevation data filtered using the Maneesha · P. K. Pandey (B) Department of Electronics, Maharaja Agrasen College, University of Delhi, Delhi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_9

93

94

Maneesha and P. K. Pandey

above method. The results are in good agreement with the actual terrain profile of the test setup. Keywords Terrain profile · Elevation map · Ultrasonic range sensor · Underwater range measurement · Digital signal filter · Low pass FIR filter · Kalman filter · Denoising Autoencoder

1 Introduction Underwater milieu is attracting major interest with regards to its vast resources lying underneath the oceans. Scientists and researchers are continuously developing newer and better technologies to further uncover this rather undiscovered world for the benefits of the world at large. In this scenario, the advancements in underwater robotics with better sensing technology are providing numerous opportunities to explore and harness the vast energy which lies amidst the water bodies. Underwater robots used are highly dependent on their ability to sense and respond to their environment for their exploration activities. In recent years, various research efforts in the field of underwater robots are giving rise to more focused, consistent, and reliable underwater robotic vehicles, thereby minimizing the need for human workers [1]. Currently, Remotely Operated Vehicles (ROVs) and Autonomous Underwater Vehicles (AUVs) provide sensor platforms for measuring underwater properties [2]. Both can operate in a previously unmapped environment with unpredictable disturbances and threats [3]. Salamonowicz, ASCE; and Arnold generated terrain profiles of Alaska and Antarctica using Seasat radar altimeter data and data reduction method developed by the Geoscience Research Corporation. However, when compared with terrain profile generated using doppler measurements, an error of magnitude 50 feet was observed [4]. Florida Atlantic University’s Ocean Engineering Dept. and University of South Florida’s Marine Science Department designed long-range AUV “The Ocean Voyager II.” It is used for coastal oceanography using the principle of light reflectance and absorption measurement while flying at a constant altitude [5]. Apart from these, the Internet of Underwater Things (IoUT) is being used as a worldwide network of smart interconnected underwater objects with a digital entity [6] to enable various practical applications, such as environmental monitoring, underwater exploration, and disaster prevention. For terrain profiling and surveillance, many types of range sensors based on electromagnetic waves, such as light or short radio wave are available for estimating the distance in air medium, but these sensors are not effective under water as electromagnetic waves are heavily attenuated beyond short distances [7, 8]. However, sound waves can propagate easily in water. Hence, ultrasonic range sensors are preferred for range measurement for depth measurement, and to detect and map landmarks under the water by surveyor robot. However, precision and update rate of the range measurement are the two major limitations of ultrasonic range sensors. Further, the dynamic nature of the underwater

An Improved Terrain Profiling System with High-Precision Range …

95

medium disrupts the signal quality as acoustic signals are actuated mechanically [9]. Thus, the data collected by ultrasonic range sensors has limited precision as they are prone to noise in dynamic environments with variation in temperature, turbidity, and strong underwater currents, leading to uncertainty in terrain profiling, surveying, and navigation. The noise embedded in the data collected by ultrasonic range sensors entails vital impact on the working of the surveyor robot. In order to avoid a non-convergent system, a new calibration is essential by filtering the inherent noise in the data to obtain a better estimation of the range data. The paper presents a comparative study of the use of digital signal filtering techniques including Moving Average filter, Finite Impulse Response (FIR) filter, and Kalman filter to remove noise from data obtained from the ultrasonic range sensor, leading to better estimate of the range data and improved signal-to-noise ratio. The filtered data was used to generate terrain profile of an underwater test setup.

2 Design of Underwater Surveyor Robot 2.1 Hardware In the present work, the underwater surveyor robot was designed using ATmega328 microcontroller and was equipped with ultrasonic range sensor, accelerometer, gyroscope, and a temperature sensor for generating terrain profile. GY-85 BMP085, a 9-axis sensor module comprising 3-axis gyroscope, 3 axis accelerometer, and 3-axis magnetic field is used for inertial measurement system. DYP-ME007Y-PWM waterproof ultrasonic sensor is used for underwater range measurement and terrain scanning. After power on, the ultrasonic sensor waits for the trigger signal. When it receives a trigger signal, it generates and transmits eight 40 kHz pulses and waits for echo signal. A PWM pulse is generated according to the delay between the transmitted and the echo signal. The value of the distance can be deduced from the pulse width. If no echo is detected, the sensor generates a constant pulse width of about 35 ms. Two sets of range measurement data were obtained using range sensor for estimating depth in an underwater setup. The raw data received from the sensor lacks precision due to the inherent characteristics of the sensor as well as the noisy underwater environment. Hence, it is imperative to estimate the accurate signal from the sensor data using signal processing techniques. The paper analyzes and compares the filtered range sensor data obtained using three different signal processing techniques, i.e., Moving Average filter, Finite Impulse Response filter, and Kalman filter. The terrain profile was generated by simultaneously recording the position of the robot using inertial measurement system and the elevation data was filtered using the improved Kalman filter method.

96

Maneesha and P. K. Pandey

2.2 High-Precision Range Measurement Methods 2.2.1

Moving Average Filter Algorithm

The Moving Average filter or running-mean filter is one of the most commonly used filters in underwater environment [10]. It is used to filter the short-term ripples and emphasize longer term trends. The threshold between short-term and long-term depends on the parameters of the Moving Average filter based on the application. The filter output consists of filtered data sequence with the degree of smoothening, and associated loss of information from both the ends of the input data, depending on the number of filter weights [10]. Mathematical design of the filter is described in detail by Thomson and Emery.  1 xi+k 2M + 1 i=0 2M

z M+k = w=

1 2M + 1

(1)

The above equation shows that the Moving Average filter is a moving rectangular window filter and consists of an odd number of 2M + 1 equal weights w which resembles a uniform probability density function. Two implementations of Moving Average filter with M = 1 and 2 are presented in the current work.

2.2.2

Finite Impulse Response (FIR) Filter Algorithm

Finite Impulse Response filter is a filter whose impulse response is of finite time duration, i.e.; it decays and settles to zero in finite time duration. FIR filter uses different equations to generate output as a weighted sum of samples of input signal. The output of an Nth order general linear FIR filter, with impulse response h k is given by the following equation [11] zk =

N −1  m=0

h m xk−m =

N −1 

h k−m xm

(3)

m=0

For an ideal low pass FIR filter, the frequency response h k is given by hk =

ωc sin(kωc ) = sin c(kωc ) kπ π

(4)

Finite Impulse Response filter with N = 10 and 20 are implemented to obtain the estimated signal for both sets of range data.

An Improved Terrain Profiling System with High-Precision Range …

2.2.3

97

Kalman Filter

Kalman filter uses a series of measurements observed over time, containing process and environment noises, and produces estimates of unknown variables which are more accurate and precise, by estimating a joint probability distribution over the variables for each timeframe [12]. Knowing the covariance matrices of the estimate and the incoming measurements, the filter fuses measurements and estimates, minimizing the variance of the resulting estimate [13]. The filter is also referred to as Linear Quadratic Estimator (LQE). Kalman model for the current problem comprising state differences and measurement equations for linear dynamic system is given by [14] x k = Ax k−1 + Buk + wk

(5)

z k = H x k + vk

(6)

where, Variable

Description

Dimension

x

System State Vector, x k ∈ Rn

n×1

u

System Control Vector

p×1

w

Process/Perturbation Noise Vector, wk ∈ Rn

n×1

z

Measurement Vector, z k ∈ Rm

m×1

v

Measurement Noise Vector, vk ∈

A

System State Matrix, A ∈ Rn×n

B

System Control Matrix, B ∈

H

Measurement Matrix, H ∈ Rm×n

Rm

Rn× p

m×1 n×n n×p m×n

wk and vk are independent Gaussian white noise sequences which satisfy the following equations [15–17].   E{wk } = 0; E{vk } = 0; E wk vTj = 0

(7)

    E wk wTj = Qδk j ; E vk vTj = Rδk j

(8)

where function E{X} represents expectation (or mean) of X, Q is covariance matrix of Process Noise Vector wk , R is the covariance matrix of Measurement Noise Vector vk , δk j = 1 if k = j and δk j = 0 if k = j. The Kalman filter is a recursive filter and is a two-step process comprising Prediction and Correction (or Update). The first step, i.e., Prediction phase uses the state estimate from the previous time step to produce an estimate of the state at the current time step. In the Correction phase, the current prediction data is combined with current observation information to obtain more precise state estimate.

98

Maneesha and P. K. Pandey

Prediction ˆ k−1 + Buk xˆ − k = Ax T P− k = A P k−1 A + Q

(9) (10)

− where xˆ − k is Predicted state estimate, Pk is Predicted error covariance. n×n is error covariance matrix, i.e., covariance of state error ek (difference Pk ∈ R between estimated state value and the state). P may be defined [12] by the following equation.

  E ek eTj = P k δk j

(11)

where δk j = 1 if k = j and δk j = 0 if k = j Correction/Update T P− k H T H P− k H + R  − ˆk xˆ k = xˆ − k + K k zk − H x

Kk =

P k = (I − K k H) P − k

(12) (13) (14)

where K is the Kalman gain which is the relative weight given to the measurements and current state estimate through covariance matrices Q and R, xˆk is the corrected/updated state estimate. Although Kalman filter is very efficient in filtering the noisy data from an input sensor, however, its filtering performance is dependent on the input noise parameters. Selecting optimal value of measurement noise covariance R is an important parameter for effective filtering using Kalman filter. In an actual underwater data streaming environment, it is very difficult to obtain the optimal value of R from complex device configuration. In case of a poor choice of the value of R, the accuracy of the Kalman filter is reduced and degraded. In order to avoid a poor selection of R and to determine its best estimate directly from the sensor data, an analytical method using Denoising Autoencoder is used in the present work [18, 19].

3 Result To reduce the noise in range measurement data and improve the signal-to-noise ration, data for two depths were recorded using ultrasonic range sensor. The data set I was measured for depth less than 100 cm whereas the data set II was taken for depth greater than 200 cm.

An Improved Terrain Profiling System with High-Precision Range …

99

Two implementations of the Moving Average filter with M = 1 and M = 2 were applied to the data set I and data set II. The results are shown in Fig. 1. The mean value and standard deviation of filtered data (data set I, M = 1) in Fig. 1Ia is 75.46 cm and 0.085, respectively. The standard deviation of filtered data (data set I, M = 2) in Fig. 1Ib reduces to 0.076 while the mean remains same. The mean value and standard deviation of filtered data (data set II, M = 1) in Fig. 1IIa is 210.91 cm and 0.147, respectively. The standard deviation of filtered data (data set II, M = 2) in Fig. 1IIb reduces to 0.117 while the mean remains same. Finite Impulse Response filter with order N equal to 10 and 20 was implemented and applied to both sets of the sensor data. The results are shown in Fig. 2. The mean value and standard deviation of filtered data (data set I, N = 10) in Fig. 2Ia is 75.38 cm and 1.330, respectively, while the mean value and standard deviation of filtered data (data set I, N = 20) in Fig. 2Ib is 75.39 cm and 1.395, respectively. The mean value and standard deviation of filtered data (data set II, N = 10) in Fig. 2IIa is 210.90 cm and 0.118, respectively. The standard deviation of filtered data (data set II, N = 20) in Fig. 2IIb reduces to 0.128 while the mean remains the same. Kalman filter was applied to data set I and II and the results are shown in Fig. 3. The mean and standard deviation of the filtered data for data set I were found to be 75.41 cm and 0.059. The mean and standard deviation values for data set II were 210.91 cm and 0.020, respectively.

Fig. 1 Filtering of range sensor data using Moving Average filter

100

Maneesha and P. K. Pandey

Fig. 2 Filtering of range sensor data using FIR filter

Fig. 3 Filtering of range sensor data using Kalman filter

A test setup of the dimension of size 60 cm × 35 cm was designed and immersed in water tank of height 90 cm. The setup is shown in Fig. 4. The range data along with the position was captured simultaneously by the underwater surveyor robot. The range data was then filtered using the improved Kalman filter method described above. Terrain profile for the test setup was thus generated by the underwater surveyor robot. The results were logged in a computer for plotting the 2D surface plots. MATLAB was used to plot the terrain profile and is shown in Fig. 5.

An Improved Terrain Profiling System with High-Precision Range …

101

Fig. 4 Underwater test setup

Fig. 5 Terrain profile for underwater test setup

4 Conclusion The paper presents an improved terrain profiling system with high-precision range measurement method for underwater surveyor robot. A comparative study of the use of digital signal filtering techniques including Moving Average filter, Finite Impulse Response (FIR) filter, and Kalman filter to remove noise from data obtained from ultrasonic range sensor, leading to better estimate of the range data and improved signal-to-noise ratio is presented. The results show that the Kalman filter method with Denoising Autoencoder estimated the range data accurately and were superior to the Moving Average filter and FIR filter methods in the filtering accuracy. The improved Kalman filter method was used to generate terrain profile of underwater test setup. The results are in good agreement with the actual terrain profile of the test setup.

102

Maneesha and P. K. Pandey

References 1. J. Yuh, Design control of autonomous underwater robots: a survey. Auton. Robot. 8(1), 7–24 (2000) 2. R.E. Thomson, W.J. Emery, Data acquisition and recording, data analysis methods in physical oceanography, 3rd edn. (Elsevier, 2014), pp. 1–186 3. J.G. Bellingham, K. Rajan, Robotics in remote and hostile environments. Science 318(5853), 1098–1102 (2007) 4. P.H. Salamonowicz, A.M. ASCE, G.C. Arnold, Terrain profiling using seasat radar altimeter, J. Surv. Eng. (1985). https://doi.org/10.1061/(asce)0733-9453(1985)111:2(140) 5. S.M. Smith, S.E. Dunn, The Ocean Voyager II: an AUV designed for coastal oceanography, Autonomous Underwater Vehicle Technology, AUV ‘94 (1994). http://doi.org/10.1109/AUV. 1994.518618 6. J. Pascual, O. Sanjua´n, J.M. Cueva, B.C. Pelayo, M. A´ lvarez, A. Gonza´ lez, Modeling architecture for collaborative virtual objects based on services. J. Netw. Comput. Appl. 34(5), 1634–1647 (2011) 7. P.K. Pandey, Maneesha, S. Sharma, V. Kumar, S. Pandey, An intelligent terrain profiling embedded system for underwater applications, in Proceeding of International Conference on Computational Intelligence & Communication Technology (CICT) (2018) ISBN: 978-1-5386-0886-9, IEEE Xplore 8. P. Jonsson, I. Sillitoe, B. Dushaw, J. Heltne, Observing using sound and light – a short review of underwater acoustic and video-based methods. Ocean Sci. Discuss. 6, 819–870 (2009) 9. M.R. Arshad, Recent advancement in sensor technology for underwater applications. Indian J. Geo-Marine Sci. 38(3), 267–273 (2009) 10. R.E. Thomson, W.J. Emery, Digital Filters, Data Analysis Methods in Physical Oceanography, 3rd edn. (Elsevier, 2014), p. 607 11. F.J. Taylor, Digital Filters: Principles and Applications with MATLAB (IEEE-John Wiley & Sons, Inc., Publication, 2012) 12. J. Ou, S. Li, J. Zhang, C. Ding, Method and Evaluation Method of Ultra-Short-Load Forecasting in Power System, Data Science: 4th International Conference of Pioneering Computer Scientists, Engineers and Educators, Proceedings, Part 2 (2018) 13. T.D. Larsen, K.L. Hansen, N.A. Andersen, O. Ravn, Design of Kalman filters for mobile robots: Evaluation of the kinematic and odometric approach, in Proceedings of IEEE Conference on Control Applications, vol. 2 (1999) 14. R. Kalman, A new approach to linear filtering and prediction problems. J. Basic Eng. (1960) 15. X.F. Zhang, A.W. Heemink, J.C.H. Van Eijeren, Performance robustness analysis of Kalman filter for linear discrete-time systems under plant and noise uncertainty. Int. J. Syst. Sci. 26(2), 257–275 (1995) 16. G. Chen, J. Wang, L. Shieh, Interval kalman filtering. IEEE Trans. Aerosp. Electron. Syst. 33(1), 250–259 (1997) 17. J. Xiong, C. Jauberthie, L. Trave-Massuyes, New computation aspects for the interval Kalman filtering, in 15th IFAC Workshop on Control Applications of Optimization (2012) 18. P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, in Proceedings of the 25th International Conference on Machine Learning (2008), pp. 1096–1103 19. P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in Proceedings of the ICML Workshop on Unsupervised and Transfer Learning (2011), pp. 37–49

Prediction of Diabetes Mellitus: Comparative Study of Various Machine Learning Models Arooj Hussain and Sameena Naaz

Abstract Diabetes is a common metabolic-cum-endocrine disorder in the world today. It is generally a chronic problem where either the pancreas does not produce an adequate quantity of Insulin, a hormone that regulates blood glucose level, or the body does not effectively utilize the produced Insulin. This review paper presents a comparison of various Machine Learning models in the detection of Diabetes Mellitus (Type-2 Diabetes). Selected papers published from 2010 to 2019 have been comparatively analyzed and conclusions were drawn. Various models that have been compared are Adaptive Neuro-Fuzzy Inference System (ANFIS), Deep Neural Network (DNN), Support Vector Machine (SVM), Artificial Neural Network (ANN), Logistic Regression, Decision Tree, Naive Bayes, K-Nearest Neighbours (KNN) and Random Forest. The two models which have outperformed all others in most studies taken into consideration are Random Forest and Naive Bayes. Other powerful mechanisms are SVM, ANN and ANFIS. The criteria chosen for comparison are accuracy and Matthew’s Correlation Coefficient (MCC). Keywords Machine Learning · Diabetes Mellitus · Random Forest · Artificial Neural Network (ANN) · Logistic Regression · Cross-validation · Percentage split

1 Introduction Machine Learning (ML) can be defined as a subtype of Artificial Intelligence to solve real-world problems by “providing learning ability to computer without additional programming” [1]. As the development in ML increased, along with it increased the use of computers in medicine. A. Hussain · S. Naaz (B) Department of Computer Science and Engineering, School of Engineering Sciences and Technology, Jamia Hamdard, New Delhi 110062, India e-mail: [email protected] A. Hussain e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_10

103

104

A. Hussain and S. Naaz

Diabetes Mellitus is a disease that is affecting a large population throughout the world and is becoming a huge challenge to tackle with. According to the data released by International Diabetes Federation [2], in 2017 there were 425 million people suffering from Diabetes globally, which are expected to increase up to 629 million by 2045. For the classification and prediction of the occurrence of Diabetes, various computational techniques have been developed and utilized. The use of Machine Learning techniques in the prediction is proving to be very useful as it is increasing the accuracy in diagnosis, reducing the costs and increasing the rates of effective treatments. Patients of other diseases like cancers such as of breast and brain tumours can also be benefitted by employing Machine Learning to detect the anomalies by studying their scans as shown by Naaz et al. [3], Kahksha et al. [4], Hassan et al. [5]. A comprehensive study on a number of Machine Learning models used for identification of diabetes has been done in this review, thereby comparing their performance and figuring out the most suitable amongst them. In this survey, several ML models viz. Adaptive Neuro-Fuzzy Inference System (ANFIS), Deep Neural Network (DNN), Support Vector Machine (SVM), Artificial Neural Network (ANN), Logistic Regression, Decision Tree, Naive Bayes, K-Nearest Neighbours (KNN) and Random Forest have been compared. At last, the outcome of the previously conducted studies has been analyzed which hopefully would help in future advancement and research. This remaining paper has been arranged into sections as follows: In Sect. 2, the description of the problem statement and the aim of the study are given. Section 3 discusses all the techniques that have been used by researchers to predict Diabetes Mellitus in the selected papers, followed by Validation Techniques used in Sect. 4. Section 5 summarizes the results that are obtained from each study. Section 6 describes the cumulative results drawn from all the studies according to the method of validation used, and in Sect. 7, the conclusion from the entire exercise is provided. Finally, the challenges faced, future directions and limitations of this study are briefly discussed in Sect. 8.

2 Problem Statement This survey has been formulated to carry out a review of Machine Learning techniques/models and their application in classification/detection of Diabetes Mellitus. The aim of this survey is to compare these techniques and to conclude which is the most feasible model for achieving the highest accuracy in the prediction process. Going by the experience of scientists from the papers [6, 7] who have compared ML techniques in various research studies, the literature review was carried out by rigorously scanning and going through previously published papers in depth to obtain inferences about Diabetes prediction. Only those papers were selected that were published between 2010 and 2019.

Prediction of Diabetes Mellitus …

105

3 Machine Learning Techniques The various techniques that have been compared in this review are briefly described below. Adaptive Neuro-Fuzzy Inference System (ANFIS) is a system that incorporates the basic principles of Neural Networks as well as Fuzzy Logic. It combines the parallel distributed processing with the learning ability of ANN and hence works as a hybrid. ANFIS is itself composed of two parts, the antecedent part and the conclusion part, which communicate with each other using some pre-set rules [8]. Architecturally it is generally made up of 5 layers [9]. An Artificial Neural Network (ANN) is a Machine Learning model that takes inspiration from the biological neural networks present inside the human brain. Analogous to the human nervous system, the basic building block of ANN is called a Neuron. Any Neural Network has a minimum of three layers, input layer, output layer and a hidden layer sandwiched in between them. An ANN featuring more than one hidden layer is called a Deep Neural Network (DNN). A Support Vector Machine (SVM) is a supervised Machine Learning method employed mostly for classification problems and rarely for regression problems, proposed by Platt et al. [10]. As a classifier, it is a discriminative classification method that segregates data points into two or more classes on the basis of certain parameters [11]. The best hyperplane is considered to be the one with the largest distance from the closest data point, known as margin, which decreases the chances of generalization error. Logistic Regression can be thought of as a simpler version of DNN in a way, deprived of any hidden layers [12]. The processing of data is done by using an activation function and a sigmoid function whose output will then be compared to 0.5 for the classification purpose. It takes the use of a sigmoid function for finding the probability of a class using the following rule, in case probability comes out to be ≥0.5, it is assumed to be belonging to class 1 and if probability xi−1 }

(4)

410

R. Kumar et al.

{|xi − xi+1 | ≥ ε} and {|xi − xi−1 | ≥ ε}

(5)

Waveform Length Waveform length is the total length of the signal for each segment. This can be calculated as given in Eq. (6). WL =

N 

|xi |

(6)

i=1

3 Classification Two mental tasks right hand and feet movement are classified by ten various types of classifiers such as Naïve Bayes, bays classifier AdaBoost, and some decision-based classifiers are used. In this classification, features extracted from the time-domain method are applied as input to these classifiers to evaluate the performance of it so as to understand the capability of them to separate the motor imagery task.

4 Results and Discussion The section provides the result obtained from the classification after the use of timedomain feature. The different classifiers are evaluated to judge the time-domain features. The details are presented in subsequent tables. In the current work, motor imagery task is taken for analysis. The tasks data using EEG signal are taken from BNCI Horizon 2020. In the dataset, signals are recorded through 15 channels for 14 subjects. In this work, only 3 subjects with central channel (C3, CZ, and C4) are considered for analysis. Each subject consists of 3 runs for training and 5 runs for testing, every run comprises data, trial value, and class label. Two classes that are right hand and feet movement considered as class 1 and class 2, respectively. Before performing pre-processing method, first data are segmented into 20 trials for each channel. In the next stage, four time-domain features such as MAV, ZC, SSC, and WL are calculated and combined as feature set. On this set of features, 10 different types of classification are performed and evaluate the performance of each classifier. The obtained result shows that for subject 1 (Table 1) AdaBoostM1 and Decision table are showing the highest and similar recognitions capability with 56.25% and minimum accuracy is archived by Logistic classifier. For subject 2 (Table 2) maximum accuracy is 59.375% achieved by the Bayes net and IBK classifiers. Similarly, for subject 3 (Table 3) maximum accuracy is 59.375 is archived by Naive Bayes, SGD, IBK, and LWL.

Evolution of Time-Domain Feature for Classification … Table 1 Performance of different classifiers on S01 Classifier TP rate FP rate Precision Recall Bayes net

Naïve Bayes

Logistic

SGD

SMO

IBK

LWL

AdaboostM1

Decision table

Random forest

0.375 0.5 0.438 0.375 0.5 0.438 0.313 0.375 0.344 0.25 0.563 0.406 0.313 0.5 0.46 0.625 0.25 0.438 0.5 0.5 0.5 1 0.125 0.563 0.75 0.375 0.563 0.25 0.5 0.375

0.5 0.625 0.563 0.5 0.625 0.563 0.625 0.688 0.656 0.438 0.75 0.594 0.5 0.688 0.594 0.75 0.375 0.563 0.5 0.5 0.5 0.875 0 0.438 0.625 0.25 0.438 0.5 0.75 0.625

0.429 0.444 0.437 0.429 0.444 0.437 0.333 0.353 0.343 0.364 0.429 0.396 0.389 0.421 0.403 0.455 0.4 0.427 0.5 0.5 0.5 0.533 1 0.767 0.545 0.6 0.573 0.333 0.4 0.367

0.375 0.5 0.438 0.375 0.5 0.438 0.313 0.375 0.344 0.25 0.563 0.406 0.313 0.5 0.406 0.625 0.25 0.438 0.5 0.5 0.5 1 0.125 0.563 0.75 0.375 0.563 0.25 0.5 0.375

411

F-measure Class

Accuracy

0.4 0.471 0.435 0.4 0.471 0.435 0.323 0.364 0.364 0.296 0.486 0.391 0.345 0.357 0.401 0.526 0.308 0.417 0.5 0.5 0.5 0.696 0.222 0.459 0.632 0.462 0.547 0.286 0.444 0.365

1 2

43.75

1 2

43.75

1 2

34.375

1 2

40.625

1 2

40.625

1 2

43.75

1 2

50

1 2

56.25

1 2

56.25

1 2

50

In terms of subject, subject 2 (S02) and subject 3 (S03) delivered highest accuracy on the other hand subject 1 (S01) delivered minimum accuracy for same set of features. Best result is given by the Bayes net and Naive Bayes classifier with average of 53.125% classification accuracy and worst result is given by the Logistic classifier with average of 43.75% classification accuracy. To examine the efficiency of features, Table 4 provides the comparison between proposed approach and existing approach for same dataset. This table shows classification performance for each of the three subjects. As shown in Table 4, proposed method gives better performance of 56.25% for subject 1 and 59.375% for subject 2

412

R. Kumar et al.

Table 2 Performance of different classifiers on S02 Classifier TP rate FP rate Precision Recall Bayes net

Naïve Bayes

Logistic

SGD

SMO

IBK

LWL

AdaboostM1

Decision table

Random forest

0.65 0.5 0.594 0.6 0.5 0.563 0.5 0.417 0.469 0.5 0.5 0.5 0.5 0.583 0.531 0.55 0.667 594 0.45 0.583 0.5 0.15 0.917 0.438 0.5 0.333 0.438 0.1 1 0.438

0.5 0.35 0.444 0.5 0.4 0.463 0.583 0.5 0.552 0.5 0.5 0.5 0.417 0.5 0.448 0.333 0.45 0.377 0.147 0.55 0.467 0.083 0.85 0.371 0.667 0.5 0.604 0 0.9 0.338

0.684 0.462 0.601 0.667 0.429 0.577 0.588 0.333 0.493 0.625 0.375 0.531 0.667 0.412 0.571 0.733 0.471 0.635 0.643 0.389 0.548 0.75 0.393 0.616 0.556 0.286 0.454 1 0.4 0.775

0.65 0.5 0.594 0.6 0.5 0.563 0.5 0.417 0.469 0.5 0.5 0.5 0.5 0.583 0.531 0.55 0.667 0.594 0.45 0.583 0.5 0.15 0.917 0.438 0.5 0.333 0.438 0.1 1 0.438

F-measure Class

Accuracy

0.667 0.48 0.597 0.632 0.462 0.568 0.541 0.37 0.477 0.556 0.429 0.508 0.571 0.483 0.538 0.629 0.552 0.6 0.529 0.467 0.506 0.25 0.55 0.363 0.526 0.308 0.444 0.182 0.571 0.328

1 2

59.375

1 2

56.25

1 2

46.875

1 2

50

1 2

53.125

1 2

59.375

1 2

50

1 2

43.75

1 2

43.75

1 2

43.75

and subject 3. Therefore it can be stated that proposed method significantly performs better than existing method that already discussed in the literature survey section. It can be noticed that highest accuracy archived is 59.375%, which is not enough to design the stable and reliable BCI. Lowest accuracy indicates that regardless of feature ability in classification and it had difficulty to deal with chaotic behavior of EEG signal. Mostly best feature is the one that gives better accuracy in order to design the BCI.

Evolution of Time-Domain Feature for Classification …

413

Table 3 Performance of different classifiers on S03 Classifier TP rate FP rate Precision Recall Bayes net

Naïve Bayes

Logistic

SGD

SMO

IBK

LWL

AdaboostM1

Decision table

Random forest

0.5 0.667 0.563 0.333 0.45 0.377 0.35 0.75 0.5 0.4 0.917 0.594 0.4 0.917 0.594 0.5 0.75 0.594 0.4 0.917 0.594 0.15 1 0.469 0.4 0.583 0.469 0.1 1 0.438

0.333 0.5 0.396 0.733 0.471 0.635 0.25 0.65 0.4 0.083 0.6 0.277 0.083 0.6 0.277 0.25 0.5 0.344 0.083 0.6 0.277 0 0.85 0.319 0.417 0.6 0.485 0 0.9 0.338

0.714 0.444 0.613 0.733 0.471 0.635 0.7 0.409 0.591 0.889 0.478 0.735 0.889 0.478 0.735 0.769 0.474 0.658 0.889 0.478 0.735 1 0.414 0.78 0.615 0.368 0.523 1 0.4 0.775

Table 4 Comparison with other methods Authors Methods S01 Sahu et al. DWT features Proposed method Time-domain features

54.375 56.250

0.5 0.667 0.568 0.55 0.667 0.594 0.35 0.75 0.5 0.4 0.917 0.594 0.4 0.917 0.581 0.5 0.75 0.594 0.4 0.917 0.594 0.15 1 0.469 0.4 0.583 0.469 0.1 1 0.438

F-measure Class

Accuracy

0.588 0.533 0.568 0.629 0.552 0.6 0.467 0.529 0.49 0.552 0.629 0.581 0.552 0.629 0.581 0.606 0.581 0.597 0.552 0.629 0.581 0.261 0.585 0.383 0.485 0.452 0.472 0.182 0.571 0.328

1 2

56.25

1 2

59.375

1 2

50

1 2

59.375

1 2

59.375

1 2

59.375

1 2

59.375

1 2

46.875

1 2

43.75

1 2

43.75

S02

S03

57.500 59.375

51.250 59.375

414

R. Kumar et al.

5 Conclusion In this paper, four time-domain features have been used to classify the two-class motor imagery action, right hand and feet movement. Here author calculated four timedomain feature MAV, ZC, SSC, and WL which are used as input for classification. In this classification, 10 classifiers are used in order to check the recognition ability of features and compare the performance of classifiers. Performance of classifiers is varying according to the subject. For subject S01, AdaboostM1 and Decision table are showing the highest accuracy, for subject S02, Bayes net and IBK classifiers are achieved highest accuracy, and Naive Bayes, SGD, IBK, and LWL performed well for subject S03. For all subjects, Bayes net and Naive Bayes classifier are the best classifiers among all 10 classifiers which give 53.125% accuracy. It is clear that we got maximum accuracy lies between 55 and 59%. Comparative analysis also shows the better performance of time-domain features but still there is scope in future for large data where performance can be improved with variety in motor imagery data.

References 1. J.R. Wolpaw, N. Birbaumer, D.J. McFarland, G. Pfurtscheller, T.M. Vaughan, Brain–computer interfaces for communication and control. 113(6), 767–791 (2002) 2. S.G. Mason, G.E. Birch, A general framework for brain-computer interface design. 11(1), 70–85 (2003) 3. M.X. Cohen, Analyzing neural time series data: theory and practice. MIT Press (2014) 4. S. Vaid, P. Singh, C. Kaur, EEG signal analysis for BCI interface: a review, in 2015 Fifth International Conference on Advanced Computing and Communication Technologies (IEEE, 2015), pp. 143–147 5. A. Khorshidtalab, M. Salami, M. Hamedi, Evaluation of time-domain features for motor imagery movements using FCM and SVM, in 2012 Ninth International Conference on Computer Science and Software Engineering (JCSSE) (IEEE, 2012), pp. 17–22 6. P. Geethanjali, Y.K. Mohan, J. Sen, Time domain feature extraction and classification of EEG data for brain computer interface, in 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (IEEE, 2012), pp. 1136–1139 7. R. Upadhyay, A. Manglick, D. Reddy, P. Padhy, P.J.C. Kankar, E. Engineering, Channel optimization and nonlinear feature extraction for Electroencephalogram signals classification. 45, 222–234 (2015) 8. A.S. Sankar, S.S. Nair, V.S. Dharan, P. Sankaran, Wavelet sub band entropy based feature extraction method for BCI. 46, 1476–1482 (2015) 9. Z. Liu, J. Sun, Y. Zhang, P. Rolfe, Sleep staging from the EEG signal using multi-domain feature extraction. 30, 86–97 (2016) 10. V. Harpale, V. Bairagi, An adaptive method for feature selection and extraction for classification of epileptic EEG signal in significant states (2018) 11. M. Sahu, S. Shukla, Impact of feature selection on EEG based motor imagery, in Information and Communication Technology for Competitive Strategies (Springer, 2019), pp. 749–762 12. G.U. Technology (2015) Two class motor imagery (002-2014). http://bnci-horizon-2020.eu/ database/data-sets 13. G. Pfurtscheller, C. Neuper, Motor imagery and direct brain-computer communication. 89(7), 1123–1134 (2001)

Finding Influential Spreaders in Weighted Networks Using Weighted-Hybrid Method Sanjay Kumar, Yash Raghav, and Bhavya Nag

Abstract Finding efficient influencers has attracted a lot of researchers considering the advantages and the various ways in which it can be used. There are a lot of methods but most of them are available for unweighted networks, while there are numerous weighted networks available in real life. Finding influential users on weighted networks has numerous applications like influence maximization, controlling rumours, etc. Many algorithms such as weighted-Degree, weightedVoteRank, weighted-h-index, and entropy-based methods have been used to rank the nodes in a weighted network according to their spreading capability. Our proposed method can be used in case of both weighted or unweighted networks for finding strong influencers efficiently. Weighted-VoteRank and weighted-H-index methods take the local spreading capability of the nodes into account, while entropy takes both local and global capability of influencing the nodes in consideration. In this paper, we consider the advantages and drawbacks of the various methods and propose a weighted-hybrid method using our observations. First, we try to improve the performance of weighted-VoteRank and weighted-h-index methods and then propose a weighted-hybrid method, which combines the performance of our improved weighted-VoteRank, improved weighted-H-index, and entropy method. Simulations using an epidemic model, Susceptible-Infected-Recovered (SIR) model produces better results as compared to other standard methods. Keywords Complex networks · Influence maximization · Node centrality · SIR model · Weighted-H-index · Weighted-VoteRank S. Kumar (B) · Y. Raghav · B. Nag Department of Computer Science and Engineering, Delhi Technological University, Shahbad Daulatpur, Main Bawana Road, Delhi 110042, India e-mail: [email protected] Y. Raghav e-mail: 1yashraghav[email protected] B. Nag e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_37

415

416

S. Kumar et al.

1 Introduction Most of the real-world networks like social networks, biological networks, collaboration networks, circuit networks are complex networks. These networks have a large number of nodes or users, and interaction between nodes are complex. The evolution of complex networks has led to the establishment of many useful applications like influence maximization, node classification, viral-marketing, link prediction, information propagation, etc. Influence maximization [1, 2] requires a strategic selection of influential individuals who are capable of spreading information by ‘Word-ofMouth’ analogy, after knowing the information source. Therefore, the objective of influence maximization is of great value in real-world systems, which dictate the choice of source spreaders to maximize the final extend of spreading/propagation in the complex network. Determining influential spreaders in the propagation process has a number of applications such as rumour control, virus Transmission, information dissemination. Many real-life networks like transportation networks, email networks are the weighted networks. Finding influential nodes in a weighted network is a hot and demanding research topic. Many centralities have been proposed for this task. Weighted-degree [2] is the simplest weighted centrality that is used. It helps in deciding the most effective influencers on the basis of the product of the degree of the node and the average weights of its connection with its neighbours. It just helps to find the local influence. Closeness [3] and betweenness [4] are both centralities that consider a node’s shortest path to other nodes and thus are not a very viable centrality for large networks. Other than these, many advanced methods have come up to achieve this objective. PageRank [5] uses an algorithm that was used to rank web pages, i.e. counting citations or back-links of the page. In K-shell Decomposition [6], a core number (Ks) is allotted to each node (not unique) which represents the location of that node according to successive layers in the shell of the network, thereby being able to take global structure of the network into consideration. Many K-shell improved methods were also proposed [7–11] such as Mixed Degree Decomposition aiming at differentiating nodes with different influences. VoteRank [12] is a method that chooses the spreader by considering the voting ability of its neighbours. Once chosen, the node can’t vote again. The main advantage of this method is that ‘far off’ spreaders or relatively disconnected spreaders are chosen by implementing this method. H-index method [13], the ‘h’ value reflects the relative local influence of a node. These existing methods are usually applied to unweighted networks that consider only a single type of relation between all nodes. However, in real-world scenarios, the edges are weighted depicting the extent of interaction. Thus, if a social network comprises of more than one kind of relation between

Finding Influential Spreaders in Weighted Networks …

417

the individuals, weighted graphs are used. This relation between nodes can depict capacity, capability (unweighted) or duration, emotional intensity (weighted). Thus, the concept of weighted networks can be expanded to a lot of real-world networks. Traditional methods of unweighted networks don’t consider the weight of edges and thus, many variants of the traditional methods are available such as weighted-degree centrality [14], weighted betweenness centrality [2], weighted K-shell decomposition [15], weighted-H-Index centrality [13], weighted-VoteRank. In the weighted K-shell decomposition, both the degree of the node and the weight of the links were considered by allocating each node a weighted-degree k. The pruning process was done similar to that of K-shell decomposition [15]. The weighted-H-Index defined the edge weight as a product of degrees of the connecting nodes, while the weightedVoteRank changed the traditional method for calculation of voting score to include the effect of strength of links between two nodes. Recently proposed NCVoteRank method is the extension of VoteRank by bringing neighbourhood coreness value in voting [16]. Thus, by tweaking the unweighted centralities to include the strengths of links, weighted centralities could be implemented in real-world networks the effect of different kinds of links could be understood. Our contribution of proposed method, i.e. weighted-Hybrid involves the following: 1. An improved version of weighted-h-index contributes to the final score of a node, which helps us in deciding the efficient influencers in a social network. 2. An improved version of weighted-VoteRank also contributes to the final score of a node. 3. Entropy centrality helps us in combining the global property with the local property which is the case with other two. The organization of this paper is as follows: Sect. 2 presents a brief about related works. In Sect. 3, we present information diffusion model, performance metrics and datasets used in this work. The proposed method is described in Sect. 4. Section 5 summarizes our results and findings and eventually, paper if concluded by Sect. 6.

2 Related Works Weighted-H-Index: In the weighted-H-Index method, the edge weights were defined to quantify the diffusion capacity of the network. The edge weight was defined as the product of the degree of the vertices connected by the edge. For each vertex i, connected to a vertex j, the weighted edge was decomposed into multiple weighted edges equal to the degree of vertex j. Completing this procedure for each neighbour of vertex i, H-Index was calculated in a traditional manner, i.e. maximum h value for the node i, such that it has at least h neighbours with weights more than or equal to h.

418

S. Kumar et al.

Weighted-VoteRank: Sun et al. [17] proposed a weighted-VoteRank method to improve the idea of the VoteRank method in which not only the number of neighbours was taken into consideration but also the weights of their relation with the current node. This method is used to find the multiple influential spreaders in a weighted network in which each node is allocated a tuple consisting of a voting ability and a voting score, i.e. each node v is attached to a tuple consisting of its voting score and voting ability: {sv , vav }. Initially, this tuple is initialized to {0, 1}. At each step, node votes for its directly connected neighbours according to its voting ability. The voting score in a weighted network was defined as the square root of the product of the weights with the voting ability of each neighbour as shown in the equation below.  sv =



|N (v)| ∗

vai ∗ wv,i

(1)

i∈γ (v)

Thus, for any given node v, three factors were taken into consideration in order to determine the voting score, the number of neighbour nodes of v, i.e. |γ (v)|, the voting ability of its neighbour i, i.e. vai and the edge weight between the neighbour i and the node v, i.e. wv,i . Initially, each node had a voting ability equal to unity, however, after each voting, the neighbours of the selected node had their voting ability reduced by a constant value . Entropy-based centrality: Qiao et al. [18] proposed entropy-based centrality for the weighted network, the total influencing power of current node was divided into two parts local power and global power. The local power can be achieved by combining the interaction frequency entropy, which indicates the accessibility of the node, and the structural entropy, which indicates the popularity and communication activity of the node. A complete network was deconstructed into smaller subnetworks and the required interaction and structural information derived from it. This information along with the information from the two-hop neighbours formed the total power for a node.

3 Information Diffusion Model, Performance Metrics and Datasets 3.1 Information Diffusion Model In the paper being proposed, the Stochastic Susceptible-Infected Recovered (SIR) is used as the information diffusion model to assess the performance of our algorithm. This model divides network nodes into three categories, i.e. Susceptible (S), infected (I) and recovered (R). Nodes that are in the susceptible state are likely to receive data from neighbours surrounding it. The SIR model takes a list of spreaders as input, i.e. a subset of the network nodes, infection probability (β) and recovery probability (γ ).

Finding Influential Spreaders in Weighted Networks …

419

In this type of model, all nodes are initially liable to get infected except a few nodes that are in the infected state. After every step, susceptible neighbours are affected by the infected nodes with a probability of β. Then they enter the recovered stage with a probability of γ . Once reaching the recovered stage, they are immunized and can’t be infected again. As the model discussed above is a random model, the abovediscussed steps were run for 100 times and the average of the results were taken for all the 100 simulations.

3.2 Performance Metrics We judge the performance of our approach along with others using the following matrices: (1) Final Infected Scale (F(t c )): The final infected scale is defined as the ratio of recovered nodes at the end of the SIR simulations and the total number of nodes in the network. Here, recovered nodes correspond to those nodes who first got infected and then recovered in the SIR model. The high value of F(t c ) means information or idea, which was propagated by the influential spreaders, has reached to a large number of people in the social network. The final infected scale is calculated using the following equation: Final Infected Rate, F(tc ) =

n R(tc) n

(2)

where nR(tc) = no. of recovered nodes when spreading is at steady-state and n = total no. of nodes. (2) Shortest path length (Ls): It is used to evaluate the structural properties between each pair of selected spreaders. The shortest path length is calculated for each pair of spreaders and is an essential metric that considers, the location of the influential spreaders. Its high value denotes that the spreaders are widely distributed in the network and hence can spread information to a more substantial portion of the network. Ls =

 1 lu,v |S|(|S| − 1) u,v ∈S,u =v

(3)

where lu, v denotes shortest path from node u to node v and |S| is the total number of spreaders.

420

S. Kumar et al.

Table 1 Used dataset S. No.

Dataset name

Description

#Nodes

1

Powergrid

An undirected weighted network containing information about the power grid of the Western States of the United States of America

4941

#Edges 6594

2

Facebook-like social network

This undirected weighted dataset originates from an online community for students at the University of California, Irvine

1899

20297

3

US top 500 airport network

An undirected weighted network of the 500 busiest commercial airports in the United States

500

28237

4

Bitcoin+11

This is a user–user trust/distrust undirected weighted network

5881

35592

3.3 Datasets We chose to work with four real-life datasets to judge the performance of our proposed method of finding influencers in weighted networks. Table 1 lists all the datasets used with brief descriptions. These data sets are publicly available at https://toreopsahl. com/datasets/.

4 Proposed Method In this section, we first present the improved weighted-H-index and improved weighted-VoteRank method and then describe then proposed Weighted-Hybrid method, which is the combination of three techniques, namely, weighted-Hindex, improved weighted-VoteRank method and entropy centrality. The weighted-degree often gave the best results (about 70% models showed such results), but it only considers the significance of the one-hop neighbours to determine the most significant spreaders. Thus, by the same logic, we decided to improve the traditional weighted-H-Index and weighted-VoteRank methods by including the information of the neighbours of the nodes in the formulas. A traditional weighted-HIndex method evaluated a node’s spreading power according to the number of highly influential neighbours, however, it failed to account for the topological structure of the network. Thus, this method on its own is unable to give excellent results. However, it is highly beneficial in real-world scenarios where we are missing a few links or some network information because it is not sensitive to small variations in degree. Keeping in mind its benefits in neutralizing the effect of missing links and data on the final output, we decided to include an enhanced version of weighted-H-Index in our hybrid. A weighted-VoteRank has a huge advantage over any of the other methods,

Finding Influential Spreaders in Weighted Networks …

421

i.e. it protects the output from the rich-club phenomenon. While determining multiple spreaders, it discounts the voting ability of the selected spreader’s neighbours and thus, rather than choosing all the spreaders in one crowded area and thereby causing an overlap of influences and neglecting the far-flung regions; this method tries to choose spreaders which are far from each other so as to maximize the influence and reduce the chances of an overlap. Entropy is a very useful method that considers the topological structure of the network by considering the indirect influence of the node in the form of two-hop neighbours. Thus, entropy considers the global qualities of a node. Improved weighted-H-index: In the classical weighted-H-index, we introduce effective weight as a product of the strength of link (weight of edge) and the spreading capacity of the link (degree of nodes) and the effective weight of an edge is defined as   wi j = wi j + ki ∗ k j

(4)

where wij is the weight of the edge between nodes i and j, k i and k j are the degrees of nodes i and j, respectively. Further, we follow the same procedure to find the effective H-Index of a node by first decomposing the weighted edge into multiple weighted edges based on the degree of the neighbouring node. Finally, we follow the definition of H-Index for a node ‘n’ which is defined as the max h value such that there are neighbours equal to or greater than h of ‘weight’ equal to or larger than h. Improved weighted-VoteRank: In the weighted-VoteRank method, we propose to modify the voting score (sv ) of node v, as presented in Eq. (1) and the resultant equation is  sv =

|γ (v)|



vai ∗ wv,i ∗ ki

(5)

i∈γ (v)

Here, k i is the degree of the node i (neighbour of node v). Thus, we also take in account the number of neighbour nodes of i, which is the neighbour of node v. Hence, considering the nodes up to two hops may take care of the spreading process in a better manner. Weighted-Hybrid method: We propose our weighted-Hybrid method to be a combination of these three methods:   E i = α ∗ W Hi + β ∗ W Vi + μ ∗ T Pi /3

(6)

where the effective influence of a node i is the sum of its improved weighted-H-index (WH’i ), improved weighted-VoteRank Score (WV’i ), total influencing power calculated through entropy formula (TPi ). Also, α, β, μ are constants. Experimentally, we keep the value of α, β, μ to be equal, i.e. α = β = μ.

422

S. Kumar et al.

As we considered the ratio of three different methods, we normalized the final score using min-max normalization (as per Eq. (7)). In max-min normalization, the data is scaled between 0 and 1. y=

x − min max − min

(7)

5 Results and Analysis We chose to work on four datasets of weighted networks mentioned in Sect. 3 of this paper. Datasets considered include Powergrid, Facebook-like Social Network, Bitcoin+11 and US Top-500 Airport Network. We ran the SIR model 100 times as it is a random model and its working can vary. The beta value was taken to be 0.01 which just means that every node has got capability of infecting the 1% of his neighbouring nodes. The results, obtained by running the SIR model for 100 times, were then averaged to take different ways of spreading in consideration. We chose to compare our algorithm of choosing efficient influencers on the basis of the Final infected scale F(t c ) versus time. The performance of our proposed method, i.e. W-Hybrid was compared with other proposed methods and the results were noted. After observing the results obtained after conducting the required experiments, it is evident that our proposed method gives better results when we used the final affected scale F(t c ) performance matrix. Our method, i.e. weighted-hybrid is able to affect a number of nodes than the other methods at the same time. Also, we noticed that decreasing the beta value usually improved the results of our W-Hybrid method (Figs. 1, 2, 3 and 4).

a) 10 spreaders

b) 20spreaders

Fig. 1 a, b F(t c ) versus time for powergrid data set with initial spreaders as 10 and 20

Finding Influential Spreaders in Weighted Networks …

a) 10 spreaders

423

b) 20 spreaders

Fig. 2 a, b F(t c ) versus time for Facebook-like social network data set with initial spreaders as 10 and 20

a)10 spreaders were chosen

b) 20 spreaders were chosen

Fig. 3 a, b F(t c ) versus time for Bitcoin data set with initial spreaders as 10 and 20 and β as 0.01

Ls values Table 2 lists the value of Ls for various data sets calculated using Eq. (3). As the above results show, we are able to maximize the shortest path between selected spreaders which will lead to a better influence spread in the network.

424

S. Kumar et al.

a)10 spreaders were chosen

b) 20 spreaders were chosen

Fig. 4 a, b F(t c ) versus time for US Top-500 airport network data set with initial spreaders as 10 and 20 with and β as 0.01

6 Conclusions In this manuscript, we have proposed a weighted-hybrid method to find the most effective influencers or spreaders in the given weighted network so that information can reach a large number of users in the system. The proposed weighted-hybrid method is able to include both the local and global attributes of a node in the determination of the influence of the node. Furthermore, the weighted-hybrid is able to deal with issues such as the rich-club phenomenon, hidden/missing links in the network. The experiments conducted gave us results that conveyed that our proposed technique is better technique to find multiple influencers in a weighted complex network.

US Top-500 airport network

Facebook-like social network

Bitcoin

Powergrid

Dataset

372.9

4.433

21.178

5.515

Degree

7.986

2.241

19.571

2.939

Closeness

17.337

2.494

21.535

4.33

Betweenness

169.07

3.322

20.535

4.015

Weighted-H-Index

Table 2 Ls values for all datasets with initial spreaders as 10 and β as 0.01

2905.9

4.055

24.178

6.438

Improved weighted-H-Index

1527.2

5.013

32.857

2.424

Improved weighted-VoteRank

203.8

4.185

28.218

5.181

Entropy

703.5

5.364

33.678

9.030

W-Hybrid

Finding Influential Spreaders in Weighted Networks … 425

426

S. Kumar et al.

References 1. W. Chen, C. Wang, Y. Wang, Scalable influence maximization for prevalent viral marketing in large-scale social networks, in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2010), pp. 1029–1038 2. T. Opsahl, F. Agneessens, J. Skvoretz, Node centrality in weighted networks: generalizing degree and shortest paths. Soc. Netw. 32(3), 245–251 (2010) 3. Y. Du, C. Gao, X. Chen, Y. Hu, R. Sadiq, Y. Deng, A new closeness centrality measure via effective distance in complex networks. Chaos Interdiscip. J. Nonlinear Sci. 25(3), p. 033112 (2015) 4. D. Prountzos, K. Pingali, Betweenness centrality. ACM SIGPLAN Not. 48(8), 35 (2013) 5. S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998) 6. M. Kitsak, L. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. Stanley, H. Makse, Identification of influential spreaders in complex networks. Nat. Phys. 6(11), 888–893 (2010) 7. Z. Liu, C. Jiang, J. Wang, H. Yu, The node importance in actual complex networks based on a multi-attribute ranking method. Knowl.-Based Syst. 84, 56–66 (2015) 8. A. Zareie, A. Sheikhahmadi, A hierarchical approach for influential node ranking in complex social networks. Expert Syst. Appl. 93, 200–211 (2018) 9. J. Bae, S. Kim, Identifying and ranking influential spreaders in complex networks by neighborhood coreness. Phys. A 395, 549–559 (2014) 10. Z. Wang, Y. Zhao, J. Xi, C. Du, Fast ranking influential nodes in complex networks using a k-shell iteration factor. Phys. A 461, 171–181 (2016) 11. Z. Wang, C. Du, J. Fan, Y. Xing, Ranking influential nodes in social networks based on node position and neighborhood. Neurocomputing 260, 466–477 (2017) 12. J. Zhang, D. Chen, Q. Dong, Z. Zhao, Erratum: Corrigendum: Identifying a set of influential spreaders in complex networks. Sci. Rep. 6(1) (2016) 13. L. Lü, T. Zhou, Q. Zhang, H. Stanley, The H-index of a network node and its relation to degree and coreness. Nat. Commun. 7(1) (2016) 14. A. Nikolaev, R. Razib, A. Kucheriya, On efficient use of entropy centrality for social network analysis and community detection. Soc. Netw. 40, 154–162 (2015) 15. A. Garas, F. Schweitzer, S. Havlin, Ak-shell decomposition method for weighted networks. New J. Phys. 14(8), 083030 (2012) 16. S. Kumar, B.S. Panda, Identifying influential nodes in social networks: neighborhood coreness based voting approach. Phys. A 124215 (2020) 17. H.L. Sun, D.B. Chen, J.L. He, E. Chng, A voting approach to uncover multiple influential spreaders on weighted networks. Phys. A 519, 303–312 (2019) 18. T. Qiao, W. Shan, G. Yu, C. Liu, A novel entropy-based centrality approach for identifying vital nodes in weighted networks. Entropy 20(4), 261 (2018)

Word-Level Sign Language Gesture Prediction Under Different Conditions Monika Arora, Priyanshu Mehta, Divyanshu Mittal, and Prachi Bajaj

Abstract With over 6% population suffering from hearing problems and relying on sign language to communicate with the masses and expressing their emotions through actions. It has been an onerous task for the speech and hearing-impaired people to make people understand and thus it is necessity to build a system that can help anyone in understanding the gestures and generate its meaning. A system for sign language recognition can be a preliminary step to establish better communication. We used word-level Argentinian Sign Language (LSA) video dataset with 64 actions which are shot under different lights and with non-identical subjects. Video data accommodate both dimensional and sequential attributes, thus we used a deep convolutional neural network along with recurrent neural network with LSTM units to incorporate both together. We created two different test cases, that is, indoor lighting environment with single subject and a mix of both indoor and outdoor conditions with multiple subjects and have achieved accuracy of 93.75% and 90.625%, respectively. Keywords Sign language gesture prediction · Recurrent neural network · Convolutional neural network · LSTM

M. Arora · P. Mehta (B) · D. Mittal · P. Bajaj Bhagwan Parshuram Intitute of Technology, Delhi, India e-mail: [email protected] M. Arora e-mail: [email protected] D. Mittal e-mail: [email protected] P. Bajaj e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_38

427

428

M. Arora et al.

1 Introduction The challenge of communication between speech and hearing-impaired individuals with others has been existing worldwide. Therefore, the requirement to develop a unique strategy for communication that doesn’t include verbal methods, that is, Sign Language. It incorporates the utilization of gestures which involve formations of hands-orientations, movements along with facial expressions. These gestures assume a significant job in sharing their cerebrations and help them in communicating with others. Generally, an ordinary may not be able to gain proficiency with the sign language thus it would get hard for them to comprehend any of such symbols. Also, it is not feasible to have a translator every time. To overcome this issue, many researchers have worked over a large span of time to develop systems with technological support that bridge the communication gap effectively between the two. Taking this serious concern into consideration, we have tried to develop a viable system that can recognize each gesture of sign language proficiently, identify, and translate to a relevant readable form that the general public can understand. The practical application involves inputting gestures from a user, data-based recognition, and translating to an understandable form upon comparison. The techniques to input involve two approaches either vision-based identification or using gloves with sensors implementing hardware. We have adopted a hybrid of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). It takes videos as inputs and derives frames from them to train and test. The recordings taken are a part of the Sign Language dataset, that is, utilized to set up our proposed model. Similar to the existence of various dialects around the world, there are variants of Sign Language too like Japanese Sign Language, Korean Sign Language, Indian Sign language (ISL), American Sign Language (ASL), etc. [1]. Our chosen dataset is the Argentinian Sign language database (LSA) that includes videos of 10 non-master subjects who executed each gesture five times for the distinct signs. Signs for some 64 commonly used words have been chosen in the LSA comprising of verbs and nouns. The dataset is a collection of 3200 recordings of various signs shot under different types of lighting [2].

2 Literature Review Using different datasets as per individual requirements, several approaches have been undertaken by researchers to develop models on the subject of “Sign Language Recognition”. Some have worked upon figuring out the alphabets while others took into consideration the identification of commonly used phrases or terms. Masood et al. [1] proposed a method of real-time Sign Language recognition from Video sequences by implementing CNN and RNN to train on the spatial and temporal features, respectively, over the Argentinian Sign Language (LSA) gestures.

Word-Level Sign Language Gesture Prediction Under Different Conditions

429

The model also implemented Long Short-Term Memory (LSTM) to bridge time intervals in noisy incompressible input sequences along with pool layer approach. The work by researchers Masood et al. [3] involved using the finger-spelled letters of American Sign Language as it’s dataset to train the CNN model which was inspired by VGG19. With the aim of reducing the learning time considerably, a pretrained model was used to initialize the weights. Only a few epochs were sufficient to converge despite a very deep model. The authors Tripathi et al. [4] proposed a method by applying gradient-based key frame extraction method for recognizing symbols from continuous gestures. To find patterns in input data, principal component analysis has been applied. Several distance metrics like Cosine Distance, Mahalanobis Distance, Euclidean Distance, etc were used for gesture recognition. Results obtained from Euclidean distance and correlation depicted highest recognition rates among others. Pardeshi et al. [5] worked upon the comparison and analysis of Deep Learning algorithms like AlexNet and GoogLeNet for training the images. To minimize the training period, the project included multicore processors and GPUs with parallel computing toolbox in MATLAB. In the method proposed by Ko et al. [6], the KETI dataset (Korean language) takes into account certain words and phrases that need to be expressed in an emergency situation, where physical disability may act as a severe hindrance. The recognition system works upon feature extraction of Human Keypoints from hands, face, other body parts, etc. using OpenPose library. Upon vector normalization, they used stacked bidirectional GRUs for classification of the standardized feature vectors. The system by researchers Mali et al. [7] involves pre-processing using MATLAB, skin thresholding, and dilation and erosion before feature extraction implementing PCA. Further the SVM classifier is applied for classification and analysis. An overall accuracy of 95.31% could be achieved. Singha and Das [8] in their research proposed a method divided into steps where first data is acquired, pre-processed, then features are extracted and classification is done. Classification has been done using Eigen value-weighted Euclidean distance. From continuous videos, the system could recognize 24 different alphabets of Indian Standard Language with an accuracy of 96%.

3 Proposed Methodology 3.1 Data Acquisition The dataset utilized for the framework is The Argentinian Sign Language (LSA) which has been made with the objective of creating a word reference for LSA and preparing a programmed sign recognizer. It comprises videos where 10 non-master subjects executed each gesture five times for the 64 distinct signs. Signs were chosen among the most usually utilized ones in the LSA dictionary, including verbs and

430

M. Arora et al.

Table 1 64 symbols of LSA ID

Name

H

ID

Name

H

ID

Name

H

ID

Name

1

Opaque

R

17

Call

R

33

Hungry

R

49

Yogurt

H B

2

Red

R

18

Skimmer

R

34

Map

B

50

Accept

B

3

Green

R

19

Bitter

R

35

Coin

B

51

Thanks

B

4

Yellow

R

20

Sweet milk

R

36

Music

B

52

Shut down

R

5

Bright

R

21

Milk

R

37

Ship

R

53

Appear

B

6

Light-blue

R

22

Water

R

38

None

R

54

To land

B

7

Colors

R

23

Food

R

39

Name

R

55

Catch

B

8

Red

R

24

Argentina

R

40

Patience

R

56

Help

B

9

Women

R

25

Uruguay

R

41

Perfume

R

57

Dance

B

10

Enemy

R

26

Country

R

42

Deaf

R

58

Bathe

B

11

Son

R

27

Last name

R

43

Trap

B

59

Buy

R

12

Man

R

28

Where

R

44

Rice

B

60

Copy

B

13

Away

R

29

Mock

B

45

Barbecue

B

61

Run

B

14

Drawer

R

30

Birthday

R

46

Candy

R

62

Realize

R

15

Born

R

31

Breakfast

B

47

Chewing gum

R

63

Give

B

16

Learn

R

32

Photo

B

48

Spaghetti

B

64

Find

R

nouns. The dataset is an assortment of 3200 recordings of various signs shot both under different types of lighting [2] (Table 1).

3.2 Pre-processing 3.2.1

Frame Extraction

Since it is difficult to train a model on videos directly, approximately 200 frames have been extracted from each video sequence and then used for training the model, thus increasing the dataset and playing a major role in improving the accuracy (Fig. 1).

3.2.2

Feature Extraction

The hands of the subjects are detected using OpenCv library of Python. The background and other body parts act as noise and are not required for preparing our recognizing system. Thus, they are removed. The background is made black and the image is converted to grayscale so that color of gloves is not involved in the learning of model.

Word-Level Sign Language Gesture Prediction Under Different Conditions

431

Fig. 1 Data flow model

3.3 Classification Videos as dataset comprise both spatial and temporal features. To extract the spatial features, we have implied Convolutional Neural Network and to extract temporal features by relating frames one after the other, we have implied Recurring Neural Network (Fig.2).

3.3.1

Positional Feature Extraction Using CNN

CNN can be effectively utilized as an approach for classifying images due to its outstanding ability of recognizing relations and finding patterns easily irrespective of any translational or rotational change in images [9]. Firstly, we have clustered the frames in their respective symbol subfolder and have retrained each frame using Tensorflow’s retrain in order to generate the respective bottlenecks corresponding to each frame. Then, there is transfer learning which uses a pre-trained neural network, in this case, the bottlenecks. For instance, we have extracted the spatial features from

432

M. Arora et al.

Fig. 2 Frame extraction

video frames with CNN by implementing the Image recognition model: Inception V3 model of Tensorflow library [10] which uses about 25 million parameters and about 5 billion operations to classify each image. Since only the final layer has been trained, it could be completed in feasible time and resources. The predicted frames are stored for the train model. 3.3.2

Sequential Feature Extraction Using RNN

After the model is trained using CNN, the softmax-based prediction is implemented to output a model that can be passed to the RNN for the final prediction of the actual word related to each video sequence. As we have sequential data, to predict the output, RNNs utilizes the current input and the previous output recurrently [11]. Since RNNs cannot learn long-term dependencies, we have utilized Long Short-Term

Word-Level Sign Language Gesture Prediction Under Different Conditions

433

Fig. 3 Process flow

Memory (LSTM) model [12]. The sequence of forecast videos for each sign of the train information from CNN is then given to the RNN model for preparing on the temporal features. After this the model is used for making predictions on the test data (Fig. 3).

4 Result Each video was split into 200 frames in order to retrieve the dimensional characteristics, and hence predictions were done on each frame and then finally on corresponding frames when arranged sequentially, RNN was used to give final predicted value. We have prepared a training set and two different testing conditions, in Case 1 we have tested the sign corresponding to same set of subjects and environmental conditions, whereas in Case 2 we have used mixture of different subjects under both artificial and natural lighting conditions in the test set where the subjects in the testing data were not a part of training data. In Case 1, 120 out of 128 gestures of same subject under equivalent lighting conditions were interpreted successfully and we achieved an accuracy of 93.75% while the accuracy dropped to 90.625% in Case 2, as was expected, where 290 out of 320 videos were recognized correctly. In Fig. 4a, the gesture copy has accuracy below 45% due to the reason that the left hand of the subject slightly overlaps the right hand of his, during the course of action

Fig. 4 a Copy sign and b Breakfast sign

434

M. Arora et al.

completion in the video, for a fraction of second and while this is less prominent in natural lighting, the overlapping frame is more eminent when subject or illumination is changed, thus making the prediction slightly difficult. On the contrary Fig. 4b has 100% accuracy, this gesture also uses both hands for the depiction but the course of the action is identical. Both the hand go from one position then upwards and traverse back to the same position. Also, no overlapping is required therefore no ambiguity for the algorithm.

5 Conclusion Gestures and expressions are substantial in day-to-day communication and their recognition by computers is an equally exciting and strenuous task. Our work presents a method which is able to interpret hand gestures from the Argentinian Sign Language (LSA) and examines the effect of different lighting conditions on the predicted result. We carved up two test instances based on disparate subjects and illumination milieu. We have attained results of 93.75% for Case 1 where both train and test data had same subject under alike lighting and 90.625% for Case 2 where test data had a mixture of distinct subjects under different lights. Through our study, we can conclude two main perorations, first, CNN along with RNN can be highly effective in treating video sequences and second, there are certain losses such as loss in edge detection and frame mapping when the subject or the environmental changes are brought into the mix without training the data over these conditions.

References 1. S. Masood, A. Srivastava, H.C. Thuwal, M. Ahmad, Real-time sign language gesture (word) recognition from video sequences using CNN and RNN. Intell. Eng. Inf. 623–632 (2018) 2. F. Ronchetti, F. Quiroga, C. Estrebou, L. Lanzarini, A. Rosete, LSA64: A Dataset of Argentinian Sign Language, in XX II Congreso Argentino de Ciencias de la Computación (CACIC) (2016) 3. S. Masood, H.C. Thuwal, A. Srivastava, American sign language character recognition using convolution neural network. Smart Comput. Inf. 403–412 (2018) 4. K. Tripathi, N.B.G. C. Nandi, Continuous Indian sign language gesture recognition and sentence formation. Proc. Comput. Sci. 54, 523–531 (2015) 5. K. Pardeshi, Dr. R. Sreemathy, A. Velapure, Recognition of Indian sign language alphabets for hearing and speech impaired people using deep learning, in Proceedings of International Conference on Communication and Information Processing (ICCIP) (2019) 6. S.-K. Ko, J.G. Son, H. Jung, Sign language recognition with recurrent neural network using human keypoint detection, in The 2018 Conference (2018) 7. D.G. Malia, N.S. Limkar, S.H. Malic, Indian sign language recognition using SVM classifier, in Proceedings of International Conference on Communication and Information Processing (ICCIP) (2019) 8. J. Singha, K. Das, Automatic Indian sign language recognition for continuous video sequence. ADBU J. Eng. Technol.2, 0021105(5pp) (2015) 9. B. Garcia, S. Viesca, Real-time American sign language recognition with convolutional neural networks. Convolutional Neural Networks for Visual Recognition at Stanford University (2016)

Word-Level Sign Language Gesture Prediction Under Different Conditions

435

10. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado et al., Tensorflow: largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016) 11. Cooper Helen, Brian Holt, and Richard Bowden. Sign language recognition, in Visual Analysis of Humans (Springer, London, 2011), pp. 539–562 12. S. Hochreiter, J. Schmidhuber, Long short term memory. Neural Comput. 9(8), 1735–1780 (1997)

Firefly Algorithm-Based Optimized Controller for Frequency Control of an Autonomous Multi-Microgrid Kshetrimayum Millaner Singh, Sadhan Gope, and Nicky Pradhan

Abstract This paper considered a mathematical model that consists of two-area microgrid based on renewable energy resources for the study of automatic generation control. The microgrid consists of Solar Photovoltaic (SPV); Hydro, Battery Energy Storage System (BESS); Load, and one has Bio Gas Turbine Generator (BGTG) and other have Biodiesel Engine Generator (BDEG). Proportional-Integral (PI) controller is used as the frequency controller for this system. The BDEG, BGTG, and BESS have been considered for instant Load Frequency Control (LFC) sources during a disturbance in the system frequency. Cuckoo Search (CS) and Firefly (FA) algorithms are used for tuning the gain values of the controllers. Finally, for the validity of the proposed approach, the system performance obtained by the firefly algorithm for PI controller with random step load perturbation is compared with the CS algorithm Keywords Proportional-Integral · Solar photovoltaic · Biodiesel engine generator · Battery energy storage system · Bio gas turbine generator

1 Introduction The electrical energy consumption of the world has continued to rise rapidly at a rate faster than the other forms of energy consumption. In recent times, the demand for energy to fuel the country’s economic growth has only spurted. The core idea is that overall growth and development of any country is mostly decided by the quantum of energy consumption by that country. Furthermore, most of the sources of energy for global consumption comes from conventional sources. For instance, K. M. Singh (B) · S. Gope · N. Pradhan Electrical Engineering Department, Mizoram University, Aizawl, India e-mail: [email protected] S. Gope e-mail: [email protected] N. Pradhan e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_39

437

438

K. M. Singh et al.

coal-fired power plants generate 72 % of India’s electricity. However, conventional sources of energy are fastly running out. Their ever-increasing exploitation is also bringing about catastrophic and irreparable damages to the environment as well as accelerating global warming. In this scenario, the integration of Renewable Energy Sources (RESs) in the power grid is being promoted across the globe in order to ensure sustainable development and combat climate change. Due to the initiation of policy of deregulation in the power system, distributed generator has opportunities to encourage the microgrid system in the power industry. But, it is very difficult and complex for integration of RESs with conventional energy sources due to the uncertain nature of RESs, load variation, and imbalance of load demand and supply generation. This problem can be overcome by load frequency control as it the primary control which correspondingly controls the system frequency and active power of the system. As a result, energy storage is the most secure and efficient option to mitigate the difference between energy demand and supply. In literature review, studies used for energy storage system and its effect in frequency deviation and ACE control of multi-microgrid are analyzed [1]. Grasshopper Optimization Algorithm (GOA) is used for multi-area microgrid frequency control with fuzzy PID controller [2]. The study of frequency deviation by considering communication delay in multi-microgrid is studied in [3]. Considering Renewable Source Uncertainties study the frequency response of microgrid with PID controller and is compared to different optimization algorithms that are Cow Search Algorithm, Whale Optimization Algorithm (WOA), and Mosquito Flying Optimization (MFO) [4]. Comparison of different controllers is used for load frequency control like PI, PID, PD, ID, and PIFD with the help of Particle Swarm Optimization (PSO), Grasshopper Optimization Algorithm (GOA), and Genetic Algorithm (GA) [5]. In this paper, the PI controller is used to investigate the load frequency control of multi-microgrid (2-areas) connected with a tie-line. The parameters of PI controller are tuned with the help of CS and FA Algorithm considering the changes of load that is an application of SLP. The results of the algorithms are compared.

2 Overview of Multi-Microgrid 2.1 Multi-Microgrid Multi-Microgrids (MMGs) are considered an advanced level system than microgrid. It operates at the medium voltage level. Multi-Microgrid (MMGs) is comprised of numerous low voltage Microgrids (MGs) and Distributed Generators (DGs) units connected next to MV feeders. It’s capable to have many controllable Distributed Generators (DGs) units and Microgrids (MGs). The benefits of multi-microgrids are that it is possible to implement Demand Side Management (DSM), as it needs the classified control scheme. It needs efficient control and management of the system [6].

Firefly Algorithm-Based Optimized Controller …

439

2.2 Solar Photovoltaic Model A PV system consists of several cells connected in series and parallel to deliver desirable current and voltage. The V-I characteristic of PV system is non-linear and the output power of the PV array depends on load current and solar radiation. The PV system transfer function can be expressed as G pv (S) =

K pv 1 + STpv

(1)

where K pv is gain constant and Tpv is time constant [6].

2.3 Biogas Turbine Generator (BGTG) Biogas is obtained from the decomposable of wastes and animal excreta. It can be economically used in a micro Gas Turbine Generator (BGTG) for a generation of power [7]. The output power of BGTG is directly proportional to stoke of the components in BGTG. The transfer function of BGTG is given as G BGTG (S) =

1 + STCR 1 1 + S XC . (1 + SVC )(1 + Sb B ) (1 + STBG ) (1 + STBT )

(2)

2.4 BioDiesel Engine Generator (BDEG) Biofuel is extracted from crops. Transesterification is the process that is used for the extraction of fuel. The fuel has the same chemical properties as diesel and can use in usual diesel generators [7]. The BDEG transfer function can be express as G BDEG (S) =

K BE K VA 1 + STVA 1 + STBE

(3)

2.5 Hydro Plant The thermal plant prototype is similar to the hydro plant prototype. The main three units of the hydro plant are governor, turbine, and generator load. The speed governor transfer function can be represented as [8]

440

K. M. Singh et al.

G HG (S) =

K HG 1 + STHG

(4)

where K HG is gain constant and THG is time constant. The turbine unit can be represented as a transfer function given below [8] G HT (S) =

K HT 1 + STHT

(5)

where K HT is gain constant and THT is time constant of the turbine.

2.6 System Frequency Variation and Power Deviation To keep a stable operation of the power system, the total generation of power should efficiently control and also make suitable dispatch so that to meet the total load demand [1]. Therefore total power generation (PT ) in the microgrid system is equal to the summation of all sources, i.e., solar photovoltaic power (PPV ), Biodiesel engine generator power (PBDEG ), Biogas turbine generator power (PBDEG ) Hydro plant (PH ), and power of energy storage system (PESS ), i.e., given in equation as PT = PPV + PBDEG + PBGTG + PH ± PESS

(6)

The deviation of power in the system is given by total power generation (PT ) minus power demand (PD ) as follows: Pe = PT − PD

(7)

As we know that system frequency deviation is due to the changes in the total power deviation, and its frequency variation ω is given by ω =

Pe K sys

(8)

where K sys is frequency characteristic constant. A time delay is essential between power deviation and frequency deviation, therefore frequency deviation to power deviation in per unit can be expressed by transfer function as follows: G sys (S) = =

ω K sys 1 1  = (D + M S) K sys 1 + S K sys

(9)

where M denotes equivalent inertia constant and D denotes system damping constant.

Firefly Algorithm-Based Optimized Controller …

441

2.7 Interconnection of Purposed Multi-Microgrid with Tie-Line By interconnection of standalone microgrids through tie-line which can be reliable to power supply for the load demand, we consider that all microgrids have their control area. The characteristic that shows how to respond to the system frequency deviation of a specific area is determined by tie-line bias control. It is used to exchange energy between microgrids when the power generation and load demand are not equal, and frequency deviation occurs in that area [1]. The tie-line power deviation (Ptie ) is given as  Ptie = Ps



 ω1 dt −

ω2 dt

(10)

where ω1 , ω2 is frequency deviation of area-1 and area-2, respectively, and Ps is synchronizing power coefficient. Laplace transform of tie-line power deviation is given by (Fig. 1) C=

Ptie S PS = ω S S

Fig. 1 Proposed block diagram of two areas multi-microgrid model/system

(11)

442

K. M. Singh et al.

3 Algorithm The detailed overview of the adopted optimization technique along and the flowchart of the FA Algorithm have been discussed in Refs. [6, 9, 10] and CS Algorithm in Ref. [11, 12].

4 Results and Discussion 4.1 PI Controller for Firefly Optimization The various simulation results for the PI controller using firefly algorithm optimization in the proposed model of two-area multi-microgrid for the study of load frequency control are given as follows. Figure 2 shows load variation in area-1 and also shows the response of power generation of Hydro, ESS, PV, BDEG. Initially, when the load demand is at nominal then it started low from nominal, i.e., 0.8 pu at t = 55 s, 0.73 pu at t = 60 s. Then it gets back to normal from t = 65 s to t = 85 s, after that it starts increasing the load demand, i.e., 1.2 pu at t = 90 s, 1.5 pu at t = 95 s, 1.15 pu at t = 100 s and then back to nominal value from t = 105 s to t = 120 s. Initially it considers to give normal step output. And at around up to 5 s most of the sources including ESS are fluctuating then back to normal by LFC. At t = 50 s BDEG generation begins to reduce gradually until t = 60 s while the supply from

Fig. 2 Response of area-1 with PI controller by using FA

Firefly Algorithm-Based Optimized Controller …

443

Fig. 3 Response of area-2 with PI controller by using FA

ESS starts to decrease up until 55 s, then the supply starts to increase up to 60 s, after that its supply is reduced and back to normal unlike hydro generation which does not have much effect in the system as it acts as baseload generation. When the load demand is increased at t = 85 s, BDEG generation is increased and ESS also supplies power up to t = 100 s, where also hydro contributes a small amount of power, then it’s back to normal. Figure 3 shows load variation in area-2 and also shows the response of power generation of Hydro, ESS, PV, BGTG. Initially, the load demand is at nominal then it started high from nominal, i.e., 1.2 pu at t = 55 s, 1.15 pu at t = 60 s. Then it is back to normal fromm t = 65 s to t = 80 s, after that it starts to reduce load demand ea. 0.8 pu at t = 85 s, 0.75 pu at t = 90 s, 1.15 pu at t = 95 s and then back to nominal value from t = 100 s to t = 120 s. Similarly, initially it considers to give normal step output. And at around up to 5 s most of the sources including ESS are fluctuating then back to normal by LFC. Then load demand is increased at t = 50 s correspondingly the ESS starts to supply power to system up to t = 60 s after that reduces the supply and BGTG starts to increase generation up to 60 s after reducing generation to 65 s. But during 60–65 s ESS again starts the supply and then back to nominal. At t = 80 s load demand is reduced here similarly both the ESS and BGTG reduce the supply up to 90 s, then both ESS and BGTG start to increase supply up to 100 s after it gets back to normal. Here also hydro acts as the baseload generation. Figures 4 and 5 shows the frequency response of MG-1 and MG-2, respectively. Figure 6 shows the response for deviation of power between area-1 (MG-1) and area2 (MG-2). From above results it shows that both the frequency deviation of area-1 and area-2 are back to normal after certain disturbance by changing load demand

444

K. M. Singh et al.

Fig. 4 Frequency response of area-1 with PI controller by using FA

Fig. 5 Frequency response of area-2 with PI controller by using FA

and also shows that deviation power in tie-line during the disturbances and back to normal, which is clear that LFC controls both frequency deviation and exchange of power take place in the tie-line.

Firefly Algorithm-Based Optimized Controller …

445

Fig. 6 Power deviation response of tie-line for area-1 and area-2

4.2 PI Controller for Cuckoo Search Optimization The various simulation results for the PI controller using Cuckoo Search Optimization in the proposed model of two-area multi-microgrid for the study of load frequency control are given as follows. Figure 7 shows the response of load, ESS, hydro, BDEG, and PV in area-1. Here the load demand is starting to decrease from t = 50 s then 0.8 pu at t = 55 s, 0.73 pu at t = 65 s load demand is increasesed from 85 s, i.e., 1.2 pu at t = 90 s, 1.5 pu at t = 95 s and 1.15 pu at t = 100 s. This result shows that hydro seems to act as the baseload generator at all over. Here ESS and BDEG start to reduce supply up to 55 s, then after slowly increasing the supply up to 65 s then reducing up to 85 s then again increasing the supply after up to 95 s it reduces to 100 s finally back to the normal; here hydro also contributes the supply according to the load changes but it affects lesser than others. Figure 8 shows load variation in area-2 and also shows the response of power generation of Hydro, ESS, PV, BGTG. Initially, the load demand is at nominal then it started high from nominal, i.e., 1.2 pu at t = 55 s, 1.15 pu at t = 60 s. Then it gets back to normal from t = 65 s to t = 80 s, after that it starts to reduce load demand, i.e., 0.8 pu at t = 85 s, 0.75 pu at t = 90 s, 1.15 pu at t = 95 s and then back to nominal value at from t = 100 s to t = 120 s. Here ESS impacts more than other sources as it starts to supply power from 55 s, up to 60-s corresponding to load where BGTG and hydro also increase their supply during this period then after that reduce their supply. At t = 85 s all sources are reducing their supply up to 95 s after increasing their supply up to 100 s then reduce and back to normal.

446

K. M. Singh et al.

Fig. 7 Response of area-1 with PI controller by using CS

Fig. 8 Response of area-2 with PI controller by using CS

Figure 9 shows the frequency response of MG-1 and MG-2. Figure 10 shows the response for deviation of power between area-1 (MG-1) and area-2 (MG-2). The above results show that both the frequency deviation of area-1 and area-2 are back to normal after certain disturbance by changing load demand. Also it shows that the deviation power in tie-line during the disturbances is back to normal, which clearly

Firefly Algorithm-Based Optimized Controller …

447

Fig. 9 Frequency response of area-2 and area-2 with PI controller by using CS

Fig. 10 Power deviation response of tie-line for area-1 and area-2

verifies that LFC controls both the frequency deviation and exchange of power that take place in the tie-line.

448 Table 1 Gains values of PI controller by Firefly Algorithm optimization

K. M. Singh et al. Controller

Kp

Ki

Controller-1

1.9833

1.9996

Controller-2

1.1865

1.1060

Controller-3

1.7418

1.9998

Controller-4

1.5865

1.9449

Controller-5

0.6189

1.9845

Controller-6

1.9925

1.9636

Pf1

1.9996

Pf2

1.9750

4.3 Comparison of Firefly Algorithm and Cuckoo Search Optimization The proposed model is simulated and used ISE as the objective function to optimize with Firefly Algorithm and Cuckoo Search Algorithm in MATLAB 2018a. The system is analyzed with Step Load Perturbation (SLP) then the response of the system used by the PI controller is compared with optimization algorithms between Firefly Algorithm and Cuckoo Search Algorithm. The gains values of PI controller for Firefly Algorithm and Cuckoo Search are given in Tables 1 and 2, respectively. The various comparisons are shown in the given figures. Figures 11, 12, and 13 depict the comparison of the algorithm. Here Fig. 11 shows the comparison of the frequency deviation of microgrid-1 with PI controller tune by both Firefly Algorithm and Cuckoo Search Algorithm. Similarly, Fig. 12 shows that comparison for frequency deviation of microgrid-2 and FA is superior to CS. Figure 13 depicts the comparison of tie-line response by using FA and CS algorithm; hence FA gives more extension results than CS. Table 2 Gains values of PI by Cuckoo Search Algorithm optimization

Controller

Kp

Ki

Controller-1

1.581289934981527

1.275385914668063

Controller-2

1.981013131586057

0.785919954776326

Controller-3

0.899786331530040

2

Controller-4

0

0.843885456792004

Controller-5

1.403856781582448

0.283325335077086

Controller-6

1.168895975320635

1.150072145237343

Pf1

1.150072145237343

Pf2

0.620301772980351

Firefly Algorithm-Based Optimized Controller …

449

Fig. 11 Comparison of frequency response of area-1 with PI controller by using FA and CS

Fig. 12 Comparison of frequency response of area-2 with PI controller by using FA and CS

5 Conclusion In this paper, a multi-microgrid connected with tie-line by regulating the PI controller gains embedded in the individual microgrid system has been investigated. FA and CS algorithm has been exercised for generating an optimal gain of PI controller for the stability of the system under random step load perturbation. The simulation results show that the PI controller with FA has a better control effect compared to the PI controller with CS at step load perturbation.

450

K. M. Singh et al.

Fig. 13 Comparison of power tie-line using FA and CS

Appendix

Symbol and abbreviation

Values

K PV , TPV (solar photovoltaic gains constants and time constant)

1, 1.5

K ESS , TESS (energy storage system gains constant and time constant)

−10, 0.1

K VA , TVA , K BE , TBE (biodiesel engine generator gain constant and time constant)

1, 0.05, 1, 0.5

X C , YC , b B , TCR , TBG , TBT (biogas turbine generator gain constant and time constant)

0.6, 1, 0.05, 0.01, 0.23, 0.2

K HG , THG , K HT , THT (hydro turbine, governor gain constant and time constant)

1, 41.6, 1, 0.5

D1, D2, M1, M2, Ps (power system gain constant and tie-line gain constant)

0.02, 0.03, 0.8, 0.7, 1.5

B1,1/R1, B12, B22, 1/R2, B2, 1/R22, 1/R12 (system droop gain constant and bias gain constant)

0.1866, 0.1666, 0.4366, 0.1966, 0.4466, 12.5, 25, 0.4168

Firefly Algorithm-Based Optimized Controller …

451

References 1. A.H. Chowdhury, M. Asaduz-Zaman, Load frequency control of multi-microgrid using energy storage system, in IEEE International Conference on Electrical and Computer Engineering (2014), pp. 548–551 2. D.K. Lal, A.K. Barisal, M. Tripathy, Load frequency control of multi area interconnected microgrid power system using grasshopper optimization algorithm optimized fuzzy PID controller, in IEEE International Conference on Recent Advances on Engineering, Technology and Computational Sciences (2018), pp. 1–6 3. X. Wang et al., Load frequency control in multiple microgrids based on model predictive control with communication delay. J. Eng. 13, 1851–1856 (2017) 4. P. Srimannarayana, A. Bhattacharya, S. Sharma, Load frequency control of microgrid considering renewable source uncertainties, in IEEE International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC) (2018), pp. 419–423 5. A.K. Barik, D.C. Das, Expeditious frequency control of solar photovoltaic/biogas/biodiesel generator based isolated renewable microgrid using grasshopper optimization algorithm. IET Renew. Power Gener. 12(14), 1659–1667 (2018) 6. N.J. Gil, J.A.P. Lopes, Hierarchical frequency control scheme for islanded multi-microgrids operation. in IEEE International Conference on Lausanne Power Tech Lausanne (2007), pp. 473–478 7. D. Muthu, C. Venkatasubramanian, K. Ramakrishnan, J. Sasidhar, Production of biogas from wastes blended with cow dung for electricity generation-a case study, in IOP International Conference Series: Earth and Environmental Science, vol. 80, no. 1 (2017), pp. 1–8 8. C. Srinivasarathnam, C. Yammani, S. Maheswarapu, Multi-objective jaya algorithm for optimal scheduling of DGs in distribution system sectionalized into multi-microgrids. Smart Sci. 7(1), 59–78 (2019) 9. A.A. El-Fergany, M.A. El-Hameed, Efficient frequency controllers for autonomous two-area hybrid microgrid system using social-spider optimiser. IET Gener. Transm. Distrib. 11(3), 637–648 (2017) 10. Xin-She Yang, Xingshi He, Firefly algorithm: recent advances and applications. Int. J. Swarm Intell. 1(1), 36–50 (2013) 11. Ramin Rajabioun, Cuckoo optimization algorithm. Appl. Soft Comput. 11(8), 5508–5518 (2011) 12. P.K. Ray, S.R. Mohanty, N. Kishor, Small-signal analysis of autonomous hybrid distributed generation systems in presence of ultra-capacitor and tie-line operation. J. Electric. Eng. 61(4), 205–214 (2010)

Abnormal Activity-Based Video Synopsis by Seam Carving for ATM Surveillance Applications B. Yogameena and R. Janani

Abstract Criminal activities have been increasing in ATM centers, but the law enforcement authorities become mindful only after the incident have occurred. Viewing the whole video sequence is tedious and also slows down the investigation process. The abnormal activity involved in the proposed work is stabbing. The Lucas Kanade method of optical flow is proposed to analyze the stabbing action by the velocity and direction the knife moves with. The analysis is further enhanced by the facial expression recognition. The proposed method involves abnormal activity analysis followed by video synopsis. Video condensation by a seam carving provides an effective solution. The concept of the seam carving is to associate reliable activityaware cost with seams and recursively evacuate seams one at a time. The seam cost is equal to the sum of all pixels that make up a seam. To condense a given video the ribbons with minimum cost are removed by a user-defined stopping criterion. The datasets are real-time ATM crimes involving stabbing action with a knife. The experimental outcomes show that the demonstrated framework gives an effective synopsis of the video based on abnormal activities, i.e., a reduction in duration and no frames. Keywords Faster R-CNN · Seam carving · Stabbing action · Video synopsis

1 Introduction Surveillance is monitoring and inferring information related to the behavior, activities of a person and also used for preventing the crime. Video Synopsis is the contemporary presentation of events that allow hours of video footage to be checked in B. Yogameena · R. Janani (B) Department of Electronics and Communication Engineering, Thiagarajar College of Engineering, Madurai, Tamil Nadu, India e-mail: [email protected] B. Yogameena e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_40

453

454

B. Yogameena and R. Janani

just minutes [1, 2]. Now a day, ATM crime is increasing by involving abnormal activities such as threatening people with sharp objects, breaking ATM machines, changing posture, bending, and walking [3, 4]. Accordingly, an efficient approach is required to provide a synopsis video based on suspicious activities which will be helpful in forensic analysis. Seam carving is a content-aware resizing algorithm. It works by setting an amount of seams (path of least significance) in an image and automatically extracts seams to minimize the size of the image or insert seam to extend it. The reduction as well as the extension of the image size in both directions is achieved by eliminating or adding seams successively [5]. Optical flow is a visual scene design of evident movement of objects, surface, and edges brought about by the relative movement between the camera and the scene. The Lucas Kanade method in computer vision is a broadly utilized differential strategy for assessing the optical flow. It accepts that the flow in a local area of the pixel under rumination is essentially constant and can provide an rate of the motion of fascinating features in progressive images of a scene [6]. An entire image and a set of target proposals are the input for a Fast R-CNN network [4, 7]. The training set is hands with a knife and the position of the knife in the training images will be given in the ground truth table for learning different knives at different angles and positions.

2 Related Work Video synopsis is an effective tool that preserves the crucial task in the original video while providing a short video presentation. Lu et al. [8], Rav-Acha et al. [1], have focused on a video synopsis which is the most broadly utilized approach for shortening the lengthy video. The texture method and Gaussian mixture model by Lu et al. [8] are joined to identify increasingly smaller foreground with shadow removal. A particle refine tracker is used to generate additional fluent tubes for synoptic video. This method efficiently concatenates many tube fragments belonging to one object activity. Avidan et al. [9] proposed an algorithm which is used for object removal and image content enhancement. Seam carving is arranged in pixel count and resizing is arranged in number of seams to be expelled, or inserted. Chen et al. [10] introduced an algorithm where sheets are grafted incrementally from the video cube to lessen a video’s length. The algorithm carves out images of smaller importance until the desired video size is obtained. The main contribution of the system is a generalization of the video frame to a sheet in a video cube. The sheet can be formulated by using min-cut formulation. Li et al. [11] have suggested video condensation method, in that the ribbons are carved using dynamic programming to minimize an activity-aware cost function. Our method is applicable to ATM surveillance for generating synopsis videos, compared to previous synopsis approaches.

Abnormal Activity-Based Video Synopsis by Seam Carving …

455

2.1 Problem Formulation and Motivation There is a lack of literature for abnormal activity-based video synopsis for ATM surveillance applications and to handle the challenges when multiple objects move in a different direction, speed, change in the chronological order of events, bad tube extraction, shadow, and occlusion. (1) Criminal activities have been increasing in ATM centers. Many times law enforcement authorities become aware of the crime after several hours after the incident. (2) Investigation of the crime is done by watching the surveillance videos in the ATM centers. Watching the entire video sequence is time-wasting and also slows down the investigation process.

2.2 Contribution and Objective As per the survey of abnormal activity-based video synopsis, till now no work has been done on stabbing action-based video synopsis for ATM applications. To synopsis a video in the sparse crowd involving stabbing action which is applicable for forensic analysis and useful for evidences. To develop an efficient algorithm for abnormal activity, video synopsis in a sparse crowd is carved out with ribbons for minimizing activity-aware cost function for ATM surveillance applications.

3 Methodology 3.1 Foreground Segmentation Using Gaussian Mixture Model Foreground segmentation is used as a primary step for detection of moving objects. The background is modeled and the foreground is detected by Gaussian Mixture Model (GMM) [12]. A Gaussian Mixture Model is a weighed sum of the densities of M components given by Eq. (1) as p(x|λ) =

M  i=1

 wi g x|μi ,



 (1)

i

where X is a dimensional    continuous value vector, wi = 1, . . . , M is the mixture weights, and g |μi , i i = 1, . . . M, are the Gaussian densities component. Each component density is a D-variate Gaussian function of the form (2) given as

456

B. Yogameena and R. Janani

 g x|μi ,

 i



⎧ ⎫ −1 ⎨ 1 ⎬  1  exp − = − μ − μ ) (x ) (x i i 1   D  ⎩ 2 ⎭ 2π 2   2 i

where μi is the mean vector and

(2)

i



i is

the covariance matrix.

3.2 Blob Detection and Labeling A blob is projected from the head plane to the ground plane; any discontinuity in a blob that represents a person is accomplished by linking discontinuous blobs that are protected by a bounding rectangle. For shorter people, having too large head plane height may result in the zero intersected area. Contrarily, setting very low height head plane can result in detecting shorter objects. A rectangular area is created for all the blobs by joining the opposite points C1 and C2. If this area is below the area threshold, then the area is categorized separately. The blobs whose estimated area overreach a threshold are grouped. The blobs projection is then split into head plane and ground plane then; it is separated and labeled as The area projected on head plane = C1 − L2 The area projected in ground plane = C2 − L1 Intersected area = C1 − C2

3.3 Hand with Knife Detection by Using R-CNN At test time, R-CNN generates proposals for the given image around 2000 categoryindependent regions, separates each area with category-specific linear SVMs, and also eliminates a fixed-length feature vector from each proposal using a CNN. For example, consider training a binary classifier (SVM) to detect knife, the image region tightly enclosing the knife must be a positive example and the background region would be the negative example. If a region overlaps knife, it is overcome by thresholding. Threshold ranges from 0 to 0.5. The classification is between hand with knife and others.

Abnormal Activity-Based Video Synopsis by Seam Carving …

457

3.4 Optical Flow by Lucas Kanade Method and Face Expression Recognition by Faster R-CNN Lucas Kanade feature tracker was used in order to determine the movement of each subtarget. Optical flow is used to calculate the speed and direction of a movable object from one frame to another. The Lucas Kanade method of optical flow is used because it provides fast calculation and accurate time derivatives. Faster R-CNN is considered the system that consists of a network of regional proposals and Fast Regions with Convolutional Neural Network Features (Fast RCNN). Next the multi-task loss in fast R-CNN, the objective function is minimized. For an anchor box, loss function is defined by Eq. (3),       L pi , ti∗ = Lcls pi , pi∗ λpi∗ Lreg ti , ti∗ ti∗

(3)

where pi is the probability of prediction for anchor being an object. Since Faster R-CNN is used, The Region Of Interest (ROI) of each image must be marked first. In order to improve the reliability of the experimental results, three different depths of the network are used to train and test data. Facial expression recognition enhances the stabbing action analysis.

3.5 Video Synopsis by Seam Carving The main goal of video condensation is to remove inactive pixels and produce a shorter length video [13, 14]. A segment of N consecutive video frames with W pixels wide and H pixels tall should end up with a new segment N  of consecutive frames but N ≥ N  . If R denotes the vertical or horizontal ribbon, then the cost of ribbon is given by 

C(R) = C(x, y, t) =



C(x, y, t)

(4)

(x,y,t) R

Ix (x, y, t)

2

 2  2 + I y (x, y, t) + It (x, y, t)

(5)

where C be a cost function with each pixel having co-ordinates (x, y, t) and Ix , I y , Iz be local estimates of horizontal, vertical, and temporal derivatives. Consequently, a vertical or horizontal ribbon cannot span more than Mφ = (φmax(W, H ), φ + 1), where W is wide, H is Height, φ is Flex parameter [15]. Therefore, the experimental results of video synopsis by seam carving are shown in Table 2.

458

B. Yogameena and R. Janani

4 Results and Discussions 4.1 Experimental Results The real-time dataset-3 consists of 1,79,220 frames, the location and the sample frame are shown in Table 1 and Fig. 1 respectively. The proposed algorithm is applied to the input video to provide the synopsis of the video based on stabbing action which will be useful in forensic analysis to analyze the crime effectively. From the above results (Figs. 2, 3, and 4) the input video is modeled and the foreground objects are identified by Gaussian Mixture Model (GMM) then, the detected foreground is labeled. The individual with knife in hand is detected by using the Regional Convolutional Neural Network. Finally, the individual with knife in hand is marked with a blob of red color. Video synopsis is done by seam carving which carves ribbon out by the cost of seams. The addition or removal of seams is based on flex parameter whose value ranges from 1 to 3 as shown in Table 3. The seam involving stabbing action has high value and the seam having no activity has low value. The seams are removed until the desired length of video is obtained. Table 1 Real-time datasets S.No

Dataset

Frame no

Location

1

Real-time Dataset 1

Frame

198

India (Bangalore)

2

Real-time Dataset 2

64

China

3*

Real-time Dataset-3

251

Italy

Abnormal Activity-Based Video Synopsis by Seam Carving …

459

Table 2 Video synopsis by seam carving Input video

Flex values

Output video

Duration

 = 0.1

2 min

 = 0.2

3 min

 = 0.3

4 min

Duration: 97 min

Fig. 1 Sample frame of real-time dataset-3

5 Conclusion An efficient algorithm is required to provide an abnormal activity-based video synopsis in order to make the forensic analysis faster. The background is modeled and the foreground is segmented by Gaussian mixture model. The individuals from the foreground are grouped into individual blob. The individual with hand in knife

460

B. Yogameena and R. Janani

Fig. 2 Foreground detection by Gaussian mixture model

Fig. 3 Blob detection and labeling

is detected by using the Regional Convolutional Neural Network. The motion vector estimation is done by optical flow by Lucas Kanade to determine the speed and velocity the knife with which the knife moves. The velocity of the knife used for destructive purposes will be higher than that of the velocity of knife used for other constructive purposes. If face is detected, then faster regional Convolutional Neural Network is used to recognize the facial expression which enhances recognition of stabbing action. If face is not detected then it passes to abnormal activity-based video

Abnormal Activity-Based Video Synopsis by Seam Carving …

461

Fig. 4 Individual’s hand with knife detection using R-CNN

Table 3 Performance measure of video synopsis S.No

Input video

No of output frames

No of output frames

Condensation rate

1

Real-time Dataset-1

1,45,500

φ = 0.1

3000

1:48.5

φ = 0.2

4500

1:32.3

φ = 0.3

6000

1:24.25

φ = 0.1

1620

1:12

φ = 0.2

3240

1:6

φ = 0.3

4050

1:5

φ = 0.1

6240

1:28.72

φ = 0.2

7800

1:22.98

φ = 0.3

9360

1:19.15

2

3*

Real-time Dataset-2

Real-time Dataset-3

20,025

1,79,220

synopsis which is done by seam carving. Seam carving in the video is extracting the seams for frames involving stabbing action which is based upon cost function. The seams with low cost are eliminated to obtain a video synopsis. Then, the performance of video synopsis is done by comparing the no of frames in the input and output video. Consequently, the system provides an efficient abnormal activity-based video synopsis for ATM surveillance application. Till now, there is a lack of work in obtaining video synopsis by both horizontal and vertical seam carving. In addition, any other abnormal activities such as change of posture, walking, bending can be employed in future work.

462

B. Yogameena and R. Janani

References 1. A. Rav-Acha, Y., Pritch, S. Peleg, Making a long video short: dynamic video synopsis, in IEEE Conference on CVPR (Computer Vision and Pattern Recognition), December 2006, pp. 1–5 2. H.-C. Chen, P.-C. Chung, “Online surveillance video synopsis, in: Proceedings of IEEE International Conference on CVPR, May 2012, pp. 1843–1846 3. A. Glowacz, A. Dziech, M. Kmie´c, Visual detection of knives in security applications using active appearance model. IEEE Trans. Image Process. 54, 703–712 (2015) 4. B. Yogameena, S. Veeralakshmi, E. Komagal, S. Raju, V. Abhaikumar, RVM - based human action classification in crowd through projection and star skeletonization. J. Image Video Process. (2009) 5. X. Ye, J. Yang, X. Sun, Foreground background separation from video clips via motion-assisted matrix restoration. IEEE Trans. Circ. Syst. Video Technol. 25(11), 1721–1734 (2015) 6. D. Patel, S. Upadhyay, Optical flow measurement using Lucas Kanade method. Int. J. Comput. Appl. 61(10), 6–10 (2013) 7. J. Lia, J. Zhanga, D. Zhanga, J. Zhanga, T. Lia, Y. Xiaa, Q. Yana, L. Xuna, Facial expression recognition with faster R-CNN. Int. Conf. Inf. Commun. Technol. 107(2017), 135–140 (2017) 8. M. Lu, Y. Wang, G. Pan, Generating fluent tubes in video synopsis, in: Proceedings of IEEE International Conference on Pattern Recognition, May 2013, pp. 2292–2296 9. S. Avidan, A. Shamir, Seam carving for content-aware image resizing. ACM TOG (Transactions on Graphics) 26(3) (2008) 10. B. Chen, P. Sen, Video carving, in Euro-graphics Conference on Computational Photography and Image-Based Rendering (2008) 11. Z. Li, P. Ishwar, J. Konrad, Video condensation by Ribbon Carving. IEEE Trans. Image Process. 18(11), 2572–258 (2017) 12. V. Tiwari, D. Choudhary, V. Tiwari, Foreground segmentation using GMM combined temporal differencing, in International Conference on Computer, Communications and Electronics (2017) 13. K. Li, B. Yan, W. Wang, H. Gharavi, An effective video synopsis approach with seam carving. IEEE Signal Process. 23(1), 11–14 (2016) 14. P.S. Surafi, H.S. Mahesh, Surveillance video synopsis via scaling down moving objects. Int. J. Sci. Technol. Eng. 3(9), 298–302 (2017) 15. R. Furuta, T. Yamasaki, I. Tsubaki, Fast volume seam carving ith multi-pass dynamic programming, in International Conference on Image Processing (2016), pp. 1818–1822

Behavioral Analysis from Online Data Using Temporal Graphs Anam Iqbal and Farheen Siddiqui

Abstract The Internet over and above social media is the basis of human interaction, information exchange, and communication nowadays, which has resulted in prodigious data footprints. If prediction techniques are efficiently employed this data can be put to appropriate utilization for deducing human behavior. We in our work have proposed a methodology for collecting data from social media by assessing the user interactions online, using time-varying attributed or temporal graphs. Initially, we have discussed temporal graphs and how temporal and structural properties of users can be modeled using these evolving graphs for predicting the personality type of the user. The online platforms from where the datasets have been used for the deductions are Stack Exchange and Twitter. Moreover, the secondary research question addressed in this paper is How temporal or time-varying features impact our user behavior prediction. The graphs plotted using the provided datasets show the interactive behavior of users on different platforms. Keywords Time-varying attributed graph · Social media data · Stack exchange · Facebook · Twitter · Data mining

1 Introduction In the last decade, the number of Internet users has increased to about 56% of the world population in 2019, up 9% from January, 2018 [1]. With the dawn of new and developing technologies, human interaction has mostly become dependent on social media and hence has resulted in the generation of huge data footprints. For example, according to a survey by Domo, called the Data never sleeps 0.0016 GB of data is created every second [2]. The data from all the social networking platforms is stored on large servers, which can be utilized for data analytics. We in our paper have proposed a methodology for A. Iqbal (B) · F. Siddiqui Jamia Hamdard, New Delhi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_41

463

464

A. Iqbal and F. Siddiqui

utilizing the data that defines human interactions on the social network and hence deduce some attributes of user behavior. The type of information that can be extracted from social media data is still limited, but its utilization in the most appropriate scenario can also give results better than expected. In order to gain an insight into such applications most of the researchers have concentered on deriving those features which impact user behavior. His study is a challenging task as the user behavior and attributes related to it depend on temporal, spatial, and contextual factors [3, 4]. The appropriate understanding of the dynamics of a user’s interaction with social media can have multiple applications in diverse fields, for example, for a Whatsapp user the online and offline time at the start and end of the day, respectively, can help determine his sleep cycle. Over the course of our work, we have come across some fundamental arguments and we have posed them as research questions. ‘What methodology and its consequent implementation can allow data analysts to model, explore, extract, and then finally predict the user behavior using social media data, by employing data mining techniques?’

2 Literature Review A huge number of online platforms are used for communication, discussion, and information exchange. If this data is put to use in prediction it can serve various applications [5]. The authors in [6] proposed the communication model, where the message travels from the sender to the receiver and was explained by [7] by implying the feedback factor. The clients using social media are both senders and receivers at their respective ends, which emphasize feedback more than the original message.

2.1 Prediction A large number of researchers have proposed and researched upon various aspects related to response of users to activities taking place on the Internet, like, predicting purchasing activities [8], customer behavior [9], churn [10], users’ loyalty [11], identification of criminals, and financial defaulters [12]. Enhanced pieces of work incorporate complex analysis like, predicting temporal properties of a client, and hence providing them with an improved experience [13]. In [14] the analysts predicted user cuisine choices by checking check-ins; the limitation was the computational cost was not evaluated. In [15] historical geographical data was used to predict the future locations of the user accurately. In [16] user activity time at home was predicted.

Behavioral Analysis from Online Data Using Temporal Graphs

465

2.2 Modeling User Behavior We in our research have classified models into graph-based models and dynamicbased models, on the basis of their core functionalities and structure.

2.2.1

Dynamic Models

These models represent the behavior of objects with respect to time. The objects in our scenario are Internet users. A lot of work has been done on the analysis of dynamic networks, [17, 18] as well as modeling based on time-varying attributes in large scale network datasets [19, 20] emphasizes on the use of temporal links to improve prediction models. For implementing dynamic models for human behavior in [21] a human is placed analogous to a device with varied mental states. Each state in its own is a dynamic process xi = f i (x, t) + ε(t)

(1)

where xi is the state vector at time i, f is a function that models xi, and ε represents a noise process In case of a dynamic multiple models, the probability for n-dimension observation Yz if the kth model dynamics are given, can be put as 

(k)T

−1 (k)



  e − 2 τz R τz P(k) Yz |X∗z = n 1 (2π ) 2 Det(R) 2 1

(2)

where R is the co-variance matrix and τz(k) = Yz − f (i) (X z∗(k) , t)

(3)

2.3 Graph-Based Models For using graphs in determining user behavior nodes of the graph are used to represent people and the edges represent the interaction between these nodes. These models are used to establish structural properties of users. A graph is represented as (Fig. 1). G = (V, E)

(4)

466

A. Iqbal and F. Siddiqui

Fig. 1 An AMG with four nodes, u1-4, four edges, e1-4, and each node having three attributes

where V nodes are the people and E edges represent the interactions between the people These graphs can be used to learn the types of associations between the users, i.e., who follows who, and between the user and the platform, i.e., how active is a user on Twitter. A simple graph model only has users and their associations, but a real-world scenario is much more complex. For that there is a need to extend the classical graph model. The result is powerful statistical analysis of the data. One implementation is nodes having associated attribute [22]. Such graphs are called attributed graphs. An attributive graph is given in Fig. 2. If nodes do not have attributes, in order to simplify the task at hand, a combination of graphs can also be used as shown in [23].

3 Temporal Graphs Temporal or time-varying graphs (TVGs) comprise as a set of entities X, relationship between entities represented by edges E, and a variable for defining any other property, P, i.e., E ⊆ X ×X ×P. A single property can be defined over multiple entities. The relationships between entities have a lifetime, referred to as the timestamp, Γ, whereΓ ∈ T. System dynamics can be represented by a TVG, G, i.e., G = (X ; E; Γ ; σ ; ξ ).

(5)

where σ : E × Γ → {0, 1} represents whether an edge exists or not at a particular time, and ξ : E ×Γ → T indicates the time taken to traverse one edge One very important deduction from the behavior of temporal graphs is that the final graph is a union of many temporal sub-graphs(static graph when time is constant),

Behavioral Analysis from Online Data Using Temporal Graphs

467

Fig. 2 Process flow of the proposed work

i.e., F(G) = SG(1) ∪ SG(2) . . . ∪ SG(k−1) ∪ SG(k) .

(6)

4 Temporal Graphs for Behavior Prediction Temporal graph is one of the most suitable implementations for a social network, because the edges are fluctuating. Also the interactions between users keep changing with time, either due to the spatial movement of users and change in status of relationship between users like acquaintances, friends, family, which are all dynamic relations. Following the research questions raised the research process in this project is divided into following steps:

468

A. Iqbal and F. Siddiqui

1. Developing the model: The model used is Time-Varying Graph, which describes structural and temporal features of the users. 2. Feature Extraction: This includes feature extraction and classification. Semantic and computational are the two classes of features that have been considered. 3. Prediction: Four algorithms which can be used are: Extreme Gradient Boosting (XGB), Decision trees (DT), Linear Regression (LR), and k-Nearest Neighbor (kNN). Evaluation metrics, like accuracy and time can be used to compare these with other previous models. 4. Visualization: At last we propose visualization on the data sets using Case-Based Reasoning (CBR). This paper puts forth the results of the prediction based on TVG.

5 Retaining Temporal Features The two data sets used are StackOverflow and Twitter, each obtained from their official websites. The layout algorithm used is the Fruchterman–Reingold algorithm. The dynamic network models can be developed using Python or R. We have chosen Python as the language and anaconda as the platform for the construction of the graph. Rest of the implementation is done in python using the igraph package. Our model encapsulates the time-varying properties of social media data. The temporal attributes enable us to further divide the graph into sub-graphs. This is an efficient mechanism as the features are only to be computed only for the sub-graph, hence reducing the computation time. Our previous deductions have already stated that a social network is very similar to a graph. Thus we have employed graph theory for the detailed study of social network traits. The Time-Varying Attributed Graph (TVAG) consists of objects with time varied attributes. These attributes can be engineered to design the interaction and relationship graphs. When people interact on the social media, relationships are established between them. In order to represent these, a relationship graph RelG (t) has been used, where its nodes and edges represent the users and their relationships, respectively. Each relationship R (Uti (t)) has necessarily two attributes, one is the source a UtSOURCE , and the other is target UtTARGET . When people interact in the social media, they interact with each other through messages. In order to represent these, an interaction graph INTG(t) has been used, where its nodes and edges represent users and their messages, respectively.

Behavioral Analysis from Online Data Using Temporal Graphs

469

6 Results 6.1 Twitter Twitter is a micro-blogging website, which enables users, to post tweets, and follow other users. Tweets are generally associated with a ‘#’ hashtag, which represents the trending area. The data model consists of five tables: users, tweets, entities, entities in objects, and places. In the relationship graph nodes are users while the edges show the relation between users. Twitter discards the temporal variations; hence, relationships once established do not generally change. The interaction graph of twitter comes out to be time-stamped. The tweets of users are broadcasted to their corresponding followers. The interaction graph hence is modeled between users and their tweets. The tweets have an attribute which determines its type: mention, reply, or retweet. In the interaction graph, each and every edge has a lifetime, but in the relationship graph the timestamp is from time of the first tweet and it follows every retweet. Figure 3 gives the interaction graph between 500 random users following a particular hasthtag, and the distribution of other users who are not following the hashtag. Figure 4 gives the relationship graph of the 500 most active users. These users interact, i.e., tweet and retweet) with each other quite a lot as is depicted by the density of the interaction graph.

Fig. 3 Interaction graph for twitter

470

A. Iqbal and F. Siddiqui

Fig. 4 Relationship graph for twitter

7 Stack Exchange Stack Exchange is a collection of about 128 question answer websites on varied topics. Each of these sites follows a particular data model. The interactions here are queries, responses to these queries, and additional comments. The data model consists of six tables. In order to form the basis of these models, interaction graph is to be modeled between users and their messages. Additional attributes here are Reputation, Views, and Message attributes. These are generally temporal, hence need an additional attribute which can define the timebound changes that occur in the contents of the messages. In Stack Exchange the users cannot develop relations between them; hence, a relationship graph cannot be established. For our model, we have picked up the questions and answers from the July 2009 to October 2014, and have developed an interaction graph for this scenario, which is given in Fig. 5. Users

Fig. 5 Interaction graph for Stack Exchange

Behavioral Analysis from Online Data Using Temporal Graphs

471

Fig. 6 Interaction graph for Stack Exchange

are represented by the nodes, while answer flow is represented by edges. Another interaction graph is plotted which shows the interaction of users following a particular subject of study and involved in answering the questions related to that particular topic (Fig. 6).

8 Conclusion The research carried out in this paper is based on a primary research problem, that is, ‘how social media data can be modeled in a way that the capturing, exploring, and hence understanding human behavior?’ Graphs are one of the spire suitable tools for representing and for the exploring social media data sets. Social networks like stack overflow and Twitter data can be modeled into temporal model with hardly any loss of data and with focus on the time-varying behavioral attribute of the users. Hence, to demonstrate the usage of the proposed model Twitter and Stack Overflow datasets were modeled. Moving forward, the structural features derived from this temporal model play a very important role in feature extraction step of machine learning, hence reducing the effort needed for extracting desired and useful features automatically.

472

A. Iqbal and F. Siddiqui

References 1. Topic: Internet Usage Worldwide (2019). Www.Statista.Com, https://www.statista.com/topics/ 1145/internet-usage-worldwide/. Accessed 15 Dec 2019 2. D. Cohen, 10 Takeaways From Domo’S, In 7th Annual Data Never Sleeps Infographic (2019). www.Adweek.com, https://www.adweek.com/digital/10-takeaways-from-domos-7th-annualdata-never-sleeps-infographic/ 3. S. Scellato, A. Noulas, C. Mascolo, Exploiting place features link prediction location-based social networks, in Proceedings of the 17th ACM SIGKDD 11 (ACM, USA, 2011), pp. 1046– 1054 4. A. Guille, H. Hakim, Predictive model for ttemporal dynamics of information diffusion, in Proceedings of the 21st international conference on WWW (ACM, 2012) 5. G. Barbier, H. Liu, Data mining in social media, in Social Network Data Analytics (Springer, Boston, MA, 2011), pp. 327–352 6. C. Shannon, A mathematical theory of communication 27(4), 623–656 (1948) 7. W. Schramm, D.F. Roberts, The Process and Effects of Mass Communication (University of Illinois Press Urbana, 1971). rev. edn 8. C. Lo, D. Frankowski, J. Leskovec, Understanding behaviors that lead to purchasing: a case study of pinterest, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2016) 9. A. Martínez et al., A machine learning framework for customer purchase prediction in the non-contractual setting. Eur. J. Oper. Res. (2018) 10. M. Miloševi´c, Ž. Nenad, A. Igor, Early churn prediction with personalized targeting mobile social games. Expert Syst. Appl. 83, 326–332 (2017) 11. W. Buckinx, V. Geert, D. Van Poel, Predicting customer loyalty using internal transactional database. Exp. Syst. Appl. 32.1, 125–134 (2007) 12. G. Sudhamathy, C. Jothi Venkateswaran, Analytics using R for predicting credit defaulters, in 2016 IEEE International Conference on Advances in Computer Applications (ICACA) (IEEE, 2016) 13. R. Boutaba et al., A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. J. Internet Serv. Appl. 9.1, 16 (2018) 14. W. Min et al., A survey on food computing. ACM Comput. Surv. (CSUR) 52.5, 92 (2019) 15. Y. Miura et al., A simple scalable neural networks based model for geolocation prediction in Twitter. in Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT) (2016) 16. J. Chen, Y. Liu, M. Zou, Home location profiling for users in social media. Inf. Manag. 53(1), 135–143 (2016) 17. Bommakanti, S.A.S. Rajita, S. Panda, Events detection in temporally evolving social networks, in 2018 IEEE International Conference on Big Knowledge (ICBK) (IEEE, 2018) 18. L.Y. Zhilyakova, Dynamic graph models and their properties. Autom. Remote Control 76(8), 1417–1435 (2015) 19. R.A. Rossi et al., Modeling dynamic behavior in large evolving graphs, in Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (ACM, 2013) 20. V. Nicosia et al., Graph metrics for temporal networks, in Temporal Networks (Springer, Berlin, 2013), pp. 15–40 21. A.W. Woolley, I. Aggarwal, T.W. Malone, Collective intelligence and group performance. Current Direct. Psychol. Sci. 24.6, 420–424 (2015) 22. V. Peysakhovich, C. Hurter, A. Telea, Attribute-driven edge bundling for general graphs with applications in trail analysis, in 2015 IEEE Pacific Visualization Symposium (PacificVis) (IEEE, 2015) 23. A. Guille et al., Information diffusion in online social networks: a survey. ACM Sigmod Record 42.2, 17–28 (2013)

Medical Data Analysis Using Machine Learning with KNN Sabyasachi Mohanty, Astha Mishra, and Ankur Saxena

Abstract Machine learning has been used to develop diagnostic tools in the field of medicine for decades. Huge progress has been made in this area, however, a lot more work has yet to be done in order to make it more pertinent for real-time application in our day-to-day life. As a part of data mining, ML learns from previously fed data to classify and cluster relevant information. Hence, the main problems arise due to variations in the big data in the individuals and huge amounts of unorganised datasets. We have used ML to figure out various patterns in our dataset and to calculate the accuracy of this data, with the hope that this serves as a stepping stone towards developing tools that can help in medical diagnosis/treatment in future. Creating an efficient diagnostic tool will help improve healthcare to a great extent. We have used a mixed dataset where an individual with any severe illness in early stages or individuals who are further along, are both present. We use libraries like seaborn to construct a detailed map of the data. The fundamental factors considered in this dataset are age, gender, region of stay and blood groups. The main goal is to compare different data to each other and locate patterns within. Keywords Medical diagnosis · Seaborn · Matplotlib · Data mining · KNN

S. Mohanty · A. Mishra · A. Saxena (B) Amity University, Noida Uttar Pradesh,, India e-mail: [email protected] S. Mohanty e-mail: [email protected] A. Mishra e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_42

473

474

S. Mohanty et al.

1 Introduction Machine learning has helped to make huge strides in the fields of science and technology, including medical data processing and significant impact on life science medical research. Few highlights include the recent advances that have been made in the development of machine learning pipelines for statistical bioinformatics and their deployment in clinical diagnosis, prognosis and drug development [1]. Machine learning algorithms can also be trained to screen complications on medical imaging data [2]. We obtained this data using google trends which reflect upon the interest people have shown in the field of machine learning since 2014. It is based on the web searches made over this period which is a good source to reflect upon the popularity of any kind of entity in this digital age [3]. From 2014 to 2019 there has been consistent rise by huge proportions that shows how vast applications of ML are being realised and discovered by more and more people [4]. Machine learning has gradually spread across several areas within the medical industry with the complete potential to revolutionise the whole industry [5]. Until a few years ago, the medicine solely was dependent on heuristic approaches, the knowledge is gathered through experiences and self-learning, crucial in healthcare environment [6]. The increasing amount of data or the big data is the node for the application of machine learning [7]. ML is a platform that can skim information from numerous sources into an integrated system that can help in decision-making processes even for professionals [8].

1.1 Artificial Intelligence The focus of artificial intelligence has been hugely drawn towards the improvisation of healthcare since the 1960s. In addition to building databases which store medical data such as the patient data, research libraries, administrative and financial systems, the research focus for Artificial Intelligence is innovating techniques for better medical diagnosis [9]. For example, PubMed is a service of the US national library of medicine that includes over 16 million citations from journals for biomedical articles dated back to the 1950s.

1.2 Medical Diagnosis It analyses the structured data sets such as images, genetic and the EP data. In the clinical applications the ML procedures attempt to patient’s traits, or infer the probability of the disease results. In the process of molecular drug discovery and manufacturing of drugs, machine learning can be used for precision medicine, next generation

Medical Data Analysis Using Machine Learning with KNN

475

sequencing, nano-medicine, etc. [10]. For better treatments, we are aiming towards the development of improvised algorithms, for example, using the existing treatment methods, say, cancer precision treatment, with the machine learning technologies [11]. Machine learning models have been trained to screen patients. Screening models or the algorithms have already been started for identifying tumours, diabetes, heart diseases, skin cancer, etc. The algorithms and ML models should be of high precision and high sensitivity for the best evaluation and diagnosis of the diseases or ailments [12] (Fig. 1). Machine Learning tools can be put to various kinds of uses [13]. The following Fig. 2 shows a heat map, that has been used to analyse the Air Quality Index (AQI) of the entire city of Delhi over a month. This data analysis has been performed by a renowned media company news channel, India Today by their Data Intelligence Unit on pollution statistics provided by CPCB [14]. CPCB is Central Pollution Control Board, a statutory organisation under the Ministry of Environment, Forest and Climate Change. Therefore, this is publicly available data, which could be easily ignored if not for the processing that India Today did on that impactless statistical data [15]. One glance at the heat map gives enough information regarding the state of air quality in the city [16]. The dark shades in the odd-even weeks show that air was at its worst during this period, with average AQI of 365. We are able to analyse the impact of a very popular government scheme without having to read and compare hundreds of numeric values of index [17]. Use of ML tools in this example has been to analyse large scale medically and environmentally relevant data for an area of 1,484 km2 with a population of 1.9 crores. Its utility is indeed limitless. The Fig. 3 is a step ahead, it’s a pollution calendar for the year of 2019 [18, 19].

Fig. 1 Changes in people’s interest in machine learning over a period of 5 years

476

S. Mohanty et al.

Fig. 2 Air quality index heat graph for January of 2016 (statistics during odd-even scheme implementation)

Fig. 3 AQI heat graph calendar for year of 2019

Medical Data Analysis Using Machine Learning with KNN

477

2 Methodology 2.1 About the Dataset We have collected this dataset through the means of a google form that we circulated among our college mates and friends. That is why, maximum dataset has the medical information of individuals from the age groups 16–30. Upon receiving complete responses, further processing of dataset involved calculation of BMI from the height and weight data of the individuals, changing certain column entries like medical history, symptom diagnosis, etc. to Boolean format, along with grouping age of individuals into age groups of two years coupled in one group. The dataset processing was instrumental to correct, unambiguous presentation and seamless execution of ML tools on the data.

2.2 Environment Setup Anaconda was installed to get the work started, as it makes the process of installing libraries seamless, which is used with Python version 3.7. We used Jupyter notebook as our IDE because it’s one of the gold standard IDE for machine learning as it is user friendly and has simple interface. It was most appropriate for our work as it displays the graphs and the data clearly.

2.3 Starting We used the most popular machine learning libraries of python like Sklearn in our work. The data was used in Comma Separated Values (CSV) format. Before starting with the analysis we need to import the libraries and its dependencies. The libraries imported had all things for data analysis machine learning and data visualisation. Pandas, numpy, matplotlib and seaborn are few major libraries. The dataset looks like (See Fig. 4). The complete process can be summarised as (See Fig. 5). Upon installation of jupyter notebook, an integrated development environment, on our desktop, we used the preinstalled libraries on the software for further editing, like, numpy for mathematical operations, seaborn, sklearn, pandas and matplotlib. Then used pandas library for importing our dataset onto jupyter. We performed data visualisation using these library functions to helps us with the data analysis process. Then, we used KNN algorithm to classify the data (Fig. 6).

478

S. Mohanty et al.

Fig. 4 head() function of pandas shows the first 5 rows of the dataset

Fig. 5 The workflow

3 Results and Discussion 3.1 Data Relation with Respect to Gender Using Pairplot Function of Seaborn Library The above plot shows the analysis of different parameters of the dataset with respect to the gender in terms of male and female. The red dots show the female and the blue ones represent the males. We can comprehend various patters in these clusters of points plots against the two axes (Fig. 7).

3.2 The Few Individual Parameters of Dataset in Form of Graphs or Histograms Which Is Crucial for Data Analysis The above histogram gives an overview of the age of the participants. It shows that the data has a broad group of participants between the ages of 18 and 21 years as compared to elder groups. This is because the survey was done in a higher educational institute, with majority population of young participants. This helps us to predict the kind of illnesses that could be common in this group among the individuals and what we can expect in general in terms of medical histories (Fig. 8).

Medical Data Analysis Using Machine Learning with KNN

479

Fig. 6 pairplot was used to plot the histogram to show relations

The above histogram shows that the highest number of individuals have B + ve blood group, it is also the most common blood among people of Indian subcontinent. This definitely speaks well in the accuracy of this dataset (Fig.9). The survey form circulated through electronic medium, on messaging apps, etc. received significant responses from the female individuals. This can also indicate greater medical and physiological awareness among female participants (Fig. 10).

3.3 Comparison of Two Parameters Together The above histogram shows that females took more medications as compared to the male individuals. The above graph proves the fact that females usually consume more medicines and get ill more frequently as compared to men (Fig. 11).

480

Fig. 7 Individual parameter study of age groups in form of histogram

Fig. 8 Individual parameter study of blood groups in form of histogram

S. Mohanty et al.

Medical Data Analysis Using Machine Learning with KNN

Fig. 9 Individual parameter study of blood groups in form of histogram

Fig. 10 Parameter study of gender and medications in form of histogram

481

482

S. Mohanty et al.

Fig. 11 Parameter study of gender and blood group in form of histogram

The above histogram shows that females took more medications as compared to the male individuals. The above graph proves the fact that females usually consume more medicines and get ill more frequently as compared to men (Fig. 12).

Fig. 12 The graph shows the inter relationship of one parameter with each other by a float value

Medical Data Analysis Using Machine Learning with KNN

483

Fig. 13 The above histogram plots the number of males and females with respect to BMI and medical history

3.4 For Better Analysis We Did a 1 to 1 Comparison of Data The heat map gives information regarding the type of data collected by the survey. The dark shade of green shows the right number which had the standard type of data obtained during the data collection. And, thus, lighter shades indicate non-standard data that had to be further processed before visualisation and analysis. We did this analysis of the impact factor without having to read and compare hundreds of numeric values of the dataset (Fig. 13).

3.5 Finding the K Nearest Neighbour (KNN) Algorithm This supervised ML algorithm was used to solve classification problems in terms of male and females between the fields BMI and medical history of the people. Here, similar groups are closer together and the dissimilar ones are relatively farther apart (Fig. 14). To calculate and present the distance between two corresponding points, we have plotted the above graph for further data analysis in terms of its accuracy (Euclidean Distance or the straight-line distance was used).

484

S. Mohanty et al.

Fig. 14 The above graph shows the relation between the accuracy of data with K nearest values

4 Conclusion and Future Scope This model requires better, more robust entries that are accurate and curated. Since the diagnosis is not specific it cannot be analysed with just the few parameters as more information is needed to be analysed due to difference in multiple ailments. The data should be 98 % accurate for it to be acceptable in real-time diagnostic tool development. The dataset is required to be trained rigorously to make the analysis more efficient. Also, the future work may involve deep learning and neural network like BERT and other better algorithms after an improvised dataset is formed. Acknowledgments We would like to express our deep sense of gratitude towards Amity Institute of Biotechnology and our family, without their support throughout the process this paper would have not been accomplished.

References 1. I. Sharma, A. Agarwal, A. Saxena, S. Chandra, Development of a better Study resource for genetic disorders through online platform. Int. J. Inf. Syst. Manag. Sci. 1(2), 252–258 (2018) 2. S. Mohagaonkara, A. Rawlani, P. Srivastavac, A. Saxena, HerbNet: intelligent knowledge discovery in MySQL database for acute ailments, in 4th International Conference on Computers and Management (ICCM) (ELSEVIER-SSRN, 2018), pp. 161–165. ISSN: 1556-5068 3. S. Shuklaa, A. Saxena, Python based drug designing for Alzheimer’s disease, in 4th International Conference on Computers and Management (ICCM) (ELSEVIER-SSRN, 2018) pp. 20–24. ISSN: 1556-5068

Medical Data Analysis Using Machine Learning with KNN

485

4. A Agarwal and A Saxena, Comparing machine learning algorithms to predict diabetes inwomen and visualize factors affecting it the most—a step toward better healthcare forwomen, in International Conference on Innovative Computing and Communications. https://doi.org/10.1007/ 978-981-15-1286-5_29,2019 5. A. Saxena, N. Kaushik, A. Chaurasia and N. Kaushik, Predicting the outcome of an election results using sentiment analysis of machine learning, in International Conference on Innovative Computing and Communications. https://doi.org/10.1007/978-981-15-1286-5_43,2019 6. A. Saxena, S. Chandra, A. Grover, L. Anand and S. Jauhari, Genetic variance study in human on the basis of skin/eye/hair pigmentation using apache spark, in International Conference on Innovative Computing and Communications. https://doi.org/10.1007/978-981-15-1286-5_3 1,2019 7. V.V Vijayan, C. Anjali, Prediction and diagnosis of diabetes mellitus-a machine learning approach, in 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS) (Trivandrum, 2015) 8. B sarvwar, V Sharma, Intelligent Naive Bayes approach to diagnose diabetes type- 2. Int. J. Comput. Appl. Issues Chall. Netw. Intell. Comput. Technol. (2012) 9. R Motka, V Parmar, Diabetes mellitus forecast using different data mining techniques, in IEEE International Conference on Computer and Communication Technology (ICCCT) 2013 10. S. Sapna, A. Tamilarasi, M. Pravin, Implementation of genetic algorithm in predicting diabetes. Int. J. Comput. Sci. 9, 234–240 11. K Savvas, N. Schizas Christos, Region based support vector machine algorithm for medical diagnosis on Pima Indian diabetes dataset, in IEEE Conference on Bioinformatics and Bioengineering (2012), pp. 139–144 12. A. Al Jarullah, Decision discovery for the diagnosis of Type II Diabetes, in IEEE Conference on Innovations in Information Technology (2011), pp. 303–307 13. D.M. Nirmala, B.S. Appavu alias, U.V. Swathi, An amalgam KNN to predict Diabetes Mellitus, in IEEE International Conference on Emerging Trends in Computing Communication and Nanotechnology(ICECCN) (2013), pp. 691–695 14. U Poonam, H Kaur, P Patil, Improvement in prediction rate and accuracy of diabetic diagnosis system using fuzzy logic hybrid combination, in International Conference on Pervasive Computing (ICPC) (2015), pp. 1–4 15. S.S Vinod Chandra, S Anand Hareendran, Artificial Intelligence and Machine Learning (PHI learning Private Limited, Delhi, 2014), p. 110092 16. R. Bellazzi, B. Zupan, Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Informatics 77, 81–97 (2008) 17. A. Agarwal, A. Saxena, Malignant tumor detection using machine learning through scikit-learn. Int. J. Pure Appl. Math. 119(15), 2863–2874 (2018) 18. S. Saria, A.K. Rajani, J. Gould, D. Koller, A.A Penn, Integration of early physiological responses predicts later illness severity in preterm infants. Sci. Trans. Med. 2, 48ra65 (2010) 19. D.C. Kale, D. Gong, Z. Che et al., An examination of multivariate time series hashing with applications to health car, in IEEE International Conference on Data Mining (ICDM) (2014), pp. 260–69

Insight to Model Clone’s Differentiation, Classification, and Visualization Ritu Garg and R. K. Singh

Abstract Model clones are model fragments in terms of model elements clustered together in form of containment relationship that is highly similar. Due to their property of defect propagation, these are harmful from maintenance point of view. Work on model cloning is less mature as compared to code cloning. In order to fill this gap, the authors identified the key areas regarding model clones with important concepts along with their benefits, limitations, and findings on basis of existing literature that is helpful for further research. It creates awareness about the attributes in which code and model clones differentiates. Then the classification of model clones is refined and proposed on basis of similarity and clustering strategy followed by the techniques for detection of model clones where importance of hybrid clone detection is studied on basis of pros and cons of other existing techniques for clone detection. Recommendations are given regarding the techniques for the visualization and reporting of model clones detected. Keywords Clone · Software quality · Modeling · Duplication · Refactoring

1 Introduction to Model Clones and Its Representation Model is an abstract representation of any system with respect to a context from specific viewpoint [1, 2]. “Connected submodels, that are in structural equivalence to one another, up to a particular threshold represents Model Clones” [1]. The idea of modeling promotes transformation of real-world ideas into clear design and more maintainable system. In initial phases of Software Development Life Cycle (SDLC), process modeling of the software systems is done using the UML that is used to design the structure and behaviors of the software. Unified Modeling Language (UML) R. Garg (B) · R. K. Singh Indira Gandhi Delhi Technical University for Women, Delhi, India e-mail: [email protected] R. K. Singh e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_43

487

488

R. Garg and R. K. Singh

is useful for modeling application systems and Simulink, for modeling embedded system. The internal representation of UML models is stored textually in form of XML files in tree structure [3]. However, physical representation in itself may not replace the conceptual view of the software, which represents a form of graph. This is because the conceptual view represents the semantic information along with the structural information in form of concepts and its features, while the physical representation represents structural information only. There are many literature surveys for code clone detection and its associated areas but model clone is still lacking behind. The features corresponding to model elements, if similar, to other model elements of same type may be termed as model clone. However, such similarity is vague in itself as affirmed by Ira Bexter [4]. Due to this different author uses different basis of similarity whether it is in terms of attributes of similarity, clone classification, similarity detection technique or reporting of similar model elements. This study not only provides consolidate and comprehensive view of model clones, its attributes, classification, detection, and visualization but also reports the major findings in these areas. This will help future authors to understand these concepts and distinguish them from code clones for further research. The main contributions are as under 1. Attributes in which code and model clones are similar or different during the clone detection are reported. 2. Different clone classifications used in literature are discussed and then refined by the authors. This is done to have a clear and concise idea of interrelation between them for comparing various studies. 3. Different model clone detection techniques are presented with their pros and cons to depict the need for hybrid technique for better efficiency. 4. Various techniques for visualization of clones are studied to focus the need of aggregation of both textual and graphical approaches for better understanding and navigation of clones. The authors have focused to restrict to concept of model clones. In this paper, Sect. 2 shows the similarities/differences of attributes in which model and code clones differ. Section 3 discusses the proposed classification of clones in models. Shift to Hybrid technique for model clone detection is discussed in Sect. 4. Section 5 deals with the different techniques used for visualization of clones. Section 6 discusses the conclusions and future works for extending this study.

2 Similarities/Differences of Attributes in Which Model and Code Clones Differ In contrast to code clones (software clones that exist during the implementation phase in SDLC), model clones differ in the following ways as shown in Table 1. Similarities in code and model clones are on the basis of layout, color, and position among the elements, which are lines of code in case of code clone and model elements

Insight to Model Clone’s …

489

Table 1 Differentiation of code and model clones [3, 5–7] Attribute

Code clone

Model clones (UML)

Information

Dependency information (Implicit similarity)

Containment information along with dependencies among objects (Explicit similarity)

Structure

Textual structure at file level

Tree-like structure at higher level (Containment info) or Graph-like structure (dependency information) or combination of both

Notes/Comments

Comments are not important from structural and semantic viewpoint with respect to code

Notes/Comments or notes are important from semantic viewpoint with respect to models

Identifiers

Identifiers identified by names are locally unique

Identifiers are identified by the id that are globally unique while name attribute of model elements is locally unique

Naming

All identifiers must have a name

All identifiers must have id. Name is optional which may result in loopholes

Coupling

Depends on the structure of function call

Depends on the relationships between model element along with structure of objects

Renaming

Simple renaming in terms of variable, constant or literal names

Blind renaming in terms of variable, constant, or literal names on the basis of type of the block

in case of model clone. The clone detection techniques use these attributes as basis of comparison for model fragments.

3 Proposed Classification of Clones in Models In the existing literature, model clones have many classifications at broader domain. However, the authors presented the majorly used classifications that can be interrelated to one another. It provides a common view of classification of model clones for further research. This will help to compare the various clone detection techniques on basis of various attributes for better efficiency.

490

R. Garg and R. K. Singh

3.1 Classification of Clones on Basis of Clustering Effect Is Identified as [3, 7] 1. Primary clone 2. Secondary clone. Primary clones: clones based on similarity of fragment as a whole consisting of all similar elements or resultant after merging various clone fragments where each contains similar elements within the fragments. Secondary clones: clones based on the similarity of a single indivisible element in a fragment.

3.2 Classification of Model Clone on Basis of Similarity as Per Störrle [3, 7] Type A: Exact Model Clone–A model that is exactly similar in terms of content other than the Layout, Secondary Notation, and Internal identifiers. It avoids differences in position, color, spacing, text fonts, appearance and formatting, orientation, etc. In code clones, comment does not play any important role at code level due to concrete level of details but in model clones, notes/comments play a vital role due to the abstract representation of the software that has capability to detect potential clones. In model clones, a global unique identifier is associated with each model element while such a case was not with code clones where uniqueness lies only on basis of naming convention during coding process. Due to these facts, the model elements may be similar but not identical within a system. Type B: Renamed Model Clone–A model that is highly similar in terms of content other than the changes such as changes to Names of Elements, Attributes, and Parts along with the variation as mentioned in Type A. Thus, it takes into account the variations among the labels along with their values with respect to model elements of the model. It will identify similarities considering the variations in terms of data types, access specifiers, or other meta-attributes related to model elements into account on basis of developer as they may change the scope and accessibility of model element. Type C: Modified Model Clone–A model that is highly similar in terms of content other than the changes such as addition or removal of parts (set of model elements as submodel) and ordering in the same hierarchical level along with the variation as mentioned in Type B. It involves avoidance of variations among submodels where any model element is added, modified or removed; up to a certain threshold. It may lead to gapped clones that mean the number of clones will increase in such a way that the size of the clone is less. Type D: Sementic Model Clone–A model that is approximately similar in terms of content only that may be due to practices like Copying of Model Fragments, Methods, or Constraints imposed by the Languages, Convergent Development, or

Insight to Model Clone’s …

491

other processes. It takes into account the equivalence as unique normal form of models [8]. However, they may be exactly similar in terms of meaning of content. It checks the behavior of the system on basis of the inputs, if that results in the same outputs. Such clones are very hard to detect because of the semantic nature involved in the abstract representation. That is why their interlinking with preconditions and postconditions increases the precision of clone detection techniques. Different authors have used the Object Constraint Language (OCL) specifications for semantic similarity for these pre and postconditions. Other than that for synonyms, maintaining a dictionary identifies the similar model elements at lower level of abstraction to increase the precision. It is different from other types because it not only involves pairwise matching for structural content but also semantic transformations that are very hard to detect. This pairwise matching of model elements and attributes reports many exact model elements as secondary clones. However, the focus should be in detecting maximal matching as primary clones whether it should be in exact match or approximate match to identify the major area of emphasis for clone detection at detail level further. In order to remove such accidental duplications as secondary clones, there should be some threshold for the reporting of number of model elements in primary clones.

3.3 Classification of Model Clone on Basis of Similarity as Per Rattan [6, 9, 10] Type-1: Model clones based on standard modeling or coding practices. These are the repetitions using model elements within model (fields in class) due to programming or modeling (default identifiers in serializable class). Type-2: Model clones by purpose. These are the repetitions in form of nature of relationships (overriding feature among parent and all child subclasses or realization relationship between interface and implemented classes due to repetition of abstract operations). Type-3: Model clones based on design practices. These represent the repetitions that are present among different model elements (classes) in form of clusters of different sizes due to unfinished design or any other reason instead. The classification given by Rattan relates to class diagrams. The authors analyzed that all these three types of clones mentioned by Rattan [6, 9, 10] are based on the same naming conventions for attributes/relationships supported by minor changes in meta-attributes within them. Therefore, it may represent exact model clones, i.e., provided by Störrle [3, 7]. Thus, the refined classification is as shown in Fig. 1.

492

R. Garg and R. K. Singh

Fig. 1 Refined/Proposed classification of model clones

4 Shift to Hybrid Technique for Model Clone Detection Model clone detection is similar to code clone detection with the difference that here they are concerned with model elements and meta-attributes of the model elements with respect to models. The Model Clone Detection (MCD) techniques along with their pros and cons are listed below that depicts the shift to hybrid technique for clone detection 1. 2. 3. 4. 5. 6. 7.

Graph-based MCD Token-based MCD Tree-based MCD Hierarchical textual MCD(special class of tree-based MCD) Metric-based MCD Semantic-based MCD Feature-based MCD.

Earlier (especially before 2012) the techniques used for model cloning were graph-based model clone detection [1, 2, 11, 12]. It involves matching for sub-graph isomorphism that is an NP-Complete problem [4]. That makes it difficult and time consuming. The token-based technique has capabilities to detect mainly T-1 and T-2 clones. Therefore, recall is less where these approaches are used [13]. In case of treebased clone detection, the structural clones are easy to detect but any shift of model element is difficult to detect [3, 5–7, 9, 10]. In case of hierarchical textual approach, lexical approach detects renaming within the tree structure that tool Simone uses for model clone detection [14]. The metric-based approaches provide better performance to compare model elements using metrics easily with less detection time for model clones [15]. Semantic-based approaches rely on the behavior of the concepts used in the study that requires transformation, which is difficult to detect [8]. Feature-based clone detection identifies the features corresponding to the concepts and measures the similarity on basis of granularity (class, method, identifier) [16–18]. It may use machine-learning approaches in order to train the system so that it may later test for the similarities. Due to high complexity involved in such MCD, a heuristic is required to balance time and space complexity. To overcome these limitations, a combination of these MCD’s in form of hybrid technique, depending on the type of software and

Insight to Model Clone’s …

493

desired performance parameters is better choice for better efficiency in terms of time and space complexity.

5 Different Techniques Used for Visualization of Clones The clones reported to the developer or the one who maintains the system are as follows: 1. Clone pairs 2. Clone class. Clone pair represents exactly two sets of clone instances or components (set of interrelated model elements) that are highly similar to each other. Clone class represents two or more sets of clone instances or components (set of interrelated model elements) that are highly similar to one another. For a clone class having n clone instances, we may have n(n–1)/2 clone pairs each with different sets of clone instances. The clone class cardinality represents number of cloned instances of any fragment including the original fragment whereas the size of clone class represents the number of nodes or model elements present in a clone instance. Models are created in Integrated Development Environment (IDE) but in visualization because of hierarchical layering in models that span multiple files and models. Therefore, there should be some mechanism to represent the clones either textually or graphically. In textual representation, there is no link with the IDE just model elements that are clone are referred by different means such as 1. Complete Paths of model element names. Highlight the names of the blocks or lines that are clones as clone pairs or clone classes [5, 15, 19]. Along with clone classes and clone instances, it should also provide the capability to explore clone instances in same working environment [20]. In graphical representation, there is linkage with the IDE where model elements that are clone are represented within the IDE itself. The following techniques used are [5] 2. Using different coloring schemes for small systems. 3. Scatter-plots to identify the models where major components or highly risky components according to the Parento principle are cloned. 4. Using matrix representation for representation of clones. In some cases, we may use both textual and graphical representation for representing the clones in textual form and then it is flexible for the user to switch to graphical mode within the IDE itself, if one wants. 5. Complete path of element names on click of which it is redirected to the visual representation of the clones in models with coloring schemes (used for large size software systems) [21].

494

R. Garg and R. K. Singh

Such representations of clones are better than the individual textual or graphical approach for visualizations. Due to easier navigation and understanding of clones in software, it finds its major application in clone lineage and clone genealogies in the software evolution. Directed Acyclic Graph (DAG) that corresponds to clone group’s history of evolution, for the various versions; represents Clone Lineage. Clone genealogies represent the relation in form of co-change in the history of the revisions. It depicts the effect on the clone instance of other clones on basis of cochange phenomena, if change occurs in clone instance of a clone class. Since models are abstract in nature so they contain little information as compared to code due to which precision and time taken by clone detection process decreases.

6 Conclusion and Future Scope This paper deals with model clones highlighting the attributes in which code clones are differentiated from model clones such as information, structure, comments, identifiers, naming scheme, coupling, and renaming during MCD. Then model clone classification for detecting model clones according to existing researches is refined on basis of similarity and clustering strategy by the authors to have a consistent view all around. Pros and cons of various MCD techniques focus the need for detecting clones using the hybrid MCD for balancing time and space complexities. Then these clones detected as per MCD technique with clone classifications on basis of various attributes are presented using the model clone visualization techniques based on textual, graphical, and a combination of both approaches along with its linkage to IDE. However, the combination of both textual and graphical approaches provides better understanding and navigation during clone evolution. This study is useful for future researchers to understand the model clone, their classification and visualization areas to enhance further research. Model Cloning needs further exploration and refinement from the aspects of analysis, detection, and management during the evolution of software that may help to improve quality of industrial practices for developing and maintaining software. The effects of attributes of model on clone detection and management still need exploration and validation with empirical studies.

References 1. F. Deissenboeck, H.B. Juergens, E.M. Pfaehler, B. Schaetz, Model clone detection in practice, in Proceedings of the 4th International Workshop on Software Clones (ACM, 2010), pp. 57–64 2. B.J. Muscedere, R. Hackman, D. Anbarnam, J.M. Atlee, I.J. Davis, M.W. Godfrey, Detecting feature-interaction symptoms in automotive software using lightweight analysis, in 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), (IEEE, 2019), pp. 175–185

Insight to Model Clone’s …

495

3. M. Chochlov, M. English, J. Buckley, D. Ilie, M. Scanlon, Identifying feature clones: an industrial case study, in 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER) (IEEE, 2019), pp. 544–548 4. Rattan,D., Bhatia, R., Singh,M.: Model clone detection based on tree comparison. In: India Conference (INDICON).IEEE, pp. 1041–1046 (2012) 5. B. Hummel, E. Juergens, D. Steidl, Index-based model clone detection, in Proceedings of the 5th International Workshop on Software Clones (ACM, 2011), pp. 21–27 6. H. Störrle, Towards clone detection in UML domain models. Softw. Syst. Model. 12(2), 307– 329 (2013) 7. D. Rattan, R. Bhatia, M. Singh, Detecting high level similarities in source code and beyond. Int. J. Energy. Inf. Commun. 6(2), 1–16 (2015) 8. H. Störrle, Effective and efficient model clone detection, in Software, Services, and Systems (Springer International Publishing, 2015), pp. 440–457 9. M.H. Alalfi, J.R. Cordy, T.R. Dean, Analysis and clustering of model clones: an automotive industrial experience, in IEEE Conference on Software Maintenance, Reengineering and Reverse Engineering (CSMR-WCRE) (IEEE. Software Evolution Week, 2014), pp. 375–378 10. E.J. Rapos, A. Stevenson, M.H. Alalfi, J.R. Cordy, SimNav: Simulink navigation of model clone classes, in International Working Conference on Source Code Analysis and Manipulation (SCAM) (IEEE, 2015), pp. 241–246 11. D. Rattan, M.G. Singh, R.G. Bhatia, Design and development of an efficient software clone detection technique. Doctoral dissertation (2015) 12. G. Mahajan, Software cloning in extreme programming environment. arXiv (2014), pp. 1906– 1919 13. Deissenboeck, F., Hummel, B., Jürgens, E., Schätz, B., Wagner, S., Girard, J. F., &Teuchert, S.: Clone detection in automotive model-based development. In: Software Engineering, 2008. ICSE’08. ACM/IEEE 30th International Conference, pp. 603–612 (2008) 14. R. Garg, R.K. Singh, Detecting model clones using design metrics, in International Conference on New Frontiers in Engineering, Science and Technology (2018), pp. 147–153 15. B. Al-Batran, B. Schätz, B. Hummel, Semantic clone detection for model-based development of embedded systems. Model Driven Eng. Lang. Syst. 258–272 (2011) 16. C.K. Roy, J.R. Cordy, A survey on software clone detection research. Queen’s School Comput. TR 541(115), 64–68 (2007) 17. D. Rattan, R. Bhatia, M. Singh, Software clone detection: a systematic review. Inf. Softw. Technol. 55(7), 1165–1199 (2013) 18. N.H. Pham, H.A. Nguyen, T.T. Nguyen, J.M. Al-Kofahi and T.N. Nguyen, Complete and accurate clone detection in graph-based models, in Proceedings of the 31st International Conference on Software Engineering (IEEE Computer Society, 2009), pp. 276–286 19. I.D. Baxter, A. Yahin, L. Moura, M. Sant’Anna and L. Bier, Clone detection using abstract syntax trees. In software maintenance, in Proceedings of International Conference (IEEE, 1998), pp. 368–377 20. S.K. Choudhary, M.A. Sindagi, M.V. Patel, U.S. Patent Application No. 15/637, 684 (2019) 21. E.J. Rapos, A. Stevenson, M.H. Alalfi, and J.R. Cordy, SimNav: Simulink navigation of model clone classes, in IEEE 15th International Working Conference on Source Code Analysis and Manipulation (SCAM) (2015), pp. 241–246

Predicting Socio-economic Features for Indian States Using Satellite Imagery Pooja Kherwa, Savita Ahlawat, Rishabh Sobti, Sonakshi Mathur, and Gunjan Mohan

Abstract This paper presents a novel, accurate, inexpensive, and scalable method for estimating some of the socio-economic features like electricity availability, treated water, electronics like television, radio, communication mediums like mobile phone, landline phone and vehicle like 2/3/4 wheeler from high-resolution daytime and nighttime satellite imagery. Our approach is a novel method, which helps to track and target poverty and development in India and other developing countries. Keywords Satellite images · Ridge regression · Stochastic Gradient Descent (SGD) · Machine learning · Convolutional neural network

1 Introduction In developing countries like India, collecting information that is grounded on precise evaluations of monetary and advancement pointers on foot through Census is troublesome. Census is error-prone and uproarious because of the extensive changeability in the information accumulation forms over the geology, and there is regularly no validation. Through our machine learning model, we endeavor to limit this exertion, help move towards easy development, and provide more accurate results. Exact estimations of the monetary qualities of populaces basically impact both research and arrangement. Such estimations shape choices by individual governments about how to apportion assets and give the establishment to worldwide endeavors to comprehend and track advance toward enhancing human livelihood. Through our model, we were able to accomplish the following:

P. Kherwa · S. Ahlawat (B) · R. Sobti · S. Mathur · G. Mohan Maharaja Surajmal Institute of Technology, New Delhi 110058, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_44

497

498

P. Kherwa et al.

• Prepared and analyzed the data for around 10,000 villages spread over two states of North India using Census 2011, village boundaries, and corresponding satellite imagery. • Trained eight deep convolutional neural network-based model for direct regression of socio-economic features of the Census 2011 data on daytime image data. • Used ridge regression on nighttime satellite images of the villages. • Compared the regression scores obtained by both the models with existing models, which used either the night or daytime data for evaluation. The two Indian states Punjab and Haryana are analyzed for the present research. For the primary run, 2000 pictures were utilized for day and nighttime data.

2 Literature Survey In 2011, association between nightlight and GDP estimates for India at the district level using non-linear regression techniques was studied [1]. In 2013, nightlight highresolution satellite images were used as a proxy for development using regression model. [2, 3]. In this a machine learning tool is developed with very high accuracy to predict socio-economic scenario using daylight images [4]. A global poverty map has been produced using a poverty index calculated by dividing population count by the brightness of satellite observed lighting. In another work deep Convolutional Neural Network (CNN) with daytime images is used to identify land use pattern [5]. They analyze land use pattern using advanced computer vision techniques for labeling, for this they used ground truth labels obtained from surveys. Predicting poverty is again another area in developing countries, where nighttime lighting is considered as a rough estimate for economic wealth of that countries [6].

3 Data Collection and Description 3.1 Census Data Vector The asset model for two north Indian states Punjab and Haryana was created [7]. The village-level information is ordered by ids and gives collected data of around 140 household characteristics. Then a dimension reduction technique is used to reduce this dimensionality from 140 feature vector to 5. The vector fields were electricity availability, treated water, electronics [television, radio], communication mediums [mobile phone, landline], and vehicle [2/3/4 wheeler].

Predicting Socio-economic Features …

499

3.2 Outlier Removal The Census 2011 data has noise and a large number of errors for some villages. Outliers are values that don’t coordinate the general character of the dataset. For dismissing anomalies, we process the appropriation of Mahalanobis distance of all villages [8]. Threshold was set up to 10% deviation from standard middle. Results are given with and without anomaly dismissal.

3.3 Daytime Satellite Images For geo-enrollment of the cities publicly available geospatial vector data format created by government of India [9] is used. We acquire the daytime satellite pictures comparing to the towns from Google static maps using the API given by Google [10] (Google Static Maps API, 2017). Sample picture is given in Fig. 1. The API call requires API key, latitude, longitude, zoom level, etc. for a successful query. We set the zoom level = 15 and the size of the image as (640 × 640), which is roughly equivalent to 7 sq km ground area.

Fig. 1 Daytime satellite image

500

P. Kherwa et al.

Fig. 2 Nighttime satellite image

We have tested utilizing single pictures corresponding to the town centroid covering zones of 7 m2 ground zone. For the primary run, 6000 pictures were utilized as our underlying set.

3.4 Nighttime Images The nightlight information given by the Defense Meteorological Satellite Program’s Operational Linescan System [11] has been used in the present work. A sample nighttime satellite image is given in Fig. 2. The nightlight information is accessible in 30 circular segment second grids. The nightlight guide is two-dimensional array of intensities. The nightlight image is then cropped into smaller images of 7sq km each, centered at the geographical coordinates corresponding to the villages under consideration. Each nightlight image is indexed according to unique village ids.

4 Proposed Convolutional Neural Network Architecture Convolutional Neural Network (CNN) is the most widely used and powerful technique to analyze high-resolution satellite images. A motivational work used highresolution satellite imagery to successfully carry out automatic road extraction using high-resolution satellite images [12] and used convolutional networks on satellite images to detect and classify roads, buildings, vegetation, etc. Since, features, like roads, buildings, etc., and their quantitative analysis form the basis of overall development and are concurrent with other socio-economic features; hence, it can be

Predicting Socio-economic Features …

501

concluded that a trained CNN is fully capable of predicting socio-economic features using satellite imagery [13, 14]. As we have an availability of a large image dataset at our disposal, therefore it was feasible to train all the fully connected layers from scratch. The complete architecture of the model used by us is described in Fig. 3. The first five convolutional layers of the VGG CNN-S architecture are taken as it is, i.e., the weights are used unchanged and are not tuned while training of the model. The last three fully connected layers are ripped off and trained from scratch, initializing the weights using a Gaussian distribution with zero mean and standard deviation of 0.01.

Fig. 3 Architecture of CNN used

502 Table 1 Hyperparameters used for training

P. Kherwa et al. Hyperparameter SGD with momentum

Adam optimizer

Learning policy

Step, with step size = 500 Fixed

Learning rate

0.000001

0.0001

Weight decay

0.005



Momentum 1

0.8

0.9

Momentum 2



0.999

Gamma

0.2



The pre-trained VGG CNN-S is trained on an input size of 224 × 224, but the usage of higher resolution images, having input dimensions of 640 × 640, mandates the removal and retraining of the fully connected layers. Moreover, the underlying task of the existing model is the classification of the image into one out of a thousand classes. So, another achievement was to change this underlying task to regression of the five target outputs that we need. The weight decay was changed to 0.005. The Caffe architecture [15], a convolutional neural network architecture for fast feature embedding has been used for model specification and training purpose. Since the Caffe architecture supports only classification, so, the Euclidean Loss has been used which is essentially the L2 loss layer and is given by Eq. 1. E=

N  1   yˆn − yn 2 2 2N n=1

(1)

where E is Euclidean loss, N is the total number of samples, and ||.|| is Euclidean norm. We have used two optimizers, first is the Stochastic Gradient Descent with momentum introduced by [16]. The second solver is the Adam optimizer introduced by [4]. The hyperparameters used for both of these optimizers are presented in Table 1. All the images in the dataset are first added to a Lightning Memory-mapped Database (LMDB) format, and before getting fed into the feedforward CNN, the mean of all the images is subtracted from each image for normalization of input features. Also, the target labels are scaled down by a factor of 0.01, so that the range of target vector is changed from [0, 100] to [0, 1]. 1.

5 Regression for Night Data On the nightlights data, which is essentially a two-dimensional matrix of light intensities, we applied ridge regression with the asset vector as the multidimensional target. Since the spatial distribution of the light intensities during nighttime is not as essential as the net amount of light intensity in a particular village, hence the mean and standard deviation of the nightlight intensities for a particular village is used for training a Ridge regression model.

Predicting Socio-economic Features …

503

During implementation, we carried out the following tasks: • Segmenting out [13, 14] sized one-dimensional figures centered on the village centroids for around 6000 villages. • Taking the mean, standard deviation, and maximum values for the light intensity of each village, and serializing the data along with the five target variables. • Normalizing the input features, and scaling down the target labels by a factor of 100, to bring them to the range [0, 1] from [0, 100]. • Running the ridge regression algorithm on the training data, and evaluating the Mean Squared Error and Mean Absolute Error on the testing set.

6 CNN for Night Data Convolutional Neural Network (CNN) is also used for nighttime data, and it provides similar results as the ridge regression approach. Since the size of input image data was very less, hence a very small CNN was used as a model. The architecture contains only one Convolutional layer with 4 × 4 filters, one RELU activation layer, a flatten layer, a dropout layer with dropout probability of 0.3, and one fully connected output layer. TensorFlow library was used in Python for model training and specification. The loss function chosen for this network is Mean Squared Error (MSE), which can be mathematically represented as in Eq. 2. 1 ˆ i )2 (Yi − Y n i=1 n

MSE =

(2)

ˆ is the where MSE is Mean Squared Error, ‘n’ is the total number of samples, ‘Y’ real value of the nth sample, and ‘y’ is the predicted value by our model. The optimizer used to minimize this loss function is Adam optimizer, with the parameters as mentioned in Table 2. Table 2 Parameters for ADAM optimizer

Parameter

Value

No. of epochs

10

Learning rate

0.001

Momentum 1

0.9

Momentum 2

0.999

504

P. Kherwa et al.

7 Results First, we obtain the results separately for nighttime and daytime data by using different models to train on both of them and getting the results for the same testing set, i.e., the same set of villages for both nighttime and daytime. And then, we combine the predictions obtained from the models performing best on each dataset, by taking the average of the results obtained from the two models. The results obtained are described below.

7.1 Daytime Model The convolutional neural network trained on the daytime data set with the target as the census socio-economic vector was run for 550 iterations with two separate optimization algorithms.

7.2 SGD with Momentum The obtained results are tabulated in Table 3. The curve for training loss using this algorithm is given in Fig. 4. The presence of spikes in the training loss, as the epochs progress, is because of the use of mini-batching technique used by the SGD algorithm, which aims to update the weights with respect to the loss obtained on each iteration of a single batch. Some mini-batches have “by chance” unlucky data for the optimization, inducing those spikes you see in the cost function.

7.2.1

ADAM Optimizer

The obtained results are tabulated in Table 4. The curve for training loss using this algorithm is given in Fig. 5. Table 3 Results on daytime data using SGD with momentum

Result parameter

Value

Iterations

550

Euclidean loss (Training set)

1.1

Euclidean loss (Test set)

0.088

Mean absolute error (Test set)

0.236

Predicting Socio-economic Features …

505

Fig. 4 Training loss for SGD with momentum

Table 4 Results on daytime data using ADAM optimizer

Result parameter

Value

Iterations

550

Euclidean loss (Training set)

0.8

Euclidean loss (Test set)

0.062

Mean absolute error (Test set)

0.168

Fig. 5 Training loss for ADAM optimizer

506

P. Kherwa et al.

Table 5 Results on night data using ridge regression

Table 6 Results on night data using convolution neural network

Result parameter

Value

Euclidean loss (Training set)

0.0353

Mean absolute error (Training set)

0.1314

Euclidean loss (Test set)

0.0342

Mean absolute error (Test set)

0.1289

Result parameter

Value

Euclidean loss (Training set)

0.0282

Mean absolute error (Training set)

0.1264

Euclidean loss (Test set)

0.0300

Mean absolute error (Test set)

0.1228

7.3 Nighttime Model The nighttime models were first trained on a training set of 5400-nightlight intensity images by using two different models, the results obtained from which are given here.

7.3.1

Ridge Regression

The obtained results are tabulated in Table 5.

7.3.2

Convolution Neural Network

The obtained results are tabulated in Table 6.

7.4 Final Combined Result Taking a weighted average of the daytime and nighttime model based on their Euclidean Loss values, the results of Table 7 are obtained.

Predicting Socio-economic Features … Table 7 Combined results on day and nighttime data

Result parameter

507 Value

Euclidean loss (Half test set)

0.029

Mean absolute error (Half test set)

0.120

Euclidean loss (Full test set)

0.029

Mean absolute error (Test set)

0.119

8 Conclusion In today’s era, it is very difficult to find reliable and high-frequency data, available census is also error-prone and expensive. So in this paper, a machine learning approach is presented, through satellite images both at nighttime and daytime taken in real time. This presented approach can be useful tool to address various issues of development in country like India and other developing countries like poverty, health, agriculture, sanitation, and other resource management.

References 1. L. Bhandari, K. Roy Chowdhury, Night lights and economic activity in India: a study using DMSP-OLS night time images. Proc. Asia-Pac. Adv. Netw. 32, 218–236 (2011). https://doi. org/10.7125/apan.32.24; http://dx.doi.org/10.7125/APAN.32.24. ISSN 2227-3026 2. P.K. Suraj, A. Gupta, M. Sharma, S.B. Paul, S. Banerjee, On monitoring development indicators using high-resolution satellite images (2018). arXiv:1712.02282v3 3. C.D. Elvidge, P.C. Sutton, T. Ghosh, B.T. Tuttle, K.E. Baugh, B. Bhaduri, E. Bright, A global poverty map derived from satellite data. Comput. Geosci. 35(8), 1652–1660 (2009) 4. D.P. Kingma, J. Ba, Adam: a method for stochastic optimization (2014). arXiv:1412.6980 5. A. Albert, J. Kaur, M.C. Gonzalez, Using convolutional networks and satellite imagery to identify patterns in urban environments at a large scale, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD (2017), pp. 1357–1366 6. N. Jean, M. Burke, M. Xie, W.M. Davis, D.B. Lobell, S. Ermon, Combining satellite imagery and machine learning to predict poverty. Science 353(6301), 790–794 (2016) 7. The Ministry of Home Affairs, Government of India.Census Data. http://www.censusindia. gov.in/2011-Common/CensusData2011.html 8. P.C. Mahalanobis, On the generalised distance in statistics. Proc. National Inst. Sci. India 2(1), 49–55 (1936). Retrieved 27 Sept 2016 9. The Ministry of Science and Technology, Government of India, Survey of India (2017). http:// www.surveyofindia.gov.in 10. Google Static Maps API. https://developers.google.com/maps/documentation/staticmaps/ 11. NOAA/NGDC Earth Observation Group. National Geophysical Data Center, Version DMSPOLS Nighttime Lights Time Series (2013) 12. A.V. Buslaev, S.S Seferbekov, V.I. Iglovikov, Fully Convolutional Network for Automatic Road Extraction from Satellite Imagery (2018). arXiv:1806.05182 13. V. Iglovikov, S. Mushinskiy, O. Vladimir, Satellite Imagery Feature Detection using Deep Convolutional Neural Network: A Kaggle Competition (2017). arXiv:1706.06169 14. K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in the details: delving deep into convolutional nets, in British Machine Vision Conference (2014)

508

P. Kherwa et al.

15. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding. CoRR, abs/1408.5093 (2014). http://arxiv.org/abs/1408.5093 16. A. Ng, J. Ngiam, C.Y. Foo, Y. Mai, C. Suen, UFLDL Tutorial (2017). http://ufldl.stanford.edu/ wiki/index.php/UFLDL_Tutorial

Semantic Space Autoencoder for Cross-Modal Data Retrieval Shaily Malik and Poonam Bansal

Abstract The primary aim of cross-modal retrieval is to enable the user to retrieve data across different modalities in a flexible manner. Through this paper, we tackle the problem of retrieving data across different modalities, where the input is given in one form, and relevant data of another type is retrieved as the output, as per the requirement of the user. Most of the techniques or approaches that have been used so far have not considered the feature and semantic information preservation. As a result of this negligence, they are not able to obtain effective results. Here, we have proposed a two-stage learning method that does the projection of low dimensional embeddings to multimodal data that preserve both feature and semantic information, which enabled us to get satisfactory results. In this paper, we have proposed an autoencoder for cross-model retrieval that can process both visual as well as textual data based on their semantic similarity. Keywords Semantic learning · Multimodal data retrieval · Neural network

1 Introduction Most of the applications in today’s world involve multiple modalities such as the text, images, sound, or videos describing a variety of information. To grasp the information present in all such modalities, one must understand the relationship that exists between them. Although, some techniques are already proposed to provide resolution of such problem, they are unable to produce satisfactory results as they fail to preserve feature and latent information. Through this work we tried to elaborate S. Malik (B) · P. Bansal Department of Computer Science and Engineering, Maharaja Surajmal Institute of Technology, GGSIPU, New Delhi, India e-mail: [email protected] P. Bansal e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_45

509

510

S. Malik and P. Bansal

a model that challenges this issue by creating an effective technique to retrieve data across different modalities and produce adequate results [1, 2].

1.1 Multimodal Data In our day to day lives, we come across many applications of multimodal data where information comes from different sources such as the images, text, or speech. Usually the content of a web page is described through the text, images, or videos for exhibiting the common content, illustrating heterogeneous properties. Most of the search techniques that have been used so far are single modality based that does not fulfill the demanding requirement of information retrieval across multimodal data. So, through this paper, we try to explore the technical challenges surrounding multimodal data by accomplishing the semantic features of the data. This can be achieved through learning how the samples belonging to the same semantic category can be mapped into common space, even though in today’s world the data is generated from multimodal sources, and there should be divergent samples from different semantic groups. We also need to gain knowledge about the prejudiced features of data generated by heterogeneous multimedia sources. A minimization of discrimination loss of the two spaces: common representation space and the label space can be proposed to fulfill these needs [3–5]. At the same time, we likewise limit the separation between the portrayals of each picture content pair to lessen the cross-modular disparity. The semantic code vectors are found out which comprises both: the component data and the mark data, at that point the projections are found out by multi-modular semantic autoencoder which is utilized to ventures picture and content together to the educated code vector and the picture and content from code vector can be remade.

2 Literature Review K. Wang et al. gave an effective and hearty strategy for recovery of information from various modalities which is progressively appropriate and amazing when contrasted with the customary single-modality based strategies. They provided an overview for cross-modal retrieval and summarized a variety of representation methods which can be classified into two fundamental gatherings: (a) genuine esteemed representation learning and (b) pair wise representation learning. A few normally utilized multimodal datasets are presented, the presentation of some agent strategies on some ordinarily utilized datasets are assessed [1]. Another methodology for learning basic portrayals for the heterogeneous information is proposed. The regular portrayals scholarly can be both discriminative just as methodology invariant for cross-modular recovery. This goal was achieved by a new approach named DSCMR by reducing the discrimination loss as well as the modality invariance loss simultaneously [2]. A

Semantic Space Autoencoder for Cross-Modal Data Retrieval

511

new approach for accomplishing the task of multimodal data retrieval is discussed. In this method, the type of data modality mappings of the cross-modal retrieval is learnt so that data from different sources can be projected to embeddings in such a way that the original extracted feature information and the semantic information in both modalities would be preserved [6]. J. Gu1 et al. said that in the first place, get familiar with the component mindful 405 semantic code vectors which join the data from both element spaces and the name spaces. Afterwards, encoder-decoder worldview is utilized to learn projections which venture the picture and content to the semantic code vector and recoup the first highlights from the semantic code vector [7]. The authors gave a new paradigm for accomplishing the task of cross-modal retrieval. In this given method, modalitybased projections are learnt so that data from these modalities can be projected to embeddings that would preserve the semantic information and the original feature information in both modalities. We from the outset become familiar with the 405 component semantic code vectors which have the consolidated data from the name space and the element spaces. An encoder-decoder model is used to learn the projections and mappings which are further used to project the available textual feature and image captions to the semantic code vector, and then the semantic code vector is used to reconstruct the initial features [8]. We learn effective methods for extraction of features, creation of shared subspaces considering the significant level of semantic data and how to optimize them [9, 10], how to extract features from hand-drawn images [11]. The neural system learns a multi-modular inserting space for pieces of pictures and sentences and reasons about their idle, between the modular arrangements. It is shown that the combination of CNN and RNN [12], CNN visual features [13], Cross-Modal Generative Adversarial Networks (CM-GANs) [14, 15] can without much of a stretch accomplish predominant outcomes contrasted and utilizing conventional visual highlights.

3 Cross-Modal Retrieval In this paper, we have proposed an auto encoder for cross-model retrieval that can process both visual as well as textual data. The autoencoder can convert textual data to image and vice versa, and can be used to find similar images or text. This can be used to address machine translation problems, recommendation systems, image denoising, and dimensionality reduction for data visualization and in many other fields. For this work we have used flickr8k. The image dataset consists of 8092 images and the text dataset in json format consists of corresponding captions to the image dataset. This work is basically divided into four parts: Image to Text Conversion, Text to Text conversion, Image to Image Conversion, and Text to Image Conversion. A. Image to Text Image to Text conversion can be achieved through the process called image captioning which is done into two parts: First component is an image encoder that takes the image

512

S. Malik and P. Bansal

as input and converts it into representations that are meaningful to do captioning. A deep convolution neural network is used as image encoder. Second component is the caption decoder which takes the image representations as input and gives the descriptions as output. GRU is used for caption decoding. In this work, we have used the pre-final layer activations of an already existing image classifier, i.e., inception network. This is to avoid training image encoder from the beginning. The representations from the inception network are fed into the Recurrent Neural Network (RNN). We train the decoder and check the performance by generating the captions for random images from the training and testing datasets. B. Text to Text The functionality of text to text generation is build [16] by the representations developed by the network while captioning the images. In this part of the work, we need to feed the words to the network in such a format that it can act as the input to the network. So we begin by randomly created word embeddings [17] and try exploring what the network learnt about words when training was done. Since visualization of large dimensions is not possible so we have used a technique known as T-SNE which helps in reduction of number of dimensions without leading to any change in the neighbors while converting from high to low dimensional space. 100-dimensional representation is taken and the cosine similarity to all the other words present in the data is calculated. C. Image to Image To find the similar images to the image given as input we have applied the same technique of T-SNE for visualizing the nearest neighbors of the image given as input. We find the image representations of each image [18] and store the representations corresponding to each in a text file. This part of the work aims at providing the functionality of searching the most similar image to the image that the user provides as input [19]. We first take the representation of the image provided by the user and apply cosine similarity to find the closest image in the data. D. Text to Image Text to image conversion is achieved by developing the functionality of searching images via captions [20]. For this we perform the reverse of what we did for generating caption for an image. As the first step, we start with completely random 300dimensional tensor as input rather than 300-dimensional representation of image coming from an encoder. In the next step, all layers of the network are frozen, i.e., PyTorch does not calculate the gradients. Assuming that randomly generated input tensor comes out of the image encoder we feed it into caption decoder. Then the caption being generated by network is taken at the time that arbitrary input was given and is compared with the user-given caption. The loss is calculated by comparing the network-generated and user-provided caption. The gradients for the input tensor are calculated to minimize the loss. The input tensor is changed by taking tiny step in

Semantic Space Autoencoder for Cross-Modal Data Retrieval

513

the direction given by gradients. We repeat unless we reach to convergence or until the loss reaches below a definite threshold. Then the final input tensor is taken and its value is used to find the closest images to it by applying cosine similarity.

4 Results and Discussion The model for image captioning was trained at 40 epochs and the average running loss came out to be around 2.84. Keeping the trained model as the base, the functionalities of similar text and similar image retrieval were developed. At last, for retrieval of image from text given as input, we performed the reverse of image captioning. The epochs vs loss graph was plotted to see the loss incurred while giving the text to image results. Several architectures have been tested with different combinations of dense layers with CNN ones. The resulting architecture configuration (the size and number of layers) showed the best results on cross-validation test which corresponds to the optimal usage of training data. Our tests proved that the architecture using dense layers to deal with fixed-length vectors and CNN layers for handling varied length vectors is optimal (Fig. 1). For the text to image conversion path, we achieved our objective to energize the grounded content component to produce a picture that is like the ground-truth one as appeared in Fig. 2a. Although the produced pictures are of restricted quality for complex multi-object scenes, they despite everything contain certain conceivable shapes, hues, and foundations when contrasted with the ground-truth picture and the recovered pictures. This proposes that our model can catch the complex basic language-picture relations. Cosine similarity is a metric that is used to measure how similar the two images are. It quantifies the degree of the similarity between intensity patterns in two images. Fig. 1 Epochs versus loss graph to check the loss during text to image conversion

514

S. Malik and P. Bansal

Fig. 2 Results of the Auto encoding process for the image and text modalities

In Fig. 2b, we can clearly see that the results retrieved are of high accuracy as the similar text is identified using the cosine similarity among the words. In Fig. 2c, in Image to Image retrieval it can be inferred from the test results that the system is able to match query images in different resolutions with the images in the database. It tries to identify the similar type of images from the dataset. In Fig. 2d, picture to-content recovery, where the aftereffects of recovered inscriptions just as the ground-truth subtitles. We can see that the recovered subtitles of our model can all the more likely depict the inquiry pictures.

5 Conclusion The autoencoder is aimed to retrieve relevant data using heterogeneous modalities, i.e., text vs. images. The key thought is to distinguish the quantitative similitude in single-modular subspace and afterward move them to the basic subspace to set up the semantic connections between unpaired things across modals. Experiments show that our method outperforms the state-of-the-art approaches in single or pairbased data retrieval tasks. In this paper, we follow the dataset segment and highlight

Semantic Space Autoencoder for Cross-Modal Data Retrieval

515

extraction methodologies as inception for picture encoder and Gated Recurrent Unit (GRU) for a sentence or as content decoder. Based on the study of various resources we found that GRU exhibits better performance on certain smaller datasets. We can additionally examine the impact of picture encoding model on the cross-modular element installing by supplanting the VGG19 model rather than beginning or utilizing LSTM instead of GRU, and evaluate the performance for further optimization. The future work includes the designing of more smooth algorithms to summarize the multimodal data and multimodal learning with limited and noisy annotations. We can also work on improvement of scalability on large scale data and Finer-level cross-modal semantic correlation modeling.

References 1. K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey on cross-modal retrieval, in Senior Member (IEEE, 2016) 2. L. Zhen, P. Hu, X. Wang, D. Peng, Deep supervised cross-modal retrieval, machine intelligence laboratory, in College of Computer Science (Sichuan University Chengdu, 610065, China, 2019), pp. 10394–10403 3. X. Zhai, Y. Peng, J. Xiao, Learning cross-media joint representation with sparse and semi supervised regularization. IEEE Trans. Circuits Syst. Video Technol. 24(6), 965–978 (2014) 4. Y.T. Zhuang, Y.F. Wang, F. Wu, Y. Zhang, and W.M. Lu, Supervised coupled dictionary learning with group structures for multi-modal retrieval, in AAAI Conference on Artificial Intelligence (2013) 5. C. Wang, H. Yang, C. Meinel, Deep semantic mapping for cross-modal retrieval, in International Conference on Tools with Artificial Intelligence (2015), pp. 234–241 6. Y. Wu, S. Wang, Q. Huang, Multi-modal semantic autoencoder for cross-modal retrieval. Neurocomputing (2018). https://doi.org/10.1016/j.neucom.2018.11.042 7. J. Gu1, J. Cai2, S. Joty2, L. Niu3, G. Wang, Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models (Hangzhou, China, CVPR, 2018) 8. V. Ranjan, N. Rasiwasia, C.V. Jawahar, Multi-label cross-modal retrieval, in 2015 IEEE International Conference on Computer Vision (ICCV) (Santiago, 2015), pp. 4094–4102. https://doi. org/10.1109/iccv.2015.466 9. T. Yao, T. Mei, C.-W. Ngo, Learning query and image similarities with ranking canonical correlation analysis, in International Conference on Computer Vision (2015), pp. 28–36 10. R. Socher, A. Karpathy, Q.V. Le, C.D. Manning, A.Y. Ng, Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2, 207–218 (2014) 11. Y. Jhansi, E. Sreenivasa Reddy, Sketch based image retrieval with cosine similarity. ANU College of Engineering,Acharya Nagarjuna University, India, Int. J. Adv. Res. Comput. Sci. 8(3) (2017) 12. A. Karpathy, A. Joulin, F. Li, Deep fragment embeddings for bidirectional image sentence mapping, in Advances in Neural Information Processing Systems (2014), pp. 1889–1897 13. Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, S. Yan, Cross-modal retrieval with cnn visual features: a new baseline, in IEEE Transactions on Cybernetics, p. Preprint 14. Y. Peng, J. Qi, Y. Yuan. Cross-modal generative adversarial networks for common representation learning. TMM (2017) 15. X.-Y. Jing, R.-M. Hu, Y.-P. Zhu, S.-S. Wu, C. Liang, J.-Y. Yang, Intra view and interview supervised correlation analysis for multi-view feature learning, in AAAI Conference on Artificial Intelligence (2014), pp. 1882–1889

516

S. Malik and P. Bansal

16. J. Martinez-Gil, An overview of textual semantic similarity measures based on web intelligence. https://doi.org/10.1007/s10462-012-9349-8 17. Y. Lu, Z. Lai, X. Li, Fellow, IEEE, D. Zhang, Fellow, IEEE, W. KeungWong, C. Yuan, Learning Parts-Based and Global Representation for Image Classification. https://doi.org/10.1109/tcsvt. 2017.2749980 18. Y. Gong, Q. Ke, M. Isard, S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vision 106(2), 210–233 (2014) 19. L. Yang, V.C. Bhavsar, H. Boley, On semantic concept similarity methods, in Proceedings of International Conference on Information and Communication Technology and System (2008), pp. 4−11 20. F. Cararra, A. Esuli, T. Fagni, F. Falchi, A. Moreo, Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions (2016). arXiv:1606.07287v1[cs.IR]

A Novel Approach to Classify Cardiac Arrhythmia Using Different Machine Learning Techniques Parag Jain, C. S. Arjun Babu, Sahana Mohandoss, Nidhin Anisham, Shivakumar Gadade, A. Srinivas, and Rajasekar Mohan

Abstract The major cause of deaths around the world is cardiovascular disease. Arrhythmia is one such disease in which the heart beats in an abnormal rhythm or rate. The detection and classification of various types of cardiac arrhythmia is a challenging task for doctors. If it’s not done accurately or not done on time, the patient’s life can be at a great risk, as few arrhythmias are serious, and some can even cause potentially fatal symptoms. This paper illustrates an effective solution to help doctors in the critical diagnosis of various types of cardiac arrhythmias. To classify the type of arrhythmia, the patient might be suffering from, the solution utilizes a variety of machine learning algorithms. UCI machine learning repository dataset is used for training and testing the model. Implementing the solution can provide a much-needed early diagnosis that proves to be critical in saving many human lives.

P. Jain · C. S. Arjun Babu · S. Mohandoss · S. Gadade · R. Mohan (B) PES University, Banashankari, Bengaluru 560085, India e-mail: [email protected] P. Jain e-mail: [email protected] C. S. Arjun Babu e-mail: [email protected] S. Mohandoss e-mail: [email protected] S. Gadade e-mail: [email protected] N. Anisham The University of Texas at Dallas, Campbell Rd, Richardson 75080, USA e-mail: [email protected] A. Srinivas Dayananda Sagar University, Bengaluru 560068, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_46

517

518

P. Jain et al.

Keywords Machine learning · ECG recordings · Cardiac Arrhythmia · Ensemble methods · Hard voting · Healthcare · Feature selection

1 Introduction The heart of a healthy human being beats at a rate of 60–100 beats per minute in a periodic sinus rhythm, which is maintained by the heart’s electrical system. When there are problems with this electrical system, the heart chambers will beat in a random way or the heart will beat too fast or too slow. These conditions are collectively called as cardiac arrhythmia. The history and ECG tests are crucial in the diagnosis of the patients suspected with arrhythmias [1]. A typical electrocardiogram (ECG) tracing comprises of P wave, T wave, and QRS complex, which repeats in a sequence. A normal ECG tracing is shown in the Fig. 1. A cardiologist evaluates the ailments based on the various parameters like the shape, duration, amplitude, PR, QT, RR intervals, etc., of the waves [2]. Determining the specific type of arrhythmia is a difficult task because of the massive amount of information involved and the possibility of miscalculating the number of beats by looking at ECG. Pattern recognition of ECG by visual interpretation is prone to errors. Some arrhythmias are just slightly uncomfortable while few arrhythmias such as ventricular fibrillation are deadly [3]. Therefore, it becomes pivotal to evaluate the exact type of arrhythmia the patient is affected with. The objective of this paper is to train a machine learning system to categorize the arrhythmia dataset into one of the 16 classes. This paper makes the following specific contributions: • Offers a GUI-based framework to assist doctors in diagnosing patients who are suspected to have a cardiac arrhythmia. Fig. 1 Normal ECG tracing [17]

A Novel Approach to Classify Cardiac Arrhythmia …

519

• Predicts the type of arrhythmia which the patient might be suffering from using the ensemble of trained machine learning models. • Improvement in prediction performance over existing work done in the same field of study.

2 Literature Review In the early days, arrhythmia detection was carried out using conventional statistical methods like heart rate variability (HRV) analysis [4]. Variations in the indicators of HRV, like duration of successive RR intervals and multiple derived statistical parameters such as root mean square difference and standard deviations, point to the existence of an arrhythmia [4]. The arrhythmia dataset [5] was created and classification was proposed in [6]. They developed a new supervised inductive learning algorithm, VFI5 for the classification. A couple of machine learning algorithms have been investigated in the same classification problem [2]. It was found that feature selection using gradient boosting technique and the model trained with SVM, gave the best results comparatively. To select features Principle Component Analysis (PCA) technique was used and detection of arrhythmia was done using various SVM-based methods like Fuzzy Decision Function, Decision Directed Acyclic Graph, One Against One and One Against All in [7]. Cardiac arrhythmia diagnosis was carried out by techniques such as Fisher Score and Least Squares-SVM with Gaussian radial basis function and 2Dgrid search parameters in [8]. In [9], an arrhythmia prediction was accomplished by a combination of methods like dimensionality reduction by PCA and clustering by Bag of Visual Words on different models, based on Random Forest (RF), SVM, Logistic Regression, and kNN. The arrhythmia dataset was classified by selecting significant features using the wrapper method around RF and normalizing it in [10]. Further, it was used to implement several classifiers such as Multi-Layer Perceptron, NB, kNN, RF, and SVM.

3 The Dataset and Its Preprocessing Dataset: We use the dataset from the UCI repository [5], which contains records of 452 patients with 279 different attributes. Every record contains 4 personal details of patients like age, weight, gender, and height and 275 derived attributes of the ECG waves such as amplitude, width, vector angle, and so on which can be found in [5]. Each record has the conclusion of an expert cardiologist, which represents the class of arrhythmia. Class 01 indicates a normal ECG, classes 02–15 indicate various types of arrhythmias, while class 16 indicates the remaining unclassified ones. Preprocessing: Records with abnormal values such as height of 500, 780 cm, age of 0, etc., were removed. The missing values represented by “?” are replaced

520

P. Jain et al.

with the median value of that feature. WEKA [11] was used to visualize the variance of the features. Further, all the features with standard deviation close to zero were eliminated, as they have a very little effect on the final result. The preprocessing yields a clean dataset of 163 features and 420 records.

4 System Description Supervised machine learning techniques are used to solve the classification problem. All of them are implemented in python. We then form different models by training each of the below algorithms with the training dataset.

4.1 Naïve Bayes (NB) NB is derived from the Bayes’ theorem. It assumes that the value of a feature is independent of any other feature’s value [12]. In NB, the predicted class is the one with the highest posterior probability. Posterior probability is given as posterior probability =

prior probability × likelihood evidence

(1)

where prior probability of a class is the ratio of the numbers of samples of that class to the total number of samples, the evidence is the sum of the likelihoods of all classes. Before the likelihood of a class is calculated, P(A|C) i.e., conditional probability of each attribute of that class in the training sample is calculated. It is given as    −(x−μ)2 1 e 2σ 2 P A C =√ 2 2π σ

(2)

where x is the value of that attribute, σ is the variance, μ represents the mean of all the values of that attribute. Likelihood is the product of the conditional probabilities of all attributes of that class.

4.2 Decision Trees (DT) DT is a classifier which follows a tree structure. We implement a DT using the ID3 algorithm. In the ID3 algorithm, the attribute for splitting the data samples is decided by the information gain. The information gain which describes how effectively a given attribute splits the training sample into the given classes is given as

A Novel Approach to Classify Cardiac Arrhythmia …



Gain(S, A) = E(S) −

v∈Values(A)

521



 |Sx | × E(S) . |S|

(3)

where Sx is the subset of S for which attribute A has value x and Entropy E(S) is given as E(S) =

c 

pi log2 pi .

(4)

i=1

where pi is the proportion of S belonging to class i and S is the total number of samples. The data samples are split, based on the attribute with the highest information. The process continues until the entropy becomes zero.

4.3 k-Nearest Neighbors (kNN) The kNN algorithm groups the instances based on their similarity. kNN is a type of lazy learning algorithm, where all the class labels of the nearest neighbors, from the training dataset are stored and all computations are postponed until the classification [13]. The prediction class is determined based on the majority of k-nearest neighbors of the test instance. In this work, a “k” value of 3 is used. If two samples p and q have “n” number of attributes each, then the Euclidean distance d(p, q) is given as  n  d( p, q) =

( pi − qi )2 .

(5)

i=1

All the training samples are sorted based on their Euclidean distance and the nearest neighbors are determined based on the k.

4.4 Support Vector Machine (SVM) SVM is a technique that works by creating hyperplanes, which separate the different classes in space. We use the “one-versus-all” approach. Here one class is taken to form a hyperplane, so that it separates that class from the rest of the classes. This is done for all the classes and the label is predicted. It does the classification based on parameters C, gamma, and a specified kernel. The SVM linear classifier kernel function is a dot product of the data point vectors given as   K xi , x j = xiT , x j .

(6)

522

P. Jain et al.

We have tried different kernels, C, and gamma values and the best accuracies were obtained with radial basis function (rbf) kernel [14] for C and gamma values of 100 and 1000, respectively.

4.5 Voting Feature Interval (VFI) Classification in each VFI algorithm is based on a majority voting of all class predictions, made by each feature. Prediction is made by a feature based on the projections of all the training instances on that feature. In VFI each feature is given an equal weightage. This paper makes use of all five variations of VFI algorithms [15]. VFI1 The algorithm constructs feature intervals for all the features of each class. The sum of all the votes, for all the features, for each class is calculated. The class with the majority of votes is identified as the prediction class. VFI2 This algorithm differs from VFI1, in finding the lower bounds of the intervals. The endpoints are selected as the midpoints, instead of the lower bounds. VFI3 VFI3 is again a modification of VFI1, in determining the class counts. This is done to consider the three lower bound types of the range intervals. VFI4 VFI4 is similar to VFI3 but if the highest and lowest points of a feature are the same for a class, a point interval is constructed instead of a range interval. VFI5 VFI5 is similar to VFI4, however, it constructs point intervals for all endpoints and all values between the distinct endpoints as range intervals, excluding the endpoints.

4.6 Ensemble Method Ensemble method takes multiple models and combines them to produce an aggregate model, which performs better than any of its individual models. We use a hard voting ensemble method. It makes use of all the above algorithms to classify an unknown data record. Each algorithm predicts one of the 16 class labels. The class which is predicted the most number of times is taken as the final predicted class. A GUI is used to display the result.

A Novel Approach to Classify Cardiac Arrhythmia …

523

5 Results and Discussions 5.1 k-Fold Cross Validation If we have a dataset that has a very low ratio of the number of data records to the number of features, then there will be a lot of variation in the accuracy estimates for different partitions of training and testing datasets. To mitigate this, we perform k-fold cross validation, where the original sample is randomly partitioned into k equal subsamples. One of the subsamples is used as validation data for testing the model and the rest k–1 subsamples are used as training data. This cross validation is performed k times until each of the k subsamples is taken once as validation data. For a 15-fold cross validation the above models perform at its best. The architectural design of the system used in this paper is shown in Fig. 2.

5.2 Performance Analysis We use accuracy as the performance indicator. The accuracy is calculated as the number of correct predictions to the number of evaluated records. The Fig. 3 illustrates the accuracy percentage of the several classifiers for split ratios k = 15. To summarize the figure • The low accuracy of the NB algorithm is due to the fact that every feature is assumed to be independent of the other and hence the interdependence of the features is not taken into account.

Fig. 2 Architecture design for arrhythmia classification

524

P. Jain et al.

Fig. 3 Accuracy percentage of various classifiers

• The DT algorithm overfits the data. This causes incorrect predictions and it lowers the accuracy. • The kNN algorithm is more effective if there are a greater number of neighbors. Hence, we can improve the accuracy further with a larger dataset. • The SVM algorithm gave the highest accuracy of 68.33%. SVM supports different kernels which can be used to create nonlinear hyperplanes between the classes which increases the accuracy of the model. • The VFI algorithms too consider feature independence. The accuracies were better but the training time increased as compared to NB.

5.3 Arrhythmia Classification The hard voting ensemble method predicts with an accuracy of 90.71%. This is a significant improvement in the prediction performance. Individually the model is prone to different kinds of errors like variance, noise, and bias on the dataset [16]. This can result in an average performance because each individual model might over-fit, different parts of the dataset. If the models are reasonably diverse, informed, and independent, the risk of over-fitting is reduced, as their individual mistakes are averaged out, by merging all these predictions together. The outcomes consequently tend to be substantially better. The core intuition is to develop a “strong learner” from a group of “weak learners”. Ultimately this paper provides a concrete diagnosis that is highly irrefutable.

A Novel Approach to Classify Cardiac Arrhythmia …

525

6 Conclusion We have provided a solution to detect the presence of cardiac arrhythmia and to classify it. The approach was to pre-process the arrhythmia dataset, use k-fold cross validation to train various models with machine learning algorithms using the training set, and predict the arrhythmia class using the testing set. By preprocessing the arrhythmia dataset, issues like underfitting or overfitting were addressed. k-fold cross validation performed the best for k value of 15. We have trained the models using NB, DT, kNN, SVM, VFI1, VFI2, VFI3, VFI4, and VFI5 algorithms using the training set. Finally, the class of arrhythmia has been predicted by the majority vote of these models using the hard voting ensemble method. The paper has achieved a best in class accuracy of 90.71%, which is robust and reliable enough for the doctors to provide a crucial diagnosis. Hence it is evident that in predicting the class of arrhythmia, the accuracy of the system surpasses the previous models of similar type. In the future, the execution time could be reduced by making use of methods like multithreading and batch processing.

References 1. T. Harrison, D. Kasper, S. Hauser et al., Harrison’s Principles of Internal Medicine (McGrawHill Education, New York [i pozostałe], 2018) 2. A. Batra, V. Jawa, Classification of Arrhythmia using conjunction of machine learning algorithms and ECG diagnostic criteria. Int. J. Biol. Biomed. 2016, 1–7 (2016) 3. H. Publishing, Cardiac Arrhythmias-harvard health, in Harvard Health (2020). https://www. health.harvard.edu/a_to_z/cardiac-arrhythmias-a-to-z. Accessed 11 Jan 2020 4. T. Electrophysiology, Heart rate variability. Circulation 93, 1043–1065 (1996). https://doi.org/ 10.1161/01.cir.93.5.1043 5. (2020) UCI Machine Learning Repository: Arrhythmia Data Set, in Archive.ics.uci.edu. https:// archive.ics.uci.edu/ml/datasets/Arrhythmia. Accessed 11 Jan 2020 6. H. Guvenir, B. Acar, G. Demiroz, A. Cekin, A supervised machine learning algorithm for arrhythmia analysis. Comput. Cardiol. (1997). https://doi.org/10.1109/cic.1997.647926 7. N. Kohli, N. Verma, Arrhythmia classification using SVM with selected features. Int. J. Eng. Sci. Technol. (2012). https://doi.org/10.4314/ijest.v3i8.10 8. E. Yılmaz, An expert system based on fisher score and LS-SVM for cardiac Arrhythmia diagnosis. Comput. Math. Methods Med. 1–6 (2013). https://doi.org/10.1155/2013/849674 9. P. Shimpi, S. Shah, M. Shroff, A. Godbole, A machine learning approach for the classification of cardiac arrhythmia. Int. Conf. Comput. Methodol. Commun. (ICCMC) 2017, 603–607 (2017). https://doi.org/10.1109/iccmc.2017.8282537 10. A. Mustaqeem, S.M. Anwar, M. Majid A.R. Khan, Wrapper method for feature selection to classify cardiac arrhythmia, in 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (2017), pp. 3656–3659. https://doi.org/ 10.1109/embc.2017.8037650 11. W. Badr, getting started with Weka 3 - machine learning on GUI, in Medium (2019). https://tow ardsdatascience.com/getting-started-with-weka-3-machine-learning-on-gui-7c58ab684513. Accessed 12 Jan 2020 12. P. Joshi, Artificial Intelligence with Python (Packt Publishing Ltd., Birmingham, UK, 2017)

526

P. Jain et al.

13. S. Karimifard, A. Ahmadian, M. Khoshnevisan, M.S. Nambakhsh, Morphological heart Arrhythmia detection using hermitian basis functions and kNN classifier. Int. Conf. IEEE Eng. Med. Biol. Soc. 2006, 1367–1370 (2006). https://doi.org/10.1109/iembs.2006.260182 14. A. Alexandridis, E. Chondrodima, N. Giannopoulos, H. Sarimveis, A Fast and efficient method for training categorical radial basis function networks. IEEE Trans. Neural Netw. Learn. Syst. 28, 2831–2836 (2017). https://doi.org/10.1109/tnnls.2016.2598722 15. G. Demiröz, Non-Incremental Classification Learning Algorithms Based on Voting Feature Intervals (Bilkent University, M.Sc., 1997) 16. R.R.F. DeFilippi, Boosting, bagging, and stacking-ensemble methods with sklearn and mlens, in Medium (2018). https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-met hods-with-sklearn-and-mlens-a455c0c982de. Accessed 12 Jan 2020 17. (2020) Sinus rhythm, in En.wikipedia.org. https://en.wikipedia.org/wiki/Sinus_rhythm. Accessed 11 Jan 2020

Offline Handwritten Mathematical Expression Evaluator Using Convolutional Neural Network Amit Choudhary, Savita Ahlawat, Harsh Gupta, Aniruddha Bhandari, Ankur Dhall, and Manish Kumar

Abstract Recognition of Offline Handwritten Mathematical Expression (HME) is a complicated task in the field of computer vision. The proposed method in this paper follows three steps: segmentation, recognition and evaluation of the HME image (which may include multiple mathematical expressions and linear equations). The segmentation of symbols from image incorporates a novel pre-contour filtration technique to remove distortions from segmented symbols. Then, recognition of segmented symbols is done using Convolutional Neural Network which is trained on an augmented dataset prepared from EMNIST and custom-built dataset giving an accuracy of 97% in recognizing the symbols correctly. Finally, the expressions/equations are evaluated by tokenizing, converting into postfix expressions and then solving using a custom-built parser. Keywords Offline HME image · Symbol segmentation · Multiple expressions · Linear equation · Augmented dataset · CNN

A. Choudhary Department of Computer Science, Maharaja Surajmal Institute, New Delhi, India e-mail: [email protected] S. Ahlawat (B) · H. Gupta · A. Bhandari · A. Dhall · M. Kumar Department of Computer Science and Engineering, Maharaja Surajmal Institute of Technology, New Delhi, India e-mail: [email protected] H. Gupta e-mail: [email protected] A. Bhandari e-mail: [email protected] A. Dhall e-mail: [email protected] M. Kumar e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_47

527

528

A. Choudhary et al.

1 Introduction Digitzation of offline work is on the rise to increase the longevity of documents most of which consist of mathematical expressions. Since mathematics is almost entirely subsumed in its expressions, it is imperative to properly digitize these expressions to maintain the consistency in the digital documents [1]. But recognition of handwritten mathematical expressions is a difficult task and topic of many ongoing and concluded research works [2]. Handwritten mathematical expression recognition is of two types: online and offline. The former form consists of recognizing the characters by their strokes while they are being written on a tablet or smartphone. While the latter form consists of recognizing the characters from an image of a handwritten document. This research paper will be entirely dedicated to the digitization and evaluation of offline handwritten mathematical expressions. Since mathematics itself is a very wide field, digitizing and evaluating all of the mathematical symbols becomes a very complex and tedious task. Therefore, only a subset of these mathematical symbols is considered in this paper which are digits (0-9), arithmetic operators (‘+’, ‘–’, ‘*’, ‘÷’), characters (‘a’, ‘b’, ‘c’, ‘d’, ‘u’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’) and parenthesis. All of these will be referred to as symbols in the rest of this research paper. The focus in this research has been on segmentation and recognition of multiple arithmetics and linear mathematical expressions from a single image and then evaluating these successfully recognized expressions. The original contributions and approaches followed in this research paper are outlined in the following paragraphs. Image comprising of single or multiple mathematical expressions containing either entirely arithmetic or linear expressions are considered as input. A new approach of pre-contour filtration is considered in this paper where expressions are tightly cropped to remove any noise that might obfuscate the segmentation of symbols. The segmented symbols are arranged in their original manner using a novel algorithmic technique. These segmented symbols are recognized using a Convolutional Neural Network (CNN), because of its state-of-the-art performance in classification of images [3], which result in digitized expressions. The evaluation of these digitized expressions is performed by ingeniously built string manipulation algorithm. The segmentation of ‘=’ and ‘÷’ characters results in the detection of separate components instead of a single symbol. This problem is solved by vertically combining the components for the height of the image resulting in a single segmented image of the symbol. Another problem faced is the ambiguity between ‘×’ and ‘*’ characters because of the similarity in their handwritten versions. This problem is solved by considering the succeeding symbol which should be a digit or an open parenthesis in case of ‘*’ and any other symbol for ‘×’. Finally, the recognition of offline handwritten symbols was made easier by using a shallow CNN which was able to understand the complex relationship of strokes constituting a symbol.

Offline Handwritten Mathematical Expression …

529

The related work is presented in the section. The proposed method of this paper is described in Sect. 3. Sections 4 and 5 present the result and a conclusion of the work done.

2 Related Work A lot of work has been done in the field of Handwritten Mathematical Expression recognition. Some of these have been studied before the implementation of the proposed system in this paper. The proposed system used in [4, 5] is to normalize the image of HME. The threshold value of 50px was used. Then edge detection followed by morphological transformations is applied and separation of components of the image has been considered. Features like skew, entropy and standard deviation were extracted to improve the accuracy of the neural network. The recognition was done with a backpropagation neural network with adaptive learning. The neural network had 10 input nodes, two hidden layers and one output layer with 10 nodes. The proposed network achieved an accuracy of 99.91% on the training dataset of 5 × 7 pixels. In [6] the proposed method was to classify handwritten mathematical symbols. Convolutional Neural Network (CNN) model was used with a 5 × 5 kernel for convolutional layer and 2 × 2 kernel for a max-pooling layer. Sigmoid function for non-linearity was used at every layer of the network. The log-likelihood cost function was used to check the performance. The CROHME 2014 dataset was used for training and images were resized to 32 × 32 pixels. The accuracy achieved was 87.72%. A crucial point identified in [6] was that some symbols were misclassified by CNN because they had a similar structure with the other symbols, a problem that was also faced in this paper. In [7], the main objective was symbol detection from images of HME. Three modified versions SSD model were used along with the original SSD model for the detection and classification of mathematical symbols from HME. There were 52353 Gray images of 32 × 32 size belonging to 106 classes of symbols. Dataset of HME contained 2256 images of 552 expressions of 300 × 300 resolution divided into three sets. The precision for each class was calculated and class weight for each symbol was calculated. The maximum mAP gain (0.65) was observed in the version of SSD where 1 convolution layer was modified and 2 new layers were added. In [8], the main focus was to recognize and digitize HME. Convolutional Neural network with an input shape of 45 × 45, three convolutions and max-pooling layers, a fully connected layer and output shape of 83 was used for classification of HMS extracted from HME. Preprocessing of HME images included grayscale conversion, noise reduction using median blur, binarization using adaptive threshold and thinning to make the thickness of foreground 1 pixel. Then segmentation was done using projection profiling algorithm and using connected component labelling. CNN achieved an accuracy of about 87.72% on HMS.

530

A. Choudhary et al.

The approach taken in [9] is different from other proposed systems, it mainly focuses on Chinese HME. For symbol segmentation, a decomposition on strokes is operated, then dynamic programming to find the paths corresponding to the best segmentation manner and to reduce the stroke searching complexity is used. For symbol recognition, Spatial geometry and directional element features are classified by a Gaussian Mixture Model learned through the Expectation-Maximization algorithm. For semantic relationship analysis, a ternary tree is utilized to store the ranked symbols through calculating their priorities. The system was tested on a dataset consisting of 30 model expressions with a total of about 15000 symbols. The system performs well at symbol level but recognition of full expression shows 17% accuracy. The system proposed in [10] is for recognition and evaluation for single or a group of handwritten quadratic equations. NIST dataset and self-prepared symbols are used for training after preprocessing techniques such as grayscale, binarization and low pass filtering Horizontal compact projection analysis and combined connected component analysis methods are used for segmentation. For the classification of specific characters, CNN is applied. The system was able to fully recognize 39.11% equation correctly in the set of 1000 images whereas character segmentation accuracy was 91.08%. The methodology proposed in [11] uses a CNN for feature extraction, a bidirectional LSTM for encoding extracted features and an LSTM and an attention model for generating target LaTex. The dataset used is CROHME with augmentation techniques such as local distortion and global distortion. Recognition neural network consists of five convolution layers, four max-pooling layers with no dropout layer. The accuracy obtained on CROHME was 35.19%.

3 Methodology The proposed work used EMNIST [12] and Handwritten Math Symbols [13] dataset for mathematical expression digitizer and evaluator. Each class of operands in the dataset [12] contained images in 32 × 32 × 3 dimension. Only 2471 images of each lower case alphabet (‘a’, ‘b’, ‘c’, ‘d’, ‘u’, ‘v’, ‘w’, ‘x’, ‘y’ and ‘z’) and digits (0–9) are selected in the present work. The letters were selected on the basis of the quality of the text written in the image, the possible similarities with other necessary symbols used in the classifier and the frequency of occurrence of different alphabets in different types of equations. The proposed method follows the processing steps shown in the flowchart in Fig. 1. The subsequent sub-sections elaborate on the steps followed in the present work.

Offline Handwritten Mathematical Expression …

531

Fig. 1 Proposed method

3.1 Preprocessing of the Input Image The input image is initially preprocessed to prepare for the correct segmentation of symbols. Illumination

By making the brightness constant (185, trial and error based value) throughout the image, small and irrelevant contours detected due to variation in natural lighting can be eliminated to a great extent thereby improving the accuracy of evaluation. Grayscale By converting coloured images into grayscale images reduces the processing power and time. Also, thresholding and edge detection algorithms work best with grayscale images. Gaussian Blur Gaussian blur with a filter of 7 × 7 kernel worked best out of all the blurring techniques tested. Blurring reduces the amount of noise present in the image which helps during the processing of the image so that only relevant features will be extracted from the image. Threshold The threshold value of 150px is applied to the image. It helps in separating the digits/foreground (higher pixel values) from the background (lower pixel values).

3.2 Segmentation of Preprocessed Image Segmentation involves extracting the symbols from the image and then sorting them in the digitized form according to their original order in the images. The methods adopted for Tight Crop, Contour Detection, Padding Contours, Extending Contours and Sorting Contours are explained in the subsequent section. Tight Crop- To remove contours which are very small and meaningless as they can be marks on the paper or just noise, a pre-contour filtration technique is used.

532

A. Choudhary et al.

The contours are initially detected using OpenCV library function findContours() using RETR_TREE hierarchy of contours because it provides all the contours from the expression image. The detected contours having Contour Area < (0.002* Total Area of the Image) are removed thereby eliminating the small and irrelevant contours which would have otherwise affected the overall accuracy of the expression during evaluation. The 0.002 value has been considered by trial and error basis. After the removal of small contours, the expression image is tightly cropped within the minimum and maximum values of x and y coordinates achieved from all of the detected contours. Contour Detection- A threshold technique with 120px value is applied to remove any remaining noise from the tightly cropped expression image obtained from the previous step. For extracting the digits and operators from the resultant image, OpenCV findContours() function is used along with the RETR_EXTERNAL hierarchy of contours. The output image has only the extreme outer contour containing the complete digit and operator. To filter out the small contours that might be subparts of operands and operators, all the contours with Contour Area < (0.002*Total Area of the Image) are removed. Here the 0.002 value is finalized after running several trials and observing the error. Padding Contours-The contours obtained from the input image are tightly cropped, by using copyMakeBorder() function of OpenCV. In the present work, the contours are padded with 40 pixels on all sides thereby making sure that the value within the contour is centrally aligned for easy detection for the neural network. Extending Contours-In the present work, the contours with x-coordinate length > = (2 * y-coordinate length), are extended vertically in both directions by value 0.5 times the difference in length and breadth of the image. It also solved the problem of detecting ‘=’ and ‘÷’ symbols as a single operator. The resultant images are shown in Fig. 2a, b. Sorting Contours-Since the expression image can contain multiple lines of mathematical expressions, it is important to sort the operators and operands in the correct order so that the expression can be solved correctly. Therefore, in the present work, the sorting of the contours is performed using the following processing steps. (a) Segregating each contour into appropriate expression row according to their minimum and maximum y coordinate values. (b) Contours from each row of a mathematical expression from the image that was clubbed together, are stored in separate arrays. (c) All contours in respective mathematical expressions are sorted according to their x coordinate values thereby organizing the contours in their original order present in the input image. Fig. 2 a Extracted contour b Extended contour

(a)

(b)

Offline Handwritten Mathematical Expression …

533

3.3 Augmenting the Dataset The Handwritten Math Symbols Dataset has been used for operator images [13]. But this dataset had a few sample images of division (÷) operator. Therefore, the proposed work performed augmentation of images. The augmentation process is a combination of various steps including rotating the images upside down and laterally inverting the images.

3.4 Preprocessing the Dataset Images The present work used only 2471 images of 45 × 45 × 3 dimension of the EMNIST and Handwritten Math Symbols Dataset as sample images [12, 13]. The sample image set contains an equal number of sample images from each class. On carefully observing the sample images it is found that the writing strokes in the sample images are thin and partially visible. Therefore, to overcome this problem the following preprocessing steps are applied to each image: • Dilation-In this step, a matrix of one of the dimension 3 × 3 is used for dilation which smoothed the image which has been pixelated due to increased size. • Threshold-In this step, a threshold value is finalized to separate the foreground (higher pixel values) from the background (lower pixel values). In the present work, the threshold value of 150px has been used for all symbol images except ‘÷’ operator for which the threshold value is 235px. The threshold values are finalized on a trial and error basis. The sample images of digit, variable and parentheses required further preprocessing which is elaborated in the subsequent steps. Preprocessing of Digits and Variables The following preprocessing steps are applied to improve the quality of sample images to ensure good classification by neural network: (a) Inverting the RGB values The RGB values of each image are inverted (i.e. Subtracted from 255) since the images in the dataset [12] were white text on a black background. (b) Resizing Images Each image is resized from 32 × 32 × 3 to 45 × 45 × 3, to match the size of images of the operators.

534

A. Choudhary et al.

Fig. 3 a Original image b Image after padding and resizing

(a)

(b)

Preprocessing Parenthesis The images of the parenthesis in the dataset [13] are very similar to digit ‘1’ which would result in wrong recognition of digits. To solve this problem, preprocessing of these images was done as follows to make these images look like parentheses 1. Padding Images. Each image is padded with 14 white pixels on the top and the bottom of the image to increase the bulge in the centre of the parenthesis. 2. Resizing Images. Each image has a size of 45 × 73 × 3 after the padding was applied in the previous step. The images were resized back to its original size of 45 × 45 × 3. It is clearly visible from Fig. 3a, b that after implementing the preprocessing techniques suggested in Steps 1 and 2, the resultant parenthesis images turned out to be more curved and had greater resemblance with actual handwritten parenthesis operator than the ones that were present in the dataset before the preprocessing.

3.5 Recognizing Symbol Using CNN In this last step, a deep neural network is created and trained on sample images. The recognized symbols are stored and passed to the next step for equation evaluation. The details are as follows: Creating and Training the Deep Neural Network A convolutional neural network was created to classify the different numbers, operators and variables. The network is made up of three convolutional layers, two fully connected layers, and two dropout layers as shown in Fig. 4.

3.6 Solving the Digitized Expression The expression obtained after the classification is a stream of characters stored in a string. Different operations are performed for arithmetic expressions and linear equations. Following are the steps for arithmetic equations:

Offline Handwritten Mathematical Expression …

535

Fig. 4 Convolutional neural network architecture

(a) Tokenizing the string The stream of characters is converted into a list of tokens in the order they appear in the expression. (b) Creating a parser to solve the arithmetic equations First, a function that converts the string detected by the neural network into a list of strings (each string containing an operator or an operand) is created. Then a function to solve the arithmetic equations is created which first determines whether to check the correctness of the expression or to solve the expression according to BODMAS rule by first converting the expression into a postfix expression and then evaluating the postfix expression. (c) Extracting the coefficients from the linear equation If the expression is a set of linear equations, they are passed through a function to solve them and return the values of variables used in that equation.

4 Experimental Result The experimental results are shown in Table 1, which shows the result of the k-fold cross-validated approach used to train the convolutional neural network for better results and lower overfitting.

536

A. Choudhary et al.

Table 1 Accuracy after each fold of CNN Cross-validation fold

Training accuracy (max)

Training loss (min)

Validation accuracy (max)

Validation loss (min)

1

0.9639

0.1115

0.9786

0.0635

2

0.9787

0.0686

0.9894

0.0260

3

0.9882

0.0371

0.9955

0.0144

4

0.9908

0.0287

0.9968

0.0108

5

0.9904

0.0291

0.9982

0.0064

6

0.9946

0.0162

0.9989

0.0028

7

0.9955

0.0150

0.9993

0.0019

8

0.9948

0.0177

0.9991

0.0042

9

0.9962

0.0137

0.9993

0.0015

10

0.9966

0.0111

0.9998

0.0006

The problem of detecting ‘=’ and ‘÷’ was overcome by our proposed approach of sorting the contours according to x coordinates which was similar to the approach used in [10]. The problem of ambiguity between ‘*’ and ‘×’ symbols was solved by checking whether the succeeding token in the string is a digit or ‘(’ in the case of ‘*’ and any other symbol otherwise. The augmentation and preprocessing of training data images used for the customized convolutional neural network helped in improving the classification of alphabets and parentheses. The proposed system was also tested on various other self-shot images having various kinds of expressions and equations. The proposed system is able to recognize and evaluate them successfully.

5 Conclusion and Future Work This paper focused on segmentation, recognition and evaluation of offline handwritten mathematical expressions. Only arithmetic and linear mathematical expressions were considered in this paper. A precontour filtration technique was suggested to remove distortions from segmented symbols which was able to reduce the noise from images to a great extent. The sorting technique designed was able to preserve the equation order for any number of expressions in the image. The customized convolutional neural network designed gave a convincing result with an accuracy of 97% in recognizing the segmented symbols. Finally, the correct evaluation was achieved for the tested expressions. In future work, the segmentation technique needs to be further improved because the sub-parts of symbols are being detected as a separate contour leading to erroneous detection. The proposed method can be extended to quadratic equations by employing the same technique of segmentation and recognition. Also, the dataset

Offline Handwritten Mathematical Expression …

537

can be expanded to include the remaining alphabets, thus increasing the domain of the system.

References 1. A.M. Awal, H. Mouchère, C.V. Gaudin, towards handwritten mathematical expression recognition, in 10th International Conference on Document Analysis and Recognition (2009), pp. 1046–1050 2. C. Lu, K. Mohan, Recognition of Online Handwritten Mathematical Expressions Using Convolutional Neural Networks (2015), pp. 1–7 3. A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks. Neural Inf. Process. Syst. 1(25), 1097–1105 (2012) 4. S. Shinde, R.B. Waghulade, D.S. Bormane, A new neural network based algorithm for identifying handwritten mathematical equations, in International Conference on Trends in Electronics and Informatics ICEI (2017), pp. 204–209 5. S. Shinde, R.B. Waghulade, An improved algorithm for recognizing mathematical equations by using machine learning approach and hybrid feature extraction technique, in International Conference on Electrical, Instrumentation and Communication Engineering (ICEICE2017) (2017), pp. 1–7 6. I. Ramadhan, B. Purnama, S.A. Faraby, Convolutional neural networks applied to handwritten mathematical symbols Classification, in Fourth International Conference on Information and Communication Technologies (ICoICT) (2016), pp. 1–4 7. G.S. Tran, C.K. Huynh, T.S. Le, T.P. Phan, Handwritten mathematical expression recognition using convolutional neural network, in 3rd International Conference on Control, Robotics and Cybernetics (CRC) (2018), pp. 15–19 8. L. D’ Souza, M. Mascarenhas, Offline handwritten mathematical expression recognition using convolutional neural network, in International Conference on Information, Communication, Engineering and Technology (ICICET) (2018), pp. 1–3 9. Y. Hu, L. Peng, Y. Tang, On-line handwritten mathematical expression recognition method based on statistical and semantic analysis, in 11th IAPR International Workshop on Document Analysis Systems (2014), pp. 171–175 10. M.B. Hossain, F. Naznin, Y.A. Joarder, M.Z. Islam, M.J. Uddin, Recognition and solution for handwritten equation using convolutional neural network, in Joint 7th International Conference on Informatics, Electronics and Vision (ICIEV) (2018) 11. A.D. Le, M. Nakagawa, Training an end-to-end system for handwritten mathematical expression recognition by generated patterns, in 14th IAPR International Conference on Document Analysis and Recognition (2017), pp. 1056–1061 12. G. Cohen, S. Afshar, J. Tapson, A. Van Schaik, EMNIST: an extension of MNIST to handwritten letters, in International Joint Conference on Neural Networks (IJCNN) (2017), pp. 2921–2926 13. Handwritten Math Symbols Dataset [Online]. Available: https://www.kaggle.com/xainano/han dwrittenmathsymbols 14. Y. Chajri, A. Maarir, B. Bouikhalene, A comparative study of handwritten mathematical symbols recognition. in International Conference Computer Graphics, Imaging and Visualization (CGIV), vol. 1(13) (2016), pp. 448–451 15. A.M. Hambal, Z. Pei, F.L. Ishabailu, Image noise reduction and filtering techniques. Int. J. Sci. Res. (IJSR) 6(3), 2033–2038 (2017)

An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm Md. Tanvir Islam, M. Raihan, Fahmida Farzana, Promila Ghosh, and Shakil Ahmed Shaj

Abstract Diabetes Mellitus introduce various diseases that affect the way of using sugar in human body. Sugar plays a vital role as it is the main source of energy for cells that build up muscles and tissues. So, any issue that causes the problem to maintain normal blood sugar in our blood can create serious problems. Diabetes is one of the diseases which results in abnormal sugar level in the blood and can occur due to several problems like bad diet, obesity, hypertension, increasing age, depression, etc. Diabetes can lead to cardiovascular disease, kidney, brain, foot, skin, nerve, hearing impairment and eye damage. From this thinking, in this study, we have tried to build up some rules using Association Rule Mining technique with various diabetes symptoms and factors to predict diabetes efficiently. We have got 8 rules using Apriori Algorithm. Keywords Diabetes mellitus · Diabetes prediction · Machine learning · Association rule mining · Apriori algorithm

Md. Tanvir Islam · M. Raihan (B) · F. Farzana · P. Ghosh · S. Ahmed Shaj North Western University, Khulna, Bangladesh e-mail: [email protected]; [email protected]; [email protected] Md. Tanvir Islam e-mail: [email protected]; [email protected] F. Farzana e-mail: [email protected] P. Ghosh e-mail: [email protected] S. Ahmed Shaj e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_48

539

540

M. Tanvir Islam et al.

1 Introduction In recent days, diabetes is one of the major issues in health care which is spreading very fast. Generally, it appears when the level of blood sugar increases than the normal level of sugar in blood [1]. From a study, we came to know that around 12,550 people had diabetes at mature ages, and the development of type-2 diabetes (T2D) was almost 2.5 times. Additionally, each pathophysiological disease entity serves to exacerbate the other. Both hypertension and diabetes increases the chances of cardiovascular disease (CVD) and renal disease [2]. Bangladesh is one of the six nations of the International Diabetes Federation (IDF) South-East Asia (SEA) region. Around 425 million individuals have these chronic diabetes diseases worldwide and 82 million individuals in the SEA Region; by 2045 this will ascend to 151 million [3]. The motive of our analysis is to build a model that identifies diabetes accurately using Machine Learning Algorithm. The purpose of our study is to find out the relationship between some diabetes risk factors which increase the probability of developing diabetes. The other part of the manuscript is arranged as follows: in Sects. 2, 3 the related works and methodology have been elaborated with a distinguishing destination to the justness of the classier algorithms, respectively. In Sect. 4 the outcome of this analysis has been clarified with the impulsion to justify the novelty of this exploration. Finally, this research paper is terminated with Sect. 5.

2 Related Works An algorithm was proposed by Vrushali R. Balpande et al. which provides severity in terms of ratio interpreted as the impact of diabetes patterns are generated in step 7 which is for frequent pattern generation than Apriori [4]. Another survey of data mining was combined with Decision Trees and Association Rule by a research team. They have applied 1251 different cases with Apriori Genetic algorithm for T2D. They have tried to prove that the interaction of Multi SNPs is associated with diabetes [5]. Another research team performed an analysis where they used Apriori algorithm with a dataset which contains a total of 768 instances with 8 numeric variables. Their model generates optimal rules on data where the coverage is 74 and confidence is 100% within a condition of low pregnancy, normal diastolic blood pressure, and low Diabetes Pedigree Function (DPF) [6]. Similarly, a study was conducted with Bottom-up Summarization (BUS) and Association Rule summarization techniques and they have found 10 rules [7]. An analysis was performed on association rules from 1-sequence patterns that generated visualizing comparing medication trajectory graph to quickly identify interesting patterns having minimum support value 0.001 [8]. Likewise, an analysis has been performed on gestational diabetes mellitus where several algorithms have been used such as Iterative Dichotomiser 3 (ID3), C4.5, T-

An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm

541

test, F-test. They have used a dataset consist of 3075 instances with having seven attributes and they have got two rules from the study. The risk factors are Polyhydramnios, Preeclampsia, Infections, Risk of Operative Delivery, Puerperal Sepsis, Wound Infection, Ketoacidosis [9]. Another research is developed to predict the associations of pleiotropic genes with data mining on the Human Leukocyte Antigen (HLA) gene complex in relation to Type-1 Diabetes (T1D). Whereas the Major Histocompatibility Complex (MHC) proteins in human where HLA types are inherited [10]. In a health informatics study, 33 attributes have been taken where the minimum support value was set to 10% and minimum confidence value was set to 90%. For the heart disease followed by 25 medical risk factors where the minimum support value was 1%, minimum confidence value was 70%, maximum rule size was 4, minimum lift value was 1.20, and minimum lift value was 2.0 for cover rules which are associated with diabetes beside that in the other research of dataset analyzing using Frequent Pattern Tree (FP-Tree) based on Association Rule (AR). The achieved accuracy was 95%, while sensitivity and specificity were 97 and 96%, respectively [11]. So, it is quite clear that to predict diabetes efficiently the Association Rule Mining techniques are very useful.

3 Methodology Our study has been performed in several steps. They are as follows: – – – – –

Data Collection Data Preprocessing Dataset Training Association Rule Mining Tools and Simulation Environment. Figure 1 shows the overall work flow of our study.

3.1 Collection of Data The dataset we have used for this study was collected from various diagnostic centers located in Khulna, Bangladesh. It contains 464 instances and each instance has 22 unique features shown in Table 1 where 48.92% of the total are male and 51.08% are female.

542

M. Tanvir Islam et al.

Fig. 1 Work flow of the study

Start Import collected data with 464 instances and 22 features Preprocess data with median() and trimmedMean() Apply Machine Learning Association algorithm Select Support and Confidence values Apriori algorithm

Determine rules

End

Table 1 Features list Attribute Age Gender Drug history Weight loss Diastolic blood pressure (bp) Systolic blood pressure (bp) Duration of diabetes Height

Subcategory

Data distribution Mean, Median

L.V. = 20 yrs H.V. = 83 yrs Male Female Yes No Yes No L.V. = 80 mmHg H.V. = 170 mmHg L.V. = 50 mmHg H.V. = 110 mmHg L.V. = 0 day H.V. = 7300 days L.V. = 138 cm H.V. = 174 cm

41.65,40.40 48.92% 51.08% 57.11% 42.89% 46.34% 53.66% 117.8,120 77.67,80 713.8,90 155.9,156 (continued)

An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm Table 1 (continued) Attribute Weight Blood sugar before eating Blood sugar after eating Urine color before eating

Urine color after eating

Waist Thirst Hunger Relatives Pain or numbness Blurred vision

543

Subcategory

Data distribution Mean, Median

L.V. = 37 kg H.V. = 85 kg L.V. = 3.07 mmol/L H.V. = 20.6 mmol/L L.V. = 5.8 mmol/L H.V. = 28.09 mmol/L Nil Blue Yellow Orange Green Brick Red Green Yellow Nil Blue Red Yellow Orange Green Brick Red Green Yellow H.V. = 28 cm L.V. = 44 cm Yes No Yes No Yes No Yes No Yes No

60.79,60 7.792,7.205 13.003,12.760 46.55% 10.13% 9.70% 15.73% 16.81% 0.43% 0.65% 46.98% 6.90% 2.80% 9.27% 22.20% 7.11% 3.23% 1.51% 35.22,35.00 48.06% 51.94% 44.61% 55.39% 58.41% 41.59% 46.55% 53.45% 53.45% 46.55%

3.2 Data Preprocessing To handle missing data we have used a couple of functions from R-3.5.3 namely trimmedMean() which evacuates a proportion of the highest and lowest perceptions

544

M. Tanvir Islam et al.

and afterward takes the average of the numbers that stay in the dataset [12] and median() which computes the most middle value [13].

3.3 Data Training To train our dataset we have used the percent split method which split the dataset. The dataset contains 70% training data and 30% test data.

3.4 Association Rule Mining Association Rule Mining is very useful for selecting a proper market strategy. It is a Machine Learning technique which works based on some rules [14]. Business analysts use this technique to discover the behavior of customers by finding association and correlation between the products that have been used by consumers. The results from this kind of analysis help them to know whether their existing strategy should change or not [15]. It describes the relationships among different variables or features of a large dataset. It predicts frequent if-then associations know as association rule mining [14]. There are several algorithms to implement Association Rule Mining. In this study, we have used Apriori algorithm (implemented in R-3.5.3) which has three common components to measure association as follows:

3.4.1

Support

Support of an item set A is proportion of the transactions in the database in which the item A appears is signify the popularity of an items set (Table 2). Suppor t (A) =

3.4.2

N umber o f transactions in which A appear s T otal number o f transactions

Confidence

It signifies the likelihood of item B being purchased when item A is purchased. Con f idence({ A} → {B}) =

Suppor t (A ∪ B) Sup(A)

An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm Table 2 Features list second part Attribute Type of medicine

Family stroke Physical activity Classes

545

Subcategory

Data distribution Mean, Median

No Tablet Insulin Yes No Yes No Yes No

35.99% 37.07% 26.94% 42.89% 57.11% 90.30% 9.70% 65.09% 34.91%

L.V. = Lowest Value H.V. = Highest Value

3.4.3

Lift

This signifies the likelihood of an item Y being purchased when item X is purchased while taking into account the popularity of Y. Li f t ({A} → {B}) =

Suppor t (A ∪ B) Suppor t (A) × Suppor t (B)

This technique can be very slow as it gives a number of combinations. So, to speed up the process we need to follow the given steps [16]: 1. For support and confidence set a minimum value. 2. Extract the subsets having the highest value of support than the lowest threshold. 3. If confidence value of any rule is higher than the minimum threshold then selects that rule and thus select all rules from the subsets. 4. According to descending order of lift, order the rules.

3.5 Tools and Simulation Environment – R-3.5.3 – RStudio 1.1.463

546

M. Tanvir Islam et al.

3.6 R Packages Some of the important functions we have used to perform the analysis are given below [17]: subset() It returns subsets of matrices, vectors or data frame if they satisfy conditions. is.na() It helps to deal with missing data or the data that are not available (na) and basically used with if else. median() This method is used to find out the most middle value of a data series. apriori() It provides rules of association and correlations between items using the mining technique. inspect() It summarizes the pertinent alternative, statistics, and plot that should be examined. library(arulesViz) It visualizes frequent item sets and association rules. plot() It’s a generic function to plot R items or objects.

4 Outcomes The analysis gives 8 rules representing the association and correlation between several features of the dataset. The rules have been given in Table 3. Here, the features have been categorized by yes and no, for example, weightloss = no, means the patient is not losing weight, and similarly drug history = yes, means the patient has been taking medicine. And, the outcomes have been classified as two types: classes = yes (diabetes affected) and classes = no (not diabetes affected). The first rule shows the association of pain numbness with a duration of diabetes where the value of Support is 0.502, Confidence is 0.939, and Lift is 1.098. The second and third rules also show relations of weight loss and hunger with a duration of diabetes where for no weight loss and no hunger the values of support, confidence, and lift are 0.509, 0.948, 1.108 and 0.509, 0.918, 1.073, respectively. For these three rules, the range of diabetes duration is 0 to 2100 days. Rules 4 and 5 state the relation of drug history with classes and physical activity. According to rules 6 and 7 for diabetes affected patients physical activity is yes which means the patients have to do any physical activity such as walk and exercise. Similarly, patients who have both drug history and diabetes, for them the physical activity is also yes. So, if a person has diabetes, he or she has to do physical activity and if the person has both diabetes and drug history, he or she also has to do physical activity. In the same manner, for having both drug history and physical activity, the class will be diabetes (classes = yes), that is, rule number 8. The highest support, confidence, and lift are, respectively, 0.603 for rule 6, 1.00 for rule 8, and 1.536 for rule 8. Figure 2 plotted the 8 rules, where most of the rules are within the support range of 0.5–0.53, confidence range from 0.5 to 0.95, and lift range from 0.5 to 1.2. So, 5 rules out of 8 are inside of these ranges and the rest are outside.

An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm Table 3 Associated rules between some items Transactions Item 1 1 2 3 4

Soap Handwash Onion Potato

547

Item 2

Item 3

Handwash Soap Potato Onion

Shampoo Shampoo Burger Burger

Confidence

1

1.5 1.4 1.3 1.2 1.1

0.98 0.96 0.94 0.92 0.5

0.52

0.54

0.56

Support

0.58

0.6

Lift

Fig. 2 Scatter plot for 8 rules Fig. 3 Nominal features and their percentages for yes and no

Some nominal attributes of our dataset have been plotted in the Fig. 3 with respect to the percentage. The attributes contain mainly two types of value. One is Yes, and another is No, for example, if someone has felt pain then the value of pain for that person will be yes. Similarly, the other attributes have value as yes or no. The graph describes the total percentage of yes and no for each attribute. Here, the highest percentage of yes is found for physical activity, that is, about 90.30%, and the highest percentage of no is for family stroke history, that is, about 57.11%. Table 4 shows the differences between previous models and our newly proposed model based on the number of instances, attributes, and algorithms used for the analysis, and it also compares the outcomes of the systems (Table 5).

548

M. Tanvir Islam et al.

Table 4 Association rules Rules LHS [1] [2]

{pain_numbness = no} {weightloss = no}

[3]

{hunger = no}

[4] [5]

{drug_history = yes} {drug_history = yes}

[6]

{classes = yes}

[7]

{drug_history = yes, classes = yes} {drug_history = yes, physical_activity =yes}

[8]

RHS

Support

Confidence Lift

{duration_of_ diabetes = 0-2100} {duration_of _diabetes = 0-2100} {duration_of _diabetes = 0-2100} {classes = yes} {physical _activity = yes} {physical _activity = yes} {physical _activity = yes} {classes = yes}

0.50

0.94

1.098

0.51

0.95

1.108

0.51

0.92

1.073

0.56 0.52

0.99 0.92

1.519 1.209

0.60

0.93

1.222

0.52

0.93

1.223

0.52

1.00

1.536

Table 5 Comparison with other systems with proposed system Reference Sample size Attributes Algorithms number [5]

1251



[6] [9] Our proposed system

768 3075 464

8 7 23

Decision trees and Association rule with the apriori—genetic algorithm Apriori algorithm ID3, C4.5, T-test, F-test Apriori algorithm

Number of rules 1

14 2 8

5 Conclusion Diabetes Mellitus is a regular illness that is upsetting people all over the world. So, we have played out this investigation by utilizing Machine Learning to identify diabetes precisely, and the experiment has been performed effectively with expected results. Although there were some limitations of using the Apriori algorithm, because sometimes it may need to find many rules that need huge time to compute. In this case, we have some tentative arrangements to lead the investigation with more accuracy. We have got a total of 8 rules, and the highest and lowest values are 0.603 and 0.502 for support, 1.0 and 0.917 for confidence, 1.536 and 1.073 for lift. We would like to use more popular algorithms like Frequent Pattern Tree (FP-tree), Maximum Frequent Itemset Algorithm (MAFIA), Aprioritid algorithm, Apriori Hybrid algorithm, Tertius

An Empirical Study on Diabetes Mellitus Prediction Using Apriori Algorithm

549

algorithm, etc. Finally, depending on the best execution of these investigations and calculations, we want to develop an expert system by using the results of exploration and learning.

References 1. A. Bhatia, Y. Chiu (David Chiu), Machine Learning with R Cookbook, 2nd edn. Livery Place 35 Livery Street Birmingham B3 2PB, UK.: Packt (2015). Diabetes, World Health Organization (2017). [Online]. http://www.who.int/news-room/fact-sheets/detail/diabetes. Accessed 25 Jan 2019 2. G. Govindarajan, J. Sowers, C. Stump, Hypertension and diabetes mellitus. European Cardiovascular Disease (2006) 3. IDF SEA members, The International Diabetes Federation (IDF), Online (2013). http:// www.idf.org/our-network/regions-members/south-east-asia/members/93-bangladesh.html. Accessed 01 Feb 2019 4. V. Balpande, R. Wajgi, Prediction and severity estimation of diabetes using data mining technique, in 2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bangalore, India (2017), pp. 576–580 5. B. Shivakumar, S. Alby, A survey on data-mining technologies for prediction and diagnosis of diabetes, in 2014 International Conference on Intelligent Computing Applications, Coimbatore, India (2014), pp. 167–173 6. B. Patil, R. Joshi, D. Toshniwal, Association rule for classification of type-2 diabetic patients, in 2010 Second International Conference on Machine Learning and Computing, Bangalore, India (2010), pp. 330–334 7. G. Simon, P. Caraballo, T. Therneau, S. Cha, M. Castro, P. Li, Extending association rule summarization techniques to assess risk of diabetes mellitus, in IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 1, pp. 130–141 (2015). Accessed 12 Feb 2019 8. P.H. Khotimah, A. Hamasaki, M. Yoshikawa, O. Sugiyama, K. Okamoto, T. Kuroda, On association rule mining from diabetes medical history, in DEIM (2018), pp. 1–5 9. C. Raveendra, M. Thiyagarajan, P. Thulasi, S. Priya, Role of association rules in medical examination records of Gestational Diabetes Mellitus, in 2017 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India (2017), pp. 78– 81 10. I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I. Vlahavas, I. Chouvarda, Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15, 104–116 (2017) 11. W. Altaf, M. Shahbaz, A. Guergachi, Applications of association rule mining in health informatics: a survey. Artif. Intell. Rev. 47(3), 313–340 (2016). https://doi.org/10.1007/s10462016-9483-9. Accessed 17 Feb 2019 12. H. Emblem, When to use a trimmed mean. Medium (2018). [Online]. https://medium.com/ @HollyEmblem/when-to-use-a-trimmed-mean-fd6aab347e46. Accessed 05 Mar 2019 13. Median Function R Documentation (2017). [Online]. https://www.rdocumentation.org/ packages/stats/versions/3.5.2/topics/median. Accessed 10 Mar 2019 14. A. Yosola, Association rule mining - apriori algorithm. NoteWorthy-The Journal Blog (2018). [Online]. https://blog.usejournal.com/association-rule-mining-apriorialgorithm-c517f8d7c54c. Accessed 12 Mar 2019 15. A. Shah, Association rule mining with modified apriori algorithm using top down approach, in 2016 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), Bangalore, India (2016), pp. 747–752

550

M. Tanvir Islam et al.

16. U. Malik, Association rule mining via apriori algorithm in Python. Stack Abuse (2018). [Online]. https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/. Accessed 16 Mar 2019 17. A. Bhatia, Yu-Wei, D. Chiu, Machine Learning with R Cookbook - Second Edition: Analyze Data and Build Predictive Models, 2nd edn. (Packt Publishing Ltd., Birmingham, 2017)

An Overview of Ultra-Wide Band Antennas for Detecting Early Stage of Breast Cancer M. K. Anooradha, A. Amir Anton Jone, Anita Jones Mary Pushpa, V. Neethu Susan, and T. Beril Lynora

Abstract This study gives us a glance of ultra-wideband (UWB) antenna sensors and it is applied in medical applications, especially for microwave radar imaging. The utilization of Ultra-Wide Band sensor-based microwave energy in microwave imaging diligence for detecting tumor cells in breast. In radar imaging, the electrical changes in the human tissues when the back scattered radiation with the use of sensor is analyzed. These cells expose more dielectric constants since its water content is high. The aim of this clause is to render for microwave investigators with a deeper data on electromagnetic proficiency for microwave imaging detectors and explain its late evolutions in these proficiencies. To detect breast cancer in women in early stages with comfortable and easy methods, here different types of UWB antenna Novel antenna, bow-tie antenna, Slot antenna, Microstrip antenna, Planar plate antenna, Circular antenna are survived. The generally operated frequency in medical field is from 3.1 GHz to 10.6 GHz. Keywords Tumor cells detection · Ultra-wide band antenna · Radar imaging techniques · Backscatter radiation · Breast cancer · Microwave imaging

M. K. Anooradha · A. Amir Anton Jone (B) · A. Jones Mary Pushpa · V. Neethu Susan · T. Beril Lynora Department of ECE, Karunya Institute of Technology and Sciences, Karunya Nagar, Coimbatore, India e-mail: [email protected] M. K. Anooradha e-mail: [email protected] A. Jones Mary Pushpa e-mail: [email protected] V. Neethu Susan e-mail: [email protected] T. Beril Lynora e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_49

551

552

M. K. Anooradha et al.

1 Introduction Women are at the prospect of breast cancer, and the chances of developing it increases with age. Here we may not know the exact reason for its cause. There are two types of tumor cells: benign (non-cancerous) and malignant (cancerous). Benign tumor cells don’t affect the organs in the body but they continue to grow in abnormal size, whereas malignant invades other cells and spreads to the body parts. The earlier the cancer is detected; it’s more curable and reduces the mortality. The most common way of detecting breast cancer is by mammography, that checks when you have no symptoms or even at the initial stage. But most women feel some discomfort during the process. The pressure against the breast from the testing equipment causes little discomfort and is applicable for young women. Ultrasound is an imaging process that sends sounds waves with higher frequency into the breast and converts into images for viewing purpose. This technique is not directly advised, but when the physician finds an abnormality in the breast during mammography, ultrasound is the best way to find or detect tumor cells. For women under age 30, ultrasound is a preferred method than mammography to evaluate lumps in the breast. Even ultrasounds can find the cancer cells in the women with denser breasts.

2 Related Works A. Square patch T-slot antenna A simple square patch antenna and a T-slot mounted in the ground plane. Three rectangular slots are added for the purpose of surface current. The square patch is found on the FR-4 substrate with dielectric permittivity 1r = 3.34 and heaviness of 0.794 mm. 50 microstrip feed line excites the T-slot. It has a return loss of –10 dB. This antenna is designed with a good impedance matching and exhibit a larger bandwidth of 14.26 GHz. A yielding of 12.1 GHz is graphically observed when we compare the test results with the numerical predictions. Simulation at 3 GHz, they result in near-far (far-field) field radiation pattern. This method of detecting tumor using antenna is suitable for microwave thermal radiation or thermography. The advantages are it is safer diagnostic and small tumor can also be detected. This antenna demonstrates low spatial resolution when compared with other methods [1] (Fig. 1). B. Bow-Tie Antenna This antenna is designed with a frequency range of 5–7 GHz and is used to detect tumors of different sizes. The relative permittivity 1r is 50 and the conductivity σ = 0.7 S/m. It operates around the center frequency of 5.5 GHz. The metal body of bow-tie is printed on two thin flexible dielectric sheets [2]. The total thickness of the dielectric sheets is 0.1287 mm. The relative permittivity of the sheet is 4. It has a

An Overview of Ultra-Wide Band Antennas …

553

Fig. 1 Square patch T-slot antenna (Courtesy Proposed antenna structure design)

reflection loss of –10 dB and covers a bandwidth of 500 MHz. The pulse bandwidth is less than 10% of center frequency. This antenna can detect spherical like tumor using a narrow band pulse of center frequency and bandwidth. The radiation pattern exhibited here is non-ionized radiation. This antenna is required to be wide band, compact, low profile, lightweight and flexible to be placed directly on the breast. It provides low risk tool (Fig. 2). C. Slot Antenna A UWB antenna is printed on the circuit board and it’s fabricated by the use of slot antenna. The frequency range used is 6 GHz. The slot is mounted on the antenna element and there is an antenna fork which is fed on the microstrip with a symmetric multiprocessor. After the current distribution analyze the antenna parameter S11, which has deviations below –10 dB is used for cancer detection. The antenna interface is connected to the down pulse generator and up pulse generator. The width of the

Fig. 2 Compact bow tie antenna used for imaging (courtesy)

554

M. K. Anooradha et al.

(a)

(b)

(c)

Fig. 3 Length and Width of the tuning fork placed on the antenna, back end and its current distribution (Courtesy photograph of a slot antenna and a simulation result of current distribution a Front side b Back side c Current distribution)

GMP [3] is determined by the delay elements when the time response is same in both. The power supply is given to the DPG, UPG and antenna interface is connected to a clock, where it’s connected to a pattern generator and an oscillator. The detection for cancer is done by antenna array. At the output, hemispherical plane wave fronts are obtained which are scattered at the target. This waveform is received at receiver antenna. The cancer is detected when the input pulse width is from 5 to 6 GHz with 198 ps (Fig. 3). D. Microstrip Antenna The flexible microstrip antenna is designed for the detection of different types of tumor cells. This antenna is used to identify the type of tumor cells within its parameters dielectric properties and relative permittivity and the result obtained is the form of return loss. A single microstrip antenna is changed to flexible substrates and its ground planes because this antenna provides a better shielding from stay radiations. A microstrip flexible antenna can be designed and operated in ISM (Industrial, Scientific and Medical) [4] band and its used to identify the cells in its early phase. The feed technique used here is microstrip feed and its fed. The antenna is designed in the frequency range of 2.4 GHz (ISM band). The simulation result obtained is –29 dB return loss and high gain. The flexible substrate used is Kapton Polyimide substrate which retains its dielectric property under any circumstances and is water resistant. Substrate carries the properties as: dielectric constant = 3.4, thickness = 1MIL, loss tangent = 0.002. The radiation pattern is omnidirectional (Fig. 4). Fig. 4 Microstrip antenna and its schematic representation (courtesy)

An Overview of Ultra-Wide Band Antennas …

(a)

555

(b)

Fig. 5 Top view and bottom view ( (Courtesy a Top view b Bottom view)

E. Planar Plate Antenna The use of planar plate antenna is designed in a shape of circular disc that is produced on two vertical rectangular plates. It is located on the ground plane with the length and width of 40 mm and thickness of 0.5 mm. The feeding to the antenna is done by vertical plate of length 5 m and breadth of 15 mm. The feeding probe is connected to the vertical plate through the slot in the ground plane. This antenna is formed from a copper plane having heaviness of 0.5 mm. The HP 8510C [5] network analyzer is applied for calculating the values. Here when the tumor cells are detected, a strong scattering takes place when the microwave hits the tumor cells and it consists a bandwidth at minimum return loss 10 dB is accomplished between 3 and 8 GHz. Radiation pattern exhibited here is a directional radiation pattern with a gain of 8 dBi (Fig. 5). F. Rectangular Antenna This antenna is architecture with an absolute frequency of 2.45 Ghz and a total extension of 37.26*28.82 mm on a FR4 substratum. It is mounted on a rectangular gusset-fed micro-strip patch antenna. It has a relative permittivity 1r = 4.4, breadth of 65.4 mm, extension of 88.99 mm and heaviness 1.588 mm. This consists of five different antennas located on the cutis of the breast to obtain different parameters of electric, magnetic fields and current density of active breast tissue. Hemispherical shape is used to model the breast phantom possessed of a skin with outer radius 70 mm and thickness 2 mm. The five antennas have a radiation pattern variable from 3.34 dB to 1.6 dB. The array in a circular layout is placed where 8 antennas are placed close to one another, attached to clod interface and isolated by circularly space of λ/2. The antenna is used to radiate energy received by alternative is

556

M. K. Anooradha et al.

Fig. 6 Different structures of rectangular antenna

Table 1 Antenna Specifications Specifications

Efficacy

Specifications

Efficacy

Width of the antenna [1]

20

Slot1

6

Length of the antenna [1]

25

Slot2

5

Width and length of the patch [1]

10*10

Slot3

3

Length of the feed [1]

15

Ground length

12.33

known as mutual coupling. This antenna has a good directional radiation pattern and easily designed for Microwave Breast Imaging (MBI) [6]. The simulated result has good impedance coordination, high radiation pattern with low reciprocal integration (Fig. 6). G. Rectangular Step Slot Antenna The UWB antenna is a well distinguished element/module to furnish higher efficiency to identify the tumor cells. The antenna proposed here is a rectangular step slot shape antenna. This antenna is projected to have a better impedance matching of 50 ohm. A micro strip feed line with an offset feed from the center is fed to the antenna and is marked on FR-4 substratum. The ultimate goal of designing a UWB antenna is to reduce its size (covenant size) and enhancing better performance. Many steps and process in rectangular stair slot has been carried out to get over the restrictions regarding narrow bandwidth and hapless impedance matching. An increase of stairs (steps) between feed and antenna allows better evolution in improved adjustment. The length and width of steps−L1 = L2 = 2.4 mm, L3 = 2.3 mm and W1 = W2 = 0.5 mm, W3 = 1.5 mm as shown in Table 1. The performance of the broadband can be determined by resonating the antenna at numerous frequencies to increase the bandwidth (Fig. 7).

3 Conclusion Different types of antennas are surveyed for the detection of breast cancer. Here we have compared the performance of different types of antennas for better results. Our antenna’s purpose is to defeat the restrictions of the established solutions for detecting breast cancer at its early stage, accurate. Our main aim is to identify the

An Overview of Ultra-Wide Band Antennas …

557

Fig. 7 Proposed rectangular step slot antenna

Table 2 Specifications of rectangular antenna

Specifications Efficacy (mm) Specifications Efficacy (mm) W [6]

65.4

L1

3.997

L [6]

88.99

L2

13.84

Wp [6]

37.26

GL

9.57

Lp [6]

28.82

GW

1

LG [6]

48.82

FL

20

W1 [6]

4

FW

3.036

W2 [6]

11.26





tumor cells in a painless method and also harmless for the skin. The purpose is to identify tumor cells at its initial stage before the cells get matured and spreads all over the breast and where the women reach the stage of removing her breast through surgical method. Despite these challenges there are many evidences that suggest these tumor cells maybe curable if diagnosed and treated early (Table 2). We have explored that microwave radar imaging so far have failed to elicit a survival benefit. This has led to an over utilization of the resources and expensive methods of false identification. With the future guidelines and improvised techniques, we can avoid both under and over evaluation of patient’s disease status (Tables 3 and 4).

558

M. K. Anooradha et al.

Table 3 Antenna design specifications

Specifications

Units (mm)

Specifications

Units (mm)

Ws [7]

19

Ls

33

Wp [7]

3

Lp

4

Wf [7]

1.8

Lf

9

Lg [7]

6

Lr

2

Fd [7]

11.5

Wu1

0.5

Wu2 [7]

0.5

Wu3

1.5

Lu1 [7]

2.4

Lu2

2.4

Lu3 [7]

2.3





Table 4 Comparative analysis of UWB antennas S.no

Types of antenna

Year

Advantages

Disadvantages

01.

Novel antenna

2016

● Safer diagnostic ● Measurement losses, ● It can detect even small effects the cable tumors

02.

Bow-tie antenna

2018

● Compact ● Lightweight ● Flexible

● It can’t detect 2 tumors with space less than 20 mm

03.

Slot antenna

2013

● Its simplicity ● Can transmit high power

● Low radiation efficiency ● High cross-polarization level

04.

Microstrip antenna

2017

● Water resistant

● It has lower gain ● Low efficiency

05.

Planar antenna

2010

● Good radiation pattern

● Fabrication inaccuracies

06.

Rectangular Antenna

2017

● Better image enhancing ● No proper reflection coefficient

07.

Rectangular step slot antenna

2019

● Compact size, low VSWR, and better return loss ● High gain and directivity

NIL

References 1. A. Afyf, L. Bellarbia, N. Yaakoubib, E. Gaviotb, L. Camberleinb, M. Latrachc, M.A. Sennouni, Novel Antenna structure for early breast cancer detection, in Procedia Engineering (2016), pp. 1334–1337 2. Abdullah K. Alqallaf, Rabie Deeb, compact bow-tie antenna for the detection of multiple tumors. ICBES 140, 1–8 (2018) 3. T. Sugitani, S. Kubota, M. Hafiz, A. Toya, T. Kikkawa, A breast cancer detection system using 198 ps gaussian monocycle pulse CMOS transmitter and UWB antenna array, in Poceedings of the 2013 International Symposium on Electromagnetic Theory (2013), pp. 372–374 4. P. Chauhan, Sayan Dey, Subham Dhar, J.M. Rathod, Breast cancer detection using flexible microstrip antenna. Kalpa Publ. Eng. 1, 348–353 (2017)

An Overview of Ultra-Wide Band Antennas …

559

5. R. Abd-Alhameed, C. Hwang See, I. Tamer Elfergani, A Compact UWB antenna design for breast cancer detection 6(2), 129–132 (2010) 6. K. Ouerghi, N. Fadlallah, A. Smida, R. Ghayoula, J. Fattahi and N. Boulejfen, Circular Antenna Array Design for Breast Cancer Detection (IEEE Access, 2017) 7. A. Amir Anton Jone, T. Anita Jones Mary, A novel compact microstrip UWB rectangular stair slot antenna for the analysis of return losses. Int. J. Innovat. Technol. Explor. Eng. 8(10) (2019)

Single Image Haze Removal Using Hybrid Filtering Method K. P. Senthilkumar and P. Sivakumar

Abstract The mist, smog, fog, and haze occurring in the outdoor environment produces distortion in the image when it is picked up by the imaging sensor and hence it degrades the image resolution and clear scene of the image acquired. Hence to clear the haze from the captured image a wavelet transform method combined with Guided image filter and Global guided image filter for better visibility and improved quality in the output image has been developed. In hybrid filtering, two different filters are combined as a hybrid filter (GIF & GGIF) to remove the haze present in the single image and this method also conserves the finer details of the image. The wavelet technique is used to detect the edges in the input image which gives good results when compared to other techniques. The global guided image filter (GGIF) is a filter that combines the effect of edge conserving method and guidance structure transfer method which conserves small edge structures in the dehazed output image. The dehazed output image has good preservation of edge details and also gives sharper edges which are used in real time transportation systems. The performance parameters obtained has better results such as high PSNR and low MSE. Keywords Haze removal · Single image · Hybrid filter · Wavelet transform · Edge preserving

1 Introduction The occurrence of haze in the external atmosphere is a universal and general event. It may occur due to the existence of small dust molecules, smog, water droplets, and from the light reflected from the region of the surface when it goes to the K. P. Senthilkumar (B) Department of ECE, Kingston Engineering College, Vellore, India e-mail: [email protected] P. Sivakumar Department of ECE, Dr. N.G.P Institute of Technology, Coimbatore, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_50

561

562

K. P. Senthilkumar and P. Sivakumar

Fig. 1 Phenomenon of Haze formation

observer. Due to this images captured in the camera suffer from low contrast, reduced changing colors, and also change in brightness. The phenomenon is explained in Fig. 1 in which the image reaches the observer suffers from haze and atmospheric light scattering/absorption problem. Single image haze removal method was used to overcome the above problem which gives a better viewable scene effect and more clear effective image details. In general, the haze removal or denoising clears the undesirable eye effects and hence it is considered to be an image enhancing method. This methodology is more useful in various image transform methods, Data processing machine-based vision methods, Video aided automobile moving systems, and outside supervision video systems. Denoising method is classified into three major domains they are Time domain, Transform domain, and Variation-based technique. Several methods were developed for removing haze from a single image which was used in various real time applications. Narasimhan et al. [1] discussed several issues in recovering the image details of the outer environment from degraded grey scale images. They estimated the locate depth discontinuities in the image and designed the visible scene, from the two images captured in the different outer environment. Tan [2] developed an automated model which required a single input image. They observed different conditions in which, the sunshiny images have more contrast value than the outer environment images which was affected by poor climatic conditions and the next condition was air light difference. Fattal [3] prepared a scheme in which a hazy image was formed by a precise image modeling which has dark surface region and depth scene transmission map, but it was unsuccessful due to a large amount of haze. He et al. [4] developed a transparent method called as dark channel prior method to remove the haze from a single image. They utilized this for outdoor environment image in which the nonatmospheric regions has minimal one color region and less color ranges at some places known as dark region. Drew et al. [5] found a method to estimate the transmission map for submarine regions, they proposed a model known as Underwater DCP, which enables a vital enhancement over the older methods which was based on DCP.

Single Image Haze Removal Using Hybrid Filtering Method

563

Pang et al. [6] developed a haze model for a single image by using the dark channel method and GIF. They utilized the guided image filter to filter the haze image which has less running time when compared to other existing methods. In the GIF [7] method, the design of the transfer filter was developed using the quadratic expansion method and color attenuation prior method was used in [8]. The WGIF method [9] used the weighted function along with the guided image which produced better results. The WLS filter [10] proposed was a significant method when compared to the edge securing filter, in which the running time of the smoothening filter was made significantly good than the method [7]. In [11] edge preserving for multiple images was used with a standard filter which gives better results. Iterative bilateral filter [12] was proposed for fast image dehazing which was used in real time applications. Guided Joint bilateral filter method [13] was proposed for real time fast dehazing in single image haze removal. In [14] optimized contrast enhancement was used in image dehazing and real time videos. The G-GIF proposed in [15] preserves the minute details of the images better than WGIF and GIF. In the dehazed image to remove the halo artifacts and they made the transmission map to be the same with respect to the input haze image, but the structure of the input haze image was conserved better when they used the minimal color channel. The performance parameters showed that the proposed algorithm produced good results and sharper images when compared to the other algorithms used. The comprehensive survey of various single image haze removal methods was discussed in [16]. A review on various haze removal methods was presented in [17] in which they analyzed the various haze removal methods with several performance parameters. In the proposed method a new technique called as hybrid filtering method is developed by combining Guided Image Filter and Globally Guided Image Filter which have been used to get an effective method that removes the presence of noise from the single input image in an effective manner. The different performance parameters like Peak Signal to Noise Ratio, Mean Square Error, Accuracy, Sensitivity, and Specificity are also calculated. The entire paper is classified as follows: Sect. 2 describes the various existing filter methods, Sect. 3 explains the method of Hybrid filtering, Sect. 4 gives the output simulation results, Sect. 5 lists the various performance parameters and Sect. 6 describes the conclusion.

2 Existing Filter Methods The different filtering methods used for haze removal in a single image is given below.

564

K. P. Senthilkumar and P. Sivakumar

2.1 Bilateral Filter Method This filtering technique is used for smoothening the input images and it also has edge preserving property which operates by transfusing indifferent nearby pixel values. This filter is local and very simple where the grey levels or colors are fused by their geometric proximity values in both domain and range. It filters the edges to remove the noise present in it but does not support noise reduction to a greater extent, but this filter suffers from the “gradient reversal” effect which gives undesirable sharpening of edges.

2.2 Guided Image Filter Method This filter is also called as an edge conserving filter similar to the bilateral filter which conserves the image edges in a good manner. This filter works as a fast running algorithm whose running time does not depends on the size of the filter and no unwanted profile across the edges thus known to be as edge conserving filter. It gives the linear relation between the input haze image and the guidance image with respect to the output and hence it has a speed running time when compared to BF. But the limitation is it doesn’t conserve the small edge details in the output image.

2.3 Guided Joint Bilateral Filter Method This filter method is employed to create a new mask which eliminates the heavy composition details and restores the details across the edges in the image. This filter is used to filter the atmospheric mask to obtain the best detailed new air envelope. The Guided joint bilateral filter is utilized if the haze image doesn’t give the original details about the edges if the input image has more noise. This filter can be used in noise removal because it enforces the edge details of the filtered input image which is more likely with the reference image.

2.4 Weighted Guided Image Filter Method Local filters designed has the major disadvantages like edge preservation problem than the global filters so WGIF was used to reduce the complication which used a weighted function along with Guided image filter. It conserves the sharp edges in the image like other universal filters so that the halo artifacts effect and unwanted profile around the edges were removed. Here the filter called as Gaussian was adopted to remove the artifacts and the time needed is O(N) only. In haze removal for the single

Single Image Haze Removal Using Hybrid Filtering Method

565

image, the various problems discussed are Halo artifacts, noise amplification, and color fidelity. This filter overcomes the all major problems. But the major drawback was this filter does not preserve the fine edge information in the dehazed image and also it over smoothens across the small edges.

2.5 Globally Guided Image Filter Method It is also known as the universal filter and hence used in many applications. They used two filters called as structure transfer filter and an edge conserving smoothening filter. The first filter was used to change the initial form of the image for filtering operation and the smoothening filter was utilized to smoothen the dehazed image. Haze removal using a single image utilizes both Global Guided Image Filter and Koschmiedar’s law which conserves the small edge details of the image and produced better output than the older methods like GIF and WGIF.

3 Hybrid Filter Method The hybrid filter technique has been developed by using the two filters GIF and G-GIF. Both the filtering techniques have numerous advantages and applications in which removal of haze from a single image is very important. This new technique has been used in single image haze removal in an effective manner to eliminate the problem of GIF. The new hybrid filter used has been developed by using structure guidance method and a fast edge conserving technique. The new filter method process flow is explained in Fig. 2.

Fig. 2 Process flow of the hybrid filter method

566

K. P. Senthilkumar and P. Sivakumar

3.1 Guided Image Filter (GIF) Guided Image Filter works on an image for edges to be preserved by using the image, called as guidance image, to improve the filter operation. The other input image called as guidance image may be the original image or a peculiar form or a totally contrasting image. The guidance image is similar to the filter image and the elements are also same as the edges in the input. If the guidance image is having varying forms then the guidance image will encounter the image to be filtered, due to that these structures are set in the output image. This phenomenon is known to be the method of transferring the structure to the output image.

3.2 Globally Guided Image Filter (G-GIF) The Globally GIF is constructed by using two filters one is global structure guidance filter and the other one is global corner conserving smoothening filter. This filter employs nearby neighboring pixel interpolation down-sampling technique to clearly view the input images which reduces the processing time complexity of the guided filter. After the process of globally guided image filter operation, the output image uses an up-sampling bilinear interpolation method to clearly view the output image.

3.3 Hybrid Filter Method The limitation of GIF can be resolved by hybrid filtering techniques by combining the GIF and G-GIF filters to eliminate the haze effect from a single image. To preserve the detail in the edges of the input image and the sharp finer details initially edges has to be detected. The Wavelet transform and inverse wavelet transform is used for this process. The procedure for haze removal from a single image using the hybrid filter technique is given in Sect. 3.4.

3.4 Process Flow Step 1 The wavelet partition method is applied to the haze input image in order to obtain the lower frequency region and higher frequency region. Step 2 Then apply the wavelet transform which partition the lower frequency region, in order to obtain lower frequency region and higher frequency region. Step 3 After that GIF & GGIF filters are used to get the reconstructed dehazed image which is shown in Fig. 5e Step 4 Then for the higher frequency regions

Single Image Haze Removal Using Hybrid Filtering Method

567

(a) Estimate the local noise variance (b) And for every pixel in higher frequency regions (i) Determine the range of threshold in the image (ii) By applying the soft threshold Step 5 At last inverse wavelet transform is used so that the output dehazed image is obtained. The Wavelet-based method is used to partition the hazy input image which is shown in Fig. 3a in two frequency regions low and high and then the hybrid filter method is used for the haze image and the guidance image. The GIF output is shown in Fig. 4c and G-GIF output obtained is shown in Fig. 4d the reconstructed image obtained is shown in Fig. 5e and then the image is given to the inverse wavelet transform to obtain the output which is shown in Fig. 5f.

4 Simulation Results The input hazy image is shown in Fig. 3a and the various output images are shown below.

(a)

(b)

Fig. 3 Results of hybrid filtering technique a Haze image, b guidance image

(c) Fig. 4 c GIF filter output image, d G-GIF filter output image

(d)

568

K. P. Senthilkumar and P. Sivakumar

(e)

(f)

Fig. 5 e Reconstructed image, f output image

5 Performance Parameters The following performance parameter is calculated by the proposed method. The two most important parameters are Mean Square Error (MSE) and Peak Signal to Noise Ratio(PSNR) which is described below in Eqs. (1) and (2) and also the other parameters like Accuracy, Specificity, and Sensitivity are calculated and shown in Table 1.

5.1 Mean Square Error (MSE) Mean Square Error (MSE) is estimated between input haze image and output dehazed image. It is determined by the formula given below m−1 n−1     MSE = 1 p ∗ q (I (i, j)) − (K (i, j))2

(1)

i=0 j=0

where p, q shows the haze image height and breadth, I(i, j) and K(i, j) are output dehazed image and haze input. If the value of Mean square error is very less then it shows that the quality of the image obtained in the output is good.

5.2 Peak Signal to Noise Ratio (PSNR) Peak Signal to Noise Ratio (PSNR) is estimated by considering the input haze image and output dehazed image which is obtained by P S N R = 10 log10

  2  2n − 1 MSE

(2)

If PSNR is very high then it indicates the quality of output image is very good.

Single Image Haze Removal Using Hybrid Filtering Method

569

Fig. 6 Pie Chart representing the different performance parameters

Table 1 Performance parameters and results obtained

S. no.

Performance parameters

Results

1

Peak signal to noise ratio

24.2

2

Mean Square Error

3

Sensitivity

89

4

Accuracy

86

5

Specificity

81

0.00063

The results obtained by the hybrid filtering method are shown below in Table 1. The Performance parameters obtained are given as a graphical chart which is shown below in Fig. 6.

6 Conclusions The proposed hybrid filter is used to produce clear images and conserves the edge details in the dehazed output image better than the other methods. The smaller details in the output dehazed image are clear and sharp than those of the current single image haze removal methods. The simulation results show that the hybrid filter method based haze removal technique enhances both visible quality of the image and also conserves fine edge details in the output image which gives high PSNR and low MSE. In the future, this method will be used to remove haze from real time video which preserves the fine edge structure and to produce a sharper output image. This method can be used in various applications like single image dehazing, real time transportation systems, and outdoor video surveillance systems.

570

K. P. Senthilkumar and P. Sivakumar

References 1. S.G. Narasimhan, S.K. Nayar, Chromatic framework for vision in bad weather, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Hilton Head Island, SC Hilton Head Island, SC, 2000), pp. 598–605 2. R. Tan, Visibility in bad weather from a single image, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Anchorage, AK, 2008), pp. 1–8 3. R. Fattal, Single image de-hazing, in Proceedings of the SIGGRAPH (2008), pp. 1–9 4. K. He, J. Sun, X. Tang, Single image haze removal using dark channel prior. IEEE Trans. Pattern Anal. Mach. Intell. 33(12), 2341–2353 (2011) 5. P. Drews, E. Nascimento, F. Moraes, S. Botelho, M. Campos, Transmission estimation in underwater single images, Proceedings of the IEEE International Conference on Computer Vision Workshop (2013), pp 825–830 6. J. Pang, O.C. Au, Z. Guo, Improved single image dehazing using guided filter, in Proceedings of the APSIPA, ASC (2011), pp 1–4 7. K. He, J. Sun, X. Tang, Guided image filtering. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1397–1409 (2013) 8. Q. Zhu, J. Mai, L. Shao, A fast single image haze removal algorithm using colour attenuation prior. IEEE Trans. Image Process. 24(11), 3522–3533 (2015) 9. Z. Li, J. Zheng, Z. Zhu, W. Yao, S. Wu, Weighted guided image filtering. IEEE Trans. Image Process. 24(1), 120–129 (2015) 10. Z. Li, J. Zheng, Edge-preserving decomposition-based single image haze removal. IEEE Trans. Image Process. 24(12), 5432–5441 (2015) 11. Z. Farbman, R. Fattal, D. Lischinshi, R. Szeliski, Edge-preserving decompositions for multiscale tone and detail manipulation. ACM Trans. Graph. 27(3), 67 (2008) 12. S. Kang, W. Bo, Z. Zhihui, Z. Zhiqiang, Fast single image de-hazing using iterative bilateral Filter, in Proceedings of the International Conference on Computer Science (2010), pp 1–4 13. C. Xiao, J. Gan, Fast image de-hazing using guided joint bilateral filter. Vis. Comput. 28(6–8), 713–721 (2012) 14. J.H. Kim, W.D. Jang, J.Y. Sim, C.S. Kim, Optimized contrast enhancement for real-time image and video de-hazing. J. Vis. Commun. Image Represent. 24(3), 410–425 (2013) 15. Zhengguo Li, Jinghong Zheng, Single image de-hazing using globally guided image filtering. IEEE Trans. Image Process. 27(1) (2018) 16. K.P. Senthilkumar, P. Sivakumar, Haze removal techniques-a comprehensive survey. Int. J. Control Theory Appl 9(28), 365–376 (2016) 17. K.P. Senthilkumar, P. Sivakumar, A review on Haze removal techniques, in Lecture Notes in Computational Vision and Biomechanics, vol. 31 (Springer, 2019), pp 113–123, ISBN 978–3030-04061-1

An Optimized Multilayer Outlier Detection for Internet of Things (IoT) Network as Industry 4.0 Automation and Data Exchange Adarsh Kumar and Deepak Kumar Sharma

Abstract In this work, a multilayered, multidimensional outlier detection mechanism is proposed. The multilayered system consists of: ultra-lightweight, lightweight and heavyweight detection whereas multiple dimensions involve machine learning, Markov construction, and content and context-based outlier detection. A thresholdbased system is proposed for ultra-lightweight and lightweight outlier detection systems. The heavyweight outlier detection system requires higher computational and communicational costs for outlier detection. In simulation it is observed that the optimal number of clusters required for 50, 100, 500, 1000, 2000, 3000, 4000 and 5000 node network are 5–18, 16–17, 5–25, 33–34, 13–39, 38–39, 22–51 and 45–52, respectively. Keywords Outlier · Inlier · Threshold · Attack · Performance · Time-series analysis

1 Introduction The integration of mobile ad hoc network (MANET) and Internet of Things (IoT) opens new applications for smart environments like automated power plants, intelligence transportation and traffic management systems, car connectivity, etc. The possibilities of wide applications for IoT systems increase with opportunities of interoperability between different types of networks in a smart environment. In smart environments, like MANET–IoT systems, information exchange over different things, routing principles and protocols, clustering, cluster interaction, etc. are designing issues for interoperability network constructions. Due to complex designing issues, MANETs are highly vulnerable to attacks. MANET–IoT interaction characteristics A. Kumar (B) · D. K. Sharma University of Petroleum and Energy Studies, Dehradun, Uttrakhand, India e-mail: [email protected] D. K. Sharma e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_51

571

572

A. Kumar and D. K. Sharma

like open medium, distributed network, autonomous nodes distribution and participation, decentralization, etc. make such systems more complex and challenging. Thus, these systems are easily prone to attacks, and countermeasures are required which identify unauthorized and malicious activities with the use of existing and minimum resources. In existing infrastructure, identification of performance-based nodes helps in identifying various attacks. Performance-based process includes the feature of nodes deployment, their interaction, overall network outputs and services, etc. The majority of MANET’s routing protocols provides clustering and cluster head selection mechanisms. The reusability of the clustering process in identifying active and passive nodes is studied for attack detection. Designing a MANET–IoT outlier detection scheme that is energy efficient and generates a low outlier identification traffic with high accuracy and detection rate is a major area of research in this work. In this work, a multidimensional multilayered outlier detection architecture with increasing complexity is proposed. In this architecture, ultra-lightweight, lightweight and heavyweight outlier detection mechanisms are proposed dimensions. In ultra-lightweight outlier detection (ULOD), outlier nodes are identified with their deployment with internal [1–7] and external [8–16] indices, and without analyzing their performance. In lightweight outlier detection (LOD), outlier nodes are identified after analyzing their performance and using QoS parameters. In heavyweight outlier detection (HOD), outlier nodes are identified using multiple techniques deployed at different layers of MANET architecture. Overall, outlier nodes are identified based on their deployment, density, performance and interactions with other nodes at different layers. Unlike most conventional outlier detection mechanisms that adopt promiscuous monitoring strategy and result in heavy outlier detection traffic generation, the proposed approach uses a dynamic, continuous and increasing complexity (for outliers)-based monitoring strategy wherein nodes with high outlier probability are monitored more frequently compared to inlier nodes. This paper is organized as follows. A literature survey on a recent outlier detection mechanism is presented in Sect. 2. A detailed description of the increasing complexity-based outlier detection approach is presented in Sect. 3. This scheme uses internal and external indices-based ULOD approach, QoS parameter and performance-based LOD approach, and layered dependent increasing computational complexity-based HOD. In Sect. 4, simulation-based analysis is performed for measuring the stability of clusters in the outlier detection process. Finally, a conclusion is drawn in Sect. 5.

2 Literature Survey In this section, various outlier detection mechanisms are surveyed for MANETs [17– 28]. Li et al. [17] proposed an outlier detection mechanism demonstrating behavioral patterns for attack detection using the Dempster–Shafer theory. This scheme takes observations from multiple nodes and reflects the uncertainty and unreliability of the

An Optimized Multilayer Outlier Detection …

573

observations. The proposed scheme is observed to be resilient to various attacks and stable for a distributed network. Although communication overhead is required for efficient detection, actual detection of attacks validates the efficiency of this scheme. It is further observed that the proposed scheme is better for a distributed network but a single approach is not reliable in detection for dynamic networks like MANETs. Sun et al. [18] focus on intrusion detection systems for mobile ad hoc networks and wireless sensor networks. These systems consist of attack detection and elimination using the outlier detection process as well. It is observed that a vast majority of methods are dependent on outlier-based threshold mechanisms. These thresholdbased outlier systems collect data through multiple ways and perform outlier detection after applying data analytics. Karlsson et al. [19] perform wormhole attack detection using the outlier detection process. It is found that various algorithms like traversal time and hop count analysis (TTHCA) and modified transmission time-based mechanism (MTTM) are efficient for attack detection with low traffic overheads. Karlsson et al. [3] extended the TTHCA algorithm and named it traversal time per hop analysis (TTpHA). This extended version uses threshold-based outlier process with different node radio coverage and prevailing MANET conditions. The author claimed that the extended version is better than the base version in terms of detection performance. Yadav et al. [20] proposed the detection of a black hole attack in MANET using the outlier detection process over ad hoc on-demand distance vector (AODV) routing protocol. This scheme is vigilant to those nodes which attract other nodes for data communication by compromising some of their secret entities. In experimental analysis, authors have observed that the proposed scheme provides simplicity, robustness and effectiveness for AODV protocol and other existing routing mechanisms. However, this approach is another threshold-based mechanism covering performance evaluation in detail. A major challenge is deciding the ideal threshold value for outlier detection. Kumar et al. [21, 24–28] proposed attack detection using the outlier detection process after integrating trust mechanism in MANET routing protocols. Trust mechanism includes trust score generation, trust score transmission, rust re-computation and trust regeneration. In evaluation, it is observed that various attacks are resilient to trust mechanism. The outlier detection process helps in detecting attacks and intrusions whereas, trust mechanism help in finding nodes whose disconnection would protect the network for unauthorized activities. The outlier detection process is efficient if lightweight cryptography primitives are pre-integrated with low-resource devices-based MANETs. Henningsen et al. [22, 23] identified the use of the term “misbehavior detection” for attack analysis in wireless networks. In this work, authors have used this terminology for identifying attacks in industrial wireless networks as an exemplary application area. This work focuses on data collected at a physical layer. Data is analyzed using machine learning techniques and outperforming and underperforming data elements are traced for outlier detection. Finally, it is observed that the technique suitable for wireless communication is also beneficial for ad hoc networks in dynamic situations with high flexibility and mobility.

574

A. Kumar and D. K. Sharma

3 Proposed Approach This section proposes a multidimensional and multilayered approach for outlier identification and countermeasures. This section starts with an explanation to a dataset collected from hierarchical MANET and further considered for the outlier detection process. In the proposed outlier detection mechanism, multiple dimensions are explained initially. As shown in Fig. 1, there are three major dimensions: ultralightweight, lightweight and heavyweight. Ultra-lightweight and lightweight outlier detection mechanisms concentrate on cost-effectiveness, especially communication and computational costs. In heavyweight outlier detection process, major concentration is drawn toward the identification of outliers without concern about the costs involved in it. Thus, a multilayered architecture is proposed which identifies outliers at different layers of the MANET protocol stack. The detailed process is explained in the following subsections:

Fig. 1 Proposed multidimensional multilayered outlier detection architecture

An Optimized Multilayer Outlier Detection …

575

r -t 0.044240586 -Hs 8 -Hd -1 -Ni 8 -Nx 75.00 -Ny 500.00 -Nz 0.00 -Ne -4.220000 -Nl RTR -Nw --- -Ma 0 -Md ffffffff Ms s -t 0.044240777 -Hs 3 -Hd -1 -Ni 3 -Nx 85.00 -Ny 400.00 -Nz 0.00 -Ne -6.054000 -Nl RTR -Nw --- -Ma 1 -Md ffffffff -Ms d -t 0.044240605 -Hs 4 -Hd -1 -Ni 4 -Nx 55.00 -Ny 670.00 -Nz 0.00 -Ne -8.770000 -Nl RTR -Nw --- -Ma 2 Md ffffffff -M r -t 0.044240887 -Hs 6 -Hd -1 -Ni 6 -Nx 35.00 -Ny 800.00 -Nz 0.00 -Ne -3.032000 -Nl RTR -Nw --- -Ma 1 -Md ffffffff -

Ms

Fig. 2 Sample records in dataset generated using hierarchical MANET

3.1 Dataset Figure 2 shows an example of entries considered for analysis in a dataset. In this dataset, various fields opted for analysis are packet action (sent, receive, drop and forward), packet action time, packet action location, layer involved, flags, sequence no., packet type, size of packet, flags for source and destination addresses, etc.

3.2 Proposed Increasing Complexity Outlier Detection Architecture In this section, MANET is divided into a set of clusters using the top-down cluster splitting mechanism. These clusters are formed in such a way that every node in a given cluster is within a limited distance and transmission range. These nodes also share a common set of properties described in the distance metric. After constructing a hierarchical network, one node per cluster is elected as cluster head which monitors, controls and provides outlier detection and other clustering services to all other cluster nodes for a predefined period of time. The detailed functionalities of this component are explained as follows.

3.2.1

MANET Clustering and Cluster Head Election

Hierarchical clustering plays an important role in resource-constrained MANETs. MANETs are extremely dynamic and unsteady in nature which creates problems in splitting the network into clusters, and selection of cluster heads for controlling and monitoring cluster’s activities. The major objective of the work presented in this subsection is to reduce the packets transmission overhead at every time in the clustering process. In order to reduce packet transmission overhead, a novel distance metric-based efficient clustering approach is utilized for hierarchical clustering and cluster-head selection. In contrast to the k-means clustering algorithm, hierarchical clustering does not require to prespecify the number of clusters. This scheme uses a distance matrix between observations for identifying similar groups. Initially, a divisive hierarchical clustering approach is used for clustering. Thus, all data points fall

576

A. Kumar and D. K. Sharma

under a single group; thereafter, nodes with similar nature are divided into different clusters. Pseudocode 1 presents the detailed divisive clustering process. Pseudocode 1: Divisive Hierarchical Clustering Algorithm Goal: To create clusters 1. All data points are grouped into a single cluster. 2. Pick each data point one by one and follow the following steps: a. Iterate each data point in the cluster and compute average dissimilarity of this point from all other data point in the same cluster. b. If dissimilarity is greater than a pre-determined threshold then put the data point in new cluster. c. Repeat Step 2a and 2b for every data point in every cluster. 3. Count number of data points in each cluster and pick all clusters having data point greater than one. 4. Compare each cluster with other cluster using data points and dissimilarity score. 5. If dissimilarity score of two data points in two different clusters is greater than zero then move data point to another cluster. 6. Else if dissimilarity score is lesser than zero or zero then a new cycle of restructuring clusters will start and outliers are identified from beginning.

3.2.2

Ultra-Lightweight Outlier Detection (ULOD)

Cluster validation is a process of evaluating the goodness of the clustering algorithm [1]. According to [2–5], cluster validation methods can be categorized into three classes: internal, external and relative cluster validation. To measure the goodness of a clustering algorithm, internal indices methods use properties that are internal information from a dataset and are used in clustering process. In this category of methods, compactness, separation/connectedness and connectivity are reflected in their outcomes [1]. Various internal cluster validation indices are [5–9]: CalinskiHarabasz measure (CHI), Density-Based Cluster Validation (DBCV), etc. Pseudocode 2 shows the generic steps followed for ULOD in this work. ULOD is an indices and threshold-based lightweight outlier detection mechanism for resourceconstrained devices. In order to detect outliers, internal and external indices are used for analysis. In order to identify n-objects with efficient internal and external indicesbased outlier detections factor (IEIODF) values, various internal and external indices are used [10–14]. Pseudocode 2 explains the ULOD process in detail.

An Optimized Multilayer Outlier Detection …

577

Pseudocode 2: Ultra-lightweight Outlier Detection (ULOD) 1. Apply divisive hierarchical clustering algorithm initially followed by hierarchical clustering using proposed distance metric and construct clusters. 2. If any node falls outside clusters then consider those nodes as outliers 3. X_Internal_Indices_inlier_list =NULL 4. X_External_Indices_inlier_list =NULL 5. X_Internal_Indices_list = [DI, DBI, RMSSDI, RSI, SI, II, XBI, CHI] 6. X_External_Indices_list = [FI, NMII, PI, EI, RI, JI] 7. Count_outlier=0 8. Count_inlier=0 9. for item in X_Internal_Indices_list 10. Compute item 11. 12. if then 13. Append item to X_QoS_outlier_list 14. else 15. Append item string to X_QoS_inlier_list 16. end if 17. end for 18. for item in X_External_Indices_list 19. Compute item 20. 21. if then 22. Append item to X_QoS_outlier_list 23. else 24. Append item string to X_QoS_inlier_list 25. end if 26. end for 27. for item in X_QoS_outlier_list 28. Count_outlier=Count_outlier+1 29. end for 30. for item in X_QoS_inlier_list 31. Count_inlier=Count_inlier+1 32. end for then 33. if 34. for item in X_QoS_outlier_list 35. Identify the clusters using item and declare them outliers 36. end for 37. end if

3.2.3

Lightweight Outlier Detection (LOD)

LOD is a performance and threshold-based lightweight outlier detection mechanism for resource-constrained devices. In order to measure performance, QoS metric is used for analysis. In order to identify n-objects with efficient local distance, proportionate QoS outlier factor (LDQOF) values, throughput (TP), goodput (GP) and end-to-end delay (ETED) QoS metric I used. A detailed explanation to LOD process is explained in Pseudocode 3.

578

A. Kumar and D. K. Sharma Pseudocode 3: Lightweight Outlier Detection (LOD) 1. Pick one data point in 2. Retrieve ’s -nearest connected neighbors using 3. Consider this a one network 4. X_QoS_outlier_list:=NULL 5. X_QoS_inlier_list:=NULL 6. X_QoS_list = [TP, GP, ETED] 7. Count_outlier=0 8. Count_inlier=0 9. for item in X_QoS_list 10. Compute item 11. 12. if 13. Append item to X_QoS_outlier_list 14. else 15. Append item string to X_QoS_inlier_list 16. end if 17. end for 18. for item in X_QoS_outlier_list 19. Count_outlier=Count_outlier+1 20. end for 21. for item in X_QoS_inlier_list 22. Count_inlier=Count_inlier+1 23. end for 24. if then 25. for item in X_QoS_outlier_list 26. Sort the nodes using item and declare them outliers 27. end for 28. end if

3.2.4

then

Heavyweight Outlier Detection (HOD)

As shown in Fig. 1, HOD is a multilayered outlier detection process. This process collects data from the LOD process and performs data preprocessing process for identifying missing data, removing duplicated data and data enrichment. After data preprocessing, data parser module parse the data for three different layers: MAC layer, routing layer and application layer. Each of these layers has its own outlier detection process. Brief description to outlier detection at each of these layers is as follows: • MAC layer outlier detection (MACLOD) uses machine learning process for outlier detection. In the machine learning process four phases are used for analysis: preprocessing, learning/training, evaluation and prediction. In the preprocessing phase, training and testing datasets are prepared for analysis. In the learning/training phase, features are extracted and compared using decision treebased clustering mechanism. In the evaluation phase, testing data’s features are compared with training set data features for outlier detection. In the prediction phase, new data features are extracted and directly compared with expected features of inlier or outlier rather than executing learning/training and evaluation phase again and again. • Routing layer outlier detection (RLOD) is the second layer of the outlier detection process in HOD. Here, routing packets are filtered for analysis and constructing a

An Optimized Multilayer Outlier Detection …

579

Markov chain. State probability transition matrix is computed from Markov chain construction and the probability of outlier is computed using this matrix. • Application layer outlier detection (ALOD) is the third layer of the outlier detection process in HOD. In this layer, filtered application layer packets are processed for associativity rules. Associativity rules-based outlier detection is a content and contextual outlier detection mechanism. Content association determines the chain of source, intermediate and destination nodes whereas context-based association determines outliers by dividing the nodes into colonies and empires. Contextual similar colonies are put together for constructing empires and dissimilar colonies are either moved to neighboring empires or considered as outliers. After receiving the nodes with a label of outlier or inlier, the nodes are passed through the scoring module. This module may take a single layer opinion or aggregated observations of all layers depending upon its configuration. The final score is computed in terms of percentage of outliers in a particular cluster and in the whole network.

4 Simulation, and Its Results and Analysis This section explains the experimental setup, simulation parameters taken for analysis and clusters’ visualization. Clusters are validated through various internal and external indices. Simulation of the proposed approach in detail with environment setup is explained as follows.

4.1 Simulation Environment In simulation analysis, 50–5000 nodes are distributed randomly over 1500 m × 1500 m area. The Random WayPoint Mobility model is used with a wireless channel and an omnidirectional antenna to trace and capture packets. A maximum of 7 packets per second can transfer at a time with each packets containing a maximum of 512 bits. Here, ns-3 [15] simulator is used to simulate nodes with 0.1 to 5 m/s mobility. Total simulation is executed for 2000s with a multi-execution scenario.

4.2 Simulation-Indices Computation and Analysis As discussed earlier, cluster validation evaluates the goodness of the clustering algorithm. A detailed evaluation of the proposed approach using cluster validation methods is explained as follows.

580

A. Kumar and D. K. Sharma

4.2.1

A Comparative Analysis of Internal Indices

Internal cluster validation methods use properties that are internal information from the dataset. In this work, various indices used for evaluation are II, XBI and . Figure 3 shows the comparative analysis of II, XBI and  with variation in the number of nodes. Figure 3 shows that the optimal indices’ values for 50, 100, 500, 1000, 2000, 3000, 4000 and 5000 node datasets are observed during T1 –T2 , T3 –T4 , T5 –T6 , T7 –T8 , T8 –T9, T8 –T9 , T8 –T9 and T8 –T9 slots with 4, 18, 23, 33, 38, 40, 53 and 56 clusters, respectively. Table 1 shows the comparative analysis of timing slots and the number of clusters indicating optimal indices value for all internal indices taken for analysis. IEIODFitem_L O W E R threshold values selected for all internal indices (taken for analysis) are the values where all indices agree for considering all clusters as valid clusters. For 50-node dataset, IEIODFitem_L O W E R threshold values selected for II, XBI and  are 0.084, 0.271 and 0.243, respectively. A detailed threshold index change with variations in the number of nodes is shown in Fig. 3. This analysis is an experiment for internal indices. Figure 3 shows the comparative analysis of those indices whose value is varying between 0 and 1. As compared to XBI, values for II and  remain almost constant for all types of networks (small to large scale). If II,

Index Value

0.8 II

0.6 0.4

XBI

0.2 0 50

100

500

1000

2000

3000

4000

Γ

5000

No. of Nodes

Fig. 3 Analysis of index value variation for three internal indices (II, XBI and )

Table 1 Timing slots and no. of clusters indicating optimal indices’ value Indices Datasets 50 II

XBI



A T5 –T6

100

500

1000

2000

3000

4000

5000

T10 –T11

T6 –T7

T11 –T12

T1 –T2

T10 –T11

T5 –T6

T12 –T13

B

32

36

23

41

8

49

34

59

C

0.084

0.052

0.26

0.023

0.292

0.141

0.696

0.4

A T7 –T8

T8 –T9

T10 –T11

T6 –T7

T2 –T3

T1 –T2

T5 –T6

T2 –T3

B

32

36

38

30

13

7

34

27

C

0.271

0.145

0.032

0.203

0.139

0.269

0.01

0.223

T5 –T6

T6 –T7

T6 –T7

T12 –T13

T2 –T3

T7 –T8

T5 –T6

A T12 –T13 B

32

22

31

30

44

14

48

46

C

0.243

0.104

0.209

0.312

0.184

0.108

0.212

0.174

*A = timing slots, B = number of clusters, C = indices value

An Optimized Multilayer Outlier Detection …

581

XBI and  are compared among themselves, then maximum variation is observed in II and minimum variation is observed in XBI.  index value decreases from 50 to 100 nodes (very small-scale network), increases from 100 to 1000 nodes (very small- to medium-scale network), decreases from 1000 nodes to 3000 nodes (medium-scale network), increases from 3000 nodes to 5000 nodes with a small decrease from 4000 to 5000 nodes (large-scale network). XBI value decreases from 50 to 500 nodes (small-scale network), increases from 500 to 3000 nodes with one-time decrease for 2000 nodes (medium-scale network), and shows maximum decrease (from 3000 to 4000 nodes) and increase (from 4000 to 5000 nodes) for large-scale network.

4.2.2

A Comparative Analysis of External Indices

To measure the goodness of a clustering algorithm, these methods use external information for comparisons. For example, the use of known labeled cluster datasets is generally preferred for comparisons between a produced partition and known partition [16]. In this work, various methods used for external cluster validation are EI, RI and JI. Figure 4 shows the comparative analysis of EI, RI and JI with variations in the number of nodes. For example, Fig. 4 shows that the optimal EI values for 50, 100, 500, 1000, 2000, 3000, 4000 and 5000 nodes datasets are observed during T3 –T4 , T3 –T4 , T1 –T2 , T7 –T8 , T2 –T3 , T8 –T9 , T3 –T4 and T4 –T5 slots with 13, 36, 11, 34, 21, 39, 55 and 48 clusters, respectively. Table 2 shows a detailed analysis of timing slots and the number of clusters indicating optimal indices’ value for all Lower external indices taken for evaluation. I E I O D Fthr eshold threshold values selected for all external indices (taken for analysis) are the values where all indices agree for Lower considering all clusters as valid clusters. For the 50-nods dataset, I E I O D Fthr eshold 1.2 EI 1

Index Value

0.8 RI

0.6 0.4

JI

0.2 0 50

100

500

1000

2000

3000

4000

5000

No. of Nodes

Fig. 4 Analysis of index value variation for external indices (EI, RI and JI)

582

A. Kumar and D. K. Sharma

Table 2 Timing slots and no. of clusters indicating optimal indices value Indices

Datasets 50

EI

RI

JI

100

500

1000

2000

3000

4000

5000

A

T2 –T3

T6 –T7

T3 –T4

T7 –T8

T4 –T5

T8 –T9

T12 –T13

T7 –T8

B

13

36

11

34

21

39

55

48

C

0.837

0.881

0.905

0.868

0.98

0.793

0.946

0.898

A

Up to T1

Up to T1

T1 –T2

Up to T1

T1 –T2

T3 –T4

T9 –T10

T10 –T11

B

3

4

5

5

8

21

52

57

C

0.981

0.998

0.92

0.931

0.937

0.98

0.983

0.901

A

T2 –T3

T5 –T6

T3 –T4

T6 –T7

T7 –T8

T4 –T5

T5 –T6

T2 –T3

B

13

22

11

30

34

24

34

27

C

0.982

0.885

0.806

0.882

0.967

0.961

0.975

0.84

*A = timing slots, B = number of clusters, C = indices value

thresholds for EI, RI and JI are 0.837, 0.981 and 0.982, respectively. A detailed comparative analysis of threshold variations with variations in the number of nodes is shown in Fig. 4. This experimentation is performed for external indices (EI, RI and JI). It is observed that the threshold index value lies between 0.7 and 1. As compared to internal threshold index variation, external threshold indices show a slight variation or remain constant with variations in the number of nodes. For a small-scale network (50–500 nodes), EI indices increase whereas RI indices show an increase for very small-scale network (50–100 nodes) and decrease for very small-scale to small-scale network (100–500 nodes). Also, JI index value decreases for small-scale network (50–500 nodes). For medium-scale network (500–3000 nodes), JI and RI values increase whereas EI values increases for small-scale to medium-scale network (500–1000 nodes), decrease for medium-scale network (1000–2000 nodes) followed by an increase from medium- to large-scale network (2000–3000 nodes). For largescale network (3000–5000 nodes), EI increases for 3000–4000 nodes followed by a decrease for 4000–5000 nodes, whereas RI and JI values increase from 3000 to 4000 nodes and decrease slightly from 4000 to 5000 nodes.

5 Conclusion In MANET, outlier detection systems are not only helpful in detecting the number of attacks but can also adaptively respond and/or mitigate the detected attacks. In this work, the proposed outlier detection scheme used a multidimensional and multilayer outlier detection mechanism for MANETs. In a multidimensional architecture, three subsystems are proposed: ultra-lightweight, lightweight and heavyweight. Ultra-lightweight and lightweight systems are threshold-based outlier detection systems. Ultra-lightweight system detects outliers based on internal and external

An Optimized Multilayer Outlier Detection …

583

indices whereas lightweight system detects through QoS parameters. Heavyweight outlier system uses multilayered outlier detection mechanism. This system detects outliers from application, routing and MAC layer of MANET protocols layering stack. In simulation analysis, it is observed that the number of clusters required for small-, medium- and large-scale networks varies from 5 to 52. A minimum of 0.91% and a maximum of 104.1% percent improvement in cluster stability is observed.

References 1. Cluster Validation Statistics: Must Know Methods—Articles—STHDA. http://www.sthda. com/english/articles/29-cluster-validation-essentials/97-cluster-validation-statistics-mustknow-methods/. Accessed 5 July 2018 2. G. Brock, V. Pihur, S. Datta, S. Datta, clValid: an R package for cluster validation. J. Stat. Softw. 25(4), 1–22 (2008) 3. M. Charrad, N. Ghazzali, B. Boiteau, A. Niknafs, NbClust: an R package for determining the relevant number of clusters in a data set. J. Stat. Softw. 61(6), 1–8 (2014) 4. S. Theodoridis, K. Koutroumbas, Pattern Recognition (Academic Press, 2009) 5. P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. (1987) 6. J.C. Dunn, Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974) 7. Davies, D. L. and Bouldin, D. W.: A Cluster Separation Measure, IEEE Trans. Pattern Anal. Mach. Intell., (1979) 8. T. Caliñski, J. Harabasz, A Dendrite method foe cluster analysis. Commun. Stat. (1974) 9. D. Moulavi, P.A. Jaskowiak, R.J. Campello, A. Zimek, J.D. Sander, Density-based clustering validation. In: Proceedings of the 2014 SIAM International Conference on Data Mining (2014) 10. Evaluation of clustering. https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clu stering-1.html#fig:clustfg3. Accessed 05 July 2018 11. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J., Wu, S.: Understanding and enhancement of internal clustering validation measures, IEEE Trans. Cybern., (2013) 12. F. Kovács, C. Legány, A. Babos, Cluster validity measurement techniques. In: 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases (2006), pp. 388–394 13. S. Huang, Y. Cheng, D. Lang, R. Chi, G. Liu, A formal algorithm for verifying the validity of clustering results based on model checking. PLoS One (2014) 14. Gurung, S. and Chauhan, S.: A dynamic threshold based approach for mitigating black-hole attack in MANET, Wirel. Networks, pp. 1–15, 2017 15. The Network Simulator-ns-2. https://www.isi.edu/nsnam/ns/. Accessed 5 July 2018 16. T. Van Craenendonck, K. Leuven, H. Blockeel, Using Internal Validity Measures to Compare Clustering Algorithms (ICML, 2015), 1–8 17. W. Li, A. Joshi, Outlier detection in ad hoc networks using dempster-Shafer theory. In: 2009 Tenth International Conference on Mobile Data Management: Systems, Services and Middleware (IEEE, May 2009), pp. 112–121 18. Sun, B., Osborne, L., Xiao, Y. and Guizani, S.: Intrusion detection techniques in mobile ad hoc and wireless sensor networks. IEEE Wirel. Commun. 14(5) (2007) 19. J. Karlsson, G. Pulkkis, L.S. Dooley, A packet traversal time per hop based adaptive wormhole detection algorithm for MANETs. In: 2016 24th International Conference on Software, Telecommunications and Computer Networks (SoftCOM) (IEEE, September 2016), pp. 1–7 20. S. Yadav, M.C. Trivedi, V.K. Singh, M.L. Kolhe, Securing AODV routing protocol against black hole attack in MANET using outlier detection scheme. In: 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics (UPCON) (IEEE, October 2017), pp. 1–4

584

A. Kumar and D. K. Sharma

21. A. Kumar, K. Gopal, A. Aggarwal, Design and analysis of lightweight trust mechanism for secret data using lightweight cryptographic primitives in MANETs. IJ Netw. Secur. 18(1), 1–18 (2016) 22. S. Henningsen, S. Dietzel, B. Scheuermann, Challenges of misbehavior detection in industrial wireless networks. In: Ad Hoc Networks (Springer, Cham, 2018), pp. 37–46 23. A. Kumar, K. Gopal, A. Aggarwal, Novel trust hierarchical construction for RFID sensor-based MANETs using ECCs. ETRI J. 37(1), 186–196 (2015) 24. A. Kumar, K. Gopal, A. Aggarwal, A novel lightweight key management scheme for RFIDsensor integrated hierarchical MANET based on internet of things. Int. J. Adv. Intell. Paradi. 9(2–3), 220–245 (2017) 25. A. Kumar, A. Aggarwal, Performance analysis of MANET using elliptic curve cryptosystem. In: 14th International Conference on Advanced Communication Technology (ICACT) (IEEE, 2012), pp. 201–206 26. A. Kumar, K. Gopal, A. Aggarwal, Simulation and cost analysis of group authentication protocols. In: 2016 Ninth International Conference on Contemporary Computing (IC3) (IEEE, Noida, India, 2016), pp. 1–7 27. Kumar, A., Aggarwal, A. and Gopal, K.: A novel and efficient reader-to-reader and tag-to-tag anti-collision protocol. IETE J. Res. 1–12 (2018). [Published Online] 28. A. Kumar, A. Aggarwal, Charu: Survey and taxonomy of key management protocols for wired and wireless networks. Int. J. Netw. Secur. Appl. 4(3), 21–40 (2012)

Microscopic Image Noise Reduction Using Mathematical Morphology Mangala Shetty and R. Balasubramani

Abstract In image processing to enhance the region of an image, mathematical morphological (MM) operations are taken an important role. Mainly application of basic morphological techniques are useful in improving the quality of an image. In the collection and delivery process, the image will be polluted by salt-and-pepper noise, which would lead directly to image quality reduction throughout subsequent processes of image analysis. Thus obtaining the actual image from the image that is distorted by noise is therefore of great importance [1]. This paper deals with an approach using morphological functions to reduce the salt-and-pepper noise from scanning electron microscopic(SEM) images of bacteria cell. The noise removal has a wide effect in getting the accurate segmentation and classification of bacteria cells thereby cell identification accuracy increases to identify the bacteria cells within a short period of time automatically, the noise has to remove from the SEM image of bacteria. Various quality assessment operations are used to measure the quality of enhanced images. The results of the experiment indicate that without blurring edges, this experiment can reduce noise effectively from the input image. The validation outcomes of denoised images with a higher peak signal-to-noise ratio (PSNR) and mean squared error (MSE) show their reliable application potential. Keywords Mathematical morphology · Structuring element · SEM Bacteria

1 Introduction Set theory principles are the building blocks of MM. Geometrical shape of the object in an image is considered for the application of MM techniques. Since many morphological operations examine relatively ordered pixel values, these operators are M. Shetty (B) · R. Balasubramani NMAMIT Nitte, Karkala, India e-mail: [email protected] R. Balasubramani e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_52

585

586

M. Shetty and R. Balasubramani

effectively helpful in reducing the noise level present in the images [2–4]. The design of noise reduction strategies for image-based calculating systems is of great importance [5]. The primary principle behind Morphological Processing is to analyze an image geometric shape by comparing it with tiny patterns at different locations called structuring components [6]. Noise can be defined as a random variation in the image brightness. It degrades the image. In most of the cases, the presence of noise is mainly due to the image acquisition phase. In the process of image acquisition, an optical image is transferred into a continuous electrical signal and then it is sampled [2]. Complete noise removal without image signal distortion is not possible. But the reduction of noise to a certain acceptable range to further analyze and to process the image is possible. Salt-and-Pepper noise is the most common type of additive noise in the image. Inclusion of Salt-and-Pepper noise in the image is due to many reasons. Some of them are defective camera sensors, faulty memory regions, timing fault in the digitization process,transmission of signal in noisy channels, etc. [2]. Impulsive noise reduction strategies are described in the literature. The popularly used non-linear and scalable filter is the standard median filter (SMF) [7]. Whenever the level of noise increases, this filter tends to blur the image that distorts its content. The same limitations occur for the progressive type switched median filter (PMSF) [8]. The latest adaptive noise removal algorithms are the decision-based algorithm (DBA) [9] and the noise adaptive fuzzy switching median salt-and-pepper noise reduction filter (NAFSM) [10]. In the present study, the efficient structuring element to reduce salt-and-pepper noise present in the scanning electron microscope (SEM) images of bacteria cells are presented. The relevant factors for evaluation to be analyzed are as follows. (1) Impacts of peak signal-to-noise ratio and MSE on noise reduction in an unprocessed SEM image. (2) Effective quality assessment of noise reduced images of SEM. (3) Comparison with other 2D-flach structuring components in reducing salt-and-pepper noise from SEM images using the proposed mathematical morphology-based method. In the resulting images, the proposed method retains edges and features information. In addition, experimental data show that when images are distorted with higher impulsive noise, the proposed model has high performance in noise reduction.

2 Morphological Operations In image processing, morphological operations are highly experimented [8] in improving the appearance. To reduce the noise, the MM is also applied and it uses structuring element to probe the image, and thereby useful information from the image can be obtained and noise can be reduced while preserving the features. This paper is on an experiment in which four morphological operations are working to reduce the noise from the grayscale image and thereby enhancing the quality of the images.

Microscopic Image Noise Reduction Using Mathematical Morphology

587

2.1 Structuring Element Structuring element (SE) can be defined as a simple predefined shape used to identify the neighborhood pixel values. Mainly 2D-flat SE plays a vital role in morphological operations with binary and grayscale data because their light transmission functions are unknown and morphological operations are applied on the relative ordering of pixel values instead of their numerical values. From a graphical point of view, structuring elements can be represented either by a matrix having 0s and 1s or as a set of foreground pixels all having values 1. Some conventional structuring elements like arbitrary, ball, diamond. There are two types of structuring elements, flat SE and non flat SE. In this paper, five arbitrary 2D-flat structuring elements namely disk, square, rectangle, line, and octagon have been used for the experiments and shown in Figs. 1, 2, 3, 4 and 5.

2.2 Dilation Dilation is useful to add pixels to the boundaries of the region or it is also used to fill the holes in the picture. There is a possibility that the holes will be completely closed or the holes will be narrowed in an image. So the initial figure can be extended or shrunk by dilation. It is possible to connect disjoint pixels or to insert pixels at edges using dilation operation.

Fig. 1 Square boundary and rectangle boundary SE

Fig. 2 Octagon boundary SE

588

M. Shetty and R. Balasubramani

Fig. 3 Circle boundary SE

Fig. 4 Line boundary SE

2.3 Erosion Erosion operation produces the reverse effect of dilation. In erosion, boundaries will be narrowed and it expands the holes. This is done by setting an ON-valued pixel to OFF-valued as a structuring element sliding across the image. All the pixels which are completly overlap with the ON-valued pixels are set to OFF valued pixels.

2.4 Opening and Closing Erosion and dilation may be applied repeatedly to achieve the desired results. Nonetheless, in the processed picture, the execution order of these operations display a difference. Combining dilation and erosion, opening and closing are obtained. The opening procedure requires erosion with the same structuring component followed by dilation. The process of closing begins with dilation followed by erosion with the same structuring component. Opening is performed to smooth the surface contours, to split narrow joints, and remove thin ridges. It is possible to smooth contour sections in closing but it fuses narrow breaks, includes contour gaps, and removes small holes. When small noise regions are more in an image, the opening operation must be used. On the other hand, closing restores connectivity between objects close to each other.

Microscopic Image Noise Reduction Using Mathematical Morphology

589

Fig. 5 Process sequence in the proposed method

3 Proposed Approach The term morphology refers to a specific method of filtering and SEs in digital image processing. In morphological image processing, choosing a suitable SE is a very important task. SE can be represented either with a matrix of 0s and 1s or with all values as a set of foreground pixels. The origin of the SE must be clearly identified in both the representation. This technique is based on using morphological operations with 2D-flat structuring elements to eliminate salt-and-pepper noise. Figure 5 displays the proposed method with the process sequence. In the initial phase of the proposed approach, images of lactococcus bacteria are taken in different dimensions and converted in the second stage into a grayscale image. The grayscale picture is defined as intensity values that range from black to white to high at the lowest intensity. The morphological operations were performed in the third stage with arbitrary 2D-flat SE proposed in this paper and shown in Figs. 1, 2, 3, 4 and 5.

4 Experimental Results and Discussions Using morphological operators, five 2D-flat arbitrary SEs were used to perform the noise removal process. They are disk, square, row, octagon, and rectangle; lactococcus images are chosen to do the experimental research; lactococcus image of 512 × 512, 460 × 819, 1024 × 1024, 1218 × 1120, and 2048 × 2048 dimensions are used within sixty percent of salt-and-pepper noise. From the final resulting images Figs. 10, 11, 12, 13, 14 and 15, it is clear that most of the noise can be reduced using square-shaped SE. The numerical measurements with PSNR and MSE are also shown in Figs. 6, 7, 8 and 9. The numerical measures of improved images are shown by the reduction of noise. The square boundary SE contributes higher PSNR and octagon SE yields lower PSNR, respectively, from statistical observations using the proposed method.

590

Fig. 6 PSNR for 1024 × 1024 image

Fig. 7 PSNR for 512 × 512 image

M. Shetty and R. Balasubramani

Microscopic Image Noise Reduction Using Mathematical Morphology

Fig. 8 MSE for 1024 × 1024 image

Fig. 9 MSE for 512 × 512 image

591

592 Fig. 10 Noisy Image

Fig. 11 Image with line SE

Fig. 12 Image with disk SE

M. Shetty and R. Balasubramani

Microscopic Image Noise Reduction Using Mathematical Morphology Fig. 13 Image with square SE

Fig. 14 Image with rectangle SE

Fig. 15 Image with octagon SE

593

594

M. Shetty and R. Balasubramani

5 Conclusion and Future Work SE plays an important role in image enhancement for noise removal using morphological operations for SEM images of bacteria. Selecting various structuring elements will result in myriad applications for analyzing and storage of the geometric details of images. Thus ultimately determine the distribution and volume of data and their existence in the morphological transformation. Dilate, erosion open and close are the morphological procedures applied in this experiment to the noisy SEM image of the lactococcus bacteria cells. Although these operations have their own efforts in improving the images, it is possible to combine these operators to greatly improve the appearance of the noisy image by reducing noise. The conclusions made in this paper were based purely on the experimental outcome. The morphological analysis with five arbitrary SEs was performed in this paper to perform noise reduction procedure. Statistical measurements can also be seen with resulting images for different 2D-flat SEs. Among the various SEs, square SE is recognized as being more reliable in the elimination of noise as per the visual perception evaluation and statistical measurements. The result was reliable and a very strong degree of improvement was reached, showing the efficiency of the proposed work. More morphological operations experimented with higher noise levels in future research. Acknowledgments The authors are grateful for supplying the SEM images to Dr. Dennis Kunkel, former president of Dennis Kunkel Microscopy Inc.

References 1. Y. Shi, X. Yang, Y. Guo, translation invariant directional framelet transform combined with gabor filters for image denoising. IEEE Trans. Image Process. 23(1), 44–55 (2013) 2. T. Huang, G. Yang, G. Tang, A fast two-dimensional median filtering algorithm. IEEE Trans. Acoust. Speech Signal Process. 27(1), 13–18 (1979) 3. A. Taleb-Ahmed, X. Leclerc, T. Michel, Semi-automatic segmentation of vessels by mathematical morphology: application in MRI, in Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), vol. 3 (IEEE, 2001), pp. 1063–1066 4. K.K.V. Toh, N.A.M. Isa, Noise adaptive fuzzy switching median filter for salt-and-pepper noise reduction. IEEE Signal Process. Lett. 17(3), 281–284 (2009) 5. V.V Das, S. Janahanlal, Y. Chaba. Computer Networks and Information Technologies: Second International Conference on Advances in Communication, Network, and Computing, CNC 2011, Bangalore, India, March 10–11, 2011. Proceedings, vol. 142 (Springer, 2011) 6. K. Ratna Babu, K.V.N. Sunitha, Image de-noising and enhancement for salt and pepper noise using genetic algorithm-morphological operations. Int. J. Signal and Image Process. 4(1), 36 (2013) 7. Z. Wang, D. Zhang, Progressive switching median filter for the removal of impulse noise from highly corrupted images. IEEE Trans. Circ. Syst II: Analog Digi. Signal Process. 46(1), 78–80 (1999) 8. KS Srinivasan and David Ebenezer, a new fast and efficient decision-based algorithm for removal of high-density impulse noises. IEEE Signal Process. Lett. 14(3), 189–192 (2007) 9. F. Ortiz, Gaussian noise removal by color morphology and polar color models, inInternational Conference Image Analysis and Recognition (Springer, 2006), pp. 163–172 10. S.E Umbaugh. Computer Imaging: Digital Image Analysis and Processing (CRC press, 2005)

A Decision-Based Multi-layered Outlier Detection System for Resource Constraint MANET Adarsh Kumar and P. Srikanth

Abstract MANET is a useful network for providing various services and applications. Among those services and applications, sharing is important. The sharing of resources is possible when the availability of resources is ensured. In this work, the multi-dimensional multi-layered solution is proposed for ensuring the availability of network resources. The multi-dimensional approach provides criteria for collecting and analyzing data from different security dimensions. A multi-layered outlier detection algorithm using hierarchical data interconnection is proposed in this work. In the analysis, it is observed that internal indices like DBI and RSI give confirmation of clusters stability with the proposed approach. A minimum of 4.1% and a maximum of 11.3% stability is observed with variation in a number of nodes. Similarly, external indices like F-measure and NMI indicate stability in comparison to external clusters. A minimum of 2% and a maximum of 13.5% stability is observed. Keywords Outliers · Attack detection and countermeasure · MANET · Clustering · QoS

1 Introduction Mobile ad hoc networks (MANETs) constituted with limited hardware devices are decentralized, autonomous, and dynamic in nature. Using this type of network, various applications can be designed to resolve [1]: natural or man-made disasters, road traffic issues, group/military movements, item/visitor tracking systems, autonomous household appliances, etc. The major challenge among resolving these A. Kumar (B) · P. Srikanth Department of Systemics, School of Computer Science, University of Petroleum and Energy Studies, Dehradun 248007, Uttrakhand, India e-mail: [email protected] P. Srikanth e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_53

595

596

A. Kumar and P. Srikanth

issues is the implementation of security aspects due to scarcity of resources [2]. Security requirements include the implementation of cryptography primitives and protocols. In order to implement any primitive or protocol, devices should be available. Thus, availability is observed to be the first challenge necessarily resolved. Outlier detection mechanisms can be helpful in ensuring availability, i.e., those devices are allowed to work in the network which is having good historical or presents records else devices are put under scrutiny until they prove themselves. The goal of ensuring availability through outlier detection mechanisms is to identify nodes relevant for communication. There are various outlier detection mechanism and the majority of them follows statistical approaches. Statistical approaches require both training and testing processes for identification. Thus, statistical approaches are either parametric or non-parametric [2]. Gaussian processes modeled using means and covariance is the best example of parametric outlier detection. Whereas, nonparametric approaches are silent over mathematical calculations. In another classification, univariate and multi-variate are also helpful in outlier detection. In networkspecific scenarios (like MANETs), the multi-layered parametric and multi-variate feature is considered as important and priority-based outlier detection mechanism [3]. These multi-layered solutions formulate scalability, high-quality results, and better performance-based scenarios. In respect of multiple layers, there is a possibility of different attacks at different layers, thus performance and the feature-based system are helpful in identifying unruly nodes. This work proposes a multi-layered architecture for outlier detection. Here, five layers are proposed for outlier detection. Layer-1 uses machine learning unsupervised approach and Bayesian classifier for outlier detection and analysis. The trained dataset is prepared with labels as outlier or inlier. Network layer packet features are used for probability-based outlier detection. This data is used again at layer-3 for application trust score-based outlier detection. An associated rule-based mechanism is used for outlier detection with extracted features. Layer-4 detection is an aggregated outlier detection module from the above three layers. In this layer, outliers are identified at subgroup and network levels. Overall, the multi-layered outlier detection module is flexible to scale with increasing complexity ensuring availability. This work is organized as follows. Section 2 discusses the literature survey which has been carrying out in multi-layered outlier detection in MANETs. Section 3 explains the proposed unsupervised machines learning-based multi-layered outlier detection approach for limited hardware devices-based MANETs. Section 4 presents the results and analysis of simulation for the proposed work. Finally, a conclusion is drawn in Sect. 5.

A Decision-Based Multi-layered Outlier Detection …

597

2 Literature Survey With the development of wide-areas of application for MANETs, it is essential to realize the importance of efficient security services for ensuring the availability of resources as and when required. Outlier detection approaches are preferred way of outlier detection as lighter statistical techniques make it suitable for resource constraint devices in MANETs. Various multi-layered outlier detection models are proposed over networks where there is ad hoc connectivity and instability. For example, in [3], Mapanga et al. proposed neural network-based multi-layered intrusion detection architecture for MANET using the outlier detection process. This proposed model analyzes packets at the network layer and passes results to other layers. An enhanced threshold-based packet analysis process is used for attack analysis. This technique is used for detecting and isolating black hole attacks from the network. The proposed technique is claimed to be better in terms of packet delivery and end-to-end delay. In [4], a two-level outlier detection scheme is proposed for detecting misbehaving nodes in MANETs. Guo et al. [4] have proposed an outlier detection mechanismbased joint prediction of the level and the conditional variance of the traffic flow series. In this detection, mechanism data is collected from different regions and a comparative analysis is performed ensuring efficiency of the proposed outlier detection mechanism. Recommendations are made to investigate the underlying outlier generating mechanism and countermeasures for transportation-based applications. The outlier detection approach uses the concept of the variability of the smaller size population representing malicious nodes should be greater than the variance of the larger size population representing normal nodes in the network. In order to efficiently and effectively separate the normal nodes from malicious nodes, the linear regression process is performed in parallel for computing the threshold using the measurement of fluctuation for received classified instances. In [5–11, 13–18] other outlier detection approaches are discussed in detail which are helpful for ad hoc connectivity and unstable environment. As expected, most of the existing approaches use the network layer’s packet analysis for intrusion detection [7]. These network layer’s packet analysis processes hardly apply any machine learning approach in its pre-processing or analysis. Thus, incomplete or inconsistent data records are also identified in the form of outliers. In addition, the complexity of outlier computation is not taken into consideration. Complexity is an important parameter for resource constraint devices thus a mechanism should be flexible and scalable for computational complexity-based outlier detection.

598

A. Kumar and P. Srikanth

3 Proposed Approach This section explains the proposed outlier detection approach in detail as shown in Fig. 1. The proposed approach applies multiple techniques at different layers for outlier detection. These techniques are discussed as follows:

3.1 Data Pre-processing Initially, the network is constructed and the node’s performance data is logged for analysis. The constructed network is adaptive in nature and uses multiple protocols for data exchange and configurations. Adaptability is implemented when there is a need for improvement in performance. After network construction and configuration, data is logged for analysis and further processing. This processing includes identifying false entries, side-channel attacked records, data duplication records, and initial data dependency observations. Initial observations and records dependencies help in data reduction for further analysis and presentation. After data preparation, data is forwarded to layers for analysis.

3.2 Layer-1 Outlier Detection This is the first layer of outlier detection. In this layer, data link layer features are extracted and these feature sets are processed through the machine learning cycle as shown in Fig. 2. Initially, data processing starts with one window and the size of the window varies till ‘N’. Those datasets are put in a training set whose label is observed as outlier or inlier. Unlabeled and unpredictable data is put in the testing dataset. The proposed mechanism observes the nature of data for anomaly detection. In nature observation, the rate of anomalies is computed during training for estimating the probability of anomaly. In a given window, a packet is observed multiple times for Layer-1 Layer-2 Layer-3 Layer-4

Unsupervised machine learning approach characterizing network layer data Unsupervised machine learning approach characterizing transport layer data Unsupervised machine learning approach characterizing application layer data and advances outlier detection using rule mining Aggregated detection

Fig. 1 Proposed outlier detection approach

A Decision-Based Multi-layered Outlier Detection …

599

Fig. 2 Machine learning cycle

distinct values. If a packet with the same source but different destinations or different sources with the same destination is observed ‘x’ number of times with ‘d’-distinct values then the probability of anomaly is x/d. This processing is performed in the decision tree using ruleset. Further, this is helpful in building a trained classifier. This classifier saves time for outliers and inliers detection in new data. Sliding window process of outlier detection in the machine learning cycle is explained in Fig. 3. This process collects data sequences from logged data and inserts a window for the trained dataset. In this trained dataset, new entries from the testing dataset are inserted one by one through feature extraction and comparison. This comparison involves an outlier with a new outlier label and an inlier with a new inlier label. After labeling, each node profile is built. Node profile is a contextual aware representation of the node’s features. The hierarchical clustering mechanism [17] is used for node profile building and computing the average of all node values having a similar node profile.

Fig. 3 WxN-window process in machine learning cycle

600

A. Kumar and P. Srikanth

This hierarchical process of profile building is helpful in connecting similar nodes together and identifying feature-based outliers. After the machine learning phase, another evaluation phase is integrated for extracting node features. This phase predicts dependencies based on process contextualization and identifies outliers [10]. Figure 4a–c shows process contextualization without outliers. Figure 4d–f shows process contextualization with outliers. Figure 4a shows a process with independent nodes without interconnection. Although there is no interconnection among nodes, all nodes are connected with a single process/activity. Thus, these nodes are not considered as outliers in this process. Figure 4b shows nodes interconnected among themselves and connected with a single common process. In this scenario, some nodes will act as sources and others as a destination. Nodes may act as intermediate nodes also but no intermediate node should allow an alternative path to existing paths. Multiple self-loops are allowed in this process. Figure 4c shows another scenario with parallel activities. In this scenario, multiple nodes are interconnected in single or multiple processes and parallel paths are possible. Each process must have a single source and a single destination. For k = n, a maximum of n parallel activities is allowed. Figure 4d shows a process of contextual outlier detection. In this process, those nodes are considered as outliers who are connected in a process but are not performing an activity for a long time. This is a threshold-based outlier detection approach. Initially, the threshold time period is the average value of waiting for any activities in the network. Thereafter, the average time period of the subgroup is considered for detection. Figure 4e shows a process of contextual outlier detection when all nodes are not connected in a process and disconnected nodes are acting as source nodes regularly. In this scenario, multiple paths from disconnected nodes to destination nodes are not possible. Figure 4f shows a scenario where multiple paths are possible. In both cases, disconnected nodes are considered as outlier nodes. Detail process of contextual outlier detection is explained in Pseudocode 1.

A Decision-Based Multi-layered Outlier Detection … Fig. 4 a Directed acyclic graph (DAG) when k = 0 Nodes found connected in a process but no activity (NO OUTLIERS). b Directed acyclic graph (DAG) when k = 0 Nodes found connected in a process with single activity (NO OUTLIERS). c Directed Acyclic Graph (DAG) when k = 2 Nodes found connected in a process with two parallel activities (NO OUTLIERS). d Directed Acyclic Graph (DAG) when k = 0 Nodes found not connected in a process for long time with no activity (OUTLIERS). e Directed Acyclic Graph (DAG) when k = 1 Nodes found not connected in a process for long time with activity (OUTLIERS). Possibilities: • Distance bounding attack, • Distance Hijacking attack, • Man-in-Middle attack, • Sync-hole attack

a

b

c

d

e

601

602 Fig. 4 (continued)

A. Kumar and P. Srikanth

f

Pseudocode 1: Contextual Outlier Detection Goal: To evaluate the class of data points collected from a particular node. 1. Iterate each node one by one. 2. Extract features of each data element coming or going out of a particular node. 3. Analyze the features and identify whether the collected feature predicts the graph with one or more parent nodes. 4. Calculate the time period of the inactivity of a node without connection with any process. 5. If the time of inactivity is going beyond a threshold then 6. Node is marked outlier 7. End if 8. if the node is not connected with any process but it is performing an activity with other nodes then 9. Mark the node as an outlier 10. End if 11. If all nodes are connected with any process then 12. Execute content based outlier detection process 13. If randomly picked content is suspicious from historical records then 14. Get connect node’s profile and mark them for outlier analysis 15. End if 16. If randomly picked content is suspicious from historical records with multi-dimensional features then 17. Get connect node’s profile and mark them for outlier analysis 18. End if 19. else 20. return 21. end if

Observations from context and content-based outlier detection processes are compared with the divisive hierarchical clustering process defined previously. Importance to both observations is given equally if labels of both analyses are same then the data label is considered as the final label else if there is a discrepancy in the observation dataset is put in testing dataset for analysis again.

A Decision-Based Multi-layered Outlier Detection …

603

3.3 Layer-2 Outlier Detection Layer-2 outlier detection process deals with the transition of node states. A node state can vary indicating damage caused due to side-channel effects. Transitions between node states are helpful in constructing a Markov chain. Initially, nodes are placed randomly and their movements are observed over a certain period of time. Markov chain process is a process of analysis using historical data and it is helpful in detecting outliers using ruleset. Transitions of node’s states are observed for control and regular messages. Control message sender or receiver is put under scrutiny if these messages are sent or received beyond threshold without any further action. Complete process of outlier detection follows the following steps: chain construction, transition matrix formation, and final computations. The chain construction process uses graphical datasets for record-keeping and computations. The probability matrix accesses the graphical dataset and store paths among nodes in two-dimensional space. Figure 5 shows an example of Markov chain construction and the probability transition matrix is shown in Table 1. Figure 5 and Table 1 presents two routes from source to destination. Path probability ratio of two paths is calculated as Route 1/route 2 = (0.7 + 0.2 + 0.05)/(0.3 + 0.7 + 0.05) = 0.95/1.05 = 0.9 < 1. Now, if nodes are selecting

Fig. 5 Example of markov chain construction

Table 1 Transition matrix 1

2

3

4

5

6



N

1

0

0.3

2

0.3

0

0

0.7

0

0





0.7

0

0

0





3

0

0.7

4

0.7

0

0

0.2

0.05

0.05





0.2

0

0

0.95



5

0

0



0

0.05

0

0.95





6

0

0

0









0.05

0.95

0















N

















604

A. Kumar and P. Srikanth

route 2 for control or data message transmission then no outlier exists (i.e., all nodes are inliers). If route 1 is selected for transmission then source and destination nodes with degree >1 in route 2 are under scrutiny. In addition, all intermediate nodes with degree ≥3 are under scrutiny.

3.4 Layer-3 Outlier Detection Layer-3 outlier detection process starts with an assumption that all nodes are randomly deployed. Further, their deployment area is well known in advance as shown in Fig. 6a. Profiles of all nodes are collected from the above layer and the initial population is decided for analysis as shown in Fig. 6b. Association rules [11–13] are applied for outlier detection. Among the initial population, highly trusted nodes are identified for applying association rules as shown in Fig. 6c. Using trusted nodes, the initial population is divided into the imperialist countries and imperialist states. The imperialist country is considered to be denser as compared to the imperialist state. Thus, the imperialist country is defined as a collection of nodes with a number of interconnections (with trusted nodes) greater than a certain threshold. The imperialist state is also a collection of nodes but the number of interconnections is lesser than the imperialist country but greater than a minimum density-based threshold required for outlier detection. Figure 6e shows the construction of a colony, empire, and sub-zones. The whole population area is divided into colonies covering imperialist countries or states. Highly trusted nodes are interconnected for authentic data communication. Thereafter, high power nodes are connected with highly trusted nodes for constructing an empire. Connection of each trusted nodes and highly powered nodes formulate a sub-zone. Small sub-zones are merged by moving high powered nodes to other neighboring sub-zones. If nodes are left isolated or smaller sub-zones exist, after repetitive merging attempts, then these nodes or sub-zones are considered as outliers.

3.5 Layer-4 Detection Layer-4 outlier detection is added in proposed multi-layered architecture for those devices where there is scarcity of resources. In this layer, outlier detection administrator has the option of considering observations of single or multiple layers in his/her final opinion. Resource constraint devices may choose any layer implementation and observations for analysis whereas resourceful network/devices should select combined opinion of all layers. Pseudocode 4 explains the combined outlier detection process in detail.

A Decision-Based Multi-layered Outlier Detection …

605

Fig. 6 a Nodes are distributed randomly over geographical region (Stage 1). b Decide initial population. c Identify highly trusted nodes. d Divide all nodes into imperialist states and countries. e Build colonies inside empires using nearest possible connection to high power node

606

A. Kumar and P. Srikanth

Fig. 6 (continued)

Pseudocode 4: Combines outlier score calculator 1. Iterate each layer regularly and collect labels of each node 2. if each node’s label is same for all above three layers then 3. 4. else 5. Implement fuzzy min-max in computing conflicting node’s exact labels 6.

4 Simulation, Evaluations, and Analysis In this section, network simulation and performance of cluster indices (internal and external) are explained reflecting the stability of colony, empire, and sub-zones. This explanation is as follows:

4.1 Simulation Setup In the simulation, a network of 50–5000 nodes is formulated for performance analysis. Nodes have the flexibility to move in any directions and at specified speed within a geographic area. Details of simulation parameters are shown in Table 2. In this work, eight variations of the network, with different numbers of data records, are considered for analysis. This analysis is observed during the different time periods with variation in the number of clustered formed. It is observed that the number of clusters and their stability increases with an increase in time.

A Decision-Based Multi-layered Outlier Detection … Table 2 Simulation setup

Parameters

607 Value

Nodes

50–5000

Communication via

Wireless channel

Radio propagation model

Ray tracing

Interface

Wireless Phy

MAC type

802.11

Queue type

Priority queue

Antenna

Omni antenna

Waiting queue size

50 packets

Maximum X-dimension of the 500 m topography Maximum Y-dimension of the topography

500 m

Mobility model

Random Waypoint mobility

Data transfer rates

7 packets/second

Single packet size

1024 bits

Discrete event simulator

ns-3 [12]

Total simulation time

1500 s

Number of slots assigned to reader at stretch ()

1

Time of each slot

10 ms

Velocity (minimum to maximum)

0.3–5 m/s

4.2 Analysis of Internal and External Cluster Indices This sub-section explains the internal and external clustering indices used for measuring the quality of clusters formed in outlier detection process. Higher the quality of clustering indices better will be the cluster implementation process which in turns validates effective and efficient identification of outliers and inliers. Simulation is performed over three different timings slots for analysis. Analysis of internal and external indices are explained as follows:

4.2.1

Internal Cluster Indices

Internal indices are used for measuring the quality of clustering without any external data. This includes data units and features inherited within the dataset are used for measurements. In this work, Davies–Bouldin index (DBI) and R-squared Indices (RSI) are used as internal indices for analysis as shown in Fig. 4. Trends for DBI and RSI are almost same. All indices values increase from 50 to 100 nodes (very

A. Kumar and P. Srikanth 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

upto T1 T1-T2 T2-T3 T3-T4 T4-T5 T5-T6 T6-T7 T7-T8 T8-T9 T9-T10 T10-T11 T11-T12 T12-T13

50 100 500 10002000300040005000

No. of Nodes

0.3 0.25

Index Value

Index Value

608

0.2 0.15 0.1 0.05 0

No. of Nodes

upto T1 T1-T2 T2-T3 T3-T4 T4-T5 T5-T6 T6-T7 T7-T8 T8-T9 T9-T10 T10-T11 T11-T12 T12-T13

(b) R-squared Indices(RSI)

(a) Davies-Bouldin index (DBI)

Fig. 7 Internal cluster evaluation. a Davies–Bouldin index (DBI). b R-squared indices (RSI)

small scale network) and decrease from 100 to 5000 nodes (small scale to large scale network). Figure 7a and b shows RSI and RMSSDI index analysis during different time slots. In case of RSI and RMSSDI, an elbow structure indicates higher stability. Thus, proposed clustering and outlier detection mechanism are validated to be stable.

4.2.2

External Cluster Indices

External indices are used for measuring the quality of clustering with external information, i.e., quantities and features inherited from known cluster structure of a dataset are used for measurement. In this work, F-measure Index (FI) and Normalized Mutual Information (NMI) indices are used as external indices for analysis as shown in Fig. 5. FI and NMII indices show increase for very small scale network (50–100 nodes) and decrease for very small scale to small scale network (100–500 nodes). Higher FI and NMII means more stability. According to FI, proposed mechanism is best for 50, 100, and 1000 nodes networks, and it is good during initial time slots (up to T5) for other networks as shown in Fig. 8a. NMII values in Fig. 8b shows that the proposed mechanism is best for 50, 100, and 4000 nodes networks. Although it is good for other networks as well, more fluctuations are observed in these cases.

Index Value

1 0.8 0.6 0.4 0.2 0

50 100 500 10002000300040005000 No. of Nodes

upto T1 T1-T2 T2-T3 T3-T4 T4-T5 T5-T6 T6-T7 T7-T8 T8-T9 T9-T10 T10-T11 T11-T12 T12-T13

1.2 1 Index Value

1.2

0.8 0.6 0.4 0.2 0

50 100 500 10002000300040005000 No. of Nodes

(a) F-measure Index Fig. 8 External cluster evaluation. a F-measure index. b NMI

(b) NMI

upto T1 T1-T2 T2-T3 T3-T4 T4-T5 T5-T6 T6-T7 T7-T8 T8-T9 T9-T10 T10-T11 T11-T12 T12-T13

A Decision-Based Multi-layered Outlier Detection …

609

5 Conclusion In dynamically changing topology-based networks like MANET, single dimensional security solutions are not efficient in providing proper safeguard. Thus, layer-based solutions are preferred. Single dimension-Single layer solution does not identify all types of attacks. Whereas, multi-dimensional multi-layer solutions are increasing their importance by considering different data at different points. In this work, a similar approached is proposed where four different layers consider different types of data for attack analysis. The proposed approach filter and analyze network, transport, and application layer data at three different layers for analysis. Fourth layer provides a provision of collecting observations from above three layers and concludes the results. In analysis, it is observed that internal indices like DI, RMSSDI, DBI, and RSI give confirmation of clusters stability with proposed approach. A minimum of 4.1% and maximum of 11.3% stability is observed with variation in number of nodes. Similarly, external indices like F-measure and NMI indicate stability in comparison to external clusters. A minimum of 2% and maximum of 13.5% stability is observed. In future, hybrid indices will be explored to improve the results and advanced analysis will be performed to reduce error approximations.

References 1. J. Liu, Y. Xu, Y. Shen, X. Jiang, T. Taleb, On performance modeling for MANETs under general limited buffer constraint. IEEE Trans. Veh. Technol. 66(10), 9483–9497 (2017) 2. S. Sen, J.A. Clark, J.E. Tapiador, Security threats in mobile ad hoc networks. In: Security of self-organizing networks: MANET, WSN, WMN, VANET, ed. by A.-S. Khan Pathan, 1st edn (CRC Press, New York, 2016), pp. 127–147 3. I. Mapanga, V. Kumar, W. Makondo, T. Kushboo, P. Kadebu, W. Chanda, Design and implementation of an intrusion detection system using MLP-NN for MANET. In: IST-Africa Week Conference (IST-Africa) 2017 (IEEE, Windhoek, Namibia, 2017), pp. 1–12 4. J. Guo, W. Huang, B.M. Williams, Real time traffic flow outlier detection using short-term traffic conditional variance prediction. Transp. Res. Part C: Emerg. Technolo. 50, 160–172 (2014) 5. I. Butun, S.D. Morgera, R. Sankar, A survey of intrusion detection systems in wireless sensor networks. IEEE Commun. Surv. Tutorials 16(1), 266–282 (2014) 6. L. Nishani, M. Biba, Machine learning for intrusion detection in MANET: a state-of-the-art survey. J. Intell. Inf. Syst. 46(2), 391–407 (2016) 7. A. Amouri, V.T. Alaparthy, S.D. Morgera, Cross layer-based intrusion detection based on network behavior for IoT. In: 2018 IEEE 19th Wireless and Microwave Technology Conference (WAMICON) (IEEE, 2018), pp. 1–4 8. M.A. Hayes, M.A. Capretz, Contextual anomaly detection framework for big sensor data. J. Big Data 2(2), 1–22 (2015) 9. R. Agrawal, T. Imieli´nski, A. Swami, Mining association rules between sets of items in large databases. In: ACM SIGMOD Record (ACM, NY, USA, 1993), pp. 207–216 10. M. Hahsler, R. Karpienko, Visualizing association rules in hierarchical groups. J. Bus. Econ. 87(3), 313–335 (2017) 11. S. Shamshirband, A. Amini, N.B. Anuar, M.L. Mat Kiah, Y.W. Teh, S. Furnell, D-FICCA: a density-based fuzzy imperialist competitive clustering algorithm for intrusion detection in wireless sensor networks. Meas. J. Int. Meas. Confed. 55, 212–226 (2014)

610

A. Kumar and P. Srikanth

12. The Network Simulator—ns-2.” https://www.isi.edu/nsnam/ns/. Accessed 5 July 2018 13. F. Chen, P. Deng, J. Wan, D. Zhang, A.V. Vasilakos, X. Rong, Data mining for the internet of things: literature review and challenges. Int. J. Distrib. Sens. Netw. 11(8), 1–14 (2015) 14. A. Kumar, K. Gopal, A. Aggarwal, Simulation and cost analysis of group authentication protocols. In: 2016 Ninth International Conference on Contemporary Computing (IC3) (IEEE, Noida, India, 2016), pp. 1–7 15. A. Kumar, A. Aggarwal, A., K. Gopal, A novel and efficient reader-to-reader and tag-to-tag anti-collision protocol. IETE J. Res., 1–12 (2018). [Published Online] 16. A. Kumar, K. Gopal, A. Aggarwal, Design and analysis of lightweight trust mechanism for secret data using lightweight cryptographic primitives in MANETs. IJ Netw. Secur. 18(1), 1–18 (2016) 17. S.K. Solanki, J.T. Patel, A survey on association rule mining. In: 2015 Fifth International Conference on Advanced Computing & Communication Technologies (ACCT) (IEEE, 2015), pp. 212–216 18. A. Kumar, A. Aggarwal, Survey and taxonomy of key management protocols for wired and wireless networks. Int. J. Netw. Secur. Appl. 4(3), 21–40 (2012)

Orthonormal Wavelet Transform for Efficient Feature Extraction for Sensory-Motor Imagery Electroencephalogram Brain–Computer Interface Poonam Chaudhary and Rashmi Agrawal Abstract Wavelet Transform (WT) is a well-known method for localizing frequency in time domain in transient and non-stationary signals like electroencephalogram (EEG) signals. These EEG signals are used for non-invasive Brain–Computer Interface (BCI) system design. Generally, the signals are decomposed in dyadic (twoband) frequency bands for frequency localization in time domain. The triadic approach involves the filtering of EEG signals into three frequency filter bands: low-pass filter, high-pass filter, and band-pass filter. The sensory-motor imagery (SMI) frequencies (α, β, and high γ ) can be localized from non-stationary EEG signals in using this triadic wavelet filter efficiently. Further features can be extracted using common spatial pattern (CSP) algorithms and these features can be classified by machine learning algorithms. This paper discusses dyadic and non-dyadic filtering in detail and also proposes an approach for frequency localization using three-band orthogonal wavelet transformation for classification of sensory-motor imagery electroencephalogram (EEG) signals. Keywords Electroencephalogram (EEG) · Filter band · Common spatial patterns (CSP) · Non-dyadic orthogonal wavelet transformation · Sensory-motor imagery (SMI) · Brain–computer interface

1 Introduction of BCI A Brain–Computer Interface is an alluring research area from last two decades with the successful online application in education, rehabilitation, home automation, restoration, entertainment, and enhancement. The advancement in technologies like wireless recording, signal processing techniques, computer algorithms, and P. Chaudhary (B) · R. Agrawal Manav Rachna International Institute of Research and Studies, Faridabad, India e-mail: [email protected] R. Agrawal e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_54

611

612

P. Chaudhary and R. Agrawal

brain sciences made possible this unimagined task of converting brain signals into control signals for computer or any other electronic devices. Specifically, physically impaired patients can be rehabilitated. The people having misfortune brain damage or any brain diseases (e.g., amyotrophic lateral sclerosis suffering people, stroke patients) can use BCI and significant literature positively impacted on large scale [1–3]. The brain signals can be acquired either invasively by planting electrodes inside the grey area of brain or non-invasively by placing electrodes on the scalp of the brain. Electroencephalography (EEG) is a method to record the brain signals non-invasively from the scalp of the brain [4–6]. Despite of high SNR, i.e., signal-tonoise ratio, this method is considered convenient as it does not require any surgical procedure. Furthermore, Electrocorticography (ECoG) is invasive neuroimagining method, recorded from the cortical surface of brain. It is also known as intracranial EEG [4]. The local field potentials (LFPs), single unit activity potentials, and multi-unit activity potentials are some other invasive technologies. The biomechanic parameters, like motor imagery even, can be extracted and applied successfully from these spatio-temporal signals [7–10]. Literature is available in which researchers [11–15] have used these signals for classifying upper limb movement and used it in controlling the electronic devices. They achieved the task using invasive electrodes in human brain and monkeys’ brain and resulted in low signal-to-noise ratio and accurate control of prosthetic devices in three-dimensional space [13–17]. Though, the brain signals acquired in invasive methods degrade gradually, the risk of performing surgery for placement of electrodes makes these methods more unrealistic approach. EEG method measures neural activity directly, economically, and portably for medical use. Thus, EEG is the most accepted method for brain–computer interface instead of higher spatial resolution technologies like fMRI, MEG, etc. A speller system for paralyzed people [16], EEG-based wheelchair [17] and Reach and Grasp event using a Robotic Arm [18] are some successful projects in assistive and rehabilitation devices for physically disabled patients. The use of BCI system comprises two phase: in first phase system is calibrated and known as training phase, in second phase the BCI system is used online to translate the recognized brain activity patterns into control commands for a computer. Finding the relationship and analyzing the patterns inside the brain signal, underlying physical events and cognitive processing are some of challenging tasks in brain–computer interfacing. The basic framework to implement an online EEG-based brain–computer interface system is a closed-loop, which starts with acquiring/recording specific EEG patterns of user (e.g., Motor imagery or using visual stimuli). Then acquired EEG signals are preprocessed using advance signal processing techniques like de-noising, digitization, spatial and spectral filtering, [19] etc. in order to normalize the signals and then feature extraction, selection from these preprocessed signals has to be performed in order to characterize them in compatible structure [20]. Further, these compact feature sets are classified [21] before converting them into command signals for computer application. Then users give the feedback that whether this command signal has been interpreted correctly as mental task or not. Every step of this protocol involves algorithms which can be optimized for better performance of brain–computer interface system [22]. The designing of BCI system is a very challenging task due to subject

Orthonormal Wavelet Transform for Efficient …

613

and application specificity. The performance is dependent over algorithms used at every step of BCI design. The basic performance measure is the classification accuracy of classifier used with assumption of balanced classes and unbiased classifier. Kappa metric or confusion metric which calculates the sensitivity–specificity pair, or precision, is an alternative choice when classes are biased or unbalanced. Another performance measures for BCI system are AUC curve, and ROC curve often used when the classification depends upon the continuous parameter like a threshold. The overall performance strongly depends on the performance of the subcomponents of BCI system. There are different orchestrations of the BCI system like hybrid, self pace or system paced, and many hybrid systems [23]. This paper discusses the wavelet analysis versus spectrum analysis of signal processing and Sect. 2 describes different wavelets used for decomposition of signals in Sect. 2 followed by Sect. 3 which discusses the construction and advantages of orthonormal non-dyadic wavelet. Section 4 proposes the application of orthonormal non-dyadic wavelet for decomposing the acquired EEG signal in the field of motor imagery classification followed with the conclusion and future work in Sect. 5.

2 Wavelet Analysis Wavelet analysis (WT) of biosignals has attracted attention in recent years for signal processing using software techniques. Wavelet transform or wavelet theory has gained popularity over Fourier Transform (FT) which was most commonly applied representation of signals. Unlike FT, WT expands the mother wavelet function which considers the translation and dilations of this basis function, instead of taking trigonometric polynomials. Finally, localization of scaling properties is performed in frequency and time domain [24]. Thus, it allows a close correlation between the function with its coefficients and ensures numerical stability during reconstruction. Construction of powerful wavelet basis functions is the main goal of wavelet analysis. It extends to find efficient methods for their computation like fast FT can be formulated using wavelet basis functions. So wavelet spectrum can be formulated which consider the signal’s time (or spatial) domain information as well as frequency domain information. Due to non-stationary characteristics in time domain, biosignals processed with FT not yield the desired results. So, wavelet analysis is a powerful tool for processing non-stationary signals. The wavelet is a smooth and flexible short time localized oscillating waveform which has an average value zero and has good frequency and time localization [25]. Figure 1a, b demonstrates the Fourier and wavelet transform of a signal. Figure 2 demonstrates the Fourier and wavelet transform of a signal. The decomposition of a function of time (here a signal) into its constitute frequencies can be done by Fourier Transform and represented mathematically as

614

P. Chaudhary and R. Agrawal

Fig. 1 a Fourier transform of a signal b Wavelet transform of a signal N-channel Raw EEG

3-band Wavelet Decomposition

Feature extraction and selection

Classification (ANN, SVM, Bayesian,

Motor Task

Fig. 2 Block diagram of proposed sensory-motor imagery EEG classification using non-dyadic wavelet decomposition

∞ F(ω) =

f (t)e −iwt dt

(1)

−∞

Equation 1 gives the Fourier coefficients F(ω) by sum over all time of the signal f (t) multiplied by a complex exponential. According to Heisenberg’s uncertainty principle, velocity, and position of an object cannot be measured at the same time exactly. This proves the time and frequency resolution problems regardless of any transform used. Thus, multiresolution analysis (MRA) is an alternate approach for signal analysis at different frequencies with different resolutions. This approach gives good frequencies resolution and poor time resolution at low frequencies and good time resolution and poor frequency resolution at high frequency. As EEG signals have low-frequency resolution for long time and high-frequency resolution for very small time. So multiresolution analysis (MRA) is suitable for EEG signal analysis. Wavelet transform can be categorized as discrete wavelt transform (DWT), continuous wavelet transform (CWT) ,and multiresolution-based transform (MRT). Like

Orthonormal Wavelet Transform for Efficient …

615

short time fourier transform (STFT), in CWT signal of finite energy is multiplied with function of frequency bands and signal transform is computed separately for different segments of the time-domain signal. The signal is reconstructed by integrating resulting frequency components. Unlike STFT, in CWT does not compute negative frequencies and transform is computed for every single spectral components. The result of FT coefficients multiplied by sinusoidal frequency ω; leave the constituent sin component of original signal known as wavelet coefficients. Formally the CWT is the sum over scaled and shifted versions of the wavelet function multiplied by all time of the signal. FT uses sin() and cos() functions whereas wavelets can define a set of basis functions ψ k (t) as follows: f (t) =



ak ψk (t)

(2)

k

The basis can be constructed by applying translations (a real number τ ) and scaling (stretch/compress by positive scale s) on the “mother” wavelet ψ(t):   t −τ 1 ψ(s, τ, t) = √ ψ s s

(3)

The projection of a function y onto the subspace of scale s then has the form  W Tψ {y}(s, t) · ψs,τ (t)dτ

ys (t) =

(4)

R

with wavelet coefficients   W Tψ {y}(s, t) = y|ψs,t =

 y(t) · ψs,τ (t)dt

(5)

R

Some of the continuous wavelets are Poisson wavelet, Mexican hat wavelet, Morlet and modified Morlet wavelet, Shannon wavelet, Beta wavelet, Casual wavelet, Hermitian wavelet, Cauchy wavelet, Meyer wavelet, and many more [26]. The analysis of a signal using all the wavelet coefficients is computationally impossible and a NP-hard problem. So wavelets are discretely sampled and reconstructed. Series Expansion of Discrete-Time Signals is explained as if x[n] is a square-summable sequence, i.e., x[n] ∈ 2 (Z) and orthonormal expansion of x[n] of the form x[n] =



(ϕk [1], X [1])ϕk [n] =

k∈Z

where X [k] =  ϕk [1],|x[l] =



    X [k]ϕk [n] → x 2  =  X 2 

(6)

k∈Z



l

ϕk∗ [n], x[1] is the transform of x and the basis

functions ϕk satisfy the orthonormal constraint  ϕk [n],|ϕ1 [n] = δ[k − l] [26].

616

P. Chaudhary and R. Agrawal

The DWT decomposes the signal into detailed information and coarse approximation and analyzes the signal at different frequency bands with different resolutions. The two filters known as high-pass filter and low-pass filter employ two sets of functions in time domain, known as wavelet functions and scaling functions, respectively. The original signal y[i] is filtered first by g[i] (high-band filter) and then to h[i] (lowband filter) in Eqs. (7) and (8) and resulted into convolution of two then half of the samples can be eliminated using Nyquist’s theorem, as the frequency of the signal is now f /2 instead of f . The signal further sub-sampled by 2 known as one level of decomposition of signal y[i] and mathematically represented as yhigh [k] =



y[i] · g[2k − i]

(7)

y[i] · h[2k − i]

(8)

i

ylow [k] =

 i

EEG is signal is a non-stationary signal. Hence, for such transient signals, a time–frequency representation is highly desirable, with an aim to derive meaningful features [10].

3 Non-dyadic Wavelet Transform Wavelet series expansion decomposes the finite energy functions for analysis of the same. Thus, basis functions must be regular, well localized, and of finite energy. It is convenient to take special values for s and τ in defining wavelet basis as s = 2−j and τ = k. 2−j for jth stage of the process. Thus, scale samples of wavelet transform following a geometric sequence of 2 is known as dyadic wavelet transform. Equation (3) can be rewritten as Eq. (10) known as dyadic wavelet transform of f. The family of dyadic wavelet is a frame of L 2 (R).

W f u, 2 j =

+∞ −∞

  1 t −u dt = f × ψ 2 j (u), f (t) √ ψ 2j 2j

with   −t 1 ψ 2 j (t) = ψ2 j (t) = √ ψ j 2j 2

(10)

The time–frequency localized basis functions are popular among the researchers for the applications like analysis of acquired signals [27], image coding [28, 29], features extraction [30–32]. Orhan et al. [33] and Ubeyli et al. [34] have implemented two-band wavelet filter banks and extracted the features, then classified the

Orthonormal Wavelet Transform for Efficient …

617

features into predefined classes. Authors [35] took two frequency bands each of ω = π /2 for frequency resolution and concluded their results in poor frequency resolution both in high- and low-frequency band. Further, dyadic filter bank (M = 2 band) can be extended to M-band filter bank with M > 2 sub-bands, improves the frequency resolution to ω = π /M. To increase the frequency resolution in high- or low-frequency signals, the number of sub-bands can be increased in the region. Thus, higher frequency resolution of triadic filter bank can be useful for practical applications which include high- or low-frequency signals [36]. Further, localization of highor low-band filters in spatio-temporal domain results in improved performance of 3-band filter banks. Xie and Morris [37] and Sharma et al. [38] have designed dyadic regular orthogonal and biorthogonal filter banks, respectively, using time–frequency wavelet basis functions. Two-band wavelet transform has been implemented extensively [28, 31, 32, 37, 38] and it outperforms many other existing methods like empirical mode decomposition (EMD) [39, 40], high-order moment parameters [41, 42], autoregression, and band-power based models [43]. The literature has shown poor frequency resolution with two-band wavelet transformation both in high- and low-frequency signals. There can be improvement in sensory-motor imagery (SMI) classification accuracy by increasing the frequency resolution of any frequency region of dyadic wavelet transform. A more flexible time–frequency wavelet transformation can be tiled up using M-band wavelet decomposition. Lin et al. [44] and Lin et al. [45] have proposed the construction of M-band wavelets using multiresolution analysis (MRA). They decomposed the input signal into M parts using the filter bank matrix based on the calculated filter coefficients. The filter bank matrix(X) is the concatenation of K number of M × K overlapping factor matrices given by X = [X 0 , X 1 , … X K −1 ]. The filter bank should produce the orthonormal and reconstructable output for given polyphase matrix B(z) (Eq. 11), and the conditions to be followed for such output are shown in Eq. 12 B(z) = X 0 + X 1 z −1 + · · · + X k−1 z −(K −1)

(11)

 ⎧ ⎪ ⎨ Z e = Me1 RRT = I ⎪ ⎩ SST = I

(12)



where 

Z=

K −1 

X i e = [1, 1, . . . 1]T , e1 = [1, 0, . . . 0]T

(13)

i=0

(14)

618

P. Chaudhary and R. Agrawal

(15)

The M-band filter bank X k is decomposed in the orthogonal matrices to solve the constraint equation using singular value decomposition (SVD) that can be given as X k = E Dk F

(16)

where the factored matrices E and F are orthogonal and X = [X 0 , X 1 ] satisfies the Eq. (12), if and only if they have following decomposition [44, 45]. Sharma et al. [46] have used optimal orthogonal wavelet to decompose the ECG signal for automated heartbeat classification. They have designed a finite impulse response filter (FIR) that assures the condition of zero moments and condition of orthogonality. Chandel et al. [47] have proposed the triadic wavelet decomposition to find the suitable features which give higher accuracy for epileptic seizure classification. Bhati et al. [48] have designed the epileptic seizure signal classification using three-band orthogonal wavelet filter bank with stopband energy. Benchabane et al. [49] have applied statistical threshold on coefficients of wavelet decomposition of individual evoked potential signals and estimated the mean value of the same across the trials to improve the signal-to-noise ratio.

4 Proposed Approach of Sensory-Motor Imagery (SMI) EEG Classification Using Non-dyadic Wavelet Decomposition Brain working depends upon the perception level and it shows different rhythmic activities. The rhythms are affected by cognition process of thoughts and preparation of actions, e.g., eye blink can attenuate particular rhythm. The reality that sheer thoughts distress the rhythms can become the basis for the BCI system. Different brain rhythms can be identified in EEG with different range of frequencies. Niedermeyer [50] has given Greek letters delta, theta, alpha, beta, gamma, and mu (δ, θ, α, β, γ , and μ) to represent the brain rythms. Author has explained that sensory-motor patterns are present in α, β, and high γ brain rythms. The frequency ranges of these rythms in EEG signal are as follows: (i) Alpha wave: 8–13 Hz, (ii) Beta rhythm: 13–30 Hz, (iii) Gamma rhythm: 30–85 Hz. This section discusses a new approach to filter out the frequency bands using nondyadic wavelet decomposition [51] for sensory-motor imagery EEG classification for brain–computer interfacing. The steps involved in proposed methodology for sensory-motor imagery classification from EEG signals are explained in Fig. 2.

Orthonormal Wavelet Transform for Efficient …

619

The raw N-channel EEG data will be decomposed into three-band filter to localize the time–frequency characteristic. This results in segmentation of frequency bandwidth and results in three sub-bands frequencies, i.e., frequency from (a) 0 to π /3, (b) π /3 to 2π /3, (c) 2π /3 to π. Splitting the frequency bandwidths using triadic wavelet increases the flexibility. The division of lowest frequency sub-band again up to essential number of level. However to find α, β, and high γ frequencies of sensory-motor imagery patterns, few number of frequency band can be selected from the decomposed signal for further feature extraction. Rest frequencies can be discarded. Further, features, like band-power, CSP, power spectrum density, etc., are some examples of features to be extracted from selected frequency bands. The wavelet fuzzy approximate entropy, clustering techniques, cross-correlation techniques, and many techniques exist for feature extraction from raw EEG signals. The selection of sub-bands can be done on the basis of corresponding brain rhythms. The features extracted can be high-dimension vectors depend upon the number of channels, number of trials, number of sessions from multiple modality, and sampling rate of modality. It is neither realistic nor useful to consider all features for classification. So selecting a smaller subset of distinctive feature set or feature space projection is an important step in pattern recognition for classification. The aim of feature selection process is to remove the redundant and uninformative features along with finding unique features which do not overfit the training set and classify the real dataset with higher accuracy even in the presence of noise and artifacts. To reduce the curse of dimensionality, the representative features can be selected out of all the coefficients obtained from the three-band frequency domain [52–56]. This opens the use of many analytical and statistical (1st moment, 2nd moment, 3rd moment, etc.) to further evaluate them. The machine learning algorithms like support vector machine (SVM), k-means clustering, Bayesian networks, Artificial Neural Network (ANN), Radial basis function (RBF), decision tree, etc. can be applied for further identification of imagined motor task. The ultimate goal of BCI design is to translate the mental event of user into control commands. The acquired raw EEG signal has to be converted into real action in surrounding environment. So, classification or pattern matching of the signal into predefined classes is naturally the next step after preprocessing and feature extraction and selection. Machine learning has played an important role not only in identifying the user intent but also handle the variation in ongoing user’s signals. Considering traditional approach of pattern matching, the classification algorithms for mental task recognition inside the EEG signals can be categorized in four categories: (1) adaptive classifiers, (2) transfer learning-based classifiers, (3) matrix and tensor classifiers, and (4) deep learning-based classifiers.

5 Conclusion and Future Work Brain–computer interfacing (BCI) is a new pathway to human brain and unlocks many solutions for physically disabled people. Acquisition of brain signals using EEG non-invasively has carried this task in practical domain. For the start BCI

620

P. Chaudhary and R. Agrawal

competition IV dataset can be considered for the analysis [55]. Despite of existing literature of algorithms and methods, there is still a scope of improvement in every step of designing a robust BCI system. This paper discusses and proposing a new approach for signal processing analysis based on wavelet transformation. It discusses Fourier Transform, Continuous Wavelet Transform (CWT), Discrete Wavelet Transform (DWT), and Multiresolution Wavelet Transform (MWT) and their application in EEG signal decomposition. Further, literature of both dyadic and non-dyadic orthogonal transformation has been discussed for the localization of time–frequency analysis of EEG signal. A new approach has been proposed for sensory-motor imagery (SMI) EEG classification using Non-dyadic wavelet decomposition. Further, this approach would be implemented for preprocessing of EEG signals, comparison of the machine learning algorithms like support vector machine (SVM), k-means clustering, Bayesian networks, ANN, Radial basis function (RBF), decision tree, etc. can be applied for further identification of imagined motor task. With growth of the applications of BCI, security and threats have become major issues now [57]. These threats and ethical issues could also be explored further. The proposed approach could analyze nonstationary power at different frequencies, which exists in fractal structure in time series, so the dissimilarities between target and nontarget EEG signal are recognized. The application of non-dyadic filter will be demonstrated in our next paper.

References 1. N. Birbaumer, W. Heetderks, J. Wolpaw, W. Heetderks, D. McFarland, P.H. Peckham, G. Schalk, E. Donchin, L. Quatrano, C. Robinson, T. Vaughan, Brain-computer interface technology: a review of the first international meeting. IEEE Trans. Rehabil. Eng. 8(2), 164–173 (2000) 2. J.R. Wolpaw, N. Birbaumer, D.J. McFarland, G. Pfurtscheller, T.M. Vaughan, Brain-computer interfaces for communication and control (in eng). Clin. Neurophysiol. 113(6), 767–791 (2002) 3. M.A. Lebedev, M.A. Nicolelis, Brain-machine interfaces: from basic science to neuroprostheses and neurorehabilitation. Physiol. Rev. 97(2), 767–837 (2017) 4. L.F. Nicolas-Alonso, J. Gomez-Gil, Brain computer interfaces, a review. Sensors 12(2), 1211– 1279 (2012) 5. N. Birbaumer, T. Hinterberger, A. Kubler, N. Neumann, The thought-translation device (ttd): Neurobehavioral mechanisms and clinical outcome. IEEE Trans. Neural Syst. Rehabil. Eng. 11, 120–123 (2003) 6. J. Wolpaw, D. McFarland, T. Vaughan, G. Schalk, The wadsworth center brain computer interface (BCI) research and development program. IEEE Trans. Neural Syst. Rehabil. Eng. 11, 204–207 (2003) 7. G. Pfurtscheller, C. Neuper, G. Muller, B. Obermaier, G. Krausz, A. Schlogl, R. Scherer, B. Graimann, C. Keinrath, D. Skliris, M. Wrtz, G. Supp, C. Schrank, Graz-BCI: state of the art and clinical applications. IEEE Trans. Neural Syst. Rehabil. Eng. 11, 177–180 (2003) 8. J. Borisoff, S. Mason, G. Birch, Brain interface research for asynchronous control applications. IEEE Trans. Neural Syst. Rehabil. Eng. 14, 160–164 (2006) 9. M.W. Slutzky, R.D. Flint, Physiological properties of brain-machine interface input signals. J. Neurophysiol. 118(2), 1329–1343 (2017)

Orthonormal Wavelet Transform for Efficient …

621

10. T. Gandhi, B.K. Panigrahi, S. Anand, A comparative study of wavelet families for EEG signal classification. Neurocomputing 74(17), 3051–3057 (2011) 11. L.R. Hochberg et al., Reach and grasp by people with tetraplegia using a neurally controlled robotic arm. Nature 485(7398), 372–375 (2012) 12. M. Velliste, S. Perel, M.C. Spalding, A.S. Whitford, A.B. Schwartz, Cortical control of a prosthetic arm for self-feeding. Nature 453(7198), 1098–1101 (2008) 13. S.-P. Kim, J.D. Simeral, L.R. Hochberg, J.P. Donoghue, G.M. Friehs, M.J. Black, Point-andclick cursor control with an intracortical neural interface system by humans with tetraplegia. IEEE Trans. Neural Syst. Rehabil. Eng. 19(2), 193–203 (2011) 14. D.M. Taylor, S.I.H. Tillery, A.B. Schwartz, Direct cortical control of 3D neuroprosthetic devices. Science 296(5574), 1829–1832 (2002) 15. J. Vogel et al., An assistive decision-and-control architecture for force-sensitive hand–arm systems driven by human–machine interfaces. Int. J. Rob. Res. 34(6), 763–780 (2015) 16. N. Birbaumer et al., A spelling device for the paralysed. Nature 398(6725), 297–298 (1999) 17. L. Bi, X.-A. Fan, Y. Liu, EEG-based brain-controlled mobile robots: a survey. IEEE Trans. Hum. Mach. Syst. 43(2), 161–176 (2013) 18. J. Meng, S. Zhang, A. Bekyo, J. Olsoe, B. Baxter, B. He, Noninvasive electroencephalogram based control of a robotic arm for reach and grasp tasks. Sci. Rep. 6, 38565 (2016) 19. B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, K.R. Müller, Optimizing spatial filters for robust EEG single-trial analysis. IEEE Signal Proc. Mag. 25, 41–56 (2008) 20. F. Lotte, M. Congedo, EEG Feature Extraction (Wiley, New York, 2016). pp 127–43 21. F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, B. Arnaldi, A review of classification algorithms for EEG-based brain–computer interfaces. J. Neural Eng. 4, R1–13 (2007) 22. C Neuper, G. Pfurtscheller, Neurofeedback training for BCI control, in Brain–Computer Interfaces: Revolutionizing Human-Computer Interaction, ed. by B. Graimann, G. Pfurtscheller, B. Allison (Springer, Berlin, 2010). pp. 65–78 23. M. Fatourechi, R. Ward, S. Mason, J. Huggins, A. Schlogl, G. Birch, Comparison of evaluation metrics in classification applications with imbalanced datasets International Conference on Machine Learning and Applications (IEEE, 2008). pp 777–82 24. H.D.N. Alves, Fault diagnosis and evaluation of the performance of the overcurrent protection in radial distribution networks based on wavelet transform and rule-based expert system, in 2015 IEEE Symposium Series on Computational Intelligence (IEEE, 2015). pp. 1852–1859 25. Y. Shi, X. Zhang, A Gabor atom network for signal classification with application in radar target recognition. IEEE Trans. Signal Process., 2994–3004 (2001) 26. A. Bruce, H.Y. Gao, Applied Wavelet Analysis with S-Plus (Springer, 1996) 27. D. Gabor, Theory of communication. Part 1: The analysis of information. J. Inst. Electr. Eng. Part III: Radio Commun. Eng. 93(26), 429–441 (1946) 28. D.M. Monro, B.G. Sherlock, Space-frequency balance in biorthogonal wavelets, in Proceedings of International Conference on Image Processing, vol. 1 (IEEE, 1997). pp. 624–627 29. L. Shen, Z. Shen, Compression with time-frequency localization filters. Wavelets and Splines, 428–443 (2006) 30. B. Boashash, N.A. Khan, T. Ben-Jabeur, Time–frequency features for pattern recognition using high-resolution TFDs: A tutorial review. Digit. Signal Proc. 40, 1–30 (2015) 31. R. San-Segundo, J.M. Montero, R. Barra-Chicote, F. Fernández, J.M. Pardo, Feature extraction from smartphone inertial signals for human activity segmentation. Sig. Process. 120, 359–372 (2016) 32. A.T. Tzallas, M.G. Tsipouras, D.I. Fotiadis, Automatic seizure detection based on timefrequency analysis and artificial neural networks. Comput. Intell. Neurosci. (2007) 33. U. Orhan, M. Hekim, M. Ozer, EEG signals classification using the K-means clustering and a multilayer perceptron neural network model. Expert Syst. Appl. 38(10), 13475–13481 (2011) 34. E.D. Übeyli, Combined neural network model employing wavelet coefficients for EEG signals classification. Digit. Signal Proc. 19(2), 297–308 (2009) 35. A.N. Akansu, P.A. Haddad, R.A. Haddad, P.R. Haddad, Multiresolution Signal Decomposition: Transforms, Subbands, and Wavelets (Academic Press, 2001)

622

P. Chaudhary and R. Agrawal

36. M. Rhif, A. Ben Abbes, I.R. Farah, B. Martínez, Y. Sang, Wavelet transform application for/in non-stationary time-series analysis: a review. Appl. Sci. 9(7), 1345 (2019) 37. H. Xie, J.M. Morris, Design of orthonormal wavelets with better time-frequency resolution, in Wavelet Applications, vol. 2242 (International Society for Optics and Photonics, March 1994). pp. 878–887 38. M. Sharma, V.M. Gadre, S. Porwal, An eigenfilter-based approach to the design of timefrequency localization optimized two-channel linear phase biorthogonal filter banks. Cir. Syst. Signal Process. 34(3), 931–959 (2015) 39. R. Sharma, R. Pachori, U. Acharya, Application of entropy measures on intrinsic mode functions for the automated identification of focal electroencephalogram signals. Entropy 17(2), 669–691 (2015) 40. V. Bajaj, R.B. Pachori, Classification of seizure and nonseizure EEG signals using empirical mode decomposition. IEEE Trans. Inf. Technol. Biomed. 16(6), 1135–1142 (2011) 41. R. Ebrahimpour, K. Babakhan, S.A.A.A. Arani, S. Masoudnia, Epileptic seizure detection using a neural network ensemble method and wavelet transform. Neural Netw. World 22(3), 291 (2012) 42. K. Abualsaud, M. Mahmuddin, M. Saleh, A. Mohamed, Ensemble classifier for epileptic seizure detection for imperfect EEG data. Sci. World J. (2015) 43. E. Parvinnia, M. Sabeti, M.Z. Jahromi, R. Boostani, Classification of EEG Signals using adaptive weighted distance nearest neighbor algorithm. J. King Saud Univ. Comput. Inf. Sci. 26(1), 1–6 (2014) 44. T. Lin, P. Hao, S. Xu, Matrix factorizations for reversible integer implementation of orthonormal M-band wavelet transforms. Sig. Process. 86(8), 2085–2093 (2006) 45. A.L. Goldberger, L.A. Amaral, L. Glass, J.M. Hausdorff, P.C. Ivanov, R.G. Mark, J.E. Mietus, G.B. Moody, C.K. Peng, H.E. Stanley, PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215–e220 (2000) 46. M. Sharma, R.S. Tan, U.R. Acharya, Automated heartbeat classification and detection of arrhythmia using optimal orthogonal wavelet filters. Inform. Med. Unlocked 16, 100221 (2019) 47. G. Chandel, P. Upadhyaya, O. Farooq, Y.U. Khan, Detection of seizure event and its onset/offset using orthonormal triadic wavelet based features. IRBM 40(2), 103–112 (2019) 48. D. Bhati, R.B. Pachori, V.M. Gadre, Optimal design of three-band orthogonal wavelet filter bank with stop band energy for identification of epileptic seizure eeg signals, in Machine Intelligence and Signal Analysis (Springer, Singapore, 2019). pp. 197–207 49. B. Benchabane, M. Benkherrat, B. Burle, F. Vidal, T. Hasbroucq, S. Djelel, A. Belmeguenai, Wavelets statistical denoising (WaSDe): individual evoked potential extraction by multiresolution wavelets decomposition and bootstrap. IET Signal Proc. 13(3), 348–355 (2019) 50. E. Niedermeyer, The normal EEG of the waking adult, in Electroencephalography: Basic Principles, Clinical Applications, and Related Fields, vol. 167 (2005). pp. 155–164 51. T. Lin, S. Xu, Q. Shi, P. Hao, An algebraic construction of orthonormal M-band wavelets with perfect reconstruction. Appl. Math. Comput. 172(2), 717–730 (2006) 52. K.P. Thomas, C. Guan, A.P. Vinod, C.T. Lau, K.K. Ang, A new discriminative common spatial pattern method for motor imagery brain–computer interfaces. IEEE Trans. Biomed. Eng. 56(11), 2730–2733 (2009) 53. W. Wu, Z. Chen, X. Gao, Y. Li, E.N. Brown, S. Gao, Probabilistic common spatial patterns for multichannel EEG analysis. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 639–653 (2015) 54. S.H. Park, D. Lee, S.G. Lee, Filter bank regularized common spatial pattern ensemble for small sample motor imagery classification. IEEE Trans. Neural Syst. Rehabil. Eng. 26, 2 (2018) 55. T. Michael, et al., Review of the BCI competition IV. Front. Neurosci. 6, 55 (2012) 56. P. Chaudhary, R. Agrawal, A comparative study of linear and non-linear classifiers in sensory motor imagery based brain computer interface. J. Comput. Theor. Nanosci. 16(12), 5134–5139 (2019) 57. P. Chaudhary, R. Agrawal, Emerging threats to security and privacy in brain computer interface. Int. J. Adv. Stud. Sci. Res. 3(12) (2018)

Performance of RPL Objective Functions Using FIT IoT Lab Spoorthi P. Shetty and Udaya Kumar K. Shenoy

Abstract The Internet of Things is a system, which connects many heterogeneous devices and it finds application in several areas. The network used in IoT is Low Power and Lossy Networks (LLN) because the devices used in IoT are power constrained. LLN uses Routing Protocol for Low Power and Lossy Networks (RPL) as its routing protocol and it is considered as an IETF standardized protocol for LLN. RPL constructs Destination Oriented Directed Acyclic Graph (DODAG) to select the appropriate path to the destination. In RPL, the DODAG can be constructed based on the objective function. Thus, the selection of the best objective function plays a major role in RPL. The main metric for selection of objective function is the power, as our focus is on the design of power efficient IoT. The most widely used objective functions in RPL are OF0 and MRHOF. The metric used by objective function OF0 is hop count and MRHOF uses expected transmission count metric. In the existing research, the superiority of these two objective functions is established using only simulation studies but not based on the real testbed experiment. Hence, it is necessary to conduct the experiment in the real testbed to assess the suitable objective function. In this paper, experiments are conducted in the FIT IoT Lab to select the best objective function with respect to the power parameter. From the result, it is identified that both OF0 and MRHOF perform equally and in some cases, it is observed that MRHOF is more power efficient than OF0. The objective functions are also evaluated for single and multi sink scenarios. It is identified through the experiments that the increase in the number of sink nodes does not affect the power consumption.

S. P. Shetty (B) Department of MCA, N.M.A.M.Institute of Technology Nitte, Karkala, India e-mail: [email protected] U. K. K. Shenoy Department of CSE, N.M.A.M.Institute of Technology Nitte, Karkala, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_55

623

624

S. P. Shetty and U. K. K. Shenoy

1 Introduction The LLN is a community of many embedded devices that are limited in storage, power, and resource management. These devices are connected to each other by a variety of links. LLNs have a wide range of applications, including industrial surveillance, automation of building, monitoring of environment, management of energy, health care, etc. RPL is known as one of the powerful routing protocols for LLN network. RPL uses Destination Oriented DAG topological concept to construct the tree structure. This DODAG uses a specific objective function for tree construction. Selecting the best objective function thus plays a vital role in building DODAG and it also helps to make the routing protocol more efficient and effective. The most widely used objective functions are OF0 and MRHOF. In this, OF0 uses metric of hop count and MRHOF uses metric of ETX. The selection of the best objective function plays a major role in the DODAG construction and also in effective utilization of routing protocol. In the existing research, the comparison of different objective functions is performed using simulation. But testing in a real testbed would give a better testimony of the objective function. This paper is categorized as follows: Sect. 2 gives motivation of the work and also covers the literature survey. Section 3 introduces the testbed configuration, Sect. 4 explains the outcomes and gives the framework to predict the nature of the network. Finally, Sect. 5 provides the conclusion.

2 Literature Survey 2.1 Low Power and Lossy Network The Low power and Lossy Network is a working group created by IETF. This group mainly works on standardizing the routing protocol for LLN. The salient features of LLN are, the nodes used in this can carry limited data with less energy [1]. The nodes used in this network are powered with less energy and it can store only a limited rate of data. But the error rate and link failure are more with less packet delivery ratio [2]. It also supports different types of traffic flows with source and destination as either point or multipoint. Because of these support for different types of traffic flows, the LLN network is more suitable for IoT.

2.2 Routing Protocol for Low Power and Lossy Networks One of the effective protocols for the IPv6 LLN network is RPL. The special features of RPL are (i) It is highly adaptable to any network circumstances. (ii) When routes are not accessible, it provides the alternative route by default. (iii) It uses DODAG

Performance of RPL Objective Functions Using FIT IoT Lab

625

topological notion to construct a tree structure. (iv) It is a function that comprises of two steps. They are route discovery and maintenance of route. The route discovery helps to create new paths between the nodes and in maintenance of route, it helps to maintain the created route. RPL uses the concept of DODAG to construct the routing structure. It works on the concept of DAG (Directed Acyclic Graph). In this logical routing, tree construction is performed on a physical network. The construction is based on the objective function. This DODAG is periodically rebuilt and modified using the trickle timer.

2.3 Comparison of RPL for Varied Topology There was a lot of research on IoT stable nodes with RPL protocol. The research demonstrates that RPL is an efficient routing protocol for IoT network with stable nodes. The performance of the routing protocol using RPL objective functions is considered by Spoorthi et al. [3]. In Long et al. [4] and Gnawali et al. [5], the Collection Tree Protocol’s (CTP) performance is compared with RPL. It shows how the performance of RPL is better for the scalability parameter. The results of CTP and RPL are compared for the parameter of Packet Reception Ratio and power. By the result, it is proved that CTP performs better in sparse network and RPL works well in dense network with more data traffic. The limitation of this work is that the researchers failed to identify the suitable objective function in RPL for stable topology. In Qasem et al. [6], the working of objective function is evaluated using simulation. In this, random and grid topologies are considered for the experiment. Here, in IoT network, the power consumption is calculated based on RX value. From the experiment, it is identified that for the 60% RX value, both objective functions perform well for power and PDR. From the result, it is also noted that, in some scenarios, the performance of MRHOF is superior to OF0 in random and grid topology. In the paper [7], the working of objective functions is compared through simulation. The researchers have shown that the OF0 typically performs better than MRHOF in terms of Power Consumption and Convergence Time for Static-Grid Topology. In paper [8], the authors have focused on the power Consumption and Packet Delivery Ratio(PDR) metrics for stable network. In this, the simulation is performed under two topologies, i.e., random and fixed. From the results, it is noted that using OF0 objective function, PDR is more in low density network, and using MRHOF, efficient utilization of power is more in dense network. The authors Lam Nguyen et al. [4] addressed the load balancing problem and evaluated the skewness of DODAG both via numerical simulations and via actual large-scale testbed. In the paper, the authors proposed a solution called SB-RPL, which aims to obtain large-scale balanced distribution of workload between the nodes in LLN. In this, the researchers implemented SB-RPL in ContikiOS and conducted an extensive evaluation using computer simulation and on large-scale real-world testbed. The researchers also compared their solution with the current objective

626

S. P. Shetty and U. K. K. Shenoy

function, mainly on the parameter of load balancing, but not on power. It can be noted in all of the above papers that the comparison of the objective function is done mainly using a simulator. In some of the papers, OF0 performs better while in some other, MRHOF performs well. The main criterion for the selection of the best objective function is it should be power efficient, because RPL is the protocol which is mainly used in LLN network. Hence, it is important to check the working of the objective function in the real environment using testbed. The parameter considered in this test is the power which makes this work unique.

3 Experiment Details The experiment is carried out with the aim of testing the performance of the RPL objective function for the distinct scalability of the nodes. Our main objective in this paper is to evaluate the performance of OF0 and MRHOF objective functions with RPL protocol in FIT IoT Lab.

3.1 FIT IoT Lab Setup In this part, we present our study on the FIT IoT LAB testbed. In our experiments, we used the platform installed in the Lille site, France. We used 40 nodes (M3 ARM-Cortex) from the Lille site contributed by FIT IoT Lab testbed as described in Table 2. The topology includes one sink located at the center and 40 random sensor nodes. In a predefined time interval, it generates UDP packets. The M3 node has one ARM M3-Cortex micro-controller, one 64kB RAM, one IEEE 802.15.4 radio AT86RF231, one rechargeable 3.7 V LiPo Battery, and several types of sensors. In order to construct the multi-hop topology, the transmission power is set to −17 dBm as in the tutorial of FIT IoT Lab. The details of parameters is described in Table 1.

Table 1 Hardware parameters Antenna model MAC Radio chip Radio propagation Transmission power RX RSSI threshold

Omni-directional 802.15.4 beacon enabled TI CC2420 2.4 GHz −17 dBm −69 dBm

Performance of RPL Objective Functions Using FIT IoT Lab Table 2 FIT IoT Lab experimental setup Experimental parameters Environment network scale Node spacement Deployed nodes Platform Duration Application traffic Payload size Number of hops Embedded network stack Compared objective functions

627

Values Indoor 40 nodes and 1 sink Uniform random 41 random nodes ContikiOS/M3 Cortex ARM 15 min per instance UDP/IPv6 traffic 16 bytes Multihop ContikiMAC RPL (OF0, MRHOF)

4 Result 4.1 Comparison of OF0 and MRHOF Using Single Sink As an initial step, both OF0 and MRHOF objective functions are compared in terms of power for a single sink and for a varied number of sender nodes as shown in Table 3 and Fig. 1. In this, it is noted that OF0 consumes less power for the number of sender nodes 20 and 40 and MRHOF consumes less power for the number of sender nodes 10 and 30. Hence using a single sink one can not conclude about the power efficient objective function.

Table 3 Comparision of OFO and MRHOF for single sink Number of nodes OF0 10 20 30 40

Fig. 1 Comparison of OF0 with MRHOF for single sink

0.162 0.161 0.162 0.160

MRHOF 0.161 0.162 0.160 0.161

628

S. P. Shetty and U. K. K. Shenoy

Table 4 Comparison of OFO and MRHOF for multi sink Number of nodes OF0 10 20 30 40

0.162 0.161 0.162 0.162

MRHOF 0.159 0.160 0.161 0.161

Fig. 2 Comparison of OF0 with MRHOF for multi sink

4.2 Comparison of OF0 and MRHOF Using Multi Sink As shown in Table 4 and Fig. 2, the objective functions are compared for multi sink with varied number of nodes. From the experiments, it is noted that MRHOF is more power efficient than OF0.

4.3 Analyzing the Performance of OF0 Using Single Sink and Multi Sink In Table 5 and Fig. 3, OF0 objective function’s performance is compared for a single sink and multi sink, here it is noted that increase in the number of sinks does not affect the power, i.e., when the number of nodes is 10, 20, 30 both in case of a single sink and multi sink, the performance of OF0 is the same as that of MRHOF. But when the number of nodes is 40, in case of single sink, OF0 consumes less power and in case of multi sink, OF0 consumes more power.

Table 5 Comparison of OF0 for single and multi sink Number of nodes Single sink 10 20 30 40

0.162 0.161 0.162 0.160

Multi sink 0.162 0.161 0.162 0.162

Performance of RPL Objective Functions Using FIT IoT Lab

629

Fig. 3 Comparison of OF0 for single and multi sink

Table 6 Comparison of MRHOF for single and multi sink Number of nodes Single sink 10 20 30 40

0.161 0.162 0.160 0.161

Multi sink 0.159 0.160 0.161 0.161

Fig. 4 Comparision of MRHOF for single and multi sink

4.4 Analyzing the Performance of MRHOF Using Single and Multi Sink The performance of MRHOF is compared for both single sink and multi sink as shown in Table 6 and Fig. 4. From the results, it is noted that for the sparse network (for the number of sender nodes 10 and 20), the MRHOF consumes less power for multi sink than the single sink. In the case of the dense network (for the number of sender nodes are 30 and 40) for both single sink and multi sink, MRHOF consumes almost the same power as that of OF0.

5 Conclusion The paper has evaluated the two main objective functions of RPL using FIT IoT Lab. From the experiments, it is noted that both OF0 and MRHOF perform equally for a single sink and the MRHOF is more power efficient than the OF0 in case of

630

S. P. Shetty and U. K. K. Shenoy

multi sink. In the paper, even the performance of network is checked for both single sink and multi sink and it is observed that change in the number of sinks does not affect the power consumption for both the objective functions. In future work, the experiment can be conducted for mobile nodes to analyze the working of objective functions.

References 1. H.-S. Kim, J. Ko, D.E. Culler, J. Paek, Challenging the ipv6 routing protocol for low-power and lossy networks (RPL): a survey. IEEE Commun. Surv. Tutor 19(4), 2502–2525 (2017) 2. G.G. Krishna, G. Krishna, N. Bhalaji, Analysis of routing protocol for low-power and lossy networks in iot real time applications. Procedia Comput. Sci. 87, 270–274 (2016) 3. S.U.K. Shetty Spoorthi, Performance of static IoT networks using RPL objective functions. IJRTE 8, 8972–8977 (2019) 4. N.T. Long, N. De Caro, W. Colitti, A. Touhafi, K. Steenhaut, Comparative performance study of RPL in wireless sensor networks, in 19th IEEE Symposium on Communications and Vehicular Technology in the Benelux (SCVT) (IEEE, 2012). pp. 1–6 5. O. Gnawali, R. Fonseca, K. Jamieson, D. Moss, P. Levis, Collection tree protocol, in Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems (ACM, 2009). pp. 1–14 6. M. Qasem, H. Altawssi, M.B. Yassien, A. Al-Dubai, Performance evaluation of RPL objective functions, in 2015 IEEE International Conference on (CIT/IUCC/DASC/PICOM) (IEEE, 2015). pp. 1606–1613 7. W. Mardini, M. Ebrahim, M. Al-Rudaini, Comprehensive performance analysis of RPL objective functions in iot networks. Int. J. Commun. Netw. Inf. Secur. 9(3), 323–332 (2017) 8. Q.Q. Abuein, M.B. Yassein, M.Q. Shatnawi, L. Bani-Yaseen, O. Al-Omari, M. Mehdawi, H. Altawssi, Performance evaluation of routing protocol (RPL) for internet of things. Perform. Eval. 7(7) (2016)

Predictive Analytics for Retail Store Chain Sandhya Makkar, Arushi Sethi, and Shreya Jain

Abstract Purpose-Forecasting techniques are used in the real-world system for better decision making. The main purpose of this research paper is to explore the techniques used by retail store chains for variety of products at various store locations by working on a Retail Chain’s dataset. Methodology- A Public Data set of a retail store chain has been taken which has various details regarding the weekly sales. With the help of python, the data is handled, analyzed and model is created and tested. And further used to forecast the future. Finding- Understanding the various techniques used for forecasting multiple products at multiple places and selecting the best technique based on accuracy. Keywords Forecasting · Exponential smoothening · Random forest · Regression

1 Introduction If information is the oil of the twenty-first century, then analytics is surely the internal combustion engine. And one of the important tool of analytics which has gained attention of the business organizations over the years is forecasting. Forecasting can be easily termed as the process of estimating a future event which is out of the control of the business and becomes a basis for decision making and managerial planning. An organization cannot control its future circumstances, but its impact can be reduced with proper management and planning. Forecasting is one such step towards reducing the impact of any future uncertainty. It has been S. Makkar (B) · A. Sethi · S. Jain Lal Bahadur Shastri Institute of Management, New Delhi, India e-mail: [email protected] A. Sethi e-mail: [email protected] S. Jain e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_56

631

632

S. Makkar et al.

consistently recognized as crucial capability for business planning and management [1]. Forecasting is all pervasive, it is needed at every level whether it is manufacturing, sales, inventory management, service provider as demand forecasting is necessary to identify any market opportunity, enhance customer satisfaction, schedule production and anticipate the financial requirements. Forecasting ensures the optimum utilization of resources as it plays an important role at identifying the trends in the future sales based upon the past sales and so accordingly raw material can be purchased based upon the sales forecast so it will also reduce the bullwhip impact. Forecasting plays a very crucial role in different areas of functions of business management because every organization wants to know the forward estimation of the market so that they can plan the internal functions accordingly, because if managers make decisions without knowing that what is going to happen in near future then they would not be able to maintain a balance between inventory needed and amount of sales and would not be able to make investment without knowing the profit. It forms a crucial role at the formulation of strategic and tactical decision making for the effective and efficient management of business [2]. As the organizations are growing larger and larger, the magnitude of every decision plays an important role, so the organizations are moving towards the more systematic approach of which forecasting is considered one of important part [3]. Predicting a future event involves a lot of complexity as it is dependent upon internal factors such as variety of products offered, life span of the products, product usage and external factors such as market in which it is existing, competing firms, market segment and many [ 4]. But future belong to those who prepare themselves for it.

2 Forecasting as Distinct from Planning Forecasting and planning are the two different functions of the firm, forecasting is generally used to estimate the possible demand for future based on the past records and a certain set of assumptions [5]. Planning on the other hand is used to make the steps to be taken in consideration with the results for forecasting [6]. Once the results of forecasting are there before the firm, strategies need to be made for how to tackle the results. So forecasting gives the situation, now what action plan needs to be there for that situation, this is the function of planning. One important point that managers need to take into consideration is that what would be the impact of the planning on the forecast result and how the results of the forecast may be best combined in planning [7].

Predictive Analytics for Retail Store Chain

633

3 Principles of Forecasting • Accuracy of the forecast: In most business, a minimal amount of error is being reserved and is tolerated and the percentage of this error varies from company to company, but error shouldn’t be above the permissible limits, the standards of the accuracy must be maintained. • Impact of Time horizon: As the time horizon increases, the accuracy of the forecast decreases because if the time span increases then there are greater chances of the new patterns which can impact the result of forecasting [8]. • Technological Change: Forecasting works best in the industry in which technological change is somewhat constant, because if there would be dynamic industry then it will become difficult to form patterns and hence would impact the result of forecasting. • Barriers to entry: Forecast would be more accurate when there are more barriers to entry because there would be less competitors to impact the established patterns and hence more accurate forecast. • Distribution of Information: The faster the dissemination of information, the less competitive advantage forecasting would give to the firm, because competitors can also make use of the same information. • Elasticity of demand: The more inelastic the demand is, there would be greater accuracy as for example the demand of necessities can be predicted easily as compare to the demand of automobile, which is elastic and hence less accuracy at prediction. • Consumer versus Industrial Goods: Accuracy at forecasting is better in consumer goods as compare to the industrial goods, because industrial goods are sold only to few customers and of which if some are lost, there would be huge loss [9]. • Aggregate versus Disaggregate: When aggregate forecast for a family or for a product is taken then there are more accurate results as compare to the single items, because the patterns of single items change much faster than patterns of the aggregate groups.

4 Multi Variant Time Series Data Time is the most important asset that a firm can have, and this statement is more specifically applicable at a multistore, which needs to align its activities according to the season, because these stores need to identify the best time at which they can boost up sales. In a multivariate time series, there are number of variables whose performance is dependent upon time [10]. The variables are not only dependent upon the past data, but their performance is also inter-related with each other, the variables are dependent on each other also and this dependency is used in further forecasting. Multivariate time series forecasting of a retail store chain

634

S. Makkar et al.

Retail is one of the most important business domains which faces numerous optimization problems such as optimal prices, stock levels, discount allowed, recommendations which can be now easily solved with various data analysis methods. Data science and data mining applications can be used in even in forecasting of sales which can ultimately help in proper optimization for price, cost, inventory etc. [5]. The accurate prediction of sales is a challenging task in today’s competitive and dynamic business environment, and it can help the retailer in inventory management and increases their profits. For the purpose of understanding forecasting of sales for retail sector, a dataset of a Global Retail Stores chain is taken which has 45 different outlets at different location all over the USA and have 99 different departments. The data is a public dataset which has been taken from Google toolkit. The data also consist of datapoints for which forecasting is to be done. The data is multivariate because the prediction is to be done for different stores and department in a time series. The whole forecasting will be carried out in Python and some part of data exploration will be done using Tableau.

4.1 Data Understanding The dataset ranges for over 3 years and has various variables but is mainly classified in 3 different categories: Sales Description—It contains details about the sales of products under different department in different stores on weekly basis. Variable Name

Description

Store ID

The stores are assigned ID ranging from 1 to 45

Department number

The number of departments from 1 to 99 for every store

Is Holiday Super Bowl

This variable is in form of Boolean, i.e., having values true and false. True states that particular day was a holiday. Following are main holidays which happened that time. 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13

Labor Day

10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13

Thanksgiving

26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13

Christmas

31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Sales

This was the sales figure of individual departments in individual stores on weekly basis.

Store Description

This includes the details related to all the 45 stores

Type

All the 45 stores are divided in 3 categories A, B and C

Size

This describes the size of each store

Location and weekly Description

The description about different location like, temperature, fuel price etc. (continued)

Predictive Analytics for Retail Store Chain

635

(continued) Variable Name

Description

Temperature

Average temperature of the location in that particular week

Fuel price

This variable describes the average weekly fuel price for the different stores

Markdown

5 different type of markdowns on the prices in a specific week mostly during the holidays

CPI

Consumer Price Index

The Fig. 1 shows the details about all the variables recorded. It talks about the description of the data type of individual variables.

4.2 Data Exploration The data has in total 14 variables which will be used for the purpose of sales prediction. But before carrying out the prediction using the model, proper understanding of the data is necessary. This is known as data exploration. Fig. 1 Descriptive Analysis of Variables

636

4.2.1

S. Makkar et al.

Multivariate Analysis

A correlation heat map is created in Fig. 2, which helps in visually understanding whether there is any strong correlation between the variables. The heat map shows the maximum correlation is among variables is 0.3 which is not very high. Markdown 2 and markdown 3 has a positive correlation with Isholiday which clearly shows that the markdown is done during the holiday week. The size of the store is correlated with weekly sales which shows that larger the size of the store, higher is the sales. CPI is known to have a negative relation with unemployment, higher the CPI lower is the unemployment rate of a state. The variables have correlation but there is no high correlation because of which any variable need to be removed.

Fig. 2 The correlation between different variables

Predictive Analytics for Retail Store Chain

4.2.2

637

Univariate Analysis

The variables are now individually explored for better understanding. Every variable is plotted against the target variable, i.e., weekly sales to notice any specific observation. The two variables which showed some important observation are ‘type of store’ and ‘holiday’ as shown in Fig. 3. Type of Store-There are 3 types of stores, A B and C in which all the 45 stores are divided. After plotting it against weekly sales, it is observed that the number and amount of sales in type C is very less compared to other 2. Holiday-0 is for no holiday and 1 is for the holiday week. It can be clearly seen in the Fig. 4 that the weeks with holidays have more number of sales and the amount of sales is also large compared to the weeks without holidays. It can be observed in the Fig. 5 that Christmas weekend had the maximum sale of more 240,000. Fig. 3 Bivariate analysis between weekly sales and type of store

Fig. 4 Holiday and weekly sales

638

S. Makkar et al.

Fig. 5 Count of weeks on which the sales is more than 240,000

4.3 Data Preprocessing The data should be preprocessed to convert the raw data into an understandable form for the model. This will help in proper implementation of the model and therefore will give much more efficient and accurate results.

4.3.1

Null Value Treatment

The data has various null values, especially in markdown columns as seen in the Fig. 6 which needs to be treated. Null value in mark down columns indicate that there is no markdown available during that date, so it can be written as zero. Even in the weekly sales column there are 115,064 data points which are null, it is because these are the data points which need to be predicted. For the time being these are also filled with 0. Fig. 6 The count of null values in the data

Predictive Analytics for Retail Store Chain

4.3.2

639

Creating Dummies

Holiday-Isholiday data has boolean data, false and true, so a dummy is created for it performing the model. 0 is assigned to true, i.e., holiday week and 1 is assigned to no holiday week. Month- For month there are 12 dummies, 1 for every month. And on the basis of date of sales, 1 is assigned to the column of that month and others are left with 0. Black Friday—If it is black Friday, then 1 to the black Friday yes column and 0 to other and vice versa. Pre christmas—Sales during the christmas time is high compared to other weeks, so a dummy is created to classify the sales whether it is durning christmas time or not.

4.4 Model Implementation 4.4.1

Random Forest

Random Forest is an ensemble bagging learning method especially for classification and regression. It comprises of several decision trees. In classification, each individual tree in the random forest gives out a class prediction and the class with the most votes become the prediction of the model. And for regression mean prediction is considered. The data for this retail store chain is multivariate, i.e., there are various variables to help the prediction and also the prediction is to be done for different department, store and dates. So normal time series forecasting cannot be used. Therefore, here random forest algorithm is used to show how multivariate data forecasting is done.

4.4.2

Lagged Values

Random Forest evaluates the data points without collaborating the information from the past with the present. So, because of this lagging variable are created, in this case lagged sales is created which will help in bringing a pattern from the past for evaluating the present. The lagged sales is created considering 1 lag week.

4.4.3

Selected Variables

‘LaggedSales’, ‘Sales_dif’, ‘LaggedAvailable’, ‘CPI’, ‘Fuel_Price’, ‘isHoliday_False’, ‘isHoliday_True’, ‘Temperature’, ‘Unemployment’, ‘MarkDown1’, ‘MarkDown2’, ‘MarkDown3’, ‘MarkDown4’, ‘MarkDown5’, ‘Size’, ‘Pre_christmas_no’, ‘Pre_christmas_yes’, ‘Black_Friday_no’, ‘Black_Friday_yes’, ‘md1_present’, ‘md2_present’, ‘md3_present’, ‘md4_present’, ‘md5_present’.

640

S. Makkar et al.

Fig. 7 This figure represents the output of the model

These are the 24 variables which are considered for use in the model.

4.4.4

Train and Test Split

The dataset is finally divided into historic and forecasting. In the beginning the forecasting data was combined with the given historic data and was named as test. The historic data is further divided into 80% training and 20% testing for getting more accurate results.

4.5 Result Random forest is first done on the training set, the number of trees assigned are 20. Followed by running on the whole model and then it is finally used for the prediction. The graphs in Fig. 7 represents the distribution of the predicted values firstly against the weekly sales and secondly shows the probability distribution. As it is condensed together it clearly shows that the error is minimized. The final forecasting for the weekly sales of the retails store is done for future dates on the basis of the model created and they were—25,978.1, 26,966.8, 27,052.5, 54,787.7 and 54,313.9 for 5 continuous days

5 Conclusion In past times, forecasting was something which was the work of mathematicians or consultants, but with the changing time and technology, more and more senior managers are trying to work and make use of these techniques for long term planning for their organizations. This helps in reducing expense and time of bringing top consultants for small work. But still there are various complexities involved like,

Predictive Analytics for Retail Store Chain

641

the type of technique to be selected for the particular purpose, the amount of data required, etc. Still forecasting methods cannot be perfect under all conditions. Even after applying the appropriate technique, one should properly monitor and control the process, so as to avoid aany error. Forecasting techniques are rewarding for the managers but they need to tackle all the challenges coming in their way.

References 1. J.T. Mentzer, R. Gomes, R.E. Krapfel, Physical distribution service: A fundamental marketing concept?. JAMS 17, 53–62 (1989) 2. L. Cassettari, I. Bendato, M. Mosca, R. Mosca, A New Stochastic Multi source Approach to Improve The Accuracy of the Sales Forecasts (University of Greenwich, 2016) 3. C. Maritime, Kent, UK, K.K., Intelligent techniques for forecasting multiple time series in real-world systems, in NW School of Business and Economicss (Fayetteville State University, North Carolina, USA, 2014) 4. D. Waddell, A.S. Sohal, Forecasting: the key to managerial decision making management decision. Res. Forecast. Early-Warning Methods 32(1), 41–49, 0025–1747 (1994) 5. R. Fildes, T. Huang, D. Soopramanien, The value of competitive information in forecasting fmcg retail product sales and the variable selection problem. Eur. J. Oper. Res. 237, 738–748 (2014) 6. I. Alon, M.H. Qi, R.J. Sadowski, Forecasting aggregate retail sales: A comparison of artificial neural networks and traditional methods. J. Retailing Consum. Serv. 8(3), 147-156 (2001) 7. N.S. Arunraj, D. Ahrens, A hybrid seasonal autoregressive integrated moving average and quantile regression for daily food sales forecasting. Int. J. Econ., 321–335 (2015) 8. A. Chong, B. Li, E. Ngai, E. Ch’Ng, F. Lee, Predicting online product sales via online reviews, sentiments, and promotion strategies: a big data architecture and neural network approach. Int. J. Oper. Prod. Manag. 36, 358–383 (2016) 9. K.J. Ferreira, B.H.A. Lee, D. Simchi-Levi, Analytics for an online retailer: Demand forecasting and price optimization. Manuf. Serv. Oper. Manag. 18, 69–88 (2016) 10. M.D. Geurts, J.P. Kelly, Forecasting retail sales using alternative models. IJF 2, 261–272 (1986)

Object Identification in Satellite Imagery and Enhancement Using Generative Adversarial Networks Pranav Pushkar, Lakshay Aggarwal, Mohammad Saad, Aditya Maheshwari, Harshit Awasthi, and Preeti Nagrath

Abstract Ship detection from satellite images is an essential application for sea security, port traffic control, disaster management, and rescue operations which incorporates traffic surveillance, illicit fisheries, oil spills, and observation of ocean contamination. Significant challenges for this method include cloud, tidal wave, and even the variability of ship sizes. In this paper, we introduce a framework for ship detection from low-resolution satellite images using the best combination of Generative Adversarial Networks (GANs) and Convolutional Neural Networks (CNNs) with respect to image enhancement and training time reduction, as well as high accuracy. The operations of the above proposed method has been done on the Kaggle open source (“Ships in Satellite Imagery”) dataset. Keywords Ship detection · Satellite imagery · Generative adversarial networks · Convolutional neural networks

P. Pushkar (B) · L. Aggarwal · M. Saad · A. Maheshwari · H. Awasthi · P. Nagrath Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India e-mail: [email protected] L. Aggarwal e-mail: [email protected] M. Saad e-mail: mo[email protected] A. Maheshwari e-mail: [email protected] H. Awasthi e-mail: [email protected] P. Nagrath e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_57

643

644

P. Pushkar et al.

1 Introduction The greatest challenge in the satellite imaging domain is how to cope up with a small dataset and limited amount of annotated dataset, especially while employing unsupervised [1] learning algorithm which generally requires a large number of clear and enhanced training [2] samples. Considerable efforts have been made during the last decades to design and develop different algorithms and tools for ship detection in satellite/aerial imagery. This study presents a review of the recent progress in the field of ship detection from satellite imagery using deep learning [3]. We also use different GAN [4] models to speed up the training process (at the same time enhancing the images and doing noise reduction as well) and finally each of their performance is compared on our dataset. Ship recognition is a significant and vast research area in the field of Computer Vision (along with various Deep Learning methods) that have recently come into limelight and it is believed that these methods can come very handy in different applications like identification of illicit oil spills, observing oceanic traffic in fisheries, security, military movements [5], etc. Our Kaggle open-source Dataset (“Ships in Satellite Imagery”) consists of images of various ships collected over the San Francisco Bay and San Pedro Bay areas of California. We are applying Convolution Neural network (CNN) [6], DCGAN [7] (Deep Convolutional Generative Adversarial Network), as well as RCNN, and DCGAN got us the most accurate results. Objective: • • • •

To remove the ship like images which are formed by clouds and tidal waves. Enhance the image as the ship is moving there is a wave line formed with the ship. Increase the size of the image so that object is properly identified. Comparison between the previous methodology for the detection of the ship.

The further section contains the following: Sect. 2 states the related works, Sect. 3 consists of methodologies of proposed work, Sect. 4 consists of experimental set-up and results, Sect. 5 consists of discussions and Sect. 6 consists of conclusion.

2 Related Works Early works utilized 2D, 3D, and 5D top view pictures and with top and both side views images taking by Synthetic-aperture radar [8] explicit model alongside image descriptors as a result of the unavailability of high resolution images. For instance in [9–11] the creator tended to the issue of ship location as a 3-D object discovery issue. Visual saliency estimation is one of the pre-mindful strategies for people to concentrate their eyes on districts with appealing substance from scenes, applicant locales contain genuine and bogus targets. In these districts, total profiles for each

Object Identification in Satellite Imagery and Enhancement …

645

speculated objective are still to be affirmed. Given the presumed ship focus, the district is pivoted by adjusting the ship hub to the vertical bearing, and afterward the S-HOG [12] descriptor is determined. In view of this component, a segregation methodology is made to choose whether the speculated objective is a genuine ship. Afterward, a few ship recognition methods have utilized blends of various highlights so as to catch various qualities of boats. Later Christiana Corbane [10] recognized the ship by allocating enrollment probabilities to the consequences of the identification in a post handling stage. Thus, an ultimate conclusion is left to the administrator who would then be able to approve the aftereffects of the location dependent on his experience. Furthermore, research was continuing using an eager non-most extreme concealment calculation [15] and a bunching on-appearance-based [12] way to deal with a bunch of numerous locations. Despite the fact that it is generally realized that utilizing in excess of a solitary component [18] improves the general execution of the recognition calculation, the presentation is profoundly subject to the selection of highlights. Other send recognition technique is utilized on engineered opening radar (SAR) [10] to limit the issue of night vision picture issues. A few papers in the open writing treat techniques for the recognition of ship focuses on SAR information. Inggs and Robinson (1999) explored the utilization of radar run profiles as target marks for the distinguishing proof of ship targets and neural systems for the arrangement of these marks. Tello et al. (2004) utilized wavelet change by methods for the multi goals examination to investigate multiscale discontinuities of SAR pictures and henceforth recognized ship focuses on an especially loud foundation. Given the long history and progressing enthusiasm, there is a broad writing on calculations for transport recognition in the writing. As far as operational execution is concerned, Zhang et al. (2006) announced impediments of SAR in distinguishing littler dispatches in inland waters. Furthermore, due to the nearness of dot and the decreased elements of the objectives contrasted and the sensor spatial goals, the programmed elucidation of SAR pictures is regularly intricate despite the fact that vessels undetected are now and again unmistakable to the eye. The second strategy for dispatch recognition lies in optical remote detecting, which has been investigated since the dispatch of Land sat during the 1970s. Now and then in PC vision, models frequently neglect to perceive or limit questions on the low goals pictures. To handle this issue, SRGAN [13] is utilized. It comprises of two sub-systems, super goals sub-system and discovery sub arrange. While super goals sub organize is accomplished by stacking of personality remaining squares while the location sub arrange receives the single shot multibox locator (SSD).

646

P. Pushkar et al.

Fig. 1 Flowchart of the proposed methodology

3 Methodology 3.1 Overview of the Proposed Work In this paper, our main objective is to do Object Detection (ships in our case) using Generative Adversarial Networks and Convolutional Neural Networks. We reconstruct a HR image (i.e., 400 * 400) from the given LR input (i.e., 80 * 80) using SRGAN [1, 4] (Fig. 2) and then apply EEGAN [9], [17] (Fig. 3, to do edge enhancement of input 400 * 400 images) to SR output. Section 3.2 briefly introduces GANs (in particular EEGAN). Section 3.2 tells us about the error detection in the form of clouds or crest waves (by heat maps or by some detection algorithm). Section 3.2 tells us about the technique employed in training our dataset using CNN [17], RCNN [19] (less accuracy), and DCGAN [21] and finally Sect. 3.2 provides the network architecture and implementation details (Fig. 1).

3.2 Generative Adversarial Networks (GAN) Generative Adversarial Networks (GANs) [4] are an incredible class of neural networks that are utilized for unsupervised learning [16]. GANs are essentially comprised of an arrangement of two contending neural network models which rival one another and can dissect, catch, and duplicate the varieties inside a dataset. In

Object Identification in Satellite Imagery and Enhancement …

647

Fig. 2 [2]: (Generator and discriminator model)

GANs, there is a generator and a discriminator. The Generator produces fake samples of data (be it a picture, sound, and so on.) and attempts to trick the Discriminator [12]. The Discriminator then attempts to recognize the genuine and fake samples. The Generator and the Discriminator are both Neural Networks and both of them run in rivalry with one another in the training stage. The steps are repeated numerous times and the Generator and Discriminator [2] get better and better in their tasks after each repetition. As highlighted in Fig. 3, our proposed method EEGAN is made up of three basic sections: a generator (G), a discriminator (D), and a VGG19 [12] network for feature extraction. The generator (G) can be divided into two subnetworks: an EESN and a UDSN. UDSN is made of few dense blocks and a reconstruction layer for producing an intermediate HR result. EESN is utilized to enhance the target edges extracted from the intermediate SR image by removing most of the unwanted noise. We obtain the final HR output by replacing the noise edges with the more enhanced edges from EESN (Fig. 4).

Fig. 3 [5]: (Generator and discriminator models for super resolution GAN [SRGAN])

648

P. Pushkar et al.

Fig. 4 [7]: (Representation of UDSN and EESN models of EEGAN [edge enhanced GAN])

4 Simulation and Results 4.1 Experimental Set-Up We are using the Kaggle open-source data set (“Ships in Satellite Imagery”) which consists of satellite images collected over the San Francisco Bay and San Pedro Bay areas of California. It includes 500 80 × 80 RGB images labeled with either a “ship” or “no-ship” classification. The entire data set is in .png format. There are more than 500 images in which 130 images contain “ship” and ships are of different sizes, orientations, and atmospheric interferes like clouds, tidal waves, etc., are included. We take the dataset and apply the model on a system that provides a gpu so that faster processing of images can be done (Figs. 5 and 6).

Fig. 5 (“Ship class” label)

Object Identification in Satellite Imagery and Enhancement …

649

Fig. 6 (“No-ship class” label)

The “no-ship” class includes 370 images. Most of them are random samples of different land cover features—water, vegetation, bare buildings, etc. Some of them are “partial ships” that contain only some part/portion of a ship. We use this dataset to train our models and then apply them on the scene that contains large number of ships and then check their respective training time, testing time, and accuracy they provide on the same dataset.

4.2 Results Following results were obtained chronologically. Before proceeding one point should be kept in mind that the training and testing time, as well as the accuracy of the model, are dependent on the system used for implementation and may vary on different systems. But they do give a comparative idea of the models and can help us choose a better model for ship detection. 1. EEGAN (Figs. 7 and 8) Time taken for training the model = 2500 s Time taken for image preparation = 1540 s 2. CNN (Fig. 9) Time taken for training the model = 2780 s Time taken for ship detection = 2200 s Accuracy of the model = 92% 3. RCNN (Figs. 10 and 11) Time taken for training the model = 3000 s Time taken for ship detection = 1500 s Accuracy of the model = 96% 4. Faster RCNN (Figs. 12, 13, 14, 15 and 16) Time taken for training the model = 2500 s Time taken for ship detection = 1400 s Accuracy of the model = 98%.

650

P. Pushkar et al.

Fig. 7 Before applying EEGAN

5 Discussions We compile our results in a table to get a more precise look at them and have a direct idea of their performance with respect to the accuracy, time taken to train, as well as to detect ships. Model

Time taken for training the model (s)

Time taken for detecting ships./image preparation (s)

Accuracy (%)

Edge enhanced generative adversarial network (EEGAN)

2500

1540

NA

Convolutional neural network (CNN)

2780

2200

92

Region convolutional 3000 neural network (RCNN)

1500

96 (continued)

Object Identification in Satellite Imagery and Enhancement …

651

(continued) Model

Time taken for training the model (s)

Time taken for detecting ships./image preparation (s)

Accuracy (%)

Faster region convolutional neural network (faster RCNN)

2500

1400

98

5.1 Comparative Analysis As we are aiming for the best combination of models for satellite image analysis, we need to take GAN + CNN models. As we have only one GAN model, we are able to just compare the CNN models only and find our best combination with our GAN model. Though CNN is very easy to implement, i.e., its training time is less and its detection time is too much as compared to the other two models. Hence, we can’t further proceed with this model. In RCNN out training time is the highest but on the same hand, it is very accurate and fast while detecting ships. Faster RCNN provides the highest accuracy and least detection time. Though implementation of both RCNN and faster RCNN models is complex and hence sometimes incompatible on certain systems.

6 Conclusion In this paper, we provide an unsupervised method to detect ships from satellite images using a GAN-based framework and Convolutional Neural Networks (CNNs). In the proposed technique, we used SRGAN or EEGAN for preparing HR images and doing Edge Enhancement of those HR images so that we get suitable images with noise reduction and removing the artifacts and sharp edges, respectively. Moreover, the proposed method is robust for scenes with cloud, tidal waves, and is effective when size varies and accuracy is also high in the detection of ships. Though the study was limited due to technical limitations, still a good comparative analysis could be drawn and models could be given their respective merits and demerits with respect to each other. EEGAN + faster RCNN gives the best model combination considering the present scope of research and technical limitations.

652

Fig. 8 After applying EEGAN

Fig. 9 Ships detected through CNN

P. Pushkar et al.

Object Identification in Satellite Imagery and Enhancement …

Fig. 10 Ships detected through RCNN Fig. 11 Confusion matrix for RCNN

653

654

Fig. 12 Training model statistics for successive epochs

Fig. 13 Tabular data for different epochs (cycles)

P. Pushkar et al.

Object Identification in Satellite Imagery and Enhancement …

Fig. 14 Loss-learning rate curve for successive epochs

Fig. 15 Ships detected through faster RCNN

655

656

P. Pushkar et al.

Fig. 16 Confusion matrix for faster RCNN model

References 1. R. Girshick, J. Donahue, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the International Conference on CVPR (IEEE, Columbus, 2014), pp. 580–587 2. R. Girshick, Fast R-CNN, in Proceedings of the International Conference on CVPR (IEEE, Santiago, 2015), pp. 1440–1448 3. S. Ren, K. He, R. Girshick, Faster R-CNN: towards real-time object detection with region proposal networks. TPAMI 39, 1137 (2017) 4. K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, J. Jiang, Edge-enhanced GAN for remote sensing image superresolution. IEEE Trans. Geosci. Remote Sens. 1, 1–13 (2019) 5. W. Liu, D. Anguelov, SSD: single shot multibox detector, in Proceedings of the International Conference on ECCV (Springer, Amsterdam, 2015), pp. 21–37 6. J. Redmon, S. Divvala, You only look once: unified, real-time object detection, in Proceedings of the International Conference on CVPR (IEEE, Las Vegas, 2016), pp. 779–788 7. S. Bell, C.L. Zitnick, Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks, in Proceedings of the International Conference on CVPR (IEEE, Las Vegas, 2015), pp. 2874–2883 8. F. Yang, W. Choi, Y. Lin, Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers, in Proceedings of the International Conference on CVPR (IEEE, LasVegas, 2016), pp. 2129–2137 9. V. Ramakrishnan, A.K. Prabhavathy, J. Devishree, A survey on vehicle detection techniques in aerial surveillance. Int. J. Comput. Appl. 55(18), 43–47 (2012) 10. C. Corbane, L. Najman, E. Pecoul, L. Demagistri, M. Petit, A complete processing chain for ship detection using optical satellite imagery. Int. J. Remote Sens. (Taylor & Francis) 31(22), 5837–5854 (2010) 11. S. Qi, J. Ma, J. Lin, Y. Li, J. Tian, Unsupervised ship detection based on saliency and S-HOG descriptor from optical satellite images. IEEE Geosci. Remote Sens. Lett. 12(7), 1415–1455 (2015)

Object Identification in Satellite Imagery and Enhancement …

657

12. P.F. Felzenszwalb, R.B. Girshick, Object detection with discriminatively trained part-based models. TPAMI 47, 6–7 (2014) 13. I.J. Goodfellow, J. Pouget Abadie, Generative adversarial networks advances, in Neural Information Processing Systems (2014), pp. 2672–2680

Keyword Template Based Semi-supervised Topic Modelling in Tweets Greeshma N. Gopal, Binsu C. Kovoor, and U. Mini

Abstract The performance of supervised and semi-supervised approaches for topic modelling is highly depended on the prior information used for its tagging. Tweets are short-length texts and hence demand supplementary information to infer their topic. The correlation of a word with a topic changes with time in social media. Therefore it is not appropriate to fix a tag for a keyword for indefinite time. Here we have proposed a framework for the adaptive selection of the keywords for tagging with the help of external knowledge. The keyword template will be updated appropriately with the time slice in consideration. The evaluation matrices have shown that this model is giving consistent and accurate latent topic identification in short text. Keywords Topic modelling · Semi-supervised learning · LDA · Tweets

1 Introduction Social Media is considered as one of the richest sources to extract statistical information. Social media can provide relevant and genuine information through analysis. However, identifying and extracting only the relevant data is not that easy. There are several statistical models that have been implemented to extract the hidden category of subject, that the text is dealing with. Latent Dirichlet algorithm is one of the most commonly used techniques for topic modelling. This model has G. N. Gopal (B) · B. C. Kovoor School of Engineering, CUSAT, Kochi, India e-mail: [email protected] B. C. Kovoor e-mail: [email protected] G. N. Gopal College of Engineering Cherthala, Cherthala, India U. Mini CUSAT, Kochi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_58

659

660

G. N. Gopal et al.

given satisfying result in the topic modelling with documents. While trying to infer the topic in tweets, one of the major challenge that we face is the sparsity of words in short text. Therefore the information recognition fails to attain accuracy with naive models based on Latent Dirichlet Allocation (LDA) [1].

2 Related Work The inadequacy of sufficient words in short texts like tweets were addressed by several researchers by incorporating knowledge from external sources or by aggregating the tweets. Yang et al. have identified the key-phrases from the short text itself [2]. They have considered Wikipedia, WordNet, YAGO and Know It All for extracting the supplementary knowledge. Similar work was done by Zhu et al. by doubling the strength of feature keywords, after estimating the importance of the keyword from external source [3]. Cheng et al. have used word correlation statistics obtained from external corpus to improve LDA [4]. Scientific document titles were efficiently classified using similar external knowledge in the proposed work of Vo and Ock [5]. Social data providers like Socialbakers and Topic enhanced word embedding (TEWE) were used by Li et al. to develop Entity knowledge Base [6]. Work by Kim et al. who have taken feedback based on time series in their iterative topic modelling, was a step towards a models that adapts to the change in topic with time [7]. When considering the external sources like Wikipedia word distributions [8], the constantly changing mapping behaviour of a word to topic is not reflected. Collecting additional knowledge from relevant sources that is constantly updated with the time will help us to map the word to current semantics. Basave et al. [9] have used the news group for inferring the word relationship through summarization technique. Through summarization they have extracted dominant words of a particular topic. The behavioural study of social media users has shown the statistics that, people tweet more about controversial topics rather than on other relevant facts mentioned in news portals. So the word frequency in external sources and tweets will not always match. Another challenging factor in understanding the topic is that the relationship of a word with topic is highly dynamic in social media. For example, when the price of onion hit the roof during November 2019 in India, people used the word Onion to disapprove the government policies. Onion which was referred as a word related to cooking was then contributing to the topic politics. Obviously, news groups can provide relevant words for a particular time period related to a topic. On the other hand, depending on word frequency to extract dominant words of a topic from news headlines is not so reliable due to the sparsity of words. Hence the topic models for tweets with external knowledge must address the altering word semantics as well as the sparsity of keywords. The information acquired from external corpus can be incorporated to the models by reshaping the unsupervised algorithm to supervised or semi-supervised algorithms. Initial tagging of the corpus in supervised learning

Keyword Template Based Semi-supervised Topic Modelling in Tweets

661

demands huge amount of human intervention. Hence semi-supervised algorithms are widely accepted in developing Guided LDA.

3 Semi-Supervised Topic Modelling for Tweets In this section, we describe the framework and model for the automatic labelling for the semi-supervised LDA. Figure 1 shows the framework of tweet classification based on topic modelling. A set of words k1 , k2 , . . . , k3 are selected to form keyword templates that act as the starting data points for the clustering. These keywords are chosen with at most care such that the semantics of these words are fixed. Using these initial data points, word clustering is done in the news corpus to tag those words to a topic ti . During the topic sampling of words in the LDA, this additional information is considered for estimating the distribution. Later classification is done with the topic modelled text. Meanwhile, the top words selected for a particular topic are fed to the keyword template selection for update.

3.1 Inference Algorithm For inferring whether a term belongs to a particular topic, we have used Collapsed Gibbs Sampling. Initially a distribution is generated over K topics with Dirichlet

Fig. 1 Framework for adaptive topic modelling of tweets

662

G. N. Gopal et al.

prior α. We then have to draw a Dirichlet distribution φ for words with the Dirichlet prior β. This distribution can tell what a document is about. Finally, for each word in a document, a topic is drawn and a word that contributes to a topic, with the multinomial distribution. In the Naive LDA model the initial distribution of the topics for the words is based on initial Dirichlet distribution. While here, a set of words are tagged to the labels and during the sampling this acts as a prior. The whole system is described in the plate notation shown in Fig. 2. There are M tweets that are to be classified under different topics. The number of words in these documents are N. The correlation of a tweet with a topic ti depends on how the words in that particular tweet are aligned to that topic. The fundamental objective of our work was to automate the keyword tagging in a semi-supervised LDA. From the preliminary analysis, we observed that most of the tweets are incomplete or short and will not be having sufficient information to identify the subject they are dealing with. This is because people usually tweet to express their opinion about controversial topics, rather than passing the news. There will be plenty of information flooding in the social network mentioning the topic. People give comments and post in the social network which has direct or indirect connection with the subject. Hence, depending on an external source, to complete the semantic information, is required. From the external source, the word correlations can be identified. However, identifying this correlation with the frequency of two words occurring together was not always successful because of the sparsity of the words in text. Moreover, the frequency of this observation was not consistent in the corpus. For example, the word ‘Modi’ and ‘ISRO’ was found to occur only 5 times in the news corpus and ‘Modi’ and ‘Trump’ occurred 15 times. If the word co-occurrence statistics predict setting distance 10 as the frequency then both pairs will be in different clusters. We have observed that the clustering

Fig. 2 Adaptive topic modelling plate notation

Keyword Template Based Semi-supervised Topic Modelling in Tweets

663

algorithms like k-means, spectral clustering, etc. fail to retrieve entity relationships in the news corpus. Hence we have used DBSCAN algorithm that shows the strength or density of the co-occurrence in the current space. In this algorithm if two words t1 and t1 are having a similarity strength between a word in the keyword template and a word in the news corpus, then that word is added into the cluster. The algorithm runs recursively with the new term as the input. The recursive algorithm is shown in Algorithm 2. Algorithm 1 Template labelled topic modelling Require: M: Number of Tweets N : Number of words k: Number of topics α: Prior for the Dirichlet Distribution over topics β: Prior for the Dirichlet Distribution over words Z : List of topics from z 0 to z k Wc : Keywords obtained from Clustering with Template Keywords as initial data points. 1: for i = 1: k do 2: do Clustering(W1...n ) 3: end for 4: for i = 1: M do 5: Generate θi ∼ Dir(α) 6: end for 7: for i = 1: k do 8: Generate φi ∼ Dir(β) 9: end for 10: for m = 1: M; n = 1: N and c = 1: k do 11: if w(m, n) ∈ Wc then 12: Z m,n = l(Wc ) 13: else 14: Z m,n = Multinomial(θi ) 15: end if 16: w(m, n) = Multinomial(φ Z m,n ) 17: end for

Algorithm 2 Clustering 1: Function addterms(term t) 2: for i = 1: N do 3: if JaccardSimilarity(t, i) > tck then 4: if t ∈ / Wc then 5: addterm(t) 6: end if 7: end if 8: end for

The keywords that are used to tag the corpus were extracted from the external corpus. One of the major challenges faced during this process was due to the sparsity

664

G. N. Gopal et al.

of the data in the extracted news. This is because there will be only two or three headlines related to a piece of news. Measuring the relationship of a word to another word in short and sparse text is very difficult. Usually the co-occurence similarity of two words is measured using matrices like Jaccard similarity [10] where distance between two terms t1 and t2 is d J (t1 , t2 ) = 1 − J (t1 , t2 ) =

|t1 ∪ t2 | − |t1 ∩ t2 | |t1 ∪ t2 |

However, in a sparse data, the number of times two terms co-occur may be more than the times they individually occur. In such case, the similarity index turns out to be zero even though both terms co-occur multiple times. Therefore, we have considered the normalized value of co-occurrence count here. Observing that the keyword relationships are getting only slight projection based on count, our next step was to extract only important words for the clustering process. The entity recognition was done for extracting only relevant keywords. In addition to this, the entity itself was to be cleaned since it had many stop words. Leaving the structure of entity as it was not a good choice since tweets seldom have complete entity patterns. People use first name or last name of a person and not the complete word of that person when they tweet. For example, if we consider the entity “Indian National Congress”, people when they tweet use the word Congress. So we extracted only the proper nouns from the entity recognized from the news. Later, news that has only the proper nouns were given to the clustering algorithm. The keywords obtained through clustering was then used to tag the tweet corpus.

4 Experiments and Results The dataset used for the experiments are the tweets extracted from Twitter for a chosen time period. The tweets were collected using the Twitter Tweepy API. During the preprocessing of the tweets, URLs and special characters were removed. The texts were then converted to lower case and tokenized. Later, stop words were removed from the tweets. The external corpus that we have used is the news websites. The daily news is scraped from the news websites. Importance was given to the news in trend and most shared news. During the initial preprocessing of the news corpus, URLs and image links were removed. In every news text, the entity words that describe the topics they were aligned were to be recognized. For this we have employed Named Entity Recognition (NER) and have extracted all names, locations and organizations. The NER was done using Spacy, since the dataset was showing better performance than Stanford [11]. The experiments were also done by choosing entity words by extracting only the proper nouns from the text. The word list of each news text is then clustered using the algorithm Algorithm 2.

Keyword Template Based Semi-supervised Topic Modelling in Tweets

665

For the clustering, we have selected initial data points from where the clustering starts. These words have a close relationship with the topic. However, the consistency of the topic word relationship does not matter in our model since, every time, these words are updated with the top words obtained through Collapsed Gibbs sampling. So our model is not only automatically tagging the words for semi-supervised learning, it is also adapting to the change in semantics of the keywords. In essence human intervention is required only during the deploy of the model, thereby reducing the burden of labelling in semi-supervised and supervised algorithm. The modelling was done by extending the STTM tool for short-text topic modelling [12] The topic distribution obtained through modelling is used to classify the text. The classification accuracy of this semi-supervised algorithm is compared with accuracy of unsupervised algorithm. The experiments have shown that our algorithm with automatic topic labelling is giving a consistent solution thoughout the execution as in Fig. 3. The figure shows the accuracy of KSSTM and Naive LDA, when they were run multiple times with same input. In the unsupervised algorithm, the performance of the classification depends completely on the initial distribution. The change in accuracy was observed when we use only the entity extracted from the external news corpus and also when considering only proper nouns. The results have shown that the accuracy improves as we filter out ambiguous words that may fall into both clusters (Fig. 4).

Fig. 3 a Classification accuracy in dataset 1, b classification accuracy in dataset 2

Fig. 4 a Classification accuracy by applying NER and b proper noun extraction from data

666

G. N. Gopal et al.

4.1 Conclusion Through the experiments, it is observed that Semi-supervised and Supervised LDA can always provide a more consistent topic categorization. However, tagging the data needs lots of human effort and time. The suggested model proposes a method to automatically extract the words to be tagged from the external corpus. The proposed method is designed for the topic modelling in tweets which are short in length. The semantics of tweets are highly correlated with the current news and this hypothesis is used to extract the knowledge for the inference. The experiments have shown that the model is providing better and reliable solution to the topic modelling problem.

References 1. D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003) 2. S. Yan, W. Lu, D. Yang, L. Yao, B. Wei, Short text understanding by leveraging knowledge into topic model, in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2015), pp. 1232–1237 3. Y. Zhu, L. Li, L. Luo, Learning to classify short text with topic model and external knowledge, in International Conference on Knowledge Science, Engineering and Management (Springer, Berlin, 2013), pp. 493–503 4. X. Cheng, X. Yan, Y. Lan, J. Guo, BTM: topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 26(12), 2928–2941 (2014) 5. D.T. Vo, C.Y. Ock, Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Syst. Appl. 42(3), 1684–1698 (2015) 6. Q. Li, S. Shah, X. Liu, A. Nourbakhsh, R. Fang, Tweetsift: tweet topic classification based on entity knowledge base and topic enhanced word embedding, in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (ACM, 2016), pp. 2429–2432 7. H.D. Kim, M. Castellanos, M. Hsu, C. Zhai, T. Rietz, D. Diermeier, Mining causal topics in text data: iterative topic modeling with time series feedback, in Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (ACM, 2013), pp. 885–890 8. J. Wood, P. Tan, W. Wang, C. Arnold, Source-LDA: enhancing probabilistic topic models using prior knowledge sources, in 2017 IEEE 33rd International Conference on Data Engineering (ICDE) (IEEE, 2017), pp. 411–422 9. A.E.C. Basave, Y. He, R. Xu, Automatic labelling of topic models learned from twitter by summarisation, inProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 2, Short Papers, 2014), pp. 618–624 10. A. Saxena, M. Prasad, A. Gupta, N. Bharill, O.P. Patel, A. Tiwari, M.J. Er, W. Ding, C.T. Lin, A review of clustering techniques and developments. Neurocomputing 267, 664–681 (2017) 11. B. Kleinberg, M. Mozes, A. Arntz, B. Verschuere, Using named entities for computerautomated verbal deception detection. J. Forensic Sci. 63(3), 714–723 (2018) 12. J. Qiang, Y. Li, Y. Yuan, W. Liu, X. Wu, STTM: A Tool for Short Text Topic Modeling (2018). arXiv:1808.02215

A Community Interaction-Based Routing Protocol for Opportunistic Networks Deepak Kumar Sharma, Shrid Pant, and Rinky Dwivedi

Abstract Opportunistic Networks (Opp-Nets) provide the capability to interact and transfer information between spontaneous mobile nodes. In these networks, the routing of messages, which involves the selection of the best intermediate hop for the relay of message packets, is one of the most important issues. This is primarily due to the non-availability of prerequisite knowledge about the network topology and configuration. This paper presents a community interaction-based routing protocol for Opp-Nets that may be used to select appropriate intermediate nodes as hops. The selection is based on the interaction point probability and social activeness of the nodes, which are calculated and analyzed at the sender and, each, intermediate nodes. The results for the proposed protocol are obtained using the ONE simulator, and analytically and graphically compared with other contemporary routing protocols to show its effectiveness. Keywords Opportunistic networks · Routing · Community interaction · ONE simulator

D. K. Sharma · S. Pant Department of Information Technology, Netaji Subhas University of Technology (Formerly Netaji Subhas Institute of Technology), New Delhi, India e-mail: [email protected] S. Pant e-mail: [email protected] R. Dwivedi (B) Department of Computer Science and Engineering, Maharaja Surajmal Institute of Technology, Janakpuri, New Delhi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_59

667

668

D. K. Sharma et al.

1 Introduction With the emergence of novel and affordable wireless technologies such as Bluetooth, 3G [1], WiFi, and many others, it is now possible to enable almost any device with wireless technology. This has led to an exponential increase in the number of wireless networks. However, wireless network infrastructures are not available in every scenario, especially those involving extreme environments like deep space and oceans. Therefore, Mobile Ad Hoc Networks [2] (MANETs) were introduced to deal with some of these challenges because, in MANETs, each node is mobile and act as intermediate nodes for transferring the data. MANETs, however, require that an end-to-end (E2E) connectivity among any pair of source-destination nodes be available for packets transfer. This assumption may result in data getting dropped midway, when intermediate nodes go down due to failure, power outage, or if the nodes go out of the radio range [3]. Hence, without a complete connected path, communication might never happen in case of MANETs. In real-world scenarios, there are many such circumstances where end-to-end active paths may never exist. In these situations, Opportunistic Networks (OppNets) can provide a means to route the packets, while accounting for intermittently connected paths. In Opp-Nets, the nodes are obligated to buffer their message packets until they discover appropriate nodes that they can forward them to, in such a way that these packets may eventually be delivered to their desired destinations. OppNets also come under the sub-class of Delay Tolerant Networks [4] (DTNs). They are traditionally characterized by low power mobile devices communicating and networking in an ad hoc nature. The network connectivity is sparse, intermittent, and usually unpredictable. Even the duration for which two nodes might meet each other is highly variable. Thus, reliable delivery of the message is not guaranteed in these networks due to the network partitions and intermittent connectivity. Every node decides the best next-hop using several appropriate parameters before it transfers this message to any other node in the network. Routing in Opp-Nets does not require the existence of a connected path. Instead, each node decides which nodes it should relay the packets to, in order to guarantee the successful transfer of the packets with minimal delay. By virtue of the issue of network partition, the intermediate nodes might not discover viable nodes to relay the packets toward the target. In these circumstances, the nodes might have to store the packets in their buffer for a period of time, when there exists no opportunity for forwarding toward the goal nodes. So, a buffer management scheme is needed when new packets arrive at a node and its buffer is already full. For this, an acknowledgment-based technique can be used to remove the message copy from the buffer of the nodes. Also, the nodes in Opp-Net have limited energy resources. intermittent contacts, frequent disconnection, and reconnection, long delays, etc., that generally result in the drainage of the battery. This is one of the primary issues in Opp-Nets that must be addressed. Secure routing is also needed to provide secure communication between the nodes in Opp-Nets. By designing the protocol with security features, it can be assured that the network is protected against various types of security threats

A Community Interaction-Based Routing Protocol …

669

in the underlying environment. This is also an area of concern in Opp-Nets that needs attention [5]. The following are some characteristics of Opp-Nets: (1) they are devoid of any fixed network topology as the nodes are in constant movement relative to each other, (2) contact opportunities and contact duration times are less and varying since the nodes are moving at all times, (3) links present in the network at any given instant may not be present at all times due to node failure, nodes being out of radio range of each other, or node power failure, (4) as a result links are unreliable and give varying performance, (5) buffer capacity in each node is high in order to buffer as many messages as possible and to avoid them from being dropped due to buffer overflow, finally (6) since these networks are delay tolerant, average latency for delivering a message to its intended receiver is quite high as compared to existing legacy networks that require end-to-end connected paths between source and destination. The following are some of the applications of Opportunistic Networks [6]: • Emergency applications: Opp-Nets can be used in all kinds of emergency situations such as an earthquake and hurricane. A seed Opp-Net with some nodes can be deployed for disaster recovery. Other potential helper nodes equipped with more facilities can be added as per requirement to grow into an expanded Opp-Net. • Opportunistic computing: Opp-Nets can be used to provide a platform for distributed computing where various resources, content, services, applications can be shared by mobile devices for various purposes. • Recommender systems: Opp-Nets can exploit the various context information about the nodes such as mobility patterns, contact history, and their workplace information. This contextual information can be used to furnish suggestions on multiple items. • Mobile data offloading: The mobile social Opp-Nets can be used for the purpose of mobile data offloading. The immense increase in smartphone users has overloaded large portions of the 3G networks. A number of research works have been accomplished to take advantage of mobile social Opp-Nets for data offloading on 3G networks. • Information exchange: Opp-Nets also utilize the data transmission potential of small devices, such as mobile phones. These handheld devices form an Opp-Net when they come in close proximity of other wireless devices to exchange data.

2 Related Works This section presents a review of some relevant routing protocols, and attempts made to decrease the congestion. Numerous algorithms have been proposed for effective routing of messages in opportunistic networks. There are mainly two classes of routing protocols: (1) Context-Ignorant Routing Protocols and (2) Context-Aware Routing Protocols. In Context-Ignorant Protocols, the nodes are oblivious to the network information to select the next-intermediate nodes. In Context-Aware Algorithms, on the other hand, the delivery probability is calculated using different

670

D. K. Sharma et al.

network metrics for routing. The following are some of the routing strategies that are used in delay tolerant networks [7]. A. First Contact: First Contact is a simple routing algorithm in which the sender passes the message only when it comes in direct connection with the target, or the sender and target are immediate neighbors. Until then, the packets are stored in the buffer, waiting to be in contact with the destination. In this protocol, the local copy of messages is removed after every transfer between nodes. Since there exists just one copy of the data in the entire network, the congestion and resource utilization are less. Although simple, First Contact has very limited applications as the delivery is very poor and relaying along the random paths might not allow progress toward the target. B. Epidemic Routing: Epidemic Routing Protocol [8] is founded on the theory of Epidemic Algorithms. It is a dissemination-based protocol in which the message packets are passed within the network with the help of flooding mechanisms. In it, the starting node floods the entire network with numerous replicas of the message packets intended for delivery to the target node. This is accomplished by distributing many copies of the message packets to every encountered node, which further distributes the copies to their adjacent nodes. This activity is continued until a replica of the message has reached the target node. Thus, the message spreads in the network like an epidemic and each node infects all its surrounding nodes, that haven’t been infected. The algorithm has a good delivery rate, but suffers from heavy buffer and bandwidth requirements resulting in wastage of network resources. C. PROPHET Routing: PROPHET [9] is a history-based protocol that employs the knowledge of past interactions and transitivity to route a message. The protocol employs a parameter named delivery predictability, which is the probability of interaction among nodes and the destination to decide the next receiver of the message packet. PROPHET is founded on the presumption that the movement of nodes always follows a special movement pattern and is repetitive over a given interval. PROPHET utilizes this repetitiveness in the nodes’ movements and creates a probability table called delivery predictability, which contains the probability of final delivery of the message packets from a given node. This probability is based on the node’s movement pattern and the history of interactions with other nodes that have helped the node to successfully deliver the node in the past. In this protocol, whenever a node finds other nodes, the exchange of delivery predictability values take place. This allows the message packets to be forwarded to those nodes which possess a better delivery predictability value. D. Spray and Wait Routing: Spray and Wait Protocol [10] is based on the technique of controlled flooding. It is essentially an improvement over the existing Epidemic routing algorithm such that

A Community Interaction-Based Routing Protocol …

671

it restricts the volume of flooding and reduces the network resource usage. Routing takes place in two stages: the Spray Phase and the Wait Phase. In the Spray phase the starting nodes compute L, the number of replicas of the message the network should be flooded with and forward these copies to L distinct nodes called the relay nodes. Message could be directly delivered to the target in this phase. During the Wait Phase, the sender node waits for L nodes so that at least one of the L nodes directly delivers the message to the target. So, the network gets flooded only by L copies of the message. This protocol requires a high amount of mobility of nodes within the network. E. Spray and Focus Routing: Spray and Focus Routing [11] is an advancement over the Spray and Wait Routing. It operates in two stages, namely, the Spray Phase and the Focus Phase. In Spray Phase the initiator of the message can distribute a copy to only a fixed number of relays, e.g., L. Now, the nodes that received the copy can only distribute a copy to half of this fixed number of relays, i.e., L = 2 and so on. If L = 1, then the packets can only transmit one relay on the basis of a particular relaying criterion. During Focus Phase, the forwarding is done based on this forwarding criterion. A group of timers are employed by the protocol to measure the interval between the meetings of two nodes. The timers are employed to define a utility function, which helps nodes decide the usefulness of relay nodes in delivering packets. Packets are transmitted to only those nodes which have a higher utility function value. F. Other Works: Many modern routing algorithms have been proposed to provide a more efficient way for message routing. Different node characteristics and network information are analyzed to decide the best possible routes between nodes. Besides the benchmark protocols like Epidemic, PROPHET, and others which have been discussed above, other routing protocols apply numerous techniques to achieve optimal results. Application of game theory [12, 13], clustering techniques [14, 15], fuzzy systems [16], machine learning [17–19], and many others [20, 21] have resolved issues pertaining to specific aspects of Opp-Nets.

3 Proposed Protocol The proposed algorithm is thoroughly described in this section.

3.1 Parameters Considered A novel routing algorithm is proposed for efficient message delivery, by minimizing the number of copies and selecting appropriate relay node, in an Opportunistic

672

D. K. Sharma et al.

Network environment. The intermediate nodes for message delivery are selected on the basis of the following factors: (1) Interaction Point Probability: An interaction point is a particular location where numerous nodes from diverse communities join each other to interact routinely. The nodes which have a greater probability of advancing in the direction of the interaction point are good candidates for intermediate nodes. (2) Socially Active Nodes: A node is said to be socially active if it interacts, relatively, with many nodes of the network. As compared to static or less mobile nodes, the nodes which change their positions frequently have a higher probability of interacting with other nodes. For a node to be considered socially active, it could be changing its position frequently, i.e., move fast and have a short wait time. A node that is either socially active or has high interaction point probability is considered for intermediate nodes for message delivery.

3.2 Assumptions While proposing the routing algorithm, the following assumptions were made: 1. Message exchange takes place at interaction point or inside community and no other place. 2. Nodes meeting at the interaction point will diverge/move in different directions, i.e., will enter different communities. 3. Minimal time is taken for data transfer between the node and the destination, i.e., the destination does not change its community in between a data transfer.

3.3 The Proposed Protocol The source generates the message ID along with destination ID. The source delivers the message when it comes in contact with the destination. But, the probability of that happening is very less, so we take the help of intermediate nodes. The source gives a copy of the message to a node in the community that is socially active. This socially active node transfers the message to the node that is in range (is connected) and has the interaction point probability higher than a threshold value. Every node maintains a table of messages which are yet to be delivered in the form of message buffer. The node moves out of the community to reach the interaction point. The node, on reaching the GP, will meet some other node. The meeting nodes will exchange their messages, which are not common. Thus, various nodes at GP will have a copy of that message. When these nodes move to different communities, so will the message. The nodes after meeting at interaction point enter in different communities. On entering a community the node transfers its message list to the node that is a socially active host in that community. Thus, various communities have a copy of the message. Chances

A Community Interaction-Based Routing Protocol …

673

that destination node is in one of these communities are high, therefore message will be delivered whenever destination is in contact with any of the socially active nodes. This ensures minimal end-to-end delay because, even if destination changes its community, it may enter in a community whose socially active node has a copy of the message. If any of the intermediary node is the required destination then that node receives the message, it is not further relayed.

4 Simulation Setup and Results In this section, the simulation setup is explained and the results are thoroughly discussed.

4.1 Simulation Setup Simulation studies have been conducted by employing the ONE simulator for comparing the efficacy of Community Interaction-based routing against Epidemic, PROPHET, First Contact, Spray and Wait, and Direct Delivery. It has been presumed that the buffer size and transmission duration of the nodes are restricted. The parameters and relevant values of simulation are as follows: Parameter

Value

Area

6500 m * 6500 m

Data transfer rate

250 Kbps

Number of groups

10

Buffer space of each node

5 MB

Speed range

1–7 m/s

Wait time range

0–120 s

Message size

50–150 Kb

Message generation interval

25–35 s

Simulation time

43,000 s

Movement model

CommInteractMovement

The following performance metrics are taken into consideration: (1) Delivery Probability: The probability of the messages which are successfully received by the target within a provided time period. (2) Hop Count: It depicts the number of hops required by the packets to reach from source to destination. (3) Dropped Message: the number of packets dropped from the buffers of the nodes.

674

D. K. Sharma et al.

4.2 Simulation Results This subsection presents the graphical and analytical analysis of the results received by varying various simulation parameters through the Opportunistic Network Environment (ONE) Simulator. Figures 1, 2, 3, 4, and 5 show the performance of the proposed algorithm on numerous performance metrics and against some existing routing protocols. Figure 1 shows the performance of various routing protocols with respect to the delivery probability. The delivery probability naturally increases with time for all the protocols. The proposed algorithm’s graph initially lies above Prophet due to imprecise prediction in Prophet, but soon falls below Prophet and tends to follow the Epidemic curve. The Epidemic graph gives the best results in its initial stages, but decreases toward the end due to packet loss caused by overloaded buffers. Direct

Fig. 1 Comparison against various existing routing protocols in terms of delivery probability

Fig. 2 Cumulative probability comparison against message delay

A Community Interaction-Based Routing Protocol …

Fig. 3 Variation of delivery probability with number of host nodes

Fig. 4 Effect of speed variation on delivery probability

Fig. 5 Number of hop count for different routing algorithms

675

676

D. K. Sharma et al.

Delivery, Spray and Wait, and First Contact perform poorer than the proposed algorithm as the intermediate hop selection in the proposed algorithm considers multiple parameters before relaying the messages. As shown in Fig. 2, the cumulative probability of all the routing algorithms increases with the message delay until they approach a maximum value. The rise is higher toward the start and slows down at the end. The proposed algorithm’s curve lies above Direct Delivery, Spray and Wait, and First Contact, while it is below Epidemic and Prophet router. Figure 3 shows the variation of Delivery Probability with the number of hosts per group. The probability increases with the incrementing number of hosts in each group until it reaches a maximum, after which it starts to fall due to the overflowing buffers at higher values. Incrementing the number of nodes increases the intermediate helping nodes for message delivery, and thus the delivery probability. Figure 4 shows the change in delivery probability with the variation in the speed of mobile nodes. As the speed of nodes is incremented, various nodes become socially active, thereby increasing the number of replicas of the message. This provides a greater delivery probability. The hop count of various routing algorithms, for the same settings, have been depicted in Fig. 5. The hop count for Direct Delivery, as expected, comes to 1, while the others have a value above 1.5. The proposed algorithm’s hop count comes out to be less than most of the existing routing algorithms because the intermediate nodes are selected only if they satisfy certain conditions, as described in Sect. 3. This reduces the hops required to route the message, and hence proves its efficiency over others. In Fig. 6, the average latency of various existing routing protocols and the proposed method are plotted and compared. The figure clearly shows the latency of the proposed algorithm to be in the range of First Contact and Direct Delivery, and slightly greater than Epidemic and Prophet.

Fig. 6 Average latency of various routing protocols

A Community Interaction-Based Routing Protocol …

677

5 Limitations and Conclusion This section describes the various limitations faced by our protocol and provides a glimpse of the possible future works.

5.1 Limitations 1. The proposed algorithm assumes that there is a fixed point in the network (interaction point) where nodes meet for message relaying. Thus, this assumptions inhibits the algorithm to work for all kind of network design in opportunistic networks. 2. Some of the nodes are considered to be moving with a higher speed, i.e., more than a threshold value. Hence, in networks where the node mobility is not so fast, this algorithm does not work efficiently. 3. Since, in our proposed algorithm, the network is considered to be a group of communities, it does not work so efficiently in other movement models.

5.2 Conclusion This paper has highlighted a novel routing mechanism and compared it with other contemporary protocols of Opportunistic Networks. The architecture, characteristics and challenges in Opportunistic Networks have been discussed at great lengths. Further, the architecture and different modules of ONE Simulator have also been explored. The proposed community interaction-based routing protocol attempts to minimize the volume of copies of messages and end-to-end delay in the delivery of messages by selecting appropriate intermediate nodes. The simulation results have concluded that the proposed algorithm outperforms Spray and Wait, First Contact, and Direct Delivery Protocols with respect to the delivery ratio, and is very close to Epidemic. The results also emphasize that the average hop count taken by the messages are less than 2. These simulated results reveal that the proposed protocol significantly helps in minimizing the usage of bandwidth of the network by restricting the excess messages that would have been dropped by nodes.

678

D. K. Sharma et al.

References 1. K. Miya, M. Watanabe, M. Hayashi, T. Kitade, O. Kato, K. Homma, CDMA/TDD cellular systems for the 3rd generation mobile communication, in 1997 IEEE 47th Vehicular Technology Conference. Technology in Motion, Phoenix, AZ, USA (Vol. 2, 1997), pp. 820–824 2. V. Chandrasekhar, W.K.G. Seah, Y.S. Choo, H.V. Ee, Localization in underwater sensor networks: survey and challenges, in Proceedings of the 1st ACM International Workshop on Underwater Networks (WUWNet’06) (ACM, New York, NY, USA, 2006), pp. 33–40 3. H. Yang, H. Luo, F. Ye, L. Songwu, L. Zhang, Security in mobile ad hoc networks: challenges and solutions. IEEE Wirel. Commun. 11(1), 38–47 (2004) 4. V. Singh, L. Raja, D. Panwar, P. Agarwal, Delay tolerant networks architecture, protocols, and its application in vehicular Ad-Hoc networks, in Hidden Link Prediction in Stochastic Social Networks (IGI Global, 2019), pp. 135–161 5. S. Trifunovic, S.T. Kouyoumdjieva, B. Distl, L. Pajevic, G. Karlsson, B. Plattner, A decade of research in opportunistic networks: challenges, relevance, and future directions. IEEE Commun. Mag. 55(1), 168–173 (2017) 6. M.K. Denko, Mobile Opportunistic Networks: Architectures, Protocols and Applications (Auerbach Publications, 2019) 7. M. Alajeely, R. Doss, A. Ahmad, Routing protocols in opportunistic networks: a survey. IETE Tech. Rev. 35(4), 369–387 (2018) 8. A. Vahdat, D. Becker, Epidemic routing for partially-connected Ad Hoc networks. Technical report number CS-200006, Duke University, pp. 1–14 9. T. Huang, C. Lee, L. Chen, PRoPHET+: an adaptive PRoPHET-based routing protocol for opportunistic network, in 2010 24th IEEE International Conference on Advanced Information Networking and Applications, Perth, WA (2010), pp. 112–119 10. T. Spyropoulos, K. Psounis, C.S. Raghavendra, Spray and wait: an efficient routing scheme for intermittently connected mobile networks, in SIGCOMM’05 Workshops, 22–26 August 2005, Philadelphia, PA, USA 11. T. Spyropoulos, K. Psounis, C.S. Raghavendra, Spray and focus: efficient mobility-assisted routing for heterogeneous and correlated mobility, in Fifth Annual IEEE International Conference on Pervasive Computing and Communications Workshops (PerComW’07), White Plains, NY (2007), pp. 79–85 12. A. Chhabra, V. Vashishth, D.K. Sharma, SEIR: a Stackelberg game based approach for energy-aware and incentivized routing in selfish opportunistic networks, in 2017 51st Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD (2017), pp. 1–6 13. A. Chhabra, V. Vashishth, D.K. Sharma, A game theory based secure model against Black hole attacks in opportunistic networks, in 2017 51st Annual Conference on Information Sciences and Systems (CISS), Baltimore, MD (2017), pp. 1–6 14. D.K. Sharma, S.K. Dhurandher, D. Agarwal et al., kROp: k-means clustering based routing protocol for opportunistic networks. J. Ambient Intell. Human Comput. 10, 1289–1306 (2019) 15. D.K. Sharma, Aayush, A. Sharma, J. Kumar, KNNR: K-nearest neighbour classification based routing protocol for opportunistic networks, in 2017 Tenth International Conference on Contemporary Computing (IC3), Noida (2017), pp. 1–6 16. A. Chhabra, V. Vashishth, D.K. Sharma, A fuzzy logic and game theory based adaptive approach for securing opportunistic networks against black hole attacks. Int. J. Commun. Syst. 31, e3487 (2018) 17. D.K. Sharma, S.K. Dhurandher, I. Woungang, R.K. Srivastava, A. Mohananey, J.J.P.C. Rodrigues, A machine learning-based protocol for efficient routing in opportunistic networks. IEEE Syst. J. 12(3), 2207–2213 (2018) 18. S.K. Dhurandher, D.K. Sharma, I. Woungang, S. Bhati, HBPR: history based prediction for routing in infrastructure-less opportunistic networks, in 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), Barcelona (2013), pp. 931–936

A Community Interaction-Based Routing Protocol …

679

19. A. Chhabra, V. Vashishth, D.K. Sharma, GMMR: a Gaussian mixture model based unsupervised machine learning approach for optimal routing in opportunistic IoT networks. Comput. Commun. 134 (2018). https://doi.org/10.1016/j.comcom.2018.12.001 20. A. Gupta, A. Bansal, D. Naryani, D.K. Sharma, CRPO: cognitive routing protocol for opportunistic networks, in Proceedings of the International Conference on High Performance Compilation, Computing and Communications (HP3C-2017). (ACM, New York, NY, USA, 2017), pp. 121–125 21. D.K. Sharma, S. Singh, V. Gautam, S. Kumaram, M. Sharma, S. Pant, An efficient routing protocol for social opportunistic networks using ant routing. IET Netw. (2019)

Performance Analysis of the ML Prediction Models for the Detection of Sybil Accounts in an OSN Ankita Kumari

and Manu Sood

Abstract The Online Social Networks (OSNs) as such have significantly become huge platforms for information sharing and social interactions for a variety of users across the globe. In the backdrop of fast transformations, these OSNs are undergoing illegal activities especially in the form of security attacks, and have already started reflecting serious harmful effects on these interactions. One of the prominent attacks in such environments, the Sybil attack, is jeopardizing various categories of social interactions as the number of users having Sybil accounts on these social platforms is experiencing phenomenal growth. The existence of such Sybil accounts on OSNs may threaten to defeat the very purpose of these OSNs. The presence of these Sybil accounts of malicious users is really almost impossible to control and, very difficult to detect. In this paper, with the help of Machine Learning (ML), an attempt has been made to uncover the presence of such Sybil accounts on an OSN such as Twitter. After the acquisition and preprocessing of available datasets, the Correlation with Heatmap and Logistic Regression-Recursive Feature Elimination (LR-RFE) feature selection techniques were applied to get a set of optimal features from these datasets. Then the prediction models were trained on these datasets by using Random Forest (RF), Decision Tree (DT), Logistic Regression (LR) and Support Vector Machine (SVM) classifiers. Further, the effects of biasing of genuine accounts with fake accounts on feature selection and classification have been presented. It is concluded that the prediction models using the DT algorithm outperformed all other classifiers. Keywords Feature selection · Support vector machine · Random forest · Logistic regression · Decision tree · Sybil account · Biasing

A. Kumari (B) · M. Sood Department of Computer Science, Himachal Pradesh University, Shimla, India e-mail: [email protected] M. Sood e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_60

681

682

A. Kumari and M. Sood

1 Introduction In the present times, Online Social Networks (OSNs) like Twitter, Facebook, Instagram, etc., are becoming the most generally used sources of information as well as social interactions. Their growth has transformed how people conduct business and interact with each other [1]. These platforms cannot be bound by any single standard definition at present. However, Boyd and Ellison in [2] have defined social networks as web-based services that allow individuals to (a) create a public or semi-public profile(s) within a bounded system, (b) make a list of other users with whom they share a connection, and (c) view and traverse various lists of connections within the system. Due to the presence of a huge number of users on these platforms and the uncanny ease with which any user can hide her/his real identity or create virtual identities including Sybil identities, the result is that any user can be easily trolled without any cost to the trolling users. This has led to an uncontrolled increase in a number of fake profiles on these OSNs culminating into a serious problem to genuine authorized users of these platforms [3]. In the beginning, these platforms were being used to connect with family members and friends on the social networks and also to hook up with old out-of-contact friends, but nowadays, the use of the OSNs has increased multifold for multidimensional purposes at a massive rate. Different people use these platforms for different purposes generating mammoth amounts of data. So to churn out requisite useful information like the identities of fake users, etc., from this large data, some special techniques are needed. Machine Learning (ML) is one such mechanism that caters to these kinds of needs by providing the techniques to dig some sense out of the stack of data in effective and simple ways [4]. ML is a component of Artificial Intelligence (AI) where the main focus is to make a machine learn from the given data as humans learn from their experiences. Basically, there are four methods through which a machine can learn and these are supervised learning, semi-supervised learning, unsupervised learning, and reinforcement learning. In supervised learning, labeled data is given as an input to the machine, and based on that labeled data; the machine is trained to label the outcome too. Classification and Regression are the two significant techniques used under supervised learning. In unsupervised learning, unlabelled data is given as an input to the machine through which a model finds the hidden patterns among the pieces of data, clustering being one of the popular techniques under this category. In semi-supervised learning, both labeled and unlabeled data can be given as an input to the machine as it is the combination of supervised and unsupervised learning. In reinforcement learning, the model is trained on the basis of feedback from the neighboring environment. It is also known as feedback-oriented learning or reward-based learning [1]. The ML process involves various steps depending upon the prediction model and those steps as defined in [5] in general are (a) data collection, (b) data preparation, (c) data analyzing, (d) model training, (e) model testing, and after performing all these steps the prediction model is ready to use.

Performance Analysis of the ML Prediction Models …

683

In OSNs, when a normal-looking malicious user creates multiple fake user accounts and tries to control the behavior of a social platform, the Sybil attack is said to have taken place. The Sybil attack is basically a security threat which tries not only to control the resources available on the network but also to influence the natural social interactions. In order to get rid of these types of attacks, many studies are available on the development of defense mechanisms against Sybil attacks [6–9]. Twitter, at present, being one of the significant Online Social Networks, got initiated for the purpose of microblogging on social networks in which tweets were restricted only to 140 characters. But now it has also become an information-sharing platform and the size of the tweets has been doubled to 280 characters. The use of this platform for social causes is increasing day by day at a huge pace and so is the presence of the number of Sybil accounts. Hence, it becomes imperative to detect such types of accounts. In this paper, we have developed a few prediction models using ML classifiers for the detection of these Sybil accounts, used as a synonym to the fake accounts. We have used four classification techniques, namely Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), and Support Vector Machine (SVM), for determining whether an account is a fake account or genuine account.

1.1 Objectives The objectives which have been kept in focus while conducting this research work are (a) To analyze the effect of biasing on the dataset, i.e., fake accounts (FSF, INT, TWT) on genuine accounts (E13 and TFP); (b) To analyze classification results on the biased datasets for RF, DT, LR, and SVM classifiers, and (c) To compare the classifier results based on evaluation metrics.

1.2 Paper Structure This paper consists of four sections. Section 2 explains the methodology followed to achieve the objectives of this study. Section 3 highlights the results and analysis of the experimentation. Section 4 concludes the work and presents a pointer toward future work. The novelty of this paper is stated as follows: (a) A set of real-time datasets have been used for the prediction models with different proportions of biasing, (b) Two different feature selection categories namely, Correlation with Heatmap and LR-RFE have been used to select the optimum set of features before the application of the predictive modeling, (c) Four different classifiers have been explored for this predictive modeling for the purpose of comparing their performances on the given real-time datasets, and (d) The performances of two of the proposed models have

684

A. Kumari and M. Sood

been found to be almost ideal for almost all the evaluation parameters with the performance of DT model as the best.

2 The Methodology Followed In this section, the methodology followed to conduct this research work is described briefly.

2.1 Dataset and Biasing The datasets used to conduct this study are in short described here. The details of the datasets used are shown in Table 1. Cresci et al. in [10] collected these datasets in their research work, and we are thankful to them for allowing us to perform our experiments on them. This table contains the data of user accounts of Twitter. It includes five datasets in which three datasets are of fake accounts (FSF, TWT, INT) and two datasets are of genuine accounts (E13, TFP). The number of features in all the datasets are the same, i.e., 34. The authors of [10] collected the dataset of genuine accounts themselves for their own study, and the dataset of fake accounts was bought online. After collecting the datasets, the next step carried out was the biasing of the datasets. Table 2 shows the details of the biased datasets. Table 1 Datasets considered [4] Type of accounts

S. no.

Dataset

No. of features

No. of accounts

Fake accounts

1

FSF (Fast Followerz)

34

1169

2

INT (Inter Twitter)

34

1337

3

TWT (Twitter Technology)

34

845

1

E13 (Elezioni 2013)

34

1481

2

TFP (The Fake Project)

34

469

Genuine accounts

Table 2 Biased datasets

Cases

Datasets

No. of accounts

D1

Dataset-D1

E13-FSF

2650

D2

Dataset-D2

E13-INT

2818

D3

Dataset-D3

E13-TWT

2326

D4

Dataset-D4

TFP-FSF

1638

D5

Dataset-D5

TFP-INT

1806

D6

Dataset-D6

TFP-TWT

1314

Performance Analysis of the ML Prediction Models …

685

Table 2 shows that a total of six datasets were obtained after the biasing of fake accounts with genuine accounts. Further to study the effect of biasing, four cases were prepared on each dataset. In the first case, 100% biasing is done, i.e., all accounts of genuine and fake users were combined together. In the second case, 75% of accounts of genuine users were biased with 25% accounts of fake users. In the third case, 60% of genuine were biased with 40% of fake users. In the last case, 50% biasing of genuine with an equal percentage of fake users was done. These cases have been named as D11 (100), D12 (75–25), D13 (60–40), D14 (50) for Dataset-D1 and likewise for Dataset-D2, 3 4, 5 and 6. These 24 datasets in total were used during the process of feature selection and classification for the purpose of predicting the occurrences of Sybil accounts.

2.2 Experimental Setup In this study, we have used the Python language for the process of feature selection and implementation of various classification algorithms. The data preprocessing includes the scaling, cleaning, integration, and reduction of data, and then the normalized, scaled, and cleaned data is obtained. The dataset used in this study contained some features with no values and some features with missing values. So, the features with no values were eliminated in the beginning, and after that, the missing values in the dataset were replaced with zero. A feature named dataset contained only the name of a particular dataset, so this feature was also dropped. At the end of data preprocessing, a subset of 31 features was obtained from the original set of 34 features. The next step was to obtain the most significant features out of these 31 features using feature selection techniques. Feature Selection (FS) is basically the process of removing insignificant, unwanted, and noisy features [11]. With the help of FS, a subset of pertinent features is selected out of the total number of features in any available dataset. This helps in selecting those features which contribute most toward the output variable and thus help in making the predictive model more competent [4]. The FS technique is divided into three categories, i.e., filter method, wrapper method, and embedded method. In our study, we have used Correlation with Heatmap which is a filter method and Recursive Feature Elimination with Logistic Regression (LR-RFE) technique under the wrapper method for the selection of optimal features. The Correlation with Heatmap is a feature selection technique in which the data is represented graphically. In this technique, a 2D visualization of data is given in a matrix in which different colors are used to represent different values. This is basically a statistical term which uses familiar utilization between two variables to convey the linear relationship with each other in addition to their closeness in relationship [12, 13].

686

A. Kumari and M. Sood

The Recursive Feature Elimination method of Feature Selection selects the lesser number of features iteratively. In this method, to get the importance of each feature, the predictor is trained first on the original set of features, which further eliminates those features which are having the least significance, and this process continues until a proper set of features is obtained. RFE mainly helps in ranking feature significance and feature selection. The study in [14] shows that reduction of features using RFE helps in improving prediction accuracy. So in order to have the best subset of features from the set of original 31 features, these feature selection techniques were used in this study. At the end of the feature selection process, an optimal subset of 22 features is obtained, which are further used in the process of model building. The predictive models are built using ML classifiers. In this study, four classifiers were used, namely Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), and Support Vector Machine (SVM) for the training and testing of the classification models. The conventional ratio of 70:30 has been used for the training and testing of the classifier models in our study. RF is an ensemble learning method in which a multitude of decision trees are constructed at the training time [15]. Decision trees are the trees that classify instances by sorting them based on feature values [16]. LR is a regression method for predicting a binarydependent variable [17]. SVM is a supervised learning algorithm that is useful for recognizing precise patterns in complex datasets [18]. The experimentation in this study was conducted for the detection of fake (Sybil) accounts from the Twitter datasets. So for the evaluation and prediction of experimentation results, we have used confusion matrix and evaluation metrics. The evaluation metrics used here were Accuracy, Precision, Recall, F1 score, Mathew Correlation Coefficient (MCC), and Specificity.

3 Results and Analysis The results of the experiments conducted in this study by using four classifiers RF, DT, LR, and SVM are shown in Tables 3, 4, 5, and 6, respectively. Also, the graphical representation of these results of each classifier is depicted in Figs. 1, 2, 3, and 4 for the sake of comparison. Table 3 displays the experimental results of the RF classifier for all the 24 cases obtained after the biasing of datasets. Figure 1 gives the graphical representation of the results compiled in Table 3 for the RF classifier. As can be seen, the performance of this classifier is quite good for the datasets 1, 2, and 4 as far as the values of six evaluation metrics are concerned. But for other datasets, the values of these metrics are a bit on the lower side. Table 4 displays the experimental results of the DT classifier for all the 24 cases obtained after the biasing of datasets and the corresponding graphical representation for the DT classifier of the results is shown in Fig. 2. It can be concluded from this figure that not only this classifier produces quite good values of the evaluation metrics for datasets 1, 2, and 4, but also the values of these metrics for the other three datasets are better too when compared to those of the RF classifier.

Performance Analysis of the ML Prediction Models …

687

Table 3 Results of Random Forest classifier-based prediction model Metric values for random forest classifier Datasets

Cases

Accuracy

Precision

Recall

F1-score

MCC

Specificity

Dataset-D1

D11 (100)

1.000

1.000

1.000

1.000

1.000

1.000

D12 (75–25)

1.000

1.000

1.000

1.000

1.000

1.000

D13 (60–40)

1.000

1.000

1.000

1.000

1.000

1.000

Dataset-D2

Dataset-D3

Dataset-D4

Dataset-D5

Dataset-D6

D14 (50)

1.000

1.000

1.000

1.000

1.000

1.000

D21 (100)

0.996

0.996

1.000

0.993

0.992

0.992

D22 (75–25)

1.000

1.000

1.000

1.000

1.000

1.000

D23 (60–40)

1.000

1.000

1.000

1.000

1.000

1.000

D24 (50)

1.000

1.000

1.000

1.000

1.000

1.000

D31 (100)

0.974

0.980

0.993

0.968

0.942

0.937

D32 (75–25)

0.984

0.991

1.000

0.982

0.930

0.885

D33 (60–40)

0.963

0.976

0.989

0.963

0.895

0.877

D34 (50)

0.974

0.980

0.986

0.974

0.942

0.942

D41 (100)

1.000

1.000

1.000

1.000

1.000

1.000

D42 (75–25)

1.000

1.000

1.000

1.000

1.000

1.000

D43 (60–40)

1.000

1.000

1.000

1.000

1.000

1.000

D44 (50)

1.000

1.000

1.000

1.000

1.000

1.000

D51 (100)

0.983

0.970

0.980

0.961

0.959

0.984

D52 (75–25)

1.000

1.000

1.000

1.000

1.000

1.000

D53 (60-40)

0.981

0.972

1.000

0.947

0.959

0.972

D54 (50)

0.994

0.989

1.000

0.979

0.985

0.992

D61 (100)

0.954

0.945

0.936

0.954

0.906

0.967

D62 (75–25)

0.973

0.978

0.971

0.985

0.944

0.976

D63 (60–40)

0.927

0.924

0.982

0.873

0.860

0.882

D64 (50)

0.984

0.976

0.954

1.000

0.966

1.000

Table 5 displays the experimental results of the LR classifier for all the 24 cases obtained after the biasing of datasets and Fig. 3 shows the graphical representation of these results presented in Table 5 for the LR classifier. An examination on this table, as well as figure, simply shows that values of almost all the evaluation metrics for this classifier are quite low. Table 6 displays the experimental results of the SVM classifier for all the 24 cases obtained after the biasing of datasets. Figure 4 gives the graphical representation of the results displayed in this table. From Table 6 and Fig. 4, it can be deduced that the performance of this SVM classifier for all the evaluation metrics is far from satisfactory. Based upon the results and analyses of the values of all six evaluation metrics for the four classifiers used in our experimentation, it is concluded that the results of the Decision Tree classifier for all the evaluation metrics were the best. This entails that this prediction model when

688

A. Kumari and M. Sood

Table 4 Results of DT classifier-based prediction model Metric values for decision tree classifier Datasets

Cases

Accuracy

Precision

Recall

F1-score

MCC

Specificity

Dataset-D1

D11 (100)

0.996

0.996

0.993

1.000

0.993

1.000

D12 (75–25)

1.000

1.000

1.000

1.000

1.000

1.000

D13 (60–40)

1.000

1.000

1.000

1.000

1.000

1.000

Dataset-D2

Dataset-D3

Dataset-D4

Dataset-D5

Dataset-D6

D14 (50)

1.000

1.000

1.000

1.000

1.000

1.000

D21 (100)

0.988

0.989

0.987

0.99

0.977

0.989

D22 (75–25)

0.993

0.996

1.000

0.992

0.98

0.968

D23 (60–40)

0.990

0.992

0.990

0.995

0.979

0.991

D24 (50)

1.000

1.000

1.000

1.000

1.000

1.000

D31 (100)

0.962

0.971

0.967

0.976

0.917

0.953

D32 (75–25)

0.965

0.980

0.992

0.969

0.837

0.783

D33 (60–40)

0.933

0.956

0.966

0.947

0.810

0.825

D34 (50)

0.968

0.976

0.982

0.970

0.930

0.943

D41 (100)

1.000

1.000

1.000

1.000

1.000

1.000

D42 (75–25)

1.000

1.000

1.000

1.000

1.000

1.000

D43 (60–40)

1.000

1.000

1.000

1.000

1.000

1.000

D44 (50)

1.000

1.000

1.000

1.000

1.000

1.000

D51 (100)

0.972

0.951

0.922

0.981

0.932

0.992

D52 (75–25)

0.993

0.993

0.987

1.000

0.986

1.000

D53 (60–40)

0.983

0.974

1.000

0.950

0.963

0.975

D54 (50)

1.000

1.000

1.000

1.000

1.000

1.000

D61 (100)

0.944

0.934

0.934

0.934

0.887

0.952

D62 (75–25)

0.967

0.973

0.960

0.986

0.933

0.979

D63 (60–40)

0.897

0.892

0.920

0.865

0.796

0.878

D64 (50)

0.993

0.990

0.980

1.000

0.985

1.000

used for the prediction of occurrences of Sybil accounts on the datasets pertaining to Twitter OSNwill produce the best results with the best possible values of accuracy, recall, specificity, precision, F1 score, and MCC. We have arrived at this conclusion as the values achieved for all these metrics based on our experiments in this paper are near perfect values.

Performance Analysis of the ML Prediction Models …

689

Table 5 Results of LR classifier-based prediction model Metric values for logistic regression classifier Datasets

Cases

Accuracy

Precision

Recall

F1-score

Dataset-D1

D11 (100)

0.817

0.805

0.674

1.000

0.689

1.000

D12 (75–25)

0.902

0.937

0.882

1.000

0.753

1.000

D13 (60–40)

0.861

0.886

0.795

1.000

0.746

1.000

Dataset-D2

Dataset-D3

Dataset-D4

Dataset-D5

Dataset-D6

MCC

Specificity

D14 (50)

0.869

0.874

0.777

1.000

0.769

1.000

D21 (100)

0.765

0.705

0.546

0.996

0.604

0.997

D22 (75–25)

0.974

0.983

0.967

1.000

0.931

1.000

D23 (60–40)

0.829

0.849

0.747

0.982

0.694

0.976

D24 (50)

0.851

0.842

0.733

0.989

0.736

0.990

D31 (100)

0.687

0.792

0.910

0.700

0.235

0.266

D32 (75–25)

0.745

0.852

0.853

0.851

−0.064

0.081

D33 (60–40)

0.718

0.824

0.890

0.767

0.141

0.221

D34 (50)

0.638

0.758

0.858

0.679

0.084

0.207

D41 (100)

0.872

0.739

0.586

1.000

0.703

1.000

D42 (75–25)

0.868

0.861

0.756

1.000

0.767

1.000

D43 (60–40)

0.895

0.835

0.717

1.000

0.784

1.000

D44 (50)

0.903

0.779

0.638

1.000

0.751

1.000

D51 (100)

0.833

0.552

0.386

0.968

0.547

0.995

D52 (75–25)

0.837

0.812

0.683

1.000

0.715

1.000

D53 (60–40)

0.888

0.788

0.682

0.933

0.731

0.978

D54 (50)

0.916

0.789

0.661

0.979

0.762

0.995

D61 (100)

0.638

0.416

0.325

0.577

0.198

0.843

D62 (75–25)

0.779

0.803

0.688

0.965

0.610

0.953

D63 (60–40)

0.902

0.900

0.909

0.891

0.804

0.896

D64 (50)

0.728

0.415

0.283

0.777

0.347

0.958

4 Conclusion In this study, the data preprocessing and a combination of feature selection techniques have been implemented on the datasets taken from the authors of another study. For obtaining a subset of optimal features, we used two different types of FS techniques, Correlation with Heatmap and LR-RFE belonging to two different categories, and by using these techniques, we have obtained a subset of 22 effective features from the original set of 31 features. We carried out experimentation on the set of 24 biased datasets containing data related to these selected features. The prediction models have further been built using four classifiers, namely RF, DT, LR, and SVM. The analyses of the results obtained for all the six evaluation metrics show that with the selected set of features on the 24 datasets, the performance of the Decision Tree (DT)

690

A. Kumari and M. Sood

Table 6 Results of SVM classifier-based prediction model Metric Values for support vector machine classifier Datasets

Cases

Accuracy

Precision

Recall

F1-score

MCC

Specificity

Dataset-D1

D11 (100)

0.562

0.719

1.000

0.562

0.000

0.000

D12 (75–25)

0.825

0.904

1.000

0.825

0.000

0.000

D13 (60–40)

0.676

0.806

1.000

0.676

0.000

0.000

Dataset-D2

Dataset-D3

Dataset-D4

Dataset-D5

Dataset-D6

D14 (50)

0.584

0.737

1.000

0.584

0.000

0.000

D21 (100)

0.513

0.678

1.000

0.513

0.000

0.000

D22 (75–25)

0.782

0.877

1.000

0.782

0.000

0.000

D23 (60–40)

0.640

0.78

1.000

0.64

0.000

0.000

D24 (50)

0.539

0.701

1.000

0.539

0.000

0.000

D31 (100)

0.653

0.79

1.000

0.653

0.000

0.000

D32 (75–25)

0.86

0.924

1.000

0.86

0.000

0.000

D33 (60–40)

0.743

0.852

1.000

0.743

0.000

0.000

D34 (50)

0.661

0.796

1.000

0.661

0.000

0.000

D41 (100)

0.691

0.000

0.000

0.000

0.000

1.000

D42 (75–25)

0.868

0.861

0.756

1.000

0.767

1.000

D43 (60–40)

0.629

0.000

0.000

0.000

0.000

1.000

D44 (50)

0.733

0.000

0.000

0.000

0.000

1.000

D51 (100)

0.734

0.000

0.000

0.000

0.000

1.000

D52 (75–25)

0.515

0.68

1.000

0.515

0.000

0.000

D53 (60–40)

0.696

0.000

0.000

0.000

0.000

1.000

D54 (50)

0.761

0.000

0.000

0.000

0.000

1.000

D61 (100)

0.603

0.000

0.000

0.000

0.000

1.000

D62 (75–25)

0.655

0.792

1.000

0.655

0.000

0.000

D63 (60–40)

0.517

0.000

0.000

0.000

0.000

1.000

D64 (50)

0.658

0.000

0.000

0.000

0.000

1.000

classifier was better than the other three classifiers used in this study. The values of all the six metrics for this classifier have been found to be near perfect which means that predictions made by this model can be used to identify the presence of Sybil accounts in the datasets of an OSN with great accuracy, specifically Twitter. In future, we are going to enhance our prediction models by using ensemble and optimization techniques to achieve better results on the same or different datasets.

Performance Analysis of the ML Prediction Models …

691

1.05 Value of metrics

1 0.95

Accuracy

0.9

Precision

0.85

Recall

0.8

F1 Score MCC D11(100) D12(75-25) D13(60-40) D14(50) D21(100) D22(75-25) D23(60-40) D24(50) D31(100) D32(75-25) D33(60-40) D34(50) D41(100) D42(75-25) D43(60-40) D44(50) D51(100) D52(75-25) D53(60-40) D54(50) D61(100) D62(75-25) D63(60-40) D64(50)

0.75

Specificity

Dataset-D1 Dataset-D2 Dataset-D3 Dataset-D4 Dataset-D5 Dataset-D6

Fig. 1 Comparative analysis of RF classifier metrics on 24 biased datasets

Value of metrics

1.2 1 0.8 Accuracy

0.6

Precision

0.4

Recall

0.2

F1 Score D11(100) D12(75-25) D13(60-40) D14(50) D21(100) D22(75-25) D23(60-40) D24(50) D31(100) D32(75-25) D33(60-40) D34(50) D41(100) D42(75-25) D43(60-40) D44(50) D51(100) D52(75-25) D53(60-40) D54(50) D61(100) D62(75-25) D63(60-40) D64(50)

0

Dataset-D1 Dataset-D2 Dataset-D3 Dataset-D4 Dataset-D5 Dataset-D6

Fig. 2 Comparative analysis of DT classifier metrics on 24 biased datasets

MCC Specificity

692

A. Kumari and M. Sood 1.2

Value of metrics

1 0.8

Accuracy

0.6

Precision

0.4

Recall

0.2

F1 Score MCC

-0.2

D11(100) D12(75-25) D13(60-40) D14(50) D21(100) D22(75-25) D23(60-40) D24(50) D31(100) D32(75-25) D33(60-40) D34(50) D41(100) D42(75-25) D43(60-40) D44(50) D51(100) D52(75-25) D53(60-40) D54(50) D61(100) D62(75-25) D63(60-40) D64(50)

0

Specificity

Dataset-D1 Dataset-D2 Dataset-D3 Dataset-D4 Dataset-D5 Dataset-D6

Fig. 3 Comparative analysis of LR classifier metrics on 24 biased datasets

Value of metrics

1.2 1 0.8

Accuracy

0.6

Precision

0.4

Recall

0.2

F1 Score D11(100) D12(75-25) D13(60-40) D14(50) D21(100) D22(75-25) D23(60-40) D24(50) D31(100) D32(75-25) D33(60-40) D34(50) D41(100) D42(75-25) D43(60-40) D44(50) D51(100) D52(75-25) D53(60-40) D54(50) D61(100) D62(75-25) D63(60-40) D64(50)

0

MCC Specificity

Dataset-D1 Dataset-D2 Dataset-D3 Dataset-D4 Dataset-D5 Dataset-D6

Fig. 4 Comparative analysis of SVM classifier metrics on 24 biased datasets

Acknowledgments We convey our gratitude to Cresci et al. [10] for providing us their permission to perform our experiments on the datasets we acquired from them.

References 1. M. Al-Qurishi, M. Al-Rakhami, A. Alamri, M. Alrubaian, S.M.M. Rahman, M.S. Hossain, Sybil defense techniques in online social networks: a survey. IEEE Access 5, 1200–1219 (2017) 2. D. Boyd, N. Ellison, Social network sites: definition, history, and scholarship. J. Comput. Med. Commun. 13, 210–230 (2007)

Performance Analysis of the ML Prediction Models …

693

3. P. Galán-García, J.G.D.L. Puerta, C.L. Gómez, I. Santos, P.G. Bringas, Supervised machine learning for the detection of troll profiles in twitter social network: application to a real case of cyberbullying. Logic J. IGPL 24(1), 42–53 (2016) 4. D. Sonkhla, M. Sood, Performance analysis and feature selection on Sybil user data using recursive feature elimination. Int. J. Innov. Technol. Explor. Eng. (IJITEE) 8, 48–56 (2019) 5. H.M. Anwer, M. Farouk, A. Abdel-Hamid, A framework for efficient network anomaly intrusion detection with feature selection, in Proceedings of 9th International Conference on Information and Communication Systems, Irbid (2018), pp. 157–162 6. A. Vasudeva, M. Sood, Sybil attack on lowest id clustering algorithm in the mobile ad hoc network. Int. J. Netw. Secur. Appl. 4(5), 135–147 (2012) 7. M. Sood, A. Vasudeva, Perspectives of Sybil attack in routing protocols of mobile ad hoc network, in Computer Networks and Communications (NetCom) ed. by N. Chaki, et al. Lecture Notes in Electrical Engineering, vol. 131 (Springer, New York, NY, 2013), pp. 3–13 8. A. Vasudeva, M. Sood, A Vampire Act of Sybil attack on the highest node degree clustering in mobile Ad Hoc networks. Indian J. Sci. Technol. 9(32), 1–9 (2016) 9. A. Vasudeva, M. Sood, Survey on Sybil attack defense mechanisms in wireless ad hoc networks. J. Netw. Comput. Appl. 120, 78–118 (2018) 10. S. Cresci, R.D. Pietro, R. Petrocchi, A. Spognardi, M. Tesconi, Fame for sale: efficient detection of fake Twitter followers. Decis. Support Syst. 80, 56–71 (2015) 11. H. Nkiama, S.Z.M. Said, M. Saidu, A subset feature elimination mechanism for intrusion detection system. Int. J. Adv. Comput. Sci. Appl. 7(4), 148–157 (2016) 12. Y. Saeys, I. Inza, P. Larrañaga, A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007) 13. S. Zhao, Y. Guo, Q. Sheng, Y. Shyr, Advanced heat map and clustering analysis using heatmap. BioMed. Res. Int. (2014) 14. T.E. Mathew, A logistic regression with recursive feature elimination model for breast cancer diagnosis. Int. J. Emerg. Technol. 10(3), 55–63 (2019) 15. A. Liaw, M. Wiener, Classification and regression by random forest. R News 2(3), 18–22 (2002) 16. S.B. Kotsiantis, I. Zaharakis, P. Pintelas, Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007) 17. I. Kurt, M. Ture, A.T. Kurum, Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease. Expert Syst. Appl. 34, 366–374 (2008) 18. P. Pavlidis, I. Wapinski, W.S. Noble, Support vector machine classification on the web. Bioinform. Appl. Note 20(4), 586–587 (2004)

Exploring Feature Selection Technique in Detecting Sybil Accounts in a Social Network Shradha Sharma

and Manu Sood

Abstract Machine learning (ML) provides us the techniques to carve out meaningful insights into the useful information embedded in various datasets by making the machine learn from the datasets. There are different machine learning techniques available for various purposes. The general sequence of steps for a typical supervised machine learning technique includes preprocessing, feature selection, building the prediction model, testing and validating the model. Various ML techniques are being used to detect the presence of fake as well as spambot accounts on a number of Online Social Networks (OSNs). These fake/spambot accounts especially the Sybil accounts appear in these networks with malicious intentions to disrupt or highjack the very purpose of these networks. In this paper, we have trained various prediction models using appropriate real-time datasets to detect the presence of Sybil accounts on online social media. Since the data is collected from various sources; it necessitates the preprocessing of the dataset. The preprocessing has mainly been carried out for (a) removing the noise from this data and/or (b) normalizing values of various features. Next, three different feature selection techniques have been used for the selection of the optimal set of features from the superset of features so as to remove the features that are redundant and irrelevant in making accurate predictions. The three feature selection techniques used are Correlation Matrix with Heatmap, Feature Importance and Recursive Feature Elimination with Cross-Validation. Further, KNearest Neighbor (KNN), Random Forest (RF) and Support Vector Machine (SVM) classifiers have been deployed to train the proposed prediction models for predicting the presence of Sybil accounts in the OSN dataset. The performances of the proposed prediction models have been analyzed using six standard metrics. We conclude that the prediction model based on the Random Forest classifier provides the best results in predicting the presence of Sybil accounts in the dataset of an OSN. S. Sharma (B) · M. Sood Department of Computer Science, Himachal Pradesh University, Shimla, India e-mail: [email protected] M. Sood e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_61

695

696

S. Sharma and M. Sood

Keywords Data preprocessing · Feature selection · Classification · K nearest neighbor classifier · Random forest classifier · Support vector machine classifier · Datasets · Sybil account

1 Introduction The transactions and interactions being performed online on the Internet are generating huge amounts of digital traces in the cyber-physical space. A significant chunk of contributions to this data can be attributed to social media and social networks. Online social networks, because of their great user-friendliness, simple interfaces and multilevel stay-in-touch approaches are not only attracting a large number of users to use these networks 24 × 7, but are also drawing the attention of spammers and attackers. These spammers and attackers exploit the inbuilt mechanisms of these social networks to influence the interactions of genuine users, sometimes adversely. These social network sites record these interactions due to which large data is being generated and stored in various servers. It is difficult to separate the data related to the attackers from this huge data with the existing manipulation practices. Of late, Machine Learning (ML) has come to the forefront for the detection of such malicious activities intentionally launched by vested interests (users). It can be used to classify the data related to genuine users and fake/malicious users if used appropriately. ML as the subset of Artificial Intelligence focuses on the training of machines through ML algorithms using the huge datasets so as to detect or predict the occurrences of data related to the fake users or attackers. There are four types of ML techniques: Supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning. Supervised learning trains the machine by using labeled data. It mainly uses classification and regression techniques. Classification predicts the discrete responses whereas regression predicts the continuous responses. Unsupervised learning does not consist of labeled data. It uses techniques like clustering and association. Clustering is mainly used for the grouping of data and finding hidden patterns. Association rules mainly help in finding out the association among the data objects in the huge dataset. In semisupervised learning, the amount of input data is very large but only small data is labeled. It mainly deploys graph-based methods along with classification, regression and prediction techniques. Reinforcement learning is basically based on the reward and punishment method. It consists of three main components: agent, environment and actions and is mainly used in gaming, navigation and robotics. Given the kind of problem of detection/prediction of fake/malicious users at hand, supervised learning is the best suitable ML category for this purpose. In supervised learning, the process of training a model consists of a number of steps, the very first step being the collection of data necessary to train the model. Data collected from various sources is large in size and may consist of noise. So, the next step is preprocessing of this data and involves data cleaning, data transformation, error correction and normalization [1]. Data cleaning deals with the removal of missing

Exploring Feature Selection Technique …

697

and noisy data. Missing data is the data with some of the missing values and noisy data is irrelevant data. The next step is data transformation for transforming data into the required form followed by the process of feature selection. Datasets may consist of hundreds of features, out of which only a few may contribute to effective prediction. Hence, irrelevant features are dropped using various feature selection techniques which in turn improves the accuracy and reduces the training time. There are three methods for feature selection: Filter, Wrapper and Embedded methods. In the filter method, the selection of features is based on calculating their scores in a statistical test for their correlation with the target variable [2]. In the wrapper method, the feature subset is selected by using the induction algorithm which is also the part of evaluation function [3]. The third technique is the embedded method which combines the techniques used in filter and wrapper methods. The next step in the creation of the model for prediction is the selection of the ML technique. There are a number of techniques available for training a prediction model. There is no single technique to date that can be used in all the scenarios in general. The technique(s) to be used depends upon the problem at hand, datasets and various other factors. Since our datasets contained labeled data and our problem is related to the classification, we have used K-Nearest Neighbor, Random Forest and Support Vector Machine classification techniques. After the selection of the prediction model, the next important step is to test the prediction model for which the dataset is divided into two parts: training data and testing data. Training data is used to train the model and testing data is used to test the designed and trained model. Any social network is a huge platform which provides the way by which people all around the world can connect with each other. Online social networks (OSNs) provide such platforms through which people all around the world can exchange their ideas and information with one another. These are similar to virtual communities that are connected through hardware, networking and special software for socialization. One of the most popular OSNs used by people across the globe is Twitter. The adversaries in such OSNs could be either fake users having Sybil accounts associated with them or could well be spambots. The OSNs often provide either some rules to detect/remove the spambots, or some software program(s) for this purpose but at times, both these machines fail to detect the fake user or the spambots, and spambots become successful in their malicious designs [4–6]. Twitter before getting transformed into a huge social site was originally started as a personal micro-blogging site [7]. In order to increase the number of followers of the target account on Twitter or any other OSN, fake followers and social spambots come into play. Social spambot is a profile on the social media platform that is programmed to automatically generate messages and follow accounts. In this paper, after using three feature selection techniques to obtain an optimal subset of features, we have explored three different ML techniques to find out the fake followers and social spambots on an OSN such as Twitter. In order to determine whether a given account is real or Sybil account in the dataset, three classification techniques were used which are K-Nearest Neighbor (KNN), Random Forest (RF) and Support Vector Machine (SVM). The datasets used for this research work contained both numerical and categorical values. Each and every instance in the

698

S. Sharma and M. Sood

dataset has a unique set of values. Since we have used Python for the implementation purpose, it could not deal with the categorical values. Instead, assigning binary values to the categorical data, all the categorical values were converted into their corresponding numerical values, which in return provided better results. This research work is intended to fulfill the following objectives: (1) using a combination of feature selection techniques, retrieve the set of best optimal features from the complete features set for all the datasets used, (2) to compare the prediction performance the three ML classifiers, i.e., RF, KNN and SVM on various available datasets, and (3) to study and analyze the effect of biasing on human account with fake followers and human account with social spambots. The novelty of this research work is (a) various datasets containing real-time data used with the permission of the owners have been prepared for the training and testing of ML prediction models using different percentages of biasing, (b) in order to obtain the most optimal set of features in the datasets, a combination of three feature selection techniques have been used in cascade, and (c) three supervised learning classifiers have been used and their performances compared on all the datasets under investigation so as to predict the occurrence of Sybil accounts and spambot accounts on these real-time datasets with high accuracy. The remainder of this paper consists of the following sections. Section 2 presents the information about the related work. Section 3 is about the methodology followed. Section 4 highlights the results and their analyses. Section 5 presents the conclusion and future scope.

2 Related Work Machine learning is being widely used across the world to find solutions elusive so far in different problem domains. One such domain is the presence of Sybil and/or fake accounts on OSNs. The techniques of machine learning are also being used in predicting the occurrences of these fake Twitter accounts and spambots. Some of the related work has been mentioned here so as to get an idea about the current state of affairs in this field. For the detection of fake Twitter account, the authors in [7] have used the feature selection process to create a subset of the features in such a way that no useful information was left out but the unnecessary and repetitive features were removed. They have shown that it enhanced the accuracy and reduced the computational time. The feature selection techniques, when used appropriately, decrease the storage needs, prevent overfitting and enable data visualization [8]. Whether an account on the OSN is a human account or a fake account can be predicted on the basis of the information obtained from the profile of the account [9]. A subtle criterion has been outlined in [10] on the basis of which a human account can be detected in the corresponding dataset that contains the combination of human as well as spambot account. The authors in [11] have successfully used the ML algorithm to detect spam on Twitter. In their work, they have proposed a hybrid approach to detecting the streaming of Twitter spam using a combination of Decision Tree,

Exploring Feature Selection Technique …

699

Particle Swarm Optimization and Genetic algorithm. Some important features on the basis of message content and user behavior can be used along with the SVM classification algorithm for the successful detection of spam [12]. McCord and Chuah in [13] have carried out their evaluation on the basis of suggested user and contentbased features with the help of four classifiers for the detection of spammers and legitimate users.

3 Methodology Followed This section describes, in brief, the description of the process followed to achieve the goal. For the implementation of various ML algorithms, Python language has been used. We have used Jupyter Notebook to execute the codes written in Python language. Jupyter Notebook is a web-based application and it is open sourced [14].

3.1 Datasets In order to achieve the set objectives, a set of datasets having human, fake Twitter as well as spambot accounts has been considered. We have studied nine datasets, a total of eight datasets have been used in this study, out of which two datasets (E13, Genuine accounts) contain human accounts, three datasets have the fake accounts data and other three contain the data for spambots. Each dataset supports the same number of features all of which have the same labels. Table 1 presents the description of the type of dataset, nature of dataset and number of accounts in each dataset. Table 2 presents the names of all the 32 features for the datasets. Cresci et al. have created this dataset for their research work [7]. They verified each and every genuine account themselves which made these datasets special. The Table 1 Datasets considered S. no.

Dataset

Nature of dataset accounts

No. of accounts

1 2

E13 (Elezioni 2013)

Human

1481

TFP (The Fake Project)

Human

469

3 4

Genuine accounts

Human

3474

TWT (Twitter Technology)

Fake accounts

845

5

INT (Inter Twitter)

Fake accounts

1337

6

FSF (Fast Followers)

Fake accounts

1169

7

Spambot 1

Spambots

991

8

Spambot 2

Spambots

3457

9

Spambot 3

Spambots

464

700

S. Sharma and M. Sood

Table 2 Features in the features set Id

Friends count

Language

Geo enabled

Name

Favorite count

Time zone

Profile image URL

Screen name

Listed count

Location

Profile banner URL

Status count

Created at

Default profile

Profile text color

Followers count

URL

Default profile image

Profile image URL https

UTC offset

Protected

Verified

Updated

Profile sidebar fill color

Profile background image URL

Profile background color Profile link color

Profile use background image

Profile background image URL https

Profile sidebar border color

Profile background

first dataset of human accounts, i.e., E13 (Elezioni 2013) was created during the elections conducted in Italy in 2013. The second dataset of human accounts, i.e., The Fake Project (TFP) was created by authors on their own. They had started a project named “The Fake Project” (a Twitter account) in order to collect the data about these human accounts. We have not used this dataset in our experiments. The next three datasets which contained the details of fake account were bought by them online. Spambot 1 dataset was created by observing a group of social bots on Twitter in 2014 in Mayoral elections in Rome. Spambot 2 promoted the #TALNTS hashtag for several months where Talnts was a mobile phone application used for hiring workers in the fields of writing, digital photography and music. Spambot 3 dataset was collected from Amazon.com which advertises products on sale.

3.2 Dataset Biasing Out of the nine datasets described in Table 1, we have used eight of them to generate 18 numbers of biased datasets as shown in Table 3. The biasing of data is carried out as follows. Biasing of E13 dataset with FSF, INT and TWT datasets in the ratios of 50:50, 25:75 and 75:25 was carried out to obtain 9 datasets named M1–M9. Also, 9 more biased datasets M10–M18 were obtained by biasing genuine accounts with three spambot datasets again in the ratios of 50:50, 25:75 and 75:25. The table also lists the resulting total number of accounts in each of these 18 datasets.

Exploring Feature Selection Technique …

701

Table 3 Biased datasets S. no.

Case

Spambot

Total accounts

M1

(E13–FSF) (50%–50%)

Human accounts 741

Fake accounts 585



1326

M2

(E13–FSF) (25%–75%)

371

877



1248

M3

(E13–FSF) (75%–25%)

1111

293



1404

M4

(E13–INT) (50%–50%)

741

669



1410

M5

(E13–INT) (25%–75%)

371

1003



1374

M6

(E13–INT) (75%–25%)

1111

335



1446

M7

(E13–TWT) (50%–50%)

741

423



1164

M8

(E13–TWT) (25%–75%)

371

634



1005

M9

(E13–TWT) (75%–25%)

1111

212



1323

M10

(Genuine–spambot 1) (50%–50%)

1738



496

2234

M11

(Genuine–spambot 1) (25%–75%)

869



744

1613

M12

(Genuine–spambot 1) (75%–25%)

2609



248

2857

M13

(Genuine–spambot 2) (50%–50%)

1738



1729

3467

M14

(Genuine–spambot 2) (25%–75%)

869



2594

3463

M15

(Genuine–spambot 2) (75%–25%)

2609



865

3474

M16

(Genuine–spambot 3) (50%–50%)

1738



233

1971

M17

(Genuine–spambot 3) (25%–75%)

869



349

1218

M18

(Genuine–spambot 3) (75%–25%)

2609



117

2726

3.3 Data Preprocessing Data preprocessing is the process of cleaning, scaling and transforming the data into the required format. In the process of data cleaning NaN (Not-a-Number), inconsistent and missing values are removed [1]. The data under consideration also had some

702

S. Sharma and M. Sood

missing values and some features in the datasets did not have any value at all. The first step was the removal of those features which did not contain any values. After that missing values were replaced with zero. Features containing the redundant values were also dropped. After this preprocessing of data, we were left with 24 features into the feature set and the datasets with these 24 features were further subjected to feature selection techniques.

3.4 Feature Selection The main reason behind using large datasets is that they contain a large number of features in the feature set. These features contain information about the target variables. The statement “the more the number of features, the better is the performance” is not valid for every case. In feature set, there are some features which when removed from the feature set, do not affect the solution at all. Normally, these features may be irrelevant, constitute noise, or are redundant [10] and hence can be dropped safely. Out of three categories of feature selection methods, the filter methods normally use some mathematical functions and are known to be faster than wrapper methods. These methods include Univariate method, Chi-Square method and Correlation Matrix with Heatmap. Wrapper methods use the classifiers to prepare the feature set with maximum accuracy [15]. The third category of embedded methods combines the salient features of both the other two categories. Filter methods provide better results if there are very large numbers of features in the feature set but when dealing with fewer features, wrapper methods work better [15]. Three feature selection techniques have been explored in this work in order to find out the best optimal features set for further processing. The first one is Correlation Matrix with Heatmap which deals with the relationships among the features themselves and with the target variable. The second method used is Feature Importance. This method ranks the features according to their importance. The third method used is Recursive Feature Elimination with Cross-Validation (RFE-CV). If C is the classifier for prediction (e.g., a random forest), F is the scoring function to evaluate the performance of the classifier (e.g., accuracy); k is the number of features to be eliminated in every step (e.g., 1 feature) then RFE-CV starts with all the n features, makes predictions with cross-validation using C, computes the relative cross-validated performance score F and the ranking of the importance of the features. Then it eliminates the lowest k features in the ranking and remakes the predictions, the computation of the performance score and the feature ranking. It continues until all the features are eliminated. Finally, it outputs the set of features that produced the predictor with the best performance score [16]. After applying Correlation Matrix with Heatmap, 15 features were selected out of 24 features. Subsequently, applying the Feature Importance technique the number was further reduced to 11. Finally, the RFE-CV technique outputs the 8 best optimal features as shown in Table 4. All the later operations are performed on these selected best optimal features listed in Table 4.

Exploring Feature Selection Technique …

703

Table 4 Final features set statuses_count

friends_count

favourites_count

Lang

Profile_text_color

profile_background_image_url

Profile_link_color

Updated

1.02 1

Values of Metrics

0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 Accuracy Precision

Recall F-Measure

MCC

Evaluation Metrics

Specificity

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18

Fig. 1 Comparison of values of evaluation metrics for KNN classifier

3.5 Classifiers Used Three classifiers used in this study are Random Forest (RF), Support Vector Machine (SVM) and K-Nearest Neighbor (KNN). The KNN classifier also known as a nonparametric algorithm is used for both classification and regression. It works on the principle that the objects within a dataset are close to each other and have similar properties [17]. It is also used for both classification and regression and is very fast to train. The RF classifier is also used to address the problems of classification and regression. This algorithm works in two steps, in the first step random tress are created, and the second step is initiated to predict the output based on the basis of votes of trees generated in the first step [1]. The classifications through the SVM classifiers are known to be discriminative in nature which are subsequently used for the classification and regression. It constructs a hyperplane among the data points segregating the data points into two classes. There can be multiple planes separating the data points into two classes but the one with the maximal margin between the data points of two classes is considered the best and is selected as the hyperplane. Data points that affect the position of the hyperplane are called the support vectors.

704

S. Sharma and M. Sood

3.6 Evaluation Criteria This study has been conducted to detect fake accounts such as Sybil accounts and social spambots. For the evaluation of experimentation results, confusion matrix and six evaluation metrics have been used. These evaluation metrics are Accuracy, Precision, Recall, F-Measure, MCC and Specificity.

4 Results and Analysis After selecting the requisite features for the datasets using feature selection techniques, the selected features with their data were used for further experiments. Prediction models based on KNN, RF and SVM classifiers have been proposed next. Each prediction model was trained and tested on all the 18 datasets explained in Table 3. The results collected from the experiment for all the six evaluation metrics are listed in Tables 5, 6 and 7, respectively, for each of the classifiers. Table 5 and Fig. 1 presents the performance of the predictive model trained and tested by using KNN Table 5 Values of valuation metrics of KNN classifier Case

KNN classifier Values of evaluation metrics Accuracy

Precision

Recall

F-measure

MCC

Specificity

M1

0.99

1.00

0.99

0.99

0.99

1.00

M2

0.99

1.00

0.99

0.99

0.98

1.00

M3

0.99

1.00

1.00

0.99

0.99

1.00

M4

0.98

1.00

1.00

0.97

0.97

1.00

M5

0.97

0.96

0.94

0.93

0.96

0.98

M6

0.96

0.93

0.89

0.92

0.93

0.97

M7

0.97

0.98

0.96

0.96

0.96

0.98

M8

0.99

1.00

0.99

0.99

0.99

1.00

M9

0.97

0.97

0.92

0.94

0.95

0.99

M10

0.98

1.00

0.97

1.00

0.97

1.00

M11

0.99

1.00

0.99

0.99

0.98

1.00

M12

0.98

0.99

0.97

0.97

0.97

0.99

M13

0.98

0.97

0.97

0.94

0.96

0.99

M14

0.98

0.97

0.98

0.97

0.98

0.97

M15

0.99

0.98

0.99

0.98

0.98

0.98

M16

0.97

0.96

0.97

0.97

0.97

0.98

M17

0.98

0.99

0.98

0.98

0.97

0.99

M18

0.97

0.97

0.96

0.96

0.95

0.97

Exploring Feature Selection Technique …

705

Table 6 Values of evaluation metrics for RF classifier Case

Random forest classifier Evaluation metrics Accuracy

Precision

Recall

F-measure

MCC

Specificity

M1

1.00

0.99

1.00

1.00

1.00

1.00

M2

1.00

1.00

1.00

1.00

1.00

1.00

M3

1.00

1.00

1.00

1.00

1.00

1.00

M4

1.00

1.00

1.00

1.00

1.00

1.00

M5

1.00

1.00

1.00

1.00

1.00

1.00

M6

1.00

1.00

1.00

1.00

1.00

1.00

M7

1.00

1.00

1.00

1.00

1.00

1.00

M8

1.00

1.00

1.00

1.00

1.00

1.00

M9

1.00

1.00

1.00

1.00

1.00

1.00

M10

1.00

1.00

1.00

1.00

1.00

1.00

M11

1.00

1.00

1.00

1.00

1.00

1.00

M12

1.00

1.00

1.00

1.00

1.00

1.00

M13

1.00

1.00

1.00

1.00

1.00

1.00

M14

1.00

1.00

1.00

1.00

1.00

1.00

M15

1.00

1.00

1.00

1.00

1.00

1.00

M16

1.00

1.00

1.00

1.00

1.00

1.00

M17

1.00

1.00

1.00

1.00

1.00

1.00

M18

1.00

1.00

1.00

1.00

1.00

1.00

in terms of the values calculated for the six evaluation metrics. Table 6 and Fig. 2 shows the performance of these metrics for the predictive model trained by using the RF classifier and then Table 7 and Fig. 3 highlights the performance of these metrics for the predictive model trained by using an SVM classifier. In case of KNN-based model, we got the best result for dataset M3 (E13–FSF) (75%–25%). In case of RF-based model, the values achieved for each metric are 1.00 except precision (0.99) for dataset M1 (E13–FSF) (50%–50%). In case of SVM-based model, from 14 datasets we got best results and only four datasets have provided values less than others, and those are M3 (E13–FSF) (75%–25%), M7 (E13–TWT) (50%–50%), M12 (Genuine–spambot 1) (75%–25%) and M16 (Genuine–spambot 3) (50%–50%). After comparing the maximum values of metrics obtained from three different classifiers, we observed that the performance of the RF classifier is the best among all these classifiers with best results for dataset M3 (E13–FSF) (75–25%) in all three cases.

706

S. Sharma and M. Sood

Table 7 Values of evaluation metrics for SVM classifier Case

Support vector machine classifier Evaluation metrics Accuracy

Precision

Recall

F-measure

MCC

Specificity

M1

0.99

1.00

0.99

0.99

0.99

1.00

M2

0.99

1.00

0.99

0.99

0.99

1.00

M3

0.99

1.00

0.99

0.99

0.99

1.00

M4

0.98

0.99

0.99

0.98

0.98

0.97

M5

0.99

1.00

0.99

0.99

0.99

1.00

M6

0.99

1.00

0.99

0.99

0.99

1.00

M7

0.98

0.99

0.96

0.97

0.97

0.99

M8

0.99

1.00

0.99

0.99

0.99

1.00

M9

0.99

1.00

0.99

0.99

0.99

1.00

M10

0.99

1.00

0.99

0.99

0.99

1.00

M11

0.99

1.00

0.99

0.99

0.99

1.00

M12

0.98

0.98

0.96

0.96

0.97

0.98

M13

0.99

1.00

0.99

0.99

0.99

1.00

M14

0.99

1.00

0.99

0.99

0.99

1.00

M15

0.99

1.00

0.99

0.99

0.99

1.00

M16

0.98

0.98

0.96

0.96

0.97

0.98

M17

0.99

1.00

0.99

0.99

0.99

1.00

M18

0.99

1.00

0.99

0.99

0.99

1.00

1.002 1

Values of Metrics

0.998 0.996 0.994 0.992 0.99 0.988 0.986 0.984 Accuracy Precision

Recall

F-Measure

MCC

Evaluation Metrics Fig. 2 Comparison of values of evaluation metrics for RF classifier

Specificity

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18

Exploring Feature Selection Technique …

707

1.01

Values of Metrics

1 0.99 0.98 0.97 0.96 0.95 0.94 Accuracy Precision

Recall F-Measure

MCC

Evaluation Metrics

Specificity

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18

Fig. 3 Comparison of values of evaluation metrics for SVM classifier

5 Conclusion and Future Scope In this work we have first used preprocessing on the data collected from various sources and after that three different feature selection techniques were used on preprocessed data. Preprocessing removes all the noisy data and feature selection removes all the redundant features and helps in selecting the best optimal features. After the selection of a set of eight best optimal features for the datasets under consideration, three prediction models have been created based on three different classifiers, KNN, RF and SVM. The performances of these prediction models have been evaluated on the bases of values of six standard metrics normally used for this purpose. Out of three classifiers used, RF provides the best results on all these six metrics. We can further extend this study to implement a high-performance model for real-time environment. We can use this study to design some set of rules for using further optimization techniques on the experiments conducted. Acknowledgments We express our utmost gratitude toward Cresci et al. [7, 18] for allowing us to use the datasets created by them for this research work, since these datasets were the basic requirement for our research.

References 1. N. Bindra, M. Sood, Data pre-processing techniques for boosting performance in network traffic classification, in 1st International Conference on Computational Intelligence and Data

708

S. Sharma and M. Sood

Analytics, ICCIDA-2018 (Springer CCIS Series, Bhubaneshwar, Odhisha, India, 2018) 2. https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methodswith-an-example-or-how-to-select-the-right-variables/. Last accessed on 21 Nov 2019 3. G. John, R. Kohavi, K. Pfleger, Irrelevant features and the subset selection problem, in Proceedings of 5th International Conference on Machine Learning (Morgan Kaufmann, New Brunswick, NJ, Los Altos, CA, 1994), pp. 121–129 4. A. Vasudeva, M. Sood, Survey on Sybil attack defense mechanisms in wireless ad hoc networks. J. Netw. Comput. Appl. 120, 78–118 (2018) 5. A. Vasudeva, M. Sood, A vampire act of Sybil attack on the highest node degree clustering in mobile ad hoc networks. Indian J. Sci. Technol. 9(32), 1–9 (2016) 6. A. Vasudeva, M. Sood, Perspectives of Sybil attack in routing protocols of mobile ad hoc network, in Computer Networks and Communications (NetCom), ed. by Chaki et al. LNEE, vol. 131 (Springer, New York, NY, 2013), pp. 3–13 7. S. Cresci, R.D. Pietro, R. Petrocchi, A. Spognardi, M. Tesconi, Fame for sale: efficient detection of fake Twitter followers. Decis. Support Syst. 80, 56–71 (2015) 8. G. Devi, M. Sabrigiriraj, Feature selection, online feature selection techniques for big data classification: a review, in Proceeding of International Conference on Current Trends Toward Converging Technologies (IEEE, Coimbatore, India, 2018), pp. 1–9 9. J. Alowibdi, U. Buy, P. Yu, L. Stenneth, Detecting deception in online social networks, in Proceedings of International Conference on Advances in Social Network Analysis and Mining (ASONAM) (IEEE/ACM, 2014), pp. 383–390 10. G. Stringhini, M. Egele, C. Kruegel, G. Vigna, Poultry markets: on the underground economy of twitter followers, in Proceedings of Workshop on Online Social Networks WOSN’12 (ACM, 2012), pp 1–6 11. S. Murugan, G. Devi, Detecting streaming of twitter spam using hybrid method. Wireless Pers. Commun. 103(2), 1353–1374 (2018) 12. Z. Xianghan, Z. Zeng, Z. Chen, Y. Yu, C. Rong, Detecting spammers on social networks. Neurocomputing 159, 27–34 (2015) 13. M. McCord, M. Chuah, Spam detection on twitter using traditional classifiers, in Proceedings of 8th International Conference on Autonomic and Trusted Computing, ATC 2011. LNCS (Springer, Berlin, Heidelberg, 2011), pp. 175–186 14. Project Jupyter, https://jupyter.org/. Last Accessed on 21 Nov 2019 15. D. Sonkhla, M. Sood, Performance analysis and feature selection on Sybil user data using recursive feature elimination. Int. J. Innov. Technol. Explor. Eng. (IJITEE) 8(9S4), 48–56 (2019) 16. https://www.researchgate.net/post/Recursive_feature_selection_with_crossvalidation_in_ the_caret_package_R_how_is_the_final_best_feature_set_select.ed2/. Last accessed on 22 Nov 2019 17. K. Yan, D. Zhang, Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sens. Actuators B Chem. 212, 353–363 (2015) 18. S. Cresci, R.D. Pietro, R. Petrocchi, A. Spognardi, M. Tesconi, The paradigm-shift of social spambots: evidence, theories, and tools for the arms race, in Proceedings of 26th International Conference on World Wide Web Companion International World Wide Web Conferences Steering Committee (2017), pp. 963–972

Implementation of Ensemble-Based Prediction Model for Detecting Sybil Accounts in an OSN Priyanka Roy

and Manu Sood

Abstract Online Social Networks (OSNs) are the leading platforms that are being used these days for a variety of social interactions generally aimed at fulfilling the specific needs of different strata of users. Normally, a user is allowed to join these social networks with little or negligible amount of antecedent verification which essentially leads to the coexistence of fake entities with malicious intentions on these social networking websites. A specific category of such accounts is known as the Sybil accounts where a malicious user pretending as an honest user, creates multiple fake identities to manipulate/harm honest users, creating an illusion of the real users in the OSN that these are real identities. In the absence of stringent control mechanisms, it is difficult to identify and remove these malicious accounts. But, as every single interaction on a social media website leaves its digital trace and a huge number of such interactions every day culminates into huge datasets, it is possible to use Machine Learning (ML) techniques to build prediction models for identifying these Sybil accounts. This paper is one such attempt where we have used ML techniques to build prediction models that can predict the presence of Sybil accounts in Twitter datasets. After preprocessing the data in these datasets, we have selected an optimal set of features using one filter method namely Correlation with Heatmap and two wrapper methods namely Recursive Feature Elimination (RFE) and Recursive Feature Elimination with Cross-Validation (RFE-CV). Then using 8 classifiers (SVM, NN, LR, DT, RF, NB, GPC, and KNN) for the classification of accounts in the datasets, we have concluded that the Decision Tree classifier gives the best prediction performance among all these classifiers. Lastly, we have used an ensemble of 6 classifiers (SVM, NN, LR, DT, RF, and KNN) by using Bagging (max voting) to achieve better results. But it can be concluded that due to the inclusion of weak learners like SVM, NN, and GPC in the ensemble, DT has given the best possible prediction outcomes. P. Roy (B) · M. Sood Department of Computer Science, Himachal Pradesh University, Shimla, India e-mail: [email protected] M. Sood e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_62

709

710

P. Roy and M. Sood

Keywords Data preprocessing · Feature selection · Classifier · Ensemble of classifiers · Bagging · Max voting · Sybil attack

1 Introduction Due to recent rapid advancements in technology, Online Social Networks (OSNs) are fast becoming prominent and leading platforms for a variety of social interactions worldwide. OSNs generally ride on wide area networks connecting people anywhere anytime. The users connect to an OSN by registering on its website, i.e., by creating profiles on a particular social network like Facebook, Instagram, Twitter, LinkedIn, etc. They can hook on to these networks using different processes like (a) using to send and receive friend requests, (b) using tags (on Twitter, mostly hashtags are used, and (c) by sharing links with each other. There are two supreme elements of the OSN, users (connectors) and links (connection). Users play the role of connectors by creating their profiles displaying information (mostly unverifiable by all means) about the listed user in their profiles. These connectors join each other by using send/receive requests, uploading data, sharing information, and building communication. These links are known as connections that show the associations among users. These associations may be one-to-one, one-to-many or many-to-many. OSNs are used for a variety of purposes such as personal, professional (video conferencing) as well as in business (e.g., e-Commerce) [1]. There are 65+ OSN platforms available to the users, among those, the two most popular platforms, especially among younger generations, are Facebook and Twitter. Generally, Facebook is used for sharing social views, photos, videos, knowledge, opinion, business, etc., whereas Twitter, in addition to being a microblogging site, is used more for political debates, professional interactions, and as the largest source of daily hunts. Due to the inherent structures of these OSNs, there are ample chances for any user to create multiple numbers of fake accounts or spam accounts on purpose. These fake users may also attempt to steal/copy information, personal or professional, of other users so as to create fake profiles seemingly looking like those of real or genuine users. Also, the fake user accounts hide the real identities of the actual users, helping them to remain untraceable [2]. Fake users may have a variety of motivations like broadcasting false information, trolling somebody, spreading fake news/content/rumors about a person/group/organization/event, cyberbullying, destroying reputations of competitors, etc. The primary purpose of fake users is to harm some genuine user intentionally [3]. When a fake user uses his multiple fake profiles to pretend as multiple different genuine users for some ulterior motives, it is known as the Sybil attack. The word Sybil comes from a woman named Sybil Dorsett. She was a woman with a multiple personality disorder problem. There are three dimensions of Sybil attacks: (a) What—A user with malicious intensions creates multiple Sybil nodes with an aim to harm the honest users by portraying Sybil nodes as real nodes; (b) How—It first joins the weak targets (honest nodes) and then mounts multiple Sybil

Implementation of Ensemble-Based Prediction Model …

711

nodes to attack them, and (c) When—Sybil attack can be easily mounted if there is no centralized trust party in the system at a suitable location. Different approaches used for detecting the occurrences of a Sybil attack are given in [4, 5]. Two categories of graph-based approaches used for detection of Sybil attack are (a) Sybil Detection Schemes (SybilGuard, Gatekeeper, SybilRank, SybilLimit, SybilDefence, SybilFrame, SybilBelief, Integro), and (b) Sybil Tolerance Schemes (Ostra, Sumup, True Top). There are two ML-based approaches too for detection of Sybil attacks: (a) Supervised Approaches (Uncovering Sybil, Social Turing Test) and (b) Unsupervised Approach (Click Stream, Latest community model). Using these Sybil attack detection techniques, Sybil nodes in OSNs can be detected so as to initiate steps for making the users safe [6]. Online Social Networks generate an enormous amount of data which is, of late, being used for the detection/prediction of the occurrences of Sybil attacks in social media such as Twitter [7]. Machine Learning (ML) provides various solutions to identify/earmark the presence of malicious nodes along with their Sybil nodes based on the manipulation of the enormous data called datasets. As humans learn from their past experiences, the machine can also be trained from the experiences of some experimentation based upon datasets. In this research work, we have used Cresci 2015 dataset which is a dataset based upon the genuine and fake Twitter accounts [7] for the purpose of presenting various prediction models based upon ML. Firstly, we have applied some data preprocessing techniques for cleaning, transforming, and for normalization of datasets. Thereafter, three feature selection techniques have been used to select the most relevant subset of features out of the given set of features, the purpose being reducing overfitting, improving accuracy, and reducing the training time of the models. With the help of feature selection, we can remove noisy, irrelevant, redundant features [8]. We have used filter and wrapper methods of feature selection to find out the best results by retrieving the number of optimal features from each of the datasets [9]. As far as the filter method is concerned, Correlation with Heatmap (Pearson Correlation) has been used for selecting the optimal features in all the datasets. From this, the Heatmap can easily be drawn to conclude which features are more related to the target variable so as to retrieve the best optimal features. In wrapper method, out of a few available choices, Recursive Feature Elimination (RFE) method was chosen to find the ranking of features. Later, Recursive Feature Elimination with CrossValidation (RFE-CV) was used to transform the entire dataset, using the best scoring number of features. The cross-validation provides a better optimal feature set whereas the simple RFE method provides the ranking for the features. In this paper, only the results obtained from Correlation with Heatmap were finally considered as the final results for further processing since the wrapper methods use the same strategy as that of the filter method and are computationally expensive when compared to filter method. For the purpose of achieving the best performance, most of the prediction models use the ensemble techniques along with classifiers. There are multiple ensemble techniques available in the literature, bagging techniques along with max voting

712

P. Roy and M. Sood

for classifier ensemble is one of them that can be trained on a specific dataset for achieving best merged accuracy. In OSNs, there are multiple fake accounts, and some of these fake accounts pretend as multiple genuine users to launch a Sybil attack on target user(s). Sybil attack is used to influence the online interactions and behaviors of honest/genuine users maybe, by spreading false information, gathering personal information of genuine users, and posting negative comments and negative responses to posts, blogs of genuine accounts, etc. So detections of these fake accounts are of prime importance to make OSNs safe and secure platforms for genuine users. Although almost all the social media platforms capture various interactions of all their users, it is difficult to detect the presence of Sybil accounts in these datasets without using machine learning techniques. Not much work has been carried out on the detection of Sybil accounts using various ML techniques. Whatever little work has been carried out on Sybil account detection, the performance of the models has been below par. This paper is an attempt to find out a model based on an optimal ensemble of ML classifiers to achieve the best performance of prediction as compared to the individual classifiers. The paper is divided into five sections. Section 2 highlights the research methodology used for this work to achieve all objectives. The details of datasets used and other necessary details are presented along with the experimental setup in this section. Section 3 discusses the ensemble of classifiers used, whereas Sect. 4 summarizes the results of experiments followed by their analysis. The work has been concluded in Sect. 5 with a pointer toward future work. The novelty of this research work as perceived by the authors is as follows: • All five datasets of Cresci 2015 [7] have been considered for the purpose of training and testing of predictive models proposed and the process of controlled biasing has culminated into 24 datasets from these five datasets. • Three different feature selection techniques Correlation with Heatmap, RFE, and RFE-CV, belonging to two different classes of FS techniques, have been used before the final selection of an optimal set of features for the prediction models proposed. • To explore the performances of various prediction models proposed, eight ML classifiers namely SVM, DT, RF, NB, GPC, NN, and LR have been used, the main purpose being the comparison of performances of these individual classifiers on biased datasets. • An ensemble of 6 classifiers has been used on the same datasets and its performance has been compared with those of individual classifiers. • The results of comparison have shown that the performance of the ensemble for the purpose of prediction of the presence of Sybil accounts is almost ideal for the datasets used. The objectives of this work are (1) Using appropriate feature selection techniques, to select an optimal set of features of the classifiers, (2) To analyze the performances of various individual classifiers on the biased datasets (genuine with fake accounts), and (3) To analyze the performances of an ensemble of classifiers on the same datasets and comparison of performance with individual classifiers.

Implementation of Ensemble-Based Prediction Model …

713

2 Research Methodology and Simulation Setup Figure 1 depicts the methodology that we have used to implement this research work, whereas Table 1 shows the details of the datasets based on the work carried out by Cresci et al. [7] and used in our research work. Cresci 2015 datasets consist of Twitter accounts data. There is a total of five datasets in this dataset named as Elezioni 2013 (E13), The Fake Project (TFP), Fast Followers (FSF), Inter Twitter (INT), and Twitter Technology (TWT). Two datasets are of genuine accounts (E13, TFP) and three of fake accounts (FSF, INT, TWT). The number of features and their Cresci 2015 Dataset Data Preprocessing Correlation with Heatmap, RFE, RFECV

Feature Selection Training Set (70%)

SVM NN KNN

Testing Set (30%)

DT

NB

GPC

RF

LR

Classification Models

Calculate Six Evaluation Metrics SVM NN KNN

DT

NB

GPC

RF

Ensemble of Six Classifiers

LR

Output of Classifiers By using Bagging & Max Voting

Evaluation Metrics of Ensemble Fig. 1 Methodology used for this research work

Table 1 Twitter datasets and their details [7]

Account type

Datasets

Name of datasets Total accounts

Genuine accounts

1

E13 (Elezioni 2013)

1481

2

TFP (The Fake Project)

469

3

FSF (Fast Followers)

1169

4

INT (Inter Twitter)

1337

5

TWT (Twitter Technology)

Fake accounts

845

714 Table 2 Complete features set of Twitter accounts [7]

P. Roy and M. Sood ID

Location

Profile background title

Name

Default profile

Profile sidebar fill color

Screen name

Default profile image

Profile background image URL

Status count

Geo enable

Profile link color

Follower counts

Profile image URL

Utc offset

Friends count

Profile banner URL

Protected

Favorites count

Profile use background image

Verified

Listed count

Profile background image https

Description

Created at

Profile text color

Updated

URL

Profile image URL https

Dataset

Time zone

Profile sidebar border color

labels in genuine and fake accounts are exactly the same and are listed in Table 2. We have biased the datasets, i.e., genuine accounts (E13, TFT) with fake accounts (FSF, INT, TWT) datasets to build 24 named datasets based on the ratio of genuine to fake accounts (100:100), (75:25), (60:40), (50:50) as shown in Table 3. For the practical implemetation we have used anaconda, In Anaconda we have used Jupyter Notebook [10].

3 The Proposed Models It is a common knowledge that feature selection helps in removing noisy, redundant, irrelevant features [8]. Feature selection techniques have been used in this work as ‘data cleaning agents’ and to select the most relevant set of features in order to achieve reduced overfitting, improved accuracy, and reduced training time of the models. Out of the three categories of FS methods, for the sake of exploring, we have used one filter and two wrapper methods to retrieve the optimal set of features. We have used Correlation with Heatmap (Pearson Correlation) filter method for selecting the optimal set of features for all datasets and to show the relationship between each input with its target in 2D colored matrix. This technique pictorially depicts relations of all the features with objective variables. The correlation can be Positive or Negative. Next, we used two wrapper methods, namely, Recursive Feature Elimination (RFE) and Recursive Feature Elimination with Cross-Validation (RFE-CV). RFE is for finding the ranking of features. We

Implementation of Ensemble-Based Prediction Model …

715

Table 3 Details of biased datasets used Biasing of genuine accounts with fake accounts Biasing ratio on dataset

E13 + FSF Total accounts

E13 + INT Total accounts

E13 + TWT

Total accounts

100–100

AP1 (100)

2650

AP2 (100)

2818

AP3 (100)

2326

75–25

AP1 (75–25)

1403

AP2 (75–25)

1446

AP3 (75–25)

1322

60–40

AP1 (60–40)

1357

AP2 (60–40)

1424

AP3 (60–40)

1227

50–50

AP1 (50)

1325

AP2 (50)

1408

AP3 (50)

1162

Biasing ratio on dataset

TFP + FSF

Total accounts

TFP + FSF

Total accounts

TFP + FSF

Total accounts

100–100

AP4 (100)

1638

AP5 (100)

1806

AP6 (100)

1314

75–25

AP4 (75–25)

645

AP5 (75–25)

687

AP6 (75–25)

563

60–40

AP4 (60–40)

749

AP5 (60–40)

816

AP6 (60–40)

619

50–50

AP4 (50)

818

AP5 (50)

902

AP6 (50)

656

have used 24 datasets in our research work, and RFE gives different rankings to feature sets of the different datasets (24 in all in our case). RFE-CV transforms the entire set using the best scoring number of features. Crossvalidation gives better optimal features set as compared to that of RFE. Here, we have considered the results of Correlation with Heatmap filter method only for further use in classifiers since it gave better results as compared to the wrapper methods. Another deciding factor was that the wrapper methods used the same strategy as that of the filter method but were computationally expensive when compared to filters methods. In our datasets, application of Correlation with Heatmap feature selection technique resulted in a set of 22 optimal features as shown in Fig. 2. This figure also depicts the relationship of each input with its target in 2D colored matrix where it can easily be determined which feature is more related to the target variable. Hence it has enabled us to retrieve a set of best optimal features. After selecting the best features of all 24 datasets, as a standard practice, we split our data into training and testing data in the ratio of 70:30 to be used in the classifiers. A classifier is an algorithm that is used to map input data to some specific category or label for the sake of classification. Various categories of different classifiers are used for the purpose of classification. Eight classifiers used in this work are Support Vector Machine (SVM), Logistic Regression (LR), Neural Network (NN), K-Nearest Neighbor (KNN), Random Forest (RF), Gaussian Process Classifier (GPC), Naive Bayes (NB), and Decision Tree (DT) [11, 12]. For measuring the performance of various classifiers, a set of six metrics have been used, some of which are based on the Confusion Matrix. A Confusion Matrix, also

716

P. Roy and M. Sood

Fig. 2 Features selected through Correlation with heatmap filter method

called an error matrix [7], is generally used as the basis for measuring the performance of a classifier in ML. In our case, True Negative (TN) is the number of fake accounts identified as fake. True Positive (TP) is the number of genuine accounts identified as genuine. False Negative (FN) is the number of genuine accounts identified as fake and False Positive (FP) is the number of fake accounts identified as genuine. Evaluation metrics used to evaluate our results are Accuracy, Precision, Recall/Sensitivity, F1 score, MCC (Matthew Correlation Coefficient), and Specificity. After evaluating these six metrics for all 8 classifiers mentioned above, we used an ensemble of classifiers for further investigations. In ensemble learning, we take multiple classifiers and combine the output of various classifiers to get better prediction or classification accuracy. Here, the classifier output is merged based on different ensemble techniques like max voting, averaging, and weighted average. There are also advanced ensemble techniques such as stacking, boosting, blending, and bagging. In this research work, we have used bagging because it is a combination of Bootstrap and Aggregation and is simple to implement with good performance [13]. This ensemble classification model runs multiple classifiers in parallel and independent of each other. Bagging takes bootstrap samples of data and trains the classifiers on each sample before classifiers’ predictions (votes) are combined by majority voting. Bagging ensemble method entails high classification accuracy [14]. The maximum unweighted data resulting from SVM and GPC classifiers have been considered as weak learners. Therefore, we have used the bagging technique to randomly sample weak learners and merged the output of only 6 out of 8 classifiers in the ensemble process. The classifiers used for the ensemble are KNN, RF, LR,

Implementation of Ensemble-Based Prediction Model …

717

SVM, DT, and NN. We did not consider GPC and NB because they both produced results somewhat similar to SVM. In the bagging technique, we have used max voting because, in max voting, it uses multiple models to make predictions for each instance of datasets, and the prediction of each model is considered as a single vote only. The prediction which got the majority of votes was taken as the final result. Out of the two types of voting methods used in ensemble learning, hard voting and soft voting, we have used hard voting for performing ensemble on all 24 datasets because of its established supremacy.

4 Results and Analysis After the selection of an optimum feature set consisting of 22 features for all the 24 biased datasets using the feature selection techniques, the performance of six metrics for 8 classifiers of the simulation experiments under supervised ML have been presented in Figs. 3, 4, 5, 6, 7, 8, 9, and 10. We have found the results first without the use of the ensemble of classifiers, i.e., the performance of six evaluation metrics of 8 classifiers (SVM, NN, KNN, LR, DT, RF, NB, and GPC) for all 24 biasing datasets. From these results, we conclude that the performance of Decision Tree (DT) classifier over the metrics namely accuracy, precision, recall, specificity, F1 score, and MCC is the best among all other classifiers, the second best performance on these parameters is of the KNN and RF classifiers, the third best performance belongs to LR and NB, and the worst performance is of SVM, NN, and GPC classifiers. Secondly, the results of the performance of the ensemble of 6 classifiers (SVM, NN, KNN, LR, DT, and RF) over these six metrics for all the 24 biased datasets have been presented in Fig. 11 and Table 4. A careful perusal of Figs. 3, 4, 5, 6, 7, 8, 9, 10, and 11 shows that the performances of the ensemble of classifiers and that 1.2 Values of Metrics

1 0.8 Accuracy

0.6 0.4

F1 score

0.2

Recall

0 AP6(50) AP6(60-40) AP6(75-25) AP6(100) AP5(50) AP5(60-40) AP5(75-25) AP5(100) AP4(50) AP4(60-40) AP4(75-25) AP4(100) AP3(50) AP3(60-40) AP3(75-25) AP3(100) AP2(50) AP2(60-40) AP2(75-25) AP2(100) AP1(50) AP1(60-40) AP1(75-25) AP1(100) Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

Fig. 3 Performance of six metrics for SVM classifier on various datasets

Precision MCC Specificity

718

P. Roy and M. Sood 1.2 1

0.6

Accuracy

0.4

F1 score

0.2

Recall Precision

0 -0.2 -0.4

AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)

Values of Metrics

0.8

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

MCC Specificity

Dataset 6

Fig. 4 Performance of six metrics for NN classifier on various datasets 1.2 Values of Metrics

1 0.8 0.6

Accuracy

0.4

Precision

0.2

Recall F1 Score AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)

0

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

MCC Specificity

Dataset 6

Fig. 5 Performance of six metrics for LR classifier on various datasets

of the DT classifier are almost comparable. A deeper look at the evaluation metrics has revealed that the performance of the ensemble is below par than that of the DT. The reason for the ensemble not providing better performance has been traced to the participation of weak learners like NB and GPC. These weak learners have pulled down the performance of the ensemble. And, this has been found experimentally true not only for the bagging technique used for combining the outputs in the ensemble but for all other techniques too. As can be seen from Fig. 8, the performance of metrics for Decision Tree classifier on Dataset 1, 2, and 4 are almost near perfect. For the rest of the three datasets too, the values of six evaluation metrics have been found to be quite good. This means that this DT classifier when used for the prediction of Sybil accounts on the social media datasets performs the best not only with the highest accuracy but also with the highest values of other metrics too.

719

1.02 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82

Accuracy F1 score Recall Precision AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)

Values of Metrics

Implementation of Ensemble-Based Prediction Model …

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

MCC Specificity

Dataset 6

Fig. 6 Performance of six metrics for RF classifier on various datasets 1.2 Values of Metrics

1 0.8 0.6

Accuracy

0.4

F1 score

0.2

Recall Precision AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)

0

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

MCC Specificity

Dataset 6

Fig. 7 Performance of six metrics for KNN classifier on various datasets

5 Conclusion and Future Scope In this research work, after due data preprocessing and biasing (genuine accounts with fake accounts) on datasets, we have built 24 datasets from the five available datasets for the purpose of building an ML model to predict the presence of Sybil accounts on an OSN website. After preparing the datasets, features have been selected using both filter and wrapper methods. But we considered the results of the filter method only, i.e., the results we have obtained by Correlation with Heatmap as final results because the wrapper method is computationally expensive. After obtaining an optimal set of 22 features, we used all the 24 sets of datasets to train and test the prediction models proposed using SVM, RF, NN, NB, KNN, GPC, RF, and LR classifiers. From the

720

P. Roy and M. Sood 1.2

Values of Metrics

1 0.8 0.6

Accuracy

0.4

F1 score

0.2

Recall Precision AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)

0

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

MCC Specificity

Dataset 6

Fig. 8 Performance of six metrics for DT classifier on various datasets 1.2

Values of Metrics

1 0.8 0.6

Accuracy

0.4

F1 score Recall

0.2

Precision AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)

0

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

MCC Specificity

Dataset 6

Fig. 9 Performance of six metrics for NB classifier on various datasets

results of the evaluated metrics, we conclude that the performance of Decision Tree is the best among all other classifiers whereas SVM, NN, and GPC classifiers are the worst performers. Subsequently, using an ensemble of 6 of these classifiers, we evaluated and compared the performance of the ensemble of classifiers only to conclude that though the performance of this ensemble over all the parameters is encouraging, due to involvement of weak learners like SVM, NN, and GPC in this ensemble, the Decision Tree classifier still provides the best performance individually in comparison to other individual classifiers or the ensemble of classifiers. The results were found to be no different for various other available combining techniques used in the ensemble of classifiers. Further, different optimization techniques may be tried

Implementation of Ensemble-Based Prediction Model …

721

1.2

Values of Metrics

1 0.8 0.6

Accuracy

0.4

F1 score

0.2

Recall Precision AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)

0

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

MCC Specificity

Dataset 6

1.2 1 0.8 0.6 0.4 0.2 0

Accuracy F1 score Recall AP1(100) AP1(75-25) AP1(60-40) AP1(50) AP2(100) AP2(75-25) AP2(60-40) AP2(50) AP3(100) AP3(75-25) AP3(60-40) AP3(50) AP4(100) AP4(75-25) AP4(60-40) AP4(50) AP5(100) AP5(75-25) AP5(60-40) AP5(50) AP6(100) AP6(75-25) AP6(60-40) AP6(50)

Values of Metrics

Fig. 10 Performance of six metrics for GPC classifier on various datasets

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Precision MCC Specificity

Dataset 6

Fig. 11 Performance of six metrics for the ensemble of 6 classifiers on various datasets

on this ensemble output in the future in order to explore and compare the results of the ensemble of classifiers with optimization too.

722

P. Roy and M. Sood

Table 4 Performance metrics data of ensemble of 6 classifiers Ensemble of classifiers Datasets

Cases

Accuracy

F1 score

Recall

Precision

MCC

Specificity

Dataset 1

AP1 (100)

0.993

0.994

0.99

0.997

0.987

1

AP1 (75–25)

1

1

1

1

1

1

AP1 (60–40)

1

1

1

1

1

1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

AP1 (50)

1

1

1

1

1

1

AP2 (100)

0.99

0.991

1

0.982

0.981

0.98

AP2 (75–25)

1

1

1

1

1

1

AP2 (60–40)

0.995

0.996

1

0.992

0.99

0.987

AP2 (50)

0.99

0.991

1

0.982

0.981

0.98

AP3 (100)

0.851

0.892

0.968

0.826

0.667

0.645

AP3 (75–25)

0.891

0.939

1

0.885

0.53

0.317

AP3 (60–40)

0.943

0.961

0.988

0.936

0.855

0.578

AP3 (50)

0.851

0.892

0.968

0.826

0.677

0.645

AP4 (100)

0.906

0.805

0.673

1

0.77

1

AP4 (75–25)

0.906

0.89

0.801

1

0.825

1

AP4 (60–40)

0.924

0.887

0.797

1

0.843

1

AP4 (50)

0.934

0.87

0.771

1

0.84

1

AP5 (100)

0.874

0.6964

0.553

0.9397

0.658

0.987

AP5 (75–25)

0.995

0.995

0.99

1

0.99

1

AP5 (60–40)

0.885

0.805

0.69

0.966

0.748

1

AP5 (50)

0.926

0.836

0.728

0.98

0.804

0.995

AP6 (100)

0.784

0.581

0.418

0.951

0.535

0.988

AP6 (75–25)

0.923

0.934

0.877

1

0.852

1

AP6 (60–40)

0.951

0.946

0.94

0.951

0.902

0.96

AP6 (50)

0.796

0.607

0.442

0.968

0.564

0.992

Acknowledgments We are grateful to Cresci et al. [7] for allowing us to use their real-time dataset, i.e., Cresci 2015 dataset in this research work.

References 1. H. Mayadunna, L. Rupasinghe, A trust evaluation model for online social networks, in Proceedings of IEEE 2018 National Information Technology Conference (NITC) 02–04 October, Colombo, Sri Lanka (2018) 2. A.H. Wang, Don’t follow me: spam detection in twitter, in 2010 International Conference on Security and Cryptography (SECRYPT) (IEEE, 2010), pp. 1–10 3. F. Masood, G. Ammad, A. Almogren, A. Abbas, H.A. Khattak, I.U. Din, M. Guizani, M. Zuair, Spammer Detection and Fake User Identification on Social Networks (IEEE, 2019)

Implementation of Ensemble-Based Prediction Model …

723

4. M. Al-Qurishi, M. Al-Rakhami, A. Alamri, M. Alrubaian, S.M.M. Rahman, M.S. Hossain, Sybil defense techniques in online social networks: a survey. IEEE Access 5, 1200–1219 (2017) 5. A. Vasudeva, M. Sood, Survey on Sybil attack defense mechanisms in wireless ad hoc networks. J. Netw. Comput. Appl. 120, 78–118 (2018) 6. H. Bansal, M. Misra, Sybil detection in online social networks (OSNs), in Proceedings of IEEE 6th International Conference on Advanced Computing (2016) 7. S. Cresci, R.D. Pietro, R. Petrocchi, A. Spognardi, M. Tesconi, Fame for sale: efficient detection of fake Twitter followers. Decis. Support Syst. 80, 56–71 (2015) 8. H. Nkiama, S.Z.M. Said, M. Saidu, A subset feature elimination mechanism for intrusion detection system. Int. J. Adv. Comput. Sci. Appl. 7(4), 148–157 (2016) 9. N. Bindra, M. Sood, Data pre-processing techniques for boosting performance in network traffic classification, in Proceedings of First International Conference on Computational Intelligence and Data Analytics, ICCIDA-2018, 26–27 October 2018 (Springer CCIS Series, Gandhi Institute For Technology (GIFT), Bhubaneshwar, Odhisha, India, 2018) 10. https://www.anaconda.com/distribution/#download-section. Last accessed on 07 Dec 2019 11. https://analyticsindiamag.com/7-types-classification-algorithms. Last accessed on 07 Dec 2019 12. https://scikit-learn.org/stable/modules/gaussian_process.html. Last accessed on 07 Dec 2019 13. R. Polikar, Ensemble based systems in decision making. IEEE Circ. Syst. Mag. 21–44 (2008) 14. J.J. Rodriguez, L.I. Kuncheva, Rotation forest: a new classifier ensemble method. IEEE Trans. Pattern Mach. Intell. 28(10), 1619–1630 (2006)

Performance Analysis of Impact of Network Topologies on Different Controllers in SDN Dharmender Kumar and Manu Sood

Abstract Over the passage of a decade, technology, especially in computer science, goes beyond the thinking level of man. The traditional approach of networking has several loopholes that have been minimized to some extent with a modern networking approach known as Software-Defined Networks (SDNs). Software-Defined Network (SDNs) has made the communication more interesting with several notable features such as flexibility, dynamic and agile behavior. These features have become possible with its unique features such as centralized control, direct programmability, and physical separation of two planes named network control plane and forwarding plane or data plane. Since the whole network and its entities are controlled by the control plane, this feature of controlling and separating the two planes has made Software-Defined Networks (SDNs) completely different from traditional networking. By networking we mean that there must exist communication among various physical and logical devices. So Communication plays an important role in any network and is also a vital part of any network. In order to have better communication in SDN, it is essential to have analysis and evaluation of the performance of different network topologies. So finding the number of network topologies, and also finding the best topology among them, which can be proposed best for communications in SDN, would be interesting. In this paper, we propose to find out the best topology among four possible topologies in SDN, on three different SDN controllers through simulation in Mininet. This selection of best topology is done by analysis and evaluation of different network parameters such as throughput, round Trip Time, end-to-end delay, bandwidth, and packet loss with/without link down. Based on the result obtained with different parameters, we propose the best topology that provides us the results for both bestcase as well as worst-case communication in our experiment. Four different types of topologies on three different SDN controllers (OpenDaylight, POX, NOX.) have been shown to be simulated through Mininet and Wireshark for SDN. D. Kumar (B) · M. Sood Department of Computer Science, H.P. University, Shimla, India e-mail: [email protected] M. Sood e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_63

725

726

D. Kumar and M. Sood

Keywords Software-defined network · OpenFlow · Mininet · Wireshark · Control plane · Throughput · Round trip time · OpenDaylight · POX · NOX

1 Introduction The traditional approach, which we are still using for network communication, is popular since a long time, but simple. This traditional networking is identified by features that are implemented on a particular device always. These dedicated devices in conventional networking can be any one among switches, routers, and application delivery controllers [1–6]. On the whole, physical devices are used for performing networking functionality in traditional networks. Also, there is hard coupling between both data and control planes in traditional networking, which means the data forwarding and controlling policies is provided by the same layer in this traditional approach. This tight coupling between data and control planes may lead to a certain level of additional complexity for instance, the policies once defined in traditional networking during communication, are hard to be altered dynamically as per the needs of users. We may have The only way to change these networking policies on the go is to halt the communication temporarily to accommodate suitable modifications in the policies. Traditional approach of networking has some other significant limitations such as: a) networking setup is a time consuming process and is prone to errors, b) specific professional experts are needed to handle multi-vendor environment, and c) these have low network security and reliability. All these limitations of traditional networking paradigm are the major contributors for SDN to be evolved as the emerging networking approach.

1.1 Software-Defined Networking Factors affecting the traditional way of networking give rise to SDN. That has been emerged as a new technology from a networking perspective for the past few years. It plays a very important role in networking, also overcomes the limitation that has been encountered in traditional networking. As the traditional approach of networking follows the principle of strong coupling, SDN works on the principle of loose coupling, which means both the data plane and control plane are decoupled. Also both separate data and control plane are not have tied up with the hardware device as [1–5]. SDN entirely uses different sets of data forwarding policies. In SDN, the controller act as a major role player in the network related to these forwarding and controlling policies. The configuration of the controller can be done dynamically in order to make any changes, which is a drawback of traditional networking [6]. Therefore, dynamic configuration means, in SDN, the enforcement of new policies can be done at a later stage to override the previous one. Control plane’s primary

Performance Analysis of Impact of Network Topologies …

727

functionality is to specify the path and communication parameters, while the data plane implements the decision directed by the control plane. Most significant features in SDN are (a) control plane arrangement, (b) controller as a centralized authority, (c) open interface, and (d) direct and dynamic programming facility. The complete SDN architecture is explained in detail in [7]. It comprises the whole networking platform for SDN. Architecture for SDN consists of three layers names : application layer, control layer, and infrastructure layer. Northbound API provides communication among the upper two layers, whereas southbound API enforces communication between two lower layers. The application layer as the topmost layer is responsible for providing the abstract application view, while the control layer as the middle layer plays a very important role in SDN. Network operating system named as the OpenFlow controller exists in control layer. The purpose of this layer is to provide the interface to the upper application layer and lower data or infrastructure layers. It is possible to have an OpenFlow facility with SDN; this feature differentiates the SDN from traditional networks. Also, the dynamic as well as programmatical configuration of the controller makes the configuration process easy and arbitrary. This feature is not available in traditional networking. Forwarding is done by the lower-most data layer.

1.2 SDN and Traditional Networking Comparison The major key networking difference between the two different approaches is in Table 1.

1.2.1

Topologies’ Significance

For proper management of network topology plays an important as many characteristics of network is affected by the topology, e.g. performance, communication policies complexity level, reliability, and the efficiency of network [8–10]. A Network may Table 1 Traditional networks versus SDN [7] S. no.

Features

Conventional network

SDN

1

Control plane and data plane

Tight coupling

Loosely coupled, both the layers are decoupled

2

Protocol

Number of protocols are used

Mostly OpenFlow is used

3

Reliability

Less reliable

More reliable

4

Security

Moderate

High

5

Energy consumption

Very high

Less

728

D. Kumar and M. Sood

have several different topologies with their merit and demerit [11]. But not a single topology can act as the best for every network requirement. Therefore, selecting the best among all topologies requires some evaluation. Throughput, Bandwidth, Packet loss, Round Trip Time (RTT), End-To-End delays are the different performance affecting parameters of a network topology [12]. By careful examination of SDN-related literature, before exploring the simulation results in this research paper, we observed that there exists no such paper that covers the performance comparing results for different topologies with different SDN controllers. Mininet is a simplest and easily accessible tool for simulation in the SDN environment. In this paper, different topologies and their possibility in SDN have been compared with respect to different networking controllers in SDN through Mininet. For a topology to be best or worst among six available topologies in SDN, we perform the simulation of four topologies with Mininet. Simulation results obtained with Mininet have been extended by another tool that can plot the graph named Wireshark.

2 Proposed Work With simulated results, the performance of various SDN topologies has been evaluated, based on different network parameters. Different network topologies that can be created with commands in SDN with Mininet environment [13, 14] are as follows: (a) Minimal: Minimal topology is the first simplest topology in SDN with a single switch and double host by default. Mininet command for implementing minimal topology is sudo mn − topo = minimal. (b) Single Topology: Another different kind of topology used in SDN is single topology, having 1 * N switches and hosts. Mininet command for implementing single topology is sudo mn − topo = single, 3. (c) Linear Topology: Linear topology is a little different as compared to the single topology in SDN with N * N switches as well as hosts instead of 1 * N as in single topology. Mininet command for implementing linear topology is sudo mn − topo = linear, 3. (d) Tree Topology: As the name suggests, tree topology has internal structure like a tree with multiple levels, where two hosts are associated with every switch. Mininet command for implementing tree topology is sudo mn − topo = tree, 3. (e) Reversed Topology: Reversed topology as the name suggests is completely the reverse of the single topology, in which host and switch are connected in reverse order. The command for implementing reverse topology is sudo mn − topo = reversed, 3. (f) Torus Topology: Torus topology of SDN is similar to that of mesh topology in traditional networking. Mininet command for implementing torus topology is sudo mn − topo = torus, 3, 3.

Performance Analysis of Impact of Network Topologies …

729

Numeric value in all the above topologies’ command specifies the number of hosts, switches, and for tree topology these values represent the number of levels. These values can be arbitrary values, may increase or decrease depending on the network requirement.

3 Simulation Environment and Results Implementation details of the results have been carried out by Ubuntu 18.04 LTS as an operating system, with a minimum of 2 GB RAM and a minimum of 6–8 GB free hard disk space. The simulation results have been obtained for three different SDN controllers: OpenDaylight, POX, and NOX with Python as the language. Torus topology was not used in this experiment because it is a non-switched topology. Also, ring topology was not used because switches are irrelevant in this topology. Another reason for not using the ring topology was that in current networking scenario, ring topology is rare in use. All these topologies used for our experiment, created with Mininet, with complete details of switches as well as hosts for all three SDN controllers during experiments, are listed below. For all four topologies, we created one server host (i.e. h1 in all cases), rest host as client (e.g. ‘h3’ or ‘h4’); one of the client hosts for all topologies requests the server host to download a file. Same bandwidth was used for all topologies as well as for all three SDN controllers. Bandwidth is fixed using the command –link tc, bw = value. Also, the file of the same size is used for the experiment in all cases. Linear topology creation is depicted in Fig. 1 for POX controller. Hosts, switches, and controller used in our experiments are shown with h1, h2, etc. as hosts and

Fig. 1 Creating linear topology in SDN using Mininet

730

D. Kumar and M. Sood

s1, s2, and c0 as switches and controller, respectively. The figure below shows the Mininet command execution in order to make h1 host as the server, the command is h1 python –m SimpleHTTPServer 80 & [13, 14]. Also, the file downloading operation is performed in all four topologies and for three different SDN controllers, numeric values used as well as obtained for different network parameters is shown in Tables 2, 3, 4, 5, and 6 as the outcomes of these simulations, respectively (Fig. 2). Simulated results are obtained for different topologies for all three different SDN controllers, on different network parameters with Wireshark and Mininet [14–18]. Simulated results have been obtained with the help of Mininet, while tables and graphs of our experiments were obtained with Wireshark from these results. These results are for downloading a specific file from the server host through a client host for all topologies as well as controllers. For all four topologies and three different Table 2 Different simulation network elements used S. no. Topology used

No. of servers No. of switches No. of hosts No. of controllers

1

Single topology

1

1

2

1

2

Linear topology

1

3

2

1

3

Tree topology

1

7

7

1

4

Reversed topology 1

1

2

1

Table 3 Topologies’ simulation results for POX controller Parameters used

Bandwidth fixed (Gb/s)

Topology used

End-to-end delay obtained (ms) 7.2799

Throughput obtained (b/s)

Round trip time obtained (ms)

Packet loss (when link is down)

Packet loss (when link is not down)

20,300

15.032

0

66

31,000

46.83

0

66

Single

26

Linear

26

23.41

Tree

26

32.36

24,100

65.72

0

25

Reversed

26

6.26

24,000

13.550

0

Network unreachable

Table 4 Simulated network parameters’ value in experiments Topology used

Single topology

Linear topology

Tree topology

Reversed topology

Segment range (Bytes)

0–1500

01–1500

0–120

0–1500

No. of segments

3

7

12

3

RTT (ms)

15.032

46.83

65.72

13.55

Average throughput (bps)

20,300

31,000

24,100

24,000

Parameters used

Performance Analysis of Impact of Network Topologies …

731

Table 5 Topologies’ simulation results for NOX controller Parameters used

Bandwidth fixed (Gb/s)

End-to-end delay obtained (ms)

Throughput obtained (b/s)

Round trip time obtained (ms)

Packet loss (when link is down)

Packet loss (when link is not down)

Single

26

9.2799

23,300

12.02

0

66

Linear

26

25.41

27,000

46.43

0

66

Tree

26

33.36

24,100

65.92

0

25

Reversed

26

8.26

22,000

13.250

0

Network unreachable

Topology used

Table 6 Topologies’ simulation results for OpenDaylight controller Parameters used

Bandwidth fixed (Gb/s)

Topology used Single

26

Linear

26

End-to end-delay obtained (ms) 8.2799 27.41

Throughput obtained (b/s)

Round trip time obtained (ms)

Packet loss (when link is down)

Packet loss (when link is not down

22,100

18.002

0

66

28,000

40.8

0

66

Tree

26

35.36

23,400

60.32

0

25

Reversed

26

10.26

24,800

10.450

0

Network unreachable

SDN controllers, values obtained for these different network parameters are given below in Tables 3, 4, and 5, respectively. The purpose of our experiment was to find out the single best topology among all three different controllers that provide the best communication results for all these different topologies. Figure 3 is a table listing the details of packets used and transmitted in linear topology in POX controller, captured through Wireshark during the downloading of the file. File downloaded was divided into 14 packets of different size as shown in Fig. 3. The same way the file was downloaded for other topologies and for other two SDN controllers, OpenDaylight and NOX.

3.1 Graphs Figures 4 and 5 present below, plotted with Wireshark, and depict the throughput and round-trip-time graph for the linear topology. In the same way, we find out the graph for both throughput as well as round trip time for reaming three topologies and also for other two SDN controllers, OpenDaylight and NOX. These graphs were captured with Wireshark tools during downloading the file for all topologies and for all controllers from a server through a request made by the client host for all cases.

732

Fig. 2 Downloading file from server with linear topology

Fig. 3 Packet details for linear topology

D. Kumar and M. Sood

Performance Analysis of Impact of Network Topologies …

733

Fig. 4 Linear topology graph for throughput with POX controller

Fig. 5 Linear topology graph for round trip time with POX controller

The results are shown in Tables 3, 4, and 6 for different network parameters and for all topologies as well as for all three SDN controllers [17, 18]. Graph for throughput and round trip time is shown in Figs. 4 and 5, respectively. From the analysis of the results of our experiments, we observe that the worst topology for POX, NOX, and OpenDaylight SDN controllers is tree topology while taking RTT into consideration. Because maximum RTT is obtained with tree topology

734

D. Kumar and M. Sood

for all three controllers. Also, the topology that provides the worst result for SDN for the other two controllers was single, reversed for NOX, and OpenDaylight, respectively, while average throughput is taken into account. The topology that provides the best result for SDN from RTT point of view was reversed, single, and reversed for all three controllers, respectively. Similarly, linear topology was considered as a best topology for SDN in all three cases because it results in maximum throughput. Based on these simulation results for SDN, we can say that (a) best topology among all four topologies, with three different SDN controllers, is linear topology, as it gives maximum average throughput and medium RTT, and (b) tree topology is the worst topology for SDN with RTT. A single topology cannot be best or worst for all three different SDN controllers with all network parameters; finally, we depict the best topology on two network parameters as Throughput and Round Trip Time.

4 Conclusion SDN application comprises fine-grained traffic, quick failover demand, and fast interaction amid switches, hosts, and controllers. Generation and execution of control message and operation in SDN can be variable for all unique topologies. Therefore, we aim to analyze the best-controlled communication topology for SDN with this paper. From simulation and experimental results, we found that there exist no single topologies that can show the best outcome for all three different SDN controllers for all network parameters. The precise result of our experimentation on the bases of Throughput and RTT to identify the best and worst topology for SDN is linear and tree topologies, respectively. Due to the time constraints, we have simulated limited experiments on these topologies and controllers. There were various different sets of experiments that are left for future research. Some of the variations in experiments that can be conducted in future to expand the scope of the investigations may include varying the size and/or numbers of the files being communicated. Moreover it would also be interesting to investigate the results with different sizes of data packets in the experiments.

References 1. D. Kreutz, F.M.V. Ramos, Software-defined networking: a comprehensive survey. IEEE/ACM Trans. Audio, Speech, Lang. Process. 103(1), 1–76 (2015) 2. H. Farhady, H.Y. Lee, Software-defined networking: a survey. Comput. Netw. 81, 1–95 (2015) 3. S. Badotra, J. Singh, A review paper on software defined networking. Int. J. Adv. Res. Comput. Sci. 8(3) (2017) 4. S. Sezer, S. Scott-Hayward, Are we ready for SDN?—Implementation challenges for softwaredefined networks. IEEE Commun. Mag. 51(7), 36–43 (2013). https://doi.org/10.1109/MCOM. 2013.6553676

Performance Analysis of Impact of Network Topologies …

735

5. B. Astuto, A. Nunes, A survey of software-defined networking: past, present, and future of programmable networks. IEEE Commun. Surv. Tutorials 16(3), Third Quarter, 1617–1634 (2014) 6. S.H. Yeganeh, A. Tootoonchian, On scalability of software-defined networking. IEEE Commun. Mag. 51(2), 136–141 (2013) 7. M. Sood, Nishtha, Traditional verses software defined networks: a review paper. Int. J. Comput. Eng. Appl. 7(1) 2014 8. S. Perumbuduru, J. Dhar, Performance evaluation of different network topologies based on ant colony optimization. Int. J. Wirel. Mobile Netw. (IJWMN) 2(4) (2010), http://airccse.org/jou rnal/jwmn/1110ijwmn.12.pdf. Last Accessed on 31 Dec 2018 9. R. Hegde, The impact of network topologies on the performance of the in-vehicle network. Int. J. Comput. Theory Eng. 5(3) (2013), http://ijcte.org/papers/719-A30609.pdf. Last accessed on 31 Dec 2018 10. D.S. Lee, J.L. Kal, Network topology analysis (Sandia Report, SAND2008-0069, Sandia National Laboratories, California, 2008), https://prod-ng.sandia.gov/techlib-noauth/accesscontrol.cgi/2008/080069.pdf. Last accessed on 31 Dec 2018 11. B. Meador, A survey of computer network topology and analysis examples, https://www.cse. wustl.edu/~jain/cse567-08/ftp/topology.pdf. Last accessed on 31 Dec 2018 12. M. Gallagher, Effect of topology on network bandwidth. Masters Thesis, University of Wollongong Thesis Collection, 1954–2016, University of Wollongong, Australia, Available: https:// ro.uow.edu.au/theses/2539/. Last accessed on 31 Dec 2018 13. D. Kumar, M. Sood, Software defined networks (SDN): experimentation with Mininet topologies. Indian J. Sci. Technol. 9(32) (2016). https://doi.org/10.17485/ijst/2016/v9i32/ 100195 14. Mininet walkthrough, http://mininet.org/walkthrough/. Last accessed on 31 Dec 2018 15. R. Barrett, A. Facey. Dynamic traffic diversion in SDN: test bed vs Mininet, in International Conference on Computing, Networking and Communications (ICNC): Network Algorithms and Performance Evaluation (2017). https://doi.org/10.1109/iccnc.2017.7876121 16. E. Guruprasad, G. Sindhu, Using custom Mininet topology configuring L2-switch in Opendaylight. Int. J. Recent Innov. Trends Comput. Commun. 5(5), 45–48. ISSN: 2321-8169 17. J. Biswas, Ashutosh, An insight into network traffic analysis using packet sniffer. Int. J. Comput. Appl. 94(11), 39–44 (2014) 18. Wireshark Complete Tutorial, https://www.wireshark.org/docs/wsug_html/. Last accessed on 31 Dec 2018

Bees Classifier Using Soft Computing Approaches Abhilakshya Agarwal and Rahul Pradhan

Abstract Researchers have gone through many studies and attempt to classify among two different types of bees, i.e., Honey bee and Bumble bee. From these approaches many of them depend on the subjective analysis and lacked any measurable analysis. Therefore, our work is an attempt to cross the gap between the subjective and measurable approaches to classify among bees. Thus, the researches use the machine learning algorithms in the research (classification and neural network) that will classify bee as Honeybee and Bumblebee using data in the form of images. This research will greatly speed-up the study of bee populations. Machine learning used in this research will help in automating this classification by its photograph. Information from these algorithms can be used by researchers in the study of bees. This research also includes manipulation of images, in which the data is prepared for applying the model which will be based on features extracted from bees. In this research, we also performed dimension reduction for focusing specially on the bee’s image, i.e., not considering the background details like flowers and other unnecessary details in the image. Keywords Support vector machines · K-nearest neighbor · Random forest · Decision tree · Logistic regression

1 Introduction Many researches had been carried out to classify the bees whether they are honey bee or bumble bee. Many of those researches depended on qualitative techniques in place of a quantitative approach used for figuring out and detecting the bees. Some of the researchers have also investigated approximately the quantitative techniques A. Agarwal (B) · R. Pradhan Department of Computer Engineering and Application, GLA University, Mathura, UP, India e-mail: [email protected] R. Pradhan e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_64

737

738

A. Agarwal and R. Pradhan

in an effort to be used for monitoring bees’ classification. They get their photos for identifying the bees. They also get new features as indicators to determine bees, i.e., biological feature, e.g., fatness, length, robustness, and many more. The tip of their abdomen is more pointed. Bees, flying bugs, are known for their job in fertilization and for creating nectar and beeswax. There are more than 16,000 known types of bees in seven perceived organic families found on each landmass with the exception of Antarctica. A few animal categories—including bumble bees, honey bees, and stingless honey bees— live socially in settlements. A bumble bee (likewise spelled bumble bee) is a flying bug inside the sort Apis of the honey bee clade, all local to Eurasia yet spread to four different mainlands by individuals. They are known for the development of perpetual, pioneer homes from wax. A honey bee is any of more than 250 species in the class Bombus, some portion of Apidae, one of the honey bee families. Wild honey bees are basic to continue fertilization. As of now, distinguishing the sort of a honey bee requires master exactness. This examination is spurred from the “Guideless Bees Classifier” challenge by Metis facilitated by Drivendata.org. Robotizing this procedure, with the end goal that basically an image of a honey bee can be utilized for classification by an AI calculation, will enormously accelerate the investigation of the honey bee populaces. As a feature of this exploration, the point was to create classifiers to precisely recognize the variety of a honey bee as either a “Bumble bee” (class 0) or a “Honey bee” (class 1) from its photo.

2 Methodology Figure 1 is the paradigm for the system proposed in this article. First, researchers use the cameras for gathering photographs of bees. A real-time running algorithm or bunch of functions will be helpful in extracting and counting some particular biological features of bees. Then the resulting information extracted from the features will be stored in database. When each bee is observed once, then the databases data Fig. 1 System overview

Bees Classifier Using Soft Computing Approaches

739

will be divided into 2 sets, i.e., train and test set. As the dataset contains 3969 bees, then they would be divided in the ratio of 7:3. After the data is divided, various ML models will be trained on the training set containing labels for honey bee and bumble bee as 0 and 1, respectively. For betterment of the model, we then apply validation algorithm for verifying each model’s confusion metrics parameters.

2.1 Photograph Gathering In this system, to extract the bee’s features we have used various means for gathering bee’s photographs like GitHub, Google photos, and many more. We decide to use a Kinect sensor as it is relatively inexpensive than RGBD sensor to detect the bees and click their photo no matter honey or bumble bee. The data that we collected consists of 3969 bee images which is very small as compared to the ambient dimensionality of the image (200 × 200 × 3). In addition, our dataset, the class distribution of honey bees and bumble bees is about 1:4, i.e., there is skew in the distribution.

2.2 Features-Detecting Algorithm There are various features that show a bee classification; from these, many features can be extracted from straightening the image after converting the 3 channels to 1 channel, i.e., from RGB image to Grayscale image and apply histogram equalization to each image. Then, we obtain features using histogram of oriented gradients (HOG) and DAISY feature descriptor [1]. Some of these features can be raw intensity value of each pixel; color combination of bees; light brown striations on their abdomen. The bumble bees, on the other hand, are darker and broad. The quality of images varies drastically. As we collected features of each data which is in count 128,100 values, i.e., the number of pixels or features that will be gathered after straightening the image, as this was a very large data to be processed on approximately 9.30 GB, so we have to convert and reduce the dimensions of the images. We focused specially on bees that were in the center of the images. But in this process, we got some problem, while focusing on bees, as in some images we found bees at the corner of the image.

2.3 Database After manipulating each bee’s image with the PCA transformation to convert from the original image to focus on bees, then gathering image features in the form of pixels of manipulated images, we converted the data to CSV file which will act as a database. The data is recorded into a new row inside thebdatabase. The number

740

A. Agarwal and R. Pradhan

of counts of various behaviors used for classification will be considered as separate features that will represent single pixels in the image.

2.4 Classification Algorithm When each and every bee is observed, that are meant to be classified, then the data of each feature or pixel of image will be used for classification. The various ML algorithms are used for classifying the image like K-Nearest Neighbor (K-NN), Support Vector Machine (SVM), Random Forest, Decision Tree, Logistic Regression. Then we test the data on these models and then algorithms will be analyzed by the confusion matrix. Each model is trained and tested and then it will be going to be validated by the validation algorithm. The default classifier label is classified as bumble bee, i.e., 1 and on the other hand, the other classifier is classified as honey bee, i.e., 0. This approach is executed by which the result is the data indicating most of it is as bumble bees; another reason may be data of bumble bee is in more ratio as compared to honey bee and lack of the images of honey bee results in the prediction closer to the bumble bee label.

2.5 Validation Algorithm Since the predictions made above are now labeled as honey bee or bumble bee, we create validation process for verifying how the accuracy of each model varies on changing their parameters by using cross-validation algorithm. In this research, we have used support vector machine algorithm (SVM), K-nearest neighbor (K-NN), random forest (RF), logistic regression (LR), and decision tree (DT). We have taken our data in part of 7:3 ratio in which we will be taking 70% of data as training part and 30% as testing part, and also there is skewness in data as the ratio of bumble bee images is more than honey bee in ratio of 4:1, i.e., bumble bee is 4 times more than honey bee. We have taken cross-fold validation algorithm to validate on different part of the images to get the most accurate result.

3 Experiments 3.1 Data Generation The data we are going to use in this research is created by collecting photographs from various sources and also, we will be running our classification algorithms on the manipulated data, which will be formed by performing various functions.

Bees Classifier Using Soft Computing Approaches

741

Fig. 2 Cropping

Firstly, we will perform Region of Interest (ROI) cropping. Since in most cases, ROI occupies a small portion of images. By cropping the images, we can avoid the portion of the image which is not containing bee. To achieve this, we use the region covariance descriptor [2], where we use a set of bee templates (available online) to identify the regions of the image which contain the bee. This method first constructs a suitable feature vector (for both the template image and candidate image segments), and then estimates an empirical covariance matrix for the feature vectors, and finally compares the relative distances of the covariance matrices from that of the template images. Upon further investigation, we found that this technique does not always come with good results, as it may crop that part of the image that does not contain any bee. Figure 2 shows the cropping of images by this method. Secondly, we will perform Data Augmentation. As we know the distribution of images is not balanced and also the data is skewed, we employ data augmentation on samples of class 0 to make the distribution of labels to be roughly uniform. We perform data augmentation on the honey bee (class 0) images by adding noise (zero mean AWGN with variance of 0.01), and by introducing rotations (−90° and 90°). This quadruples the size of class 0 (we get 3308 images from 827). The next process in this augmented dataset, which now contains 5955 samples, to obtain the feature vectors are as described in the following section.

3.2 K-Nearest Neighbor Algorithm As data is generated now, we classify the data using K-Nearest Neighbor algorithm. It is classic non-parametric method that is available to utilize while performing the task related to classification and regression. Both classification and regression need to be k closest training examples in the feature space, the only difference may be in output whether the task is of classification or regression. In classification, we have binary class honey bee and bumble bee, therefore the output of K-NN will be class membership. The image of bee will be assigned to the class which is most common among its k neighbors, where k is a small positive integer, the most common class will be judged on the basis of votes from neighbors. Bee will be assigned to class that is solely present in the neighborhood as the chosen value of k is 1.

742

A. Agarwal and R. Pradhan

Such type of classification where we take decision of class assignment on the basis of local statistics and avoid or delay the full computation are often referred as lazy learner or instance-based learning. A set of objects is used for taking the neighbors for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be considered as the training set for the algorithm, as no explicit training step is required. Figure 3 illustrates the example of k-NN classification. As you can see in Fig. 3 that green dot is the representation we have used for sample data, and k-NN model is supposed to classify this green dot into either blue square or red triangle. There will be greater chances of classifying this green dot into red triangle as k = 3 which can be seen as a solid line circle containing 2 triangles and only 1 square. But if k = 5, i.e., red solid line circle then k-NN will suppose to classify this green dot to blue square as there are 3 squares and only 2 triangles inside the circle. This algorithm works on majorities. The training examples are vectors in a multidimensional feature space, each with a category label. The training phase consists only of storing the feature vectors of the images, i.e., pixels and class labels of the training samples, i.e., the 0 or 1 (0 for honey bee and 1 for bumble bee). In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or the test point) is classified by assigning the label as 0 or 1 which is occurring most frequently among the k training samples nearest to that query point (image of a bee). The accuracy of the k-NN algorithm may be critically degraded by means of the presence of noisy facts as an instance, a colorful flower much like the coloration of the bee, or if the feature scales aren’t consistent with their importance. In such classification problems, i.e., binary cases, which is our project, it is helpful to choose k to be an odd number as this avoids similar votes and that is why we tested various odd values of k and finally end up with the value 11 which is giving the highest accuracy. Fig. 3 Example of K-NN classification

Bees Classifier Using Soft Computing Approaches

743

Fig. 4 Example of SVM

3.3 Support Vector Machine Algorithm Support vector machine often represented with its short-form SVM is a classifier that separates two classes that are not linearly separable. This separation between the classes can be achieved by separating hyperplane. SVM uses labeled data and outputs an optimal hyperplane which categorizes new examples. As Fig. 4 shows that in 2D space, we can plot data points and color of these that represent their different classes, SVM tries to find an optimal plane that is clearly dividing these two classes. This hyperplane is not a single line but it is the margin that divides the two classes, data points that lie along the margin borders are imagined as points that support this margin, and hence model is named after them as support vector machine. What we have done our support vector machine implementation, we divided our dataset of 3969 images in 7:3 ratio for training and testing. Then we create an SVM classifier by putting the attributes: kernel = ‘linear’, with probability taken True and random state 42. We have trained our model for different parameters so that we get the same result every time we run. After training and testing we found accuracy in the range 66–80% with the above parameters. SVM is achieving the best accuracy of 80% with PCA 500.

3.4 Decision Tree Algorithm A decision tree is a choice help tool that uses a tree-like version of choices and their feasible consequences, which include chance event results, useful resource prices,

744

A. Agarwal and R. Pradhan

and application. Decision trees are a non-parametric supervised getting to know method used for both category and regression tasks. Decision tree can be represented as a flowchart in which each node represents a condition or test and the output of this test will be represented by the two child nodes of this internal node. Test over here which an internal node represents is preferably a binary test which results in a range of two outcomes possible. Classification rules can be derived from the path between the root and leaf nodes. Each of these paths become a classification rule. Decision tree consists of three types of nodes: internal nodes aka decision nodes represented by square, child nodes if not decision nodes then represented by circles. And lastly, leaf nodes that are the final outcome represented by rectangle. Decision trees are commonly used in operations research and operations management. Another use of decision tree is as a descriptive way for calculating conditional chances. In this research, we used this classifier with the criterion as entropy and maximum depth of the tree as 11, which gave us the accuracy as 71.2% (best among the parameters used), on changing the criterion and max depth values the results were varying and the accuracy kept on decreasing. The precision was found to be 70% based on this classification.

3.5 Random Forest Algorithm Random forest, as it is quite evident from its name that it will offer an ensemble approach by collecting huge variety of attributes under one umbrella. What we have done for our random forest implementation is that we divided our dataset of 3969 images in 7:3 ratio for training and testing. Then we created a random forest classifier by putting the attributes: N estimators value taken is 15, with criterion entropy and random state 1234321. We have trained our model for different parameters such as for N estimator value from 5 to 20, for criterion entropy and gini and a random_state parameter so that we get the same result every time we run. After training and testing, we found the best accuracy of 80% with the above parameters n_estimator = 10, criterion = ‘entropy’. Random forests include more than one single tree each based on a random pattern of the training facts. They are commonly more correct than single choice trees. Each tree in the random forest gives a class prediction and the class with the most votes becomes our model’s prediction. As we can see in Fig. 5, there are nine decision trees trained for random forest and out of which six trees result in prediction and three decision tree results in prediction. So the overall result for the random forest is prediction.

Bees Classifier Using Soft Computing Approaches

745

Fig. 5 Example of random forest

3.6 Logistic Regression Algorithm Another classification algorithm we are using is Logistic Regression. This algorithm got its name from the function that is used in its core, i.e., logit function or often called logistic function. There is another name few authors refer it with, sigmoid function. This is because of the shape of the curve which is in the form of S-shaped structure which can take any real number values and can map it between 0 and 1. (Here in our case 0 stands for honeybee while 1 stands for bumblebee). 1   1 + e−value

(1)

In this, ‘e’ is the base of the natural logarithms (Euler’s number or the EXP() function) and ‘value’ is the actual numerical value that you want to transform. The coefficients (Beta values b) of the logistic regression algorithm ought to be predicted out of your training data. This is executed using max-likelihood estimation, whether the picture is classified as honey bee or bumble bee. The best coefficients would result into a model that can predict a value near to 1 (i.e., bumble bee) for the default class and a value near to 0 (i.e., honey bee) for the

746

A. Agarwal and R. Pradhan

other class. The intuition for max-likelihood for logistic regression like a procedure of searching seeks values for the coefficients that decrease the error in the probabilities predicted by the model with those in the data (e.g., probability of 1 if the data is the primary class). In binary or binomial logistic regression, the outcome is usually displayed as “0” or “1”, as this results in the most trustworthy interpretation. The logarithm of the odds is the logit of the probability, the logit is defined as follows: logit p = ln

p 1− p

f or 0 < p < 1

logit E(Y ) = α + βx

(2) (3)

4 Results The data which is being processed on various classification is based on the application of the algorithms—Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Decision Tree, Random Forest (RF), and Logistic Regression. We use python’s machine learning package, scikit-learn to implement these algorithms and find the confusion matrix to check the metric parameters. As a feature of this exploration, the point was to create classifiers to precisely recognize the variety of a honey bee as either a “Bumble bee” (class 0), or a “Honey bee” (class 1) from its photo. After that, the features were selected using the extraction method and dimensionality reduction. In extraction, the histogram equalization was applied to each image. Through this we used the raw pixel values to form our features. We analyze their performance using Confusion Matrix. On analyzing, the SVM algorithm gave an accuracy varying between 75 and 80%, K-NN gave an accuracy of 80%, Decision Tree gave an accuracy of 71.2%, Random Forest gave an accuracy of 80%, and Logistic Regression gave an accuracy of 75.3%. Training data available for each class is much skewed (honey bees to bumble bee ratio is about 1:4). Thus, error rate is not an indicator of the performance of a classifier (at least for the original data and cropped data case, where the dataset is skewed). Hence, in addition to error rates, we use the area under the receiver operating characteristics (ROC) curve. It is also referred as AUC, as a performance metric. AUC performance of the classifiers (k-SVM, S-LR, and RF) on (a) the original dataset, (b) the dataset formed by cropping the bees, and (c) the augmented dataset. Figure 6 shows the AUC performance of the classifiers (SVM, RF, LR), i.e., Support vector machine, Random Forest, Logistic Regression, respectively. (a) The original dataset, (b) dataset formed by cropping of images (ROI), (c) the augmented dataset, (d) shows the decay of singular values of the PCA features for original, cropped, and augmented datasets.

Bees Classifier Using Soft Computing Approaches

747

Fig. 6 AUC curves for a the original dataset, b dataset formed by cropping of images (ROI), c the augmented dataset d shows the decay of singular values of the PCA features for original, cropped, and augmented datasets

5 Conclusion This research on Naive bees’ classifier that automatically classifies any bee between honey bee and bumble bee with the help of various machine learning classification algorithms. In this research, we generated a result that will represent whether the given image is of honey bee or bumble bee. The dataset contains 3969 images of both types of bees. Once the data is loaded, we applied PCA for dimension reduction and furthermore straighten the images resulting 128,100 pixels or columns or features. After the data is generated, we simply applied classification algorithms and checked for the accuracies of each. The outcome of the Naive bee’s classifier is helpful in allowing researchers to more quickly and efficiently collect field data and that too with the help of an automated machine. Pollinating bees have critical roles in both ecology and agriculture, and disease like colony collapse disorder threatens these insects. Identifying different species of bees in the field means that we can better understand the prevalence and growth of these important insects.

748

A. Agarwal and R. Pradhan

We can add more images of other important insects out there in the wild and contributing a part in the balanced functioning of ecosystem and build a classification model for the clarification of each, as a future expansion of this research.

References 1. E. Tola, V. Lepetit, P. Fua, DAISY: an efficient dense descriptor applied to wide baseline stereo. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 815–830 (2010) 2. O. Tuzel, F. Porikli, P. Meer, Region covariance: a fast descriptor for detection and classification, in Proceedings of the 9th European Conference on Computer Vision—Volume Part II, ECCV’06 (Springer-Verlag, Berlin, Heidelberg, 2006), pp. 589–600

Fuzzy Trust Based Secure Routing Protocol for Opportunistic Internet of Things Nisha Kandhoul and S. K. Dhurandher

Abstract Opportunistic Internet of Things (OppIoT) is a network of Internet of Things (IoT) devices and communities formed by humans. The data is shared among the nodes in a broadcast manner, using the opportunistic contacts between humans. So, devising secure data transmission technique is necessary. As OppIoT comprises of a wide range of devices like sensors, smart devices, and so on, not all of them are capable of handling the complexity of the security protocols. We incorporate fuzzy logic for adding flexibility to the system. In this paper, the authors propose a fuzzy trust based routing protocol FuzzyT_CAFE for protecting the network against bad or good mouthing, Sybil, blackhole, and packet fabrication attacks. The protocol derives the trust of nodes in the network using four fuzzy attributes: Unfabricated Packet Ratio, Amiability, Forwarding Ratio, and Encounter Ratio. The output parameter is the Trust of the node for determining whether the selected node is malicious or not and the message should be forwarded or not. Simulation results suggest that the proposed FuzzyT_CAFE protocol is more flexible and outperforms the base routing protocol T_CAFE in terms of unfabricated packets received, higher message delivery probability, and a very low dropped message count. Keywords OppIoT · Fuzzy logic · Security · Trust

1 Introduction Internet of Things (IoT) [1] is a collection of sensors, digital devices, and humans that are connected to the Internet. Devices forming IoT are present everywhere and are used in domains like health care, intelligent homes, and many more. Opportunistic N. Kandhoul (B) Division of Information Technology, N.S.I.T, University of Delhi, New Delhi, India e-mail: [email protected] S. K. Dhurandher Department of Information Technology, Netaji Subhas University of Technology, New Delhi, India e-mail: [email protected] © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1166, https://doi.org/10.1007/978-981-15-5148-2_65

749

750

N. Kandhoul and S. K. Dhurandher

Networks (OppNets) [2] are a type of Delay Tolerant Networks, where routes are built on the fly whenever a device needs to send a message to other devices in the network. Opportunistic Internet of Things (OppIoT) [3] is an amalgamation of IoT and OppNets that brings together humans and a wide variety of devices like smartphones, sensors, RFID, and so on. The divergence of IoT devices and broadcast technique of data sharing further elevates the problem of user privacy [4] and data security. These attacks [5] can be passive that aim at gathering information about network and users like eavesdropping, or can be active attacks aimed at harming the users and data like packet fabricating, blackhole attack, etc. The attackers obstruct the normal functioning of the network and waste the resources of power limited devices. So, the need of the hour is to ensure the privacy and security of OppIoT devices. Cryptography based methods are not very successful when it comes to OppIoT as they consume lots of resources and key management is very difficult in the presence of attacker nodes in the network. So, using trust based routing is most appropriate for securing OppIoT. Trust is the measure of belief that a node has on another node about its future behavior in the network. This paper proposes a fuzzy version of existing T_CAFE [6], a trust based secure routing protocol. Fuzzy Logic (FL) brings fuzziness in the system and provides relaxation in absolute conditions. FL has the capability of handling numerical data and linguistic knowledge simultaneously. FL allows the computation of Trust based on several adaptive rules, Fuzzy_Trust. These rules provide a mapping of several input parameters to Fuzzy_Trust thereby reducing the errors in the system and making it more adaptive while guarding the network against attacks viz. bad and good mouthing, black hole, Sybil, and packet fabrication. The major contribution of this work includes – Fuzzy based Trust computation: Relaxation is provided in trust computation using the fuzzy logic thereby making the system flexible. – Detection and isolation of attackers: FuzzyT_CAFE isolates attackers from packet forwarding procedure. The remaining paper is arranged as follows. Section 2 presents the literature survey. The details of proposed F_T_CAFE are given in Sect. 3. Simulation results are discussed in Sect. 4. Section 5 provides paper’s conclusion.

2 Related Work In this section, existing works related to fuzzy based secure OppIoT networks are discussed. Cuka et al. [7] presented several fuzzy based systems for the selection of IoT devices in opportunistic networks. The input parameters used were device speed, distance, and residual energy. The output parameter is IoT device selection decision. The malicious behavior of the nodes is not addressed.

Fuzzy Trust Based Secure Routing Protocol for Opportunistic Internet of Things

751

Chhabra et al. [8] proposed FuzzyPT that provided defense against black hole attacks by extracting information from messages available in the buffer, threat messages, and applying the fuzzy logic. FuzzyPT improved decision-making and reduced the number of false positives. The authors verified the proposition by using game theoretic approach. Xia et al. [9] proposed a fuzzy and trust based approach for character building of the nodes. This protocol dynamically computed fuzzy trust for setting up routing path free from malicious nodes. Dhurandher et al. [10] proposed a fuzzy approach for geocasting in OppNets. The protocol employed several fuzzy attributes namely movement, residual energy, and buffer space for the selection of next hop for message forwarding.

3 System Model 3.1 Motivation Fuzzy logic takes decisions based on perception. The exact values are not used and errors are accepted within a range. The message forwarding decisions can be made based on certain attributes like trust, forwarding behavior, and so on. The presence of malicious nodes hugely affects these decisions. A fuzzy controller can be used for this decision-making by giving these attributes as input and the output can then be used for making forwarding decisions. Fuzzy logic reduces the complexity of the system without affecting the performance of the system. This motivation has led the authors to design (FuzzyT_CAFE), that is, a fuzzy version of (T_CAFE) where the fuzzy trust of the OppIoT devices is calculated using fuzzy parameters making the system a bit more flexible thereby protecting the system from attackers.

3.2 Proposed FuzzyT_CAFE Routing Protocol FuzzyT_CAFE is a fuzzy extension of T_CAFE. The details about the computation of Trust can be referenced from [6]. When a node wants to send a packet, it initiates fuzzy trust computation for the neighboring nodes as depicted in Fig. 1. In this work, encounter ratio (ER), forwarding ratio (FR), amiability (Amb) and unfabricated packet ratio (UR) [6] as computed by T_CAFE are the inputs and trust is the defuzzified output of a fuzzy controller. Higher the value of input parameters, higher is the trust. If FR is low, the trust is low as it suggests a possible black hole. ER along with Amb is the measure of the social behavior of the node and is used in the detection of Sybil nodes. UR is used to detect packet fabrication attacks. All these parameters when combined form trust. For malicious nodes, the value of trust is very low, whereas benign nodes, which are social and have a high forwarding rate,

752

N. Kandhoul and S. K. Dhurandher Fuzzy Unfabricated RaƟo Fuzzy Amiability

Fuzzifier

Fuzzy Forwarding RaƟo

Rule Based Processor

Defuzzifier

Fuzzy Trust

Fuzzy Encounter RaƟo Fuzzy parameters Fuzzifier

T_CAFE computed parameters

Fig. 1 Fuzzy logic controller

(a) Amiability

(b) Encounter Ratio

(c) Unfabricated Ratio

(d) Forwarding Ratio

Fig. 2 Input parameters

Fig. 3 Trust

Fuzzy Trust Based Secure Routing Protocol for Opportunistic Internet of Things Table 1 Fuzzy rule base S_No ER FR UR 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18

P P P P G P P G G G P G G E P P P G

P P P G P P G G P G G G G P E P P P

P P G P P G G P P G G P G P P E P P

753

Amb

Trust

S_No ER

FR

UR

Amb

Trust

P G P P P G P P G P G G G P P P E E

VL VL VL L VL L L M L G G M G VL M M M M

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

E P G E P G E G G G E G P P E E G E

P P P G E G G E G G G E G E G E G E

P G E P G E G G G P G G E G G G E E

G M G G M VG VG G G M G G M M VG VG G E

G E P P P G G G E E P P G G E G E E

have higher trust. FLC fuzzifies Trust and adds flexibility to the system to make it close to the real-world behavior. Table 1 provides the rule base for the FLC. Figure 2 shows the membership functions of fuzzy input and Fig. 3 shows the output variable. All the values have been normalized and vary in the range 0–1.

4 Simulation Results This section provides the details of simulations performed using ONE simulator [11]. Each simulation is run for 42,800 s. The performance of FuzzyT_CAFE is evaluated and compared with the results for T_CAFE, under varying Time to Live of messages. 500 Kb–1 Mb sized message is created every 25–35 s. TTL is varied in the range of 100–300 min and the result of this variation is captured in Figs. 4 and 5. From these figures, it is clear that the fuzzy version produces comparable results like the basic version and adds flexibility to the system. Figure 4 shows the impact of changing the TTL of messages on various performance parameters. Figure 4a shows that the probability of message delivery falls with rising Time to Live of messages, as the messages now stay in the buffer for a larger period of time eventually leading them