134 47 25MB
English Pages 908 [872] Year 2021
Advances in Intelligent Systems and Computing 1387
Ashish Khanna · Deepak Gupta · Siddhartha Bhattacharyya · Aboul Ella Hassanien · Sameer Anand · Ajay Jaiswal Editors
International Conference on Innovative Computing and Communications Proceedings of ICICC 2021, Volume 1
Advances in Intelligent Systems and Computing Volume 1387
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. Indexed by DBLP, INSPEC, WTI Frankfurt eG, zbMATH, Japanese Science and Technology Agency (JST). All books published in the series are submitted for consideration in Web of Science.
More information about this series at https://link.springer.com/bookseries/11156
Ashish Khanna · Deepak Gupta · Siddhartha Bhattacharyya · Aboul Ella Hassanien · Sameer Anand · Ajay Jaiswal Editors
International Conference on Innovative Computing and Communications Proceedings of ICICC 2021, Volume 1
Editors Ashish Khanna Maharaja Agrasen Institute of Technology, Rohini, Delhi, India Siddhartha Bhattacharyya Department of Computer Science and Engineering, CHRIST (Deemed to be University), Bangalore, Karnataka, India Sameer Anand Department of Computer Science, Shaheed Sukhdev College of Business Studies, Rohini, Delhi, India
Deepak Gupta Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, Rohini, Delhi, India Aboul Ella Hassanien Faculty of Computers and Information, Cairo University, Giza, Egypt Ajay Jaiswal Department of Computer Science, Shaheed Sukhdev College of Business Studies, Rohini, Delhi, India
ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-16-2593-0 ISBN 978-981-16-2594-7 (eBook) https://doi.org/10.1007/978-981-16-2594-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022, corrected publication 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Dr. Ashish Khanna would like to dedicate this book to his mentors Dr. A. K. Singh and Dr. Abhishek Swaroop for their constant encouragement and guidance and his family members including his mother, wife, and kids. He would also like to dedicate this work to his (Late) father Sh. R. C. Khanna with folded hands for his constant blessings. Dr. Deepak Gupta would like to dedicate this book to his father Sh. R. K. Gupta, his mother Smt. Geeta Gupta for their constant encouragement, his family members including his wife, brothers, sisters, kids, and to my students close to my heart. Prof. (Dr.) Siddhartha Bhattacharyya would like to dedicate this book to Late Kalipada Mukherjee and Late Kamol Prova Mukherjee. Prof. (Dr.) Aboul Ella Hassanien would like to dedicate this book to his wife Nazaha Hassan. Dr. Sameer Anand would like to dedicate this book to his Dada Prof. D. C. Choudhary, his beloved wife Shivanee, and his son Shashwat.
Dr. Ajay Jaiswal would like to dedicate this book to his father Late Prof. U. C. Jaiswal, his mother Brajesh Jaiswal, his beloved wife Anjali, his daughter Prachii, and his son Sakshaum.
ICICC-2021 Steering Committee Members
Patrons: Dr. Poonam Verma, Principal, SSCBS, University of Delhi Prof. Dr. Pradip Kumar Jain, Director, National Institute of Technology Patna, India
General Chairs: Prof. Dr. Siddhartha Bhattacharyya, Christ University, Bangalore Prof. Valentina Emilia Balas, Aurel Vlaicu University of Arad, Romania Dr. Prabhat Kumar, National Institute of Technology Patna, India
Honorary Chairs: Prof. Dr. Janusz Kacprzyk, FIEEE, Polish Academy of Sciences, Poland Prof. Dr. Vaclav Snasel, Rector, VSB-Technical University of Ostrava, Czech Republic
Conference Chairs: Prof. Dr. Aboul Ella Hassanien, Cairo University, Egypt Prof. Dr. Joel J. P. C. Rodrigues, National Institute of Telecommunications (Inatel), Brazil Prof. Dr. R. K. Agrawal, Jawaharlal Nehru University, Delhi
vii
viii
ICICC-2021 Steering Committee Members
Technical Program Chairs: Prof. Dr. Victor Hugo C. de Albuquerque, Universidade de Fortaleza, Brazil Prof. Dr. A. K. Singh, National Institute of Technology, Kurukshetra Prof. Dr. Anil K. Ahlawat, KIET Group of Institutes, Ghaziabad
Editorial Chairs: Prof. Dr. Abhishek Swaroop, Bhagwan Parshuram Institute of Technology, Delhi Dr. Arun Sharma, Indira Gandhi Delhi Technical University for Women, Delhi Prerna Sharma, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi
Conveners: Dr. Ajay Jaiswal, SSCBS, University of Delhi Dr. Sameer Anand, SSCBS, University of Delhi Dr. Ashish Khanna, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi Dr. Deepak Gupta, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi Dr. Gulshan Shrivastava, National Institute of Technology Patna, India
Publication Chairs: Prof. Dr. Neeraj Kumar, Thapar Institute of Engineering and Technology Dr. Hari Mohan Pandey, Edge Hill University, UK Dr. Sahil Garg, École de technologie supérieure, Université du Québec, Montreal, Canada Dr. Vicente García Díaz, University of Oviedo, Spain
Publicity Chairs: Dr. M. Tanveer, Indian Institute of Technology, Indore, India Dr. Jafar A. Alzubi, Al-Balqa Applied University, Salt—Jordan Dr. Hamid Reza Boveiri, Sama College, IAU, Shoushtar Branch, Shoushtar, Iran Prof. Med Salim Bouhlel, Sfax University, Tunisia
ICICC-2021 Steering Committee Members
Co-Convener: Mr. Moolchand Sharma, Maharaja Agrasen Institute of Technology, India
Organizing Chairs: Dr. Kumar Bijoy, SSCBS, University of Delhi Dr. Rishi Ranjan Sahay, SSCBS, University of Delhi Dr. Amrina Kausar, SSCBS, University of Delhi Dr. Abhishek Tandon, SSCBS, University of Delhi
Organizing Team: Dr. Gurjeet Kaur, SSCBS, University of Delhi Dr. Aditya Khamparia, Lovely Professional University, Punjab, India Dr. Abhimanyu Verma, SSCBS, University of Delhi Dr. Onkar Singh, SSCBS, University of Delhi Dr. Kalpna Sagar, KIET Group of Institutes, Ghaziabad Dr. Purnima Lala Mehta, Assistant Professor, IILM Dr. Suresh Chavhan, Vellore Institute of Technology, Vellore, India Dr. Mona Verma, SSCBS, University of Delhi
ix
Preface
We hereby are delighted to announce that Shaheed Sukhdev College of Business Studies, New Delhi, in association with the National Institute of Technology Patna and the University of Valladolid, Spain, has hosted the eagerly awaited and muchcoveted International Conference on Innovative Computing and Communication (ICICC-2021) in Hybrid Mode. The fourth version of the conference was able to attract a diverse range of engineering practitioners, academicians, scholars, and industry delegates, with the reception of abstracts including more than 3,600 authors from different parts of the world. The committee of professionals dedicated to the conference is striving to achieve a high-quality technical program with tracks on Innovative Computing, Innovative Communication Network and Security, and Internet of Things. All the tracks chosen in the conference are interrelated and are very famous among present-day research community. Therefore, a lot of research is happening in the above-mentioned tracks and their related sub-areas. As the name of the conference starts with the word “innovation,” it has targeted out of box ideas, methodologies, applications, expositions, surveys, and presentations helping to upgrade the current status of research. More than 900 full-length papers have been received, among which the contributions are focused on theoretical, computer simulation-based research, and laboratory-scale experiments. Among these manuscripts, 210 papers have been included in the Springer proceedings after a thorough two-stage review and editing process. All the manuscripts submitted to the ICICC-2021 were peer-reviewed by at least two independent reviewers, who were provided with a detailed review proforma. The comments from the reviewers were communicated to the authors, who incorporated the suggestions in their revised manuscripts. The recommendations from two reviewers were taken into consideration while selecting a manuscript for inclusion in the proceedings. The exhaustiveness of the review process is evident, given the large number of articles received addressing a wide range of research areas. The stringent review process ensured that each published manuscript met the rigorous academic and scientific standards. It is an exalting experience to finally see these elite contributions materialize into three book volumes as ICICC-2021 proceedings by Springer entitled “International Conference on Innovative Computing and Communications.” The articles are organized into three volumes in some broad categories covering subject xi
xii
Preface
matters on machine learning, data mining, big data, networks, soft computing, and cloud computing, although given the diverse areas of research reported might not have been always possible. ICICC-2021 invited seven keynote speakers, who are eminent researchers in the field of computer science and engineering, from different parts of the world. In addition to the plenary sessions on each day of the conference, ten concurrent technical sessions are held every day to assure the oral presentation of around 210 accepted papers. Keynote speakers and session chair(s) for each of the concurrent sessions have been leading researchers from the thematic area of the session. A technical exhibition is held during the 2 days of the conference, which has put on display the latest technologies, expositions, ideas, and presentations. The research part of the conference was organized in a total of 28 special sessions and 3 international workshops. These special sessions and international workshops provided the opportunity for researchers conducting research in specific areas to present their results in a more focused environment. An international conference of such magnitude and release of the ICICC-2021 proceedings by Springer has been the remarkable outcome of the untiring efforts of the entire organizing team. The success of an event undoubtedly involves the painstaking efforts of several contributors at different stages, dictated by their devotion and sincerity. Fortunately, since the beginning of its journey, ICICC-2021 has received support and contributions from every corner. We thank them all who have wished the best for ICICC-2021 and contributed by any means toward its success. The edited proceedings volumes by Springer would not have been possible without the perseverance of all the steering, advisory, and technical program committee members. All the contributing authors owe thanks to the organizers of ICICC-2021 for their interest and exceptional articles. We would also like to thank the authors of the papers for adhering to the time schedule and for incorporating the review comments. We wish to extend my heartfelt acknowledgment to the authors, peer-reviewers, committee members, and production staff whose diligent work put shape to the ICICC-2021 proceedings. We especially want to thank our dedicated team of peer-reviewers who volunteered for the arduous and tedious step of quality checking and critique on the submitted manuscripts. We wish to thank my faculty colleagues Mr. Moolchand Sharma and Ms. Prerna Sharma for extending their enormous assistance during the conference. The time spent by them and the midnight oil burnt is greatly appreciated, for which we will ever remain indebted. The management, faculties, administrative, and support staff of the college have always been extending their services whenever needed, for which we remain thankful to them. Lastly, we would like to thank Springer for accepting our proposal for publishing the ICICC-2021 conference proceedings. Help received from Mr. Aninda Bose, the acquisition senior editor, in the process has been very useful. Delhi, India Rohini, India
Ashish Khanna Deepak Gupta Organizers, ICICC-2021
Contents
Building Virtual High-Performance Computing Clusters with Docker: An Application Study at the University of Economics Ho Chi Minh City . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quoc Hung Nguyen, Thanh Le, Ha Quang Dinh Vo, and Viet Phuong Truong
1
Implementing Multilevel Graphical Password Authentication Scheme in Combination with One Time Password . . . . . . . . . . . . . . . . . . . . T. Srinivasa Ravi Kiran, A. Srisaila, and A. Lakshmanarao
11
State of Geographic Information Science (GIS), Spatial Analysis (SA) and Remote Sensing (RS) in India: A Machine Learning Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shruti Sachdeva and Bijendra Kumar Application of Noise Reduction Techniques to Improve Speaker Verification to Multi-Speaker Text-to-Speech Input . . . . . . . . . . . . . . . . . . . Md. Masudur Rahman, Sk. Arifuzzaman Pranto, Romana Rahman Ema, Farheen Anfal, and Tajul Islam Utilization of Machine Learning Algorithms for Thyroid Disease Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Shahajalal, Md. Masudur Rahman, Sk. Arifuzzaman Pranto, Romana Rahman Ema, Tajul Islam, and M. Raihan Detection of Hepatitis C Virus Progressed Patient’s Liver Condition Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ferdib-Al-Islam and Laboni Akter Energy Performance Prediction of Residential Buildings Using Nonlinear Machine Learning Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. Senthil Kumar, D. George Washington, A. K. Reshmy, and M. Noorunnisha
29
43
57
71
81
xiii
xiv
Contents
Cloud Image Prior: Single Image Cloud Removal . . . . . . . . . . . . . . . . . . . . Anirudh Maiya and S. S. Shylaja
95
Prioritizing Python Code Smells for Efficient Refactoring Using Multi-criteria Decision-Making Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Aakanshi Gupta, Deepanshu Sharma, and Kritika Phulli Forecasting Rate of Spread of Covid-19 Using Linear Regression and LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Ashwin Goyal, Kartik Puri, Rachna Jain, and Preeti Nagrath Employment of New Cryptography Algorithm by the Use of Spur Gear Dimensional Formula and NATO Phonetic Alphabet . . . . . . . . . . . . 135 Sukhwant Kumar, Sudipa Bhowmik, Priyanka Malakar, and Pushpita Sen Security Framework for Enhancing Security and Privacy in Healthcare Data Using Blockchain Technology . . . . . . . . . . . . . . . . . . . . . 143 A. Sivasangari, V. J. K. Kishor Sonti, S. Poonguzhali, D. Deepa, and T. Anandhi American Sign Language Identification Using Hand Trackpoint Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Yugam Bajaj and Puru Malhotra Brain Tumor Detection Using Deep Neural Network-Based Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Ambeshwar Kumar and R. Manikandan Detecting Diseases in Mango Leaves Using Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Rohan Sharma, Kartik Suvarna, Shreyas Sudarsan, and G. P. Revathi Recommending the Title of a Research Paper Based on Its Abstract Using Deep Learning-Based Text Summarization Approaches . . . . . . . . . 193 Sheetal Bhati, Shweta Taneja, and Pinaki Chakraborty An Empirical Analysis of Survival Predictors for Cancer Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Ishleen Kaur, M. N. Doja, and Tanvir Ahmad Epitope Prediction for Peptide Vaccine Against Chikungunya and Dengue Virus, Using Immunoinformatics Tools . . . . . . . . . . . . . . . . . . . 213 Krishna S. Gayatri, Geethu Gopinath, Bhawana Rathi, and Anupama Avasthi Airflow Control and Gas Leakage Detection System . . . . . . . . . . . . . . . . . . 239 J. S. Vimali, Bevish Jinila, S. Gowri, Sivasangari, Ajitha, and Jithina Jose
Contents
xv
Impact of Lightweight Machine Learning Models for Speech Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Swaraj Dhondge, Rashmi Shewale, Madhura Satao, and Jayashree Jagdale Impact of COVID-19 Pandemic on Mental Health Using Machine Learning and Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Rakshanda Naiem, Jasanpreet kaur, Shruti Mishra, and Ankur Saxena Identification of Student Group Activities in Educational Institute Using Cognitive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Ganeshayya Shidaganti, Ikesh Yadav, Himanshu Dagdi, Jagdish, and Aman A Machine Learning Model for Automated Classification of Sleep Stages using Polysomnography Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Santosh Kumar Satapathy, D. Loganathan, S. Sharathkumar, and Praveena Narayanan An Efficient Approach for Brain Tumor Detection Using Deep Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 R. V. Belfin, J. Anitha, Aishwarya Nainan, and Lycia Thomas Real-Time Detection of Student Engagement: Deep Learning-Based System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Zeyad A. T. Ahmed, Mukti E. Jadhav, Ali Mansour Al-madani, Mohammed Tawfik, Saleh Nagi Alsubari, and Ahmed Abdullah A. Shareef Bangla Handwritten Digit Recognition Based on Different Pixel Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 Forhad An Naim Information Extraction and Sentiment Analysis to Gain Insight into the COVID-19 Crisis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Sandhya Avasthi, Ritu Chauhan, and Debi Prasanna Acharjya Gender Recognition Using Deep Leering Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Belal Alsellami and Prapti D. Deshmukh Detection of COVID-19 Using EfficientNet-B3 CNN and Chest Computed Tomography Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Sahar Alquzi, Haikel Alhichri, and Yakoub Bazi Comparative Study on Identification and Classification of Plant Diseases with the Support of Transfer Learning . . . . . . . . . . . . . . . . . . . . . . 375 Aditi Singh and Harjeet Kaur
xvi
Contents
Cross Channel Scripting Attacks (XCS) in Web Applications . . . . . . . . . . 387 R. Shashidhara, V. Kantharaj, K. R. Bhavya, and S. C. Lingareddy Localization-Based Multi-Hop Routing in Wireless Body Area Networks for Health Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 Subba Reddy Chavva and Ravi Sankar Sangam Analyzing Security Testing Tools for Web Applications . . . . . . . . . . . . . . . . 411 Amel F. Aljebry, Yasmine M. Alqahtani, and Norrozila Sulaiman A Study on Password Security Awareness in Constructing Strong Passwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Norrozila Sulaiman Minimum Pearson Distance Detection for MIMO-OFDM Systems . . . . . 431 H. A. Anoop and Prerana G. Poddar Study on Emerging Machine Learning Trends on Nanoparticles—Nanoinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 B. Lavanya and G. Sasipriya A Review of the Oversampling Techniques in Class Imbalance Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Shweta Sharma, Anjana Gosain, and Shreya Jain Eye Blink-Based Liveness Detection Using Odd Kernel Matrix in Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 N. Nanthini, N. Puviarasan, and P. Aruna Predicting Student Potential Using Machine Learning Techniques . . . . . 485 Shashi Sharma, Soma Kumawat, and Kumkum Garg Routing Based on Spectrum Quality and Availability in Wireless Cognitive Radio Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Veeranna Gatate and Jayashree Agarkhed A Review on Scope of Distributed Cloud Environment in Healthcare Automation Security and Its Feasibility . . . . . . . . . . . . . . . . . 509 Mirza Moiz Baig and Shrikant V. Sonekar Smart Traffic Monitoring and Alert System Using VANET and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 Manik Taneja and Neeraj Garg Enhancement of Lifetime of Wireless Sensor Network Based on Energy-Efficient Circular LEACH Algorithm . . . . . . . . . . . . . . . . . . . . . 537 Jainendra Singh and Zaheeruddin Estimation and Correction of Multiple Skews Arabic Handwritten Document Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 M. Ravikumar and Omar Ali Boraik
Contents
xvii
Heart Disease Prediction Using Hybrid Classification Methods . . . . . . . . 565 Aniket Bharadwaj, Divakar Yadav, and Arun Kumar Yadav Job Recommendation System Using Content and Collaborative-Based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Rahul Pradhan, Jyoti Varshney, Kartik Goyal, and Latesh Kumari Recommendation System for Business Process Modelling in Educational Organizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 Anu Saini, Astha Jain, and J. L. Shreya RETRACTED CHAPTER: Using Bidirectional LSTMs with Attention for Categorization of Toxic Comments . . . . . . . . . . . . . . . . . 595 Zubin Tobias and Suneha Bose Detection of Rheumatoid Arthritis Using a Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 A. S. Mahesh Kumar, M. S. Mallikarjunaswamy, and S. Chandrashekara The Improved Method for Image Encryption Using Fresnel Transform, Singular Value Decomposition and QR Code . . . . . . . . . . . . . . 619 Anshula and Hukum Singh A Study on COVID-19 Impacts on Indian Students . . . . . . . . . . . . . . . . . . . 633 Arpita Telkar, Chahat Tandon, Pratiksha Bongale, R. R. Sanjana, Hemant Palivela, and C. R. Nirmala Improving Efficiency of Machine Learning Model for Bank Customer Data Using Genetic Algorithm Approach . . . . . . . . . . . . . . . . . . . 649 B. Ajay Ram, D. J. santosh Kumar, and A. Lakshmanarao Unsupervised Learning to Heterogeneous Cross Software Projects Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659 Rohit Vashisht and Syed Afzal Murtaza Rizvi PDF Text Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679 Rahul Pradhan, Kushagra Gangwar, and Ishika Dubey Performance Analysis of Digital Modulation Schemes Over Fading Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691 Kamakshi Rautela, Sandeep Kumar Sunori, Abhijit Singh Bhakuni, Narendra Bisht, Sudhanshu Maurya, Pradeep Kumar Juneja, and Richa Alagh Single Image Dehazing Using NN-Dehaze Filter . . . . . . . . . . . . . . . . . . . . . . 701 Ishank Agarwal Comparative Analysis for Sentiment in Tweets Using LSTM and RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713 Rahul Pradhan, Gauri Agarwal, and Deepti Singh
xviii
Contents
Solving Connect 4 Using Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . 727 Mayank Dabas, Nishthavan Dahiya, and Pratish Pushparaj A Pareto Dominance Approach to Multi-criteria Recommender System Using PSO Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737 Saima Aysha and Shrimali Tarun Twitter Sentiment Analysis Using K-means and Hierarchical Clustering on COVID Pandemic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757 Nainika Kaushik and Manjot Kaur Bhatia Improved ECC-Based Image Encryption with 3D Arnold Cat Map . . . . . 771 Priyansi Parida and Chittaranjan Pradhan Virtual Migration in Cloud Computing: A Survey . . . . . . . . . . . . . . . . . . . . 785 Tajinder Kaur and Anil Kumar Supervised Hybrid Particle Swarm Optimization with Entropy (PSO-ER) for Feature Selection in Health Care Domain . . . . . . . . . . . . . . . 797 J. A. Esther Rani, E. Kirubakaran, Sujitha Juliet, and B. Smitha Evelin Zoraida Contribution Title A Multimodal Biometrics Verification System with Wavelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807 Aderonke F. Thompson IoT-Based Voice-Controlled Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827 Anjali Singh, Shreya Srivastava, Kartik Kumar, Shahid Imran, Mandeep Kaur, Nitin Rakesh, Parma Nand, and Neha Tyagi Trusted Recommendation Model for Social Network of Things . . . . . . . . 839 Akash Sinha, Prabhat Kumar, and M. P. Singh Efficient Classification Techniques in Sentiment Analysis Using Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849 Leeja Mathew and V. R. Bindu Ultra-Wideband Scattered Microwave Signal for Classification and Detection of Breast Tumor Using Neural Network and Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863 Mazhar B. Tayel and Ahmed F. Kishk Retraction Note to: Using Bidirectional LSTMs with Attention for Categorization of Toxic Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zubin Tobias and Suneha Bose
C1
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
About the Editors
Dr. Ashish Khanna has 16 years of expertise in Teaching, Entrepreneurship, and Research & Development He received his Ph.D. degree from National Institute of Technology, Kurukshetra. He has completed his M. Tech. and B. Tech. GGSIPU, Delhi. He has completed his postdoc from Internet of Things Lab at Inatel, Brazil and University of Valladolid, Spain. He has published around 55 SCI indexed papers in IEEE Transaction, Springer, Elsevier, Wiley and many more reputed Journals with cumulative impact factor of above 100. He has around 120 research articles in top SCI/ Scopus journals, conferences and book chapters. He is co-author of around 30 edited and text books. His research interest includes Distributed Systems, MANET, FANET, VANET, IoT, Machine learning and many more. He is originator of Bhavya Publications and Universal Innovator Lab. Universal Innovator is actively involved in research, innovation, conferences, startup funding events and workshops. He has served the research field as a Keynote Speaker/ Faculty Resource Person/ Session Chair/ Reviewer/ TPC member/ post-doctorate supervision. He is convener and Organizer of ICICC conference series. He is currently working at the Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, under GGSIPU, Delhi, India. He is also serving as series editor in Elsevier and De Gruyter publishing houses. Dr. Deepak Gupta received a B.Tech. degree in 2006 from the Guru Gobind Singh Indraprastha University, India. He received M.E. degree in 2010 from Delhi Technological University, India and Ph.D. degree in 2017 from Dr. APJ Abdul Kalam Technical University, India. He has completed his Post-Doc from Inatel, Brazil. With 13 years of rich expertise in teaching and two years in the industry; he focuses on rational and practical learning. He has contributed massive literature in the fields of Intelligent Data Analysis, BioMedical Engineering, Artificial Intelligence, and Soft Computing. He has served as Editor-in-Chief, Guest Editor, Associate Editor in SCI and various other reputed journals (IEEE, Elsevier, Springer, & Wiley). He has actively been an organizing end of various reputed International conferences. He has authored/edited 50 books with National/International level publishers (IEEE,
xix
xx
About the Editors
Elsevier, Springer, Wiley, Katson). He has published 180 scientific research publications in reputed International Journals and Conferences including 94 SCI Indexed Journals of IEEE, Elsevier, Springer, Wiley and many more. Prof. Siddhartha Bhattacharyya, FRSA FIET (UK) is currently the Principal of Rajnagar Mahavidyalaya, Birbhum, India. Prior to this, he was a Professor in Christ University, Bangalore, India. He served as Senior Research Scientist at the Faculty of Electrical Engineering and Computer Science of VSB Technical University of Ostrava, Czech Republic, from October 2018 to April 2019. He also served as the Principal of RCC Institute of Information Technology, Kolkata, India. He is a coauthor of 6 books and a co-editor of 75 books and has more than 300 research publications in international journals and conference proceedings to his credit. His research interests include soft computing, pattern recognition, multimedia data processing, hybrid intelligence and quantum computing. Prof. Aboul Ella Hassanien is the Founder and Head of the Egyptian Scientific Research Group (SRGE) and a Professor of Information Technology at the Faculty of Computer and Artificial Intelligence, Cairo University. Professor Hassanien is an ex-dean of the faculty of computers and information, Beni Suef University. Professor Hassanien has more than 800 scientific research papers published in prestigious international journals and over 40 books covering such diverse topics as data mining, medical images, intelligent systems, social networks, and smart environment. Prof. Hassanien won several awards, including the Best Researcher of the Youth Award of Astronomy and Geophysics of the National Research Institute, Academy of Scientific Research (Egypt, 1990). He was also granted a scientific excellence award in humanities from the University of Kuwait for the 2004 Award and received the scientific - University Award (Cairo University, 2013). Also, He was honored in Egypt as the best researcher at Cairo University in 2013. He was also received the Islamic Educational, Scientific and Cultural Organization (ISESCO) prize on Technology (2014) and received the State Award for excellence in engineering sciences 2015. He was awarded the medal of Sciences and Arts of the first class by the President of the Arab Republic of Egypt, 2017. Dr. Sameer Anand is currently working as an Assistant professor in the Department of Computer science at Shaheed Sukhdev College of Business Studies, University of Delhi, Delhi. He has received his M.Sc., M.Phil, and Ph.D. (Software Reliability) from Department of Operational Research, University of Delhi. He is a recipient of ‘Best Teacher Award’ (2012) instituted by Directorate of Higher Education, Govt. of NCT, Delhi. The research interest of Dr. Anand includes Operational Research, Software Reliability and Machine Learning. He has completed an Innovation project from the University of Delhi. He has worked in different capacities in International Conferences. Dr. Anand has published several papers in the reputed journals like IEEE Transactions on Reliability, International journal of production research (Taylor & Francis), International Journal of Performability Engineering
About the Editors
xxi
etc. He is a member of Society for Reliability Engineering, Quality and Operations Management. Dr. Sameer Anand has more than 16 years of teaching experience. Dr. Ajay Jaiswal is currently serving as an Assistant Professor in the Department of Computer Science of Shaheed Sukhdev College of Business Studies, University of Delhi, Delhi. He is co-editor of two books/Journals and co-author of dozens of research publications in International Journals and conference proceedings. His research interest includes pattern recognition, image processing, and machine learning. He has completed an interdisciplinary project titled “Financial InclusionIssues and Challenges: An Empirical Study” as Co-PI. This project was awarded by the University of Delhi. He obtained his masters from the University of Roorkee (now IIT Roorkee) and Ph.D. from Jawaharlal Nehru University, Delhi. He is a recipient of the best teacher award from the Government of NCT of Delhi. He has more than nineteen years of teaching experience.
Building Virtual High-Performance Computing Clusters with Docker: An Application Study at the University of Economics Ho Chi Minh City Quoc Hung Nguyen, Thanh Le, Ha Quang Dinh Vo, and Viet Phuong Truong
Abstract The need for high-performance computing in science and technology is becoming a challenging issue in recent years. Building a high-performance computing system by utilizing the existing hardware and software resources is a low-cost solution. Virtualization technology is proposed to solve this problem. It has brought convenience and efficiency as it can run on various operating systems. It can be used for implementing many computational algorithms simultaneously on the same hardware system including parallel processing and/or cluster processing systems. It can be expanded for computation and storage if the resources are still available. Virtualization also can combine the existing hardware and software resources to solve the problem of mobilizing multiple resources. Docker virtualization technology is considered a powerful virtualization technology, offering a new virtualization solution, instead of creating independent virtual machines with different virtual hardware and operating systems. Because this technology allows applications can be repackaged into individual data units and run together on the operating system kernel, sharing the resources of the mobilizing hardware platforms is the strength of Docker. The paper will focus on analyzing the superiority of using hardware virtualization technology, thereby proposing to build a high-performance virtualization system by using Docker technology with utilizing the available hardware platform at the University of Economics Ho Chi Minh City (UEH). Keywords Virtual high-performance computing · Docker · Cluster computing Q. H. Nguyen (B) · T. Le · H. Q. D. Vo · V. P. Truong School of Business Information Technology, University of Economics Ho Chi Minh City, Ho Chi Minh City 700000, Vietnam e-mail: [email protected] T. Le e-mail: [email protected] H. Q. D. Vo e-mail: [email protected] V. P. Truong e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_1
1
2
Q. H. Nguyen et al.
1 Introduction Today, the use of high-performance computers in the fields of sciences and engineering has changed scientific research activities. Many scientific disciplines need powerful computational abilities such as computational biology, computational chemistry, computational physics, computational materials, computational mechanics, computational geophysics, computational statistics, and banking and financial technology. A common feature is information processing and analysis, and predicting results based on simulating and assessing mathematical models with the help of a high-performance computing system. The system is considered to be a critical solution to meet the increasing demand for computing. Not only scientific research activities but also practical activities, such as socioeconomics, also requires high computing performance. They have been developing on a large scale and require more complex technology. Therefore, the implementation of the related models on conventional computers is impossible due to the large computational volume and time constraints. High-Performance Computing (HPC) system was born to meet the above situation, and it allows the implementation of big data solutions based on a system integrating with many page calculation nodes. Central Processing Units (CPU) will be combined with a Graphical Processing Unit (GPU). On another aspect, maximizing resource utilization on HPC systems remains problematic in recent years. Finding the optimal solution to this problem to maximize the computing power of the HPC system is attracting the attention of scientists from research institutes and universities. However, to equip an HPC system for a university or a public research unit requires a large investment of money to purchase and maintain the system and human resources. Technicians need to have operation and management qualifications when put into use. On the other hand, on the system, there are still exist separate servers with independent applications. Therefore, to build a high-performance computing system with a virtualization technology based on the availability of distributed hardware resources, which we call vHPC (virtual HPC), is the motivation of the study. In this paper, we develop the above system based on Docker technology. Our contributions in this paper focus on the following: (1) proposing an efficient virtual HPC architecture to utilizing the existing hardware and software resource of UEH; (2) combining the existing physical servers to build virtual servers that suit the usage of Docker technology; (3) configuring the system to optimize resource allocation and balancing the performance with other tasks that are working on the existing physical servers. The system uses many advanced virtualization technologies such as Hyper-V, VMware ESX Server, and Docker Cluster to exploit available physical resources to create and operate multiple machines [1]. Virtual machines share resources, or they can allow to consolidate and run multiple workloads as virtual machines on a single server. This technology allows a virtual machine can be considered to be a physical computer and can run an operating system and applications fully. It enables the virtualization of hardware from the server to servers, virtualization of storage on the
Building Virtual High-Performance Computing Clusters with Docker: …
3
network (SAN—Storage Area Network) [2], or virtualization of shared applications (Applications).
2 Literature Review Some computing system virtualization technologies have been applied in many fields of life. Gautam Bhanage [3] studied the experimental evaluation based on OpenVZ virtualization technology (Open Virtuozzo) from the perspective of receiving. Considering that OpenVZ [4] is a virtualization technology system based on the Linux kernel. OpenVZ allows a single physical server to run multiple separate operating system instances, known as containers, Virtual Private Servers (VPSS), or Virtual Environments (VES). OpenVZ is not completely virtualization, it only shares a modified Linux kernel and can, therefore, only run Linux operating systems. Thus, all VPS virtual servers can only run Linux with the same technology and Kernel version. Microsoft Hyper-V virtualization technology [5] consists of three main components: hypervisor, virtualization compartment, and new virtualization I/O (I/O) model. The hypervisor is a very small layer of software presented on a processor under Intel-V or AMD-V technology. It is responsible for creating the partitions, in that the virtual entity will run. A partition is a logically isolated unit and can contain an operating system on which it is intended. Always have at least 1 root partition containing Windows Server 2008 and virtualization compartment, with direct access to hardware devices. The next root partition can generate child partitions, usually called virtual machines, to run the guest operating systems. A partition can also generate its own child partitions. Virtual machines do not have access to the physical processor, but only “see” the processor provided by the hypervisor. The virtual machine can only use the virtual device, all requests to the virtual device will be passed through the VMBus to the device in the parent partition. Feedback is also transmitted via VMBus. If the device on the parent partition is also the virtual device, it is forwarded until it meets the real device on the root partition [6]. In the work of Irfan Habib, the author presented an overview of virtualization technology KVM (Kernel Virtualization Machine) [7] this is a new virtualization technology that allows real virtualization on the hardware platform. That means the OS itself emulates the hardware for other OSs to run on. It works similarly to a transcendental manager for a fair share of resources such as disk, network IO, and CPU. The original server has Linux installed, but KVM supports creating virtual servers that can run both Linux and Windows. It also supports both × 86 and × 86–64 systems. In the study of David Chisnall, the author presented a system with XEN virtualization technology [8] that allows running multiple VPS virtual servers on a physical server at the same time. XEN virtualization technology allows each virtual server to run its own kernel, so VPS can install both Linux or Windows Operating systems. Each VPS has its own File System and acts as an independent physical server.
4
Q. H. Nguyen et al.
Next is the optimized and most used virtualization technology in recent times, VMWare virtualization technology [9] developed by the company VMWare, which supports virtualization from the hardware level. User-friendly interface, simple to install and use, many advanced features, support for multiple operating systems, diverse versions. This technology is often applied to large units like banks and is rarely used for commercial VPS sold in the market. Its structure is a virtualization application program, running on Linux or Windows operating systems. In the study of HE Yu et al. [10], the authors proposed a solution to automatically build bridges connecting computing nodes into clusters for parallel processing (MPI– –Message Passing Interface) using Docker Cluster [11]. This work has also been carried out by De Bayser et al. [12] to integrate computation in parallel MPI with HPC systems using Docker. It is important to evaluate efficiency when it comes to virtualizing the computational system as a basis for building HPC. Muhammad Abdullah et al. at Punjab University, Pakistan conducted a study [13] to evaluate the effectiveness when using VMs (Virtual Machines) and Docker on the OpenNebula cloud service platform. The results showed that when deploying virtualization with Docker, the performance reached 70.23% compared to 46.48% of VMs, thereby proving effective when virtualizing using Docker. In Vietnam, several studies in the field of HPC have been carried out [14]. Researchers of Center for Computing Engineering, University of Technology— Vietnam National University Ho Chi Minh City, aims at building solutions and techniques used in the field of high performance and looking to build a tool that can evaluate the performance of a powerful computer system, and achieved performance in the range of 30–90%. The results were published in “500 of the world’s most powerful computers” [15]. In project [16], Nguyen Thanh Thuy et al. at the HighPerformance Computing Center, Hanoi University of Technology, by the protocol of scientific and technological cooperation with the Indian government in the period of 2004–2005, he has proposed to build an HPC system, namely BKluster. BKluster is a system of parallel clustered computing based on Beowulf architecture and message communication programming model together with the BKlusware software suite, a set of software that supports up to many users at many different levels.
3 Virtual High-Performance Computing Based on Virtual Clusters and Docker In this paper, we develop a vHPC system by combining Cluster Computing using available computing resources of the existing infrastructure at UEH (Table 1) with Docker technology (Fig. 1). In Fig. 1, the vHPC architecture included three processing layers: • The Physical layers: They consist of 8 physical servers as shown in Table 1. • The Virtualization layers: They refer to Taknet’s Analytical Evaluation Report [17] on building an HPC system, the minimum configuration consists of 1 control
Building Virtual High-Performance Computing Clusters with Docker: …
5
Table 1 List of the existing servers of UEH used for virtualization No
Example
Font size and style
1
IBM X3560 M4, 12 Core 2.0 GHz, E5 2620, 24G Ram
270 GB/272 GB
2
HP Proliant DL380 Gen9, 32 Core 2.10 GHz, E5 2620, 64G Ram
600 GB/830 GB
3
HP Proliant DL380 Gen9, 24 Core 2.10 GHz, E5 2620, 128G Ram
100 GB/830 GB
4
IBM X3560 M3, 16 Core 2.4 GHz, E5620, 36G Ram
1.62 TB/1.62 TB
5
DELL PowerEdge R740, 24 Core 2.59 GHz, 64G Ram
1.30 TB/1.90 TB
6
DELL PowerEdge R720, 24 Core 2.10 GHz, 64G Ram
2 TB/3.26 TB
7
DELL PowerEdge R720, 24 Core 2.10 GHz, 64G Ram
1 TB/3.26 TB
8
SAN (Storage Area Network)
7 TB
Fig. 1 The proposed vHPC architecture with the Docker
computer (defined as Head Node) and at least 3 to 4 handling computers (defined as Node). These nodes can be completely supplemented if necessary to improve the overall computing performance of the HPC system. Therefore, we virtualize 5 virtual servers as in Fig. 1b, in which 1 machine plays the role of the manager (Head Node) and the other 04 machines—the processing cluster computer (Node). All computers are installed with the CenOS operating system version 7.8 as in Fig. 2. For the hardware configuration of cluster computers, we set up 05 computers with the same configuration as in Fig. 3. • The vHPC layers: The system includes two components: (i) the CenOS operating system v7.8 runs in the background; (ii) A human-interactive interface with Docker to compute resources (operation image and folder with data containers) as in Fig. 4.
6
Q. H. Nguyen et al.
Fig. 2 Virtualized Servers with IP Addresses: 172.16.248.170, 172.16.248.171, 172.16.248.172, 172.16.248.173, 172.16.248.174
Fig. 3 The hardware configuration for the virtual server
For the vHPC performance calculation system [18], Docker images with Storage Containers for each specific problem will be built, and the system provides information about each Docker images. The part that interacted with the outside through the IP address is authenticated through the HTTPS protocol if the system runs internally. When doing the work outside, tasks will be connected to the system via VPN. All the procedures are illustrated in Fig. 5.
Building Virtual High-Performance Computing Clusters with Docker: …
7
Fig. 4 The hardware configuration for the virtual server
Fig. 5 Computational performance of the vHPC system
Providing accounts to members is performed by the administrator on the same Docker mining interface that corresponds to the computing hardware requirement defined by the authority in Docker images.
8
Q. H. Nguyen et al.
4 Conclusions Virtualization technology has been developed a long time ago, achieved many achievements, and commonly used for server systems. In addition to flexibility in deployment, virtual machines are also known for the characteristics such as ease of management, high security, and efficiency in isolation between the scope of use and control. Currently, in the field of high-performance computing, virtual machine deployment infrastructure plays an important role in services such as cloud computing, and cloud storage. In this research, we have discussed and tested several virtualization technologies for servers that are integrated with a large computational memory such as CPU and GPU toward an HPC system with superior features, scalability, and higher reliability. On the other hand, it helps researchers to have the tools to implement algorithms to solve hard problems such as computer vision [19, 20], machine learning [21], and big data processing [22] in all different environments to meet technical factors such as high availability, superior fault tolerance feature, and especially, utilizing the existing resources. Acknowledgements This work was supported by the University of Economics Ho Chi Minh City under project CS-2020-14.
References 1. Lee, H. (2014). Virtualization basics: Understanding techniques and fundamentals. In: School of Informatics and Computing Indiana University 815 E 10th St. Bloomington IN 47408. 2. Khattar, R. K., Murphy, M. S., Tarella, G. J., & Nystrom, K. E. (1999). Introduction to Storage Area Network. SAN: IBM Corporation, International Technical Support Organization. 3. Bhanage, G., Seskar, I., Zhang, Y., Raychaudhuri, D., & Jain, S. (2011). Experimental evaluation of openvz from a testbed deployment perspective. Development of Networks and Communities. In volume 46 of Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering (pp. 103–112). Berlin: Springer. 4. Jin, Y., Wen, Y., Chen, Q. (2012). Energy efficiency and server virtualization in data centers: An empirical investigation. In 2012 Proceedings IEEE INFOCOM Workshops (pp. 133–138). 5. Performance Report Hyper-V (2010). White Paper: https://sp.ts.fujitsu.com/. 6. Kusnetzky, D. (2011). Virtualization: A Manager’s Guide. O’Reilly Media, Inc. 7. I. Habib. (2008, February). Virtualization with KVM. Linux Journal, 2008(166), 8. Article No.: 8. 8. Chisnall, D. (2013). The Definitive Guide to the Xen Hypervisor (1st ed.). USA: Prentice Hall Press. 9. Technical Papers. VMware Infrastructure Architecture Overview. White Paper. https://www. vmware.com/pdf/vi_architecture_wp.pdf. 10. Yu, H. E., & Huang, W. (2015). Building a virtual hpc cluster with auto scaling by the docker. arXiv:1509.08231. 11. Rad, B. B., Bhatti, H. J., & Ahmadi, M. (2017). An introduction to docker and analysis of its performance. International Journal of Computer Science and Network Security (IJCSNS), 17(3), 228.
Building Virtual High-Performance Computing Clusters with Docker: …
9
12. de Bayser, M., & Cerqueira, R. (2017). Integrating MPI with docker for HPC. In 2017 IEEE International Conference on Cloud Engineering (IC2E), Vancouver, BC, 2017 (pp. 259–265). https://doi.org/10.1109/IC2E.2017.40. 13. Abdullah, M., Iqbal, W., & Bukhari, F. (2019). Containers vs virtual machines for auto-scaling multi-tier applications under dynamically increasing workloads (pp. 153–167). https://doi.org/ 10.1007/978-981-13-6052-7_14. 14. Hung, N. Q., Phung, T. K., Hien, P., & Thanh, D. N. H. (2021) AI and blockchain: Potential and challenge for building a smart E-learning system in vietnam. In IOP Conference Series: Materials Science and Engineering (In press). 15. Tran Thoai N., et al (2016). Research and design a 50–100 TFlops high performance computing system / University of Technology - Viet Nam National University HCMC, Project in HCM City. 16. Thuy, N. T., et al. (2006). Research high-performance computational systems and apply micromaterial simulation. Project in Ministry of Science and Technology (MOST), 2004–2005. 17. Tan, G., Yeo, G. K., Turner, S. J., & Teo, Y. M. (Eds.). (2013). AsiaSim 2013: 13th International Conference on Systems Simulation, Singapore, November 6–8, 2013. Proceedings (Vol. 402). 18. Petitet, R. C. W. A., Dongarra, J., Cleary, A. HPL - A portable implementation of the highperformance linpack benchmark for distributed-memory computers. http://www.netlib.org/ben chmark/hpl 19. Thanh, D. N. H. & Dvoenko, S. D. (2019). A denoising of biomedical images. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, XL-5/W6, 73–78. 20. Kumar, V., Mishra, B. K., Mazzara, M., Thanh, D. N. H., & Verma, A. (2020). Prediction of Malignant and benign breast cancer: A data mining approach in healthcare applications. In: Borah, S., Emilia Balas, V., & Polkowski, Z. (Eds.), Advances in Data Science and Management. Lecture Notes on Data Engineering and Communications Technologies (Vol. 37). Singapore: Springer. 21. Erkan, U. (2020). A precise and stable machine learning algorithm: eigenvalue classification (EigenClass). Neural Computing and Applications. https://doi.org/10.1007/s00521-02005343-2(Inpress) 22. Fowdur, T. P., Beeharry, Y., Hurbungs, V., Bassoo, V., & Ramnarain-Seetohul, V. (2018). Big data analytics with machine learning tools. In: Dey, N., Hassanien, A., Bhatt, C., Ashour, A., & Satapathy, S. (Eds.), Internet of Things and Big Data Analytics Toward Next-Generation Intelligence. Studies in Big Data (Vol. 30). Cham: Springer.
Implementing Multilevel Graphical Password Authentication Scheme in Combination with One Time Password T. Srinivasa Ravi Kiran, A. Srisaila, and A. Lakshmanarao
Abstract At present, everyone makes instantaneous digital transaction by using applications like PhonePe, Google pay etc. Password is necessary to prove the user authentication. At the same time, user authentication is also verified by validating one time password. A shoulder surfer may have the possibility to crack the password during entry time. In this paper, we present a novel, clear, recall-based graphical password scheme where the user is required to figure out the triangle for some exact permutations of the password on the active display. The token holder is likely to pick the correct password permutations in the expected order cyclically per every login endeavor. For example, the user must select the first permutation of the password and that permutation must form a triangle on the visual display. After the correct triangle is identified in the visual display, a one time password is sent to the user. The user is required to enter the one time password in the text box given in the visual display. If the one time password is correct then the login attempt is successful; otherwise, the login attempt is failed. Likewise, the user is instructed to identify triangles with other permutations of the password for the remaining login attempts. Keywords Password · Authentication · Security · Shoulder surfing · Visual display · One time password · Triangle
T. Srinivasa Ravi Kiran (B) Department of Computer Science, P.B.Siddhartha College of Arts & Science, 520010 Vijayawada, Andhra Pradesh, India e-mail: [email protected] A. Srisaila Department of Information Technology, V.R Siddhartha Engineering College, 520007 Vijayawada, Andhra Pradesh, India A. Lakshmanarao Department of Information Technology, Aditya Engineering College, Surampalem, Kakinada, Andhra Pradesh, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_2
11
12
T. Srinivasa Ravi Kiran et al.
1 Introduction Even though the technique of using textual passwords is insecure, the majority of the digital systems are still using textual passwords to implement their security. To overcome the vulnerabilities of using textual-based passwords, graphical passwords provide a promising solution to implement better security [1]. A graphical password is an authentication method that facilitates the user to select from images or symbols in a precise order on the existing graphical user interface (GUI). People are using passwords commonly, many times to facilitate online banking transactions, for the access of social networks and to verify their emails [2]. The most popular way of identifying user authenticity is the use of one time password [3]. One time passwords are intended to be a top-secret between two parties and can be used once within the limited time lapse. This prevents someone from looking over your shoulder and trying to use it later for network sniffing and from spyware attacks [4]. The password strength of a graphical password scheme is more than the text-based schemes, and as a result, it provides a superior level of security. Owing to this benefit, there is an increasing interest in graphical password. In graphical password schemes, images or symbols are used as an alternative to alphanumerical passwords [5]. The graphical password schemes that are offered protect the user and/or application privileges from the attempts of hacking [6]. According to Wiedenbeck et al. [7] unforgettable places depend on the type of image and the defined sequence of click locations. Chopra et al. [8] present potential outcomes in terms of security, ease of use and memorability.
2 Related Works To authenticate the user credentials, reference [9] proposed a click-based graphical password system. The user is instructed to click a well-ordered sequence of five pass points on the images of the interface presented. The authentication of the user fails if he could not be able to choose well-ordered sequence of pass points. As per the instructions of Jermyn et al. [1], the client is required to “Draw A Secret” (DAS) on the grid of the accessible interface. The password is the clear-cut pattern presented in a grid. This scheme is keystroke-independent and allows the user to draw the pattern without difficulty. Sobrado and Birget [10] talk about “Triangle Scheme” where a number of pass-objects are displayed on the interface along with the “decoy” objects. The client chooses the pass objects during the initial sign-up stage. At the authentication phase, the client must find out the pass objects and it is mandatory to click inside the convex hull framed by all the pass objects. Since the password space of the hull is vast, the chance of compromising the password is very low. The S3PAS scheme was proposed by Zhao et al. [5]. A 10 × 10 fixed grid containing 94 printable characters consisting of alphabets, digits and special symbols is shown to the user. The password length is a predetermined string of four characters such that any three-character combination can form a triangle on the accessible
Implementing Multilevel Graphical Password Authentication Scheme …
13
grid. For example, if the password selected is “6Tg:” then choose the combinations, viz., “6Tg”, “Tg6”, “g:6” and “:6T”. The combinations chosen successfully form a triangular pattern on the existing interface independently. The user is required to identify the triangle with the chosen password combinations in the specified order for authentication. In line with Kiran et al. [11] the users are instructed to recognize quadruplets formed from the password blends on the existing interface from the password chosen at the time of registration. Kiran et al. [11] presented a graphical password authentication scheme resistant to peeping attack which begins with recognizing quadruplets formed from the password combination chosen from the client, starting with the first character and rotating one character toward the right in a way that the last character in the password combination comes into view as the first character of the password combination. For instance, if the password chosen at the time of registration is “Sa3T:” then the quadruplets formed are “Sa3TS”, “a3T:a”, “3T:S3”, “T:SaT” and “:Sa3:”. It is compulsory for the user to select the arrangement of the password blends in the expected fashion rotated for every login attempt. Ravi Kiran et al. [12] presented a new, interactive, recall-based scheme where the user starts with identifying the required transformation applied to every individual character of password amalgamation. Prasad et al. [3] illustrate that an authentication schema resistant from peeping attack starts with identifying triangle formed by clicking on the buttons of the interface have colors red, green, blue and red of the grid, respectively. Rao et al. [13] proposed the PPC schema. In this, the user is instructed to symbolize a rectangle on the accessible interface. Any character existing on the edge of the rectangle may be the pass character. Jusoh et al. [14] exemplify a relative study of recognition-based authentication algorithms. The comparisons are based on very important characteristics like usability, weaknesses and security attacks. For better usability, Stobert et al. [15] suggest that many click-points could be used on smartphone displays since it is very difficult to display larger image sizes. Hemavathy et al. [16] proposed an innovative graphical password authentication system to resist shoulder surfing. As a part of authentication, the users are instructed to choose the appropriate horizontal and vertical lines on the pass matrix to find out the pass objects. Khadke et al. [17] conveyed that the security of the graphical password can be imposed at multiple levels. Katsini et al. [18] discussed implications to improve recognition-based graphical passwords by adopting personalization techniques derived from individual cognitive characteristics.
3 Projected Scheme In the existing interface, we make use of a 10 × 10 table formed with 94 printable character set with spaces as shown in Fig. 1. Password amalgamations are verified with mouse taps on individual cells of the grid. The precise plan begins with identifying the triangle shape by tapping on the cells on display. The recognition of the required triangle for that login attempt generates one time password to the hand-held device. If the user is able to verify the one
14
T. Srinivasa Ravi Kiran et al.
Fig. 1 Projected display
time password, then the login attempt is successful. Suppose a triangle does not form, then that blend can be overlooked. In previous studies, verification is at a single level only, that is, the user is expected to identify the required triangle with password combinations. However, in the proposed work, novelty is achieved in a manner that the user authentication is carried out at multiple levels in such a way that the user is expected to validate one time password after the required triangle is identified for better security. Permutations of secrete phrase are input by four taps on the displayed screen. For example, if the secret phrase selected at the sign-up time is “g5:G”, at that point the plausible triangle shaped by tapping on the cells are “g5:g”, “5:G5”, “:Gg:” and “Gg5G”, that is, pivoting unique secret key character blend one position from left to right every time and the first character must be the same as the last character. The users are instructed to enter the combination of password blend cyclically in the expected order for every login stab. Algorithm 1. 2. 3. 4.
Start. Register a password of length four. Select five distinct four-character amalgamations of registered passwords, pivoting from left to right. The user makes an entry attempt by selecting the password amalgamation in the predicted way on the accessible display.
Implementing Multilevel Graphical Password Authentication Scheme …
5.
6.
7.
8.
15
If the client opts for correct password amalgamation in an expected manner and it forms a triangle, then a one time password is sent to the hand-held device and that one time password is verified on the interface. If the one time password entered on the interface is correct, the login attempt is successful; otherwise, the login attempt is failed, so evaluate the next amalgamation. In case if the client does not choose a secret token combination in the expected style or does not form a triangle shape, the entry process is obstructed, so disregard that amalgamation of secret tokens and opt next amalgamation of secret tokens in such a way that all the amalgamations of secret word tokens must form a triangle. Stop. Flowchart: Star Register a password. Select four distinct password blends from the password registered. n:=1
No
Is n>4
User makes nth login n:=n+1 Is the nth password blend forms triangle.
No
Yes Send one time password (OTP) for verification.
Is OTP is verified? Yes Login attempt is successful. Stop
No
Login attempt failed.
Yes
16
T. Srinivasa Ravi Kiran et al.
Fig. 2 First phase of login is verified by clicking on the cells “g5:g” and also by verifying one time password “0783”
Step 1: For the first instance, the login entry is valid for the amalgamation “g5:g” and for one time password “0783” (Fig. 2). Step 2: One time password is not generated and the entry is foiled for sequence “g5:g” at second entry as the client chooses wrong attempt (Fig. 3). Step 3: For the second instance, the login entry is valid for the password blend “5:G5” and for one time password “1508” (Fig. 4). Step 4: For the third instance, the login entry is valid for the password blend “:Gg:” and for one time password “4545” (Fig. 5). Step 5: For the fourth instance, the login entry is valid for the password blend “Gg5G” and for one time password “9412” (Fig. 6). Step 6: If the first character is different from the terminal character, in such a case the specific password amalgamation can be ignored (Fig. 7). For example, if the user clicks on the buttons containing the letters “g5:G”, respectively, the initial token “g” is not the same as the terminal token “G” and the pattern triangle cannot be formed. In such a case the amalgamation is disregarded. Select correct amalgamation of password blend in such a way that the triangle is formed.
Implementing Multilevel Graphical Password Authentication Scheme … Fig. 3 Login is denied since the user selects the wrong password blend for login entry
Fig. 4 During the second phase, the entry is verified by clicking on the cells “5:G5” and also by verifying one time password “1508”
17
18 Fig. 5 At third login instance, the login entry is passed for password combination “:Gg:” and for one time password 4545
Fig. 6 At the fourth stage, login is successful by clicking on the cells “Gg5G” and by verifying one time
T. Srinivasa Ravi Kiran et al.
Implementing Multilevel Graphical Password Authentication Scheme …
19
Fig. 7 Triangle not formed by selecting buttons “g5:G”. Amalgamation is ignored
4 Results and Usability Study The results were confident and the client’s predictable triangles were formed by clicking on the cells in an exact manner. It takes 37 ms to identify the required triangle and to verify one time password as shown in the following four tables. Peeping attacks were not possible with the proposed scheme since the client taps on visible tokens are in random order (Tables 1, 2, 3, 4, 5, 6, 7 and 8). Table 1 Login table for first pass S. no.
Password
Pass 1
Time to verify triangle for pass 1 in milliseconds
One time password
Time to enter one time password for pass 1 in milliseconds
Login time for pass 1 in milliseconds
1
g5:G
5:G5
23
1508
11
34
2
3U75
U75U
26
1134
13
35
3
tg + 8
G+8g
24
6534
12
34
4
zQ2c
Q2cQ
25
8620
12
37
5
8I_i
I_iI
23
0634
14
38
Login time in milliseconds for first pass using i5 processor
36
20
T. Srinivasa Ravi Kiran et al.
Table 2 Login table for second pass S. no.
Password
Pass 2
Time to verify triangle for pass 2 in milliseconds
One time password
Time to enter one time password for pass 2 in milliseconds
Login time for pass 2 in milliseconds
1
g5:G
:Gg:
24
4545
12
36
2
3U75
7537
25
0536
12
37
3
tg + 8
+8t+
25
9735
13
38
4
zQ2c
2cz2
26
9064
14
40
5
8I_i
_i8_
24
7896
13
37
Login time in milliseconds for second pass using i5 processor
38
Table 3 Login table for third pass S. no.
Password
Pass 3
Time to verify triangle for pass 3 in milliseconds
One time password
Time to enter one time password for pass 3 in milliseconds
Login time for pass 3 in milliseconds
1
g5:G
Gg5G
2
3U75
53U5
23
3456
11
34
25
1813
13
3
tg + 8
38
8tg8
25
2845
12
4
37
zQ2c
czQc
24
1745
12
5
36
8I_i
I8Ii
23
0073
12
35
Login time in milliseconds for third pass using i5 processor
36
Table 4 Login table for fourth pass S. no.
Password
Pass 4
Time to verify triangle for pass 4 in milliseconds
One time password
Time to enter one time password for pass 4 in milliseconds
Login time for pass 4 in milliseconds
1
g5:G
5:G5
24
9412
12
36
2
3U75
53U5
23
1234
11
34
3
tg + 8
8tg8
23
9667
13
36
4
zQ2c
cZQc
25
7845
12
37
5
8I_i
I8Ii
24
0556
13
37
Login time in milliseconds for fourth pass using i5 processor
36
Table 5 Average login time for four passes S. no.
Pass
Average login time for all four passes
1
Pass 1
36
2
Pass 2
36
3
Pass 3
38
4
Pass 4
Login time in milliseconds for all four passes using i5 processor
36 37
Blonder
Jermyn
Stallings
Ziran Zheng
S3PAS
A Novel GP Scheme
A Robust
Multilevel
1
2
3
4
5
6
7
8
Y : Yes
Reorganization based schema
S. no.
Y
Y
Y
Y
Y
Y
Y
Y
Mouse usage
Y
Y
Y
Y
Y
Y
N
Y
N : No
Y
Y
Y
Y
Y
Y
Y
Y
Meaningful Assignable image
Satisfaction
User Features
Table 6 A 12-point efficiency scale
Y
Y
Y
Y
Y
Y
Y
N
Memorability
Y
Y
Y
Y
Y
Y
Y
Y
Simple steps
Y
Y
Y
Y
Y
Y
Y
Y
Nice Interface
Y
Y
Y
Y
Y
Y
Y
Y
Training Simply
Y
Y
Y
Y
Y
Y
Y
Y
Pleasant Picture
Y
Y
Y
N
N
N
N
N
password permutations
Y
N
N
Y
N
N
N
N
Selecting of triangle pattern
Y
N
N
N
N
N
N
N
Use of on e time password
Y
N
N
N
N
N
N
N
Multilevel security
Implementing Multilevel Graphical Password Authentication Scheme … 21
22 Table 7 A 12-point efficiency scale
T. Srinivasa Ravi Kiran et al. S. no.
Recognition schema
Scale of efficiency
1
Blonder
7
2
Jermyn
7
3
Stallings
8
4
Ziran Zheng
8
5
S3PAS
9
6
A Novel GP
9
7
A Robust
9
8
Multilevel
12
5 Comparative Analysis The visual interface is a 10 × 10 grid where the user is educated to form a triangle with password blends considered from the password chosen at the time of registration. At the same time, the user is instructed to enter one time password on the interface once the password blend is validated. This type of multilevel evaluation is the first attempt in a password authentication system. I have analyzed eight schemas of graphical password authentications. Among all these schemas only four have a look at the approach of taking into consideration the password blends from the original password. The schemas of the other authors also did not tell the robustness of the original password and password blends. In this proposed schema, the robustness of the password and password blends have been correctly defined and fixed at 99.61% and 97.65%, respectively. The password and password blends reduce the probability of cracking by the shoulder surfers since he/she needs to validate every password blend with the onetime password (Figs. 8 and 9).
6 Conclusion The existing scheme focused on the functioning of the user interface based on static grids. It is possible to improve security by designing dynamic graphical grids. The address space for passwords can be improved by increasing the grid size by rows as well as columns.
Ziran zheng
4
N
N
N
N
N
N
NS
5 × 5 grid
N
Stallings
N
4 × 4 grid
3
N
Jermyn
2
N
Does the passwords permutations forms triangle?
N
N
Blonder
1
N
Is actual Is login phase Password password verified with space protected? password permutations?
Row Proposed schema
N
N
N
N
Password robustness is not specified
Password strength is not specified
Password strength is not specified
Password space is small
Password strength is not specified
Password space is small
NS
NS
Brute force
Spyware
Description
Dictionary
NS
Average login time in milliseconds using i5 processor
Guessing
Dictionary
Brute force
Spyware
Dictionary
Guessing
NS
NS
Shoulder surfing NS
Description Guessing
Brute force Shoulder surfing
Dictionary
Impervious to security attacks
Spyware
Does the Drawbacks Security password attacks security pervious to is verified with one time password
Table 8 A comparative analysis of innovative graphical password schema
NS
NS
NS
NS
Robustness of the password scheme
(continued)
NS
NS
NS
NS
Robustness of each password permutation
Implementing Multilevel Graphical Password Authentication Scheme … 23
S3-pass
A novel graphical password scheme
A robust
5
6
7
Row Proposed schema
Y
Y
Y
Y
Y
Y
N
N
10 × 10 grid Y
14 × 14 grid N
NS
NS
Lengthier login processes
Password strength is not specified
Password strength is not specified
NS
NS
NS
Does the Drawbacks Security password attacks security pervious to is verified with one time password
N
Does the passwords permutations forms triangle?
10 × 10 grid Y
Is actual Is login phase Password password verified with space protected? password permutations?
Table 8 (continued) Average login time in milliseconds using i5 processor
Spyware
Hidden camera
Shoulder-surfing 44
Random click attacks
Hidden-camera
Shoulder-surfing 38.46
Spyware
Hidden-camera
Spyware
Hidden-camera
Shoulder-surfing NS
Impervious to security attacks
99.96%
99.96%
NS
Robustness of the password scheme
(continued)
99.23%
99.23%
NS
Robustness of each password permutation
24 T. Srinivasa Ravi Kiran et al.
Multilevel Y
Y
Does the passwords permutations forms triangle?
10 × 10 grid Y
Is actual Is login phase Password password verified with space protected? password permutations?
NOTE: Y-Yes, N-No, NS-Not Specified
8
Row Proposed schema
Table 8 (continued)
Y
NS
NS
Does the Drawbacks Security password attacks security pervious to is verified with one time password
Spyware
Hidden camera
Dictionary
Guessing
Shoulder surfing
Brute force
Impervious to security attacks
37
Average login time in milliseconds using i5 processor
99.61%
Robustness of the password scheme
97.65%
Robustness of each password permutation
Implementing Multilevel Graphical Password Authentication Scheme … 25
26
Fig. 8 Histogram of multilevel schema
Fig. 9 Linegraph of multilevel schema
T. Srinivasa Ravi Kiran et al.
Implementing Multilevel Graphical Password Authentication Scheme …
27
7 Future Scope It is important that the presented authentication schemes have a natural defense mechanism against network security attacks since the authentication system can be executed in a two-tier client–server model that only needs communication of grid coordinates between the client and the server and hence also requires lesser bandwidth and reduces the load on the authenticating server. It follows that increasing these dimensions will lead to better security of such graphical authentication systems.
References 1. Jermyn, I., et al. (1999). The design and analysis of graphical passwords. In Proceedings of the 8th USENIX Security Symposium, Washington, D.C., USA. Retrieved August 23–26, 1999. 2. Kiran, T. S. R., et al. (2019, March). A Robust scheme for impervious authentication. International Journal of Innovative Technology and Exploring Engineering (IJITEE), 8(4S2). ISSN: 2278-3075. 3. Ravi Kiran, T. S., et al. (2012, April). Combining captcha and graphical passwords for user authentication. International Journal of Research in IT & Management, 2(4). (ISSN 22314334), http://www.mairec.org. 4. Kiran, T. S. R., et al. (2013, September ). A symbol based graphical schema resistant to peeping attack. IJCSI International Journal of Computer Science Issues, 10(5), No 1. ISSN (Print): 1694-0814 | ISSN (Online): 1694-0784, www.IJCSI.org 229. 5. Zhao, H., et al. (2007). S3PAS: A scalable shoulder-surfing resistant textual-graphical password authentication scheme. In 21st International Conference on (Volume:2) Research paper published in Advanced Information Networking and Applications Workshops, AINAW ‘07. 6. Stallings, W., & Brown, L. (2008) Computer security: Principle and practices. Pearson Education. 7. Wiedenbeck, S., et al. (2005). PassPoints: Design and longitudinal evaluation of a graphical password system, Elsevier. International Journal of Human-Computer Studies, 63, 102–127. 8. Chopra, A., et al. (2020, April 4). A bankable pictorial password authentication approach. In International Conference on Innovative Computing and Communication (ICICC 2020). 9. Blonder, G. (1996). Graphical password. Murray Hill, NJ: Lucent Technologies, Inc., United States Patent 5559961. 10. Sobrado, L., & Birget, J.-C. (2002). Graphical passwords. The Rutgers Scholar, An Electronic Bulletin for Undergraduate Research, 4. 11. Kiran, T. S. R., et al. (2012). A novel graphical password scheme resistant to peeping attack. IJCSIT, International Journal of Computer Science and Information Technologies, 3(5), 5051– 5054. ISSN:0975-9646, http://www.ijcsit.org/. 12. Ravi Kiran, T. S., et al. (2014). A shoulder surfing graphical password schema based on transformations. International Journal of Applied Engineering Research, 9(22), 11977–11994. ISSN 0973-4562. 13. Rao, M. K., et al. (2012, August). Novel shoulder-surfing resistant authentication schemes using text-graphical passwords. International Journal of Information & Network Security (IJINS), 1(3), 163–170. ISSN:2089-3299. 14. Jusoh, Y., et al. (2013, May). A review on the graphical user authentication algorithm: recognition-based and recall-based. International Journal of Information Processing and Management. 15. Stobert, E., et al. (2020). Exploring usability effects of increasing security in click-based graphical passwords. ACSAC, 2010, Computer Science, Corpus ID: 6320784. https://doi.org/10. 1145/1920261.1920273.
28
T. Srinivasa Ravi Kiran et al.
16. Hemavathy, M., & Nirenjena, S. (2017). Multilevel graphical authentication for secure banking. International Journal of Scientific Research in Computer Science, Engineering and Information Technology (IJSRCSEIT ), 2(2). ISSN: 2456-3307. 17. Khadke, A., et al. (2020, March) Three level password authentication. International Journal of Emerging Technologies and Innovative Research, 7(3), 68–72. ISSN:2349-5162, www.jet ir.org. 18. Katsini, C., et al. (2019). A human-cognitive perspective of users’ password choices in recognition-based graphical authentication. International Journal of Human-Computer Interaction, 35(19), 14.
State of Geographic Information Science (GIS), Spatial Analysis (SA) and Remote Sensing (RS) in India: A Machine Learning Perspective Shruti Sachdeva and Bijendra Kumar
Abstract The past decade has witnessed India making significant progress in the spheres of computer science and space engineering. Coupled with the increased frequency of successful satellite launches, a huge amount of geospatial data has now become available in the government repositories, empowering the concerned authorities in making quicker decisions and better policies. Resource monitoring and disaster prediction and management are two of the most crucial fields employing these advancements and indulge in a prolific consumption of the earth observation data being captured by the Indian satellites. The paper reviews the work accomplished in the aforementioned fields using the state-of-the-art sciences of geographical information science (GIS), remote sensing (RS), spatial analysis (SA), and machine learning (ML) in the Indian landscape in the past decade. The emphasis is on finding the impact of machine learning techniques on these earth data studies for finding interesting patterns in resource potential mappings, and hazard susceptibility and vulnerability studies are conducted. It was observed in the study that high prediction accuracies were achieved with the employment of the latest data analysis techniques for geospatial mapping for predicting the future occurrence of specific object/event/phenomenon on the earth’s surface. The field holds enormous potential for environmental assessment and preservation and sustainable city planning and development. Keywords Geographic information science · Remote sensing · Geographic information system · Machine learning · Geospatial analysis · Earth observation data
S. Sachdeva (B) · B. Kumar Department of Computer Science and Engineering, Netaji Subhas University of Technology, Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_3
29
30
S. Sachdeva andB. Kumar
1 Introduction India with its massive area coverage and astronomical population indisputably offers a challenge on an immense scale in the areas of governance, administration, natural resource monitoring, disaster management, adversity handling, law implementation, etc. The magnitude of these seemingly innocuous tasks in the sphere of the dayto-day working in a regular country can be emphasized from the fact that India is ranked seventh and second in the list of countries ranked on the basis of their area coverage and population, respectively. The sheer enormity of the country’s size and population impedes the country’s economic growth. Subsequently, it does not come as a surprise that despite the government’s best efforts, a lot of it leaves to be desired in terms of the country’s development and progress. Despite the impediments effectuated by the magnitude, acreage, and dimensions associated with the country, it is the scale itself that can metamorphose into an opportunity in disguise. In the digital age, where data is the new currency, India with its wide diversity and versatility can provide a perfect canvas for building up an unparalleled empire of data. While the financial benefit is an incentive enough for the employment of such a digitization exercise, it has also become the need of the hour to be able to oversee the country’s assets and liabilities, which otherwise is a very tedious and chaotic process. The widening economic gap between the haves and have nots, expanding population, increasing poverty, climate change, agrarian crisis, environmental degradation, growing unemployment, etc. are other hindrances in the way of India transitioning from a developing country to a developed country. The strained relations with India’s neighboring countries, fluctuating western trade policies, and China’s market dominance have made it imperative for the concerned agencies to take cognizance of the situation and devise and implement the necessary strategies for tackling the maladies ailing the system.
1.1 Status Quo in Indian Resources A large portion of the country’s assets is tangible in nature, thus making them locatable on the earth’s surface. Water (both surface and groundwater), land, and air, being the primary resources necessary for human sustenance, are the most obvious candidates for data gathering and information generation. The deteriorating air and water quality, water scarcity, and increased migration to cities causing constrained spaces, and urban expansion leading to unsustainable land use and land cover bring to the fore the dire need for an active monitoring of these basic amenities and prevent further damage, as well as frame obligatory policies for their future preservation. Other natural resources like metals (iron, copper, aluminum, gold, silver, etc.), coal, petroleum, natural gas, oils, and minerals are also present in various degrees of volume and extractability in the Indian subcontinent [1]. The feasibility studies for discerning the commercial aspects of extraction of such resources like
State of Geographic Information Science (GIS), Spatial Analysis (SA) …
31
the investments needed, return on investments expected, etc. are inevasible for a profitable venture. The digitization of the existing remunerative inventories along with that of the topology and landscape factors influencing a region’s potential to yield these resources could allow for mapping and identifying other complimentary high-yielding (previously unknown) locations and mines.
1.2 Natural Disasters in India Among all the countries in the world, India is considered to be one of the most frequent witnesses of a variety of natural hazards as the region has seen the maximum number of disaster-induced casualties and fatalities. The main reason attributed is the dynamic ecosystem in prevalence. The extensive financial overhaul and rural empowerment leading to population remodeling in terms of both their numbers and diversification is a conspicuous manifestation of such activities. Climate change and rapid landuse transformations are other sinister by-products of these exertions. Among all the fatal disasters induced by natural hazards, the widest scale of apathy and the havoc actuated has been attributed to landslides, floods, forest fires, and avalanches, etc. [2]. In a country already struggling to progress, it comes as no surprise that the occurrence of such disasters could severely hamper and deter the gained headway. The prediction of a region’s vulnerability and potential to witness the occurrence of hazards can intercede to stymie more than the minimal damage. Such an analysis could be based on the past history of similar hazards and the factors that led up to the event.
1.3 The Role of GIS, RS, and SA Via ML Vis-à-Vis India The common thread among all the aforementioned proposed and previously implemented endeavors is the use of data about the country’s tangible resources and the existing inventories (resource inventories like wells, rivers, coal mines, etc. to hazard inventories like that of the set of flood/landslide/avalanche locations) to establish the relations between the preexisting conditions on the ground and set of locations that witnessed an event. The event here could vary from “presence of coal, metal or oil” to “occurrence of flood inundation or land subsidence”. Such relations are spatial in nature since they relate attributes of a location in space to the event. The study of these relations for their identification and employment in mapping studies gives us spatial analysis [3]. If the space being analyzed is on the earth’s surface, which is the case here, they form the quintessential geospatial relations and subsequently beget the geospatial analysis. Conventionally, the data for any such explorations required doing rigorous field works and extensive surveys that were hugely time and effort-consuming. Also, in a lot of cases, the area under study is not the most accessible place in the country,
32
S. Sachdeva andB. Kumar
such as those for oil and natural gas exploration. The resource reservoirs are usually found at great depths under the earth’s surface, or in the case of natural hazard studies, more often than not, the region tends to get disconnected owing to the various communication and connectivity failures at the time of adversity. In the case of man-made disasters like gas, chemical, nuclear radiation, leaks, etc., the areas are deliberately cordoned off, thus making the data collections via ground surveys an even more imposing task. The advancements achieved in the fields of science, technology, geographical analysis, data processing, and analytics have come together in a cohesive amalgamation bringing about an exemplified simplicity in its wake so much so that the surveying activities have almost been rendered trivial. Especially with the high accessibility to cheap smartphones and a good internet connection, availability of global positioning systems on the devices, etc., any user (expert or non-expert) with access to these resources could become a potential data provider or consumer. Remote sensing is the process of accessing information about a region without having to be physically present at the said location, i.e., it allows the user to remotely gain perspective into the possibly inaccessible landscape. This information is usually obtained by means of satellite images. India has been progressing by leaps and bounds in this area, with the esteemed and ambitious Indian Space Research Organization (ISRO) launching multiple satellites serving a variety of purposes and capturing the spatial characteristics of the Indian landscape. A geographic information system is a system for capturing, storing, checking, integrating, manipulating, analyzing, and displaying data that are spatially referenced to the Earth [4]. Geographical information science is the use of these systems on remotely sensed earth observation data (via satellite images or ground surveys) to acquire/import, store/export, preprocess (handling and management), analyze, and report results/patterns uncovered on the specific area’s characteristics. The science behind image-based pattern identification and trend analysis has witnessed a transition from the conventional statistical approaches such as frequency ratio, weight of evidence, etc. methods to the upcoming machine learning models. The early statisticsbased techniques relied merely on qualitative approaches that were derived from the subjective judgment of an expert or a unanimous judgment from a group of experts and evaluating weights of criteria, thus making them cost and time-intensive and error-prone due to human error arising because of the roles of these experts. These shortcomings were eventually overcome by the advent of the quantitative approaches relying on mathematical methodologies based on arithmetic articulations of the relation between conditioning elements and the occurrence of the event/phenomenon. Machine learning is one such quantitative domain that is capable of assisting in the domain of spatial prediction of events. These techniques tend to simulate human intelligence and decision-making by accepting past data (inventory) and the factors/circumstances present there causing the phenomenon. It then tends to observe the relationship between these factors and the events and builds a model accordingly, which is a collection of rules which can be then reiterated over a new data set, allowing us to make a new set of predictions. Hence it apes the human thought process at a much faster pace and greater accuracy. Novel ways of optimization and ensemble models have pushed the envelope further by achieving higher levels of accuracy
State of Geographic Information Science (GIS), Spatial Analysis (SA) …
33
and precision levels between the digital representations (cartographic along with non-spatial attributes) and the ground truths.
2 GIS Mappings: An Insight into Indian Studies A large number of geospatial studies within the Indian boundaries have been undertaken in the past decade for various tasks, such as hazard susceptibility and vulnerability assessment studies on flood/landslide/forestfire/earthquake/avalanche/tsunami susceptibility mappings, natural resource modeling studies on groundwater/iron/coal/oil/natural gas potential assessments, land use and land cover predictions, organic soil carbonate detections, nitrate and fluoride water contamination mappings, crop suitability, and many more. Table 1 summarizes a few of the Indian studies that undertook geospatial analysis in some parts of the country by employing various types of processing of satellite data using machine learning for predictions. Here, IL represents the inventory locations, indicating the points/polylines/polygons denoted by their coordinates (latitude, longitude) that witnessed the occurrence of an event such as floods, landslides, etc. or the presence of resource such as groundwater, iron, silver, etc. [5, 14, 15]. The CF denotes the conditioning factors that could have impacted the occurrence of an event or the presence of the object. For instance, in the case of flood susceptibility mappings, the literature indicates that the proximity of a location from a drainage system/river played a crucial role in influencing the location’s probability of being inundated in the past occurrence of a flood and hence could play a significant role in the future as well [6]. It could also impact the density of inundation in the future as well. The data acquisition, processing, and analysis are carried out in these studies via RS and GIS tools. On the other hand, the ML models are used in order to identify latent relations and interactions in the CFs and ILs so as to predict the potential/susceptibility/vulnerability of similar new locations on the basis of their CF values alone after the model training. The accuracy of the trained models as measured on the metrics of accuracy and area under the curve (AUC) represents how well the model performs on unknown data. The precision and accuracy of the final maps (against the ground truth values) produced by these models are proportional to the model’s performance. In the past, such studies have been part of scientific explorations undertaken by different government departments and ministries for serving different purposes, and most of them relied on ground surveys and required collecting field information manually. As a result, these analysis exercises, more often than not, became error-prone due to the inclusion of human factors, and hence, their results were not always reliable. However, with advanced Indian satellites, capturing information with improved spatial and temporal resolutions, earth surface analysis and ocean modeling have been revolutionized. The Indian research community has taken cognizance of these advancements and have time and again taken up similar projects focusing on
Type
Flood susceptibility
Flood susceptibility
Flood susceptibility
#
[5]
[6]
[7]
West Bengal
State
Chamoli district
Uttarakhand
Sundarban West Bengal biosphere reserve
Koiya river basin
Study region
Table 1 Geospatial assessments in India
143
228
264
#IL
11
10
8
#CF
CF
Support vector machine
Evidential belief function-logistic regression
ML models
Elevation, slope, Particle swarm aspect, plan curvature, optimization-support topographic wetness vector machine index, stream power index, soil texture, land cover, normalized difference vegetation index, rainfall, and distance from rivers
Slope, drainage density, surface curvature, elevation, embankment density, flood inundation density, topographic wetness index, stream power index, distance to rivers, normalized difference vegetation index
Land use land cover, soil, rainfall, normalized difference vegetation index, distance to rivers, elevation, topographic wetness index, stream power index
(continued)
Accuracy = 96.5%
AUC = 0.83
AUC = 0.84
Result
34 S. Sachdeva andB. Kumar
Type
Landslide susceptibility
Landslide susceptibility
#
[8]
[9]
Table 1 (continued)
Northern part of Himalaya
Northern part of Himalaya
Study region
–
–
State
930
930
#IL
15
15
#CF
ML models
Slope, road density, Rotation forest-based curvature, land use, radial basis function distance to road, plan on neural network curvature, lineament density, distance to lineaments, rainfall, distance to river, profile curvature, elevation, aspect, river density, soil type
Slope, aspect, Multi-boost on neural elevation, curvature, networks plan curvature, profile curvature, soil types, land cover, rainfall, distance to lineaments, distance to roads, distance to rivers, lineament density, road density, river density
CF
(continued)
AUC = 0.891
AUC = 0.886
Result
State of Geographic Information Science (GIS), Spatial Analysis (SA) … 35
Type
Landslide susceptibility
Landslide susceptibility
#
[10]
[11]
Table 1 (continued)
Region of Himalaya
Pauri Garhwal Area
Study region
Uttarakhand
Uttarakhand
State
430
1295
#IL
11
16
#CF
Slope angle, slope aspect, elevation, curvature, lithology, soil, land cover, distance to roads, distance to rivers, distance to lineaments, and rainfall
Slope angle, elevation, slope aspect, profile curvature, land cover, curvature, lithology, plan curvature, soil, distance to lineaments, lineament density, distance to roads, road density, distance to river, river density and rainfall
CF
Sequential minimal optimization-based support vector machines
Random forest
ML models
(continued)
AUC = 0.891
AUC = 0.985
Result
36 S. Sachdeva andB. Kumar
Type
Landslide susceptibility
Forest fire susceptibility
#
[12]
[13]
Table 1 (continued) Assam, Nagaland
State
Nanda Devi Uttarakhand biosphere reserve
North-Eastern regions of India
Study region
702
436
#IL
18
16
#CF Logistic regression-gradient boosted decision trees-voting feature interval
ML models
Elevation, slope, Evolutionary aspect, plan curvature, optimized-gradient topographic position boosted decision trees index, topographic water index, normalized difference vegetation index, soil texture, temperature, rainfall, aridity index, potential evapotranspiration, relative humidity, wind speed, land cover and distance from roads, rivers and habitations
Elevation, slope, aspect, general curvature, plan curvature, profile curvature, surface roughness, topographic wetness index, stream power index, slope length, normalized difference vegetation index, land use land cover, distance from roads, rivers, faults and railways
CF
(continued)
AUC = 0.955
AUC = 0.98
Result
State of Geographic Information Science (GIS), Spatial Analysis (SA) … 37
Type
Gully Erosion susceptibility
Groundwater potential
#
[14]
[15]
Table 1 (continued) State
Vadodara district
Gujarat
Pathro river basin Jharkhand
Study region
34
174
#IL
#CF
10
12
CF
ML models
Slope, aspect, plan Rotation forest-based curvature, topographic decision stump wetness index, rainfall, river density, lithology, land use, and soil
Slope Random forest gradient, altitude, plan curvature, slope aspect, land use, slope length, topographical wetness index, drainage density, soil type, distance from the river, distance from the lineament, and distance from the road AUC = 0.988
AUC = 0.962
Result
38 S. Sachdeva andB. Kumar
State of Geographic Information Science (GIS), Spatial Analysis (SA) …
39
regions (state, district, block, taluk, etc.), phenomena (land-use modification, gully erosion, air quality monitoring, etc.), hazards (floods, landslides, etc.), and resources (groundwater, iron, gold, etc.).
3 Geospatial Abridgement: ML Studies at a Glance As shown in Fig. 1, the decision trees machine learning model and its different variants (ensemble and optimized by bagging, boosting, etc.) have proved to be the most successful in geospatial mappings in the Indian context. The simplicity and understandability associated with decision trees are found to be the major factors that tilt the scales toward their extensive application. Support vector machines are the second most commonly employed learning technique owing to the fact that they are inherently most suited for high-dimensional data with clearly linearly separable classes. Since most of the spatial analysis data build high-dimensional data sets from earth observation data, the support vector machine models fit quite seamlessly into its logical analytics. Also, most of these analysis schemes end up having a binary classification problem at its core, with the positive class usually signifying the presence of event/phenomena on the earth’s surface, while the negative class represents its absence which again further makes support vector machine a good choice of classifier in similar studies. The logistic regression model ranks up next in its applicability for predictions on GIS data sets in the Indian context. The logistic regression is the most widely understood model which explains its predominance on the ML scene. However, it is interesting to note that very few studies in this field have been carried with the application of artificial neural network models, even though these models are generally considered to be suitable for identifying complex patterns arising due to the interaction between the contributing attributes. Fig. 1 Contribution of ML in GIS in the Indian context
Contribution of ML models in Indian GIS studies 8%
Logistic Regression
17%
Support Vector Model 25%
Decision Trees
50% Artificial Neural Networks
40
S. Sachdeva andB. Kumar
4 Conclusion and Future Scope Through this survey, we observed that machine learning with its tremendous powers in identifying complex relations and patterns with relative ease has made a significant impact on the sophisticated field of GIS mappings and remote sensing. It was found that a huge volume of resource potential mappings and hazard susceptibility mappings have been undertaken using geospatial data produced by the Indian satellites. They have successfully employed state-of-the-art machine learning models to demarcate regions with a high probability of unearthing precious resources or regions with high vulnerability to specific hazards, respectively, for resource modeling and hazard susceptibility mappings. We found that decision trees and their variants owing to their inherent simplicity have found vast use in these studies. However, owing to the novelty of the field, not many attempts have been made to employ the latest machine learning optimizations and ensemble models which hold the key for further enhancement in this field. Feed-forward neural networks and convolution neural networks which have time and again proved and established their suitability for image processing and other related prediction tasks have yet to be taken up in similar studies. Also, due to the vastness of the Indian land coverage and population, there is a need for many more such explorations for better and efficient use of Indian resources. The field thus holds vast potential in India, as it could be applied for identifying future patterns of occurrence and absence for any object/event/phenomenon that ever found presence on the earth’s surface. Since it is just the onset of a digital era of its own kind, there is still a long way to go before the cohesive forces of GIS, RS and ML could be utilized for the complete potential that they have to offer.
References 1. Jain, R., & Sharma, R. U. (2019). Airborne hyperspectral data for mineral mapping in Southeastern Rajasthan. India: International Journal of Applied Earth Observation and Geoinformation. https://doi.org/10.1016/j.jag.2019.05.007. 2. Chakraborty, A., & Joshi, P. K. (2016). Mapping disaster vulnerability in India using analytical hierarchy process. Geomatics, Natural Hazards and Risk. https://doi.org/10.1080/19475705. 2014.897656. 3. Guhathakurta, S. (2019). Spatial analysis. In The Routledge handbook of international planning education. 4. Chorley, R., & Buxton, R. (1991). The government setting of GIS in the United Kingdom. Geographical Information Systems: Principles and Applications, 1, 6779; Princ. 5. Chowdhuri, I., Pal, S. C., & Chakrabortty, R. (2020). Flood susceptibility mapping by ensemble evidential belief function and binomial logistic regression model on river basin of eastern India. Advances in Space Research. https://doi.org/10.1016/j.asr.2019.12.003. 6. Sahana, M., Rehman, S., Sajjad, H., & Hong, H. (2020). Exploring effectiveness of frequency ratio and support vector machine models in storm surge flood susceptibility assessment: A study of Sundarban Biosphere Reserve, India. Catena. https://doi.org/10.1016/j.catena.2019. 104450. 7. Sachdeva, S., Bhatia, T., & Verma, A. K. (2017). Flood susceptibility mapping using GIS-based support vector machine and particle swarm optimization: A case study in Uttarakhand (India). In
State of Geographic Information Science (GIS), Spatial Analysis (SA) …
8.
9.
10.
11.
12.
13.
14.
15.
41
8th International Conference on Computing, Communications and Networking Technologies, ICCCNT 2017. Pham, B. T., Tien Bui, D., Prakash, I., & Dholakia, M. B. (2017). Hybrid integration of Multilayer Perceptron Neural Networks and machine learning ensembles for landslide susceptibility assessment at Himalayan area (India) using GIS. Catena. https://doi.org/10.1016/j.catena.2016. 09.007. Pham, B. T., Shirzadi, A., Tien, B. D., et al. (2018). A hybrid machine learning ensemble approach based on a Radial Basis Function neural network and Rotation Forest for landslide susceptibility modeling: A case study in the Himalayan area, India. International Journal of Sediment Research. https://doi.org/10.1016/j.ijsrc.2017.09.008. Pham, B. T., Khosravi, K., & Prakash, I. (2017). Application and comparison of decision treebased machine learning methods in landside susceptibility assessment at Pauri Garhwal Area, Uttarakhand, India. Environmental Processes. https://doi.org/10.1007/s40710-017-0248-5. Pham, B. T., Tien Bui, D., Prakash, I., et al. (2017). A comparative study of sequential minimal optimization-based support vector machines, vote feature intervals, and logistic regression in landslide susceptibility assessment using GIS. Environmental Earth Sciences. https://doi.org/ 10.1007/s12665-017-6689-3. Sachdeva, S., Bhatia, T., & Verma, A. K. (2020). A novel voting ensemble model for spatial prediction of landslides using GIS. International Journal of Remote Sensing, 41. https://doi. org/10.1080/01431161.2019.1654141. Sachdeva, S., Bhatia, T., & Verma, A. K. (2018). GIS-based evolutionary optimized Gradient Boosted Decision Trees for forest fire susceptibility mapping. Natural Hazards, 92. https://doi. org/10.1007/s11069-018-3256-5. Gayen, A., Pourghasemi, H. R., Saha, S., et al. (2019). Gully erosion susceptibility assessment and management of hazard-prone areas in India using different machine learning algorithms. Science of the Total Environment. https://doi.org/10.1016/j.scitotenv.2019.02.436. Pham, B. T., Jaafari, A., Prakash, I., et al. (2019). Hybrid computational intelligence models for groundwater potential mapping. Catena. https://doi.org/10.1016/j.catena.2019.104101.
Application of Noise Reduction Techniques to Improve Speaker Verification to Multi-Speaker Text-to-Speech Input Md. Masudur Rahman, Sk. Arifuzzaman Pranto, Romana Rahman Ema, Farheen Anfal, and Tajul Islam Abstract Text-to-speech is a very common implementation in the modern world. Its use is everywhere, from hearing aid to a virtual assistant. But the development of voice models for TTS involves a lot of sample speech from professional speakers. Voice cloning can reduce this development cost by generating artificial voices from small speech samples. Speaker verification to multi-speaker text-tospeech (SV2TTS) makes this possible with its three individual neural networks and a lot of speech data. But it is still not possible to use it casually because of the noises around us. Noise creates garbage data while being trained and that makes the output less desirable. We propose to add a noise reduction system to the recorder of SV2TTS to reduce noise from speech data and create a more desirable output from SV2TTS. We compared six noise reduction algorithms and applied the best-performing one to the SV2TTS. We intend to expand this research to implement SV2TTS for the Bengali language. Keywords Text-to-speech (TTS) · Speaker verification · Speaker verification to text-to-speech (SV2TTS) · Encoder · Synthesizer · Vocoder · Mel spectrogram · Phoneme · Waveform
1 Introduction A text-to-speech (TTS) is a system that reads out text data in a preset voice. It is vastly used in services like a visual impaired assistant, PDF reader and dictionaries. Smart assistants are the latest application of TTS. But not all TTS is natural or human-like, so that makes assistants not smart enough, even though they are capable. A speaker verification system is able to recognize a speaker from speech data. If SV and TTS can be combined together, it can use the speaker identity to clone a speaker’s voice. The idea is already applied in speaker verification to multi-speaker text-to-speech synthesis (SV2TTS) [1]. Md. M. Rahman · Sk. A. Pranto (B) · R. R. Ema · F. Anfal · T. Islam North Western University, Khulna, Bangladesh © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_4
43
44
Md. M. Rahman et al.
The SV2TTS works by training three independent neural networks: (1) an encoder that calculates a fixed dimensional vector from a speech signal, (2) a sequenceto-sequence synthesizer, which predicts a Mel spectrogram from a sequence of grapheme or phoneme inputs, conditioned on the speaker embedding vector, and (3) an autoregressive WaveNet [2] vocoder, which converts the spectrogram into time-domain waveforms. We modified the speaker encoder to reduce noise and unwanted sounds so that the engine can acquire perfect sample data and predict voice from professional training data perfectly. This helps to create a more reliable clone of a person’s voice. The work can be further stretched for other languages as well since it is limited to English only. Other language support can be achieved by establishing datasets of hundreds of speakers in a language.
2 Related Works SV2TTS [1] is based on new technology. Thus, many researchers attempted to build new models and applications that are related to the modules of SV2TTS. Arık et al. [3] introduced a voice cloning system that learns to synthesize a voice from a few audio samples. They described the system in two approaches: speaker adaptation and speaker encoding. The adaptation is to fine-tune a multi-speaker generative model. Speaker encoding is to train a separate model for inferring a new speaker embedding that can be applied to a multi-speaker generative model. They showed that the naturalness of the speech and the similarity of the speaker both can achieve good performance. Arık et al. introduced another technique for augmenting neural TTS. They tried to improve the audio quality of Deep Voice 1 and Tacotron with Deep Voice 2 [4]. A post-processing neural vocoder is introduced by the authors that improved audio quality of output for Tacotron. They showed that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, as well as achieving high audio quality synthesis and preserving the speaker identities almost perfectly. The authors contributed by presenting Deep Voice 2 architecture and introducing WaveNet-based vocoder and using it with Tacotron. They also used these two models as the base to introduce trainable speaker embeddings into Deep Voice 2 and Tacotron. Attention mechanisms are vastly used in handwriting recognition, machine translation and image caption generation. Chorowski et al. [5] proposed an attentionbased mechanism for speech recognition. They adapted the attention-based model for recognizing speech and showed that it can recognize phonemes with an 18.7% phoneme error rate (PER). Then they proposed a change to the attention mechanism that reduces PER to the 17.6% level.
Application of Noise Reduction Techniques to Improve Speaker …
45
3 Preliminaries To continue with the research, first we need to assemble the SV2TTS [1] part by part with its individual modules. It was stated in the first section of this article that SV2TTS consists of three individual neural networks; encoder, synthesizer and vocoder. In this section, we describe the modules and database.
3.1 Encoder The speaker encoder takes some audio input of a speaker and generates output embedding that represents the speaker’s own voice. The encoder learns of the voice characteristics of the speaker, e.g., high/low pitch of voice, accent, tone, etc. These features are then combined into a low-dimensional vector, formally as d-vector, or speaker embedding. As a result, utterances spoken by the same speaker are close to each other in the speaker embedding, but utterances spoken by different speakers are far apart in the speaker embedding. Speaker verification [6] technology is used for this implementation.
3.2 Synthesizer The synthesizer receives the utterances from the encoder and merges them with its own encoded utterances from analyzing the text inputs. Then, it generates a Mel spectrogram corresponding to the speaker and the text both by repeatedly attenuating and decoding. The Mel spectrogram is later converted to audio by a vocoder unit. Tacotron 2 [7] is used as a synthesizer to make it possible.
3.3 Vocoder The vocoder’s job is to convert Mel spectrograms into raw audio. This is also a neural network that is trained by the voice samples of hundreds of speakers. These samples help to generate synthetic utterances that match with the utterances of the speaker at the encoder. WaveNet [2] is used as the vocoder in this research that can generate realistic utterances based on speaker data.
46
Md. M. Rahman et al.
3.4 Database LibriTTS [8] is used as a database for this research. It is a corpus that is derived from LibriSpeech [9], a corpus created from audiobooks. LibriTTS inherits the properties of LibriSpeech and addresses the issues of LibriSpeech that make it less ideal for TTS. LibriTTS contains 585 h of speech data at a 24 kHz sampling rate spoken by 2,456 speakers. The corresponding texts of all speeches are also included in the corpus.
4 Methodology 4.1 System Architecture Our proposed system is a combination of a noise reduction system and whole SV2TTS to improve the SV2TTS performance. This reduces the need to obtain highquality or professional recording to gather speaker reference speech. The proposed system consists of the noise reduction system, along with SV2TTS’s speaker encoder, synthesizer and vocoder. The proposed system is visualized in Fig. 1.
5 Implementation Our system implements the combination of a noise reduction system and SV2TTS. This helps the SV2TTS receive cleaner reference data to process it to generate cloned audio. We used a noise reduction system [10] that has six different techniques to reduce noise from audio. We implemented each technique and compared them. Both the SV2TTS and noise reduction system are implemented in python. Time series and sampling rates are extracted from the audio file from the recorder and the noise reduction algorithm is then applied. Figure 2 shows the basic steps for the implementation. However, the noise reduction step itself contains six different variations with different approaches. We implemented each technique and tested the result. The techniques are discussed below. Noise Reduction Using Power. This technique includes regular sound engineering effect applications using pysndfx [11]. At first, the spectral centroid is calculated. Then high and low thresholds of the sound are calculated by using the median of the centroid. 1.5 and 0.1 are multiplied with the high and low thresholds, respectively. Then noise reduction is applied using the threshold values. The method lowshelf applies noise reduction to a lower level while highshelf applies noise reduction to a higher level. The parameters for these methods are gain, frequency and slope. The frequency parameters are the high threshold and low threshold for highshelf and
Application of Noise Reduction Techniques to Improve Speaker …
47
Fig. 1 The proposed system
Fig. 2 The basic steps for noise reduction
lowshelf, respectively. However, gain and slope are kept constant for highshelf as − 12.0 and 0.5 and lowshelf as −30.0 and 0.8, respectively. The following pseudo-code shows the entire process. Pseudo 1: Power
48
Md. M. Rahman et al.
Noise Reduction Using Centroid Analysis (No Boost). To reduce noise in this technique, similar steps are followed, except thresholds are calculated in a different method. This time, upper and lower thresholds are calculated by selecting the maximum and minimum values of the centroids, respectively. Then the noise reduction effects are applied. Here, the method limiter is used to limit the gain with parameter 6.0 as gain. The pseudo-code shows the process. Pseudo 2: Centroid
Noise Reduction Using Centroid Analysis (With Boost). It is similar to the previous technique, except the audio signal analysis and noise reduction are performed twice. The method is similar until first pass, except the slope value of highshelf is taken 0.5 and the limiter with 10.0. This time, the centroid is calculated again and the audio is boosted by using lowshelf. The boosted audio is then returned as in the following pseudocode: Pseudo 3: Centroid with boost
Application of Noise Reduction Techniques to Improve Speaker …
49
Noise Reduction Using Mel-frequency Cepstral Coefficient (MFCC) Down. In this technique, MFCC features and log Mel-filterbank energy features are calculated. Then, a cepstral lifter is applied to the features. This increases the magnitude of highfrequency features in the audio. For every feature, the sum of their value’s square is calculated to find the strongest frame in the audio. The Hertz value of this frame is used as the threshold. The highshelf method is used to omit high-level noise only, since we are doing the down method. The pseudo-code shows the entire process. Pseudo 4: MFCC Down
Noise Reduction Using Mel-frequency Cepstral Coefficient (MFCC) Up. This MFCC technique is the same as the previous, except only that the lowshelf is applied. The strongest frame is calculated as in the previous method. Then the lowshelf is applied using the threshold and boosted speech is generated by applying it to the sound.
50
Md. M. Rahman et al.
Pseudo 5: MFCC Up
Noise Reduction Using Median. This is the simplest technique of all. Calculating the median and filtering the sound. Applying SciPy’s signal.medfilt on the sound does all the work. The kernel size for the method taken is 3. The pseudo-code is as follows: Pseudo 6: Median
Application of Noise Reduction Techniques to Improve Speaker …
51
6 Results This section illustrates our result after implementing six algorithms. Here, we present each algorithm’s results and show how they differ from the original data. Figure 3 shows the original voice data. The data is an indoor recording that contains gossip and additional noises from an indoor environment.
6.1 Noise Reduction Using Power The result of this algorithm shows the impressive output. Figure 4 illustrates the result. If we compare the result with the original, we can see that the spectra are smoother. Low-level sounds from the original sound have been omitted and the high-level sounds are in clear shape while the original includes some residues along with them. This shows that the low dB sounds which are most likely to be noise are cleared out, and the voice presence in high dB is preserved by clearing out low dB sound within them and boosting up the voice.
Fig. 3 Original voice data
Fig. 4 Noise reduced data using power
52
Md. M. Rahman et al.
Fig. 5 Noise reduced data using centroid analysis (no boost)
6.2 Noise Reduction Using Centroid Analysis (No Boost) Theoretically, this algorithm is supposed to show better results. But in the actual case, it does not show very good performance in reducing noise. If we compare Fig. 5 with Fig. 3, we can get an idea about the performance of this algorithm. The thresholds were calculated using centroid analysis, so its high dB sounds and low dB sounds are better represented. But in the actual case, this data is not much noise-free. We can see in the figure that low dB noises are not removed. Instead, voice sounds are boosted, so the noise can be heard less. This algorithm works but not as good as power.
6.3 Noise Reduction Using Centroid Analysis (with Boost) This algorithm is almost the same as the no boost. So, its performance is also almost the same. But it performs better than its predecessor because its dB signals are boosted. In Fig. 6 we can see that high dB signals boosted to their peaks, while low dB signals are also boosted, but still below the peaks. Bass boost is increased in this technique. This makes the sound louder, making it harder to listen to the noise.
Fig. 6 Noise reduced data using centroid analysis (with boost)
Application of Noise Reduction Techniques to Improve Speaker …
53
Fig. 7 Noise reduced data using MFCC down
6.4 Noise Reduction Using Mel-Frequency Cepstral Coefficient (MFCC) Down The result of this technique is shown in Fig. 7. In this technique, the threshold was calculated using MFCC features. So, it is supposed to give a better result. In real-life experience, it works better than centroid analysis, but not good as power. The bass boost is very high in this technique. Since it is MFCC down, only lower dB sounds are reduced. So, it should work in places with low-level noise.
6.5 Noise Reduction Using Mel-Frequency Cepstral Coefficient (MFCC) Up This technique works almost the same as the other MFCC technique. We can analyze it by inspecting Fig. 8. Unlike MFCC down, high-level noise is reduced in this technique. So, it is suitable for very loud areas. Similar to MFCC down, this technique also increases bass boost trying to omit noise. But bass boost makes it harder to understand the actual voice. So, we can say that still power is the most suitable technique in our case.
Fig. 8 Noise reduced data using MFCC up
54
Md. M. Rahman et al.
Fig. 9 Noise reduced data using median
6.6 Noise Reduction Using Median As we have discussed in the fourth section, this was the most basic technique of all. The result also shows the same. We can see that if the median is used, there is not much difference between the original voice (Fig. 3) and the noise-reduced voice (Fig. 9). Although some noises were reduced, it was not reliable for our scenario.
6.7 Final Result After analyzing all the data, we came to a decision that reducing noise using power is most suitable for our work. Voice cloning recordings are usually conducted in closed rooms. In the same scenario, power performs the best. If we take outdoors and other scenarios into consideration, then we may have to change the noise reduction technique based on the scenario. Table 1 shows the comparison between the results of the algorithms. Noise reduction using power reduced the audio streams above 0.19 dB and below –0.19 dB. This resulted in the audio stream containing only speech data and no light or loud noise. It reduced about 39% of the audio that can be considered as noises. MFCC down and median also reduced noise like power that was not enough for implementation. MFCC down performs 28% audio reduction, while median barely changed the input audio. The centroid analysis algorithms and MFCC up boosted the audio intensity after noise reduction. So, their performance is relative, but we can clearly see that MFCC up barely changes the audio like median. The centroid analysis increases the audio intensity to 0.49 dB and –0.42 dB. This makes the audio louder, and noises can also be heard loudly. It gets louder and that is also not acceptable for our work. So, we chose power that performed better than the other algorithms.
Application of Noise Reduction Techniques to Improve Speaker …
55
Table 1 Comparison of the audio samples Audio sample
Max. intensity (dB)
Min. intensity (dB)
Reduced intensity (dB)
Increased intensity (dB)
Overall audio reduction
Overall audio increase (%)
Original
0.35
−0.29
–
Power
0.19
−0.19
0.16, 0.09
–
–
–
–
39%
Centroid analysis
0.49
−0.42
–
0.14, 0.13
–
42
Centroid analysis (Boost)
1.0
−1.0
–
0.65, 0.71
–
200
MFCC down
0.25
−0.19
0.09, 0.09
–
28%
MFCC up
0.45
−0.32
–
0.1, 0.03
–
Median
0.35
−0.28
0.006, 0.004
–
2 2%
7 Conclusion Our attempt to improve SV2TTS included the SV2TTS [1] itself and the noise reduction system [10]. Both systems are new, so they are not yet perfect. Little attempts to improve these systems may take them closer to perfection. The noise reduction system can be perfected by applying more intelligent classifier algorithms while SV2TTS can be perfected by using more datasets. Together, these two systems may create a virtual voice artist. We wish to research more and contribute further to this field in order to make this system work in our mother language, Bangla. It can be achieved by contributing to the voice repository by taking voice samples from the speakers. That is our focus for the future of this work.
References 1. Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F. et al. (2019). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In 32nd Conference On Neural Information Processing Systems (NEURIPS 2018) (pp. 1–15). Montréal, Canada. 2. van den Oord, A, Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., et al. (2016). WaveNet: A generative model for raw audio. CoRR, http://arXiv.org/abs/1609.03499. 3. Arik, S. O., Chen, J., Peng, K., Ping, W., & Zhou, Y. (2018). Neural voice cloning with a few samples. CoRR, https://arxiv.org/abs/1802.06006. 4. Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W., et al. (2017) Deep voice 2: Multi-speaker neural text-to-speech. In: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 30, pp 2962–2970). Curran Associates, Inc.
56
Md. M. Rahman et al.
5. Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K. H., & Bengio, Y. (2015). Attention-based models for speech recognition. CoRR, http://arxiv.org/abs/1506.0750. 6. Wan, L., Wang, Q., Papir, A., & Moreno, I. L. (2018). Generalized end-to-end loss for speaker verification. In Proceedings of IEEE International Conference on Acoustics, Speech, And Signal Processing (ICASSP). 7. Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., & et al. (2018). Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In Proceedings of IEEE International Conference On Acoustics, Speech, And Signal Processing (ICASSP). 8. Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia, Y., & et al. (2019). Libritts: A corpus derived from librispeech for text-to-speech. CoRR, https://arxiv.org/abs/1904.02882. 9. Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: An ASR corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015 (pp. 5206–5210). IEEE. 10. Dror Ayalon: noise_reduction page. Retrieved March 1, 2020, from https://github.com/dodiku/ noise_reduction. 11. Pysndfx page. Retrieved March 1, 2020, from https://pypi.org/project/pysndfx.
Utilization of Machine Learning Algorithms for Thyroid Disease Prediction Md. Shahajalal, Md. Masudur Rahman, Sk. Arifuzzaman Pranto, Romana Rahman Ema, Tajul Islam, and M. Raihan
Abstract Thyroid diseases are not so uncommon in our society. Normally, thyroid diseases are diagnosed by various tests and the following symptoms. But it can be predicted more efficiently by using machine learning. In this paper, we compared a number of ML algorithms and measured their performance in predicting thyroid diseases. We found out that the support vector machine (SVM) works best among the other ML algorithms. The research can be extended by including more advanced machine learning techniques and measuring their performance. Keywords Hyperthyroid · K-nearest neighbor · Linear discriminant analysis · Naïve Bayes · Logistic regression · Support vector machine
1 Introduction Thyroid diseases don’t get much attention from researchers since it is not considered as a major disease. But a thyroid-related problem can even lead to thyroid cancer. Thyroid diseases are diagnosed by measuring the hormone levels released by the thyroid. It is a long process and sometimes left unnoticed. Machine learning can achieve very complex prediction that is unbearable for the human mind. If we can use ML to predict thyroid diseases, then it can be more efficient to determine one’s thyroid condition. Fortunately, many researchers have come forward in implementing ML algorithms to predict thyroid diseases. This motivated us to find the algorithm which performs best to predict thyroid diseases. In order to find that algorithm, we compared some of the best algorithms and represented the results. Thyroid, a butterfly-shaped gland situated in front of the neck under Adam’s apple, in front of windpipe [1], produces two types of hormones containing iodine [2]: T4 (thyroxine) and T3 (triiodothyronine). Hyperthyroid is the condition of high production of the hormones while Goiter is the condition of swelling of the thyroid [1]. About 1.2% of the US population has hyperthyroidism [3]. Women are likely Md. Shahajalal · Md. M. Rahman (B) · Sk. A. Pranto · R. R. Ema · T. Islam · M. Raihan North Western University, Khulna, Bangladesh © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_5
57
58
Md. Shahajalal et al.
to develop hyperthyroid while people who have a family history of hyperthyroid are also prone to hyperthyroid [4]. It is diagnosed by symptoms and blood tests for TSH (thyroid-stimulating hormone) levels [5]. The common features of hyperthyroidism include dizziness, fatigue, vision change, change of appetite, thinning of hair etc. [6]. However, with the help of machine learning (ML), hyperthyroid can be accurately predicted by gathering the symptoms and hormone levels. Many ML algorithms have been developed to predict thyroid diseases. The algorithms vary in their accuracy and validation. We have tried to compare the algorithms’ accuracy and cross-validated the results in order to find the best suitable algorithm for hypothyroid prediction. In the second section of this paper, we have discussed the works done in the same field. We have used five algorithms: K-nearest neighbor, linear discriminant analysis, naïve Bayes, logistic regression and support vector machine to train data and test the algorithms’ performance. The algorithms have been described in Sect. 3. Section 4 features the methodology for our work. It includes the dataset features and the algorithms used for the dataset training and testing. The results of the test are discussed in Sect. 5. Cross-validation check for every algorithm and the accuracy comparison has presented there. We conclude our paper in Sect. 6, with the hope of contributing more to medical science.
2 Related Works Many researchers have contributed to thyroid disease prediction. K. Vijayalakshmi et al. introduced an intelligent thyroid prediction system using big data in their work [7]. They made a framework that works with a predefined dataset and predicts hypothyroid disease. The dataset can also be updated as needed. Geetha et al. introduced a model for thyroid disease classification using the evolutionary multivariate Bayesian prediction method [8]. The model classifies the diseases related to the thyroid. It can detect hyperthyroid, hypothyroid and normal conditions using the Bayesian prediction method. Dr. G. Rasitha Banu MCA. et al. discussed thyroid disease prediction using data mining techniques [9]. Their work introduces the possibility of different classifiers for predicting thyroid diseases using densitybased spatial clustering of applications with noise (DBSCAN) algorithm. The algorithm classifies the dataset using a hierarchical multiple classifier. By applying their method, one can predict thyroid disease with the symptoms. Ammulu et al. proposed a methodology for data classification algorithm for predicting thyroid data [10]. They used the random forest approach to mine data and classify them in order to find prediction accuracy. Weka tool was used to derive this experiment. They produced a confusion matrix in the end for different K values in RF. This ensures better treatment and decision-making in hypothyroid disorder. Shaik et al. compared different ML algorithms for thyroid disease prediction [11]. They compared decision tree, naïve Bayes, SVM and multiple linear regression algorithms in order to find the bestperforming ML algorithm. It turned out that decision tree was the best-performing algorithm in their research with 99.23% accuracy.
Utilization of Machine Learning Algorithms for Thyroid Disease …
59
3 Preliminaries 3.1 K-Nearest Neighbor Algorithm K-nearest neighbor algorithm uses Euclidean distance, Manhattan etc. between data according to their characteristics and K defines how many neighbors will be chosen for the algorithm [12]. The algorithm classifies data by following these concepts so that the dataset can be classified even if they are not pretagged. • First, it calculates the distance from the unknown data point to other known data points. • Then, the data points are sorted in ascending order according to the distance measured. • K number of data points are selected then from the sorted data points. • Then the unknown data are classified based on the class that has the maximum number of data points within K numbers of data.
3.2 Linear Discriminant Analysis It classifies data from the data matrix by calculating separability between classes (between-class matrix), distance between mean and sample values from class (withinclass matrix), and then minimizing within-class and maximizing between-class. The algorithm works by using the following equation. D = a1 ∗ X 1 + a2 ∗ X 2 + · · · · · · · · · · · · + ai ∗ X i + b
(1)
where D = discriminant function, X = response score, a = discriminant coefficient and i = number of discriminant variables. The algorithm is vastly used in the classification technique. Data validation using test data and new data is done by this algorithm.
3.3 Logistic Regression This algorithm analyzes the relationship between independent variables and compares it with classified variables in order to classify the independent variables [13]. It is a regression technique that can find out the odds for a variable to belong to a certain class. The algorithm works in three steps. First, it generalizes the data by converging them using feature attributes. In order to do this, a list is initialized, and feature
60
Md. Shahajalal et al.
scaling is performed on the attributes of the example data. Then the data is iterated until convergence and the list is sent to a later stage. Next, the data are clustered using the previous generalized data. This is acquired by initializing another list, normalizing the dataset and then iterated until convergence. At last, the data is generalized. It is performed by initializing a list, normalizing data, prepending all data columns and then iterating 100 times to store consecutive values. This ensures the learning process in LR.
3.4 Naïve Bayes Naïve Bayes classifier classifies the data by its feature vector. It learns from the feature and classifies the variables according to the features in an old-fashioned way. The algorithm works in these steps. • Converting dataset into a frequency table. • Creating a likelihood table based on probability • Calculating posterior probability for each class by using the naïve Bayesian equation. The class whose posterior probability is highest is the outcome.
3.5 Support Vector Machine Support vector machine is a supervised learning classifier. It takes training data to train itself for a certain set of results. Then it predicts the results by testing new data based on what it learned. • • • • • • • •
First, a gene subset and an empty set are taken. Then, the following steps are done until G is not empty Using the subset, train G Compute weight vector Compute ranking criteria Sort features by rank Update feature list Eliminate smallest rank features.
Utilization of Machine Learning Algorithms for Thyroid Disease …
61
4 Methodology The goal is to find accuracy for each of the algorithms’ predictions. To obtain it, we designed an ML system that learns from the training dataset and predicts using different algorithms from the testing dataset. Then it calculates the accuracy for each algorithm and draws a learning curve for the algorithms. The whole process is described in Fig. 1.
4.1 Dataset The data is gathered from the UCI machine learning repository’s [14] thyroid disease dataset. The dataset used for training contains 2800 patients’ data. Data attributes are described in Table 1.
Fig. 1 Flowchart for ML system
62 Table 1 Dataset features
Md. Shahajalal et al. Parameter
Description
Range
Age
Age of the patient
1–94
Sex
Gender of the patient
M or F
On_thyroxine
Patient is on thyroxine doses
t or f
Query_on_thyroxine
If query for thyroxine is done
t or f
On_antithyroid medication
Patient is on antithyroid medication
t or f
Sick
Patient feels sick
t or f
Pregnant
Patient is pregnant
t or f
Thyroid_surgery
Patient has faced thyroid t or f surgery
I131_treatment
Patient on radioactive iodine therapy
t or f
Query hyperthyroid
If tests are done on hyperthyroid
t or f
Lithium
Bipolar affective disorder
t or f
Goiter
Patient has abnormal thyroid gland
t or f
Tumor
If the patient has tumor
t or f
Hypopituitary
Affected by hypopituitarism
t or f
Psych
t or f
TSH
Thyroid-stimulating hormone normal range (0.5–6 uU/ml)
T3
Serum triiodothyronine normal range (80–180 ng/dl)
TT4
Total thyroxine normal range (4.5–11.5 ug/dL)
T4U
Thyroid hormone binding ratio
FTI
Free thyroxine index
Utilization of Machine Learning Algorithms for Thyroid Disease …
63
4.2 Applying Algorithms Since we need to classify the datasets, we used few classifier algorithms in order to produce the best outcome among them. We took every algorithm in the model and calculated the fitness function and performed classification with the algorithms using our training and test dataset. We chose Python to implement the algorithms. These algorithms have various properties and ways to use. The used properties of the algorithms in our research are described below. K-Neighbors Classifier. This is a classifier implemented using the K-nearest neighbors vote. We used five neighbors and equal weight for all neighbors. We used the Minkowski power parameter as 2.0. The leaf size passed to BallTree or KDTree is 30. Linear Discriminant Analysis. A classifier with a linear decision boundary generated by fitting class conditional densities to the data and using Bayes’ rule. We initialized the algorithm’s function by singular value decomposition as a solver for this algorithm and the threshold to rank the estimation was 1.0e − 4 and used no shrinkage. Logistic Regression. With this algorithm, regularization was applied by default. The penalty mode was L2 (ridge regression) which adds “squared magnitude” of coefficient as penalty term to the loss function. The tolerance of stopping criteria was 1e − 4. We initialized the algorithm by large-scale bound-constrained optimization (LBFGS) as the solver. Gaussian Naive Bayes. This algorithm finds the likelihood of the features assumed to be Gaussian. 2 xi − μ y 1 exp − (2) p( xi |y) = 2σ y2 2π σ 2 y
σ y and μ y in (2) are calculated using maximum likelihood. We initialized the algorithm’s function using probabilities of the classes as a priority and kept the largest variance of all features as 1e-9 for calculation stability. Support Vector Classification. The classifier is called C-support vector classification. The implementation is based on libsvm (a library in python for support vector machine algorithm). We used the regularization parameter C as 1.0. The kernel we used is the radial basis function (RBF). Gamma for the algorithm is provided as 1/(n_features * X.var()). Tolerance for stopping criterion was 1e − 3. All classes are supposed to have weight one in this implementation. The implementation returns a one vs. rest decision function of shape as all other classifiers.
64
Md. Shahajalal et al.
4.3 Accuracy After the implementation of the algorithms, we calculate the accuracy of each algorithm. The accuracy is calculated by accuracy = TP + TN/(TP + FP + TN + FN)
(3)
where TP = true positive, TN = true negative, FP = false positive and FN = false negative result for the test datasets. TP + TN denotes all correct predictions and TP + FP + TN + FN denotes both correct and wrong predictions.
4.4 Cross-Validation We also cross-validated the algorithms to check how well they fit our dataset. Crossvalidation shows how well the algorithm model is learning and performing. It is performed by plotting curves of learning (y-axis) to experience (x-axis). The train learning curve shows how well the model learns and the validation curve shows how well the model can predict based on the test dataset. Overfitting. When the dataset learns the training data too well, it is then considered as overfit learning curve. The overfitted algorithm learns noisy data and leads to the wrong prediction. Underfitting. When the algorithm model cannot learn from the dataset, it is then called underfitted model. This model can surely predict wrong results. Good Fitting. Our goal is to stay between overfit and underfit so that the model doesn’t learn too well or too bad. This is called good fitting. Cross-validation shows whether the algorithm is overfitted, underfitted or good fitted.
5 Results 5.1 Cross-Validation The learning curve shows the training vs. validation performance for each algorithm. The training score and cross-validation score are compared on the curve. KNN. K-nearest algorithm shows an accuracy of 97.64. The learning curve for KNN is shown in Fig. 2.
Utilization of Machine Learning Algorithms for Thyroid Disease …
65
Fig. 2 Learning curve for KNN
LDA. Linear discriminant analysis shows the training curve below the crossvalidation curve at some points. So, the performance may be low in actual use. The curve is shown in Fig. 3. LR. Logistic regression shows an impressive performance in cross-validation check. Its curve is shown in Fig. 4. The training and cross-validation curves do not cross at any point, so it can be considered a good fit.
Fig. 3 Learning curve for LDA
66
Md. Shahajalal et al.
Fig. 4 Learning curve for LR
NB. Naïve Bayes doesn’t perform so well in cross-validation. It can be viewed in Fig. 5 that the curves underfit at some point. So, the overall accuracy falls because of this phenomenon. SVM. Support vector machine performs best in our cross-validation check. Figure 6 shows that testing and cross-validation score doesn’t commute. Also, the crossvalidation curve is almost a straight line, so the accuracy is best in the case of SVM.
Fig. 5 Learning curve for NB
Utilization of Machine Learning Algorithms for Thyroid Disease …
67
Fig. 6 Learning curve for SVM
Table 2 Accuracy of each algorithm
Algorithm
Accuracy (%)
Logistic regression
97.89
K-nearest neighbor classifier
97.64
Linear discriminant analysis
94.89
Naïve Bayes
71.64
Support vector machine
98.46
5.2 Accuracy The accuracy for each algorithm is shown in Table 2. We can see that, except naïve Bayes, every algorithm shows an accuracy above 95%. SVM’s accuracy is best in this case. Figure 7 shows the comparison between the accuracies of the algorithms.
6 Conclusion This paper aims to compare various algorithms based on hyperthyroid patient’s diagnosis reports. The used algorithms are logistic regression, K-nearest neighbor classifier, linear discriminant analysis, naïve Bayes and support vector machine. The accuracy of these algorithms in logistic regression is 97.89%, K-nearest neighbors classifier is 97.64%, linear discriminant analysis is 94.89%, maïve Bayes is 71.64% and support vector machine is 98.46%. Among these algorithms, the support vector machine came up with the highest accuracy. The learning curve showed the dataset
68
Md. Shahajalal et al.
Fig. 7 Accuracy comparison of algorithms
was perfectly fit for SVM. So, a program with this dataset applied to SVM will provide the best results. This will be of great help to predict patient’s hyperthyroid diseases at the basic stage. This will decrease the hyperthyroid detection time and is helpful for medical technology. Thus, this paper contributes to medical technology in an efficient manner.
References 1. WebMD article. Retrieved September 5, 2019 from https://www.webmd.com/women/pictureof-the-thyroid. 2. EurekAlert article. Retrieved September 5, 2019 from http://www.eurekalert.org/pub_releases/ 2014-09/mali-nht093014.php. 3. National Institute of Diabetes and Digestive and Kidney Diseases article. Retrieved September 5, 2019 from https://www.niddk.nih.gov/health-information/endocrine-diseases/ hyperthyroidism. 4. Golden, S. H., Robinson, K. A., Saldanha, I., Anton, B., & Ladenson, P. W. (2009). Clinical review: Prevalence and incidence of endocrine and metabolic disorders in the United States: a comprehensive review. Journal of Clinical Endocrinology and Metabolism, 94(6), 1853–1878. https://doi.org/10.1210/jc.2008-2291. 5. MayoClinic article. Retrieved September 5, 2019 from https://www.mayoclinic.org/diseasesconditions/hyperthyroidism/diagnosis-treatment/drc-20373665. 6. EndocrineWeb article. Retrieved September 5, 2019 from https://www.endocrineweb.com/con ditions/hyperthyroidism/hyperthyroidism-symptoms. 7. Vijayalakshmi, K., Dheeraj, S., & Deepthi, B. S. S. (2009). Intelligent thyroid prediction system using big data. International Journal of Computer Sciences and Engineering, 6(1), 326–331. https://doi.org/10.26438/ijcse/v6i1.326331. 8. Geetha, K., & Baboo, S. S. (2016). An empirical model for thyroid disease classification using evolutionary multivariate bayesian prediction method. Global Journal of Computer Science and Technology: E Network, Web & Security, 16(1), 1–10. ISSN 0975-4172.
Utilization of Machine Learning Algorithms for Thyroid Disease …
69
9. Rasitha Banu, G., & Baviya, M. (2015). Predicting thyroid disease using datamining technique. International Journal of Modern Trends in Engineering and Research, 2(3), 666–670. 10. Ammulu, K., & Venugopal, T. (2017). Thyroid Data Prediction using Data Classification Algorithm. International Journal for Innovative Research in Science & Technology., 4(2), 208–212. 11. Shaik, R., Prathyusha, P. S., Krishna, N. V., & Sumana, N. S. (2018). A comparative study of machine learning algorithms on thyroid disease prediction. International Journal of Engineering & Technology, 7(2.8), 315–319. 12. Zhang, Z. (2016). Introduction to machine learning: K-nearest neighbors. Annals of Translational Medicine, 4(11), 218. https://doi.org/10.21037/atm.2016.03.37. 13. Park, H.-A. (2013). An introduction to logistic regression: from basic concepts to interpretation with particular attention to nursing domain. Journal of Korean Academy of Nursing, 43(2), 154–164. https://doi.org/10.4040/jkan.2013.43.2.154. 14. Dua, D., & Graff, C. (2019). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml.
Detection of Hepatitis C Virus Progressed Patient’s Liver Condition Using Machine Learning Ferdib-Al-Islam
and Laboni Akter
Abstract Hepatitis C virus is a significant reason because of incessant liver illness tainting more than 170 million individuals around the world. If an individual’s liver illness advances because of the Hepatitis C virus, from fibrosis to cirrhosis and liver disappointment, the condition carries passing to that individual. There are a ton of difficulties, including the accessibility of hepatologists and cost bearing while at the time of doing the Hepatitis C infection screening and its stage-specific screening. In this research, we have applied machine learning to classify normal people (normal blood donors or suspected) and Hepatitis C infection-tainted individuals with their current conditions, like only Hepatitis C, fibrosis, and cirrhosis from a biochemical test dataset with superior performance in accuracy, precision, and recall and also shown the feature importance scores. The logistic regression model, SVM model, and XGBoost model achieved an accuracy of 95%, 95%, and 92% correspondingly. Keywords Hepatitis C virus · Machine learning · Logistic regression · Support vector machine · XGBoost · Feature importance score
1 Introduction Hepatitis C virus contamination causes fibrosis, cirrhosis, and other liver diseases around the world; anyhow, for the majority of the term of disease, and quite when treatment is conceivable, the contamination is quiet and either unrecognized or disregarded by most people. Upgrades in treatment appear to turn out to be similar to the effect of contamination starts to rise. From a general wellbeing point of view, these treatment signs of progress are not prone to have a significant effect without considerable changes in the techniques for control and diagnosis. The progression to cirrhosis and the related complication is an inevitable concern because of patients across the incessant liver malady. Paces into the movement of
Ferdib-Al-Islam (B) · L. Akter Khulna University of Engineering & Technology, Khulna 9203, Bangladesh © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_6
71
72
Ferdib-Al-Islam
andL. Akter
cirrhosis can shift significantly through with separates [1]. So, research to prevent the transmission and to assess the condition easily of HCV patients is required. Danger forecast patterns could be especially helpful because of the appeal within people by constant Hepatitis C, the main source of cirrhosis around the world. Regardless of the accessibility of viable antiviral treatment for CHC, sickness disposal remains very testing because of restricted admittance to forte care and the lofty expense of antiviral drugs. The capacity that anticipates the danger into the health center movement could assist in distinguishing patients in danger because of unfavorable results and might help target assets to those at most noteworthy danger [2]. HCV stays a critical general medical issue despite the accessibility of profoundly successful antiviral treatment. In this paper, we have classified the blood donors (normal and suspected) and HCV progressed situation (only Hepatitis C, fibrosis, and cirrhosis) using machine learning techniques for quick assessment of the patient current situation with a better classification performance and have also shown the feature importance scores from the classification model. The rest of the part of the work has been structured as follows: “Literature Review” segment describes the recent researches in diagnosing and predicting HCV with machine learning techniques. The implementation details of this work have been represented in the “Methodology” section with different subsections. The results have been described in the “Results and Discussion” section. The “Conclusion” section represents the paper’s conclusion with future works.
2 Literature Review Barakat et al. [3] developed shrewd samples, acquired recent cutoff esteems because of fibrosis-4 and APRI because of the expectation, organizing into fibrosis among kids with interminable Hepatitis C. Random forest was used for that examination. The basic, noninvasive model notwithstanding the APRI and fibrosis-4 shorts would empower opportune intercession with meds. This might assist to lower the amount of children’s liver biopsies as well as HCV consequences. Abd El-Salam et al. [4] used machine learning techniques because of short expectations in cirrhotic patients depending on their clinical assessment and exploration reading endeavors. Since conceiving a snappier and more effective procedure for sickness analysis, prompting opportune patient therapy, they dissected 4,962 patients with constant Hepatitis C. Bayesian network accomplished better outcomes and the accuracy was 68.9%. Hashem et al. [5] developed a model that was arranging interminable liver infections for dodging the downsides of biopsy that are essentially expanding. The point of examination was from consolidating the serum biomarkers and clinical data to building up an older pattern. It could anticipate progressed liver fibrosis. A total of 39,567 patients with incessant hepatitis C were incorporated. Two models were created using rotating decision tree calculation, and the accuracy was 84.8%.
Detection of Hepatitis C Virus Progressed Patient’s Liver Condition …
73
Singal et al. [6] developed and compared predictive models for HCC improvement among cirrhotic sufferers. The machine learning calculation ensured great enhanced analytic precision as surveyed by remaining reclassification advancement, incorporated for separation betterment. Hashem et al. [7] used different machine learning strategies in the forecast of cutting-edge fibrosis by consolidating the blood biomarkers and medicinal data to build up the models. Multi-linear regression models, genetic algorithm, decision tree, and PSO for final fibrosis threat expectations had been created and the accuracy was 66.3 and 84.4%.
3 Methodology The methodology of the proposed work has been divided into the following steps: • • • •
Data collection Data preprocessing Exploratory data analysis Machine learning for classification The proposed system architecture has been illustrated in Fig. 1.
3.1 Data Collection In this implemented work, we have used the “HCV Dataset” from the “UCI Machine Learning Repository” [8, 9]. This dataset contains the clinical values of blood donors and Hepatitis C sufferers. There are 615 instances and 14 attributes, where each entity represents the clinical test values of a patient. Several biochemical tests like alanine amino transferase (ALT), albumin (ALB), aspartate amino transferase (AST), bilirubin (BIL), cholinesterase (CHE), cholesterol (CHOL), gamma glutamyl transferase (GGT), creatinine (CREA), alkaline phosphatase (ALP), and protein (PROT) along with patients’ gender, age, and corresponding category where the patient was suffering from Hepatitis C or not were recorded. The target variable for the classification of this task was “Category” which has five classes of two kinds: blood donors (normal and suspected) versus Hepatitis C (only Hepatitis C, fibrosis, and cirrhosis).
3.2 Data Preprocessing While starting the implementation, we first removed the unnecessary column from the dataset. All attributes except the “sex” and the “category” were in numerical
74
Ferdib-Al-Islam
andL. Akter
Fig. 1 Proposed system architecture
form. Label encoding implies changing over categorical features into mathematical qualities [10]. The features that characterize a class are categorical variables. Machine learning algorithms anticipate that features should be either integer or float numbers and subsequently absolute features should be changed over to mathematical qualities. The label encoder changes the categorical feature to an integer number. We have done label encoding to handle the categorical variables. There were missing values in several attributes of the dataset. It is one of the most widely recognized circumstances that happen with data indexes from this present reality. These datasets may contain missing qualities for different reasons, for example, unclear qualities, data assortment blunders, and so forth. Building a model with a dataset that has a lot of missing qualities can radically affect the performance of the machine learning model [11]. We have used the mean value imputation technique to handle this [12]. In this methodology, we can ascertain the mean of the nonmissing qualities in a section and supplant the missing qualities inside every segment independently and autonomously from the others. It is a straightforward and quick strategy and functions admirably with few mathematical datasets. Feature scaling is a procedure to normalize the free features present in the dataset in a fixed series. In case if feature scaling is not done, at that point a machine learning
Detection of Hepatitis C Virus Progressed Patient’s Liver Condition …
75
algorithm will, in general, weigh more significant features, higher and consider littler qualities as the lower value, paying little heed to the unit of the qualities. Here, we have done the “Min-Max Feature Scaling Technique”. It is a scaling procedure where esteems are moved and rescaled so that they wind up extending somewhere in the range of 0 and 1. The formula for calculating min-max feature scaling is given in (1): X =
X − X min X max − X min
(1)
where Xmax and Xmin are the highest and the lowest values of the factor correspondingly.
3.3 Exploratory Data Analysis Exploratory data analysis is a significant advance in insights that empowers the approval, rundown, and speculation age comparable to a dataset. Exploratory data analysis is a pathway to deal with breaking down of data. It is the circumstances at which an expert accepts a superior aspect of the data and attempts to comprehend it. It is frequently the primary stage in data examination and realized prior somewhat proper precise procedures are applied [13]. Correlation gives a sign of how the progressions are related between two factors. If two features alter in a similar course, they are positively correlated. If they change in inverse ways together, at that point they are adversely correlated. Correlation unquestionably impacts feature significance [14], implying that if the features are exceptionally correlated, there would be a significant level of excess if keeping them all. Since two features are correlated, a change in one will change the other. Some machine learning algorithms logistic regression can have terrible showing if there are many correlated input factors in the dataset. A few features are highly pertinent to our objective variable, so they may be redundant. Any two of the independent factors are viewed as redundant if they have a high correlation. In Fig. 2, a correlation plot of our dataset has been represented. It can be easily visualized such that there was no high correlation between the input features.
3.4 Machine Learning for Classification We have applied logistic regression, support vector machine, and XGBoost algorithm to classify blood donors, suspected blood donors, and HCV-affected people (Hepatitis C, fibrosis, and cirrhosis). We have split the dataset into an 80:20 ratio for the training set and the test set correspondingly.
76
Ferdib-Al-Islam
andL. Akter
Fig. 2 Correlation matrix of the input variables
Logistic Regression. Logistic regression is one of the most basic and generally used machine learning algorithms. Logistic regression is not a regression method, yet a probabilistic classification algorithm. The thought in logistic regression is to project the issue as a summed up linear regression algorithm as in (2):
y = β0 + β1 x1 + · · · · · · · · · + βn xn
(2)
where yˆ = predicted esteem, x = independent features, and the β are learning coefficients. In the multi-class logistic regression, the multi-class classification can be empowered/handicapped bypassing qualities to the parameter called “multi_class” in the constructor of the method [15]. In the multi-class instance, the training method uses the one versus rest scheme if the “multi_class” parameter is fixed to “ovr” and uses the cross-entropy loss if the “multi_class” parameter is fixed to “multinomial”. In this work, we have used the “multi_class” choice with “multinomial”, where the solver was “lbfgs”. Support Vector Machine. SVM is one of the most conventional supervised learning algorithms. In any case, principally, it is used for classification matters in the area of machine learning [16]. The objective of the SVM algorithm is to make the best
Detection of Hepatitis C Virus Progressed Patient’s Liver Condition …
77
line or decision limit. This decision limit is known as a hyper-plane. SVM prefers the outrageous focus that helps in constructing the hyper-plane. Here, we have applied the “ovr” procedure, otherwise called one versus all. This methodology comprises opting for one classifier for each class. For every classifier, the class opts against several classes. Nevertheless, its computational productivity, one bit of leeway of this methodology is its interpretability. Since each class is spoken to by one and one classifier, in particular, it is conceivable to pick up information about the class by probing its comparing classifier. This is the most generally used strategy for multi-class categorization and is a reasonable default decision [17]. XGBoost. XGBoost is a famous and efficient open-source execution of the gradient boosted tree [18]. Gradient boosting known as a supervised learning algorithm attempts to foresee a target variable by merging the evaluations of a ton of less difficult and more delicate models. XGBoost limits a regularized (L1 and L2) target work that solidifies a raised misfortune work and a discipline term for model eccentrics. The workflow of the XGBoost algorithm has been illustrated in Fig. 3. The planning continues iteratively, including the novel trees that envision the residuals of preceding trees that are before gotten together with past trees to build the last desire. It is entitled to gradient boosting since it practices a gradient descent method to restrict the setback while including novel models. The objective function is an aggregate of a particular loss assessed in general classifications and a whole of regularization term for all classifiers. Mathematically, it can be represented as in (3): obj (θ ) =
n i
Fig. 3 XGBoost algorithm’s workflow
l(yi − y i ) +
K K =1
Ω( f K )
(3)
78
Ferdib-Al-Islam
andL. Akter
4 Results and Discussion As we revealed previously, we have executed logistic regression, support vector machine, and XGBoost algorithm for classifying patients’ biochemical test data to any of the five classes: “Blood Donors”, “Suspected Blood Donors”, “Only Hepatitis C”, “Fibrosis”, and “Cirrhosis”. We have acquired a significant result after implementing those algorithms. It was a real-world clinical dataset. We have measured the performance of the machine learning models in various metrics. Logistic Regression Performance. The accuracy of the LR model was 95%, precision was 84.5%, and recall was 83.3%. SVM Performance. The accuracy of the SVM model was 95%, precision was 90.75%, and recall was 88.3%. XGBoost Performance. The accuracy of the XGBoost model was 92%, precision was 73.75%, and recall was 88.3%. The performance of these three models has been listed in Table 1. The performance comparison between our implemented models with others (for the same dataset) has been represented in Table 2. It can be seen that our proposed system got better performances. We computed the feature importance score using the XGBoost algorithm. Fig. 4 is the representation of the corresponding scores. From Fig. 4, “Alkaline phosphatase (ALP)”, “Aspartate Amino Transferase (AST)”, and “Gamma Glutamyl Transferase (GGT)” got the top-3 scorers in the feature score list according to its significance. Table 1 Model performance
Table 2 Proposed model comparison with the state-of-the-art
Model
Accuracy (%)
Precision (%)
Recall (%)
LR
95
84.5
83.3
SVM
95
90.75
88.3
XGBoost
92
73.75
88.3
Author
Model
Model (%)
Hoffmann et al. [9]
Decision tree with ELF score
75.3
Proposed work
Logistic regression
95
SVM
95
XGBoost
92
Detection of Hepatitis C Virus Progressed Patient’s Liver Condition …
79
Fig. 4 Feature importance score
5 Conclusion Hepatitis C is a liver infection attained as regards by the Hepatitis C virus (HCV). The infection can base on both exquisite, interminable hepatitis, operable in deprecation from a ripen sickness stable half a month to a genuine, long-lasting disease. Most patients with Hepatitis C do not have any manifestations until they have fibrosis and cirrhosis, and in any event, when they have early cirrhosis, they might not have side effects. Early conclusion of HCV contamination, accordingly, is principal to maintaining a strategic distance from the genuine complexities of liver sickness. Our implemented machine learning algorithms had the option to differentiate Hepatitis C cases with the existing methods with relatively high estimates for accuracy, precision, and recall. Logistic regression, support vector machine, and XGBoost classifier got 95%, 95%, and 92% accuracy, respectively. The precision and recall rate for LR, SVM, and XGBoost were 84.5%, 90.75%, 73.75% and 83.3%, 88.3%, 88.3%, respectively. We have also shown the feature importance scores. Further analysis and models with feature selection methods can be applied to improve the classification performance more accurately.
References 1. Freeman, A., Law, M., Kaldor, J., & Dore, G. (2003). Predicting progression to cirrhosis in chronic hepatitis C virus infection. Journal of Viral Hepatitis, 10, 285–293. https://doi.org/10. 1046/j.1365-2893.2003.00436.x. 2. Waljee, A., Higgins, P., & Singal, A. (2014). A primer on predictive models. Clinical and Translational Gastroenterology, 5,. https://doi.org/10.1038/ctg.2013.19. 3. Barakat, N., Barakat, S., & Ahmed, N. (2019). Prediction and staging of hepatic fibrosis in children with hepatitis c virus: A machine learning approach. Healthcare Informatics Research, 25, 173. https://doi.org/10.4258/hir.2019.25.3.173. 4. Abd El-Salam, S., Ezz, M., Hashem, S., et al. (2019). Performance of machine learning approaches on prediction of esophageal varices for Egyptian chronic hepatitis C patients.
80
Ferdib-Al-Islam
andL. Akter
Informatics in Medicine Unlocked, 17,. https://doi.org/10.1016/j.imu.2019.100267. 5. Hashem, S., Esmat, G., Elakel, W., et al. (2016). accurate prediction of advanced liver fibrosis using the decision tree learning algorithm in chronic hepatitis C Egyptian patients. Gastroenterology Research and Practice, 2016, 1–7. https://doi.org/10.1155/2016/2636390. 6. Singal, A., Mukherjee, A., Elmunzer, J., et al. (2013). Machine learning algorithms outperform conventional regression models in predicting development of hepatocellular carcinoma. American Journal of Gastroenterology, 108, 1723–1730. https://doi.org/10.1038/ajg.2013.332. 7. Hashem, S., Esmat, G., Elakel, W., et al. (2018). Comparison of machine learning approaches for prediction of advanced liver fibrosis in chronic hepatitis C patients. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 15, 861–868. https://doi.org/10.1109/tcbb.2017. 2690848. 8. UCI Machine Learning Repository: HCV data Data Set. In: Archive.ics.uci.edu (2020). Retrieved December 15, 2020 from https://archive.ics.uci.edu/ml/datasets/HCV+data. 9. Hoffmann, G., Bietenbeck, A., Lichtinghagen, R., & Klawonn, F. (2020). Using machine learning techniques to generate laboratory diagnostic pathways—a case study. xJlpm.amegroups.com. Retrieved December 15, 2020 from http://jlpm.amegroups.com/article/ view/4401. 10. ML | Label Encoding of datasets in Python - GeeksforGeeks. In: GeeksforGeeks. Retrieved December 15, 2020 from https://www.geeksforgeeks.org/ml-label-encoding-of-datasets-in-pyt hon/. 11. Padgett, C., Skilbeck, C., & Summers, M. (2014). Missing data: the importance and impact of missing data from clinical research. Brain Impairment, 15, 1–9. https://doi.org/10.1017/brimp. 2014.2. 12. Jerez, J., Molina, I., García-Laencina, P., et al. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine, 50, 105–115. https://doi.org/10.1016/j.artmed.2010.05.002. 13. Velleman, P. F., & Hoaglin, D. C. (2012). Exploratory data analysis. In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, & K. J. Sher (Eds.), APA handbooks in psychology®. APA handbook of research methods in psychology (Vol. 3). Data analysis and research publication (pp. 51–70). American Psychological Association. https://doi.org/10.1037/136 21-003. 14. Nicodemus, K., & Malley, J. (2009). Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics, 25, 1884–1890. https://doi.org/10.1093/bio informatics/btp331. 15. Escalona-Moran, M., Soriano, M., Fischer, I., & Mirasso, C. (2015). Electrocardiogram classification using reservoir computing with logistic regression. IEEE Journal of Biomedical and Health Informatics, 19, 892–898. https://doi.org/10.1109/jbhi.2014.2332001. 16. Mathur, A., & Foody, G. (2008). Multiclass and binary SVM classification: implications for training and classification users. IEEE Geoscience and Remote Sensing Letters, 5, 241–245. https://doi.org/10.1109/lgrs.2008.915597. 17. Xu, Y., Shao, Y., Tian, Y., & Deng, N. (2009). Linear multi-class classification support vector machine. In: Y. Shi, S. Wang, Y. Peng, J. Li, & Y. Zeng (Eds.), Cutting-edge research topics on multiple criteria decision making. MCDM 2009. Communications in computer and information science (Vol 35). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-02298-2_93. 18. Chen, T., & Guestrin, C. (2016). XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. https://doi.org/10.1145/293 9672.2939785.
Energy Performance Prediction of Residential Buildings Using Nonlinear Machine Learning Technique D. Senthil Kumar, D. George Washington, A. K. Reshmy, and M. Noorunnisha
Abstract Energy consumption is the process of all energy used by an individual or organization to perform an action. The consumption of energy in buildings has steadily increased between 35 and 40 and exceeds the other major sectors like transportation and industries, etc. The indoor climate is a major impact in energy consumption based on the ventilation, air-conditioning, and ventilation in the buildings. It is necessary to forecast the heating load (HL) and cooling load (CL) prior to the design of the building plan for the efficient use of energy. To obtain an efficient building design, the optimum level of heat energy should be measured to maintain the temperature at the appropriate level. In this paper deep learning algorithm is proposed to forecast the HL and CL using the recreational information of the building. Experimental results reveal that the support vector regression can be a suitable forecasting model for heating load, while the proposed deep learning model is most suitable for cooling load prediction. Also, the proposed study helps engineers and architects to construct an efficient building design based on the optimum level of cooling load and heating load.
D. Senthil Kumar (B) University College of Engineering (BIT Campus), Anna University, Tiruchirappalli, India e-mail: [email protected] D. George Washington Ramanujan Computing Centre, College of Engineering Guindy Campus, Anna University, Chennai, India e-mail: [email protected] A. K. Reshmy Department of Computer Applications, B.S. Abdur Rahman Crescent Institute of Science and Technology, Vandalur, India e-mail: [email protected] M. Noorunnisha Department of Computer Science and Engineering, M.A.M College of Tiruchirappalli, Tiruchirappalli, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_7
81
82
D. Senthil Kumar et al.
1 Introduction The energy utilization in private structures has altogether expanded in the most recent decade. The energy execution of a structure is the figuring of the measure of energy really devoured in the structure to meet the various necessities related to standard utilization of the structure, including warming/cooling and lighting. The aim of energy efficiency is to decrease the usage of energy needed to perform the same set of task or process and reducing the pollution [1]. Energy productivity additionally helps the economy, by sparing a huge number of dollars in energy costs. Energy proficiency safeguards our current circumstances for the practical turn of events. The utilization of energy productive apparatuses diminishes the usage of normal assets [2]. Efficient use of energy improves the safeguarding of these sources as an approach to accomplish a manageable turn of events. The structures where we live and work represent 30% of generally speaking ozone-harming substance discharges in the USA. Advancements, for example, more financial warming, cooling, and lighting adjust structures utilize less energy, which assists in lessening ozone-depleting substance outflows [3]. Energy-sparing activities are characterized as everyday and common acts of family units that emphasize explicit decreases in energy use. Families conclude how to keep their home warm in the colder time of year and how to keep it cool in the late spring to utilize their significant energy framework [4]. The greater part of the energy and ecological conversation research around building energy execution and the energy execution improvement programs are concentrated around building retrofits [5]. Motivating forces are given for introducing energy-effective advancements: lighting, high effectiveness warming, and cooling frameworks, and so on. However, there is no indisputable proof that the innovation quantifies alone that essentially leads to improved execution [6]. One approach to diminish the regularly growing solicitation for additional energy flexibly is to develop greater essentialness gainful structure with progressive energy protection properties. To diminish the energy utilizations in the structures, it is essential to anticipate the warming and cooling loads when the structure is arranged [7]. Nowadays, designers use the effects of heating load (HL) and cooling load (CL) to reduce the usage of energy in the building while designing recreation building projects. Moreover, information mining methods like least square support vector machines (LS-SVM), fuzzy logic (FL), random forest (RF), and artificial neural network (ANN) are applied to foresee the HL and CL by considering the different information boundaries of energy utilization without utilizing reproduction programs. Nowadays, deep learning techniques are gaining popularity in different real-time applications due to their efficiency in modeling with higher prediction accuracy. In deep learning, the data are transferred through a large number of hidden layers [8–11]. A solitary convolutional neural network architecture with multiple tasks learning procedure was intended for natural language processing (NLP). The deep auto
Energy Performance Prediction of Residential Buildings Using …
83
encoder network was used to change the large-dimensional information over lowdimensional codes, and tests showed that it works superior to PAC for dimensionality decrease. In [12], a stacked auto encoder (SEA) was applied for organ identification in clinical attractive reverberation pictures. The deep learning approaches are applied to forecasting in the time series data. Also, in [13], the results prove that the deep learning approach performs better when compared with the existing AI techniques in the prediction of power load time series data [14]. In this paper, the eight input parameters, for example, relative compactness, overall height, wall area, orientation, roof area, surface zone, glazing area, and glazing area distribution, are utilized to anticipate the warming and cooling heaps of private structures. This paper investigates the information and analyzes the properties of input and yield factors, and uses the nonlinear machine learning technique (deep learning) to foresee the HL and CL [7]. To improve the efficiency of forecasting the energy utilization in the building structures, a deep learning (DL) algorithm is proposed. And the proposed DL algorithm is compared with the artificial neural network (ANN), chain-square automatic interaction detector (CHASID), support vector regression (SR), classification and regression tree (CART), and general linear regression (GLARE) [7, 8].
2 Data Mining Algorithms In this section, the data mining algorithms used in this study, such as ANN, CART, SR, GLARE, and CHASID are discussed.
2.1 Artificial Neural Networks The ANN mimics the functions of the human brain for learning and forecasting [15]. It uses various elements for creating the prediction model, such as neurons, input features, synaptic strength, activation function, target, and bias, and it is denoted by Eq. 1. The interrelation among neurons increases the strength of the neural network, and the basic ANN architecture is shown in Fig. 1 [16]. networ kn =
wn j T j and yn = f (networ kn ) =
where j is the set of neurons in the previous layer; wnj is the linking weight between n and j neurons; T j is the target of j neuron; and yn is the logistic transfer function or sigmoid.
1 1 + e−networ kn
(1)
84
D. Senthil Kumar et al.
Fig. 1 Diagrammatic representation of ANN model
2.2 Chi-Square Automatic Interaction Detector (CHASID) The CHASID is a decision tree-based classification algorithm introduced by [17]. To verify the improvement of precision in the fragment node chi-square test is used. Every target node with the higher p-values associated with the significant inputs is utilized as a root node. The split operations are not performant, and the algorithm ended if there is no significant improvement statistically in the tested features. For a component variable, it will not update because it closes consolidating types and when it gets all lingering types expressively it might differ. CHAID stays away from this issue by perpetually consolidating highlight types until just two great sorts remain. Subsequently, based on the adjusted p-values, it identifies which features to be chosen and for the best split [18].
2.3 Classification and Regression Tree (CART) Classification and regression tree (CART) [18] to perform the DT. The creation of a CART is an iterative method of constructing a binary decision tree that can be used both for classification as well as for regression. It supports both categorical and numeric target types [19]. It is a rule-based algorithm and the data are partitioned into two feature subsets. The new feature subset has higher homogeneity, i.e., high accuracy, compared with the first feature subset. For classification trees, CART uses the Gini coefficient reduction principles for feature selection and produces a binary tree. After reaching the homogeneity principle, it needs an iterative splitting method. A CART is satisfactorily adaptable to consider misclassification costs and to signify the likelihood of dispersion in a classification issue. In a CART model, the exactness
Energy Performance Prediction of Residential Buildings Using …
85
is all around characterized as the comparability among inputs and target esteems are estimated impeccable when all component subset esteems are indistinguishable [20]. Three measures are used to obtain the target field and it is used to design the CART models. Notwithstanding that, the target is typically represented using Gini while the most least-square deviation technique is used for choosing constant. The Gini index g(t) in a node of CART is defined using Eq. (2). g(t) = 1 −
j
pi2
(2)
i=1
where j is the different classes after splitting of tree. p is the probability of node t belonging to class i
2.4 Support Vector Regression (SVR) A support vector machine (SVM) is used for classification and regression problems and it is also called support vector regression (SVR). Both SVM and SVR are similar with a slight difference in that SVR relies on kernel functions. SVR is mainly used in fixing the error rate within some threshold. In SVR, first, the input(F) is mapped into large-dimensional feature space through nonlinear mapping and then linear regression is done in that feature space. If the data are not linearly separable it can be extended to nonlinear and the original feature space I mapped into a higher-dimensional feature set space [11, 21], f (F, w) = {w, F} + c with w ∈ z, c ∈ z ⎧ ⎨ 0 i f |T − f (F, w1)| ≤ ε L(F, w) = [T, f (F, w1)] = T − f (F, w1) other wise ⎩
(3)
(4)
The SVR algorithm has a parameter that denotes the insensitive loss. It is used to estimate an LR method for the size of the feature set while simultaneously reducing ||w||2 to minimize the time complexity of the algorithm. Two parameters ∃i ∃i∗ are presented to rectify the reduction issues. Thus, the SVR algorithm is expressed in Eq. 5 and parameter D ≥ 0 identifies the relations between (F, w) and ε. [22]. 1 Minimi ze ||w||2 + D ∃i + ∃i∗ 2 i=1 l
86
D. Senthil Kumar et al.
⎧ ⎨ Ti − w|Fi − c ≤ ε + ∃i Respectto : w|Fi + c − Ti ≤ ε + ∃i∗ ⎩ ∃i , ∃i∗ ≥ 0
(5)
2.5 General Linear Regression (GLR) General linear regression (GLR) is one of the springier types of linear regression (LR) and it creates the model with different distribution. It establishes the relationship between the dependent variable Y and the set of independent variables using the link function (it is denoted by g(·)). In the literature various link functions are available, namely Identity Link, Logit Link, Log Link, Probit Link, Complementary Log-Log Link, and Power Link. The GLR model relationship is represented by Eq. 6 E(Y ) = μ = g − 1(Xβ)
(6)
where Xβ is the linear predictor and y is the response variable and g(u) = Xβ. Generalized linear regression supports the output factors that can have any method of exponential distribution type but it requires a reasonably high-dimensional dataset. SVR has regularization metrics which made the user think about avoiding overfitting. But unfortunately, the kernel model can be relatively delicate to overfitting the model selection condition. The main merits of CHAID are the facility to describe subpopulations using the grouping of analytical elements. ANN can also be used with categorical data and continuous data but there are several limitations to be set in a neural network and optimizing the network can be challenging, especially to avoid overfitting. CART handles the missing values automatically using “surrogate splits”. It is a great way to explore and visualize data, but it has instability of the model structure, and also it does a poor job of modeling linear structure. DL is a developing method in the area of machine learning research field. It mimics the nature of the research of the brain for learning to analyze data. ANN, CHAID, SVR, CART, and GLR are only supporting the linear relationship among target and features but the proposed deep learning algorithm supports a complex nonlinear relationship between the target and the features. It indicates that the proposed deep learning algorithm is a better choice for multi-target regression when compared to ANN, CHAID, SVR, CART, and GLR. Therefore, this paper proposed a deep learning algorithm for predicting the energy-efficient (HL and CL) in multi-target regression learning.
Energy Performance Prediction of Residential Buildings Using …
87
3 Proposed Algorithm In this section, the proposed deep learning technique is briefly discussed.
3.1 Deep Learning Deep learning is created from multi-layered neural networks, but there is a piece of contrast between deep learning and multi-layered neural networks. The main contrast is that deep learning models are constructed with more than two layers when compared with the basic neural network. A multi-layered neural network is designed by the interconnection of neurons; these gatherings of neurons frame the neural architecture. The neuron in the input layer is based on the input variable in the dataset. There is the most proper number of neurons in the concealed layer in the R tool using a cross-validation (CV) performance. If the provided information covers several features, then multi-layered neural networks are favored. There are many types of neural networks; in that, the feed-forward neural network is broadly used and the processing is done in the forward direction from the input to the output layer. The neural architecture consists of input and output layers; along with that, it also consists of many concealed layers. Figure 2 shows the input layer consisting of a set of inputs like features (f 1 , f 2 , f 3 …, f n ) with associated weights (ω1 , ω2 , ω3 . . . . . . . . . ωn ). Weights of the inputs are randomly selected first; typical values are among −1 and 1 or −0.5 and 0.5. A summation method for estimating the weighted sum of the input to features is represented in Eq. 7 and the nonlinear activation function with bias is denoted by T in Eq. 8 [12, 10, 23] a=
p
wi f i
(7)
i=1
T = f (a + y)
(8)
4 Experimental Results and Discussion In this section machine learning algorithms like ANN, SVR, CART, CHAID, and GLR are compared with the proposed deep learning algorithm using the performance metrics MSE, RMSE, and R-square. In this paper, the basic 3D simulated building shape (3.5 × 3.5 × 3.5) 12 with the same volume 771.75 m3 are used, where each building shape is composed of 18 components (rudimentary 3D shapes) with diverse surface regions. Table 1 shows the details of the features and the possible number of
88
D. Senthil Kumar et al.
Fig. 2 Diagrammatic representation of deep learning model Table 1 Details of the features and the target variables
Notation of features
Name of features
Number of possible values
F1
Relative compactness
12
F2
Surface area
12
F3
Wall area
7
F4
Roof area
4
F5
Overall height
2
F6
Orientation
4
F7
Glazing area
4
F8
Glazing area distribution
6
T1
Heating load
586
T2
Cooling load
636
Energy Performance Prediction of Residential Buildings Using …
89
Table 2 Comparison of experimental results with DL CL
HL
MSE
RMSE
R-square
MSE
RMSE
R-square
ANN
2.8156
1.672
0.9682
0.3721
0.61
0.9961
SVR
2.7126
1.647
0.9702
0.1197
0.346
0.9982
CART
3.3892
1.841
0.9623
0.64
0.8
0.9921
CHAID
3.4559
1.859
0.9623
0.8262
0.909
0.9911
GLR
3.0276
1.74
0.9663
1.0795
1.039
0.9911
DL-H2O
0.9965
0.9982
0.99995
0.9961
0.998
0.9999
DL-MXNet
0.7301
0.8544
0.9027
0.7383
0.8592
0.9427
values in the target variables [8]. The experimental results of tested algorithms are presented in Table 2. The mean square error (MSE) is the only value that provides information about the fit of the regression line, and the lower the MSE value, the better the model. The root mean square error (RMSE) is one kind of error measure and it is used to calculate the variances among the predicted values with the actual values. The individual variances calculation is also denoted as “residuals”. The RMSE evaluates the magnitudes of the errors. It is a very good estimator of accuracy which is used to perform contrast predicting errors from different evaluators for a certain variable, but not between the variables since this measure is scale-dependent. R-squared (R or decision coefficient) is a statistical measure in a regression model that quantifies the proportion of variance in the dependent variable that can be explained by the independent variable. That is, the chi-square indicates how well the data fit into the regression model (agreement scale). The R-square can have any value between 0 and 1. The value of R-square nearer to 1 is a better model. Table 2 shows the CL achieved by deep learning using MXNet that gives the lowest error rate for the two measures: MSE (0.7301), RMSE (0.8544), and deep learning using MXNet achieved the highest R-square value (0.9027). SVR gives the second-lowest error rate for the two measures: MSE (2.7126), RMSE (1.647), and also achieved the highest R-square value (0.9702). ANN achieved the third-lowest error rate, MSE (2.8156), RMSE (1.672), and the R-square value (0.9682). GLR achieved the fourth-lowest error rate for the two measures MSE (3.0276), RMSE (1.74), and highest R-square value (0.9663). For the cooling, Load CART achieved the fifth-lowest error rate for the two measures MSE (3.3892), RMSE (1.841), and also achieved the highest R-square value (0.9623). CHAID achieved the sixth-lowest error rate for the two measures MSE (3.4559), RMSE (1.859), and also achieved the highest R-square value (0.9623). For the HL SVR achieved the lowest error rate for the two measures MSE (0.1197), RMSE (0.346), and also achieved the highest R-square value (0.9982). ANN achieved the second-lowest error rate in all the two measures MSE (0.372), RMSE (0.61), and
90
D. Senthil Kumar et al.
also achieved the highest R-square value (0.9961). For HL CART achieved the thirdlowest error rate in all the two measures MSE (0.64), RMSE (0.8), and also achieved the highest R-square value (0.9921). For the HL CHAID achieved the fourth-lowest error rate in all the two measures MSE (0.8262), RMSE (0.909), and also achieved the highest R-square value (0.9911). For the HL DL achieved the fifth-lowest error rate for the two measures MSE (0.7383), RMSE (0.8592), and also achieved the highest R-square value (0.9427). For the HL GLR achieved the sixth-lowest error rate in all the two measures MSE (1.0795), RMSE (1.039), and also achieved the highest R-square value (0.9911). The experimental results clearly show that in the proposed deep learning model the MSE is very low (0.7301) and RMSE is also very low (0.8544) for the CL while using the MXNet package and the R-square value is high (0.99995) for the CL load while using H2O package compared to the existing techniques. The MSE for the HL (HL) is moderate and RMSE is also moderate in both the deep learning packages than the other existing methods, but the R-square value is high (0.9999) in the H2O package than all the other existing techniques. Therefore, the proposed deep learning model provides high accuracy for the prediction of both CL and HL. Hence, the proposed deep learning technique is the most suitable algorithm for predicting the CL and HL because it gives high accuracy compared to the existing techniques (Figs. 3, 4, 5, and 6). Fig. 3 Comparison of CL for RMSE and MSE
Fig. 4 Comparison of cooling load for R2
Energy Performance Prediction of Residential Buildings Using …
91
Fig. 5 Comparison of HL for RMSE and MSE
Fig. 6 Comparison of heating load for R2
5 Conclusion The working performance of different machine learning techniques, like ANN, SVR, CART, CHAID, DL, and GLR are compared with simulated building design to predict HL and CL. The proposed deep learning method performed effectively in various environmental applications, like building energy consumption. Several building features were used as an input feature to HL and CL and 768 simulated building designs are used for this prediction model. The proposed DL algorithm is very easy to use and takes less number of metrics and tests to regulate when compared to building simulation software. The support vector regression can be a suitable forecasting model for HL, while the proposed deep learning model is more suitable for cooling load prediction. The proposed deep learning algorithm is an efficient algorithm for designing and analyzing energysaving building structures, and it can play an important role in saving energy. It is also a fast and accurate method for forecasting CL and HL buildings that can help
92
D. Senthil Kumar et al.
engineers and architects design efficient buildings. Hence, it is used to minimize the cost to build energy-efficient buildings and further time of the building works.
References 1. Lin, Y., Zhou, S., Yang, W., Shi, L., & Li, C.-Q. (2018). Development of building thermal load and discomfort degree hour prediction models using data mining approaches. Energies, 11(6), 1570. 2. Castelli, M., Trujillo, L., Vanneschi, L., & Popovi, A. (2015). Prediction of energy performance of residential buildings: A genetic programming approach. Energy and Buildings, 102, 67–74. 3. Fan, C., Liao, Y., & Ding, Y. (2019). Development of a cooling load pre- diction model for air-conditioning system control of office buildings. International Journal of Low-Carbon Technologies, 14(1), 70–75. 4. Li, C., Ding, Z., Zhao, D., Yi, J., & Zhang, G. (2017). Building energy consumption prediction: An extreme deep learning approach. Energies, 10(10), 1525. 5. Yu, Z., Haghighat, F., Fung, B. C. M., & Yoshino H. (2010). A decision tree method for building energy demand modeling. Energy and Buildings, 42(10), 1637–1646. 6. Chou, J.-S., & Bui, D.-K. (2014). Modeling heating and cooling loads by arti- ficial intelligence for energy-efficient building design. Energy and Buildings, 82, 437–446. 7. Kumar, S., Pal, S. K., & Singh, R. P. (2018). A novel method based on extreme learning machine to predict heating and cooling load through design and structural attributes. Energy and Buildings, 176, 275–286. 8. Tsanas, A., & Xifara, A. (2012). Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy and Buildings, 49(2012), 560–567. 9. Yezioro, A., Dong, B., & Leite, F. (2008). An applied artificial intelligence approach towards assessing building performance simulation tools. Energy and Buildings, 40(4), 612–620. 10. Fayaz, M., & Kim, D. (2008). A prediction methodology of energy con- sumption based on deep extreme learning machine and comparative analysis in residential buildings. Electronics, 7(10), 222. 11. Dong, B., Cao, C., & Lee, S. E. (2005). Applying support vector machines to predict building energy consumption in tropical region. Energy and Buildings, 37(5), 545–553. 12. Huval, B., Adam, C., & Ng, A. (2013). Deep learning for class-generic object detection. arXiv: 1312.6885. 13. Qiu, X., Zhang, L., Ren, Y., Suganthan, P. N., & Amaratunga, G. (2014). Ensemble deep learning for regression and time series forecasting. In 2014 IEEE Symposium on Computational Intelligence in Ensemble Learning (CIEL) (pp. 1–6). IEEE. 14. Ekici, B. B. (2016, June). Building energy load prediction by using LS-SVM. Energy and Buildings, 3(3). 15. Pandey, S., Hindoliya, D. A., & Pandey, R. (2011). Artificial neural networks for predicting cooling load reduction using roof passive cooling techniques in buildings. International Journal of Advanced Research in Computer Science, 2(2). 16. Alam, A. G., Chang, I. B., & Han, H. (2016). Prediction and analysis of building energy efficiency using artificial neural network and design of experiments. In Applied mechanics and materials (Vol. 819, pp. 541–545). Trans Tech Publications. 17. Kass, G. V. (1980). An exploratory technique for investigating large quantities of categorical data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 29(1), 119–127. 18. Kwok, S. S. K., Lee, E. W. M. (2011). A study of the importance of occupancy to build- ing cooling load in prediction by intelligent approach. Energy Conversion and Management, 52(7), 2555–2564; SPSS, Clementine 12.0 Algorithm Guide, Integral Solutions Limited, Chicago, IL, 2007.
Energy Performance Prediction of Residential Buildings Using …
93
19. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. New York, NY: Chapman & Hall/CRC. 20. Biggs, D., De Ville, B., & Suen, E. (1991). A method of choosing multiway partitions for classification and decision trees. Journal of Applied Statistics, 18(1), 49–62. 21. Li, Q., Meng, Q., Cai, J., Yoshino., H., & Mochida, A. (2009). Applying support vector machine to predict hourly cooling load in the building. Applied Energy, 86(10), 2249–2256. 22. Yang, I. T., Husada, W., & Adi, T. J. W. (2016). Data mining methods in structural reliability estimation. In Proceedings of the 16th International Conference on Computing in Civil and Building Engineering. 23. Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (pp. 160–167). ACM. 24. Singh, T. N., Sinha, S., & Singh, V. K. (2007). Prediction of thermal conductivity of rock through physico-mechanical properties. Building and Environment, 42(1), 146–155.
Cloud Image Prior: Single Image Cloud Removal Anirudh Maiya and S. S. Shylaja
Abstract Cloud removal from satellite imagery is a well-known problem in both remote sensing and deep learning. Many methods have been developed to address the cloud removal problem in a supervised setting. These methods require gathering of huge datasets to learn the mapping from cloudy images to cloud-free images. In this paper, we address cloud removal as an inverse problem. We convert this problem into an inpainting task in which cloudy regions are treated as missing pixels and completely solve the cloud removal problem in an unsupervised setting. We show that the structure of a network is a good prior by itself and is sufficient to remove clouds from satellite imagery using Deep Image Prior algorithm. Experimental results on Sentinel-2 Imagery have quantitatively and qualitatively demonstrated the effectiveness of our technique on a diverse range of clouds. Keywords Deep image prior · Cloud image prior · Convolutional neural networks · Inpainting · U-NET · Meshgrid · Peak signal-to-noise ratio · Structural similarity index measure
1 Introduction Neural networks have been around for a long time. Their increasing popularity is mainly credited due to their application in automating/solving various imagerelated tasks such as segmentation [1], classification [2], image-captioning, and more. LeNet [2] was one of the first convolutional neural networks (ConvNets) which was used to recognize handwritten zip code. Ever since the invention of AlexNet [3] which showed that training of ConvNets with GPUs go hand-in-hand, ConvNets have been extensively used to win ImageNet Large Scale Visual Recognition Challenge, most notably [4, 5]. A. Maiya (B) · S. S. Shylaja Department of Computer Science and Engineering, PES University, Bangalore, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_8
95
96
A. Maiya and S. S. Shylaja
Fig. 1 Images containing cloud from sentinel-2 dataset
ConvNets have been used to solve problems in the remote sensing domain such as semantic segmentation of crops [6], detecting urban footprints [7], etc. Often times clouds obstruct the analysis of the abovementioned tasks (Fig. 1). Therefore, cloud-free images are desirable for any segmentation and classification tasks in multispectral remote sensing imagery. Hence, it is imperative to add cloud removal as a pre-processing step to carry on further analysis. ConvNets have also been used to solve a wide variety of image inverse problems such as deblurring, inpainting, denoising, dehazing, etc. Solving an inverse problem is the process of calculating parameters from a distribution that has more noise than signal and is often incomplete. For example, removing a certain unknown noise distribution from an image (denoising) is an inverse problem. Solutions to inverse problems are non-unique and ill-posed. Existing methods solve the cloud-removal problem in a supervised setting, i.e., they gather huge datasets consisting of both cloudy and cloud-free images and employ deep learning techniques to learn the prior through this data. Singh and Komodakis [8] have used Generative Adversarial Networks to learn the distribution between input (cloudy) and target (cloud-free) domains with an additional cycle consistency loss. Chen et al. [9] have used multi-source data (content, spectral, and texture) to build a unified framework which detects and removes cloud. We eliminate the step of gathering dataset to learn the mapping from cloudy to cloud-free distribution and solve the cloud removal problem as an inpainting task. Our work is mainly based on Deep Image Prior [10] which shows that structure of a ConvNet is a sufficient prior for such an inpainting task. We call this method Cloud Image Prior.
2 Problem Formulation Deep Image Prior (DIP) works in a supervised manner in an unconventional way. DIP shows that a great deal of information is stored in the structure of the network itself. DIP also assumes that we have a network structure that has a powerful prior. In other words, the only prior information that the network has is the structure of the network itself. For example, U-Net is a network with a powerful prior. The following
Cloud Image Prior: Single Image Cloud Removal
97
section is proof of DIP not requiring any prior information on the dataset to work in a supervised manner. Consider the Image Restoration Problem (IR) where y is the clean image, yˆ is the corrupted image, and y ∗ is the restored image. According to Bayes Theorem: P(y| yˆ ) =
P( yˆ |y)P(y) , P( yˆ )
(1)
where P(y) is the prior, P( yˆ |y) is the probability of observing a corrupted image yˆ given a hypothesis y, and P(y| yˆ ) is the posterior probability which tells how confident our hypotheses y after observing the corrupted data yˆ . Choosing the maximally probable hypothesis y from a set of hypothesis y ∈ H where H is a set of candidate hypotheses, Eq. 1 reduces to y M A P ≡ argmax P(y| yˆ ) y∈H
y M A P ∝ argmax P( yˆ |y)P(y).
(2)
y∈H
According to DIP’s assumptions, the model has no prior information about the clean image. Hence, P(y) is a constant. Therefore, y M A P ∝ argmax P( yˆ |y). y∈H
Assuming y follows a normal distribution, we know that maximizing the likelihood is the same as minimizing the sum of squared errors. Let J (y; yˆ ) denote the sum of squared errors. We further convert the parameters from the image space to the parametric space θ that the neural network operates upon. y ∗ = argmin J (y; yˆ ) y
∗
θ = argmin J ( f θ (z)); yˆ ) = argmin f θ (z) − yˆ 2 , θ
θ
(3)
where f θ (z) is an operator that maps the image space to a parametric space and can be thought as a convolutional neural network with parameters θ . A fixed noise vector z sampled from the uniform distribution serves as an input to this convolutional neural network. Hence, DIP does not require any prior information of the dataset to learn a good prior rather it assumes the prior of the network structure is enough for solving image restoration problems. To sum up the procedure, we use an optimizer which uses gradient descent from which a local minimizer θ ∗ is found. The resulting restored image can be found by f θ ∗ (z).
98
A. Maiya and S. S. Shylaja
U-NET(2) U-NET(1) Semantic Segmentation
Fig. 2 Framework of our proposed model. U-Net (1) is the model that performs cloud segmentation in a supervised setting. U-Net (2) is the model that performs cloud removal on a single image with no prior information about the dataset other than the randomly initialized network itself. Z is sampled from a uniform distribution or is a meshgrid, is Hadamard’s product
3 Proposed Framework and Methodology In our framework, the process of removal of the cloud is done in two stages— cloud segmentation and cloud removal. Hence, two separate architectures are used as shown in Fig. 2. Image inpainting is an inverse problem where a corrupted image yˆ has missing pixels. In order to fill these missing pixels, the structure of the missing pixels must be shown to the network to solve the inpainting problem. Therefore, a separate model is created to detect/segment the cloud and this model is called cloud segmentation.
3.1 Cloud Segmentation Segmentation is the process of classifying each pixel in an image to a particular class. In the context of cloud segmentation, the task is to classify a pixel into cloudy and cloud-free region. Hence, the output of the segmentation model is a binary mask indicating cloudy and cloud-free regions. This model requires a dataset with cloudy images as the input and binary masks indicating regions of cloud as the output (supervised setting).
Cloud Image Prior: Single Image Cloud Removal
99
3.2 Cloud Removal In this part of the framework, motivated by the Deep Image Prior algorithm which assumes that the structure of a network is a good prior by itself for solving an inverse problem, we carry on our analysis. We use binary mask obtained from the cloud segmentation model that was mentioned in the previous step to denote the missing pixels (cloudy region) for our cloud removal model. Hence, no dataset comprising of cloudy and its corresponding cloud-free regions are used for the cloud removal model.
4 Dataset In this paper, our dataset is obtained from Sentinel-2 imagery. Sentinel-2 dataset is free of charge and has an open data policy. We choose Band-2 (Blue), Band-3 (Green), and Band-4 (Red) out of the 13 bands present in the dataset. We use the default color ranges of the bands chosen (RGB) with no false coloring. We select Paris as the place to carry on our analysis since [8] have used the same region. We choose imagery from the year 2017 to 2019 covering all months from January to December. We further split our image into patches of 768 × 768. Considering the overall diversity of the images, we choose 40 images for the task of cloud segmentation. Since segmentation problem is solved in a supervised manner, we create binary masks indicating cloudy and cloud-free region of the same size (768 × 768) using Pixel Annotation Tool.
5 Architecture In this paper, we use U-Net architecture which has skip connections. Skip connection architectures are of great use in image generation tasks. Also, skip connections solve the vanishing gradient problem when the network tends to get deeper. We use two variants of U-Net to solve the cloud segmentation and cloud removal problem where the first has skip connections between the down-sampling and up-sampling layer and the later has no skip connections. Figure 3 depicts the respective architectures used in this paper.
100
A. Maiya and S. S. Shylaja Conv 1x1, stride = 1 Batch Normalization Leaky ReLU
Conv 3x3, stride = 2 Batch Normalization Leaky ReLU
Conv 3x3, stride = 1 Batch Normalization Leaky ReLU
Conv 1x1, stride = 1 Sigmoid
Reflection Pad Bilinear Upsample Batch Normalization + Leaky ReLU Concatenate
Fig. 3 Architectures used for cloud segmentation and cloud removal model
6 Training 6.1 Cloud Segmentation Cloud segmentation model was trained on 40 images. Data augmentation techniques, such as horizontal and vertical flip, random rotation, and addition of Gaussian noise, are done to increase the robustness of our model. Binary cross-entropy is used as the loss function since the output consisted of two classes (cloudy and cloud-free regions). Adam optimizer is used with a step size of 0.001 for 50 epochs.
6.2 Cloud Removal Cloud removal problem is solved with Cloud Image Prior Algorithm. We solve the cloud removal problem without any dataset consisting of scenes containing clouds and their corresponding cloud-free regions. We only show the cloudy image along with the segmentation mask obtained from the cloud segmentation model to the cloud removal model. z which is the input for U-Net(2) is created with numpy. Meshgrid where z ∈ R H ×W ×2 is fixed during the optimization process. For every cloudy image we resample z and reinitialize the network with random weights. Adam is used to optimize the parameters of the network with an initial learning rate of 0.05 for 6000 iterations and is further decayed by a factor of 2 at 8000 and 10000 iterations. The detailed algorithm is mentioned in Algorithm 1.
Cloud Image Prior: Single Image Cloud Removal
101
Algorithm 1 Cloud Image Prior for a single image Require: yˆ - Corrupted Image (Cloudy Image) Require: m - Mask (from Cloud Segmentation) Require: α - Step Size 1: Initialize z with meshgrid where z ∈ R H ×W ×2 2: Obtain segmentation mask m from trained Cloud Segmentation model 3: Solve the optimization problem by finding the parameters of the network. θt = θt−1 − α · ∇θ J (θt−1 ) where J is the cost function defined as J ( f θ (z), yˆ ) = ( f θ (z) − yˆ ) m2 4: Find the solution to the inpainting problem after the model has converged. y ∗ = f θ ∗ (z)
Fig. 4 Optimization process carried on by cloud removal Model for 10000 iterations. At the end of 10000 iterations, a cloud-free image is obtained
7 Results We solve the cloud-removal problem using cloud removal model in an unsupervised setting. Figure 4 shows the optimization process of our model with respect to the number of iterations. Interestingly, Fig. 4 shows that cloud removal model is able to conclude that a river patch is below the cloud without knowing the true distribution of the dataset. We perform our experiments on four different scenes consisting of clouds. Figure 5 demonstrates the results of our model evaluated on a wide variety of images with varying percentages of clouds present in them. The results show that the cloud removal model is able to interpolate through the cloudy region by utilizing the context of the image from the adjacent cloud-free regions. Due to the unavailability of their code, we do not provide comparison with [8]. We further report quantitative results such as Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) to demonstrate the efficacy of our model. Since we could not obtain pairs of cloudy images with the corresponding cloud-free images, we add Perlin noise to cloud-free images and report the quantitative results on four different scenes. The results in Fig. 6 show that our model is able to increase the peak signal-to-noise ratio and structural similarity index measure. We also report our results on thick/dense clouds where our model fails to find the corresponding cloud-free image. Figure 7 demonstrates that these thick/dense clouds obstruct the image completely and hence our model lacks the context of the
102
A. Maiya and S. S. Shylaja
Fig. 5 Results from our cloud removal. Row 1 shows the cloudy image and Row 2 shows the corresponding cloud-free image
Fig. 6 Quantitative results such as PSNR and SSIM on four different scenes
Fig. 7 Grainy output is produced when the image is significantly covered with clouds
Cloud Image Prior: Single Image Cloud Removal
103
surrounding pixels to interpolate onto these cloudy regions ultimately resulting in grainy output.
8 Conclusions We have introduced cloud image prior, in which cloud replacement is done with a single image. Our model eliminates the process of gathering huge datasets and then perform cloud removal. We also report quantitative image quality metrics such as PSNR and SSIM score on four different synthetic scenes where our model was able to increase the value of the abovementioned metrics. Although our model was not able to successfully replace a cloud which was fully enveloping the image, nevertheless cloud image prior is a good baseline for other models that are trained in a supervised setting. The future scope of this paper would be to utilize all bands from Sentinel-2 imagery and use ground information to aid the cloud removal process.
References 1. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. CoRR, arXiv:abs/1505.04597. 2. Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. 3. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105. 4. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. CoRR, arXiv:abs/1512.03385. 5. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, arXiv:abs/1409.1556. 6. Wang, A., Xu, Y., Wei, X., & Cui, B. (2020). Semantic segmentation of crop and weed using an encoder-decoder network and image enhancement method under uncontrolled outdoor illumination. IEEE Access, 8, 81724–81734. 7. Li, W., He, C., Fang, J., Zheng, J., Fu, H., & Yu, L. (2019). Semantic segmentation-based building footprint extraction using very high-resolution satellite images and multi-source GIS data. Remote Sensing, 11(4), 403. 8. Singh, P., & Komodakis, N. (2018). Cloud-Gan: Cloud removal for sentinel-2 imagery using a cyclic consistent generative adversarial networks. In IEEE International Geoscience and Remote Sensing Symposium (pp. 1772–1775). 9. Chen, Y., Tang, L., Yang, X., Fan, R., Bilal, M., & Li, Q. (2019). Thick clouds removal from multitemporal ZY-3 satellite images using deep learning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 143–153. 10. Ulyanov, D., Vedaldi, A., & Lempitsky, V. S. (2020). Deep image prior. International Journal of Computer Vision, 128, 1867–1888.
Prioritizing Python Code Smells for Efficient Refactoring Using Multi-criteria Decision-Making Approach Aakanshi Gupta , Deepanshu Sharma , and Kritika Phulli
Abstract Software is subjected to regular modifications for the fulfilment of new specifications of the end user; which might lead to design issues in the software. The cost of maintaining a good quality software tends to be toilsome when design issues such as code smells are involved enormously. The best solution to rectify code smell is to redesign the code through refactoring without compromising its functionality. This research proposes a ranking order of considered code smells for realizing the priority for cost-effective and efficient refactoring process using the Multi-Criteria DecisionMaking (MCDM) approach. The considered five Python-based code smells (Cognitive Complexity, Collapsible “IF”, Many Parameter, Naming Convention, Unused Variable) have been examined over 10552 classes of 20 Python applications. The Python classes affected with any of the considered code smells have been explored for obtaining the critical software metrics (criterion) through machine learning rulebased classifiers and are further evaluated for weight estimation. The obtained order rankings have been achieved by the application of “VIKOR”: an MCDM technique, supported by its compromise solution. The weight estimators, i.e., Shannon’s entropy and CRITIC method, were individually applied with this MCDM technique. This research determines the imperative software metrics with an accuracy of approx. 90% alongside the respective weight estimators. The prominent outcomes portray the preference scores obtained from the VIKOR technique to be the order ranking for the considered code smells in Python application. It was observed that Collapsible “IF” smell, being the most critical smell, should be primarily considered for refactoring followed by cognitive complexity smell for improved and sustainable development of Python software systems. Keywords Machine learning · Multi-criteria decision making · Python code smells · VIKOR · Weight estimation (Shannon entropy and CRITIC) A. Gupta · K. Phulli (B) Department of Computer Science and Engineering, Amity School of Engineering and Technology, GGSIPU, New Delhi, India e-mail: [email protected] D. Sharma Executive Branch-IT, Indian Navy, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_9
105
106
A. Gupta et al.
1 Introduction The current software industry demands good quality and high-maintainable software applications. It is believed that over 75% of the cost is associated with the maintenance phase [1]. However, there exist some design flaws, termed as code smells, which hinders the maintainability and poses a severe threat to software evolution and modifiability in the name of technical debt [2]. Code smell tends to violate the design flaws and increases the technical debt [3, 4, 5]. Pragmatically, a single software might be affected by various code smells at a time. This infection causes the software to deteriorate its performance and efficiency to a more significant extent. One of the most promising approaches to eliminate the code smell is considered the refactoring process [3]. Researchers, yet, find it challenging to refactor all the smell simultaneously due to more time consumption. Consequently, it is desired for the code smells to be ranked according to their impact on the software code. Many studies have observed the impact of code smells on various maintainabilityrelated aspects [6, 7, 8], especially changes, effort, modularity, comprehensibility, and defects [9, 10, 11, 12]. Code smell implies the kind of refactoring to be applied [13, 14, 15, 16]. Refactoring involves the restructuring of the external structure of the code without hampering the internal logic and helps in enhancing the performance and increases the maintainability of the code [17]. However, refactoring an extensive system involves several resources of uncertainty concerning the priority levels of code smells to be rectified along with the location of the smells in the relevant classes [18]. It can be a situation where one code smell may affect the code quality more than other code smells present in the system. An empirical study believes that the smells introduced in initial stages of development should be removed at the earliest to avoid further maintainable problems [19, 20]. It is a probability that the less prioritized code smell gets refactored and utilizes resources and human effort, which might not increase the maintainability of the software. It was only since 2016 that the code smells have been detected for Python language [21]. This research studies the diffusion of code smells in Python developing software along with prioritization. It has been seen that Python language, in terms of code smell, has been rarely explored, though it is the most popular language. Moreover, Python, being a dynamically typed, strong multipurpose language, has extensive support for multiple libraries and viable for building complex modules. The smells considered for the study involve the generic code smells that infect the Python software systems, making it unfavorable for a developer to refactor them all simultaneously without knowing their extent to which it impacts the software system. The above factors mark the necessity to prioritize these identified code smells and only eliminate those code smells (by applying suitable refactoring) that profoundly affect the improvement of underlying system design. Additionally, the refactoring depends both on the quantitative measure and qualitative measure that includes the impact and assessment of code smell on the quality of code. Since both, the measures
Prioritizing Python Code Smells for Efficient Refactoring Using Multi- …
107
depend upon a variety of criteria [22, 17]. Hence, the Refactoring effort calculation referred to as the prioritization of code smell is estimated through the MultiCriteria Decision-Making technique (MCDM) [23, 24, 5]. The approach can sort the components based on their need for refactoring. MCDM is considered a branch of operational research that deals with finding optimal results in complex scenarios, including various indicators, conflicting objectives, and criteria. MCDM helps a decision-maker that quantifies the criteria based on its importance in the presence of other objectives. Multi-criteria optimization determines the best feasible solution according to the established criteria (representing different effects). Mareschal listed the stability of the ranking while changes in the criteria weights [25]. However, some situations may cause a conflict for preferencing, which might require a compromising solution. One such method for resolving this conflict is applying the VIKOR method, which ranks the alternatives and determines the compromise solution that is the closest to the ideal one [26]. This method employs a weight estimation technique for the determination of the multi-criteria. For this, two types of objective weight estimation approaches have been followed, inclusive of ENTROPY (Shannon Entropy) method and CRITIC (CRiteria Importance Through Intercriteria Correlation) method in this study [27, 28]. The concern of refactoring all the code smells encourages the authors to propose the order preferencing of the code smells of Python-based systems for refactoring needs using the VIKOR MCDM technique along with their weight estimators. The rankings obtained will utterly help the Python developers to refactor the most prevalent and harmful code smell out of the five considered code smells and reduce their efforts and build good quality, high-maintainable, and efficiently modifiable software systems for future extensive development in Python domain.
1.1 Research Contributions 1. 2. 3.
Applied the machine learning technique for the extraction of critical software metrics vital for order preferencing of Python code smells. Evaluated the objective weights of the selected software metrics through the Shannon entropy and CRITIC method. Proposed a ranking of Python code smell through VIKOR: a multi-criteria decision-making approach.
Structure of the Paper: The paper is arranged as follows: In Sect. 2, related work has been highlighted, Sect. 3 represents the methodology, Sect. 4 represents the context selection, the experimental setup of the work is expressed in Sect. 5. The results and discussion in Sect. 6 and the threats to validity are described in Sect. 7. Finally, Sect. 8 reviews the conclusions of the work and Sect. 9 briefs the future scope of this work.
108
A. Gupta et al.
2 Related Studies Code smells have been the symptoms of issues at the design level that can be rectified through the appropriate refactoring techniques [4]. Fowler has proposed 22 generic code smells [4]. This section expresses information about the work already carried out in the field of prioritization of different code smells. The study of code smells prioritization strategies always remains a subject of recent studies in the literature. The Python software domain has been less experimented, maybe due to its dynamic nature. It was Chen et al. who managed to investigate the Python code smells [21]. They developed a code smell detector named Pysmell and examined Python code smells for three approaches and determined the smelly modules to be more change and fault-prone than the non-smelly modules [21]. In different studies by Harland and Akerblom et al., respectively, the Python program has been investigated for their prominent dynamic activities at the program start-up [29, 30]. A hybrid dynamic slicing method was proposed by Chen et al. in 2014, capturing the data dependencies and discovering a bytecode interpreter for Python [22]. Xu et al. determined a static slicing approach for Python objects. [31]. A substantial amount of research has been done to prioritize and optimize refactoring [22, 23]. Along with selecting the appropriate refactoring strategy, it is equally essential to determine the sequence of refactoring procedure for code smells for high-maintainable products [22]. The focus of some research is based on the idea of ranking refactoring recommendation when it comes to code smell prioritization. For instance, Tsantalis and Chatzigeorgiou [32] proposed a technique for ordering the refactoring suggestions given by JDeodorant using historical information. A method was defined to rank smell instances based on their severity using heuristic-based code smell detectors by Marinescu [30]. Apart from design flaws, there are many other studies where the application of MCDM techniques has been observed. An approach has been proposed by Onder and Dag for supplier selection problem based on AHP and improved TOPSIS [31]. Sehgal et al. applied the MCDM approaches in a similar context using Fuzzy TOPSIS method [22]. Another integrated approach by Yang Wu studies the construction of the evaluation model in health welfare for sustainability [33]. Zaidan et al. presented an approach based on integrated TOPSIS and AHP for selecting the open-source and optimal EMR software packages [34]. A comparison between TOPSIS and VIKOR techniques for quality of service-based network selection has been recently studied by Malathy et al. [35]. Past research work has seen many applications of the CRITIC method like it was applied to a sample of Greek pharmaceutical industries by Diakoulaki et al. [27]. Also, a water resource management model was developed by Yılmaz and Harmancıoglu for Turkey’s Gediz River Basin [36]. The alternatives for management were estimated with Compromise Programming, Simple Additive Weighting, and Technique for Order Preference by Similarity to Ideal Solution. During these estimations, the Analytic Hierarchy Process method was used for subjective criteria and CRITIC method, the standard deviation method, and the Entropy method was used for objective criteria.
Prioritizing Python Code Smells for Efficient Refactoring Using Multi- …
109
3 Research Methodology This research escalates the study of Python-based code smells in terms of their order ranking to realize the prioritization of code smells diffused in the Python applications as shown in Fig. 1. Figure 1 explains the step-by-step study for obtaining the raking of considered code smells. This study has been performed in conjunction with the MultiCriteria Decision-Making methods (MCDM), which assists in evaluating the multiple conflicting criteria (i.e., software code metrics) depending upon their prominence infused in the code smells.
Fig. 1 Workflow of the study
110
A. Gupta et al.
The initial phase of the study concentrates on dataset collection in terms of smell selection and application selection. The immediate process helps in detecting the smells and exploring the critical software metrics that tend to harm the software code severely. Following this, weight calculations have been performed through two types of weight determining criteria—Shannon entropy and CRITIC method. Lately, the extracted weights from both the techniques individually were employed to compute the preference scores using the VIKOR method for obtaining the order rankings along with a set of a comprised solution in the presence of conflicting criteria. These mathematical statistics inspect the order of the code smells in which they should be refactored at the earliest and given supreme attention to avoid significant maintainability issues and further development complications.
4 Context Selection This section targets the detailing of the dataset collection used for this experiment. The intent is to analyze the Python code smells for order preferencing. This includes the selection of Python software system and selection and detection of Python-based code smell.
4.1 Python Software System Selection The study is practised using open-source Python software systems, cloned from the GITHUB repository. The open-source system applications were extracted by applying the most star filter of Github, having at least 15 K line of code. Moreover, the software systems were found to be of different sizes and scopes, developed by various communities and are still maintained by Python developers. Python applications belonging to web development frameworks, APIs, numerical libraries, and shell development frameworks were considered to operate upon. A total of 20 Pythonbased software applications aggregating to 6817 Python files and 10552 classes were assessed. These classes were then further analyzed as per the statistical techniques required to obtain the desired results. Table 1 lists the Python application with its required statistics.
4.2 Python-Based Code Smell Selection and Detection This study intents to detect and prioritize the code smells which contribute to the technical debt. One such tool for code smells detection is SonarQube (https://www. sonarqube.org/). Therefore, five code smells are considered, which are detected by an open-source code inspection tool—SonarQube. It is worth describing that the term
Prioritizing Python Code Smells for Efficient Refactoring Using Multi- …
111
Table 1 Python software systems Name of applications BottlePy CertBot
#No. of releases 75
#Files
#Classes
34
125
#Functions 843
87
325
585
4088
Django-API
118
155
1446
2649
Docker-compose
141
90
178
2077
Keras
49
198
346
3319
Misago
42
720
987
4153
Pandas
109
864
1733
20194
Request
137
35
79
608
Scrapy
90
293
868
3241
Zeronet
24
220
207
1839
Authomatic
16
93
122
432
CPython
42
1942
11355
52549
Django-Filter
18
113
138
459
Django-newsletter
19
42
97
303
Image AI
13
178
68
731
IPython
96
351
501
3261
Luigi
54
255
1427
4874
Powerline
19
173
302
1598
Ulauncher
145
175
167
1125
Web2Py
92
396
638
4265
Turbo Gear
37
165
559
2378
code smells assigned in SonarQube does not refer to Fowler’s popularly known code smells [3]. A dataset of 10552 classes was examined for the existence of code smells. The classes indicating the presence of respective code smells are marked as TRUE, whereas the rest of the classes (absence of smell) are marked as FALSE. Table 2 depicts the considered code smells for order preferencing. Table 2 Code smell description Code Smells
Description
Cognitive complexity
Cognitive complexity of a function should not be too high
Collapsible “IF”
Collapsible “IF” statement should be merged
Many parameter
Functions, methods, and lambdas should not have too many parameters
Unused variable
Unused local variable should be removed
Naming conventions
Class, function, method name should comply with the naming conventions
112
A. Gupta et al.
5 Experimental Setup 5.1 Data Pre-processing The diffusion of code smells in 20 Python applications was found to be in 10552 classes. The visualization of the same is depicted in Fig. 2. To evaluate the critical software metrics for order preferencing a step-by-step approach toward Machine Learning Classification for code smells was utilized which is discussed further. Initially, a set of 37 software metrics was extracted out using the static analyzer tool, SciTools UNDERSTAND (https://scitools.com/). The software metrics act as the multi-criteria for decision-making and are extracted for 20 Python software applications. In the next step, a feature selection technique to filter the essential metrics for the presence of code smell out of 37 metrics was applied before classifying the data. The Information Gain feature selector was practised following a Ranker searching approach to minimize the complexity of the detection rules. Next, the rule-based approach of supervised learning was used as it empowered this criterion to determine the significant software metrics that can severely harm the maintainability of the Python software. A couple of machine learning classifiers have been applied using the paired T test (depicted in Table 3), demonstrating JRIP to be the accurate classifier with 89.81% accuracy. Fig. 2 Diffusion of Python code smells
Table 3 Machine learning algorithms comparison using paired T test
ML algorithms
Percent correct (%)
ZeroR
86.52
OneR
89.58
PART
89.78
JRip
89.81
Prioritizing Python Code Smells for Efficient Refactoring Using Multi- … Table 4 Extracted software metrics along with their comparators
Software metric
Comparator
SumCyclomatic
>
MaxCyclomatic
>
CountDeclInstanceVar
>
CountLine
>
MaxNesting
>
113
Finally, it tends to generate a detection rule for smell existence, which consists of five significant metrics along with their suitable comparators for determining its nature, which is listed in Table 4.
5.2 Weight Estimation The MCDM techniques require a weight associated with each of the criteria (i.e., five software metrics acquired) for further computations. Each metric has a different weight associated with it. In this research, the objective weight determining methods are preferred, such as the Entropy method, multiple objective programming [37, 38], and principal element analysis [38]. The procedure involves the usage of two weight estimation methods, namely, Entropy method (measures the amount of information) and CRITIC method. Shannon Entropy Weight Estimation Method: The Shannon Entropy is a fundamental quantity in information theory, based on the quantity of uncertainty in information following the probability theory [28]. It is one of the well-known methods in obtaining weights for MCDM problems when obtaining a suitable weight is based on the preferences, and decision-making experiments are difficult. Shannon’s information Entropy averages random variable’s unpredictability which is like the information content [37, 38]. Definition: Let X = {x 1 , x 2 …, x n } be a discrete random variable with probability mass function P(X); the Entropy of x is defined as H (x) = E(x) = D[−ln( p(x))]
(1)
where D = expected value operator. For a finite sample, the Entropy can be explicitly defined as E(x) = −
P(xi )ln P(xi )
(2)
i
The following steps were followed to obtain the weight using Entropy method:
114
A. Gupta et al.
Step 1: Normalize the decision matrix using the following formula: xi j r i j = m i=1
xi j
(3)
where ri j is an element of the decision matrix xi j is the criteria values of variants. Step 2: Compute Entropy. E j = −h
m
ri j lnri j j = 1, 2, ..., n
(4)
i=1
where constant “h”guarantees that E j (j = 1, 2,…, n) belongs to the interval [0,1]. m = number of code smells; (m = 5) h=
1 1 1 = = (= 0.62133493456) ln(m) ln(5) ln(5)
(5)
Step 3: Compute the weight vector. 1 − Ej j = 1, 2, . . . , n j=1 (1 − E j)
w j = n
(6)
where (1 − E j ) is called the degree of diversity of the information involved in the outcomes of the jth criterion. CRITIC Method (Criteria Importance Through Intercriteria Correlation) This method determines objective weights for decision-making [36]. This method includes the conflict in the criteria and intensity of the contrast for the MCDM problem. It determines the information in the criteria by evaluating the variants through analytical testing of the decision matrix [27, 39]. It searches for the contrasts between the criteria by correlation analysis [36]. The procedure for finding the weights through the CRITIC method is presented as follows: Let X i j = performance value of ith alternative on a jth criterion Step 1: Normalize the decision matrix. *For beneficial criteria, Xij =
st X i j − X wor j st X best − X wor j j
For non-beneficial criteria,
i = 1, . . . , m; j = 1, . . . , n
(7)
Prioritizing Python Code Smells for Efficient Refactoring Using Multi- …
X si j =
st − X wor j
X best j
−
Xi j
st X wor j
i = 1, . . . , m; j = 1, . . . , n
115
(8)
where X i j = normalized performance value of ith alternative on jth criterion (Here, normalization does not consider the type of criteria.) Step 2: Estimate the standard deviation σ j for each criterion j. Step 3: Obtain the symmetric matrix (n x n) having elements r jk , (r jk , = linear correlation coefficient between the vectors x j and x k ). Step 4: Measure the conflict created by criterion j for the decision situation defined by the rest of the criteria. m 1 − r jk
(9)
k=1
where r jk = correlation coefficient between two criteria. Step 5: Determine the quantity of the information concerning each criterion C j = σ j*
m
1 − r jk
(10)
k=1
where C j = quantity of information contained in jth criterion [A significant amount of information is obtained from the given criterion when the value of C j is higher] Step 6: Determining the objective weights Cj W j = m k=1
Cj
(11)
*It is to note that the comparator provided by the rule-based classifier indicates the beneficial and non-beneficial criteria. However, the five essential metrics acquired tend to possess the beneficial criteria (comparator “ >”). The above-stated steps resulted in the weights for each software metric from Entropy weight determining method and are defined as a histogram in Fig. 3.
5.3 VIKOR MCDM Techniques After computing the weights for the multi-criteria, VIKOR method has been the selected as the prime MCDM analysis approach for discovering the order preference of Python-based code smells due to its facility to provide a compromise solution along with its standard order ranking process.
116
A. Gupta et al.
Fig. 3 Weights of software metrics using Entropy and CRITIC methods
VIKOR Method. VIKOR, pronounced as VIsekriterijumsko KOmpromisno Rangiranje in Serbian, proves to be an appropriate method, dealing with every kind of judgment criterion, clarifying the results, and minimizing difficulty to deal with parameters and choices. It proposes a multi-criterion optimizing compromise ranking solution along with its natural order preferencing of code smells, providing an agreement relating a mutual concession among the smells. VIKOR method was founded by Yu and Zeleny and later advocated by Opricovic and Tzeng in 1998 [26]. The compromise ranking is performed by comparing the closeness to the ideal alternative, where each alternative is evaluated based on a criterion function. The following steps have been followed by applying the respective weights [26]: Step 1: Determine the best and the worst value for each criterion from the decision matrix. Beneficial criteria: X i+ = Best Value - > max(x ij ) X i− = Worst Value – > min(x ij ) Step 2: Compute the value Si and Ri for every criterion for each code smell using weights of the criteria W j . Si =
m j=1
(W j ∗
X i+ − X i j ) X i+ − X i−
X + − Xi j Ri = max j W j ∗ i+ X i − X i−
(12)
(13)
Step 3: Compute the values of Q i Qi = μ ∗
Si − S ∗ Ri − R ∗ + − μ) ∗ (1 S− − S∗ R− − R∗
where S ∗ = min i Si , S − = maxi Si
(14)
Prioritizing Python Code Smells for Efficient Refactoring Using Multi- …
117
Table 5 Order preferencing of Python-based code smells using VIKOR technique Python Code Smells
Si Entropy
CRITIC
Entropy
Ri CRITIC
Qi CRITIC
Order By VIKOR
Cognitive complexity
0.15
0.16
0.06
0.07
0.03
2nd
Collapsible “IF”
0.08
0.1
0.06
0.07
1.14E- 06
1st
Many parameter
0.27
0.26
0.07
0.1
0.12
0.15
3rd
Naming convention
1
1
0.23
0.27
1
1
5th
Unused variable
0.66
0.67
0.16
0.2
0.61
0.63
4th
Entropy 0.038 −2.11E-07
R ∗ = min i Ri , R − = maxi Ri μ = weight of the strategy of most criteria = 0.5 (considered) Step 4: Based on the Q, R, and S values, rank the code smells in decreasing order. The rankings are described in Table 5. Step 5: Check to satisfy the following two conditions for acquiring a compromised solution for the code smells ranked according to the Q (descending order). C1: Acceptable advantage Q A2 − Q A1 ≥ D Q
(15)
1 where D Q = j−1 , j = number of alternatives = 5 C2: Acceptable stability in decision-making: It states that the alternative A1 must also be ranked by S or/and R. This guarantees a compromise solution is stable for the given decision-making process working with voting by majority rules. The dataset studied in this research follows condition C2, being the Acceptable stability in decision-making but disagrees condition C1, being an Acceptable advantage. Therefore, a set of compromise solution is proposed consisting of the following: Alternatives a1, a2, …, an (M), if condition C1 is not satisfied, and an (M) is determined by the relation:
Q a (M) − Q a 1 < D Q where = Maximum ‘M’ (the positions of these alternatives are “in closeness”)
(16)
118
A. Gupta et al.
6 Results and Discussion This study targets the order preferencing of Python-based code smells (contributing to technical debts) using MCDM techniques. MCDM techniques for order preferencing has been proved to be a great one for order preferencing in other software fields. This research has explored VIKOR as the decision-making technique which has been applied using two weight-determining approaches, i.e., Shannon entropy and CRITIC method. The dataset consists of 10552 classes extracted from 20 Python software systems. It was used for obtaining the multi-criteria software metrics, depicting the combined presence of code smells. The initial process elects vital software metrics using the rule-based classification method. Paired T-test was applied to acquire an accurate algorithm for classification. The observation implies that JRIP implements a propositional rule learner and generates detection rules with an accuracy of 89.81% for the formerly prepared dataset. The metrics composed in detection rule with their respective comparator helps to step up the process for further weight-determining procedure. Following are the software metrics indicating the existence of code smells: SumCyclomatic, MaxCyclomatic, CountLine CountDeclInstanceVariable, MaxNesting. It is worth noting that the considered software metrics tend to possess a beneficial criterion for deciding the order ranking of the code smells. Once the metrics have been determined, the next step advances to the weight determination method. It has been established that two methods were required to estimate the weights for the metrics, as mentioned earlier, ensures that the order preferencing does not rely on a single weighing criterion. To ensure diversification in weighing criteria, Entropy and CRITIC were the two weight-determining techniques employed for the analysis of the significant metric that has a crucial impact on the smells, as discussed in Sect. 5.2. The Entropy method accounts for the highest weight of 23.06% for CountDeclInstanceVariable and the least 15.92% weight for CountLine. In contrast, for the CRITIC method, the metric, MaxCyclomatic, has reached the highest weight of 27.78%, whereas the MaxNesting metric weighed the least, accounting for 10.82%. The rest of the weights for both the methods are mentioned in Fig. 3. Once the weights were calculated, the indicated MCDM technique, i.e., VIKOR has been practised on the critical software metrics extracted for considered five code smells. The method has been explained in Sect. 5.3. Considering µ = 0.5 (weight of the strategy) for VIKOR technique, the obtained ranking of the code smells indicates that the Collapsible “IF” smell appears to be the most prioritized smell for refactoring with ranking “1.” In contrast, Naming Convention tends to be the least prioritized smell (Rank 5). The ranking is based on a performance score, described in Table 5. A compromise solution has been proposed by this method describing the maximum “group utility” for the majority and a minimum of an individual regret for the opponent. A set of three smells inclusive of Collapsible “IF” (Rank 1), Cognitive Complexity (Rank 2), and Many Parameter (Rank 3) appears to be the compromise solution after applying VIKOR technique, indicating that if the first ranked smell is absent, it is advised to refactor the second-ranked smell and so on.
Prioritizing Python Code Smells for Efficient Refactoring Using Multi- …
119
The existing results (preference scores) have been compared with other MCDM techniques to affirm the results obtained from the VIKOR method for better prioritization of code smell to support early-stage refactoring. The considered MCDM techniques against the VIKOR method are TOPSIS (Technique of Order Preference Similarity to the Ideal Solution) and WASPAS (Weighted Aggregated Sum Product Assessment) methods. Both these techniques when applied individually to the static code metrics of the considered code smells along with two different weight estimators (Shannon entropy and CRITIC) help in computing the respective preferences scores.
7 Threats to Validity This section reviews the threats encountered through the advancements of this study. The results of the ordering may vary in different smells, but more often than not, it will continue to show a similar ordering. The limitation of the dataset may pose some threat to validity. Considering the advancement in the mathematical formulations, the results might vary if other order-preferencing techniques were taken into consideration. Moreover, the order in this study is based on the static code metrics, and they might not be the only criteria to decide the order.
8 Conclusion The presented research work targets the ranking of code smells concerning the programming language, Python. Ranking the code smells assists the refactoring process to be a bit faster and useful for optimized coding solutions. Optimization in Python programming would surely ensure the latest and upcoming developments to be sustainable and maintainable by eliminating the design flaws, i.e., code smells. These flaws refer to the “Technical Debt” domain, which deteriorates the software maintenance activities while adding new features. This study takes those code smells into consideration which happens to be the technical debts, and therefore the detection of smells is done by the SonarQube tool. This research aimed to apply the Multi-Criteria Decision-Making (MCDM) technique for the order preferencing of five Python-based code smells. The analysis comprehends 20 Python applications accounting for 10552 classes. The inspection of Python application analysis about 37 software metrics through a static code analyzer and results into a set of five essential software metrics which severely affect the software and are capable of being the vital criteria for decisionmaking. The rule-based classifier algorithm of machine learning in virtue of feature selection filters has been applied for discovering the essentials software metrics affecting the Python source code by smelly classes. The metrics befitted further evaluation for weight determination and MCDM techniques. The weight estimation
120
Naming ConvenƟon
A. Gupta et al.
Unused Variable
Many Parameter
CogniƟve Complexity
Collapsible "IF"
Fig. 4 Devised order ranking of code smells
was fulfilled using two distinctive objective methods, i.e., Shannon entropy method and CRITIC method. This resulted in two different weight percentages for each software metric. It was further utilized in evaluating the preference scores through VIKOR MCDM techniques. Hence, the devised ranking, represented in Fig. 4, helps the complicated and expensive refactoring efforts become efficient and save time and labor of the developer and maintenance team for the considered Python-based code smells.
9 Future Scope The outcome of this study would facilitate Python developers with better decisionmaking while refactoring Python code smells during the maintenance phase of the software development lifecycle. It would pave a path for other researchers to prioritize code smells in other languages and realize their importance and preference score utilizing the above-employed approaches and empower them with better decisionmaking. We plan to compare the existing technique with other MCDM approaches such as TOPSIS, WASPAS, to affirm the order preferencing established here for the extended version of this study.
References 1. Lenarduzzi, V., Saarimäki, N. & Taibi, D. (2020). Some sonarqube issues have a significant but small effect on faults and changes. A large-scale empirical study. Journal of Systems and Software, 170, 110750. 2. Narasimhan, V. L. (2008). A risk management toolkit for integrated engineering asset maintenance. Australian journal of mechanical engineering, 5(2), 105–114. 3. Fowler, M., Beck, K., &, Opdyke, W.R. (1997, June). Refactoring: Improving the design of existing code. In 11th European Conference. Jyväskylä, Finland. 4. Fowler, M. (2018, Novmber 20). Refactoring: Improving the design of existing code. AddisonWesley Professional.
Prioritizing Python Code Smells for Efficient Refactoring Using Multi- …
121
5. Tan, J., Feitosa, D., Avgeriou, P., & Lungu, M. (2020). Evolution of technical debt remediation in Python: A case study on the Apache Software Ecosystem. Journal of Software: Evolution and Process, p. e2319. 6. Lozano, A., Wermelinger, M., & Nuseibeh, B. (2007, September). Assessing the impact of bad smells using historical information. In Ninth International Workshop on Principles of Software Evolution: In Conjunction with the 6th ESEC/FSE Joint Meeting (pp. 31–34). 7. Yamashita, A., & Moonen, L. (2012, September). Do code smells reflect important maintainability aspects?. In 2012 28th IEEE international conference on software maintenance (ICSM) (pp. 306–315). IEEE. 8. Yamashita, A., & Moonen, L. (2013, May). Exploring the impact of inter-smell relations on software maintainability: An empirical study. In 2013 35th International Conference on Software Engineering (ICSE) (pp. 682–691). IEEE. 9. Alsolai, H., & Roper, M., (2019). Application of ensemble techniques in predicting objectoriented software maintainability. In Proceedings of the Evaluation and Assessment on Software Engineering (pp. 370–373). 10. Khomh, F., Di Penta, M., & Gueheneuc, Y. G. (2009, October). An exploratory study of the impact of code smells on software change-proneness. In 2009 16th Working Conference on Reverse Engineering (pp. 75–84). IEEE. 11. Lozano, A., & Wermelinger, M. (2008, September). Assessing the effect of clones on changeability. In 2008 IEEE International Conference on Software Maintenance (pp. 227–236). IEEE. 12. Olbrich, S. M., Cruzes, D. S., & Sjøberg, D. I. (2010, September). Are all code smells harmful? A study of god classes and brain classes in the evolution of three open source systems. In 2010 IEEE International Conference on Software Maintenance (pp. 1–10). IEEE. 13. Zhang, M., Baddoo, N., Wernick, P., & Hall, T. (2011, March). Prioritising refactoring using code bad smells. In 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops (pp. 458–464). IEEE. 14. Stolee, K. T., & Elbaum, S. (2013). Identification, impact, and refactoring of smells in pipe-like web mashups. IEEE Transactions on Software Engineering, 39(12), 1654–1679. 15. Tsantalis, N., & Chatzigeorgiou, A. (2009). Identification of move method refactoring opportunities. IEEE Transactions on Software Engineering, 35(3), 347–367. 16. Catolino, G., Palomba, F., Fontana, F. A., De Lucia, A., Zaidman, A., & Ferrucci, F. (2020 Jan 1). Improving change prediction models with code smell-related information. Empirical Software Engineering., 25(1), 49–95. 17. Malhotra, R., & Jain, J. (2019). Analysis of refactoring effect on software quality of objectoriented systems. In International Conference on Innovative Computing and Communications (pp. 197–212). Springer, Singapore. 18. BenIdris, M., Ammar, H., Dzielski, D., & Benamer, W. H. (2020, September 14). Prioritizing Software components risk: towards a machine learning-based approach. In Proceedings of the 6th International Conference on Engineering & MIS 2020 2020 Sep 14 (pp. 1–11). 19. Bertrán, I. M. (2013). On the detection of architecturally-relevant code anomalies in software systems (Doctoral dissertation, Ph.D. thesis, Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Rio de Janeiro, Brazil). 20. Cedrim, D., Garcia, A., Mongiovi, M., Gheyi, R., Sousa, L., de Mello, R., et al. (2017). August. Understanding the impact of refactoring on smells: A longitudinal study of 23 software projects. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (pp. 465–475). 21. Chen, Z., Chen, L., Ma, W, & Xu, B. (2016, November). Detecting code smells in Python programs. In 2016 International Conference on Software Analysis, Testing and Evolution (SATE) (pp. 18–23). IEEE. 22. Sehgal, R., Mehrotra, D., & Bala, M. (2018). Prioritizing the refactoring need for critical component using combined approach. Decision Science Letters, 7(3), 257–272. 23. Vidal, S. A., Marcos, C., & Díaz-Pace, J. A. (2016). An approach to prioritize code smells for refactoring. Automated Software Engineering, 23(3), 501–532.
122
A. Gupta et al.
24. Nagpal, R., Sehgal, R., & Mehrotra, D. (2019, January). Decision making on critical component using combined approach. In International Conference on Distributed Computing and Internet Technology (pp. 143–165). Singapore: Springer. 25. Mareschal, B., Brans, J. P., & Vincke, P. (1984). PROMETHEE: A new family of outranking methods in multicriteria analysis (No. 2013/9305). ULB–Universite Libre de Bruxelles. 26. Opricovic, S., & Tzeng, G. H. (2004). Compromise solution by MCDM methods: A comparative analysis of VIKOR and TOPSIS. European Journal of Operational Research, 156(2), 445–455. 27. Diakoulaki, D., Mavrotas, G., & Papayannakis, L. (1995). Determining objective weights in multiple criteria problems: The critic method. Computers & Operations Research, 22(7), 763– 770. 28. Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379–423. 29. Girba, T., Ducasse, S., & Lanza, M. (2004, September). Yesterday’s weather: Guiding early reverse engineering efforts by summarizing the evolution of changes. In 20th IEEE International Conference on Software Maintenance, 2004. Proceedings. (pp. 40–49). IEEE. 30. Marinescu, R. (2012). Assessing technical debt by identifying design flaws in software systems. IBM Journal of Research and Development, 56(5), 9–1. 31. Önder, E., & Dag, S. (2013). Combining analytical hierarchy process and TOPSIS approaches for supplier selection in a cable company. Journal of Business Economics and Finance (JBEF), 2, 56–74. 32. Tsantalis, N., & Chatzigeorgiou, A. (2011, March). Ranking refactoring suggestions based on historical volatility. In 2011 15th European Conference on Software Maintenance and Reengineering (pp. 25–34). IEEE. 33. Wei, C. C., Chien, C. F., & Wang, M. J. J. (2005). An AHP-based approach to ERP system selection. International Journal of Production Economics, 96(1), 47–62. 34. Zaidan, A. A., Zaidan, B. B., Al-Haiqi, A., Kiah, M. L. M., Hussain, M., & Abdulnabi, M. (2015). Evaluation and selection of open-source EMR software packages based on integrated AHP and TOPSIS. Journal of Biomedical Informatics, 53, 390–404. 35. Malathy, E. M., & Muthuswamy, V. (2019). A comparative evaluation of QoS-based network selection between TOPSIS and VIKOR. In International Conference on Innovative Computing and Communications (pp. 109–115). Singapore: Springer. 36. Yilmaz, B., & Harmancioglu, N. (2010). Multi-criteria decision making for water resource management: a case study of the Gediz River Basin, Turkey. Water SA, 36(5). 37. Choo, E. U., & Wedley, W. C. (1985). Optimal criterion weights in repetitive multicriteria decision-making. Journal of the Operational Research Society, 36(11), 983–992. 38. Ma, J., Fan, Z. P., & Huang, L. H. (1999). A subjective and objective integrated approach to determine attribute weights. European Journal of Operational Research, 112(2), 397–404. 39. Madic, M., & Radovanovi´c, M. (2015). Ranking of some most commonly used nontraditional machining processes using ROV and CRITIC methods. UPB Scientific Bulletin Series D, 77(2), 193–204.
Forecasting Rate of Spread of Covid-19 Using Linear Regression and LSTM Ashwin Goyal, Kartik Puri, Rachna Jain, and Preeti Nagrath
Abstract COVID-19 virus, knows as novel coronavirus, spread across the world. The World Health Organization (WHO) marked March 11, 2020 as the day when COVID-19 was declared as pandemic. It was first originated in Wuhan, China. In recent days, Covid-19 impacted various social and economic fields in the world. It is necessary to quantify its spread and make predictions on how it is going to affect the world in coming months. In this paper, our aim is to use linear regression and LSTM algorithms to forecast Covid-19 spread. The objective of this study is to determine if spread can be forecasted to better accuracy using linear regression and LSTM algorithms. Keywords Machine learning · Linear regression · LSTM · Mean absolute error · COVID-19
1 Introduction The spread of COVID-19, from the sars-cov2 virus occurred in Wuhan, China, is on the rise and has shaken the world. The World Health Organization christened the illness as COVID-19 when the first case of this virus was reported. The global spread of COVID-19 affected every major nation and was defined as a pandemic by the WHO in March 2020. This paper tracks the spread of the novel coronavirus, also known as the COVID19. COVID-19 is a contagious respiratory virus that first started in Wuhan in December 2019. [1] A. Goyal (B) · K. Puri · R. Jain · P. Nagrath Department of Electronics & Communication Engineering, Bharati Vidyapeeths College of Engineering, New Delhi, India R. Jain e-mail: [email protected] P. Nagrath e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_10
123
124
A. Goyal et al.
The two types of coronaviruses, named as, “severe acute respiratory syndrome coronavirus” and “Middle East respiratory syndrome” have affected more than 20,000 individuals in last 10 years [2]. The coronavirus can spread by various means. However, some of the common means through which the infection can occur are as follows: 1. 2. 3.
airborne or aerosol transmission; direct or indirect contact with another human; and lastly through droplet spray transmission.
However, a person can protect himself from these transmission modes. Close contact can be avoided and a minimum distance of 1.8 m should be maintained to avoid contact with a person as well as respiratory droplets. However, for airborne transmission a minimum of 4 m should be maintained to avoid contact. Symptoms of COVID 19 are coughing, extreme fever, tiredness, or weakness and pain in some joints of the body. So for helping combat coronavirus, the use of artificial intelligence techniques such as machine learning and deep learning models was studied and implemented in this paper. These model will give us a rough estimate as to how the disease will spread in the upcoming days how many more people will be affected. It will be a rough estimate to the government of various countries to know about how is the spread and will enable them to be prepared well in advance for the epidemic. Most of the data-driven approaches used in previous studies [3] have been linear models and often neglect the temporal components of the data. In this report, data preprocessing techniques are applied on the confirmed cases data and then the preprocessed data is applied to two models, i.e., LSTM and Linear Regression. The actual and forecast values of cases are compared on a predefined metrics. A comparison is made between the performance of LSTM and Linear Regression model to see which model is best for the data. Literature Review talks about similar work done by other researchers on this topic and talk about the model and approach used by them. The methodology used in the paper and the approach on how to handle this problem were also discussed. Methods and models talks about the dataset used and its features. Since classification is done worldwide, the data was processed to suite the needs of the models in use and a brief description of the processed dataset was also provided. Next, evaluation metrics are discussed to understand and compare the result between the two models used. MAPE and accuracy were used to compare the result and were used to draw conclusions. Also the models of Linear Regression and LSTM network are explained demonstrating our approach. Finally, the experimental results are shown. Evaluation metrics are used to compare the result.
Forecasting Rate of Spread of Covid-19 Using Linear Regression …
125
2 Literature Review In [4], a machine learning-based alternative to transmission dynamics for Covid19 is used. This AI-based approach is executed by implementing modified stacked auto-encoder model. In [5], a deep learning-based approach is proposed to compare the predicted forecasting value of LSTM and GRU model. The model was prepared and tested on the data and a comparison was made using the predefined metrics. In [6], LSTM and Linear Regression model was used to predict the COVID-19 incidence through Analysis of Google Trends data in Iran. The model was compared on the basis of RMSE metrics. In [7], an LSTM network-based approach is proposed for forecasting time-series data of COVID-19. This paper uses linear short-term memory network to overcome problems faced by linear model where algorithms assign high probability and neglects temporal information leading to biased predictions. In [8], temporal dynamics of the corona virus outbreak in China, Italy, and France in the span of 3 months are analyzed. In [9], a variety of linear and non-linear machine learning algorithm approaches were studied and the best one as baseline, after that the best features were chosen, using wrapper and embedded feature selection methods and genetic algorithm (GA) was used to determine optimal time lags and number of layers for LSTM model predictive performance optimization. In [10], temporal dynamics of the corona virus outbreak in China, Italy, and France in the span of 3 months are analyzed. In [11], a modeling tool was constructed to aid active public health officials to estimate healthcare demand from the pandemic. The model used was SEIR compartmental model to project the pandemic’s local spread. In [12], a transmission network-based visualization of COVID-19 in India was created and analyzed. The transmission networks obtained were used to find the possible Super Spreader Individual and Super Spreader Events (SSE). In [13], comparison of day-level forecasting models on COVID-19 affected cases using time-series models and mathematical formulation. The study concluded exponential growth in countries that do not follow quarantine rules. In [14], phenomenological models that have been validated during previous outbreaks were used to generate and assess short-term forecasts of the cumulative number of confirmed reported cases in Hubei Province.
2.1 Our Work In our report, the confirmed cases of corona virus are studied from the start of the epidemic and the two approaches of Linear Regression and LSTM networks are used,
126
A. Goyal et al.
Fig. 1 Number of cases around the world
and a report is presented stating which of the above-stated model works best in these type of data on the basis of mean absolute error (Fig. 1).
3 Methods and Models 3.1 Data The dataset used was the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) for COVID-19. It consists of three datasets each of Death, Confirmed, and Recovered cases of 188 countries datewise. The number of date columns is 138 starting from January 22, 2020 to June 8, 2020. Out of this about 85% are used as training data and the rest used as testing and validating data. So the model would be predicting next 15% data value. The prediction would not be made on a specific country rather it will be worldwide. Table 1 shows the world data of corona virus spread with confirmed, death, and recovery rates.
Forecasting Rate of Spread of Covid-19 Using Linear Regression …
127
Table 1 World dataset of corona virus spread with confirmed, death, and recovery rates Confirmed Recoveries Deaths
Confirmed change
Recovery rate
Growth rate
Count 1.390000e 1.390000e + 02 + 02
139.000000
138.000000 139.000000 138.000000
Mean 1.918547e 6.817390e + 06 + 05
123,264.726619
50,666.268116
0.286331
0.076081
Std
2.170725e 8.911273e + 06 + 05
138,597.907312
42,526.463980
0.143922
0.117824
Min
5.400000e 2.800000e + 02 + 01
17.000000
89.000000
0.017598
0.005032
25%
7.862450e 2.747150e + 04 + 04
2703.000000
2957.500000
0.207790
0.021193
50%
8.430870e 1.738930e + 05 + 05
44,056.000000
67,738.000000
0.288055
0.032183
75%
3.546736e 1.142438e + 06 + 06
249,918.000000
84,446.500000
0.395898
0.085793
Max
6.992485e 3.220219e + 06 + 06
397,840.000000 130,518.000000
0.544809
0.951446
3.2 Evaluation Metrics For the selection of better performing model, it is necessary to use some kind of performance/evaluation metrics to evaluate the algorithm’s performance. In this paper, MAPE and accuracy are used to measure model’s performance: 1.
Mean Absolute Percentage Error: It is defined by the following formula: M AP E =
2.
100% y − y y n
(1)
where y is true value and y is predicted value. Accuracy: It is defined by the following formula: Accuracy = (100 − M A P E)%
(2)
3.3 Method The prediction of confirmed cases due to COVID-19 is evaluated using Recurrent Neural Network method (LSTM) and Linear Regression.
128
A. Goyal et al.
Linear Regression is a statistical model that works with values where the input variable (x) and output variable (y) have a linear relationship, for single input the model is known as simple linear regression. A recurrent neural network is a special kind of artificial neural network which has memory of the previous inputs, i.e., it remembers the previous inputs. In these neural networks, the output of previous neuron is fed as input to the next neuron. It is generally used in problems like when it is required to predict the following word in a sentence or in time-series data. However, a main problem associated with RNN is gradient vanishing and exploding. In this, the gradient starts vanishing as we go deeper into the layers due to which the model stops updating weights. This problem can be solved using special RNN like Long Short-Term Memory (LSTM) RNN and Gated Recurrent Unit (GRU). These have a much better gradient flow and perform better than traditional RNN and are generally used [5]. The dataset used for predicting the value is taken from Johns Hopkins University which contains cases form January 21, 2020 to June 8, 2020. The training and testing of both the models are done on this dataset. It contains 138 date columns out of which 120 are used for training and the rest 18 days are used for testing data or for forecasting it. At first the data is preprocessed by converting the date columns into datetime object and also eliminate the missing values. The preprocessed data is then transformed into the required shape to be put into the model. The models are trained and the test data is predicted and prediction result is quantified using performance measures metrics such as MAPE and accuracy. The methodology performed for each of the step is shown in Fig. 2.
3.3.1
Linear Regression
Linear regression-based models are generally used for prediction tasks. The technique is used which tries to best fit the value to a linear line. This line can be used to relate both the predicting and predicted values. When there is more than one value then the. In case of exponential relations, linear regression cannot be directly used. But after transformation to a linear expression, even exponential relations can be predicted using linear regression. For example, y = αeβx
(3)
Taking the log on both sides of the equation, we get ln y = ln α + βx
(4)
This expression is of the form of a linear regression model: y = α + βx
(5)
Forecasting Rate of Spread of Covid-19 Using Linear Regression …
129
Fig. 2 Flowchart for proposed methodology
3.3.2
LSTM Model
Long Short-term memory (LSTM) is a recurrent neural network which is most effective for time-series prediction. The model used in this case is sequential. As the data was time series and we needed to predict the best positive corona cases, this model was best for our study. The model was built using tensorflow keras framework and the model’s performance was evaluated on the mean absolute error percentage (MAPE). The proposed architecture of LSTM model is depicted in Fig. 3.
130
A. Goyal et al.
Fig. 3 Architecture of LSTM model
4 Experimental Result In LSTM prediction, LSTM layers use sequence of 180 nodes. Single layered structure followed by 2 dense layers with 60 nodes in the first layer and single node in the output layer is used as LSTM model for verifying prediction result. The best hyperparameter used is a batch size of 1. The result of the model is shown in Table 2. The prediction result is shown in Fig. 4.
Forecasting Rate of Spread of Covid-19 Using Linear Regression …
131
Table 2 Accuracy and MAPE of LSTM model Model
Accuracy (%)
MAPE (%)
LSTM model
96.90
3.092
Middle East respiratory syndrome
Fig. 4 Comparison of predicted and true value using LSTM model
Table 3 Accuracy and MAPE of regression model
Model
Accuracy (%)
MAPE (%)
Linear model
93.57
6.421
Linear regression model was used on the time-series data and the date columns were taken as input and the 18-day data was predicted. The exponential fit of the model was fit and the result of the model is shown in Table 3. The prediction result of comparing the test data predicted data is shown in Fig. 5.
4.1 Comparing with Other Studies In [4], they used a multi-step forecasting system on the population of China, and the estimated average errors are shown in Table 4. In [7], LSTM networks are used on Canadian population, and the results are shown in Table 5. In [5], a deep learning-based approach is proposed to compare the predicted forecasting value of LSTM and GRU model is used in the result as shown in Table 6.
132
A. Goyal et al.
Fig. 5 Comparison of predicted and true value using Linear Regression model
Table 4 Result [4]: method and average errors
Table 5 Results [7]: Canadian datasets
Table 6 Results [7]: Canadian datasets
Model
Error (%)
6-Step
1.64
7-Step
2.27
8-Step
2.14
9-Step
2.08
10-Step
0.73
Model
RMSE
Accuracy (%)
LSTM
34.63
93.4
Model
RMSE
Accuracy (%)
LSTM
53.35
76.6
GRU
30.95
76.9
LSTM and GRU
30.15
87
5 Conclusion and Future Scope The comparison between Linear Regression and LSTM model signifies that using LSTM yields better results for forecasting the spread of confirmed cases. It showcases a method that checks occurred cases of COVID-19. However, it could be made automated to train on the updated data every week and see the predicted value. Also,
Forecasting Rate of Spread of Covid-19 Using Linear Regression …
133
the model is trained only on confirmed cases and the same could be done for both recovered and death cases and predicted values could be found. The model shows only the worldwide cases; however, the dataset also provides countrywise statistics so it can be used by different countries to forecast the future outcome of the pandemic and take necessary preventive measures to be safe from this worldwide pandemic. A conclusion is drawn that shows forecasting models could be used by medical and government agencies to make better policies for controlling the spread of pandemic. The comparison between the two models allows them to choose the better suited model for the required task. The availability of high-quality and timely data in the early stages of the outbreak collaboration of the researchers to analyze the data could have positive effects on healthcare resource planning.
References 1. World health organization. (2020). Who statement regarding cluster of pnemonia cases in wuhan, china. 2. Huang, C., Wang, Y., Li, X., Ren, L., Zhao, J., Hu, Y., Zhang, L., Fan, G., Xu, J., Gu, X., et al. (2020). Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. The Lancet, 395(10223), 497–506. 3. Knight, G. M., Dharan, N. J., Fox, G. J., Stennis, N., Zwerling, A., Khurana, R., & Dowdy, D. W. (2016). Bridging the gap between evidence and policy for infectious diseases: How models can aid public health decision-making. International Journal of Infectious Diseases, 42, 17–23. 4. Hu, Z., Ge, Q., Jin, L., & Xiong, M. (2020). Artificial intelligence forecasting of covid-19 in China. arXiv:2002.07112. 5. Bandyopadhyay, S. K., & Dutta, S. (2020). Machine learning approach for confirmation of covid-19 cases: Positive, negative, death and release. medRxiv. 6. Ayyoubzadeh, S. M., Ayyoubzadeh, S. M., Zahedi, H., Ahmadi, M., & Kalhori, S. R. N. (2020). Predicting covid-19 incidence through analysis of google trends data in iran: Data mining and deep learning pilot study. JMIR Public Health and Surveillance, 6(2), e18828. 7. Chimmula, V. K. R., Zhang, L. (2020). Time series forecasting of covid-19 transmission in canada using lstm networks. Chaos, Solitons & Fractals, 109864. 8. Fanelli, D., & Piazza, F. (2020). Analysis and forecast of covid-19 spreading in China, Italy and France. Chaos, Solitons & Fractals, 134, 109761. 9. Bouktif, S., Fiaz, A., Ouni, A., & Serhani, M. A. (2018). Optimal deep learning lstm model for electric load forecasting using feature selection and genetic algorithm: Comparison with machine learning approaches. Energies, 11(7), 1636. 10. Yang, Z., Zeng, Z., Wang, K., Wong, S.-S., Liang, W., Zanin, M., Liu, P., Cao, X., Gao, Z., Mai, Z., et al. (2020). Modified seir and ai prediction of the epidemics trend of covid-19 in China under public health interventions. Journal of Thoracic Disease, 12(3), 165. 11. Rainisch, G., Undurraga, E. A., Chowell, G. (2020). A dynamic modeling tool for estimating healthcare demand from the covid19 epidemic and evaluating population-wide interventions. arXiv:2004.13544. 12. Singh, R., Singh, P. K. (2020). Connecting the dots of covid-19 transmissions in India. arXiv: 2004.07610. 13. Elmousalami, H. H., & Hassanien, A. E. (2020). Day level forecasting for coronavirus disease (covid19) spread: Analysis, modeling and recommendations. arXiv:2003.07778. 14. Roosa, K., Lee, Y., Luo, R., Kirpich, A., Rothenberg, R., Hyman, J., Yan, P., & Chowell, G. (2020). Real-time forecasts of the covid-19 epidemic in china from february 5th to february 24th, 2020. Infectious Disease Modelling, 5, 256–263.
134
A. Goyal et al.
15. Aritra, K., Tushar, B., & Roy, A. (2020). Detailed study of covid-19 outbreak in india and West Bengal (vol. 5). https://doi.org/10.5281/zenodo.3865821. 16. Tomar, A., & Gupta, N. (2020). Prediction for the spread of covid-19 in india and effectiveness of preventive measures. Science of The Total Environment, 138762. 17. Tuli, S., Tuli, S., Tuli, R., & Gill, S. S. (2020). Predicting the growth and trend of covid-19 pandemic using machine learning and cloud computing. Internet of Things, 100222. 18. Salgotra, R., Gandomi, M., & Gandomi, A. H. (2020). Time series analysis and forecast of the covid19 pandemic in india using genetic programming. Chaos, Solitons & Fractals, 109945. 19. Randhawa, G. S., Soltysiak, M. P. M., El Roz, H., de Souza, C. P. E., Hill, K. A., & Kari, L. (2020). Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study. PLOS ONE, 15(4). https://doi.org/10.1371/journal. 20. pone.0232391. https://doi.org/10.1371/journal.pone.0232391. 21. Salgotra, R. (2020). Covid-19: Time series datasets india versus world. https://doi.org/10. 17632/tmrs92j7pv.1. 22. Tathagatbanerjee. (2020). Covid-19 analytics India. https://www.kaggle.com/tathagatbanerjee/ covid-19-analytics-india. 23. Palladino, A., Nardelli, V., Atzeni, L. G., Cantatore, N., Cataldo, M., Croccolo, F., Estrada, N., & Tombolini, A. (2020). Modelling the spread of covid19 in italy using a revised version of the sir model. arXiv:2005.08724. 24. Koubaa, A. (2020). Understanding the covid19 outbreak: A comparative data analytics and study. arXiv:2003.14150. 25. Boccaletti, S., Ditto, W., Mindlin, G., & Atangana, A. (2020). Modeling and forecasting of epidemic spreading: The case of covid-19 and beyond. Chaos, Solitons, and Fractals, 135, 109794. 26. Anastassopoulou, C., Russo, L., Tsakris, A., & Siettos, C. (2020). Data-based analysis, modelling and forecasting of the covid-19 outbreak. PloS ONE, 15(3), e0230405.
Employment of New Cryptography Algorithm by the Use of Spur Gear Dimensional Formula and NATO Phonetic Alphabet Sukhwant Kumar, Sudipa Bhowmik, Priyanka Malakar, and Pushpita Sen
Abstract Cryptography is the branch of science. In cryptography, the techniques that are being used to protect data are derived from mathematical theories and a collection of calculations known as algorithms to process messages in many ways to encode it. In this paper, new approach to cryptography algorithm is mentioned in which the equation and numerals are adopted. This new finding is established on the unique amalgamation of mechanical concept and traces of alphabets used in NATO (North Atlantic Treaty Organization) phonetics alphabet. Keywords Military alphabets · ASCII code · Encryption · Decryption · Outer dimension formula · Spur gear · Teeth · Module · NATO · Phonetic
1 Introduction In today’s world, the amount of theft has increased greatly. Cryptography is a method of protecting information so that only those for whom the information is desired can read and process it. Cryptography involves creating, writing, or generating codes that allow the information to be kept secret [1–8]. In the study of cryptography, the terms plain text and cipher text are the key players. Encryption is the method of encrypting a message or information so that only those who are approved can read it and individuals who are not really authorized can’t [5]. Decryption is the method of translating encoded text or other data back into text that a machine can comprehend [8]. Simply put, it is the conversion of cipher text to plain text. The method of S. Kumar (B) Department of Mechanical Engineering, JIS College of Engineering, Block–A, Phase–III, Kalyani, Nadia 741235, West Bengal, India S. Bhowmik · P. Malakar Department of Computer Applications, Narula Institute of Technology, Agarpara, Kolkata 700109, West Bengal, India P. Sen Department of Bio Medical Engineering, JIS College of Engineering, Block–A, Phase–III, Kalyani, Nadia 741235, West Bengal, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_11
135
136
S. Kumar et al.
converting plain text to cipher text is known as encryption. Decryption, on the other hand, is the method of converting cipher text into plain text. In the projected work, plain texts are converted into ASCII value [8, 9]. The ASCII values are divided into two parts known as Module and Teeth. The formula of outer diameter (Od ) is [10] Od = M ∗ (T + 2)
(1)
After getting the values of Od, the ASCII values are subtracted by the values of Od. In the next step, we get our next values of S. We use the condition if, (0 = 26)
(2)
*If this condition satisfies, i.e., the range within of 0–26, 0 will be padded before the digits. Otherwise, if the S is biggest than 26, it will consider as individual digits. After applying this condition, we compare it and get the desired cipher text.
2 Proposed Work The section reveals exclusively with the flowchart of the projected work, followed by Key Engendering, Encryption Technique, and Decryption Methodology (Fig. 1). *By Using Example—Pushpita# as plain text.
2.1 Key Engendering Step 1: Let us consider Pushpita# is plain text. The plain text is converted to ASCII value 1st digit from the left or 1st and 2nd if digits are three which are considered as Key 1, i.e., M and rest is Key 2, i.e., T (Tables 1 and 2). *where M is module of the gear and T is number of teeth of gear. Step 2: Key 3, i.e., Od will be generated by using formula, OuterDiameter = Module ∗ (No.ofTeeth + 2).
(3)
2.2 Encryption Step 3: Foremost important part of the process of cryptography is encryption; in this part, the keys will be used to derive a newly brewed cipher text by subtracting Qd from ASCII value or A (Tables 3, 4 and 5). Then compared to
Employment of New Cryptography Algorithm by the Use of Spur Gear …
137
Fig. 1 Flowchart of the proposed work Table 1 Key engendering table
Plain text
A (ASCII value)
M (Module)
T (No. Teeth)
P
80
8
0
u
117
11
7
s
115
11
5
h
104
10
4
p
112
11
2
i
105
10
5
t
116
11
6
a
97
9
7
#
35
3
5
138
S. Kumar et al.
Table 2 Contd. key engendering table
M (Module)
T (No. teeth)
Od (Module * (teeth + 2))
8
0
8 * (0 + 2) = 16
11
7
11 * (7 + 2) = 99
11
5
11 * (5 + 2) = 77
5 + 210
4
10 * (4 + 2) = 60
11
2
11 * (2 + 2) = 44
10
5
10 * (5 + 2) = 70
11
6
11 * (6 + 2) = 88
9
7
9 * (7 + 2) = 81
3
5
3 * (5 + 2) = 21
Table 3 Encryption table Plain text
A (ASCII value)
M (Module)
T (No. teeth)
Od (M*(T + 2))
S (A-Od )
C = (0 ≤ S ≥ 26) if Yes
No
P
80
8
0
16
64
–
64
u
117
11
7
99
18
018
–
s
115
11
5
77
38
–
38
h
104
10
4
60
44
–
44
P
112
11
2
44
68
–
68
i
105
10
7
70
35
–
35
t
116
11
6
88
28
–
28
a
97
9
7
81
16
016
–
#
35
30
5
21
14
014
–
0 ≤ S ≥ 26
(4)
*If number lies in the rage zero will be employed before the number. *If not the number will be considered as two numbers. *The cipher text is obtained by assigning NATO phonetics/military alphabets.
2.3 Decryption Decryption is inverse method with exception, i.e., *Cipher text arrangement will be done in a cluster of doublets, starting either from left or right. *Zero occurring left of the number after the first step of sorting will be omitted.
Employment of New Cryptography Algorithm by the Use of Spur Gear … Table 4 Numbers— alphabets—NATO phonetic/military alphabets conversion chart
139
Numbers
Alphabet
NATO phonetic/military alphabet
0
–
Zero
1
A/a
Alfa
2
B/b
Bravo
3
C/c
Charlie
4
D/d
Delta
5
E/e
Echo
6
F/f
Foxtrot
7
G/g
Golf
8
H/h
Hotel
9
I/i
India
10
J/j
Juliet
11
K/k
Kilo
12
L/l
Lime
13
M/m
Mike
14
N/n
November
15
O/o
Oscar
16
P/p
Papa
17
Q/q
Quebec
18
R/r
Romeo
19
S/s
Sierra
20
T/t
Tango
21
U/u
Uniform
22
V/v
Victor
23
W/w
Whiskey
24
X/x
X-ray
25
Y/y
Yankee
26
Z/z
Zulu
3 Result Analysis The proposed work is evaluated and compared to existing cryptography works in this section, and values are tabulated and displayed with the help of a graph. Below are comparisons and displays of executable text files for encryption and decryption time (Fig. 2 and Table 6).
64
Foxtrot delta
C
Cipher text
Table 5 Cipher text
Zero romeo
018 Charlie hotel
38 Delta delta
44 Foxtrot hotel
68 Charlie echo
35
Bravo hotel
28
Zero papa
016
Zero november
014
140 S. Kumar et al.
Employment of New Cryptography Algorithm by the Use of Spur Gear …
141
4500 4000 3500 3000 2500 2000 1500 1000 500 0
TDES(DEC.) TDES(ENC.) AES(DEC.) AES(ENC.) ECASGDFNPA(DEC.) File.1.txt File.2.txt File.3.txt File.4.txt File.5.txt File.6.txt File.7.txt File.8.txt File.9.txt File.10.txt File.11.txt File.12.txt File.13.txt File.14.txt File.15.txt
In.mm.Sec
ComparaƟve Graph of EncrypƟon and DecrypƟon Time.
ECASGDFNPA(ENC.)
Source Files Fig. 2 Encryption and decryption time comparison graph Table 6 Encryption and decryption time comparison chart Sl
Source file Source file Name
Military (mm s)
AES (mm s)
TDES (mm s)
Size (bytes) Enc
Dec Enc
Dec Enc
Dec
1
File.1.txt
2,565
14
15
16
0
0
16
2
File.2.txt
8,282
16
0
62
0
0
16
3
File.3.txt
26,585
30
14
328
0
16
0
4
File.4.txt
52,852
37
33
36
13
16
0
5
File.5.txt
82,825
62
31
333
13
13
16
6
File.6.txt
157,848
30
30
31
32
31
31
7
File.7.txt
343,587
107
75
47
32
125
141
8
File.8.txt
737,157
144
98
62
48
250
152
9
File.9.txt
782,732
151
159
89
63
159
215
10 File.10.txt
1,375,453
215
160
89
89
291
329
11 File.11.txt
1,737,050
254
298
94
94
344
360
12 File.12.txt
2,107,551
327
357
109
125
438
453
13 File.13.txt
2,770,747
502
464
158
235
562
641
14 File.14.txt
3,284,377
523
538
140
156
815
865
15 File.15.txt
3,785,411
611
584
298
313
953
1109
142
S. Kumar et al.
4 Conclusion In this paper, a new formula of cryptography is generated. We presented the algorithm grounded on ASCII transfigurations, mathematical function as well as physics and mechanical engineering. These different subjects together form this new formula which gives a new era in cryptographic science. In this paper, the algorithm is used for encryption and decryption information, through using ASCII value and military alphabets, so it is challenging to find out the plain text. In future, we will implement this work for better security for encryption and decryption. Acknowledgements We are deeply indebted to our mentor (late) Professor Dr. Rajdeep Chowdhury for his inimitable exuberant style of inspiration which helped us to complete this stupendous project, owing to his expert guidance and keen interest during this research. Unfortunately, he passed away before the current work could be accomplished. His untimely death left us in a state of depression and anguish that was difficult to address. We send our heartfelt condolences to his family and friends.
References 1. Chowdhury, R., Saha, A., & Dutta, A. (2011). Logarithmic Function Based Cryptosystem [LFC]. International Journal of Computer Information Systems, 2(4), 70–76. ISSN 2229 5208. 2. Chowdhury, R., Dey, K. S., Datta, S., & Shaw, S. (2014). Design and implementation of proposed drawer model based data warehouse architecture incorporating DNA translation cryptographic algorithm for security enhancement. In Proceedings of International Conference on Contemporary Computing and Informatics, IC3I 2014, Mysore (pp. 55–60). IEEE. ISBN 978-1-4799-6629-5. 3. Chowdhury, R., Datta, S., Dasgupta, S., & De, M. (2015). Implementation of central dogma based cryptographic algorithm in data warehouse for performance enhancement. International Journal of Advanced Computer Science and Applications, 6(11), 29–34. ISSN (O) 2156 5570, ISSN (P) 2158 107X. 4. Chowdhury, R., Roy, O., Datta, S., & Dasgupta, S. (2018). Virtual data warehouse model employing crypto–math modus operandi and intelligent sensor algorithm for cosseted transference and output augmentation. In Knowledge Computing and Its Applications (pp. 111–129). Singapore: Springer. ISBN (O) 978-981-10-6680-1, ISBN (P) 978-981-10-6679-5. 5. Chowdhury, R., Chatterjee, P., Mitra, P., & Roy, O. (2014). Design and implementation of security mechanism for data warehouse performance enhancement using two tier user authentication techniques. International Journal of Innovative Research in Science, Engineering and Technology, 3(6), 165–172. ISSN (O) 2319 8753, ISSN (P) 2347 6710. 6. Anciaux, N., Bouganim, L., & Pucheral, P. (2006). Data confidentiality: to which extent cryptography and secured hardware can help. Annals of Telecommunications, 61(3–4), 01–20. 7. Mamta, P. A. (2016). Image encryption using RSA with 2 bit rotation. International Journal for Research & Development in Technology, 5(7), 154–158. ISSN (O) 2349 3585. 8. Chowdhury, R., Kumari, S., & Kumar, S. (2021). Project management method-based cryptographic algorithm employing IC engine transmission ratio and simple interest formula. Lecture Notes in Electrical Engineering, 719–728,. https://doi.org/10.1007/978-981-15-8297-4_57 9. Kahate, A. (2008). Cryptography and network security, 4th edn. New Delhi: Tata McGraw–Hill. ISBN (10) 0-07-064823-9, ISBN (13) 978-007-06-4823-4. 10. . Shigley, J. E. (1986). Spur gears. (Retroactive Coverage). In Standard Handbook of Machine Design (vol. 33). McGraw-Hill.
Security Framework for Enhancing Security and Privacy in Healthcare Data Using Blockchain Technology A. Sivasangari, V. J. K. Kishor Sonti, S. Poonguzhali, D. Deepa, and T. Anandhi
Abstract New opportunities for clinical data processing and accessibility for patients to view and share their health data are opening up through advances in electronic medical reports, cloud data storage, and patient data protection regulations. Securing data, processing, transfers, and maintaining their seamless integration are of tremendous importance for any data-driven enterprise. Particularly in healthcare, blockchain technology can robustly and effectively solve these essential problems. Blockchain technology has entered a big boom in the health sector amid its utility in solving the safety concerns of the EHR in healthcare. EHRs are able to improve the delivery of healthcare. The introduction of EHRs offered the potential for information technologies to counteract those efforts. Electronic access to medical records allowed doctors to increase the quality of care significantly. In fact, EHR allows improved control of the condition and increased levels of preventive treatment. The digital record provides better functions to support decision-making and enhanced collaboration among carers. Therefore, its position in the health sector is increasingly recognized. The protection of such digital information is, therefore, given the utmost importance, and blockchain is widely used for safe and stable data on healthcare. Holding data integrity and rising data processing storage costs in blockchain healthcare technology play a significant role. Encrypted in the block chain, the proposed LWKG architecture helps a customer to show the moment that health data is displayed. LWKG verifies the integrity of information and analyzes the results stored in the database. Keywords Block chain · Hash function · Security
A. Sivasangari (B) · V. J. K. K. Sonti · S. Poonguzhali · D. Deepa · T. Anandhi Sathyabama Institute of Science and Technology, Chennai, India V. J. K. K. Sonti e-mail: [email protected] T. Anandhi e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_12
143
144
A. Sivasangari et al.
1 Introduction “Health is Wealth” is the reality put forth by nature before the human race especially in 2020. This year unfolded the significance of being healthy and fit to tackle any health emergency. Understanding physical and emotional signals of body is crucial in one’s effective health management. Electronic Health Care Systems’ (EHCs) one of the prime objectives is to assist the patient and doctor primarily in ensuring better health diagnosis. The advantages offered by technology have become vital in extending its fruits to various sectors. Revolutionary changes have been observed in health sector with the innovations in integrated circuit technology. Gordon Moore’s (1960) prediction about the power of processing technology backed by the evolution of integration capacity of electronic components is a paradigm shift for various interdisciplinary researches. Facet of the health care systems has been changing quickly over few decades. The role of electronics behind the revolutionary evolution of health care components and systems is significant. Sensors, digital electronics, wireless communication, and cryptography have taken major share in shaping the present Health Care System (HCS). Delivery process of HCS has become more user-friendly and scientific. Electronic Health Record (EHR) system is one such innovation in data processing and storing. Patients’ health parameters, such as data acquisition, processing, analysis, interpretation, storing, and timely retrieval, are the need of the hour in this digital era. Sensors are widely used in gathering the required data about the patient’s general and specific health status. Many times, various other electronic machinery and processes are involved in bringing out the experts’ opinion to the crucial hour of the treatment. EHR played a key role in decision-making when treating complex medical cases for the doctors. Individual health history accessibility is more ease in EHR rather than manual records. It helps in reducing cost and time in medical diagnosis. EHR offers several advantages such as accessibility, ease in processing, presentation of the key observations and findings, storing data for longer times, storing not only numeric data but also images and motion pictures. It is important to store echocardiogram data, radiology images, neural activity graphs, and various key parameter fluctuations in a repository. But in EHR, the confidentiality and privacy of this health data is the main feature or obstacle. Data breach is the major concern in the health care data security. Some of the countries take this as the major parameter in designing health care policy. In fact, the complete health care industry works on the merit of storing and maintaining health care data of an individual. Hospitals, insurance companies, patient, and other support systems give prime importance for the security and privacy of this data. But providing or ensuring this data security has become Herculean task. As it is prominently known, “technology is a boon as well as bane,” when misused or not properly controlled. Radiographic images and crucial medical records are to be maintained with utmost security. If this information reaches to unauthenticated people or agencies, the
Security Framework for Enhancing Security and Privacy in Healthcare …
145
damage is unmanageable at individual and health care service provider level. Security means adopting foolproof procedures in data acquisition, processing, storing, and retrieving. Automation has replaced manual intervention in data security. In other words, security has reached to a state where the data management has become more procedural and technical rather than in emotional and ethical view point. Therefore, discussion about effective procedures in data, especially health care data in the context of this book chapter, is more relevant and significant. Security of the health care data that is generated in the form of records, analytical information, images, small videos shall be protected with restricted access. This shall be provided with personal-level authentication. But as the data is maintained in a repository, the people involved in controlling the information storing and retrieval must be vigilant. Various players are involved in using medical data. Every day, huge amount of health care data of individuals are collected, and social clusters are generated and maintained. This information is very crucial in understanding the repercussions of a particular drug in a particular patient’s health as well as in the case of a social group. Unauthenticated access to crucial health care data should be avoided. This is possible with the efficient EHR maintenance. It was essential to build and validate algorithms, encryption protocols, fool proof data protection techniques to ensure the security, and privacy of health data. For such effective protection frame function, the EHR repository must be preserved. There are various standards offering different levels of security and privacy of health care data of an individual. Various market players assure such data management practices. But always discrepancies in the medical data management insist on the better systems to be explored. This book chapter addresses such issues and provides solutions in this direction. EHR has to be maintained with security as major objective. The data is having the possibility to get corrupted in transmission. At the acquisition, the care is automatically ensured as the patient will not come forward if the standards are compromised. Transmission of the data has taken a different direction with the encryption systems in place. Encryption of digital data ensured the security. Techniques such as “Steganography” provide data transfer protection or privacy. Redundancy is another means of maintaining data protection during automated data transfer on a small scale. In a blockchain taxonomy that is divided into five key aspects, HER is stability, scalability, governance, interoperability, and privacy. The scalability of the blockchain has broader goals depending on the available tools. And a comprehensive literature review was performed on EHRs inside a Blockchain, with the goal of defining and addressing the most concerns, obstacles, and potential benefits of health care implementation of Blockchain. Blockchain’s implementation has exceeded the reach of the economics industry and we have demonstrated the value of Blockchain for the health care market, while still demonstrating that it relies heavily on the latest technologies within the health care ecosystem. Analyzing the results of the literature review, we conclude that Blockchain technology may be a suitable potential solution for popular healthcare problems, such as EHR interoperability, exchanging trust between health care providers, auditability, protection, and granting patients’ access to health data control, which could enable them to decide on whom. However, before
146
A. Sivasangari et al.
using Blockchain technologies on an outsized scale in health care, additional tests, evaluations, and experiments must be carried out to ensure that a stable and developed framework is applied, because the health data of a patient is confidential, critical, and essential material. The modern way of maintaining the confidentiality and safety of medical data uses “Blockchain” technology, which will be discussed in Sect. 2.
2 Related Work Sudeep et al. [1] speak regarding access management policies to strengthen the HER structure of data usability health care simulation environment, and different efficiencies, comparison of metrics, and improved outcomes. This helps to increase Blockchain performance and protection. Alla et al. talk about electronic health care system and have done a systematic analysis about Blockchain and advantages for health care process, issues in security level, and vulnerable things. And discussed about technologies in Blockchain and infrastructure with high-performance authentication sector [2]. Vignesh et al. [3] discussed fewer deployments in the cloud storage with lowcost replication and higher availability with better performance in geo-replicated system by data centers with these benefits. Similarly, Tripathi et al. discussed about S2HS approach for smart health care technology and social barriers analyzing expert perception. Blockchain-based SHS security and integrity system. And traditional health care system adopting modern-day technologies in smart health care ecosystem [4]. Gomathi et al. [5] proposed energy-efficient routing protocol using wireless sensor with dynamic clustering suggested UWSN routing technique. This will be helpful for researchers using reduced power consumption, response time, avoiding overload, and improving throughput of the network. Ishwarya et al. [6] proposed a project to reduce congestion in traffic and calculating current traffic with normal. Mikula et al. [7] proposed access to identities using the Blockchain for digital system protection. They illustrate the functionality using the hyperledger fabric platform. This framework is used for authentication and authorization tasks of electronic health reports in the health care domain. Sivasangari et al. [8] compared the use of fog computing for protection and privacy. They used specialized network edge strategies and facilities to enhance the big data health management infrastructure through the fog computing concept at the smart gateway. Deepa et al. [9] proposed an idea of detecting road damage by image processing in smart phone and sending the coordinate point to cloud from cloud user can visualize the road, where the damages in the road can be seen in map. From this, we can be able to avoid the accidents, etc. Samhitha, B Keerthi et al. [10]. talk about CNN to identify tumors that are dangerous in lung disease. The CNN technique has lots of features and standard
Security Framework for Enhancing Security and Privacy in Healthcare …
147
representation pneumonic radiological complexity, fluctuation, and classification of lung nodule. Sivasangariet al. [11] proposed WBAN has serious issues about the protection and privacy of the health care industry. The health records of patients should meet the physician at the right time. Protection has the largest effect on people’s lives and they have adopted an efficient SEKBAN model that ensures ECG signal-based safety data. Indira et al. [12] implemented an efficient hybrid detection using wireless sensor network. Wireless devices are spatially distributed over sensors and physical changes. And the device network which includes multiple detection over sensors with lightweight transportability. Risius et al. [13] talk about framework in Blockchain which they divided into three group activities and as well four level of analysis. This design and features more research which has predominantly focused on new technologies in Blockchain. Yue Xiao et al. [14] proposed centered on an algorithm called HGD architecture, a Blockchain paradigm allows patients to be able to safely own, monitor, and exchange their data without compromising privacy. Personal health care details are organized by ICS as well. And ultimately, without breaching untrusted computing records, MPC is supported data. Kshetri et al. [15] proposed secure patient health data without unauthorized breaches and at the same time easy access of data for the patients. Lanxiang et al. [16] proposed searchable encryption and EHR sharing in Blockchain. Here they mainly discussed about how EHR constructed and utilized expression and facilitates propagation which ensures the integrity traceability of EHR index which finally evaluated two aspects of EHR. Ekblaw et al. [17] compared the existing system with the proposed method and exposed EHR working prototype and analyzed the model which helps in leveraging unique blockchain properties in health IT. Sivasangari et al. [18] proposed an eyeball cursor movement detection using image processing Opencl package implemented. Maria Prokofieva et al. [19] identify prospective implementations in health care blockchain technology. The systematic study looks at open-source technical and scholarly journals released between 2008 and 2019 to identify the promise of Blockchain-based approaches to the distribution of health care knowledge. Eltayieb et al. [20] identify an online/offline searchable encryption system based on attributes with the following contributions: first, algorithms for encryption and trapdoor are divided into two stages. Second, in the offline process, the message encryption and attribute management policy are also carried out. Third, we show that both selected plaintext and selected keyword attacks are secure against the proposed method. Finally, the applicability of the proposed system to the cloud-based smart grid is clarified. Because of its design, cloud infrastructure requires adaptive systems to overcome some of its problems. Dagher et al. [21] proposed a Blockchain-based platform for reliable, interoperable, and productive access by patients, physicians, and third parties to medical data, while protecting the integrity of confidential information for patients. In an
148
A. Sivasangari et al.
Etheretim-based Blockchain, our platform, called Ancile, uses smart contracts for improved access control and data officiation, and uses sophisticated cryptographic techniques for further stability. The aims of this paper are to examine how Ancile can connect with the diverse interests of patients, vendors, and third parties, and to explain how the system could resolve long-standing health care sector privacy and security issues. Hasselgren et al. [22] aimed at regularly studying, assisting, and synthesizing peer-reviewed articles using/proposing to use blockchain to optimize health care, health science, and health education processes and services. This research indicates that the attempts in the health sector to use Blockchain technologies are growing rapidly within the health domain, which will theoretically be strongly influenced by Blockchain technology. Chen et al. [23] suggested a Blockchain-based searchable encryption framework for IHRs. The index for EHRs is built and stored in the blockchain by complex logic expressions, so that the expressions can be used by a data user to check for the index. Finally, two facets of the performance of the proposed scheme are analyzed, namely, in terms of the overhead for collecting the fi-one EHRs document IDs and the overhead associated with executing Ethereum smart contract transactions. Zhao et al. [24] proposed a modern privacy-based blockchain-preserving app upgrade protocol that offers a safe and secure update with a reward scheme thus protecting the privacy of users involved. By using a smart contract to offer financial motivation by offering proof of delivery, a provider offers the upgrade and makes a promise. To obtain the-of-delivery proof, Double Authentication Avoiding Signature (DAPS) is used by transmitting node to conduct equal trade. Suggest a particular outsourced attribute-based signature (OABS) framework to solve the vulnerability. Prove the protection of the proposed OABS and the procedure. Finally, show the feasibility of the proposed protocol, and apply smart contract in solidity. The study of Agbo et al. [25] indicate that various usage cases for the application of Blockchain in health care have been proposed by a variety of researchers; however, there is a lack of appropriate prototype implementations and studies to characterize the feasibility of these suggested use cases. Therefore, the need for further studies to further grasp, define, and assess the effectiveness of Blockchain in health care continues to this purpose. The goal of the analysis is to define the health care usage cases of blockchain technology, the examples of apps designed for these use cases, the difficulties and drawbacks of Blockchain-based health care applications, the emerging methods used to build these applications, and areas for potential studies. Pandey et al. [26] analyzed the societal and technological challenges ahead in delivering universal healthcare systems on a broad scale and proposed a way to support society as a whole that is technology-intervening. Finding that scalability is a key concern in the adoption of Blockchain health care systems on such a wide scale. We experimented with the creation of a Blockchain and found that the output of the protocol is a function of the number of special nodes called ordering nodes and a trade-off is needed to match the time to commit and the fault tolerance system Chen et al. [27] proposed a Blockchain and health care research which is currently limited, but Blockchain is on the verge of reforming the health care sector; Blockchain
Security Framework for Enhancing Security and Privacy in Healthcare …
149
can increase usability and confidentiality of medical records through its open values, and can thus overturn the structure of health care and create a modern system through which patients control their own treatment. Blockchain is now one of the most important fields of software science, and by restoring authority over medical history and health data to the user, it will shift the hierarchy of health care. This transition of power will lead to an eventual change to patient-centered care; patients are just starting the Blockchain revolution. Yaqoob et al. [28] proposed a Blockchain technology, developed over the Internet, which provides the power to use existing health care data in a peer-to-peer and interoperable way by replacing the third party using a patient-centric approach. Applications to maintain and exchange safe, clear, and immutable audit trails with reduced systemic fraud can be developed using this technology. This technology has immense potential to address the big problems with its features and assets in the sub-sectors of the health care industry. Technology has the power to make the whole environment groundbreaking. Providers, patients, and academic groups are more focused on their initial journey and need more research, but in health care and pharmacy supply chains, intensive research work must be carried out. Jabba et al. [29] proposed BiiMED: a Blockchain platform for Improving Data Interoperability and Integrity in EHR, sharing to address these challenges. The suggested solutions include an access control scheme that allows multiple medical providers to share EHRs and a decentralized Trusted Third-Party Auditor (TTPA) to ensure the accuracy of records. The goal of this study was to create a framework for further studies to develop a decentralized EHR management system to ensure security, authentication, and encryption and exchange of data between health care facilities using Blockchain technology. Engelhardt et al. [30], in order to address real-world challenges, proposed a technique in which emerging businesses are looking to implement Blockchain technologies, including attempts to manage public health, centralize research results, track and execute prescriptions, reduce administrative overheads, and coordinate medical data from a growing number of inputs. Clear examples of the usage of Blockchain technologies in the health sector are outlined below, highlighting near-term promises and obstacles. It is an interesting time, with the exploration and development of many new applications and implementations, and full of a lot of hope. Mettler et al. [31] in their study proposed various starting points for Blockchain technologies in the health care sector. Direct transactions immediately become available with Blockchain, which can remove a central agent who managed the records, received commission, or even intervened in a censoring fashion. It will also encourage new emerging business models and strategies for digital well-being. Since (data) intermediaries can be eliminated in the future, this technology opens new doors with regard to how to perform business transactions in health care. Therefore, Blockchain has an immense future promise and will show revolutionary improvements in the health care industry. Jiang et al. [32] suggest BlocHIE, a Blockchain-based health care information exchange platform, centered on research, verification, and protected healthcare details.
150
A. Sivasangari et al.
Kamel Boulos et al. [33] proposed a securing patient and supplier identities controlling supply chains for medication and medical devices tracking medical deception in clinical testing and data monetization public health monitoring, e.g., by the US CDC (Centers for Disease Control and Prevention) for exchanging public health data to help public health professionals respond more effectively to an epidemic that requires geo-tagged data to be fully public and open network.
3 Proposed Work The medical data management scheme mostly used in Blockchain technology and cloud computing technologies is achieved through safe storage management and sharing data. Three primary categories in medical services, patients, and third-party organizations are in scientific Blockchain transaction heads. Doctors are responsible for the diagnosis test as well as treatment in patients, and maintaining their medical records which is created for further usage. In various medical institutions, patients can see a doctor and have possession of their personal medical details. Many other organizations may be providing their certain services, lot of medical advices from appointed medical institutions. The main principal in the Blockchain field is data storage and access control in the medical records. It is ideal to store medical record details in the Blockchain technology, but using only index details of medical transaction records is to be documented on Blockchain due to constraints like costs, storage access. They should encrypt large medical data and Blockchain saved outside. Such scientific ones are available in scheme data that are stored under the chain in the cloud storage. Controlling of data and other access is decided by the permission, and various transactions which are having entities those have various authorization for further access control. In medical, Blockchain is responsible for development of blocks. The newly created network nodes blocks are validated first, then connected to form permanent main chain collection of medical data regarding any kind of transaction. The time mark verifies that the blocks obey the timing relation in the Blockchain to the medical. A three-tier e-Health caregiver and patient architecture contains three layers, namely, front layer, communication layer, and the back layer (Fig. 1). e-Health Front Layer: The very first layer which consists of the sensors and wearable devices that are used to collect patient health-related detail in real time. With the help of sensors health data details will be sent to MD, collected by communication layer protocol, and then further linked to the Internet to stored data in the cloud storage. Communication layer: It manages data obtained from the front layer computers, and it will send them through the remote gateway, the Net. The service routines are transmitted via cloud storage gateway for more research, and some immediate services are through the
Security Framework for Enhancing Security and Privacy in Healthcare …
151
Fig. 1 Layer of e-Health architecture
e-Health Front Layer Data Collection
Communication Layer Analyzing Data
e-Health Back Layer Decision management
Blockchain Layer
fog portal. After the cloud server is used for over large geographic place, in order to check which is consuming higher time. This confusion is overcome using emergency services. Overall process used in this layer is compiling frequently and then compress and format the patient’s health details which are collected from front layer. e-Health Back layer: This layer is used for high calculating data process of centralized patient management allowing that dynamic and long-term behavioral research techniques and patient relationship information. This also consists of a cloud server, which takes strategic decisions. Blockchain layer: Blockchain technology solves transaction-oriented data issue since the data is shared transactional and decentral nodes, consisting of all the computers known as nodes. Those nodes could be used for validating each transactions, using an algorithm. When all the transaction is checked, it is then added to the Blockchain in the distributed unchangeable ledger. Blockchain is a linear block sequence, with a list of complete and correct transaction records. Blocks are connected by a connection (hash value) to the previous block, and thus form a chain. The block that precedes a block is called its parent block, and the very first block is called the block of genesis (Fig.2). The block header consists of block header and the body block. The header to the block contains
152
A. Sivasangari et al.
Transac(Chunk for i) Transac (Chunk for i+1)
Transac(Chunk for i+2)
Block contains Header Block contains Body Body root Hash Others: Verstion Timestamp Nounce Previous Block Hash Fig. 2 Transaction Hash Block process
• • • •
Block version: Guidelines for block validation; Previous hash block: preceding key hash value; Timetamp: Current block formation time; Nonce: a 4-byte random field adjustable to miners for each calculation on hash.
The block body consists of transactions validated within a particular body time. Merkle tree is used to store all true transactions in which each leaf node is a transaction and in which each non-leaf node is its two concatenated child hash meaning Nodes. Nodes such an arboreal structure is effective in verifying the life and validity of transactions. Initially, we dissect the various prerequisites for sharing medical care information from various sources. Based on the analyses of data, we utilize two loosely coupled Blockchains to deal with various types of medical services information. Second, we join off-chain memory storage and on-affix check to fulfill the necessities of both protection and really. Third, we propose two fairness-based packing algorithm calculations to improve the framework throughput and the fairness among the users mutually responsible. To show the practicability and adequacy of BlocHIE, we actualize BlocHIE in an insignificant feasible item way and assess the proposed pressing calculations widely. A sole responsibility of the doctors who want to treat patient can ask for health information by providing the patient’s own ID and private key, too. The doctor’s collected details are stored in the local databases. In addition, the person’s access control is checked from the access controlling list area by using given ID. And proposed work is based on attribute-based access control scheme. If the consumer is initially authenticated, more procedures are introduced to sign the information to guaranteed digital integrity. This incorporates an underlying gateway which goes
Security Framework for Enhancing Security and Privacy in Healthcare …
153
about as a basic aggregation site for testing and afterward assists with confidentiality and furthermore ensures patient information protection inside the propose work (Fig3). Front end includes stakeholder interactions of the medical method using EHR. Physicians and patients seek access to the EHR report to find out about their data information. The user must be licensed for this EHR Device Database. The user information are checked in the Access List. If the value goes in, then the user can access the record by the amount he has furnished. Back-end concerns the health care process data via Keyless Infrastructure and Blockchain. So that data is kept without any problem. Figure 2 shows the transaction process. The proposed LWKG framework enables a consumer to prove the time when health data are presented encrypted in the Blockchain. The data you signed is saved, signing which can be used later to confirm signing time name, and the quality of records. The user sends the signed hash document to server and earns a signature token as proof of the time signature, agency signing, and integrity of the data. No keys needed a signature token. The aggregator creates a hash tree which is passed on to the Next Server (core). The root hash value is then stored in the Blockchain. The Blockchain is used as supplementary publication layer which is at the top of the LWKG. Running the hash function, LWKG verifies data integrity and analyzes the outcome stored in the database. Transaction through Blockchain Process Algorithm
Upload record
Download Record
EHR Log
Healthcare Provider Blockchain Service Network
Treating
Patient
Doctor Visiting
Fig. 3 Overview of EHR healthcare framework
154
A. Sivasangari et al.
Step 1: X Nodes (Bkn ,N) Step 2:Bkn =current Block Step 3:Bkn-1 null Step 4: E { } Empty waiting transaction Step 5: While t< τ Step 6: If transaction invalid (t,Bkn ) Step 7: E EUt Step 8: Else Step 9: If transaction valid (t, Bkn) Step 10: Ynew= Ynew.add (t) block (Enew, Previous block header hash value Bk n-1,timestamp ts) Step 11:Bkn addblock (bkn ) Step 12: Nc Step 13: End if Step 14: End while
Medical services just as to isolate issues for the execution and advancement of Blockchain applications. We distinguish a few significant application areas that currently analyzes openings and difficulties for the future progressions and bearings for the advantages of IS researchers and experts in the field. As just the record is moved to the Blockchain to encourage proliferation, and the information proprietors have full power over who can see their EHRs information. The utilization of Blockchain innovation guarantees the trustworthiness, against altering, and recognizability of EHRs list.
4 Performance Analysis Main advantage of using Blockchain in health care sector is to maintain data easily and can process data playing a crucial role in the health care marketing nowadays. And the results are collected automatically and verification process, correcting process as well as aggregating data from different places which are immutable in case and provide secured data with the reduced probability of cybercrime. Similarly, using Blockchain which supports distributed data with redundancy and fault tolerance of the system. Using cryptography using shared symmetric key and the private key enables the Electronics Health Records (EHR) through distributed other users in the Blockchain area network. Different types of algorithms are used in this. a. b. c. d.
Algorithm on Admin Working. Algorithm on Patient Working. Algorithm on Clinician Working. Algorithm on Lab Working.
Hence, many algorithms are used in order to secure the data efficiently. The performance is analyzed on various parameters. And in Blockchain network which is used, hyperledger caliper is benchmarking tool. It is used and supports different
Security Framework for Enhancing Security and Privacy in Healthcare …
155
hyperledger frameworks in different varieties. Here we used caliper tool to verify and execute the performance of the system and different parameter checking, which includes latency and throughput CPU usages, memory consumption and disk write/read, network input and output, etc., and is used for main metrics for the evaluation of the system. We can be able to modify configuration parameter based on our frequent assessment like block size, channel, resource allocations, ledger database, consumption time for each block, etc. (Fig.2). To calculate transaction throughput TT from success rate of the transaction with the defined tps used and for transaction committed TCT is taken from an entire network. Also during many committed nodes NCN for finding invalid nodes or failed transactions are subtracted with total transactions time (TTS) (Fig. 4). TT = TCT TTS ∗ NCN
(1)
Reading Latency (RL) is for finding time taken for reading request and submitted and its reply on the entire network. Also ST is submitting time and RR is a time when the response is received (Figs. 5 and 6).
Average Latency(ms)
Read Latency RL = RR − ST
(2)
Block Size Block Size 2
Transactions Rate(TPS) Fig. 4 Different block sizes vary transaction rate and average latency
156
AVERAGE LATENCY (MS)
AVERAGE LATENCY
Fig. 5 Throughput and average latency of LWKG framework
A. Sivasangari et al.
THROUGHPUT (KBPS)
Average Latency (ms)
120 0
110 0
100 0
900
800
700
600
400
500
300
200
12 10 8 6 4 2 0 100
Average Latency
Fig. 6 Throughput and average latency of existing work
Throughput (KBps)
5 Conclusion Hence, EHRs contain information of critical and more sensitive and private data for analysis and it will help for treatment in health care sectors. The sharing of medical information is significant and important for health care smarter system and improves the standard of health care framework. Similarly, an EHR could likewise be a structure in advanced organization of a patient’s well-being information that has made and kept up patient records which are put away and dispersed among numerous medical field, clinics, and the records for preventing and quick access of the past patient details. These suppliers regularly hold essential admittance to the records for further access, forestalling fast access to previous information by patients’ records. In some circumstances where the patients can approach their well-being record for knowing their health history, they can wind up associating with information during a cracked way that reflects the idea of how these records are managed and maintained. The principle commitment of this content might be an efficient writing survey that features past examinations that are related with EHRs and Blockchain, and during this audit, we investigate the machine of a Blockchain structure for health care in EHR storage and access form of executives. Then Blockchain might be having a distributed ledger protocol that was initially identified with Bitcoin technique. It is used mainly for utilizing public key cryptography to append in particular data and it should be immutable which contains time
Security Framework for Enhancing Security and Privacy in Healthcare …
157
stamped content. It was designed for maintaining financial ledger and Blockchain paradigm. For the data being handled, the LWKG offers a digital signature. The digital signature ensures the data quality, and to a greater degree, integrity is handled. Further implementation of the Blockchain means that data can be partially applied as blocks in order to produce a time sign for the data stored or revised which also discusses the reception of the block in question. That configuration is modified based on the assessment and resource allocation, ledger database, etc.
References 1. Tanwar, S., Parekh, K., & Evans, R. (2020). Blockchain-based electronic healthcare record system for healthcare 4.0 applications. Journal of Information Security and Applications, 50, 102407. 2. Alla, S. L., Tatar, U., & Keskin, O. (2018). Blockchain technology in electronic healthcare systems. In: Proceedings IISE Annual Conference and Expo (pp. 901–906). 3. Vignesh, R., Deepa, D., Anitha, P., Divya, S., & Roobini, S. (2020). Dynamic Enforcement of causal consistency for a geo-replicated cloud storage system. International Journal of Electrical gineering and Technology, 11(3). 4. Tripathi, G., Ahad, M. A., & Paiva, S. (2020). S2HS-A blockchain based approach for smart healthcare system. Healthcare, 8(1), 100391. Elsevier. 5. Gomathi, R. M., Martin Leo Manickam, J., Sivasangari, A., & Ajitha, P. (2020). Energy efficient dynamic clustering routing protocol in underwater wireless sensor networks. International Journal of Networking and Virtual Organisations, 22(4), 415–432. 6. Ishwarya, M. V., Deepa, D., Hemalatha, S., VenkataSaiNynesh, A., & PrudhviTej, A. (2019). Gridlock Surveillance and Management System. Journal of Computational and Theoretical Nanoscience, 16(8), 3281–3284. 7. Mikula, T., & Jacobsen, R. H. (2018). Identity and access management with blockchain in electronic healthcare records. In: 2018 21st Euromicro Conference on Digital System Design (DSD) (pp. 699–706). IEEE. 8. Sivasangari, A., Ajitha, P., Brumancia, E., Sujihelen, L., & Rajesh, G. Data security and privacy functions in fog computing for healthcare 4.0. In: Fog Computing for Healthcare 4.0 Environments (pp. 337–354). Springer, Cham. 9. Deepa, D., Vignesh, R., Mana, S. C., Samhitha, B. K., & Jose, J. (2020). Visualizing road damage by monitoring system in cloud. International Journal of Electrical Engineering and Technology, 11(4). 10 Samhitha, B. K., Mana, S. C., Jose, J., Vignesh, R., & Deepa, D. (2020). Prediction of lung cancer using convolutional neural network (CNN). International Journal, 9(3). 11. Sivasangari, A., Ajitha, P., & Gomathi, R. M. (2020). Light weight security scheme in wireless body area sensor network using logistic chaotic scheme. International Journal of Networking and Virtual Organisations, 22(4), 433–444. 12. Indira, K., UshaNandini, D., & Sivasangari, A. (2018). An efficient hybrid intrusion detection system for wireless sensor networks. International Journal of Pure and Applied Mathematics, 119(7), 539–556. 13. Risius, M., & Spohrer, K. (2017). A blockchain research framework. Business and Information Systems Engineering, 59(6), 385–409. 14. Yue, X., Wang, H., Jin, D., Li, M., & Jiang, W. (2016). Healthcare data gateways: Found healthcare intelligence on blockchain with novel privacy risk control. Journal of Medical Systems, 40(10), 218. 15. Kshetri, N. (2018). Blockchain and electronic healthcare records [cybertrust]. Computer, 51(12), 59–63.
158
A. Sivasangari et al.
16. Chen, L., Lee, W.-K., Chang, C.-C., Raymond Choo, K.-K., Zhang, N. (2019). Blockchain based searchable encryption for electronic health record sharing. Future Generation Computer Systems, 95, 420–429. 17. Ekblaw, A., Azaria, A., Halamka, J. D., & Lippman, A. (2016). A Case Study for Blockchain in Healthcare: MedRec prototype for electronic health records and medical research data. In Proceedings of IEEE Open and Big Data Conference (vol. 13, p. 13). 18. Sivasangari, A., Deepa, D., Anandhi, T., Ponraj, A., Roobini, M.S. (2020) Eyeball based cursor movement control. In: Proceedings of the 2020 IEEE International Conference on Communication and Signal Processing, ICCSP 2020 (pp. 1116–1119), 9182296. 19. Prokofieva, M., Miah, S. J. (2019). Blockchain in healthcare. Australasian Journal of Information Systems, 23 20. Eltayieb, N., Elhabob, R., Hassan, A., & Li, F. (2019). An efficient attribute -based online/offline searchable encryption and its application in cloud based reliable smart grid. Journal of system Architecture, 98(165–172) 21. Dagher, G. G., Mohler, J., Milojkovic, M., Babu, P., & Ancile, M. (2018). Privacy-preserving framework for access control and interoperability of electronic health records using blockchain technology. In: Sustainable Cities and Society (253–297). 22. Hasselgren, A., Kralevska, K., Gilgoroski, D., Pedersen, S. A., & Faxvaag, A. (2019). Blockchain in Healthcare and health science. International Journal of Medical Informatics 23. Lanxiang Chen, War-Xong Lee, Chin-Chen Chang, l’tlm-l’iwang Raymond Choo, Nan Zhang “Blockchain based searchable encryptlon for electronic health record sharing “Future Generation Computer Systems, January (2019) Pg. No 420–429 24. Zhao, Y., Liu, Y., Tian, A., Yu, Y., Du, X. (2019). Blockchain based Privacy- preserving software update with proof-of-delivery for Internet of Things. Journal of Parallel and Distributed Computing, 141–149. 25. Agbo, C. C., Mahmoud, Q. H., Eklund, J. M. (2019). Blockchain technology in healthcare: A systematic review. Journal of Healthcare. 26. Pandey, P., Litoriya, R. (2020). Implementing healthcare services on a large scale: Challenges and remedies based on blockchain technology. In: Health Policy and Technology (pp. 69–78) 27. Hannah, S. (2019). Chen, Juliet T Jarrell, Kristy A Carpenter, David S Cohen, Xudong Huang”. Blockchain in Healthcare: A Patient-Centred Model, HHS Public Access, Biomed J Sci Tech Res., 20(3), 15017–15022. 28. Yaqoob, S., Khan, M. M., Talib, R., Butt, A. D., Saleem, S., Arif, F., Nadeem, A. (2019). Use of blockchain in healthcare: a systematic literature review. (IJACSA) International Journal of Advanced Computer Science and Applications, 10(5). 29. Jabbar, R., Fetais, N., Krichen, M., Barkaoui, K. (2020). Blockchain technology for healthcare: Enhancing shared electronic health record interoperability and integrity. ResearchGate. 30. Engelhardt, M. A. (2017). Hitching healthcare to the chain: An introduction to blockchain technology in the healthcare sector. Technology Innovation Management Review, 7(10). 31. Mettler, M. (2020). Blockchain technology in healthcare the revolution starts here. In: 2016 IEEE 18th International Conference on e-Health Networking, Applications and Services (Healthcom). 32. Jiang, S., Cao, J., Wu, H., Yang, Y., Ma, M., He, J. (2018). BlocHIE: A BLOCkchain-based platform for healthcare information exchange. In: Proceedings of the 2018 IEEE International Conference on Smart Computing(SMARTCOMP) (pp. 49–56). Sicily, Italy. 33. Kamel Boulos, M. N., Wilson, J. T., Clauso, K. A. (2018). Geospatial blockchain: promises, challenges, and scenarios in health and healthcare. International Journal of Health Geographics, 17, 25.
American Sign Language Identification Using Hand Trackpoint Analysis Yugam Bajaj and Puru Malhotra
Abstract Sign language helps people with speaking and hearing disabilities communicate with others efficiently. Sign language identification is a challenging area in the field of computer vision and recent developments have been able to achieve near-perfect results for the task, though some challenges are yet to be solved. In this paper, we propose a novel machine learning-based pipeline for American sign language identification using hand trackpoints. We convert a hand gesture into a series of hand trackpoint coordinates that serve as an input to our system. In order to make the solution more efficient, we experimented with 28 different combinations of pre-processing techniques, each run on three different machine learning algorithms, namely, k-Nearest Neighbours, Random Forests and a Neural Network. Their performance was contrasted to determine the best pre-processing scheme and algorithm pair. Our system achieved an accuracy of 95.66% to identify American sign language gestures.
1 Introduction American Sign Language (ASL) uses hand gestures and movements as a means of communication for people with hearing or speaking disabilities. It is a globally recognized standard for sign language but still there are only ~250,000–500,000 people who understand it [1] and this makes the users dependent on ASL restricted while conversing in real-life scenarios. In this paper, we propose an ASL recognition system to tackle this problem and hence lay a foundation for translator devices to make dynamic conversation in ASL easier. The proposed system uses a series of preprocessing steps to convert a gesture image into meaningful numeric data using hand TrackPoint analysis which serves as the input to a machine learning/deep learning algorithm for identification. The existing solutions require the use of external devices such as motion sensing gloves or Microsoft Kinect to capture the essence of finger Y. Bajaj · P. Malhotra (B) Department of Information Technology, Maharaja Agrasen Institute of Technology, Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_13
159
160
Y. Bajaj andP. Malhotra
movements, hence decreasing the feasibility and accessibility. Also, they use a really complex system of neural networks which is computationally expensive. We propose an American sign language gesture identification method which is capable of recognizing 24 alphabet letters. Our system is based on the pipeline as shown in Fig. 1. A gesture image goes through it while being converted to a series of hand trackpoints. So, our system represents a particular ASL gesture as coordinates (trackpoints) instead of an image. In the study, various pre-processing combinations were compared by their performance on k-Nearest Neighbour algorithm, Random Forest algorithm and a Nine-Layered Sequential Neural Network to find the best possible pre-processing + algorithm pair. Our study was able to achieve the state-ofthe-art results, with less complicated algorithms from the ones that are commonly used.
Fig. 1 Process pipeline
American Sign Language Identification Using Hand Trackpoint Analysis
161
2 Related Work A lot of research has been done till now to solve the problem of sign language detection, the majority of them applying neural networks and were able to achieve accuracies greater than 90% [2–11]. In one such experiment, Razieh Rastgoo et al. constructed multi-view skeletons of the hands from video samples and fed them into a 3D-Convolutional Neural Network (CNN) model. They outperformed state-of-the-art models on New York University (NYU) and First-Person datasets. They applied 3DCNN on stacked input to get discriminant local spatio-temporal features and the outputs were fused and fed to a Long Short-Term Memory (LSTM) network which formulated different hand gestures [12]. Zafar Ahmed Ansari and Gaurav Harit made an effort to solve the problem of identification of Indian sign language. Indian sign language, being more complex to tackle as a computer vision task than other sign language standards due to the use of both hands and involving movement, has been less explored as compared to advancements made in recognition methods for other sign languages. In their paper titled, ‘Nearest neighbour classification of Indian sign language gestures using Kinect camera’, they were able to achieve an accuracy of 90.68% for the task. They used a k-Nearest Neighbour (k-NN) approach as it provides a good basis for such a task where samples can be from different angles and backgrounds [13].
3 Data Collection The data samples were collected through a dummy web interface developed using Java Script and HTML which was hosted on a local web server. The website platform supported functionality to capture images, process them and record the hand trackpoints as results after feature extraction. The images were captured using a standard webcam integrated with a laptop. The samples were collected from four volunteers who performed the hand gestures to be captured by the webcam with their hands at a distance ranging between 1.5 and 3 feet from the webcam. This distance was sufficient enough to capture the full gesture and was a good estimate to mimic the distance between two people in a conversation in real life. The hand gestures corresponding to 24 of American sign language alphabets were performed, omitting gestures for the letters J and Z as they involved movement. 30 samples for each alphabet were captured per volunteer which is compiled to a total of 120 samples per alphabet. The backgrounds of the image while capturing the samples were kept a mix of plain background with only the hand in the frame and a natural background with the person performing the gesture in the frame. Figure 2 shows samples from our collected data corresponding to each of the alphabets mentioned. Each of such images was processed to be converted into 21
162
Y. Bajaj andP. Malhotra
Fig. 2 Dataset sample: American sign language chart
three-dimensional coordinate points as per the process mentioned in the next subsection and then stored in an ordered manner in a csv file to be used as a dataset to a machine learning model. The dataset had 64 columns—first 63 containing the nx column followed by ny column followed by the nz column representing the x-, y-, and z-coordinates for trackpoint n, where n [1, 21] in order and column 64 containing the actual representation of the gesture in English.
3.1 Feature Extraction The captured images were passed to a hand tracking model from the MediaPipe framework which was deployed using tensorflow js. The hand tracking model’s parameters and weights were adjusted as per our experiment’s requirements. The hand tracking model uses a BlazePalm detector to detect the palm in the captured image and passes the section of the image with the palm in it to a Hand Landmark Model [14]. The segment for the feature extraction process returned a Tensor of float values which contained the coordinates of 21 Landmark points of a hand in 3D space. These coordinates serve as the data values for further carrying out the task of American sign language recognition.
American Sign Language Identification Using Hand Trackpoint Analysis
163
4 Preprocessing 4.1 Normalization of Coordinates The hand gesture can be present anywhere in the image frame. The position of the hand in the image affects the reference point coordinates for all the three (x, y, z) axes. To make it easy for the algorithms to generalize patterns for different gestures, we have to bring all of them to a single reference frame. For that, all the coordinates of a particular image sample were imagined to be contained within a bounding box, which was taken as a reference for calculations. In our study, we carried out all our calculations with two types of bounding boxes: Cuboidal Bounding Box. This type of bounding box was the smallest cuboid that could fit in all the 21 reference points in the space. For each image, the following six points were found: x min , x max , ymin , ymax , zmin , zmax ; which denoted the minimum and maximum coordinate values for their respective axis. Hence, the bounding box stretched from x min to x max , ymin to ymax and zmin to zmax in the x-, y-, z-axes, respectively, and was of the dimensions (x max – x min ) * (ymax – ymin ) * (zmax – zmin ). Cubical Bounding Box. This type of bounding box was the smallest cubical box that could fit in all the 21 reference points in the space. For each image, all the six points for the minimum and maximum points for each axis were found as above. Now the largest value among (x max – x min ), (ymax – ymin ) and (zmax – zmin ) acted as the length of the edge for the cube and the other two lengths were made equal to it by uniformly adjusting their min and max values. This was done using the following general equations: Tmin(new) = Tmin(old) −
Edge length of cube − (Tmax − Tmin ) . 2
(1)
Tmax(new) = Tmax(old) −
Edge length of cube − (Tmax − Tmin ) . 2
(2)
For any axis T. Hence, the bounding box stretched from modified x min to x max , ymin to ymax and zmin to zmax in the x-, y-, z-axes, respectively, and was of the edge length (x max – x min ). For normalizing the coordinates into a common reference frame, we use two mathematical transformations on our data: • Shifting of Origin and • Scaling. Shifting of Origin. The reference vertex of the bounding box was shifted back to the origin to bring all the coordinates to a common reference position. The reference vertex of the bounding box was considered to be the vertex (x min , ymin , zmin ) for the
164
Y. Bajaj andP. Malhotra
box. The shifting of each coordinate took place following the simple mathematical transformation for shifting of origin: X = x − xmin .
(3)
Y = y − ymin .
(4)
Z = z − z min .
(5)
For every coordinate point (x, y, z). Scaling. Each bounding box is scaled to a standard size of (255 * 255 * 255). This is done to bring every collected image sample to a uniform size for the box, bringing every coordinate placed in a common reference space. This was aimed to increase the efficiency of classification algorithms because of the ease of comparison of the generalization of patterns for every gesture. A scaling factor (f) was calculated for each dimension of the bounding box by using the following formula: f =
255 , L
(6)
where L indicates the edge length of the bounding box in each of the x-, y-, and z-directions one by one. Each coordinate value was multiplied by the scaling factor for their respective axis in order to complete their transformation into a box of the required dimensions.
4.2 Rounding Off Another technique which was used to pre-process the coordinate values was the process of rounding up the values. After many trials and experimenting, it was found that rounding up the values to three decimal places produced consistent results. Hence, rounding coordinates values to three places after the decimal was performed in the study wherever applicable.
5 Method Three different algorithms were used in the experiment to build a sign language recognition model and their performances were tested on 28 different combinations of the discussed pre-processing techniques. These algorithms were
American Sign Language Identification Using Hand Trackpoint Analysis
165
• k-Nearest Neighbour Classifier, • Random Forest Classifier and • Neural Network. The collected dataset was distributed randomly into an 80:20 split of Training: Test set for the training and testing purpose of the models.
5.1 k-Nearest Neighbours Classifier (k-NN) For a data record t to be classified using k-NN, its k-Nearest Neighbours are retrieved, and this forms a neighbourhood of t [15]. Hence, the record t is labelled to be belonging to the category of the corresponding neighbourhood. The performance of the algorithm is largely dependent on the choice of k. Hence, to find the optimal value of k, the classifier was tested for different values of k in the range (1,25).
5.2 Random Forest Classifier Random Forest Classifier uses a series of decision trees to label a sample. The performance of the algorithm depends on the number of decision trees (n) which the algorithm forms to make a prediction. To find the optimal value of the number of decision trees, the classifier was tested for different values of n in the range (1,200).
5.3 Neural Network A Neural Network was constructed in keras to carry out the experiment. The network contained nine dense layers with the initial eight layers powered by the Relu activation function and the final output layer powered by the Softmax activation function which outputs the probabilities of the input belonging to each of the available labels. The network was compiled using a categorical cross-entropy loss function with an Adam optimizer and an accuracy metrics. For its training purpose, the neural network was fitted to the training set and was trained for 128 epochs.
6 Observations The recognition system was implemented on an Intel Core i5 CPU (2.40 GHz × 4) and NVIDIA Geforce GTX 1650 GPU with 8 GB RAM. The system ran Windows 10 (64 bit). The system was implemented in Python programming language. Matplotlib
166
Y. Bajaj andP. Malhotra
and Seaborn package was used for analysis and visualizing results. Table 1 shows the results obtained by using a cuboidal bounding box and a cubical bounding box. Both the types of bounding boxes performed relatively well over the set of different combinations of pre-processing techniques. Among the pre-processing combinations, the combination of rounding the data first and then performing shifting and scaling, respectively, resulted in giving maximum accuracy. The performance results for each algorithm are further discussed. Table 1 Test set accuracy using a cuboidal and cubical bounding box over all the pre-processing combinations Pre-processing combination
Cuboidal bounding box k-NN
Random Forest
Neural Network
Cubical bounding box k-NN
Random Forest
Neural Network
No Pre-processing
70.83
77.43
91.49
70.83
77.43
87.33
Shifting
90.97
91.15
93.06
90.97
91.15
94.27
Scaling
67.53
76.91
89.06
69.1
75.35
81.94
Rounding
70.83
77.08
87.5
70.83
70.83
91.15
Scaling + Shifting
92.71
93.23
92.88
93.23
92.88
93.92
Scaling + Rounding
67.53
78.3
87.85
69.1
75.35
89.58
Rounding + Scaling
67.53
77.6
84.9
69.1
74.13
88.54
Shifting + Rounding
90.97
91.32
94.97
90.97
91.49
93.92
Rounding + Shifting
90.97
91.32
93.23
90.97
91.32
94.44
Rounding + Shifting + Scaling
92.71
93.4
93.4
93.23
93.23
95.66
Shifting + Scaling + Rounding
92.71
93.23
88.72
93.23
92.88
91.67
Rounding + Scaling + Rounding
67.53
77.26
90.45
69.1
75.17
83.51
Rounding + Shifting + Rounding
90.97
91.32
93.92
90.97
91.32
94.62
Rounding + Shifting + Scaling + Rounding
92.71
93.4
91.32
93.23
92.88
93.4
American Sign Language Identification Using Hand Trackpoint Analysis
167
Fig. 3 Confusion matrix for best performance in k-NN
6.1 k-NN The k-NN algorithm performed with an average accuracy of 82.19% over the 28 experimental combinations. Its best performance recorded was for four different combinations which had an accuracy of 93.23%. These combinations were, namely, • • • •
Shifting + Scaling, Rounding + Shifting + Scaling, Shifting + Scaling + Rounding and Rounding + Shifting + Scaling + Rounding
all in a cubical bounding box. All the methods achieving maximum accuracy had similar results, recognizing eight characters with 100% accuracy and 10 other characters with accuracies over 90% (as shown in Fig. 3).
6.2 Random Forest The Random Forest algorithm gave an average accuracy of 85.30% over the 28 experimental combinations. Its best accuracy of 93.4% was achieved for the preprocessing combination of • Rounding + Shifting + Rounding and • Rounding + Shifting + Scaling + Rounding. Both over a cuboidal bounding box. Both of them recognized 7 letters with 100% accuracy and 12 others with accuracies over 90% (as shown in Fig. 4).
168
Y. Bajaj andP. Malhotra
Fig. 4 Confusion matrix for best performance in random forest
6.3 Neural Network The Neural Network architecture designed by us performed the best with an average accuracy of 90.95%. A combination involving a pre-processing combination of Rounding + Scaling + Shifting over a cubical bounding box recorded an accuracy of 95.66%, which was the maximum accuracy achieved over the experiment, compiled with a loss < 0.1 and training accuracy of 0.978 (Fig. 5). The Neural Network recognized 11 characters with 100% accuracy taking the total to 20 out of 24 letters recognized with accuracies above 90% (as shown in Fig. 6).
7 Results and Discussion The study was focused on developing a system to detect American sign language gestures. In the course of the study, there was a comparison made between the kNN algorithm, Random Forest classifier and a proposed Neural Network. The k-NN algorithm performed with an accuracy of 70.83%, Random Forest performed with an accuracy of 77.43% and the Neural Network with an accuracy of 91.49% on raw unprocessed data. To find an optimized solution we tried 28 different pre-processing techniques. The use of pre-processing helped increase the performance of the three algorithms to a maximum of 93.23%, 93.4% and 95.66%, respectively. A comparative analysis of the study indicates the Neural Network being the most effective with an average accuracy of 90.95%, outperforming k-NN and Random Forest which had the average accuracies of 82.19% and 85.30%, respectively, over a total of 28 test
American Sign Language Identification Using Hand Trackpoint Analysis
169
Fig. 5 Loss and training accuracy curve for Neural Network
runs completed during the study. We concluded by achieving a maximum accuracy of 95.66% over our test dataset in a combination using a Neural Network. Among all the pre-processing techniques, the combination applying Rounding + Shifting + Scaling proved out to be the most efficient giving out high accuracies in all the three algorithms. Hence, a pipeline was devised to serve as an American sign language identification system.
170
Y. Bajaj andP. Malhotra
Fig. 6 Confusion matrix for best performance in Neural Network
8 Conclusion and Future Work We Implemented and trained an American sign language identification system. We were able to produce a robust model for letters a, b, d, e, f, i, k, l, o, s, x and a modest one for letters a–y (except r). The pre-processing techniques implemented resulted in a 14.47% average increase on test set accuracy over no pre-processing. This work can be further extended to develop a two-way communication system between sign language and English. Also, there is a need to further work out finding ways to recognize letters or gestures involving hand moments as well.
References 1. Mitchell, R., Young, T., Bachleda, B., & Karchmer, M. (2012). How many people use ASL in the United States?: Why estimates need updating”. Sign Language Studies (Gallaudet University Press.), 6(3). Retrieved November 27, 2012. ISSN 0302-1475. 2. Singha, J., & Das, K. (2013, January 17–18). Hand gesture recognition based on KarhunenLoeve transform. In Mobile and Embedded 232 Technology International Conference (MECON), India (pp. 365–371). 3. Aryanie, D., & Heryadi, Y. (2015). American sign language-based finger-spelling recognition using k-nearest neighbors classifier. In 3rd International Conference on Information and Communication Technology (pp. 533–536). 4. Sharma, R., et al. (2013, July 3–5). Recognition of single handed sign language gestures using contour tracing descriptor. In Proceedings of the World Congress on Engineering 2013, WCE 2013, London, U.K. (Vol. II). 5. Starner, T., & Pentland, A. (1997). Real-Time American sign language recognition from video using hidden markov models. Computational Imaging and Vision, 9(1), 227–243.
American Sign Language Identification Using Hand Trackpoint Analysis
171
6. Jeballi, M., et al. (2013). Extension of hidden Markov model for recognizing large vocabulary of sign language. International Journal of Artificial Intelligence & Applications, 4(2), 35–42. 7. Suk, H., et al. (2010). Hand gesture recognition based on dynamic Bayesian network framework. Pattern Recognition, 43(9), 3059–3072. 8. Mekala, P. et al. (2011, March 14–16). Real-time sign language recognition based on neural network architecture. In 2011 IEEE 43rd Southeastern Symposium on System Theory (SSST). 9. Admasu, Y. F., & Raimond, K. (2010). Ethiopian sign language recognition using artificial neural network. In 10th International Conference on Intelligent Systems Design and Applications (pp. 995–1000). 10. Atwood, J., Eicholtz, M., & Farrell, J. (2012). American sign language recognition system. In Artificial Intelligence and Machine Learning for Engineering Design. Dept. of Mechanical Engineering, Carnegie Mellon University. 11. Pigou, L., et al. (2014, September 6–12). Sign language recognition using convolutional neural networks. In European Conference on Computer Vision. 12. Rastgoo, R., Kiani, K., & Escalera, S. (2020). Hand sign language recognition using multi-view skeleton. Expert Systems with Applications, 150(15), 13. Ansari, Z. A., & Harit, G. (2016). Nearest neighbour classification of Indian sign language gestures using kinect camera. Sadhana, 41(2), 161–182. Indian Academy of Sciences. 14. Zhang, F., et al. (2006). MediaPipe hands: On-device real-time hand tracking. https://arxiv.org/ pdf/2006.10214. 15. Guo, G., et al. KNN model-based approach in classification. In R. Meersman, Z. Tari, & D. C. Schmidt (Eds.), On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE. OTM 2003 (Vol. 2888). Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.
Brain Tumor Detection Using Deep Neural Network-Based Classifier Ambeshwar Kumar and R. Manikandan
Abstract The brain tumor can be categorized into two groups: benign and malignant tumor. The brain tumor management process is contingent on the physician’s experience, knowledge, and efficient algorithms. Initial exposure of tumor and treatment plan leads to enhanced eminence of life and improved life expectation in these patients. Clinical trainings have shown that discussing many primary brain tumors is stimulating task due in part to the nonexistence of safe and effective compounds that cross the blood–brain barrier. Usually, brain tumor can be categorized as additional in hundred categories built on position and development of the tumor. This exploration article proposed a novel approach to ascertain the brain tumor and diagnose the disease with less computational time. Convolutional neural network for classification problematic and faster region-based convolutional neural network for segmentation problem with abridged number of computational time with developed accuracy level. Our research results managed to accomplish that the innovative system is capable of segmenting the tumor image using faster R-CNN to produce the output with efficient accurateness. Keywords Brain tumor · MRI · Convolutional Neural Network (CNN) · Recurrent Convolutional Neural Network (R-CNN) · Region Proposed Network (RPN) · Faster R-CNN
1 Introduction An anomalous development of tissue in the brain causes a brain tumor which could be life intimidating if not detected in appropriate time. Magnetic Resonance Image (MRI) and Computer Tomography (CT) scan images are used by clinical researcher to extract the detailed analysis of brain images to recognize the tumor in the brain. Earlier detection of brain tumor can reduce the death proportion of patient but always it cannot be possible. The tumor can be benign or malignant. Both categories of tumor A. Kumar · R. Manikandan (B) SASTRA Deemed University, Thanjavur, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_14
173
174
A. Kumar andR. Manikandan
are differing from each other, generally benign tumor does not extend to other part of the body tissue and it can be surgically removed from the body. The common method is to identify the tumor type is MRI images [1]. In brain tumor patients, exhaustion is a communal indication that has an unfavorable effect on superiority of lifespan. There are undeniably indications in the brain that reproduce the selfdescribed sensation of exhaustion. Precisely, the level of self-testified exhaustion was meaningfully associated with the level of deactivation accompanying with phasic attentiveness. The exhaustion in brain tumor patient is contradictory from the normal patient fatigue. Normal patient does not feel tiredness and more weakness compared to tumor patients [2]. Earlier recognition and diagnosis of brain tumor can exploit the rate of endurance; however, it is a perplexing concern to achieve precise diagnosis of the tumor. Convolutional neural network is utilized for the recognition of brain tumor region, it is renowned that CNN determined the tumor in profligate with specific result. Researcher has noticed that customarily survival rate of brain tumor patient is not more than 50%. Tumor is identified normally in two types: benign as a non-cancerous tumor because this tumor grows only in one place and it cannot spread to other parts of the body and malignant as a cancerous tumor because it develops when the cell grows uncontrollably, it spreads to other parts of the body in a process called metastasis [3]. Brain tumor separation using MRI input image has remained an extreme exploration area. Tumor has various dimensions and shapes and appear at different locations and convolutional neural network involves in analyzing the feature derived from the MRI image to segment the tumor. Segmented brain tumor is categorized into high-grade tumor and low-grade tumor. The advantage of CNN is to provide the optimal accuracy of segmentation in less computational time. It works by using MRI input image convoluting the input image with filter and optimized the output. It is a rising extent of research in tumor-automated segmentation [4]. World Health Organization has newly shown the mortality rate due to brain tumor is uppermost in the Asian Continent. It is due to the absence of earlier detection of tumor. The several indications of brain tumor embrace organization concerns, recurrent headaches mood fluctuates, variations in dialogue, trouble in attentiveness, appropriations, and remembrance loss. Brain tumor is categorized into four grade of malignant tumor grades (I, II, III, IV) as per the growth of tumor. It is also categorized by the progression stages (stage 0, 1, 2, 3, 4) in which stage 0 is denoted as abnormal cell growth but it is not spreading in nearby cells. Further stages 1, 2, 3 are denoted as a cancerous tumor cell which grows rapidly to nearby cells. Finally, stage 4 tumor type is cancer cell which spreads throughout the body. Brain tumor diagnosis can be aggressive or non-aggressive. In the invasive approach, tumor sample is collected where scratch is done for examination. In this examination of tumor sample, pathologist observes various features of tumor cell under a microscope to approve the malignancy of the tumor. Body scanning, MRI of brain, and CT scan of brain are considered as non-invasive approaches. In this research article, brain tumor dissection procedure has been improved by using faster recurrent convolutional neural network to recognize and diagnose the tumor. The rest of article is illustrated with Sect. 2 covering literature survey, Sect. 3 dealing with the
Brain Tumor Detection Using Deep Neural Network-Based Classifier
175
methodology followed with Sect. 4 covering results and discussion followed with Conclusion and References.
2 Literature Survey The illustration of furthermost comprehensive used procedures for brain tumor segmentation and diagnosis by the clinical investigator and technician are declared. Deep learning techniques obligate proven a great enhancement in the arena of medical science to decipher the multifaceted problem. Medical image procedures assist the researcher, medical practitioners, and the doctor to vision inside the human body to examine the interior activities without any scratch on the body. The cell and tissue are the elementary structure blocks of the body. The protein of the gene is acting as an emissary that helps in communication between the cell and the gene themselves. Gene monitors the death process of unhealthy and unwanted cell besides facsimile of healthy cell. The gene is accountable for the tumor in the brain and it is categorized into three parts: in first category, the two signaling pathways process cell to kill itself, the cell obtains the demise indication from adjacent cells, it stops the cell growth. Second category gene is answerable for repair of the DNA, if any malfunction has occurred it triggers the tumor. In the third category, tumor suppressor genes is accountable for the fabrication of the protein encouraging the division process and constraining the normal cell demise. If the cancer starts from these three mentioned categories, then it is known as primary tumor. If the cancer flinches from blood vessel, then it is called as subordinate tumor [5]. The deep learning technique is used for the separation of brain tumor images by fully convolutional neural network to make forceful algorithm alongside small data excellence. FCNN aims to profiling the capability of the model to simplify when trained with different configurations of data augmentation. It delivers perceptions about the enactment of the model skilled on data from patients with tumors. FCNN performs the best approach in segmentation of tumor as well as it can build the model quick and fast [6]. Convolutional neural network is occasionally called ConvNet because it is a class of deep neural network, it collects the feature from input data to extract the relevant features from images. CNN encloses four major operations in extraction of feature and delivers the desired output. Convolutional procedure, pooling layer, Rectified Linear Unit non-linearization, and organization fully connected layer. The segmentation process of brain tumor is in two groups, either tumor or non-tumor. The medical images are preprocessed, resized, garnered, and amplified before proceeding to training CNN model [7]. Brain Net CNN is self-possessed of novel point-to-point, point-to-node, and node-to-graph convolutional filters that influence the topological section of physical brain networks. This framework is to forecast the operational brain connectivity networks using fully connected neural network with the same number of model parameters. It is intended to influence the manufacture of brain network to envisage the wound and the disorder of anomalous brain. It has advanced association to the
176
A. Kumar andR. Manikandan
pulverized truth accompanying to the reasonable methods. It is also used to forecast the age and it has also initiated the expectation which was more precise in the younger toddlers compared to the older age patients [8]. This article deals with the object detection algorithms, Faster R-CNN, which are used to identify the tumor in brain and formed a bounding box on tumor with the types of tumor. It pre-trained convolutional neural network model and Region Proposed Network (RPN). Region proposed network generates the region of interest from brain input images. It was completely convolutional network which consists of three convolutional layers and one proposed layer. The occurrence of tumor is measured by regression and classified using bounding box provided by Region Proposed Network [9]. Feature extraction is used to acquire most suitable information from original data by using different techniques, Statistical features are used, except the features are applied as an input for artificial neural network. Canny edge detection technique is used for the scaling purpose to expand the correctness of the system [10]. From the overhead discussion in the literature survey, it has been cleared that different technologies and methods are used for the recognition and segmentation. The anticipated system, Faster R-CNN, is used to recognize the tumor and diagnosis of tumor is performed with minor computational time and advanced Disease Diagnosis Rate (DDT).
3 Methodology In this fragment, we deliberated about the proposed approach to recognize the brain tumor using Faster R-CNN deep learning algorithms. The Faster R-CNN is an object detection algorithm, it is able to allure the bounding box around the region of interest to localize inside an image. It signifies an exaltation of R-CNN, especially in terms of the computational time. Faster R-CNN used pre-trained network model and Region Proposed Network (RPN). It utilizes the Region Proposed Network for classification of brain images by tuning. This methodology is analogous to the R-CNN algorithm, but in its place of nourishing the Region Proposals to the CNN, it nourishes the input image to the CNN to produce a Convolutional Feature Map. Generated Convolutional Feature Map is utilized to perceive the area of interest and wrap them into square using ROI pooling layer. The detected extent of interest image reshaped into a stable size, so it can be easily fed into a fully connected layer. The reason of choosing the Faster R-CNN is because it need not feed the numerous region proposals to the convolutional neural network every interval, instead of the convolution operation is complete only once per image and a Feature Map is engendered from it. The overview of brain tumor recognition using Faster R-CNN is shown in Fig. 1. MRI brain images are used to analyze the anatomy of the brain and to identify the abnormality conditions such as cerebrovascular incidents, demyelinating, neurodegenerative diseases, etc. The leading improvement of MRI Images is that it uses no radiation, but it takes longer time to produce compared to CT scan images. The frequently used MRI sequences of brain examination are T-1 weighed and T-2
Brain Tumor Detection Using Deep Neural Network-Based Classifier
Medical Database
177
Number of features
Reducing the noise in images
Filtered Input Image
Recognize the area of interest in the images
Apply the Faster R- CNN Algorithm
Classified Brain Images
Fig. 1 The faster R-CNN-based brain tumor recognition
weighed in which T-1 W are very useful in examining the anatomy of the brain. In this proposed approach, the dataset has been occupied from Kaggle Directory to observe the abnormal and normal images. The input images carry set of noise, and it creates problem in identification and segmentation of images. The images are preprocessed before carrying for the further post-processed techniques. The noise is getting eliminated from the input images. To eliminate the noisy portion from the image subtractive probable pixel normalization has been used; in this method after getting the input MRI images, it calculates the standard deviation and union of probability using signal normalization depth to obtain the desired output. By this way unwanted features in the image are eliminated and therefore the noise level in brain MRI images is reduced. Figure 2 shows the difference between the noisy and noiseless image.
178
A. Kumar andR. Manikandan
Fig. 2 Subtractive probable pixel normalization a Noisy image b Noiseless image
3.1 Faster R-CNN-Based Approach for Segmentation of Images Deep learning techniques provide better results in solving complex problem of healthcare system with high efficiency and ability to process better output results. Faster RCNN approach is the higher version of Region-based Convolutional Neural Network, it proceeds an input image and creates a usual bounding box as an output. The bounding box contains the region of interest in a rectangular box. Original R-CNN individually calculated the neural network on each, as many of region of interest. But Faster R-CNN executes the neural network once for the whole images. The CNN architecture is common in both R-CNN and Faster R-CNN and it is considered as the backbone of the architecture [11]. The impression of Faster R-CNN is obtainable in Fig. 3. It considers the brain MRI images as input images; the preprocessed image is selected for the convolutional neural network because Faster R-CNN is used as pretrained neural network. The trained input images are provided to convolutional layer network which produces a convolutional feature map. Instead of using a Selective Search Algorithm on Feature Map to recognize the expanse of interest, a discrete network is used to forecast the region proposal. The prophesied region of interest is then redesigned using ROI pooling layer and further the classified tumor images are produced within the bounding box. The classified tumor images are further processed to calculate the grade of the tumor.
Brain Tumor Detection Using Deep Neural Network-Based Classifier
Convolutional Layer
MRI Image
179
Convolutional Feature Map ROI Pooling Classified Images
RPN Convolutional
Regression Coefficient
Tumor Detection
Proposal Layer
Fig. 3 Overview of faster region-based convolutional neural network
4 Results and Discussion This segment deals with the conversation associated to the robustness and efficiency in the Faster R-CNN framework architecture. The enactment of the proposed Faster R-CNN deep learning algorithm used for cataloguing of the brain tumor is deliberated in the above section. The convolutional feature map is taken as an input to Region Proposed Network (RPN). RPN consists of three convolutional layers and proposed layer to engender area of interest in the images. The RPN convolutional layer predicts the tumor recognition and regression coefficient to extract the location of the tumor. The tumor detection and regression coefficient are provided to the proposed layer as an input which created the bounding box for the area of interest using regression coefficient. The classified brain images are approved to analyze the effectiveness of the proposed model through accuracy, sensitivity, and specificity. Brain tumor classification accuracy [12, 13] is the percentage of accurate classification to the entire number of classification results. Classification of MR images was carried with different MR images. Brain tumor classification accuracy is measured using Eq. 1 which is shown below: Accuracy(%) =
Correctcases ∗ 100 Totalnumber
(1)
180
A. Kumar andR. Manikandan
Specificity is determined by the authentic and precise classification of brain tumor type. It is shown in Eq. 2 calculated using formulas. Specificity =
Truenegativecase truenegativecase + falsepositivecase
(2)
Sensitivity is demarcated as proportion of entire pertinent results or output correctly classified by the functional algorithm. The deviousness is revealed in Eq. 3. Sensitivity =
Truepositivecase truepositivecase + falsenegativecase
(3)
The robustness of the proposed system is measured through the above equation. An actual significant feature of Faster R-CNN is that it cannot be computed independently for region of interest, and it computes once for whole image. It performs the best approach among other techniques to categorize the tumor from the brain images.
5 Conclusion and Future Work In conclusion, the present scenario of study presented the enhancement of deep learning techniques in the healthcare system. We propose a Faster R-CNN approach for the classification of the tumor. The foremost aim of this article is to concentrate the tumor in brain images in less computational time. The initial stage is to strain the image and remove the noise and formulate it for further processing through Faster R-CNN approach. The determinations have been motivated through the bounding box provided by the RPN to the detected tumor and locate the position of the tumor in the images. It achieved the hopeful results associated to the segmentation of the brain tumor system through existing approach. Future work includes the postponement of this planned model for recognition of the tumor in other parts of the body with actual position and the size of the tumor. Consent for Publication No consent for publication.
References ˇ (2020). Classification of brain tumors from MRI images 1. Badža, M. M., & Barjaktarovi´c, M. C. using a convolutional neural network. Applied Sciences, 10(6), 1999. 2. de Dreu, M. J., Schouwenaars, I. T., Rutten, G. J. M., Ramsey, N. F., & Jansma, J. M. (2020). Fatigue in brain tumor patients, towards a neuronal biomarker. NeuroImage: Clinical, 28, 102406.
Brain Tumor Detection Using Deep Neural Network-Based Classifier
181
3. Rammurthy, D., & Mahesh, P. K. (2020). Whale Harris hawks optimization based deep learning classifier for brain tumor detection using MRI images. Journal of King Saud University-Computer and Information Sciences. 4. Bhandari, A., Koppen, J., & Agzarian, M. (2020). Convolutional neural networks for brain tumour segmentation. Insights into Imaging, 11(1), 1–9. 5. Tandel, G. S., Biswas, M., Kakde, O. G., Tiwari, A., Suri, H. S., Turk, M., Laird, J. R., Asare, C. K., Ankrah, A. A., Khanna, N. N., Saba, L., Suri, J. S., & Madhusudhan, B. K. (2019). A review on a deep learning perspective in brain cancer classification. Cancers, 11(1), 111. 6. Lorenzo, P. R., Nalepa, J., Bobek-Billewicz, B., Wawrzyniak, P., Mrukwa, G., Kawulok, M., Ulrych, P., & Hayball, M. P. (2019). Segmenting brain tumors from FLAIR MRI using fully convolutional neural networks. Computer Methods and Programs in Biomedicine, 176, 135– 148. 7. Rai, H. M., & Chatterjee, K. (2020). Detection of brain abnormality by a novel Lu-Net deep neural CNN model from MR images. Machine Learning with Applications, 100004. 8. Kawahara, J., Brown, C. J., Miller, S. P., Booth, B. G., Chau, V., Grunau, R. E., Zwicker, J. G., & Hamarneh, G. (2017). BrainNetCNN: Convolutional neural networks for brain networks; towards predicting neurodevelopment. NeuroImage, 146, 1038–1049. 9. Ezhilarasi, R., & Varalakshmi, P. (2018). Tumor detection in the brain using faster R-CNN. In 2018 2nd International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC) (pp. 388–392). IEEE. 10. Zhao, J., Li, D., Kassam, Z., Howey, J., Chong, J., Chen, B., & Li, S. (2020). TripartiteGAN: Synthesizing liver contrast-enhanced MRI to improve tumor detection. Medical Image Analysis, 101667. 11. Rosati, R., Romeo, L., Silvestri, S., Marcheggiani, F., Tiano, L., & Frontoni, E. (2020). Faster R-CNN approach for detection and quantification of DNA damage in comet assay images. Computers in Biology and Medicine, 123,. 12. Kumar, A., & Manikandan, R. Recognition of brain tumor using fully convolutional neural network-based classifier. In International Conference on Innovative Computing and Communications (pp. 587–597). Singapore: Springer. 13. Kumar, A., Manikandan, R., & Rahim, R. (2020). A study on brain tumor detection and segmentation using deep learning techniques. Journal of Computational and Theoretical Nanoscience, 17(4), 1925–1930.
Detecting Diseases in Mango Leaves Using Convolutional Neural Networks Rohan Sharma, Kartik Suvarna, Shreyas Sudarsan, and G. P. Revathi
Abstract Trees and plants, like all other living creatures, can succumb to infections and diseases. These illnesses need to be detected in their early stages so that proper steps can be taken to cure them, else it can cause serious effects on its yield. Automated systems that help in looking for these diseases have been around for some time now, and they are all based on Machine Learning (ML) models. The onset of Deep Learning (DL) has further helped improve these automated systems, which can be implemented in areas with a large number of trees and crops. In our work, a multilayer Convolutional Neural Network (CNN) is designed to detect the presence of diseases in the leaves of a mango tree. The network is trained on a dataset of images consisting of images of healthy leaves and disease leaves. The results obtained from the same are presented.
1 Introduction Agriculture is the backbone of the economy for many major countries. With a growing population, the global demand for high-quality food is continuously rising. But with this rising demand, the resources available to produce food are diminishing and so production must be made more efficient while maintaining quality standards. The marriage between technology and agriculture has proven to be very fruitful. One of the earliest ways in which technology was introduced into this sector was through the use of pesticides, fertilizers, modern irrigation methods, and artificial crop seeds. In a span of around 40 years after they were introduced, crop yields went up by nearly 300% [1]. The most common use of technology in farming is to monitor and improve crop yield [2]. This can include monitoring the quality of soil, water, and plants, predicting future yield, predicting species of crops, detecting diseases, R. Sharma · K. Suvarna · S. Sudarsan (B) PES University, Bengaluru, India G. P. Revathi Faculty of CS Dept, PES University, Bengaluru, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_15
183
184
R. Sharma et al.
and detecting the presence of weed crops. Such monitoring requires systems that can gather information from very large areas of land and analyze them. Such a system can be built using the Internet of Things (IoT) concept. IoT-enabled monitoring systems can enable seamless integration of video and image capturing, and real-time analysis. Exploiting automation techniques can bring down costs, maintain quality standards, and reduce the workload of the farmer. Automation and Machine Learning (ML) go hand-in-hand. Allowing machines to take decisions without human intervention, although sounds scary, has proven to be efficient and reliable. The idea of using ML in agriculture goes as far back as the 1990s, where a program called the AQ11 was used to detect diseases in soybean crops [3]. Proposals to use ML to predict soil types, weather patterns, and the presence of pests to improve the quality of grapes in the wine manufacturing industry also date back to the same time [4]. Currently, ML is used extensively in agriculture for a variety of uses ranging from predicting future data to managing resources [5]. Deep Learning (DL) is a branch of machine learning that uses Artificial Neural Networks (ANN) to recognize and learn patterns from large amounts of data and uses them to make predictions. Compared to traditional ML, DL models need very little to no manual feature extraction. Such functionality makes it ideal for applications that require predictions to be made on inputs like images and videos. In our work, we have used Deep Learning (DL) to detect the presence of diseases in the leaves of a mango tree. A Convolutional Neural Network was trained on healthy and diseased leaves, and the resulting model was used to make predictions. The rest of the paper is organized as follows—In Sect. 2, we present our literature survey; Sect. 3 has details about our methodology including system requirements, data preparation, and image manipulation. Section 4 has details about our deep learning model, including a model summary. Our results are detailed in Sect. 5 followed by our conclusions and future scope in Sect. 6.
2 Literature Survey Advancements in the fields of image processing, ML, and DL have led to a shift in how technology is being used for detecting diseases in fields and large farmlands. Using these techniques leads to a reduction in the amount of manual labor required. Further, since machines are able to make accurate predictions, there is no requirement to consult a specialist every time there is suspicion of disease in the plants and this can help keep overheads low. Digital Image Processing (DIP) and ML techniques have been used by a lot of researchers to create models that can detect diseases in plants. The authors of [6] used image segmentation and proposed ANNs, Bayes classifiers, fuzzy logic, and hybrid algorithms to further improve their work. In [7], the authors were working on technologies that are capable of identifying and classifying a wide variety of plant diseases in a short time. In particular, two types of leaf bean
Detecting Diseases in Mango Leaves Using Convolutional Neural …
185
diseases were detected using HSI models and K-Nearest Neighbors (KNN) to process the images, and a Support Vector Machine (SVM) was used for classification. An ML-based approach was adopted in [8] to evaluate images of pomegranate to detect diseases. KNN, along with a Particle Swarm Optimization technique, for feature optimization, was used for this classification. A multiclass SVM was used after extracting the region of interest from the images of leaves in [9]. Using ML to perform classification still requires some amount of manual work for defining features and methods to extract these features. DL models, on the other hand, do not require a feature extraction process and so many researchers have used various DL techniques to perform disease detection. The authors of [10] used different DLbased detectors like Faster Region-based CNN, Region-based Fully Convolutional Network, and Single Shot Multibox Detector. ANNs were used in [11] to detect diseases in cotton leaves by studying the color changes on the affected portion of the leaf and performing detection using that data. In [12], a CNN model with a Learning Vector Quantization algorithm was used for detecting diseases in tomato leaves.
3 Methodology The first step is to find and prepare the data, which can then be used to build the model. A dataset containing pre-labeled mango leaves was used and some arbitrary changes were made to it to improve model performance. After the data is prepared, it is used to train the model. A Convolutional Neural Network is used for this, with several layers of image processing steps included. The trained model is uploaded to the cloud, where test images can be fed into it for prediction. A cloud-based model can make the system free from hardware dependencies on the user end. Once the model is trained and uploaded, test images can be fed to it and predictions can be made as to whether the image contains a healthy or a diseased leaf (Fig. 1).
3.1 System Requirements The software tool used here was Python 3.4. Python is one of the most prominent and robust languages when it comes to image processing and deep learning. We developed our model using the TensorFlow framework. The Keras module of TensorFlow is an open-source neural network library, which helped in creating the model efficiently. Along with this, various other libraries of Python such as NumPy, Pandas, Seaborn, and Matplotlib were used. All this work was done on a Windows machine.
186
R. Sharma et al.
Fig. 1 A sample image showing a diseased mango leaf
3.2 Preparing the Data The amount of training and testing data in the dataset is huge, and reading it all at once would require high-end computation systems. As an alternative, all these images can be read in batches of images. Keras has built-in functions to automatically process all the data, generate a flow of batches from a directory, and to also manipulate the images.
3.3 Image Manipulation To make the deep learning model more robust, the images used for training were further manipulated using functions that rotate, resize, and rescale these images in a random fashion. With this, the model was trained with a variety of images, all generated from the same training dataset. The ImageDataGenerator function was used to do this automatically. As a result, the trained model is more flexible when it comes to the type, size, and orientation of the input images it can classify. The dataset must be clearly divided into training and testing sets and put into appropriate sub-directories. This is an absolute requirement, otherwise, the method flow_from_direcctory won’t work.
Detecting Diseases in Mango Leaves Using Convolutional Neural …
187
4 Model Design Here, we used a Convolutional Neural Network (CNN) to build the model instead of using a basic Conventional neural network because of the following: • CNNs are motivated by the fact that they are able to learn relevant features from an image at different levels similar to a human brain. Conventional neural networks are unable to do this. • Another main feature of Convolutional Neural Networks is Parameter Sharing. Say we have got a one layered CNN with 10 filters of size 5 × 5. Now we will simply calculate parameters of such a CNN, it might be 5 × 5 × 10 weights and 10 biases, i.e., 260 parameters. Now let’s take an easy one layered neural network with 250 neurons; here, the amount of weight parameters counting on the dimensions of images is ‘250 × K’ where K is the size of the image and is equal to the product of its height and width. Additionally, we need ‘M’ biases. For the MNIST data as input to such a neural network, we will have 19,601 parameters. Clearly, a CNN is more efficient in terms of memory and complexity.
4.1 Approach The first step here is to acquire an image. Here, we consider a single leaf to be kept in a brightly lit room with a dark background. Images of two classes have been taken: Healthy and Diseased. We make the respective directory for the two classes. For having a connection between the directories of our training images with the Model, we establish a pipeline between the two with the help of TensorFlow’s ImageDataGenerator function, which not only helps in establishing a connection but also to the preprocessing of the training as well as test images; various operations like resizing of images, horizontal flipping, vertical flipping, shearing as well as zooming of the images can be done using this function. So here as we have access to limited images of the diseased and healthy mango leaves, this helps us to increase the accuracy of our model by throwing in a wide variety of augmented types of images to it. For example, a single image from our dataset can be inputted as 10 different images with the help of the different functions provided by the ImageDataGenerator function. After the preprocessing with the ImageDataGenerator, we send the images from our training directory into the model using another TensorFlow function called flow_from_directory. Now that we have the images ready and loaded from the directory, we can create our neural network. We have constructed a Convolutional Neural Network, consisting of multiple conv_2d layers, followed by MaxPooling layers. This is followed by Keras layers such as Flatten, Dense, Activation, Dropout, and finally another layer of Dense and Activation. While we construct the basic neural network, we make sure to use the best set of parameters such as input shape, filter size, kernel size, and pool size for each
188
R. Sharma et al.
layer. These parameters help in bettering the accuracy of our network and reduce the training and validation losses. Finally, we compile the model and specify the loss and optimizer functions, followed by the metrics upon which we wish to predict our results. In order to train the model, we generate images for training and testing our model. We then assign indices to the two classes namely: ‘Diseased’ - 0 and ‘Healthy’ – 1. With all the data we have, we are ready to train our model. Upon training, we focus on our loss and validation loss values for each epoch. Once the training is completed, we plot a Training loss vs Validation loss graph. Finally, we take a random test image and try to predict the class of it using our trained model.
4.2 Model Summary The first few filters in the first few layers in the model are used for the edge detection of the leaf; as we go deeper into the layers and filters, complex features such as the shape of the leaf, the pigment changes present on the leaf, and the diseased part of the leaves can be detected. The dropout in our model helps us to reduce the overfitting problem and helps us to generalize to images out of our training dataset. Dropout helps in the switching of a few neurons in the layers to neglect some features from the training dataset so that the model tries to generalize on a broader range of images. Flatten helps to make the input images of three dimensions into a vector of one dimension. This flattened layer is fed into the Dense layer which is nothing but the neural network consisting of two layers. So, the last Dense layer has only one neuron which does the work on logistic regression classifying our two classes: Healthy or Diseased Leaf (Fig. 2).
5 Results Model performance is determined using loss and accuracy values. After each iteration step or epoch, accuracy should increase and loss should decrease. Accuracy is a measure of how well the model predicts the test data, compared to the true data. A loss is a measure of how poorly the model behaves when it comes to making predictions, it is the sum of errors made for the data (Fig. 3). Fifteen epochs were set up for training and testing the model, with an early stop of two epochs. Early stopping is a method that keeps an eye on a parameter and stops the training and testing process when the observed value keeps varying. In this model, early stopping kept an eye on the validation loss and stopped the training and testing process when these values kept increasing for more than two epochs. This helped obtain better and more accurate results when using the model to make predictions. With each epoch, accuracy kept increasing and validation loss kept decreasing. Each epoch took an average of 69 s to finish. After the 7th epoch, the training and testing
Detecting Diseases in Mango Leaves Using Convolutional Neural …
Fig. 2 Model Summary Fig. 3 Loss Plots
189
190
R. Sharma et al.
were stopped because of the early stop criteria. A final accuracy of 99.27% and a final validation loss of 3.88% were obtained (Figs. 4 and 5). For the sake of representation, sample output images were created and the predictions were written onto the images using Python Image Library tools.
Fig. 4 Layers of the DL model
Fig. 5 Results showing a Healthy leaf versus a Diseased leaf
Detecting Diseases in Mango Leaves Using Convolutional Neural …
191
6 Conclusions and Future Scope Disease prediction in mango leaves was done using CNNs. Python’s TensorFlow framework was used to build a deep learning model with several layers. A dataset of mango leaves was taken, and the model was trained and tested on it. Since the number of images was not very high, ImageDataGenerator was used to bring in variations in the images to help improve results. Fifteen epochs were run, with an early stop criterion for the validation loss with a patience limit of two epochs. This caused the training to stop after the 7th epoch and an accuracy of 99.27% was obtained in predicting the presence of disease. To make the process of generating inferences from the model easier, the model should be accessible to the farmer easily. This can be done in several ways. One method involves deploying the model on a cloud server and linking it to an app on the farmer’s phone. This will make the process simple, as the farmer will only have to click a picture, and the cloud-based server will make the predictions and respond accordingly. Another approach to do the same is by using a Raspberry Pi (RPi), on which the model is deployed. An RPi-Camera module can be interfaced, and this can be used to capture images of the plant leaves. No cloud dependency is required in this approach, but the deep learning model used has to be able to run a low computation capacity device such as the RPi. Acknowledgements We would like to thank the Centre of IoT, PES University, for allowing us to undertake this project. We also thank Agbaje Abdullateef, who posted the dataset of mango leaves on Kaggle for public use.
References 1. Clercq, M. D., Vats, Anshu., & Biel, A (2018). Agriculture 4.0: The Future of Farming Technology. 2. Vuran, M. C., Salam, A., Wong, R., & Imak, S. (2018). Internet of underground things in precision agriculture: architecture and technological aspects. Ad Hoc Networks. https://doi. org/10.1016/j.adhoc.2018.07.017 3. McQueen, R. J., Garner, S. R., Nevill-Manning, C. G., Witten, I. H. (1995). Applying Machine Learning to Agricultural Data Computers and Electronics in Agriculture, 12, 275–293 4. Ian, H., Witten, G. H., McQueen, R. J., Smith, L., Cunningham, S. J. (1993). Practical Machine Learning and its Applications to Problems in Agriculture. New Zealand: Hamilton 5. Konstantinos, G., Liakos, P. B., Moshou, D., Pearson, S., & Bochtis, D. (2018). Machine Learning in Agriculture: A Review, Sensors in Agriculture, 18, 2674. https://doi.org/10.3390/ s18082674 6. Singh, V., & Misra, A. K. (2016). Detection of plant leaf diseases using image segmentation and soft computing technques. Information Processing in Agriculture, 4, 41–49. 7. Abed, S., Esmaeel, A. A. (2018), In: 2018 IEEE Symposium on Computer Applications and Industrial Electronics. Penang, Malaysia. https://doi.org/10.1109/ISCAIE.2018.8405488
192
R. Sharma et al.
8. Kantale, P., Thakare, S. (2020). A review on pomegranate disease classification using machine learning and image segmentation techniques. In: 2020 4th International Conference on intelligent Computing and Control Systems. Madurai, India. https://doi.org/10.1109/ICICCS48265. 2020.9121161 9. Islam, M., Dinh, A., Wahid, K., & Bhowmik, P. (2017). Detection of potato diseases using image segmentation and multiclass support vector machine. In: 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering. Ontario, Canada. https://doi.org/10.1109/CCECE. 2017.7946594 10. Akila, M., & Deepan, P. (2018). Detection and classification of plant leaf diseases by suing deep learning algorithm. International Journal of Engineering Research & Technology, 6. 11. Shah, N., & Jain, S. (2019). Detection of disease in cotton leaf using artificial neural network. In: 2019 Amity International Conference on Artificial Intelligence. Dubai, UAE. https://doi. org/10.1109/AICAI.2019.8701311 12. Sardogan, M., Tuncer, A., & Ozen, Y. (2018). Plant leaf disease detection and classification based on CNN with LVQ algorithm. In: 2018 3rd International Conference on Computer Science and Engineering. Sarajevo, Bosnia. https://doi.org/10.1109/UBMK.2018.8566635
Recommending the Title of a Research Paper Based on Its Abstract Using Deep Learning-Based Text Summarization Approaches Sheetal Bhati, Shweta Taneja, and Pinaki Chakraborty
Abstract Due to the increasing use of the Internet and other online resources, there is tremendous growth in the data of text documents. It is not possible to manage this huge data manually. This has led to the growth of fields like text mining and text summarization. This paper presents a title prediction model for research papers using Recursive Recurrent Neural Network (Recursive RNN) and evaluates its performance by comparing it with sequence-to-sequence models. Research papers published between 2018 and 2020 were obtained from a standard repository, viz. Kaggle, to train the title prediction model. The performance of different versions of Recursive RNN and Seq2Seq was evaluated in terms of training and hold-out loss. The experimental results show that Recursive RNN models perform significantly better than the other models. Keywords Text summarization · Title of research paper · LSTM · Seq2Seq model · Recursive RNN
1 Introduction Text summarization refers to a process of creating a shorter or condensed form of a text which consists of the main idea of articles or passages. As the data over the Internet is increasing rapidly, it is a time-consuming process to read entire content to get insight. Thus, text summarization becomes important. It has many applications in diverse fields like science and medicine, business and law, and news production [1].
S. Bhati (B) · P. Chakraborty Division of Computer Engineering, Netaji Subhas University of Technology, New Delhi, India S. Taneja Department of Computer Science, Bhagwan Parshuram Institute of Technology (GGSIPU), Delhi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_16
193
194
S. Bhati et al.
Text SummarizaƟon Techniques On the Basis of Purpose
On the Basis of Input
Generic
Single
Query focused
MulƟple
On the Basis of Output
Headline ExtracƟve
AbstracƟve Domain specific
On the Basis of Deliverable
Keywords InformaƟve
IndicaƟve Fig. 1 Taxonomy of text summarization techniques
Based on the type of output generated, text summarization is categorized into extractive and abstractive text summarization. An extractive summarization technique produces a summary by concatenating the important sentences, paragraphs, etc. from the original document [2]. On the other hand, in abstractive text summarization, new words are added to the generated summary and the summary looks similar to a text written by a human expert. Depending on different types of characteristics of summarization, there are various types of text summarization techniques as given in Fig. 1. There are a variety of approaches used to get optimal summaries. Some common approaches are graph-based, statistical-based, algebraic, machine learning-based, and deep learning-based. In this paper, our focus is on the deep learning-based approaches of text summarization, viz. LSTM-based Seq2Seq Model and Recursive RNNs using Keras library. We compared the different versions of the Seq2Seq model with the different versions of recursive RNNs in order to understand which is the most suitable for title prediction for research papers from their abstracts. The paper is organized as follows: Sect. 2 is a brief description of the work done in the field of text summarization; Sect. 3 describes the dataset used; Sect. 4 presents the methodology; Sect. 5 presents the results; and Sect. 6 presents the conclusion.
2 Related Work Many researchers have contributed in the field of text summarization using different approaches like machine learning, statistical approach, graph-based, and deep learning. Our focus is to explore the deep learning-based approaches to propose
Recommending the Title of a Research Paper Based on Its Abstract …
195
titles for research papers from the abstracts. Statistical approaches were commonly used earlier, but nowadays deep learning-based approaches are being used more widely. Deep learning models were used for the first time in abstractive text summarization by Rush et al. in 2015 [3]. They were based on the encoder–decoder architecture. Using these techniques, the quality of the summary was improved. Deep learning helps in analyzing complex problems and helps in generating human-like summaries. The deep learning models are based on a hierarchical structure which helps in learning. The number of layers affects the level of learning [4]. The higher layers have fewer details as compared to the low layers [5]. There are several models of deep learning which have been used for text summarization, viz. recurrent neural networks (RNNs), CNNs, and seq-to-seq models. In 2015, Lopyrev [6] proposed an encoder–decoder RNN with a simplified attention mechanism to generate headlines for newspaper articles. They suggested using bidirectional RNN to enhance performance. In 2018, Al-Sabahi et al. [7] implemented bidirectional RNN to generate summaries with high abstraction. They had highlighted a need to propose evaluation metrics besides ROUGE for long sequences. In 2016, the encoder–decoder RNN and sequence-to-sequence models were used which mapped the input sequence to the output sequence [8]. Shi et al. [9] provided a survey based on different sequence-tosequence models and they also developed an open-source library, for abstractive text summarization. Some researchers were using hybrid models to enhance performance like the encoder–decoder LSTMs and reinforcement learning [10]. Table 1 presents a summary of deep learning-based text summarization approaches used since 2015. Table 1 Summary of prior art Year
References
Approach
Dataset
Metrics
2015 Rush et al. [3]
Attention-based summarization (ABS)
DUC, Gigaword
ROUGE
2015 Lopyrev [6]
Simple attention
Gigaword
2015 Ranzato et al. [11]
Sequence-level training Gigaword
ROUGE, BLEU
2016 Nallapati et al. [8]
Switch generator-pointer
DUC, Gigaword, CNN/DM
ROUGE
2017 Suleiman et al. [4]
Hidden Markov model
Corpus
N-gram
2017 Paulus et al. [10]
Deep reinforced model CNN/DM, New York (ML + RL version) Times
BLEU
ROUGE
2018 Al-Sabahi et al. [7] Bidirectional RNN
CNN/DM
2020 Shi et al. [9]
CNN/DM, Newsroom, ROUGE Bytecup
Seq2Seq models
ROUGE
196
S. Bhati et al.
3 Data Set We have used a dataset of research papers published obtained from Kaggle (https:// www.kaggle.com/Cornell-University/arxiv/download). We have applied various text summarization techniques to papers published between 2018 and 2020. So, we got 5396 training pairs and 1485 validation pairs and we obtained 20 articles for prediction. For each sentence, we added < Start > and < End > as keywords for start and end of sentences.
4 Methodology We have used two deep learning-based models of text summarization, viz. Seq2Seq Model and RNN for our experiment. We make use of the Keras library for implementation. The categorical cross-entropy loss function was calculated in our experiment. Cross-entropy represents the uncertainty between the target and predicted distribution and categorical cross-entropy is a version that can be used where one target can be classified into multiple class entropy. We have used Root Mean Square Propagation in Seq2Seq Models and Adaptive moment optimization in RNNs. Next in this section is given a brief introduction to the Sequence-to-Sequence model and Recurrent Neural Networks.
4.1 Sequence-to-Sequence Model A sequence-to-sequence model works by mapping the input sequence to the output sequence where the output sequence length may vary. This model includes two parts—Encoder and Decoder. We have used three different versions of the Sequence to Sequence models for our experiment. Sequence-to-Sequence (Model A). This model uses one layer of LSTM in encoder and one layer of LSTM in decoder with 128 dimensional 100 hidden states. It generates the entire output sequence in a single pass. Sequence-to-Sequence Model using glove in the encoder (Model B). This model makes use of glove word embedding. Glove stands for Global Vectors for Word Representation. It generates a global co-occurrence matrix by calculating the probability a given word will co-occur with the other words. This model uses one layer of LSTM in the encoder and decoder with 128 dimensional 100 hidden states. We have used 100-dimension Glove 6B for words embedding to the encoder. Sequence-to-Sequence Model using glove in encoder and decoder (Model C). In this model, we have used Glove 6B 100D version which is implemented in both decoder and decoder along with LSTM 128 dimensional 100 hidden layers.
Recommending the Title of a Research Paper Based on Its Abstract …
197
4.2 Recurrent Neural Networks Recurrent Neural Networks (RNN) help in summarizing the data like humans. A single RNN layer uses the output of previous layers and the same process is followed till the last layer is reached. In our work, we have used three different versions of Recurrent Neural Networks. Recurrent Neural Network (One-shot RNN). This methodology helps in generating the entire output sequence in one pass. Here, the context vector helps the decoder to produce the output sequence. Recursive Recurrent Neural Network 1 (RNN1). This model produces a forecast of a single word and then recursively calls it, which means, the context vector and the distributed representation of all words generated so far is fed as input to decoder, and then generates the upcoming word. Recursive Recurrent Neural Network 2 (RNN2). In this model, first the input document is converted into context vector representation by the encoder. Then this representation is given to the decoder at each pass of the generated output sequence. This helps the decoder to maintain the internal state, which helped in generating the words in output. So that it can produce the next word in the output sequence. For each word inside the output sequence, this process called recursively until the process didn’t get the end-of-sequence token or until the maximum length was not reached.
5 Results For comparison purpose, we have carried out experiments on different models of Sequence to Sequence and RNN models and calculated accuracy and loss. Figures 2, 3, and 4 are obtained by running for 100 epochs using the Seq2Seq models A, B, and C, respectively. Figures 5, 6, and 7 show the accuracy and loss obtained with the One-Shot Recursive RNN model, Recursive neural network 1, and Recursive neural network 2, respectively. We noticed that in Figs. 2, 3, 4, 5, 6, and 7, training loss suddenly decreases but validation loss eventually goes up. We can see that with the increase of the number of epochs, the accuracy increases. From Figs. 2, 3, 4, 5, 6, and 7 and Table 2, we can say that RNN1 gives the highest accuracy with the lowest training loss. Another important observation is that RNNs provide better accuracy than the Seq2Seq models.
198
S. Bhati et al.
Fig. 2 Graph of Model A
Fig. 3 Graph of Model B
6 Conclusion There has been a tremendous growth in the amount of digital information available in the form of text documents in recent years. This has led to the need of developing automated text mining and text summarization methods. In this paper, we have tried
Recommending the Title of a Research Paper Based on Its Abstract …
Fig. 4 Graph of Model C
Fig. 5 Graph of RNN (one-shot)
199
200
Fig. 6 Graph of Recursive RNN1
Fig. 7 Graph of Recursive RNN2
S. Bhati et al.
Recommending the Title of a Research Paper Based on Its Abstract … Table 2 Evaluation of Seq2Seq models and Recursive RNNs
Model name
Max accuracy
201 Min loss
Seq2Seq
0.3248
0.5167
Seq2Seq V1
0.2758
1.3587
Seq2Seq V2
0.1876
1.1134
One-shot RNN
0.3520
1.2309
RNN1
0.7856
0.7765
RNN2
0.6339
1.6164
to predict the title of research papers from the abstract. There are many approaches in text summarization like graph-based, statistical, algebraic, machine learning-based, and deep learning-based. In this work, we trained RNNs and Seq2seq models for generating titles using the abstract of research papers. For comparison purpose, we have calculated accuracy and hold-out loss. We have observed that RNNs provide better accuracy than Seq2Seq models. RNN1 model gives the highest accuracy with the lowest training loss. In the future, these models can be tested on multi-domain datasets of larger size.
References 1. Boorugu, R., Ramesh, G. (2020). A survey on NLP based text summarization for summarizing product reviews. In Proceedings of the Second International Conference on Inventive Research in Computing Applications (pp. 352–356). 2. Moratanch, N., Chitrakala, S. (2017). A survey of extractive text summarization. In Proceedings of the International Conference on Computer, Communication, and Signal Processing (pp. 1–6). 3. Rush, A. M., Chopra, S., Westo, J. (2015). A neural attention model for abstractive sentence summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 379–389). 4. Suleiman, D., Awajan, A., Al Etaiwi, W. (2017). The use of hidden Markov model in natural Arabic language processing: a survey. Procedia Computer Science, 113, 240–247. 5. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. 6. Lopyrev, K. (2015). Generating news headlines with recurrent neural networks. arXiv:1512. 01712. 7. Al-Sabahi, K., Zuping, Z., Kang, Y. (2018). Bidirectional attentional encoder-decoder model and bidirectional beam search for abstractive summarization. arXiv:1809.06662. 8. Nallapati, R., Xiang, B., Zhou, B., Santos, C., & Gulcehre, C. (2016). Abstractive text summarization in using sequence-to-sequence RNNs and beyond. In Proceedings of Twentieth SIGNLL Conference on Computational Natural Language Learning (pp. 280–290). 9. Shi, T., Keneshloo, Y., Ramakrishnan, N., & Reddy, C. K. (2018). Neural abstractive text summarization with sequence-to-sequence models: a survey. arXiv-1812. 10. Paulus, R., Xiong, C., & Socher, R. (2018). A deep reinforced model for abstractive summarization. In Proceedings of the International Conference on Computation and Language (pp. 1–12). 11. Ranzato, M. A., Chopra, S., Auli, M., Zaremba, W. (2015). Sequence level training with recurrent neural networks. arXiv:1511.06732.
An Empirical Analysis of Survival Predictors for Cancer Using Machine Learning Ishleen Kaur, M. N. Doja, and Tanvir Ahmad
Abstract Cancer management is an active domain of research in machine learning applications. Cancer survival prediction is such an application. However, most of the studies use online datasets due to easy availability and a larger number of instances. Real-world datasets in medical applications are challenging to collect but may give some critical insights relevant to local people. In this paper, a real-world dataset of metastatic prostate cancer patients has been collected from an Indian hospital and analyzed for cancer survival prediction. The authors identified some significant survival predictors and classified them into three categories. The significance of each of the categories is examined using machine learning algorithms. Quality of life attributes gave better prediction results than diagnostic and treatment attributes indicating that a metastatic patient’s survival can be well anticipated using his overall health and comorbidities than his initial diagnostic attributes. Keywords Data mining · Cancer survival · Machine learning · Prostate cancer · Survival predictors · Quality of life
1 Introduction Cancer management and predicting survivability after diagnosed with cancer have been a pressing issue for researchers and scientists for several decades. After cardiovascular diseases, cancer is the second leading cause of death in the world [1]. Scientists/researchers have been conducting research every year to predict the survival of cancer patients better. Predicting survival is necessary for patients to be better
I. Kaur (B) · T. Ahmad Department of Computer Engineering, Jamia Millia Islamia, New Delhi, India T. Ahmad e-mail: [email protected] M. N. Doja Indian Institute of Information Technology, Sonepat, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_17
203
204
I. Kaur et al.
prepared for their lives and future treatments. Survival prediction is also vital for medical practitioners in recommending the best treatment line for the patient. Various studies illustrate the use of data mining methods and techniques using numerous sorts of attributes to predict cancer patients’ survival [2–4]. Like diagnostic variables and medical history, clinical attributes are the most common and useful characteristics [5, 6]. Several studies have also been conducted to recommend proper treatments for patients using machine learning [7, 8]. Today, all hospitals or medical industries store vast amounts of medical records, patients’ files, reports, etc. However, these records are just a part of data that requires gigabytes to petabytes of storage. There is a need to change the stored data into information by researchers and scientists to enhance the overall health and survival of patients. Though machine learning has been used in many medical applications, most of the studies involve online datasets such as SEER [9] and the UCI machine learning repository [10]. Online datasets are large, but medical datasets results might not be applicable to all regions, unlike other online datasets. Previous studies have proven that patients belonging to different races or regions may have different responses to treatments and correspondingly different survival outcomes [11]. Also, almost all online clinical datasets capture data from western countries, but no Indian dataset is publicly available for researchers. This study aims to identify the different classes of variables for predicting the outcome of a cancer patient. The identified categories of variables have been analyzed in order to evaluate the best predictors for survival outcomes. A case study of prostate cancer patients from a hospital located in the capital of India has been used for the analysis. The variables are categorized into diagnostic, treatment, and quality of life attributes. This study is an attempt to identify the significance of attributes for prostate cancer survival prediction to understand their relation with the survival outcome. The rest of the paper is organized as follows: Sect. 2 gives a summary of current literature. Section 3 describes the detailed approach used in the study, while Sect. 4 discusses the outcomes and their relevance in the present scenario. Section 5 concludes the study.
2 Related Work Researchers from several decades are trying to analyze cancer patients’ survival using different predictors [3, 12]. Machine learning has been widely used in various medical applications, including predicting cancer patients [4, 13]. Our literature survey comprises studies involving the survival of prostate cancer patients using machine learning or statistical techniques. Hall et al. [14] focused on the idea of including various comorbidities while considering the treatment pattern and predicting the survival outcome of a localized prostate cancer patient. Though clinicians may consider age in determining appropriate treatment for a patient, some older patients may not get rigorous treatment options to improve their survival due to their old age and probably less tolerance and
An Empirical Analysis of Survival Predictors for Cancer Using Machine …
205
life expectancy. Various comorbidities have been considered in treatment decisions and outcomes research. Auffenberg et al. [15] tried to educate prostate cancer patients of their treatment options that similar patients have taken. Diagnostic features were used to predict the primary treatment for the patients. A multinomial random forest model was used to compute the probability of obtaining a patient’s particular treatment and achieved an overall AUC of 0.81 in the validation cohort. Studies have also been conducted to determine effective treatments using association rule mining or sequence mining for different cancer patients [8, 16]. Kerkmeijer et al. [17] created a model to predict the survival in localized prostate cancer patients based on pre-treatment factors and compared it with the currently used risk assessing models. 10-year overall survival and disease-specific survival were estimated using the Kaplan–Meier method with a comprehensive follow-up of around 8.3 years. The total risk score was estimated using pre-treatment attributes collected for each patient. Nezhad et al. [11] used an online SEER dataset of prostate cancer patients to explore the use of deep learning and active learning to predict cancer patients’ survival. The authors also compared the mortality of white and African-American people and suggested that the two groups’ treatment options are different.
3 Methodology The study follows the methodology, as shown in Fig. 1. Each of the steps involved in the methodology is explained in detail in the following subsections.
3.1 Data Collection and Preprocessing 3.1.1
Data Collection
It is a retrospective study involving data from Rajiv Gandhi Cancer Institute and Research Center, India, using the case study of prostate cancer patients. The data
Fig. 1 Methodology
206
I. Kaur et al.
collection took place by the authors manually from the patients’ files digitally stored in the hospital repository. The authors got approval from the hospital with letter no. Res/SCM/31/2018/99 received from the Scientific Committee of the hospital. The study focused on metastatic prostate cancer patients diagnosed and treated between 2011 and 2015. Survival data were collected from the patients’ case files or by contacting the family of patients. Any such patient with missing survival information was removed from the analysis.
3.1.2
Feature Selection and Categorization
Features were carefully chosen while consulting from previous literature and experienced urologists from RGCI&RC. The final feature set was divided into three input categories, as mentioned in Table 1. Diagnostic attributes comprised of the ones recorded at the time of cancer diagnoses like age, PSA levels, Gleason score, and metastasis location. Treatment attributes included up to four treatments given to the patients, along with the Nadir PSA levels that were recorded after being treated with a drug/therapy. Although previous studies have incorporated initial treatment as a valid predictor for survival, literature considering the next lines of treatments remains scarce. Lastly, the quality of life attributes were included. Initially, a person’s quality of life includes various factors comprising physical, mental, emotional, and social well-being [18]. This study incorporated some attributes that can check the patients’ physical well-being by including ECOG (performance status) and the Charlson Comorbidity Index (CCI). ECOG is recorded for a patient to determine how he handles himself and his day-to-day work to determine the treatments he can take. CCI is an online calculator that considers various comorbidities that can influence the survival of a person [19] and has been used by several outcome-related studies [14]. Table 1 Feature categorization
Type
Attributes
Diagnostic attributes
Age at diagnosis PSA level at diagnosis Gleason score (grade) Metastatic location (Bones/nodes/viscera)
Treatment attributes
A sequence of treatments (First/Second/Third/Fourth)
Quality of life attributes
ECOG
Nadir PSA Charlson Comorbidity Index (CCI) Outcome information
Class
An Empirical Analysis of Survival Predictors for Cancer Using Machine …
3.1.3
207
Missing Data
Any missing data found in the patients’ files were handled by removing instances with more than 50% of the lost entries. The rest of the missing entries were filled using mean in numerical attributes (e.g., PSA) and mode for categorical attributes (e.g., Gleason score) from the same class. The final dataset included 407 patients, out of which only 158 (38.8%) have three years of survival, and 249 were deceased.
3.2 Classification Techniques The processed data after being divided into three categories are used for classification into survived and deceased groups. The following classification techniques were used for the study due to their relevance in the medical domain and small datasets. Support Vector Machines—SVMs are one of the highly used classification techniques for small datasets. The support vector machine creates a hyperplane that divides the dataset for classification [20]. The name derives from the fact that the hyperplane’s data values are called support vectors and are used for creating the optimal hyperplane. Decision trees—Decision trees have been used in various medical studies due to their straightforward interpretation and decent results. It creates an inverted tree with the root node at the top and leaf nodes (classes) at the bottom. The root node is selected with the highest information gain/Gini ratio and thus best classifies the dataset. Ensembles—An ensemble classifier creates a composite model generated by combining various classifiers. Bagging, boosting, and random forests are popular classifiers that have been used in the study. Bagging combines the classifiers by assigning equal weights to each tuple while boosting updates the weight of a tuple previously misclassified by a classifier so that the next classifier can give it more attention. Random forests are similar to bagged trees, except that the decision trees are more independent of each other in a random forest than bagging [20].
4 Results and Discussion Table 2 discusses the results using bagging and boosting only since only these two techniques could perform comparable and good enough to draw some conclusion. It can be seen from Table 2 that bagging gave better results than boosting. It is evident from Table 2 that quality of life attributes contribute the most to survival prediction. Diagnostic features are not of much importance in predicting the survival of prostate cancer. This implies that it is not possible to accurately predict the survival of metastatic prostate cancer patients at the time of diagnosis. Treatment attributes with slightly better predictions indicate that different treatments are given
208
I. Kaur et al.
Table 2 Results Bagging
Boosting
Attributes
Accuracy (%)
TP Rate
FP Rate
AUC
Accuracy (%)
TP Rate
FP Rate
AUC
Diagnostic attributes
68.3
0.73
0.54
0.66
63.4
0.73
0.53
0.69
Treatment attributes
71.5
0.78
0.4
0.7
70.5
0.74
0.35
0.67
Quality of life attributes
79.6
0.84
0.27
0.84
79.1
0.83
0.27
0.81
Diagnostic + treatment
73.3
0.83
0.43
0.77
72.3
0.83
0.45
0.77
Diagnostic + quality of life
80.4
0.85
0.27
0.86
80.4
0.85
0.27
0.82
Treatment + 75.8 quality of life
0.82
0.34
0.85
75.8
0.84
0.38
0.82
All
0.86
0.29
0.86
80.7
0.87
0.30
0.87
81.4
to the patient in his lifetime. Its response can affect a patient’s survivability better than the initial diagnostic attributes. Quality of life attributes, i.e., the patients’ performance quality and various comorbidities, can be used to predict the survivability of the patients effectively. ECOG, a measure used to rate a patient’s general capabilities, is provided to the patients by a clinician and is based directly on his intuition. CCI, which is a measure of comorbidities suffered by a patient, also affects a person’s overall survival. The importance of comorbidities and treatments for survival analysis has also been acknowledged in previous literature [7, 21]. Patients with higher comorbidities and ECOG levels greater than one should be treated accordingly. Figure 2 shows the decision trees obtained using each of the attribute types. Due to a larger number of values for each treatment, Fig. 2c shows a cropped decision tree. However, it can be comprehended from the results that the accuracy and other factors may improve when we combine features from different types. The best prediction accuracy is achieved when all the attributes have been used in the analysis. For a dual set of attributes, diagnostic and quality of life attributes gave a better performance than the treatment and quality of life. Statistical tests have also been conducted on the dataset using SPSS to validate the results. Table 3 gives the p-values computed for the features. Considering a significance level of 0.05, it is clear from the results that both the quality of life attributes are highly significant for the survival outcome. However, only the Gleason score of the diagnostic attributes was found to be substantial in these tests. Gleason score was also the root node of the decision tree constructed as
An Empirical Analysis of Survival Predictors for Cancer Using Machine …
209
Fig. 2 Decision tree for a diagnostic attributes; b performance attributes; c treatment attributes
in Fig. 2a. Also, the different treatments were slightly crucial in the class outcome, especially the second and third lines of treatments. It can also be backed by Fig. 2c and previous studies, which indicated that since a patient receives multiple lines of treatments, the later stages of treatments may improve the survival prediction of cancer patients [7].
210 Table 3 p-values for attributes
I. Kaur et al. Type
Attributes
Diagnostic attributes
Age at diagnosis
0.957
PSA at diagnosis
0.9
Gleason score
0.016
Bones
0.389
Nodes
0.825
Treatment attributes
Quality of life attributes
p-value
Viscera
0.064
First
0.437
Second
0.07
Third
0.0056
Fourth
0.0832
Nadir PSA
0.46
ECOG Charlson Comorbidity Index (CCI)
0, b = 0 is dc offset, a and b are real quantities, which are unknown to both sender and receiver. The codeword will be affected by AWGN vector v = (v1 , v2 , …, vn ), v ∈ R, consisting of n noise samples. The Pearson distance between any possible transmitted vector and received vector is defined in (2). This measure lies in the range of 0–2. i denotes the bit position in a codeword of length n.
δ2 r, xˆ =
n i=1
⎛ ⎝r i −
x i − xˆ σ xˆ
⎞2 ⎠
(2)
Here δ r, xˆ represents the modified Pearson distance.r and xˆ represent the received codeword and pivot codeword respectively. The summation is carried out for entire codeword length xˆ represents the mean of the pivot codeword and x is standard deviation of pivot codeword. T-constrained codes are q-ary codes of length n. T
Minimum Pearson Distance Detection for MIMO-OFDM Systems
433
denotes the number of codes that can be constrained in the codebook design, T = 1, 2, …, q. The set of T-constrained codewords is represented by ST . Consider a binary case of q = 2, S1 and S2 are two sets of constrained codewords. In S1 code design, symbol ‘1’ should appear at least once in each codeword. In S2 , both symbols ‘0’ and ‘1’ should exist at least once in each codeword. In this design, the all-zeros and all-ones codewords are excluded. The codebooks have |S1 | = 2n – 1 and |S2 | = 2n – 2 number of codewords, respectively. The disadvantage of Pearson distance measure becomes infinite if codeword is zero or constant is removed by using these codes.
2.2 Slepian’s Algorithm [1] Slepian’s algorithm is a very efficient algorithm used for decoding the received vector r, which is encoded with T-constrained codes. In decoding the received vector, the concept of Pivot words is used. The i-th pivot word is represented by xpi , 1 ≤ i ≤ K, where the total number of pivot codewords is given by K = n + q – T – 1. A pivot word xpi has n symbols lexicographically arranged in the order from largest to smallest. All the symbols 0 to q-1 are present in each codeword. For binary codes, q = 2, thus xpi,1 = 1 and xpi,n = 0. For example, the pivot codewords for binary case with n = 4, q = 2, T = 2 are [1 0 0 0], [1 1 0 0] and [1 1 1 0]. All the remaining distinct codewords of ST can be obtained by permuting the order of n symbols of the ‘K’ pivot words. In Slepian’s sorting algorithm, first, the received vector r is arranged lexicographically from largest to smallest value. Pearson distance from Eq. (2) between r and all K pivot words is computed. If pivot code xpj is at MPD to the received codeword r, then the receiver decides that the respective transmitted codeword will have the same composition as the constant composition pivot codeword xpj . The received codeword can be decoded by applying reverse Slepian’s algorithm on the constant composition codeword xpj . Detailed analysis and working examples can be found in [1, 2] for the application of MPD and Slepian’s algorithm concept on code vectors received from noisy/fading channels.
2.3 Analysis of Transmission Schemes for MIMO Systems In [1], the idea of MPD-based detection was developed and enhanced for MIMO systems with the help of new transmission methods. Signaling schemes and simple arithmetic combining were developed for 2×2 and 4×4 MIMO systems using BPSK and QPSK. The block diagram and signal flow graph for 2 × 2 MIMO are shown in Figs. 1 and 2. Here two symbols are transmitted every three-time intervals. At the receiver side, the received symbols at the end of three-time spans on two antennas
434
H. A. Anoop andP. G. Poddar
Fig. 1 Generic block diagram for 2 × 2 MIMO with Pearson distance-based detector
Fig. 2 Signal flow graph for 2X2 MIMO system with new arithmetic scheme [2]
are collected and the arithmetic combining operation is carried out as per Eqs. (3 and 4). yadd = y11 + y21 + y12 + y22 + y13 + y23
(3)
ysub = y11 + y21 + y12 + y22 − y13 − y23
(4)
Here yij denotes the symbol received on antenna i at time unit j. By solving Eq. (3), we find the symbol x1 is scaled by all the channel coefficients and at the same time symbol x2 is restrained. Similarly, by Eq. (4), the symbol x2 is improved by all the channel coefficients and simultaneously symbol x1 is restrained. This strategy uses MIMO diversity to the full degree with the end goal that regardless of whether any link is in deep fade, the received signal may not weaken because of diversity gain coming in from all the channel coefficients. The transmission scheme in Fig. 2 and in [1] for 2 × 2 MIMO belongs a broad class of techniques, which can be represented as
y11 y12 y13 y21 y22 y23
h 11 h 12 = h 21 h 22
x11 x12 x13 n 11 n 12 n 13 + x21 x22 x23 n 21 n 22 n 23
(5)
where xij , nij and yij , refer to the transmitted signal value, AWGN and received signal value at antenna i at time unit j, range of i = 1, 2 and j = 1, 2, 3. We performed a complete theoretical analysis of different possible signaling methods (at transmitter) and combining methods (at receiver) and verified that only
when the transmission symbols are arranged as x1 x2 x1 − x2 x1 x2 x1 − x2 or and the combining equations are chosen as x2 x1 x1 − x2 x1 x2 x1 − x2 (3) and (4), it enables the isolation of the transmitted symbols at the receiver and also
Minimum Pearson Distance Detection for MIMO-OFDM Systems
435
leverages full channel diversity. Any other combination does not yield the desired effect. It either creates a signal, which is function of both transmitted symbols x1 and x2 or fails to utilize the channel coefficients fully. Our complete analysis can be found in [4]. The new arithmetic scheme proposed in [1] for a 4×4 MIMO system can communicate four modulated symbols are over four antennas in six units of time. Twenty-four equations are obtained for the received symbols, which are represented in a compact matrix form. To effectively isolate the four transmitted symbols at the receiver, arithmetic combining operations are performed on the received symbols over six-time intervals on four receiving antennas. This method is also analyzed by us in [4].
3 Proposed OFDM and LTE System Design with MPD After verifying the preliminary system designs (BPSK/QPSK/2 × 2 and 4 × 4 MIMO) through simulation and analysis, we proceed with the design of OFDM system using MPD concept. The input specifications required for OFDM system implementation include the number of sub-carriers, bits per OFDM symbol, number of cyclic prefix samples and FFT size, and the total OFDM symbols. The input data are first encoded with T-constrained codes with the help of codebook. Then the bits are modulated with BPSK. Then the modulated data are interleaved and loaded into FFT frequency bins. Then the sub-carriers are changed back to time domain with IFFT. Then the cyclic prefix is added to form one OFDM symbol. This is transmitted over radio channel as OFDM burst, and the process is repeated for complete data. At the receiver, the converse process is carried out. The cyclic prefix is removed and the symbol is converted to the frequency domain with the help of FFT. Then it is deinterleaved and applied to MPD detector. The received symbol values are gathered at the finish of every third-time interval (in case of 2 × 2) and sixth-time interval in case of 4 × 4 MIMO to make the length same as n. These consolidated vectors are input to the MPD detector for assessing the transmitted codeword utilizing Slepian’s calculation. The detailed block diagram for the proposed implementation is shown in Fig. 3. A similar scheme is implemented for LTE system. The input data bits are first encoded with T-constrained codes, BPSK modulated and are interleaved. The symbols are OFDM modulated and transmitted with any of the multi-antenna schemes. At the receiver side, arithmetic combining is done to recover the symbols and OFDM demodulation is carried out. The antenna configurations SISO, 2×2 and 4×4 are tested. Subsequently, the detection of transmitted bits is carried through MPD concept. Finally, the BER performance is evaluated for different SNRs. Table 1 denotes the parameter values used for the implementation of LTE system.
436
H. A. Anoop andP. G. Poddar
Fig. 3 Block diagram for LTE MIMO-OFDM system with T-constrained codes and MPD
4 Results and Discussion 4.1 Simulation Results: BER Performance The BER performance of the different methods described above is evaluated here through simulation. The BER result of T-constrained codes with MPD using T = q
Minimum Pearson Distance Detection for MIMO-OFDM Systems Table 1 Parameters for the implementation of LTE system
437
Transmission bandwidth (MHz)
3
5
1.4
Sub-carrier spacing (kHz)
15
15
15
FFT size
256
512
128
Number of sub-carriers
180
300
72
Cyclic prefix length
4, 16
16
16
Constrained length
2
2
2
Code word length
4, 8, 16
4
4
Fig. 4 a BER performance of two-constrained code with MPD-based detection with BPSK and AWGN channel condition with n = 4, 8 and 16. b BER performance of turbo codes with multiple iterations
= 2, code length n = 4, 8, and 16 under AWGN channel is shown in Fig. 4a. The LTE system utilizes Turbo coding mechanism as a standard. Hence it is necessary to compare the BER performance of the proposed scheme with a Turbo coding framework. For simplicity of visualization, the comparison is made using BPSK modulation under AWGN channel conditions for both the transceivers. Figure 4 b shows the BER performance of Turbo codes under BPSK and AWGN conditions with pseudo-random interleaver. The examination of Fig. 4a and b shows that the error performance of the proposed coding scheme is similar to and even better than the Turbo coding scheme with four iterations (green curve in Fig. 4b). Also, it is important to note that T-constrained coding performance can be further enhanced by varying constraint length, order of q-ary codebook and codeword length. Figure 5 a shows the performance of constrained codes with T = 4 and q = 2 and MPD detection technique with OFDM. Figure 5b illustrates the turbo-coded system with OFDM for multiple iterations under AWGN channel. We can see that the performance of two-constrained codes with MPD and OFDM is almost similar to the performance of Block turbo codes with OFDM with five iterations. A BER performance close to 10–3 is obtained at SNR of 4 dB for MPD system, whereas
438
H. A. Anoop andP. G. Poddar
Fig. 5 BER Performance of BPSK-OFDM on AWGN channel. a Two-constrained code with MPD-based detection. b Turbo codes with multiple iterations
the error performance of 10–3 is obtained for a SNR of around 3.8 dB for the turbocoded system. Hence with OFDM also, the performance of the proposed system is comparable. Figures 6, 7, 8, 9 and 10 illustrate the simulation results for various conditions for T = 2 MPD with BPSK-OFDM using MIMO channel. The main parameters considered for the comparison are FFT size, cyclic prefix (CP) size and codeword length n. By comparing Figs. 6 and 11 for 2×2 MIMO case, the proposed scheme has a BER performance of 5×10−5 at around 5 dB, whereas the turbo-coded system shows
Fig. 6 BER with MIMO-OFDM MPD, FFT size = 256, n = 4, CP = 16
Minimum Pearson Distance Detection for MIMO-OFDM Systems
Fig. 7 BER with MIMO-OFDM MPD, FFT size = 256, n = 8, CP = 16
Fig. 8 BER with MIMO-OFDM MPD, FFT size = 256, n = 16, CP = 16
439
440
H. A. Anoop andP. G. Poddar
Fig. 9 BER with MIMO-OFDM MPD, FFT size = 128, n = 4, CP = 16
Fig. 10 BER with MIMO-OFDM MPD, FFT size = 512, n = 4, CP = 16
Minimum Pearson Distance Detection for MIMO-OFDM Systems
441
Fig. 11 Performance of turbo-coded system with OFDM under AWGN channel condition concatenated code as interleaver [3]
a BER performance of 10–2 at 10 dB. For 4×4 MIMO system case, the proposed scheme has BER performance of 10–5 at almost 5 dB, whereas the turbo-coded system shows a BER performance of 10–3 at almost 10 dB. When we change the FFT size from 256 to 128 and 512, the BER results as plotted in Figs. 9 and 10 demonstrate the similar type of performance.
4.2 Computational Complexity and Implementation Aspects Conventional LTE systems use the turbo-coding technique, so a large decoding delay owing to large block lengths and many iterations of decoding are required for better performance. The receiver uses a channel estimation algorithm followed by a combiner and ML detection based on Euclidean distance. These blocks add significantly to the computational complexity, cost and delay at the receiver. With the incorporation of Slepian’s algorithm, the number of computations is highly reduced, since the Pearson distance of Eq. (2) needs to be computed between a received codeword and pivot codewords only, and not with all the codewords of the codebook. The mean and variance of pivot codes need one-time calculation. The multiplication and addition operations in the implementation of MPD detector increase linearly with codeword length. In the proposed transmission scheme, the linear combiner task requires real-valued addition/subtraction and no multiplications. This is in contrast to the maximum likelihood combiner equations derived for different MIMO configurations in the literature, e.g. [5]. These results indicate that the proposed technique provides a comparable or better performance than a turbo-coded LTE system with less computational complexity.
442
H. A. Anoop andP. G. Poddar
5 Conclusions We perform the complete theoretical analysis of the methods in [1] to validate that the transmitted symbols can be indeed recovered correctly after arithmetic combining, and performance improvement due to diversity gain is obtained. We develop a transceiver scheme for MIMO-OFDM system by using T-constrained codebook and MPD-based detection, which can provide comparatively better performance than conventional turbo-coding systems, in terms of BER, as well as computational complexity, latency and cost. This can form a potential candidate for upcoming communication systems standards.
References 1. Bharath, L., & Poddar, P. G. (2019). Minimum Pearson distance based detection for MIMO systems. In Third International Conference on Computing and Network Comm (CoCoNet’19). Elsevier Procedia Computer Science. 2. Immink, K. A. S., & Weber, J. H. (2014). Minimum Pearson distance detection for multilevel channels with gain and/or offset mismatch. IEEE Transactions on Information Theory, 60(10), 5966–5974. 3. Torabi, M., & Soleymani, M. R. (2002). Turbo coded OFDM for wireless local area networks. In IEEE Canadian Conference on Electrical & Computer Engineering. 4. Anoop, H. A. (2020). Minimum Pearson distance detection for LTE system. Master Of Technology Thesis Submitted To Department Of Electronics And Communication Engineering, B.M.S. College Of Engineering, Bengaluru, India, November 2020. 5. Alamouti, S. M. (1998). A simple transmit diversity technique for wireless communications. IEEE Journal on Select Areas in Communication, 16(8).
Study on Emerging Machine Learning Trends on Nanoparticles— Nanoinformatics B. Lavanya and G. Sasipriya
Abstract Artificial intelligence (AI) is an influential technology, which has helped many branches of science and technology accomplish an unimaginable progress. The field of nanoparticles has had more than a decade-long tryst with AI, but it is yet to fully realize the benefits of AI to the extent other branches of technologies have. In recent times, there has been a renewed interest in applying machine learning and similar AI concepts to produce new nanoparticles, which has resulted in exponential growth of research in that direction and resulting data. This research paper critically analyzes that data to explore the prevalent trends/methods in applying machine learning (ML) and data mining techniques in the field of nanoparticles. Beginning with delineating the challenges faced in implementing ML in the nanoparticles segment, this paper goes on to explore the various learning algorithms relevant to the field of nanomaterials, their applications, suitability, merits, and demerits. In the latter section of the paper, a detailed analysis (in the form of a flow chart) spells out the considerations to be taken while collecting and constructing data, choosing and applying appropriate ML algorithms. This paper reviews various ML algorithms used in recent articles. In essence, this paper compendiously provides information and prediction of nanoparticles necessary to successfully customize ML for nanomaterials vertical, especially for nanoparticles of medical significance and applications. Keywords Machine learning · Artificial intelligence · Nanoparticles · Data mining
1 Introduction Today, there is hardly a field of science or technology, which has not used or benefitted from artificial intelligence (AI). While AI seems to have been readily and well adapted by many branches of science/technology to accomplish unimaginable progression or expedite their advancement, the field of nanoparticles seems to have struggled with a B. Lavanya (B) · G. Sasipriya Department of Computer Science, University of Madras, Chennai, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_37
443
444
B. Lavanya andG. Sasipriya
similar adaptation. The progression of ML applications in the field of nanomaterials is quite slow if not stagnated. Nanoparticles’ segment is an ever-changing field of play, where AI technologies like machine learning (ML) algorithms and data mining techniques can be stretched beyond their limits, thereby attracting the attention of ardent proponents and scholars of such technologies to this segment. However, much of that valuable data are either a grand garble of information or insights hiding like a needle in haystack, especially in the haystack called textual verbose. The imperative need of the hour is a well-trained ML program configured to critically investigate this data explosion and unearth the deeply buried nuggets of insights into it, thereby figuring out what exactly needs to be done to accomplish that elusive success in adapting AI to the needs of nanoparticles’ field. That is exactly the objective of this paper: to utilize data mining and ML techniques on that behemoth of online research materials to extract meaningful information about ML applications in nanoparticles vertical, especially nanomedicine. Taking a cue from our forerunners, who carried out a similar exercise on ML usage in nanoparticles in the fall of 2015 [1], this research proposal went on to scouring several similar research publications and meticulously investigating them. Building upon the aforementioned initiative, this research exercise conducted more online searches and reviews of resulting discoveries with an intent to collate enough data to discern the following: the kind of data collected in experiments concerning ML implementations, reasons behind such a data gathering, techniques involved in the process, choice of ML algorithms, challenges faced in such implementations, and any shortfalls or loose ends or insufficiencies. In addition to discussing these findings, this paper also sheds light on the prevalent challenges facing ML applications in nanomaterials’ segment, especially for critical activities such as the prediction of biomedical properties of nanoparticles, which might be of interest to the medical world. The paper also summarizes the effort so far this decade on the application ML in nanoparticles in Table 1. Furthermore, this research paper posits the recommendation of using Natural Language Processing techniques to probe the online literature in an efficient manner covering the entire breadth and depth of the literature, harvest meaningful data, and build a repository of that resulting data.
2 Nanoparticles and Nanoinformatics: The Uncharted Territories In the field of biomedicine engineering, nanoinformatics assumes a greater significance as it has the potential to deal with the multidimensionality of required analysis in this particular branch of engineering. For example, in this discipline, there is an imperative need for “analyzing and handling data around the structure and, physical and chemical characteristics of nanoparticles and nanomaterials, their interaction with their environments, and their nanomedicine applications” [2]. However, it is the case of easier said than done, as it is not always easy to explore or extrapolate
Author(s)
Lazarovits et al. [11]
Ilett et al. [12]
Horwath et al. [14]
Coquelin et al. [15]
Hataminia et al. [16]
Kovalishyn et al. [17]
Wang et al. [18]
Concu et al. [19]
S no.
1
2
3
4
5
6
7
8
2016
2017
2017
2019
2019
2019
2020
2019
Published year
Linear NN, radial basis function, multilayer perceptron, and probabilistic NN
KNN
kNN, Random Forest, and NN
NN
CNN
CNN
CellProfiler and CNN algorithm ilastik [13]
Supervised deep neural network
ML algorithm(s) used
Table 1 The machine learning methods used in nanoparticles Output
A set of 260 unique NPs in the published literature with 31 chemical compositions
Big experimental datasets with high quality of nanoparticle libraries
Four different datasets are used with three different measures toxicity of nanoparticles
Hydrodynamic diameter (nm), negative zeta potential (mv), incubation time (day), concentration
Manually segmented SEM images from 2000 samples
Dataset of 1024 × 1024 ETEM images
(continued)
Study of metal, metal oxide, and silica nanoparticles
Study of gold nanoparticles with four models
Analysis of physicochemical and ecotoxicological properties nanoparticles
The model training and toxicity of iron oxide nanoparticle to kidney cells was done using particle size, concentration, incubation time
Estimation of particle size distribution by SEM of accumulated TiO2 particles
To improve accuracy in calculating size distributions of nanoparticle for STEM images
1000 SEM images per sample of Image profiling used to study dataset nanoparticle distribution
Proteomic data from circulating Prediction of lower spleen and nanoparticles for inputs, as a lower liver, used for the design dataset of nanoparticle deposition
Dataset used
Study on Emerging Machine Learning Trends on Nanoparticles— … 445
Author(s)
Oksel et al. [20]
Le et al. [21]
Fourches et al. [22]
Chen et al. [23]
Papa et al. [24]
Mikolajczyk et al. [25]
S no.
9
10
11
12
13
14
Table 1 (continued)
2015
2016
2017
2016
2016
2016
Published year
Linear regression
Multiple linear regression, two types of NN and SVM
C4.5, decision tree, and random tree models
Traditional kNN, random forest, and SVM
Linear regression and Bayesian regularized NN
Genetic programming-based decision tree
ML algorithm(s) used
Zeta potential on nanoparticles based on 11 image and 17 descriptors
Experimental data of cell association measured in human lung epithelial carcinoma cells exposed to a library of 105 Au-NPs were used
Theoretical/experimental descriptor from multiple ENMs tested in organisms in vivo
Dataset of 83 surface-modified carbon nanotubes (CNTs)
Library of 45 types of ZnO nanoparticles
Four diverse literature datasets
Dataset used
(continued)
Prediction of zeta potential at 1.25 mV RMSE error in test set and r2 value of 0.87
Linear/nonlinear modeling of nanoparticles, test set error between 8 and 17%
Models for classifying ecotoxicity of nanomaterials using read-across properties
Study of nanotubes and assessing their bovine serum albumin, carbonic anhydrase, chymotrypsin, and hemoglobin activity together with acute and immune toxicity in vitro assays
Study of size, aspect ratio, doping type, etc., of nanoparticles and their biological response data
Prediction for structure–property relationships, molecular or mechanistic interpretation of nanoparticle
Output
446 B. Lavanya andG. Sasipriya
Author(s)
Casman [26]
Liu et al. [27]
Fourches et al. [28]
Chandana Epa et al. [29]
Puzyn et al. [30]
S no.
15
16
17
18
19
Table 1 (continued)
2011
2012
2010
2013
2013
Published year
Linear regression
Linear regression and Bayesian
SVM and KNN
Bayesian classifier, Logistic regression, nearest neighbor classification
Classification, regression and random forest
ML algorithm(s) used
Descriptors derived from quantum chemical calculations
31 NPs consisting of 11 different metal combinations and 109 NPs
Dataset comprising 51 nanoparticles tested in vitro using four doses
Nano-SARs have constructed a dataset of 44 iron oxide core nanoparticles
In vivo pulmonary exposures of CNTs
Dataset used
Prediction cytotoxicity of nanoparticles to Escherichia coli
Predictions of the smooth muscle cell apoptosis and uptake of nanoparticles by human umbilical vein epithelial cells and pancreatic cancer cells
Prediction accuracy up to 73% on classification and regression models
Nanoparticles affect on aortic endothelial, vascular smooth muscle, hepatocyte, and monocyte/macrophage
Predicting pulmonary toxicity of carbon nanotubes, r2 value between 0.88 and 0.96
Output
Study on Emerging Machine Learning Trends on Nanoparticles— … 447
448
B. Lavanya andG. Sasipriya
the structural activity of nanoparticles in relation to their environment, especially their impact on human beings, since there are multitudes of units or factor participating in this orchestration. It is akin to connecting and discerning the pattern out of haphazardly arranged dots. Such an inference is only possible with a highly trained AI system, whose efficacy improves with the dataset it gets to train on. Owing to this hardship and the general lack of information, nanoinformatics in a biomedicine context has been an under-explored territory despite both the popularity of nanoinformatics and the allure of its prospects. Nevertheless, the nanomaterial landscape expands rapidly, as more and more novel nanoparticles are added and existing ones are improved, leading to an urgent need for adopting ML and other AI technologies to comprehend the dynamics of nanoparticle [3].
3 Continuing Challenges Although the need to adapt ML to suit the demands of the nanoparticle segment is imperative, there are equally intensive impediments to such implementations. The foremost challenging impediment is the sheer enormity and heterogeneity of the data generated in this field [4]. To truly understand the nanoparticle, especially in the context of their medical applications, for example, one must record a plethora of factors including “physical and chemical attributes of the nanoparticles, the notions of the mixtures, distribution, shape, and differences in extent of surface modification, manufacturing conditions, and batch effects” [3]. For example, a recent review of some papers has revealed that there is a particular struggle when it comes to choosing nano-QSARs and other predictive models. These selections are crippled by a shortage of “high-quality experimental data, lack of data regarding interactions between nanoparticles like aggregation, high polydispersity in nanoparticles, etc.” [1]. Finally, there are the “biological attributes (e.g., toxicological effects of nanoparticles, modes-of-action, toxicity pathways), interactions (with different cell models), and a large form of measurement approaches with various specific conditions” [5]. This form of analysis and characterization of nanoparticles requires that appropriate analytical techniques be adopted and fine-tuned, and the selected techniques must be able to accommodate the stretch as with “expanding insight into the factors determining toxicity, the list of doubtless relevant properties is growing” [3]. There is another challenging factor on the obverse: researchers might have identical contexts but require different techniques to collect data about the variables—no single quantification or analytical method suits the purpose. Consider the situation (which has been widely discussed in several papers while talking about this challenge) where one needs to identify and predict the cellular uptake of nanoparticles used to treat a disease. The underlying context (figure out cellular uptake of nanoparticles) might be the same, but when nanoparticles vary in that equation, a plethora of details cascades and to make sense of it all with an objective to quantify the nanoparticle—using one general technique is no mean task, or worse, it is next to impossible. Almost
Study on Emerging Machine Learning Trends on Nanoparticles— …
449
all papers reviewed concurred on this opinion, and they all unequivocally concluded that the “strategy of choice for the quantification of nanoparticles uptake mainly depends on the research question, the available analytical devices also as on the sort of nanoparticle of interest” [6]. Due to the aforementioned factors and so many parts (including size, shape, core material and surface functionalization, etc.) involved, there is a need for very different analytical methods [3]. All these have a bearing on the task of comparing various researches in this field and assessing the data mining and ML techniques for the same inquiry or context [7]. Plus, there are other important factors (which cannot be discounted) that govern how data analysis and ML are adapted in this field: the hazard assessment, and it is a vertical where there is little room for errors considering the cost, time, and other safety implications involved. When such a discord is there, two things become absolutely important to solve them: build a common repository of this critical data using a common set of standards and apply NLP techniques to hone its ability to offer the most-needed information right on top. In addition to the proposition made above, this paper now aims to review research involving the employment of data mining and machine learning for the prediction of biomedical properties of nanoparticles of medical interest [1].
4 Data Mining and Machine Learning for Nanoparticles The usage of nanoparticles in biomedical applications means the margin for error is almost zero in studying the in vitro behavior of the particles [8]. It strengthens the need for machine learning in predicting the properties and effects through in silico methods. Envisioning properties and values of nanoparticles and subsequently processing therapeutic medicines driven by the implementation of machine learning led to steady advances in nanomedicine [1]. The analysis of a broad collection of publications pinpoints the same and helps in improving global research in nanotechnology. A sensible approach is needed for the safe evolution of the nanoparticles and, in turn, their usage in medicine [7].
4.1 Machine Learning Methods Machine learning (ML) is a subset of artificial intelligence technology, and it is a computer science discipline concerned with equipping programs/algorithms with a decision-making ability that improves automatically with discerned new experiences. In essence, the ML algorithms are given the ability to learn about the baseline of data, detect shifts in them, and discern the merging patterns, which render ML techniques the prowess to extract pertinent information from huge datasets [9]. This capability is highly useful in fields such as nanomaterial arena, where there is an imminent need
450
B. Lavanya andG. Sasipriya
to detect in the mammoth data critical information, which is hiding in the enormity of the data itself. Common types of ML algorithms are listed below. Supervised Learning: This type of algorithm builds a model using the set of input and output data, they are called training data. The algorithm learns from the labeled data and improves its accuracy in multiple iterations till the desired outcome is achieved. Upon reaching the desired accuracy, they are used for production data [9]. Unsupervised Learning: In contrary to the supervised algorithms that rely mostly on the labeled data, unsupervised learning unearths uncharted patterns [10]. Semi-supervised Learning: Here both supervised and unsupervised learning used in combination [10]. Reinforcement Learning: To make decision in more complex and uncertain environments, reinforcement learning is used based on a sequence of choices [9].
4.2 Natural Language Processing for Nanoparticles However, with ever-increasing number of articles on nanoparticles and the innumerable number of properties, processing these huge chunks of unstructured nanoparticle data needs enormous time and is prone to errors. Moreover, all these data contain valuable information and insights buried deep in the textual verbose, like a needle in the haystack. This implies the need for the use of Natural Language Processing (NLP)-Named entity recognition (NER), an NLP technique that identifies named entities in a text and segregates them into pre-built categories automatically. With named entity recognition, you can extract the key data to understand what a text is about, or simply use it to store the collected data in a database. Extracting these entities helps easily analyze humongous amounts of unregulated research data to get more information about the properties of nanoparticles. In a nutshell, this technique helps discern and unearth the meaningful data from the colossal of textual data. Broadly speaking, four NER can be implemented in four kinds of approaches: Dictionary-based methods use domain-specific vocabularies to extract entities from the literature [31]. Rule-based methods overcome the drawbacks of dictionary-based methods by employing handcrafted patterns and rules to address morphological variants. Different patterns including regex are used here to match words, phrases, and sentences and extract the required data from the literature. The main disadvantage of this approach is that reapplication to fit various domains is not easy [31]. Another disadvantage is a greater number of patterns, making it much tougher to maintain. Machine learning methods, on contrary, aim at automatically detecting occurrences of named entities in the text by predictive learning models. There are a few popular ML models namely conditional random fields, hidden Markov models, etc., but almost all of them require training data to train the model [31]. The hybrid methods combine two or more ML methods and/or rule-based methods. The hybrid methods are most suitable for most cases as they address the
Study on Emerging Machine Learning Trends on Nanoparticles— …
451
drawbacks of other methods and combine the advantages of the aforementioned approaches [31].
5 Approaches and Methods—An Analysis Now that we have established the challenges, proposed segments for ML applications, considered the ML definition and uses, we put forth the analysis, we carried out on various components of data collection and analysis techniques discussed in the available nanomaterials’ literature, critically dependent on the following factors: sagacious choices of ML model, data processing techniques, data validation techniques, and implementation models, and a review system to finetune the functioning of this entire system akin to clockwork. It was observed that there were no steadfast rules or underlying processes to go about accomplishing what has been discussed above. Our further search, we propose the workflow to accomplish supervised learning and successful implementation of ML on the nanoparticles dataset widely available. It is essential to label the properties to suit the selected algorithm [32]. The Nanoinformatics 2030 Roadmap [3] imagines a stream of information from a few observational areas into organized databases for possible utilization by machine learning models for predicting properties, introduction, and risk values that can back administrative activities for a targeted nanoparticle. The essence of it all can be found distilled in the following flow diagram.
452
B. Lavanya andG. Sasipriya
6 Discussion of Flow Diagram 6.1 Data Collection Various journal archives and search engines for scholarly literature were accessed and the following keywords were used to locate and review studies that implement ML to predict nanoparticle properties [7]. The following table represents some reviews of rules in data collection. S. no. Subject
Description
1
Search key terms
Nanoparticle, nanomaterial, nanotoxicity, in silico, computational, machine learning model, all properties
2
Publication
Journals and reports reviewed by experts and peers
3
Databases
“Google Scholar”, “PubMed”, “NCBI”, “ScieneDirect” and more
4
Searchable content Abstract, keywords, Title, Meta tags, etc
5
Period
Previous 10–15 years
6.2 Data Set Formation Next comes the process of forming the dataset to be incorporated into the ML model. Either using the aforementioned resources or utilizing the current literature and databases, a dataset is built. The process is streamlined to collect relevant information pertaining to the properties of nanoparticles; nanoparticles are used to extract a rich and relevant nanoparticle dataset [33].
6.3 Data Cleaning The most critical methods the data cleaning segment are discussed below: Class Imbalance: When data samples are picked and classified, there is a risk of members of a sample class outnumbering members of the other classes. While it might seem like trivial, the repercussions of such an imbalance could yield results that could entirely throw the research off the tracks, and thus is a serious challenge in the field of data mining. In nanoinformatics, this could entirely disrupt the sort of data quality required for a successfully curated dataset [1, 7]. However, this can be overcome if enough care is taken to maintain a balance in the number of members across every class found in the dataset. Missing Values: Needless to mention what this self-explanatory factor could do to the data integrity or interoperability. Data filling approaches rectify this issue,
Study on Emerging Machine Learning Trends on Nanoparticles— …
453
and the following are the popular strategies available in this regard: QSAR methods, trend analysis, and read-across [33]. Data Splitting: This is a vital piece in forming a successful data model for ML implementations. Though it is not given its due attention often, this step determines the future success of the data model when it is subjected to real-time rigors. The data have undergone three groups: (i) a group to contain only the data to train a reliable model for the ML implementation, (ii) a group with data that will be utilized to measure the model’s robustness, and (iii) a group to measure the model’s prognostic abilities. In our review of literature, we figured that the prediction ability of the model is improved if there is propinquity in test group data and training group data [33].
6.4 Data Preprocessing This segment concerns with preparing the data and improving its fit and utility. Some of the methods are feature selection, feature reduction, and techniques like cleaning, integration, transformation, etc. These techniques are used to transform the data suitable for use in computational tools [34]. Feature Reduction: This step is to remove (or, reduce to the maximum possible extent) the redundancies in the data [35]. Care is taken to remove anything—namely constant or near-constant descriptors with low variance, descriptors with poor or no values, and any data that have the poorest correlation with the research endpoints [33]—from the training data set, which otherwise might throw the model off the target. Feature Selection: It is not uncommon in this field to work with datasets that are voluminous and extremely varied in dimensions, which are bound to contain variables (features) that might produce unwanted noise or anomalies [35]. This helps to “scale the quantity of efficient variables analyzed within the predictive model” [1]. Also, feature selection helps to avoid overfitting, which is quite common in ML, where a model learns to tell apart true detail from the noise with such intensity that it begins to affect its intended performance negatively. Furthermore, almost all the papers reviewed vouch for the feature selection’s ability to enhance expert assessment of the mechanistic basis for the model [33]. It is to be noted that feature selections and extractions are favored by several domains such as in text mining, medical databases, NLP, etc. [36]. LDA: Linear discriminant analysis is a popular method used in statistics and informatics to find a linear combination of features that characterize two or more classes of object events [37]. PCA: Another popular process that computes principal components and uses them to perform a change of basis of data. This is quite useful for the analysis of datasets that has multiple variables, and it helps reduce the correlation and maximize variance [37].
454
B. Lavanya andG. Sasipriya
ICA: Independent component analysis is quite a useful process to unravel the hidden factors in the huge multivariate datasets, as it operates by separating independent sources from a mixed signal unlike the PCA where there is an emphasis on maximizing the variance of data endpoints [37]. Non-negative matrix factorization: This gains popularity in assessing highly multivariate datasets, given its heightened capabilities to work with data where only relationships between objects are known or its ability to process negative values [38]. Naive Bayesian: One of the most popular probabilistic classifiers in data mining sector, Naïve Bayesian classifier finds applications in many real-world classification problems. Alzubi et al. [39] used this algorithm to classify malignant and benign brain tumors.
6.5 Preprocessing Technique Following are the most-used NLP algorithms: Support Vector Machine (SVM): SVM is a supervised learning model, which is an excellent option for data classification and regression analysis [37]. SVM has many benefits out of which its abilities to function efficiently in cases where there are more dimensions than the samples and memory efficiency stand out. In fact, the former quality lends this classifier an edge over others, as it helps solve a major problem: most real-world problems concerning non-separable data lack in hyperplane, which serves as a barrier between positive and negative instances in the training set [40]. Bayesian Networks (BN): It is a graph-based model that is used to represent relationships between events and ideas to infer possibilities and uncertainties associated with those events. One of the key applications of BN is information retrieval [40]. Conditional Random Fields: Conditional random fields (CRFs) is a class apart from the other classifiers in that it takes into account the “context”. To do so, the prediction is modeled as a graphical model that implements the dependence between the predictions. The type of graph used depends on the program. For example, linear chain CRFs that implement sequential dependencies in predictions are common in natural language processing [17]. Given these tendencies, CRFs find a special place in fields that exercise MLs rigorously such as gene finding, peptide essential functional area finding, or object recognition and image processing. It is a viable, and in fact, a healthy option of a classifier to help with NLP. Neural Networks: Neural network is one of the widely discussed AI algorithm collections, which is renowned for its deep learning and prediction abilities. Touted to mimic a few nuances of human brains, neural networks are highly capable of adapting superbly to changing inputs or perform a number of regressions and/or classification tasks at once. Most of the papers reviewed had touched upon an interesting topic: adapting a neural network to work successfully in a many-state classification problem, as neural networks mostly work on “one network, one output” basis [10]. Maximum Entropy: A favorite with the text mining (especially, sentiment analysis) enthusiasts, maximum entropy is a probabilistic classifier. However, unlike
Study on Emerging Machine Learning Trends on Nanoparticles— …
455
Naïve Bayes algorithm, this one does not assume that the considered features are independent of each other [41].
6.6 Validation and Further Analysis A crucial step to validate the entire process and review the results with an aim to improve the predictivity of the model based on three factors: goodness-of-fit, robustness, and predictability measures. Certain papers reviewed had dealt with this segment interestingly: with regards to goodness-to-fit, which determines how model accounts for the variability in the training group of data’s responses, a few papers had vouched for measuring the quality of regression using squared correlation coefficient and its role in preventing overfitting [33]. Robustness, which is the ability of the stability of the model predictions even when an adverse influence is applied, seemed to have been an elaborate muse for many research papers, and most of the studies indicate that models should be subjected to external validation [33].
7 Conclusion In the backdrop of successful machine learning implementations in numerous fields, the world of nanoparticles is still struggling to benefit from similar adaptations. That is exactly what this paper sets out to change, clearly pinpoints the reasons for that struggle, and also turns the attention of the reader to the answers and solutions lying in the vast treasure trove sitting right in front of us research literature concerning implementations of machine learning in nanoparticle vertical. By advocating the use of NLP techniques on this data colossus and derivation of meaningful information from it, this review proposal addresses a huge problem facing the realm of nanoparticles: getting lost in data clutter. It indicates how the prowess of data mining and ML techniques can come to our aid in adapting the artificial intelligence to suit the rigors of the world of nanoparticles, especially the scope for the prediction of in vivo properties of nanoparticles using machine learning in nanosafety and toxicology.
References 1. Jones, D. E., Ghandehari, H., & Facelli, J. C. (2016). A review of the applications of data mining and machine learning for the prediction of biomedical properties of nanoparticles. Computer Methods and Programs in Biomedicine, 132, 93–103. https://doi.org/10.1016/j.cmpb.2016. 04.025. Epub 2016 Apr 28. PMID: 27282231; PMCID: PMC4902872. 2. Maojo, V., Fritts, M., de la Iglesia, D., Cachau, R. E., Garcia-Remesal, M., & Kulikowski, C. (2012). Nanoinformatics: a new area of research in nanomedicine. International Journal of Nanomedicine, 7, 3867–3890. https://doi.org/10.2147/IJN.S24582.
456
B. Lavanya andG. Sasipriya
3. Haase, & Klaessig. (2018). EU US Roadmap Nanoinformatics 2030. EU Nanosafety Cluster. https://doi.org/10.5281/zenodo.1486012. 4. Liu, X., & Webster, T. J. (2013). Nanoinformatics for biomedicine: Emerging approaches and applications. International Journal of Nanomedicine, 8(1), 1–5. https://doi.org/10.2147/IJN. S41253. 5. De la Iglesia, D., Harper, S., Hoover, M., Klaessig, F., Lippel, P., Maddux, B., Morse, J., Nel, A., Rajan, K., Reznik-Zellen, R., & Tuominen, M. (2011). Nanoinformatics 2020 Roadmap. 6. Drasler, B., Vanhecke, D., Rodriguez-Lorenzo, L., Petri-Fink, A., & Rothen-Rutishauser, B. (2017). Quantifying nanoparticle cellular uptake: Which method is best? Nanomedicine, 12(10), 1095–1099. 7. Dimitri, A., Talamo, M. (2018). The use of data mining and machine learning in nanomedicine: A survey. Frontiers in Nanoscience and Nanotechnoogy, 4. https://doi.org/10.15761/FNN.100 0S1004. 8. Lewinski, N. A., & McInnes, B. T. (2015). Using natural language processing techniques to inform research on nanotechnology. Beilstein Journal of Nanotechnology, 6, 1439–49. https:// doi.org/10.3762/bjnano.6.149. 9. Nagar, R., & Singh, Y. (2019). A literature survey on machine learning algorithms. International Journal of Emerging Technologies and Innovative Research, 6(4), 471–474. http://www.jet ir.org. ISSN:2349–5162. 10. Ayodele, T. (2010). Types of Machine Learning Algorithms. https://doi.org/10.5772/9385. 11. Lazarovits, J., Sindhwani, S., Tavares, A. J., Zhang, Y., Song, F., Audet, J., Krieger, J. R., Syed, A. M., Stordy, B., & Chan, W. C. W. (2019). Supervised learning and mass spectrometry predicts the in vivo fate of nanomaterials. ACS Nano, 13(7), 8023–8034. https://doi.org/10. 1021/acsnano.9b02774. 12. Ilett, M., Wills, J., Rees, P., Sharma, S., Micklethwaite, S., Brown, A., Brydson, R., & Hondow, N. (2020). Application of automated electron microscopy imaging and machine learning to characterise and quantify nanoparticle dispersion in aqueous media. Journal of Microscopy, 279, 177–184. https://doi.org/10.1111/jmi.12853 13. Berg, S., Kutra, D., Kroeger, T., et al. (2019). ilastik: Interactive machine learning for (bio)image analysis. Nature Methods, 16, 1226–1232. https://doi.org/10.1038/s41592-019-0582-9 14. Horwath, J. P., Zakharov, D. N., Megret, R., & Stach, E. A. (2019). Understanding Important Features of Deep Learning Models for Transmission Electron Microscopy Image Segmentation. arXiv:1912.06077. 15. Coquelin, L., et al. (2019) Towards the use of deep generative models for the characterization in size of aggregated TiO2 nanoparticles measured by Scanning Electron Microscopy (SEM). Materials Research Express, 6, 085001 16. Hataminia, F., Noroozi, Z., & Mobaleghol, E. H. (2019). Investigation of iron oxide nanoparticle cytotoxicity in relation to kidney cells: A mathematical modeling of data mining. Toxicology in Vitro: An International Journal Published in Association with BIBRA., 59, 197–203. https:// doi.org/10.1016/j.tiv.2019.04.024 17. Kovalishyn, V., Abramenko, N., Kopernyk, I., Charochkina, L., Metelytsia, L., Tetko, I. V., Peijnenburg, W., & Kustov, L. (2018). Modelling the toxicity of a large set of metal and metal oxide nanoparticles using the OCHEM platform. Food and Chemical Toxicology, 112, 507–517. https://doi.org/10.1016/j.fct.2017.08.008 Epub 2017 Aug 9 PMID: 28802948. 18. Wang, W., Sedykh, A., Sun, H., Zhao, L., Russo, D. P., Zhou, H., Yan, B., & Zhu, H. (2017). Predicting nano-bio interactions by integrating nanoparticle libraries and quantitative nanostructure activity relationship modeling. ACS Nano, 11(12), 12641–12649. https://doi.org/10. 1021/acsnano.7b07093. Epub 2017 Nov 22. PMID: 29149552; PMCID: PMC5772766. 19. Concu, R., Kleandrova, V., Planche, S., & Alejandro. . (2017). Probing the toxicity of nanoparticles: A unified in silico machine learning model based on perturbation theory. Nanotoxicology, 11, 1–16. https://doi.org/10.1080/17435390.2017.1379567 20. Oksel, C., Winkler, D. A., Ma, C. Y., Wilkins, T., & Wang, X. Z. (2016). Accurate and interpretable nanoSAR models from genetic programming-based decision tree construction approaches. Nanotoxicology, 10(7), 1001–1012. https://doi.org/10.3109/17435390.2016.116 1857
Study on Emerging Machine Learning Trends on Nanoparticles— …
457
21. Le, T. C., Yin, H., Chen, R., Chen, Y., Zhao, L., Casey, P. S., Chen, C., & Winkler, D. A. (2016). An experimental and computational approach to the development of ZnO nanoparticles that are safe by design. Small (Weinheim an der Bergstrasse, Germany), 12, 3568–3577. https:// doi.org/10.1002/smll.201600597 22. Fourches, D., Dongqiuye, Pu., Li, L., Zhou, H., Qingxin, Mu., Gaoxing, Su., Yan, B., & Tropsha, A. (2016). Computer-aided design of carbon nanotubes with the desired bioactivity and safety profiles. Nanotoxicology, 10(3), 374–383. https://doi.org/10.3109/17435390.2015.1073397 23. Chen, G., Peijnenburg, W., Xiao, Y., & Vijver, M. G. (2017). Current knowledge on the use of computational toxicology in hazard assessment of metallic engineered nanomaterials. International Journal of Molecular Sciences, 18(7), 1504. https://doi.org/10.3390/ijms18071 504. 24. Papa, E., Doucet, J. P., Sangion, A., & Doucet-Panaye, A. (2016). Investigation of the influence of protein corona composition on gold nanoparticle bioactivity using machine learning approaches. SAR and QSAR in Environmental Research, 27(7), 521–538. https://doi.org/10. 1080/1062936X.2016.1197310 Epub 2016 Jun 22 PMID: 27329717. 25. Mikolajczyk, A., Gajewicz, A., Rasulev, B., Schaeublin, N., Maurer-Gardner, E., Hussain, S., Leszczynski, J., & Puzyn, T. (2015). Zeta potential for metal oxide nanoparticles: A predictive model developed by a nano-quantitative structure-property relationship approach. Chemistry of Materials, 27(7), 2400–2407. https://doi.org/10.1021/cm504406a 26. Gernand, J. M., & Casman, E. A. (2014). A meta-analysis of carbon nanotube pulmonary toxicity studies—how physical dimensions and impurities affect the toxicity of carbon nanotubes. Risk Analysis, 34(3), 583–597. https://doi.org/10.1111/risa.12109 Epub 2013 Sep 11 PMID: 24024907. 27. Liu, R., Rallo, R., Weissleder, R., Tassa, C., Shaw, S., & Cohen, Y. (2013). Nano-SAR development for bioactivity of nanoparticles with considerations of decision boundaries. Small (Weinheim an der Bergstrasse, Germany), 9, 1842–1852. https://doi.org/10.1002/smll.201 201903 28. Fourches, D., Dongqiuye, Pu., Tassa, C., Weissleder, R., Shaw, S. Y., Mumper, R. J., & Tropsha, A. (2010). Quantitative nanostructure−activity relationship modeling. ACS Nano, 4(10), 5703– 5712. https://doi.org/10.1021/nn1013484 29. Chandana Epa, V., Burden, F. R., Tassa, C., Weissleder, R., Shaw, S., & Winkler, D. A. (2012). Modeling biological activities of nanoparticles. Nano Letters, 12(11), 5808–5812. https://doi. org/10.1021/nl303144k. 30. Puzyn, T., Rasulev, B., Gajewicz, A., et al. (2011). Using nano-QSAR to predict the cytotoxicity of metal oxide nanoparticles. Nature Nanotechnology, 6, 175–178. https://doi.org/10.1038/ nnano.2011.10 31. Ananiadou, S., & McNaught, J. (Eds.). (2006). Text mining for biology and biomedicine. Boston, MA: Artech House. 32. Schmidt, J., Marques, M. R. G., & Botti, S. et al. (2019). Recent advances and applications of machine learning in solid-state materials science. npj Computational Materials, 5(83). https:// doi.org/10.1038/s41524-019-0221-0. 33. Furxhi, I., Murphy, F., Mullins, M., Arvanitis, A., & Poland, C. A. (2020). Practices and trends of machine learning application in nanotoxicology. Nanomaterials, 10, 116. 34. Alasadi, S. A., & Bhaya, W. S. (2017). Review of data preprocessing techniques in data mining. Journal of Engineering and Applied Sciences, 12, 4102–4107. 35. Gupta, D., Rodrigues, J. J. P. C., Sundaram, S., et al. (2020). Usability feature extraction using modified crow search algorithm: A novel approach. Neural Computing and Applications, 32, 10915–10925. https://doi.org/10.1007/s00521-018-3688-6 36. Venkatesh, R., Chaitanya, K., Bikku, T., & Paturi, R. (2020). A review on biomedical mining. J RNA Genomics, 16, 629–637. 37. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., & Brown, D. (2019). Text classification algorithms: A Survey. Information, 10, 150. 38. Brown, K. A., Brittman, S., Maccaferri, N., Jariwala, D., & Celano, U. (2020). Machine learning in nanoscience: Big data at small scales. Nano Letters, 20(1), 2–10. https://doi.org/10.1021/ acs.nanolett.9b04090.
458
B. Lavanya andG. Sasipriya
39. Alzubi, J., Kumar, A., Alzubi, O., & Manikandan, R. (2019). Efficient approaches for prediction of brain tumor using machine learning techniques. Indian Journal of Public Health Research & Development, 10, 267. https://doi.org/10.5958/0976-5506.2019.00298.5. 40. Iqbal, M., & Yan, Z. (2015). Supervised machine learning approaches: A survey. International Journal of Soft Computing, 5, 946–952. https://doi.org/10.21917/ijsc.2015.0133. 41. Ratnaparkhi, A. (2016). Maximum entropy models for natural language processing. In C. Sammut & G. Webb (Eds.), Encyclopedia of machine learning and data mining. Boston: Springer. https://doi.org/10.1007/978-1-4899-7502-7_525-1.
A Review of the Oversampling Techniques in Class Imbalance Problem Shweta Sharma, Anjana Gosain, and Shreya Jain
Abstract Class imbalance is often faced by real-world datasets where one class contains a smaller number of instances than the other one. Even though this has been an area of interest for more than the past two decades, it is still a profound field of research to gain better accuracy. A continuous improvement is performed at data-level, algorithm-level, and hybrid methods as the solution of the given problem. Sampling techniques have gained significant heed to improve classification performance, which works at the data-level approach and can be categorized as oversampling and undersampling, whereas oversampling is the more efficient technique as it empathizes on replicating instances, unlike undersampling. In this paper, a review has been done on the issues that come with imbalanced datasets. Imbalanced distribution equally affects unsupervised learning, mostly clustering hence the shortcomings of Synthetic Minority Over-sampling Technique (SMOTE) are found and compared with clustering-based approaches, which focus on the importance of minority class, unlike SMOTE. A comparative survey is also presented based on data generation methods used by various oversampling techniques for handling the class imbalance problem. Keywords Imbalanced dataset · Oversampling · Undersampling · SMOTE · Class imbalance · Clustering
S. Sharma (B) · A. Gosain Communication and Technology, University School of Information, Guru Gobind Singh Indraprastha University, New Delhi, India A. Gosain e-mail: [email protected] S. Jain Vellore Institute of Technology, Vellore, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_38
459
460
S. Sharma et al.
1 Introduction As the field of machine learning and data mining emerged, there is a challenge imposed to achieve aspired classification efficiency when dealing with imbalanced datasets. Imbalance occurs when instances of one class are higher than the instances of another class. The class that consists of a smaller number of instances is known to be a minority class, whereas a class consisting of a higher number of instances is known to be a majority class. Although the majority class is the most prevalent one, the minority class may be of great significance because of its rare occurrences but sometimes considered noisy data and ignored. The seriousness of this skewed distribution can be explained with the real-world example by considering a dataset of customer churn, a problem mainly faced by the telecommunication industry where reducing churn is extremely important in a competitive market rather to attract new customers. It consists of approximately 5– 15% of instances of churn class while the rest of no-churn classes providing the imbalance ratio of 5.6–19. Classifiers may achieve nearly 90% of the accuracy by ignoring small observations of the minority class but it can leads to bias in classification. In the given example, the small portion of churn cases may contain some important information; however, that small portion (5%) of this observation consists of useful information by ignorance of which could result in greater classification error. Learning from a class-imbalance problem can be a challenging task; certain factors are identified from the previous work, influencing classifier performance, and these are: • Most of the algorithms operate on maximizing the classification accuracy while ignoring the rare events, which remain unfolded even if the accuracy tends to be higher. • Small disjuncts [1], lack of data comprise another challenge along with learning of class-imbalance dataset, which often causes learning models to abort in encountering rare events. • Rare minority examples can be misinterpreted as noisy examples and noise can be badly determined as a minority example because of the rareness of both the events [2]. • Overlapping among regions creates serious problems with the class imbalance datasets. In 2010, the authors have discussed the seriousness of the overlapping problem with the class-imbalance scenario. It was observed that overlapping tends to be a more critical issue compared with the class imbalance in isolation. The solution of the discussed problems can be ranged from sampling approaches to a new learning approach devised, especially for handling the class imbalance problem. In general, the class imbalance problem defines methods that can be categorized at three levels: (1) data-level approach, (2) algorithm-level approach, and (3) hybrid approach. The data-level approach methods deal with class imbalance ratio, keeping in mind the goal to acquire a balanced distribution either by applying oversampling or undersampling whereas, at algorithm-level approach, traditional
A Review of the Oversampling Techniques in Class Imbalance Problem
461
classification algorithms are restrained from improvising the learning. The hybrid approach is a combination of data-level and algorithm-level techniques. Comparing these three techniques, algorithm-level approach is the more sensitive one as it may require specific treatments on minority classes and is not efficient in terms of cost as well, while a hybrid approach works well on the problems faced by oversampling and undersampling in combination. The data-level approach is the most influential approach to achieve balanced distribution as it employs the pre-processing step. There are various sampling approaches under data-level approaches like oversampling, undersampling, and feature selection but oversampling is the more robust approach as it empathizes in replicating instances, unlike undersampling, which might lose significant information. It was found that while dealing with highly imbalanced data, oversampling of minority class is more efficient. In recent years, several qualitative surveys [3] have been done, which captured advances in imbalanced learning like the authors in [1] provide a systematic case study of sampling techniques and reveals the evaluation of rule generation algorithms for the application of churn prediction. In the same year, Barua et al. [4] address the issues and challenges of tackling imbalanced data by addressing the spectrum of classification, regression, and clustering along with the applications. In [5], Haixiang provides a deep review of detecting rare events from an imbalanced learning aspect by taking the issue of both binary class as well as multi-class [6] provides an overview of all the techniques at the pre-processing level and algorithmic level taking into consideration of binary class and multi-class imbalanced data referring to which hybrid outperforms ensemble technique when applied alone. This paper reviews various oversampling techniques categorized on SMOTE and clustering-based oversampling for binary class and Mahalanobis distance approach for multi-class problems. SMOTE is a much popular approach because of its simplicity in design of the procedure, and due to this fact, there are various extensions of SMOTE available like Borderline-SMOTE [7], Safe-level-SMOTE [8], DBSMOTE [9], MWMOTE [10], Cluster-SMOTE [11], which uses a different type of interpolation methods. Although SMOTE-based techniques [7, 8, 10] proved to be a standard benchmark for learning from imbalanced data, it will not be able to correct the issue of overlapping, and hence a more generalized approach is demanded to handle such issues. Cluster-based oversampling techniques like ProWsyn [4], DBSMOTE [12], CBSO [13], A-SUWO [14], SOMO [33] works on local densities found a great interest over a while as they avoid the issue of overgeneralization mainly faced by earlier technique. It outperforms the problem of overfitting compared with SMOTE extensions, and hence the quality of research work is still expected from the learning the class imbalance problems. The remaining part of this paper is organized as follows. In the ‘Approaches to handle Class Imbalanced Problem’ section, we provide an overview of the strategies and methodologies used to handle data with class imbalance problem. Section 3 discusses the overview of the existing approaches based on data generation techniques using SMOTE and clustering while Section 4 analyzes the comparative study. A comparison table is also presented for various novel methods and techniques.
462
S. Sharma et al.
2 Approaches to Handle Class Imbalance Problem 2.1 Data-Level Approach There are various approaches (Fig. 1) to handle the class imbalance problem, and data-level approach is the most popular one, which concentrates on modifying the training dataset by introducing a pre-processing step. The idea is to balance the dataset either by eliminating instances from the majority class or replicating them to achieve balanced distribution. Sampling is the most common data-level approach to handle class imbalance. It processes the training to achieve more balanced data distribution. Majorly, there are two methods of sampling, oversampling and undersampling. In general, there are three methods under the data-level approach oversampling, undersampling, and feature selection.
2.1.1
Oversampling
In this method, instances are broadly added up or replicated from the minority class in the given dataset. For replicating the instances, either sample is generated randomly or implementing an intelligent algorithm. In 2002 [15], Chawla introduced SMOTE, a standard oversampling technique that has gained immense popularity with class imbalance classification. SMOTE proved to be an alternative to the standard random oversampling by creating new instances through interpolating neighboring minority class instances. It is considered to be more useful than undersampling as it empathizes on replicating or adding up more instances, unlike undersampling, which might lose the vital information from the class samples. Its performance is shown to be improved drastically even for complex data sets. The whole key idea is to overcome the problems of overfitting faced by simply oversampling through replication. Due to its prevalence, it emerged as one of the most prominent data processing or sampling algorithm in machine learning and data mining [16–18] and for this reason, various extensions of SMOTE have been available now from in the past 15 years [9] like borderline-SMOTE [7], safe-level SMOTE [8], ADASYN [19], cluster-SMOTE [20], LVQ-SMOTE [21].
2.1.2
Undersampling
The concern in undersampling is the removal of crucial data if a large number of instances are deleted from the majority class. In [9], Tomek links provide an undersampling approach by identifying the borderline and noisy data. This approach is also used in cleansing of data and for removing overlapping caused by sampling method. Though a lot of research work is provided in the field of undersampling, it is not regarded as the efficient one as it might loses the weightage of discarded useful instances from the datasets.
A Review of the Oversampling Techniques in Class Imbalance Problem
463
Fig. 1 Various techniques to handle class imbalance problem
2.1.3
Feature Selection
There is another pre-processing step that gains acceptance with time in class imbalanced problem is feature selection. Being compared with other feature selection methods, the performance achieved is high, corresponding to this approach. There is another pre-processing step that gains acceptance with time in class imbalanced problem is feature selection. Being compared with other feature selection methods, the performance achieved is high, corresponding to this approach. This paper [22] analyzed eight feature selection methods based on correlation measures. This will help the researcher to select the appropriate feature selection model in less time. Feature selection is still not explored much, and there is a systematic approach to finding an appropriate model for imbalanced data, which leads to further advancements.
2.2 Algorithm-Level Here, in the algorithm-level approach, we need to rebalance the distribution of classes where the undersampling approach proceeds by deleting instances from the majority class and the addition of instances is done through oversampling. This method is generally categorized into two groups, one is cost-sensitive learning, which is a popular learning paradigm branch [6] based on the reduction of misclassification costs. Many studies have been governed on cost-sensitive learning like the study [23], which proposed optimized cost-sensitive SVM. Another technique is known as ensemble learning, which usually faces the issues of variance.
464
S. Sharma et al.
2.3 Hybrid Level This approach is a combination of data-level and algorithm-level methods. The need for the hybrid approach is to remove the shortcomings of data-level and algorithmlevel methods to achieve better classification results. In this paper [22], a sampling method is proposed based on k-means clustering and genetic algorithm. It emphasizes the performance of minority class where instances of this class are clustered by using k-means. For each cluster, synthetic samples are generated using the genetic algorithm and SVM.
3 Previous Work Borderline-SMOTE Han, Wang, and Mao in 2005 [7] proposed Borderline-SMOTE technique. It proceeds with idea that the instances, which are far from the defined borderline, will contribute less to the progress rate of classification. Thus, this technique identifies all those instances that exist within the borderline by using class imbalance ratio among the minority and majority instances. In this algorithm, noisy samples that consist of the neighborhood of the majority class are not considered and hence the synthetic instances are most dense at the Borderline. Safe-Level SMOTE In [8], Bunkhumpornpat et al. proposed a technique known as safe-level SMOTE, which is basically an expansion of SMOTE. It works in a similar way like the SMOTE, the only difference lies when new instances are introduced along the same line but closer to the minority class, and hence it is called the safe region. Thus all synthetic instances are to be created under this safe region only. DBSMOTE In 2012, a new technique was proposed, DBSMOTE [12]. It is influenced by Borderline-SMOTE, which operates in overlapping regions. However, unlike borderline SMOTE, the DBSMOTE algorithm oversamples this region. Along with that, it also works in safe regions. The key idea of this algorithm also lies on the densitybased approach of clustering called DBSCAN. Synthetic instances are created along the minimized path from each minority instance to a pseudo-centroid of a cluster of minority without being operated within a noise region. A-SUWO In 2015, Adnin et al proposed a technique called A-SUWO for handling imbalanced distribution of datasets, which clusters the minority instances using a semi-supervised hierarchical clustering approach. The main aim of this approach is to identify hard to learn instances from each found sub-cluster, close to the borderline. The size
A Review of the Oversampling Techniques in Class Imbalance Problem
465
of the cluster [14] is based on the misclassification error and oversampling each individual sub-cluster is based upon the weight assignment to the instances to avoid over-generalization. SMOTE Chawla, 2002 proposed a technique called SMOTE (synthetic minority oversampling technique), which generates new instances by randomly selecting instances between the selected points and it is nearby neighbors. It was the first technique to introduce sample generation in the learning dataset instead of using random methods. SMOTE is the most commonly used technique [15, 24], which performs better than the random and simple oversampling method. It reduces the inconsistency among the samples and creates a correlation among the instances of minority class. ADASYN Haibo et al. [19] proposed a novel algorithm ADASYN. It was the extension of the SMOTE algorithm. ADASYN creates data points between two classes instead of generating synthetic examples. Its main idea is to assign weights to minority sample classes based on their complexity to learn the number of instances required to be generated can be self-decided based on the density distribution function. It enhances the imbalanced class learning by shifting the boundary toward hard-to-learn instances. AHC Cohen et al. proposed a technique in 2006 known as AHC. It was the first attempt of using clustering in the process of data generation of synthetic examples to accommodate the balanced distribution of data. It uses the concept of both undersampling and oversampling to make an even distribution. For undersampling minority examples, kmeans algorithm is used, and for oversampling, agglomerative hierarchical clustering is used. The resulting clusters collaborate from all levels and their corresponding centroids are interpolated with the authentic minority class examples. ROSE Menardi and Torelli in 2014 proposed this oversampling technique that creates the synthetic samples using the probability distribution. It leads to better prediction while avoiding the risk of overfitting and proven a better oversampling technique to use. SOMO Douzas and Bacao proposed SOMO [22]. This technique uses a self-organizing map algorithm, which is applied to the initial dataset and data points portioned into several clusters. It gains popularity as it uses neural network to build two-dimensional maps through competitive learning. Distribution among the clusters generated is adjusted based on cluster density (Figs. 2 and 3).
466
S. Sharma et al.
Fig. 2 Number of papers published in the last decade based on SMOTE technique on searching the term SMOTE along with class imbalance learning in the ‘web of science’ database
Fig. 3 Number of papers published in the last decade based on cluster-based technique on searching the term cluster along with class imbalance learning in the ‘web of science’ database
MWMOTE Barua et al. in 2014 have proposed a novel method for handling the imbalance problem of class distribution called MWMOTE [10]. It first identifies the most difficult minority samples and assigns the weight to the selected samples according to their importance in the data. It elevates the problems of KNN-based sample generation approach by applying the clustering algorithm. MDO In 2016, Nik et al. presented a new oversampling technique, inspired by Mahalanobis distance. The over-sampling technique is a multi-class approach called MDO that generates synthetic samples based on Mahalanobis distance from the considered class mean with the candidate sample. The generated samples by applying MDO provide more information for the learning model compared with other oversampling
A Review of the Oversampling Techniques in Class Imbalance Problem
467
algorithms, which, in turn, increase its generalization ability. It also reduces the chances of overlapping among different class regions. CBSO To overcome the problem of creating wrong synthetic minority samples, a new algorithm is proposed in 2011 known as CBSO [13] based on the idea of clustering. It adopted the basic idea from ADASYN for synthetic instance generation and incorporates unsupervised clustering which ensures the new samples lie inside the minority regions to balance the class distribution. The main advantage of using this approach is that the samples will never lie in the majority class and all the noisy samples will put into a separate cluster.
4 Comparison of Oversampling Methods As the study of the class imbalance problem emerged as a challenging issue with the advancement of data mining and machine learning, many studies have been conducted to evaluate the performance of various techniques based on the oversampling approach. An attempt was made by [2], reviewed the limitations of class imbalance classification, emphasizes on the issue of choosing accuracy as the only performance criterion that may lead to inaccurate and misleading information. It reviewed the present work based on the strengths and weaknesses of each classification technique. It was suggested that class overlapping could be solved by including sampling strategy in the pre-processing. Feature selection strategy is also being suggested to address the issue of classification in a class imbalanced dataset. With the high-dimensional feature space and demand for rapid development of big data, there is a need to solve the real-world problems in class imbalance. Paper [4] attempted to provide a detailed review of the rare event detection technique and its applications. It compares the research works based on some applications like chemical and biomedical engineering areas, for which feature selection can be a popular choice as data comprise a fixed structure. However, for financial management, costsensitive learning is relevant in rare event detection. However, this work [5] highlights the limitations of the classifiers used in the context of class imbalance and differentiates various techniques based on the pre-processing technique used along with the algorithm taken. As per this work, hybrid techniques proved to be better over ensembles and pre-processing applied alone (Table 1).
5 Conclusion In this paper, we have discussed many of the SMOTE-based techniques, and it is observed that it does not consider the importance of minority class instances and generate an equal number of synthetic instances. However, samples were found
Method Method
SWIM ROS + MWMOTE + FIDO’s
SOMO
MDO
A-SUWO
MWMOTE
References
Cao et al. [23]
Douzas and Bacao [22]
S´anchez et al. [11]
Ne kooeimehr and Lai-Yuen [14]
Barua et al. [10]
Hierarchical clustering algorithm
Agglomerative complete-linkage Hierarchical clustering
Mahalanobis distance
SOM algorithm with cluster Density
Density estimation method using MD Random oversampling and linear interpolation
Data-generation
Table 1 Comparison of oversampling techniques
Precision,Recall,Fmeasure,M-measure,G-measure,F-mean,G-mean
F-measure, G-measure, TP,TN
Mean of TPR and TNR Detection accuracy,TPR FNR,AUC,F measure
Measure
Neural network,k-NN,Decision tree
F-measure,G-mean,AUC
SVM, logistic regression, F-measure, Gmean, AUC KNN, LDA
C4.5,ripper,KNN Classifier
S´anchezGradient Boosting Machine
Bayesian signed test Random forest Base classifier
Classifier
Application
Improve synthetic Sample generation scheme
Avoid generating overlapping synthetic inst Oversamples sub-clusters by giving weights
Reduces risk of overlapping among classes, Improve generalization ability
Generates high quality artificial data
(continued)
Breast tissue
Avoid generate Software harmful instances defect Works well with Prediction extreme imbalance Varience problem of classifier Increased classifier rate
Outcome
468 S. Sharma et al.
Method Method
Metacost ProWSyn
DBSMOTE CBSO
ADASYN Safe-level SMOTE Borderline-SMOTE
References
Haixiang et al. [5] Barua et al. [4]
Bunkhumpornpat et al. [12] Barua et al. [13]
He et al. [19] Bunkhumpornpat et al. [8] Han et al. [7]
Table 1 (continued)
Density distribution Interpolation Interpolation
Density based approach of clustering Clustering + ADASYN
Multi-class Partioning clustering
Data-generation
Decision tree C4.5, Naïve bayes, SVM C4.5
C4.5,Naïve bayes, SVM, ripper,K-NN Backpropogation, Neural network, decision tree
C4.5,C4.5 rules,Neural network, Naïve bayes, SVM C4.5 decision tree Neural network
Classifier
ROC, F-measure, G-mean Precision, Recall, F-value, AUC TP rate, F-value
AUC, Recall Precision, F-value F-measure, G-means
AUC, F-meaure, G-means AUC, F-measure, G-mean
Measure
Avoid wrong minority sample creation It can be generalized to multiple class id’s Creates a safe level and generate all synthetic instances in this region Outperforms borderline-smote Samples borderline instances of minority class
Generate synthetic instances at appropriate places Apply weighted distribution
SVM is little affected by Class distribution Assign affective weights to minority class sample using Eucledian distance
Outcome
(continued)
Spambase vehicle Texturing
Noisy datasets
Cost minimization procedure
Application
A Review of the Oversampling Techniques in Class Imbalance Problem 469
Method Method
Cluster-SMOTE SMOTE
References
Huda et al. [20] Haixiang et al. [5]
Table 1 (continued)
K-means clustering + SMOTE Interpolation
Data-generation
Measure
RIPPER ROC C4.5, Naïve bayes, ripper AUC, ROC
Classifier Effective for classifier enhancement. Limited feature space Introduces a bias towards minority class
Outcome
Intrusion Detection Informational retrieval
Application
470 S. Sharma et al.
A Review of the Oversampling Techniques in Class Imbalance Problem
471
to be of great importance near the classifier’s decision boundary or surrounded by minority instances. The clustering-based approach treated these problems like, ProWsyn considered different weight generation techniques. When the imbalance ratio is low, cluster-SMOTE gives superior results as presented in this paper [14] as they focus on the areas that actually need new instances. Despite the intensive work over the years, there are many shortcomings in the existing methods that are yet to be covered. Hence a robust cluster-based approach is needed in combination with SMOTE technique by taking the importance of minority class instances.
References 1. Amin, A., et al. (2016). Comparing oversampling technique to handle the CIP: A customer churn prediction case study. IEEE Access, 4, 7940–7957. 2. Ali, A., & Shamsuddin, S. M. (2015). Classification with class imbalance problem. International Journal of Advances in Soft Computing and its Applications, 7(3). 3. Vimalraj, S., & Porkodi, R. (2018). A review on handling imbalanced data. In Proceedings 2018 IEEE International Conference on Current Trends towards Converging Technologies. 4. Barua, S., Islam, M. M., & Murase, K. (2013). ProWSyn: Proximity weighted synthetic oversampling technique for imbalanced data set learning. In Advances in knowledge discovery and data mining (pp. 317–328). Heidelberg: Springer. 5. Haixiang, G., Li, Y., Shang, J., Mingyun, G., Yuanyue, H., & Gong, B. (2016). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73. https://doi.org/10.1016/j.eswa.2016.12.035. 6. Krawczyk, B. (2016). Learning from imbalanced data: Open challenges and future directions. Progress in Artificial Intelligence, 5, 221–232. 7. Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Advances in intelligent computing (pp. 878–887). Heidelberg: Springer. 8. Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-SMOTE: Safelevel-synthetic minority over-sampling technique for handling the class imbalanced problem. In Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (pp. 475–482). 9. Fernandez, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905. 10. Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 26(2), 405–425. 11. S´anchez, A. I., Morales, E. F., & Gonzalez, J. A. (2013). Synthetic oversampling of instances using clustering. International Journal of Artificial Intelligence Tools, 22. https://doi.org/10. 1142/S0218213013500085. 12. Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2012). DBSMOTE: Density based synthetic minority over-sampling technique. Applied Intelligence, 36(3), 664–684. 13. Barua, S., Islam, M. M., & Murase, K. (2011). A novel synthetic minority oversampling technique for imbalanced data set learning. In International Conference on Neural Information Processing, ICONIP 2011 (pp. 735–744). 14. Ne kooeimehr, I., & Lai-Yuen, S. K. (2016). Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert System with Applications, 46, 405–416. 15. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
472
S. Sharma et al.
16. Leevy, et al. (2018). A survey on addressing high-class imbalance in big data. Journal of Big Data, 5–42. 17. Ge Elreedy, D., & Atiya, A. (2019). A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Information Sciences, 505. https://doi. org/10.1016/j.ins.2019.07.070. 18. Sharma, S., & Bellinger, C. (2018). Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance. In IEEE International Conference on Data Mining, pp. 447–456. 19. He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of IJCNN, pp. 1322–1328. 20. Huda, S., Liu, K., Abdelrazek, M., Ibrahim, A., Alyahya, S., & Al-Dossari, H. (2018). An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access, 10, 1–1. 21. Nakamura, M., Kajiwara, Y., Otsuka, A., & Kimura, H. (2013). LVQ-SMOTE—Learning vector quantization based synthetic minority over–sampling technique for biomedical data. BioData mining., 6, 16. https://doi.org/10.1186/1756-0381-6-16. 22. Douzas, G., & Bacao, F. (2017). Self-organizing map oversampling (SOMO) for imbalanced data set learning. Expert Systems with Applications, 82, 40–52. 23. Cao, P., Zhao, D., & Zaiane, O.: An optimized cost-sensitive SVM for imbalanced data learning. In Advances in knowledge discovery and data mining (pp. 280–292). Springer. 24. Zhu, T., Lin, Y., & Liu, Y. (2019). Improving interpolation-based oversampling for imbalanced data learning. Knowledge-Based Systems, 187. https://doi.org/10.1016/j.knosys.2019. 25. Bennin, K. E., Keung, J., Phannachitta, P., Monden, A., & Mensah, S. (2018). MAHAKIL: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. In IEEE/ACM 40th International Conference on Software Engineering (ICSE), Gothenburg (pp. 699–699).
Eye Blink-Based Liveness Detection Using Odd Kernel Matrix in Convolutional Neural Networks N. Nanthini, N. Puviarasan, and P. Aruna
Abstract Spoofing attacks on biometric systems are one of the major disorders to use for secure applications. In the case of face recognition, it is not easy for the computer to detect whether the person is live or not. Thus, liveness detection came into existence. In traditional methods of liveness detection, hard computing techniques have been used, which are highly complex in nature. Other existing methods tried to eliminate this problem using machine learning techniques but failed to eliminate over-fitting issues. Recent growth in convolutional neural networks has paved the way for accurate results in detection and recognition tasks. In this work, a new approach to detect face liveness based on eye-blinking status using deep learning has been implemented. Eye blinking is an action that can be identified by the sequence of human images, which involves human eyes with close and open states. The proposed Odd Kernel Matrix Convolution Neural Network (OKM-CNN) method uses a modified kernel matrix to accommodate the need of finding the eye-blinking status. Furthermore, comparisons are done using various activation functions. The experimental result shows 96% of accuracy for the proposed OKM-CNN liveness detection model. Keywords Spoofing attacks · Liveness detection · Eye blink detection · CNN · Kernel
1 Introduction Face recognition, which has been developed briskly in the past few years, plays a key role in biometric identification systems. It has been broadly applied to appearance and security systems. But spoofing is a major cause for the failure of various face N. Nanthini (B) · P. Aruna Department of Computer Science and Engineering, Annamalai University, Chidambaram, Tamilnadu, India N. Puviarasan Department of Computer and Information Science, Annamalai University, Chidambaram, Tamilnadu, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_39
473
474
N. Nanthini et al.
recognition systems [1–3]. In general, face recognition system is unable to differentiate a face between its live state and non-live state. A secure system is needed in order to protect against spoofing. Liveness detection is an emerging technique for digital fraud. It is categorized based on texture, motion and life sign [4]. Eye blinking is one among the types from the life sign category. Blinking is an action that consists of sequences of open and close state of eyes. Machine learning methods only utilize the basic feature of images. In order to perform with videos or large sequence of images, deep learning is used. Deep learning has been established to be outstanding in solving multifaceted structures with high-dimensional data. There are various deep learning methods but CNN is widely used for image classification and face recognition. Convolutional neural network (CNN) is an artificial neural network that employs to extract and increase the number of features from the input data. The main contribution of this work is to achieve a strong liveness detection algorithm with great accuracy. In this paper, a modified deep CNN architecture is developed by improving the size selection of the kernel matrix, and the performance of different activation functions is also studied. The rest of this paper is structured as follows. Section 2 discusses the related works of liveness detection. Section 3 explains the CNN methodology Section 4 presents the proposed system in detail. Section 5 shows the experimental results. Section 6 provides the conclusion.
2 Literature Survey Sun et al. first introduced the eye-blink detection for the face liveness detection system with conditional random fields (CRF), which accommodate long-range contextual dependencies among the observation series [5]. Pan et al. proposed an undirected conditional graphical framework for blinking image sequence and computed the local binary pattern descriptors extracted in scale space for scenic clues [6]. Singh and Arora proposed an eye-blinking and mouth movement combined technique for procuring maximum reliability during face liveness detection [7]. Kim et al. classified the open and close condition of eye images using deep learning. They perform zero-center normalization for the input images before give it to neural networks, which uses mean values instead of pixel values [8]. Rehman et al. proposed an effective methodology for face liveness detection during training a deep CNN network, which makes use of continuous data randomization in the form of mini-batches [9]. Avinash et al. conducted a trail and error method for real-time liveness detection in security system [10]. Musab et al. developed a modified CNN architecture for face recognition by adding a normalization layer between two different layers. It improves the accuracy of the system [11]. Yu et al. proposed diffusion-based kernel model that improves the boundaries for each frame in the videos and extracts the feature called Diffusion Kernel (DK) features. It reflects the internal connection of face images in the video used [12]. Patel et al. proposed eye blink detection with CaffeNet and
Eye Blink-Based Liveness Detection Using Odd Kernel Matrix …
475
GoogLeNet architecture for various public datasets [13]. Yu et al. proposed a heterogeneous kernel in multiple convolutional layers to get better results than classic homogenous kernel [14]. Li et al. used a physiological signal, which is found in original video to eye blink detection from the synthesized fake videos [15].
3 Methodology CNN is very operative in such areas as image recognition and classification. CNN is a kind of feed-forward neural network fabricated by numerous layers. It consists of neurons with learnable weights and biases. Each neuron takes some inputs, carries out convolution and optionally follows it with a non-linearity. The structure of CNN contains convolution, pooling, activation function and fully connected layers. The general architecture of CNN is shown in Fig. 1.
3.1 Convolutional Layer The convolutional layer is the first and important layer of a convolutional neural network, which carries out most of the computational operations. The main purpose of the convolutional layer is to extract features from the input feed which is an image. Convolutional layer divides input images into small squares for learning image features between the pixels to preserves the spatial relationnships. The input image is convoluted by engaging a set of learnable neurons. The convolutional layer examines in the depth of every small square of the input image and as far as possible to get a higher degree of feature extraction. After that, feature maps that are produced in the output image are then given to the next layer as input data.
Fig. 1 General architecture of CNN
476
N. Nanthini et al.
3.2 Pooling Layer The pooling layer is usually positioned between convolutional layers. The output feature maps from the previous layer are given as inputs to the pooling layer. The input images are distributed into a set of non-overlapping rectangles. Each section is down-sampled by anyone of the non-linear operations like average, minimum and maximum. The pooling layer never changes the depth of the 3D matrix in the network, but it minimizes the dimensionality of each feature map retaining all necessary, important information, so as to reduce the parameters of the whole neural network and decrease the training time. This layer succeeds in better simplification, faster convergence, robust to translation and distortion.
3.3 Activation Function In neural networks, activation functions are used to implement diverse computation between hidden layers and output layers. It figures the weighted sum of inputs and biases, to choose whether a neuron can be dismissed or not. It sends the presented data over and done with some gradient processing and produces an output for the neural network, which contains the parameters in the data.
3.4 Fully Connected Layer A fully connected layer simply shows the interconnection between the previous layer and the next layer, i.e., each and every filter in the earlier layer is connected to each and every filter in the succeeding layer. The output from the above three layers is the representation of high-level features of the input image. The goal of using the fully connected layer is to make use of the features for categorizing the input image into several classes based on the labeled training set. It is considered that the final pooling layer sending the features to a classifier that uses activation function. The sum of output probabilities from the fully connected layer is always 1.
4 Proposed Odd Kernel Matrix (OKM-CNN) for Liveness Detection Here, the proposed eye blink detection using a convolutional neural network is presented. The proposed Odd Kernel Matrix-CNN architecture is shown in Fig. 2. For all the convolutions, we have used Odd Kernel Matrix as filter size. Eyeblinking detection requires accurate identification of opening and closing of eyes to
Conv + OKM
Conv + OKM
Conv + OKM
Conv + OKM
Conv + OKM
Max Pool
Max Pool
Max Pool
Sigmoid
BLOCK 3
Fully Connected
BLOCK 2
Fully Connected
BLOCK 1
477
Flatten
Eye Blink-Based Liveness Detection Using Odd Kernel Matrix …
Fig. 2 Proposed OKM-CNN architecture for face liveness detection using modified CNN
detect the person’s liveness. In general, the eye images are of smaller size comparing to the whole human face. In our proposed work, eye images of size 26 × 34 are fed into input layer to extract the features for eye blink detection. Odd Kernel matrices extract accurate features for small input images while traditional kernels are failed to extract features from such small images. So, to improve the network accuracy in eye-blinking detection systems, Odd Kernels are highly suitable than the traditional kernels. The video input is converted into series of consecutive frames. Then the frames are given as input image to the proposed OKM-CNN model. The proposed OKM-CNN model has three convolutional blocks with five convolutional layers, three pooling layers and two fully connected layers. Thus, the proposed OKM-CNN architecture has 10 layers in total. The first layer is the input layer consisting of a sample size as 26 × 34. The first convolutional layer has one convolution with filter size 32 and odd kernel matrix of size 1 × 3. The output of the convolution is the set of image features of size 26 × 34 × 32. Then, the max-pooling layer is placed, which outputs the image features of size 13 × 17 × 32.The second convolutional layer is designed to have two convolutions with filter size 64 and output dimensions as 13 × 17 × 64. Then, a max-pooling layer is placed, which gives the output feature map of size 6 × 8 × 64. Here also we have used odd kernel with size 1 × 3. The third convolutional layer has two convolutions with filter size 128 and a max-pooling layer. The output of the two convolutions produces 6 × 8 × 128 featured images. The max-pooling gives the result of size 3 × 4 × 128. The activation function in convolutional layers is used to eliminate the redundant data while preserving features and draws out these features by non-linear functions, which is necessary for the neural network to crack the complex non-linear problem. In this proposed work, four kinds of activation functions are going to be applied to this model such as Sigmoid, Tanh, ReLU and LeakyReLU.
478
N. Nanthini et al.
Sigmoid It is a non-linear activation function that is also referred as a logistic function. They are used for calculating probability based on output, and it has been applied successfully in binary classification problems. f (x) =
1 1 + e−x
(1)
Tanh The hyperbolic tangent function is a smoother function whose range lies between −1 and 1. When compared to sigmoid function, Tanh gives a better training performance on the multi-layer neural networks. f (x) =
e x − e−x e x + e−x
(2)
ReLU The rectified linear unit (ReLU) function works faster when compared to Sigmoid and Tanh. It implements a threshold operation to each and every input element where the values less than zero are set to 0 and eliminating the vanishing gradient problem. f (x) = max(0, x) =
x, i f x ≥ 0 0, i f x < 0
(3)
LeakyReLU It is introduced to solve the ‘dying ReLu’ problem. ReLu has zero gradients whenever a unit is not active, which may slow down the training process due to continuous zero values. Leaky ReLU compresses the values less than zero rather than mapping it to continuous zero, which makes it as a small, non-zero gradient. f (x) = 1(x < 0)(αx) + 1(x >= 0)(x)
(4)
After the third block, the features are flattened to form 7, 86, 944 feature values and fed to a fully connected layer which gives a total of 10, 64, 641 trainable parameters. The hidden neurons of the dense layer are reduced from 1536 to 512 neurons in the first dense layer and to one neuron in the second dense layer, respectively. Then, the classification takes place using sigmoid layer. The final detection of openness and closeness of eye images are detected based on the output of the sigmoid layer.
Eye Blink-Based Liveness Detection Using Odd Kernel Matrix …
479
Fig. 3 Sample images of training dataset
5 Experimental Results 5.1 Dataset For training phase, MRL Eye Dataset is used. It contains a total of 84,898 images of both eye open and close state from 37 different persons. In the database, the eye images are obtained by using the eye detector-based HOG with SVM classifier. Figure 3 shows the sample eye images of MRL eye dataset. For testing phase, ROSE-YOUTU database is used. It contains of 150 videos from 20 different persons of the categories with glass, without glass, photo imposters and video captures. The length of the one video clip is about 10–15 s with 30 frames/s.
5.2 Training and Testing: The proposed model is trained for 25 epochs using randomly selected 5000 eye images from the training dataset. In training phase, the model has 10, 64, 641 trainable parameters. 300 images are used for validation. The loss/error and accuracy graph of training and validation data for the proposed model are shown in Figs. 4 and 5. For testing, a human face video input is given to the proposed model to classify the open and closed state of eye for the eye blink detection. The input video is converted into series of consecutive frames, in which the human eye is located and the eye blink is detected. The proposed model calculates a positive value for opened eye and negative value for closed eye. Figure 6 shows the detection of eye blinking from the input video using the proposed OKM-CNN model.
480
N. Nanthini et al.
1
Fig. 4 Loss of training\validation data for the proposed model
T.loss
loss
0.8
V.loss
0.6 0.4 0.2 0
0
5
10
15
20
25
epoch 1
Fig. 5 Accuracy of training\validation data for the proposed model
Accuracy
0.8 0.6 0.4 T.acc
0.2 0
V.acc 0
5
10
15
20
25
epoch
5.3 Performance Evaluations The performance of the proposed CNN is evaluated by using four metrics such as accuracy, precision, recall and F1score. The model saved its best accuracy with patience = 10 for the given four activation functions Sigmoid, Tanh, ReLu and Leaky ReLU. The formulas for above-mentioned metrics are given below, Accuracy = (TP + TN)/(total no.of samples given)
(5)
Precision(pr) = TP/(TP + FP)
(6)
Recall(rc) = TP/(TP + FN)
(7)
F1score = (pr × rc)/(pr + rc)
(8)
Eye Blink-Based Liveness Detection Using Odd Kernel Matrix …
a)
b)
481
Detection of liveness from human video
Detection of non-liveness from photo-imposter video
Fig. 6 Screenshot of liveness detection using proposed OKM-CNN eye-blink detection model
where TP (true positives) denotes the number of correct detections of eyes, TN (true negatives) denotes the number of wrong detection of eyes. FP (false positives) denotes the number of missed eyes. FN (false negatives) denotes the number of incorrect detection of eyes. The average performance for the proposed model was 0.75, 0.76, 0.73 and 0.81. The results suggested that the proposed model for eye-blink detection classifies the classes with higher accuracy. The performance of the proposed OKM-CNN system is evaluated by calculating the average of evaluation metrics and accuracy. From Table 1, the accuracy for Sigmoid activation function is 48%, Tanh activation function is 82%, ReLU activation function is 95% and Leaky ReLU activation function is 96%. ReLU activation function eliminates the vanishing gradient problem arises from Sigmoid and Tanh activation functions. Leaky ReLU eliminates the ‘Dying ReLU’ problem, which improves the accuracy of the model. It is observed that, by using Leaky ReLU activation function for our proposed OKM_CNN algorithm has higher accuracy than exisiting models using traditional kernels. Table 1 Performances of various activation functions used in proposed OKM-CNN model Activation function
Precision
Recall
F1score
Accuracy
Sigmoid
0.49
0.99
0.66
0.48
Tanh
0.84
0.81
0.83
0.82
ReLU
0.93
0.93
0.95
0.95
Leaky ReLU
0.95
0.91
0.96
0.96
482
N. Nanthini et al.
Table 2 Comparison of the proposed OKM-CNN model with the existing work Epoch
Classic CNN model
Proposed model with ReLU
Proposed model with Leaky ReLU
5
48.95
92.36
90.97
10
48.95
95.13
95.83
20
52.25
96.18
96.18
25
59.65
95.83
96.87
Table 3 Comparison of HTER value for ROSE-YOUTU database
Algorithms
Accuracy (%)
HTER
DTLP [16]
92
6.1
OKM-CNN
96
2.27
From Table 2, it is clear that the proposed OKM-CNN model with activation functions Relu and Leaky Relu has higher accuracy when compared with the existing model with Relu using traditional kernel. Half Total Error Rate (HTER) is used to evaluate the detection performances, which is a combination of false acceptance rate (FAR) and false rejection rate (FRR). HT E R =
1 (F A R + F R R) 2
(9)
From the experimental results in Table 3, the HTER value for the proposed Odd Kernel Matrices convolutional neural network is improved comparing with DLTP algorithm.
6 Conclusion This work proposed the OKM-CNN model for liveness detection using eye-blink detection. The performance of neurons in different activation function in the proposed model is evaluated by calculating the accuracy values. The traditional kernels fail to give better output for the eye region in overall human face, so Odd Kernel Matrix is proposed to get better performance for small input image size. The performance of the proposed model increases with an increase in epoch. From the above experimental results of the proposed OKM-CNN model, it is inferred that Odd Kernels with Leaky ReLU activation function has more accuracy of 96% than other Sigmoid, Tanh and ReLu activation functions. The future work is to continue the optimization of the proposed network for improving its performance in a liveness detection scenario.
Eye Blink-Based Liveness Detection Using Odd Kernel Matrix …
483
References 1. Hasan, M., Hasan Mahmud, S. M., & Li, X. Y. (2019). Face anti-spoofing using texture-based techniques and filtering methods. Journal of Physics-Conference Series (IOP-2019), 1–10. 2. Bhele, S. G., & Mankar, V. H. (2012). A review paper on face recognition techniques. International Journal of Research in Computer Science Engineering and Technology, 1, 2278–2323. 3. Parmar, D. N., & Mehta, B. (2013). Face Recognition methods and applications. International Journal of Computer Technology Applications, 4, 84–86. 4. Kollreider, K., HartwigFronthaler, M. I. F., & Bigun, J. (2007). Real-time face detection and motion analysis with application in “Liveness” assessment. IEEE Transactions on Information Forensics and Security, 2(3), 548–558. 5. Sun, L., Pan, G., Wu, Z., & Lao, S. (2007). Blinking-based live face detection usingconditional random fields. In International Conference (ICB 2007) (252–260). 6. Pan, G., Sun, L., Wu, Z., & Wang, Y. (2011). Monocular camera-based face liveness detection by combiningeyeblink and scene context. In Telecommunication system (vol. 47, pp. 215–225). Springer. 7. Singh, M., & Arora, A. S. (2017). A robust anti-spoofing technique for face liveness detectionwith morphological operations (vol. 139, pp. 347–354). Optik, Elsevier. 8. Kim, K. W., Hong, H. G., Nam, G., & Park, K. R. (2017). A study of deep CNN-based classification of openend closed eyes using a visible light camera sensor. Sensors, MDPI, 17, 1–21. 9. Rehman, Y. A. U., Man, P. L., & Liu, M. (2018). LiveNet: Improving features generalization for face livenessdetection using convolution neural networks. Expert System with Applications, 45, 1–25. Elsevier. 10. Singh, A. K., Joshi, P., & Nandi, G.C. (2014). Face recognition with liveness detection using eyeand mouth movement. In International Conference on Signal Propagation and Computer Technology (ICSPCT) (pp. 592–597). 11. Co¸skun, M., Uçar, A., Yıldırım, Ö., & Demir, Y. (2017). Face recognition based on convolutional neural network. In International Conference on Modern Electrical and Energy Systems (IEEE) (vol. 10, pp. 2–5). 12. Yu, C., Yao, C., Pei, M., & Jia, Y. (2019). Diffusion-based kernel matrix model for face liveness detection. Image Vision and Computing, 89, 88–94. Elsevier. 13. Patel, K., Han, H., & Jain, A. K. (2016). Cross-database face antispoofing with robust feature representation. Lecture Notes in Computer Science (CCBR), 1–10. 14. Lu, X., & Tian, Y. (2020). Heterogeneous kernel based convolutional neural network for face liveness detection. In BIC-TA 2019, CCIS 1160 (pp. 381–392). Singapore: Springer. 15. Li, Y., Chang, M.-C., & Lyu, S. (2018). In Ictu Oculi: Exposing AI generated fake face videos by detecting eye blinking. Computer Vision and Pattern Recognition. 16. Parveen, S., Ahmad, S. M., Abbas, N. H., Adnan, W. A., Hanafi, M., & Naeem, N. (2016). Face liveness detection using dynamic local ternary pattern (DLTP). First Compuer, 5(2).
Predicting Student Potential Using Machine Learning Techniques Shashi Sharma, Soma Kumawat, and Kumkum Garg
Abstract Digitization has immensely impacted our education system and the career planning of students. Generally, career counselors/experts play an important role in evaluating and assisting students in choosing appropriate careers for themselves. These conventional methods and practices are no longer very impactful and after a point, they have proved ineffective. Hence many educational institutions have started using advanced automated solutions that are developed through artificial intelligence. Automation of the counseling system saves effort as well as time and has the potential to reach a diverse group of people. This paper explores the various machine learning algorithms used for providing effective career guidance and counseling. The study works on real-time students’ data using machine learning techniques and considers different attributes to find out which factors play a major role in choosing career students. Decision tree provides the highest accuracy for both datasets. Keywords Machine learning · Students · Performance · Career selection · Prediction · Accuracy
1 Introduction Students are an essential part of any educational organization as well as of the country. Every student wants to make his future better. Now a day, it has become very difficult for students to choose their careers. So, we decided to research how students can choose the right career according to their ability, interest, and competence. The aim of this study is to identify a student’s career path, based on attributes like his/her
S. Sharma (B) · S. Kumawat · K. Garg Bhartiya Skill Development University, Jaipur, Rajasthan, India S. Kumawat e-mail: [email protected] K. Garg e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_40
485
486
S. Sharma et al.
background information, school achievements, and parents’ income, etc. This is done by using different machine learning algorithms and statistical techniques. Machine learning is the study of computer programs that can learn by example. In other words, machine learning is a field of artificial intelligence that extracts patterns out of raw data by using an algorithm. A key goal of machine learning is to have good generalization ability. This refers to a learning algorithm’s ability to predict the training data perfectly and make accurate predictions for new data. In Rajasthan, 15–20 lakh students passed their 12th standard from different boards every year. The volume of the dataset is very large. Conventional methods of data analysis are not efficient due to the lengthy process and time-consuming. So, machine learning techniques are used for the prediction of student potential. This study works on original students’ datasets that have been collected from different universities and institutions and considering different attributes to find out which factors play a major role in career choice. ML algorithms are used on this dataset to predicting the best technical program a prospective student should study. The paper is divided into five sections. A literature review is presented in the second section. The third section describes machine learning techniques; the fourth section describes the research methodology used; and the fifth section investigates the results and findings. The conclusion is given in the last section of the paper.
2 Literature Review This section has reviewed some research that has been done in related areas. Kumar and Singh proposed a classification model to evaluate student performance using ML algorithms. The paper used a combination of k-means clustering with SVM and ANN, to evaluate student performance. The online dataset contained 34 different attributes. The evaluation of student performance was done based on mean square error and effort estimation. The finding of this study shows that the performance of ANN is better than SVM. The mean square error is 5–20% better [1]. Fernandez et al. have implemented different machine learning algorithms to evaluate student performance for the final grades prediction, based on their past academic information. The dataset used 335 students’ academic records. This dataset was taken from engineering degree students of Ecuador University. This paper has done data collection and preprocessing in the initial step and then grouped students based on a similar pattern carried out. Machine learning effectiveness shows the prediction of student performance [2]. Ammar et al. constructed an ensemble meta-based tree model (EMT) classifier for students’ performance prediction. 400 student records with 13 attributes were used in this study. The experimental result shows that EMT is more accurate than other ML techniques [3]. Zohair and Mahmoud developed a model based on classification methods to predict student performance using clustering algorithms. The paper tried to identify the key indicators in small datasets that were used in creating the prediction
Predicting Student Potential Using Machine Learning Techniques
487
model. 50 graduate students’ records were used for this study. The paper used MS Excel and Python 3.6.2 for the analysis of collected data. It also used R studio for data visualization. Among the implemented algorithms, the SVM algorithm for small dataset size has better accuracy than other ML algorithms [4]. Lau et al. examined both conventional statistical analysis and neural network modeling approach for students’ performance prediction. The data were collected for about 1000 undergraduate students. This dataset included 275 female and 810 male students and was taken from a Chinese university. Input variables 11, two layers of hidden neurons, and one output layer was used for modeling neural network. This model achieved 84.8% accuracy [5]. Sudani and Palaniappan proposed a model, which was based on multi-layered neural network to predict the students’ performance. The objective of applying this model was to categorize student degrees into good or basic classes. The data used in this study were of 481 students. The dataset was divided into three parts as training dataset contains 70% of the data, testing dataset contains 25% of the data, and validation (5%). This paper used four algorithms, viz., support vector machine, k-nearest neighbor, decision tree, and neural network. The neural network model was compared with other classifiers on the same dataset. Results showed that the neural network performed better than other algorithms in terms of accuracy [6]. Pal and Bhatt studied prediction accuracy rates using R programming. This paper evaluated the student performance of postgraduate students. The aim of this paper was to analyze the factors that influence student academic performance. This paper used deep learning with other methods like linear regression and random forest. They used accuracy, recall, and F-measure to compare total data. The outcome of deep learning is better than other ML techniques [7]. Suhsimi et al. discussed the factors to predict student’s performance. This paper studied the influence of different factors using ML techniques. It used NN, support vector machine (SVM), and decision tree. The NN achieved 95% accuracy, which was higher than other ML techniques. This paper shows that when predicting a student’s graduation time, the academic assessment was a prominent factor [8]. Roy and Garg implemented different ML techniques to predict student academic performance. This study helped to identify the interests and weaknesses of students. Different attributes like social and demographic influenced the performance of students. The paper used different classifiers Naïve Bayes, J48 Decision Tree, and MLP. Naïve Bayes had the highest accuracy of 68.60% [9]. Hassan et al. studied different ML techniques for predicting student performance. This paper tried to predict future results based on the current status of the students. Data of 1170 students of three subjects were used. It used K-nearest neighbor and decision tree. Decision tree showed higher accuracy (94.88%) [10]. Tanuar et al. predicted last year’s GPA based on first-year semester results using ML algorithms. Rapid miner was used as the platform. Three models, viz., deep learning, decision tree, and general linear model were used in this paper. The finding of this paper showed that the important factors that had an impact on the result can be extracted. This will enable the undergraduates for the exams well in advance [11].
488
S. Sharma et al.
Rahman and Islam predicted students’ academic performance based on two categories, behavior and student absence in class. They used four classification algorithms; KNN, NB, decision tree, and artificial neural networks. They also applied ensemble methods such as Bagging, Random Forest, and Adaboosting for better accuracy. The ensemble methods have higher accuracy of 84.3% [12]. Turabieh evaluated ML algorithms for predicting student marks. The paper focused on an emerging new approach to finding evocative information from collected data. The dataset was collected from virtual course students’ e-learning log files. This paper used a hybrid feature selection method with classification algorithms, i.e., KNN, convolutional neural network, Naïve Bayes, and decision tree (C4.5). The wrapper feature selection method such as binary genetic algorithm was used on collected data. The findings of the study showed that BGA increases all classifiers’ performance [13]. Patil et al. predicted student grade point averages using the effective deep learning model as compared with ML techniques. Feedforward neural network and recurrent NN were used for the student’s GPA prediction. They used various recurrent neural architectures and compared their results. This paper compared the proposed model with ML techniques. Bi-directional long-term memory network accuracy (92.6%) was higher than other algorithms [14]. Soni et al. examined different classification algorithms for analyzing pupil performance. This paper used graduate and undergraduate students’ data collected from different universities during the period from 2017 to 2018. The data were collected with the help of a questionnaire survey. Three classifiers, viz., support vector machine, decision tree and Naïve Bayes were used for the evaluation of students’ performance. It was found that the accuracy of SVM is 83.33, which is higher than the other algorithms [15]. Livieris predicted students’ performance of secondary school students using a semi-supervised learning method, using two wrapper methods. They observed and evaluated the efficiency of algorithms for predicting performance in the final exam. The dataset of 3716 students in mathematics courses was taken from Microsoft showcase school “Avgoulea-Linardatou” throughout the years 2007–2016. The result showed that classification accuracy can be increased by using a semi-supervised learning algorithm [16]. Adekitan and Salau examined the effect of engineering students’ performance during 3 years graduation time, using different data mining algorithms. Different data mining algorithms, viz., probabilistic NN, decision tree, tree ensemble, random forest, logistic regression and Naïve Bayes were used. Logistic regression achieved the highest accuracy (89.15%) [17]. Mduma et al. studied different ML approaches for the prediction of student dropout. Their paper surveyed literature in books, journals, and case studies. Several works have been done using supervised and unsupervised learning algorithms. The paper concluded that several techniques have been used for this problem in developed countries, but there is a lack of research in developing countries [18].
Predicting Student Potential Using Machine Learning Techniques
489
3 Machine Learning Techniques According to Arthur Samuel, machine learning is defined as “Machine Learning in the field of study that gives computers the ability to learn without being explicitly programmed”. In 1997, Tom Mitchell of Carnegie Mellon University has also defined machine learning definition in engineering concept context “A computer program is said to learn from experience E concerning a task T and some performance measure P, if its performance on T, as measured by P, improves with experience E” [1]. ML algorithms are classified as either supervised or unsupervised. A child learns many things in life through his own experience, without being told. The supervised learning of human beings comes through their elders like parents and friends, or teachers. In supervised ML, the program is “trained” on a predefined set of training examples, so that it can learn and reach the correct solution/conclusion when given unseen data. A classification algorithm is useful for predicting discrete outputs. Or we can say that it is beneficial when the answer to a question comes under the probable outcomes predictable set. Five classification algorithms are used in this paper: logistic regression, decision tree, support vector machine, KNN, and Naïve Bayes. These algorithms are supervised machine learning algorithms, which use labeled data for prediction. This paper uses these algorithms for predicting student potential based on their different features.
4 Implemented Methodologies The goals of this study are (1) preparing a model to predict the best technical program for a prospective student to study and (2) comparing prediction accuracy between different machine learning techniques. For the study, data were collected from Bhartiya Skill Development University and other institutions. The dataset contains information about those students who enrolled in the year 2018–19 and 2019–20. The whole dataset contains 800 records. The collected dataset includes a student’s demographic, academic, and financial details. The target population as per the study is students passing 12th standard from Rajasthan and also the students who are already pursuing technical and non-technical programs. The whole dataset was divided into two datasets based on the different boards, i.e., RBSE and CBSE. Each dataset has 400 students’ data. These datasets are used to predict student potential. through machine learning techniques. It has eight labels. Table 1 shows the brief description of the variables used in the study. a.
Data Preprocessing
Before data analysis, data must be checked and preprocessed. Data preprocessing is a technique that is used to convert raw data into a standard dataset and to improve
490
S. Sharma et al.
Table 1 Student dataset
Attributes
Data type
Description
Age
Numeric
Student age
Gender
Nominal
Student gender
Foccu
Nominal
Parent’s occupation
Fincome
Numeric
Family income
MoE
Nominal
Medium of education
10th
Numeric
10th marks
12th
Numeric
12th marks
Branch
Nominal
Program
its quality. Data attribute variety, data transformation, data reduction, and cleaning are all data preprocessing parts. 800 records are used for prediction after cleaning. Then data transformation was done on the dataset. Nominal attributes like Gender, MoE, etc., were transformed into binary data “0” and “1”. Other nominal attributes like Foccu, Fincome, etc., were transformed into a numerical data type. We have used the library Sklearn to preprocess data. Sklearn is a very effective tool for encoding categorical attributes into numeric values. For example, gender has two levels either male or female. The whole dataset was preprocessed and after that, it was used for analysis. b.
Data Analysis
After data preprocessing, both the datasets were compared, based on eight attributes. The number of female students is much more in CBSE as compared to RBSE, as shown in Fig. 1. The medium of education also impacts the choice of higher programs. More English medium students take up technical programs than Hindi medium students, as shown in Fig. 2. Income is also an important attribute for choosing technical programs. Highincome group students take admission in technical programs, as shown in Fig. 3.
Male vs Female 400 200 0 M
F CBSE
Fig. 1 Gender ratio
RBSE
Predicting Student Potential Using Machine Learning Techniques
491
Medium of Educaon
500
0 E
CBSE
RBSe
H
Fig. 2 Medium of education
Income 150 100 50 0 H
M CBSE
L
RBSE
Fig. 3 Family Income
This study concludes that more students choose technical programs from CBSE as compared with RBSE. Figure 4 shows the ratio between students from technical and non-technical programs.
Technical vs Non Technical 300 200 100 0 CB SE
RB SE Tech
Fig. 4 Technical versus non-technical
Non-Tech
492
S. Sharma et al.
5 Experiment and Results a.
Environment
The experiments ran on the Microsoft Windows 10 OS with 64 GB RAM and a 4 Intel-cores processor. In evaluating the machine learning algorithms, we used Python Programming. We split the dataset into two sets: 60% of which was used to train the models and the remaining 40% was used as test data. b.
Evaluation Measures
We used a common classification measure in our experiments, viz., accuracy. Model accuracy is given as the overall correct classifications divided by the total classifications done. The accuracy of classification algorithms is one way to measure how often the algorithm classifies a data point correctly. We applied Python to the collected students’ dataset. Python and its libraries, such as Pandas, NumPy, Scikit-Learn, and Matplotlib are used for interpretation of data and investigation. This language is generally used for making scalable machine learning algorithms. Python implements different machine learning algorithms like Regression and Classification [19]. Datasets are divided into two parts: 75% of data are used to train the models and 25% are used as a test dataset. Both training dataset and test dataset are used Onehot encoding. K is selected for K-fold cross-validation to evaluate the accuracy of different classifiers. Tool Python3 is used for different machine learning algorithms’ implementation. Machine learning techniques have been applied on the collected dataset to calculate the prediction accuracy. This study applied five different machine learning techniques Naive Bayes, decision tree, K-nearest neighbor, support vector machine, and logistics regression on the collected dataset. The final accuracy of classification methods is considered and compared with each other (Tables 2 and 3). Whisker plots are used to compare the accuracy scores of each 10 folds crossvalidations, for each machine learning algorithm, using both datasets, as given in Figs. 5 and 6. As shown, the decision tree achieves the highest accuracy for both datasets. The decision tree model will thus be applied to new student data for the prediction of student potential. Table 2 Accuracy comparison for RBSE dataset
ML model
Accuracy
Logistics regression (LR)
0.797288
K-nearest neighbor (KNN)
0.807401
Decision tree (CART)
0.824237
Naïve Bayes (NB)
0.733277
Support vector machine (SVM)
0.817684
Predicting Student Potential Using Machine Learning Techniques Table 3 Accuracy comparison for CBSE dataset
ML model
0.840839
K-nearest neighbor (KNN)
0.828485
Decision tree (CART)
0.868345
Naïve Bayes (NB)
0.477855
Support vector machine (SVM)
0.850023
ML Algorithms
Fig. 6 CBSE dataset
Accuracy
Logistics regression (LR)
A c c u r a c
Fig. 5 RBSE dataset
493
494
S. Sharma et al.
6 Conclusion The academic success of a student is of the highest primacy for any institute or university. Machine learning classification algorithms are used in this paper for predicting students’ potential based on different education boards. This paper compared the performance of different classification algorithms in academics. Different algorithms are Naïve Bayes, decision tree, logistics regression, support vector machine, and Knearest neighbor. Two datasets based on boards, i.e., RBSE and CBSE were used for the prediction of student potential. It shows that the decision tree algorithm provides higher accuracy than other classifiers for both datasets; hence it will be applied to new student data for the prediction of student potential. We also found out which factors played a major role in choosing a career by using machine learning algorithms. Machine learning techniques are faster and most of the steps involved in the process can be automated by programming the algorithms for machine learning, hence it is less time-consuming. So, machine learning techniques are used for this research. Before applying any algorithm, need to understand the data that are used for prediction and then build the model. We will now be increasing our dataset and will focus on improving the accuracy of prediction further. This will be done by implementing some ensemble algorithms like bagging, Adaboosting, etc.
References 1. Kumar, M., & Singh, A. J. (2019). Performance analysis of students using machine learning & data mining approach. International Journal of Engineering and Advanced Technology (IJEAT), 75–79. 2. Almasri, A., Celebi, E., & Alkhawaldeh, R. S. (2019). EMT: Ensemble meta-based tree model for predicting student performance. Scientific Programming, 3610248:1–3610248:13. 3. Mahmoud Abu Zohair, L. (2019). Prediction of student’s performance by modelling small dataset size. 16, 18. https://doi.org/10.1186/s41239-019-0160-3. 4. Lau, E., Sun, L., & Yang, Q. (2019). Modelling, prediction and classification of student academic performance using artificial neural networks. SN Applied Sciences, 1. https://doi. org/10.1007/s42452-019-0884-7. 5. Al-Sudani, S., & Palaniappan, R. (2019). Predicting students’ final degree classification using an extended profile. Education and Information Technologies, 1–13. https://doi.org/10.1007/ s10639-019-09873-8. 6. Pal V. K., & Bhatt V. K. (2019). Performance prediction for post graduate students using artificial neural network. International Journal of Innovative Technology and Exploring Engineering (IJITEE), 446–454. G10760587S219/19©BEIESP. 7. Suhaimi, N., Rahman, S., Mutalib, S., Abdul Hamid, N., & Hamid, A. (2019). Review on predicting students’ graduation time using machine learning algorithms. International Journal of Modern Education and Computer Science, 11, 1-13. https://doi.org/10.5815/ijmecs.2019. 07.01. 8. Roy, S., & Garg, A. (2017). Predicting academic performance of student using classification techniques. 568–572. https://doi.org/10.1109/UPCON.2017.8251112.
Predicting Student Potential Using Machine Learning Techniques
495
9. Hasan, H. R., Rabby, A. S. A., Islam, M. T., & Hossain, S. A. (2019). Machine learning algorithm for student’s performance prediction. In 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT) (pp. 1–7). IEEE. 10. Tanuar, E., Heryadi, Y., Abbas, B. S., & Gaol, F. L. (2018). Using machine learning techniques to earlier predict student’s performance. In 2018 Indonesian Association for Pattern Recognition International Conference (INAPR) (pp. 85–89). IEEE. 11. Rahman, Md., & Islam, Md. (2017). Predict student’s academic performance and evaluate the impact of different attributes on the performance using data mining techniques. 1–4. https:// doi.org/10.1109/CEEE.2017.8412892. 12. Turabieh, H. (2019). Hybrid machine learning classifiers to predict student performance. In 2019 2nd International Conference on new Trends in Computing Sciences (ICTCS) (pp. 1–6). IEEE. 13. Patil, A. P., Ganesan, K., & Kanavalli, A. (2017). Effective deep learning model to predict student grade point averages. IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), 2017, 1–6. 14. Soni, A., Kumar, V., Kaur, R., & Hemavath, D. (2018). Predicting student performance using data mining techniques. International Journal of Pure and applied Mathematics, 119(12), 221–227. 15. Livieris, I. E., Drakopoulou, K., Tampakas, V. T., Mikropoulos, T. A., & Pintelas, P. (2019). Predicting secondary school students’ performance utilizing a semi-supervised learning approach. Journal of Educational Computing Research, 57(2), 448–470. 16. Adekitan, A. I., & Salau, O. P. (2019). The impact of engineering students’ performance in the first three years on their graduation result using educational data mining. Heliyon. 17. Mduma, N., Kalegele, K., & Machuve, D. (2019). A survey of machine learning approaches and techniques for student dropout prediction. 18. Mccrea, N. An introduction to machine learning theory and its applications: A visual tutorial with examples. https://www.toptal.com/machine-learning/machine-learning-theory-an-introd uctory-primer. 19. Python Introduction. https://www.w3schools.com/python/python_intro.asp.
Routing Based on Spectrum Quality and Availability in Wireless Cognitive Radio Sensor Networks Veeranna Gatate and Jayashree Agarkhed
Abstract With the emerging new applications on the Internet of Things (IoT) and the abundant use of sensing devices with increased wireless spectrum utilization, it becomes tedious for reliable communication for unlicensed users. Due to spectrum availability dynamics, it is a potential challenge to devise a spectrum aware channel assignment and routing mechanism for unlicensed users called secondary users (SUs) in cognitive radio-enabled sensor networks. In the presented work, a mechanism for routing based on spectrum availability is proposed. The estimation of spectrum availability and spectrum quality is based on the distance and received signal strength indicator (RSSI). Routing is performed by allocating path weights and the optimal path is selected that has minimum path weight. The proposed work is evaluated for routing performance through simulations and it is found that the proposed work achieves a significant enhancement in the performance than the existing algorithm. Keywords Cognitive radio · Clustering · Minimum-cost · Optimal strategy · RSSI
1 Introduction Currently, excessive devices with capabilities of computation and infrastructure-less communication are devised and employed to assemble in persistent computational environments, which leads to an intelligent period of the Internet of Things (IoT) [1]. Due to the increased exploitation of mobile applications, the spectrum portions of the wireless electromagnetic spectrum like Industrial Scientific and Medical (ISM) band have become progressively more crowded. As per the Federal Communications Commission (FCC) report, the spectrum allocation currently follows static policies and is utilized only in limited geographical areas. The utilization of spectrum is found to be majorly under-utilized [2]. To enhance the efficiency of spectrum use, cognitive radio (CR) has transpired as an elemental solution in terms of optimizing the low consumption ratio of spectrum [3, 4]. In a typical CR environment, primary users V. Gatate (B) · J. Agarkhed CSE Department, Poojya Doddappa Appa College of Engineering, Kalburagi, Karnataka, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_41
497
498
V. Gatate andJ. Agarkhed
(PUs) operate with secondary users (SUs) and for sensing the spectrum the SUs are incorporated with CR devices, enabling them to sense the vacant spectrum from the surrounding. SUs have to opportunistically gain access over these vacant spectrum bands as PU has a higher channel access priority. When SUs sense the PU arrival on the occupied spectrum band, they should immediately discard the currently occupied spectrum band and quickly switch to other available vacant spectrum bands. Recently, research in CRSN has gained more interest in especially in multi-hop wireless Cognitive Radio Sensor Network (CRSN) [5–7]. A multi-hop CRSN is defined as a distributed network consisting of wireless sensors enabled with CR technology capable of sensing an event signal and communicate between themselves over the available spectrum holes in a multi-hop manner. In a multi-hop CRSN, SU selects the best available channel once determining the vacant channel for data transmission and releases the occupied channel instantaneously upon PU arrival detection on the same channel. By incorporating CR technology, multi-hop CRSNs can increase the spectrum usage efficiency, improve the SU performance and hence extend the lifetime of network [8, 9]. Routing data from a SU source to base station (BS) in multi-hop CRSNs is challenging because of uncertain and dynamic PU activities, non-availability of channels, and spectrum quality [10]. Firstly, in a typical multi-hop CRSN, the shared channel may not exist that can be used by all SUs along the routing path. The traditional routing mechanisms would be unsuccessful in multihop CRSNs due to the lack of available shared channels. Second, the availability of spectrum for SUs varies dynamically. As an SU has to release the channel and abort the data forwarding process because of the spectrum re-occupancy by PUs, sometimes rerouting is needed. This rerouting leads to further depletion of network performance. Therefore, it is essential to identify the optimal path causing minimum rerouting. Finally spectrum quality impacts the routing path selection. During data forwarding, the channel currently chosen may turn invalid because of PU arrival and a new channel must be identified to resume the routing. This necessitates validating the quality of both currently occupied channel and newly available channel by a good routing scheme. The main contributions of this paper are twofold: 1. The design of optimal spectrum strategy selecting paths of good quality and having maximum spectrum availability. 2. Design of optimal routing scheme that selects the minimum cost routing path. The remaining portion of the paper is presented as follows: Related works are discussed in Sect. 2. The proposed method and execution flow along with implementation details is explained in Sect. 3. Section 4 provides the performance analysis of the proposed work with the existing technique. And Sect. 5 concludes the paper.
2 Related Works Authors in [11] proposed a routing protocol with spectrum-awareness (SEARCH), using a greedy approach-based geographic technique for broadcasting the route request packets on every identified vacant channel, destination node identifies the
Routing Based on Spectrum Quality and Availability in Wireless …
499
routing path with less hop count for reaching the source node ensuring the least interference to the PUs using the received route requests. Because of the lack of approximation of the impending spectrum accessibility, the route determination in [11] associates with the spectrum dynamics. Authors in [12] have proposed a routing protocol (TIGHT), in which the source node selects the optimal routing path having least distance to the destination to prevent the PU interference. In applications having meager PU activities, TIGHT achieves good performance whereas in high dynamicity applications its performance is poor. The objective of opportunistic routing in CRSNs is to determine the neighbor sequence priority for every midway node. Each midway node broadcasts the data packet to neighbor at the network layer, and at the MAC layer, only one node replies and behaves as next relay node depending on received results and priority. In [13], an Opportunistic Cognitive-aware Routing (OCR) mechanism is presented, in which relay node priority is estimated from its position and associated spectrum quality based on factors like throughput of channel, the reliability of channel and the advancement of distance towards destination. The authors of [14] framed a cross-layer distributed routing protocol, which combines both the spectrum sensing along with relay selection with an intent to reduce the delay between a source and destination. The authors of [15] proposed a spectrumaware semi-structured routing scheme (SSR) which introduced the forwarding zone concept for every SU and permitted to single SU node to choose its next relay from possible relay nodes, minimizing the packet delay and the energy utilization. Due to lack of spectrum accessibility estimation, the opportunistic routing mechanisms enter into local optimization though the retransmission probability is minimized. Our previous work Interference Aware Cluster Formation in CRSNs (IACFC) [16] forms clusters by selecting neighbor nodes with minimum distance and channels are selected based on fairness index computed using channel throughput, bandwidth, and average buffer occupancy. IACFC considers the selection of channels avoiding the channel overlapping and avoids channel interference. The protocol performs opportunistic routing in allotted static time slots. For handling the dynamics of spectrum in the CRSNs, authors in [17] propose a mechanism to find the remaining available period of the available spectrum depending on the probable idle duration of channels, communication history over the channel, and channel sensing period. This method overcomes the communication interruptions caused due to PU arrivals and improves the per-hop transmission performance. By estimating hop delay, an optimal route between source and destination is determined to provide minimum delay with reliable communication along the selected paths. The scheme proposed in [18] utilizes an evaluation parameter to determine CH node in each cluster by estimating node weight and CH is selected having maximum weight. For route establishment among CHs, common channels are identified between them enhancing the clustering and routing performance extending the network lifetime. Authors in [19] propose an unequal clustering method with energy and spectrum awareness (ESAUC) which jointly mitigates the energy and spectrum limitations by achieving residual energy balance among the sensor nodes. ESAUC optimally adjusts common channels to improve cluster stability. The objective of the scheme proposed in [20] is to balance the routing overhead using PU avoidance, based on the user-defined utility function.
500
V. Gatate andJ. Agarkhed
Optimal discovery radius (k) is determined to balance routing overhead. The protocol achieves the balance between the route optimality, enhancing throughput and packet delivery ratio, minimizing the routing overhead.
3 Proposed Method The detailed work of the proposed method is presented in this section. Section 3.1 presents the network and sensing model. Section 3.2 describes the RSSI-based channel availability and channel quality estimation and selection of optimal routing strategies.
3.1 Network and Sensing Model The CRSN considered has CR-enabled wireless sensor nodes with primary users PU and secondary users SU with the base station (BS). The sensor nodes are equipped with spectrum sensing features to select the available vacant channels under the network coverage. Each node maintains two tables, routing table with next-hop information and RSSI table storing the detected signal strength values, and both are periodically updated. Sensor nodes perform the spectrum sensing and send the sensed result to a Fusion center (FS) where the decision about selecting the channel is made.
3.2 Working of the Proposed Method The algorithm begins with the BS broadcasting its location information in the network. All the sensor nodes immediately send their location information which is maintained and updated periodically in Location Table (LT) at the BS. For every available channel, the channel coefficient (C HCoefficient ) is computed by using the channel tuning value (α) and distance to PU by using Eq. (1). Let Dist PU indicate the distance of the channel to PU, then C HCoefficient = (Dist PU )α
(1)
The BS checks LT and determines the one-hop neighbors based on the node locations in its table. The next step is to group the sensor nodes by executing clustering method.
Routing Based on Spectrum Quality and Availability in Wireless …
3.2.1
501
Clustering Process and Cluster-Head Selection
The algorithm computes the proximity list count to determine the nearest nodes to form the cluster. The proximity is computed based on the node communication range and all the nodes that fall within the node coverage will be added as cluster members and the proximity list count is updated. The LT is redefined with node ID and its neighbor and all the nodes out of proximity are eliminated. The proximity max count is computed and cluster announcement message is broadcasted. Within the proximity list of sensor nodes, one PU is elected as a Cluster-Head (CH). The CH is selected based on the least distance to proximity list members. Upon CH selection the proximity list members send JOIN message to CH to form the cluster. The Join message also contains the node-id and location information of nodes which is updated by the CH in its Cluster Member Table (CMT). Cluster Heads utilize the license-free frequency band for intra-cluster communication and CHs are permitted to use opportunistically the PU channels for inter-cluster communication. During communication, the last hop connecting the BS is chosen with the CH having the highest signal-to-noise ratio (SNR) so that BS receives the high signal strength. Once the CMT is constructed the sensor nodes initiate the spectrum sensing process.
3.2.2
Spectrum Sensing Process
The proposed method determines the Received Signal Strength Indicator (RSSI) of each node to determine the signal strength of the nodes for the non-interrupting connection over the channel. Each node maintains an RSSI table containing signal strength levels to its neighbor nodes and is periodically updated. The algorithm computes the average signal strength by the information available from the RSSI table and determines the RSSI deviation as a difference sum. Now the spectrum sensing process is initiated by all sensor nodes considering the RSSI and difference sum. Then the signal-variance is computed as shown in Eq. (2) Signal − variance =
Di f f Sum RSS I count
(2)
Based on the RSSI information retrieved from the RSSI table the sampling count is estimated which indicates the selected frequency. Now the channel detection probability is computed based on the PU users in the coverage area. Now if the false alarm count and the detection count of a particular channel are enabled, the channel fading coefficient is computed. At the same time, the RSSI table and the RSSI-based distance are periodically updated. The probability of false alarm and the probability of missed detection at SU node are modeled. The false alarm indicates the false detection of PU presence and the missed detection probability indicates missed channel detection. The success of the channel detection is measured when the following two conditions are satisfied:
502
a. b.
V. Gatate andJ. Agarkhed
When PU is absent on a licensed free channel k and no false alarm is produced by any CH SNR at a receiving CH node does decrease the predefined threshold value γ .
The optimal channel is selected satisfying the conditions (a) and (b) and the next step is to devise an optimal routing strategy for packet transmission.
3.2.3
Routing Algorithm
To determine the routing strategy the path weights are computed for each identified path from source to BS. The algorithm checks the path having nodes that can spend minimum transmission power based on the RSSI table entries. The ϕ value is computed that expedites minimum transmission power on the paths identified. Since the CRSN is dynamic the connection between the intermediate nodes has to be checked frequently. Let ai denote the connection established and the path weights Pwt for the identified paths are computed using Eq. (3). Pwt
n √ = ai ϕ
(3)
i=1
Among the path weights computed, the algorithm selects the path that has minimum path weight to ensure the routing path spends minimum energy during communication. If found then algorithm proceeds to compute the cost function. Cost function determines the selection of the shortest path among the identified paths which selects the path having minimum cost by checking the connectivity among nodes. Then optimal routing strategy is determined once the minimum cost path is identified as shown in Eq. (4)
Strategy = ϕ
ni n i+1
(4)
The optimal strategy is determined among all the strategies for the identified paths from the source node to BS. By selecting the optimal strategy, packets are forwarded over the optimal routes. To ensure the end-to-end connectivity always the algorithm keeps checking for non-empty RSSI table entries. The optimal routing strategies are computed and updated periodically to ensure the routing algorithm consumes minimum spectrum resources and delivers enhanced performance. Figure 1 shows the complete flow of execution of the proposed work.
Routing Based on Spectrum Quality and Availability in Wireless …
Fig. 1 Flow chart of the proposed method
503
504
V. Gatate andJ. Agarkhed
4 Results and Discussions The proposed method is implemented in an NS-2 network simulator and the experiments are conducted to determine the protocol performance in comparison with our previous work Interference Aware Cluster Formation in CRSN (IACFC) [16]. The simulation parameters are shown in Table 1. To achieve maximum accuracy the experiments were conducted for 5 different test runs and the results are derived by taking the mean of these values. The performance metrics are average energy consumption, Packet Delivery Ratio (PDR), dropped packets, and throughput.
4.1 Average Energy Consumption The average consumption of energy is computed by taking the mean of energy consumption of all nodes within the network. IACFC has channel allocation based on cluster formation, by considering the channel fairness index computed using throughput, jitter, and data rate. But since the CRSN is dynamic, these channel defining parameters vary depending on time and location. The proposed method selects the channel based on received signal strength (RSS) and an optimal spectrum is allocated by considering minimum cost which contributes to more energy preservation. As shown in Fig. 2 the proposed method has less energy expenditure as compared to IACFC.
4.2 Packet Delivery Ratio (PDR) The PDR in CRSN is influenced by channel availability and quality and is determined by the ratio of total count of packets received to total count of packets sent. In IACFC the dynamic of PU is not considered and static time slots are allotted which affects the packet delivery. In the proposed method the RSSI values selected indicate the intensity of channel occupancy by PUs and also optimal routing strategies are Table 1 Simulation set-up
Parameter
Simulation value
Number of nodes
100–200
Packet size
64 bytes
Network size
100 m × 100 m
Antenna model
Omni-directional antenna
Traffic type
CBR
Mobility
Random waypoint
Routing Based on Spectrum Quality and Availability in Wireless …
505
Average Energy (Joules)
Nodes v/s Average Energy 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
ICBCA Proposed
0
50
100 150 Number of Nodes
200
Fig. 2 Number of nodes versus average energy consumption
computed based on minimum cost paths. This significantly enhances the PDR as shown in Fig. 3 the proposed method has the highest PDR than IACFC.
PDR(%)
Packet Interval v/s PDR 93 92 91 90 89 88 87 86 85 84 83
ICBCA Proposed
0
0.2
0.4 0.6 0.8 Packet Interval (seconds)
Fig. 3 Packet interval versus PDR
1
506
V. Gatate andJ. Agarkhed
Throughput (bits/sec)
Packet Interval v/s throughput 1600 1400 1200 1000 800 600 400 200 0
ICBCA Proposed
0
0.2
0.4
0.6
0.8
1
Packet Interval(Seconds) Fig. 4 Packet interval versus throughput
4.3 Throughput Throughput is estimated as the ratio of the number of bits received in unit time. Channel quality estimation and routing strategies affect throughput. In IACFC static time slots are allocated and, due to CRSN dynamic nature, static time slots result in lesser throughput. In the proposed method throughput is high due to the allocation of optimal spectrum and optimal routing strategy. As shown in Fig. 4 the throughput of the proposed method has higher values than IACFC.
4.4 Comparative Study of Performance Variations The performance differences and the summary of the performance analysis of the benchmark protocols are presented in Table 2.
5 Conclusion Routing in the cognitive environment requires accurate estimations of spectrum quality and spectrum availability due to uncertain activities of licensed users. The SU nodes get limited opportunities for communication over the license-free channel which has to be optimally utilized to upgrade the communication performance. In this paper a novel method for spectrum aware routing is proposed which selects vacant spectrum portions based on received signal strengths and candidate routing
Routing Based on Spectrum Quality and Availability in Wireless …
507
Table 2 Performance differences among benchmark protocols Protocol
Methodology
Performance variations
ICBCA
Implements channel selection based on the data rate, network load and buffer occupancy Implements static time slots for data transmission
In CRSN, PU activities are dynamic hence using static time slots depletes the packet delivery also estimating PU presence on channel plays a vital role which is not employed
Proposed method
Implements channel selection based on the received signal strength and estimates outage probabilities Employs optimal routing strategy selecting minimum cost path for packet delivery
Outage probability estimation aids in eliminating nodes having less spectrum availability Optimal routing strategy and minimum cost path computation allows enhanced network performance with increased packet delivery and decreased packets dropped
paths are determined having maximum spectrum availability. The optimal route is found by selecting the path with minimum path weight using the cost function. This contributes to routing performance enhancement of secondary users. The simulation results derived from experiments reveal that the proposed work shows significant improvement in throughput, packet delivery with minimum energy expenditure. The extension of the presented work in future aims on providing a detailed performance analysis for different performance parameters.
References 1. Rawat, P., Singh, K. D., & Bonnin, J. M. (2016). Cognitive radio for M2M and Internet of Things: A survey. Computer Communications, 94, 1–29. 2. Salameh, H. A. B., & Krunz, M. (2009). Channel access protocols for multihop opportunistic networks: Challenges and recent developments. IEEE Network, 23(4), 14–19. 3. Wang, B., & Liu, K. J. R. (2011). Advances in cognitive radio networks: A survey. IEEE Journal of Selected Topics in Signal Processing, 5(1), 5–23. 4. Zareei, M., Mohamed, E. M., Anisi, M. H., Rosales, C. V., Tsukamoto, K., & Khan, M. K. (2016). On-demand hybrid routing for cognitive radio ad-hoc network. IEEE Access, 4, 8294–8302. 5. Zubair, S., Yusoff, S. K. S., & Fisal, N. (2016). Mobility-enhanced reliable geographical forwarding in cognitive radio sensor networks. Sensors, 16(2), 172. 6. Joshi, G. P., & Kim, S. W. (2016). A survey on node clustering in cognitive radio wireless sensor networks. Sensors, 16(9), 1465. 7. Syed, A. R., Yau, K.-L. A., Qadir, J., Mohamad, H., Ramli, N., & Keoh, S. L. (2016). Route selection for multi-hop cognitive radio networks using reinforcement learning: An experimental study. IEEE Access, 4, 6304–6324. 8. Joshi, G. P., Nam, S. Y., & Kim, S. W. (2013). Cognitive radio wireless sensor networks: Applications, challenges and research trends. Sensors, 13(9), 11196–11228.
508
V. Gatate andJ. Agarkhed
9. Zhang, L., Cai, Z., Li, P., & Wang, X. (2016, August). Exploiting spectrum availability and quality in routing for multi-hop cognitive radio networks. In Proceedings of 11th International Conference on Wireless Algorithms, System, and Applications (WASA) (pp. 283–294). 10. Sengupta, S., & Subbalakshmi, K. P. (2013). Open research issues in multihop cognitive radio networks. IEEE Communications Magazine, 51(4), 168–176. 11. Chowdhury, K. R., & Felice, M. D. (2009). Search: A routing protocol for mobile cognitive radio ad-hoc networks. Computer Communications, 32(18), 1983–1997. 12. Jin, X., Zhang, R., Sun, J., & Zhang, Y. (2014). TIGHT: A geographic routing protocol for cognitive radio mobile ad hoc networks. IEEE Transactions on Wireless Communications, 13(8), 4670–4681. 13. Liu, Y., Cai, L. X., & Shen, X. S. (2012). Spectrum-aware opportunistic routing in multihop cognitive radio networks. IEEE Journal on Selected Areas in Communications, 30(10), 1958–1968. 14. Cai, Z., Duan, Y., & Bourgeois, A. G. (2015). Delay efficient opportunistic routing in asynchronous multi-channel cognitive radio networks. Journal of Combinatorial Optimization, 29(4), 815–835. 15. Ji, S., Yan, M., Beyah, R., & Cai, Z. (2016). Semi-structure routing and analytical frameworks for cognitive radio networks. IEEE Transactions on Mobile Computing, 15(4), 996–1008. 16. Agarkhed, J., & Gatate, V. (2020). Interference aware cluster formation in cognitive radio sensor networks. In International Conference on Communication, Computing and Electronics Systems (pp. 635–644). Singapore: Springer. 17. Tran-Dang, H., & Kim, D. S. (2020). Link-delay and spectrum-availability aware routing in cognitive sensor networks. IET Communications. 18. Bakr, R., El-Banna, A. A. A., El-Shaikh, S. A., & Eldien, A. S. T. (2020, October). energy efficient spectrum aware distributed cluster-based routing in cognitive radio sensor networks. In 2020 2nd Novel Intelligent and Leading Emerging Sciences Conference (NILES) (pp. 159– 164). IEEE. 19. Stephan, T., Al-Turjman, F., & Balusamy, B. (2020). Energy and spectrum aware unequal clustering with deep learning based primary user classification in cognitive radio sensor networks. International Journal of Machine Learning and Cybernetics, 1–34. 20. Guirguis, A., Digham, F., Seddik, K. G., Ibrahim, M., Harras, K. A., & Youssef, M. (2018). Primary user-aware optimal discovery routing for cognitive radio networks. IEEE Transactions on Mobile Computing, 18(1), 193–206.
A Review on Scope of Distributed Cloud Environment in Healthcare Automation Security and Its Feasibility Mirza Moiz Baig and Shrikant V. Sonekar
Abstract Remote patient monitoring using the internet and wearable medical devices has become lifesaver and is gaining a lot of attention in the healthcare industry. These technologies also allow the researcher to generate huge medical data for analytics and medical researchers. As the number of patients increases, problems related to stability raises. There are several reasons due to which remote connectivity of medical devices may fail and which could lead to extreme emergency support system failure. However, this technology and the product introduced in healthcare industries are lifesaver and allow researchers to understand the major changes happening with patient’s body but its support and maintainability cost directly proportion to the demand of the technology. Providing multiple connectivity lines and providing load balancing servers could be the solutions to the problem. This research paper primarily deals with the study about the different issues that may occur in remote medical device monitoring failure and the possible solutions to the system. This paper also covers the possible solution toward maintaining the medical data over distributed cloud environment. Keywords DFS · Cloud · Cache · Map-Reduce · Authentication · Electronic Health Records
1 Introduction In the era of current information technology, the use of or fog cloud computing to share or distribute resources between multiple organizations or individuals has received great support. Since cloud computing enables resources like infra or service sharing such as storage, servers, networks, or applications by offering user-friendly specifications and monastery usage, primarily there are public and private cloud systems. The public cloud provides online services or solutions with different hosting
M. M. Baig (B) · S. V. Sonekar JD College of Engineering and Management, Nagpur, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_42
509
510
M. M. Baig andS. V. Sonekar
models such as Software as a Service (SaaS), Infrastructure as a Service (IaaS), Platform as a Service (PaaS), or Data as Services (DaaS). As examples of the public cloud, Google App Engine, Windows Azure Platform, or the Amazon AWS are widely used, on other hand, the private cloud takes another approach because it is dedicated to the individual organization and its servers can be either on-premises or off-site. Private cloud could be more secure and may have better allocation comparatively to the public cloud as all the resources are dedicated and managed. Examples of private cloud are HP, Dell, Cisco, or IBM [1]. These predefined and commonly required features have increased interest and the usage by multiple users around the globe and this is due to how cloud services provide for easy deployment and transparent on-site data or resources sharing among multiple individuals or companies worldwide [2]. Moreover, cloud-computing services provide the user’s legal resource requirements fulfillment from the user such as hardware and software, as the users depend on the cloud services provider in this specific way. It again provides an easy user interface and is very less complicated on the user side [3]. The cloud is also capable of providing multiple applications to many users who can access or share data along with great reliability, flexibility, or accessibility [4]. Due to these mentioned reasons, cloud computing gains a rapid growth rate over the past several years and making it one of the most used data exchange methods over the internet [5, 6]. In particular, the cloud contains many applications used in various aspects such as electronic health records (e-health), data storage services, business applications, etc. We will be considering e-health as the main scope in this research work. However, most of the cloud service providers are nearly reliable, even though they provide security systems such as the Secure Sockets Layer (SSL) security or the Transport Layer Security (TCL) data communication. These methods have yet to be fully confirmed so that making it essential to secure data over the cloud through a ZE setup using either encrypted or unencrypted techniques. Electronic Health Records (EHR) or electronics medical record (EMR), are one of the major services provided by the cloud, and it can contribute to getting an improved solution to healthcare providers or the recipients by storing patient record and sharing on demand, such as prescriptions, patient medicines, history medical record, allergies details, test results, etc. [7–9]. It could also offer mobile application healthcare like PHR through which patients could be monitored anywhere anytime while living their normal daily lives, and this could result in a significant improvement compared to standard data sharing systems [10]. Therefore, for all the advantages and facilities that the cloud service provider and many other applications can provide. This has been led to its very widespread use and the sharing of data over time, which makes the data stored or shared over intranet or internet networks at risk as the owner of this data does not directly control it, where data is accessed by a virtual system or machine and this increases the risk further [11, 12]. Moreover, issues can be in such as data network security, integrity, confidentiality, access control, and availability have been significant concerns, especially when it comes to secure cloud access. Data migration from source system to destination system must be accurate and secured to reach to the reliable cloud computing
A Review on Scope of Distributed Cloud Environment in Healthcare …
511
environment. However, when as our considered chosen topic of e-health and the sharing of electronic health records over the internet or network via cloud service providers, the risks should be reduced as much as possible network [13]. Leaving to the classic (paper-based) healthcare system and moving to the digital version of e-healthcare has resulted in the rise of data in a quick access way that makes it challenging to deal with overtime, and maintaining a large and rapidly growing amount of data is complicated to manage [14]. Hence, the e-health system must reach all the security requirements appropriate to it and help erase the correct data distribution, such as availability. The system service must be ready at maximum possible time, user authentication and data assets, internet and system reliability, confidentiality, and integrity [15]. In order to achieve more security, primarily two methods that can help protect and secure the e-health record system. The very first is encryption methods that include attribute rules, symmetric and asymmetric cryptography, and mixed methods that may combine more than one type. While the second is the non-encrypted methods consisting of access control and role-based policy enforcement methods that define each user’s roles and define the right of access accordingly [16]. The e-health record system must be available anywhere and anytime while still maintaining all the security as mentioned aspects in order to function correctly and ensure that data is shared and transmitted securely across the system’s networks every time while communicating [16]. Further, this review paper is systematically arranged as follows, in the second section, we discussed the multiple cloud service provider environments. At the same time, we will talk in the third section about the distributed cloud computing and data storage, as we have divided them into encryption and non-encrypted methods. Finally, in Sect. 4, we presented the anatomy of e-health security and mentioned some open issues and research topics multi-cloud environment.
2 Multi-Cloud Environment In a multi-cloud environment, user data is fragmented among various private/public clouds. The adversary cannot get a complete set of data that removes most of the threats occurring in a single-cloud environment. This data is managed by a distributed file system (DFS) used to share and access files from multiple hosts in a distributed environment with transparency. DFS has provided a significant breakthrough for Cloud Computing applications in the multi-cloud environment. It is based on the Map-Reduce paradigm [17]. A system using DFS serves to compute and to store simultaneously. A file is distributed into number of chunks allocated in remote nodes, which enable parallel execution of Map-Reduce operations [18]. There are other types of cloud computing such as:
512
M. M. Baig andS. V. Sonekar
Public Cloud Public cloud which is also referred to as “external” cloud, describes the predictable meaning of cloud computing: dynamically provisioned, scalable, often virtualized resources available over the Internet from an off-site third-party provider, which divides up resources [19] and bills its customers on a “utility” basis. Private Cloud Private cloud (also referred to as “corporate” or “internal” cloud) is a term used to denote a proprietary computing architecture providing hosted services on private networks. Large companies generally used this type of cloud computing. It allows their corporate network and data center administrators to become in-house effectively “service providers” catering to “customers” within the corporation. However, it contradicts many cloud computing profits, as organizations still need to buying, setup, and manage their clouds [20]. Hybrid Cloud It has been suggested that combination resources from both internal and external providers will become the most popular choice for enterprises [21]. This may be because larger organizations are likely to have already invested heavily in the infrastructure required to provide resources in-house—or they may be concerned about the security of public clouds. It will concentrate on public clouds because these services demand the highest security requirements. It also includes a high potential for security prospects. It can provide a survey on the achievable security merits by making use of multiple distinct clouds simultaneously [21, 22]. Various distinct architectures are introduced and discussed according to their security and privacy capabilities and prospects.
3 Distributed File System In a circulated document framework (DFS), numerous customers share records given by a common document framework. In the DFS worldview, correspondence between procedures is finished utilizing these mutual records. Although this is like the DSM and conveyed object standards (in that correspondence is preoccupied with shared assets), a noteworthy contrast between these ideal models and the DFS worldview is that the support (documents) in DFS are any longer lived. This makes it, for instance, a lot simpler to give non-concurrent and diligent correspondence utilizing shared records than using DSM or dispersed articles [23]. The fundamental model given by disseminated document frameworks is that of customers getting to records and registries provided by at least one record server. A record server gives a client a document benefit interface and a perspective of the record framework. Note that the view gave to various customers by a similar server might be extraordinary, for instance, if customers observe records that they are
A Review on Scope of Distributed Cloud Environment in Healthcare …
513
approved to get to. Access to records is accomplished by customers performing activities from the document benefit interface (for example, make, erase, read, compose, and so on.) on a document server. Contingent on the servers’ usage the tasks might be executed by the servers on the open records or by the customer on neighborhood duplicates of the document. DFS [24] has some know file naming and accessing related issues like, 1.
Naming and Transparency
a.
Naming—mapping among sensible and physical items. It staggered mapping reflection of a document that conceals the subtleties of how and where on the plate the record is put away. A straightforward DFS conceals the area wherein the system the document is put away. For a record being repeated in a few destinations, the mapping restores a lot of the areas of this current document’s limitations; the presence of both numerous duplicates and their area are covered up. Naming Structures—Naming structure must have some formatting which would not allow users to guess the content and it has the following possible rules,
b.
Location Transparency—record name does not uncover the document’s physical stockpiling area. Record name still indicates a particular, albeit shrouded, set of physical plate squares. Advantageous approach to share information. Can uncover correspondence between segment units and machines. Location Independence—document name shouldn’t be changed when the record’s physical stockpiling area changes. Better record deliberation. Advances sharing the storage room itself. Isolates the naming progressive system shape the capacity gadgets chain of command. c.
Naming Schemes—Documents named by a blend of their hostname and neighborhood name ensure one of a kind system-wide name. Connect remote indexes to neighborhood registries, giving the presence of a lucid registry tree, just recently mounted remote catalogs can be gotten to straightforwardly. Absolute mix of the part record frameworks. A solitary worldwide name structure traverses every one of the records in the framework. If the server is inaccessible, some subjective arrangement of catalogs on various machines additionally winds up inaccessible.
2.
Remote File Access
Reduce traffic by holding as of late gotten to circle obstructs in a store, so rehashed gets to a similar data can be taken care of locally. If the necessary information is not reserved, a duplicate of information is conveyed from the server to the client. Gets to are performed on the reserved duplicate. Records related to one ace duplicate dwelling at the server machine, However, duplicates of (parts of) the document are dissipated in various reserves. Reserve consistency issue—keeping the stored duplicates reliable with the ace document
514
M. M. Baig andS. V. Sonekar
• Reserve Location—Disk versus Memory • Points of interest of plate reserves • Progressively solid. Reserved information kept on a plate is still there amid recovery and should not be gotten once more. Points of interest of primary memory reserves: • Allow workstations to be diskless. • Execution speedup in greater recollections. Server stores (used to accelerate plate I/O) are in primary memory paying little mind to where client reserves are found; utilizing principle memory stores on the client machine allows a solitary reserving system for servers and clients.
4 Challenges in Portable Connected Device Security Its common belief that the Internet of Things and cybersecurity difficult to work together. Thousands of portable connected devices are looking for entry into domestic, healthcare, businesses, transport, industry, and many other spaces of our daily lives, but all of its security is very lower on device manufacturers’ list of priorities. In the new innovative and growing connected ecosystem, there are very few defined industry standards for the model, architecture, or security, and devices often employ proprietary data exchange protocols and custom-built operating systems [25]. Internet of Things security remains an absolute hazard and glitches with connected devices cyber-attacks and viruses can only continue to rise along with the growth in the number of devices. Identified issues in connected device securities are likely to be, • • • • • • • •
Passwords: Weak, Guessable, or Hardcoded Vulnerable: Network Services and Ecosystem Interfaces Update: Lack of Security Mechanism Outdated: Use of Insecure Components Privacy: Insufficient Protection Insecure: Data Transfer and Storage Lacking: Device Management, Physical Hardening Default: Factory Settings in production.
5 Connectivity Challenges Data exchange assurance is the most important aspect in real-time patient monitoring healthcare devices as failure of emergency data transfer within minimum threshold time may lead to a critical situation [26, 27]. Considering wrong data and
A Review on Scope of Distributed Cloud Environment in Healthcare …
515
Level – 5 Cloud Service Provider Cloud 1
ISP 1
Cloud 2
ISP 2
Cloud 3
ISP 1
ISP 2
Level – 4 Internet Service Provider Level – 3 Local Internet
Wi-fi
LAN
GSM
Wi-fi
LAN
GSM
Level – 2 Device
Level – 1 Paent Paent - 1
Paent - 2
Fig. 1 Multiple level of possible failure in communication
the failure of data communication, there are multiple possible levels of failure starts from wrong contact of device with patients to failure of cloud service provider. Even though researcher may have solutions to possible failure level but it depends upon the deployment feasibility and the demand or return on the cost of implementation. Some level may get an alternative solution or backup solution, but at some level, finding the backup or alternative solution in real time may not feasible options. Figure 1 illustrated the multiple levels of failure in medical device data exchange from device to service. Level 1. Patient negligence: considering the situation at ground-level data transfer may different or send wrong value in case of improper placing of the sensor or the device. This is a very rare and negligible issue since, after the right training or the experience, patient can use the device as per the standard operational procedure (SOP). Level 2. Device malfunctioning: in rare case, medical device fails as these products are build by following the medical-grade standards but still situation cannot be ignored. Malfunctioning or failure of the device may lead to a loss in data or create a barrier in communication. Solution to these problems has different discussion areas and cannot be mentioned with specific explanation.
516
M. M. Baig andS. V. Sonekar
Level 3. Local internet issues: if the local internet connectivity fails will lead to communication failure. The only solution to this problem is to keeping backup line of multiple modes like mobile data, wired internet, or Wi-Fi access. This solution is also depending upon the connectivity methods supported by the specific healthcare device. Level 4. ISP failure: again, this is a rare case where internet service provider failed to provide service by means of any technical or logical error. It may have multiple reasons starts from the user account activation, billing issues, or may be due to service upgradations. This issue affects a wide customer base using the same service provider. Solution to this problem is to keep multiple service provider options with users. Level 5. Cloud Service Failure: this could be the major issue if the cloud service provider fails to provide service. These problems may be due to failure in hardware but, in maximum cases, it happens due to software as a service failure.
6 Proposed Solution Managing the data at multiple servers with the help of splitting and merging of the data or the files could be the best solution in order to avoid the cloud service provider or the vendor dependencies on the data and provide redundancy upon failure of any of the service provider. Distributed data has many advantages but the solution could the cost-effective. Following methodology gives the planning for storing the accessing the data from multiple cloud servers. As shown in Fig. 2 a client first uploads source file (F) along with user-defined encryption key(k) generation function which helps to encrypt the file and generate the (Fk)e. Once the file is generated, (Fk)e gets split into the “n’ number of parts (Fk)e[0], (Fk)e [1], and (Fk)e [2] provided by user while writing the file into the application server. The application server may be a local server or a public cloud server. The local server (A) store (Fk)e[0] part at defined storage space which is first part of the complete file and this saves cost for 1 public cloud required for complete system. The system then tries to connect configured public cloud servers (B) and (C) and upload the files (Fk)e [1] and (Fk)e [2], respectively, to server using cloud-specific file uploading classes of methods. This mechanism prevents file from the malicious server administrator or user. The complete process has two important functionality where first is encryption and second is splitting. Encryption and splitting can be scheduled and rearranged according to the user need.
A Review on Scope of Distributed Cloud Environment in Healthcare …
517
Fig. 2 Upload/Write process
The Download/Read file process, a Client initialize the request for getting the desired file by selecting the file name form file listing table so that system can get the metadata about server connection details and connect to the server to get all the part of file for merging. The first system gets the file name from user to download and merge (F) and get the key associated with file encryption along with “one time request key” received after submitting the file read request by user to a system. The client authentication is done by evaluating the key and one-time request key provided on its registered mobile number provided during the registration phase to
518
M. M. Baig andS. V. Sonekar
the application. This provides the second level of authentication to the system. After successful authentication, system allows the user to select the file to download (F) and key (k). System get the server connection details for specified file name from metadata. Connect to the server and download the chunk in the temp directory in order to merge file into single file (Fk)e[0], (Fk)e [1], and (Fk)e [2]. Create a blank file to merge all parts into single file Fp. Then write first part into Fp and append rest of the part into Fp. After merging all parts into single file system decrypt file with provided key (k) and generate the original file and finally set the original create file for download to the client machine and after successful download system delete all parts, merged file and decrypted file from temp directory for security reason.
Figure 3 shows the complete process can be easily visualized which we have just described. The file has three parts that is requested by the user. They are Part A, B, C, respectively. Part A is available with the local host and rest of parts B and C is residing on the public cloud. All the parts discussed searched onto local as well as public or local cloud available with system and call to merger and decryption module is made for file downloaded in order to get original filename.
7 Upload and Download Compare • The below result shows the time taken by a file to be uploaded or downloaded from global cloud. • The graph shows the time taken by available cloud vendors consider while trial implementation. • Amazon web services, Azure, GoDaddy, and Distributed Hidden ownership management model are compared over the Chunk of information. • In order to perform experiment, we consider different file size sample spaces rangeing from 1 MB, 2 MB, 5 MB, 10 MB to 50 MB.
A Review on Scope of Distributed Cloud Environment in Healthcare …
Client
519
Get Filename and Key
Local Disk
Server
Get First Part Download File
A
Connect to ftp 1
Get Second Part
B
Connect to ftp 2
Get Third Part
C
Ecry Key
Decrypt file
Merge File
Queue it For Download
Fig. 3 Download/Read process
• The network of parameters are on high prioritywhile performing the experiment as only these files are uploading or downloading this experiment on the various sets. • The high-speed internet at 512 KB/s High No Traffic priority is considered to get result [28]. • The comparative analysis concludes that our model i.e., Distributed Hidden ownership management model provides better results in comparison with others over the parameters considered for experiment (Figs. 4, 5, 6, and 7).
8 Conclusion Managing the security and the stability of the medical or healthcare data were always the biggest challenges as the quantum of the data is huge and generated at multiple locations or from multiple sources. Standardization or the centralization of the healthcare data may lead to different new threats. Handling these issues might differ from service provider to service provider, but these rules can be guided by the governing authority. This paper tries to give an understanding of possible failure of data exchange the solutions to the problem in a broadway. This topic is not limited to the solution given in the paper only.
520
M. M. Baig andS. V. Sonekar
Tot Time= Conn. Time + HS Time + Actual Transfer Time + NT Load Fig. 4 Upload and Download Compare—1 MB File [28]
Tot Time= Conn. Time + HS Time + Actual Transfer Time + NT Load Fig. 5 Upload and Download Compare—2 MB File [28]
A Review on Scope of Distributed Cloud Environment in Healthcare …
521
Fig. 6 Upload and Download Compare—10 MB File [28]
Fig. 7 Upload and Download Compare—50 MB File [28]
References 1. Ewan, H., Hansdah, R. C. (2018). Julunga: a new large-scale distributed read-write file storage system for cloud computing environments. In 2018 IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA). https://doi.org/10.1109/aina. 2018.00138.
522
M. M. Baig andS. V. Sonekar
2. Roy, S., Das, A. K., Chatterjee, S., Kumar, N., Chattopadhyay, S., Rodrigues, J. J. (2018). Provably secure fine-grained data access control over multiple cloud servers in mobile cloud computing based healthcare applications. IEEE Transactions on Industrial Informatics, 1–1. https://doi.org/10.1109/tii.2018.2824815. 3. Ranchal, R., Bastide, P., Wang, X., Gkoulalas-Divanis, A., Mehra, M., Bakthavachalam, S., Mohindra, A. (2020). Disrupting healthcare silos: Addressing data volume, velocity and variety with a cloud-native healthcare data ingestion service. IEEE Journal of Biomedical and Health Informatics, 1–1. 4. Vinodhini, A. N., Ayyasamy, S. (2017). Prevention of personal data in cloud computing using biometric. In 2017 International Conference on Innovations in Green Energy and Healthcare Technologies (IGEHT). https://doi.org/10.1109/igeht.2017.8094085. 5. Wang, X., Jin, Z. (2019). An overview of mobile cloud computing for pervasive healthcare. IEEE Access, 7, 66774–66791. https://doi.org/10.1109/access.2019.2917701. 6. Aburukba, R., Sagahyroon, A., Aloul, F., Thodika, N. (2018). Brokering services for integrating health cloud platforms for remote patient monitoring. In 2018 IEEE 20th International Conference on e-Health Networking, Applications and Services (Health-com). https://doi.org/ 10.1109/healthcom.2018.8531151. 7. Abrar, H., Hussain, S. J., Chaudhry, J., Saleem, K., Orgun, M. A., Al-Muhtadi, J., et al. (2018). Risk analysis of cloud sourcing in healthcare and public health industry. IEEE Access, 6, 19140–19150. https://doi.org/10.1109/access.2018.2805919. 8. Yang, L., Zheng, Q., Fan, X. (2017). RSPP: A reliable, searchable and privacy-preserving ehealthcare system for cloud-assisted body area networks. In IEEE INFOCOM 2017 - IEEE Conference on Computer Communications. https://doi.org/10.1109/infocom.2017.8056954. 9. Islam, M. M., Razzaque, M. A., Hassan, M. M., Ismail, W. N., Song, B. (2017). Mobile cloudbased big healthcare data processing in smart cities. IEEE Access, 5, 11887–11899. https://doi. org/10.1109/access.2017.2707439. 10. Shen, M., Duan, J., Zhu, L., Zhang, J., Du, X., Guizani, M. (2020). Blockchain-based Incentives for secure and collaborative data sharing in multiple clouds. IEEE Journal on Selected Areas in Communications, 1–1. 11. Zhang, H., Yu, J., Tian, C., Zhao, P., Xu, G., Lin, J. (2018). Cloud storage for electronic health records based on secret sharing with verifiable reconstruction outsourcing. IEEE Access, 6, 40713–40722. https://doi.org/10.1109/access.2018.2857205. 12. Singh, P., Rizvi, M. A. (2018). Virtual machine selection strategy based on grey wolf optimizer in cloud environment: a study. In 2018 8th International Conference on Communication Systems and Network Technologies (CSNT). 13. Hao, M., Li, H., Xu, G., Liu, Z., Chen, Z. (2020). Privacy-aware and resource-saving collaborative learning for healthcare in cloud computing. In ICC 2020 – 2020 IEEE International Conference on Communications (ICC). https://doi.org/10.1109/icc40277.2020.9148979. 14. Aburukba, R., Sagahyroon, A., Elnawawy, M. (2017). Remote patient health monitoring cloud brokering services. In 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom). https://doi.org/10.1109/healthcom.2017.8210798. 15. Mukhopadhyay, A., Suraj, M., Sreekumar, S., Xavier, B. (2018). Emergency healthcare enhancement by multi-iterative filtering of service delivery centers. In 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI). https:// doi.org/10.1109/icacci.2018.8554402. 16. Rajasingham, S., Premarathne, U. S. (2018). Efficient agent based trust threshold model for healthcare cloud applications. In 2018 IEEE International Conference on Information and Automation for Sustainability (ICIAFS). https://doi.org/10.1109/iciafs.2018.8913357. 17. Kamoona, M. A., Altamimi, A. M. (2018). Cloud E-health systems: A survey on security challenges and solutions. In 2018 8th International Conference on Computer Science and Information Technology (CSIT). https://doi.org/10.1109/csit.2018.8486167. 18. Dhinakaran, K., Nivetha, M., Duraimurugan, N., Wise, D. C. J. W. (2020). Cloud based smart healthcare management system using blue eyes technology. In 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC). https://doi.org/10.1109/ice sc48915.2020.9155878.
A Review on Scope of Distributed Cloud Environment in Healthcare …
523
19. Jose Reena, K., Parameswari, R. (2019, February 14th–16th). A smart health care monitor system in IoT based human activities of daily living: A review. In 2017 IEEE International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (Com-IT-Con), India. 20. McClenaghan, K., Moholth, O.C. (2019). Computational model for wearable hardware commodities. In IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, DO. https://doi.org/10.1109/dasc/picom/cbd com/cyberscitech.2019.00055. 21. Chen, H. (2019). Ubi-care: A decentralized ubiquitous sensing healthcare system for the elderly living support. In IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing, International Conference on Cyber Science and Technology Congress. https://doi.org/10.1109/DASC/PiCom/CBDCom/CyberS ciTech.2019.00108. 22. Ganesan, M., Sivakumar, N. (2019). IoT based heart disease prediction and diagnosis model for healthcare using machine learning models. In IEEE International Conference on Systems Computation Automation and Networking. 978-1-7281-1524-5. 23. Singh, A., Chandra, U., Kumar, S., Chatterjee, K. (2019). A secure access control model for e-health cloud. https://doi.org/10.1109/TENCON.2019.8929433. 24. Sudheep, K., Joseph, S. (2019). Review on securing medical big data in healthcare cloud. In 2019 5th International Conference on Advanced Computing Communication Systems (ICACCS). 25. Deshmukh, N. M., Kumar, S. (2019). Secure fine-grained data access control over multiple cloud server based healthcare applications. IEEE. 26. Alnefaie, S., Cherif, A., Alshehri, S. (2019). Towards a distributed access control model for IoT in healthcare. In 2019 2nd International Conference on Computer Applications Information Security (ICCAIS). 27. Macis, S., Loi, D., Ulgheri, A. (2019). Design and usability assessment of a multi-device SOAbased telecare framework for the elderly. IEEE Journal of Biomedical and Health Informatics. 28. Gupta, R., NirmalDagdee, P. (2019). HD-MAABE: Hierarchical distributed multi-authority attribute based encryption for enabling open access to shared organizational data. In International Conference on Intelligent Computing and Smart Communication. https://doi.org/10. 1007/978-981-15-0633-8_18.
Smart Traffic Monitoring and Alert System Using VANET and Deep Learning Manik Taneja and Neeraj Garg
Abstract Road accidents have been a major contributing factor in the loss of approximately 1.35 million people every year. Ranging from 20 to 50 million people suffer non-fatal injuries and others incur disability. As stated by Ghori et al., the accidents are a major concern since road injuries cause economic losses to individuals, their families, and to the nation as a whole [1]. As per the WHO [1], the risk factors include speeding, distractive driving, and inadequate post-crash care. Therefore, there is a requirement for a system that detects anomalous driving (speeding and distractive driving) and accidents, reports it to the local authorities for faster post-crash care, and implements inter-vehicle communication so that the vehicles in the proximity of the incident can be notified and can remain at a safe distance from the vehicles involved. Keywords VANET · Traffic monitoring · Alert system
1 Introduction The progress made by the governments towards reducing the number of road traffic deaths, is on an upward trend yet global expectations seem distant. There is a pressing need to bring about reforms to ramp up road safety efforts in order to stand-by the commitments made during the “Sustainable Development Agenda 2030” [2]. The trend of the number of deaths in the recent years is depicted in Fig. 1 [2], although the rate of deaths per a hundred thousand people seems to be decreasing the total number of deaths have reached an all-time high, this ambiguity is due to the population surge. To accelerate the aforementioned progress, the countries require a state-of-the-art system which can be visualized through the applications of ad hoc network. It is a M. Taneja (B) · N. Garg Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, GGSIP University, New Delhi 110086, India N. Garg e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_43
525
526
M. Taneja andN. Garg
Fig. 1 Road traffic deaths per 100,000 population: 2000–2016 (courtesy—WHO) [2]
type of decentralized wireless system that is not dependent on the currently available infrastructure and is designed based on the problem statement. Due diligence needs to be given to the constraints while including ad hoc network in the workflow. Although there is a distinction between the theoretical concepts and the practical limits, the advantages offered by the network are compensating. The network requires minimal configuration and can be quickly deployed, thereby making it suitable for application during disaster management or a state of conflict. One of the flaws of the ad hoc network is the radio spectrum pollution, a phenomenon in which the waves in the radio and electromagnetic spectrum stray away from their dedicated allocations.
1.1 Evolution of VANET Technology VANET—Vehicular Ad Hoc Network, is a unique type of network that derives its characteristics from MANET. MANET is a mobile node that is used to establish oneto-one or one-to-many connections. As shown in Fig. 2, WANET is the parent field of all the ad hoc networks. VANET is an extension of MANET which is independent
Fig. 2 Classification of Ad-Hoc network (courtesy Ghori et al. [6])
Smart Traffic Monitoring and Alert System Using VANET and Deep …
527
of any infrastructural dependencies [3]. A Vehicular ad hoc network is a network technology that combines wireless network features in vehicles and enables vehicleto-vehicle data communication as well as vehicle-to-infrastructure communication. These aspects enable VANET to provide both efficient and effective communication. The smart vehicles can collect data from the sensor and expand their perception range [4]. Issues such as throughput delay and significant packet loss make the technology vulnerable to skepticism. Vehicular communication in VANETs is done by exchanging information using two modes of communication—Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) communications [5]. The VANET communication, as is depicted in Fig. 3, establishes two-way communication in the form of V2V—vehicle to vehicle and V2I—Vehicle to access point, also referred to as the Road Side Unit (RSU). The V2V communication further involves one-hop communication and multi-hop communication. Using the VANET, the Vehicles can communicate with other vehicles through the On-Board Units (OBUs) that allow wireless communication in a distributed manner while they can communicate with RSUs installed as a part of the main infrastructure in an infrastructure mode [5]. Research on VANET has increased recently [8] as well as the support for vehicular safety application [9]. Yousefi et al. characterized VANET by describing its rapid changes in topology, fragmentation rate, network diameter, limited power constraint, scalability and network density, and impact of driver’s behavior on the network. Raya et al. [10] characterized the network by its quasi-permanent mobility, high speed and short connection times, node distribution, and vehicle computational and power
Fig. 3 Communication in VANET (courtesy—Rehman et al. [7])
528
M. Taneja andN. Garg
resources. VANET is also characterized by its high mobility and rapid changing topology, geographical positioning, mobility modeling and prediction, hard delay constraint, and no power constraint [8, 11]. The availability of such a technology is a boon for the public if applied carefully and consciously.
1.2 Current System The OBUs transmit monitoring reports for specific event to their neighboring OBUs as well as to the infrastructure installed on the side of the road; the monitoring reports transmitted from the cars contain the position of the monitored event and the generated traffic report contains the following fields: Vehicle’s Identity, current time, Position, Direction, Speed of the vehicle, and the traffic event, which contains two main fields: the traffic type and traffic information [12].
1.3 Deep Learning Deep learning uses layered generic algorithms that reveal interesting patterns in a set of data without having to write any custom high-level language-based code. The data is fed to the generic algorithm, different layers of neural network are created and the algorithm is responsible to build its own logic that forms a relationship between the categorical and non-categorical data. Deep Learning gives computers or hardware, the capability to learn without being explicitly programmed and having algorithms fed to resolve specific problems. It is an artificial intelligence technique that imitates the working of the neurons present in the human brain by applying logic building and pattern generation by processing data for use in decision making.
2 Proposed System The proposed system includes a workflow that will be detrimental in helping scholars and other simulation professionals to get an edge over the existing methodologies. The modules in the workflow are designed using six tools: • • • • • •
Terminator: Red Predefined process: Blue Process: Green Decision: Mauve Database: Orange Display: Yellow.
Smart Traffic Monitoring and Alert System Using VANET and Deep …
529
The proposal is based on a combination of two state-of-the-art technologies— VANET and Deep Learning. The VANET module of the model is responsible for wireless distributed communication and the deep learning model is responsible for detection and providing the systems with the required data so that the information regarding the vehicle in the field of view can be generated. The alpha-numeric data generated can be utilized to keep a check on the vehicle’s accident and/or anomalous driving records. The system is a theoretical approach; therefore, it includes certain assumptions and prerequisites that are required for the practicality of the proposal. It is important to make use of the infrastructure and current-age technologies installed alongside the roads so that minimal investment is done to develop a practical application. As per Fig. 4, the workflow uses three predefined processes that include average speed detection using journey time analysis, automatic license plate recognition, and V2I communication for alerting vehicles to slow down. The processes mentioned are applied to every vehicle that enters the field of view. The billboards are also connected to the RSUs to alert the approaching vehicles by displaying a hazard sign.
2.1 Assumptions • Every vehicle has an OBU installed to enable communication through a Dedicated Short-Range Communication (DSRC) protocol with other vehicles. • There is enough area for setting up of RSUs and other required infrastructure. • The availability of previously installed high-definition cameras, night vision detectors, and Lidar system to detect vehicles and number plates along with their speed. • Availability of a robust database to store vehicle class, license plate, speed, and driving characteristics. • Strong communication channel for V2I, V2V, and concerned authorities in the vicinity.
2.2 Pre-requisites • The infrastructure requirements for the system include setting up an access point in the vicinity of the area in consideration so that the communication between the vehicle and the access point can be managed easily and thereby, enable one hop and multi-hop communication between the vehicles based on the scenario. • The other requirement is to set up two high-definition cameras directed at the vehicles at an angle less than 45° for a point-to-point system with in-device model deployment for ALPR.
530
M. Taneja andN. Garg
Fig. 4 Proposed workflow
2.3 Vehicle Detection Detecting moving objects in the video stream is essential for understanding behavior and tracking objects. According to S. A. Taie et al., to detect moving objects in a video stream accurately, sequences of frames are segmented by using adaptive Gaussian
Smart Traffic Monitoring and Alert System Using VANET and Deep …
531
mixture models (MGM) [12]. In the proposed system, the vehicles in the field of view are detected through object detection techniques using deep learning models based on recurrent neural network and visual image input from the cameras. Vehicle detection is an important part of the system since different emergency vehicles need to be deployed for different vehicles. The system differentiates the approaching vehicles into different categories. The proposed classes of vehicles are: • • • •
Class 1—Light vehicles Class 2—Medium-heavy vehicles Class 3—Large heavy vehicles Class 4—Extra-large heavy vehicles.
2.4 Average Speed The system deals with anomalous driving and considers accident as a subset of the aforementioned, therefore there is a need to establish the parameters for recognizing anomalous driving on the road. The model needs to first calculate the average speed in a point-to-point system, as is depicted in Fig. 5, over a sustained distance in its field of view. In the diagram to explain the point-to-point system, we can sees point a and b are connected even we are showing put in a system. The approach is to use journey time prediction. Journey time prediction using sources of real-time measurement data has the potential to assist travelers by providing more accurate estimates of journey times. Improving the accuracy of the prediction by suitable methods helps to reduce the overall uncertainty of journey times [13]. It calculates the journey time associated with each vehicle class for crossing the two camera milestones and then using the distance parameter estimates the effective average speed. Since the average speed may vary at different times of the day due to various unforeseen circumstances, therefore, the system will be programmed to check for Fig. 5 Point to Point system [14]
532
M. Taneja andN. Garg
Table 1 Reflectors available on vehicles (courtesy—Datta Sainath Dwarampudi and Kakumanu [15]) S. No.
Category of reflector
Vehicular parts
1.
Primary reflectors
Front and rear license plates
2.
Secondary reflectors
Headlight, turn signal indicators and tail lights, bumper guard
3.
Tertiary reflectors
Windshields and vehicle external body
average speed in 1-minute intervals during weekdays or in case of an accident or anomalous driving; yellow flag the area, issue a warning through V2I and recalculate once the emergency vehicles have left the field of view.
2.5 Speed Detection Once the average speed has been estimated and the vehicle class identified; Lidar sensors in the cameras will be put to use. LiDAR Speed Gun is a device which can determine the velocity of target by emitting a laser and processing the received signal by using a microprocessor. The device is capable of producing reliable range and speed measurements in typical urban and suburban traffic. The acronym of LiDAR is Light Detection and Ranging [15]. LiDAR uses different categories of reflectors present on the vehicle, as is depicted in Table 1. If the speed of any vehicle detected by the LiDAR is below or higher than the average or the maximum speed limit of the identified vehicle class, as calculated by the authorities, then the system shall proceed to identify the number plate.
2.6 Automatic License Plate Recognition The automatic license plate recognition system is used to identify the number plate of the vehicle, as is depicted in Fig. 6. The system is put to use in case of an anomalous driving situation. To have a pragmatic approach to the problem the system also takes care of the worst-case scenario in which the number plate of the vehicle is in the blind spot or not readable due to an accident. The ALPR system requires data collection of a set of number plates running in that particular city and the alpha-numeric characters used. This data is cleaned to provide the best possible training set for the model. The training of the dataset includes character detection and recognition from the image of the number plate. Since the character recognition is not possible for the entire number plate, character segmentation is introduced after which the characters are easily recognized and the alpha-numeric data from the number plate is generated.
Smart Traffic Monitoring and Alert System Using VANET and Deep …
533
Fig. 6 ALPR using deep learning [16]
Based on the plate number generated, the database consisting of all the valid number plates and their owners is sifted to find the matching plate number.
2.7 Anomalous Driving A vehicle is considered to be driving anomalously if and when the magnitude of speed of the vehicle deviates from the average speed, as detected by the journey time analysis, and the maximum speed defined by the road authorities.
2.8 Vehicle Involved in an Accident The LiDAR sensors use three different categories of reflectors for speed detection. To declare a vehicle involved in an accident, we have the following different scenarios: • If the LiDAR is unable to use two out of three reflector techniques and the vehicle detected does not change its position in multiple one-minute intervals then that vehicle will be declared to be involved in an accident.
534
M. Taneja andN. Garg
• If the LiDAR is unable to use two out of three reflector techniques and the vehicle changes its position in multiple one-minute intervals then that vehicle will be red-flagged and the vehicle will be notified using the V2I communication.
2.9 V2I If the vehicle detected is driving anomalously and not involved in an accident the RSU will issue an alert message to that vehicle through the V2I communication. Else, the vehicle will be red-flagged for being involved in an accident through V2I communication.
2.10 V2V The RSU yellow flags the vehicle through V2I communication and the vehicles in proximity of the vehicle involved are alerted with a V2V communication and the location of the accident is displayed on the infotainment screens of the vehicles and are requested to lower speeds, turn on hazard lights and give way to emergency services.
2.11 Emergency Vehicle As part of the VANET infrastructure, the vehicle identified to be anomalous previously by the system communicates with the access point or the RSU using the V2I communication and the unit sends alert to the patrolling vehicles and nearby hospital with the details of the location of the accident and provides sufficient time to prepare for an emergency attendance.
3 Conclusion In this paper, a novel VANET AND DEEP LEARNING, based smart road traffic monitoring and alert system have been presented, as is represented in Fig. 4. The smart traffic system proposes to calculate the average speed at different time intervals of the day and detects the speed of the vehicles in the field of view. As and when an anomalous driving or accident is detected through vehicle detection and Automatic License Plate recognition the drive is alerted through V2I communication, and the vehicles in proximity are alerted through V2V communication and V2I communication. The area is proposed to be marked yellow and the vehicle
Smart Traffic Monitoring and Alert System Using VANET and Deep …
535
in accident or anomalous driving scenario is red-flagged and the alert is given out on dashboards of smart vehicles and billboards to reduce speed and the concerned authorities are notified. This system is based on simplistic VANET architecture. The proposed model will be further developed to extend its application to a wide-area traffic control system. In the wide-area traffic control system, all the OBUs will be interlinked via a fixed central network that will allow traffic information over a large area to be distributed to all OBUs, resulting in a better traffic control mechanism. The wide area system will also allow vehicles to inform the OBUs about their final destination. This information can then be used to calculate the load on different roads and possibly load balance traffic on different roads to reduce the congestions [17]. As a part of the future work, the research is to work on the development of more efficient communication and a wide field of view with advanced recognition systems based on more complex deep learning models. The research paper, therefore, proposes a model that can be efficiently used by other researchers and scholars to simulate the traffic monitoring system. The modules and workflow are bound to give an edge over conventional simulations.
References 1. https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries. 2. Global status report on road safety 2018. Geneva: World Health Organization; 2018. Licence: CC BYNC-SA 3.0 IGO. https://www.who.int/violence_injury_prevention/road_safety_status/ 2018/en/. 3. Ghori, M. R., Zamli, K. Z., Quosthoni, N., Hisyam, M., & Montaser, M. (2018). Vehicular ad-hoc network (VANET): Review. In 2018 IEEE International Conference on Innovative Research and Development (ICIRD), Bangkok (pp. 1–6). https://doi.org/10.1109/icird.2018. 8376311. 4. Wang, Y., Menkovski, V., Ho, I. W., & Pechenizkiy, M. (2019). VANET meets deep learning: The effect of packet loss on the object detection performance. In 2019 IEEE 89th Vehicular Technology Conference (VTC2019-Spring), Kuala Lumpur, Malaysia (pp. 1– 5). https://doi. org/10.1109/VTCSpring.2019.8746657. 5. Shrestha, R., Bajracharya, R., & Nam, S.-Y. (2018). Challenges of future VANET and cloudbased ID 5603518, 15pp. https://doi.org/10.1155/2018/5603518. 6. Ghori, M. R., K. Z. Zamli, N. Quosthoni, M. Hisyam and M. Montaser. “Vehicular ad-hoc network (VANET): Review.” 2018 IEEE International Conference on Innovative Research and Development (ICIRD) (2018): 1–6. 7. Rehman, S., Khan, M. A., Zia, T., & Zheng, L. (2013). Vehicular ad-Hoc networks (VANETs)— An overview and challenges. Journal of Wireless Networking and Communications, 3, 29–38. https://doi.org/10.5923/j.jwnc.20130303.02. 8. Liu, Y., Bi, J., & Yang, J. (2009). Research on vehicular ad hoc networks. In IEEE Control and Decision Conference. CCDC’09. Chinese (pp. 4430–4435). 9. Yousefi, S., Altmaiv, E., El-Azouzi, R., & Fathy, M. (2007). Connectivity in vehicular ad hoc networks in presence wireless mobile base-stations. In 7th International Conference on ITS Telecommunications IEEE. ITST’07 (pp. 1–6). 10. Raya, M., & Hubaux, J. P. (2007). Securing vehicular ad hoc networks. Journal of Computer Security, 15(1), 39–68. 11. Offor, P. (2012). Vehicle Ad Hoc Network (VANET): Safety benefits and security challenges. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2206077.
536
M. Taneja andN. Garg
12. Taie, S. A., & Taha, S. (2017). A novel secured traffic monitoring system for VANET. In 2017 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), Kona, HI (pp. 176–182). https://doi.org/10.1109/percomw.2017. 7917553. 13. Gibbens, R. J., & Saacti. Y. (2006). Road traffic analysis using MIDAS data: journey time prediction. 14. https://www.aiactive.com/en/pages/traffic-solutions/speed-enforcement/average-speed.html? hcb=1. 15. Dwarampudi, D. S., & Kakumanu, V. S. V. (2013). Efficiency of A LIDAR speed gun. International Journal of Electrical, Electronics and Data Communication, 1(9). ISSN: 2320-2084. 16. http://tadviser.com/index.php/Article:Video_analytics_%28terms,_scopes_of_application,_ technologies%29. 17. Nafi, N., & Khan, J. Y. (2012). A VANET based Intelligent Road Traffic Signalling System. In Australasian Telecommunication Networks and Applications Conference, ATNAC, 2012 (pp. 1–6). https://doi.org/10.1109/ATNAC.2012.6398066.
Enhancement of Lifetime of Wireless Sensor Network Based on Energy-Efficient Circular LEACH Algorithm Jainendra Singh and Zaheeruddin
Abstract The wireless sensor network (WSN) being an ad hoc network consists of a huge quantity of nodes that are used as sensors and these are deployed in a rigid manner inside the network. The main task performed by these nodes, which are acting as sensors, is to collect useful information from different environments condition. In place dispatching verdant information to the nodes that are amenable for amalgamation, these nodes utilize their computing capabilities for performing local and easy calculations and output the expected and partly calculated data. In this way, they operate on limited power supply and power storage devices that are the major mode of energy resource for these nodules, which are acting as sensors. Thus, the duration of the existence of WSN greatly relies on the power usage of these nodules. To overcome this problem, this paper introduced energy adept shared assembling routing protocol LEACH (Low-Energy adaptive clustering Hierarchy) for a pair of dense sensor nodes deployed in rectangular and circular environments. The network lifetime and energy consumption are computed for both the networks and it has been found that circular LEACH outperforms the rectangular LEACH interconnected system. Keywords WSN · LEACH · Network lifetime
1 Introduction The popularity of sensor technologies is increasing rapidly due to the ease of deployment, low-cost, and a variety of applications provided by them. The application scenario of these devices is very vast including monitoring, civil and military applications [1]. The WSN nodules deployed as sensors contain the capability to capture J. Singh (B) · Zaheeruddin Department of Electrical Engineering, Jamia Millia Islamia, New Delhi, India e-mail: [email protected] Zaheeruddin e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_44
537
538
J. Singh andZaheeruddin
process and send the information to a server or Internet-connected cloud, respectively. However, these sensors suffer some limitations like poor capability for storing energy, poor computing ability, and most importantly, limited energy. In addition, resources, particularly the power required for the sensor nodule may not be substituted or recharged because of the hazardous working environment. Therefore, energy-adept sensor networks are needed to increase the overall lifetime of WSN. To achieve the highest duration for the life span of WSN, it is necessary to find a minimal optimal value and save the energy consumption of nodules deployed as sensors, in a more adept manner. To enhance the network life an assembling procedure called LEACH [2] and LEACH-C [3] was proposed by Heinzelman et al. Both methods provide the solution of maximizing the life span of networks by using minimal levels of the energy usage in sensor nodes but due to cluster numbers deviation and unequal distribution of cluster heads makes them unsuitable for clustering-based algorithm. This paper focuses on the deployment of sensor nodes in a circular and rectangular WSN environment and then apply clustering algorithm so that the overall interconnection life span is enhanced by minimizing the energy requirements and usage in sensor nodes. The proposed method overcomes the other clustering algorithm approach in terms of minimizing energy consumption by increasing the lifetime of networks, particularly in circular WSN case. It works in first dividing the entire network area into subarea then the cluster heads are uniformly distributed in each subarea of entire network. These cluster heads in each subarea are responsible for receiving the data from all other sensor nodes and then forwarding it to the sink node. This protocol follows the deployment of sensor nodes as shown in Fig. 1 where the whole network is subdivided into various clusters and a cluster head is chosen among them whose role is to provide effective communication to the local base station and further provide useful information to the user end. To enhance the overall network life by minimizing the energy consumption in sensor nodes, the local base station and cluster nodes are kept at a close distance. Based on this concept, some static clustering protocols were suggested by researchers in Refs. [4–9] for effective communication. However, these methods have some limitations; they are not found suitable for effective communication because they use fixed cluster nodes and cluster Fig. 1 LEACH protocol
Enhancement of Lifetime of Wireless Sensor Network Based on Energy- …
539
heads in the complete interconnection existence duration and assumed internal root station as nodules having higher energy levels in Refs. [10, 11]. The excessive use of this local base station is the main reason for the entire network will die soon. This paper proposed a novel circular LEACH protocol for enhancing the overall network life by minimizing the energy consumption in WSN sensor nodes as compared to normal LEACH protocol. We further introduce a data gathering approach based on the novel fixed clustering approach which provides nodule procedures used for scheduling of effectual and inactive nodes in each group for the time duration of WSN. In this way, it provides an enhancement of overall network life by optimizing the minimal levels of the total energy requirement of the entire interconnection. The rest of this paper is classified into five sections. Sect. 2 provides a brief detail about LEACH protocol, Sect. 3 provides the overall framework of our introduced methods of circular LEACH procedure, Sect. 4 provides the details of the outputs obtained for the simulation of the system, and Sect. 5 explains the concise conclusion of paper.
2 LEACH Protocol The popular clustering-oriented procedure called LEACH (A low-energy adaptive clustering hierarchy) was first proposed by Heinzelman in [2000]. The LEACH operation may be split into rounds. Every iteration starts with an initial phase used for setting the states, followed by another phase called steady phase when cluster nodules are used in WSN. Then the process of transferring the data frames starts from the sensor nodules to cluster head node and further on the root station. As setup phase begins, the probability model of entire network is computed and, based on probability model, cluster head is chosen among all sensor nodes; the whole process is demonstrated in Fig. 2. The process of choosing a cluster head started when every sensor node generates a random number between 0 and 1. If the number is found to be smaller than the pre-defined threshold t(n) value, then for the current round, the sensor node acts itself as a cluster head node. The threshold value is computed with the help of the following equation as t (n) =
pn 1− pn ∗ rn mod
0
Fig. 2 Phases of LEACH Protocol
1 pn
if n ∈ G other way
(1)
540
J. Singh andZaheeruddin
Here, pn is denoted as the probability of cluster head node, rn denotes the current round and G is the set of cluster nodes that have not been found as cluster head nodes for last p1n rounds. So, based on the threshold value, each node for some rounds acts as a cluster head node for the p1n rounds. After completing p1n iteration, every cluster nodule is afresh now in the entire network and can be eligible for becoming a cluster head. In LEACH-based WSN application, generally, we assumed about 5% of the total number of sensor nodes are to be considered as cluster heads. The sensor node that becomes a cluster head in a particular round is responsible for broadcasting a message for advertising to the remaining sensor nodes in the entire network. The decisions to which cluster and to which round these remaining sensor nodes belong are taken by receiving the advertisement message by these non-cluster head nodes. The decision is taken depending upon the level of the advertising signal received. Firstly, the head node collects the signals from every nodule which is deployed as sensors and would like to become a part of the cluster, then depending on the count of nodules present in the cluster, the cluster heads send a TDMA scheduling and assign a timing duration to each sensor node that consists of transmitting information. After the setup phase, the role of steady-state phase starts in which the sensor nodes start sensing and transmitting the data to the cluster heads. The non-cluster head don’t perform any function until they allocate the time for data transmission. The cluster heads starts gathering the data after receiving the data through all sensor nodes present in the cluster and then forwarding it to the sink.
2.1 Energy Computation in Radio Model Based on the following assumption, we use first-order radio model for energy computation which are I II III IV V VI
The sensor nodes were deployed in the range of wireless communication and they are capable to communicate with each other and to the base station (BS). All sensor nodes bear the same characteristics of homogeneity sensing, communicating, and computing performance for the entire network. The process of deployment of all sensor nodes has been chosen in random fashion for WSN testing. It is assumed that BS is situated in the middle of sensor interconnection and it is having an infinite level of power resources. The same initial energy is imparted to all sensor nodes and the rate of energy dissipation of all sensor nodes is assumed to be same. The entire interconnection life span is described as the timing span from the implementation to the timing point when the starting nodule has died or all the nodules have died. All sensors diffuse their energy levels at the same instance of time and these die out at one instance of time, refer to (V) for this statement.
Enhancement of Lifetime of Wireless Sensor Network Based on Energy- …
541
Fig. 3 Dissipation model of radio energy
VII
VIII
IX
X
The information collected from the sensors and clustering process consume very small power as compared to power consumed by CPU and radio, therefore, the energy dissipation in sensing and clustering data is assumed to be neglected. Moreover, the assumption that no diffusion of energy levels in a cluster is taking place inside a nodule and the entire clustering algorithm run on the BS is used. The round is defined in WSN as the timing duration for which BS collects data from each of the sensor nodes. Each sensor node in a round only senses data once. To minimize the data transmitted by radio, the sensor nodes which receive data start merging one or more packets to generate the packets of same size and reduce the data size transmitted by radio. This trend is observed in the cluster of sensor nodes that they are associated actively in sensing data. The fixed amount of energy is dissipated for one-bit data. Fig. 3 shows the computation process to measure the costs for transferring and the costs for receiving of a single bit of information crossing a margin of d units.
The wireless channel model that dissipates the radio energy is depicted in Ref. [9]. For transmitting a single-bit information crossing a milage d, the transmitter diffuses the power for running transmitting electronics factor (E T x − elec) and power amplification factor (E T x − amp). The minimal level of power required for transmitting information of k bits and crossing a milage of d is given as E T x (k, d) = E T x − elec(k) + E T x − amp(k, d)
(2)
The power loss in a transmitter of a power amplifier is controlled by properly setting the mileage between transmitting unit and receiving unit, if the distance used is smaller than a thresholding value of d 0 , then we use a model based on free spacing (fs), else a model based on multi-pathing scheme (mp) is used. The modeling for radio energy is given by for free spacing (d 2 loss factor of power) and for multi-pathing scheme (d 4 loss factor of power) is described by the following equation: E T x (k, d) =
k E elec + k ∈ mpd 4 . . . , d ≥ d0 k E elec + k ∈ f s d 2 . . . , dd 0
(3)
542
J. Singh andZaheeruddin
The level of power expansion required by receiver for receiving a message and to run the radio electronics is described by the following equation: E Rx (k) = E Rx − elec(k) = k E elec d0 =
∈fs ∈mp
(4)
(5)
The factor that affects the electronics energy (E elec ) is solely dependent on modulation techniques, spreading of signal, filtering type, and digital coding of the signal. The distance and BER (bit error rate) are the only factors on which the amplifier energy of ∈ f s d 2 or ∈mp d 4 the d 0 represents the threshold transmission distance for amplification circuit.
2.2 Initialization of Rectangular LEACH Network The sensor nodes are initialized in the entire network along with the assignment of energy variables. The deployment of nodules used as sensors, in networks adopted an inconsecutive distribution in the L * L m2 of the region. A random distribution of 100-node topology is chosen for a 100 * 100 m2 and sink of the network is located at position (50, 50). The initializations of sensor nodes are shown in Fig. 4
Fig. 4 Initialization of rectangular WSN
Enhancement of Lifetime of Wireless Sensor Network Based on Energy- …
543
3 Proposed Algorithm This section proposed the process of routing scheme for sensor nodes scheduling in each cluster in the entire rectangular networks. The process diagram for routing scheme for WSN is depicted in Fig. 4. The working process is identical as described in common LEACH protocol is used. The cluster formulation and cluster head choosing process in the overall interconnection are same as in the proposed protocol and, by making a comparison of the completive energy level, the selection of cluster nodule used for head-node is decided in every iteration. By repeating this process, we improve the total energy efficiency of the interconnection by providing the proper scheduling of all sensor nodes in the cluster. After the node scheduling, the process of selection of cluster head starts based on the equal distribution of available residual energy. The condition that is the main requirement of maximizing the network life is that sensor node should consume a fraction of very small energy when it is in sleep mode under each cluster. Due to this, the number of rounds increased by controlling the energy consumption in passive and effectual mode and the distribution of existing power took place in most efficient form in the entire network (Fig. 5). Using optimal procedures based on apt schedule of sensor nodes in each cluster with allocating a proper timing interval, the target of minimum power consumption in nodules is obtained and incrementing in the overall lifetime of WSN.
Fig. 5 The LEACH protocol procedure
544
J. Singh andZaheeruddin
3.1 Circular LEACH The circular LEACH is basically also based on clustering the sensing nodules using the distribution of energy in the whole interconnection. The process of clustering is divided into iterations which consists of the clustering head-nodule receives information messages from its c nodes cluster and forwarding the combined information to the base station or upper layer by using TDMA technique. An assumption that the BS is situated at position (0, 0) in circular LEACH is used. The optimal one-hop transmitting milage and the optimal value of the angle used in the cluster may be computed with the help of following equations:
2E elec + E cpu E amp (γ − 1)
(6)
8π 3 3E elec + E cpu N 2E elec + E cpu
(7)
dopt = θopt =
The clustering head-nodules which are situated in the nearest clustering units try to send its location and ID information to the BS while the initial setting phase. In this way, it sends an advertising signal that bears a 1-hope milage d1hop and its position details to the cluster sensor nodules. The information received by cluster heads in terms of location and ID of nearby nodes which acts as the deciding factor for making node as first cluster heads. The process through which it was decided for selecting the first cluster heads by BC will be fulfilled if the condition d 1hop ≥ d opt is satisfied by nodes whose location must lie between top and second layers. The node to which cluster belongs is decided by following flow chart after receiving an advertising signal from the BS (Fig. 6).
4 Simulation and Results This section discussed the simulation results obtained after simulating the model in MATLAB environment and with the help of obtained results the performance of proposed circular LEACH and rectangular LEACH algorithm are estimated on the basis of energy competency and life span extension of WSN. The simulation perimeters that were used in experimental works are shown in Table 1. The sensing nodules are deployed randomly in dense networks area amidst x = 0, y = 0 and x = 100, y = 100 and position of base station is fixed at location x = 50, y = 50. The total numbers of nodules which are active after 700 simulation rounds in rectangular and circular LEACH protocols are shown in Figs. 7 and 12. From the analysis of Fig. 13, it is clearly seen that nodes in circular LEACH takes longer time (rounds) to be alive as compared to rectangular LEACH case. It is found if on further increasing the area and number of nodes in the Circular LEACH we can achieve
Enhancement of Lifetime of Wireless Sensor Network Based on Energy- …
545
Clusters formations
The current cluster heads working time is denoted by T, its main function is to receive data from its cluster unit and lower layer clustering head-nodule having the identical clustering angle and forwarding the combined information to the upper layer of clustering head-nodule
TDMA
Current clusters general nodes N T < T0
Cluster head of lower layer
Y The current cluster head replacement took place with the candidate cluster head and forwarding its information to its member
Fig. 6 Flow chart for the Circular LEACH in which each cluster head starts collecting the information from its member and the lower layer clustering heading-nodule having a position in the identical clustering angle Table 1 Simulation Results
Parameter
Values
Network area
100 m *100 m
Packet size
40000 bits
Eelec d0 = sqrt
εfs εmp
50 nJ/bit
BS position
50 * 50
εmp
0.0013 pJ/bit/m4
εfs
10 pJ/bits m2
E DA
0.5 J
Number of nodes
100
546
J. Singh andZaheeruddin
Fig. 7 Rectangular LEACH protocol lifetime after 700 rounds inactive nodules (Nearly 50 nodules) are shown by red points. * denotes the cluster heads and Sink node is at middle is denoted as ×
improved interconnection life span extensively. The simulation outputs demonstrate that we can achieve a power reduction up to 50% if we deployed the sensor nodes in circular LEACH environment as compared to rectangular LEACH. On further analysis based on metrics it has been found that in circular LEACH whenever the starting or first nodule died (FND) and remaining halves of the nodules active (HNA) the network lifetime is improved in a comparison to rectangular LEACH algorithm. The Figs. 7 and 8 provides the details of network lifetime in rectangular and circular LEACH protocol for 700 rounds and showing the dead nodes, cluster nodes and sink nodes in figures. Figure 9 shows the dead nodes results after complete iteration. Figure 10 shows the results for overall count of active sensing nodules over the count of iterations when more nodes are added to networks. Figure 11 provides the results for network available energy per round In LEACH protocol, the overall energy competency is improved approximately 50% as compared to rectangular LEACH. It is demonstrated by adding more number of nodes in a given area of network where After undergoing 2700 iterations, only the circular LEACH protocol nodules are under inactive location, while in case of rectangular LEACH protocol, every nodule has been losing their energy approximately after completing 1300 rounds itself. This indicates the clear results that circular LEACH outperforms than the rectangular case. In circular LEACH, the base station is located at nodule (0, 0). In circular LEACH, the circular placement of sensor nodes in the dense network provides greater covering region. It holds smaller count of nodules that were placed at equal distance. It maximizes the interconnection life span by optimizing a minimal level of the utilization of power usage in sensor nodes. The overall results that can be drawn from the analysis of two cases can be concluded
Enhancement of Lifetime of Wireless Sensor Network Based on Energy- …
547
Fig. 8 Circular LEACH lifetime after 700 iterations Red points are shown inactive nodes (only 4 nodes) are shown by red points. * denotes the Cluster heads and × denotes the Sink node is at middle
Fig. 9 Dead Node over complete iteration
548
J. Singh andZaheeruddin
Fig. 10 Live sensor nodes versus Number of rounds
Fig. 11 Network energy per round
that when sensing nodules are placed in circular scheme than we can minimize the energy consumption in sensors nodes efficiently by maximizing the network lifetime in comparison to grid placement interconnection (Table 2).
Enhancement of Lifetime of Wireless Sensor Network Based on Energy- …
549
Fig. 12 Circular LEACH
Fig. 13 The networks lifetime of circular LEACH
5 Conclusion The LEACH is a widely used as the most popular routing protocol for clusteringoriented wireless sensor networks. This research compares the outputs of LEACHbased wireless sensor networks on the basis of life span and throughput for the two particular cases when sensor nodes deployed in rectangular and circular LEACH
550 Table 2 Circular LEACH simulation parameter
J. Singh andZaheeruddin Parameter
Value
Symbol
Total number of nodes
100
Simulation area
50 m
Circular area of radius
Packet size
500 bytes
P
Transmit/Receive electronic
50 nJ/bit
E elec
Amplifier constant
10 pJ/bit/m2
εfs
Base station position
N
0.00013/pJ/bit/m2
εmp
(0,0)
(x, y)
Number of sectors Number of tracks
6
S
4
T
Initial energy
2J
E0
Energy for data aggregation
5 nJ
E DA
protocols. The reasonable number of structures in a LEACH iteration is inferred to increase the life span and save energy consumption by 50% in sensor nodes when deployed in circular LEACH environments. The circular deployment of sensor nodes in the dense network provides greater covering region. It holds smaller count of nodules that were placed at equal distance. It maximizes the interconnection life span by optimizing a minimum level of the utilization of power usage in sensor nodes.
References 1. Heinzelman, W. R., Chandrakasan, A., & Balakrishnan, H. (2000). Energy efficient communication protocol for wireless microsensor networks. In Proceedings of the Hawaii International Conference on System Sciences, Hawaii, USA (Vol. 1, pp. 3005–3014. C). 2. Liu, J. S., & Lin, C. H. (2003). Power efficiency clustering method with power limit constraint for sensor networks performance. In Proceedings of the 2003 IEEE International Performance, Computing, and Communications Conference, Arizona, USA (Vol. 9, pp. 129–136. C). 3. Bandyopadhyay, S., & Coyle, E. (2003). An energy efficient hierarchical clustering algorithm for wireless sensor networks. In Proceedings of the 22nd Annual Joint Conference of the IEEE Computer and Communications Societies, San Francisco, USA (Vol. 3, pp. 1713–1723). 4. Xue, Q., & Ganz, A. (2004). Maximizing sensor network lifetime: analysis and design guides. In Proceedings of the 2004 Military Communications Conference, Monterey, CA (Vol. 2, pp. 1144– 1150). 5. Shih, E., Cho, S., & Ickes, N., et al. (2001). Physical layer driven protocol and algorithm design for energy efficient wireless sensor networks. In Proceedings of the 7th Annual Conference on Mobile Computing and Networking, Pisa, Italy (Vol. 6, pp. 272–287).
Enhancement of Lifetime of Wireless Sensor Network Based on Energy- …
551
6. Priscilla, C., O’Dea, B., & Callaway, E. (2002). Energy efficient system design with optimum transmission range for wireless ad-hoc networks. In Proceedings of the 2002 IEEE International Conference on Communications, New York, USA (Vol. 2, pp. 945–952). 7. Heinzelman, W. B., Chandrakasan, A. P., & Balakrishnan, H. (2002). An application-specific protocol architecture for wireless microsensor networks. IEEE Transactions on Wireless Communications, 1(4), 660–670. 8. Guo, B., & Li, Z. (2007). United voting dynamic cluster routing algorithm based on residualenergy in wireless sensor networks. Journal of Electronics & Information Technology, 29(12), 3006–3010. 9. Shelby, Z., Pomalaza-Raez, C., & Karvonen, H. (2005). Energy optimization in multi-hop wireless embedded and sensor networks. International Journal of Wireless Information Networks, 12(1), 11–20. 10. Handy, M. J., Haase, M., & Timmermann, D. (2002). Low energy adaptive clustering hierarchy with deterministic cluster-head selection. In Fourth IEEE Conference on Mobile and Wireless Communications Networks, Stockholm, Sweden (Vol. 12, pp. 368–372). 11. Soro, S., & Heinzelman, W. B. (2005). Prolonging the lifetime of wireless sensor networks via unequal clustering. In Proceedings of the 19th IEEE International Parallel and distributed Processing Symposium, Colorado, USA (Vol. 13, pp. 236–243).
Estimation and Correction of Multiple Skews Arabic Handwritten Document Images M. Ravikumar and Omar Ali Boraik
Abstract Skew detection and correction have become important in the preprocessing stage as the first step in Arabic handwriting document recognition and analysis. In this paper, the skewness is detected and corrected through the help of morphological operation and connected components to identify the line in Arabic documents. For such purpose, the bounding box is drawn for each line and segmented separately which could be replaced by global bounding box in some cases. A statistical method is applied for detecting a skew angle and correcting the skewness with different orientations. The proposed method is implemented on more than 700 Arabic handwritten documents. The accuracy ratio is 98% based on line segmentation and it takes 2.5 s in color images and 1.5 s on binary or grayscale images. Keywords Skew detection · Correction · Arabic handwriting · Bounding box and text line segmentation
1 Introduction Today’s world has witnessed rapid and unprecedented advancements in all aspects. The majority of such advancements are technologically based, such technological developments include easing the methods of documenting, digitizing, saving, transferring, enhancing, modifying documents as well as easing accessing such the document worldwide through the use of technology [1, 2]. Technology is adopted to deal with different types of documents, i.e., historical documents, manually written documents and printed, through the use of various input devices or sensors in order to transfer such documents into digital documents. Within the process of digitizing, documents may not come out as expected to be, i.e., documents may have noise, low contrast, the documents may have not been captured properly, or the documents could be of low quality as in old or badly written M. Ravikumar · O. A. Boraik (B) Department of Computer Science and MCA, Kuvempu University, Shimog 577451, Karnataka, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_45
553
554
M. Ravikumar andO. A. Boraik
documents, for instance, the writing could have multiple skew angles due to handwriting mistakes or using unlined papers [3], any mistakes or errors accompanying the digitizing process may cause inconvenience to subsequence image processing, and may even lead to wrong results. This research paper proposes a method for detecting and correcting multiple skew angles in Arabic handwriting document images. This method is relied on process steps which start with preprocessing step and end with the correction of lines skewness. The outcome of using the proposed method was significant in respect of running time, accuracy, and memory consumption. The paper is organized as follows: Sect. 2 gives the related work as skew detection and correction. In Sect. 3, the proposed methodology is discussed in detail. Next, the results and discussions are addressed in Sect. 4 and, finally, conclusion is given in Sect. 5.
2 Related Work Over recent years, the techniques and methods for detecting and correcting skews on improperly acquired document images into OCR systems have been developed and improved [4]. Skew detection and correction are considered mandatory steps in the preprocessing stage because they directly impact the reliability, accuracy, and efficiency of the next stages in OCR systems [5, 6]. Although these methods and techniques had been performed well on printed documents, they are still under improvement and development to overcome the challenges and difficulties of handwritten and historical documents [7]. The most popular method is projection profile [8, 3, 7, 9], and Hough transforms [8–11], Boukharouba [11] had concentrated on accuracy but it consumed a large memory [8, 9]. The nearest neighbor [9] gets more errors when it is applied to older Arabic documents [3]. And there are more than 13 methods discussed by [8, 4, 7]. Techniques and methods have recently expanded to involve multilingual and multiple skew angles in a single handwritten document [7]. Sharma and Lehal [12] had proposed a method for multiple skew angles in a printed Indic (Gurmukhi) Script. Guru et al. [7] suggested a method to detect and correct a multiple skew angle on multilingual handwritten documents, moreover, another method was proposed by RaviKumar and Shivaprasad [6] which is estimated skew angle based on region properties of a word from Trilingual Handwritten Documents. There are several studies focusing on skew detection and correction in Arabic documents that may or may not have multiple skew angles. Al-Shatnawi and Omar [5] suggested a method for skew correction of printed Arabic documents using Center of Gravity (COG) suggesting that the complete text block is inscribed in a polygon. The angle between the COG of the polygon and an ideal horizontal origin is calculated as the document skew angle. Moreover, the method corrected the global skewness of printed Arabic documents only. Ahmad [13] presented a method for skew correction based on connected components analysis and projection profile, this proposed method works for different types of printed Arabic
Estimation and Correction of Multiple Skews Arabic Handwritten …
555
documents having text or non-text, the main drawback of this method is that it requires large memory and is slow in running. AL-Shatnawi [14] presented another method for multiple skew angles detection and correction, this method estimates a line-text baseline based on calculating the center point for its sub-words bounding, then aligns the text-line components on the estimated baseline. Based on the review of the available literature, we noticed that the most of algorithms have been well performed on printed documents for skew detection and correction, and few studies detected and corrected the skewness on handwritten documents [12, 7]. There is a difficulty and complexity in the recognition of Arabic language text compare to other languages like Chinese, Japanese and Latin because Arabic language is written cursively, Arabic characters are written connectivity [5], therefore a scant amount of literature focused on skew detection and correction in respect of Arabic text. This triggered the need for this study in order to focus on multiple skew detection and correction of Arabic handwritten documents. This paper is divided into four sections: 1—preprocessing. 2—line segmentation. 3—Skew angle detection and finally correction.
3 Proposed Methodology In this section, we discuss the proposed methodology for skew detection and correction of Arabic documents. The block diagram of this work in Fig. 1 shows the main four stages, the first stage describes the preprocessing steps of the input image, the second stage describes line segmentation using morphological operation (dilation), bounding box, segmenting the bounding, the third stage describes skew angle detection using a statistical technique, and finally, the last stage describes the line skew correction.
3.1 Preprocessing Stage Mostly, OCR system processes the input binary image so, if the input image is colored, it should be converted into grayscale image firstly, then Otsu’s method is applied to binarization [3]. After that, noises and dots are removed using median filter. Some Arabic document images have diacritical marks or some letters have ”, are shown in Figs. 2 and 3 vertically long lines such as “ and these cause interference among lines and errors during the process. To solve these problems, If any area is < 40 pixels, it should be eliminated; dilation operation is used to expand the line structure to overcome the overlapping in the word and among the words in the same line as well as making its components connected, after that edges are detected in order to be in position to draw a Bounding Box [14].
556
M. Ravikumar andO. A. Boraik
Fig. 1 Block diagram of the proposed method
3.2 Line Segmentation Detecting edges or black pixels are checked on 0–1 or 1–0 conversion during the image tracing by using a matching window of 3 × 3 sizes, for determining the eight neighbors of any given pixel. The 8 neighbors are used to find the edges in eight possible directions. Then a first minimum point (x 1 , y1 ) or (x 2 , y2 ), and the last highest point (x 3 , y3 ) or (x 4 , y4 ) of the line’s connected components are determined
Estimation and Correction of Multiple Skews Arabic Handwritten …
557
Fig. 2 Shows some letters that have vertically long lines
Fig. 3 Shows interference bounding box among lines
for drawing Boundary Box. The detected edges coordinates are labeled and stored in a 2D array to find out the skew angle and use it later in other stages [14].
3.3 Skew Detection and Correction First, detecting the skew angle of line’s coordinate as well as analyzing multivariate data are carried through computing the mean vector and the line’s matrix variance– covariance, the set of yj observing and measuring xi variable can be described by its mean vector and variance–covariance matrix, the Yj variables from left to right are length, width and height of a certain object. The mean vector consists of the means of each variable and the variance–covariance matrix consists of the variances of the variables along the main diagonal and the covariances between each pair of variables
558
M. Ravikumar andO. A. Boraik
in the other matrix positions, to compute the covariance of variables of x and y, we use this formula: n ¯ i − y¯ ) (Xi − x)(Y (1) C = i=1 n−1 where x and y denote the means of X and Y, respectively. The results are x, x y, y C = λ x, y y, x
(2)
produces matrices of eigenvalues (λ) and eigenvectors (x) of matrix C, so that C = λ * x. Matrix D is the canonical form of C – a diagonal matrix with C’s eigenvalues on the main diagonal. Matrix V is the modal matrix—its columns are the eigenvectors of C. Getting the orientation of the ellipse from the eigenvectors and eigenvalues of C, the Eigenvalues and eigenvectors method of a symbolic matrix C returns matrices V and D. The columns of V present eigenvectors of C. The diagonal matrix D contains eigenvalues. If the resulting V has the same size as C, the matrix C has a full set of linearly independent eigenvectors that satisfy A * V = V * D. Now, we can get the orientation of the ellipse from the eigenvectors and eigenvalues of C matrix which can be obtained by Velocity vector (quiver) where a quiver plot displays velocity vectors as arrows with components (u, v) at the points (x, y). A velocity vector represents the rate of change of the position of an object. The magnitude of a velocity vector gives the speed of an object while the vector direction gives its direction [15] (Fig. 4). From Table 1, we can see that the eigenvectors (columns of V ) are approximately pointing to the X direction (first column) and the Y direction (second column). Examining the eigenvalues (diagonal of D) you can see that the second eigenvalue is much larger than the first—this is the major axis of your ellipse. Now we can recover the orientation of the ellipse by finding a major axis index (largest eigenvalue of diagonal D). Second, according to major axis index of diagonal D, we finally get the slope value. The skew angle can be calculated using the following formula: θ = a tan 2(V (2, m), V (1, m) ∗ 180/π
(3)
where V presented rows of eigenvalues and eigenvectors from C matrix, m presents the matrix of largest eigenvalues of diagonal D from C matrix, 180/π to convert the value into a decree for readability. Based on theta (θ ) we rotate the bounding box to correct its skewness as shown in Table 1.
Estimation and Correction of Multiple Skews Arabic Handwritten …
559
Fig. 4 Images with different coordinate position (line, word, entire image)
4 The Experimental Results and Discussions The proposed system experiments with more than 700 text documents for testing. These text documents are taken from the KHAT database, old documents, and our own handwritten dataset in Arabic language. Most of the collected handwritten document images are collecting that including multiple skew angles in a single document or global skew. The proposed method has provided satisfactory results such as. The algorithm has provided satisfactory with high accuracy, short time, and required low memory in running comparing with previous studies. The use of statistical method determination of skew angle is really simple and efficient when the line segmentation outputs are good the results of skew detection and correction be high accuracy. Tables 1 and 2 show and demonstrate the outputs and results of the proposed method, respectively, where the input and output images are listed. Some input images have global skew therefore the line segmentation process faces problem to extract the skew line separately, instead of that the algorithm detects the slope for entire text as shown in Tables 2 and 3. Any suggested algorithm remains dependent on the process conditions with any method, otherwise, the results may be of low quality and accuracy due to the difficulty of the Arabic script, particularly the handwritten script which leads to unlimited issues and challenges [5].
560
M. Ravikumar andO. A. Boraik
Table 1 Shows eigenvalues and eigenvector of C matrix for each line and its angle (theta θ) Samples
Number of lines
Eigenvalues and eigenvector
Angle for each line
Sample 1
3
C1 = 1.0e + 03 * 8.1960 1.3723 1.3723 0.2819 C2 = 1.0e + 03 * 5.5811 0.1085 0.1085 0.0529 C3 = 1.0e + 03 * 1.1991− 5.5960 −1.1991 0.3102
θ 1 = –170.4365 θ 2 = –178.8765 θ 3 = 167.7980
Sample 2
1
C1 = 1.0e + 03 * 2.4886 −2.2661 −2.2661 2.6383
θ 1 = 134.0541
Sample 3
4
C1 = 1.0e + 04 * 3.2216 −0.3924 −0.3924 0.0844 C2 = 0.5432 −0.0988 −0.0988 0.9877 C3 = 1.0e + 04 * 3.4061 −0.2502 −0.2502 0.0997 C4 = 1.0e + 04 * 3.4126 −0.3843 −0.3843 0.0548 C5 = 1.0e + 04 * 1.1450 −0.2146 −0.2146 0.0430
θ1 θ2 θ3 θ4
= 173.6061 = 172.8575 = 170.8987 = 171.3042
Sample 4
5
C1 = 1.0e + 04 * 3.1545 −0.1535 −0.1535 0.0190 C2 = 1.0e + 04 * 3.2805 −0.2147 −0.2147 0.0434 C3 = 1.0e + 04 * 3.1097 −0.3315 −0.3315 0.0565 C4 = 1.0e + 04 * 3.0927 −0.3162 −0.3162 0.0427 C5 = 1.0e + 03 * 1.4419 −0.1324 −0.1324 0.0483
θ1 θ2 θ3 θ4 θ5
= 177.2032 = 176.2214 = 173.8740 = 174.1434 = 174.6216
(continued)
Estimation and Correction of Multiple Skews Arabic Handwritten …
561
Table 1 (continued) Samples
Number of lines
Eigenvalues and eigenvector
Angle for each line
Sample 5
2
C1 = 1.0e + 04 * 3.1569 0.3996 0.3996 0.1366 C2 = 1.0e + 04 * 2.7954 0.2691 0.2691 0.1249
θ 1 = –172.5899 θ 2 = –174.3033
Sample 6
6
C1 = 1.0e + 05 * 3.4222 0.0859 0.0859 0.0075 C2 = 1.0e + 05 * 3.2960 0.0966 0.0966 0.0068 C3 = 1.0e + 05 * 3.0464 0.1055 0.1055 0.0071 C4 = 1.0e + 05 * 2.6201 0.1191 0.1191 0.0085 C5 = 1.0e + 04 * 4.8140 0.4509 0.4509 0.0775 C6 = 1.0e + 04 * 2.8751 0.1537 0.1537 0.0402
θ1 θ2 θ3 θ4 θ5 θ6
= –178.5604 = –178.3200 = –178.0140 = –177.3952 = –174.6104 = –176.9060
Table 2 Proposed system time for the results of some document images in Table 3 Samples
Preprocessing per second
Line segmentation
Skew detection and correction
Total time
Color image 1
0.110343
0.393267
2.03042
2.53403
Color image 2
0.098196
0.366027
0.660495
1.124718
Image 3
0.145516
0.420718
1.1112278
1.6774618
Image 4
0.130016
0.423822
1.669268
2.223106
Image 5
0.161941
0.432834
1.6100778
2.2048528
Image 6
0.1494
0.42276
1.636461
2.208621
Image 7
0.07607
0.333851
1.185558
1.595479
Old image 8
0.079928
0.313881
1.180443
1.574252
562
M. Ravikumar andO. A. Boraik
Table 3 Results showing Arabic handwritten documents
Original Images
preprocessing
Skew detection and correction
(continued)
Estimation and Correction of Multiple Skews Arabic Handwritten …
563
Table 3 (continued)
5 Conclusion This paper proposes a method for skew detection and correction based on the coordinate position of bounding box of line segmented separately in Arabic documents or global bounding box in some cases. Through the morphological operation,
564
M. Ravikumar andO. A. Boraik
line segmentation and statistical method are applied to detecting a skew angle and correcting the skewness. We had tested the algorithm on Arabic old, handwritten, and printed documents as shown in Table 3.
References 1. Hull, J. J. (1998). Document image skew detection: Survey and annotated bibliography. Document Analysis Systems II, 40–64. 2. Zhang, Y., et al. (2018). Research on Deskew algorithm of scanned image. In 2018 IEEE International Conference on Mechatronics and Automation (ICMA). IEEE. 3. Rubani, & Rani, J. (2018). Skew detection and correction in text document image using projection profile technique. 2018 7th July International Journal of Computer Sciences and Engineering, 6, 2347–2693. 4. Rezaei, S. B., Sarrafzadeh, H., & Shanbehzadeh, J. (2013). Skew detection of scanned document images. 5. Al-Shatnawi, A. M., & Omar, K. (2009). Skew detection and correction technique for Arabic document images based on center of gravity. Journal of Computer Science, 5(5), 363. 6. Ravikumar, M., et al. (2019). Estimation of Skew Angle from Trilingual Handwritten Documents at Word Level: An Approach Based on Region Props. Soft Computing and Signal Processing (pp. 419–426). Singapore: Springer. 7. Guru, D. S., Ravikumar, M., & Manjunath, S. (2013). Multiple skew estimation in multilingual handwritten documents. International Journal of Computer Science Issues (IJCSI), 10(5), 65. 8. Makkar, N., & Singh, S. (2012). A brief tour to various skew detection and correction techniques. International Journal for Science and Emerging Technologies with Latest Trends, 4(1), 54–58. 9. Al-Khatatneh, A., Pitchay, S. A., & Al-qudah, M. (2015). A review of skew detection techniques for document. In 2015 17th UKSim-AMSS International Conference on Modelling and Simulation (UKSim). IEEE. 10. Jundale, T. A., & Hegadi, R. S. (2015). Skew detection and correction of Devanagari script using Hough transform. Procedia Computer Science, 45, 305–311. 11. Boukharouba, A. (2017). A new algorithm for skew correction and baseline detection based on the randomized Hough Transform. Journal of King Saud university-computer and information sciences, 29(1), 29–38. 12. Sharma, D. V., & Lehal, G. S. (2009). A fast skew detection and correction algorithm for machine printed words in Gurmukhi script. In Proceedings of the International Workshop on Multilingual OCR. 13. Ahmad, I. (2013). A technique for skew detection of printed Arabic documents. In 2013 10th International Conference Computer Graphics, Imaging and Visualization. IEEE. 14. Al-Shatnawi, A. M. (2014). A skew detection and correction technique for Arabic script textline based on subwords bounding. In 2014 IEEE International Conference on Computational Intelligence and Computing Research. IEEE. 15. Robert R., & Walker, J. (2004, June 16). Fundamentals of Physics (7th ed.). Wiley. ISBN 0471232319.
Heart Disease Prediction Using Hybrid Classification Methods Aniket Bharadwaj, Divakar Yadav, and Arun Kumar Yadav
Abstract Predictive analytics is the method of taking out information from available datasets for determining patterns and predicting future events and developments. It mainly consists of three steps named as pre-processing, feature extraction, and classification. In previous methods, age was the only primary factor taken for analysis and disease prediction. By changing the primitive attributes in the study, better predictions are received in results as compared to he older techniques. Second, in this work, a novel hybrid classifier is designed. This classifier is made up by combining two different classifiers. These classifiers are support vector machine (SVM) and knearest neighbor (k-NN). Features of the dataset are extracted using SVM classifier. K-NN is used to provide the concluding classified outcome. In contrast to existing techniques, the proposed model performs better in terms of accuracy and execution time. Keywords Predictive analysis · SVM · K-NN · Heart disease
1 Introduction The massive volume of data is being stored in files, databases, and various other applications. The secret and confidential data should not be stored in any place. Thus, it is extremely imperative to determine and present a system that can hold all the data and information [1]. The data should be stored securely and carefully. Occasionally, extracting and using data from big databases becomes tricky for the users. Data mining is utilized to remove this issue. The process of selecting, choosing, A. Bharadwaj · D. Yadav · A. K. Yadav (B) Department of Computer Science & Engineering, NIT Hamirpur (HP), Hamirpur 177005, India e-mail: [email protected] A. Bharadwaj e-mail: [email protected] D. Yadav e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_46
565
566
A. Bharadwaj et al.
and extracting only functional and significant data for that specified time is called data mining. In the current lifestyle, most of the fatalities occur due to heart failure. This disease can be caused by smoking, extreme alcohol intake, etc. Heart is the main organ of human body. Any disturbance in the functioning of this organ causes disturbance in other body organs as well. Family background, hypertension, high cholesterol level, age, poor diets, etc., are some of the factors that cause cardiovascular diseases. The widening of blood vessels increases the blood pressure. This becomes the cause of heart attack. The main reason of cardiovascular diseases is smoking. According to a survey, about 40% of people all over the world die due to this habit. The oxygen supply within body is limited by smoking. Smoke disrupts the blood flow within the body and tightens the blood vessels. Varieties of data mining algorithms have been developed for the prediction of heart-related diseases. Decision tree classifier is utilized to classify those patients’ profiles who are suffering from heart diseases. The Naïve Bayes algorithm can be used for predicting the chances of chronic diseases. Furthermore, the neural network minimizes the errors that occur during prediction. These algorithms are also utilized for the classification of medical profiles [2]. These algorithms carry out the proper diagnosis of the patient. In case of some variation, information about risk rate is given to the patients. Medical practitioners try to diagnose cardiovascular diseases at an early phase using all these classification algorithms. Classification is a data mining technique. In this approach, the classes are assigned to the gathered data to obtain more accurate analysis and predictions. The classification analyzes the extremely large datasets in an exceedingly efficient way. The decisions are made and behavior is predicted on the basis of query to generate an effectual set of classification rules. Initially, in this technique, a set of training datasets is generated with the help of specified set of features. The main purpose of this classification is mining. This approach provides the way through which the features reach to the result. Heart diseases can be predicted using different classification models. These classification algorithms include decision tree, SVM, neural networks, Naïve Bayes, and k-NN. These classification algorithms are utilized to predict the risk level of heart diseases preliminary stages. The prediction in early stage will be advantageous to both doctors and patients as patients may get proper treatment timely [3]. The main contribution in this paper is that we proposed a novel hybrid classifier, by combining two different classifiers. These classifiers are support vector machine (SVM) and k-nearest neighbor (k-NN).
2 Literature Review Dewan and Sharma [4] stated that the predicting heart disease was considered the most complex job in healthcare sector. Computer-based support systems could be developed for achieving an accurate and low cost therapy. These systems helped in good decision-making. The information systems are used by most of the hospitals. These systems managed their medical data. Massive volume of data was produced by
Heart Disease Prediction Using Hybrid Classification Methods
567
these systems as images, text, charts and numbers. However, the medical decisionmaking was rarely supported by this data. This data had huge amount of hidden and unexplored information. This information could give rise to a significant question related to the way for retrieving constructive information from this available data. Thus, there was the need to develop an outstanding scheme to help practitioners in the prediction of this disease prior to its occurrence. The major aim of this work was to create a model for determining and extracting unidentified information of this disease. Gandhi and Singh [5] stated that huge amount of data was produced in the field of medical science. However, this available information was not used appropriately. This situation in medical system was called “data rich” but “knowledge poor.“ For discovering relations and patterns in medical data, the efficient analysis techniques were not available. In such situation, the techniques of data mining could be used as a good solution. Various data mining methods could be employed for this purpose. The main aim of this work was to use current data mining techniques for cardiovascular disease prediction. Babu et al. [6] stated that the hidden patterns in the healthcare sector databases could be explored using efficient medical data mining. This medical diagnosis could be performed using these patterns. The collected data was required in standard format for this purpose. Fourteen features are taken out from the clinical records. The possibility of heart disease in a person could be predicted on the basis of these features. For predicting cardiovascular disease, these attributes were implemented in K-means algorithms, MAFIA algorithm, and Decision tree classifier. The data mining method was used for the treatment of this disease. Data mining could perform similar to that get heart disease diagnosis. Saboji et al. [7] proposed a scalable framework. This framework used uses healthcare data for cardiovascular disease prediction. This data was based on certain features. Predicting the diagnosing heart disease with less number of features was the main aim of this work. Random forest on Apache Spark was utilized by the employed prediction solution. Meena et al. [8] presented a generalized survey to demonstrate the significance of data mining techniques to Heart Disease Data (HDD). In this work, these techniques were reviewed for learning purpose. In the year 2006–2016, an educational dataset of literature was provided by it. A classification scheme was proposed in this period to categorize the mining techniques. The corroboration of analysis and classification practice was performed in parallel. The outcomes of this work of indicated that the Heart Disease was a popular research area among various researchers. The data mining in Heart Disease Data was implemented using two most common models. Raju et al. [9] analyzed that cardiovascular disease was the leading cause of death. This disease was the major cause of solemn long-lasting disability. A person was attacked by this disease so instantly. The data related to healthcare sector was called information-rich but knowledge poor. Thus, accurate diagnosis of patients based on time was a challenging task for medical support. The hospital could lose their status by an illogical diagnosis. The leading biomedical concern was the accurate heart disease diagnosis. Developing an efficient treatment using data mining methodologies was the main aim of this work.
568
A. Bharadwaj et al.
Thomas and Princy [10] used data mining methodologies to predict heart disease. The major aim here was to provide an insight into the detection of heart disease danger level using data mining techniques. In various surveys, the discussion about several data mining methodologies and classifiers was made. These techniques were used for the diagnosis of cardiovascular heart disease efficiently and effectively. The analytical outcomes showed that a lot of technologies and different numbers of attributes were utilized by different researchers for their study. Thus, the different accuracy rates provided by different technologies were based on different attributes. Jabbar et al. [11] analyzed that a lot of deaths were caused due to a global disease called Coronary disease. Disease diagnosis was a wearisome task. In order to predict disease, an intelligent decision support system was required. A patient was classified as normal or having heart disease by means of data mining methods. Hidden Naïve Bayes was a data mining model. This model relaxed the conventional Naïve Bayes conditional autonomy theory. Hidden Naïve Bayes (HNB) model was implemented in the proposed work to predict or classify the heart disease. Alex et al. [12] used data mining methodologies for heart disease prediction. Detecting and healing heart disease using data mining method was the major objective of this work. Data were gathered from jubilee mission hospital Thrissur for data mining. The data was collected after communicating with patients. The discharge summary of the individual patients was the other method to gather data. Overall, 20 features of almost 2200 and above patients were gathered in this manner. Later, this gathered data was typified and organized orderly in Excel format. Various data mining algorithms could be applied on this gathered data. Twenty features were extracted from the clinical records. The chances of heart diseases in a person were predicted on the basis of these features. These features were applied to different classifiers. Radhimeenakshi [13] stated that coronary Heart Disease classification could be advantageous to doctors. The main purpose of this approach was to find accurate result quickly in automatic manner. The accurate prediction about the occurrence of heart disease could improve the survival chances of patients. The major purpose of this work was to inculcate the utilization of AI devices for foreseeing the likelihood of heart disease. The classes of heart illness were included in this study. For this purpose, Support Vector Machine (SVM) and Artificial Neural Network (ANN) classification algorithms were used. The testing of these algorithms was performed by considering accuracy and training time. A medical selection backing framework was proposed in this study. This framework represented coronary disease prediction in a reasonable, purposeful, accurate and quick mode. The Cleveland Heart Database and Statlog Database were used in this work. These databases were achieved from UCI Machine learning dataset vault. Salma Banu and Swamy [14] made discussions on different techniques of data mining (DM) models for heart disease prediction. In order to construct a smart model for healthcare sector, an important role was played by data mining. These models used patients’ databases for heart disease detection. These databases had risk factor related with heart illness. Predicting heart disease prior to happening occurring could be advantageous to the patients. The data mining tools were used to analyze
Heart Disease Prediction Using Hybrid Classification Methods
569
the massive volume of data provided by clinical diagnosis. The valuable information called knowledge was extracted from this available information. The method used to explore huge databases to extract the hidden patterns and earlier indefinite associations and information discovery to get more information from healthcare data for heart disease prevention was called mining. Several data mining classifiers were employed to predict cardiovascular disease.
3 Methodology The technique using which the future possibilities can be predicted by studying the previous patterns of events is known as prediction analysis. To cluster the similar and dissimilar kind of data based on similarity, the k-means clustering technique is applied. The dataset is given as input from which the arithmetic mean is calculated through k-means clustering algorithm. The central point of data is considered to be this calculated arithmetic mean. From the central point, the Euclidian distance is calculated and in separate clusters, the similar and dissimilar points are clustered. To cluster the points that are not clustered, the backpropagation algorithm is applied through which the accuracy of clustering is also improved. To perform hybrid classification, the framework used is shown in Fig. 1. The missing and redundant values are removed by the data given as input to the preprocessor. This step is otherwise called pre-processing phase. Between two datasets, namely, training and test datasets, the preprocessed data is divided on random basis. To perform feature extraction, the SVM classifier is applied. Between the attribute and target sets, the relationship is generated through feature extraction process. A hyperplane is drawn using SVM classifier. Based on the classes of target set, the data is categorized into certain classes using hyperplane. One-against-remaining approach is the method through the N classifiers are which generated each for separate class. A feature vector ϕ X , Y is used to generate a
Fig. 1 Framework for hybrid classification method
570
A. Bharadwaj et al.
two-class classifier as given by Eq. (1). For generating this feature vector, the input features and class of data are paired. At the testing time, the classifier selects the class. T
y = arg max y W ϕ(X , Y
(1)
Among the value of correct class and the class closest to it, a gap exists which is defined using the margin generated at the time of training. To perform heart disease prediction, the classification approach is applied in the final phase. This step applies the k-NN classification algorithm. The numbers of centroid points are defined here. Euclidian distance is calculated from centroid points. A separate class will be generated by classifying the points with similar distance separately. The points with varying distance are placed in other classes. The heart disease predicted values are defined through classification. The k-nearest neighbor classifier’s value is calculated as per the Euclidean distance existing among a test sample and specific training samples. The Euclidean distance existing among samples x i and x l (l = 1, 2,…, n) are defined in equations: d(xi , xi ) =
2 (xi1 − xi1 )2 + (xi2 − xi2 )2 + · · · + xi p − xi p
A Ri = x ∈ R p : d(x, xi ) ≤ d(x, xm), ∀i = m
(2)
Voronoi cell encapsulates the neighboring points nearest to each sample. The Voronoi cell denoted as Ri for sample xi is defined in Eq. (2). x denotes all the points that are available in Voronoi cell Ri.
4 Result and Discuss This research work is based on the heart disease prediction using machine learning methods. The proposed method is the combination of two classifiers, which are kNN and SVM. The heart disease prediction approach has three steps, which are pre-processing, feature extraction, and classification. In the proposed method, the k-NN classifier is applied for the feature extraction and SVM classifier is applied for the final prediction. The performance of the proposed model and existing models are tested in terms of accuracy, precision, and recall. The dataset is collected from the UCI repository. The UCI repository has four datasets, which are Chevaland, Hungarian, Switzerland, and VA. The accuracy, precision and recall of the various classifiers like support vector machine, Random Forest Classifier, Naïve Bayes, k-NN, Decision Tree, and Proposed Model. The performance of the classifiers is given in Tables 1, 2, and 3 (Figs. 2, 3, and 4).
Heart Disease Prediction Using Hybrid Classification Methods Table 1 Accuracy analysis
Table 2 Precision analysis
Classifier name
Percentage
Decision tree
84.88
k-NN classifier
36.58
Naïve Bayes
100
Random forest
82.92
SVM
100
Proposed method
100
Classifier name
Percentage
Decision tree k-NN classifier Naïve Bayes Random forest
Table 3 Recall analysis
93 50 100 83
SVM
100
Proposed method
100
Classifier name
Percentage
Decision tree k-NN classifier Naïve Bayes Random forest
Fig. 2 Accuracy analysis
571
85 65 100 84
SVM
100
Proposed method
100
572
A. Bharadwaj et al.
Fig. 3 Precision analysis
Fig. 4 Recall analysis
5 Conclusion Data mining is the process through which interesting knowledge and patterns can be extracted for analyzing the data. To analyze different kinds of data, several data mining tools are applied. It is important to perform proper analysis for achieving highly efficient results. The most important outcomes of data mining are classification, sequence analysis, predictions, and association rule mining. The most important problem of prediction analysis is that there are several attributes included in prediction analysis. Combining feature extraction and classification performs prediction analysis. To perform feature extraction, SVM classifier is applied. To generate predicted results, the extracted features are given as input to k-NN classifier. Python simulator
Heart Disease Prediction Using Hybrid Classification Methods
573
is used to implement the proposed methodology. Based on accuracy and execution time, the results are analyzed. An improvement up to 8% is seen in accuracy value and 5% reduction in execution value by applying the improved methodology.
References 1. Dey, M., & Rautaray, S. S. (2014). Study and analysis of data mining algorithms for healthcare decision support system. International Journal of Computer Science and Information Technologies, 6(3), 234–239. 2. Chadha, A., & Kumar, S. (2014). An improved k-means clustering algorithm: a step forward for removal of dependency on K. In 2014 International Conference on Reliability, Optimization and Information Technology-(ICROIT 2014) (Vol. 8(1), pp. 6–8). 3. Bahety, A. (2014). Extension and evaluation of ID3- decision tree algorithm. ICCCS, ICCC, 4(1), 23–48. 4. Dewan, A., & Sharma, M. (2015). Prediction of heart disease using a hybrid technique in data mining classification. In 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi (pp. 704–706). 5. Gandhi, M., & Singh, S. N. (2015). Predictions in heart disease using techniques of data mining. In 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE), Noida (pp. 520–525). https://doi.org/10.1109/ablaze. 2015.7154917. 6. Babu, S., et al. (2017). Heart disease diagnosis using data mining technique. In 2017 International Conference of Electronics, Communication and Aerospace Technology (ICECA), Coimbatore (pp. 750–753). https://doi.org/10.1109/iceca.2017.8203643. 7. Saboji, R. G. (2017). A scalable solution for heart disease prediction using classification mining technique. In 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), Chennai (pp. 1780–1785). https://doi.org/10.1109/icecds.2017.838 9755. 8. Meena, G., Chauhan, P. S., & Choudhary, R. R. (2017). Empirical study on classification of heart disease dataset-its prediction and mining. In 2017 International Conference on Current Trends in Computer, Electrical, Electronics and Communication (CTCEEC), Mysore (pp. 1041–1043). https://doi.org/10.1109/ctceec.2017.8455127. 9. Raju, C., Philipsy, E., Chacko, S., Padma Suresh, L., & Deepa Rajan, S. (2018). A Survey on Predicting Heart Disease using Data Mining Techniques, Conference on Emerging Devices and Smart Systems (ICEDSS) (pp. 253–255). https://doi.org/10.1109/icedss.2018.8544333. 10. Thomas, J., & Princy, R. T. (2016). Human heart disease prediction system using data mining techniques. In 2016 International Conference on Circuit, Power and Computing Technologies (ICCPCT), Nagercoil (pp. 1–5). https://doi.org/10.1109/iccpct.2016.7530265. 11. Jabbar, M. A., Deekshatulu, B. L., & Chandra, P. (2016). Prediction of heart disease using random forest and feature subset selection. In Advances in Intelligent Systems and Computing (pp 187–196). https://doi.org/10.1007/978-3-319-28031-8_16. 12. Mamatha Alex, P., & Shaji, S. P. (2019). Prediction and diagnosis of heart disease patients using data mining technique. In International Conference on Communication and Signal Processing (ICCSP) (pp. 0848–0852). https://doi.org/10.1109/iccsp.2019.8697977. 13. Radhimeenakshi, S. (2016). Classification and prediction of heart disease risk using data mining techniques of support vector machine and artificial neural network. In 3rd International Conference on Computing for Sustainable Global Development (pp. 3107–3111). 14. Salma Banu, N. K., & Swamy, S. (2016). Prediction of heart disease at early stage using data mining and big data analytics: A survey. In International Conference on Electrical, Electronics, Communication, Computer and Optimization Techniques (ICEECCOT) (pp. 256–261). https:// doi.org/10.1109/iceeccot.2016.7955226.
Job Recommendation System Using Content and Collaborative-Based Filtering Rahul Pradhan, Jyoti Varshney, Kartik Goyal, and Latesh Kumari
Abstract Dealing with the big amount of information on the web, employment seeker constantly spends hours discover useful ones. We do all this process in an easy manner. Recommendation systems usually consist of exploiting relations among understood features and content that describes services and products (content-based filtering) or the overlap of comparable users who interacted with or rated the goal item (collaborative filtering). We reveal a comparison between content filtering and based that is collaborative. Keywords Recommendation system · Content-based filtering · Collaborative-based filtering
1 Introduction In every field, we search for anything on the Internet; an algorithm or recommendation applies. But nowadays, this recommendation is even useful in jobs to get a search quickly without any people referral. Those recommendations aim to fulfill the demands of both the task seekers who have particular preferences concerning their next job action together with recruiters who try to employ the absolute most candidates that are appropriate offered task [1]. The field of the job market is increasing day-by-day or year-by-year basis. A job seeker dealing with the enormous or search different amount of job information R. Pradhan · J. Varshney · K. Goyal (B) · L. Kumari GLA University, Mathura, UP, India e-mail: [email protected] R. Pradhan e-mail: [email protected] J. Varshney e-mail: [email protected] L. Kumari e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_47
575
576
R. Pradhan et al.
on the Internet, a job seeker always spends hours on finding useful ones and with his/her personalized preferences. To reduce this work, we design and implement a job recommendation system for job recommend. The recommendation can achieve a higher score of precision, and they are more relevant to the user’s preferences before [2]. Recently, job recommendation has attracted tons of research attention and has played a serious role during a company’s online recruiting website. It is a better and fast job recommendation to users with the precise matching of the users and companies’ profile so not only the depending on profile matching, but users can also get job vacancies as per requirement like job title and job description. This recommendation is different from a traditional recommendation system, which recommends the jobs to the user [3]. In the job recommendation system, there are various types of users (e.g., job applicants); some are (e.g., recruiters). It is very useful nowadays in the COVID19 situation to get jobs quickly without wasting time on searches. Many E-commerce Internet sites, the most frequent application of recommendation algorithms, use collaboration, is filtering algorithms, and filtering that is content-based without considering a user’s documents plus an item’s properties. So we proposed an algorithm that is enhanced on item-based collaboration filtering [2]. To write this paper to provide an effective method of job recommendation for an online job search in a very effective way. We try our best to provide service in a personalized manner. It can help them to find or search for jobs quickly way in a convenient manner.
2 Related Work 2.1 Recommendation Algorithms Recommendation systems help users to recommend the items that they would like most. In the last two decades, for designing recommendation systems, a lot of work has been done for providing the best recommend an item for users. Recommendation systems are beneficial to both service providers and users. The two most popular applications of recommendation systems are Amazon.com and Netflix. The recommendation system is implemented by data mining and machine learning algorithms. “Recommendation System can be classified mainly in two groups: Preference-based filtering and Rating-based techniques. Preference-based recommender systems focus on predicting the correct relative order of items for a given user. Rating-based recommender systems predict the absolute values of ratings when individual users give ratings” [4]. The major process of a user profile is extracting the features and information from the unstructured information. This information that is unstructured is stored in textual forms generated by a human. So, we need to extract meaningful information
Job Recommendation System Using Content and Collaborative-Based …
577
from a large amount of unstructured data for the job recommendation system. This efficient process of achieving similar jobs through which peoples are calculating the resemblance based on their profiles. There are various common similarities to computation measures like cosine and Mean Squared Differences, Spearman Rank Correlation, Pearson Correlation, etc. Cosine similarity is estimated by the cosine of the angle between two vectors. This similarity is mainly used in content-based recommendation systems. Cosine similarity defined as follows: u k vk cos(u, v)= k 2 2 k uk k vk
(1)
2.2 Content-Based Filtering The purpose of the content-based recommendation is to recommend the jobs that already exist and provide content that is similar to ones that target job seekers are looking for. In content includes the personal information about users and their job desires, a description of the job posted by recruiters, and the description of various companies’ backgrounds. The action of content-based recommendation is to extract similar features and compute their similarity for people and jobs. The outcome of job recommendation is a set of the job description, which is most similar according to the user preferences [5]. In other words, “content-based filtering uses item features to recommend other items similar to what the user likes which is based on his/her previous actions or explicit feedback which is given by the user” [6]. It performs badly when we use it in the multimedia field like movie or music recommendation as it is difficult to extract attributes of the items and attain user’s preference from time to time [2]. Figure 1 shows the content-based filtering:
2.3 Collaborative-Based Filtering Collaborative-based filtering is also known as the user-to-user correlation method. The key step in collaborative filtering is calculating the similitude among the users. This recommendation algorithm includes memory-based and user-based [6]. The memory-based collaboration filtering processor includes the user-based and itembased correlation technique. Figure 2 shows the collaborative-based filtering. Collaboration filtering is really a good recommendation that is liked that supposed its proposals and predictions on the reviews or behavior of other users in the system [2].
578
R. Pradhan et al.
Fig. 1 Content-based filtering
Fig. 2 Collaborative-based filtering
There are two primary algorithms: • USER-BASED Collaboration filtering: “Find other users whose past rating behavior are analogous to the present user and use their ratings on other items to predict what the present user will like. The working of User-based CF may be a quite simplified technique; all it does examine the past interests of the user and supported the past results the system makes an accurate result of the candidate who has applied for employment” [2]. • ITEM-BASED Collaboration filtering: This is used for forecasting user’s preference on the basis of similitude between the user’s rating and behavior.
Job Recommendation System Using Content and Collaborative-Based …
579
3 Proposed Approach We are building a highly personal job recommendation system in which user share their preferences, and through a predefined set of the job, we give them the best suitable job based on their skills. We try to use different models and check which one is best suitable for them. The dataset contains information about what the user has worked on, what platform is worked on, and what language they worked on, etc. The two different approaches that we have worked on are a content-based filtering approach and a collaborative filtering approach. The content-based filtering approach the primary assumption behind it is that people are likely to choose jobs in their domain of expertise, so we have considered various skills like languages, frameworks, platforms of the user, but this is our primary focus because a person is very unlikely to switch jobs or another domain, so we basically match user profile from the Stackoverflow dataset and job profile from the US job posting dataset and generate top ten recommendations for every single user [7]. The second way our approach to this problem was through a collaborating filtering approach. This recommends jobs based on user-user similarity, basically if there are two users who are very similar to each other based on the skills. It will be providing them almost the same recommendations. There are two types of this filtering one is user-item collaborating filtering and is item-item collaborating filtering. User-item collaborating filtering deals with recommending users the item which are similar uses of lot whereas item-item collaborating filtering will take an item and find users who also bought the same item so what we implemented in our project is user-item collaborative filtering method where the items are skills we will be having a matrix with similar users based on the skills and based on this matrix we will recommending similar user to similar jobs [8].
4 Experiment Setup and Result Dataset: A rich set of descriptor categorical or numerical features of user and items are described in the dataset. Descriptor features take a vocabulary size around 200 K, and categorical features have around dozens of values [5]. In this project, we use two datasets: (a)
(b)
StackOverflow Dataset: This dataset was obtained from a survey conducted by Stack overflow in 2018. It contains information about users, their coding period, and their experiences in various domains of information technology like Languages, Frameworks, Platforms, etc. [9]. US Technology Jobs Dataset: This dataset contains various IT job postings in the United States contains job title, company, work location, job description, and the type of employment [10].
In recommendation systems, a document is represented in vector form when a document in a textual format in which each element has a value that represents the
580
R. Pradhan et al.
importance of the associated term for a document. This vector is generally constructed with the use of the “bag-of-words” and weighting function. “Bag-of-words” creates a set of vectors containing the count of word occurrences in the document. In the document, the importance of a term is calculated by a weighting function. Weighting functions are classified into three primary groups: local weighting functions, global weighting functions, local and global weighting functions. Inside a given document, the weight of a given term is calculated by local weighting functions. The Bool (Boolean weight), TF (normalized Term Frequency), and LTF (Log Term Frequency) are the most common methods. They are defined as follows: 1ift ∈ d Boolt, d = (2) 0otherwise TFt, d =
f t,d maxk f k,d
LTFt, d = log(1 + f t, d)
(3) (4)
where ft, d is the frequency of the term t in document d. To calculate the weight of a given term in a whole corpus, global weighting functions are used. IDF (Inverse Document Frequency) and Entropy (normalized), which is the most common weighting functions, are defined as follows [11]: IDF = 1 + log Entropyt = 1 +
N nt
pdt log pdt d
log(N )
(5)
(6)
where N is the total documents in the corpus, nt is the number of documents that have the term t, pdt = ft,dft,k is the probability that the term t ∈ d and f t, d is defined k in (3). Combinations of local and weighting that is global give better results as compared to regional weighting functions. TF-IDF and Log-Entropy are combinations being typical they’re understood to follow: TF − IDFt, d = TFt, d × IDFt
(7)
Log − Entropyt, d = LTFt, d × Entropyt
(8)
where d is a document, and t is a term [5].
Job Recommendation System Using Content and Collaborative-Based …
581
Fig. 3 Popularity of languages
Validation We validate by clustering the job and user profile independently. Then we find the similarity between these job clusters and the user clusters [12]. We have a validation set in the set we recommend jobs to every user, and the similarity between this and the user cluster, most similar job cluster profile is done then we find how similar these two are. In this way, we can discover how much the model accurate [13]. In the graph: [‘.net’, ‘bash’, ‘c#’, ‘c++’, ‘css’, ‘html’, ‘java’, javascript’, ‘perl’, ‘php’, ‘python’, ‘sql’] In Fig. 3, based on a random sample drawn from both of the datasets it was found that javascript is the most popular language but the number of developers that know it is quite less this ratio is lower for a language such as java and python. [‘Back End’, ‘Data or business analyst’, ‘Database Administrator (DBA)’, ‘Designer’, ‘DevOps’, ‘Enterprise application’, ‘Front End’, ‘Full Stack’, ‘Information Security’, ‘Mobile Developer’, ‘Network Engineer’, ‘Product Manager’, ‘QA/Test Developer’, ‘Software Developer/Java Developer’, ‘System Administrator’, ‘Web Developer’] The similarly same thing is to perform in Fig. 4 for domains, back-end. Back End is a very popular domain that is huge in demand, but very few developers know it. This is also lower for a data scientist, which means that the number of data scientists nearly matches the requirements.
582
R. Pradhan et al.
Fig. 4 Popularity of domains
5 Conclusion There are a lot of jobs online which uses inaudible to pick they need a specific job which represents skills that’s why we need recommendation system so that they get jobs which match their skills to and then one more reason is that’s a lot of inflation in the IT sector. IT sector is an ever-growing sector, and there are lots of jobs in it, and to give a highly personalized recommendation system, we need to build a recommendations system for an IT sector. A lot of recommendation systems deal with recommending carriers and jobs in general, but we have chosen to focus specifically on IT because this allows us to look at specific features in the information technology field such as languages, frameworks, and so on. Many recommenders also deal with structured data, but they have specifically chosen to deal with to build the model that’s capable of dealing with unstructured data that is in natural language English and converting data structure format and then making a recommendation. In this paper, we show that the difference between content-based filtering and collaborative-based filtering is more useful in this job recommendation system.
Job Recommendation System Using Content and Collaborative-Based …
583
References 1. Abel, F., Benczúr, A., Kohlsdorf, D., Larson, M., & Pálovics, R. (2016). RecSys challenge 2016: Job recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems - RecSys ‘16. New York, New York, USA: ACM Press. 2. Zhang, Y., Yang, C., & Niu, Z. (2014). A research of job recommendation system based on collaborative filtering. In: 2014 Seventh International Symposium on Computational Intelligence and Design (pp. 533–538). IEEE. 3. Gugnani, A., & Misra, H. (2020). Implicit skills extraction using document embedding and its use in job recommendation. In Proceedings of the… AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence (Vol. 34(08), pp. 13286–13293). 4. Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17, 734–749. 5. Siting, Z., et al. (2012). Job recommender systems: A survey. In 2012 7th International Conference on Computer Science & Education (ICCSE) (pp. 920–924). IEEE. 6. Content-based filtering. Retrieved November 09, 2020, from https://developers.google.com/ machine-learning/recommendation/content-based/basics. 7. Retrieved December 08, 2020, from https://www.researchgate.net/profile/Ajib_Susanto/pub lication/339051635_Recommendation_System_of_Information_Technology_Jobs_using_ Collaborative_Filtering_Method_Based_on_LinkedIn_Skills_Endorsement/links/5e3acf734 58515072d818a2d/Recommendation-System-of-Information-Technology-Jobs-using-Collab orative-Filtering-Method-Based-on-LinkedIn-Skills-Endorsement.pdf. 8. Neve, J., & Palomares, I. (2020). Hybrid reciprocal recommender systems: Integrating itemto-user principles in reciprocal recommendation. In: Companion Proceedings of the Web Conference 2020. New York, NY, USA: ACM. 9. https://www.kaggle.com/stackoverflow/stack-overflow-2018-developer-survey#survey_res ults_public.csv. 10. https://www.kaggle.com/PromptCloudHQ/us-technology-jobs-on-dicecom. 11. Diaby, M., et al. (2013). Toward the next generation of recruitment tools: An online social network-based job recommender system. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - ASONAM ‘13 (pp. 821–828). New York, New York, USA: ACM Press. 12. Nguyen, Q.-D., Huynh, T., & Nguyen-Hoang, T.-A. (2016). Adaptive methods for job recommendation based on user clustering. In 2016 3rd National Foundation for Science and Technology Development Conference on Information and Computer Science (NICS) (pp. 165–170). IEEE. 13. Sivaramakrishnan, N., Subramaniyaswamy, V., Ravi, L., Vijayakumar, V., Gao, X. Z., & Sri, S. L. R. (2020) An effective user clustering-based collaborative filtering recommender system with grey wolf optimisation. International Journal of Bio-Inspired Computation, 16, 44.
Recommendation System for Business Process Modelling in Educational Organizations Anu Saini, Astha Jain, and J. L. Shreya
Abstract A recommendation system is known to advocate products and services to potential users who utilize the system for their own beneficiaries. Such a system is generally an application that provides a platform to explore various options and get the best out of them. As of today, many recommendation systems function pertaining to requirements of products like movies, blogs, songs, news articles, and other personalized services. Similarly, there is a need to have a system that proposes business process models which are largely demanded by business managers, analysts, entrepreneurs, and administrative people working for large organizations and educational institutions. To address this, an approach of building an effective and efficient recommendation system has been proposed. The approach deals with involvement of all the specifications of a business process model for educational institutions. Thus, we present an empirical and tested process model support system which is based on recommendations. Keywords Business process models · Collaborative filtering · Education system · Recommendation system
1 Introduction Business Process Modelling (BPM) is a technique of representing the processes of an organization or an enterprise, such that the present process may be analysed, automated, and optimized [1]. A BPM is a model that defines and describes the ways or methods in which various operations are carried out to fulfil the needs of a company to sustain in the market [2]. But then the question arises: which process model will suit the requirements, and whether it will satisfy the objectives of increased profit, reduced costs, decreased failures, and increased quality? Hence, we come to recommendation techniques in business process modelling that can address this problem of getting an efficient process model which goes right with the flow of working of an A. Saini (B) · A. Jain · J. L. Shreya G. B. Pant Govt. Engineering College, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_48
585
586
A. Saini et al.
organization. Business process modelling is the systematic and logical illustration of an organization’s business processes. Another driving reason which motivates us to build a recommendation system like the one described above is the underestimation of the power of a process model that could not only enhance the performance of an enterprise but also facilitate entrepreneurs to make a good start to their business. Most of the recommendation platforms work extensively in the field of entertainment but very few suggest modellers on how and what should be done to improvise the organization’s needs. Education has been and will always be an intrinsic part of one’s growth and development. Educational institutions like schools and colleges as well as the virtual organizations play a vital role in imparting knowledge to the learners [3]. Deciding the plan that must be followed while performing the academic operations needs to be well formed. Hence, educational institutions have people like administrators who deal in the business process management to enforce the responsibilities [4]. There are many business process modelling techniques used to implement and represent business process models such as Business Process Modelling Notation (BPMN), UML diagrams, flowcharts, and Coloured Petri-Nets (CPNs). A BPM is generally studied and monitored by the business management team consisting of administrative people, managers, and other staff within an organization. One can build business process models for different kinds of organizations [5], such as educational institutions, information technology industry, automotive industry, medical institutions, and banking sector. Since a business process model can be a good measure to represent and analyse the processes of educational institutions, we take this realization to one step ahead. All business process models available may not be applicable or suitable to the requirements put forward by some institutions. So, what do we do now? One solution could be to design one’s own business process model and just implement it. Another could be to choose a model that has already been implemented by some other institution and which proved to be a reliable one. A recommendation system for business process modelling concerning the specifications of an educational institution is proposed in the paper. The rest of the paper is organized in the given manner: Sect. 2 describes the related work, Sect. 3 presents proposed approach and the user flow followed by simulation and results in Sect. 4. Section 5 concludes the paper along with the plans to improvise the work.
2 Related Work Today, as many of the world’s educational institutions demand an effective management that can perform even the most complex tasks with ease, several approaches have made it to the solution. When the Bologna declaration was signed in 1999 at the University of Bologna, an increase in students’ and teacher’s mobility was seen. The Bologna process made the academic organizations more virtual [3]. The educational institutions are now working as a single virtual organization for higher
Recommendation System for Business Process Modelling …
587
education to fulfil the academic requirements as they offer services like exam registration and library application through virtual windows. Cotofrei and Stoffel presented an approach having business process models for such academic virtual organizations. They sought to accept BPMN as the standard notation for process modelling for administrative process models. They supported the concept by building and analysing a BPMN diagram for the academic process “Course Registration” as they explained the elements of the business process modelling notation [6]. Strimbei et al. proposed the BPMN approach for University Information Systems. It includes academic processes like Admission and Student Exchange. The BPMN method unfolds the processes, sub-processes, and the components of the flows, whereas the UML diagrams give prominence to the actors involved in each activity. And after having a list of very clearly defined specifications from the educational area, the processes can be easily surpassed and implemented through BPMN [4]. Koschmider, Homung, and Oberweis presented an experimentally corroborated business process modelling editor, which assists its users in purpose-oriented modelling of processes. In the editor, the user can search using a query interface for model parts and business process models [7]. Kluza et al. presented the idea that recommendation methods, capable of recommending features of business process models, can provide a solution to the non-speedy procedure of building up a business process model [6]. It categorized recommendations into two major classifications: subject-based and positionbased classification. From here, the idea of forward completion is practically implemented in our system. Kluza et al. also suggested some of the machine learning approaches, however, the technique best suiting our requirements was collaborative filtering based. Research done by Dwivedi et al. focuses on the idea of “learning path recommendations based on modified variable length genetic algorithm”. According to it, a learning path recommender system has been designed by employing a variable length genetic algorithm which recommends optimal learning paths for learners by considering learners’ requirements and preferences. The proposed system is based on collaborative filtering methods [8]. Some advanced features inside a recommendation system are proposed and incorporated by [9, 10].
3 Proposed Approach Business processes models are just visual depictions of processes and are quite easy to understand by a non-business user. There are various methodologies and tools available in the market that support such process modelling such as BPMN, but not all support the recommendation mechanisms for BP modellers [6]. So, building a recommendation system for this purpose is a benefactor to the management. In our implemented and working system, flowcharts are used for visually representing the business process model and a user-friendly query interface assists users to get required results [11]. The “Recommendation System for Business Process Modelling in Educational Organizations” is implemented in two stages: 1. Resolving of the Cold-Start Problem, 2. Application of Collaborative Filtering.
588
A. Saini et al.
3.1 The Cold-Start Problem and Its Elimination The cold-start problem arises with the issue that the system cannot draw any inference of items and cannot provide recommendations to users about which it has not yet gathered sufficient information as the users are novice who may be inexperienced. Resolving the cold-start problem is a big challenge for such a recommendation system as there is no history or user profile present with the system. But the cold-start problem can be eliminated by allowing the new users to create their own business process models, by choosing the process that goes right with their specifications. The system is based on this approach that supports a platform for novel users to undergo a selflearning experience. On initialization of the model building activity, the user gets a certain number of recommendations which have been the most preferable one, in general. Moving further, at each level, selecting an activity that could further complete the process. Selection of a particular activity adds that node in the flowchart. This action of the user also increments the quantitative and qualitative value (ranking) of the selected activity by 1 (initially, it is 0 for all). Now, as the user finds interest in one of the activities that the user wants to include in the model, the user can select that activity and lead to the completion of the desired customized model.
Algorithm 1
Resolving the cold start problem
Connect to the database for data. Initially the activity score is zero for every activity. 1. Select OrgId=x 7. activity->flowchart 2. Select DeptId=y 8. End if 3. Start function 9. else select_activity=false startModel(x,y) 4. Start function 10. End function displayActivities(x,y) displayActivities 5. if select_activity=true 11. End function start Model 6. activityscore+=1
Initially the activity score is 0 for every activity. An activity score indicates the number of users who chose the activity. Value in x indicates the id of the type of organization selected by the user. Here the organization is “Education” only. Similarly, the value in y indicates the id of the type of department selected by the user under the organization with id “x”. Now the user is ready to create a customized model. As the user selects a particular activity, the select_activity is set to true and the score is incremented by 1 (which was earlier 0). This activity (a new node) is now added to the flowchart. In this way the cold-start problem is resolved because now, each activity can have a certain score.
Recommendation System for Business Process Modelling …
589
3.2 Collaborative Filtering and User Flow Although many recommender systems are designed using the principle of contentbased filtering, however it did not serve our requirements [12]. Rather, the collaborative filtering approach suited the system well. Collaborative filtering is a recommendation technique used by recommender systems to provide suggestions based on the similarities in the preferences of products of different users as shown in Fig. 1 [13]. Collaborative filtering is the most important task of our work where it helps the new users to choose the same activity in a process, which were earlier chosen by another such user in relation with the current selected activity. Now, as the cold-start problem has been resolved successfully, a certain number of models created by some users resulted in the storage of quantitative values (a certain score) of the activities listed in the database. If a user requires a process model for their organization, activities will be recommended to the user based on these quantitative values only which were earlier incremented as soon as they were preferred and selected by previous users.
Algorithm 2
Collaborative filtering technique
Connect to the database for data. Here also, initially the activity score is zero for every new activity. 1. Select OrgId=x 6. activityscore+=1 2. Select DeptId=y 7. activity->flowchart 3. Start function 8. End if startModel(x,y) 4. Start function 9. End function displayActivities(x,y) displayActivities 5. if select_activity=true 10. End function startModel
Fig. 1 Example showing collaborative filtering
590
A. Saini et al.
Algorithm 3
Recommending the activities 9.
1.
Select OrgId=x
2.
Select DeptId=y
3. 4. 5.
Start function startModel(x,y) Start function displayActivities(x,y) prevActId=m
6.
Start function recommend(x, y, m)
7.
if select_activity=true
8.
selectedActId=n
start function actScore(m,n)
10. activityscore+=1 11. activity->flowchart 12. End function actScore 13. End if 14. End function recommend 15. End function displayActivities 16. End function startModel
Initially the activity score is 0 for every new activity. The activity score indicates the number of users who chose that activity. Value in x indicates the id of the organization selected by the user. Here it is “Education” only. Similarly, the value in y indicates the id of the type of department selected by the user under the organization with id x. Now the user is ready to create the model. Function recommend (x, y, m, n) is called where activities are recommended to the user by fetching the most preferred activity from the database based on their rankings after the activity with id = m. Ranking of any activity (id = n) is based on the number of users who chose the activity with id = n in relation with the previous activity with id = m. The activity which is now chosen by the user (id = n) is added to flowchart and its score is also updated through the function actScore. In this way, collaborative filtering is successfully applied as the user is being recommended the activities preferred by other users with similar interests. Flowchart in Fig. 2 shows the user flow to create a business process model. Any user who wants to create or get a business process model by the services of our system can explore and proceed like this: The organization is selected, various departments falling under the organization can be seen. Similarly, a particular department can be chosen. After this, the users will be shown many options (which are the activities or individual processes) to create a model for themselves. These options are the recommendations made by the system to the users’ service. If the user finds a suitable activity then well and good, otherwise the user is also provided with a search query interface to search it. If it is found in the search results, then the activity can be added in the model. However, if the search results also do not display the activity that the user is looking for, then the user can insert her newly generated activity to the already listed options, thus making an entry into the process model repository. Finally, a flowchart
Recommendation System for Business Process Modelling …
591
Fig. 2 Flowchart showing user flow
of the built business process model can be obtained. In this way, recommendations help in deriving an efficient business process model.
4 Simulation and Results The practical implementation of the proposed approach was successfully done, and the result was a recommender system where initially the user is at the homepage of the system. Here, the user can see various sample business process models for reference. The recommendation system allows the user to choose any department present under the “Education” organization as an institution comprises varied academic related
592
A. Saini et al.
functions for which business process models can be required. A type of department is selected by the user from the displayed options. A page displaying two different but adjacent sections for creating the process model will be seen. However, if the user wants to select a different department, he/she can go back to the previous page by clicking on the “Choose Department” button. If the user is fine with the current selected department, he/she can now create his own customized business process model. The left section of the page is the work area where the user can create the model. The right section comprises three subsections; subsection one displays all the recommended activities that fit into the business process model for this department. If the user finds an activity suitable for the process model, the user can click on that recommended activity which is then added as a node in the flowchart being created in the left subsection simultaneously, otherwise, the second subsection provides the facility of searching the desired activity through a small query interface. Found results are displayed in the second subsection according to their rankings. Still if the user is not able to find that activity, then the last subsection facilitates the user to insert that specific activity into the system. The user can again search for this activity and add it in the process model. In this manner, a user can successfully build a customized business process model using the recommendations provided as shown in Fig. 3.
Fig. 3 The resultant customized business process model
Recommendation System for Business Process Modelling …
593
5 Conclusion and Future Work In the built recommendation system, we considered functionalities such as a userfriendly search interface, for selecting organizations and departments present within them, inserting new activities if not already present in the system, creation of “Business Process Models” and most importantly recommendation of individual activities for a process model using the collaborative filtering approach. The function of such a system is to assist users during process modelling by using process fragments from a process repository. The simplicity of our approach makes it possible for users to apply their knowledge to the best. We were able to resolve the cold-start problem successfully. We applied collaborative filtering, since it concerns user-based recommendations rather than content-based filtering which supports item-based recommendations. To further improve our work, we are planning to take it to the next level by introducing some advanced social features in the recommendation system. At present the system does not retain user credentials and specifications, as everything is dynamic in nature. However, to overcome this situation, an extension can be made by creating a separate repository from user history and a separate user profile. Also, a large process model repository can be maintained allowing new users to select the model directly from current models, which were previously chosen on a frequent basis.
References 1. Brandall, B. (2016). Why you should bother with business process modelling. Technopreneurph. Retrieved May 30, 2020, from https://technopreneurph.wordpress.com/2016/09/ 01/whyyou-should-bother-with-business-process-modelling-by-benjamin-brandall/. 2. Mathiesen, P. (2010). Business Process Modelling (BPM) best practice. Paul Mathiesen’s Blog. Retrieved May 31, 2020, from https://paulmathiesen.wordpress.com/2010/01/06/business-pro cessmodelling-bpm-best-practice/. 3. Cotofrei, P., & Stoffel, K. (2008). Business process modelling for academic virtual organisations. In L. M. Camarinha-Matos & W. Picard (Eds.), Pervasive Collaborative Networks, PRO-VE 2008, IFIP - The International Federation for Information Processing (Vol. 283). Boston, MA: Springer. 4. Strimbei, C., Dospinescu, O., Strainu, R. M., & Nistor, A. (2016). The BPMN approach of the university information systems. In Ecoforum (Vol. 5, pp. 181–193). Romania: Alexandru Ioan Cuza University of Iasi. 5. Gasson, S. (2008). A framework for the co-design of business and IT systems. In Proceedings of the Annual Hawaii International Conference on System Sciences, Drexel University, Philadelphia, USA (pp. 348–348). https://doi.org/10.1109/hicss.2008.20. 6. Kluza, K., Baran, M., Bobek, S., & Nalepa, G. (2013). Overview of recommendation techniques in business process modeling (Vol. 1070, pp. 46–57). Poland: AGH University of Science and Technology. 7. Koschmider, A., Schallhorn, T., & Oberweis, A. (2011). Recommendation based editor for business process modeling. Data & Knowledge Engineering, 70,. https://doi.org/10.1016/j. datak.2011.02.002. 8. Dwivedi, P., Kant, V., & Bharadwaj, K. K. (2018). Learning path recommendation based on modified variable length genetic algorithm. Education and Information Technologies.
594
A. Saini et al.
9. Koschmider, A., Song, M., & Reijers, H. A. (2009). Advanced Social Features in a Recommendation System for Process Modeling. In W. Abramowicz (Ed.), Business Information Systems, BIS 2009 (Vol. 21). Lecture Notes in Business Information Processing Heidelberg: Springer, Berlin. 10. Li, Y., Cao, B., Xu, L., Yin, J., Deng, S., Yin, Y., et al. (2014). An efficient recommendation method for improving business process modeling. IEEE Transactions on Industrial Informatics, 10, 502–513. https://doi.org/10.1109/TII.2013.2258677. National Science Foundation of China, China. 11. Ben Schafer, J. (2005). Dynamic lens: A dynamic user-interface for a meta-recommendation system. Department of Computer Science, University of Northern Iowa Cedar Falls, IA 506140507, USA. 12. Pazzani, M. J., & Billsus, D. (2007). Content-based recommendation systems. In The Adaptive Web: Methods and Strategies of Web Personalization ( Vol. 4321, pp. 325–341). Lecture Notes in Computer Science. Berlin Heidelberg New York: Springer. 13. Ben Schafer, J., Frankowski, D., Herlocker, J., & Sen, S. (2008). Collaborative filtering recommender systems. Department of Computer Science, University of Northern Iowa, Cedar Falls, USA.
RETRACTED CHAPTER: Using Bidirectional LSTMs with Attention for Categorization of Toxic Comments
ER
Zubin Tobias and Suneha Bose
A
PT
Abstract The online atmosphere is conducive for building connections with people all around the world, surpassing geographical boundaries. However, accepting participation from everyone is at the cost of compromising with abusive language or toxic comments. Limiting people from taking part in the discussions is not a viable option just because of misbehaving users. The proposed framework in this research harnesses the power of deep learning to enable toxic online comment recognition. These comments are further categorized using natural language processing tools.
R AC TE
1 Introduction
D
C
H
Keywords Natural language processing · Deep learning · Toxic online comments · Neural network · Long short-term memory
R ET
Many websites that display user-submitted content must deal with toxic or abusive comments. Human moderation may carry psychological risk to the moderators [1] and is difficult to implement at scale. Potential automated solutions will require not just binary classification (acceptable vs. blocked) but fine-grained comment classifications to maintain civility without interfering with normal discourse and provide explanations to users when their posts are censored. Different websites may wish to have different policies for dealing with different kinds of toxic content. For example, a website may want to allow insulting comments while still blocking threats and identity-based hate speech. In this project, we attempt to improve the identification and fine-grained classification of toxic online comments in a dataset provided by a Kaggle challenge. Our dataset consists of 159,571 comments from Wikipedia talk page edits which have been labeled by human raters for the presence of toxic behavior. The 6 types of The original version of this chapter was retracted: The retraction note to this chapter is available at https://doi.org/10.1007/978-981-16-2594-7_71
Z. Tobias (B) · S. Bose Maulana Abul Kalam Azad University of Technology, Kolkata, West Bengal, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022, corrected publication 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_49
595
596
Z. Tobias and S. Bose
PT
ER
Fig. 1 A selection of comments from our training dataset, showing the large diversity of the corpus and the ability of a comment to have multiple labels. Also, note that while toxic is a superset of severe toxic, it is not a superset of the other 4 labels
R AC TE
D
C
H
A
toxicity are toxic, severe toxic, obscene, threat, insult, and identity hate. A sample of entries from the dataset is shown in Fig. 1. In this report, we will describe a baseline approach to this multi-label classification task using a three-layer feedforward neural network with averages of word-vector sentence embeddings as inputs. We will then describe our experiments with uniand bidirectional long short-term memory (LSTM) and gated recurrent unit (GRU) architectures, and demonstrate a significant improvement in performance with the addition of an attention mechanism. Finally, we present the results of applying our framework to the dataset, including visualizations of the model’s weighting of input words via the attention mechanism.
2 Background and Related Work
R ET
The earliest work on machine learning-based detection of online harassment can be traced back to Yin et al. [2], where they fed content, sentiment, and contextual features into a support vector machine to classify instances of harassment. More recently, a paper by the group that created the dataset used in the Kaggle challenge [3] focused on binary identification of toxic comments (no fine-grained classification). They found success with relatively simple n-gram NLP methods, and left more complex methods (like LSTMs) as future work. In last year’s CS224n class, a group of students looked at the same Wikipedia corpus and performed binary classification (personal attacks vs non-personal attacks) using an LSTM architecture and word-level embeddings, a convolutional neural network (CNN) with word-level embeddings, and a CNN with character-level embeddings [4]. They found the CNN with character-level embeddings to be the most successful algorithm. For our project, we decided to take a different approach and investigate whether applying attention to this problem could improve the performance of word-level embeddings, inspired by Yang et al. [5].
RETRACTED CHAPTER: Using Bidirectional LSTMs with Attention …
597
3 Approach 3.1 Data and Preprocessing
R AC TE
D
C
H
A
PT
ER
To convert the raw text data into usable form, we tokenize the input using the nltk package [6]. Since each of the comments is of a different length, we pad or crop the token lists from each comment to a uniform length and mask the (pad) tokens in the RNN output layer (or attention layer, once we started using it). To represent each word token in the input comments, we perform word embedding using the 300-dimensional pre-trained global vectors for word representation (GloVe) [7], which was trained using aggregated global word-word co-occurrence statistics on 6 billion tokens from Wikipedia and Gigaword corpora [8] with a vocabulary size of 400 k. Finally, we create input vectors of int32 word indices using the word2id mappings included with the pre-trained GloVe vectors. We held back a random 30% of the training data to use as a dev set for testing the out-of-sample performance of our models. Many offensive words posted online are not a part of the 400 k words that form the GloVe vocabulary, and therefore map to the (unk) token. Since identifying such offensive words should improve our performance, if a token contains as a substring any member of a list of common vulgarities that are not commonly found as substrings of inoffensive words (refer to our code for details), we map the composite token to the root vulgarity. Another observation about our dataset is the large imbalance in number of examples of each label (see Table 1). We deal with this problem by weighting the contributions of labels inversely to their occurrence in the training data, normalized to 1 for toxic comments (the most common label). These weights cause larger gradients to backpropagate from mistakes on the rarer labels. In this way, we learn more from each example of the more uncommon labels.
R ET
Table 1 Counts of occurrences of each label category, and percentages of comments in the dataset with the given label (these do not sum to 100% because a single comment can have multiple labels)
Category
# of examples
% of dataset
No label
143,346
89.8
15,294
9.6
Severe Toxic
1595
1.0
Obscene
8449
5.3
478
0.3
Toxic
Threat Insult
7877
4.9
Identity hate
1405
0.9
598
Z. Tobias and S. Bose
3.2 Baseline Model
A
h1 = ReLu(xW 1 + b1 )
PT
ER
We implemented a fully connected feedforward neural network as our baseline model. The inputs of the model are the average of the 300d GloVe embeddings of all tokens in each comment. Note that no comment padding is necessary in this model, as the averaging procedure reduces comments of any length to a uniform size. The neural network consists of 3 hidden layers with ReLU activation function [9], with 30, 20, 10 hidden units in each hidden layer, respectively. Since our goal is to perform multi-label classification (i.e., the toxicity labels are not mutually exclusive), the outputs of the last hidden layer are fed into a sigmoid output layer (rather than a softmax layer, which would force the 6 class probabilities to add up to 1) with 6 units, which correspond to the predicted probabilities of each of the 6 labels. The cross-entropy function is used as the loss. The neural architecture can be defined as follows:
H
h2 = ReLu(h1 W 2 + b2 )
D
C
h3 = ReLu(h2 W 3 + b3 )
R AC TE
ˆy = σ (h3 W 4 + b4 ) J = C E( y, ˆy) = −
where
6
yi log( yˆi )
i=1
x ∈ R B×300 , h1 ∈ R B×30 , h2 ∈ R B×20 , h3 ∈ R B×10 , ˆy ∈ R B×6 , y ∈ R B×6
R ET
and B is the batch size. In this baseline model and throughout this work, all bias terms are initialized to zero, while all weight matrices are initialized using Xavier initialization [10]. With this baseline model, we found relatively good dev set performance as measured by the ROC curves for each toxicity type (see Fig. 2). The mean ROC AUC across the 6 labels on the blind test set was 0.9490, which is good performance for a baseline, but placed us 0.04 behind the leading Kaggle entry. The similarity between the train and dev set curves shows that we were not suffering from severe overfitting. Since the number of true negatives (non-toxic comments) in our dataset is large (see Table 1), and ROC curves plot the true positive rate (TPR = TP/ (TP + FN)) versus the false positive rate (FPR = FP/ (FP + TN)), their appearance is biased
599
ER
RETRACTED CHAPTER: Using Bidirectional LSTMs with Attention …
PT
Fig. 2 The ROC and precision-recall curves for our baseline model’s performance on each toxicity type as classified by our baseline model
D
C
H
A
by this imbalance in our dataset. However, since the mean AUC for these ROC curves was the official metric of the Kaggle challenge, we used these as our primary evaluation metric. For a more direct measure of our classifier’s performance for each label, we also plotted the precision versus recall performance of our baseline classifier (shown at right in Fig. 2). Here we see that our performance is better for the labels that have more training examples.
R AC TE
3.3 Recurrent Architectures
R ET
Our baseline model achieved relatively good performance, but we thought that we could improve it by building a recurrent neural network (RNN) model that is able to use positional information of each of the words rather than an average over the entire sentence. We found that RNNs with GRU and LSTM cells were superior to a vanilla RNN architecture, but the differences between LSTM and GRU were minor, so we decided to focus on networks with LSTM cells. We implemented our models in Python using tensorflow [11]. For the LSTM cells that form the backbone of our architecture, we use the tensorflow implementation of the original LSTM equations from [12], shown below: i t = σ x t W (i) + ht−1 U (i) f t = σ x t W ( f ) + ht−1 U ( f ) ot = σ x t W (o) + ht−1 U (o) c˜ t = tan hσ x t W (o) + ht−1 U (o)
600
Z. Tobias and S. Bose
ct = f t · ct−1 + i t · c˜ t ht = ot cot tan h(ct )
R AC TE
D
C
H
A
PT
ER
The LSTM architecture allows information to flow long distances across the unrolled network (if f t = 1 and it = 0, the previous state flows straight through the network), but also prevent exploding gradient problems (ot modulates strength of output at each step, and f t = 0 allows prior states to be forgotten so backpropagation need not continue all the way back to t = 0). We also investigated the use of GRU cells (which have similar properties) for this task and found comparable performance to LSTM cells, so we omit their equations here [13]. For a unidirectional LSTM architecture, it is possible to use the hidden state of the final time step as the overall output state h˜ that is fed into the final output layer. We found better performance, however, when we use the element-wise average over the hidden states of all time steps (masking any hidden states at time steps where the ˜ the input to the final layer of our architecture. pad token was input as h, With a bidirectional LSTM architecture, we have one set of parameters for a forward-unrolled LSTM and a separate set of parameters for a backward-unrolled LSTM that reads input token lists starting at the end rather than the beginning. For each token position, we therefore have two output states at each time step, which we either take the average of before averaging across all time steps, or concatenate to form a single joint output state which is then feed into the attention layer described below. To prevent overfitting, we add regularization to the overall output state of the LSTMs (or attention layer) by applying dropout [14] with pdrop = 0.5: ˜ pdrop hdropout = Dropout h,
R ET
Finally, we use a fully connected layer with sigmoid activations to convert the output of the dropout layer to the probabilities of each label: yˆ = σ hdr opout U + b
where hˆ dropout ∈ R B×H , U ∈ R H ×6 and ˆy ∈ R B×6 , with H as the hidden state size of the RNN/LSTM/GRU and B as the minibatch size.
3.4 Attention Mechanism To extract and add importance to informative words that might be relevant to the classification, we utilized a word attention mechanism as described by Yang et al. [5]. We implemented the attention mechanism by modifying publicly available code
RETRACTED CHAPTER: Using Bidirectional LSTMs with Attention …
601
[15] in order to allow for masking comments with variable length. In our neural architecture, the attention mechanism can be optionally applied to reduce the set of hidden vectors {ht } from every RNN/LSTM/GRU time steps t = {1, …, T } to an attention output vector h, as opposed to the averaging as described above. The attention output vector h˜ is then fed as the input to the dropout layer described above. The mathematical formulation of this attention mechanism is described below: vt = tan h(ht W a + ba )
ER
st = vt uaT
h˜ =
T
PT
exp(st ) αt = T t=1 exp(st ) αt ht
A
t=1
R AC TE
D
C
H
In this attention mechanism, the hidden vectors of each word ht from the recurrent time step t is feed into a single tanh non-linearity to produce a new word representation vt . An attention score st is then computed by measuring the dot product similarity between vt and the word context vector ua . The attention weights α t can then be computed by normalizing the attention score across all time steps via a softmax function. Finally, an attention output vector h˜ is computed as the weighted sum of the original hidden states ht of each word, weighted by the normalized attention weights.
4 Experiments
R ET
4.1 Experiment Configurations We ran a variety of models with a range of hyperparameter settings on an Azure NV6 node. We focused on expansions to our model rather than a systematic search for optimal hyperparameter settings, keeping the learning rate for our Adam optimizer [16] fixed at 0.0005. The steady decline of our training loss (as shown at left in Fig. 3) shows that this choice of learning rate was reasonable. Table 2 shows the results for a set of experiments we did before implementing the attention layer, to assess the improvement from using larger hidden states, more layers, and uni- versus bidirectional architectures. We found that bidirectional architectures were in all cases better than unidirectional, more than 1 layer in each direction (if bidirectional) led to overfitting, and LSTM and GRU cells gave nearly equivalent
602
Z. Tobias and S. Bose
PT
ER
Fig. 3 At left, a typical loss function from our training runs. The increase in the dev loss after 10 epochs indicates overfitting, so we use early stopping to save the best model before overfitting begins. At right, a plot of the grad norm during training showing that we did not have any exploding gradient problems
Dev (AP)
0.9610
0.5009
0.9683
0.5473
Directionality
# layers
LSTM
Single
1
50
LSTM
Single
1
100
LSTM
Single
2
50
0.9546
0.4470
LSTM
Bidirectional
1
50
0.9747
0.5788
LSTM
Bidirectional
2
0.9747
0.5882
GRU
Single
1
50
0.9659
0.5180
Bidirectional
1
50
0.9755
0.5837
Bidirectional
2
50
0.9761
0.5619
C
D
R AC TE
GRU
50
Dev (ROC AUC)
H
Cell type
GRU
Hidden size
A
Table 2 This table shows the results of a hyperparameter search among cell type, hidden state size, number of layers, and directionality for our RNN model (without attention)
R ET
performance. We therefore used a single layer, bidirectional LSTM as our fiducial model for implementing attention. Experiments took anywhere from 1 to 3 h, depending on the size of the model and number of epochs. Initially, we ran all models for 50 epochs, but found that the model began to overfit after ~10 epochs (see Fig. 3). We therefore implemented early stopping, where the model would progressively save the model after each epoch if it had better dev set performance (measured by mean column-wise ROC AUC) than any epoch before it. We also implemented gradient clipping, to ensure that our gradients didn’t explode during backpropagation, but due to the LSTM architecture (see discussion in Sect. 3.3), the norm remained low enough during training that clipping was unnecessary (see right-hand panel of Fig. 3).
RETRACTED CHAPTER: Using Bidirectional LSTMs with Attention …
603
4.2 Evaluation
D
C
H
A
PT
ER
We set up an evaluation pipeline to automatically plot ROC curves and precisionrecall curves for classification of each toxicity type. The pipeline also computed the ROC AUC and average precision (AP) for classification of each toxicity type, and the mean column-wise ROC AUC and mean column-wise average precision across all toxicity types. We submitted our model predictions on a test set of 153,164 comments whose true labels were withheld by Kaggle. Our online submissions returned the mean column-wise ROC AUC (the official evaluation metric in the Kaggle challenge) for our test set predictions. The results for our four primary models are shown in Table 3, highlighting the substantial improvement in our multi-label classification resulting from our implementation of word attention. Looking at Fig. 4, we see that the overall performance of our final model (bidirectional LSTM + attention) is very good, with the ROC very close to the upper-left corner representing perfect classification. In the right-hand panel, we see that the classes for which we have the weakest performance are the ones for which we have the fewest training examples (threat, severe toxic, and identity hate). Some amount of overfitting is present, as evidenced by the dotted dev lines being displaced from the solid lines representing the performance on the training data. One way to further improve this model would be increasing the strength of the regularization beyond the dropout we have already applied.
Model Baseline LSTM
R AC TE
Table 3 A summary table of the results from the primary models we investigated, showing the mean average precision and mean ROC AUC from the dev set, and the mean ROC AUC from the blind test set Dev (AP)
Dev (ROC AUC)
Test (ROC AUC)
0.5610
0.9669
0.9490
0.5473
0.9683
0.9435
0.5882
0.9747
0.9611
Bidirectional LSTM + Attention
0.6695
0.9859
0.9778
R ET
Bidirectional LSTM
Fig. 4 The ROC and precision-recall curves for our final model
604
Z. Tobias and S. Bose
4.3 Qualitative Evaluation via Attention
PT
ER
With its attention mechanism, our model is capable of computing the normalized attention weights α t for every word given a sentence. For diagnostic purposes, we can visualize the attention weights of selected comments in order to qualitatively understand how the classifier classified comments the way it did. For example, here are the visualizations of the attention weights of some example true positives—in this case, comments that got successfully classified as toxic and threat:
D
C
H
A
The attention weights successfully highlight the words and phrases that sound threatening to the other users. Here is an example of a true negative:
R AC TE
Even though the attention weights highlight some potentially vulgar words, the classifier successfully classified it as a negative. On the other hand, here is an example of a false positive:
R ET
The model identifies a couple of words that could potentially be toxic, as highlighted by the attention weights, but does not pick up that the commenter is quoting someone else to make a reasonable argument, and therefore misclassifies it as toxic and obscene.
5 Conclusions and Future Work We have implemented a Tensorflow framework based on a bidirectional LSTM with an output attention layer that successfully performs multi-label classification of various subtypes of toxic online comments. Our final mean ROC AUC performance on the blind test set is 0.9778, which places us within 0.011 of the leading
RETRACTED CHAPTER: Using Bidirectional LSTMs with Attention …
605
ER
Kaggle entry. The addition of an attention layer to our bidirectional LSTM architecture significantly improves performance both in terms of mean column-wise ROC AUC (0.9611–0.9778) and mean column-wise average precision (0.5882–0.6695). Our performance is primarily limited by the embedding of new or rarely used obscenities to the (unk) token by GloVe. We tried to ameliorate this by mapping unknown composite obscenities to the embedding of the root obscenity in our preprocessing, but further such preprocessing to eliminate unknown tokens, or the use of characterlevel embeddings and/or convolutional methods might further improve performance, since they would be able to recognize toxic character patterns within an unknown word. In addition, stronger regularization beyond dropout might help increase the performance of our model without overfitting.
PT
References
R ET
R AC TE
D
C
H
A
1. Madrigal, A. (2017). The basic grossness of humans. The Atlantic. https://www.theatlantic. com/technology/archive/2017/12/the-basic-grossness-of-humans/548330/. 2. Yin, D., et al. (2009). Detection of harassment on Web 2.0. In Proceedings of the Content Analysis in the WEB 2.0 (CAW2.0) Workshop at WWW2009. 3. Wulczyn, E., Thain, N., & Dixon, L. (2017). Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web. 4. Chu, T., Jue, K., & Wang, M. (2017). Comment abuse classification with deep learning. In CS224n Final Project Reports. 5. Yang, Z., et al. (2016). Hierarchical attention networks for document classification. In Proceedings of NAACL-HLT 2016 (p. 14801489). 6. Bird, S., Loper, E., & Klein, E. (2009). Natural Language Processing with Python. OReilly Media Inc. 7. Pennington, J., Socher, R., & Manning, C. (2014). GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 8. Parker, R., et al. (2011). English Gigaword Fifth Edition LDC2011T07. DVD. Philadelphia: Linguistic Data Consortium. 9. Nair, V. & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (pp. 807–814). USA: Omnipress. 10. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR (Vol. 9, pp. 249–256). 11. Abadi, M., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org. 12. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. 13. Cho, K., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078. 14. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958. 15. Ivanov, I. (2018). Tensorflow implementation of attention mechanism for text classification tasks. https://github.com/ilivans/tf-rnn-attention/. 16. Kingma, D. P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
Detection of Rheumatoid Arthritis Using a Convolutional Neural Network A. S. Mahesh Kumar, M. S. Mallikarjunaswamy, and S. Chandrashekara
Abstract Rheumatoid Arthritis (RA) is a kind of an autoimmune disease. RA generally observed with inflammation, swollen, stiffness, joint pain and loss of functionality in the joints. Inflammation starts at smaller joints of the body, in later stages inflammation spread to heart and other organs of the body. Initial symptom are shown to be less effective but in later stage it causes major difference in the functionality of the joints. Therefore, an accurate RA detection in its early stage is very much essential. Various modalities are being used for the purpose of RA diagnosis notably radiography, ultrasound and Magnetic Resonance Imaging (MRI). Even though various modalities are used in the assessment of joint damage and position changes, plain radiography is the best and effective method. Different scoring methods used in the RA assessment but all scoring methods involved the joint evaluation of fingers, hands, feet and wrist. The traditional scoring methods and manual diagnosis process requires more rheumatologist intervention. The work includes the development of automated RA diagnosis based on Convolutional Neural Network (CNN) to help the rheumatologists for their diagnosis and treatment plan. The development of CNN architecture for automated RA detection avoids manual method of preprocessing, handcrafted segmentation and classification. CNN architecture along with data augmentation is used to increase robustness and avoid overfitting during training phase. The dataset includes both normal and RA plain hand radiography images for network training and testing. The RA detection CNN model results with accuracy rate of 98.8% and error rate of 1.2%. The evaluation matrices are measured in terms of accuracy, loss, recall, sensitivity, specificity, precision, false discovery rate, matthews correlation coefficient, negative predictive value, geometric mean, false positive rate and false A. S. M. Kumar (B) Department of Electronics & Communication Engineering, PES College of Engineering, Mandya, India M. S. Mallikarjunaswamy Sri Jayachamarajendra Colleges of Engineering, JSS Science and Technology University, Mysuru, India e-mail: [email protected] S. Chandrashekara ChanRe Rheumatology & Immunology Center & Research, Bengaluru, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_50
607
608
A. S. M. Kumar et al.
negative rate. The obtained results that show network has potential to RA early diagnosis. Keywords Convolutional Neural Network (CNN) · Magnetic Resonance Imaging (MRI) · Performance metrics · Radiography · Rheumatoid arthritis
1 Introduction RA generally observed with inflammation, swollen, stiffness, joint pain and loss of functionality in the joints. Connecting tissues, tendons, ligaments, cartilages and muscles also get affected by RA [1–3]. Initially, RA affects the smaller joints of the body such as joints of the fingers, hands and wrists. In later stages, RA affects joints of knee, feet, ankles, elbows, shoulders and hip. Moreover, joint pain experience especially at early hours of the day [4–6]. Human hand fingers have different phalangeal bones like distal phalangeal, proximal phalangeal and metacarpal phalangeal. The Distal Interphalangeal Joint (DIP) is a joint between intermediate and distal phalangeal bone. The Proximal Interphalangeal joint (PIP) is a joint between proximal and intermediate phalangeal bone whereas joint between metacarpal and proximal phalangeal bone is called Metacarpophalangeal joints (MCP) [7, 8]. A small gap is present between phalangeal bones called as Joint Space Width (JSW). The JSW between the phalangeal bones and bone erosion are very much helpful in RA diagnosis at its early stages. However, manual calculation of JSW and bone erosion for every joint requires more time and complex process as well. Few normal and RA affected hand radiography images along with phalangeal bone shown in Fig. 1. There is no cure and permanent solution for RA disease. Early beginning of the medication can avoid disease progression and future disability problems. Few medications are used for RA disease such as Disease-Modifying Antirheumatic Drugs (DMARDs), AntiTNF drugs, B-cell therapies and T-cell therapies [9, 10]. Two international bodies have given a list of parameters and protocols for RA disease diagnosis. European League Against Rheumatism (EULASR) and American College of Fig. 1 MATLAB random selection results of X-ray image of normal and RA affected hand images
Detection of Rheumatoid Arthritis Using a Convolutional Neural Network
609
Rheumatology (ACR) are the two world-wise popular bodies being used for RA classification [11–13]. Various modalities are being used for the purpose of RA diagnosis problems, namely, X-ray, MRI, ultrasound, bone scintigraphy and CT. X-ray is the best and effective tool in the assessment of joint damage and position changes in RA patients [14–16]. RA radiographic image findings are cartilage loss, JSW and bone erosions. The changes in the JSW, cartilage loss and bone erosion in each of the finger are the sign of RA [17, 18]. Different manual methods have been used to evaluate and examine the joint damage, joints position and gap between the joints. The manual scoring methods are Genant and its variants, Simple Erosion Narrowing Score (SENS), sharp and its variants, Larsen and its variants and Sharp/van der Heijde (SvdH) method. The SvdH method of evaluation is considered as the best RA assessment method in radiography image analysis [19, 20]. Recent method of RA evaluation is modified total sharp (mTS) score method but mTS method requires more time and the rheumatologists intervention for disease diagnosis. Implementation of mTS score method is used to support vector machine for classification and it utilized 45 dataset for experimentation [21]. The manual methods are time consuming and need the rheumatologist intervention for RA diagnosis. A kind of automatic detection can be possible with machine learning methods. CNN is a subset of machine learning used for automatic RA detection. CNN is used in other applications like natural language processing, audio processing, image analysis and medical field application. CNN particularly reduces the burden in feature extraction and classification stage [22, 23]. CNN architecture has input layers, hidden layers and output layers. CNN uses main three mechanisms, which are local receiver, sharing of weights and finally sub-sampling at output. CNN architecture consists of multiple numbers of convolution layers, pooling layer and activation layer. The convolution layers of the CNN play vital role in feature extraction. The convolution filters are used to find different features at different levels by applying different kernel size multiple filters on input images. Pooling layers help to reduce the network parameters, reduction in the parameters reduces processing complexity at the next layers. Pooling layer implemented with some nonlinear function such as average pooling and max pooling. A Rectified Linear Unit (ReLU) is an activation function and its maps all the negative values to zero and retains all the positive values.
2 Related Works In RA diagnosis, the rheumatologists suggest any one of the modality tests along with blood test based on the severity of the problems. Radiography is considered as the best and effective method in the assessment of joint damage and position changes, but radiograph gives only two-dimensional (2D) visualization whereas MRI and ultrasonography gives three-dimensional (3D) visualization of RA affected joints. Compared with other imaging modalities, radiograph is an intrinsic advantage of super capacity of imaging the bone structures of the body. The conventional radiography generates significant difference between bones and soft tissues. In the
610
A. S. M. Kumar et al.
work of Huo et al. [24] it is mentioned the advantage of radiography. Radiography image takes shorter exposure time (typically around 0.1 s for capturing) and gives superior spatial resolution (e.g., 0.1 mm2 /pixel). X-ray experimentation is relatively cheap and service available at almost all the hospitals and clinics. Radiography images one of the considerations for RA diagnosis. Hall et al. [25] presented a survey on different preprocessing and feature extraction techniques. The experimentation includes linearization, digital spatial filtering and contrast enhancement. Finally, extracted features are used for classification. Alkan [26] used a series of image processing operations like contrast enhancing followed by adaptive histogram equalization. Gaussian filter used for the removal of random noise. Finally, edges are found by using edge detection. Bhisikar et al. [27] presented automatic RA analysis based on statistical feature. Analysis process involves preprocessing and segmentation. Gabor filter is used for local texture feature. Subramoniam [28] used image segmentation algorithms for arthritis detection. Median filter is used to remove noise from the input image. Histogram is calculated from the filter image, and then a region-growing algorithm is used for grouping purpose. Once the grouping operation is completed, edge detector is used to find out edges. Mitta and Dubey [29] presented an arthritis detection method based on a morphology operation. Dilation and erosion are the morphology operations used for image enhancement. Vinoth and Jayalakshmi [30] introduced an arthritis detection method. The detection process involves operations like de-noising, histogram smoothing and edge detection segmentation. The GLCM is used to extracting features such as energy, entropy, correlation, homogeneity and contrast. Helwan and Tantua [31] introduced new system for RA identification. System involved preprocessing, segmentation and neural classifier for RA identification. Murakami et al. [32] implemented a system for RA diagnosis based on bone erosion. CNN architecture along with segmentation algorithms is used to RA disease diagnosis. Segmentation algorithm used for phalangeal area extraction whereas CNN architecture used for detection of pathology information. The experimentation contains 129 hand radiology images that include both normal and RA images. Oyedotun et al. [33] proposed RA classification based on neural network. CNN architecture has input layers, hidden layers and output layer. The input images are presented at input layer and classification result is obtained at output layer. The hidden layer learns all the needed features, which are essential for image classification. Murakami et al. [32] implemented automatic bone erosion detection using crude segmentation and Multiscale Gradient Vector Flow (MSGVF) snakes’ method. From the above-mentioned literature study, it is clear that there is a need of simple, accurate and effective RA classification method for diagnosis. The work includes X-ray images for training and testing purpose of CNN. CNN-based RA automated diagnostic method is developed, to help the rheumatologists in their diagnosis and treatment plan to avoid future disability and morbidity. The normal and RA affected X-ray images are included in the work.
Detection of Rheumatoid Arthritis Using a Convolutional Neural Network
611
3 Materials and Methods CNN architecture has input layers, hidden layers and output layer. Input images presented at input layer, associated features are leaned at hidden layers and the results are obtained at output layers. CNN architecture has multiple convolution layers, pooling layer and activation layer. The convolution layers of the CNN play vital role in feature extraction. The convolution filters are used to find different features at different layers by applying different kernel size multiple filters on the input images. Pooling layers help to reduce the network parameters, reduction in the parameters reduces processing complexity at the next layers. Pooling layer implemented with some nonlinear function such as average pooling and max pooling. A ReLU maps all the negative values to zero and retains all the positive values. a.
Image dataset
The X-ray dataset is collected from ChanRe Rheumatology and Immunology Center, Bangalore, Karnataka, India. Dataset includes both normal and RA affected images. The images processed in MATLAB 2018b. A dataset consisting of 165 radiographs images, which includes 86 normal images and 79 RA images. The preprocessing steps include gray scale conversion and image resizing operation. The image dimension is restricted to 128 * 128 sizes. b.
Data augmentation
Data augmentation necessary steps to increase overall performance of the network and it also avoid overfitting, irrelevant pattern reorganization and memorizing condition occurred during network training phase. Data augmentation operations include transformation functions like random cropping, random rotation, vertical and horizontal flipping. The RA detection CNN model uses random rotation of images by 20°. Flipping operation includes both horizontal and vertical flipping operation by three pixels at a time. Table 1 depicts data augmentation operation used in the CNN-based RA detection (Table 2). Table 1 Data augmentation and its properties
Data augmentation
Properties
FillValue
0
RandXReflection
0
RandYReflection
0
RandRotation
[−20 20]
RandXScale
[1 1]
RandYScale
[1 1]
RandXShear
[0 0]
RandYShear
[0 0]
RandXTranslation
[−3 3]
RandYTranslation
[−3 3]
612
A. S. M. Kumar et al.
Table 2 Total number of normal and RA hand radiography images
c.
Labels and characteristics
Count
Normal images
86
RA images
79
Image resolution
128 * 128
Image format
Hand radiography
CNN Architecture
The every layer of CNN architecture has convolution layers, batch normalization layers and max pooling layers. The CNN based RA detection has four different groups of convolution layers (8, 16, 32 and 64) with convolution kernel size of 5 * 5 and 3 * 3. Batch normalization layers with 8, 16 and 32 channels are used. ReLU layer is used and it maps all the negative values to zero and retains all the positive values. Three max pooling layers of size 2 * 2 are used. Stride size of two and zero padding are used. Two fully connected layers followed by a softmax layer and classout layer are used for classification.
4 Results and Discussion The images are processed in MATLAB 2018b. A dataset consisting of 165 radiographs images, which includes 86 normal images and 79 RA images. The preprocessing steps include gray scale conversion and image resizing operation. The image dimension is restricted to 128 * 128 sizes. Plot of accuracy verse iteration and loss verse iteration of CNN model is shown in Figs. 2 and 3. Accuracy plot has two lines for the indication of training accuracy and validation accuracy. Blue line indicates training accuracy whereas dashed black
Fig. 2 RA detection CNN model, accuracy verse iteration
Detection of Rheumatoid Arthritis Using a Convolutional Neural Network
613
Fig. 3 RA detection CNN model, loss verse iteration
line indicates validation accuracy. Accuracy plot has very closed blue and dashed black line, which indicates model has well in classifying the new images during test phase. In loss plot, decrease in training loss and validation loss indicates that model classifies new images with minimum error. Closeness in the training accuracy and validation accuracy lines indicates that model has achieved good accuracy in terms of image classification with minimum error rate. Plain 128 * 128 size hand radiography images are used with zerocenter normalization. Four groups of convolution layers are applied on the input images with two different convolution kernel sizes of 3 * 3 and 5 * 5. Eight convolutional layers measuring size of 5 * 5 are applied on the input images and with eight-channel batch normalization. Next level of the architecture used sixteen convolution layers measuring kernel size of 8 * 8 sizes with eightchannel batch normalization. Thirty-two convolution filters measuring kernel size of 3 * 3 and with sixteen-channel batch normalization in the third layer of the model. Next layers of CNN model are used sixty-four convolution layers with kernel size of 3 * 3 and with 32-channel batch normalization. All the layers are used in same size of 2 * 2 max pooling layer, 2 * 2 strides and zero padding. Finally, CNN model has two fully connected layer, Softmax and function crossentropyex as classification layer with classes of ‘Normal Images’ and ‘RA Images’. The CNN architecture has convolution layers, batch normalization layers and max pooling layers. Table 3 gives the complete CNN architecture structure used in RA image classification. The evaluation matrices are measured in terms of accuracy, loss, recall, sensitivity, specificity, precision, false discovery rate, matthews correlation coefficient, negative predictive value, geometric mean, false positive rate and false negative rate. The convolution filters are used to find different features at different layers by applying multiple filters of different kernel size on input images. Extracted features are very much essential and play vital role in the image classification. Figures 4 and 5 show the feature extracted by convolution layer one and two respectively. In confusion matrix, diagonal cells indicate the correct classification of images and off-diagonal cells indicate the misclassification results of the model. Out of 165 hand radiograph images, 86 hand radiography images are belongs to normal images and remaining 79 images are RA affected hand radiography images. The performance statics values also calculate through confusion a matrix which is shown in Fig. 6.
614
A. S. M. Kumar et al.
Table 3 RA detection CNN model layers Name
Layer
Layer description
1
‘imageinput’
Image input
128 × 128x1 images with ‘zerocenter’ normalization
2
‘conv_1’
Convolution
8 5 × 5x1convolutions with stride [1 1] and padding ‘same’
3
‘batchnorm_1’
Batch normalization
Batch normalization with 8 channels
4
‘relu_1’
ReLU
ReLU
5
‘maxpool_1’
Max pooling
2 × 2 max pooling with stride [2 2] and padding [0 0 0 0]
6
‘conv_2’
Convolution
16 5 × 5 × 8 convolutions with stride [1 1] and padding ‘same’
7
‘batchnorm_2’
Batch normalization
Batch normalization with 8 channels
8
‘relu_2’
ReLU
ReLU
9
‘maxpool_2’
Max pooling
2 × 2 max pooling with stride [2 2] and padding [0 0 0 0]
10
‘conv_3’
Convolution
32 3 × 3 × 16 convolutions with stride [1 1] and padding ‘same’
11
‘batchnorm_3’
Batch normalization
Batch normalization with 16 channels
12
‘relu_3’
ReLU
ReLU
13
‘maxpool_3’
Max pooling
2 × 2 max pooling with stride [2 2] and padding [0 0 0 0]
14
‘conv_4’
Convolution
64 3 × 3 × 32 convolutions with stride [1 1] and padding ‘same’
15
‘batchnorm_4’
Batch normalization
Batch normalization with 32 channels
16
‘relu_4’
ReLU
ReLU
17
‘fc’
Fully connected
2 fully connected layer
18
‘softmax’
Softmax
Softmax
19
‘classoutput’
Classification output
crossentropyex with classes ‘Normal resize’ and ‘RA
Fig. 4 Feature extracted in the convolution layer 1
Detection of Rheumatoid Arthritis Using a Convolutional Neural Network
615
Fig. 5 Feature extracted in the convolution layer 2
Fig. 6 Confusion matrix of the RA detection CNN model
From the confusion matrix, all 86 normal images are correctly classified as normal images but 79 RA affected images have some misclassification images. Out of 79 RA affected images, 77 RA hand radiography images are correctly classified as RA but two RA hand radiography images are misclassified as normal. In confusion matrix, all the normal dataset are classified as normal (86 out of 86) whereas 97.5% (77 out of 79) of RA hand radiography images are correctly classified as RA hand radiography and 2.5% (2 out of 79) RA hand radiography are misclassified as normal images. A performance matrix gives more information about image classification. The CNN model is evaluated by different performance measures like accuracy, loss, recall, sensitivity, specificity, precision, false discovery rate, matthews correlation
616
A. S. M. Kumar et al.
Table 4 Performance statics of the CNN model Performance statics
Formula
Values
Accuracy
(tp + tn)/N
0.988
Error rate
(fp + fn)/N
0.012
Sensitivity
tp/(tp + fn)
1.000
Specificity
tn/(tn + fp)
0.975
Precision
tp/(tp + fp)
0.977
Recall
tp/(tp + tn)
0.527
Negative predictive value
tn/(tn + fn)
1.000
False discovery rate
fp/(tp + fp)
0.022
False positive rate
fp/(fp + tn)
0.025
False negative rate
fn/(tp + fn)
0.000
Matthews correlation coefficient
((tp * tn) − (fp * fn))/sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
0.642
Geometric mean
sqrt(sensitivity * specificity)
0.987
coefficient, negative predictive value, geometric mean, false positive rate and false negative rate. Accuracy is defined as the ratio of correct output to the total number of target is assigned for classification. Loss rate measures the ratio of incorrect output over the total number of target assigned for classification. Sensitivity degree of the fraction of diseases that are correctly classified from the target assigned for classification. Specificity indicates the amount of normal images that are correctly classified as normal. Precision measures diseases that are correctly output from the total output diseases. Recall find out diseases patterns that are correctly classified. Other performance matrices such as false discovery rate, matthews correlation coefficient, negative predictive value, geometric mean, false positive rate and false negative rate of the CNN model are computed. Table 4 gives the RA detection CNN architecture performance statics.
5 Conclusion Different scoring methods are used in the RA assessment but all scoring methods are involved the joints evaluation of finger, hand, feet and wrist. The traditional scoring methods and manual diagnosis process require more rheumatologists intervention and time. The work includes the development of CNN-based automated RA diagnosis to help the rheumatologists for their diagnosis and treatment plan. The development of CNN architecture for automated RA detection avoids manual method of preprocessing, handcrafted segmentation and classification. CNN architecture with data augmentation is used to increase robustness of the network and it avoids overfitting during training phase. The RA detection CNN model results with accuracy rate
Detection of Rheumatoid Arthritis Using a Convolutional Neural Network
617
of 98.8% and error rate of 1.2%. The evaluation matrices are measured in terms of accuracy, loss, recall, sensitivity, specificity, precision, false discovery rate, matthews correlation coefficient, negative predictive value, geometric mean, false positive rate and false negative rate. The obtained results that show network has potential to early diagnosis.
References 1. Silman, A. J., & Pearson, J. E. (2002). Epidemiology and genetics of rheumatoid arthritis. Arthritis Research & Therapy, 4(S3), S265. 2. Kourilovitch, M., Galarza-Maldonado, C., & Ortiz-Prado, E. (2014). Diagnosis and classification of rheumatoid arthritis. Journal of Autoimmunity, 48, 26–30. 3. Gabriel, S. E. (2001). The epidemiology of rheumatoid arthritis. Rheumatic Disease Clinics of North America, 27(2), 269–281. 4. Fleming, A., Crown, J. M., & Corbett, M. A. R. Y. (1976). Early rheumatoid disease. I. Onset. Annals of the Rheumatic Diseases, 35(4), 357–360. 5. Jacoby, R. K., Jayson, M. I. V., & Cosh, J. A. (1973). Onset, early stages, and prognosis of rheumatoid arthritis: A clinical study of 100 patients with 11-year follow-up. British Medical Journal, 2(5858), 96–100. 6. Lineker, Badley, Charles, C., Hart, L., & Streiner, D. (1999). Defining morning stiffness in rheumatoid arthritis. Journal of Rheumatology, 26(1052), 7. 7. Bhisikar, S. A., & Kale, S. N. (2018, December). Classification of rheumatoid arthritis based on image processing technique. In International Conference on Recent Trends in Image Processing and Pattern Recognition (pp. 163–173). Singapore: Springer. 8. Huo, Y., Vincken, K. L., van der Heijde, D., De Hair, M. J. H., Lafeber, F. P., & Viergever, M. A. (2016). Automatic quantification of radiographic finger joint space width of patients with early rheumatoid arthritis. IEEE Transactions on Biomedical Engineering, 63(10), 2177–2186. 9. Burmester, G. R., & Pope, J. E. (2017). Novel treatment strategies in rheumatoid arthritis. The Lancet, 389(10086), 2338–2348. 10. Calabrò, A., et al. (2016). One year in review 2016: Novelties in the treatment of rheumatoid arthritis. Clinical and Experimental Rheumatology, 34(3), 357–372. 11. Kim, Y., Oh, H. C., Park, J. W., Kim, I. S., Kim, J. Y., Kim, K. C., et al. (2017). Diagnosis and treatment of inflammatory joint disease. Hip & Pelvis, 29(4), 211–222. 12. Smolen, J. S., Landewé, R., Breedveld, F. C., Buch, M., Burmester, G., Dougados, M., et al. (2014). EULAR recommendations for the management of rheumatoid arthritis with synthetic and biological disease-modifying antirheumatic drugs: 2013 update. Annals of the Rheumatic Diseases, 73(3), 492–509. 13. Brinkmann, G. H., Norli, E. S., Bøyesen, P., van der Heijde, D., Grøvle, L., Haugen, A. J., et al. (2017). Role of erosions typical of rheumatoid arthritis in the 2010 ACR/EULAR rheumatoid arthritis classification criteria: Results from a very early arthritis cohort. Annals of the Rheumatic Diseases, 76(11), 1911–1914. 14. Tins, B. J., & Butler, R. (2013). Imaging in rheumatology: Reconciling radiology and rheumatology. Insights into Imaging, 4(6), 799–810. 15. Patil, P., & Dasgupta, B. (2012). Role of diagnostic ultrasound in the assessment of musculoskeletal diseases. Therapeutic Advances in Musculoskeletal Disease, 4(5), 341–355. 16. Narvaez, J. A., Narváez, J., De Lama, E., & De Albert, M. (2010). MR imaging of early rheumatoid arthritis. Radiographics, 30(1), 143–163. 17. Schenk, O., Huo, Y., Vincken, K. L., van de Laar, M. A., Kuper, I. H., Slump, K. C., et al. (2016). Validation of automatic joint space width measurements in hand radiographs in rheumatoid arthritis. Journal of medical imaging, 3(4),
618
A. S. M. Kumar et al.
18. Duryea, J., Jiang, Y., Zakharevich, M., & Genant, H. K. (2000). Neural network based algorithm to quantify joint space width in joints of the hand for arthritis assessment. Medical Physics, 27(5), 1185–1194. 19. Larsen, A., Dale, K., & Eek, M. (1977). Radiographic evaluation of rheumatoid arthritis and related conditions by standard reference films. Acta Radiologica. Diagnosis, 18(4), 481–491. 20. Sharp, J. T., Bluhm, G. B., Brook, A., Brower, A. C., Corbett, M., Decker, J. L., et al. (1985). Reproducibility of multiple-observer scoring of radiologic abnormalities in the hands and wrists of patients with rheumatoid arthritis. Arthritis & Rheumatism: Official Journal of the American College of Rheumatology, 28(1), 16–24. 21. Tashita, A., Morita, K., Nii, M., Nakagawa, N., & Kobashi, S. (2017, September). Automated estimation of mTS score in hand joint X-ray image using machine learning. In 2017 6th International Conference on Informatics, Electronics and Vision & 2017 7th International Symposium in Computational Medical and Health Technology (ICIEV-ISCMHT) (pp. 1–5). IEEE. 22. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. 23. Wang, S., Zhang, Y., Lei, S., Zhu, H., Li, J., Wang, Q., et al. (2020). Performance of deep neural network-based artificial intelligence method in diabetic retinopathy screening: A systematic review and meta-analysis of diagnostic test accuracy. European Journal of Endocrinology, 183(1), 41–49. 24. Ou, Y., Ambalathankandy, P., Shimada, T., Kamishima, T., & Ikebe, M. (2019, April). Automatic Radiographic Quantification of Joint Space Narrowing Progression in Rheumatoid Arthritis Using POC. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019) (pp. 1183–1187). IEEE. 25. Hall, E. L., Kruger, R. P., Dwyer, S. J., Hall, D. L., Mclaren, R. W., & Lodwick, G. S. (1971). A survey of preprocessing and feature extraction techniques for radiographic images. IEEE Transactions on Computers, 100(9), 1032–1044. 26. Alkan, A. (2011). Analysis of knee osteoarthritis by using fuzzy c-means clustering and SVM classification. Scientific Research and Essays, 6(20), 4213–4219. 27. Bhisikar, S. A., & Kale, S. N. (2016, September). Automatic analysis of rheumatoid arthritis based on statistical features. In 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT) (pp. 242–245). IEEE. 28. Subramoniam, M. (2015, March). A non-invasive method for analysis of arthritis inflammations by using image segmentation algorithm. In 2015 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2015] (pp. 1–4). IEEE. 29. Mittal, A., & Dubey, S. K. (2012). Analysis of rheumatoid arthritis through image processing. International Journal of Computer Science Issues (IJCSI), 9(6), 442. 30. Vinoth, M., & Jayalakshmi, B. (2014). Bone mineral density estimation using digital x-ray images for detection of rheumatoid arthritis. International Journal of Pharma and Bio Sciences, 5(3), 104–121. 31. Helwan, A., & Tantua, D. P. (2016). IKRAI: Intelligent knee rheumatoid arthritis identification. International Journal of Intelligent Systems and Applications, 8(1), 18. 32. Murakami, S., Hatano, K., Tan, J., Kim, H., & Aoki, T. (2018). Automatic identification of bone erosions in rheumatoid arthritis from hand radiographs based on deep convolutional neural network. Multimedia Tools and Applications, 77(9), 10921–10937. 33. Oyedotun, O. K., Olaniyi, E. O., & Khashman, A. (2016). Disk hernia and spondylolisthesis diagnosis using biomechanical features and neural network. Technology and Health Care, 24(2), 267–279.
The Improved Method for Image Encryption Using Fresnel Transform, Singular Value Decomposition and QR Code Anshula and Hukum Singh
Abstract In this paper, we fortify the double random phase encoding procedure (DRPE) for encrypting the input images along with QR code in Fresnel domain. In the proposed scheme, the wavelength (λ) and propagation distances (z1 , z2 ) are encrypted keys. In the said scheme, there is no need of lenses. Therefore, this scheme is also called as lens-less technique. By this process images to be encrypted are used. In decryption, we need to use separate encryption keys, because each step is encrypted individually. Here the random phase mask (RPM), structure phase mask (SPM) and Fresnel parameters act as the decryption as well as encryption keys. The proposed idea is simulated using the MATLAB in computer system (Intel(R) Core (TM) i3-2328 CPU @ 2.20 GHz-2.71 GHz, 2GB RAM running Windows 10 on MATLAB R2019a 5(9.6..0.1174912) 64-bit (win64), LM: 40664749).
1 Introduction Most of the time, the communicated data can be easily captured by intruders. Even though, numerous encryption technologies [1–5] are used commonly, but still now, public is not aware of the need of information security. DRPE was first projected by Refregier and Javidi in 1995 [1]. DRPE is also called 4-f imaging architecture. The DRPE is an optically symmetric key system which encrypts the input image using two random phase masks (RPMs) using Fourier transform. These two-phase masks are independent of each other. Here, one mask is used in the input plane and another in the Fourier plane. Decryption is done by either using conjugate of the encrypted image or by using the phase conjugation of the two RPM. The same DRPE scheme has been studied in various other domains like fractional Fourier transform (FrFT) [6–10], Fresnel transform (FrT) [11–13], Fresnel wavelet transform (FWT) Anshula Department of Computer Science and Engineering, The NorthCap University, Gurugram, India H. Singh (B) Department of Applied Sciences, The NorthCap University, Gurugram, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_51
619
620
Anshula andH. Singh
[14], Gyrator transform (GT) [15–18], Gyrator wavelet transform (GWT) [19], Hartley transform (HT) [20, 21], Fractional Hartley transform (FrHT) [22–24], fractional Mellin transform (FrMT) [25–27] Arnold transform (AT) [28] etc. Also, image encryption using diffractive optical elements (DOE), Single Value decomposition (SVD) [29, 30], Asymmetric cryptosystem, multiple image encryption, color image encryption, watermarking, Interference based techniques like equal modulus decomposition (EMD) have been reported [31]. Fractional wavelet transform (FWT) [9, 32] is symmetric or asymmetric in nature and are vulnerable to many attacks. Now a days, encryption schemes are explored using quick response code [33]. By QR code, the data could be retrieved by the reader. Generally, the QR code is referred to as a two-dimensional code and it contains the data regarding an image. After detecting the image, the programmed processor will be used to make the digital analysis. Normally, the QR code illustrates the link to the domain. This link could be static or dynamic. The input image would be stored in the unorganizable format. After completing the scanning of Quick Response Code, domain link will be obtained. Through this technique from the domain link, the input image can access without any data reduction. The QR code is used in the DRPE architecture. At last, the noise-free original image is obtained through this technique. After that, the noise-free optical image encryption (OIE) method grounded on QR code methods. This process is done to increase the security level. The QR code has some advantages. They are error correction and strong fault tolerance. The report contains the demonstration process like Quick Response code, optical encryption for noise-free retrieval. Recently, the asymmetric optical image cryptosystem is proposed [34]. This technique is based on EMD and coherent superposition. This is explored as a combination of asymmetric cryptosystem and interference based imaging.
2 Structured Phase Mask (SPM) The SPM [35–42] has various benefits over the normal RPM which makes it difficult for the attacker and hackers to replicate these keys. It also overcomes the axis alignment problem, which generally exist in optical set-ups and also provides extreme security by holding features of various keys in one single mask. The complex amplitude produced by Fresnel wave front is given by U (r ) = exp
−i K r 2 2f
(1)
where, K = 2π propagation constant λ The radial Hilbert phase function can be written as H (P, θ ) = exp(i Pθ )
(2)
The Improved Method for Image Encryption Using Fresnel Transform, …
621
where P signifies the transformation order. It is evident from here that opposite halves of any radial line of the mask will have relative phase difference of Pπ radian. The combination of these two keys gives rise to new key (SPM), which is given by −i K r 2 . exp(i Pθ ) V (r, P, θ ) = U (r ).H (P, θ ) = exp 2f
(3)
Figure 1a represents the Fresnel zone plate (FZP) construction by using the Eq. 1. Figure 1b represents the radial Hilbert mask (RHM) and designed by using Eq. (2). Figure 1c represents the phase part of FZP and RHT and known also as spiral phase plate (SPP).
Fig. 1 a Represents the SPM developed after combining the FZP and RHM. b Represents the FZP and c represents the RHM with the transformation order P = 7
622
Anshula andH. Singh
Fig. 2 a Plain image. b U component. c S component. d V component, e decrypted image
3 Singular Value Decomposition (SVD) SVD is a steady and an authentic technique to decompose the image into a set of independent components such as U, S and V [43]. [USV] = SVD(Image)
(4)
Image = U ∗ S ∗ V T
(5)
where V T denotes the transposition of V matrix. Figure 2a represents the plain image. Figure 2b–d elucidates the U, S and V components respectively. Figure 2e denotes the decrypted image.
4 Fresnel Transform The Fresnel transform (FrT) of an input image f (x, y) can be written [11–13] as +∞ +∞ Fz (u, v) = Fr Tλ,z { f (x, y)} =
f (x, y)h λ,z (u, v, x, y)d xd y
(6)
−∞ −∞
where the operator FrTλz denotes the Fresnel transform with parameters λ and z, and hλz is the kernel of the transform given by iπ 2π z 2 2 exp (u − x) + (v − x) exp i h λ,z (u, v, x, y) = √ λ λz iλz 1
(7)
A useful property of the FrT is
FrTλ,z1 FrTλ,z2 f (x) = FrTλ,z1 +z2 { f (x)}
(8)
The Improved Method for Image Encryption Using Fresnel Transform, …
623
5 Simulation and Numerical Results The validity and capability regarding the optical cryptosystem are discussed in this section by using some of the given numerical simulations as shown in Fig. 3. There is an image named as ‘baboon’ of size 256 × 256 pixels. The two masks are having size of 256 × 256 pixels in interval of [0, 1] and [0, 2π] are uniformly distributed, respectively. In the Fresnel transform, the parameters (z1 , z2 ) and propagation wavelength (λ) are taken in the simulation for the feature of extra keys. It was obtained by using the parameters in Fresnel transform approach of encryption. Figure 4a represents baboon as input image for encryption and decryption purposes. To begin with the encryption process, few constraints were set because selected domain of ciphertext is obtained using FrT. The following values are used, wavelength = 632.8 nm, propagation distance = 20 mm and distance between source and observation side = 1.8 mm respectively. Figure 4b is indicated the structure phase mask of selected parameters. Using Fresnel transform the encrypted image is shown in Fig. 4c. The decryption image is depicted, when all parameters are correct, which is equal to the input image which is shown in Fig. 4d. The Peak Signal to Noise Ratio (PSNR) and mean square error (MSE) serve as two vital parameters in deciding the effect of degradation in the image. While MSE
Save Image in
Secret
Database Input Image
Embedding Function with the Secret Key
SVD Fresnel ISVD
Transform
Decomposition by
Generate
Fresnel transform
QR Code
Encrypted
SVD and ISVD
Inverse Fresnel
Image
Method
Transform
Encrypted
SVD and ISVD
Combine with Angle of an Image
Image
Secret Key
Fig. 3 Flow chart of the proposed scheme
624
Anshula andH. Singh
Fig. 4 a Original image; b SPM; c encoded image and d decoded image
is a measure of the errors in the image with respect to the original image, PSNR indicated the effect of residual noise. The MSE signifies the cumulative squared error of the original input image and the image has undergone some change. The correctness of any algorithm is categorical based on this metric that computes the mean difference between the original image and the decoded image. The MSE is one of the parameters which measures the effectiveness of the algorithm and also verifies the quality of the final recovered image. 1 |Io (x, y) − Id (x, y)| MSE = (M × N ) x=1 y=1 M
N
2
(9)
where M × N is the size of the input image. If MSE obtained is zero then the original image is fully recovered back without loss of any kind of information. The MSE is calculated for the two decrypted images as 1.39 × 10−26 and 2.11 × 10−26 respectively. Figure 5 represents the curve against the MSE and the propagation distance z2 . The technique proposed can be secured because to decode the image all the values of RPM and RHM and both the propagation distance (z1 & z2 ) must be correctly chosen. If any of the value is wrongly chosen automatically there is error (MSE value is positive) hence, decoded image is not obtained. The higher value of
The Improved Method for Image Encryption Using Fresnel Transform, …
625
Fig. 5 MSE curve against propagation distance
MSE clearly indicates that there is error in the decrypted image. Peak-signal-to-noise ratio (PSNR) is defined as PSNR = 10 × log10
(N − 1)2 MSE
(10)
PSNR value calculated using Eq. (10) for grayscale image (baboon) is 18.12 dB. The deciphered image obtained after using all correct parameters like correct encryption keys and Fresnel parameters, i.e., wavelength and free space propagation distance to check the correctness of the system used is depicted in Fig. 4d, respectively. The numerical simulation for sensitivity of proposed scheme is carried out on a PC using Intel(R) Core (TM) i3-2328 CPU @ 2.20 GHz-2.71 GHz, 2GB RAM running Windows 10 on MATLAB R2019a 5(9.6..0.1174912) 64-bit (win64), LM: 40664749. Processing time taken for encoding and decoding processing was 2.061 s. To check the robustness of the proposed system, Fig. 6a, b depicts two input images of the decryption with correct keys, Fig. 6c, d shows decrypted image obtained when wrong propagation key (z1 = 15) is used. Figure 6e, f depicts the decrypted image obtained when wrong wavelength is used. And Fig. 5g, h depicts the decrypted image when another wrong propagation distance (z2 = 18) is used. As compared to predictable DRPE scheme, the system developed using SPM and RPM in symmetric mode offers extra complexity in the encryption and decryption processes. In the encryption method first image is combined with SPM and then with RPM to obtain the encrypted image. For obtaining the decrypted image correctly, exact value of the Fresnel propagation parameters, and the knowledge of correct keys in addition to DRPE keys. The gradation of transparency of noise in the noised
626
Anshula andH. Singh
Fig. 6 a Decrypted image with correct keys; b–d decrypted images with wrong correct keys
image is assessed by calculating many times, the addition of noise in any method does hinder the quality of the deciphered image. As a result, strength and effectiveness of any algorithm is verified against noise attack. The interference of multiplicative noise can be presented through the following relation [44–47]. = + kG
(11)
Figure 7 shows the MSE plots of images against noise factor (k) of Gaussian. It is
The Improved Method for Image Encryption Using Fresnel Transform, … Fig. 7 Plot of MSE vs noise factor (k)
mean square error
4
x 10
627
5
Grayscale image Binary image
3 2 1 0 0
1
2
3
4 5 6 7 Noise factor (k)
8
9
10
observed that MSE curves increase images with increase in noise factor of image. It is observed that the scheme is secure against noise attacks with maximum resistance to Gaussian noise. The drop-in quality of the recovered images is comparable to the cases of additive noise. Figure 8a input image of baboon and it transformed into color image 8b and further taken Fresnel transform and then converted into QR code Fig. 8c. The phase image of the Fresnel transform is depicted in Fig. 8d. Using the encryption scheme Fig. 9a represents the encrypted image. The color QR code is generated and shown in Fig. 9b. Using the singular value decomposition concept the SVD image is shown in Fig. 9c. Figure 9d represents the wavelet decomposition up to 4 stages. Figure 9e is representing the phase image of QR code. Finally using all correct keys and correct parameters the decryption image is shown in Fig. 9d.
6 Conclusion An image encryption scheme based on FrT using SPM and RPM as the masks is proposed. The overall purpose of this scheme is that it offers more security which is more confidential and essential. When an image is encoded using Fresnel Transform it enhances the security and privacy of the input image. The usage of different masks somehow helps in enhancing the key space. From cryptographic view asymmetric method is much more secure with less loss of data. The proposed system is extended for multiple images. The simulation result demonstrates the sustainability and efficiency of this cryptosystem. In the above experiment, it was proved that the propose scheme is based on the Fresnel transform. It also proves that the implemented algorithm using Fresnel transform is more efficient when compared to the existing algorithm. At last, in this paper, the encryption of QR code image is done by
628
Anshula andH. Singh
(a)
(c)
(b)
(d)
Fig. 8 a Input image of baboon, b color image of baboon, c QR image after Fresnel transform, d phase image of QR image
using Fresnel transform and the decomposition of the singular value and the order of multiplication is also done for SVD.
The Improved Method for Image Encryption Using Fresnel Transform, …
(a)
(c)
(e)
629
(b)
(d)
(d)
Fig. 9 a Is encrypted image. b QR code image generated by color image. c SVD image. d Wavelet decomposition up to 4 stages. e Is phase image of QR code, d is decryption image with all correct keys
630
Anshula andH. Singh
References 1. Refregier, P., & Javidi, B. (1995). Optical image encryption based on input plane and Fourier plane random encoding. Optics Letters, 20, 767–769. 2. Javidi, B., et al. (2016). Roadmap on optical security. Journal of Optics, 18, 1–39. 3. Schneier, B. (1996). Applied Cryptography (2nd ed.). New York, USA: Wiley. 4. Yadav, A. K., Vashisth, S., Singh, H., & Singh, K. (2015). Optical cryptography and watermarking using some fractional canonical transforms, and structured masks. In: Lakshminarayanan V., Bhattacharya I. (eds) Advances in Optical Science and Engineering. Springer Proceedings in Physics, vol 166. Springer, New Delhi. https://doi.org/10.1007/978-81-322-236 7-2_5. 5. Kumar, P., Joseph, J., & Singh, K. (2016). Double random phase encoding based optical encryption systems using some linear canonical transforms: weaknesses and countermeasures. In J. J. Healy, M. A. Kutay, H. M. Ozaktas, J. T. Sheridan (Eds.), Springer series in optical sciences (Vol. 198, pp. 367–396). 6. Unnikrishnan, G., Joseph, J., & Singh, K. (2000). Optical encryption by double random phase encoding in the fractional Fourier domain. Optics Letters, 25, 887–889. 7. Dahiya, M., Sukhija, S., & Singh, H. (2014). Image encryption using quad masks in fractional Fourier domain and case study. IEEE International Advance Computing Conference, 1048– 1053. 8. Maan, P., & Singh, H. (2018). Non-linear cryptosystem for image encryption using radial Hilbert mask in fractional Fourier transform domain. 3D Research, 9, 53. https://doi.org/10. 1007/s13319-018-0205-8. 9. Girija, R., & Singh, H. (2018). Symmetric cryptosystem based on chaos structured phase masks and equal modulus decomposition using fractional Fourier transform, 3D Research, 9, 42. https://doi.org/10.1007/s13319-018-0192-9. 10. Singh, H., Yadav, A. K., Vashisth, S., & Singh, K. (2014). A cryptosystem for watermarking based on fractional Fourier transform using a random phase mask in the input plane and structured phase mask in the frequency plane. Asian Journal of Physics, 23, 597–612. 11. Matoba, O., & Javdi, B. (1999). Encrypted optical memory system using three-dimensional keys in the Fresnel domain. Optics Letters, 24, 762–764. 12. Situ, G., & Zhang, J. (2004). Double random-phase encoding in the Fresnel domain. Optics Letters, 29, 1584–1586. 13. Singh, H., Yadav, A. K., Vashisth, S., & Singh, K. (2015). Optical image encryption using devil’s vortex toroidal lens in the Fresnel transform domain. International Journal of Optics, 926135, 1–13. 14. Singh, H. (2016). Cryptosystem for securing image encryption using structured phase masks in Fresnel wavelet transform domain. 3D Research, 7, 34. https://doi.org/10.1007/s13319-0160110-y. 15. Rodrigo, J. A., Alieva, T., & Calvo, M. L. (2007). Gyrator transform: Properties and applications. Optics Express, 15, 2190–2203. 16. Abuturab, M. R. (2012). Securing color image using discrete cosine transform in gyrator transform domain structured-phase encoding. Optics and Lasers in Engineering, 50, 1383– 1390. 17. Singh, H., Yadav, A. K., Vashisth, S., & Singh, K. (2014). Fully-phase image encryption using double random-structured phase masks in gyrator domain. Applied Optics, 53, 6472–6481. 18. Singh, H., Yadav, A. K., Vashisth, S., & Singh, K. (2015). Double phase-image encryption using gyrator transforms, and structured phase mask in the frequency plane. Optics and Lasers in Engineering, 67, 145–156. 19. Singh, H. (2018). Hybrid structured phase mask in frequency plane for optical double image encryption in gyrator transform domain. Journal of Modern Optics, 65, 2065–2078. 20. Singh, H. (2016). Devil’s vortex Fresnel lens phase masks on an asymmetric cryptosystem based on phase-truncated in gyrator wavelet transform. Optics and Lasers in Engineering, 125–139.
The Improved Method for Image Encryption Using Fresnel Transform, …
631
21. Hartley, R. V. L. (1942). A more symmetrical Fourier analysis applied to transmission problems. Proceedings of the IRE, 30, 144–150. 22. Chen, L., & Zhao, D. (2006). Optical image encryption with Hartley transforms. Optics Letters, 31, 3438–3440. 23. Singh, H. (2017). Nonlinear optical double image encryption using random-vortex in fractional Hartley transform domain. Optica Applicata, 47(4), 557–578. 24. Girija, R., & Singh, H. (2019). Triple-level cryptosystem using deterministic masks and modified Gerchberg-Saxton iterative algorithm in fractional Hartley domain by positioning singular value decomposition. Optik, 187, 238–257. 25. Girija, R., & Singh, H. (2019). An asymmetric cryptosystem based on the random weighted singular value decomposition and fractional Hartley domain. Multimedia Tools and Applications, 78, 1–19. 26. Zhou, N.-R., Wang, Y., & Gong, L. (2011). Novel optical image encryption scheme based on fractional Mellin transform. Optics Communication, 284, 3234–3242. 27. Vashisth, S., Singh, H., Yadav, A. K., & Singh, K. (2014). Devil’s vortex phase structure as frequency plane mask for image encryption using the fractional Mellin transform. International Journal of Optics, 2014, (728056). https://doi.org/10.1155/2014/728056. 28. Singh, H. (2018). Watermarking image encryption using deterministic phase mask and singular value decomposition in fractional Mellin transform domain. IET Image Processing, 12, 1994– 2001. 29. Yadav, P. L., & Singh, H. (2018). Optical double image hiding in the fractional Hartley transform using structured phase filter and Arnold transform. 3D Research, 9, 20. https://doi.org/10.1007/ s13319-018-0172-0. 30. Girija, R., & Singh, H. (2018). A cryptosystem based on deterministic phase masks and fractional Fourier transform deploying singular value decomposition. Optical Quantum Electronics, 50, 210. https://doi.org/10.1007/s11082-018-1472-6. 31. Singh, H. (2016). Optical cryptosystem of color images using random phase masks in the fractional wavelet transform domain. AIP Conference Proceedings, 1728, 020063-1/4. 32. Singh, H. (2016). Optical cryptosystem of color images based on fractional-, wavelet transform domains using random phase masks. Indian Journal of Science and Technology, 9S, 1–15. 33. Chen, H., Tanougast, C., Liu, Z., & Sieler, L. (2017). Asymmetric optical cryptosystem for color images based on equal modulus decomposition in gyrator domains. Optics and Lasers in Engineering, 93, 1–8. 34. Barrera, J. F., Mira, A., & Taroroba, R. (2013). Optical encryption and QR codes: Secure and moise-free information retrieval. Optics Express, 21, 5373–5378. 35. Cai, J., Shen, X., Lei, M., Lin, C., & Dou, S. (2015). Asymmetric optical cryptosystem based on coherent superposition and equal modulus decomposition. Optics Letters, 40, 475–478. 36. Abuturab, M. R. (2013). Color information security system using Arnold Transform and double structured phase encoding in gyrator transform domain. Optics & Laser Technology, 45, 524– 532. 37. Khurana, M., & Singh, H. (2019). A spiral-phase rear mounted triple masking for secure optical image encryption based on gyrator transform. Recent patents on Computer Science, 12, 80–84. 38. Khurana, M., & Singh, H. (2018). Asymmetric optical image triple masking encryption based on gyrator and Fresnel transforms to remove silhouette problem. 3D Research, 9, 38. https:// doi.org/10.1007/s13319-018-0190-y. 39. Khurana, M., & Singh, H. (2018). optical image encryption using Fresnel Zone plate mask based on fast walsh hadamard transform. AIP Conference Proceedings, 1953, 140043-1/4. 40. Khurana, M., & Singh, H. (2018). Spiral-phase masked optical image health care encryption system for medical images based on fast Walsh-Hadamard transform for security enhancement. International Journal of Healthcare Information Systems and Informatics, 13, 98–117. 41. Maan, P., & Singh, H. (2018). Optical asymmetric cryptosystem based on kronecker product hybrid phase and optical vortex phase masks in the phase truncated hybrid transform domain. 3D Research, 10, 8. https://doi.org/10.1007/s13319-019-0218-y.
632
Anshula andH. Singh
42. Zamrarni, W., Ahouzi, E., Lizana, A., Campos, J., & Yzuel, M. J. (2016). Optical image encryption technique based on deterministic phase masks. Optical Engineering, 55, 1031081/9. 43. Khurana, M., & Singh, H. (2018). Data computation and secure encryption based on gyrator transform using singular value decomposition and randomization. Procedia Computer Science, 132, 1636–1645. 44. Yadav, A. K., Vashisth, S., Singh, H., & Singh, K. (2015). A phase-image watermarking scheme in gyrator domain using devil’s vortex Fresnel lens as a phase mask. Optics Communication, 344, 172–180. 45. Khurana, M., & Singh, H. (2017). An asymmetric image encryption based on phase truncated hybrid transform. 3D Research, 8, 28. https://doi.org/10.1007/s13319-017-0137-8. 46. Girija, R., & Singh, H. (2018). Enhancing security of double random phase encoding based on random S-Box. 3D Research, 9, 15. https://doi.org/10.1007/s13319-018-0165-z. 47. Anshula, & Singh, H. (2021). Security enrichment of an asymmetric optical image encryptionbased devil’s vortex Fresnel lens phase mask and lower upper decomposition with partial pivoting in gyrator transform domain. Optical Quantum Electronics, 53(4), 1–23.
A Study on COVID-19 Impacts on Indian Students Arpita Telkar, Chahat Tandon, Pratiksha Bongale, R. R. Sanjana, Hemant Palivela, and C. R. Nirmala
Abstract This paper aims to ascertain the numerous constraints and obstacles of students in receiving their education online by conducting a survey that consisted of questions highlighting the different sorts of socioeconomic aspects of students’ lives during the pandemic. Sundry social facets have been studied including the effect of students’ links with their parents, siblings, friends, and teachers as their support counts in the social well-being of a student’s life. A total of 632 students across various institutions took the survey and provided their opinions about how well is there learning progressing during school closures based on the kind of support they get from their family, friends, and teachers, the kind of assistance their respective institutions equipped them with, their daily learning objectives, the types of professions their parents were involved in, so on and so forth. The results obtained indicate that students that went to international and private institutions have an edge over those that went to the public institutions, student’s parents’ profession played quite an important role in determining their learning progress, along with the kind of assistance children were offered by their teachers and friends. Keywords Coronavirus · COVID-19 · Pandemic · SARS
1 Introduction Currently, the world is terror-stricken with the intrusion of a life-threatening disease called the “Coronavirus disease (COVID-19)”. It has caused catastrophic damages over the globe including the disruption of numerous sectors of nations, regardless of whether they are public or non-public, large or small, business/trade-related or not. The impact on most sectors can be estimated to an extent, but it is quite onerous to approximate the intensity of blight that our education system is going through as a consequence of the nationwide closure of all educational institutions ranging from pre-primary to post-graduate schools from the second week of March 2020 in an A. Telkar (B) · C. Tandon · P. Bongale · R. R. Sanjana · H. Palivela · C. R. Nirmala Computer Science and Engineering, BIET, Davangere, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_52
633
634
A. Telkar et al.
attempt to contain the spread of the disease [1]. There are a lot of on-going debates on the adverse effects of the impairment caused to education in all states of India as the education sector is a crucial contributing factor in tomorrow’s economy. Since the lockdown has caused all the schools and colleges to shut down, the students are constrained to remain at their homes, attend classes online, do their assignments and projects, and submit them online, everything is online now. They are deprived of several activities and the environment of the school which is crucial for the growth of every student in terms of both academics and other co-curricular activities, they play a vital role in their overall development. The shift from traditional classrooms to everyday online Zoom, Google Meet, Microsoft Teams, Skype classes has been extraordinarily challenging for many concerned including professors and students especially [2]. Not everybody would have an accessible setup of a home desktop or a laptop or even a phone to either conduct or attend online classes making this mode of education inaccessible to a significant student population. The shift to online classes escalated the use of gadgets among students due to their regular online classes and other virtual activities. University professors have set up an account for online conferencing platforms like Zoom, Skype Inc, Google LLC to engage with students. With the use of these as virtual classrooms, students get plenty of time to finish their homework and work on other learnings. Conversely, poor internet connectivity, power supply and lack of adequate gadgets hamper the learning of the students in rural areas. Data analysis for the below hypothesis can be browned using several methods. Sample size determination, regression, standard deviation and mean are some of the well-known techniques used in analysing and evaluating data. Sample size determination is a trick to obtain the exact size of a sample for the study to be accurate rather than using a whole large dataset. Regression method models the relationship between dependent and independent variables. We practiced hypothesis testing and employed it in our study as we found it to be more relevant and efficient for our purposes of study. Nonetheless, we have also implemented mean, SD and SEM for better understanding of the data. Our main contributions are as follows: • Determining the impact that teachers’ support and guidance has on students’ learning during the tough times like Covid-19. • Figuring out how family members including parents and siblings affect students’ behaviour and help boost their morale when it is hard to gain education the normal way. • Finding out the degree of motivation and confidence student’s gained from their peer discussions. • All of these above tasks were achieved by devising most appropriate hypotheses according to general social and economic knowledge and the results were visualized using various seaborn plots shown below in Sect. 3.
A Study on COVID-19 Impacts on Indian Students
635
1.1 Method of Data Collection The collection of data could be done in various ways. But, due to the emerging health crisis, a survey was conducted by circulating the set of questions through Google forms amongst students across the board, making it a virtual interaction. More than 600 students participated and expressed their opinion by submitting their responses. Google forms, best served this purpose and made it easy for us to gather responses from students across the nation. Responses were recorded regarding their economic status, type of education they were into, number of siblings, parents’ profession, etc. The analysis of the socioeconomic factors affecting students learning in these COVID times was done based on the data recorded.
1.2 Empirical Findings on Socioeconomic Factors Affecting Students Learning Amidst COVID A total of 20 questions were considered necessary for the analysis. Questions included about their school grade to get a clear idea of the study they would require to do at their level. From the data gathered, we could see that the majority of the responses are received from the College students and the huge participation is seen mostly by males. It was found that the secondary school girls have better learning growth than boys [3–9]. Keith and colleagues [3, 4, 6, 10–12] conducted a study for correlations of learning from home and scholastic achievements and observed the correlations around 0.30. The data obtained here in Table 1 is predominantly the options chosen for each question in the online survey via Google forms. The students’ learning curve during COVID was obtained with these after several analyses to get a better understanding of the data before establishing the students’ learning curve. Hence, the scrutiny was done effortlessly and the required curve was achievable (Table 1). Detailed information on each column of the table is given below: • Count: It gives the total number of observations considered for the analysis. • Unique values: It says the number of possible answers that can be received for the concerning question. • Top: It directly means the mode of the data. It means the answer that has appeared most often in the set of possible answers. • Frequency: It tells the number of times the highest appeared answer has occurred. • Mean: It is the calculated central value of the data from a set of values that vary in range. • Standard deviation (SD): A value indicating how much each value in the data differs from the mean value of the group. • Standard Error for Mean (SEM): It computes the variation of the sample mean of the data from the true population mean. It will definitely be minor than SD.
Father’s occupation
Siblings
Institute
2
223
Others
Social work
39
Over four 368
32 13
Three
Service
111
Two
Between jobs
362
One
6 114
None
International
490
Private
614
University 136
7
Public
11
9–12 grade
269
Female
5–8 grade
363
Male
Gender
Standard
Count
Factors/determinants
4
5
3
3
2
Unique values
Table 1 Descriptive Statistics of Students’ Socioeconomic factors
Service
One
Private
University
Male
Top
368
362
490
614
363
Frequency
1.297
1.158
0.794
1.954
0.425
Mean
0.582
0.848
0.427
0.28
0.494
Standard deviation
0.023
0.0337
0.017
0.0111
0.0196
Standard error for mean
1.252
1.092
0.76
1.932
0.387
Lower bound
(continued)
1.342
1.224
0.827
1.957
0.464
Upper bound
95% Confidence interval for mean
636 A. Telkar et al.
102
Service
Excellent
Learning habit maintenance 63 110 357 71
Neither agree nor disagree
Agree
Strongly disagree
66
Strongly disagree 31
336
Agree
Disagree
156
Neither agree nor disagree
Strongly disagree
44
Disagree
30
112
353
Good
9 158
Average
Below average
Learning progress Strongly disagree
Student’s English capability
504
House maker 3
23
Others
Mother’s occupation
Social work
Count
Factors/determinants
Table 1 (continued)
5
5
5
4
Unique values
Agree
Agree
Good
House Maker
Top
357
336
353
504
Frequency
2.592
2.576
1.899
1.134
Mean
0.98
0.937
0.689
0.446
Standard deviation
0.039
0.037
0.027
0.017
Standard error for mean
2.515
2.503
1.845
1.099
Lower bound
(continued)
2.668
2.649
1.952
1.169
Upper bound
95% Confidence interval for mean
A Study on COVID-19 Impacts on Indian Students 637
Peer influence
Sibling influence
Parent influence
294 67
Agree
Strongly disagree 59 183 303 58
Disagree
Neither agree nor disagree
Agree
Strongly disagree
29
202
Neither agree nor disagree
Strongly disagree
50
Strongly disagree 19
358 125
Agree
Disagree
113
Neither agree nor disagree
Strongly disagree
22
Disagree
76
Strongly disagree 14
357
Agree
Strongly disagree
41 132
Neither agree nor disagree
26
Teacher influence Strongly disagree
Disagree
Count
Factors/determinants
Table 1 (continued)
5
5
5
5
Unique values
Agree
Agree
Agree
Agree
Top
303
294
358
357
Frequency
2.479
2.538
2.883
2.658
Mean
0.947
0.894
0.838
0.918
Standard deviation
0.038
0.035
0.033
0.036
Standard error for mean
2.817
2.468
2.817
2.587
Lower bound
(continued)
2.948
2.608
2.948
2.729
Upper bound
95% Confidence interval for mean
638 A. Telkar et al.
Communication and collaboration with friends
Daily learning objectives
Family support
156 341 58
Neither agree nor disagree
Agree
Strongly disagree 44 104 371 89
Disagree
Neither agree nor disagree
Agree
Strongly agree
24
56
Strongly disagree
21
Disagree
217
Strongly disagree
Strongly disagree
53 338
Agree
12
Neither agree nor disagree
12
116
Strongly disagree
Disagree
370
Agree
Strongly disagree
35 91
Neither agree nor disagree
20
Strongly disagree
Self-learning motivation
Disagree
Count
Factors/determinants
Table 1 (continued)
5
5
5
5
Unique values
Agree
Agree
Agree
Agree
Top
371
341
338
370
Frequency
2.723
2.568
3.164
2.834
Mean
0.922
0.898
0.805
0.899
Standard deviation
0.037
0.036
0.032
0.036
Standard error for mean
2.651
2.498
3.102
2.764
Lower bound
2.795
2.638
3.227
2.904
Upper bound
95% Confidence interval for mean
A Study on COVID-19 Impacts on Indian Students 639
640
A. Telkar et al.
Fig. 1 Flowchart of the proposed study
• Confidence Interval for Mean: Confidence Interval on the mean is a mean of the sample data to evaluate the mean of the overall population. A Confidence Interval for Mean of 95% is a scale of values that is 95% sure of containing the accurate mean of the overall population.
1.3 Workflow The data for our study is fetched by doing an online survey of students’ opinions using Google forms to extract their answers for our questions. Once, the data is obtained, data scrutiny was done effortlessly, selecting only the appropriate columns for our study. The entire process from collecting data to analysing them, using it for predictions is shown in Fig. 1. After the data collection, implementation of some of the cleaning and data encoding methods was done for the required study of data.
1.4 Aims—Hypothesis The goal of the current analysis is to figure out the various socioeconomic determinants affecting the of students’ learning progress during the pandemic. This paper focuses on the following research questions: 1.
2.
How does the type of education (public/private/international) affect the student’s learning progress? To what extent does the family’s income determine the student’s academic performance? Depending on the findings the following hypotheses were framed: Students that go to International and private institutes offer better education (Hypothesis 1). Parents that are occupied in service are better able to provide wards with a good education (Hypothesis 2). Does having more siblings affect the student’s learning hours? Is English an important factor in determining how confident a student is about his/her learning progress? We hypothesized that more the more siblings the responsibility of taking care of each other would be, hence, low progress (Hypothesis 3). For a successful education in today’s world, the English language is a critical determinant of learning (Hypothesis 4).
A Study on COVID-19 Impacts on Indian Students
3. 4.
641
Is support from family and friends a determiner of students’ learning during the tough times? (Hypothesis 5). How do teachers influence their students into learning and help them cope with their studies online? Teachers play a significant role in a student’s learning (Hypothesis 6).
2 Methods 2.1 Design—Participants To collect data on students’ learning curve during COVID, a survey was conducted via online Google forms that were sent to students in all states of India, belonging to different classes. The survey consisted of questions for which options were provided to answer. The students had to choose the best option that is true to them. This way, by circulating the survey from 30th September 2020 to 10th October 2020, we received responses from different parts of the nation. This paper aims to depict how effective is the online education system in providing the confidence about learning to students of different backgrounds and financial status, and how well the society (family and friends) is accepting the new normal and encouraging students in this pandemic. For this purpose, a total of 20 questions were added to the survey asking about the students’ social and economic elements around them. With the received data, the analysis was done to understand how these factors are affecting students learning in the COVID pandemic.
2.2 Study The study included survey reports from various school and college students from all over India. We circulated the Google form across various students that came from different schools, states, backgrounds, etc. Students were catechized on their socioeconomic aspects that influence their education and overall learning progress. The study concentrated on finding out the relationship between the above-mentioned parameters with their learning progress and determining a 95% confidence interval for the student population in India using a sample population of 632 students’ opinions. The data consisted of 363 reviews submitted by male students and 269 reviews were submitted by female students. Data was then analysed for finding the students’ learning curve in the COVID pandemic. The study for all the aims mentioned in the hypothesis was carried out and the effective graph was procured. The acquired graph was then analysed to know the students state and the problem faced by them in this pandemic. Understanding these factors and their effect on the students’ education and growth becomes a dominant feature to be considered as they directly relate to the country’s economic status.
642
2.2.1
A. Telkar et al.
Data Analysis
Initially, the mean values and standard deviations were calculated. For the appraisal of grade and gender difference analysis, the variance was applied. The confidence interval was calculated for about 95% to get the similarity for the collected dataset to the nationwide student population. As per the hypothesis, the relationship between the various factors affecting the students is calculated to understand how they have affected the students’ growth in learning. Procured data was encoded and graphs were developed. Keeping this graph as a reference, the conclusion was drawn for how it is affecting the learning of the students during COVID relates to their learning progress and this happens to be one important factor. Likewise, a parent’s job and learning progress relationship graph was assembled to understand how the parents’ job plays a role in the students’ learning and how it has affected the learning in the COVID era. Consequently, we analysed each of the factors and determined how it has left an impact on the students in the online education system.
3 Results 3.1 How Does the Type of Education (Public/Private/International) Affect the Student’s Learning Progress? To What Extent Does the Family’s Income Determine the Student’s Academic Performance? It is evident that students that go to International institutes are quite sure of their learning progress and the same holds for those children that come from private institutes even during the pandemic relying on the online facilities available and the virtual assistance provided by their respective institutions (Fig. 2). On the other hand, children that go to public schools/universities are slightly less sure of their learning progress during COVID as compared to their fellow students that go to private and international institutes. We hypothesized that parents with a decent profession are better able to provide quality education to their wards (Fig. 3). The result confirms the same (Hypothesis 2). It can be seen that students feel more assured about their academic progress when their fathers are into some kind of a service or other profession as they can get their children to receive education in international and private institutions when compared to students whose fathers are either between jobs or into doing social work. In this case (Fig. 4), students seem to be better assured of their learning having their mothers at home when compared to having their mothers go to work. This might be due to the moral support that their mothers can bestow upon them. In a nutshell, of all the above results, students admitted to international schools or universities are most convinced in comparison to those who belong to private and/or public institutions.
A Study on COVID-19 Impacts on Indian Students
643
Fig. 2 Variation of learning progress given different genders and institute types
Fig. 3 Variation of the learning progress of children belonging to different Institutes with their Fathers coming from different professions
644
A. Telkar et al.
Fig. 4 Variation of the learning progress of children belonging to different Institutes with their Mothers coming from different professions
3.2 Does Having More Siblings Affect the Student’s Learning Hours? Is English an Important Factor in Determining How Confident a Student Is About His/Her Learning Progress? Hypothesis 3 did not come true as Fig. 5 says having any number of siblings does not significantly impact a student’s daily learning objectives. The average achievement of daily learning objectives is around 3. The fourth hypothesis seems to be right to a great extent (Hypothesis 4). Students seem to claim good progress from Fig. 6 in their learning depending on their language capabilities, we took the English language as a parameter for this purpose. All students that study in International institutions avow that they have excellent English and hence the highest progress in their learning, the majority of students in private institutions opted excellent language skills (Fig. 6).
3.3 Is Support from Family and Friends a Determiner of Students’ Learning During Tough Times? As per hypothesis 4, the correlation between family support and students learning progress was constructed. It is clear from Fig. 7 that students getting better family support have good learning progress than ones who lack family support. It is also
A Study on COVID-19 Impacts on Indian Students
645
Fig. 5 Effect of having Siblings on students’ daily learning objectives
Fig. 6 Effect of English language on student’s learning progress
seen that students with average or the above-average motivation for learning agree to have satisfactory learning progress irrespective of the family support they are getting. It is clear from Fig. 8 that male students are more likely to have better communication with their peers than the female ones. Nonetheless, female students with comparatively less connection with their friends can have good learning progress. Overall, the figure clearly says that students with better communication and collaboration with friends can possess good learning progress during COVID. This could be because of the clarification of the doubts that can happen among the peers which is difficult with teachers.
646
A. Telkar et al.
Fig. 7 Relationship of the learning progress with the family support in accordance with self-learning motivation
Fig. 8 Discrepancy of learning progress with respect to communication and collaboration with friends for different genders
3.4 How Do Teachers Influence Their Students into Learning and Help Them Cope with Their Studies Online? Figure 9 shows the result in obedience to the hypothesis 4. It is evident that students studying in international schools and colleges agree that they are getting upstanding
A Study on COVID-19 Impacts on Indian Students
647
Fig. 9 Variation of learning progress concerning teacher influence and institute type
encouragement from their teachers through which they have better learning progress from the rest of the students. However, we can see that students studying in public or private schools and colleges also have acceptable learning progress, but they tend to be less when compared to students from international schools. This is because of the quality of education international schools and colleges provide, from superior teachers, adapting the highly flexible methodology to teach their students which other students lack.
4 Conclusion This research paper focuses on highlighting the various impacts on the students’ learning curve during the current pandemic situation which has caused a lot of ups and downs in student’s lives. Various social and economic aspects have been studied within this research that includes their connections with their parents, siblings, friends, and teachers to put light on the social determinants and as for the economic aspects are concerned, we studied the effect of student’s learning progress based on the types of institutions they went to which in turn depended on their parents’ professions. In all, it can be concluded that student’s having a good amount of support from their families, friends, and teachers helps them ensure their learning progress along with being admitted into reputed institutions which could provide them with better assistance and continued learning during hard times like COVID-19.
648
A. Telkar et al.
References 1. Jena, P. K. (2020). Impact of COVID-19 on higher education in India. International Journal of Advanced Education and Research. 2. Abad-Segura, E., González-Zamar, M.-D., Infante-Moro, J. C., & García, G. R. (2020). Sustainable management of digital transformation in higher education: Global research trends. Sustainability, 12(5), 2107 3. Keith, T. Z. (1982). Time spent on homework and high school grades: a large-sample path analysis. Journal of Educational Psychology, 74, 248–253. 4. Keith, T. Z., & Page, E. P. (1985). Homework works at school: national evidence for policy changes. School Psychology Review, 14, 351–359. 5. Tymms, P., & Fitz-Gibbon, C. T. (1992). The relationship of homework to A-level results. Educational Research, 34, 3–10. 6. Keith, T. Z., & Benson, M. J. (1992). Effects of manipulable influences on high school grades across five ethnic groups. Journal of Educational Research, 86, 85–93. 7. Wagner, P., & Spiel, C. (1999). Arbeitszeit fu¨r die Schule e Zu Variabilita¨t und Determinanten [The amount of time pupils work for school variability and determinants]. Empirische Pa¨dagogik, 13, 123–150. 8. Spiel, C., Wagner, P., & Fellner, G. (2002). Wie lange arbeiten Kinder zu Hause fu¨r die Schule? Eine Analyse in Gymnasium und Grundschule [How long and for what subjects do pupils work at home for school? An analysis of academic secondary school and primary school]. Zeitschrift fu¨ r Entwicklungspsychologie und Pa¨dagogische Psychologie, 34, 125–135. 9. Xu, J. (2006). Gender and homework management reported by high school students. Educational Psychology, 26, 73–91. 10. Keith, T. Z., Reimers, T., Fehrmann, P., Pottebaum, S., & Aubey, L. (1986). Parental involvement, homework, and TV-time: direct and indirect effects on high school achievement. Journal of Educational Psychology, 78, 373–380. 11. Cool, V. A., & Keith, T. Z. (1991). Testing a model of school learning: direct and indirect effects on academic achievement. Contemporary Educational Psychology, 16, 28–44. 12. Keith, T. Z., Keith, P. B., Troutman, G. C., Bickley, P. G., Trivette, P. S., & Singh, K. (1993). Does parental involvement affect eighth-grade achievement? Structural analysis of national data. School Psychology Review, 22, 474–496.
Improving Efficiency of Machine Learning Model for Bank Customer Data Using Genetic Algorithm Approach B. Ajay Ram, D. J. santosh Kumar, and A. Lakshmanarao
Abstract Machine learning techniques are very useful in extracting useful patterns from customer datasets. Although Machine Learning techniques give good results in most of the cases, there is a need to improve the efficiency of ML models in different ways. Feature selection is one of the most important tasks in machine learning. A genetic algorithm is a heuristic method that simulates the selection process. Genetic algorithms come under the category of evolutionary algorithms, which are generally used for generating solutions to optimization problems using selection, crossover, mutation methods. In this paper, we proposed a genetic algorithm-based feature selection model to improve the efficiency of Machine Learning techniques for customer related datasets. We applied a genetic algorithm-based feature selection for two different customer information datasets from UCI repository and achieved good results. All the experiments are implemented in Python language which provides vast packages for machine learning tasks. Keywords Machine learning · Genetic algorithm · Feature selection · Python
1 Introduction In today’s world, the banking industry generates large volumes of data every day. The banking system contains data related to account information, transactions, etc. There is a need for Data analytics in banking to the data for extracting meaningful information from it. It is important to identify the behavior of the customers to retain the customers. Nowadays, machine learning methods can be applied to various
B. Ajay Ram Department of CSE, Lendi Institute of Engineering and Technology, Visakhapatnam, A.P, India D. J. santosh Kumar Department of CSE, Avanthi’s Research and Technological Academy, Visakhapatnam, A.P, India A. Lakshmanarao (B) Department of IT, Aditya Engineering College, Surampalem, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_53
649
650
B. Ajay Ram et al.
applications. Machine Learning is a type of data analysis that uses various algorithmic models used to learn from data. After learning, that model can be used for predicting various tasks. ML is considered as a subset of well-known artificial intelligence (AI). In Machine Learning models, computers can extract useful information from data without any human involvement. The good thing about machine learning is that data is used for making predictions, not the code. Most of the applications use dynamic data, So Machine Learning allows the model to learn and evolve with experience. Although there are several types of machine learning models, most widely used machine learning techniques are supervised learning and unsupervised learning. In supervised models, a predefined dataset examples are available for the model to make conclusions with given data. In unsupervised models, the model finds useful patterns from data which has no previous examples. In bank customer data, supervised learning methods achieve good results. Classification and regression tasks come under supervised learning. In classification task, we need to predict a label, whereas in regression task, we need to predict a value. In bank customer data supervised learning methods achieve good results. Many researchers applied machine learning models for banking applications like credit card fraud detection, predicting the accuracy of probability of default of credit card clients, predicting the subscription of term deposit, etc. Genetic algorithms are based on natural selection and genetics. Genetic algorithms are very much useful for solving optimization/search-oriented problems. Feature selection can also be done with genetic algorithms [1]. Genetic algorithms are stochastic in nature. They depend on the population of the number of possible solutions. The potential arrangements are encoded as ‘genes’, strings of characters from some letters in order. New arrangements can be delivered by changing individuals from the present populace, and by mutating two arrangements together to structure another arrangement. The better arrangements are chosen to breed and transform and the more awful ones are disposed of. They are probabilistic hunt strategies; this implies that the states which they investigate are not decided exclusively by the properties of the issues. An irregular procedure directs the hunt. Genetic algorithms are intensively used in the field of artificial intelligence. The combination of genetic algorithms and machine learning techniques is helpful in finding better solutions for various applications. Steps in genetic algorithm: (1) (2) (3) (4) (5) (6)
Generating Initial Population: Evaluate the fitness Selection process Crossover: Generate new child chromosomes from two chromosomes of parents by using the information extracted from that parent chromosomes. Mutation: It operates independently on each individual After mutation, fitness value is used for the evaluation of newly generated child population, if it fails to repeat the Steps (3)–(6) until its reach maximum number of generations (Fig. 1).
Improving Efficiency of Machine Learning Model for Bank Customer … Fig. 1 Working model of genetic algorithm
651
Working Process of Genetic Algorithm
Initial population Fitness assignment Selection Crossover Mutation
stopping criteria
false
True
2 Literature Survey Patil and Dharwadkar [2] applied an artificial neural network model on two datasets, credit data, and bank customer data for classification tasks. They achieved an accuracy of 72%, 98% for bank customer’s data and credit card fraud detection respectively. Ozcelik and Duman [3] proposed a method in which each transaction is separately scored. These scores are used for dividing a customer as fraudulent or legitimate. They applied 2 meta-heuristic algorithms namely scatter approach, genetic algorithm. They have shown that their model achieved good results for fraud detection. Vats et al. [4] applied a novel genetic algorithm model for credit card fraud detection and achieved good results. Moro et al. [5] applied machine learning models Neural Network (NN) Logistic Regression, SVM, Decision Trees to predict the success rate of telemarketing calls which are used for selling long-term deposits in banks. They applied the feature selection model which reduces features from 150 to 22. They achieved good results with Neural Network model. Elsalamony [6] evaluated and compared the classification performance of four data mining models Logistic Regression, Neural Network, Naive Bayes and C5.0 on the bank direct marketing dataset to classify for bank deposit subscription. They achieved 93% accuracy with C5.0 model. Neural network and Logistic Regression model achieved 90% accuracy. Oberoi [7] proposed
652
B. Ajay Ram et al.
a genetic algorithm approach for credit card fraud detection. Similar to [7], we are also applying genetic algorithm, but we are using genetic algorithm only for feature selection. Later we applied machine learning strategies. Pouramirarsalani [8] applied a new method e-banking fraud detection using hybrid feature selection method and genetic algorithm and achieved good results. A number of researchers applied genetic algorithm-oriented feature selection techniques for different applications. Khare and Burse [9] applied a genetic algorithm-based feature selection for the Classification method for Ovarian Cancer. They applied five classification techniques to achieve good results with the Bayesnet classifier. Babatunde et al. [10] portrayed a GA-based element choice procedure. The strategy grew in this included the utilization of a novel wellness capacity to choose a combinatorically set of best features from a unique list of features. For benchmarking, features chose by both WEKA Feature Selectors and the GA were nourished into various WEKA-classifiers. The GA-based features beat WEKA-based includes in more occurrences. They have shown that GA-based method produces better results by changing fitness function. Jain et al. [11] applied various machine learning techniques for credit card fraud detection and achieved good accuracy with random forest. Kahlid and Alkhatib [12] applied various ML classifiers to predict the customers leaving the bank.
3 Research Methodology First, we collected datasets and then applied various data preprocessing techniques. Later genetic algorithms are applied to find best features. After identifying best feature machine learning algorithms are applied on the datasets with the selected features. The proposed model was shown in Fig. 2. Dataset-1(Bank Marketing Dataset) contains 20 features, namely, age (numeric), job(categorical), marital status (categorical), education (categorical), default (has credit in default?), housing (has housing loan?), loan (has personal loan?), contact communication type, month(last contact month of year), day_of_week(last contact day of the week), duration(duration of last contact duration in sec), campaign(no. of contacts performed in the campaign and for the client), pdays(no. of days over after the customer was last communicated from last campaign), previous(no. of contacts done before the campaign for the customer), poutcome (outcome of the last campaign), emp.var.rate(variation rate of the employee), cons.price.idx(price index of consumer), cons.conf.idx(confidence index of consumer), euribor3m (euribor 3 month rate), nr.employed (no. of employees). We need to predict whether customer will go for a term deposit(yes/no), which is denoted by dependent variable(y). Dataset-2(default of credit card clients Dataset) contains 24 features namely X1 (Amount of the given credit), X2(Gender), X3(Education), X4(Marital status), X5(Age), X6–X11 (past payment history), X12–X17(Amounts of bill statements,
Improving Efficiency of Machine Learning Model for Bank Customer …
653
Fig. 2 Proposed method Collection of dataset
Apply data preprocessing techniques Apply genetic algorithm with all features (Feature Selection)
Apply ML models with features selected from previous step
Compare and select best ML model X18–X23(Amounts of previous payment). We need to default payment next month(yes/no).
4 Experimentation and Results All the experiments are implemented in Python language. Python provides packages for implementing a genetic algorithm. We installed scoop, deap packages for implementing genetic algorithm. After applying the genetic algorithm for the two datasets, we identified the best features from these datasets. The genetic algorithm gives different subset features in different runs, as the division of dataset may vary from each run. We selected best possible features from all the cases. We selected 12 features from ‘bank marketing dataset’ and 10 features from default of credit card clients Dataset (Table 1).
4.1 Applying Machine Learning Techniques After selecting best features using genetic algorithm, we applied different machine learning classification models on the two datasets. For comparing different algorithms, we need to use algorithm performance measurements. Accuracy is one such measure for classification task. Before applying a machine learning algorithm, we divided the given dataset into training and testing data. For this division, we applied
654
B. Ajay Ram et al.
Table 1 Features selected (by applying genetic algorithm) S. no
Features selected for bank marketing dataset
Features selected for default of credit card clients dataset
1
Age
Age
2
Marital status
PAY_0
3
Education
PAY_4
4
Housing
BILL_AMT1
5
Month
BILL_AMT2
6
Duration
BILL_AMT4
7
pdays
PAY_AMT1
8
emp.var.rate
PAY_AMT2
9
cons.price.idx
PAY_AMT3
10
cons.conf.idx
PAY_AMT4
11
euribor3m
12
nr.employed
5-fold cross-validation technique. In this technique, data is divided into 5 parts also called folds. All the 5 folds can be used as testing sets in one of the iterations.
4.2 Applying ML Techniques on Bank Marketing Dataset We divide the ‘Bank Marketing dataset’ into training and testing sets. Total number of samples in this dataset are 41188. As we applied 5-fold cross-validation technique, The size of training set is 32951 and the size of testing set is 8237. We applied 7 classifiers, namely, logistic regression, K-NN, Random Forest, Decision Tree, ExtraTree classification, Gradient Boosting, AdaBoost classifier with selected features (Fig. 3 and Table 2). Fig. 3 Accuracy comparison
Improving Efficiency of Machine Learning Model for Bank Customer …
655
Table 2 Results for bank marketing dataset Machine Learning technique
Classifier accuracy (Training set) (%)
Logistic regression
90.8
91.5
K-NearestNeighbour
92.3
89.8
Adaboost
90.7
91.5
Decision tree classification
99.9
89.2
Random forest classification
99.9
91
Extra-tree classifier
97.9
90.8
Gradient boosting classifier
97.4
90.9
ANN
90.5
90.1
Table 3 Results for credit card client’s dataset
Classifier accuracy (Testing set) (%)
Machine learning technique
Classifier accuracy (Training set) (%)
Classifier accuracy (Testing set) (%)
Logistic regression
80.8
81
K-nearest neighbor
84.2
79.1
Adaboost
81.9
82.3
Decision tree classification
98.9
72.7
Random forest classification
98.9
81.2
Extra-tree classifier
94.8
81.8
Gradient boosting classifier
89.4
79.2
ANN
77.9
78
4.3 Applying ML Techniques on Credit Card Dataset We divide the ‘default of credit card clients Dataset’ into training and testing sets. Total number of samples in this dataset are 30000. We applied 5-fold cross-validation technique. The size of training set is 24000 and the size of testing set is 6000. We applied seven classification algorithms with selected features (Table 3 and Fig. 4).
656
B. Ajay Ram et al.
Fig. 4 Accuracy comparison
5 Conclusion In this paper, we applied a genetic algorithm approach for the feature selection process in the machine learning model. We applied a genetic algorithm on two different bank customer datasets and identified the best features. Then machine learning techniques are built from these features. We applied seven machine learning algorithms on these two datasets. For Bank Marketing Dataset, Logistic Regression and Adaboost given the best accuracy among all other classifiers. For default of credit card clients dataset, Adaboost given the best accuracy among all other classifiers.
References 1. Jung, M., & Zscheischler, J. (2013). A guided hybrid genetic algorithm for feature selection using expensive cost functions. In International Conference on Computational Science, Procedia Computer Science 18, 2337–2346 (ICCS-ELSEVIER). 2. Patil, P. S., & Dharwadkar, N. (2017). Analysis of banking data using machine learning. In International Conference on IoT in Social, Mobile, Analytics and Cloud. 3. Ozcelik, M. H., & Duman, E. (2011). Detection of credit card fraud by genetic algorithm & scatter search. Expert Systems with Applications, 13057–13063 (Elsevier). 4. Vats, S., Dubey, D. K., & Pandey, N. K. (2013). Genetic algorithms for credit card fraud detection. In International Conference on Education and Educational Technologies. 5. Moro, S., Rita, P., & Cortez, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decision Support Systems (DSS), 62, 22–31 (ELSEVIER). 6. Elsalamony, H. (2014). Bank direct marketing analysis of data mining techniques. International Journal of Computer Applications, 85(7), 0975–8887. 7. Oberoi, R. (2017). Credit-card fraud detection system using genetic algorithm. International Journal of Computer & Mathematical Sciences, 7(6). ISSN: 2347-852. 8. Pouramirarsalani, A., Khalilian, M., & Nikravanshalmani, A. (2017). Fraud-detection in ebanking by using the hybrid feature selection and evolutionary algorithms. International Journal of Computer Science and Network Security, 17(8). 9. Khare, P., & Burse, K. (2016). Feature selection using genetic algorithm and classification using weka for ovarian cancer. International Journal of Computer Science & Information Technologies, 7(1), 194–196. 10. Babatunde, O. J., Armstrong, L., & Diepeveen, D. (2015). A genetic algorithm-based feature selection. International Journal of Electronics Communication & Computer Engineering, 5(4). ISSN: 2278-4209.
Improving Efficiency of Machine Learning Model for Bank Customer …
657
11. Jain, V., Agrawal, M., & Kumar, A. (2020). Performance analysis of machine learning algorithms in credit cards fraud detection. In 8th International Conference on Reliability, Infocom Technologies and Optimization. Noida: Amity University, IEEE. 12. Alkhatib, K., & Abualigah, S. (2020). Predictive model for cutting customers migration from banks: based on machine learning classification algorithms. In 2020 11th International Conference on Information and Communication Systems (ICICS). IEEE.
Unsupervised Learning to Heterogeneous Cross Software Projects Defect Prediction Rohit Vashisht and Syed Afzal Murtaza Rizvi
Abstract Heterogeneous Cross-Project Defect Prediction (HCPDP) aims to predict defects with insufficient historical defect data in a target project through a defect prediction (DP) model trained using another source project. It doesn’t demand the same set of metrics between two applications, and it also builds DP model based on matched heterogeneous metrics showing analogous distribution in their values for a given pair of datasets. This paper proposes a novel HCPDP model consisting of four phases:data preprocessing phase, feature engineering phase, metric matching phase, and lastly, training and testing phase. One may employ supervised and unsupervised learning techniques to train the DP model. Supervised method of learning uses tagged data or well-defined instances to train the model. On the other hand, unsupervised learning techniques attempt to train the model by identifying specific hidden patterns in the distribution of unlabeled instance’s values. The advantage of using unlabeled data is that, it is easier to get from a machine than labeled data, which requires manual efforts. This paper empirically and theoretically assesses the impact of the training process on the efficiency of the HCPDP model using an unsupervised learning method. Beyond this, a comparative study has been done among HCPDP with supervised learning, HCPDP with unsupervised learning, and the standard DP approach, i.e., WithIn-Project Defect Prediction (WPDP). Logistic Regression and Km++ Clustering are used as supervised and unsupervised techniques, respectively. Results show that for both classes of DP, HCPDP, and WPDP, unsupervised learning method demonstrates comparable performance compared to supervised learning method. Keywords Cross project · Unsupervised learning · Heterogeneous · Software metric R. Vashisht (B) Research Scholar, Jamia Millia Islamia, Delhi, India e-mail: [email protected] S. A. M. Rizvi Professor, Jamia Millia Islamia, Delhi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_54
659
660
R. Vashisht andS. A. M. Rizvi
1 Introduction The primary aim of every software development process is to ensure that the final product or service has the appropriate quality standard at the end called Software Quality Assurance (SQA). Based on end-user specifications, any deviation from the actual and expected results for some regulatory configurations may be described as a defect. The most critical stage of the Software Development Life Cycle (SDLC) is testing, as it often absorbs a significant proportion of the overall cost of the project. So, this phase should be prepared first in any software development process. Software Defect Prediction (SDP) at the right time is the only way to tackle this issue. The defect prediction (DP) model is primarily designed to detect “within-project” defects by segregating the available defect dataset into two subsections in order to train DP model using one subsection of a dataset (referred to as tagged instances) and using the other subsection to test the built-in DP model. Testing the DP model means identifying labels for unidentifiable instances in the target application dataset which are either defective or non-defective [1]. Cross-Project Defect Prediction (CPDP) is a field of research where software project lacking the requisite local defect data can use data from other projects to create an effective and efficient DP model. Obviously, to facilitate the local application of CPDP, cross-project knowledge needs to be listed before [2]. CPDP captures common software measures/ functions from both the source application (whose defect data is used to train model DP) and the target application (for which DP model is made) [3]. But, when using HCPDP, there are no uniform metrics between the prediction pair datasets. Related metrics can be located between two applications by calculating the coefficient of correlation between all possible software feature combinations. The combinations of feature pairs showing some form of analogous distribution in their values are considered as common features between source and target datasets to forecast project-wide defects. This paper aims to predict defects in a target application which has features that are fully heterogeneous from the source application’s feature set and the classification model will be trained using source project’s defect data due to deficiency of past defect data in target project. The proposed research work offers a four-phase novel HCPDP model to address the same issue. Figure 1 shows a clear disparity between Homogeneous CPDP and Heterogeneous CPDP. The common metrics A, B, and D are extracted from source and target datasets in homogeneous CPDP. But, in heterogeneous CPDP, from both datasets, the pairs of correlated features (A, Q), (C, S), and (D, P) are evaluated. For training of the DP model, one can use any of the machine learning methods. Supervised learning enables us to classify unknown instances using previously labeled instances statistics. However, there are cases in which unsupervised learning methods are more effective than those of supervised methods. For example, it is very expensive to annotate large datasets, and so one can mark only a few examples manually. Or there can be instances in which one doesn’t know how many or in what groups the data should split into [4]. The proposed study also explores HCPDP model’s output using supervised and unsupervised learning techniques. Logistic regression
Unsupervised Learning to Heterogeneous Cross Software Projects …
661
Fig. 1 Classification of cross-project defect prediction
is used as a supervised approach while K-means++ (Km++) clustering is used as an unsupervised method of learning for the DP model’s training. The paper addresses the following main areas of contributions. • Compare the results of the proposed HCPDP model with supervised and unsupervised methods of learning. • Compare WPDP model outcomes with supervised and unsupervised learning methods. • Whether and to what extent is HCPDP model’s performance comparable to Within-Project Defect Prediction (WPDP) for both categories of learning techniques? The proposed research work is structured as follows: Sect. 2 provides a detailed literature survey on HCPDP, Sect. 3 describes the novel four-phase HCPDP model and discusses supervised and unsupervised learning techniques used in proposed research study, Sect. 4 briefs the datasets used for analysis of proposed work and the performance parameters used to evaluate the experimental results, Sect. 5 explains the development aspect of the experiments, Sect. 6 discusses the experimental results. and lastly, Sect. 7 summarizes the conclusive findings.
662
R. Vashisht andS. A. M. Rizvi
2 Related Work In 2002, Melo et al. [5] introduced first recognized work in CPDP. They proposed Multivariate Adaptive Regression Spline (MARS) model for defect prediction and data design of two Java based systems: Xpose and Jwriter. They predict the classes in Jwriter based on their predilection for fault. For this, they used the model trained using the dataset of Xpose. They compared the performance of MARS with Linear Regression (LR) and found that Mars outperforms with LR and also is more economically applicable. In 2009, Menzies et al. [6] used 10 projects’ data from two discrete sources. They percolate the data for effective defect prediction by removing noisy data that were redundant and irrelevant and used these unblended data for training the model. They used Nearest Neighbor (NN) approach to perform the experiments on 10 projects’ data. The results obtained shown that the experiments were good for within-project defect prediction. Meanwhile CPDP task performed using these experiments were unable to outperform within-project defect prediction task. In 2009, Camargo et al. [7] first used log transformation for finding similar instances in training and testing projects data to avoid project dependent data to avoid project dependent data instances. In the same year, Zimmermann et al. [8] proposed the classification for defect prediction on Internet Explorer and Mozilla Firefox as training and testing projects. They used the coding standard and process parameters for performing classification task. They trained the proposed DP model using the defects data of Mozilla. They trained the proposed DP model using the defects data of Mozilla Firefox and predict the defects in IE using this trained model. The result shown by these experiments showed that the proposed model outperformed when IE was used as training project and Mozilla Firefox was used as testing project. In 2011, Menzies et al. [8] contended that relevancy varies with perception. They asserted that relevancy varies with perception and relevancy of data may be opposed while viewed in different aspects. The data that seem to be relevant when seen globally may be irrelevant when seen locally. They proved their assertion by making experiments and concluded that local behavior was superior when compared to global behavior and condition based rules should be more emphasized rather than taking all parts in account. In 2011, Bellenburg et al. [8] enhanced the contention of Menzies et al. by proving that local models were more suitable for a particular dataset but generality was more focused by global models. In 2012, Rahman et al. [9] performed experiments for emphasizing on performance measures like F-score, precision, and Recall and proposed that these measures are not appropriate for assurance of quality while making defect prediction using different models. They proposed that AUC gives analogous performance in within-project defect prediction models. In 2013, Canfora et al. [10] proposed multi-objective approach to overcome single objective model [9]. They trained Logistic Regression (LR) model by non-dominated sorted Generic Algorithm (NSGA-II).
Unsupervised Learning to Heterogeneous Cross Software Projects …
663
In 2014, Zhang et al. [11] used 1398 projects from Google code and source forge and come up with a Universal Defect Prediction (UDP) model. This model matches the metrices between the training and testing projects’ dataset and if at least 26 metrices were matched, predictions were made for target projects. In the same year, Li et al. [12] overcome this limitation by using characteristic vectors of instances as a new metric. They also compared CPDP with feature disparity and found negative results. The experiments were conducted on 11 projects with 3 datasets. In 2014, He et al. [12] used feature selection methods to compare the performance results for Within-project defect prediction (WPDP) as well as Cross-project defect prediction (CPDP). They found that when the less features of training project were selected for training classifiers, the higher precision was achieved in WPDP, meanwhile a better F-score and Recall value was achieved in CPDP. In [13, 14], various Ensemble classifiers are also trained and validated for CPDP task. In 2015, Jing et al. [3] proposed defect prediction based on Canonical Correlation Analysis (CCA). They were first to introduce the work for Heterogeneous Defect Prediction (HDP). They eliminate the metrices disparity problem between training and testing project dataset by augmenting dummy metrices having null values. They performed experiments on 14 projects with 4 datasets. In 2015 & 2016, Nam et al. [15, 16] conducted experiments on 34 projects with 5 datasets. They proposed transfer learning method of HDP task. They do not augment metrices with null values like CCA proposed by (Jing et al.), but their results were comparable to WPDP. In same year, Ryu et al. [17] performed CPDP task with a new method called transfer cost-sensitive boosting method. Their method gave state-of-the-art results for CPDP task. They also proposed CPDP task using multi- objective Naïve Bayes technique by considering class imbalance [18]. Their multi-objective naïve Bayes technique outperformed all WPDP models and also single objective models. In 2015, Jing et al. [19] propounded unified metric representation (UMR) for Heterogeneous Defect Prediction (HDP). In same year, Nam and Kim [16] proposed HDP task using metric selection and metric matching. They performed their study on 28 projects and their results showed that the proposed technique was comparable to WPDP and also, in some cases, it was outperforming with statistical importance. In 2017, Ni et al. [20] suggested a new method FESCH that gave state-of-theart results for the baseline methods used and it also outperformed ALL, TCA+ and WPDP for most scenarios. And the results also showed that the performance of FeSCH was also self-sustaining and was not dependent on classifiers used. In 2017, Li et al. [21] gave comparison among the 4 filtration methods for defect data. They propounded that the choice of correct defect data filtration method highly influences the capability of model for defect prediction. They compared four filtering methods: Target project data Guided Filter (TGF), Source project data Guided Filter (SGF), Data Characteristic based Filter (DCBF), and Local Cluster based Filter (LCBF). And they proposed a new filter: Hierarchical Selection-Based Filter (HSBF) to overcome the limitation of pre-existing 4 filters for scalability with large dataset. The proposed filtering technique outperformed the state-of-the-art filtering methods.
664
R. Vashisht andS. A. M. Rizvi
In 2018, Xu et al. [22] propounded a domain adaptation technique to reduce the higher dimensional features of training and testing project domains. They applied Dictionary Learning technique to learn the difference between feature spaces. They used three open source projects: NetGene, Nasa, and AEEEM; and used 3 performance measures Recall, F- Score, and Balance for comparing Heterogeneous Defect Adaptation (HDA) [22], CCA+ [3] and HDP [15]. After collecting relevant data from existing versions of the same program, Lee and Felix, in 2020, focused on method-level (ML) defect estimation using regression models in a new software version [23]. The authors used three performance measurement variables that display significant association with ML defects, such as defect density, defect velocity, and time of implementation of defects. Analysis and comparison of pre-to post-method data preprocessing classifiers and entropy rates in average output datasets were also encouraged by the proposed work. The experiment revealed that out of all three variables, the defect velocity had the highest correlation of 93 percent with the count of ML defects. In 2020, Majd et al. proposed statement level (SL) defect prediction using deep learning (SLDeep) models [24]. In this work, the authors tried to reduce the pressure on software developers in order to identify areas or components that are more prone to defects. The authors performed experiments on 1,19,989 C/C++ programs using the Large Short- Term Memory (LSTM) deep learning model of Code4Bench. The authors also tested the SLDeep model to predict defects in the unseen data, i.e., new statements and found the results to be strong, with high recall value, precision, and accuracy.
3 Proposed HCPDP Model The basic HCPDP model begins with a pair of datasets (S,T) as source and target datasets having m and n software features, respectively, from two heterogeneous software projects, as described in Fig. 2. In every dataset, each row and column represents an observation and a feature, respectively. Firstly, preprocessing of datasets is performed to label the categorical variables and to deal with the issue of missing data and Class Imbalance Problem (CIP) in training datasets. In second phase of HCPDP modeling, i.e., feature engineering, ranking of features is performed to eliminate redundant and useless features from the original datasets, and top n features are selected using an appropriate feature selection technique [25]. Extraction of features is performed to pick the most discriminated features from the source set, or to develop new features using existing features. In the next successive step, the model uses an efficient correlation estimation technique to find a set of highly correlated pairs of features between source and target datasets. The third stage in HCPDP modeling, i.e., metric matching, is the most critical and challenging, since the training accuracy of the built model relies largely on the matched metric set. The correlation between metrics can, therefore, be described using several current methods, such as the least square method, dispersion diagram
Unsupervised Learning to Heterogeneous Cross Software Projects …
665
Fig. 2 Four-phase HCPDP model
method, and Spearman’s rank correlation method. After estimating the Coefficient of Correlation Value (CCV) between different possible feature pairs of two applications, the model selects those feature pairs whose CCV is greater than the specified cutoff threshold. A collection of feature pairs selected after applying the cutoff threshold filter is said to be the strongly correlated features. If this strongly correlated metric set is null for a pair of datasets (S, T), then one can assume that the source dataset S is not feasible for predicting defects in another heterogeneous target dataset T.
666
R. Vashisht andS. A. M. Rizvi
After identification of this highly correlated metric set, the model is trained using an adequate machine learning algorithm (supervised or unsupervised) and performance results are recorded in the final phase of the model. The performance results are outlined using different evaluation parameters. In supervised learning, the system attempts to learn from previous examples or given marked instances. While, in unsupervised learning approach, the system attempts to find some hidden informative patterns directly from given unmarked data instances that are useful for predicting the expected outcome. The inputs are distinguished based on the hidden features in an unsupervised approach, and then the prediction is done for the unlabeled instance to which group it belongs. The conceptual difference between supervised and unsupervised learning techniques is shown in Fig. 3. Logistic Regression (LR) and K-means++ (Km++) clustering are used as supervised and unsupervised algorithm, respectively„ in the proposed study. The details of each learning technique is given in Table 1.
4 Datasets Used and Evaluation Parameters Datasets Description This section discusses the data sets used to conduct the experiments. The proposed research analysis includes 13 benchmarked datasets in three groupings of projects that are AEEEM, ReLink, and SOFTLAB. AEEEM dataset consists of 61 metrics comprised of object-oriented metrics (OOs), past defect metrics, application metrics, and so on. The link repository includes 26 observations of the coding implications obtained using the Understand tool. SOFTLAB also offers proprietary data sets including cyclomatic metrics of Halstead and McCabe. Table 2 lists the further details about these datasets. According to the statistics in Table 2, the proportion of defective instances in three project groups varies from the lowest value as 7.43% to the highest value as 50.51% across all datasets. Evaluation Parameters The numerous methods used for assessing the output of various machine learning classification algorithms are described below. Table 3 shows various confusion matrix parameters used to forecast incorrect classifications. • Accuracy: This is the percentage of real (TP and TN) tests to the total number of instances observed. The value varies from 0 for the least accurate outcome to 1 for highly accurate results. Accuracy = (TP + TN)/(TP + TN + FP + FN)
Unsupervised Learning to Heterogeneous Cross Software Projects …
667
Fig. 3 Difference between supervised and unsupervised learning techniques
• Recall: It is also known as sensitivity, which can be measured as the likelihood of a positive test if an instance is indeed defective or non-defective. It is also called True Positive Rate (TPR). Recall = TP/(TP + FN) • F-Score: It is also called as F-measure, which is the calculation of the accuracy of a test in the statistical study of binary classification. It takes into account both the accuracy and the precision of the test to determine the score. It can be calculated as harmonic mean of precision (p) and recall (r).
668
R. Vashisht andS. A. M. Rizvi
Table 1 Description of used machine learning techniques Technique
Type
Description
Logistic regression [26]
Supervised
It is a linear classification method. The predictions are converted into a logistic feature. Typically, it is used in binary grading to develop a probability function which gives the probability to a particular class that an input belongs to. Several earlier research confirmed its ability to predict defects. This technique is used when dependent variable is of categorical type
K-means++ (Km++) clustering [27, 28] Unsupervised Km++ clustering is upgraded version of K-mean clustering technique. It is a distance-based learning algorithm that repeatedly calculates the distance to be allocated to a specific cluster for a new input case as done in K-means. Randomization in the picking of K centroids can finally lead to the creation of distorted clusters in K-means. So, Km++ uses a smart centroid initialization technique to overcome this problem and the rest of the algorithm is same as K-means Table 2 Datasets illustration
Project group
Datasets
AEEEM [1, 2]
EQ
324
JDT
997
LC
691
ML
1862
PDE
1492
ReLink [29]
Apache Safe
SOFTLAB [6]
Total observations
194
No. of software features 61
26
56
Zxing
399
ar1
121
ar3
63
ar4
107
ar5
36
ar6
101
29
Unsupervised Learning to Heterogeneous Cross Software Projects … Table 3 Confusion matrix
669 Predicted result
Actual result
Faulty
Non-faulty
Faulty
TP
FN
Non-faulty
FP
TN
F − Score = (2 ∗ p ∗ r )/( p + r ) • Silhouette Score: This parameter is used to check the effectiveness of clustering results. Its values ranges from −1 to 1. It can be calculated using (4) where p and q are mean intra-cluster distance and mean nearest cluster distance respectively. It can be determined only when the number of clusters is at least 2. Closer the value to 1, higher will be effectiveness of the clustering and vice versa. SS = (q − p)/max( p, q)
5 Proposed Work The proposed research study is divided into three prime objectives. In first two objectives, the study is comparing the output of HCPDP and WPDP, respectively, with employed supervised and unsupervised learning strategies. Finally, the last objective investigates whether and to what extent is HCPDP model’s performance comparable to WPDP for both the categories of learning techniques. To answer the specified three research concerns, the proposed research study conducted two experiments. Experiment 1 This experiment is performed to investigate the efficiency of the proposed fourphase HCPDP model with LR as supervised learning and with Km++ as unsupervised learning approach. Three prediction pairs as listed in Table 4 are considered from three open source projects AEEEM, ReLink, and SOFTLAB, to implement this experiment. In the first phase, preprocessing of datasets requires the deletion of redundant data and the encoding of categorical data by labels. Class Imbalance Learning is also being introduced in this step to manage the major difference in the ratio between binary type instances count. Later on, feature ranking and feature selection techniques Table 4 Prediction Pairs for HCPDP
Prediction pair
Training dataset
Testing dataset
HCPDP_P1
PDE
ar6
HCPDP_P2
ML
ZXing
HCPDP_P3
ar3
Apache
670
R. Vashisht andS. A. M. Rizvi
are applied to extract the collection of K-best features that are of higher importance to the final outcome to be expected for a given dataset of features. The selection of features is also done to make the two datasets dimensionally equal, so one can easily do metric matching. After selecting useful features, in the metric matching process, the association between each feature pair is evaluated. This experiment uses both types of learning approaches for training the HCPDP model, i.e., to implement the final modeling phase. Finally, in this experiment, the performance of each learning approach for HCPDP is evaluated on understated performance parameters listed in Sect. 4. Experiment 2 This experiment is done to investigate the traditional category of DP, i.e., WPDP performance with both methods of learning. Firstly, preprocessing of the dataset is performed to exclude irrelevant software features and encode the categorical data with the tag. Then, by feature ranking and feature selection techniques, a range of strongly discriminated features are picked. A ratio of 7:3 is used for partitioning the dataset into training and testing samples. Figure 4 shows within a project software predictions of defects. The samples of training and testing data set used to carry out this experiment are described in Table 5. The results of the experiment are measured on the basis of the confusion matrix and other performance measures listed in Sect. 4. This experiment also analyses WPDP’s performance with HCPDP using supervised and unsupervised learning methods for the three prediction pairs.
Fig. 4 Within-project defect prediction
Table 5 Prediction pairs for WPDP
Prediction pair
Training dataset
Testing dataset
WPDP_P1
ar6
ar6
WPDP_P2
ZXing
ZXing
WPDP_P3
Apache
Apache
Unsupervised Learning to Heterogeneous Cross Software Projects …
671
6 Experimental Results and Discussion Within this section, the findings of the performed experiments are presented. The results are listed in Tables 6, 7, 8, and 9. As indicator of performance assessment, the study uses accuracy, recall, F-Score, and Silhouette score. The 10-fold crossvalidation provides training accuracy of the datasets. RQ1. Compare the results of the proposed HCPDP model with supervised and unsupervised methods of learning. Three predictive pairs (HCPDP_P1, HCPDP_P2, and HCPDP_P3) are taken into account to test the efficiency of the HCPDP model. Firstly, for HCPDP_P1, the preprocessing of datasets PDE and ar6 is done. The max-min technique is used to Table 6 Set of top 15 features for HCPDP_P1 S. no.
PDE
ar6
1.
WCHU_lcom
blank_loc
2.
WCHU_wmc
branch_count
3.
WCHU_dit
code_and_comment_loc
4.
LDHH_fanIn
condition_count
5.
WCHU_numberOfAttributesInherited
decision_count
6.
WCHU_numberOfMethodsInherited
executable_loc
7.
numberOfHighPriorityBugsFoundUntil
halstead_effort
8.
LDHH_rfc
halstead_error
9.
LDHH_cbo
halstead_length
10.
numberOfNonTrivialBugsFoundUntil
halstead_level
11.
WCHU_numberOfMethods
halstead_time
12.
WCHU_numberOfPrivateAttributes
halstead_vocabulary
13.
WCHU_rfc
halstead_volume
14.
ck_oo_cbo
total_operators
15.
ck_oo_fanOut
unique_operands
Table 7 HCPDP performance statistics Prediction pair
Learning approach
Training accuracy (%)
Recall (%)
F-Score (%)
Silhouette score
HCPDP_P1
LR
84.21
85.15
80.02
–
Km++
83.09
80.56
78.13
0.617
LR
82.23
74.63
76.12
–
Km++
81.01
72.97
77.99
0.701
LR
86.92
87.17
86.74
–
Km++
85.41
85.88
84.66
0.885
HCPDP_P2 HCPDP_P3
672
R. Vashisht andS. A. M. Rizvi
Table 8 WPDP performance statistics Prediction pair
Learning approach
Training accuracy
Recall
F-Score
Silhouette score
WPDP_P1
LR
87.73
85.01
82.42
–
Km++
85.98
81.99
80.01
0.650
WPDP_P2
LR
86.33
75.61
79.86
–
Km++
84.15
73.57
78.17
0.745
LR
89.90
90.17
89.02
–
Km++
86.01
87.37
88.17
0.907
WPDP_P3
Table 9 Comparison between WPDP and HCPDP performance Performance parameters
WPDP
HCPDP
LR
Km++
LR
Km++
Training accuracy
87.99
85.38
85.45
83.17
Recall
83.60
80.98
82.32
79.80
F-score
83.77
82.12
81.96
80.26
Silhouette score
–
0.767
–
0.734
manage missing values and the AdaBoost approach is used to solve the issue of class imbalance under this phase. The top 15 features are selected from each dataset using the Chi-Square Test (CST) after the preprocessing stage. The output of the feature selection process is listed in the Table 6. Using Principal Component Analysis (PCA) technique, feature extraction is implemented to decrease the dimensionality of selected features. The second phase of the proposed model, i.e., feature engineering, which includes feature selection and extraction of features, is performed in this way. Metric matching is achieved for each possible combination of feature pairs in two datasets after choosing a collection of highly discriminating features. Suppose, (A1, A2, A3, A4, …, An) and (B1, B2, B3, B4, Bn) are the top n features extracted from both training and testing datasets. A (n*n) correlation matrix is generated in this phase, where the value at (i, j)th index gives the correlation value between ith and jth feature of respective dataset. Figure 5 displays the heat map showing the statistics of the association between the PDE and ar6 datasets. Values range from −1 to +1 in this map, where values closer to +1 result in a closely related feature set. After measuring the correlation matrix, a group of strongly correlated feature pairs is observed. It is said that if the correlation value between them is greater than the selected threshold, a feature pair (Si, Tj) is highly correlated. The cutoff threshold is taken here as 0.05 to ensure a maximum defect covering likelihood as per the state of the art [16]. Spearman’s Rho technique (SRT) is used for this experiment to correlate two datasets [25] (Fig. 6). After obtaining strongly associated feature pairs, the DP model is trained with a Km++ clustering technique. DP is also performed for other two pairs, HCPDP_P2 and HCPDP_P3, in the same way. The performance of the HCPDP model is also
673
Fig. 5 Metric matching between PDE and ar6
Unsupervised Learning to Heterogeneous Cross Software Projects …
674
R. Vashisht andS. A. M. Rizvi
Fig. 6 Output of Km++ for HCPDP_P1
obtained using the LR method to contrast the efficacy of HCPDP with the supervised technique. As per Table 7 statistics, for all three prediction pairs, Km++ clustering gives comparable performance to LR. For the prediction pair HCPDP_P3 with the highest silhouette score of 0.885, the best clustering is achieved. In the case of unsupervised and supervised learning, respectively, the value of F-Score ranges from the lowest value as 77.99 an 76.12 for HCPDP_P2 to the highest value as 84.66 and 86.74 for HCPDP_P3. The highest recall is again obtained for LR & Km++ techniques as 87.17 &and85.88 for HCPDP_P3 ,respectively. In short, it can be concluded that while LR gives better results than Km++ clustering, the latter approach also achieves comparable prediction efficiency for both HCPDP and WPDP categories on the same side. RQ2. Compare WPDP outcomes with supervised and unsupervised learning methods. To compare the WPDP’s performance using LR & Km++, the study has taken three prediction pairs WPDP_P1, WPDP_P2, and WPDP_P3. For each prediction pair, 70% and 30% of the total observations are taken as training and testing observations, respectively, as shown in Fig. 4. For, e.g., in WPDP_P1 (ar6), 71 instances are used to train the WPDP model, while 30 instances are used to test the built DP model. In Fig. 7, the working model for WPDP is well illustrated. This model consists of three stages: preprocessing stage, feature engineering stage, and the training and testing stage. Initially, dataset preprocessing is performed to handle missing values and the issue of class imbalance. After that, CST is used to pick out a set of relevant features that are useful to predict in relation to the final result. The clustering output for WCPDP_P1 is represented in Fig. 8. The model is also trained and tested with the LR method to compare WPDP’s performance with supervised technique. For other two prediction pairs WPDP_P2
Unsupervised Learning to Heterogeneous Cross Software Projects …
675
Fig. 7 WPDP model
and WPDP_P3, the same procedure is followed. For WPDP_P3 (Apache), with the highest silhouette score of 0.907, the best clustering outcome is obtained. For LR, the recall value ranges from 75.61 to 90.17 and for Km++, the same value ranges from 73.57 to 87.37. Once again, it can be inferred that WPDP offers comparable DP output for both learning methods. RQ3. Whether and to what extent is HCPDP model’s performance trained with both learning methods comparable to Within-Project Defect Prediction (WPDP)? It is evident that a DP model trained and tested from the data of the same project (WPDP) would often lead to better prediction results compared to the DP model trained and tested from the data of two different projects (HCPDP). It can be inferred from the results of Table 9 (based on the average value of the listed performance parameters) that comparable DP performance is given by WPDP and HCPDP for both methods of learning.
676
R. Vashisht andS. A. M. Rizvi
Fig. 8 Km++ output for WPDP_P1
7 Conclusion In the prediction of software defects, HCPDP is an open research area which predicts defects in the target application that deprives past data of defects. This paper proposes the four-phase HCPDP model and the three-phase WPDP model to predict defects between two different projects and within a project, respectively. In addition to this, the study provides a clear comparison between HCPDP and WPDP results with both learning methods. It can be concluded that for both DP methods, WPDP and HCPDP, Km++ clustering provides comparable prediction output performance with LR technique. On the other hand, a second comparative analysis is being performed to assess the degree to which the performance of WPDP can be contrasted with the performance of HCPDP. It is observed that HCPDP is giving slight lesser but comparable performance to conventional approach of DP, i.e., WPDP. As a future work, the efficiency of the proposed HCPDP model can be assessed on the basis of other unsupervised learning techniques. Another interesting direction for future research is to determine the relationship between software defect prediction and predictive maintenance. Another promising future task is to explore various potential deep learning methods for feature engineering phase. As it has been shown, DP performance efficacy is highly dependent on the metric matching process. Therefore, developing a novel method for this stage to avoid noisy metric matching is another path in this area for future work.
Unsupervised Learning to Heterogeneous Cross Software Projects …
677
References 1. D’Ambros, M., Lanza, M., & Robbes, R. (2012). Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering, 17(4–5), 531–577. 2. Han, D., Hoh, I. P., Kim, S., Lee, T., & Nam, J. (2011). Micro interaction metrics for defect prediction. In Proceedings of the 16th ACM SIGSOFT international Symposium on Foundations of software engineering. New York, USA: ACM. 3. He, P., Li, B., & Ma, Y. (2014). Towards cross-project defect prediction with imbalanced feature sets. CoRR, vol.abs/1411.4228. 4. https://towardsdatascience.com/unsupervised-learning-and-data-clustering-eeecb78b422a. 5. Briand, L. C., Melo, W. L., & Wurst, J. (2002). Assessing the applicability of fault- proneness models across object-oriented software projects. IEEE Transactions on Software Engineering, 28, 706–720. 6. Bener, A. B., Menzies, T., Di Stefano, J. S., & Turhan, B. (2009). On the relative value of crosscompany and within-company data for defect prediction. Empirical Software Engineering, 14(5), 540–578. 7. Cruz, A. E. C, Ochimizu, K. (2009). Towards logistic regression models for predicting faultprone code across software projects. In Proceedings of the Third International Symposium on Empirical Software Engineering and Measurement (ESEM) (pp. 460– 463), Lake Buena Vista, Florida, USA. 8. Butcher, A., Cok, D. R., Marcus, A., Menzies, T., & Zimmermann, T. (2011). Local versus global models for effort estimation and defect prediction. In 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011) (pp. 343–351). Lawrence, KS, USA: IEEE. 9. Bettenburg, N., Hassan, A. E., & Nagappan, M. (2012). Think locally, act globally: Improving defect and effort prediction models. In 9th IEEE Working Conference on Mining Software Repositories, MSR 2012 (pp. 60–69). Zurich, Switzerland: IEEE. 10. Dong, X., Jing, X., Qi, F., Wu, F., & Xu, B. (2015). Heterogeneous cross company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2015 (pp. 496–507). New York, NY, USA: ACM. 11. Devanbu, P., Posnett, D., & Rahman, F. (2012). Recalling the imprecision of cross-project defect prediction. In Proceedings of the ACM-Sigsoft 20th International Symposium on the Foundations of Software Engineering (FSE-20) (pp. 61–65), ACM, Research Triangle Park, NC, USA. 12. Gao, K., Khoshgoftaar, T. M., Zhang, H., & Seliya, N. (2011). Choosing software metrics for defect prediction: An investigation on feature selection techniques. SoftwPract. Exper 41(5), 579–606. 13. Ni, C., Liu, W., Gu, Q., Chen, X., & Chen, D. (2017). FeSCH: a feature selection method using clusters of hybrid-data for cross-project defect prediction. In Proceedings of the 41st IEEE Annual Computer Software and Applications Conference, COMPSAC 2017 (pp. 51–56), Ita. 14. Wang, T., Zhang, Z., Jing, X., & Zhang, L. (2015). Multiple kernel ensemble learning for software defect prediction. Automated Software Engineering, 23(4), 1–22. 15. Canfora, G., De Lucia, A., Oliveto, R., Panichella, A., Di Penta, M., & Panichella, S. (2013). Multiobjective cross-project defect prediction. In: IEEE Sixth International Conference on Verification and Validation in Software Testing, IEEE, Luxembourg, Luxembourg. ISSN 21594848. 16. Ryu, D., & Baik, J. (2016). Effectivemulti-objective na¨ıve Bayes learning for cross-project defect prediction. Applied Soft Computing, 49, 1062–1077. 17. He, J. Y., Meng, Z. P., Chen, X., Wang, Z., & Fan, X. Y. (2017). Semi supervised ensemble learning approach for cross-project defect prediction. Journal of Software Engineering, 28(6), 1455–1473. 18. Ryu, D., Jang, J.-I., & Baik, J. (2015). A transfer cost-sensitive boosting approach for crossproject defect prediction. Software QualityJournal, 25(1), 1–38.
678
R. Vashisht andS. A. M. Rizvi
19. Nam, J., & Kim, S. (2015). Heterogeneous defect prediction. In Proceedingsof the 10th Joint Meeting of the European SoftwareEngineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE 2015 (pp. 508–519). 20. Fu, W., Menzies, T., & Shen, X. (2016). Tuning for software analytics: Is it really necessary? Information and Software Technology, 76, 135–146. 21. Fu, W., Kim, S., Menzies, T., Nam, J., & Tan, L. (2015). Heterogeneous defect prediction. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE (pp. 508–519). New York, NY, USA: ACM. 22. Li, Y., Huang, Z., Wang, Y., & Fang, B. (2017). Evaluating data filter on cross-project defect prediction: Comparison and improvements. IEEE Access, 5, 25646–25656. 23. https://towardsdatascience.com/understanding-k-means-k-means-and-k-medoids-clusteringalgorithms. 24. Lee, S.P., & Felix, E.A. (2020). Predicting the number of defects in a new software version. PloS ONE, 15(3). 25. Jing, X., Wu, F., Dong, X., Qi, F., & Xu, B. (2015). Heterogeneous cross company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE 2015 (pp. 496– 507). 26. Mwadulo, M. W. (2015). A review on feature selection methods for classification tasks. International Journal of Computer Applications Technology and Research, 5(6), 395–402. 27. Xu, Z., Yuan, P., Zhang, T., Tang, Y., Li, S., & Xia, Z. (2018). HDA: Cross project defect prediction via heterogeneous domain adaptation with dictionary learning. IEEE Access, 6, 57597–57613. 28. Akinsola, J. E. T., Osisanwo, F. Y., Awodele, O., Hinmikaiye, J. O., Olakanmi, O., & Akinjobi, J. (2017). Supervised machine learning algorithms: classification and comparison. International Journal of Computer Trends and Technology (IJCTT), 48(3), 128–138. 29. Memoona, K., & Tahira, M. (2015). A survey on unsupervised machine learning algorithms for automation, classification and maintenance. International Journal of Computer Applications, 119(13).
PDF Text Sentiment Analysis Rahul Pradhan, Kushagra Gangwar, and Ishika Dubey
Abstract Nowadays the internet has become a great source in terms of unstructured data. In the sentiment inspection, unprocessed text is operated, and it has brought different issues in computer processing. To avoid such issues, various steps and tactics are done. The paper gives an insight into the ground of sentiment inspection targeting today’s analysis works—lexicon-based work, context less categorization, and deep analysis. Sentiment mining, a main newbie subcategory in inspection, is discussed in this project. The main objective of the project is to describe a brief introduction to this emerging issue and to represent complete research of all major survey problems and the present increment in the ground. As a proof of that, this project requires greater than 400 links from all important journals. However the ground works with the natural language text, which is generally counted in the unprocessed data, this project has done a structured line or option in describing the difficulty with the objective of linking the unprocessed and processed ground and doing qualitative and quantitative analysis of emotions. It is major and important for practical applications. Keywords Sentiment analysis · PDF · Multilingual
R. Pradhan · K. Gangwar (B) · I. Dubey GLA University, Mathura 281406, UP, India e-mail: [email protected] R. Pradhan e-mail: [email protected] I. Dubey e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_55
679
680
R. Pradhan et al.
1 Introduction PDF (Portable Document Format) is one of the most crucial and commonly used in digital media formats across various devices, Operating Systems, and even languages [1]. PDF contains useful and valuable information with rich text options such as links, buttons, form fields, audio, graphics, layers, videos, and business logic. An approach for document research is presented which uses Portable Document Format (PDF—the primary file structure for Adobe Acrobat software) as its starting point. Recently, PDLs (Page Description Languages) have evolved and are now being used for the electronic dissemination and storage of documents: Adobe’s Portable Document Format (PDF) [2] and Common Ground’s Digital Paper [2] are just two recent examples. This strategy recognizes the aspect and the geometric positioning of text and image blocks diffused in the overall document. PDF processing comes under text analytic and Python provides a huge support in doing it in a more easy way due to its Text Analytics Libraries and frameworks. But there’s a catch, no PDF can be directly processed in the existing frameworks of Machine Learning or (NLP) Natural Language Processing without cleaning, which means that first of all, we will have to convert the PDF to text, then only we can process the PDF (in the form of text) using the libraries and frameworks that are provided by Python. The writing is more crucial and effective than military power or violence” suggests that the free communication (i.e., particularly written language) is more constructive than direct violence [3]. Sentiment analysis can be referred to as a method, technique or tool that can be used for exposing and withdrawing subjective information, such as opinion and viewpoints, in any language [3]. In general, sentiment analysis objective is to discover the point of view of a writer with regard to certain topic or general provisional duality of textual data. The viewpoint may be anyone’s discernment or assessment, emotional state (i.e., sentimental state of the author while writing), or the sentimental communication (i.e., the emotional effect that the author wanted to have on the reader). This paper is to provide information about the number of occurrences of similar and most frequently used words in a PDF and plotting the bar graph out of it. It also regulates the sentiment of the PDF text by counting positive and negative words in it. This sentiment analysis will make use of a Lexicon-Based approach, in which segregation of positive and negative words takes place. This will help in determining the sentiment of the PDF, not by reading it but the task will be done automatically, which is a huge advancement over doing it manually (which also takes a lot of time).
2 Related Work Sentiment Analysis Sentiment Analysis can be observed as a categorization process. In NLP, the designation sentiment analysis wraps various procedures regarding the
PDF Text Sentiment Analysis
681
details about emotions, attitudes, point of view, and social images as described in the lexicon [4]. We have accepted the hypothetical supposition of the work for our elementary sentiment representation, but the main center of attention is on one-to-one evaluation and also the social outcomes. Lately, modernization in sentiment analysis has made a significance change in newsworthy, provisional, and social stats to form more distinct and subtle forecast about the particular language [5]. The model developed to automate this task needs to determine the topics that are being discussed and the related sentiments which are typically based on statistical keyword standards that are subject to numerous false positives and negatives due to the limitations of the models to understand the context of the topic. For example, an existing model may include a controlled vocabulary of positive and negative sentiment words, such as “good”, “excellent”, “bad”, and “awful”, which is balanced and unlikely to change. In personification, the sentiment research application is executed to recognize the noun terms or words, the verb expressions, and adjective terms that are significant to the opinion about the gist of a sentence. Adjective structure of the adjective terms can be committed to use a dictionary database of distributed opinion lexicon words to recognize sentences and phrases that are significant to the opinion about the gist. There are three major categorizations in SA: document-level, sentence-level, and aspect-level sentiment analysis. Document-level SA objective is to categorize the view of a document as indicating a positive or negative opinion or sentiment. It contemplates the whole document as primary information about a topic. Sentencelevel SA objective is to categorize the view manifested in every single sentence. The first pace is to find if the sentence is introspective or objective. If the sentence is introspective, then Sentence-level SA will regulate if the sentence manifests positive or negative sentiments see Fig. 1. Wilson discovered that sentiment terms are not certainly introspective in type. Nevertheless, there is no elementary variation in the middle of document and sentence pace categorization, as sentences are mainly brief documents. categorizing words in the paper level or at the sentence extent does not impart the important and compulsory detail required views on all characteristics of the object which is required in numerous applications, to produce this information; we are required to go to the characteristic level. Characteristic-level SA objective is to categorize the sentiment or opinion in context with the particular features of objectives [6] Fig. 1.
3 Text Analysis Since we have executed textual inspection on a textual data or PDF, we have done a prepared estimate at particular outwardly explanations that possibly be formed out of that textual data or we can say that Textual analysis is the midway procedures of the Societal works but we still are not accessible to a unique, brought out uncomplicated attendant as to what it is and how we make it work.
682
R. Pradhan et al.
Fig. 1 Sentiment analysis categorization
It is a procedure: a method of conclaving and inspecting information in educational experimentation. Some educational regulations (particularly in the physical and social sciences) are exceedingly diligent about their procedures; there are definite, enduring, and acquired methods, in which it is sustainable to converge and apply operation on the information [7]. Since there is an emerging interest among the social sciences in the structured inspection of “text”, most of the retrievable data about human thought and behavior is in form of text. In this chapter, we studied methods of text analysis in the social sciences and particularly how anthropologists have used those procedures to look for meaning and pattern in written text. Textual details globally can be broadly categorized into two main types: facts and opinions. Facts are unbiased expressions about entities, events, and their characteristics. Textual inspection is a method which is used by reporting researchers to narrate and explain the properties of a prerecorded or visually seen message. The objective of textual inspection is to contour the text, skeleton, and features of the words held in textual data. The necessary deliberations in textual inspection contain choosing the forms of textual data to be deliberated, chasing suitable textual data, and evaluating which specific method to use in inspecting them. Two of the usual classification of textual data: Records of word-to-word recordings and outcome of word-to-word recordings. With regard to obtaining textual data, outcomes of recordings are way more willingly accessible than copy [8].
4 Proposed Approach The problem statement is to provide the information about the number of occurrences of similar and most frequent words in a PDF and plotting a bar graph out of it and
PDF Text Sentiment Analysis
683
Fig. 2 Proposed approach flowchart
determining the sentiment of the PDF text by counting positive and negative words in it. The challenges in solving problems are that we are using Lexicon-based methods for sentiment analysis which consist of counting the number of negative and positive words per sentence. This method has a few considerations, the most important is that it doesn’t know the context of the sentence, so sentences like Reduced the crime levels by 10% will be considered as negative. This problem can be solved by training a model using Machine Learning, but we are not including ML in our project, as it will take a lot of resources and time. In this work, we utilize various NLP technique to figure out the most frequently used words in Spanish are fetched from a few PDFs and are being used for further processing. Using Sentiment Analysis, we segregate the words into positive or negative words. Using Text Analysis, we automatically classify and extract meaningful information from unstructured text and plot the graph about given data Fig. 2. We have done sentiment analysis and text analysis as described below to solve this problem. Sentiment Analysis can be seen as a classification procedure. In NLP, the designation sentiment analysis wraps different procedures regarding the details about emotions, attitudes, point of view, and social images as described in the lexicon (Pang and Lee 2008). We have accepted the hypothetical supposition of the work for our elementary sentiment representation, but the main center of attention is on one-to-one evaluation and also the social outcomes. Lately, modernization in sentiment analysis has made a significant change in newsworthy, provisional, and social stats to form more distinct and subtle predictions about the specific language. Since we have executed textual inspection on a textual data or PDF, we have done a prepared estimate at specific outwardly description that possibly be formed out of that textual data or we can say that Textual analysis is the midway procedures of the Societal works but we still are not accessible to a unique, brought out uncomplicated attendant as to what it is and how we make it work. It is a procedure: a method of conclaving and inspecting information in educational experimentation. Some educational regulations (particularly in the physical
684
R. Pradhan et al.
and social sciences) are exceedingly diligent about their procedures; there are definite, enduring, and acquired methods in which it is sustainable to converge and apply operation on the information. We used python as it is among the top 10 known programming languages. Python is a multipurpose and high level language. We could use Python for constructing GUI apps, websites, web applications, and so on. Further, Python, as a high level programming language, helps us to center on the main feature of the app by using usual programming work. The primary grammatical rules of the programming are required to keep in mind, the rest of the subtleties were taken care of by python and make development smoother for us, this makes the code understandable and app conceivable. Python and its libraries: i. ii. iii. iv. v. vi.
PyPDF2: For extracting text from PDF files. spaCy: For passing the extracted text into an NLP pipeline. NumPy: For fast matrix operations. pandas: For analyzing and getting insights from datasets. matplotlib: For creating graphs and plots. seaborn: For enhancing the style of matplotlib plots.
5 Experimental Setup For the execution of this project, Python has been used as a preferred language. It has many built-in libraries which help in the data analysis process.
5.1 Extracting the Text from PDF The first step is to extract the text from the PDF, so that analysis can be done on that text. For this, PyPDF2 library has been used, which is a very popular library that helps in working with PDF files. The process includes extracting the text from the PDF by using PdfFileReader method. For this project, as the PDF had most of the text data available from page 21 to 400, only these many pages are considered. So the process will start from the 21st page and will go up to 400th page and for every page, the text will be extracted using extractText() method and will save all the text in the form of string. After that, for removing extra spaces between the words or sentences, “re” has been used, which is a library for working with Regular Expressions. Finally, after doing all this, the cleaned data (text of the PDF) will be saved inside a file named transcript_clean.txt.
PDF Text Sentiment Analysis
685
5.2 Loading the Model in the NLP Pipeline Now that we have our data in a raw text format, further analysis can be done. In this step, the data that we have will be read from the file. In this step, spacy library will also be used to load the spanish model for creating NLP pipeline for our data. But here’s a catch, the data that we have has more characters than the maximum length of the spanish model, so, first of all, the maximum length of the NLP will be increased by the size of our corpus (big amount of data in the file). Now, we can easily pass the data through this pipeline and will create a document.
5.3 Getting the Tokens The document will be sent for further processing and in this step, all the tokens will be extracted and classified on the basis of appropriate rules. This will make use of the spacy library, which provides an easy way to classify the tokens. So, in the document, each word will be taken one by one as a token and it will be classified on basis of attributes that are : 1. 2. 3. 4. 5.
text lemma part of speech is alphabet is stop word (words which come often in the language like and, for, if, etc.).
For each token, these attributes will be saved in the form of a list. After processing the whole document, we finally will have a list of tokens with their classifications which will be saved in a file named as tokens.csv (Comma Separated Values format).
5.4 Counting Positive and Negative Words For this part, we have taken a dataset from Kaggle, which contains all negative words and positive words in separate files. So, first of all, the dataset will be taken and converted into a list of words. Now, for each sentence in the document that was created, a check will be done that its length must be greater than 10 characters, only then the sentence will be taken into consideration, otherwise, it will not be further processed. After this check, we will go to each word in that sentence and will match it with the dataset we have. If the word belongs to the negative word list, its score (a variable for finding the total score of the sentence) will decrease by 1, otherwise, if it belongs to the positive word list, its score will increase by 1. In this way, the total score will be calculated and will put the score along with the sentence inside a list. Finally, the final list will be saved inside a file named as sentences.csv.
686
R. Pradhan et al.
5.5 Plotting the Graphs For this part, libraries such as matplotlib, pandas, numpy, and seaborn will be used. First of all, the data from tokens.csv and sentences.csv will be taken out with the help of pandas library (which gives an ease for working with large data and also the csv format), which reads the data and converts it into Data Frames. So, now we will have 2 data frames with us, i.e., tokens DF and sentences DF. Word Count graph (Number of occurrences) Word count graph will be plotted using tokens Data Frame. We will only consider those top 20 words which are an alphabet, are not stop words, and have length >1. Using these words, we will plot the graph and will save it as a PNG image file in wordsc ount. png. Sentiment Analysis graph Sentiment analysis graph will be plotted using Sentences Data Frame which contains the sentences and their scores of positivity and negativity. Also, we are only focusing on the scores between −10 and 10 inclusive. The color for positive bars and negative bars are different. For this, we will have to create a color array, and for this, we have used the numpy library which makes it easier to work with matrix operations. Firstly, we have created an array of length equals the number of sentences with values as “Red” color. Now, for all the sentences which have a score greater than or equal to 0, we will replace the color with “Green”. Now, for positive bars, we have green color and for negative bars, we have red color. After plotting the graph, we finally will save the graph as a PNG image file in setiment.png.
5.6 Finding Overall Sentiment of the PDF In this step, we find the total sum of sentiment scores of the sentences and print the output, which tells the overall sentiment of the PDF that either it is positive or negative or neutral.
6 Result Analysis Final results are based on the graphs that are plotted, i.e., Word Count graph and Sentiment analysis graph Fig. 3.
6.1 Word Count Graph Analysis As we can see from the graph in Fig. 4, on the y-axis, it has the top 20 words that have occurred most of the time in the whole PDF and on the x-axis, it has a count of the
PDF Text Sentiment Analysis
687
Fig. 3 Occurrences of words
occurrence of words. In the graph, “nacional” (english : National) word in the PDF has come more than 800 times, so this depicts that the PDF mainly talks about some topic related to the Nation, i.e., it can be a document related to the Government. Next most occurring word is “junio” (english : June), which shows that it can be related to a national study which has been done in the month of June. Next word is programar (english : program), which depicts that this report might be based on a government program. So, going further and looking at all the most occurring words (National, June, program, ace, weight, december, service, actuate, federal, develop, health, person, education, security, system, and perform), final analysis is that; this might be a Government program survey report that was done between the month of June and December and mainly focused on the development of nation and also concerned about education, health, and security.
6.2 Sentiment Analysis Graph In the graph in Fig. 5, on the y-axis, it has a score (both positive and negative) which depicts the sentiment of the sentences, i.e., if it is to the positive side, it has a positive
688
R. Pradhan et al.
Fig. 4 Word count graph (Count of occurences of words) Table 1 Overall sentiment of PDF Sentiment Polarity Neutral Positive Negative Overall Total (sum) score of the sentiment
Count 1330 944 2604 Negative −3929
sentiment and if it is to the negative side, it has a negative sentiment, and on the x-axis, it has sentence number, which shows the number of the sentence that we are looking at. The graph shows the sentiment of the PDF (counted per sentence). This shows how the sentiment of the PDF is varying sentence to sentence. Most negative reports can be seen near to the 1000th sentence and between 3000th and 4000th sentence number. Most positive reports can be seen somewhere between 1000th and 2000th sentence. The final conclusion of the graph is that; the report mainly has a negative sentiment as compared to a positive, so it might be possible that the Government survey report is not in the favor of the government. Table 1 shows the total score of the sentiment, which is the sum of all the scores taken sentence by sentence. After that, there is a count of all three sentiments, i.e., Positive, Negative, and Neutral. This shows how many sentences are of positive, negative or neutral sentiment. It is clear from the picture that a huge amount of sentences are of negative sentiment, which in conclusion gives the overall sentiment of PDF as NEGATIVE.
PDF Text Sentiment Analysis
689
Fig. 5 Sentiment analysis graph (Positive and Negative words)
7 Conclusion Due to the hype of Web 2.0 technologies nowadays, the internet has been seen as a great origin of human-generated data. Public boldly expresses its views on a variety of things using various online mediums. Physical inspection of huge quantities of content is not possible, so a particular demand for their computer processing has grown up. Sentiment analysis collaborates and operates on public views and gestures toward items, services, events, etc. The present time survey describes a complete review on text mining and its present working and development position. According to research done in the field of data mining, there is a drawback in describing the problem of facts and figures removal from surveyed articles by the use of data mining tools and techniques. PDF is of utmost importance in terms of media and data storage and majorly used digital media as an important origin of content, many of the organizations and institutions give their content in PDFs itself. Nowadays, AI is emerging, we require more content for forecasting and categorization; hence, not considering PDFs as major data storage and origin for you could be a very big mistake. As PDF operation is somehow not easy, but we could grasp the specific API for constructing it less difficult.
690
R. Pradhan et al.
Hence, using python and its flexible libraries, we have done text and sentiment analysis. In our work, we employed NLP techniques, in fetching the most frequently used words of Spanish Language from a pool of random PDFs. Using Sentiment Analysis, we segregate the words into positive or negative words. Using Text Analysis, we automatically classify and extract meaningful information from unstructured text and plot the graph about the given data.
References 1. Wikipedia contributors: PDF. https://en.wikipedia.org/w/index.php?title=PDF. Last Accessed 02 Jan 2021. 2. Lovegrove, W.S., Brailsford, D.F. (1995). Document analysis of PDF files: Methods, results and implications. Electronic Publishing Origination, Dissemination and Design, 8. 3. Mäntylä, M. V., Graziotin, D., & Kuutila, M. (2018). The evolution of sentiment analysis-A review of research topics, venues, and top cited papers. Computer Science Review, 27, 16–32. 4. Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2, 1–135. https://doi.org/10.1561/1500000011. 5. West, R., Paskov, H. S., Leskovec, J., & Potts, C. (2014). Exploiting social network structure for person-to-person sentiment analysis. Transactions of the Association for Computational Linguistics, 2, 297–3310. 6. Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5, 1093–1113. 7. McKee, A. (2003). Textual analysis: A beginner’ss guide. Thousand Oaks, CA, USA: SAGE Publications. 8. Okamoto, K. (2016). Text analysis of academic papers archived in institutional repositories. In 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS). IEEE.
Performance Analysis of Digital Modulation Schemes Over Fading Channels Kamakshi Rautela, Sandeep Kumar Sunori, Abhijit Singh Bhakuni, Narendra Bisht, Sudhanshu Maurya, Pradeep Kumar Juneja, and Richa Alagh
Abstract Wireless has freed us from the burden of cords and cables and given us un-presentable mobility, but high data rate is required in many services like cinematographic, high class auditory and mobile integrated provision digital network. Mobile radio channels are required to transmit data and the data is sent with high data rates. This may lead to Inter-symbol interference (ISI), if there are a number of symbol periods in the channel intends response. Various estimator receivers are used to remove this ISI. In this paper, Zero forcing, the minimum mean square error and the Maximum likelihood estimation receiver are used. In this paper, the act of equalization methodology in terms of BER is evaluated by considering 2 × 2 MIMO system and 4 × 4 MIMO system using MATLAB. Keywords Wireless · MIMO · QPSK · QAM · ISI · Zero forcing · Maximum likelihood · BER
1 Introduction Wireless has freed us from the burden of cords and cable and given us un-presentable mobility [1] and it was great until we heard about IP traffic. Using single antenna at both the end causes signal to get effected by different obstacles which causes multipath wave propagation. This trouble of multipath wave propagation can be eliminated by using more than one antenna at both end of source and destination [2]. MIMO (multiple input, multiple output) is one of the variety of smart antenna procedure. More than one antenna is used in the MIMO system at the source as well as at the destination. For simple MIMO, we have two antennas on the side K. Rautela Delhi Technological University, Delhi, India S. K. Sunori · A. S. Bhakuni · N. Bisht · S. Maurya (B) Graphic Era Hill University Bhimtal Campus, Bhimtal, India P. K. Juneja · R. Alagh Graphic Era University, Dehradun, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_56
691
692
K. Rautela et al.
of transmitter and two antennas by the side of the receiver. Both antennas that we use for the transmission transmit at the same wavelength, but multiple data streams are transmitted from both antennas [3]. If you ask some people how it operates, it is more definitely that the signals overlap and the data becomes clearly scrambled because you are not allowed to get two antennas working together on the same frequency and transmit various forms of data. However, using some very advanced DSP (digital signal processing) technique, it is quite possible to differentiate between data streams, received by the receiving antenna, at the same frequency. In terms of the quality of the usage of bandwidth, that is often quite significant. In addition, the incidence doubles. By using more antennas, the efficiency of the entire system can be tripled or quadrupled. In this paper, different estimators for MIMO systems, employing numerous transmitting and receiving antennas, for a modulating system which is un-coded quadrature phase shift keying, are studied. It hires flat Rayleigh fading above free transmitreceive link. This paper shows the comparison by a graph using MATLAB tool which shows two non-linear interference elimination techniques: zero forcing (ZF) and minimal mean square error (MMSE) with symbol invalidation and matches their performance with the maximum likelihood (ML) optimum receiver. Let Dc be peak distortion in peak distortion criterion, which is reduced by whatever is decided on the equalizer coefficient {C k }. In common, for carrying out the optimization, there is no commoner algorithm for computation, except where the largest distortion at the equalizer origin exists, defined as D0 is a smaller amount than utility. As D0 < 1, the Dc distortion at the equalizer response performance qk = 0, for 1 ≤ |k| ≤ k and qk = 1. There is a common calculation algorithm in this case, called zero-forcing algorithm, which achieves these conditions [4]. Zero-forcing algorithm is an interference cancellation method. Whereas the term MMSE more definitely talks about quadratic cost function and estimation in Bayesian setting. We every so often have few earlier statistics about the parameter to be estimated which gives the straightforward idea for estimation of Bayesian from practical situations. Minimum mean square error (MMSE) is an evaluation tool in the statistics and signal processing [5] that minimizes the values of the dependent variable which are in position and are normally measured estimator feature, the mean square error. When the self-governing signals are transmitted in the uninterrupted symbol intervals, then the optimum detector exhibits satisfactory detection performance, by observing the arrangement of received signals over successive signal intervals. The matched filter or correlation demodulator receives the series at its output; the series is determined by the detector, which gets the most out of the conditional probability density function. Such a detector is called maximum likelihood (ML) detector [6]. QAM and QPSK are modulation techniques used in this paper. QAM is quadrature amplitude modulation. This modulation technique shows shift in both amplitude and phase. QPSK is quadrature phase shift keying. There is a phase shift in this modulation technique.
Performance Analysis of Digital Modulation Schemes Over Fading …
693
2 Mathematical Analysis
(i)
Equalization
The Inter Symbol Interference (ISI) and the Inter Channel Interference (ICI) that is introduced in the transmission channel are removed by means of zero-forcing equalization. This equalizer is intended for noise-free applications and does not improve noise. Figure 1 displays the zero-forcing equalizer (ZFE) chain diagram, in which the ISI component at the output of the equalizer is made zero by making use of linear time invariant filter with an appropriate transfer function. Let us consider receiver transfer function HRx (f ) and the transmitter transfer function HTx (f ) HRx ( f ) =
√
Hr c( f )e−j2π f , to | f | ≤ W |f| ≤ W 0,
HRx ( f ) = HTx ( f ) Now, HRx ( f ) · HTx ( f ) = Hraised cosine ( f ) = |HTx ( f )|2 HEqz ( f ) must compensate the channel distortion So that HEqz ( f ) = 1/Hch ( f ) = 1/|Hch ( f )| × e− jθc( f ) , . . . | f | ≤ W where HEqz (f ) is the T.F of the equalizer and Hch (f ) is the T.F of the channel. (ii)
Minimum Mean Square Error (MMSE)
MMSE is a method which estimates the minimum mean squared errors, i.e. it is optimal in statistical sense, such that the statistical information is priori (x) where
Fig. 1 Block diagram of the communication system with a zero-forcing equalizer (ZFE)
694
K. Rautela et al.
the mean square error (MSE) is defined as (in a statistics sense) p(x/y)(xˆ − x)T (xˆ − x)dx
MSE = x
where p(x/y) denotes the posteriori distribution of x. Hence, the optimal MMSE estimator can be found by minimizing MSE as follows: ∗ xˆMMSE
= arg minxˆ
p(x/y)(xˆ − x)T (xˆ − x)dx x
And this can done by making the associated derivation to be zero, i.e. d
T p(x/y) xˆ − x xˆ − x dx =0 dxˆ
The optimal MMSE estimator is derived as ∗ xˆMMSE =
p(x/y)xdx x
Generally the posteriori p(x/y) is cast, based on the Bayesian chain rules, as p(x/y) =
p(y/x) p(x) p(y)
where p(y/x) is known as the likelihood function, p(x) is the priori and p(y) is the normalizing constant, which is given by p(y) =
p(y/x) p(x) dx x
For a specific system or estimation problem, the remaining task of MMSE Estimation is to specify do statistic densities, e.g., p(y/x) and p(x) through incorporating the priori knowledge and the regular properties we have known. (iii)
Maximum likelihood (ML)
Let P(S 1 /Y ) and P(S 2 /Y ) are posteriori probabilities and P(S 1 ) & P(S 2 ) are priori probabilities. And if we assume the bits 1s and 0s to be equiprobable then P(S 1 ) = P(S 2 ) = 1/2. Now, in this case, the decision rule will be as follows: P(S1 /Y )H1 H2 P(S2 /Y )
(1)
Now let us express the posteriori probabilities in terms of priori probabilities using Bayes rule
Performance Analysis of Digital Modulation Schemes Over Fading …
695
P(S1 /Y ) =
P(S1) f (Y/S1) f (Y )
(2)
P(S2 /Y ) =
P(S2) f (Y/S2) f (Y )
(3)
Now substituting the values of Eqs. (2) and (3) in Eq. (1) f (Y/S1) H1 P(S2) H2 f (Y/S2) P(S1) Here the ratio is called likelihood ratio and is denoted by Yˆ f (Y/S1) Yˆ = Y f ( /S2) And the conditional PDF’s f (Y/S1) & f (Y/S2) are termed as likelihood of S1 & S2. This is also termed as Likelihood ratio P(S2) Yˆ = P(S1)
3 Problem Formulation While MIMO can improve almost every aspect of a wireless communication system, using more antennas can make the system more complex. Multiple signals can cause ISI and many other errors. ISI is interference problem which causes the confusion of a received signal due to presence of obstacles or signals from two or more transmitters in a single frequency. In this article, three calculations are used to evaluate efficiency of BER utilizing Rayleigh and Rician channel over QPSK and QAM modulation techniques [7].
4 Simulation and Results For simulation in MATLAB, use 2 × 2 and 4 × 4 MIMO system that employs two antennas at transmitter side two at receiver side and four antennas at transmitter side four at receiver side respectively. QPSK and QAM modulation techniques are used over Rayleigh and Rician channel. Using three interference cancellation estimation receiver, the BER performance is analyzed for 2 × 2 and 4 × 4 QAM and QPSK modulation methodology athwart Rayleigh and Rician channel, i.e., the line of sight and the non-line of sight paths.
696
K. Rautela et al.
Figure 2 shows the comparison of different estimation receiver using 2 × 2 MIMO system owing through QPSK modulation over Rayleigh Channel. ML has the lowest BER followed by MMSE and ZF. Figure 3 shows the comparison of different estimation receiver using 4 × 4 MIMO system owing through QPSK modulation over Rayleigh Channel. ML has the lowest BER followed by MMSE and ZF. The BER of 4 × 4 MIMO system is lower compared to 2 × 2 MIMO system. Figure 4 shows the comparison of different estimation receiver using 2 × 2 MIMO system owing through QAM modulation over Rayleigh Channel. ML has the lowest BER followed by MMSE and ZF. The BER of 2 × 2 MIMO system with QAM modulation is lower compared to 2 × 2 MIMO system with QPSK modulation technique. Figure 5 shows the comparison of different estimation receiver using 4 × 4 MIMO system owing through QAM modulation over Rayleigh Channel. ML has the lowest BER followed by MMSE and ZF. The BER of 4 × 4 MIMO system with QAM modulation is lower compared to the 4 × 4 MIMO system with the QPSK modulation technique. Now for Rician channel the BER is calculated using different estimation receiver for 2 × 2 and 4 × 4 MIMO system with QAM and QPSK modulation. Figure 6 shows the comparison of different estimation receiver using 2 × 2 MIMO system owing through QPSK modulation over Rician Channel. ZF has the lowest BER followed by MMSE and ML. Figure 7 shows the comparison of different estimation receiver using 4 × 4 MIMO system owing through QPSK modulation over Rician Channel. ZF has the lowest BER followed by MMSE and ML. The BER of 4 × 4 MIMO system is lower compared to 2 × 2 MIMO system. Figure 8 shows the comparison of different estimation receiver using 2 × 2 MIMO
Fig. 2 Comparison of different estimation receiver using 2 × 2 MIMO system with the QPSK modulation over the Rayleigh Channel
Performance Analysis of Digital Modulation Schemes Over Fading …
697
Fig. 3 Comparison of different estimation receiver using 4 × 4 MIMO system with the QPSK modulation over the Rayleigh Channel
Fig. 4 Comparison of different estimation receiver using 2 × 2 MIMO system with the QAM modulation over the Rayleigh Channel
system owing through QAM modulation over Rician Channel. ZF has the lowest BER followed by MMSE and ML. The BER of 2 × 2 MIMO system with QAM modulation is lower compared to 2 × 2 MIMO system with QPSK modulation technique. Figure 9 shows the comparison of different estimation receiver using 4 × 4 MIMO system owing through QAM modulation over Rician Channel. ZF has the lowest BER followed by MMSE and ML. The BER of 4 × 4 MIMO system with QAM modulation is lower compared to 4 × 4 MIMO system with QPSK modulation technique.
698
K. Rautela et al.
Fig. 5 Comparison of different estimation receiver using 4 × 4 MIMO system with the QAM modulation over the Rayleigh Channel
Fig. 6 Comparison of different estimation receiver using 2 × 2 MIMO system with QPSK modulation over Rician Channel
5 Conclusion MIMO is the application of more than one antenna at both the transmitter and receiver sides to enhance communication implementation. This can be seen from the graphs above that 4 × 4 MIMO system has comparatively low BER compared to 2 × 2 MIMO system. QAM modulation comes up with better results in terms of BER when compared to QPSK modulation system.
Performance Analysis of Digital Modulation Schemes Over Fading …
699
Fig. 7 Comparison of different estimation receiver using 4 × 4 MIMO system with QPSK modulation over Rician Channel
Fig. 8 Comparison of different estimation receiver using 2 × 2 MIMO system with QAM modulation over Rician Channel
ISI cancellation method for MIMO system works better for Rayleigh Channel than Rician Channel. As Rayleigh Channel shows lower BER. In this work, all the results have been obtained using MATLAB.
700
K. Rautela et al.
Fig. 9 Comparison of different estimation receiver using 4 × 4 MIMO system with QAM modulation over Rician Channel
References 1. Andrea Goldsmith: Wireless Communication, text book. 2. Fundamentals of 802.11/cisco.com/go/wireless. 3. Assessment of the Global Mobile Broadband Deployments and Forecasts for International Mobile Telecommunications, ITU-R tech. rep. M.2243. 4. Proakis, J. G. (2001). Digital communications, McGraw-Hill. 5. Module-4 Signal Representation and Baseband Processing, Version 2 ECE IIT, Kharagpur. 6. Van Etten, W. (1976). Maximum likelihood receiver for multiple channel transmission systems. IEEE Transactions on Communications. 7. MATLAB software.
Single Image Dehazing Using NN-Dehaze Filter Ishank Agarwal
Abstract Native haze removal techniques like Dark Channel Prior (DCP), Guided Image Filtering (GIF), Weighted Guided Image Filtering (WGIF), and Globally Guided Image Filtering could not perform the task of removing haze properly and preserving the fine structure of the image simultaneously. The proposed NN-Dehaze filter consists of a macro layer which anticipate the transmission map of whole image, and a micro layer which refines results. To train the network, NYU Depth dataset has been used. The performance analysis demonstrates that the structure if the image is preserved and haze is removed better than the traditional techniques both on indoor and outdoor images. Keywords Transmission map · Haze density · CNN
1 Introduction Visual tasks like object detection and facial recognition depend on how the outdoor scenes of nature are perceived. But, haze, smoke, fog, and rain degrades the quality of outdoor scene images quite often. Due to atmospheric particles, sunlight tends to blend with light falling on camera to the line of sight. This leads to degraded luminance [1], less intensity colors, and low contrast. Correction of color distortion along with the contrast can be done by haze removal caused by sky-light. Thus, there is high demand of dehazing algorithms in image processing industry. Algorithm in [2] uses markov random field, which uses the idea that mist-free images have more contrast than its corresponding misty image, so its increases the local contrast of the misty image to dehaze it. These types of algorithms are able to remove haze, but tend to over saturate the images. Algorithm [3] works well with images having heavy amount of haze. Dark channel prior based dehazing algorithms proposed in [4–6] can handle images quite well with heavy haze. But noise was amplified, especially in bright regions by [4, 5] even after the lower bound set I. Agarwal (B) Jaypee Institute of Information Technology, sec-62, Noida, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_57
701
702
I. Agarwal
by transmission maps of these algorithms. Due to the haze present in images, the brightness increases and color of the scene fades making the difference high with its non-hazy image, thus, color attenuation prior was proposed in [6], which has a linear model showing relationship between the depth and the brightness along with saturation. Guided image filtering (GIF) in [6] makes use of this prior. But, the algorithm in [7] works well if haze is light in the image and if its heavy, the quality needs to be improved, which is because the coefficients of the linear model are not flexible to the haze density of the hazy image and is constant. It is very difficult to identify the correct parameters of the linear equation model [8]. Algorithm in [8] regards dehazing to be a type of continuously varying technicality, [9] introduces a local outline-securing smoothing layer method to approximate the conveyance map of a misty image, which is also used by GIF [6] and WGIF [8], but they tend to over smooth images, specifically in areas of fine structure. Algorithm in [9] analyzed many cloudy-image features in a multiple regression models fused with random forests models algorithms and these features rely on dark channel features. Many traditional dehazing methods [9, 10] which used hand-crafted features, even the Globally guided filter (GGIF) [11] does preserve the fine structure of the image, but does not work very well while removing haze from the images as compared to GIF. A cloudy/misty image is formulated as, X ( p) = L( p)t( p) + A(1 − t( p))
(1)
where X(p) is the original misty image and L(p) is the luminosity, A is air-light, and t(x) is the conveyance map which is un-scattered portion of light and reaching the camera. In image dehazing, one of the main tasks is to approximate the conveyance map of the misty image, we propose a CNN-based filter layer in the proposed filter to learn effective features of the image for this task. The attributes learned by this proposed algorithm do not use cloudy-image properties, but uses the data which the images have in them.o learn these attributes, the proposed improvement uses a neural network subdivided into two layers: the macro layer which first approximates the texture of the misty image and second, a micro layer which improves the texture using local morphological anatomy from the output vector of the macro layer. Arbitrary pixel conveyance approximations are removed by this approach and it encourages neighboring pixels to have same labels. The algorithm works as like. To begin with, we design a CNN layer to determine attributes from misty images to approximate conveyance map which is further improved by a micro layer. Subsequently, to train the model, a benchmark dataset obtained from the New York University Image dataset is used which contains misty images and their conveyance maps. This is done by using mist-free images and their ground truth maps obtained from the same dataset. Trained model is now used to dehaze the hazy image. Third, we apply a morphological layer inspired by [6, 8, 11] to the image output of the CNN model. This provides better definition to the image. Inputs of the proposed morphological
Single Image Dehazing Using NN-Dehaze Filter
(a) 2.6754
(b) 1.2043
(c) 1.1959
703
(d) 1.30332
(e) 1.19122
(f) 1.14771
Fig. 1 Comparison of the dark channel prior, GIF, WGIF, G-GIF, and the NN-Dehaze. a A haze image; b A dehazed image by DCP; c a dehazed image by GIF; d a dehazed by the WGIF; e a dehazed image by G-GIF; f a dehazed image by NN-DeHaze; DCP, GIF, WGIF over smooth the hair of the human subject as illustrated in the zoom-in regions, while the problem is overcome by the proposed G-GIF and even better by NN-DeHaze. Also, haze is best removed by NN-DeHaze as depicted by haze density calculations (numerical values under images), which are explained in detail in Experimental results section
filter are output image of the CNN model and a guidance vector. Its speed is equal to [6, 8]. Fourth, the most essential layer, i.e., outline-conserving filter is proposed, which is applied to the output image from third step so as to make the image more even, inspired by the weighted least square (WLS) filter. Inputs of the proposed outline-conserving leveling filter are an output image from morphological vector and the guidance vector. Runtime speed of the outline-conserving evening filter is also equal to [6, 8]. As illustrated in Fig. 1, the proposed NN-Dehaze preserves the morphological details better than the G-GIF, WGIF, and GIF. Overall, the major contributions in this paper are: The proposed NN-Dehaze preserves morphological details better than the G-GIF [11], WGIF [9], and GIF [6] along with better haze removal and also improves the sharpness of dehazed images.
2 Single Image Dehazing with NN-Dehaze Algorithm The architecture of the proposed CNN layer is shown in Fig. 2b. 1.
Macro Layer
The job of the macro layer is to calculate a comprehensive conveyance map and it consists of 4 processes: convolution, max-pooling, up-sampling, and linear combination. Convolution: These are the hidden layers which are convolved with the attributes and input image. Its output is given by f nl+1
=σ
m
f ml
∗
l+1 km,n
+
bnl+1
704
I. Agarwal
Fig. 2 a Training procedure; b macro and micro network
where ‘ f ml ’ and ‘ f nl+1 ’ are attribute vectors of the present layer ‘l’ and the subsequent layer ‘l + 1’, ‘k’ is the patch size, indices ‘(m, n)’ show the mapping of ‘mth’ attribute vector from the present layer to the succeeding nth layer, ‘*’ stands for the convolution process operator, ‘σ (.)’ stands for the Rectified Linear Unit (ReLU) and ‘b’ is bias. Max-pooling: These layers are present before up-sampling and have a downsampling factor of 2. Up-sampling: In this process the size of attribute vectors is reduced to half pixel size. The output of this layer is f nl+1 (2x − 1 : 2x, 2y − 1 : 2y) = f nl (x, y) f nl (x, y) =
1 4
2x2
f nl+1 (2x − 1 : 2x, 2y − 1 : 2y)
Linear combination: The function of this layer is to bind the attribute from the previous layer by a sigmoid function. Its output is tc = s
wn f np
+b
n
where t c denotes the conveyance map in the macro layer. 2.
Micro Network
After the prediction of t(p) from macro layer, further refinements are done by micro network whose architecture is similar to macro network except for the first and second hidden layers as shown in Fig. 1b. Micro layer output is concatenated to macro layer conveyance map with learned attributes to approximate the combined conveyance map.
Single Image Dehazing Using NN-Dehaze Filter
3.
705
Training of the model
To make the model to determine the mapping of misty/cloudy images to their conveyance maps by minimizing the loss between t(x) and the ground truth map t *i (x), n 1 ti (x) − t (x) 2 L ti (x), ti (x) = i n i=1
where, n is no. of misty/cloudy images in the dataset. The above loss equation is used in both macro and micro layers. Air-light calculation: After t(x) is calculated by the CNN model, the air-light needs to be calculated which is beneficial for the restoration of clean image. From (1), it is derived that I (x) = A when t(x) → 0 The objects are generally far from the observers in outdoor images, so, the depth, d(x) varies between [0, +∞), and we know that t(x) is equal to zero when d(x) tends to infinity. Therefore, the air-light is approximated by picking out 0.1% of the pixels out of the darkest pixels in t(x). From these pixels, the brightest pixels in the cloudy image is chosen as air-light. After A and t(x) are anticipated by the recommended algorithm, we restore the partly mist-free image using Eq. (1). 4.
Morphological filter & Leveling Filter
Now, the output image from the CNN model is passed into the morphological filter which is formulated as below steps. First, we calculate the dark channels as the normalized misty image X/A and mist-free image Z/A. Xˇ m and Zˇ m are formulated as
X r ( p) X g ( p) X b ( p) ˇ , , X m ( p) = min Ar Ag Ab
Z r ( p) Z g ( p) Z b ( p) ˇ , , Z m ( p) = min Ar Ag Ab The relationship between Xˇ m and Zˇ m is given as below, as it is independent of t(p) Xˇ m ( p) = 1 − t ( p) + Zˇ m ( p)t ( p), In the following equations, ζ(p) is a square window of radius ζ, equidistant from every pixel p, dark channels of the normalized images X/A, Z/A are formulated as [9]:
706
I. Agarwal
Z m p ,
Xm p ,
Jdz ( p) = min
p ζ ( p)
JdX ( p)
= min
p ζ ( p)
ζ is fixed at 7 throughout this paper. In the neighborhood ζ (p), t(p) is constant so the relation becomes
JdX ( p) = (1 − t ( p)) + JdZ ( p)t ( p) The cost function of the morphological filter inspired by GIF [6], WGIF [8], G-GIF [11] has two terms. Term E 1 is in image domain and next E 2 in gradient domain First calculates the accuracy of the output image from the image and second calculates the structure of the output image. The former is defined as E 1 (O, X ) =
(O( p) − X ( p))2
p
where, X is the misty image. E 1 (O, X) means that the output image O should be as near as possible to the input misty image. The term V = (V h , V v ) is the guidance vector field which is formulated as E 2 (O, V ) =
∨∇ O( p) − V
p
where term E 2 (O, V) means that the structure of the output image O should be as near as possible to the guidance vector. The overall cost function is computed as E(O) = λE 1 (O, X ) + E 2 (O, V ) where λ is a non-negative constant and its purpose is to achieve a compensation between the terms E 1 and E 2 . λ is 1/2048 throughout this paper. Using matrix notation, modified equation of above equation becomes T T D|x O − V h + D|y O − V v D|y O − V v , λ(O − X )T (O − X ) + D|x O − V h
Dx, Dy represent differentiation operators. The work of vector O is to minimizes the above cost function, λI + D XT Dx + D yT D y O = λX + DxT V h + D yT V V ,
Single Image Dehazing Using NN-Dehaze Filter
707
where I is an identity matrix. Now, that the output image O* needs to be smoothed we propose smoothing filter which is inspired by the WLS filter and the quadratic optimization problem, formulated as min φ
p
where γ, θ, and E are 2048, 13/8, and 1/64, respectively. The above cost function can be rephrased as (φ − O )T (φ − O ) + γ φ T DxT Bx Dx φ + φ T D yT Bx D y φ where Bx and By are formulated as
1 ; Bx = diag ; B y = diag V h ( p)θ + ∈ |V v ( p)|θ + ∈ 1
The vector φ with the help of which the cost function is defined is given by I + γ DxT Bx Dx + D yT B y D y φ = O The runtime speed of the filter that gives structure to the image by learning the transmission map and the runtime speed of smoothing filter, which further smoothes the images are equal to the speed of the Weighted Least Square (WLS) filter. Therefore, the complexity of the proposed NN-DeHaze is about the three times of the GIF in [7] and the WGIF in [9]. Next, the conveyance map t(p) is again calculated by the following equation. t ( p) = 1 − φ ( p), Finally, we get the scene radiance Z(p) implementation by the following equation Z c ( p) =
X c ( p) − Ac + Ac , t ( p)
3 Quantitative Evaluation on Benchmark Dataset The no-reference perceptual haze density metric [1] has been used to compare the five haze removal algorithms. This density measure does not need the original hazy image. A lower value of haze density implies better dehazing performance. The proposed
708
I. Agarwal
algorithm is performing best on 5 images as shown in Fig. 3. First three of them are focusing on the fine structure of the facial features after dehazing. Rest of the two focuses on the overall quality of the image captured from far distance. Our algorithm performs better than G-GIF in both type of images. Clearly, the experimental results in Table 1 show that the proposed NN-DeHaze indeed outperform the G-GIF. Also in Table 1, different choices of patch size are compared with each other in terms of haze
(a) 0.7275
(a) 0.7228
(b) 0.3243
(b) 0.3343
(c) 0.5234
(d) 0.6452
(e) 0.4343
(f) 0.3156
(c) 0.3443
(d) 0.4213
(e) 0.3322
(f) 0.3234
(d) 1.0254
(e) 0.2432
(f) 0.2374
(a) 1.0870
(b) 0.3534
(c) 0.3454
(a) 0.6617
(b) 0.2900
(c) 0.3083
(d) 0.7398
(a) 2.4084
(b) 1.2993
(c) 0.7343
(d) 1.3443
(e) 0.3432
(e) 0.6245
(f) 0.2793
(f) 0.6079
Fig. 3 a A haze image; b a dehazed image with DCP; c a dehazed image with GIF; d a dehazed with WGIF; e a dehazed image with G-GIF; f a dehazed image with NN-DeHaze; It is observed that DCP performs well on heavy haze images but in low haze images it just increases the contrast changing the skin color. Similarly, GIF makes unreal images by increasing contrast and sharpness too much .GIF doesn’t preserver the edges. Between the edge-preserving filters WGIF, G-GIF, and the proposed model, the proposed model NN-DeHaze removes haze the best in terms of haze density
Single Image Dehazing Using NN-Dehaze Filter
709
Table 1 Haze densities if five hazy images in Fig. 3 with varying input parameters in different dehazing algorithms Images
Original Image Haze Density
DCP
GIF
WGIF
G-GIF
Proposed
Parameter 1: Patch size = 10
0.7275
0.3905
0.4311
0.5314
0.4801
0.3534
Parameter 2: Patch size = 30
0.7275
0.4234
0.6753
0.7587
0.5633
0.4134
Parameter 3: Patch size = 60
0.7275
0.3243
0.5234
0.6452
0.4343
0.3156
Parameter 1: Patch size = 10
0.7228
0.2991
0.2374
0.3683
0.2448
0.2954
Parameter 2: Patch size = 30
0.7228
0.3343
0.3443
0.4213
0.3322
0.3234
Parameter 3: Patch size = 60
0.7228
0.4231
0.4313
0.4314
0.4231
0.4178
Parameter 1: Patch size = 10
1.0870
0.3359
0.2394
1.1268
0.3243
0.2267
Parameter 2: Patch size = 30
1.0870
0.3424
0.3244
1.2345
0.3423
0.3056
Parameter 3: Patch size = 60
1.0870
0.3534
0.3454
1.0254
0.2432
0.2374
–
–
Parameter 1: Patch size = 10
0.6617
0.2488
0.1805
0.6892
0.2544
0.1539
Parameter 2: Patch size = 30
0.6617
0.2646
0.2524
0.7345
0.1423
0.1267
Parameter 3: Patch size = 60
0.6617
0.2900
0.3083
0.7398
0.3432
0.2793
–
–
Parameter 1: Patch size = 10
2.4084
1.0205
0.5980
1.2053
0.4933
0.4910
Parameter 2: Patch size = 30
2.4084
1.2334
0.6332
1.3241
0.5342
0.5234
Parameter 3: Patch size = 60
2.4084
1.2993
0.7343
1.3443
0.6245
0.6079
Average
1.1334
0.4589
0.3372
0.7842
0.3243
0.3137
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
710
I. Agarwal
Table 2 Average Haze densities of different algorithms applied on 500 images Average
Original image haze density
DCP
GIF
WGIF
G-GIF
Proposed
1.1945
0.3427
0.3246
.3908
0.4023
0.3056
Table 3 Average values of PSNR and SSIM applies on 500 hazy test images through different algorithms Average
Original image
DCP
GIF
WGIF
G-GIF
Proposed
PSNR
12.48
15.75
16.34
22.75
23.09
25.23
SSIM
0.35
0.51
0.69
0.84
0.88
0.92
density. Overall, the proposed algorithm preserves very fine and small details present in images like hair, color of eyes, etc., visibly better than the algorithms in [5, 6, 8] and image quality is also increased by the proposed filter as depicted by the increase in values of PSNR and SSIM values computed on an average of 500 images (Table 3). We also compared the recommended algorithm with the traditional dehazing methods [5, 6, 8] using the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM). It is proved that the proposed algorithm indeed increases the quality of the images among which NN-DeHaze outperforms the rest as shown in Table 2. In the five images, GIF tends to have some color. New dataset (for testing of proposed algorithm): we have used a dataset containing 500 images from the University of Texas at Austin (UT) (different from those that used for training). The quality of the results produced by the proposed algorithm is brought to light by the metrics like PSNR and SSIM, shown in Table 3, which shows high values of both metrics for the proposed work. Also, problem of over-smoothing of the hair present in GIF and WGIF is vanquished by the proposed model. Also, greater observable quality and lower chromatic mutilations are seen in the dehazed images produced by the proposed algorithm.
4 Conclusion and Future Work In this research, we have addressed the image dehazing problem via NN-DeHaze filter. Comparison has been done with conventional filters and our filter is easy to implement and use. Testing done on real misty images establish the success of the NN-DeHaze as depicted by the risen PSNR and SSIM ratios in Table 3. Other applications of the proposed filter are, for example, it can be applied to study imaging of panoramas, enhancement of details, image matting, etc., which is part of the future research.
Single Image Dehazing Using NN-Dehaze Filter
711
References 1. Choi, L. K., You, J., & Bovik, A. C. (2015). Referenceless prediction of perceptual fog density and perceptual image defogging. IEEE Transactions on Image Processing, 24(11), 3888–3901. 2. Tan, R. (2008). Visibility in bad weather from a single image. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognit (CVPR) (pp. 1–8). Anchorage, AK, USA. 3. Fattal, R. (2008). Single image dehazing. In Proceedings of the SIGGRAPH, (pp. 1–9) New York, NY, USA. 4. Chavez, P. S. (1988). An improved dark-object subtraction technique for atmospheric scattering correction of multispectral data. Remote Sensing of Environment, 24(3), 459–479. 5. He, K., Sun, J., & Tang, X. (2011). Single image haze removal using dark channel prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12), 2341–2353. 6. He, K., Sun, J., & Tang, X. (2013). Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(6), 1397–1409. 7. Zhu, Q., Mai, J., & Shao, L. (2015). A fast single image haze removal algorithm using color attenuation prior. IEEE Transactions on Image Processing, 24(11), 3522–3533. 8. Li, Z., Zheng, J., Zhu, Z., Yao, W., & Wu, S. (2015). Weighted guided image filtering. IEEE Transactions on Image Processing, 24(1), 120–129. 9. Zhu, Q., Mai, J., Shao, L. (2015). A fast single image haze removal algorithm using color attenuation prior. IEEE Transactions on Image Processing, 24(11). 10. Tang, K., Yang, J., Wang, J. (2014). Investigating haze-relevant features in a learning framework for image dehazing. In CVPR. 11. Zhengguo, L. (2018). Single image de-hazing using globally guided image filtering. IEEE Transactions on Image Processing, 27(1).
Comparative Analysis for Sentiment in Tweets Using LSTM and RNN Rahul Pradhan, Gauri Agarwal, and Deepti Singh
Abstract In today’s world, the quantity of the users of social networking sites are increasing day by day so as the users of the Twitter application as tweets to for advertising and inspiring the consumers about their respective products, services, reviews about any particular topic or in several different aspects. Taking this increased need of using the social networking into consideration, we have got decided to make a Sentiment Analysis system on Twitter’s data. In this paper, we have proposed two different models in order to determine which approach is most suited and better. The two models are LSTM and RNN, both the models are good in their own way, but our aim is to classify which one is best suited and straightforward to be used by a user. We have given a well explained description of both the models together with their structure, feasibility, performance, and analysis result. At the end, we came to grasp that both the models have gotten very satisfying results getting a way better results than regular ones. Keywords Sentiment · Analysis · Twitter · RNN · LSTM · Preprocessing · Approach · Neural
1 Introduction Analysis of sentiments is an approach which is widely used in text mining. Twitter Sentiment Analysis means, usage of Modern text mining routines to examine the sentiment of the tweets whether they are negative, positive or neutral (see Fig 1). It is also termed as Opinion Mining and its primary use is to analyze discussions, R. Pradhan · G. Agarwal (B) · D. Singh Department of CEA, GLA University, Mathura, India e-mail: [email protected] R. Pradhan e-mail: [email protected] D. Singh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_58
713
714
R. Pradhan et al.
opinions, and share perspectives as tweets (in case of Twitter Sentiment Analysis). Sentiment Analysis is preferably used to decide business plans of action, political analysis, and even to assess society’s reactions. R as well as Python are mostly used for dataset of Twitter Sentiment Analysis. Algorithms such as SVM and Naive Bayes are used for forecasting the polarity of the sentence, which is a part of Natural Language Processing. The analysis of the twitter data sentiments mainly depends upon the document level and the sentence level. The flavor or the taste of the text depends so much on the context that the method of finding the positive and negative words in a particular sentence is proved to be inappropriate. Analysis can be performed by glancing at the Parts of Speech labeling. Dataset Twitter Sentiment Analysis has several implementations: [1]. 1.
2.
3.
Business: Analysis of Twitter’s Sentiment is utilized by the companies for developing the business action plans, assessing customers’ feelings with regard to the products or brands, for knowing how people respond to the companies’ campaigns or launches of their products, and also for knowing the reason of the consumers not buying the certain companies’ products. Politics: Analysis of Twitter’s Sentiment is used in Politics for keeping the track of the political views, detecting whether the actions at the government level are consistent or inconsistent. This analysis of the sentiment datasets of twitter is also used for analyzing the results of the elections. Public Actions: For monitoring, for analyzing the social phenomenon, for predicting the danger and the dangerous situations, as well as for determining normal moods of blogosphere, the Twitter Sentiment Analysis is preferably used.
The Sentiment Analysis of Twitter is developed by using Python, it can be implemented through famous Python Libraries such as Tweepy and TextBlob.
Battery of this phone is so good!
Fig. 1 Different types of sentiments
Oh no! my phone is not working well.
The phone is blue in color.
Comparative Analysis for Sentiment in Tweets Using LSTM and RNN
715
1.1 Tweepy It is the user of the Python Language which is used for the ceremonial Twitter API, it supports assessing Twitter through Basic Authentication. It also has a latest method, i.e., OAuth. OAuth is a single way for using the Twitter API currently because the method of Basic Authentication has been stopped by Twitter.Tweepy is used for giving the asses to the precisely documented Twitter API. Getting any object and using any method offered by the ceremonial Twitter API has become possible because of Tweepy only. Tweets, Users, Entities, and Places are the main model classes in Twitter API. Traversing through the information and returning the JSON-formatted response is very much easy to do using python.
1.2 TextBlob It is a very well-known library among several Python libraries, its use is to process the textual data whose base is NLTK. It functions as a solid framework every important task required in Natural Language Processing. TextBlob contains many enhanced features like: 1. 2.
Extracting the Sentiment Correcting the Spelling.
TextBlob is essential for the Analysis of sentiments of the Twitter datasets using Python in the ways defined below.
1.2.1
Tokenization
This is the process of tokenizing the blocks of the texts by TextBlob in several sentences and words in order to make the reading among the lines and sentences very much easier.
1.2.2
Extraction of Noun Phrases Using TextBlob
Entity in the sentences mostly referred to as nouns. In Dependency Parsing, the main utility of the Natural Language Processing is the noun only. This is the method of extracting the noun out of the sentence using TextBlob.
1.2.3
Tagging of Parts of Speech Using TextBlob
The function of the TextBlob is also to tag and recognize the Parts of Speech used in the sentence given.
716
R. Pradhan et al.
Fig. 2 Classic procedure for sentiment categorization and analysis
1.2.4
N-Grams with TextBlob
Firstly, N refers to the number. N-Gram is a language probabilistic model to predict the next occurring item or word for a particular sentence (Fig. 2).
1.3 LSTM It is abbreviated as Long Short-Term Memory [2]. This is basically a Recurrent Neutral Network deep-learning technique along with some added features for making the analysis of the sentiment data much easier and concise. It is basically designed to model the sequential and time collection records as it can seize long term dependency, it also outperforms different normally used forecasters when forecasting real-time demands.
2 Related Work The proposed paper is divided into 2 different portions, i.e., study of Literature and development [3] of the System. Study of Literature includes doing the studies on different sentiment analysis methods which are in the great use recently [4]. In phase II, implementation requirements as well as performances are subjected before the main development procedure. Along with a this, structure and implementation design of the proposed program and the way in which the program will interact will also be considered and identified. In preparing the paper over Twitter sentiment Analysis application, a large number of tools are getting into the use, like Python and Notepad, etc. Especially in this paper, we have decided to check and evaluate the sentiment analysis procedure of the Twitter’s dataset with 2 different models, i.e., LSTM and RNN and to finalize that which one of the proposed models will prove to be more effective, efficient, speedy, easy, and considerate for the users to implement. The need of doing this classification is important because as we know that there are several
Comparative Analysis for Sentiment in Tweets Using LSTM and RNN
717
numbers of models for this particular paper which work slightly good in their own way, but if we talk about the best model among all of them, then it becomes very tough to answer. That’s why making this paper attracts the team members.
2.1 Sentiment Analysis of Tweets Using Microblogs Technique [5] A large number of practice works have been done on analysis of sentiments on tweets till now. Some exploration utilizes the collective network’s information. These researches and works discloses that collective network’s relations of view holders can make or create a dominant favoritism to the textual replicas. On the other hand, current research is going on by employing models such as LSTM, RNN, CNN. These techniques work best for some of these microblogging websites but not for all. These techniques are under exploration and not commonly available in the official literature. These techniques harness textual features but ignore verbal features such as hashtags, emoticons, which are widely used nowadays. Syntactic notions or entities grounded approaches shows a different direction for the research works. We have made the utility of sentiment’s topic characteristics and entities which are extracted through any third party’s favor to make data scarcity easy. Detail-based replicas are also in use for improving the tweet-level categorizer.
2.2 Representation Learning and Deep Models [5] Current research works shows that changing the vectors of the words in the duration of training can find the polarity details for the sentiments of the words more effectively. Deep replicas/models can also be perceptive to examine the roots that has customized very much during the time of training. We are trying to host a basic collation between very initial and the final/tuned vectors and present the way in which tuned vectors of work-diverse able works. Disseminated task-vectors guide in different NLP operations while making the use of the proposed Neural models, all these works are done under the representation learning and deep models. Constituting these depictions to the vectors of fixed lengths that hold phrases or information on the level of sentence also makes the performance of the analysis of sentiments better. The provisional deliverance of the RNN model in binary trees. Words in compounded expressions are treated as leaf nodes and formed as low to high manner. Well, it’s very tough to fetch any binary tree from any uncommon or uneven small sentences, especially used in tweets. Actually, we do not need to require for the actual composition’s details or parser, LSTM models actually enciphering the data in
718
R. Pradhan et al.
a block chain and manage it in compounded linguistic criteria with the composition of the gates and regular faults in the codes. Actually, in the paper of Sentiment Analysis of Twitter, we have focused on the distinction and final effects of using different models of sentiment analysis for the instance that can be either Long-Short-Term-Memory (LSTM) model or Recurrent Neutral Networks (RNN) (Table 1). We are also trying to use a special kind of approach that is to find out the word that will carry the opinion in the whole and after that to forecast the idea which is expressed in the content of the tweet. This approach is mainly termed as LexiconBased Approach which contains the pattern which is: [3]. 1. 2. 3. 4.
We will first process the tweet and delete the punctuation errors Start a polarity score in total which is equals to 0 Then the token will be checked, if positive the value of s will be set to positive otherwise negative At the end, focus on the total polarity score of the tweets, if the score is greater than the threshold tweets will be considered as positive otherwise negative.
This method is very much in use these days and the main advantage of using this is that it can adopt and work with any models which are used for the twitter sentiment analysis in our case it is to be LSTM and RNN.
3 Proposed Work In this section, we will explain in short on the fundamental model constructs of RNN and LSTM Networks. The whole system is divided into three different sections or phases that is preprocessing, introduction of models for analysis, and finally forecasting on the nature of the tweets [4]. These sections change the content of text to the tags, which will further decide the actual sentiment of that particular tweet. The whole construction for the analysis of the sentiments of Twitter is shown below with the help of data flow diagram in order to make it clear on a single view. Data extracted from Twitter is firstly preprocessed and then cleaned in a nice manner, after this the word stops are eliminated and then sentiments of tweets are extracted for categorization, and finally testing is done. LSTM is carried out for the prediction of these sequences of the words in the sentences [5] (see Fig. 3).
3.1 Long Short-Term Memory Model LSTM is dreamed up to prevail the difficulty of destroying and exploding ramp in RNN. It utilizes the space of the memory sets with three different gates or ways: input, output, and gate of forget. It selects the details to retain or forget by providing
Comparative Analysis for Sentiment in Tweets Using LSTM and RNN
719
Table 1 Brief table about the related work of the paper Author
Title
Approach/model Dataset used
Pros
Cons
Pedro M. Sosa
Twitter sentiment analysis using combined LSTM-CNN models [11]
CNN and LSTM They used models are used dataset of combinedly “Twitter Sentiment corpus” and “University of Michigan’s Kaggle” competition datasets
The idea of combining two models proved to be very effective as it gave 3–4% more positive results than the regular model
Issue with this combined was that this model mainly focus on optimization feature only and take much larger time and also is slightly difficult to work on
Ye Yuan, You Zhou
Twitter Sentiment Analysis with RNN [8]
Logistic Regression Baseline and Recursive Neural Networks (One-Layer RNTN and Two-Layer RNN)
They have used the SemEval-2013 dataset given by York University. This contains 6092 rows in training data
One-hidden layer of RNN gave a very decent performance and the most fine-tuning on the hypermeters is also gained by using this approach
Issue with this model was that the performance of forecasting the negative labels was poorer due to imbalance in dataset as well as the complain of under-fit also occurred
Ye Yuan, You Zhou
Predicting polarities of tweets using composing word embeddings with LSTM [1]
Naïve Bayes, maximum entropy, dynamic convolution neural networks as well as SVM approaches and RNN model is used
They used the dataset of Stanford Twitter Sentiment corpus and also manually tagged test and datasets containing 177 negative and 182 positive tweets data
In this work, work-differentiative vectors which are combinedly working with the deep approach is nicely used which is really considered as a hard point for several other networks and approaches
Issue is that in this model words or phrases with non-similar functions cannot be differentiated and the opposite sentiments are also not magnified (continued)
720
R. Pradhan et al.
Table 1 (continued) Author
Title
Approach/model Dataset used
Pros
Cons
Rupal Bhargava, Shivangi Arora, Yashvardhan Sharma
Neural network-based sentiment analysis [4]
LSTM, CNN and RNN network models have been used to prepare this model
In country like India where people use many languages, the requirement for the analysis approach for the sentiments is in great need. So, the advantage of this paper is that it can be applicable on various linguistic variations
The main issue with this model is that it cannot analyze the sentiments of the sentences which carry emoticons, exclamations as it can’t remove them on its own
They have used datasets in different Indian languages from different sources some datasets are of Hindi, Bengali, Tamil etc.
Fig. 3 Data flow diagram for sentiment analysis of twitter [11]
or allotting relevant burdens to the gates. It even makes or generates latest enclosing matrix by the use of the details from the last one [4]. As it is visible, every neural network function in a slightly differentiative way from each other. On using a hybrid version of these neural networks, you can raise the utility of the model through regulating and mixing the advantages of the networks [6]. LSTM algorithm teaches lengthy reliance are termed by “Sepp Hochreiter” and “JurgenSchmidhuber analysis” this can be proved as the solution to resolve the lasting
Comparative Analysis for Sentiment in Tweets Using LSTM and RNN
721
problem on many time stages. As written above, it consists of different three gating and units which are multiplicative in behavior. A collection of gates is utilized for controlling the data flow from the system in the cell tube by using the sigmoid [7]. The Input gate is basically used to pass input information through the convertible block sections. Forget gate, which is utilized to control the details apart from the cell block. Output gate, which is used to pass the information to the very next stage [5]. LSTM cells identifies the input, and collects and stores the data to find out the extracted details.
3.2 Recurrent Neural Networks RNN has obtained the attention in NLP area since the progress of semantic language structure, which is especially based on a normal format. which is widely known as Elam Networks. Currently RNN is being used in the papers of sentiment analysis to forecast words or characters in a specific manner [8]. In this model, each and every word is charted to a vector by using lookup table layer. The hidden layer’s input is got from hidden and current layer’s activation, after that the hidden layer encrypts the past and current opinions in a defined format [9]. Actually, RNN is an identity of neural nets that considers and stores its inputs because of its interior memory, which includes the structure of the data like content of tweets, genomes, and or data of the numerical time series. Recalling of the last used words to forecast the upcoming words of that manner is slightly hard in conservational neural networks, but in case of RNN, resolving this issue with the proper involvement of hidden layers recalls the manner or the sequence of the detail via time in order to get the output effectively [7].
3.3 Datasets of Twitter We have regulated the experiment on the dataset of First GOP debate Twitter sentiment given by Kaggle. The actual size of the dataset is approx. 8.1MBs. The dataset used is collected version using the emoticons as queries in Twitter API. There is total 10,000 of tweets in general for this dataset. The dataset contains 83 positive tweets, 813 negative, and numeral neutral tweets based on the emoticons or queries. We have not used any test set manually for this work.
722
R. Pradhan et al.
4 Results 4.1 Experimental Setup The results have shown the mean of the exactness for each of the network taken after several testing [10]. We have used NumPy and pandas’ libraries for this work. For the purpose of testing, we have regulated training sets of 10,000 tweets and testing sets of 10,000 tagged tweets from the native datasets [11]. We have taken the 83 of positive and 813 of negative tweets for training and testing purpose. The negative testing length is 853 and positive testing length is 226. The training of datasets has been done using the parameters which are shown in the Table 2, and we have also looked upon both the models’ exactness during the tagging or testing [7] (Fig. 4). Table 2 Parameters for training
Fig. 4 Experimental setup
Dimensions of insertions
128
Epoch
7
Batch size
32
LSTM output
196
Verbose
2
Pool Size
2
Dropout
0.2
Word embeddings
Data is not previously trained
Comparative Analysis for Sentiment in Tweets Using LSTM and RNN Table 3 Accuracy table of processes
Process
Accuracy
Training
0.1
Testing
0.33
723
Fig. 5 Graphs of model accuracy and model loss
4.2 Training and Testing The data of training and testing contains different two settings. The first one is binary and the second one is multi-class classification. In multi-class classification, each set is assigned the tag of the sentiment according to the nature of the sentiment, i.e., Positive, Negative, and Neutral [10]. We have implemented the least square regression on the output layer, where trainable-prams are 511,194 and non-trainable-prams are 0, while the density is 394. The accuracy of training and testing has been displayed in the Table 3. Some parameters of training have been successfully and very clearly elaborated using the graphs of this paper. In Fig. 5, the accuracy rate and the loss rates of the model have been shown, respectively.
4.3 Evaluation For this paper, we have used Pandas, NumPy, and Keras libraries in Kaggle’s notebook which has been experimented using several dimensions. Pandas is an opensource library that is used to load the data very fast and does analysis in python. The notebook used is for creating and sharing the documents in a required manner [7]. As we can see from the table below, the training loss and accuracy of the epoch at the different level is configured in the same time (Table 4).
724 Table 4 Training loss and accuracy
Table 5 Accuracy table
R. Pradhan et al. Epoch
Time (s)
Loss
Accuracy
1/7
17
0.1055
0.9566
2/7
17
0.1019
0.9578
3/7
17
0.0912
0.9636
4/7
17
0.0860
0.9661
5/7
17
0.0842
0.9654
6/7
17
0.0847
0.9663
7/7
17
0.0843
0.9662
Polarity
Accuracy using LSTM (%)
Accuracy using RNN (%)
Positive
59.87
43.22
Negative
90.59
87.69
4.4 Final Results The results and the correspondence of the graphs indicate that the LSTM model is proved a way better over RNN method in the fields like Accuracy, Speed, Usability, etc., for the analysis of the sentiments. The LSTM has given 59.87% accuracy in positive tweets and 90.59% in negative tweets, which is far better than that of the RNN model which has given the accuracy in positive tweets as 43.22% and negative tweets as 87.69% as elaborated in Table 5. All the data related to the RNN model is extracted from the results and analysis of the papers done by many other researchers on the general basis. We exactly have compared the results of our paper which has been done through LSTM approach with the other papers which are RNN-based. The final outcome is that the LSTM approach is better than the RNN.
5 Conclusion During this paper, we have traversed to frame the analysis of Twitter’s sentiments which has been conveyed through several interactions of the words. The contributions and the involvements of this paper can be expressed in short as follows. We have traced long short-term memory-based architecture to format worddepiction by a convertible layout function. We have tested this model on the public dataset and the proposed model gains comparatively better results than the conventional data driven model.
Comparative Analysis for Sentiment in Tweets Using LSTM and RNN
725
For over tuning procedure of the words’ sentiments, we introduced a concept of differentiate task-distinctive model which relies on the tags collected from whole sentence. The very interesting thing about this paper is that it contains a case study on the procedure of the task-differentiate word vectors processing combined with the deep model, we have also classified the procedure of sentiment analysis by two different models and have generated distinctive results in correspondence to both the models.
References 1. Wang, X., Liu, Y., Sun, C., Wang, B., & Wang, X. (2015). Predicting polarities of tweets by composing word embeddings with long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol. 1, Long Papers). Association for Computational Linguistics, Stroudsburg, PA, USA. 2. Bose, B. (2018). Twitter sentiment analysis—introduction and techniques. In Digitalvidya.com. https://www.digitalvidya.com/blog/twitter-sentiment-analysis-introduction-and-techniques/. Accessed 10 Nov 2020. 3. Wikipedia contributors. (2020). Long short-term memory. In Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=Long_short-term_memory&oldid= 985650266. Accessed 10 Nov 2020. 4. Bhargava, R., Arora, S., & Sharma, Y. (2019). Neural network-based architecture for Sentiment analysis in Indian languages. Journal of Intelligent System, 28(3), 361–375. 5. In:Researchgate.net. https://www.researchgate.net/publication/301408174_Twitter_sent iment_analysis. Accessed 15 Nov 2020c. 6. Chen, Y., Yuan, J., You, Q., & Luo, J. (2018). Twitter sentiment analysis via bi-sense emoji embedding and attention-based LSTM. In 2018 ACM Multimedia Conference on Multimedia Conference MM ’18. New York, New York, USA: ACM Press. 7. Monika, R., Deivalakshmi, S., & Janet, B. (2019). Sentiment analysis of US airlines tweets using LSTM/RNN. In 2019 IEEE 9th International Conference on Advanced Computing (IACC), IEEE. 8. Yuan, Y., & Zhou, Y. Twitter sentiment analysis with recursive neural networks. In Stanford.edu. https://cs224d.stanford.edu/reports/YuanYe.pdf. Accessed 15 Nov 2020. 9. In: Unicatt.it. https://publicatt.unicatt.it/retrieve/handle/10807/133048/219284/978889998 2096.pdf#page=174. Accessed 15 Nov 2020a. 10. Teng, Z., Vo, D. T., & Zhang, Y. (2016). Context-sensitive lexicon features for neural sentiment analysis. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA. 11. Sosa, P. M., Twitter sentiment analysis using combined LSTM-CNN models.
Solving Connect 4 Using Artificial Intelligence Mayank Dabas, Nishthavan Dahiya, and Pratish Pushparaj
Abstract Training an AI model for complex board games so that our agent will be reminiscent of the human mind is ostensibly a difficult task. There is a rich history of training AI to master these board games. We have trained our agent using DDQN, Monte Carlo Tree Search and Alpha-beta pruning to the game of Connect 4. Alphabeta and Monte Carlo tree search are based on the game-tree search approach, while DDQN is trained just by self-play. All the above-mentioned agents were made to play with each other to check their potency. The agent trained with DDQN consistently beat the minimax agent. MCTS also consistently beats minimax, given enough time to choose its move. The match between MCTS and DDQN was a cliffhanger till the final whistle, with MCTS as the winner. Keywords Connect 4 · Reinforcement learning · Double deep Q-Learning · Monte carlo tree search · Minimax algorithm and Alpha-Beta pruning
1 Introduction Humans have a very basic learning strategy, particularly when it comes to complex decisive situations like when an action is executed but the reward for the action is given after a long time in the future. As multiple actions lead to a certain reward, these types of problems become very hard for humans and take a lot of time to master, whereas AI tends to perform much better in solving these complex problems. Board games hugely resemble these complex decisive problems where win or lose is dependent on each and every action/move performed during play, and therefore, there is a rich history of developing AI to master these board games. We consider the game of connect 4 shown in Fig. 1 for our research, which is a 2 players game where each player is assigned specific coloured discs and there is a board consisting of 6 rows and 7 columns where players drop the coloured discs alternatively. To win, M. Dabas (B) · N. Dahiya · P. Pushparaj Department of Computer Science, Maharaja Surajmal Institute of Technology, Janakpuri 110058, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_59
727
728
M. Dabas et al.
Fig. 1 Connect 4 board game
players need to align their discs diagonally, horizontally or vertically. There are 4.5 trillion possible board configurations in connect 4. As connect 4 is a zero-sum and decision-making game, a fundamental approach to solving this type of game is a game-tree search. Algorithms like minimax, which use evaluation metrics to choose the best move, and Monte Carlo Tree Search (MCTS) are some examples of game-tree search approaches. Another approach to solve the game is by self-play, where an agent plays with itself allowing it to learn intricate relations without being explicitly programmed. So, keeping that in mind in this paper, our contribution will be: • We are going to implement 3 different agents using algorithms min-max with alpha-beta pruning, Monte Carlo Tree Search and Double DQN (self-play) to tackle the game of connect 4. • We competed them against each other and analyzed their performance. • Analyzing results and choosing the best agent.
2 Related Works 2.1 Supervised Learning Using Artificial Neural Network Training artificial neural networks to learn is a classic way of solving a lot of machine learning problems. Applying ANN to any problem is directly related to data available for training. So, the availability of game data has allowed us to think of a solution to board games like connect 4 using neural networks [1]. So, using the game data, the board configuration is given as input in input layers and various input layers are
Solving Connect 4 Using Artificial Intelligence
729
implemented to learn better from the data and finally in the output layer of 7 neurons each consist of the probability of selecting the respective column. Further, the error is calculated and weights are updated by the backpropagation algorithm. These types of NN prove to perform much better than some explicitly programmed algorithms and the performance of NN increases with the increase of game data, allowing it to learn more complicated features.
2.2 Temporal Difference (TD) Learning and N-Tuples Network Self-play is one of the most important and powerful concepts of reinforcement learning because of the fact that here we don’t need to provide direct supervision, and hence the agent is not limited because of our logic, in fact, it can achieve superhuman performance. So, exploiting this field of self-play, a combination of TDL and n-tuples is used to implement self-play in connect 4 agents [2]. In this approach, the role of n-tuples is to project the low dimension board onto a higher dimension, thus creating a huge feature set. Now, the agent’s objective is to produce an ideal value function, which is usually 1 and −1 for a win and lose, respectively. The algorithm plays the agent against itself resulting in learning and producing a reward 1, −1 or 0 (Win, lose or draw) at the end of the game. The agent predicts some value function and TD error is calculated in accordance with the current value function, using the TD error the weights are adjusted for each iteration and the main goal of the algorithm is to minimize the TD error, which directly means minimizing the difference between the previous prediction and current prediction.
3 Methodology In connect 4, an agent must align 4 discs either horizontally, vertically or diagonally in order to win the game. We used connect-x environment from Kaggle which was pre-equipped with 2 agents, namely, random and negamax for individual evaluation of our 3 agents [3]. We tackled this game by developing 3 different agents using minimax-alphabet, Monte Carlo Tree Search and Double DQN. First, we made minimax with alpha-beta pruning agent. Then Monte Carlo Tree Search was used to create the second agent. The third agent was DDQN which was based on self-play. All the agents were initially tested against negamax and random agents to observe their individual performance. The final analysis was based on the match played by our 3 agents against each other.
730
M. Dabas et al.
3.1 Connect 4 Using Minimax and Alpha-Beta Pruning Minimax algorithm is an adversarial search algorithm and is often used in two-player computer games [4]. Minimax algorithm generates a game tree, that is, a directed graph where the nodes act as a situation in the game and edges as moves. There are two players which are known as maximizer and minimizer. As a convention, the agent is the maximizer and the opponent is the minimizer. The agent chooses the move which could get it a score as high as possible and the opponent chooses the move to counteract the agent’s move. Each state has a score associated with it which is usually calculated via a heuristic function. Heuristics for this agent are described below: • • • •
If the agent gets 4 discs in a row: 1000000 points If the agent gets 3 discs in a row: 1 point If the opponent gets 3 discs in a row: −100 points If the opponent gets 4 discs in a row: −10000 points.
The limitation of the above algorithm is that it becomes slow with increasing depth. To overcome this limitation we use aplha-beta pruning. Two parameters, namely, alpha and beta are passed through the minimax function. These parameters ensure that if there is already a better rewarding move is available, then the other branches are disregarded. The values of these parameters are maintained by the algorithm. The agent was made to play negamax initially. The agent played 100 rounds multiple times each with negamax and random with varying values of depth. Minimax beats the random agent 100% of the game at all depths. Further, minimax consistently beats negamax more than 50% of the time and increases with increasing value of depth.
3.2 Connect 4 with Monte Carlo Tree Search (MCTS) Monte Carlo tree search falls under heuristics search algorithms mostly used for decision processes like board playing games [5]. The agent creates the game search tree of the possible moves in the game and chooses the node with the most promising reward. Each node of the tree encapsulates two values, the number of times the particular node is visited and the number of times the agent won the game after choosing the given node. MCTS uses Upper confidence bound of the tree. UCT is calculated using the formula (1). Wi +c∗ ni
ln(N ) ni
(1)
The goal agent is to maximize the above equation, where W i = number of wins, ni = number of times the node is visited, c = constant factor which controls the
Solving Connect 4 Using Artificial Intelligence
731
Fig. 2 Win (%) of MCTS against random and negamax agents with varying T max (ms)
balance between the exploration and exploitation,and N = the number of times the parent node is visited. When the agent hits the child node which has not been visited before, our agent picks the random moves to expand the branch of the tree. The agent will continue to play random simulations until the game is over [6]. The agent was made to play negamax and random, initially, in 100 game rounds multiple times. The MCTS agent was under the time constraint to choose each move. The agents winning percentage increases as time increases. Figure 2 shows the winning percentage of the agent against negamax with increasing values of the time.
3.3 Connect 4 with Deep Reinforcement Learning 3.3.1
Q-Learning
Q-learning is a value-based reinforcement learning algorithm in which the agent finds and learns optimal probability distribution between state and action that will allow the future reward to be as high as possible over a succession of time steps [7]. The best action is selected from all the possible actions. Q(s, a) = r (s, a) + γ
s
P(s, a, s ) max Q(s , a ) a
(2)
The Q value depends on the reward ‘r’ based on the current state ‘s’ and action ‘a’ and on the best possible future reward we can get. γ is known as the discount factor, values between 0 and 1, which will decide the dependency of our agent on future reward. This discount factor is necessary as it will help the algorithm converge faster [8].
732
M. Dabas et al.
The only difference between Q-learning and DQN is that the function approximator replaces the exact value function. This property of DQN may result in forgetting the precious and valuable experience while training and can make the agent biased to the repetitive experience. This can be resolved by using experience replay memory [7].
3.3.2
Double DQN
The agent is implemented using Double DQN. RGB images of the board are sent to the Deep Convolutional Neural Network. The image is processed by two layers of convolution network with a kernel size of 3 and stride of 1. Each convolution uses ReLU activation function. The neural network is a 3 layered network with 64 nodes in the hidden layer and 7 nodes in the output layer. The agent is an epsilon greedy agent. The Double DQN will separate the policy network from the target network. The target network will be updated once every few games which will prevent the overestimation of the Q values. The target Q values will be negative of the best opponent Q value because the agent should take actions that will put the opponent into a substandard state. This is achieved using the zero-sum property of the connect 4. The model uses dual replay memory with a capacity of 30,000, which will store experiences along with its mirror image, as connect 4 follows left-right symmetry along with the middle column, hence doubling the dataset. The agent was trained at 3,00,000 games which took 16 h approx. We have tested our Double DQN agent against negamax and random agent, and our agent was able to defeat both of them 100% of the time. Below, we have listed other important parameters that were used to train the DDQN agent: • • • •
Batch size: 32 Number of Epochs: 2 Gamma: 0.999 Learning Rate: 5 × 10−4 .
4 Results 4.1 Monte Carlo Tree Search Versus Alpha-Beta Pruning (Minimax) Figure 3 shows the winning percentage between Alpha-Beta pruning and Monte Carlo Tree Search, considering Monte Carlo Tree Search as our agent and AlphaBeta pruning (minimax) as an opponent with varying levels of depth (from n = 1 to 3), 250 games were played at each value of T max versus each value of n to generalize the win percentage. The MCTS was constrained to choose its move within a limited
Solving Connect 4 Using Artificial Intelligence
733
Fig. 3 Win (%) of MCTS against minimax with varying T max (ms)
amount of time, T max . The results were rather discouraging with respect to MCTS with the initial value of T max , where MCTS was made to choose its move in less time. As T max increased and reached over 25 ms MCTS started to dominate over Alpha-Beta (Minimax). Among the varying levels of depth of the minimax model, interestingly, the agent with depth n = 2 performed the best.
4.2 Double DQN Versus Monte Carlo Tree Search Figure 4 shows the winning percentage between Double DQN and Monte Carlo Tree Search with varying T max , 250 games were played at each value of T max . The MCTS was again constrained to choose its move within a limited amount of time. Double DQN performs encouragingly better than MCTS initially and even for latter values of T max . But, as T max reached close to 80 ms MCTS started to show an edge over Double DQN but never dominated the performance of Double DQN agent. Fig. 4 Win (%) of double DQN against MCTS with varying T max (ms)
734
M. Dabas et al.
Fig. 5 Win (%) of double DQN against Minimax with varying depth of lookahead
4.3 Double DQN Versus Alpha-Beta Pruning (Minimax) Lastly, Fig. 5 shows the win percentage between Double DQN and Alpha-Beta pruning (minimax) with varying depth of lookahead (from n = 1 to 3), 250 games were played at each value of n. Here, Double DQN beats Alpha-Beta pruning for every value of n. Although, at n = 2, Double DQN beats Alpha-Beta only 55% of the games and Alpha-Beta beats Double DQN about 45% of the games which was very close but even after testing for more games at n = 2, Double DQN consistently won more than 50% of the games. Hence, it came out to be a superior agent to Alpha-Beta.
5 Conclusion In this research paper, we have applied three algorithms to the game of Connect 4 and perused their performance by contesting them with each other. The first algorithm was minimax with alpha-beta pruning and the study revealed the general trend that it performs better with increasing depth, although it also was observed that for certain cases depth of n = 2 performed better than the depth of n = 3. This may be due to the exploration of the suboptimal move when depth is equal to 3. Hence, it is always better to limit the depth of the minimax algorithm up to the appropriate value. The second algorithm was Monte Carlo Tree Search (MCTS). This algorithm consistently defeated the minimax algorithm, given enough time to choose its move. The third algorithm was DDQN which was based on self-play. It emerged as the undisputed winner against minimax agents. This algorithm ostensibly lost to MCTS when MCTS got around 80 ms to choose its move but there was not huge dissimilitude between their winning percentages. We believe that training our Double DQN agent for more games will certainly be able to overcome the MCTS.
Solving Connect 4 Using Artificial Intelligence
735
References 1. Schneider, M. O., & Rosa, J. L. G. (2002). Neural connect 4—A connectionist approach to the game. In VII Brazilian Symposium on Neural Networks. 2. Thill, M., Koch, P., & Konen, W. Reinforcement learning with N-tuples on the game connect-4. In Lecture notes in computer science (Vol. 7491), Springer, Berlin, Heidelberg. https://doi.org/ 10.1007/978-3-642-32937-1_19. 3. Connect, X. (n.d.). From. https://www.kaggle.com/c/connectx. 4. Rijul, N., Rishabh, D., Shubhranil, M., Vipul, K. Alpha-beta pruning in minimax algorithm—An optimized approach for a connect-4 game. International Research Journal of Engineering and Technology (IRJET), 5(4), 1637–1641. 5. Monte Carlo tree search. (2020, September 26). Retrieved December 06, 2020, from https://en. wikipedia.org/wiki/Monte_Carlo_tree_search. 6. Wisney, G. (2019, April 28). Deep reinforcement learning and monte carlo tree search with connect 4. Retrieved December 02, 2020, from https://towardsdatascience.com/deep-reinforce ment-learning-and-monte-carlo-tree-search-with-connect-4-ba22a4713e7a. 7. Larsen, N. (2019, January 17). Why is a target network required? Retrieved December 04, 2020, from https://stackoverflow.com/questions/54237327/why-is-a-target-network-required. 8. Ashraf, M. (2018, April 11). Reinforcement learning demystified: markov decision processes (Part 1). Retrieved December 04, 2020, from https://towardsdatascience.com/reinforcement-lea rning-demystified-markov-decision-processes-part-1-bf00dda41690.
A Pareto Dominance Approach to Multi-criteria Recommender System Using PSO Algorithm Saima Aysha and Shrimali Tarun
Abstract Recommender systems are software tools which are used for dealing with the information overload problem by identifying more relevant items to users based on their past preferences. Single Collaborative Filtering, the most successful recommendation technique, provides appropriate suggestions to users based on their similar neighbours through the utilization of overall ratings given by users. But it can select less representative users as neighbours of the active user, indicating that the recommendations thus made are not sufficiently precise in the context of singlecriteria Collaborative Filtering. Incorporating multi-criteria ratings into Collaborative Filtering presents an opportunity to improve the recommendation quality because it represents the user preferences more efficiently. However, learning optimal weights to different criteria for users is a major concern in designing multi-criteria recommendation framework. Our work in this paper is an attempt towards introducing multicriteria recommendation strategies exploring both the concepts of Pareto dominance and Genetic algorithm, to further enhance their quality of recommendations. The contributions of this paper are two fold: First, we develop a Multi-criteria Recommender system using Pareto dominance methodology (MCRS-PDM). The use of Pareto dominance in our method is to filter out less representative users before the neighbourhood selection process while retaining the most promising ones. Second, we applied Particle Swarm Optimization to our proposed methodology for efficiently learning the weights of various criteria of an item. Effectiveness of our proposed RSs is demonstrated through experimental results in terms of various performance measures using Yahoo! movies dataset. Keywords Recommender systems · Collaborative filtering · Multi-criteria decision-making · Pareto dominance · Particle swarm optimization algorithm · Multi-criteria recommendations · Pareto optimal Solution · Unseen item · Criterion Weight · Active user · Similarity computation S. Aysha Career Point University, Kota, Rajasthan, India S. Tarun (B) Department of CS and IT, JRN Rajasthan Vidyapeeth, Udaipur, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_60
737
738
S. Aysha andS. Tarun
1 Introduction Escalating with the modern era, technology is overpowering the manual human work in order to ease the tedious manly tasks. Leading the surplus information which is being available for any user who wants to access it from anywhere at any time. This too much information creates a chaos in decision-making for its user, resulting into an information overload problem which makes it difficult for the user to choose one item or product of information among a potentially overwhelming set of alternatives [1]. Decision-making is a vital part of our everyday life. With the increase in growth of web information, users find it challenging to locate a relevant information, products or services according to their needs and preferences. Most of the general lifestyle activities such as shopping clothes, food, books furniture, watching movies, songs or videos, booking holidays, trains, flights, hotels, restaurants, etc. are all preferred to be done online nowadays on a heavily large scale. There is plethora of options available to users for accessing these applications online. To mitigate this information overload problem, Recommender systems (RSs) are the best and most notable web applications which assists the user in decision-making process, providing him or her with the best possible options of their own preferences. Popular web applications like Netflix, YouTube, Amazon.com, etc. recommends movies, videos, clothes and many other available products with the help of Recommender systems. Online social networking sites such as Facebook, Twitter, Instagram, LinkedIn, Google+, etc. comprises a very crucial part in people’s life and RSs provides an easy base for users to access their preferred information smartly and efficiently saving a huge slot if time too. Recommender Systems uses various filtering techniques for providing recommendations [2]: Content-Based Filtering: Recommendations are provided on the basis of items which are similar to the items liked by the user in past. Collaborative Filtering: Recommendations using collaborative filtering is provided on the basis of users whose recorded past preferences are similar to that one’s of active user (user to whom the recommendations are to be provided). It is more likely that the like-minded users prefer similar items in future too. Demographic Filtering: Recommendations are generated on the basis of information given by users with similar demographic features such as age, occupation, nationality, gender, etc. Hybrid Filtering: This type of recommendation technique combines any of the other filtering techniques in order to make efficient and better recommendations, by inculcating the advantages and also eradicating the disadvantages of respective filtering techniques. Traditional Collaborative Filtering (CF) is the most widely used technique in RSs. CF recommenders try to find out the peers of the active user. It works on a concept that if two users have similar choices in their past then it more likely that the items liked by one user will also be liked by the other as they share common interest in
A Pareto Dominance Approach to Multi-criteria Recommender System …
739
their past preferences. CF finds out the similar users also known as neighbours of the active user and uses their preferred data to generate and provide recommendations to the active user. It works in following phases as shown below: 1.
2. 3. 4.
Similarity computation: Similarity of the users is calculated using the ratings they have provided to the common items in the past. There are many approaches used to calculate similarity measure, two of the most commonly used are Pearson Correlation and cosine based similarity computation. Neighbourhood formation: Set of Top-k users with higher similarity value is considered to be the neighbours of the active user. Prediction: Ratings provided by users in neighbourhood set of the active user are used to predict the ratings that active user is likely to give to those items. Recommendation: Top items which are predicted to be given the highest ratings are finally recommended to the active user.
There are two types of rating recommender systems: Single rating recommender systems, in which a single overall rating is given to the item by users and Multicriteria recommender systems, in which the items is divided according to its sub criteria or features and each criteria of that item is given a rating by users. In our work, we are going to deal with multi-criteria RSs using Movielens dataset which has multiple criteria rating of movies on criteria like story, direction, visuals, star cast, etc. Multi-criteria decision-making is a kind of multiple objective optimization problem where more than single objective functions are to be optimized concurrently. Since, it is impractical that there exists a single solution which can simultaneously optimize each multiple objective. For such conflicting objectives, there do exist a set number of finite Pareto optimal solutions [3]. A solution x is called Pareto optimal or non-dominated, if none of the objective functions can be improved in value without degrading any other or say that there is no feasible solution which takes a lower value of some objective without causing simultaneous increase in some other one [4]. While dealing with multiple criteria ratings, one cannot ignore the fact that the preferences of each criteria will be different for different users. For example, for some users, when visiting a hotel for holidays at some place, priority may be the location of hotel, while for some others, it can be the ambience or the luxurious rooms. When going for a movie, some users may prefer to go for genre of the movie be it comedy, horror, romance or thriller, etc., while some may just be interested in going for good actors or good star cast. Priorities to all the criteria are different for different people. To resolve this issue, certain weights can be assigned to each criteria according to the preference of the user. This can be done by various machine learning algorithms like Genetic Algorithm and Particle Swarm Optimization (PSO). In this research work, we are going to perform experiments and compare both these algorithms in order to determine which one works best when applied to our proposed algorithm which comprises the use of PSO in Multi-Criteria Recommender System (MCRS) with the concept of Pareto dominance (PD). We have already used GA in MCRS using PD approach in our previous work, and thus with this work, we are
740
S. Aysha andS. Tarun
analyzing the efficiency of our proposed system (MCRS-PDM) using the other the weight learning mechanism, i.e. PSO.
2 Background and Motivation Recommender systems are the software tools which help to exploit the implicit and explicit user preferences in order to recommend the most relevant information according to their choices. Usually, RSs use single-criterion CF approaches successfully since years. It started to employ multi-criteria ratings in order to facilitate the accurate and more précised user preferences [5]. Majority of engineering problems are multi-criteria optimization problems [3] and typical methods which can solve such problems include pareto optimal solutions, also the most important criteria can be optimized while converting the criteria to constraints, successively optimizing one criteria at a time.
2.1 Traditional Single Rating Recommender Systems V/S Multi-criteria Recommender Systems Traditional CF RS function in a 2-D User × Item matrix, where R0 is the overall rating given by user to item i, represented as R: User × Item → Ro . While, multicriteria RS includes multiple criteria of an item, to which the users provide their more specific ratings. Let there be k number of criteria of each item, then the rating of that item will be shown as: R: User × Item → R0 × R1 × R2 × …. × Rk. To understand the difference between single-rating RS and mutli-criteria RS, consider an example of a simple recommendation system including 5 users’ (u1 ….u5 ) and 5 items’ (i1… i5 ) or we can consider movies as the items here. Ratings to each item are provided on the scale of 0–10. As shown in Fig. 1, our motive is to predict
Fig. 1 Single-rating recommender system [3]
A Pareto Dominance Approach to Multi-criteria Recommender System …
741
Fig. 2 Multi-criteria rating recommender system [3]
the rating that the active (target) user u1 will be giving to the items or movies he has not seen i.e. i5 in this given example. As shown in Fig. 1, user u2 and user u3 seems to be the most similar users to u1 , since the ratings given by u2 and u3 are exactly same in value as the ratings given by active user u1 . Therefore, u2 and u3 are considered as the neighbours to u1 and the rating of unseen item i5 will be predicted using the ratings u2 and u3 have provided to that particular item. As u2 and u3 rated i5 as “9”, value of the target ratings will also be predicted as “9”. Now, let us elaborate the following example by incorporating various criteria of the items. Items in this example are considered as movies, say there are namely four different criteria of each movie—story, acting, star cast, direction. Ratings to each criterion are given as shown in Fig. 2. By observing the overall ratings of all the items given by the users as shown in Fig. 2, it is clearly visible that u2 and u3 are most similar to the target/active user u1 while user u4 and u5 seems to have a different choice of opinion. On the other hand, analyzing the criteria ratings that the users have provided to each item, it can be seen that actually user u4 and u5 are the most similar users to u1, contradicting to the fact shown in Fig. 1 comprising the single rating system u2 and u3 comes out to have very different preferences of criteria of the movies. Therefore, it is evident that the overall rating of an item given by users may be similar to each other, but it is also possible that their criteria or detailed preferences don’t match at all. Hence, Multi-criteria Recommender Systems are way more accurate and powerful in providing better personalized recommendations to the users.
2.2 Concept of Pareto Dominance in MCRS Multi-criteria decision-making problems are gaining attention with increasing amount of available information which leads to a state of massive confusion for the users in selecting the product or information which is most preferable for them.
742
S. Aysha andS. Tarun
It is concerned with mathematical optimization, where multiple conflicting objective functions are aimed to be optimized simultaneously [6, 7]. It is practically not possible to get a single solution which optimizes each objective at the same time. While, for such conflicting objective functions there exist Pareto Optimal solutions [3]. Users which are being dominated by others do not tend to have higher similarity to active user as compared to the ones who dominate them. Therefore, it is acceptable to eliminate the dominated users keeping the most promising dominating ones, also known as non-dominated users. With these filtered non-dominated users, we proceed further by combining the weights obtained from PSO algorithm and finding similarity between users on that account. Top-k users with higher similarity are used to predict the ratings for active user and finally the items with best probable ratings are recommended to the user. In order to enhance the accuracy of MCRS using the concept of pareto dominance, the basic operation includes four major phases [4]: 1. 2. 3. 4.
Selecting the non-dominated users. Computing similarity between the non-dominated users and active user to find out the most promising neighbours to the active user. Item Prediction. Item Recommendation.
2.3 Particle Swarm Optimization Optimization techniques are used to find the parameters which provide maximum or minimum best value which is known as optimal solution of a target function. Particle Swarm Optimization (PSO) is an evolutionary algorithm first described by Russell Eberhart and James Kennedy in the year 1995 [8, 9], who observed the flocking and schooling patterns of swarms and realized that the simulation of their flocking pattern works very well as an algorithm on optimization problems. It is based on the concept of population of individual solutions, similar to genetic algorithm, but differs in a way that each particle travels into a space carrying its velocity, position and acceleration which keeps changing rather than reproducing the existing solutions. Movement of particles around the space, in order to explore all the possible optimal solutions, depends on the velocity and acceleration [10]. Particles move around different locations of the space and each location has a fitness value which depends on how well it is satisfying the objective defining preferences of users in this case. Each particle maintains: Position in the space (fitness solution), velocity, individual best position and Global best position [11]. PSO along with other algorithms has been successfully implemented in various research fields including recommendation. Xiaohui et al. [12] presented a modified dynamic neighbourhood particle swarm optimization algorithm to solve the multiobjective optimization problems using one-dimension optimization for dealing with multiple objectives and introduced an extended memory to store pareto optimal solutions reducing the computation time of the algorithm. In Logesh et al. [13], the authors
A Pareto Dominance Approach to Multi-criteria Recommender System …
743
proposed a dynamic PSO and hierarchy induced k-means algorithm to generate effective personalized Point of Interest (POI) recommendations based on electroencephalography feedback. Bakshi et al. [14] used PSO to alleviate the problem of sparsity in recommender systems. In Choudhary et al. [15], the authors have worked towards development of MCRS by utilizing different similarity measures and PSO for learning weights.
2.4 Motivaiton Selecting neighbours of the active user from all the users available in dataset is indeed a big deal for the prediction and recommendation processes. Success of a RS depends highly on the effectiveness of algorithm that is being used to find out these neighbours of active user. Using K- nearest neighbour approach for calculating neighbours does not necessarily always provide us with the users who have best matching preferences to the active user. It is common to find a noteworthy number of neighbours who share a little information or similar taste of choices. Multi-criteria ratings aid a more detailed knowledge about the rating of a particular item and prove that overall rating is not just independent of other criteria ratings, but also serves as some aggregation function of item’s multiple criteria rating. Moreover, it is more likely that each criterion has a diverse preference value for every single user which shows that each one of an item’s feature possesses different weights of choices, and therefore, recommendation would be even better if provided on the basis of these concepts. The concept of Pareto dominance is used to filter the neighbours who have less information in common to the active user, as compared to a few other dominating neighbours who have larger information in common. It eliminates lesser representative neighbours from the neighbourhood set to extract the most promising nondominated neighbours. Our proposed work compares the accuracy of MCRS-PDM (Multi-criteria recommender system using Pareto Dominance Method) with the working of PSO algorithm on Multi-criteria RS with PDM.
3 MCRS-PDM for Multi-criteria Recommender Systems Concept of Pareto dominance indicates that there can be multiple objectives to a problem, but there does not exist a single solution that simultaneously optimizes each objective. In that case, the objective functions are said to be conflicting, and there exists a (possibly infinite) number of Pareto optimal solutions [3]. A solution x is called non-dominated, Pareto optimal, Pareto efficient or non-inferior, if none of the objective functions can be improved in value without degrading some of the other objective values or say if no other feasible solution exists, which takes a lower
744
S. Aysha andS. Tarun
value in some objective without causing a simultaneous increase in at least some other one [4, 16]. We propose RSs based on CF having S set of users who choose ratings for m items with c criteria from the interval min,…,max, where no rating is represented by “•”. The variables used in proposed work are defined as follows:
The proposed method MCRS-PDM aims at enhancing the multi-criteria RSs based on Pareto dominance, to exclude the unpromising users in the k-neighbours selection phase. The basic operation of MCRS-PDM is divided into the following four phases: Phase 1: Selecting the Non-dominated users. Phase 2: Similarity computation and Neighbourhood formation. Phase 3: Item Prediction. Phase 4: Item Recommendation. The detailed description about these phases is given below.
3.1 Phase 1: Selecting the Non-dominated Users In this phase, we determine a set of users who are non-dominated users of an active user, more specifically, the set of non-dominated users with respect to the active user. Let Iu = {i ∈ I |ru,i = •} be the set of items rated by the user u. Let i c = { c ∈ i|ru,i = •} be the criteria rating given in an item i rated by the user u.
A Pareto Dominance Approach to Multi-criteria Recommender System …
745
Let d ru,i , r x,i be the absolute difference between the ratings given by user u and x to the item i. ru,i − r x,i ru,i = • d ru,i , r x,i = For each criteria (1) ∞ ru,i = • Figure 3 shows an overview of proposed methodology: MCRS-PDM. According to the condition of Pareto dominance: We say that user x dominates user y with respect to another user u, if the following expression is satisfied (Fig. 4). x >u y ↔ ∀i c ∈ Iu : d ru,i , r x,i ≤ d ru,i , r y,i ∧ ∃ j ∈ Iu |∀Ic ∈ Iu : d ru, j , r x, j < ru, j , r y, j
(2)
Conceptually, dominated users do not show higher similarity to the active user than the users who dominate them. Thus, it is viable to discard the dominated users and only keep the dominating or the non-dominated ones. We have used this definition to extract these non-dominated users for an active user (Fig. 5).
Fig. 3 Flowchart of the proposed Work MCRS-PDM
746
S. Aysha andS. Tarun
Fig. 4 A flowchart of the particle swarm optimization
3.2 Phase 2: Similarity Computation and Neighbourhood Formation In order to find the top k-neighbours of an active user among the filtered non-dominated users, we must go through the following steps: 1. 2.
Calculate the similarity of each non-dominated user to the active user. Find the top k-neighbours with the highest similarities to the active users.
The set of criteria of items which are co-rated by both users x and y is denoted by: Let C x,y = {i c ∈ I |r x,i = • ∧ r y,i = •} be the set of items which are co-rated by both user x and y. For similarity computation between the active user and its non-dominated users, we used the modified Pearson correlation formula as shown below:
A Pareto Dominance Approach to Multi-criteria Recommender System …
747
Fig. 5 Graphical representation of results
r x,ic − r¯x,ic ∗ r y,ic − r¯y,ic PC(x, y) = 2 M 2 M M∗ − r ¯ r x,i x,i c c i c ∈C x,y i c ∈C x,y r y,i c − r¯y,i c c=1 c=1 M c=1
i c ∈C x,y
(3) where, M is the total number of criteria present in an item. The neighbours are chosen only from the set of non-dominated users. Now further we need to calculate the top-k most similar users to the active user, among all. We define Ku as the set of K-neighbours (most similar) of the active user.
3.3 Phase 3: Item Prediction After the neighbourhood set generation, we evaluate the rating predictions for an active user on the basis of the ratings given by the nearest neighbours. To find out the ratings likely to be given by an active user to any unseen item, we need to combine the ratings given to those items by its K-neighbours. The one which we are using in our work is as follows: Let pu,i be the prediction of item i to the user u. pu,i =
1 rn,i ⇔ Mu,i = • #Mu,i n∈M u,i
pu,i = • ⇔ Mu,i = •
748
S. Aysha andS. Tarun
3.4 Phase 4: Item Recommendation In order to complete the recommendation process, we compute Z u , the set of items likely to be recommended to user u, based on the predicted ratings on unseen items and X u , the set of at most N items to be recommended.
4 The Proposed Approach: MCRS-PDM with PSO Algorithm In this section, we will elucidate the process of our proposed method which utilizes the concept of pareto dominance in MCRS and using PSO to learn the optimal criteria weights. The MCRS-PDM with PSO framework is basically accomplished in three under mentioned phases.
Phase I Phase II
Phase III
Selecting the Non-dominated users using t the concept of Pareto dominance Calculating similarity using these weights (Using PSO to learn criteria weights)
Prediction Recommendation
5 Phase I 5.1 Selecting the Non-dominated Users Let the rating provided by user u and y to item i be ru,i and r y,i and to item j be ru, j and r x, j . Then, let Iu = {i ∈ I |ru,i = •} be the set of items rated by the user u. Let i c = { c ∈ i|ru,i = •} be the criteria rating given in an item i rated by the user u. Let d ru,i , r x,i be the absolute difference between the ratings given by user u and x to the item i.
ru,i − r x,i ru,i = • For each criteria d ru,i , r x,i = ∞ ru,i = •
A Pareto Dominance Approach to Multi-criteria Recommender System …
749
According to the concept of Pareto dominance that we have used in our research work [4] as shown in Eq. (1), in MCRS, we say that user x dominates user y with respect to another user u, if the following expression is satisfied.
5.2 Similarity Computation After calculating the non-dominated users, we need to find out the top-k neighbours of active users by undergoing following two steps: 1. 2.
To calculate the similarity of active user with each non-dominated user. Find out the top-k neighbours of active users which shows highest similarity to the one. Let C x,y = {i c ∈ I |r x,i = • ∧ r y,i = •} be the co-rated items by user x and user y. To compute the similarity, we use modified Pearson Co-relation formula w(c) ∗ ic ∈C x,y r x,ic − r¯x,ic ∗ r y,ic − r¯y,ic sim(x, y) = 2 M 2 M r − r ¯ x,i c i c ∈C x,y x,i c i c ∈C x,y r y,i c − r¯y,i c c=1 c=1 M
c=1
(4)
Here, M denotes the total number of criteria of an item, w(c) denotes the weight of each criteria of an item in accordance to the active user. We use particle swarm optimization to learn the appropriate weights given by active user.
5.3 Using PSO to Learn Criteria Weights PSO algorithm is used to acquire feature weights of items for active user, hence tailor the probability increment of likeness of any item according to user’s taste and personality.
5.3.1
Step 1: Particle Dynamics
We use a conventional PSO algorithm [17, 18] to learn the multiple criteria weights in our proposed approach. Each criteria weight is represented by 8 bit of binary digit number ranging from 0 to 255. We assume to learn ‘r’ weights for each criteria and so the particle is denoted as 8r bits. Finally, the weighing value for every criteria can be found by dividing the value of each weight by total weight and then converting the binary value into its decimal value. Each particle preserves its position, composed of the candidate solution and its evaluated fitness, and its velocity. Additionally, it remembers the best fitness value it has achieved thus far during the process of the algorithm, denoted as the individual
750
S. Aysha andS. Tarun
best fitness, and the candidate solution that achieved this fitness, referred to as the individual best position. Finally, the PSO algorithm maintains the best fitness value achieved among all particles in the swarm, called the global best fitness, and the candidate solution that achieved this fitness, called the global best position [19]. The PSO algorithm works in three main steps which are repeated until some stopping criteria is reached [19]: • Calculate fitness value of each particle • Update the personal best and global best solutions • Update the velocity and position of the particle. The position with the highest fitness value in each iteration will be recorded as the global best (gbest) position of entire swarm and all other particles move towards that position. Moreover, every particle keeps a record of its own personal best (pbest) position it has visited [20]. Velocity and position of each particle is updated using following rules: vi = wvi + c1r1 xpbest,i − xi + c2 r2 xgbest − xi
(5)
xi = xi + vi
(6)
Here, vi is the velocity of particle i, xi w c1 and c2 xpbest xgbest r1 and r2
is the current position of particle i, is the inertial co-efficient which lies between 0.5 and 1, are the constants whose values are set to 1.494 [17], is the personal best position that particle i had visited, is the swarm’s global best position, random values that lies between 0 and 1
where, n R denotes the cardinality of training set of the active user, while ari and pri are the respective actual and predicted ratings given to item i by active user. Upon reaching the predefined number of iteration threshold value, the PSO algorithm itself gets terminated. Now, the Top-k users with the highest similarity to active user are considered as its neighbours. Let the neighbourhood set be denoted as M u . (a)
Prediction To evaluate or to predict the ratings that are likely to be given by the active user to an unseen item, we utilize the ratings that the users of neighbourhood set have provided to that item. Let pu,i be the rating prediction value for item i. Ratings provided by various users in the neighbourhood set to an item i is denoted by rn,i . We calculate the predicted ratings using the following equations:
A Pareto Dominance Approach to Multi-criteria Recommender System …
751
1 rn,i ⇔ Mu,i = • #Mu,i n∈M
(7)
pu,i =
u,i
pu,i = • ⇔ Mu,i = • (b)
Recommendation Top-n items with highest predicted rating values are most likely to be preferred by the active user and hence will be recommended to him/her. Let Z u be the set of items to be recommended to the active user.
6 Experiments and Results To verify and demonstrate the effectiveness of our proposed approach, we have used the Yahoo! Movies dataset which includes multiple criteria ratings on a large number of movie items. Yahoo! Movies dataset comprises 6078 users and 976 movies. Each movie consists of four different criteria ratings along with an overall rating. We have performed our experiments in 3 splits, where the dataset is randomly divided into 3 folds by picking up 300,500 and 700 users. Each user’s ratings are divided into training set (80%) and test set (20%). The ratings of training set are used to generate the neighbourhood set also can be referred as to train the system, while the ratings in test set are used to considered as unseen items for active user thus used to check the effectiveness of proposed system.
6.1 Performance Measures The performance of proposed RS is being evaluated on the basis of following quality measure. Coverage defines the number of unknown ratings which are predicted by the system; Mean Absolute Error measures the mean error occurring in the predictions generated by the system, i.e. how much the predicted ratings deviate from the true ratings provided by the user; Precision indicates the percentage of relevant items recommended with respect to the total number of items being recommended in the test data or the measure of correctness; Recall signifies the percentage of relevant recommended items out of total relevant items in test data also considered as the measure of completeness; F-measure (FM) combines the harmonic mean of precision and recall. In case of information retrieval, a perfect precision value 1.0 implies that every result retrieved is a relevant one and a perfect recall value 1.0 means that all relevant items are retrieved in the process.
752
S. Aysha andS. Tarun
6.2 Experiments To evaluate the performance of proposed approach, under mentioned approaches have been compared with the proposed methodology MCRS-PDM with PSO (PDMPSO MCRS). Following approaches are compared together on the basis of above mentioned measures: 1. 2. 3.
Multi-criteria RS Collaborative filtering (MCRS-CF). Multi-criteria RS using Pareto dominance concept (MCRS-PDM). Multi-criteria RS using Pareto dominance with PSO algorithm (PDM-PSO MCRS).
Experiments are done in three splits namely 1000-fold, 1100-fold and 1200-fold users. A fixed neighbourhood set of Top-60% most similar neighbours among all the non-dominated users are being extracted and used for experiment.
7 Results The impact of proposed approach can be seen well through the experimental results shown Tables 1, 2, 3 and 4. All the results are carried executed for the aforementioned 3 folds. In all the performance evaluation measures, i.e. Coverage, Precision, Recall and F-measure, PDM-PSO MCRS has outperformed MCRS-CF and MCRS-PDM. It shows that better predictions with more relevant preferences are being provided by our proposed approach, which, in turn, results into better recommendations to users. We can thus analyze and say that the weight learning approach such as PSO gives improved results in the process of recommendation as compared to simple traditional approach. Table 1 Experimental results of various approaches over coverage
Table 2 Experimental results of various approaches over precision
User
MCRS-CF
MCRS-PDM
PDM-PSO MCRS
1000-fold
0.6968
0.8010
0.8900
1100-fold
0.7341
0.8222
0.9190
1200-fold
0.7563
0.8352
0.9327
User
MCRS-CF
MCRS-PDM
PDM-PSO MCRS
1000-fold
0.8068
0.8376
0.8591
1100-fold
0.8241
0.8471
0.8724
1200-fold
0.8293
0.8524
0.8827
A Pareto Dominance Approach to Multi-criteria Recommender System … Table 3 Experimental results of various approaches over recall
Table 4 Experimental results of various approaches over F-measure
753
User
MCRS-CF
MCRS-PDM
PDM-PSO MCRS
1000-fold
0.8010
0.8732
0.9123
1100-fold
0.8111
0.9065
0.9096
1200-fold
0.8975
0.9090
0.9312
User
MCRS-CF
MCRS-PDM
PDM-PSO MCRS
1000-fold
0.8050
0.8421
0.8825
1100-fold
0.8211
0.8635
0.8963
1200-fold
0.8475
0.8745
0.9095
8 Conclusion and Future Scope In today’s scenario of chaotic information overload dilemma to users, recommender systems have provided an easy platform to access one’s preferable option, and therefore, become an integral part of various e-commerce facilities. Multi-criteria RSs enhances the quality of recommendation by utilizing different criteria preferences for each item, calculating more effective predictions and providing better recommendations to the user. A variety of filtering techniques have been developed for generating recommendations including mainly, collaborative and content-based filtering techniques. Multi-criteria RSs presents an opportunity to enhance the recommendation quality by incorporating user preferences on different criteria. In this paper, we have used a multi-criteria RS based on the concept of Pareto dominance (MCRS-PDM) where Pareto dominance acts as a pre-filter for CF and select more optimal neighbours of an active user through elimination of less representative users. It leads to significant improvement in the accuracy and quality of recommendations. In real-life scenario, different users have distinct priorities on various criteria of items. To extract appropriate weights on these criteria is a major concern in the field of multi-criteria RS. In order to further increase the performance of MCRS, we have introduced the concept of PSO in MCRS-PDM recommendation methodology. Since different users have different priorities on various criteria of items, we have used PSO in combination with MCRS-PDM to learn the optimal weights of various criteria according to the likeness of user. The experiments performed on the Yahoo movies dataset have shown that the proposed method gives better results on all performance measures which proves its effectiveness in the field of recommendation. In any research venture, there is always room for enhancement and the work described in this thesis is no different. In recent years, classifier ensemble techniques have drawn the attention of many researchers in the machine learning research community [21]. Apart from our work in this thesis, there are some challenges that can be investigated. In future, following enhancements and modifications can be done in extension to the proposed method: We used the concept of PDM to get an appropriate optimal solution to our multi-criteria optimization problem. There are few
754
S. Aysha andS. Tarun
other algorithms which can be used to derive the non-dominated optimal solutions from the neighbourhood set, such as NSGA, SPEA2, PAES, NSGA-1, NSGA-2. Our work in this thesis is specific to movie RSs and it would be interesting to explore the feasibility of extending the work to other domains, e.g. jokes, books, music, etc.; balancing the diversity and novelty of recommendations made along with the accuracy of system also helps to provide a better user satisfaction rate.
References 1. Ekstrand, M. D., Riedl, J. T., & Konstan, J. A. (2011). Collaborative filtering recommender systems. Foundations and Trends® in Human–Computer Interaction, 4(2), 81–173. 2. Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 6, 734–749. 3. Adomavicius, G., & Kwon, Y. (2007). New recommendation techniques for multicriteria rating systems. IEEE Intelligent Systems, 22(3), 48–55. 4. Ortega, F., SáNchez, J. L., Bobadilla, J., & GutiéRrez, A. (2013). Improving collaborative filtering-based recommender systems results using Pareto dominance. Information Sciences, 239, 50–61. 5. Shambour, Q., Hourani, M., & Fraihat, S. (2016). An item-based multi-criteria collaborative filtering algorithm for personalized recommender systems. International Journal of Advanced Computer Science and Applications, 7(8), 274–279. 6. Lakiotaki, K., Matsatsinis, N. F., & Tsoukias, A. (2011). Multicriteria user modeling in recommender systems. IEEE Intelligent Systems, 26(2), 64–76. 7. Ribeiro, M. T., Ziviani, N., Moura, E. S. D., Hata, I., Lacerda, A., & Veloso, A. (2015). Multiobjective pareto-efficient approaches for recommender systems. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4), 53. 8. Eberhart, R., & Kennedy, J. (1995, November). Particle swarm optimization. In Proceedings of the IEEE International Conference on Neural Networks (Vol. 4, pp. 1942–1948). 9. Juneja, M., & Nagar, S. K. (2016, October). Particle swarm optimization algorithm and its parameters: A review. In 2016 International Conference on Control, Computing, Communication and Materials (ICCCCM) (pp. 1–5). IEEE. 10. Blackwell, T. M., & Bentley, P. J. (2002, July). Dynamic search with charged swarms. In Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation (pp. 19–26). Morgan Kaufmann Publishers Inc.. 11. Shi, Y. (2001, May). Particle swarm optimization: developments, applications and resources. In Proceedings of the 2001 congress on evolutionary computation (IEEE Cat. No. 01TH8546) (Vol. 1, pp. 81–86). IEEE. 12. Hu, X., Eberhart, R. C., & Shi, Y. (2003, April). Particle swarm with extended memory for multiobjective optimization. In Proceedings of the 2003 IEEE Swarm Intelligence Symposium. SIS’03 (Cat. No. 03EX706) (pp. 193–197). IEEE. 13. Logesh, R., Subramaniyaswamy, V., Malathi, D., Senthilselvan, N., Sasikumar, A., Saravanan, P., & Manikandan, G. (2017). Dynamic particle swarm optimization for personalized recommender system based on electroencephalography feedback. Biomedical Research (0970–938X), 28(13). 14. Bakshi, S., Jagadev, A. K., Dehuri, S., & Wang, G. N. (2014). Enhancing scalability and accuracy of recommendation systems using unsupervised learning and particle swarm optimization. Applied Soft Computing, 15, 21–29. 15. Choudhary, P., Kant, V., & Dwivedi, P. (2017, February). A particle swarm optimization approach to multi criteria recommender system utilizing effective similarity measures.
A Pareto Dominance Approach to Multi-criteria Recommender System …
16.
17.
18.
19. 20.
21.
755
In Proceedings of the 9th International Conference on Machine Learning and Computing (pp. 81–85). ACM. Deb, K., & Saxena, D. K. (2005). On finding pareto-optimal solutions through dimensionality reduction for certain large-dimensional multi-objective optimization problems. Kangal Report, 2005011. Shi, Y. (2001). Particle swarm optimization: Developments, applications and resources. In Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No. 01TH8546) (Vol. 1, pp. 81–86). IEEE. Chen, H., Fan, D. L., Fang, L., Huang, W., Huang, J., Cao, C., … & Zeng, L. (2020). Particle swarm optimization algorithm with mutation operator for particle filter noise reduction in mechanical fault diagnosis. International Journal of Pattern Recognition and Artificial Intelligence, 34(10), 2058012. Blondin, J. (2009). Particle swarm optimization: A tutorial. http://cs.armstrong.edu/saad/csc i8100/psotutorial.pdf. Ujjin, S., & Bentley, P. J. (2003, April). Particle swarm optimization recommender system. In Proceedings of the 2003 IEEE Swarm Intelligence Symposium. SIS’03 (Cat. No. 03EX706) (pp. 124–131). IEEE. Alzubi, O. A., Alzubi, J. A., Alweshah, M., Qiqieh, I., Al-Shami, S., & Ramachandran, M. (2020). An optimal pruning algorithm of classifier ensembles: dynamic programming approach. Neural Computing and Applications, 1–17.
Twitter Sentiment Analysis Using K-means and Hierarchical Clustering on COVID Pandemic Nainika Kaushik and Manjot Kaur Bhatia
Abstract Twitter is an online social platform used by many people. Most people express their views by tweeting their thoughts and perception. This paper proposes a mechanism for extracting tweets related to coronavirus all over the world. It also attempts to structure the tweet data into a meaningful format to analyse. K-means and hierarchical clustering algorithms are implemented for data mining to understand what is going on among different categories of people. Sentiment analysis serves as a valuable technique to leverage a better understanding of people’s ongoing moods and emotions during this pandemic and it helps to know people’s attitudes toward the pandemic and find out the heated topics associated with coronavirus through people’s tweet. After analysis, there were mixed emotions among people with a higher degree of pessimism. Keywords Corona · Twitter · Sentiment analysis · K-means · Hierarchical clustering · Covid · Social media
1 Introduction Social media is a growing platform for exchanging information throughout the world. People express their views and emotions on social media. While most of the data is readily available through social media platforms, it is still in an unstructured format. So to understand the data and study the perception of people on certain topics, various exploratory data analysis techniques like sentimental analysis are performed. Sentiments can be classified as positive, negative, or neutral. Sentiment analysis is a process of determining the feeling or opinion of a piece of text. Humans are good at guessing sentiments by looking at a text. In companies, artificial intelligence is used N. Kaushik (B) · M. K. Bhatia Jagan Institute of Management Studies, Rohini Sec 5, Delhi, India e-mail: [email protected] M. K. Bhatia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_61
757
758
N. Kaushik andM. K. Bhatia
for doing this task as it is really important for a company to know how its customers feel, his/her reviews, likes/dislikes. Recommendation systems also work on it. Right from the beginning of 2020, coronavirus, or more popularly known as COVID-19 has been the most discussed topic among everyone because of its severity. Coronaviruses are a new family of viruses that has taken the whole world down in many aspects such as economic and physical. It has widely spread in the whole world and taken a toll on the world’s economy. Many types of research are used for coping up with this pandemic situation, scientists are working on vaccination. WHO has declared this virus a Pandemic. Ebola was too declared as a pandemic in the past. Coronavirus has changed the entire scenario of the globe. Citizens are majorly affected by it in one or the other ways. Many people have lost their jobs, in some developing countries the scenario is more worst. People even do not have money for basic necessities. Even mentally people are affected by it, it is a scary situation for many. It is also declared as airborne. For some time, everything was on halt, and there were lockdowns in most of the affected countries. At this point, mental health plays an important role in physical health. People express their views more on social platforms these days. Twitter is the most famous platform, where people express their emotions using tweets and hashtags. There were certain astonishing effects of this virus which resulted in fleeing of the migrants to their cities as they were left with no work. They were starving and were homeless. In the beginning, it was a shocking situation for all and as a result, took a lot of time to settle. It is a pandemic situation and the government is taking all possible steps to relieve its citizens. With the production of masks, sanitizers, gloves and PPE kits, there is overwhelming help and support by the health department. In order to see the effect of this pandemic on human minds, this paper proposes a sentimental analysis of the tweets done by users throughout the world. At this point, it is really necessary for everyone to be in good mental health. K-means and hierarchical clustering are implemented. The main goal of the research is to find out people’s attitudes towards this pandemic and the major topics that are highlighted during this period.
2 Related Work Many researchers are working on predicting the impact of COVID-19 on human beings at various levels such as psychological, economic, etc. In [1], the researcher compared many models of machine learning and soft computing for predicting the outbreak of COVID-19. COVID-19 has non-uniformity, it varies from place to place. It is difficult to predict its behaviour. Multi-layered perceptron, ANFIS, networkbased fuzzy inference system and MLP showed better results for the prediction. Machine learning is a powerful tool to predict such kinds of complex behaviour of its outbreak. The data of many countries like Iran, China, Germany, USA are analysed. GA, PSO and GWO are used for optimization purpose. The outbreak of
Twitter Sentiment Analysis Using K-means and Hierarchical Clustering …
759
this serious respiratory disease is of high concern. So the models are applied for predicting new cases. The author proposed Twitcident [2], it is a framework which is used to analyse information about real-world incidents and crisis. It is a broadcasting service, which gets information from tweets. Sentiment Analysis is performed on the tweets. It uses real-time analytics and faceted search. The crises such as large fires, earthquakes, cyclones or other incidents are also analysed. As soon as the user tweets about any such crisis, the tweets are collected, filtered and a broadcasting of emergency message is circulated. It uses NER (Named Entity Recognition). The researcher proposes a combination of sentiment analysis and casual rule discovery to extract useful information from the tweets [3]. Twitter is the most powerful social media platform where people express their views and opinions using text messages and hashtags. The data for the analysis purpose was of the Kurdish political issue in Turkey, which was a very controversial and heated topic. Such type of a combination of sentimental analysis and casual rule discovery can be used in various platforms such as by the politicians for making policies for the people. This methodology can also be used for marketing purposes on customer reviews for any particular product. The author implemented the K-means clustering technique on the datasets of search engines using the WEKA tool [4]. The information available on WWW is huge, to dig out useful information data needs to be reduced. This can be achieved by implementing some clusters on the data. The clusters are categorized on the basis of some attributes. Attributes play a major role, the main attributes after the analysis were back-links, length of the title, keywords in title, URL length and in-links. These attributes are of great importance in search engine optimization. The researcher proposed a way so that an Android application can present a view of traffic jams in Jakarta, Indonesia [5]. The data is collected from Twitter. Twitter does not have a GUI for maps. Natural language processing is used to extract information from tweets. Tweets are further tokenized, and each tweet will have a corresponding token. Then further rule-based approach, POS, and sentiment analysis is applied. With the help of Google Maps, the traffic condition was displayed on the mobile phone using three different colours for three different conditions of traffic. The author depicted that how Twitter as a social platform is used by well-known personalities like G7 world leaders for delivering the message related to COVID-19 to the masses all around the globe [6]. Most of the tweets that were extracted had a useful and relevant information related to the pandemic. Various tweets have links to official government websites. Twitter has turned out to be a powerful tool used by the well-known leaders to give messages to their citizens during this pandemic crisis. Therefore, a general caution should be taken while delivering any information during a serious scenario.
760
N. Kaushik andM. K. Bhatia
The research assessed the usage of the term “chinese virus” and “china virus” after the US president referred to the pandemic by these terms [7]. The tweets were extracted by the Sysomos software. Stata 16 was used for quantitative analysis and Python for plotting state-level heat map in the United States.16,535 tweets were identified that contained keywords like China virus or Chinese virus, this was during pre-period. 177,327 tweets were found during post-period. All the states’ data depicted the exponential increase of these keywords. The top 5 states to use these keywords were Pennsylvania, New York, Florida, Texas and California. This shows that the message of the COVID-19 pandemic is spreading more because of the social platforms, out of which Twitter plays a major role. The author analysed how the information regarding Ebola was spread to people throughout the globe by social platforms like Twitter [8]. Ebola spread in the United States as it was contagious. The tweets were retrieved and were categorized into hidden influential user, influential user, common user and disseminators. Spreading of the message can be done using a broadcasting mechanism, that is, one to many, and viral spreading, that is, one to one. The analysis concluded that broadcasting was more as compared to viral spreading. The researcher proposed a big data query optimization system for customer sentiment analysis of telecom tweets [9]. The suggested hybrid system leverages the recurrent neural network, a deep learning technique for efficient handling of big data and spider monkey optimization, a metaheuristic technique that helps to train the network faster and enables fast query handling. The comparison is made with deep convolutional networks and the model optimized the efficiency and performance for predicting customer churn rate. The author propounded a Spider Monkey Crow Optimization (SMCO) [10], a hybrid model for sentiment classification and information retrieval of big data. The model was compared with famous sentiment classification techniques and outputs showed an accuracy of 97.7%, precision of 95.5%, recall of 94.6% and F1-score 96.7%, respectively.
3 Methods and Approach In this proposed system, all the tweets are fetched from Twitter that are related to COVID-19. The tweets are preprocessed and cleaned. Document matrix is created for each word. A word cloud is formed and data mining is performed using K-means and hierarchical clustering. The last and the major step is sentiment analysis. The system workflow is depicted in Fig. 1.
Twitter Sentiment Analysis Using K-means and Hierarchical Clustering …
761
Fig. 1 System Workflow A. Retrieving text/data from tweets
In order to fetch tweets, we need to request Twitter for Developer Account. Once Twitter authorizes, use Twitter login ID and password for signing in as a Twitter Developer(Fig. 2). Create and store Twitter Authenticated Credential object, i.e. Consumer_key, consumer_secret, accesss_token, access_secret (Fig. 3) and extract tweets.
Fig. 2 Create an app on Twitter (using developer account)
762
N. Kaushik andM. K. Bhatia
Fig. 3 Authenticated credential object
The data is fetched from Twitter, it has 15,000 rows that is 15,000 tweets. Data has 16 variables WHICH means 16 fields/columns. Data is extracted using the keyword “CoronaVirus” or “COVID-19” from Twitter. Only English language tweets are considered. Data is collected using Twitter Developer Account. The fields which are there in the data are mentioned in Table 1. Table 1 List of data fields S. No
Data fields
1
Text
The text of the tweets
2
Favorited
If the tweet is favourited by someone or not
3
favoriteCount
Total count of the favourites on a tweet
4
replyToSN
Screen name of the user this is in reply to
5
Created
When the tweet was created
6
Truncated
Whether the tweet was truncated
7
replyToSID
If a tweet was replied to another tweet
8
Id
ID of the tweet
9
replyToUID
ID of the user this was in reply to
10
Status source
Source user for the tweet
11
Screenname
Screen name of the user
12
retweetCount
Number of retweets in a tweet
13
isRetweet
Whether it is a retweet of some other tweet
14
Retweeted
If the tweet is retweeted or not
15
Longitude
geocode longitude of the user
16
Latitude
geocode latitude of the user
A.
Data Preprocessing
Description
Twitter Sentiment Analysis Using K-means and Hierarchical Clustering …
763
Tweets fetched from Twitter are unstructured, and hence they cannot be used for analysis. It is necessary to clean and preprocess the data before the analysis phase. The various steps in data preprocessing are the following. 1.
2. 3.
Reading the data: After transforming the tweets into data/excel, the data is read from the table and special characters are removed like symbols and emoticons. Building Corpus: Corpus is a collection of texts. Clean and Transform data: It involves several sub-steps like the following: – – – –
B.
remove punctuations, remove numbers, remove common words, remove URLs.
Document-term Matrix • The term-document matrix then is a two-dimensional matrix, whose rows are the terms and columns are the documents, so each entry (i, j) represents the frequency of term i in document j. It has various attributes like sparsity, maximal term length, term frequency. The function TermDocumentMatrix is used in R. studio. In the matrix, each entry represents the term frequency which implies the number of times term i appears in document j and the inverse document frequency implies the number of documents in the corpus which has term i (Fig. 4).
Fig. 4 Display of term-document matrix
• On the basis of this matrix, we can further perform classification, clustering and association analysis. C.
Frequent Terms/Associations • A dictionary was created to remove and clean unnecessary words. • Since tweets were pulled using Coronavirus so words like Coronavirus, Covid19 and all random words were removed to find meaningful associated
764
N. Kaushik andM. K. Bhatia
Fig. 5 Depicting frequent terms
words. This is done to analyse the frequent associations (Fig. 5). As it can be seen in the bar plot, people’s tweets centred on the pandemic. Random words like risk, patient, death, emergency, infections give a glimpse of what is going on related to Coronavirus. There are some non-related words like detainee, it is because some other trending tweets were in place while data was retrieved. • As per findings, the top 5 frequent words associated with Coronavirus are the following: – – – – – D.
E.
1. Virus, 2. Coronavirus, 3. Masks, 4. World, 5. People.
Word Cloud Word cloud shows a picture of a cloud showing the words with the maximum frequency, here the words are from textual data (such as a speech, blog post or database), the bigger and bolder it appears in the word cloud the more is the frequency of the word. In R. studio, we can build a world cloud by specifying maximum words, frequency for the terms, rotation percentage and colour words. The word cloud shows that words like funeral fear deaths are common in people’s tweets, which shows their attitude (Fig. 6). However, there are some unnecessary words like directors, jamesokeefeiii that are related to other tweets. Data Mining
Twitter Sentiment Analysis Using K-means and Hierarchical Clustering …
765
Fig. 6 Displaying word cloud
F.
There are two types of Machine Learning methods. The first one are supervised learning methods such as Naïve Bayes, Decision Trees, Dynamic Programming-based Ensemble Design algorithm (DPED) [11], K-Nearest Neighbour Algorithm and many more. Unsupervised Learning methods are Clustering, Association Mining, Anomaly Detection, etc. Vector Quantization has many approaches among which the most commonly used is K-means clustering. K-means clustering technique defines the k (target number of clusters) and after choosing the arbitrary centroid it further proceeds with minimalizing the number of clusters in the data. Clustering works on the principle of dividing the populated data into small subsets known as clusters or data clusters based on their common characteristics. Hierarchical clustering is often associated with heatmaps. The columns represent different samples and the rows represent measurements from different genes. Red represents a high expression of a gene and blue/purple means a lower expression of a gene. Hierarchical clustering orders the row and the columns based on similarity. This makes it easy to see correlations in the data. Heatmaps often come with dendrograms. Sentiment Analysis More businesses are coming online every day, which leads to a large collection of data. Companies like Apple are increasing their revenues as they use the data for taking decisions, which is referred to as data driven decisions. Sentiment analysis is a methodology of extracting opinions from texts. Many methods are there like RBM (rule-based model) [12]. It is discovering people’s opinions, emotions and feelings about a product or service. It is also referred to as opinion mining.
766
N. Kaushik andM. K. Bhatia
4 Results Implementing K-means clustering, we need to specify the number of clusters “k” and then each term is assigned a cluster. Transpose the matrix to cluster documents, then set a fixed random seed value and print the tweets of every cluster. Six clusters are created using K-means clusters where the output is shown below (Fig. 7). After applying hierarchical clustering, the output is a diagram that is known as dendrogram. By looking at the dendrogram, we can tell the similarity between the terms and which cluster was formed first. The order of cluster formation can be found out by looking at dendrogram arms length. The more similar terms are grouped together. The first two clusters are coronavirus and covid. We can see people, death and july have more similarities so they come under one cluster (Fig. 8). It can be seen from the dendrogram that the words have been divided into clusters. Few of them are as follows. Cluster 1: Corona virus Cluster 2: Covid Cluster 3: pandemic Cluster 4: new (new cases)
Fig. 7 Output of K-means clustering
Fig. 8 Output of hierarchical clustering
Twitter Sentiment Analysis Using K-means and Hierarchical Clustering …
767
Fig. 9 Sentiment analysis in R
Cluster 5: people, deaths, July Cluster 6: hydroxychloroquine Opinion mining is also referred to as sentiment analysis. Sentiment analysis is used in various fields like marketing, advertising, and many more. To get some useful information from a text is known as sentiment analysis. On Twitter platform, people express their emotions, views on any topic of their concern. This is done using tweets, tweets are basically textual data with some hashtags. So sentiment analysis can be applied to see the thoughts of the people on various issues or products. The ‘syuzhet’ package in R helps to capture people’s emotions in text. The sentiment score plot shows that tweets are more related to positivity followed by negatively (Fig. 9). People do have some positive feelings on Twitter but negativity is closely followed. The bar chart shows that around 80% of people have expressed trust, and about an equal number of people have demonstrated sadness and fear. The graph shows mixed emotions among people with a higher degree of pessimism.
5 Conclusions Coronavirus has taken a toll of deaths throughout the globe. People all around the world are scared by this pandemic. In order to focus on the mental health of the citizens, we performed sentiment analysis on the one of the most famous social
768
N. Kaushik andM. K. Bhatia
platforms, that is, Twitter. All the tweets fetched are related to coronavirus or COVID19. People express their opinions and sentiments through tweets. We were able to find out people’s attitude towards this pandemic and the major topics that are highlighted during this period. The tweets are fetched in the real time from 30/01/2020 to 21/07/2020. 15,000 tweets are fetched and analysed. Preprocessing and cleaning of tweets are performed, which include removal of unwanted emoticons, words, punctuations. Then building a term-document matrix, and calculating the frequency of each word were performed. The top 5 frequent words associated with coronavirus were, Virus, Coronavirus, Masks, World and people. Then building a word cloud was done. The word cloud shows that words like funeral, fear, and deaths are common in people’s tweets, which shows their attitude. Using K-means clustering, we specify the number of clusters “k”, then each term will be classified to one of the clusters. Six clusters are created using K-means clustering. Applying hierarchical clustering, gives dendrogram which depicts that the words have been divided into clusters. A few of them are Cluster 1: Corona virus, Cluster 2: Covid, Cluster 3: pandemic, Cluster 4: new (new cases), Cluster 5: people, deaths, july, Cluster 6: hydroxychloroquine. By applying sentiment analysis, we see that positive opinions are more, but it is followed closely by negative opinions. There is a slight difference between positive and negative opinions. The semantic scores show mixed emotions among people with a higher degree of pessimism. About 80% of people have expressed trust, followed by fear and sadness.
References 1. Ardabili, S., Mosavi, A., Ghamisi, P., Ferdinand, F., Varkonyi-Koczy, A., Reuter, U., Rabczuk, T., & Atkinson, P. (2020). COVID-19 outbreak prediction with machine learning. doi:https:// doi.org/10.20944/preprints202004.0311.v1. 2. Abel, F., Hauff, C., Houben, G.-J., Stronkman, R., & Tao, K. (2012). Twitcident: Fighting fire with information from Social Web streams. In WWW’12 - Proceedings of the 21st Annual Conference on World Wide Web Companion. doi:https://doi.org/10.1145/2187980.2188035. 3. Dehkharghani, R., et al. (2014). Sentimental causal rule discovery from Twitter. Expert Systems with Applications, 41(10), 4950–4958. 4. Jindal, M., & Kharb, N. (2013). K-means clustering technique on search engine dataset using data mining tool. International Journal of Information and Computation Technology, 3(6), 505–510. 5. Endarnoto, S.K., et al. (2011). Traffic condition information extraction & visualization from social media twitter for android mobile application. In Proceedings of the 2011 International Conference on Electrical Engineering and Informatics. IEEE. 6. Rufai, S.R., & Bunce, C. World leaders’ usage of Twitter in response to the COVID-19 pandemic: A content analysis. Journal of Public Health, fdaa049. doi:https://doi.org/10.1093/ pubmed/fdaa049 7. Budhwani, H., & Sun, R. (2020). Creating COVID-19 stigma by referencing the novel coronavirus as the “Chinese virus” on Twitter: Quantitative analysis of social media data. Journal of Medical Internet Research, 22(5), e19301. 8. Liang, H., Fung, I. C., Tse, Z. T. H., et al. (2019). How did Ebola information spread on Twitter: Broadcasting or viral spreading? BMC Public Health, 19, 438. https://doi.org/10.1186/s12889019-6747-8
Twitter Sentiment Analysis Using K-means and Hierarchical Clustering …
769
9. Chugh, A., Sharma, V.K., Bhatia, M.K., & Jain, C. (2021). A big data query optimization framework for telecom customer churn analysis. In 4th International Conference on Innovative Computing and Communication, Advances in Intelligent Systems and Computing. Singapore: Springer. 10. Chugh, A., Sharma, V.K., Kumar, S., Nayyar, A., Qureshi, B., Bhatia, M.K., & Jain, C. (2021). Spider monkey crow optimization algorithm with deep learning for sentiment classification and information retrieval. IEEE Access. 9, 24249–24262. doi:https://doi.org/10.1109/ACCESS. 2021.3055507. 11. Alzubi, O.A., Alzubi, J.A., Alweshah, M., Qiqieh, I., Al-Shami, S., & Ramachandran, M. (2020). An optimal pruning algorithm of classifier ensembles: Dynamic programming approach. Neural Computing & Applications. 12. Dwivedi, R.K., et al. (2019). Sentiment analysis and feature extraction using rule-based model (RBM). In: International Conference on Innovative Computing and Communications. Singapore: Springer.
Improved ECC-Based Image Encryption with 3D Arnold Cat Map Priyansi Parida and Chittaranjan Pradhan
Abstract Maintaining the secrecy and safety of images while sharing digital data online is a huge challenge. Numerous encryption schemes use the Elliptic Curve Cryptography(ECC) to encrypt and decrypt the images as ECC provides higher security with shorter key sizes. The researchers also suggest the use of the popular chaotic maps for added strength of the encryption process. In this paper, a novel encryption scheme for digital images based on ECC and 3D Arnold cat map is proposed. The 3D Arnold cat map scrambles the position of pixels in the image and then transforms the values of pixels. The transformed pixel values are encrypted and decrypted using the Elliptic Curve Analogue ElGamal Encryption Scheme (ECAEES). The proposed model is implemented using Python. We get an average entropy value of 7.9992, NPCR of 99.6%, UACI of 33.3% and PSNR of 27.89. The correlation coefficient values between adjacent pixels of cipher images are minimized. The improved performance proves that the model put forward is more secure and resilient than the existing noteworthy schemes. Keywords Image encryption · Elliptic Curve Cryptography (ECC) · Chaotic map · 3D Arnold Cat map · Image security
1 Introduction With the advent of digitized communications, ensuring security of multimedia on the internet is one of the top concerns that needs to be addressed for achieving a reliable transmission. Multimedia security is a form of data protection which consists of preserving the integrity of varying media files, such as audio, images, video, text and more, in the insecure network. Large amount of images are exchanged between users over the internet every day. Numerous algorithms have been proposed for the encryption and decryption of images to maintain their confidentiality in real time. P. Parida (B) · C. Pradhan KIIT University, Bhubaneswar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_62
771
772
P. Parida and C. Pradhan
Elliptic Curve Cryptography (ECC)-based encryption schemes are a good choice for public-key encryption schemes, as it benefits from the inability of solving the computationally hard Elliptic Curve Discrete Logarithm Problem (ECDLP) in feasible amount of time. The algorithms often use chaotic maps to introduce chaotic behaviour into the system. The maps are used to scramble and unscramble the pixels of the image. Arnold’s cat map, Baker’s map, Logistic map, Lorenz system are some of the popular choices in chaotic maps. In this paper, in Sect. 2, we discuss the existing image encryption schemes. In Sect. 3, we delve into the theoretical view of our scheme. In Sect. 4, we depict the design model for our proposed encryption scheme based on ECC and Arnold’s Cat map. In Sect. 5, we review the experimental results of our scheme. In Sect. 6, we analyse the resistance of the proposed model against various security attacks. In Sect. 7, we discuss the performance of the proposed scheme compared to the existing schemes. In Sect. 8, the conclusions follow.
2 Related Work Many remarkable works in the field of image security have used the properties of ECC and chaotic systems to devise an efficient image encryption model. Zhang et al.’s [2] block-based image encryption scheme used two Piecewise Linear Chaotic maps(PWLCM), one for scrambling of pixels and another for generating random data sequence to be XORed with the scrambled pixels to create the cipher image. Chen et al. [2] presented a symmetric encryption scheme with 3D Arnold’s cat map for data shuffling and diffusion through ’XOR through mod’ after every two rounds of Arnold’s map. Luo et.al’s [3] EC-ElGamal image encryption scheme used the secure hash function, SHA-512 and the DNA encoding to acquire the cipher images. The scheme is secure against attacks but consumes more time in comparison to existing schemes. In Singh and Singh’s [4] image encryption scheme, blocks of pixels are encrypted using the elliptic curve with a naive method of adding values 1 or 2 to pixels before encryption. The scheme increases the number of pixels in the cipher image during encryption. In [5], they improved their encryption by substituting the simple Koblitz encoding with Elliptic Curve Analogue ElGamal Encryption Scheme (ECAEES) as well as using 2D Arnold’s cat map to shuffle values of pixels. Abdelfatah’s [8] chaotic-enhanced ECC based scheme proposed a hybrid chaotic map, Sine–Tent–Henon map with two-stage encryption process for stronger encryption of images. Niu et.al’s [9] encryption scheme uses DNA coding, Henon map and genetic operations for pixel scrambling and diffusion and iterative logistic maps for generating pseudo-random sequences to be XORed with the computed values.
Improved ECC-Based Image Encryption with 3D Arnold Cat Map
773
Fig. 1 An Elliptic Curve E over a finite field F p
3 Preliminaries 3.1 Elliptic Curve Cryptography Elliptic Curve Cryptography(ECC), proposed by Koblitz [10] of the popular publickey algorithms. The elliptic curve, denoted as E(F p ) or simply as E, is a non-singular algebraic plain curve over a finite field F p which consists of a set of points that satisfy the well-known Weierstrass equation of form, y 2 mod p = x 3 + ax + b mod p
(1)
where x, y, a, b ∈ F p and 4a 3 + 27b2 mod p = 0 with a point of infinity, O (Fig. 1).
3.2 Arnold’s Cat Map Arnold’s Cat map is a classical chaotic map that performs repetitive shear mapping on the original image and obtains the initial image after a certain number of transformations,P, i.e. the periodicity of the map. The 2D Arnold cat map [5] can be expressed as follows:
x y
=
1 a b ab + 1
x · mod N y
(2)
774
P. Parida and C. Pradhan
where a and b are the positive integers called the control parameters. N is the dimension of a N × N square image. The 2D Arnold Cat map [5] can be extended to 3D Arnold Cat map by introducing two new control parameters, c and d [11]. ⎛
⎞ ⎛ x 1 a ⎝ y ⎠ = ⎝ b ab + 1 c d z
⎞ ⎛ ⎞ 0 x 0 ⎠ · ⎝ y ⎠ mod N 1 z
(3)
The 3D Arnold’s map [11] can be used for pixel position scrambling and pixel value transform as follows:
⎧ ⎪ x ⎨ x = 1 a · mod N (4) b ab + 1 y y ⎪ ⎩ p = (cx + dy + p) mod M where p and p’ are the initial and transformed pixel values and M is the maximum pixel colour intensity value. For a greyscale image, M = 256.
4 Proposed Design Model The proposed algorithm suggests an improvement to the existing Singh and Singh’s [5] ECC Analogue ElGamal (ECAEES) scheme. The proposed scheme uses the improved 3D Arnold’s cat map discussed in Eq. 4. The proposed scheme uses ECAEES scheme [5] to encode and decode data. An image is made up of pixels. Encryption on blocks of pixels is faster than performing encryption on each and every pixel value. The maximum number of pixels that can be grouped together depends upon the prime elliptic curve parameter(p). For a 512-bit elliptic curve, the maximum number of pixels in a group can be 64. The proposed design of the image encryption scheme, depicted in Fig. 2, is explained as follows.
4.1 Image Encryption 1. Record the image dimension and related information from the original image. Choose a random integer k within 2 to (n − 1), where n is called the cyclic order of the curve. 2. Perform scrambling of the input image pixels for j rounds with 3D Arnold scrambling taking control parameter values as a = 1 and b = 1. j = kG x mod P
(5)
Improved ECC-Based Image Encryption with 3D Arnold Cat Map
775
Fig. 2 Proposed Design Model of the Algorithm
where kG x = x-coordinate of kG and P is the period of the 3D Arnold scrambling for given image size. 3. Transform the values of scrambled pixels of the image using 3D Arnold value transform from Eq.(4) as (6) p = r · p mod M where r is positive integer such that r and M are co-prime i.e.gcd(r, M) = 1. 4. Partition the pixels into groups and convert each group of 64 pixels from byte values into a single large integer value with base conversion using 256 as base. 5. Represent every two large integer values in succession, as plain text input,PM for ECAEES encryption [5] process as follows: C = {kG, PC } wher e PC = PM + k Pb
(7)
6. Convert cipher text, PC back to byte values with base conversion with base 256 and ensure each list contain 64 values with necessary padding of zeroes on the left. 7. Convert the cipher values into the cipher image. Send kG and cipher of the plain image to the receiver.
776
P. Parida and C. Pradhan
4.2 Image Decryption 1. Record the dimension and channel information of the cipher image. 2. Partition the pixels into groups of 64 each, and convert each group of pixels from byte values into a single large integer value with base conversion of 256. 3. Pair up the integer values as PC to compute the plain text, PM for ECAEES decryption [5] process as follows: PM = PC − n b kG
(8)
where n b is the private key of receiver. 4. Convert plain text, PM back to byte values with base conversion and ensure each list has 64 values with necessary left zero-padding. 5. Transform the values of the pixel using 3D Arnold pixel value transform. p = s · p mod M where s is the modular multiplicative inverse of r, such that r · s ∼ = 1 mod M. Unscramble the transformed pixel values for P-j rounds to obtain the original image with r computed from the kG. 6. Convert the values into the plain image.
5 Experimental Results The proposed model is implemented successfully in Python 3.7 on a HP laptop with system configuration as Intel(R) Core(TM) i5-8250U [email protected] with 8GB RAM using Spyder 4.1.5. The Elliptic curve used in the proposed algorithm is the secure 512-bit curve from the ECC Brainpool (Table 1) [12]. The elliptic curve parameters are depicted in Table 1.The plain, scrambled and cipher images obtained from the proposed scheme are depicted in Figs. 3, 4 and 5.
6 Security Analysis In order to explore the strength and security of the scheme, various statistical as well as security analyses are carried out.
Improved ECC-Based Image Encryption with 3D Arnold Cat Map
777
Table 1 Elliptic Curve parameters used in the implementation of Proposed Scheme Parameter Values p
a
b
Gx
Gy
894896220765023255165660281515915342216260964409835451134459718720 005701041355243991793430419195694276544653038642734593796389430992 3928536070534607816947 62948605579730632276664213064763793240747157706227462271369 104454 50301914281276098027990968407983962691151853678563877834221834027 439718238065725844264138 32457890083289670592748495843420779165319090096375019183283236687 36179176583263496463525128488282611559800773506973771797764811498 834995234341530862286627 67920591404245751744356404312691950878431533901025218814680230127 32047482579853077545647446272866794936371522410774532686582484617 946013928874296844351522 65922445552401128733247483814296103413127129403262663313274450666 87010545415256461097707483288650216992613090185042957716318301180 1592 34788504307628509330
(a) Original Image of Barbara
(b) Original Image of Lena
(c) Original Image of Mandrill
(d) Original Image of Peppers
Fig. 3 Plain input images
(a) 3D Scrambled Image of Barbara
(b) 3D Scrambled Image of Lena
Fig. 4 3D scrambled images
(c) 3D Scrambled Image of Mandrill
(d) 3D Scrambled Image of Peppers
778
(a) Cipher Image of Barbara
P. Parida and C. Pradhan
(b) Cipher Image of Lena
(c) Cipher Image of Mandrill
(d) Cipher Image of Peppers
(b) Plain Image of Lena
(c) Plain Image of Mandrill
(d) Plain Image of Peppers
(c) Cipher Image of Mandrill
(d) Cipher Image of Peppers
Fig. 5 Cipher images
(a) Plain Image of Barbara
Fig. 6 Histogram plots of plain images
(a) Cipher Image of Barbara
(b) Cipher Image of Lena
Fig. 7 Histogram plots of cipher images
6.1 Histogram Analysis The histogram plot of an image depicts the variation of the frequent occurrence of the pixel intensities against the pixel intensity values. An efficient encryption scheme ensures uniform distribution of pixel intensities in the cipher image. The histogram plots for the cipher images are shown in Figs. 6 and 7. From the figures, its evident that the histogram plots for the respective cipher images are evenly distributed.
6.2 Entropy Analysis Entropy is the measure of uncertainty in the data. The higher the randomness, the higher is the value of entropy. A good quality cipher image should have entropy
Improved ECC-Based Image Encryption with 3D Arnold Cat Map
779
closer to an ideal value,8. The computed entropy values for the cipher images are close to the theoretical value, as given in Table 3.
6.3 Resistance to Differential Attacks The Number of Changing Pixel Rate(NPCR) and Unified Average Changed Intensity (UACI) values are used to measure the resistance of the scheme against the differential attacks. The ideal values of NPCR and UACI parameters are 100% and 33.33%, respectively. From Table 3, we can ascertain that the scheme has near-ideal values for the NPCR and UACI parameters.
6.4 Key Space The security of an algorithm depends upon the key size of the cryptosystem. An ECCbased algorithm relies upon the hardness of exponentially difficult Elliptic Curve Discrete Logarithm Problem (ECDLP) to provide higher security with smaller key sizes. We have used a secure standard 512-bit Elliptic curve from the ECC Brainpool [12] in our implementation which successfully resists brute-force attacks.
6.5 Similarity Measurement Peak Signal-to-Noise ratio (PSNR) value is inversely proportional to Mean Squared Error (MSE) which is the aggregate squared error between the cipher and plain images’ pixel data. The Structural Similarity (SSIM) values measure the similarity between the plain and cipher images. Lower the PSNR and SSIM values are between plain and cipher data, better is the encryption method used. The measured values of PSNR and SSIM for the proposed scheme are given in Table 3.
6.6 Known Plain Text Attack Assume the intruder has access to one or more plain image-cipher image pairs from the scheme. The scheme generates different and unique cipher images for the same plain image in different sessions due to a random parameter, k which is uniquely generated in each session of encryption.
780
P. Parida and C. Pradhan
Fig. 8 Correlation graphs of plain and cipher greyscale LENA images from the proposed scheme. Here, a and b are horizontal correlation graphs, c and d are vertical correlation graphs and e and f are diagonal correlation graphs
6.7 Correlation Coefficient Analysis The neighbouring pixels of a plain image have higher correlation values which lead to a denser correlation graph. A desirable cipher image has an even distribution and low correlation values between the adjacent pixels. The various correlation graphs of plain and cipher images for the proposed scheme are shown in Fig. 8. Table 2 gives a comparison among various correlation coefficient values for the proposed and existing schemes. The cipher image from the proposed model has
Improved ECC-Based Image Encryption with 3D Arnold Cat Map
781
Table 2 Comparison of Correlation Coefficient Values for Greyscale Images Scheme Image Horizontal Vertical Diagonal Plain Cipher Plain Cipher Plain Cipher Ours
Ref. [3]
Ref. [5]
Ref. [13] Ref. [14] Ref. [15]
Barbara Lena Mandrill Peppers Barbara Lena Peppers Barbara Lena Mandrill Peppers Lena Lena Peppers
0.85973 0.97189 0.86652 0.97918 0.9689 0.9858 0.9807 0.85973 0.97189 0.86652 0.97918 0.9771 0.9503 0.9295
−0.00163 0.00068 0.00031 −0.0004 0.0024 0.0019 −0.0028 −0.00164 −0.00301 −0.00033 0.00027 0.0925 −0.0226 0.0048
0.95908 0.98498 0.75864 0.98264 0.8956 0.9801 0.9752 0.95908 0.98498 0.75864 0.98264 0.9631 0.9775 0.9294
0.00223 0.00149 0.00191 0.00121 0.0031 −0.0024 0.0039 0.00797 0.01098 0.01002 0.00303 0.0430 0.0041 0.0062
0.84181 0.95928 0.72613 0.96797 0.8536 0.9669 0.9636 0.84181 0.95928 0.72613 0.96797 0.9469 0.9275 0.8771
Table 3 Performance of the proposed scheme for various greyscale images Measure Barbara Lena Mandrill NPCR(%) UACI(%) Entropy PSNR SSIM
99.62 33.27 7.9993 27.8964 0.00938
99.60 33.31 7.9992 27.892 0.0099
99.59 33.31 7.9992 27.896 0.0098
−0.00078 −0.00025 0.00115 0.001329 −0.0013 −0.0011 −0.00024 −0.00134 −0.03204 −0.06414 −0.00148 −0.0054 0.0368 0.0030
Peppers 99.60 33.28 7.9992 27.8959 0.00932
lower correlation coefficient values, with uniform correlation graphsthan the existing schemes.
7 Discussion The proposed model eliminates the problem of the increase in the size of cipher images faced by Singh and Singh’s [4] ECC-based encryption and authentication scheme. The proposed algorithm preserves the original size of images throughout the encryption process. Multiple greyscale images with varying sizes are taken as input for the comparative analysis. The Singh and Singh’s [5] ECAEES scheme uses the 2D Arnold’s cat map. With the added randomness and complexity in pixel values, the enhanced 3D Arnold’s cat map in our scheme generates a better intermediary cipher image than the 2D Arnold’s cat map. The proposed scheme has lower PSNR and
782
P. Parida and C. Pradhan
Table 4 Comparative analysis for N × N greyscale lena images Measure Scheme Value of N 64 256 Entropy PSNR SSIM NPCR(%) UACI (%)
Proposed Ref. [5] Proposed Ref. [5] Proposed Ref. [5] Proposed Ref. [5] Proposed Ref. [5]
7.996 7.9514 27.840 27.848 0.0038 0.0108 99.78 99.78 33.42 33.414
7.9997 7.9997 27.886 27.888 0.0087 0.0105 99.604 99.59 33.309 33.28
512 7.9992 7.9992 27.892 27.902 0.0099 0.0103 99.601 99.58 33.31 33.30
SSIM values than Singh and Singh’s [5] ECAEES scheme. The model also shows higher entropy, NPCR and UACI values which indicate stronger encryption. From the comparison of the correlation coefficient values of different cipher images for the proposed method with those from the existing schemes in Table 2, we infer that the proposed method has a larger reduction in the correlation values between adjacent pixels. The results of the performance comparisons, in Tables 2 and 4, prove that the proposed scheme is more preferable to the existing schemes in terms of better resistance to statistical and security attacks.
8 Conclusion The proposed scheme performs ECC based image encryption using a 3D Arnold map. The model delivers a stronger and more unique cipher image, apparent from the uniform distribution of pixels in the histogram and correlation graphs. The scheme effectively eliminates the threats to image security with the improved values of NPCR, UACI and PSNR. In future work, the digital signature can be used to authenticate the identity of cipher images before decryption.
References 1. Zhang, X., & Wang, X. (2017). Multiple-image encryption algorithm based on mixed image element and chaos. Computers & Electrical Engineering, 62, 401–413. 2. Chen, Guanrong, Mao, Yaobin, & Chui, Charles K. (2004). A symmetric image encryption scheme based on 3D chaotic cat maps. Chaos, Solitons & Fractals, 21(3), 749–761.
Improved ECC-Based Image Encryption with 3D Arnold Cat Map
783
3. Luo, Y., Ouyang, X., Liu, J., & Cao, L. (2019). An image encryption method based on elliptic curve elgamal encryption and chaotic systems. IEEE Access, 7, 38507–38522. 4. Singh, L. D., & Singh, K. M. (2015). Image encryption using elliptic curve cryptography. Procedia Computer Science, 54, 472–481. 5. Laiphrakpam, D. S., & Khumanthem, M. S. (2017). Medical image encryption based on improved ElGamal encryption technique. Optik, 147, 88–102. 6. Ravanna, C. R., & Keshavamurthy, C. (2019). A novel priority based document image encryption with mixed chaotic systems using machine learning approach. Facta Universitatis, Series: Electronics and Energetics, 32(1), 147–177. 7. Broumandnia, A. (2019). The 3D modular chaotic map to digital color image encryption. Future Generation Computer Systems, 99, 489–499. 8. Abdelfatah, R. I. (2019). Secure image transmission using chaotic-enhanced elliptic curve cryptography. IEEE Access, 8, 3875–3890. 9. Niu, Y., Zhou, Z., & Zhang, X. (2020). An image encryption approach based on chaotic maps and genetic operations. Multimedia Tools and Applications, 79(35), 25613–25633. 10. Koblitz, N. (1987). Elliptic curve cryptosystems. Mathematics of computation, 48(177), 203– 209. 11. Liu, H., Zhu, Z., Jiang, H., & Wang, B. (2008, November). A novel image encryption algorithm based on improved 3D chaotic cat map. In 2008 The 9th International Conference for Young Computer Scientists (pp. 3016-3021). IEEE. 12. Elliptic Curve Cryptography(ECC) Brainpool Standard Curves and Curve Generation. https:// tools.ietf.org/html/rfc5639.Cited8Nov,2020 13. Ye, G. D., Huang, X. L., Zhang, L. Y., & Wang, Z. X. (2017). A self-cited pixel summation based image encryption algorithm. Chinese Physics B, 26(1), 010501. 14. Xu, L., Gou, X., Li, Z., & Li, J. (2017). A novel chaotic image encryption algorithm using block scrambling and dynamic index based diffusion.Optics and Lasers in Engineering, 91, 41-52. 15. Zhang, W., Yu, H., & Zhu, Z. L. (2018). An image encryption scheme using self-adaptive selective permutation and inter-intra-block feedback diffusion. Signal Processing, 151, 130– 143. 16. Paar, Christoff and Pelzi Jan. "Understanding Cryptography a Textbook for Students and Practitioners” . Heidelberg: Springer, 2014.PDF. 17. Kanso, A., & Ghebleh, M. (2012). A novel image encryption algorithm based on a 3D chaotic map. Communications in Nonlinear Science and Numerical Simulation, 17(7), 2943–2959. 18. Soleymani, A., Nordin, M. J., & Sundararajan, E. (2014). A chaotic cryptosystem for images based on Henon and Arnold cat map. The Scientific World Journal, 2014. 19. Chen, F., Wong, K. W., Liao, X., & Xiang, T. (2012). Period distribution of the generalized discrete Arnold Cat map for N = 2e . IEEE Transactions on information theory, 59(5), 3249– 3255. 20. Toughi, S., Fathi, M. H., & Sekhavat, Y. A. (2017). An image encryption scheme based on elliptic curve pseudo random and advanced encryption system. Signal processing, 141, 217– 227. 21. Kaur, M., & Kumar, V. (2020). A comprehensive review on image encryption techniques. Archives of Computational Methods in Engineering, 27(1), 15–43. 22. Hariyanto, E., & Rahim, R. (2016). Arnold’s cat map algorithm in digital image encryption. International Journal of Science and Research (IJSR), 5(10), 1363–1365.
Virtual Migration in Cloud Computing: A Survey Tajinder Kaur and Anil Kumar
Abstract In this era of everything available online, the user can get access to shared services and resources such as hardware, network, and infrastructure through the Internet. In contrast without the use of cloud, these services are available in expensive manner. To resolve this issue, virtualization concept is used in order to help clients to get efficient management of all resources or virtual machines. In virtualization, one of the most prominent features is the ability of virtual machines to migrate, that allows transferring the data of one virtual machine to another virtual machine for backup. This process gives rise to various features such as maintenance of virtual machines, energy efficiency, power management, and fault tolerance. While transferring data from virtual machines, various techniques are followed basis on their required transferred time, downtime, and latency. In this paper, the concept of virtualization, Migration of machine data, and their different techniques are discussed. This paper mainly focuses on the comparison of various approaches and the challenges faced during virtual machine migration. Keywords Cloud computing · Challenges · Live migration · Pre-copy · Post-copy · Resources · Virtualization
1 Introduction Cloud computing (CC) is a technology that provides computation, storage, and other services to various users based on their demand. The day-to-day technology development such as CC has made the resources like storage, memory cheaper, and easily available [1]. This technology uses the concept of virtualization to decrease the computational cost [2] and increase resource utilization.
T. Kaur (B) · A. Kumar Guru Nanak Dev University, Amritsar, India A. Kumar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_63
785
786
T. Kaur andA. Kumar
Virtualization is the term introduced in 1960 [3]. Virtualization technology is a way to run a number of different operating systems on a single physical server. It helps for sharing the resources in an efficient manner. This technology is the method that deploys a hypervisor on the hardware layer of operating system. The physical server is partitioned into different executable environment those are independent of each other and can execute different applications without any interference to each other. This term has various pros and cons such as this technology is not preferred alone [4] as it utilizes the CPU, memory, or other resources of physical hardware. In the situation of poor network connection, it creates problems in proper utilization of resources. But still, its performance factor [5], security feature [6] has emerged with the technology of cloud computing and provides valuable features [7]. Most of the companies have shifted to the emergence of cloud computing and virtualization. Industries hire the hardware resources from cloud and provide these resources to their customers. Customers also take the benefit as these services are easily accessible and understandable. The large number of customers can do their work in fast manner. The main issue faced by cloud data center is the increased rate of users’ accessibility to run their applications. All the requests are to be handle by the main cloud data center this can create problem of one of the virtual machines overloading, any virtual machine failure. These all issues have given a challenge to researchers to protect virtual machines from these failures. With these issues, virtual machine migration (VMM) becomes a challenge. Virtual machine migration (VMM) is the term introduced from the technical term process migration. As process migration doesn’t work at cloud level. Hence, for cloud management, migration of virtual machines is being used to get more efficiency. Virtual machine migration is the term used to move the virtual machine (VM) from one server where it was working to another server where it has to be shifted. The various factors those are considered along the VMM are Downtime [8], Energy efficiency and management [9], Load balancing and management [10], server consolidation [11]. VMM is the technique used for cloud management that also increases overhead on client machine, server, network bandwidth, and many more. Thus, this technique should be used with great care such as balance of energy, cost, and overheads. The organization of the paper is as follows. Sect. 2 discusses the literature survey which includes the concept of virtualization, live migration, comparison of techniques of live migration, and tools used for simulation of live migration, Sect. 3 focuses on the challenges and the scope of VMM and Sect. 4 concludes the paper with future scope.
2 Literature Survey The virtual machine migration is based on virtualization technology. In this section, we have included the concept of virtualization and its different methods.
Virtual Migration in Cloud Computing: A Survey
787
Fig. 1 Virtualized environment
2.1 Virtualization Virtualization is the approach in which multiple operating systems can share the same hardware and other resources to increase resource utilization and reduce the cost of operation. Each virtual system works in different virtual machines and looks like running on the actual system with usage of actual memory, CPU, and other resources. Here, virtual machines give the illusion of actual machines those work on virtualized layer as shown in Fig. 1. At virtualized layer, the hypervisor is there that has two types: The first type runs on top of the hardware, and the second type provides the feasibility of Live migration.
2.2 Live Migration Live Migration is the technique that helps to manage the virtual machines of data centers by transferring the current state of the running VM from one host to another host. The virtual machines running on data centers can have the requirement of migration from one rack to other rack also. This migration process requires the migration of memory, device state, and storage along with the data of the machine. This process can be performed in Live or non-Live way. Non-Live migration is the process of suspension of virtual machine, then migrate that virtual machine and resume that same state at another virtual machine. In this process, user faces the interruption during this process. But if this process occurs without any interruption of services is considered as Live Migration process. As user will not able to get this movement of memory or data storage. The complete virtual machine will be transferred to the destination host. The various approaches of Live Machine Migration: pre-copy and post-copy as shown in Fig. 2.
788
T. Kaur andA. Kumar
Fig. 2 Types of live migration
Pre-copy [12] is the technique that is followed by some of the virtualization hypervisors such as KVM [13], Xen [14], and VMware [15]. This process works in the form of iterations such as the pages of source machine will be transferred to destination machine in the first iteration and virtual machine is in progress at source machine also. After that in the next iteration, only the modified pages of source host are to be transferred and this process will work till all the modified pages up to the threshold value reaches the destination machine. At the end, the virtual machine restarted and the source copy will be diminished. This process is beneficial when there are some chances of failure of the destination machine. This process helps to minimize the downtime and increase the application upgradation. Post-copy [16] is the approach that every page of source machine to destination without any duplication that is the problem can be faced in pre-copy approach. This approach is preferred to minimize the total migration time required but sometimes it can lead to high downtime and degradation of performance of virtual machine. Here, the virtual machine on host side is on suspension mode and its next processor state is transferred after that at the destination side, host is checked for availability, and all pages are transferred to the destination host. The main drawback of this approach is the process of migration cannot be aborted in between as pre-copy approach.
2.3 Comparison of Techniques of Live Migration This section compares the basis of the papers on their algorithm used as shown in Table 1.
Virtual Migration in Cloud Computing: A Survey
789
Table 1 Comparison of various techniques of live migration Author and Description reference no
Algorithm for migration
Type of migration
Parameters computed
Zhang et al. [17]
This paper focused on live virtual machine migration over the wide-area network. The author has proposed a migration system known as Layer Mover which is beneficial for data centers VM migration. This paper focused on data duplication technique
Proposed a three-layer structure known as Layer Mover. Focused on WAN level migration
Virtual live migration at data center level. Real experiment performed on two DELL systems
Computation cost, migration time, downtime, and transmission benefit
Jin et al. [18]
This paper focused on optimizing the live migration of virtual machines. The author proposed a pre-copy mechanism to reduce migration downtime. It controls the CPU scheduler of the VM monitor
Proposed an optimized pre-copy algorithm. Limits the dirty rate of VM’s
Pre-copy migration for improving the live migration. CPU scheduler for implementing the optimized pre-copy algorithm
Dirty data delivery time, process resume time, dirty memory generation rate, overall live migration time, and downtime of optimized pre-copy algorithm
Gao et al. [19]
This paper is focusing on the virtual machine placement problem. The author has proposed a multi-objective ant colony system algorithm for virtual machine placement. The paper focuses on minimizing resource wastage and power consumption
Proposed a multi-objective ant colony system (ACS) algorithm
Ant colony optimization and Evolutionary multi-objective optimization is performed
(continued)
790
T. Kaur andA. Kumar
Table 1 (continued) Author and Description reference no Sun et. al [20]
Algorithm for migration
Type of migration
Parameters computed
Pre-copy and post-copy strategy is used for live migration
Memory migration, efficiency of resources, quality of services, downtime, average waiting time
Pre-copy strategy for live migration
Downtime, migration time
Optimized pre-copy Optimized in combination with pre-copy strategy characteristic-based compression
Downtime, migration time
This paper focused Mixed migration to increase the strategy for multiple performance of virtual migration live migration of more than one virtual machine using the improved version of serial and mixed migration. This method works on basis of downtime of virtual machine. The paper focuses on optimizing the live migration process of multiple virtual machines
Walter et. al This paper allows Geometric [21] the trade-off programming between cost, model is used downtime, and resource utilization time Megha et al. This paper has [22] optimized the pre-copy strategy by adding a parameter of dirty pages to be maintained by a threshold for using pre-copy and then using CBC algorithm to further decrease the page rate Jing et. Al [23]
This paper is using Bacteria foraging bacterial foraging optimization algorithm
Energy aware virtual machine migration
Energy consumption
2.4 Tools Used for Simulation of Virtual Migration Virtual machine migration process can be simulated in different environments such as cloud sim, cloud analyst, and many more shown in Table 2.
Virtual Migration in Cloud Computing: A Survey
791
Table 2 Detail of different working environments used for migration Year
Reference no
Working environment for migration
Pros
2018 Zhang et al. [17]
Virtualization is performed in real-world with two DELL systems with 1 CPU and 1 GB memory
There is a trade-off Only de duplication between the technique has been bandwidth of focused internet and storage space of virtual machine. To resolve this in WAN this paper has proposed a three-layer structure. Size of VM storage data
2011 Jin et al. [18]
Pre-copy migration uses a pair of two-socket servers, each socket has 4 Intel Xeon 1.6 GHz CPUs. Both servers were 4 GB DDR RAM and connected by a 1000 M bits Ethernet network. OS used was Linux 2.6.18 with Xen 3.1.0
To reduce the rate of dirty memory generation, we proposed the optimized live migration. It reduces the overall live migration time by reducing the overall memory transfer rate. It also reduces the time for the final round of the pre-copy
2013 Gao et al. [19]
Cons
The pre-copy migration fails to reduce the memory size if the memory dirty rate reaches to available bandwidth
Many real-world problems take into account the multiple criteria’s that’s why researchers look upon the multi-objective solution and proposed this algorithm. It simultaneously minimizes resource wastage and power consumption. It provides large solution space for data centers (continued)
792
T. Kaur andA. Kumar
Table 2 (continued) Year
Reference no
Working environment for migration
Pros
Cons
2016 Sun et al. [20]
Virtualized system Xen and KVM are used
In data centers, it provides strategy based on small migration time in the situation of high downtime
Not focused on transmission failure rate. Real time-based experiment is performed. Only simulation is performed
2016 Cerroni et al. [21]
Linux Based Virtualized environment is used. In Linux QEMU-KVM environment is used
It follows the pre-copy migration concept. This limits the downtime and migration time
The only simulation is performed. The design and framework are not given. Migration on some limit of virtual machines is performed
The technique of adding compression algorithm after each round of pre-copy will provide multithreading
Only design and flowchart are given No actual implementation is provided
2015 Desai and Patel [22]
2019 Jing et al. [23]
Cloud-sim simulator The optional migration will also benefit the resource utilization factor
No real-world implementations are performed
2019 Nashaat and Ashry Cloud-sim simulator The technique is the Sometimes the data [24] inclusion of energy is transferred twice efficiency, number of migrations and helps in decreasing degradation of system 2019 Shukla et al. [25]
Cloud-sim simulator The technique being used has led to lowering of total migration time and down time by minimizing number of pages being transferred
No real-world implementation was performed
2018 Karthikeyan et al.[26]
Cloud-sim simulator Prediction of VM No real-world failure is provided implementation was more efficiently than performed previously available algorithms
Virtual Migration in Cloud Computing: A Survey
793
3 Challenges and Future Scope There are some challenges related to live virtual machine migration. While performing the live migration main challenge is the optimization of migrated machine memory, power consumption, energy efficiency, and more.
3.1 Optimization in Storage Migration Over the LAN, migration of memory of virtual machine is the major bottleneck [27]. The memory is shared in multiple virtual machines, but this can be used in an optimized way by decreasing the total migration time. The main purpose is to utilize the storage and other resources in a better way. This challenge can be resolve by using checkpoint, reduplication, and various approaches like compression of memory. These approaches [28] will help to find out the dirty pages and results in optimized storage.
3.2 Dynamic Migration of Multiple Virtual Machines Most of the researchers focus on the migration of a single machine but migration of multiple machines simultaneously is one of the main challenges. To resolve this issue main concern is to minimize the total time of migration needs to transfer of first virtual machine to the last virtual machine. This can affect the network latency also [29]. The major issue that can be considered for research is to introduce the technique by utilizing the bandwidth with minimized service degradation.
3.3 Power Consumption The management of power is considered a major challenge from an environmental point of view. To decrease the consumption of power, the virtual machines on data centers can be consolidated into minimized physical servers. The power consumption mainly occurs at source side and destination side due to usage of resources and computation. So, power consumption mainly reflects through the utilization of CPU [30] and data transfer rate [31].
794
T. Kaur andA. Kumar
3.4 Security In Live Migration, virtual machines those are transferred from source host to destination host must be secured as these include passwords, sensitive data during the migration process [32]. This virtual machine should be transferred through some shared link that can make them secure from various attacks [33].
4 Conclusion Live Migration is one of the research topics that helps researchers and cloud administrators to manage all the virtual machines and their resources. This paper discusses the virtualization concept, Live Migration, their types, and the tools and parameters required for analyzing the performance of Migration occur at data centers. Mainly, the CPU utilization, downtime, and Migration time are used to get the optimized technique of migration. Here, various techniques’ advantages and disadvantages are discussed. In future, this will be helpful to get the appropriate technique for Live Migration to increase the utilization of CPU and saves the power consumption.
References 1. Dillon, T., Wu, C., & Chang, E (2010). Cloud computing: Issues and challenges. In 2010 24th IEEE International Conference on Advanced Information Networking and Applications (pp. 27–33). IEEE. 2. Alzubi, J.A., Manikandan, R., Alzubi, O.A., Qiqieh, I., Rahim, R., Gupta, D., & Khanna, A. (2020). Hashed Needham Schroeder Industrial IoT based Cost Optimized Deep Secured data transmission in cloud. Measurement, 150, 107077. 3. Gum, P. H. (1983). System/370 extended architecture: Facilities for virtual machines. IBM Journal of Research and Development, 27(6), 530–544. 4. Huber, N., von Quast, M., Hauck, M., & Kounev, S. (2011). Evaluating and modeling virtualization performance overhead for cloud environments. In CLOSER (pp. 563–573). 5. Wang, L., Tao, J., Kunze, M., Castellanos, A.C., Kramer, D., & Karl, W. (2008) Scientific cloud computing: Early definition and experience. In 2008 10th IEEE International Conference on High Performance Computing and Communications (pp. 825–830). IEEE. 6. Mather, T., Kumaraswamy, S., & Latif, S. (2009). Cloud security and privacy: An enterprise perspective on risks and compliance. O’Reilly Media. 7. Gupta, D., Rodrigues, J.J., Sundaram, S., Khanna, A., Korotaev, V., & de Albuquerque, V.H.C. (2018). Usability feature extraction using modified crow search algorithm: A novel approach. Neural Computing and Applications, 1–11. 8. Salfner, F., Tröger, P., & Polze, A. (2011). Downtime analysis of virtual machine live migration. In The Fourth International Conference on Dependability (DEPEND 2011). IARIA (pp. 100– 105). 9. Sekhar, J., Jeba, G., & Durga, S. (2012). A survey on energy efficient server consolidation through vm live migration. International Journal of Advances in Engineering & Technology, 5(1), 515.
Virtual Migration in Cloud Computing: A Survey
795
10. Achar, R., Santhi Thilagam, P., Soans, N., Vikyath, P.V., Rao, S., & Vijeth, A.M. (2013). Load balancing in cloud based on live migration of virtual machines. In 2013 Annual IEEE India Conference (INDICON) (pp. 1–5). IEEE. 11. Murtazaev, A., & Sangyoon, Oh. (2011). Sercon: Server consolidation algorithm using live migration of virtual machines for green computing. IETE Technical Review, 28(3), 212–231. 12. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., & Warfield, A. (2005). Live migration of virtual machines. In Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation-Volume 2 (pp. 273–286). 13. Soriga, S.G., & Barbulescu, M. (2013). A comparison of the performance and scalability of Xen and KVM hypervisors. In 2013 RoEduNet International Conference 12th Edition: Networking in Education and Research (pp. 1–6). IEEE. 14. Lee, M., Krishnakumar, A.S., Krishnan, P., Singh, N., & Yajnik, S. (2010). Supporting soft real-time tasks in the xen hypervisor. In Proceedings of the 6th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (pp. 97–108). 15. Muller, A., & Wilson, S. (2005). Virtualization with VMware ESX server. 16. Hines, M. R., Deshpande, U., & Gopalan, K. (2009). Post-copy live migration of virtual machines. ACM SIGOPS Operating Systems Review, 43(3), 14–26. 17. Zhang, F., Xiaoming, Fu., & Yahyapour, R. (2018). LayerMover: Fast virtual machine migration over WAN with three-layer image structure. Future Generation Computer Systems, 83, 37–49. 18. Jin, H., Gao, W., Song, Wu., Shi, X., Xiaoxin, Wu., & Zhou, F. (2011). Optimizing the live migration of virtual machine by CPU scheduling. Journal of Network and Computer Applications, 34, 1088–1096. 19. Gao, Y., Guan, H., Qi, Z., Hou, Y., & Liu, L. (2013). A multi-objective ant colony system algorithm for virtual machine placement in cloud computing. Journal of Computer and System Sciences, 79, 1230–1242. 20. Sun, G., Liao, D., Anand, V., Zhao, D., & Yu, H. (2016). A new technique for efficient live migration of multiple virtual machines. Future Generation Computer Systems, 55, 74–86. 21. Cerroni, W., & Esposito, F. (2016). Optimizing live migration of multiple virtual machines. IEEE Transactions on Cloud Computing, 6(4), 1096–1109. 22. Desai, M.R., & Patel, H.B. (2015). Efficient virtual machine migration in cloud computing. IEEE 23. Jing, S., Ebadi, A.G., Mavaluru, D., & Rajabion, L. (2019). A method for virtual machine migration in cloud computing using a collective behavior-based metaheuristics algorithm. Wiley. 24. Nashaat, H., & Ashry, N. (2019). Smart elastic scheduling algorithm for virtual machine migration in cloud computing. Springer Nature. 25. Shukla, R., Gupta, R.K., & Kashyap, R. (2019). A multiphase pre-copy strategy for the virtual machine migration in cloud. Springer. 26. Karthikeyan, K., Sunder, R., Shankar, K., Lakshmanaprabu, S.K. (2018). Energy consumption analysis of virtual machine migration in cloud using hybrid swarm optimization (ABC–BA). Springer. 27. Sharma, S., & Chawla, M. (2016). A three phase optimization method for precopy based VM live migration. Springerplus, 5(1), 1022. 28. Wu, T.-Y., Guizani, N., & Huang, J.-S. (2017). Live migration improvements by related dirty memory prediction in cloud computing. Journal of Network and Computer Applications, 90, 83–89. 29. Xu, F., Liu, F., Liu, L., Jin, H., Li, Bo., & Li, B. (2013). iAware: Making live migration of virtual machines interference-aware in the cloud. IEEE Transactions on Computers, 63(12), 3012–3025. 30. Satpathy, A., Addya, S.K., Turuk, A.K., Majhi, B., & Sahoo, G. (2018). Crow search based virtual machine placement strategy in cloud data centers with live migration. Computers & Electrical Engineering, 69, 334–350. 31. Canali, C., Lancellotti, R., & Shojafar, M. (2017). A computation-and network-aware energy optimization model for virtual machines allocation. In International Conference on Cloud Computing and Services Science, vol. 2 (pp. 71–81). SCITEPRESS.
796
T. Kaur andA. Kumar
32. Garfinkel, T., & Rosenblum, M. (2005). When virtual is harder than real: Security challenges in virtual machine based computing environments. In HotOS. 33. Babu, M.V., Alzubi, J.A., Sekaran, R., Patan, R., Ramachandran, M., & Gupta, D. (2020). An Improved IDAF-FIT clustering based ASLPP-RR routing with secure data aggregation in wireless sensor network. Mobile Networks and Applications, 1–9.
Supervised Hybrid Particle Swarm Optimization with Entropy (PSO-ER) for Feature Selection in Health Care Domain J. A. Esther Rani, E. Kirubakaran, Sujitha Juliet, and B. Smitha Evelin Zoraida
Abstract Attribute selection considers more informative and important features and thereby the size of the dataset reduces. The reduction in dimensionality is achieved by removing inappropriate and unimportant features. Selecting features in supervised feature selection will help the search process to discover the prominent attributes for the classification of the given medical dataset. There are many evolutionary computation techniques and particle swarm optimization is one among them which always helps to identify the overall optimal solution in various applications. The present research work takes the advantages of particle swarm optimization and the entropy function. This paper proposes a Supervised PSO with an entropy function for feature selection which deals with a large medical dataset. The effectiveness of this proposed hybrid algorithm is evaluated and checked for error minimization and proved to be effective in the classification of medical domain. Keywords Particle Swarm Optimization (PSO) · Entropy · Attribute selection
1 Introduction Feature Selection [1] helps the user to identify relevant and suitable edifying subset from a given list of attributes based on some criteria. The attributes that are not considered must be irrelevant, noisy, and redundant. After removing the irrelevant features, the decision or the result does not change. In other words, the error is minimized in classification when compared to the actual dataset. There are many J. A. E. Rani Department of Computer Science, Bharathidasan University, Tiruchirappalli 620023, India E. Kirubakaran · S. Juliet (B) Department of Computer Science & Engineering, Karunya Institute of Technology and Sciences, Coimbatore 641114, India e-mail: [email protected] B. S. E. Zoraida School of Computer Science and Engineering, Bharathidasan University, Tiruchirappalli 620023, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_64
797
798
J. A. E. Rani et al.
dimensionality reduction methods, but the advantage of using attribute selection is to maintain the actual significance of the attributes after reduction. The dependency of the data is identified using rough set and, in turn, the attributes which are less dependent are removed and highly dependent attributes are contained in a subset [2]. Attribute selection algorithms are further divided into filter approach, wrapper approach, and embedded approach. A heuristic approach is used by Particle Swarm Optimization (PSO), a filter-based method. It is effective for attribute selection which helps to find the minimum optimal reduced set each time when it is executed. By executing the proposed supervised algorithm for attribute selection that takes in the advantages of PSO and entropy, the selection is made based on the fitness function. The fitness function considered here is the entropy function and a subset selection is made based on the entropy function and finding the pbest value and comparing with gbest. Thus, a more flexible and user’s friendly approach to have a predictive reduced set selection can be developed.
2 Research Background Two different approaches to supervised feature selection have been adopted. The Quick Reduct (QR) algorithm always finds a reduct set without trying out all possible subsets. The features are removed based on backward elimination in the Relative Reduct (RR) algorithm. They are eliminated based on the condition that even upon their removal relative dependency equals one. The relative dependency is evaluated for each attribute. An Unsupervised Quick Reduct (QR) algorithm is proposed [3]. Here, a positive region is based on the evaluation measure for the unsupervised subset which is used by rough set theory. Here, the related degree of dependency is calculated for the attributes that are present in the reduced subset which is greatly associated with the conditional attribute. The mean dependency values is been calculated for all conditional attributes. Decision attribute does not play a vital role in both unsupervised QR and RR algorithms. In the unsupervised relative reduct algorithm, even after removing a feature if the relative dependency is equal to one, then that feature can be considered as an insignificant attribute and can be weeded off. Attributes that are similar are also assumed to be insignificant and irrelevant. They are weeded out which leads to loss of information. The algorithm Tolerance Quick Reduct algorithm explained this concept [4]. The checking of similarity is done with a threshold value and not equal to 1. The threshold value calculated in this algorithm guides to find the features which are identical. The standard PSO algorithm applies inertia weight to the optimizer (particle swarm optimizer) so that we get more precision in results [5]. The feature selection methods are followed with the decision (ie) clusters or classes obtained through classification or clustering for dataset without decision class. The paper analyzes unsupervised PSO with Quick Reduct algorithm which combines the advantages of both.
Supervised Hybrid Particle Swarm Optimization with Entropy (PSO- …
799
3 Proposed Work Entropy as Fitness Function It measures the degree of uncertainty of random variables. If the randomness is high it leads to least accuracy. Higher randomness will never lead to any conclusion. Lower randomness leads to better accuracy. Let us consider I as the information system in I = (U, A). Here, U indicates the universe of countable objects and is non-empty, A indicates a countable set of conditional attributes and is non-empty, ∀a ∈ A, an equivalent function is given. It is represented as Fa : U → Va, Here, Va, consist of the set of values of a. Consider X as the subset of U (X ⊆ U), P X indicates P-lower approximation P X indicates P-upper approximation of set X. The uncertainty of random variables is measured by entropy which is defined as H (X ) = −
p(x) log2 p(x)
(1)
xε X
where X denotes the random variable, p(x) = Pr(X = x) represents the probability density function for X. The key factor is the probability distribution of the random variable and not the actual values. The randomness or the entropy depends on it. Every time after executing once, one feature is eliminated and the entropy E is calculated for the remaining variables. If a variable is considered as a least variable, then the attribute set must produce least entropy or randomness without that variable. If it is so then the least variable can be removed from the set. This is a repetitive process until the important variables are identified and selected. The first variable will the most significant variable and the last one will be least significant. An ordered list of features is obtained by reversing the order in which they are removed. The research work proposed gives insights into the concept of randomness which finds a subset in a supervised dataset. PSO is formulated on the basis of the movement of flock of birds who fly across in search of food. The position of the bird that is closer to the food source is considered the global best. The individual bird’s position is also found and considered as personal best if it moves closer to the food source or the current position is considered as pbest. The PSO algorithm follows the behavior pattern of the flocking of birds. The working procedure of the proposed PSO-ER algorithm for selecting features is shown in Fig. 1. In this algorithm, we construct a population with the features in the dataset which are considered as particles. The particles are initialized with the random velocity and random position in an N-dimensional problem space. The entropy function for each feature which is represented as 1 is evaluated. The entropy is calculated for all the features including the conditional variables and the decision variables. The entropy calculated for the decision variable is considered as the gbest. Then the fitness function is calculated for the individual bird (ie) features in our
800
J. A. E. Rani et al.
Start
Initialize pbest and gbest
Initialize particle velocity and position
Compute fitness value using entropy
Update particle’s and position
velocity
Update pbest and gbest
NO
Check for termination
Yes End Fig. 1 Modified PSO with entropy function
case. The pbest of each feature is calculated and compared with gbest. The fitness function entropy refers to the randomness of the features. Hence, entropy must be minimum to select a feature and it is considered as the fitness function. All the features are compared and reduct set is formed with the features that have minimum randomness. The feature with lower fitness value (ie) randomness is considered and various possible combinations of the chosen attribute with the remaining attributes are formed. The randomness of the selected attribute with different combinations is calculated. If the current feature’s fitness value is better (ie) the randomness is minimum than the pbest, then this will be the particle’s personal best and its position and fitness are stored. Now, the present (pbest) particle’s personal best (ie) fitness (randomness) is compared with all the attribute set for which the previous best fitness values are
Supervised Hybrid Particle Swarm Optimization with Entropy (PSO- …
801
calculated. If the present value randomness is lower than gbest, then the current position is the new gbest. Here, the position represents the prominent feature and it is included in the reduced subset. Here, the next step is to update the particle velocity and position of the particle. The process is iterative and it continues until the condition is met or according to the iterations limit. The reduced subset is the PSO-ER feature subset. The fitness function entropy is calculated and the best features are chosen. PSUEDOCODE (Algorithm—PSO with ENTROPY) Step 1: Initialize pbest and gbest, and velocity. Step 2:Initialize particle positions. Step 3:Compute fitness value using ENTROPY Update particle position Update gbestand pbest Step 4: Repeat until maximum number of iterations is met or a global optimum is reached.
4 Experimental Results In this section, the attributes that are selected by executing PSO-ER algorithm are compared with the actual dataset and the results show that the errors are minimized while classifying the reduced set. The clinical datasets are taken from the UCI Repository Machine Learning Database [6]. The Root Mean Square Error [3] and Mean Absolute Error are calculated by classifying the actual dataset and the reduced set. The accuracy of prediction is higher if the value of Mean Absolute Error is lower. The same is true for root mean square error also. Error minimization leads to better accuracy in classification. The PSO-ER method is compared with the PSO-QR [7] PSO-RR [4] methods. The selected features are classified using the WEKA tool. The algorithm used for classification is the built-in tree classification algorithms. The reduced dataset is obtained by the various hybridized algorithms like PSO-QR, PSO-RR proposed PSO-ER algorithms. The algorithm is implemented in MATLAB and the reduced set is obtained by implementing these algorithms and the results are discussed.
4.1 Comparative Analysis on Reduced Set and the Actual Dataset The Classification is done using WEKA tool. The analysis shows that the root meansquared error is minimized in three tree classification algorithms except for REP tree
802
J. A. E. Rani et al.
(Fig. 2). The analysis reveals that the mean absolute error is minimized for Random Tree, Random Forest, and Trees M5P. It is maximized in REP Tree only. Table 1 and Fig. 3 depict the comparative analysis of classification approaches for reduct set and actual data set. From Table 1, it is inferred that RMSE is much higher in the actual dataset while applying the classification Tree algorithms when compared with the reduct set. Table 2 shows the comparison of actual dataset with the reduced dataset based on MAE.
Fig. 2 Comparison of actual dataset & reduced set based on RMSE
Table 1 Comparison of actual dataset with the reduced dataset based on RMSE
Type of tree algorithm (RMSE)
PSO-ER
Actual set
Random tree
0.1799
0.3162
Random forest
0.1917
0.1862
Rep tree
0.2728
0.2551
Trees M5P
0.1964
0.2807
Fig. 3 Comparative analysis of the RMSE with PSO-ER, PSO-QR, PSO-RR
Supervised Hybrid Particle Swarm Optimization with Entropy (PSO- …
803
Table 2 Comparison of actual dataset with the reduced dataset based on MAE Type of tree algorithm (MAE)
PSO-ER
Actual set
Random tree
0.055
0.1
Random forest
0.0696
0.079
Rep tree
0.1379
0.121
Trees M5P
0.1064
0.143
4.2 Comparison of PSO-ER, PSO-QR, and PSO-RR Here, the algorithm PSO-ER is analyzed and compared with the other existing hybridized algorithms, namely, PSO-QR and PSO-RR methods. The reduced set and the actual set are classified using the WEKA tool. The data presented in Tables 3 and 4 reveals the accuracy in classification of reduced subset data using Hybridized Particle Swarm optimization with Quick Reduct (PSO-QR), Particle Swarm Optimization with Relative Reduct (PSO-RR), and the proposed Particle Swarm Optimization with Entropy (PSO-ER) (Fig. 4).
5 Conclusion and Future Enhancement In this research work, the PSO-ER hybridized algorithm is analyzed and compared with PSO-QR and PSO-RR algorithms, and the effectiveness of the results are Table 3 Comparison of RMSE with PSO-ER, PSO-QR, and PSO-RR Types of Tree-based classification algorithms based on RMSE
Reduced set PSO-ER
PSO-QR
PSO-RR
Random tree
0.1799
0.2078
0.291
Random forest
0.1917
0.211
0.2633
Rep tree
0.2728
0.335
0.2978
Trees M5P
0.1964
0.2325
0.2954
Table 4 Comparison with mean absolute error Types of tree-based classification algorithms based on MAE Reduced set PSO-ER
PSO-QR PSO-RR
Random tree
0.055
0.076
0.112
Random forest
0.069
0.090
0.117
Rep tree
0.137
0.194
0.153
Trees M5P
0.106
0.134
0.175
804
J. A. E. Rani et al.
Fig. 4 Comparative analysis of the MAE with PSO-ER, PSO-QR, PSO-RR
discussed in terms of the reduced subset and the original dataset. Experimental results for minimizing the error in classification are analyzed and presented. The PSO-ER algorithm selects all the features and searches all possible ways and calculates the fitness function in all aspects and it produces a best-reduced subset. The preprocessing is done and the prominent features are selected and the effectiveness is analyzed. The proposed algorithm chooses random particles and not the entire set and explores in all possible directions. This helps it to converge in global optimization. The experimental results show that the error in classification increases while using PSO-QR and PSO-RR methods and so the accuracy in classification decreases in these cases. Hence, the proposed algorithm is very much appropriate for large clinical databases. The effectiveness of the proposed method is clearly revealed. The experimental results also exhibit the significance of using PSO-ER for large clinical datasets. As a future enhancement, the same proposed algorithm can be used and implemented in gene databases and image databases.
References 1. Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 301–312. 2. Pal, S. K., De, R. K., & Basak, J. (2000). Unsupervised feature evaluation: A neuro-fuzzy approach. IEEE Transactions on Neural Networks, 11(2), 366–376. 3. De Castro, P.A., De França, F.O., Ferreira, H.M., & Von Zuben, F.J. (2007). Applying biclustering to perform collaborative filtering. In Seventh IEEE International Conference on Intelligent Systems Design and Applications (ISDA 2007) (pp. 421–426). 4. Velayutham, C., & Thangavel, K. (2011). Unsupervised quick reduct algorithm using rough set theory. Journal of Electronic Science and Technology, 9(3), 193–201. 5. Skowron, A., & Stepaniuk, J. (1996), Tolerance approximation spaces. Fundamental Informaticae, 27(2, 3), 245–253.
Supervised Hybrid Particle Swarm Optimization with Entropy (PSO- …
805
6. https://www.Kaggle.com/fabdeljia/autism-screening-for-toddlers. 7. Inbarani, H. H., Azar, A. T., & Jothi, G. (2014). Supervised hybrid feature selection based on PSO and rough sets for medical diagnosis. Computer Methods and Programs in Biomedicine, 113(1), 175–185.
Contribution Title A Multimodal Biometrics Verification System with Wavelet Aderonke F. Thompson
Abstract Using multimodal biometric techniques, this study focuses on overcoming some of the limitations of already-existing unimodal biometric authentication techniques such as spoofing and biometrics centralization. A lightweight and efficient linear algorithm with orthogonal features to design the multimodal biometrics which comprises iris, nose, and ear. To ensure model effectiveness, we deployed wavelet transform multi-algorithm for image compression while maintaining image quality. Consequently, we present the promising outcomes of a proposed fusion using Haar Discrete Wavelet Transformation, Fast Walsh Hadamard Transformation, and Singular Value Decomposition design using a sum-fusion technique. The simulation was carried out with a benchmark algorithm of Hough transform for Iris and PCA for Nose and Ear with the acquired dataset of volunteers. Keywords Biometrics · Multimodal · Wavelet · Multi-algorithm · Iris · Nose · Ear
1 Introduction The future of human recognition and verification systems in the digital ecosystem is solely based on an intrinsic part of the human for resilient security, this is much realistic with the use of multimodal biometrics [1]. These traits can be deployed at overt and mostly importantly covert devices to circumvent the activities of individual’s attempt to conceal true identity [2]. Keys, tokens, badges, and access cards can be misplaced, stolen, replicated, or forgotten. Also, passwords, secret codes, and personal identification numbers (PINs) might be forgotten, altered, stolen, or compromised easily, either consciously or unconsciously. However, because biometric security functions via a physical personality mode, its susceptibility toward the aforementioned shortcomings is negligible [3]. Emphasized the merits of a good biometric scanning system as speed, accuracy, dependability, user-friendliness, and cost-effectiveness. A. F. Thompson (B) The Federal University of Technology, Akure, Nigeria e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_65
807
808
A. F. Thompson
Most biometric systems used in the real world are unimodal, i.e., they rely on a single biometric characteristic, which, unfortunately, could be hacked by modern (cyber) burglars. Generally, the inherent limitations of unimodal systems include a high error rate owing to deformation, aging, and noise [3, 4]. For instance, noisy data could incorrectly label an individual an impostor, resulting in a high False Rejection Rate (FRR). Besides, the degree of reliability and usability of the biometric system data may be inefficient. In such cases, it results in a Failure-to-Enroll (FTE) error. As an example, some individuals might have difficulty getting their fingerprints captured as a consequence of dirt or profuse sweat in the ridges of the fingerprint [3]. Similarly, an iris recognition system might not obtain the required iris subject information because of stretched-out eyelashes, floppy eyelids, or a pathological eye condition [4, 5]; this corroborates the solution to the missing data in multimodal biometrics as elucidated in [6]. According to [7], often, feature extraction and matching components do not achieve optimized unimodal-based system performance of the individual traits. The authors further stated that an implicit (i.e., the distinct number of biometric feature sets) can be characterized using a biometric template for verification or identification [8]. A template capacity is inhibited by the observed variations of each subject feature set (i.e., intra-class variations) and the existing variations between different subjects’ feature sets (i.e., inter-class variations) [9, 10]. Another challenge with a unimodal biometrics system is the spoof attacks. Behavioral traits such as voice [11] and signature [12, 13] are susceptive to spoof attacks. Also, the inscription of a ridge-like structure on gelatine and play-doh synthetic materials is a spoof attack of the trait. This is commonly done with fingerprint [14, 11]. However, [15] ascertained that targeted spoof attacks can compromise the security efficiency of the biometric system. Figure 1 depicts the human parts (with biometric property) for ensuring security against cyber burglars. In recent years, the world’s financial community has been concerned with the escalating cases of automated teller machines (ATM) scams and illegal access to bank accounts. All these are being fortified by the use of biometrics Although, fraudsters and scammers are inventing new ways of committing their malicious acts [16], the integration of various biometrics traits has reduced the loss accrued via the usage of the previous form of security techniques [17, 18]. Financial technology (Fintech), uses digital technology to enhance consumers, financial institutions as well as regulators’ and user experiences. The Fintech industry opines that a significant percentage of these losses could be eliminated by biometric scanning. Although some critics have alleged the ability of biometrics to erode anonymity, multimodal biometrics would provide adequate security for the financial community and e-commerce [19]. Elsewhere, the issues highlighted by [20] on biometric centralization and balkanization. So, the highlighted limitations of a unimodal biometric system can be solved through a multimodal verification system approach that ensures resilient security. Therefore, this study proffers a design of a fusion algorithm for iris–nose–ear multimodal biometrics (INEMB) verification system, implements the designed fusion algorithm, and appraises its performance vis-à-vis unimodal biometric system. The contribution of this study overcomes the limitations often posed when capturing an
Contribution Title A Multimodal Biometrics Verification System …
809
Fig. 1 Human parts with biometric properties for security against cybercrimes [3]
iris image, cooperation from the volunteers ensures suitable image acquisition; there is a considerable reduction in resource and computational cost as opposed to the usual high cost for iris system without trading its high performance based on the new fusion algorithm. The choice of the biometrics traits: Iris, Nose, and Ear are orthogonal as it leverages on the universal acceptance coupled with its ageless stability and accuracy over in human identity techniques. thus, one scanner was capable of all the traits which overcome the cost of additional scanner and the total overhead computational cost. Thus, multimodal biometrics for human identity is recommended for use in low- and medium-income economies. The next section of this paper has biometrics modalities related works discussed as well as its approaches. In section three, the architecture of the multimodal biometrics system, its approaches in conjunction with the proposed fusion scheme model are presented; while simulations of the proposed fusion scheme model alongside
810
A. F. Thompson
results validation are detailed. Consequently, the last section highlights the study’s conclusion and direction for future research.
2 Related Works In [21] the authors developed a “Face-Iris Multimodal Biometric Identification System.” This iris-based multimodal biometric system was motivated by iris being the most reliable biometric characteristics with a unique texture unchanged throughout the adult human life. Designing an optimal and efficient face–iris multimodal biometric system is the study focus.; in addition to evaluating the performance of each trait unimodal modality. A multimodal biometric system is thereafter projected by merging the two systems choosing the best feature vectors, using both score and decision level fusion at the same time. A real database was used as a chimeric database to implement the system and the experimental result shows that the best rate of the proposed face-iris multimodal biometric system is obtained by normalization using min–max and fusion with max rule. Wavelet-based multimodal biometrics with score-level fusion using mathematical normalization is carried out in [22]. This is inspired by the necessity to address the mathematical normalization technique on fingerprint, palmprint, and iris models for score-level fusion in the multimodal system with various combinations of wavelet families in unimodal systems. The proposed multimodal biometric verification system is based on the use of 6 different wavelet families and 35 respective wavelet family members in the feature extraction stage of unimodal system. The study is centered on improving the accuracy of multimodal systems by varying the value of the mathematical constant (alpha) from 0 to 5000. The value of mathematical constant (alpha) at 1000 more accuracy and lowest percentage of Equal Error Rate (EER). The study focuses in [23] s on the development of a deep multimodal biometric recognition using contourlet derivative weighted rank fusion with human face, fingerprint, and iris images. Authors proffered a solution to improve the performance degradation that occurs in traditional biometric recognition system, designing a framework capable of increasing the recognition rate while reducing computational time and complexity localized in a high-dimensional temporal domain by incorporating discriminative features using multimodal biometric system with Deep Contourlet Derivative Weighted Rank (DCD-WR) framework. The contourlet transforms effectively overcome the drawbacks of conventional transforms to obtain smooth contours of images that aid the identification of more related features for image analysis. Consequently, local derivative ternary patterns are then applied to obtain a histogram for three different traits. The face, fingerprint, and iris biometric samples are extracted from the CASIA Image Database that includes many iris, face, fingerprint, palm print, multi-spectral palm, and handwriting images for biometric recognition. The system improved the recognition rate by 44%.
Contribution Title A Multimodal Biometrics Verification System …
811
Conclusion from [24] on the evaluation of multimodal biometrics at different levels of face and palm print fusion schemes, yields an improved Genuine Accept Rate (GAR) by varying its False Acceptance Rate (FAR), and it was found that the fusion of face and palm print at the score level using sum rule produced the best result with a value of 97.5%. This study, spurred from various unimodal-prone problems usually increases FAR and False Reject Rate (FRR). Contrarily, a good biometric system needs a very low value of both the FAR and FRR which can only be achieved by the multimodal system. In addition, multibiometric effectively addresses noisy data from unimodal. Consequently, when the biometric signal, acquired from a single trait, is corrupted with noise, then the authentication may switch over to another biometric trait such as fingerprint. Performance evaluation was carried out by first probing unimodal face and palm print were evaluation independently, the face and palm print modality fusion were considered at different levels, due to the heterogeneity of features matrix from palm print and face. The wavelet-based image decomposition scheme was deployed at sensor level; in order to fuse palm print and face images, Min–Max Z-score and hyperbolic tangent (Tanh) normalization techniques were used at feature level; the sum, minimum, and maximum rules at score level; and finally at the decision level, “AND” and “OR” rules were integrated to fuse the face and palm print decision. Face and palm print biometric samples available publicly were used as benchmark databases. Also, authors in [25] proposed a Multi-biometric System for Security Institutions using Wavelet Decomposition and Neural Network. The traits are face, iris, and fingerprint premised on their users’ acceptability since the capturing procedure is fast and easy with notable uniqueness for each subject. The proposed model focuses on the ability of neural network to learning and recognize the pattern for the fused feature vectors. The extracted face, iris, and fingerprint feature vectors from the previous stage were fused by concatenating method to produce one feature vector. This model use wavelet decomposition to solve both problems feature space which is considered large and leads to a more memory requirement and heterogeneity of features in each trait by fusing them together without an alteration in the quality of the feature vector. Identification was carried out with the use of different samples of the three traits of face, iris, and fingerprint using two samples of the three traits for ten people. The identity was obtained after ICP was used to perform matching between the saved template and the tested feature patterns. The model showed a high accuracy with FAR of 0% and FRR of 3%. The recognition rate of this model was 85%. Authors in [26] The authors presented three new concepts in human identification system using multimodal biometrics with phase congruent facial feature approach leveraging on resource optimization. the adopted method yielded a cost-effective decision fusion technique for the two separable face features with edge-to-angle relationship with highly improved fusion outputs suitable for low-quality dataset such as the UFI. The multimodal system recognition rate is 92% in contrast to 57.7% for a unimodal system, with a Manhattan classifier.
812
A. F. Thompson
3 Methodology The 150 human subjects were volunteers, randomly selected from the university community of the lead author. The INE database contains the right eye, left eye, right ear, left ear, and nose for each test subject (through a common pipeline of multialgorithms) resulting in 550 INEs. Each subject has 15 captured modalities, yielding a total of 8250 observations. Each dataset goes through three steps: pre-processing, signal processing, and analysis. The face is imaged with a frontal view using a Sony digital camera with 16.2 MP from Sony Corporation, Tokyo, Japan. The image I is represented by a matrix of size (m, n). During preprocessing, dimension reduction of the image I (m, n) was done. The cropped image was geometrically transformed using isotropic scaling to obtain the region of interest along with grayscale conversion, resulting in I ( p) (where I(p) = 256 × 256). To eliminate the inconsistent lightning effect, illumination adjustment I (x, y) was done using histogram equalization. We considered an image pixel value r ≥ 0 to be an element of random variable R with a continuous probability density function PR (r ) and cumulative probability distribution FR (r ) = P[R ≤ r ]. Also, we considered the mapping function to be s = f (r ) between the input and output images. On equalizing the histogram of the output image, let PS (s) be a constant. The gray levels were assumed to be between 0 and 1, then PS (s) = 1 forms a uniform random variable. The mapping function in, Eq. 1, for histogram equalization, was uniformly distributed over (0, 1). r
s = FR (r ) = ∫ PR (r )dr
(1)
0
Next is the signal processing phase. To enhance the space–time trade-off of unimodal and multimodal biometrics system and obtain feature vectors, a Haar wavelet decomposition of the image I (x, y) was performed at the second level. The Haar transform is given as Eq. 2: Bn = Hn An
(2)
where An is an n × n matrix, Hn is n-point Haar transform. The low–low decomposed sub-band was chosen and the features were stored, thus, optimizing space. Furthermore, Bn was encoded using Fast Walsh–Hadamard Transform (FWHT) to obtain unique features, Cn . FWHT for a signal x(t) of length N is defined as presented in Eqs. 3 and 4: yn =
N −1 1 xi W AL(n, i) N i=0
(3)
Contribution Title A Multimodal Biometrics Verification System …
xi =
N −1
yn W AL(n, i)
813
(4)
i=0
where i = 0,1,2,.., N−1 and WAL(n,i) are Walsh functions. Bn was then encoded using Singular Value Decomposition to obtain the unique features, Dn .Bn is a m × n matrix whose entries K, are real numbers. Then, there exists a factorization of Eq. 5: M =U
V∗
(5)
where U is an m × m orthogonal unitary matrix over k, is a m × n diagonal matrix with non-negative real numbers on the diagonal, and the n × n unitary matrix V* denotes the conjugate transpose of the n × n unitary matrix V . The fusion of extracted features of INE was done using concatenation technique; the result is Nfused x15 feature vector. The Analysis phase: The multimodal verification system was formally posed as follows: given a probe Si of an unknown person, determine the verification I i , i ∈ {1, 2, N, N + 1}, where I 1 , I 2 ,…, I N are the subjects enrolled in the database and I N+1 indicates the rejected case where no suitable subject can be determined for the user. N is the subject’s index in the database while N + 1 is the index of the subject to be verified using Eq. 6. Thus, Sv ∈
Ii if maxi Z(Si , BT ) ≥ t, i = 1, 2, 3 · · · I N +i , otherwise
(6)
where BT is the biometric template, Si corresponding to identity I i , Z is the function that measures the similarity between S v (subject to be verified from the database) and S i (subject already exists in the database), while t is a predefined threshold. Matching is done and the system outputs I i Matlab(R) was used for simulation of the System on Windows Operating System platform. Results were interpreted, analyzed, and appropriate graphical plots were made. Thus, the performance evaluation of the implemented IENMB verification system metrics is based on FRR, FAR, and the Receiver Operating Curve (ROC).
4 System Design 4.1 Design of a Verification Biometrics System According to Information Theory, biometric information is defined as the decrease in uncertainty about the identity of a person due to a set of biometric features measurements [27]. To interpret this definition, we refer to two instances: before a biometric
814
A. F. Thompson
measurement, t 0 , at which time we only know a person p is part of a population q, which may be the whole planet; and after receiving a set of measurements, t 1 , we have more information and less uncertainty about the person’s identity. Equation 7 is the formal representation of the aforementioned statements: Ibm (ti ) =
1, i f pbm ∈ q, i = 1, 2, 3, · · · 0, otherwise
(7)
where Ibm (ti ) is the biometric information at a given time; pbm is biometric measurements; q is a population or the domain in which measurement is being taken. The design utilizes the multimodal input classification which uses correlated biometric measurements. The system employs INE modalities. So, the biometric data of a person is captured and compared with that person’s biometric data stored in a database; that is, 1:1 matching verification was used in this study. Figure 2 shows the architecture of the system which consists of three biometric subsystems. Each has an output of either “Biometrics Accepted,” BA, or “Biometrics Rejected,” BR. These outputs (features) are fed into a decision fusion module that gives the outcome of the multimodal verification system with three modalities, eight possible vectors exist at the decision level. They are:
Fig. 2 The architecture of the multimodal biometric verification system
Contribution Title A Multimodal Biometrics Verification System …
815
4.2 Fusion Approaches The fusion of extracted features of INE was done using the concatenation technique. The result is Nfused x15 feature vector. Given a biometrics modality, B and || denote concatenation. The formal representation of the biometrics modalities fusion is Bfused = Ix,y 1 ||Ix,y 2 || · · · where Ix,y i is a biometrics modality of form q as expressed in Eq. 8. Ix,y i = B E =
∞
Bx 2 dt
(8)
−∞
∞ where the energy function of the biometrics sample, E is f (x) = −∞ H W Sdt, ∀i = 1, 2, · · · The resulting biometric test sample BS is classified into one of the following two classes: w0 (genuine) or w1 (impostor). If x1 , x2 andx3 are the outputs (matching scores) of the subject, then, the biometric sample, BS, is assigned w j i.e. if B S ← w j in Eq. 9: 3 1 P w j |x1 , x2 , x3 = max P(wk |x1 , x2 , x3 ) k=o
(9)
j=1
where P w j |x1 , x2 , x3 denotes the posterior probability of wk given x1 , x2 andx3 with threshold are the fusion techniques employed. Thus, the biometric verification 08060algorithm is presented in Algorithm I: Algorithm I
816
A. F. Thompson
Contribution Title A Multimodal Biometrics Verification System …
817
4.3 Matching Scheme The matching was done by computing the Normal Mean Square Error (NMSE) using Eq. 10 while Eqs, 11 and 12 were employed to derive the Peak Signal-to-Noise Ratio (PSNR) obtained Normal Mean Square Error (NMSE) analysis employed is given in Eq. 10: M N 1 ˙ )2 (I(x,y) − I(x,y) mn j=1 k=1
NMSE =
(10)
‘
where I is the original image, I(x,y) is an approximation of the processed image and m, n are dimensions of the image. Peak Signal-to-Noise Ratio (PSNR) PNSR(d B) = 20 × log10
Max √ MSE
Max = 2n − 1.
(11) (12)
where Max is the maximum possible pixel value of the image samples, whereas n is the encoding unit of the image. Based on the similarity score, a threshold is set for a subject such that the FAR and FRR were not significantly affected.
4.4 Correlation Pearson’s method, described by [28] was employed for template matching as expressed in Eq. 13:
(x i − xm )(yi − xm ) 2
2 (x − x ) m i i i (y i − ym )
r i =
i
(13)
where xi is the intensity of the ith pixel in image 1, yi is the intensity of the ith pixel in image 2, xm is the mean intensity of image 1, and ym is the mean intensity of image 2. It is a corollary of the Cauchy–Schwarz inequality that the correlation has less relationship as it approaches zero (i.e., closer to being uncorrelated). Thus, the closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.
818
A. F. Thompson
5 Simulation and Result Discussion 5.1 Proposed Algorithm Simulation The multimodal verification system was formally posed as follows: given a probe Si of an unknown person, Eq. 14 determines the verification I i , i ∈ {1, 2,…, N, N + 1}, where I 1 , I 2 ,…, I N are the subjects enrolled in the database and I N+1 indicates the reject case where no suitable subject can be determined for the user. N is the subject’s index in the database, while N + 1 is the index of the subject to be verified. Sv ε
Ii if max Z (Si , BT ) ≥ t, i = 1, 2, 3, . . . i
I N +i , otherwise
(14)
where BT is the biometric template, Si corresponding to identity I i , Z is the function that measures the similarity between S v (subject to be verified from the database) and S i (subject already exists in the database), while t is a predefined threshold. The dialog session is the first window that shows up when the application (software) is launched from either the command line or from the “guide” option of MatLab. The dialog session allows the user to input the necessary login details. The result from this session is then passed on to the main application, upon verification of the supplied information, the main application launches appropriately. The software is divided into three sections, viz: tools, test and display, and fusion. The “Tools” section contains three important decision panes. From the first panel, the user can select the appropriate algorithm to be used for the analysis. Such an algorithm includes Coded Wavelet Processing (CWP) and the Benchmark Algorithm (using Hough Transform, Principal Component Analysis, and Wavelet Decomposition HPW). The selected option from here determines the appropriate method of analysis used. The fusion of extracted features of INE was done using a concatenation technique, which results in a 1× (550 × 15) feature vector. These visual features of two subjects are shown in Fig. 3a and b.
5.2 Benchmark Algorithm Simulation The INE templates can be created for a particular person by choosing the folder in which the images are located. There are 15 images for each person: images 1–3 are the nose samples, 4–6 are for the right eye, 7–9 for the left eye, 10–12 for the right ear, and 13–15 for the left ear. When the iris is chosen and the folder of interest has been selected, the system creates the appropriate database as a .mat file which is saved in the current directory. This file contains the major features of the extracted iris. When the ear is selected, the cropping is done according to the algorithm specified
Contribution Title A Multimodal Biometrics Verification System …
819
a
b
Fig. 3 a The fusion of extracted features of INE (Line graph representation of subject’s iris, nose, and eye). b The fusion of extracted features of INE (Contour representaion of subject’s iris, nose, and eye)
above. Then, the result of the PCA (Eigen-ear) is saved in the directory. With the nose, after the Viola–Jones algorithm has been implemented for cropping each of the nose images, the 2D wavelet decomposition of the image is performed and the horizontal, vertical, and diagonal components of the extracted features are saved in the directory.
5.3 Fusion Scheme Validation Validation Model for the Proposed (CWP) and Benchmark (HPP) Algorithms This section presents a formal fusion scheme validation of the three modalities with the structure in Fig. 4. Fused: We observed that Iris dominates the Nose and Eye. Therefore, Nose and Ear scores are taken together with the square of the Iris score. Thus, the expression in equations 15 and 16: Fig. 4 The fusion scheme validation
f
Iris
MX1
QX!
g h
Nose
NX1
PX1
Ear
P-28 Conv
y
820
A. F. Thompson
G = f 2g + f 2h
(15)
G = f 2 (g + h)
(16)
Using Linear Convolution theorem, the equation becomes Eq. 17 G = ( f 2 ∗ g) ∗ ( f 2 ∗ h)
(17)
while: ∞
x(n) ∗ h(n) =
x(k)h(n − k)
(18)
k=−∞
y=
σ
x(k)h(n − k)
(19)
r −1
From Eq. (18), let a ⇒ ( f 2 ∗ g) and b ⇒ ( f 2 ∗ h) N
∴a∗b =
a( j)b( j − k + 1)
(20)
j=1
Solving equation (20) is computationally inefficient, therefore, the convolution theorem, which employs the DFT, was applied accordingly. Given an image f (x, y) of the size M × N where the forward Discrete Fourier Transform (DFT) is given as T (u, v) then T can be expressed in Eq. 21 and Eqs. 22–27 yielding the modalities values. And the fusion: T (u, v) =
f (x, y)gu.v (x, y)
(21)
x,y
F(K ) =
N −1 N =0
f (u) => F(v) = g(u) => G(v) = h(u) => H (v) = Applying the Convolution Theorem, The first part is
x(n)e− j
N −1 N =0
N −1 N =0
N −1 N =0
2
π kn N
f (n)e− j
2
g(n)e− j
2
h(n)e− j
2
(22) π kn N
(23)
π kn N
(24)
π kn N
(25)
Contribution Title A Multimodal Biometrics Verification System …
821
f 2 ∗ g (n) = IDFT DFT f 2 DFT(g)
(26)
f 2 ∗ h (n) = IDFT DFT f 2 DFT(h)
(27)
The second part is
5.4 Proposed Algorithm Fused Extracted Features Using the validation model, the fused extracted features of the CWP are given as expressed in Fig. 5. From Fig. 5, the range of the threshold improved significantly. If the range is 0.1–0.2, it implies that the feature fusion enhances the system performance. Therefore, by using the convolution-based model, the threshold value can be selected as low as possible, compared to the result of fused modalities of the CWP algorithm. Also, we observed that the time taken to run each algorithm is longer than that required for the convolution-based model.
Fig. 5 The fused features of iris, nose, and ear false acceptance/false rejection (FAR/FRR) versus the threshold graph
822
A. F. Thompson
Fig. 6 The fused features of iris, nose, and ear
5.5 Benchmark Algorithm Fused Extracted Features Using the validation model for the proposed CWP and benchmark (HPP) algorithms, the fused extracted feature is expressed in Fig. 6. The fusion of extracted features of Iris, Ear, and Nose was done using a concatenation technique; the result is 1× (550 × 15) feature vector. From Fig. 6, we observed that range of 0.1–0.6. of the threshold is poor. Also, we noticed that it does not go beyond the scores Iris alone offered. Therefore, we infer that the iris is the major determinant, giving the high correlation attributed to the iris data acquisition.
5.6 Proposed Versus Benchmark Algorithms The CWP in its forms from singe modalities to fused modalities revealed that the proposed algorithm based on wavelet algorithm performs at a high optimal level compared to the benchmark algorithm at both single and fused modalities. It was observed that the benchmark algorithm threshold range is average compared to the proposed algorithm based on wavelet. In Fig. 7a, the range of Iris threshold using the Walsh algorithms is 0.1–0.4., while any threshold value chosen above the range maximum may tend to increase the FAR; while there is no significant improvement when SVD is used, the range of the threshold is very high. It is 0.9 which shows that using SVD alone based on its properties is not an optimal choice. Nose modality as depicted in Fig. 7b has a threshold of 0.1–0.8. It is very a worst-case based on the
Contribution Title A Multimodal Biometrics Verification System …
823
Fig. 7 a Iris correlation with benchmark and wavelet algorithms. b Nose correlation with benchmark and wavelet algorithms. c Ear correlation with benchmark and wavelet algorithms
824 Fig. 8 Iris correlation with benchmark and wavelet algorithms
A. F. Thompson Fused Correlation: Coded Wavelet versus Benchmark Algorithms 1.2
Benchmark Algorithm Coded
1 0.8 0.6 0.4 0.2 0
1
3
5
7
9
11
13
15
17
19
21
23
25
Subjects
very high threshold. Thus, it can be inferred that nose is not at its best as a biometric feature with this method. The graph in Fig. 7b is with a very high threshold. It is 0.9 this is as observed in other biometrics modality using SVD. Figure 7c graph illustrates the range of the threshold is 0.1–0.4., while any threshold value chosen above the range maximum may tend to increase the FAR, however, the threshold range is very high- 0.9 this shows that using SVD only due to its characteristics is not the preferred choice. Consequently, Fig. 8, shows the threshold range which improved with 0.1– 0.3 when the wavelet algorithms are fused with Iris; Nose being from 0.8 to 0.5 maximum and reducing to 0.1–0.5 coupled with ear at 0.1–0.3 from 0.9 to 0.3, absolutely revealed that fusing the features improve the system performance of the three modalities of INE.
6 Conclusion In this study, we employed Haar Discrete Wavelet Transformation, Fast Walsh Hadamard Transformation, and Singular Value Decomposition for the design of the human identity verification system. A total number of 15 features of the iris (6), nose (3), and ear (6) were extracted. Accordingly, a benchmark algorithm based on Hough transform for Iris and PCA for Nose and Ear was also designed. The Coded Wavelet processing algorithm and the benchmark algorithm were implemented using Matlab. The findings from the simulation showed that the Coded Wavelet processing algorithm performed better than the benchmark algorithm. It was also observed that Iris modality has the highest dominance over Nose and Ear biometrics, based on the PNSR, MSE, and Correlation values obtained from the experimental result. To validate the experimental result, a model was developed using the Convolution Theorem. The observations from the simulation results establish the performance of the Coded Wavelet processing algorithm over the Benchmark algorithm.
Contribution Title A Multimodal Biometrics Verification System …
825
Therefore, this study overcomes the limitations inherently posed when capturing an iris image. Also, the financial cost of our proposed system is lower than the usual ones for the iris system. Further, our proposed system also offers a higher performance (99.5%) based on the new fusion algorithm; the choice of biometrics traits: Iris, Ear, and Nose are orthogonal. Therefore, only one scanner was needed and used, adding to the cost-effectiveness merit of the multimodal biometrics for human identity. Thus, the INEMB is recommended for use irrespective of the demography and geolocation of the country.
References 1. Haider, S. A., Rehman, Y., & Usman Ali, S. M. (2020). Enhanced multimodal biometric recognition based upon intrinsic hand biometrics. Electronics, 9, 1916. doi:10.3390/electronics9111916. 2. Li, S. Z., & Jain, A. K. (2015). Encyclopedia of biometrics, New York, NY. USA: Springer. 3. Akhtar, Z., Hadid, A., Nixon, M.S., Tistarelli, M., Dugelay, J.-L., & Marcel, S. (2018). Biometrics: In search of identity and security (Q & A). IEEE MultiMedia, 25(3), 22–35. doi:10.1109/MMUL.2018.2873494. 4. Raju, A. S., & Udayashankara, V. (2018). A survey on unimodal, multimodal biometrics and its fusion techniques. International Journal of Engineering & Technology, 7(4), 689–695. doi:10.14419/ijet.v7i4.36.24224. 5. Zhou, Z., Zhu, P.-W., Shi, W.-Q., Min, Y.-L., Lin, Qi., Ge, Q.-M., Li, B., Yuan, Q., & Shao, Yi. (2020). Resting-state functional MRI study demonstrates that the density of functional connectivity density mapping changes in patients with acute eye pain. Journal of Pain Research, 13, 2103–2112. doi:10.2147/JPR.S224687. 6. Jain, A. K., Nandakumar, K., & Ross, A. (2016). 50 years of biometric research: Accomplishments challenges and opportunities. Pattern Recognition Letters, 79, 80–105. 7. Sabri, M., Moin, M. S., & Razzazi, F. (2019). A new framework for match on card and match on host quality based multimodal biometric authentication. Journal of Signal Processing Systems, 91, 163–177. doi:10.1007/s11265-018-1385-4. 8. Buciu, I., & Gacsadi, A. (2016). Biometrics systems and technologies: A survey. Interntional Journal of Computers Communications & Control, 11(3), 315–330. doi:10.15837/ijccc.2016.3.2556. 9. Sindt, C. (2016). “Don’t Overlook the Eyelid” in Review of Optometry, pp 44, March 2015. Accessed on 4 December, 2019. https://www.reviewofoptometry.com/CMSDocuments/2015/ 3/ro0315i.pdf. 10. Abderrahmane, H., Noubeil, G., Lahcene, Z., Akhtar, Z., & Dasgupta, D. (2020). Weighted quasi-arithmetic mean based score level fusion for multi-biometric systems. IET Biometrics, 9(3), 91–99, 5. doi:10.1049/iet-bmt.2018.5265. 11. Korshunov, P., & Marcel, S. (2017). Impact of score fusion on voice biometrics and presentation attack detection in cross-database evaluations. IEEE Journal of Selected Topics in Signal Processing, 11(4), 695–705. doi:10.1109/JSTSP.2017.2692389. 12. Tolosana, R., Vera-Rodriguez, R., Fierrez, J., & Ortega-Garcia, J. (2019). Reducing the template ageing effect in on-line signature biometrics. IET Biometrics, 8(6), 422–430, 11. doi:10.1049/iet-bmt.2018.5259. 13. Jaha, E. S. (2019). Augmenting gabor-based face recognition with global soft biometrics. In 2019 7th International Symposium on Digital Forensics and Security (ISDFS), (pp. 1–5). Barcelos, Portugal. doi:10.1109/ISDFS.2019.8757553.
826
A. F. Thompson
14. Lin, W., Yang, W., Junbin, G., & Xue, L. (2017). Deep adaptive feature embedding with local sample distributions for person re-identification, Pre-Print-arXiv:1706.03160v2 [cs.CV] September 7, 2017 15. Matsuda, K., Ohyama, W., & Wakabayashi, T. (2017). Multilingual-signature verification by verifier fusion using random forests. In 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR) (pp. 941–946). Nanjing. doi:10.1109/ACPR.2017.156. 16. Jones, M. ‘3 ways scammers are using ATMs to steal from you. 25 May, 2018, Komando.com. 17. Anonno Razzak. The rise of biometrics, Illumination 6 May, 2020. http://www.fintechbd.com/ the-rise-of-biometrics/. 18. Roig, M. Contactless comes of age: How biometrics is taking cards to the next level, 18 May, 2020. https://www.fintechnews.org/contactless-comes-of-age-how-biometrics-is-takingcards-to-the-next-level/. 19. Naganuma, K., Suzuki, T., Yoshino, M., Takahashi, K., Kaga, Y., & Kunihiro, N. (2020). New secret key management technology for blockchains from biometrics fuzzy signature. In 2020 15th Asia Joint Conference on Information Security (AsiaJCIS), (pp. 54–58). Taipei, Taiwan. doi:10.1109/AsiaJCIS50894.2020.00020. 20. Storey, A. (2020). Where does biometrics sit in today’s security ecosystem?Biometric Technology Today, 2020(7), 9–11, ISSN 0969-4765. doi:10.1016/S0969-4765(20)30096-5. 21. Ammour, B., Boubchir, L., & Bouden et al. (2020). Face – iris multimodal biometric identification system, 3. 22. Sanjekar, P., & Priti, S. (2019). Wavelet based multimodal biometrics with score level fusion using mathematical normalization, April, 63–71. 23. Gunasekaran, K., Control J., Raja J. et al. (2019). Deep multimodal biometric recognition using contourlet derivative weighted rank fusion with human face, fingerprint and iris images, 1144. 24. Alghamdi, T. (2016). Evaluation of multimodal biometrics at different levels of face and palm print fusion schemes. Asian Journal of Applied Sciences, 9(3), 126–130. 25. Namjm, M., & Hussein, R. (2015). Multi-biometric system for security institutions using wavelet multi-biometric system for security institutions using wavelet decomposition and neural network, June, 2–7. 26. Hamd, M. H., & Rasool, R. A. (2020). Optimized multimodal biometric system based fusion technique for human identification. Bulletin of Electrical Engineering and Informatics, 9(6), 24112418, ISSN 2302-9285. doi: 10.11591/eei.v9i6.2632. 27. Mokross, B.-A., Drozdowski, P., Rathgeb, C., & Busch, C. (2019). Efficient identification in large-scale vein recognition systems using spectral minutiae representations. doi: 10.1007/9783-030-27731-4_9. 28. Berman, J. J. (2018). Indispensable tips for fast and simple big data analysis. In Jules J. Berman (Ed.), Principles and practice of big data (2nd ed.) (pp. 231–257). Academic Press, ISBN 9780128156094. doi:10.1016/B978-0-12-815609-4.00011-X.
IoT-Based Voice-Controlled Automation Anjali Singh, Shreya Srivastava, Kartik Kumar, Shahid Imran, Mandeep Kaur, Nitin Rakesh, Parma Nand, and Neha Tyagi
Abstract Since the advent of technology, the focus on automation has been tremendous. Any task requiring human efforts and intervention has already been automated through machines or is/will be in the channel of being automated. Home automation has come into view in recent years. It is a field where intelligent home appliances are connected over a network in which they can communicate with each other to provide an experience of ultimate comfort and luxury to the user. The proposed model extends this luxury to the not-so-affluent section of the society. We aim at creating a mediator device between the normal appliances and the user which would optimize the working according to the user’s liking. Taking three kinds of light bulbs in this project, and controlling them through voice commands, we hint towards our ultimate goal of a mediator device mentioned before. Keywords Home automation · Internet of Things · Bluetooth · Raspberry Pi · GSM
1 Introduction Recently, a rise in the implementation of automation for real-life projects and for solving prevalent problems has been observed. These can be seen in households as well as educational premises. The main structure is backed by IoT, AI and ML, and integration with applications, websites and web apps has been observed. IoT is a system where digital and mechanical hardware is integrated with sensors which A. Singh · S. Srivastava · K. Kumar · S. Imran · M. Kaur (B) · N. Rakesh · P. Nand · N. Tyagi Department of Computer Science & Engineering, SET, Sharda University, Greater Noida, India S. Srivastava e-mail: [email protected] K. Kumar e-mail: [email protected] S. Imran e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_66
827
828
A. Singh et al.
can then interact with humans through websites, applications, etc. Hence, it can be extremely useful for collecting data and generating desired responses based on what the data is representing. It is an evolving field and even with all the existing research, it is welcoming more every day [1]. For IoT devices to reach the masses far and wide and be commonly used, it has to be valuable and convenient. The number of smart appliances in the market is increasing and not many people are ready to switch and invest yet. Thus, a hybrid solution that provides the option of converting the device into a smart one without even switching the device would be ‘easy on the pockets and fulfill the purpose’ [1]. The objective in entirety is to develop a new product altogether which caters to the need of the market today in terms of smart devices. Home automation is the one technology that is on the rise and a lowcost solution is the need of the hour. We have therefore come up with a solution that could universally control the appliances that have been voluntarily connected to the system. The first step or experiment in this direction is a system, where we develop a voice-enabled application handling three different appliances of the same kind. Using voice assistants from a mobile application, we take in commands from the user which further instructs the Raspberry Pi––through which all devices are connected, accordingly [2]. One of the major problems in this project’s context is the use of Bluetooth which is generally used for point-to-point networks. ˙In addition, Bluetooth operates at 720 Kbps that is a very slower rate and very effective to operate small devices remotely. In this project, Wi-Fi connection is used,; Wi-Fi is a much secure way of communication than Bluetooth. Wi-Fi has a bandwidth of 150 Mbps which provides the way for remote access to the device and smooth functioning [3]. The other problem analysed is the storage of data, which has to be stored to perform other functions like calculating the power consumption of the appliances individually which can help to monitor the appliances and help to reduce energy conservation. In this project, the data has been provided for storage so that it can be interpreted further for future use. A further addition to the list of problems for the module voice automation is to deal with the noisy and disturbance environment. If only voice mode is provided to control the devices, i.e. Google Assistant [4], sometimes the misinterpretation happens if the background is noisy. This project has been provided with two options to control the devices, i.e. switch mode and voice mode. The user can control the device with the voice inputs as well as the with mobile application, which contains buttons to operate the appliances. The device can be operated effectively and efficiently through the mobile application.
2 Problem Analysis The working of this system starts with the mobile application. It takes the input commands either as voice or switches (buttons in the app). Finally, the algorithm identifies the commands present in the input and also classifies the devices which users want to operate. Then it switches ON or OFF the devices according to the commands provided by the users. The user can also add timing using timers provided
IoT-Based Voice-Controlled Automation
829
Fig. 1 Overall workflow of the system
to operate a particular device at a particular time and place. The overall workflow of the system is shown in Fig. 1. The user was able to interact with the interface of the application and give commands in the following form: One of the major problems in this project’s context was deciding the communication technology we could make use of. Narrowing down to Bluetooth and Wi-Fi, we observed that Bluetooth operates at a rate of 720 Kbps (approx), which is generally used for point-to-point networks. Our objective of controlling devices remotely would not have been fulfilled in this case. Thus, in this project Wi-Fi connection is used. Wi-Fi is a more secure way of communication than Bluetooth. Wi-Fi can support bandwidths of 150 Mbps (approx), which provides a way for remote access to the device and smooth functioning. The other problem analysed is the storage of data, that has to be stored to perform other functions like calculating the power consumption of the appliances individually, which can help monitor the appliances and help to reduce energy conservation. In this project, the data has been provided for storage so that it can be interpreted further for future use. A further addition to the list of problems for the module voice automation is to deal with the noise and disturbance in the surrounding environment. If we provide only voice mode to control the devices, sometimes misinterpretation could happen if the background is noisy. For this project, we have provided two options to control the devices, i.e. switch mode and voice mode. The user can control the device with the voice inputs as well as the mobile application, which contains buttons to operate the appliances. The device can be operated effectively and efficiently through the mobile application. The comparative analysis is represented in Table 1.
Home automation
2
Hardwares: 1. NodeMCU (ESP8266) 2. Relay board 3. ULN2803 IC Softwares: 1. BLYNK 2. IFTTT application
Control the movement of 1. NodeMCU DEVKIT and a vehicle using voice ESP8266 2. Ultrasonic Sensor commands using IoT HC-SR04 3. DC motor 4. Motor Driver L239D 5. Arduino IDE 6. IFTTT 7. Google Assistant 8. Adafruit.io
1
Software/hardware requirements
Objectives
S. No
Table 1 Literature Survey
1.ConfiguringBLNK application 2.Configuring IFTTT 3.Connections
1.Configuring Adafruit.io 2.Configuring IFTTT 3.Connections 4.Programming the module
(continued)
A cost-effective voicecontrolled (Google Assistant) home automation for controlling general appliances is proposed. Highly reliable and efficient for the aged people and differently abled person on a wheel chair
Log into Google account in the smartphone. Bring up Google Assistant Feed the exact commands which were put into the IFTTT trigger. We will notice the vehicle will move as expected
Algorithm/Methodologies/Techniques/Methods Findings
830 A. Singh et al.
Automatic lighting system to reduce energy consumption
Completely universal Hardware: Raspberry Pimodel The website is used as the instruction manager. home automation control 2, relay channel module With buttons for ON/OFF panel boards, jumper cables Software: Raspbian, PuTTY, Nmap, FileZilla, Win32DiskImager, PHP, HTML
5
Raspbian operating system/Python
The web camera compares the images of human patterns in the OpenCV software The appliances are turned OFF accordingly
1. When the connection is established, the status of the virtual pin in Blynk is set to LOW or HIGH 2. The changes in the virtual pin will be reflected in the digital pin, so the device connected turns ON/OFF
4
Blynk API, ThingSpeak, 2 amps Fuse, 4-channel OMRON SSR, Current server, Octocoupler, NodeMCU and Hilink power supply
The aim is to control and monitor the electrical appliances of IoT lab in CIT campus using Google Assistant or chatbot
Energy conservation
Max. 50% of energy is conserved
Remotely controlling of separate appliances in the lab so that electricity consumption is minimum
Algorithm/Methodologies/Techniques/Methods Findings
3
Software/hardware requirements
Objectives
S. No
Table 1 (continued)
IoT-Based Voice-Controlled Automation 831
832
A. Singh et al.
Fig. 2 Proposed methodology
3 System Design The working of this system starts with the mobile application. It takes the input commands either as voice or switches (buttons in the app). Finally, the algorithm identifies the commands present in the input and also classified the devices which users want to operate. Then it switches ON or OFF the devices according to the commands provided by the users. The user can also add timing using timers provided to operate a particular device at a particular time and place. The overall workflow of the system is shown in Fig. 1. The user was able to interact with the interface of the application and give commands in the following form: (a) (b) (c)
Voice Commands (translated and synthesized via Web Speech API, Timers or Timings set via the app (updated on Cloud/database), Send an SMS to set a timer/give any instruction (read by SIM 800).
When the user was able to interact with the device modules, appliances, etc., the data is stored in the database in order to calculate approximate usage and provide stats for assessment [5]. Figure 2 shows the proposed methodology. The system is described using a data flow diagram shown in Fig. 3.
4 Implementation The implementation of the proposed voice-controlled automation system is shown in Fig. 4. Figure 5 shows the pin diagram of Rpi. The system comprises of following components: Develop Application—A mobile application is developed in Flutter to remotely access the appliances. Connection to database—The system was provided with the database to store the data allowing retrieving of the data for future uses.
IoT-Based Voice-Controlled Automation
833
Fig. 3 Data flow diagram
Fig. 4 Proposed system ımplementation
Connection to Raspberry Pi—The various devices like timer, relays, jumper wires were connected to Raspberry Pi-3 Model B+. The circuit was established. Connect Application—The application was connected to the system for remote access of the devices. The project was implemented using a microcontroller, which is a small computer in itself, i.e. Raspberry Pi. The Raspberry Pi-3 Model B+ is used in the system. An operating system was installed in Raspberry Pi because it will not work without that. Four LEDs with different colours such as red, yellow, green and orange were connected on the breadboard. The breadboard was connected to the Rpi via jumper wires to its GPIO pins. The Wi-Fi modules were used to connect the mobile application to the device. The mobile application was developed using Flutter and the RPi was coded in Python. The whole system was implemented.
834
A. Singh et al.
Fig. 5 Pin diagram of Rpi
5 Testing and Results Wi-Fi is a more secure means of communication than Bluetooth. Wi-Fi connection is a more effective way to send and share video, audio, and telemetry operation while sending and accepting remote control commands from an operator. The project has been connected to the database to exchange the data between the devices so that it has a wide future scope of calculating the power consumption and further uses. The data has been stored for more interpretation. The project has been given options for giving inputs by the users. • Switch Mode—App has been provided with the buttons to control the appliances so that in case of disturbance it does not have misinterpretation. • Voice Mode—The app has been provided with the voice inputs as well. Figures. 6 and 7 depict when the whole connection was established and no commands were given by the user through the application. Figures 8 and 9 depict here when the user gave the commands to turn ON the lights. The LEDs in the circuit glow up. The result and the outcome were excatly how the project is operating the devices by a mobile application. Hence, the outcome was satisfactory and it came up with 90% accuracy.
IoT-Based Voice-Controlled Automation
835
Fig. 6 Application when no commands given
Fig. 7 Circuit when no commands given
6 Conclusion The prime objective of this system is to use the smartphone (mobile-based application) to control the home appliances effectively. The switch mode and voice mode
836 Fig. 8 Application when a command is given
Fig. 9 Circuit when a command is given
A. Singh et al.
IoT-Based Voice-Controlled Automation
837
have been given in the mobile application to control and monitor the home appliances. The GSM module has been implemented so that text messages can be sent by the user and can be operated easily. A timer has been added so that the user can add a timer in the appliances to ON and OFF at a particular time. Suppose a user has been stuck somewhere and wants a particular LED to switch ON at 6 a.m, then he can add a timer and the LED will glow automatically. The system uses Wi-Fi to operate which enables a user to remotely access the device at any time and at any place there is no range limitation. Users can easily interact with the Android phone/tablet. The user can send commands via the button or speech mode. The data are analysed by the application and sent over a network. The Raspberry Pi acts as a server, analyses the data and activates the GPIO (General-Purpose Input/Output) Pins. In this way, the automation process is carried out. Using this as a reference, further it can be expanded to many other programs like by accessing the stored data we can calculate the power consumption of our home appliances and can easily monitor and control the energy conservation.
References 1. Poongothai, M., Sundar, K., Vinayak Prabhu, B. (2018). Implementation of IoT based ıntelligent voice controlled laboratory using Google assistant. International Journal of Computer Applications (0975 – 8887), 182(16). 2. Din, I.U., Hassan, S., Khan, M.K., Atiquzzaman, M., Ahmed, S.H. The internet of things: A review of enabled technologies and future challenges. 3. Ali, M., Hassan, A.M. Developing applications for voice enabled IoT devices to improve classroom activities. In 2018 21st International Conference of Computer and Information Technology (ICCIT), 21–23 December, 2018. 4. Gupta, M.P. (2018). Google assistant controlled home automation. International Research Journal of Engineering and Technology, 5(5). 5. Kalyan Chenumalla et al. Google assıstant controlled home automatıon. InIEEE Vaagdevı Engıneerıng College Student Branch. 6. Turchet, L., Fischione, C., Essl, G. Internet of musical things vision and challenges. 7. Sachdeva, S., Macwana, J., Patela, C., Doshia, N. Voice-controlled autonomous vehicle using IoT. In 3rd International Workshop on Recent Advances on Internet of Things: Technology and Application Approaches (IoT-T&A 2019) November 4–7, Coimbra, Portugal.
Trusted Recommendation Model for Social Network of Things Akash Sinha, Prabhat Kumar, and M. P. Singh
Abstract Recent advances in the computing infrastructure have led to the realization of advanced solutions for catering to the needs of human users. Obtaining recommendations from the network of things that are inspired by the social behavior of humans is becoming an important research topic. The main crux of such systems is to provide the users with recommendations of products and services by considering the opinions of the user’s social circle. The social circle of the user is inferred from the various online social networking sites that the user is part of. However, while considering these systems, there is a need to evaluate not only the opinion of the user’s social circle about the product/service provider but also the past experiences of the user with the providers as well as well as the strength of the relationships of the user with other users in his social circle. The work proposed in this paper aims to provide trusted recommendations to the user by incorporating multiple aspects of the social behavior of the user. Theoretical evaluation of the proposed work clearly indicates the enhanced efficiency and trustworthiness of socially inspired recommendation systems. Keywords Trust · Social · Security · Recommendations · Internet of things
A. Sinha (B) · P. Kumar · M. P. Singh Computer Science and Engineering Department, National Institute of Technology Patna, Patna, India e-mail: [email protected] P. Kumar e-mail: [email protected] M. P. Singh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_67
839
840
A. Sinha et al.
1 Introduction The rapid evolution of communication technology and computing abilities of the Information Technology (IT) infrastructure have paved the way for the development of promising solutions to cater to the day-to-day requirements of the user. Social Network of Things (SNoT) is an emerging paradigm that refers to a highly interconnected network of uniquely addressable objects communicating using standard communication protocols and utilizing the social dimensions of the human users for providing them the required service [1–3]. The key components of these SNoT systems are the users itself, information about the user’s social circle, intelligent server, and appropriate communication media [1]. These components work in collaboration to provide the desired service to the users. One of the important features of such SNoT systems is to provide suitable recommendations to its users about the right things, at the right place, and at right time [4]. This has led to the development of multiple recommendation engines that employ the properties of the social network of the users. However, a number of issues are related to the challenging requirement of obtaining recommendations from the social circle of human users [5]. For instance, contacting individual persons one by one for obtaining particular information about a topic or item is a time-consuming task. Moreover, even if the entire process is automated to facilitate less human intervention it may happen that most of the people in the social circle of the user may not have the desired information. For example, people residing in India who have never visited another country cannot provide suggestions about where to dine and where to stay in a foreign country. Another challenging issue of socially inspired recommendation systems is the ability to obtain credible information from the social network of the user. There exists a multitude of literature that emphasizes the requirement of suitable mechanisms to identify and minimize Sybil users and fake information on social media. Since most of the Online Social Networks (OSNs) are not centrally moderated it is hard to identify which users are malicious or which information is false without properly analyzing their behavior. This urges for the requirement of appropriate security strategies to identify and utilize only credible information from the OSNs. Minimizing the use of fake or irrelevant information requires modeling systems that analyze various aspects of the users or information before using them in the system [6–8]. In order to address the previously mentioned challenges, the work proposed in this paper aims to provide a trusted recommendation model for providing users with recommendations about a product or topic based on the opinions of the user’s social circle. Moreover, the system proposed in this work also considers the experiences of the user itself while recommending to him the required product or services. Another important aspect of the proposed system is that while considering the opinions of the user’s social circle for a recommendation, it also infers the strength of the relationship between the user and his friend. This allows the system to prioritize the opinions of the user’s close friends, who are assumed to provide legitimate information to the user, over other random people not existing in the social circle of the user. Further,
Trusted Recommendation Model for Social Network of Things
841
friends who may not be close to the user but have prior knowledge, interest, or experience about the topic for which the recommendation is sought for are preferred over those close friends who do not have any prior knowledge or opinions about the desired topic. Consideration of the previous experiences of the user himself ensures that the recommendations may not be solely guided by the false or fake opinions of others. The rest of the paper is organized as follows: Sect. 2 reviews the existing literature regarding the trustworthy recommendation models in social networks; Sect. 3 provides the details of the proposed system; Sect. 4 presents a theoretical discussion of the proposed system, and finally Sect. 5 presents the concluding remarks.
2 Related Works There exists a plethora of research in the domain of recommendation systems. However, all the works do not include the considerations of trusted connections of the user. Moreover, the inclusion of social networks in recommendation engines is still in its infancy. This can be attributed to the fact that existing OSNs do not publicly provide data of their user activities due to privacy concerns. Hence, it is necessary to review those recommendation systems that incorporate the principle of trust propagation in social networks. This section provides a review of selected literature regarding the recommendation systems employing the social connections of a user. The authors in [9] propose a method for a social recommendation that employs the concept of probabilistic matrix factorization [10]. The proposed approach considers both the records of ratings and the information of users’ social networks. The rating matrix is factorized using the hidden features of users and items while the trust matrix is factorized using latent factor and hidden features of the user. The proposed approach is shown to be scale linearly with respect to the number of observations. Further, the authors extend their work in [11] to incorporate the concept of trust propagation in their model to avoid the cold start problem. The use of trust propagation improved the prediction accuracy of their model as well. However, the interoperability of the model due to different feature sets of a user considered in this work is not feasible in real-life implementations. The authors in [12] propose a breadth-first search approach for providing recommendations to a user in the trust network. The proposed approach employs the ratings provided by other users that are at the shortest distance from the source user in the trust network. The ratings provided by such users are weighed and aggregated as per their trust scores with respect to the source user. For the indirect trust score, the trust value of the target user is evaluated with respect to its directly connected neighbors. These trust scores are weighted as per the direct trust values existing between the target user’s neighbor and the source user. The concept of MoleTrust [13] is inspired by TidalTrust. Recommendations are provided using the ratings provided by other users existing up to a maximum depth
842
A. Sinha et al.
in the trust network. The value of maximum depth is provided as an input and is independent of any other entity in the system. Backward exploration technique is employed to evaluate the trust score between any two users. The authors in [14] propose max flow trust metric “Advogato” for identifying the trusted users in an online community. The system requires the number of users to trust as input. The proposed system further requires the knowledge of the entire network structure in order to assign the weights to the network graph edges. One of the limitations of Advogato is that it only identifies the users that can be trusted and do not evaluate the trust degree. Thus, this approach does not differentiate between users who can be trusted more and users who are less trustworthy. Item-based and trust-based recommendations are combined in TrustWalker [15] for dealing with noisy data and considering a sufficient number of ratings. The approach employs random walk procedure for evaluating the confidence level in the computed predictions. The probability of using the rating of a similar item instead of a rating for the target item increases with the increasing length of the walk. Experimental evaluation depicts the efficiency of the proposed framework over other memory-based approaches.
3 Proposed Framework The system proposed in this work comprises a number of modules that work in conjunction to provide the desired results to the users. This section provides the details of all the modules of the proposed system. Figure 1 shows the outline of the proposed system. The “Profile Manager” module provides the user interface for gathering the details of the user and saving them on the user device. The profile information of the user
Fig. 1 Proposed system model
Trusted Recommendation Model for Social Network of Things
843
can include the following details: name, age, gender, occupation, current location, hometown, interests, transactions, etc. Transactions refer to the historical activity details of the users. The activities of a user can include liking or commenting on the socials posts of the friends, providing reviews and ratings of products and services, posting information about check-ins to a new place, change in occupation, etc. The activities performed using the proposed system are stored in the form of a tuple: . Here, U i refers to the user who is performing the activity, U j denotes the user’s friend whose posts have been liked or commented upon, C T denotes the category of activity, I includes the actual details of the activity such as comments, likes, etc., DT indicates the date of performing the activity and T id denotes the activity or transaction id. For activities other than liking posts, comments, messaging, U j may refer to the product/place/service provider name on which the user has provided the feedback. The proposed system categorizes every activity performed by the user using the system. The decision about which activity belongs to which category is inferred by the type of posts on which the user has performed the activity. For instance, if the user posts information about traveling to a new place or check-in in a hotel, etc., the activity of the user is placed under “Travel”. If the user purchases any product (say laptop), this activity is recorded under the category of “Electronics”, i.e., type of the product that the user purchased. If the user likes or comments upon a post of other users, then the category will be the topic of the post such as humor, politics, etc. Categorizing the activities of the user helps in identifying the interest or expertise of the user if such information has not been explicitly mentioned by the user in the Interest section. The working of the proposed system is guided by the social circle of the user and as such the system needs to have the details of the user’s friends. The work proposed in this paper is limited to the direct friends of the user and hence it does not consider the higher degree social circle of the user. The details of the user’s friends can be stored on the mobile of the user itself. This can be viewed as an extended version of the user’s phonebook contacts with each contact having the details included in their profile. The module “User Social Database” stores the details of the user’s friends. This database periodically synchronizes with the end devices of the user’s friends for updating their profile information. The “Search Engine” module is responsible for retrieving the list of products/places, etc. from the Internet as per the keyword and specifications provided by the user. Recommendation of a particular product or place requires the opinions of other users, who have either used the product or visited that place earlier. These opinions are can either be expressed in the form of ratings or comments. Comments can be used to obtain the sentiment score of the user. Evaluating the sentiment score of the user can be done using any of the methods existing in the literature such as [16]. The work presented in this paper does not include the details for obtaining the sentiment score from review and as such it considers only the final quantified value obtained from the reviews and ratings. Parsing the reviews and obtaining the opinion scores of the users is performed by the “Opinion Parser” module. There can be a possibility that the user who wants the recommendations may have his own opinion about any
844
A. Sinha et al.
of the products included in the retrieved set of products. In such cases, the selfopinion of the user also becomes an important parameter to consider. The “Opinion Parser” module, therefore, records the sentiment score of the self-opinion separately for eventual use in calculating the final recommendation score of the product. The opinion scores calculated by the “Opinion Parser” module are fed to the “Recommendation Engine” module, which is eventually responsible for calculating the recommendation score of the products, sorting the products as per the recommendation scores, and displaying the results to the user. In doing so, the “Recommendation Engine” utilizes the opinion scores of the following: i. ii. iii.
iv.
v. vi.
User’s friends who have specified the nature of the product in their interest section and share strong relations ties with the user (U BS ). User’s friends who have specified the nature of the product in their interest section and share weak relations ties with the user (U BW ). User’s friends who have not specified the nature of the product in their interest section but their interest/expertise has been inferred from the Transaction Category and they share strong relations ties with the user (U NS ). User’s friends who have not specified the nature of the product in their interest section but their interest/expertise has been inferred from the Transaction Category and they share weak relations ties with the user (U NW ). Other users who are not on the friend list (OU). Self-opinion of the user (SO).
The above inputs can be combined to obtain the recommendation score of the required product or service as per Eq. 1. Here, α + β + γ + δ + η + θ = 1. RSi = α ∗ U B S + β ∗ U BW + γ ∗ U N S + δ ∗ U N W + η ∗ OU + θ ∗ S O.
(1)
The weights assigned to each attribute represent the priority or importance of the attribute in calculating the overall recommendation score. The order of priority of weights is given by the following expression: θ = α > γ > β > δ > η.
(2)
Expression 2 clearly indicates that the self-opinion of the user about any product is considered equally important to the opinions of his friends who have knowledge or interest in the product. The rest of the priorities have been logically set as per the opinions consideration performed by humans naturally. For missing attributes, their corresponding weights will be set to 0 (zero) so as to nullify the effect of missing values. The workflow of the entire system is presented in Fig. 2.
Trusted Recommendation Model for Social Network of Things
845
Fig. 2 Workflow of the proposed system
4 Discussion Owing to the limited availability of the datasets required to evaluate the proposed system, new datasets need to be constructed that include the parameters relevant to the proposed work. This section, hence, theoretically analyzes the proposed system and establishes its validity by considering various use case scenarios applicable to the proposed work. Two different use cases have been analyzed in this work: (i) when the social circle of the user has positive opinions about the item/topic but the user himself has a negative opinion about that particular item/topic (ii) when the social circle of the user and the user himself has no opinion of the item/topic. Use Case 1: Consider the user wants a recommendation for buying a new laptop. The proposed recommendation engine will try to analyze the opinions of the user circle and of other users as mentioned earlier in the Sect. 3. Suppose, for the laptop of a particular brand the social circle of the user as well as other users have good opinions about the product. However, it may happen that the user himself may not have a good experience of using a laptop of that brand. In such cases, the overall recommendation score for that item will be reduced as the maximum priority or weight has been assigned to the self-opinion of the user himself. This will place the item at a lower place in the final recommended list. Use Case 2: Consider a user is visiting a new country, say London and he wants a recommendation of the hotels where he can stay and dine. It may happen that neither the user nor any friend in his social circle has ever visited London. In such a case, the recommendation will be guided by the opinions of those users only who are not in the social circle of the users. This, to some extent, helps in avoiding the cold start problem in the proposed recommendation engine.
846
A. Sinha et al.
5 Conclusion Recommendation engines have evolved from just being another feature of the system to an important necessity for the users. The work proposed in this paper aims to build a model-based recommendation engine that incorporates multiple factors in order to provide trusted recommendations to the users. The working of the proposed recommendation system is driven by the principle of social behavior of the human users and hence, can be used to provide recommendations on social networking portals. The proposed work, in particular, will be beneficial for the emerging Social Network of Things ecosystem, where the devices utilize the social connection of the human users to provide suitable and required services to the users. Moreover, to cater to the need of providing reliable and trustworthy information to the users the proposed model utilizes the concept of trust in evaluating the recommendations for a user. Incorporating the trust factor in the system ensures that the recommendations provided to the users are credible and reliable. The trust factor has been modeled using multiple factors such as the experience of the user’s friends, the strength of their relationships with the user, self-opinion of the user itself, etc. The strength of the proposed system is highlighted by analyzing the use case scenarios, where this system can be used to provide credible and trustworthy recommendations to the users. The factors considered in the proposed model are not exhaustive in nature and as such, the proposed work can be further extended by incorporating other relevant parameters for accurately modeling the context of the required recommendations.
References 1. Kleinberg, J. (2008). The convergence of social and technological networks. Communications of the ACM, 51(11), 66–72. 2. Atzori, L., Iera, A., Morabito, G., & Nitti, M. (2012). The social internet of things (siot)–when social networks meet the internet of things: Concept, architecture and network characterization. Computer Networks, 56(16), 3594–3608. 3. Afzal, B., Umair, M., Shah, G. A., & Ahmed, E. (2019). Enabling IoT platforms for social IoT applications: Vision, feature mapping, and challenges. Future Generation Computer Systems, 92, 718–731. 4. Sinha, A., Shrivastava, G., & Kumar, P. (2019). Architecting user-centric internet of things for smart agriculture. Sustainable Computing: Informatics and Systems, 23, 88–102. 5. Lye, G. X., Cheng, W. K., Tan, T. B., Hung, C. W., & Chen, Y. L. (2020). Creating personalized recommendations in a smart community by performing user trajectory analysis through social internet of things deployment. Sensors, 20(7), 2098. 6. Nitti, M., Girau, R., & Atzori, L. (2013). Trustworthiness management in the social internet of things. IEEE Transactions on Knowledge and Data Engineering, 26(5), 1253–1266. 7. Chen, R., Bao, F., & Guo, J. (2015). Trust-based service management for social internet of things systems. IEEE Transactions on Dependable and Secure Computing, 13(6), 684–696. 8. Lin, Z., & Dong, L. (2017). Clarifying trust in social internet of things. IEEE Transactions on Knowledge and Data Engineering, 30(2), 234–248. 9. Ma, H., Yang, H., Lyu, M.R., & King, I. (2008). Sorec: Social recommendation using probabilistic matrix factorization. In CIKM 2008 (pp. 931–940). ACM.
Trusted Recommendation Model for Social Network of Things
847
10. Salakhutdinov, R., & Mnih, A. Probabilistic matrix factorization. In NIPS 2008, Vol. 20. 11. Ma, H., King, I., & Lyu, M.R. Learning to recommend with social trust ensemble. In SIGIR 2009 (pp. 203–210). 12. Golbeck, J. (2005). Computing and applying trust in web-based social networks. Ph.D. thesis, University of Maryland College Park. 13. Massa, P., & Avesani, P. Trust-aware recommender systems. In RecSys 2007, USA. 14. Levien & Aiken. (2002). Advogato’s trust metric. http://advogato.org/trust-metric.html. 15. Jamali, M., & Ester, M. Trustwalker: A random walk model for combining trust-based and item-based recommendation. In KDD 2009. 16. Gaurav, K., & Kumar, P. (2017). Consumer satisfaction rating system using sentiment analysis. In Conference on e-Business, e-Services and e-Society (pp. 400–411). Cham: Springer.
Efficient Classification Techniques in Sentiment Analysis Using Transformers Leeja Mathew and V. R. Bindu
Abstract Classification in sentiment analysis using transformers is a state-of-theart method. The transformer model speeds up the training process with the help of an attention mechanism. The encoding and decoding architecture in transformers helps in language modeling for machine translation and text summarization. The pre-trained transformer model provides a trained model for downsampling to private datasets of various sizes. In this paper, we conduct a study on efficient classification techniques using transformers. The fine-tuning task of transformers is the focus of our work. The choices of hyperparameters for fine-tuning have a significant impact on the final results. Early stopping method for regularization eliminates overfitting during training time. The performance of the efficient transformer models BERT, RoBERTa, ALBERT, and DistillBERT has been analyzed and compared with Long Short-Term Memory and experimental results confirm the superiority of these models over LSTM. Keywords Attention · Deep learning · Pretrained models · Sentiment analysis · Transformers · BERT
1 Introduction Sentiment Analysis (SA) using Natural Language Processing (NLP) with the help of pretrained models [1] leads to many applications nowadays. According to the perspective of methodology, SA can be broadly divided into three types: rule based, machine learning based, and deep learning based [2]. However, due to the tremendous
L. Mathew (B) · V. R. Bindu School of Computer Sciences, Mahatma Gandhi University, Kottayam, Kerala, India V. R. Bindu e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_69
849
850
L. Mathew andV. R. Bindu
volume of information, data analysis is chaotic, time consuming, and computationally expensive. In this context, the pretrained model has a significant role nowadays. Transformer architecture accomplishes context-based bidirectional language modeling, a novel approach in NLP, efficiently. Transformer is the best performing model, which consists of an encoder and a decoder through an attention mechanism [3]. It is used to find global dependencies among words and speed up the training process. The parallel processing of this task has become a revolutionary change in language embedding. This method outperforms Long Short-Term Memory (LSTM), which had been observed as an effective solution for sequence prediction problem [4]. The rest of the paper is organized into four sections. We discuss the related study in Sect. 2. The proposed transformer method is explained in Sect. 3. Our results are presented and analyzed in Sect. 4. Finally, Sect. 5 includes the conclusion.
2 Related Study Peters et al. [5] introduce an improved method for NLP tasks; especially syntax, semantic and cross-linguistic context; by applying deep bidirectional Language Models (biLM). Radford et al. [6] propose a method for Natural Language Understanding, which includes textual entailment, semantic similarity assessment, and document classification. Even though a large number of unlabeled text corpora are available and learning with a specific task using labeled data is rare, it is a challenge for differently trained models to act perfectly. So the authors here introduce a framework based on generative pretraining and discriminative fine-tuning. They achieve improvements on commonsense reasoning, question-answering, and on textual entailment by 8.9%, 5.7%, and 1.5%, respectively. Raffel et al. [7] introduce the effectiveness of transfer learning in the NLP technique and build a framework that converts every language problem into a text-to-text format. They build “Colossal Clean Crawled Corpus” based on pretraining objectives, architecture, unlabeled datasets, transfer approaches, etc. They achieve better results on question-answering, text summarization, text classification, etc. Logeswaran and Lee H. [8] propose a framework for learning sentences from unlabeled data using distributional hypothesis by reformulating the problem of predicting the sentence in which the context exists based on classification. Effective learning of different encoding functions is based on vector representations. Their sentencebased learning outperforms the existing supervised and unsupervised representations, in terms of training time. Munikar et al. [9] analyze sentiment classification for five classes based on the deep contextual pretrained language model, BERT.
Efficient Classification Techniques in Sentiment Analysis Using …
851
Alam Tanvirul et al. [10] design a multilingual transformer model based on Bangla and English text with different domains like sentiment analysis, emotion detection, news categorization, and authorship attribution. They obtain 5–29% improved accuracy for different tasks. Alzubi Omar et al. [11] design a novel algorithm for ensembling classifiers which outperform the classical ensembles. Zhu Jinhua et al. [12] propose a dependency graph enhanced dual-transformer network. They propose a dual-transformer structure that supports mutual reinforcement learning between the flat representation and graph-based representation using the encoder and decoder architecture of the transformer. It is proven that transformer-based pretrained language for learning context-based language representation is very effective. The final layer of the encoder’s output being taken for fine tuning of downstream tasks is the usual procedure. Yang and Zhao [13] improve this method by fusing the hidden representation by utilizing RoBERTa as the backbone encoder. This model gives improved performance in multiple natural language understanding tasks.
3 The Proposed Transformer Method We propose a method named Transformer-based Sentiment Classification (TSC) as depicted in Fig. 1 based on a state-of-the-art transformer architecture [3], which provides a significant improvement in context-based language embedding. Unlike other deep neural network architectures like RNN (Recurrent Neural Network), CNN (Convolutional Neural Network), or LSTM (Long Short-Term Memory), this method can learn dependencies among words efficiently. For the proposed TSC model, we have experimented with four pretrained transformer models which obtained the highest accuracy among 13 models as shown in Fig. 2 and these are compared with LSTM. The experimented models named BERT, RoBERTa, ALBERT, and DistillBERT are discussed in the following sections.
3.1 BERT (Bidirectional Encoder Representation from Transformers) The model comprises the following tasks.
3.1.1
Tokenization
Tokenization means dividing the input into tokens/word pieces. For example, consider the following sequence: Who was Jim Henson? Jim Henson was a puppeteer.
852
L. Mathew andV. R. Bindu
Fig. 1 TSC model architecture
The BERT tokenizer converts the input sequence into ’[CLS]’, ’who’, ’was’, ’jim’, ’henson’, ’?’, ’[SEP]’, ’jim’, ’henson’, ’was’, ’a’, ’puppet’, ’##eer’, and ’[SEP]’. ’[CLS]’ and ’[SEP]’are special tokens for classification and sentence separation in BERT model.
Efficient Classification Techniques in Sentiment Analysis Using …
bert-base-uncased bert-base-cased bert-base-mulƟlingualuncased bert-base-mulƟlingualcased disƟlbert-base disƟlbert-basemulƟlingual-uncased
Accuracy
0.97 0.965 0.96 0.955 0.95 0.945 0.94 0.935 0.93 0.925 0.92 0.915 0.91 0.905 0.9 0.895 0.89 0.885 0.88 0.875 0.87 0.865 0.86 0.855 0.85 0.845 0.84 0.835 0.83 0.825 0.82 0.815 0.81 0.805 0.8 0.795 0.79 0.785 0.78 0.775
853
albert-basev1 albert-base-v2 albert-largev1 roberta-base disƟlroberta xlnet-base-cased disƟlbert-basemulƟlingual
Epoch 1
2
3
4
5
6
7
8
9
10
11
Fig. 2 Comparison chart
3.1.2
Embedding in Transformer Based Model
Embedding means converting text into a number (here vector form). There are three types of embedding in BERT as illustrated in Fig. 3.
Fig. 3 Embedding
854
L. Mathew andV. R. Bindu
(i)
Token embedding: Here each token is replaced with a token-id, which has a high-dimensional axes vector (768). Sentence/Segment embedding: Here each token is placed in the corresponding sentence/segment and embedded accordingly. Position embedding: Here the tokens are embedded with position in a sentence/paragraph.
(ii) (iii)
3.1.3
Transformer Encoding
Transformer is an attention-based architecture for NLP consisting of two components, namely Encoder and Decoder. BERT is a multi-layer bidirectional Transformer encoder. BERT consists of mainly two model architectures: BERTBASE which consists of L = 12, H = 768, A = 12, Total Parameters = 110 M, and BERTLARGE with L = 24, H = 1024, A = 16, Total Parameters = 340 M. We used BERTBASE model for this work. Encoder consists of two sections—self-attention layer and feed-forward network. The detailed architecture [14] of the encoder is given in Fig. 4. i
Positional Encoding
Positional Encoding makes use of the order of the text sequence. Since the model contains any recurrence and no convolutional neural network, sine and cosine functions of different frequencies are used. pe( p, j) = sin
p 2j
10000 d p pe( p, j + 1) = cos 2j 10000 d ii
(1) (2)
Multi-Head Attention
All the attention of tokens is done in parallel and generate multiple attention heads, this will discard the token’s similarity to itself. All attention (self-attention) heads are concatenated and multiplied with a weighted vector in order to form a linear vector that can be passed to the feed-forward neural network. Self-Attention It is used to find relationships between words in a sequence. Step-by-step procedure of self-attention is as follows. 1.
Multiply the embedding vector (input) with Query (WQ), key (WK), and value (WV) in order to get the corresponding token’s query (q), key (k), and value (v) vectors.
Efficient Classification Techniques in Sentiment Analysis Using …
855
Fig. 4 Encoder architecture
2. 3. 4. 5.
Find the relationships of query vector with other words by calculating the dot product of query vector and key vector, i.e., q.k. Scale the above quantity by dividing the square root of the dimensionality of the key vector which leads to more stable gradients. Perform softmax operation for finding relevant words. Multiply softmax value with the value vector for drawing out irrelevant words. Sum of these new vectors will be the attention of the first token.
Equation (3) describes the above steps so that attention maps a query and a set of key-value pairs to an output.
qk T Attention(q, k, v) = softmax √ d
∗v
(3)
856
L. Mathew andV. R. Bindu
iii
Add and Normalize
Our model uses 12 attention heads. That is, 12 encoders are used (represented by the number N in Fig. 4) in training. Each set is used to project the input embedding into a different representation subspace. The feed-forward layer is not expecting a single matrix. In order to convert like this, all attention heads are concatenated and multiplied with weighted matrix W0 to obtain the resultant matrix (Z) and pass it onto the FFNN (Feed-Forward Neural Network). The procedure is repeated till the end of the corpus. Fully connected normalized output layer for classification is obtained from the encoder. GELU activation function is used in this model.
3.1.4
Fine-Tuning
BERT model has two tasks—feature-based training and fine-tuning. i
Feature-based training
Training procedure in BERT model is performed using Books Corpus (800 Million words) and English Wikipedia (25,000 Million words) in a self-supervised way. It is based on two methods:-Masked Language Modeling (MLM) and Next sentence prediction [15]. MLM Using this method, the original word corresponding to the masked word is predicted based on the context. Here 15% of words are masked randomly by using [MASK] token, which is taken out of 80% of the input sequence. From the rest of the input sequence, 10% are replaced with the original token and the remaining 10% are replaced with a random token. Next sentence prediction Using this method, training of each input sequence is done by dividing two spans of text (say A and B) from the corpus. 50% of B is the real next sentence that follows A and 50% is random sentences from the corpus labeled by IsNext and IsNotNext, respectively. ii
Fine-tuning
For performing classification tasks in sentiment analysis, we add a classification layer on top of the transformer output using the [CLS] token. Label probabilities are calculated using softmax according to equation (4). p = softmax(C W T )
(4)
Efficient Classification Techniques in Sentiment Analysis Using …
857
where C is a special [CLS] token and W is fine-tuning parameter. We set batch size, max. len, number of training epochs, and learning rate as hyperparameters for finding classification accuracy.
3.2 RoBERTa (a Robustly Optimized BERT Pretraining Approach) It is an improved model compared to BERT in NLP. The training mechanism in this model has been done by removing BERT’s next sentence prediction. Dynamic masking also improves the accuracy. Larger mini-batches and learning rate are chosen at training time. This leads to better performance in the downstream task. The developers had taken a longer amount of time for training RoBERTa based on large datasets called as BOOKCORPUS plus English Wikipedia (same as BERT), CC-NEWS, OPENWEBTEXT, and STORIES [16]. Thus, the entire data set is about 160 GB of text. This model attains state-of-the-art results on SQuAD, GLUE, and RACE. RoBERTa consists of mainly two model architectures: Roberta-base (L = 12, H = 128, A = 12, Total parameters = 125 M) and Roberta-large L = 24, H = 1024, A = 16, Total parameters = 355 M). We have chosen Roberta-base model in our work. Our downstream task with movie review dataset obtained better accuracy than BERT.
3.3 Albert (a Lite Bert) The longer training time in the previous model is reduced by implementing parameter reduction techniques called as factorized embedding parameterization [17] and crosslayer parameter sharing. Instead of next sentence prediction in BERT, here sentence order prediction is done by introducing self-supervised loss. This model achieves state-of-the-art results on GLUE, SQuAD, and RACE. ALBERT consists of mainly four model architectures: Albert-base (L = 12 (repeating), 128 embedding, H = 768, A = 12, Total parameters = 11 M). Albert-large (L = 24 (repeating), 128 embedding, H = 1024, A = 12, Total parameters = 17 M). Albert-xlarge (L = 24 (repeating), 128 embedding, H = 2048, A = 16, Total parameters = 58 M). Albert-xxlarge (L = 12(repeating), 128 embedding, H = 4096, A = 64, Total parameters = 223 M).
3.4 DistilBERT, a Distilled Version of BERT It is a general-purpose smaller language representation model. In this model, size is reduced by 40% of BERT and training is made 60% faster by using distillation
858
L. Mathew andV. R. Bindu
techniques. The environmental cost can be reduced accordingly [18]. This model can operate on-device in real time. This model can be fine-tuned for several downstream tasks achieving good results. In our IMDb task also we get a better result in a faster manner. The training mechanism was done based on student architecture by distilling knowledge from the teacher model (here BERT). DistilBERT mainly consists of DistilBERT-base model with (L = 6, H = 768, A = 12, Total parameters = 66M). So it is a faster, cheaper and lighter model than others.
4 Results and Discussion 4.1 Experimental Setup The model is implemented in Python programming language, Keras ktrain library is used for our fine-tuning tasks. We fine-tuned for 10 epochs and performed early stopping based on each task’s evaluation metric on the dev set. The learning rate varies from 3e−05, 1.5e−05, and 7.5e−06 for getting minimum loss. The rest of the hyperparameters remain the same as during pretraining. We experimented with various max. length like 64, 128, 256, and 512 and obtained better results in 512 max. length. Batch size is set as 12. It is observed that increasing batch size does not lead to a better result. The next section shows the detailed result.
4.2 Results This section summarizes experimental results achieved by our model using the IMDb dataset. Figures 5, 6, 7, and 8 show learning rate plots for BERT-base, RoBERTa-base, ALBERT-base, and DistillBERT-base models, respectively. The efficient GPU time Fig. 5 BERT-base
Efficient Classification Techniques in Sentiment Analysis Using …
859
Table 1 Time analysis Execution time (Hrs)
BERT-base
RoBERTa-base
ALBERT-base
DistillBERT-base
8.05
7.29
7.97
4.20
Table 2 Comparison of epoch-wise accuracy Epochs Models
1
2
3
4
5
6
7
LSTM
0.8562
0.8642
0.8734
0.8627
0.8637
0.8674
0.8601
BERT-base
0.9322
0.9383
0.9360
0.9348
0.9353
0.9356
0.9365
RoBERTa
0.9536
0.9554
0.9555
0.9551
0.9562
0.9547
-
ALBERT
0.9213
0.9279
0.9279
0.9307
0.9301
0.9299
0.9289
DistillBERT
0.9278
0.9294
0.9316
0.9307
0.9296
0.9284
0.9305
Fig. 6 RoBERTa-base
using Intel core i3 laptop is analyzed in Table 1. Epoch-wise fine-tuning accuracy is recorded in Table 2 and displayed in Fig. 9Fig. 6RoBERTa-base
5 Conclusion We have presented an efficient TSC architecture for classification analysis of the performance based on four transformer models––BERT, RoBERTa, ALBERT, and DistillBERT on the IMDb dataset. The early stopping method for regularization included in our model avoids overfitting in the training step. The hyperparameters, maximum length 512, and batch size 12 have improved our model. RoBERTa outperforms with 95% accuracy. BERT model secures 93% accuracy. ALBERT and DistillBERT got nearer accuracy compared to BERT. ALBERT model has a lower
860
L. Mathew andV. R. Bindu
Fig. 7 ALBERT-base
Fig. 8 DistillBERT-base
processing time compared to BERT. DistillBERT model has been found 52% faster with the nearest accuracy as BERT model. This model outperforms LSTM, which had been observed as the most effective solution of sequence prediction problem, while comparing the epoch-wise accuracy. In future, a better model may be built with efficient processing speed as well as classification accuracy.
Efficient Classification Techniques in Sentiment Analysis Using …
861
0.98 0.96 bert-basecased
0.94
disƟlbert-base
Accuracy
0.92 0.9
albert-basev1
0.88
roberta-base
0.86
lstm
0.84 0.82 0.8 1
2
3
4
5
6
7
8
9
10
Epoch
Fig. 9 Fine-tuning comparison chart with LSTM
Acknowledgements The authors acknowledge the support extended by DST-PURSE Phase II, Govt. of India.
References 1. Mathew, L., & Bindu, V.R. (2020). A review of natural language processing techniques for sentiment analysis using pre-trained models. In 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC) (pp. 340–345). IEEE. doI:https:// doi.org/10.1109/ICCMC48092.2020.ICCMC-00064. 2. Liu, R., Shi, Y., Ji, C., & Ji, M. (2019). A survey of sentiment analysis based on transfer learning. IEEE Access, 7, 85401–85412. https://doi.org/10.1109/ACCESS.2019.2925059 3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998–6008). 4. Fei, L., Zhou, K., & Weihua, Ou. (2019). Sentiment analysis of text based on bidirectional LSTM with multi-head attention. IEEE Access, 7, 141960–141969. 5. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv preprint arXiv:1802.05365. 6. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-(2018)assets/resear chcovers/languageunsupervised/languageunderstandingpaper.pdf.
862
L. Mathew andV. R. Bindu
7. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P.J. (2019). Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. 8. Logeswaran, L., & Lee, H. (2018). An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893. 9. Munikar, M., Shakya, S., & Shrestha, A. (2019). Fine-grained Sentiment Classification using BERT. In 2019 Artificial Intelligence for Transforming Business and Society (AITB) (Vol. 1, pp. 1–5). IEEE. doi:https://doi.org/10.1109/AITB48515.2019.8947435. 10. Alam, T., Khan, A., & Alam, F. (2020). Bangla text classification using transformers. arXiv preprint arXiv:2011.04446. 11. Alzubi Omar, A. et al. (2020). An optimal pruning algorithm of classifier ensembles: Dynamic programming approach. Neural Computing and Applications. 1–17 12. Zhu, J., et al. (2020). Incorporating bert into neural machine translation. arXiv preprint arXiv: 2002.06823. 13. Yang, J., & Zhao, H. (2019). Deepening hidden representations from pre-trained language models for natural language understanding. arXiv preprint arXiv:1911.01940. 14. https://medium.com/dissectingbert/dissecting-bert-part-1-d3c3d495cdb3. 15. Devlin, J., Chang, M.W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 16. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. 17. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942. 18. Sanh, V., Debut L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Ultra-Wideband Scattered Microwave Signal for Classification and Detection of Breast Tumor Using Neural Network and Statistical Methods Mazhar B. Tayel and Ahmed F. Kishk
Abstract Medical screening methods have become very important in the diagnosis of diseases and in assisting therapeutic treatment. Early detection of breast cancer is considered as a critical factor in reducing the mortality rate of women. Various alternative breast screening modalities being investigated to improve breast cancer detection. Using of Ultra-Wideband (UWB) radar for screening breast cancer and determining the location of tumors in the breast using artificial intelligence methods is a new technique for the detection and localization of cancer. This work focused on experimental dataset based on breast tumor detection and localization using Neural Network (NN) techniques (i.e., Feed-Forward backpropagation (FFNN). NN provides an important method to get solutions for real-life problems, and with learning algorithms, neural networks provide a promising new technology. Dataset clustering for fast training of networks is used in this work. Feature extraction using Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT), Principal Component Analysis (PCA). Artificial Neural Networks (ANN) are trained to solve certain problems using a learning algorithm given sampled data and desired output to construct NN that perform different tasks depending on training received. Performance of NN is calculated and statistical calculations are used to visualize the performance of NN being trained on given data sets. Keywords Ultra-Wideband (UWB) · Dielectric properties of tissue · Artificial Neural Networks (ANN)
M. B. Tayel Faculty of Engineering, EED, Alexandria University, Alexandria, Egypt A. F. Kishk (B) EED, Alexandria Higher Institute of Engineering & Technology (AIET), Alexandria, Egypt e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_70
863
864
M. B. Tayel andA. F. Kishk
1 Introduction Tumor causes problem to the human body and can be dangerous if not treated in early stage. Breast cancer screening methods include many ways to image the breast, such as Mammography, Computer-Aided Detection (CAD), Ultrasound Imaging, Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), Optical Imaging and Spectroscopy (OIS), Thermography, Electrical Impedance Imaging, Electronic Palpation, and Electrical Potential Measurements. UWB technology benefits also in the field of public protection, construction, engineering, science, medical, consumer applications, information technology, multimedia entertainment, and transportation major applications in medical areas, e.g., medical monitoring, and medical imaging [1]. The idea of the use of microwaves for breast tumor detection gained considerable attention and extensive studies from a variety of research groups due to many benefits, including low cost, harmless radiation, and ease of use compared to current techniques, as seen in Table 1. Breast tumor detection using UWB signal is recommended since the electrical properties (conductivity and permittivity) of human body tissue are affected by frequency transmitted, as seen in Table 2 [2]. Table 1 Comparison of different screening methods
Screening method
Problems
X-ray mammography
• • • • •
High false-positive rate High false-negative rate Ionizing radiation Breast compression Sensitivity decreases with high-density tissue
Ultrasound
• • • •
High false-negative rate Low resolution Higher cost than X-rays Operator skill
MRI
• High false-positive rate • Contrast agent • Very expensive
Positron emission tomography • Radioactive traces to produce images of tumors
Table 2 Dielectric properties of the breast at 4 GHz [1, 2, 4]
Tissue
Parameters Permittivity (F/m)
Conductivity (S/m)
Tumor
50
1.20
Fat
5.14
0.14
Skin
37.9
1.49
Ultra-Wideband Scattered Microwave Signal for Classification …
865
Fig. 1 UWB band spectrum [7]
UWB is a short-range radio communication, involving the generation and transmission of radio-frequency energy that spreads over a very large frequency range, which may overlap several frequency bands allocated to radio communication services [3]. Federal Communications Commission (FCC) put regulations for the use of UWB technology [4]. UWB technology typically has intentional radiation from the antenna with either −10 dB bandwidth or at least 500 MHz. UWB operates at very low power levels and can support applications involving multiple users at high data rates (e.g., short-range Wireless Personal Area Networks (WPANs) at data rates greater than 100 Mbit/s) [3]. Maximum mean Equivalent Isotropic Radiated Power (EIRP) of −41.3 dBm/MHz can be used from 3.1 GHz up to 10.6 GHz as shown in Fig. 1 [4]. UWB operates at very low power levels and can support applications involving multiple users at high data rates (e.g., short-range Wireless Personal Area Networks (WPANs) at data rates greater than 100 Mbit/s) [3]. Maximum mean Equivalent Isotropic Radiated Power (EIRP) of −41.3 dBm/MHz can be used from 3.1 GHz up to 10.6 GHz as shown in Fig. 1 [4]. UWB medical screening methods became very important in diagnosing diseases and assisting therapeutic treatment. Early detection of breast cancer is considered as a critical factor in reducing the mortality rate of women [4]. Various alternative breast screening modalities are being investigated to improve breast cancer detection. UWB screening of breast cancer location with the help of artificial intelligence techniques became a new trend in the medical field [5]. The proposed NN is used to classify breast tumor and its location. Trained NN gives received signals and feature extracted using three algorithms. Fast Fourier Transform (FFT) algorithm, Discrete Cosine Transform (DCT) algorithm, and Principal Component Analysis (PCA) algorithm for feature extraction, then training NN using features extracted from scattered signals. Performance is then tested for each algorithm; then statistical information (e.g., mean, variance, standard deviation, entropy) is gathered.
866
M. B. Tayel andA. F. Kishk
Table 3 Difference between benign and malignant tumor [5, 6] Benign tumors
Malignant tumors
Don’t invade nearby tissue
Invade nearby tissue
Can’t spread to other parts of the body
Can spread to other parts of the body
Normally do not return after they are removed
Can return after being removed
Usually have smooth, regular shape
May have irregular shape
Under pathologist’s microscope, shape, chromosomes, and DNA of cells appear normal
May recur after removal, sometimes in areas other than original site
Unlikely to recur if removed or require further treatment such as radiation or chemotherapy
May require aggressive treatment, including surgery, radiation, chemotherapy, and immunotherapy medications
2 Types of Breast Cancer Cancers are divided into noninvasive and invasive. Noninvasive cancer means the growth of cells are concentrated in a small region and not yet spread to nearby tissues, whereas in invasive cancer it is spread, therefore it is more dangerous to treat. Tumor is a mass of abnormal tissues that are divided into two types: noncancerous, (i.e., ‘benign’) and cancerous, (i.e., ‘malignant’). Indicators to distinguish between benign and malignant tumors, such as shape, size, stiffness, and viscosity. A malignant tumor is more viscous than a benign one. Dielectric properties (permittivity, and conductivity), as well as density of malignant tumors, are higher because of their higher water and blood contents [6]. Dielectric properties of benign and malignant tumors are seen in Table 3.
3 Data Collection UWB frequency at which a signal transmitted is 4.7 GHz. A set of signals scattered from the phantom given in [8] consists of 118 signals that have 108 signals with tumor location (x, y, z) and 10 without tumor and in this case, x, y, z are assigned −1 [8]. Dataset signals represent training imported to FFNN. Data is individually transformed for FFT algorithm, DCT algorithm, and PCA algorithm, respectively, for the training process. A group of 39 test signals are applied for testing three algorithms and consist of 36 signals with tumor and 3 signals without tumor. PulsOn UWB transmitter (TX) and receiver (RX) are connected to PC using Ethernet hub and can be controlled through PC as shown in Fig. 2 [2]. This device operates at a center frequency of 4.7 GHz and with 3.2 GHz bandwidth. The size of tumor is 2.5-mm. The phantom consists of pure petroleum jelly used to mimic breast fatty tissue as seen in Table 4 [5, 9].
Ultra-Wideband Scattered Microwave Signal for Classification …
867
Fig. 2 Configuration of test setup [9]
Table 4 Dielectric properties of the used materials at 4.7 GHz [9]
Breast phantom
Material
Permittivity (F/m)
Conductivity (S/m)
Fatty breast tissue
Vaseline
2.36
0.012
Tumor
Various water to flour ratio
15.2–37.3
2.1–4.0
Skin
Glass
3.5–10
10−11 –10−15
4 Result of Training FFNN Using FFT, DCT, and PCA for Feature Extraction Figure 3 shows the proposed block diagram for the three algorithms that show the steps for classification using FFT, DCT, and PCA to classify breast tumor location detection using MATLAB2019b. Data stored for training are the signals scattered from the build breast phantom [9]. Feed-Forward Neural Network (FFNN) is then constructed which has three layers input, hidden, output layer. Input layer used only to transfer input data to hidden layer nodes. The input layer act as a repeater for input data. The hidden layer consists of 150 neurons which are calculated according to Shibata and Ikeda [10] equation to determine the number of neurons in the hidden layer. NH =
N I No
(1)
where NI NH NO
number of input neurons, number of hidden neurons, and number of output neurons.
The output layer consists of three neurons which are the (x, y, z) of the tumor location in breast phantom. NN parameters used during the training process is in Table 5 contain a number of
868
M. B. Tayel andA. F. Kishk
Fig. 3 Block diagram of the proposed FFNN classifier Table 5 Neural network parameters
Neural network parameter in MATLAB
Values
Number of data in the database for training
118
Number of data in the database for test
39
Number of samples per input
6400 × 1
Target
(x, y, z) location
Number of input layers
1
Number of hidden layers
1
Number of output layers
1
Number of neurons per hidden layer
150
Number of neurons per output layer
3
Type of performance function
MSE
Network goal
0.00001
Training epochs
600
Training function
Traingdx
Ultra-Wideband Scattered Microwave Signal for Classification … Table 6 Performance of the FFNN simulation using FFT algorithm
Performance
Value
P1
0.0118423789129858 (best performance)
P2
6.03249629619895
P3
10.0629516640744
P4
0.458939505998817
P5
4.42813873749897
P6
8.03572395987964
P7
5.56060955441658
P8
17.7973134968194
P9
3.87892183832209
P10
10.3369152726144
P11
4.86018027840713
P12
9.33341875945502
P13
1.68309092004919
P14
12.6897255423410
P15
6.50882595248705
P16
2.46879816354843
P17
8.34185053029358
P18
10.0158053141539
P19
0.717192038045963
P20
5.41881414393784
P21
2.82100261982881
P22
12.0397595033299
P23
6.87242307684039
869
signals used for training, testing, and samples for each scattered signal, the number of desired outputs for each of the input signals where the target is the location of the tumor in phantom, the number of neurons for input, hidden, and output layer is for created NN, type of performance function, network goals, epochs, and training function, training function used is traingdx, where a network training function updates the values of weight and bias according to the momentum of gradient descent and an adaptive learning rate. As given in Table 6, network with the lowest performance value is the best network that will give the actual location of tumor. Performance is calculated using a function given in MATLAB2019b after testing the created neural network using function given in MATLAB2019b. Scattered received signal with tumor is shown in Fig. 4, also scattered received signal without tumor is shown in Fig. 5, sampled signal is then used to create FFNN
870
M. B. Tayel andA. F. Kishk
Fig. 4 Scattered received signal with tumor [8]
Fig. 5 Scattered received signal without tumor [8]
for the training process. Statistical calculation for tested data signal and trained data signal where the mean, variance, standard deviation are calculated for both tested and trained data given in Table 7. The architecture of neural network using MATLAB2019b for the training of FFNN and feature extraction using FFT and traingdx as training function is shown in Fig. 6. Neural Network panel displays the structure of neuron arrangement, input and hidden and output patterns. Algorithms panel shows the details of the algorithms that have been used for the complete training process. For the training process, gradient descent with momentum and learning algorithm has been used. To analyze performance, mean square error MSE was used
Ultra-Wideband Scattered Microwave Signal for Classification …
871
Table 7 Statistical calculations for sample test signal and trained data for FFT algorithm set and total training time Mean of signal in the database
1.4553e + 05
Mean of the test signal
1.4704e + 05
Mean error
1510 (1.03%)
Variance of signal in the database
7.6028e + 11
Variance of the test signal
7.4384 × 1011
Variance error
1.644 × 1010 (2.1%)
Standard deviation of the signal in the database
8.7194 × 105
Standard deviation of the test signal
8.6246 × 105
Standard deviation error
9480 (1.08%)
Training time in seconds
7.872490878 × 102
during training. The number of epochs assumed is 600. The progress panel displays details of this training process, which indicates the number of iterations that are currently running (285 iterations for trained network) time taken for completing the training process at end of training which is 39 s. Performance indicating how much minimized errors occurred during training is shown in Fig. 7, X-axis indicates the number of iterations (285 Epochs). Y-axis represents MSE occurred for each iteration also curve on the graph represents training results. Sample of training performance progress for the trained network is shown in Fig. 7, where best performance appears at epoch 285 and 8.8673 × 10−6 ). Training performance is done using Mean Square Error (MSE) between the actual output of the neural network and desired output. In Fig. 8, gradient is a value of backpropagation gradient on each iteration in logarithmic scale. Epoch 285 means that you have reached the bottom of the local minimum of your goal function. Figure 9 represents training regression which is the relationship between the actual output and the desired output of the neural network. In the DCT algorithm, a feed-forward neural network using MATLAB2019b is created and trained using DCT feature extracted data. Signal is tested using test database and performance is calculated with the result of best performance given in Table 8, which will give a classified signal and tumor location. The statistical calculation for this case is given in Table 9, which includes mean, variance, and standard deviation of tested data and trained data. The architecture of neural network using MATLAB2019b for the training of FFNN and DCT for feature extraction, where traingdx as training function is shown in Fig. 10. The number of epochs is assumed to be 600, see Eq. 1. Figure 11 represents the performance graph; this graph allows users to know the status of the training process. Best performance appears at epoch 258 and 9.8947 × 10−6 , where training performance is done using the Mean Square Error (MSE) between the actual output of the neural network and the desired output. Epoch 258 means that you have reached the bottom of the local minimum of goal function. Sample of neural network training state using DCT algorithm is shown in Fig. 12. Figure 13 represent training regression which is the relationship between the actual output and the desired output of neural network for DCT algorithm.
872
M. B. Tayel andA. F. Kishk
Fig. 6 Sample of the neural network training process and FFT algorithm
In PCA algorithm, feature is extracted using PCA for dimension reduction of data that allow data compression and hence faster processing and low storage memory. Feed-forward neural network using MATLAB is created and trained to classify the location of tumor, with the best performance which is given in Table 10, which will give the classified signal and tumor location. Statistical calculation is given in Table 11, which includes mean, variance, and standard deviation of tested data. The architecture of neural network using MATLAB for the training of FFNN using PCA algorithm is given in Fig. 14. Figure 15 represents the performance graph; best performance appears at epoch 120 and 9.54 × 10−6 . Figure 16 represents a sample of feed-forward neural network training state using the PCA algorithm. Figure 17 represents training regression
Ultra-Wideband Scattered Microwave Signal for Classification …
Fig. 7 Sample of neural network training performance and FFT algorithm
Fig. 8 Sample of neural network training state and FFT algorithm
873
874
M. B. Tayel andA. F. Kishk
Fig. 9 Sample of neural network training regression and FFT algorithm
which is the relationship between the actual output and the desired output of neural network for PCA algorithm.
5 Conclusions UWB screening is challenge motivated microwave technique for breast tumor detection. UWB-based detection technique offers advantages over other detection methods such as low cost, noninvasive, non-ionizing. Proper data analysis for classification using FFNN gives better performance and speeds up the operation of the training process. The study shows that using the algorithm FFT for feature extraction gives better performance compared with the DCT algorithm and PCA algorithm. However, the training time for the PCA algorithm is better than FFT and DCT algorithms, and also the number of features required for training the PCA algorithm is lower. Compared to [8] however, the performance is enhanced using PCA for feature extraction; hence, the proposed FFNN using PCA algorithm is enhanced as compared to [8]. The PCA provides effective processing of data and an efficient way of compressing
Ultra-Wideband Scattered Microwave Signal for Classification … Table 8 Performance of the FFNN simulation using DCT algorithm
Table 9 Table of statistical calculations for sample test signal and trained data using DCT algorithm
875
Performance
Value
P1
0.0123217941009013 (best performance)
P2
9.79274063685165
P3
19.0610111451186
P4
3.55436197937391
P5
1.25390061702620
P6
14.2924867191725
P7
2.37379591085487
P8
10.7226447839646
P9
0.258122898071586
P10
14.9912369730774
P11
4.50309094138571
P12
7.06958109413300
P13
13.5226324560448
P14
15.6899128340643
P15
1.51054194057211
P16
15.9029090272936
P17
1.51625579951612
P18
6.31776379922306
P19
5.08331499071611
P20
8.97584841353502
P21
4.38590925020738
P22
1.70021490315963
P23
0.539116526008565
Mean of signal in the database
1.6574 × 103
Mean of sample test signal
1.6731 × 103
Mean error
15.7 (0.94%)
Variance of signal in the database
1.1936 × 108
Variance of the sample test signal
1.1680 × 108
Variance error
2.56 × 106 (2.14%)
Standard deviation of the signal in the database
1.0925 × 104
Standard deviation of sample test signal 1.0808 × 104 Standard deviation error
117 (1.07%)
Total training time in second
8.303470494 × 102
876
M. B. Tayel andA. F. Kishk
Fig. 10 Sample of neural network training process using DCT algorithm
and hence speeds up the learning process of the proposed FFNN structure as given in Table 12.
Ultra-Wideband Scattered Microwave Signal for Classification …
Fig. 11 Sample of neural network training regression using DCT algorithm
Fig. 12 Sample of neural network training state using DCT algorithm
877
878
M. B. Tayel andA. F. Kishk
Fig. 13 Sample of neural network training regression using DCT algorithm
Ultra-Wideband Scattered Microwave Signal for Classification … Table 10 Performance of the FFNN simulation using PCA algorithm
Table 11 Table of statistical calculations for sample test signal and trained data using PCA algorithm
879
Performance
Value
P1
0.0118445784962093 (best performance)
P2
7.16289939582103
P3
20.6977935026925
P4
28.3968927422920
P5
4.40789901935253
P6
36.2828019433748
P7
8.95105950678067
P8
9.40096413269188
P9
6.58621466833301
P10
5.27871532473491
P11
34.5832038794441
P12
13.9401608745710
P13
20.6888524939909
P14
10.9685040537834
P15
3.90755991506430
P16
14.8080176762745
P17
2.11805506011642
P18
0.601791003577186
P19
20.1042028428448
P20
1.65400041382903
P21
4.90571581696167
P22
13.8458696325103
P23
2.69090667579248
Mean of signal in the database
0.1525
Mean of the sample test signal
0.1545
Mean error
2 × 10−3 (1.31%)
Variance of signal in the database
0.1841
Variance of the sample test signal
0.1835
Variance error
6 × 10−4 (0.32%)
Standard deviation of the signal in the database
0.4291
Standard deviation of the sample test signal
0.4283
Standard deviation error
8 × 10−4 (0.189%)
Total training time
30.814604 s
880
M. B. Tayel andA. F. Kishk
Fig. 14 Sample of neural network training process using PCA algorithm
Ultra-Wideband Scattered Microwave Signal for Classification …
Fig. 15 Sample of neural network training regression using PCA algorithm
Fig. 16 Sample of neural network training state using PCA algorithm
881
882
M. B. Tayel andA. F. Kishk
Fig. 17 Sample of neural network training regression using PCA algorithm Table 12 Comparison of three algorithms
Parameters
MSE
Methods FFT algorithm DCT algorithm
PCA algorithm
0.01184
0.01232
0.01184
Training time 7.87 102
8.30
0.3081
Mean error%
1.03%
0.94%
1.31%
Variance error%
2.1%
2.14%
0.32%
Standard deviation error%
1.08%
1.07%
0.189%
Ultra-Wideband Scattered Microwave Signal for Classification …
883
References 1. Grosenick, D., Rinneberg, H., Cubeddu, R., & Taroni, P. (2016). Review of optical breast imaging and spectroscopy. Journal of Biomedical Optics, 21(9), 091311. 2. Alshehri, S. A., & Khatun, S. (2009). UWB imaging for breast cancer detection using neural network. Progress in Electromagnetics Research C, 7, 79–93. 3. Sm, I., Itu, T., & Assembly, R. (2018). R-REC-SM.1755: Characteristics of ultra-wideband technology, no. 2006. 4. Niemelä, V., Haapola, J., Hämäläinen, M., & Iinatti, J. (2017). An ultra wideband survey: Global regulations and impulse radio research based on standards. IEEE Communications Surveys and Tutorials, 19(2), 874–890. 5. Wang, L. (2017). Early diagnosis of breast cancer. Sensors (Switzerland), 17(7). 6. Benign and malignant tumors: How do they differ? Available: https://www.healthline.com/hea lth/cancer/difference-between-benign-and-malignant-tumors#key-differences. 7. Ultra-Wide Band. Accessed March 27, 2020. Available: https://www.itu.int/itunews/manager/ display.asp?lang=en&year=2004&issue=09&ipage=ultrawide&ext=html. 8. Hassan, N.A., Yassin, A.H., Tayel, M.B., & Mohamed, M.M. (2016). Ultra-wideband scattered microwave signals for detection of breast tumors using artifical neural networks. In 2016 3rd International Conference on Artificial Intelligence and Pattern Recognition, AIPR 2016 (pp. 137–142). 9. P. In, (2011). Progress in Electromagnetics Research, 116, 221–237. 10. Vujiˇci´c, T., & Matijevi, T. (2016). Comparative analysis of methods for determining number of hidden neurons in artificial neural network. In Central European Conference on Information and Intelligent Systems (pp. 219–223).
Retraction Note to: Using Bidirectional LSTMs with Attention for Categorization of Toxic Comments Zubin Tobias and Suneha Bose
Retraction Noe to: Chapter “Using Bidirectional LSTMs with Attention for Categorization of Toxic Comments” in: A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_49 The Editor has retracted this conference paper because it contains material that substantially overlaps with the following article [1]. All authors agree to this retraction [1] Baumer, M., & Ho, A. (2018). Toxic Comment Categorization using Bidirectional LSTMs with Attention. In CS224n Final Project Reports.
The retracted version of this chapter can be found at https://doi.org/10.1007/978-981-16-2594-7_49
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7_71
C1
Author Index
A Acharjya, Debi Prasanna, 343 Agarkhed, Jayashree, 497 Agarwal, Gauri, 713 Agarwal, Ishank, 701 Ahmad, Tanvir, 203 Ahmed, Zeyad A. T., 313 Ajay Ram, B., 649 Ajitha, 239 Akter, Laboni, 71 Alagh, Richa, 691 Alhichri, Haikel, 365 Aljebry, Amel F., 411 Al-madani, Ali Mansour, 313 Alqahtani, Yasmine M., 411 Alquzi, Sahar, 365 Alsellami, Belal, 355 Alsubari, Saleh Nagi, 313 Aman, 275 Anandhi, T., 143 Anfal, Farheen, 43 Anitha, J., 297 Anoop, H. A., 431 Anshula, 619 Aruna, P., 473 Avasthi, Anupama, 213 Avasthi, Sandhya, 343 Aysha, Saima, 737
B Baig, Mirza Moiz, 509 Bajaj, Yugam, 159 Bazi, Yakoub, 365 Belfin, R. V., 297
Bhakuni, Abhijit Singh, 691 Bharadwaj, Aniket, 565 Bhatia, Manjot Kaur, 757 Bhati, Sheetal, 193 Bhavya, K. R., 387 Bhowmik, Sudipa, 135 Bindu, V. R., 849 Bisht, Narendra, 691 Bongale, Pratiksha, 633 Boraik, Omar Ali, 553 Bose, Suneha, 595
C Chakraborty, Pinaki, 193 Chandrashekara, S., 607 Chauhan, Ritu, 343 Chavva, Subba Reddy, 399
D Dabas, Mayank, 727 Dagdi, Himanshu, 275 Dahiya, Nishthavan, 727 Deepa, D., 143 Deshmukh, Prapti D., 355 Dhondge, Swaraj, 249 Doja, M. N., 203 Dubey, Ishika, 679
E Ema, Romana Rahman, 43, 57
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1387, https://doi.org/10.1007/978-981-16-2594-7
885
886
Author Index
F Ferdib-Al-Islam, 71
Kumar, Sukhwant, 135 Kumawat, Soma, 485
G Gangwar, Kushagra, 679 Garg, Kumkum, 485 Garg, Neeraj, 525 Gatate, Veeranna, 497 Gayatri, Krishna S., 213 George Washington, D., 81 Gopinath, Geethu, 213 Gosain, Anjana, 459 Gowri, S., 239 Goyal, Ashwin, 123 Goyal, Kartik, 575 Gupta, Aakanshi, 105
L Lakshmanarao, A., 11, 649 Lavanya, B., 443 Le, Thanh, 1 Lingareddy, S. C., 387 Loganathan, D., 285
I Imran, Shahid, 827 Islam, Tajul, 43, 57
J Jadhav, Mukti E., 313 Jagdale, Jayashree, 249 Jagdish, 275 Jain, Astha, 585 Jain, Rachna, 123 Jain, Shreya, 459 Jinila, Bevish, 239 Jose, Jithina, 239 Juliet, Sujitha, 797 Juneja, Pradeep Kumar, 691
K Kantharaj, V., 387 Kaur, Harjeet, 375 Kaur, Ishleen, 203 kaur, Jasanpreet, 263 Kaur, Mandeep, 827 Kaur, Tajinder, 785 Kaushik, Nainika, 757 Kirubakaran, E., 797 Kishk, Ahmed F., 863 Kumar, A. S. Mahesh, 607 Kumar, Ambeshwar, 173 Kumar, Anil, 785 Kumar, Bijendra, 29 Kumari, Latesh, 575 Kumar, Kartik, 827 Kumar, Prabhat, 839
M Maiya, Anirudh, 95 Malakar, Priyanka, 135 Malhotra, Puru, 159 Mallikarjunaswamy, M. S., 607 Manikandan, R., 173 Mathew, Leeja, 849 Maurya, Sudhanshu, 691 Mishra, Shruti, 263
N Nagrath, Preeti, 123 Naiem, Rakshanda, 263 Naim, Forhad An, 325 Nainan, Aishwarya, 297 Nand, Parma, 827 Nanthini, N., 473 Narayanan, Praveena, 285 Nguyen, Quoc Hung, 1 Nirmala, C. R., 633 Noorunnisha, M., 81
P Palivela, Hemant, 633 Parida, Priyansi, 771 Phulli, Kritika, 105 Poddar, Prerana G., 431 Poonguzhali, S., 143 Pradhan, Chittaranjan, 771 Pradhan, Rahul, 575, 679, 713 Pranto, Sk. Arifuzzaman, 43, 57 Puri, Kartik, 123 Pushparaj, Pratish, 727 Puviarasan, N., 473
R Rahman, Md. Masudur, 43, 57 Raihan, M., 57 Rakesh, Nitin, 827
Author Index Rani, J. A. Esther, 797 Rathi, Bhawana, 213 Rautela, Kamakshi, 691 Ravikumar, M., 553 Reshmy, A. K., 81 Revathi, G. P., 183 Rizvi, Syed Afzal Murtaza, 659 S Sachdeva, Shruti, 29 Saini, Anu, 585 Sangam, Ravi Sankar, 399 Sanjana, R. R., 633 Santosh Kumar, D. J., 649 Sasipriya, G., 443 Satao, Madhura, 249 Satapathy, Santosh Kumar, 285 Saxena, Ankur, 263 Sen, Pushpita, 135 Senthil Kumar, D., 81 Shahajalal, Md., 57 Sharathkumar, S., 285 Shareef, Ahmed Abdullah A., 313 Sharma, Deepanshu, 105 Sharma, Rohan, 183 Sharma, Shashi, 485 Sharma, Shweta, 459 Shashidhara, R., 387 Shewale, Rashmi, 249 Shidaganti, Ganeshayya, 275 Shreya, J. L., 585 Shylaja, S. S., 95 Singh, Aditi, 375 Singh, Anjali, 827 Singh, Deepti, 713 Singh, Hukum, 619 Singh, Jainendra, 537 Singh, M. P., 839 Sinha, Akash, 839 Sivasangari, A., 143, 239 Sonekar, Shrikant V., 509
887 Sonti, V. J. K. Kishor, 143 Srinivasa Ravi Kiran, T., 11 Srisaila, A., 11 Srivastava, Shreya, 827 Sudarsan, Shreyas, 183 Sulaiman, Norrozila, 411, 421 Sunori, Sandeep Kumar, 691 Suvarna, Kartik, 183
T Tandon, Chahat, 633 Taneja, Manik, 525 Taneja, Shweta, 193 Tarun, Shrimali, 737 Tawfik, Mohammed, 313 Tayel, Mazhar B., 863 Telkar, Arpita, 633 Thomas, Lycia, 297 Thompson, Aderonke F., 807 Tobias, Zubin, 595 Truong, Viet Phuong, 1 Tyagi, Neha, 827
V Varshney, Jyoti, 575 Vashisht, Rohit, 659 Vimali, J. S., 239 Vo, Ha Quang Dinh, 1
Y Yadav, Arun Kumar, 565 Yadav, Divakar, 565 Yadav, Ikesh, 275
Z Zaheeruddin, 537 Zoraida, B. Smitha Evelin, 797