550 87 47MB
English Pages XXIX, 1180 [1152] Year 2021
Advances in Intelligent Systems and Computing 1165
Deepak Gupta · Ashish Khanna · Siddhartha Bhattacharyya · Aboul Ella Hassanien · Sameer Anand · Ajay Jaiswal Editors
International Conference on Innovative Computing and Communications Proceedings of ICICC 2020, Volume 1
Advances in Intelligent Systems and Computing Volume 1165
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **
More information about this series at http://www.springer.com/series/11156
Deepak Gupta Ashish Khanna Siddhartha Bhattacharyya Aboul Ella Hassanien Sameer Anand Ajay Jaiswal •
•
•
•
•
Editors
International Conference on Innovative Computing and Communications Proceedings of ICICC 2020, Volume 1
123
Editors Deepak Gupta Maharaja Agrasen Institute of Technology Rohini, Delhi, India Siddhartha Bhattacharyya CHRIST (Deemed to be University) Bengaluru, Karnataka, India Sameer Anand Department of Computer Science Shaheed Sukhdev College of Business Studies University of Delhi Rohini, Delhi, India
Ashish Khanna Maharaja Agrasen Institute of Technology Rohini, Delhi, India Aboul Ella Hassanien Department of Information Technology Faculty of Computers and Information Cairo University Giza, Egypt Ajay Jaiswal Department of Computer Science Shaheed Sukhdev College of Business Studies University of Delhi Rohini, Delhi, India
ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-15-5112-3 ISBN 978-981-15-5113-0 (eBook) https://doi.org/10.1007/978-981-15-5113-0 © Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Dr. Deepak Gupta would like to dedicate this book to his father Sh. R. K. Gupta, his mother Smt. Geeta Gupta for their constant encouragement, his family members including his wife, brothers, sisters, kids, and to my students close to my heart. Dr. Ashish Khanna would like to dedicate this book to his mentors Dr. A. K. Singh and Dr. Abhishek Swaroop for their constant encouragement and guidance and his family members including his mother, wife and kids. He would also like to dedicate this work to his (Late) father Sh. R. C. Khanna with folded hands for his constant blessings. Prof. (Dr.) Siddhartha Bhattacharyya would like to dedicate this book to his father Late Sh. Ajit Kumar Bhattacharyya, his mother Late Smt. Hashi Bhattacharyya, his beloved wife Rashni, and his colleagues Jayanta Biswas and Debabrata Samanta.
Prof. (Dr.) Aboul Ella Hassanien would like to dedicate this book to his wife Azza Hassan El-Saman. Dr. Sameer Anand would like to dedicate this book to his Dada Prof. D. C. Choudhary, his beloved wife Shivanee and his son Shashwat. Dr. Ajay Jaiswal would like to dedicate this book to his father Late Prof. U. C. Jaiswal, his mother Brajesh Jaiswal, his beloved wife Anjali, his daughter Prachii and his son Sakshaum.
ICICC-2020 Steering Committee Members
Patrons Dr. Poonam Verma, Principal, SSCBS, University of Delhi Prof. Dr. Pradip Kumar Jain, Director, National Institute of Technology Patna, India
General Chairs Prof. Dr. Siddhartha Bhattacharyya, Christ University, Bengaluru Dr. Prabhat Kumar, National Institute of Technology Patna, India
Honorary Chairs Prof. Dr. Janusz Kacprzyk, FIEEE, Polish Academy of Sciences, Poland Prof. Dr. Vaclav Snasel, Rector, VSB-Technical University of Ostrava, Czech Republic
Conference Chairs Prof. Dr. Aboul Ella Hassanien, Cairo University, Egypt Prof. Dr. Joel J. P. C. Rodrigues, National Institute of Telecommunications (Inatel), Brazil Prof. Dr. R. K. Agrawal, Jawaharlal Nehru University, Delhi
vii
viii
ICICC-2020 Steering Committee Members
Technical Program Chairs Prof. Dr. Victor Hugo C. de Albuquerque, Universidade de Fortaleza, Brazil Prof. Dr. A. K. Singh, National Institute of Technology, Kurukshetra Prof. Dr. Anil K Ahlawat, KIET Group of Institutes, Ghaziabad
Editorial Chairs Prof. Dr. Abhishek Swaroop, Bhagwan Parshuram Institute of Technology, Delhi Dr. Arun Sharma, Indira Gandhi Delhi Technical University for Womens, Delhi Prerna Sharma, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi
Conveners Dr. Ajay Jaiswal, SSCBS, University of Delhi Dr. Sameer Anand, SSCBS, University of Delhi Dr. Ashish Khanna, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi Dr. Deepak Gupta, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi Dr. Gulshan Shrivastava, National Institute of Technology Patna, India
Publication Chairs Prof. Dr. Neeraj Kumar, Thapar Institute of Engineering and Technology Dr. Mohamed Elhoseny, University of North Texas Dr. Hari Mohan Pandey, Edge Hill University, UK Dr. Sahil Garg, École de technologie supérieure, Université du Québec, Montreal, Canada
Publicity Chairs Dr. M. Tanveer, Indian Institute of Technology, Indore, India Dr. Jafar A. Alzubi, Al-Balqa Applied University, Salt, Jordan Dr. Hamid Reza Boveiri, Sama College, IAU, Shoushtar Branch, Shoushtar, Iran
ICICC-2020 Steering Committee Members
Co-convener Mr. Moolchand Sharma, Maharaja Agrasen Institute of Technology, India
Organizing Chairs Dr. Kumar Bijoy, SSCBS, University of Delhi Dr. Rishi Ranjan Sahay, SSCBS, University of Delhi
Organizing Team Dr. Gurjeet Kaur, SSCBS, University of Delhi Dr. Aditya Khamparia, Lovely Professional University, Punjab, India Dr. Abhimanyu Verma, SSCBS, University of Delhi Dr. Onkar Singh, SSCBS, University of Delhi Kalpna Sagar, KIET Group of Institutes, Ghaziabad
ix
Preface
We hereby are delighted to announce that Shaheed Sukhdev College of Business Studies, New Delhi in association with National Institute of Technology Patna and University of Valladolid Spain has hosted the eagerly awaited and much coveted International Conference on Innovative Computing and Communication (ICICC-2020). The third version of the conference was able to attract a diverse range of engineering practitioners, academicians, scholars and industry delegates, with the reception of abstracts including more than 3,200 authors from different parts of the world. The committee of professionals dedicated towards the conference is striving to achieve a high quality technical program with tracks on Innovative Computing, Innovative Communication Network and Security, and Internet of Things. All the tracks chosen in the conference are interrelated and are very famous among present day research community. Therefore, a lot of research is happening in the above-mentioned tracks and their related sub-areas. As the name of the conference starts with the word ‘innovation’, it has targeted out of box ideas, methodologies, applications, expositions, surveys and presentations helping to upgrade the current status of research. More than 800 full-length papers have been received, among which the contributions are focused on theoretical, computer simulation-based research, and laboratory-scale experiments. Amongst these manuscripts, 196 papers have been included in the Springer proceedings after a thorough two-stage review and editing process. All the manuscripts submitted to the ICICC-2020 were peer-reviewed by at least two independent reviewers, who were provided with a detailed review proforma. The comments from the reviewers were communicated to the authors, who incorporated the suggestions in their revised manuscripts. The recommendations from two reviewers were taken into consideration while selecting a manuscript for inclusion in the proceedings. The exhaustiveness of the review process is evident, given the large number of articles received addressing a wide range of research areas. The stringent review process ensured that each published manuscript met the rigorous academic and scientific standards. It is an exalting experience to finally see these elite contributions materialize into two book volumes as ICICC-2020 proceedings by Springer entitled International Conference on Innovative Computing and Communications. xi
xii
Preface
The articles are organized into two volumes in some broad categories covering subject matters on machine learning, data mining, big data, networks, soft computing, and cloud computing, although given the diverse areas of research reported it might not have been always possible. ICICC-2020 invited six key note speakers, who are eminent researchers in the field of computer science and engineering, from different parts of the world. In addition to the plenary sessions on each day of the conference, fifteen concurrent technical sessions are held every day to assure the oral presentation of around 195 accepted papers. Keynote speakers and session chair(s) for each of the concurrent sessions have been leading researchers from the thematic area of the session. A technical exhibition is held during all the 3 days of the conference, which has put on display the latest technologies, expositions, ideas and presentations. The delegates were provided with a book of extended abstracts to quickly browse through the contents, participate in the presentations and provide access to a broad audience of the audience. The research part of the conference was organized in a total of 45 special sessions. These special sessions provided the opportunity for researchers conducting research in specific areas to present their results in a more focused environment. An international conference of such magnitude and release of the ICICC-2020 proceedings by Springer has been the remarkable outcome of the untiring efforts of the entire organizing team. The success of an event undoubtedly involves the painstaking efforts of several contributors at different stages, dictated by their devotion and sincerity. Fortunately, since the beginning of its journey, ICICC-2020 has received support and contributions from every corner. We thank them all who have wished the best for ICICC-2020 and contributed by any means towards its success. The edited proceedings volumes by Springer would not have been possible without the perseverance of all the steering, advisory and technical program committee members. All the contributing authors owe thanks from the organizers of ICICC-2020 for their interest and exceptional articles. We would also like to thank the authors of the papers for adhering to the time schedule and for incorporating the review comments. We wish to extend my heartfelt acknowledgment to the authors, peer-reviewers, committee members and production staff whose diligent work put shape to the ICICC-2020 proceedings. We especially want to thank our dedicated team of peer-reviewers who volunteered for the arduous and tedious step of quality checking and critique on the submitted manuscripts. We wish to thank my faculty colleagues Mr. Moolchand Sharma and Ms. Prerna Sharma for extending their enormous assistance during the conference. The time spent by them and the midnight oil burnt is greatly appreciated, for which we will ever remain indebted. The management, faculties, administrative and support staff of the college has always been extending their services whenever needed, for which we remain thankful to them.
Preface
xiii
Lastly, we would like to thank Springer for accepting our proposal for publishing the ICICC-2020 conference proceedings. Help received from Mr. Aninda Bose, the acquisition senior editor, in the process has been very useful. Rohini, India
Bengaluru, india Giza, India
Ashish Khanna Deepak Gupta Organizers, ICICC-2020 Sameer Anand Ajay Jaiswal Siddhartha Bhattacharyya Aboul Ella Hassanien
About This Book
International Conference on Innovative Computing and Communication (ICICC-2020) was held on 21–23 February at Shaheed Sukhdev College of Business Studies in association with National Institute of Technology Patna and University of Valladolid Spain. This conference was able to attract a diverse range of engineering practitioners, academicians, scholars and industry delegates, with the reception of papers including more than 3200 authors from different parts of the world. Only 195 papers have been accepted and registered with an acceptance ratio of 24% to be published in two volumes of prestigious springer Advances in Intelligent Systems and Computing (AISC) series. This volume includes a total of 97 papers.
xv
Contents
Systematic Analysis and Prediction of Air Quality Index in Delhi . . . . Kanika Bhalla, Sangeeta Srivastava, and Anjana Gosain
1
Proof of Game (PoG): A Proof of Work (PoW)’s Extended Consensus Algorithm for Healthcare Application . . . . . . . . . . . . . . . . . . . . . . . . . Adarsh Kumar and Saurabh Jain
23
Analysis of Diabetes and Heart Disease in Big Data Using MapReduce Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manpreet Kaur Saluja, Isha Agarwal, Urvija Rani, and Ankur Saxena
37
User Interface of a Drawing App for Children: Design and Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Savita Yadav, Pinaki Chakraborty, and Prabhat Mittal
53
Rank Aggregation Using Moth Search for Web . . . . . . . . . . . . . . . . . . Parneet Kaur, Gai-Ge Wang, Manpreet Singh, and Sukhwinder Singh
63
A Predictive Approach to Academic Performance Analysis of Students Based on Parental Influence . . . . . . . . . . . . . . . . . . . . . . . . Deepti Sharma and Deepshikha Aggarwal
75
Handling Class Imbalance Problem in Heterogeneous Cross-Project Defect Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rohit Vashisht and Syed Afzal Murtaza Rizvi
85
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME . . . Saba Fatima and V. M. Viswanatha A Generic Framework for Evolution of Deep Neural Networks Using Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deepraj Shukla and Upasna Singh
97
117
xvii
xviii
Contents
Online Economy on the Move: The Future of Blockchain in the Modern Banking System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anushree A. Avasthi
129
A Novel Framework for Distributed Stream Processing and Analysis of Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shruti Arora and Rinkle Rani
147
Turbo Code with Hybrid Interleaver in the Presence of Impulsive Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V. V. Satyanarayana Tallapragada, M. V. Nagabhushanam, and G. V. Pradeep Kumar Performance Evaluation of Weighted Fair Queuing Model for Bandwidth Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shapla Khanam, Ismail Ahmedy, and Mohd Yamani Idna Idris Modified Bio-Inspired Algorithms for Intrusion Detection System . . . . Moolchand Sharma, Shachi Saini, Sneha Bahl, Rohit Goyal, and Suman Deswal An Analysis on Incompetent Search Engine and Its Search Engine Optimization (SEO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nilesh Kumar Jadav and Saurabh Shrivastava Infotainment System Using CAN Protocol and System on Module with Qt Application for Formula-Style Electric Vehicles . . . . . . . . . . . Rahul M. Patil, K. P. Chethan, Rahul Ramaprasad, H. K. Nithin, and Srujan Rangayyan A Novel Approach for SQL Injection Avoidance Using Two-Level Restricted Application Prevention (TRAP) Technique . . . . . . . . . . . . . Anup Kumar, Sandeep Rai, and Rajesh Boghey Prediction of Cardiovascular Disease Through Cutting-Edge Deep Learning Technologies: An Empirical Study Based on TENSORFLOW, PYTORCH and KERAS . . . . . . . . . . . . . . . . . . . Mudasir Ashraf, Syed Mudasir Ahmad, Nazir Ahmad Ganai, Riaz Ahmad Shah, Majid Zaman, Sameer Ahmad Khan, and Aftab Aalam Shah
163
175 185
203
215
227
239
Study and Analysis of Time Series of Weather Data of Classification and Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rashmi Bhardwaj and Varsha Duhoon
257
Convection Dynamics of Nanofluids for Temperature and Magnetic Field Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rashmi Bhardwaj and Meenu Chawla
271
Contents
xix
A Lightweight Secure Authentication Protocol for Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nitin Verma, Abhinav Kaushik, and Pinki Nayak
291
Movie Recommendation Using Content-Based and Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Priyanka Meel, Farhin Bano, Agniva Goswami, and Saloni Gupta
301
Feature Selection Algorithms and Student Academic Performance: A Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chitra Jalota and Rashmi Agrawal
317
MobiSamadhaan—Intelligent Vision-Based Smart City Solution . . . . . Mainak Chakraborty, Alik Pramanick, and Sunita Vikrant Dhavale
329
Students’ Performance Prediction Using Feature Selection and Supervised Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . Juhi Gajwani and Pinaki Chakraborty
347
Modified AMI Modulation Scheme for High-Speed Bandwidth Efficient Optical Transmission Systems . . . . . . . . . . . . . . . . . . . . . . . . . Abhishek Khansali, M. K. Arti, Soven K. Dana, and Manoranjan Kumar
355
A Safe Road to Health: Medical Services Using Unmanned Aerial Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Monica Dev and Ramachandran Hema
367
Building English–Punjabi Parallel Corpus for Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simran Jolly and Rashmi Agrawal
377
Performance of MIMO Systems with Perfect and Imperfect CSI . . . . . Divya Singh, Shelesh Krishna Saraswat, and Aasheesh Shukla
387
Concealing the Confidential Information Using LSB Steganography Techniques in Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ravi Kumar and Namrata Singh
397
FastV2C-HandNet: Fast Voxel to Coordinate Hand Pose Estimation with 3D Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . Rohan Lekhwani and Bhupendra Singh
413
Hybrid Apparel Recommendation System Based on Weighted Similarity of Brand and Colour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Priyanka Meel, Puneet Chawla, Sahil Jain, and Utkarsh Rai
427
Abusive Comments Classification in Social Media Using Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. R. Janardhana, Asha B. Shetty, Madhura N. Hegde, Jayapadmini Kanchan, and Anjana Hegde
439
xx
Contents
To Find the Best-Suited Model for Sentiment Analysis of Real-Time Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ritesh Dutta
445
Comparative Analysis of Different Balanced Truncation Techniques of Model Order Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankur Gupta and Amit Kumar Manocha
453
Development of a Real-Time Pollution Monitoring System for Green Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankush Garg, Bhumika Singh, Ekta, Joyendra Roy Biswas, Kartik Madan, and Parth Chopra
465
Understanding and Implementing Machine Learning Models with Dummy Variables with Low Variance . . . . . . . . . . . . . . . . . . . . . Sakshi Jolly and Neha Gupta
477
Assessment of Latent Fingerprint Image Quality Based on Level 1, Level 2, and Texture Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diwakar Agarwal and Atul Bansal
489
Breast Cancer Detection Using Deep Learning and Machine Learning: A Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alpna Sharma, Barjesh Kochar, Nisheeth Joshi, and Vinay Kumar
503
Tweets About Self-Driving Cars: Deep Sentiment Analysis Using Long Short-Term Memory Network (LSTM) . . . . . . . . . . . . . . . Anandi Dutta and Subasish Das
515
A Comparison of Pre-trained Word Embeddings for Sentiment Analysis Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Santosh Kumar, Rakesh Bahadur Yadav, and Sunita Vikrant Dhavale
525
Enhanced and Efficient Multilayer MAC Protocol for M2M Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mariyam Ouaissa, Mariya Ouaissa, and Abdallah Rhattoy
539
Comparative Study of the Support Vector Machine with Two Hyper-Parameter Optimization Methods for the Prediction of FSHD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Babita Pandey and Devendra Kumar Pandey
549
Missing Value Imputation Approach Using Cosine Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wajeeha Rashid, Sakshi Arora, and Manoj Kumar Gupta
557
Text-Independent Voice Authentication System Using MFCC Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nandini Sethi and Dinesh Kumar Prajapati
567
Contents
xxi
Design of a Monopole Antenna for 5.8 GHz ISM Band Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Sudhakar, Doondi Kumar Janapala, and M. Nesasudha
579
Recognition of Brain Tumor Using Fully Convolutional Neural Network-Based Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ambeshwar Kumar and R. Manikandan
587
SD-6LN: Improved Existing IoT Framework by Incorporating SDN Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rohit Kumar Das, Arnab Kumar Maji, and Goutam Saha
599
Hybrid RF/MIMO-FSO Relaying Systems Over Gamma–Gamma Fading Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anu Goel and Richa Bhatia
607
Enhancement of Degraded CCTV Footage for Forensic Analysis . . . . . A. Vinay, Aditya Lokesh, Vinayaka R. Kamath, K. N. B. Murty, and S. Natarajan
617
Sign Language Recognition Using Microsoft Kinect . . . . . . . . . . . . . . . Simran Kaur, Akshit Gupta, Ashutosh Aggarwal, Deepak Gupta, and Ashish Khanna
637
Peer-to-Peer Communication Using LoRa Technology . . . . . . . . . . . . . Farzil Kidwai, Aakash Madaan, Sahil Bansal, and Aaditya Sahu
647
Empirical Analysis of Classification Algorithms in Data Stream Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aastha Masrani, Madhu Shukla, and Kishan Makadiya
657
Classifying and Measuring Hate Speech in Twitter Using Topic Classifier of Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. H. A. Shibly, Uzzal Sharma, and H. M. M. Naleer
671
IoT-Based System to Measure Soil Moisture Using Soil Moisture Sensor, GPS Data Logging and Cloud Storage . . . . . . . . . . . . . . . . . . . Ayushi Johri, Anchal, Rishi Prakash, Anurag Vidyarthi, Vivek Chamoli, and Sharat Bhardwaj CAD Diagnosis by Predicting Stenosis in Arteries Using Data Mining Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akansha Singh and Ashish Payal Analyzing Mental Health Diseases in a Spanish Region Using Software Based on Graph Theory Algorithms . . . . . . . . . . . . . . Susel Góngora Alonso, Andrés de Bustos Molina, Sofiane Hamrioui, Miguel López Coronado, Manuel Franco Martín, Ashish Khanna, and Isabel de la Torre Díez
679
689
701
xxii
Contents
Bug Localization Using Multi-objective Approach and Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akanksha Sood, Shubharika Sharma, Ashish Khanna, Ajay Tiwari, Deepak Gupta, Vaibhav Madan, and Srinath Doss
709
SMART EV: Dynamic Pricing and Waiting Period Prediction at EV Charging Station . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tanish Bhola and Satvik Kaul
725
In Silico Analysis of Protein–Protein Interactions Between Estrogen Receptor and Fungal Laccase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nawaid Zaman, Akansha Shukla, Shazia Rashid, and Seneha Santoshi
737
Exploring Blockchain Mining Mechanism Limitations . . . . . . . . . . . . . Noha M. Hamza, Shimaa Ouf, and Ibrahim M. El-Henawy
749
Emotion Recognition Using Q-KNN: A Faster KNN Approach . . . . . . Preeti Kapoor and Narina Thakur
759
Integrating Behavioral Analytics with LSTM to Get Stock Predictions with Increased Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . Chetan Sirohi, Sakar Jain, Jatin Jha, and Visharad Vashist Blockchain-Based Boothless E-Voting System . . . . . . . . . . . . . . . . . . . . Ngangbam Indrason, Wanbanker Khongbuh, and Goutam Saha Eye Blinking Classification Through NeuroSky MindWave Headset Using EegID Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mridu Sahu, Praveen Shukla, Aditya Chandel, Saloni Jain, and Shrish Verma Position and Velocity Errors Reduction Using Advanced TERCOM Aided Inertial Navigation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rajendra Naik Bhukya, Swetha Kodepaka, Vennela Dharavath, and Ravi Boda
769 779
789
801
LightBC: A Lightweight Hash-Based Blockchain for the Secured Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabiola Hazel Pohrmen and Goutam Saha
811
Analysis of Breast Cancer for Histological Dataset Based on Different Feature Extraction and Classification Algorithms . . . . . . . . . . . . . . . . . Chetna Kaushal and Anshu Singla
821
Design of High-Speed FPGA Based CASU Using CBNS Arithmetic: Extension to CFFT Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Madhumita Mukherjee and Salil Kumar Sanyal
835
Contents
xxiii
Security-Based LEACH Protocol for Wireless Sensor Network . . . . . . A. Sujatha Priyadharshini and C. Arvind
855
Depression Detection Among Social Media Users Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prashant Verma, Kapil Sharma, and Gurjit Singh Walia
865
Routing in Internet of Things Using Cellular Automata . . . . . . . . . . . . Ehsan Heidari, Ali Movaghar, Homayun Motameni, and Esmaeil Homayun
875
Robust Approach for Emotion Classification Using Gait . . . . . . . . . . . Sahima Srivastava, Vrinda Rastogi, Chandra Prakash, and Dimple Sethi
885
Utility of Game Theory in Defensive Wireless Sensor Networks (WSNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohd Adnan, Toa Yang, Sayekat Kumar Das, and Tazeem Ahmad
895
Security in Cloud Computing for Sensitive Data: Challenges and Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marley Boniface Paul and Uzzal Sharma
905
Delay Based Multi-controller Placement in Software Define Tactical Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sayekat Kumar Das, Han Zhenzhen, and Mohd Adnan
919
Human Emotion Recognition from Spontaneous Thermal Image Sequence Using GPU Accelerated Emotion Landmark Localization and Parallel Deep Emotion Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chirag Kumar Kyal, Harsh Poddar, and Motahar Reza Face Recognition-Based Automated Attendance System . . . . . . . . . . . . Kunjal Shah, Dhanashree Bhandare, and Sunil Bhirud
931 945
Dark web Activity on Tor—Investigation Challenges and Retrieval of Memory Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arjun Chetry and Uzzal Sharma
953
Application of Stochastic Approximation for Self-tuning of PID in Unmanned Surface Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rupam Singh and Bharat Bhushan
965
State of Charge Estimation Using Data-Driven Techniques for Storage Devices in Electric Vehicles . . . . . . . . . . . . . . . . . . . . . . . . Rupam Singh, Mohammed Ali Khan, and V. S. Bharath Kurukuru
975
Elman and Jordan Recurrence in Convolutional Neural Networks Using Attention Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sweta Kumari, S. Aravindakshan, and V. Srinivasa Chakravarthy
983
xxiv
Contents
mHealth for Mental Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohit Saxena, Anveshita Deo, and Ankur Saxena
995
Optimization Algorithm-Based Artificial Neural Network Control of Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007 Vishal Srivastava and Smriti Srivastava CuraBand: Health Monitoring and Warning System . . . . . . . . . . . . . . 1017 Sopan Phaltankar, Kirti Tyagi, Meghna Prabhu, Pranav Jaguste, Shubham Sahu, and Dhananjay Kalbande A Hybridized Auto-encoder and Convolution Neural Network-Based Model for Plant Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027 Sk Mahmudul Hassan and Arnab Kumar Maji Driver Fatigue and Distraction Analysis Using Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037 Rachit Rathi, Amey Sawant, Lavesh Jain, and Sukanya Kulkarni Multimodal Medical Image Fusion Based on Discrete Wavelet Transform and Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047 Jayant Bhardwaj, Abhijit Nayak, and Deepak Gambhir A Primer on Opinion Mining: The Growing Research Area . . . . . . . . 1059 Mikanshu Rani and Jaswinder Singh Bow-Tie Shaped Meander Line UWB Antenna for Underwater Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077 Sajjan Kumar Jha, Priyadarshi Suraj, and Ritesh Kr. Badhai Dorsal Hand Vein-Biometric Recognition Using Convolution Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087 Rajendra Kumar, Ram Chandra Singh, and Shri Kant Modern Automobile Adaptive Cruise Control . . . . . . . . . . . . . . . . . . . 1109 R. Bharathi, Sunanda Dixit, and R. Bhagya KNN-Based Classification and Comparative Study of Multispectral LISS-III Satellite Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1123 Anand Upadhyay, Dhwani Upadhyay, and Axali Gundaraniya Hybrid Data-Level Techniques for Class Imbalance Problem . . . . . . . 1131 Anjana Gosain, Arushi Gupta, and Deepika Singh Drug Repositioning Based on Heterogeneous Network Inference . . . . . 1143 K. Deepthi and A. S. Jereesh Study and Classification of Recommender Systems: A Survey . . . . . . . 1153 Mugdha Sharma, Laxmi Ahuja, and Vinay Kumar
Contents
xxv
Design of Ensemble Learning Model to Diagnose Malaria Disease Using Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 1165 Raghavendra Kumar, Akhil Gupta, and Ashish Mishra Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177
About the Editors
Dr. Deepak Gupta is an eminent academician; and plays versatile roles and responsibilities juggling between lectures, research, publications, consultancy, community service, Ph.D. and postdoctorate supervision, etc. With 12 years of rich expertise in teaching and two years in industry; he focuses on rational and practical learning. He has contributed massive literature in the fields of human–computer interaction, intelligent data analysis, nature-inspired computing, machine learning and soft computing. He has served as Editor-in-Chief, Guest Editor, and Associate Editor in SCI and various other reputed journals. He has completed his postdoc from Inatel, Brazil, and Ph.D. from Dr. APJ Abdul Kalam Technical University. He has authored/edited 33 books with national/international level publisher (Elsevier, Springer, Wiley, Katson). He has published 105 scientific research publications in reputed international journals and conferences including 53 SCI Indexed Journals of IEEE, Elsevier, Springer, Wiley and many more. He is the convener and organizer of ‘ICICC’ Springer Conference Series. Dr. Ashish Khanna has 16 years of expertise in Teaching, Entrepreneurship, and Research & Development. He received his Ph.D. degree from National Institute of Technology, Kurukshetra. He has completed his M.Tech. and B.Tech. from GGSIPU, Delhi. He has completed his postdoc from the Internet of Things Lab at Inatel, Brazil, and University of Valladolid, Spain. He has published around 45 SCI indexed papers in IEEE Transaction, Springer, Elsevier, Wiley and many more reputed journals with cumulative impact factor of above 100. He has around 100 research articles in top SCI/Scopus journals, conferences and book chapters. He is co-author of around 20 edited and textbooks. His research interest includes distributed systems, MANET, FANET, VANET, IoT, machine learning and many more. He is originator of Bhavya Publications and Universal Innovator Lab. Universal Innovator is actively involved in research, innovation, conferences, startup funding events and workshops. He has served the research field as a Keynote Speaker/Faculty Resource Person/Session Chair/Reviewer/TPC member/
xxvii
xxviii
About the Editors
postdoctorate supervision. He is convener and organizer of ICICC conference series. He is currently working at the Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, under GGSIPU, Delhi, India. He is also serving as Series Editor in Elsevier and De Gruyter publishing houses. Dr. Siddhartha Bhattacharyya is currently serving as a Professor in the Department of Computer Science and Engineering of Christ University, Bangalore. He is a co-author of 5 books and the Co-editor of 50 books and has more than 250 research publications in international journals and conference proceedings to his credit. He has got two PCTs to his credit. He has been member of the organizing and technical program committees of several national and international conferences. His research interests include hybrid intelligence, pattern recognition, multimedia data processing, social networks and quantum computing. He is also a certified Chartered Engineer of Institution of Engineers (IEI), India. He is on the Board of Directors of the International Institute of Engineering and Technology (IETI), Hong Kong. He is a privileged inventor of NOKIA. Dr. Aboul Ella Hassanien is the Founder and Head of the Egyptian Scientific Research Group (SRGE). Hassanien has more than 1000 scientific research papers published in prestigious international journals and over 50 books covering such diverse topics as data mining, medical images, intelligent systems, social networks and smart environment. Prof. Hassanien won several awards including the Best Researcher of the Youth Award of Astronomy and Geophysics of the National Research Institute, Academy of Scientific Research (Egypt, 1990). He was also granted a scientific excellence award in humanities from the University of Kuwait for the 2004 Award, and received the superiority of scientific—University Award (Cairo University, 2013). Also he honored in Egypt as the best researcher at Cairo University in 2013. He was also received the Islamic Educational, Scientific and Cultural Organization (ISESCO) prize on Technology (2014) and received the State Award for excellence in engineering sciences 2015. He was awarded the medal of Sciences and Arts of the first class by the President of the Arab Republic of Egypt, 2017. Professor Hassanien awarded the international Scopus Award for the meritorious research contribution in the field of computer science (2019). Dr. Sameer Anand is currently working as an Assistant professor in the Department of Computer science at Shaheed Sukhdev College of Business Studies, University of Delhi, Delhi. He has received his M.Sc., M.Phil., and Ph.D. (Software Reliability) from the Department of Operational Research, University of Delhi. He is a recipient of ‘Best Teacher Award’ (2012) instituted by Directorate of Higher Education, Government of NCT, Delhi. The research interest of Dr. Anand includes operational research, software reliability and machine learning. He has completed an Innovation project from the University of Delhi. He has worked in different capacities in international conferences. Dr. Anand has published several papers in the reputed journals like IEEE Transactions on Reliability, International Journal of
About the Editors
xxix
Production Research (Taylor & Francis), International Journal of Performability Engineering, etc. He is a member of Society for Reliability Engineering, Quality and Operations Management. Dr. Sameer Anand has more than 16 years of teaching experience. Dr. Ajay Jaiswal is currently serving as an Assistant Professor in the Department of Computer Science of Shaheed Sukhdev College of Business Studies, University of Delhi, Delhi. He is Co-editor of two books/journals and co-author of dozens of research publications in international journals and conference proceedings. His research interest includes pattern recognition, image processing, and machine learning. He has completed an interdisciplinary project titled ‘Financial Inclusion-Issues and Challenges: An Empirical Study’ as Co-PI. This project was awarded by the University of Delhi. He obtained his masters from the University of Roorkee (now IIT Roorkee) and Ph.D. from Jawaharlal Nehru University, Delhi. He is a recipient of the Best Teacher Award from the Government of NCT of Delhi. He has more than nineteen years of teaching experience.
Systematic Analysis and Prediction of Air Quality Index in Delhi Kanika Bhalla, Sangeeta Srivastava, and Anjana Gosain
Abstract Pollution refers to the adulteration of the atmosphere with substances that intervene with the nature and hence affects the health of humans in some cases critically. Off late, the quality of air in urban places has been found unfavorable and Delhi is no exception for this. The National Air Quality Standards have stressed that the main pollutants are particulate matter (PM). A thorough analysis of past years of quality of air in Delhi-NCR has been done by us for certain areas, and broadly, the analysis results of air quality data have revealed that the Air Quality Index (AQI) was high during the winters, low during the monsoons and average during summers. This analysis has been done for identifying the variations in the air quality and their effects on human health. We have further applied this data to an algorithm to predict the future air quality results with great accuracy so that in future the predicted data can be used for taking measures to control air pollution and prevent the hazards resulting from the same. This paper has considered the air quality data for the three most polluted areas of Delhi, i.e. ITO, Anand Vihar and Jahangirpuri at Delhi. If we can predict the future air quality in a given area, then the major issues of air pollution can be curtailed beforehand so that the critical health-related issues can be prevented. Keywords Pollution · Prediction · Air quality index analysis · Holt–Winter algorithm · Air quality index prediction
K. Bhalla (B) · A. Gosain Guru Gobind Singh Indraprastha University, New Delhi, India e-mail: [email protected] A. Gosain e-mail: [email protected] S. Srivastava University of Delhi, New Delhi, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_1
1
2
K. Bhalla et al.
1 Introduction One of the major concerns of air pollution is its effects on the health of human beings. Various studies have been carried out that have concluded that particulate matter is the major source of toxic element in the air that causes air pollution, and the result of which has led to various cardiovascular diseases, pulmonary diseases and even death [1]. The United Nations (UN) Environment Programme has estimated that around 1.1 billion people around the world breathe the air that is unhealthy [2]. The World Health Organization (WHO) has estimated that urban air pollution is responsible for approximately 800,000 deaths and 4.6 million deaths each year out of which 88% is of premature deaths [3]. The main concern is that out of this estimation twothird of the deaths have occurred in Asia. This is majorly due to the urbanization, industrialization, socioeconomic development and population growth in Asia [4]. India is one such country that has shown increase in the above factors. But the growth has led to the adverse effects as well in the form of pollution. Air quality has always been severe in metropolitan cities around the world, and Delhi is also one of them [5]. According to Census of India, the pollution of Delhi has crossed around 25 million that have occupied 1484 km2 of area (2011 Census of India). A study conducted by WHO in 1991–1994 concluded that average total suspended particulate level was five times the annual average standard of WHO. The study found that the air pollution in Delhi can cause deaths at a younger age. The National Air Quality Index (AQI) was launched in New Delhi on 17 September 2014 under Swachh Bharat Abhiyan. The AQI considers eight pollutants PM10, PM2.5, NO2 , SO2 , CO, O3 , NH3 and Pb for which various Air Quality Standards have been prescribed [6]. Ministry of Environment and Forest, India, did another analysis in 1997 that made a review on the environment situation and concerns of the worsening conditions in Delhi. The study estimated that around 3000 metric tons of various air pollutants were being emitted on a daily basis in Delhi where vehicular pollution contributed 67% and thermal power plants that were based on coal contributed around 12%. From 1989 to 1997, this trend rises highly that was being monitored closely by Central Pollution Control Board. The release of carbon monoxide by vehicles increased to 92% in 1996. In 2014, Delhi was declared as the most polluted city in the world [7]. And till today, it has continued to be one. A lot of research and work has been done in the past to analyse and predict the AQI in Delhi so that further lowering of the air quality standard can be prevented. The Government of Delhi has also proposed many solutions in the past 3–4 years to ensure the better air quality. This paper tries to put forth the analysis of the AQI data available and predict so that some preventive measures could be taken beforehand. The rest of the paper is organized as follows: Sect. 2 discusses the literature review; Sect. 3 presents the pollutant data collection for the city of Delhi, Sect. 4 specifies the data analysis
Systematic Analysis and Prediction of Air Quality …
3
results, Sect. 5 discusses the prediction of future pollution data, and Sect. 6 gives the conclusion and future scope.
2 Literature Review A lot of research work in the past has been done to analyse and predict the air quality index using various methods and its consequences in the future. Many researchers have predicted the effects of air pollution on human health. Md. Senaul Haque and Singh [8] have predicted through a survey that increase in the concentrations of pollutants increases in the risk of respiratory infections. They used exceedence factor and took four pollutants into account to categorize the pollution into different categories that are critical, high, moderate and low. Segala et al. have shown that prevailing levels of winter air pollution, which are below the international air quality standards, had consistent and measurable effects on children with mild to moderate asthma [9]. Marilena Kampa and Elias Castanas, in their study, have given a detailed view on the effects of air pollution on the human health by taking into account all the particles responsible for air pollution. They have given a detailed view of the chronic and acute effects on health of human beings that can affect various organs and different systems of human body. They have also thrown light on both the short and well as long-term affects of human exposure to air pollution [10]. Another research focus has also been on the effect of different constituents of air pollution on human health and the varying intensities of these effects. Valavanidis et al. have investigated the particulate matter (PM) characteristics that are critical, and they determined their biological effects as well. In their work, they have deeply reviewed the ways in which the composition and size of PM can cause adverse effects on body cells that in turn can cause oxidative DNA damage [11]. Rizwan, Nongkynrih and Gupta have also given an insight about the effects of air pollution on health of human beings, and along with this, they have also suggested various control measures for the same [12]. Anderson J, Thundiyil J and Stolbach A have also reviewed various effects of PM that can cause the worsening of respiratory system in human body. They have also reviewed the short-term and a long-term effects of PM depending upon its the composition [13]. All in all, we can say that the research work listed above puts forth the fact that the pollution is a major deterrent to human health. Along with the research on the effect of air pollution on health, various researchers have also focussed on prediction of the same using various algorithms. Anikender Kumar and Pramila Goyal have used principal component regression technique in which they have forecasted daily AQI though previous days AQI using principal component regression technique. They took four seasons under consideration and predicted accordingly. They are predicting for only one station of Delhi, i.e. ITO
4
K. Bhalla et al.
seasonally but the model is not giving better results except for the winter season [14]. Another study identified one of the factors that contributes a major percentage in the growth of air pollution which is the transport sector where S. K. Goyal, S. V. Ghatge, P. Nema and S. M. Tamhane have monitored the concentration increase of NO2 from 1997 to 2003 and have that rose drastically in the past and resulted in the contribution of air pollution [15]. Dananjay Ghei and Renuka Sane have used regression technique to forecast air pollution during the festival of Diwali, but the problem with the model is that their model is not considering the exogenous factors like seasonality, construction or crop burning. Another factor which should be considered is that they have taken into account only one station which is Anand Vihar wherein the AQI is low during the Diwali season as it is an industrial area and the factories remains closed during this season [16]. The problems in the above prediction of air quality models are either they cover a small part of the city or only one of the factors or one season. So we are focusing on larger area and all seasons which are the main contributing factors. The various research work presented above show that increase in the pollution level can severely affect human health so we need to find a solution to this problem of reducing pollutants. One of the solutions to this problem is prevention which is possible if we can predict. The prediction should be accurate and consider the major polluting factors and pollutants. In our case, the pollutants taken into consideration are PM10 and PM2.5 so that we can predict accurately.
2.1 Pollutants: A Study on the Effects on Human Health The composition of the atmosphere changes majorly due to the combustion of fossil fuel done for the energy generation, and this energy is required for various activities like transportation [10]. Various pollutants differ in their chemical composition, reactions and impacts on living beings. But, there are certain similarities as well that can classify them into the following four categories: • • • •
Gas emission pollutants Perpetual biotic pollutants Particulate matter Metals
Gas emission pollutants, at a very large extent, contribute to the changes in the composition of the atmosphere. They are primarily formed due to the fossil fuel combustion [17]. Various examples of these pollutants are CO, SO2 , NOx , ozone and VOCs. CO—It is the resultant of incomplete combustion process in the atmosphere. Burning of petrol and road transport are the contributors to this toxic gas. It has the result of severe effects on cardiovascular system in the human body as it binds
Systematic Analysis and Prediction of Air Quality …
5
itself with the haemoglobin that results in the reduction of transfer of the oxygen to different organs of the body, majorly heart and brain. SO2 —It is the resultant of the combustion of fossil fuels that contains sulphur, majorly heavy oils and coal. It can produce acid rain after it is combined with water. It can affect the respiratory system of the human body that can result in various lung diseases and lead to asthma [18]. NOx —Nitrogen oxides are resultant of oxidation of nitrogen during the combustion process at a very high temperature. Its major source is the vehicular traffic. It causes breathing problems due to lung irritation and respiratory diseases [19]. Ozone—This is mainly formed due to the process initiated by sunlight where there are continuous reactions that involve VOC and NO2 . The inhalation of ozone can worse the respiratory problems like asthma and reduces the immunity power of human body to fight against various respiratory problems and many even damage lungs [20]. VOCs—Volatile organic compounds that comprises benzene, m-xylene, ethylbenzene, chlorobenzene, 1,2-dichlorobenzene and 1,4-dichlorobenzene are the resultants of the fuel combustion and road transport emission. They primarily have an effect on the respiratory system of the human body, along with that can also result in the haemophiliac problems and even cancer. Perpetual biotic pollutants (dioxins, pesticides, PCBs, furans) are majorly formed due to the incomplete combustion of material having chlorine (plastic). They get deposited into water and soil, but they are insoluble in these two. They affect the human health by entering into the food chain where they get accumulated as they are unable to bound to the body lipids stably [21]. As a result, they can cause problems in the digestive system and even a permanent liver damage. Metals (mercury, vanadium, manganese, vanadium and lead) contribute to the atmosphere majorly as the industrial waste and combustion. Metals are already present at the earth crust and are essential requirement of human body, but in limited concentrations as they maintain the metabolic reactions in the body. As the concentration rises, they can become toxic [22]. These can have high impact on human body as it can affect nervous system, urinary system and can also affect the developing foetus as if a mother is exposed to these metals especially lead then it can cause miscarriage or reduction in foetal growth [23]. Particulate matter is a term given to describe varied types of particles that form a mixture and are suspended in the atmosphere. These particles are varied in nature as they can be in solid or liquid form and along with that they vary in their composition and size [24]. These particles are added to the atmosphere through vehicular transport, construction sites, industries, wind blow dust and fire. Depending upon the size, they may be classified as PM2.5 and PM10. PM10—These particulate matters are the particles of a diameter of 10 µm. These particles get deposited directly to the throat and nose that may lead to various cardio pulmonary problems. The elders and young kids that have some medical conditions already are more likely to be affected adversely with the exposure to high PM10 concentrations. A
6
K. Bhalla et al.
study carried out regarding this in New Zealand that suggested that PM10 can result to nearly 900 premature deaths each year. PM2.5—These particulate matters have a diameter of about 2.5 µm. These get accumulated in the respiratory tracts that cause adverse effects on health. Man-made sources of PM2.5 are more important to be taken into consideration than natural resources because they contribute in very small percentage to the total concentration. Road vehicles form one of the most important contributors for the same. Studies have shown that the concentration of PM2.5 is more in the areas near to roads than to the areas which are there in the background. Industrial emissions also form the contributors for the same along with the areas nearby. In addition to these primary or direct emitters of the PM2.5 particles, secondary or indirect sources also contribute the same. These secondary sources are mainly formed by the chemical reactions of gases like nitrogen oxides (NO2 , NOx and NO) and sulphur dioxide (SO2 ). The above studies show that each pollutant is contributed to the atmosphere through various sources and along with that each of them has harmful effects on human health, and out of all the pollutants, PM10 and PM2.5 are the most harmful. In the next section, we present the collection of data for the city of Delhi and the areas that we have selected to analyse the AQI. Along with that, we have presented the AQI assessment chart that gives a detailed categorization of the assessment of AQI.
3 Pollutant Data Collection for the City of Delhi 3.1 Area Selected As we have already stated in the literature review that Delhi is one of the most polluted cities in the world. Hence, there is a need to do an analysis of the city so as to find the reason for categorizing it as the most polluted city. We thoroughly analysed all the stations of the city, and out those, we selected three stations, i.e. Anand Vihar, Jahangirpuri and ITO, as these are the parts of Delhi where the air is most polluted. Anand Vihar and Jahangirpuri are the industrial or peripheral areas, and ITO is the station with heavy traffic. Last year, Delhi Pollution Control Committee (DPCC) added many new air quality monitors. Data collected from these stations shows a steep increase in the number of pollutants where earlier they were within limit [25]. In this work, we have done a month-wise analysis of the three stations, i.e. Anand Vihar, Jahangirpuri and ITO, of Delhi for the year 2018 and tried to identify the new factors causing the pollution and as per outcome of the results, identify the preventive measures that could be taken beforehand.
Systematic Analysis and Prediction of Air Quality …
7
3.2 Method of Pollutant Data Collection • Pollutants Chosen for study As stated above, AQI considers eight pollutants PM10, PM2.5, NO2 , SO2 , CO, O3 , NH3 and Pb for which various Air Quality Standards have been prescribed. Out of these eight prescribed pollutants, we have chosen the following pollutants, i.e. PM10 and PM2.5. This is due to the fact that the impact of above-mentioned pollutants is categorized as most hazardous among all the eight and have adverse effects on the foetal growth, human health that can cause many pulmonary and cardiovascular diseases and even death. • Sources of data There are various sources where the data sets are available like UCI machine learning repository but that does not validate the authentication of data. Along with that, complete data sets were also not available in the repository. Hence, we have fetched the raw values of the pollutants and given the annual effect. The data is fetched from two sources. The first one is the Central Pollution Control Board (CPCB), and second one is the US Embassy based in Chanakya Puri. The pollutants level at various stations in Delhi-NCR is determined by National Air Monitoring Programme (NAMP). Here, the data can be accessed openly, and it is government authenticated repository where the raw values can be fetched day wise. The daily average of PM10 and PM2.5 was collected from CPCB for different monitoring stations. Figure 1 shows the various monitoring stations of Delhi-NCR.
3.3 Air Quality Index Assessment Table 1 gives a detailed view of the air quality index (AQI) assessment for various pollutants [26]. It describes the break points for all the pollutants. CPCB have prescribed five categories, i.e. “Good, Satisfactory, Moderately Polluted, Poor, Very Poor and Severe” for all the eight pollutants. It specifies the AQI categories of the above-mentioned eight pollutants that are considered harmful by the CPCB. For a pollutant to be categorized as good, the breakpoint for PM10 is 50; for PM2.5, it is 30; for ozone, it is 50; for NO2 , it is 40; and for NH3 , PB and CO, it should be 200, 0.5 and 1.0, respectively. Each pollutant has a break point for all the above-mentioned categories. The most dangerous category is the severe one. The alarming situation arises when the value of any pollutant reaches from satisfactory to the moderately polluted category. This is due to the fact that such level of pollutants shortens the life span. It also leads to sudden deaths and heart attacks, especially PM10 and PM2.5 that directly attacks every organ of the human body [27].
8
K. Bhalla et al.
Fig. 1 Locations of ambient air quality monitoring stations
Our children are most affected by air pollution because they breathe more air and spend more time playing outside. Hence, more exposure to PM10 and PM2.5 causes irreversible damage to their lungs which are still developing. Studies have shown that 40% of children in Delhi have reduced lung capacity [28]. It also affects their brains and can lead to autism and lower intelligence quotient. In 2015, air pollution caused 6.5 million premature deaths globally, making it the leading cause of premature deaths. Not only children but the older-age people are at higher risk. Many studies show that when particulate matter levels are high, older adults are more likely to be hospitalized and some may die of aggravated heart or lung disease. Exercising is prescribed by doctors for healthy living. However, in a polluted environment, exercise can do more harm than help. Exercise and physical activity cause people to breathe faster and more deeply and to take more particles into their
Good
Satisfactory
Moderately polluted
Poor
Very poor
Severe
0–50
51–100
101–200
201–300
301–400
401–500
AQI category
AQI range
Table 1 AQI assessment chart
>430
351–430
251–350
101–250
51–100
0–50
PM10
>250
121–250
91–120
61–90
31–60
0–30
PM2.5
>400
281–400
181–280
81–180
41–80
0–40
NO2
>748
209–748
169–208
101–168
51–100
0–50
O3
>1600
801–1600
381–800
81–380
41–80
0–40
SO2
>1800
1200–1800
801–1200
401–800
201–400
0–200
NH3
>3.5
3.1–3.5
2.1–3.0
1.1–2.0
0.5–1.0
0–0.5
Pb
>34
17–34
10–17
2.1–10
1.1–2.0
0–1.0
CO
Systematic Analysis and Prediction of Air Quality … 9
10
K. Bhalla et al.
lungs. People who exercise early in the morning are at most risk due to high PM2.5 levels in the early mornings. The above assessment clearly indicates the reason for choosing PM10 and PM2.5 as major pollutants for our study. In the next section, we present the result of the analysis. We identify the various factors which contribute to the increase of pollutant level in the air. We also analyse the data and present the results for the same. Lot of research work in the past has been done to analyse and predict the air quality index using various methods and its consequences in the future. Many researchers have predicted the effects of air pollution on human health. Md. Senaul.
4 Data Analysis Results 4.1 Factor Identification Although the results show the variation in the values seasonally, but as a whole the values are still in the poor category throughout the year as per the AQI assessment chart for the city of Delhi. It has been observed from CPCB [29] that the factors responsible for the increase in the value of PM10 and PM2.5 are as follows: • Traffic—It plays an important role in the increase in the pollution. There are various sources of emission of harmful gases that form the smoke, and they are during the months of January, February, October November and December due to the festivals like New Year, Lohri, Diwali and marriages. Hence, the increase in the rate of traffic along with increases during this period. • Construction—Here, construction also forms an important criterion for increase in the air pollution due to the fact that it results in the increase of dust particles. • Crop Burning—It also forms one of the contributors to the air pollution during the month of October and November. • Winters—This season also contribute its own share to add in as during this period the dust particles and pollutants in the air are unable to move due to stagnancy in winds. Hence, the pollutants get themselves locked at one place resulting in smog. • Garbage Dumps and Industrial Wastes—They also increase the air pollution and builds up smog. • Population—Delhi’s over population also becomes one of the reasons for the pollution
Systematic Analysis and Prediction of Air Quality …
11
4.2 Analysis and Results The raw value of PM10 and PM2.5 of all the months was collected over a period of one year across three different stations in Delhi, i.e. Anand Vihar, Jahangirpuri and ITO. Here, we have taken only three stations into considerations due to the fact that they have been declared the hotspots by CPCB. These raw values are shown in Table 2 for PM10 and Table 3 for PM2.5. Further, this data has been represented into a graph to do a thorough analysis of the AQI month wise. This chart is represented in Fig. 2 for PM10 and Fig. 3 for PM2.5. Through this graphical representation, we have identified the reasons of variations and in turn given the solution for the same. Table 2 Actual PM10 (Sources: CPCB)
Table 3 Actual PM2.5
Month
Anand Vihar
Jahangirpuri
ITO
Jan
429.4019
Feb
353.1544
395.1204
195.42
Mar
330.003
319.1645
170.2423
Apr
315.351
325.5603
201.307
May
311.3903
307.5713
156.6052
Jun
296.9014
34.427
237.2079
Jul
200.9416
31.25633
106.7842
Aug
147.5921
30.82677
92.105
Sep
199.9112
28.559
83.01448
Oct
423.2571
26.25258
205.298
Nov
471.1287
21.264
270.489
Dec
475.2253
15.59733
305.526
Month
Anand Vihar
Jahangirpuri
ITO
Jan
261.6381
Feb
161.2682
216.41714
117.7204
Mar
119.1808
150.73194
88.23226
Apr
115.366
115.42167
81.485
May
107.3823
112.642
Jun
111.5933
0.9966667
Jul
70.123
4.1296667
54.10679
Aug
35.93065
3.1774194
49.75786
Sep
56.17846
1.8406667
38.933
294.7342
206.0323
76.1829 159.0379
Oct
166.4619
2.7035484
129.8
Nov
254.0023
2.9883333
202.5407
Dec
293.1313
10.427667
220.2583
Sources CPCB
12
K. Bhalla et al.
Fig. 2 Analysis of PM10
Fig. 3 Analysis of PM2.5
A thorough analysis of the graph reveals the following: In case of Anand Vihar Station, it is observed that the values are highest in the months of October to January and lowest in the months of July to September. The reason for high and low values is possibly the “seasonal phenomenon”. The values are high during winter season, and values are low during monsoons. During winters, the pollution level becomes worse due to two main reasons. One is, the atmospheric conditions trap pollutants closer to the surface of the earth and reduce the rate at which they can disperse which results in the increase in the concentration of the same. Second is the stubble burning during this time in Punjab which leads to increase of PM2.5 which adds to regular sources like construction dust and vehicular and emission. During monsoons, the pollution
Systematic Analysis and Prediction of Air Quality …
13
levels lowers as the monsoon winds washes away the dust and bring the levels to satisfactory. In case of ITO station, it is observed that the values are high throughout the year due to the fact that ITO being the busiest stations as far as traffic is concerned; hence, its share of vehicular traffic contributes around 20–30% throughout the year. But the analysis show that the values are highest in the months of October to January due to winter season and then there is a dip during February, March and May. The reason for this dip is the issuance of Air (Prevention and Control of Pollution) Act, 1981 to NCR States (Haryana, Uttar Pradesh, Rajasthan and NCT of Delhi) for control of air pollution in NCR and NCT of Delhi during the month of February. Then, we can see that the values are lowest in the months of July to September similar to that of Anand Vihar. The reason for these low values is monsoon season as the rains settle the pollutants into the soils resulting to lower the overall pollution level. In case of Jahangirpuri, the levels of pollution are highest during the months of February, March April and May. The major contributor to this is being the industrial area it has a large share of emissions of about 25–43% throughout the year [30]. We can clearly see in Fig. 2 for PM10 and Fig. 3 for PM2.5 that there is a major dip in the month of June and after this it became constant and remained low due to the summer season and the monsoon season during which the hot and dry winds carry the pollutant particles away. Figure 2 shows the month-wise analysis for all the three stations in a graphical manner. Figure 3 shows the month-wise analysis for all the three stations in a graphical manner. The reason for value is null in Jahangirpuri for the month of January is that the installation and connectivity of this station to CPCB portal is from the month of February. Therefore, we can safely state that the above values shows that the concentrations of PM10 and PM2.5 is highest during the winter season, average during the summer season and least during monsoon and post-monsoon duration, and the factors responsible for this increase in the values are clearly identified by us in the Section A. Therefore, if the government wants, it can make use of this analysis to find a solution to reduce the pollution and try to maintain it within safe limits One of the measures could be artificial rain which is the most appropriate measure as it has been used in various countries where the air pollution is very high throughout the world like Beijing under their “weather modification program” [31]. In the next section, we will be presenting Holt–Winter model which will be used to predict the pollutant values by us.
5 Prediction of Future Pollution Data Analysis has shown that throughout the year certain areas in Delhi are always in hazardous zone. Factors responsible for this are mostly due to human activities like
14
K. Bhalla et al.
garbage dumps, industrial wastes, traffic and various festivals. Some of these factors are almost constant as the variation during the year is not very high within a period of one year and some are seasonal. Efforts need to be put into the exercise of predicting the AQI as accurately as possible. However, there are some factors which show high variation over a period of one year, and we can say that they are seasonal. Traffic is considered as seasonal due to the fact that it increases during the festivals like Diwali, Holi, New Year and Lohri mainly during the months of October to January and March. Along with that crop burning takes place during the months of October and November every year which makes this factor as seasonal and then there is winter itself is seasonal. From this, we understand that there are constant and varying factors affecting the air pollution level in Delhi. Therefore, we can safely ignore the constant and consider the high varying seasonal factors for predicting the AQI for the next year. Then, maybe we can find a solution to reduce the pollution level in Delhi which shows a seasonal trend. Hence, we now proceed to find a way to predict the AQI pollution levels for the forthcoming seasons beforehand by using previous year’s data and the Holt–Winter algorithm to predict the next year’s pollution level. If we can do this “accurately”, then we can provide a safe solution to reduce the air pollution. As much of the research has been done to predict the air pollution values, but here, we are using this model which has already been used to predict the air pollution for China, and this model has predicted the results accurately [32]. Hence, this model is used to predict the values for Delhi, and through the experimental results shows that this model is superior due to its highest accuracy of forecast.
5.1 Holt–Winter Model Triple exponential smoothing, which is also popularly known as Holt–Winters method, is an algorithm that is used for forecasting of data values in a series which are “seasonal” which means that is repetitive after some period of time. The Holt–Winters forecasting algorithm allows the users to smooth a time series and use that data to forecast areas of interest. Exponential smoothing assigns exponentially decreasing weights and values against historical data to decrease the value of the weight for the older data. It is a way to model and predict the behaviour of a sequence of values over time—a time series. Holt in 1957 and his student Winters in 1960 extended the Holt’s method to apprehend seasonality. This Holt–Winters seasonal method finds the forecast equation on the basis of three smoothing equations that are: Lt = α(Y t/St − M) + (1 − α)(Lt − 1 + T t − 1)
(1)
T t = β(Lt−Lt − 1) + (1 − β)T t − 1
(2)
Systematic Analysis and Prediction of Air Quality …
15
St = γ (Y t/Lt) + (1 − γ )St − M
(3)
Ft + k = (Lt + k ∗ T t) ∗ St − M + k
(4)
where • • • • •
Y is the observation L is the smoothed observation T is the trend factor S is the seasonal index F is the forecast at k periods ahead
5.2 Application of Holt–Winter to Predicted Results and Discussion • PM10 Prediction for Anand Vihar The graph shows the comparison of the actual and predicted value for PM10 and PM2.5 for the station Anand Vihar. It clearly indicates that the values are accurately predicted and the results are very close to the actual values of PM10 and PM2.5. It shows the same shape curve, i.e. similar dips and rises over different seasons (Fig. 4). The curve is highest in the winter season, and it is lowest during the monsoon season, especially during the months of July and August. It clearly indicates the accuracy of results predicted by the Holt–Winter model and as we have already analysed in the previous section that the level of pollution rises steeply during the months of October to January. Our model also predicts the same with accuracy (Fig. 5).
Fig. 4 Anand Vihar PM10
16
K. Bhalla et al.
Fig. 5 Anand Vihar PM2.5
• PM10 and PM2.5 Prediction for Jahangirpuri The graph shows the comparison of the actual and predicted value for PM10 for the station Jahangirpuri (Fig. 6). We have predicted the results for PM10 in Jahangirpuri area and the graph clearly indicates Holt–Winter algorithm that we have applied on our data set is giving very accurate results. As already discussed, the curve is highest during February to June and our model also predicts the same results here as well (Fig. 7). • PM10 and PM2.5 Prediction for ITO
Fig. 6 Jahangirpuri PM10
Systematic Analysis and Prediction of Air Quality …
17
Fig. 7 Jahangirpuri PM2.5
The graph below shows the comparison of the actual and predicted value for PM10 for the station ITO. It is clearly shown in the graph that there is minor variation in the month of February. The reason for this difference in the predicted value and the actual value is due to the measures taken by the government during this period (Fig. 8). The dip in the actual value is due to the slew of measures that had been rolled out by the government to keep pollution levels under control. Other measures such as the ban on pet coke and furnace oil also yielded results. Meteorological conditions
Fig. 8 ITO PM10
18
K. Bhalla et al.
were also favourable. Rise of temperature and wind blow aided better air circulation. This helped in dispersal of pollutants (Fig. 9). The air quality trends compared on monthly basis for all the monitoring stations and predictions are compared for the year 2018. Table 4 and 5 shows the predicted results for PM10 and PM2.5. It can be compared with the actual values given in Tables 2 and 3 which clearly shows that the predicted results are accurately close to the actual values of both PM10 and PM2.5. The reason for variation in the values is due to the meteorological conditions that are natural and we do not have any control on these conditions.
Fig. 9 ITO PM2.5
Table 4 Predicted PM10
Month
Anand Vihar
Jahangirpuri
ITO
Jan
448.1877
0
298.2968
Feb
325.9507
391.9233
204.2408
Mar
296.7551
312.4862
167.2586
Apr
255.5257
309.0192
193.0571
May
249.7176
238.7624
139.4273
Jun
238.2811
34.62981
204.9342
Jul
178.3122
29.48557
86.02366
Aug
138.1622
30.67395
70.88972
Sep
162.9665
28.8062
75.61234
Oct
280.6133
25.99982
142.2878
Nov
473.8113
21.03014
269.8857
Dec
439.322
15.23243
269.8857
Systematic Analysis and Prediction of Air Quality … Table 5 Predicted PM2.5
19
Month
Anand Vihar
Jahangirpuri
ITO
Jan
259.4
0
204.7515
Feb
174.4386
220.4402
123.726
Mar
88.72924
147.8084
86.23433
Apr
26.89802
112.9687
81.4834
May
80.93116
72.51968
74.71031
Jun
71.00612
1.416132
Jul
63.125
4.183256
53.08436
Aug
27.50141
3.35444
40.87963
Sep
38.77863
3.303213
46.05773
Oct
81.94339
1.811218
162.7164
96.57521
Nov
260.7208
3.13555
188.9563
Dec
254.0812
10.41423
191.5154
The reason for value zero in Jahangirpuri for the month of January is that the installation and connectivity of this station to CPCB portal is from the month of February. The results clearly show that Holt–Winter method gives the very close results when compared to the actual values of the pollutants. In our model, all the above factors are taken into consideration to give better results for the coming years as the most important factor that it considers is the factor of seasonality. As we have shown that the air pollution values vary due to seasonality and it is lowest in monsoon season.
6 Conclusions It is very important and necessary to use an effective model for prediction and forecasting of AQI for various pollutants, so that necessary preventive measures could be taken by the Delhi government before it is too late and the citizens bear the consequences. The analysis result in this paper clearly indicates that the proposed model considers the seasonal trends and effects and is capable of obtaining the forecasting and prediction results more accurately. We have already stated that increase in the levels of particles in the air could lead to may health-related problems in human beings; hence, there is a need to predict the pollution values beforehand so that preventive measures could be taken by the government to bring it within at least to moderate limit. Hence, there is need to use a model that could consider seasonal effects which means a seasonal algorithm would be very suitable in this case.
20
K. Bhalla et al.
References 1. D.K. Dockery, P. Arden, Acute respiratory effects of particulate air pollution. Annu. Rev. Public Health 15, 107–113 (1994) 2. UNEP, Environmental threats to children: children in the new millennium. United Nations Environmental Programme, UNICEF; WHO, Geneva, Switzerland (2002) 3. WHO, The World Report 2002-Reducing Risks, Promoting Healthy Life; World Health Organization, Geneva, Switzerland (2002) 4. J. De, Development, environment and urban health in India. Geography 92, 158–160 (2007) 5. S. Jain, K. Mukesh, Urban air quality in Mega cities: a case study of Delhi city using vulnerability analysis 6. Press Information Bureau Government of India Ministry of Environment, Forest and Climate Change, 17 Oct 2014 7. TOI, Delhi has the worst air pollution in the world: WHO, The Times of India, 7 May 2014. Chauhan C. Delhi world’s most polluted city: Study, Hindustan Times, 8 May 2014 (2014). https://www.hindustantimes.com/india/delhi-world-s-most-polluted-city-study/storyKqiz2WDZ8muWya6MJpbGPM.html 8. M. Haque, R.B. Singh, Air pollution and human health in Kolkata, India: a case study. Climate 5(4), 77 (2017) 9. C. Segala, B. Faurox, J. Just, L. Pascaul, A. Grimfield, F. Neukirch, Short term effect of winter air pollution on respiratory health of asthmatic children in Paris. ERS J. Ltd. 11, 677 (1998) 10. M. Kampa, E. Castanas, Human health effects of air pollution. Sci. Dir. Environ. Pollut. 151, 362–367 (2008) 11. A. Valavanidis, K. Fiotakis, T. Vlanchogianni, Airborne particulate matter and human health: toxicological assessment and importance of size and composition of particles for oxidative damage and carcinogenic mechanisms. J. Environ. Sci. Health Part C 26, 339–362 (2008) 12. S.A. Rizwan, B. Nongkynrih, S.K. Gupta, Air pollution in Delhi: its magnitude and effects on health. Ind. J. Commun. Med. 38, 4–8 (2013) 13. J. Anderson, J. Thundiyil, A. Stolbach, Clearing the air: a review of the effects of particulate matter air pollution on human health. J. Med. Toxicol. 8, 166–175 (2012) 14. Anikender Kumar, Pramila Goyal, Forecasting of air quality in Delhi using principal component regression technique. Atmosph. Pollut. Res. 2(4), 436–444 (2011) 15. S.K. Goyal, S. Ghatge, V.P. Nema, S.M. Tamhane, Understanding urban vehicular pollution problem vis-a-vis ambient air quality—case study of a megacity (Delhi, India). Environ. Monit. Assess. 119, 557–569 (2006) 16. D. Ghei, R. Sane, Estimates of air pollution in Delhi from the burning of firecrackers during the festival of Diwali. Plos One 13, 8 (2018) 17. K. Katsouyanni, Ambient air pollution and health. Br. Med. Bull. 68, 143 (2003) 18. J.R. Balmes, J.M. Fine, D. Sheppard, Symptomatic bronchoconstriction after short term inhalation of sulfur dioxide. Am. Rev. Resp. Dis. 136, 1117 (1987) 19. J. Kagawa, Evaluation of biological significance of nitrogen oxides exposure. Tokai J. Exp. Clin. Med. 10, 348 (1985) 20. S.K. Rastogi, B.N. Gupta, T. Husain, H. Chandra, N. Mathur, B.S. Pangtey, S.V. Chandra, N. Garg, A cross-sectional study of pulmonary function among workers exposed to multimetals in the glass bangle industry. Am. J. Ind. Med. 20, 391 (1991) 21. A. Schecter, L. Birnbaum, J.J. Ryan, J.D. Constable, Dioxins: an overview. Environ. Res. 101, 419 (2006) 22. L. Jarup, Hazards of heavy metal contamination. Br. Med. Bull. 68, 167 (2003) 23. L.M. Schell, M.V. Gallo, M. Denham, J. Ravenscroft, Effects of pollution on human growth and development: an introduction. J. Physiol. Anthropol. 25, 103 (2006) 24. U. Poschl, Atmospheric aerosols: composition, transformation, climate and health effects. Angew. Chem. Int. Ed. Engl. 44, 7520 (2005) 25. TOI, 12 areas in Delhi where you can never breathe clean air. The Times of India, 28 January 2018
Systematic Analysis and Prediction of Air Quality …
21
26. National Air Quality Index, Control of Urban pollution series, CUPS/82/2014–15, Central Pollution Control Board, Ministry of Environment control and Climate change 27. India: Delhi pollution level deteriorates to ‘hazardous’ category. https://www. aljazeera.com/news/2018/11/india-delhi-pollution-level-deteriorates-hazardous-category181105085836714.html 28. TOI, https://timesofindia.indiatimes.com/home/environment/pollution/40-of-Delhischoolkids-fail-lung-capacity-test-Study/articleshow/47156480.cms 29. TOI, https://timesofindia.indiatimes.com/life-style/health-fitness/health-news/top-8-maincauses-for-air-pollution-in-delhi/articleshow/61626744.cms 30. HT. https://www.hindustantimes.com/india-news/delhi-has-a-complex-air-pollution-problem/ story-xtLhB9xzNYeRPp0KBf9WGO.html 31. H. Xu, W. Yao, A numerical study of the Beijing extreme rainfall of 21 July 2012 and the impact of topography, Hindawi Publishing Corporation. Adv. Meteor. 2015, Article ID 980747, 12 p (2015) 32. Wu Lifeng, Xiaohui Gao, Yanli Xiao, Sifeng Liu, Yingjie Yang, Using grey holt winter model to predict the air quality index for cities in China. Nat. Hazards 88(2), 1003–1012 (2017)
Proof of Game (PoG): A Proof of Work (PoW)’s Extended Consensus Algorithm for Healthcare Application Adarsh Kumar and Saurabh Jain
Abstract Advancement in blockchain technologies during the past decade has attracted tremendous interests from academia, research community and the industry. A blockchain network is a peer-to-peer, decentralized and immutable distributed ledger system for transactional records. With an increase in the number of blockchainbased applications, it becomes a powerful technology for decentralized data processing and consensus mechanisms based blockchain networks. In this work, a single- and multiplayer bit challenging and incentivized consensus mechanisms for blockchain networks are used in proposing a “proof-of-game (PoG)” protocol for resource variant blockchain networks. Bit verifier PoG is designed to be memory dependent and CPU independent mechanism for time efficiency and resource independence. In results, it is observed that the number of blocks mined using this protocol is proportional to the number of participants associated with blocks. Further, it is observed that the priority of a blockchain increases exponentially with an increase in the number of blocks mined, and the number of blocks mined decreases exponentially with an increase in computational challenge. Keywords Blockchain · Consensus algorithm · Proof of concepts · Trusted environments · Cryptography · Cryptocurrency
1 Introduction Blockchain, developed by Satoshi Nakamoto in 2008, acts as a transaction ledger of the cryptocurrencies. According to [1], there are 2238 cryptocurrencies today. Among these cryptocurrencies, ten popular cryptocurrencies are bitcoin, ethereum, ZRP, litecoin, bitcoin cash, eos, binance coin, bitcoin sv, tether and stellar. Blockchain A. Kumar (B) · S. Jain School of Computer Science, University of Petroleum and Energy Studies, Bidholi, Dehradun, India e-mail: [email protected] S. Jain e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_2
23
24
A. Kumar and S. Jain
is constructed with a sequenced interconnection of blocks. Thus, interconnections and blocks are an integral component of a blockchain. Decentralized, transparent, opensource, autonomous, immutability and anonymity are the key elements of blockchain design. According to Zheng et al. [2], the block structure consists of a block header and block body. Block header contains the name of the block, block version number, block’s hash value, previous block’s hash value, consensus algorithm’s target difficulty value, creation time (timestamp) of the block, Merkle root of transactions and nonce value (i.e., a random counter value). The block body contains multiple transactions and their records. A block may contain a large number of transactions depending upon block and transaction size. A transaction contains a transaction header and payload. Transaction header consists of a transaction hash value, block number/index, transaction number in the block, creation time (timestamp) of the transaction, sender and receiver’s identifications and a digital signature over transaction’s hash value. Payloads contain multiple data chunks. Figure 1 shows a typical structure of a large number of blocks interconnected in a blockchain. Each of these interconnected blocks has multiple transactions. For example, Fig. 1 shows that there are four participants (alice, bob, trent and carol) who want to establish transactions with the owner of a block (den). Each of the four participants used their public keys in generating the signatures and verified the transaction. Each transaction inside block contains transaction amount (TA), unspent amount (UA) and gas amount (GA). Hash of a transaction between two or more parties constitutes leaf nodes in the Merkle root tree. Parent and its sibling nodes in the Merkle root tree are computed using hashing operation over child’s transaction hashes. Figure 2 shows an example of 4-level Merkle root tree formulated using similar operations. There are eight transaction messages (M0 to M7 ) whose 4-level Merkle root tree is constituted. Figures 1 and 2 shows a simplified block and transaction structure of a blockchain. However, the consensus algorithm’s difficulty level value is an important parameter for a tampered proof blockchain network. The consensus algorithm’s difficulty level value is measured using various concepts. For example, proof of stake model (PSM),
Fig. 1 Block structure and blockchain construction
Proof of Game (PoG): A Proof of Work (PoW)’s …
25
Fig. 2 Merkle root tree in block
proof of elapsed time (PET), Byzantine fault tolerance algorithm (BFTA), practical Byzantine fault tolerance algorithm (PBFTA), SIEVE consensus protocol (SCP), cross fault tolerance (XFT), federated Byzantine agreement (FBA), ripple consensus protocol (RCP), Stellar Consensus Protocol (SCP), etc., are popular. In this work, another consensus difficulty level concept “Proof-of-Game (PoG)” is introduced for both resourceful and resource-constraint devices in the blockchain network. PoG is derived from game theory concepts incorporated for blockchain network in recent years. PoG allows participants based on the level of game solved to create new blocks inside the blockchain. PoG can be one player or multiple player games. A singleplayer PoG is preferable if participants are interested to construct a blockchain in which single stakeholder is selected for permitting new blocks to join the blockchain network (similar to the specialized case of PSM). Whereas, multi-player PoG uses multi-participant playing with participant interested to insert block in the blockchain network. The level and difficulty of the game played in PoG vary with the availability of resources. High difficulty level means more resources are visible publicly and the level of confidence over the blockchain network is high. The rest of this work is organized as follows: Sect. 2 provides state-of-the-art consensus algorithms for both private (permissioned) and public (limited permission or permissionless) blockchain technology. Here, the use of game theory in blockchain technology is explored in depth. In Sect. 3, the PoG concept is introduced for blockchain technology. Section 4 explains the PoG integrated architecture for the healthcare system. Section 5 performs statistical and experimental analyses of the proposed consensus algorithm. Finally, Sect. 6 concludes this work by summarizing the contributions.
26
A. Kumar and S. Jain
2 Literature Survey In the literature, several consensus algorithms are proposed [3, 4]. According to [3], special measures required for a consensus mechanism suitable for a secured and sustainable blockchain network include incentive-compatible sub-stages with tolerance to Byzantine and unfaithful faults, avoid adversaries with cumulated computational power (e.g., botnets, mining pools, etc.) and maintains a balance between processing throughput and network scalability. Consensus algorithms and game theory concepts derived for these algorithms are discussed as follows: • Proof of Work (PoW): In PoW, a cryptographic challenge instead of charging real money is put forward to the challenger. In the blockchain, if the challenger is able to solve the challenge, then it is allowed for him to add a block in the blockchain. An idea of PoW is not new in blockchain; it is used in various other applications like [5]: stopping spamming emails, preventing unauthorized access, measuring capabilities of challenger. PoW is used in various blockchain technologies. For example, the main network of Ethereum (Homestead) uses the PoW consensus model called EthHash [6]. EthHash is used to counter attacks through mining centralization in Bitcoin. In mining centralization, a large number of ASICs were used to perform hashing operations at very high rates [6]. This action allowed corporations having powerful computing resources to control the Bitcoin network by creating mining pools. PoW is a time-consuming process. In this process, each block generator has to process a cycle with nonce values and an algorithm. Back’s “hash-cash” [7] is another widely known example of proof-of-work. In “hashcash,” the challenge in front of a sender (challenger) is to apply an algorithm that produces a string whose cryptographic hash starts with a certain number of zeroes. Although, it is expensive to complete this operation, but it is cheap in verification. The “hash-cash” puzzle can be applied to a specific set of email recipients through a nonce, timestamp and recipient email address. If the freshness factor generated through a nonce, timestamp and recipient email address is validated as a string value, then email is sent to specific recipients. Similarly, “hash-cash” is extended is various other applications. Automated metering to the website is another extension of “hash-chash” application [8]. In [9], PoW is used to mitigate distributed denialof-service attacks. In [10], it is used in a P2P network for uniform user behavior using the incentive-based game. In [11], PoW is used for cloud and fog computing network with a novel mathematical model having lesser iteration to converge to the consensus solution. The proposed approach is efficient in terms of time and memory consumption and suitable for the Internet of Things (IoTs). Apart from PoW, there are various consensus algorithms like PSM, PET, BFTA, PBFTA, SCP, XFT, FBA, RCP, SCP which are popular for putting the challenge to the challenger. • Game Theory in Blockchain: In recent studies [12], game theory has been applied in various scenarios of blockchain. In [12], the concept of supervised machine learning algorithm and game theory is proposed for the detection and stoppage of majority attacks in the blockchain. Supervised machine learning helps intelligent
Proof of Game (PoG): A Proof of Work (PoW)’s …
27
software agents in activity-based anomaly detection that is extended for majority population size attack detection in a blockchain. Nakamoto protocol [4] for the permissionless blockchain network proposes the concept of incentivizing the participants based on token supply and transaction tipping. Stone [13] incorporated the game-theoretical analysis of consensus nodes prioritizing blocks according to its sizes. In this analysis, it is observed that consensus nodes prefer blocks of large size over small because of a long delay in incorporation and validation which may lead a block to isolation. This isolation can lead to Denial of Service (DoS) or Distributed Denial-of-Service (DDoS) attack. Large gas price for large block size is also not preferable for blockchain because of the physical constraints of a network [13]. Additionally, various game-based models have been adopted for the payoff function of miners [14–19]. For example, evolutionary game, hierarchical game, auctions, etc.
3 Proposed PoG Approach This section gradually derives the concept of PoG. Algorithm 1 shows the steps followed in constructing a blockchain. Developed blockchain accepts block number, user data, previous block hash, timestamp and presents block’s hash value. The SHA256 hashing algorithm is used for linking the blockchain. GBlock() and PreviousHASH() are two functions used for genesis block and calculating a hash value of the previous block. Blockchain() and AddBlock() functions are used for developing a blockchain and adding blocks to the blockchain. Line 62 to line 67 shows the steps followed in creating blocks and adding to blockchain. Algorithm 2 is an extension of algorithm 1 with PoW. In PoW, various challenges [6, 7–11] can be put to challenger before allowing him to add a block in the blockchain. Algorithm 2 shows an example of PoW with a generation of hash and operation. Hash value could be generated using any hashing algorithm whereas operation should be modular arithmetic infinite field. The generated hash value is verified by existing blockchain participants using ValidateChallenge() function. If the challenge is generated within a specified time period, verified by blockchain participants and linked with a correct block, then the process of generating blockchain using PoW starts in line 35 to line 39.
28
A. Kumar and S. Jain
Proof of Game (PoG): A Proof of Work (PoW)’s …
29
Now, a group of participants may delegate rights to one or a group of participants for performing “adding blocks” activity. In this activity, one or group of participants authorized to perform this activity is known as Authority and concept is called “Proofof-Authority (PoA).” Algorithm 3 shows the steps following in PoA before creating a blockchain. In PoA, signatures of participants authorized to perform this activity is verified. Although PoA is an additional overhead to PoW, but it does not require all participants to be active before allowing any participant to add a block. A minimum threshold of active participants is acceptable for allowing any new block in the blockchain. PoA consensus mechanism can be extended with “Proof-of-Ownership” (PoO) as shown in Algorithm 4. In PoO, it is assumed that a blockchain can be owned by a specific set of participants and those participants are allowed to create new blocks with and without the use of the challenge. If a new participant is interested to be part of blockchain, then its identity is verified with ownership credentials before appending it to stakeholders list.
30
A. Kumar and S. Jain
Algorithm 5 presents the PoG concept introduced in the blockchain. In PoG, existing participants play a game with the new block participant based on the availability of resources. Initially, all participants are considered to be honest and ready to share their computational powers. If a new participant is interested to add a block, then it has to reveal its computational power, signature and block information. The existing blockchain participants will start the game as per the resources of the new participants. If new participants have enough resources, then it is put in the resourceful category and a game of heavy computations is played. Otherwise, the lightweight game is preferred and the participant is put in the resource-constraint category. In a resourceful category, a multiplayer game asks the new participant to verify a random bit and its position in a challenge. A challenge could be derived from PoW, PoA or PoO. In this work, the challenge is implemented for three consensus algorithms (PoW, PoA or PoO); however, it may be extended for other algorithms. In the lightweight game, the random bit position of the hash value is verified. If a new participant is able to verify the challenge then it is considered as the winner and allowed to add the block. The honesty of participants is the key success of this consensus mechanism; however, a historical fair play record is required to be maintained in order to increase the reliability and security of the existing blockchain.
Proof of Game (PoG): A Proof of Work (PoW)’s …
31
4 PoG Integrated Proposed Healthcare Application Model The healthcare industry faces numerous challenges in their system including billing frauds, mistakes, interruption of operations, leakage of information and transaction processing time. The solution to this problem is a blockchain-based proper billing system. With the help of blockchain technology, transparency and provisions to audit records are available for all processings wherever there is any type of payment involved. Further, a well-defined policy-based automated payment processing system can be derived using smart contracts. As the healthcare industry is majorly dependent on insurance thus, smart contract developed for providing health insurance to the patients would develop a useful and strong system. Blockchain-based
32
A. Kumar and S. Jain
Fig. 3 PoG integrated smart contract for healthcare system
billing and insurance applications will provide a reliable source of trustable information at reduced cost and within the stipulated time period. As blockchain provides immutable records, chances of fraud or attacks will be much lesser. Figure 3 shows a high-level scenario of integrated PoG consensus algorithms in the healthcare system. There are hour major parties in this system: patient side, hospital side, blockchain network side and the audit team. The hospital initiates the payment request whenever patient arrives in hospital. After depositing a security amount, the blockchain network executes PoG-based smart contract and confirms the transactions as per hospital policies. If a smart contract confirms the transaction, then a permanent record of this transaction is stored in the transaction block of the blockchain network. The data from the transaction block is forwarded to the audit department for verification of any transaction. After random verifications from the audit department, the blockchain network make e-payment to hospital and sends a confirm message to patient side.
5 Analysis In this section, the behavior of PoG is analyzed for attack detection. • Majority-Attack Detection: After successful forking several blocks, it is exponentially impossible for an attacker to break PoG and produce a tampered chain. This can be proved using Chernoff bound and independence as follows:
Proof of Game (PoG): A Proof of Work (PoW)’s …
33
Let M represents the population size of honest participants uses PoG for consensus establishment, m represents the population size of dishonest participants able to break PoG for tampered blockchain creation. Now, both sides will try to maximize their chance of success as follows: = Maximize {Uniform(0, 1)} M ChanceSuccess M
(1)
ChanceSuccess = Maximize {Uniform(0, 1)}m m
(2)
Further, if there are N-blocks created then probability of successful blockchain construction is h Pconstruction =
h
ChanceSuccess − ChanceSuccess M m
(3)
i=1
According to Chernoff bound and independence, the probability of an attacker to tamper a blockchain can be computed as follows: min −sChanceSuccess N ≤ 0 ≤ s>0 E e Ptamper ChanceSuccess N min = s>0
N Success Success E eChance M (t) E eChancem (t) t=1
N Success min E eChanceSuccess (t) M E eChancem (t) = s>0
(4)
Since M > m, thus, there exist an s > 0 such that the product of inner expectation is less than 1. In conclusion, the success probability of an attacker in succeeding the attack decreases with an increase in the number of blocks. • The Priority of Blocks in the Blockchain: As the number of players playing the game within the blocks of blockchain increases, so as the priority of blockchain. Each participant associated with any block of a blockchain has an equally likely probability of playing a game. Thus, the expected number of blocks mined by N-participants is proportional to the total number of participants associated with blocks, as shown in Fig. 4. • Block Round Time and Confirmation Time: a block’s round and confirmation time are found to be much lesser than 10 min (Bitcoin’s block time) because of variation in the availability of resources during challenge generation and confirmation. As compared to block propagation time (6.5 s) observed in the Bitcoin network, the proposed scheme has a comparable time period because a block is propagated only if the game round is successful in its execution. Block mined process includes block round time and confirmation time. Figure 5 shows the variation in a number of blocks mined with an increase in the number of participants associated with blocks. Results show the number of blocks mined is
34
A. Kumar and S. Jain
12
Blockchain priority level
10 8 6 4 2 0 1
2
3
4
5
6
7
8
9
10
Number of blocks mined per parƟcipant associaƟon Fig. 4 Variation in blockchain priority level with an increase in the number of blocks mined per participant association
Number of Blocks Mined
250
200
Number of ParƟcipants (Associated with each Block)=1
150
Number of ParƟcipants (Associated with each Block)=2
100
50
0 10 20 30 40 50 60 70 80 90 100
Number of Blocks Fig. 5 Variation in the number of blocks mined with an increase in the number of participants associated with blocks (calculated using statistical formula)
Proof of Game (PoG): A Proof of Work (PoW)’s …
12 10 8 6 4 2 0
35
Average Number of Blocks Mined per second (ParƟcipants=1) Average Number of Blocks Mined per second (ParƟcipants=5) Average Number of Blocks Mined per second (ParƟcipants=10) Average Number of Blocks Mined per second (ParƟcipants=15)
Fig. 6 Variation in the number of blocks mined per second with variation in block challenge
comparatively higher for two participants as compared to one participant. This difference increases with an increase in the number of blocks mined. Figure 6 shows the variation in the number of blocks mined per second with variation in block challenge. Multiple challenges are considered for comparative analysis. For example, 256 bits of hash value output is considered for 1, 5, 10, 15, 20 and 25bits verification. A comparative analysis of the number of blocks with increases in bits verification and modular arithmetic operations shows that the number of blocks mined decreases exponentially with an increase in the computational challenge over a system with processor: Intel Core i5-7200U [email protected] GHz, 4 GB RAM and 64-bit operating system.
6 Conclusion In this work, the PoG concept is introduced with a specific emphasis on designing methodologies for both resourceful and resource-constraint networks. PoG put challenges to challenger if the challenger is interested to give prior information about the availability of resources else memory access time is assumed to provide resource independent results. Further, PoG is a bit generator and verifier game between existing and new participants. The new participant is considered as the winner and allowed to participate in the blockchain if bits are verified in the stipulated time period. The experimental scenario shows that the number of blocks mined in PoG integrated blockchain decreases exponentially with an increase in the number of bits verified and the number of participants. The honesty of the participant is found to be the
36
A. Kumar and S. Jain
backbone of success for the PoG consensus mechanism. In the future, PoG will be extended to statistical PoG (SPoG) having an honesty verification model before and after playing a game for adding new blocks.
References 1. All Cryptocurrencies. https://coinmarketcap.com/all/views/all/ 2. Z. Zheng, S. Xie, H. Dai, X. Chen, H. Wang, An overview of blockchain technology: architecture, consensus, and future trends, in 2017 IEEE International Congress on Big Data (BigData Congress). IEEE (2017), pp. 557–564 3. W. Wang, D.T. Hoang, Z. Xiong, D. Niyato, P. Wang, P. Hu, Y. Wen, A Survey on Consensus Mechanisms and Mining Management in Blockchain Networks. arXiv preprint arXiv:1805. 02707 (2018), pp. 1–33 4. S. Nakamoto, Bitcoin: A Peer-to-Peer Electronic Cash System. http://bitcoin.org/bitcoin.pdf (2008) 5. B. Laurie, R. Clayton, Proof-of-work proves not to work; version 0.2, in Workshop on economics and information, security 6. A. Baliga, Understanding blockchain consensus models, in Persistent (2017) 7. A. Back, Hashcash (1997). http://www.cypherspace.org/adam/hashcash/ 8. M.K. Franklin, D. Malkhi, Auditable metering with lightweight security, in Financial Cryptography (1997), pp. 151–160 9. D. Mankins, R. Krishnan, C. Boyd, J. Zao, M. Frentz, Mitigating distributed denial of service attacks with dynamic resource pricing, in Proceedings of 17th Annual Computer Security Applications Conference (ACSAC 2001) (2001) 10. A. Serjantov, S. Lewis, Puzzles in P2P systems, in 8th CaberNet Radicals Workshop, Corsica, Oct 2003 11. G. Kumar, R. Saha, M.K. Rai, R, Thomas, T.H. Kim, Proof-of-work consensus approach in blockchain technology for cloud and fog computing using maximization-factorization statistics. IEEE Internet Things J. (2019) 12. S. Dey, Securing majority-attack in blockchain using machine learning and algorithmic game theory: a proof of work, in 2018 10th Computer Science and Electronic Engineering (CEEC) (2018). IEEE, pp. 7–10 13. A. Stone, An examination of single transaction blocks and their effect on network throughput and block size. Self-published Paper, (Jun. 2015) [Online]. Available http://ensocoin.org/ resources/1txn.pdf 14. X. Liu, W. Wang, D. Niyato, N. Zhao, P. Wang, Evolutionary game for mining pool selection in blockchain networks, in IEEE Wireless Communications Letters (2018), pp. 1–1 15. Z. Xiong, S. Feng, D. Niyato, P. Wang, Z. Han, Optimal pricing based edge computing resource management in mobile blockchain, in 2018 IEEE International Conference on Communications (ICC), Kansas City, Kansas, May 2018 16. Y. Jiao, P. Wang, D. Niyato, Z. Xiong, Social welfare maximization auction in edge computing resource allocation for mobile blockchain, in 2018 IEEE International Conference on Communications (ICC), Kansas City, Kansas, May 2018 17. N. Houy, The bitcoin mining game. Ledger J. 1(13), 53–68 (2016) 18. N. Houy, The Bitcoin Mining Game. Available at SSRN 2407834 (2014) 19. A. Kiayias, E. Koutsoupias, M. Kyropoulou, Y. Tselekounis, Blockchain mining games, in Proceedings of the 2016 ACM Conference on Economics and Computation, ACM (2016), pp. 365–382
Analysis of Diabetes and Heart Disease in Big Data Using MapReduce Framework Manpreet Kaur Saluja, Isha Agarwal, Urvija Rani, and Ankur Saxena
Abstract MapReduce is a programming algorithm which is used for generating a simplified and relative collection of large datasets such as Big Data. In our paper, we relate MapReduce to the healthcare system to make the effective and spontaneous decision making. In life science, an enormous amount of data is generated on a daily basis. But most space is occupied by routinely generated data of non-communicable diseases like blood pressure, diabetes, etc. Diabetes and heart-related diseases are one of the most commonly found around the world. These non- communicable diseases are usually caused by changing lifestyle, food that we eat, stress is becoming the causes of growth in the non-communicable diseases, and so with an increasing number of patients comes, the number of patients records to be handled, this huge amount of data is made useful through a technique called Big Data which contains one such algorithm called MapReduce. By applying MapReduce, the huge clinical data is filtered in different categories which can be easily read for future reference. The main purpose of our paper is to focus on filtering various parameters of diabetes as well as heart diseases, putting clinical data into developing medical intelligence for creating a patient-centered healthcare system. Considering our dataset, 45–54 Age group of people have maximum prone to diabetes and when it comes to heart-related disorders, people above the age of 65 are mostly suffering from this disease. Keywords Big data · MapReduce · Hadoop · Diabetes · Heart disease · Word count
1 Introduction Big Data is defined as a method in which the huge amount of data which is structured, semi-structured, and unstructured types of data are handled computationally [1]. In other words, big data is a set of large record units that turns into complex database system tools or conventional statistics processing programs [2]. In today’s scenario, M. K. Saluja · I. Agarwal · U. Rani · A. Saxena (B) Amity Institute of Biotechnology, Amity University, Noida, Uttar Pradesh, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_3
37
38
M. K. Saluja et al.
there is a need for handling a huge amount of data and all this data can be handled using the new technique called BIG DATA [3]. Example, the huge data in the hospitals helps the doctors to give a detailed description of patients disease with the help of big data before big data was not introduced, the data collected by hospitals could not give a detailed description about patients disease and therefore preventive measures cannot be taken as the doctors do not have a detailed description of patients disease. Due to the speedy boom of such information, answers need to be studied and supplied so that you can cope with and extract cost and know-how from these datasets [4]. Furthermore, decision-makers want will benefit treasured insights for such hastily changing data, starting from day by day transactions to customer interactions and social network records [5]. Such cost can be provided using large information analytics that is the software of superior analytics techniques on massive statistics [6]. This paper pursuits to analyze a number of the different analytics methods and tools which is carried out to huge data facts, and the opportunities supplied through the software of big information analytics in various selection domain names [7, 8]. Now, how do we know that which data is big data; this is known to us with the help of five V’s [9] (Fig. 1). (1) Volume: The volume of data is quite huge that we need to manage complex data conventionally [10]. (2) Variety: Several types of data is being generated through different kind of sources. So the data is divided into three forms first one is structured form, the second one is semi-structured form, the third form is unstructured form [11]. (3) Velocity: Speed of accumulation of all this variety of data all together which brings to our third V that is velocity [12]. (4) Value: Now, the biggest problem is that how to extract the useful data that is called value. So, in this, we first need to mine useful content from the data, and then we perform certain analysis on our data and we need to see that the data which is analyzed must be of some value [13]. (5) Veracity: When we are dumping our data, then there are some chances of losing some valuable data in the process; and so we need to mine it again so that we do not lose our valuable data [14, 15]. In the recent era of information technology where there are large sets of data available in hand which needs to be filtered and made of some use, and all this is possible due to data analytics through big data techniques. Big data is being used in many industries like medical and health care, retail and transport, education services, banking and securities, and also in government sectors. This rigorous use of big data has enabled researches to develop more and more new techniques to simplify the use of big data. Some of the examples in today’s scenario related to big data are [16]: (a) Social media: When we use social media that is Instagram, Facebook, and Twitter, a lot of data is generated as millions of people are using this platform at the same time so, even a like or sharing a video creates a huge amount of data, and so to maintain this huge amount of data, we need bigdata [17].
Analysis of Diabetes and Heart Disease in Big Data …
39
Fig. 1 Five parameters of big data
(b) Electronic health records (ELH): It is the most widespread utility of big facts in medicine [18, 19]. Every patient has his own digital report which incorporates demographics, scientific records, allergies, laboratory take a look at results, etc. Records are shared through relaxed information systems and are available for vendors from each public and personal region. EHRs also can cause warnings and reminders while a patient has to get a brand new laboratory check or prescriptions to look if an affected person has been following doctor’s orders. (c) Curing cancer: Medical researchers can use huge amounts of statistics on remedy plans and recovery rates of most cancers sufferers which will find developments and remedies which have the best prices of fulfillment inside the actual international [20, 21] (Fig. 2).
Fig. 2 Real-world sectors where big data plays a major role
40
M. K. Saluja et al.
Data in lifestyles sciences and health care is anticipated to develop exponentially in the coming years and could be beyond the capability of the traditional strategies of facts control and information analytics [22, 23]. In addition, healthcare repayment fashions are converting; meaningful use and pay for performance are emerging as critical new elements in these days’ healthcare surroundings [24, 25]. It is vitally essential for existence sciences and healthcare corporations to collect the available equipment, infrastructures, and strategies to leverage this huge quantity of information efficiently, or chance dropping doubtlessly tens of millions of dollars in sales and profits [26, 27]. In this diagram, it has been explained that the data of various test has been collected that is the prescription data diagnostic data sequencing data and outcome data (clinical trial or hospital) to know that from which diseases the patients are suffering from and what are the prevention steps that they should take, and then, according to their diseases, different patients are given different drugs because either if the two patients have the same disease, there might be a chance that the same drug does not respond to one patient but it might respond to another patient body so the drugs are always given according to the patients capacity that how their body responds to a particular drug, and then personalized health care strategy is adopted which divides the people into different groups according to their diseases. One parameter to observe here is that to check that the patient health is improved or worsened in the past few days; if the patient health is getting worsened, then the doctor needs to do more checkup and also change the drug (Fig. 3). Diabetes and Heart Diseases These two chronic diseases are most commonly found and are the most fatal one. Diabetes is a medical condition when the blood glucose level starts increasing more than 140 mg/dl. Whereas heart disease refers to a condition where heart functions do not respond normally, and it is also called as cardiovascular disease which refers to a
Fig. 3 Big data in lifestyle and health care
Analysis of Diabetes and Heart Disease in Big Data …
41
condition where any blockage or narrowing of blood vessel causes a heart attack or chest pain or strokes. The continuous occurrence of these diseases in a person leads to death and are necessary to treat but it is difficult to keep records of routine on a global level, and this is where big data comes in use and currently, the use of big data in predictive analysis has extensively increased. Diabetes includes a large number of genes and environmental factors that could affect its intensity, so a large number of medications are available, also each body responds to each medication differently thus it becomes a difficult task to predict all the possible mechanisms, and this problem is currently being solved using big data and algorithms of Hadoop and MapReduce and many of big data. Heart diseases research and predictions are being improved using big data analytics by increasing level of high-resolution studies, development of models to predict intensity and factors, real-time response to daily problems involved, drug surveillance, the formation of personalized medicines, and by keeping a quality check measurements. MapReduce is the major architecture of Hadoop, which is used to filter a huge amount of data [28]. It takes our complex set of data and then refines it in such a way that it makes the output less complicated and easy to understand [29]. This technique takes a collection of data and reforms this into every another set of statistics, wherein each and every factor was damaged into row [30, 31]. Then, it takes the result from the map as an entry and adds the information rows in quite a small set of rows [32]. As the collection of the call MapReduce implies, the reduce mission is constantly completed after the mapping process [21, 32]. Hadoop is an open software that manages data processing and contains huge statistics programs in clustered systems. It is one of the huge technologies which can be frequently used to guide advanced analytics tasks, including predictive analytics, information mining, and system studying programs. Hadoop can manage diverse types of both structured and unstructured records by providing users extra flexibility for processing and reading data [19]. In Fig. 4, the process of MapReduce has been explained. In this, first, we input our data, then with the help of map method, our data gets filtered and sorted and the reduce method is designed to perform summary of the output from the map data. In this way, our data gets filtered and we get the desired output (Fig. 5).
2 Literature Review The want for a countrywide look at diabetes occurring in India consists of a readable pattern of the country’s population, for each city and rural, led to the formation of the Indian Council of Medical Research on Diabetes. The purpose of the ICMR— INDIAB is to take a look at the set up of the national and country-precise incidence of diabetes and prediabetes in India. Here, we record on the occurrence of diabetes and prediabetes from 15 states of India and discover diversities in diabetes and prediabetes phenotypes with the aid of country, rural and concrete setting, and person traits [33].
42
Fig. 4 Process of how MapReduce works
Fig. 5 Current use of big data to improve the diabetic healthcare system
M. K. Saluja et al.
Analysis of Diabetes and Heart Disease in Big Data …
43
The key to cardiovascular sickness control is to do massive rankings of datasets, evaluate and mine them for facts that may be used to expect, save, manage, and cure occurring diseases which include coronary heart assaults. Big data analytics recognize the corporate global for its treasured use in controlling, finding similarities, and dealing with large datasets which can be implemented fully for the prediction, leading to the prevention of the cardiovascular disorder (Fig. 6).
3 Methodology For this paper of big data, we have collected the data of people suffering from heart disease and diabetes. The data was collected on the basis of different parameters like Age, BMI, physical exercise, sex, and alcohol consumption. We further divided age into four sub-age groups: 1–18, 19–44, 45–64, and 65+ and BMI is further divided into five sub-groups: underweight, normal weight, overweight, obese, and extremely obese. For our collected set of data, we applied a mapReduce method to filter our data. The process of filtering the data is shown below: This graph in Fig. 7 is the depiction of the body mass index (BMI). We further divided BMI into five categories that are: underweight is denoted by letter A, normal weight is denoted by letter B, overweight is denoted by letter C, obese is denoted by letter D, and extremely obese is denoted by letter E. In this, the first step is to input our data, and then with the help of map method, our data gets filtered and sorted, then the reduce method is used to give us a desired output from the map data. Similarly, Fig. 8 is the depiction of alcohol consumption and physical exercise that could affect heart disease. The complete execution process involves compiling the data with readable keys and then the WordCount program is run in Hadoop with specific classes of Map and Reduce. The output is then saved in a JAR file which is not human readable. To get the human readable file format, the JAR file is submitted as a job in and we get a readable output which is further analyzed for various purposes. Hadoop runs the program of MapReduce in all different languages; we used Java to run the program (Fig. 9). The steps which are performed using WordCount are a. Splitting of data: The input is split into fix size pieces and each split is consumed by a single map. The splitting parameter could be anything. For example: Splitting through using space, comma, semicolon, or maybe by a newline(‘n’). b. Mapping: Then, it comes to the execution phase where data in each split is passed on to mapping function which gives us the output values. This way, our desired collection of data is transformed into another form of data, in which each and every element is broken into tuples. c. Intermediate splitting: The complete technique is parallel to unique clusters. In a way to group them in “Reduce Phase,” the same KEY information has to be on the identical cluster.
44
M. K. Saluja et al.
Fig. 6 Prevalence of diabetes and GDP per capita by state
d. Reduce: It is mostly grouping done through a phase. e. Combining: This is the last segment in which all the splits are combined to form the desired result. All the above steps are not to be written individually; rather, we just have to add a few particular commands for the MapReduce program to execute. We just have to
Analysis of Diabetes and Heart Disease in Big Data …
45
Fig. 7 Dataset of BMI (A = underweight, B = normalweight, C = overweight, D = obese and E = extremely obese)
Fig. 8 Dataset of alcohol consumption and physical exercise (where F is the amount of alcohol consumption and G is the physical work done by the people)
write down the splitting parameters, map function logic, and reduce function logic rest all other commands will get executed automatically. The algorithm of MapReduce uses three fundamental steps: (1) Map function (2) Shuffle function (3) Reduce function Here, we are going to talk about every characteristic role and obligation in the MapReduce algorithm.
46
M. K. Saluja et al.
Fig. 9 Process of MapReduce execution
Map function: Map function is the first step in MapReduce. It takes input obligations and distinguishes them into smaller sub-tasks. Then, carry out essential computation on every sub-task parallelly. This step performs the subsequent two sub-steps: • Splitting • Mapping Splitting step takes input dataset from source and distinguishes them into smaller sub-datasets. Mapping takes those smaller sub-datasets and performs the required process on every sub-dataset computationally. Shuffle Function: It is the second step in MapReduce. This function is also recognized as “Combine Function.” It includes two steps: • Merging • Sorting Reduce Function: This is the last step in MapReduce. It accomplishes only one step that is reduce step. Inputs a listing of sorted pairs from shuffle function and performs reduce function (Fig. 10).
Analysis of Diabetes and Heart Disease in Big Data …
47
Fig. 10 Algorithm of MapReduce
4 Results and Discussion The graph (here is drawn n with the help of R programming in which we have input all the parameter to get our desired graph) shown below in the result is the graph of diabetes with all the parameters that is age, BMI, sex, alcohol consumption, physical exercise, etc. In this graph, the data of different regions are collected and merged into a single graph whereas the graph shown in the review shows that the data of 15 states which are collected and shown individually in context of rural and urban areas. The data of diabetes which consisted of five parameters and these parameters are further rearranged into their sub-groups. Example: Age is divided into 1–18, 19–44, 45–64, and 65+ and BMI is divided into underweight, normal weight, overweight, obese, and extremely obese. Then, the data is filtered using a WordCount algorithm of the mapReduce method which divides the patient into different groups that are the patients suffering from the same disease are put into one category and the patients with the same age into another group and soon (Fig. 11).
48
M. K. Saluja et al.
Fig. 11 Result of diabetes
5 Conclusion The main principle of this paper is to highlight the use of MapReduce in the field of the healthcare system and to ease the work of sorting medical data and using it for future reference. The huge amount of data is filtered through MapReduce and we can divide the patients into different categories according to their diseases, for example: The patients with the same disease are kept in one group whereas the patient who is given the same drug is kept in another group because it is not possible that the patients diagnosed with the same disease can be given the same drug because every drug respond to different body in a different way. This paper mainly aims at basic parameters of the human body which are involved in diabetes and heart diseases. These parameters can further be used in different diseases like cancer, thyroid, and early aging. Before big data was introduced, the role of data in the treatment of patient was limited the hospital would collect data such as patient name, age, height, weight, disease description, family history, medical reports; such data provides a limited information about what disease the person is suffering from and which drug is to be given. But after big data is introduced, it provides us with detailed information and with the help of this the doctors can easily diagnose the patient and suggest preventive measures to be taken against a particular disease.
Analysis of Diabetes and Heart Disease in Big Data …
49
6 Future Scope MapReduce in big data has enabled us to make predictions through mathematical and programming approach in different healthcare systems for future reference. The strategy of predictive analytics with the help of big data in the medical field will help save a lot of resources and time for people in the near future. This will enable patients to take the cure for minor diseases and also help health tracking. With the help of machine learning and artificial intelligence, big data will be able to remove more and more human errors and also treating high-risk patients with continuous care by making their records digitized and making it easy for the doctors to detect the root cause and eliminate it. And will all this big data will also help to diminish the cost in the medical sector. Acknowledgements We wish to show our gratitude and sincere thanks to our mentor Dr. Ankur Saxena, Amity Institute of Biotechnology, Noida, for his guidance, assistance, insights, and expertise. We also express our sincere thanks to the Amity Institute of Biotechnology, Amity University, Noida, for providing us this opportunity and a great platform to work on.
References 1. T. Mehta, N. Mangla G. Guragon, A survey paper on big data analytics using MapReduce and hive on Hadoop framework a survey paper on big data analytics using MapReduce and hive on HadoopFramework (2016) 2. I.J. Anuradha, A brief introduction on big data 5Vs characteristics and Hadoop technology. Proc. Comput. Sci. 48, 319–324 (2015) 3. R. Beakta, Big data and Hadoop: a review paper. Int. J. Comput. Sci. Inform. Tech. 2 (2015) 4. R.C. Shobha, B. Rama, MapReduce with Hadoop for Simplified Analysis of Big Data. Int. J. Adv. Res. Comput. Sci. (2017) 5. D. Buono, M. Danelutto, S. Lametti, Map, reduce and MapReduce, the skeleton way, in The Proceedings of International Conference on Computational Science, ICCS 2010, Procedia Computer Science 1 (2012), pp. 2095–2103 6. J. Dittrich, J.A. Quiane-Ruiz, Efficient big data processing in Hadoop MapReduce, in The Proceedings of the VLDB Endowment (vol 5, 12, 2012) 7. T. Plantenga, Y. Choe, A. Yoshimura, Using performance measurements to improve MapReduce algorithms. Proced. Comput. Sci. 9, 1920–1929 (2012) 8. T. Sandholm, K. Lai, MapReduce optimization using regulated dynamic prioritization, in Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’ 09), Seattle, USA (2009), pp. 299–310 9. A.C. Alexander, L. Wang, Big data analytics in heart attack prediction. J. Nurs. Care 06. https:// doi.org/10.4172/2167-1168.1000393 (2017) 10. V.H. Bhat, P.G. Rao, P.D. Shenoy, Efficient prediction model for diabetic database using soft computing techniques architecture Springer-Verlag, Berlin Heidelberg (2009), pp. 328–335 11. R. Sharma, Kuppuswamy’s socioeconomic status scale—revision for 2011 and formula for real-time updating. Indian J. Pediatr. 79, 961–962 (2012) 12. S. Sirsat, M. Sahana, R. Khan, Analysis of Research Data using MapReduce WordCount Algorithm (2015). https://doi.org/10.17148/ijarcce.2015.4542 13. S. Jain, A. Saxena, Analysis of Hadoop and MapReduce tectonics through hive big data. Int. J. Contr. Theory Appl. 9(14), 3811–3911 (2016)
50
M. K. Saluja et al.
14. A. Saxena, N. Kaushik, N. Kaushik, Implementing and analyzing big data techniques with Spring framework in Java & J2EE, in Second International Conference on Information and Communication Technology for Competitive Strategies (ICTCS) ACM Digital Library (2016) 15. A. Matsunaga, M. Tsugawa, J. Fortes, CloudBLAST: Combining MapReduce and virtualization on distributed resources for bioinformatics applications, in Proceedings of the IEEE Fourth International Conference on eScience (eScience’08), Indianapolis, USA (2008), pp. 222–229 16. A. Saxena, N. Kaushik, N. Kaushik, A. Dwivedi, Implementation of cloud computing and big data with Java based web application, in Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465 2016 3rd International Conference on “Computing for Sustainable Global Development. 16–18 March, 2016. Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA), pp 3043–3047 (2016) 17. A. Chhawchharia, A. Saxena, Execution of big data using MapReduce technique and HQL, in Proceedings of the 11th INDIACom; INDIACom-2016; IEEE Conference ID: 40353 2017 4th International Conference on “Computing for Sustainable Global Development”. 1–3 March 2017. Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA) (2017) 18. M. Chand, C. Shakya, G.S. Saggu, D. Saha, I.K. Shreshtha, A. Saxena, Analysis of big data using apache spark, in Proceedings of the 11th INDIACom; INDIACom-2016; IEEE Conference ID: 40353 2017 4th International Conference on “Computing for Sustainable Global Development”, 1–3rd March, 2017. Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA) (2017) 19. J. Polo, D. Carrera, Y. Bacerra, V. Beltran, J. Torres, E. Ayguadé: Performance management of accelerated MapReduce workloads in heterogeneous clusters, in Proceedings of the 39th International Conference on Parallel Processing (ICPP’10), San Diego, USA (2010), pp. 653– 662 20. S. Sendre, S. Singh, L. Anand, V. Sharma, A. Saxena, Decimation of duplicated images using MapReduce in Bigdata, in Proceedings of the 11th INDIACom; INDIACom-2016; IEEE Conference ID: 40353 2017 4th International Conference on “Computing for Sustainable Global Development”, 1–3 March, 2017 Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA) (2017) 21. Y. Luo, Z. Guo, Y. Sun, B. Plale, J. Qiu, W.W. Li, A hierarchical framework for crossdomain MapReduce execution, in Proceedings of the 2nd International Workshop on Emerging computational methods for the life sciences (ECMLS’11), San Jose, USA (2011), pp. 15–22 22. S. Jain, A. Saxena, Integration of spring in Hadoop for data processing, in Proceedings of the 11th INDIACom; INDIACom-2016; IEEE Conference ID: 40353 2017 4th International Conference on “Computing for Sustainable Global Development”, 1–3rd March 2017 Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA) (2017) 23. Z. Fadika, E. Dede, J. Hartog, M. Govindaraju, MARLA: MapReduce for heterogeneous clusters, in Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12), Ottawa, Canada (2012), pp. 49–56 24. K. Yesugade, V. Bangre, S. Sinha, S. Kak, A. Saxena, Analyzing human behaviour using data analytics in booking a type hotel, in Proceedings of the 11th INDIACom; INDIACom-2016; IEEE Conference ID: 40353 2017 4th International Conference on “Computing for Sustainable Global Development”, 1–3rd March, 2017 BharatiVidyapeeth’s Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA) (2017) 25. W.-T. Tsai, P. Zhong, J. Elston, X. Bai, Y. Chen, Service replication with MapReduce in clouds, in Proceedings of the 10th International Symposium on Autonomous Decentralized System (ISADS’11), Kobe, Japan, (2011), pp. 381–388 26. R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg, I. Brandic, Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Fut. Gen. Comput. Syst. 25(6), 599–616 (2009) 27. F. Tian, K. Chen: Towards optimal resource provisioning for running MapReduce programs in public clouds, in Proceedings of the 4th IEEE International Conference on Cloud Computing (CLOUD’11), Washington DC, USA, (2011), pp. 155–162
Analysis of Diabetes and Heart Disease in Big Data …
51
28. J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008) 29. C. Vecchiola, X. Chu, R. Buyya, Aneka: a software platform for .NET-based cloud computing, in W. Gentzsch, L. Grandinetti, G. Joubert, High Speed and Large-Scale Scientific Computing 30. Amsterdam, The Netherlands: IOS Press, (2009), pp. 267–295 31. Z. Fadika, M. Govindaraju, LEMO-MR: Low overhead and elastic MapReduce implementation optimized for memory and CPU-intensive applications, in Proceedings of the 2nd International Conference on Cloud Computing Technology and Science (CloudCom’10), Indianapolis, USA (2010), pp. 1–8 32. Y. Geng, S. Chen, Y. Wu, R. Wu, G. Yang, W. Zheng: Locationaware MapReduce in virtual cloud, in Proceedings of the 40th International Conference on Parallel Processing (ICPP’11), Taipei, Taiwan, (2011), pp. 275–284 33. A. Verma, L. Cherkasova, R. Campbell, Resource provisioning framework for MapReduce jobs with performance goals, in Middleware
User Interface of a Drawing App for Children: Design and Effectiveness Savita Yadav, Pinaki Chakraborty, and Prabhat Mittal
Abstract Drawing is an integral part of child rearing. Nowadays, children use mobile apps to draw along with traditional mediums. We need to understand how children interact with drawing apps in order to develop apps that can facilitate their artistic and cognitive development. In this study, we tried to understand how children aged two to eight years interact with drawing apps and how should the user interface of such apps be designed. We developed a mobile app specifically for young children. The app provides children with a fixed-sized canvas, a limited color palette and few other features. We provided the app to 90 children between two and eight years of age and observed them when they used the app to draw. We observed that the two- and three-year-old children did not explore the features of the app and scribbled randomly typically with a single color. The four-to-six-year old children drew with multiple colors, were able to use the ‘Undo’ and ‘Redo’ options and liked to revisit their previous drawings. The seven- and eight-year-old children drew figures resembling real-world objects and were able to narrate stories about what they were drawing. The seven- and eight-year-old children were, however, frustrated by the lack of colors in the palette and the simplicity of the app. We conclude that age appropriate drawing apps may be helpful in child development. Keywords User interface · Smartphone · Drawing app · Child
1 Introduction User interface is a medium through which a user interacts with a system. The user interface provides the user with multiple functionalities to translate her/his goal into a sequence of executable tasks. In addition, the user interface makes provisions S. Yadav (B) · P. Chakraborty Department of Computer Science and Engineering, Netaji Subhas University of Technology, New Delhi, India e-mail: [email protected] P. Mittal Department of Commerce, Satyawati College (Evening) University of Delhi, Delhi, India © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_4
53
54
S. Yadav et al.
to effectively represent the result of the execution of the tasks to the user. User interface plays an indispensable role in determining the success of an interaction of a human with a system. Knowledge of a particular domain and related tasks plays a pivotal role while designing the user interface. Over a period of time, the interaction styles supported by the user interface have undergone significant changes. Use of smartphones by adults and children alike has altered the way we interact with a system. Interactions, nowadays, are more dynamic, frequent and mobile in nature. A smartphone manifests itself to such type of interaction and is hence a very popular device [1]. Drawing is one of the most cherished activities for children [2]. A lot of emphasis is laid on drawing classes in school curriculum for young children. Children perform better when learning is supplemented with drawing. Children typically draw using crayons and chalks. Children use their imagination to produce works of art. Smartphones are also used by children to accomplish drawing-related assignments. A significant number of drawing apps are now available. Such apps are an interesting addition to various platforms available for drawing [3]. Drawing apps provide a flexible and conducive environment supporting reversibility of all the actions. These drawing apps provide children with ample scope to experiment and visualize. However, drawing apps will be useful only if they have a proper user interface [4]. User interface of drawing apps for children needs thorough planning to accommodate their needs and demands. It is desired that such apps should try to find a fine balance between structured and exploratory methods of imparting knowledge. The interface should not inundate the children with unnecessary information. The visual constructs on the screen should be relatable with the daily activities of children. Icons should follow an appropriate color scheme to make it visually appealing. The menu design should be linear to allow sequential access. The app should facilitate multimodal ways of interaction where the child has the flexibility to provide input using multiple methods like touch, speech and text. Use of textual instructions should be minimal. Cues with the help of animated and audio instructions are encouraged. Another crucial aspect in the design of drawing apps for children is feedback. Children are eager and impatient at times, and it is thus of paramount importance that the user interface should provide instantaneous feedback in response to any action. Completion of a task may be met with a reward or a surprise. In this paper, we report our experiment in which we designed a drawing app with a simple user interface and observed how two-to-twelve-year-old children use it. Our results will help in understanding children’s interaction with smartphones and developing mobile apps suitable for children.
2 Related Work The digital and non-digital resources can coexist to provide an enhanced contributory environment for development of different facets of creativity among children [5]. Ready-to-use digital images offer the benefit of consuming without incurring
User Interface of a Drawing App …
55
expenses on material [6]. However, the attitude of the caregivers of young children also influences how well the technological interventions will be received [7]. Drawing as an activity is enjoyable and informative for children [8]. Children often use drawing to express themselves without inhibitions [8–10]. Effort is needed to create educational opportunities for children at home as well as at school to strengthen them artistically [8]. Drawing apps are used by children to scribble and draw [11]. Children gain skills with age to use drawing apps. Drawing increases attentiveness of children [10, 12]. Moreover, children are able to recall their observations and use them while drawing [9, 10]. Child development can be positively influenced by the use of drawing apps, and some drawing apps have been developed specifically for children [13, 14]. We have to lay emphasis on understanding the limitations of using digital media like smartphones and tablets as tools for drawing by children and how they circumvent those constraints [15]. Recent research points out that children are able to adapt and draw using a drawing app on a tablet [16] and on a smartphone [11]. Children’s ability to adapt to the constraints is not much affected by the mode or the media used for drawing [17]. It is also observed that hard to obtain rewards and penalties motivate children to excel in their usage of drawing apps [18].
3 Materials and Methods 3.1 The App We developed a drawing app for children named Baby’s Drawing App [19]. The app provides a simple, attractive and engaging platform for children to hone their drawing skills. The display icon of the app is colorful and appealing. The interface of the app is simple and provides a child with a hassle-free environment to play around (Fig. 1a). The app displays all the relevant options on the screen such as ‘Start New Drawing’, ‘Erase’, ‘Undo’ and ‘Redo’ with the help of self-explanatory and relatable icons. A click on an icon is followed by an auditory response specifying the function of the clicked icon. The organization of the menu options on the top portion of the screen leaves ample canvas for a child to scribble and draw. The canvas is white and cannot be resized. The color palette is judiciously chosen to be minimal so as not to confuse a child. Each of the colors is depicted by an amusing like-colored icon. As a child picks a color, a confirmatory audio is played to reaffirm the color to the child. Soothing melody plays in the background as the child engages in drawing with her/his finger. The app automatically saves the drawings of the child, and the last three drawings can be reloaded. The app makes minimal use of textual instructions. Moreover, textual instructions are supplemented with audio. The app makes use of animated icons in place of plain text to help a child in responding to questions asked by the app (Fig. 1b). A child can undo any accidental touch on the canvas by clicking the ‘Undo’ icon. A child can exit the app at any time as the app does not bother the
56
S. Yadav et al.
Start new drawing Erase Undo Redo Reload drawings Color palee Canvas
(a)
(b)
Fig. 1 Baby’s Drawing App: a the user interface and b the use of animation to facilitate dialog
child with unnecessary dialogs. Baby’s Drawing App may be downloaded for free from Google Play.
3.2 Study Design The study was conducted on three age groups of children in a play-cum-primary school in Dwarka, New Delhi, in the month of July 2019. Children aged two and three years formed the first age group, the second age group comprised of fourto-six-year-old children, and children aged seven and eight years constituted the third age group. Every group consisted of thirty children (fifteen females and fifteen males). Each child was provided with Baby’s Drawing App on a smartphone and was instructed to draw using the app. Children were not tutored initially on how to use the app. A time of five minutes was given to every child to draw using the app. However, if a child was unable to comprehend the features of the app or could not draw anything in the first one minute, then assistance was provided to her/him. We observed the children draw using the app and recorded their usage pattern. We noted the number of children who were interested in using the app, the number of children who could use the app without training, the number of children who could use all the features of the app, the number of children who struggled to use the features of the app even after training and the number of children who complained about the scarcity of options in the color palette in each age group. The drawings of the children were saved for future analysis.
User Interface of a Drawing App …
57
3.3 Statistical Analysis We used SPSS version 26.0 to analyze the usage pattern of the children. We used chi-square test in crosstab format to compare the number of children in each group who displayed a particular behavioral trait.
4 Results 4.1 Quantitative Results We observed that 90% of the children aged two and three years were ecstatic about using the app, and this percentage rose to 100% in the two older age groups (Table 1). The app could be used by 60% of the two- and three-year-old children without any help in initiation, and this percentage improved with age. We found that 93% of the children aged four to six year and all children aged seven and eight years needed no help in using the app. The different features of the app like ‘Start New Drawing’, ‘Erase’, ‘Undo’, ‘Redo’ and ‘Load Previous Painting’ could be used by 70% of the two- and three-year-old children. The proficiency to use all the features of the drawing app improved with age as 87% of the four-to-six-year old children and 97% of the seven- and eight-year-old children excelled in it. It was witnessed that around 17% of the children in the first and 3% of the children in the second age group struggled to understand the use of ‘Redo’ and ‘Load Previous Painting’ features of the app. An interesting observation made during the course of the experiment was that all children in the first age group were satisfied by the number of colors in the palette, but 23% children in the second age group and 87% children of third age group complained about too few choices of color for drawing. The result of the chi-square test confirmed that the behavioral traits were significantly different for the children in the three age groups (P < 0.05).
4.2 Analysis of the Children’s Drawings The children in the age group of two and three years scribbled mostly with a single color all the time. The children in this age group did not explore the options provided by the app much, and their scribbling resembled random lines (Fig. 2). The fourto-six-year-old children were at a better position to use the different color options, and their drawings precisely reflected real-world entities (Fig. 3). Children in this age group were more in command of their actions and could decide when to use the ‘Undo’ and ‘Redo’ options. They made fewer mistakes like clicking on wrong icons and pressing icons repeatedly and had lesser accidental touches. They also showed interest in revisiting their previous drawings and used the ‘Load Previous
58
S. Yadav et al.
Table 1 Behavior of the children while using Baby’s Drawing App Observation
Number of children
Chi-square statistic
d.f.
P-value
Two- and three-year-old children
Four-to-six-year-old children
Seven- and eight-year-old children
How many children were interested in using the app?
27
30
30
6.21
2
0.045*
How many children could use the app without training?
18
28
30
20.98
2
0.000*
How many children could use all the features of the app?
21
26
29
8.29
2
0.016*
How many children struggled to use the features of the app even after training?
5
1
0
7.50
2
0.024*
How many children complained about the scarcity of options in the color palette?
0
7
26
51.96
2
0.000*
*P < 0.05
Painting’ option for the same. The seven- and eight-year-old children had expertise in drawing myriad objects which may exist in reality or may be a figment of their own imagination (Fig. 4). The children in this age group used sequence of drawings to narrate a story and liked to boast about their accomplishments.
User Interface of a Drawing App …
59
Fig. 2 Samples of drawings of two- and three-year-old children
Fig. 3 Samples of drawings of four-to-six-year-old children
Fig. 4 Samples of drawings of seven- and eight-year-old children
5 Discussion Drawing is an integral part of a child’s growing up years. Over a period of time, the popularity of smartphones has led to a surge in the development of drawing apps. Drawing apps are used by children to augment their traditional drawing platforms like
60
S. Yadav et al.
paper and crayons [11]. The interface design of such drawing apps should provide an easy to understand environment, linear menus and hassle-free navigation. Children can start using drawing apps at the age of two years. Initially, they scribble randomly on the screen of the smartphone. As children grow up, their scribbling becomes more controlled. Later, as children reach the age of four years, they transition from the scribbling stage to the preschematic stage [20]. This was reaffirmed in our study. Baby’s Drawing App though appealed initially to the children aged seven to nine years, but the interest of children soon faded away. Children were dissatisfied by the less challenging task and wanted a format that could also earn them rewards or could surprise them with virtual goodies. Children of seven years and above fall in the schematic stage [20]. Traditional methods of drawing using paper, drawing sheets and variety of colors offer children a chance to visualize and physically transform their creative imagination. Likewise, drawing apps for children provide them with a hands-on experience of drawing as all the icons can be manipulated directly with the help of a finger. Drawing apps have the benefit of not sourcing materials each and every time children want to draw. The work of children can be preserved for longer duration with the save option of drawing app as compared to the physical hard copies which are amenable to wear and tear. Drawing apps do not take punitive measures against children for any wrong action and ensures reversibility of all tasks. The interface of a drawing app should be designed with the knowledge of different developmental and artistic stages of children. Too complicated or too simple design of the interface of a drawing app may not provide substantive support for the creative skills of children.
6 Conclusion Drawing apps for smartphones cannot eliminate the purpose of conventional drawing materials like thick tempera paints, crayons and colored chalks, but they can certainly be used as a supplementary art medium. Drawing apps for smartphones can be easily customized to provide individualized learning. Drawing apps with simple, easy-to-use interface and age-appropriate features can be used to stimulate the children and enhance creativity. Drawing apps also helps in better documentation of the artwork. Parents and teachers can later reference the drawings for creating better individualized learning opportunities. Acknowledgements The authors thank Rishabh Singh, Ravi Kumar, Ravi Prakash and Robin Ratan for their help in developing Baby’s Drawing App.
User Interface of a Drawing App …
61
References 1. S. Eisen, A.S. Lillard, Young children’s thinking about touchscreens versus other media in the US. J. Child. Media 11, 167–179 (2017) 2. J.A. Di Le Interpreting children’s drawings. Brunner-Routledge (1983) 3. S. Yadav, P. Chakraborty, Using smartphones with suitable apps can be safe and even useful if they are not misused or overused. Acta Paediatr. 107, 384–387 (2018) 4. C.L. Chau Positive technological development for young children in the context of children’s mobile apps. Doctoral dissertation, Tufts University (2014) 5. N. Kucirkova, M. Sakr, Child-father creative text-making at home with crayons, iPad collage & PC. Thinking Skills and Creativity 17, 59–73 (2015) 6. M. Sakr, V. Connelly, M. Wild, Imitative or iconoclastic? How young children use ready-made images in digital art. Int. J. Art Des. Educ. 37, 41–52 (2018) 7. J. Matthews, J. Jessel, Very young children and electronic paint: The beginning of drawing with traditional media and computer paintbox. Early Years 13, 15–22 (1993) 8. E. Burkitt, R. Jolley, S. Rose, The attitudes and practices that shape children’s drawing experience at home and at school. Int. J. of Art Des. Educ. 29, 257–270 (2010) 9. N.S. Frisch, Drawing in preschools: A didactic experience. Int. J. of Art Des. Educ. 25, 74–85 (2006) 10. R. Jolley, Z. Zhang, How drawing is taught in Chinese infant schools. Int. J. of Art Des. Educ. 31, 30–43 (2012) 11. S. Yadav, P. Chakraborty, Children aged two to four are able to scribble and draw using a smartphone app. Acta Paediatr. 106, 991–994 (2017) 12. M. Posner, M.K Rothbart, B.E. Sheese, J. Kieras, How arts training influences cognition. in C. Asbury, B. Rich, (Eds.) Learning, Arts, and the Brain. Dana Press (2008), pp. 1–10 13. O.E. Kural, E. Kılıç, Intelligent mobile drawing platform. in Proceedings of the Twenty-fourth Signal Processing and Communication Application Conference (2016), pp. 1765–1768 14. V. Ramnarain-Seetohul, D. Beegoo, T. Ramdhony, Case study of a mobile based application for kindergarten schools in Mauritius. in Proceedings of the IEEE International Conference on Emerging Technologies and Innovative Business Practices for the Transformation of Societies (2016), pp. 81–84 15. S.R.M. Shukri, A. Howes, How do children adapt strategies when drawing on a tablet? Extended Abstracts of the ACM CHI Conference on Human Factors in Computing Systems (2014), pp. 1177–1182 16. S.R.M. Shukri, Children Adapt Drawing Actions to Their Own Motor Variability and to the Motivational Context for Action. Doctoral dissertation, University of Birmingham (2016) 17. S.R.M. Shukri, A. Howes, Children adapt drawing actions to their own motor variability and to the motivational context of action. Int. J. Hum Comput Stud. 130, 152–165 (2019) 18. S.R.M. Shukri, A. Howes, Reward conditions modify children’s drawing behaviour. Lect. Notes Comput. Sci. 10645, 455–465 (2017) 19. S. Yadav, P. Chakraborty, Smartphone apps can entertain and educate children aged two to six years but should be used with caution. Acta Paediatr. 107, 1834–1835 (2018) 20. V. Lowenfeld, W.L. Brittain, Creative and Mental Growth. 4th ed., Macmillan (1964)
Rank Aggregation Using Moth Search for Web Parneet Kaur, Gai-Ge Wang, Manpreet Singh, and Sukhwinder Singh
Abstract Metasearch engines always play a crucial role to provide useful information for the requested query. At the hinder end, the rank aggregation (RA) module is perhaps the most important component which merges the output derived from distinct search engines. But the primary goal of this problem is to assign a relevance rank to similar documents from different search engines in order to select the best optimized document. With this application in mind, we have proposed moth search algorithm (MSA)-based approach along with two distance measure methods. Thus, Spearman’s footrule and Kendall tau distance measures are optimized which are further applied to assign ranks to the documents by different rank aggregation methods. Experimentally, it has been proved that MSA approach outperformed than conventional genetic algorithm. Keywords Rank aggregation · Distance measures · Optimization strategies
P. Kaur (B) · S. Singh Ajay Kumar Garg Engineering College, Ghaziabad, India e-mail: [email protected] S. Singh e-mail: [email protected] G.-G. Wang College of Information Science and Engineering, Ocean University of China, Qingdao, China e-mail: [email protected] M. Singh Department of Information and Technology, GNDEC, Ludhiana, India e-mail: [email protected] P. Kaur · S. Singh Department of Electrical and Electronics Engineering, AKGEC, Ghaziabad, India © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_5
63
64
P. Kaur et al.
1 Introduction Now, the problem of ranking has found new challenges to produce a genuine and interesting dataset in different arenas, like word association, spam fighting, human polling, gaming and stock market [2]. But searching of the desired information is found to be a tedious one and rather complex activity of twenty-first century [15]. A user query comprised of different keywords returns back in the form of the document list of different search engines. All these lists are different in terms of the orderliness of the document. But, on an average, the lists obtained never meet the user satisfaction level [8]. Thus, the aggregation problem to formulate a consensus ranking list from multiple lists is known as rank aggregation. Division of rank-based aggregation is done in terms of distributional, heuristic and stochastic methods [16]. Also, further distribution in terms of score, position and learning models is done for rank aggregation techniques. The concept of score-based methods was discussed with the problem of ranking fusion by [21]. Dwork et al. [8] first discussed probabilistic methods like Markov chains for applications like metasearch, spam reduction, aggregating ranking functions, search engine comparison and word association. Beg in [3] worked remarkably by implementing GA-based technique to remove the complexity of NP-hard-based partial footrule optimal aggregation method (PFOA) for metasearch. An enhanced Shimura technique was proposed by [4] for rank aggregation which uses the operator membership function (OWA). Akritidis [1] highlights the effective rank aggregation methods of metasearch based on QuadRank and designed the QuadSearch metasearch engine that provides access to crawler-dependent search engines. Waad [22] discussed the concept of feature selection model which was employed on the metaheuristic methods in order to improve the prediction accuracy in comparison with traditional aggregation techniques of ranking. The rank aggregation problem was implemented by [20] on R tool for computing the algorithm for DNA examination. During the past few years, continued progress has been made in developing different optimized rank aggregation-based metaheuristic techniques. In this series, we also made an effort to dig more in this area so that better aggregating methods can be explored. Hence, in this paper, we have first introduced the five aggregation methods such as scaled footrule, Borda method, MBV, PageRank and Markov Chain. Further, we applied the proposed MSA in comparison with GA to check the performance of selected search engines. Also, precision, recall and F-measure are implemented on the output to check the performance.
2 Rank Aggregation (RA) Methods for MetaSearch The aggregation methods are categorized as order-based and score-based such as Borda method [6] in the former, only the order information related to objects are
Rank Aggregation Using Moth Search for Web
65
available, whereas in the latter such as MBV, Markov, PageRank, etc., the input ranking is linked with the scores. The Markov Chain method as given in [14] employed the concept of sorting the candidates by the number of pair-wise majority contest which is won by the candidate. This method is widely used for solving Metasearch problems. The PageRank method was adopted from the [19] since it is one of the novel methods used in ranking the Web pages. The two commonly used distance measures, namely Kendall tau distance and Spearman’s footrule distance [7], are applied so that the aggregate rank comes as close the base ranking which helps in deriving optimized rank. Kendall tau counts the pair-wise disagreements, whereas latter sums up the value of element-wise ranking differences. In this paper, two main objectives are considered into consideration such as execution time, minimization of distance value and to obtain the optimized rank list.
3 Genetic Algorithm (GA) GA approach was originated by John Holland [11] which forms the solid base of evolutionary computing. [10, 23] firstly applied the search in machine learning. In our problem of rank aggregation, we applied the genetic approach to determine the optimal distance value. In the initial step, fitness score was derived in order to generate the objective functions. In the next step, population is generated by using the Borda, MBV, Markov, PageRank and scaled footrule methods in the form of aggregated rank list. In this paper, only Borda method aggregated list is provided as shown in Table 1, and using the same methodology, aggregated list for remaining methods is also derived which is used for result formulation. Then, the results are obtained from 500 generations with GA as shown in Tables 3 and 4. Kendall tau and Spearmen’s footrule distance measures based on GA for RA are referred from the [12] and compared with the proposed MSA approach as shown in Tables 3 and 4 (Table 2).
4 Proposed Moth Search Algorithm (MSA)-Based Aggregation Approach In the MSA, the optimization task is derived by taking the photo tax and levy flights of moths toward the light source with the shortest path described in detail [9]. The process of fly straightly by moths toward light is drawn beneath. For moth i, its flights can be formulated as given in Eq. 1: t − xit )) xit+1 = λ × (xit + ϕ × (xbest
(1)
66
P. Kaur et al.
Table 1 Aggregated list for rank aggregation methods
RA method
Aggregated list
Borda method
tquery1 = {11 15 8 13 7 6 5 9 3 12 10 4 14 1 2 16 17 18 19 20} tquery1a = {17 15 9 13 2 6 8 1 12 11 5 16 4 10 3 7 14 18 19 20} tquery1b = {1 3 15 17 20 5 4 8 10 11 2 9 16 19 6 7 8 14 11 13} tquery2 = {3 13 7 4 9 5 10 12 2 8 11 6 1 15 14 16 17 18 19 20} tquery2a = {4 9 13 18 7 6 11 3 10 5 12 14 15 1 2 16 17 8 19 20} tquery2b = {20 19 18 17 16 15 14 13 12 11 10 9 7 8 2 6 5 4 3 1} tquery3 = {1 3 11 19 12 4 10 2 6 9 14 15 16 13 8 5 7 17 18 20} tquery3a = {18 20 17 16 19 4 15 14 12 11 7 9 6 10 13 3 2 1 5 8} tquery3b = {16 12 7 5 4 11 3 8 20 10 9 15 13 18 1 17 2 14 6 19}
Table 2 Main queries and new logical operator-based queries
Query No.
Query
Q1.
Civil rights movement
Q1a.
Civil rights and movement
Q1b.
Civil rights or movement
Q2.
Query expansion normalization
Q2a.
Query expansion and normalization
Q2b.
Query expansion or normalization
Q3.
Florida franchise tax board
Q3a.
Florida franchise and tax board
Q3b.
Florida franchise or tax board
t where xbest is taken as the best moth or document at generation t and ϕ are the acceleration factors that provide basis for setting the golden ratio in the space. Also, λ is a scale factor. Afterward, the moths proceed toward the final position beyond the light source which can be formulated as in Eq. 2:
xit+1 = λ × (xit +
1 t × (xbest − xit )) ϕ
(2)
For simplicity, for moth i, its position will be updated by Eq. [1] Or Eq. [2] with the possibility of 50%.
Rank Aggregation Using Moth Search for Web
67
Algorithm: Moth search algorithm (Gai-Ge Wang., 2016) Steps of the proposed Moth search algorithm for rank aggregation Begin Step 1. Initialization. Set the generation number t 1 ; randomly initialize the population P of NP moths randomly using the uniform distribution; the maximum generation MaxGen, maxwalk step S max , the index , and acceleration factor . Step 2. Fitness evaluation. Evaluate each moth individual according to its position. Step 3. While t MaxGen do Sort all the moth individuals as per their fitness. Step 4. For i = 1 to NP/2 (for all moth individuals in Subpopulation 1) do Generate xit
1
by performing Lévy flights (Gai-Ge Wang., 2016) ;
end for i Step 5. for i = NP/2+1 to NP (for all moth individuals in Subpopulation 2) do Step 6. if rand>0.5 then Generate xit 1 by Eq. (3); else Generate xit 1 by Eq. (4). end Step 7. end for i Step 8. Evaluate the population as per the newly updated positions;
In this paper, we have taken the MS method [9] and used the population set from Table 1. Further, the MS algorithm divides the population into two equal subpopulations (subpopulation 1 and subpopulation 2). Next, using MS algorithm, the fitness
68
P. Kaur et al.
function is formulated which is based on Kendall tau and Spearman’s footrule [7] distance measures, respectively, as derived previously using GA in [12]. Then, the distances covered by the moths in subpopulation 1 are calculated which give optimum results similar to the subpopulation 2. Then, ranks are assigned to the moths cum documents on the basis of minimum distance value for different rank aggregation methods as discussed in the Tables 3 and 4. The documents with minimum distance values will lie in a high position in the list and will be given higher ranks.
5 Results and Discussion This section provides report related to the result of experimental evaluation using the proposed MSA-based RA measure which is implemented in MATLAB/Simulink environment. Further, performance comparison of the presented strategy with the genetic-based RA measure is done. The dataset we used is originated from [12]. The dataset is comprised of thirty-seven keywords out of which three queries are taken randomly as shown in Table 2. The evaluation was conducted by employing five general-purpose search engines for query insertions which are Google, AltaVista, Deeperweb, Excite and HotBot. The best performing lists as shown in Table 1 are obtained from the queries given in Table 2 for each rank aggregation method which are Borda, Mean-By-Variance, Markov Chain MC4, Page Rank and Scaled Footrule methods referred in [13].
5.1 Comparative Analysis of MSA and GA Approach The set of aggregated queries is employed with RA methods (Borda, mean-byvariance, Markov chain MC4, PageRank and scaled footrule). The optimized distance values are calculated using nature-inspired soft computing algorithm, MSA. Table 3 indicates optimized Kendall tau distance value for the proposed MSA and GA. Similarly, the distance values for Spearman’s footrule distance measure for both GA and the proposed MSA calculated at 500 iterations are shown in Table 4. Table 3 indicates minimum distance values by the proposed MSA at 500 generations in comparison with GA. For example, for Query Q1 the Kendall tau distance value with GA is 0.210 and with the proposed MSA, the distance value came lower to 0.20.
5.2 Query Analysis for Best Optimized Aggregated List As per the query analysis of all six main queries of Table 1, different aggregated list-based results have been formulated as shown in Tables 2,3 and 4. Like in case
Rank Aggregation Using Moth Search for Web Table 3 Comparison of the proposed MSA approach, GA Kendall tau distance measure at (500 generations) for rank aggregation methods
69
Query No.
Kendall-tau distance obtained after aggregating the ranks using Soft Computing Algorithms with 500 generations.
Borda method
With GA
MSA
Q1
0.210
0.20
Q1a
0.226
0.23
Q1b
0.280
0.21
Q2
0.305
0.17
Q2a
0.370
0.21
Q2b
0.380
0.18
Q3
0.389
0.22
Q3a
0.384
0.20
Q3b
0.240
0.18
MBV method
With GA
MSA
Q1
0.320
0.14
Q1a
0.320
0.21
Q1b
0.305
0.18
Q2
0.331
0.18
Q2a
0.345
0.23
Q2b
0.332
0.18
Q3
0.342
0.18
Q3a
0.331
0.21
Q3b
0.341
0.21
Markov chain method
With GA
MSA
Q1
0.300
0.21
Q1a
0.367
0.23
Q1b
0.420
0.23
Q2
0.233
0.21
Q2a
0.268
0.18
Q2b
0.350
0.17
Q3
0.380
0.20
Q3a
0.370
0.23
Q3b
0.280
0.17
PageRank method
With GA
MSA
Q1
0.320
0.23
Q1a
0.330
0.20
Q1b
0.300
0.22 (continued)
70 Table 3 (continued)
P. Kaur et al. Query No.
Kendall-tau distance obtained after aggregating the ranks using Soft Computing Algorithms with 500 generations.
Q2
0.310
0.22
Q2a
0.310
0.18
Q2b
0.240
0.22
Q3
0.330
0.21
Q3a
0.350
0.17
Q3b
0.310
0.21
Scaled footrule method
With GA
MSA
Q1
0.336
0.21
Q1a
0.370
0.22
Q1b
0.330
0.20
Q2
0.360
0.18
Q2a
0.360
0.21
Q2b
0.340
0.22
Q3
0.400
0.16
Q3a
0.310
0.19
Q3b
0.300
0.20
of Q1, Q1a and Q1b are also derived as indicated in Table 1 using a Boolean word association technique. Furthermore, using Kendall tau-based measure on queries 1, 1a and 1b, the best result in terms of distance is obtained on query 1 for Borda RA algorithm using MSA approach which is indicated in Tables 3 and 4. The list of documents for query1 is represented as the final aggregated list, which is given below in Table 5. Q1: = {11 15 8 13 7 6 5 9 3 12 10 4 14 1 2 16 17 18 19 20} Likewise, when comparing query 2, query 2a and query 2b, among them, query 2b gives best optimized result in terms of distance as shown in Tables 3 and 4 and performance evaluation parameters of precision, recall and F-measure for RA-based scaled footrule. Further, query 3a also comes as the best derived queries among its counterpart set of queries for RA-based PageRank under MSA approach as shown in Tables 3 and 4. Thus, for query 2a and query 3a, the URLs with optimized or near to user query satisfaction answers are as follows: tquery2a = {4 9 13 18 7 6 11 3 10 5 12 14 15 1 2 16 17 8 19 20} tquery3a = {18 20 17 16 19 4 15 14 12 11 7 9 6 10 13 3 2 1 5 8}
Rank Aggregation Using Moth Search for Web Table 4 Comparison of the proposed MSA approach with GA-based Spearman’s footrule distance measure at (500 generations) for rank aggregation methods
71
Query No.
Spearman Footrule distance obtained after aggregating the ranks using Soft Computing Algorithms with 500 generations.
Borda method
With GA
MSA
Q1
0.510
0.31
Q1a
0.520
0.29
Q1b
0.530
0.31
Q2
0.480
0.31
Q2a
0.530
0.30
Q2b
0.520
0.27
Q3
0.570
0.25
Q3a
0.530
0.32
Q3b
0.520
0.29
MBV method
With GA
MSA
Q1
0.420
0.32
Q1a
0.400
0.31
Q1b
0.400
0.31
Q2
0.500
0.27
Q2a
0.480
0.29
Q2b
0.580
0.30
Q3
0.430
0.29
Q3a
0.520
0.31
Q3b
0.423
0.25
Markov chain method
With GA
MSA
Q1
0.460
0.28
Q1a
0.450
0.27
Q1b
0.380
0.30
Q2
0.390
0.31
Q2a
0.520
0.32
Q2b
0.440
0.31
Q3
0.500
0.29
Q3a
0.470
0.25
Q3b
0.560
0.31
PageRank method
With GA
MSA
Q1
0.301
0.32
Q1a
0.380
0.30
Q1b
0.402
0.28 (continued)
72 Table 4 (continued)
P. Kaur et al. Query No.
Spearman Footrule distance obtained after aggregating the ranks using Soft Computing Algorithms with 500 generations.
Q2
0.400
0.32
Q2a
0.535
0.31
Q2b
0.440
0.28
Q3
0.470
0.31
Q3a
0.460
0.29
Q3b
0.430
0.32
SF method
With GA
MSA
Q1
0.435
0.26
Q1a
0.450
0.27
Q1b
0.480
0.30
Q2
0.520
0.27
Q2a
0.450
0.29
Q2b
0.520
0.34
Q3
0.520
0.31
Q3a
0.420
0.30
Q3b
0.550
0.31
6 Conclusion and Future Scope This paper presents and proposes MSA-based RA measure which potency is compared with GA-based RA approach. Further, the parameters such as execution time, precision, recall and F-measure are also applied for the final optimized list refinement and analysis. Thus, it has been observed that MSA-based RA derives better result than GA-based RA measure. Further, it can be seen easily that by applying binary word association technique such as AND and OR on set of defined queries generate improved search results. Thus, it has been proved in the analysis that queries like query1a, query 2b and query 3a generate far better results than individual or independent queries such as query1, query 2 and query 3. Also, the derived set of list came closer to the non-aggregated lists. However, the final observation concludes that Kendall Tau distance measure when applied with MSA based RA techniques helps in designing of good quality metasearch engines which helps in deriving much refined user preferred query solutions.
Rank Aggregation Using Moth Search for Web
73
Table 5 Optimized aggregated list-based URLS QUERY NO.
Optimized aggregated list-based URLs of Q1 for Borda
11
https://en.wikipedia.org/wiki/Civil_rights_movement
New rank 1
15
https://www.britannica.com/event/American-civil-rightsmovement
2
8
https://www.adl.org/education/resources/backgrounders/civilrights-movement
3
13
https://www.khanacademy.org/humanities/us-history/postwarera/ civil-rights-movement/a/introduction-to-the-civil-rightsmovement
4
7
https://www.aarp.org/politics-society/history/info-2018/civilrights-events-fd.html
5
6
https://www.gilderlehrman.org/history-now/essays/civil-rightsmovement-major-events-and-legacies
6
5
https://www.learningtogive.org/resources/civil-rights-movement
7
9
https://hechingerreport.org/students-ignorant-civil-rightsmovement/
8
3
https://www.docsoffreedom.org/student/readings/the-civil-rightsmovement
9
12
https://prospect.org/justice/civil-rights-movement-politicsmemory/
10
10
https://www.infoplease.com/history/us/civil-rights-timeline
11
4
https://www.civilrightsteaching.org/civil-rights-movementhistory-resources
12
14
https://www.ducksters.com/history/civil_rights/african-american_ civil_rights_movement.php
13
1
https://www.theatlantic.com/politics/archive/2017/06/the-fightfor-health-care-is-really-all-about-civil-rights/531855/
14
2
https://www.bbc.co.uk/bitesize/guides/zcpcwmn/revision/1
15
16
https://memory.loc.gov/ammem/aaohtml/exhibit/aopart9.html
16
17
https://www.aa.com.tr/en/life/mother-of-civil-rights-movementremembered-worldwide/1624437
17
18
https://www.loc.gov/collections/civil-rights-history-project/ articles-and-essays/youth-in-the-civil-rights-movement/
18
19
https://www.historyextra.com/period/20th-century/timeline-theamerican-civil-rights-movement-of-the-1950s-and-1960s/
19
20
https://history.house.gov/Exhibitions-and-Publications/BAIC/ Historical-Essays/Keeping-the-Faith/Civil-Rights-Movement/
20
References 1. L. Akritidis, D. Katsaros, P. Bozanis, Effective rank aggregation for metsearching. J. Syst. Softw. 84, 130–143 (2011) 2. K. Arrow, Social Choice and Individual Values (Wiley, New York, 1951)
74
P. Kaur et al.
3. M.M.S. Beg, N. Ahmad, Soft computing techniques for rank aggregation on the World Wide Web. J. Int. Inf. Syst. 6(1), 5–22 (2003) 4. M.M.S. Beg, N. Ahmad, Study of rank aggregation for world wide web. J. Study Fuzziness Soft Comput. 137, 24–46 (2004) 5. M.M.S. Beg, N. Ahmad, Fuzzy logic based rank aggregation methods for the world wide web, in Proceedings of the International Conference on Artificial Intelligence in Engineering and Technology (Malaysia, 2002), pp. 363–368 6. J.C. Borda, Mémoire sur les élections au scrutin. Histoire de l’Académie Royale des Sciences (1781) 7. P. Diaconis, R. Graham, Spearman’s footrule as a measure of disarray. J. Roy. Stat. Soc. B 39(2), 262–268 (1977) 8. C. Dwork, R. Kumar, M. Naor, D. Sivakumar, Rank aggregation methods for the web, in Proceedings of Tenth ACM International Conference on World Wide Web (2001), pp. 613–622 9. G.-G. Wang, Moth search algorithm: A bio-inspired metaheuristic algorithm for global Optimization problems. Memetic Comput. 1–14 (2016). https://doi.org/10.1007/s12293-0160212-3 10. D.E. Goldberg, Book on Genetic Algorithms in Search, Optimization and Machine Learning (Addison Wesley, 1989) 11. J. Holland, Genetic algorithm and adaptation, in Adaptive Control of Ill-Defined Systems Vol. 16 of the Series NATO Conference Series (1975) pp. 317–333 12. P. Kaur, M. Singh, G.S. Josan, Comparative analysis of rank aggregation techniques for metasearch using genetic algorithm. Springer J. Educ. Inf. Technol. 21, 1–19 (2016) 13. P. Kaur, M. Singh, G.J. Singh, S.S. Dhillon, Rank aggregation using ant colony approach for metasearch. J. Soft Comput. Springer 22(3), 4477–4492 (2018) 14. A. Laughlin, J. Olson, D. Simpson, A. Inoue, Page ranking refinement using fuzzy sets and logic, in Proceedings of the 22nd Midwest Artificial Intelligence and Cognitive Science Conference (Cincinnati, Ohio, USA, 2011), pp. 40–46 15. A.N. Langville, C.D. Meyer, Book of Who’s #1?: The Science of Rating and Ranking (Princeton University Press, Princeton, NJ, USA, 2012) 16. L. Xue, W. Guanghua, A. Xiao, Comparative study of rank aggregation methods for partial and top ranked lists in genomic applications. Brief. Bioinf. 20(1), 178–189 (2019) 17. M.H. Montague, J.A. Aslam, Models of metasearch, in Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR, 2001), pp. 276–284 18. G. Napoles, Z. Dikopoulou, E. Papgeorgiou, R. Bello, K. Vanhoof, Aggregation of partial rankings-an approach based on the Kemeny ranking problem. Adv. Comput. Intell. Lect. Notes Comput. Sci. 9095, 343–355 (2015) 19. L. Page, L. Brin, The anatomy of a large-scale hyper textual web search engine, in Proceedings Of Seventh International World Wide Web Conference (1998) 20. V. Pihur, S. Datta, Rank aggregation, an R package for weighted rank aggregation. A Report by Department of Bioinformatics and Biostatistics, University of Louisville (2014). http://vpihur. com/biostat 21. M.E. Renda, U. Straccia, Web metasearch: rank vs. score based rank aggregation methods, in Proceedings of ACM SAC (2003), pp. 841–846 22. B. Waad, B.B. Atef, L. Mohamed, feature selection by rank aggregation and genetic algorithms, in Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (Vilamoura, Algarve, Portugal, 2013), pp. 74–81 23. L. Yan, C. Gui, W. Du, Q. Guo, An improved pagerank method based on genetic algorithm for web search, in Proceedings of Advanced Control Engineering and Information Science by Procedia Engineering 15 (Elsevier, 2011), pp. 2983–2987. https://doi.org/10.1016/j.proeng. 2011.08.561
A Predictive Approach to Academic Performance Analysis of Students Based on Parental Influence Deepti Sharma and Deepshikha Aggarwal
Abstract The analysis is conducted to inspect the level of parental influence on academic performance of college students. A sample of around 400 students was randomly selected. The data was examined using Python. The results of the analysis indicate that various parental factors such as parental education, job of parents, facilities and environment at home have a considerable influence on their children and affect their academic performance. The results presented that there was a positive impact on academic performance of students if they were getting adequate support at home. The findings of the study have been used to propose a predictive model for student academic performance. Through this study, we have been able to highlight several factors that need to be considered by the parents, teachers and peers to support the students in their academic development. The results showed that various aspects of family influence are positively correlated to actual academic performance. Different factors are considered to derive the results which include size of family, cohabitation status of parents, education of mother and father, job of mother and father, Internet and paid classes at home. The effect of each of the factors is analysed, and a predictive model is proposed based on the analysis. Keywords Parental influences · Academic performance · Correlation · Regression
1 Introduction For the economic growth of any country in the world, the education plays a crucial part. The basis of education is the performance at primary school as well as higher education level [1]. Parents are the most influential factor for the academic performance of any student. Academic performance possesses various properties like studying effectively, gaining knowledge, understanding concepts and ideas and D. Sharma (B) · D. Aggarwal Department of Information Technology, Jagan Institute of Management Studies, Delhi, India e-mail: [email protected] D. Aggarwal e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_6
75
76
D. Sharma and D. Aggarwal
scoring considerable marks in the examination [2]. The success of student is generally judged by performance in examination. Parents play a major role in guiding and motivating children to perform in their studies [3]. Various factors like ideas sharing, support by parents and approval by parents on various aspects have effect on students’ ability and skills in their studies. The involvement of the parents and their interest in the children’s education has a considerable impact on their ability to focus on studies and perform well in academics [4]. The education of parents also has a significant effect on the performance of the students as it is considered that the educated parents are capable of guiding their children in efficient manner [5]. If the children are supported and encouraged by family in terms of providing facilities like paid tuition classes, Internet at home, they do well in studies as well. The environment of the home also affects the performance of students at school as well as college level [6]. This research aims at analysing the influence of participation and upbringing of parents on the academic performance of their children. The proposed methodically evaluation is to support the aims of this study, which are: 1. Analyse and find out gaps in various methods of prediction which already exist. 2. Identifying the factors used in analysing students’ performance. 3. Identifying the variables related to parents that can affect students’ academic performance.
2 Related Work Brown and Iyengar [7] examined the role of parents in the lives of the teenagers. The styles of parenting were examined through different aspects like parental control, gender and parenting styles, education of parents, ethnicity and diversity. Parents’ strong, yet kind style of parenting results in making children positive towards their performance. Students achieve higher grades in school if the parents take dynamic role in teenagers’ life. Wong [8] examined the impact of participation by parents and independence care in context of performance in academics of children. Around 171 teenagers were interviewed with different backgrounds, societies and the education of parents. The research calculated the response of the participant in contrast to the effect of involvement of parents. It has been proved that the parental involvement had a positive influence on the academic performance of children. Cassidy and Conroy [9] analysed how parents’ interactions with their children help to find and shape them for self-confidence and educational achievement levels. The self-esteem and confidence level will be increased with the increase in parental support. The parental support also helps them to gain self-confidence and perform well in exams. Spera [10] analysed and proved that children perform well in academics if they have positive, loving, motivated, focused parents at home. The life at college becomes easy and better if the atmosphere at home is friendly and satisfactory.
A Predictive Approach to Academic Performance Analysis …
77
The students’ performance has been predicted by data mining techniques [11]. To identify the most vital features in students’ data, the prediction algorithms are used. To predict the student’s academic performance, a classification based on association mining rule algorithm is used. The dataset used is admission test score results, enrolment_options and some socio-demographic attribute [12]. Educational data mining is applied to model academic attrition, and the results are presented.
3 Methodology The data of 395 students is selected to examine the impact of parental factors on the academic performance of students. The factors taken into account to perform the analysis are summarized in Table 1.
3.1 Formula Grades ∼ Size_ family + Cohab_ status + Edu_ mother + Edu_ father + Guardian + Family_ support + Paid_ classes + Internet_ home + Family_ bond Table 1 Factors for academic performance
Factors
Description
Size_family
Family size
Cohab_status
Cohabitation status of parents
Edu_mother
Mother’s education
Edu_father
Father’s education
Job_mother
Mother’s job
Job_father
Father’s job
Guardian
Guardian of student
Family_support
Family educational support
Paid_classes
Extra paid classes
Internet_home
Internet access at home
Family_bond
Family relationship’s quality
Marks_sem1
First semester marks
Marks_sem2
Second semester marks
Marks_sem3
Third semester marks
78 Fig. 1 Factors related to parental influence on academic performance of students
D. Sharma and D. Aggarwal
Family Size
Cohab_ Status Mother_ Education Father_Educa tion
Guradian
Parental Influence
Academic Performance
Family Support Paid Classes Internet at Home Family Bond
Various aspects of family are combined to denote the parental influence as shown in Fig. 1.
3.2 Data Interpretation The marks of three semesters are averaged to get the GAvg field which has then been converted to grades of students (Fig. 2). The grades shown in above figure are very close to normal distribution with a mode at 11 (the grade scale is in the range of 0–20). The overall grades show that they do not have skewness.
4 Results and Discussions In the dataset, we can analyse from above plots that all independent factors cannot be related to the dependent variable “grade”. Thus, there is a need to apply feature
A Predictive Approach to Academic Performance Analysis …
79
Fig. 2 Distribution of the final grades
selection which is also known as dimensionality reduction [13]. In this, we choose only the “relevant” factors. There are different ways to do this, but we are using simple measure “correlation coefficient” in the current research [14]. The value is between −1 and +1 to predict the relevant factors for predicting the grades. To limit the number of variables, the variables with the greatest correlation (either negative or positive) with the final grade can be considered. The formula used to find out the correlation in Python is: data.corr()[‘grades’].sort_ values() It can be seen from Table 2, various factors like job of mother and father, guardian, cohabitation status and family support are negatively correlated with the grade. Negative correlation means that as these factors increase, the grade will decrease, while other factors such as the education of mother and father, extra classes, family_bond, internet_home and size of the family have a positive correlation with the grade. The above table is shown diagrammatically in Fig. 3. The impact of father’s education on student academic performance is shown in Fig. 4. Table 2 Correlation between factors and grade
Job_mother −0.154082
Job_father −0.139351
Guardian −0.049467
Family_support 0.040009
Cohab_status −0.016170
Paid_classes 0.004246
Family_bond 0.021554
Size_family 0.026272
Internet_home 0.102206
Edu_father 0.107908
Edu_mother 0.210590
Marks_sem3 0.645108
Marks_sem2 0.701935
GAvg 0.727617
Marks_sem1 0.745264
Grades 1.000000
80
D. Sharma and D. Aggarwal
Fig. 3 Correlation between different factors and grades
Fig. 4 Distribution of grades based on father’s education
The grade distribution on the basis of mother’s education is shown in Fig. 5. The combined influence of parents’ education on the academic performance of their kids is shown in Fig. 6. It can be observed from this analysis that the education level of the mother has more influence on better performance of students.
5 Predictive Modelling We have built the predictive model using linear regression. The data is divided into train and test data. Training is done on 90% of data. Naive baseline is the median
A Predictive Approach to Academic Performance Analysis …
81
Fig. 5 Distribution of grades based on mother’s education
Fig. 6 Distribution of grades based on parental education
prediction. The mean absolute error (MAE) and root-mean-square error (RMSE) are calculated as follows: Median Baseline MAE: 2.5250 Median Baseline RMSE: 3.3129. The mean absolute error can easily be interpreted. It represents the distance from an average value from the correct value. The root-mean-square error is used to show larger errors and is commonly used in regression tasks. Our model is focusing on linear regression, but we have compared our results with other machine learning techniques such as support vector machines, random forests, gradient boost and other tree-based methods. Table 3 represents six different models along with the naive baseline.
82
D. Sharma and D. Aggarwal
Table 3 MAE and RMSE values
MAE
RMSE
Gradient boosted
3.22006
3.82207
ElasticNet regression
3.26509
3.8931
Random forest
3.35906
4.15262
Extra trees
3.6038
4.63584
SVM
3.16753
3.7626
Linear regression
3.15587
3.76197
Baseline
3.2963
3.94277
6 Conclusions We see that linear regression is performing the best in both cases. Also, we can see from Fig. 7 that all models are working almost equally well demonstrating that machine learning algorithm can be applied for this problem. Overall, the linear regression method performs the best although SVM and gradient boost also perform well. Factors that have a positive correlation with the grade that is the education of mother and father, extra classes, family_bond, internet_home and size of the family have been taken to develop the predictive model using linear regression.
Fig. 7 Comparison of machine learning models
A Predictive Approach to Academic Performance Analysis …
83
Regression statistics Multiple R
0.547
R square
0.4763
Adjusted R square
0.4771
Standard error
1.982396
Observations
395
The results of linear regression analysis indicate that regression coefficients of education of mother and father, extra classes, family_bond, internet_home and size of the family on performance are 0.547 and the adjusted R square is 0.4771. So, the outcomes established and recommended that the overall 47% of the variance (adjusted R square) in presentation has been significantly explained by the aspects of parental influence. The p-value for the adjusted R square is p = 0.017 and is less than 0.05. Thus, it can be concluded that these variables are substantial interpreters of academic performance.
References 1. A.M. Shahiri, W. Husain, N.A. Rashid, A review on predicting student’s performance using data mining techniques. Procedia Comput. Sci. 72, 414–422 (2015) 2. T.J. Moore, S.M. Asay, Family Resource Management (Sage Publications, Inc., Thousand Oaks, CA, 2008) 3. S. Kalaivani, B. Priyadharshini, Analyzing student’s academic performance based on data mining approach. Int. J. Innovative Res. Comput. Sci. Technol. 5(1), 194–197 (2017) 4. K.F. Li, D. Rusk, F. Song, Predicting student academic performance, in Proceedings of 2013 Seventh International Conference on Complex, Intelligent, and Software Intensive Systems (CISIS), Taichung, Taiwan (2013) 5. A. Daud, N.R. Aljohani, R.A. Abbasi, et al., Predicting student performance using advanced learning analytics, in Proceedings of the 26th International Conference on World Wide Web, Companion, Perth, Australia (2017) 6. E.A. Amrieh, T. Hamtini, I. Aljarah, Mining educational data to predict student’s academic performance using ensemble methods. Int. J. Database Theory Appl. 9(8), 119–136 (2016) 7. L. Brown, S. Iyengar, Parenting styles: the impact on student achievement. Marriage Fam. Rev. 43(1), 14–38 (2008) 8. M. Wong, Perceptions of parental involvement and autonomy support: their relations with self-regulation, academic performance, substance use and resilience among adolescents. North Am. J. Psychol. 10(3), 497–518 (2008) 9. C. Cassidy, D. Conroy, Children’s self-esteem related to school- and sport-specific perceptions of self and others. J. Sport Behav. 29(1), 3–26 (2006) 10. C. Spera, A review of the relationship among parenting practices, parenting styles, and adolescent school achievement. Educ. Psychol. Rev. 17(2), 125–146. https://doi.org/10.1007/s10648005-3950-1 (2005) 11. A.M. Shahiria, W. Husaina, N.A. Rashida, A review on predicting student’s performance using data mining techniques. Sci. Direct, 414–422 (2015)
84
D. Sharma and D. Aggarwal
12. A. Elbadrawy, A. Polyzou, Z. Ren, M. Sweeney, G. Karypis, H. Rangwala, Predicting Student Performance Using Personalized Analytics, April 2016 (IEEE, 2016), pp. 61–69 13. M.M.A. Tair, A.M. El-Halees, Mining educational data to improve students’ performance: a case study, Int. J. Inf. 2(2), 140–146 (2012) 14. I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, Burlington, MA, USA, 2016)
Handling Class Imbalance Problem in Heterogeneous Cross-Project Defect Prediction Rohit Vashisht and Syed Afzal Murtaza Rizvi
Abstract Software Defect Prediction (SDP) is one of the key tasks in the testing phase of Software Development Life Cycle (SDLC) that discovers modules that are more susceptible to defects and therefore requires significant testing to identify these flaws early in order to cut up the extra cost for software development. Much research has been performed on Cross-Project Defect Prediction (CPDP), which seeks to predict defects in the target application that lacks historical defect prediction information or has restricted defect information to construct an efficient generalized model for forecasting defects in a software project. The proposed research work focuses on defect forecast using a heterogeneous metric set so that there are no common metrics between the source and the target applications. This paper also discusses the Class Imbalance Problem (CIP) that occurs in a dataset because of the disproportionate number of favorable and unfavorable cases. If trained using imbalance dataset, a classifier will offer biased outcomes. We used Adaptive Boost (AdaBoost) method to manage CIP in Heterogeneous Cross-Project Defect Prediction (HCPDP), and after managing CIP, experimental findings demonstrate significant improvements. Keywords Class imbalance · Heterogeneous metrics · Cross-software project · Software defect prediction
1 Introduction SDLC’s testing stage is the most important as it consumes a large portion of the overall expense of the project. It is therefore crucial to manage this phase in every project development. Therefore, an issue occurs: “How can the expense of the testing stage be narrowed in order to limit the actual project cost?” The Software Defect Prediction (SDP) is the only way to address this issue that predicts defects from R. Vashisht (B) · S. A. M. Rizvi Jamia Millia Islamia, Delhi, India e-mail: [email protected] S. A. M. Rizvi e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_7
85
86
R. Vashisht and S. A. M. Rizvi
historical defect data. But, how can we predict the faults in a software project if that project does not have enough previous defect information to build an effective model of defect prediction. This issue is resolved by CPDP which tries to predict defects in a target application using classifier model trained on another software project’s defect data [1]. CPDP collects common metrics from both source (whose defect data is used to train the model) and target applications (for which defect prediction is done) [2]. But this raises a question about the validity of the model of defect prediction intended on the grounds of common metric set, because uniform metric set may not include some substantive metrics required to construct a strong defect forecasting model [3]. But there is a challenge in heterogeneous CPDP in collecting the exact prevalent metric set between the source and the target dataset. For example, when projects are written in distinct languages, it is hard to match metrics that are language-specific [4]. Class imbalance issue has been acknowledged in many applications, such as detecting unreliable telecommunications clients, detecting oil spills in satellite radar pictures, learning word pronunciations, text classification, risk management, finding and filtering data, and medical diagnosis. From the application’s point of perspective, the imbalance problem can be classified into two categories: naturally imbalanced information (e.g., credit card fraud and rare illness) or not naturally imbalanced information [5]. Similarly, if a defect prediction model is trained on an imbalanced project defect dataset where the number of faulty instances is very far from the number of non-faulty instances or vice-versa, then performance of defect prediction model will be surely poor. So, we should first approach the CIP in an imbalance dataset before training the classification model so that any predictive model can yield better outcomes than before. In this paper, the entire content has been structured under the following sections: Sect. 2 briefs about HCPDP, its modeling components & CIP, Sect. 3 explores the entire HCPDP related work with some initial CPDP study, Sect. 4 explains proposed research work, Sect. 5 summarizes the experimental results and finally, Sect. 6 concludes the major findings and gives future work.
2 Basic Terminologies 2.1 Heterogeneous Cross-Project Defect Prediction (HCPDP) Model HCPDP strategy predicts project-wide flaws consisting of only heterogeneous metrics that show some sort of comparable distribution in their values. Sometimes, two heterogeneous metrics show the same alignment for each two project datasets in their value distribution. We use these metrics as common metrics to train the required defect prediction model [3].
Handling Class Imbalance Problem in Heterogeneous …
87
The HDP model initiates with a pair of dataset, i.e., a source project whose defect data is used to train the defect prediction model and a target project (having limited past defect information) for which defect prediction is done. We have made both datasets dimensionally equal (only column-wise) for the purpose of metric matching. We have chosen the finest q characteristics (assuming q < n) from the n features of the source dataset using any feature selection method to make it compatible for metric matching [6]. Thus, the selection of features is the very first step in HCPDP modeling. Second stage is metric matching, i.e., finding the extremely correlated metrics between two datasets so that standardized metric matching pair can be extracted out to create the training dataset [7, 8]. Finally, the classification model is trained using altered training dataset and lastly, performance report is produced for defect prediction in the target software. The HDP model and its constituents are well described in Fig. 1.
Dataset 1
Dataset 2 Target Software Project (p*q)
Source Soware Project (m*n)
Feature Selecon Technique (Chi-Square Test (CST)) (Assuming q 0.3 B4(0.351711), B7(0.309785), B11(0.318607), B14(0.345769), B16(0.34521), B18(0.367582), B20 (0.360618), B21(0.349393)
Training dataset feature
A5
(Training/testing) dataset
Table 4 Metric matching with CCV 94 R. Vashisht and S. A. M. Rizvi
Handling Class Imbalance Problem in Heterogeneous …
95
References 1. D. Han, I.P. Hoh, S. Kim, T. Lee, J. Nam, Micro interaction metrics for defect prediction, in Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ACM, New York, USA, 2011) 2. P. He, B. Li, Y. Ma, Towards cross-project defect prediction with imbalanced feature sets. CoRR, abs/1411.4228 (2014) 3. W. Fu, S. Kim, T. Menzies, J. Nam, L. Tan, Heterogeneous defect prediction, in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE (ACM, New York, NY, USA, 2015), pp. 508–519 4. A.B. Bener, T. Menzies, J. Di Stefano, B. Turhan, On the relative value of cross- company and within-company data for defect prediction. Empirical Softw. Eng. 14, 540–578 (2009) 5. X. Guo, Y. Yin, C. Dong, G. Yang, G. Zhou, On the class imbalance problem, in Fourth International Conference on Natural Computation (School of Computer Science and Technology, Shandong University, Jinan, 250101, China, 2008) 6. M.W. Mwadulo, A review on feature selection methods for classification tasks. Int. J. Comput. Appl. Technol. Res. 5(6), 395–402 (2015) 7. F.J. Massey, The Kolmogorov-Smirnov test for goodness of fit. J. Am. Stat. Assoc. 46(253), 68–78 (1951) 8. C. Spearman, The proof and measurement of association between two things. Int. J. Epidemiol. 39(5), 1137–1150 (2010) 9. N. Rout, D. Mishra, M.K. Mallick, Handling imbalanced data: a survey, in International Proceedings on Advances in Soft Computing, Intelligent Systems and Applications, Advances in Intelligent Systems and Computing, vol. 628. https://doi.org/10.1007/978-981-10-5272-9_39 (2018) 10. https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18 11. L.C. Briand, W.L. Melo, J. Wurst, Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans. Softw. Eng. 28, 706–720 (2002) 12. A.B. Bener, T. Menzies, J.S. Di Stefano, B. Turhan, On the relative value of cross- company and within-company data for defect prediction. Empirical Softw. Eng. 14(5), 540–578 (2009) 13. Z. Xu, P. Yuan, T. Zhang, Y. Tang, S. Li, Z. Xia, HDA: cross project defect prediction via heterogeneous domain adaptation with dictionary learning. IEEE Access 6, 57597–57613 (2018) 14. W. Fu, T. Menzies, X. Shen, Tuning for software analytics: is it really necessary? Inf. Softw. Technol. 76, 135–146 (2016) 15. https://www.toppr.com/guides/business-mathematics-and-statistics/correlation-andregression/karl-pearsons-coefficient-correlation/ 16. J.E.T. Akinsola, F.Y. Osisanwo, O. Awodele, J.O. Hinmikaiye, O. Olakanmi, J. Akinjobi, Supervised machine learning algorithms: classification and comparison. Int. J. Comput. Trends Technol. (IJCTT) 48(3), 128–138 (2017) 17. https://machinelearningmastery.com/gentle-introduction,gradient-boosting-algorithmmachine-learning/ 18. M.J. Justin, M.K. Taghi, Survey on deep learning with class imbalance. J Big Data 27(6), 1–54 (2019) 19. S. Maheshwari, R.C. Jain, R.S. Jandon, A review of class imbalance problem: analysis and potential solution. Int. J. Comput. Trends Technol. (IJCTT) 14(6), 3 (2017) 20. F. Rayhan, S. Ahmed, A. Mahbub, M.R. Jani, S. Shatabda, D.M. Farid, C.M. Rahman: ME boosting: mixed estimators with boosting for imbalance data classification. arXiv:1712. 06658v2[cs.LG], 13 January 2018
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME Saba Fatima and V. M. Viswanatha
Abstract In the present era of technology the cost of the hardware resources in the market is too high, to even perform a small operation the hardware is required, which too costly to use only to execute a small operation. Later, cloud computing came into the picture where the resources were provided to the clients in Pay-As-Used method where they charge cost only for the time or operation we perform or execute. As the resources are on the cloud in heterogeneous manner, it uses lots of time to use the resources. There are many techniques designed to schedule the resources from the server for both homogeneous and heterogeneous. To make the resource scheduling efficient and faster, we perform the operation during run-time by which the scheduling of the resources can be done faster. Even after implementation of many techniques by many researchers throughout the world, the scheduling algorithm is not efficient as expected. Thus, in this paper, we are implementing a new technique for scheduling resources that are in heterogeneous way during time for minimizing the energy and the cost of the operation known as HDSME (Heterogeneous-DynamicScheduling-Minimize-Energy). In this technique the resources are nothing but the Virtual Machines which are present on the server. Here we are minimizing the cost and energy used by the Computation, Communication and Reconfiguration. And the outcomes of our implemented technique is compared with the existing technique “A Pareto-Based-Technique-for-CPU-Provisioning-of-Scientific-Workflowson-Clouds”. We will be using Montage with 1000 number of resources to evaluate the performance of the HDSME technique and Existing technique to compare the performance and the utilization of energy and cost.
S. Fatima (B) Department of Electronics and Communication Engineering, Visvesvaraya Technological University, Belagavi, India e-mail: [email protected] V. M. Viswanatha Department of Electronics and Communication Engineering, S. L. N. College of Engineering, Raichur, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_8
97
98
S. Fatima and V. M. Viswanatha
Keywords Cloud computing · Virtual machines · DVFS · Heterogeneous resources of CPU and GPU frequencies · Energy and cost efficiency
1 Introduction Maintaining rapid application development is an important aspect in the Information Technology sector and the reduction of time and effort put into the software deployment must be as minimal as possible. This calls for the usage of Cloud Computing. It is an upcoming trend that is widely used for the purpose such as storage, sharing of memory, computational capacity sharing and hardware resource sharing over a network such as the internet. It is a concept that provides resources to both individuals as well as organizations, as a service that can be used at any given time or place of the user’s demand and convenience. This results in the saving of time and cost for the users as they do not necessarily need to possess the resources they require, and can utilize the service at their will. Cloud Computing Support on demand Self Service, Pay-for-use and dynamically scalable storage services over the Internet. One of the Cloud Computing service is an Infrastructure as a Service. IaaS Provides Infrastructure components as a repository, firewalls, networks, Load balancing and other computing resources to their customers. Infrastructure as a Service is sometimes referred to as Hardware as a Service. Features and functionalities of IaaS are Instant VM Launching, Live VM Migration, Load balancers, Bandwidth Provisioning, Virtualized Computational & Storage Volumes, VM, Storage, and Network-Management. Load balancing is a process of applying the total load to the individual nodes of the shared system to enhance the response time on the job and to make effective resource usage, at the similar time. Extract a state in which some of the nodes are overloaded and some others are loaded. The Workload can be Memory capacity, interruption, CPU demand. At the same time, data centers will convey an unparalleled degree of heterogeneous workload combination, hosting in the same infrastructure distributed services from many different users. It also enables execution of large-scale distributed applications in datacenters comprising tens or hundreds of thousands of machines [1]. One problem that has drawn significant research effort over the last years is scheduling how to allocate hardware resources to these applications [2–5]. Most of the cloud schedulers (CS) constructed using homogeneous component of hardware, they were not implemented heterogeneous of both hardware and software in CS. Current datacenters added accelerators like graphics processing unit (GPU) [6, 7], different types of hardware [like memory, core], hardware extension [such as Intel SGX] [8], FPGA [9]. Many of the application uses the cloud for security purpose, storing database along with its performance, execution time, latency size of the application [10–14]. Whatever the software we are running in the cloud is also heterogeneous. The role of CRM (cloud resource manager) is to rapidly finish the task optimally for cloud resource. Past decades various CSA (cloud scheduling algorithms) have been came into the market to increase the performance of CRM. The
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME
99
main aim of CSA is to decrease the total time in which it takes for complete execution of application and find out the most appropriate resource to be allocated. Resource allocation in static manner and schemes for scheduling asks the resource necessities and description of the environment before. Thus, it causes un-assumed time variance among beginning and ending of the jobs. Along with this, during runtime dynamics schemes have made decision very easily by using platform specification and available information on platform specification. Static algorithm make use of dependency, allocation decision and platform for explanation of scheduling. In depth this algorithm includes communication of any pair of resources with respect to time, measures the congestion that disturbs the communication, time taken for execution of all jobs. During run time dynamics algorithms makes decision for resource allocation and scheduling. Decision of scheduling is made after execution of main application. Making decision is main aim of these algorithms and these decisions are mainly focus on heterogeneity of software and hardware platform, information about the resources, memory location of the data, total jobs available in resources. By adding this many information in resources, we should make use of the them. HPS cluster is come under batch jobs via PBS like TORQUE [15]. Another thing we noticed here is maintaining huge amount of cluster for create multi user environment which consist of virtualization, by considering cluster as cloud (private cloud) (e.g. Eucalyptus [16]). Here user no need to attentive [17, 18] of the features that are present in hardware: by private cloud, load balancing and resource management operated in integrated way. User will see in package (service) oriented infrastructure. At the end if the large cluster are not sufficient to handle the resource requirement, externally we can take borrow from hybrid cloud. User can see this process frequently with no authentication [18]. Merging of cloud model with heterogeneous HPC is established by Cluster GPU Amazon EC2 instances [19]. Adding GPU in cloud environment cluster is beginning of the process. Other Dealers like Hoopoe [20] and Nimbix [21] are providing cloud based service for the computation of GPU. So many current project such as gVirtuS [22], GViM [23], rCUDA [24] and vCUDA [25], presented errors and problems in application if its runs in virtual machine to use GPUs by interrupting and redirecting of library calls from the client to the dynamic CUDA on the server. This agenda is built on scheduling mechanism delivered by CUDA RUNTIME. In sequence, it agree the application to operate in GPU Above these 2 points, it leads to minimal subsequent performance and less utilization of resources. In recent paper, Ravi et al. [26] author treated GPU sharing by execution of parallel kernel operation called by multiple application. All of the above existing system has few limitations like. They considered whole memory resource of the object drawn it to same GPU system because it fit to the memory capacity of the device. Resource sharing increases when the data set of the application increases, this phenomenon became false. Next proposed paper will say that, the introduced architectures bind applications statically to GPUs (i.e. dynamic applications are not used for remapping) only this cannot make to use sub-optimal scheduling, but additionally it makes complete software to restart when the GPU failure occurs, and stops load balancing in efficient way if GPU machines are removed or added.
100
S. Fatima and V. M. Viswanatha
Grids computing accept the network as distributed computing phenomena. Likened to grid computing cloud also provide distributed resource such as software, database storage, computing power, virtualization which more helpful to the end user to complete his work in short interval of time and also it reduces cost also. If we compare utility computing with cloud, utility computing involves cost as charges depend on usage; it gives current o demand requirement. If it comes to cloud, huge amount of resources with minimum operational cost. Virtualization is main object, which gives all requirements to end-user, which is on demand. All current proposed system that is used for assigning, reassigning is virtualized server called as virtual machine. So this type of server we can use anywhere at any point of time with internet connection, it will give the useful service. Now a day’s number of jobs are using virtual machine for growth of virtualization and cloud computing. Cloud computing is a developing and on demand attractive technology, in which user can develop their application and dispatch the application to the cloud and it gives more security each every application so that user can enlarge their home clusters which results in high demand on workflow of management organization. Responsibilities of the workflow, management will reduce increasing performance with low cost. No. of optimization algorithm have been proposed to optimize the cost and performance. In this paper, we are introducing a new methodology known as “A Heterogeneous Dynamic Scheduling Minimize Energy” (HDSME). In this methodology, we dynamically schedule the virtual machine’s resources that are heterogeneous type on the servers were we specially take care on the minimizing the usage of energy while performing the complete execution process on server’s virtual machine by using the resources. The rest of the paper is arranged in the given pattern. Section 2 consist of related work, in Sect. 3 has the Architecture of our proposed methodology and complete model of the system, Sect. 4 consists of Result and analysis, and finally in Sect. 5, the conclusion of work is presented.
2 Related Work In this section, we represent the related research works performed by various researchers by studying which we understands the basic knowledge on the topic and highlight importance of our research and compare our performances with the previous research works and show the advancement in our work. In recent years, the scheduling of resources of Virtual machines that are heterogeneous type on the server is upcoming on the trend to use the resources on the server efficiently and to reduce the cost of the resources less with minimizing the energy usage. “A taxonomy of scheduling in shared system for general purpose” was introduced by Casavant and Kuhl [27]. The categorizations shown by the researchers consist of static, dynamic, global and local, non-cooperative and cooperative, and non-shared and shared scheduling, and few methods to solve the issue. For instance, optimal,
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME
101
exploratory, assumed and sub-optimal. This represented categorization is fully completed in few aspects, and it is still useable these days. Anyways, the present state of shared systems definitely ask for the addition of the features in this taxonomy. A static scheduling technique for assigning the workloads linked as a DAGs (Directed-Acyclic-Graph) within more than one processor was surveyed by Ahmad and Kwok [28]. The researchers introduced a simplified taxonomy technique to the issue, and the categorization and representation of 27 scheduling techniques. Thus, the authors introduced static scheduling technique for more than one processors that are also usable to shared systems and their categories. In this paper, we check for the evolution of those techniques and their categories by having heterogeneous systems, scheduling technique in run-time, scheduling techniques in novel shared environments and new techniques for scheduling. Recently, after Ahmad and Kwok’s research, other surveys and papers to resolve the scheduling issue for systems that are in parallel was been implemented. Many of these operations aim on heterogeneous shared systems [29] that Kwok and Ahmad assumed as the most difficult ways to follow [28]. In [30], Hamscher et al. introduced “Scheduling of tasks techniques for grid computing”. The writer introduces basic scheduling frameworks. For instance, hierarchical, centralized and decentralized. In every scheduling framework, they introduce and perform selection technique of four processor and three different scheduling algorithms, known as FCFS (First-Come-and-First-Serve), Backfilling and Random. After this implementation, several dynamic scheduling techniques were implemented to handle the dynamic behavior of the grid. In [31], Feitelson et al. claims that, in general, review is required for parallel task scheduling. Certainly, the researchers represent an brief introduction on grids’ task scheduling, distinguishing among the grids and parallel computers. They select loading of balance via cross-domain and co-assigning as two important disquiets while scheduling within the grids. In this research, we propose a categorizing of schedulers in shared systems, which includes a general view of computing of grid techniques. Additionally, we point out new needs for the developing model and its variance to computing of the grid. In [32], Wieczorek et al., introduces a classification in the issue of scheduling for flow of the work meanwhile looking to more than one criteria in optimization of computing of the grid platforms. The researchers differentiate more than one criteria scheduling part in 5 divisions, known as criteria of scheduling, model of the resource, model of task, process of scheduling and model of the workflow. Every division unfolding the issue from various viewpoint. These divisions are elaborated to categorize present works in small parts of the complete details, finding how to enlarge the current research and the task in every division. In [33], MARS is another MapReduce technique designed with the help of GPUs to increase the speed of several web software. Anyways, one disadvantage of utilizing the solution of MapReduce on the GPU is that of memory. When compared with others, GPU has low memory and may not be enough to resolve huge issues that are dependent on web, where the basic processing data size is of GBs or TBs. Thus, more tests are required to perform to get outcomes that are more accurate.
102
S. Fatima and V. M. Viswanatha
For MapReduce issues, Ravi et al. introduced a method of scheduling tasks during run-time [34]. Complete programs are separated into blocks that are later shared over the devices. This technique is useful as it uses less input from the clients and any profiling methodology is not performed. The basic knowledge of this technique is clear that as it utilizes a master-slave framework to assign new blocks once the executing elements are finish their older task. Anyways, the disadvantage of this model is the selection of block size. Their outcomes of the tests represent a high difference in performance while various sizes are selected. Selecting the optimal size is left pending for the later research [34]. In [35], Grewe et al. utilized OpenCL and in [36], Clang models an infrastructure to evaluate programs and select the features of static code to divide those programs all over devices. The main contribution of this paper is a machine learning dependent compiler, which efficiently assumes the best division of the workload utilizing these features of static codes. A two-layer predictor is utilized to divide the workloads. It is a fine grained technique to solve the issue of scheduling but by the point of view of the researchers among themselves, many of the programs are scheduled by 1st layer of the predictor. This calls to requirement questioning for 2nd layer predictor. And also feature selection has not been acceptable and any modification in the static features will need re-training the complete framework that can be similar to machine learning techniques, which are very costly for computation.
3 Architecture of HDSME and Model of the System In this section, we present the cloud’s model of data center utilized in HDSME. Particularly, we represent the modules of implemented architecture, which is represented in Fig. 1. We also represent the typical method of the implemented architecture that reduces the integrated communication energy and computing energy for processing the data, the essential modules of implemented architecture and at last the architecture’s energy-aware module.
3.1 The Reference Architecture of HDSME The architecture of HDSME is made of two modules. In Fig. 1, the bottom part is the data center’s front-end, which controls the incoming tasks and does the configuration of infrastructure of the cloud data center. And the upper part is the resources required for computing, as a servers communicating with the network of data center. Every single server has virtualization section where the VMs are performing and a physical section gives the real resources required for computation for doing the processing of incoming tasks. Every single server communicates to the modules of front-end with the help of communication link via network of data center as represented in Fig. 1; the heterogeneous-cluster interaction is helped via passing of messages.
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME
103
Fig. 1 Architecture of HDSME
In the Data center’s front end, we detect a gateway collecting the incoming tasks. Later, we consist of controlling section with two sub modules known as Separator of Data Block (SDB) and VMC (Virtual-Machine-Controller). The SDB separates the incoming tasks into a Block of Data and give to M number of VMs via endto-end communication link (i.e. link between SDB and VMC in Fig. 1, which is bi-directional). The VMC controls the virtualization section in run-time to map the resources that are available on more than one Virtual Machines.
3.2 Resources Required for Computation The resources required for computation in HDSME architecture are the servers of data center, which are hosting the Virtual Machines. The price of the computation is analyzed depending on the energy used while performing the process on data block of the dispatched tasks for every VM. In our paper, we predict that Virtual Machines present on the server can share their resources, which is the usual method for controlling the clouds that are private. Our algorithm aims to manage huge computational demands with the help of less number of huge VMs in place of several number of small VMs. This influence to choose just single Virtual Machine on each server.VM that is hosted by server has three stages: • Active. • Non-active. • Off. Active: here the server is in ON condition and performs the task execution;
104
S. Fatima and V. M. Viswanatha
Non-active: Here, the server is in ON condition, but the performance of the task execution is not done. Off: Here, the power of the server is in OFF condition. Switching the server ON and OFF is the operation of datacenter’s durable controlling system. As the time taken to switch the server OFF and ON is very high, we bring a new stage known as idle state for every single Virtual Machine. In this state, there is no processing of the data by the VM and uses very low energy. In our algorithm we predict that the decision of switching the server OFF and ON is taken by the strategy of server linking that is durable, which should guarantee the infrastructure’s capability to have the availability of enough servers to execute the estimated high incoming tasks for that given time. Therefore, the strategy of linking that is durable should be handling the everyday patterns usual of the tasks of many services that are Internet-Based. Several resolves for durable linking strategy are already present. The main motivation of our implementation is on the rapid changes of tasks intensity that happens with a fraction of seconds, which cannot be pointed by old server linking solutions.
3.2.1
Tasks Design
The tasks are designed as a sequence of data blocks transmitted from the users for performing some operations to the server. The assumed conditions are the applications’ cases for interpreting contents of the data i.e. multimedia data. For instance to do the operation of tracking of the movements and recognition of the faces within the given images or video data. We represent the Data block size as L b in bits. The block is separated into N parts that refer to the number of VMs that are active and given in parallel with independent to the content by end-to-end communications for executing each VM’s task. Multimedia data processing is categorized by the SLA defined Quality-ofService needs in terms of largest time permitted for data processing. Therefore, processing of data block (i.e. delay caused while communicating and time taken for executing the task) should be less than Ss seconds of time.
3.2.2
Classification of Servers and Virtual Machines
In resource allocation via energy-aware technique, the aspects of standard server x can be given as: freqxI , freqhighest , E xI , Pa (x), Ce (x) , x = 1, 2, 3, . . . , N x
(1)
Here frequency of the CPU that is in idle state is denoted as freqxI and frequency of , the energy usage of the CPU in idle the CPU that is maximum is denoted as freqhighest x condition is denoted as E xI , Pa (x) and the factors Pa (x) and Ce (x) denotes the gates’
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME
105
percentage that is active and capacitance load that is effective respectively. As we assume that each VM can be hosted by only one server, we can take the energy based aspects of the x-th physical server immediately to the x-th Virtual Machine that is hosted by the server. Additionally, to make it simpler we assume a data center that is heterogeneous, such that later we eliminate the server x and VM’s reference from the representation. We assume that x-th VM can perform execution on multimedia data based on the software launched on the data center of the cloud. We represent highest highest in units of bit/sec. Later, we predict to assume a CPU processing rate as R p guaranteed software such that the frequency of the CPU of the server is associated highest is similar to the frequency to R p linearly. Therefore, the rate of processing R p of the CPU freqhighest .
3.3 Usage of Power in HDSME Within this section, we deeply represent the design of power that is classified into three models of costing: • Cost for Computation (Ccomp ) • Cost for Reconfiguration (Creconf ) • Cost for Communication (Ccomm ).
3.3.1
Cost for Computation in HDSME
DVFS (Dynamic-Voltage-and-Frequency-Scaling) methodology is used while processing of VMs to minimize the usage of power by reducing the frequencies of VM. It is considered that at more than one frequency processing, each VM can perform the operation and only for some range of time, each frequency is active. Basically, DVFS methodology permits to operate on different frequencies: freqdisc is the amount of parts of the frequency between the actual low and high frequencies for every processor of Virtual Machine, which is capable to operate with DVFS methodology. By considering the inactive state (idle state) as the minimum frequency, we can denote: freqhighest freqz > freqz−1 > freqz−2 > · · · > freq1 > freq1 freq0
(2)
The amount of time needed to vary frequency is ranged within few 10× of microseconds in latest DVFS used methodology for more than one cored platforms for computing. Therefore, the present methodology supports frequency varying in dynamic way to enhance the rate of processing of the data to the requirements of the data center. From more experimental way, in DVFS dependent Virtual Machines CPU, x-th
106
S. Fatima and V. M. Viswanatha
Virtual Machine is ready to do the task with freq y for sx y seconds of time. Therefore, R p y sx y is the out coming data that is processed in size of bits. Every Virtual Machine can collect its present range of the frequency while processing the data, which is the set of incoming data. Other parts of the system, for instance bus, memory, etc. work at constant frequency and use the same amount of energy in both idle and active (running) states. Because of this, we aim mostly on the energy of the CPU. The dynamic energy usage E run - time of every single server of the CPU and Virtual Machine processing on the frequency freq is denoted as: E run-time = Pa ∗ Ce ∗ freq ∗ vs2
(3)
Here, freq is the frequency of the CPU and vs is the voltage supply, Ce is the capacitance load that is effective and Pa is the gates’ percentage of active. The voltage supply and the frequency are related to one another with respect to Eq. (4).
vs−1 freq = C ∗ (vs − vsτ )−2
(4)
Here, vsτ is the voltage with threshold that is very less when compared with voltage supply vs and constant C. By combining Eqs. (3) and (4), we can represent the rapid usage of energy as frequency’s cubic function freq. Therefore, if we introduce E I ≥ 0 as the energy usage while Virtual Machine is in not running any tasks, the complete analytical price becomes: SUMC-comp (x)
Z
Pa ∗ Ce ∗ sx y ∗
y=0
1 , x = 1, 2, . . . , N freq−3 y
(5)
Here,Pa = C−1 ∗ Pa ,sx y is the duration of time between which the x-th Virtual Machine of the CPU is operating on freq y frequency, the range of the x is from 1 to N and y is from 0 to Z . Here Z represents the various frequencies for every Virtual Machine in Z + 1 various limits.
3.3.2
Cost for Reconfiguration in HDSME
The general work of the VMC is to control applicable frequency measuring techniques, to let the server to host the Virtual Machines to scale in dynamic its frefrom freq1 quency of processing freqx . We see that by changing the frequency to freq2 increases the price of power SUMC-reconf freq1 − freq2 in the unit of Joule. Even though, the real performance of the function of changing power and SUMC-reconf freq1 ; freq2 is on the basis of selected DVFS methodology the primary physical CPUs, any experimental function SUMC-reconf freq1 ; freq2 usually maintains the given below three properties:
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME
107
1. The function SUMC-reconf freq1 ; freq2 is based on the frequency gap that is absolute freq1 − freq2 ; 2. SUMC-reconf (.) when freq1 and freq2 are equal and it is non gets eliminated reducing in freq1 − freq2 ; 3. It is combined curved in freq1 , freq2 . A basic experimental model that maintains the formal properties that are mentioned before is given below: 1 SUM(C-reconf) freq1 ; freq2 = εc ∗ (−2) · · · Joule freq1 − freq2
(6)
Here, εc Joules/Hz2 represents the cost of the energy caused because of switching the frequency by one unit. Normal values of εc for present DVFS dependent computing platforms that are virtual are very less and within hundreds of micro Joules(µJ)/Mega Hertz2 MHz2 . Therefore, f r eq1 and f r eq2 are kept constant and just associated to the represented various limits of present frequencies for every Virtual Machine. We consider that the amount of various present frequencies is similar for every Virtual Machine. With respect to the amount of separation of the task assigned to it, every Virtual Machine is able to perform within a limited various frequencies that we represented as active various frequencies. The cost of switching in HDSME is classified into inner and outer costs. The inner cost includes cost of reconfiguration for varying the inner switching between active various frequencies of Virtual Machine VM(x), whereas the outer cost is based on variance among the first active various frequency for another set of upcoming task and the recently used active various frequency for recent task. For instance, If we assume a Virtual Machine that has 5 various frequencies dependent on the allocated part of the task. We assume these 3 various frequencies as a set of active various frequencies for this part of the task, then analyze the frequency variation for internal and external and calculate the energy for reconfiguration on the basis of Eq. (6).
3.3.3
Cost for Communication in HDSME
In HDSME technique, we consider that every Virtual Machine interacts with the scheduler through a contention independently based link that executes at the rate of transmission of TrnsRatex whose unit is (bits per second), where x = 1 to N . We consider that the link is symmetric and directed from both the directions. Additionally, we consider that unidirectional switching and transmission process on the x-th link consumes a specific energy of E xC-comm with unit factor as Watts. E xC-comm C-comm C-comm C-comm can be represented as: E xC-comm ≡ E TotalTime (x) + E TrnsRate (x), here E TotalTime (x) is C-comm the energy needed for one-side switching and transmission and E TrnsRate (x) is the amount of energy required by circuit of the receiver. The real data of E xC-comm is based on the switching unit, the effect to the x-th link because of noise, and also the
108
S. Fatima and V. M. Viswanatha
required the below, we consider that the collection of link energies C-commdependency. In E , x = 1 to N is allocated. x Regarding the real data of E xC-comm , we see that with respect to minimize the designing cost, present data centers use physical servers that are linked by Ethernet switches which are superfast. Later, they build the TCP protocols to achieve endto-end strong link to interact. On this basis, we see that the oriented versions of data center of the protocol is introduced, which allow the controlled end-to-end transmission links to perform without congestion with 99.9% of the executing time, while guaranteeing the similar end-to-end throughput of the introduced protocol. Thus, the energy cost of the communication for the implemented method can be simplified as: 2 E x(C-comm) (TrnsRatex ) = δx TRTx ∗ TrnsRatex + E xI , x = 1 to N
(7)
2 −1
∗ SHS−1 ∗ 2∗θ , x = 1 to N ;. Here, δx Rgain 3 Here, Size-of-the-Highest-Segment is denoted as SHS in the unit factor of bits and θ ∈ {1, 2} is the number of acknowledged parts; Rgain is the noise energy ratio of coding gain-to-receive for the x-th end-to-end link; TRTx is the mean time-ofround-trip for the x-th link which is end-to-end linked; and E xI is the energy cost of idle for the x-th link. Thus, the alternate one-directional delay for transmission is represented as: TrnsDelay(x) =
Z
R p y sx y /TrnsRatex
(8)
y=1
Such that the alternative one-directional interacting power (Communication power) SUMC - comm (x) is: ⎞ ⎛ Z SUMC-comm (x) E xC-comm (TrnsRatex ) ∗ ⎝ R p y sx y /TrnsRatex ⎠, in Joules y=1
(9) Particularly, the power usage of end-to-end communication would not affect the calculation policy and is fully free of all dependencies.
4 Result and Analysis In our algorithm, we are executing our program on Java Eclipse Neon 3 on the hardware specifications with 8 GB Ram and 500 MB of Hard-disk space free. In our program, we are simulating the results for Montage with 1000 number of resources.
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME Table 1 Table representing the data of number of VMs versus amount of execution time in seconds
Number of virtual machines
109
Execution time (in seconds)
5
2555.85
15
1087.66
25
787.69
35
657.03
45
589.57
55
553.87
65
512.96
75
498.75
85
468.00
We assume different number of Virtual Machines such 5–85 with a difference of 10 number of Virtual Machines. We also provide different values for the number of hosts that is same as the number of Virtual Machines. Table 1 represents the Execution time taken to simulate for the number of Virtual Machines with Montage for 1000 resources. In Table 1, we see that, for different number of Virtual Machines, there is a change in Execution time also. As the number of Virtual Machines increases, there is decrease in the execution time. There is a graph in Fig. 2 representation below which shows the levels of execution time with respect to different Virtual Machines. Table 2 represents the dataset values for Cost in pounds currency with respect to different number of Virtual Machines. In Table 2, we observe the amount of costs occurring in Pounds for different number of Virtual Machines. Moreover, the below graph in Fig. 3 also representing the different levels of cost for various number of VMs. Execution time with respect to Number of Virtual Machines Execution Time (in seconds)
3000.00 2555.85 2500.00 2000.00 1500.00 1087.66 1000.00
787.69
500.00
657.03 589.57 553.87 512.96 498.75 468.00
0.00 5
15
25
35
45
55
65
75
85
Number of Virtual Machines
Fig. 2 Graph representing the level of execution time with respect to the number of virtual machines
110
S. Fatima and V. M. Viswanatha
Table 2 Table representing the cost in Pounds with respect to different number of virtual machines
Virtual machines
Cost (£)
Cost (INR)
5
0.025930863
2.35
15
0.022353241
2.03
25
0.022353704
2.03
35
0.022695563
2.06
45
0.022353457
2.03
55
0.022353464
2.03
65
0.022353464
2.03
75
0.022353984
2.03
85
0.022353284
2.03
Cost in Pounds for different number of Virtual Machines 0.027
Cost in Pounds (£)
0.026 0.025 0.024 0.023 0.022 0.021 0.02 5
15
25
35
45
55
65
75
85
Number of Virtual Machines
Fig. 3 Graph representing the cost for executing task for different number of VMs
The above graph in Fig. 3 is plotted to show the levels of Cost for executing the task or workload for different number of Virtual Machines. The cost is represented in Pounds (£). To see whether our algorithm known as HDSME for scheduling the heterogeneous resources in dynamic way is efficient or not, we compare our results with the existing algorithm’s results known as “A Pareto-Based–Technique-for-CPU-Provisioning-ofScientific-Workflows-on-Clouds”. By which we can prove that our results are much better than the existing algorithms and techniques. Table 3 shows the execution time for both Proposed algorithm and Existing algorithm, and represents that the time taken to execute the simulation by our algorithm is better compared with the existing algorithm. Table 3 shows that the time taken to execute the tasks for 1000 resources of Montage using HDSME algorithm (i.e., Proposed Algorithm.) is better than the execution time taken to execute the same task by using Existing Algorithm. This
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME
111
Table 3 Representation of execution time for both existing and proposed system with respect to different number of virtual machines Virtual machines
Execution time of existing algorithm (in seconds)
Execution time of proposed algorithm (in seconds)
5
2585.7
2555.85
15
1108.44
1087.66
25
807.96
787.69
35
676.74
657.03
45
604.74
589.57
55
562.44
553.87
65
528.6
512.96
75
498.96
498.75
85
482.04
468
shows that, the execution time of our algorithm is providing better results when compared with the older algorithms. We also represent the outcomes in graph in Fig. 4 to have the clear picture of the difference between the results of both previous and new proposed algorithms. Along with Table 4, which is representing the cost for both Existing and Proposed techniques with respect to different number of Virtual Machines, we also represent the data in the graph format in Fig. 5, which clearly does the comparison between the Existing and Proposed techniques. Execution Time for Montage 1000 Execution time (in seconds)
6000 5000 4000 3000 2000 1000 0 5
15
25
35
45
55
65
75
85
Number of Virtual Machines Existing
Proposed
Fig. 4 Graph showing the difference between the existing and proposed algorithm for the execution time
112
S. Fatima and V. M. Viswanatha
Table 4 Table representing the values of cost for both existing and proposed techniques for different number of virtual machines Virtual machines
Cost price by using existing technique (in Pounds £)
Cost price by using existing technique (in INR)
Cost price by using proposed HDSME technique (in Pounds £)
Cost price by using proposed HDSME technique (in INR)
5
0.649
58.87
0.025930863
2.35
15
0.589
53.43
0.022353241
2.03
25
0.522
47.35
0.022353704
2.03
35
0.489
44.36
0.022695563
2.06
45
0.443
40.18
0.022353457
2.03
55
0.413
37.46
0.022353464
2.03
65
0.3524
31.97
0.022353464
2.03
75
0.305228571
27.69
0.022353984
2.03
85
0.258057143
23.41
0.022353284
2.03
Cost for Montage 1000 0.7 0.6
Cost (£)
0.5 0.4 0.3 0.2 0.1 0 5
15
25
35
45
55
65
75
85
Number of Virtual Machines Existing
Proposed
Fig. 5 Graph representing the cost for existing and proposed techniques for different number of virtual machines
The graph in Fig. 5 clearly shows that the cost price for Proposed technique (i.e. HDSME) is less than the cost price of Existing technique (i.e. A Pareto-BasedTechnique-for-CPU-Provisioning-of-Scientific-Workflows-on-Clouds). Thus, all the graphs and tables clearly represent that our HDSME technique is better than the Existing A Pareto-Based–Technique-for-CPU-Provisioning-ofScientific-Workflows-on-Clouds technique for both execution time and Cost price used for executing the tasks for Montage for 1000 resources.
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME
113
5 Conclusion In our work, we have implementing a new scheduling algorithm known as HDSME. In this algorithm, we are designing a technique to scheduling the selection of the efficient frequency configurations of the GPU and CPU for performing the execution of the tasks on selected amount of cloud resources that is Virtual Machines in dynamically. The main aim of this algorithm is to minimize the cost and energy of the scheduling technique by allocating the heterogeneous cloud resources (i.e. Virtual Machines in the servers) in dynamic way. We are using Montage with 1000 number of resources to analyze the cost and execution time of our algorithm. After performing the experiments on the Montage resources, our technique’s outcomes are compared with the existing algorithm’s outcome known as “A Pareto-Based-Techniquefor-CPU-Provisioning-of-Scientific-Workflows-on-Clouds”. After comparison, it is seen that our results are better when compared with the results of “A Pareto-BasedTechnique-for-CPU-Provisioning-of-Scientific-Workflows-on-Clouds”.
References 1. J. Dean, The Rise of Cloud Computing Systems. https://youtu.be/4_BeSgiNoQ0. Oct. 2015. SOSP History Day 2. P. Delgado, F. Dinu, A.-M. Kermarrec, W. Zwaenepoel, Hawk: hybrid datacenter scheduling, in 2015 USENIX Annual Technical Conference (USENIX ATC 15) (USENIX Association, Santa Clara, CA, 2015), pp. 499–510 3. A. Goder, A. Spiridonov, Y. Wang, Bistro: scheduling data-parallel jobs against live production systems, in Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference USENIX ATC ’15 (USENIX Association, Berkeley, CA, USA, 2015), pp. 459–471 4. I. Gog, M. Schwarzkopf, A. Gleave, R.N.M. Watson, S. Hand, Firmament: fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (USENIX Association, GA, 2016), pp. 99–115 5. R. Grandl, S. Kandula, S. Rao, A. Akella, J. Kulkarni, Graphene: packing and dependencyaware scheduling for data-parallel clusters, in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (USENIX Association, GA, 2016), pp. 81–97 6. Softlayer GPU Accelerated Computing. http://www.softlayer.com/GPU 7. Amazon EC2 Pricing. https://aws.amazon.com/ec2/pricing/ (2016) 8. A.M. Caulfield, E.S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, et al., A cloudscale acceleration architecture, in, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (2016), pp. 1–13 9. F. Mckeen, I. Alexandrovich, A. Berenzon, C.V. Rozas, H. Shafi, V. Shanbhogue, U.R. Savagaonkar, Innovative instructions and software model for isolated execution, in Proceedings of the 2nd International Workshop on Hardware and Architectural Support for Security and Privacy, HASP ’13 (2013) 10. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R.H. Katz, S. Shenker, I. Stoica, Mesos: a platform for fine-grained resource sharing in the data center, in NSDI, vol. 11 (2011), pp. 22–22. 11. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, A. Goldberg, Quincy: fair scheduling for distributed computing clusters, in Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP ’09, SOSP ’09 (ACM, New York, NY, USA, 2009), pp. 261–276
114
S. Fatima and V. M. Viswanatha
12. V.K. Vavilapalli, A.C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, et al., Apache Hadoop YARN: yet another resource negotiator, in Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC ’13 (ACM, New York, NY, USA, 2013), pp. 5:1–5:16 13. A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, J. Wilkes, Large-scale cluster management at Google with Borg, in Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15 (ACM, New York, NY, USA, 2015), pp. 18:1–18:17 14. M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, I. Stoica, Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling, in Proceedings of the 5th European Conference on Computer Systems, EuroSys ’10 (ACM, New York, NY, USA, 2010), pp. 265–278 15. TORQUE Resource Manager. http://www.clusterresources.com/products/TORQUE-resourcemanager.php 16. Eucalyptus. http://www.eucalyptus.com 17. H. Lim, S. Babu, J. Chase, S. Parekh, Automated control in cloud computing: challenges and opportunities, in Proceedings of ACDC ’09 (ACM, New York, NY, USA, 2009), pp. 13–18 18. P. Marshall, K. Keahey, T. Freeman, Elastic site: using clouds to elastically extend site resources, in Proceedings of CCGrid 2010 (2010), pp. 43–52 19. Amazon EC2 Instances: http://aws.amazon.com/ec2/ 20. Nimbix Informatics Xcelerated: http://www.nimbix.net 21. Hoopoe: http://www.hoopoe-cloud.com 22. V. Gupta et al., GViM: GPU-accelerated virtual machines, in Proceedings of HPCVirt ’09 (ACM, New York, NY, USA, 2009), pp. 17–24 23. L. Shi, H. Chen, J. Sun, vCUDA: GPU accelerated high performance computing in virtual machines, in Proceedings of IPDPS ’09 (Washington, DC, USA, 2009), pp. 1–11 24. J. Duato et al., rCUDA: Reducing the number of GPU-based accelerators in high performance clusters, in Proceedings of HPCS’10, pp. 224–231 25. G. Giunta, R. Montella, G. Agrillo, G. Coviello, A GPGPU transparent virtualization component for high performance computing clouds, in Proceedings of Euro-Par 2010 (Heidelberg, 2010) 26. V. Ravi, M. Becchi, G. Agrawal, S. Chakradhar, Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework, in Proceedings of HPDC ’11 (ACM, New York, NY, USA, 2011), pp. 217–228 27. T.L. Casavant, J.G. Kuhl, A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Trans. Softw. Eng. 14 (2), 141–154 28. Y.K. Kwok, I. Ahmad, Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Comput. Surv. 31(4), 406–471 29. C. Jiang, C. Wang, X. Liu, Y. Zhao, A survey of job scheduling in grids, in Proceedings of the Joint 9th Asia-Pacific Web and 8th International Conference on Web-Age Information Management Conference on Advances in Data and Web Management, APWeb/WAIM’07 (Springer, Berlin, Heidelberg, 2007), pp. 419–427 30. V. Hamscher, U. Schwiegelshohn, A. Streit, R. Yahyapour, Evaluation of jobscheduling strategies for grid computing, in GRID ’00 Proceedings of the First IEEE/ACM International Workshop on Grid Computing (Springer, London, UK, 2000), pp. 191–202 31. D.G. Feitelson, L. Rudolph, U. Schwiegelshohn, Parallel job scheduling — a status report, in Proceedings of the 10th International Conference on Job Scheduling Strategies for Parallel Processing, JSSPP’04 (Springer, Berlin, Heidelberg, 2005), pp. 1–16 32. Wolfgang, M. Wieczorek, A. Hoheisel, R. Prodan, Taxonomies of the multicriteria grid workflow-scheduling problem, in Grid middleware and services (Springer US, 2008), pp. 237–264 33. B. He, W. Fang, Q. Luo, N.K. Govindaraju, T. Wang, Mars: a mapreduce framework on graphics processors, in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (ACM, 2008), pp. 260–269 34. V.T. Ravi, W. Ma, D. Chiu, G. Agrawal, Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations, in Proceedings of the 24th ACM International Conference on Supercomputing (ACM, ICS ’10, 2010), pp. 137–146
A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME
115
35. D. Grewe, M.F.P. O’Boyle, A static task partitioning approach for hetero-geneous systems using OpenCL, in Proceedings of the 20th International Conference on Compiler Construction (Springer, 2011), pp. 286–305 36. S. Naroff, Clang: New LLVM C Front-end, (2007). http://llvm.org/devmtg/2007-05/09-NaroffCFE.pdf
A Generic Framework for Evolution of Deep Neural Networks Using Genetic Algorithms Deepraj Shukla and Upasna Singh
Abstract This paper presents an approach for the evolution of the deep neural networks (DNN) using genetic algorithms. The deep artificial neural networks are used generally for classification tasks. Depending upon the problem at hand, the designers decide on how many layers, how many number of nodes in each layers, what activation functions to be used at layers, etc. The term genetic algorithms is taken from the biological world and used in the evolution process. Here, we utilized a basic genetic algorithm concept to automatically build the optimum deep neural network suiting for the given classification task. The evolved networks are evaluated on the accuracy of classification needed, and genetic algorithm will find the bestsuited DNN architecture automatically.
Acronyms DNN ANN CNN AE GA
Deep neural network Artificial neural network Convolutional neural network Autoencoders Genetic algorithm
1 Introduction In the current generation of data abundance, data availability has become easier and has opened a lot of avenues for the data-based models for performing various types of tasks which were previously impossible. Further, the deep learning techniques have D. Shukla (B) · U. Singh Defence Institute of Advance Technology, Pune, India e-mail: [email protected] U. Singh e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_9
117
118
D. Shukla and U. Singh
enhanced capabilities to perform better and better as more and more data had become available. Of the many data-driven models, deep learning models like autoencoders (AE), convolutional neural network (CNN), LSTM, recurrent neural network (RNN), and generative adversarial network (GAN) are used in many fields for a variety of tasks such as object recognition, detection, tracking, time-series activity modeling, and object classifications. The deep learning models are used in various real-time applications, viz. detection of objects for border surveillance, parking area management, traffic data analysis (TDA), automatic driving of cars, and human activity recognition. The performance shown by various models on various tasks has been exceptionally well and has reached to very high levels of accuracy. However, it is observed that these deep neural network models are designed with the help of the human expertise and trial-and-error methods, manually. There have been various approaches proposed to automatically design the deep neural networks and have achieved good results, but they are mostly either fixed-depth DNN based or are lacking in optimizing all hyperparameters of a DNN. Here, we propose a generic framework for the evolution of DNNs using genetic algorithm and give results of the MNIST [4] dataset-based evaluation of the automatic evolution process of the DNN structure with the selected hyperparameters; however, the approach is generic and can be extended with for optimizing any no. of the hyperparameters. The paper is organized into sections. Section 2 will give basics of DNNs, Sect. 3 will describe various genetic algorithm steps required along with brief description of each, Sect. 4 will describe proposed methodology with the generic chromosome structures and the one which can be used for the evolution of the DNNs, Sect. 5 will describe the experiment and results, and finally, Sect. 6 will have conclusion and the future scope of work.
2 Basics of Deep Neural Networks The perceptron is the basic unit of process in deep neural networks. It has multiple inputs as xi, and each input is associated with a weight wi ; all the weighted inputs get summed and a nonlinear function like step is used before the final output is generated. A single perceptron can classify linearly separable inputs, but for the nonlinearly separable inputs, multiple perceptrons are required. For complex functions, multiple layers of neurons are used, with each layer having multiple neurons, known commonly as deep neural networks (DNNs). For each DNN structure, there are multiple parameters known as hyperparameters which can be optimized so as to find the best performance for the particular task at hand. The example of hyperparameters is as follows: 1. Hidden layer count 2. Number of neurons in particular hidden layer 3. Type of activation in each hidden layer
A Generic Framework for Evolution of Deep Neural Networks …
119
4. Batch size 5. Number of epochs for training 6. Drop percentage for each hidden layer etc.
2.1 Deep Neural Networks In deep neural networks, the arrangement of the neurons in hidden layers and number of neurons, along with other hyperparameters, is called the structure of the DNN. DNNs are capable of extracting complex features out of the inputs and can classify various inputs depending upon them. Deep neural networks have become popular due to easy availability of data in abundance and the ease of use they provide, as there is no need to preprocess the inputs, and DNN can learn the features of the inputs automatically. Though the efficiency of the DNNs is very high, its architecture is primarily decided by the experience and trial-and-error methods which is very inefficient, and availability of experts is one of the main constraining reasons why they are still not deployed at many problem areas, as their scope is possible in almost every field of science.
2.2 DNN Hyperparameters DNN has many hyperparameters which are generally decided by the expertise and the trial and error by the designers of DNN. A DNN has architecture depending on these hyperparameters, and hence, the hyperparameters have to be chosen carefully. In a DNN, there are hyperparameters like no. of hidden layers, no. of neurons in each layers, type of activation used in each layer, drop layers, drop percentage, optimization function used, etc. Though we can encode all these hyperparameters in a gene structure to construct the chromosome for the evolution, we restrict our self only to two of the most important hyperparameters to demonstrate the concept.
2.3 Dropout Layer Dropout is a regularization technique recommended for neural network models for reducing over-fitting proposed by Srivastava, et al. in their 2014 paper, A Simple Way to Prevent Neural Networks from Over fitting [1]. Dropout is a technique where randomly selected neurons connections are ignored during training of the particular model. The neurons are dropped out randomly every training time depending upon the percentage defined. Contribution of the dropout selected neurons to the activation of downstream neurons is temporally removed or skipped on the forward pass of the
120
D. Shukla and U. Singh
data, and the backward propagation weight updates are also not applied to the those neurons on the backward pass of error propagation.
2.4 Applications of DNNs Deep neural networks (DNNs) are used in various of types of applications ranging from classifications, object detection, feature extraction, compression in autoencoders, etc. Due to no need of feature extraction manually, the DNNs have reduced the expertise requirements in using them and hence have found widespread use in many areas of human activities and use.
3 Genetic Algorithms A genetic algorithm (GA) is a stochastic search and optimization method taken from the very popular science domain named evolutionary computations. GA is inspired by sir Charles Darwin’s theory of natural evolution of species on earth. This algorithm is a adoption of the process of natural selection where the fittest individuals/candidates of a population are selected for reproduction of the next generation in order to produce offspring of better and improved fitness. The process of natural selection starts with the selection of fittest individuals from a population, depending some predefined fitness criteria and using which the fitness scores they have scored. Selected parents produce offspring which inherit the characteristics of the parents and will be part of the next generation. The best performer of population gets more chance to reproduce and in elite selection may even directly become part of next generation. Better fitness parents have better chance to survive, and their offspring will be better than their parents with a even better chance of surviving in the environment. This process is repeated many times depending upon the number of generations defined for evolution, and at the end, a individual of generation with the fittest score will be chosen. This notion can be applied and used for a search problems and optimization problems. We consider a set of solutions for a problem and select the set of individuals, which using the evolution mechanism result in a population with better and fitter individuals. There are five phases considered in a genetic-algorithm-based evolution mechanism. 1. 2. 3. 4. 5.
Population initialization Mutation Selection Crossover Fitness function.
A Generic Framework for Evolution of Deep Neural Networks …
121
3.1 Population Initialization The process of GA begins with a set of randomly generated individuals which is called a population. Each individual is a solution to the problem we want to solve in the selected solution space. Every individual is characterized by a set of parameters (variables) known as the genes. Genes are joined into a string to form a chromosome (solution), and a chromosome can be translated into a phenotype of architecture of neural network. In a GA, the set of genes of an individual is represented using a string in terms of an alphabet or as a number, etc. Depending upon the coding scheme selected, we can use real number encoding or binary values, used as string of 1s and 0s. We combine many genes together, and this new structure is called as the chromosome of the individual (and a particular solution). The chromosomes may be of fixed length or of variable length depending upon the problem domain. Here, we used variable-length chromosome structure as it has better probability of searching the deeper, efficient network architectures.
3.2 Selection The second phase in GA is the selection phase. In this phase, the fittest individuals from the population are selected and they pass their genes to the next generation. The more fit candidate gets more chances of reproduction. There are multiple ways by which individuals may be selected from the population like roulette wheel selection method, rank-based selection, etc. Here, we will use elitism-based method followed by proportionate-based selection method. Two pairs of individuals (parents) are selected based on their fitness and the selection method. Individuals with high fitness have more chance to be selected for reproduction. We choose best-fitness individual as first parent and randomly select another individual to keep the generic diversity in the population. Although there are various methods of selection possible, we stick to simple one here for reducing the computation complexity.
3.3 Crossover Crossover is the mechanism used here to generate the new population from the existing parents. There are many methods of crossover in literature like single-point crossover, two-point crossover, etc., for simplicity, we chose single-point crossover as the preferred mechanism here for the experiment. Crossover is the most significant phase in the genetic-algorithm-based evolution. There are multiple methods available for crossover like single-point crossover, multi-point crossover, etc. Here we will use single-point crossover mechanism. For each pair of randomly selected parents which have to produce offspring, a crossover point is chosen at random position within the
122
D. Shukla and U. Singh
Fig. 1 Single-point crossover mechanism
chromosome structure such that both male and female chromosomes have valid and common crossover points. By exchanging the genes of parents, two offspring are generated using the crossover point as the attachment and division point. Example is shown in Fig. 1.
3.4 Mutation In randomly selected new offspring formed in the current population, some of their genes can be subjected to a mutation with a random probability as defined. This implies that some of the genes can be altered for the sake of maintaining diversity or to keep the search away from local optima. Mutation is performed to maintain diversity within the population, and it prevents premature convergence of the optimization process. We have used addition of a layers as the mutation mechanism in our approach, so as the search space for the best-suited DNN is not restricted to fixed number of layers. Mutation can be in the form of 1. changing no of neurons in a layer 2. changing the activation type in a layer 3. changing drop percentage in a layer, etc. Depending upon the gene and chromosome structure chosen for the evolution.
3.5 Fitness Function The fitness function is predefined, and it is used to measure the fitness of an individual in the current population. It is basically the ability of an individual to compete within the defined population’s individuals. Fitness function will give a fitness score to each
A Generic Framework for Evolution of Deep Neural Networks …
123
individual in the population. The likelihood of selection of an individual candidate for reproduction will be based on its fitness score usually. There are various fitness evaluation mechanisms available in literature like percentage loss, root mean square error, etc. Here, we use categorical cross-entropy loss as the loss/fitness function to compare the DNN models. Categorical cross-entropy loss function formula is as
L(y, z) = −
j=M i=N (yi j ∗ log(z i j )) j=0 i=0
Here, y is the actual values, and z is the predicted values, summed over the entire distribution.
4 Proposed Methodology In the proposed methodology, we encode the DNN architecture into a chromosome and generate a random population. To determine the architecture of the DNN, we choose the hyperparameters [2] associated with each layer of the network, and there are probable values (to limit the search space). Generally, the DNN architecture hyperparameters are determined by human intuition, previous experience, or trialand-error method. Table 1 is listing some of the important hyperparameters associated with the deep neural network [3] and their probable value which are used for experiments in this paper. A chromosome structure can be translated into DNN architecture (phenotype). This paper tries to determine the best-fit values for the hyperparameters by using genetic algorithm. GA is used to automatically determine these hyperparameter values for best performance on the given dataset. When searching for the best values of the randomly generated hyperparameters of a DNN architecture using genetic algorithm, the most crucial step is the encoding of the architecture representation in a gene and chromosome structure. In other
Table 1 Hyperparameters for DNN Parameter No. of hidden layers No. of neurons at each layer Activation used at each layers Drop connection Optimizer Batch size No. of epoch
Value range (0–10) (0–1024) (Sigmoid, Softmax, ReLU) 10, 20, 50 (Adam, RMS, SGD) 128, 256, 512 10, 30, 50
124
D. Shukla and U. Singh
words, the problem should be formulated in such a way that it is suitable for the genetic-algorithm-based evolution method. The variables used in this tuning process are the various selected DNN hyperparameters, training hyperparameters, etc. Hyperparameters, which are evolved to their best possible values, will generate the best DNN architecture according to the fitness function and should provide the highest classification and prediction accuracy for the given MNIST dataset based on a predetermined threshold.
4.1 Proposed Chromosome Structure For the purposes of experimentation, a simple example-based encoding is used for proof of concept, considering the compute and time available. Here, for the experiment a simple binary encoding for the gene and chromosome is used; then, further representation of the DNN architecture and hyperparameters is performed (only for the selected hyperparameters). For example, a number of hidden layers and no. of neurons per hidden layer are selected. Once the chromosome is available, the values of hyperparameters are extracted from a GA chromosome and the corresponding DNN architecture is constructed. Figure 2 shows a random DNN architecture in the defined chromosome format. we have gene and chromosome structure with only two parameters for experiment (layer no. and no. of neurons in layer) as: gene0 =(batch size, no. of epochs, optimizer) genei =(hidden layer no., no. of neurons in this layer, activation, drop connection) chr omosomei = [gene1 , gene2 , gene3 , . . . genei ] i= 1,2,3,4 …n
Fig. 2 Generic gene structure
A Generic Framework for Evolution of Deep Neural Networks …
125
for example, a chromosome with five layers with layers having 2, 4, 6, 5, and 7 neurons, respectively, can be encoded as chromosome=[(1,2),(2,4),(3,6),(4,5),(5,7)].
4.2 Proposed Framework There are various types of encoding methods possible to construct the chromosome structure, and the suitable one can be selected depending upon the number of hyperparameters selected for the evolution of chromosome structure. We generate an initial population of five members of random length or layers and random number of neurons at the hidden layer. For each individual of the population, a fitness evaluation is done using the selected fitness function. If a member meets the predefined fitness criteria, the evaluations are stopped as we have arrived at the desired structure, otherwise the process continues. Then, the next generation has to be created. The parent members are chosen based on the selection method and then the crossover mechanism will use selected individuals (parents) to generate the child individuals, generating the new population of individuals. Some individuals are muatated to address the global search issue and this process generates a whole new population, which is evaluated member by member for the fitness. If the number of generations is over, we add a new random hidden layer gene at the end of the chromosome and restart the generation evolution again. Once the fitness of a particular level is found, the GA is stopped and the best-fit individual is selected as the final solution architecture of DNN. The search space is widened when after all generations the desired output criteria are not met (accuracy or loss) and a new gene can be added in the chromosome to increase the depth of the DNN. The new gene inserted may be any of the possible genes from the pool of available genes except the base gene (gene 0) (Fig. 2).
Fig. 3 Chromosome structure
126
D. Shukla and U. Singh
4.3 Fitness Evaluation The fitness of a individual is dependent upon the particular aspect of the problem statement which we are trying to optimize. For simple demonstration, here we have consider MNIST dataset, and the fitness function chosen is the accuracy of the prediction here in terms of categorical cross-entropy function. To evaluate best DNN structure, the accuracy needs to be higher and the log loss should reach near to zero. Here, we try to minimize the loss to the minimum by defining the acceptable threshold of loss. Here, we use the softmax layer at the output side and measure the categorical cross-entropy loss to select the best candidate out of the population.
5 Results For the experiment, we have used a simple chromosome structure having only no of layers and no of neurons in it as the hyperparameters for the optimization. We have used one of the most used datasets in experiments of DNNs MNIST dataset and have evolved the DNN structure of the classification accuracy. We measure the categorical cross-entropy loss for the validation subset of the dataset and stop the candidate generation when the validation loss value hits the user-defined threshold. As we can see in the results here that the accuracy of the evolved DNN architecture is very high or the validation loss is reaching near to zero in just few generations of evolution, providing the automatically generated DNN architecture which can be used for this problem. We used a initial population of five candidates, each with random no of layers initially. Mutation probability of 10 % with 30 epochs training, batch size of 256 and default activation as ReLU for all hidden layers except the output layer where we used softmax activation. We have used single-point crossover as the crossover mechanism for generation of child from
Fig. 4 Log loss values
A Generic Framework for Evolution of Deep Neural Networks …
127
parent candidates, and for the experiment, we used mutation for addition of a layers as the mutation mechanism with random no of neurons in this layer, since generally deep networks have better abstraction capabilities. The chromosome length is kept to be variable so as to be able to cover the deeper network architectures if needed. As can be seen from the final loss values in Fig. 4, the final candidate has achieved a loss value near to 0.02 log loss, which is a very good architecture for the given MNIST dataset classification problem.
6 Conclusion In this paper, we have tried to give a generic framework for evolution and the chromosome structure of a DNN, which can be used to find the best model for the given problem using evolutionary genetic algorithms. We have shown a sample chromosome structure to evolve the DNN using only two basic parameters of layer number and the no. of neurons in the layer, for the MNIST dataset classification problem, and it is evident from results that even with few no of hyperparameters selected in gene, we could achieve very good validation loss values, i.e., categorical cross-entropy loss or log loss. Hence, this approach is a promising step for the evolution of the DNN architectures for various types of problem statements. The number of hyperparameter selection and there values can be problem specific, and various methods of encoding can be used to construct the chromosome structure. We just need to give the problem dataset to this algorithm, and it will be able to generate the near optimum DNN architecture. This approach of evolution can be used in many such problems to find the optimal solution DNN architecture. The proposed approach of encoding chromosome and evolution method can be used for the evolution of CNN and RNN.
References 1. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) 2. A. Al-hyari, S. Areibi, Design space exploration of convoultional neural networks based on evolutionary algorithms (2017) 3. A. Bhandare, D. Kaur, Designing convolutional neural network architecture using genetic algorithms, in International Conference on Artificial Intelligence (ICAI’18), Department of EECS, The University of Toledo, Toledo, OH, USA 4. Y. LeCun, C. Cortes, MNIST Handwritten Digit Database (2010)
Online Economy on the Move: The Future of Blockchain in the Modern Banking System Anushree A. Avasthi
Abstract The Indian banking system is a complex hierarchical structure engaged in financial intermediation. In its role as a financial intermediary, it performs several transactions like cash deposits, cash withdrawals and investments on a daily basis and is an important institution to assess a nation’s economic growth. However, with the constantly growing amount of data and the increasing susceptibility of banks to online fraud, banks now need to shift to a more robust, efficient and secure platform to carry out their functions. One such platform is provided by the blockchain structure. Blockchain is a decentralized, public ledger system which has gained immense popularity and recognition for its efficiency, security and transparency since its inception in 2008. This paper aims to make required changes in the blockchain model prevalent today to make it relevant for use in the banking system and evaluates the advantages of the blockchain system over the current system employed in consortium banking and payments through an assessment framework. This paper also tries to devise types of blockchain platforms which can encompass most of the functionalities of the banking system existing today. The paper is divided into three parts. The first part aims to devise a suitable consensus protocol which can be used to make blockchain suitable for banks, the second part discusses types of blockchain platforms applicable in the banking sector and the third part employs use cases in payments and consortium banking to demonstrate the advantages of blockchain.
1 Introduction 1.1 Banking System Prevalent in India Today The banking system prevalent in India today comprises of complex financial intermediation with several functions and constraints on every transaction undertaken. The system comprises of information sharing and modifying at different levels, both A. A. Avasthi (B) Computer Science Department, Shiv Nadar University, Greater Noida, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_10
129
130
A. A. Avasthi
Fig. 1 Hierarchy in the banking system of India
manually and digitally. Moreover, the Indian banking system comprises of a complex functional hierarchy wherein each player performs a different role and satisfies the needs of different target groups [1, 2]. The structural hierarchy of the Indian banking system is briefly explained in Fig. 1. Banking system plays a vital role in boosting the economy, ensuring secure and authentic transactions of funds and ensuring safe and liable investments. Banks also provide various utility functions and play a vital role in safe-keeping a person’s assets [2]. Banks’ functions vary from ensuring proper transactions, to keeping assets safe, to giving out loans, to ensuring social welfare. Banks also act as a point of carrying out transactions between different banks, people and countries. Banks, thus, are the backbone of a nation’s economic strength and growth [3]. A few important functions of a bank are depicted in Fig. 2. However, currently this hierarchy in the banking structure relies heavily upon manual bookkeeping and information storage on single servers with limited or no backups. The current mechanism employed to store data is likely to fail or collapse in the near future with the continuously growing amount of data. Moreover, the complexity of the hierarchical structure in the banking system makes banks susceptible to fraud and tampering of information at different levels of the hierarchy. The following subsection briefly discusses the problems faced by the banking systems today.
1.2 Problems Faced by the Banking Systems Today Banks have been the centre of economic growth for many years now. However, as the data held with banks increases due to more efficient intercommunication between banks, influence of digitization to increase the ease of transactions and the increasing awareness in the society to open bank accounts through government concessions and incentives, banking systems are now susceptible to new and more threatening problems [4].
Online Economy on the Move: The Future of Blockchain …
131
Fig. 2 Functions of banks in India
Some of the problems the banking system faces due to the growing data are as follows: 1. Increasing cost of operations. 2. Increasing susceptibility to security attacks on centralized servers. 3. Ensuring transparency when data has to be shared with several parties is a challenge. 4. Crisis in management. 5. Growing data and management has made banks susceptible to fraudulent transactions. 6. Difficulty in maintaining backup servers and database causing a threat of loss of information in case of security attacks or hacks. A primary cause for the problems faced is the extensive manual processing and the corresponding documentation required at each stage of transaction. This not only causes duplication of data but also increases the time required to complete a transaction and compromises the efficiency of the system. A possible solution to deal with the growing problems faced by the banking systems is the implementation of blockchain in banks.
132
A. A. Avasthi
1.3 Blockchain Blockchain was introduced in 2008 by a group of people in China working anonymously under the name Satoshi Nakamoto [5]. The Nakamoto’s consensus protocol was used to devise the blockchain system which has lured various IT companies and industrialists to explore the option of conducting transactions on a public ledger for transparency, efficiency and security. Currently, blockchain finds its application only in the online currency system like in transactions with bitcoins, ethers, etc. However, the implementation and idea behind blockchain is being studied in various countries to help implement it in the banking system. Shifting to blockchain will help the bank deal with the problem of scalability, throughput, speed and security which the banks face today [6, 7]. The basic functioning of the blockchain system can be explained by Fig. 3. The working of blockchain basically involves various participants agreeing to participate in a chain involving transparent transactions. Once a participant (or a node) initiates a transaction, a request to carry on further is sent to all the other participants. Based on the consensus protocol used, the other participants validate the transaction which is verified and added as a block to the public ledger (a hash pointer to which is available at all nodes). This is how blocks of validated and verified transactions are added to a chain to form a blockchain. All transactions can be viewed (but not changed due to the property of immutability of blocks) at any point by any node in the system. One major advantage of a public ledger is that it helps to solve the problem of double counting as explained in Fig. 4.
Fig. 3 Blockchain model
Online Economy on the Move: The Future of Blockchain …
133
Fig. 4 Problem of double spending
An important feature of the blockchain system is the consensus protocols which are essential to validate and add a transaction as a block to the blockchain. The following subsection discusses various consensus protocols currently used on the blockchain platforms.
1.4 Blockchain Consensus Protocols Blockchain is a distributed ledger-based system that requires consensus protocols to validate transactions in a chain. Some of the prominent consensus protocols currently used in blockchain are proof of work (PoW) and proof of stake (PoS). In proof of work, mining and validation of blocks, i.e. transactions, are done on the basis of the speed of computation of a node. It requires the user to solve a cryptographic puzzle in order to mine blocks. The node that is able to solve the puzzle fastest validates the transaction and is also rewarded a transaction fee [8]. In proof of stake, a person can validate his block based on his possession of coins (in our case currency or assets). The more the numbers of coins he possesses, the more power in the chain he is given. It solves the issue of having huge computing power to validate blocks and also gives the power to validate a transaction to the member with the highest stake in the chain [9]. Objective This paper builds on the concepts of blockchain and extrapolates its applications to the banking system in order to increase its efficiency, security and transparency.
134
A. A. Avasthi
It also explores alternate consensus protocols to increase the performance of the proposed system. Hence, the study was carried out with the following objectives: 1. To study the changes that can be made to blockchain to make it relevant to the Indian banking system. 2. To devise different types of blockchain platforms to integrate most functionalities of a bank. 3. To create an assessment framework to assess the parameters and to decide the feasibility of moving from conventional banking system to blockchain platform.
2 Methodology Adopted 2.1 Designing Two-Hop Blockchain Consensus Model The consensus protocols applied in the blockchain system so far are not suitable for use in the context of the banking system since it does not question the authority and identity of the members. Thus, after studying various protocols it was concluded that proof of authority which validates blocks based on the true identity of the miner based on government-issued identity cards should be used along with proof of work in alternate cycles to validate blocks (transactions). The primary function of the proof of work cycle would be to choose an appropriate miner (i.e. the person who solves the cryptographic puzzle first), and the subsequent proof of authority cycle can be used to determine the identity of the miner selected in the PoW round. Hence, a two-hop model combining PoW and PoA was devised and suggested to validate the functionality and ensure increased efficiency of the system. The advantage of using a two-hop model over just one consensus protocol is that proof of work individually requires a significant amount of computing power and leads to wastage of energy whereas if proof of work is employed in alternate cycles it will decrease the load on the computing power. Moreover, banks need to ensure that a transaction is authentic, and proof of authority requires the miner to provide his/her actual identity on the network, thus making the blockchain technology suitable for implementation in the banking system. The structure of the two-hop model was devised, and the cycles were explained to suggest a more efficient, fast and authentic protocol for use in banks.
Online Economy on the Move: The Future of Blockchain …
135
2.2 Types of Blockchain Platforms Applicable to the Banking System Any transaction can be broadly categorized into three frameworks based on the degree of centralization and the participants involved. The three types of possible framework are: 1. Type 1: Degree of centralization: Decentralized Participants: Open to isolated, independent transactions that do need require the involvement of banks for functioning. 2. Type 2: Degree of centralization: Multi-centralized Participants: Multiple banks and investors. 3. Type 3: Degree of centralization: Centralized Participants: Banks and associated clients. This categorization can be used to cover all types of transactions whether or not undertaken by a bank. Type 1 framework uses the public blockchain, presently used in the case of cryptocurrencies today. Type 2 framework is useful in case of heavy and expensive multi-party transactions that require funding and inputs from various members and need to be proctored by several banks. Type 3 framework works typically like the contemporary banking system wherein all information and authority of successfully completing a transaction is rested with a bank. Based on these three types of frameworks, three different blockchain platforms were developed and compared on various parameters like degree of centralization, type of participants, consensus protocol used, maintenance of records, advantages and incentives provided for participation in the chain.
2.3 Assessment of the Blockchain Platform The advantages of blockchain over the current system prevalent in banks were evaluated based on various assessment criteria, and this framework was used to compare the current system employed in consortium banking and payments to evaluate the applicability of blockchain to these functionalities of bank. The assessment criteria used is as follows: Assessment Criteria: 1. Intermediary Assessment factors: • Latency (delay) due to the involvement of intermediary. • Lack of trust amongst parties.
136
A. A. Avasthi
• Cost incurred due to intermediary. • Efficiency of intermediary. How does blockchain perform? • Latency reduced as no intermediary required and all transactions are real time. • Since blockchain is a public ledger, all vital information pertaining to all the members in the network is available thus ensuring trust amongst them. • Absence of intermediary reduces its cost. • Efficiency of intermediary is not an issue since no intermediary is involved in the process. Conclusion: Since, blockchain is a distributed ledger, it reduces latency and cost of intermediary involved. It is a public ledger and ensures transparency. The blockchain system primarily does not involve an intermediary, and the trust between the parties is maintained by transparency and authentic bookkeeping through smart contracts. 2. Transparency Assessment factors: • Are there multiple participants involved? • Does the system require transparency for all participants? • Is there a method to devise who gets to view what information? How does blockchain perform? • Blockchain is fit for use even in the case of involvement of multiple participants. • Blockchain generally provides transparency for all nodes in the network. • Based on the type of blockchain platform selected for the network, the level of transparency can be adjusted and implemented. Conclusion: The blocks (validated transactions) added to the chain can be easily viewed by all members of the chain, thus ensuring transparency and trust. The hash/pointers of the records written on the blockchain are irreversible and immutable, which eliminates the possibility of fraudulent modification. 3. Information storage Assessment factors: • Is data stored at multiple locations and has backups? • Is the data stored consistent? • Is data restoration troublesome in case of a hazard? How does blockchain perform? • Data is stored at multiple locations, and all nodes have a hash pointer to the last block added to the chain. • Data stored is consistent and immutable. • Data can be easily restored from backups in case of loss of information from one source.
Online Economy on the Move: The Future of Blockchain …
137
Conclusion: Blockchains’ distributed ledger and consensus mechanism helps in ensuring data consistency across multiple participants. Thus, data can be restored from different sources in case of a hazard and is not reliant on information storage at a central server. 4. Manual processing Assessment factors: • Is cost of reconciliation high? • Is documentation absolutely necessary? • At how many levels is the same document maintained? How does blockchain perform? • Blockchain has an in-built feature of developing smart contracts for transaction; thus, the cost of reconciliation is not high. • Blockchain maintains the smart contract at every stage of a transaction, and thus, manual documentation is not necessary. • Blockchain maintains the smart contract viewable by all members of the network but immutable to avoid fraudulent changes. Conclusion: Blockchain maintains automated audit trail of transactions, thus eliminating the need for manual processing for data validations. 5. Trust Assessment factors: • Is there trust amongst the participants in the chain? • Is there a risk of fraudulent transactions? • How many banks and participants are communicating during the transaction with each other? How does blockchain perform? • Most of the important information about all the members in the network is available. • Transactions are immutable and reduced the possibility of fraud. • Multiple banks and participants can communicate on a blockchain platform. Conclusion: Smart contracts are used for codification of business rules, reconciliation and validation, thus reducing manual processing. 6. Time sensitivity Assessment factors: Do the transactions need to be real time? How does blockchain perform? • All transactions are real time.
138
A. A. Avasthi
Conclusion: Blockchain employs real-time settlement of records, thus improving customer experience, reducing risks and enabling faster turnaround time of transactions.
3 Results and Discussion 3.1 Designing Two-Hop Blockchain Consensus Model Different consensus protocols were studied and compared. PoW, PoA and PoS blocks were run for 100 iterations, and the following results were obtained: 1. In Proof of Work: Simulation is carried out on three servers with varying resources using SimBlock to simulate a proof of work environment. The simulation was done on a private blockchain environment. In this research, synthetic transactions are used to generate 100 transactions with 100 wallets with public key and private key using Diffie–Hellman key exchange. Average runtime was 5.6 s. 2. In Proof of Stake: A basic proof of stake cycle was coded, and average execution time over 100 rounds was calculated as 3.6 s. 3. In Proof of Authority: A basic proof of authority cycle was coded, and the average execution time over 100 rounds was calculated as 2.8 s (Figs. 5, 6 and 7). Fig. 5 Average runtime from SimBlock
Fig. 6 Sample output format from PoS code
Online Economy on the Move: The Future of Blockchain …
139
Fig. 7 Sample output format from PoA code
The efficiency of proof of authority is more than that of proof of stake and is preferred over PoS in two-hop model due to the following reasons: 1. 2. 3. 4.
PoA has a smaller average runtime. PoA does not require transaction fee. PoA hides the problem of anonymity on the Web which is required by banks. PoS means having the chain controlled by the most powerful entity in the chain which is not desirable.
The two-hop model combining PoW and PoA is depicted in Fig. 8. In the figure above, it can be seen that first a suitable miner is selected based on the proof of work cycle (based on which member solves a cryptographic puzzle first), and then, the identity of the selected miner is validated using the proof of authority consensus method in the next cycle. After the completion of one cycle of PoW and one cycle of PoA, the transaction is added as a block to the chain. Similar steps are followed to validate and add more transactions.
140
A. A. Avasthi
Fig. 8 Two-hop blockchain model combining PoW and PoA
Proof of stake and proof of authority can also be used in alternate cycles to determine the miner as the member with the maximum stake in the chain. Using this new and more reliable consensus protocol, blockchain can now be extrapolated to be applied to the banking system.
3.2 Types of Blockchain Platforms Applicable to the Banking System Three blockchain platforms, namely public blockchain, consortium blockchain and private blockchain, were compared on the basis of various parameters [10], and the results are summarized in Table 1. Public blockchain is the type of blockchain system currently in use wherein no banks are involved and all transactions can be viewed by all the members of the chain. In some public blockchain, it is also possible to view the assets of other members in the network. This type of blockchain generally uses proof of work consensus protocol for validation of blocks, and thus, the verification of the true identity of members in this case remains a major point of contention in its use. In consortium banking, two or more banks come together to take part in transactions. This type of blockchain is mainly used when various influential parties having
Online Economy on the Move: The Future of Blockchain …
141
Table 1 Assessment and comparison of different types of blockchain platforms Public blockchain
Consortium blockchain
Private blockchain
Degree of centralization
Decentralized
Multi-centralized
Centralized
Participants
Public without the need of involvement of banks
Specific group of investors and banks forming an alliance
A central authority to manage and proctor all transactions
Consensus protocol (to validate transactions)
Proof of work, or proof of work and proof of authority in alternate cycles
Proof of stake, or proof of stake and proof of authority in alternate cycles
Specific to the needs of the chain and the discretion of the central authority
Records maintained by
All participants of the chain have a personal copy of the records
Decided by the members who choose a leader bank for bookkeeping
The central authority involved
Advantage
Self-established credit and transparent
Efficient and reduces costs
Traceable and easy to establish
Prominent incentive
Needed to members in the form of fraction of the transaction fee they validate on the chain. Third-party members not involved in the transaction validate the transaction
Optional and primarily based on the nature of partnership and funding involved
Not needed and it primarily works like a normal banking system
alliances with different banks come together to fund a project. All participating banks in this case come together to form a network. Blockchain helps ensure transparency between the banks to avoid the possibility of fraud or to eliminate distrust amongst the participants. Private blockchain acts basically as banks. All information is held with the banks who act as the central authority. The information is generally partially disclosed or not disclosed at all (Table 1). Hence upon framework analysis, it can be concluded that most of the primary functions of the bank can now be shifted to either of the types of blockchain mentioned in the table above.
3.3 Assessment of the Blockchain Platform Based on the proposed assessment criteria, the implementation of blockchain in the banking sector can be used via the following use cases explained briefly. Out of a bank’s functionalities, consortium banking and payments have been chosen for
142
A. A. Avasthi
implementation of blockchain since they are crucial to a bank’s functioning and are more susceptible to fraud and security hacks. Moreover, these functionalities of a bank also face the problem of fast-growing data and decreasing efficiency. Thus, the assessment framework is applied to consortium banking and payments. The results of assessment for consortium banking and payments are as follows: Use cases: 1. Consortium banking: Consortium banking involves multiple financially sound and trusted parties coming together to fund and invest in a project. This branch of banking can involve multiple participants as well as multiple banks. However, trust and authentic transactions remain a big hurdle in consortium banking, especially due to the extensive documentation and manual processing involved [11]. Moreover, the recent incidents of fraud have posed a major concern related to security and trust in consortium banking. Blockchain can provide an appropriate solution to this problem (Table 2). Based on the assessment criteria chosen, we can conclude the following about implementation of consortium blockchain in consortium banking (Table 3): As we can see blockchain has an advantage over the current system in five cases out of six cases, and thus, it is useful to shift consortium banking onto the blockchain system. 2. Payments: Table 2 Comparison of current state of consortium banking and how blockchain can help Current problems
How can blockchain help?
Time-consuming process: selection of suitable members and formation of a financially sound group can be a time-consuming and a tedious job
Automated selection criteria for syndicate formation in programmable smart contracts
High intermediary fees: agents needed at high fees to administer the process
The entire process is automated and transparent and does not require intermediaries
Manual processing: manual processing of documents and agreements is a time-consuming process and leaves a lot of scope for fraud and tampering of contracts
Agreements, contracts, terms and conditions documents, etc., are digitized on the blockchain and validations and checks are automated
Documentation and efforts: manual processing can lead to errors and documentation duplication
Immutability feature of the blockchain eliminates need for multiple copies of the same documents being held
Transparency and consensus
Blockchain ensures that every member is able to view the activities of the chain and is able to contribute
Delayed settlement cycles
Blockchain can facilitate near real-time loan funding and payment settlements using smart contracts
Online Economy on the Move: The Future of Blockchain …
143
Table 3 Evaluation of assessment framework on consortium banking Factor
Is it a problem in the current system employed?
Will blockchain be able to help?
Intermediary
Yes, an intermediary is currently used to store and pass information between the participating members
Yes, since blockchain does not need the involvement of an intermediary
Transparency
Yes, due to the presence of a leader bank there is a possibility that some information may be concealed from other participants
Yes, since blockchain ensures transparency for all the members in a network
Information storage
No, since bookkeeping is done by the leader bank decided by the group.
No, not required
Manual processing
Yes, due to the hierarchical structure of banks, heavy manual documentation is required at each level
Yes, blockchain maintains codified smart contracts thus reducing manual processing
Trust
Yes, since there is a possibility of concealing information and tampering with documentation, it is difficult to maintain trust amongst the participants
Yes, since all vital information about all members in the chain is visible, blockchain ensures trust amongst the participants
Time sensitivity
Yes, however, it depends on the type of transaction
Yes, all transactions on blockchain are real time
Payment processes are difficult to track despite the introduction of KYC especially when it involves transfer of money from one bank to another. Blockchain will not only help automate the process but will also ensure transparency in the process (Table 4). Based on the assessment criteria chosen, we can conclude the following about implementation of private blockchain in banking (Table 5): Blockchain has an advantage over the current system in four cases out of six cases, and thus, it is useful to shift to blockchain for payments. Table 4 Comparison of current state of payments and how blockchain can help Current problems
Can blockchain help?
Manual documentation
Automated documentation
Time-consuming process
Real-time settlement of transactions
Transparency
Every member would know what stage the payment has reached in the chain
Inability to track throughout the process
Real-time checking of the process to avoid fraud
144
A. A. Avasthi
Table 5 Evaluation of assessment framework on payments Factor
Is it a problem in the current system employed?
Will blockchain be able to help?
Intermediary
No, since daily payments and withdrawals do not require intermediary
No, not required
Transparency
Yes, transparency is essential to enable clients track the progress of payments or withdrawals
Yes, blockchain is a public ledger; thus, all vital information is visible to all the members in the network
Information storage
No, information can be stored with the banks of personal accounts
No, not required
Manual processing
Yes, there is lot of manual processing involved
Yes, blockchain keeps a track of all transactions
Trust
Yes, people need to trust their banks and be assured that their money is in safe hands
Yes, all transactions on blockchain are codified and immutable; thus, people can be assured that no tampering can be done with the data or with their assets
Time sensitivity
Yes, transactions need to be fast
Yes, transactions on blockchain are real time
4 Conclusion It can be concluded from the paper that the growing problem of information storage and security faced by banks today can be solved by implementing blockchain in banks. It can also be concluded that by adopting proof of authority and proof of work in alternate cycles as the chosen consensus protocol, blockchain can be made fit for use in the banking system. Proof of work and proof of authority together have increased efficiency and security which is feasible to be used in banks. Moreover, based on the assessment framework devised, it can be concluded that blockchain will be more advantageous to be implemented in consortium banking and payments.
5 Future Work Future works include applying the assessment framework to more functionalities of the banking system and working on a more efficient consensus protocol to be applied to the blockchain system to reduce the computational power needed further.
Online Economy on the Move: The Future of Blockchain …
145
References 1. S. Chand, Indian Banking System: Structure and other Details (with Diagrams). Available at http://www.yourarticlelibrary.com/banking/indian-banking-system-structure-andother-details-with-diagrams/23495 (2006) 2. O.P. Agarwal, Modern Banking of India (Himalaya Publishing House, 2008) 3. R. Feldman, Top 4 Digital Transformation Challenges Banks Face. Available at https://www. compucom.com/blog/top-4-digital-transformation-challenges-banks-face (2018) 4. S. Nakamoto, Bitcoin: A Peer-to-Peer Electronic Cash System. Available at https://bitcoin.org/ bitcoin.pdf (2008) 5. M. Teing, What is Blockchain Technology? Available at https://blockgeeks.com/guides/whatis-blockchain-technology/ (2019) 6. L. Fan, H.-S. Zhou, A scalable proof of stake blockchain in the open setting, in Eurocrypt 2018 (2018) 7. T. Duong, L. Fang, H.-S. Zhou, 2 Hop Blockchain: Combining Proof of Work and Proof of Stake Securely. Available at https://eprint.iacr.org/2016/716.pdf (2016) 8. S. Khalil, R. Masood, M.A. Shibli, Use of bitcoin for internet trade, in Encyclopedia of Information Science and Technology, 4th edn (IGI Global, 2018), pp. 2869–2880. Web 19 Apr. 2019. https://doi.org/10.4018/978-1-5225-2255-3.ch251 9. T. Zhang, H. Pota, C.C. Chu, R. Gadh, , Real-time renewable energy incentive system for electric vehicles using prioritization and cryptocurrency. Appl. Energy 226, 582–594 (2018) 10. J. Kagan, Available at www.investopedia.com/terms/c/consortium-bank.asp (2018)
A Novel Framework for Distributed Stream Processing and Analysis of Twitter Data Shruti Arora and Rinkle Rani
Abstract Data is ubiquitously generated each microsecond from heterogeneous sources. These sources are connected devices in the form of mobiles, telemetry machines, sensors, clickstreams, weather forecasting devices, and many more. The social media platform is not less a firehose of streaming data. Every microsecond, millions of tweets on Twitter, posts, and uploads on Facebook contribute to the volume of data. Twitter’s most important characteristic is its ability for users to tweet about events, situations, feelings, opinions, or even something totally new, in real-time. The motivation of the work arises from the Twitter’s easily available API to capture the tweets and earlier research work on data analytics of streaming social data. There are many workflows developed to handle Twitter data streams nowadays, but still, the research area of streaming data is still in its infancy stage. This study attempts to develop a real-time framework for processing, transformation, storage, and visualization of Twitter data streams. The framework integrates some open-source technologies for execution of the proposed architecture such as Apache Kafka for ingestion of tweets, Apache Spark Streaming for data stream processing, and Cassandra NoSQL database for storage of tweets. The work aims to analyze the ingestion and processing time of the tweets in real time using the proposed integrated framework. The proposed framework shows the minimum average scheduling delay and processing time for different Twitter data modules such as print tweets, average tweet length, and most trending hashtag. Keywords Streaming · Resilient distributed dataset · DStream · FlatMap
S. Arora (B) · R. Rani Department of Computer Science and Technology, Thapar Institute of Engineering and Technology, Patiala, India e-mail: [email protected] R. Rani e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_11
147
148
S. Arora and R. Rani
1 Introduction The proliferation of networking all around the world has resulted in a tremendous amount of data volume. The social networking services such as Twitter, Instagram, Facebook, and a lot more gave birth to the research area of big data. Big data is not just big as the name suggests but also big in terms of 7V’s characteristics. The 7V’s characteristics of big data are Volume, Velocity, Variety, Variability, Veracity, Visualization, and Value. The term streaming data is also not too new. Big data is a characteristic yield of the progressed computerized artifacts and their applications. Sensors and social media networks are instances of the present-day computerized innovations that have penetrated our day by day lives. Commonness of these innovations in the regular day to day existence helped human-to-human, human-to-machine, and machine-to-machine communication into exceptional levels yielding gigantic volumes of data known as big data [1]. Streaming data is a variant of big data where the data is in a continuously flowing state. In general terms, stream processing commonly refers to data transformations that are done over an infinite set of data. Or, more realistically, once a stream processing system is turned on, there is no fixed end. It will keep processing data as it arrives until the job is turned off because it is no longer needed. The traditionally available frameworks and techniques for handling big data are developed to suit the batch requirements of data processing. As opposed to the traditionally designed database models where sheer size data is first stored and recorded and after that along these lines processed by the queries, stream processing takes the inbound information while it is in flight, as it streams through the server. Stream processing likewise associates with outer information sources, empowering applications to incorporate chosen data into the application stream or to refresh an external database with handled data. Some open-source technologies are also developed for real-time or near real-time processing of streaming data, but each lacks in fulfilling at least one of the 7V’s requirements of big streaming data. The open-source technologies like Apache Spark, Kafka, Flume, Samza, Spark Streaming, Flink, etc., have handled a lot of information on the fly. Here, in this work, we integrate the open-source technologies to handle and transform the Twitter streams and also analyze it with respect to the time and origin of the tweets. The challenge in doing so, however, is the limited context provided by tweets resulting from the length restriction of 140 characters imposed on a tweet. On top of that, the majority of the information propagated on Twitter is irrelevant for the event detection task, and the noise generated from spammers and the use of an informal language coupled with spelling and grammatical errors adversely affect the event detection process. Mostly, two primary difficulties are structuring rapid mining strategies for data streams and need to instantly distinguish changing ideas and data conveyance due to exceptionally powerful nature of data streams [2]. The main contributions of this research paper are: • A streaming data architecture known as Kappa architecture is developed that just incorporates the speed layer, unlike Lambda architecture which also contains the batch layer which is not required for social media data of Twitter.
A Novel Framework for Distributed Stream …
149
• The architecture enables the ingestion of Twitter streams through Apache Kafka. • The processing task of streaming data is carried out through Apache Spark. Further, the transformation of the data is done for various modules such as average tweet length analysis in scripting language Scala which runs on Spark. • The final results are stored in NoSQL database called Apache Cassandra. The rest of the paper is structured as follows. Section 2 presents an extensive literature survey for the Twitter-centric streaming data ingesting, processing, storage, and visualization frameworks. The proposed framework is described in Sect. 3 which covers setting up twitter streaming API for capturing the stream of tweets and integration of Spark with Kafka and Apache Cassandra. Illustrations and analysis of the work using the proposed framework are explained in Sect. 4 followed by a conclusion in Sect. 5.
2 Related Work For the purpose of a focused discussion on the related work, we classify various stream processing technologies based on the architecture they are built upon.
2.1 Study of Architectures for Stream Processing (a) Lambda Architecture: This streaming data architecture (Marz and Warren 2015) is an architecture design proposed by Nathan Marz that expects to process huge information sources by limiting the inactivity. Henceforth, Lambda architecture is split into three important layers portrayed as under: Batch layer: This layer is in charge of making changes, counts, and collections that are required through the current information. It is done by coordinating the new information that shows up and then is processed after batching it upon certain parameters. Serving layer: This layer is particularly for generating variable views for the specific queries. These views can be precomputed and cached in order to speed the querying process. Speed layer: This layer acts as a masterpiece solution when there are latency problems in batch processing or when data is ingested in short periods of time. As we referenced, batch procedures could set aside a long effort to invigorate information on batch views. Subsequently, the speed layer gives up-to-date data of queries in real time by making use of real-time technologies. Consolidating these two ideal models, batch layer and speed layer, Lambda Engineering offers a strong big data approach with a mix of no inactivity and modern information [3].
150
S. Arora and R. Rani
(b) Kappa Architecture: Kappa architecture is not less or more different but just a simplification of the Lambda architecture which just excludes the batch layer. All the event-driven and critical stream processing applications use this simplified architecture. Instead of using the main dataset, all transformations will be in accordance with the speed layer and, subsequently, in an append-only unchanging log. The architectures are just the outline of requirements for a framework for stream processing. Based on Lambda and Kappa architecture, the technologies are developed which are further classified as open source and commercial.
2.2 Study of Frameworks for Stream Processing Real-time or near real-time frameworks raise extra mechanical difficulties that cannot be dealt by batch systems. These situations require frameworks ready to refresh their calculations with the appearance of new data, to work inside the necessary time limitations furthermore, to manage memory impediments [4]. Mitch and Hari in [5] presented the architecture of the current database management platforms as a pull-based model for accessing the data and stream-based data management as a push-based model. The authors assumed a push-based model of information as a client (the active party) presenting an inquiry to the server (the passive party), and an answer is returned. Conversely, in stream-based applications, information is pushed to a framework that must assess questions accordingly to recognized occasions. Various challenges faced by the distributed stream computation were considered by the author, and a framework called Aurora is a centralized stream processing system is one of them. Aurora is designed assuming the domain in which a single administrative domain is present. Aurora is built on the fundamentals of a dataflow system and uses the box and arrow paradigm. An extension of Aurora is Medusa which is built for service delivery among autonomous participants. Boehm et al. [6] presented the latest overview of machine learning, required extensions of the engine to exploit Spark to the core, and optimized solutions to unique implementation challenges on Spark. Some of the major challenges under study are memory handling [7] and lazy evaluation. Spark-related key optimizations are automatic integration of RDD caching and repartitioning, as well as partitioningpreserving operations to minimize data scans trials and shuffles. The experiments by the author show that the hybrid execution plans are crucial for performance. Maarala et al. [8] worked for the processing of Big Trajectory Systems using Twitter feeds for obtaining the count of vehicles at each street segment and detected the congested street segments in real time. The authors correlated client versatility data to individuals’ Twitter exercises so as to give geo-labeled tweets created on packed road portions [9]. The traffic status (dynamic travel time, slow down, blockage, etc.) and events (mishappenings, roadwork, etc.) were distinguished that possibly trigger tweets that possibly offer more data identified with traffic circumstances. For
A Novel Framework for Distributed Stream …
151
acquiring the output, other real-time streaming data such as sensors coordinated in smartphones, accelerometer, as well as vehicle embedded sensors that are accessible through the OBD-II interface can be utilized. Carbone et al. [10] in their research reviewed that the Flink ecosystem consists of kernel which has a component called runtime used for distributed streaming dataflow. On the above layer, the specialized components called dataset API are used for batch processing and data stream API is used for stream processing, Flink ML is used for machine learning libraries, and Gelly is used for graph processing. Kleppmann et al. [11] defined a hybrid framework of Kafka and Samza and presented the state management problem for implementing stream processing by creating a small number of general-purpose abstractions. The durable state is implemented by Samza through the abstraction of the key-value store. An interface of Java call StreamTask is provided by Samza which is implemented by application code. R.Ranjan [12] reviewed that Apache Samza has parallel distributed architecture. It has support for Java Virtual Machine (JVM) languages. Spark and Kafka were initially not designed to work together, but integrating them in the proposed framework has proven efficient for ingestion and processing.
2.2.1
Apache Spark
Spark is an iterative distributed stream processing platform designed to process the data through functional transformation on distribution collections of big data. Spark is not another tool, but it is a new concept. Uber, Amazon, Yahoo, Baidu, Pinterest, Spotify, Alibaba, and Shopify Verizon are some of the top industries where Spark is already being used in production. It introduced new programming and fundamental abstraction. It is better than MapReduce as it is an abstraction that wraps MapReduce into a single concept. It performs efficiently in solving interactive data mining problems. Distributed computing is almost in every field such as machine learning. Hadoop was designed for the transformation of data and detection and further moving on with other data. It ignores the historical data as it is not designed for the lookback computation. Core Issues in Traditional Approaches: • The way of solving interactive and iterative algorithms is inefficient, and these inefficiencies are a result of not keeping an enormous data in memory. • Keeping data in memory can improve efficiency for iterative algorithms as well as in interactive data mining. • The idea of Spark especially RDD originated from this single thought, i.e., figure out a way to keep the data in memory, and when the data is so big, figure out a way to keep data in distributed memory.
152
2.2.2
S. Arora and R. Rani
Apache Kafka
Kafka is a published/subscribe messaging system. A cluster can be set up entirely for Kafka servers, and their entire job is to just store all incoming messages from publishers which might be a bunch of Web servers or sensors for some period of time, and as it comes in, it will store them up and publish them to anyone who wants to consume them. The messages are associated with a term called topic which represents a specific stream, for example, a topic of weblogs from a given application or a topic of sensor data from a given system. Consumers generally subscribe to one or more topics, and they receive the data as it is published. A consumer if goes offline can catch up with the data from some point in the past. The important characteristic of Kafka is that it has the ability to manage multiple consumers that might be at different points in the same stream, and it can do that very efficiently. Kafka enables the synchronous message retrieval by the consumer at the maximum rate; it can validate and also avoid being flooded by messages pushed faster than it can manage. Kreps et al. [13] presented a Kafka messaging system for log processing in real time at LinkedIn Corp. Numerous specialized log aggregators have been developed over the last few years. One such is Scribe developed by Facebook. Scribe has an architecture where each frontend machine can send log data to a set of Scribe machines over sockets from where the aggregation of log entries is made and periodically uploaded into HDFS [14] or an NFS device.
2.3 Study of in-Memory Computing Framework The way of solving interactive and iterative algorithms is inefficient, and these inefficiencies are a result of not keeping an enormous data in memory. Zhang et al. explained the in-memory cluster computing feature of Spark [15]. The authors expressed the distributed memory management platform for Spark. Cluster computing is a complete combination of the infrastructure as well as the software platform, as software platform is responsible for distributing tasks among multiple nodes. Zaharia et al. [16] proposed an improvised programming model called discretized streams (DStreams) that offer an abnormal state practical programming API, solid consistency, and productive fault recovery. DStreams support another recovery component that enhances proficiency over the customary replication and upstream reinforcement arrangements in streaming databases: parallel recovery of the lost state over the cluster. The authors prototyped DStreams in augmentation to the Spark cluster computing system called Spark Streaming, which lets clients consistently intermix streaming, batch, and intuitive queries.
A Novel Framework for Distributed Stream …
153
3 Proposed Framework 3.1 Setting up Twitter Account In this research work, Scala and Spark shell scripting are used. Scala is a very popular choice for the basic programming of Spark Streaming because Spark itself is written in Scala and it is a lot easier to use than Java. It runs on top of Java Virtual Machine (JVM) and hence can get access to Java classes. It is built on the concepts of functional programming. The modules for analyzing the twitter streams for various applications such as popular hashtags, clustering similar tweets, and saving the tweets in the local filesystem are implemented using Scala scripting language in Scala IDE extension of Eclipse for the ease of execution. Live tweets are streamed by setting up a Twitter developer account to get authorization access tokens to programmatically receive live Twitter updates as they happen. This is achieved by logging into https:// apps.twitter.com. After logging into account, set up an application by filling up the required information like name, description, Web site, etc., to use Twitter API. Once an application is set up, the keys and access token are generated. There are four types of keys and access tokens which are required for the authorization purpose. These are Consumer Key (API Key), Consumer Secret (API Secret), Access Token, and Access Token Secret. Once the credentials are set up, they are used in the code to get access to live Twitter feeds. Various modules are built once this step is successfully completed. The Twitter streams are stored on integrated NoSQL database Cassandra for performing subsequent analysis. Once the twitter is set up to send the tweets to the system, the tweets are accessed as Resilient Distributed Datasets (RDD). Each individual RDD is accessed in discretized stream also known as DStream as it comes in. To achieve this, the foreachRDD function is used which allows the extraction of both individual RDD from the DStream and a time stamp. The timestamp is used here for saving the tweets with a file name as a timestamp. Repartitioning of RDD into a single RDD actually consolidates the RDD down into single partition. It might be distributed across the cluster, but here, the access to everything at once is required and that is especially crucial when dumping the data streams into database is required because if one creates a database connection to output the results and the data is actually distributed on multiple machines then the database connection may not be valid on all those different machines. Therefore, firstly, it is required to consolidate everything back together into a single partition. The cache is called in the application because whenever there is more than one action on RDD, it is desirable to cache it. The reason being that every time Spark sees an action, it will compute the directed acyclic graph to figure out an optimal way to get the data which is desirable. The result is obtained in batches of 1 s in the console.
154
S. Arora and R. Rani
3.2 Integrating with Apache Kafka This section covers integrating Spark Streaming with some real-world systems that might be required to consume data. Kafka is a high-throughput distributed messaging system which is very scalable, durable, distributed, and robust. It is a publish/subscribe messaging system where it can publish out messages received from a bunch of different message producers, and consumers of those messages can subscribe to given topics to actually listen to those messages, for example, broadcasting weblogs that are dropped into Apache access log directories. It is a reliable way to send data over a cluster of computers, and Kafka acts as a broker between producers and consumers. As of Spark 1.3, Spark Streaming can connect directly to Kafka. Earlier, it used to be that it had to connect to the Zookeeper hosts that actually sat on the top of Kafka, and there was a lot of room for things to go wrong and for messages to get lost. It is a much more reliable mechanism, and there is just one less person in the middle. Spark-Streaming-Kafka package is not built-in, and it is needed to be downloaded from Maven. It is to be copied to the Spark Lib folder and then further needed to be imported as an additional external jar.
3.3 Integrating with Apache Cassandra Cassandra is a very popular output source choice for the integration of Spark Streaming. Cassandra is basically a NoSQL database [7] which is based on key-value data. It is a distributed, fast, and reliable database. The tables are created with specific queries in mind to keep things really fast. If a data is sent for a given key to more than one host, then there is a condition for redundancy to be there which Cassandra takes care of. So, in Spark Streaming, one can transform raw input into key-value RDD where the first element of the tuple is the key and the other is the value. This becomes very efficient with Cassandra because there is a key-value RDD which Spark can easily understand.
4 Illustrations 4.1 Setting up Spark • As Spark is built using Scala programming language and also Scala runs on the top of JVM for the compilation and execution purpose, therefore, in order to get started with Spark, Java installation is the primary step. Java Development Kit (JDK) can be installed from oracle.com followed by setting up a couple of environment variables in the Windows operating system.
A Novel Framework for Distributed Stream …
155
Fig. 1 Live twitter streams through Spark
• The secondary step is to install Spark which can be downloaded from spark.apache.org shown in Fig. 1. The Spark release 2.0 or latest is downloaded so that all the latest libraries are installed and configured. The package type to be installed should be pre-built for Hadoop 2.7 or later. • In order to run the Hadoop cluster on Windows, there has to be a pseudo-Hadoop distribution file present known as winutils.exe which can be downloaded from any external source of the Internet which makes sure to Spark that Hadoop is present. • Next comes the installation of Scala IDE which can be downloaded from scalaide.org which gives eclipse.exe file which has Scala inbuilt in it. The Spark installation can be verified by writing command spark-shell in command prompt by running it as administrator.
4.2 Stream Live Tweets with Spark Streaming The important jar files that are required for specifically Twitter applications are twitter4j-core-4.0.4.jar, twitter4j-stream-4.0.4.jar, and dstream-twitter_2.11-0.1.0SNAPSHOT.jar. The external jars are present in Spark ‘jars’ setting folder where all the configuration, libraries, and jar files are present, by including them in the application. Once done with the configuration settings, an application module to print the tweets is run locally using all CPU cores and one-second batches of data which is illustrated in Fig. 1.
156
S. Arora and R. Rani
Fig. 2 Average tweet length analysis
5 Module Analysis 5.1 Print Tweets Module Analysis The application for printing the tweets in real time using Apache Spark was executed, and the result obtained on Scala IDE.
5.2 Average Tweet Length Analysis One of the further set of manipulations that can be performed on the twitter streams is tracking the average tweet length. The requirement for achieving this manipulation is accomplished after establishing a connection with the database. Another cause can be by computing the aggregate data of the tweets as they come in. Extract the text
A Novel Framework for Distributed Stream …
157
only from each tweet that comes in that is stored in DStream and further extract the length of all the tweets which are computed over time. A new DStream is created which is named lengths which maps the status Dstream using function status.map (status ⇒ status.length()) which transforms the status string which comes in from the status DStream into a length integer. This is just the length of each status tweet and the number of characters. The resultant new DStream which is obtained after the transformation of status DStream from its actual tweets is just the length of each tweet. The two things that are tracked are the total number of tweets and the total number of characters illustrated in Fig. 3. As new data comes in lengths, DStream is aggregated with the previous data and each RDD is accessed individually. In this module, a timestamp is ignored and just the RDD is accessed and the information contained in them is counted (Fig. 2).
Fig. 3 Most popular hashtag module
158
S. Arora and R. Rani
Fig. 4 Streaming statistics of print tweets module
5.3 Most Popular Hashtag Analysis In this module, the most popular hashtag is tracked from a stream of tweets. It is done over a sliding window of 5-min. Here, the use of filter operation is introduced on DStream. FlatMap is used instead of map operation. This is accomplished by splitting up the tweets into individual words by using FlatMap operation. The map has a one-to-one relationship where every row of one DStream is transformed into another row of another DStream. But, with FlatMap, we can actually have a different number of output results than the input. Since an entry for every word in a row is required, there will be many more entries in the resulting DStream. Subsequently, eliminate everything that is not hashtag by using filter() function illustrated in Fig. 4. The requirement is not only to reduce the strings but also to reduce it over a sliding window of time. In this thesis, there is a look back over the past five minutes; otherwise, the stuff will pile forever. For this purpose, the reduceByKeyAndWindow function is used on hashtagKeyValues RDD. This helps in providing a function not only in how to combine together individual values in a given key but also as an optimization.
6 Results An application for printing the tweets was executed on Spark cluster, and it was run in batches of 1 s for 58 s 445 ms. During this execution, 58 batches were completed,
A Novel Framework for Distributed Stream …
159
Fig. 5 Streaming statistics of average length of tweets module
and 2076 records were processed. The scheduling delay during this run is 0 ms, and the time taken to process the application was 34 ms. The streaming statistics for this application are illustrated in Fig. 4. An application to track the average length of a tweet runs in batches of 1 s, and experimentally, the application + was executed for 1 min 18 s. It completed 78 batches and 2544 records. The application encountered an average scheduling delay of 7 ms and took 64 ms processing time. It can be concluded from statistics in Fig. 5 that Apache Spark has shown high-speed stream processing in the distributed environment. An application for the most popular hashtag was run in the batches of 1 s and was executed for 26 s 9 ms. Within this time frame, it completed 22 batches and processed 809 records. The streaming statistics show the average scheduling delay of average 60 ms; the average processing time shown is 503 ms, and the status of the micro-batch processing is shown in Fig. 6.
160
S. Arora and R. Rani
Fig. 6 Streaming statistics of most popular hashtag module
7 Conclusion In this paper, a time-efficient framework is proposed for analyzing Twitter streaming data. The main idea is to study Twitter Data Analytics by ingesting, processing, storage, and visualization of tweets in real time and perform various other transformations on them such as tracking the average tweet length and detecting the most popular hashtag. The concept of streaming data is deeply studied, and experiments are carried out by integration of Spark with Kafka and distributed storage framework known as Apache Cassandra. The concept of in-memory computing is also experimented in this work. The proposed work shows the minimum processing time and scheduling delay for various Twitter modules such as print tweets, average tweet length, and most popular hashtag. In future, the work may be expanded to be executable in a distributed fashion on the k-means clustering algorithm being able to utilize the power of the Spark clusters. Furthermore, the proposed work can be applied to various applications such as intrusion detection and market rate analysis.
A Novel Framework for Distributed Stream …
161
References 1. A.M.S. Osman, A novel big data analytics framework for smart cities. Future Gener. Comput. Syst. 91, 620–633 (2019) 2. M. Kholghi, M. Keyvanpour, An analytical framework for data stream mining techniques based on challenges and requirements 3, 2507–2513 (2011) 3. Q. Lin, B.C. Ooi, Z. Wang, C. Yu, Scalable distributed stream join processing, in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 811–825 (2015) 4. H. García-González, D. Fernández-Álvarez, J.E. Labra-Gayo, P. Ordóñez de Pablos P, Applying big data and stream processing to the real estate domain. Behav Inf Technol, 1–9 (2019) 5. M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Çetintemel, Y. Xing, S. Zdonik, Scalable distributed stream processing, in Proceedings of the Conference on Innovative Data Systems Research (CIDR), pp. 2021–2025 (2003) 6. M. Boehm, M.W. Dusenberry, D. Eriksson, A.V. Evfimievski, F.M. Manshadi, N. Pansare, B. Reinwald, F.R. Reiss, P. Sen, A.C. Surve, S. Tatikonda, SystemML: declarative machine learning on spark. Proc. VLDB 9(13), 1425–1436 (2015) 7. G. Wang, J. Tang, The NoSQL principles and basic application of cassandra model, in The Proceedings of International Conference on Computer Science and Service System (CSSS), pp. 1332–1335 (2012) 8. A.I. Maarala, M. Rautiainen, M. Salmi, S. Pirttikangas, J. Riekki, Low latency analytics for streaming traffic data with Apache Spark, in Proceedings of IEEE International Conference on Big Data, pp. 2855–2858 9. M. Hasan, M.A. Orgun, R. Schwitter, Real-time event detection from the Twitter data stream using the TwitterNews + Framework. Inf Proc Manag 56, 1146–1165 (2019) 10. P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, K. Tzoumas, Apache Flink: stream and batch processing in a single engine. IEEE Data Eng. Bull. 28–38 11. M. Kleppmann, J. Kreps, Kafka, Samza and the unix philosophy of distributed data. Bull. Tech. Committee Data Eng. 38, 4–14 (2015) 12. R. Ranjan, Streaming big data processing in datacenter clouds. IEEE J. Cloud Comput. 1, 78–83 (2014) 13. J. Kreps, Kafka: a distributed messaging system for log processing, Proceedings of the NetDB, pp. 1–7 (2011) 14. M.M. Rathore, A. Ahmad, A. Paul, A. Daniel A, Hadoop based real-time big data architecture for remote sensing earth observatory system, in The Proceedings of 6th International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1–7 15. T. Jiang, Q. Zhang, R. Hou, L. Chai, S.A. McKee, Z. Jia, N. Sun, Understanding the behavior of in-memory computing workloads, in IISWC IEEE International Symposium on Workload Characterization, pp. 22–30 (2014) 16. M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets, in The Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10–11 (2010)
Turbo Code with Hybrid Interleaver in the Presence of Impulsive Noise V. V. Satyanarayana Tallapragada, M. V. Nagabhushanam, and G. V. Pradeep Kumar
Abstract In this paper, a numerical density function of impulsive symmetric αstable noise was presented and simulated in the turbo coding environment. The impulsive symmetric α-stable noise has no closed-form expression, making it difficult to realize in its original form. The impulsive symmetric α-stable noise can be used to approximate many kinds of noise sources when all the parameters of the function are well optimized to a specific value. In this paper, one of the parameters is set to a specific value so that the stable distribution approximates Cauchy distribution. Correspondingly, it is applied on turbo codes. In addition, an improved interleaver structure is presented where the multiple interleavers of smaller capacity replace a single epic interleaver. Simulation results show that the proposed hybrid interleaver outperforms the existing techniques. Keywords Bit error rate · Cauchy distribution · Interleaver · Stable distribution · Turbo codes
1 Introduction Impulsive is a kind of noise that occurs at a specific instance of time. The duration of its occurrence is very short, and the strength is very high. This kind of noise occurs because of interference of electromagnetic signals, explosions, switching noise, dropouts, keyboard clicks, surface degradation of audio disks, adversative conditions and synchronization issues in communication link. Noises are characterized by cumulative and probability functions. But, the problem with impulsive noise is V. V. Satyanarayana Tallapragada (B) Department of ECE, Sree Vidyanikethan Engineering College, Tirupati, India e-mail: [email protected] M. V. Nagabhushanam · G. V. Pradeep Kumar Department of ECE, Chaitanya Bharathi Institute of Technology, Hyderabad, India e-mail: [email protected] G. V. Pradeep Kumar e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_12
163
164
V. V. Satyanarayana Tallapragada et al.
that there is no closed-form expression to the probability function that approximates the characteristics of impulsive noise. The impulsive noise corrupts a sequence of bits. This is generally referred to as burst error. The length of the burst depends on the duration of the impulsive noise. The impulsive noise can be removed, and the noisy signals can be enhanced using an impulsive noise filter. In addition to the wireless communication, the impulsive noise has its significance in control system applications, pattern recognition and industrial automation applications. Traditionally, a median filter is employed to remove the impulsive noise. As the performance of the median filer on impulsive noise is not satisfactory, the impulsive noise filters are devised. The design of impulse noise filters is based on the properties of noise and also the original signal in consideration. The distinct features of noise, signal, model of the signal generation and channel characterization help in handling the impulsive noise in an optimal way. The following are the contributions of this paper. • Implementation of symmetric α-stable noise. • Implementation of turbo code in the presence of symmetric α-stable noise.
2 Symmetric Alpha-Stable Noise (SαS) The properties of impulsive noise can only be revealed with the analysis of stable distributions. A stable random variable is the one whose distribution is stable. Distributions that are stable are also called Levy alpha-stable distribution [1, 2]. Paul Levy is the first man who studied and extensively revealed the detailed mathematical model of stable distributions. These are characterized by four parameters, which are stability, skewness, scale and location parameter. The ranges of these parameters are (0, 2], [−1, 1], (0, ∞) and (-∞, ∞), respectively. The density function is shown in Figs. 1 and 2.
Fig. 1 Density function of stable random variable with β=0, c=1, μ=0 and α=0.5, c=1, μ=0 [2]
Turbo Code with Hybrid Interleaver …
165
Fig. 2 Distribution function of stable random variable with β=0, c=1, μ=0 and α=0.5, c=1, μ=0 [2]
These four parameters determine what the random variable is all about. Out of these four parameters, the most important parameter is the stability parameter α. The range of stability parameter is 0 < α ≤ 2. The upper bound of α, i.e., 2, corresponds to normal distribution. The stable distribution with α = 1 corresponds to Cauchy distribution. The normal distribution can be considered as a superset of stable distribution. By the intuition of central limit theorem, a set of summed random variables with finite variance will tend to a normal variable when the number of variables is very large. But, if the variance is not finite, the stable distribution may not become normal. Polishborn, French and American mathematician Benoit B. Mandelbrot referred these distributions as ‘stable Paretian distributions’ [3–5] after Italian engineer Vilfredo Pareto. The distributions which are skewed maximum, with stability parameter in the range 1 < α < 2, are referred as ‘Pareto–Levy distributions.’ According to Vilfredo Pareto, the Pareto–Levy distributions can be used as a good descriptor for commodity and stock process than the normal distribution which is currently in use [6]. A stable distribution that satisfies the following property is called non-degenerate distribution [7]. It is worthy to note that the distributions like normal, Cauchy and Levy satisfy the above condition. Hence, they all are special cases of stable distribution. These distributions generate a family of distribution functions defined by location, scale parameters and two shape parameters. The probability density function of a stable distribution cannot be expressed using an analytic expression, but an analytic expression can be given for the characteristic function. The distribution function can be obtained using Fourier transfer of the characteristic function ϕ(t) [8]. f (x) =
1 2π
∞ −∞
ϕ(t)e−i xt dt
If the characteristic function can be written as ϕ(t; α, β, c, μ) = exp(itμ − |ct|α (1 − iβsgn(t)Φ)), then the random variable X is stable.
166
V. V. Satyanarayana Tallapragada et al.
3 Interleaver Design In this paper, the turbo code is simulated to handle the effect of SαS noise effect. The implementation or modeling of SαS distribution is difficult. In this paper, an attempt has been made to simulate SαS noise by taking specific cases. These cases are simulated in the manifestation of turbo code environment. Another contribution of this paper is the proposal of a novel architecture of interleaver. A case with no interleaver, block interleaver and a combination of block interleavers of lesser dimension is presented [9, 10]. Bandwidth puts many constraints on the type channel and services provided by the channel, channel requirements and noise constraints [11, 12]. Energy or power consumption requirement directly depends on the number of bits transmitted as it accounts for the total power transmitted and error rate. Energy efficiency is a main concern in wireless circumstances. In this paper, turbo codes are being tested and explored to handle such a noise which can be well represented using SαS noise [13, 14]. The turbo code has encoding and decoding parts. The encoder section of turbo code is shown in Fig. 3. In simple terms, the interleaver takes an input of specific length and gives the same bits as received at the input but in another order. The order appears to be random. The order is clearly pseudorandom. Interleaving is done to mix the bits in a pseudorandom way so that the weight of the stream of bits will be changed. The conversion of burst errors into simpler is shown in Fig. 4. Fig. 3 Internal structure of encoder section of turbo coding system
Turbo Code with Hybrid Interleaver …
167
Fig. 4 Operation of interleaver
An error detection code generally, detects errors, and as a result of this, the received frame needs to be transmitted once again, creating an additional delay and gives rise to jitter, which in turn creates annoy in on-demand applications. Hence, the usage of error correcting codes is inevitable. Error correcting codes are complex than the error detection schemes as the distance constraint is tight. To maintain a weight satisfying the correction capacity, the number of parity bits added will be very high and rate of coding will be less. But, this complexity enables the forward error handling feature which never requires a request to the sender for retransmission. But, the advantage with interleaver is that the error detection and correction capability of a coding scheme will be lesser to handle the same quantity of error. In Fig. 4, the top stream of bits represents the case with no interleaving. Assume that the bits shown in red color are initially 0 s. In case of impulse noise, because of high energy, these are received as 1 s. Though it is not practically used in turbo codes, assume block code for description. Let the number of bits of a codeword is 7. In no-interleaver case, it requires an error-correcting code with the capability of correcting 4 bits. Now, observe the interleaved stream of bits. A simple block interleaver is employed. The interleaved stream shows that in the first 7 bits, i.e., first frame, only two bits gone in error. Hence, the error correction capability needed is reduced from 3 to 2. There are many interleaver schemes available in the literature. The block interleaver and 3GPP interleavers are the basic interleavers. In this paper, a hybrid interleaver proposed in [9] is presented and applied on the simulated SαS noise. The block diagram of hybrid interleaver is given in Fig. 5. The advantage of the hybrid interleaver is that the change of the order will be done in more spirit than a direct application of single interleaver. This is more effective when the frame length is high. In Fig. 5, a frame of 16 bits is divided into 4 frames. Each frame is applied to block interleavers of size 4, and further, the output nibbles are further block interleaved, which is logically shown in Fig. 5.
168
V. V. Satyanarayana Tallapragada et al.
Fig. 5 Block diagram of hybrid interleaver [9]
4 Simulation Results In this section, the simulation results of turbo code using hybrid interleaver are presented. Cauchy noise was simulated using SαS distribution. In the previous sections, it is shown that any arbitrary noise distribution may be approximated using SαS distribution. In this paper, simulation of Cauchy noise was done to prove that SαS distribution can be used to approximate any random distribution. Three cases of interleaving were considered. In the first case, there is no interleaver employed, resulting in poor combination of bit stream which will result in high bit error rate. Second, a block interleaver is employed, where interleaving is done using a row-wise read and column-wise write operation. The third, a hybrid interleaver is employed which uses block interleavers of lesser dimension. Varied frame lengths are used ranging from 2 to 16,384. In this section, results of three of these sizes are presented. They are 1024, 4096 and 16,384. Tables 1, 2 and 3 present the performance of turbo code with a frame of length 1024. The turbo coding is an iterative process. The termination can be set either with minimum error rate reached or maximum number of iterations reached. In this paper, the latter is implemented. A total of 8 iterations are run. Figure 6 shows the error rate performance of turbo code in different cases with a frame of length 1024 and Fig. 7 that of frame of length 4096.
2
0.078
0.089
0.091
0.093
0.092
E b /N 0 (dB)
Uncoded
It—1
It—2
It—4
It—8
0.068
0.068
0.070
0.073
0.079
3
0.067
0.068
0.069
0.073
0.075
4
0.052
0.054
0.057
0.060
0.068
5
Table 1 Bit error rate with no interleaver on 1024 in Cauchy noise environment 6
0.046
0.049
0.051
0.050
0.067
7
0.041
0.041
0.043
0.043
0.064
8
0.038
0.039
0.037
0.040
0.060
9
0.027
0.029
0.031
0.033
0.053
10
0.026
0.027
0.027
0.031
0.052
Turbo Code with Hybrid Interleaver … 169
2
0.0804
0.0795
0.0785
0.0760
0.0767
E b /N 0 (dB)
Uncoded
It—1
It—2
It—4
It—8
0.0605
0.0588
0.0620
0.0684
0.0729
3
0.0428
0.0487
0.0510
0.0542
0.0738
4
0.0234
0.0318
0.0376
0.0482
0.0706
5
Table 2 Bit error rate with block interleaver on 1024 in cauchy noise environment 6
0.0096
0.0120
0.0205
0.0399
0.0691
7
0.0040
0.0056
0.0182
0.0340
0.0641
8
0.0035
0.0028
0.0059
0.0260
0.0645
9
0
0.0007
0.0043
0.0178
0.0527
10
0.0003
0.0003
0.0015
0.0090
0.0526
170 V. V. Satyanarayana Tallapragada et al.
2
0.0778
0.0860
0.0912
0.0899
0.0916
E b /N 0 (dB)
Uncoded
It—1
It—2
It—4
It—8
0.0595
0.0580
0.0659
0.0726
0.0756
3
0.0438
0.0436
0.0501
0.0608
0.0724
4
0.0227
0.0250
0.0302
0.0436
0.0682
5
0.0007
0.0072
0.0155
0.0350
0.0646
6
Table 3 Bit error rate with hybrid interleaver on 1024 in cauchy noise environment 7
0.0032
0.0035
0.0154
0.0316
0.0589
8
0
0.0008
0.0081
0.0279
0.0567
9
0
0
0.0010
0.0127
0.0561
10
0
0
0.0005
0.0125
0.0535
Turbo Code with Hybrid Interleaver … 171
172
V. V. Satyanarayana Tallapragada et al.
Fig. 6 Error rate performance of turbo code with no interleaver case, block and hybrid interleaver with a frame of length 1024 bits
Turbo Code with Hybrid Interleaver … Fig. 7 Error rate performance of turbo code with no interleaver case, block and hybrid interleaver with a frame of length 4096 bits
173
174
V. V. Satyanarayana Tallapragada et al.
5 Conclusion In this paper, stable distribution is modeled and used to approximate the Cauchy distribution. Out of the four parameters that characterize the stable distribution the stability parameter α is optimized to 1 to approximate Cauchy distribution. The noise is added to a signal in turbo code environment. Different architectural models of turbo coders are tested. Interleaver structures are explored to obtain a weight balance among the code words at the decoder. The hybrid interleaver presented has shown better performance in generating code words that are suitable to decode themselves, resulting in less error rate even at a lower signal-to-noise ratio.
References 1. B. Mandelbrot, The Pareto–Lévy law and the distribution of income. Int. Econ. Rev. (1960) 2. P. Lévy, Calcul des probabilités (1925) 3. B. Mandelbrot, Stable paretian random functions and the multiplicative variation of income. Econometrica (1961) 4. B. Mandelbrot, The variation of certain speculative prices. J. Bus. (1963) 5. E.F. Fama, Mandelbrot and the stable paretian hypothesis. J. Bus. (1963) 6. B. Mandelbrot, New methods in statistical economics. J. Polit. Econ. 71(5), 421–440 (1963) 7. J.P. Nolan, Stable Distributions—Models for Heavy Tailed Data 8. S. Kyle, Stable Distributions. www.randomservices.org 9. J.K. Sunkara, S. Sri Manu Prasad, T. Venkateswarlu, Turbo code using a novel hybrid interleaver. Int. J. Latest Trends Eng. Technol 7(3), 124–130 (2016) 10. V.V. SatyanarayanaTallapragada, J.K. Sunkara, K.L. Narasihimha Prasad, D. Nagaraju, Fast turbo codes using sub-block based interleaver, in Proceeding of Second International conference on Circuits, Controls and Communications, pp. 200–205, Dec 2017, Bangalore 11. V.V. SatyanarayanaTallapragada, G.V. Pradeep Kumar, J.K. Sunkara, Wavelet packet: a multirate adaptive filter for de-noising of TDM signal, in International Conference on Electrical, Electronics, Computers, Communication, Mechanical and Computing (EECCMC), Jan 2018, Vellore 12. J.K. Sunkara, E. Navaneethasagari, D. Pradeep, E. Naga Chaithanya, D. Pavani, D.V. Sai Sudheer, A new video compression method using DCT/DWT and SPIHT based on accordion representation. I.J. Image Graph. Signal Proc. 28–34 (2012) 13. T.V.V. Satyanarayana, Prospective approach using power estimation technique in wireless communication systems. Nat. Conf. Pervasive Comput. 172–174 (2008) 14. M. Adam, W. Rafał, Springer Handbooks of Computational Statistics, in ed. by J.E. Gentle, W.K. Härdle (Springer, Berlin, Heidelberg, pp. 1025–1059). https://doi.org/10.1007/978-3642-21551-3_34
Performance Evaluation of Weighted Fair Queuing Model for Bandwidth Allocation Shapla Khanam, Ismail Ahmedy, and Mohd Yamani Idna Idris
Abstract Today’s communication is based on packet transmission, which raises the challenge of allocation of available bandwidth with the Quality of Service (QoS) requirements. Various bandwidth allocation methods are used to allocate the network resources that need to schedule incoming packets and afterward benefit them distinctively as indicated by their requirements. This paper investigated the WFQ scheduler and proposed an iterative mathematical model to allocate average bandwidth. The paper aims to present the results of our investigation into QoS support mechanisms available in MPLS for different traffic types. Keywords NGN · Bandwidth allocation · WFQ · QoS
1 Introduction Recently, IP-based next-generation networks (NGNs) are deployed with the most demandable applications, for instance video on demand and voice over IP (VoIP) which require special Quality of Service (QoS) guarantee [1]. The packet departure order in different flows is determined and controlled by the network nodes, and the QoS is greatly determined by the scheduling or queuing disciplines adopted by the systems or networks. Generalized processor sharing (GPS) technique is regarded as a perfect solution in order to guarantee a flow-based fair performance in the network flows such that it can assure the QoS requirements. Various algorithms of GPS, namely weighted fair queueing (WFQ), worst-case fair weighted fair queueing (WF2Q), virtual clock, self-clock fair queueing (SCFQ), credit-based SCFQ S. Khanam (B) · I. Ahmedy · M. Y. I. Idris Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur 50603, Malaysia e-mail: [email protected] I. Ahmedy e-mail: [email protected] M. Y. I. Idris e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_13
175
176
S. Khanam et al.
scheduler (CB-SCFQ) and, deficit round-robin (DRR) have been proposed to provide various levels of performance [2–5]. Several proposals had been investigated in order to achieve fairness in next-generation wireless networks [2, 6, 7]. However, the attempt has been limited to provide an equal distribution of bandwidth to different network flows. Some other protocols were developed to achieve fairness in multi-hop network topology. Some mechanisms have developed by [7], in achieving weighted fairness in the interest of maximizing the network throughput. The approach uses the concept of “flow contention graph” which mainly focuses on a network topology. In other work [8], a model proposed for weighted fair queuing which does not take into consideration of the network topology. Several researches have been done with the interest of achieving fair bandwidth sharing on shared communication link [2, 4, 9, 10]. The existing approaches, however, have limitations in terms of dynamic queue utilization or reassigning the unused bandwidth. In our bandwidth allocation model, we used iterative mathematical approach for WFQ. The iterative WFQ model contributes a proper utilization of unassigned bandwidth by redistributing it to different priority classes. This approach takes into consideration of weight analysis, stability of the system, queue length and delay of the traffic. Average packet length, link speed and packet arrival rate are used to calculate the bandwidth assignments. The rest of this paper is organized as follows. Section 2 presents the details of WFQ iterative bandwidth allocation model. The simulation setup and the performances of QoS constraints are discussed in Sects. 3 and 4. Finally, Sect. 5 concludes our research with potential future works.
2 Iterative Bandwidth Allocation Model WFQ algorithm for bandwidth allocation models is mainly proposed in Multiprotocol Label Switching (MPLS) traffic engineering and was first introduced in [11]. WFQ works by sharing the output bandwidth in a fair manner based on weights assigned to them. Each packet has its turn to be read and sent forward bit by bit corresponding with time according to the virtual finish time, which is determined by the scheduler, and every packet is assigned a finish time. A WFQ scheduling algorithm is being considered as an iterative method of bandwidth allocation model which is presented in this section. According to WFQ algorithm, the total output bandwidth BT will be allocated in a fairly manner with respect to packet size or weight as well as their priority in the queue [12]. Based on the assigned weights of the service classes and waiting for queues, the available bandwidth is allocated between them and if there remains any unused bandwidth which can be shared between other queues whose requirement is not met. Now, we consider a network node with waiting queues or priority class q. Let W i be the pre-assigned weight for each flow or queue i. Each packet has a size, and packets arrive at the queue at a specific rate. Hence, for the priority queue, the input bandwidth can be expressed as follows:
Performance Evaluation of Weighted Fair …
177
bwi = ri ∗ li i = 1, 2, 3 . . .
(1)
where bwi is the input bandwidth, r i is the packer arrival rate to the queue and li is the average packet size. Let us consider some probable circumstances that can be appeared in the first step during bandwidth distribution. Bi is the allocated bandwidth to the queue. BT is the overall obtainable output bandwidth. This output bandwidth is allocated among all the priority classes. It represents that for the both cases if the distribution of total available output bandwidth is less than or equal to and greater than or equal to the input bandwidth, then no sharing will be required, and all the flows will be assigned according to distribution. For either case, the allocation process will stop at the first iteration [12]. The remaining bandwidth will remain unused (2) or the queues are not satisfied with the assigned bandwidth (3). bT wi q ≤ bwi j=1 w j
(2)
bT wi q ≥ bwi j=1 w j
(3)
The algorithm will assign lower value input bandwidth to queues. The first iteration is rewritten as Eq. 4. bi,1 = min bwi , bT ∗ wi ∗ q
1
j=1
wj
(4)
However, if some queues are required more bandwidth than they are assigned, the more iteration will take place. In the worst-case scenarios, it may take kth iteration to complete the distribution of total available bandwidth. The iteration process will go on, and it may take q − 1 steps to meet the bandwidth demands in the extreme situation. The next iteration is expressed as below: ⎛ bi,k = bwi , bi,k−1 + ⎝bT −
q
⎞ wi j=1 w j min(bw j − b j,k−1 , 1)
b j,k−1 ⎠ × q
j=1
(5)
The iteration will continue from second iteration till qt iteration. The calculation/iteration will continue until the bandwidth requirements are satisfied with all the queues, or else it may result in division by zero. The iteration will terminate if the total bandwidth is completely allocated to all the queues (6) or all queues are satisfied with the assigned bandwidth (7). bT =
q i=1
bi,k i = 1, 2, 3, . . . q
(6)
178
S. Khanam et al.
bi,k = bwi
(7)
However, the iteration will terminate, and the above conditions will meet if there is no bandwidth reallocation that occurs at the next iteration: bi,k = bi,k−1 i = 1, 2, 3, . . . q
(8)
3 Simulation Setup WFQ bandwidth allocation model is simulated through Network Simulator 2 (NS2). Figure 1 represents a typical network model that consists of sending nodes (1, 2, 3 and 4) and a set of receiving nodes (6, 7, 8 and 9). For the communication among transmitting and receiving nodes, a shared bottleneck link (between node 0 and node 5) is located. The Markovian queue model (M/M/1) is being used because the system contains a Poisson arrival process or an exponential service time distribution. The platform and simulation parameters are summarized in Tables 1 and 2, respectively. Figure 2 shows the number of packets generated in terms of packet interval time (sec) and packet size (B).
Fig. 1 Network simulation model
Performance Evaluation of Weighted Fair … Table 1 Platform parameters
179
OS
Unix/Ubuntu 14.04 LTS
Simulator
NS-2 version 2.35
Network topology
Wired mesh topology
Simulation period
60 s
Maximum allocated bandwidth
100 MB
Number of nodes
10
Table 2 Simulation parameters Voip_0
Voip_1
Data_0
Data_1
Data_2
Packet size (Bytes)
660
800
1000
5000
8000
Data rate (kb)
8
64
32
96
128
Start time (Sec)
0.5
0.5
0.5
0.5
0.5
End time (Sec)
50.5
30.5
55.5
45.5
15.5
Fig. 2 Packets generated value 0.001 s
4 Performance Evaluation We present simulation results to demonstrate the number of generated packets, packet delay performance (for both data and VoIP), average throughput, average packet loss and average packet arrival ratio of the network. A deterministic queuing model named D/D/1/∞ Markovian model is used to generate packets in constant and fixed interval while keeping the packet size fixed.
180
S. Khanam et al.
Delay: The average delay is one of the most significant performance matrixes of the data network. The delay measurement is provided in the following subsections. Figure 3 depicts average delay of data packets in response to time (sec). As can be seen, the transmission delay of the data packets is slightly higher at the beginning of the simulation, but after few seconds, it decreases to an average steady state. Three data sets are evaluated such as data_0 (1000 Bytes), data_1 (5000 Bytes) and data_2 (8000 Bytes). Hence, from the figure, it is clear that the data with the highest value has the highest transmission delay. It can be concluded that as the data size increases, the average delay correspondingly increases. The voice over IP (VoIP) is real-time voice transmission. Delay in voice is highly intolerable compared to the delay in data; thus, higher priority is given in the queue. Figure 4a, b represents the average transmission delay for VoIP_0 (660B) and VoIP_1 (800B). From both figures, we can see that the delay is slightly higher at the Fig. 3 Average delay
Average Delay (Data) Delay_data 2 Delay_data_1
0.4 0.2 0
0 3.6 7.2 10.8 14.4 18 21.6 25.2 28.8 32.4 36 39.6 43.2 46.8 50.4 54 57.6
Delay
0.6
TIme(s)
Fig. 4 a Delay VoIP_0. b Delay VoIP_1
1
delay
0.8 0.6 0.4 0.2 0 0
4.5
9 13.5 18 22.5 27 31.5 36 40.5 45 49.5 54 58.5
Time (Seconds)
Delay
(a). Delay VoIP_0 0.12 0.1 0.08 0.06 0.04 0.02 0
0
4.5 9 13.5 18 22.5 27 31.5 36 40.5 45 49.5 54 58.5
Time(seconds) (b). Delay VoIP_1
Performance Evaluation of Weighted Fair …
181
beginning of both VoIP packet sizes, and the delay dropped dramatically after few seconds and remain steady. Throughput: Network throughput is the average rate of a successful message transfer through media. Figure 5 is the representation of average throughput evaluation for data_0, VoIP_0 and VoIP_1. As for packet size of data_0 and VoIP_0, the throughputs are almost the same from beginning until the end of the simulation and are very low compared to VoIP_1. For packet size of VoIP_1, the throughput goes up to 0.42kbps, which indicates that the higher the packet size of the VoIP, the higher the throughput is. Packet loss and packet arrival ratio: Figure 6 presents the average packet loss for different applications as per Table 2. The average packet arrival ratio is presented in Fig. 7 for the various configured data and VoIP applications. Fig. 5 Average throughput
Fig. 6 Average packet loss
182
S. Khanam et al.
Fig. 7 Average packet arrival ratio
5 Conclusion This paper examines distinctive queuing and scheduling algorithm named WFQ algorithm. The functionality of the model was simulated on the NS2 simulator and analyzed several QoS performances. In this iterative model, we utilized D/D/1 and M/M/1 as input traffic. In fact, it measures current sending and accepting rate of every stream of data. The measure of average delay, packet arrival ratio, packet loss ratio and average throughput is calculated and compared.
References 1. F. Bensalah, M.E.L. Hamzaoui, Quality of service performance evaluation of next-generation network, in International Conference on Computer Applications and Information Security, pp. 1–5 (2019) 2. V. Inzillo, A.A. Quintana, A self clocked fair queuing MAC approach limiting deafness and round robin issues in directional MANET, in 2019 Wireless Days, pp. 1–6 (2019) 3. C.H. Dasher, Regulating Content Streams From A Weighted Fair Queuing Scheduler Using Weights Defined For User Equipment Nodes (2017) 4. Z. Guo, Simulation and analysis of weighted fair queueing algorithms in OPNET. 2009 Int. Conf. Comput. Model. Simul. 1, 114–118 (2009) 5. H. Zhang, J.C. Bennett, WF2Q: worst-case fair weighted fair queueing. IEEE INFOCOM 96 (1996) 6. M. Dighriri, G.M. Lee, T. Baker, 5G Cellular Network Packets Traffic. Springer Singapore 7. H. Luo, S. Lu, V. Bharghavan, New model for packet scheduling in multihop wireless networks. Proc. Ann. Int. Conf. Mob. Comput. Networking, MOBICOM, 76–86 (2000) 8. H. Luo, S. Lu, A topology-independent wireless fair queueing model in ad hoc networks. IEEE J. Sel. Areas Commun. 23(3), 585–597 (2000) 9. Z. Jiao, B. Zhang, W. Gong, H. Mouftah, Open Access a Virtual Queue-Based Back-Pressure Scheduling Algorithm for Wireless Sensor Networks (2015)
Performance Evaluation of Weighted Fair …
183
10. E. Inaty, R. Raad, CDMA-based dynamic power and bandwidth allocation (DPBA) scheme for multiclass EPON : a weighted fair queuing approach 10(2), 52–64 (2018) 11. A. Demers, S. Keshav, S. Shenker, Analysis and simulation of a fair queueing algorithm. Internetw. Res. Exp. 1 (1990) 12. T. Balogh, M. Medvecký, Average bandwidth allocation model of WFQ. Model. Simul. Eng. 2012 (2012)
Modified Bio-Inspired Algorithms for Intrusion Detection System Moolchand Sharma, Shachi Saini, Sneha Bahl, Rohit Goyal, and Suman Deswal
Abstract In today’s time, the Internet has become a critical part of everyone’s life. It has helped make everything very easy. Be it paying bills, booking shows, or flights, and it has made it all a matter of few clicks. A network is a means to connect two people or organizations or within the organization. With the advancements in technology, even the attackers have become more advanced with their attacks, and hence network security has become a severe matter of concern for individuals and organizations. Intrusion detection system (IDS) is one of many solutions to strengthen the system. It is like a security alarm that goes off whenever it sniffs an attack. It keeps the network traffic under observation, looking for any suspicious activity and then warns when such an event is found. The primary task for an IDS is to know what kind of traffic represents which type of an attack. In this paper, we have proposed a novel approach to strengthening the IDS by using bio-inspired algorithms, namely binary firefly optimization (BFA), binary swarm particle optimization (BPSO), and cuttlefish optimization. We have introduced a unique approach, i.e., modified cuttlefish algorithm (MCA) and also compared it with the rest two algorithms mentioned above. Modified cuttlefish algorithm has shown a significant accuracy other than two bio-inspired algorithms as shown below in the paper. We have used the CICIDS2017 dataset which incorporates benign and the most recent attacks, which resemble the real-world data to a greater extent. It contains the labeled flows from the network M. Sharma (B) · S. Saini · S. Bahl · R. Goyal Maharaja Agrasen Institute of Technology, Delhi, India e-mail: [email protected] S. Saini e-mail: [email protected] S. Bahl e-mail: [email protected] R. Goyal e-mail: [email protected] S. Deswal Deenbandhu Chhotu Ram University of Science & Technology, University in Murthal, Sonipat, Haryana, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_14
185
186
M. Sharma et al.
traffic analysis based on the source and destination ports, source and destination IPs, protocols timestamp and attack. Keywords Intrusion detection system · Bio-inspired algorithms · Binary firefly optimization · Binary swarm particle optimization · Cuttlefish optimization · CICIDS2017 · Decision tree · Random forest classifier · K- nearest neighbor
1 Introduction A cyber-attack aims to destroy or steal computer information systems, computer networks, infrastructures, or personal computer devices. Intrusion detection will help to detect vulnerabilities in the system. The intrusion detection system comprises three parts, namely the event generator, analysis engine, and the response manager [1]. The event generator primarily provides the data. Data sources are broadly of four types, namely network-based monitors, target-based monitors, application-based monitors, and host-based monitors. The analysis engine is responsible for analyzing the data for patterns of attacks or any breach. Two analysis approaches exist, namely signaturebased detection and anomaly detection [2]. We are taking the anomaly detection approach as they can detect new and unique attacks. The response manager is put into action only when an intrusion attack is found. It is accountable to inform someone or something in the form of a response. We are working on the analysis engine. Even though intrusion detection systems look over the networks for potential attacks, they can lead to false alarms. Thus, fine-tuning the IDS is very important. It means appropriately designing the IDS to detect how the regular traffic differs from a possible malignant activity. There are various ways to train an IDS. In this paper, we will use multiple bio-inspired algorithms for an optimized result. Bio-inspired algorithms are metaheuristics that imitate the natural forces to solve optimization problems creating [3]. Numerous research activities have been performed in this domain in the past years. Optimization helps in effectively addressing the issues. It involves finding an alternative solution with the highest achievable performance or most cost-effectiveness under the given conditions by giving importance to the desired factors and neglecting the undesired ones. Various optimization algorithms exist among which the most efficient known to be our bio-inspired algorithms [4]. We are going to use several bio-inspired algorithms for optimizing our feature selection process in the intrusion detection system and compare their results. Feature selection consists of selecting the features which contribute most to our prediction variable or output in which we are interested. It hugely impacts the performance of a machine learning model. Irrelevant or partially relevant features can hurt model performance and decrease its accuracy. This process is essential when there are a vast number of features in the dataset. We do not need to use every feature at our disposal for creating an algorithm. We use only those features which matter to improve our algorithm. This not only helps improve accuracy but also reduces the training time involved. There are mainly two viewpoints for feature selection. It is
Modified Bio-Inspired Algorithms for Intrusion Detection System
187
a filter approach when the feature selection process is unconstrained in the learning algorithm. In the wrapper approach, it considers the performance of the learning algorithms iteratively to examine the mark of the selected features [5]. The paper is lined up as: Information and concepts regarding the bio-inspired algorithms used are in Sect. 2. The proposed approach and the experiment performed are discussed in Sect. 3. Experimental results produced are described in Sect. 4. Lastly, a conclusion is stated and the future scope is explored.
2 Literature Review Nature is a vast source of getting insights to solve complicated and hard problems as it shows very sturdy, multifold, dynamic, confused, and fascinating situations. It unfailingly searches for the optimal solution to solve the problem and tries to maintain the perfect balance among its constituents. Nature-inspired algorithms are metaheuristics that imitate this concept to solve optimization problems thus creating a new area in computation [3]. Optimization aims to solve the problems in the best possible manner. It focuses on finding a method which not only satisfies all constraints but also seeks the highest achievable performance. Bio-inspired algorithms are an effective way to achieve this optimization [6]. Particle swarm optimization (PSO) is a nature-inspired optimization algorithm that resembles evolutionary algorithms because it also uses a group of particles, corresponding to individuals to perform the searches [7]. All the particles in PSO represent a feasible solution for the problem being studied. In PSO, particle adjusts its position (by continually updating its velocity) toward the best previous place obtained in the swarm by flying around in a multidimensional search space until a new location is experienced. BPSO is a discrete binary version of PSO that is used in this paper to decide whether to ‘include’ or ‘not to include’ a feature during feature selection. In PSO, the velocities of the particles are represented using probabilities, but in binary PSO, it is represented as bits and it will be either in one state or the other. Thus, the velocity of particle velocity is restricted within the range [0, 1]. And hence the particle’s movement in state space is limited to 0 and 1 on each dimension [8]. The firefly algorithm is a bio-inspired biological global stochastic approach developed by Yang for optimizing problems. It is also a metaheuristic approach and is based on the firefly population. Each firefly represents a potential search space solution. The algorithm mimics the mating and light flash-based information exchange mechanisms of fireflies [9]. There are three idealized rules for the approach. First, all fireflies are contemplated to be unisex due to which other fireflies are getting attracted to one firefly irrespective of their sex. Second, the intensity of the attractiveness and its light intensity is in proportion, thus in the case of any two sparking fireflies, the one which has lesser brightness moves toward the brighter one. Finally, the intensity of light of a firefly is related to the fitness function [10]. With an increase in distance,
188
M. Sharma et al.
the attractiveness between the fireflies decreases. The move was randomly when no brighter firefly is present in their surroundings. Cuttlefish algorithm falls under the category of the metaheuristic optimization methods [11]. To solve global optimization problems, it uses the changing color behavior of cuttlefish. The cuttlefish use camouflage, allowing it to hide by changing its color and be visible with the same colors as its environment. There are three skin layers which govern this behavior, the chromatophores, iridophores, and leucophores [12]. The chromatophore layer is a group of cells that comprise an elastic saccule and holds a pigment. 15–25 muscles are also connected to this saccule. These are located under the skin of cuttlefish. Iridophores are layered platelets that are present below the chromatophores. Some species are chitinous and other species are proteinbased. They work by reflecting light. The white spots appearing on some species of cuttlefish, squid, and octopus are due to the leucophores. They are flat, branched cells that scatter and reflect the incoming light [13].
3 Proposed Work/Methodology We present an altered version of the cuttlefish algorithm, and then we have performed a comparative study between the three mentioned bio-inspired algorithms, i.e., modified CFA, BPSO, and BFA. All of the algorithms that we have used follow the wrapper approaches. It means that the training process involves the use of a learning algorithm to evaluate the performance of the selected feature subset.
3.1 Binary Particle Swarm Optimization (BPSO) We have represented all the features in a binary form, where a selected feature is assigned ‘1’, and the unselected feature is assigned ‘0’. Each particle works on a different string of features. Initially, features are selected randomly and then the particles find the best possible local solution. If this solution exists among the rest also, it is selected as global best. The process is repeated for all particle’s for a multiple numbers of iterations. We decided the stop condition to be a specific number of iterations.
Modified Bio-Inspired Algorithms for Intrusion Detection System
189
3.2 Binary Firefly Algorithm (BFA) Fireflies are initialized with randomly selected features. In our model, all the fireflies in the swarm are initialized with a binary sequence, just like in the binary particle swarm algorithm. The illumination of the fireflies is proportionate to the value generated by the fitness function of the optimization problem.
190
M. Sharma et al.
3.3 Modified Cuttlefish Algorithm (MCFA) Cuttlefish was designed to work on a mathematical function or maybe even a single feature. We have modified it in a way so that it can be used with multiple attributes for feature selection. MCFA consists of two main processes: the visibility process and the reflection process. The reflection process is like the process of reflection of light. The visibility process takes care of complementing patterns. These processes help in finding the optimal global solution. The new solution is a result of both
Modified Bio-Inspired Algorithms for Intrusion Detection System
191
reflection and visibility process. The color-changing procedure at different layers of the cuttlefish is the basis for selecting the best features [14, 15]. In the feature selection process, we have used random forest classifier (RFC) for the MCFA. We compute its accuracy and use it in the fitness function to examine each feature’s importance. The final output of the feature selection process, i.e., the best possible subset of features is also evaluated by RFC to calculate the classification accuracy. The role of the fitness function searches for the feature subset, which has minimal features and gives maximum classification accuracy.
192
M. Sharma et al.
Modified Bio-Inspired Algorithms for Intrusion Detection System
193
4 Implementation and Results All three algorithms mentioned earlier have been implemented and results observed. Initially classification was performed without feature selection using four classifiers; k-nearest neighbors, random forest, decision tree, and support vector machines. Then, the binary particle swarm, binary firefly, and cuttlefish algorithms were implemented with the random forest classifier.
4.1 Dataset Several benchmark datasets exist which are used for the evaluation of an intrusion detection system. Datasets such as KDD99, DARPA98, NSL-KDD, ADFA13, and ISC2012 have been used for intrusion detection systems by researchers. The major problem in the dataset comes because the real-world attacks differ from the created ones. The intrusion detection evaluation dataset (CICIDS2017) is used in the paper. This dataset was created, giving the main priority of generating realistic background traffic [16]. Some unique and necessary criteria were found to build a reliable dataset. Most of the previous IDS datasets did not consider all these criteria. Those were metadata, complete capture, heterogeneity feature set, attack diversity, available protocols, complete interaction, labeled dataset, complete traffic, and complete network configuration. The dataset contains around 78 attributes. Six different attack scenarios were considered in the making of the dataset, namely infiltration of the network from inside, unsuccessful infiltration of the network from inside, denial of service (DoS), collection of Web application attacks, Brute force attacks, and recent attacks. Some of the attributes from the dataset are as follows: i. ii.
Port number of destinations Duration of packet flow
194
iii. iv. v. vi. vii. viii. ix. x.
M. Sharma et al.
Total number of forwarding packets Header length Length of packets Synchronization flag Acknowledgment flag Push flag Urgent flag Forward packet to backward packet flow ratio.
4.2 Machine Learning Classifiers Classification is a process of arranging or grouping things into different classes or categories. A classifier learns the trends and patterns from a training set that contains pre-labeled data, i.e., data already classified. (i)
K-Nearest Neighbors (KNN). It is a simple algorithm that is widely used for both classification and regression problems. It is based on feature similarity. An object is allotted the class by the dominant vote of its k neighbors (data points). No model is formed in this classification. It is also called a lazy algorithm as it does not perform any generalization using the training data. It means that there is no training phase as such or it is very small. All or nearly most of the training data is used during the testing phase. (ii) Decision Tree Classifier (DTC). The decision tree classifier uses the tree representation to solve problems. It is a flowchart-like structure where an internal node represents each feature, the branch represents the decision rule represents a decision rule, and the leaf nodes regard outcomes. The topmost node is known to be the root node. The tree learns to partition based on an attribute value. The partitioning takes place recursively until all outcomes are achieved or some specific stop condition is met. Its flowchart-like structure helps in visualization and easy decision making. (iii) Random Forest Classifier (RFC). Random forest classifier is an ensemble of decision trees. They are mostly trained with the “bagging” method, which refers to the technique of using a combination of learning models for better accuracy. A random number of decision trees are created from subsets of the training set. Then, they aggregate the results of the different trees and decide the final output for the object in question. (iv) Support Vector Machine (SVM). Support vector machine (SVM) uses a hyperplane to differentiate between classes. It is used for both classification and regression. It is majorly used for classification problems. In Table 1, we specify the parameters used in the classifiers. Parameters used in BPSO, BFA, and MCFA are specified in Tables 2, 3, and 4, respectively.
Modified Bio-Inspired Algorithms for Intrusion Detection System Table 1 Input parameters for classifiers
Table 2 Input parameters for BPSO
Table 3 Input parameters for BFA
Table 4 Input parameters for MCFA
195
Classifier
Parameters
KNN-1
n_neighbors = 30
KNN-2
n_neighbors = 300
Random forest 1
n_estimators = 10 random_state = 0
Random forest 2
n_estimators = 100 random_state = 0
SVM
gamma = ‘auto’ probability = True
Population size
n = 20
Maximum number of iterations
max_iter = 200 Default: max_iter = 300
Move rate
w1 = 0.5
Constants
c1 = 0.5 c2 = 0.5 Default: c1 = 1 c2 = 1
Limit search range of vmax
vmax = 4
Population size
n = 20
Maximum iteration
max_iter = 25 Default: max_iter = 300
Gamma
gamma = 0.20
Alpha
alpha = 0.20
Beta
beta = 0.25
Population size
n = 76
Number of iterations
its = 10
Constants for group 1
r1 = 2 r2 = −1 v = 1
Constants for group 2
v1 = 1.5 v2 = −1.5 r = 1
Constants for group 3
v1 = 1 v2 = −1 r = 1
5 Results We first performed classification using the classifiers on the complete feature set, i.e., without any feature selection. The classifiers compute the feature importance is shown in Figs. 1, 2, and 3. We implemented three bio-inspired algorithms, and the feature importance is shown in Figs. 4, 5, and 6. The feature importance is a mean of all the iterations
Fig. 1 Decision tree
196 M. Sharma et al.
Modified Bio-Inspired Algorithms for Intrusion Detection System
197
Fig. 2 Random forest 1 (no. of trees = 10)
Fig. 3 Random forest 2 (no. of trees = 100)
performed for each algorithm. For BPSO, it was 20; for BFA, it was 15 and for MCFA there were ten iterations. The accuracies for the classification performed with and without feature selection are shown in Figs. 7 and 8. The number of features selected by all the applied bio-inspired algorithms is shown in Fig. 9.
198
M. Sharma et al.
Fig. 4 Binary particle swarm optimization
Fig. 5 Binary firefly algorithm
6 Conclusion Intrusion detection systems are a critical part of network security. The primary purpose of IDS is to detect suspicious behavior on the network and protect it from attacks. The IDS needs to learn what does an attack looks like and what regular traffic is. We have presented a method to improve this learning process. We modified cuttlefish algorithm and used it for feature selection along with the random forest classifier. We also compared this algorithm with two more bio-inspired algorithms, namely binary swarm optimization algorithm and binary firefly algorithm. Initially, classification
Modified Bio-Inspired Algorithms for Intrusion Detection System
199
Fig. 6 Modified cuttlefish algorithm
Fig. 7 Accuracy without feature selection
was performed without feature selection using multiple classifiers, namely KNN, decision tree, and random forest. After completing feature selection, we observed an increase in the accuracy of classification as compared to the one done without feature selection. We can see that feature selection improved the accuracy significantly and also decreased the computation time. In the future, hybridization of two bio-inspired algorithms can be used in the detection of intrusion as they provide results with higher accuracy in less computational time.
200
M. Sharma et al.
Fig. 8 Accuracy with feature selection
Fig. 9 Number of features selected
References 1. P.U. Kadam, M. Deshmukh, Various approaches for intrusion detection system: an overview 2(11), Nov 2014. Accessed on 15 Jan 2019. [Online]. Available http://www.ijircce.com/upload/ 2014/november/38X_Various.pdf 2. Basic intrusion detection system. elprocus.com, Accessed on 16 Jan 2019. [Online]. Available https://www.elprocus.com/basic-intrusion-detection-system/ 3. S. Binitha, S.S. Sathya, A survey of bio-inspired optimization algorithms. Int. J. Soft Comput. Eng. IJSCE. 2(2). ISSN: 2231-2307, May 2012
Modified Bio-Inspired Algorithms for Intrusion Detection System
201
4. N. Pazhaniraja, P.V. Paul, G. Roja, K. Shanmugapriya, B. Sonali, A study on recent bio-inspired optimization algorithms, in 2017 Fourth International Conference on Signal Processing, Communication and Networking (ICSCN). https://doi.org/10.1109/icscn.2017.8085674 5. J. Kaliappan, R. Thiagarajan, K. Sundararajan, Intrusion detection using artificial neural networks (ANN) with the best set of features. Int. Arab J. Inf. Technol. (2015) 6. N. Pazhaniraja, P.V. Paul, G. Roja, K. Shanmugapriya, B. Sonali, A study on recent bioinspired optimization algorithms, in 2017 Fourth International Conference on Signal Processing, Communication, and Networking (ICSCN). https://doi.org/10.1109/icscn.2017.8085674 (2017) 7. H. Nezamabadi-pour, M. Rostami-shahrbabaki, M.M. Farsangi, Binary particle swarm optimization: challenges and new solutions. J. Comput. Soc. Iran CSI Comput. Sci. Eng. JCSE 6(1-A), 21–32 (2008) 8. M.A. Khanesar, M. Teshnehlab, M.A. Shoorehdeli, A novel binary particle swarm optimization, in 2007 Mediterranean Conference on Control and Automation. https://doi.org/10.1109/med. 2007.4433821 (2007) 9. R.F. Najeeb, B.N. Dhannoon, A feature selection approach using binary firefly algorithm for network intrusion detection system. ARPN J. Eng. Appl. Sci. 13(6). ISSN 1819-6608, Mar 2018 10. V. Subha, D. Murugan, Opposition-based firefly algorithm optimized feature subset selection approach for fetal risk anticipation. Mach. Learn. Appl. Int. J. MLAIJ 3(2), https://doi.org/10. 5121/mlaij.2016.3205, June 2016 11. A.S. Eesa, A.M.A. Brifcani, Z. Orman, Cuttlefish algorithm—a novel bio-inspired optimization algorithm. Int. J. Sci. Eng. Res. 4(9) (2013) 12. M.E. Riffi, M. Bouzidi, Discrete cuttlefish optimization algorithm to solve the traveling salesman problem, in Third World Conference on Complex Systems (WCCS). https://doi.org/10. 1109/icocs.2015.7483231 13. Y. Arshak, A. Eesa, A new dimensional reduction based on cuttlefish algorithm for human cancer gene expression, in International Conference on Advanced Science and Engineering (ICOASE). Kurdistan Region, Iraq. https://doi.org/10.1109/icoase.2018.8548908 (2018) 14. A.S. Eesa, Z. Orman, A.M.A. Brifcani, A novel feature-selection approach based on the cuttlefish optimization algorithm for intrusion detection systems. Expert Syst. Appl. 42(5), 2670–2679 (2015) 15. M. Suganthi, V. Karunakaran, Instance selection and feature extraction using cuttlefish optimization algorithm and principal component analysis using a decision tree. Cluster Comput. (2018) 16. I. Sharafaldin, A. Gharib, A.H. Lashkari, A.A. Ghorbani, Towards a reliable intrusion detection benchmark dataset. J. Softw. Network. 177–200. https://doi.org/10.13052/jsn2445-9739. 2017.009
An Analysis on Incompetent Search Engine and Its Search Engine Optimization (SEO) Nilesh Kumar Jadav and Saurabh Shrivastava
Abstract Search engine optimization (SEO) is a technical term that optimizes the website over the Internet to get the better rank compare to its competitor, the ultimate goal of SEO is to get the brand to its website. Earlier time there are few websites and therefore the competition among website are narrower, but at present the scenario of the websites is gigantic, and therefore the people are using so many atrocious ways to get the page rank. This paper consists of the lagging part of SEO and incompetent search engines. The paper shows an analysis that how the search engines are not effective while crawling the webpages, and also analyses how the SEO is been affected by futile work of search engines and its page rank to the website owner. Keywords SEO · Search engines · Internet · Websites · Page rank
1 Introduction The Internet is the hub of information, mainly this information is accumulated inside the websites. These websites are having various information and cumulatively we call it as—multimedia information. At present, approximately 644 million websites are active over the Internet, due to a large number of such websites, we cannot distinguish the good and bad website, therefore, there is a need of ranking among the sites so that user can get the efficient result of its request. In early 1990’s, when the search engines made their way in retrieving information from the website, more and more website are connected every day to the Internet and it was easy to find information, but with a great cost, as those website are not with high quality information. All website owners uses black hat SEO to improve their rankings and attracts number of potential users [1]. Authors are the first one who announced the “page rank” by which we can efficiently retrieve information, not with keyword-based but with quality. In [2], author reviewed N. K. Jadav (B) · S. Shrivastava Department of Computer Engineering, MEF Group of Institutions, Rajkot, Gujarat 360003, India e-mail: [email protected] S. Shrivastava e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_15
203
204
N. K. Jadav and S. Shrivastava
SEO and its relevant techniques, it provide analysis on various factors by which we can improvise our SEO. In [3], it clarifies on Google new update—hummingbird for calculating ranks in SEO, and also it explains the comparison between various Google search algorithms. To maintain the rank among websites, we have to do optimization specifically called as—search engine optimization. Search engine optimization is the process of getting more and more traffic to your website with the help of different SEO strategies that will be precisely elaborated in Sect. 2 of SEO.
2 Search Engine Optimization Keywords are the essential aspect of SEO and search engines, when any request appears on the search engines to find something, it is necessary and important to put favorable keywords in our request, moreover, it is vital from SEO side too, to include good keywords in their websites. Figure 1—it shows the working of Google search algorithm, initially you request something to Google by putting keywords to its search bar, spiders take those keywords to Google database center (room for millions of websites), where spiders follows a matching procedure for keywords, they match query keywords with millions of websites, and extract those websites whose having the same keywords, extracted results would be saved in a result folder. Inside the result folder, website are arranged in random order that cannot be distinguish among them; therefore, Google had provided page ranking algorithm that works precisely on SEO [4] that provides rank to the websites, once it is arranged in ranked order.
2.1 Page Ranking There are two types of page ranking that is followed by SEO—Alexa and Page rank (PR), the difference between them is Alexa rank gives a global rank from [0 to millions], whereas PR gives a range-based ranks to the website from [0 to 10]. Page rank is an algorithm that is used by Google to rank the websites in their search result. The rank is having a range from 0 to 10, wherein the highest rank is 10 and lowest is 0. Website with page rank 10 is having the best resources and information among the others, while in the case of rank 0 is not having much reliable information, and therefore people or audience refer the website whose having higher PR. Xiang [5] suggested improvised techniques to improve the efficiency of page rank to fetch the coherent result. Furthermore, SEO people uses different style of optimization
Fig. 1 Google search result
An Analysis on Incompetent Search Engine and Its Search …
205
techniques to increase the rank, however, it is a slow process when your information or resources are not with trending concern. Page rank can be calculated as, PR(x) =
PR(y) L(y) Z ∈Bx
(1)
where PR(x) is the page rank of x website, which is dependent on each page rank values of PR(y) of y website, L(y) is the outgoing links from website y.
2.2 On-Page Optimization On-page optimization greatly deals with what’s your site about, it accord with the meta tags—title, heading, keywords that you include in your website so that the crawlers easily index your website. The title and heading tag specifies about your website, it should unique and trending so that it can easily be traced by crawlers [6]. Keywords play very important role in indexing your website, so if you are providing better and favorable keyword, user can easily reach to your website. Moreover, onpage optimization [7] deals with content of a webpage, it is the thing that makes your website a worthwhile, as content is the only reason, user visit to your website, and so if you are providing a trust worthy content, your website is on the right path in terms of SEO. Once the content in your website is highly appreciated, it is widely been linked to various resources, either in the form of link or references. Furthermore, website uniform resource locator (URL) also comes in this category of on-page, there is a particular structure of URL—(http://en.wikipedia.org/wiki/Super_Mario_World), if your website is not following that structure, your website will ignored by various search engines. Significant improvement can be seen in your [8] on-page SEO by improvising your website linking behavior (link back to category and subcategory), there should be proper navigational links that can back and forth from different pages of a website. Image alt text helps user to know about specific image, in case the image cannot viewed properly or displayed. (A) Page Titles—Page title are the short illustration about your webpages that will appear at the top of the browser window. Every webpage has its own title that signifies the information that website contains. It should be of 50–60 character in length, however, there is no such rule about it, but to display the title tag properly it should be of the given length. While working with title tag do not overuse your keywords as it leads to over-optimization (Table 1). (B) Meta Description—Meta descriptions are the small end content which appear when you search relevant keyword on search engine, by which a user can know about your content and can click your link (Table 2). (C) URL Structure—Uniform resource locator (URL) is the web address of your website, where user use this URL to enter into the website. It also specifies the protocol used by the domain such as—HTTPS, HTTP, and FTP.
206
N. K. Jadav and S. Shrivastava
Table 1 Rules for page titles
Table 2 Rules for meta descriptions
S. No.
Rules for page titles
1
Character length should be of 50–60 character
2
Do not use stuffed keywords
3
Title should be unique, and should impact on large audience
4
Relevant keywords should be place first in title
5
Title page exist on every page
6
Use of satisfied lowercase and uppercase character
S. No.
Rules for meta descriptions
1
Use description of up to 155 character or more
2
Should include a “call-to-action,” to maximize your bounce rate
3
Focused keyword
4
Description should match up with content
5
It should be unique
(D) Heading Tags—As the name suggests, tags are used to create headings in the content, to differentiate a content from others and to divide the paragraphs into digestible sections. There are six different levels of heading tags taking H1 to H6 with H1 being highest level and H6 the least (Tables 3 and 4). (E) Keyword Density—It is the percentage of the ratio between numbers of time a keyword appears on a webpage to the total number of words in that page (Tables 5, 6, 7, 8, 9, 10, and 11). Table 3 Rules for URL structure S. No.
Rules for URL structure
1
URL consists of applicable keyword.
2
URL should be structured and easy readable
3
Remove any symbols and sign other than hyphens and underscore
4
Use of “canonical” in your URL, in case you have same URL for different pages, you can either redirect it or canonicalizing them
5
URL should match with the webpage content and title
6
URL should be shorter and limit the folders that precepts the site is bigger, when actually it is not
7
Take care of case sensitivity while creating URL, as backslash (\) is different for windows and different for Linux
8
Use suitable tracking parameters
9
Use of paginated URL
An Analysis on Incompetent Search Engine and Its Search …
207
Table 4 Rules for heading tags S. No.
Rules for heading tags
1
Use your keywords in heading tags.
2
Webpage should follow the hierarchy of to tags
3
Use tag for only one time, just to describe the topic of the page
4
should be of 20–70 character in length
Table 5 Rules for keyword density S. No.
Rules for keywords density
1
Do not overcome or stuff the keywords in the content
2
Use of keywords variants
3
Use of keyword at necessary places such as—{meta description, title, content, URL, social media}
4
should be of 20–70 character in length
Table 6 Rules for image optimization S. No.
Rules for image optimization
1
Images should have reasonable and correct naming conventions
2
Images should have a preferred and popular extension {GIF, JPEG, and PNG}
3
Use responsive images
4
Never use stock image, better to create it from the scratch
5
File size should be rational, small but of high dots per inch (DPI)
6
Include short captions for your images
7
Always host your images with an ALT tag
8
If a website has many images, it is a good step to create an image sitemap
Table 7 Rules for meta tags S. No.
Rules for meta tags
1
Title tag should be placed inside tag
2
Description tag can minimize or maximize the character length up to 150–180 character
3
Important keywords should be at the start of the beginning
4
All images must have an ALT tag
Keyword Denisty =
Number of Keywords ∗ 100 Total Number of Words
208
N. K. Jadav and S. Shrivastava
Table 8 Rules for internal linking S. No.
Rules for internal linking
1
Use the links that are relevant to the user and the content
2
SEO also depends on numbers of internal links, you are using on a page
3
Use of “no-follow” links—search engine will not index such pages. For example, login page
4
While creating internal links—use anchor text wisely
Table 9 Rules for posting long content S. No.
Rules for posting long content
1
Visual aid such as—slideshow, info-graphics, GIF, and videos are helpful, while creating long texts
2
Ideal length of a blog post is 1600 words
3
Ideal width of a paragraph is 40–55 character
4
Ideal width of a domain name is 8 character
5
Should know the target audience and according to that content should be formulated
6
Always use active voice while writing lengthy content
7
Technical jargon should use, but to the extent as web is for both technical as well as non-technical, wisely choose your keywords and text.
Table 10 Rules for external linking S. No.
Rules for external linking
1
While linking an external link, make sure you test the domain first, in case, it is malicious, better not to include as external link
2
Use the link of higher page rank.
3
Relation between anchor text and external link should match
4
Always fix broken links (external or internal)
Table 11 Rules for sharing buttons S. No.
Rules for social sharing buttons
1
Find a preferred location where we can place social media buttons
2
Do not over share social buttons—such as—no use of social sharing buttons at contact us page
3
Always share correct mapped social buttons
4
Icon and color should match with the original social buttons
5
Place social media buttons according to its popularity and trend
6
Sharing buttons are perfectly and precisely mapped with their links
An Analysis on Incompetent Search Engine and Its Search …
209
2.3 Off-Page Optimization Off-page optimization examines how prominent your website is among other websites? It eminently depends upon the link building, where you have to get links from other resources to your website. It is a nearby topic of Internet marketing, where you have to explicitly market your websites from different platforms [9]. Shows many important and interactive techniques to increase the optimization result from the perspective of user. There are varied ways of doing off-page SEO by which you can market or optimize your website, wherein articles and guest blogging are the very relevant option in contrive off-page. Majority of optimization [10] is done on social media, as it is the highest growing market for sharing assets. Off-page is a long term process and takes time to improve, however, it gives an efficient result compare to on-page. Moreover, off-page has a disadvantage of vulnerability from “black hat SEO” that uses black hat hacking activities to explicitly create link building among its competitor sites. It is been observed that by putting long articles and blogs, users get bored or frustrated, and it jump to another website that leads to decrement in your bounce rate of your website. Info-graphics is a cutting edge process that deals with content marketing, where you can show all your content in a graphical image, results in user attraction to both content as well as images and eventually it can increase your bounce rate. Website optimization using traditional methods are artless, where there is a need to upgrade the rules for optimization that can enhance the result efficiency. Rules for different types of off-page SEO are been precisely discussed in this section using below tables (Tables 12, 13, 14, 15, 16, and 17). Table 12 Rules for guest post S. No.
Rules for guest post
1
Never guest post for spam and never include spammy links in your post
2
Post should contain minimum of 4 links and can go up to 8 links
3
Content should be owned by the author, so that no other author can copy it
4
Content should contain at least 500 words
Table 13 Rules for blogging
S. No.
Rules for blogging
1
Template for blogging has to be satisfactory and not that glitchy
2
Choose a popular domain that is known by many
3
Engaging and long quality content
4
Website must register with Google webmaster for faster indexing
5
Import your on-page SEO in blog
6
Interact with your audience, always feedback to their comment
210
N. K. Jadav and S. Shrivastava
Table 14 Rules for social media sharing S. No.
Rules for social media sharing
1
Frequent sharing can leads to spam, make a gap of few hours in between your shares
2
Feedback to every comment, to interact with your audience
3
Hashtags are important and trending, use it in your shared post
4
Social media is for making your product a brand, so entertain your audience that leads to sell your product
5
Never ask for explicit likes, comment, and share
6
Provide more character count in your tweet
7
Stop irrelevant and unnecessary tagging of people without permission
8
Following anyone and everyone, and joining any group can leads to dilute your brand that strictly oxidize your brand reputation
Table 15 Rules for blog commenting S. No.
Rules for blog commenting
1
Never use links, specially your webpage links in the blog comment section
2
Use your real identity and username while commenting, as it results into a seamless networking
3
Always share your blog content with other blog content that have a similar audiences
4
Communicate and link up with other blog writer working on same domain to build up your own network
Table 16 Rules for visual content marketing S. No.
Rules for visual content marketing
1
Post should contain at least 2–3 images that gives a good engagement
2
Instead using long text, post should include info graphic that explains the summarized view of a post
3
Share your visuals to other social media to integrate it into a brand
4
Keywords are the crucial part of any SEO, so make sure it stick with your visuals
5
Visuals are of high quality and at better resolution
Table 17 Rules for web 2.0 S. No.
Rules for web 2.0
1
Post should contain at least 2–3 images that gives a good engagement
2
Instead using long text, post should include info graphic that explains the summarized view of a post
3
Share your visuals to other social media to integrate it into a brand
4
Keywords are the crucial part of any SEO, so make sure it stick with your visuals
5
Visuals are of high quality and at better resolution
An Analysis on Incompetent Search Engine and Its Search …
211
3 Proposed Problem: Differ Google Results on Different Platform The way search engines works is slightly distressing, as we had analysis many problem where search engines are not quite good enough to tackle the problems [11]. Consider a general problem where a webpage is SEO and have a good rank in the range of 0–10, but when we enter the webpage keywords in our search engine, testing on two browsers, we get different results on both browsers and this is not acceptable at least from the perspective of SEO. In this paper, we will discuss five major problems of search engines that lead to its in-effectiveness. This paper also relates Internet data and its usage problem while using search engines or specifically using the Internet.
3.1 Differ Search Results in Different Browsers While analyzing the Google search engine, we found out that search results from Google with different browsers appear different. We had taken several trending keywords and compared with different browsers and come to know that different browsers show different results in terms of the order. In the result page, first two links are quite similar in all of the browsers, but drastically the results changes, we observe that after third or fourth link the links are not in the same order, and also sometimes the links with chrome did not appear in Firefox or in Opera, and therefore the result is dreadful for those website which had gone through good SEO, but cannot achieve the existence in different browser’s result. From the vendor perspective, who had done SEO to his website, this result is devastated as even after doing an effective optimization [12], the results are differ in different browser and due to that the audience has a varied view on your website, they cannot decide properly that the content is an effective content or not as one browser shows the PR rank as 5 and other browser shows the PR rank as 3 for the same website (Fig. 2).
3.2 Differ Search Results in Multilingual Mode The second analysis is limited to SEO, as it is quite hectic to do SEO in multilingual mode, the search in other language is comparably lower than the search we do in English, still we observed something that cannot be ignored, when you search for a keyword “Miss India 2016” or any keyword in English language you got some result, but as you change the mode of language the Google result page will show distinct and unordered results. Therefore, again the website that had done SEO to get high page rank is flipping the result in disordered manner. When you compare the
212
N. K. Jadav and S. Shrivastava
Fig. 2 Differ search result in different browser and a user perspective when comparing both the result
two results one in English and other in any preference language (depends on your location), you see a radical change in the results (Fig. 3). While observing the result, we also come to know that the search also gets differ when we do private browsing is shown in Fig 4. we can see that the results are varied and some websites are also not indexed that is a complication for the owner of the website as the website is indexed in one browser and in the other it is not, for him to get better page rank the website should be indexed in every browser. In addition to the normal browser, we also tested the same keyword with anonymous and private browser, and there is an extreme change in the website ordering, few of them are not even indexing in the result. There might be a possibility that anonymous browser uses VPN and proxy, and therefore the search route gets differ in different browser, however, the problem still exist with the owner of the website as it is not getting the proper SEO result.
Fig. 3 Differ search result in multilingual mode
An Analysis on Incompetent Search Engine and Its Search …
213
Fig. 4 Differ search result in multilingual mode in private browsing
Fig. 5 Differ search result in low end devices
3.3 Differ Search Results in Low End Devices While analyzing the results, we again come to the severe observation that the search results for a keyword “Education System in India” again gets differ from browser to browser and also from device to device. The result in our PC gets differ when we compare it with any mobile device search result, in addition to that result also gets changes when we compare Chrome browsing, Firefox or Puffin browsing, the links are not in order and some of them again are not indexed properly (Fig. 5).
4 Conclusion This paper surveys different aspects of SEO which has to be considered in order to get effective search to the audience, the problem exist in the order the search result appears in the browser. Website owner do their rigorous optimization using SEO, which is a lengthy and time consuming process, furthermore, the optimization result that is a “page rank” takes time to get into action, and therefore it is a very crucial result for any website owner. We had seen that even though the website page rank is higher and productive, it is not reflecting in the search result that comes after you search anything on Google search engine and hence it is a failure of SEO. This survey is limited to Google search engine only, for that observation three different cases has been taken where a specific and trending keyword is applied on the search
214
N. K. Jadav and S. Shrivastava
engine and the result is compared with different browser such as Internet Explorer, Google Chrome, and Firefox and also with different platforms such as on mobile device, in multilingual mode, and in anonymous browsing and it is been discovered that the results are slightly mismatched and unordered when compared with different browsers, however, it is not a huge inconsistency in the results but from the perspective of any SEO this result is appreciable and accepted. This paper also gives an introduction and ingredient that will help in order to do fruitful SEO, on-page and off-page optimization is been discussed and several rules are noted down that can be applied in any website to get an effective optimization result.
References 1. L. Page, S. Brin, R. Motwani, T. Winograd, The Pagerank Citation Ranking: Bringing Order to the Web, in Technical Report, Stanford Digital Library Technologies Project (1998) 2. E. Ochoa, An Analysis of the Application of Selected Search Engine Optimization Techniques and their effectiveness on Google’s Search Ranking Algorithm (California State University, Northridge, 2012) 3. A. Kakkar, R. Majumdar, A. Kumar, Search engine optimization: a game of page ranking, in 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 11 Mar 2015 4. M. Cui, S. Hu, Search engine optimization research for website promotion, in International Conference of Information Technology, Computer Engineering and Management Sciences, Nanjing, Jiangsu, China, 24–25 Sept 2011 5. L. Xiang, Research and improvement of pagerank sort algorithm based on retrieval results, in 7th International Conference on Intelligent Computation Technology and Automation, Changsha, China, 25 Oct 2014 6. X. Zhu, Z. Tan, SEO keyword analysis and its application in website editing system, in 8th International Conference on Wireless Communications, Networking and Mobile Computing, Shanghai, China, 21–23 Sept 2012 7. F. Wang, Y. Li, Y. Zhang, An empirical study on the search engine optimization technique and its outcomes, in 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), Dengleng, China, 8–10 Aug 2011 8. V. Patil, A. Patil, SEO: on-page + off-page analysis, in International Conference on Information, Communication, Engineering and Technology (ICICET), Pune, India, 29–31 Aug 2018 9. J. Lemos, A. Joshi, Search engine optimization to enhance user interaction, in International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), Palladam, India, 10–11 Feb 2017 10. J. Killoran, How to use search engine optimization techniques to increase website visibility. IEEE Trans. Prof. Commun. 56(1), 50–66 (2013) ˇ 11. S. Duk, D. Bjelobrk, and M. Carapina, SEO in e-commerce: balancing between white and black hat methods, in 36th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, May 2013 12. A. Patil, V. Madhukar Patil, Search engine optimization technique importance, in IEEE Global Conference on Wireless Computing and Networking (GCWCN), Lonavala, India, 23–24 Nov 2018
Infotainment System Using CAN Protocol and System on Module with Qt Application for Formula-Style Electric Vehicles Rahul M. Patil, K. P. Chethan, Rahul Ramaprasad, H. K. Nithin, and Srujan Rangayyan Abstract This paper talks about the challenges involved in building an in-vehicle infotainment system (IVI), the need for one and the advantages of choosing the CAN protocol for the proposed in-vehicle infotainment system. For a formula-style electric car, the main objective of the designed infotainment system is to read various realtime data from different sensors present in the car and providing such information to the driver. This information is also logged and can be viewed by the team to assess the safety and performance of the various components of the car and the car as a whole. The paper further describes the main features of the System on Module (SoM) chosen for this system, the motivation behind utilizing the CAN protocol for intra-vehicular communication, and the advantages of choosing CAN. In addition, the design of the interface using the Qt framework and other elements involved in the design of the system are also discussed. The system has also been tested and validated in both workshop conditions and on track and has proven to be a robust real-time system. The applications of such a system in making crucial data of the vehicle’s dynamics available to the driver and the crew are explored. Also, some light is shed upon how valuable it is to have such a system installed in a student-built formula-style electric car.
Rahul M. Patil and K. P. Chethan contributed equally to this work. About Team Chimera Started in the year 2006, Team Chimera is the first group in India to successfully transcend into the automotive sector of hybrid and electric technology. We are a multi-dimensional team comprising of students with backgrounds in various engineering disciplines such as Electrical and Electronics, Mechanical, and Computer Science. Team Chimera has, to date, undertaken various ventures, which include E-volve Triple Powered Vehicle(TPV), Hybrid Auto Rickshaw, a plug-in hybrid from the Electric Vehicle—REVA, and the latest being building a Formula Hybrid race car and Formula Electric race car from scratch. R. M. Patil (B) · K. P. Chethan · R. Ramaprasad · H. K. Nithin · S. Rangayyan Team Chimera, RV College of Engineering, Bengaluru, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_16
215
216
R. M. Patil et al.
1 Introduction Over the past couple of decades, we have witnessed plenty of development in several industries and disciplines, the automotive and information technology industries in particular. These industries have grown in parallel, with great advancements being made in both. But in recent times, the automotive industry is not just focused on delivering basic transportation facilities to people. Vehicles have now evolved from exclusively being modes of transport to so much more, with the change in people’s lifestyle and advancement in technology being the principal catalysts for this change. Now we can experience infotainment (a combination of information and entertainment) in our vehicles. It plays several key roles in areas such as safety, entertainment, navigation, and comfort as a result of which several large corporations are investing heavily in infotainment systems. In-vehicle infotainment (IVI) system is an essential component of vehicles and has been for quite a few years. Due to its need and importance in recent times to deliver information and entertainment to drivers and passengers, they can be found in almost all types of vehicles. These infotainment systems are divided into several different categories, classified based on factors such as their complexities, sizes, functions, and cost. For the majority of these commercially manufactured vehicles, the infotainment systems installed are sophisticated with several functions such as entertainment (AM and FM radio, multimedia support for audio and video, etc.), advanced vehicular functions (in-vehicle environment control, parking assistance, voice assistants, etc.), and other support vehicle functions (smartphone pairing, etc.). For such functions to be provided, the features and technology utilized to form the system deem it as extremely costly and power-hungry. Therefore, for a formula-style electric car to perform optimally, the system proposed in the paper has been designed and implemented. It is suitable for the limited power supply conditions and proves to be very economical. The system has been developed for a student formula-style electric car and has been designed in a plugand-play fashion for maximum modularity and ease of use. It uses the CAN protocol for communication and the Qt framework for the main user interface. The main function is to provide the driver with a lot of real-time relevant information and also logs this data for future testing and analysis scenarios.
2 Motivation With the significant advancement in automotive technology, mainly in the fields of automotive electronics, electronic control of the processes that occur during the operation of the automobile, and the safety features present, the need to develop an efficient networking system to facilitate communication between the various electronic subsystems and the electronic control unit (ECU) arises.
Infotainment System Using CAN Protocol and System on Module …
217
In a formula-style automobile where the speeds are high and the vehicle has to constantly push its limits, an infotainment system that can monitor the various aspects related to the vehicle performance and provide these vitals to the driver and the team provides an extra edge in terms of safety and performance. Through the real-time monitoring and display of various data that indicate the performance of the vehicle, the driver and the team can identify the presence of any safety issue and where optimization is required to improve performance. As a student formula team, the priority is to learn, design, develop, and implement as many of the systems as possible in-house. Thus arises the motivation to develop and implement an in-vehicle infotainment system. The practical applications of such a system are numerous and are crucial to helping the team read and understand various values from the car such as the accumulator voltage, the current being drawn, speed, etc. These values are read from the motor controller and displayed on the screen, thus helping the driver understand better the functioning of the car’s different aspects. The team also benefits from this data as it makes reading the vitals of the car extremely simple. Without an infotainment system, important data like the maximum speed achieved during a test run, maximum current drawn, and range of accumulator voltage would be extremely difficult to approximate.
3 Design Older infotainment systems lacked effective user interfaces and were especially troublesome to test. A significant change from the analog infotainment display system to the development of a digital system for infotainment display was made in [1]. Such an interface enormously improves the driver–vehicle interface, and this greatly influenced the selection of a high-resolution LCD screen. Based on the discussion of some vehicle applications of CAN protocol in [2], the design decision to use the CAN infrastructure was made. The CAN protocol in the vehicle industry enabled the transition from stand-alone systems to integrated and network controlled systems. The success of this protocol is mainly because it is a low-cost and modular, standardized protocol for the control of real-time systems. The fundamental working of the proposed system draws significant inspiration from [3], where the authors have developed a low-cost microcontroller-based wireless controller area network for a solar car, a case very similar to our formula-style vehicle. The system is mainly developed to obtain data such as velocity, temperature, and batteries’ power to monitor the condition of the solar car and ensure it is running in optimum condition. The proposed system also collects further relevant information about the motor and the motor controller as well. Galarza et al. [4] emphasize the importance of a good balance between road safety and driver experience when it comes to designing infotainment systems. In their work, they designed and evaluated an adaptive human–machine interface (HMI) which reduced the driver’s interaction to a minimum level as well as managed the information flow. Proposed HMI principles were followed while designing the adap-
218
R. M. Patil et al.
tive system. The participants in this study gave positive feedback on the system designed. The safety features integrated into the proposed system are motivated by the study of the automotive safety system developed using CAN protocol and master–slave sensor architecture [5]. Finally, the decision to use the Qt framework is based on the work in [6], wherein a GUI design in the Qt framework is proposed for the educational microcomputer Edulent. The benefits of employing the Qt framework over Visual C++ are highlighted in this paper. QT framework supports 2D and 3D graphics, SQL, XML, and unit testing. It also supports many compilers and is a cross-platform software, which makes it the best choice for the developing process. The developed application can be tested before deploying on the target, enabling the early validation of the application. It also integrates well with our high-resolution LCD screen, giving the driver a much improved experience. A smart parking system is implemented by Patil et al. [6] by interfacing a set of sensors to the micro-controllers, using the CAN protocol for communication. [7] discusses the design of infotainment cluster, which is used by the driver to interpret the vehicle behaviour through the information that is displayed. To implement an embedded GUI in the cluster, the QT framework is used in a Linux environment and is run on a micro-controller device.
4 Specifications of the Automobile Developed The formula-style electric vehicle developed by the team runs on a lithium-ion battery pack, which is used to power a three-phase induction motor through a motor controller. The various control circuitry are powered by a control voltage source provided by isolating and stepping down the main accumulator voltage (Fig. 1).
4.1 Accumulator The accumulator (or the main battery pack) is made up of 30 cells connected in a series configuration. The cells used are of LiFePO4 type. Each cell has a nominal voltage of 3.2 V and a capacity of 60 A h. The complete configuration thus has a nominal voltage of 96 V and a capacity of 60 A H. The maximum nominal current is 120 A, and the total capacity is 6.84 KWh.
4.2 Motor As mentioned before, the battery pack drives the motor by employing a motor controller. The motor controller takes in the DC voltage of the battery pack and then
Infotainment System Using CAN Protocol and System on Module …
219
Fig. 1 Illustration of the main systems in student-built formula-style electric vehicle
converts it into three-phase AC voltage required to drive the three-phase induction motor. The motor controller used is specifically designed for the particular motor present in the car. The three-phase induction motor used has a rating of 65 horsepower at 5100 RPM. It can provide a maximum continuous power of 9 KW and a peak power of 48.47 KW for up to 2 min. It provides a nominal torque of 98 Nm and a maximum torque of 111 Nm.
4.3 Control and Safety Circuits The control and safety circuits run on a 12 V supply provided through an isolated step-down DC–DC converter which converts the accumulator voltage into the 12 V supply. The various safety and control circuits include: • Shutdown circuit: Consists of various shutdown switches both manually operated and automatic that upon switching, shuts down the motor controller. • Insulation monitoring device (IMD): Detects any breakdown in high-voltage insulation or any short circuit between the high-voltage and low-voltage circuits and shuts down motor controller if present. • Brake system plausibility device (BSPD): Detects any difference in the position of the two adjacent brake pedals, if the difference exceeds permissible limits shutdown proceeds.
220
R. M. Patil et al.
• Tractive system active light (TSAL): Used to indicate if the high-voltage tractive system is on; if the system is on, a red blinking light is seen. • Ready-to-drive sound (RTDS): Indicates that the car is in a ready to drive state by making a distinct sound for a specified period of time.
5 Methodology and Setup Signals from the motor controller include battery voltage, instantaneous current drawn, motor temperature, brake and throttle values, all of which are transmitted over a CAN bus. The system uses the CAN protocol as it provides a robust way to communicate these signals between the devices, sensors, and the SoM. It is also an extremely popular vehicle bus standard for communication, designed for such a system that lacks a host computer. The information transmitted by the motor controller and other devices is essential to both the driver of the vehicle and the off-track personnel. Therefore, there arises a need for a visual interface to display this vital information in real time. For the rest of the team to understand the vehicle’s status to bring about improvement in performance, the system also needs to log data for future analysis. To fulfill these tasks, the system utilizes the open-source Qt toolkit. Qt is a cross-platform embedded and desktop software development toolkit, which provides an extensive C++ framework for developing user interfaces. The reason why Qt best suits the application is the cross-platform support it provides. This essentially means that the code can be tested in the development environment and then can be deployed on the embedded device without any code changes, while retaining the efficiency that comes with the native code. The setup for the proposed work includes the electrical and electronic subsystems that form the basic system driving the motor, which is designed according to the regulations imposed by FSAE (EV), combined with a telemetry subsystem to handle the processing of data and conveying information in the cockpit. The main electrical components in the car include the power supply and the motor controller (Curtis 1238-7601 AC Motor Controller 96 V 650 A). The telemetry subsystem includes a SoM coupled with a 7-inch LCD to convey information. The SoM used in this setup is the phyBOARD-Mira—a product of PHYTEC Embedded Pvt. Ltd.—as shown in Fig. 2 with a i.MX6 DualLite processor clocking at 1.2 GHz, 1 GB of RAM, inbuilt CAN support, and running a custom flavor of Linux developed by PHYTEC Embedded Pvt. Ltd. The motor controller takes various inputs from external factors like the brake and accelerator levels, battery parameters like voltage, current, and temperature, based on which it drives the motor and generates messages for consumption by the telemetry system. These messages are generated over a CAN bus which the telemetry system uses as its primary source to publish data. The CAN bus provides a fast and robust means of carrying these messages to and fro between the telemetry system and the motor controller. It also provides flexibility to add more sensors or nodes onto the bus without adding any additional transport media. The bus is shorted with a 120
Infotainment System Using CAN Protocol and System on Module …
221
Fig. 2 phyBOARD-Mira i.MX 6 (view from top and bottom)
(a) Transmitted information from default mailbox addresses - 0x601 0x602
(b) System bits configuration
Fig. 3 Generic CAN messages from the motor controller
resistor at all endpoints to provide the required potential difference for it to operate. These messages are transmitted at a rate of 2.5 Mbaud, which is sufficient for the proposed system. The motor controller sends CAN frames (Fig. 3) that are received by the CAN interface built on the phyBOARD-Mira. These frames are processed by the Qt application—that is running on the SoM—with the help of CAN protocol libraries provided as part of the Qt framework. The application uses an event emitter–listener mechanism to monitor frames on the CAN bus and maintains a queue of the frames received. These frames are processed, and the data extracted is used to update the UI elements in real time (Fig. 4a). The application displays this UI information on a 7inch LCD. During initial test runs, there was a large latency between the receiving of the CAN signals from the motor controller and its subsequent display on the screen. This was because the data from the CAN frames is simultaneously being processed and then stored in a database, and the problem was solved by flushing the incoming queue at regular intervals and using transactions to commit batches of data to the
222
R. M. Patil et al.
(a) Minimalistic GUI that appears on the dashboard
(b) Dashboard from the driver’s point of view
Fig. 4 Design of user interface
database safely. The database is a local log file stored in the SoM’s memory, which is accessed after test runs for analysis. To sum up, the CAN messages transmitted by the controller are consumed by the CAN module that is present on the board. These messages are read into the OS by the CAN drivers installed and are therefore made available to be used by any software process running on the board. To present the CAN data visually, a Qt application is developed and run on the board. This application takes care of rendering data like RPM, battery voltage, current drawn, temperature, motor temperature, brake and accelerator levels, and faults present if any (as indicated by the motor controller), onto the LCD. This gives the driver instantaneous information about the state of the electrical and electronic subsystems and more importantly faults that may be a threat to safety. This information is logged into a database continuously, which is analyzed later to obtain insight into the performance of the car in various conditions. Going ahead, the features that can add more value to the system is the addition of cloud storage of data so that the data can be monitored by the pit crew in real time as the car is running. A web server can be developed that displays the generated data graphically. Since the CAN bus supports hundreds of device connections on the same bus, more sensors can be added, like tire pressure monitoring, accelerometers, etc., which will provide a lot more valuable information that can be used to improve various areas of the car.
6 Results and Discussion The decision of choosing the right system involves striking a balance between efficiency and cost-effectiveness. Thus, considering the aforementioned criteria, the proposed infotainment system turns out as a great choice. The infotainment system displays voltage and current values in real time, and thus provides a clear picture of the power usage trend. This hugely helps in the setting up of the battery recharge
Infotainment System Using CAN Protocol and System on Module …
223
schedules. Critical parameters like motor temperature, controller temperature alert the driver about safety issues like overheating, undervoltage, overvoltage conditions, etc. Display of RPM and speed prevents the driver from overshooting the safety limits, preventing possible damage to the car and any consequent harm to the driver. All of these help in improving the health of the battery along with the health of the car itself. This information will be easily observable by the driver on the vehicle’s dashboard (Fig. 4b). The performance of the system has been checked by logging the CAN signal data received by the current system and comparing it with the data received by another test setup. This testing setup comprises an Arduino coupled with MCP2515 and MCP2551 CAN high-speed transceiver ICs. The number of packets received, their order, and payload data contained in them are some of the parameters considered for comparison. The results obtained confirmed that the current system received packets accurately with minimal loss and latency. Certain displayed parameters such as speed are derived from values of the motor’s RPM, which is received from the motor controller in CAN frames. These derived values calculated in real time were checked against those obtained from standard measuring devices, giving perfectly equal values up to a single decimal place. Logged data obtained from test runs performed in realtime conditions is plotted to gain significant insight into the vehicle’s performance (Fig. 5) This shows real-time data processed by the application, which is displayed on the LCD screen for the driver. The metrics logged are plotted for a test run that lasted for approximately 80 s. The motor controller transmits roughly 50 CAN frames every second via the CAN bus, out of which a total of 25–30 frames are displayed and logged every second by our proposed system. This variation in received frames against processed frames is accounted for by factors such as frame processing time by APIs in the user space application, which heavily involves graphical rendering updates. Though the application is designed to be efficient, using asynchronous logic wherever required, the presence of graphical elements and frequent memory accesses due to the database (for logging purposes) and display updates introduce a slight overhead. The logged data is verified with the test setup and other monitoring devices and is observed to have a minimal deviation in accuracy and latency. Even with the presence of drops in frames, the proposed system’s rate of rendering information from
Fig. 5 Plot with data logged during test run in real-time conditions
224
R. M. Patil et al.
the received frames is observed to be more than sufficient for practical conditions, be it for analysis or for the driver to monitor car parameters during real-time runs. In general, vehicles (especially formula type) developed by student teams or similar research groups tend to not include integrated infotainment systems. Primarily because the entertainment aspect of it is deemed as unnecessary and can be considered as a non-compulsory expense. Hence, it is usually substituted with a system that does the bare minimum, for example, a system that consists of a single voltmeter that is responsible to provide information about only a single parameter. For any other essential collection of data, certain traditional measuring devices are utilized that may only be used when the car is stationary. Instead, the system described here provides an all-in-one solution by providing multi-parameter gathering capability, while also being economical. Since it can be fitted in the car permanently, the requirement of real-time measurement and logging of parameters is also satisfied. Frameworks such as Qt are used in creating the UI that the driver sees on the LCD screen. A simple yet informative three-color scheme (blue representing safe, yellow representing alert, and red representing danger) is used in the UI. It highlights the status of each of the parameters (Fig. 6a). The progress bar representation (analog representation) for speed instantly provides the driver with a superficial state of the car well before he/she comprehends the actual speed. Placement of various elements on the dashboard UI, font of numbers, and text has been decided in such a way (after several surveys and test runs) that it fits a wide variety of screens, and at the same time maintains the elegance and sharpness of modern-day infotainment systems. With a glance at the screen on the dashboard, any driver can get a quick summary of the vitals of the car. This draws only a fraction of the driver’s attention, thus providing him/her with more time to concentrate on the track, rather than be completely worried and completely unaware of their personal safety and the vehicle’s safety.
Fig. 6 Team Chimera’s formula-style electric car on track at Formula Green 2018
Infotainment System Using CAN Protocol and System on Module …
225
7 Conclusion We have designed a robust infotainment system for a formula-style electric car that plays an important role in providing real-time monitoring information about the car and logs the data for future analysis by the team. The results of tests conducted by us show the capability of CAN as an effective communication medium between various team participates in events (Fig. 6)—a group of devices in an automotive system with minimal loss and latency. The proposed system shows the CAN protocol being administered to get the most important data which requires monitoring. This system can be further improved by adding a wide range of sensors to the car to monitor tire pressure, get acceleration data, and so on in real time. The presence of a SoM in our system makes this a possibility due to the modularity offered by it. Additionally, with the advent of IoT and wireless capabilities of sensors with a small physical and electrical footprint, the whole system can be integrated with the cloud so that real-time data can be continuously sent to the cloud, which can be monitored by the team off-track in the pit. Having such a system provides great benefits to the team because the car could be monitored from remote locations—for example, when the team participates in events—a group of individuals can monitor the car’s performance from the workshop itself. The proposed system, therefore, provides maximum extensibility due to the presence of modularity in the hardware used, the robustness of the communication medium, and the cross-platform support in the software, making it ideal in the context of a student formula team in which there is continuous incremental development to the car. Acknowledgements The authors would like to express their gratitude to Ashraful K. and his colleagues from PHYTEC Embedded Pvt. Ltd. for their quick and efficient technical support. We also thank them for the resources they provided us with, such as the phyBOARD-Mira and sensors. The authors would like to thank all the members of Team Chimera, RV College of Engineering, for their efforts toward the design and production of the formula-style electric vehicle.
References 1. S. Vijayalakshmi, Vehicle control system implementation Using CAN protocol. Int. J. Adv. Res. Electr. Electron. Instrument. Eng. 2(6) (2013) 2. K. Johansson, M. Trngren, L. Nielsen, Vehicle applications of controller area network, in Handbook of Networked and Embedded Control Systems (2005) 3. G.Ch. Hock et al., Development of wireless controller area network using low cost and low power consumption ARM microcontroller for solar car application, in Proceedings of the IEEE International Conference on Control System Computing and Engineering (2011) 4. M. A. Galarza, T. Bayona, J. Paradells, Integration of an adaptive infotainment system in a vehicle and validation in real driving scenarios. Int. J. Veh. Technol. Article ID 4531780 (2017) 5. P.A. Wagh, R.R. Pawar, S.L. Nalbalwar, A review on automotive safety system using CAN protocol. 4(2) (2017). ISSN (Print): 2393-8374, (Online): 2394-0697 6. I. Mezei, Cross-platform GUI for educational microcomputer designed in Qt, in Proceedings of The IEEE East-West Design Test Symposium (EWDTS) (2017), pp. 159–162
226
R. M. Patil et al.
7. R.M. Patil, N.R. Vinay, P.D. Application-based smart parking system using CAN bus. Indones. J. Electr. Eng. Comput. Sci. 12(2), 759–764 (2018) 8. P. Murali, K. Daniel, H. Ulrich, System design of a modern embedded Linux for In-Car applications. Embedded World (2017)
A Novel Approach for SQL Injection Avoidance Using Two-Level Restricted Application Prevention (TRAP) Technique Anup Kumar, Sandeep Rai, and Rajesh Boghey
Abstract Current IT world is moving forward in revolutionary terms of e-commerce, artificial intelligence, machine learning, and many more. A lot of change has been observed in technology stack for past 2–3 years. One of the notable advancement is evolution of e-commerce sites and various other sites where user input is required. This has made these sites more vulnerable to a type of attack termed as SQL injection attack. These are just SQL executable code passed through the inputs. SQL injection attacks are the most easiest and high impacting attacks on an application. There are several ways that these attacks work, namely appending true statement, modifying existing data, union query to pull whole data, and many more. These attacks have potential to take down an entire application or delete the critical information from database. Infinite loops can also be appended in the form on functions which severely affects whole application infrastructure. User input cannot be removed from Internet ecosystem as it is the basic need for a website. Given that, this is also the most exploited channel to attack the website. By going through most of the researches done in this area, it is observed that majority of preventive techniques either work in single tier or increase complexity of the system just to implement the technique. In this paper, we propose a two-level restricted application prevention (TRAP) technique for SQL injection prevention which leads to a robust and time efficient, two-tier defense system against SQL injections with comparatively minimal impact to the application.
A. Kumar (B) · S. Rai · R. Boghey Department of CSE, TIT (Excellence), Bhopal, India e-mail: [email protected] S. Rai e-mail: [email protected] R. Boghey e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_17
227
228
A. Kumar et al.
1 Introduction Mostly, Web application poses a serious challenge from different types of application security threats. More the modularization of Web application, more are the entry points for application security threats. In the world of microservices and microarchitecture, each application has multiple modules which are loosely coupled and independent of each other. Hence, those modules interact with each using either some API request or form submissions. So, it becomes very difficult to say that request has come from an authentic source or not .All applications should be made highly secured to deal with these kind of application threats which have cropped up due to these microservices environments. As shown in Fig. 1, application security threats can be categorized as—Cross-site scripting attacks, SQL injection attacks, denial of service attacks, buffer overflows, and session highjacking attacks. Cross-site scripting attacks or XSS can be termed as an attack where attacker sends harmful code into the application which is dynamic in nature and runs on the client’s browser. It happens mainly on the client side and is executed specifically through Javascript. SQL injection attacks, as we have already explained, are executed by appending threat borne SQL queries to the input parameters required for the Web application. Denial Of service attacks ensure that authorized users do not get access to the application. These attacks may make the server unreachable by flooding it with huge number of requests thereby making the whole server unresponsive. Buffer overflow is an attack which specifically targets the buffers used in the application. Basically, an application has multiple buffers where it keeps on writing data to be used for the processing in case memory cannot hold the same. Attackers send such type of inputs to the applications which are although smaller than the buffer size but application assume those to be bigger in
Fig. 1 Types of application security threats
A Novel Approach for SQL Injection Avoidance Using Two-Level …
229
Fig. 2 Denial of service SQL injection attack
size and override the valid information from the buffers to store these inputs. In session highjacking, attacker somehow gets access to the cookie of the ongoing session and starts acting as an authorized user. There are multiple after effects of SQL injection attacks on the system like denial of service, deletion/updation of sensitive data, unauthorized access, etc. Let us illustrate the same in more details as follows. Figure 2 shows how SQL injection attack floods the system with huge requests and takes down the database. It eventually cuts off the database server from Web server thereby taking down the whole application. This is one of the most severe attack as all the requests will fail and application will not be able to respond to any of the incoming requests. Since whole database goes down, it has to be restarted to bring the whole system up. This leads to the huge monetary loss for the organization owning the application. One can just imagine the magnitude of loss if websites like Amazon, Flipkart go down for an hour or more. In Fig. 3, database updation/deletion attack is explained. It shows how SQL injection attack can penetrate into the DB and modify/delete the sensitive information from the system. Let us take an example of a bank account and OTP. If an attacker modifies the table in which user information is there and provides his/her mobile number in place of original user, he/she can use it for telecalling fraud to make transactions. In the similar manner, if weather forecasting dataset itself is wiped off from meteorological departments’ DB, it will lead to chaos. So, this attack is quite harmful and can hamper users in their day-to-day life. It is quite dangerous due to the fact that till the time compromised data comes into the notice of authorities, attackers keep on manipulating the data and compromising the system.
230
A. Kumar et al.
Fig. 3 Database updation/deletion SQL injection attack
Fig. 4 Unauthorized access SQL injection attack
Unauthorized access attack as shown in Fig. 4 is the most dangerous attacks of all SQL injection attacks. In this attack, attacker gets hold of the whole system and gets whole playground in the form of application to do all everything. Attacker can literally do anything if the compromised role is an admin role. All the aspects of the application get compromised and application can be made inaccessible to all legitimate users. In this attack, application remains on the mercy of the attacker and attacker can ask for anything in lieu of it.
A Novel Approach for SQL Injection Avoidance Using Two-Level …
231
2 Related Work Authors proposed three techniques—Query rewriting approach, encoding-based approach, and assertion-based approach in [1]. First approach works in two phases— static and dynamic. In static phase,input parameters are inserted in an input table and then those are appended into the main query using select statements form input table. Dynamic phase checks for correct grammar of insert statements created for insertion in the input table. Second approach works on encoding the parameters and decoding the same in the middle tier. Third approach works on making assertion with DB values before firing the actual query in the DB. Tokenization of query using Heisenberg analysis is the approach proposed by [2]. Each query is divided into blocks and stored into the system. Every incoming query is matched against the tokens. Another technique of Honeypot is proposed which deals with hashed passwords. An overview of different types of SQL injection attacks and preventive measures is provided in [3]. Instruction set randomization(ISR)-based technique has been proposed in [4]. In this technique, file containing system queries is read and all keywords are appended with random numbers. These queries can be interpreted by a proxy server which in turn translates these queries to actual queries before sending to DB server. Java-based approach where filter based on regular expression is used to detect malicious queries is discussed in [5]. This is static approach with very limited effect on the SQL injection attacks. Research in this field is not limited to only conventional methods. Researchers have tried using neural network and machine learning based techniques as well to solve this problem. One of those techniques using Java’s Neuroph API is used in [6] to find SQL injection attacks. Front-end- and back-end-based approaches are described by the authors of [7]. Encryption/decryption techniques are used for user authentication and inputs. Query generator and validator modules are proposed to check query constraints. In frontend phase, client needs to register one time and his/her details are stored at backend by generating a secret key. Once user tries to login, user name is matched with stored secret key generated based on the name of client at the time of registration. Back-end phase maintains a dynamic table which keeps the predefined queries in the form of tokens. Input query KPIs are matched at the runtime with KPIs of predefined query. Different types of SQL injections are prevalent across the systems. Some of these types are presented in survey reports given in [8, 9]. Authors of [10] have proposed hash-based mechanism to check these attacks. This technique works mainly on improving authentication at client side using hashing algorithm. Like various other techniques, this technique also works on securing frontend authentication by generating hash-based fingerprint of a registered user. All the queries are allowed once user is authenticated. A three-tier SQL injection prevention technique is explained in [11]. This includes complex implementation of logic at all three tiers which eventually takes a lot of turn around time. A proxy server is also proposed in the methodology which tends to reduce malicious calls but increases the infrastructure cost.
232
A. Kumar et al.
SQL replacing and distance calculation using Levensthein method is used in [12]. The distance between the predefined query and replaced query gives idea about the correctness and authenticity of query. Multiple levels of SQL injection vulnerabilities are explained in [13]. Implementation also requires Web server and DB server level changes. Script matching and client side rule-based implementation is discussed in [14]. On the other hand, predefined query dictionary-based method is used in [15].
3 Problem Statement In previous approaches, different preventive methods have been proposed. Some of the techniques focus on front-tier validation and some rely on middle-tier-based prevention. Some techniques use more advanced methods, e.g., neural networks and some use some techniques like basic randomization techniques. There are some methods which provide three-tier implementation but their infrastructure cost is quite high. Some of the techniques rely on external mechanisms such as Honeypot mechanism, hashing, and Levensthein distance. Since these are external mechanisms, their drawbacks will also be there in techniques using these mechanisms. These techniques are more reliant on these external. So, looking at previous proposed methods, we can summarize drawbacks as follows: • Most of the techniques like SQL rewriting, encryption based, etc., focus mainly on a single tier, i.e., any of front end, middle, and DB tier. So, in case, attack is not sniffed by single tier, application will be compromised. Hence, this leads to single point of failure • Single-tier-based approaches are more prone to SQL injection attacks when compared to multitier-based approach as there may be less permutations of exploitation in terms of SQL patterns • Some approaches have been proposed as multitier but those also involve additional infrastructure. Additional infrastructure adds upto the cost • Advance techniques which make use of neural networks and machine learning unnecessarily increase the complexity of a simple application as well as increase infrastructure and runtime costs.
4 Proposed Work 4.1 Introduction to Proposed Work This paper proposes a blended approach which combines existing two approaches of SQL filter and SQL rewriting. These approaches are quite simple to implement and are powerful in their own tiers of application. Two-level restricted application
A Novel Approach for SQL Injection Avoidance Using Two-Level …
233
prevention(TRAP) technique is proposed which secures middle tier and DB tier. The purpose of this technique is to ensure multi-tier security using simple techniques implementation,no infrastructure cost addition keeping original code rewriting as minimum as possible. This technique is compared with existing constituent techniques on standalone basis in terms of various SQL injection attacks prevention. In a nutshell, aim is to provide a simple, robust, and cost-efficient technique to prevent SQL injection attacks.
4.2 Two-Level Restricted Application Prevention (TRAP) Technique Architecture Proposed TRAP technique works at two levels—Middle tier and DB tier. At first level, i.e., middle tier, pattern matching filter is implemented and in second level, SQL rewriting mechanism alongwith grammar validation is implemented. These two levels of security mechanism ensure that only valid inputs are passed to table for fetching details. Requests containing invalid inputs are out rightly rejected with validation error sent back as a response. These two levels of security ensure that the all infected parameters are kept under check and prevented from entering into the application to make a compromise with the data. Step by step working of these security levels is explained in more details as follows: • Level 1: Pattern matching filter – This level contains the logic to match inputs against predefined keyword patterns of SQL – In ideal scenario, input parameters are not expected to contain predefined SQLspecific keywords. This level restricts these kind of inputs – Pattern matching is carried out using predefined keyword regular expression – Regular expression does not allow any of the independent keywords in the parameters. By independent, we mean that it can be a part of any continuous string but can not exist as an independent entity • Level 2: Query rewriting and analysis – In this level, there are two phases: (1) Query rewriting phase (2) Syntax analyzer phase – In phase (1), all input parameters are taken and those are converted into the form of insert statements – When any input parameter contains tautology-based query, for example, 1=1, its concatenation effect is neutralized by inserting it into the table – Before insertion into the table, the syntax of insert statements is verified using one predefined syntax grammar – Once syntax is verified and parameters are stored in the table in DB, main query is modified to fetch these parameters from DB
234
A. Kumar et al.
Fig. 5 TRAP technique architecture
– Main query is modified by replacing parameter values with select statements from DB table containing parameter values (Fig. 5).
4.3 Two-Level Restricted Application Prevention (TRAP) Technique Working Methodology When user sends request from UI/API, it is validated against predefined pattern of SQL keywords.All input parameters are validated against predefined pattern serially in within a loop. Post validation of all valid requests are moved forward to the next step and no-valid request are rejected with reason.Valid input parameters are passed to the next step.In DB, there is an INPUT table which is maintained to store input parameters associated with each request.To store these input parameters into DB, corresponding statements care is created. These statements are validated against custom grammar defined in the system. This grammar is similar to CFG, i.e., Context-free grammar.If found valid against this grammar,these statements are executed which eventually stores the inputs in DB.Invalid statements are rejected and response is sent back to user with corresponding error details. Next, these stored input values are fetched from DB and original query is recreated using these values.Finally, after all checks and validations,original query is executed with proper inputs. Figure 6 depicts the whole technique workflow.
A Novel Approach for SQL Injection Avoidance Using Two-Level …
235
Fig. 6 TRAP technique methodology flowchart
5 Results and Discussion We have compared SQL rewriting and TRAP-based technique in terms of time taken to handle SQL injections for union-based attacks and time-based attacks. We took three test sets of 100,500, and 1000 requests for each of these attacks and simulated these attacks using JMETER.
236
A. Kumar et al.
5.1 Time Comparison of Time-Based Attacks TRAP techniques is taking very less time as compared to SQL rewriting technique to handle time-based SQL injection attacks. If we look into all test set results, it can be easily inferred that TRAP technique is taking 82% less time than SQL rewriting technique to handle time-based attack. In results, we have not done comparison of Java-based filter as it is not able to prevent this attack. So, TRAP techniques fares well in terms of time and prevention of time-based attacks. Based on the results, we have created following graph which depicts the statistics graphically (Fig. 7).
5.2 Time Comparison of Union-Based Attacks After having a close look on the results obtained for union-based attacks prevention time comparison(as shown in Fig. 8), it can be inferred that TRAP technique is 82% better in terms of execution time as compared to SQL rewriting technique. Again, this attack is also not prevented by Java filter-based approach. Hence, TRAP technique is better in terms of execution time and prevention than other two considered techniques. By looking at the above graph, it can be clearly seen that TRAP technique is better than SQL rewriting approach in terms of time taken to handle union-based attack.
Fig. 7 Time comparison of time-based attacks
A Novel Approach for SQL Injection Avoidance Using Two-Level …
237
Fig. 8 Time comparison of union-based attacks
5.3 KPI Comparison of TRAP Technique with Existing Techniques Since there are lot of techniques have been proposed to handle SQL injection, following table gives comparison of TRAP technique with existing techniques on some KPIs: Technique
Autodetection
Autoprevention
WebSSARI SQLrand JDBC-Checker AMNESIA CANDID DIGLOSSIA SQL Rewriting Java Based Filter TRAP
Yes Yes Yes Yes Yes Yes Yes Yes Yes
No Yes No Yes Yes Yes Yes Yes Yes
Identification of all input sources Yes No No Yes Yes No Yes Yes Yes
Modify code base No Yes No No No No Yes No Yes
Complexity
Implemented in this work
O(n) O(n) O(n) O(2n ) O(n) O(n) O(n) O(n) O(n)
No No No No No No Yes Yes Yes
6 Conclusion and Future Work SQL injection attacks are quite common and dangerous in terms of impact. Various techniques proposed to handle these attacks work differently from each other in variety of terms—such as implementation complexity, tier of implementation, execution time of handling some specific attacks, and many more. We have proposed a TRAP technique which operates in multitier mode by taking individual advantages
238
A. Kumar et al.
of existing two techniques which operate alone in corresponding tiers. TRAP technique neither requires any new hardware implementation nor it proposes complex solutions of neural networks or machine learning. We have seen that TRAP technique saves around 82% of time in order to prevent time-based and union-based attacks. Front-end tier can also be integrated with this technique so that injection prevention can be achieved in all application tiers.
References 1. B. Ahuja, A. Jana, A. Swarnkar, R. Halder, On preventing SQL injection attacks. Adv. Comput. Syst. Secur 395, 49–64 (2015) 2. A.S. Sai Lekshmi, V.S. Devipriya, An Emulation of SQL Injection Disclosure and Deterrence (2017) 3. Z. Fei, S. Bhattacharjee, E.W. Zegura, M.H. Ammar, SQL injection: types, methodology, attack queries and prevention, in 3rd International Conference on Computing for Sustainable Global Development (INDIACom) (2016) 4. P. Chen, J. Wang, L. Pan, H. Yu, Research and implementation of SQL injection prevention method based on ISR, in 2nd IEEE International Conference on Computer and Communications (2016) 5. L. Qian, Z. Zhu, l. Hu, S. Liu, Research of SQL injection attack and prevention technology, in International Conference on Estimation, Detection and Information Fusion (ICEDIF 2015) (2015) 6. N. Patel, N. Shekokar, Implementation of pattern matching algorithm to defend SQLIA, in International Conference on Advanced Computing Technologies and Applications (ICACTA2015) (2015) 7. S. Som, S. Sinha, R. Kataria, Study on SQL Injection attacks: mode, detection and prevention. Int. J. Eng. Appl. Sci. Technol. 1(8), 23–29 (2016). ISSN No. 2455-2143 8. K. Elshazly, Y. Fouad, M. Saleh, A. Sewisy, A survey of SQL injection attack detection and prevention. J. Comput. Commun. 2014(2), 1–9 (2014) 9. Z.S. Alwan, M.F. Younis, Detection and prevention of SQL injection attack: a survey. Int. J. Comput. Sci. Mob. Comput. (IJCSMC) 6(8), 5–17 10. K. D’silva, J. Vanajakshi, K.N. Manjunath, S. Prabhu, An effective method for preventing SQL injection attack and session Hijacking, in 2nd IEEE International Conference On Recent Trends in Electronics Information & Communication Technology (RTEICT), India, 19–20 May 2017 11. W. Rajeh, A. Abed, A novel three-Tier SQLi detection and mitigation scheme for cloud environments, in International Conference on Electrical Engineering and Computer Science (ICECOS) (2017) 12. R. Latha, E. Ramaraj, SQL injection detection based on replacing the SQL query parameter values. Int. J. Eng. Comput. Sci. 4(8), 13786–13790 (2015). ISSN: 2319-7242 13. H. Hu, Research on the technology of detecting the SQL injection attack and non-intrusive prevention in WEB system, in AIP Conference Proceedings 1839, 020205 (2017) 14. A.S. Dikhit, K. Karodiya, Result evaluation of field authentication based SQL injection and XSS attack exposure, in IEEE International Conference on Information, Communication, Instrumentation and Control (ICICIC-2017) (2017) 15. A. Yasin, N. A. Zidan, SQL injection prevention using query dictionary based mechanism. Int. J. Comput. Sci. Inform. Secur. (IJCSIS) 14 (6) (2016)
Prediction of Cardiovascular Disease Through Cutting-Edge Deep Learning Technologies: An Empirical Study Based on TENSORFLOW, PYTORCH and KERAS Mudasir Ashraf, Syed Mudasir Ahmad, Nazir Ahmad Ganai, Riaz Ahmad Shah, Majid Zaman, Sameer Ahmad Khan, and Aftab Aalam Shah Abstract In healthcare system, the predictive modelling procedure for risk estimation of cardiovascular disease is extremely challenging and an inevitable task. Therefore, the attempt to clinically examine medical databases through conventional and leading-edge machine learning technologies is contemplated to be valuable, accurate and more importantly economical substitute for medical practitioners. In this research study, primarily we have exploited both individual learning algorithms and ensemble approaches including BayesNet, J48, KNN, multilayer perceptron, Naïve Bayes, random tree and random forest for prediction purposes. After analysing the performance of these classifiers, J48 attained noteworthy accuracy of 70.77% than other classifiers. We then employed new fangled techniques comprising TENSORFLOW, PYTORCH and KERAS on the same dataset acquired from Stanford online repository. The empirical results demonstrate that KERAS achieved an outstanding prediction accuracy of 80% in contrast to entire set of machine learning algorithms which were taken under investigation. Furthermore, based on the performance improvisation in prediction accuracy of cardiovascular disease, a novel prediction model was propounded after conducting performance analysis on both approaches (conventional and cutting-edge technologies). The principle objective behind this research study was the pursuit for fitting approaches that can lead to better prediction accuracy and reliability of diagnostic performance of cardiovascular disease. Keywords Machine learning · TENSORFLOW · PYTORCH · KERAS · Prediction · Classifiers M. Ashraf (B) · S. M. Ahmad · R. A. Shah · S. A. Khan · A. A. Shah Division of Animal Biotechnology, Center of Bioinformaics, FVSc and AH Shuhama, SKUAST-K, Srinagar 190006, India e-mail: [email protected] N. A. Ganai SKUAST-K, Srinagar, India M. Zaman IT&SS, University of Kashmir, Srinagar, India © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_18
239
240
M. Ashraf et al.
1 Introduction In the modern era, cardiovascular diseases are reported as extremely dicey in terms of death occurrence at global level, causing millions of deaths each year. In the upcoming years, cardiovascular disease is contemplated to be a major challenge for healthcare investigators, with growing population and demographic transitions projected. As reported by World Health Organization (WHO), the proportion of people who die due to cardiovascular diseases, typically heart attacks and strokes, are approximately 17 million [1]. Various research findings and assessments reveal that the number of deaths by virtue of cardiovascular diseases will amplify both in developed and under developing countries in succeeding years [2]. It is therefore indispensable to record significant traits that contribute to the cardiovascular diseases. In this direction, data analytic applications and soft-computational-based tools can be prolific which can aid physicians in making better and swift decisions by exploiting different mining techniques on cardiovascular datasets. There are several applications and technologies that yield novel information from data repositories; however among them, data mining techniques are expeditiously employed to discover useful, novel and hidden knowledge from miscellaneous data sources including medical, pedagogical and corporate [3]. Moreover, machine learning techniques have produced outstanding results by predicting output classes with significant accuracy [4, 5]. The cardiovascular risk prediction modelling system has consummated prime significance in clinical studies and medical service. To ensure issues related to impact on health are timely evaluated, it demands application of computational and technicalbased health risk estimation models to be developed in the realm of health sciences. Contemporarily, pharmaceutical and clinical backgrounds are data intensive. The clinical administrative authorities are routinely collecting patient’s data including patient’s description, laboratory tests, medical reports, physician notes and other related data. With the upgradation and substantial developments in terms of recording and storage technologies, and subsequently with the growing demand of employing soft computational heath-based recording systems in hospitals, it is now possible to examine massive amount of data for discovering valuable information. Therefore, the pursuit for approaches to evaluate cardiovascular data, and thereby devise prognosis and diagnosis mechanisms to improve cardiovascular implications in individuals, is a continuous quest. To develop a predictive modelling system in healthcare informatics has become an extremely challenging task for researchers to estimate cardiovascular risk in patients. Furthermore, managing several types of healthcare implications with desired prediction accuracy and error-free outcomes demands the exploitation of soft computational and intelligent systems to automate the healthcare informatics. These intelligent systems have a significant role in diagnosis, prognosis and decisionmaking. The application of different data mining models enables smart diagnosis process while extracting eloquent information from large datasets. The classification techniques and intelligent heart disease prediction mechanism can lead to accurate and economical healthcare facilities. Abundant research studies have been explored using various techniques, viz. classification, regression, clustering, artificial neural
Prediction of Cardiovascular Disease Through Cutting-Edge Deep …
241
networks, support vector machines, k-nearest neighbours and ensemble approaches to predict the heart disease. However, no cutting-edge technologies have been utilized hitherto, to further examine clinical data using these advanced practices, and thereby ameliorate the prediction accuracy in medical field. Therefore, in this research study, cognizant attempts have been made to exploit leading-edge techniques in the realm of healthcare prediction services including TENSORFLOW, KERAS and PYTORCH to achieve better results. The study proposes a prediction model for healthcare services particularly for cardiovascular disease based on the above-mentioned state-of-the-art technologies to generate better and reliable predictions.
2 Related Work In this section, various techniques related to machine learning have been reviewed and analysed by virtue of which heart disease prediction models have been formulated. The broad spectrum of techniques and methodologies available in data mining is one of the key strengths that can be applied in various medical science complications [6]. A number of thriving research studies have been conducted to predict cardiovascular diseases with techniques such as clustering, association rule mining, regression and classification. These methods have been applied to improve diagnosis of diseases with noteworthy accuracy and minimal error rate. From the existing literature, it is noticeable that classification has an imperative role in predicting heart diseases in contrast to other approaches. The literature work also signifies that abundant research studies have been conducted either through exploration of individual machine learning algorithms or ensemble procedures, in the thirst of better accuracy and reliable prediction models. Assari et al. propounded a study for early detection of heart diseases by analysing different risk factors associated with heart patients. A number of learning classifiers were employed on the selected dataset, nevertheless among diverse classifiers SVM attained an outstanding accuracy of 84.33% [7]. Eventually, the model was developed based on discovered rules and heart disease indices, viz. thal, ca and cp, which were noted to be effective indices. Rajahoana et al. explored UCI dataset with 12 features included, and the authors applied artificial neural networks across the data and achieved noteworthy accuracy of 85.66% in predicting the heart disease [8]. Raihan et al. proposed risk prediction system for heart attack based on data mining and statistical approaches [9]. The dataset encompassed of 10 attributes and 917 instances, and remarkable accuracy of 86% was attained by the model. Kanchan and Kishor applied various machine learning algorithms using principal component analysis for heart disease prediction [10]. Nikhar et al. identified factors accountable for heart diseases based on machine learning approaches [11]. The proposed framework instructed rudimentary characteristics that are supposed to be checked to corroborate occurrence of heart disease. Fatima and Pasha conducted a survey on disease diagnosis based on machine learning algorithms. The researchers observed
242
M. Ashraf et al.
that Naïve Bayes is among the finest classification algorithms which is strongly scalable [12]. Shilna and Navya applied k-means clustering classifier and miscellaneous data mining techniques to predict the heart disease [13]. Typically, the dimensionality of the heart dataset is large; thus, the selection of relevant attributes becomes a challenging task for obtaining better diagnosis [14, 15]. In this regard, various ensemble and hybrid approaches have been applied to achieve promising and accurate prediction of heart disease datasets. Abdullah and Rajalaxmi employed RF classifier on Coronary Heart Disease (CHD) to predict the disease [16]. The investigation was conducted on factors such as angina, acute myocardial infarction and bypass graft surgery. From empirical examination, it was realized that ensemble approach produced effective and accurate results while predicting coronary heart disease. Lafta et al. at University of Southern Queensland formulated an intelligent prediction mechanism using time series prediction algorithm [17]. The model produced exceptional prediction results on patient’s medical data, and subsequently was operated by medical practitioners for decision-support system. Parthiban and Subramanian designed prediction system for heart disease based on neuro-fuzzy inference method [18]. The capabilities of neural network adaptive algorithm and fuzzy logic were integrated with GA to foretell heart disease. The performance of combined model, viz. neuro-fuzzy, was estimated on performance measures, and the results revealed realistic potential in forecasting heart disease. Anbarasi et al. presented improved prediction mechanism of heart disease through Genetic Algorithm (GA) using feature subset selection [19]. The number of attributes involved in predictive analysis was 13, and subsequently, GA was employed to acquire optimal subset features. Moreover, Naïve Bayes, decision tree and clustering were deployed for empirical analysis on 909 data instances. The developed predictive model was tested with k-fold cross-validation method, and the model produced significant results. Yan and Zheng projected a real-coded genetic-algorithmbased structure for the diagnosis of heart diseases while exploiting serious clinical feature sub-settings [20]. The prediction system was modelled on the basis of 352 heart disease instances for the diagnosis of five major heart disease, and the model demonstrated reasonably high accuracy for all practical purposes. As per the literature review, it is evident that a healthy variety of research studies have been conducted in the field of medical sciences for prediction and diagnosis of heart diseases using various algorithms and methods such as classification, artificial neural networks, regression analysis, clustering, KNN, decision tree and ensemble approaches. However, there has been considerable research gap in the sphere of clinical sciences, wherein no research study has been undertaken hitherto in the realm of cardiovascular prediction using cutting-edge technologies, viz. TENSORFLOW, KERAS and PYTORCH. There is still inadequacy in research chronicle and strive in this backdrop. Therefore, novel approaches are indispensable to be enforced in the field of cardiovascular prediction. The pervasive problem statement of this study is to develop an intelligent prediction model for predicting cardiovascular disease based on leading-edge technologies such as TENSORFLOW, KERAS and PYTORCH.
Prediction of Cardiovascular Disease Through Cutting-Edge Deep …
243
3 Empirical Examination of Conventional and Deep Learning Practices The patient’s data associated with cardiovascular disease was collected from Stanford healthcare repository. The dataset contained sum of 304 records with 10 attributes including Systolic Blood Pressure (SBP), Tobacco Low-density Lipoprotein (LDL), adiposity, famhist, typea, obesity, alcohol, age and Coronary Heart Disease (CHD) which are highlighted under a snapshot of the data in Fig. 1. The CHD determines the outcome of the patient based on two Boolean values 1 and 0, wherein 1 designates the patient is prone to the risk of cardiovascular failure and 0 indicates no such
Fig. 1 Snapshot of the cardiovascular dataset acquired from Stanford online repository
244
M. Ashraf et al.
risk. The dataset in this case was subjected to several machine learning algorithms comprising BayesNet, J48, k-nearest neighbour, multilayer perceptron, Naïve Bayes, random forest and random tree to predict the cardiovascular disease more accurately. However, in the later phase novel and cutting-edge technologies in the sphere of healthcare sciences were exploited to predict the cardiovascular disease on the same data, and subsequently examine the performance of conventional machine learning algorithms with the leading-edge technologies including TENSORFLOW, KERAS and PYTORCH. Moreover, while employing different techniques to predict the heart disease with significant prediction accuracy, the dataset was classified into two sets, viz. training and test sets, with training data as 70% and test data as 30%. The principal objective behind this study was to predict the cardiovascular disease with greater accuracy so that diagnosis of diseases in healthcare system can be improved considerably, and consequently reducing treatment costs by providing initial diagnosis on time. In addition, the empirical results and visualization of data demonstrated in different tables underneath were acquired using latest version of Python 3.8 and Anaconda.
3.1 Performance of Conventional Classifiers in Predicting the Disease To evaluate the effectiveness of all classifiers on cardiovascular data, various parameters including Correct Classification (CC), Incorrectly Classified (IC), True Positive Rate (TPR), False Positive Rate (FPR), precision, recall, f-measure, Matthews Correlation Coefficient (MCC), Receiving Operating Characteristics Area (ROC Area) and Precision–Recall Curve (PRC Area) are presented in Table 1. From the Table 1, it is noticeable that J48 has performed better with 70.77% in predicting the correct class than the remaining 6 classifiers; nevertheless, Naïve Bayes has also produced prediction accuracy of 70.56% which is close to performance of J48. The other five classifiers, viz. BayesNet (67.96%), KNN (63.20%), multilayer perceptron (67.31%), random forest (67.53%) and random tree (62.33%), have generated considerable results and can be visualized in the below-mentioned Table 1.
3.2 Exploitation of Cutting-Edge Technologies Encompassing of TENSORFLOW, PYTORCH and KERAS In this section, the researchers employed state-of-the-art technologies on cardiovascular dataset with the prime purpose of improving the prediction accuracy of our results. After application of TENSORFLOW, PYTORCH and KERAS, it was observed that there was substantial improvement in prediction accuracy in predicting
CC
67.96%
70.77%
63.20%
67.31%
70.56%
67.532
62.337
Classifier name
BayesNet
J48
K-Nearest Neighbour
Multilayer perceptron
Naïve Bayes
Random forest
Random tree
37.662
32.467
29.43%
32.68%
36.79%
29.22%
32.03%
IC
Table 1 Prediction performance of various classifiers
0.623
0.675
0.706
0.673
0.632
0.708
0.680
TP R
0.485
0.445
0.341
0.414
0.471
0.393
0.370
FPR
0.613
0.661
0.710
0.667
0.623
0.698
0.686
Precision
0.623
0.675
0.706
0.673
0.632
0.708
0.680
Recall
0.617
0.664
0.708
0.670
0.626
0.701
0.682
F-measure
0.144
0.247
0.360
0.265
0.166
0.331
0.305
MCC
0.569
0.717
0.747
0.692
0.587
0.667
0.727
ROC area
0.583
0.730
0.750
0.698
0.598
0.660
0.733
PRC area
Prediction of Cardiovascular Disease Through Cutting-Edge Deep … 245
246
Fig. 2 Screenshot of accuracy attained based on TENSORFLOW
M. Ashraf et al.
Prediction of Cardiovascular Disease Through Cutting-Edge Deep …
Fig. 3 How accuracy varied as epoch size was increasing
Fig. 4 Results generated using PYTORCH
247
248
M. Ashraf et al.
Fig. 5 Association between achieved accuracy and epochs
the class labels. Each technique was explored on an epoch of 500 and learning rate of 0.01 to achieve best possible results. The underlying Figs. 2, 3, 4, 5, 6 and 7 illustrate the performance of respective techniques, viz. TENSORFLOW, PYTORCH and KERAS. The above Fig. 2 shows a screenshot of results which includes mean squared error (MSE), training accuracy and the actual accuracy achieved by TENSORFLOW. The overall accuracy acquired by TENSORFLOW is 70.96% in predicting the outcome class, and the same can be seen from the above-mentioned Fig. 2. The above Fig. 3 explicates the relationship between epochs and accuracy achieved by TENSORFLOW during testing with the test data. The accuracy produced by TENSORFLOW revealed sharp growth till it attained an accuracy of 66%, and thereafter, the prediction accuracy changed invariably and at certain points remained constant. However, the maximum accuracy acquired was 70.96% which is relatively significant. PYTORCH was also examined under same estimates which were employed in case of TENSORFLOW, viz. epoch (500), learning rate (0.01), training data (70%) and test data (30%) as can be seen from the above Fig. 4. In case of PYTORCH, we calculated loss, accuracy during training phase and val_loss, val_accuracy throughout the test stage. It was realized that the PYTORCH achieved exceptional accuracy of 78.91% in predicting the cardiovascular disease, and performed better than not only TENSORFLOW, but accomplished noteworthy results than conventional classifiers whose results are referenced in Table 1. Figure 5 has demonstrated a contrasting view of training and testing accuracy using PYTORCH on each epoch. As it is distinguishable from the figure that during training, the prediction accuracy has reached paramount accuracy of around 76%, whereas during testing, maximum possible accuracy that the model accomplished was 78.91% in predicting the correct class labels from the given cardiovascular
Prediction of Cardiovascular Disease Through Cutting-Edge Deep …
249
Fig. 6 Screenshot of results accomplished based on KERAS
dataset. However, in preliminary phase, the performance progressively altered, and after approximately 25 epochs, the performance dropped and remained constant. After application of KERAS on our data as explained in Fig. 6, the model produced an outstanding prediction accuracy of 80% in predicting the exact instances. The accuracy produced by this particular deep learning tool has been found significant in comparison to classifiers, viz. BayesNet, J48, KNN, multilayer perceptron, Naïve Bayes, random tree, random forest; and deep learning tools, viz. TENSORFLOW, KERAS; deployed earlier in above sections.
250
M. Ashraf et al.
Fig. 7 Distinction between accuracy and epoch in case of KERAS
Table 2 Results generated by leading-edge technologies
Classifier name
Accuracy (%)
TENSORFLOW
70.9
PYTORVH
78.9
KERAS
80
From the Fig. 7, it is apparent that KERAS has corroborated outstanding accuracy during both the stages including training and testing phases. Furthermore, the performance of KERAS has consistently publicized rise in its prediction accuracy throughout the training and testing phases and achieved remarkable paramount accuracy of 80% in predicting the correct class labels while performing the process of testing (Table 2). The proposed model explained in Fig. 8 caters comprehensive illustration of training and testing of various leading-edge technologies during the course of cardiovascular prediction. In the current study, we exploited three widespread techniques, viz. TENSORFLOW, PYTORCH and KERAS, to generate predictions. Based on the experimental results produced and examination of three predominant methods, we propounded a prediction model to encourage researchers to explore more deep learning tools so as to ameliorate prediction accuracy in the near future.
3.3 Data Visualization The data visualization part has authoritative role in expressing data in a powerful way and thereby can widely contribute in developing an accurate and robust prediction system. The underneath Figs. (9, 10, and 11) show relationship among different attributes based on various visualization plots including colour plot, boxplot and heat plot, respectively. In Fig. 12, the artificial neural network has been proposed to predict the cardio-
Prediction of Cardiovascular Disease Through Cutting-Edge Deep …
Fig. 8 Prediction model for cardiovascular disease
251
252
M. Ashraf et al.
Fig. 9 Visualization of data based on attributes
Fig. 10 Boxplot visualization of various attributes
vascular disease based on attributes visualized in input layer. The network comprised of three hidden layers, input layer and output layer (where ‘P’ signifies the presence of cardiovascular disease and ‘A’ the absence of disease). The healthcare data was fed as an input to start off the artificial neural network model with classifier running internally as a core, and the classifier produced admirable prediction accuracy.
Prediction of Cardiovascular Disease Through Cutting-Edge Deep …
Fig. 11 Attributes with heat plot
Fig. 12 Proposed artificial neural network model
253
254
M. Ashraf et al.
4 Conclusion An intelligent algorithm can act as prolific tool and subsequently contribute effectively to ameliorate the precision in treatment of diseases. In this realm, we have successfully exploited conventional learning algorithms, ensembles and cutting-edge technologies to acquire better insights about heart diseases and a quest to achieve significant prediction accuracy on cardiovascular data. The data obtained from Stanford online healthcare repository was initially subjected to techniques such as BayesNet, J48, KNN, multilayer perceptron, Naïve Bayes, random tree and random forest. Among all learning classifiers, J48 produced noteworthy prediction accuracy of 70.77% in classifying the correct class labels. Nevertheless, KERAS generated paramount prediction accuracy of 80% in diagnosing the heart disease, not only from the category of leading-edge technologies (TENSORFLOW and PYTORCH) but also in contrast to traditional learning classifiers explored earlier. Based on the application of miscellaneous machine learning techniques and tools, an efficient prediction model was proposed. Moreover, data visualization plots were used to express the attributes of the datasets in a more powerful way so that it can contribute to prediction mechanism. In future, researchers can look for possible approaches to ameliorate scalability and accuracy of prediction system in cardiovascular diseases. Typically, dimensionality of the heart diseases data is high, and thus, identification and selection of momentous attributes are extremely challenging tasks for healthier diagnosis of heart disease. Acknowledgements We acknowledge the support provided by bioinformatics infrastructure facility under BTIS Net program of DBT, Government of India.
References 1. J. Mackay, G.A. Mensah, The Atlas of Heart Disease and Stroke. (World Health Organization, 2004) 2. K.S. Reddy, Cardiovascular disease in non-Western countries. New England J. Med. 350(24), 2438–2440 (2004) 3. M. Ashraf, M. Zaman, M. Ahmed, S.J. Sidiq, Knowledge discovery in academia: a survey on related literature. Int. J. Adv. Res. Comput. Sci. 8(1) (2017) 4. M. Ashraf, M. Zaman, M. Ahmed, Using ensemble StackingC Method and base classifiers to ameliorate prediction accuracy of pedagogical data. Procedia Comput. Sci. 132, 1021–1040 (2018) 5. M. Ashraf, M. Zaman, M. Ahmed, To ameliorate classification accuracy using ensemble vote approach and base classifiers. in Emerging Technologies in Data Mining and Information Security. (Springer, Singapore, 2019), pp. 321–334 6. R. Assari, P. Azimi, M.R. Taghva, Heart disease diagnosis using data mining techniques. Int. J. Econ. Manag. Sci. 6(3), 1–5 (2017)
Prediction of Cardiovascular Disease Through Cutting-Edge Deep …
255
7. S. Babu, E.M. Vivek, K.P. Famina, K. Fida, P. Aswathi, M. Shanid, M. Hena, Heart disease diagnosis using data mining technique. in 2017 International Conference of Electronics, Communication and Aerospace Technology (ICECA), vol. 1, (IEEE, 2017 April), pp. 750–753 8. S.P. Rajamhoana, C.A. Devi, K. Umamaheswari, R. Kiruba, K. Karunya, R. Deepika, Analysis of neural networks based heart disease prediction system. in 2018 11th International Conference on Human System Interaction (HSI), (IEEE, 2018 July), pp. 233–239 9. M. Raihan, S. Mondal, A. More, P.K. Boni, M.O.F. Sagor, Smartphone based heart attack risk prediction system with statistical analysis and data mining approaches. Adv. Sci. Technol. Eng. Syst. J. 2(3), 1815–1822 (2017) 10. B.D. Kanchan, M.M. Kishor,. Study of machine learning algorithms for special disease prediction using principal of component analysis. in 2016 International Conference on Global Trends in Signal Processing, Information Computing and Communication (ICGTSPICC), (IEEE, 2016 December), pp. 5–10 11. S. Nikhar, A.M. Karandikar, Prediction of heart disease using machine learning algorithms. Int. J. Adv. Eng. Manage. Sci. 2(6) (2016) 12. M. Fatima, M. Pasha, Survey of machine learning algorithms for disease diagnostic. J. Intell. Learn. Syst. Appl. 9(01), 1 (2017) 13. S. Shilna S, E.K. Navya, Heart disease forecasting system using k-means clustering algorithm with PSO and other data mining methods. Int. J. Eng. Technol. Sci. 2349–3968 (2016) 14. I. Zriqat, I. A.M. Altamimi, M. Azzeh, A comparative study for predicting heart diseases using data mining classification methods. arXiv preprint arXiv:1704.02799 (2017) 15. A. Shrivastava, S.S. Tomar, A hybrid framework for heart disease prediction: review and analysis. Int. J. Adv. Technol. Eng. Explor. 3(15), 21 (2016) 16. A.S. Abdullah, R. Rajalaxmi, A data mining model for predicting the coronary heart disease using Random Forest classifier. in International Conference in Recent Trends in Computational Methods, Communication and Controls (2012 April), pp. 22–25 17. R. Lafta, J. Zhang, X. Tao, Y. Li, V.S. Tseng, An intelligent recommender system based on short-term risk prediction for heart disease patients. in 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 3 (IEEE, 2015 December), pp. 102–105 18. L. Parthiban, R. Subramanian, R. Intelligent heart disease prediction system using CANFIS and genetic algorithm. Int. J. Biol. Biomed. Med. Sci. 3(3) (2008) 19. M. Anbarasi, E. Anupriya, N.C.S.N. Iyengar, Enhanced prediction of heart disease with feature subset selection using genetic algorithm. Int. J. Eng. Sci. Technol. 2(10), 5370–5376 (2010) 20. H. Yan, J. Zheng, Y. Jiang, C. Peng, S. Xiao, Selecting critical clinical features for heart diseases diagnosis with a real-coded genetic algorithm. Appl. Soft Comput. 8(2), 1105–1111 (2008)
Study and Analysis of Time Series of Weather Data of Classification and Clustering Techniques Rashmi Bhardwaj and Varsha Duhoon
Abstract Climate on Earth is chaotic; hence, the meteorological department faces though challenge of forecasting weather to maximum accuracy. Weather forecasting is very important as it is the first step towards the preparation against the upcoming natural hazard and hence causing threat to life form. The objective of the paper is to study weather parameters using different clustering and classification techniques and select the best-suited model on the basis of least error and less time taking. Weather parameters such as max. and min. temperature, wind Speed, rainfall, relative humidity, evaporation, bright sun shine hours, average temperature, average humidity from 1 January 2017 till 30 September 2018 of Delhi region are studied. Navie Bayes provides better results as compared to others on the basis of statistical outcomes of Kappa Statistics, MAE, RMSE, RAE, RRSE. Clustering is the technique used for clustering the weather parameters into groups having similarity, and classification refers to the forecasting of weather parameters on the basis of the input data by training, validating, and testing the data set and further forecasting. K-means clustering, EM clustering, hierarchical clustered are the clustering methods used for study, and on the biases of the less time taken, it is concluded that K-means clustering is efficient clustering technique. Keywords Classification · Clustering · Navie Bayes · J48 · LivSVM · K-means clustering · EM clustering · Hierarchical clustered · Data mining
R. Bhardwaj (B) · V. Duhoon University School of Basic and Applied Sciences (USBAS), Non-Linear Research Lab, Guru Gobind Singh Indraprastha University (GGSIPU), Dwarka, New Delhi 110078, India e-mail: [email protected] V. Duhoon e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_19
257
258
R. Bhardwaj and V. Duhoon
1 Introduction Meteorological department faces challenges of weather forecasting due to its chaotic nature, and hence, since long time scientist and mathematicians have worked rigorously in the field of weather forecasting by generating new models and algorithm for weather prediction to maximum accuracy. The paper studies different techniques under clustering and classification for understating the weather parameters and hence predicting them with the aim to forecasting weather parameters with accuracy and with lesser time. In classification, the objective is to forecast the target class by analysing the input data and that classification is supervised learning technique. This is done by finding the proper boundaries for each target class; by using the training data set to get better boundary condition which are used to get better boundary conditions further to predict the target class; hence, the process is called classification. Further, in clustering the process and objective is not same as classification. Clustering groups the similar kind of things by taking the most satisfying condition of no non-similar item to be grouped together. Clustering is unsupervised learning technique and further uses variant measures to group same values in one group hence analysing relationship between the objects. Classification and clustering are different from each other as classification has prior knowledge of class, whereas clustering does not also data needed in classification is labelled whereas clustering has unlabelled data adding to this classification is predictive method and clustering is Descriptive method. Bhardwaj and Duhoon studied various methods for weather forecasting by using data mining techniques which included MLP, Gaussian process, SMO regression, linear regression among which best-suited tool was selected using the statistical calculations CC, MAE, RMSE, RRSE, RAE; also studied ANFIS-SUGENO model using subtractive clustering technique for the formation of membership function to forecast minimum and maximum temperature; further studied behaviour of temperature from 1981 to 2015 by calculating hurst exponent, fractal dimension, predictability index using R/S method [1–3] Bhardwaj applied wavelet and fractal methods [4]. Bhardwaj and Srivastava studied cloud analysis methods of hydrometeors from radar data [5]. Bhardwaj et al. studied T-170 model for prediction [6]. Durai and Bhardwaj predicted rainfall using multi-model techniques [7]. Erdem and Shi studied ARMA model to predict wind speed and direction [8]. Finamore et al. applied data mining techniques to predict wind speed [9]. Geetha Ramani et al. studied and applied data mining techniques [10]. Paras modelled weather forecasting model [11]. Jyothis and Ratheesh predicted rainfall using data mining tools [12]. Moertini Veronica studied and applied C4.5 algorithm [13]. Naik et al. applied ANN to predict weather [14]. Nadali et al. applied fuzzy fuzzy system to predict success [15]. Samya and Rathipriya applied ANN to predict the time series for a data [16]. Shivang and Sidhar applied machine learning tools to predict weather at different places [17]. Vaibhavi et al. applied different approaches to predict weather [18]. Wang and Mujib applied cloud computing technique to predict weather [19].
Study and Analysis of Time Series of Weather Data …
259
2 Methodology See Flowchart 1.
2.1 Rotation Estimation Rotation-based estimation or else sample-based testing technique is a technique used for analysing the behaviour of the data set by testing and training the input values. A
A
A
A
A
Consider the input data set n, which are further divided into A sub sets with equal number of values in it. Every A set is tested, and then, the average of each A set is the testing set. The result is hence more accurate, and whole set is used as testing and training.
Flowchart 1 Application of the clustering and classification technique
260
R. Bhardwaj and V. Duhoon
2.2 Rotation Estimation-Based Navie Bayes Navie Bayes is a classification steps for binary and multi-class classification issues. Navie Bayes or Navie Bayes classifier is based on Bayes theorem with navie assumption which are independent of each other. Bayes Theorem: A = (a1 , a2 , a3 , . . . , an ) and class variable Yk . Bayes theorem states that P(Yk |A) =
P(A|Yk )P(Yk ) ; k = 1, 2, 3, 4, . . . , J P(A)
(1)
such that P(Yk |A): Posterior Probability; P(A|Yk ): Likelihood; P(Yk ): Prior Probability Class; P(A): Prior Probability Predictor. Using chain rule, Likelihood P(A|Yk ) decomposes as: P(ai |ai+1 , . . . , an |Yk ) = P(ai |Yk )
(2)
P(A|Yk ) = P(a1 |a2 , a3 , . . . , an |Yk ) . . . P(an−1 |an , Yk )P(an |Yk )
(3)
Navie conditional independence: P(ai |ai+1 , . . . , an |Yk ) = P(ai |Yk )
(4)
P(A|Yk ) = P(a1 , a2 , a3 , . . . , an |Yk )
(5)
Further,
P(A|Yk ) =
n
P(ai |Yk )
(6)
i=1
Posterior probability is: P(Yk ) P(Yk |A) =
n
P(ai |Yk )
i=1
P(A)
(7)
Navie n Bayes model: P(A) is constant, and we have P(Yk |A) ∝ P(ai |Yk ); further P(Yk ) i=1 n ∝ is positive proportional. Yk being the class value, P(ai |Yk ). the maximum of P(Yk ) i=1 Further, it can be formulated as:
Study and Analysis of Time Series of Weather Data …
261
Flowchart 2 Application of the Navie Bayes model
Yˆ = arg maxP(Yk )
n
P(ai |Yk )
(8)
i=1
P(Yk ) is calculated as the relative frequency of class Yk in training data (Flowchart 2).
2.3 Rotation Estimation-Based J48 This algorithm is applied in order to forecast the required target variable. J48 is steps further of ID3. It is important to continuously generalize till the time equilibrium is gained. The algorithm for J48 works as follows: • Instances belong to similar class tree which means a leaf further marking as similar class. • Potential information is measured. Gain in information is measured which results as test on attribute. • Best attribute, located using selection criterion and attribute, is selected for branching. “Entropy” measures data disorder, which is calculated as follows:
abs(β−α)2 N abs(γ −α)2 N
(9)
262
R. Bhardwaj and V. Duhoon
Flowchart 3 Application of the J48 decision tree model
− → iterating over all possible value of A . Conditional entropy is given by: AK AK − → (10) ENTROPY(K |A) = − → log − → A A − − → → − → GAIN is given by: GAIN(K , A) = ENTROPY A − ENTROPY(K |A) Gain maximization is the primary objective. The classification process includes training of the data set and forming tree. Pruning helps in reducing errors, produced by specialization in training set. Pruning generalizes the tree (Flowchart 3).
2.4 Rotation Estimation-Based LIBSVM LIBSVM implements the “one-against-one” approach for multi-class classification. If X is number of classes, then X(X-1)/2 classifications are done, and then, data is trained from two classes. The following classification problem is solved for pth and qth classes,
Study and Analysis of Time Series of Weather Data …
1 pq T pq
z pq z +c 2 j z pq b pq j min pq
263
(11)
pq 1 subject to (z pq )T r pq ≈ if x j is the pth class; (z pq )T φ(x j )+b pq ≤ −1+ j 1+e A fˆ+B if x j is the qth class. X is the data set; for any x, objective is to estimate Pp = P(y = P|x), p = 1 . . . k; estimating pairwise class probabilities rqq ≈ p(y = p|y = p/q, x). If fˆ is the decision value at x, then we assume ri j ≈
1 1 + e A fˆ+B
(12)
A, B are calculated by minimising the negative log likelihood of training data.
2.5 K-Means Clustering Output is a class membership in KNN classification. The KNN classifier can be viewed as assigning KNN a weight 1/k and all others 0. The nearest neighbour is assigned weight as φn j such that nj=1 φn j = 1. Let, αn denote the weighted nearest n n classifier with weight φn j j=1 . The optimal weighted scheme φn j j=1 is further taken as: ε ε 1 ∗ = ∗ 1 + − ∗2/d{x 1+2/d −(x−1)1+2/d} , x = 1, 2, 3, 4, . . . , β ∗ (13) φnx β 2 2β ∗ φnx = 0; x = β ∗ + 1, . . . , n
(14)
4 where β ∗ = n d+4 Further, iterative clustering algorithm calculates LocalMaxima in each iteration. It functions as follows: (1) (2) (3) (4) (5) (6)
Mention number of clusters K Now, mark every value to a cluster Find cluster centres Re-allocate every value to closet cluster centre ci Re-calculate cluster centres using: νi = c1i j=1 x i ith; ci is number of data Repeat Steps 4 and 5 in order to achieve best possible output. Hence, terminating algorithm.
264
R. Bhardwaj and V. Duhoon
2.6 EM Clustering Model-based approach includes application of few models for clusters and attempting to optimise fit among data and model. It works as: • Choose component (gaussian) at random with probability p(α); • It samples a point N (μi , σ 2 I ); • Algorithm calculates combinations of Gaussians that model’s time series known Expectation Maximisation (EM) Algorithm functions for mixture Gaussians: (1) Initialise parameters: δ0 = α1(0) , α2(0) , . . . , αk(0) , p1(0) , p2(0) , . . . , pk(0) (2) E-step:
p(w j |xk ξt ) =
p(xk |w j , ξt ) p(w j |ξt ) p(xk |ξt )
(15)
(3) M-step: p(wi |xk , ξt )xk ξit+1 = k k p(wi |x k , ξt ) p(wi |xk , ξt ) pit+1 = k R
(16) (17)
where R is the number of records
2.7 Hierarchical Clustering It is algorithm used for building levels of clusters. The assigning of each data value to the cluster of their own is done. The two nearest clusters are merged into the same cluster. Algorithm stops at the remainder of one cluster. Merging of two clusters is done based on distance between clusters. Distance between two clusters is calculated using Euclidean distance:
Study and Analysis of Time Series of Weather Data …
d(α, β) =
(αi − βi )
265
(18)
(1) Start with disjoint clustering; level L(0) = 0 and sequence number m = 0 (2) Look for non-similar pair clusters; pair A, B accordingly: d[A, B] = min d[(x), (y)] (3) Increase sequence number: m = m + 1. Combine cluster A and B in 1 cluster to make another clustering M; set level to L(M) = d [(A), (B)] (4) Further, by removing rows and columns related to cluster (A) and (B); add row and column related to newly formed cluster. Proximity among new cluster, denoted (A, B) and old cluster(k):
d[(k), (A, B)] = mind[(k), (A)], mind[(k), (B)]
(19)
(5) If all objects are in one cluster, stop. Go back to Step 2.
3 Results and Discussions Daily data of temperature, rainfall, bright sun shine, evaporation, relative humidity, wind speed for Delhi with coordinates longitude 77° 09 27 latitude 28° 38 23 N altitude: 228.61 m has been taken from 1 January 2017 till 30 September 2018. The statistical analysis for the testing and training parameters of temperature, rainfall, bright sun shine, evaporation, relative humidity, wind speed has been done using the different models. The statistical outcomes: Kappa Statistics (KS), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE), Root Relative Squared Error (RRSE) of Navie Bayes, J48, LIVSVM on the basis of different folds of cross-validation have been compared to analyse appropriate model for the reduction of error. Tables 1 and 2 give the statistical analysis on the basis of KS, MAE, RMSE, RAE, RRSE which has been calculated at different folds using rotation estimation. Navie Bayes was comparatively better classification tool for the further forecasting of the parameters. From the above Tables 1 and 2, it can be concluded that at fold 10, the results are comparatively better; hence, the model will give better results at n = 10. Tables 3 and 4 give the statistical analysis of KS, MAE, RMSE, RAE, RRSE of Navie Bayes, J48, LIVSVM for daily minimum, maximum temperature. Navie Bayes was comparatively better classification tool for the further forecasting of the parameters. In Figs. 1 and 2, the graphical comparison of the statistical parameters of KS, MAE, RMSE, RAE, RRSE of Navie Bayes, J48, LIVSVM for daily minimum temperature, where Navie Bayes shows the best results (Table 5).
266
R. Bhardwaj and V. Duhoon
Table 1 Daily maximum temperature, statistical analysis on the basis of KS, MAE, RMSE, RAE, RRSE of Navie Bayes, J48, LIVSVM on the basis of different folds of cross-validation Folds (n)
KS
MAE
RMSE
RAE (%)
RRSE (%)
10
0.0257
0.0115
0.0771
98.6344
100.9862
20
0.0219
0.0115
0.0772
98.6803
101.680
40
0.0185
0.0115
0.0773
98.7443
101.2357
50
0.0185
0.0115
0.0774
98.7691
101.2946
10
0.0000
0.0113
0.1064
97.1016
139.306
20
0.0002
0.0113
0.1064
97.0786
139.3734
40
0.0001
0.0113
0.1064
97.0626
139.3363
50
0.0001
0.0113
0.1064
97.0594
139.3120
10
0.0040
0.0116
0.0769
99.5872
100.6783
20
0.0041
0.0116
0.0770
99.6028
100.8519
40
0.0041
0.0116
0.0770
99.6093
100.8556
50
0.0041
0.0116
0.0770
99.6106
100.8564
Navie Bayes
J48
LIVSVM
Table 2 Daily minimum temperature, statistical analysis on the basis of KS, MAE, RMSE, RAE, RRSE of Navie Bayes, J48, LIVSVM on the basis of different folds of cross-validation Folds (n)
KS
MAE
RMSE
RAE (%)
RRSE (%)
Navie Bayes 10
0.0128
0.0098
0.0706
99.5734
100.3858
20
0.0111
0.0098
0.0706
99.612
100.4520
40
0.0095
0.0099
0.0707
99.638
100.4909
50
0.0095
0.0099
0.0707
99.6462
100.5029
10
−0.0013
0.0098
0.0988
98.6723
140.4830
20
−0.0077
0.0098
0.0991
99.3127
140.9057
40
−0.0013
0.0099
0.0993
99.6316
141.1227
50
−0.0097
0.0098
0.0992
99.4638
141.0020
10
−0.0001
0.0099
0.0722
99.6656
102.7245
20
−0.0125
0.0099
0.0712
99.9686
102.1828
40
−0.0128
0.0099
0.0712
99.9736
101.7255
50
−0.0046
0.0099
0.0712
99.9745
101.1860
LIVSVM
J48
Study and Analysis of Time Series of Weather Data …
267
Table 3 Daily minimum temperature, statistical analysis on the basis of KS, MAE, RMSE, RAE, RRSE of Navie Bayes, J48, LIVSVM (n = 10) Statistical parameter
Navie bayes
J48
LIVSVM
−0.0001
−0.0013
Kappa statistics
0.0128
Mean absolute error
0.098
0.0099
0.0098
Root mean square error
0.0706
0.0722
0.0988
Relative absolute error (%)
99.5734
Root relative squared error (%)
99.6656
100.3858
102.724
98.6723 140.483
Table 4 Daily maximum temperature, statistical analysis on the basis of KS, MAE, RMSE, RAE, RRSE of Navie Bayes, J48, LIVSVM (n = 10) Statistical parameter
Navie Bayes
J48
LIVSVM
Kappa statistics
0.0257
0.000
0.004
Mean absolute error
0.0115
0.0113
0.0116
Root mean square Error Relative absolute Error (%) Root relative squared error (%)
0.0771
0.1064
0.0769
98.6344
97.1016
99.5872
100.9862
139.3068
100.6783
MINIMUM TEMPERATURE Navie Bayes
J48
LIVSVM
4 3 2 1 0 KAPPA STATISTICS MEAN ABSOLUTE ERROR
ROOT MEAN SQUARE ERROR
RELATIVE ABSOLUTE ERROR
ROOT RELATIVE SQUARED ERROR
Fig. 1 Statistical parameter comparison of minimum temperature MAXIMUM TEMPERATURE Navie Bayes
J48
LIVSVM
1.5 1 0.5 0 -0.5
KAPPA STATISTICS MEAN ABSOLUTE ERROR
ROOT MEAN SQUARE ERROR
RELATIVE ABSOLUTE ERROR
Fig. 2 Statistical parameter comparison of maximum temperature
ROOT RELATIVE SQUARED ERROR
268
R. Bhardwaj and V. Duhoon
Table 5 Time taken to cluster the daily maximum and minimum temperature, average temperature, rainfall, bright sun shine, evaporation, average relative humidity, wind speed, relative humidity Time taken (s)
K-means
EM
HC
Time taken (all parameters)
0.01
19.8
Time taken (minimum temperature)
0.03
8.82
4.58
Time taken (maximum temperature)
0.01
9.45
4.68
1.61
TIME TAKEN (SECONDS) 30
Time Taken (All Parameters)
Time Taken (Minimum Temperature)
Time Taken (Maximum Temperature)
20 10 0 K MEANS
EM
HC
Fig. 3 Comparison of time taken in seconds by the K-mean, hierarchical clustering, expectation maximization clustering
It is observed that among all K simple mean takes least time to cluster the data. The graphical comparison of the time taken in seconds can be seen as follows (Fig. 3): From above, it can be seen that among the clustering techniques K-means takes lesser time.
4 Conclusion Clustering and classification techniques are computationally effective. The objective of the paper was to study the time series of weather parameters and to analysis which among the clustering and classification technique is appropriate. The main focus of the research was to select the model on the basis of least time taking model and least error. It has been seen that in classification technique Navie Bayes proves to be efficient among others and among clustering techniques k-means was efficient. Navie Bayes can be used to predict the weather parameters with reduced error in the output. K-means in clustering technique is efficient to cluster the data with similar properties and hence provide the outlook to the pattern of the time series. Weather parameters prediction is not only important for agricultural production but for the common people, the tourists, the aviation sector and other sectors which are either directly or indirectly affected by the day-to-day temperature. Weather forecast is also important for the government to analysis the future conditions and make appropriate policies.
Study and Analysis of Time Series of Weather Data …
269
Acknowledgements Authors are thankful to Guru Gobind Singh, Indraprastha University, for providing financial support and research facilities.
References 1. R. Bhardwaj V. Duhoon, Weather forcasting using soft computing technique. in International Conference on Computing, Power, and Communication Technologies (GUCON), IEEE Explore Digital Library (2018), pp. 1111–1115 2. R. Bhardwaj R, V. Duhoon V, Real time weather parameter forecasting using Anfis-Sugeno. Int. J. Eng. Adv. Technol. (IJEAT) 9(1), 461–469 (2019). (Blue Eyes Intelligence Engineering & Sciences Publication (BEIESP)) 3. R. Bhardwaj, V. Duhoon, Time series analysis of heat stroke. JANANBHA Vijnana Parishad India 49(1), 01–10 (2019) 4. R. Bhardwaj, Wavelets and fractal methods with environmental applications. in Mathematical Models, Methods and Applications, ed. A.H. Siddiqi, P. Manchanda, R. Bhardwaj (2016), pp. 173–195 5. R. Bhardwaj, K. Srivastava, Real time Nowcast of a Cloudburst and a Thunderstorm event with assimilation of Doppler Weather Radar data. Nat. Hazards 70(2), 1357–1383 (2014) 6. R. Bhardwaj, A. Kumar, P. Maini, S.C. Kar, L.S. Rathore, Bias-free rainfall forecast and temperature trend- based temperature forecast using T-170 model output during the monsoon season. Meteorol. Appl. Royal Meteorol. Soc. 14(4), 351–360 (2010) 7. V.R. Durai, R. Bhardwaj, Forecasting quantitative rainfall over India using multi-model ensemble technique. Meteorol. Atmos. Phys. 126(1–2), 31–48 (2014) 8. E. Erdem, J. Shi, ARMA based approaches for forecasting the tuple of wind speed and direction. Appl. Energy, Elsevier 88, 1405–1414 (2011) 9. R. Finamore, V. Calderaro, V. Galdi, A. Piccolo, G. Conio, S. Grasso, A day-ahead wind speed forecasting using data-mining model—a feed-forward NN algorithm. in IEEE International Conference on Renewable Energy Research and Applications (2015), pp. 1230–1235 10. R. Geetha Ramani, L. Balasubramanian, S.G. Jacob, Automatic prediction of Diabetic Retinopathy and Glaucoma through retinal image analysis and data mining techniques. in Machine Vision and Image Processing (MVIP), 2012 International Conference (2012), pp. 149–152 11. S.M. Paras, A simple weather forecasting model using mathematical regression. Indian Res. J. Extension Educ. 12(4), 161–168 (2016) 12. J. Jyothis, T.K. Ratheesh, Rainfall prediction using data mining techniques. Int. J. Comput. Appl. 83 (2013) 13. S. Moertini Veronica, towards the use of C4.5 algorithm for classifying banking dataset. Integeral. 8(2) (2013) 14. A.R. Naik, P.M. Shafi, S.P. Kosbatwar, Weather prediction using error minimization algorithm on feedforward artificial neural network. in Intelligent Computing, Networking, and Informatics. Advances in Intelligent Systems and Computing, ed. D. Mohapatra, S. Patnaik (Springer, 2014), p. 243 15. A. Nadali, E.N. Kakhky, H.E. Nosratabadi, Evaluating the success level of data mining projects based on CRISP-DM methodology by a Fuzzy expert system. Electron. Comput. Technol. (ICECT) 6, 161–165 (2011) 16. R. Samya, R. Rathipriya, Predictive analysis for weather prediction using data mining with ANN: a study. Int. J. Comput. Intell. Inf. 6(2), 150–154 (2016) 17. J. Shivang, S.S. Sidhar, Weather prediction for Indian location using machine learning. Int. J. Pure Appl. Mathe. 118(22), 1945–1949 (2018)
270
R. Bhardwaj and V. Duhoon
18. M. Vaibhavi, P. Vibha, V. Jean-Francois, H. Srikantha, Improving global rainfall forecasting with a weather type approach in Japan. Hydrol. Sci. J. 62(2), 167–181 (2017) 19. Z.L. Wang, A.M. Mujib, The weather forecast using data mining research based on cloud computing. J. Phys. Conf. Series 910 (2017)
Convection Dynamics of Nanofluids for Temperature and Magnetic Field Variations Rashmi Bhardwaj and Meenu Chawla
Abstract This paper studies nonlinear stability and convection dynamics of temperature and magnetic variation on electrical conductivity of nanofluids. The system in Cartesian coordinates comprises a fluidic layer deals with exterior magnetic field, gravity and heat subjection in a chamber. The partial differential equations have been obtained from the equations of sustentation of energy and momentum; then, these equations are converted to three-dimensional differential equations set of nonlinear system similar to Lorenz equations. Applying time series and stability concept to investigate the consequence of temperature with magnetic force through Rayleigh and Hartmann numbers on the chaos transposition has been investigated for aluminum trioxide (Al2 O3 ), titanium dioxide (TiO2 ), zinc oxide (ZnO), silicon dioxide (SiO2 ) and copper oxide (CuO) nanofluids. Some kind of magnetic cooling has been observed which is indicated by the stabilization of chaos in nanofluid convection with the increase in the applied field or the Hartmann number. As the value of Rayleigh number increases, then system transits from stable to chaotic stage, and once the chaotic phase begins, system stability cannot be restored by controlling Rayleigh number. It is observed that among all nanofluids, CuO resists to chaotic stage for longer time in response to increase in temperature and Al2 O3 requires least increase in magnetic field intensity to get restored from chaotic to stable phase. It is concluded that variations in temperature and magnetic field cause transition of system from stable to chaotic and back to stable state, and also the electrical conductivity for different nanofluids decreases and increases, respectively. This phenomenon has a wider application in pharmacy, biosciences, health sciences, environment and in all fields of engineering. Keywords Phase portrait · Chaotic phase · Rayleigh number · Hartmann number · Nanofluids R. Bhardwaj · M. Chawla (B) Non-Linear Dynamics Research Lab, University School of Basic and Applied Sciences, Guru Gobind Singh Indraprastha University, Dwarka, Delhi 110078, India e-mail: [email protected] R. Bhardwaj e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_20
271
272
R. Bhardwaj and M. Chawla
1 Introduction For the last few decades, the significance of chaotic behavior of dynamical (moving) system has been developing in nature. Chaos in conduction dynamics plays a vital role and has a vast application in the evolvement of dynamical systems in electrical, mechanical, magneto-mechanical, biological, chemical reaction or fluid flows. The extensive study on way to chaos in fluidic layer has been discussed by Lorenz [1] in which Bénard problem studied to figure the unpredicted weather convention. Lorenz considered a system of 2-D fluid chambers, which were cool down and heat up from below; on the basis of it, 3-D partial differential equations called model of fluid conduction were derived. Studies began on different problems adopting the ideas derived by Lorenz. The critical stage associated with loss of linear stability for Lorenz equation was modeled and discussed by Sparrow [2]. Kimura [3] used a spectral analytical scheme and discussed fluid convection in soaking porous layer by observing chaotic stage from steady state with increase in Rayleigh number. Pecora [4] studied the problems on integration of dispersed chaotic systems. The term nanofluids was proposed by Choi who defined it as a mixture or expulsion of solidified nanoparticles which possess dimensions from 1 to 100 nm typically in a supported fluid [5]. The base (supported) fluid can commonly be H2 O or organic fluids like ethylene glycol. The unique characteristics of greatly enhanced mass and heat transplant in nanofluids are imparted through the size of the nanoparticles which are suspended in it and discussed in detail by different authors [6–9]. Due to greater stability and thermal conductivity as compared to conductional heat transferred fluid or the intermission of micro-sized particles, the nanofluid mechanics turns into a novel challenge for heat transferred fluid. The special attention has been given over nanofluid transport properties over last few years because of its remarkable technological uses [10]. In modern cooling systems, electromechanical systems and in thermal control systems like heat evaporators, exchangers, etc., everywhere nanofluid is getting applied significantly. Very little theoretical work has been implemented on the study of electric conductivity of H2 O-based nanofluids. Vadasz et al. [11] analyzed the properties of transition throughout natural convection in porous medium. Ganguly et al. [12] and Modesto Lopez and Biswas [13] discussed the definite increment in electrical conductivity along with variation in particle proportion. Idris and Hashim [14] illustrated the conductive motion in a fluid film which can slow down utilizing magnetic field and observe transition states from steady(stable) to chaotic by Hopf bifurcation. Odibat et al. [15] discussed the chaos organization by precise control of two uniform systems and demonstrated that through active control organization of two unlike dispersed order the chaotic system is attained. Kadri et al. [16] analyzed the vertical magneto-convection of H2 O-based aluminum trioxide nanofluid in square cavity and ceased that magnetic field has consequence on flow along with heat transmission of nanofluids. Rahmannezhad et al. [17] go through numerical consideration about the magnetic effect on combined conduction of nanofluids in cap-driven chamber,
Convection Dynamics of Nanofluids for Temperature and Magnetic …
273
and through the numerical results, observed that heat conduction rate decreased for increment in Hartmann number. The above referenced work has been focused on chaotic behavior of dynamic convection in nanofluids, but effects of both Rayleigh and Hartmann numbers on chaotic behavior of different nanofluid convection have not been focused. This paper studies the heating and magnetic field effects on the nanofluid conduction through variation of Rayleigh and Hartmann numbers which are associated with change in magnitude of magnetic field and temperature, respectively. The system comprises rectangular chamber in Cartesian coordinate having electrical conductive nanofluid in fluid layer exposed to gravity as well as heated from below. The Lorenz-like pattern being acquired by applying Galerkin’s truncate estimation and chaotic behavior of system has been discussed in context of variation of Rayleigh number and Hartmann number or temperature and magnitude of magnetic field for different nanofluids in order to compare their convection dynamics. The significance of study exists in reactors like nanofluids which are used as coolants when high temperatures can lead to chaos in fluid convection. So that the stability for nanofluid conduction with an alternate mechanism is essential to control it in the cases where fluid flow becomes chaotic. As a result of this disorder or chaos, electrical conductivity decreases. Thus, to maintain the electrical conductivity of flow, chaos has to be prevented.
2 Mathematical Modeling Considering an extremely small chamber having Cartesian coordinate system through which electrically conductive nanofluids are passing. The horizontal fluid layer in its flow is exposed to high temperature (heat) and magnetic field along gravity. The two lengthy walls are sustained at temperatures T H and T C , independently, while the small walls are wrapped thermally. Upstanding z-axis is smooth with force of gravity which means eˆg = −eˆz . The steady and stable electromagnetic field (B) is put regularly to hot side of given chamber as shown in Fig. 1. Fig. 1 Simplified diagram of the chamber with nanofluids
274
R. Bhardwaj and M. Chawla
The change in fluid density is caused by variation in temperature which is because of heat conduction and interconnection of the magnetic field with conductive motion. Consider small value of Reynolds number, so that influenced field of electromagnetic is negligible. In Darcy’s equation, time-derivative term is not ignored for small values of Prandtl number. Darcy’s law is used to control the flow of fluid, and Boussinesq approximation is used to control the density variations. The equations of sustentation of mass, energy, momentum and electric charge movement in laminar flow are represented by ∇ · V = 0
(1)
(Pβ)n f ∂V 1 + V ∇V + ∇ p = ϑn f ∇ 2 (V ) + E ∗ B − g T − T c (2) ∂t Pn f Pn f ∇ · E = 0 and E = E c −∇∅ + V ∗ B (3) ∂T + V · ∇T = αn f ∇ 2 T ∂t
(4)
where V = velocity, P = pressure, T = temperature, β = coefficient of thermal expansion ∅ = electric potential, ϑ = viscosity of fluid, E = density of electric current, E c = conductivity of electric current, B = electromagnetic field, Pn f = effective density, αn f = thermal diffusivity, (Pβ)n f = thermal expansion coefficient, g = gravity, T − T c = temperature differences and ϑn f = nanofluid’s kinematic viscosity. The effectual density of the nanofluid, Pn f , is given as: Pn f = (1 − ∅)P f + ∅Pnp
(a)
where ∅ is the solid volume fraction of nanoparticle (∅ = 0.05), P f is effective density of fluid and Pnp is effective density of nanoparticle. The thermal diffusivity of nanofluid, αn f , is given as: αn f = kn f / PC p n f
(b)
where kn f is the thermal conductivity and PC p n f is the heat capacitance of nanofluid. The thermal conductivity of nanofluid, kn f , is given as: kn f = (1 − ∅)k f + ∅knp
(c)
where k f is the thermal conductivity of fluid and knp is the thermal conductivity of nanoparticle.
Convection Dynamics of Nanofluids for Temperature and Magnetic …
275
The heat capacitance of nanofluid, PC p n f is given as:
PC p
nf
= (1 − ∅) PC p f + ∅ PC p np
(d)
where PC p f is heat capacitance of fluid and PC p np is heat capacitance of nanoparticle. The thermal expansion coefficient of nanofluid, Pβ p n f , is given as:
Pβ p
nf
= (1 − ∅) Pβ p f + ∅ Pβ p np
(e)
where Pβ p f is thermal expansion coefficient of fluid and Pβ p np is thermal expansion coefficient of nanoparticle. In Eq. (3), the electric potential for steady state vanishes to ∇ 2 ∅ = 0 and then Lorentz force diminishes to a standard damping factor. Following transformations are used to make Eqs. (1)–(4) nondimensional:
αf h2 B 1 h V ; P = 2 P ; t = 2 t ; = B; (x, y, z) = x , y , z ; V = αf h h h αf T Tc = T − T c where V = (u, v, w)—dimensional less velocity, P—dimensional less pressure, T —dimensional less time, (TH − Tc ) = Tc = characteristic temperature difference, B—dimensional less magnetic field, h = scaling factor. Simplifying Eq. (2) and to eliminate pressure, taking curl on both sides, we get
ϑn f ∂(∇ × V ) h 4 + ∇ × (V · ∇V ) − ∇ × ∇ 2 V + E c B 2 (∇ × V ) ∂t αf αf 3 (Pβ)n f h =− gT c 2 (∇ × T ) Pn f αf
We assume the following boundary conditions 1. For the temperature, T = 0 at z = 1 and T = 1 at z = 0, 2 2. Stress-free condition ∂∂uZ = ∂∂vZ = ∂∂ Zw2 = 0 at z = 0 and z = 1, 3. Impermeability condition V · eˆn = 0 at z = 0 and z = 1.
(5)
276
R. Bhardwaj and M. Chawla
In the paper, the flow is assumed to be two dimensional, i.e., v components of velocity V vanishes and remaining components u and w depend on x, z and t only. Therefore, we can introduce stream function ϕ(x, z, t) such that u=
∂ϕ ∂ϕ and w = − ∂Z ∂X
and then above Eq. (5) can be transformed as
1 ∂ ∂T ∂ϕ ∂ ∂ϕ ∂ − ϑ∇ 2 + γ ∇ 2 ϕ = −β Ra + − Pr ∂t ∂Z ∂X ∂X ∂Z ∂X
(6)
2 ∂ T ∂T ∂2T ∂ϕ ∂ T ∂ϕ ∂ T + + − =α ∂t ∂Z ∂X ∂X ∂Z ∂ X2 ∂ Z2
(7)
and
where Ratio of thermal diffusivity: α = αn f /α f , Ratio of kinematic viscosity: ϑ = ϑn f /ϑ f , Ratio of kinematic viscosity to thermal diffusivity: Pr = ϑ/α f (referred as Prandtl number), Ratio of the thermal expansion coefficient: β = (Pβ)n f /(Pβ) f , γ = E c B 2 h 4 /ϑ f, Ra = β f gT h 3 / α f ϑ f (referred as Rayleigh number), 1/2 H a = B E c /ϑ f (referred as Hartmann number) (the values of α, β, γ , Pr, ϑ become constant and dimensionless). Now taking stream (flow) function ϕ and temperature T in form of ϕ = a11 (t) sin(kx)sin(π z) and T = 1 − z + b11 (t) cos(kx) sin(π z) + b02 (t)sin(2π z) where k is a free parameter and assumed as a wave number. Further, dimensionless (x and z are dimensionless) variable parameters a11 , b11 and b02 are considered for the simulation of the system. This description is identical to Galerkin’s extension for the solution in x- and zdirections. Substituting these expressions into Eqs. (6) and (7), performing integration k/π 1
∫ ∫ f (x, z)dzd x 0 0
then applying orthogonality of Galerkin’s system, we obtained
Convection Dynamics of Nanofluids for Temperature and Magnetic …
277
2 βk Ra da11 γ 2 = −Pr ϑ k + π 2 b11 + a11 1 + 2 dτ ϑ k + π2 ϑ k2 + π 2
(8)
db11 = −a11 k + 2π ka11 b02 − αb11 k 2 + π 2 dτ
(9)
db02 πk = a11 b11 − α4π 2 b02 dτ 2
(10)
Now, we introduce rescaled time variable τ and rescaled amplitudes a˜ 11 , b˜11 , b˜02 2 Rak2 b11 b02 11 ˜ by formulas τ = t k 2 + π 2 , a˜ 11 = k 2ka+π , b˜ = Rak , 2 , b11 = ( ) (k 2 +π 2 )3 02 (k 2 +π 2 )3 Then, Eqs. (8)–(10) become
β˜ d a˜ 11 γ = −Pr ϑ b11 + a˜ 11 1 + 2 dτ ϑ ϑ k + π2 d b˜11 = −a˜ 11 R + 2π a˜ 11 b˜02 − α b˜11 dτ
(11)
(12)
d b˜02 π = a˜ 11 b˜11 − αλ4π 2 b˜02 dτ 2 2
where λ = k 24π ( +π 2 ) Finally, let us define X1 =
X2 =
βαλ π 2ϑ
−b˜11
a˜ 11 αϑ β
αϑ β
L X3 = − L 2
−
αϑ β
S L
1/2
− LS
1/2
1/2
b˜02 αϑ β
−
(13)
S L
(14)
(15)
where L = 1 + ϑ π 2γ+k 2 ( ) From Eqs. (13) to (15), we obtain the following system of ordinary differential equations:
X 1 = Pr ϑ L(Q X 2 − X 1 ) = d(Q X 2 − X 1 )
(16)
278
R. Bhardwaj and M. Chawla
X2 = U X1 + N X1 X3 − α X2 = U X1 − a X2 + N X1 X3
X 3 = αλ(G X 1 X 2 − X 3 ) = w(G X 1 X 2 − X 3 ) where
=
d . dτ
(17) (18)
Further,c = Pr ϑ L, α = a, w = αλ
αϑ S 2 S −1 π αϑ
, G = − − ,Q = , αϑ β L β L − LS β S Sβ αϑ − U= πϑ L β L β N= ϑ
a = α > 0, w > 0, d > 0 and N > 0 for the considered parameters. Thus, as a result of the Galerkin approximation, governing Eqs. (1)–(4) are replaced by a 3 × 3 system of first-order nonlinear ordinary differential Eqs. (16)–(18) similar to famous Lorenz equations.
2.1 Linear Stability Analysis Solving the system of equations by equating right-hand sides of Eqs. (16)–(18) with zero, we get the system with three critical points (0, 0, 0) which exists for any choice of parameters a, d, w, G, Q, N and U, and
⎤ a − U Q 1/2 , − ⎥ a − U Q 1/2 1 a − U Q 1/2 a − U Q ⎢ NG ⎥ ⎢ , , ,⎢ ⎥ ⎣ NG Q NG NQ 1 a − U Q 1/2 a − U Q ⎦ − , Q NG NQ
⎡
Q if a−U > 0. For the problem under consideration Q < 0, G < 0 and U < 0 in NQ accordance to the values for fixed parameters α, β, ϑ, L and S. Thus, the assumption necessary for simulation in next case (a − U Q) < 0. For determining the stability of these points, we examine the characteristic equation related to Jacobian matrix
⎤ −d − λ dQ 0 J = ⎣ U + N X 3 −a − λ N X 1 ⎦ = 0 wG X 2 wG X 1 −w − λ ⎡
Convection Dynamics of Nanofluids for Temperature and Magnetic …
279
The equation of the form λ3 + c1 λ2 + c2 λ + c3 = 0, with coefficients c1 , c2 and c3 depending on coordinates of given fixed points. By Routh–Hurwitz rule, the characteristic polynomial λ3 + c1 λ2 + c2 λ + c3 has all roots with negative (−ve) real parts iff c1 > 0, c3 > 0 and c1 .c2 − c3 > 0. Fulfilling conditions of this measure implies asymptotic stability of given points. We shall examine each of points separately. The Trivial Fixed Point (X 1 , X 2 , X 3 ) = (0, 0, 0) c1 = (d + a + w); c2 = (ad + dw + aw − d QU ); c3 = dw(a − QU ) Then, for asymptotic stability of the fixed point (0, 0, 0) the following conditions have to be concurrently satisfied. 1. As d + w + a > 0 which implies c1 > 0. Thus, condition-1 satisfied. 2. As QU > a(dw > 0), therefore, a − QU cannot be greater than zero, which implies condition-2: c3 > 0 violates and can never be satisfied as for the problem under study a < QU . Thus, the condition for Routh–Hurwitz stability becomes inconsistent with assumption required for the existence of this fixed point. Thus, the trivial fixed point (0, 0, 0) is an unstable fixed point always. The Nontrivial Fixed Points (X 1 , X 2 , X 3 ) =
a −UQ NG
21
1 1 a −UQ 2 a −UQ , , ; Q NG NQ
1 a − U Q 1/2 a − U Q a − U Q 1/2 ,− , (X 1 , X 2 , X 3 ) = − NG Q NG NQ are invariant. As G < 0, the nontrivial equilibrium points exist for QU > a. For these points, c1 = (d + a + w); c2 = w(d + U Q); c3 = 2dw(QU − a) and corresponding system of conditions is as follows: 1. For c1 > 0, (d + a + w) (condition-3), 2. For c3 > 0 and (U Q > a; d > 0; w > 0) (condition-4)
280
R. Bhardwaj and M. Chawla
3. For c1 .c2 > c3 ⇒ [d + w + a].[wQU + dw] > 2dw(−a + QU ) ⇒ QU < , for (d > w − a) (condition-5). (QU )c = d(3a+w+d) (d−w−a) (QU)c is the critical value of (QU) on which the evolution of system depends. As per condition-(5), these fixed points show different behaviors for which are defined as: Stable phase if QU < (QU)c Limit cycle phase as QU = (QU)c and System will be chaotic at QU > (QU)c .
3 Numerical Simulation Numerical simulation in MATLAB is implemented to study the behavior of external magnetic field with temperature for different nanofluids convection. The simulation is carried out with initial condition as τ = 0 : [X 1 , X 2 , X 3 ] = [1, 1, 1]; k = 1 and τmax = 200., i.e., time span [0, 200] for Eqs. (16)–(18) based on the values of thermophysical characteristics of different nanoparticles and water is stated in Table (1). Using the values of parameter knp and k f given in Table (1), the values of parameters kn f , β, α, ϑ and k are obtained. As discussed above, k is free parameter when it appears in definition of λ and L along in all coefficients d, w, N, Q, G and U. The following values pr = 10, k = 2.2, H = 0.25, λ = 8/3 are used for the computation. The different values of Ra and Ha are classified in Tables (2) and (3), at which the different phases of nanofluid convection for Al2 O3 , TiO2 , CuO, SiO2 and ZnO nanofluid have been observed. When the Rayleigh number is increased, then phase transition from stable to chaotic stage is observed and on further increasing the temperature or the value of R the chaotic stage continues. However, on raising the value of Hartmann number during the chaotic stage, the stability is restored for the nanofluids. The phase portrait of stable, critical and chaotic phase transition for Al2 O3 , TiO2 , CuO, SiO2 and ZnO nanofluid convection with increase in Rayleigh number and their mesh plots are shown in Fig. 2, respectively, along with Figs. 3 and 4 in which the surface and phase plots have been plotted. The effect of increasing Table 1 Thermophysical characteristics of water and different nanoparticles Substance
ρ(kg m−3 )
k(Wm−1 K−1 )
C p (J kg−1 K−1 )
β
α
H2 O
997.1
0.613
4179
–
–
–
Al2 O3
3970
40
765
0.834
1.166
0.989
ϑ
TiO2
4250
8.954
686.2
0.825
1.145
0.977
CuO
6500
18
540
0.7548
1.150
0.890
SiO2
2200
1.4
745
0.898
1.078
1.0724
ZnO
5600
13
495.2
6.151
1.154
0.922
Convection Dynamics of Nanofluids for Temperature and Magnetic …
281
Table 2 Phases for different nanofluids with variation in parameter R Substance
Stable to chaotic phase with variation in parameter R Stable spiral phase
Limit cycle phase
Chaotic phase
Al2 O3
≤24.8
24.9–33.9
≥34.0
TiO2
≤24.4
24.5–33
≥33.1
CuO
≤23.1
23.2–32.9
≥33.0
SiO2
≤25.7
25.8–31.5
≥31.6
ZnO
≤3.0
3.1–4.1
≥4.2
Table 3 Phases for different nanofluids with variation in Ha to control chaos Substance
Value of parameter R for chaotic phase
Chaotic to stable phase with variation in Ha Chaotic phase
Limit cycle phase
Stable spiral phase
Al2 O3
40
≤0.8
0.9–0.11
≥1.2
TiO2
34
≤0.6
0.61–0.98
≥0.99
CuO
34
≤0.53
0.54–1.0
≥1.1
SiO2
32
≤0.52
0.53–0.86
≥0.87
ZnO
5
≤0.84
0.85–1.10
≥1.2
the magnetic field in controlling chaotic stage of nanofluid convection for Al2 O3 , TiO2 , CuO, SiO2 and ZnO nanofluid has been shown at different values of Hartmann number. It is observed that as parameters β and ϑ increase, the value of R for which system retains stability also increases. While increase in α increases, the value of R for which the system is stable and it also impacts the value of Ha beyond which stability of the system is again restored from the chaotic phase at high R values. It is observed that among all nanofluids, CuO resists to chaotic stage for longer time in response to increase in temperature and Al2 O3 requires least increase in magnetic field intensity to get restored from chaotic to stable phase.
282
R. Bhardwaj and M. Chawla Al 2O3 X-Y –Z Mesh Plot
y
x
t
X-Y Mesh Plot
z
z
t
x
y
x
X-Z Mesh Plot
x
y
z
z
y Y-Z Mesh Plot
t
x
t
MeshTime Series of X
y
z t
Mesh Time Series of Y
t
Mesh Time Series of Z
Fig. 2 Mesh plots of chaotic state for nanofluids Al2 O3 , TiO2 , CuO, SiO2 and ZnO at Ha = 0.5. a Chaotic attractor for Al2 O3 at R = 34.0; b chaotic attractor for TiO2 at R = 37.0; c chaotic attractor for CuO at R = 40.0; d chaotic attractor for SiO2 at R = 35.0; e chaotic attractor for ZnO at R = 5.0
Convection Dynamics of Nanofluids for Temperature and Magnetic …
283
X-Y –Z Mesh Plot
Ti O2
y
x
z
t
X-Y Mesh Plot
z
x
y
t
x
X-Z Mesh Plot
x
y
z
z
y
t Y-Z Mesh Plot
Fig. 2 (continued)
x
t
Mesh Time Series of X
y
z t
Mesh Time Series of Y
t
Mesh Time Series of Z
284
R. Bhardwaj and M. Chawla X-Y –Z Mesh Plot
CuO
y
x
z
t
X-Y Mesh Plot
z
x
t
x
y X-Z Mesh Plot
x
y
z
z
y Y-Z Mesh Plot
Fig. 2 (continued)
t
x
t
Mesh Time Series of X
y
z t
Mesh Time Series of Y
t
Mesh Time Series of Z
Convection Dynamics of Nanofluids for Temperature and Magnetic …
285
X-Y –Z Mesh Plot
SiO2
y
x
t
z
X-Y Mesh Plot
z
t
x
y
x
X-Z Mesh Plot
x
y
z
z
y Y-Z Mesh Plot
Fig. 2 (continued)
t
x
t
Mesh Time Series of X
y
z t
Mesh Time Series of Y
t
Mesh Time Series of Z
286
R. Bhardwaj and M. Chawla X-Y –Z Mesh Plot
ZnO
y
x
t
z
X-Y Mesh Plot
z
t
x
y
x
X-Z Mesh Plot
x
y
z
z
y Y-Z Mesh Plot
Fig. 2 (continued)
t
x
t
Mesh Time Series of X
y
t
Mesh Time Series of Y
z
t
Mesh Time Series of Z
Convection Dynamics of Nanofluids for Temperature and Magnetic … TiO2
Z
Stable State
Z
Critical State
Y
Y
Z
x
Y
Y
Y
Z x
Y
Y
x
Z
x
Y
x
Z
x
Z
Y
x
Z
x
Z Y
Z Y
x
Z
Y
ZnO
Z
x
Z
Y
SiO2
Z
x
Chaotic State
CuO
Z
Y
x
x
Y
x
Al2O3
287
x
Fig. 3 Surface plots of three phases for nanofluids Al2 O3 , TiO2 , CuO, SiO2 and ZnO with increasing R
Al2O3
TiO2
CuO
Ha=0.5
y
x
Critical State
Ha=0.7
Ha=1.3
Ha=1.5
y
x
y
x
Ha=1.2
Ha=1.5 Z
y
y
x
y
x
x
x
Ha=2.5
Ha=2.5
Ha=2.5
Ha=2.5 Z
Z
x
y
Z
Ha=2.5
y
x
Z
x
Z
Z
Ha=1.3
Z
y
Ha=1.0
Z
Y
x
y
Ha=0.8
Ha=0.7
x
y
Ha=0.5 Z
x
y
Z
x
Z
x
y x
Z
y
Stable State
y
Ha=1.0
Z
Z
Z
Z
y
x
y
x
Chaotic State
Z
ZnO Ha=0.5
Z
y
x
Z
SiO2
Ha=0.53
Ha=0.5
y
x
Fig. 4 Phase plots for Al2 O3 , TiO2 , CuO, SiO2 and ZnO with increasing Ha restoring stability
288
R. Bhardwaj and M. Chawla
4 Conclusion In this work, the nanofluid conduction of different nanofluids in a rectangular chamber with exposure to temperature and magnetic field variation has been studied. Using equations of sustentation of mass, energy and momentum, the partial differential equations of fluid conduction are obtained which are then transformed to three-dimensional ordinary differential equation system using Galerkin approximation for nonlinear analysis of the nanofluid convection. From the stability analysis of the system, the critical condition has been obtained beyond which system becomes chaotic. Different stages, i.e., stable, critical and chaotic phases of nanofluid conduction, are observed for Al2 O3 , TiO2 , CuO, SiO2 and ZnO nanofluid through phase portraits, time series, mesh plots and surface plots. It is analyzed that as Rayleigh number increases, the temperature increases, and the system transits to chaotic stage from stable stage passing through the critical stage where limit cycle is observed. Once the chaotic phase begins, chaos continues to grow with increase in value of Rayleigh parameter. Also, it is observed that in the chaotic stage when magnetic intensity is increased by increasing the Hartmann number, the system transforms toward stability, but it is possible for chaotic phase at large value of Rayleigh number. It shows occurrence of magnetic cooling, which controls chaotic phase of nanofluid conduction. This has its significance in different areas of application. It is concluded that as parameters β and ν increase, the value of R for which system retains stability also increases, while increase in α increases the value of R for which system is stable, and it also impacts the value of Ha beyond which stability of the system is again restored from the chaotic phase at high R values. It is observed that among all nanofluids, the convection in CuO is resistant to chaotic stage for longer time in response to increase in temperature and Al2 O3 requires least increase in magnetic field intensity to get restored from chaotic to stable phase. Acknowledgements Authors are thankful to GGSIPU, Delhi (India), for providing research facilities and grant.
References 1. E.N. Lorenz, Deterministic non-periodic flow. J. Atmos. Sci. 20, 130–141 (1963) 2. C. Sparrow, The Lorenz Equations: Bifurcations Chaos and Strange Attractors (Springer, New York, 1982) 3. S. Kimura, G. Schubert, J.M. Straus, Route to chaos in porous-medium thermal convection. J. Fluid Mech. 166, 305–324 (1986) 4. L.M. Pecora, Carroll Synchronization in chaotic systems. Phys. Rev. Lett. 64, 821–824 (1990) 5. S.U.S. Choi, Enhancing thermal conductivity of fluids with nanoparticles. in Developments Applications of Non-Newtonian Flows ed. D.A. Siginer, H.P. Wang, FED-vol. 231/MD-vol. 66 (ASME, New York, 1995), pp. 99–105 6. Y.M. Xuan, Q. Li, Heat transfer enhancement of nanofluids. Int. J. Heat Fluid Flow 21, 58–64 (2000)
Convection Dynamics of Nanofluids for Temperature and Magnetic …
289
7. J.A. Eastman, S.U.S. Choi, S. Li, W. Yu, L.J. Thompson, Anomalously increased effective thermal conductivities of ethylene-glycol based nanofluids containing copper nanoparticles. Appl. Phys. Lett. 78, 718–720 (2001) 8. H. Xie, J. Wang, T. Xi, Y. Liu, F. Ai, Q. Wu, Thermal conductivity enhancement of suspensions containing man-sized alumina particles. J. Appl. Phys. 91, 4568–4572 (2002) 9. S.K. Das, N. Putra, P. Thiesen, W. Roetzel, Temperature dependence of thermal conductivity enhancement for nanofluids. J. Heat Trans. Trans. ASME 125, 567–574 (2003) 10. P. Keblinski, J.A. Eastman, D.G. Cahill, Nanofluids for thermal transport. Mater. Today 8, 36–44 (2005) 11. J.J. Vadasz, J.E.A. Roy-Aikins, P. Vadasz, Sudden or smooth transitions in porous media natural convection. Int. J. Heat Mass Trans. 48, 1096–1106 (2005) 12. S. Ganguly, S. Sikdar, S. Basu, Experimental investigation of the effective electrical conductivity of aluminum oxide nanofluids. Powder Technol. 196, 326–330 (2009) 13. L.B. Modesto Lopez, P. Biswas, Role of the effective electrical conductivity of nano suspensions in the generation of TiO2 agglomerates with electro spray. J. Aerosol Sci. 41, 790–804 (2010) 14. R. Idris, I. Hashim, Effects of a magnetic field on chaos for low Prandtl Number convection in porous media. Nonlinear Dyn. 62, 905–917 (2010) 15. Z.M Odibata, N. Corsonb, M.A. Aziz-Alaouib, C. Bertellec, Synchronization of chaotic fractional-order systems via linear control. Int. J. Bifurcation Chaos (2010) 16. S. Kadri, R. Mehdaoui, M. Elmir, Vertical Magneto-Convection in square cavity containing a Al2 O3 + water nanofluid: cooling of electronic compounds. Energy Procedia 18, 724–732 (2012) 17. J. Rahmannezhad, A. Ramezani, M. Kalteh, Numerical investigation of magnetic field effects on mixed convection flow in a nanofluid-filled lid-driven cavity. Int. J. Eng. 26(10), 1213–1224 (2013)
A Lightweight Secure Authentication Protocol for Wireless Sensor Networks Nitin Verma, Abhinav Kaushik, and Pinki Nayak
Abstract Wireless sensor networks (WSNs) are becoming very popular. The security of such networks is thus crucial. The asymmetric key cryptography results in increased computational overheads. WSN systems are not designed to handle such costs. Most WSN systems are based on symmetric-key cryptography mechanisms to avoid such overheads. However, they give rise to many security attacks such as sybil attacks. In this paper, we introduce a new protocol based on mutual authentication between a Sensor and User with asymmetric cryptography and low overhead handling cost. The information sharing between the Sensor Node and the User uses elliptic-curve Diffie–Hellman protocol. Further, the protocol protects from masquerade attacks, replay attacks, sinkhole attacks, sybil attacks, etc. The protocol is implemented and verified using AVISPA software. Keywords Authentication protocol · Wireless sensor network · Smart card · Elliptic-curve Diffie–Hellman · Mutual authentication · Timestamp · AVISPA
1 Introduction Wireless sensor network (WSNs) is a collection of wireless sensor devices which communicate through medium such as air, water, or space. The data regarding temperature, light intensities, sound frequencies, humidity, pressure, and many more can be collected using WSN. They can also be used in military and airport area monitoring.
N. Verma (B) · A. Kaushik · P. Nayak Amity School of Engineering and Technology, New Delhi, India e-mail: [email protected] A. Kaushik e-mail: [email protected] P. Nayak e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_21
291
292
N. Verma et al.
One of the most important applications of WSN is the field of Internet of things (IoT). In IoT, communication takes place in the public network, and therefore, it is vital that the data being transported is secured and does not get tampered in any way. Security plays a crucial role in WSN because in most of the cases, once the nodes are deployed, they cannot be maintained and observed manually. They have to be configured remotely. Certain security mechanisms have to be implemented for maintenance of the confidentiality, integrity, and authenticity of the data. Implementation of suitable authentication protocols is challenging due to the following reasons: (a) Low memory of the devices (b) Low computing power (c) Low energy supply. The main challenge in WSN is to design a protocol that is able to provide better security as well as minimum overhead on the nodes [1]. Some important types of attacks in WSN [2] are: (1) Masquerading attack—In this form of attack, the attacker hides his identity and tries to access the data flowing in the network by using an identity, authorized to do so. (2) Man-in-the-middle (MITM) attack—In MITM, the messages are intercepted flowing between the sender and the receiver using sniffing techniques by a third party. The attacker can also replay the message by impersonating as sender. (3) Replay attack—This form of attack is very similar to man-in-the-middle attack. In MITM, the attacker generally eavesdrops and captures the message for later use. However, in replay attack, the attacker uses the information to attack the network itself. (4) Jamming attack—Jamming attack is a form of denial of service (DoS), and it occurs at the physical layer of OSI model. In this attack, the normal flow of traffic is disrupted. The interference with the radio frequencies on which the network is established is the root cause of this attack. (5) Selective forwarding—This is a network layer attack. As the number of nodes in the network increases, the probability of malicious node entering the network also increases. It gets involved in network communication, transmitting only selected packets of information and discarding the rest. (6) Sinkhole attack—This attack is also performed on the network layer and is very similar to selective forwarding but, sinkhole attack is active in nature while selective forwarding is passive in nature. (7) Sybil attack—In this attack, the adversary uses multiple identities. These identities are usually forged or stolen. These multiple identities are then used for communication.
A Lightweight Secure Authentication Protocol for Wireless …
293
2 Literature Review In 1981, Lamport [3] first suggested the use of password-based authentication mechanism over untrusted networks. However, the protocol was susceptible to stolenverifier attack because of its dependency on the password table. Thereafter, many protocols were put forth for user authentication and key management. Key management plays a crucial role for data transmission, as the data that is to be transported over the network has to be completely encrypted so that any other malicious party is not able to read and manipulate the data. Many such key management techniques were put forth in this regard [4–6]. In 2009, Das [7] projected an authentication protocol for WSNs. The protocol was very efficient and the protocol offered very low complexity. The protocol was based on the hash function operation. Das’s protocol also involved temporary credentials, i.e., timestamps for verification. However, the protocol was susceptible to attacks such as DoS and node capture attack. In 2010, Khan and Alghathbar [8] proposed another scheme which was an improvement to the Das’s scheme. Problem of unsafe passwords and mutual authentication was solved by them by introducing the concept of pre-shared keys and masked passwords. Vaidya et al. [2] identified the drawbacks of Khan and Alghathbar’s scheme and proposed another scheme for the same. Xue et al. [9] proposed a new authentication protocol. Their protocol was based on temporal credential-based system and key management protocol for WSN. Hash function and XOR operations reduced the computational complexity of the system. The drawbacks were stolen-verifier and insider attack. A protocol for ad hoc wireless sensor network for user authentication and key management based on hash function was offered by Turkanovic et al. [10]. Wong et al. proposed a scheme for WSN which were not static in 2006. The drawbacks found were forgery and replay attack.
3 Proposed Scheme The scheme consists of three independent participants, i.e., the User who requests a resource, the Gateway Node which is involved in data transmission, and the Sensor Node which provides the data to the User. The notations used in the protocol are: HU: Represents the hash of U. SHU: Represents the salted hash of U. SHID: Represents the salted hash of User ID. SHGID: Represents the salted hash of Gateway ID. SHSID: Represents the salted hash of Sensor ID. KSN: Special key agreed between the Sensor Node and the User Node. KGN: Special key agreed between the User and the Gateway. RID: Represents the Request ID. TK: represents the token. ETKS: Encrypted token which only the Sensor Node can decrypt.
294
N. Verma et al.
EHeaderG: Encrypted header which only the Gateway Node can decrypt. Ts: Represents the timestamp. RM: Response message for the corresponding RID from the Sensor. Special symbols used in the protocol are: ||: Represents the concatenation operation. ⊕: Represents the XOR operation.
3.1 User Registration Phase The User selects a set of random number, UserID, UserPassword, and a UserKey which is known only to the User. The User sends this information to the Registrar. The Registrar concatenates two separate Secret Keys to each of the number selected by the User and calculates their hash. Let the result be x and y. The Registrar then calculates UserID concatenated with the hashed variable and the whole, XOR’ed with the UserPassword, X. Registrar also calculates UserPassword concatenated with the other hashed variable and the whole XOR’ed with the UserID, Y. In the next step, the variables X and Y are concatenated and the result is XOR’ed with UserKey concatenated by the hash of X and hash of Y. The complete hash is obtained and salted to improve security and then written onto Smart Card along with X and Y. This phase is shown in Fig. 1.
3.2 User Login Phase The following operations take place—the UserID is concatenated with Y (mentioned in the Smart Card) and XOR the result with the UserPassword. Let the results obtained be X. Similarly, the UserPassword is concatenated with X and XOR the result with the UserID. Let the results obtained be Y. Now same operations are performed as in the registration phase, to verify the User and login if the hashes match. This is shown in Fig. 2.
3.3 Authentication Phase The data that is encrypted for the transmission within the channel uses the ellipticcurve Diffie–Hellman (ECDH) algorithm. In the first part of authentication phase, the User generates a random number called as a Nonce. Let it be N1. N2 is calculated as KGN XOR’ed with salted hash of the GatewayID, i.e., SHGID which is again XOR’ed with the timestamp.
A Lightweight Secure Authentication Protocol for Wireless …
295
Fig. 1 User registration phase
Fig. 2 Login phase
N3 is calculated as the XOR operation between two variables. KSN concatenated with the SessionKey. N1 is concatenated with N2. B is calculated as N1 concatenated with N2 which is again concatenated with N3, the resultant is XOR’ed with SHSID and RID. Token TK is calculated as SHID concatenated with SHGID which is concatenated with SHSID. The resultant is XOR’ed with the result obtained after the concatenation operation on SessionKey and the B. The Header follows the format of Header in HashCash. The first field, represented as M, represents the number of leading bits that are 0. The next field is the timestamp. Third is N1, and fourth is SessionKey followed by RID and SHSID. The last field
296
N. Verma et al.
is the counter. The token and the header are encrypted. The token can be decrypted by the Sensor Node only and the header can be decrypted by the Gateway Node. The Gateway Node decrypts the header and verifies it. N2 is again calculated as XOR between the KGN, SHGID, and the timestamp. Variable C is calculated as N1 concatenated with N2. Variable E is calculated as SHID concatenated by SHGID which is again concatenated by SHSID. A new header is formed at the Gateway Node. Here, the first field of the header, represented by N, represents the number of preceding 0 bits. The second field is the SessionKey followed by RID, E, C, SHID, SHGID and the counter. The Gateway Node sends the new header which can be decrypted by the Sensor Node and the encrypted token which was received by the User Node. The Sensor Node decrypts the header received and decrypts the encrypted token received. N3 is calculated as KSN concatenated by SessionKey and the resultant XOR’ed with RID. Variable P is calculated as C concatenated with N3 and the resultant XOR’ed with SHSID which is again XOR’ed with RID. Now, the Sensor calculates the token as XOR’ed with the resultant of the SessionKey concatenated with P. If the newly calculated token matches with the token received, the User Node is verified. In the second part of the authentication phase, the communication takes place from the Sensor Node to the User Node. The Sensor Node first calculates the variable P as C concatenated with N3 and the resultant XOR’ed with SHSID which is again XOR’ed with corresponding RM. A new token is generated as E XOR’ed with the resultant of SessionKey concatenated with the variable P. The User Node calculates a variable H as N1 concatenated by N2 concatenated by N3 and the resultant XOR’ed with SHSID and RM received from the Sensor Node. The token is calculated as SHID concatenated by SHGID concatenated by SHSID and the resultant XOR’ed with SessionKey concatenated by the variable H. This token is matched with the token received from the Sensor Node. If the token matches the Sensor Node is verified. This is shown in Figs. 3 and 4.
4 Experimental Results We have simulated our results using the AVISPA tool and SPAN Protocol Animator to prove the security of our proposed protocol. AVISPA tool has four back-ends, i.e., verification tools: 1. SATMC—SAT-based Model Checker. 2. OFMC—On-The-Fly Model Checker. 3. TA4SP—Tree Automata Tool based on Automatic Approximations for the Analysis of Security Protocols. 4. CL-AtSe—Constraint-Logic-based Attack Searcher. CL-AtSe is designed to be operated on pre-defined bounded number of loops. If there are no loops in the protocol, then CL-AtSe analyzes the whole specification. However, if there are loops present in the protocol specification, then maximum
A Lightweight Secure Authentication Protocol for Wireless …
297
Fig. 3 Authentication phase (User Node to Sensor Node)
Fig. 4 Authentication phase (Sensor Node to User Node)
bound on the number of iterations has to be provided for complete analysis. Only if there are bounded number of loops, the CL-AtSe can search for attacks and identify if the protocol is safe or not. Figure 5 shows the results obtained after the proposed protocol is tested by the CL-AtSe-based Attack Searcher.
298
N. Verma et al.
Fig. 5 Obtained results
4.1 Security Analysis 1. Masquerading attack—the proposed protocol handles the issue of masquerading attack using the Smart Card and the keys used. 2. Replay attack—encryption of data is used along with time stamping of the data. This ensures freshness of the information and also mitigates the successful chances of a replay attack. 3. Denial of service—to handle the threat from denial of service, the concept of header is used. Headers are also used in electronic mail transmission. 4. Selective forwarding—to handle selective forwarding attack, tokens are used. These tokens discourage the Gateway Node to behave maliciously. Since these tokens are also encrypted, a malicious Gateway Node cannot read the contents of token and modify them. 5. Sinkhole attack—tokens are also used to prevent sinkhole attack. 6. Sybil attack—Sybil attack is mitigated with the help of two passwords of User. 7. Mutual authentication—the data that is sent to the Sensor Node is used to verify the User and vice versa. 8. Non-repudiation—ECDH algorithm uses public–private key pair. This pair of keys can be used to handle non-repudiation.
A Lightweight Secure Authentication Protocol for Wireless …
299
5 Conclusion In this paper, we have proposed a secure authentication protocol for wireless sensor networks. To ensure data integrity, confidentiality and to prevent non-repudiation and MITM, elliptic-curve Diffie–Hellman (ECDH) algorithm is used. We have used two passwords along with a Smart Card which also enhances the security. Hashing and salting are used to ensure that an attacker is not able to access the identities of the User Node, Gateway Node as well as the Sensor Node. Headers and tokens are used to prevent many possible attacks like selective forwarding, sinkhole attack, and denial of service. Timestamp is used to ensure freshness of the information in the network. Timestamp can also be used to mitigate attacks such as replay attack. The protocol is verified using AVISPA tool and SPAN protocol animator. Since the protocol is lightweight, it can be applied in WSN. The enhanced security enables the protocol to be used for authentication of both the Users and Sensor resulting in mutual authentication.
References 1. S. Kumari, H. Om, Authentication protocol for wireless sensor networks applications like safety monitoring in coal mines. Comput. Netw. 104, 137–154 (2016) 2. B. Vaidya, D. Makrakis, H.T. Mouftah, Improved two-factor user authentication in wireless sensor networks, in IEEE 6th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob) (IEEE, 2010), pp. 600–606 3. L. Lamport, Password authentication with insecure communication. Commun. ACM 24(11), 770–772 (1981) 4. L. Eschenauer, V. Gligor, A key management scheme for distributed sensor networks, in CCS ’02: Proceedings of the 9th ACM Conference on Computer and Communications Security (New York, USA, 2002) 5. H. Chan, A. Perrig, D. Song, Random key pre-distribution schemes for sensor networks, in Proceedings of the IEEE Security and Privacy Symposium (2003) 6. R. Watro, D. Kong, S.F. Cuti, C. Gardiner, C. Lynn, P. Kruus, TinyPK: securing sensor networks with public key technology, in SASN ’04 (New York, USA, 2004), pp. 59–64 7. M.L. Das, Two-factor User authentication in wireless sensor networks. IEEE Trans. Wirel. Commun. 8, 1086–1090 (2009) 8. M.K. Khan, K. Alghathbar, Cryptanalysis and security improvements of two–factor User authentication in wireless sensor networks. Sensors 10(3), 2450–2459 (2010) 9. K. Xue, C. Ma, P. Hong, R. Ding, A temporal-credential-based mutual authentication and key agreement scheme for wireless sensor networks. J. Netw. Comput. Appl. 36, 316–323 (2013) 10. M. Turkanovic, B. Brumen, M. Hölbl, A novel User authentication and key agreement scheme for heterogeneous adhoc wireless sensor networks, based on the internet of things notion. AdHoc Netw. 20, 96–112 (2014). https://doi.org/10.1016/j.adhoc.2014.03.009
Movie Recommendation Using Content-Based and Collaborative Filtering Priyanka Meel, Farhin Bano, Agniva Goswami, and Saloni Gupta
Abstract In the current time, we want to spend our money wisely and judiciously. In order to achieve this, we either turn to reviews of things that we want to try or better check the recommendation based on our previous experiences. This is where the need for a recommendation system comes in. As the online entertainment industry and the e-commerce markets grow rapidly, so is the need for efficient recommendation engines and efficient algorithms, for the business of the companies so that a large amount of revenue can be generated. The paper proposes a hybrid collaborative and content-based filtering algorithm so that the online entertainment market can be benefited, especially the online movie market, which gives the plus points of both, semantics and frequency-based filtering along with a collaborative-based approach which predicts the ratings of every movie. In the end, this paper shows the results of the proposed hybrid algorithm, along with the other known filtering techniques and algorithms used to recommend movies. In addition to this, we also tried to compare our method with the other methods already implemented in this field and tried to overcome the disadvantages offered by them to the best of our knowledge. Our method gives promising results for the top 5–10 movie recommendations as per user query string. Keywords Recommendation engine · E-commerce · Frequency · Semantics · Collaborative · Ratings · Hybrid algorithm
P. Meel · F. Bano · A. Goswami · S. Gupta (B) Delhi Technological University, New Delhi 110042, India e-mail: [email protected] P. Meel e-mail: [email protected] F. Bano e-mail: [email protected] A. Goswami e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_22
301
302
P. Meel et al.
1 Introduction With the increase in the number of entertainment and e-commerce Web sites, people are moving toward watching movies and television shows and the purchase of different products online. But the information present on the Web sites is too much that it becomes difficult for people to find useful and relevant information. Hence, recommender systems play a very important role in making the user buying experience hassle-free. Recommender systems apply machine learning algorithms, to predict the rating or preference a buyer would give to an item and suggest top-n items according to the ratings predicted. Different companies are working on the recommendation algorithms so that the existing algorithms can be improved. Recommendation system algorithms can be content-based or collaborative-based [1]. Companies like Amazon generate about 35% of revenue from product recommendation, which equals approximately more than 40 billion $ in 2016, according to statistics. Similarly, Netflix believes it is worth very much cash: approximately $1 billion per year. Netflix compares the searching and watching patterns of a user and provides movie recommendations. It suggests movies that have similar features as of films that a user has rated highly. In this paper, we have used a content-based approach on both the title and tags of the movies and collaborative approach to finding out the ratings of all the movies by all the users for product recommendation. This paper will describe methods to recommend movies to users for the purpose of suggesting relevant movies. Different approaches have been used on text, collaborativebased approaches like singular value decomposition [2, 3], content-based approaches such as tf-idf [4] (which stands for term frequency–inverse document frequency) and Word2vec [5], developed by Google in the year 2013, for developing contentbased recommendation system. They are being modified in different ways so that the recommendations to the results improve. Some examples where recommendation engines are used are as follows: ‘you might also like’ in Amazon.com, youtube.com suggestions, Netflix’s movie recommendation, etc., as shown in Table 1 [6, 7]. Recommendation of movies and products makes it easier for the users of the Web sites and buyers and also generates profits to the companies as it attracts the customers into buying more products and watching more movies and makes them spend more time on the Web sites, thus making more profit. Table 1 Fields where recommendation systems are used with examples
Recommendation systems used in fields
Examples
Movies
IMDb, Netflix and BookMyShow
E-commerce
Amazon, eBay and Flipkart
Videos
YouTube
Games
Xbox
Music
Pandora and Spotify
Movie Recommendation Using Content-Based and Collaborative …
303
1.1 What Is a Recommender Engine? A data procedure that proposes just the important results to the clients by expelling all the superfluous and excess data by executing various calculations that are computerized and automated is known as a recommender system [1, 8]. Some real-world examples are the recommendation of a product on Amazon.com and the recommendation of movies on Netflix. All of the recommendations are dependent on the characteristics of the data available to us and the domain. The recommender system was developed to bridge the gap between the collection of information and analysis of it by removing all the redundant data and presenting to the user all the essential data. The possible ways to suggest the recommendations to the users by the recommender system are by one of the two ways: (i) content-based filtering and (ii) collaborative filtering. Content-based Filtering: This is the most common approach while designing a recommender system. This filtering method filters the data based on the item description and according to the data collected by tracking the user’s preferences [9]. The user’s search history is taken into consideration, and the recommendation is done based on this search history. The data is provided to the system either implicitly when a link is clicked by a user or explicitly when a user rates an item, and this data is then further used by the recommender system to recommend products. All the objects are equated with the inquiry objects which the user had rated earlier. After every user activity, the suggestion is updated; for instance when the user gives positive ratings to a specific product, the probability of that product in upcoming recommendation increases. For example, if a user likes a page containing words such as pen drive, RAM and mobile, then content-based filtering will recommend Web pages related to electronics. Content-based filtering tried to recommend items based on the similarity count. A content-based recommender engine works with what data the user provides, either implicitly by clicking on a link or explicitly by adding tags and comments; based on this, a user’s profile is created which is in turn used to recommend products. The more the user provides inputs or takes action, the better the recommender engine becomes. Collaborative Filtering: Here, the behavior of users that are somehow related to a user is taken into consideration while recommending a product of a user. For example, if a user A likes an item and a user B does not like that same item, and if a user C is more closely related to user A, then that particular item is recommended to user C. So, in this filtering, the rating of other close users is taken into consideration, and the closest item that another user has liked is also taken into consideration. Recommendations are given to the users based on what the other user of the same group has preferred. The algorithm calculates the similarities between different items by using various similarity measures, and then these same similarity measures are used to predict the ratings for a user-item pair that is absent in the dataset. The prediction is not based on similar people’s ratings but rather on the weighted average of all the people. This weight is based on the similarity between two users, that is,
304
P. Meel et al.
a person and the person for whom prediction or recommendations have to be made, and the Pearson correlation coefficient can be used to measure the similarity. The following contributions are made via this paper: • The method used a hybrid of collaborative and content-based filtering. • Also, we have used a tf-idf weighted Word2vec model instead of the standard tf-idf or Word2vec model. • The SVD method is applied, and the Euclidean distance is calculated to finally produce recommendations.
2 Related Work A significant amount of research has been carried out in the field of collaborative filtering recommendation system. One such approach leads to an application to theatrical movie releases [10]. It provides a framework that learns to measure distances in a product space to predict purchase probabilities. Deep neural network architecture is used in this approach that trains on past customer purchases and on dense representations of movie plots. Initial experiments show material gains over traditional frequency–recency models. Another work shows a collaborative method to recommend movies with cuckoo search [11]. A novel recommender model that uses k-means clustering by implementing a cuckoo search optimization algorithm applied to the MovieLens dataset has been discussed in this research article. Performance is measured against metrics such as MAE (mean absolute error), RMSE (root-mean-squared error), SD (standard deviation) and t-value that provide high accuracy quality and were capable of delivering accurate and personalized movie recommendation systems with different cluster numbers. If the initial partition does not function well, then efficiency can decrease for this approach. In another paper, a collaborative filtering function is designed as a new and improved system for recommending movies. This is achieved by combining the asymmetric approach of ascertaining likenesses with matrix factorization and Tyco (typicality-based collaborative filtering) [12]. This combination is known as the HYBRTyco system applied to MovieLens and Ciao dataset. The asymmetric approach defines the similarity between consumers A and B as not the similarity between B and A. Matrix factorization displays objects (movies) and users by vectors of factors derived from object (movies) rating pattern. Tyco generated movie clusters of the same genre, and the degree of typicality (a measure of how much a movie belongs to that genre) of each movie was considered in that cluster and subsequently measured for each consumer in a genre. HYBRTyco uses Pearson correlation coefficient, linear regression and gradient descent optimization in combination with regularization, producing improved results that can be seen from the fact that it produces a lower MAE value and MAPE value. It still has to be implemented on a large dataset but with less execution time.
Movie Recommendation Using Content-Based and Collaborative …
305
Table 2 Dataset used in various papers Dataset
Papers (in which used)
The training dataset is prepared for the Siamese network, using customer purchase data to build the Cartesian product of the purchase history of each customer, and the complete list of movies in the dataset. Each row in the dataset represents a possible combination of customers, movies purchased by that customer, and movie from the movie universe
[10]
MovieLens dataset was used to test the approach containing 0.1 million ratings on 1683 movies provided by 943 users. At least 20 movies have been rated by every client. The ratings range between 1 (very bad) and 5 (excellent) measurements
[11, 12]
The MovieLens dataset has been extracted from the MovieLens Research Project Web site. The dataset consists of one million anonymous movie ratings. It also has user demographic information and a list of about three thousand movies from which the users can rate
[13]
The movie dataset used in one of the recommendation systems consists of 3 × 105 random movies which are taken from the IMDb. The movies having votes which are less than five or the films not looked into by a solitary individual are sifted through. The information is partitioned into three sets of equivalent worth. Every movie is described using 13 features
[14]
Another approach uses standard user demographics such as gender, age, occupation in a recommendation engine using collaborative filtering [13]. Recommendations are made based on the nearest neighbor’s best-rated movies. This approach also handles the recommendations which are also made time-sensitive to keep up with the changing tastes of the user. The results are evaluated using three metrics—precision, recall and F-measure. More demographic information like nationality, race, location, mother tongue and languages spoken can be used to make the recommendations better. All the above approaches use collaborative filtering in the recommendation systems, whereas our method proposes a hybrid of collaborative and content-based filtering algorithm (Table 2).
3 Proposed Work 3.1 Overview User ratings are an important aspect of movie recommendation besides the comments given by the user or the tags generated by the user. The rating tells us about the quality of the movie and what kind of movie a user likes based on the rating a user gives to a movie. But the problem with user ratings is that it is incomplete. It is because
306
P. Meel et al.
every user will not rate each and every movie, so here comes the utilization of the singular value decomposition algorithm. This algorithm looks for a pattern in the already given ratings by different users, and based on that it predicts the rating of a particular movie by a particular user. As a result, we get all the ratings of all the movies. This algorithm is the collaborative filtering technique used in this recommendation because it uses the information of other users to predict the results. We have also used the tf-idf weighted Word2vec technique, which looks at both the frequency of the genres and the words in the title of the movie and also looks at the semantic aspect by the use of the Word2vec algorithm. Thus, our proposed algorithm makes use of the ratings predicted by the SVD algorithm, and also the frequency and semantic-based techniques made effect with tf-idf weighted Word2vec; this is done by assigning weights to both according to the user’s preferences. The architectural model is shown in Fig. 1.
3.2 Data Acquisition The data has been attained from the movielens.org Web site. The dataset contained three datasets, namely movies.csv file which contained movie id, movie title and the tags; tags.csv file which contains more tags given to a movie by a user; and ratings.csv file which contains the rating given to a movie by a particular user.
3.3 Data Cleaning There were three datasets that are used in this paper and were downloaded from movielens.org [15]. A total of 138,493 users gave the ratings to many movies between January 9, 1995, and March 31, 2015, generated on March 31, 2015, and updated on October 17, 2016. All the randomly selected users had rated a minimum of 20 movies each. No geographic locations of the users were used. Each user has a unique id, and no other information is provided to us. The data is contained in three files, ‘movies.csv’, ‘ratings.csv’ and ‘tags.csv’. The total number of rows, i.e., the total number of films in movies.csv file, is 27,278; also, the genres in movies.csv file are ‘|’ separated; thus, we need to remove these and add these genre tags to a list so that later we can convert the list into vectors and apply our proposed algorithm to calculate the distances between movies. Also, there is a genre known as ‘(no genres listed)’ in the movies.csv dataset, and this is an inconsistency and will cause an error later, so we had to remove this particular genre tag from the movies.csv dataset. The algorithm for removing ‘|’ and the ‘(no genres listed)’ is as follows: • Iterate through each genre tag in the movies.csv file. • Split the text at ‘|’.
Movie Recommendation Using Content-Based and Collaborative …
307
DATASET [Already Available] [ratings-csv, tags-csv, movies-csv]
DATA CLEANING [StopWord Removal, Tokenization, Stemming, etc]
FEATURE EXTRACTION [TF-IDF Weighted Word2Vec Algorithm]
FEATURE SCALING AND SELECTION
SINGULAR VALUE DECOMPOSITION [Compute all ratings]
Perform further function such as multiplication and addition as per algorithm
EUCLEDIAN DISTANCE [Calculate pair wise distance]
Sort the distance in ascending order and recommend movies with least distance.
Fig. 1 Architectural model
• For each word in the split text, check if it is not equal to ‘(no genres listed)’; if it does not, then append it to an empty string. • Add the final string to a list. The Tag Data Structure (tags.csv) All the corresponding movie tags are present in the file called ‘tags.csv’. Each row of the file after the header row contains some tags which were given to a particular movie by a particular user. The header column had the following format: userId, movieId, tag, timestamp Each row of this file is ordered according to the userId first and then by movieId. The tags were given by the user according to what they felt after watching the movie. Thus, each tag is a phrase or a simple word. Each tag has some meaning and purpose which are user-relevant. There was also a column named ‘timestamp’ which showed
308
P. Meel et al.
the number of seconds after January 1, 1970, which elapsed when the user gave the corresponding tag. This file is very important, as it gives us a description of what different users thought about the movie; thus, the tags increase and thus become more descriptive and the recommendation becomes more accurate. We accumulated the different tags that different users gave to a particular movie and appended it to the genre tag to the list made in movies.csv dataset. The algorithm is as follows: • Iterate through all the data points of tags.csv file. • Store the movie by the user, and store the tags by splitting at space, in a list. • Iterate through the movie.csv dataset, and if the movieId number matches with the stored movieId, then append the stored tag list with the corresponding list that was created before at the time of iterating through movies.csv file. Now, we have a ready list of tags corresponding to every movie, but the problem with this list is that the tags in it contain many useless items such as punctuation marks like ‘!’, ‘[‘,’]’, etc., and also it contains many useless words which will give inconsistency in the results. So, we remove all these useless items. Figures 2 and 3 show a few rows of the tag.csv and rating.csv file after the data cleaning step, respectively. Data Preprocessing: Stop words such as ‘down’, ‘had’, ‘few’, ‘both’, ‘after’, ‘my’, ‘was’, ‘because’, ‘during’, ‘about’, ‘again’, ‘y’ and ‘because’ have been downloaded Fig. 2 Data of tags.csv after data cleaning
Fig. 3 Data of rating.csv after data cleaning
Movie Recommendation Using Content-Based and Collaborative …
309
from the NLTK library and have been removed by iterating over the dataset, as such words are not at all useful while recommending.
3.4 The Proposed Algorithm
Computing tf-idf weighted Word2vec model In average Word2vec, we take the tf-idf value to be 1, but here that is for every item in list that is the genres T(i), we run the Word2vec model for each word W(j), we get a 300-dimensional vector, then we do a product of the tf-idf value of words, and next we sum up all the cell corresponding to each column and then divide the value with the total sum of the tf-idf value of each word W(j) of each title. In this way, we get the final vector for each title. Figure 4 shows the encoded data of ratings.csv file after removing the timestamp column. The algorithm is as follows: 1. Computing tf-idf: • tf stands for term frequency and is calculated by the total number of times a word appears in a title divided by the number of words in the whole data corpus D. • idf is measured for a word W(j) to a given data corpus D, and IDF (Wj, D) is equal to the logarithm (number of titles in the data corpus D divided by the number of titles that contains the given word Wj). • Next, we multiply both the values; thus, the tf-idf value is high if a word appears a large number of times in a given title and few numbers of times in the given data corpus D.
Fig. 4 Encoded data of ratings.csv file and after removing the timestamp column
310
P. Meel et al.
2. Computing Word2vec: • Let us denote a particular title as T(i), consisting of ‘k’ number of words, w1 to wk. For every word Wj, we run the Word2vec model to get a 300-dimensional vector. • For each title T(i), we take each and every word’s Wj’s 300-dimensional vector, derived from the previous step, and then do a product of the word’s tf-idf value with the 300-dimensional Word2vec. • We now create a 300-dimensional vector, where each cell has a value after we add all the values of the corresponding cell of each word divided by the sum of tf-idf values of each word. Computing absent ratings using singular value decomposition Before applying the SVD algorithm, we encoded the data. As the data present in Ratings.csv file was haphazard and not ordered, we had to encode the data and order it according to the userId and then by movieId. The userId which started at 1 in the original data from ratings.csv was changed and made to start from 0. The resultant encoded data had userIds in increasing order, and for a particular userId, the movieIds were in increasing order. Next for training our model, we stored the number of unique userIds and the number of movieIds present in the dataset. Then, we choose our embedding size as 100, which means that we will make two-component matrices of size (number of users cross 100) X (number of movies cross 100). These matrices were initially filled with random values; to do this, we had used ‘torch’ library. Training of our model was done with 100 epochs and a learning rate of 0.01. We initially did a dot product of the corresponding row of movie component matrix and the corresponding row of the user component matrix to find the value of a the associated cell; next, we stored the original rating given to that particular movie by the particular user and then calculated the loss, by calculating the difference between both the values; if the value of the rating is more, then we increase the values of the corresponding column of movie component matrix and the corresponding row of the user component matrix in the next iteration. The process of training the sample was done by using Adam’s optimization algorithm; it is just an extension to the known stochastic gradient descent that is very well known in natural language processing and computer vision. This particular algorithm can be used instead of the stochastic gradient descent algorithm to update the weights of the network iteratively based on the training data. Adam is derived from ‘adaptive moment estimation’ and was presented by Diederik Kingma and Jimmy Ba in the year 2015 in their paper titled ‘Adam: A Method for Stochastic Optimization’ [16]. A rate of learning termed as alpha for the weight updates is used by the stochastic gradient descent. The rate of learning is not changed during the training phase. Here in Adam algorithm, a different learning rate is used for every network weight or every parameter and adaptation happens separately as the learning unfolds. The Adam mixes the benefits of the other two extensions of gradient descent, root-mean-square
Movie Recommendation Using Content-Based and Collaborative …
311
Fig. 5 Comparison of Adam with other optimization algorithm training a multilayer perceptron taken from Adam: a method for stochastic optimization, 2015
propagation that keeps up the per-parameter learning rates that change dependent on the average of the new magnitudes of the gradient values of the weights; thus, this algorithm performs well on non-stationary and online problems. And the adaptive gradient algorithm also has per-parameter learning which performs better on issues with sparse gradients like natural language problems. Adam identifies the advantages of these algorithms, and as opposed to adjusting to the parameter rates of learning depending on the mean as in root-mean-square propagation, Adam also makes use of the uncentered variance. This algorithm specifically calculates the exponential moving average of the gradient, also the gradient square. The parameters beta1 and beta2 handle the rate of decay of the moving average. Figure 5 shows the comparison of Adam with other algorithms. Merging The Two Algorithms Now, coming back to the implementation, after the use of an optimizer algorithm, we run the training model 100 times, or the number of epochs was 100 and the learning rate was fixed at 0.01. During each epoch, we set the gradient to zero before the start of backpropagation because PyTorch accumulates the gradient on subsequent backward passes. Next, we compute the d(loss)/d(x) for every parameter x which has requires_grad = true; next, the optimizer.step updates the value x using the gradient. After training our model and running it, we can get the rating of any movie given by a user. It is because we have trained our model so that we can guess our component matrices with very little error. The overall error has come out to be approximately 0.87. Next, we stored all the ratings for a particular movie in a list and computed the average of it; thus, we get a list of average rating for a particular movie. We then append the rating value to our previously calculated tf-idf weighted word2vector and give weights according to our preference; for example if somebody wants similarly rated movies, then we give more weight to the rating part of the total vector; otherwise, we give equal weights to both the tf-idf weight word2vector and the rating. Also if the user wants the suggestion to be according to the ratings, then we sort the predicted movies according to the ratings.
312
P. Meel et al.
Here, we have given equal weights to the tf-idf Word2vec vector and the rating value. For example, if a weight of 0 is given to the tf-idf Word2vec value and some weight to the rating value then movies of the similar rating of the query movie will be shown, and when equal weights are given then movies of same genres and tags and of same ratings will be shown.
3.5 The Algorithm The algorithm is as follows: 3. Clean the data, and insert the tags into a list. 4. Use the tf-idf weighted Word2vec model to turn the movie titles and the tags into 300-dimensional vectors. 5. Compute the rating matrix, which predicts the ratings of every movie given by every user, using the singular value decomposition algorithm. 6. Calculate the average rating for each movie. 7. Take the 300 tf-idf weighted Word2vec, and multiply with the weight given to it; also, take the average rating, multiply with the weight given to rating, sum it up and divide it with the sum of the weights. 8. Now, calculate the pairwise Euclidian distance between the query movie and all other movies and recommend 200 movies that are closest to the query movie.
4 Related Technologies Term frequency–Inverse document frequency algorithm (tf-idf) It is the weight that is used in the mining of text and the retrieval of information. This measurement of weight measures the importance of a particular word in a document in a data corpus [4]. The importance is proportional to the frequency of the word in the document. A different variation of the tf-idf algorithm is used by search engines to rank the relevance of a document given a user query. Filtering stop words including text classification can also be handled by tf-idf. • Term frequency (tf): It measures the frequency of a term in a document, and the term frequency is divided by the length of the document. • Inverse document frequency (idf): It detects the relevance of a term. It is the logarithm of the total number of documents divided by the number of documents having that term in it. Average Word2vec Algorithm Word2vec is a text-processing two-layer neural network [5]. For every text corpus as
Movie Recommendation Using Content-Based and Collaborative …
313
input, the output is a set of vectors, known as the feature vectors for that particular work in the data corpus. Word2vec changes the text into some numerical form that is understandable by deep neural networks, but it is not counted as a deep neural network. TF-IDF Weighted Word2vec Algorithm The closeness between average Word2vec and tf-idf weighted Word2vec is here we take the tf-idf value in average Word2vec supposedly to be 1, but in tf-idf weighted Word2vec, for each word Wj, we get the 300-dimensional Word2vec, multiply the tf-idf value of the word, then add up all the corresponding cell of each word and then finally divide it with the sum of tf-idf values of each word to get the final vector for the given title [9]. Singular Value Decomposition Algorithm Singular value decomposition helps in the decomposition of a matrix into constituent parts, thus reducing the matrix size and to make calculations simpler [17–21]. SVD can be used to calculate matrix inverse, help in data reduction in machine learning and also in image compression and de-noise data.
5 Result Analysis To the best of our knowledge, the approaches described so far did not consider some of the aspects that we tried to cover in our approach. One of the deep learningbased approaches considered customer purchase history and dense representation of movie plots but was short of considering the user ratings of movies as a factor for the recommendation which was considered in our approach. The user ratings play a pivotal role in a recommendation system [10]. The other method used the k-means clustering algorithm with cuckoo search for movie recommendation [11]. It carried the restrictions that are laid down by the kmean algorithm of machine learning which is that the efficiency of such an approach is largely dependent upon the initial partitioning that is followed. Along with this, the value of ‘k’ defined impacts the performance of the system. Thus, one has to follow the brute force approach to decide the value of ‘k’ and likewise use the advance versions of k-means to select improved values of initial centroids (called k-means seeding). We did consider and studied all these aspects and decided to choose SVD as the machine learning algorithms obtain results that might not be impacted by such factors. The approach discussed above used Pearson correlation [12], and the other used only user-based collaborative filtering which recommends a piece to the user based on other similar-minded user’s opinions for that piece [13]. While the former’s limitation is that it is tedious and time-consuming to calculate, the latter suffers from the problem of scalability and data sparsity which leads to the cold start problem.
314
P. Meel et al.
Fig. 6 Spatial position of the movies
In Fig. 6, the plotted graph shows that all the recommended movies using Word2vec are nearby to the query movie (here, it is ‘Toy Story’). While designing and modeling the proposed method described above, we tried to consider and overcome the limitations mentioned to the best of our knowledge by using algorithms that run considerably faster while computing the results and we also tried to extract the maximum features from the data that are present in the dataset. We also considered the standard dataset, so our work can be associated with other work that has already been done or will be done in the future in this field.
6 Conclusion and Future Work The above outcomes show the positives of using the tf-idf weighted Word2vec algorithm over only tf-idf or only Word2vec algorithms. Using semantic-based filtering like Word2vec, we can know the user’s behavior and thinking and thus predict future needs, and by using the frequency-based filtering we get the results that give recommendations based on the text of the title and previously added tags by the user. Singular value decomposition algorithm is very efficient. The basis of that is hierarchical which is ordered by relevance, and it tends to perform quite well for most datasets. Better results are produced when we add the average rating value to the previous idf weighted vector, which enables the user to give preference to ratings or semantics or a mixture of the previous. As future work, the efficiency can be improved by studying the cover picture of the movie and classifying the image property using convolutional neural network and then give recommendations based on the similarity of the query movie image. Many current techniques in collaborative filtering are unable to handle large dataset, also many users give very few ratings, and this problem cannot be easily
Movie Recommendation Using Content-Based and Collaborative …
315
dealt with, so here we could use probabilistic matrix factorization [22]. This particular algorithm performs very well in datasets which are very large and is sparse and imbalanced. We could clean the data accordingly so that by applying clustering algorithms like k-means we can find all the movies belonging to a particular genre.
References 1. C.C. Aggarwal, An introduction to recommender systems, in Recommender Systems (Springer International Publishing, Cham, 2016), pp. 1–28 2. Singular value decomposition—Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/ Singular_value_decomposition 3. M.E. Wall, A. Rechtsteiner, L.M. Rocha, Singular value decomposition and principal component analysis (2003) 4. M. Tajbakhsh, J. Bagherzadesh, Microblogging hash tag recommendation system based on semantic TF-IDF: Twitter use case, in 2016 I. 4th International, and U. 2016, ieeexplore.ieee.org. (2016) 5. Y. Goldberg, O. Levy, word2vec explained: deriving Mikolov et al.’s negative-sampling wordembedding method (2014) 6. Recommender Systems in Practice—Towards Data Science. [Online]. Available: https:// towardsdatascience.com/recommender-systems-in-practice-cef9033bb23a 7. The 4 Recommendation Engines That Can Predict Your Movie Tastes. [Online]. Available: https://medium.com/@james_aka_yale/the-4-recommendation-engines-that-can-predictyour-movie-tastes-bbec857b8223 8. A. Adomavicius, G. Tuzhilin, Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions, computer.org (2005) 9. P. Meel, A. Goswami, Inverse document frequency-weighted Word2Vec model to recommend apparels, in 2019 6th International Conference on Signal Processing and Integrated Networks, SPIN 2019, (2019), pp. 1–7 10. M. Campo, J.J. Espinoza, J. Rieger, A. Taliyan, Collaborative metric learning recommendation system: application to theatrical movie releases 11. R. Katarya, O.P. Verma, An effective collaborative movie recommender system with cuckoo search. Egypt. Informat. J. Elsevier 12. R. Katarya, O.P. Verma, Effective collaborative movie recommender system using asymmetric user similarity and matrix factorization, in International Conference on Computing, Communication and Automation (ICCCA), ieeexplore.ieee.org (2016) 13. V. Subramaniyaswamy, R. Logesh, M. Chandrashekhar, A. Challa, V. Vijayakumar, A personalised movie recommendation system based on collaborative filtering. Int. J. High Perform. Comput. Netw. 10(1–2), 54–63 (2017) 14. S. Debnath, N. Ganguly, P. Mitra, Feature weighting in content based recommendation system using social network analysis, in Proceedings of the 17th International Conference on, and Undefined 2008, Citeseer 15. MovieLens. [Online]. Available: https://movielens.org/ 16. D.P. Kingma, J.A. Ba, A Method for stochastic optimization (2014) 17. H. Björnsson, S.A. Venegas, A manual for EOF and SVD analyses of climatic data, in Report, and U. 1997, geog.mcgill.ca 18. R.A. Horn, C.R. Johnson, Topics in Matrix Analysis (Cambridge University Press, 1991) 19. G.W. Stewart, On the early history of the singular value decomposition. SIAM Rev. 35(4), 551–566 (1993)
316
P. Meel et al.
20. G. Golub, W. Kahan, Calculating the singular values and pseudo-inverse of a matrix. J. Soc. Ind. Appl. Math. Ser. B Numer. Anal. 2(2), 205–224 (1965) 21. J. Demmel, W. Kahan, Accurate singular values of bidiagonal matrices. SIAM J. Sci. Stat. Comput. 11(5), 873–912 (1990) 22. R. Salakhutdinov, A. Mnih, Probabilistic matrix factorization. Adv. Neural Inf. Process. Syst. 20 - Proc. 2007 Conf., pp. 1–8, 2009
Feature Selection Algorithms and Student Academic Performance: A Study Chitra Jalota and Rashmi Agrawal
Abstract In the present state of affairs, the motive behind every educational organization is to uplift the academic achievement of students. Educational data mining (EDM) is an upward field of research, and it is very helpful for academic institutions to predict the academic performance of the students. Educational datasets are the basis of various predictive models. Quality of these models can be improved by using feature selection (FS). To get the required benefits from the available data, there must be some tools for analysis and prediction. In lieu of the above, machine learning/data mining are most suitable. In educational data mining, for better accuracy of prediction models’ and quality of various educational datasets, feature selection (FS) plays a vital role. Feature selection (FS) algorithms abolish inappropriate information from the repositories of educational background so that performance of classifier in terms of accuracy could be increased and the same could be used for better decision. In lieu of the above, a best feature selection algorithm must be selected. In this paper, two filter selection approaches namely correlation feature selection (CFS) and wrapper-based feature selection have been used to demonstrate the importance of selection of a feature subset for a classification problem. The present paper aims to find the detailed investigation of filter feature selection algorithms along with the classification algorithms on a given dataset. We found result with numerous numbers of features from various Feature selection algorithms and classifiers which will help the researcher to discover the most excellent mixture of filter feature selection algorithms and its associated classifiers. The result indicates that SMO and J48 have the highest accuracy measures with the correlation feature selection algorithms, while Naïve Bayes has the highest accuracy measures with the wrapper subset feature selection algorithms for predicting high, medium and low grade for the students.
C. Jalota (B) · R. Agrawal Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad, India e-mail: [email protected] R. Agrawal e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_23
317
318
C. Jalota and R. Agrawal
Keywords Feature selection algorithms · J48 · Random forest · Correlation feature selection · Wrapper feature selection
1 Introduction For the development of any nation, the key factor is the education. Most required ingredient to make new changes to the society is the quality of education. To improve the educational process, we have to explore the hidden information from this huge data which is kept in academic institutions databases. Student’s academic performance can be evaluated by many techniques like data mining and machine learning. With the help of student’s performance, the organization’s reputation and future of students could be evaluated in advance. We use various prediction models for monitoring the academic performance of students. With the help of these prediction models and by using various machine learning techniques, there is a possibility of forecast of educational/study wise performance of students. It is difficult to determine the pertinent features for educational/study wise performance. Following are the advantages of applying academic performance prediction on student’s database: • Improvement in the enrollment quality of an institution • Better future planning for students • Helpful in optimizing decision related to studies. For building this student performance prediction model, a few elected features are mandatory from the given dataset. Algorithms of feature selection and their applications are helpful to select nearly all appropriate features. Results of prediction can be distinguished by using these algorithms. Thus, with the help of feature selection algorithms, redundancy can be avoided, and most relevant features can be included at no cost for the loss of data. Feature selection algorithms can be used in data preprocessing which is the first step for creating a machine learning model. By means of this step, we select subgroup of features for model prediction. With these algorithms, computational complexity can be reduced, and thus, accuracy of the results could be increased. Feature selection algorithms can be categorized into three main categories: (1) Filter Method: In this method, we simply compute the variation of each feature, and it is the assumption that a feature contains high variation is having most important information. There is no relationship between feature variable and target variable which is a major drawback of this method, e.g., Pearson Correlation, Chi-square, etc. (2) Wrapper Method: In this method, we take a subgroup of features and apply the same on to the model to train them. From our past experiences with the previous model, features can be added or removed from the subset. Computation on such
Feature Selection Algorithms and Student Academic Performance …
319
type of methods is very expensive, e.g., forward selection, backward selection, etc. (3) Embedded Method: It is a hybrid method which merges the qualities of filter and wrapper methods. Algorithms used for this method have their own built-in methods for feature selection, e.g., Lasso and Ridge regression. The main edge of this study is filter feature selection algorithm. It is also evident from past research that feature selection is used frequently in machine learning for various research works. Researchers also used various feature selection algorithms for educational/study wise performance to get the best result. In this study, a student dataset is used to detect the finest promising combination of classification algorithm and feature selection algorithm.
2 Literature Survey Many researchers have done a lot of research in the field of educational by using data mining for educational/study wise performance. Most of the researchers have used machine learning/data mining for the above mentioned. Chitra and Agrawal [1] used Weka to find the most influential classifier for student’s academic performance prediction on an educational dataset. Approximately, 480 instances were collected to find out the performance result. Kavipriya and Karthikeyan [2] used Boolean operator intersection to select bestfit features from education systems of historical academic data and proposed a cross method for feature extraction using improved ant colony optimization (ACO) and better genetic algorithm (GA). Ant colony optimization (ACO), genetic algorithm (GA) and Boolean operator were used as baseline algorithms to calculate the accuracy of feature selection methods and compare their results to predict the better performance with the proposed algorithm. Hussain and Dahan [3] evaluated student academic performance on the basis of two parameters, i.e., academic and personal. Four classification algorithms, i.e., J48, Part, BayesNet and random forest were used to find the accuracy. In this paper, author has also used Apriori algorithm to find some rules. Zaffar et al. [4] analyzed and evaluated various feature selection algorithms with the result that there is no such remarkable difference in the accuracy by using different feature selection algorithms using Weka. Author concludes that principal component analysis with random forest classifier has shown better results Kavipriya and Karthikeyan [5] analyzed feature selection algorithms, and the idea behind it was to inspect strength and weakness of most influential Feature Selection methods. Algorithms used for feature selection should choose the significant variables and also eliminate the disparate and incompatible variables which are the basis for deprivation of correctness of the algorithm used in classification. Classification algorithms and feature selection algorithms are applicable for twofold datasets and for multiclass datasets.
320
C. Jalota and R. Agrawal
Kavipriya and Karthikeyan [6] projected some noticeable data mining methods with the help of which the student performance prediction system can be better. An improved version of K-means algorithm with support vector machine was first projected technique for the reduction of number of support vectors to execute classification; another used technique was a two-way system which summarizes the benefits of many classifiers for the improvement in efficiency. Zaffar et al. [7] evaluated and analyzed various feature selection algorithms like BayesNet, Naïve Bayes, multilayer perceptron, etc. However, from the abovementioned feature selection methods, the most significant method was principal components, and it shows comparatively better results when we use it with random forest classifier. Out of all used classifiers, MLP was the best to some extent on a student dataset. Mueen et al. [8] used an educational dataset and applied three different data mining classification algorithms (Naïve Bayes, neural network and decision tree). The prediction performance of three classifiers are calculated and compared. From the results, it can be stated that Naïve Bayes classifier is the best among the three predefined classifiers because of its overall prediction accuracy of 86%. Figueira [9] used Moodle and extracted three main features from its database to predict future grades of students by possible means. Various statistical analyses have also performed on these features. By combining features and principal component analysis, a decision tree can be obtained. Grade prediction is possible in three intervals with the help of derived tree. LMS could be beneficial by incorporating this methodology, and the same could be used during a course. Sivakumar et al. [10] designed and developed an ID3-based improved decision tree algorithm used to get the probable outcome in regard of the continuation or departure from the studies by the students. A model was designed with the help of the conventional version of ID3 algorithm, Renyi entropy, information gain and association function. Major reasons and relevant factors could also be detected for dropout students. Veerabhadrappa and Rangarajan [11] introduced an algorithm which was implemented for information filtration on the basis of Pearson χ2 test approach, and with the help of this, feature selection was tested. It is very beneficial for multi-dimensional data where sample set is not large. Four feature selection algorithms (FCBF, CorrSF, ReliefF and ConnSF) were used for selected features quality. Sandya et al. [12] used fuzzy logic to develop a new feature for extraction method. A fuzzy score was generated by fuzzy system in this method and the score was the base for the extraction of most relevant features. Results have shown that the fuzzy score method was able to draw out the most methodical features and classification accuracy was also better. Kajdanowic et al. [13], developed a new method for feature extraction. Under this technique, there is a combination of class labels and network structure information for the calculation of new features. Important features were extracted by using this method, and it also shows that by using this method, we are capable to depict significant and relevant variables and also shows little refinement for the accuracy of classification.
Feature Selection Algorithms and Student Academic Performance …
321
Gladis et al. [14] used principal component analysis and linear discriminant analysis were used to find the most relevant features. A subset of newly obtained features is found, and the same has been used on a support vector machine (SVM) classifier which proves improvement in the correctness of classification. Jalota and Munjal [15] projected segmentation and grid module, variables mining module by using K-means and K-nearest neighbor clustering algorithms so that a neighborhood module can be generated to build the CBIR system. To identify all sides of every image it is important to mention the perception of neighborhood color analysis module. Content-based image retrieval (CBIR) gives the fabulous output where number of issues can be optimized by using this technique.
3 Proposed Methodology Present paper is an extended version of our previous work, i.e., analysis of data mining using classifiers by Jalota and Agrawal [1]. A kalboard 360 dataset was used which exists in the area of education, and it was gathered using learning management system (LMS). This dataset comprises 480 instances with 16 features. Motive behind this study is to access the different feature selection algorithms and predict their performance in association by means of classification algorithms. With the help of this study, we would be able to give answer of following two questions in regard of the same: Q1: Most important feature selection algorithm for the above mentioned (Whether students would be able to achieve low grade/middle grade or high grade)? Q2: Most appropriate subset chosen from feature selection algorithms and classifiers for student’s performance prediction. (Whether students would be able to achieve low grade/middle grade or high grade? For the purpose of getting answers of above-mentioned research questions, an educational dataset of students is taken from reliable source, and on that dataset, we applied various feature selection algorithms. Evaluation of feature selection algorithms can be done on the basis of accuracy retrieved from them. Numerous classification algorithms are also applied on these feature selection algorithms. Performance could be evaluated among all possible combination of feature selection algorithm and classifiers. Figure 2 shows the architecture of feature selection and classification algorithms for this paper. There is a blending mixture of (CFSSubset Eval, WrapperSubsetEval) feature selection algorithms and various classification algorithm for the above-mentioned dataset (Fig. 1).
3.1 Experimental Setup In this paper, we used Waikato Environment for Knowledge Analysis (WEKA) version 3.8 developed by University of Waikato in New Zealand as a data mining tool.
322
C. Jalota and R. Agrawal
Fig. 1 Types of feature selection algorithms
Fig. 2 Architecture of feature selection and classification algorithm
Java language is used as a base for this. It has n number of machine learning algorithms for analysis and decision making.
3.2 Feature Selection Algorithm and Classifiers Data preprocessing is the first step of any data mining task. During data preprocessing, the most important technique is feature selection which is effective in reducing dimensionality, removing irrelevant data and learning accuracy. It is also the most required element of machine learning process. Feature selection is also known as variable selection, attribute selection and attributes subset selection. This paper consists of the following algorithms for feature selection: • Correlation Feature Selection Algorithm • Wrapper Subset Feature Selection Algorithms: CfsSubsetEval: A feature selection technique which measures correlation between two nominal features, so numeric features are first discretized. It is a fully automatic algorithm-means, and there is no requirement from the user side to specify threshold value or number of features to be selected. It is a filter, and due to this, it does not incur high evaluation cost. WrapperSubsetEval: Scoring of features could be done with the help of a predictive model under this method. To train a model (hold out set), this new subset
Feature Selection Algorithms and Student Academic Performance …
323
could be used. With the help of counting of number of mistakes in hold out set, score of that subset could be calculated. For each subset, this method always trains a new model. Due to this reason, their computation cost is very high. For calculating the accuracy of prediction of the different features, we use classification algorithms. In this paper, we have used nine classification algorithms.
4 Results and Discussions This research paper shows the results for the performance evaluation of two feature selection algorithms (correlation feature selection, wrapper feature selection) on the above-mentioned student’s dataset. F-measure, recall, precision and prediction accuracy are the bases for performance evaluation of the above-mentioned algorithms. The outcomes of these feature selection techniques are explained in Tables 1, 2, 3, 4, 5 and 6 by applying nine different classifiers. Results acquired by the above-mentioned Table 1 Result of CFSSubSetEval with different classifiers for high grade prediction Feature selection–classification algorithm
F-measure
Recall
Precision
CFS-BayesNet
0.687
0.718
0.658
CFS-JRip
0.676
0.683
0.669
CFS-J48
0.823
0.803
0.843
CFS-MLP
0.801
0.810
0.793
CFS-Naïve Bayes
0.671
0.711
0.635
CFS-OneR
0.520
0.451
0.615
CFS-RandomForest
0.7533
0.697
0.773
CFS-SMO
0.778
0.803
0.755
CFS-SimpleLogistic
0.760
0.746
0.774
Table 2 Result of CFSSubSetEval with different classifiers for low prediction Feature selection–classification algorithm
F-measure
Recall
Precision
CFS-BayesNet
0.803
0.819
0.788
CFS-JRip
0.811
0.811
0.811
CFS-J48
0.823
0.858
0.790
CFS-mlp
0.808
0.811
0.805
CFS-naïve Bayes
0.791
0.866
0.728
CFS-OneR
0.727
0.701
0.754
CFS-RandomForest
0.844
0.874
0.816
CFS-SMO
0.849
0.843
0.856
CFS-SimpleLogistic
0.817
0.827
0.808
324
C. Jalota and R. Agrawal
Table 3 Result of CFSSubSetEval with different classifiers for average performance Feature selection–classification algorithm
F-measure
Recall
Precision
CFS-BayesNet
0.649
0.621
0.679
CFS-JRip
0.663
0.659
0.668
CFS-J48
0.725
0.711
0.739
CFS-mlp
0.756
0.749
0.763
CFS-naïve Bayes
0.598
0.540
0.671
CFS-OneR
0.627
0.697
0.570
CFS-RandomForest
0.740
0.749
0.731
CFS-SMO
0.757
0.744
0.770
CFS-SimpleLogistic
0.736
0.739
0.732
Table 4 Result of WrapperSubSetEval with different classifiers for high performance Feature selection–classification algorithm
F-measure
Recall
Precision
Wrapper-BayesNet
0.697
0.704
0.690
Wrapper-Jrip
0.630
0.606
0.656
Wrappe-J48
0.667
0.676
0.658
Wrapper-mlp
0.685
0.690
0.681
Wrapper-naïve Bayes
0.706
0.718
0.694
Wrapper-OneR
0.480
0.465
0.496
Wrapper-RandomForest
0.702
0.697
0.707
Wrapper-SMO
0.708
0.690
0.726
Wrapper-SimpleLogistic
0.691
0.669
0.714
Table 5 Result of WrapperSubSetEval with different classifiers for low performance Feature selection–classification algorithm
F-measure
Recall
Precision
Wrapper-BayesNet
0.832
0.819
0.846
Wrapper-Jrip
0.808
0.827
0.789
Wrapper-J48
0.779
0.764
0.795
Wrapper-mlp
0.824
0.827
0.820
Wrapper-naïve Bayes
0.828
0.890
0.774
Wrapper-OneR
0.650
0.606
0.700
Wrapper-RandomForest
0.805
0.795
0.815
Wrapper-SMO
0.816
0.858
0.779
Wrapper-SimpleLogistic
0.824
0.827
0.820
Feature Selection Algorithms and Student Academic Performance …
325
Table 6 Result of WrapperSubSetEval with different classifiers for average performance Feature selection–classification algorithm
F-measure
Recall
Precision
Wrapper-BayesNet
0.695
0.697
0.693
Wrapper-JRip
0.651
0.659
0.644
Wrapper-J48
0.648
0.649
0.646
Wrapper-mlp
0.687
0.682
0.692
Wrapper-naïve Bayes
0.673
0.635
0.717
Wrapper-OneR
0.536
0.569
0.506
Wrapper-RandomForest
0.693
0.701
0.685
Wrapper-SMO
0.687
0.678
0.698
Wrapper-SimpleLogistic
0.698
0.711
0.685
feature selection (FS) algorithms and classifiers are explained in tables. Result can be predicted by means of feature selection–classification algorithm, precision, recall and F-measure values. The results in Tables 1, 2 and 3 show the variety of correctness measurement for nine classifiers with CFSSubsetEval feature selection algorithm. Figure 3 graphically shows the results achieved with CFSSubsetEval feature selection algorithms. As we can see from Table 1 and Fig. 3, the classifier OneR is having the lowest performance, and the classifier J48 has the highest level of accuracy with CFSSubset Eval for the prediction of high grade. Similarly, Tables 2, 3 and Figs. 4 and 5 show that SMO has the highest level of accuracy with CFSSubsetEval for the prediction of low grade and average grade and high grade. The results in Tables 4, 5 and 6 show the variety of accuracy measurement of WrapperCFSSubsetEval feature selection algorithm with nine classifiers. Figure 6
Fig. 3 Performance of CFSSubsetEval for prediction of high performance
326
C. Jalota and R. Agrawal
Fig. 4 Performance of CFSSubsetEval for prediction of low performance
Fig. 5 Performance of CFSSubsetEval for prediction of average performance
Fig. 6 Performance of WrapperSubsetEval for prediction of high performance
Feature Selection Algorithms and Student Academic Performance …
327
Fig. 7 Performance of WrapperSubsetEval for prediction of low performance
Fig. 8 Performance of WrapperSubsetEval for prediction of average performance
graphically shows the results achieved with WrapperSubsetEval feature selection algorithms. As we can see from Tables 4, 5, 6 and Figs. 7 and 8, the classifier OneR is having the lowest performance, and the classifier Naive Bayes has the highest level of accuracy with Wrapper SubsetEval for the prediction of high, low and average grades
5 Conclusion and Future Scope This paper shows the comprehensive investigation of two feature selection algorithms and evaluates the result in terms of their performance using an educational dataset. The above-given results are giving the clear indication that SMO and J48 have the highest accuracy measures with the correlation feature selection algorithms, while
328
C. Jalota and R. Agrawal
Naïve Bayes has the highest accuracy measures with the wrapper subset feature selection algorithms for predicting grade for the students, i.e., high, medium and low. More than one dataset can be used to evaluate the feature selection results through different feature selection algorithms like correlation-based, info gain, gain ratio and many more in the future. Hybrid algorithms for feature selection on various educational dataset for the prediction of student’s academic performance can also be included as future work.
References 1. C. Jalota, R. Agrawal, Analysis of data mining using classification, in IEEE International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCON) (2019) 2. P. Kavipriya, A. Karthikeyan, Hybridization of enhanced ant colony optimization (ACO) and genetic algorithm (GA) for feature extraction in educational data mining. J. Adv. Res. Dyn. Control Syst. 10, 1278–1284 (2018) 3. S. Hussain, N.A. Dahan, F.M. By-Law, N. Ribata, Educational data mining and analysis of students’ academic performance using WEKA. Indonesian J. Electr. Eng. Comput. Sci. 9 (2018) 4. M. Zaffar, M.A. Hashmani, K.S. Savita, Performance analysis of feature selection algorithm for educational data mining, in IEEE Conference on Big Data and Analytics (ICBDA) (2017) 5. P. Kavipriya, K. Karthikeyan, A comparative study of feature selection algorithms in data mining. Int. J. Adv. Res. Comput. Commun. Eng. 6(11) (2017) ISO 3297 6. P. Kavipriya, K. Karthikeyan, Case study: on improving student performance prediction in education systems using enhanced data mining techniques. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 7(5) (2017) 7. M. Zaffar, M.A. Hashmani, K.S. Savita, Performance analysis of feature selection algorithm for educational data mining, in Big Data and Analytics (ICBDA) (2017), pp. 7–12 8. B. Mueen, A. Zafar, U. Manzoor, Modeling and predicting students’ academic performance using data mining technique. Int. J. Modern Educ. Comput. Sci. 8, 36 (2016) 9. A. Figueira, Predicting grades by principal component analysis: a data mining approach to learning analysis, in IEEE 16th International Conference in Advanced Learning Technologies (ICALT) (2016), pp. 465–467 10. S. Sivakumar, S. Venkataraman, R. Selvara, Predictive modeling of student dropout indicators in educational data mining using improved decision tree. Indian J. Sci. Technol. 9 (2016) 11. Veerabhadrappa, L. Rangarajan, Multi-level dimensionality reduction methods using feature selection and feature extraction. Int. J. Artif. Intell. Appl. 1, 54–68 (2010) 12. H.B. Sandya, P. Hemanth Kumar, S.K.R. Himanshi Bhudiraja, Fuzzy rule based feature extraction and classification. Int. J. Soft Comput. Eng. 3(2), 42–47 (2013) 13. T. Kajdanowic, P. Kazienko, P. Doskocz, in Label-Dependent Feature Extraction in Social Networks for Node Classification. Lecture notes in computer science, vol. 6430 (Springer, 2010), pp. 89–102 14. V.P. Gladis, P. Rathi, P.S. Palani, A novel approach for feature extraction and selection on MRI images for brain tumor classification. Int. J. Comput. Sci. Inf. Technol. 2(1), 225–234 (2012) 15. C. Jalota, M. Munjal, Use of K means with feature extraction in content based image retrieval system, in 4th IEEE International Conference on Computing for Sustainable Global Development, pp. 6763–6768 (2017)
MobiSamadhaan—Intelligent Vision-Based Smart City Solution Mainak Chakraborty , Alik Pramanick , and Sunita Vikrant Dhavale
Abstract MobiSamadhaan is an artificial intelligence (AI)-based approach toward a universal mobility solution for the problems concerning with daily hassles of traffic and unbalanced transportation catering to the needs of India’s smart cities mission. In this paper, we propose vision-based solutions for tracking, forecasting, and surveillance through spatio-temporal analysis of traffic data using deep learning and machine learning techniques for commuters and Indian transport authorities. The method includes an accurate estimation of the vehicle and passenger density for analysis of mass movement, traffic congestion, CO2 emissions, demand-supply gap, and a vision-based low-cost parking status prediction system. We also implement a human activity detection system, as safety and security are vital components of smart city mobility management and achieved 91.2% accuracy on the UCF-101 dataset. The findings of this study show that Mobisamadhaan can adopt as a pilot solution as part of smart city development. Keywords Smart city · Mobility challenges · Activity detection · Car parking · Carbon emission estimation · Traffic jam detection
1 Introduction A smart city is committed to strengthening the performance and quality of its public services through the utilization of Information and Communication Technologies (ICT) like artificial intelligence-based approaches. Indian transportation can categoAwarded the prestigious Smart India Hackathon-2019 software edition. M. Chakraborty (B) · A. Pramanick · S. V. Dhavale Defence Institute of Advanced Technology (DIAT), Girinagar, Pune 411025, India e-mail: [email protected] A. Pramanick e-mail: [email protected] S. V. Dhavale e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_24
329
330
M. Chakraborty et al.
rize as a complex problem due to its difficulty in modeling the system and user’s behavior based on activities, traffic flow, comfort level, demand-supply gap, existing infrastructure, and uncertainty. People squander their crucial time due to daily hassles of traffic, and that directly threatens the economy, health, and well-being of the country. Limited availability of transport resources may cause buses to be overcrowded and uncomfortable for passengers during peak hours. This problem can be solvable by minimizing the supply-demand gap and accurate prediction of daily traffic congestion. However, the country may face chaotic scenarios including congested footpaths, illegal parking, and criminal activities because of the lack of adequate monitoring and poor traffic management policies [5]. In the field of computer vision tasks, especially image classification [19], object detection [13] , activity recognition [26], and feature extraction [40], the convolutional neural network (CNN) [20] was quite impressive with comparable performance to human. Deep learning-based applications for solving the smart city-related challenges are explored by Wang et al. [34] and Chen et al. [8]. Koesdwiady et al. [18] employ a deep belief network (DBN) on Caltrans Performance Measurements Systems (PeMS) [7] and MesoWest project [32] datasets to predict traffic flow and weather condition. To improve accuracy, they utilized decision-level data fusion. Huang et al. [15] proposed a model by combining DBN at the bottom layer for unsupervised traffic flow feature extraction with multitask regression at the top layer for traffic flow prediction on PeMS and highway entrance exit station datasets. Long short-term memory (LSTM) [14] is used to predict future taxi demand based on present and past demand of that zone, proposed by Xu et al. [37]. In their work, the model is trained by New York City taxi trip dataset [29], which contains timestamps (pick-up and drop off) and GPS coordinates of 600 million taxi trips. Davis et al. [9] forecast taxi demand-supply gap in different areas in Bengaluru, India, based on time series modeling. The multilevel clustering technique is utilized to increase their model accuracy. The increase of air pollution and emissions of carbon dioxide from wasteful fuels is responsible for global warming. The proper estimation of mobility-related emissions in terms of pollutants and greenhouse gasses is, therefore, essential for the development of smart cities. Morris et al. [22] proposed a vision-based solution to estimate emissions of CO, CO2 , NOx , HC using vehicle-specific power approach, based on the vehicle type, speed, and acceleration. In their research, they use the Comprehensive Modal Emissions Model and EPA MOVES Model to estimate vehicle emission in real time. Kakouei et al. [17] presented a mathematical model to estimate the vehicle’s CO2 emission in the capital city of Iran. Recent research reports that almost 30% of the traffic in cities comes from cars looking for parking, which takes several minutes for each vehicle [4]. To solve this problem, many authors provided solutions based on deep learning techniques. Amato et al. [1] proposed real-time detection for car parking occupancy, using CNN classifier, running from the onboard camera. In their research, they introduced CNRParkEXT [1] dataset, which composed of 144,965 labelled patches. They trained and
MobiSamadhaan—Intelligent Vision-Based Smart City Solution
331
validated their model using mAlexNet, a modified version of AlexNet [19] by CNRPark [2], CNRPark-EXT and PKLot [10] datasets.Valipour et al. [33] proposed a novel method for parking plot vacancy detection using pre-trained VGG CNN [27] fine-tuned on PKLot dataset. Human safety is a key element in smart city development and can implement by constant monitoring of human activity. Various deep learning-based approaches were proposed by researchers to learn human behavior patterns in a video for national security, crowd analysis, ambient assisted living, and anomaly detection. Donahue et al. [11] proposed a new architecture long-term recurrent convolutional networks (LRCN) by combining CNN (CaffeNet, VGGNet) and LSTM as frame-level spatial and temporal feature extractor for activity recognition , image captioning, and video description on UCF101 dataset [28]. A new dataset MultiTHUMOS extended version of THUMOS with multi-label action annotation and MultiLSTM a recurrent model was introduced by Yeung et al. [38]. They used VGG16 pre-trained on ImageNet, fine-tuned on MultiTHUMOS for frame-level spatial appearance and MultiLSTM for modeling temporal features. Tran et al. [31] proposed C3D CNN model by replacing 2D convolutional kernels with spatio-temporal convolutional kernels (3D) for action recognition. Another approach by Jing et al. [16] used 3D CNN to extract appearance, short-term, and long-term temporal information from stacked RGB, optical flow, and dynamic images. Finally, the softmax score of the three models were weighted to predict the action. Wu et al. [36] proposed a two-stream approach by extracting spatial and motion features from single RGB and stacked optical flow images using VGG19, CNNM, respectively. They used two layers of LSTM for temporal modeling in both spatial and motion streams. Finally, they fused the output of LSTM (two-stream) for action prediction. Motivated by the above observations, we came up with a real-time automated vision-based unique monitoring and mobility solution “MobiSamadhaan” to deal with mobility challenges. To summarize, the main contributions of this paper is as follows: • Design of a vision-based model “MobiSamadhaan” using AI technique to gather and analyze anonymized data of people’s movement in a city. • The demand patterns for a specific mode of transport are estimated to minimize the supply-demand gap. • The mass movement across long-distance is analyzed for smart travel decisions. • A traffic jam detection algorithm is proposed. • By analysis of video footage, the estimation of carbon footprint trails in transportation is presented. • A real-time parking occupancy status estimation method is proposed. • A single stream-based activity recognition model to detect human activity by correlating spatio-temporal features is designed.
332
M. Chakraborty et al.
2 Proposed Model We propose a novel vision-based monitoring and mobility solution for India’s smart city mission. The intuition behind the work is to maintain well and secure daily city travel life. The model consists of (1) Object detection module, (2) Object tracking module, (3) Transport demand supply gap detection and forecasting module, (4) Mass movement detection and forecasting module, (5) Traffic jam detection module, (6) Carbon emission estimation module, (7) Advanced car parking module, and (8) Human activity detection module. Figure 1 shows an overview of the proposed framework.
2.1 Object Detection Module This module collects real-time video data from a roadside camera to detect vehicles and human objects. YOLOv3 [24] model pre-trained on the COCO dataset [21] is utilized for object detection. It is well known for the fast and accurate detection of objects from 80 different categories [24]. YOLOv3 predicts three-scale 3D tensors from real-time test video frames for detecting different size objects by downsampling the shape of the input frame by strides 32, 16, and 8, respectively, as shown in Fig. 2. The shape of each 3D tensor is obtained by N × N × [Number of boxes × (Box cordinate + Objectness score + Number of classes)], where N × N is the feature map size after downsampling the frame [24]. After each detection layer, the feature map is upsampled by a factor of 2 to make identical size feature map. Each 3D tensor contains objectness score based on logistic regression, class confidence, and bounding box coordinates. For each bounding box, YOLOv3 network predicts 4 coordinates (tx , t y , tw , th ). If the cell is offset by (cx , c y ) from the top-left corner of the image and ( pw , ph ) are the prior height and width of the bounding box, then the prediction will be as in Eq. 1 [24]. bx = σ (tx ) + cx , b y = σ (t y ) + c y , bw = pw etw , bh = ph eth
(1)
The objectness score should be 1 if more than any other bounding box before the bounding box overlaps a ground truth object [24]. While the prior bounding box is not the finest one, but covers more than a threshold with a ground truth object, disregard the prediction. Every box predicts the bounding box classes that can be classified by multi-label. An independent logistic classifier used instead of softmax for better performance. For class prediction, the cross-entropy loss function is enhanced during training. The YOLOv3 network is created by combining YOLOv2, Darknet-19, and newfangled residual network [24]. The network consists of 53 convolutional layers with no fully connected layer. It uses successive 3 × 3 and 1 × 1 convolutional layers [24] with some short-cut connections to process images with random size. Effect of
MobiSamadhaan—Intelligent Vision-Based Smart City Solution
Fig. 1 Overview of the proposed framework
333
334
M. Chakraborty et al.
Fig. 2 YOLOv3 basic flow diagram
shadows in accuracy and night time vision problems is eliminated using the robust YOLO framework.
2.2 Object Tracking Module The object tracking module is based on Simple Online and Realtime Tracking (SORT) algorithm [6] for online and real-time multiple object tracking (MOT) in a video sequence. A unique identification number assigned to each object is detected by YOLOv3 in a given frame. If the object moves away from subsequent frames, the ID will be dropped. The SORT tracker approximates the displacement of every object inter-frame with a linear constant velocity model independent of the movement of the camera and other objects. Each targets state is shaped like: X = [u, v, s, r, u, ˙ v, ˙ s˙ ]T ; where (u, v) is the center of the target horizontal and vertical pixel locations, s represent area of the targeted bounding box, and r represent aspect ratio of the bounding box [6]. The target state is updated using the detected bounding box if a target associated with detection, and Kalman filtering applied to solve the velocity component optimally. Otherwise, its state is anticipated by the linear model of velocity without rectification. Hungarian method is utilized for frame by frame information connection with an association metric that measures the overlapping of the bounding box [6]. Furthermore, the performance of the SORT tracker is outstanding in terms of precision and accuracy, and the tracker updates at 260 Hz, which is more than 20 times higher than other modern trackers.
2.3 Transport Demand Supply Gap Detection and Forecasting Module To calculate the density of passengers standing in a bus stop at a given time, the model uses region of interest (ROI) approach. If a person enters into the virtual region and waits for a threshold time, then the passenger counter will be increased. We collect the sample density of passengers after a threshold time for real-time demand estimation and transport demand forecasting. Linear Regression analysis is adopted to predict
MobiSamadhaan—Intelligent Vision-Based Smart City Solution
335
the number of buses required to fulfill the demand. It finds a line that best fits with passenger density. The best-fitting line is the one with the lowest possible amount of total prediction error. The distance from the point to the regression line is the error. The model establishes a relationship between time and density as in Eq. 2. n y = m × x + c, x¯ =
i=1
n
xi
n xi yi − n X¯ Y¯ , m = i=1 , n 2 ¯2 i=1 x i − n X
n y¯ =
i=1
n
yi
(2)
Here, x is the predicted variable, y is called the dependent variable, m is the scope of the line, c is the intercept of the line, and x, ¯ y¯ are the sample means for x, y values. n The model is optimized by sum of squared error, Q = i=1 (yi − yˆi )2 where (yi , yˆi ) are the actual and predicted outputs.
2.4 Mass Movement Detection and Forecasting Module Accurate detection of mass movement of the vehicle helps in making smart travel decisions, alleviate traffic congestions, reduce carbon emissions, improves traffic operation efficiency, and route guidance. In the proposed model, we first calculate the density of vehicles in a given region at a particular time. Using the line of interest (LOI) approach, a virtual line is placed over the road in both directions (Up and Down) on each video frame to obtain the density of vehicles. If the detected vehicle’s centroid intersects with the virtual line (Up or Down) and the vehicle identification number is not present in the temporal intersection list, then similar category vehicles counter will be increased. The centroid (x, ¯ y¯ ) of objects is calculated by Eq. 3. x¯ =
x1 + (x1 + BBoxWidth ) , 2
y¯ =
y1 + (y1 + BBoxHeight ) 2
(3)
where (x1 , y1 ) is the top-left coordinate of the bounding box (BBox). The calculated density of the vehicle is stored after a threshold interval to predict the mass movement of that location. For mass movement prediction, linear regression (Eq. 2) is applied with respect to time and vehicle density.
2.5 Traffic Jam Detection Module Here, we propose a simple approach to estimate traffic congestion time and type (High, moderate, and low) based on real-time road traffic video samples of various locations. The suggested solution first accurately predicts traffic congestion and then sends notifications to the traffic control room to reduce the adverse effect of slow traffic flows. The proposed approach is summarised in Algorithm 1.
336
M. Chakraborty et al.
Algorithm 1 Calculate traffic congestion Input: Real time video footage Output: Traffic congestion time 1: for each frame ∈ Video do 2: if Object detections in frame = N one then 3: for each Object ∈ frame do 4: if (Object categories = “vehicle ) then 5: if (Object centroid (x, ¯ y¯ ) ∩ Virtual line )and(ObjectID ∈ / old_obj_id_list) then 6: old_obj_id_list ← ObjectID 7: end if 8: end if 9: end for 10: end if 11: if (Vehiclecount = Length(old_obj_id_list))and(Object detections in frame = N one)and(Length(old_obj_id_list) = 0) then 12: framecount ← framecount + 1 13: else 14: framecount ← 0 {New object ID appended with old_obj_id_list} 15: Alert (“Road Clear”) 16: end if 17: if (framecount > Thresholdframe_no ) then (framecount −Thresholdframe_no ) 18: congestiontime ← Thresholdframe_no 19: if (Low threshold_time ≥ congestiontime ≤ Moderatethreshold_time ) then 20: Alert (“Low traffic congestion”) 21: else if (Moderatethreshold_time > congestiontime ≤ Highthreshold_time ) then 22: Alert (“Moderate traffic congestion”) 23: else if (congestiontime > Highthreshold_time ) then 24: Alert (“High traffic congestion”) 25: end if 26: end if 27: Vehiclecount ← Length(old_obj_id_list) 28: end for
2.6 Carbon Emission Estimation Module In this section, we introduce a novel vision-based method to calculate carbon emission over a region using the mathematical model proposed by Kakouei et al. [17]. The total number of different types of vehicles passing through an area, average fuel consumption (L/km), and amount of driving per day by each vehicle (km) has taken into account in the calculation of fuel (diesel or petrol) consumption over time. Specific gravity (kg/m3 ), calorific power (kcal/kg), and emission factor (tCO2 /TJ) for both diesel and petroleum are considered for the calculation of ambient CO2 emissions [17]. Fuel consumption (L) and carbon emission (tons) of vehicles are obtained by Eq. 4 [17].
MobiSamadhaan—Intelligent Vision-Based Smart City Solution
337
Fuelconsumption = Avg. fuel consumption × No. of vehicles of each type × Amount of driving per day by each vehicle Emission (CO2 ) = Fuelconsumption × Fuel specific gravity
(4)
× Fuel calorific power × Fuel emission factor
2.7 Advanced Car Parking Module The proposed car parking module identifies the parking spaces and predicts the occupancy status by collecting real-time video samples. Hand-craft virtual bounding boxes is drawn over the parking plot on each frame to localize the parking spaces. Intersection over union (IoU) approach is adopted to predict the occupancy status. A parking space is to be treated as empty if the IoU score of a parking space bounding box and the car bounding box is higher than a threshold (T ) value. IoU scores and empty spaces are calculated as in Eq. 5. Intersection of car and parking space bounding box Union of car and parking space bounding box Empty = True if IoUscore ≥ T IoU =
(5)
= False if IoUscore < T
2.8 Human Activity Detection Module This section proposes a model to detect various activities of persons by correlating spatio-temporal feature in a video. The proposed model combines a deep spatial extractor (CNN) with a model that can recognize and analyze temporary dynamics (LSTM) to address sequence learning issues. In this work, the MobileNetV2 CNN model proposed by Sandler et al. [25], pre-trained on ImageNet, is utilized as a framelevel feature extractor. The output of CNN, i.e., the feature description of each frame stitch together to form 30 sequence of extracted features and pass it to an LSTM. Instead of extracting features from each frame and then pass it to LSTM like Donahue et al. [11], this approach has achieved higher accuracy and reduced time complexity. Assuming that given a sequence of spatial features (x1 , x2 , x3 , . . . , x T ) as the input of an LSTM cell at the time t = 1 to T , an LSTM then calculates cell activation for mapping input with the output sequence (y1 , y2 , y3 , . . . , yT ) using Eq. 6.
338
M. Chakraborty et al.
i t = σ (Wxi xt + Whi h t-1 + bi ) f t = σ (Wxf xt + Whf h t-1 + b f ) ot = σ (Wxo xt + Who h t-1 + bo ) gt = tanh(Wxc xt + Whc h t-1 + bc )
(6)
ct = f t ct-1 + i t gt h t = ot tanh(ct ) where σ = sigmoid function, i = input gate, f = forget gate, o = output gate, g = input modulation gate, c = internal memory, h = hidden state, W = weight matrix, b = bias, and = element-wise multiplication. Intuitively, the LSTM can read and write information to its internal memory so that information can be maintained and processed over time. The hidden state h t of an LSTM cell is used to model the frame-level activity at time t based on past behaviors. The output from the last layer of the LSTM is given to a fully connected layer with a softmax classifier followed by a dense layer with the ReLU activation function for activity prediction. A dropout layer is added after the dense layer, and the LSTM dropout mask is applied for all sample inputs to prevent overfitting.
3 Experiment and Results The proposed model is implemented using Python 3.6, OpenCV, and deep learning PyTorch API. All experiments carried out on Intel Core i5-2400, 3.10 GHz, 12 GB memory, Nvidia GPU—GTX1050ti 4 GB, CUDA Tool kit (version 9) and CuDNN (version 7.05). In these experiments, video footages are collected from 5 different congested locations in Pune, India (refer Fig. 3a), in between different rush hours. The traffic video samples are captured by Redmi Note 7 Pro (Neptune Blue, 64 GB, 4 GB RAM, camera features: 48 MP—f1.79, 1.6 µm (4-in-1), 5 MP—f2.2, 1.12 µm, Primary 6P Lens, Secondary 3P Lens, PDAF, AI Dual Camera) mounted on a tripod for achieving optimal performance. Frames are extracted from each video using OpenCV and resized to 416 × 416 dimension. Initially, YOLOv3 PyTorch implementation pre-trained on the COCO dataset is utilized for object detection. In this experiment, 6FPS is achieved by processing the sample videos. The object detector module detects objects (Human and Vehicle) in a given frame and forwards the list of bounding boxes around the objects to the object tracking module (based on the SORT algorithm) for tracking in consecutive frames by assigning unique identification number. A virtual line is drawn across the lane in each frame based on camera position and road divider. The counter counts the categories vehicle if the centroid of each vehicle bounding box intersects with the virtual line. To detect and forecast the mass movement of a location at a given time, aforementioned in Sect. 2.4, the calculated density of vehicles after 1 h interval is captured.
MobiSamadhaan—Intelligent Vision-Based Smart City Solution
339
(a) Five locations of Pune,India from where video (b) Status of traffic condition and carbon emissamples collected. Image courtesy of Google sion of a location maps.
(c) Parking occupancy status
(d) Real-time carbon emission estimation depending on categorized vehicle
(e) Road with Traffic congestion
Fig. 3 Visualization of the results using the vision-based monitoring and mobility solution
340
M. Chakraborty et al.
(f ) Daily carbon footprint pattern
(g1) Bus stand
(g2) Density of passengers
(g3) Linear regression with respect to passenger density
(g) Transport demand supply gap estimation, forecasting and density of passenger over time
(h1) Linear regression with respect to time and vehicle density (Up and Down)
(h2) Result of predicted mass movement
(h) Mass movement detection and forecasting
Fig. 3 (continued)
MobiSamadhaan—Intelligent Vision-Based Smart City Solution
341
Table 1 Vehicle’s average fuel consumption and running per day with their chemical property Type of A D (km) AFC (L/km) FSg (kg/m3 ) FCp (kcal/kg) FEf (tCO2 /TJ) vehicle Bus Car Motorbike Truck
210 40 33 425
0.25 0.1 0.04 0.25
885 737 737 885
10,700 11,464 11,464 10,700
74.1 69.3 69.3 74.1
Where, A D = Amount of driving per day by each vehicle, AFC = Avg. fuel consumption, FSg = Fuel specific gravity, FCp = Fuel calorific power, and FEf = Fuel emission factor.
Figure 3h illustrates the results of the mass movement module. A hand-crafted virtual box is placed over the region of the bus-stand on each video frame and is increased the passenger counter if the passenger waits there for 2 min. We stored the calculated density of passengers after 1 h interval for detection and forecasting the transport demand, aforementioned in Sect. 2.3. The number of buses required to full-fill the demand computed by considering standard seating capacity (40) of Indian public transport buses [23] results is shown in Fig. 3g. In consecutive 360 frames, if no vehicle movement observed by the proposed algorithm mentioned in Sect. 2.5, it is considered as a traffic jam. Traffic congestion types (Low, Moderate, and High) are determined based on three different thresholds 1–2, 2–5, and more than 5 min, respectively. Figure 3b, e presents the outputs of proposed traffic jam detection module. The mathematical equation mentioned in Sect. 2.6 is applied to estimate carbon emission depending on the different types of the vehicle. Total carbon emission of a region computed by considering the values is given in Table 1 [3, 17, 30]. Figure 3b, d, f shows results of CO2 emission of a location. The parking space occupancy status is calculated by adopting the technique discussed in Sect. 2.7. The parking space is considered vacant if the IoU score is greater than 0.6 and update the occupancy status of that parking plot after 5 min interval (refer Fig. 3c). For constant traffic analysis, we stored the parking occupancy status, traffic congestion time, carbon emission, and transport demand details of a location to the server. A common user can access information on nearby parking places and traffic stoppage time by browsing the Web link. The proposed model works well for real-time CCTV footage in lowlight conditions. It also pop-up alert message and generate an alarm when any uncertainty happened (high carbon emission, traffic congestion). UCF-101 [28] dataset is used for activity detection mentioned in Sect. 2.8, having 13,320 videos with 101 action categories. The dataset is divided into two parts, 70% for training and 30% for testing. Frames are extracted and resized to (224 × 224 × 3) using OpenCV. MobileNetV2 [25] Keras implementation pre-trained on ImageNet is used for stacked spatial feature extraction from RGB video frames. However, any CNN can be chosen for the same purpose. MobileNetV2’s last prediction layer is removed because we want to extract frame-level features, not the prediction. The extracted feature matrices of dimension (30, 1280) are transferred to the LSTM model
342
M. Chakraborty et al.
Table 2 Comparison of the model performance with previously published state-of-the-art findings on UCF-101 dataset Accuracy (%) Donahue et al. [11] Zha et al. [39] Tran et al. [31] Jing et al. [16] Wang et al. [35] Ours
82.9 89.6 86.7 88.6 88.2 91.2
for predicting action by analyzing the temporal dynamics. We trained the model by setting 1e−4 as learning rate with decay 1e−5 , 32 batch size and 1000 epochs with 100 early stopping. For action prediction, the cross-entropy loss function enhanced, and the model is optimized by Adam Optimizer during training. The LSTM model is compared with 2048, 1280, 512, and 256 hidden units. We received 1.09% increase in performance by 1280 hidden units compared to the others on our validation set. The output from the LSTM’s last layer is given to a fully connected layer with a softmax classifier followed by a dense layer with the ReLU activation function for activity prediction. To prevent overfitting, the dropout is set to 0.5. The model performance is compared with previously published state-of-the-art findings on the UCF-101 dataset, as shown in Table 2. Figure 3 shows the experimental results.
4 Conclusion In this paper, we presented a real-time vision-based mobility solution for smart city management. The proposed method can successfully estimate traffic congestion time and carbon footprint trails. The proposed framework is capable of discovering parking occupancy status, demand patterns based upon the density of passengers, and also predicting mass movement through the discovery of traffic flow knowledge. The human activity detection module is capable of recognizing human activities by analyzing video footage. The proposed model is evaluated against the real video samples collected from different locations in Pune, India, and also produced promising results in low light conditions. Traffic density prediction, as well as demand prediction, may be affected by factors like vehicle types, road types, traffic lights, driver behavior, weather conditions [12], events, and festive seasons. Roadway shapes may also affect the predictions. Factors like multi-camera data fusion, calibration, and optimal placement of cameras are also significant challenges.
MobiSamadhaan—Intelligent Vision-Based Smart City Solution
343
In our future studies, the proposed framework will be improved to detect various suspected human activities, road signs, road accidents, and accident / congestion prone regions. Acknowledgements The authors would like to acknowledge the valuable feedback and suggestion of Tata Motors and Smart India Hackathon (SIH) 2019 committee. Also, the authors would like to thank NVIDIA for the GPU grant for carrying out this research work.
References 1. G. Amato, F. Carrara, F. Falchi, C. Gennaro, C. Meghini, C. Vairo, Deep learning for decentralized parking lot occupancy detection. Expert Syst. Appl. 72, 327–334 (2017) 2. G. Amato, F. Carrara, F. Falchi, C. Gennaro, C. Vairo, Car parking occupancy detection using smart camera networks and deep learning, in 2016 IEEE Symposium on Computers and Communication (ISCC). IEEE (2016), pp. 1212–1217 3. A. Raheja, Dtc: modern and dependable. http://commercialvehicle.in/dtc-modern-anddependable/. 27 Oct 2016 4. R. Arnott, E. Inci, An integrated model of downtown parking and traffic congestion. J. Urban Econ. 60(3), 418–442 (2006) 5. M. Batra, Urban transport news. https://urbantransportnews.com/parking-and-traffic-issuesirking-major-indian-cities/ (2019) 6. A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft, Simple online and realtime tracking, in 2016 IEEE International Conference on Image Processing (ICIP). IEEE (2016), pp. 3464–3468 7. California Department of Transportation,Sacramento, CA, USA, Caltrans performance measurement system. http://pems.dot.ca.gov/ (2015) 8. Q. Chen, W. Wang, F. Wu, S. De, R. Wang, B. Zhang, X. Huang, A survey on an emerging area: deep learning for smart city data. IEEE Trans. Emerg. Top. Comput. Intell. (2019) 9. N. Davis, G. Raina, K. Jagannathan, A multi-level clustering approach for forecasting taxi travel demand, in 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC). IEEE (2016), pp. 223–228 10. P.R. De Almeida, L.S. Oliveira, A.S. Britto Jr., E.J. Silva Jr., A.L. Koerich, Pklot-a robust dataset for parking lot classification. Expert Syst. Appl. 42(11), 4937–4949 (2015) 11. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 2625–2634 12. Federal Highway. Administration, Washington, DC, USA, in How do weather events impact roads? http://www.ops.fhwa.dot.gov/weather/q1_roadimpact.htm. Accessed 23 Sep 2019 13. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 580–587 14. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 15. W. Huang, G. Song, H. Hong, K. Xie, Deep architecture for traffic flow prediction: deep belief networks with multitask learning. IEEE Trans. Intell. Transport. Syst. 15(5), 2191–2201 (2014) 16. L. Jing, Y. Ye, X. Yang, Y. Tian, 3d convolutional neural network with multi-model framework for action recognition, in 2017 IEEE International Conference on Image Processing (ICIP). IEEE (2017), pp. 1837–1841 17. A. Kakouei, A. Vatani, A.K.B. Idris, An estimation of traffic related CO2 emissions from motor vehicles in the capital city of, iran. Iran. J. Environ. Health Sci. Eng. 9(1), 13 (2012)
344
M. Chakraborty et al.
18. A. Koesdwiady, R. Soua, F. Karray, Improving traffic flow prediction with weather information in connected cars: a deep learning approach. IEEE Trans. Veh. Technol. 65(12), 9508–9517 (2016) 19. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012), pp. 1097– 1105 20. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 21. T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft Coco: Common Objects in Context, in European Conference on Computer Vision (Springer, Berlin, 2014), pp. 740–755 22. B.T. Morris, C. Tran, G. Scora, M.M. Trivedi, M.J. Barth, Real-time video-based traffic measurement and visualization system for energy/emissions. IEEE Trans. Intell. Transport. Syst. 13(4), 1667–1678 (2012) 23. T. Motors, How Tata Motors Buses Have Helped in Smart City Solutions. https://www. buses.tatamotors.com/blog/how-tata-motors-buses-have-helped-in-smart-city-solutions/, 15 Nov 15 2018 24. J. Redmon, A. Farhadi, Yolov3: An Incremental Improvement. arXiv (2018) 25. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.C. Chen, Mobilenetv2: Inverted Residuals and Linear Bottlenecks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 4510–4520 26. K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Advances in Neural Information Processing Systems (2014), pp. 568–576 27. K. Simonyan, A. Zisserman„ Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014) 28. K. Soomro, A.R. Zamir, M. Shah, Ucf101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv preprint arXiv:1212.0402 (2012) 29. Taxi and Commission, Nyc Taxi and Limousine Commission (tlc) trip record data. http://www. nyc.gov/html/tlc/html/about/trip_record_data.shtml 30. The Times of India, Avg distance covered by trucks up by 100-150 km/day post gst. https://timesofindia.indiatimes.com/business/india-business/avg-distance-coveredby-trucks-up-by-100-150-km/day-post-gst/articleshow/62316504.cms (Dec 31, 2017, 17:27 IST) 31. D. Tran, L.D. Bourdev, R. Fergus, L. Torresani, M. Paluri, C3d: generic features for video analysis. CoRR, abs/1412.0767 2(7), 8 (2014) 32. University of Utah, Salt Lake City, UT, USA. Mesowest status update log. http://mesowest. utah.edu/ (2015) 33. S. Valipour, M. Siam, E., in IEEE 3rd World Forum on Internet of Things (WF-IoT). IEEE (2016), pp. 655–660 34. L. Wang, D. Sng, Deep Learning Algorithms with Applications to Video Analytics for a Smart City: A Survey. arXiv preprint arXiv:1512.03131 (2015) 35. X. Wang, C. Qi, F. Lin, Combined trajectories for action recognition based on saliency detection and motion boundary. Signal Process. Image Commun. 57, 91–102 (2017) 36. Z. Wu, X. Wang, Y.G. Jiang, H. Ye, X. Xue, textitModeling spatial-temporal clues in a hybrid deep learning framework for video classification, in Proceedings of the 23rd ACM international conference on Multimedia. ACM (2015), pp. 461–470 37. J. Xu, R. Rahmatizadeh, L. Bölöni, D. Turgut, Real-time prediction of taxi demand using recurrent neural networks. IEEE Trans. Intell. Transport. Syst. 19(8), 2572–2581 (2017) 38. S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, L. Fei-Fei, Every moment counts: dense detailed labeling of actions in complex videos. Int. J. Comput. Vis. 126(2–4), 375–389 (2018)
MobiSamadhaan—Intelligent Vision-Based Smart City Solution
345
39. S. Zha, F. Luisier, W. Andrews, N. Srivastava, R. Salakhutdinov, Exploiting image-trained CNN Architectures for Unconstrained Video Classification. arXiv preprint arXiv:1503.04144 (2015) 40. W. Zhao, S. Du, Spectral-spatial feature extraction for hyperspectral image classification: a dimension reduction and deep learning approach. IEEE Trans. Geosci. Remote Sens. 54(8), 4544–4554 (2016)
Students’ Performance Prediction Using Feature Selection and Supervised Machine Learning Algorithms Juhi Gajwani and Pinaki Chakraborty
Abstract Educational data mining involves finding patterns in educational data which can be obtained from various e-learning systems or can be gathered using traditional surveys. In this paper, our focus is to predict the academic performance of a student based on certain attributes of an educational dataset. The attributes can be demographic, behavioural or academic. We propose a method to classify a student’s performance based on a subset of behavioural and academic parameters using feature selection and supervised machine learning algorithms such as logistic regression, decision tree, naïve Bayes classifier and ensemble machine learning algorithms like boosting, bagging, voting and random forest classifier. For selection of the attributes, we plotted various graphs and determined the attributes that were most likely to affect and improve prediction. Experiments with different algorithms show that ensemble machine learning algorithms provide best results with our dataset with an accuracy of up to 75%. This has widespread applications such as assisting students in improving their academic performance, customizing e-learning courses to better suit students’ needs and providing tailor-made solutions for different groups of students. Keywords Educational data mining · Students’ performance prediction · Ensemble machine learning algorithms · Classification · Feature selection
1 Introduction Prediction of students’ academic performance is an important application of educational data mining. Educational data mining involves finding patterns in educational data which can be obtained from various e-learning systems or can be gathered using traditional surveys. In this paper, our focus is to classify the academic performance of a student into different categories based on certain attributes of an educational dataset. J. Gajwani (B) · P. Chakraborty Division of Computer Engineering, Netaji Subhas University of Technology, New Delhi, India e-mail: [email protected] P. Chakraborty e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_25
347
348
J. Gajwani and P. Chakraborty
The attributes can be demographic, behavioural or academic. Demographic attributes include “nationality”, “gender”, “place of birth” and “parent responsible for student”. Behavioural attributes include “participation of student in discussion groups”, “raising hands in class”, “visiting resources” and “viewing announcements”. Academic attributes include “grade”, “semester” and “courses attended”. So, we identify a subset of these attributes and use various supervised learning techniques in order to predict student performance in terms of a percentage category. One of the applications of students’ academic performance prediction is to assist students in improving their grades. Such systems can be used to make the students aware of their performance prior to the final evaluation. Another application can be aiding the university management in formulating educational policies. This can also be used to improve e-learning systems and tailoring these systems towards the needs of different groups of students. Our approach was to identify a subset of attributes that were most likely to affect the students’ grades directly, thus making the prediction more accurate. For this, we plotted graphs of these attributes against the grades obtained by the students and analysed the relationship between them. Thereafter, we used different supervised learning algorithms to classify the students into one of the three grade categories, and we tried to identify the algorithm with the highest accuracy of prediction.
2 Related Work Researchers have proposed various methods to predict student performance. Some of these researches focus on feature selection, and others propose ways to predict student performance and compare different machine learning techniques in their ability to predict the performance of students accurately. Punlumjeak and Rachburee [1] proposed a comparative study of four algorithms used for selecting a subset of features. These algorithms include support vector machine, information gain, genetic algorithms and minimum redundancy and maximum relevance. They used four supervised classifiers, viz. decision tree, naïve Bayes, k-nearest neighbour and neural network. Ajibade et al. [2] also provided a comparative analysis of various feature selection algorithms to improve classifier prediction accuracy. These algorithms included correlation feature selection, relieff method, Kullback–Leibler divergence, sequential backward selection, sequential forward selection and differential evolution. On the other hand, Amrieh et al. [3] predicted the performance based solely on the behavioural attributes of the dataset. They used information gain to select a subset of attributes to be used for prediction. It used ensemble machine learning algorithms for data analysis. Mueen et al. [4] performed various experiments in order to find an algorithm that can classify the educational data with maximum accuracy. Three different algorithms, viz. naïve Bayes, neural network and decision tree, have been used for analysis. This study concluded that naïve Bayes gave the best results with an accuracy of about 86%. Al-Shabandar et al. [5] proposed a method to predict
Students’ Performance Prediction Using Feature Selection …
349
learning outcomes in Massive Open Online Courses (MOOCs) [6]. Their research showed that random forest algorithm gave the best results. On the other hand, some studies have also proposed the use of learning analytics in this field.
3 Methodology We have obtained the dataset from Kaggle (“https://www.kaggle.com/aljarah/xAPIEdu-Data/data”). The dataset contains information about various types of attributes related to students that might affect their educational performance. It is obtained from a learning management system (LMS) known as Kalboard 360. The data has been collected using a learner activity tracker tool called experience API (xAPI). The collected features include nationality, gender, place of birth, grade levels, educational stages, field of study, parent responsible for student, semester, number of times a student raised hand, viewed announcements, visited resources, participated in discussion groups, results of parent answering survey, parent school satisfaction measurement and number of days the student was absent. These features were classified into three categories, viz. academic background features, demographic features and behavioural features. The dataset needs to be cleaned, i.e. tuples with missing values need to be removed before using it for prediction. The dataset had a total of 20 missing values in different features from 500 records which were removed from the dataset, thus giving a total of 480 records. We selected a subset of features from the dataset to improve the accuracy of prediction. For selection of the attributes, we plotted various graphs and determined the attributes that were most likely to affect and improve prediction. Each graph is a plot of the students’ percentage category against the respective attribute. Using these plots, we identify the features that were most relevant to performance prediction. After feature selection, we used different supervised learning algorithms to classify students’ performance into one of the following categories, viz. low-level, middle-level and high-level in terms of percentage. The low-level interval comprises values from 0 to 69, middle-level interval comprises values from 70 to 89 and high-level interval comprises values from 90 to 100. We experimented with several supervised machine learning algorithms including decision trees, logistic regression and naïve Bayes classifier. Apart from these basic algorithms, we have conducted experiments using ensemble machine learning techniques including bagging, boosting voting classifier and random forest. Implementation was done using Python’s scikit-learn library. We used 70% of the records for training and 30% for testing.
350
J. Gajwani and P. Chakraborty
4 Results For selection of the attributes, we plotted various graphs and determined the attributes that were most likely to affect and improve prediction. First, we plotted a graph of students getting high, medium or low grade against the number of times they raised hands in the class (Fig. 1). The plot clearly shows that students who raised hands between 0 and 20 times are more likely to get a lower percentage as opposed to students raising hands between 60 and 90 times are likely to get a higher percentage. Next, we plot a graph of students getting high, medium or low grade against the attribute participated in discussion groups (Fig. 2). We observed that students can be
Fig. 1 Plot of student category against number of times hands raised
Fig. 2 Plot of student category against participation in discussion groups
Students’ Performance Prediction Using Feature Selection …
351
classified on the basis of this attribute together with some other attributes to avoid ambiguity. Next, we plot a graph of students getting high, medium or low grade against the attribute visited resources (Fig. 3). This attribute can be a good fit for use in classification. We plotted similar graphs for number of times the students viewed announcements (Fig. 4) and number of days the students were absent (Fig. 5). Finally, the attributes used for classification included number of times raised hand, participated in discussion groups, visited resources, viewed announcements, days of absence and parent responsible for the student. We have used the following techniques for prediction and have some useful observations.
Fig. 3 Plot of student category against visited resources
Fig. 4 Plot of student category against viewed announcements
352
J. Gajwani and P. Chakraborty
Fig. 5 Plot of student category against student absent days
• Decision Tree. We implemented decision tree using Gini index or entropy. The total number of misclassified samples using Gini index approach was 38. Information gain favours smaller partitions with many distinct values. The total number of misclassified samples using this approach was 45. • Logistic Regression. Logistic regression is used for classification of a set of values to a discrete set of classes. The total number of misclassified samples using this approach was 38. • Naive Bayes Classifier. Naïve Bayes classifiers use Bayes’ theorem to classify data items. The total number of misclassified samples using this approach was 39. • Ensemble Machine Learning Algorithms. Ensemble methods either use several homogeneous weak learners, i.e. basic classification algorithms such as decision tree, and train each of them in separate ways, or they use several heterogeneous weak learners and finally aggregate the results of all the learners using different techniques. Different learners can either be trained in parallel or the results of one learner can be used to improve the predictions of the other learners by training them sequentially. There are several types of algorithms to aggregate the results of the weak learners aimed to reduce variance, decrease bias and improve predictions. These methods are bagging, boosting and stacking. The choice of the algorithm used to aggregate the results of various weak learners depends on the basic algorithms used for classification. For example, deep decision trees are overfitted and hence have high variance, so bagging can be used for aggregating different trees trained separately.
Students’ Performance Prediction Using Feature Selection …
353
• Sequential Ensemble Methods. Are those where the basic predictors are generated sequentially, i.e. one after another. Successive predictors use the results of the former predictors in order to improve performance. • Parallel Ensemble Methods. Are those where the basic predictors are generated in parallel, i.e. simultaneously. In such methods, results of several prediction algorithms are combined to improve prediction. • Bagging. Bagging stands for bootstrap aggregation. This is a type of parallel ensemble method where several samples of data are generated and are used to train different models. For example, different trees can be trained using the different subsets of data. The total number of misclassified samples using this approach was 38. • Boosting. Boosting involves creating several predictive models sequentially and using the results of the previous models to improve predictions of successive models. The total number of misclassified samples using this approach was 36. • Voting. It involves generating a number of models (typically of differing types) and using statistical techniques to make the final prediction. In hard voting, a model is chosen from a number of basic predictors to make the final prediction by a simple majority vote for accuracy. Soft voting arrives at the best result by averaging out the probabilities calculated by individual algorithms. The total number of misclassified samples using this approach was 37. • Random Forest. In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e. a bootstrap sample) from the training set. In addition, instead of using all the features, a random subset of features is selected, further randomizing the tree. As a result, the bias of the forest increases slightly; but due to the averaging of less correlated trees, its variance decreases, resulting in an overall better model. The total number of misclassified samples using this approach was 37. After applying various predictive algorithms on the data set, the results obtained are distinct based on different machine learning algorithms. The results are summarized in Table 1.
5 Conclusion In this paper, student performance was evaluated using various machine learning algorithms after feature selection. We experimented with supervised learning algorithms such as decision trees, logistic regression, naïve Bayes classifier and ensemble machine learning algorithms like boosting, bagging and random forest classifier. Ensemble machine learning algorithms further improved the accuracy of prediction by combining one or more basic algorithms. Boosting, voting and random forest gave the best results on our dataset. Boosting achieved a prediction accuracy of 75%, while
354 Table 1 Summary of results
J. Gajwani and P. Chakraborty Classifier
Number of misclassified samples
Accuracy (%)
Decision tree-Gini index
38
73.61
Decision tree-entropy index
45
68.75
Logistic regression
38
73.61
Naïve bayes
39
72.92
Bagging
38
73.61
Boosting-gradient
36
75.00
Boosting-Ada
39
72.92
Voting-hard
40
72.22
Voting-soft
39
72.92
Voting-weighted
37
74.31
Random forest
37
74.31
voting and random forest both achieved an accuracy of 74.31%. Moreover, gradient boosting shows better results than AdaBoosting. We conducted experiments that show eliminating any of the selected features reduces the accuracy of the classifier. Prediction of students’ academic performance is an important application of educational data mining. It has widespread applications such as assisting students in improving their academic performance, customizing e-learning courses to better suit students’ needs and providing tailor-made solutions for different groups of students. We believe that further research on prediction of students’ performance is necessary to enhance the academic ecosystem.
References 1. W. Punlumjeak, N. Rachburee, A comparative study of feature selection techniques for classify student performance, in Proceeding of the Seventh International Conference on Information Technology and Electrical Engineering (2015), pp. 425–429 2. S.S.M. Ajibade, N.B. Ahmad, S.M. Shamsuddin, An heuristic feature selection algorithm to evaluate academic performance of students, in Proceedings of the Tenth Control and System Graduate Research Colloquium (2019), pp.110–114 3. E.A. Amrieh, T. Hamtini, I. Aljarah, Mining educational data to predict student’s academic performance using ensemble methods. Int. J. Database Theory Appl. 9, 119–136 (2016) 4. A. Mueen, B. Zafar, U. Manzoor, Modeling and predicting students’ academic performance using data mining techniques. Int. J.Mod. Educ. Comput. Sci. 8, 36–42 (2016) 5. R. Al-Shabandar, A. Hussain, A. Laws, R. Keight, J. Lunn, N. Radi, Machine learning approaches to predict learning outcomes in Massive open online courses, in Proceedings of the International Joint Conference on Neural Networks (2017), pp. 713–720 6. P. Sra, P. Chakraborty, Opinion of computer science instructors and students on MOOCs in an Indian university. J. Educ. Technol. Syst. 47, 205–212 (2018)
Modified AMI Modulation Scheme for High-Speed Bandwidth Efficient Optical Transmission Systems Abhishek Khansali, M. K. Arti, Soven K. Dana, and Manoranjan Kumar
Abstract An advanced modulation format based on alternate mark inversion (AMI) signaling scheme is proposed for the purpose of high-speed optical communication systems. This novel proposed format is modified AMI, which is derived with the help of vestigial sideband (VSB) scheme and AMI. In order to obtain the performance of proposed AMI scheme, it is compared with modified Manchester, modified RZ, and modified NRZ in premise of receiver sensitivity as well as spectral efficiency having equal spectral width. Optical filter, which is used for filtering through VSB on the AMI signal, is optimized so that the spectral width of the modified AMI signal can be decreased and hence improved the spectral efficiency of modified AMI scheme. Keywords Alternate mark inversion (AMI) · Manchester · Non-return-to-zero (NRZ) · Return-to-zero (RZ) · Vestigial sideband (VSB) · Wavelength division multiplexing–passive optical networks (WDM-PON)
1 Introduction Free-space optical (FSO) communication is a promising wireless communication technology which is available for high-data-rate transmission at optical frequencies [1, 30]. The FSO technology is more riskless in terms of security than commonly used RF wireless telecommunications because of large directionality of laser beam, [2–4]. Furthermore, the FSO communication system has the high compatibility and inherent A. Khansali · M. K. Arti (B) · S. K. Dana AIACT&R, Geeta Colony, Delhi 110031, India e-mail: [email protected] A. Khansali e-mail: [email protected] S. K. Dana e-mail: [email protected] M. Kumar ADGITM, Shastri Park, Delhi 110053, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_26
355
356
A. Khansali et al.
security without interference of signal for broadband data access. In today’s world, the information in the form of digital data will be competent to get the practical use of optical systems having maximum speed and to perform complicated tasks [5, 6]. For optical communication, the performance depends upon the choice of best coding technique [7, 8]. An efficient line coding is selected for a particular task. In general, the digital communication besides having many advantages over analog communication, it has two drawbacks: average DC components and issues of synchronization [9, 10]. Manchester coding technique is the solution to such drawbacks as it ensures that the digital signal does not hold at ‘1’ or ‘0’ logic for long time duration [11, 12]. The AMI coding technique offers low bandwidth as compared to Manchester coding scheme, and there is also no DC component problem [13]. AMI coding technique is an attractive and favorable scheme that has the properties of zero DC and clock component for synchronization, which makes it advantageous for highspeed communication systems. AMI coding scheme uses alternate logic ‘1’ and logic ‘0’ for transitions. In wavelength division multiplexing–passive optical network system, the PSK and Manchester coding scheme committed a powerful system which beats interference noise [14]. In [15], it is taken as an offset of Manchester line coding scheme on carrier distribution [6, 16]. Wavelength division multiplexing–passive optical network communication system is to overcome Rayleigh noise as compared with modified NRZ. In the communication of radio over fiber systems, Manchester and RZ (return-to-zero) engaged the reuse of wavelength in uplink [17]. However, in very high speed of optical communication scheme, the standardized passive optical networks have accomplished various improvements in data rates because of the requirement of exponentially raised bandwidth which will be directed by HDTV, video conferencing, and cloud computing [18]. Therefore, the next-generation passive optical network stage II is a combination of time with wavelength-multiplexed passive optical network, that has been suggested as a worthwhile single wavelength to the data rate above 10 Gb/s [19]. Having this advancement, it is given in [19], and the confirmation of 40 Gb/s TDM passive optical network accomplished 42 km by optical duobinary technique to overcome the complication at optical network unit having a 25 Gb/s avalanche photodiode detector and equalizer for better bandwidth receiver. Manchester modulation also has a limitation to its application in optical communication system due to broader spectral bandwidth that deteriorates its spectral efficiency and reduces transmission distance [15, 20]. With the help of Manchester and VSB filter combined a new scheme is proposed to generate new modified AMI scheme for improving spectral efficiency and receiver sensitivity. VSB filter is used to suppress the sideband by optical BPF called as VSB during whole modulation information. For the performance of proposed AMI format, the bandwidth of VSB filter and center frequency is very essential factor [21]. This study proposed and realized a modified AMI scheme for immense speed optical communication scheme. Same spectral width is assumed for comparison of this modified technique with Manchester scheme, NRZ, and RZ scheme. The results
Modified AMI Modulation Scheme for High-Speed Bandwidth …
357
show that spectral efficiency and power at the receiver end of modified AMI improve considerably than modified Manchester technique.
2 Generation of AMI Code In the world of these great speed optical transmission systems, AMI coding provides a considerable robust nature to reduce the DC problem since for transition of every logic ‘1’ an alternating change in polarity of the voltage level. Thus, we have several techniques to generate AMI code in electrical domain as well as in the optical domain. The way of generating AMI code is shown in Fig. 1 in which the blocks required to perform this task are pseudorandom bit sequence generator that generates the binary sequence. The duobinary precoder provides delay in binary sequence, NRZ pulse generator, electrical signal time delay, and electrical subtractor. In this proposed scheme, the performance of the AMI technique is taken as in the form of electrical domain with the help of different blocks. Pseudorandom bit sequence generator is having some built-in logic circuits for generation of binary sequence which is random in nature and difficult to predict. Duobinary precoder is having some XOR function used for providing delay in generated binary bit sequence. NRZ block forms a sequence of non-return-to-zero electrical pulses which is coded by an input binary signal. Electrical signal time delay provides delay to the entire electrical NRZ pulse sequence, and at last, both the signals pass through electrical subtractor, which generates electrical AMI signal. Alternate mark inversion (AMI) is a clock synchronized line coding scheme that uses binary 1’s to represent alternate positive and negative values. A neutral zero
Fig. 1 Generation of AMI code
358
A. Khansali et al.
voltage is used to represent binary 0. A modified AMI codes are a digital communication scheme to maintain the synchronization of system. The alternating line coding scheme avoids the problem of a DC voltage level downward the cable. It is an advantageous because the cable can be used to transfer a small amount of DC current to power intermediate equipment like line repeaters. Due to these characteristics of AMI, it is very problematic for the signal which is to be transmitted remains constant for a long duration of time that is compatible with other distinct coding techniques. It is shown in Fig. 2. If transmitted signal of input remains consistent temporarily either in logic ‘1’ or logic ‘0’ state level and it resulting in the formation of very low frequencies said to be as DC components. It is one of the difficulties that encountered during digital transmission since the component of DC generated cannot transmit by some medium and it becomes problematic in some AC coupling system. Nevertheless, such type of problem can be eliminated with AMI coding scheme that includes a residual of DC component due to three levels of data transmission. Another difficulty in digital transmission system is the synchronization issue, to successful transmission and reception of data, the transmitter end and receiver end should be synchronized or the received data will not be perfectly decoded. However, this method is undesired when there is a desire to minimize the number of links between parts of a system. Another method is used by an internal timing signal rather than external clock signal, which results variation in internal frequency of clock especially when the transmitter and receiver are in different surroundings. Figure 2 represents the simulated output waveforms of AMI, NRZ, RZ, and Manchester coding scheme of signals for the comparison purpose.
3 Vestigial Sideband Filtering VSB filtering is a very useful method to enhance the spectral efficiencies in optical systems. It provides the improvement in terms of transmission distance and wavelength division multiplexing (WDM) aspects. The optical filter is using in this scheme to suppress one part of the unwanted sideband which is present in output waveform, and modulation information is preserved [22]. Hence, the optical band-pass filter in this scheme is used to remove a part of a present sideband which is noteworthy for the performance of this scheme. This reduction in spectral width is very important to increase the bandwidth. When the spectral bandwidth of this vestigial sideband filter decreases, then signal is degraded through narrow filtering [23]. The spectral width of optical signals can be reduced by VSB filter. The filter that is using for the purpose of VSB filtering has a narrow width. Therefore, because restriction of the power in the modest spectral bandwidth bearing a resistance to the detuning frequency of the central frequency of the VSB filtering [16]. When the detuning frequency is lesser, thus the VSB filter bandwidth will be higher to entertain components of high frequency. Furthermore, an increase in bandwidth of optical BPF above optimal grant maximum noise to passage by the filter hence reduces the receiver sensitivity [24]. Therefore, the detuning of
Modified AMI Modulation Scheme for High-Speed Bandwidth … (a)
1
Amplitude, a.u.
Manchester 0.8 0.6 0.4 0.2 0
(b)
0
1
2
3
Time, s
4
10-9
1
Amplitude, a.u.
NRZ 0.8 0.6 0.4 0.2 0
(c)
0
1
2
3
Time, s
4
10-9
1
Amplitude, a.u.
RZ 0.8 0.6 0.4 0.2 0
(d)
0
1
2
3
Time, s
4
10-9
1 AMI
Amplitude, a.u.
Fig. 2 Comparison of signals waveform in different coding scheme. a–d Waveforms in the electrical domain of conventional AMI, Manchester, NRZ and RZ signals, respectively
359
0.5
0
-0.5
-1
0
1
2
Time, s
3
4
10-9
360
A. Khansali et al.
center frequency and the optical filter bandwidth had to be improved to implement the VSB filtering. With the vestigial sideband filter used in modified AMI coding technique, the bandwidth suppression in optical filter is resulting in enhancement in the maximum efficiency of bandwidth of this AMI modulation scheme. As we know that, in filtering of VSB the bandwidth requirement for the optical BPF will be linearly associated with the central frequency of the given filter. Thus, changing one and unchanged another must change the given system automatically. The main function of vestigial sideband filtering is to minimize of spectral bandwidth of the optical signal, since the proposed AMI signal has extensive spectral bandwidth. Simultaneously, power reduction is commonly accomplished due to suppression in bandwidth requirement of optical filter [25]. Therefore, the requirement of bandwidth optimization of optical BPF and its frequency of detuning is necessary for the purpose of VSB filtering [16]. In Fig. 3, the modulated AMI signal clearly shows the vestigial sideband filtering effect in which its comparison with modulated conventional AMI signal ahead apply the VSB filter. As illustrated in Fig. 3, after vestigial sideband filtering process the amplitude of the output waveform will be not equal like the baseband waveform signal before filtering due to the suppression level that the signal gone through. At this instant of time, the power level reduced that alters the sensitivity of detector and may be enhanced by using optical amplification process such as erbium-doped fiber amplifier after the vestigial sideband filter that one may increase the power level through expanding the transmission distance.
1
x10-3 Waveform Before VSB Waveform After VSB
Power, W
0.8
0.6
0.4
0.2
0
0
0.5
1
1.5
2
2.5
Time, s
Fig. 3 Effect of vestigial sideband filtering on modulated waveform
3
3.5
x10-9
Modified AMI Modulation Scheme for High-Speed Bandwidth …
361
4 Simulation Setup and Results In Fig. 4, the proposed AMI modulation scheme simulation model is shown in an EDFA-free communication system. Hence to analyze the performance of system, we are using OptiSystem and MATLAB. The bit data rate is 10 Gb/s having pseudorandom binary bit sequence. As a consequence, the generation of proposed AMI signal in electrical domain when the pseudo-random bit sequence is passed through duobinary precoder and then converted into NRZ pulse, after that passes through some electrical time delay and then electrical subtractor. The generation of electrical AMI signal is used for the modulation of CW laser diode that operates at wavelength of 1553 nm through a Mach–Zehnder modulation with an annihilation ratio of 30 dB. The proposed AMI signal which is generated optically launched into the single-mode fiber along with 0 dBm launch power succeeds by optical BPF which is used for the purpose of VSB filtering. The VSB filtering has better performance at receiver end as compared to transmitter end since the reduction of the sideband at the transmitter side builds up itself very quickly on nonlinear behavior in the optical fiber at the time of signal transmission. The principal purpose of vestigial sideband filtering in proposed AMI scheme is used to suppress the spectral bandwidth of optical input signal of proposed AMI [16]. The suppression in spectral bandwidth results in rise in the efficiency of the bandwidth for the modified AMI technique. In this proposed technique, the performance optimization of vestigial sideband filter spectral width was taken to 20 GHz at bit data rate of 10 Gb/s, whereas the center frequency is detuned through 20 GHz that results in an optimal performance of the scheme. The central frequency can be detuned at the lower sideband or upper sideband of the modified AMI signal. The proposed AMI scheme is obtained by a PIN photodiode receiver by a varying optical attenuator that varies the receiver power. The responsivity and dark current of PIN photodiode are 0.8 A/W and 10 nA,
Fig. 4 Complete simulation setup of proposed modified AMI scheme
362
A. Khansali et al.
respectively. The modified AMI electrical signal hence launched to a Gaussian-type electrical LPF having a cut-off frequency 27.5 GHz and then launched to a BER analyzer. The BER analyzer is used to obtain BER as a function of the energy per bit to noise power spectral density ratio. In digital communication system, the BER is the number of bit errors per unit time; i.e. it is the number of bit errors divided by the total number of bits transferred during a specified time interval. The bit error ratio can be assumed as an approximation of the bit error probability.
4.1 Spectral Efficiency One of the principal issues for increasing the capacity of an optical transmission method is how to adequately enhance the system spectral efficiency. Thus, in highspeed optical communication system the spectral efficiency will be the essential and central factor [23]. It is important to note that the system spectral efficiency completely depending on spectral bandwidth of the line coding technique. The squeezing of the spectral bandwidth of any line coding techniques will result in spectral bandwidth efficiency of this modified scheme. The spectral bandwidth of proposed AMI modulation scheme is compressed as compared with the line coding of modified AMI, modified Manchester, RZ, and NRZ [12, 27]. The optical spectral bandwidth of the RZ, NRZ, modified AMI, and modified Manchester coding scheme is shown in Fig. 5. Hence, optical spectral bandwidth reduction of proposed modified AMI scheme is resulting in a substantially rise in the spectral efficiency bandwidth of the novel technique. 0 NRZ RZ Modified Manchester Modified AMI
-10 -20
Spectrum, dB
-30 -40 -50 -60 -70 -80 -90 -100 1.93 1.9302 1.9304 1.9306 1.9308 1.931 1.9312 1.9314 1.9316 1.9318 1.932
Frequency, Hz
x1014
Fig. 5 Variance in optical spectral bandwidth of modified AMI, Manchester, NRZ and RZ modulation schemes
Modified AMI Modulation Scheme for High-Speed Bandwidth … Table 1 For fiber length of 0 km
Table 2 For fiber length of 20 km
S. no.
VSB filter bandwidth (GHz)
363 Power received (dBm)
1
5
−3.905
2
10
−3.509
3
15
−3.346
4
20
−3.254
5
25
−3.207
6
30
−3.169
S. no.
VSB filter bandwidth (GHz)
Power received (dBm)
1
5
−4.313
2
10
−3.920
3
15
−3.719
4
20
−3.662
5
25
−3.606
6
30
−3.569
4.2 Receiver Sensitivity The power launched is 0 dBm as given at the starting of this segment. For better performance, the carrier signal power ratio (CSPR) is assumed to be the similar value for all the techniques. The received power is −7.573 dBm and −7.240 dBm in case of modified Manchester and modified AMI, respectively. This shows that the power at receiver end is much improved for modified AMI than modified Manchester. The effect of vestigial sideband (VSB) filter bandwidth on the received power, as bandwidth of the VSB filter increases and then power received at the receiver end is also increased [26–29]. It shows the directly proportional relation between VSB filter bandwidth and power received. Tables 1 and 2 show the comparison between VSB filter bandwidth and power received for ideal (0 km) and 20 km of fiber length, having attenuation of 0.02 dB/km.
5 Conclusion In this paper, we introduced and analyzed an advanced modulation technique called as modified AMI scheme. In this scheme, the optical BPF is using to perform VSB filtering operation. Since, in the next-generation high-speed optical communication system, spectral efficiency will be the principal issue particularly the DWDM passive
364
A. Khansali et al.
optical networks. Thus, we are now capable to recognize reasonable rise in the efficiency of the proposed AMI with the increase of bandwidth of vestigial sideband filter and detuning of center frequency. An important increment in the spectral efficiency and better receiver sensitivity can form this proposed AMI technique a promising format scheme for the purpose of very high-speed spectral bandwidth-efficient optical communication system.
References 1. N. Dong-Nhat, M.A. Elsherif, A. Malekmohammadi, Investigations of high-speed optical transmission systems employing absolute added correlative coding (AACC). Opt. Fiber Technol. 30, 23–31 (2016) 2. S.-K. Liaw, K.-Y. Hsu, J.-G. Yeh, Y.-M. Lin, Y.-L. Yu, Impacts of environmental factors to bidirectional 2 × 40 Gb/s WDM free-space optical communication. Opt. Commun. 396, 121–133 (2017) 3. I.B. Djordjevic, S. Zhang, T. Wang, OAM-based physical-layer security enabled by hybrid freespace optical-terahertz technology, in 13th International Conference on Advanced Technologies Systems and Services in Telecommunications (TELSIKS)(2017), pp. 317–320 4. D. Wu, C. Chen, Z. Ghassemlooy et al., Short-range visible light ranging and detecting system using illumination light emitting diodes. IET Optoelectron. 10, 94–99 (2016) 5. M. Elsherif, A. Malekmohammadi, Performance enhancement of mapping multiplexing technique utilising dual-drive Mach–Zehnder modulator for metropolitan area networks. IET Optoelectron. 9, 108–115 (2015) 6. P. Saxena, A. Mathur, M.R. Bhatnagar, Performance of optically pre-amplified FSO system under gamma-gamma turbulence with pointing errors and ASE noise, in Proc. IEEE 85th Veh. Technol. Conf. (2017) pp. 1–5 7. T.-L. Wang, I.B. Djordjevic, Physical-layer security in free-space optical communications using Bessel-Gaussian beams, in Photonics Conference (IPC) (IEEE, 2018), pp. 1–2 8. H.S. Khallaf, H.M.H. Shalaby, J.M. Garrido-Balsells, S. Sampei, Performance analysis of a hybrid QAM-MPPM technique over turbulence-free and gamma-gamma free-space optical channels. IEEE/OSA J. Opt. Commun. Netw. 9(2), 161–171 (2017) 9. M.R.B. Jaiswal, Free-space optical communication: A diversity-multiplexing tradeoff perspective. IEEE Trans. Inf. Theory 65(2), 1113–1125 (2019) 10. T. Peyronel, K.J. Quirk, S.C. Wang, T.G. Tiecke, Luminescent detector for free-space optical communication. Optica 3(7), 787–792 (2016) 11. J.N. Roy, J.K. Rakshit, Manchester code generation scheme using microring resonator based all optical switch, in Proc. Int. Conf. 9th Int. Symp. On Communication Systems, Networks and Digital Sign (Tripura, India, 2014), pp. 1118–1122 12. Q. Huang et al., Secure free-space optical communication system based on data fragmentation multipath transmission technology. Opt. Exp. 26(10), 13536–13542 (2018) 13. J. Abu-Ghalune, M. Alja’fari, Parallel data transmission using new line code methodology. Int. J. Netw. Commun. 6(5), 98–101 (2016) 14. Z. Li, Y. Dong, Y. Wang et al., A novel PSK-Manchester modulation format in 10-Gb/s passive optical network system with high tolerance to beat interference noise. IEEE Photon. Technol. Lett. 17, 1118–1120 (2005) 15. J. Xu, X. Yu, W. Lu et al., Offset manchester coding for Rayleigh noise suppression in carrierdistributed WDM-PONs’. Opt. Commun. 346, 106–109 (2015) 16. Y. Yang et al., Multi-aperture all-fiber active coherent beam combining for free-space optical communication receivers. Opt. Exp. 25(22), 27519–27532 (2017)
Modified AMI Modulation Scheme for High-Speed Bandwidth …
365
17. Y. Lu, X. Hong, J. Liu et al., IRZ-manchester coding for downstream signal modulation in an ONU-source-free WDM-PON’. Opt. Commun. 284, 1218–1222 (2011) 18. A. Malekmohammadi, M.K. Abdullah, A.F. Abas et al., Absolute polar duty cycle division multiplexing (APDCDM); technique for wireless communications, in Proc. Int. Conf. on Computer and Communication Engineering 2008 (Kuala Lumpur, Malaysia, 2008), pp. 617–620 19. B. Batsuren, H.H. Kim, C.Y. Eom et al., Optical VSB filtering. N. Wang, X. Song, J. Cheng, V.C.M. Leung, Enhancing the security of free-space optical communications with secret sharing and key agreement. IEEE/OSA J. Opt. Commun. Netw. 6(12), 1072–1081 (2017) 20. Y. Dong, Z. Li, C. Lu et al., Improving dispersion tolerance of manchester coding by incorporating duobinary coding. IEEE Photon. Technol. Lett. 18, 1723–1725 (2006) 21. J. Lee, S. Kim, Y. Kim et al., Optically preamplified receiver performance due to VSB filtering for 40-Gb/s optical signals modulated with various formats. J. Lightwave Technol. 21, 521–527 (2003) 22. J. Zhang, J. Wang, Y. Xu, M. Xu, F. Lu, L. Cheng, J. Yu, G.-K. Chang, Fiber–wireless integrated mobile backhaul network based on a hybrid millimeter-wave and free-space-optics architecture with an adaptive diversity combining technique. Opt. Lett. 41(9), 1909–1912 (2016) 23. G. Kaur, H. Singh, Design & investigation of 32 × 10 GBPS DWDM-FSO link under different weather condition. Int. J. Adv. Res. Comput. Sci. 8(4), 79–82 (2017) 24. J. Geisler et al., Demonstration of a variable data-rate free-space optical communication architecture using efficient coherent techniques. Opt. Eng. 55(11) (2016) 25. K. Ahmed, S. Hranilovic, C-RAN uplink optimization using mixed radio and FSO fronthaul. J. Opt. Commun. Netw. 10(6), 603–612 (2018) 26. B. Batsuren, H.H. Kim, C.Y. Eom, J.J. Choi, J.S. Lee, Optical VSB filtering of 12.5 GHz spaced 64 × 12.4 Gb/s WDM channels using a pair of Fabry-Perot filters. J. Opt. Soc. Korea. 17, 63–67 (2013) 27. I.O. Festus, D.-N. Nguyen, M. Amin, Modified manchester modulation format for high-speed optical transmission systems. IET Optoelectron. 12(4), 202–207 (2018) 28. K.I.A. Sampath, K. Takano, Phase-shift method-based optical VSB modulation using high-pass Hilbert transform. IEEE Photon. J. 8, pp. 1–14 (2016) 29. H.Y. Chen, J. Lee, N. Kaneda et al., Comparison of VSB PAM4 and OOK signal in an EMLbased 80-km transmission system. IEEE Photon. Technol. Lett. 29, 2063–2066 (2017) 30. M.R. Bhatnagar, Z. Ghassemlooy, Performance analysis of gamma-gamma fading FSO MIMO links with pointing errors. IEEE/OSA J. Lightwave Technol. 34(92), 2158–2169 (2016)
A Safe Road to Health: Medical Services Using Unmanned Aerial Vehicle M. Monica Dev and Ramachandran Hema
Abstract Traffic accident deaths are rising due to the lack of response time of ambulance in India. It is mostly because of the lack of response time to reach the crash site. The solution for that is unmanned aerial vehicle. The concept of medical emergency arrived to fill the delay of ambulance with shortest path of travel using a hex-copter to reach the patient in fast response to give first aid which is similar to rescue emergency drone, which can provide faster and real-time crash site medical emergency in order to quickly and accurately assess the emergency situation and assists the bystanders to treat the inflicted person. The medical services aim at locating the accident spots and finding the live status of the person and delivering the medical emergency kit in a short while. Along with manual calculations, the estimated time to reach the drone to the spot from any of the nearest hospitals in both rural and urban areas can be done. Thus, the live status of the person can be viewed by hospitals using IOT. A video guidance can be done by on-board camera which is fixed in hex-copter. This type of a system can save many lives. Keywords Unmanned aerial vehicle · Rescue emergency drone · IOT · Video guidance · Hex-copter
1 Introduction The unmanned aerial vehicles (UAVs) are used since World War I in order that not to risk a soldier’s life. These UAVs are now in use in various fields including surveying, aerial photography, parcel delivery, rescue emergency, surveillance, etc. The main cause of non-natural deaths is due to the accidents. According to an RTI survey conducted in India, 480,652 accidents happened in 2018 and 30% of those people died because of the lack of first aid. The average time of an ambulance to reach the M. Monica Dev · R. Hema (B) Translational Engineering, Government Engineering College, Barton Hill, Kerala, India e-mail: [email protected] M. Monica Dev e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_27
367
368
M. Monica Dev and R. Hema
hotspot is 9–10 min whereas it takes hours in non-accident (i.e., rural) zones [1–3]. It is vital that medical attention should be given within minutes in case of seriously injured person. Due to delayed response, the chance of surviving in accidents is reduced. The major flaw in many cases is the undue delay in identifying the location and traffic. Drones in medical logistics would be time and cost effective comparing to road transport as it reaches the destination fast and reduces the cost of public service delivery though it has weakness like limited payload. In medical applications, we need a fast response in the case of accidents or natural disasters, wherein the drones will supply medical aid during the emergency or disaster by transporting blood and organs within hospitals [4, 5]. This can save many lives. In India, many start-ups are focusing on medical business by aerial supply of medicines or any rescue purposes. Thus, drones will be faster and vertical take-off and landing are possible. By using these drones, a safety band can identify the live location of a person when he/she hits the emergency button. Then, the drone will follow the GPS points. The live aerial video of the drone can be viewed by the ambulance by avoiding the traffic prone routes [6, 7]. Once the hex-copter reaches the location, the condition of the person is monitored and giving an alert by making an alarm so that the ambulance or people nearby can help them. This can also carry a paramedic first aid kit which can be given to the person by the bystander before the arrival of ambulance.
1.1 Technology Behind the Drone for Medical Services Hex-copter is designed in such a way that it can serve the medical field. It is fixed with three sensors which are temperature sensor, to find the body temperature of the patient, respiratory sensor, to find the breath rate and the pulse rate sensor, to find the heart beat rate. The concept arrived from the TU DELFT’s drone ambulance which was made by a former student Alec Momont. It is an autonomous drone designed to deliver the defibrillator to the victims suffered by cardiac arrest [8]. The bystander can give CPR by seeing the video from the monitor which is mounted on the drone. Thus, RED also emerges from the fact similar to this. Delivery drones also work in the same way with some changes in specifications. The design of UAV includes the calculation of thrust and payload which can be calculated using an online software ECALC. The estimated weight of the frame and payload with motor and battery specification can be added to the given data. The flying time, battery usage, and the stability are calculated from the given data. The values are analyzed with the manual thrust calculations by including the weights of frame, battery, camera, sensors, and payloads.
A Safe Road to Health: Medical Services Using Unmanned Aerial …
369
1.2 Determination of Payload The number of accidents happening annually and the number of deaths due to the delay time of ambulance are calculated from the RTI survey of number of accidents in India. The various drones which are currently working in various fields and its design based on the specification on what application it is working has been studied. The requirement of speedy medical requirement to the bystander has been analyzed. The speed and the distance of travel hae been noted and the thrust is calculated.
1.3 Thrust Calculation Total Thrust = 2 × weight of drone + weight of carried object + [Total from above × .20]/Number of motors needed.
(1)
The total thrust is calculated as per (1). A thumb rule is that for 3600 g, we need 7200 g of thrust to get it off the ground plus 1000 g additional grams to hover the drone. Hence, by dividing the thrust by number of motors [6 motors as it is a hex-copter], each motor requires 1200 g of thrust.
2 Materials and Methods The present work has used the arrangement as shown in Fig. 1. There are six motors which function under the commands received from electronic speed controllers. The battery provides necessary power to drive the motors. There is an ultrasonic sensor, ARM photo-board, and a flight controller to direct the hex-copter to the desired location.
2.1 Peripherals of the Hex-copter UAV is designed based on the application on which we are going to operate. Thus, the peripherals depend on the design of the hex-copter. The basic parts are motors, propellers, electronic speed controllers, transmitters and receivers to fly, and the battery to give supply for the motor. It also consists of secondary parts such as flight controller, camera, and monitor. Motor design is the important factor in drone’s efficiency. We have used outrunner-type brushless motor which is efficient, reliable, and quieter than a brushed motor that is mainly used for drones. The idea about size and weight of propeller
370
M. Monica Dev and R. Hema
Fig. 1 Detailed layout of the hex-copter driven by six motors each receiving inputs from electronic speed controllers
helps to know about the overall thrust of motor for the payload in order to make a perfect lifting during flights. This will allow the drone to hover in the mid-air. The propellers are the props at the front of the quadcopter. The props called “tractor” which is a propeller that pulls the quadcopter through the air. The present work has used drone propellers, made of plastic, which has better quality compared to carbon fiber. The pusher props are at the back and push the UAV forward. During stationary level flight, the motor torque will be cancelled out by the contra-rotating props. We have used a 2.4 GHz, 4 camera Quadcopter having a frame size of 180 mm and propeller size of 4 inches with a wheel base of 36 mm. It uses six numbers of 2300 kV brushless motors each with a maximum speed of 10,350 rpm. It uses 3.7 V, 380 mm AH batteries for drive power. Speed is controlled by an 8-bit low power microcontroller with a 10-bit ADC. The maximum payload is 2.2 kg. The electronic speed controllers connected to the motor supply proper modulated current to the motors, regulate the speed which will produce correct rates of spin for lifting and maneuvering. Usually, ESC’s always comes with a battery eliminator circuit (BAC), which will allow the flight control and transceiver components in order to connect the ESC other than directly to the battery. Lithium batteries are the most commonly used battery which is used to power quadcopter because of its high energy densities and its high discharge capabilities. Normally, a standard lithium polymer cell has a nominal voltage of 3.7 V. Li-Po has more linear discharge which makes easier to qualitatively gauge the remaining flight time. When the capacity battery becomes larger, it will provide a longer flight times. We can use high rate mah batteries with high payload for longer flights (Fig. 2). We have used the APM 2.8, a complete open source autopilot system and its advanced technology which allows the user to turn any fixed, rotary wing or multirotor
A Safe Road to Health: Medical Services Using Unmanned Aerial …
371
Fig. 2 Hex-copter frame design
vehicle into a fully autonomous vehicle for the autonomous flight. It performs flight which is a programmed GPS missions with waypoints. It is best to place the flight controller on the center of the drone. This can be calibrated using mission planner software. APM 2.8 is also capable to plan the mission based on the given waypoints. By this, the flight will be easy without any chaos. The ARM proto board is a replacement for the traditional receiver. It is a development platform that takes input from ultrasonic sensor and sets an autonomous flight path by replicating receiver signals. It alters the course if input from sensors indicates a nearby object that the hex-copter may collide with the obstacles. The hex-copter is fitted with FPV cameras which are small, light, reasonably priced, and well-efficient. The FPV camera is mounted over the drone to send the real-time video down the ground to the control room by using a video transmitter. The FPV camera allows you to see the location where the drone is flying and the waypoints and its aerial view. Depending on the drone’s characteristics, the FPV transmitter will send the live video signal to the remote control screen and smartphone device. FPV cameras also allow to fly higher and further up to miles away using this FPV technology. The present work employed with Thing speak, an open IOT platform in which we can collect up-to-the-minute temperature, humidity, and power usage data. We have also used MATLAB to analyze and visualize the data. The results obtained real time from the IOT platform can be viewed anywhere from the mobile by the Thing speak App from the play store.
372
M. Monica Dev and R. Hema
2.2 Medical Payload The medical payload in the drone carries a small first aid kit which consists of basic essential medications with oxygen mask and a defibrillator. This first aid box also consists of adhesive tapes, antiseptic liquid for cleaning wounds and hands, gloves, non-adhesive wound pads, plastic reseal able bags (oven and sandwich), pocket mask of CPR.
2.3 Arduino Board Design The Arduino microcontroller consists of 14 digital I/O pins and six analog input pins to write programs and create interface circuits to read switches and other sensors, also to control motors and lights with very little effort. An important feature of the Arduino is to create a control program on the host PC, which can be downloaded with the Arduino to run automatically. The circuit diagram for connecting the pins with sensors is shown in Fig. 3. Arduino AT Mega 328p is used to connect the sensors. Pulse and respiration sensors are connected to the Analog pins A3 and A0 and the temperature sensor is connected to the digital pin. A 16 × 2 LCD is connected to the board which displays the data. For stand-alone operation, the board is connected and powered by a battery through the USB connection to the computer. The details from the sensors are collected and saved it for the further use by using the IOT.
Fig. 3 Sensors connected to the Arduino microcontroller for medical services
A Safe Road to Health: Medical Services Using Unmanned Aerial …
373
3 Results and Discussion As soon as the hospital receives a distress call, a drone is sent in with the medical payload depending on the type of care needed for the patient. The hex-copter keeps tracks of the location and reaches the destination. The bystander is given instructions through the camera mounted on the hex-copter. He then uses the temperature sensor, pulse sensor, and respiration sensor to determine the body temperature, pulse rate, and respiration rate of the patient. This data is send through Thing speak platform to the doctor at the hospital. The doctor then can recommend further assistance to the patient in need. We have used the hex-copter to successfully track the patient and have instructed the bystander to deliver the information to the hospital using the three sensors. Further assistance was given on advice of the doctor at the hospital. Thus, Fig. 4 shows the pulse rate of the patient with the different time intervals. It shows that the pulse rate is increasing and after a particular point, it is decreasing. Since the pulse rate is not steady, the patient needs medical attention. This data will be checked by the doctor and he can advise the bystander on the necessary medical aid to be given to the patient. Figure 5 represents the temperature of the person at different time intervals. We can view the data of a particulate date and time. The body temperature of every one minute will be uploaded in the IOT platform which creates a graphical representation to monitor the real-time condition of the patient at the crash site. Thus, Fig. 6 represents the respiratory rate of the patient in a delay of one-minute time interval. There is a significant drop in respiratory rate which is slowly catching up by the interventions done by the bystander.
Fig. 4 Pulse rate monitored at various time intervals at the hospital from the measured data from pulse rate sensor
374
M. Monica Dev and R. Hema
Fig. 5 Body temperature monitored at various time intervals at the hospital with the measured data from temperature sensor
Fig. 6 Respiration rate monitored at various time intervals at the hospital with the measured data from respiration sensor
4 Conclusion Drones for medical services will carry the sensor, following the waypoints, monitoring the condition of the patient, and delivering the medication to the bystander with the help of video guidance. Extensive research is already being done in this area by some countries like Delft, Netherlands, for the benefit of the medical industry. Studies are ongoing for using the drones for the surveillance of wards and delivering the medicines within the hospital as the doctor is not needed to visit the patients
A Safe Road to Health: Medical Services Using Unmanned Aerial …
375
until there is an emergency. The open IOT platform is useful to monitor the condition of a patient from anywhere and the database can also be stored and used for future purposes. By delivering the first aid kit to the bystander, the prototype can provide the victim basic first aid before the arrival of ambulance. Many companies and start-ups are focusing on drone delivery for food and medicine in order to save time and energy. This kind of drones will make life simple and easier. The traffic delay, fuel usage, and pollution can also be controlled when using drone for this operations. If we can get a drone to a downed person having a heart attack quicker than an ambulance, we can save many more lives.
References 1. S.K. Anders, A. Dewan, M. Saqib, A. Shakeel, Rescue emergency drone for fast response to medical emergencies due to traffic accidents, World Academy of Science, Engineering and Technology. Int. J. Health Med. Eng. 11(11) (2017) 2. V. Bas, N. Huub, B. Geert, C. Bart, Drone technology: Types, payloads, applications, frequency spectrum issues and future developments, in The Future of Drone Use, Information Technology and Law Series, vol. 27, ed. by B. Custers (T.M.C. Asser Press, 2016). https://doi.org/10.1007/ 978-94-6265-132-6_2 3. D. Nikola, D. Momcilo, A Study on the Modernization of Postal Delivery (University of Belgrade Faculty of Transport and Traffic Engineering, Belgrade, Serbia) 4. R.L. Finn, D. Wright, Unmanned aircraft systems: Surveillance, ethics and privacy in civil applications. Comput. Law Secur. Rev. 28, 184–194 (2012) 5. K. Ro, J.-S. Oh, L. Dong, Lessons learned: Application of small UAV for urban highway traffic monitoring, in 45th AIAA Aerospace Sciences Meeting and Exhibit. Western Michigan University Reno, Nevada, Kalamazoo, MI, 8–11 Jan 2007 6. A. Martin, B. Lu, C.X. Ting, Autonomous Quadcopter (Group 37 TA, Katherine O’Kane), 28 Sept 2015 7. C.W. Park, H.T. Chung, A study on drone charging system using wireless power transmission. Department of Creative ICT Engineering Graduate School, Doctors Course Busan University of Foreign Studies, BUFS Busan, South Korea. Int. J. Trend Res. Develop. 3(6). ISSN 2394-9333. www.ijtrd.com 8. B. Dane, Drones: Designed for product delivery. A. Barr, B. Greg, Google is testing delivery drone system. Wall Street J. (2014)
Building English–Punjabi Parallel Corpus for Machine Translation Simran Jolly and Rashmi Agrawal
Abstract Parallel corpus is needed for many natural language processing tasks, like machine translation and multilingual document classification. The parallel corpus of English–Punjabi language pair is sparse in volume due to the semantic differences between two languages and Punjabi being a low resource language. In this paper, a parallel corpus for machine translation is being created and evaluated using the sentence alignment permutation metrics. Multiple translation corpora and human assessment together validate automatic evaluation metrics, which are important for the development of machine translation systems. The corpora considered are dialogues of the movie taken from the Wikipedia dumps. Further, the metrics are identified that define the corpora more accurately. The quality of the corpus is verified using the performance metrics based on distance metrics. Keywords Natural language processing (NLP) · Natural language understanding (NLU) · Corpus · Punjabi · English · Sentence alignment · Application programming interface (API)
1 Introduction Machine translation is subset of multilingual natural language processing that leverages parallel corpus for collection of large texts of different languages. The accuracy of the translation system depends largely on the sentence alignment in the parallel corpus. A parallel corpus is collection of texts in two different languages aligned to each other for natural language processing tasks. The accuracy of the translation system depends largely on the sentence alignment in the parallel corpus. Parallel corpora are basically bodies of text in machine translation which have a crucial role S. Jolly (B) Manav Rachna International Institute of Research and Studies, Faridabad, India e-mail: [email protected] R. Agrawal FCA, Manav Rachna International Institute of Research and Studies, Faridabad, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_28
377
378
S. Jolly and R. Agrawal
in machine translation and NLP. For the translation process, parallel corpus is an important resource. The languages like Indian languages have limited corpora available due to their rich morphology. The languages are divided into two parts, low resource languages and high resource languages. Low resource languages are the languages having no available parallel corpora and NLP resource such as pos taggers and grammars. Hence, monolingual comparable corpora are collected for these languages from different sources on the Web. The factors affecting the translation in parallel corpus are: (a) density of the resources (parallel corpora); (b) agreement difference between the languages; (c) sentence alignment issues. In our work, we try to improve the machine translation task by aligning the sentences using fuzzy string-matching approach in the corpora. The corpora can be developed by taking sentences from Web and other government libraries (Tdil). A parallel corpus basically is a translation corpus Tiedemann [1] that contains text and its equivalent word or token in a tabular form which can be bilingual or multilingual. They are the resources used for acquisition and training of data in statistical machine translation systems applied by Liu et al. [2]. Given a parallel corpus in English language and morphologically rich language, the projection between two languages can be shown by word to word alignment between two languages. The rest of the paper is structured as follows. Related and previous works are described in Sect. 2. Section 3 describes the corpus alignment metrics being used currently for the alignment in translation task. Section 4 has methodology being adopted and applying fuzzy string-matching algorithm to the approach. Section 5 has evaluations and results. Section 6 indicates conclusion and future work.
2 Related Work The main issue to be addressed in machine translation systems is creating a parallel corpus. In this section, different sentence alignment models are discussed along with their limitations for creation of parallel corpus. The three widespread sentence alignment models are based on the length, lexicon, and hybrid models. The lengthbased models are based on length of the sentences. Gale and Church [3] applied character-based method to align the sentences in the languages having similar length correlation and only considering 1:1 alignment. The distant languages could not be aligned using this algorithm. Tiedemann [1] applied a different technique or building the parallel corpus using the time overlap approach. This approach built corpus in xml format for 29 languages. Indian Language Corpora Initiative (ILCI) handled by a consortium of different participating institutions in India has developed a multilingual corpus for different languages. The parallel corpora were built by translating sentences into the desired language by humans. The corpora were taken from different Web sources and agencies. A parallel corpus based on health domain was built by Choudhary and Jha [4].
Building English–Punjabi Parallel Corpus for Machine Translation
379
Comparable corpora are not considered good for asymmetrical nonparallel alignments as proposed by Cartoni et al. [5] because it consists of source language and the target language addressing same topic in different manner. As the frames inside corpora are not aligned fully, it would be difficult to use them for sentence alignment in machine translation task. Comparable corpora are considered best for symmetrical alignments as it is more balanced having comparable text. Parallel corpora can be used for machine translation tasks as they may be unidirectional, directional, or multidirectional in nature. Hence, looking upon the advantages and disadvantages of parallel corpora, parallel corpora is being built having large number of sentences and then evaluated by the distance metrics. Many text alignment tools and libraries were also implemented in Python like Bleu metrics [11], based on parallel text alignment and sentence length by university of Zurich in 2013. Liu [6] build a Chinese–Hungarian parallel corpus using automatic alignment approach. It used word dictionary for Chinese–English and English–Chinese pair. In the bilingual corpus, the exact alignment is based on the statistical properties of the corpus and using the probability of the words from the lexicon. The approach was followed less correlated languages like English and Chinese applied by Liu et al. [2]. The limitation of the research was that as the corpus size increases the sentence length ratio decreases. Another open-source tool available Hun align [12] was improvised on the gale and church algorithm by comparing the sentence lengths. This method failed for languages having low length correlations. Yeka et al. [7] used the standard machine translation approach for building the parallel corpus based on English and Hindi language pair. The sentence-based alignment model is combination of both length and lexiconbased models. The sentence alignment methods were further improvised using vocabulary-based approaches by Braune and Fraser [8]. Asymmetrical alignment was introduced using the bipartite graphs which were fast as compared to other approaches described above. Kumar and Goyal [9] developed a parallel corpus for Hindi–Punjabi machine translation task using the language pair lexicon that helped in improving the accuracy of the task. The corpora were manually evaluated and were successful for similar agreement languages. The drawback of the corpus created was that it requires high latency as it was evaluated using Moses and gizaa++.
3 Evaluation Metrics for Parallel Corpus The parallel corpus hence generated by the intertext tool (Pavel [10] needs to be evaluated by different text distance metrics. These metrics are applied on the parallel corpora and the bleu score is calculated individually for each of the metrics. The metrics are described as follows:
380
S. Jolly and R. Agrawal
1. Hamming Distance: In this metric, dissimilarity between the two sentences is matched of mostly equal length. Hence, both the strings are compared character wise. The value of hamming string ranges from 0 to 1 where 0 means a perfect match and 1 means not matching. The following equation is used for hamming distance computation. d(a, b) =
ak = bk N − 1 k = 0
(1)
2. In the above equation, d(a, b), is hamming distance between string a and string b. ak is the character of string at position k. bk is the character of string b at position k. 3. Levenshtein Distance: In these metric, transpositions are applied on the sentence by applying insertion, substitution, and deletion between two strings. The following equation is used for Levenshtein distance computation. (|a| + |b| − Lev(a, b|i, j))/|a| + |b| 4. Where a, b are length of the phrases that are being contrasted and Lev is the Levenshtein distance. Based on the above metric, the sentences in the aligned corpus are compared and the best translation pair is chosen among the sentences. J (X, Y ) =
min(x, y)/
max(x, y)
5. Jaccard Token Matching: This metric measures the distance between two sentences using the Jaccard coefficient, then splits the sentences into tokens and matches the similarity of two sentences. This function ranges from 0 to 1. The Jaccard coefficient equation is given as: 6. Regex: A Regex pattern or regular expression patterns is a special language used in natural language processing to represent patterns or sequence of text. These patterns are represented by ‘\s+’ which means the strings should have s and one more character matching to it. Word boundaries are used in Regex to detect pattern starting and ending in a sentence, e.g.: ‘\bcat’ will match cat mat and not mat cat.
4 Methodology The detailed working of the corpus alignment flow is described in this section and in Fig. 1: 1. Preprocessing: The linguistic preprocessing of the corpus is the most vital step involved in corpus creation. In this process, first the English corpus is extracted into a text file. Then the corpus is tokenized, formatted, and converted into a
Building English–Punjabi Parallel Corpus for Machine Translation
381
Fig. 1 Methodology
pickle file by removing its out of the vocabulary words. The same procedure is followed by Punjabi corpus extraction. After extracting English corpus, it is translated using Google API by taking part of the corpus. Then the comparison id done sentence wise, where their alignment ratio is computed. Then a simple ratio is calculated by fuzzy string-matching algorithm. 2. Fuzzy String-Matching Algorithm: The fuzzy string-matching algorithm conducts evaluation by matching the similarity of sentences using the token ratio. It basically finds the distance between the characters of two tokens. The smallest distance is then selected for the evaluation. It is an approximate string-matching algorithm that finds the token ratio of the phrases in the sentences. 3. Alignment symmetrisation: The approach of alignment symmetrisation means to align the corpus in both directions by taking union and intersection of two languages that are being translated. A Language pair translation model is being used here developed by IBM that estimates conditional probability P(e/p) of any sentence e in English given the Punjabi sentence p. Figure 1 shows translation
382
S. Jolly and R. Agrawal
from English to Punjabi and Punjabi to English language in asymmetrical form. These templates are called alignments which are asymmetrical in nature due to difference in word order of the two languages. In order to make these alignments symmetrical, we need to merge the translations bi-directionally by union or intersection of the individual alignments in language pair. This symmetrical alignment is called as phrase base symmetrical sentence alignment that helps in accurate extraction of phrases further as shown in Fig. 2: Alignment symmetrisation is the most vital step involved for building a parallel corpus. It models association between two languages either by word to word mapping or wordiness (sentence alignment models). The alignment component controls transition between source and target states and translation component controls emission of target words from those states. In this model, alignment of one word depends on alignment of previous word and the next word. pj aj aj J =P ,t π a P pj, ej e1 a j−1 ea j In the above equation e-word, J-state, 1 < I m), using the various model order reduction techniques. The transfer function used in the paper is of nth order and is given by Eq. (29), where n = 8. For the study of MOR of a system of higher order to obtain its reduced-order approximation, the eighth-order system with the following transfer function [8, 19] is selected for carrying out the comparative analysis: 18s 7 + 514s 6 + 5982s 5 + 36380s 4 + 122664s 3 + 185760s 2 + 185760s + 40320 T = 8 s + 36s 7 + 546s 6 + 4536s 5 + 22449s 4 + 67284s 3 + 118124s 2 + 1095840s + 40320
(29)
The order reduction is carried out using MATLAB, and the system of eighth order given in Eq. (29) is reduced to second order using Hankel norm approximation, Schur decomposition, normalized co-prime factorization (NCF), and balanced stochastic truncation (BST) technique. Their effects are then compared on the basis of parameters, namely peak overshoot, settling time, and steady-state error by applying step input to the system. Fig. 1 Block diagram for a closed-loop system
460
A. Gupta and A. K. Manocha
3 Results and Discussion The system represented by Eq. (29) is reduced to get the second-order reduced approximation using all the four techniques mentioned, and then, the transfer function obtained after reduction is given in Table 1. To compare the given original and obtained reduced systems, the results are simulated for both the systems using MATLAB. Bode plot of the given original and obtained reduced systems is shown in Fig. 2. Bode plot behavior in magnitude as well as phase analysis shows that the frequency domain response of the given original system and obtained reduced systems obtained by all the four techniques is almost the same with not any significant error. The gain margin and phase margin for both the given original and obtained reduced systems are approximately the same. This shows that frequency response of the system does not change much after reducing its transfer function order from eighth order to even second order. Hence, these reduced approximations can be used to replace the original system with its reduced approximation. Table 1 Transfer function obtained after applying reduction techniques Sr. No.
Technique used
Reduced-order transfer function
1
Hankel norm approximation
2
Schur decomposition
3
NCF technique
4
BST technique
15.73s+4.649 s 2 +6.456s+5.349 17.92s+5.342 s 2 +7.534s+5.501 17.6s+4.908 s 2 +7.206s+5.308 18s+5.141 s 2 +7.528s+5.369
Fig. 2 Bode plot of given original and obtained reduced systems obtained by all techniques
Comparative Analysis of Different Balanced Truncation …
461
Further, both types of systems (original and reduced) are compared through their time domain behavior with the step input applied at the input port of the system. Time domain behavior using the MATLAB is shown in Fig. 3. The study of time domain behavior for step input from Fig. 3 shows that many parameters of reduced and original systems are the same. Rise time and peak overshoot are almost the same in all the cases, but the settling time and steady-state error of the reduced system have been changed for all the four techniques. The amount of settling time, peak overshoot, and steady-state error occurring in the obtained reduced system obtained by all the four techniques are given in Table 2, which shows that peak overshoot is almost the same in all the four techniques but steady-state error is least in NCF technique of model order reduction with settling time little higher than Hankel norm approximation. The BST technique has higher
Fig. 3 Transient response of given original and obtained reduced systems obtained by all techniques
Table 2 Value of settling time, peak overshoot, and steady-state error for all techniques Sr. No.
Technique
Peak overshoot MP %
Settling time (seconds)
Steady-state error
1
Hankel norm approximation
53.70
6.71
0.127
2
Schur decomposition
54.73
7.79
0.072
3
NCF technique
53.91
7.53
0.025
4
BST technique
54.12
9.68
0.042
462
A. Gupta and A. K. Manocha
settling time than all the four techniques which is not desirable. Hankel norm approximation technique has the minimum amount of settling time but highest steady-state error which makes it less useful.
4 Conclusion The study of all the four techniques mentioned in the paper reveals that it is checked clearly that behavior of obtained reduced-order system is more convenient to study. Also, the study of the bode plot analysis shows that the reduction of higher-order system makes less effect on the frequency domain characteristics of the original system, but on the other side while viewing the transient response behavior of the given original and obtained reduced systems, it is calculated clearly that some of the parameters get changed, hence generating the error among the given original and obtained reduced systems. This error must be reduced in order to get the exact approximation of the original system which is to be reduced. So, either by less reduction in the order or by using the appropriate technique among the four techniques mentioned in the paper, the most suitable reduced-order approximation of the given higher-order original system can be obtained. This study can be further implemented to develop a new technique in the future which has the least impact on the system indices with upgraded performance parameters.
References 1. A.C. Antoulas, D.C. Sorensen, S. Gugercin, A survey of model reduction methods for largescale systems. Contemp. Math. 2006 2. B.C. Moore, Principal component analysis in linear systems: controllability, observability and model reduction. IEEE Trans. Autom. Control AC 26(1), 17–32 (1981) 3. K. Glover, All optimal Hankel-Norm approximations of linear multivariable systems and their L∞ -error bounds. Int. J. Control 39(6), 1115–1193 (1984) 4. U.B. Desai, D. Pal, A transformation approach to stochastic model reduction. IEEE Trans. Autom. Control AC 29(12), 1097–1100 (1984) 5. M.G. Safonov, R.Y. Chiang, Model reduction for robust control: a Schur relative error method. Int. J. Adapt. Control Signal Process. 2, 259–272 (1988) 6. M.G. Safonov, R.Y. Chiang, A Schur method for balanced-truncation model reduction. IEEE Trans. Autom. Control 34(7), 729–733 (1989) 7. Y. Shamash, Continued fraction methods for the reduction of discrete-time dynamic systems. Int. J. Control 20(2), 267–275 (1974) 8. Y. Shamash, Linear system reduction using PADE approximation to allow retention of dominant modes. Int. J. Control 21(2), 257–272 (1975) 9. T.C. Chen, C.Y. Chang, Reduction of transfer functions by the stability-equation method. J. Franklin Inst. 308(4), 389–404 (1979) 10. T.N. Lucas, Factor division: a useful algorithm in model reduction. IEE Proc. 130(6), 362–364 (1983)
Comparative Analysis of Different Balanced Truncation …
463
11. S. Mukherjee, M.R.C. Satakshi, Model order reduction using response matching technique. J. Franklin Institute 342, 503–519 (2005) 12. T. Reis, T. Stykel, Stability analysis and model order reduction of coupled systems. J. Math. Comput. Model. Dyn. Syst. 13(5), 413–436 (2007) 13. O. Alsmadi, Z. Abo-Hammour, D. Abu-Al-Nadi, S. Saraireh, Soft Computing Techniques for Reduced Order Modelling: Review and Application. In: Intelligent Automation and Soft Computing (2015) 14. M.A. Le, G. Grepy, Introduction to transfer and motion in fractal media: the geometry of kinetics. Solid State Ionics 9, 10(Part 1), 17–30 (1983) 15. C.B. Vishakarma, R. Prasad, MIMO system reduction using modified pole clustering and genetic algorithm. Modell. Simul. Eng. (2009) 16. B. Philip, J. Pal, An Evolutionary Computation Based Approach for Reduced Order Modeling of Linear Systems. In: IEEE International Conference on Computational Intelligence and Computing Research, Coimbatore, pp. 1–8 (2010) 17. A. Sikander, R. Prasad, Linear Time Invariant System Reduction Using Mixed Method Approach. Appl. Math. Modell. (2015) 18. A. Narwal, R. Prasad, A Novel Order Reduction Approach for LTI Systems Using Cuckoo Search and Routh Approximation. In: IEEE International Advance Computing Conference (IACC), Bangalore, pp. 564–569 (2015) 19. S.K. Tiwari, G. Kaur, An improved method using factor division algorithm for reducing the order of linear dynamical system. Sadhana 41(6), 589–595 (2016) 20. A. Narwal, R. Prasad, Optimization of LTI systems using modified clustering algorithm. IETE Techn. Rev. (2016) 21. A. Sikander, R. Prasad, A new technique for reduced-order modelling of linear time-invarient system. IETE J. Res. (2017) 22. X. Cheng, J. Scherpen, Clustering approach to model order reduction of power networks with distributed controllers. Adv. Comput. Math. (2018) 23. S.K. Tiwari, G. Kaur, Enhanced accuracy in reduced order modeling for linear stable/unstable system. Int. J. Dyn. Control (2019)
Development of a Real-Time Pollution Monitoring System for Green Auditing Ankush Garg, Bhumika Singh, Ekta, Joyendra Roy Biswas, Kartik Madan, and Parth Chopra
Abstract In this work or study, we introduce a solution to monitor and record various levels of pollutants generated by textile industries. This is implemented using a sensor network(wireless) using ESP12 Wi-Fi module units with various industry grade sensors to record air, water and sound pollutants. The working in easiest terms would be the sensor unit senses pollutants and their levels in parts per million and updates firebase in real time. The Web app dashboard fetches the data and generates a score for each factory. The Web app also has possibility to be accessed by authorities to overlook all industries in an area. The dashboard can be used to view notification and stay updated on pollution levels. A CEPI score can be generated with data to classify industries better. Keywords ESP12E · Sensor · Google’s firebase · Web app
1 Introduction Pollution board of India categorizes industrial sectors into four categories depending upon the pollution caused by them. White, green, orange, red with white being most eco-friendly and red being most hazardous. A score called comprehensive environmental pollution index (CEPI) is assigned to each sector which decides the zone. Mostly, all textile industries come under red category. This is based on the Air EPI, Water EPI and Sound EPI. This rigidity in this system acts as a demotivation for smaller-scale industries. Thus, a real-time monitoring system needs to be devised to generate a better CEPI score which could be visualized on a dashboard and on a larger scale can be used for better administrative purposes. A. Garg · B. Singh · Ekta Department of Computer Science Engineering, Maharaja Surajmal Institute of Technology, Guru Gobind Singh Indraprastha University, C-4 Janakpuri, New Delhi, India A. Garg · B. Singh · Ekta · J. R. Biswas (B) · K. Madan · P. Chopra Department of Electronics and Communication Engineering, Maharaja Surajmal Institute of Technology, Guru Gobind Singh Indraprastha University, C-4 Janakpuri, New Delhi, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_36
465
466
A. Garg et al.
The ESP has inbuilt analog to digital converters to read sensor readings from sensors which are easily available and widely used. These contain MQ7 and MICS 5524 in air module, pH and temperature sensor in water module and sound sensor. As the environment changes, the values generated are updated over Wi-Fi. So, when the data arrives, the dashboard application can generate notifications for early action. Thus, the factory owner is able to know what exactly is going wrong. The person having authority over all industries of an area also gets to know which factories. Additionally, the national dashboard can be used to draw generic information about national scheme implementation, pollution hotspots, etc.
2 Literature Survey Author has explained the categorization of textile industries on the basis of “CEPI” score. CEPI stands for comprehensive environmental pollution index. Based on this score, textiles industries are graded as low, moderate, high and critical in pollution generation [1]. Author has explained the existing CEPI criteria and revised CEPI criteria. In revised methodology, factors changing at source, pathway and receptors are considered. Factors at source include presence of toxic materials and scale of industrial activities. At pathway, level of exposure of pollutants is considered. Receptors consist of high-risk elements that cause human health issues like asthma, cancer, etc [2]. In this paper, author has provided detailed analysis of monitoring quality on real time using low-cost equipment and sensors. The author has discussed the usage of cloud for storing data and how it can be cost-effective in the long run [3].
2.1 Sensors Author described about utility of wireless sensors and nodes to meet the industry standards and communicating real-time data. A detailed description has been provided on how different sensors could be utilized to measure different elements like CO2 , NO, SO2 , etc. It may be employed to measure the excess amount of pollutants in air [4]. Firebase documentation, using it and starting it with different technical stack, is given. Connection to firebase and using its NoSQL database with JavaScript is mentioned. All the features are explained in detail [5]. The document, usage of pH sensors in given in detail and how different pH sensors are used on industry front [6]. Rules imposed on textile industries in India are provided by Government of India. All the future aspects will be made by taking care of these rules mentioned in the document [7]. Data for air, water contamination published by Government of India is taken to train ML model precisely [8, 9]. Author describes the distribution of pollution over a district and quantifying it by measuring CEPI to overcome subjectivity factors of human health [10]. Author has explained the use of IOT to monitor pollution in air and water. Various sensors are
Development of a Real-Time Pollution Monitoring System …
467
discussed by him for the same [11]. Author has explained the use of IOT to monitor pollution with varied range of hardware equipment. These equipments are explained in details with their functioning to understand the basic circuit required to implement things [12]. In this paper, author has provided detailed analysis of monitoring quality on real time using low-cost equipment and sensors. The author has discussed the usage of cloud for storing data and how it can be cost-effective in the long run. Also, availability of cheap sensors is discussed [13]. Author described about utility of wireless sensors and nodes to meet the industry standards and communicating real-time data [14].
2.2 Circuit Description Circuit description of electrical equipment is done which can be used for documentation and other implementation [15]. The paper, usage of MQ7 sensors in given in detail and how it can be used to measure quantity of CO (in ppm) accurately within a large range. The author has described how quantity of one effluent in air affects other gases’ nature [16]. Implementation of Node MCU as Wi-Fi is shown and how it can give power to other modules or nodes connected to it. This gives basic idea of how self-powered systems to be formed [17]. A pollution monitoring system develops using Arduino is proposed which is designed to monitor and analyze air quality in real time and long data [18]. Calibration of MQ7 is done to detect hazardous gases like carbon monoxide which is the one of the major components emitted from industries [19]. A module is designed using LM386 integrated audio amplifier using precision rectifier so as to detect noise in the signal [20]. Author shows a detailed description of data preprocessing techniques which are used for data mining. Raw data is usually susceptible to null values, noise, incomplete data, inconsistency and outliers. Thus, it is important for these data to be processed prior to mining. Preprocessing data is an essential and foremost step to enhance data efficiency [21]. Author conducts an elaborate study of the process of preprocessing in Web log mining [22].
2.3 Monitoring Readings Author describes various factors that contribute in water pollution and how a lowcost equipment can be designed to measure water pollution [23]. All the factors contributing to air pollution are discussed and the measuring techniques of air effluents are demonstrated theoretically [24]. Author focuses on the application of firebase with Android with the aim of making its concept familiar in terms of terminologies, benefits and shortcomings [25]. Author proposes a system which shall be able to send
468
A. Garg et al.
text-based messages and media files like audio, images, texts, videos over the Internet among two users on the network in real time. The author makes use of Android OS and Google Firebase to manage the driving of the information transfer operation, demonstrating the multiple particularities of both the OS and the employed service [26].
3 Proposed Solution The solution proposed in this paper relies on internet to provide a solution to said problems. With low power equipment like ESP12E with industrial sensors, the solution proposes a better tomorrow. The hereby proposed solution ensures: • • • • •
Machine-generated values ensure unadulterated values. The sensors are placed in critical points to ensure true data. Independent nodes to avoid system failures Use of inbuilt communication modules. In case of device failure, it can be easily detected
A. DS18B20: It is a temperature sensor which can measure temperature ranging from −55 to +125 °C with an accuracy of ±5%. As it works on one wire protocol, one is able to control numerous sensors from only one pin of Microcontroller. It can be seen as a one-bus-based digital temperature sensor with characteristics like miniature size, a wide spectrum of applicable voltages and measurable temperature with high resolution, etc. (Fig. 1). B. ESP01: The model is designed using NodeMCU (ESP01), MQ7, MICS 5524, DS18B20 and sound sensors. ESP01 is a small module that adds Wi-Fi connection to an Arduino/IOT PC. It has full TCP/IP stack and ranged at very low price. The fact that it has hardly any exterior elements on the module make it inexpensive in volume as well (Fig. 2). Node MCU is an open-access sourced IoT platform which includes a firmware running on the ESP8266 Wi-Fi SoC by Espressif Systems, and ESP-12 module-based hardware. The term “NodeMCU” refers to the firmware rather than the development kits by default (Fig. 3). C. MQ7: MQ7 is a carbon monoxide sensor which is suitable for sensing CO concentrations in air. MQ7 can detect CO concentration ranging from 20 ppm to 2000 ppm. This sensor detects the CO concentrations in the air and gives output of its reading in terms of an analog voltage. The operating temperatures for this sensor range from −10 to 50 °C and its consumption is less than 150 mA at 5 V. The connection of 5 V across the heating (H) pins keeps the sensor warm enough to work properly. Connecting 5 V supply at either of the pins prompts the sensor to radiate a voltage (analog) on the rest pins. The sensitivity of the detector is dependent on the resistive
Development of a Real-Time Pollution Monitoring System … Fig. 1 DS18B20
Fig. 2 ESP-01 setup
Fig. 3 NODE MCU
469
470
A. Garg et al.
Fig. 4 MQ7
load placed between the ground and output pins. The calibration to be done is based on the resistance value and equation given in the datasheet (Fig. 4). D. MICS 5524: It is a general-purpose metal oxide sensor. The basic principle behind it is the change of resistance on change of oxygen amount associated with the gas. It is a MEMS-based sensor generally used to detect interior leakage of CO and natural gas; it is appropriate for indoor inspection of air quality; breath checker and early fire detection as well. This sensor is sensitive to carbon monoxide (up to 1000 ppm), ammonia (up to 500 ppm), ethanol (10–500 ppm), H2 (up to 1000 ppm), and methane/propane/iso-butane (over 1,000 ppm). However, it is incapable of telling which gas it has detected. To use it is one must power it with 5 VDC and read the analog voltage off the output pin. When gasses are detected, there is an increase in the analog voltage in proportion of detected gas (Fig. 5). E. Sound Sensor: Sound sensor can determine the sound severity in the environment. The components of the module are a microphone based on the LM386 amplifier and
Fig. 5 MICS 5524
Development of a Real-Time Pollution Monitoring System …
471
Fig. 6 Sound sensor
a microphone. The output from the module is analog in nature and can be sampled and tested by an Arduino or any other compatible board (Fig. 6). F. Google’s Firebase: Firebase currently under Google is an application that involves real-time database, hosting of different apps and authentication services. In 2016, firebase also launched its cloud messaging services and a year later launched Cloud FireStore. Therefore, firebase is a whole package of analytics, development and growth services (Fig. 7). G. Web App: A Web-based application or commonly known as Web app is a machine(computer) program that has a client–server architecture that the client (including UI and client-side logic) runs in a Web browser. Progressive Web apps deliver good user experiences that cover range of the Web (using modern technologies), and are: • Reliable—due to their near instantaneous loading speed and ability to work even under uncertain network conditions • Fast—a fluid response with smooth animations and no hassle of scrolling • Engaging—does not feel unnatural on any device due to an immersive user experience (Fig. 8).
Fig. 7 Google’s firebase new logo
472
A. Garg et al.
Fig. 8 Progressive Web app
4 System Architecture and Working As visible in Fig. 9, the ESP12E takes input from the various sensors and updates the firebase real-time database over Wi-Fi using the firebase library available for Arduino
Fig. 9 System architecture overview
Development of a Real-Time Pollution Monitoring System …
473
compatible boards, written in embedded C language. The Web app refreshes itself after a set time period and fetches data from the firebase in every cycle, and in turn, it updates the dashboard. This causes the app to monitor pre-set values and generate an alert in case value exceeds it. The alert can be sent to multiple recipients in form of notification. The air module includes MQ7, MICS5524 and IR module. The ESP12E accesses the values from MQ7 and MICS5524 one at a time via multiplexer. These include carbon monoxide, nitrogen oxides, sulfur oxides, VOCs and ammonia. IR sensor is used to detect presence of smoke. The water module includes pH meter to check discharge’s pH and temperature serves as basis to check BOD, COD and conductivity. Sound module checks whether the noise exceeded the set parameter or not and sends a Boolean corresponding to that. All this data is sent every 5 s to the database, as per directions from the pollution control board while the dashboard refreshes every 2 s (Figs. 10 and 11). Fig. 10 Sensor reading from pH sensor
Fig. 11 Sensor readings from MICS5524
474
A. Garg et al.
5 Conclusion and Future Work A cost-effective smart system that monitors the effluents generated from one of the most hazardous industries on actual-time baseline and helps to manage the data by easy visualization techniques. This system also keeps complete transparency between the government and pollution checking bodies of it and the industries regardless of the location of it. Also, the system provides a good hack of marketing to the textile managers as it categorizes them in different zones, namely white, green, orange and red. The sequence of white > green > orange > red follows most eco-friendly to most environmental danger prone industry. This simple color tag would make them market themselves in the broader area.
References 1. https://cpcb.nic.in/openpdffile.php?id=TGF0ZXN0RmlsZS9MYXRlc3RfMTIwX0RpcmVjd GlvbnNfb25fUmV2aXNlZF9DRVBJLnBkZg== 2. http://www.indiaenvironmentportal.org.in/files/file/CEPI_COMMENTS.pdf 3. S.R. Enigella, H. Shahnasser, Real Time Air Quality Monitoring. In: 2018 10th International Conference on Knowledge and Smart Technology(KST), in IEEE Conference, Thailand, August 2018 4. P. Movva Pavani, T. Rao, Real time pollution monitoring using wireless sensors. In: 2016 IEEE 17th annual information technology, electronics and mobile communication conference (IEMCON), University of British Columbia, Vancouver, Canada, 13th–15th, October 2016 5. https://firebase.google.com/docs/database/web/start 6. https://sensorex.com/ph-sensors-3 7. http://www.indiaenvironmentportal.org.in/files/file/Guidelines_textile_industry_draft.pdf 8. https://data.gov.in/keywords/air-pollution 9. https://data.gov.in/dataset-group-name/water-quality 10. R. Rajamanickam, S. Nagan, Assessment of CEPI of Kuruchi Industrial Sectora Case Study. 2018 Thaigarajar College of Engineering, Madurai, Tamil Nadu, India 11. L. Mohan Joshi, IOT based air and sound pollution monitoring system. Int. J. Comput. Appl. (0975–8887) 178(7) (2017) 12. A. Singh, D. Pathak, P. Pandit, S. Patil, P.C. Golar, IOT based air and sound pollution monitoring system. Int. J. Adv. Res. Electr. Electr. Instrum. Eng. 6(3) (2018) 13. T. Anuradha, C.R. Bhakti, D. Pooja, IoT based low cost system for monitoring of water quality in real time. Int. Res. J. Eng. Technol. (IRJET) 5(5) (2018) 14. V. Dhoble, N. Mankar, S. Raut, M. Sharma, IOT based air pollution monitoring and forecasting system using ESP8266. JSRSET 4(7) (2018) 15. K. Chaitanya, K. Shruti, B. Siddhi, M.M. Raste, Sound and air pollution monitoring system. Int. J. Sci. Eng. Res. 8(2) (2017) 16. S. Karamchandani, A. Gonsalves, D. Gupta, Pervasive monitoring of carbon monoxide and methane using air quality prediction. In: 3rd international conference on computing for sustainable global development, 16th–18th March, 2016 17. V. Vasantha Pradeep, V. Ilaiyaraja, Analysis and control the air quality using NodeMCU. Int. J. Adv. Res. Ideas Innov. Technol 4(2) (2018) 18. K. Okokpujie, A smart air pollution monitoring system. Int. J. Civil Eng. Technol. 9(9) (2018) 19. C. Nagaraja, Calibration of MQ-7 and detection of hazardous carbon mono-oxide concentration in test canister. Int. J. Adv. Res. 4(1) (2018)
Development of a Real-Time Pollution Monitoring System …
475
20. G. Han, H. Li, H. Chen, Y. Sun, J. Zhang, S. Wang, Z. Liu, Design and implementation of the noise sensor signal conditioning (2015) 21. W.S Bhaya, Review of data preprocessing techniques in data mining. J. Eng. Appl. Sci. (2017) 22. S. Peng, Q. Cheng, Research on data preprocessing process in the web log mining. In: 2009 First International Conference on Information Science and Engineering 23. T Anuradha, C.R. Bhakti, D. Pooja, IoT based low cost system for monitoring of water quality in real time. Int. Res. J. Eng. Technol. (IRJET) 5(5), 2018 24. V. Dhoble, N. Mankar, S. Raut, M. Sharma, IOT based air pollution monitoring and forecasting system using ESP8266. 2018 JSRSET 4(7) (2018) 25. C. Khawas, P. Shah, Application of firebase in android app development-a study. Int. J. Comput. Appl. 179 (2018) 26. N. Chatterjee, S. Chakraborty, A. Decosta, A. Nath, Real-time communication application based on android using google firebase. Int. J. Adv. Res. Comput. Sci. Manag. Stud. 6(4) (2018)
Understanding and Implementing Machine Learning Models with Dummy Variables with Low Variance Sakshi Jolly and Neha Gupta
Abstract Machine learning is creating some importance in daily life and predicting something to be done with the data. We need to handle the data in an adequate format, and the information we gather from the data and the insights of data will be identified based on the implementation of the rules we generate and the rules must be semantic with time to time and requirement to requirement. Dummy variables are used for implementing and handling the categorical variables which are by default object category in modeling. These cannot be directly used in the prediction model and for that we need to use and understand the purpose of collecting the type of data we have the information we gathered will be further used for identifying the objects of the model and the features we gather will impact the accuracy of model. In machine learning we compute the categorical variables based on back propagation and the requirement of feature selection plays a vital role in understanding the accuracy management. Regression analysis and classification analysis differ the usage of dummy variables. In this chapter, we are not replacing the variables with dummy values and instead we are adding a new feature with dummy variables. There will be a major difference in implementing the classification model and regression model with the same features. We achieved highest accuracy of 91% with DBScan with clustering mechanism. Keywords Machine learning · Deep learning · Prediction · Feature selection · Modeling
S. Jolly (B) · N. Gupta Faculty of Computer Applications Department, MRIIRS, Faridabad, India e-mail: [email protected] N. Gupta e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_37
477
478
S. Jolly and N. Gupta
1 Introduction Machine learning is the concept of understanding the past experiences and understanding the present scenario of the problem and estimating the future of certain problem. Whether it may be a continuous data or desecrate data. There is no difference in understanding implementing the type of data but there is a difference in understanding the type of data we acquire from repository is the most important thing. The scenario we gather from each and every problem in the real world is which is related to our own life. There are different problems in the real-time and we need to identify the solutions of the problem based on our own perspective in the same way we need to identify the things which are useful to solve the problem with the real-time scenarios and the dummy variables are the instances we are using in the modeling instead of removing the categorical variable we need to manipulate and manage the categorical variables with the dummy values [1–3]. In regression analysis these kind dummy variables place a crucial part and they are indicated as Boolean variables for the better understanding and implementing the variables for the better prediction model. These kinds of variables take values like 0 or 1. For an instance if the gender is the feature need to be used then using one-hot encoder we convert all male in 1 and female into 0. Our project we use clustering technique get the result group 1 and group 2 dummy value will be added as the group 1 for 0 and group 2 for 1. We add the dummy value not replace. The main question here is why we need to use the dummy variables in out modeling. The reason is Dummy values do not affect the data quality. We are adding the dummy values to the already existing values and not replacing them. There are different methodologies which can help to identify the importance of dummy values, and here in this chapter, we are designing the models with two scenarios. In the first scenario the case is creating model without dummy variables and this is the existing system. Here we achieved some amount of accuracy and this accuracy will affect other part of the model which is a proposed model [4]. In this proposed architecture we are considering few of the dummy variable as mentioned before to identify the accuracy with different distance metrics. For an instance consider the following Fig. 1 which indicates the DBScan without dummy variables and here if we observe we have to noisy information in the plot which is failed to carry all the information we provided for the modeling. In modeling, we have to consider features which are required for the modeling with dummy variables and one-hot encoding provides the scenario of converting the categorical variables in the form of numerical variables. In this method we achieved different accuracy levels based on the algorithm we impose on the data. Let’s discuss those algorithms in the further sections [5]. In the next section, we discuss the total algorithms we used in this problem; next, we explain about the accuracy we achieved with the data we have without dummy variables [6]. The next section deals with the algorithm of the proposed system with dummy variables and utilizing the categorical variables in an effective manner. Next explains the sample output of the proposed system and concludes the concept.
Understanding and Implementing Machine Learning Models …
479
Fig. 1 EM modeling with some noise information
2 Models Used The modeling is an art of achieving a novel concept in our model, and there is a list of algorithms we used in this architecture. Because of utilizing different distance metrics we got the accurate result with one among the algorithms and that will be explained later in the same section. a. EM Model EM model is the expectation maximization algorithm which deals with the highest likelihood of the occurrence of an event. Here, we can use the iterative method to capture and maximize the data likelihood. Figure 1 explains the EM clustering model in AI. If we observe the image, we can get the cluster with not having the points connected in all the formats. In this scenario, we have the problem of capturing the accuracy [7]. b. Hierarchical Clustering We cluster the data points in the form dendrograms, and these kinds of dendrograms will plot the variables in the various positions in the work space with respective of their impact on the data and the accuracy will differ time to time. Figure 2 explains the sample dendrogram method for the hierarchical clustering. c. K-Means K-means is another clustering methodology based on the nearest distance of the data points; it will capture the category of the information we are plotted in the graph or any kind of visualization representation of the data points. The following image represents the K-means algorithm (Fig. 3) [8–11].
480
S. Jolly and N. Gupta
Fig. 2 Dendrogram for hierarchical clustering
Fig. 3 K-means clustering mechanism based on distance matrices
d. One-hot encoding This is the common operation we perform to maintain the categorical variables. These are the variables which are in the form of objects and need to be converted when the data is to be plotted. The features are needed to be plotted; then, we need to convert that feature from object type to the numerical type using one-hot encoding. Figure 4 is the sample of one-hot encoding [12].
Understanding and Implementing Machine Learning Models …
481
Fig. 4 One-hot encoding
The next section will be explaining the implementation of the existing system. This consists of implementing this DBScan algorithm without dummy variables.
3 Existing System Existing system consists of implementing the DBScan algorithm without dummy variables. For an instance, consider the algorithm with and without dummy variables. We have to check with the without conversion of dummy variables. In some cases, we need to consider some important things like gender. For an instance, consider an example of designing a regression model which can take the values of person gender, age, and weight to predict the height. If in the case OS is not handling the categorical variables like converting all M into 1 and F into 0, we cannot handle the prediction model. If we skip the case without these dummy variables, then we may lose the accuracy of the model. In the same existing system, we have another kind of implementation. If there is a chance of implementing the dummy variables what are the causes of error in the model and how we can handle that model [13, 14]. Dummy variable trap is the concept we need to focus while doing the modeling. More of the dummy variables, the more we use the variables, the less we get the accuracy. In this case, we use the most of the dummy variables when we have the k number of variables which are categorical. If we have the k number of dummy variables, we have to mention only k − 1 variables as it is considered as 0 to n − 1 category [15, 16]. The main problem of handling without the dummy variables is causing the high variance and low similarity; this causes highest error rate in the model. For example, consider Fig. 5 which shows the implementation of the model without dummy variables with the highest variance. This highest variance will lead to the highest failure rate in the model. For that purpose, we need to use DBScan with optimal value for epsilon. This concept was explained in the next section as the proposed system with the sample output [17].
482
S. Jolly and N. Gupta
Fig. 5 High variance with low similarity
The causes of high variance are as follows. i. ii. iii. iv. v.
Model failure Repeated model generation Including mode number of unwanted dummy variables Excluding the required features High cost function
These kinds of things lead to the failure of the model. This causes the model failure, and repeated generation of the same thing can lead to the biggest mistake of the regression analysis.
4 Proposed System The proposed system is with implementing the model with highest accuracy with different algorithms and checking the model with required amount of dummy variables. In this scenario, we have the chance of implementing the dummy variables and assigning them a specific notation to understand where to use and what to use [18]. This kind of implementation was having an algorithm which was mentioned as below [19]. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9 Step 10 Step 11
Start the program DBSCAN(D, epos, MinPots) C=0 for each unvisited point P1 in dataset D mark P1 as visited NeighborPots = regionQuery(P, epos) if sizeof(NeighborPots) < MinPots mark P1 as NOISE else dum = next cluster expandCluster(P1, NeighborPots, dum, epos, MinPots)//Function Call
Understanding and Implementing Machine Learning Models …
483
Step 12 expandCluster(P1, NeighborPots, dum, epos, MinPots) //Function Definition Step 13 add P1 to cluster dum Step 14 for each point P in NeighborPots Step 15 if P is not visited Step 16 mark P as visited Step 17 NeighborPots = regionQuery(P , epos) Step 18 if sizeof(NeighborPots ) > = MinPots [20] Step 19 NeighborPots = NeighborPots joined with NeighborPots Step 20 if P is not yet member of any cluster Step 21 add P to cluster dum regionQuery(P, epos) Step 22 return all points within P s Step 23 epos-neighborhood (including m P) ( f ) di j (adding dummy value and apply the Distance d(i, j) = m1 i=1 distance function to the optimal value) Step 24 for each point pp in NP Step 25 NPP = regionQuery (pp, epos) Step 26 NP = NP joined with NPP Step 27 end while Step 28 if pp is not yet member Step 29 add pp Cluster dum Step 30 end class Step 31 } Step 32 regionQuery(p,epos) Step 33 return all points within pp epos-neighborhood (including p) 2 = dx,1 − d y,1 + . . . + dx,1 − d y,1 dist dx − d y Condition base clustering (dist dx − d y < E P S) Step 34 Stop the Program There is a static pseudocode for the sample implementation as follows:
484
S. Jolly and N. Gupta MinMaxClustering(dataset) Find min and max points of the dataset. Step 3: if d(min,max) =MinPots form cluster mark as visited else mark all the points as NOISE else if d(min,max)>=2(epos)+1 MinMaxBisection(min,max,dataset) MinMaxClustering(dataset1) MinMaxClustering(dataset2) else
{ for each point P in dataset if P is visited continue with the next point NeighborPots =getNeighborPots (P,epos) if size-of(NeighborPots ) mark P as NOISE }
The above mentioned pseudo-code implements the DB Scan clustering by min and max. Also comparison is implemented to epos value and the dataset compares min pts that condition true noise value get else part run will be run and find NeighborPots after following result which was mentioned in the next section in the results. The above image (Fig. 6) indicates the low variance of the DBScan implementation which was having the highest accuracy with the dummy variables and implementing the optimal value for epsilon.
5 Results The results speak the different approaches with different inputs. The inputs we consider will be helpful to understand each and every method of implementing different dummy variables and different algorithms with the different features. The result we got is DBScan with the implementation of optimal value for epsilon created huge impact on understanding the requirement of handling the dummy variables as mentioned below (Table 1). The above image (Fig. 6) indicates the usage of the optimal value for the epsilon creates a much impact on understanding the implementation of machine learning in different prediction models. Table 2 gives the overview of accuracy we achieved in implementation of the proposed architecture [21].
Understanding and Implementing Machine Learning Models …
485
Fig. 6 DBScan with low variance Table 1 Comparison chart Dataset
WOCIL
OCIL
WKM
EWKM
Proposed method
Difference
Credit
74
69
75
77
86.4
12 ± 14%
Heart
81
76
73
75
87.2
6 ± 7%
Table 2 Accuracy measure table
S. No
Clustering algorithm
Accuracy, %
Computing time, s
1
EM clustering
69.5
3.2
2
Hierarchical clustering
56.1
0.06
3
K-means
60.82
0.02
4
Cob web
42.68
0.02
5
DBScan
91
0.01
6
Farthest first
57.44
0.01
7
Filtered first
60.82
0.02
8
Make density-based
61.8
0.03
486 Fig. 7 Result of different comparisons in the form of dummy variables
S. Jolly and N. Gupta 100 80 60 Credit
40
Heart
20 0 WOCIL
OCIL
WKM
EWKM
Proposed method
6 Conclusion Machine learning provides many features which can be used for implementing different scenarios in the real-time applications. Here, in this scenario irrespective of the concept, we tried to implement the scenario of implementing dummy variables in the successful manner and succeeded in implementing the dummy variables with the DBScan with increasing the probability of identifying the related thing in the model which will be used for understanding the real importance of the feature selection and the extraction. In the feature selection, we consider the variables which can give the highest similarity and lowest variance and we achieved 91% accuracy with DBScan with implementing and handling the dummy variables.
References 1. K.B. To, L.M. Napolitano, Common complications in the critically ill patient. Surg. Clinics North Amer. 92(6), 1519–1557 (2012) 2. C.M. Wollschlager, A.R. Conrad, Common complications in critically ill patients. Disease-aMonth 34(5), 225–293 (1988) 3. S.V. Desai, T.J. Law, D.M. Needham, Long-term complications of critical care. Critical Care Med. 39(2), 371–379 (2011) 4. N.A. Halpern, S.M. Pastores, J.M. Oropello, V. Kvetan, Critical care medicine in the United States: addressing the intensivist shortage and image of the specialty. Critical Care Med. 41(12), 2754–2761 (2013) 5. A.E.W. Johnson, M.M. Ghassemi, S. Nemati, K.E. Niehaus, D.A. Clifton, G.D. Clifford, Machine learning and decision support in critical care. Proc. IEEE 104(2), 444–466 (2016) 6. S. Saria, D. Koller, A. Penn, Learning individual and population level traits from clinical temporal data, in Neural Information Processing Systems (NIPS), Predictive Models Personalized Medicine Workshop, 2010 7. O. Badawi et al., Making big data useful for health care: a summary of the inaugural MIT critical data conference. JMIR Med. Informat. 2(2), e22 (2014) 8. C.K. Reddy, C.C. Aggarwal, Healthcare data analytics, vol. 36 (CRC Press, Boca Raton, FL, USA, 2015) 9. D. Gotz, H. Stavropoulos, J. Sun, F. Wang, ICDA: a platform for intelligent care delivery analytics, in Proceedings of AMIA Annual Symposium, 2012, pp. 264–273
Understanding and Implementing Machine Learning Models …
487
10. A. Perer, J. Sun, Matrix_ow: temporal network visual analytics to track symptom evolution during disease progression, in Proceedings of AMIA Annual Symposium, 2012, pp. 716–725 11. S. Jolly, N. Gupta, Handling mislaid/missing data to attain data trait, published in IJITEE. ISSN: 2278–3075, 8(12), 4308–4311 (2019) 12. S. Jolly, N. Gupta, Higher dimensional data access and management with improved distance metric access for higher dimensional non-linear data, published in IJRTE 8(4) (2019) 13. Y. Mao, W. Chen, Y. Chen, C. Lu, M. Kollef, T. Bailey, An integrated data mining approach to real-time clinical monitoring and deterioration warning, in Proceedings of 18th ACM SIGKDD International Conference on Knowledge Discovery Data Mining, 2012, pp. 1140–1148 14. J. Wiens, E. Horvitz, J.V. Guttag, Patient risk stratification for hospital-associated C. Diff as a time-series classification task, in Proceedings of Advances in Neural Information Processing Systems, 2012, pp. 467–475 15. S. Jolly, N. Gupta, Extemporizing the data trait, published in IJETT, published in 58(2) April issue 16. S. Jolly, N. Gupta, Data quality outflow in cloud computing, published in INDIACom-2016 at IEEE Xplore 17. R. Dürichen, M.A.F. Pimentel, L. Clifton, A. Schweikard, D.A. Clifton, Multitask Gaussian processes for multivariate physiological time-series analysis. IEEE Trans. Biomed. Eng. 62(1), 314–322 (2015) 18. S. Jolly, N. Gupta, AI proposition for crypt information management with maximized em modelling, published in IJEAT, published in November Issue 19. M. Ghassemi et al., A multivariate timeseries modeling approach to severity of illness assessment and forecasting in ICU with sparse, heterogeneous clinical data, in Proceedings of AAAI Conference on Artificial Intelligence, 2015, pp. 446–453 20. S. Jolly, N. Gupta, An overview on evocations of data quality at ETL stage, March 15. https:// www.researchgate.net/publication/276922204, An overview on evocations of data quality at ETL stage 21. I. Batal, H. Valizadegan, G.F. Cooper, M. Hauskrecht, A pattern mining approach for classifying multivariate temporal data, in Proceedings of IEEE International Conference on Bioinformatics Biomedicine (BIBM), 2011, pp. 358–365
Assessment of Latent Fingerprint Image Quality Based on Level 1, Level 2, and Texture Features Diwakar Agarwal and Atul Bansal
Abstract The matching of the latent print obtained at crime scene with the stored database at law enforcement agencies is the most important forensic application. The performance of an automated latent fingerprint matcher is limited by the unwanted appearance or poor quality of the latent prints. This reason necessitates latent fingerprint investigators for feature markups and quality value determination. However, the reliability and consistency of the manual assessment are significantly affected by various factors involved in the forensic examination. This paper proposed an algorithm to determine latent fingerprint image quality through feature extraction followed by k-means classifier. The feature vector consists of the ridge clarity values, number of extracted minutiae, average quality of the minutiae, area of the convex hull including all minutiae and textural feature values. Experimental results show that the identification rate of the minutiae-based latent fingerprint matcher is improved after rejecting unacceptable quality of query latent prints. Keywords Fingerprint quality · K-means · Latent fingerprints · Ridge clarity · Texture
1 Introduction Latent fingerprint is one of the prominent evidences used in the forensic investigation more than a century ago. Law enforcement agencies often collect and record the tenprints of the detained criminals in two forms: rolled and plain (Fig. 1 shows an example) [1]. These fingerprints are either inked or scanned and used as reference (exemplar) fingerprints for latent fingerprint identification. The acquisition of the D. Agarwal (B) · A. Bansal (B) Electronics & Communication Engineering, GLA University, Mathura 17 Km Stone, NH-2, Mathura-Delhi Road,, Uttar Pradesh, India e-mail: [email protected] A. Bansal e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_38
489
490
D. Agarwal and A. Bansal
Fig. 1 Three types of fingerprint images: a rolled, b plain, and c latent [5]
reference fingerprint is supervised by the operator, so that the poor quality impression is recaptured in order to ensure the quality of the fingerprint. Unlike rolled and plain fingerprints, latent fingerprints are smeared, blurred, have small area, and inherently corrupted by nonlinear distortion. The bad and poor quality latent prints lead to two significant drawbacks: (i) The performance of an automated latent fingerprint identification is greatly affected [2], and (ii) in ACE-V method [3], an erroneous decision about the amount of information present in the fingerprint may lead to miss the opportunity to identify the actual suspect and may waste the examiner’s time for fruitless comparison [4]. Determination of the latent quality may improve the identification performance by rejecting unacceptable form of the query latent prints collected from the crime scene. Also, the latent quality aids the forensic experts in knowing whether the sufficient amount of information such as minutiae and pores is present in the query latent or not. This paper presents a method which is based on minutiae and texture features to quantify the latent fingerprint quality. The three main contributions of this paper are as follows: (i) An unsupervised classification of the latent prints in the unlabeled dataset is performed with the help of k-means classifier; (ii) an automated latent quality assessment is proposed which avoids manual feature markup by the forensic experts; and (iii) rejection of poor quality latents considerably improves the identification rate of the Level 2 (minutiae) feature-based automated latent fingerprint matcher. In this work, fingerprint ridge clarity, number of extracted minutiae, average quality of the minutiae, area of the convex hull including all minutiae and texture feature values of each latent are determined and concatenated to form the feature vector. The experimental database so used is IIITD Latent Database [6].
Assessment of Latent Fingerprint Image Quality Based …
491
2 Related Work Number of algorithms for estimating the latent fingerprint image quality is present in the literature. However, the criterion used to distinguish existing algorithms is based on the type of the features extracted and annotated by the latent examiner. Fronthaler et al. [7] quantified several fingerprint impairments such as noise, blur, and lack of structure by studying fingerprint’s oriented tensor features. Their algorithm was designed for high-quality fingerprints but not for latent fingerprints. Hicklin et al. [8] for the first time developed Latent Quality Assessment Software (LQAS) which was used by forensic community in order to annotate the ridge clarity and quality map. Ulery et al. [4] developed the model to find out the relationship between latent value determination and features marked by certified latent examiners. Several features which are mentioned in the guidelines of extended feature set [9] were manually marked. Experimental results showed that both minutiae count and value determination are strongly associated with each other and suffer by the lack of reproducibility. Yoon et al. [10] have proposed the latent fingerprint image quality (LFIQ) algorithm used for the identification of latent in “lights-out” mode. The algorithm was primarily based on fingerprint ridge clarity map and the total number of detected minutiae. The rejection of 50% of poor quality latents has significantly increased the rank-100 latent identification rate from 69 to 86%. The LFIQ was further enhanced in [2] by including other features such as ridge structures connectivity, minutiae reliability, and position of the finger. Although the LFIQ so modified has resulted in good identification performance, still it was coped with manually marking of the minutiae which is both laborious and subjective. Sankaran et al. [11] have developed an automated ridge local clarity and local quality estimation algorithm. Emphasis has been put on using Level 1 features which were incorporated by 2-D linear symmetric structure tensor and orientation field. Sankaran et al. [12] investigated the quality of latent which was lifted from eight different surfaces. Cao et al. [13] have proposed a fully automatic method based on minutiae, ridge quality, ridge flow, core and delta points for latent value determination. The classification accuracy reached 79.5% better than 76.9% of Yoon et al. [10].
3 Feature Extraction for Latent Fingerprint Quality Assessment In general, the features that can be used in determining latent fingerprint quality are categorized into qualitative and quantitative features. Qualitative features measure the sufficiency of ridge structure in the given query latent for further examination, whereas quantitative features measure the sufficiency of Level 2 and Level 3 features for the purpose of identification. In this paper, qualitative features are represented by the ridge clarity map and texture. The quantitative features are based on Level 2 features and represented by the number of extracted minutiae, average quality of the
492
D. Agarwal and A. Bansal
Table 1 Feature vector used for quality assessment S. No
Type of features
1
Fingerprint ridge clarity
Number of feature values
Description
2
Texture
3
Number of extracted minutiae
1
Total number of minutiae extracted in the query latent
4
Average quality of the minutiae
1
Average minutiae quality computed from ridge clarity map
5
Size of the convex hull including all minutiae
1
Area covered by all minutiae
64
Ridge clarity values of 8 × 8 local ridge clarity map which is computed after 32 × 32 block-wise partitioning of 256 × 256 input latent fingerprint image
512
Each 32 × 32 block of 256 × 256 input latent fingerprint image is filtered by 8 Gabor filters and generates 8 texture values per block
minutiae, and the size of convex hull including all minutiae. Table 1 summarizes the type of features which later concatenated to form the feature vector.
3.1 Ridge Clarity Feature The ridge clarity of the given latent fingerprint image I is defined as the quality of the local ridge pattern which is captured by 2-D structure tensor [14]. Some of the regions of the latent have good interleaving pattern of ridge and valleys, whereas some regions are smudgy and blurred. So, the second-order structure tensor is capable to seize uniform geometric pattern [15] if present on the fingerprint. Thus, the fingerprint ridge clarity is one of the discriminative qualitative features, represented by the ridge clarity map RC. The determination of RC involves the following steps. 1. Perform the smoothening operation on I (256 × 256) to obtain I * by applying the Gaussian filter of size 3 × 3 with standard deviation 0.5. 2. Compute the first-order derivatives, i.e., gradients ∂x and ∂ y by using Sobel operator at each pixel of I * . 3. Compute the 2-D structure tensor J as given by (1) at each pixel of I *
Assessment of Latent Fingerprint Image Quality Based …
∂x2 ∂x ∂ y J= ∂x ∂ y ∂ y2
493
(1)
4. Since the structure tensor contains ridge orientation information, its Eigen values can be used in analyzing the existence of local linear symmetries in an image. Two Eigen values μ1 and μ2 with μ1 > μ2 are then computed from the structure tensor J. It was reported in [16] that the larger Eigen value shows the local edge strength. Figure 2 shows the maximum Eigen value response at each pixel. 5. Divide maximum eigen value response into 64 blocks of size 32 × 32 arranged in 8 × 8 grid pattern. Each block EV b is averaged with total N b pixels in order to get 8 × 8 ridge clarity map as given in (2). The average Eigen value of the block EV b is the ridge clarity feature value of that block.
RC =
1 EVb (i,j) Nb i,j
(2)
Fig. 2 Maximum Eigen value response of two different latents a and c, b and d maximum Eigen value responses, brighter region shows good ridge clarity
494
D. Agarwal and A. Bansal
3.2 Texture Feature Rao and Jain [17] have observed that the ridge and valley pattern of the fingerprint was arranged as an oriented texture field. Most of the textured images are differed significantly by their limited number of dominant spatial frequencies [18–20]. Jain et al. [21] have stated that the different texture regions can be distinguished by breaking down the texture into several spatial frequencies and orientations. Texture inherently possesses both local and global information present in the fingerprint, and this may become the desirable qualitative characteristics for measuring latent fingerprint quality. In this paper, an algorithm implemented by Jain et al. [21] is utilized with slight modifications in order to extract texture features. Jain et al. [21] have extracted the core point as the reference point in the plain scanned fingerprint for neighborhood tessellation in order to represent the texture. Unlike fingerprints used by Jain et al. [21], latent fingerprints are of poor quality and small in size; therefore, it is difficult to extract core point in latent fingerprints. In this work, the input image is partitioned into equal size square blocks rather circular tessellation into sectors around core point. The steps required for the computation of texture features are given below. 1. Generate the bank of eight Gabor filters of the same frequency with different ◦ ◦ ◦ orientations, i.e., from 0 to 157.5 in steps of 22.5 . The frequency is computed −1 by applying Hong et al. [22] method, i.e., 0.14 pix . Figure 3 shows the spatial convolution kernel of the first four Gabor filters. 2. An input latent fingerprint image of size 256 × 256 is filtered by eight Gabor filters to form a set of eight filtered images. Figure 4 shows the Gabor-filtered images at four different orientations. 3. The set is partitioned into non-overlapping equal size blocks of 32 × 32, resulting in 32 cells per filtered image. 4. The absolute average intensity of each cell along eight filtered images in the set is considered as the feature values of that cell. Thus, each cell is represented by 8 values, which later concatenated column-wise to form the texture features TF of 512 values (64 × 8).
◦
Fig. 3 Real part of the spatial convolution kernel at frequency 0.14 pix−1 with orientations a θ = 0 , ◦ ◦ ◦ b θ = 22.5 , c θ = 45 , and d θ = 67.5
Assessment of Latent Fingerprint Image Quality Based …
495
◦
◦
◦
◦
Fig. 4 Gabor-filtered images at frequency 0.14 pix−1 with orientations 0 , 22.5 , 45 , 67.5 of two latents shown in Fig. 2a–d Gabor-filtered images of the latent shown in Fig. 2a, e–h Gabor-filtered images of the latent shown in Fig. 2c
3.3 Minutiae Feature Minutiae properties are the most significant features in the latent fingerprint identification. However, the average number of minutiae for the latent prints in NIST SD27 [23] is 27 which is less in comparison with 106 for their mated rolled prints [24]; still one cannot completely disregard the minutiae information. In this paper, minutiaebased properties such as total number of extracted minutiae N m , average quality of the minutiae Qm , and the size of convex hull including all minutiae C m are considered as quantitative feature values. Minutiae in the latent fingerprint are detected by applying an algorithm proposed by Abraham et al. [25]. The main purpose of using this algorithm is to remove spurious minutiae by using contextual and orientation information around detected minutiae. Figure 5 shows the extracted minutiae and the convex hull enclosing those minutiae. The average quality of the minutiae obtaining from the ridge clarity map is given by (3). Qm =
Nm 1 RC bxi , byi Nm i=1
(3)
where bxi , b yi is the block-wise position of the ith minutia and Nm is the number of detected minutiae.
496
D. Agarwal and A. Bansal
Fig. 5 Minutiae extraction and convex hull of two latents shown in Fig. 2a and c extracted minutiae of the latent in Fig. 2a, b and d extracted minutiae of the latent in Fig. 2c
4 K-Means Clustering-Based Classification The clustering refers to classifying the group of objects into subgroups according to the properties of each object. The k-means clustering algorithm [26] assigns every data point to the nearest cluster by computing the distance between each data point and “k” given clusters. Latent fingerprint quality assessment is simply a two-class classification problem of assigning the latent fingerprint in the given dataset to one of the two clusters, namely latent with quality and latent without quality represented by LWQ and LWQ, respectively. Each latent fingerprint is represented by the feature vector x as given in (4) which is the concatenation of the estimated qualitative and quantitative features. x = (RC, TF, Nm , Qm , Cm )
(4)
Assessment of Latent Fingerprint Image Quality Based … Table 2 Within-cluster sum of distance of two clusters
497
No. of clusters
Within-cluster sum of distance
1
30.74
2
74.56
5 Experimental Results 5.1 Database Latent fingerprint quality assessment is evaluated on IIITD Latent Database [6]. This database consists of the multiple instances of the latent prints of the 10 fingers of each 15 subjects, lifted by using brush and black powder dusting process. It contains 1046 latent prints on two different backgrounds: card and tile along with 150 mated plain scanned fingerprints. Since the database is unlabeled, experimental database of 150 latent fingerprints has been formed by selecting 75 fingerprints of agreed quality (not too blurred) and 75 fingerprints of disagreed quality (not too fine) on card background. Thus, the experimental database consists of the latent dataset of 150 fingerprints and the reference dataset of 150 mated fingerprints.
5.2 Clustering Performance The classification of 150 latent fingerprints in either of two classes LWQ and LWQ is performed by using k-means classifier with two clusters. Due to the absence of external class labels, the clustering performance is evaluated with the help of internal validation methods, i.e., within-cluster sum of distances and silhouette plot. The within-cluster sum of distance of two clusters is reported in Table 2. As shown by the silhouette plot in Fig. 6, the silhouette value of the first cluster is greater than 0.8, which shows that the points are well matched to its own cluster. However, some points in the second cluster are miss-clustered as indicated by the negative silhouette value.
5.3 Classification Accuracy The classification accuracy of the classifier is determined by the confusion matrix
which requires both predicted (LWQ and LWQ) and true (LWQ and LWQ) quality class labels. The predicted class label of each latent is obtained by its cluster index (1 for LWQ and 2 for LWQ) which is returned by k-means clustering algorithm. In this paper, the true class label of each latent is obtained by the matching performance of the minutiae-based latent matching system. The latent fingerprint in the database
498
D. Agarwal and A. Bansal
Fig. 6 Silhouette plot of two clusters between cluster indices and silhouette value
Table 3 Confusion matrix of LWQ and LWQ classification from k-means
LWQ
LWQ
LWQ
42
29
LWQ
33
46
classifier. LWQ and LWQ belong to true class labels
Table 4 Classification accuracy of LWQ and LWQ
LWQ
59.1%
LWQ
58.2%
Total classification accuracy
58.6%
belongs to the class LWQ if the matching score with its mated plain fingerprint results
in rank-100; otherwise, it belongs to the class LWQ). The confusion matrix and the classification accuracy are reported in Tables 3 and 4, respectively.
5.4 Latent Identification Performance The proposed algorithm for latent quality measure is evaluated by determining the latent identification performance. Since the poor quality latent has less quantitative and qualitative information, it is accompanied with low score with its mated fingerprint in the reference database. This results into high rank and eventually decreasing the identification rate. Rejecting the latent prints which are classified as LWQ by the classifier increases the identification rate. The rejection rate of 52% is achieved. The
Assessment of Latent Fingerprint Image Quality Based …
499
Fig. 7 CMC curve of the 150 latent fingerprints shows better identification rate after rejecting 52% latents of unacceptable quality
latent identification rate before and after rejection is shown by the Cumulative Match Characteristic (CMC) curve in Fig. 7. The rejection of 52% latent fingerprints in the database significantly increases the latent identification rate at rank-100 from 64 to 70%.
6 Conclusion An algorithm for latent fingerprint image quality assessment is proposed which is based on Level 1, Level 2, and texture features. In this paper, qualitative features are represented by the ridge clarity map and texture. The quantitative features are represented by the total number of extracted minutiae, average quality of the minutiae, and the size of convex hull including all minutiae. The classification of latent images in IIITD Latent Database is performed with the help of k-means classifier in two classes, i.e., latent with quality and latent without quality. The classification accuracy of 58.6% is achieved for unsupervised classification. The rejection of 52% unacceptable latent fingerprints significantly improves the identification rate of minutiae-based latent fingerprint matcher. The rejection may prevent the forensic experts in making erroneous decision about the suspect by eliminating the poor quality latent prints. As advancement, latent fingerprint image quality systems can be developed which would be based upon Level 3 features and more distinctive feature set.
500
D. Agarwal and A. Bansal
References 1. D. Maltoni, D. Maio, A.K. Jain, S. Prabhakar, Handbook of Fingerprint Recognition, 2nd edn. (Springer-Verlag, Heidelberg, 2009) 2. S. Yoon, K. Cao, E. Liu, A.K. Jain, LFIQ: latent fingerprint image quality, in BTAS (2013) pp. 1–8 3. D.R. Ashbaugh, in Quantitative-Qualitative Friction Ridge Analysis: An Introduction to Basic and Advanced Ridgeology (CRC Press, Boca Raton, 1999) 4. B.T. Ulery, R.A. Hicklin, G.I. Kiebuzinski, M.A. Roberts, J. Buscaglia, Understanding the sufficiency of information for latent fingerprint value determinations. Forensic Sci. Int. 1, 99–106 (2013) 5. A.K. Jain et al., On matching latent fingerprints, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (2008) 6. A. Sankaran, T.I. Dhamecha, M. Vatsa, R. Singh, On matching latent to latent fingerprints, in International Joint Conference on Biometrics (IJCB) (IEEE, 2011) pp. 1–6 7. H. Fronthaler et al., Fingerprint image-quality estimation and its application to multialgorithm verification. IEEE Trans Inf Forensics Secur 3(2), 331–338 (2008) 8. R.A. Hicklin, J. Buscaglia, M.A. Roberts, Assessing the clarity of friction ridge impressions. Forensic Sci. Int. 226(13), 106–117 (2013) 9. R.A. Hicklin, Guidelines for Extended Feature Set Markup of Friction Ridge Images. Working Draft Version 0.3 (2009) [Note: This document has been formalized in “Markup Instructions for Extended Friction Ridge Features”, version 1.0, March 2012] 10. S. Yoon, E. Liu, A.K. Jain, On latent fingerprint image quality, in Proceedings IWCF (2012), pp.67–82 11. A. Sankaran, M. Vatsa, R. Singh, Automated clarity and quality assessment for latent fingerprints, in IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems (BTAS) (IEEE, 2013) 12. A. Sankaran et al, Latent fingerprint from multiple surfaces: database and quality analysis, in IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS) (2015) 13. K. Cao et al., Automatic latent value determination, in IEEE International Conference on Biometrics (ICB) (2016) 14. J. Bigun, T. Bigun, K. Nilsson, Recognition by symmetry derivatives and the generalized structure tensor. IEEE Trans. Pattern Anal. Mach. Intell. 26(12), 1590–1605 (2004) 15. F. Alonso-Fernandez, J. Fierrez, J. Ortega-Garcia, J. Gonzalez-Rodriguez, H. Fronthaler, K. Kollreider, J. Bigun, A comparative study of fingerprint image-quality estimation methods. IEEE Trans. Inf. Forensics Secur. 2(4), 734–743 (1998) 16. J. Weickert, Anisotropic diffusion in image processing. 1 Teubner Stuttgart (1998) 17. A.R. Rao, R.C. Jain, Computerized flow field analysis: oriented texture fields. IEEE Trans. Pattern Anal. Mach. Intell. 14, 693–709 (1992) 18. A.C. Bovik, M. Clark, W.S. Geisler, Multichannel texture analysis using localized spatial filters. IEEE Trans. Pattern Anal. Mach. Intell. 12, 55–73 (1990) 19. J. Bigun, G.H. Granlund, J. Wiklund, Multidimensional orientation estimation with applications to texture analysis and optical flow. IEEE Trans. Pattern Anal. Mach. Intell. 13, 775–790 (1991) 20. A.K. Jain, F. Farrokhnia, Unsupervised texture segmentation using Gabor filters. Pattern Recognit. 24(12), 1167–1186 (1991) 21. A.K. Jain, A. Ross, S. Prabhakar, Fingerprint matching using minutiae and texture features, in Proceedings of IEEE International Conference on Image Processing, vol. 3 (2001) 22. L. Hong, Y. Wan, A.K. Jain, Fingerprint image enhancement: algorithm and performance evaluation. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 777–789 (1998) 23. NIST Special Database 27, Fingerprint Minutiae from Latent and Matching Tenprint Images [Online]. Available: http://www.nist.gov/srd/nistsd27.cfm 24. A.A. Paulino, J. Feng, A.K. Jain, Latent fingerprint matching using descriptor-based Hough transform. IEEE Trans. Inf. Forensics Secur. 8(1), 31–45(2012)
Assessment of Latent Fingerprint Image Quality Based …
501
25. J. Abraham, P. Kwan, J. Gao, Fingerprint Matching Using a Hybrid Shape and Orientation Descriptor (In State of the art in Biometrics, Intechopen, 2011) 26. S.P. Lloyd, Least squares quantization in PCM. IEEE Trans. Inf. Theory 28, 129–137 (1982)
Breast Cancer Detection Using Deep Learning and Machine Learning: A Comparative Analysis Alpna Sharma, Barjesh Kochar, Nisheeth Joshi, and Vinay Kumar
Abstract Nowadays, breast cancer is one of the most frequently diagnosed deadly cancers in women. The cancer leads to uncontrolled growth of cell, usually present in form of a tumor or lump in the affected area accompanied by skin hardness, redness, and irritation. In this paper, an attempt has been made to detect breast cancer using deep neural network using Wisconsin breast cancer dataset. Further, the results are compared with machine learning techniques like support vector machine and linear regression. Keywords Convolutional neural network (CNN) · AlexNet · GoogLeNet · Deep learning
1 Introduction The term breast cancer adverts to the uncontrolled growth of breast cells that form stagnation in milk ducts. Its symptoms include production of hard mass with irregular boundary over breast that may can change shape of breast. The other symptoms are swelling on breast [1], irritation over breast skin and nipple [2], leakage of fluid other than milk from nipples, and redness over breast skin.
A. Sharma (B) · N. Joshi Department of Computer Science, Apaji Institute, Banasthali University, Vanasthali, India e-mail: [email protected] N. Joshi e-mail: [email protected] B. Kochar · V. Kumar Vivekananda School of Information Technology, VIPS, New Delhi, India e-mail: [email protected] V. Kumar e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 D. Gupta et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1165, https://doi.org/10.1007/978-981-15-5113-0_39
503
504
A. Sharma et al.
Since the deep learning came into existence, different architectures for deep learning have been designed and proposed by various authors [3] like AlexNet [4], GoogLeNet, state-of-the-art CNNs [5], etc. Deep learning architectures have proved themselves far better than machine learning architectures whether it is the case of object detection or its classification. The main merit of using deep networks is that it can learn features itself and we need not extract features using analytical methods [6]. Sample images are taken as an input from Wisconsin breast cancer dataset and rescaled so that they can be fed into convolutional neural network (CNN) structure named AlexNet which is a deep learning model. Rescaled images from Wisconsin breast cancer dataset are fed into CNN; the results of images are set at different magnifying factors like 40x, 100x, and 200x. The image analysis is done by deep learning as well as machine learning techniques like support vector machines (SVM) and linear regression. Linear regression model is applied to predict the accuracy after determining the tumor-affecting portions [7]. High misclassification rate generated with only one feature, i.e., intensity, can be improved by adding more features like window mean and standard deviation [8]. With the introduction of more features, segmentation generated by various rules has been also improved. The dataset is analyzed, and results are shown accordingly. The paper is categorized into the following sections. Section 2 presents deep learning architectures and concise information on machine learning algorithms. Section 3 describes dataset and its parameters. Section 4 presents experimental results that show accuracy between CNN and machine learning techniques. Section 5 concludes the given paper followed by references.
2 Methodology This section describes the deep learning architectures.
2.1 Convolutional Neural Networks (CNNs) Deep learning model like CNN can be trained for breast cancer recognition [9]. Large amount of data can be handled with deep learning models [8] which comprise neural networks which in turn consist of different layers like convolutional layer, pooling layer, flattening layer, fully connected layer, etc. Convolutional layer and pooling layer are used for feature extraction. Task-specific feature extraction can be performed excellently by CNN [10].
Breast Cancer Detection Using Deep Learning and Machine …
505
Fig. 1 General view of AlexNet [3]
(a) AlexNet AlexNet is one of the convolutional neural networks (CNNs) that is used in object identification and classification. It consists of eight layers in which the former five are convolutional and the latter three are fully connected. The first layer is used to filter the input data image with 96 kernels with a stride of four pixels. This layer acts as input to the second layer, thus filtering it with 256 kernels. The third, fourth, and fifth layers are connected without any intervention or normalization (Fig. 1). (b) GoogLeNet GoogLeNet [11] performs task of estimating accuracy by finding ideal local structure and fabricating a multi-layer network. Pooling layer is positioned between modules with the purpose to extract feature from given dataset. It also involves use of classifiers that apply rules in feature extraction from values.
2.2 Machine Learning Algorithms (a) Support Vector Machines (SVM) SVM has been developed by Vapnik [12] and is a machine learning model which comprises training, testing, and evaluation of the performance [13]. Support vector
506
A. Sharma et al.
machine (SVM) was aimed principally for binary classification. Its prime objective is to identify the optimal hyper plane f (w, x) = w ∗ x + b splitting up two classes in a given dataset having input features x belonging to RP and labels y belonging to {− 1, 1}. SVM applies optimization formula to extract deep feature as stated below [13]: 1 B min A + C p i=1 p
(1)
where A is Manhattan function, B is cost function, and C is arbitrary value or selected value. (b) Linear Regression Linear regression is used as classifier in dataset in order to predict accuracy on basis of various parameters in breast cancer dataset. It is given by following equation [14]: h(x) =
p
D
i=0
where D denotes threshold range in which equation satisfies dataset.
3 Dataset Problem Identification Breast cancer occurs due to the abnormal growth of cells in breast tissues called as tumor. They are classified as benign (mild, not much harmful) and malignant (harmful or cancerous). Identification of data sources Detection of breast cancer is done from Wisconsin diagnostic breast cancer dataset [15]. The dataset has ten attributes. The dataset and its associated libraries are loaded into deep learning studio, and parameters are being set. The set of libraries used in the implementation of model is (Figs. 2 and 3): ggplot2,e1071,dplyr,reshape2,corrplot,caret,pROC,gridExtra,grid,ggfortify,purrr, nnet,doParallel,registerDoParallel,foreach,iterators,parallel Command to load raw dataset in deep learning studio is: Cancer.rawdata