150 108 16MB
English Pages 704 [669] Year 2022
Lecture Notes in Electrical Engineering 840
Sagaya Aurelia Somashekhar S. Hiremath Karthikeyan Subramanian Saroj Kr. Biswas Editors
Sustainable Advanced Computing Select Proceedings of ICSAC 2021
Lecture Notes in Electrical Engineering Volume 840
Series Editors Leopoldo Angrisani, Department of Electrical and Information Technologies Engineering, University of Napoli Federico II, Naples, Italy Marco Arteaga, Departament de Control y Robótica, Universidad Nacional Autónoma de México, Coyoacán, Mexico Bijaya Ketan Panigrahi, Electrical Engineering, Indian Institute of Technology Delhi, New Delhi, Delhi, India Samarjit Chakraborty, Fakultät für Elektrotechnik und Informationstechnik, TU München, Munich, Germany Jiming Chen, Zhejiang University, Hangzhou, Zhejiang, China Shanben Chen, Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Tan Kay Chen, Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore Rüdiger Dillmann, Humanoids and Intelligent Systems Laboratory, Karlsruhe Institute for Technology, Karlsruhe, Germany Haibin Duan, Beijing University of Aeronautics and Astronautics, Beijing, China Gianluigi Ferrari, Università di Parma, Parma, Italy Manuel Ferre, Centre for Automation and Robotics CAR (UPM-CSIC), Universidad Politécnica de Madrid, Madrid, Spain Sandra Hirche, Department of Electrical Engineering and Information Science, Technische Universität München, Munich, Germany Faryar Jabbari, Department of Mechanical and Aerospace Engineering, University of California, Irvine, CA, USA Limin Jia, State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing, China Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Alaa Khamis, German University in Egypt El Tagamoa El Khames, New Cairo City, Egypt Torsten Kroeger, Stanford University, Stanford, CA, USA Yong Li, Hunan University, Changsha, Hunan, China Qilian Liang, Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX, USA Ferran Martín, Departament d’Enginyeria Electrònica, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain Tan Cher Ming, College of Engineering, Nanyang Technological University, Singapore, Singapore Wolfgang Minker, Institute of Information Technology, University of Ulm, Ulm, Germany Pradeep Misra, Department of Electrical Engineering, Wright State University, Dayton, OH, USA Sebastian Möller, Quality and Usability Laboratory, TU Berlin, Berlin, Germany Subhas Mukhopadhyay, School of Engineering & Advanced Technology, Massey University, Palmerston North, Manawatu-Wanganui, New Zealand Cun-Zheng Ning, Electrical Engineering, Arizona State University, Tempe, AZ, USA Toyoaki Nishida, Graduate School of Informatics, Kyoto University, Kyoto, Japan Federica Pascucci, Dipartimento di Ingegneria, Università degli Studi “Roma Tre”, Rome, Italy Yong Qin, State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing, China Gan Woon Seng, School of Electrical & Electronic Engineering, Nanyang Technological University, Singapore, Singapore Joachim Speidel, Institute of Telecommunications, Universität Stuttgart, Stuttgart, Germany Germano Veiga, Campus da FEUP, INESC Porto, Porto, Portugal Haitao Wu, Academy of Opto-electronics, Chinese Academy of Sciences, Beijing, China Walter Zamboni, DIEM - Università degli studi di Salerno, Fisciano, Salerno, Italy Junjie James Zhang, Charlotte, NC, USA
The book series Lecture Notes in Electrical Engineering (LNEE) publishes the latest developments in Electrical Engineering - quickly, informally and in high quality. While original research reported in proceedings and monographs has traditionally formed the core of LNEE, we also encourage authors to submit books devoted to supporting student education and professional training in the various fields and applications areas of electrical engineering. The series cover classical and emerging topics concerning: • • • • • • • • • • • •
Communication Engineering, Information Theory and Networks Electronics Engineering and Microelectronics Signal, Image and Speech Processing Wireless and Mobile Communication Circuits and Systems Energy Systems, Power Electronics and Electrical Machines Electro-optical Engineering Instrumentation Engineering Avionics Engineering Control Systems Internet-of-Things and Cybersecurity Biomedical Devices, MEMS and NEMS
For general information about this book series, comments or suggestions, please contact [email protected]. To submit a proposal or request further information, please contact the Publishing Editor in your country: China Jasmine Dou, Editor ([email protected]) India, Japan, Rest of Asia Swati Meherishi, Editorial Director ([email protected]) Southeast Asia, Australia, New Zealand Ramesh Nath Premnath, Editor ([email protected]) USA, Canada: Michael Luby, Senior Editor ([email protected]) All other Countries: Leontina Di Cecco, Senior Editor ([email protected]) ** This series is indexed by EI Compendex and Scopus databases. **
More information about this series at https://link.springer.com/bookseries/7818
Sagaya Aurelia · Somashekhar S. Hiremath · Karthikeyan Subramanian · Saroj Kr. Biswas Editors
Sustainable Advanced Computing Select Proceedings of ICSAC 2021
Editors Sagaya Aurelia Department of Computer Science CHRIST (Deemed to be University) Bengaluru, Karnataka, India
Somashekhar S. Hiremath Department of Mechanical Engineering Indian Institute of Technology Madras Chennai, Tamil Nadu, India
Karthikeyan Subramanian Department of Information Technology University of Technology and Applied Science Sultanate of Oman, Oman
Saroj Kr. Biswas Department of Computer Science Engineering National Institute of Technology Silchar Silchar, Assam, India
ISSN 1876-1100 ISSN 1876-1119 (electronic) Lecture Notes in Electrical Engineering ISBN 978-981-16-9011-2 ISBN 978-981-16-9012-9 (eBook) https://doi.org/10.1007/978-981-16-9012-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
A True leader has the confidence to stand alone, the courage to make tough decisions, and the compassion to listen to the needs of others. He does not set out to be a leader, but becomes one by the equality of his actions and the integrity of his intent. — Dauglas Me Arthur This book is dedicated to such a leader Dr. Joy Paulose who serves as Head of the Department, Department of Computer Science for his compelling vision, a comprehensive plan and relentless implementation.
Committees
Organizing Committee • • • • •
Dr. (Fr.) Abraham V. M., Vice-Chancellor Dr. (Fr.) Joseph C. C., Pro-Vice-Chancellor Dr. Anil Joseph Pinto, Registrar Dr. George Thomas C., Dean – Sciences Dr. T.V Joseph, Associate Dean Sciences
Organizing Chair • Dr. Joy Paulose, Head, Departments of Computer Science and Statistics
Organizing Secretary • Dr. Kirubanand V. B., Department of Computer Science
Conveners • Dr. Manjunatha Hiremath, Department of Computer Science • Dr. Arokia Paul Rajan R., Department of Computer Science • Dr. Azarudheen S., Department of Statistics
vii
viii
Committees
Communication / Publicity Committee • • • •
Dr. Vinay M. (In charge) Dr. Gobi R Dr. Kavitha R Dr. Rohini V
Executive Committee • • • • •
Dr. Tulasi B. (In charge) Prof. Roseline Mary R Dr. Peter Augustin D Dr. Prabu P Dr. Azarudheen S
Financial Committee • Prof. Deepthi Das. (In charge) • Dr. Saleema J S
Keynote and Session Committee • • • • •
Dr. Senthilnathan T. (In charge) Dr. Deepa V. Jose Dr. Vijayalakshmi A Dr. Prabu M Dr. Hemlata Joshi
Online Sessions Committee • • • •
Dr. Anita H. B. (In charge) Dr. Vaidhehi V Dr. Ramamurthy B Dr. Sandeep J
Committees
Publication Committee • • • • •
Dr. Sagaya Aurelia P. (In charge) Dr. Nizar Banu P K Dr. Rajesh R Dr. Niju P Joseph Dr. Nimitha John
Registration Committee • • • • •
Dr. Jayanta Biswas (In charge) Dr. Debabrata Samanta Dr. Monisha Singh Dr. Smitha Vinod Dr. Nagaraja M S
Report Committee • • • • • • • • • • •
Ms. Lakshmi S. (In charge) Ms. Amrutha K Mr. Febin Antony Ms. Jinsi Jose Ms. Kavipriya K Ms. Lavanya R Ms. Mino George Ms. Sherly Maria Mr. Siddiq Pasha Ms. Smera C Mr. Libin Thomas
Review Committee • • • • •
Dr. Arish P. (In charge) Dr. Sreeja C S Dr. Ummesalma M Dr. Shoney Sebastian Dr. Basanna Veeranna Dhandra
ix
x
Committees
Sponsorship/Hospitality Committee • • • • •
Dr. Beaulah Soundarabai P. (In charge) Dr. Saravanan K. N Dr. Rupali Sunil Wagh Dr. Saravanakumar K Dr. Sharon Varghese A
Website Committee • • • •
Dr. Arul Kumar N. (In charge) Dr. Chandra J Dr. Sivakumar R Dr. Nismon Rio R
Advisory/Review Committee Dr. Ajay K. Sharma, Vice-Chancellor, IKG Punjab Technical University, Punjab, India Dr. Arockiasamy Soosaimanickam, Dean, Department of Information Systems, The University of Nizwa, Sultanate of Oman Dr. S. Karthikeyan, Head, Department of Information Technology, The University of Technology and Applied Sciences, Sohar, Oman Dr. Rajkumar Buyya, Redmond Barry Distinguished Professor, Director, Cloud Computing and Distributed Systems (CLOUDS) Lab, School of Computing and Information Systems, The University of Melbourne, Australia Dr. Inder Vir Malhan, Professor and Head, Mathematics, Computers and Information Science, Central University of Himachal Pradesh, India Dr. Jose Orlando Gomes, Professor, Department of Industrial Engineering, Federal University, Brazil Dr. Richmond Adebiaye, Associate Professor, College of Science and Technology, Department of Informatics and Engineering Systems, University of South Carolina Upstate, Spartanburg, USA Dr. Ahmad Sobri Bin Hashim, Senior Lecturer, CIS Department, Faculty of Applied Sciences, University Teknologi Petronas, Malaysia Dr. John Digby Haynes, Honorary Professor, Discipline of Business Information Systems, The University of Sydney Business School, The University of Sydney, Australia Dr. Subhash Chandra Yadav, Professor and Head, Centre for Computer Science and Technology, Central University, Jharkhand, India
Committees
xi
Dr. Rajeev Srivastava, Professor and Head, Department of Computer Science and Engineering, Indian Institute of Technology, Varanasi, India Dr. Inder Vir Malhan, Professor and Head, Department of Mathematics, Computers and Information Science, Central University, Himachal Pradesh, India Dr. Abhijit Das, Professor, Department of Computer Science and Engineering, Indian Institute of Technology, West Bengal, India Dr. Muralidhara B. L., Professor, Department of Computer Science and Application, Jnana Bharathi Campus, Bangalore University, Bengaluru, India Dr. K. K. Shukla, Professor, Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, India Dr. Pabitra Mitra, Professor, Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India Dr. Hanumanthappa M., Professor, Department of Computer Science and Application, Jnana Bharathi Campus, Bangalore University, Bengaluru, India Dr. Subhrabrata Choudhury, Professor, Department of Computer Science, National Institute of Technology, West Bengal, India Dr. Tandra Pal, Professor, Department of Computer Science and Engineering, National Institute of Technology, West Bengal, India Dr. Dilip Kumar Yadav, Professor and Head, Department of Computer Applications, National Institute of Technology, Jharkhand, India Dr. P. Santhi Thilagam, Professor, Department of Computer Science and Engineering, National Institute of Technology, Karnataka, India Dr. Annappa, Professor, Department of Computer Science and Engineering, National Institute of Technology, Karnataka, India Dr. R. Thamaraiselvi, Associate Professor and Head, Department of Computer Applications, Bishop Heber College, Tiruchirappalli, India Dr. P. Mukilan, Associate Professor, Department of Electrical and Computer Engineering, College of Engineering and Technology, Bule Hora University, Ethiopia Dr. Gnanaprakasam Thangavel, Associate Professor, Department of Computer Science of Engineering, Gayatri Vidya Parishad College of Engineering (Autonomous), Vishakhapatnam, India Dr. Tanmay De, Associate Professor, Department of Computer Science, National Institute of Technology, West Bengal, India Dr. Saravanan Chandran, Associate Professor, Department of Computer Science, National Institute of Technology, West Bengal, India Dr. Rupa G. Mehta, Associate Professor, Department of Computer Science and Engineering, Sardar Vallabhbhai National Institute of Technology, Gujarat, India Dr. Bibhudatta Sahoo, Associate Professor and Head, Department of Computer Science and Engineering, National Institute of Technology, Rourkela, India Dr. Baisakhi Chakraborty, Associate Professor, Department of Computer Science and Engineering, National Institute of Technology, West Bengal, India Dr. Rajiv Misra, Associate Professor, Department of Computer Science and Engineering, Indian Institute of Technology, Bihar, India Dr. Somashekhar S. Hiremath, Associate Professor, Department of Mechanical Engineering, Indian Institute of Technology Madras, India
xii
Committees
Dr. S. K. V. Jayakumar, Associate Professor, Department of Ecology, and Environmental Sciences, Pondicherry University, Puducherry, India Dr. S. Ravi, Associate Professor, School of Engineering and Technology, Pondicherry University, Puducherry, India Dr. Diptendu Sinha Roy, Associate Professor, Department of Computer Science and Engineering, National Institute of Technology, Meghalaya, India Dr. Pankaj Kumar Sa, Associate Professor, Department of Computer Science and Engineering, National Institute of Technology, Rourkela, India Dr. Shashirekha H. L., Associate Professor, Department of Computer Science, Mangalore University, India Dr. Haobam Mamata Devi, Associate Professor, Manipur University, Manipur, India Dr. A. S. V. Ravi Kanth, Associate Professor and Head, Department of Mathematics, National Institute of Technology, Haryana, India Dr. Neeta Nain, Associate Professor, Department of Computer Science and Engineering, Malaviya National Institute of Technology Jaipur, India Dr. G. Sobers Smiles David, Associate Professor and Head, PG Department of Computer Science, Bishop Heber College, India Dr. Manikandan K., Associate Professor, School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India Dr. Nilamadhab Mishra, Senior Assistant Professor, School of Computer Science and Engineering, Vellore Institute of Technology, Bhopal Dr. Mohammad Gouse, Assistant Professor, Department of Information Technology, Catholic University, Erbil, Kurdistan Region, Iraq Dr. M. Bagyaraj, Assistant Professor, College of Natural and Computational Sciences, Debre Berhan University, Ethiopia Dr. E. George Dharma Prakash Raj, Assistant Professor, School of Computer Science Engineering and Applications, Bharathidasan University, Tiruchirappalli, India Dr. A. Manimaran, Assistant Professor, Department of Computer Applications, Madanapalle Institute of Technology and Science, Andhra Pradesh, India Dr. N. Rukma Rekha, Assistant Professor, School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India Dr. M. T. Somashekara, Assistant Professor, Department of Computer Science and Application, Jnana Bharathi Campus, Bangalore University, Bengaluru, India Dr. Arup Kumar Pal, Assistant Professor, Department of Computer Science and Engineering, Indian Institute of Technology (Indian School of Mines), Jharkhand, India. Dr. Virender Ranga, Assistant Professor, Department of Computer Engineering, National Institute of Technology, Kurukshetra, India Dr. Ranjita Das, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Mizoram, India Dr. Pravati Swain, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Goa, India Dr. Anagha Bhattacharya, Assistant Professor, Electrical and Electronics Engineering, National Institute of Technology, Mizoram, India
Committees
xiii
Dr. K. Jairam Naik, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Chhattisgarh, India Dr. Venkatanareshbabu Kuppili, Assistant Professor, Computer Science and Engineering, National Institute of Technology, Goa, India Dr. Ripon Patgiri, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Assam, India Dr. Khundrakpam Johnson Singh, Assistant Professor, Computer Science and Engineering, National Institute of Technology, Manipur, India Dr. Ramesh Kumar, Assistant Professor, Department of Electrical and Electronics Engineering, National Institute of Technology, Mizoram, India Dr. B. Shameedha Begum, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, India Dr. Pradeep Singh, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Chhattisgarh, India Dr. Chandrasekhar Perumalla, Assistant Professor, School of Electrical Sciences, Indian Institute of Technology, Bhubaneswar, India Dr. Sk Subidh Ali, Assistant Professor, Department of Computer Science and Engineering, Indian Institute of Technology, Chhattisgarh, India Dr. Dharavath Ramesh, Assistant Professor, Department of Computer Science and Engineering, Indian Institute of Technology, Dhanbad, India Dr. Swapan Debbarma, Assistant Professor, Computer Science and Engineering, National Institute of Technology, Tripura, India Dr. Arka Prokash Mazumdar, Assistant Professor, Department of Computer Science and Engineering, Malaviya National Institute of Technology, Jaipur, India Dr. Jyoti Grover, Assistant Professor, Department of Computer Science and Engineering, Malaviya National Institute of Technology, Jaipur, India. Dr. Vijay Verma, Assistant Professor, Department of Computer Engineering, National Institute of Technology, Haryana, India Dr. Pournami P. N., Assistant Professor, Computer Science, National Institute of Technology, Kerala, India Dr. Nonita Sharma, Assistant Professor, Computer Science and Engineering, National Institute of Technology, Punjab, India Dr. K. P. Sharma, Assistant Professor, Department of Computer Science and Engineering, Dr. B. R. Ambedkar National Institute of Technology, Punjab, India Dr. B. B. Gupta, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Haryana, India Dr. Kuldeep Kumar, Assistant Professor, Computer Science and Engineering, National Institute of Technology, Jalandhar, India Dr. Dinesh Kumar Tyagi, Assistant Professor, Department of Computer Science and Engineering, Malaviya National Institute of Technology, Jaipur, India Dr. Badal Soni, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Assam, India Dr. Mahendra Pratap Singh, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Karnataka, India
xiv
Committees
Mr. Pragati Singh, Assistant Professor, Department of Electronics and Communication Engineering, National Institute of Technology, Mizoram, India Dr. Deepak Gupta, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Arunachal Pradesh, India Dr. Saroj Kumar Biswas, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Silchar, India Dr. Partha Pakray, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Silchar, India Dr. Dipti P. Rana, Assistant Professor, Department of Computer Science and Engineering, Sardar Vallabhbhai National Institute of Technology, Gujarat, India Dr. Bidhan Malakar, Assistant Professor, Department of Electrical Engineering, JIS College of Engineering, West Bengal, India Dr. Malaya Dutta Borah, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Silchar, India Dr. S. Jaya Nirmala, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, India Dr. Subhasish Banerjee, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Arunachal Pradesh, India Dr. Rajneesh Rani, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Punjab, India Dr. Vikram Singh, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Haryana, India Dr. Jamal Hussain, Assistant Professor, Department of Mathematics and Computer Science, Mizoram University, Aizawl, India Dr. Layak Ali, Assistant Professor, School of Electrical and Computer Engineering, Central University, Karnataka, India Dr. V. Kumar, Assistant Professor, Department of Computer Science, Central University, Kerala, India Dr. Th. Rupachandra Singh, Assistant Professor, Department of Computer Science, Manipur University, Manipur, India Dr. Prabu Mohandas, Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Kerala, India Dr. Sujata Pal, Assistant Professor, Department of Computer Science and Engineering, Indian Institute of Technology, Punjab, India Dr. Surjit Singh, Assistant Professor, Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, India Dr. Hima Bindu K., Assistant Professor, Department of Computer Science and Engineering, National Institute of Technology, Andhra Pradesh, India Dr. A. S. Mokhade, Assistant Professor, Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Maharashtra, India Dr. A. Nagaraju, Assistant Professor, Department of Computer Science, Central University, Ajmer, India Dr. M. A. Saifullah, Assistant Professor, School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India
Committees
xv
Dr. Avatharam Ganivada, Assistant Professor, School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India Dr. P. Shanthi Bala, Assistant Professor, School of Engineering and Technology, Pondicherry University, Puducherry, India Dr. A. Martin, Assistant Professor, School of Mathematics and Computer Sciences, Central University, Tamil Nadu, India Dr. A. Saleem Raja, Faculty, IT Department, University of Technology and Applied Science-Shinas, Al-Aqr, Shinas, Sultanate of Oman
Preface
Welcome to ICSAC2021, the 3rd International Conference on Sustainable Advanced Computing hosted by Departments of Computer Science and Statistics, CHRIST (Deemed to be University), Bangalore Central Campus, Bangalore co-hosted by ACM students Chapter and Computer Society of India. The conference is mostly held every two years. ICSAC 2020 was held online, which will be remembered as the first time since its inception due to the COVID-19 outbreak. The conference’s theme is “Sustainable Advanced Computing for Current Society,” Cloud computing, IoT, Big data, Machine Learning (ML), and Robotics are essential enablers for the latest change in the current society and market. The conferences aim to bridge Indian and overseas researchers, practitioners, and educators, with a particular focus on innovative models, theories, techniques, algorithms, methods, and domain-specific applications and systems, from both technical and social aspects in Sustainable Advanced Computing. The conference provided the participants an opportunity to discuss the recent developments in Computer Science and challenges faced by the community in the 21st century and featured a series of exciting and rich activities, including invited keynotes and technical presentations. We solicited papers on all aspects of research, development, and applications of cloud computing, IoT, Big data, Machine Learning (ML), and Robotics. We received 347 research articles all over India and from countries like Srilanka, Malaysia, Sultanate of Oman, and the United States of America. To ensure the quality of the conference, we had national and international academic experts and professionals as our reviewers. There were 158 reviewers, out of which 15 were from different countries across the globe. It was a professional and skill-based exercise for the review committee to assign the manuscript to the right reviewers and receive their valuable reviews on time. After thorough scrutiny, 54 papers were selected for presentation and publication. Hence, the acceptance ratio of this conference is 13.87%. The accepted papers were presented in 8 different sessions, covering major research domains like Artificial Intelligence, Machine learning, Cloud Computing, Data Analytics, Data Security, Networks, BlockChain, Virtual and Augmented reality, and the Internet of Things. xvii
xviii
Preface
I hope this conference served as a hub and knowledge repository for the researchers to share their latest research ideas! We wish that further editions of ICSAC will reach greater heights in the future. Sagaya Aurelia Department of Computer Science CHRIST University Bangalore, India
Acknowledgements
No one who achieves success does so without acknowledging the help of others. The wise and confident acknowledge this help with gratitude. –Alfred North Whitehead
On behalf of the International Conference on Sustainable Advanced Computing 2021 committee and the entire fraternity of the institution, I, first of all, extend my most sincere thanks to the Almighty God for giving us strength, knowledge, and good health for conducting this International conference successfully. I am happy to acknowledge Dr. Fr. Abraham Vettiyankal Mani. Our respected Vice-Chancellor, Rev. Fr. Jose CC. Pro-vice Chancellor, Dr. Joseph Varghese, Director, Center for Research, beloved registrar Dr. Anil Joseph Pinto, and all the management authorities for giving us permission and logistic support to conduct this International conference in a hybrid mode which made this event a historical benchmark in the history of CHRIST (Deemed to be University). I also extend my sincere gratitude to our beloved Dean Science Dr. George Thomas and Associate Dean Sciences Dr. T. V. Joseph for their kind support throughout the conference. I open my wholehearted thanks to Dr. Joy Paulose, Organizing Chair and Head of Department, Department of Computer Science, for being the backbone of this entire conference. I will fail from my duty if I don’t remember Mrs. Kamiya Katter and Mr. Rajaneesh from Springer, without which this publication would not have been possible. I sincerely recognize Dr. Bonny Banerjee, Associate Professor, Institute of Intelligent Systems and Department of Electrical and Computer Engineering, University of Memphis, USA, for delivering a thought-provoking talk during the inaugural ceremony of the conference. Special recognition to Dr. Balasubramanian, Vicepresident of Software Engineering, Corporate, and Investment bank, Singapore, for accepting our invitation and gracing the occasion with his enlightened speech for the valedictory function.
xix
xx
Acknowledgements
I humbly acknowledge all the eminent keynote speakers in various domains of Computer Science. Dr. S. S. Iyengar, Professor from Florida International University, USA, Dr. Subhrabrata Choudhury, Professor, National Institute of Technology, WB. Dr. Dilip Kumar Yadav, Professor from National Institute of Technology, Jamshedpur. Prof Sheikh Iqbal Ahamed from Marquette University, USA. Dr. Saiyed Younus, University of Kalamoon, Syria. Dr. Somashekhar Hiremath, IIT, Chennai, and Dr. Saroj Biswas from the National Institute of Technology, Silchar. I also appreciate and acknowledge the session chair, Dr. Vishwanathan, from VIT Chennai. Dr. Rajeev Mente from Solapur University, Dr. Jayapandian from Christ University Kengeri Campus, Dr. Prasanna, VIT Vellore, Dr. Ravindra Hegade, Central University of Karnataka, Dr. Seetha Lakshmi, SRM Chennai, and Dr. Uma Rani, SRM, Chennai. I also acknowledge all the external and internal reviewers for providing their helpful comments and improving the quality of papers presented in the forum. My sincere gratitude to Fr. Justin Varghese, Director, IT supports, and all IT support staff members, student volunteers, and non-teaching staff for their support throughout the conference who worked day and night to make this event a grand success. I thank all the advisory committee members, coordinators, committee heads and members, faculties of Computer Science and Statistics Departments. Their names are listed below. I acknowledge all the authors who submitted papers and congratulate those whose papers were accepted with a deep sense of appreciation. Sagaya Aurelia Department of Computer Science CHRIST University Bangalore, India
Keynote Speakers
International Dr. Sheikh Iqbal Ahamed, Ph.D., Director, Ubicomp Research Lab, and Professor and Chair, Department of Computer Science, Marquette University, USA Dr. S. S. Iyengar, ACM Fellow, IEEE Fellow, AAAS Fellow, NAI Fellow, AIMBE Fellow, Distinguished University Professor, School of Computing and Information Sciences, Florida International University, USA Dr. Said Eid Younes, Associate Professor, Department of Information Technology, College of Engineering, University of Kalamoon, Syria
National Dr. Subhrabrata Choudhury, Professor, Department of Computer Science and Engineering, National Institute of Technology, Durgapur, West Bengal Dr. Dilip Kumar Yadav, Professor, Department of Computer Applications, National Institute of Technology, Jamshedpur Dr. Somashekhar S. Hiremath, Associate Professor, Manufacturing Engineering Section, Department of Mechanical Engineering, IIT Madras Dr. Saroj Kr. Biswas, Assistant Professor, (Grade I), Department of Computer Science and Engineering, National Institute of Technology, Silchar
xxi
ICSAC 2021 Reviewers
Dr. Manjaiah D. H., Professor, Department of Computer Science, Mangalore University, Mangalore Karnataka, Mangalore, Karnataka Dr. Karthikeyan J., Associate Professor, School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu Dr. Nandhakumar, Assistant Professor, Department of Computer Science and Applications, Vivekanandha College of Arts and Sciences for Women, Namakkal, Tamil Nadu Dr. Iyappa Raja M., Associate Professor, School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu Dr. Balamurugan, Assistant Professor, Department of Computer Science, Bharathidasan Government College for Women, Puducherry Dr. Suresh K., Assistant Professor, Computer Science, Amrita Viswapeetham University, Mysore Campus, Mysore Dr. Ramaprabha, Professor, Department of Information Technology, Nehru Arts and Science College, Coimbatore, Tamil Nadu Dr. Antoinnete Aroul Jayanthi, Associate Professor, Department of Computer Science, Pope John Paul II College of Education, Puducherry Dr. Anand R., Associate Professor, CSA, Reva University, Bangalore, Karnataka Dr. Ruban S., Associate Professor, Software Technology, St. Aloysius College, Mangalore, Karnataka Dr. Uma, Professor and Head, Computer Science, SRM Institute of Science and Technology, Chennai, Tamil Nadu Dr. Manju Jose, Assistant Professor and Programme Manager, Department of Computing, Middle East College, KOM, Muscat, Oman Dr. A. Adhiselvam, Head, Department of Computer Applications, S. T. E. T Women’s College, Mannargudi, Tamilnadu Dr. Karthikeyan Subramaniam, HoD-IT, Information Technology, University of Technology and Applied Sciences—Suhar Campus, Sohar, Oman Dr. D. Kamalraj, Assistant Professor, Computer Science, Bharathidasan Government College for Women, Puducherry
xxiii
xxiv
ICSAC 2021 Reviewers
Dr. Alby S., Associate Professor, Department of Computer Applications, Sree Narayana Gurukulam College of Engineering, Ernakulam, Kerala Dr. Prasanna Mani, Associate Professor (Sr.), School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu Dr. C. Shalini, Associate Professor, Computer Science, The National College, Jayanagar, Bangalore, Karnataka Dr. Andhe Dharani, Professor, Master of Computer Applications, R. V. College of Engineering, Bengaluru, Karnataka Dr. Parshuram Mahadev Kamble, Assistant Professor, Department of Computer Science, School of Computer Science, Central University of Karnataka, Kalaburagi, Karnataka Dr. Ganesan R., Professor, Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu Dr. Gururaj Mukarambi, Assistant Professor, Department of Computer Science, School of Computer Science, Central University of Karnataka, Kalaburagi, Karnataka Dr. Cyril Anthoni, ICT Program Manager, Information Systems, Bahrain Polytechnic, Isa Town, Bahrain, Bahrain Dr. F. Sagayaraj Francis, Professor, Department of Computer Science and Engineering, Pondicherry Engineering College, Puducherry Dr. Vishwanathan, Professor, Computer Science and Engineering, Vellore Institute of Technology, Chennai, Chennai, Tamil Nadu Dr. Prabhakar C. J., Associate Professor, Department of Computer Science, Kuvempu University, Shimoga, Karnataka Dr. A. Anthonisan, Professor, Computer Science, St. Joseph University, Dimapur, Nagaland, Nagaland Dr. Sridevi U. K., Assistant Professor(SL.Gr), Applied Mathematics and Computational Sciences, PSG College of Technology, Peelamedu, Coimbatore-641005, Coimbatore, Tamil Nadu Dr. Kagale Madhuri Raghunath, Assistant Professor, Department of Computer Science, School of Computer Sciences, Central University of Karnataka, Kalaburagi, Karnataka Dr. Biku Abraham, Professor, Computer Applications, Saintgits College of Engineering, Pathamuttom P.O., Kottayam, Kerala Dr. Yogish Naik G. R, Associate Professor, Computer Science and MCA, Kuvempu University, Jnana Sahyadri, Shankaraghatta, Karnataka Dr. S. Arockiasamy, Acting Dean, Associate Professor, Department of Information Systems, College of Economics, Management and Information Systems, University of Nizwa, Sultanate of Oman Dr. S. Manimekalai, Head, PG and Research, Computer Science, Theivanai Ammal College for Women, Villupuram, Tamil Nadu Dr. K. Anandapadmanabhan, Dean, Computer Science, Sri Vasavi College (SFW), Erode, Tamil Nadu Dr. Muneeswaran V., Assistant Professor Senior, School of Computing Science and Engineering, VIT Bhopal University, Sehore, Madhya Pradesh
ICSAC 2021 Reviewers
xxv
Dr. Suresha, Professor, Department of Computer Science, University of Mysore, Mysore, Karnataka Dr. Shauiaya Gupta, Assistant Professor, School of Computer Science, The University of Petroleum and Energy Studies, Dehradun, Uttarakhand Dr. Kamal Raj, Assistant Professor, Computer Science, Bharathidasan Government College for Women, Puducherry Dr. Doreswamy, Professor and Chairman of Statistics, Department of Computer Science, Mangalore University, Mangalore Karnataka, Mangalore, Karnataka Dr. J. Frank Ruban Jeyaraj, Associate Professor and Director, MCA, The American College, Madurai, Tamil Nadu Dr. B. H. Shekar, Professor, Department of Computer Science, Mangalore University, Mangalore Karnataka, Mangalore, Karnataka Dr. A. Sarvanan, Associate Professor, Department of Computer Science, CIT Engineering College coimbatore, Coimbatore, Tamil Nadu Dr. Thamilselvan P., Assistant Professor, Department of Computer Science, Bishop Heber College, Tiruchirappalli, Tamil Nadu Dr. M. Prasanna, Associate Professor, Department of Computer Science, VIT, Vellore, Tamil Nadu Prof. Ganapat Singh Rajput, Professor, Department of Computer Science, Karnataka State Akkamahadevi Womens University Vijayapura, Bijapur, Karnataka Dr. Rohini Bhusnurmath, Assistant Professor, Department of Computer Science, Karnataka State Akkamahadevi Womens University Vijayapura, Bijapur, Karnataka Prof. Ramesh K., Assistant Professor, Department of Computer Science, Karnataka State Akkamahadevi Womens University Vijayapura, Bijapur, Karnataka Dr. Suresh K., Assistant Professor, Computer Science, Kuvempu University, Mysore, Karnataka
Contents
Machine Learning Predictive Analytics for Real-Time Agriculture Supply Chain Management: A Novel Pilot Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suja Panicker, Sachin Vahile, Sandesh Gupta, Sajal Jain, Lakshita Jain, and Shraddha Kamble Analysis and Prediction of Crop Infestation Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Mohamed Inayath Hussain, K. R. Lakshminarayanan, Ravi Dhanker, Sriramaswami Shastri, and Milind Bapu Desai ECG-Based Personal Identification System . . . . . . . . . . . . . . . . . . . . . . . . . . A. C. Ramachandra, N. Rajesh, and N. Rashmi
3
15
27
Explorations in Graph-Based Ranking Algorithms for Automatic Text Summarization on Konkani Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jovi D’Silva and Uzzal Sharma
37
Demography-Based Hybrid Recommender System for Movie Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bebin K. Raju and M. Ummesalma
49
Holistic Recommendation System Framework for Health Care Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. Navin and M. B. Mukesh Krishnan
59
A Synthetic Data Generation Approach Using Correlation Coefficient and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohiuddeen Khan and Kanishk Srivastava
71
Cepstral Coefficient-Based Gender Classification Using Audio Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Sweta, Jiss Mariam Babu, Akhila Palempati, and Aniruddha Kanhe
81
xxvii
xxviii
Contents
Reinforcement Learning Applications for Performance Improvement in Cloud Computing—A Systematic Review . . . . . . . . . . . . Prathamesh Vijay Lahande and Parag Ravikant Kaveri
91
Machine Learning Based Consumer Credit Risk Prediction . . . . . . . . . . . 113 G. S. Samanvitha, K. Aditya Shastry, N. Vybhavi, N. Nidhi, and R. Namratha Hate Speech Detection Using Machine Learning Techniques . . . . . . . . . . . 125 Akileng Isaac, Raju Kumar, and Aruna Bhat Real-Time Traffic Sign Detection Under Foggy Condition . . . . . . . . . . . . . 137 Renit Anthony and Jayanta Biswas Deep Learning Waste Object Segmentation for Autonomous Waste Segregation . . . . . . . 147 Abhishek Rajesh Pawaskar and N. M. Dhanya Deep Learning Techniques to Improve Radio Resource Management in Vehicular Communication Network . . . . . . . . . . . . . . . . . . 161 Vartika Agarwal and Sachin Sharma Malabar Nightshade Disease Detection Using Deep Learning Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Imdadul Haque, Mayen Uddin Mojumdar, Narayan Ranjan Chakraborty, Md. Suhel Rana, Shah Md. Tanvir Siddiquee, and Md. Mehedi Hasan Ashik A Constructive Deep Learning Applied Agent Mining for Supported Categorization Model to Forecast the Hypokinetic Rigid Syndrome (HRS) Illness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 N. Priyadharshini, Juliet Rozario, and S. Delight Mary High-Utility Pattern Mining Using ULB-Miner . . . . . . . . . . . . . . . . . . . . . . . 199 Sallam Osman Fageeri, S. M. Emdad Hossain, S. Arockiasamy, and Taiba Yousef Al-Salmi Recent Progress in Object Detection in Satellite Imagery: A Review . . . . 209 Kanchan Bhil, Rithvik Shindihatti, Shifa Mirza, Siddhi Latkar, Y. S. Ingle, N. F. Shaikh, I. Prabu, and Satish N. Pardeshi Aspergillus Niger Fungus Detection Using Deep Convolutional Neural Network with Principal Component Analysis and Chebyshev Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Vanitha Venkateswaran and Sornam Madasamy Autism Spectrum Disorder Classification Based on Reliable Particle Swarm Optimization Denoiser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 G. Rajesh and S. Pannir Selvam
Contents
xxix
Automated News Summarization Using Transformers . . . . . . . . . . . . . . . . 249 Anushka Gupta, Diksha Chugh, Anjum, and Rahul Katarya Generating Stylistically Similar Vernacular Language Fonts Using English Fonts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 A. Ram Newton, G. A. Dhanush, T. P. V. Krishna Teja, M. Prathilothamai, and B. Siva Ranjith Kumar Optimization of Artificial Neural Network Parameters in Selection of Players for Soccer Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 J. Vijay Fidelis and E. Karthikeyan Organ Risk Prediction Using Deep Learning and Neural Networks . . . . . 289 Simran Bafna, Achyut Shankar, Vanshika Nehra, Sanjeev Thakur, and Shuchi Mala Image Processing (Image/Video) and Computer Vision Automated Vehicle License Plate Recognition System: An Adaptive Approach Using Digital Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Saidur Rahman, Tahmina Aktar Trisha, and Md. Imran Hossain Imu Detection of Image Forgery for Forensic Analytics . . . . . . . . . . . . . . . . . . . . 321 Chintakrindi Geaya Sri, Shahana Bano, Vempati Biswas Trinadh, Venkata Viswanath Valluri, and Hampi Thumati Video Steganography Based on Edge Detector and Using Lifting Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Meenu Suresh and I. Shatheesh Sam A Vision-Based System Facilitating Communication for Paralytic Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 R. Yaminee, Shatakshi Sinha, Neha Gupta, and A. Annis Fathima Novel Fall Prevention Technique in Staircase Using Microsoft Kinect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 Abhinav Sharma and Rakesh Maheshwari Modified LSB Algorithm Using XOR for Audio Steganography . . . . . . . . 369 H. Hemanth, V. S. Hrutish Ram, S. Guru Raghavendran, and N. Subhashini Region-Based Stabilized Video Magnification Approach . . . . . . . . . . . . . . . 381 Sanket Yadav, Prajakta Bhalkare, and Usha Verma Natural Language Processing Analysing Cyberbullying Using Natural Language Processing by Understanding Jargon in Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Bhumika Bhatia, Anuj Verma, Anjum, and Rahul Katarya
xxx
Contents
Impeachment and Weak Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 Samridh Khaneja Devanagari to Urdu Transliteration System with and Without AIRAAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Anjani Kumar Ray, Shamim Fatma, and Vijay Kumar kaul Communication System Comparison of Radar Micro Doppler Signature Analysis Using Short Time Fourier Transform and Discrete Wavelet Packet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 A. R. Vandana and M. ArokiaSamy Review of Security Gaps in Optimal Path Selection in Unmanned Aerial Vehicles Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 M. Kayalvizhi and S. Ramamoorthy Internet of Things A Multilayered Hybrid Access Control Model for Cloud-Enabled Medical Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Navod Neranjan Thilakarathne, H. D. Weerasinghe, and Anuradhi Welhenge Effective Data Delivery Algorithm for Mobile IoT Nodes . . . . . . . . . . . . . . 473 Dharamendra Chouhan, Anupriya Singh, N. N. Srinidhi, and J. Shreyas An IoT-Based Model for Pothole Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 C. Anandhi and Kavitha Rajamohan Network Lifetime Enhancement Routing Algorithm for IoT Enabled Software Defined Wireless Sensor Network . . . . . . . . . . . . . . . . . . 499 J. Shreyas, Dharamendra Chouhan, M. Harshitha, P. K. Udayaprasad, and S. M. Dilip Kumar Network and Network Security An Optimized Algorithm for Selecting Stable Multipath Routing in MANET Using Proficient Multipath Routing and Glowworm Detection Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 Barzan Abdulazeez Idrees, Saad Sulaiman Naif, Mohammad Gouse Galety, and N. Arulkumar A Framework to Restrict COVID-19 like Epidemic Converting to Pandemic Using Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 Faraz Masood and Arman Rasool Faridi
Contents
xxxi
Application of Blockchain in Different Segments of Supply Chain Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 Dhairya K. Vora, Jash H. Patel, Dhairya Shah, and Prathamesh Mehta Detection of DoS Attacks on Wi-Fi Networks Using IoT Sensors . . . . . . . 549 Irene Joseph, Prasad B. Honnavalli, and B. R. Charanraj DPETAs: Detection and Prevention of Evil Twin Attacks on Wi-Fi Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 Fanar Fareed Hanna Rofoo, Mohammed Gouse Galety, N. Arulkumar, and Rebaz Maaroof An Integrated Approach of 4G LTE and DSRC (IEEE 802.11p) for Internet of Vehicles (IoV) by Using a Novel Cluster-Based Efficient Radio Interface Selection Algorithm to Improve Vehicular Network (VN) Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 Shaik Mazhar Hussain, Kamaludin Mohamad Yusof, Rolito Asuncion, Shaik Ashfaq Hussain, and Afaq Ahmad A Review on Synchronization and Localization of Devices in WSN . . . . . 585 M. N. Varun Somanna, J. Sandeep, and Libin Thomas Data Analysis, Education Technology, Operating System and Applications of Information Technologies Volume Analytics for an International Berry Supplier . . . . . . . . . . . . . . . . . 605 Keya Choudhary Ganguli, Prerna Bhardwaj, and Khushboo Agarwal SCAA—Sensitivity Context Aware Anonymization—An Automated Hybrid PPUDP Technique for Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 A. N. Ramya Shree, P. Kiran, R. Rakshith, and R. Likhith Adopting Distance Learning Approaches to Deliver Online Creative Arts Education During the COVID-19 Pandemic . . . . . . . . . . . . . 627 Kamani Samarasinghe and Rohan Nethsinghe A Survey of Deadlock Detection Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 635 Palak Vij, Sayali Nikam, and Amit Dua A Survey on Privacy-Preserving Data Publishing Methods and Models in Relational Electronic Health Records . . . . . . . . . . . . . . . . . . 645 J. Jayapradha and M. Prakash An Efficient Biometric Approach of the Attendance Monitoring System for Mahatma Gandhi National Rural Employment Guarantee Act (MGNREGA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663 A. SrikantReddy, I. N. N. Krishna Sai, A. Mahendra, R. Ganesh, and Yalamanchili Sangeetha
About the Editors
Sagaya Aurelia Over her 20-years career in the teaching profession, Mrs. Sagaya Aurelia has worked with leading engineering colleges, universities, India, and overseas and currently working in CHRIST University. She has completed her diploma and bachelor’s degree in electronics and communication Masters’ in information technology. Doctorate in Computer Science and Engineering. She has completed a Post Graduate Diploma in Business Administration and a diploma in Ecommerce Application Development. She has published several papers in international and national journals and conferences and won the best paper award for many of them. She serves as a member of ACM, ISTE, the International Economics Development Research Center, and the International Association of Computer Science and Information Technology. She is awarded the Peace Leader and peace educator award from the united nation’s global compact’s global mission. Somashekhar S. Hiremath is an Associate Professor in the Department of Mechanical Engineering, Indian Institute of Technology Madras. He has over two decades of teaching and research experience. He worked as a visiting professor at the Ecole Centrale de Nantes, France, and Warsaw University of Technology, Poland. He also worked as a Director of Tamil Nadu Small Industries Corporation Limited, popularly known as TANSI, Chennai. Dr. Somashekhar has guided 11 doctoral students, 5 M.S. (by research) students, and 43 master students. He has 02 patents, 56 journal papers, 130 conference papers, 03 edited books, 03 monographs, 04 book chapters to his credits. His research areas include mechatronic system design, robotics, micromachining, fluid power systems, system simulation, and modeling, finite element modeling, and precision engineering. Karthikeyan Subramanian is currently heading the Department of Information Technology at the University of Technology and Applied Sciences, Oman. He completed his B.Sc. and M.Sc. from Bharathidasan University. He completed his Ph.D. in computer science and engineering from Alagappa University. Dr. Subramanian served as an Editor-in-Chief of the International Journal of Computer Science and System Analysis. He has guided 13 Ph.D. Students. His interests are cryptography xxxiii
xxxiv
About the Editors
and network security, key management and encryption techniques, sensor network security. Saroj Kr. Biswas is currently associated with the National Institute of Technology (NIT) Silchar and has over ten years of experience. He did his B.Tech. from Jalpaiguri Government Engineering College, M.Tech. from National Institute of Technical Teachers’ Training and Research, Kolkata, and Ph.D from NIT Silchar. Dr. Biswas has supervised 2 Ph.D. students and presently guiding 6 Ph.D. students, published 36 research papers in peer-reviewed journals of national and international repute. His research interest areas are artificial intelligence, sentiment analysis, and image processing.
Machine Learning
Predictive Analytics for Real-Time Agriculture Supply Chain Management: A Novel Pilot Study Suja Panicker, Sachin Vahile, Sandesh Gupta, Sajal Jain, Lakshita Jain, and Shraddha Kamble
Abstract Predictive analytics is an interesting field of research aimed at discovering future trends/patterns from the past data. With proliferation of big data over the past decade, analytics has been gaining more significance in almost all domains with tremendous contribution to the insightful knowledge. In this paper, we present the initial walkthrough for novel predictive analytics in the field of agriculture, aimed at enhancing and strengthening the current feeble agriculture ecosystem. Important challenges we faced include vast heterogeneity of data and highly scattered data with differences of timelines/structure/area. Hence, integration of data is also an important contribution of current work. Our fundamental motivation has been to develop a novel efficient platform for bridging the gap between the numerous geographically dispersed producers and consumers, thereby not only increasing business value for the producers, but also providing suitable recommendations based on predictive analytics, that can be pivotal in market decisions in the future. In current work, we have experimented with the data provided by government sources and performed indepth data analysis and visualization to infer precious insights that can be communicated to the producers, thereby directly impacting the markets and society. Our main contribution is development of a novel predictive analytics-based platform for real-time agriculture supply chain management. Keywords Big data · Machine learning · Agriculture supply chain · Predictive analytics · Decision support · Data visualization
S. Panicker (B) · S. Gupta · S. Jain · L. Jain · S. Kamble School of CET, MIT WPU, Pune, India e-mail: [email protected] S. Vahile HungreeBee LLP, Pune, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 S. Aurelia et al. (eds.), Sustainable Advanced Computing, Lecture Notes in Electrical Engineering 840, https://doi.org/10.1007/978-981-16-9012-9_1
3
4
S. Panicker et al.
1 Introduction The rich agrarian culture and diverse regional climate of India has been pivotal in its significant contribution to global food basket. For the subsequent food segments, India has a major lead globally—grapes, pulses, honey, milk, and vegetables [1]. The government reports pertaining to three-year export statement [2] indicates India’s whopping role as a major contributor to several agricultural products. Quantitative statistics about 2019–2020 indicate the following with respect to gross exports—31,745 quantity of floriculture, 2,659,482 quantity of fresh fruits/vegetables, 1,124,532.61 of processed fruits/vegetables, 1,645,386.63 of animal products, 2,506,883.59 of other processed foods (such as cocoa and groundnuts), and 10,214,201.23 of cereals (such as rice, wheat, and maize). These enormous numbers indicate the compelling need for incorporating technological innovations through machine learning, predictive analytics, and web design/development for real-time contribution in current agriculture supply chain management.
1.1 Background Supply chain means flow and movement of goods from the producers to the final consumers. Supply chain management is the integrated planning, implementation, coordination, and control of all agri-business processes and activities necessary to produce and deliver, as efficiently as possible, products that satisfy consumer preferences and requirements. Understanding the supply chain management is crucial, before embarking on contributing toward it. A typical agricultural supply chain includes farmers, suppliers, processors, distributors, consumers, and other stakeholders. Due to an increasing number of stakeholders involved in the supply chain, it becomes riskier and more complex for interpretability, compatibility issues, and incorporating techno-savvy innovative and effective processes.
1.2 Motivation Agriculture supply chain swarms with several issues that are faced by the numerous stakeholders. All of these challenges can be converted into opportunities by using digital technologies to enhance agricultural productivity and efficiency, starting from that of individual producer and gradually extending to districts/states and eventually the nation as a whole. Hence, use of machine learning techniques and data analytics in this vast domain of agriculture is invaluable to humanity, globally. We present some existing challenges in this field, which motivated us for the current project:
Predictive Analytics for Real-Time Agriculture Supply Chain …
• • • •
5
The basic agriculture supply chain suffers from inefficiency at almost every stage. Lack of proper infrastructure for procuring agricultural produce from the farm gate to the consumer has led to huge losses in transit. The farmers hardly benefit from any price rise, while the many layers of intermediaries enjoy high margins.
2 Related Works There is a rapid increase in the interest being shown by the global research community in providing automated/semi-automated processes/stages in the agriculture supply chain management using techniques of machine learning, big data analytics, or Web design and development. Image processing has been popularly used in several research toward identifying diseases at root, fruits, or leaves, wherein machine learning-based techniques such as convolutional neural network [3], k-means [4], and neural networks [5]. Use of machine learning in predicting soil type, moisture in soil, crop pattern, etc., has been examined in [6, 7] with SVM, logistics regression, decision tree, kNN, and classifiers. Predictive analytics using random forest regression and smart basket analysis is elaborated in [8]. On similar lines, there is quality research done in past, but use of machine learning in improvising the supply chain ecosystem is yet to be explored to the fullest and hence the motivation for current work. In [4], statistical analysis is performed on the data received from semi-structured questionnaires having a sample from 42 banks and 185 farmers from the Punjab (India). Objective was deciding the access to finance from the producers’ point of view. A highlight of latest research is presented in Table 1, with research gaps thereof.
3 Research Gap We enlist below some top-rated apps along with research gaps for each of them. • Agribuzz [13]—Agribuzz constraints upon to search and filter about the feed. Hence, there is no provision to choose over the product that is required. It is just an explanatory app. One is barred to buy/sell over the app, provided with the manual entries of contacts to connect with sellers and buyers. • GramSevakisan App [14]—It is an only Stakeholder App, and there is no need to log in here. Users cannot save, buy, or sell anything here which is the main drawback of this application. • Krishi Gyan App [15]—It is a monolingual application—Hindi, which displays only crop disease content. • Khetibadi App [16]—The Khetibadi app gives farming tips. Basically, just an informative app.
1. Allows automatic matching between buyers and sellers 2. One stop service supply chain focusing on marketing product distribution 3. Evaluation based on index of item objective congruence (IOC) 4. 97% of accuracy found in the app as per the paper 5. Automatic search for distributor available to delivery agricultural product within specified area
2017
[10]
1. Analysis of farmers’ awareness about MSP and its impact on diversification of crops 2. Prediction by probit model: MSP needs to be backed up by effective procurement 3. The relationship between farmers’ awareness about MSP and decision to go for crop specialization using Heckman selection model
Uncertainty in content consistency and inappropriateness for use as a feature in the mobile application
Gaps
(continued)
1. Easy analysis of the awareness of the During the 2018–19 Union Budget, the farmers government announced that MSP 2. Ease of policy making for farmers would be kept at 1.5 times the C.P regarding MSP
Advantages
2019
[9]
1. Elderly farmers can distribute and deliver their products 2. Online buyers via the mobile application will search and purchase products for delivery 3. Distributors or transport contractors will transport the products of elderly farmers to consumers
Reference no. Publication year Features
Table 1 Highlights of existing research
6 S. Panicker et al.
Advantages The study reveals considerable benefits to the ASC that have developed the ML capability, implying that the adoption of ML in decision making is beneficial
2019
2020
[11]
[12]
To examine how bank managers make Results indicate that the Indian farming lending decisions to marginal farmers sector is a complex and in India multidimensional one
1. Machine learning techniques enhance data-driven decision making in agricultural supply chains 2. Systematic literature review based on 93 papers on machine learning applications in agricultural supply chains 3. Machine learning applications supports in developing sustainable agriculture supply chains
Reference no. Publication year Features
Table 1 (continued)
A clearer picture would have appeared if the data set considered for the survey was huge
1. ISI Web of Science was used as the database for searching the papers for conducting the SLR 2. Important research papers are not included in the ISI Web of Science database. The SLR covers a timeframe of 19 years, i.e., 2000–2019 3. Future studies can also explore the extent of the ML application to ASCs in different regions across the world and provide a comparative assessment
Gaps
Predictive Analytics for Real-Time Agriculture Supply Chain … 7
8
S. Panicker et al.
• AgriApp App [17]—This AgriApp gives one the buying option for agri-based equipment and the products used in growing crops. • Digital Mandi App [18]—The drawback of the digital Mandi app with the system is logging in issues. • Mandi Trades App [19]—Mandi trades allow one to sign in through mobile number only. From the literature survey and summary of apps presented above, it is noted that our current work shall be valuable in agriculture as it shall provide a real-time interface between producers and consumers, thus benefitting both parties equally and thereby enhancing the entire agriculture ecosystem.
4 Proposed Work Our solution is to combine several exciting features in one application comprising the pros of all of the above apps, while overpowering their shortcomings with some additional features. It is planned to enhance the Heckman model [10] with reduced bias, for predicting the MSP for particular crops. Current work is part of industry-supported internship project at HungreeBee Pvt. Ltd. [20] and shall be deployed as an Android app to be used in real time by the producers and consumers associated with HungreeBee (for the initial testing), and later on (after rigorous testing), our work shall be deployed as a full-fledged system with improvised agriculture supply chain, powered by machine learning and big data analytics. Some interesting features that shall be included in our app are multilingual support, real-time trade, and speech to text/text to speech conversions to facilitate different segments of society for the efficient use of the app. Figure 1 illustrates the proposed work in finer details and is developed using Gliffy [21].
5 Technology Stack Certain apps have a set of universal fundamental libraries which makes their presence in the system an essential prerequisite. Since the producers may or may not have access to the devices that can run heavy applications and knowledge of installing certain packages in advance to their devices, there is a need to develop a crossplatform application having Web support. This can be done by using Flutter which can provide a portable, efficient, high-quality, and smooth user experience across modern browsers. Flutter is a layered system, where each layer consists of independent libraries each dependent on the underlying layer. At the bottom-most, we have an embedder that is platform-specific. At the core of Flutter is the Flutter engine,
Predictive Analytics for Real-Time Agriculture Supply Chain …
9
Fig. 1 Low level diagram of proposed work
which is mostly written in C++ and supports the primitives necessary to support all Flutter applications. The Flutter engine is exposed to the Flutter framework through dart:ui [22]. Generally, developers interact with Flutter through the Flutter framework, which provides a modern and sensitive framework written in the Dart language. Dart compiles to native machine code for mobile and desktop as well as to JavaScript for Web. There are several Flutter UI toolkits available in the market. One of the latest being Liquid_ui [22]. Liquid is an open-source UI toolkit for developing apps specifically in Flutter. It has a powerful grid system, text processor, forms, extensive prebuilt components, and other additional utilities. Database: We have selected firebase real-time database [23] as the most suitable database framework in this scenario. The firebase real-time database is a cloud-hosted database. Data is stored as JSON and synchronized in real-time to every connected client. When we build cross-platform apps with its SDKs, all the clients share one real-time database instance and automatically receive updates with the latest data. Firebase ML kit [23]: To make the analysis based on different crops and their production data, we need to use certain machine learning algorithms and incorporate them with the application. Firebase ML kit is a powerful and easy-to-use package which brings up Google’s machine learning expertise to the Flutter app and facilitates image recognition, text recognition, land identification, image labeling, and language identification. With the prior knowledge of ML, we can also utilize APIs provided by Firebase ML kit to use our customized TensorFlow models with the application.
10
S. Panicker et al.
6 Experimental Work We have used the real-time data provided by Indian Government in its annual reports on agriculture [24]. Though rich, the data is heterogeneous and vast leading to complexity. We have worked on the data collected from the past 50–60 years and have cleansed and normalized the same. Details of the real data. • Foodgrains data—comprises of all India area, production, and yield along with coverage having dimensionality of (69, 5) • Rice data—comprises of all India area, production, and yield along with coverage having dimensionality of (69, 5) • Total Foodgrains data having state-wise yields with a dimensionality of (36, 15). Since this data was not easily interpretable, hence could not be directly used for any analysis. We have performed detailed analysis on the area of interest to address this problem and to facilitate use of this data in further tasks. Visualizing data graphically aids in anticipating various inferences. We explored some of the data visualizations extracted from the past 5 years report [24]. This data was received in the form of annual report as disseminated by the Government of India, and it presents extensive statistics about all India food grains. Important features were area (in million hectares), production (in million tons), yield (Kg/hectare), and area under irrigation (%) for a wide spectrum of 69 years—from years 1950 to 2019. To visualize the data, an open-source tool named Tabula [25] is used to extract the required tables in csv format which are further processed with the help of Python libraries (Seaborn [26] and Matplotlib [27]) and Tableau [28]. Figure 2 is a graphical representation from the data sheets. An increasing graph of yield vs year with values varying from 522 kg/hectare to 2299 kg/hectare and from 1950 to 2019, respectively, is observed from the visualization. It surmises that every year overall yield has increased with an average yield increment value of 25.38 kg/hectare. Overall, a 340% increase is in the yield in the last 70 years.
Fig. 2 Food grains: all India yield in Kg/hectare from 1950 to 2019
Predictive Analytics for Real-Time Agriculture Supply Chain …
11
Fig. 3 Foodgrains: all India production in million tons from 1950 to 2019
The overall production in the time span of 70 years from 1950 to 2019 has increased from 50.8 million tons to 284.9 million tons giving an inconsistent rise to almost 460% as illustrated in Fig. 3. As per the statistics, in India, about one-third of the total land area is used for agriculture. Arable land area has increased from 100 million hectares to 124 million hectares over the period of 1950 to 2019 platinum’s as illustrated in Fig. 4 picturing area vs year representation. There is a gradual increase in the production vs area and production vs area under irrigation in the initial years can be deduced from Figs. 5 and 6, respectively. Production directly relates with the area under irrigation, but there are many other factors other than area which affects the net production.
Fig. 4 Foodgrains: all India area in million hectare from 1950 to 2019
12
S. Panicker et al.
Fig. 5 Production versus area
Fig. 6 Production versus area under irrigation
7 Future Scope • The proposed solution perfectly fits into the requirements and the problems faced by both the stakeholders. • The exchange of crops when incorporated with the payment gateway feature will be considerate. • Further, a feed page can be dispensed that would have real-time Q&A. • Provision of the news related to agriculture would be a supplement. • Along with the crops, various farming equipment and products could be merged and integrated in one platform.
Predictive Analytics for Real-Time Agriculture Supply Chain …
13
8 Conclusion In current work, we propose the use of machine learning and data analytics in semiautomation of the vast, complex, and vital agriculture supply chain management ecosystem. Proposed work is based on an internship project and shall be deployed for real-time use at HungreeBee Pvt. Ltd and its associates. We have presented the widespread lacunae in the current ecosystem and proposed a novel framework for equally benefitting the producer and the consumer, by addressing several research gaps in existing work. We shall explore cognitive computing and natural language processing in addition to data analytics so as to present novel features such as multilingual support and text to speech (and vice versa) conversions for the producers. Our work shall be tremendously useful to society due to its direct implications in the field of agriculture.
References 1. https://www.ibef.org/exports/agriculture-and-food-industry-india.aspx Accessed on 31 Jan 2021 2. http://agriexchange.apeda.gov.in/indexp/exportstatement.aspx Accessed on 31 Jan 2021 3. Diop PM, Takamoto J, Nakamura Y, Nakamura M (2020) A machine learning approach to classification of Okra. 2020 35th international technical conference on circuits/systems, computers and communications (ITC-CSCC). Nagoya, Japan, pp 254–257. https://ieeexplore.ieee.org/ stamp/stamp.jsp?tp=&arnumber=9183312&isnumber=9182785 4. Mrunmayee D, Ingole AB (2015) Diagnosis of pomegranate plant diseases using neural network. In: 2015 Fifth national conference on computer vision, pattern recognition, image processing and graphics (NCVPRIPG). IEEE. https://doi.org/10.1109/NCVPRIPG.2015.749 0056 5. Shirahatti J, Patil R, Akulwar P (2018) A survey paper on plant disease identification using machine learning approach. 2018 3rd international conference on communication and electronics systems (ICCES). Coimbatore, India, pp 1171–1174. https://doi.org/10.1109/CESYS. 2018.8723881 6. Heung B, Ho HC, Zhang J, Knudby A, Bulmer CE, Schmidt MG (Nov 2015) An overview and comparison of machine-learning techniques for classification purposes in digital soil mapping. Geodarma 266:62–77. https://doi.org/10.1016/j.geoderma.2015.11.014 7. Feyisa GL, Palao L, Nelson A, Win KT, Htar KN, Gumma MK, Johnson DE (March 2016) A participatory iterative mapping approach and evaluation of three machine learning algorithms for accurate mapping of cropping patterns in a complex agro-ecosystems. Adv Remote Sens 5:1–17. https://doi.org/10.4236/ars.2016.51001 8. Sharma R, Kapoor R, Bhalavat N, Oza C (2020) Predictive agricultural demand insights using machine learning. In: 2020 4th international conference on trends in electronics and informatics (ICOEI) (48184). Tirunelveli, India, pp 533–539. https://doi.org/10.1109/ICOEI48184.2020. 9142978 9. Nuanmeesri S (2019) Mobile application for the purpose of marketing, product distribution and location-based logistics for elderly farmers. Appl Comput Inf. ahead-of-print. https://doi. org/10.1016/j.aci.2019.11.001 10. Aditya KS, Subash SP, Praveen KV, Nithyashree ML, Bhuvana N, Sharma A (2017) Awareness about minimum support price and its impact on diversification decision on farmers of India. https://doi.org/10.1002/app5.197
14
S. Panicker et al.
11. Sharma R, Kamble SS, Gunasekaran A, Kumar V, Kumar A (2020) A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Comput Oper Res 119. https://doi.org/10.1016/j.cor.2020.104926 12. Sandhu N (2020) Dynamics of banks’ lending practices to farmers in India. Journal of Small Business and Enterprise Development. ISSN 1462-6004. https://doi.org/10.1108/JSBED-052019-0161 13. https://play.google.com/store/apps/details?id=com.vinayaenterprises.www.agribuzz&hl=en& gl=US Accessed on 31 Jan 2021 14. https://play.google.com/store/apps/details?id=com.metalwihen.gramseva.kisan&hl=en_IN& gl=US Accessed on 31 Jan 2021 15. https://play.google.com/store/apps/details?id=com.mixorg.krishidarshan.activities&hl=en_ IN&gl=US Accessed on 31 Jan 2021 16. https://play.google.com/store/apps/details?id=com.app.khetibadi&hl=en_IN&gl=US Accessed on 31 Jan 2021 17. https://play.google.com/store/apps/details?id=com.criyagen&hl=en_IN&gl=US Accessed on 31 Jan 2021 18. https://play.google.com/store/apps/details?id=com.maswadkar.digitalmandi&hl=en_IN& gl=US Accessed on 31 Jan 2021 19. https://play.google.com/store/apps/details?id=com.manditrades&hl=en_IN&gl=US Accessed on 31 Jan 2021 20. https://www.hungreebee.com Accessed on 31 Jan 2021 21. https://www.gliffy.com/ Accessed on 31 Jan 2021 22. http://www.liquid-ui.com/ Accessed on 31 Jan 2021 23. https://firebase.google.com/ Accessed on 31 Jan 2021 24. Agricultural Statistics at a Glance (2019) Government of India. Accessed on 31 Jan 2021. https://eands.dacnet.nic.in/PDF/At%20a%20Glance%202019%20Eng.pdf 25. https://tabula.technology/ Accessed on 31 Jan 2021 26. https://seaborn.pydata.org/ Accessed on 31 Jan 2021 27. https://matplotlib.org Accessed on 31 Jan 2021 28. https://www.tableau.com Accessed on 31 Jan 2021
Analysis and Prediction of Crop Infestation Using Machine Learning A. Mohamed Inayath Hussain, K. R. Lakshminarayanan, Ravi Dhanker, Sriramaswami Shastri, and Milind Bapu Desai
Abstract The contribution of agriculture to the GDP of India is 27.82% (Infestation data: http://www.fao.org/home/en/, https://icar.nta.nic.in/WebInfo/Pub lic/Home.aspx, https://www.cabi.org/, https://www.crida.in/, Climatic Condition data: https://www.indiawaterportal.org/, https://data.gov.in/) compared to the world average of 10.5% (Infestation data: http://www.fao.org/home/en/, https://icar. nta.nic.in/WebInfo/Public/Home.aspx, https://www.cabi.org/, https://www.crida.in/, Climatic Condition data: https://www.indiawaterportal.org/, https://data.gov.in/). With a growing population and shrinking land availability for cultivation, there needs to be a focus on increasing the yield per hectare. As per FAO statistics, the loss of yield in crops due to pests is approximately 20%. The Indian government has embarked on a journey to increase the farmer’s per capita income through various measures. We have attempted to help in this journey through a data-driven approach for solving the problem of prediction of pest incidence in crops. We have configured a machine learning technique to predict pest infestation in multiple districts of India for selected crops like rice, cotton and maize. The models were developed using the data about crops, soil nutrient availability and weather patterns to perform the prediction. We had used various analytical models like multiple linear regression, random forest, support vector machines and time series models for predicting and classifying pest infestation. The random forest classification had been ascertained as the best amongst all the models in terms of accuracy (88%) [Table 7]. However, based on the sensitivity (89.5%) [Table 7] and F1-score (88.85) [Table 7], we recommend support vector machine as the most suitable algorithm. The features that had been critical to predicting infestation were soil deficiency in potash, max temperature, cloud cover, wind speed and relative humidity. Keywords Pest infestation · Predictive analytics · Soul nutrients deficiency imputation · Random forest · Time series · Support vector machine · Rice · Cotton · Maize
A. M. I. Hussain · K. R. Lakshminarayanan (B) · R. Dhanker · S. Shastri · M. B. Desai Great Learning, Bengaluru, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 S. Aurelia et al. (eds.), Sustainable Advanced Computing, Lecture Notes in Electrical Engineering 840, https://doi.org/10.1007/978-981-16-9012-9_2
15
16
A. M. I. Hussain et al.
Table 1 Loss due to infestation of crops in India Crop
Actual production (Million tons)a
Approx. estimate of loss in yield Total (Million tons)
%
Production in absence of loss due to pest
Value of estimated lossesb
Rice
93.1
31
25
124.1
164,300
Wheat
71.8
3.8
5
75.6
23,560
Maize
13.3
4.4
25
17.1
21,340
300.1
75
20
375.1
46,540
10.1
50
20.2
287,600
Sugarcane Cotton
10.1
a
Production at million bales at 170 kg each Based on the MSP fixed by the Govt. of India for 2001–02. Production and MSP figures from anonymous (2003) b
1 Introduction India is one of the leading countries in terms of production of staple crops; however, the yield per hectare is the lowest amongst the BRICS (Brazil, Russia, India, China and South Africa) nations (as per Food and Agricultural Organization (FAO) statistics). The loss in yield is attributed to various factors ranging from farming practices, climatic conditions and pest infestation. In farming, infestation refers to persistent pests that infiltrate crops, affecting the plant’s general health and development. These pests usually occur in the form of slugs, ants, spiders and various types of insects and are influenced by certain climatic conditions. The pests are responsible for destroying one-fifth of the global food crops annually (Table 1). According to the FAO, the total loss of food grains is to the tune of 1.3 billion tons per year. If we consider, on an average a crop loss of 20% [1], and with the present gross value of our agriculture production at Rs. 7 lakh crore [1], the loss is to the tune of Rs. 1.4 lakh crores [1]. Climatic conditions play a significant role in the development of a crop. Changes in average temperatures, rainfall, humidity and climatic extremes (e.g. heat waves) have a direct effect on the incidence of pests and diseases. With the changing climatic conditions, better prediction of pest incidence must pave way for better control of the pests and inherently increase the yield of production.
2 Objective Indigenous pest infestation is nothing new to an agrarian economy. However, sometimes non-indigenous pests and locusts invade our crops. In 2019, within 9 months, the fall armyworm infested almost ten states causing extensive damage to crop.
Analysis and Prediction of Crop Infestation Using Machine …
17
Timely prediction of the infestation with advance notification to the farmers would have helped in better pest management. In our paper, we have proposed a two-step solution of pest infestation as a binary classification problem and a regression problem in certain districts for a few identified crops in a specific week of a year.
3 Literature Review There are studies by Gondal and Khan [2] to determine the pest using the plant parts like leaves, flowers or stem. A comprehensive study by Xiao et al. [3] involved climatic factors influencing the pest using an a priori algorithm to find the association rules between weather factors and the occurrence of crop pests. There is a study by Patil et al. [4] involving the factors mentioned above to recognize healthy or unhealthy plants. Arivazhagan et al. [5], in their research, extracted the shape and colour features from the affected region. There is also a detailed study by Shanmuga and Abinaya [6] on the incidence of pests influenced by the weather parameters. In one of the approaches studied, Tripathy, et al. [7] performed detection of whiteflies by using classification as a learning approach.
4 Materials and Methods 4.1 Data Summary Pest infestation was studied under two main categories: climate and soil parameters. The other categorization is by the two main seasonal crop types in India: Kharif and Rabi. The data collected from multiple sources had 3094 observations with 23 features (Table 2). The data had one target variable (pest infestation in units).
4.2 Data Imputation—Soul Nutrients Deficiency The soil deficiency parameters of nitrogen, phosphorus and potassium (NPK) for each district did not have all the entries across the years and had to be imputed. We predicted the NPK ratio at the state level based on NPK consumption at the national level. With the predicted values of state-level consumption of NPK, we derived the correlation with NPK deficiency at the district level and imputed the values for deficiency. Y is state-level nitrogen consumption, and X is national-level nitrogen consumption. The regression equation for state-level nitrogen consumption (1) and state-level
18
A. M. I. Hussain et al.
Table 2 Data summary [1] Dimension
Target variable
Categorical variables
Numerical variable
Time series variable
3094 × 23
Pest infestation (Units)
Name of pest (7) Districts (6) State crop (9) Growth Stage (3) Season (2)
Evaporation Rate (mm) Rainfall (mm) Sunshine (Hrs.) Max temperature (C) Min temperature (C) Relative humidity (%) Cloud cover (%) Soil nutrient deficiency nitrogen (%) Phosphorous (%) Potash (%) Wind speed (Kmph)
Week (52) Month (12) Year (1959–2019)
Table 3 Summary output of linear regression District Guntur Parameter (Consumption)
R-squared
Standard Error
t- value
Pr(>|t|)
Nitrogen (N)
0.874
0.503
3.285
0.046
Phosphorus (P)
0.543
0.599
1.221
0.309
phosphorus consumption (2) for the state of Andhra Pradesh was y = 1.645 + 0.434x
(1)
y = 0.732 + 0.514x
(2)
We derived similar equations for every state in the data set. The R-squared values obtained for the linear regression models are depicted in Table 3. We established the relationship between state-level NPK consumption and the district-level deficiency and derived the NPK deficiency for each district for each year. The average MAPE for the deficiency parameters was in the range of 2%–10% across the districts in the data set. A regression method was used to impute missing values for deficiency.
4.3 Dependent Variable Standardization As per the FAO research paper 1983, pest incidence is classified into three types as follows;
Analysis and Prediction of Crop Infestation Using Machine …
19
(a) Degree of prevalence of pests: What % of the area had been infested with pests/diseases? (b) The intensity of pest incidence or infestation: % of plants infested by the pest or disease. (c) The severity of the infestation: count of # eggs/larvae/pests in the infested area or percentage of the infested leaf. We had to standardize the pest infestation to a single UOM based on severity as the values were scaled differently based on the original data collected. We adopted the following methodology to standardize the infestation data for every crop–district– year–week number combination as follows: • For unit of measure (UOM) of number per x plants: Infestation index was standardized based on the total number of plants (based on the planting pattern of the crop in the district). • For UOM of number per x leaf: The same approach as above is used; however, the number of leaves per plant data was used. • For UOM of number per pheromone light trap: Pest infestation is monitored through pheromone traps placed at specific intervals in the field. The standardized index was calculated using the number of traps per acre in a district for a crop. Post the standardization exercise, we had all the infestation data in a single UOM.
4.4 Outlier Treatment Only four predictors contained outliers, e.g. maximum temperature (Max. Temp), relative humidity (RH1), wind speed (WS) and sunshine hours (SSH). The outliers in climatic features were assumed seasonal; hence, we had not carried out any outlier treatment (Fig. 1).
Fig. 1 Relative humidity distribution
20
A. M. I. Hussain et al.
Fig. 2 Correlation of predictor variables
4.5 Correlation of Numerical Variables We had 23 predictor variables in the data set, and hence, we studied the correlation plot (Fig. 2) to see if there is a potential to reduce some of the features. As expected, the sunshine hours (SSH) were negatively correlated to cloud cover (CC) and relative humidity (RH2). The correlation plot is shown in the figure below. We did a principal component analysis (PCA) on the data set and found that the features can be reduced to two components; however, we used the original data set.
4.6 Regression Models and Outcome We used multiple regression models (random forest, linear regression and time series linear regression) to predict the level of infestation for a pest and crop in a district. The pest incidence data had shown a significant spread in the years between 2000 and 2010 for multiple districts and crops. It had also shown a higher spread from August to October across various crops indicating typical cropping cycle behaviour of the pests. A further deep dive into the weekly distribution had shown a cluster of data points from week 30 to week 52 that had indicated a potential seasonality in the infestation (Fig. 3). We had developed individual time series models for each pest district and crop combination (six different models out of 11 combinations) (Table 4). Some of the combinations were not modelled as we had data only for a single year as depicted in Fig. 4. For each of the models, we had run a prediction algorithm using Holt-Winters,
Analysis and Prediction of Crop Infestation Using Machine …
21
Fig. 3 Akola cotton Thrips plot of time series data Table 4 Observations across the timescale Description
Observations
Years
Gall midge adult rice Cuttack
520
10
Yellow stem borer rice Cuttack
364
7
Spodoptera cotton Guntur
260
5
Thrips cotton Akola
260
5
American bollworm larva cotton Guntur
208
4
American bollworm larva cotton Akola
208
4
Mealybug cotton Coimbatore
104
2
52
1
FAW maize Coimbatore, American bollworm larva cotton Nagpur and Pharbani, Spodoptera cotton Akola
Fig. 4 Holt-winters model output for Akola Thrips cotton
22
A. M. I. Hussain et al.
ARIMA and time series with linear regression and random forest regression and had compared the performances of the algorithms on accuracy parameters. The Holt-Winters method and ARIMA had better prediction accuracy; however, the method had not been able to explain the predictor variable importance in the prediction. A sample output of the Holt-Winters model is depicted in the figure below (Fig. 4). We had used a ‘pmax’ function to correct the negative forecasts from the HW algorithm. We had employed the technique of time series with a linear regression (TSLM) algorithm. A comparative account of the accuracy parameters of the different time series algorithms is listed in the Fig. 5. We inferred that the accuracy based on adjusted R2 values ranges from 7 to 58% across the different models. We had employed a train test split of the time series data as indicated in Table 5. Since the data had intermittent year values, the test data
Fig. 5 Comparative accuracy parameters across time series model
Table 5 Training and test split of time series data
Model: pest region crop
Training data Test data
Gall midge adult Cuttack rice
1963–1969
1973–1975
Yellow stem borer Cuttack rice
1995–1998
1999–2000
Spodoptera Guntur cotton
1996–1999
2002
Thrips Akola cotton
2005–2008
2009
American ball worm Guntur cotton 1991–1993
1995
American ball worm Akola cotton
2005
2007–2009
Analysis and Prediction of Crop Infestation Using Machine …
23
values were not continuous with the training data, and hence, the algorithm had to predict 1–2 years forward forecast to ascertain the test accuracy. We had compared the performance of Holt-Winters [HW], random forest [RF] and time series with linear regression [TSLM] algorithms. The inferences from the output had been that the random forest regression model had the highest adjusted R2 value across the three algorithms that had been illustrated in the figure below (Fig. 6). The values indicated ranged from 30–50%. We had derived the relative importance of the predictor variables (Table 6) from the time series linear regression algorithm.
Fig. 6 [Left] Accuracy performance across algorithms (pest: Thrips, region: Akola, crop: cotton). [Right] pest: gall midge adult, region: Cuttack, crop: rice
Table 6 Variable importance of predictors Model
Significant Feature
Remarks
Gall midge adult Cuttack rice
Def K
Increase in Def. of K increases infestation
Yellow stem borer Cuttack rice
CC
Increase in cloud cover increases infestation
Spodoptera Guntur cotton
RH2, SSH, Def N, Def P
Infestation increases with increase in Def. of P and reduced with increase in RH2, SSH, Def. of N
Thrips Akola cotton
Seasonality
Only seasonality impact
American ball worm Guntur cotton
RF, WS EVP, DefN
Increase in WS and EVP increases the infestation, and increase in RF and Def. in N reduces
American ball worm Akola cotton
RF, SSH WS, EVP
Increase in SSH and WS increases the infestation, and increase in RF, EVP reduces it
24
A. M. I. Hussain et al.
The predictor variables rainfall (RF), sunshine hours (SSH), cloud cover (CC) and deficiency in potash (K) apart from seasonality had been the one with the significant impact on the prediction. The climatic conditions had been the one with a significant influence on the prediction of pests across the models. Since the regression models had not yielded a good accuracy for prediction, we had attempted to model the problem as a classification problem.
5 Classification Algorithms for Pest Prediction We used supervised learning algorithms of logistic regression, support vector machine, random forest and time series analysis with linear regression (TSLM) to predict the infestation as a binary classifier. LR, RF and SVM are techniques that are used when the decision variable is a binary in nature. These models predict the decision variable as function of the predictor variables. We fed our model the information of a certain crop and the district of cultivation, and our model was able to predict the infestation for the type of the pest and whether an infestation will happen or not as a binary classifier.
5.1 Classification Modelling Techniques and Outcomes Firstly, we had built a binary classification model, which predicts an infestation. Secondly, we had built a multi-class classification model for pest type. For the logistic regression, we needed to encode the categorical variables. There are two different ways to encoding categorical variables. Say, one categorical variable has n values. One hot encoding converts it into N variables, while dummy encoding converts it into N-1 variables. If we have k categorical variables, each of which has N values. One hot encoding ends up with KN variables, while dummy encoding ends up with KN-K variables. Besides, the ratio of imbalance was 28%, and SMOTEd data did not improve so we went ahead with the original data. Following hyperparameters were tuned for RF (trees: 401, mtry:8, nodesize:10, stepFactor:1.25, improve:0.0001). The prepared data is not linear separable data. So, SVM is used with kernels, and this model will be like a black box model where you cannot figure out the feature importance. We used the default hyperparameters with the ‘radial’ kernel. SVM turned out to be the best model based on the F1-score. The model which had the highest F1-score was decided to be the best model. Hence, the F1-score was used as an evaluation metric against other parameters of accuracy or sensitivity was used with default parameters and was used to understand the significance of each variable on the dependent variable. Soil Def in K, relative humidity, cloud cover and stage of crop are the variables which influence the pest infestation. A comparative account of the various accuracy parameters for different models is listed in the Table 7).
Analysis and Prediction of Crop Infestation Using Machine …
25
Table 7 Model accuracy for different algorithms. (Units in percentage) Metrics
RF train
RF test
SVM train
SVM test
GLM train
GLM test
Accuracy
94.86
88.26
88.90
84.27
76.81
73.00
Sensitivity
93.40
82.30
82.25
89.59
63.33
54.25
Specificity
95.41
90.41
91.61
71.87
80.53
78.31
F1-score
90.89
78.81
81.11
88.85
54.20
47.00
6 Conclusion Based on the comparison of model parameters across random forest (RF), support vector machine (SVM) and logistic regression (GLM), the following are the key inferences: • Random forest had the highest accuracy of 88%; however, with SVM, we were able to achieve an accuracy of 84%. • SVM had a better sensitivity at 89% as compared to random forest at 82%. • The decision to select the prediction model had been based on our inference of the cost of predicting false negative (not predicting the pest when there was an infestation) which would have led to higher damages to crops. We recommend going with the SVM model since it had performed better as compared to all the three models on the F1-score. • We had observed that the accuracy parameters of the model, and GLM had been the one that scored the lowest. Our model would be able to provide a better prediction of a certain pest which would enable the farmers to take necessary preventive actions. Also, through our work, we can provide advanced information to fertilizer companies to do targeted marketing of their products. There is further scope of enhancement of the model through detailed data collection and enhanced neural network-based models through image recognition of infested crops.
References 1. Infestation data: http://www.fao.org/home/en/, https://icar.nta.nic.in/WebInfo/Public/Home. aspx, https://www.cabi.org/, https://www.crida.in/, Climatic condition data: https://www.indiaw aterportal.org/, https://data.gov.in/ 2. Gondal MD, Khan YN, Early pest detection from crop using image processing and computational intelligence 3. Xiao, Li W, Kai Y, Chen P, Zhang J, Wang B, Occurrence prediction of pests and diseases in cotton based on weather factors by long short-term memory networking 4. Patil AJ, Sivakumar S, Witt R, Forecasting disease spread to reduce crop losses 5. Arivazhagan S, Shebiah RN, Ananthi S, Varthini SV, Detection of the unhealthy region of plant leaves and classification of plant leaf diseases using texture features
26
A. M. I. Hussain et al.
6. Shanmuga Priya S, Abinaya M, Feature selection using random forest technique for the prediction of pest attack in cotton crops 7. Tripathy AK, Adinarayana J, Sudharsan D, Merchant SN, Desai UB, Vijayalakshmi K, Raji Reddy D, Sreenivas G, Ninomiya S, Hirafuji M, Kiura T, Tanaka K, Data mining and wireless sensor network for agriculture pest/disease predictions
ECG-Based Personal Identification System A. C. Ramachandra, N. Rajesh, and N. Rashmi
Abstract Biometric system involves biosignal data as features for recognition which is specific to each person. Among biometric data, heartbeat-related ECG signals can be used for identifying individuals and diagnosis of diseases. Also, it makes easier to miniaturize measuring devices related to other biosignals. In this paper, a fiduciary approach to ECG-based personal identification system (ECGPIS) is proposed. The method proposed consists of data collection, preprocessing, extraction of features and classification phases. Datasets are obtained from the ECG-ID database in the first stage. The noise elimination is done using the series of filters along with wavelet transform is used for noise reduction. In the third stage, extraction of features is carried out where the mean fragment is obtained which is used in the classification stage. The classification is done using support vector machine (SVM) which gives the high accuracy of a system. The results obtained indicate that the method proposed is efficient and robust. Keywords Biometric · ECG signals · Feature extraction · Machine learning
1 Introduction Biometric recognition provides a great tool for protection. Identifying biological characteristics such as irises, fingerprints and faces is a major type of biometric technology which is currently in use. Nevertheless, such characteristics can be falsified and influenced, and it can be used for various crimes, such as piracy accident that involves replication of the iris, facial video piracy and a fake fingerprint in an electronic passport accident. And the studies on electromyogram (EMG), electrocardiogram (ECG) and the electroencephalogram (EEG) are actively conducted to avoid
A. C. Ramachandra (B) · N. Rajesh · N. Rashmi Nitte Meenakshi Institute of Technology, Bangalore, Karnataka, India e-mail: [email protected] N. Rajesh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 S. Aurelia et al. (eds.), Sustainable Advanced Computing, Lecture Notes in Electrical Engineering 840, https://doi.org/10.1007/978-981-16-9012-9_3
27
28
A. C. Ramachandra et al.
Fig. 1 ECG signal pulse [9]
this type of piracy. For intention identification, situation analysis and personal identification, EEG signal measures the flow of electricity generated by the brain activity. For movement recognition and disease diagnosis, EMG signal measures the flow of microcurrent generated when a muscle moves and ECG signal measures the change in electrical potential that is related to heartbeats, and it can be used for persona identification [1–4]. Due to their position of the heart and physical conditions, ECG signal is unique for every individual. The shape of the ECG waveform offers significant information about the function and rhythm of the heart that represents the present state of the heart [5–8]. P wave, a T wave and a QRS complex are the components of P-QRS-T characteristic waves in the ECG signal as shown in Fig. 1. For ECG, according to the method used for extraction of the features, based on ECG, these biometric systems can be classified into fiducial [10, 11] or non-fiducial systems [12–14]. The detection of ECG wave components, P-QRS-T is done for fiducial-based approach. The crucial task is detection of R peak which is the main characteristic wave in the ECG signal. R peak is taken as a reference point, the other wave components are detected, i.e., P, Q, S, T points after finding the location of R peak. The characteristics that represent the fiducial features are slope, amplitude, time, distance and other [15] characteristics. The AC/DCT method is used in nonfiducial approach which creates a predefined length of window of samples. In this approach, the samples which are found in the window are combined into sum of products. The ECG-based personal identification system that uses fiducial approach is proposed in this paper. The following steps are considered in the fiducialbased approach, namely dataset collection, preprocessing, feature extraction and classification.
ECG-Based Personal Identification System
29
2 Literature Survey K. Bashar using wavelet domain statistical feature proposed the method of combination of EEG and ECG signals with multiple classifiers. By simple addition of feature vectors, EEG and ECG signal features are combined at level of feature [9]. When compared to single feature with fused classifier or with single classifier, the highest average F-score is produced by using the fused wavelet packet statistics vectors with fused classifier. H. Ko et al. had proposed an algorithm named adjusted (Qi ∗ Si) algorithm. The detection of Q value is difficult; to solve this issue, the algorithm is proposed. Anyways, Q and S points are close to the location of R peak. To determine the exact features is impossible because the R-value range is not fixed [16]. And, hence, the algorithm is proposed to fix the R peak location from the distance of Q and S and established thresholds and allowing the user to be identified. E. Rabhi and Z. Lachiri proposed a fiducial approach which consists of three steps. The second step is extraction of Hermite polynomials expansion coefficients and morphological descriptors from each heartbeat after the preprocessing step [10]. Hidden Markov model and support vector machine are used for classification, which achieved highest identification rate. D. Meltzer and D. Luengo proposed a fiducial approach which extracts a set of 15 features from the ECG signal for personal identification, and it is performed by selecting some important features from the dataset. The classifier SVM classifier is used which obtained a highest accuracy and the fiducial approach [11]. M. Abdelazez et al. proposed a non-fiducial biometric system in order to study the body posture effect [12]. Due to posture changes, ECG signal varies. Posture-aware system, single posture and multi-posture system are the three system variants which are considered in this approach. They observed that the effect of posture can improve the system performance posture-aware system and multi-posture system produces an average TPR in the biometric system.
3 Proposed Model Figure 2 shows the block diagram of the proposed ECGPIS system. It includes important steps that are the loading of the ECG signal, the preprocessing stage, the extraction of features and finally the classification. A.
Data Acquisition
The datasets are collected from public database which are available on the PhysioNet ECG-ID database. It contains the recordings [17] from 90 persons (44 male and 46 female) using the single lead ECG sensor. Each recording is sampled at 500 Hz over a duration 20 s long. In this database, since only a small number of subjects contains more than 2 recordings, the first two recordings of each subject is considered.
30
A. C. Ramachandra et al. Database
Noise removal from ECG Signal Preprocessing Stage Smoothing of ECG Signal
Peaks Detection
PQRST fragments
Feature Extraction
PQRST fragment mean
Classification
Identification
Fig. 2 Block diagram of ECGPIS
B.
Preprocessing Stage
Preprocessing is a process of eliminating the most common noises that may occur in a signal, and to extract the features, it delivers the signal in a clear way. Three types of unwanted noises are present in the ECG signal: baseline fluctuation artifacts which is caused due to breathing, muscle noise due to movement and power line noise due to exchange currents. For the preprocessing step, the following combined approach was selected. First, the ninth level decomposition wavelet is done by using decomposing wavelet db8 on the raw signal for the correction of baseline drift, then the different levels of wavelet co-efficients are used by soft thresholded shrinkage strategy, and at the last, the obtained thresholded wavelet co-efficients are reconstructed back to the original signal by using inverse discrete wavelet transform (IDWT) to obtain the denoised signal. Next, the frequency-selective filters were applied to remove the power line noise and high-frequency components by using an adaptive bandstop filter and a Butterworth low-pass filter. To remove the power line noise, adaptive
ECG-Based Personal Identification System
31
bandstop filter is used with Ws = 50 Hz where it notches the particular frequency which is caused by an exchange current. To remove high-frequency components, that is, the remaining noise components, Butterworth filter is used in order to obtain a flat frequency response, that is, without ripples by using Wp = 40 Hz and Ws = 60 Hz, Rp = 0.1 dB and Rs = 30 dB is the attenuation in the stop band. And at the last in order to soften the signal N = 5 is used where N is the value which produces the smoothed preprocessed signal. C.
Feature Extraction
Feature extraction is a method of reducing the ECG signal dimensionality, an initial collection of raw data is reduced to processing groups that are more manageable. A large number of variables that take many computational resources to process are a characteristic of these large datasets. Feature extraction is performed by calculating the amplitudes, lengths and angles of ECG data using fiducial characteristics that capture complete patterns. Peaks Detection The ECG signal feature extraction begins with the detection of the QRS complex by using filters of the same size as the preprocessed signal; beginning with DC shift cancelation and normalization, low-pass filtering, high-pass filtering, to increase the signal-to-noise ratio, derivative filter was used to get the QRS slope information. Squaring and moving window integration are used in order to enhance the dominant peaks, that is, the QRS peaks in the ECG signal. The moving window integrator is used to extract the information about the duration of the QRS wave to produce a vector of ones and zeros as the same-size preprocessed signal and the ones determine the QRS complex of the ECG preprocessed signal in the vector‚ after getting the vector that contains ones is the vector which contains information about the QRS complex. The start of the Q peaks and S peaks is found out using the vector which is divided into vectors, that is, into right and left vector. Then, to the left Q peak, the window of samples are created and the maximum sample in it will be P peak. Then, to the right which is near S peak of the ECG signal, the window of samples is created, and the maximum sample in it will be the T peak. The maximum value between the left and right vector is the R peak which can be easily detected. The Q and S peaks are detected by selecting the minimum value between the right and left vector that is Q, S peaks and R locations. Now, P, Q, R, S, T peaks and locations have been detected and as well as the R interval in the ECG signal is also detected. PQSRT Fragment In this section, the most important fragments which describe the signal are selected and can be represented in the identification process. On each PQRST fragment, a lot of checks have been done to decide whether to take this fragment or not. R interval of PQRST fragment is checked for each fragment, that is, the number of samples of R interval of ECG signal should lie between 35 and 65 samples. If this condition is not satisfied, then PQRST fragment is rejected. For next each PQRST fragments, the amplitudes, distances and means are calculated. Then, for each successive PQRST
32
A. C. Ramachandra et al.
fragment, the RR distances, median and mean of RR distance are calculated. RR thresholding is the minimum value between the mean and median of the RR distances which shows the better performance of the system. The final step in the PQRST fragment selection is that selecting the most important and similar fragments which is done by applying weight sum, restrictions and conditions of the ECG fragments; the PQRST fragment which contains higher weight is chosen for representation of the signal. PQSRT Fragment Mean The P-QRS-T fragments are extracted for each ECG record. In order to enhance the similarity of the PQRST fragments, the extracted PQRST fragments are processed. In order to obtain the most important peaks in the fragment mean of the ECG signal, the modified superset features are extracted. D.
Classification
Classification is a process to generate the match scores by comparing the stored templates with the extracted features. The powerful technique for pattern classification is support vector machine which depends upon the supervised learning algorithm. When the linear separation is not possible in the system, the nonlinear kernel modifications can be applied that are polynomial, radial and quadratic kernel functions. The function of kernel is that it takes the input as data and converts it into the form that is needed. The kernel function returns the inner product between the two points in a suitable feature space. For simple small learning problems, the classifier performances have been improved by using polynomial kernel function idea and sequential minimal optimization (SMO).
4 Results and Discussion Figure 3a is the ECG-ID database from the person 1 which is taken from the PhysioNet which is preprocessed to produce the noise-free signal because it consists of noises such as powerline noise, baseline wander and other high-frequency distortions
Fig. 3 a ECG-ID signal before preprocessing. b ECG-ID signal after preprocessing
ECG-Based Personal Identification System
33
which is removed by using the wavelet transform and a series of filters, as shown in Fig. 3b. Figure 4 shows the signal where all the P, Q, R, S, T points are detected in the ECG signal and the below signal is the signal which shows the zoomed signal (Fig. 5). The algorithms are developed, and experiments are carried out in MATLAB. The ECG-ID database contains the records of 90 persons, that is, 2 recordings of each person was collected, that is, total 180 recordings were taken. The number of training records was 100 and the testing records was 80. The algorithms were tested on ECGID database. Table 1 shows the number of correctly classified subjects that is the true positive (TP), FP is the false positive that represents the number of incorrectly classified subjects. Three measurements were used for computing the performance of the approach that is accuracy (A) as shown in Eq. (1). The percentage of correctly classified subjects, precision (P) as shown in Eq. (2) is the fraction of correctly
Fig. 4 ECG-ID signal with the detected peaks (P, Q, R, S, and T ) and a zoomed ECG-ID preprocessed signal
Fig. 5 a P-QRS-T fragment mean of ECG-ID signal. b is the graph where all the PQRST fragments are plotted that is the most important fragments which represents the signal in the identification process. The number of samples
34
A. C. Ramachandra et al.
Table 1 The classification results for ECG-ID database with the different kernels for classification Database
Classifier
TP
FP
P
ER
A (%)
ECG-ID 90 subjects
SVM-P
69
11
0.86
0.1375
86.25
SVM-L
65
15
0.81
0.1875
81.3
SVM-RBF
74
6
0.925
0.075
92.5
TP—true positive, FP—false positive, P—precision, ER—error rate and A—accuracy, P—polynomial kernel, L—linear kernel, RBF—radial basis function kernel
Fig. 6 Accuracy graph of ECG-ID Database for different kernels
classified subjects among all the records. Error rate (ER) as shown in Eq. (3) is the total error value which is incorrectly classified among all the records. Total samples taken are 281 (Fig. 6). A=
TP ∗ 100 T P + FP
(1)
TP T P + FP
(2)
P= ER =
FP T P + FP
(3)
5 Conclusion The ECG-based personal identification system is proposed in this paper. The proposed approach contains data acquisition phase, preprocessing phase, feature extraction phase and classification phase. For training and testing processes, the ECG
ECG-Based Personal Identification System
35
signals acquired from ECG-ID dataset. Here, the fiducial-based approach is used for extracting the features, and the features that are extracted are sent to the SVM classifier to classify the data accurately and compares the test data with the training data. The proposed approach provides a powerful ECG signal classification. With the help of the above approach, one can develop a software for biometrics for identifying a different individual. Further, the algorithm performance can be improved to fuse the approach of non-fiducial and fiducial to achieve the high classification accuracy of the system.
References 1. Zhang Y, Zhao Z, Guo C, Huang J, Xu K (2019) ECG biometrics method based on convolutional neural network and transfer learning. International conference on machine learning and cybernetics 2. Byeon Y-H, Pan S-B, Kwak K-C (2020) Ensemble deep learning models for ECG-based biometrics. Cybernetics & informatics (K&I) 3. Lee J-N, Kwak K-C (2019) Personal identification using a robust eigen ECG network based on time-frequency representations of ECG signals. IEEE Access 4. Cordeiro R, Gajaria D, Limaye A, Adegbija T, Karimian N, Tehranipoor F (2020) ECGbased authentication using timing-aware domain-specific architecture. IEEE transactions on computer-aided design of integrated circuits and systems 5. Bajare SR, Ingale VV (2019) ECG based biometric for human identification using convolutional neural network. 10th international conference on computing, communication and networking technologies 6. Samona Y, Pintavirooj C, Visitsattapongse S (2017) Study of ECG variation in daily activity. 10th biomedical engineering international conference 7. Hirayama Y, Takashina T, Watanabe Y, Fukumoto K, Yanagi M, Horie R, Ohkura M (2019) Physiological signal-driven camera using EOG, EEG, and ECG. 8th international conference on affective computing and intelligent interaction workshops and demos 8. Khoma V, Pelc M, Khoma Y (2018) Artificial neural network capability for human being identification based on ECG. 23rd international conference on methods and models in automation and robotics 9. Bashar K (2018) ECG and EEG based multimodal biometrics for human identification. Int Conf Syst, Man, Cybern, Miyazaki, Japan 4345–4350 10. Rabhi E, Lachiri Z (2018) Personal identification system using physiological signal. Middle East Conf Biomed Eng Tunis 153–158 11. Meltzer, Luengo D (2019) Fiducial ECG-based biometry: comparison of classifiers and dimensionality reduction methods. International conference on telecommunications and signal processing. Budapest, Hungary, pp 552–556 12. Abdelazez M, Sreeraman S, Rajan S, Chan AC (2018) Effect of body posture on non-fiducial electrocardiogram based biometric system. International instrumentation and measurement technology conference. Houston, TX, pp 1–5 13. Farago P, Groza R, Ivanciu L, Hintea S (2019) A correlation-based biometric identification technique for ECG, PPG and EMG. International conference on telecommunications and signal processing. Budapest, Hungary, pp 716–719 14. Hejazi M, Al-Haddad SAR, Hashim SJ, Aziz AFA, Singh YP (2017) Non-fiducial based ECG biometric authentication using one-class support vector machine. Signal processing: algorithms, architectures, arrangements, applications. Poznan, pp 190–194 15. Naraghi ME, Shamsollahi MB (2011) ECG based human authentication using wavelet distance measurement. 4th international conference on biomedical engineering and informatics (BMEI)
36
A. C. Ramachandra et al.
16. Ko H et al (2019) ECG-based advanced personal identification study with adjusted (Qi * Si). IEEE Access 7:40078–40084 17. Kaul A, Arora AS, Chauhan S (2012) ECG based human authentication using synthetic ECG template. IEEE international conference on signal processing, computing and control
Explorations in Graph-Based Ranking Algorithms for Automatic Text Summarization on Konkani Texts Jovi D’Silva and Uzzal Sharma
Abstract The work presented in this paper is an attempt at exploring the field of automatic text summarization and applying it to Konkani language, which is one of the low-resource languages in the automatic text summarization domain. Low-resource languages are the ones that have none or a very limited number of existing resources available, such as data, tools, language experts, and so on. We examine popular graph-based ranking algorithms and evaluate their performance in performing unsupervised automatic text summarization. The text is represented as a graph, where the vertices represent sentences and the edges between a pair of vertices represent a similarity score computed using a similarity measure. The graph-based ranking algorithms then rank the most relevant vertices (or sentences) to include in a summary. This paper also examines the impact of using weighted undirected or directed graphs on the output of the summarization system. The dataset used in the experiments was specially constructed by the authors using books on Konkani literature, and it is written in Devanagari script. The results of the experiments indicate that the graph-based ranking algorithms produce promising summaries of arbitrary length without needing any resources or training data. These algorithms can be effortlessly extended to other low-resource languages to get favorable results. Keywords Automatic text summarization · Konkani text summarization · TextRank · Graph-based text summarization · Low-resource languages
1 Introduction Automatic text summarization (ATS) process is when a computer system automatically summarizes the contents of a text document provided to it [1]. Research pertaining to automatic summarization of text has gained a lot of popularity because of the humongous amount of text available on the Internet. It is not possible for J. D’Silva (B) · U. Sharma Computer Science Engineering, Assam Don Bosco University, Assam, India U. Sharma e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 S. Aurelia et al. (eds.), Sustainable Advanced Computing, Lecture Notes in Electrical Engineering 840, https://doi.org/10.1007/978-981-16-9012-9_4
37
38
J. D’Silva and U. Sharma
a reader to read all the documents in their entirety to conclude whether the given article contains all the relevant information that they are looking for when browsing the internet. Therefore, if the readers are given a summary of an article, it would save them a lot of reading time and help them move over to the next article faster. Considering the huge number of text documents available on any topic, it is manually impossible to summarize all these documents to provide the readers with summaries. Hence, it is critical that ATS systems be developed to overcome such a challenge and provide a summary to the readers saving a lot of their time. Substantial research in ATS is pursued in the world’s most popularly used languages like English; however, the languages that are less popular with most of the population are not a very common choice for such research. These less popular languages also have very limited resources existing in the language to begin with any kind of research in natural language processing (NLP). Therefore, such languages come under the category of ‘low-resource’ languages. The low-resource nature makes it even more challenging to work with any such language as there are limited tools, data, and/or language experts readily available in the language. We extend the research in ATS to one such low-resource language, Konkani, which is spoken by a minority of the population along the west coast of India. The dataset used for this research contains Konkani folk tales from five rare books and was constructed specifically for this research in ATS. Folk tales are small stories that are passed from one generation to another, usually verbally, and comprise some form of morals and good teachings. The techniques used for automatic summarization of documents come under two categories: language-independent and language-dependent techniques [2]. Language-dependent techniques depend on tools available in a language such as lexicons of words for the removal of stop-words, language-specific dictionaries for lemmatization, and so forth [3]. However, a language-independent system is not dependent on any such tools or knowledge [3]. Graph-based ranking algorithms belong to the category of language-independent techniques, and they provide a way of identifying the significance of a graph’s vertex [2, 4]. The algorithms aid in ranking web pages by considering the structure of the links on the World Wide Web. The significance of a vertex is analyzed based on data that is collected globally and is calculated recursively from the complete graph. This graph-based modeling can be applied to a variety of text ranking and text summarization methods to get a better understanding of the context of the information presented in a document. In this paper, we examine the application of two graph-based ranking algorithms for performing language-independent single document unsupervised sentence-based text extraction on a dataset of Konkani language literature. Thereafter, we explore the impact of using undirected, forward-directed, and backward-directed graphs on the outcome of text summarization. ATS in Konkani language is a relatively unexplored domain. The paper is organized as follows; Sect. 2 gives the related work in graph-based ATS. Section 3 highlights the graph-based ranking algorithms used in this work. Section 4 outlines the representation of the textual data in the form of graphs. Section 5
Explorations in Graph-Based Ranking Algorithms for Automatic …
39
highlights the similarity measure, and Sect. 6 gives the details of the dataset. The methodology used in this work is given in Sect. 7, followed by the evaluation and results in Sect. 8. The conclusion of the study is given in Sect. 9.
2 Related Work In the year 2004, R. Mihalcea introduced the approach of using graph-based ranking algorithms for summarizing text documents [4]. The approach used was later termed as ‘TextRank,’ which is a ranking model based on graphs [5]. Erkan and Radev proposed ‘LexRank’ algorithm that computed the significance of sentences from a sentence graph positioned on the generality of ‘eigenvector centrality’ [6]. Thakkar et al. presented a method for automatic extractive summarization employing graphbased ranking algorithm and shortest path algorithm by constructing undirected weighted graphs using the input text documents [7]. Hariharan et al. proposed enhancements to the graph-based models and also made use of directed forward and directed backward graphs [8]. Agrawal et al. proposed an approach using directed weighted graphs that represented sentences as nodes, and the edges carried weight, which was computed by examining the relation between a pair of adjacent nodes [9]. Thereafter, a ranking algorithm indicated the most relevant sentences that were considered being critical to include in the summary [9]. In 2018, Sarwadnya and Sonawane proposed a system for summarizing Marathi text documents by building undirected graphs of the sentences and using ‘TextRank’ for ranking the relevant sentences based on their scores [10]. In the work presented by K Usha Manjari, extractive summarization of Telugu text documents is performed using TextRank algorithm [11]. The architecture of the proposed system has four major steps, i.e., processing of the text documents, evaluation step which involves graph generation, and ranking of the sentences, the next step is selecting the sentences, and the last step is summary generation. K. Agrawal used graph-based techniques for summarizing legal documents [12]. Mamidala et al. presented a heuristic approach for summarizing Telugu text documents [13]. They proposed an improved method for scoring the sentences based on event and named entity scores. Kothari et al. presented ‘GenNext,’ a model for generating extractive summaries for multi-genre documents of user reviews using graph-based method [14]. The proposed model shows higher accuracy compared to other models used for the said purpose. Gupta et al. proposed a graph-based extractive summarization model for summarizing biomedical texts [15]. They used sentence embeddings in combination with TextRank and PageRank algorithms for constructing the model. To summarize, graph-based methods have been effectively used for text summarization, though they have been used in popular languages such as English. This method has been effectively used in other Indian languages such as Marathi and Telugu as well. Graph-based methods have also been used for summarizing other genres of textual data, such as user reviews, legal documents, and biomedical texts.
40
J. D’Silva and U. Sharma
As compared to other approaches used for text summarization, the merits of graphbased approaches include being completely unsupervised, language-independent, and domain-independent. Also, they can produce summaries that are coherent with no redundant information. The drawbacks of graph-based approaches are that they do not consider the importance of words in a document and in addition may not able to differentiate between sentences that are semantically similar. The similarity measure has a significant impact on how the graph-based methods perform on the selection of sentences [16]; despite this, graph-based methods produce good quality summaries of arbitrary length with no training data; this property is being particularly helpful in working with low-resource languages. The graph-based approaches construct a graph of text units, such as sentences in a document where one sentence recommends another sentence of importance for the formation of the summary. This represents a powerful and robust approach to automatic text summarization.
3 Graph-Based Ranking Algorithms Graph-based ranking algorithms provide a method for determining the ordering of the vertices of a graph based on their significance. The score that determines a vertex’s significance is computed for every vertex by analyzing the vital information deduced from the graph’s other vertices and edges. We review ‘HITS’ and ‘PageRank’ algorithms that have been proven to work efficiently in ranking tasks. These algorithms function exceptionally accurately when applied to text-based ranking cases and also function effortlessly with weighted directed or undirected graphs [4, 5]. Traditionally, page ranking algorithms work on un-weighted graphs, but they can be adapted to consider weighted edges as well [4, 5]. A graph ‘G’ is directed with ‘E’ as its edges or links, connecting vertices, or nodes represented by ‘V ’. The graph G can be represented as G = (V, E), ‘V ’ represents the set containing all the vertices present in G, and set ‘E’ indicates all the edges or links that exist in G. For any vertex Vi , let the predecessors of Vi be represented by In(V i ), and, all the successors of Vi be represented as Out(V i ) [4, 5].
3.1 Hits In the year 1999, J.M. Kleinberg et al. introduced HITS that stands for hyperlinkinduced topic search [17]. ‘HITS,’ a graph-based algorithm, is used for link analysis and grading web pages on the World Wide Web. The algorithm works on the concept of ‘authority’ and ‘hub.’ The ‘Authority’ measure is computed based on the number of incoming links to a web page. ‘Hub’ score is determined by the number of outgoing links from a web page. Equations (1) and (2) give the mathematical representations of the concept of authority and hub [4].
Explorations in Graph-Based Ranking Algorithms for Automatic …
HITS A (Vi ) =
41
HITS H V j
(1)
HITS A V j
(2)
V j ∈In(Vi )
HITS H (Vi ) =
V j ∈Out(Vi )
3.2 PageRank In 1998, Brin and Page devised ‘PageRank’ for the analysis of links for verifying the significance of a web page [18]. The algorithm consolidates the influence of incoming and outgoing links of a web page to produce a single value given by Eq. (3) [4]. P R(Vi ) = (1 − d) + d ∗
P R(Vi ) Out V j V ∈In(V ) j
(3)
i
In the above equation, ‘d’ is the damping factor, and the value of ‘d’ is set to 0.85 in the experiments presented in this paper. HITS and PageRank algorithms begin by assigning arbitrary estimate to every node in the graph and then continually iterate until convergence is obtained below the stipulated threshold. After the algorithm completes, the value associated with every vertex indicates the significance or ‘power’ of that vertex [4, 5].
4 Representing Textual Data as a Graph The foremost task when applying a graph-based algorithm on textual data is the construction of a graph, either undirected or directed. The distinct units of text in a document are depicted as the nodes of a graph and a weighted edge interlinks any two nodes based on the relationship between them. The ranking algorithm iterates until convergence is attained and the vertices are sorted according to their ranks [4, 5].
4.1 Directed Graph According to Metcalf and Casey, a relationship between any two vertices in a directed graph is a one-way relationship. If a directed graph is envisioned, then the nodes are interlinked with arrows. The arrow illustrates that the node where the arrow originates has a relationship with the destination node where the arrow points. This relationship indicated by the arrow need not be true conversely [19].
42
J. D’Silva and U. Sharma
The graph can be depicted as a ‘forward-directed graph’ and a ‘backward-directed graph’ [4, 5, 20]. A forward-directed graph is formed of edges, the connect sentences that appear first in a document to the sentences that appear later. A backward-directed graph is built with edges that link sentences to those appear prior to them [2]. If we consider, ‘i’ and ‘j’ as sentence positions, then forward-directed graphs have edge weights only when i < j, and backward-directed graphs have edge weights only if i > j [4, 5, 20].
4.2 Undirected Graph We construct a fully connected undirected graph where there is an edge from each vertex to all the other vertices in the graph. Mihalcea et al. specified that graph-based ranking algorithms have traditionally been applied to directed graphs; however, they can be also applied to undirected graphs. The undirected graph would need to have out-degree of a vertex equivalent to the in-degree of that vertex [2, 3, 12]. This means that for every undirected edge in the graph, it is replaced with two directed edges, each pointing in the opposite direction. This captures the information of directed forward and directed backward graphs.
5 Similarity Measure A relationship between any two vertices can be established by computing similarity scores. The objective of the similarity measure is to reveal the overlap between any two sentences by estimating the total count of common tokens between the two sentences and then normalizing by the length of the two sentences. If ‘Si ’ and ‘S j ’ are two sentences in a document, then let ‘Ni ’ depict the words that form the sentence, Si = W1i , W2i , · · · W Ni i . The similarity between ‘Si ’ and ‘S j ’ can be computed with Eq. (4) [4].
Similarity Si , S j
Wk |W k ∈ Si &Wk ∈S j = log(|Si |) + log( S j )
(4)
6 Dataset The authors of this paper designed and framed the dataset used for these experiments as there was an evident lack of any such dataset in Konkani language literature at the time the research was being carried out. The authors have presented a Konkani dataset
Explorations in Graph-Based Ranking Algorithms for Automatic …
43
that has a collection of 71 hand-picked folk tales from five books written by various authors scripted in Devanagari. We can find the detailed process of compilation of the dataset in [21].
7 Sentence Extraction Methodology The procedure for extraction of sentences from a text document is as stated below: • Every input document goes through a phase called ‘tokenization,’ where the text document is broken down into smaller units or sentences. • The sentences are then ‘cleaned’ of all the punctuations to compute similarity measure. However, these punctuations are retained in the final output summary. • The cleaned text units form the vertices of a graph. Edges connect the vertices based on the relationship shared between any two vertices; this is defined by the similarity measure. The resulting graph can be represented as a fully connected undirected graph, directed forward graph, or a directed backward graph. • Graph-based ranking algorithms are then applied to the constructed graph to rank the graph vertices. These ranked vertices are organized in descending order of their ranking scores. Since the vertices are in fact sentences of the original input document, we get a ranking of the sentences. • The top sentences from the list obtained in the previous step are then chosen to be a part of the output summary. The summary length is then restricted to a fixed count of 300 words for evaluation. An example of a fully connected undirected graph is shown in Fig. 1. We consider a document containing five sentences numbered from 0 to 4. The edges are weighted using the similarity measure mentioned in Sect. 5. The values in red indicate the PageRank values computed, after which, the vertices are ranked based on these values and the sentences that correspond to those vertices are selected to form the summary.
8 Evaluation and Results We performed automatic extractive summarization of Konkani documents using unsupervised language-independent graph-based algorithms, PageRank and HITS. We used the ROUGE Toolkit for evaluation. It works on N-gram statistics and has been found to work well with human evaluations [22]. Here, the system-generated 300-word summary is compared against two human-generated reference summaries of 300-words each. The ROUGE metric is used to quantify the occurrences of unigrams, bi-grams and ‘Longest Common Subsequence’ (LCS) between the reference
44
J. D’Silva and U. Sharma
Fig. 1 A sample graph using five sentences and PageRank
summaries and the system-produced summaries. We have ROUGE-1 referring to unigram scores, ROUGE-2 showing bi-gram scores, and ROUGE-L showing Longest Common Subsequence (LCS) scores. Table 1 depicts the ROUGE-1 uni-gram values. Table 2 shows the ROUGE-2 bi-gram values. And Table 3 displays ROUGE-L LCS values. Each row in the above tables represents the scores of the individual graph ranking summarizations systems used with undirected graphs (UD), directed forward graphs (DF), and directed backward graphs (DB). ‘Precision’ attempts to find out how much of information from the original document the system summary has captured. ‘Recall’ tries to find out how much portion of the reference summary is relevant. The F1-score takes both precision Table 1 ROUGE-1 uni-gram scores System
ROUGE-1 (uni-gram) Precision
Recall
F1-score
PageRank-commontokens-UD
0.32849
0.32823
0.32834
PageRank-commontokens-DF
0.32304
0.32285
0.32293
PageRank-commontokens-DB
0.32414
0.32398
0.32404
HITS-Hubs-commontokens-UD
0.31460
0.31417
0.31437
HITS-Hubs-commontokens-DF
0.31751
0.31692
0.31719
HITS-Hubs-commontokens-DB
0.31768
0.31710
0.31737
HITS-Authorities-commontokens-UD
0.31460
0.31417
0.31437
HITS-Authorities-commontokens-DF
0.31768
0.31710
0.31737
HITS-Authorities-commontokens-DB
0.31751
0.31692
0.31719
Explorations in Graph-Based Ranking Algorithms for Automatic …
45
Table 2 ROUGE-2 bi-gram scores System
ROUGE-2 (bi-gram) Precision
Recall
F1-score
PageRank-commontokens-UD
0.08688
0.08680
0.08683
PageRank-commontokens-DF
0.08526
0.08525
0.08525
PageRank-commontokens-DB
0.08645
0.08641
0.08643
HITS-Hubs-commontokens-UD
0.08416
0.08402
0.08409
HITS-Hubs-commontokens-DF
0.08332
0.08317
0.08324
HITS-Hubs-commontokens-DB
0.08136
0.08120
0.08127
HITS-authorities-commontokens-UD
0.08416
0.08402
0.08409
HITS-authorities-commontokens-DF
0.08136
0.08120
0.08127
HITS-authorities-commontokens-DB
0.08332
0.08317
0.08324
Table 3 ROUGE-L LCS scores System
ROUGE-L (LCS) Precision
Recall
F1-score
PageRank-commontokens-UD
0.31984
0.31960
0.31970
PageRank-commontokens-DF
0.31717
0.31699
0.31706
PageRank-commontokens-DB
0.31797
0.31781
0.31786
HITS-Hubs-commontokens-UD
0.30645
0.30602
0.30622
HITS-Hubs-commontokens-DF
0.31024
0.30967
0.30994
HITS-Hubs-commontokens-DB
0.31028
0.30971
0.30997
HITS-authorities-commontokens-UD
0.30645
0.30602
0.30622
HITS-Authorities-commontokens-DF
0.31028
0.30971
0.30997
HITS-Authorities-commontokens-DB
0.31024
0.30967
0.30994
and recall into consideration and reports a single score. The best performing system is highlighted using bold text in the tables. From the data depicted in the tables above, it can be noted that systems based on PageRank performed better than those that used HITS. From among the systems that ran on HITS, we observed that ‘HITS-Hubs’ scored systems performed well on ‘directed backward’ graphs and ‘HITS-authorities’ scored systems performed well on ‘directed forward’ graphs. We note that graph-based techniques do not make use of any language-dependent resources and are completely unsupervised. They can also construct promising summaries by ranking sentences using information derived from the sentences themselves. According to R. Mihalcea [4], because graph-based systems produce rankings of sentences, they can produce summaries of arbitrary length as it can be seen in the results of our experiments which produced summaries of 300 words.
46
J. D’Silva and U. Sharma
9 Conclusion We explored graph-based ranking algorithms and applied them to the area of ATS in Konkani language. Graph-based methods perform well as language-independent models for text summarization [4]. In particular, we noted that the ‘PageRank’ algorithm outperformed ‘HITS’ in our experiments using a fully connected undirected graph. Graph-based systems can produce promising summaries of arbitrary length using no other language resources or training data, rather, each text unit recommends another text unit, and this information derived from the input text itself helps generate the ranking of the sentences and the formation of the summary. In addition, languageindependent models, such as graph-based methods, can be easily extended to other low-resource languages like Konkani and yet produce promising summaries.
References 1. Lloret E, Palomar, M (2012) Text summarisation in progress: a literature review. Springer, Springer, pp 1–41 2. D’Silva J, Sharma U (2019) Automatic text summarization of indian languages: a multilingual problem. J Theor Appl Inf Technol 97(11):3026–3037 3. Saleh AA, Weigang L (2017) Language independent text summarization of western European languages using shape coding of text elements. 2017 13th international conference on natural computation, fuzzy systems and knowledge discovery (ICNC-FSKD). pp 2221–2228. https:// doi.org/10.1109/FSKD.2017.8393116 4. Mihalcea R (2004) Graph-based ranking algorithms for sentence extraction, applied to text summarization. Proceedings of the ACL 2004 on interactive poster and demonstration sessions. p 20 5. Mihalcea R, Tarau P (2004) TextRank: bringing order into text. Proceedings of the conference on empirical methods on natural language processing (EMNLP). pp 404–411 6. Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479 7. Thakkar K, Dharaskar RV, Chandak MB (2010) Graph-based algorithms for text summarization. 3rd International conference on emerging trends in engineering & technology, international conference. pp 516–519. https://doi.org/10.1109/ICETET.2010.104 8. Hariharan S, Srinivasan R (Feb 2010) Enhancements to graph based methods for single document summarization. IACSIT Int J Eng Technol 2(1). ISSN: 1793-8236 9. Agrawal N, Sharma S, Sinha P, Bagai S (Feb 2015) A graph based ranking strategy for automated text summarization. DU J Undergraduate Res Innovation 10. Sarwadnya V, Sonawane S (2018) Marathi extractive text summarizer using graph based model. Fourth international conference on computing communication control and automation (ICCUBEA). https://doi.org/10.1109/iccubea.2018.8697741 11. Manjari KU (2020) Extractive summarization of Telugu documents using TextRank algorithm. 2020 Fourth international conference on I-SMAC (IoT in social, mobile, analytics and cloud) (I-SMAC). pp 678–683. https://doi.org/10.1109/I-SMAC49090.2020.9243568 12. Agrawal K (2020) Legal case summarization: an application for text summarization. 2020 International conference on computer communication and informatics (ICCCI). pp 1–6. https:// doi.org/10.1109/ICCCI48352.2020.9104093 13. Mamidala et al (2021) A heuristic approach for Telugu text summarization with improved sentence ranking. Turkish J Comput Math Educ 12(3):4238–4243
Explorations in Graph-Based Ranking Algorithms for Automatic …
47
14. Kothari K, Shah A, Khara S, Prajapati H (2021) GenNext—an extractive graph-based text summarization model for user reviews. In: Tuba M, Akashe S, Joshi A (eds) ICT systems and sustainability. Advances in intelligent systems and computing, vol 1270. Springer, Singapore. https://doi.org/10.1007/978-981-15-8289-9_11 15. Gupta S, Sharaff A, Nagwani NK (2022) Biomedical text summarization: a graph-based ranking approach. In: Iyer B, Ghosh D, Balas VE (eds) Applied information processing systems. Advances in intelligent systems and computing, vol 1354. Springer, Singapore. https://doi. org/10.1007/978-981-16-2008-9_14 16. El-Kassas WS, Salama CR, Rafea AA, Mohamed HK (2021) Automatic text summarization: a comprehensive survey. Expert Syst Appl 165:113679 17. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM (JACM) 46(5):604–632 18. Brin S, Page L, The anatomy of a large-scale hypertextual Web search engine. Comput Networks ISDN Syst 30(1–7):107–117. https://doi.org/10.1016/s0169-7552(98)00110-x 19. Metcalf L, Casey W (2016) Graph theory. Cybersecurity Appl Math 67–94. https://doi.org/10. 1016/b978-0-12-804452-0.00005-1 20. Mihalcea R, Radev DR (2011) Graph-based natural language processing and information retrieval. Cambridge University Press, Cambridge 21. D’Silva J, Sharma U (2019) Development of a Konkani language dataset for automatic text summarization and its challenges. Int J Eng Res Technol. Int Res Publ House 12(10):18913– 18917. ISSN 0974-3154 22. Lin CY (July 2004) ROUGE: a package for automatic evaluation of summaries. Proceedings of the workshop on text summarization branches out (WAS 2004). Barcelona, Spain, pp 74–81
Demography-Based Hybrid Recommender System for Movie Recommendations Bebin K. Raju and M. Ummesalma
Abstract Recommender systems have been explored with different research techniques including content-based filtering and collaborative filtering. The main issue is with the cold start problem of how recommendations have to be suggested to a new user in the platform. There is a need for a system which has the ability to recommend items similar to the user’s demographic category by considering the collaborative interactions of similar categories of users. The proposed hybrid model solves the cold start problem using collaborative, demography, and content-based approaches. The base algorithm for the hybrid model SVDpp produced a root mean squared error (RMSE) of 0.92 on the test data. Keywords Collaborative filtering · Content filtering · Demography filtering · SVDpp · Cold start
1 Introduction We often rely on recommendations in different situations. We give importance to other people’s opinions while choosing any item. The reviews given by users on an application’s beta test is like a recommended feedback to the developers. While hiring an employee, the human resource manager looks into the candidate’s reference list to know about what their mentors or colleagues has to say about the candidate. The same concept has been used by E-commerce and the online streaming industries to show recommendations to its users. E-commerce platforms like Amazon portraits recommendations like “people who liked this also like,” “similar products,” and “Top picks for the user”. Netflix also has a recommendation engine which shows “people who watched this also watched,” “similar movies,” and “Top picks for the user”.
B. K. Raju (B) · M. Ummesalma Computer Science Department, Christ Deemed to be University, Karnataka, Bengaluru, India e-mail: [email protected] M. Ummesalma e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 S. Aurelia et al. (eds.), Sustainable Advanced Computing, Lecture Notes in Electrical Engineering 840, https://doi.org/10.1007/978-981-16-9012-9_5
49
50
B. K. Raju and M. Ummesalma
Recommendation systems are the most reliable system which incorporates predictions based on the end users’ characteristics and behaviors. The recommendations shown are personalized by tracking both the implicit and explicit user activities. Different algorithms are introduced to solve specific problems in the field of recommendation systems. Users of any existing platforms most often do not provide explicit ratings, for like a movie they have watched. Due to this, the analysis matrix for users is mostly in the form of sparse matrix. The collaborative filtering approach tries to estimate these sparse data from the existing data by incorporating the idea of what other users who watched the same movie have also watched. Content-based filtering methods on the other hand recommend similar movies. It takes no information on the users–user interactions, but relies on the information of a movie and its related movies. This filtering technique can be used to solve the cold start problem for a new user on the platform by recommending items within a category. A content-based movie recommender system based on temporal user preferences which uses a user-centered framework that incorporates the content attributes of rated movies (for each user) into a Dirichlet process mixture model to infer user preferences and provide proper recommendation [1]. Demography-based filtering allows to extract demographic details of a user such as age, sex, and location to recommend movies under their demographic category. Adding demography to collaborative filtering was been around for a long time. A combination of demographic data with collaborative filtering for automatic music recommendation, is a classic example [2]. Hybrid systems are also introduced by combining the collaborative, content, and demography-based algorithms to get an improved set of recommended items. Hybrid recommender systems: A systematic literature review by [3] explores through the literature for different state of the art hybrid recommender systems. Neural networks and deep learning techniques are also been experimented for making hybrid systems. A typical example is the hybrid movie recommender system designed using neural networks, for yielding high accuracy predictions [4]. Extracting and combining demographic data along with hybrid systems increase the systems predictive power. A unique cascading hybrid recommendation approach by combining the rating, feature, and demographic information about items [5] details on the how adding features boosts the recommender engine. The proposed system integrates methods to develop a hybrid movie recommender system which has the capability to recommend movies to new users according to their user demographic, by incorporating the tastes of its existing user base as an alternative and effective solution for the cold start problem. The background theory section will provide an overview of the techniques used and will explain the research gap which led into the design of this hybrid recommender. The proposed hybrid algorithm section will cover the details regarding system design and implementation. The following section has details on different algorithms which are being used to get best results for the base collaborative modeling. The experimental results contain the model’s validation results, by using the root mean squared error (RMSE). We conclude by presenting the constraints and future scope for the proposed work.
Demography-Based Hybrid Recommender System for Movie …
51
2 Background Theory The collaborative-based filtering algorithms are used to predict the user ratings with limited number of actual ratings. To estimate the ratings using the collaborative filtering technique on a large dataset is a memory intensive task. DECORS: A simple and efficient demographic collaborative recommender system for movie recommendation by [6] used collaborative filtering, which initially partitioned the users based on demographic attributes, then, using k-means clustering algorithm clustered the partitioned users based on user rating matrix. We have several methods to estimate these ratings and find similar users using the collaborative approach. Some of these methods are: KNN-inspired algorithms, matrix factorization-based algorithms, slope one, and co-clustering. Content-based recommenders are best suited for recommending items of similar taste. They don’t depend upon explicit ratings from the end users, instead they only look for the list of items which the user prefers and recommend a list of items with similar categories. The collaborative and content-based approach can be combined to make a hybrid recommender system. An improved hybrid recommender system by combining predictions by [7], in which neighborhood-based collaborative filtering is applied, to which a content recommender with clustered demographics is taken together to show the recommendations. Demographic details are like personal details which can be collected from users. They may include details like their name, age, occupation, sex, and location. Reference [8] presented a novel hybrid method that combines user–user collaborative filtering with the attributes of demographic filtering. Recommendation systems can be modeled to personalize recommendations by considering the users demographic details. Reference [9] used a personalized recommender system based on user’s information in folksonomies which identifies user preferences and grouped them into quadrants of similar categories. Evaluating the impact of demographic data on a hybrid recommender model, [10] evaluated a multifaceted hybrid recommender model to see how applying demographic details adds on to the recommender. The cold start problem arises when the system is not able to recommend items to the users. This is mostly seen when a new user is logging into the system; as for them, the system does not have any prior information about their taste of contents. Reference [11] has resolved data sparsity and cold start problem in collaborative filtering recommender system using linked open data to find enough information about new users. Reference [12] explored user demographic attributes such as age, gender, and occupation for solving cold start problem in recommender system. Reference [13] have performed a comparison of differences and similarities between collaborative filtering and content-based filtering. This study focuses on how a hybrid system can be built by combining collaborative based filtering, demography-based filtering, and content-based filtering so as to solve and give a hybrid solution to the cold start problem. The surprise library is used to model and evaluate the base algorithms. Surprise (simple Python recommendation system engine) is an easy-to-use Python Scikit for recommender systems [14]. A movie and book recommender system has been implemented by using the surprise recommendation kit by [15].
52
B. K. Raju and M. Ummesalma
Table 1 Movie lens dataset description Datasets Dataset Name Size
Null values Description
User details
No
User demographic details like id, age, sex, occupation, zip code
Yes
18 movie categories with their movie ID and titles
943*5
Movie details 1682*24 Ratings
100,000*4 No
The ratings given by each user to the movies
3 Proposed Model 3.1 Dataset GroupLens research project provides an open dataset of the Movielens Web site (movielens.umn.edu) from September 19th, 1997 through April 22nd, 1998. This dataset is available for research and development [16]. For the proposed model, we are using the Movielens 100 k dataset. The data collected consists of 100,000 reviews (1–5) on 1682 movies from 943 users. Each user has rated at least 20 movies along with simple demographic details like age, occupation, and zip code. There are three different datasets available for user details, movie details, and ratings. Table 1 provides the details of the datasets used for building the proposed model.
3.2 Data Preparation Exploratory data analysis is performed to understand the dataset under study. The dataset has no outliers. Users are extracted with the condition of 20 ratings per movie which dropped some users even if the source specified it as all users are with more than 20 ratings per movie. The users are separated according to categories with respect to their age group. The movie lens has a minimum user age as 7 and the maximum age as 73; based on this, we split users into 6 types with respect to their age group. We split the users into Children, Teenager, Young Adult, Adult, Middle Aged and Elderly categories. The surprise library is used to split the data into train and test sets. The entire ratings dataset with the user id, movie id, and the ratings are split by using a 5 cross-validation.
3.3 System Design The system is modeled in three phases. The first phase is to build a collaborative filtering model to generate initial sets of recommendations; from which, we
Demography-Based Hybrid Recommender System for Movie …
53
Fig. 1 Hybrid system block diagram
extract the features using the demography-based filtering technique. These intermediate results are passed on to a content-based recommender to get the final recommendations. The block diagram of the hybrid model is given in Fig. 1. In order to build a hybrid model, a combination of content-based filtering and collaborative filtering has been used on user data, item data, and demographic data. For collaborative-based filtering, the SVDpp algorithm is used. It is a slightly modified version of the singular value decomposition (SVD) algorithm which uses matrix factorization technique to decrease the number of features in a dataset by lowering the space dimension from N to K. The matrix is then represented with each row representing a user and each column representing a movie. The ratings that users give to items are the constituents of this matrix. The dataset used is having implicit ratings which means the ratings are not directly provided by every user but instead gathered from available streams. From the literature for Netflix price, it is shown that SVDpp algorithm gave more accurate recommendations when used with implicit ratings data. For content-based filtering, cosine similarity has been used to achieve better results. The obtained result is compared with other models given in Table 3. The result obtained for proposed method was found to be better than the counter parts. The ratings data of user-item interactions is used in the collaborative modeling phase where different algorithms are being experimented. The best algorithm with the least root mean squared error (RMSE) is the final model which can be used to
54
B. K. Raju and M. Ummesalma
predict recommendations to all the users in the database. From the list of recommended movies for each user, we extract the list of movies for all the users, which match according to the new user’s demographic age category. The extracted users and movies will be closely related to the new user’s demographics. The content-based recommender tries to find similar movies within the extracted list of movies, which results to give the final recommendations list that has the flavors of similar user and similar movies.
4 Proposed Hybrid Algorithms 4.1 Collaborative Filtering-Based Models The collaborative filtering model is applied to get the initial set of recommendations. Different algorithms were experimented to get a low root mean squared error (RMSE) for the initial base recommender. The algorithms used to produce the collaborative results are normal predictor, baseline only, KNN Basic, KNN with means, KNN with Z score, KNN baseline, SVD, SVDpp, non-negative matrix factorization (NMF), slope one and co-clustering. The best performing algorithm, SVDpp is chosen to model the base collaborative filtering model. Singular value decomposition (SVD) which got famous during the Netflix prize 2009, is the core algorithm which performs collaborative filtering. When baseline algorithms were not used, this is equivalent to probabilistic matrix factorization. The prediction rˆui as: rˆui = μ + bu + bi + qiT pu .
(1)
If user u is unknown, then the bias bu and the factors pu are assumed to be zero. The same applies for item i with bi and qi [17]. To estimate all the unknown, we minimize the following regularized squared error: 2 rui − rˆui + λ bi2 + bu2 + qi 2 + pu 2 .
(2)
rui ∈Rtrain
The minimization is performed using stochastic gradient descent, which repeated over the entire ratings of the training set till it meets the specified epochs. Singular value decomposition plus plus (SVDpp) is an extension of the SVD algorithm which takes the implicit ratings into account while predictions [17]. Implicit ratings indicate that a user u rated an item j, regardless of the rating value. The prediction rˆui is set as:
Demography-Based Hybrid Recommender System for Movie … Table 2 Feature extraction for demography-based filtering
55
Demography extraction Age group
Category
7–14
Children
15–19
Teenager
20–24
Young adult
25–44
Adult
20–24
Middle Aged
65–79
Elderly
⎛ rˆui = μ + bu + bi + qiT ⎝ pu + |Iu |
− 21
⎞ y j ⎠.
(3)
j∈Iu
Just like SVD, the parameters are learned using a stochastic gradient descent (SGD) on the regularized squared error with learning rates set to 0.005 and regularization terms set to 0.02.
4.2 Demography-Based Filtering From the results of the collaborative filtering model, demography-based filtering is applied. For example, information filtration can be done with respect to the children’s category so that only movie recommendations of the users who belongs to the age category of child users are extracted. The method includes splitting users into six types with respect to their age group as Children, Teenager, Young Adult, Adult, Middle Aged and Elderly categories. The demographic info of the new user is extracted on the basis of their age group. Details related to feature extraction in demography-based filtering is given in Table 2.
4.3 Content-Based Filtering The extracted information of movies is matched together for similarities using the cosine similarity matrix; the scores are sorted and arranged to produce the final set of recommendations which has the collaborative, demographic, and the content-based characteristic of recommendations. Similarity = cos(θ ) =
AB AB n
i=1
n
n 2
Ai
i=1
Bi2
56
B. K. Raju and M. Ummesalma
=
Ai Bi
(4)
i=1
5 Experimental Evaluation Recommendation systems can be evaluated using precision, recall, root mean squared error, and mean absolute error. The most commonly used evaluation matrix is the root mean squared error which is the averaged squared error of the difference between the observed and the predicted values. The result obtained for the proposed model is compared with other techniques, and the result is as shown in Table 3. The performance of the system depends how well the collaborative filtering algorithm performed. Table 4 gives collaborative filtering result. From the list of all the collabTable 3 Collaborative filtering algorithms with test results
Table 4 List of movies recommend for every user by the collaborative filtering algorithm
Collaborative filtering algorithms Algorithm
Test RMSE
SVDpp
0.920371
KNN baseline
0.930184
SVD
0.938123
Baseline only
0.943795
Slope one
0.944184
KNN with Z score
0.951147
KNN with means
0.951736
NMF
0.962818
Co-clustering
0.967763
KNN basic
0.979026
Normal predictor
1.516743
Collaborative filtering result User id
Recommended list of movies
192
[515, 127, 108, 255, 948]
758
[50, 98, 276, 654, 313, 209, 14, 23, 171, 170]
833
[179, 56, 192, 135, 197, 182, 512, 479, 488, 475]
727
[408, 1, 258, 168, 114, 12, 176, 230, 222, 191]
497
[22, 12, 28, 195, 181, 265, 431, 173, 176, 89]
116
[187, 298, 484, 268, 199, 137, 315, 116, 607, 307]
450
[357, 520, 15, 435, 136, 127, 480, 196, 28, 132]
207
[15, 318, 316, 22, 265, 357, 258, 96, 282, 191]
Demography-Based Hybrid Recommender System for Movie … Table 5 Final recommendations for Copycat (1995) after demography and content-based filtering
57
Final recommendation list Copycat (1995) Twelve monkeys (1995) White balloon, the (1995) Batman forever (1995) Billy Madison (1995) Disclosure (1994) Priest (1994) Quiz show (1994)
orative algorithms, the SVDpp outperformed all the other algorithms with an RMSE of 0.92 on the test data followed by the KNN baseline algorithms with 0.93. Table 5 highlights an example for the final recommendation. The result of the collaborative filtering technique has the recommendations for all the users in the database. The demographic filter extracts list of movies of the users which matches to the new user. The content-based filtering after applying the cosine similarity on the extracted movie list outputs the final set of recommendation for the user. The final recommendations produced by the system incorporates the flavors of similar category of users in the existing database so that for a new user who watched Copycat (1995) the system can show a recommendation list under “Popular among the category” or “Top picks for this category”.
6 Conclusion and Future Scope The hybrid system developed can be used to make recommendations to existing as well as to new users based on the existing user’s characteristics. The experimentation on collaborative filtering model helped to select the best models with the least RMSE. The demography and the content filtering methods extract personalization from the users, and since the recommendations made for the new users are inherited from the existing user’s list of movies and their category, this allows the system to be used even as general recommender to recommend items. Benchmarks data with geographic details are not made public; we will be collecting more details with geographic features to suggest movies based on user’s current geographic location for a much-personalized experience. This system architecture performs well on platforms with a good number of existing users. The performance of the hybrid model firmly depends on the base collaborative algorithm, so a different algorithm which can reduce the RMSE can output a better recommender solution. Increasing the levels of demographic features also adds more weight-age and personalization.
58
B. K. Raju and M. Ummesalma
References 1. Cami BR, Hassanpour H, Mashayekhi H (2017) A content-based movie recommender system based on temporal user preferences. 2017 3rd Iranian conference on intelligent systems and signal processing (ICSPIS). pp 121–125 2. Yapriady B, Uitdenbogerd AL (2005) Combining demographic data with collaborative filtering for automatic music recommendation. Int Conf Knowl-Based Intell Inf Eng Syst 201–207 3. Çano E, Morisio M (2017) Hybrid recommender systems: a systematic literature review. Intell Data Anal 21(6):1487–1524 4. Christakou C, Vrettos S, Stafylopatis A (2007) A hybrid movie recommender system based on neural networks. Int J Artif Intell Tools 16(05):771–792 5. Ghazanfar MA, Prugel-Bennett A (2010) A scalable, accurate hybrid recommender system. Third Int Conf Knowl Discov Data Min 2010:94–98 6. Sridevi M, Rao RR (2017) Decors: a simple and efficient demographic collaborative recommender system for movie recommendation. Adv Comput Sci Technol 10(7):1969–1979 7. Chikhaoui B, Chiazzaro M, Wang S (2011) An improved hybrid recommender system by combining predictions. IEEE Workshops Int Conf Adv Inf Networking Appl 2011:644–649 8. Alshammari G, Kapetanakis S, Alshammari A, Polatidis N, Petridis M (2019) Improved movie recommendations based on a hybrid feature combination method. Vietnam J Comput Sci 6(03):363–376 9. Jelassi MN, Ben Yahia S, Mephu Nguifo E (2013) A personalized recommender system based on users’ information in folksonomies. Proceedings of the 22nd international conference on world wide web. pp 1215–1224 10. Santos EB, Garcia Manzato M, Goularte R (2014) Evaluating the impact of demographic data on a hybrid recommender model. IADIS Int J WWW/Internet 12(2) 11. Natarajan S, Vairavasundaram S, Natarajan S, Gandomi AH (2020) Resolving data sparsity and cold start problem in collaborative filtering recommender system using linked open data. Expert Syst Appl 149:113248 12. Safoury L, Salah A (2013) Exploiting user demographic attributes for solving cold-start problem in recommender system. Lect Notes Softw Eng 1(3):303–307 13. Glauber R, Loula A (2019) Collaborative filtering versus content-based filtering: differences and similarities. arXiv preprint arXiv:1912.08932 14. Hug N (2020) Surprise: a python library for recommender systems. J Open Source Softw 5(52):2174 15. GS A et al (2020) A movie and book recommender system using surprise recommendation 16. Harper FM, Konstan JA (2015) The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5(4):1–19 17. Koren Y, Bell R, Volinsky C (2009) Matrix factorization techniques for recommender systems. Computer 42(8):30–37
Holistic Recommendation System Framework for Health Care Programs K. Navin and M. B. Mukesh Krishnan
Abstract Government health care programs for disease surveillance, control, and prevention adopt a strategized technique in collecting information from various sources, monitor the trend of disease outbreak, initiate actions, and detect the response for further course of action. The main components for implementing such programs are integration and decentralization of surveillance activities at the zonal, and subzonal levels, developing and deploying human resources involving field workers to experts, strengthening public health lab facilities, and most importantly embracing digital information communications technology (ICT), the key component in binding all other components in bringing the effective implementation of the program. The proposed work explains the adoption of ICT through a holistic framework depicting the strategy and techniques in implementing the program. It discusses the digitalized strategy in data collection as well as data compilation, data analysis through integrated decision support modules, and proper presentation and dissemination of information data. The key element of the framework involves adopting various recommendation systems techniques in the analysis of data for information filtering, decision support, and data dissemination at various stages of handling the data. It suggests the selection of different recommendation techniques to address various needs such as predicting the disease dynamics and provide decision support in predicting likely hood of the next outbreak of the disease, rate of speeding, demographic details of spread, etc. The framework proposes a strategy to use smartphone-based mobile intervention to support the program and also proposes an API-based data exchange architectural style for data gathering and dissemination connected to various sources and endpoints. Keywords Public health · Health surveillance · Health care analytics · Health recommendation systems · m-Health
K. Navin (B) · M. B. M. Krishnan SRM Insitute of Science Technology, Kattankulathur, Chennai, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 S. Aurelia et al. (eds.), Sustainable Advanced Computing, Lecture Notes in Electrical Engineering 840, https://doi.org/10.1007/978-981-16-9012-9_6
59
60
K. Navin and M. B. M. Krishnan
1 Introduction Information and communication technology (ICT) enabled to connect computing systems to collect, store, and share computed data and represent it in the form of information. Quality of life improved largely with various applications served on the internet [1]. Evolution of Technology over the years with the intervention of mobile computing and IoT, data can be collected from anybody to anything from anywhere at any place [2, 3]. Applying modern data science on the collected data representing a specific domain enables us to extract useful information and provide recommendations and decision support on achieving solutions on the specific domain of interest. Public healthcare is one such application domain that has begun to adopt such technological growth in improving the quality of life among people [4]. Community-based health care programs [5] provide vital services to clinicians to help ease up diagnostics and instill effective treatments for diseases. The role of community health care support involves follow-up during treatment, providing rehabilitation support, providing educational and motivation information to the identified individuals or community [5, 6]. General wellness community-based health care programs need to be given equal importance along with disease control programs that will make more impact on improvement in the quality of life in a community [7]. Need for strategizing health care programs with a holistic approach which combines wellness programs with disease control programs would bring more fruitful results in combating health scares. Challenges lie in the extent of outreach, need for experts, budge for funding, addressing cultural differences, language barriers, social barriers, and effective implementation of the programs [8]. The solution lies in the effective adaptation of current technology concepts and practice, this will eradicate or minimize the above-mentioned challenges. Technology refers to the clustering of techniques, implementation of the process, methods that would improve the outcome [8]. Advancement in technologies is emerging with the development of information and communication technology (ICT). In the health sector application of the internet, web technology needs to be inculcated in the development of innovative health care IT framework to help the health care programs [9]. Collaborating techniques of ehealth, m-health, telehealth, telemedicine, health informatics, health care analytics are needed to be adopted in such frameworks to supplement the health care programs [7, 9]. The adaptation of these technologies into the framework provides effective decision support and recommendations not only to the health care decision authorities but also to general public communities. It would provide a better impact on the outcome of the proposed community health care programs in bringing wellness to public health. This paper proposes an IT solution framework for the identified model on community-based public health care programs backed by an evidence-based approach IT framework. This framework promotes different types of data collection through community-based programs through smartphones. The framework also facilitates fetching health care data, supporting data like weather forecasts from government agencies as well as established private agencies through API interfaces.
Holistic Recommendation System Framework for Health Care Programs
61
The framework incorporates a machine learning-based analytical tool shelf for health experts to work upon the data and provides end-users the service of decision support through the recommendation system. The framework further encompasses a dashboard to provide an interface to push survey questionnaires to intended users, take reports on incidents, and present visualization support for epidemiologists and health experts.
2 Background and Motivation Technological advancement in health care systems is seeing never been before upward trend where the concept of personalized medicine is future and very nearby [10], such advancements must reach out to our community to ensure the spread of diseases are controlled effectively and curbed authoritatively through preventive and predictive measures. The role of public health is crucial, it should create and design health programs adopting modern techniques to curb disease spread and improve the quality of life. Developing Countries like India have a huge population spread across vast landscapes. The challenge lies in data collection as well as taking appropriate action to implement preventive and control measures [11]. In recent times, the pressure is on the public health department to address the challenges posed by threats due to bioterrorism and the emergence of new infectious diseases which are serious threats to public health [11, 12]. Public health agencies would need to embrace Modern IT solutions working on public health informatics for public health programs to ensure swift measures are taken into disease surveillance systems to ensure appropriate measures are taken to counter them. Public health informatics is a sub-domain of public health that involves computerized data collection, analytics, and decision-making [13]. The use of smart mobile devices can perform data collection for surveillance and receive informative messages instantly anywhere at any time.
3 The Proposed Framework 3.1 The Community Support Program A community-supported program is designed to promote as many people as possible to use the health program to enable the collection of healthcare record samples from an area. Mobile app-based health program is intended to address large public size spread across a vast landscape of the geographical area. This, in turn, enables effective surveillance of health status among the public swiftly. On the other hand, the mobile app also helps to address the public immediately with awareness, educational messages of information. The most important thing is, IT infrastructure does the job
62
K. Navin and M. B. M. Krishnan
Fig. 1 Portrays health care records represented in the temporal and spatial domain
of health experts in computing information that is customized to be sent to the individual, group of community, or all general public. This system will be most effective in case of handling emergencies that might occur due to epidemic or pandemic breakouts. IT framework supporting the community-based program does the job of health surveillance which can be referred to as a hybrid model adopting active and passive surveillance methodology. Further, the program is designed to provide m-Health services like on-demand remote health assistant through video conferencing and BOT-based interaction for aiding live decision support for end-users. In the current technology-driven world, data is like gold dust, this mobile intervention-supported IT solution-based community health program will be an ideal foil for collecting temporal and spatial health care data. Data is paramount important which is temporal and spatial in nature for this proposed public health care program model as shown in Fig. 1. These data could be used for understanding behavioral patterns among people, it becomes the determinants for health trends. The spread of the epidemic that can happen in a place can be analyzed with the prevailing data or even can be forecasted. The proposed system also incorporates a combination of external data like weather, population, and demographic structure with prevailing health care data to bring out effective decision-support information. The program initiates through identified trained community health workers in each village who will also be provided with smartphones installed with the health program app. Through promotions, a free health app for health support will be provided to all intended people.
Holistic Recommendation System Framework for Health Care Programs
63
3.2 IT Infrastructure The heart of the IT infrastructure will be a scalable compute service that would run on cloud instance-based application servers as shown in Fig. 3. The application servers are connected to scalable structured database services. The data processing module is closely bonded to the database does the job of data preprocessing which includes data cleaning, data integration, data transformations as well as helps feature selection to make a conducive environment for the health care recommendation system to carry out analytical processing. The application server host database-driven service modules that will connect to endpoint devices like desktops, smartphones, IoT devices to collect health care data from end-users, and even sometimes from the environment. The collected data are stored in the cloud database. The data-driven service modules also play the role of collecting survey questionnaires and health diagnostic patterns from health experts or epidemiological experts and store them in the database. Separate mobile-based/web-based portals are there to collect data from hospitals and local health clinics as a part of smart electronic data collection in data-driven service module. Figure 2 depicts the use-case-based scenario for the IT architecture model for the proposed public health program portraying the role of various stakeholders and data flow from them. The analytical recommendation module consists of a tool shelf of machine learning services that could be chosen based on data and the approach of the recommendation system to be employed to get decision support for a particular scenario. The knowledge representation module extracts the recommendation system results to construct information for the individual, group (a community), overall public. The message communication module segregates messages and pushes them to the endpoint devices (e.g.,: smartphones). The visualization module binds with the knowledge representation module to present information in the form of charts, graphs, tables, etc. The proposed framework has web service client modules that enable us to connect public API from government agencies’ websites to consider relevant features like geographical information, climate, and general details population distribution of an area that could subtly contribute to identifying health trends in a place. The health experts will use a web portal for generating survey questions to be populated to carry out public health surveillance. The survey questions will be populated to the intended registered mobile clients, it could be static or BOT driven. The client will have a health app to receive notifications on survey questionnaires that are populated into the mobile app. The client will reply to the questionnaires through the app. The mobile device does not act like a mere data collector or simple message communicating medium, it also plays the role of edge compute capability to cloud framework as well as being a personal health assistant. It involves the participants from hospitals, health centers providing formal reports. Community workers and laypeople contribute inputs by providing informal reports by pushing back answers to the survey questionnaires. The survey result patterns are the evidence-based syndrome inputs to the computing system in the framework. It can also help to capture disease dynamics in case of epidemic breakouts.
64
K. Navin and M. B. M. Krishnan
Fig. 2 Use-case and data flow scenario for the proposed framework to support public wellness health care program
3.3 Health Care Recommendation System Health care recommendation systems can be termed as a decision support system that provides recommendations based on health care information that involves health care professionals and the common public [14–16]. The nature of data, vast features make it difficult to design an optimized model for these types of health care frameworks [17, 18]. The ideal scenario is to use an auto-ML-based analytical framework for recommendation systems [19]. The challenge for health experts or epidemiologists lies in feature selection, selecting the appropriate model, and tuning them. The framework will greatly assist the above challenges by automating the machine learning process pipeline. The holistic health care recommendation system needs to apply diagnostic, descriptive, and predictive analytics in different scenarios, such as diagnosing disease symptoms, providing descriptive details on observable information of
Holistic Recommendation System Framework for Health Care Programs
65
Fig. 3 Health care recommendation system framework for proposed public wellness health care program
prevailing health status from historical health data, predicting health issues, epidemic breakouts based on health trends derived from the statistical approach on collected data [20]. The broad classifications of the approach to the designing of recommendation systems are content-based filtering, collaborative filtering, and hybrid model combing the techniques of both content-based and collaborative filtering approaches [21–23]. The content-based filtering technique is typically modeled to provide recommendations to user/entity based on comparison with the content user//entity profile respective to the chosen domain [21, 24]. In the public health care domain the health indicator parameters are analyzed with the known or desirable health pattern of features termed as a profile. This approach provides analytical information termed to be diagnostic recommendation systems. Collaborative-based filtering technique is typically modeled to provide recommendations by finding similarities between historical data collected from users/entities of the chosen domain [25, 26]. The public health care domain adopts this type of recommendation system technique by using historical data of health indicator parameters. These historical health parameters are taken from individuals, from a group representing community are considered. This would be considered for analyzing the prevailing scenario of health care data with past results to provide recommendations or decision support. Based on the available historical data of health indicator parameters which are temporal and spatial in nature the recommendation system can perform predictive recommendations. For example, using the past results of disease occurrences, health trends, etc. will help to predict disease outbreaks in a place. If the historical pattern and features of historical data with past results and feature patterns of information can be correlated with prevailing
66
K. Navin and M. B. M. Krishnan
health indicator parameters to provide recommendations, this approach is termed to be a hybrid model.
3.4 Mobile App for Health Care Program The health care program provides administered official app for community people and community workers. There will be a separate mobile app for community workers to monitor allowed community areas to check for treatment adherence, or educate to promote them to give survey data and provide health assistance. The adaptive UI design model is recommended for the mobile app is intended to address large common public people using the app as a health assistant. The mobile app acts as the endpoint of the IT framework supporting the public health program. The app initiates user verification to enable them to use the app. With the accepted terms policy, the user will be authenticated every time to use the app. The app will work with an internet connection and GPS being switched on. The app receives push notifications from the messaging module of the cloud-hosted health program IT framework. The push notification might contain a payload of survey questionnaires that would be populated by health experts from a dedicated portal to be accessed only by those experts. The user would respond with answers to the questionnaires which are taken as data input by the analytical module in the framework. The mobile app is populated with educational, awareness, and motivational messages customized to individuals, a group of community, or addressing all. These messages are governed through the decision support system module and through health experts who would approve it. The customized messages are framed to address individuals or a group through their registered mobile number and GPS coordinates. If the intended user who is part of the health care program is identified to be a patient, their basic health issue will be stored as an electronic record in the proposed IT framework. The smart surveillance feature of the IT framework keeps track of the patient’s adherence to treatment, informing and reminding the patient of his scheduled visit to health check-up. The mobile app gets personalized settings based on the patient record from the cloud computing module, it enables them to set reminders for taking medicine, making emergency calls to a community support person or health experts near to his area, and also for tracking them through geolocation. The mobile app features an on-demand video conferencing facility to connect to hospitals or primary health care centers. Further, the app also provides the provision to upload medical-related data such as images or documents to aid diagnostics.
Holistic Recommendation System Framework for Health Care Programs
67
4 Challenges The challenge involves designing community support based on public health surveillance and support programs. It would be a field research-based approach to identify the benefits and implications of the proposed program outcome. Further challenges involve developing m-Health intervention in the proposed program. It is needed to study the challenges which include resistance to change in general, the existence of unreliable technologies, non-uniformity of technological availability, and lack of end-user education [27–29]. In this perspective, we attempt to analyze some of these challenges from the viewpoint of an engineer, who would like to design and implement a m-Health intervention to enhance the quality of healthcare delivery processes. Mobile health (m-Health) interventions [30, 31] may overcome constraints due to limited clinician time, poor patient adherence, and inability to provide meaningful interventions at the most appropriate time. However technological capability does not equate with user acceptance and adoption. It is needed to adopt a multi-methodological research approach for achieving the best usability and user experience among the variance of users. The multi-methodological research approach will be a combination of action research, field study, and applied research to develop a context-aware personalized mobile intervention system which will be a vital cog to the success of the program. The vital challenge of this framework involves designing machine learning models for the recommendation system [32, 33]. It would need to analyze data to provide near accurate information that can be diagnostic, predictive, and prescriptive [34, 35]. The analytical module requires multiple models to be designed to address multivariate data which poses characteristics being type general, temporal, and spatial. The developed models could be optimized or the choice of the models itself would be carried out through applied research.
5 Conclusions The overall objective of the proposed framework will be to construct a smart healthbased IT framework to back up the community-based surveillance and support program. By applying methodological research the system has to be built. It is expected to bring improved health awareness among the public, achieve effective diagnostics through the collection of different dimensions of data, bringing clarity in decision support for the authorities. The program would also provide support to the public through smartphone-based telehealth check-ups with health experts. It is intended to improve the health status of the public by providing health awareness, assist remedial measures through current technological support. This would, in turn, reflect a healthy community.
68
K. Navin and M. B. M. Krishnan
References 1. Klironomos I, et al (2015) Improving quality of life through ICT for the facilitation of daily activities and home medical monitoring. Stud Health Technol Inf 217:759–66 2. Nevado-Peña D, et al (Nov 2019) Improving quality of life perception with ICT use and technological capacity in Europe. Technol Forecast Soc Change 148:119734. ScienceDirect. https://doi.org/10.1016/j.techfore.2019.119734 3. Damant J, et al (March 2013) The impact of ICT services on perceptions of the quality of life of older people. J Assistive Technol 7(1):5–21. https://doi.org/10.1108/17549451311313183 4. Islam MS, et al (May 2018) A systematic review on healthcare analytics: application and theoretical perspective of data mining. Healthcare 6(2). PubMed Central. https://doi.org/10. 3390/healthcare6020054 5. Community-based care—an overview, ScienceDirect topics. https://www.sciencedirect.com/ topics/medicine-and-dentistry/community-based-care. Accessed 6 April 2020 6. Community-based intervention—an overview, ScienceDirect topics. https://www.scienc edirect.com/topics/medicine-and-dentistry/community-based-intervention. Accessed 6 April 2020 7. Fry CE, et al (Jan 2018) Evaluating community-based health improvement programs. Health Affairs 37(1):22–29. Healthaffairs.org (Atypon). https://doi.org/10.1377/hlthaff.2017.1125 8. Huffman LC, Paul H (2009) Wise. “Chapter 18—Neighborhood and community.” In: William BC, Saunders WB et al (eds) Developmental-behavioral pediatrics, Fourth edn. 170–81. ScienceDirect. https://doi.org/10.1016/B978-1-4160-3370-7.00018-3 9. Negash S, et al (April 2018) Healthcare information technology for development: improvements in people’s lives through innovations in the uses of technologies. Inf Technol Dev 24(2):189–97. Taylor and Francis+NEJM. https://doi.org/10.1080/02681102.2018.1422477 10. Heier E (7 Nov 2019) Health information technology. Types Healthc Softw 2020. https://www. selecthub.com/medical-software/7-categories-healthcare-information-technology/ 11. Mohanan M et al (Oct 2016) Quality of health care in India: challenges, priorities, and the road ahead. Health Affairs 35(10):1753–58. healthaffairs.org (Atypon). https://doi.org/10.1377/hlt haff.2016.0676 12. Wani NUH (2013) Health system in India: opportunities and challenges for enhancements. IOSR J Bus Manag 9(2):74–82. https://doi.org/10.9790/487X-0927482 13. PHII. https://www.phii.org/defining-public-health-informatics. Accessed 9 April 2020 14. Farid SF (Sept 2019) Conceptual framework of the impact of health technology on healthcare system. Frontiers Pharmacol 10. PubMed Central. https://doi.org/10.3389/fphar.2019.00933 15. Gräßer F, et al (2017) Therapy decision support based on recommender system methods. J Healthc Eng https://doi.org/10.1155/2017/8659460 16. Sahoo AK, et al (May 2019) DeepReco: deep learning based health recommender system using collaborative filtering. Computation 7(2):25. https://doi.org/10.3390/computation7020025 17. DiClemente R, et al (Feb 2019) Need for innovation in public health research. Am J Public Health 109(S2):S117–20. https://doi.org/10.2105/AJPH.2018.304876 18. Innovative ideas for addressing community health needs, from the center for rural health. https:// ruralhealth.und.edu/projects/community-health-needs-assessment/innovative-ideas. Accessed 30 March 2020 19. AutoML vision beginner’s guide, cloud autoML vision. Google cloud, https://cloud.google. com/vision/automl/docs/beginners-guide. Accessed 9 April 2020 20. Wiesner M, Daniel P (March 2014) Health recommender systems: concepts, requirements, technical basics and challenges. Int J Environ Res Public Health 11(3):2580–607. PubMed Central. https://doi.org/10.3390/ijerph110302580 21. Pincay J, et al (2019) Health recommender systems: a state-of-the-art review. 2019 sixth international conference on eDemocracy & eGovernment (ICEDEG). IEEE, pp 47–55. https://doi. org/10.1109/ICEDEG.2019.8734362 22. Deng H (5 Dec 2019) Recommender systems in practice. Medium. https://towardsdatascience. com/recommender-systems-in-practice-cef9033bb23a
Holistic Recommendation System Framework for Health Care Programs
69
23. Comprehensive guide to build recommendation engine from scratch. Analytics Vidhya (21 June 2018) https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-rec ommendation-engine-python/ 24. Ricci F, (ed) (2011) Recommender systems handbook. Springer 25. Kim S-J, et al (Jan 2016) Food recommendation system using big data based on scoring taste adjectives. Int J U- and e-service, Sci Technol 9(1):39–52. https://doi.org/10.14257/ijunesst. 2016.9.1.05; King M, Zhu B (1998) Gaming strategies. In: Tang S, King M (eds) Path planning to the west, vol II. Xian, Jiaoda Press, pp 158–176 26. Almohsen KA, Al-Jobori H (Dec 2015) Recommender systems in light of big data. Int J Electr Comput Eng (IJECE) 5(6):1553. https://doi.org/10.11591/ijece.v5i6.pp1553-1563; Yorozu Y, Hirano M, Oka K, Tagawa Y (Aug 1987) Electron spectroscopy studies on magneto-optical media and plastic substrate interface. IEEE Translated J Magn Japan 2:740–741 [(1982) Digest 9th annual conference magnetics Japan, p. 301] 27. Shahzad M, et al (Dec 2019) A population-based approach to integrated healthcare delivery: a scoping review of clinical care and public health collaboration. BMC Public Health 19(1):708. https://doi.org/10.1186/s12889-019-7002-z 28. Feroz A, et al (Dec 2020) Using mobile phones to improve community health workers performance in low-and-middle-income countries. BMC Public Health 20(1):49. https://doi.org/10. 1186/s12889-020-8173-3 29. Yang X, Carrie LK (July 2019) A systematic review of mobile health interventions in China: identifying gaps in care. J Telemed Telecare. https://doi.org/10.1177/1357633X19856746 30. Saha S, et al (2020) Addressing comprehensive primary healthcare in Gujarat through MHealth intervention: early implementation experience with TeCHO+ programme. J Family Med Primary Care 9(1):340. https://doi.org/10.4103/jfmpc.jfmpc_835_19 31. MHealth an overview, ScienceDirect topics. https://www.sciencedirect.com/topics/psycho logy/mhealth. Accessed 6 April 2020 32. Yong PL (ed) et al (2010) The healthcare imperative: lowering costs and improving outcomes: workshop series summary. National Academies Press 33. Gaikwad A, A presentation on title approval ‘A design of framework of recommender system for medical services’. www.academia.edu, https://www.academia.edu/27807269/A_Present ation_on_Title_Approval_A_Design_of_Framework_of_Recommender_System_for_Med ical_Services_. Accessed 9 April 2020 34. Eleks Labs (17 Oct 2014) Data science: unlocking the power of recommender systems. https://labs.eleks.com/2014/10/data-science-in-action-unlocking-the-power-of-recommendersystems.html 35. Gaitanou P, et al (2014) The effectiveness of big data in health care: a systematic review. In: Sissi C, et al (ed) Metadata and semantics research. Springer International Publishing, pp 141–53. https://doi.org/10.1007/978-3-319-13674-5_14
A Synthetic Data Generation Approach Using Correlation Coefficient and Neural Networks Mohiuddeen Khan and Kanishk Srivastava
Abstract Modern day machine learning models relies on huge amount of sample data to feed them, in order to give robust performance. Many studies have been conducted to show that if we have sufficient high amount of data, even simple models perform with great accuracy when used at very complex tasks. Having more data, both in terms of examples or more features, is fortunate. The availability of data enables more better insights and applications. More data indeed enables better approaches. Researchers in field of medical sciences require abundance of data to make intelligent models for disease classification. Generative Adversarial Networks (GAN) provides a way to generate data in order to tackle the above problems. In this paper, we explain an approach to generate synthetic data using machine learning techniques and correlation coefficients. In this study, we used Pearson correlation coefficient to determine the predictive relationship among the independent variables. An artificial neural network (ANN) model was created and deployed to generate data based on the relationship coming out from the correlation coefficients among the independent variables and also with dependent variables. The approach was tested on Boston House Price Dataset to generate synthetic data for different attributes. Also, to study the appropriateness and cogency of the generated data, we conducted comparison of two models in which one was fed with the real data, and the other model was trained on real and synthetic data. The study showed an improvement in accuracy and generalization on new training instances in the model trained with additional synthetic data. The results conclude that the generated data from the neural networks has sufficient efficacy which can be used for various purposes including data analysis and statistics. Keywords Synthetic data · Machine learning · Neural networks · Generative adversarial networks · Pearson correlation coefficient · Data generation M. Khan (B) Department of Computer Engineering, Aligarh Muslim University, Aligarh, India e-mail: [email protected] K. Srivastava Department of Electrical and Computer Engineering, Portland State University, Portland, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 S. Aurelia et al. (eds.), Sustainable Advanced Computing, Lecture Notes in Electrical Engineering 840, https://doi.org/10.1007/978-981-16-9012-9_7
71
72
M. Khan and K. Srivastava
1 Introduction 1.1 Need for More Data Deep learning algorithms are extensively used in extraction of highly complex data features with high levels of abstraction. This ability allows these algorithms to make sense from unsupervised data, but this comes at the cost of the huge amount of data that needs to be fed to them. This simply translates to more data giving better test accuracy. However, getting more data is not a possible use-case for all the problems that we solve using deep learning.
1.2 Data Generation A GAN is a type of neural network that is able to generate new data from scratch. Researchers have used various approaches using GANs for generating data [1]. Other experiments were also done using Auto encoder to generate data. Auto encoder use neural networks to generate new data. The networks are composed of multiple networks consisting of encoder and decoder connected by bottleneck (latent space). A study showed RNN generating highly accurate synthetic data [2]. Further, data augmentation techniques are used to improve the accuracy by increasing the dataset size [3]. Many studies were done to improve the accuracy of models which had deficiency of training data. One of the approaches used is data augmentation. It is set of techniques that enables to increase the diversity of data available for training models without collecting new data.
2 Theory and Background Correlation coefficients are used in statistics to measure and figure out the relationship between two variables [4]. In this study, we used correlation heatmap described below which is made using values of correlation coefficient. We used artificial neural networks (ANN) to formalize relationship among the variables.
2.1 Correlation Coefficient and Heatmap A correlation heatmap uses colored cells, typically in a monochromatic scale, to show a 2D correlation matrix between two discrete dimensions or event types. The color value of the cells is proportional to the number of measurements that match
A Synthetic Data Generation Approach Using Correlation …
73
the dimensional values. This enables us to quickly identify incidence patterns, and to recognize anomalies. Pearson’s correlation coefficient is the test statistics that measures the statistical relationship, or association, between two continuous variables. It is known as the best method of measuring the association between variables of interest because it is based on the method of covariance. It gives information about the magnitude of the association, or correlation, as well as the direction of the relationship. When applied to a population, Pearson correlation coefficient is represented by ρ(r ho) and is given by Eq. (1) ρx,y =
cov(X, Y ) σ X σY
(1)
where: σ X is the standard deviation of X σY is the standard deviation of Y Cov is the covariance. The formula for ρ can be expressed in terms of mean and expectation given by Eq. (2) ρx,y =
E[(X − μX )(Y − μY )] σ X σY
where: μ X is the mean of X μY is the mean of Y E is the expectation.
μ X = E[X] μY = E[Y] σ X2 = E (X − E[X])2 = E X2 − (E[X])2 σY2 = E (Y − E[Y])2 = E Y2 − (E[Y])2 E[(X − μX )(Y − μY )] = E[(X − E[X])(Y − E[Y])] = E[XY] − E[X]E[Y]
(2)
74
M. Khan and K. Srivastava
Fig. 1 Pearson coefficient correlation heatmap
E[XY] − E[X]E[Y] ρx,y = E X2 − (E[X])2 E Y2 − (E[Y])2
(3)
When the value of the coefficient approaches to 1 it depicts that with every positive increase in one variable, there is a positive increase in fixed proportion in the other whereas a value near or equal to −1 depicts that with every positive increase in one variable, there is a negative decrease of a fixed proportion in the other. Correlation coefficient near or equal to 0 means for every increase in one variable there is no increase or decrease in other and there is no relation existing between them. Correlation heatmap containing the coefficient values (Fig. 1) was plotted among all the variables to extract the dependency.
2.2 Artificial Neural Network Artificial neural networks are biologically inspired computational networks. ANN consists of multiple artificial neurons which are connected to make a fully connected network. Equation (4) shows the mathematical formulation of single neuron. A feedforward neural network model consists of multi neurons in multilayer. The first layer or the input layer is connected to the output layer through hidden layers. y(x) = f
n
wi xi + b
i=0
where: x = neuron. n = input from x 0 to x n , y(x) = single output. wi = weights, b = bias, f = activation function.
(4)
A Synthetic Data Generation Approach Using Correlation …
75
Fig. 2 ReLU function
The elementary feed-forward network implements a nonlinear revolution of input data in consideration of approximating the output data. The activation function ( f ) takes command of whether a neuron in the neural network should be activated or not by calculating weighted sum and further adding bias with it [5]. The main use of activation functions is to introduce non-linearity in the output of the neuron in the neural network. Different activation functions are given in eqs. (5), (6) and (7). We used the ReLU activation function [6] for our network. Figure 2 shows the plot of ReLU function. 0 for x < 0 ReLU: f (x) = (5) x for x ≥ 0 Sigmoid: f (x) = tanh:
1 1 + e−x
2 −1 1 + e−2x
(6) (7)
3 Algorithm This section contains the step by step explanation of our approach. In the chosen task, the robust algorithm yields considerable improvement. Predicted data generated from the neural networks can be used to improve models by increasing the data feeding into the network for training and making the model adapt to new training instances. Dependency among independent or predictive variables can be used to generate more data by using Pearson Correlation Coefficient. The coefficient values can be used to study and form dependency among the independent variables to generate more data.
76
M. Khan and K. Srivastava
By studying the dependency of an independent variable on other variables we can predict new instances of the independent variable using different prediction algorithms like neural networks, regression, etc. The newly created samples can help problems relating to high bias or underfitting or fields relating to the study of data. Below we describe step by step procedure and necessary information of the techniques, we followed to accomplish the target.
3.1 Preparing Dataset The new approach was tested on the BOSTON HOUSING dataset. The dataset contains 14 attributes, although to show our approach and results more clearly, we eliminated some attributes and took the following for the experiment: • • • • • •
NOX—nitric oxides concentration (parts per 10 million) RM—average number of rooms per dwelling AGE—proportion of owner-occupied units built prior to 1940 DIS—weighted distances to five Boston employment centers LSTAT—% lower status of the population MEDV—Median value of owner-occupied homes in $1000’s
where MEDV is the target value we have to predict using the training attributes given. A simple neural network with 3 layers was deployed to train on NOX, RM, AGE, DIS and LSTAT to predict MEDV.
3.2 Heatmap Implementation Heatmap was created using Pearson correlation coefficient through seaborn library [7] among the independent variables (training variables) and the target variable to find out the dependency of each attribute on the other. Figure 1 shows the heatmap formed on the dataset. Reading from the heatmap, we took all the attributes for training whose Pearson coefficient for the target value was less than −0.3 or greater than 0.3 (Eq. 8) as when the value of Pearson correlation coefficient (r ) reaches near to zero the relationship between the two variables vanishes. −0.3 ≤ r ≤ 0.3
(8)
For predicting the age we neglected RM value as the coefficient value (r ) falls in between −0.3 and +0.3. The predicted values are generated from the training attributes through a simple neural network architecture described below. The approach shows generating data for the AGE parameter, other attributes will
A Synthetic Data Generation Approach Using Correlation …
77
Fig. 3 Variation of AGE with variables
also be generated using the similar approach. The relationship can be further visualized of dependency of AGE on other parameters using plots created by matplotlib [8] shown in Fig. 3. As we can see from the plots of Fig. 3, the relationship increases (−ve or +ve) as the coefficient value (Fig. 1) reaches near to −1 or +1.
3.3 Neural Network Architecture The Neural network architecture is chosen in a way such that it is not very robust and leads to an average accuracy. Average accuracy is needed in order to prevent replication of the previous samples which may occur due to high accuracy (>90%). 4 sequential layers of 50–100–50–1(input layer-hidden layer-hidden layer-output layer) neurons respectively with ReLU activation function was deployed. Following the relationship from heatmap ‘RM’ was excluded from training set to predict age due to its value being −0.24.
78 Table 1 ANN accuracy and elimination table
M. Khan and K. Srivastava Target variable
Eliminated variable(s)
Accuracy (%)
NOX
None
79.38
RM
AGE, DIS
69.40
AGE
RM
78.82
DIS
RM, MEDV
83.18
LSTAT
None
82.90
MEDV
DIS
85.50
4 Results and Comparison with Previous Works The generalized model is now ready to predict new values. We generated 506 new values by giving input the initial dataset into the model. The values didn’t mimic the old values due to the model’s low accuracy. The accuracy came out to be 78.82% with the neural network architecture in predicting age values from NOX, DIS, LSTAT and MEDV. The predict values were concatenated to the previous 506 training instances of age resulting in 1012 samples.
4.1 Generating Values of Other Attributes The previous neural network architecture was applied to the other attributes viz. RM, AGE, DIS, LSTAT and MEDV. Similar approach gave results displayed in Table 1. The predicted values were the new samples of synthetic data and were concatenated to the old values to feed to the new model to make it robust and generalize well on new instances.
4.2 Testing To test the quality and cogency of the synthetic data (predicted data), a comparison was done between two artificial neural network models with target variable MEDV. The first model was trained on the 506 original training instances, and the second model was trained on the original and synthetic data, i.e., 1012 training instances. To better understand the effect of new data, the configuration for the neural networks was kept same. The model had 3 hidden layers with ReLU activation enabled and was trained for 50 epochs. The accuracy on first model came out to be 81.42%, whereas in the second model with more training data, the model gave an accuracy of 86.47% which shows that this approach can also be used to improve accuracy of the models which shows poor accuracy due to lack of training data.
A Synthetic Data Generation Approach Using Correlation …
79
4.3 Comparison of Previous Works Many works were done to generate synthetic data that outperformed or gave similar results for the classification methods when compared to original data. For instance, one research work used CGAN’s to generate synthetic data and gave similar results for different classification methods on the actual data and synthetic data [9]. Another work was done which introduced a machine learning-based approach to generate healthcare data and actually improved the metrics when trained on generated synthetic data compared to the original data [10]. Also, in our approach, we were able to get much better metrics with our generated synthetic data as compared to the original data.
5 Conclusions In this paper, an approach for generating synthetic data was implemented using artificial neural networks and Pearson correlation coefficient. Following are the conclusions from the research: (i) (ii) (iii)
(iv)
Correlation coefficients can be used to find relationship and dependencies among independent variables and also with dependent variables. The dependencies can further be used for generating one variable from the other by checking the variation of one from other using prediction techniques. Artificial Neural Network was used as the predicting technique to generate one variable from the others using the given relationship and dependency and using other correlation coefficients. The data generated can be used for research work requiring data analysis and for training the models requiring more data.
The quality of generated data can further be improved by trying using different correlation coefficients [11] and different prediction techniques other than neural networks like regression, etc. Also, in the research work of regression models [12] the models can be made better to generalize on new training instances by feeding them the synthetic data.
6 Future Work The above approach can be used in the areas of research where there is insufficiency of data including bioinformatics [13] where the study of the data is very important. Many studies have been conducted to increase data in biomedical sciences. This data generation technique can help to improve machine learning models lacking data in training to improve themselves by giving an abundance of data. Also, further study
80
M. Khan and K. Srivastava
and research can be conducted to find relationship between independent variables to make these models more robust and enhance the quality of data.
References 1. Han C, et al (2018) GAN-based synthetic brain MR image generation. 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). Washington, DC, USA, pp 734–738. https://doi.org/10.1109/ISBI.2018.8363678 2. Behjati R, Arisholm E, Bedregal M, Tan C (2019) Synthetic test data generation using recurrent neural networks: a position paper. 2019 IEEE/ACM 7th international workshop on realizing artificial intelligence synergies in software engineering (RAISE). Montreal, QC, Canada, pp 22–27. https://doi.org/10.1109/RAISE.2019.00012 3. Moreno-Barea F, Jerez J, Franco L (2020) Improving classification accuracy using data augmentation on small data sets. Expert Syst Appl 161:113696. https://doi.org/10.1016/j.eswa.2020. 113696 4. Schober P, Boer C, Schwarte L (2018) Correlation coefficients: appropriate use and interpretation. Anesth Analg 126:1. https://doi.org/10.1213/ANE.0000000000002864 5. Kurtulus D (2009) Ability to forecast unsteady aerodynamic forces of flapping airfoils by artificial neural network. Neural Comput Appl 18:359–368. https://doi.org/10.1007/s00521008-0186-2 6. Agarap AF (2018) Deep learning using rectified linear units (ReLU) 7. Waskom M (Sep 2020) The seaborn development team.mwaskom/seaborn 8. Hunter JD (May–June 2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9(3):90– 95. https://doi.org/10.1109/MCSE.2007.55 9. Vega B, Rubio-Escudero C, Riquelme J, Nepomuceno-Chamorro I (2020) Creation of synthetic data with conditional generative adversarial networks. https://doi.org/10.1007/978-3-03020055-8_22 10. Dahmen J, Cook D (2019) SynSys: a synthetic data generation system for healthcare applications. Sensors 19:1181. https://doi.org/10.3390/s19051181 11. Akoglu H (Sep 2018) User’s guide to correlation coefficients. Turk J Emerg Med 18(3):91–93. https://doi.org/10.1016/j.tjem.2018.08.001. PMID:30191186; PMCID:PMC6107969 12. Khan M, Srivastava K (2020) Regression model for better generalization and regression analysis. 30–33. https://doi.org/10.1145/3380688.3380691 13. Calimeri F, Marzullo A, Stamile C, Terracina G (2017) Biomedical data augmentation using generative adversarial neural networks. 626–634. https://doi.org/10.1007/978-3-319-686127_71
Cepstral Coefficient-Based Gender Classification Using Audio Signals S. Sweta, Jiss Mariam Babu, Akhila Palempati, and Aniruddha Kanhe
Abstract In this paper, audio signal-based gender classification using cepstral coefficients is presented. The audio signals are divided into small frames after the noise removal stage, and Mel-frequency cepstral coefficients and statistical coefficients are extracted to create training and testing data sets. The performance analysis of the system is done by comparing results of different methods. Comparison of the results using MFCC coefficients and statistical coefficients confirms that the proposed method outperforms over the existing classification methods. Keywords Short-time fourier transform (STFT) · Mel-frequency cepstral coefficients (MFCC) · Wavelet denoiser · Statistical coefficients · Support vector machine (SVM)
1 Introduction Gender recognition is the process of identifying the speaker from a speech utterance. So far, humans are considered to be the most accurate detector to detect the speaker’s gender [1]. This topic has been a trending topic for the past few years. Text-independent speaker recognition using LSTM-RNN and speech enhancements is presented in [2]. The MFCC coefficients are extracted and trained with the LSTM-RNN model to classify and achieve an accuracy of 95.3%. In [3], a deep learning approach for speaker recognition where the MFCC is transformed to DSF, and using deep belief networks, the classification is performed. In [4], performance analysis of wavelet thresholding methods in denoising of audio signals of some Indian musical instruments is presented. In [5], classification of audio signals using statistical parameters on time and wavelet transform domains using different wavelet thresholding methods is presented. S. Sweta (B) · J. M. Babu · A. Palempati · A. Kanhe Department of Electronics and Communications Engineering, National Institute of Technology, Karaikal, Puducherry, India A. Kanhe e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 S. Aurelia et al. (eds.), Sustainable Advanced Computing, Lecture Notes in Electrical Engineering 840, https://doi.org/10.1007/978-981-16-9012-9_8
81
82
S. Sweta et al.
In this paper, gender classification is presented using an audio signal where the wavelet denoiser is applied initially to remove the noise, and statistical parameters are computed from Mel-frequency cepstral coefficients (MFCC). The gender classification is performed by creating the training model using the statistical parameters using a support vector machine (SVM) classifier. The performance of an algorithm is evaluated using sensitivity, accuracy and specificity. The rest of the paper is organized as follows: Sect. 1 discusses Introduction. Section 2 discusses Proposed Work. Sections 3 and 4 present Results and Conclusions, respectively. Followed by References.
2 Proposed Work The audio-based gender classification is performed using statistical features computed from MFCC coefficients. The block diagram of the proposed algorithm is presented in Fig. 1. According to the block diagram, the input audio signal first goes to the denoising stage to remove the noise by comparing it to a threshold. After this stage, for each audio, MFCC coefficients are extracted, from which the statistical parameters are calculated. With this statistical parameter computed from MFCC, training and testing data sets are created for classification using SVM.
Fig. 1 Block diagram of proposed classification algorithm
Cepstral Coefficient-Based Gender Classification Using Audio …
83
The dataset consists of different audios recorded in different environments. Most of the audios had noise and required a denoising stage before any operation was performed on. So, after passing it to the wavelet signal denoiser of MATLAB, once the noise is suppressed, MFCC parameters are extracted from the signal. Using these MFCC parameters, the statistical parameters are calculated. Now, using the statistical parameters and MFCC parameters, training and testing sets are formed, and hence, classification is done using the SVM classifier.
2.1 Wavelet Signal Denoiser For a noisy signal, the algorithm involves finding discrete wavelet transform (DWT) and then finding thresholds using several methods. After comparing the signal value with threshold, inverse discrete wavelet transform is taken which provides the denoised signal. In MATLAB’s wavelet signal denoising, there are several wavelet families of denoising methods and thresholding methods. Out of which, the symlet wavelet family gave better results. Universal Threshold, Minimax and Block James–Stein denoising methods gave better results. Out of the three mentioned methods, Universal Threshold suppressed the noise to almost zero but also reduced the signal value. Out of soft and hard rules, soft gave better results since it did not reduce the information a lot, but hard reduces the signal value by threshold value when the signal is greater than threshold. In this paper, symlet wavelet functions with minimax and universal threshold are used.
2.2 Mel-Frequency Cepstral Coefficients—Features MFCC features are widely used in speech recognition and speaker recognition. MFCC was first used in monosyllabic speech recognition. MFCC computation is similar to the human hearing system. It is linearly spaced at lower frequencies and logarithmically at higher frequencies. The algorithm includes first breaking the audio into frames which is obtained by giving the audio to a window. After this, short-time Fourier transform (STFT) is applied to find the power spectrum of each frame. This is given to a Mel-Scale frequency filter bank followed by taking logarithm of this result. Discrete cosine transform (DCT) is taken to provide Mel-frequency cepstral coefficients. To find STFT, X m (ω) =
∞ n=−∞
x(n + m R)w(n)e− jω(n+m R)
(1)
84
S. Sweta et al.
where w(n) is the window function. To calculate Mels for any frequency, mel( f ) = 2595 log10 (1 + f /700)
(2)
MFCCs can be calculated using, Cˆ =
1 π log Sˆk cos n k − 2 k n=1
k
(3)
where k is the number of Mel cepstrum coefficient, Cˆ is the MFCC coefficients, and Sˆk is the output after filtering. In MATLAB, mfcc() can be used or cepstral features extractor can be used to find MFCC features. This results in 14* n matrix, where n is the number of frames in the audio. Each column represents different features.
2.3 Statistical Parameters The audio of maximum 20 s was divided into several frames. For each of these frames, the statistical parameters are found from MFCC features. They include mean, mode and median, to measure the central tendency and standard deviation, variance, skewness, kurtosis, quantiles, interquartile range, moving absolute deviation, to measure dispersion. For a Gaussian distribution, the skewness defines the asymmetricity, kurtosis measures the tailedness, and quartiles are used to divide the data into equal probability.
2.4 SVM Classifier Support machine vector (SVM) is a machine learning algorithm used for classification problems. In MATLAB, it is implemented by first loading the file that has the details of all extracted features. Features are separated into two groups: training and testing set. The training y is CV partitioned and features are selected from this. The best hyper parameter is found modeling using Gaussian kernel function. Later, test features are passed on to the predict(), which predicts the result. Using this, accuracy is found. SVM can also be performed using the classification learner app. In this app, first, the file can either be loaded from workspace, or the file itself can be loaded directly from the directory.
Cepstral Coefficient-Based Gender Classification Using Audio …
85
Confusion matrix is found using the predicted values and the test values. Using a confusion matrix, true positive (tp), true negative (tn), false positive (fp) and false negative (fn) are identified. With these values, accuracy, specificity, sensitivity and precision are found. Each audio signal is first given to a preprocessing unit, where the denoising is performed. After that, the MFCC features are extracted, from which statistical parameters are found. Using these, a model is trained using SVM and tested.
2.5 Observations from Wavelet Denoiser Among several thresholding methods available for wavelet denoising, Minimax and Universal Threshold gave better results, that is, suppressed noise without actually suppressing the information, as shown in Figs. 2 and 3. This figure shows the denoised signal and the original signal using Minimax and soft rule for thresholding. The following figure shows the denoising of the same input signal using Universal threshold and soft rule.
Fig. 2 Denoising using minimax thresholding (red: denoised signal, blue: noisy signal input)
86
S. Sweta et al.
Fig. 3 Denoising using universal threshold (red: denoised signal, blue: noisy signal input)
3 Results The performance of the proposed algorithm is evaluated by computing accuracy, precision, sensitivity and specificity. Accuracy, precision, sensitivity, specificity and miscalculation can be calculated using these formulas: (t p + tn) (t p + tn + f n + f p)
(4)
Precision =
tp (t p + f p)
(5)
Sensitivity =
tp (t p + f n)
(6)
Specificity =
tn (tn + f p)
(7)
Accuracy =
Miscalculation =
( f n + f p) (t p + tn + f n + f p)
where tp is the number of true class that is identified as true, tn is the number of false class that is identified as false,
(8)
Cepstral Coefficient-Based Gender Classification Using Audio …
87
Fig. 4 Confusion matrix obtained when the model was trained using MFCC coefficients without using any denoising stage for different SVM kernels
fp is the number of false class that is falsely classified as true, fn is the number of true class that is falsely classified as false (Fig. 4). It is observed from Table 1 that the classification using MFCC parameters is not giving the desired results. So, to improve the accuracy and precision, a denoising stage has been included and statistical parameters for the same set of audio files have been included. Wavelet denoising is used as it showed better results compared to the rest according to the referred papers, and statistical parameters of these are taken along with MFCC. Table 2 shows the results when Universal thresholding was used for thresholding in case of denoising and MFCC of these and statistical coefficients of these are used to train the model. It is observed from Table 3 that the accuracy and precision have increased, compared to the results obtained when the classification was done only using MFCC parameters. The highest accuracy has been observed when cubic approximation has been used with 82.6%. From these values, it is observed that the accuracy and precision have increased, compared to the results obtained when the classification was done only using MFCC
88
S. Sweta et al.
Table 1 Classification results using MFCC coefficients without using any denoising stage for different SVM kernels Kernel function
Accuracy (%)
Precision (%)
Sensitivity (%)
Specificity (%)
Miscalculation (%)
Linear
64.5
64.9
66.9
61.97
35.5
Quadratic
73.4
73.2
75.8
70.8
26.6
Cubic
77.5
78.4
77.5
77.5
22.4
Coarse gaussian
68.8
68.9
71.4
66.1
31.2
Medium gaussian
78.3
78.8
78.9
77.7
21.6
Fine gaussian
80.6
79.5
83.7
77.3
19.4
Table 2 Classification results using MFCC and statistical parameters with universal thresholding for different SVM kernels Kernel function
Accuracy (%)
Precision (%)
Sensitivity (%)
Specificity (%)
Miscalculation (%)
Linear
68.4
69.1
76.9
58.0
31.6
Quadratic
79.9
77.5
89.5
68.2
20.1
Cubic
82.6
81.2
88.9
73.5
17.4
Coarse Gaussian
66.6
64.9
85.7
43.1
33.4
Medium gaussian
76.5
74.7
86.9
63.8
25.6
Fine gaussian
77.8
76.5
86.2
67.5
22.2
Table 3 Classification results using MFCC and statistical parameters with minimax thresholding for different SVM kernels Kernel function
Accuracy (%)
Precision (%)
Sensitivity (%)
Specificity (%)
Miscalculation (%)
Linear
68.4
69.2
76.9
57.9
31.6
Quadratic
79.9
77.5
89.4
68.1
20.1
Cubic
82.6
81.4
88.7
75.0
17.4
Coarse gaussian
66.5
64.9
85.7
43.1
33.4
Medium gaussian
76.4
74.4
86.8
63.4
23.6
Fine gaussian
77.7
76.2
86.6
66.8
22.3
Cepstral Coefficient-Based Gender Classification Using Audio …
89
parameters. The highest accuracy has been observed when cubic approximation has been used with 82.6%. Minimax and Universal thresholding have been used because these suppressed noise better than the rest thresholding methods.
3.1 Comparison with Existing Work MFCC features have proved to provide better results if the signal is cleaner without noise. But with noise, it does not perform well. So, currently, the classification is being done by calculating statistical parameters of the audio itself. With just MFCC features and wavelet denoising, [2] got a recognition accuracy of 54% by training it using LSTM-RNN. In [1], training DBM using DSFs using MFCC has produced an error rate of 0.43% for female and 0.55% for male. In [3], 67.70 % accuracy was obtained using MFCC and 73.40% using Fourier parameters and 86.40% using both. As observed from these mentioned results, the accuracy of the system should be improved. Hence, in this paper, both MFCC and statistical parameters including a denoising stage are used to train the model using SVM, and the results are better with the maximum obtained accuracy of 82.6% with Minimax thresholding and a maximum of 82.6% with Universal thresholding.
4 Conclusion and Future Work It is clear from the results that MFCC features do not perform well for noisy signals. So to increase the accuracy and precision, a denoising stage is included before calculating MFCC features. Denoising was done using wavelet signal denoiser using several methods. Later, MFCC features are extracted, followed by statistical parameters. These are later used to train an SVM Model. The highest accuracy was observed with Minimax thresholding for cubic approximation of 82.6% and the same with Universal thresholding. Future works include increasing the dataset, and hence, increasing the accuracy and precision.
References 1. Alim SA, Rashid NKA (12 Dec 2018) Some commonly used speech feature extraction algorithms, from natural to artificial intelligence—algorithms and applications. Ricardo Lopez-Ruiz, IntechOpen 2. Hourri S, Kharroubi J (2020) A deep learning approach for speaker recognition. Int J Speech Technol 23:123–131. https://doi.org/10.1007/s10772-019-09665-y
90
S. Sweta et al.
3. El-Moneim SA, Nassar MA, Dessouky MI et al (2020) Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed Tools Appl 79:24013–24028. https:// doi.org/10.1007/s11042-019-08293-7 4. Verma N, Verma AK (2012) Performance analysis of wavelet thresholding methods in denoising of audio signals of some Indian musical instruments. Int J Eng Sci Technol 4.5:2040–2045 5. Lambrou T, Kudumakis P, Speller R, Sandler M, Linney A (1998) Classification of audio signals using statistical features on time and wavelet transform domains. Proceedings of the 1998 IEEE international conference on acoustics, speech and signal processing, ICASSP ‘98 (Cat. No.98CH36181), vol 6. Seattle, WA, USA, pp 3621–3624. https://doi.org/10.1109/ICA SSP.1998.679665 6. Srinivas NSS, Sugan N, Kar N et al (2019) Recognition of spoken languages from Acoustic speech signals using fourier parameters. Circuits Syst Signal Process 38:5018–5067. https://doi. org/10.1007/s00034-019-01100-6 7. Harb H, Chen L (2005) Voice-based gender identification in multimedia applications. J Intell Inf Syst 24:179–198. https://doi.org/10.1007/s10844-005-0322-8 8. Yücesoy E (2020) Speaker age and gender classification using GMM super vector and NAP channel compensation method. J Ambient Intell Human Comput https://doi.org/10.1007/s12 652-020-02045-4 9. Krishna DN, Amrutha D, Reddy SS, Acharya A, Garapati PA, Triveni BJ (2020) Language independent gender identification from raw waveform using multi-scale convolutional neural networks. ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). Barcelona, Spain, pp 6559–6563. https://doi.org/10.1109/ICASSP40776. 2020.9054738
Reinforcement Learning Applications for Performance Improvement in Cloud Computing—A Systematic Review Prathamesh Vijay Lahande and Parag Ravikant Kaveri
Abstract The cloud provides a platform that is easy to access and use the users’ data and applications from anywhere in the world using the Internet. With this fantastic use of the cloud, the number of cloud users has increased exponentially thereby making the cloud sometimes suffer from some of the major issues include resource allotment problems, Virtual Machine problems, etc. Artificial Intelligence is the study that enables the system the ability to think on its own. Artificial Intelligence includes the study of Machine Learning which gives us various statistical tools to explore and help us understand the data. Supervised, Unsupervised, and Reinforcement Learning are the three Machine Learning types. Reinforcement Learning is a type of Machine Learning in which the Machine Learning model uses limited past data and learns with new data. This research paper does the study of the Literature Survey on the different Reinforcement Learning algorithms and how these algorithms can be used to solve the issues of the cloud. The main reason why Reinforcement Learning can be applied to solve the problems of the cloud is that Reinforcement Learning provides automated learning through experience, which is similar to how human beings learn. So hence, if any problem arises on the cloud, it will learn from its past experiences and automatically improve itself to become immune to any arising problem. This technique of learning from experience is different from other Machine Learning techniques because no large data sets are required and this learning is innovative and goal-oriented which enhances itself with the method of rewards. Reinforcement Learning methods and algorithms such as Markov Decision Process, Q-Learning, Double Q-Learning, etc. have made a significant amount of development in the process of learning of the respective agent. The major objective of this research is to implement these Reinforcement Learning algorithms on the cloud to solve some of the issues faced by the cloud. A total of 60 research papers are studied in this research paper, providing a good pool of algorithms and techniques which can be used to solve cloud problems. Keywords Cloud computing · Cloud reliability · Physical machines · Reinforcement learning · Service level agreements P. V. Lahande (B) · P. R. Kaveri Symbiosis Institute of Computer Studies and Research (SICSR), SIU, Pune, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 S. Aurelia et al. (eds.), Sustainable Advanced Computing, Lecture Notes in Electrical Engineering 840, https://doi.org/10.1007/978-981-16-9012-9_9
91
92
P. V. Lahande and P. R. Kaveri
1 Introduction The term “cloud” in cloud computing is used as a metaphor for “The Internet” [1]. Cloud computing means providing services to the end-user through the Internet [1, 2]. The cloud computing model is a model for anytime convenience for accessing a pool of resources that include servers, networks, applications, services, etc. [1, 2]. This term cloud computing is built on decades of research in various fields such as Virtualization using CloudSim environment [3], Resource Sharing, and Networking [4]. The three ways how clouds are used are IaaS, PaaS, and SaaS [1, 5, 6]. IaaS stands for Infrastructure as a Service; PaaS stands for Platform as a Service and SaaS stands for Software as a Service [1, 5, 7]. Security and management in the cloud are very important today [8]. Security of the cloud is directly proportional to the security of cloud computing providers [8] (Figs. 1, 2 and 3). Computer science includes the study of Artificial Intelligence that studies the creation of mostly intelligent machines, not just intelligent but their actions and reactions are just like human beings [9]. The study of Artificial Intelligence includes the study of Machine Learning which makes the machine learn on its own automatically and improve from its past experiences without being programmed explicitly [10]. Concerning the feedback of the learner, there are three types of learning namely Supervised, Unsupervised, and Reinforcement Learning [9]. Reinforcement Learning is a new approach to the study of systems in which the learning agent learns from his past experiences thereby making him prune to understand and adapt to the Fig. 1 Client server architecture
Reinforcement Learning Applications for Performance Improvement …
93
Fig. 2 IaaS versus PaaS versus SaaS
Fig. 3 AI, ML, RL
learning environment at a fast rate [9, 10]. If the agent makes a correct decision in the environment, he gets a positive reward, else a negative reward [9, 10]. Reinforcement Learning is pretty close to how normal human beings learn i.e., from past experiences [10].
94
P. V. Lahande and P. R. Kaveri
2 Literature Survey Comparison Between Existing Surveys and Present Survey: Criteria
Our Survey
[11] to [12]
Cloud provisioning
✔
✔
SLA
✔
Cloud elasticity
✔
✔
Cloud revenue ✔
✔
[13] to [10]
[14] to [15]
[16] to [17]
✔
✔
✔
[18] to [19]
[20] to [21]
✔
✔
✔
Cloud reliability
✔
✔
✔
Cloud efficiency
✔
Cloud performance
✔
✔
Virtual machine
✔
✔
Job scheduling
✔
Resource management
✔
Time
✔
✔
Cost
✔
✔
✔
✔
✔
✔ ✔
✔ ✔
✔
✔
✔ ✔
[22] to [23]
✔
✔
✔ ✔ ✔
✔ ✔
✔
Features, Drawbacks and Improvements: Ref. No.
Year
Features
[11]
2021
JCETD—joint cloud-edge Resource management, task deployment task deployment
Drawbacks
The task deployment problem is simulated as Deep RL process using the Cloud-Edge model. Tasks are efficiently and reasonably deployed in cloud and edge center
[24]
2021
DTOMALB—distributed task allocation algorithm based on multi-agent and load balancing
Algorithm is based on multi-agent that reduces resource wastage and does load balancing
Resource allocation, response delay, load balancing
Improvements
(continued)
Reinforcement Learning Applications for Performance Improvement …
95
(continued) Ref. No.
Year
Features
Drawbacks
[25]
2021
Adaptive resource allocation integrated with RL mechanism
Service cost, availability, Integration of RL and VMs, SLAs adaptive resource provisioning
Improvements
[26]
2021
Collaborative RL Content delivery algorithm, node broadcast efficiency, QoS algorithm, delivery tree construction algorithm
Routing planning improves content CCDNs delivery efficiency
[27]
2021
Double deep Q-network (Double DQN)
–
The paper proposes an algorithm based on double DQN to jointly optimize the decision variables
[28]
2020
Task scheduling, resource utilization
Assumption is made that one VM cannot perform several tasks at the same time. All kind of clusters not considered by proposed system
Better efficient task scheduling and optimized resource utilization of clusters
[29]
2020
Flexible job-shop scheduling problem (FJSP), self-learning genetic algorithm (SLGA), genetic algorithm (GA), RL
–
Proposed SLGA better than its competitors in solving FJSP
[30]
2020
Server less, auto-scaling, RL, Knative
System applicable for only serverless environments and in the modern era, systems are composed of servers
Better resource allocation in serverless environments
[31]
2020
Cloud computing, deep learning, scalability
The number of attributes used for the application can be more limited dataset is used for the research
Improvement in cloud resource scheduling in cloud data centre ML algorithms can give significant results in predicting upcoming resource demands
[12]
2019
Cloud elasticity, service level agreements, cloud computing, adaptive resource provisioning, reinforcement learning
Preliminary experiments are not conducted or implemented on a real cloud but on CloudSim environment
Experiments conducted show results that are superior over the threshold-based, static and other RL based allocation schemes (continued)
96
P. V. Lahande and P. R. Kaveri
(continued) Ref. No.
Year
Features
Drawbacks
Improvements
[5]
2018
Cloud computing, cloud reliability, queuing theory
Methodology is developed on CloudSim environment framework that is used to simulate the real cloud environment, virtual resource allocation and other functions and NOT on real cloud environment
Improved and enhanced reliability in cloud. Better scheduling of users’ requests. Better job scheduling
[32]
2018
Cloud elasticity, resource management
A lot of time is consumed at the beginning for making the algorithm train and learn
Deep reinforcement Learning automatically decides and proceeds on requesting and releasing VM resources from the cloud provider and brings them in NoSQL cluster according to user defined policies and rewards The agent is then compared to the state-of-the-art approaches and it shows 1.6 times gains of rewards on its lifetime
[6]
2018
Cloud computing cloud resource management
Simulations are performed using python programming language and not tested on real clouds
The algorithm learns from the previous results This helps to give better results than the normal ones The proposed deep reinforcement learning achieves up to 320% energy cost efficiency This technique achieves up to runtime reduction of 144%
[33]
2017
Job scheduling, cloud computing performance
Not implemented on a real cloud
Better resource management at cloud level (continued)
Reinforcement Learning Applications for Performance Improvement …
97
(continued) Ref. No.
Year
Features
Drawbacks
Improvements
[34]
2017
Resource provisioning
Can be applied to real clouds
Used for maximizing efficiency
[35]
2017
Resource provisioning, task scheduling
Performed on simulators Better penalty of and not on real clouds performance in terms of scheduling of task delay Service level agreements are met to a high extent
[36]
2017
Cloud resource allocation, Performed on simulators Low cost, high power management and not on real clouds performance
[18]
2016
RL, double Q-learning
Algorithm may complex Double Q-learning faster learning of the RL algorithm Q-learning
[37]
2015
Resource management, cloud reliability
–
Effective cloud reliability, cloud is more adaptive and efficient
[13]
2014
Scaling applications in cloud, application re-dimensioning, auto-scaling elastic applications in cloud environment
Problems faced during traffic surges
Better resource utilization, lower cost for working on cloud, elasticity in cloud is better, give the correct number of resources on a service of pay-as-you-go better cost, better Efficiency
[38]
2014
Cloud efficiency
Vadara is a predictive approach than a reactive approach
Better prediction of cloud elasticity
[14]
2014
Adaptability of cloud, cloud reliability
–
Client meets the cloud availability and reliability goals by making use of the provided service
[39]
2014
Cloud resource management, resource profiling
Paper comprises of only – literature work performed by the authors
[40]
2014
Migration of virtual machines
Simulations done using Better migration of virtual machines and not virtual machines on real cloud
[22]
2014
Resource management
Work carried out on Reduced traffic and virtual machines and not workload on the on real cloud virtual machines (continued)
98
P. V. Lahande and P. R. Kaveri
(continued) Ref. No.
Year
Features
Drawbacks
Improvements
[41]
2013
Live migration on cloud
Methodology should also focus on resource allocation on the cloud
Better cloud availability
[42]
2013
Fault tolerance as a service
–
Better load balancing of the cloud
[43]
2013
Atari 2600 games, efficiency and performance of games
RL—reinforcement learning can be applied to other games as well
With the use of DRL—deep reinforcement learning, the performance and efficiency of games is much higher
[44]
2013
Resource management, cloud performance
Not implemented on a real cloud environment
Better performance in the cloud due to the improved management of resources of virtual machines
[19]
2013
Task scheduling
Applicable to real clouds Used for lowering the cost
[20]
2013
Resource provisioning, task scheduling
Performed on simulators HARMONY can and not on real clouds show improvements the data center energy efficiency by up to 25%
[45]
2013
Cloud resource provisioning
Work carried in simulators and virtual machines
[21]
2013
Resource allocation on data centers
Work carried out on Better migration of virtual machines and not VMs. Prediction of on real cloud future needs
[15]
2012
Job scheduling problems
The unified Improved data reinforcement learning efficiency in the algorithm method can be learning algorithm applied to a real cloud and not just for solving virtual machine problems
[46]
2012
VM placement
The proposed methodology focuses on only two parameters such as CPU and I/O bandwidth requirements
–
[47]
2011
Slow convergence, cloud performance, Time
More time for setting resources
Better cloud performance
Proactive provision of resources done by predicting future user demands
(continued)
Reinforcement Learning Applications for Performance Improvement …
99
(continued) Ref. No.
Year
Features
Drawbacks
Improvements
[48]
2011
Job scheduling
Methodology should also focus on resource allocation
Problem of job scheduling is reduced
[49]
2011
Workload on cloud, dynamic scheduling, priority of task
Simulators used rather than using real cloud
Better workload balancing, better management of power on cloud
[48]
2011
Scalability issues on cloud, service level agreements
Poor timing and utility functions
Improves scalability issues
[50]
2011
Energy management
Model tested on simulator
Better results due to the use of adaptive reinforcement learning
[51]
2011
Application profiling
Can be applied to real clouds
Performance prediction model works as it says
[23]
2011
Virtual machine maintenance
Not implemented on the cloud Implemented on virtual machine and not on real cloud Can be applied to real clouds
Better performance and its maintenance
[52]
2010
Cloud provider revenue
Not implemented on the cloud Implemented on virtual machine and not on real cloud Can be applied to real clouds
High cloud provider revenue
[53]
2010
Virtual machines, physical Not implemented on the machines cloud Implemented on virtual machine and not on real cloud Can be applied to real clouds
Optimal placement for VMs to the PMs
[54]
2010
Data, virtual machine
Improved data transfer time between the virtual machine and data
Not implemented on the cloud Implemented on virtual machine and not on real cloud Can be applied to real clouds
100
P. V. Lahande and P. R. Kaveri
Results and Discussion (Comparative Table for Algorithm): Ref. No.
Year
Algorithm
Parameters focuses Description/Analysis on
[11]
2021
JCETD—joint cloud-edge Resource task deployment management, task deployment
The task deployment problem is simulated as deep RL process using the cloud-edge model. Tasks are efficiently and reasonably deployed in cloud and edge center
[24]
2021
DTOMALB—distributed task allocation algorithm based on multi-agent and load balancing
Resource allocation, response delay, load balancing
Algorithm is based on multi-agent that reduces resource wastage and does load balancing
[25]
2021
Adaptive resource allocation integrated with RL mechanism
Service cost, availability, VMs, SLAs
Integration of RL and adaptive resource provisioning
[26]
2021
Collaborative RL algorithm, node broadcast algorithm, delivery tree construction algorithm
Content delivery efficiency, QoS
Routing planning improves content CCDNs delivery efficiency
[27]
2021
Double deep Q-network (Double DQN)
–
The paper proposes an algorithm based on double DQN to jointly optimize the decision variables, including the task offloading decision, the transmission power policy of VUE and the computational resources allocation policy of MEC servers
[28]
2020
RL
Average time delay Adjusts the cluster size of task, task according to cluster scale congestion degree
[29]
2020
Self-learning genetic algorithm (SLGA)
Job-shop scheduling
Genetic algorithm (GA) is used for optimizing and its arguments which are adjusted based on RL
[30]
2020
Q-learning, kubernetes-based framework
Server less computing, auto-scaling
Paper investigates the applicability of RL approach for optimizing auto-scaling for VM provisioning
[31]
2020
Machine learning, deep learning, RL
Cloud application performance
Explores the impact of ML algorithms on cloud application (continued)
Reinforcement Learning Applications for Performance Improvement …
101
(continued) Ref. No.
Year
Algorithm
Parameters focuses Description/Analysis on
[12]
2019
Q-learning, deep reinforcement learning
Cloud provisioning, service level agreements
[5]
2018
MDP, greedy scheduling Dynamic changes policy, random scheduling in the cloud, policy scheduling algorithms in cloud
In the cloud environment, there are dynamic changes that make it difficult to have a reliable task scheduler. The solution to this problem would be to employ a task driven scheduler. Scheduler can adapt effectively to the changes that can also be dynamic in nature but still schedule the user requests
[32]
2018
DERP
Deep reinforcement learning agent for cloud elasticity problem called DERP
[6]
2018
DRL—deep reinforcement Cost minimization, learning cloud resource provisioning, task scheduling
Key problem that the service providers face is that of the data energy cost minimization. Solutions to this problem are given using task scheduling or resource provisioning but they too suffer from scalability issues. Use of deep reinforcement learning will enhance the cloud energy cost
[33]
2017
MDP_DT
MDP_DT is a novel reinforcement learning full based model for elastic resource management
Cloud elastic resource provisioning, resource management, resource provisioning
Elastic management of cloud applications
Q-learning suffers from convergence at low speed, issues about scalability, workload adjustments, meeting SLAs. proposes deep reinforcement learning useful for efficient cloud provisioning
(continued)
102
P. V. Lahande and P. R. Kaveri
(continued) Ref. No.
Year
Algorithm
Parameters focuses Description/Analysis on
[34]
2017
NBRPTS (Negotiation based resource provisioning task scheduling) algorithm
Resource provisioning, task scheduling, minimization of electricity bills, maximization of cost for cloud provider
Addresses the problem of scheduling of tasks and provisioning of resources under the service level agreements
[35]
2017
–
Resource provisioning, task scheduling
Energy awareness for the provisioning of resource and scheduling of task for the use of cloud systems
[36]
2017
Deep reinforcement learning, hierarchical framework
Cloud resource allocation, power management
A framework has been introduced for the purpose of scheduling of task and allocation of resource to happen successfully in the cloud devices using Deep reinforcement learning. The proposed model comprises of a global tier for VM resource allocation as well as a local one for distributed power management of local servers
[18]
2016
Deep reinforcement learning, double Q-learning
–
Double Q-learning uses architecture which is existing and deep network of neural of the DQN algorithm without the actual use of any kind of additional parameters or networks
[37]
2015
Adaptive reinforcement learning
Cloud reliability, resource management
The amount of computing power provided by the distributed systems is more but their reliability is often difficult to guarantee. With adaptive reinforcement learning, the cloud functions better thereby making the client experience with the cloud better (continued)
Reinforcement Learning Applications for Performance Improvement …
103
(continued) Ref. No.
Year
Algorithm
Parameters focuses Description/Analysis on
[13]
2014
Auto-scaling
Auto scaling, cloud elasticity
Classifies scaling of automating applications into reinforcement learning, static threshold rules, control theory, time series analysis, queuing theory, auto-scaling technique that can be used for elasticity of cloud environments
[38]
2014
Vadara
Cloud elasticity
Vadara is a framework that is used for creating modules that are elastic and not just that, but these modules are generic in nature and also can be plugged into the framework of cloud. Vadara is a framework that enables the creation of elasticity modules which are totally generic in nature and pluggable into the cloud framework
[14]
2014
–
VM failures, physical server failures, network failures
Paper focuses on virtual machine failures, physical server failures and network failure. These parameters are essential for enhancing the cloud performance
[39]
2014
–
Resource management
Paper aims to present a vocabulary/taxonomy for tools and profiling models. The paper presents its characteristic, its challenges when describing various tools and models (continued)
104
P. V. Lahande and P. R. Kaveri
(continued) Ref. No.
Year
Algorithm
Parameters focuses Description/Analysis on
[40]
2014
Heuristics for VM migration
VM migration, traffic problems on the cloud, heuristics: CPU and memory
[22]
2014
Management techniques in Resource IaaS provisioning, resource allocation, resource mapping, resource adaptation
Paper presents management techniques such as resource provisioning, resource allocation, resource mapping and resource adaptation. The research paper surveys the IaaS techniques and also guides for further research
[41]
2013
–
Live migration on cloud, cloud availability
An availability model which is comprehensive in nature used to evaluating the utilization of migration being live for enabling the virtual machine migration with a comparatively minimum interruption of service. Many a times a live migration happens in the presence of observing a time-based trigger
[42]
2013
–
Fault tolerance, cloud reliability, cloud load
Providing a method for managing fault tolerance in clouds as well as balancing the load on the cloud. This thereby improves the cloud availability and cloud reliability improving the users experience
[55]
2013
–
CPU utilization
CPU utilization time patterns of several MapReduce applications
[43]
2013
Deep RL, double Q-learning
Atari games
DRL is used for playing and enhancing the vintage Atari games
Without the consideration of traffic of network problems between the VMs, the migration of VMs can lead to various problems. Paper presents two heuristics for migration of virtual machines and they are CPU and memory
(continued)
Reinforcement Learning Applications for Performance Improvement …
105
(continued) Ref. No.
Year
Algorithm
Parameters focuses Description/Analysis on
[44]
2013
Tiramola, NoSQL
Job scheduling, cloud computing performance
Elastic provisioning of resources using NoSQL by the use of Tiramola
[19]
2013
–
Resource provisioning, scheduling optimized framework
Research focuses on the problem of cloud system global optimization by reducing the cost and maximizing the efficiency while the SLAs are satisfied
[20]
2013
Harmony, K-means clustering algorithm
Dynamic capacity provisioning
A system which is heterogeneous in nature and also a resource management system for the capacity of dynamic provisioning
[45]
2013
Prediction model of cloud for benchmark web application Support vector regression, neural networks linear regression
VM scaling, SLA agreements, web applications, cloud prediction model
Effective scaling of VM in cloud computing servers needs to be provisioned for before actually required due to the amount of time required by VM. One way to solve this problem is to know the future demands. This is developed using three ML techniques: support vector regression (SVR), neural networks (NN), linear regression (LR)
[21]
2013
Autonomic resource controller
Resource provisioning in data centers, resource allocation
The research paper proposes a resource controller which is autonomic in nature that is used to control the allocation of resource for data centers
[15]
2012
Unified reinforcement learning
Cloud management
Reinforcement learning used for solving cloud management problems
[17]
2012
ImageNet Classification
Neural networks
– (continued)
106
P. V. Lahande and P. R. Kaveri
(continued) Ref. No.
Year
Algorithm
Parameters focuses Description/Analysis on
[46]
2012
Survey and method for placing a VM to a PM
Assigning VMs to PMs or Nodes
The main problem is the smaller number of Physical Machines. The problem of mapping VM to PM can be viewed as a NP hard problem
[47]
2011
Speedy Q-learning
Less time consuming, convergence, cloud efficiency, cloud performance
Speedy Q-learning method is highly more convergent method over the existing method Q-learning since at each step speedy Q-learning uses two successive estimates of the action-value function that makes its space complexity twice as that of the standard Q-learning
[48]
2011
Ordinal sharing learning
Job scheduling, load balancing
With proper job scheduling in the cloud, the cloud will run effectively thereby providing a good user experience to the client
[49]
2011
–
Distributed systems
Paper focusses on a scheduler being hierarchical that is used for exploiting purposes of the multiple core architecture. This scheduler can perform scheduling which is effective
[48]
2011
Reinforcement learning
Job scheduling, job scheduling problems dynamic Resource allocation problems
Ordinal sharing learning finds out the problem of scalability by the use of an ordinal strategy of learning
[50]
2011
Adaptive reinforcement learning
Scheduling (continued)
Reinforcement Learning Applications for Performance Improvement …
107
(continued) Ref. No.
Year
Algorithm
Parameters focuses Description/Analysis on
[51]
2011
Canonical correlation analysis with the use of application profiling
Cloud resource management
One of the important techniques for effective resource management is application profiling. Paper presents a new application profiling technique using CCA. It also presents a performance prediction model as an appliance of such a technique
[52]
2010
–
Cloud provider revenue
The research paper focuses on how to increase the cloud provider revenue by minimizing the electricity costs
[53]
2010
Linear programming and quadratic programming
Mapping of VM to PM, Problem of VM Placement in Data Centers
The research paper focuses on mapping of VM to the PM. The research paper provides approaches based on programming of quadratic and programming of linear techniques that can significantly improve over the theoretical bounds and solve the problem of VM placement in data centers
[23] [54]
2011 2010
VM placement and VM migration method
Virtual machine The research paper placement, virtual proposes a virtual machine machine migration placement and virtual machine migration approach to minimize the data transfer time consumption
[56]
2009
RL
Online web system Reinforcement learning used for online web system auto-configuration
[57]
2009
VCONF
Virtual machine
Reinforcement learning based approach called VCONF for automating the process of VM configuration (continued)
108
P. V. Lahande and P. R. Kaveri
(continued) Ref. No.
Year
Algorithm
Parameters focuses Description/Analysis on
[58]
2009
Virtual putty
Placing of VM on PM, memory, network, disk and power consumption
The research paper characterizes the VM in the terminology of local virtual footprint. The research paper has presented three general principles for minimizing VM physical footprint w.r.t. memory, disk and power consumption and network
[59]
2008
Cloud management middleware
Resource allocation, placing of virtual machines on physical machines
Paper comprises of a cloud management middleware that based on the observations and predictions of client requests, migrates the components of migration across multiple client data centers
[60]
2008
Two stage heuristic algorithms
Bin-item incompatibility constraints
Paper a two-stage algorithm for considering the bin-item constraints for the problem of incompatibility that are implicit in server consolidation issues
Figure 4 shows the area wise contribution of the majority of the topics related to cloud computing and Reinforcement Learning. These topics include QLearning and its variations (namely: Q-Learning, Double Q-Learning, Speedy QLearning), Reinforcement Learning, Markov Decision Process, Scheduling, and Virtual Machines. Figure 5 shows the work done, i.e., the number of research papers for the various areas such as Q-Learning and its variations (namely: Q-Learning, Double QLearning, Speedy Q-Learning), Reinforcement Learning, Markov Decision Process, Scheduling, and Virtual Machines. After studying the above two figures we can say that Reinforcement Learning is the technology that can have an impact to improve the efficiency and effectiveness of any system. Not just the efficiency and effectiveness, but it maximizes the performance of a particular system and also sustains the change for a longer period of time. Longterm results can be obtained when this concept of Reinforcement Learning can be applied to solve cloud computing issues and problems. It is important to know and understand that this Reinforcement Learning model is very similar to how human beings learn. Human beings make several mistakes and errors while learning and
Reinforcement Learning Applications for Performance Improvement …
109
Fig. 4 Area wise contribution
Fig. 5 Area versus No. of research papers
then later enhance themselves by learning from experiences. Reinforcement Learning works in the absence of a training dataset because it is bound to work with the help of past experiences. This concept of Reinforcement Learning can outperform human beings in many tasks and hence the researcher wishes to implement the same in the concept of cloud. When this is done, other concepts such as Scheduling, Virtual Machines, Provisioning also come along. Also, one major advantage of Reinforcement Learning is that there is no requirement for any dataset. Usually, datasets are smaller in size at the beginning, but they
110
P. V. Lahande and P. R. Kaveri
tend to grow after a while. Maintaining datasets are also costly. Since Reinforcement Learning does not require any datasets, the pain and cost of maintaining and labeling datasets are absent in Reinforcement Learning. With the implementation of Reinforcement Learning in cloud computing, we can achieve better Scheduling of Resources and better working of Virtual Machines.
3 Conclusion Since the cloud proves to be useful to the user, the count of cloud users increases because they tend to rely more on the cloud for its storage and accessing applications, due to which sometimes the cloud suffers and goes through some of the major issues. Some of these issues include problems in Resource Allocation and Management, Resource Provisioning, Virtual Machine Management, Service Level Agreements, Revenue of the Cloud, etc. The researcher wishes to implement the concept of Reinforcement Learning in the Cloud, where the agent, in this case, is the cloud and it is the cloud that will learn and reduce the mentioned problems on its own by using its past experiences. After analyzing different Reinforcement Learning algorithms, a finding that can be made is that Reinforcement Learning is a powerful learning, and algorithms such as Q-Learning, Deep Q-Learning, Speedy Q-Learning, etc. can be applied to the cloud to make the cloud immune to all the problems. The objective of this research paper was to study the different Reinforcement Algorithms and do a Literature Survey. Also, this research paper shows how Reinforcement Learning can enhance the learning capability of any system if it is implemented in it. Cloud issues and problems faced will be reduced and will be much less in quantity with the implementation of Reinforcement Learning on the cloud. The cloud will learn from past experiences and henceforth learn how to tackle problems if they arise in the future. The future study will be to test these Reinforcement Learning algorithms on Cloud Simulators and further practical implement them on cloud platforms such as Microsoft Azure, Google Cloud, etc. Acknowledgements I would like to convey my special thanks of gratitude to my parents Mr. Vijay Ramdas Lahande and Ms. Jyoti Vijay Lahande, my family and my friends who helped me in this research paper. I would like to convey my sincere thanks to all of them.
References 1. Mell P, Grance T (2009) The NIST definition of cloud computing 2. Vaquero LM, Rodero-Merino L, Caceres J, Lindner M (2008) A break in the clouds—towards a cloud definition 3. Calheiros RN, Ranjan R, De Rose CAF, Buyya R (2009) CloudSim—a novel framework for modeling and simulation of cloud computing infrastructures and services 4. Vouk MA (2008) Cloud computing—issues, research and implementations. J Comput Inf Technol—CIT 16
Reinforcement Learning Applications for Performance Improvement …
111
5. Balla HAMN, Sheng CG, Weipeng J (2018) Reliability enhancement in cloud computing via optimized job scheduling implementing reinforcement learning algorithm and queuing theory. 1st International conference on data intelligence and security 6. Cheng M, Li J, Nazarian S (2018) DRL-cloud-deep reinforcement learning-based resource provisioning and task scheduling for cloud service providers 7. Aceto G, Botta A, de Donato W, Pescape A (2013) Cloud monitoring-A survey 8. Rittinghouse JW, Ransome JF (2016) Cloud computing-implementation, management, and security 9. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning a survey 10. Sutton RS, Barto AG (1998) Reinforcement learning—an introduction 11. Dong Y, Xu G, Zhang M, Meng X (2021) A high-efficient joint cloud-edge aware strategy for task deployment and load balancing. IEEE Access 12. John I, Bhatnagar S, Sreekantan A (2019) Efficient adaptive resource provisioning for cloud applications using reinforcement learning. IEEE 4th international workshops on foundations and applications of self* systems (FAS*W) 13. Lorido-Botrán T, Miguel-Alonso J (Dec 2014) A review of auto-scaling techniques for elastic applications in cloud environments. Art J Grid Comput 14. Liu X, Tong W, Zhi X, ZhiRen F, WenZhao L (2014) Performance analysis of cloud computing services considering resources sharing among virtual machines 15. Xu C-Z, Rao J, Bu X (2012) URL-A unified reinforcement learning approach for autonomic cloud management 16. Lin L, Zhang Y, Huai J (2007) Sustaining incentive in grid resource allocation—a reinforcement learning approach 17. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks 18. van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double Q-learning. Proceedings of the Thirtieth AAAI conference on artificial intelligence (AAAI-16) 19. Gao Y, Wang Y, Gupta SK, Pedram M (2013) An energy and deadline aware resource provisioning, scheduling and optimization framework for cloud systems 20. Zhang Q, Zhani MF, Boutaba R, Hellerstein JL (2013) Harmony-dynamic heterogeneity-aware resource provisioning in the cloud 21. Elprince N (2013) Autonomous resource provision in virtual data centers 22. Manvi SS, Shyam GK (2014) Resource management for infrastructure as a service (IaaS) in cloud computing-A survey. J Network Comput Appl 23. Mohammadi E, Karimi M, Heikalabad SR (2011) A novel virtual machine placement in cloud computing. Aust J Basic Appl Sci 5(10):1549–1555. ISSN 1991-8178 24. Zhang Z, Li C, Peng SL, Pei X (2021) A new task offloading algorithm in edge computing. EURASIP J Wirel Commun Network 25. Kumar VP, Prakash KB (2021) Adaptive resource management utilizing reinforcement learning technique in inter-cloud environments. IOP Conf Ser: Mater Sci Eng 26. He M, Lu D, Tian J, Zhang G (2021) Collaborative reinforcement learning based route planning for cloud content delivery networks. IEEE Access 27. Li D, Xu S, Li P (2021) Deep reinforcement learning-empowered resource allocation for mobile edge computing in cellular V2X networks. Sens Commun Syst Enabling Auton Veh 28. Che H, Bai Z, Zuo R, Li H (2020) A deep reinforcement learning approach to the optimization of data center task scheduling. Hindawi 29. Chen R, Yang B, Li S, Wang S (2020) A self-learning genetic algorithm based on reinforcement learning for flexible job-shop scheduling problem. Comput Ind Eng 30. Schuler L, Jamil S, Kuhl N (2020) AI-based resource allocation: reinforcement learning for adaptive auto-scaling in serverless environments. Cornell University 31. Sharkh MA, Xu Y, Leyder E (2020) CloudMach: cloud computing application performance improvement through machine learning. IEEE Canadian conference on electrical and computer engineering 32. Bitsakos C, Konstantinou I, Koziris N (2018) DERP-a deep reinforcement learning cloud system for elastic resource provisioning. IEEE International conference on cloud computing technology and science (CloudCom)
112
P. V. Lahande and P. R. Kaveri
33. Lolos K, Konstantinou I, Kantere V, Koziris N (2017) Elastic management of cloud applications using adaptive reinforcement learning. IEEE international conference on big data (BIGDATA) 34. Li J, Wang Y, Lin X, Nazarian S, Pedram M (2017) Negotiation-based resource provisioning and task scheduling algorithm for cloud systems 35. Li H, Li J, Yao W, Nazarian S, Lin X, Wang Y (2017) Fast and energy-aware resource provisioning and task scheduling for cloud systems 36. Liu N, Li Z, Xu J, Xu Z, Lin S, Qiu Q, Tang J, Wang Y (2017) A hierarchical framework of cloud resource allocation and power management using deep reinforcement learning 37. Hussin M, Hamid NAWA, Kasmiran KA (2014) Improving reliability in resource management through adaptive reinforcement learning for distributed systems. J Parallel Distrib Comput 38. Loff J, Garcia J (2014) Vadara: predictive elasticity for cloud applications. IEEE 6th international conference on cloud computing technology and science 39. Weingärtner R, Bräscher GB, Westphall CB (2014) Cloud resource management-A survey on forecasting and profiling models. J Netw Comput Appl 47(2015):99–106 40. Akula GS, Potluri A (2014) Heuristics for migration with consolidation of ensembles of virtual machines 41. Torquato M, Romero P, Maciel M, Araujo J, Matos R (2013) Availability study on cloud computing environments-live migration as a rejuvenation mechanism 42. Jhawar R, Piuri V, Santambrogio M (2013) Fault tolerance management in cloud computing-a system-level perspective. IEEE 43. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning 44. Konstantinou I, Boumpouka C, Sioutas S (2013) Automated, elastic resource provisioning for NoSQL clusters using Tiramola. Conference paper 45. Bankole AA (2013) Cloud client prediction models for cloud resource provisioning in a multitier web application environment 46. Madhukumar SD, Anju KS, Joseph AM, John I, Subha N, Thomas RA (2012) Virtual machine placement based on VM-PM compatibility 47. Azar MG, Munos R, Ghavamzadeh M, Kappen HJ (2011) Speedy Q-learning 48. Wu J, Xu X, Zhang P, Liu C (2011) A novel multi-agent reinforcement learning approach for job scheduling in grid computing 49. Hussin M, Lee Y-C, Zomaya A (Dec 2011) Priority-based scheduling for large-scale distribute systems with energy awareness. Conference paper 50. Hussin M, Lee YC, Zomaya AY (2011) Efficient energy management using adaptive reinforcement learning-based scheduling in large-scale distributed systems. International conference on parallel processing 51. Do AV, Chen J, Wang C, Lee YC, Zomaya AY, Zhou BB (2014) Profiling applications for virtual machine placement in clouds. IEEE 4th international conference on cloud computing 52. Mazzucco M, Dyachuk D, Deters R (2010) Maximizing cloud providers revenues via energy aware allocation policies 53. Bellur U, Madhu Kumar SD (2010) Optimal placement algorithms for virtual machines 54. Piao J, Yan J (2010) A network-aware virtual machine placement and migration approach in cloud computing 55. Rizvandi NB, Taheri J, Zomaya AY, Moraveji R (2013) Automatic tunning of mapreduce jobs using uncertain pattern matching analysis 56. Bu X, Rao J, Xu C-Z (2009) A reinforcement learning approach to online web system autoconfiguration 57. Rao J, Bu X, Xu C-Z, Wang L, Yin G (2009) VCONF-a reinforcement learning approach to virtual machine auto-configuration 58. Sonnek J, Chandra A (2009) Virtual putty-reshaping the physical footprint of virtual machines 59. Malet B, Pietzuch P (2008) Resource allocation across multiple cloud data centres 60. Gupta R, Bose SK, Sundarrajan S, Chebiyam M, Chakrabarti A (2008) A two stage heuristic algorithm for solving the server consolidation problem with item-item and bin-item incompatibility constraints. IEEE international conference on services computing
Machine Learning Based Consumer Credit Risk Prediction G. S. Samanvitha, K. Aditya Shastry, N. Vybhavi, N. Nidhi, and R. Namratha
Abstract In the financial domain, estimating the credit risk has a great significance in management activity, like lending the money to those who are in need for it. Credit risk (CR) points out to the possibility of the cost incurred because of a borrower’s default to pay their loans on time. Management of risks related to credit is the procedure of minimizing the losses by the prior knowledge of the bank’s capital and loss of loan reserves during any specified time. This action has long been an issue that is faced by many institutions related to finance. CR assessment plays a key part in the decision process. It is a classification problem, considering the CR of the customer. In such a type of obstacle, the main aim is to genuinely classify the customers into one of the finite class. The appraisal of the CR datasets leads to the judgment to issue the loan of the customer or reject the application of the customer is the arduous task which includes the deep analysis of the customer credit dataset or the data provided by the customer. Numerous CR analysis techniques are utilized for the assessment of CR on the dataset of consumers. In this work, we have scrutinized different competencies for CR analysis using Machine Learning (ML) techniques like k-Nearest Neighbor (k-NN), Support Vector Machine (SVM), and Multiple Linear Regression (MLR). Results demonstrated that the MLR technique performed better CR prediction than k-NN and SVM on the given dataset. Keywords Credit risk analysis · Machine learning · Regression
1 Introduction CR denotes the probability of loss that results because of the borrower’s failure to pay a loan back or fulfill the commitments as per the contract. Basically, it is associated with the risk that a lender faces when he/she may not receive the lent amount resulting in disturbances in cash flows and more collection costs. Additional flows of cash can be done to generate extra cover for CR. Even though, estimation of defaulters is G. S. Samanvitha · K. A. Shastry (B) · N. Vybhavi · N. Nidhi · R. Namratha Nitte Meenakshi Institute of Technology, Bengaluru, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 S. Aurelia et al. (eds.), Sustainable Advanced Computing, Lecture Notes in Electrical Engineering 840, https://doi.org/10.1007/978-981-16-9012-9_10
113
114
G. S. Samanvitha et al.
difficult, effective assessment and management of CR can significantly reduce the loss severity. Payments of interest by the borrower are bonus to the creditor when CR is assumed. When different types of loans such as credit card loans, housing loans, are provided by the lenders, there exists considerable risks that the borrower may not repay the debt. Likewise, customer credit provided by an organization is at a substantial risk that the client may not repay their invoices. CR can also occur when the issuer of a bond/agreement may make significant delays for making the payment or a valid claim may not be paid by an insurance company. CRs are computed based on the loan paying capability of a client/customer as per the original terms. To assess the CR on a client loan, lenders focus on 5 Cs which are: history of the borrower, repayment capacity, capital, conditions of the loan, and the related security. Banks are constantly incurring losses due to the non-repayment of debts. Hence, each bank needs to have its own CR assessment system. However, certain banks have not been able to develop software’s that accurately forecast the client’s default leading to heavy losses. The breach of loan still occurs, typically in commercial lending [1–4]. According to the survey report of the Federal Reserve, companies related to oil and gas incurred around 40$ billion losses in 2016. The default rate of bonds reached a high of around 19% during 2016. The failure to repay loans happened in every industry during that year. Therefore, the accurate prediction of commercial CR is a significant part of research that aids in stabilizing the economic environment. The good and bad customers can be differentiated by using software which forms a major factor for any banks while granting loans. In such a case, the software’s accuracy in predicting the CR becomes significant as the durability of banks is linked. A bank that takes no risks is also as susceptible as the bank that takes more risks [2–5]. In view of these issues, we have developed a CR model by utilizing regression which is able to predict the CR accurately for the given dataset. Due to the substantial number of companies/startups that have come up in the domain of microcredit & peer-to-peer lending, we have developed a CR system that is able to accurately estimate the client risk of defaulting the loan. The work aims to forecast whether a client/customer will be able to repay the loan or not based on the input attributes of the bank/loan dataset. The main objectives are: • To determine the borrower’s capacity to reimburse the loan. • To ensure that the lender provides loan to the borrower who has the capacity to repay and make sure that the lender does not suffer from any enormous loss. The rest of this paper is organized as follows. Section II deliberates the related work. Section 3 demonstrates the proposed work. Section 4 provides the experimental results attained. The paper ends with conclusion and future work.
Machine Learning Based Consumer Credit Risk Prediction
115
2 Related Work This section surveys some of the recent works in the domain of CR Analysis. The work [6] analyzes the performance of three ML methods, viz. Random forest, boosting, and neural networks using the RiskCalc software’s benchmark GAM model. The ML approaches delivered comparable accuracy like the GAM model. These alternate techniques are well equipped in capturing the relationships which are non-linear in nature. The main goal in [7] is forecasting whether a customer will share a grave delinquency of ninety days or more for the subsequent 2 years. Kaggle data called “Give me some credit” was utilized by the authors for experimentation purpose. The data comprised of more than one lakh customers with each customer possessing 10 features. Authors found that all the 10 features were not important for CR prediction. They observed that classification trees were more suitable for this problem type since the trees successively evaluate the criteria containing decisions based on feature subsets. A model was implemented by hybridizing the gradient boosting technique (GBT model) & trees. Prediction accuracy of GBT model was observed to be more by the authors. In [8], the authors have discussed various techniques for CR analysis. They utilized Bayesian Classifier, Decision Tree, k-NN, K-Means, and SVM algorithms for the evaluation of credit datasets for the better and reliable credit risk analysis. Typically, 2 credit datasets of Germany & Australia were utilized to measure the performance of ML algorithms and used for ensemble learning. There are two classes in the dataset which is the “good” and “bad” reflection of the creditors to whom loan is approved and not approved. The authors analyzed and compared their performance using different types of classifiers and from the comparison table they inferred that the ELM classifier gave enhanced accuracies when compared to the other classifiers.
3 Proposed Work This section describes the proposed work of predicting the consumer credit risk using Linear Regression. The proposed technique consists of the following steps: • • • • • •
Dataset collection Feature selection Exploratory Data Analysis Model Training Accuracy Testing Analysis of results
The whole procedure can be separated into 2 parts, training and testing. The dataset is also divided to train the model and to test it. In the first part, i.e., training we use the training data to train the chosen algorithm. The results are then passed to
116
G. S. Samanvitha et al.
the model. In the 2nd part, the model is subject to testing, using the result generated previously the model predicts the value of target variable for the test data and the accuracy of the predicted target variable is analyzed. The same procedure is repeated using all algorithms under consideration namely, K-nearest neighbors’ algorithm, Support Vector Machine algorithm and the linear regression algorithm. The dataset for the CR model is collected from the Kaggle repository [9]. The loan status attribute is desired target variable whose value is to be predicted, and it assumes a value of 0 for loans that are fully paid and 1 for loans that will be charged off. The data is divided into training and testing. It is then trained using the algorithm. Figure1 depicts the CR analysis model. The model’s accuracy is evaluated, and adjustments are made to the weights of the attributes to better fit the training data. The test data is then passed to the model and predicted data is generated. This predicted data is analyzed to find the accuracy of the model. This process is repeated with different machine learning algorithms, and the resulting accuracies are compared to determine the best algorithm for calculating consumer credit risk. Figure 2 depicts the proposed CR analysis model. Figure 2 starts with the collection of datasets and cleaning it. The mean imputation technique was used to handle the missing values. In this technique, the missing values were replaced with the mean (Average) of the corresponding column. The cleaned dataset is then subject to feature selection by consulting a domain expert. Linear
Apply Linear Regression
Predicted Values
Accuracy of ML Model
Performance Evaluation
Loan Dataset
Fig. 1 Proposed credit risk analysis model
Start
Stop
Loan Dataset
Results
Feature Selection
Prediction Analysis
Fig. 2 Flowchart of the proposed credit risk analysis model
Train using Linear Regression
Test using Linear Regression
Machine Learning Based Consumer Credit Risk Prediction Table 1 Employment length key values
117
Emp_length
Key