148 15 20MB
English Pages [758]
Lecture Notes in Electrical Engineering 1053
Malaya Dutta Borah Dolendro Singh Laiphrakpam Nitin Auluck Valentina Emilia Balas Editors
Big Data, Machine Learning, and Applications Proceedings of the 2nd International Conference, BigDML 2021
Lecture Notes in Electrical Engineering Volume 1053
Series Editors Leopoldo Angrisani, Department of Electrical and Information Technologies Engineering, University of Napoli Federico II, Napoli, Italy Marco Arteaga, Departament de Control y Robótica, Universidad Nacional Autónoma de México, Coyoacán, Mexico Samarjit Chakraborty, Fakultät für Elektrotechnik und Informationstechnik, TU München, München, Germany Jiming Chen, Zhejiang University, Hangzhou, Zhejiang, China Shanben Chen, School of Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, China Tan Kay Chen, Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore Rüdiger Dillmann, University of Karlsruhe (TH) IAIM, Karlsruhe, Baden-Württemberg, Germany Haibin Duan, Beijing University of Aeronautics and Astronautics, Beijing, China Gianluigi Ferrari, Dipartimento di Ingegneria dell’Informazione, Sede Scientifica Università degli Studi di Parma, Parma, Italy Manuel Ferre, Centre for Automation and Robotics CAR (UPM-CSIC), Universidad Politécnica de Madrid, Madrid, Spain Faryar Jabbari, Department of Mechanical and Aerospace Engineering, University of California, Irvine, CA, USA Limin Jia, State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing, China Janusz Kacprzyk, Intelligent Systems Laboratory, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Alaa Khamis, Department of Mechatronics Engineering, German University in Egypt El Tagamoa El Khames, New Cairo City, Egypt Torsten Kroeger, Intrinsic Innovation, Mountain View, CA, USA Yong Li, College of Electrical and Information Engineering, Hunan University, Changsha, Hunan, China Qilian Liang, Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX, USA Ferran Martín, Departament d’Enginyeria Electrònica, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain Tan Cher Ming, College of Engineering, Nanyang Technological University, Singapore, Singapore Wolfgang Minker, Institute of Information Technology, University of Ulm, Ulm, Germany Pradeep Misra, Department of Electrical Engineering, Wright State University, Dayton, OH, USA Subhas Mukhopadhyay, School of Engineering, Macquarie University, NSW, Australia Cun-Zheng Ning, Department of Electrical Engineering, Arizona State University, Tempe, AZ, USA Toyoaki Nishida, Department of Intelligence Science and Technology, Kyoto University, Kyoto, Japan Luca Oneto, Department of Informatics, Bioengineering, Robotics and Systems Engineering, University of Genova, Genova, Genova, Italy Bijaya Ketan Panigrahi, Department of Electrical Engineering, Indian Institute of Technology Delhi, New Delhi, Delhi, India Federica Pascucci, Department di Ingegneria, Università degli Studi Roma Tre, Roma, Italy Yong Qin, State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing, China Gan Woon Seng, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore Joachim Speidel, Institute of Telecommunications, University of Stuttgart, Stuttgart, Germany Germano Veiga, FEUP Campus, INESC Porto, Porto, Portugal Haitao Wu, Academy of Opto-electronics, Chinese Academy of Sciences, Haidian District Beijing, China Walter Zamboni, Department of Computer Engineering, Electrical Engineering and Applied Mathematics, DIEM—Università degli studi di Salerno, Fisciano, Salerno, Italy Junjie James Zhang, Charlotte, NC, USA Kay Chen Tan, Department of Computing, Hong Kong Polytechnic University, Kowloon Tong, Hong Kong
The book series Lecture Notes in Electrical Engineering (LNEE) publishes the latest developments in Electrical Engineering—quickly, informally and in high quality. While original research reported in proceedings and monographs has traditionally formed the core of LNEE, we also encourage authors to submit books devoted to supporting student education and professional training in the various fields and applications areas of electrical engineering. The series cover classical and emerging topics concerning: • • • • • • • • • • • •
Communication Engineering, Information Theory and Networks Electronics Engineering and Microelectronics Signal, Image and Speech Processing Wireless and Mobile Communication Circuits and Systems Energy Systems, Power Electronics and Electrical Machines Electro-optical Engineering Instrumentation Engineering Avionics Engineering Control Systems Internet-of-Things and Cybersecurity Biomedical Devices, MEMS and NEMS
For general information about this book series, comments or suggestions, please contact [email protected]. To submit a proposal or request further information, please contact the Publishing Editor in your country: China Jasmine Dou, Editor ([email protected]) India, Japan, Rest of Asia Swati Meherishi, Editorial Director ([email protected]) Southeast Asia, Australia, New Zealand Ramesh Nath Premnath, Editor ([email protected]) USA, Canada Michael Luby, Senior Editor ([email protected]) All other Countries Leontina Di Cecco, Senior Editor ([email protected]) ** This series is indexed by EI Compendex and Scopus databases. **
Malaya Dutta Borah · Dolendro Singh Laiphrakpam · Nitin Auluck · Valentina Emilia Balas Editors
Big Data, Machine Learning, and Applications Proceedings of the 2nd International Conference, BigDML 2021
Editors Malaya Dutta Borah Department of Computer Science and Engineering National Institute of Technology Silchar Silchar, Assam, India
Dolendro Singh Laiphrakpam Department of Computer Science and Engineering National Institute of Technology Silchar Silchar, Assam, India
Nitin Auluck Department of Computer Science and Engineering Indian Institute of Technology Ropar Rupnagar, Punjab, India
Valentina Emilia Balas Department of Automation and Applied Informatics Aurel Vlaicu University of Arad Arad, Romania
ISSN 1876-1100 ISSN 1876-1119 (electronic) Lecture Notes in Electrical Engineering ISBN 978-981-99-3480-5 ISBN 978-981-99-3481-2 (eBook) https://doi.org/10.1007/978-981-99-3481-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Contents
Android Application-Based Security Surveillance Implementing Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Henakshi Das and Preetisudha Meher
1
Realtime Object Distance Measurement Using Stereo Vision Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. N. Arunakumari, R. Shashidhar, H. S. Naziya Farheen, and M. Roopa
9
An Insight on Drone Applications in Surveillance Domain . . . . . . . . . . . . . M. Swami Das, Gunupudi Rajesh Kumar, and R. P. Ram Kumar
21
Handwritten Mixed Numerals Classification System . . . . . . . . . . . . . . . . . . Krishn Limbachiya and Ankit Sharma
33
IoT Based Smart Farm Monitoring System . . . . . . . . . . . . . . . . . . . . . . . . . . Ankuran Das, Hridaydeep Bora, Jugasmita Kashyap, Chinmoy Bordoloi, and Smriti Priya Medhi
45
An Extensive Review of the Supervised Learning Algorithms for Spiking Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Irshed Hussain and Dalton Meitei Thounaojam
63
Multitask Learning-Based Simultaneous Facial Gender and Age Recognition with a Weighted Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . Abhilasha Nanda and Hyun-Seung Yang
81
Visualizing Crime Hotspots by Analysing Online Newspaper Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Trupthi, Prerana Rajole, and Neha Dinesh Prabhu
89
Applications of Machine Learning for Face Mask Detection During COVID-19 Pandemic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Sarfraz Fayaz Khan, Mohammad Ahmar Khan, and Rabiah Al-Quadah
v
vi
Contents
A Cascaded Deep Learning Approach for Detection and Localization of Crop-Weeds in RGB Images . . . . . . . . . . . . . . . . . . . . . 121 Rohit Agrawal and Jyoti Singh Kirar Ensemble of Deep Learning Enabled Tamil Handwritten Character Recognition Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 R. Thanga Selvi A Comparative Study of Loss Functions for Deep Neural Networks in Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Rashi Jaiswal and Brijendra Singh Learning Algorithm for Threshold Softmax Layer to Handle Unknown Class Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Gaurav Jaiswal Traffic Monitoring and Violation Detection Using Deep Learning . . . . . . 175 Omkar Sargar, Saharsh Jain, Sravan Chittupalli, and Aniket Tatipamula Conjugate Gradient Method for finding Optimal Parameters in Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Vishal Menon, V. Ashwin, and G. Gopakumar Rugby Ball Detection, Tracking and Future Trajectory Prediction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Pranesh Nangare and Anagha Dangle Early Detection of Heart Disease Using Feature Selection and Classification Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 R. S. Renju and P. S. Deepthi Gun Detection System for Surveillance Cameras Using HOG-Assisted KNN Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Lucy Sumi and Shouvik Dey Optimized Detection, Classification, and Tracking with YOLOV5, HSV Color Thresholding, and KCF Tracking . . . . . . . . . . . . . . . . . . . . . . . . 235 Aditya Yadav, Srushti Patil, Anagha Dangle, and Pranesh Nangare COVID-19 Detection Using Chest X-ray Images . . . . . . . . . . . . . . . . . . . . . . 247 Gautham Santhosh, S. Adarsh, and Lekha S. Nair Comparative Analysis of LDA Algorithm for Low Resource Indian Languages with Its Translated English Documents . . . . . . . . . . . . . . . . . . . . 257 D. K. Meghana, K. Kiran, Saleha Nida, T. B. Shilpa, P. Deepa Shenoy, and K. R. Venugopal Text Style Transfer: A Comprehensive Study on Methodologies and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Nirali Parekh, Siddharth Trivedi, and Kriti Srivastava
Contents
vii
Classification of Hindustani Musical Ragas Using One-Dimensional Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Rutuparn Pawar, Shubham Gujar, Anagha Bidkar, and Yogesh Dandawate W-Tree: A Concept Correlation Tree for Data Analysis and Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Prakash Hegade, Kishor Rao, Utkarsh Koppikar, Maltesh Kulkarni, and Jinesh Nagda Crawl Smart: A Domain-Specific Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Prakash Hegade, Ruturaj Chitragar, Raghavendra Kulkarni, Praveen Naik, and A. S. Sanath Evaluating the Effect of Leading Indicators in Customer Churn Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Sharath Kumar, Nestor Mariyasagayam, and Yuichi Nonaka Classification of Skin Lesion Using Image Processing and ResNet50 . . . . 341 Adarsh Pradhan, Subhojit Saha, Abhinay Das, and Santanu Barman Data Collection and Pre-processing for Machine Learning-Based Student Dropout Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Sheikh Wakie Masood and Shahin Ara Begum Nested Named-Entity Recognition in Multilingual Code-Switched NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Ashwin Patil and Utkarsh Kolhe Deep Learning-Based Semantic Segmentation of Blood Cells from Microscopic Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 S. B. Asha and G. Gopakumar A Partitioned Task Offloading Approach for Privacy Preservation at Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 R. Ramprasad, S. Pradhiksha, K. Sundarakantham, Rajashree R. Harine, and Shalinie S. Mercy Artificial Intelligence in Radiological COVID-19 Detection: A State-of-the-Art Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Abhishek Kumar, Pinki Roy, Arnab Kumar Mishra, and Sujit Kumar Das Anomaly Detection in SCADA Industrial Control Systems Using Bi-Directional Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 M. Nakkeeran and V. Anantha Narayanan Implementing Autonomous Navigation on an Omni Wheeled Robot Using 2D LiDAR, Tracking Camera and ROS . . . . . . . . . . . . . . . . . . 437 Atharva Bhorpe, Pratik Padalkar, and Pawan Kadam
viii
Contents
Analysis of Deep Learning Models for Text Summarization of User Manuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Mihir Kayastha, Megh Khaire, Malhar Gate, Param Joshi, and Sheetal Sonawane Modelling Seismic Performance of Reinforced Concrete Buildings Within Response Spectrum Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 Praveena Rao and Hemaraju Pollayi A Survey on DDoS Detection Using Deep Learning in Software Defined Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 M. Franckie Singha and Ripon Patgiri Segmentation of Dentin and Enamel from Panoramic Dental Radiographic Image (OPG) to Detect Tooth Wear . . . . . . . . . . . . . . . . . . . . 495 Priyanka Jaiswal and Sunil Bhirud Revisiting Facial Key Point Detection—An Efficient Approach Using Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 Prathima Dileep, Bharath Kumar Bolla, and E. Sabeesh A Hybrid Framework Using Natural Language Processing and Collaborative Filtering for Performance Efficient Feedback Mining and Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 Kathakali Mitra and P. D. Parthasarathy Facial Recognition-Based Automatic Attendance Management System Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 Saranga Pani Nath, Manditjyoti Borah, Debojit Das, Nilam Kumar Kalita, Zakir Hussain, and Malaya Dutta Borah Application of Infrared Thermography in Assessment of Diabetic Foot Anomalies: A Treatise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 N. Christy Evangeline and S. Srinivasan A Survey and Classification on Recommendation Systems . . . . . . . . . . . . . 569 Manika Sharma, Raman Mittal, Ambuj Bharati, Deepika Saxena, and Ashutosh Kumar Singh Analysis of Synthetic Data Generation Techniques in Diabetes Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 Sujit Kumar Das, Pinki Roy, and Arnab Kumar Mishra Beyond Information Exchange: An Approach to Deploy Network Properties for Information Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 Soumita Das, Anupam Biswas, and Ravi Kishore Devarapalli Sentiment Analysis on Worldwide COVID-19 Outbreak . . . . . . . . . . . . . . . 615 Rakshatha Vasudev, Prathamesh Dahikar, Anshul Jain, and Nagamma Patil
Contents
ix
Post-Vaccination Risk Prediction of COVID-19: Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627 Anjali Agarwal, Roshni Rupali Das, and Ajanta Das Offensive Language Detection in Under-Resourced Algerian Dialectal Arabic Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 Oussama Boucherit and Kheireddine Abainia A Comparative Analysis of Modern Machine Learning Approaches for Automatic Classification of Scientific Articles . . . . . . . . . . . . . . . . . . . . . 649 Kongkan Bora, Nihar Jyoti Baishya, Chinmoy Jyoti Talukdar, Deepali Jain, and Malaya Dutta Borah A Review of Machine Learning Algorithms on Different Breast Cancer Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659 E. Jenifer Sweetlin and S. Saudia The Online Behaviour of the Algerian Abusers in Social Media Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Kheireddine Abainia Interactive Attention AI to Translate Low-Light Photos to Captions for Night Scene Understanding in Women Safety . . . . . . . . . . . . . . . . . . . . . 689 A. Rajagopal, V. Nirmala, and Arun Muthuraj Vedamanickam AI Visualization in Nanoscale Microscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . 707 A. Rajagopal, V. Nirmala, J. Andrew, and Arun Muthuraj Vedamanickam Convolutional Gated MLP: Combining Convolutions and gMLP . . . . . . . 721 A. Rajagopal and V. Nirmala Unique Covariate Identity (UCI) Detection for Emotion Recognition Through EEG Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737 V. S. Bakkialakshmi and T. Sudalaimuthu A Simple and Effective Method for Segmenting Lung Regions from CT Scan Images Using K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751 Yumnam Kirani Singh Risk-Based Portfolio Optimization on Some Selected Sectors of the Indian Stock Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765 Jaydip Sen and Abhishek Dutta
About the Editors
Dr. Malaya Dutta Borah is an Assistant Professor in the Department of Computer Science and Engineering, National Institute of Technology Silchar, Assam, India. Her research areas include data mining, BlockChain technology, cloud computing, e-Governance, and machine learning. She has authored several book chapters and over 44 papers in reputed journals and conferences. She is awarded two Australian Patents. She has edited several books published by reputed international publishers.
Dr. Dolendro Singh Laiphrakpam is an Assistant Professor in the Department of Computer Science & Engineering, National Institute of Technology Silchar, Assam, India. He receives his Master’s Degree and a Ph.D. degree from the National Institute of Technology Agartala, India, and the National Institute of Technology Manipur, India respectively. His research interests are cryptography, cryptanalysis, watermarking, and steganography. He has published several research papers in reputed journals. He is a life member of the Cryptography Research Society of India. He has acted as chair of various international conferences.
xi
xii
About the Editors
Dr. Nitin Auluck is an Associate Professor in the Computer Science and Engineering Department of the Indian Institute of Technology Ropar. He obtained his B.Tech. in Electrical and Electronics Engineering from Poojya Dodappa Appa College of Engineering Gulbarga in 1998 and a Ph.D. in Computer Science and Engineering from the University of Cincinnati USA in 2005. His research interest is in fog/edge computing. He has published several research papers in top-tier journals and conferences. Currently, he is serving as an Editor for the journal Concurrency and Computation: Practice and Experience (CCPE), published by Wiley. Prof. Valentina Emilia Balas is currently a Full Professor in the Department of Automatics and Applied Software at the Faculty of Engineering, “Aurel Vlaicu” University of Arad, Romania. She holds a Ph.D. Cum Laude, in Applied Electronics and Telecommunications from the Polytechnic University of Timisoara. Dr. Balas is the author of more than 350 research papers in refereed journals and International Conferences. Her research interests are in intelligent systems, fuzzy control, soft computing, smart sensors, information fusion, modeling, and simulation. Dr. Balas is the director of the Intelligent Systems Research Centre at the Aurel Vlaicu University of Arad and the Director of the Department of International Relations, Programs, and Projects at the same university. Dr. Balas was past Vice-president (responsible for Awards) of IFSA—International Fuzzy Systems Association Council (2013–2015), is a Joint Secretary of the Governing Council of the Forum for Interdisciplinary Mathematics (FIM),—A Multidisciplinary Academic Body, in India, and recipient of the “Tudor Tanasescu” Prize from the Romanian Academy for contributions in the field of soft computing methods (2019).
Android Application-Based Security Surveillance Implementing Machine Learning Henakshi Das and Preetisudha Meher
Abstract Usually, home automation is constructed to control the appliances even when the owner is not at home. Our system is designed in such a way that the implementation should be cost-effective and powerful using the Internet of Things. This system can be controlled via our developed Android-based mobile application. Using a camera inside home, an automatic identification of intruder as a part of intelligence access control system. Once the intruder gets detected by infrared sensor and our systems will compare the intruder face with existing data set, then it will match the face, if matched occur system will consider the face otherwise, an alert notification will be sent to owner android device. Then the owner can switch on the appliances (light and fans) using the mobile application [4]. This paper combines machine learning and image processing which are powerful modern technologies. This paper basically deals with the integration of all three criteria: Android application, machine learning, and home automation. Keywords Home automation · Infrared sensor · Raspberry Pi 4 · Android application · Computer vision
1 Introduction In today’s world, there is a great demand of automated systems. This system provides facilities to control appliances remotely from the Android device. Smart home can provide privacy and security. Computer vision technology is used for facial recognition, content organization, and many more. The problem of smart home automation is an issue that occurred a few decades ago when scientists and engineers around the world developed solutions such as automatic light switches and voice control devices. H. Das · P. Meher (B) National Institute of Technology, Arunachal Pradesh 791119, India e-mail: [email protected] H. Das URL: https://www.nitap.ac.in © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_1
1
2
H. Das and P. Meher
In order to control the intelligent detection switches, the results are currently obtained using an array of IR sensors [1]. Currently, the homeowner can monitor and control the home appliances using a smartphone. However, many companies offer new systems as apps which make the use or the control of smart homes easy [2]. The user can access the system from anywhere but the internet connection should be there in the microcontroller. Raspberry Pi 4 board is the latest Pi board which is small in size and also acts as a server. The idea of automation for each machine in the home was developed many years ago, starting with connecting two power lines to the battery and closing the circuit by connecting the load as a lamp. Later, it can be developed by different organizations, which creates its own automation systems with different devices like sensors, controllers, actuators, buses, and interfaces. In present days, most of the automation systems utilize the combination of hardwired and wireless systems for controlling the appliances [3]. In this paper, we have integrated our system implementing computer vision technology for face detection with Android application in a smart home system [7]. Our system can also control the speed of the fan and brightness of the light.
2 Proposed system Home automation plays an important role for our protection and convenience. The main benefit of the system is that, by sitting anywhere in the world, it can be controlled by our Android device. The proposed system provides facility to control home appliances like fans, lights, etc. [9]. Raspberry Pi 4 board is a basic, simple all-in-one chipboard which is the main base of the system. Motor driver L298N module acts as an interface between the motors and the circuit. When any intruder enters home, the infrared sensor which detects infrared radiation gets activated and sends the signal to the processing unit Raspberry Pi 4. Then the camera will take a snap of the intruder and our system will compare the intruder’s face with the existing data set, then it will match the face. If match occurs, then the system will consider the face; otherwise, it will give an alert notification to the owner’s Android device. An Android application is established using Android studio platform which is a unified environment where we can build apps for Android devices. Kotlin and Java programming language is used. Android applications will be used to control the home appliances. The system architecture is shown above (Figs. 1 and 2).
3 Components Required 3.1 Hardware Requirements (1) Raspberry Pi 4
Android Application-Based Security Surveillance Implementing Machine Learning
Fig. 1 System architecture
Fig. 2 Circuit diagram of the system
(2) (3) (4) (5) (6) (7)
IR sensor L298N motor driver DC motor LED Battery Android device
3.2 Software Requirements (1) Android studio
3
4
H. Das and P. Meher
(2) Python 3.6.5 (3) My home automation app
4 Hardware Implementation Here, Raspberry Pi 4 acts as a controlling unit of the system. Pi collects the data from the infrared sensor which detects the infrared radiation whenever any intruder comes in the range of IR. Motor driver L298N is used to control DC motor and stepper motor. It can be used for two techniques—for speed control and rotation direction of the DC motor. Motor driver uses H-bridge techniques for rotation of DC motors. Apart from this, two 5V DC motors (toy fans) and an LED are used as home appliances (Fig. 3).
5 Working of the System Phase 1: Home automation implementing Internet of Things. The processing unit Raspberry Pi 4 is initialized [10]. Then the infrared sensor and other components in the circuit are initialized. Once the setup is completed, the system will check the network WiFi connectivity and it will automatically connect to the Google Firebase database and initialize with it [8]. The system will then read the infrared sensor data and update the data in the Firebase database. Now, we set our application with respect to the Firebase database identity and generate a key. After proper connection of the application with the Firebase database, our application will read back the sensor data from Firebase with the logic of the program, the home appliances will automatically control. We can manually control with the help of application interface. Manually, our application interface will have the functionality to control on and off switches of home appliances such as light, fan, etc.
Fig. 3 Experimental setup of the system
Android Application-Based Security Surveillance Implementing Machine Learning
5
Fig. 4 Flowchart of the system
Phase 2: Home automation applying machine learning. In the second phase of the project, we mainly concentrate on machine learning on the basis of computer vision. Raspberry Pi will be in surveillance mood. The program logic will already store the familiar face of the family as the data set in the Google Firebase. Once the surveillance Pi camera detects a human face, it will fetch the image into the program, and extract feature of the image. Now, system will make a quarry of this image that is already stored in the data set. If the image features match with any of the data set, the system will be at ease. If there is any mismatch of image with data set, the system will generate an alert message via the Firebase messaging system and send to the reliable person mobile or into the application. Also, it can ring an alert buzzer in the circuit. Workflow of the system can be referred to in the below chart (Fig. 4).
6
H. Das and P. Meher
Fig. 5 Developed interface of the Android application
6 Software Implementation The main aim of developing Android application is to allow control signals sent from Android phone through WiFi to provide facilities like device control, device monitoring, etc. By developing this application, one can control the home appliances like lights, fans, etc. Android package is a file format used by Android operating system [5, 6] (Fig. 5).
7 Results and Discussion We have successfully assembled our system and the prototype is shown in the figure below. The basic aim of the system is to provide a safe and efficient home automation. This system provides the facility to control home appliances through a designed Android application. Using computer vision and through a Haar Cascade classifier, we are able to detect intruder’s face and, finally, send as alert notification to the owner’s Android device. This paper is based on the implementation of Raspberry Pi 4, IR sensor, Camera, Android platform Java, and XML. The assembled system and results can be shown below, respectively (Figs. 6, 7, 8, 9).
Fig. 6 Intruder detection by the system
Android Application-Based Security Surveillance Implementing Machine Learning Fig. 7 Image detection using Haar Cascade
Fig. 8 Designed Android application
Fig. 9 Alert message received
7
8
H. Das and P. Meher
8 Conclusion This paper has the main motive to monitor the home and keep it safe and secure. This paper describes the design, implementation, and integration of Android application, home automation, and computer vision. Our system eliminates many disadvantages of earlier systems with certain modifications over it. Image processing has been done to improve the accuracy of automation system. This system is designed by Python programming language and it provides safeguard from possible intruders. The proposed system enables the features to send message alert to the Android device of owner, which works as an alert signal for the homeowner. Our designed Android application helps in controlling home appliances. The overall configuration is very low cost and can be easily implemented.
References 1. Hasnain, Rishabh S, Mayank P, Swapnil G (2019) Smart home automation using computer vision and segmented image processing. In: International conference on communication and signal processing (ICCSP) 2. Karimi K, Krit S (2019) Smart home-smartphone systems: threats, security requirements and open research challenges. In: International conference of computer science and renewable energies (ICCSRE). Agadir, Morocco 3. Dhami HS, Chandra N, Srivastava N, Pandey A (2017) Raspberry Pi home automation using android application. Int J Adv Res, Ideas Innov Technol 4. Sarkar S, Gayen S, Bilgaiyan S (2018) Android based home security systems using Internet of Things(IoT) and firebase. In: International conference on inventive research in computing applications (ICIRCA). Coimbatore, India 5. Raju KL, Chandrani V, Begum SS, Devi MP (2019) Home automation and security system with node MCU using Internet of Things. In: International conference on vision towards emerging trends in communication and networking (ViTECoN). Vellore, India 6. Islam A (2018) Android application based smart home automation system using Internet of Things. In: 3rd international conference for convergence in technology (I2CT). Pune 7. Jaihar J, Lingayat N, Vijaybhai PS, Venkatesh G, Upla KP (2020) Smart home automation using machine learning algorithms. In: International conference for emerging technology (INCET). Belgaum, India 8. ElShafee A, Hamed KA (2012) Design and implementation of a Wi-Fi based home automation system, world academy of science, engineering and technology 9. Vinay Sagar KN, Kusuma SM (2015) Home automation using Internet of Things. Int Res J Eng Technol(IRJET) 2(03):1965–1970 10. Paul S, Anthony A, Aswathy B (2014) Android based home automation using Raspberry Pi. IJCAT- Int J Comput Technol 1(1):143–147
Realtime Object Distance Measurement Using Stereo Vision Image Processing B. N. Arunakumari , R. Shashidhar , H. S. Naziya Farheen , and M. Roopa
Abstract In recent years, great progress has been made on 2D and 3D image understanding tasks, such as object detection and instance segmentation. The recent trends in technology driverless cars are making a difference in daily life. The basic principle in these driverless cars is object detection and localization using multiple video cameras and LIDAR and it is one of the current trends in research and development, so attempts to achieve the same on small scale using the available resources. In the proposed method, firstly the stereo images are captured in a dual-lens camera, and secondly, converting the RGB image into a grayscale image. The third step is to apply a global threshold to separate the background, to get the same size of the image using morphological operation. Blob detection is used to detect the points and regions in the image. The fourth step is to detect the object distance and size measurement using the pinhole camera formula. Further, in the proposed work, an effort is made to determine the linear space between the camera and the object from the pictures taken from the camera. Typically, stereo images are used for computation. Binocular stereopsis, or stereo vision, is the capability to derive information about how far left the objects are, grounded uniquely on the comparative places of the object in the two eyes. It depends on both sensory and motor capabilities, using the similar principle the human brain employs, taking two images of the same object taken from two different linearly separated distances. The frame rate of the system can go a B. N. Arunakumari (B) Department of Computer Science and Engineering, BMS Institute of Technology and Management, Bengaluru, Karnataka 560064, India e-mail: [email protected] R. Shashidhar Department of Electronics and Communication Engineering, JSS Science and Technology University, Mysuru, Karnataka 570006, India e-mail: [email protected] H. S. Naziya Farheen Department of Electronics and Communication Engineering, Navkis College of Engineering Hassan, Hassan, Karnataka 573217, India M. Roopa Department of Electronics and Communication Engineering, Dayananda Sagar College of Engineering, Bengaluru 560078, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_2
9
10
B. N. Arunakumari et al.
maximum of up to 15 frames per second. 15 frames per second can be considered as acceptable for most autonomous systems, and it will work in realtime. Effective convolutional matching technique between embeddings are used for localization that leads LIDAR to increase centimeter level accuracy by about 97%. Keywords Linear regression · Linear distance · Stereovision · LIDAR · Stereo image
1 Introduction Nowadays, understanding state-of-the-art evolution of object detection and instance segmentation has made significant improvement in 2D image. However, beyond getting 2D enclosing boxes or pixel masks, 3D understanding is eagerly in demand in real world applications such as housekeeping robots, autonomous driving and advance driver assistance system (ADAS) and augmented reality (AR). With rapid advent and development of 3D sensors deployed on mobile devices and autonomous vehicle navigation, 3D data capturing and processing is gaining more and more attention. Studying the rotation and translation parameters—3D object detection and localization with respect to the coordinate system, which classifies the object category and estimates 3D minimum bounding boxes of solid objects from 3D digitized sensor data are attempting to find the linear space among the camera and the object from the pictures taken from the camera. There are two ways to compute the object to camera distance viz. (i) the camera focal length and the object given size, (ii) the point of contact where the object or image meets the ground and the height of the camera. Unlike in the (i) approach, the dimension of object or image is unspecified in the (ii) approach. Typically, the proposed work uses stereo camera for the computation of object to camera distance. However, the binocular stereopsis is the capability to derive an impress of deepness by the superimposition of two images of the same object taken slightly from unlike angles [1]. It depends on the amalgamation of pictorial excitations from matching retinal images into a single image (sensory fusion) and the capability to sustain a single image with curative movement of eyes to carry the fovea around to the essential place (motor fusion), a similar principle in which the human brain employs, is taking two images of the identical object taken from two, unlike linearly separated distances. Once get the left and the right images of the object, apply the algorithm to compute the linear distance.
2 Related Work In the past few years, couple of techniques have been established for calculating object to camera distance. These methods have been categorized into two types: contact and noncontact approaches. In the contact approach, most of the harvests
Realtime Object Distance Measurement Using Stereo Vision Image …
11
have been used to calculate distance. The weakness of this approach is the image or object is destructive. For noncontact approach, many solutions have been proposed namely laser and ultrasonic replication. The drawback of these techniques is image reflectivity is failing in noncontact measurement [2–6]. In [7], the authors have developed an efficient method for object or image camera distance measurement using fuzzy stereo matching algorithm. The algorithm uses a window size of 7 × 7 pixels within a search range that varies from −3 to +3 for accurate computation of disparity. This approach is the optimally moral choice with respect to processing speed and accuracy. But, for realtime applications like autonomous vehicles and robot navigation, the rectification process is necessary, but it is unnoticed in the proposed stereo vision algorithm. In [3], the authors proposed a search detection method that uses depth and edge data and its hardware architecture for realtime object detection. This method has improved the detection accuracy with respect to the sliding window method and its hardware realization on a Field Programmable Gate Arrays (FPGA) is further improved via an application-specific integrated circuit (ASIC) execution or application precise optimization. However, the method has not presented the implementation approaches of three major tasks of the algorithm viz. disparity calculation, classification and edge detection and the influence of algorithms with respect to hardware architecture and performance [8, 9]. According to [10], calculation of eight to thirteen discrepancy images per second with 3D reestablishment on the M6000GPU, attaining a mean square error of 0.0267 m2 in calculating distances up to 10 m. Graphics processing unit (GPU) implementation significantly speeds up the calculation and 3D depth information for shapeless outdoor atmospheres can be produced in realtime, agreeing on suitable close-range plotting of the atmosphere. In addition, the authors have made an attempt to estimate the motion of a stereo camera by decreasing re-projection errors between two successive edges using Random sample consensus and there are no prior knowledge/sensors are needed to estimate the motion of the camera. As many parameters are added to optimize the process which leads to the enhancement of the motion accuracy [11, 12]. However, it is observed that compared to other steps of the stereo matching method, the random sample consensus-based motion is comparably futile. In this paper, an attempt has been made to develop a robust and efficient method to solve such problems using available resources just by taking picture. Furthermore, authors have made an attempt to improve the accuracy and speed of depth estimation through depth from focus and depth from defocus in combination with stereoscope [12, 13]. However, the motion of the image will remain the superior method in terms of cost, if enlarging the baseline is inexpensive than enlarging the lens aperture, enhancements of depth from or depth from defocus by algorithmic improvements are restricted in common implementations by the small aperture dimension. In addition, the system does not ensure the prevention of occlusion and matching problems of monocular structure of focusing and defocusing systems. However, there is a possibility to enlarge the range of focuses that can be segmented to foodstuff, animals and other stuffs. A novel method to calculate low latency dense-depth maps by means of a particular CPU core of a mobile phone was proposed by [14]. This method can effectively facilitate occlusions for applications viz. AR photos, AR realtime navigation and
12
B. N. Arunakumari et al.
shopping that uses only current monocular color sensor. However, this method has the usual constraint of monocular depth approximation systems. When the comparative pose among the recent frame and the designated key frame is indefinite then various components can be impacted. Another constraint comes from hardware limitations that can for occurrence evident in the form of progressing motion blur and shutter artifact. Lastly, as usual in the reflexive stereo works, small surfaced areas are predominantly unclear and performances of implication over them often lead to inappropriate results. In addition, the estimation of depth map is a significant step in 3D data generation for 3D demonstration method. From the existing methods it has been perceived that the utmost of the literature is done on monocular, sequence and stereo image. Depth map estimation is depending on depth from focus. However, there are complications to approximate the depth of the object at other spaces which are not in intensive areas and also there is an encounter to estimate the depth map from defocused images and 3D translation from that [15, 16]. Furthermore, authors have attempted to estimate low complexity depth and confidence maps based on bloc distinction investigation for close to the pixel implementation and all-in emphasis image rendering. In this method to build all-in focus images the depth criterion with error elimination and noise filtering was proposed by simple median filter of size 21 × 21 [17]. The median filtering to the unified depth maps is used to observe whether the measured pixel is uncorrupted one and if the pixel is uncorrupted then the next iteration is executed based on 3 × 3 window size. Thus, a low complexity method to improve depth estimation is at stake, if the numbers of iterations are repeatedly executed based on the size of the image. Further, attempts have been made to find a distance map for an image capably solved on the linear array with reconfigurable pipeline (LARP) optical bus system [18]. Because of the elevated bandwidth of an optical bus, numerous resourcefully implemented data movement processes are effectively resolved on the LARP model. Hence, this algorithm is completely scalable. However, a real parallel computer founded on the LARP optical bus system may not exist in the near imminent; the algorithms on the LARP optical bus system model are not robust. In [19], authors have developed a methodology for distance map estimation for two-dimension to threedimension transformation. The developed method is tested under various conditions viz. multiple objects, static cameras, a very dynamic foreground, motion in behind the forefront then minor motion as backdrop. Therefore, this method remains valid for two-dimension to three dimensions in three-dimensional display. However, such a method possesses the limitation as motionless forefront image inside a view is likely to be measured in the process of backdrop as a result of inaccuracy in distance approximation resulting in truthful stereo vision. Moreover, a survey on indoor and outdoor mobile robot navigation has been developed based on structured and unstructured environments [20]. If the objective is to send a movable robot from one organized area to a different organized area, we trust there is adequate accumulated capability in the investigation today to build a movable robot that could do that in a distinctive structure. However, if the objective is to achieve function driven mobile navigation
Realtime Object Distance Measurement Using Stereo Vision Image …
13
we are still in the days of yore. Moreover, an automatic method has been developed to estimate the motion of a stereo camera from uninterrupted stereo image pairs. The method has based only on stereo camera and no other sensor or prior knowledge is required to extract point features in the pair of images. Minimization in the re-projection error of successive images leads to good correspondence of the features. The main advantage of this method is that it designs an adapted feature descriptor that can make feature-matching procedure particularly fast while conserving the matching dependability. However, the method included a few features in the optimization process to improve the accuracy of motion estimation [21].
3 Proposed Methodology First phase of the work was image acquisition using stereo camera. Once the image is obtained next phase was to convert the color space to Gray scale to rescue the amount of computation on 3 layers in case of RGB, and the image is resized to reduce the computation by 24 folds. Now the image is ready to process for further phases. Median filter is useful to eliminate the noise in the image and then threshold to translate the image into binary image. Once the binary image is obtained blob detection is carried out to find the coordinates of the object of attention in the image. Once all these steps are carried out on the two images obtained from the camera, the difference between the pixel values is used to compute the disparity hence the distance as shown in Fig. 1. Fig. 2 explains the proposed method flowchart and, the flowchart, showing the step-by-step procedure of the proposed. Stereo Image capture: Image is captured by Redmi note 4 mobile cameras. Its dual-lens camera yields an effective focal length of 26 mm (in 35 mm full-frame equivalent terms). Its 1/2.9-inch sensor has 1.25 µm pixels. Two images (left and right) are caught by moving the camera in y axis keeping x and z axis constant. Fig. 1 Object detection and localization
Image Acquisition
Resizing the Images and Gray Scale
14
B. N. Arunakumari et al.
Fig. 2 Flowchart of the proposed method
Stereo Image Capture
Image Pre-processing
Object Detection and Segmentation
Object Distance and Size Measurement
3.1 Pre-processing of an Image Pre-processing steps were carried out to meet the requirement of the system. Image is resized to the required dimension. To reduce computation complexity and increase response time RGB image is transformed to grey scale image. Resulting grey scale image is passed through the median filter. It is a non-linear alphanumeric filtering method used to remove noise from an image or signal.
3.2 Object Detection and Segmentation Background elimination is carried out using thresholding technique. Global thresholding is used to separate the background. Morphological operations are applied. It processes images founded on forms. Morphological procedures adopt constructing features to an input picture, forming an output picture of the equivalent size. The morphological procedure, the cost of every single element in the outcome picture is formulated on an assessment of the conforming element in the incoming picture with its adjacent connects. Blob detection is used to detect points and sections in the image that contrast in properties like illumination or color equated to the adjacent. The main aim is to afford harmonizing information about regions, which is not obtained from edge detectors or corner detectors. It is used to obtain regions of interest for further processing. After object recognition pixel management of both left and right is determined. The difference between pixel coordinates of left and right images gives disparity.
3.3 Object Distance and Size Measurement Using Stereo Camera Once disparity is found; value of disparity is used in the formula z = (b*f)/a to find the distance. Where z is object distance, d is the disparity, f is the focal length of
Realtime Object Distance Measurement Using Stereo Vision Image …
15
camera, b is the steps space among the camera. Once we know the distance object approximate dimension can be found using pin hole camera formula is discussed in Sect. 3.4.
3.4 Mathematical Substantiation for Object Detection and Localization Figure 3 shows the mathematical proof of the object detection and localization. where d is the disparity i.e., d = (xl − xr), f is the focal length of camera b is the distance between the camera. Triangle APL is similar to triangle CDLf /z = xl/x, Triangle BRP is similar to triangle EFR, f /z = xr/(x − b) not ragged. f /z = xl/x = xr/(x − b)
(1)
x − b = zxr/(f )
(2)
zxl/(f ) − b = zxr/(f )
(3)
(z(xl − xr))/(f ) = b
(4)
Z = bf /((xl − xr)) = bf /(d )
(5)
Z = bf /(d )
(6)
4 Result and Discussion Two images (left and right) of LCD monitor are taken from Redmi Note 4 camera separated by 15 cm. The focal length of camera in pixels is 62.40. The disparity is computed to find the distance of the object. The final result shown in Fig. 4a is the original image and Fig. 4b Processed image. The main objective of the work is to detect the objects using stereo camera and to localize the detected objects in a coordinate system. Finally, we completed the objects with the mathematical evidence. The main applications of the work are to use in self-driving cars, Autonomous robots and Survey drones. Table 1 summarizes different approaches for object detection and localization and their accuracy level. Our proposed method shows better accuracy (97%) compared with the extant methodologies.
16
B. N. Arunakumari et al.
Fig. 3 Detection and localization of object
5 Conclusion In our proposed methodology an attempt has been made to compute linear distance between the camera and the object using stereo camera and further detected objects are localized in a coordinate system which is economically feasible. The proposed work and its objectives were successfully completed with an accuracy of about 97%. At the end, able to implement the basic concepts of image processing practically and explored specific functions and the mathematics behind it. Major applications of this work are used in self-driving cars, Autonomous robots and survey drones. The recommendation of the future work is to detect multiple objects, detect moving and still objects and classify the object.
Realtime Object Distance Measurement Using Stereo Vision Image …
17
Fig. 4 a Original image (top), b processed image (bottom)
Table 1 Comparison of different object detection and localization systems Methods
Accuracy (%)
Remarks
Realtime 3D depth estimation and measurement of un-calibrated stereo and thermal images [14]
90.54
In this approach, authors have made an attempt to perform the rectification of uncalibrated stereo images, realtime depth estimation and realtime distance measurement using stereo images, Intel computes stick webcam and thermal camera used. SAD, triangulation method, epipolar constraints and disparity map for 3D rectification, depth estimation and distance measurement of the object
Distance estimation of colored objects in image [21]
92.1
Determination of accurate HSV values for the lower and upper limits requires a separate experiment. The detection of objects can also be imperfect due to lighting causing different colors in the image with its original color. Imperfection’s detection of objects bycolor into account when calculating the distance because the size of the object in the image is different
Realtime object distance measurement using stereo vision image processing (proposed method)
97
Two images (left and right) of the LCD monitor are taken from Redmi Note 4 camera separated by 15 cm. The focal length of camera in pixels is 62.40
References 1. Ma Y, Li Q, Chu L, Zhou Y, Xu C (2021) Real-time detection and spatial localization of insulators for UAV inspection based on binocular stereo vision. Rem Sens 13(2):230. https:// doi.org/10.3390/rs13020230
18
B. N. Arunakumari et al.
2. Garcia MA, Solanas A (2004) Estimation of distance to planar surfaces and type of material with infrared sensors. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004, vol 1, pp 745–748. https://doi.org/10.1109/ICPR.2004.1334298 3. Culshaw B, Pierce G, Jun P (2003) Non-contact measurement of the mechanical properties of materials using an all-optical technique. IEEE Sens J 3(1):62–70. https://doi.org/10.1109/ JSEN.2003.810110 4. Klimkov YM (1996) A laser polarimetric sensor for measuring angular displacements of objects. In: Proceedings of European meeting on lasers and electro-optics, pp 190–190. https:/ /doi.org/10.1109/CLEOE.1996.562308 5. Gulden P, Becker D, Vossiek M (2002) Novel optical distance sensor based on MSM technology. Sensors, vol 1. IEEE, pp 86–91. https://doi.org/10.1109/ICSENS.2002.1036994 6. Carullo A, Parvis M (2001) An ultrasonic sensor for distance measurement in automotive applications. IEEE Sens J 1(2):143–. https://doi.org/10.1109/JSEN.2001.936931 7. Chowdhury M, Gao J, Islam R (2016) Distance measurement of objects using stereo vision. In: Proceedings of the 9th hellenic conference on artificial intelligence (SETN ’16). Association for Computing Machinery, New York, NY, USA, Article 33, pp 1–4. https://doi.org/10.1145/ 2903220.2903247 8. Othman NA, Salur MU, Karakose M, Aydin I (2018) An embedded real-time object detection and measurement of its size. Int Conf Artif Intell Data Process (IDAP) 2018:1–4. https://doi. org/10.1109/IDAP.2018.8620812 9. Kyrkou C, Ttofis C, Theocharides T (2013) A hardware architecture for real-time object detection using depth and edge information. ACM Trans Embed Comput Syst 13(3), Article 54 (December 2013), 19 p. https://doi.org/10.1145/2539036.2539050 10. Singh D (2019) Stereo visual odometry with stixel map based obstacle detection for autonomous navigation. In: Proceedings of the advances in robotics 2019 (AIR 2019). Association for Computing Machinery, New York, NY, USA, Article 28, pp 1–5. https://doi.org/10.1145/335 2593.3352622 11. Mou W, Wang H, Seet G (2014) Efficient visual odometry estimation using stereo camera. In: 11th IEEE International conference on control & automation (ICCA), pp 1399–1403. https:// doi.org/10.1109/ICCA.2014.6871128 12. Wadhwa N, Garg R, Jacobs DE, Feldman BE, Kanazawa N, Carroll R, Movshovitz-Attias Y, Barron JT, Pritch Y, Levoy M (2018) Synthetic depth-of-field with a single-camera mobile phone. ACM Trans Graph 37(4), Article 64 (August 2018), 13 p. https://doi.org/10.1145/319 7517.3201329 13. Acharyya A, Hudson D, Chen KW, Feng T, Kan C, Nguyen T (2016) Depth estimation from focus and disparity. IEEE Int Conf Image Process (ICIP) 2016:3444–3448. https://doi.org/10. 1109/ICIP.2016.7532999 14. Iqbal JLM, Basha SS (2017) Real time 3D depth estimation and measurement of un-calibrated stereo and thermal images. Int Conf Nascent Technol Eng (ICNTE) 2017:1–6. https://doi.org/ 10.1109/ICNTE.2017.7947959 15. Valentin J, Kowdle A, Barron JT, Wadhwa N, Dzitsiuk M, Schoenberg M, Verma V, CsaszarA, Turner E, Dryanovski I, Afonso J, Pascoal J, Tsotsos K, Leung M, Schmidt M, Guleryuz O, Khamis S, Tankovitch V, Fanello S, Izadi S, Rhemann C (2018) Depth from motion for smartphone AR. ACM Trans Graph 37(6), Article 193 (November 2018), 19 p. https://doi.org/ 10.1145/3272127.3275041 16. Kulkarni JB, Sheelarani CM (2015) Generation of depth map based on depth from focus: a survey. Int Conf Comput Commun Control Autom 2015:716–720. https://doi.org/10.1109/ICC UBEA.2015.146 17. Emberger S, Alacoque L, Dupret A, de Bougrenet de la Tocnaye JL (2017) Low complexity depth map extraction and all-in-focus rendering for close-to-the-pixel embedded platforms. In: Proceedings of the 11th international conference on distributed smart cameras (ICDSC 2017). Association for Computing Machinery, New York, NY, USA, pp 29–34. https://doi.org/10. 1145/3131885.3131926
Realtime Object Distance Measurement Using Stereo Vision Image …
19
18. Pan Y, Li Y, Li J, Li K, Zheng SQ (2002) Efficient parallel algorithms for distance maps of 2D binary images using an optical bus. IEEE Trans Syst Man Cybernet Part A Syst Humans 32(2):228–236. https://doi.org/10.1109/TSMCA.2002.1021110 19. Mulajkar RM, Gohokar VV (2017) Development of methodology for extraction of depth for 2D-to-3D conversion. In: 2017 Second international conference on electrical, computer and communication technologies (ICECCT), pp 1–5. https://doi.org/10.1109/ICECCT.2017.811 7848 20. Desouza GN, Kak AC (2002) Vision for mobile robot navigation: a survey. IEEE Trans Pattern Anal Mach Intell 24(2):237–267. https://doi.org/10.1109/34.982903 21. Zhang J, Chen J, Lin Q, Cheng L (2019) Moving object distance estimation method based on target extraction with a stereo camera. In: 2019 IEEE 4th International conference on image, vision and computing (ICIVC), pp 572–577. https://doi.org/10.1109/ICIVC47709.2019.898 0940
An Insight on Drone Applications in Surveillance Domain M. Swami Das , Gunupudi Rajesh Kumar , and R. P. Ram Kumar
Abstract Unmanned Aerial Vehicle (UAV), also referred as drone, is the rapid development of the technology. The Drone requires a critical infrastructure element, tools, ground station, communication links, server, and application services. The existing system has drawbacks, such as vulnerability, unsafe, risk and unsecured systems etc. To overcome the limitations, the application of Drones in Surveillance uses ground to ground, ground to air and air to air communication. The significant features of drone include takeoff, landing, traveling with payload, record the data, functional operations, and application services. The system with a high-resolution camera can record the data in a specific area using Global Positioning System (GPS), embedded systems, controllers, and concern application(s). The inevitable usage of drones in Surveillance includes the evidence collection in the investigation process during the Forensic study/police investigations. The work can be extended to intelligent navigation with target mission-critical applications in military/defense. Keywords Drone · Surveillance · Unmanned aerial vehicle · Security · Investigations · IoT
1 Introduction UAVs can be used to create and find the response over a specified time in domains like agriculture, remote sensing, military, mapping, and monitoring systems. UAV has rapid development in the past few decades in surveillance, mapping, military, M. S. Das (B) CVR College of Engineering, Hyderabad, India e-mail: [email protected] G. R. Kumar Vallurupalli Nageswara Rao Vignana Jyothi Institute of Engineering & Technology, Hyderabad, Telangana, India R. P. Ram Kumar Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_3
21
22
M. S. Das et al.
medical, and other applications too. Drones are used in outdoor spaces and it requires a critical infrastructure, elements, and tools that are used to operate some real-time applications. The drones collect the source of evidence in the investigation process. Later, the recorded data are used by forensic examiners for extracting data during the investigation process, and making decisions based on the data. In critical situations, the availability of drone applications will continue to extract and collect evidence, such as the process of monitoring terrorist groups for surveillance activities. Drones are used by police in the crime scene investigation process. The drones can be controlled in Real-Time Process using sensors, microprocessors, elements, and application services. Moreover, the usage of mini UAVs are becoming inevitable in the application domains including, communication, public services, photography, surveillance, and even in agriculture. The drone is categorized into authorized and unauthorized drones. The authorized drones can fly over certain heights at various applications like communication, photography, surveillance, agriculture, and public services. The deployment of drones has critical challenges, namely, (1) Airspace around airports—safety, (2) Risk to assets, that is, sites, government houses, and nuclear power plants, (3) Carry explosive and other dangerous chemicals, military secrets, by taking photos, technical and operation requirements. Majority study inculcates three classes, namely, (a) Warning technique detection of devices, cameras, and radars to perform early warnings, (b) Spoofing technique to send false signals, and (c) Jamming technique to explore the control and navigation information integrated with automatic driving system. Anti-Drone defense system has a smart sensor with remote detecting, tracking, and provides activities. It has electronic scanning radar, tracking, classification, and intelligent directional Radio Frequency (RF) inhibitor. The rest of the manuscript content is organized as follows: Sect. 2 deals with Literature survey, Sect. 3 with Proposed Model; Sect. 4 is about Discussions and Sect. 5 with Conclusion and Future Scope.
2 Literature Survey 2.1 Related Work The advancement of UAV technology has a sensing module, data processing module, and operation module that depends on concern applications. In agriculture, drones along with automation sensors are used to collect the data in real-time, followed by operational services to provide water management and pesticides. Few significant applications of UAV include; Modern agricultural management, environment, applied to utilize, crop management, farming, cultivation, fertilization, irrigation, pesticides, harvesting, specific crop management, and even in Crime Investigations. The system consists of UAV, GPS, with Raspberry Pi microprocessor, and integrated with Camera to capture the aerial images. The data sensor enables the collecting and sharing of data for processing. UAV is utilized simultaneously to collect data
An Insight on Drone Applications in Surveillance Domain
23
in distributed environments to get knowledge regarding irrigation water requirements and its deficiency, and even the current status of crop health. Further, image processing techniques are used to generate the best feasible results. Drone technology is used in surveillance activities by police to collect evidence [1]. Bhoopesh Kumar Sharma et al. [2] showed that UAVs were used in remote sensing, monitoring, and applications with the study on crime scene investigation in Forensic science. Jean-Paul Yaacoub et al. [3] studied the usage of drones for multipurpose applications, i.e., military, civilian, and terrorist, with the use of communication links, hardware, and smartphones. Drone application technologies like aerial photography and entertainment were studied in [4]. Kangunde et al. [5] studied the processing of drones to enhance performance in real-time response applications, monitoring, agriculture, and flight control. Drones are used to deliver packages and are used by the public in various applications [6]. Altawy et al. [7] studied physical and cyber threats security, and applications of drones. Katsigiannis et al. [8] studied UAV technology, applications in agriculture, and multi-sensor UAV imaging system. It is used in water management in agriculture. Kulbacki et al. [9] studied drone applications in the robotic process, remote sensing in agriculture applications, and surveillance and security purposes in defense applications, areal security and surveillance in all sectors [10]. Ding et al. [11] studied a drone surveillance system using IoT. Milas et al. [12] studied drone systems in environmental applications and remote sensing. Masum et al. [13] studied UAVs used in security and military applications. Hartmann et al. [14] studied that UAVs were used in military and defense. The operation of drones in military patrol enhanced the surveillance and diminished the dangerous situations. Dilshad et al. [15] studied applications of drones in smart homes and smart cities through video surveillance in object tracking, detection; video summarization and target monitoring in traffic, and disaster management. Hodge et al. [16] studied navigation algorithms using sensors integrated with drone to solve hazardous problems for finding solutions. Kandrot et al. [17] studied UAV technology applications in coastal management including surveillance of illegal fishing activities. Kangunde et al. [18] studied realtime aspects of drone implementation in real-time flight control systems. Nader et al. [19] studied UAVs in aerospace traffic management for low altitude operations security, data, and protection. Upadhyay et al. [20] proposed the design and development of a drone application for aerial target, signal processing, and radar detection. Tan et al. [21] studied object detection using YOLOV3 network architecture.
2.2 Problem Definition Design and develop a model for surveillance drones and the process for collecting the evidence, analyze and suggest the recommendations accordingly.
24
M. S. Das et al.
2.3 Key Challenges UAV platforms key challenges through two parts, namely, one rotary is multirotor spinning like helicopter, second fixed wing like aeroplane, for reading the data and mapping the resource using sensors in domains such as surveillance, military, and agriculture. The various phases incurred in the UAV are discussed in the following section; (1) Tracking the moving objects: UAV controller is an embedded computercontrolled flight control system used to monitor the moving objects. (2) Real-time operating system, Sensor input, and feedback control system. (3) Scheduling and Prioritization of Mobile device(s), Communication to ground system and Real-time scheduling. (4) Security and safety Remote operations, Automatic navigation, and flight propulsion with optimizing cost. (5) Privacy, security, and Remote status. (6) Stakeholders: Drone specification for concern application(s) with regulations, working technologies, policies, and Guidelines by Government. (7) Communication: Drone coverage area, communication sensors and data. (8) Software; Regulatory approval for drone application software and its operations.
3 The Model A drone can be operated with manual and automated control. Drones might be used by territories, at sensitive locations, military centers, and other areas. For operating the drones, one must follow the operating height, keep away from aircraft, airport and helicopter, and adhere to safety or face prosecution. The Ground Control Station operates to control and monitor the Flight controller. Drone’s Central Processing Unit takes control over the data link, the link (wireless link) to control drones, depending on range, control, and distance, followed by the Industrial Protocol Base in wireless technology and drone to the ground station. It uses wireless communication to communicate with ground stations through passive and active modes. It has a choice of network security including cellular communication such as wireless communication network and drone to satellite through GPS. Figure 1 shows the best heavy payload drone summarized from [8]. Its significance includes, (1) Multi-rotor drone: takeoff and landing principle, (2) Fixed-wing drone: Guide travel at high speed with heavy payload, (3) Hybrid wind drone: Fixed/ rotary-wing drones, able to rapidly fly using four rotors. UAV: Control remotely using Mobile phone/Computer/Tablet. UAV control can be divided into three categories, namely, (a) remote pilot—human remote operation, (b) Remote supervised control—able to launch and carry assigned functions with human intervention, and (c) Full Autonomous—Control static automation system for a successful mission without human intervention. Few advantages of these drones include, Operationscontrolled vehicle, RPA intensive skills, training long period of time, crash avoidance
An Insight on Drone Applications in Surveillance Domain
25
Fig. 1 Best heavy payload drone [8]
method, crash avoidance system navigate objects return to base on a programmed route. UAV routing always enhances safe routing to avoid accidents and dangers. Few other significance are as follows: Police multipurpose usage suspects aerial bird watch view, respond to traffic to prevent accidents in emergencies. Traffic Monitoring, tracking escapees, forensic search and rescue, anti-rioting used in military operations surveillance, Target attack radar, intelligence, command control, communication, computer, surveillance role. A drone can be operated by children and adults through an HD camera integrated with Wi-Fi enabled operations. The Wi-Fi enabled drone establishes the connection with network, and then shares the gathered information based on the necessity. Drones are used in surveillance, like capturing photos and videos at a distance. Drones have been attacked in a regular setting, and vulnerabilities from GPS spoofing, Denial of Service (DoS) attack, endpoint system, such as, network hackers locate the target, and the operator is unaware about flying devices.
3.1 Architecture/Model Surveillance and security drones are deployed in Army and Police Applications. The controlling system can be operated manually or remotely. For security reasons, the deployable drones’ functionalities such as takeoff and landing, and path planning operations are controlled automatically for even 24/7 h. The Architecture of UAV is shown in Fig. 2. It has four elements for controlling the overall operations, namely, (1) unmanned aircraft ground control station (2) Flight controller, (3) Ground control station—where the user can operate the drone operations and functions, (4) Data links use satellite or GPS or Wi-Fi, Wireless communication signals to control the navigation, speed, and operations of drones. Even the usage of drones in the military, and
26
M. S. Das et al.
Fig. 2 High-level Architecture of a UAV System ([7] and [15])
commercial both have vulnerabilities by attackers controlling the service, denying service, or disabling a user from normal operations. The High-level architecture of the UAV system has a drone, GPS Elements, highresolution camera, main control computer which has a flight interface, data interface, and server control unit. The drone systems are connected to the ground station and satellite using a communication link. Certain rules need to be followed for proper usage of drones to regulate drone’s safety and security, risk analysis, vehicle hijacking method, and detection. Drones are classified even based on the connectivity, namely, RF—Drones and WiFi Drones. The RF controlled drones depend on the temperature of the surrounding area, the connected frequency, and the RF controller. Drones that are used in military and commercial applications range from long-distance for Navigation operations. Wi-Fi hijack attacks and GPS spoofing are the vulnerabilities that include jamming, packet sniffing, authentication, and video replay. Alternative operation includes jamming, smartphone Wi-Fi analyzer, reply attack, and commercial and military operations to capture and test the setup. Further, military drones have the ability for encryption, authentication, apply network password, video feed and receiver feed, and furthermore, with these features, drones are subjected to use in wars.
3.2 Components/Elements—Protocols The drone elements have communication, functions, camera, and other supporting elements. Dedrone (Automatic Anti-Drone security) is an airspace, security platform that detect, classify, and mitigate threats. The major pitfall is that, it enables airspace surveillance, automatic alarm, and notification. For example, US Army Enhanced Area Protection and Survivability is a missile-based rocket used in defense system. Its features are launch command guided, tracking radar, fire control computer, and
An Insight on Drone Applications in Surveillance Domain
27
Boeing compactable Laser Weapon System. It tracks the Laster weapon system which is used to acquire, identify, and track an event target and destroy accurately. Further, it makes the device connected to the system with IoT connectivity, communication, service provider, and even the network operator. Amateur drone surveillance nodes are capable to store and analyze the data. The various functionalities during the Surveillance process include the following: (a) Detection followed by robust detection, such as altitude, video detection, radar detection, short-range detections, and thermal detections data fuzzing and quick detections. (b) Localization: 3D and Passive localization, such as parameter estimation process and serves on the basics of tracking and control operations (c) Tracking is context-aware tracking, filtering trajectory, and behavior prediction, and (d) Controlling Jamming directional, safe catching, drone classification. The UAV provides the solutions at a faster rate. The system functions, with two cameras; one, an optical camera is used to create extremely accurate detailed map; second, a thermal camera is used to search and rescue operation representing the ground control, image, GPS location, and monitors the usage. The sensors are used to find the altitude and performance of a system in different environments. The sensors are integrated with Real-time operating systems, feedback control, scheduling prioritization, mobile phone, communication, controllers, fuzzy logic control, integral derivative, and Neural Network. The drone is used in various applications in real-time control system by varying the (1) motor speed movement, (2) multi-tasking enable position, (3) Feedback orientation, and (4) Path planning. Digital crime UAV forensic investigation process: (1) preparation phase—identify evidence, (2) examination phase—capture and identify the data storage, (3) Reporting and analysis Phase—Extracting data store images.
3.3 Working Principle The working principle of the Drone with its elements and operations is presented in this section. The ground station from drone data supports data processing which includes logs, image processing, and recovery. The purpose of investigation of aerial photos is always the best representation for evidence collection and data representation and submission to the court. Evidences like fingerprint operations, scope of policy tasks, and are used to collect evidence in crime scenes. The planning to collect evidence by drones and understand the crime scene by Quality assurance and photos of the impact of scenes, are extremely valuable in surveying the scenes. The collected evidence through UAV is efficient to create barcodes, and easily identifies to separate the rest of the evidence in the examination phase. Finally, the Quality assurance, maintenance, and custody of evidence, based on extracted UAV data with Machine learning and AI techniques to analyze data in forensic study and investigations process. The Surveillance Drone System is used by authorized users. The Algorithm 1 gives the drone surveillance working operations.
28
M. S. Das et al.
Algorithm:1 Drone surveillance working operations Kprwv
0 H (t) = (3) 0, otherwise The value of δm is calculated for the lateral use in Eq. (4): δm = q j=1
r k=1
(δm − z m )
w kjm × ξ z m − y j − d k × zm −y1j −d k − τ1
(4)
Now, the change in weights between the input and hidden layers wikj is given in Eqs. (5) and (6): (5) wikj = −η × δ j × ξ t − xi − d k r 1 1 k k δ w × ξ(z − y − d ) × − m m j k jm m=1 k=1 τ z m −y j −d 1 p r 1 k k × y j −xi −d k − τ k=1 wi j × ξ y j − x i − d i=1
s δj =
(6)
The final change in synaptic weights wkjm calculated using the value of δ j is given using Eq. (6), which is added to the initial synaptic weights wkjm to get the new synaptic weights. Thus, the training happens in the case of a gradient-based approach for SNN. There are i = 1, 2, 3, ..., I input neurons, j = 1, 2, 3, ..., J hidden neurons, and m = 1, 2, 3, ..., M output or readout neurons present in the network. The challenge of discontinuity is solved to some extent by Bohte et al. [5] by introducing probably the first popular supervised learning algorithm to train an SNN connected in feed-forward fashion and naming it as SpikeProp [5]. The exciting part of the SpikeProp is its similar analogy with the most popular backpropagation algorithm of ANN. SpikeProp eliminates discontinuity by allowing a single spiketime while discarding the lateral spikes. Although SNN smoothly work with nonlinear classification problems if implemented efficiently, without the need of hidden layer(s) and hidden neuron(s), SpikeProp used hidden layers and thereby suffered from the heavy computational cost. The reason is that hidden layers increase the synaptic load in the architecture, and, as a result, more computational power is required. SpikeProp uses the population coding scheme [19] combined with the concept of time-to-first-spike [19] firing, i.e., in every neuron, the first firing time is most important than the lateral ones. The utilisation of time-to-first-spike eliminates the discontinuity problem by omitting the lateral spike-times and considering only the early spike-time. It is observed that first spike-times are the most relevant in terms of information carrier [51]. Thus, the input, hidden, and readout (i.e., output neurons) are restricted to fire only a single spike. The SRM [19] is selected as the neuron model for providing the dynamics of the membrane potential. In addition, the synapses are connected in a one-to-one fashion between every pair of SRM neurons. The error
An Extensive Review of the Supervised Learning Algorithms …
67
direction was investigated in SpikeProp by finding the slope since the usage of the time-to-first-spike as given in [5] turns discrete into the continuous nature of spiking. Although SpikeProp was a success to some extent, it lags behind in propping up weights if a neuron (postsynaptic neuron) no longer fires a spike after receiving the input stimuli. Moreover, inhibitory neurons are not investigated properly and use a minimum value of learning rate. In [50], QuickProp and Rprop improve SpikeProp to some extent, and it is observed that the small value of the learning rate or step size used in SpikeProp can be increased to a large value that also leads to successful training to a certain extent. Note that the convergence rate in online mode, biological plausibility (since the synapses are not well-explored, which is less similar to the biological neurons), and the computational cost of SpikeProp is the flaw of this algorithm. In [6, 21, 50, 68, 91, 92], the convergence rate and multiple spiking nature are further investigated, which makes SpikeProp more generalised and a speedy algorithm than the previous version in [5]. However, the major problem of SpikeProp being a gradient-based supervised learning algorithm persists, that is, the stagnation at the local minimum, and it is a problem with any gradient-based optimisation algorithm. The surge or sudden jumps present in any optimisation algorithm that uses gradient rule to determine the error-direction disturbs an optimisation algorithm’s consistency. In addition, SpikeProp did not consider the mixture of inhibitory and excitatory neurons because, in this case, there is always a threat to the convergence of the algorithm, and it is also a barrier when we want a synapse model to be biologically more realistic. In [68], Shrestha et al. also explore some of the demerits of this kind along with the problem in formulating the loss function. Some other gradientbased supervised learning uses a slightly different concept by utilising extended delta learning rule developed in [53, 55]. In the algorithm, each spike-train is allowed to go through convolution with the suitable kernel function, distinguishing the algorithm from the others. The gradient descent approach-based SPAN algorithm proposed in [54] uses the concept of the spike-pattern association, which works with a single synapse connected in the form of α-shaped synaptic curve. It used the area under the curve to compute the overall loss in the network while training is an exciting feature. However, the common problem is aforementioned to persist. Therefore, moving in a different direction in the search for another approach becomes necessary. This approach is primarily based on the concept of Hebbian learning, especially asymmetric Hebbian learning which is discussed in the next section.
2.2 Asymmetric Supervised Hebbian Learning The Spike-Time-Dependent Plasticity (STDP) is a biological process that optimises the information processing mechanism among neurons. It is considered the asymmetric form of Hebbian learning that adjusts the synaptic efficacy or weights between neurons, considering the timing of a neuron’s spike-time (relative timings) and input spike-times. The correlation (temporal) between pre- and postsynaptic spiking neu-
68
I. Hussain and D. M. Thounaojam
rons is taken into consideration. The plasticity generally means change; the meaning maps to the synapse change (the change happens in terms of synaptic efficacy). Like any other synaptic plasticity mechanism, along with the development and fine-tuning of neuronal circuits while in the brain’s development phase, it is believed that STDP handles the learning and storing corresponding information inside the brain [4, 69]. It explains activity-dependent development partially regarding two different concepts: the Long-Term Potential (LTP) and the other is the Long-Term Depression (LTD). When the repeated presynaptic spike arrives a few milliseconds before the postsynaptic spikes, it is referred to as the LTP. On the other hand, when the repeated presynaptic spike comes after the postsynaptic spikes, it is referred to as the LTD. The learning window which is also called the STDP-function varies for different synapse models. The rapid change in the learning window’s value forces the time scale to be represented to the millisecond. Although it primarily learns in an unsupervised manner and is considered a partial learning algorithm, most researchers combine STDP with a concept called anti-STDP to train in a supervised manner. There are various supervised algorithms for SNN which are developed using the STDP. However, a few are successful to some extent, both computationally and biologically. The time difference t between presynaptic spike (tpre ) and postsynaptic spike (t post ) is represented as t = (tpre − tpost ). The change in synaptic weights for excitatory synapse wexcitatory is given in Eq. (7). The exponentially decaying shape shown in Fig. 1 indicates the dependency on the time difference of spikes, i.e., t:
wexcitatory
⎧ t + ⎪ ∀t < 0 ⎨A exp( τ + ), t − = A exp(− τ − ), ∀t > 0 ⎪ ⎩ 0, ∀t = 0
(7)
where A+ , and A− represents constant value (usually taken as 1.0) for the LTP and LTD, respectively. The values of τ + and τ − known as time constants shape the curve for LTP and LTD, respectively. In [70], a learning algorithm for SNN is proposed in which STDP and anti-STDP are used to fit the algorithm in a supervised paradigm. In this algorithm, multiple spiking activity is used where each spiking neuron can fire multiple spikes at a different time step. The architecture of the network is feed-forward, having hidden layers. The demerit of the algorithm is the negligence of the precise spike-times produced by the neurons present in the hidden layers at the time of training. Qiang et al. proposed an algorithm that uses temporal coding to represent real-valued continuous information in the form of discrete spikes to train SNN in a supervised manner [93]. In [81], Aboozar et al. proposed the supervised learning that is biologically plausible called the BPSL algorithm, which is capable of firing multiple target spikes from a spiking neuron. Although it is referred to as the biologically plausible algorithm, there is a lack of proper implementation in the synapse model when essential biological elements are considered.
An Extensive Review of the Supervised Learning Algorithms … Fig. 1 The learning window for STDP (relation between synaptic weights and spike-time difference) where LTP is represented by the left curve, and LTD is represented by the right curve
69
1
Δw LTP
0.5 50
-150
-100
100
150
0
-50
Δt -0.5
LTD
-1
In [88], John et al. proposed a supervised algorithm using synaptic weight association training and called it SWAT. It is used to classify the non-linear feed input patterns into their respective target patterns. There is an exciting feature present in the SWAT algorithm that uses the dynamic synapse model [10, 84], which is capable of working in terms of the mechanism of long-term plasticity. SWAT has biological properties to some extent. However, the major drawback of SWAT is the computational cost since it has a huge synaptic load to be dealt with having high computational power. The increase in synaptic load results from a huge number of connections formed due to the presence of many hidden neurons in the network topology. Therefore, it is challenging for a computer with moderate computational power to adjust and fine-tune many network parameters. As far as SWAT training is concerned, it is trained using the STDP algorithm transforming into the supervised paradigm. Tempotron algorithm, proposed in [24], which trains SNN in a supervised manner, came with a slightly different picture. It allows a neuron to learn spike firing decisions (whether to issue a spike or not) when its cell membrane is updated with the potential of incoming input stimuli from several presynaptic neurons. The working of Tempotron’s response is like a switch “on” or “off” akin to a digital system. Instead of precise spike-time learning, Tempotron decides the ability of a neuron’s firing (acts like a decider). This algorithm also lags behind when there is the question of a balanced trade-off between biological plausibility and computational efficiency. In addition, Tempotron can be used only in a single-layered network topology which is a barrier for multilayered network topology. Also, it is restricted to 0 or 1 as output which does not encode information in precise spike timing. Other more supervised learning algorithms are primarily based on STDP; a few of the most used are discussed in this literature review. J. Wang et al. proposed the OSNN algorithm in [89] which is an online supervised learning algorithm for SNN. The OSNN has an adaptive network structure trained in an online fashion using
70
I. Hussain and D. M. Thounaojam
supervised learning patterns. In [56, 80], a supervised learning algorithm is proposed where the concept of STDP and anti-STDP is used to make the algorithm work as supervised learning. It is well-known that STDP primarily works in an unsupervised fashion. It is not considered a fully functional learning algorithm due to its plasticity updating mechanism, which changes the sign of synaptic efficacy instead of updating a fair value based on all presynaptic neurons’ spike firing times. It is a barrier to STDP-based supervised learning. Note that the Hebbian approach-based supervised learning algorithm has a common problem, which is the continuous change in synapse parameters even if neuron fire spikes exactly match the target spikes. Thus, there is a need for some extra work for adding additional learning rules or constraints to the original algorithm to provide stability. Moreover, in supervised Hebbian learning, all undesired timings of the spike are usually suppressed by the “teaching signal” during the training phase. Therefore, corelation happens only between pre- and postsynaptic spikes, around the desired timings of the spike. Since this type of corelation is absent in all other circumstances, synaptic strength cannot be weakened even if a neuron fires spikes at undesired times during the testing phase. It is observed from the literature that spiking neurons have the ability to successfully classify non-linear patterns into their respective target classes without using any hidden layer(s), and this powerful feature of spiking neurons is not implemented in the aforementioned learning algorithms except SEFRON proposed in [38]. SEFRON did not use any hidden layer. However, it was successful in classifying the non-linear patterns, thereby decreasing the synaptic load. It explores the computational power to a certain extent by utilising a single spiking neuron. However, we analysed and observed that the number of network parameters could be reduced to half, keeping the classification accuracy unhampered, which we experimented successfully.
2.3 Learning with Remote Supervision Ponulak et al. [62] proposed a distinguished learning algorithm called ReSuMe that is based on the concept of “remote supervision”. It is argued that ReSuMe eliminates the significant drawbacks found in the supervised Hebbian learning approach. Apart from this, ReSuMe also implements some exciting features. The primary principle is to impose the input-output characteristics into the SNN for yielding the target spike trains in response to the corresponding input spikes. Unlike supervised Hebbian learning, ReSuMe does not directly feed the desired signals to the learning neurons. Nevertheless, it can co-determine the synaptic connection’s plasticity. The algorithm ReSuMe also uses the supervised Hebbian approach for learning, but its “remote supervision” feature primarily distinguishes it from the others that use the supervised Hebbian learning approach. The concept of “remote supervision” is biologically justifiable based on an experimentally observed neurophysiological phenomenon—heterosynaptic plasticity [63, 75, 87]. The working rule of ReSuMe is briefly explained in Eq. (8):
An Extensive Review of the Supervised Learning Algorithms …
⎡ ⎤ ∞ d d w(t) = S (t) − S l (t) ⎣a + W(s) × S in (t − s)ds ⎦ dt
71
(8)
0
where S d (t), S l (t), and S in (t) represent the desired, presynaptic (input), and postsynaptic (output) spike trains, respectively. The parameter a denotes the amplitude of the contribution (non-correlated) to the dtd w(t), and the convolution function given in Eq. (8) is the modification (Hebbian-like) of w. The value of s represents the time-delay between spikes of synaptic sites and over s, and the integral kernel W(s) is defined as shown in Eq. (8). A positive value of a corresponds to the excitatory synapses where the shape of W(s) becomes similar to the STDP rules, and a negative value of a corresponds to the inhibitory synapses where the shape of W(s) becomes similar to the anti-STDP rules. The exciting merit of ReSuMe is its independence from the spiking neuron models. Therefore, it can work with a variety of spiking neuron models. Also, ReSuMe can learn the target temporal as well as spatio-temporal spikes efficiently. In addition, it converges quickly towards the optimum value. There exist algorithms that explore ReSuMe in a better manner, such as in [77, 78], the ReSuMe algorithm is further investigated, where synaptic delays were added. The delay used is the static constant values is not random. In addition, in [79], multiple neurons are successfully trained using the training rules of ReSuMe instead of training a single neuron. However, ReSuMe has many disadvantages despite the advantageous features: ReSuMe claims to be suitable for online learning, but due to the fixed network topology, it is not adaptive to the incoming stimuli. Also, ReSuMe is unable to predict inputs just after single usage of the training patterns. Although ReSuMe is biologically plausible, local behaviour restricts its learning ability. Another exciting supervised learning algorithm that works on the ReSuMe principle developed to train SNN is called Chronotron, proposed by Florian et al. in [16]. The Chronotron is experimented with using three different models: first is the gradient descent learning (called gradient descent E-learning) where delta learning rule is used, second is the I-learning where gradient descent E-learning and ReSuMe learning rule are combined and used. The third one is the ReSuMe learning rule. Supervised learning is implemented using a sophisticated distance metric called VictorPurpora in Chronotron, an exciting feature of the algorithm. However, Chronotron trained the synaptic efficacies in a batch mode by fixing the network topology, making it unsuitable for online learning.
2.4 Learning with Metaheuristics Heuristic methods are used as a powerful and comprehensive tool for solving challenging optimisation problems. Although heuristics provide “good balanced” solutions relatively very close to the global optimum in affordable cost and time, their
72
I. Hussain and D. M. Thounaojam
design and development become complicated as they depend on “problem-specific” characteristics [59]. Therefore, to solve the flaw mentioned above, metaheuristics came into existence [22]. Metaheuristics are “problem-agnostic” rather than “problem-specific” and have become remarkably popular in many optimisation areas, such as developing learning algorithms for ANN. However, the power of metaheuristics is significantly less explored and experimented with within the case of SNN. In this section, the metaheuristic approaches which are used to train SNN are briefly discussed. Metaheuristics such as evolutionary algorithms are mathematically simpler and can work on the real numbers directly, and do not waste time encoding these real numbers into other formats. Therefore, most of the complex classification problems want this strategy. In [67], an evolving network of spiking neurons is proposed, which is based on the Thorpe model [82] called eSNN. The advantages of eSNN include the fast real-time simulation achieved at a low computational cost in a large network architecture. Also, without retaining past data, the model can accumulate knowledge at the time of data arrival. The usage of fuzzy rules for yielding the inference engine is an exciting feature of eSNN. However, eSNN has many disadvantages, such as the “infinite repository” problem. For each new arrival of patterns in online fashion, its repository of neurons grows infinitely. Also, due to the usage of averaging synaptic weights with rank order, eSNN cannot handle input patterns having the same rank (despite having different spike-times), as well as rank order, and also can increase the number of neurons in the network, which may lead to the loss of relevant stored information. In [14], synaptic efficacies of SNN were optimised to reduce the overall network error using evolutionary techniques where the concept of “self-regulatory” (called the algorithm as SRESN) is appropriately implemented, which regulates the learning process. The current stored knowledge can automatically evolve the output layer neurons based on training patterns. SRESN can add a neuron, change network parameters, or forgo learning from samples based on the class-specific and sample knowledge stored in the network. Thus, SRESN works in a “self-regulatory” mode of learning. This method has both online and offline modes of training. However, SRESN does not use synaptic delays, which is an essential factor, to provide better computational cost compromising the biological plausibility. Evolutionary methods are also used to improve the gradient-based SpikeProp algorithm [5], which uses the Particle Swarm Optimisation (PSO) technique [41], and it is referred to as SpikeProp-PSO. It enhanced the learning process of SpikeProp using the angle-driven dependency-learning rule. However, it increases the computational cost. Also, it is biologically less plausible since the biological elements present in synapses are neglected. Differential Evolution (DE) [74] is a powerful optimisation tool known for its simplicity and good performance, which is combined with eSNN [67] to develop
An Extensive Review of the Supervised Learning Algorithms …
73
another supervised learning algorithm called DEPT-ESNN [66]. The primary goal of DEPT-ESNN is to select the optimum number of eSNN parameters such as modulation factor, similarity factor, and threshold. In DEPT-ESNN, DE plays a vital role by providing suitable values for the mentioned eSNN parameters adaptively rather than trial-and-error. The advantage of DEPT-ESNN includes its simple implementation and generalisation. But biological elements present in synapses are not considered, which makes DEPT-ESNN biologically less plausible. Although metaheuristic approaches are a bit time-consuming and can work with a single spiking scheme, they have many advantages that are not achievable using other optimisation approaches. Therefore, there is a need for more exploration of metaheuristic approaches to develop an efficient learning algorithm compatible with SNN. Note that other powerful metaheuristics such as Genetic Algorithm (GA) [3, 26, 33] and Grey Wolf Optimisation (GWO) [52] are neither explored nor successfully experimented directly, providing a properly balanced trade-off between the computational cost and biological plausibility, with SNN trained in the supervised manner. The aforementioned supervised learning algorithms, irrespective of the approach used, did not explore the synapse model thoroughly which is found from the literature. Although in some algorithms such as [5, 77–79] synaptic delays [40] is used, those are constant synaptic delays, and wherever the usage of the mixture of excitatory neurons and inhibitory neurons is observed, those are not appropriately implemented like GABA-switch [17, 45]. Synaptic delays are significant when biological plausibility is concerned. In the GABA-switch mechanism, switching from excitatory neuron to inhibitory and vice versa happens randomly. The robustness of an algorithm is tested against noise, and in the biological process, the presence of noise is evident while sharing information among neurons [15]. Therefore, it should be robust for a model to be biologically plausible, which is less explored as far as SNN is concerned. Another important phenomenon observed in a biological neuron is the spontaneous firing of spikes [27, 42] which is almost neglected in most of the synapse models of an SNN architecture. Moreover, there is a lack of a balanced trade-off between the computational cost and biological plausibility in almost all the aforementioned supervised learning algorithms developed to train an SNN topology. A balanced trade-off between the computational cost and biological plausibility is essential in the case of SNN because if the computational complexity is very high, it is difficult to handle a high-dimensional dataset. Tables 1 and 2 show a brief summary of the gradient and STDP-based supervised learning algorithms, and a brief summary of remote-supervision and metaheuristic supervised learning algorithms.
74
I. Hussain and D. M. Thounaojam
Table 1 A brief summary of gradient and STDP-based supervised learning algorithms Approach
Algorithm
Gradient
SpikeProp [5] and Variants [6, 21, 50, 68, 91, 92]
Advantages
Disadvantages
(1) Able to solve complex non-linear classification problems
(1) When a post-synaptic neuron stops firing/responding to its corresponding input patterns, there is no mechanism using which synaptic weights can be “propped up”
(2) Computationally powerful for classification
(2) Even though neurons fire at most one spike due to the time-to-first-spike encoding, the synaptic load is very high; mathematically challenged
(3) QuickProp improves the convergence speed using momentum, Rprop also seems to speed up SpikeProp
(3) Only excitatory neurons with a simple synapse model are used, and arbitrary values of synaptic delays are used—a barrier to biological plausibility
Common problem: Gradient-based optimisation algorithms may be stuck at the local minimum STDP
SWAT [88]
(1) Uses dynamic synapse model (2) Can handle large non-linear datasets
(1) Huge synaptic load—it is computationally very costly (2) More number of network parameters to adjust; very less biological elements are used in the synapse model
Tempotron [24]
(1) Applicable to a wide range of input classes, and it is flexible to information encoding scheme
(1) Suitable only for single-layered network topology. (2) Lack of precise spike-timing information due to the restricted output either as 0 or 1 during a predetermined interval; less biologically plausible
SEFRON [38]
(1) Lower computational cost (2) Less network parameters are to be adjusted as there is no hidden layer(s)
(1) Stability and robustness are not assured (2) Computational complexity can be reduced to half keeping classification accuracy unhampered; less biologically plausible
Others [56, 70, 80, 81, 89, 93] (1) Can be used for a wide [77–79] range of classification problems including large datasets. In [77–79], synaptic delays are considered. [93] performs well with the MNIST dataset
(1) Learning rule is based on STDP, not a fully functional supervised learning algorithm. Synaptic delays considered in [77–79] are not random. A higher value of constant synaptic delays can affect learning. In [70], spikes fired by hidden neurons are neglected while training
Common problem: STDP is not considered a fully supervised learning algorithm, also stability cannot be guaranteed
An Extensive Review of the Supervised Learning Algorithms …
75
Table 2 A brief summary of remote-supervision and metaheuristic-based supervised learning algorithms Approach
Algorithm
Remote Supervision
ReSuMe [62]
Chronotron [16]
Advantages
Disadvantages
(1) Solves the flaw in SpikeProp and is independent of the spiking neuron models
(1) Due to the fixed network topology not adaptive to incoming stimuli
(2) Efficiently learns target temporal and spatio-temporal spike-patterns
(2) After only single use of the training patterns, it is impossible to predict inputs
(3) Quickly converges towards optimised values
(3) Moderately biologically plausible; but local behaviour restricts its learning ability
(1) Supervised learning is implemented using a sophisticated distance metric called VictorPurpora
(1) Trained the synaptic efficacies in batch mode by fixing the network topology, making it unsuitable for online learning
Common problem: Network topology should be fixed before training, which is not adaptive Metaheuristic
PSO-SpikeProp [2]
(1) Enhanced learning process of SpikeProp using angle-driven dependency-learning rule
(1) Poor performance; increase in computational cost and biological elements present in synapses are not considered
DEPT-ESNN [66]
(1) Simple implementation and generalisation
(1) Moderate performance; biologically not plausible
(1) Fast real-time simulation provided low computational cost in case of large network architecture
(1) For each new arrival of patterns in online fashion, its repository of neuron grows infinitely
(2) Without retaining past data, the model is capable of accumulating knowledge in case of data arrival
(2) Due to the usage of averaging synaptic weights with rank order, it cannot handle input patterns having the same rank
(3) The exciting feature is the fuzzy rule generation
(3) The rank order might lead to an increase in the number of neurons in the network, which can lead to information loss
eSNN[67]
Common problem: Very time consuming; not guaranteed to find the optimal solution, but finds near-optimal solution
76
I. Hussain and D. M. Thounaojam
3 Conclusion The supervised learning algorithms discussed in this paper, irrespective of the approach used, did not explore the synapse model thoroughly that is found from the literature. Although in some algorithms such as [5, 77–79] synaptic delays [40] are used, those are constant synaptic delays, and wherever the usage of the mixture of excitatory neurons and inhibitory neurons is observed, those are not appropriately implemented like GABA-switch [17, 45]. Synaptic delays are significant when biological plausibility is concerned. In the GABA-switch mechanism, switching from excitatory neuron to inhibitory and vice versa happens randomly. The robustness of an algorithm is tested against noise, and in the biological process, the presence of noise is evident while sharing information among neurons [15]. A model should be robust to be biologically plausible, which is less explored in the case of SNN. Another important phenomenon observed in a biological neuron is the spontaneous firing of spikes [27, 42] which is almost neglected in most of the synapse models of an SNN architecture. Although metaheuristic approaches are a little time-consuming and can work with a single spiking scheme, they have many advantages that are not achievable using other optimisation approaches. Therefore, there is a need for more exploration of metaheuristic approaches to develop an efficient learning algorithm compatible with SNN. Note that other powerful metaheuristics such as Genetic Algorithm (GA) [3, 26, 33] and Grey Wolf Optimisation (GWO) [52] are rarely explored successfully and experimented directly, providing a properly balanced trade-off between the computational cost and biological plausibility, with SNN trained in the supervised manner except [35] and [36].
References 1. Abbott LF (1999) Lapicque’s introduction of the integrate-and-fire model neuron (1907). Brain Res Bull 50(5–6):303–304 2. Ahmed FY, Shamsuddin SM, Hashim SZM (2013) Improved spikeprop for using particle swarm optimization. Math Probl Eng 3. Baluja S, Caruana R (1995) Removing the genetics from the standard genetic algorithm. In: Machine learning proceedings. Elsevier, pp 38–46 4. Bi GQ, Poo MM (2001) Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu Rev Neurosci 24(1):139–166 5. Bohte SM, Kok JN, La Poutre H (2002) Error-backpropagation in temporally encoded networks of spiking neurons. Neurocomputing 48(1–4):17–37 6. Booij O, tat Nguyen H (2005) A gradient descent rule for spiking neurons emitting multiple spikes. Inf Process Lett 95(6):552–558 7. Brunel N, Van Rossum MC (2007) Lapicque’s 1907 paper: from frogs to integrate-and-fire. Biol Cybern 97(5–6):337–339 8. Cariani PA (2004) Temporal codes and computations for sensory representation and scene analysis. IEEE Trans Neural Netw 15(5):1100–1111 9. Cassidy A, Sawada J, Merolla P, Arthur J, Alvarez-lcaze R, Akopyan F, Jackson B, Modha D (2016) Truenorth: A high-performance, low-power neurosynaptic processor for multi-sensory
An Extensive Review of the Supervised Learning Algorithms …
10. 11.
12.
13.
14. 15. 16. 17. 18. 19. 20. 21.
22. 23. 24. 25.
26. 27.
28.
29. 30. 31. 32.
77
perception, action, and cognition. In: Proceedings of the government microcircuits applications and critical technology conference. Orlando, FL, USA, pp 14–17 Choquet D, Triller A (2013) The dynamic synapse. Neuron 80(3):691–703 Comsa IM, Fischbacher T, Potempa K, Gesmundo A, Versari L, Alakuijala J (2020) Temporal coding in spiking neural networks with alpha synaptic function. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 8529– 8533 Davies M, Srinivasa N, Lin TH, Chinya G, Cao Y, Choday SH, Dimou G, Joshi P, Imam N, Jain S et al (2018) Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 38(1):82–99 DeBole MV, Taba B, Amir A, Akopyan F, Andreopoulos A, Risk WP, Kusnitz J, Otero CO, Nayak TK, Appuswamy R et al (2019) Truenorth: accelerating from zero to 64 million neurons in 10 years. Computer 52(5):20–29 Dora S, Subramanian K, Suresh S, Sundararajan N (2016) Development of a self-regulating evolving spiking neural network for classification problem. Neurocomputing 171:1216–1229 Faisal AA, Selen LP, Wolpert DM (2008) Noise in the nervous system. Nat Rev Neurosci 9(4):292–303 Florian RV (2012) The chronotron: A neuron that learns to fire temporally precise spike patterns. PLOS ONE 7:1–27. https://doi.org/10.1371/journal.pone.0040233 Ganguly K, Schinder AF, Wong ST, Poo MM (2001) Gaba itself promotes the developmental switch of neuronal gabaergic responses from excitation to inhibition. Cell 105(4):521–532 Gerstner W (1995) Time structure of the activity in neural network models. Phys Rev E 51(1):738 Gerstner W, Kistler WM (2002) Spiking neuron models: Single neurons, populations, plasticity. Cambridge University Press Gerstner W, Kistler WM, Naud R, Paninski L (2014) Neuronal dynamics: from single neurons to networks and models of cognition. Cambridge University Press Ghosh-Dastidar S, Adeli H (2009) A new supervised learning algorithm for multiple spiking neural networks with application in epilepsy and seizure detection. Neural Netw 22(10):1419– 1431 Glover F (1977) Heuristics for integer programming using surrogate constraints. Decision Sci 8(1):156–166 Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press Gütig R, Sompolinsky H (2006) The tempotron: a neuron that learns spike timing-based decisions. Nat Neurosci 9(3):420–428 Han J, Moraga C (1995) The influence of the sigmoid function parameters on the speed of backpropagation learning. In: International workshop on artificial neural networks. Springer, pp 195–201 Haupt RL, Ellen Haupt S (2004) Practical genetic algorithms. Wiley Online Library Häusser M, Raman IM, Otis T, Smith SL, Nelson A, Du Lac S, Loewenstein Y, Mahon S, Pennartz C, Cohen I et al (2004) The beat goes on: spontaneous firing in mammalian neuronal microcircuits. J Neurosci 24(42):9215–9219 Hewitt J (2014) Darpa’s new autonomous quadcopter is powered by a brain-like neuromorphic chip. https://www.extremetech.com/extreme/193532-darpas-new-autonomousquadcopter-is-powered-by-a-brain-like-neuromorphic-chip, online. Accessed 5 Nov Hodgkin AL, Huxley AF (1952) A quantitative description of membrane current and its application to conduction and excitation in nerve. J Physiol 117(4):500–544 Hodgkin AL, Huxley AF, Katz B (1952) Measurement of current-voltage relations in the membrane of the giant axon of loligo. J Physiol 116(4):424 Hodgkin AL, Huxley AF (1952) The components of membrane conductance in the giant axon of loligo. J Physiol 116(4):473 Hodgkin AL, Huxley AF (1952) Currents carried by sodium and potassium ions through the membrane of the giant axon of loligo. J Physiol 116(4):449
78
I. Hussain and D. M. Thounaojam
33. Holland JH et al (1992) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT Press 34. Huber J, Lisi´nski P, Kasi´nski A, Kaczmarek M, Kaczmarek P, Mazurkiewicz P, Ponulak F, Wojtysiak M (2004) Therapeutic effects of spinal cord and peripheral nerve stimulation in patients with the movement disorders. Artif Organs 28(8):766 35. Hussain I, Thounaojam DM (2020) Spifog: an efficient supervised learning algorithm for the network of spiking neurons. Sci Rep 10(1):1–11 36. Hussain I, Thounaojam DM (2021) Wolif: An efficiently tuned classifier that learns to classify non-linear temporal patterns without hidden layers. Appl Intell 51(4):2173–2187 37. Izhikevich EM (2003) Simple model of spiking neurons. IEEE Trans Neural Netw 14(6):1569– 1572 38. Jeyasothy A, Sundaram S, Sundararajan N (2018) Sefron: a new spiking neuron model with time-varying synaptic efficacy function for pattern classification. IEEE Trans Neural Netw Learn Syst 30(4):1231–1240 39. Kasi´nski A, Ponulak F (2006) Comparison of supervised learning methods for spike time coding in spiking neural networks. Int J Appl Math Comput Sci 16(1):101–113 40. Katz B, Miledi R (1965) The measurement of synaptic delay, and the time course of acetylcholine release at the neuromuscular junction. Proc R Soc London Ser B Biol Sci 161(985):483– 495 41. Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of ICNN’95international conference on neural networks, vol 4. IEEE, pp 1942–1948 42. Kerschensteiner D (2014) Spontaneous network activity and synaptic development. Neurosci 20(3):272–290 43. Kistler WM, Gerstner W, Hemmen JLV (1997) Reduction of the hodgkin-huxley equations to a single-variable threshold model. Neural Comput 9(5):1015–1045 44. Lapicque L (1907) Recherches quantitatives sur l’excitation electrique des nerfs traitee comme une polarization. Journal de Physiologie et de Pathologie Generalej 9:620–635 45. Lee SW, Kim YB, Kim JS, Kim WB, Kim YS, Han HC, Colwell CS, Cho YW, Kim YI (2015) Gabaergic inhibition is weakened or converted into excitation in the oxytocin and vasopressin neurons of the lactating rat. Molecular Brain 8(1):1–9 46. Lobo JL, Del Ser J, Bifet A, Kasabov N (2020) Spiking neural networks and online learning: an overview and perspectives. Neural Netw 121:88–100 47. Maass W (1997) Networks of spiking neurons: the third generation of neural network models. Neural Netw 10(9):1659–1671 48. Maass W (1997) Noisy spiking neurons with temporal coding have more computational power. In: Advances in neural information processing systems 9: Proceedings of the 1996 conference, vol 9. MIT Press, p 211 49. Maass W, Bishop CM (2001) Pulsed neural networks. MIT Press 50. McKennoch S, Liu D, Bushnell LG (2006) Fast modifications of the spikeprop algorithm. In: The 2006 IEEE international joint conference on neural network proceedings. IEEE, pp 3970–3977 51. Minneci F, Kanichay RT, Silver RA (2012) Estimation of the time course of neurotransmitter release at central synapses from the first latency of postsynaptic currents. J Neurosci Methods 205(1):49–64 52. Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61 53. Mohemmed A, Schliebs S, Matsuda S, Kasabov N (2011) Method for training a spiking neuron to associate input-output spike trains. In: Engineering applications of neural networks. Springer, pp 219–228 54. Mohemmed A, Schliebs S, Matsuda S, Kasabov N (2012) Span: Spike pattern association neuron for learning spatio-temporal spike patterns. Int J Neural Syst 22(04):1250012 55. Mohemmed A, Schliebs S, Matsuda S, Kasabov N (2013) Training spiking neural networks to associate spatio-temporal input-output spike patterns. Neurocomputing 107:3–10 56. Mostafa H (2017) Supervised learning based on temporal coding in spiking neural networks. IEEE Trans Neural Netw Learn Syst 29(7):3227–3235
An Extensive Review of the Supervised Learning Algorithms …
79
57. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: ICML 58. Natschläger T, Ruf B (1998) Spatial and temporal pattern analysis via spiking neurons. Netw: Comput Neural Syst 9(3):319–332 59. Parejo JA, Ruiz-Cortés A, Lozano S, Fernandez P (2012) Metaheuristic optimization frameworks: a survey and benchmarking. Soft Comput 16(3):527–561 60. Paugam-Moisy H, Bohte SM (2012) Computing with spiking neuron networks. Handb Nat Comput 1:1–47 61. Pei J, Deng L, Song S, Zhao M, Zhang Y, Wu S, Wang G, Zou Z, Wu Z, He W et al (2019) Towards artificial general intelligence with hybrid tianjic chip architecture. Nature 572(7767):106–111 62. Ponulak F, Kasi´nski A (2010) Supervised learning in spiking neural networks with resume: sequence learning, classification, and spike shifting. Neural Comput 22(2):467–510 63. Bi GQ (2002) Spatiotemporal specificity of synaptic plasticity: Cellular rules and mechanisms. Biol Cybern 87(5–6):319–332 64. Rigelsford J (2001) Control of movement for the physically disabled. Ind Robot: Int J 65. Rullen RV, Thorpe SJ (2001) Rate coding versus temporal order coding: what the retinal ganglion cells tell the visual cortex. Neural Comput 13(6):1255–1283 66. Saleh AY, Shamsuddin SM, Hamed HNA (2017) A hybrid differential evolution algorithm for parameter tuning of evolving spiking neural network. Int J Comput Vis Robot 7(1–2):20–34 67. Schliebs S, Kasabov N (2013) Evolving spiking neural network-a survey. Evol Syst 4(2):87–98 68. Shrestha SB, Song Q (2015) Adaptive learning rate of spikeprop based on weight convergence analysis. Neural Netw 63:185–198 69. Sjostrom PJ, Rancz EA, Roth A, Hausser M (2008) Dendritic excitability and synaptic plasticity. Physiol Rev 88(2):769–840 70. Sporea I, Grüning A (2013) Supervised learning in multilayer spiking neural networks. Neural Comput. 25(2):473–509 71. Stanic U, Davis R, General consideration in the clinical application of electrical stimulation. International FES Society web page. http://www.ifess.org 72. Stein RB (1965) A theoretical analysis of neuronal variability. Biophys J 5(2):173–194 73. Stein RB (1967) Some models of neuronal variability. Biophys J 7(1):37–68 74. Storn R, Price K (1997) Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optim 11(4):341–359 75. Bonhoeffer T, Staiger V, Aertsen AM (1989) Synaptic plasticity in rat hippocampal slice cultures: Local hebbian conjunction of pre and postsynaptic stimulation leads to distributed synaptic enhancement. Proc Nat Acad Sci USA 86(20):8113–8117 76. Taherkhani A, Belatreche A, Li Y, Cosma G, Maguire LP, McGinnity TM (2020) A review of learning in biologically plausible spiking neural networks. Neural Netw 122:253–272 77. Taherkhani A, Belatreche A, Li Y, Maguire LP (2015) Dl-resume: a delay learning-based remote supervised method for spiking neurons. IEEE Trans Neural Netw Learn Syst 26(12):3137–3149 78. Taherkhani A, Belatreche A, Li Y, Maguire LP (2015) Edl: an extended delay learning based remote supervised method for spiking neurons. In: International conference on neural information processing, pp 190–197 79. Taherkhani A, Belatreche A, Li Y, Maguire LP (2015) Multi-dl-resume: Multiple neurons delay learning remote supervised method. In: 2015 international joint conference on neural networks (IJCNN), pp 1–7 80. Taherkhani A, Belatreche A, Li Y, Maguire LP (2018) A supervised learning algorithm for learning precise timing of multiple spikes in multilayer spiking neural networks. IEEE Trans Neural Netw Learn Syst 29(11):5394–5407 81. Taherkhani A, Belatreche A, Li Y, Maguire LP et al (2014) A new biologically plausible supervised learning method for spiking neurons. In: ESANN 82. Thorpe S, Delorme A, Rullen RV (2001) Spike-based strategies for rapid processing. Neural Netw 14(6):715–725. https://doi.org/10.1016/S0893-6080(01)00083-1
80
I. Hussain and D. M. Thounaojam
83. Thorpe S, Fize D, Marlot C (1996) Speed of processing in the human visual system. Nature 381(6582):520–522 84. Tsodyks MV, Markram H (1996) Plasticity of neocortical synapses enables transitions between rate and temporal coding. In: International conference on artificial neural networks. Springer, pp 445–450 85. Vazquez RA, Cachón A (2010) Integrate and fire neurons and their application in pattern recognition. In: 2010 7th international conference on electrical engineering computing science and automatic control. IEEE, pp 424–428 86. Vreeken J (2003) Spiking neural networks, an introduction 87. Hui-zhong WT, Zhang LI, Bi GQ, Poo MM (2000) Selective presynaptic propagation of longterm potentiation in defined neural networks. J Neurosci 20(9):3233–3243 88. Wade JJ, McDaid LJ, Santos JA, Sayers HM (2010) Swat: a spiking neural network training algorithm for classification problems. IEEE Trans Neural Netw 21(11):1817–1830 89. Wang J, Belatreche A, Maguire L, McGinnity TM (2014) An online supervised learning method for spiking neural networks with adaptive structure. Neurocomputing 144:526–536 90. Wang X, Lin X, Dang X (2020) Supervised learning in spiking neural networks: a review of algorithms and evaluations. Neural Netw 125:258–280. https://doi.org/10.1016/j.neunet.2020. 02.011, https://www.sciencedirect.com/science/article/pii/S0893608020300563 91. Xu Y, Yang J, Zhong S (2017) An online supervised learning method based on gradient descent for spiking neurons. Neural Netw 93:7–20 92. Xu Y, Zeng X, Han L, Yang J (2013) A supervised multi-spike learning algorithm based on gradient descent for spiking neural networks. Neural Netw 43:99–113 93. Yu Q, Tang H, Tan KC, Yu H (2014) A brain-inspired spiking neural network model with temporal encoding and learning. Neurocomputing 138:3–13
Multitask Learning-Based Simultaneous Facial Gender and Age Recognition with a Weighted Loss Function Abhilasha Nanda
and Hyun-Seung Yang
Abstract Traditionally, researchers train facial gender and age recognition models separately using deep convolutional networks. However, in the real world, it is crucial to build a low-cost and time-efficient multitask learning system that can simultaneously recognize both these tasks. In multitask learning, the synergy among the tasks creates imbalance in the loss functions and influences their individual performances. This imbalance among the task-specific loss functions leads to a drop in accuracy. To overcome this challenge and achieve better performance, we propose a novel weighted sum of loss functions that balances the loss of each task. We train our method for the recognition of gender and age on the publicly available Adience benchmark dataset. Finally, we experiment our method on VGGFace and FaceNet architectures and evaluate on the Adience test set to achieve better performance than previous architectures. Keywords Multitask learning · Facial gender and age recognition · Deep learning · Convolutional neural networks · Weighted loss functions · Center loss
1 Introduction Gender and age recognition from facial images have gained importance over the years in various fields of machine learning and computer vision [1]. The objective is to identify these facial traits correctly from images in the wild [2]. To perform any facial attribute classification task, researchers use either handcrafted features or learned features. Some even use a fusion of both handcrafted and learned features. In the recent years, deep CNN-based features proved to outperform handcrafted features [3]. Convolutional Neural Networks (CNNs) consist of many deep layers that perform A. Nanda (B) Korea Research Institute of Ships and Ocean Engineering, Daejeon, South Korea e-mail: [email protected] URL: https://www.kriso.re.kr H.-S. Yang Korea Advanced Institute of Science and Technology, Daejeon, South Korea © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_7
81
82
A. Nanda and H.-S. Yang
Fig. 1 Multitask learning. The images show simultaneous recognition of facial gender and age by our method on the Adience dataset [5]
high-level, complex tasks like estimation and recognition of various facial attributes. In multitask learning, deep CNNs with higher parameters and intermediate layers are used for training and evaluating various complex tasks simultaneously [4]. Using only one CNN with multiple layers results in a lot of parameters that not only require a lot of time and space, but also yield bad accuracy. Therefore, most recognition tasks are performed separately using different CNN models (Fig. 1). However, deploying separate deep CNN models for complex tasks such as gender and age recognition also consumes a lot of time and space [4]. Therefore, there is a demand for low-cost and time-efficient a single multitask learning model [6] that can simultaneously recognize various facial attributes. One of the major challenges in multitask learning models is the imbalance in the loss functions [7]. The synergy that the different tasks share creates imbalance in their loss functions and affects their overall performance [8]. Therefore, in this paper, we propose a novel weighted sum of loss functions that balances the losses of each high-level task. We train our method for simultaneous recognition of gender and age on the publicly available Adience benchmark dataset [5]. Finally, we evaluate our models on the VGGFace [9] and FaceNet [10] architectures. We achieve 83.45% gender accuracy and 52.16% age accuracy on the FaceNet architecture. In addition, we achieve 80.72% gender accuracy and 53.33% age accuracy on the VGGFace architecture.
2 Literature Review Recently, multitask learning has been studied widely and various approaches are proposed for different estimation and recognition tasks. Multitask learning is an emerging trend in the fields of machine learning, natural language processing and human-computer interaction. Earlier, multitask learning studies included various handcrafted approaches. Zhu et al. [11] proposed a model based on the mixture of trees with a shared pool of parts for the combined face detection, pose estimation and landmark estimation. Strezoski et al. [12] introduced a method called Selective Sharing, in which a multitask model learned the inter-task relatedness from secondary latent features. With the advancement of deep learning, more and more CNN-based multitask learning methods were proposed. Ranjan et al. [4] proposed a method to simultane-
Multitask Learning-Based Simultaneous Facial Gender and Age Recognition …
83
ously identify face, estimate pose and recognize gender as well as localize facial landmarks. In addition, Hu et al. [13] deployed a multitask learning model that automatically learned the information about the layers to share for various recognition tasks. Multitask learning is an approach to solve several tasks using one deep model. It optimizes multiple loss functions for individual tasks performed by the same model for an efficient outcome [14]. Parameters in multitask learning are shared either with a hard parameter sharing approach or a soft parameter sharing approach [15]. In the hard parameter sharing approach, the hidden layers are shared by all the tasks but the output layer is split to achieve multiple task-specific results, whereas in soft parameter sharing approach, each task has its own convolutional layers and separate parameters with a shared output layer [16]. Multitask learning is a method used in many machine learning applications, like natural language processing, speech recognition, computer vision and current state retrieval [16]. Multitask learning is typically performed with one or more performance features that are optimized [17]. This makes users exploit the effectiveness of multitask learning and benefit from it. Many recent deep learning approaches use multitasking learning explicitly or implicitly as part of the model, but all use both hard and soft approaches that share parameters [16]. Hard parameter sharing is the most commonly used approach for multitask learning. In hard parameter sharing, multiple job-specific output layers are maintained while all jobs share hidden layers. In soft parameter sharing, each task has its own model with its own parameters [16]. One of the main challenges faced by multitask learning is handling multiple taskspecific losses [8]. In contrast to the single task learning that has a single loss function, multitask learning has several task-specific loss functions [18]. There is always a dominant loss among the several losses, and only the task dominating other losses converges well with better accuracy. The remaining tasks do not contribute much to the learning and do not have efficient performance [19]. To overcome this challenge, we propose a weighted sum of loss functions that balances the losses shared by individual recognition tasks and ensures better accuracy in multitask learning. The Multi-Input Multi-Output [15] architecture is a type of soft parameter sharing. In this architecture, each recognition model is trained separately and the models only share the final classifier layers; see Fig. 2. The prediction time of this model is reduced to that of predicting one model instead of predicting two models. However, the
Fig. 2 Multi-input multi-output architecture
84
A. Nanda and H.-S. Yang
Fig. 3 Single-input multi-output architecture
training time and space are still equivalent to that of single task learning architecture. More commonly, multitask learning architecture is based on a single-input multioutput structure [8], which is a type of hard parameter sharing. The convolutional layers are shared and information learning is done jointly; see Fig. 3. In the case of gender and age recognition, the output layer splits into two for both gender and age recognition tasks. We build both multi-output architecture and single-input multi architecture using FaceNet [10] and VGGFace [9]; see Tables I and II. We fine-tune Adience training set and evaluate over five folds on the Adience test set [5]. We observe a drop in accuracy of almost 10% in the single-input multi-output architecture. However, it trains only one model rather than multiple models and has less parameters. Therefore, it is both time and cost-effective compared to the multi-input multi-output architecture.
3 Proposed Method Since the Single-Input Multi-Output approach is time and cost-effective but has a lower accuracy, we propose a weighted sum of loss functions for this approach to achieve better network convergence. It is understood that gender recognition is a binary classification problem, whereas age recognition is a multiclass classification problem [20]. In the case of multitask learning, gender recognition, being the binary classification problem, converges better than age by behaving as the dominant task. Therefore, the loss functions for both gender and age should be assigned in such a way that there is better optimization of age than gender [5]. However, this can result in degrading loss function for gender recognition with large error gradients. Therefore, we combine gender classification loss with center loss [21] to make gender recognition task more discriminative (Tables 1 and 2). Table 1 Experimental results on ADIENCE dataset for MIMO Network Gender (%) FaceNet VGGFace
90.25 91.27
Age (%) 60.00 62.16
Multitask Learning-Based Simultaneous Facial Gender and Age Recognition … Table 2 Experimental results on ADIENCE dataset for SIMO Network Gender (%) FaceNet VGGFace
81.40 80.02
85
Age (%) 52.25 52.39
Center loss helps in reducing the intra-class variations and inter-class similarities of class samples. It minimizes the distance among class samples and their corresponding class centers. We combine softmax loss [22] with center loss and a weight value for total gender loss. The equation for total gender loss is
ex j
L G = λG |x|
k=1
e xk
m 1 2 + ||xi − c yi ||2 , 2 i=1
(1)
where L G is the total gender loss and λG = 0.001 is the weight term for gender recognition task. The first part of the equation is the softmax loss and the second part is the center loss. Although age recognition could benefit from center loss in reducing intra-class variations and inter-class similarities, it tends to ignore the ordinal relationships among individual samples in age classes [23]. Therefore, we do not assign center loss to age recognition and use only softmax loss with a weight value for better optimization. The equation for total age loss is ex j L A = λ A |x| k=1
e xk
,
(2)
where L A is the total age loss and λ A =1 is the weight term for age recognition task. The equation part is a softmax loss function. We finally combine total gender loss weighted as 0.001 with total age loss with its weight assigned as 1. This way, the multitask learning balances the loss functions for both the gender and age recognition tasks. The equation for the proposed loss is as follows: L total = L G + L A ,
(3)
where L total is total weighted sum of loss, L G is total gender loss and L A is the total age loss. 0.001 is the weight term for gender recognition task, and 1 is the weight term assigned to age. We optimize this total objective function for simultaneous gender and age recognition. From experiments, we find that the weight value as 0.001 for gender gives the best result in our multitask network. Network architecture for our proposed method is shown in Fig. 4.
86
A. Nanda and H.-S. Yang
Fig. 4 Single-input multi-output architecture with proposed loss
4 Implementation Details We train the proposed method on ADIENCE 5-fold dataset and evaluate by finding the average accuracy of Adience test set over the 5 folds. Adience dataset is a publicly available benchmark dataset developed for facial gender and age recognition. The images are collected in the wild with total number of 26580 samples, and it has both gender and age labels [5]. For preprocessing, we use the cropped and aligned images and augment the images with horizontal flip and rotation. Finally, we resize the images to 169 × 169 for FaceNet architecture and 224 × 224 for VGGFace architecture. We deploy our model on a general-purpose computer with 1 GPU capable of running TensorFlow with the Linux operating system. Also, our framework is deployed with Keras on the Python language.
5 Results Fine-tuning on the FaceNet gender model gives an accuracy on Adience test set as 83.45% and age model accuracy on Adience as 52.16%. Fine-tuning on the VGGFace gender model gives an accuracy on Adience test set as 80.72% and age model accuracy on Adience as 53.33% (Table 3). We observe that our multitask method with the proposed loss function achieves better performance than baseline in both gender and age recognition. Lee et al. [24] have higher performance in gender recognition, but their method suffers from a dominant gender loss that degrades the age recognition performance. Table 3 Comparison with other multitask methods on ADIENCE dataset Network Gender (%) Age (%) FaceNet (proposed) VGGFace (proposed) Lee [24] Baseline [25]
83.45 82.72 85.16 82.52
52.16 53.33 44.26 44.14
Multitask Learning-Based Simultaneous Facial Gender and Age Recognition …
87
6 Conclusion We experimented on both Multi-Input Multi-Output and Single-Input Multi-Output architectures. From the results, we infer that the Multi-Input Multi-Output architecture yields better performance but at the expense of large training time and space. Therefore, to build a low-cost and time-efficient multitasking model, we propose the weighted sum of loss functions for the Single-Input Multi-Output architecture. From experiments, we observe that there is better convergence of both the gender and age recognition tasks. The proposed weighted sum of loss functions help in influencing various classification tasks by balancing the losses in a rule-based manner. We train and evaluate our method on the Adience dataset and achieve better accuracy than the previous works. Acknowledgements We thank Korea Research Institute of Ships and Ocean Engineering as well as Korea Advanced Institute of Science and Technology for the infrastructure and resources provided to us to complete this paper.
References 1. Mollahosseini A, Chan D, Mahoor MH (2016) Going deeper in facial expression recogni- tion using deep neural networks. In: 2016 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 1–10 2. Liu Z, Luo P, Wang X, Tang X (2015) Deep learning face attributes in the wild. In: Proceedings of the IEEE international conference on computer vision, pp 3730–3738 3. Bekhouche SE, Dornaika F, Benlamoudi A, Ouafi A, Taleb-Ahmed A (2020) A comparative study of human facial age estimation: handcrafted features vs. deep features. Multimed Tools Appl 79(35)26605–26622 4. Ranjan R, Patel VM, Chellappa R (2017) Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Trans Pattern Anal Mach Intell 41(1):121–135 5. Eidinger E, Enbar R, Hassner T (2014) Age and gender estimation of unfiltered faces. IEEE Trans Inf Forensics Secur 9(12):2170–2179 6. Zhang C, Zhao P, Hao S, Soh YC, Lee BS (2016) Rom: a robust online multi-task learning approach. In: 2016 IEEE 16th international conference on data mining (ICDM). IEEE, pp 1341–1346 7. Kendall A, Gal Y, Cipolla R (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7482–7491 8. Thung KH, Wee CY (2018) A brief review on multi-task learning. Multimed Tools Appl 77(22):29705–29725 9. Parkhi OM, Vedaldi A, Zisserman A (2015) Deep face recognition 10. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 815–823 11. Zhu X, Ramanan D (2012) Face detection, pose estimation, and landmark lo- calization in the wild. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 2879–2886
88
A. Nanda and H.-S. Yang
12. Strezoski G, van Noord N, Worring M (2019) Learning task relatedness in multi- task learning for images in context. In: Proceedings of the 2019 on international conference on multimedia retrieval, pp 78–86 13. Hu G, Liu L, Yuan Y, Yu Z, Hua Y, Zhang Z, Shen F, Yang Y (2018) Deep multi- task learning to recognise subtle facial expressions of mental states. In: Proceedings of the European conference on computer vision (ECCV), pp 103–119 14. Swersky K, Snoek J, Adams RP (2013) Multi-task bayesian optimization 15. Ruder S, Bingel J, Augenstein I, Søgaard A (2019) Latent multi-task architecture learning. Proc AAAI Conf Artif Intell 33(01):4822–4829 16. Ruder S (2017) An overview of multi-task learning in deep neural networks. arXiv:1706.05098 17. Kang Z, Grauman K, Sha F (2011) Learning with whom to share in multi-task feature learning. In: ICML 18. Liu S, Johns E, Davison AJ (2019) End-to-end multi-task learning with attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1871–1880 19. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? arXiv:1411.1792 20. Bekhouche SE, Ouafi A., Benlamoudi A, Taleb-Ahmed A, Hadid A (2015) Automatic age estimation and gender classification in the wild 21. Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning ap proach for deep face recognition. In: European conference on computer vision. Springer, Cham, pp 499–515 22. Liu W, Wen Y, Yu Z, Yang M (2016) Large-margin softmax loss for convolutional neural networks. In: ICML 2(3):7 23. Pan H, Han H, Shan S, Chen X (2018) Revised contrastive loss for robust age estimation from face. In: 2018 24th international conference on pattern recognition (ICPR). IEEE, pp 3586–3591 24. Lee JH, Chan YM, Chen TY, Chen CS (2018) Joint estimation of age and gender from unconstrained face images using lightweight multi-task cnn for mobile applications. In: 2018 IEEE conference on multimedia information processing and retrieval (MIPR). IEEE, pp 162–165 25. Levi G, Hassner Y (2015) Age and gender classification using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 34–42
Visualizing Crime Hotspots by Analysing Online Newspaper Articles M. Trupthi , Prerana Rajole , and Neha Dinesh Prabhu
Abstract With improvements in technology, India is growing at a fast pace, which has led to a great deal of urbanization. However, instead of reducing, the rate of crime has increased these past couple of years. The general public must be educated on how safe an area is, so that they may take the appropriate actions to protect themselves. Every day, we see many local crimes published in internet news articles, but not everyone has the time to read them all. They contain information that can be used to determine the safety of a location. Thus, in this paper, we propose an end-to-end solution based on Natural Language Processing to inform users of the crime rate in their area. We create a model that analyzes crimes mentioned in local news articles and collects data such as location and incident type. The model uses the concept of Named Entity Recognition to extract the locations and the crime that has occurred. To take advantage of the benefits of transfer learning, we built the model using Google’s BERT framework. It was trained on CONELL2003 with custom modifications and was put to the test using real-time data gathered from several online news outlets’ crime pieces. Our model has an F1 score of 83.87% and a validation accuracy of 96%. The information collected via internet was visualized on a heat map using bokeh package. We display metrics such as name of the location, number of crimes occurred in that area and the recent most crime that has occurred which provides a quick overview and benefits our users. Keywords Named entity recognition · BERT · Bi- directional encoder representation from transformers · Citizen safety · Heat map
M. Trupthi · P. Rajole · N. D. Prabhu (B) Chaitanya Bharathi Institute of Technology, Hyderabad, Telangana 500075, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_8
89
90
M. Trupthi et al.
1 Introduction India is growing at a rapid rate as a result of technological advancements, which have resulted in a significant amount of urbanization. In recent years, rather than the crime rate dropping, there has been an increase in crime. Crimes such as theft, murder, and other crimes are reported in the newspapers on a daily basis. According to a study, a total of 51.5 lakh cognizable crimes were recorded nationally in 2019, with 32.2 lakh Indian Penal Code crimes and 19.4 lakh Special and Local Laws crimes. The crime rate per 100,000 has risen from 383.5 in 2018 to 385.5 in 2019, reflecting a 1.6 percent (50.7 lakh) annual spike in the number of cases registered. Murder, abduction, attack and death by negligence accounted for more than a fifth of all reported crimes (10.5 lakh), which included violent acts such as murder, kidnapping, assault, and death by negligence. In 2019, 18,051 cases were filed under the Hyderabad police Commissionerate limits, compared to 16,012 cases in 2018. Based on data available publicly, i.e., news articles/newspapers it can be observed that some locations have a higher crime rate than others. A compilation of this data can be referred to by the general public for safety and security purposes. This study offers a model that employs the NER-Named Entity Recognition, a natural language processing technique to collect critical information such as the location of the crime and the type of crime that happened from online news articles. Following the collection of this data, we create a heat map to display the density of crimes that are easily understood by the public reader. The paper is organized as follows. Section 2 contains a brief summary of related works. Section 3 discusses the definition and brief description of important ideas needed to understand the proposed solution. Section 4 contains a summary of the approaches utilized to implement the prototype. The findings of the experiment on real-world data are shown in Sect. 5. The system’s shortcomings are highlighted, and future scope of this research is discussed in Sect. 6.
2 Related Work Data relating to where the crime has occurred and how many crimes have occurred is basic information that is required to be known to take the necessary precautions to prevent crimes in the future. There has been a lot of study done about the various ways through which we can acquire this data. An efficient way to map the crime rate for thefts occurring in a city in Italy is mentioned in [1]. The described method involves the use of a parser to extract important information from the article and an automatic text analysis tool to determine what the stolen goods were. The model described in [2] uses a combination of feature engineering and the concept of n-grams to extract entities like location of the crime. The method described in the paper employs a naive Bayes classifier to classify the type of crimes.
Visualizing Crime Hotspots by Analysing Online Newspaper Articles
91
Another popular technique involves the usage of the meta tags of the news articles to get the location of the crime and then utilize word embeddings to create categories for the crime as mentioned by Bondielli et al. [3]. An alternate method mentioned in [4] uses feature vectors and the notion of term frequency and inverse document frequency to extract the entities from the newspaper articles. Arulanandam et al. [5] mentions utilizing Named Entity Recognition in conjunction with a conditional random field to recognize the sentences in the article that contain the location of the crime. The two surveys [6, 7] give a summary of the techniques that can possibly be used to perform crime analysis. In [8] the author uses a document clustering technique to extract important information of the crime from the documents. The survey paper by Nasridinov et al. [9] evaluated all the existing machine learning approaches by using a real- time dataset consisting of crimes in South Korea. Ku et al. [10] uses Natural Language Processing to process a dataset consisting of witness and police reports of the crime to extract the crime that has occurred. In [11] the method proposed uses an SVM classifier to classify if the article is a crime article or not then uses NER to extract any entity required. Jie et al. proposes a LSTM-CRF model in [12] to encode the complete dependency trees and capture the required properties for the task of named entity recognition (NER). The model proposed in [13] follows a neural network approach, i.e., attention-based bidirectional Long Short-Term Memory along with a CRF (Conditional Random Field layer) for performing the NER task.
3 Definition of Terms 3.1 Named Entity Recognition Named Entity Recognition abbreviated as NER is the process of identifying and extracting named entities from sentences. We then proceed to categorize these entities in pre-defined classes. In unstructured text, named entity recognition is a subtask of information extraction that aims to discover and classify named entities referenced.
3.2 BERT It stands for Bi-directional Encoder Representation from Transformers and is a machine learning technique introduced by Google [14]. It follows the concept of selfattention to compute the output. The major advantage of this model is that it takes into consideration the context of the sentence when for each occurrence of a given word. It is pre-trained on unlabeled data extracted from Wikipedia having 2,500 million
92
M. Trupthi et al.
Fig. 1 Architecture of pre-trained BERT model
words and Books Corpus having 800 million words using the following unsupervised prediction tasks Next Sentence Prediction and Masked Language Modeling. The model will use attention to concentrate on other terms in the input that are closely connected to the word in question. The model is made up of staked up encoders and decoders. In the transformer, the attention layer computes multiple times in parallel. Each computation is termed as an attention head (Fig. 1).
3.3 BIO Scheme BIO scheme which stands for beginning, inside and outside is the tagging format followed in this paper. • B—prefix before a tag indicates that the tag is the beginning of a chunk. • I—prefix before a tag indicates that the tag is inside a chunk. • An O tag indicates that a token belongs to no chunk (outside). The different classes in the BIO scheme are: geo org per gpe tim art eve nat
Geographical Entity Organization Person Geopolitical Entity Time indicator Artifact Event Natural Phenomenon
Visualizing Crime Hotspots by Analysing Online Newspaper Articles
93
4 Proposed System 4.1 Web Scraping We collected the Data from English local newspaper called “Siasat”. Package called “kora” built on top of selenium was used for web scraping task. Siasat has a crime section listing 20 articles which were used as base link. We wrote a Python script to crawl through each article linked, starting from base page. This script is scheduled to run every week from the start date with a gap of 7 days using the Python scheduler package.
4.2 Dataset To train the model we have used the CONELL2003 dataset with custom modifications to suit the model while training. The dataset consists of sentences from various English articles and not just the articles related to crimes for training and validation of the model. The labeling scheme followed by the dataset is the BIO scheme. The dataset had 4 columns: • • • •
Sentence #—holds the sentence number Word—holds the word POS—holds the parts of speech tag for the corresponding word Tag—contains the BIO tag of the word.
Consider the following sentence “The Ramagundam Police caught Ramarao Reddy who was involved in murder.” The above sentence is labeled as shown (see Table 1) when it follows the BIO scheme (Fig. 2). Table 1 Example of bio scheme
The
O
Ramagundam
B-GPE
Police
B-ORG
Caught
O
Ramarao
B-PER
Reddy
I-PER
Who
O
Was
O
Involved
O
In
O
Murder
B-EVE O
94
M. Trupthi et al.
Fig. 2 Architecture of proposed system
4.3 Location and Crime Extraction Tokenization and pre-processing. The model extracts the locations and crimes using named entity recognition. The first step in this process is tokenization—in which each sentence in the article is broken down into tokens for which the entity is predicted. Although the dataset has the parts of speech tag it is of no use as the model requires only the BIO tag to perform named entity recognition. All the pre-defined entities in the dataset are stored in a dictionary along with index numbers which are used to predict the tags. Fine-tuning. After loading the pre-trained model, it is fine-tuned to meet requirements to perform named entity recognition. References [15–17] mention how to fine-tune the pre-trained model. The BERT model can be modified to perform NER tasks by adding a classification layer that will predict the NER label after feeding the output vector of each token to it. The pre-trained model has 12 layers. Weight decay is used to prevent the weights from becoming too large after each weight update. A stochastic optimization method is used to adjust the weight decay term by decoupling weight decay from the gradient update. A learning rate scheduler is used to reduce the learning rate of each epoch to reduce the load of the model during training. The fine-tuning is done using the AdamW function as the optimizer (Fig. 3). Training the model. Only 80% of the dataset is used for training while the remaining is used for validating the model. The input for the model is the multi-dimensional matrices are the token and the tag for each token. During the training of the model to deal with the exploding gradients problem gradient clipping has been used. The main aim of gradient clipping is to rescale the gradient if it becomes too large. The number of epochs the model is trained for is
Visualizing Crime Hotspots by Analysing Online Newspaper Articles
95
Fig. 3 Modified BERT architecture to perform NER
directly proportional to the number of sentences in our dataset. At the end of each epoch, the average train loss is stored to assess the performance of the model. Validating the model. The output of the model is stored in the form of logits which are non-normalized predictions that are then passed onto the SoftMax activation. A vector containing the normalized probability values for every pre-defined tag is generated by the function. The model assigns the tag with the highest probability to the token. Saving the model. Along with the model the vocabulary file and the final weights of the model are also saved. The model is saved using Python’s pickle feature. Geocoding and Heat Map Generation. The location and crime committed are saved in a dataset after being extracted from the article. To get the coordinates of the location we use the Nominatim package. Nominatim is built on the osm2pgsql which is a software used to import OpenStreetMap data into a PostgreSQL database. If the location already exists in the dataset only the crime count and recent crime are updated. To create the heat map, the Google Map API is used along with the bokeh package. A tooltip is used to display the location name, number of crimes and the most recent crime that has taken place for a particular location.
5 Experimental Results The state of Telangana was selected as the subject for the analysis of the model. We first collected crime articles from the local English newspaper—Siasat. The model was trained for 4 epochs (Fig. 4).
96
M. Trupthi et al.
Fig. 4 Learning curve of the model
5.1 Evaluation of Model In Named Entity Recognition, F1 score is used to evaluate the performance of trained models, and the evaluation is done per entity rather than per token. The following criteria have been used to assess the model: F1 score. The f1 score is calculated as follows. In the formula, P is the precision and R is recall. Our model has an F1 score of 83.877%. F1 = 2 ∗ ((P ∗ R)/(P + R))
(1)
Precision is calculated by dividing the number of True Positives by the total number of True Positives and False Positives. Precision can be regarded as a metric for how accurate a classifier is. P = TP/(TP + FP)
(2)
Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives. Recall can be thought of as a measure of a classifier’s completeness. R = TP/(TP + FN)
(3)
The precision, recall, and F1 score of each entity, as well as their support, corrected to 2 decimals, are recorded in the classification report (see Table 2). Validation Accuracy. To measure how the model can perform with generalized data we take into consideration the validation accuracy. By using BERT to perform the task we got an accuracy of 96.14%.
Visualizing Crime Hotspots by Analysing Online Newspaper Articles
97
Table 2 Classification report of the model Precision
Recall
F1 score
Support
B-eve
0.96
0.94
0.95
1131
B-geo
0.90
0.86
0.88
12,590
B-gpe
0.93
0.96
0.94
3429
B-org
0.73
0.77
0.75
7316
B-per
0.82
0.79
0.80
5890
B-tim
0.83
0.85
0.84
4555
Micro avg
0.84
0.84
0.84
35,094
Macro avg
0.71
0.73
0.72
35,094
Weighted avg
0.71
0.84
0.84
35,094
5.2 Visualization of Results Extraction of Articles. As mentioned in Sect. 4.1, crime articles were extracted from Crime section of English Local Newspaper called “Siasat”. Packages such as “kora” and “Selenium” were used, and the script was scheduled to run weekly. Extraction of Entities. The extracted data was fed to the model, where it recognizes “location” and “crime type” using NER and saves it in a dataset. The coordinates of these locations were obtained via Geocoding and then fed to Google Map API for visualization. Visualization of Results. The gathered data, namely the location and type of crime, are now displayed on a dynamic Heat Map. Each Hotspot has a unique color and size that changes depending on the data collected. The size of the hotspot on the map is determined by the number of crimes, which assists the user in determining how safe a region is in comparison with others. Aside from the circle’s size, the color of the circle changes in proportion to the number of incidents (Fig. 5). A tooltip is used to display the location name, number of crimes and the most recent crime that has taken place for a particular location (Fig. 6).
6 Conclusion and Future Scope The method described in this paper uses online crime news articles to assess city regions in terms of crime rate in order to determine the location’s security. We utilized the CONELL2003 dataset to train the model, with some adjustments to fit the model.
98
M. Trupthi et al.
Fig. 5 Heat map displaying crime hotspots collected via recent news articles Fig. 6 Tool tip used for quick overview
The proposed model was built by fine-tuning the pre-trained state of the art, transformer-based model, BERT. The model employed the BIO scheme tagging to recognize the location and type of crime. Further, the prototype built incorporates heat maps to visualize the number of crimes per location and display the crime hotspots. The model proposed in this paper achieved a validation accuracy of 96.14% as compared to the 91.2% of the model mentioned in [3] by Alessandro Bondielli et al. The F1 score obtained by the proposed model is 83.87%, which is a significant improvement from the 79% in the previous work [3]. After validating the model, it was used to extract crime location and the crime committed from real-world data for the state of Telangana using real-time crime news articles.
Visualizing Crime Hotspots by Analysing Online Newspaper Articles Table 3 GPU time comparison
99
GPU model
Tesla T4
Tesla K80
Tesla P4
Training time (min)
59
120
93
Another essential factor to consider while training the model was the amount of time available. By exploiting the GPU supplied by the environment, the model’s training time was lowered to under an hour, at 59 min. Three of the GPUs allocated at random were evaluated while training the model (Table 3). Our initial goal for the future is to improve the process of extracting crime types and locations. RoBERTa, which introduces dynamic masking so that the masked token changes during the training epochs, can be used to improve the model’s performance. This prototype can be expanded into a full-fledged application with database connectivity to manage records, delete duplicates, and include queries about a specific crime to be presented on a map. After all the aforementioned improvements are made, it can be expanded to additional Indian states.
References 1. Po L, Rollo F (2018) Building an urban theft map by analysing newspaper crime reports. In: 2018 13th International workshop on semantic and social media adaptation and personalization (SMAP), September 2018. https://doi.org/10.1109/SMAP.2018.8501866 2. Saldana M, Escobar C, Galvez E, Torres D, Toro N (2020) Mapping of the perception of theft crimes from analysis of newspaper articles online. In: 15th Iberian conference on information systems and technologies (CISTI). IEEE. https://doi.org/10.23919/CISTI49556.2020.9141154 3. Bondielli A, Ducange P, Marcelloni F (2020) Exploiting categorization of online news for profiling city areas. In: 2020 IEEE conference on evolving and adaptive intelligent systems (EAIS), May 2020. https://doi.org/10.1109/EAIS48028.2020.9122777 4. Das P, Das AK (2017) Crime analysis against women from online newspaper reports and an approach to apply it in dynamic environment. In: 2017 International conference on big data analytics and computational intelligence (ICBDAC), IEEE. https://doi.org/10.1109/ICBDACI. 2017.8070855 5. Arulanandam R, Savarimuthu BTR, Purvis MA (2014) Extracting crime information from online newspaper articles. In: The second Australasian web conference (AWC 2014) 6. Thongsatapornwatana U (2016) A survey of data mining techniques for analysing crime patterns. In: 2016 Second Asian conference on defense technology (ACDT). IEEE. https:// doi.org/10.1109/ACDT.2016.7437655 7. Revathy K, and Satheesh Kumar J. Survey of data mining techniques on crime data analysis. Int J Data Min Tech Appl 1:47–49. https://doi.org/10.20894/IJDMTA.102.001.002.006 8. Bsoul Q, Salim J, Zakaria LQ (2013) An intelligent document clustering approach to detect crime patterns. In: the 4th International conference on electrical engineering and informatics (ICEEI 2013). Elsevier. https://doi.org/10.1016/j.protcy.2013.12.311 9. Nasridinov A, Park Y-H (2014) A study on performance evaluation of machine learning algorithms for crime dataset. In: Conference: networking and communication 2014. https://doi.org/ 10.14257/astl.2014.66.22 10. Ku CH, Iriberri A, Leroy G (2008) Crime information extraction from police and witness narrative reports. In: 2008 IEEE conference on technologies for homeland security. IEEE. https://doi.org/10.1109/THS.2008.4534448
100
M. Trupthi et al.
11. Hassan M, Rahman MZ (2017) Crime news analysis: location and story detection. In: 20th International conference of computer and information technology (ICCIT), pp 1–6. https://doi. org/10.1109/ICCITECHN.2017.8281798 12. Jie Z, Lu W (2019) Dependency-guided LSTM-CRF for named entity recognition. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, pp 3862–3872 13. Luo L, Yang Z, Yang P, Zhang Y, Wang L, Lin H, Wang J (2018) An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics 34(8):1381–1388. https://doi.org/10.1093/bioinformatics/btx761 14. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: 31st Conference on neural information processing systems (NIPS 2017). arXiv:1706.03762 15. Sun C, Qiu X, Xu Y, Huang X (2020) How to fine-tune BERT for text classification. arXiv: 1905.05583v3 16. Yu S, Su J, Luo D (2019) Improving BERT-based text classification with auxiliary sentence and domain knowledge. IEEE Access 7:176600–176612. https://doi.org/10.1109/ACCESS.2019. 2953990 17. Wang Y, Sun Y, Ma Z, Gao L, Xu Y, Sun T (2020) Application of pre-training models in named entity recognition. arXiv:2002.08902v1
Applications of Machine Learning for Face Mask Detection During COVID-19 Pandemic Sarfraz Fayaz Khan, Mohammad Ahmar Khan, and Rabiah Al-Quadah
Abstract Covid-19 pandemic has forced us to adapt to the new lifestyle. World Health Organization (WHO) recommends that people should adhere to the public health expert’s guidelines to fight against the spread of Covid-19. The most essential Covid-19 guideline has been the use of facemask which has been enforced throughout the globe and has proven to contain the spread of corona virus. The proposed study aims at examining the detection of mask usage by people through Machine learning approach. The research employs binary classification problem to detect and classify people wearing masks from the people not-wearing masks. Three machine learning models namely InceptionV3, VGGNet and Resnet have been adapted in this research for pre-processing the input images. Similarly, XGBoost, Random Forest and fully connected DNN models have been used for decoding and classification. Performance evaluation has also been done for the different models and a comparison of the performances has been carried out as a part of the research. The results obtained through performance evaluation technique showed that ResNet+Fully Connected DNN is the best among the developed models where the precision was 99.73%, accuracy was 99.7%, F1 Score was 99.69% and the recall score was 99.66%. Keywords Machine learning · COVID-19 · Masks · Without masks · Binary classifications
S. F. Khan (B) SAT, Algonguin College, 1385 Woodroffe Ave, Ottawa, ON C211K2G1V8, Canada e-mail: [email protected] M. A. Khan Department of MIS-CCBA, Dhofar University Salalah, Salalah, Sultanate of Oman e-mail: [email protected] R. Al-Quadah Department of Computer Science and Software Engineering, Concordia University, Mackay Street, Montreal, QC 2070H3G 2J1, Canada e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_9
101
102
S. F. Khan et al.
1 Introduction COVID-19 pandemic has affected almost all the countries across the globe. The number of infected and deceased patients has been increasing at an alarming rate in all the affected nations. Forecasting techniques can be inculcated thereby assisting in designing efficient strategies and in making productive decisions. Several public health research has proven that wearing face masks is an effective physical intervention against disease transmission which directly results in containing the spread of virus. This research aims to detect the usage of masks by people during the COVID-19 pandemic. It is a binary classification problem in which the researcher has attempted to detect whether a person is wearing a mask or not. Covid-19 has been identified as the new norm where people globally have been subject to psychological and physiological effects as a result of the infectious disease [1] and prevention techniques like wearing masks, washing hands, maintaining social distances, staying home, etc. [2] are the current techniques under practice in order to minimize the impact. However, there are people from different countries and states who follow their culture to the core and also due to lack of resources either they deny to wear masks or don’t have the opportunity to wear masks [3]. Contrarily some people believe that wearing masks to prevent Corona virus has no effect since it doesn’t prevent doctors and staffs even if they are wearing the masks [4]; due to improper awareness and knowledge of wearing masks have also led people to believe that masks have no effect on preventing diseases like corona virus [5]. Henceforth identifying the people with masks and without masks has been recognized by the investigators as the new focus to study upon through Machine Language in Deep Learning through algorithms, coding, Neural Networks models and classifiers. Detection of face images in research adopts techniques like image resolution with classifiers along with neural networks, models like VGG-16, ResNet50, AlexNet, InceptionNet, etc. along with feature extraction and face detection algorithms/ techniques such as: MLP, NB, KNN and decision tree J48 [6]. The existing techniques successfully recognize the faces or face images as inputs and detect “region of interest” (ROI) specified by the researchers, for instance in this study the ROI will be “masks” to identify and classify faces into with and without mask classes. The classifiers based on the algorithm will recognize the masks and people without masks and compare the results with the original inputs. Thus, the classifier divides the images and the best model with higher accuracy, precision, recall and F1 scores based on their performances will be weighed and assessed to rank the model and save the best for future researches. The research by focusing on developing a NN based architectural model contributes the researchers with a more accurate and precise model that could identify and categorize face images with and without masks of all age groups of both men and women as the gender factor, through Machine Language, especially during the pandemic situation to minimize the risks of being infected in areas without proper prevention methods. The existing relevant studies either focus upon face images and identify the faces with masks and without masks and categorize them through a
Applications of Machine Learning for Face Mask Detection During …
103
specific moderator. In this research the researcher aims at identifying, categorizing and storing the images with and without masks so that data retrieval could be done at any-time in future.
2 Literature Review The existing research works and studies upon face identification and face recognition will be reviewed here, especially with a focus on Covid-19’s impact and wearing masks as prevention. Authors Wong et al. [7], Esposito et al. [8], Matuschek et al. [9] and Li et al. [10], have focused upon the Covid-19 and the impact of wearing masks as a preventive technique against the infectious disease. The studies also focused on analyzing the effect and benefits of wearing masks during lockdown and found that wearing masks were effective unlike the beliefs of the people with lack of knowledge of properly wearing masks. However, the studies also found that risks were higher in wearing masks that cover both nose and mouth resulting in respiratory compromises, especially in patients with lung infections and other obstructive respiratory diseases. However, wearing masks was proven to be an effective measure in close-contacts and medical care scenarios in the infected regions. Hence recent investigators have been focusing on developing algorithms and models to identify/detect and classify people into with and without masks to prevent the further spread of diseases among the masses like: hospitals, shops, educational sector, transportation, etc. Dhankar [11] has focused upon the facial emotions and how they are recognized and classified by the classifiers through a combined VGG-16 and ResNet50 neural network model. The study mainly focused on 7 emotions (disgust, sadness, anger, happiness, surprise and fear along with ‘neutral emotion’) and found that the efficiency of the model built achieved 92.4% which is higher than existing models, proving that the research was success. Lin et al. [12] have investigated the face detection through image segmentation algorithms via Mask R-CNN. The study found that ResNet is effective than VGGNet in computing speed. The study was aimed at identifying images with low lights, faces in backgrounds and faces that are far from bounding box. Findings and analysis provided outcomes that show that the accuracy rate is improved through the developed model where the spatial locations are accurate and the gradient is preserved. Khan et al. [13] studied and evaluated ResNet models and image recognition through performance assessment, towards cancer identification. According to the study and its conclusion it could be identified that ResNet model is efficient, good fit for prediction-tasks and rapid in computing with lower loss which is suitable for higher accuracy and precision, unlike models that produce greater loss. Daniya et al. [14] have studied about Gini Index (GI) and Regression based decision tree methods along with classifiers towards finding the best applications in classification and weighing them through performances. The study revealed GI in classification and regression based studies would provide the degree of impurities
104
S. F. Khan et al.
with probability distribution which increases the performances of the neural network based models in varied applications providing optimal tree as the outcome. Cabani et al. [15] have analyzed about the existing models and algorithms in detecting and classifying the faces with masks as “correct” and “incorrect” classes where the researcher would focus on mouth, nose and chin as the detected landmarks and other areas (12 marked areas) as annotated landmarks in identifying the way of wearing masks. According to investigators [16] many people who wear masks either neglect to wear it properly (e.g. kids of age groups 2–14 years, elderly people of age groups above 50 years) or they wear it with discomfort (e.g. people with breathing diseases or due to suffocation) instead of protecting themselves by wearing the masks correctly. Henceforth authors developed the model and algorithm to identify and detect the masked images and classify them as correct mask-wearing class and incorrect mask-wearing class towards monitor and create awareness among the masses. The lack of adequate and successful face mask detection oriented literary sources during Covid-19 has paved-way for the investigator to develop a model that could detect the face masks by aiming to obtain higher performance metric outcomes, than existing studies. Thus face recognition, face detection, image segmentation, and region of interest as techniques in object identification based algorithm are developed for detection. Classifiers like XGBoost, Random Forest, etc. in DNN and CNN built models are witnessed to be effective in increasing the efficiency of the face recognition models with ResNet, AlexNet, InceptionNet, and VGGNet and so on. The study would adopt respective architecture and classifiers to detect and classify the images into with and without masks through the developed models.
3 Research Background on Developed Algorithm The following approaches are adopted here for rapid, efficient, accurate and precise estimations/predictions and outcomes through classifier neural networking models. This research is based on the input of face images rather than real-time detection of faces hence the process would slightly vary from existing research. The CNN architectures adopted here for the model developments are:
3.1 VGGNet Among the CNN models, the VGGNet (VGG-16) is generally identified as a deeper network that has the capability of extraction with higher features than other models [17]. The VGGNet-16 layers are used by researchers which is the advanced or improvisation of AlexNet that could replace the common 7 * 7 with 5 * 5 conv kernels [18]. Generally, the VGGNet is considered rapid and efficient with SVM classifier for face detection and classification.
Applications of Machine Learning for Face Mask Detection During …
105
3.2 ResNet Though the VGGNet has higher scopes and is identified as a new trend initially, the ResNet50 is identified as the new trend where it has deeper layers and it is efficient and can be more accurate than the VGGNet [19]. Also, ResNet50 is considered as dynamic, rapid in computing, compatible and has robustness throughout the convolution processes with several layers (deeper layers). Though the ResNet has deeper layers it could be recognized for refined filtering of images and pooling impurities, however the drawbacks are the loss of original image resolution and gradient disappearances. However, the same could be overcome by combining the ResNet with classifiers and Activation Functions.
3.3 InceptionNet The InceptionNet V3 has been recently adopted by the researchers for studies like: face identification, face recognition, facial expressions, etc. The basic process involved in the InceptionNet is that, it is mostly utilized for “dimensionality reduction”. Though the accuracy is higher in the InceptionNet V3 model it could be identified that, when compared with the ResNet and VGGNet, Inception offers the investigators with speed and good performance than precision [20]. The pre-processing of the face images as inputs is done through following steps: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7
Acquire the inputs from the databank; Resize the images into 150 * 150; Split the datasets into train datasets and test datasets; Build the architectures and train them on the datasets, respectively; Label the images, ‘with masks’ as “1” and ‘without masks’ as “0” classes; Import the classifier techniques for the developed model; Identify and detect the faces with and without masks; labeling the outcomes and classify the same into respective folders for performance assessment of the models built; Step 8 Based on the performance metrics the best model should be saved. It has to be noted that for each respective model, the steps involved are the same where step 6 alone differs with the models, such as: VGGNet, InceptionNet and ResNet with Fully Connected DNN, Random Forest Classifier and XGBoost Classifiers.
106
S. F. Khan et al.
4 Proposed Neural Network and Its Architectural Framework The adopted neural networks and developed system flowchart, testing and training datasets approaches along with the encoding and decoding approaches are listed below.
4.1 Proposed System The research aims on detecting faces as input images from “Kaggle” where the developed algorithm and models would identify and classify the images through several steps (refer Fig. 1) and find the best model for the developed approach. The model could be utilized for similar approaches or could be adopted and added with more advanced techniques as hybrid model in the future. The system proposed here consists of steps where the images are acquired as inputs, pre-processed and divided into two datasets for testing and training. Once they are divided, the datasets are then passed through encoding phase where CNN models of ResNet, InceptionNet and VGGNet are used here. Subsequently, the encoded images from three models are then passed through classifiers like: Random Forest
Fig. 1 System flowchart
Applications of Machine Learning for Face Mask Detection During …
107
classifier, Fully Connected DNN and XGBoost Classifiers for detection and classification of images with and without masks. Later based on the prediction and original image values the best models would be examined and weighed through performance scores (accuracy, precision, F1 score and recall) and the outcome would be assessed to validate the best model.
4.2 Data Acquisition A research would be initially conducted with either primary datasets (acquisition of data by researcher himself through survey, interviews and reports) along with secondary datasets (existing studies and researches) or just with secondary datasets to authenticate the study with reliable and valid datasets that are appropriate to the proposed aim and objective. The data for the proposed examination were collectively gathered from GitHub and Kaggle. The images were accessed for the research through the website: https://www.kaggle.com/omkargurav/face-mask-dataset. Exactly, 1776 images from Github and 5777 from Google were acquired for the research, where they were filtered and pre-processed. The total dataset is of 7553 RGB images with and without masks (i.e. 3725 with and 3828 without masks) and is stored in two folders in Kaggle for further access. The investigator searched through Google for the relevant images through keywords like, “Female Adult face images with masks”, “Female Adult face images without masks”, “Male Adult face images with masks”, “Male Adult face images without masks”, “Face images with Masks”, “Face images without Masks”. Age groups under the investigation were initially aimed at Adults with age groups of 21–50, but later on, to train and test the model, all age groups (5 years and above) were finalized as the demographic profile of datasets obtained. Justification: The acquisition of gathered datasets could be justified with studies by researchers like Gurucharan [21] and Loey et al. [22], where they adopted the face images as inputs from secondary sources for their studies to detect faces with and without masks through CNN models and algorithms.
4.3 Testing and Training Datasets The testing of the developed model and training of the datasets are done through the following steps. The Fig. 2 represents the testing and training datasets in the proposed research where the images are filtered/cleansed (pre-processed), passed through encoder models of VGGNet, InceptionNet and ResNet50, then passed via decoders of Random Forest classifiers, XGBoost Classifiers and Fully Connected DNN for
108
S. F. Khan et al. Input Face-images as Dataset Pre-processing of images Passing the pre-processed images through via Encoding-Approaches: VGGNet, ResNEt and InceptionNet
Passing the image encodings towards Decoding-Approaches: Fully-Connected DNN, Random Forest and XGB Classifier for predictions Classifying the metrics based images produced through predictions of the adopted models Comparing the outcomes to finalize the best model
Fig. 2 Training approach on acquired datasets
prediction/estimations. The outcomes are weighed through performance metrics of accuracy, recall, precision and F1 scores. Based on the outcomes the best model is saved and chosen.
4.4 Encoding Approaches (a) VGGNet: The VGGNet architecture consists of 16 layers of CNN where 3 fully Connected DNN with 5 max-pooling layers comprising of 19 weight layers in total. The CNN are developed with 3 * 3 depth based smaller filters. The CNN layers are designed in such a manner where the first is with 64 depth, second with 128 depth, third with 256 depth, fourth and fifth with 512 depths each and each comprising max-pooling filters in-between. In the end, 3 layers where 2 fully connected DNNs (1000 and 4096 labels) and softmax for classifying is designed. Through this framework, the input images are passed followed by preprocessing via weight layers. Once the datasets are obtained from training phase, they are passed through the CNN layers with ReLu, filtered, then passed through a fully connected DNN detection technique and finally through softmax classifier. Thus the images are obtained and filtered, the detection of masks is done and the output is classified in this model. (b) ResNet50: The ResNet50 architecture comprises of 4stages where the input images’ attributes (height and width) along with channel width of 3 (for instance: 224 * 224 * 3 as shown in the figure, but in the study, the images are measured as 150 * 150 * 3) is considered as the first Convolutional processes. The initial process of CNN in ResNet50 starts with 7 * 7 conv with 3 * 3 pooling kernel
Applications of Machine Learning for Face Mask Detection During …
109
sizes. Following this process, 3 blocks of residuals are designed as the following layers in stage 1 with 3 layers each, where the kernels are of 64, 64 and 256. The curved lines with arrows represent the connection of identity whereas the dashed line with arrows represents operations of Convolutional network in residual blocks (RB). In this process the RB is executed with 2 strides and thus input’s size will be deduced into half, i.e. height and width; whereas the channel width (3) will be evidently doubled as the stages progress. Thus once the process reaches the final layer of average-pooling and reaches the fully connected DNN (1000 neurons), the outcome is obtained. (c) InceptionNet: It is a CNN architecture belonging to the inception family. Szegedy et al. [23] introduced the inception architecture model in CNN. It includes asymmetric and symmetric blocks consisting of convolutions, maxpooling, average-pooling, dropouts, concats, softmax and fully connected layers. Softmax is utilized in this model to estimate the loss and the InceptionNet model has been found to be accurate with 78.1% accuracy rate with 170 epochs [24]. This model is identified to be efficient in face recognition and image identification/classification model oriented researches and thus the InceptionNet model is adopted for the proposed research.
4.5 Decoding Approaches (a) Fully Connected DNN (Deep Neural Nets): In the FCDNN series of layers are connected where every input dimension is followed by output dimensions. Through this technique, the processed images are decoded into original images to compare the outcomes to estimate the reliability and accuracy. In this research, the models ResNet, InceptionNet and VGGNet are adopted with the deep net layers and hence the images are passed through layers for efficient processing and filtered outcomes with higher accuracy. The FCDNN layers follow the convolutions in the developed models and thus the images are processed, filtered, trained, classified and detected as outcome through decoding in Python. (b) Random Forest Classifier: The RFC is a representation of a group of decision trees with randomly opted training-subsets. From the obtained classes, the RFC combines the votes from varied decision trees to pipeline the test objects of classes into final class. Here the RFC is adopted to perform the following processes: Step 1 Chose random sample within the given datasets; Step 2 Build a decision tree for every sample towards acquiring prediction outcome from each trees; Step 3 Conduct voting process for every predicted outcome; Step 4 Select the prediction resulted with higher votes than other outcomes and save the model.
110
S. F. Khan et al.
(c) XGBoost Classifier: The XGBC is also a decision tree based technique in ML, in Python. It is of decision trees with gradient boosts towards performance and speed. Henceforth it is adopted in the research to classify and detect images rapidly and outperform other models. Thus by encoding and decoding the images with developed models the research aims at detecting the images with face masks and without face masks and classifying accordingly. The purpose of selecting these three architectures for the investigation is that, the ResNet50, VGG-16 and InceptionNet V3 are ascertained as the most accurate and precise architectures in the NN architecture models. Henceforth to examine and deduce the best model combined with FC-DNN, XGBoost and Random Forest Classifiers have been focused here to determine higher accuracy and precision based model that could identify face images through segmentation and categorizing through developed algorithms. Thus the research contributes to future investigators with information towards the best NN architecture based model with image segmenting.
5 Statistical Techniques and Software Adopted 5.1 Statistical Tools (a) Confusion matrix: A 2 * 2 confusion matrix has been adopted here to find the scores through classification of the images. Normally the confusion matrix has TN (True Negative), TP (True Positive), FN (False Negative) and FP (False Positive) values. The total number of the samples is divided as estimated/predicted values against the original values. In this research the “1” represents the masks class and “0” represents the without masks class in the “heat-map” confusion matrix, where the 0:0 represents the True Negative, 0:1 represents the False Negative, 1:1 represents the True Positive and finally the 1:0 represents the False Positive in x:y axis respectively. The accuracy, precision, recall and F1 score are calculated through the standardized formulae: Accuracy =
TP + TN TP + FP + TN + FN
Recall =
TP TP + FN
Precision = F − 1Score =
TP TP + FP
2 ∗ Precision ∗ Recall Precision ∗ Recall
Applications of Machine Learning for Face Mask Detection During …
111
Based on the above metric estimation the performances of the models developed would be weighed and assessed for ranking the best model. (b) Categorical Cross-Entropy: The categorical cross-entropy in neural networkoriented classification especially in the face recognition researches has been proved to be effective and efficient in providing the loss error values with maximum probability towards predicted values [25]. The following formula is being used to calculate the loss epoch values. −
N
yi , x log mi , x.
(1)
x=1
(c) Gini Index: The face images based researches have been focused in the past 20 years with advanced image resolutions and color variants to identify and classify gender, age, ethnicity and emotions. Henceforth researchers have been focusing on developing more varied algorithms with improvements and rapid estimation and classification. The following formula is utilized in this research: GiniIndex = 1 −
m
(Ni )2 .
(2)
a=1
The GI, known as the decision-tree technique classification basically labels every node into a class; likewise, the nodes that include both non-terminal internal nodes and root nodes are classified and divided into five varied categorizations based on their attribute test-conditions. Splitting datasets is normally done through impurity degrees based on child nodes [26]. (d) Information Gain: Gaining information of the attributes and the variables involved is basically calculated through the total number, samples, child nodes and impurity criterion. The following formula is used for the estimation of the IG: IG(An , d ) = C(An ) −
Sright Sleft C Aleft − C Aright . S S
(3)
where d An Aright Aleft C Sleft Sright S
feature split-on; parent node dataset right child node datasets left child node datasets entropy Gini index as impurity criterion left child node dataset’s total samples right child node dataset’s total samples total samples
Thus the statistical tools have been adopted and used in the study for face images as inputs through the identification and detection of masks as “with masks” and
112
S. F. Khan et al.
“without masks” as the factors under focus, through the developed Neural Network models and binary classification based algorithm techniques.
5.2 Software Used The software programming language used in the research is ‘Python: Language’; ‘Object Detection: API’; ‘Anaconda: Platform and ‘Tensorflow and Keras: Library’. The development has been done in Jupyter Notebook environment. The machine language based learning (ML) along with DL (deep learning) in NN (Neural Network) has been identified as a recent trend in the research/literature pool where the advanced technologies have different characteristics and attributes. Python among other languages is known for its simplicity, probability of efficiency, adaptability and availability. The tensorflow with Keras as platform works as an effective realtime identification process that could identify and classify multiple objects rapidly, especially in single frame [27]. Python is known to be extensible, modular, userfriendly and compatible with Theano or Toolkit (Microsoft Cognitive) and Tensorflow [28]. Object detection and classification have been done using Jupyter notebook. A comparison of the performances of the various models is also done as a part of the work and the accuracy, precision, recall and F1 score values obtained have been plotted graphically using Matplotlib library in order to visualize the results clearly and identify the best model.
6 Results Based on adopted techniques and methodologies, the results are represented as three sections, namely: data analysis, findings and confusion matrix outcome. They are:
6.1 Data Analysis The research was attempted to identify and classify through object detection technique through binary classification technique. In this research people with and without masks as objects have been targeted and classified under the developed models in the machine learning, i.e. deep learning oriented neural networks. The gathered datasets were cleansed, pre-processed and split into testing datasets where the algorithm was tested for effectiveness and the same is then applied upon the training datasets for betterment. The following images (refer Fig. 3) are the examples of the testing phase.
Applications of Machine Learning for Face Mask Detection During …
113
Fig. 3 Testing datasets
The same method and the technique were used for training the datasets and the results have been obtained through the developed models to estimate the precision, accuracy, recall and F1 scores.
6.2 Findings Through the developed models the following (refer Fig. 6) outcomes were obtained as “with masks classification” and without masks classification” where the accuracy and precision were higher and thus the developed models are effective and successful. However, the best model of the three developed models is assessed and weighed through their scores and performance outcomes in the latter section (Figs. 4 and 5). Performance Metrics of Fully Connected DNN 99.50% 99.69%
F1 Score
99.30% 99.90%
Recall
99.66% 99.00% 99.10% 99.73% 99.70%
Precision 99.50%
ResNet + Fully Connected DNN
99.70%
Accuracy 98.50%
InceptionNet + Fully Connected DNN
99.40% 99.00%
99.50%
VGGNet + Fully Connected DNN 100.00%
Fig. 4 Performance metrics of fully connected DNN with VGGNet, ResNet and InceptionNet
114
S. F. Khan et al.
Performance Metrics of Random Forest Classifier 99.25% 99.53% 99.23%
F1 Score
99.80% 99.66%
Recall
InceptionNet + Random Forest Classifier
99.26% 98.71%
ResNet + Random Forest Classifier
99.40% 99.20%
Precision
99.25%
98.00%
VGGNet + Random Forest Classifier
99.53%
Accuracy
99.23% 98.50%
99.00%
99.50%
100.00%
Fig. 5 Performance metrics of random forest with VGGNet, ResNet and InceptionNet
Performance Metrics of XGB Classifier 99.45%
F1 Score
99.29% 99.60% 99.80% 99.20%
Recall
99.66% 99.10%
Precisio n
99.39% 99.53% 99.45% 99.30%
Accurac y 98.60%
99.60% 98.80%
99.00%
99.20%
99.40%
99.60%
99.80%
InceptionNet + XGBoost Classifier VGGNet + XGBoost Classifier ResNet + XGBoost Classifier
Fig. 6 Performance metrics of XGB classifier with VGGNet, ResNet and InceptionNet
Inference: From Table 1 it is clearly understandable that the developed model efficiently classifies the face images as inputs into two classifications as desired/predicted by the researcher. Thus the developed algorithm is a good fit for the research and the models are successful.
Applications of Machine Learning for Face Mask Detection During … Table 1 Predictions by the final best performing model
Image
115
Predicted class Mask
Without mask
Mask
Without mask
Mask
Without mask
Mask
Without mask
Mask
Without mask
116
S. F. Khan et al.
Table 2 Confusion matrix values for the classifier technique of developed CNNs S no.
Models
TP
FP
TN
FN
1
VGGNet+fully connected DNN
1486
4
1496
14
2
VGGNet+random forest classifier
1489
12
1488
11
3
VGGNet+XGBoost classifier
1488
9
1491
12
4
ResNet+fully connected DNN
1495
4
1496
5
5
ResNet+random forest classifier
1495
9
1491
5
6
ResNet+XGBoost classifier
1495
7
1493
5
7
InceptionNet+fully connected DNN
999
9
991
1
8
InceptionNet+random forest classifier
998
13
987
2
9
InceptionNet+XGBoost classifier
998
9
991
2
6.3 Confusion Matrix Outcomes Table 2 represents the values of the 2 * 2 confusion matrix. Inference: Through the confusion matrix outcomes it could be observed that the estimated TP and FN values of the ResNet model remained the same for the three classification techniques with moderate differences in TN and FP values. Thus the ResNet model is found to be effective than the InceptionNet and VGGNet models in the proposed research.
6.4 Evalution Metrics The evalution of resarch is carried out through “performnace metric” where the accuracy, f-1 score, recall and precision are calculated for the three models VGGNet, ResNet and InceptionNet with Fully Connected DNN and classifiers as Random forest and XGBoost. Performance Metrics: Table 3 represents the outcomes (accuracy, recall, F1 score and precision) of the adopted CNN and Classifier models towards assessing the best model or effective outcome that offers higher efficient results in classifying faces/ images with and without masks. Figure 4 represents (refer Fig. 4) the outcomes of Fully Connected DNN technique where the InceptionNet, VGGNet and ResNet are compared for better results. Figure 5 represents (refer Fig. 6) the outcomes of XGB Classifier technique where the InceptionNet, VGGNet and ResNet are compared for better results. Inferences: From Table 3 and the representation of bar graphs (Figs. 4, 5 and 6) it could be found that the best model, effective in detecting faces as images with and without masks are: “Fully Connected DNN model with ResNet”, where the outcomes are more accurate (99.7%) and precise (99.73%) with higher F1 Score
Applications of Machine Learning for Face Mask Detection During …
117
Table 3 Outcomes of the performance metrics S no. Models
Accuracy (%) Precision (%) Recall (%) F1 Score (%)
1
VGGNet+fully connected DNN
99.4
99.7
99.0
99.3
2
VGGNet+random forest classifier
99.23
99.2
99.26
99.23
3
VGGNet+XGBoost classifier
99.3
99.39
99.2
99.29
4
ResNet+fully connected DNN
99.7
99.73
99.66
99.69
5
ResNet+random forest classifier
99.53
99.4
99.66
99.53
6
ResNet+XGBoost classifier
99.6
99.53
99.66
99.60
7
InceptionNet+fully connected DNN
99.5
99.10
99.9
99.5
8
InceptionNet+random forest 99.25 classifier
98.71
99.8
99.25
9
InceptionNet+XGBoost classifier
99.10
99.8
99.45
99.45
(99.69%) than other models developed. However, the recall score in InceptionNet with FCDNN is: 99.66%, which is lesser than InceptionNet with Fully Connected DNN, where the recall score is 99.9% implying that precision and accuracy are higher in ResNet and FCDNN model along with F1 Score.
7 Conclusion The research aims at identifying and detecting the faces as inputs from databanks and the algorithm was developed and trained with inputs for classifying the images with and without masks. The models developed were of ResNet, VGGNet and InceptionNet based with Fully Connected DNN, Random Forest Classifier and XGBoost Classifier. The study primarily was initiated to find and distinguish the people wearing masks and without masks and alternatively, this study could be adopted by future researchers for comparative analysis based on the ratio of people wearing masks and without masks with a newly developed algorithm and CNN. As per the findings of the proposed research and developed architecture the most reliable and accurate is the ResNet CNN architecture followed by InceptionNet and VGGNet respectively. Through the adopted statistical techniques the outcomes were calculated and estimated with performance metric evaluation technique and 2 * 2 confusion matrix; the findings in the study revealed that the Fully Connected DNN model as Classifier in the detection is more reliable and accurate with 99.7% accuracy and 99.73% precision whereas the F1 score was 99.69% and the recall score was
118
S. F. Khan et al.
99.66%; however the recall score was higher (99.9%) in the InceptionNet with Fully Connected DNN where its almost 100% and the rest scores. The study contributes upon facts and knowledge of developing the binary classifier towards detection and classification of the face as image inputs to classify the faces with masks and without masks. The study also has scope towards advanced binary classifiers-based algorithm development where researchers could combine two or more (hybrid) techniques to offer a classifier to identify and detect faces based on ethnicity, age, gender, etc. Thus, the developed algorithm and model were a success with higher accuracy and precision rate. Data Availability The gathered data were acquired for the research purpose through the link mentioned above and they are utilized for the proposed study as it is. They were secondary resources and they were not gathered primarily by the research investigator. Conflict-of-Interest (COI): The author declares that, there are absolutely no COI in the proposed research throughout.
References 1. Scheid JL, Lupien SP, Ford GS, West SL (2020) Commentary: physiological and psychological impact of face mask usage during the COVID-19 pandemic. Int J Environ Res Public Health 17(6655):1–12 2. Chua MH, Cheng W, Goh SS et al (2020) Face masks in the new COVID-19 normal: materials, testing and perspectives. Res Sci Partner J 1–40 3. Howard J, Huang A, Li Z et al (2020) Face masks against COVID-19: an evidence review. PNAS, pp 1–9 4. Qin B, Li D (2020) Identifying facemask-wearing condition using image super-resolution with classification network to prevent COVID-19. Sensors 20(5236):1–23 5. Leung NHL, Chu DKW, Shiu EYC et al (2020) Respiratory virus shedding in exhaled breath and efficacy of face masks. Nat Med 26(676):676–680 6. Dino HI, Abdulrazzaq MB (2020) A comparison of four classification algorithms for facial expression recognition. Polytech J 10(1):74–80 7. Wong SH, Jeremy YC, Leung C-H et al (2020) COVID-19 and public interest in face mask use. Am J Respir Crit Care Med 202(3):453–455 8. Esposito S, Principi N, Leung CC et al (2020) Universal use of face masks for success against COVID-19: evidence and implications for prevention policies. EurRespir J. https://doi.org/10. 1183/13993003.01260-2020. Accessed 27 Mar 2021 9. Matuschek C, Moll F, Fangerau H et al (2020) Face masks: benefits and risks during the COVID-19 crisis. Eur J Med Res 25(32):1–8 10. Li T, Liu Y, Li M, Qian X, Dai SY (2020) Mask or no mask for COVID-19: a public health and market study. PLoS ONE 15(8):e0237691, 1–17 11. Dhankar P (2019) ResNet-50 and VGG-16 for recognizing facial emotions. Int J Innov Eng Technol (IJIET) 13(4):126–130 12. Lin K, Zhao H, Lv J et al (2020) Face detection and segmentation based on improved mask R-CNN. Hindawi-Discr Dyn Nat Soc J 2020:1–11 13. Khan RU, Zhang X, Kumar R, Opoku E (2020) Evaluating the performance of ResNet model based on image recognition. Conf Paper Electr Sci Technol 2020:86–90 14. Daniya T, Geetha M, Kumar KS (2020) Classification and regression trees with Gini index. Adv Math Sci J 9(10):8237–8247
Applications of Machine Learning for Face Mask Detection During …
119
15. Cabani A, Hammoudi K, Benhabiles H, Melkemi M (2020) Maskedface-Net—a dataset of correctly/incorrectly masked face images in the context of Covid-19, pp 1–5 16. Haischer MH, Belifuss R, Hart MR, Opielinski L et al (2020) Who is wearing a mask? Gender, age and location-related differences during the COVID-19 pandemic. PLoS ONE 15(10):e0240785, 1–12 17. Guan Q, Wang Y, Ping B et al (2019) Deep convolutional neural network VGG-16 model for differential diagnosing of papillary thyroid carcinomas in cytological images: a pilot study. J Cancer 10(20):4876–4882 18. Chen H, Haoyu.H, (2015) Face recognition algorithm based on VGG network model and SVM. J Phys: Conf Ser 1229:1–8 19. Li B, Lima D (2021) Facial expression recognition via ResNet-50. Int J Cogn Comput Eng 2(2021):57–64 20. Maeda-Gutiérrez V, Galvan-Tejada CE, Zanella-Calzada LA et al (2020) Comparison of convolutional neural network architectures for classification of tomato plant diseases. Appl Sci 10(1245):1–15 21. Gurucharan. MK (2020) COVID-19: face mask detection using TensorFlow and OpenCV. https://towardsdatascience.com/covid-19-face-mask-detection-using-tensorflow-and-opencv702dd833515b. Accessed 27 Mar 2021 22. Loey M, Manogaran G, Taha MHN, Khalifa NEM (2021) Fighting against COVID-19: a novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection. Sustain Cities Soc 65(2021–102600):1–8 23. Szegedy C, Vanhoucke V, Ioffe S, Shlens J (2015) Rethinking the inception architecture for computer vision, pp 1–10, 37. https://arxiv.org/pdf/1512.00567v1.pdf. Accessed 24 Mar 2021 24. Nelli A, Nalige K, Abraham R, Manohar R (2020) Landmark recognition using inception-V3. Int Res J Eng Technol (IRJET) 7(5):6475–6478 25. Zhou Y, Zhang M, Wang X et al (2019) MPCE: a maximum probability based cross entropy loss function for neural network classification. IEEE Access 7:146331–146341 26. Tripathi PM, Verma.K, Verma. L.K and Parveen. N, (2013) Facial expression recognition using data mining algorithm. J Econ Bus Manage 1(4):343–346 27. Zhao Z-Q, Zheng P, Xu S-T, Wu X (2017) Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst Publ 1–21 28. Urkude V, Pandey P (2019) A deep machine learning neural network for real time object classification using Keras & Tensorflow. Int J Tech Innov Mod Eng Sci (IJTIMES) 5(7):266–271
A Cascaded Deep Learning Approach for Detection and Localization of Crop-Weeds in RGB Images Rohit Agrawal and Jyoti Singh Kirar
Abstract Weeds compete with crops in the fields, thus lowering crop yield with losses of up to 80%. The efficient use of chemical herbicides is desired to reduce the harmful effects on the environment, which requires the location of the weeds to be known. In this paper, we present a deep learning approach capable of detecting and localizing weeds in RGB images, trained using the publicly available Open Sprayer dataset. The adopted methodology consists of a classification step using a pre-trained 2D convolution neural network and a Random Forest classifier, which is used to predict the presence of weeds in an RGB image. If presence is predicted, then an attempt to localize them has been done by cascading a segmentation step using a U-Net architecture. The proposed architecture can classify the presence of weeds in an image with an accuracy of 91.19% and predict the location of weeds in the image by generating binary masks with a mean Dice score of 0.879 on the publicly available Open Sprayer dataset. Keywords Semantic segmentation · Weed detection · Precision agriculture · Image recognition · Fully convolutional network · Random forest classifier
1 Introduction Environmentally sustainable technologies such as sustainable natural resource management and precision agriculture are essential for holistic rural development. Precision Agriculture is technology used to enhance farming techniques to achieve the desired crop production rate through a collection of field data in a non-destructive way, performing analysis on the data, and thus make smarter implementable decisions on the field. Among these practices, site-specific weed management is effective R. Agrawal Shiv Nadar University, Greater Noida, Uttar Pradesh 201314, India e-mail: [email protected] J. Singh Kirar (B) Banaras Hindu University, Varanasi, Uttar Pradesh 221005, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_10
121
122
R. Agrawal and J. Singh Kirar
in decreasing herbicide costs, optimizing weed control, and preventing unnecessary environmental contamination [15, 30]. Controlling weeds is an important task because they compete with crops in the field, contributing to a lower crop yield with overall losses estimated to be up to 80% [32]. Chemical weed control is the preferred treatment by most conventional farmers due to greater precision required in mechanical and thermal weed control. However, excessive use of herbicides can cause hazardous effects on the environment. Thus, to limit the use of herbicides, farmers are required to inspect their fields manually and only use dosages targeted for their fields’ needs. Identifying the target area in a vast field is a laborious task. This paper explores a deep learning approach for automating this task, capable of detecting the location of weeds, thus reducing the workload of the farmers.
2 Literature Review A number of research work has been reported in this area. Herrera et al. [17] proposed a method for discriminating between monocot and dicot weeds, based on color segmentation, morphological operations and well-known shape descriptors and classifiers, which are common operations in image processing. Oliveira et al. [31] were also able to use morphological operations and Hough Line Transform to detect rows of crows using RGB images captured using Unmanned Aerial Vehicles (UAVs). To extract more information, NIR channels were added and explored by Kodagodaa et al. [20] and weed classification was performed based on their spectral properties. Similarly, multispectral images have also been used in [34]. Kumar et al. [23] proposed a modified intuitionistic fuzzy c-means algorithm (MIFCM) which achieved superior performance when compared to other methods. With the advent of machine learning for better pattern recognition, a new family of methods began to come up for image classification. Kaur et al. [19] performed segmentation of weeds in images using the K-means algorithm [25], followed by rule-based classification using Support Vector Machines (SVM). A comparison of traditional machine learning methods for the analysis of plant diseases was performed in [2]. Ever since Convolutional Neural Networks (CNN) have been used extensively for image analysis after their superior performance in the ILSVRC Challenges in 2012 [22]. CNNs started being used heavily as feature extractors for classification of weeds [1, 5] due to their ability to extract complex patterns from data. These features were then used to perform classification using traditional machine learning techniques such as SVM and Random Forests. With more advancements in the field of image analysis, problems at pixel-level such as object localization and semantic segmentation began to be explored. Fully connected convolutional neural networks (FCN) were proposed as an encoder-decoder CNN architecture which performed very well at pixel-level classification for semantic segmentation tasks and thus was used extensively for creating weed cover maps [18, 27, 28, 35]. One of the most popular FCN architectures is known as SegNet [4].
A Cascaded Deep Learning Approach for Detection and Localization …
123
The work proposed in [3, 8] brought together methods from Deep CNNs and probabilistic graphical models for addressing the task of semantic segmentation. The responses at the final DCNN layer were combined with a fully connected Conditional Random Field (CRF) model. The method was tested on PASCAL VOC 2012 semantic image segmentation task, reaching 71.6% IOU accuracy in the test set. However, the impact of the CRF models on the results depended highly on the quality of CNN output prediction. More recent approaches to the problem of weeds localization has been addressed by using the U-Net architecture [12, 33] instead of SegNet due to its superior performance. In addition to RGB image data, a depth dimension has also been explored in [13] for detection and localization of weeds which improved the performance significantly. Motivated by the above research work, in this paper, a deep learning-based model for the detection and localization of docks(weeds) in an image has been proposed. The image data is pre-processed and fed into a pre-trained 2D Convolutional Neural Network (CNN). The CNN works as a feature extractor and is capable of extracting complex features from an RGB image. Binary classification has been performed on the extracted features using the Random Forest classifier [24]. The output of the Random Forest classifier indicates which of the two classes—dock(weed) or not dock(not weed), are present in an RGB image. If an image is predicted to belong to the dock(weed) class, it is further processed by a U-Net model that performs semantic segmentation on the RGB image and generates a binary mask over the predicted location of the weeds. An improvement to the segmentation output has also been attempted by post-processing it using Fully connected Conditional Random Fields (also known as DenseCRF) model [21].
3 Proposed Work The flow diagram of the proposed methodology is shown in Fig. 1. The whole architecture is divided into the following steps as defined below.
Fig. 1 Proposed work flow diagram
124
R. Agrawal and J. Singh Kirar
Fig. 2 Samples from the open sprayer dataset
(a) Image weeds.
with (b) Image without weeds.
3.1 Data Description and Pre-processing In this work, the publicly available Open Sprayer Dataset [14] is considered. It consists of broad-leaved dock images captured in a field which has extensive grass cover on the farm. We have considered the grass as the “crop” and the undesirable leaves as the “weeds”. Samples from the chosen dataset have been shown in Fig. 2. The data set contains images belonging to the 2 classes: dock(weed) and not dock(not weed) and are of size (256, 256, 3), where each value corresponds to the height of the image, width of the image, and the number of channels in the image (RGB) respectively. The dataset is unbalanced as the number of images containing weeds is almost 3 times less as compared to the number of images without weeds. Therefore, the images containing weeds are augmented using a mixture of horizontal flipping, rescaling, and translation to generate extra samples. Each image is further cropped and resized to a fixed size of (250, 250, 3) to match the required dimensions for the pre-trained 2D CNN used below. Finally, each image is pixel-wise scaled to the range of [–1, 1] using Eq. 1. The InceptionV3 architecture discussed below requires the input images to be scaled in this particular range [38]. xnorm = 2 ×
x −1 255
(1)
where x is the value of a pixel of one channel in an image and xnorm is the corresponding value scaled between the range [–1, 1].
3.2 Weeds Classification Step The weed classification step consists of two substeps—(1) Extracting features from the input image, and (2) Performing binary classification using the extracted features to distinguish between docks and not docks. Feature Extraction Convolutional Neural Networks (CNNs) are a very powerful method of extracting complex features from images. However, they can be very computationally expensive to train from scratch. This work uses a technique called
A Cascaded Deep Learning Approach for Detection and Localization …
125
Fig. 3 Comparison of various pre-trained CNN architectures [7]
Transfer Learning, which uses weights of a pre-trained CNN as initial weights of a new CNN. The InceptionV3 model is used as the pre-trained CNN due to a good trade-off between accuracy and inference time offered by the model as shown in Fig. 3. It is trained on the huge ImageNet image database [10], which makes it proficient at image recognition over a wide variety of images. To use InceptionV3 as a feature extractor, the fully connected layers at the end, which contain weights trained on the ImageNet database, are removed. This provides us with the Average Pooling layer as the final layer, which outputs the image features which is a vector of size 2048. These features can now be fed into any classifier of our choice for classification. Classification In this work, detecting the presence of weeds in an image is formulated as a binary classification problem as we are dealing with only 2 classes: weed and not weed. Binary classification has been performed on the extracted features using a Random Forest classifier. Random Forests have advantages over other methods as it uses a technique called bagging that helps de-correlate the numerous decision trees and makes the results of averaging trees more robust. The performance of the classifier is evaluated using the Accuracy, Area under the Receiver Operating Curve (AUC-ROC score), Average Precision, and F1-Score.
3.3 Weeds Segmentation Step The training for the weed segmentation step consists of two substeps: (1) Creating binary pixel masks for each image containing weeds (OpenSprayerSeg), and (2)
126
R. Agrawal and J. Singh Kirar
Table 1 Samples from the constructed dataset (OpenSprayerSeg)
1.1a Original Image 1
1.1b Original Image 2
1.1c Original Image 3
training fully convolutional models to perform semantic segmentation to predict the location of weeds in an image. Additional improvements using a DenseCRF model to improve the quality of segmentation outputs have also been explored. OpenSprayerSeg: Dataset for Semantic Segmentation The Open Sprayer Dataset [14] considered in this paper doesn’t contain the pixel-wise annotated masks required to perform semantic segmentation, thus we manually created a separate dataset (OpenSprayerSeg). It builds upon the Open Sprayer Dataset by taking 4172 RGB images containing weeds and creating corresponding binary masks. The binary masks are 2-dimensional arrays of size 256 × 256 consisting of either 1 or 0 as values, where 0 corresponds to pixels for the crop, and 1 corresponds to pixels for the weeds. The Open Sprayer Dataset contains only 1173 images containing weeds, thus these images and their masks were augmented by horizontal flipping, vertical flipping, and rotation to reach the 4172 samples. These masks were created using Gimp [37]—a free and open-source image manipulation program. Our dataset is publicly available at https://github.com/agrawal-rohit/OpenSprayerSegDataset. Samples from the created dataset are shown in Table 1. Semantic Segmentation using the U-Net architecture To perform semantic segmentation, this study uses the U-Net architecture [33] which builds upon the concepts of several fully convolution networks (FCNs). It comprises several convolutional lay-
A Cascaded Deep Learning Approach for Detection and Localization …
127
ers which are stacked in an encoder-decoder format. The encoder gives a compressed feature representation of the image by downsampling the image using pooling operations. The decoder then uses transposed convolution on the compressed features to upsample the image to its original size which provides the segmented output. This model takes an RGB image of size (256, 256, 3) as input and performs classification for each pixel of the input image to either belong to weeds or the crops and outputs a binary segmentation mask of size (256, 256, 1). The output segmentation generated by the U-Net is further fed into a Dense CRF model for improving the quality of the segmentation using surrounding pixel values. In this work, this model has been optimized to minimize the Dice loss given in Eq. 3 using gradient descent. Dice loss is computed by inversing the Dice score (Eq. 2)—which calculates the ratio of the overlapping area and the total number of pixels in two images (3). 2
Dscor e = N i
N i
pi2
+
pi gi N
Dloss = 1 − Dscore
i
gi2
(2)
(3)
where N represent the total number of pixels in each image (256 × 256 = 65536) and pi and gi represent pairs of corresponding pixel values of the prediction and ground truth, respectively. The values of pi and gi are either 0 or 1, representing whether the pixel corresponds to weeds (value of 1) or not (value of 0).
4 Results and Analysis In this paper, a cascaded deep learning approach was proposed capable of detecting the location of weeds in RGB images of a crop field. The first subsection describes the steps taken before performing the experiment. The results are reported in the subsequent subsections: classification and semantic segmentation.
4.1 Experimental Setup A methodology proposed in [12] used a segmentation neural network to extract regions of crops and weeds (blobs) which were further classified by passing through another CNN. They collected data from a tilled sunflower farm in Italy as shown in Fig. 4. The images in the data have sufficient separation between the areas of interest—crops, and weeds, which makes it easier to extract information from them. Thus, datasets of this type [16] are usually considered for addressing the problem of crop-weed localization. However, soil tilling has been identified as one of the biggest contributors to soil degradation, which has led farmers to adopt no-till farming
128
R. Agrawal and J. Singh Kirar
Fig. 4 Sunflower farm dataset used in [12]
practices for sustainable agriculture due to its added benefits [11]. This shall result in fields where separation between crops and weeds wouldn’t be easily possible. Grass and other plant growth may cover major portions of the field and act as additional sources of noise. The Open Sprayer Dataset we propose to use emulates no-till fields as compared to the other datasets in this domain which correspond to tilled fields. The Open Sprayer Dataset is split into a training set of size 6,027 images and test set of size 670 images, which is used to train and evaluate the Random Forest classifier respectively. The training set contains 4851 not dock images (images without weeds), but only 1176 dock images (images containing weeds). Augmenting the low number of images containing weeds leads to 4475 dock samples and 4851 not dock samples. All experiments are performed on an Nvidia GTX 1060 graphics card.
4.2 Classification The performance of the Random Forest classifier is compared with following classifiers—Support Vector Machines [36], Decision Trees [6], and a fully connected neural network layer with softmax activation. The softmax activation converts the outputs of the network into probabilities. These probabilities indicate which of the 2 classes—dock(weed) or not dock(not weed), are present in an image. Decision Trees are the simplest among various tree-based methods used for predictive analysis. They are non-parametric, require less data cleaning, and are not influenced by outliers. But these methods tend to have low predictive power. SVMs attempt to find a hyperplane which separates the data by turning linearly inseparable data into linearly separable data using various kernels. A comprehensive comparison of various such supervised methods on different datasets was conducted in [29]. The performance of the 4 classifiers taken is computed on the test set and stored in Table 2. The ROC curves and Precision-Recall curves are plotted and shown in Fig. 5a, b respectively. Random Forest outperforms the others in terms of all the measures, performing slightly better than SVM and Decision Trees. Meanwhile, SVM and Decision Trees show comparable performance in terms of AUC Score, F1-Score, and Precision. The ROC curves for each classifier have been plotted and displayed in Fig. 5, where it
A Cascaded Deep Learning Approach for Detection and Localization … Table 2 Performance of weed classification step Feature extractor Test AUC-ROC Score F1-score + Classifier Accuracy(%) CNN + Support Vector Machine CNN + Dense Layer + Softmax Activation CNN + Random Forest CNN + Decision Trees
129
Average Precision
55.22
0.879
0.944
0.957
90.59
0.782
0.641
0.905
91.19
0.890
0.946
0.957
90
0.853
0.937
0.936
(a)
(b)
Fig. 5 Performance curves
can be observed that the ROC curve of the Random Forest is more bowed to the left than the other curves, thus being closest to the performance of a perfect model. The ROC curves for SVM and Random Forest start as very similar, however, the performance of the SVM curve reduces after TPR = 0.3. The increase in FPR for every increment in TPR increases in the case of the SVM curve, thus it lies below the ROC curve of Random Forest. Both the curves start saturating at a value of TPR = 1, after FPR = 0.5. The Softmax Classifier performs the worst by having the lowest ROC curve of all the classifiers. It saturates at a value of TPR = 1, after FPR = 0.7. Figure 5 displays the plots of the Precision vs Recall curves for the different classifiers, where Random Forest and SVM are closest to the ideal model performance. The curves for both are downward sloping and very close to each other, however, the curve for Random Forest is slightly higher than the curve for SVM. The Random Forest curve ends at Precision = 0.91, while the SVM curve is just a bit below it at Precision = 0.9. The Softmax Classifier again performs the worst by having the lowest Precision-Recall curve of all the classifiers. It decreases suddenly at Recall = 0.01, and then starts increasing until Recall = 0.08. It then saturates and remains more or less constant at Precision = 0.9, and then finally decreases until Precision = 0.87.
130
R. Agrawal and J. Singh Kirar
From our observations, we can infer that the Random Forest algorithm performs the best as the final classifier of the Weed Classification module in our project.
4.3 Semantic Segmentation The adopted U-Net architecture is compared with some widely used fully convolutional models—a shallow version of U-Net, FCN-32s [26], and FCN-16s [26]. The shallow U-Net model contains a lesser number of convolution filters and a lesser amount of dropout as compared to the standard U-Net model. Each FCN model has been considered with the VGG-16 backbone. The FCN-32s model uses regular convolutions, whereas the FCN-16s model used atrous(dilated) convolutions. Atrous convolutions have shown to decrease blurring in semantic segmentation maps, and are capable of extracting long-range information without the need for pooling [9]. The outputs of a few samples of the test set are displayed in Fig. 6. Dice Scores for the results are analyzed, with and without using Dense CRF for post-processing, and displayed in Table 3. The FCN models utilize skip connections from the earlier layers to reconstruct accurate segmentation boundaries by learning back relevant features, which are lost during downsampling. It was observed that results due to the regular convolution operations were far superior as compared to atrous convolutions. Both the FCN networks are able to predict the location of the weeds accurately, however, atrous convolutions result in several adjacent areas are also considered as part of the mask which leads to a lower performance. Atrous convolutions make it harder to reconstruct fine information about boundaries during upsampling the image at the decoder. The U-Net architecture builds upon the concept of FCN networks and retains the fine-grained information better, and has thus achieved much better results.
Table 3 Performance of weed segmentation step Model Dice score (F1-Score) Before DenseCRF post-processing Minimum Mean Maximum U-Net Shallow U-Net FCN_32s (VGG-16, normal convolutions) FCN_16s (VGG-16, Atrous convolutions)
After DenseCRF post-processing Minimum Mean Maximum
0 0
0.875 0.705
1.0 1.0
0 0
0.879 0.731
1.0 1.0
0
0.79
0.984
0
0.799
0.989
0
0.681
0.998
0
0.695
1.0
A Cascaded Deep Learning Approach for Detection and Localization …
131
Fig. 6 Sample outputs of the weed segmentation step
The Shallow U-Net model uses a dropout rate of 0.2 at the bottleneck layer, and a lesser number of convolution filters which has led to the main areas of interest being predicted, while also predicting several false positive pixels thus decreasing overall performance. The standard U-Net model achieves the best performance by taking into account the discussed issues. It can also be noted from Table 3 that using a DenseCRF has further improved the output quality produced by all the considered models, although the mean Dice score does not show any stark improvement. Thus, no additional merit was observed from post-processing outputs by a DenseCRF model.
5 Conclusion In this paper, we create a new dataset for image segmentation by extending and manually labeling the weeds in the Open Sprayer Dataset. A cascaded deep learningbased approach is further proposed which is capable of detecting and localizing the presence of weeds in an RGB image, and evaluated on this dataset. A pre-trained 2D CNN has been used to extract features from the RGB images and predict if weeds are present in the image using a Random Forest Classifier with an accuracy of 91.19%. Moreover, the fully connected U-Net architecture has been used for semantic segmentation to detect the location of weeds in these images, and a mean Dice score of 0.879 has been observed, even though the weeds are overlapped extensively by the crops. We have also attempted to use a DenseCRF model to help improve the segmentation outputs produced by the U-Net model, however, no drastic increases in performance are observed. Future work includes jointly training the classification and segmentation steps and also evaluating the approach in other domains.
132
R. Agrawal and J. Singh Kirar
References 1. Abdullahi HS, Sheriff R, Mahieddine F (2017) Convolution neural network in precision agriculture for plant image recognition and classification. In: 2017 seventh international conference on innovative computing technology (Intech). IEEE, Londrés , pp 1–3 2. Akhtar A, Khanum A, Khan SA, Shaukat A (2013) Automated plant disease analysis (apda): Performance comparison of machine learning techniques. In: 2013 11th international conference on frontiers of information technology. IEEE, pp 60–65 3. Arnab A, Zheng S, Jayasumana S, Romera-Paredes B, Larsson M, Kirillov A, Savchynskyy B, Rother C, Kahl F, Torr PH (2018) Conditional random fields meet deep neural networks for semantic segmentation: Combining probabilistic graphical models with deep learning for structured prediction. IEEE Signal Process Mag 35(1):37–52 4. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495 5. Bah MD, Hafiane A, Canals R (2018) Deep learning with unsupervised data labeling for weed detection in line crops in uav images. Remote Sens 10(11):1690 6. Brodley CE, Utgoff PE (1995) Multivariate decision trees. Mach Learn 19(1):45–77 7. Canziani A, Paszke A, Culurciello E (2016) An analysis of deep neural network models for practical applications 8. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv:1412.7062 9. Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 10. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255 11. Derpsch R, Friedrich T, Kassam A, Li H (2010) Current status of adoption of no-till farming in the world and some of its main benefits. Int J Agric Biol Eng 3(1):1–25 12. Fawakherji M, Youssef A, Bloisi D, Pretto A, Nardi D (2019) Crop and weeds classification for precision agriculture using context-independent pixel-wise segmentation. In: 2019 third IEEE international conference on robotic computing (IRC). IEEE, pp 146–152 13. Gai J, Tang L, Steward BL (2019) Automated crop plant detection based on the fusion of color and depth images for robotic weed control. J Field Robot 14. GavinArmstrong: Open sprayer images (2018) A collection of broad leaved dock images for weed sprayer. https://www.kaggle.com/gavinarmstrong/open-sprayer-images 15. Gerhards R, Oebel H (2006) Practical experiences with a system for site-specific weed control in arable crops using real-time image analysis and gps-controlled patch spraying. Weed Res 46(3):185–193 16. Haug S, Ostermann J (2015) A crop/weed field image dataset for the evaluation of computer vision based precision agriculture tasks. In: Computer vision—ECCV 2014 workshops, pp 105–116 17. Herrera P, Dorado J, Ribeiro Á (2014) A novel approach for weed type classification based on shape descriptors and a fuzzy decision-making method. Sensors 14(8):15304–15324 18. Huang H, Deng J, Lan Y, Yang A, Deng X, Zhang L (2018) A fully convolutional network for weed mapping of unmanned aerial vehicle (uav) imagery. PloS one 13(4):e0196302 19. Kaur S, Pandey S, Goel S (2018) Semi-automatic leaf disease detection and classification system for soybean culture. IET Image Process 12(6):1038–1048 20. Kodagoda S, Zhang Z, Ruiz D, Dissanayake G (2008) Weed detection and classification for autonomous farming. Intell Prod Mach Syst 21. Krähenbühl P, Koltun V (2011) Efficient inference in fully connected crfs with gaussian edge potentials. Adv Neural Inf Process Syst 109–117 22. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 1097–1105
A Cascaded Deep Learning Approach for Detection and Localization …
133
23. Kumar D, Verma H, Mehra A, Agrawal R (2019) A modified intuitionistic fuzzy c-means clustering approach to segment human brain mri image. Multimed Tools Appl 78(10):12663– 12687 24. Liaw A, Wiener M et al (2002) Classification and regression by randomforest. R News 2(3):18– 22 25. Likas A, Vlassis N, Verbeek JJ (2003) The global k-means clustering algorithm. Pattern Recognit 36(2):451–461 26. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431– 3440 27. Lottes P, Behley J, Chebrolu N, Milioto A, Stachniss C (2019) Robust joint stem detection and crop-weed classification using image sequences for plant-specific treatment in precision farming. J Field Robot 28. Ma X, Deng X, Qi L, Jiang Y, Li H, Wang Y, Xing X (2019) Fully convolutional network for rice seedling and weed image segmentation at the seedling stage in paddy fields. PloS one 14(4):e0215676 29. Manchanda S, An empirical comparison of supervised learning processes. Int J Eng 1(1):21 30. Nordmeyer H (2006) Patchy weed distribution and site-specific weed control in winter cereals. Precis Agric 7(3):219–231 31. Oliveira HC, Guizilini VC, Nunes IP, Souza JR (2018) Failure detection in row crops from uav images using morphological operators. IEEE Geosci Remote Sens Lett 15(7):991–995 32. Rao A, Chauhan B (2015) Weeds and weed management in india-a review 33. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241 34. Sa I, Chen Z, Popovi´c M, Khanna R, Liebisch F, Nieto J, Siegwart R (2017) weednet: Dense semantic weed classification using multispectral images and mav for smart farming. IEEE Robot Autom Lett 3(1):588–595 35. Sa I, Popovi´c M, Khanna R, Chen Z, Lottes P, Liebisch F, Nieto J, Stachniss C, Walter A, Siegwart R (2018) Weedmap: a large-scale semantic weed mapping framework using aerial multispectral imaging and deep neural network for precision farming. Remote Sens 10(9):1423 36. Suykens JA, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300 37. Team G et al (2019) GIMP: GNU image manipulation program. GIMP Team 38. Xia X, Xu C, Nan B (2017) Inception-v3 for flower classification. In: 2017 2nd international conference on image, vision and computing (ICIVC). IEEE, pp 783–787
Ensemble of Deep Learning Enabled Tamil Handwritten Character Recognition Model R. Thanga Selvi
Abstract Recently, digitalization of handwritten characters has become a hot research topic and finds applicability in different domains. At the same time, recognition of Tamil handwritten characters is a tedious task compared to other languages. Therefore, this paper presents a new ensemble deep learning-based Tamil handwritten character recognition (EDL-THCR) model. The EDL-THCR model recognizes and classifies the Tamil handwritten characters. In addition, data preprocessing approach is involved using bilinear interpolation technique to normalize the images. Besides, an ensemble of capsule network (CapsNet) and VGGNet models take place for feature extraction process. Finally, softmax layer is employed to classify the Tamil characters in an effective way. A comprehensive experimental analysis is carried out on benchmark dataset, and the results portrayed the better performance of the EDL-THCR technique. Keywords Deep learning · Tamil · Handwritten character recognition · Ensemble models · Learning rate
1 Introduction Over the past few years, optical character recognition (OCR) has become more important since the necessity to convert the scanned image to computer detectible formats like text documents has improved application. The OCR challenges such as distortion, lighting variation, difference in font size, and blurring of the printed character images have improved the requirement for OCR in this study. The major disadvantage noted in OCR is that the infraclass variation is larger because of the huge amount of image accessibility for the process [1]. The OCR system is currently under development for most of the central languages also the Tamil languages are no exception for it. The OCR process has become really hard in the case of Indian R. Thanga Selvi (B) Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_11
135
136
R. Thanga Selvi
languages like Tamil. Similar to other languages, Tamil language possess its own complicated problem to the developer of Tamil character recognition (TCR) system [2]. In recent years, significant study was made toward the development of effective TCR systems [3]. Tamil is one of the ancient Indian languages that are mainly employed in Southern India, Malaysia, and Srilanka. The basic unit of Tamil script is syllable [4]. This syllabic unit of Tamil scripts has 18 consonants, 12 vowels as well as a special characters (AyudhaEzhuthu, Inline Image). The integration of consonants and vowels creates an overall of 216 composite characters, therefore an overall of 247 characters. It consists of 5 borrowed consonants from Sanskrit that are integrated into vowels will produce other sixty composite characters for making an overall of 307 characters. The comprehensive Tamil character sets could be denoted as a combination of 156 distinct characters. Tamil handwritten character is highly complex to identify when compared to other languages [5]. This is due to the fact that Tamil letters have further modifiers and angles. In addition, Tamil scripts have huge numbers of character sets; an overall of 247 characters consist of one special character, 12 vowels, 18 consonants, and 216 compound characters [6]. The challenges faced at the time of recognition are numbers of strokes and holes, curves in character, different writing styles, sliding characters. Among different parts of handwritten character recognitions (HTCR), it is lesser demanding to identify English letter numerals and sets when compared to Tamil characters [7]. Several interclass Multimedia Applications and Tools condition present in Tamil language and many letters resemble like others. Hence recognitions become a key challenge. The current challenges of TCR system that stimulate us to perform this study work on TCR system. Although deep learning (DL) methods have been extensively employed in Handwritten Character Recognition in several languages like English, Arabic, Chinese, and so on, in Tamil almost all the works have been done until now have been employing conventional methods. A standard method for HTCR employing conventional machine learning (ML) techniques will keep an eye on segmentation of character, preprocessing, classification, feature extraction, and later predict the novel character. This paper presents a new ensemble deep learning-based Tamil handwritten character recognition (EDL-THCR) model. The EDL-THCR model recognizes and classifies the Tamil handwritten characters. In addition, data preprocessing approach is involved using bilinear interpolation technique to normalize the images. Besides, an ensemble of capsule network (CapsNet) and VGGNet models take place for feature extraction process. Finally, softmax layer is employed to classify the Tamil characters in an effective way. A comprehensive experimental analysis is carried out on benchmark dataset, and the results portrayed the better performance of the EDL-THCR technique.
Ensemble of Deep Learning Enabled Tamil Handwritten Character …
137
2 Related Works Raj and Abirami [8] handle the feature extraction and the 3 methods of feature prediction which is investigated for grasping feature from several Tamil character that possess variation in shapes and styles. The location-based instances, shape, and shape ordering were the feature forecasted from the character. The main characteristics of this study are the stripped tree-based hierarchical structure that handles the shape feature of the character, the performance of Z-ordering algorithms to address the structured ordering, and lastly the depiction of PM-Quad tree which handles location extraction of the character feature. A hierarchical classification method depends on SVM model is employed to predict the characters from its character feature through divide and conquer process. In Deepa and Rao [9], research on image-to-image matchings is involved with feature analyses without employing ML methods. The presented method provided better performance for each character class on the typical database accessible to Tamil, HP Lab offline Tamil handwritten character databases. The presented classifiers presented a detection rate of 90.2% when involving the entire datasets. Kavitha and Srimathi [10] employed an advanced CNN model in identifying HTCR in offline mode. CNN differs from conventional method of HTCR method in extracting the feature manually. They employed an isolated HTCR dataset proposed using HP Labs India. Also, they proposed a CNN method from scratch through model training using the Tamil character in offline mode and attained better accuracy on the testing and training databases. Prakash and Preethi [11] proposed a ConvNet framework for offline isolated TCR systems. Initially, the researchers have been performed for recognizing every 247 characters in the Tamil text with 124 exclusive symbols. The presented method has 2 FC and Convolution Layers using ReLu activation function. Softmax functions are employed in the last layer for computing the likelihood of the class. Lincy and Gayathri [12] proposed a new TCR method using 2 key procedures, i.e., recognition and preprocessing. The preprocessing stage encloses RGB for binarization with thresholding, grayscale conversion, morphological operations, linearization, and image complementation. Then, the preprocessed image afterward linearization is subject to recognitions through an optimally formed CNN model. Specifically, the FC layers and weight are finetuned with a novel SALA method, i.e., the theoretical development of the typical LA model. Kowsalya and Periasamy [13] utilize an efficient TCR model. The presented model contains four major procedures like segmentation process, preprocessing process, recognition process, and feature extraction process. In the preprocessing method, the input image is fed into the skew detection technique, Binarization process, and Gaussian filters. Next, the segmentation method is performed, now character and line segmentation were implemented. From the segmented outputs, the features are extracted. Then, in feature extraction phase, the Tamil characters are identified through an optimum ANN model. Now, the conventional NN model is adapted using optimization method. In NN model, the weight is enhanced through an EHO model.
138
R. Thanga Selvi
Inunganbi et al. [16] developed a model on the Mayek27dataset using Convolution Neural Network and segmentation algorithm which gives an accuracy of 91.12%. Hazra et al. [17] proposed an unique CNN model for Mayek27 dataset. Inunganbi et al. [18] proposed a recognition model using deep neural network and obtained a recognition rate of 98.86%.
3 The Proposed EDL-THCR Technique In this study, a novel EDL-THCR model is presented to recognize and classify the Tamil handwritten characters. The EDL-THCR model involves three processes, namely, preprocessing, feature extraction, and classification. The working of these modules is elaborated in the following.
3.1 Preprocessing The images have bi-level image with background presence white (255) and the foreground from black (0). An image was of distinct sizes that are size normalized to 64 64 utilizing bilinear interpolation manner as well as scaled in [0, 1] range. It can be implemented trained on 2 groups of inputs, one with original image and other with inverted image (foreground as 1 and background as 255). But, there were no important variances with respect to accuracy or trained time.
3.2 Ensemble of DL Models The preprocessed image is fed into the ensemble of DL models which incorporates CapsNet and VGGNet techniques. The ensemble process helps to effectively determine the feature vectors.
3.2.1
VGGNet Model
The VGGnet is a typical kind of deep CNN that is frequently utilized for feature extraction and transfer learning. One of the extremely utilized VGGnet has VGG19 that has 19 hidden layers (16 convolutional layers and 3 fully connected (FC) layers). The VGG19 utilizes the sequence of 3×3 convolutional kernels for extracting image features and increases the amount of feature channels with convolutional layers. Assume that Wi and bi demonstrated the weight and bias of ith convolutional layer the feature is removed as
Ensemble of Deep Learning Enabled Tamil Handwritten Character …
X iout = σ Wi ∗ X iin + bi
139
(1)
where X iout and X iin correspondingly signifies the input as well as output feature map and σ refers to the rectified linear unit ( ReLU). In all convolutional layers, the stride is fixed to one. For avoiding the explosion of computation, the VGG19 utilizes max pooling layer for reducing the size of feature map. In the FC layers, all nodes of the provided layer have been linked with every node of their previous layers that maps the distributed features illustration to instance label space as out , Y = FC3 FC2 FC1 P X 16
(2)
where FC(·) implies the process of FC layer and P(·) demonstrates the max pooling operations. At the termination of VGG19, the softmax layer generates the classifier outcome of an image: ez j Y j = C
c=1 e
zc
(3)
where Y j demonstrates the probabilities of jth node and z j and C, correspondingly, represent the outcome of jth node and the amount of classification [14]. Related to another kind of CNNs, VGG19 enhances the depth of network and implements the different frameworks of several convolutional layers and non-linear activation layers that are helpful for extracting accurate features. In this case, distinct in image classification task, it can just utilize the convolutional layers and max pooling layers in the pre-trained VGG19 as preprocessed technique for extracting deep feature maps in the image.
3.2.2
CapsNet Model
The capsules have group of neurons whose outcomes are inferred as distinct features of similar entity, and procedures are the activation vectors. All capsule has pose matrix that demonstrates the occurrence of particular object placed at provided pixel, and activation probabilities that implies the length of vectors. Upon rotating an image to sample, the activation vector is also altered so their length has continued the similar. Figure 1 illustrates the process structure of CapsNets. During the proposed framework, it can be utilized initial capsule layers (reshape and squashed outcome of final convolution layer) and CancerCaps layer. All the capsules forecast the parent capsule output and when the forecast has been consistent with parent capsule’s actual outcome, afterward the coupling co-efficient among these 2 capsules improves. When u i refers to the outcome capsules i, their forecast to parent capsule j has been defined in Eq. (4).
140
R. Thanga Selvi
Fig. 1 Framework of CapsNet model
u j = W Vi j u j
(4)
i
where u j refers to the outcome forecast vector of j th capsule, and Wi j refers to the i weight matrix that exists learned from the backward pass [15]. The softmax function has been utilized for computing the coupling co-efficient ci j according to the degree of conformation among the capsule from the layer under and the parent capsule called as “iterative dynamic routing model” as demonstrated in Eq. (5). exp(bi j ) ci j = k exp(bik )
(5)
bi j implies the log probabilities and fixed to zero, when capsule i is coupled with capsule j primarily by agreement procedure to begin routing. Therefore, the parent capsule input vector j has been calculated in Eq. (6). sj =
i
ci j u j
(6)
i
Eventually, the non-linear squashing function has been utilized for normalizing the output vector of capsule by avoiding them surpassing 1. Their length is demonstrated as probabilities that capsule has been detected provided feature. All capsule last outcome has been defined as their primary vector value as illustrated in Eq. (7). vj =
sj s j ||2 1 + s j 2 s j
(7)
where s j refers to the entire input to capsules j and v j signifies the output. According to the agreement among v j as well as u j the log probability is upgraded in routing. i So, the upgrade log probabilities were estimated in Eq. (8).
bi j = bi j + u j .v j
(8)
i
The routing co-efficient is enchanted by dynamic routing method to j parent Capsule with influence of u j .v j . Therefore, further data is shown by child capsule i to parent capsules whose efficiency v j is more similar to their forecast u j/i .
Ensemble of Deep Learning Enabled Tamil Handwritten Character …
141
3.3 Softmax Layer In the last stage, the SM classification has been utilized for ordering the handwritten characters utilizing the features in the preceding model. Because of the multi-objective classification model of handwritten character recognition, the SM classification has been utilized as the last result layer of the DL techniques: e yi so f tmax(yi ) = I
i=1 e
yi
(9)
where yi refers to the ith modules of the feature vector and achieves I i=1 so f tmax(yi ) = 1.I demonstrates the dimensional of last vector.
4 Performance Validation In order to estimate the efficiency of the proposed work, distinct person’s Tamil character handwritten dataset has been employed. An entire image assuming the presented technique is 100–150 images. At this point, 80% of the entire images were utilized to train model and 20% of the entire images were utilized as testing model. At this point, the dataset has been created with varying persons. The Tamil character dataset has more than 50 writing styles and the image covers one of the Tamil characters. Figure 2 shows the sample test images.
Fig. 2 Sample images of character (Inline Image)
142 Table 1 Results analysis of EDL-THCR technique in terms of recognition rate
R. Thanga Selvi
Image
Neural Network
NN + EHO
EDL-THCR
Image 1
0.8411
0.9252
0.9487
Image 2
0.8763
0.9278
0.9519
Image 3
0.8750
0.9271
0.9536
Image 4
0.8878
0.9286
0.9628
Average
0.8701
0.9272
0.9543
Table 1 investigates the recognition rate analysis of the ELD-THCR technique on the test images. The results portrayed that the ELD-THCR technique has accomplished better recognition outcomes than the existing techniques. On test image 1, the ELD-THCR technique has resulted in a higher recognition rate of 0.9487, whereas the NN and NN-EHO techniques have obtained a lower recognition rate of 0.8411 and 0.9252. Also, on test image 2, the ELD-THCR manner has resulted in a maximum recognition rate of 0.9519, whereas the NN and NN-EHO methods have achieved a minimum recognition rate of 0.8763 and 0.9278. Along with that, on test image 3, the ELD-THCR approach has resulted in a superior recognition rate of 0.9536, whereas the NN and NN-EHO techniques have gained a lower recognition rate of 0.8750 and 0.9271. Moreover, on test image 4, the ELD-THCR technique has resulted in an increased recognition rate of 0.9628, whereas the NN and NN-EHO techniques have obtained a minimum recognition rate of 0.8878 and 0.9286. Figure 3 depicts the accuracy graph analysis of the ELD-THCR approach. The figure described that the training and testing accuracy gets enhanced with a rise in epoch count and the training accuracy is noticeably superior to the testing accuracy.
Fig. 3 Accuracy graph analysis of ELD-THCR model
Ensemble of Deep Learning Enabled Tamil Handwritten Character …
143
Fig. 4 Loss graph analysis of ELD-THCR model
Figure 4 demonstrates the loss graph analysis of the ELD-THCR manner. The figure exhibited that the training and testing loss becomes minimum with a higher epoch count and the training loss is created to be lesser than the testing accuracy. Finally, a brief comparative recognition rate analysis of the ELD-THCR technique with recent methods is provided in Fig. 5. The figure reported that the KNN, SOM, RBN, NN, Quad Tree, and FNN techniques have obtained a lower recognition rate of 0.6512, 0.8859, 0.8962, 0.8701, 0.8611, and 0.8508 respectively. At the same time, the NN-EHO and SVM techniques have resulted in a moderate recognition rate of 0.9272 and 0.9162 respectively. However, the ELD-THCR technique has outperformed the previous methods with a higher recognition rate of 0.9543. Therefore, the ELD-THCR technique has appeared as an effective tool to recognize Tamil handwritten characters.
144
R. Thanga Selvi
Fig. 5 Recognition rate analysis of ELD-THCR model
5 Conclusion In this study, a novel EDL-THCR model is presented to recognize and classify the Tamil handwritten characters. In addition, data preprocessing approach is involved using bilinear interpolation technique to normalize the images. Besides, an ensemble of CapsNet and VGGNet models takes place for feature extraction process. Finally, softmax layer is employed to classify the Tamil characters in an effective way. A comprehensive experimental analysis is carried out on benchmark dataset and the results portrayed the better performance of the EDL-THCR technique. Therefore, the EDL-THCR model can be utilized as an effective tool for recognizing Tamil handwritten characters. In future, advanced hybrid DL models can be used for feature extraction process.
References 1. Sampath AK, Gomathi N (2017) Decision tree and deep learning based probabilistic model for character recognition. J Cent S Univ 24:2862–2876 2. Chacko BP, Vimal Krishnan VR, Raju G, Babu Anto P (2012) Handwritten character recognition using wavelet energy and extreme learning machine. Int J Mach Learn Cybern 3:149–161 3. Ajantha Devi V, Santhosh Baboo S (2014) Embedded optical character recognition on Tamil text image using raspberry pi. Int J Comput Sci Trends Technol (IJCST) 2(4):127–132 4. Canziani A, Paszke A, Culurciello E (2016) An analysis of deep neural network models for practical applications. ArXiv preprint arXiv:1605.07678
Ensemble of Deep Learning Enabled Tamil Handwritten Character …
145
5. Bhattacharya U, Ghosh SK, Parui S (2007) A two stage recognition scheme for handwritten Tamil characters. In: 2007. ICDAR 2007. Ninth International Conference on Document Analysis and Recognition. IEEE, pp. 511–515 6. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256 7. Liu CL, Yin F, Wang DH, Wang QF (2013) Online and offline handwritten Chinese character recognition: benchmarking on new databases. Pattern Recognit 46(1):155–162 8. Raj MAR, Abirami S (2020) Structural representation-based off-line Tamil handwritten character recognition. Soft Comput 24(2):1447–1472 9. Deepa RA, Rao RR (2020) A novel nearest interest point classifier for offline Tamil handwritten character recognition. Pattern Anal Appl 23(1):199–212 10. Kavitha BR, Srimathi C (2019) Benchmarking on offline handwritten Tamil character recognition using convolutional neural networks. J King Saud Univ-Comput Inf Sci 11. Prakash A, Preethi S (2018) Isolated offline Tamil handwritten character recognition using deep convolutional neural network. In: 2018 International Conference on Intelligent Computing and Communication for Smart World (I2C2SW). IEEE, pp. 278–281 12. Lincy RB, Gayathri R (2021) Optimally configured convolutional neural network for Tamil Handwritten Character Recognition by improved lion optimization model. Multimed Tools Appl 80(4):5917–5943 13. Kowsalya S, Periasamy PS (2019) Recognition of Tamil handwritten character using modified neural network with aid of elephant herding optimization. Multimed Tools Appl 78(17):25043– 25061 14. Han B, Du J, Jia Y, Zhu H (2021) Zero-watermarking algorithm for medical image based on VGG19 deep convolution neural network. J Healthc Eng 15. Panigrahi S, Das J, Swarnkar T (2020) Capsule network based analysis of histopathological images of oral squamous cell carcinoma. J King Saud Univ-Comput Inf Sci 16. Inunganbi S, Choudhary P, Manglem K (2021) Meitei Mayek handwritten dataset: compilation, segmentation, and character recognition. Vis Comput 37:291–305 17. Hazra A, Choudhary P, Inunganbi S et al (2021) Bangla-Meitei Mayek scripts handwritten character recognition using Convolutional Neural Network. Appl Intell 51:2291–2311 18. Inunganbi S, Choudhary P, Manglem K (2020) Manipuri handwritten character recognition by convolutional neural network. In: Nain N, Vipparthi S, Raman B (eds.) Computer Vision and Image Processing. CVIP 2019. Communications in Computer and Information Science, vol 1148. Springer, Singapore
A Comparative Study of Loss Functions for Deep Neural Networks in Time Series Analysis Rashi Jaiswal
and Brijendra Singh
Abstract Currently, deep neural networks are widely used for analyzing temporal data. These networks can adapt their architecture to specific needs and deliver good performance. Researchers and developers frequently update their architecture to meet the requirements, but this process can be quite time-consuming. A crucial aspect of DNN architecture is the loss function, which plays a crucial role in calculating gradients. Most research and applications in time series analysis use the mean squared error (MSE) loss function. In this paper, we aim to explore existing loss functions to address the challenge of selecting the appropriate loss function for DNNs. We conduct experiments on time series datasets to evaluate the impact of different loss functions on DNN model performance. Our findings indicate that the Huber loss function outperforms other loss functions in time series analysis. Additionally, we discuss the potential for custom loss functions as future work, beyond the limitations of existing methods. Keywords Deep neural network · Neural network architecture · Loss functions · Recurrent networks · Convolutional networks
1 Introduction In recent years, deep neural networks have been extensively used in research and applications to achieve more accurate results. The development of deep learning models has advanced rapidly. DNNs have the ability to adapt their architecture to meet specific requirements [1]. They are used in various domains to solve a wide range of problems [2], such as prediction [3, 4], forecasting [5, 6], speech separation [7], emotion recognition [8], expert system development [9], image recognition [10], and image restoration [11], among others. The architecture of DNNs can be modified by adjusting specialized layers, number of nodes, iterations, optimizers, loss functions, and activation functions, among other attributes [12]. The loss function is a crucial R. Jaiswal (B) · B. Singh University of Lucknow, Lucknow, U.P, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_12
147
148
R. Jaiswal and B. Singh
attribute of neural networks as it is used to calculate the prediction error of the model. Most commonly, researchers and developers use the mean squared error (MSE) loss function in time series computation [13]. In this paper, we aim to address the issue of selecting the appropriate loss function for time series analysis by conducting an empirical analysis of various loss functions in DNNs. This paper provides a theoretical and empirical study of DNN loss functions by identifying the best loss function for regression tasks in time series. The training data or process of DNN is measured through the loss function in the form of an error in the predicted value. A comparative study of different loss functions is conducted to determine the best loss function for time series analysis through experiments. The study is conducted using DNN models on four time-series datasets, including two univariate and two multivariate time series, to validate the results of the paper. The notations and abbreviations used in this paper are listed in Table 1. This paper is organized into six sections: Sect. 1 introduces the problem of selecting the appropriate loss function for deep neural networks. Section 2 reviews the existing literature on loss functions in regression tasks on time-series datasets. The significance of loss functions and their role in DNN architecture are discussed in detail in Sect. 3. The different types of loss functions are explored in Sect. 4. In Sect. 5, we present the results of our experimental study on selected time-series datasets, and evaluate the performance of DNN models using performance scores. Based on the Table 1 Details of notations or abbreviations Notation/abbreviation
Description
DNN
Deep neural network
AR
Auto regressive
MA
Moving average
ARMA
Auto regressive moving average
ARIMA
Auto regressive integrated moving average
SARIMA
Seasonal auto regressive integrated moving average
CNN
Convolutional neural network
RNN
Recurrent neural network
LSTM
Long short term memory
Bi-LSTM
Bi directional long short term memory
GRU
Gated recurrent network
MSE
Mean square error
MAE
Mean absolute error
MSLE
Mean square logarithmic error
MAPE
Mean absolute
RSME
Root mean square error
R2
R square
A Comparative Study of Loss Functions for Deep Neural Networks …
149
experimental results, Sect. 6 presents a comparative analysis of the different selected loss functions with a detailed discussion. Finally, Sect. 7 concludes the paper.
2 Related Work Time series analysis and forecasting have been traditionally done using methods such as AR, MA, ARMA, ARIMA, SARIMA, etc. [14, 15]. These techniques have some limitations, such as being a statistical regression-based approach and being less accurate for long-term predictions. In the era of machine learning, various advanced models (such as random forest and gradient boost trees) have been developed to improve the accuracy of predictions in sequential data [16] and time-series data [17]. Deep neural network-based methods have also been developed to train machines to improve accuracy [18]. In DNNs, two types of networks are used for prediction: convolutional neural networks and recurrent neural networks. Convolutional networks are popular for solving high-dimensional data or image classification-based problems. However, since time-series data is sequential, only 1D convolutional neural networks are used. Tang et al. have proposed the use of 1D CNN for time series classification with an inbuilt feature selection process [19]. The concept of recurrent neural networks was developed in 1986 to solve problems in time series analysis, such as sequential long-term prediction and the vanishing gradient problem. The first type of recurrent network was discovered by John Hopfield in 1982. RNNs are used for time series prediction because they can process the sequence of inputs in DNNs, but they also have the vanishing gradient problem [20]. To address this issue, new DNN models have been developed as variants of RNNs, such as LSTM, GRU, and sub-variants like Bi-LSTM, which have been proposed for time series analysis. The architecture of DNNs is different from traditional models as it allows for customization of the model architecture to meet specific needs for greater flexibility [21]. Different loss functions have been used by various researchers to solve various problems, such as the vanishing gradients issue in the DNN process and computing the prediction error, training, and testing loss in the learning process. Scharf et al. [13] have conducted a convergence analysis between MSE and cross-entropy and found that MSE is more robust for regression. Janocha et al. [22] have studied different loss functions in DNN classifiers to identify good loss functions through an empirical study on other loss functions. Ghosh et al. [23] have provided some sufficient conditions on loss functions for noise tolerance in multi-class classification. Wang et al. [24] have discussed and analyzed 31 loss functions in five aspects: classification, regression, unsupervised learning, object detection, and face recognition. Yu et al. [25] have optimized the loss function in convolutional neural networks for classification tasks for real-time risk analysis. Zhao et al. [10] have focused on various loss functions, beyond just L2 loss, to show their importance in image restoration by comparing several losses without changing the model architecture. El Jurdi et al. [26] have surveyed loss functions for medical image segmentation and provided the outcome as the high priority loss function for the same task. Zabihzadeh has proposed
150
R. Jaiswal and B. Singh
an approach that combines losses to improve the generalizability of deep learning methods [27]. Khammar et al. [28] have proposed a novel approach to fit fuzzy regression models with different loss functions. Xingjun et al. [29] have worked on different robust models in the presence of noisy labels and proposed a framework (Active Passive Loss: APL) to solve the problem of underfitting. Based on our literature review, we found that most researchers have focused on classification tasks with loss functions in DNN. This paper aims to compare different deep learning models in time series to analyze the effects of loss functions within its architecture for regression tasks. Specifically, we focus on the problem of determining which loss function is better for calculating the prediction error through a comparative study of regression loss functions in DNNs. In the next section, we will discuss the details of DNN architecture and its importance in relation to the loss function.
3 Deep Neural Network and Its Architecture Deep neural networks have been developed to perform machine-learning tasks and solve decision-making problems by simulating the human brain [1]. DNNs can perform both classification and regression tasks. While there are various regression methods available [30], this paper focuses specifically on deep neural network models. There are two types of deep neural networks: convolutional neural networks and recurrent neural networks, which are used for learning and predicting future values. The details of CNNs and RNNs will be provided in the following sub-sections.
3.1 Convolutional Neural Networks CNN or convolutional neural network, is a powerful deep model that is often used with multi-dimensional data or visual imagery. The architecture of CNNs includes kernels as filter layers for feature selection and input, which use convolutional operations. CNNs can also be used with sequential data using a 1D CNN architecture [19] and for multi-dimensional data using 2D/3D CNN. Currently, 1D-CNNs are used to achieve more accurate results for predictions, and for hybridization with recurrent models in the computation of time-series data [31]. They can also be used for image and video-based applications. The architecture of CNN is depicted in Fig. 1, where the important layers, such as input, output, flattening, and pooling layers, are represented.
A Comparative Study of Loss Functions for Deep Neural Networks …
151
Fig. 1 Convolutional neural network architecture
3.2 Recurrent Neural Networks and Their Variants RNN. RNN, or Recurrent Neural Networks, is a type of deep neural network first proposed in 1986, which processes sequential data computation [32]. RNNs work on sequences of vectors, such as input and output. The architecture of RNNs has connections between passes and connections through time, where nodes connect along with the sequences to the previous and next layers. RNNs have less computational cost because they share parameters across different timestamps. RNNs can be used for time series, text, and audio-based applications [33, 34]. However, RNNs have the problems of vanishing gradients and exploding gradients, which motivated the development of its variants, as measured by the loss function used in its architecture. The architecture of RNN is depicted in Fig. 2, where three layers are mentioned: input, hidden (recurrent), and output layers. The number of hidden layers can be increased as needed, but this makes the network model architecture more complex. LSTM. As previously mentioned, RNNs have difficulties in learning long-term dependencies. To solve this problem and deal with the vanishing gradient issue [20], LSTM was proposed by Sepp Hochreiter and Jürgen Schmidhuber in the 1990s [35]. LSTM is an extension of RNNs, where the architecture has been changed to enable long-term pattern learning. LSTM has successfully solved complex problems that RNNs struggled with. However, the architecture of LSTM is complex and takes more time in the computation process. GRU. To make the computation process faster and more accurate, GRU was proposed by Kyunghyun Cho et al. in 2014 [36]. As a variant of RNNs and an
Fig. 2 Recurrent neural network architecture
152
R. Jaiswal and B. Singh
updated version of LSTM, GRUs are gated machines, which are a standard version of RNNs. The architecture of GRU is built with a combination of two gates: the update gate and the reset gate. While LSTM and GRU are both variants of RNNs, GRU performs faster than LSTM by using fewer external gating signals. Bi-LSTM. In the bidirectional LSTM model, two LSTMs are applied to the input to create an extended version of LSTM. The first LSTM takes input at the initial level (forward layer) and the second LSTM processes it in reverse form (backward layer). The use of bidirectional LSTMs improves the performance and computation speed of LSTM models [37], but also increases the complexity of the model. Similarly, bidirectional RNNs [38] and bidirectional GRUs can be used as needed, with the same architecture and working strategy. Hybrid models can also be built using the same loss functions for various purposes in time series analysis [39]. In the next section, we are focused on the Loss functions that attribute of DNN used to find the difference between actual and predict the value in regression task on time series.
4 Loss Functions of DNNs in Regression As previously discussed, loss functions are an essential aspect of neural network architecture. They are used to determine the loss in the prediction process as gradients in neural networks, which can occur due to various reasons such as noisy data, limited data, low learning rate, training and testing errors, etc. Loss functions are used to calculate the gradients in neural network models, which are then used to update the weights on nodes in neural networks. Deep neural networks (DNNs) are used to solve various types of problems, including regression, classification, and unsupervised tasks. This paper focuses on loss functions used for regression tasks [40] and their impact on DNN models in time series analysis [41]. The following sub-sections provide detailed information on the different regression loss functions in DNN architecture, along with their mathematical representation (as equations). MSE (L2 Loss) Mean Square Error is used for the regression task where the loss is calculated by taking the mean of squared differences between actual (target) and predicted values [42, 43]. This is also called L2 loss. L2 settings its derivatives to zero that gives the efficiency to find the solution. n MSE =
i=1
p 2
(yi− yi ) n
(1)
MAE (L1 Loss). Mean Square Error Loss is easy to solve but outliers have to be calculated by the absolute error; MAE is more robust to outliers [42]. When the prediction is exactly equal to the true value than the minimization of the loss. The demerit of the mean absolute error is that its gradients are large for the small values. It is not good for learning. This is also known as L1 loss.
A Comparative Study of Loss Functions for Deep Neural Networks …
n p i=1 yi − yi M AE = n
153
(2)
MAPE. Mean Square Percentage Error is the measure of prediction accuracy in statistical forecasting methods [44]. Here, the difference between the actual value and the forecast value is divided from the actual value. The drawback of MAPE is that if there are zero values then MAPE is not used. n 100 At − Ft M AP E = n t=1 At
(3)
MSLE. Mean Squared Logarithmic Error loss is another loss function that first calculates the natural logarithm of each prediction value and then calculates the MSE. It has overcome the problem of large differences in large predicted values [45]. So, use it when the values are normally distributed. This is helpful in regression models when predicting the unscaled quantities directly. N 1 2 L y, y = (log(yi + 1) − log(y i + 1)) N i=0
(4)
Cosine Similarity. The cosine of the two non-zero vectors can be derived by using the Euclidean dot product to measure the similarity between them [46]. It reflects the relative property rather than absolute by comparison of individual vector dimensions. It can be used to solve real-world tasks, as its time complexity is quadratic. n − → − → a.b 1 ai bi Cosθ = − → = n 2 n 2 − → a b 1 ai 1 bi
(5)
Huber Loss. Huber loss is less sensitive than the mean square error loss. It works like an absolute error, which becomes differentiable at zero and quadratic [47]. There are two Approaches of Huber loss where the quadratic depends on the delta that is MSE when delta approx. to zero and the MAE when delta approx. zero to large numbers or infinite. The challenge is there to choose the delta is critical. L δ (y, f (x)) =
1
f or |y − f (x)| ≤ δ, (y − f (x))2 other wise, δ|y − f (x)| − 21 δ 2
2
(6)
Log-Cosh Loss. Log-cosh is used for regression task that is smoother than the L2. Log-cosh loss is the logarithm of the hyperbolic cosine of the prediction error [46]. It overcomes the limitations by taking the advantage of both L1 and L2 loss functions.
154
R. Jaiswal and B. Singh
Table 2 Details of selected loss functions of DNNs in time series for regression task Symbol
Name
Equation
MSE
Mean square error (L2 Loss)
MSE =
MAE
Mean absolute error (L1 Loss)
M AE =
MAPE
Mean absolute percentage error
M AP E =
MSLE
Mean squared logarithmic error
L y, y = 2 1 N i=0 (log(yi + 1) − log(y i + 1)) N
n
p 2
(yi− yi ) n n yi −y p i=1 i n i=1
100 n
n At −Ft t=1 At
− → − → a.b − → → − a b
CS
Cosine similarity
Cosθ =
Huber
Huber loss
L δ (y, f (x)) =
1 2 2 (y − f (x))
Log-C
=
n
i bi 1 a n 2 n 2 a 1 i 1 bi
f or |y − f (x)| ≤ δ,
δ|y − f (x)| − 21 δ 2 other wise. p n L(y, y p ) = i=1 log(cosh(yi − yi ))
Log-Cosh loss
n p L y, y p = log(cosh(yi − yi ))
(7)
i=1
Quantile Loss. For predicting the interval instead of point-wise prediction, the Quantile loss is very useful because it handles the uncertainty [28]. Instead of using linear regression for handling the residuals, quartile loss function is better for nonlinear or quartile regression-based models by providing sensible prediction intervals [48].
L γ y, y
p
p
=
yi i=yi
p
i p p (γ − 1). yi − yi + (γ ). yi − yi
y
(8)
i=yi
It is difficult to choose the appropriate loss function from the regression loss functions mentioned above. In the following section of this paper, an experimental study was conducted to address this issue. Table 2 provides a summary of the selected loss functions for the regression task. The experimental study was conducted to compare these selected regression loss functions, as outlined in Table 2.
5 Experiments The details of the experimental setup, including the selected time-series datasets, DNN models, loss functions, and performance metrics, are provided in the following sub-sections.
A Comparative Study of Loss Functions for Deep Neural Networks …
155
Table 3 Details of selected time series (temporal datasets) Datasets
Instances
Attributes
Dataset type
Gold price dataset (DS1)
10,788
2
Univariate
Metro traffic volume dataset (DS2)
48,205
6
Multivariate
Daily min. temperature dataset (DS3)
3651
2
Univariate
London merged dataset (DS4)
17,415
10
Multivariate
5.1 Experimental Setup To conduct a comparative study of different selected loss functions in the time series prediction process, the Anaconda IDE and the Python programming language with the sk-learn and pandas libraries were used for the experiments in this paper. The experiments were performed on a Windows-based operating system (Windows 10) with 12 GB of RAM and an i5 processor. The setup needed to be robust enough to handle the computational demands of the neural network and to efficiently process the large datasets in order to save time.
5.2 Datasets Selection For the experimental study on the selected DNN models, four time-series datasets were chosen: Gold Price (DS1), Metro interstate traffic volume (DS2), Daily minimum temperature (DS3), and London merged (DS4). These datasets were obtained from open-source repositories such as the UCI Machine Learning Repository [49] and Kaggle Data [50]. Both univariate datasets (DS1 and DS3) and multivariate datasets (DS2, and DS4) were used to validate the outcomes on different types of time-series data. The details of the selected datasets are provided in Table 3.
5.3 Selected Models and Loss Functions There are two types of tasks: regression and classification in the time series analysis. Here, we are focusing on the regression problem. To conduct our experiments, we have used recurrent neural network models including RNN, LSTM, Bi-LSTM, and GRU, and various regression loss functions. The architecture of the DNNs used in our experiments consists of one input layer, two middle layers or hidden layers, and one output layer. The model has been trained using 50 input nodes, 100 epochs, and the ‘relu’ activation function, with 1 dropout and dense layer, and ‘adam’ optimizer. The selected loss functions for the experiments are listed in Table 2.
156
R. Jaiswal and B. Singh
5.4 Performance Evaluation Data preprocessing was done where we performed feature selection by analyzing the correlation between the features of the dataset and imputed any missing values. Since we selected regression models, the performance metrics of the models were calculated to measure their performance score. The Root Mean Squared Error (RSME) and R-Squared (R2 ) metrics were used for this evaluation. RSME indicates the error and R2 represents the accuracy of the regression models. Negative R-Squared values were considered as zero in the comparative analysis. The results of the performance scores in terms of RSME and R-Squared for the DNNs model-based regression analysis are presented in Tables 4, 5, 6, and 7, which were obtained from the experiments. After conducting the experimental study, a thorough analysis of the results has been presented in the next section. The results, as shown in Tables 4, 5, 6, and 7, are analyzed to evaluate the performance of DNNs in performing the regression task, and to determine the most suitable loss function for improving the performance of the deep learning model.
6 Comparative Analysis of Loss Functions in Time Series The architecture and loss functions for deep neural networks used in regression analysis are outlined in Sects. 3 and 4. The experimental setup and chosen deep neural networks with various loss functions are detailed in Sect. 5. We have depicted the results using graphical representation with the x-axis representing the loss functions and y-axis displaying the performance score (R2 ) for different selected DNNs, as shown in Tables 4, 5, 6, and 7. The performance of deep neural network (DNN) loss functions was evaluated using the R-Square (R2 ) measurement. Both convolutional (1D CNN) and recurrent neural networks (RNN and its variants) were used for time series analysis. We applied the selected loss functions to various deep recurrent models (RNN, LSTM, Bi-LSTM, GRU) for regression purposes. The results of the experiments are summarized in Tables 4, 5, 6, and 7, where we found that the Huber loss function performed the best among all the selected loss functions for DNN models. The results are visualized in Figs. 3, 4, 5, and 6, but we excluded negative R2 values for ease of understanding and analysis. The graphs show the results of both univariate (DS1, DS3) and multivariate (DS2, DS4) time series analysis. Our experiments showed that the univariate datasets performed well with three loss functions: MSLE (in RNN), MSE (in RNN variants), and Huber loss. However, in multivariate datasets, the performance of MAE, MSE, and Log-cosh was similar to that of the Huber loss. Overall, the Huber loss had the best performance score compared to other loss functions. Based on our findings, we conclude that the Huber loss is the most effective loss function among the ones tested. This study provides
17.2786
MAPE
MSLE
29.2362
34.7017
Huber
Log-C
1510.216
34.6588
49.0892
MAE
CS
Bi-LSTM
21.9937
17.4227
932.5053
22.568
41.7221
38.6073
21.0868
16.5748
215.3456
21.1846
19.9908
19.8576
22.5151
GRU
35.0649
16.5748
585.8018
14.8348
24.5528
32.8999
32.9826
0.9586
0.9827
−77.50
0.9874
0.9170
0.9587
0.9593
RNN
29.3902
LSTM
RNN
34.3878
R-Square
RSME
Performance score
MSE
Loss functions
Table 4 Performance score for DS1 (Gold Price Dataset)
0.9834
0.9896
−28.93
0.9825
0.9400
0.9487
0.9702
LSTM
Bi-LSTM
0.9847
0.9905
−0.596
0.9846
0.9863
0. 9864
0.9826
0.9576
0.9917
−10.81
0.9924
0.9793
0.9628
0.9626
GRU
A Comparative Study of Loss Functions for Deep Neural Networks … 157
532.4113
MAPE
MSLE
476.5263
482.1983
Huber
Log-C
48,123.55
473.8199
681.3534
MAE
CS
Bi-LSTM
536.6699
538.4953
481.1330
21,927.3
566.8014
736.9298
485.9521
500.6393
32,353.55
503.3173
633.8246
488.2672
505.1993
GRU
488.8483
514.495
22,504.0
513.7518
654.6635
498.6131
522.0358
0.9400
0.9414
−540.0
0.9268
0.8801
0.9420
0.9370
RNN
510.9257
LSTM
RNN
493.7687
R-Square
RSME
Performance score
MSE
Loss functions
Table 5 Performance score for DS2 (Metro Traffic volume dataset)
LSTM
0.9251
0.9402
−123.2
0.9170
0.8598
0.9256
0.9325
0.9390
0.9353
−269.3
0.9346
0.8963
0.9384
0.9340
Bi-LSTM
0.9360
0.9365
−129.8
0.9318
0.8893
0.9358
0.9296
GRU
158 R. Jaiswal and B. Singh
A Comparative Study of Loss Functions for Deep Neural Networks …
159
Table 6 Performance score for DS3 (Daily Min. temperature dataset) Loss Performance score functions RSME RNN
LSTM
R-Square Bi-LSTM GRU
RNN
LSTM
Bi-LSTM GRU
MSE
2.3389
2.3093
2.3100
2.3370
0.6740
0.6822
0.6820
0.6746
MAE
2.3739
2.3159
2.3207
2.2999
0.6643
0.6804
0.6791
0.6848
MAPE
12.6823
MSLE
2.3128
2.4083
2.3184
2.3215
Cosine S
3.3566
11.5166
7.1256
10.2229
Huber
2.3186
2.2949
2.3113
2.3054
0.6797
0.6862
0.6817
0.6834
Log-C
2.35932
2.3457
2.3115
2.3435
0.6683
0.6722
0.6816
0.6728
12.0825 12.1483
12.227
−8.584
−7.699
0.6813
−7.794
0.6544
0.3287 −6.903
0.6798 −2.025
−7.904 0.6789 −5.227
Table 7 Performance score for DS4 (London merged dataset) Loss Performance score functions RSME RNN
LSTM
R-Square Bi-LSTM
GRU
RNN
LSTM
Bi-LSTM GRU
MSE
306.8758
298.7129
264.7119 278.2944
0.9261
0.9299
0.9449
0.9392
MAE
334.0848
287.9983
293.6033 291.8207
0.9124
0.9349
0.9323
0.9331
MAPE
1632.5357
977.5944
862.6609 983.8704 −1.093
0.2497
0.4157
0.2399
MSLE
533.4380
284.6098
277.9015 281.0592
0.9364
0.9394
0.9380
CS
922.0342 1484.2531 2689.377
Huber
310.9188
264.5646
264.8258 273.3244
0.9241
0.9451
0.9449
0.9414
Log-C
321.5534
271.6278
316.7998 277.3920
0.9188
0.9421
0.9212
0.9396
958.4075
0.7765
0.3325 −0.73
−4.679
0.2788
Fig. 3 DS1 R-Square details
useful insights for developers and researchers as it can save time when building DNN architectures and analyzing time-series data. In this paper, we thoroughly investigated the existing loss functions in time series analysis for regression tasks. The results are presented in Tables 4, 5, 6, and 7, and visualized in Figs. 3, 4, 5, and 6. In the future, custom loss functions could also be created to compute losses during the training and testing phases of DNN learning
160
R. Jaiswal and B. Singh
Fig. 4 DS2 R-Square details
Fig. 5 DS3 R-Square details
Fig. 6 DS4 R-Square details
processes, which would be more suitable for specific problems. This study focused solely on selecting the appropriate model for deep learning process for time-series datasets to solve regression problems.
7 Conclusion DNNs have various parameters that make them flexible, such as activation functions, loss functions, and the number of layers. The loss function is a particularly important parameter for determining the difference between predicted and actual values. Different researchers have developed and used various loss functions to solve specific
A Comparative Study of Loss Functions for Deep Neural Networks …
161
problems, but choosing the right one for regression is a time-consuming process. In this paper, we focus on how to select appropriate DNN loss functions in time series analysis for regression. We conducted experiments to compare different regression loss functions. Our results show that the Huber Loss outperforms other loss functions. This paper could be extended to analyze other loss functions for different tasks. In the future, custom loss functions could be developed to improve results by reducing computational complexity and time.
References 1. GoodFelow L, Courville A (2017) Deep Learning. The MIT Press 2. Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, Shyu M-L, Chen S-C, Iyengar SS (2018) A survey on deep learning: Algorithms, techniques, and applications. ACM Comput Surv (CSUR) 51:1–36 3. Dong Q, Lin Y, Bi J, Yuan H (2019) An integrated deep neural network approach for large-scale water quality time series prediction. In: 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 3537–3542 4. Rezaei H, Faaljou H, Mansourfar G (2021) Stock price prediction using deep learning and frequency decomposition. Expert Syst Appl 169:114332 5. Livieris IE, Stavroyiannis S, Pintelas E, Pintelas P (2020) A novel validation framework to enhance deep learning models in time-series forecasting. Neural Comput Appl 32:17149–17167 6. Torres JF, Hadjout D, Sebaa A, Mart´ınez-Álvarez F, Troncoso A (2021) Deep learning for time series forecasting: a survey. Big Data 9:3–21 7. Li X, Wu X, Chen J (2019) A spectral-change-aware loss function for DNN-based speech separation. In: ICASSP 2019–2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6870–6874 8. Jain N, Kumar S, Kumar A, Shamsolmoali P, Zareapoor M (2018) Hybrid deep neural networks for face emotion recognition. Pattern Recogn Lett 115:101–106 9. Singh B, Jaiswal R (2021) Automation of prediction method for supervised learning. In: 11th international conference on cloud computing, data science & engineering (Confluence), IEEE, Noida, India, pp. 816–821 10. Zhao H, Gallo O, Frosio I, Kautz J (2015) Loss functions for neural networks for image processing. ArXiv preprint arXiv:1511.08861 11. Zhao H, Gallo O, Frosio I, Kautz J (2016) Loss functions for image restoration with neural networks. IEEE Trans Comput Imaging 3:47–57 12. Geron A (2019) Hands on machine learning with Scikit-Learn, Keras, and Tensor Flow: Concepts, tools, and techniques to build intelligent systems. O’Reilly Media 13. Scharf LL, Demeure C (1991) Statistical signal processing: detection, estimation, and time series analysis, Prentice Hall 14. Mahalakshmi G, Sridevi S, Rajaram S (2016) A survey on forecasting of time series data. In: 2016 international conference on computing technologies and intelligent data engineering (ICCTIDE’16), pp. 1–8 15. Yamak PT, Yujian L, P. K. Gadosey, “A comparison between arima, lstm, and gru for time series forecasting. In: Proceedings of the 2019 2nd international conference on algorithms, computing and artificial intelligence, pp. 49–55 16. Dietterich TG (2002) “Machine learning for sequential data: A review,” in Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR). Springer, Berlin, Heidelberg, pp 15–30 17. Mitsa T (2010) Temporal data mining. CRC Press 18. Chollet F (2018) Deep Learning with Python. Manning Publication co. NY 11964
162
R. Jaiswal and B. Singh
19. Tang W, Long G, Liu L, Zhou T, Jiang J, Blumenstein M (2020) Rethinking 1d-cnn for time series classification: a stronger baseline. ArXiv preprint arXiv:2002.10061 20. Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A field guide to dynamical recurrent neural networks. IEEE Press 21. Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26 22. Janocha, Czarnecki (2017) On loss functions for deep neural networks in classification. ArXiv preprint arXiv:1702.05659 23. Ghosh A, Kumar H, Sastry PS (2017) Robust loss functions under label noise for deep neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence 31(1):2017 24. Wang Q, Ma Y, Zhao K, Tian Y (2020) A comprehensive survey of loss functions in machine learning. Ann Data Sci 1–26 25. Yu R, Wang Y, Zou Z, Wang L (2020) Convolutional neural networks with refined loss functions for the real-time crash risk analysis. Transp Res Part C: Emerg Technol 119:102740 26. El Jurdi R, Petitjean C, Honeine P, Cheplygina V, Abdallah F (2021) High-level prior-based loss functions for medical image segmentation: a survey. Comput Vis Image Underst 210:103248 27. Zabihzadeh D (2021) Ensemble of Loss Functions to Improve Generalizability of Deep Metric Learning methods. ArXiv preprint arXiv:2107.01130 28. Khammar AH, Arefi M, Akbari MG (2021) A general approach to fuzzy regression models based on different loss functions. Soft Comput 25:835–849 29. Ma X, Huang H, Wang Y, Romano S, Erfani S, Bailey J (2020) Normalized loss functions for deep learning with noisy labels. In: International Conference on Machine Learning, pp. 6543– 6553 30. Fernández-Delgado M, Sirsat MS, Cernadas E, Alawadi S, Barro S, Febrero-Bande M (2019) An extensive experimental survey of regression methods. Neural Netw 111:11–34 31. Zhang Z, Dong Y (2020) Temperature forecasting via convolutional recurrent neural networks based on time-series data. Complexity 2020 32. Medsker LR, Jain LC (2001) Recurrent neural networks. Design and Applications 5:64–67 33. Bisong E (2019) Recurrent Neural Networks (RNNs). In: Building machine learning and deep learning models on Google cloud platform, Springer, p. 443–473 34. Petneházi G (2019) Recurrent neural networks for time series forecasting. ArXiv preprint arXiv: 1901.00069 35. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780 36. Dey R, Salem FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), IEEE, pp. 1597–1600 37. Siami-Namini S, Tavakoli N, Namin AS (2019) The performance of LSTM and BiLSTM in forecasting time series. In: 2019 IEEE International Conference on Big Data (Big Data), IEEE, pp. 3285–3292 38. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681 39. Singh B, Jaiswal R (2021) Impact of hybridization of deep learning models for temporal data learning. In 2021 IEEE 8th Uttar Pradesh section international conference on electrical, electronics and computer engineering (UPCON), IEEE 40. Gutiérrez PA, Perez-Ortiz M, Sanchez-Monedero J, Fernandez-Navarro F, Hervas-Martinez C (2015) Ordinal regression methods: survey and experimental study. IEEE Trans Knowl Data Eng 28:127–146 41. Cherkassky V, Ma Y (2004) Comparison of loss functions for linear regression. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), Vol. 1, pp. 395–400 42. Qi J, Du J, Siniscalchi SM, Ma X, Lee C-H (2020) On mean absolute error for deep neural network based vector-to-vector regression. IEEE Signal Process Lett 27:1485–1489
A Comparative Study of Loss Functions for Deep Neural Networks …
163
43. Sangari A, Sethares W (2015) Convergence analysis of two loss functions in soft-max regression. IEEE Trans Signal Process 64:1280–1288 44. Cirstea RG, Micu DV, Muresan GM, Guo C, Yang B (2018) Correlated time series forecasting using multi-task deep neural networks. In: Proceedings of the 27th ACM International conference on information and knowledge management, pp. 1527–1530 45. Yoo S, Kang N (2021) Explainable artificial intelligence for manufacturing cost estimation and machining feature visualization. Expert Syst Appl 115430 46. Chen P, Chen G, Zhang S (2018) Log hyperbolic cosine loss improves variational auto-encoder 47. Meyer (2021) An alternative probabilistic interpretation of the Huber loss. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5261–5269 48. Yi C, Huang J (2017) Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression. J Comput Graph Stat 26:547–557 49. Dataset-Source, UCI Machine Learning Repository, https://archive-beta.ics.uci.edu/, UCI Machine Learning. Last Accessed Sept. 2021 50. Dataset-Source, Kaggle Data, https://www.kaggle.co, Kaggle, Last Accessed Sept. 2021
Learning Algorithm for Threshold Softmax Layer to Handle Unknown Class Problem Gaurav Jaiswal
Abstract Neural network are mostly trained with predefined class training data in supervised learning. But, when unknown test data (other than predefined class) are classified by a trained neural network, they are always misclassified into predefined classes, thus misclassification rate of trained neural network increases. To tackle these problems, Threshold Softmax Layer (TSM) and learning algorithm is proposed. In which, a normalized probability of each output class of the neural network is calculated and a threshold value is updated for each class during threshold learning process. If the maximum normalized probability of test data does not cross threshold value of the corresponding class, we will classify test data into unknown class. This TSM layer with neural network is evaluated on three UCI benchmark dataset (Glass, Yeast and Wine quality) and successfully handles the unknown class problem with reduced misclassification error. Keywords Threshold softmax layer · Unknown class · Open set classification · Neural network · Softmax function · Misclassification error
1 Introduction Neural network-based classification models such as MLP, BPNN, CNN and ELM are widely used in various fields [26, 37], i.e. computer vision [7, 21, 29], health sector [1, 24], industrial sector [23, 33], OCR [8, 17], biometric [15], financial sector [27, 34], etc. for classification of real-world data. These models are generally trained on specific predefined class data for specific tasks. One of the main limitations of these classification models is that it always classifies any unknown input data into predefined output class, whether input data belongs to the predefined class or not [26, 37]. This always increases the misclassification rate of classification model. G. Jaiswal (B) ICT Research Lab, Department of Computer Science, University of Lucknow, Lucknow, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_13
165
166
G. Jaiswal
Fig. 1 Unknown class classification problem
To understand classification problem, suppose a classifier is trained with three predefined class training data (dog, cat and mouse). When the input domain of classifier is closed to the predefined class, classifier performs well for closed input domain. If input data (bird), other than predefined class, is given to this trained classifier, bird data will be classified into one of the predefined classes (dog, cat and mouse) while bird input data does not belong to any of predefined class. Therefore, misclassification error of classifier is increased. This problem is depicted in Fig. 1. Traditional and regular classifiers only work in closed class environments to avoid these misclassification errors for unknown class data. For generalization of classifiers for open set environments, various researchers explored the open set classification domain and developed various models to handle the unknown class problem [2, 3, 18, 30, 31]. In this paper, threshold softmax layer is introduced and threshold learning algorithm is proposed for handling unknown class problem in classification. Proposed threshold softmax layer (TSM Layer) and learning algorithm reduce misclassification rate of neural network. This layer is based on threshold of normalized probability of each class. Threshold value works as decision criteria for classification to unknown class. This layer is deployed in the neural network to improve unknown class classification capability of the neural network. This paper is organized as follows: Sect. 2 reviews the related works in literature. Section 3 describes the concepts of threshold softmax layer with application to handle unknown test data. TSM layer is implemented and improved neural network is evaluated on three UCI benchmark classification dataset in Sect. 4. Finally, paper is concluded in Sect. 5. Symbols used in this paper are given in Table 1.
Learning Algorithm for Threshold Softmax Layer to Handle Unknown Class Problem
167
Table 1 Symbols and Descriptions Sr. No. Symbols Descriptions 1 2 3
Y = [y1 , y2 , . . . , yk ] σ (Y ) e yi
4 5 6 7 8 9
[X n , Yn ] Tk Tm O p i, j, k, m
Output vector Softmax function Un-normalized probability of of ith value (yi ) of Output vector Y n Training Samples Threshold vector of k length mth value of Threshold vector Output score of Neural Network Confidence value Index of vectors
2 Related Work Various different classification models have been proposed to handle unknown class for open set classification. In literature, Gorte et al. [11] first considered this problem and proposed non-parametric classification algorithm using posteriori probability vector. Gupta et al. [13] introduced a binary tree structure called class discovery tree for dealing with unknown class. Scheirer et al. [31] formalized open set recognition problem, open space risk and openness of classification and proposed 1-vs-set machine. Afterwards, Scheirer et al. [30] proposed a novel Weilbull-calibrated SVM (W-SVM) which is based on compact abating probability and binary SVM. Rattani et al. [28] used binary and multiclass W-SVM for fingerprint spoof detection. Costa et al. [4] extended SVM using decision boundary carving algorithm for source camera attribution and device linking. Jain et al. [16] proposed Pi-SVM which was introduced for estimating the un-normalized posterior probability of class inclusion. Li et al. [22] extended nearest neighbour using transduction confidence while Júnior et al. [20] using distance ratio for open set classification. Bendale et al. proposed nearest non-outlier algorithm [2] and OpenMax layer [3]. Ge et al. [9] extended OpenMax by applying generative adversarial networks (GANs). GANs are used to generate fake unknown data [19, 36] for training of classifier. Zhang et al. [38] simplified open set recognition problem into a set of hypothesis testing problem and proposed sparse representation-based model using extreme value theory. Gunther et al. [12] suggested thresholding extreme value machine probabilities to handle open set face recognition problem. Shu et al. [32] proposed joint open classification model with a sub-model for classifying known and unknown class. Neira et al. [25] proposed a newly designed open set graph-based optimum path forest classifier using genetic programming and majority voting fusion techniques. Xiao et al. [35] gave the idea of compact binary feature generated by ensemble binary classifier. Geng et al. [10] introduced hierarchical Dirichlet process-based model which does not overly depend on training sample. Hassen et al. [14] proposed
168
G. Jaiswal
a loss function-based neural network representation for open set recognition. Different from existing models, our proposed threshold softmax layer is an extension model of neural network which enables simple neural network to classify unknown class.
3 Threshold Softmax Layer TSM layer combines two processes, i.e. softmax function [6] and threshold learning. Architecture of neural network with TSM layer is shown in Fig. 2. Description for Softmax function [6], Threshold learning and handling unknown class problem using TSM layer is given in Sects. 3.1, 3.2 and 3.3 respectively.
3.1 Softmax Function Softmax function is a neural transfer function, which is used for calculating normalized probability of each output class for given an input sample [6]. Softmax function is defined as follows: Let output vector Y = [y1 , y2 , ..., yk ] of k node of the output layer of neural network, Softmax function σ (Y ) calculates predicted normalized probability of each class. (1) σ : Y = [y1 , y2 , . . . , yk ] → [0, 1]k Un-normalized Exponential Probability is calculated by taking power of each yi to exponent (2) pr obi = e yi , f or i = 1, 2, . . . , k
Yk
Input Layer
• • •
• • • Hidden Layer
Fig. 2 Architecture of Neural network with TSM layer
• • •
Tk
}
Output Layer
• • • TSM Layer
Learning Algorithm for Threshold Softmax Layer to Handle Unknown Class Problem
169
Fig. 3 Illustration of softmax function
Normalized Probability of each class is calculated by dividing each un-normalized probability by the sum of all classes’ un-normalized probability e yi σ (Y )i = k j=1
eyj
, f or i = 1, 2, . . . , k
(3)
Softmax function can be understood as it computes the class probability of each input data. Working of softmax function is illustrated in Fig. 3.
3.2 Threshold Learning Algorithm TSM layer of neural network learns threshold values for each class during threshold learning process. When softmax function compute the normalized probability σ (Yk ) of each class for given output layer vector Yk , threshold learning process updates threshold value in threshold vector Tk for each class by taking the minimum of the normalized probability of actual class. A confidence value p (range 0.0–1.0) is taken from the user which is used for restricting threshold value approaching to 0 when noisy or outliers data occurs. The final threshold value is determined by taking maximum between confidence value (p) and threshold value. Working process of threshold learning is given in Algorithm 1.
3.3 Handling Unknown Class Problem Using TSM Layer Threshold vector (Tk) is obtained by threshold learning process which contains threshold value of each predefined class. Threshold value of each class works as decision criteria for classification. If any unknown sample is given to trained improved neural network, output layer predicts the output score. In TSM layer, the normalized
170
G. Jaiswal
Algorithm 1 Algorithm for Threshold learning Input Training sample [X n , Yn ] with k class, Confidence Value ( p) Output Threshold Vector Tk 1: Initialize Threshold Vector [Tk ] to [1]k 2: for each sample [X i , Yi ] in [X n , Yn ] do 3: Calculate feed forward output score [Oi ] of trained Neural Network 4: Calculate max of normalized probability σ ([Oi ]) using softmax function 5: Omax = max(σ ([Oi ])) 6: Update Tm in Tk for m = class value of Yi 7: Tm = min(Omax , Tm( pr ev) ) 8: end for 9: for each Tm in Tk do 10: Update Tm = max(Tm , p) 11: end for
class probability is calculated using softmax function [6]. If max normalized probability is greater than the corresponding threshold value, then corresponding class value will assign and if not, then unknown class value will assign. The procedure of handling unknown test data is given in Algorithm 2. For example, for any given sample, the normalized probability vector is calculated as [0.0345, 0.5293, 0.2542, 0.1820] and learned threshold vector for given confidence value (p) 0.5 is [0.5, 0.7823, 0.8246, 0.8185]. Max normalized class probability 0.5293 is less than corresponding threshold value 0.7823. So ‘unknown class’ is assigned for given sample. Algorithm 2 Algorithm for Handling Unknown Class Input Unknown sample [X ], Threshold Vector [Tk ] of predefined k class Output Class_value 1: Calculate feed forward output score [O] of trained Neural Network for input [X ] 2: Calculate normalized probability σ ([O]) using softmax function 3: Find σ ([O]i = max(σ [O])) for i = 1, 2, · · · , k 4: if σ ([O]i ≥ Ti then 5: Assign i to Class_value 6: else 7: Assign ’Unknown Class’ to Class_value 8: end if
4 Experiment and Results 4.1 Experimental Setup Neural Network with TSM layer implemented using Python and evaluated on three standard UCI machine learning dataset: Glass, Yeast and Wine quality (white) [5]. Each dataset is divided into two ratios of train data and test data in dataset A (ratio
Learning Algorithm for Threshold Softmax Layer to Handle Unknown Class Problem
171
Table 2 Description of benchmark dataset and their experimental configuration Dataset/Properties Glass [5] Yeast [5] Wine quality [5] No. of attributes No. of classes No. of samples DS A Train (80%) Test (20%) DS B Train (70%) Test (30%)
9 6 214 5 class data were taken all class data were taken 5 class data were taken all class data were taken
8 10 1484 8 class data were taken all class data were taken 8 class data were taken all class data were taken
11 11 4898 8 class data were taken all class data were taken 8 class data were taken all class data were taken
80:20) and dataset B (ratio 70:30). Train sample of each dataset is modified by removing samples of some particular classes, so that samples of train data of that particular class can be treated as unknown class data. Training samples of five classes out of six classes are taken in Glass dataset. In Yeast dataset, training samples of eight classes out of ten classes are taken. Training samples of eight classes out of eleven classes are taken in Wine quality (white) dataset. In test data, all classes’ samples are taken. Table 2 shows properties of each dataset and experimental configuration of training and test dataset.
4.2 Experimental Results We first train a neural network with train data for each dataset. The neural network is evaluated on test data of each dataset which also contains the unknown samples. The model predicts the class value of predefined class for unknown class samples. This experiment shows that misclassification errors are higher due to unknown class sample in test data. Therefore, the neural network with TSM layer (Same neural network configuration) train with train data for each dataset with confidence value (p) 0.5. Here, determination of confidence value (p) parameter is critical. If confidence value 0 is chosen, then most of the unknown class samples are classified into predefined class. If confidence value is chosen as 1, then all samples are classified into unknown class. So default value 0.5 is chosen. Now, neural network with TSM layer is evaluated on the test dataset. Details of experimental results and improvement in reduction of misclassification error for each dataset are tabulated in Table 3. The comparison of performance based on misclassification error of neural network (NN) and neural network with TSM layer (NN+TSM) is shown with graph in Fig. 4. It clearly shows that NN+TSM reduces the misclassification error.
172
G. Jaiswal
Fig. 4 Comparison of performance of NN and NN+TSM methods
Table 3 Details of experimental result of all datasets Dataset Misclassification error (%)
Improvement (%)
Average improvement (%)
28.56 25.98 34.43 31.91 36.49 36.64
27.27
Neural Proposed network (NN) (NN+TSM) Glass Yeast Wine
DS A (80:20) DS B (70:30) DS A (80:20) DS B (70:30) DS A (80:20) DS B (70:30)
16.28 18.05 9.76 10.56 11.73 11.98
11.63 14.06 6.40 7.19 7.45 7.83
33.17 36.57
Table 3 shows that average improvement for Glass dataset, Yeast dataset and Wine quality dataset are 27.27, 33.17 and 36.57% respectively. These improvements show that test data included Unknown data samples are successfully classified, and misclassification errors are reduced.
5 Conclusion The proposed Threshold learning algorithm and Threshold Softmax layer (TSM Layer) are implemented and illustrated its application to handle the unknown class problem. The experimental result shows that TSM layer reduces misclassification error of neural network by 27.27, 33.17 and 36.57% for Glass dataset, Yeast dataset
Learning Algorithm for Threshold Softmax Layer to Handle Unknown Class Problem
173
and Wine quality dataset respectively. The threshold learning of this layer successfully handles the unknown class problem. This proposed work leads the classification models to perform in its specific input domain to open set domain.
References 1. Agatonovic-Kustrin S, Beresford R (2000) Basic concepts of artificial neural network (ann) modeling and its application in pharmaceutical research. J Pharm Biomed Anal 22(5):717–727 2. Bendale A, Boult T (2015) Towards open world recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1893–1902 3. Bendale A, Boult TE (2016) Towards open set deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1563–1572 4. Costa FDO, Silva E, Eckmann M, Scheirer WJ, Rocha A (2014) Open set source camera attribution and device linking. Pattern Recognit Lett 39:92–101 5. Dheeru D, Karra Taniskidou E (2017) UCI machine learning repository. http://archive.ics.uci. edu/ml 6. Duch W, Jankowski N (1999) Survey of neural transfer functions. Neural Comput Surv 2(1):163–212 7. Egmont-Petersen M, de Ridder D, Handels H (2002) Image processing with neural networks-a review. Pattern Recognit 35(10):2279–2301 8. Ganis M, Wilson CL, Blue JL (1998) Neural network-based systems for handprint ocr applications. IEEE Trans Image Process 7(8):1097–1112 9. Ge Z, Demyanov S, Chen Z, Garnavi R (2017) Generative openmax for multi-class open set classification. In: British machine vision conference 2017. British machine vision association and society for pattern recognition 10. Geng C, Chen S (2018) Hierarchical dirichlet process-based open set recognition. arXiv:1806.11258 11. Gorte B, Gorte-Kroupnova N (1995) Non-parametric classification algorithm with an unknown class. In: International symposium on computer vision, 1995. Proceedings. IEEE, pp 443–448 12. Günther M, Cruz S, Rudd EM, Boult TE (2017) Toward open-set face recognition. In: Conference on computer vision and pattern recognition (CVPR) workshops. IEEE 13. Gupta C, Wang S, Dayal U, Mehta A (2009) Classification with unknown classes. In: International conference on scientific and statistical database management. Springer, pp 479–496 14. Hassen M, Chan PK (2020) Learning a neural-network-based representation for open set recognition. In: Proceedings of the 2020 SIAM international conference on data mining. SIAM, pp 154–162 15. Jain LC, Halici U, Hayashi I, Lee S, Tsutsui S (1999) Intelligent biometric techniques in fingerprint and face recognition, vol 10. CRC Press 16. Jain LP, Scheirer WJ, Boult TE (2014) Multi-class open set recognition using probability of inclusion. In: European conference on computer vision. Springer, pp 393–409 17. Jaiswal G (2014) Handwritten devanagari character recognition model using neural network. Int J Eng Dev Res 901–906 18. Jaiswal G (2021) Performance analysis of incremental learning strategy in image classification. In: 2021 11th international conference on cloud computing, data science and engineering (Confluence). IEEE, pp 427–432 19. Jo I, Kim J, Kang H, Kim YD, Choi S (2018) Open set recognition by regularising classifier with fake data generated by generative adversarial networks. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2686–2690 20. Júnior PRM, de Souza RM, Werneck RDO, Stein BV, Pazinato DV, de Almeida, WR, Penatti OA, Torres RDS, Rocha A (2017) Nearest neighbors distance ratio open-set classifier. Mach Learn 106(3):359–386
174
G. Jaiswal
21. Kannojia SP, Jaiswal G (2018) Ensemble of hybrid cnn-elm model for image classification. In: 2018 5th international conference on signal processing and integrated networks (SPIN). IEEE, pp 538–541 22. Li F, Wechsler H (2005) Open set face recognition using transduction. IEEE Trans Pattern Anal Mach Intell 27(11):1686–1697 23. Lu CH, Tsai CC (2008) Adaptive predictive control with recurrent neural network for industrial processes: an application to temperature control of a variable-frequency oil-cooling machine. IEEE Trans Ind Electron 55(3):1366–1375 24. Miller A, Blott B et al (1992) Review of neural network applications in medical imaging and signal processing. Med Biol Eng Comput 30(5):449–464 25. Neira MAC, Júnior PRM, Rocha A, Torres RDS (2018) Data-fusion techniques for open-set recognition problems. IEEE Access 6:21242–21265 26. Paliwal M, Kumar UA (2009) Neural networks and statistical techniques: a review of applications. Expert Syst Appl 36(1):2–17 27. Raghupathi W, Schkade LL, Raju BS (1991) A neural network application for bankruptcy prediction. In: Proceedings of the twenty-fourth annual hawaii international conference on system sciences, vol 4. IEEE, pp 147–155 28. Rattani A, Scheirer WJ, Ross A (2015) Open set fingerprint spoof detection across novel fabrication materials. IEEE Trans Inf Forensics Secur 10(11):2447–2460 29. Schalkoff RJ (1989) Digital image processing and computer vision, vol 286. Wiley, New York 30. Scheirer WJ, Jain LP, Boult TE (2014) Probability models for open set recognition. IEEE Trans Pattern Anal Mach Intell 36(11):2317–2324 31. Scheirer WJ, de Rezende Rocha A, Sapkota A, Boult TE (2013) Toward open set recognition. IEEE Trans Pattern Anal Mach Intell 35(7):1757–1772 32. Shu L, Xu H, Liu B (2018) Unseen class discovery in open-world classification. arXiv:1801.05609 33. Widrow B, Rumelhart DE, Lehr MA (1994) Neural networks: applications in industry, business and science. Commun ACM 37(3):93–106 34. Wong BK, Selvi Y (1998) Neural network applications in finance: a review and analysis of literature (1990–1996). Inf Manag 34(3):129–139 35. Xiao H, Sun J, Yu X, Wang L (2018) Compact binary feature for open set recognition. In: 2018 13th IAPR international workshop on document analysis systems (DAS). IEEE, pp 235–238 36. Yu X, Sun J, Naoi S (2018) Generative adversarial networks for open set historical chinese character recognition. Electron Imaging 2018(2):1–5 37. Zhang GP (2000) Neural networks for classification: a survey. IEEE Trans Syst, Man, Cybern, Part C (Applications and Reviews) 30(4):451–462 38. Zhang H, Patel VM (2017) Sparse representation-based open set recognition. IEEE Trans Pattern Anal Mach Intell 39(8):1690–1696
Traffic Monitoring and Violation Detection Using Deep Learning Omkar Sargar, Saharsh Jain, Sravan Chittupalli, and Aniket Tatipamula
Abstract The traffic density on roads has been increasing rapidly for the past few decades, which has in turn been reflected in the increase in traffic violations and accidents. Official reports from various governments and private entities bolster the fact that, indeed, the current methods for traffic monitoring are inept to deal with the huge traffic density [1, 2]. These methods, which traditionally included the deployment of traffic police personnel at a select few junctions where the traffic density is high, ignore the majority of the other roads. Traffic monitoring systems that exploit image processing, computer vision and deep learning techniques thus come out to be a viable and optimal solution to monitor traffic and detect violations. These systems can easily be integrated with the architecture of law enforcement to penalize violators in real time. The proposed method—which utilizes YOLOv3 and SORT—is effective and accurate in detecting several violations like—over-speeding, wrongway driving, signal jumping, driving without helmet and triple seat violation. It also helps to keep track of the count of vehicles, their types and also the number of axels for multi-axle vehicles, thus, asserting itself as a novel and indigenous solution to a widely recognized problem. Keywords Traffic monitoring · Traffic violations · Computer vision · Deep learning · CNN · YOLOv3 · Object tracking
Equal contribution. O. Sargar (B) · S. Chittupalli Department of Electronics Engineering, Veermata Jijabai Technological Institute, Mumbai, India e-mail: [email protected] S. Jain Department of Computer Science, Veermata Jijabai Technological Institute, Mumbai, India A. Tatipamula AIRPIX Geoanalytics, Navi Mumbai, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_14
175
176
O. Sargar et al.
1 Introduction The humongous growth in the number of vehicles has led to many severe problems both logistical and environmental. The traditional manual checking of traffic violations does not hold up to the high volumes of traffic. Through several researches [1], it has been proved that manual checking is far from an optimal method for monitoring traffic violations due to a variety of reasons: the inability to monitor large volumes of traffic, monitoring for multiple violations, not the optimal use of manpower, and even corruption. Consequently, there is a pressing need to automate this process to make it more streamlined and cover all the flaws of the traditional system. This need led to the birth of some newer systems which primarily only focused on detecting over-speeding; but these systems are costly, sub-optimal and are easily avoided. These drawbacks further pushed researchers, and so traffic violation systems based on deep learning came into existence—these latest systems were able to perform multiple tasks like counting vehicles and detecting over-speeding and signal jumping. The newer systems which utilized the power of traditional image processing along with artificial intelligence and deep learning enable to increase the efficiency of the process and reduce the operations costs significantly. The video captured from the camera is processed to detect the violations in real time, and the same video can also be used as a proof for proving.
2 Proposed Method The proposed system is able to run multiple traffic monitoring tasks on multiple video streams. The system flowchart can be seen in Fig. 1. Starting the application with video streams initializes networks. The loading of a particular network is dependent on the configured video analytic tasks for that video stream. Since tasks like ANPR, Axle detection and Helmet detection are triggered only after certain events like violation or entering a region of interest, hence the batch size for those tasks is defaulted to ‘1’. The batch size for the YOLO network which is responsible for detecting vehicles and people is decided depending on the available memory on the device. The processing starts by creating batches for YOLO network and getting the detections on the frames. The detections are then tracked using SORT tracker and with each tracker we store useful info about that object like whether the object has crossed a particular ROI, number of axles detected, etc. which will be required for monitoring tasks and display purposes. Then all configured tasks are run for the frames. The specifics of each task is explained in their respective sections. For readability of the flow, the flowchart corresponds to a single video stream. However multiple video streams can be configured and loaded in the system. Adding new video stream will require reconfiguration of the networks and selection of ROIs for that stream.
Traffic Monitoring and Violation Detection Using Deep Learning
177
Fig. 1 System Flowchart
2.1 YOLO YOLO (You Only Look Once) is a Convolutional Neural Network (CNN) for performing real-time object detection [2, 3]. As its name suggests, YOLO needs to only look at the input image once to predict what objects are present and where. ‘YOLOv3 predicts an objectness score for each bounding box using logistic regression; where the score is 1 if the bounding box prior overlaps a ground object by more than any other bounding box prior’. The bounding boxes are predicted at 3 different scales, which are 13 × 13, 26 × 26 and 52 × 52. This helps the last scale benefit from all the prior computation as well as fine-grained features from early on in the network (Fig. 2). YOLOv3 has a total of 53 convolutional layers which is called Darknet-53. This network is much more powerful than Darknet-19 yet is still more efficient than ResNet-101 or ResNet-152. The output generated by YOLOv3 has the same height and width as that of input image, however the depth is calculated by Depth = (4 + 1 + classprobabilities) ∗ 3
(1)
178
O. Sargar et al.
Fig. 2 All the compared methods have similar mean average precision, but with a huge gap between them for processing times. Since our project is highly reliant on fast processing, we chose to use YOLOv3-416 which takes 29 ms to process the COCO data
2.2 SORT Simple Online and Realtime Tracking is a pragmatic approach to multiple object tracking where the main focus is to associate objects efficiently for online and real-time applications [4]. The SORT tracker is composed of 3 key components— detection, propagating object states into future frames, associating current detections with existing objects, and managing the lifespan of tracked objects. The SORT tracker outperforms several other MOT (Multi-Object Trackers) like TBD, SMOT,MDP, TDAM, etc. with regard to speed and accuracy.
2.3 Vehicle Counting One of the most basic tasks of traffic monitoring is to count the number of vehicles that passed through that road. Also, counting logic forms the basis for detecting violations like speed violation, triple seat violation, etc. (Fig. 4). Once the object is detected it has to be tracked throughout the subsequent frames. As discussed this part is done by the SORT algorithm. So in each frame we get the ID of the object(Object_ID) and by doing this in every frame we have the history of the coordinates of the object. All we propose to do is to make a fixed line on the frame and see if the line formed between the current centroid and the previous centroid of the same object is intersecting the fixed line. If they are intersecting then the object is counted and it is marked as counted so if due to some reason the same object crosses the line then it will not be counted again. IMP condition is that the object ID needs to be the same [5] The image given below summarizes the intersection of lines logic. We can see clearly that only for the case of line intersecting the orientations of (p1, q1, p2) and (p1, q1, q2) are different and also the orientations of (p2, q2, p1) and (p2, q2, q1) are different. For all other cases, one of the 2 orientations is the same (Fig. 3).
Traffic Monitoring and Violation Detection Using Deep Learning
Fig. 3 Intersection of lines logic
Fig. 4 Vehicle counting, Wrong side detection and Speed Detection
179
180
O. Sargar et al.
Algorithm 1 Vehicle Counting Require: Get 4 points (p1 q1 line segment and p2 q2 line segment) A ← orientationo f p1, p2, q2 B ← orientationo f q1, p2, q2 C ← orientationo f p1, q1, p2 D ← orientationo f p1, q1, q2 if A = B&&C = D then Lines are intersecting else Lines are not intersecting end if
2.4 Violations 2.4.1
Speed Detection
Speed limit violation is one of the most common traffic violations. According to the National Crime Records Bureau (NCRB) over 80% of fatalities in road accidents in India in the course of 2014–2020 happened due to over-speeding and reckless driving. The proposed method would help detect speed violations and can help penalize offenders in real-time. The proposed method is based on Visual Average Speed Computer and Recorder (VASCAR) [6], which is a common method employed by several law agencies to detect over-speeding. The VASCAR method is simply based on the equation speed =
distance time
(2)
The VASCAR method is susceptible to human error and hence is not a good indicator for detecting over-speeding. Our method employs the same logic used by VASCAR. We consider the time taken by a vehicle to cross a specific distance and then calculate the speed of the vehicle. Since our system is to be deployed on fixed Closed-Circuit Television systems, we can easily choose a ROI. The system then calculates the time taken for a vehicle to pass through the ROI. We transform the ROI so as to account for the perspective view of the CCTVs, and hence increase the accuracy of the system. The system allows to change the maximum speed limit, so as to make the system more generalized and robust. If a vehicle is flagged to be over-speeding appropriate action is taken. (can be described in common for all the violations) Since the entry and exit of a vehicle in the ROI is automatically determined by the systems, it eliminates the errors previously caused by human intervention.
Traffic Monitoring and Violation Detection Using Deep Learning
2.4.2
181
Wrong Side Detection
Driving in the wrong direction may be because of many reasons, i.e. the driver is not experienced or he is trying to take a shortcut, he/she is intentionally breaking the rules, etc. No matter the reason, driving in the wrong direction causes serious accidents and must be monitored. This part uses the algorithm explained in the counting section. When the vehicle crosses a line then we use 3 values to determine if the vehicle is moving in the correct direction or if it is violating the rules. (1) point 1 of line(P1) (2) point 2 of line(P2) (3) centroid of the vehicle before crossing the line. P1 and P2 are not in a particular order so we first need to understand the orientation of the line with respect to the horizontal axis. If the orientation of the angle is between 45 and 135◦ then it is considered vertical else it is considered horizontal. For a vertically oriented line, the vehicles are moving from left to right or from right to left and vice versa and for a horizontally oriented line the vehicles are moving from top of screen to bottom or vice versa. After finding the orientation the points are compared to bring the points into a common notation and then the position of line with the previous centroid is found. Depending on the results given by the function, it can be said if the vehicle is following the rules or if there is a violation.
2.4.3
Signal Jumping
Red light jumping is also one of the foremost problems faced in India, and there are some junctions where the traffic lights are all together subverted. The proposed method helps detect red light jumping using a quite simple yet precise way. Yet again we exploit the fixed CCTVs to our advantage. We precisely know the location of the traffic lights in the video frames, and thus we don’t need to waste computational resources to detect them. For determining the status of traffic lights we find the dominant colour in the ROI around the traffic light. At any given time only one of the traffic lights (red, orange or green) glows and thus finding the dominant colour is quite easy. For determining the dominant colour in an image, all the colours in the image are clustered with the help of k-means clustering. The dominant colour is simply determined by choosing the cluster to which maximum pixels are assigned. Once the system is able to determine the status of the traffic lights, it is quite a trivial task to detect red light jumping. A line is drawn on the road which indicates the line before which the vehicles must stop at a traffic signal, if any vehicle crosses this line during a red light then it is flagged and appropriate action is taken. Thus our approach gives a very simple yet elegant way to detect red light jumping. Since we do not use any neural network and rely on basic techniques, we are able to achieve very accurate results with very low use of computational resources.
182
2.4.4
O. Sargar et al.
Axle Detection
Axle detection is used at places where restricted types of vehicles are allowed. So the proposed method would help in detecting violations by vehicles which are not allowed on the road. Camera angle is an important aspect here. Along with the whole vehicle and the wheels have to be clearly visible. So the recommended setup would be to mount the camera facing towards the ground at 45◦ such that the whole side view is visible and the field of view of the camera must be such that the vehicle must be in the scene for at least a few frames. Initially, basic image processing algorithms like canny edge detection [7] with Hough transform [8] were used to extract the circular structure of the wheel but this method requires a lot of manual tuning of parameters and the detections were not consistent. So, here we propose a robust algorithm which can efficiently and reliably detect axels in varying conditions. What we propose is to train a tiny YOLO network which would specialize in detecting axles. This network has a high accuracy and was trained on a dataset of about 1000 images. It would be really inefficient if we would be detecting axles at every frame. So as to reduce computation, this network will be used only when the vehicle is inside the chosen ROI. The ROI is divided into 3 lines as shown in the figure. As soon as a vehicle crosses the line then the image of the vehicle is cropped and sent to the tiny network. The tiny network detects the wheels and counts them. This count is stored corresponding to each vehicle. Same process is followed for the other 2 ROI lines. After the last ROI line is crossed then the axle count is determined. The axle count which came the most number of times is considered as the axle count of the vehicle. This would help reduce the wrong inferences given sometimes due to occlusions (Figs. 5 and 6).
Fig. 5 Axle detection system
Traffic Monitoring and Violation Detection Using Deep Learning
183
Fig. 6 Result as stored by the system. Riders without helmet are tagged as violated
2.4.5
Helmet Detection
Existing works in helmet violation [9, 10] work in two stages. First stage detects the motorbike and then second stage considers approximate crops where the helmet would be present. However the size of the region of interest would be dependent on the camera placement and requires tuning width and height of crops. The proposed method has the potential to generalize itself to any video perspective. The algorithm is triggered when the bike crosses the region of interest. The detection of motorbike, person and helmet is done by YOLO network. The parameter ‘threshold’ refers to the minimum intersection area for the helmet to get associated with the person. Algorithm 2 Helmet Detection Require: Detection list of persons and helmets A ← motor bike⣙scentr oidcr ossesthe R O I line B ← per sonwithmaximumintersectionwithmotor bike C ← helmetwithmaximumintersectionwithrider if A = T r ue then rider ← B compute C if intersection Area > threshold then rider is wearing a helmet else rider is not wearing a helmet trigger ANPR end if end if
Currently, the helmet violation is detected only for the rider. Future scope would be to detect helmet violation for all the persons on the bike.
184
2.4.6
O. Sargar et al.
Triple Seat Detection
We propose a novel method to detect triple seat violation. It leverages the detection of people and motorbikes from YOLO. We test this algorithm on videos of Indian traffic. The algorithm can detect the number of riders thus facilitating counting the passengers on top of triple seat detection. The algorithm is triggered whenever the bike crosses the region of interest. It takes in input the list of detections of all persons in the frame and the motorbike that crossed the line. The rider is the person with the maximum area of intersection with the bike. Additional passengers are determined by calculating the area of intersection with the bike and the rider. The algorithm can be further optimized for greater FPS on edge devices. Algorithm 3 Triple Seat Detection Require: Detection list of persons and motorbike A ← motor bike scentr oidcr ossesthe R O I line B ← per sonwithmaximumintersectionwithmotor bike C ← helmetwithmaximumintersectionwithrider if A = T r ue then rider ← per sonwithmaximumintersectionwithmotor bike associate remaining bikes with their riders and remove them from person⣙s list for p in remaining person list do calculate intersection with the motorbike and rider if intersectionarea > threshold then passengerCount++ end if if passengerCount > 2 then tripple seat violation trigger ANPR end if end for end if
2.4.7
ANPR
Number plate recognition is useful for maintaining the logs of vehicles violating any of the traffic rules. The problem of ANPR [11, 12] consists of three parts—(i) Detecting vehicles using YOLOv3 (passed from the violation detection methods) (ii) Localizing number plate using wpodNet (iii) Recognizing characters using YOLOtiny.
Traffic Monitoring and Violation Detection Using Deep Learning
185
Localizing number plates This phase plays a major role in ANPR, since it decides the image consisting of characters that will be passed on to the recognizer. Detecting number plates using contours or hough transform fails due to variation in lighting and noise in real-life data. The shear in images also requires perspective transformation for straightening the characters. To tackle these issues separately based on contours or edges is not the ideal way for practical cases. woodNet is capable of detecting and distorting number plates using a single Convolutional Neural Network. Transfer learning was used to fine-tune the network on Indian number plates. The network outputs the four corner coordinates of the number plate in the image. The number plate is then cropped and warped for passing it to the recognizing module.
Recognizing characters YOLOv3-tiny network is used for recognizing the characters on the number plate. The network was trained on 20% real-life number plates and 80% artificial grayscale dataset generated using character images in different number plate fonts. Following are some of the key points of the dataset: • Single line as well as multi-line number plates according to Indian number plate format • Random rotation, translation, shear on individual characters as well as whole number plate • Number plates with both light and dark background • Adding random combinations of Gaussian, Poisson, speckle, salt and pepper noise • Introducing blur and brightness change
3 Conclusion Detection of traffic violations through closed-circuit television is a quite complex and challenging task. This paper proposes a system to detect multiple traffic violations using computer vision and deep learning. We were successful in creating a system to detect multiple traffic violations from a single input source with acceptable accuracy.
4 Future Scope With the pressing need to automate the task of traffic violation detection and monitoring due to the humongous growth in traffic density over the last few decades, systems like ours could help effectively in traffic management. Higher volumes of
186
O. Sargar et al.
data could be gathered and be used to train the system, so as to increase the accuracy of the system. Also parallel computations could be performed so as to make the system work in real time.
References 1. Ministry of Urban Development: TRAFFIC MANAGEMENT AND INFORMATION CONTROL CENTRE (TMICC) (2016) 2. Sundar S et al (2007) Ministry of Shipping, Road Transport Highways, Department of Road Transport Highways: Report of the Committee on Road Safety and Traffic Management 3. Pao J et al (2019) The comparison of automatic traffic counting and manual traffic counting. In: IOP conference series: materials science and engineering 4. Redmon J et al (2015) You only look once: unified, real-time object detection. arXiv:1506.02640 [cs.CV] 5. Redmon J et al (2018) YOLOv3: an incremental improvement. arXiv:1804.02767 [cs.CV] 6. Bewley A et al (2016) Simple online and realtime tracking. arXiv:1602.00763 [cs.CV] 7. Boe B (2006) Line segment intersection algorithm. https://bryceboe.com/2006/10/23/linesegment-intersection-algorithm/ 8. Townson R (1973) Visual average speed computer and recorder (vascar). Police Res Bull 9. Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell vol. PAMI-8 6:679-698. https://doi.org/10.1109/TPAMI.1986.4767851 10. Hough P (1962) Method and means for recognizing complex patterns 11. Boonsirisumpun N et al (2018) Automatic detector for bikers with no helmet using deep learning. In: 2018 22nd international computer science and engineering conference (ICSEC), pp 1–4. https://doi.org/10.1109/ICSEC.2018.8712778 12. Li Y et al (2020) Deep learning-based safety helmet detection in engineering management based on convolutional neural networks. Adv Civ Eng Article ID 9703560:10. https://doi.org/ 10.1155/2020/9703560 13. Badr A (2011) Automatic number plate recognition system. Ann Univ Craiova, Math Comput Sci Ser 38(1):62–71. ISSN: 1223-6934 14. Kashyap A et al (2018) Automatic number plate recognition. In: 2018 international conference on advances in computing, communication control and networking (ICACCCN), pp 838–843. https://doi.org/10.1109/ICACCCN.2018.8748287
Conjugate Gradient Method for finding Optimal Parameters in Linear Regression Vishal Menon, V. Ashwin, and G. Gopakumar
Abstract Linear regression is one of the most celebrated approaches for modeling the relationship between independent and dependent variables in a prediction problem. It can have applications in a number of domains including weather data analysis, price estimation, bioinformatics, etc. Various computational approaches have been devised for finding the best model parameter. In this work, we explore and establish the possibility of applying the Conjugate Gradient Method for finding the optimal parameters for our regression model, which is demonstrated by taking the house price prediction problem using the Boston dataset. The efficiency of the conjugate gradient method over the pseudo-inverse method and gradient descent methods in terms of computational requirement are discussed. We show that the weights obtained by the conjugate gradient are accurate and the parameter vector converges to an optimal value in relatively fewer iterations when compared to the gradient descent techniques. Hence, Conjugate Gradient Method proves to be a faster approach for a linear regression problem in ordinary least square settings. Keywords Machine learning · Linear regression · Conjugate gradient method · Gradient descent · Boston house price prediction
1 Introduction In recent years Machine learning has become a field of eminence. Most of the daily life challenges are being solved by machine learning algorithms. The root cause for this upshoot in the field is due to the capability of ML algorithms to go beyond human thinking. From a simple logical reasoning, ML develops into more complex patterns capable of providing solutions unimaginable by humans. One such example is Google’s AlphaGo, a computer program capable of playing the popular Chinese game “Go.” Researchers stated that AlphaGo was capable of doing moves which V. Menon · V. Ashwin · G. Gopakumar (B) Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham, Amritapuri, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_15
187
188
V. Menon et al.
even the world renowned champions couldn’t think of [12]. In a nutshell, we can say that ML has the capability of thinking outside the box. One such ML algorithm is Linear Regression. It is an algorithm under supervised learning. Regression models a target prediction value based on independent variables. It has various applications including weather data analysis [4], sentiment analysis [15], performance prediction [2], aerodynamics [27], price estimation [22], bioinformatics [6, 8], etc., and many variants are popular in the literature [9, 25]. In almost all practical situations, we can model the dependent variable meaningfully from the independent variables. For example, the sales of products in a super market depends upon the popularity index, season of the year, availability, festivals during the year, etc. Thus a good model predicting the sales of different products could be used by the owner to control the supply chain thereby maximizing the profit. The model selection process has proved to be one of the main aspects of predictive modeling. Once a particular model is fixed, the best parameters that make up the model are computed by using an optimization algorithm based on several factors like time complexity, convergence, and computational requirements. The main aim of our machine learning model will be to find the best fitting parameters that can minimize the recorded cost function on the training dataset. For a linear regression model, the traditional method of computing the optimal parameters consisted of the use of the gradient descent optimization approaches. In basic setting, the batch gradient descent [16] is employed where the model parameter is updated in each epoch and it demands one pass through all training samples (Eq. 9). In this research work, we analyze different computational techniques to find the optimal parameters using a basic linear regression model. The traditional gradient descent method proved to be less efficient in finding the optimal parameters as the number of iterations will vary depending on the initial parameter vector and learning rate. Hence, we provide the necessary theory to show that the shortcomings of linear regression can be tackled by the conjugate gradient method which requires exactly “N” steps to find the “N-D” optimal parameter vector. In practical applications, as we are looking for a parameter vector that performs decently well on the validation set, we may get a descent solution in less than “N” iterations. The remaining sections of this manuscript are organized as follows. Section 2 discusses different computational approaches used in the literature to find the optimum model parameters for regression. Section 3 provides the theoretical background behind Linear regression and Conjugate Gradient Descent. The details of the dataset used in this study and different experiments conducted to establish the merit of these computational methods followed by results and discussion are provided in Sect. 4. Finally, the paper is concluded in Sect. 5.
2 Related Works As discussed in the previous section, linear regression is a popular ML technique that is used to model the relationship between the independent variables (features) to the output variable [5, 24, 28, 30, 31]. It has profound applications ranging from
Conjugate Gradient Method for finding Optimal Parameters in Linear Regression
189
weather prediction [4], price and performance estimation [2, 18, 22, 29], to medical research [6, 8], bioinformatics [14], etc. The most critical job in linear regression is fixing the right model. Once the model is fixed, the algorithm can give the best parameters for the model. The simplest linear regression problem tries to model the right parameter vector θ that relates feature vector X to the target variable Y . Thus we are looking for the best parameter vector θ for the relation X θ = Y . In the literature, there are different computational techniques to find these parameters. The simplest is based on the normal equation (Eq. 12) [3, 17]. However, the normal equation-based method (Eq. 12) involves a matrix inversion. This means that the method is going to be prohibitively expensive when we need to consider large number of features, which is often the case [1, 7, 19]. The gradient descent-based techniques [13, 20, 26] are free from this issue and are increasingly popular in this field. Here, in order to find a better parameter vector (θ ) that minimizes a convex cost function J (θ ), we move from the current vector in the opposite direction of gradient as decided by a learning rate α (Eq. 13). The demerit of the method is that the number of iterations required to reach the optimal parameters will vary depending on the learning rate chosen and the initial parameter vector used as demonstrated in Table 3.
3 Linear Regression Linear regression attempts to model target variable Y using the linear relation X θ = Y , where θ is the unknown parameter vector. Hence, solving a system of equations for finding the parameter vector becomes the fundamental objective of a linear regression problem. However, often we will be dealing with an over determined system, with inherent noise in the observed features, which makes the solution non-trivial. For establishing the applicability of different computational techniques to find the right parameter vector for the linear regression problem, we experiment on the Boston housing dataset (Sect. 4.1), where the task is to build a regression model to predict the cost of a building from several features. Our fundamental objective is to develop a relation between MEDV (final cost) and the other parameters in the dataset. This relation can be shown as ⎤⎡ ⎤ ⎡ ⎤ ⎡ θ1 yˆ1 x11 . . . 1 ⎢ x21 . . . 1⎥ ⎢θ2 ⎥ ⎢ yˆ2 ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎢ (1) ⎥ ⎢ .. ⎥ = ⎢ .. ⎥ ⎢ .. ⎦⎣ . ⎦ ⎣ . ⎦ ⎣ . θ0 yˆn xn1 . . . 1 Equation 1 is of the form X θ = Yˆ , where the N × D dimensional matrix X holds N data samples each with D features, θ is the D × 1 parameter vector, and Yˆ is the predicted value of the output variables. As the objective is to find an approximate solution to the above equation, we intend to find the parameter vector that minimizes
190
V. Menon et al.
the mean squared error (J) between the predicted and output variables, which can be written in vector form as (Eq. 2). 2 ˆ n Y − Y
1 2 (2) J= yi − yˆi = N i=1 N Hence, this gives us an unconstrained optimization problem. Therefore the solution for such optimization is the local minima of the cost function and the given problem can be further represented as a convex optimization problem, owing to the positive definite nature of its Hessian matrix: ||Y − Yˆ ||2 N [Y − Yˆ ]T [Y − Yˆ ] = N
MSE(J ) =
J=
[Y − X θ ]T [Y − X θ ] N
(3)
(4)
The convex nature of a function can be confirmed by proving the Hessian matrix of the MSE (Eq. 4) as positive semi-definite: dJ −2 = (Y − Yˆ )X dθ N
(5)
d −2 (Y − X θ )X dθ N 2 = X XT N
(6)
H=
The Hessian matrix H in Eq. 6 is positive semi-definite since z T H z ≥ 0, ∀z as seen in Eq. 8 2 T T T z XX z Hz = z N 2 T (7) = z X XT z N 2 2 v22 = vT v = N N As norm of a vector cannot be negative, v22 ≥ 0 ∴ z T H z =
2 v22 ≥ 0 N
(8)
Conjugate Gradient Method for finding Optimal Parameters in Linear Regression
191
Thus, MSE (Eq. 2) forms a convex function and the parameter vector that minimizes J (the global minimum) can be easily found by setting the gradient to zero: n −2 −2 ∂J = (Y − Yˆ )X (yi − yˆi )xi = ∂θ N i=1 N ∴
dJ = 0 ⇒ (Y − Yˆ )X dθ ⇒ XT Y − XT Xθ = 0
(9)
(10)
∴ XT b = XT Xθ
(11)
−1 T X b ∴ θo = X T X
(12)
Thus the optimal model parameter vector (θo ) which minimizes MSE(J) can be found by pseudo-inverse (Moore-Penrose Inverse) as shown in Eq. 12. Note that Eq. 12 involves matrix inversion, and in many practical applications we will be dealing with matrices having a large number of predictor variables [1, 7, 19]. This means that cost for finding the model parameter using Eq. 12 is going to be prohibitively high for such cases. Gradient descent-based techniques can be used to counter this problem. The Gradient descent technique [13, 20, 26] finds the correct parameter without involving any matrix inversion. Since the cost function J is convex, the method is ensured to converge to the optimum parameter vector. The gradient descent method involves moving in the opposite direction of the gradient at each iteration, in order to find the optimal parameter as shown in Eq. 13. θnew = θold − α
dJ dθold
(13)
In the equation above (Eq. 13), the learning rate α decides the speed with which we are moving from the current parameter vector (convergence). A higher learning rate can even lead to divergence (as shown in Fig. 1a). An effective divergence for the chosen learning rate can easily be identified by inspecting the value of the cost function across two successive iterations. The cost, being a convex function (refer Eq. 8), should always decrease for a good learning rate. Once we have chosen a good learning rate, the gradient descent will always converge irrespective of the value of the chosen α. However, a smaller learning rate can cause slower convergence (refer Fig. 1b). The gradient computed for the function requires to pass through all training samples as shown in Eq. 9. This means that the gradient used is batch gradient descent. Although being a computationally easier technique when compared to the methods based on normal equations (the pseudo-inverse method), gradient descent for param-
192
V. Menon et al.
Fig. 1 a Large learning rate leads to drastic updates causing divergent behavior b Small learning rate requires many updates before reaching minima
eter updation has a drawback. In normal gradient descent, the number of iterations required to converge to the right parameter depends on the chosen initial vector and the learning rate α. For our dataset, we have experimented with different initial parameter vectors and the results are provided in Sect. 4.3. Conjugate gradient method [11, 23] is a special technique that can be used to solve linear system of equations A θ = b, if A is symmetric positive definite (SPD). It can be shown that the solution vector θ is going to be the parameter vector that minimizes the convex optimization function given in Eq. 14 [13, 20]. The one-line proof for the same is given in Eq. 15 where we have used the SPD nature of the matrix A. f (θ ) =
1 T θ Aθ − θ T b + c 2
df = 0 ⇒ Aθ − b = 0 ⇒ Aθ = b; if A T = A and A > 0 dθ
(14) (15)
The conjugate gradient method [11, 23] can converge to the optimal solution in exactly “D” steps for a D dimensional parameter vector [11] and it makes use of the best learning rate in each iteration [21]. The pseudo-code for the algorithm is shown in Algorithm 1. As shown in the pseudo-code, the parameters are updated in each iteration using the optimum learning rate α. θ(k+1) = θ(k) + αk d(k)
(16)
We propose to use conjugate gradient method for finding the optimal parameter vector since X T X is symmetric ((X T X )T = X T X ) and positive definite matrix (in fact semi-definite by definition of semi-definite matrices) for practical applications.
Conjugate Gradient Method for finding Optimal Parameters in Linear Regression
193
Algorithm 1 Conjugate Gradient Method 1: 2: 3: 4: 5: 6: 7: 8:
Set k = 0 and select initial parameter vector θ0 g(0) = f (θ(0) ) = Aθ(0) − b if g0 = 0 then stop else d0 = −g0 end if g T d(k) αk = − T(k) d(k) Ad(k)
9: θ(k+1) = θ(k) + αk d(k) 10: g(k+1) = f (θ(k+1) ), 11: if g(k=1) = 0 then 12: stop 13: end if gT Ad(k) 14: βk = (k+1) T d(k) Ad(k)
15: d(k+1) = −g(k+1) + βk d(k) 16: Set k = k + 1, go-to Step 8
The regression problem given in Eq. 11 can thus be reformulated into Aθ = b form, where X T X is A and b = X T Y .
4 Results and Discussion This section summarizes the results of the experiments conducted using different computational techniques to find the optimal model parameters in Linear Regression. A brief description of the dataset and features used in this study are provided in Sects. 4.1, and 4.2, respectively followed by the experimental outcome in Sect. 4.3.
4.1 Boston Housing Dataset The Boston Housing Dataset [10] consists of housing values in suburbs of Boston. The dataset has 506 instances and 13 continuous, binary valued attributes. The dataset doesn’t have any missing value. More details on the dataset are given in Table 1.
Table 1 Overview of the Boston Dataset Total number of Number of Number of samples features numerical features 506
13
12
Number of categorical variables
Number of missing features
1( CHAS )
0
194
V. Menon et al.
Fig. 2 Correlation Matrix of Boston House estimation dataset
4.2 Selecting Features From the Dataset In order to find the right parameters for predicting the cost using the Boston dataset, we make use of selected features based on the correlation analysis of all the features. The correlation analysis revealed that out of 13 features, top 5 features (ZN, CHAS, RM, DIS, B) are the most important features as reflected by their high correlation values with our target variable MEDV (Median Value of owner-occupied homes in $1000s) as shown in Fig. 2. Note that the relatively high value for the correlation is indicated by the darker shades for the features mentioned above in the row and column of the MEDV feature. We have also compared the model performance by considering all features whose results are provided in Sect. 4.3
4.3 Result Analysis In order to compare the accuracy and computational requirements of conjugate gradient method, we do the following experiments. • Analyzed the effectiveness of the model parameters in terms of the MSE and Norm of the final parameter vector for all the 3 methods: pseudo-inverse, batch gradient descent, and conjugate gradient method as shown in Table 2. • Analyzed mean squared error and the number of iterations taken to converge to the right parameters using the batch gradient descent and conjugate gradient method considering: – 5 relevant features that show a good correlation with the target variables. The result is shown in Table 3. – All features as shown in Table 4.
Conjugate Gradient Method for finding Optimal Parameters in Linear Regression
195
Table 2 MSE and Norms of parameter vector found out using various models Model Considering 5 Features Considering all Features MSE Norm of the MSE Norm of the parameter vector parameter vector Pseudo-inverse Batch gradient descent Conjugate gradient method
19.6389 21.6390
24.7827 23.1947
19.5794 29.4010
23.5260 20.2504
21.5661
23.2704
20.7954
23.5237
Table 3 MSE and number of iteration for Batch and Conjugate Gradient methods considering five features from the dataset Input vectors Batch gradient descent Conjugate gradient method MSE Number of MSE Number of iterations iterations v1 v2 v3 v4 v5
21.6389 21.6492 21.62 21.63 21.61
110 111 112 110 113
21.56 21.59 21.57 21.55 21.47
3 3 3 3 3
Table 4 MSE and number of iteration for Batch and Conjugate Gradient methods considering all features from the dataset Initial vector Batch gradient descent Conjugate gradient method MSE Number of MSE Number of iterations iterations v1 v2 v3 v4 v5
29.40 28.87 29.27 29.43 28.81
39 40 40 39 40
20.80 20.79 20.80 20.75 20.76
8 8 8 8 8
In Table 2, we observed that the MSE using pseudo-inverse has the least value in both cases (i.e., considering 5 strongly correlated features and taking all features), MSE for the other 2 models have almost same values. Being the function convex, and since all methods resulted in identical norm and close MSE values, it is reasonable to believe that these methods converged to the same solution. In Table 3, we have performed the analysis of these algorithms using five different random vector initialization. In all the test cases, when compared to the batch gradient descent, the cost function converged to the minimum in relatively less number of iterations when using the conjugate gradient algorithm. The average number of
196
V. Menon et al.
Fig. 3 a Graph for MSE versus Iterations (Taking most correlated features); b Graph for MSE versus Iterations (Taking all features)
iterations required by the batch gradient descent algorithm is 111.2 which is very high compared to the average number of iterations required by the conjugate gradient algorithms. The mean squared error obtained is also slightly less while using the conjugate gradient algorithm when compared to the batch gradient descent. Theoretically, the conjugate gradient method must converge to the optimal parameter vector in exactly 5 steps [17] for all the test cases given in Table 3, but we will get a decent solution even for lesser number of iterations. This can be seen from the results provided in Table 3, where it takes only 3 iterations to converge to the decent solution (on the validation set) irrespective of the initially chosen vector. Similarly in Table 4, when considering all the feature vectors, we get similar results as that obtained when using only five features. The average number of iterations taken to converge to the minima is 40 when using the batch gradient descent algorithm and is 8 when using the conjugate gradient method, this confirms that conjugate gradient method takes lesser iteration when compared to gradient descent techniques irrespective of the number of features taken. As mentioned in the last paragraph, it can be proven [17] that the conjugate gradient method will take exactly “N” steps to converge to the optimal N dimensional parameter vector [17]. For the results in Table 4, we had used 14 features and the average number of iterations to find the best parameter was 8. Clearly, this did not cross 14 iterations in any trial as indicated by the theory [23]. The above results can be further confirmed by plotting a graph for number of iterations vs the cost/error values. As we can see in Fig. 3a, the cost value is minimized at around 100 iterations for batch gradient descent, whereas the conjugate gradient method provided a decent solution even with 3 iterations. This shows the efficacy of finding regression parameters using conjugate gradient method in similar settings. Similar result can be found in Fig. 3b, where we used all features in the price prediction.
Conjugate Gradient Method for finding Optimal Parameters in Linear Regression
197
5 Conclusion In this research work, we propose to use the conjugate gradient method for finding the optimal parameters for linear regression in an ordinary least square setting. As the conjugate gradient method demands the use of symmetric positive definite matrices, we have reformulated the linear regression problem as X T X θ = X T Y . We have then identified that it can be reposed as Aθ = b where A = X T X , the symmetric positive (semi) definite matrix. The manuscript provides the necessary theory, proof, and experimental results on the Boston House Price Prediction, to show the effectiveness of the conjugate gradient method in finding the optimal parameters for the linear regression model. Unlike the Pseudo-Inverse method, the proposed approach does not involve matrix inversion which is important especially when dealing with a large number of features. Contrary to gradient descent, the proposed approach converges to the N-D parameter vector in exactly “N” iterations, irrespective of the initial parameter vector. Hence, this method proves to be a faster and an effective technique to solve linear regression problems.
References 1. Bradley JK, Schapire RE (2008) Filterboost: regression and classification on large datasets. In: Platt J, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems, vol 20. Curran Associates Inc., pp 185–192 2. Devasia T, Vinushree TP, Hegde V (2016) Prediction of students performance using educational data mining. In: 2016 international conference on data mining and advanced computing (SAPIENCE), pp. 91–95. https://doi.org/10.1109/SAPIENCE.2016.7684167 3. Fletcher R (1968) Generalized inverse methods for the best least squares solution of systems of non-linear equations. Comput J 10(4):392–399. https://doi.org/10.1093/comjnl/10.4.392 4. Fowdur T, Beeharry Y, Hurbungs V, Bassoo V, Ramnarain-Seetohul V, Lun ECM (2018) Performance analysis and implementation of an adaptive real-time weather forecasting system. Internet Things 3–4:12–33 5. Freedman D (2005) Statistical models: theory and practice. https://doi.org/10.1017/ CBO9781139165495 6. Gayathri B, Sruthi K, Menon KAU (2017) Non-invasive blood glucose monitoring using near infrared spectroscopy. In: 2017 international conference on communication and signal processing (ICCSP), pp 1139–1142. https://doi.org/10.1109/ICCSP.2017.8286555 7. Gemulla R, Nijkamp E, Haas PJ, Sismanis Y (2011) Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’11, Association for Computing Machinery, New York, NY, USA, pp 69–77. https://doi.org/10.1145/2020408.2020426 8. Godfrey K (1985) Simple linear regression in medical research. N Engl J Med 313(26):1629– 1636. https://doi.org/10.1056/NEJM198512263132604 9. Harikumar S, Reethima R, Kaimal MR (2014) Semantic integration of heterogeneous relational schemas using multiple l1 linear regression and svd. In: 2014 international conference on data science engineering (ICDSE), pp 105–111. https://doi.org/10.1109/ICDSE.2014.6974620 10. Harrison D, Rubinfeld DL (1978) Hedonic housing prices and the demand for clean air. J Environ Econ Manag 5(1):81–102 11. Hestenes MR (1980) Conjugate gradient algorithms. in: conjugate direction methods in optimization. Springer New York, pp 231–318
198
V. Menon et al.
12. Holcomb SD, Porter WK, Ault SV, Mao G, Wang J (2018) Overview on deepmind and its alphago zero ai. In: Proceedings of the 2018 international conference on big data and education. ICBDE ’18, Association for Computing Machinery, New York, NY, USA, pp 67–71. https:// doi.org/10.1145/3206157.3206174, https://doi.org/10.1145/3206157.3206174 13. Háo DN, Lesnic D (2000) The cauchy problem for laplace’s equation via the conjugate gradient method. IMA J Appl Math 65(2):199–217. https://doi.org/10.1093/imamat/65.2.199 14. Jung Klaus SFHM (2017) Multiple linear regression for reconstruction of gene regulatory networks in solving cascade error problems. Adv Bioinf 94–95 15. Naveenkumar KS, Vinayakumar R, Soman KP (2019) Amrita-cen-sentidb 1: Improved twitter dataset for sentimental analysis and application of deep learning, pp 1–5. https://doi.org/10. 1109/ICCCNT45670.2019.8944758 16. Kershaw DS (1977) The incomplete cholesky-conjugate gradient method for the iterative solution of systems of linear equations. J Comput Phys 17. Kershaw DS (1978) The incomplete cholesky-conjugate gradient method for the iterative solution of systems of linear equations. J Comput Phys 26(1):43–65 18. Kodiyan AA, Francis K (2019) Linear regression model for predicting medical expenses based on insurance data. https://doi.org/10.13140/RG.2.2.32478.38722 19. Loh Pl, Wainwright MJ (2011) High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. In: Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ (eds) Advances in neural information processing systems, vo. 24. Curran Associates Inc., pp 2726–2734 20. Lubis FF, Rosmansyah Y, Supangkat SH (2014) Gradient descent and normal equations on cost function minimization for online predictive using linear regression with multiple variables. In: 2014 international conference on ict for smart society (ICISS), pp 202–205. https://doi.org/10. 1109/ICTSS.2014.7013173 21. Luenberger DG, Ye Y (2008) Conjugate direction methods. In: Linear and nonlinear programming. Springer US, New York, pp 263–284 22. Madhuri CR, Anuradha G, Pujitha MV (2019) House price prediction using regression techniques: A comparative study. In: 2019 international conference on smart structures and systems (ICSSS), pp 1–5 . https://doi.org/10.1109/ICSSS.2019.8882834 23. Polyak B (1969) The conjugate gradient method in extremal problems. USSR Comput Math Math Phys 9(4):94–112 24. Prion S, Haerling K (2020) Making sense of methods and measurements: simple linear regression. Clin Simul Nurs 48:94–95. https://doi.org/10.1016/j.ecns.2020.07.004 25. Reddy MR, Kumar BN, Rao NM, Karthikeyan B (2020) A new approach for bias-variance analysis using regularized linear regression. In: Jain LC, Virvou M, Piuri V, Balas VE (eds) Advances in bioinformatics, multimedia, and electronics circuits and signals. Springer Singapore, Singapore, pp 35–46 26. Ruder S (2016) An overview of gradient descent optimization algorithms. arxiv:1609.04747Comment: Added derivations of AdaMax and Nadam 27. Sathyadevan S, Chaitra MA (2015) Airfoil self noise prediction using linear regression approach. In: Jain LC, Behera HS, Mandal JK, Mohapatra DP (eds) Computational intelligence in data mining, vol 2. Springer India, New Delhi, pp 551–561 28. Seal HL (1967) Studies in the history of probability and statistics. xv: The historical development of the gauss linear model. Biometrika 54(1/2):1–24 29. Sharmila Muralidharan KP (2018) Analysis and prediction of real estate prices: a case of the boston housing market. Issues Inf Syst 5:109–118 30. Weisberg S (2005) Applied Linear Regression. Wiley series in probability and statistics. Wiley, New York 31. Yan X (2009) Linear regression analysis: theory and computing. World Scientific publishing Co
Rugby Ball Detection, Tracking and Future Trajectory Prediction Algorithm Pranesh Nangare
and Anagha Dangle
Abstract This paper presents a custom object detection and tracking algorithm for position estimation and trajectory prediction of a moving rugby ball. The approach of the algorithm is to combine the accuracy of object detection provided by the custom trained YOLOv5 model and the speed of the KCF tracker to perform a linear trajectory prediction of the ball. Kalman filter is used to ensure the optimal estimation of the current position and increase the accuracy of prediction for the future trajectory. Multi-threading is implemented to concurrently detect and track the ball in consecutive frames, resulting a computationally efficient approach. Keywords Image processing · Computer vision · Convolutional neural network · Kalman filter
1 Introduction Ball detection and trajectory prediction has been widely used for analysing results in various sports events. This use of the technology has paved ways for research in the related field contributing to the success in commercial scope. Hawkeye is a leading innovator in sports technology. It provides systems for tracking and predicting ball movement in a variety of sports, including cricket, football, tennis, rugby union, volleyball and ice hockey. Motivated by the need to create an object detection system for a rugby ball for mobile robots, this work provides a combined approach targeted towards constrained hardware environments, in the specific task of rugby ball detection. The task of ball detection is not very easy as compared to other detection problems. When a ball is thrown and moves at high velocity, its image becomes blurry and elliptical. Due to shadows and illumination variations, the ball’s perceived colour varies. It’s especially tough when the ball is partially obscured. When the ball is seen as a single object, traditional ball detection algorithms, such as those based on P. Nangare (B) · A. Dangle Pimpri Chinchwad College of Engineering, Pune, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_16
199
200
P. Nangare and A. Dangle
variations of the circular Hough transform [1], work well. However, a deep neural network is required to detect the ball in complex environments. The underlying architecture of deep neural networks is highly optimised for execution in graphics processing units (GPU). However, GPUs may be unavailable in particular areas, forcing these operations to run on CPUs. One such domain is mobile robotics, where size, weight and energy consumption constraints may limit the robot’s technology to merely CPUs, constraining the performance of deep learning systems. To overcome this limitation we have proposed a method combining the slower detection model with faster tracking algorithm optimising the entire process. Tracking algorithms such as KCF [2], Boosting [3] and MOSSE [4] are mostly used for the general object tracking task. These algorithms require initial ROI input from the frame, which is generally given manually by selecting the ROI from the first frame. These trackers provide a fast operation but tend to lose the track of objects compelling to the minor changes in shape, color and shading. In such cases, DNN detection can help trackers to maintain the course of trajectory by accurately detecting the object. Alongside detection and tracking it is important to also predict the accurate position of the ball and its future trajectory for the robot to take responsive actions. These generated trajectory equations are ideal in nature and Kalman filter [5] is used for accurate estimation. The Kalman filter is used to deal with two different scenarios: When the ball is identified, the Kalman filter predicts its state for the current video frame before correcting it with the newly detected object position. As a result, a filtered location is created. When the ball is absent in the frame, the Kalman filter predicts the ball’s present position only based on its prior condition. This paper is mainly divided into three broad sections. First section focuses on creating an image dataset of rugby balls, preprocessing the collected data in the required format and training the custom object detection model on YOLOv5 [6]. The second section explains the combining of the tracking algorithm with the trained detection model using a multithreading approach for processing the frames smoothly. The last part explains the algorithm using the Kalman filter for trajectory prediction. Corresponding results obtained in each section are also recorded and a final conclusion is proposed for the problem.
2 Methodology 2.1 Section I This section deals with training and deployment of the YOLOv5 [6] model for detection of ball in frame. The initial step is to create an image dataset with different orientations of the ball. As this system was to be used by a mobile robot, practical scenarios needed to be considered for which the videos of balls thrown by a person were recorded from the viewpoint of a robot. Due to these considerations, the ball in the successive frames was captured in different orientations and lighting condi-
Rugby Ball Detection, Tracking and Future Trajectory Prediction Algorithm
201
Fig. 1 Metrics
tions. Training a YOLOv5 model requires the data to be in a specified format, with each image having its corresponding XML file containing the bounding box coordinates and the label of the object to be detected. Software services like LabelBox [7], Roboflow [8] are frequently used to create such corresponding XML files by taking a video file input and loading each frame one by one (consecutively). This process of creating files by selecting desired objects from individual images can be tedious and time consuming, considering the size of the dataset. Hence to simplify the process we developed a Python script in which the video file is taken as input. The desired object is selected manually from the initial frame and the position in the consecutive frames is tracked by a tracking algorithm. This generates a stream of images which are saved in a folder along with their corresponding XML files. This approach efficiently reduced the time of preprocessing the dataset by approximately 87% compared to the traditional method. The preprocessed dataset was trained using a YOLOv5 model. The YOLOv5 model is the most recent addition to the YOLO family of models. YOLO was the first object detection model to include bounding box prediction and object classification into a single end-to-end differentiable network. YOLOv5 is the first YOLO model to be developed in the PyTorch [9] framework, making it significantly lighter and easier to use. YOLOv5s was chosen based on the model’s benchmarks. The parameters while training were optimal without changes in them. The dataset consisted of blurred, rotated images for better fitting. The benchmarks of training are shown in Fig. 1. As seen through the graphs, the precision and mAP values increase as the training progresses. To find the percentage correct predictions in the model, mAP is used. The mAP and precision values can be further used to compare with other network architectures. After training of the model, testing and validation was done to check the results on varied input. Fig. 2 shows the detection results on the input video, along with the associated label and bounding box.
202
P. Nangare and A. Dangle
Fig. 2 Detection
2.2 Section II This part focuses on the integration of the tracking and detection part of the code. The tracking and detection of the ball are carried out on parallel threads, hence optimising the overall process. The tracking algorithm takes the bounding box points as an input in the initial step and tracking of the object in that bounding box is done in consecutive frames. Implementing only detection on the entire video increases the time even though it produces better efficiency. On the other hand tracking algorithms have a high speed of execution but the accuracy is hindered. Hence between a tradeoff, the ideal solution was to combine the detection and tracking to get both average speed and accuracy. The results were tested on a GTX 1660Ti, hence the result times may vary for CPU and other GPU cards. The tracking algorithm used was Kernelized Correlation Filter (KCF) originally proposed in [10]. The OpenCV implementation of this tracker makes the integration in code easy. According to the paper, this tracker has outperformed the top ranking trackers such as Struck [11] or TLD [12] on a 50 videos benchmark.
2.3 Section III After successful detection and tracking of the ball, formulation of the future trajectory is carried out. This trajectory can be calculated by ballistic trajectory equations. But as these equations are ideal, they do not consider the noise in the system which introduces errors in predictions. As our system is linear in nature, we use a Kalman
Rugby Ball Detection, Tracking and Future Trajectory Prediction Algorithm ⎡
1 ⎢0 A=⎢ ⎣0 0
0 1 0 0
dt 0 1 0
203 ⎤ dt2 0 2 ⎢ 0 1 dt2 ⎥ 2 ⎥ B=⎢ ⎣ dt 0 ⎦ 0 dt
⎤ 0 dt⎥ ⎥ 0⎦ 1
⎡1
Fig. 3 State transition matrices
filter [5] which is used for estimation of unknown variables in systems with statistical noise. The Kalman filter uses a system’s dynamic model and data measured over the time (position of the ball) to form a better estimate of the state. Dynamic model in our case is the ballistic trajectory equation (Fig. 3). The kinematic equations are as follows:
vx ← vx + ax dt
(1)
v y ← v y + a y dt
(2)
1 x ← x + vx dt + ax dt 2 2
(3)
1 y ← y + v y dt + a y dt 2 2
(4)
From above equations, State transition matrices can be stated as Matrix A is the dynamic law and matrix B is the control matrix. Matrix B contains the controls for input variable which in our case is the acceleration in y direction. The required states of the system are position and velocity of the object. But only the position of the object is observable in our case (Fig. 4). Matrix u is the values of inputs, i.e. acceleration. Matrix P is the initial uncertainty of the state variables. As initially state variables are unknown, the values of i x, i y, ivx , iv y are very high in the range of 106 . Given the transformations A and B, and noise approximated using covariance matrices S and R, the information on x, in the form of the mean and covariance matrix G, is updated with a new observation z as follows: + + Bu Predicted state estimate → xk = A xˆk−1
(5)
+ Predicted error covariance → Pk− = A Pk−1 AT + Q
(6)
Measurement residual → yk = z k − H xˆk−
(7)
204
P. Nangare and A. Dangle
Kalman gain → K k = Pk H T (R + H Pk− H T )−1
(8)
Updated state estimate → xˆk = xˆk + K k yk
(9)
Updated error covariance → Pk+ = (I − K k H )Pk−
(10)
Where, Variables with (ˆ) are the estimates of the variables Variables with (+ ) and (− ) denote prior and updated estimates respectively. Variables with ( T ) denote transpose of matrices.
3 Results The results (refer Figs. 5 and 6) were tested on different videos from various orientations and lighting conditions on a CPU as well as a GPU. The accuracy was comparable in both the cases however there was a considerable difference in the timing of the process. The stand-alone detection code takes around 0.75 s on the CPU (Intel i5 6th gen 6GB Ram) to detect the ball with around 85–95% accuracy. On the same device, detection code with tracking in separate two threads takes approximately 0.052 s. Our proposed method has reduced the time required to process one frame by 93% compared to traditional detection technique. This improved performance has significant importance in case of low computation hardware devices such Mobile robots, Edge computing and low-end devices. In typical systems, detection and tracking are implemented in serial configuration, i.e. detect in the first frame and then track for the following ‘n’ number of frames, and the detection frame takes longer to process than the tracking frame, causing video to become stuck often. However, because we configured this detection and tracking process in two different threads, detection data is shared with the tracking Observation Matrix - H =
1000 0100
State matrix - S = x y vx vy
⎡
Input matrix - u = 0 −9.8
ix ⎢0 ⎢ Uncertainity matrix - P = ⎣ 0 0
Fig. 4 Matrices required for kalman computation
0 iy 0 0
0 0 ivx 0
⎤ 0 0 ⎥ ⎥ 0 ⎦ ivy
Rugby Ball Detection, Tracking and Future Trajectory Prediction Algorithm
205
Fig. 5 Ball detection and trajectory prediction
Fig. 6 Prediction and estimation
thread while the tracking thread runs constantly without pauses, resulting in seamless processing with no delays in between. We continually compute the ball’s location farther in the future, which raises the uncertainty of the ball’s position in the long future, as illustrated by the rising radius of the red circles in Fig. 5. Figures 5 and 6 depict the expected and actual trajectory of the ball. Prediction has been accurate, with a 25–30 cm difference between expected and reality.
206
P. Nangare and A. Dangle
4 Conclusion The results are pretty robust in case of normal orientation, i.e. side orientation of ball trajectory; however, the results are not as desired for front view orientation of the camera. Though detection and tracking have comparable results for different orientation, the work on trajectory prediction needs to be more flexible in terms of the view in which video is shot. Overall the results were quite sufficient for our application and can be implemented by reproducing the code on our GitHub repository.
5 Future Scope The future scope of the project is to test this developed system on different hardware systems to improve efficiency in every case. The work on making the trajectory prediction algorithm more robust is ongoing. The current algorithm is tested on 2D videos, and the future plans are to work with 3D views.
References 1. Just S, Pedersen K (2009) Circular hough transform. In: Pedersen K, Just S (eds) Pedersen CircularHT, encyclopedia of biometrics. Kjeldgaard: circular hough transform. Encyclopedia of Biometrics 2. Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3): 583–596 3. Grabner H, Grabner M, Bischof H (2006) Real-time tracking via online boosting. Bmvc 1(5):6 4. Bolme DS, Beveridge JR, Draper BA, Lui YM (2010) Visual object tracking using adaptive correlation filters. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 2544–2550 5. Julier S, Uhlmann J, Jeffrey K (1997) A new extension of the Kalman filter to nonlinear systems. In: International symposium aerospace/defense sensing, simulation and controls. Signal processing, sensor fusion, and target recognition, vol 3: 182. Bibco- de:1997 SPIE.3068..182J. https://doi.org/10.1117/12.280797 6. Jocher G, Stoken A, Borovec J, Chaurasia A, Changyu L, Hogan A, Hajek J et al (2020) Ultralytics/Yolov5: v5.0-YOLOv5-P6 1280 models, AWS, supervisely and youtube integrations (v5.0). Zenodo. Available online: https://doi.org/10.5281/zenodo.4679653. Accessed 30 Sept 2020 7. Labelbox (2021) Labelbox. [Online]. Available: https://labelbox.com 8. Roboflow (2021) Roboflow. [Online]. Available: https://roboflow.com/ 9. Paszke A et al (2019) PyTorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems, p 32. Available at: http://papers.neurips. cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
Rugby Ball Detection, Tracking and Future Trajectory Prediction Algorithm
207
10. Henriques JF, Caseiro R, Martins P, Batista J (2015) High-speed tracking with kernelized correlation filters. In: IEEE Trans Pattern Anal Mach Intell 37(3):583–596. https://doi.org/10. 1109/TPAMI.2014.2345390. PMID: 26353263 11. Hare S, Saffari A, Torr P (2011) Struck: structured output tracking with kernels. In: ICCV 12. Kalal Z, Mikolajczyk K, Matas J (2010). Tracking-learning detection In: TPAMI. https://doi. org/10.1109/TPAMI.2011.239
Early Detection of Heart Disease Using Feature Selection and Classification Techniques R. S. Renju and P. S. Deepthi
Abstract Cardiovascular diseases have been recognized as one of the major causes of death in humans. Majority of the time, the increase in death rate is due to the delay in detecting heart disease. Early detection would help to save more lives. Since the early detection of heart disease considers many features and a large volume of data, machine learning techniques can significantly predict heart diseases in the early stages. In this work, three major feature selection techniques have been deployed before each classifier to acquire better performance and accuracy. The dataset has been thoroughly examined, processed and the subset of traits that have a significant role in the prediction of heart disease has been extracted. The classification methods used to classify the retrieved features aided in improving accuracy. Keywords Cardiovascular disease · ANOVA · SVM · Step forward feature selection · Random forest · Lasso regularization · Adaboost
1 Introduction The combined disorders of both heart and blood vessels are called Cardiovascular Diseases (CVD). Detecting the risk level of heart disease by analysing various risk features by the clinicians is a time-consuming process and highly prone to errors. Machine learning has a predominant role in the prediction of chronic diseases by analyzing the available dataset. The model can give an insight to the doctor for the diagnosis and further treatment. A large volume of medical data is available now, but the most tedious task is to select the weighted features which can accurately predict heart disease. As per the review on clinical decision support systems on heart disease prediction, researchers have been using machine learning techniques like K-Neighbours, Naive Bayes, Decision Tree, Random forest, logistic regression, SVM, XGboost, etc. Only marginal success is achieved through these predictive models for heart disease, so R. S. Renju (B) · P. S. Deepthi LBS Institute of Technology for Women, Poojappura, Thiruvananthapuram, Kerala, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_17
209
210
R. S. Renju and P. S. Deepthi
a more accurate and crisp model which incorporates the most contributing features for the prediction of heart diseases is essential. The proposed work uses a combination of feature selection and classification techniques. Three classification methods along with three independent feature selection techniques are used to build the model for the early detection of heart disease. The accuracy of three independent models has been analysed and the best model is chosen for prediction. Feature selection methods are ANOVA univariate Test, Step forward Feature Selection, Lasso Regularization, the classification methods used are Support Vector Machine (SVM), Random Forest and Adaboost. The curated dataset is taken from five research databases, which includes 11 common features related to heart disease. The following is how the remainder of the paper is organized. The review of related research is covered in Sect. 2. The proposed methodology is discussed in Sect. 3. The results and discussion are presented in Sect. 4, and the paper is concluded in Sect. 5.
2 Literature Review Researchers are currently experimenting with a variety of machine learning techniques for the deployment of prediction models in order to improve the accuracy of forecasting disease onset early. The authors of [1] gathered medical data from the Kaggle website and used it to test several classification methods such as K-nearest neighbours, Random forest, Decision tree and Naive Bayes. In Ref. [2], the authors used Cleveland dataset with 14 attributes, preprocessed it, and used K-nearest neighbour, Decision Tree, Naive Bayes and Random forest, with K-nearest neighbour being the top algorithm with the highest accuracy rate. In Ref. [3], Decision Tree, SVM, Logistic regression, K Nearest Neighbour and Random forest are deployed. Logistic regression got the highest accuracy out of these trained models for the early prediction of heart disease. In Ref. [4], the authors developed prediction models using 13 features, and accuracy is measured for various traditional machine learning modelling techniques. The highest accuracy is achieved for Hybrid Random Forest Linear Model in comparison with existing methods. In Ref. [5], the authors utilized heart disease dataset and preprocessed the data to remove irrelevant data. Six machine learning methodologies have been implemented to forecast the occurrence of heart disease. It is found that Random forest gives more accuracy than other machine learning algorithms. The authors of Ref. [6] proposed an effective heart disease prediction system using density-based spatial clustering of applications with noise to identify and remove outliers, and a hybrid Synthetic minority Oversampling Technique with Nearest Neighbour to balance the data distribution and XGboost for the prediction of heart disease, which yielded better accuracy than traditional machine learning algorithms. In Ref. [7], the authors performed feature extraction techniques like Principal Component Analysis and Linear Discriminant Analysis on the Cleveland
Early Detection of Heart Disease Using Feature Selection …
211
heart disease dataset. The ensembled boosting and bagging algorithms implemented on extracted data got higher accuracy than the traditional classification algorithms. An accuracy-based weighted ageing classifier ensemble was used in Ref. [8] to model a homogeneous ensemble, which outperformed other classic machine learning techniques.
3 Materials and Methods 3.1 Dataset This work uses a curated dataset that was created by integrating datasets that were previously available on their own. It consists of 11 features and is found to be the largest available heart disease dataset with 1190 instances. The five datasets used are as follows: Cleveland: 303, Hungarian: 294, Switzerland: 123, Long Beach VA: 200, Statlog (Heart) Data Set: 270. The features (6 nominal variables and 5 numerical variables) are as follows: Age, Sex, Chest Pain Type, Resting bps, Cholesterol, Fasting blood sugar, Resting ECG, Max heart rate, Exercise angina, Old peak, ST slope and Target variable.
3.2 Methodology The heart disease dataset is preprocessed to remove the unwanted data and outliers, then the dataset is split into train and test data sets. This work incorporates three feature selection methods along with three machine learning algorithms for acquiring better accuracy in the prediction of heart disease. The proposed methodology is depicted in Fig. 1 and detailed in further subsections.
3.2.1
Data Preprocessing
Model performance depends on the quality of data fed to the model. The main preprocessing steps performed in this work are as follows: 1. Identify/Remove/Replace missing values from data 2. Handle the skewed data 3. Detection and Removal of outliers Exploratory Data analysis involves Univariate and Bivariate analysis.
212
R. S. Renju and P. S. Deepthi
Fig.1 Proposed Methodology
3.2.2
Methods
Feature selection using a filter, wrapper and an embedded method is done and different classifiers are used. The methods are detailed below. ANOVA univariate test with SVM Analysis of Variance (ANOVA) is a univariate filter method for determining the relationship between two variables. It assumes that the variables and the target have a linear relationship. The feature set with higher F values will be chosen. F=
χ 12 n1 − 1
χ 22 n2 − 1
(1)
χ 1 , χ 2 —Chi distributions and n1, n2—respective degrees of freedom The feature set chosen following the ANOVA test will be fed into SVM. The algorithm locates the hyperplane in n-dimensional space to clearly categorize the data points. The main goal is to find a hyperplane with the shortest distance between data points from both classes. The data points closest to the hyperplane are termed support vectors. The margin of the classifiers can be maximized by using these support vectors. The bigger the marginal distance, the more generalized the model will be.
Early Detection of Heart Disease Using Feature Selection …
213
Step Forward Feature Selection with Random Forest (RF) This is a wrapper method, also known as Sequential Forward Feature Selection. It is an iterative method, in the first step all features are evaluated individually out of which the best feature having the best performance with a chosen classifier is selected. In the second step, all the possible combinations of the selected feature with the remaining features will be tested, and the best pair which produces more algorithmic performance is chosen. The process continues by adding one feature at a time in each iteration until the preset criterion is reached. In this work, the feature set selected through Step Forward Feature selection is inputted to the random forest algorithm, which is an ensembled bagging technique. It consists of many decision trees. The trained dataset is distributed across the decision trees. The output from each decision tree will be combined using the majority voting and finally output is derived. The node probability using Gini supposing two child nodes is calculated by the below equation: N i j = W j C j − Wle f t ( j) Cle f t ( j) − Wright ( j) Cright ( j)
(2)
Nij —Importance of node j, W j —Weighted number of samples reaching node j, C j —Impurity value of node j, left ( j) & right ( j)—Child node from left and right split on node j. Significance of each feature can be calculated as follows: Fi j =
j : node j splits on feature i N i j kε all nodes N i k
(3)
Fij —Importance of feature I, —Importance of node j. The normalized value can be expressed as follows: Norm f i i =
Fi i jε all features Fi j
(4)
The final feature importance can be calculated as average over all trees as follows: RF f ii =
i s al l t r ees nor m f i i j T
(5)
RFfii —Importance of feature i calculated from all trees in the RF model, Normfiij —Normalized feature importance for i in tree j, T —Total number of trees. This method helps to overcome overfitting, avoids the disadvantages of decision trees and improves precision. New updates in the dataset will not affect the overall performance since the new data spread over each decision tree uniformly. This is the most widely accepted algorithm for the prediction of heart disease.
214
R. S. Renju and P. S. Deepthi
Lasso Regularization (L1 Regularization) with Adaboost This is an embedded method that combines the advantages of both filter and wrapper methods. Regularization methods are commonly used in machine learning to avoid overfitting, making the model robust to noise and more generalized. It penalizes different parameters of a model to reduce its freedom. Lasso regularization shrinks some of the features coefficients to zero. This indicates that some of the features will be multiplied by zero to predict its target. This indicates that those features can be removed since they are not contributing to the final prediction of the target. Equation for Lasso is n p (yi − xi j β j )2 + λ |β j | i=1
j
j=1
(6)
λ = amount of Shrinkage, λ = 0 indicates all features are considered and equivalent to the linear regression where only the residual sum of squares is considered to build a predicative model. λ = α indicates no feature is considered that is as λ closes to infinity it eliminates more and more features. The bias increase with increase in λ. Variance increase with decrease in λ. In this work, the feature set selected through Lasso regularization is inputted to the Adaboost (Adaptive boosting) algorithm, which is an ensembled boosting technique. This algorithm first builds a model from the training data and then creates a second model to correct the errors from the first model. This process of adding models continues until the training set is predicted correctly or the maximum number of models is created. The weighted samples will be utilized to prepare the training data for the weak classifier. Each weak model will be trained and added successively using the weighted training data. After the training of weak classifier update the weight of training sample using the below equation: Dt+1 (i) =
Dt(i)exp(−∞t yi h t (xi)) Zt
(7)
Dt —Weight at previous level, α t —Weight assigned to classifier, ht (x)—output of weak classifier t for input x, Z t —Sum of all the weights. This method is repeated until a predetermined number of weak learners have been developed, or until the training dataset can no longer be improved.
4 Results and Discussion The nominal feature entries in the dataset are encoded into corresponding categorical variables. After feature encoding, presence of missing values is checked. Exploratory Data Analysis revealed that resting blood pressure and cholesterol have outliers, with
Early Detection of Heart Disease Using Feature Selection …
215
a minimum value of 0, and cholesterol also has an outlier on the upper side, with a maximum value of 603. It is found that the dataset is balanced with 629 heart disease patients and 561 normal patients. The relationship between the features age and gender and the target variable is examined thereafter. As seen in Fig. 2, percentage of males in this dataset is far larger than females, despite the fact that the average age of the patients is around 55. In addition, males account for more patients with heart disease than females. Chest pain is considered to be the major visible symptom of heart disease, so the distribution of chest pain type is checked. Figure 3 shows the output. According to the graph above, 76% of patients experience asymptomatic chest discomfort. Asymptomatic heart attacks, also known as silent myocardial infarction, account for 40–50 percent of heart disease mortality in India [9]. The distribution of the rest of the ECG in the dataset is plotted in Fig. 3. In most research publications, the ST segment/heart rate slope (ST/HR slope) has been advocated as more accurate ECG criteria for identifying severe coronary artery disease. Figure 5 shows that upsloping is a positive sign, with 74% of normal individuals having it and 72.97% of cardiac patients having flat sloping.
Fig. 2 Age and gender distribution
216
Fig. 3 Distribution of chest pain
Fig. 4 Distribution of Rest ECG
Fig. 5 Distribution of ST Slope
R. S. Renju and P. S. Deepthi
Early Detection of Heart Disease Using Feature Selection …
217
Figure 6 depicts the outliers. Some of the patients have zero cholesterol, whereas one patient has zero cholesterol and zero resting blood pressure, which could be due to missing records. Z-score method is used to detect the outliers and the threshold value is set as 3 to remove the data points. There were a total of 17 outliers in the dataset which are removed. The correlation between the different features (input variables) and the target variable is plotted in Fig. 7. The data set is split into training and testing sets with 80% and 20%. Before splitting, the categorical variables are encoded as dummy variables and feature and target variables are segregated. Using Min–max normalization, all the numerical values are normalized in the range from 0 to 1. After data preprocessing and normalization, the number of instances in the dataset was reduced to 1172 and the number of features including dummy variables were raised to 15 and one target variable. The resultant data set including train and test data is given as input for feature selection.
Fig. 6 Visualization of outliers
Fig. 7 Checking correlation of dataset
218
R. S. Renju and P. S. Deepthi
Fig. 8 Feature importance score of 14 Input variables
ANOVA + SVM First, the filter method called ANOVA is used for selecting the best features which contribute towards the prediction of heart disease. The top 11 features have been selected based on the feature importance score generated by ANOVA. A bar graph (Fig. 8) is plotted to get an idea of how many features should be selected for the SVM classification. SFS + RF The sequential feature selector along with the random forest classifier selects the best k features and those selected features are used to train the model using the Random Forest algorithm. Lasso Regularization + Adaboost Using the Lasso regularization, the unweighted features coefficient values shrink to zero and those features are eliminated. In this work, after lasso regularization 4 features were removed and the remaining 11 features were used to train the model using the AdaBoost algorithm. Performance Analysis The performance of all methods was evaluated using accuracy, precision, recall and F-score. Classification accuracy is a metric that measures how well a classification model performs by dividing the number of correct predictions by the total number of predictions. Let TP denote the True Positives, TN the True Negatives, FP the False Positives and FN the False Negatives Accuracy = (TP + TN)/ (TP + TN + FP + FN)
(8)
Precision is the number of important instances that exist between the retrieved instances. Precision = TP/ (TP + FP)
(9)
Early Detection of Heart Disease Using Feature Selection …
219
Table 1 Comparison of performance Feature selection
Classification
Accuracy
Precision
Recall
F-score
ANOVA analysis
Support vector machine
82.55%
83%
83%
83%
Step Forward feature selection
Random forest
89.78%
90%
90%
90%
Lasso regularization
Adaboost
84.68%
90%
82%
86%
Bold indicates highest accuracy
Recall is the percentage of relevant examples that have been retrieved out of a total number of relevant instances. Recall = TP/(TP + FN)
(10)
F-Score: The F-score is calculated by dividing the total precision and recall by two times the precision times recall. F - score = (2 ∗ Precision ∗ Recall)/(Precision + Recall)
(11)
The different metrics used for comparison for all the methods is depicted in Table 1. It is clear that Random forest with Step forward feature selection has higher accuracy of 89.78% amongst all.
5 Conclusion The main aim of this work was to develop an efficient model, using the best set of features and machine learning algorithms to predict heart disease. Exploratory analysis of the heart disease dataset was performed and different feature selection techniques in combination with SVM, Random Forest and AdaBoost were implemented. A subset of 11 best features was found, and Random forest algorithm gave the highest testing accuracy of 89.78%.
References 1. Basha N, Kumar A, Krishna G, Venkatesh (2019) Early detection of heart syndrome using machine learning technique. In: 4th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT), IEEE, pp. 387–391 2. Shah D, Patel S, Bharti SK (2020) Heart disease prediction using machine learning techniques. SN Comp Sci 1(6):1–6 3. Kogilavani SV, Harsitha K (2020) Heart disease prediction system using Machine Learning Techniques. Int J Adv Sci Technol 29:78–87
220
R. S. Renju and P. S. Deepthi
4. Mohan SK, Thirumalai C, Srivastava G (2019) Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7:81542–81554 5. Malavika G, Rajathi N, Vanitha V and Parameswari P (2020) Heart disease prediction using machine learning algorithms. Biosci Biotech Res Comm 13(11):24–27 6. Fitriyani NF, Syafrudin M, Alfian G, Rhee J (2020) HDPM: an effective heart disease prediction model for a clinical decision support system. IEEE Access 8:133034–133050 7. Gao X-Y, Ali A, Hassan HS, Anwar EM (2021) Improving accuracy for analyzing heart diseases prediction based on the ensemble method. Complexity 6663455:10 pages 8. Mienyea ID, Sun Y, Wang Z (2020) An improved ensemble learning approach for the heart disease prediction risk. Inform Med 20(100402)ISSN 2352-9148 9. https://www.maxhealthcare.in/blogs/rise-cases-asymptomatic-heart-attacks-amongst-middleaged-people (accessed on August 2021).
Gun Detection System for Surveillance Cameras Using HOG-Assisted KNN Classifier Lucy Sumi and Shouvik Dey
Abstract Mass shootings have become a norm in public places claiming thousands of innocent civilian lives. Firearm-related violence has been rampant in the last few decades and therefore, needs to be addressed immediately. This research aims to propose a weapon detection system that combines image processing techniques with the most suitable machine learning classifier. The experimental study has been executed in twofold: by providing a comparative analysis with previous study and with other existing algorithms as well. Results of the trained model on the dataset provide an accuracy of 98.7% which is significantly better than others published recently. Keywords Weapon detection · Object detection · Sliding windows · Image processing · Machine learning · K-nearest neighbors (KNN)
1 Introduction With the booming usage of technology, Closed-Circuit Television (CCTV) has become a vital part of security and surveillance used by the law-enforcing authorities in various aspects. CCTV is used for monitoring [1], controlling, pedestrian detection [2, 3], identifying people armed with weapons, detecting malicious activities, or event detection [4, 5]. Continuous monitoring requires a painstaking attention to details manually and may sometimes miss out minute information which could be important to detect a malicious activity or dangerous object. Not only is it a tedious task but also consumes time, energy, and human resource. That is where weapon detection comes into play. Lately, surveillance videos has been used in various applications, such as human detection [6, 7], human attribute recognition [8] and pedestrian detection [9], L. Sumi (B) · S. Dey National Institute of Technology Nagaland, Chumukedima 797103, India e-mail: [email protected] S. Dey e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_18
221
222
L. Sumi and S. Dey
vehicle count, license plate recognition [10], people counting, car recognition [11, 12], road accidents [13], etc. According to timeline representation [14], this year has been a bad year for firearmrelated violence already. Several unethical activities like robbery, assaulting, mass shooting, kidnapping, and other criminal activities have become prevalent globally. It was reported by the FBI that, during the years 2000–2013, there were 160 incidents of shootings that resulted in the death of 1043 innocent people in the USA [15]. Such practices have become a norm and, hence, need to be intervened by using the best of emerging technologies. Object detection in particular is very helpful in detecting any objects, behavior, or human activity. Over the last few decades, object detection has received a significant amount of attention and is considered as one of the primary and demanding issues in the field of computer vision [16]. While some researchers have come up with gunshot detection [17], it will be much better if weapons are detected before any life-threatening event turns up instead of waiting for a gunshot to happen. Most of the weapon detections are based on X-ray images, mainly used in airports, banks, and railway stations whose usability is reduced to non-metallic guns. To the best of the author’s knowledge, most existing works did not use real-time gun video dataset but merely downloaded images from internet and adopted neural networks that come with merits and demerits. Therefore, this research makes a comparative analysis and attempts to introduce an automatic gun detection system (particularly, pistols) on a real-time gun video dataset fusing image processing techniques and machine learning which has not been reported in the previous studies. It also resolves some of the existing challenges with respect to the works previously published. The flow of the paper is organized with a state of the art with a brief discussion on some of the challenges and limitations in the existing system in Sect. 2. Section 3 introduces the readers to the proposed model, while Sect. 4 presents experimental results and analysis. Finally, Sect. 5 concludes the paper.
2 State Of The Art Out of the many techniques and algorithms to achieve object detection, some of the traditional approaches used were shape descriptors [18] and edge detectors [19]. Weapon detection is one of the interesting applications under object detection. Verma et al. presented a Harris interest point detector for detecting guns which was achieved by removing unrelated objects exploiting color-based segmentation and using K-means clustering, while the gun was located from the segmented images by blending Fast Retina Key point (FREAK) with Harris detector [20]. Although the authors claimed that their system is robust to partial occlusion and is capable of detecting multiple guns in an image, color-based technique has the disadvantage of objects blending in with the background and is not robust to illumination change. Buckchash et al. proposed an object detection algorithm for detecting knives which is based on client–server architecture model and used FAST (Features from Accelerated Segment Test) and multi-resolution analysis. The authors claimed their approach
Gun Detection System for Surveillance Cameras Using HOG-Assisted …
223
to be scalable and achieved parallelism since all the bulky computations are done in the cloud [21]. However, these classical methods demand high level of human supervision. Most of the researches related to weapon detection are addressed on millimetric or X-rays using traditional machine learning techniques [22, 23]. Millimeter scanners yield high false positives from buttons and folding in clothes. They are limited to “only metallic objects,” thus incapable of detecting several objects, such as plastics. It is unable to achieve processing of raw images or extracting features from them automatically, which is not practical given the kind of outcome a gun detection requires and is expected. On the contrary, automated weapon detection systems are capable of extracting features from raw images spontaneously. Some authors used neural network for handgun detection in video using Faster Regional Convolutional Neural Network (FRCNN) and Convolutional Neural Network (CNN) in their model [24–27], while others used YOLO and its variant, YOLO-v3-based algorithm [28]. These regional-based techniques extract all the possible windows of the input image as candidates and extract features which would help in localizing and classifying the object. Deep Learning (DL) is known for automatically extracting features from images and is more resilient when it comes to feature learning and representation, unlike conventional object detection approach. Weapon detection can basically be classified into concealed and non-concealed. Despite several attempts made on concealed weapon detection [29, 30], only a countable number of research has been done on non-concealed weapons. Kamal et al. [31] introduced an automatic detection of different types of pistols and guns that were implemented by using Transfer Learning (TL) which uses pre-trained network. The authors used Internet Movie Firearm Database (IMFDB), a widely used benchmark weapon database for weapon detection. They applied two DL approaches, namely, GoogLeNet and AlexNet. Another weapon detection system was presented wherein a hybrid technique was adopted by applying image and material test with fuzzy logic [32], while Active Appearance Model (AAM) is widely used in medical images [33]. Glowacz et al. introduced knife detection for security applications in luggage scanning at railways and airports using a Harris point detector. Knives are usually distinguished by the tip of their blade, which makes it easier to detect their points which is not the same for guns due to their dynamicity in various sizes and shapes. Therefore, detecting knives becomes easier with this technique. However, it is important to point out that AAM converges to objects positioned at same angles, and hence, it is invariant to rotation [34]. Lastly, Castillo et al. [35] presented a model that resolves the problem of detecting weapons in videos which are caused by weapon’s surface reflectance. To build a resilient system, the authors used CNN and DaCoLT (Darkening and Contrast at Learning and Test stages) where they only focused on variants of sharp-edged weapons like kitchen knife, machete, razor, carved knife, and dagger. Some weapon detections focused on gun, while some on knives [36]. Combining Dominant Edge Directions (DED) and Histogram of Oriented Gradients (HOG) detector, a knife detection scheme was proposed which improves detection time. Some of the existing systems mentioned above are robust but still have inadequacies which are mentioned as follows:
224
L. Sumi and S. Dey
2.1 Challenges 1. Various challenges like occlusion, view orientation, rotation, and noise exist in images or video frames. Occlusions are one of the most challenging tasks in object detection and may be classified as self-occlusions, inter occlusions, and background occlusions. 2. Weapons come in different shapes, sizes, colors, features, and orientations, hence making it even more difficult to detect as the approaches to all these issues are different and cannot be solved by one particular technique. 3. In those systems that use color-based segmentation, the objects may blend and get camouflaged with the background. 4. Additionally, the proximity between a weapon and surveillance camera is usually not close, thereby making the detection of firearm even more difficult as the object becomes tiny. 5. Weapons like pistols and knives are usually made of steel whose surfaces reflects or illuminates in varying lighting environments. This leaves the shape of an object in the frame blurry and distorted, thereby leading to difficulty in detection. While the aforementioned challenges exist, the issue of blurring, distortion, reflectance, and illumination on the objects which is a hindrance in successful detection of objects (guns) in images has been resolved in our proposed system. Also, it is unsusceptible from erroneous results (due to the blending of targeted objects with the background) since canny edge detectors were used to distinctly separate the object from the background. Various image processing techniques were applied to pre-process the images for further training and testing, and results have shown to be superior to the existing values available in literature.
3 Proposed Model The proposed model used an existing gun video database [38] which was created by a group of researchers [37]. It has been experimented with the existing algorithms to find the best combination of feature extractor and classifier. As depicted in Fig. 1, the frames were first extracted from the video and converted to grayscale to reduce their size. Next, difference in frame was applied on the images to find the difference between the current and the previous frames, which is usually known as the “background image”. This would detect moving items in foreground from a static background. Basically, it subtracts between the two frames which is mathematically denoted by
P[Fg(t)] = P[C(t)]−P[Bg]
(1)
Gun Detection System for Surveillance Cameras Using HOG-Assisted …
225
Fig. 1 Proposed model for Gun detection system using HOG-assisted KNN Classifier
where P denotes the pixel value at a particular frame, C(t) is the current frame obtained at time t, Fg is the foreground at time t, and Bg is the background image. Low-pass Gaussian blur is used for blurring or removing noise from the image and is represented by the following equations : In 1D, G B(x) = √
1 2πσ
x2
e− 2σ 2
(2)
In 2D, G B(x, y) =
1 − x 2 +y2 2 e 2σ 2π σ 2
(3)
where σ denotes the standard deviation of the distribution. Further, Canny edge detection has been applied on the “foreground images” followed by sliding windows approach. Sliding window is one of the traditional techniques used for object detection [37], and the same has been applied here because it helps in localizing the object precisely “where” in the image is, at different scales and positions. While merits and demerits exist, various neural network techniques are also used over this conventional technique. A sliding window is a rectangular region with a desired height and width that slides across the target image. The two important parameters are window size and step size. A fixed window size of 50*50 pixels and step size 20 was initialized. Step size is the number of pixels we intend to “skip” across the x and y directions. These values were chosen after an exhaustive experiment and because looping over each and every pixel of an image would be computationally very expensive (for instance,
226 Fig. 2 a Windows segregated as Gun (Positive). b Windows segregated as Non-gun (Negative)
L. Sumi and S. Dey
(a)
(b)
if we apply a classifier at each window, i.e., a step size = 1). Sequentially, each window extracted was segregated either as gun or non-gun as depicted below (see Fig. 2). The window that points to gun was put in a gun folder and other objects such as door, ceiling, body of the human, etc., were put into a non-gun folder. The dataset containing 88,052 slides was further divided into train and test, with 70% of images for training and 30% for testing. Thereafter, the Histogram of Oriented Gradients (HOG) feature descriptor is applied. This descriptor concentrates on the shape and structure of an object, which is attained by extracting the information about its orientation and gradients (in other words, direction and magnitude). We divided a slide into a small region of 6 × 6 cells, where each cell has 10*10 pixels. The gradient magnitude is calculated within this cell by √ [(G M x)2 + (G M y)2]
(4)
where GMx and GMy are the gradients, i.e., small changes along the x and y directions. The direction or the orientation for respective pixels is evaluated by tan() = G M y/G M x
(5)
Lastly, the angle’s value is given by = atan(G M y/G M x)
(6)
Intuitively, each pixel of a cell comprises of angular bins for its corresponding weighted gradients. This would generate a histogram for every region individually. Hence, it is called Histogram of Oriented Gradient. K-Nearest Neighbors (KNN) algorithm, otherwise known as a lazy machine learning algorithm, is applied on the outputs of these slides. While K refers to the number of neighbors, the term “lazy” comes from the fact that it does not require training data points for the generation of model, but gathers all the possible cases and classifies a new data on the basis of similarity measures. Some of the similarity measures are Hamming, Euclidean, Minkowski distance, etc. Here, Euclidean distance was used which is given by the equation:
Gun Detection System for Surveillance Cameras Using HOG-Assisted …
d(X, Y ) =
m i=1
(xi − yi)2 z
227
(7)
where X and Y are represented by feature vectors X = (x1, x2, …, xm) and Y = (y1, y2, …, ym) and z is the dimensionality of the feature space. Finally. KNN classifier classifies a test input image as gun or non-gun. Figure 3 shows the output image of each step in the sequence of the algorithm.
4 Experimental Results and Analysis The metrics used for performance evaluation of the system are summarized below: 6. True Positive (TP): Observation that is actually positive and is predicted as positive. 7. False Negative (FN): Observation that is actually positive but is predicted as negative. 8. True Negative (TN): Observation that is actually negative and is predicted as negative. 9. False Positive (FP): Observation that is actually negative but is predicted positive. 10. Accuracy: It is the ratio of correct predictions made against all predictions, and is denoted by (TP + TN) / (TP + FP + FN + TN) 11. True Positive rate (TPR) determines the rate of positive images accurately predicted to that of all the actual positive images. It also represents how good a model is at predicting the correct positive class when the actual outcome is positive. It is denoted by TPR = T P/ (T P + FN) TPR is also referred to as sensitivity which is denoted by Sensitivity = T P/ (T P + FN) 12. False Positive Rate (FPR): Otherwise known as false alarm, it evaluates what percentage of positive cases is estimated when its actual outcome is negative. It is denoted by FPR = FP/ (FP + TN)
228
L. Sumi and S. Dey
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 3 a Extracted image from video; b RGB to grayscale conversion; c Frame difference; d Lowpass Gaussian blurring; e Canny edge; f Sliding windows; g Slide categorized as gun; h Slide categorized as non-gun
13. Specificity: It is the rate of the total number of true negatives and all the negative events. Specificity = TN/ (TN + FP)
Gun Detection System for Surveillance Cameras Using HOG-Assisted …
229
14. Precision: It is used to determine the actual correct predictions made out of all the positive predictions, or in other words, it tells the preciseness or exactness of the classifier. A higher precision value denotes how accurate the classifier is. It is represented by TP / (TP + FP) 15. Recall: It determines the correct positive predictions out of all the predictions that are actually positive. It is used to calculate the completeness of a classifier and is also known as sensitivity. Higher recall values denote that it encompasses more cases. It is represented by TP / (TP + FN) 16. F1-Score: It is the weighted average of recall and precision and is given by 2∗
Precision ∗ Recall Precision + Recall
17. PR Curve: It is a graph showing a correlation between precision and recall. Table 1 shows the results of various existing algorithms trained and tested on the dataset. A comparative analysis has been made in two ways: (a) comparing results with previous works done by other researchers using same dataset [25, 37] and (b) comparative analysis of the proposed model’s results, i.e., HOG-assisted KNN against other existing algorithms, namely, Support Vector Machine (SVM), HOG-assisted SVM, Naïve Bayes, random forest, and KNN. Apparently, the HOGassisted KNN gives a better result than the existing ones with an accuracy of 98.7% with sensitivity and specificity values of 98% and 99%, respectively. Table 2 shows comparative results of precision, recall, F1-score, and support values where 0 and 1 represent two classes—absence and presence of gun, respectively. It is evident that the proposed system achieves the best results overall. Table 1 Comparative results of sensitivity, specificity, and accuracy of the existing algorithms versus the proposed model
Algorithm
Sensitivity
Specificity
Accuracy
[23]
0.35
0.96
n/a
[36]
0.35–0.95
0.95–0.99
n/a
SVM
0.91
0.97
0.95
HOG + SVM
0.65
0.98
0.89
Naïve Bayes
0.97
0.51
0.66
Random forest
0.97
0.98
0.97
KNN
0.90
0.98
0.96
HOG + KNN (proposed work)
0.98
0.99
0.98.7
230
L. Sumi and S. Dey
Table 2 Comparative values of precision, recall, F1-score, and support of the existing algorithms versus the proposed model Algorithm
Precision
Recall
F1-score
Support
0
0.95
0.92
0.93
210
1
0.96
0.97
0.97
398
HOG + SVM
0
1.00
0.65
0.79
199
1
0.86
1.00
0.92
409
Naïve Bayes
0
0.49
0.97
0.65
198
1
0.98
0.51
0.67
405
0
0.96
0.97
0.98
225
1
0.98
0.99
0.98
381
0
1.00
0.91
0.95
199
1
0.96
0.97
0.98
406
0
0.97
0.99
0.98
202
1
1.00
0.98
0.99
406
SVM
Random forest KNN HOG + KNN (proposed work)
The graphs shown in Fig. 4 present a Precision–Recall Curve (PRC) of all the algorithms, wherein the x-axis shows recall (sensitivity) and y-axis shows precision. The plot of the proposed model displays a better curve and shows superior outcomes as evident from the graphs below.
5 Conclusions and Future Work As the literature survey shows that there is still ample of scopes to explore in weapon detection due to its limited studies, the proposed model presents an improvised automatic gun detection system based on the fused concepts of image processing and machine learning techniques. It not only gives better results but also resolves some of the existing challenges, such as blurring, distortion, and illumination on the objects which is a hindrance in the successful detection of an object (in this case, gun) in images. The broad objective of this research is to make a contribution in ensuring the safety of citizens in public places by detecting guns in real time using surveillance cameras. This would aid in averting potentially dangerous gun-firing situations in places such as banks, schools, malls, parks, theatres, etc. An available gun video dataset was tested on existing algorithms and a comparison was made with previous works done using the same dataset. While they achieved 95–99% specificity and 35–95% sensitivity, respectively, the proposed system outperformed all the existing models giving an accuracy of 98.7% with sensitivity and specificity values of 98% and 99%, respectively. It also gave an interesting outcome with an average precision–recall curve value of 99%. Different orientations and angles of the gun were considered for a successful detection of object in surveillance video.
Gun Detection System for Surveillance Cameras Using HOG-Assisted …
(a) SVM
(c) Naïve-Bayes
(e) KNN
231
(b) HOG-SVM
(d) Random Forest
(f) HOG-KNN
Fig. 4 a–f Precision–Recall curve of existing versus proposed systems
Some major challenges concerning this particular research domain were highlighted. However, the database used in this research is not practical since it was taken in a confined environment, i.e., inside a lab. Therefore, this research domain can be further explored with versatile datasets available in a non-confined environment using suitable techniques to build a robust system in the future. This work can also be extended for exploring other kinds of weapons like knives or with different variants of guns (such as rifles, machine guns, revolvers, etc.), also keeping in mind the detection speed in real time.
232
L. Sumi and S. Dey
References 1. Maalouf A, Larabi M, Nicholso D (2014) Offline quality monitoring for legal evidence images in video-surveillance applications. Multimed Tools Appl 73:189–218 2. Xhang Y, Shen Y, Zhang J (2019) An improved tiny-yolov3 pedestrian detection algorithm. Optik 183:17–23 3. Bastian B, C V J (2019) Pedestrian detection using first- and second-order aggregate channel features. Int J Multimed Inf Retr 8:127–133 4. Velastin S, Boghossian B, Silva M (2006) A motion-based image processing system for detecting potentially dangerous situations in underground railway stations. Transp Res C 14(2):96–113 5. Lavee G, Khan L, Thuraisingham B (2007) A framework for a video analysis tool for suspicious event detection. Multimed Tools Appl 35:109–123 6. Hu W, Tan T, Wang L, Maybank S (2004) A survey on visual surveillance of object motion and behaviors. IEEE Trans Syst, Man Cybern Part C 34(3):334–352 7. Zhang J, Gong S (2009) People detection in low-resolution video with non-stationary background. Image Vis Comput 27(4):437–443 8. Ke X, Liu T, Li Z (2020) Human attribute recognition method based on pose estimation and multiple-feature fusion. SIViP 14:1441–1449 9. Saeidi M, Ahmadi A (2020) A novel approach for deep pedestrian detection based on changes in camera viewing angle. SIViP 14:1273–1281 10. Rung H, Chen C (2019) Automatic License Plate Recognition via sliding-window darknetYOLO deep learning. Image Vis Comput 87:47–56 11. Baran R, Glowacz A, Matiolanski A (2015) The efficient real and non-real time make and model recognition of cars. Multimed Tools Appl 74:4269–4288 12. Xu B, Wang B, Gu Y (2020) Vehicle detection in aerial images using modified YOLO. In: 19th International Conference on Communication Technology, pp. 1669–1672. IEEE, Xian, China 13. Gour D, Kanskar A (2019) Optimized-YOLO: algorithm for CPU to detect road traffic accident and alert system. Int J Eng Res & Technol 8(9):160–163 14. Number of mass shootings in the United States between 1982 and May 2021, https://www.sta tista.com/statistics/811487/number-of-mass-shootings-in-the-us/, last accessed 2021/09/21 15. Department of Justice, Federal Bureau of Investigations: A study of active shooter incidents in the United States between 2000 and 2013, https://www.fbi.gov/file-repository/active-shooterstudy-2000-2013-1.pdf/view, last accessed 2019/03/13. 16. Zou Z, Shi Z, Guo Y, Ye J (2019) Object detection in 20 years: a survey. computer vision and pattern recognition 1–39 17. Rodríguez A, Julián P, Castro L, Alvarado P, Hernández N (2011) Evaluation of Gunshot Detection Algorithm. IEEE Transactions on Circuits and Systems I 58(2):363–373 18. Bober M (2001) MPEG-7 visual shape descriptors. IEEE Trans Circuits Syst Video Technol 11(6):716–719 19. Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 8(6):679–698 20. Tiwari R, Verma GK (2015) A computer vision based framework for visual gun detection using harris interest point detector. Proc Comput Sci 54:703–712 21. Buckchash H, Raman B (2017) A robust object detector: application to detection of visual knives. In: International conference on multimedia & expo workshops (ICMEW), pp. 633–638. IEEE, Hong Kong, China 22. Flitton G, Breckon T, Megherbi N (2013) A comparison of 3d interest point descriptors with application to airport baggage object detection in complex CT imagery. Pattern Recogn 46(9):2420–2436 23. Uroukov I, Speller R (2015) A preliminary approach to intelligent X-ray imaging for baggage inspection at airports. Signal Process Res 4:1–11 24. Olmos R, Tabik S, Herrera F (2018) Automatic handgun detection alarm in videos using deep learning. Neurocomputing 275:66–72
Gun Detection System for Surveillance Cameras Using HOG-Assisted …
233
25. Grega M, Matiolanski A, Guzik P (2016) Leszczuk, M: Automated detection of firearms and knives in a CCTV image. Sensors 16(1):1–16 26. Verma GK, Dhillon A (2017) A handheld gun detection using Faster R-CNN deep learning. In: Proceedings of the 7th International Conference on Computer and Communication Technology (ICCCT-2017), pp. 84–88. ACM, India 27. Hernándeza F, Tabika S, Lamasa A, Olmosa R, Fujitab H, Herreraa F (2020) Object Detection Binary Classifiers methodology based on deep learning to identify small objects handled similarly Application in video surveillance. Knowl-Based Syst 194:1–10 28. Pang L, Liu H, Chen Y, Miao J (2020) Real-time concealed object detection from passive millimeter wave images based on the YOLOv3 algorithm. Sensors 20(6):1–15 29. Zhang J, Xing W, Xing M, Sun G (2018) Terahertz image detection with the improved faster region-based convolutional neural network. Sensors 18(7):1–19 30. Kaur A, Kaur L (2017) Concealed weapon detection from images using SIFT and SURF. In: International conference on green engineering and technologies (IC-GET), IEEE, Coimbatore, India 31. Mohamed M, Taha A, Zaye H (2020) Automatic gun detection approach for video surveillance. Int J Sociotechnology Knowl Dev 12(1):49–66 32. Ineneji C, Kusaf M (2019) Hybrid weapon detection algorithm using material test and fuzzy logic system. Comput Electr Eng 78:437–448 33. Beichel R, Bischof H, Leberl F, Sonk M (2005) Robust active appearance models and their application to medical image analysis. IEEE Trans Med Imaging 24(9):1151–1169 34. Glowacz A, Kmie´c M, Dziech A (2015) Visual detection of knives in security applications using active appearance models. Multimed Tools Appl 74:4253–4267 35. Castillo A, Tabik S, Perez F, Olmos R, Herrera F (2019) Brightness guided pre-processing for automatic cold steel weapon detection in surveillance videos with deep learning. Neurocomputing 330:151–161 36. Kmie´ca M, Glowacz A (2015) Object detection in security applications using Dominant Edge Direction. Pattern Recogn Lett 52:72–79 37. Grega M, Łach S, Sieradzki R (2013) Automated recognition of firearms in surveillance video. In: International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support (CogSIMA), pp. 45–50. IEEE, San Diego, USA 38. Gun video database, http://kt.agh.edu.pl/grega/guns/, last accessed 2019/09/11.
Optimized Detection, Classification, and Tracking with YOLOV5, HSV Color Thresholding, and KCF Tracking Aditya Yadav , Srushti Patil , Anagha Dangle, and Pranesh Nangare
Abstract This paper shows a detection and tracking approach regarding position estimation of pots and angle of arrow on-ground. The pots are detected by HSV color thresholding, and then classified on the basis of local positional parameters like distance of pots from the robot and relative position of pots with the robot. This is a computationally efficient solution for simple regular objects like pots. The Kalman filter specifically works on providing better depth estimates and thereby the position of pots even when the pot tables are overlapping. The approach of the detection-tracking algorithm for small objects like arrows is to combine the accuracy of object detection provided by the custom-trained YOLOv5 model and the speed of the KCF tracker to outperform the results. Multithreading is used to concurrently detect and track the arrows in consecutive frames, producing a computationally efficient approach as compared to standalone detection with YOLOv5. This paper also describes an approach to effectively get depth information on an object using an Intel RealSense D435i depth camera. Keywords Image processing · Computer vision · HSV color thresholding · YOLOV5 · Kalman filter · Object detection · KCF tracking · Depth frame processing · Intel RealSense D435i depth camera
A. Yadav (B) · S. Patil · A. Dangle · P. Nangare Pimpri Chinchwad College of Engineering, Pune, India e-mail: [email protected] S. Patil e-mail: [email protected] A. Dangle e-mail: [email protected] P. Nangare e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_19
235
236
A. Yadav et al.
1 Introduction Archery is a famous recreational sport for shooting arrows. Shooting arrows is a task of great precision. Manually, there are a lot of problems faced like deciding target position, shooting position, angle, moving target, etc. Automation is the need of the hour. Automating the entire process indirectly meant increasing the accuracy. This served as motivation to create a detection system where the robot could detect the target pots into which an arrow had to be thrown, identify the arrows fallen on the ground, and that could accurately hit the target. This task was not so easy, as this system was to be used by a mobile robot. Practical scenarios like different lighting conditions needed to be considered, so the HSV color thresholding [12] ranges are set, such as to detect pots in varying lighting conditions. The HSV color space detected all the pots distinctly in the scene, except the same color overlapping pots. As stated in the paper [7], the color algorithm proves to be the most reliable and lowest processing cost. Thus, was the chosen one for the application. Earlier, traditional model training did not give satisfactory results for detecting small objects like arrows. Hence, a convolution neural network was essential for detecting arrows. To overcome this limitation, a method is proposed, combining the slower detection model with a faster tracking algorithm optimizing the entire process. Tracking and detection ran on parallel threads to achieve the required accuracy and speed. Tracking algorithms such as KCF [1], CSRT [2], and MOSSE [3], are mostly used for general object-tracking tasks. Paper [6] describes various metrics with their success rates and FPS (Frame Per Second) for different tracking algorithms; KCF was preferred as it best suited the needs. These algorithms require initial ROI input from the frame, which is given manually by selecting the ROI from the first frame. The proposed algorithm works such as this initial ROI to tracker is provided by detection module. A D435i RealSense depth camera is used to provide the color frame along with the depth frame. These algorithms for detection and tracking were tested on Nvidia Jetson Nano, interfacing with Arduino Mega over a serial port. Apart from detection and tracking, it is important to estimate the correct position and distance of the pot under various circumstances. Kalman filter [4] is used for accurate estimation. The Kalman filter is used in the following scenario: As multiple pots will be detected in the scene; some pots may not be always clearly classifiable to track and process the angle. The Kalman filter estimates the positions and depths of those partially classifiable pots while referring to the provided data from color and depth frames. The contribution of this paper is to propose and compare two traditional methods which are color thresholding and deep learning model for object detection. These algorithms play significant roles for specific applications which are highlighted in the paper. The research also provides examples of how to perceive and utilise depth frames from the depth camera D435i.
Optimized Detection, Classification, and Tracking with YOLOV5, HSV …
237
This paper is mainly divided into three broad sections. The first section focuses on the detection and classification of pots, pot table overlapping, motion tracking, and angle of pot table. The second section deals with the YOLOV5 [5] model training and detection, multithreading, and angle of arrow. The third section covers the extraction of depth with a D435i depth camera.
2 Methodology 2.1 Section I This section deals with the detection, classification, and motion tracking of pots. With motion tracking, the relative angle of the pot table with the robot is estimated. The motion of pots is tracked along the horizontal (x-axis) of the pot table and the depth information (z-axis) of pots. Detection: Pots are detected on the basis of HSV color space. A range of HSV values was selected for red and blue pots to be detected in various lighting conditions. A mask was extracted from the image denoting ROI as pots for red and blue pots each. These areas were further sorted to assist classification. Detection with color space provides a computationally effective solution when compared with deep learning methods. Classification: For the classification of pots, the position of the robot on the arena being known with the help of the designated sensors along with fixed positions of pots on the arena, the pots were classified. This classification also helped to improve the accuracy of detection with color space, as the relative positions of pot and robot are known, only the required areas were considered for masking, thereby disregarding other similar color objects. Area-wise sorting is also done for classification. Pot table overlapping: The pot table had a blue and a red pot. Some pot tables could rotate about their vertical axis. These rotations and relative position of the robot caused few pots to overlap in their color space. To resolve this, if one of the pots (red or blue) of the same pot table was distinctly classifiable, the other pot’s position was estimated. Also, increasing the height of the camera apparatus on the robot reduces overlapping. Only focusing on ROI pots by moving robots, such that those pots do not overlap, also served as a solution (Fig. 1). Motion Tracking: Motion tracking of pots provided the angle of the pot table relative to the robot. After detecting and classifying pots, the horizontal motion of pots as viewed from the robot
238
A. Yadav et al.
Fig. 1 Detect and classify pots (Here, the second pot table is selected as ROI, two pots are undetected on purpose)
and varying distance of pots from the robot were the two variables for motion, these variables were obtained from the circular motion of the pot table as viewed from the top. The tangential and centrifugal acceleration vectors are decomposed into two perpendicular vectors, these acceleration vectors were then mapped with the x-axis and z-axis accelerations. Same with tangential velocity components. The angle of pot table: The distance of one of the completely classifiable pots from the robot obtained from depth camera (d) along with a calculated distance of the center of the table from the robot (fixed for one position) (D) is used to obtain an angle of the pot table relative to the robot (Fig. 2). r is the radius of pot table, Angle = sin−1 (|D − d|/r ) Kalman Filter: These equations were used with the Kalman filter along with real-time depth and pixel information to estimate the position and distance of pots even when overlapped. Though the motion is non-linear, the movements of pot and robot are considered nonlinear; so, the Kalman filter is applied for a few image frames to perceive the motion linearly for a small-time frame. The above-proposed method for angle provides an angle between 0 and 90 degrees with the x-axis. This system requires 180 degrees mapped by each pot. Thus, as one of the pots is in the second quadrant, the angle will be referred to as θ = π − θ (Fig. 3). x = displacement in the x direction. z = displacement in the z direction. vx = velocity in the x direction. vz = velocity in the z direction.
Optimized Detection, Classification, and Tracking with YOLOV5, HSV …
Fig. 2 Angle of pot table Fig. 3 Vector mechanics for the motion of pot table
r = radius of pot table. θ = current angle. ω = angular velocity of pot table. α = angular acceleration of pot table. Acceleration and velocity formulas:
239
240
A. Yadav et al.
ax = r α sinθ − ω2 r cosθ az = r α cosθ + ω2 r sinθ vx = ω r sinθ vz = ω r cosθ The equations for the Kalman filter are as follows: vx = vx + ax dt vz = vz + az dt x = x + vx dt + 1/2ax dt 2 z = z + vz dt + 1/2az dt 2 F is the state-transition model and B is the control-input model: 1 0 F= 0 0
0 dt 0 1 0 dt 0 10 0 01
1/2dt 2 0 B= dt 0
0 1/2dt 2 0 dt
Observational matrices are 1 0 H= 1 0
0 0 0 1 0 0 a = (r α sin θ − ωr cos θ )(r α cos θ − ωr sin θ) 0 sin(θ ) 0 0 0 cos(θ )
State matrix: Mu = x z ωr ωr Given the transformations A and B, and noise approximated using covariance matrices S and R, the information on x, in the form of the mean and covariance matrix G, is updated with a new observation z as follows: Predicted state estimate → xk = Axˆ + k − 1 + Bu. Predicted error covariance → P − k = AP + k − 1 A T + Q. Measurement residual → yk = zk Hxˆ − k . Kalman gain → Kk = Pk HT (R + HP − k HT ) − 1 . Updated state estimate → xˆk = ˆ xk + Kk yk . Updated error covariance → P + k = (I − Kk H)P − k . where variables with (ˆ) are the estimates of the variables. Variables with (+) and (−) denote prior and updated estimates, respectively. Variables with ( T) denote the transpose of matrices. This Kalman filter specifically works on providing better depth estimates, even when the pot table is overlapping.
Optimized Detection, Classification, and Tracking with YOLOV5, HSV …
241
2.2 Section II Training and detection using YOLOV5: This section covers the training and deployment of the YOLOv5 model for detecting the head and tail of an arrow. The first stage is to develop an image dataset with several arrow orientations placed at various angles on the ground. But since this system was going to be trained by a mobile robot, realistic scenarios had to be considered, and recordings of arrows on the arena had to be recorded from the robot’s perspective. Because of these concerns, the arrow was caught in various angles and lighting circumstances in subsequent frames. The data must be in a particular format for the YOLOv5 training model, with each image having its own XML file containing the bounding box coordinates and the label of the object to be detected. By accepting a video file input and loading each frame sequentially, services like LabelBox [9] and Roboflow [10] are frequently used to build such XML files. Because selecting items from each individual image for file creation can be tedious and time-consuming, we created a Python script that takes the video file as input. A tracking method tracks the position of the desired object in subsequent frames after it is manually selected from the first frame. A stream of photos is created, each of which is saved with its XML file. When compared to the old method, this effectively reduced preprocessing time by around 87%. A YOLOv5 model was used to train the preprocessed dataset. Bounding box prediction and object categorization are combined in a single end-to-end differentiable network in YOLO. YOLOv5 is substantially lighter and easier to use than previous versions. The model’s benchmarks led to the selection of YOLOv5. The parameters were optimal during training and did not need to be changed. For improved fitting, the dataset included blurred and rotated photos (Fig. 4). Tracking object, multithreaded with detection:
Fig. 4 Metrics
242
A. Yadav et al.
Fig. 5 Detection
The multithreading of the tracking and detection part of the code is the emphasis of this section of the research. To improve the overall process, the head and tail of the arrow are tracked and detected using parallel threads. In the first stage, bounding box points are taken as input as detected by YOLOv5, and the object in that bounding box is tracked in sequential frames. Even while using simply detection improves efficiency, it adds time to the process. The speed of execution of tracking algorithms is fast, but the precision is not. As a result, the optimal option was to combine detection and tracking in order to achieve both average speed and accuracy. On an Nvidia Jetson Nano (NVIDIA Maxwell architecture with 128 NVIDIA CUDA cores 0.5 TFLOPs (FP16)), the findings were tested. The tracking algorithm was a Kernelized Correlation Filter (KCF). This tracker’s OpenCV implementation makes the integration in code easy (Fig. 5). Angle of an arrow on the ground: The detection model provides bounding boxes of the head and tail of the arrow which initializes the tracker. The tracker further tracks the head and tail, with a separate tracker for each head and tail. Considering the tail as origin, the angle of an arrow as the head rotates around the tail is evaluated. The top-left coordinates of head and tail are used as two points and by the slope point method, the angle of the arrow is evaluated. This angle is in the plane of a frame captured by the camera and ranges from 0 to 360 degrees as views. As we were required to align the robot to the arrow, this angle sufficed to assist alignment (Fig. 6).
Optimized Detection, Classification, and Tracking with YOLOV5, HSV …
243
Fig. 6 Angle of an arrow on the ground with tracking after detection
2.3 Section III Intel RealSense D435i depth camera: Intel RealSense D435i depth camera provides an RGB color channel and a depth channel of the image which extracts 3D information from the image. This depth channel is used for object classification and recognition in the scene. Motion tracking and angle of the pot table have been evaluated using the depth of pots from the camera. The Kalman filter also estimates motion in the z-direction based on the depth of the image. Extraction of depth from depth frame: Python supports the pyrealsense2 module, which allows configuring the RealSense camera with the system. Color frame with depth frame is extracted, and an object is detected in the color frame, which is mapped with depth frame to obtain depth information. To get accuracy and reliability of distance, the mean of depths of the square at the center of an object of size 2*h/8 × 2*w/8 is computed. Here, “h” and “w” are the height and the width of the bounding box of the object, respectively. This provides accurate depth as the center of an object won’t have a high gradient of depth as compared to the edges of an object with the background.
244
A. Yadav et al.
3 Results Object detection with HSV color space, with local positional parameters, provides an accurate, reliable, and low processing cost solution. The angle is fairly accurate when depth was measured with a depth sensor camera. Kalman filter provided an estimation of depth and position when the direct classification was not possible due to the overlapping of objects. This approach also works for localization of robots on the field. The results of detection and training were tested on different videos from various orientations and lighting conditions on a CPU, as well as a GPU of the Nvidia Jetson Nano. The accuracy and speed considerably increased after using GPU CUDA cores. The detection with tracking code takes around 0.031 s on the Nvidia Jetson Nano to detect the head and tail in a single frame with around 85–95% accuracy, whereas standalone detection on Nvidia Jetson Nano takes around 0.5 s for detection in single frame. The results were also experimented on laptop with (Intel i5 6th Gen 6 GB Ram). The standalone detection took 0.75 s, whereas detection with tracking took 0.052 s. Our method has reduced around 93% of computational time as compared to traditional methods. This improved performance has significant importance in the case of low-computation hardware devices such mobile robots, edge computing, and low-end devices. In typical systems, detection and tracking are carried out on one thread, where tracking is done after detection. As the computation speed of detection is more than tracking, the process takes nonuniform time for execution. This approach is ineffective for real-time systems where continuous object detection is required. We propose an efficient approach where detection and tracking run on two separate threads. As tracking occupies separate threads, the process executes uniformly. As and when the detection thread detects the object with its own speed, it reinitializes the already running tracker thread. Reinitializing is required as the tracker may lose track due to some conditions. Intel’s RealSense depth camera provided an accurate depth of objects with the precision of millimeters and a minimum distance of 15 cm. The depth at the edges of the object varied invariably because of the high gradient of depth between background and object. Thus, the proposed method of averaging the depth data at the center provides a fairly accurate depth of an object. Officially, intel [11] states that the depth accuracy is less than 2% at a depth of 2 m. Also, the field of view for RGB view and depth view is different, (87o × 58°) for depth view and (69o × 42°) for RGB color view. So, to get accurate depth mapping, we need to align those frames using align function provided by the pyrealsense2 library. This provides fairly accurate depth up to 4 m.
Optimized Detection, Classification, and Tracking with YOLOV5, HSV …
245
4 Conclusion The results are pretty robust in the case of color space detection and classification, but as per color space and location, a similar object in the vicinity of objects is misinterpreted as an object. Also, the Kalman filter needs to track the motion of an object before estimation, thus cannot estimate for high FPS agile motions and frequent change in states of an object. Detection with tracking runs smoothly over CUDA cores as the detection frequently initializes trackers, also trackers don’t lose track more often. However, in unstable motions when the tracker loses track frequently, object information is lost in many frames.
5 Future Scope The future scope of the project is to test this developed system on different hardware systems to improve efficiency in every case. The work on making the detection and tracking process more robust is ongoing. Decreasing the layers of the network architecture to increase the speed of detection itself is in work.
References 1. Henriques JF, Caseiro R, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596 2. Farkhodov K, Lee SH, Kwon KR (2020) Object Tracking using CSRT Tracker and RCNN. Bioimaging 209–212 3. Bolme D, Beveridge J, Draper B, Lui Y (2010) Visual object tracking using adaptive correlation filters. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition 4. Gunjal P, Gunjal B, Shinde H, Vanam S, Aher S (2018) Moving object tracking using Kalman filter. In: International Conference On Advances in Communication and Computing Technology (ICACCT) 5. Liu C, Tao Y, Liang J, Li K, Chen Y (2018) Object detection based on YOLO network. In: IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC) 6. Brdjanin A, Dardagan N, Dzigal D, Akagic A (2020) Single object trackers in opencv: a benchmark. In: International Conference on Innovations in Intelligent Systems and Applications (INISTA) 7. Tannús J (2020) Comparison of OpenCV Tracking Algorithms for a Post-Stroke Rehabilitation Exergame. In: 22nd Symposium on Virtual and Augmented Reality (SVR) 8. Hatwar R, Kamble S, Thakur N, Kakde S (2018) A review on moving object detection and tracking methods in video. In: International Journal of Pure and Applied Mathematics 118(16):511–526 9. Labelbox, “Labelbox,” Online, 2021. [Online]. Available: https://labelbox.com 10. Roboflow, “Roboflow,” Online, 2021. [Online]. Available: https://roboflow.com
246
A. Yadav et al.
11. Intel RealSense, “Intel RealSense,” Online, 2021.[Online]. Available: https://www.intelreal sense.com/depth-camera-d435i/ 12. Barba-Guamán L, Calderon-Cordova C, Quezada-Sarmiento PA (2017) Detection of moving objects through color thresholding. Iberian Conference on Information Systems and Technologies (CISTI)
COVID-19 Detection Using Chest X-ray Images Gautham Santhosh, S. Adarsh, and Lekha S. Nair
Abstract COVID-19 is a respiratory infectious disease discovered in Wuhan, China, which later turned out to be a pandemic disease. The disease is spreading at a rate higher than what the world is prepared for, and hence, there is a huge shortage in testing and resources for it. To overcome this situation, the artificial intelligence community has been working hard to make use of some advanced technology to detect the presence of novel coronavirus. In our paper, we propose an ensemble 3-class classifier model with a stochastic hill-climbing optimisation algorithm for detecting infection in chest X-ray images. The novelty of our work involves the selection of optimal feature set from a feature set of handcrafted features and VGG16 features using optimisation technique which is followed by a soft voting based ensemble classification. The proposed model achieved an overall F1-score of 0.997. Our dataset has Chest X-Ray images of all age groups and provides a more reliable and consistent result that can be used for the timely detection of COVID-19. Keywords VGG-16 · Ensemble · GLCM · LBP
1 Introduction On January 30, 2020, COVID-19 was declared a Public Health Emergency of International concern as it took less than 30 days to spread across China. On March 11, 2020, it was declared as a pandemic by WHO, as it already spread to all parts of the world by then. The country that is the most affected ‘is the United States of G. Santhosh (B) · S. Adarsh · L. S. Nair Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham, Amritapuri, India e-mail: [email protected] S. Adarsh e-mail: [email protected] L. S. Nair e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_20
247
248
G. Santhosh et al.
America, closely followed by India with 3,74,35,835 and 3,22,05,973 total cases, respectively. The total global death toll has crossed 43,73,091 as of August 15, 2021. Coronavirus is considered as a respiratory disease with symptoms like malaria, dry cough, headache, sore throat, etc. Up to now, 7 out of 40 different mutants in the Coronavirus family have been found capable of spreading among humans showing symptoms such as common cold. Reverse transcription-polymerase chain reaction (RT-PCR) for samples (respiratory or blood) is the main indicator of coronavirus’s presence according to the guidelines from WHO. But this process takes a considerable large time for the detection of the virus itself, they also have less chance of detecting the virus itself. Due to the long-term complication that arises from the rapid multiplication of the virus, there is a high chance of yielding a false-positive result [1]. It is evident from previous studies that the health workers involved in the testing phase are infected, as they are exposed to the saliva of the infected person. As there is more chance for doctors and nurses to get infected by the virus, RT-PCR cannot be considered as a cost-effective method for the detection of the novel coronavirus. Digital radiography scans the body and helps in the diagnosis of tumours, pneumonia, lung infections, etc. At the same time, computerised tomography is an advanced digital radiography that provides much clearer images of organs, bones and tissues. Computerised tomography (CT) is comparatively expensive and is not available everywhere. This is why X-ray is used more by physicians and also because it is faster, easier and more affordable than CT. Each X-ray has to be manually examined by a radiologist for the presence of the virus, which is a tedious task as the number of cases is increasing at a rapid rate. The image consistency of chest X-ray has certain drawbacks like low contrast, blurred boundary and overlapping organs. Hence, we need to automate this part of the process, and this has also gained attention from researchers worldwide. Deep learning is a combination of machine learning techniques that is primarily geared towards the automatic extraction and classification of images; this is applied in medical image recognition and segmentation. The various deep learning applications developed in the last 5 years allow researchers to conduct an easy and reliable analysis of the X-Ray scans[2] (Fig. 1).
Fig. 1 Block diagram of the proposed system
COVID-19 Detection Using Chest X-ray Images
249
2 Related Works Two commonly used methods to detect the presence of the coronavirus are scanning the patient’s blood for antibodies and searching for the viral DNA from nasal swabs. The former would detect the presence only a few days after the symptoms begin, and the latter can take a few hours for the detection. Chest radiography is one of the most frequently used imaging methods to visualise and quantify the structural and functional consequences of chest diseases [3]. Chest X-ray and CT imaging play a crucial role in the early detection and treatment of the disease caused by COVID-19 [4]. A patient affected by COVID can be easily distinguished from other lung diseases with the help of chest X-Rays [5]. AI methods and X-rays together can provide an accurate method for the detection of the disease [6, 7]. In addition, many attempts have been made to look into CNN-based classification models for the detection of pneumonia and the distinction between the two major types of pneumonia which are viral and bacterial, with the aim of enabling rapid referral of children with an urgent need for intervention [8, 9]. Tumour detection is done using GLCM features fed into an SVM classifier extracted from mammograms, which is an X-ray image [10, 11]. The DarkCovidNet [12], is a CNN model built on the DarkNet model [13], that proposed a 2-class classification (COVID-19 and no-findings) method for the detection of coronavirus using chest X-ray images. The DarkCovidNet includes fewer layers and (progressively increased) filters than DarkNet. The article [14] focuses on the advantages of Chest X-Ray-based analysis and mentions that radiographic discoveries could be used to observe the consequences of the disease in the long run. Article [15] proposed a CVOIDX-Net framework that automatically identifies COVID-19 in X-ray images based on deep learning classifiers. In this paper, we have a multi-class prediction scenario (COVID-19, Normal and pneumonia).
3 Methodology This experiment is performed in two parts, without optimising the features and after applying an optimisation technique on the features. This section is divided into five parts starting with the dataset preparation, followed by applying a preprocessing technique. The handcrafted features and deep features are extracted in Sect. 3.3 which is then used in the classifier proposed in Sect. 3.4. Finally, the optimisation of the deep features is explained in Sect. 3.5 and the classifier from Sect. 3.4 is used on the optimised feature set.
250
G. Santhosh et al.
Fig. 2 Images from the dataset
3.1 Dataset Preparation Dataset preparation is the first step in developing any diagnostic/prognostic tool. Data was collected from three repositories from Kaggle. One is the recipient of the COVID-19 dataset awarded by the Kaggle community created by a research team along with doctors across the globe [16, 17]. This repository contains 3616 COVID19-positive images, 10,192 normal images and 1345 viral pneumonia images. The second dataset used has CXR images of paediatric patients of age one to five from Guangzhou Women and Children’s Medical Center [18]. This dataset contains 6432 chest X-ray images which are tagged as COVID-19, Pneumonia and Normal. The third dataset in addition to the previous ones have images tagged as viral and bacterial pneumonia. These three datasets were combined, and 3000 images in total were taken from this in random, which consists of 1000 normal images, 1000 COVID-positive images and 1000 Pneumonia images. The dataset was split into 70:30, making it 2100 training and 900 testing samples (Fig. 2).
3.2 Preprocessing The images from the dataset were resized to 224*224*3 to make use of the ImageNet weights. In order to maintain a dataset of the same contrast images, the CLAHE algorithm was applied to all the 3000 images to improve the contrast of the images, which in turn improves the recognition rate.
COVID-19 Detection Using Chest X-ray Images
251
3.3 Feature Extraction The extraction of features is one of the most critical steps in the process of training, as it contributes to the efficiency and accuracy of the classification. In order to train our model, we have gone for the traditional handcrafted feature extraction method and combined it with deep features obtained from the VGG-16 model. With handcrafted features, statistical and textural features are used to detect similar features. It’s easier to understand the actual model with handcrafted features, and it is also less ambiguous. We are using two handcrafted feature extraction techniques here, namely Local Binary Patterns (LBP) and Grey-Level Co-occurrence Matrix (GLCM). The LBP technique extracts texture by thresholding the neighbouring pixels to label each pixel [19]. In our case, the LBP technique provides us with two statistical features, energy and entropy. Another texture classification like GLCM texture analysis is also implemented, that can operate co-occurrence greyscale matrices through a statistical procedure to extract any attribute from the image [20, 21]. We obtain five features with GLCM, which are contrast, dissimilarity, homogeneity, energy and correlation. The deep features were extracted using the VGG-16 model. The VGG-16 model, which was pretrained on the ImageNet dataset, was fine-tuned to adapt to the requirements in this experiment. The output obtained from the VGG model is of the dimension 1 × 25,088 for each of the input images. We made use of the PCA algorithm to convert the high-dimensional vector into a low-dimensional vector of dimension 1 × 1000 from 1 × 25,088. This is required because it would, in turn, reduce the computational requirements, improve the efficiency and also avoid the highly correlated vectors to produce a homogenous set of features. After applying the PCA algorithm, we are left with 1000 features which are then combined with the previously obtained 7 handcrafted features, which takes the total features in our model to 1007 in number.
3.4 Classifier For our experiment, we would be using three classifiers for training the features obtained by combining the handcrafted feature features and deep features from the previous section. Here we use the support vector machine classifier with the Radial Basis Function (RBF) kernel as one of the classifiers. A decision tree classifier with Gini impurity as the criterion, a max depth of 3, minimum sample per leaf and minimum samples split of 2 and cost complexity pruning of 0.1 is the second classifier chosen. The third classifier is a soft voting ensemble learning technique that combines the above two machine learning classifier models to produce an optimal result. An ensemble classifier produces optimal results with the prepared dataset with the least possible errors. The end result is that it sums up the probability for each label, and the one with the largest sum probability is predicted; this is called soft voting. The 2100 training samples are individually run on each of the three classifiers with the above-obtained features.
252
G. Santhosh et al.
Fig. 3 Confusion matrices indicating the results of the ensemble classifier
3.5 Optimised Feature Selection We are optimising the feature set obtained in Sect. 3.3 to improve the overall performance of the above ensemble classifier and this optimisation also helps reduce the computational cost for the experiment. The feature set is optimised using a stochastic hill-climbing optimisation algorithm to extract the most efficient subset of features from the extracted 1007 features, and this data is used to train the classifiers. The hill-climbing optimisation algorithm takes the data set and a subset of characteristics as input and an estimated accuracy of the model from 0 (worst) to 1 (best) is returned. Here, it is a maximisation–optimisation problem. Each feature in the dataset is viewed independently and probabilistically flipped (flipping the inclusion/ exclusion of columns in the subsequence), with the probability of flipping being a hyperparameter. Finally, a subset of features that returns the maximum accuracy for the given model is selected as the most optimised set of features. The stochastic hill-climbing optimisation algorithm is run with respect to support vector machines, and a feature set that gives the maximum accuracy with SVM as the classifier is obtained which selects 469 features. This process is repeated for the decision tree classifier and another optimised feature set with 512 features is selected. As this set can have common features, we combine the selected feature set and the unique set is taken which gave us 748 features in total from the initial 1007 features. This optimal feature set is then used to train all the three classifiers mentioned in Sect. 3.4 and the results are compared (Fig. 3).
4 Results and Discussions In this study, we perform the experiment in two parts. In the first part of this experiment, we use the extracted features without applying any optimisation techniques to them. The experiment was performed individually on the SVM classifier, the decision tree classifier and the voting based ensemble classifier. We are using F1-score
COVID-19 Detection Using Chest X-ray Images
253
as the evaluation metric here, which is actually the harmonic mean of precision and recall. Precision represents how much of the positive predictions are actually positive and recall is how much of the actual positives were correctly predicted as positive. The testing was done on the 900 testing samples obtained in Sect. 3.1. The SVM classifier gave an overall F1-score of 0.839, the decision tree classifier gave an overall F1-score of 0.945 and the overall F1-score from the ensemble classifier without the optimised feature set was 0.995. In the second part, to improve the performance, the proposed feature optimisation technique was run individually with respect to each classifier, and then the experiment was repeated with the optimised feature set. This time as expected, the SVM classifier gave better results taking the F1-score to 0.967, the results from the decision tree classifier remained the same at an F1-score of 0.945, and the F1-score from the ensemble classifier with the stochastic hill climb optimisation algorithm was 0.997. The results from the support vector machine classifier, decision tree classifier and the proposed ensemble classifier are represented in Table 1. The experiment was done once without optimising the features and later on by optimising the feature set to compare the results in both cases. Table 1 Table indicating the results with and without the optimisation of the feature set Classes
Without optimisation
With optimisation
Precision
Precision
Recall
Recall
(a) Results of SVM classifier NORMAL
0.694
0.967
0.993
0.987
PNEUMONIA
0.965
0.932
0.956
0.954
COVID
0.970
0.626
0.955
0.970
Accuracy
0.840
0.967
F1-score
0.839
0.967
(b) Results of the decision tree classifier NORMAL
1.000
1.000
1.000
1.000
PNEUMONIA
0.860
0.997
0.860
0.997
COVID
0.996
0.843
0.996
0.843
Accuracy
0.945
0.945
F1-score
0.945
0.945
Results of the proposed ensemble classifier NORMAL
1.000
0.990
1.00
0.997
PNEUMONIA
0.990
0.997
0.997
0.997
COVID
0.997
1.00
0.997
1.000
Accuracy
0.995
0.997
F1-score
0.995
0.997
254
G. Santhosh et al.
5 Conclusion Diagnosis of COVID-19 has a close resemblance to the symptoms of pneumonia, which can be revealed by various imaging tests. The imaging tests can provide a faster means for the detection of COVID-19 and help to control the spread of the disease. Here, we try to detect COVID-19 using handcrafted features and combining them with deep features obtained from VGG-16. In this paper, we classify the chest X-Ray images as COVID-positive, Pneumonia and Normal using a voting-based ensemble classifier. We compared the results from the individual classifiers with and without the optimisation technique and then on the proposed ensemble classifier. Our model surpassed all the existing methods for the detection of COVID-19. The model from [22] achieved an overall accuracy of 94.2%, whereas our model achieved an accuracy of 99.7%. Our model also beats the performance of the solution proposed in [23], which uses an SVM classifier to classify the chest X-ray images. Our model obtained an overall F1-score of 0.997 with the optimised feature set and plays a crucial role in the timely diagnosis of COVID-19 disease.
References 1. (2020). ADVANTAGES AND DISADVANTAGES OF RT- PCR IN COVID 19. European Journal of Molecular & Clinical Medicine 7(1):1174–1181 2. Zhang Z, Cui P, Zhu W (2020) Deep learning on graphs: a survey. In: IEEE Transactions on Knowledge and Data Engineering, doi: https://doi.org/10.1109/TKDE.2020.2981333 3. Magree H, Russell F, Sa’aga R, Greenwood P, Tikoduadua L, Pryor J, Waqatakirewa L, Carapetis J, Mulholland E(Kim) (2005) Chest X-ray-confirmed pneumonia in children in Fiji. In: Bulletin of the World Health Organization 83(6):427–433 4. Zu ZY, Jiang MD, Xu PP, Chen W, Ni QQ, Lu GM, Zhang LJ (2020) Coronavirus disease 2019 (COVID-19): a perspective from China. Radiology 296(2):E15–E25. https://doi.org/10.1148/ radiol.2020200490 5. Gopal K, Varma PK (2020) Cardiac surgery during the times of COVID-19. Indian J Thorac Cardiovasc Surg 36:548–549. https://doi.org/10.1007/s12055-020-01006-y 6. Sathyadevan S, Nair RR (2015) Comparative analysis of decision tree algorithms: ID3, C4.5 and random forest. In: Jain L, Behera H, Mandal J, Mohapatra D (eds.), Computational intelligence in data mining – Volume 1. Smart innovation, systems and technologies, vol 31. Springer, New Delhi. https://doi.org/10.1007/978-81-322-2205-7_51 7. Ahmed K, Gouda N (2020) AI Techniques and mathematical modeling to detect coronavirus. J Instit Eng (India): B 1–10. doi:https://doi.org/10.1007/s40031-020-00514-0 8. Kermany D et al (2018) Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172(5):1122–1131 9. Rajaraman S, Candemir S, Kim I, Thoma G, Antani S (2018) Visualization and interpretation of convolutional neural network predictions in detecting pneumonia in pediatric chest radiographs. Appl Sci 8(10):1715 10. Unni A, Eg N, Vinod S, Nair LS (2018) Tumour detection in double threshold segmented mammograms using optimized GLCM features fed SVM. In: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2018, pp. 554–559, doi: https://doi.org/10.1109/ICACCI.2018.8554738 11. Ancy CA, Nair LS (2018) Tumour classification in graph-cut segmented mammograms using GLCM features-fed SVM. In: Bhateja V, Coello Coello C, Satapathy S, Pattnaik P (eds.)
COVID-19 Detection Using Chest X-ray Images
12.
13. 14. 15. 16. 17.
18. 19.
20.
21. 22.
23.
255
Intelligent Engineering Informatics. Advances in Intelligent Systems and Computing, vol 695. Springer, Singapore. https://doi.org/10.1007/978-981-10-7566-7_21 Ozturk T, Talo M, Yildirim EA, Baloglu UB, Yildirim O, Rajendra U (2020) Automated detection of COVID-19 cases using deep neural networks with X-ray images. Comput Biol Med 121:103792. https://doi.org/10.1016/j.compbiomed.2020.103792 Redmon J, Farhadi A (2017) YOLO9000: Better, Faster, Stronger. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2017:6517–6525 Yasin R, Gouda W (2020) Chest X-ray findings monitoring COVID-19 disease course and severity. Egypt J Radiol Nucl Med 51(1):193. https://doi.org/10.1186/s43055-020-00296-x Hemdan, E.E., Shouman, M., & Karar, M. (2020). COVIDX-Net: A Framework of Deep Learning Classifiers to Diagnose COVID-19 in X-Ray Images. ArXiv, abs/2003.11055 Chowdhury MEH et al (2020) Can AI Help in screening viral and COVID-19 pneumonia? IEEE Access 8:132665–132676. https://doi.org/10.1109/ACCESS.2020.3010287 Rahman T, Khandakar A, Qiblawey Y, Tahir A, Kiranyaz S, Abul SB, Islam MT, Al S, Zughaier SM, Khan MS, Chowdhury MEH (2021) Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images. Comput Biol Med 132:104319. https://doi. org/10.1016/j.compbiomed.2021.104319 Kermany DS, Zhang K, Goldbaum M (2018) Labeled optical coherence tomography (OCT) and chest X-Ray images for classification Chan YH, Zeng YZ, Wu HC, Wu MC, Sun HM (2018) Effective pneumothorax detection for chest X-Ray images using local binary pattern and support vector machine. J Healthc Eng. 2018:2908517. https://doi.org/10.1155/2018/2908517 Haralick RM, Shanmugam K, Dinstein I (1973) Textural features for image classification. In: IEEE transactions on systems, man, and cybernetics SMC-3(6):610–621. doi: https://doi.org/ 10.1109/TSMC.1973.4309314 Patel V, Shah S, Trivedi H, Naik U (2020) An analysis of lung tumor classification using SVM and ANN with GLCM features Khan AI, Shah JL, Bhat MM (2020) CoroNet: A deep neural network for detection and diagnosis of COVID-19 from chest x-ray images. Comput Methods Programs Biomed 196:105581. https:/ /doi.org/10.1016/j.cmpb.2020.105581 Garlapati K, Kota N, Mondreti YS, Gutha P, Nair AK (2021) Detection of COVID-19 Using X-ray Image Classification. In: 2021 5th International Conference on Trends in Electronics and Informatics (ICOEI), 2021, pp. 745–750, doi: https://doi.org/10.1109/ICOEI51242.2021. 9452745
Comparative Analysis of LDA Algorithm for Low Resource Indian Languages with Its Translated English Documents D. K. Meghana , K. Kiran , Saleha Nida , T. B. Shilpa , P. Deepa Shenoy , and K. R. Venugopal
Abstract Nowadays, social media acts as an information medium that generates a huge amount of data. Irrespective of the age groups, many people across the globe use social media to share their thoughts and emotions in the form of tweets, comments, etc. This information can be used as a source of data for research. This generates a humongous amount of data not only in English but also in many other languages. In this paper, we have focused our research on native Indian languages such as Kannada, Tamil, and Telugu. Latent Dirichlet Allocation (LDA) is an algorithm that can process a large set of text data to produce topic modeling, clustering words, and topics in a document. We have analyzed the performance of the LDA method for native language. Dataset in Kannada and Tamil works better with respect to coherence and English translated dataset of Telugu is Optimal when compared with original Telugu data. With respect to perplexity, LDA works better for native language dataset. Keywords LDA · Native languages · Topic modeling
1 Introduction Social media like Twitter, YouTube, Instagram, etc., act as information sources for trend identification. The extraction of the topics from the text to find such trends (i.e., the motivations for every trend) is incredibly important. Generally, high-quality and annotated data is present in English, whereas Non-English (such as the native Indian languages) data is difficult to get. As a result, several areas have grown interesting in summarizing web-based information including keyword extraction and topic generation in non-English or target language [1]. D. K. Meghana (B) · K. Kiran · S. Nida · T. B. Shilpa · P. Deepa Shenoy Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bengaluru, India e-mail: [email protected] K. R. Venugopal Bangalore University, Bengaluru, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_21
257
258
D. K. Meghana et al.
Topic modeling identifies the concepts that are represented in a set of documents and also determines the topics addressed by each of these documents [2]. It also clusters a group of words and expressions that are similar by scanning the set of documents and recognizes the different word and phrase patterns within them. Traditional topic models are more effective on English data and very few works are done for other languages, especially for low resource languages for extracting topics from documents. This is unfortunate, because resources in the non-English language contain a significant amount of information in the documents that may be useful for some people if mined properly. In this paper, we specialize in examining results for three different Indian languages (i.e., Kannada, Tamil, and Telugu) while coping with Twitter “tweets”. The algorithm that is traditionally accustomed to solve the problem of topic modeling is Latent Dirichlet Allocation (LDA). In this paper, LDA algorithm is used to analyze which language among Kannada, Tamil, and Telugu gives the best result for topic modeling by comparing the original tweets with the translated tweets to induce the most accurate topics.
2 Literature Review 2.1 Importance of Indian Native Languages Native Indian languages have a history of thousands of years. Over 43 million people in India speak Kannada, 81 million people use Telugu in their everyday life, and 69 million use Tamil for their linguistic communication. We can see that several social media sites like Facebook, Twitter, YouTube, etc., and many others are designed in such a way that people can use these native languages as their interface. This generates plenty of information in these native languages. But there is not much research done in these languages which makes the data generated useless. In this paper, we have focused on finding out how these languages can be helpful in creating services to the people who make use of these languages. For example, the most famous and trending e-commerce site Amazon also provides an interface for these languages making it easy for people to access various services the site provides. Our research is helpful in recommending the most accurate products to the customers who use these native languages, thus creating a better customer management relationship for the service providers.
2.2 Twitter Twitter first published in 2006 is among the fastest-growing social media platform to share opinions of the people. More than 18.8 million people use Twitter in India.
Comparative Analysis of LDA Algorithm for Low Resource Indian …
259
Twitter is considered as one of the major platforms to speak their opinions and to follow the recent trends in the world. It is used by the government, corporates, and common people to give information to the society. This information consists of several topics which needs to be extracted from the document.
2.3 Topic Modeling Topic modeling is one of the major research trends in the field of Natural Language Processing. Topic modeling can be used to extract words from the documents which describe that document and is called topic. Topic modeling consists of “words”, the smallest entities which are indexed in the document. “Document” is an arrangement of words and “corpus” is a collection of documents [3, 4]. Simply put, each document in the corpus contains its proportions of the topics discussed according to the words contained in them [5]. In the previous works, [6] performs topic modeling on E-News article in Punjabi language using LDA algorithm. Reference [7] explained topic modeling for Hindi corpus through clustering of semantic space using word2vec. Reference [8] uses LDA, LSI, and NMF on Hindi Corpus for topic modeling. In [9], LDA is applied to Telugu corpus to find the topics in a document. But, it is not evaluating and comparing the results with same English tweets. Many other works are done in other languages about topic modeling, but to the best of our knowledge, topic modeling is not applied to Kannada language and results are not compared with English documents.
2.4 Latent Dirichlet Allocation (LDA) LDA is a topic modeling algorithm that analyzes the documents and extracts the topics from them. LDA algorithm can be used on large documents [6]. For each document, LDA generates a list of topics that will connect, process, cluster, and summarize. Dirichlet distribution is used to obtain per-document topic distribution. The words in the document will be allocated to different topics by Dirichlet in the generative process. In LDA, topics, classification of each word on topics per document, and distribution of per-document topics are hidden. Hence, it is called LDA. Documents are the only visible objects here.
260
D. K. Meghana et al.
3 Methodology In this section, we explain the different phases of our approach. Before applying the LDA algorithm to the data, we need to clean and process it. Python tool is used for preprocessing and LDA implementation on data. The research consists of different phases like data collection, data cleaning, data preprocessing, parameters selection, and evaluation using coherence score and perplexity as shown in Fig. 1.
3.1 Data Collection In the first phase of the research, we made some studies regarding how the LDA algorithm is implemented using Python for different languages. In many works, documents are converted to English and then used on LDA algorithm. We observed that not much research was done on native Indian languages. Tweets of different native languages such as Kannada, Tamil, and Telugu were downloaded from various resources. The tweets were then translated into English using Google Translator for comparison. We collected around 1000 tweets of Kannada, Telugu, and Tamil movie reviews and converted to English using Google Translator.
3.2 Data Cleaning The tweets downloaded from Twitter contain raw data which includes emoticons, punctuations, special symbols, duplicate tweets, and retweets. These information are not required for analysis and it may affect the result. Hence, the tweets are cleaned by removing user id, noise data, and duplicate tweets using Python libraries. Fig. 1 Stages of research
Comparative Analysis of LDA Algorithm for Low Resource Indian …
261
3.3 Data Pre-processing The basic pre-processing steps include removing punctuations, tokenization, removing stop words, lower casing, and lemmatization for English. Some of these steps are not applicable for native languages as they contain different grammar and semantics [10, 11]. For the native languages, we made use of the iNLTK python library and the following pre-processing steps are carried out [9]. a. Removing Punctuations: The remove punctuation() function was used to remove punctuation, special characters, HTML tags, numbers that do not have any effect on the meaning of the sentence, etc. b. Tokenization: In this step, the sentences were broken down into small chunks called tokens. The function tokenize() is imported from the iNLTK library which breaks the raw text into words [10]. is tokenized as For example, , The dictionary (id2word) and the corpus are given as inputs to the LDA algorithm. For this, we made use of the Genism library. Genism creates a unique id for each word in the document. The corpus consists of a pair (word-id, word-frequency). For example, (0, 1) implies word 0 appears one time in the document. This is used as the input by the LDA model.
3.4 Parameter Selection LDA algorithm is defined by two hyper-parameters, alpha and beta. Alpha indicates topic probability within the document and beta indicates topic—word distribution. It also requires parameters like number of topics and number of iterations. Performance of LDA algorithm mainly depends on these parameters. To check the performance of LDA, we considered different values for these parameters and experimented with it.
3.5 Experiments Using LDA LDA is a statistical model which allows some unobserved groups to explain why some data are similar. “Bag-of-words” approach is used for this technique, in which each document is a vector of word counts. Each topic is a probability distribution over words and each document is a probability distribution over topics. The plate diagram of the LDA model is shown in Fig. 2 [12].
262
D. K. Meghana et al.
Fig. 2 Plate diagram for LDA [12]
• • • • • • • • • •
K—number of topics. N—number of words in the document. M—number of documents to analyze. α—Dirichlet-prior concentration parameter of the per-document topic distribution. β—same parameter of the per topic word distribution. φ(k)—word distribution for topic k. θ (i)—topic distribution for document i. z(i, j)—topic assignment for w(i, j). w(i, j)—jth word in the ith document. φ and θ are Dirichlet distributions, and z and w are multinomials.
The generative story of LDA begins with only a Dirichlet before topics. Each topic is a word which is visualized as a high probability word in the document and a multinomial distribution over the vocabulary, parameterized by ϕk. A pedagogical label is used to identify the topic. The LDA can be divided into two parts: distribution over topics and distribution over words. LDA has two goals: I. Selecting a topic from its distribution for each document. II. Sampling a word from the distribution over the words associated with the chosen topic. It repeats the process for all the words in the document. Since inserting a document in single topic requires all its words to have probability under the same topic, the second goal becomes hard. Also, if we put few words under a topic, then the document requires more topics to cover all the words in the document. On the other hand, putting very few words in each topic will assign many topics to it to cover the document’s words, making the first goal hard. LDA solves these problems by finding groups of tightly co-occurring words.
3.6 Evaluation Execution of LDA algorithm gives results in terms of coherence and perplexity. Experiment is repeated for different values of number of iterations, number of topics,
Comparative Analysis of LDA Algorithm for Low Resource Indian …
263
and alpha values. Coherence values and perplexity values are noted for these parameters and the results are evaluated by plotting graphs with coherence scores and perplexity against different parameters.
4 Experiments 4.1 Data Collection and Data Preprocessing The data for this experiment is downloaded from Twitter. People usually give feedback about products or movies in the form of tweets in Twitter. These reviews are downloaded using Indian language keywords. The dataset consisted of at least 1000 tweets for each of the regional languages (i.e., Kannada, Tamil, and Telugu), and the same tweets were translated into English using Google Translator. After cleaning of data (which included removal of duplicate tweets, users ids, etc.), these tweets were tokenized and all the capital letters were converted to lowercase. All the emoticons, HTML tags, and URLs were removed. “stop words” which are common in English were removed. In native language dataset, tokenization was done and all punctuations, hashtags were removed. Repeated tweets or retweets were discarded. iNLTK which is an NLTK-based library used for Indian languages was used for lemmatization and POS tagging.
4.2 Model Hyperparameters Selection After data collection and pre-processing of all languages, model parameters are selected. α and β are the two hyper-parameters used. α values were set between 0 and 1 {e.g., 0.1, 0.2, ………0.8, 0.9} and the β value was set by default. Different values for other parameters like number of iterations and number of topics are also selected.
4.3 Evaluation After the hyperparameters were selected, for each of the languages, experiments were done to check the effectiveness of each language’s dataset when the LDA model was applied. For each experiment conducted on each of the languages, the optimal models were chosen. The results were evaluated by plotting graphs for different languages with coherence scores and perplexity with the actual parameters.
264
D. K. Meghana et al.
Coherence Scores and Perplexity Evaluation of topic modeling gives how well the words are assigned to topics. Perplexity and coherence are the quantitative metrics used for topic modeling evaluation. Perplexity is the traditional approach for topic models and is also called “held out Likelihood”. Perplexity, however, studies have been performed on the correlation between traditional topic model evaluation metrics and the “coherence” of generated topics, and it was found that typical performance measures may not accurately evaluate the simplicity of the top words within a topic [13, 14]. Another method for the topic modeling evaluation is “Coherence” [15, 16] which finds the coherence between the topics. Topic coherence is the combination of number of measures into the framework to calculate the coherence. Topic coherence finds the score which measures the degree of semantic similarity between the words which has high score in the topic. Figures 3, 4, and 5 show the graphs plotted with coherence scores and perplexity against three different parameters, i.e. number of iterations, number of topics, and alpha values. For an optimal model, coherence values are considered to be positive and high, whereas perplexity values are considered to be more negative. The graphs represent the comparison of these values for each of the considered regional languages with translated English tweets. From Fig. 3, we can observe that the coherence scores are high for Kannada against all three parameters. Hence, it is concluded that the model works better for the original tweets rather than the translated English tweets. The perplexity values for both Kannada and English are also shown in the figure. The values are more negative for the original tweets which conclude that the model is optimal for the original Kannada tweets. Figure 4 summarizes the optimal models for Tamil and the translated English tweets. We can observe that the coherence scores are high for Tamil against all three parameters which conclude that the model is optimal for the original Tamil
Fig. 3 Analysis of coherence and perplexity scores for Kannada and English against the parameters (number of iterations, number of topics, and alpha values)
Comparative Analysis of LDA Algorithm for Low Resource Indian …
265
Fig. 4 Analysis of coherence and perplexity scores for Tamil and English against the parameters (number of iterations, number of topics, and alpha values)
Fig. 5 Analysis of coherence and perplexity scores for Telugu and English against the parameters (number of iterations, number of topics, and alpha values)
tweets rather than the translated ones. We observe that the perplexity values are more negative for Tamil tweets. As mentioned above, more the negative value, more optimal solution is obtained. Thus, the model works better for the original tweets. In Fig. 5, we can see that the model works better for the translated tweets rather than the original Telugu tweets with respect to coherence scores since the coherence scores are high for translated English tweets when plotted against all three parameters. However, we also visualize that the model is more optimal for the native language, Telugu with respect to perplexity as the perplexity values are more negative for the Telugu dataset when compared to the translated English dataset.
266
D. K. Meghana et al.
5 Conclusions In this paper, we performed a comparative analysis of topic modeling for low resource Indian languages with their translated English documents. Using the LDA model, we compared the tweets of both the native languages and the English data. The comparison was done by visualizing the coherence scores and perplexity values against different parameters like number of topics, number of iterations, and alpha values which generated different patterns. We analyzed that the coherence scores are high for Kannada and Tamil and perplexity values are low for the original tweets instead of the English tweets. In the case of Telugu, coherence values are high for English and perplexity values are low for Telugu dataset. We can conclude that with respect to coherence score the LDA model is optimal for the data in its original form for Kannada and Tamil and optimal for English in the case of Telugu. With respect to perplexity all the native language dataset is optimal. However, we used the LDA algorithm without much change. Future work can include changes in this algorithm for more optimal results. Also, the work can be extended and sentimental analysis can be carried out for the generated topic model. This can be carried out against various cross-lingual and cross-domain platforms.
References 1. Hong L, Dom B, Gurumurthy S, Tsioutsiouliklis K (2011) A time-dependent topic model for multiple text streams. In: Proc. 17th ACM SIGKDD int. conf. knowl. discovery data mining (KDD), pp 832–840 ˇ uˇrek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: 2. Reh˚ Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, Valletta, Malta, May 2010. ELRA, pp 45–50. http://is.muni.cz/publication/884893/en 3. Hong L, Davison BD (2010) The empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88 4. Sridhar VKR (2015) Unsupervised topic modeling for short texts using distributed representations of words. In: Proceedings of NAACL-HLT, pp 192–200 5. Chang O, Gerrish S, Wang C, Boyd-Graber JL, Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: Advances in neural information processing systems, pp 288–296 6. Verma A, Gahier AK (2015) Topic modeling of E-news in Punjabi. Indian J Sci Technol 7. Panigrahi SS, Panigrahi N, Paul B (2018) Modelling of topic from Hindi corpus using Word2Vec. In: Second international conference on advances in computing, control and communication technology (IAC3T) 8. Ray SK, Ahmad A, Aswani Kumar C (2019) Review and implementation of topic modeling in Hindi. Appl Artif Intell 33(11):979–1007 9. https://medium.com/nirupampratap/topic-modeling-using-lda-on-telugu-articles-a31e36 7ca229 10. Murthy KN (2003) Automatic categorization of Telugu news articles. Department of Computer and Information Sciences, University of Hyderabad, Hyderabad. http://languagetechnologies. uohyd.ac.in/knm-publications/il_text_cat.pdf 11. Jayashree R (2011) An analysis of sentence-level text classification for the Kannada language. In: International conference of soft computing and pattern recognition (SoCPaR), pp 147–151
Comparative Analysis of LDA Algorithm for Low Resource Indian …
267
12. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022 13. Loper E, Bird S The natural language toolkit. University of Pennsylvania, Philadelphia, PA, USA 14. Roder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on web search and data mining. ACM, pp 399–408 15. Newman D, Lau JH, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: Proceedings of Human Language Technologies: the 11th annual conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2010), Los Angeles, USA, pp 100–108 16. Aletras N, Stevenson M (2013) Evaluating topic coherence using distributional semantics. In: Proceedings of the tenth international workshop on computational semantics (IWCS-10), Potsdam, Germany, pp 13–22
Text Style Transfer: A Comprehensive Study on Methodologies and Evaluation Nirali Parekh , Siddharth Trivedi , and Kriti Srivastava
Abstract Text Style Transfer (TST) rewords a sentence from one style (e.g., polite) to another (e.g., impolite) while conserving the meaning and content. This domain has attracted the attention of many researchers as it makes natural language generation (NLG) tasks more user-oriented. TST finds its applications widely in industry such as conversational bots and writing assistance tools. With the success of deep learning, a plethora of research works on style transfer based on machine learning have been proposed, developed, and tested. This systematic review presents the past work on TST clustered into categories based on machine learning and deep learning algorithms. It briefly explains the various subtasks within TST and assembles its publicly available datasets. It also summarizes the automatic and manual evaluation practices used for style transfer tasks and finally, sheds some light on current challenges and points towards promising future directions for research in the TST domain. Keywords Text style transfer · Deep learning · Natural language generation · Natural language processing · Neural networks
1 Introduction In the domains of deep learning and artificial intelligence, style transfer has recently been a hot issue of research and development. It deals with transferring the style (or attributes) of a source content into a target style. Substantial research work involving state-of-the-art algorithms [7, 44] developed for image style transfer have demonstrated astounding results. Style transfer techniques have also recently influenced the audio domain establishing methods for music style transfer [4]. Such works have N. Parekh (B) · S. Trivedi · K. Srivastava Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai, Maharashtra 400056, India e-mail: [email protected] K. Srivastava e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_22
269
270
N. Parekh et al.
proved the potential of deep learning techniques for style transfer in generating artificial style-transferred content. In the domain of text and linguistics, style is a highly subjective phrase that can be interchanged with the term attribute. The style-specific characteristics of the text tend to vary across situations, while the style-independent content is maintained. For instance, “If you have any further requirements, please do not hesitate to contact me” is used in a formal setting and can be easily paraphrased to “Let me know if you need anything else” for usage in an informal context. The goal in a Text Style Transfer (TST) problem is to generate a style-controlled text in a target style while preserving the and content of the source text. TST semantics models can be formulated [12] as p x | a, x , where x is a source text with attribute value a and x is the style-transferred text with target attribute a. TST methods have developed from classic replacement and template-based approaches to neural network-based strategies as deep learning progresses. TST can also be formulated as a Natural Language Generation (NLG) problem as it extends NLG techniques while manipulating the attributes of the text. A wide range of deep learning techniques are employed for TST tasks like Adversarial Learning, Sequence-to-Sequence Learning, and Reinforcement learning-based methods which are covered in detail in Sect. 3. Experimentation on TST techniques is largely classified based on subtasks under the domain. A list of some common subtasks in TST is elucidated in Table 1.
Table 1 Subtasks within text style transfer Subtask Attribute transfer Formality
Formal
Sentiment
Informal Positive
Politeness Simplicity
Negative Polite Impolite Complicated
Simple Gender
Masculine Feminine
Authorship
Shakespearean Modern
Example Please accept our apologies for any inconvenience We’re sorry for the trouble I admire my college professors a lot I hate my college professors Sorry, I’m a bit busy right now Leave me alone Can I acquire assistance in deciphering this conundrum? Can you help me solve this problem? My wife went to the mall to buy a skirt My husband went to the mall to buy a shirt Hast thou slain tybalt? Have you killed tybalt?
Text Style Transfer: A Comprehensive Study … Table 2 Publicly available datasets for TST Subtask Dataset Size
271
Para-llel
Domain
Annotation
Yahoo answers (online) Emails Restaurant reviews Image captions Wikipedia (online) Tweets (online) Politics threads Literature, SparkNotes Restaurant reviews Clothing reviews Telephonic conversations Facebook posts
Manual
Formality
GYAFC
52 K
✓
Politeness Gender
Politeness Yelp
1.39 M 2.5 M
✗ ✗
Humor & romance Biasedness
Flickr style
5K
✓
181 K
✓
Toxicity
Wiki neutrality Twitter
58 K
✗
Toxicity
Reddit
224 K
✗
Authorship
Shakespeare
18K
✓
Sentiment
Yelp
150 K
✗
sentiment
Amazon
277 K
✗
Fluency
SWBD
192 K
✓
Politics
Political
540 K
✓
– – Manual Automatic – – Automatic – – Manual –
Impactful applications of TST in NLP research, like paraphrasing, as well as commercial uses such as AI-assisted writing tools have driven the burgeoning interest of NLP researchers in this domain. This rapid growth in TST research has not only produced a variety of datasets, implementation algorithms, and evaluation metrics for this task but also, at the same time, lacks a sense of standardization for the same. The objective of this review paper is to present an account of various corpora and TST methodologies that could facilitate further research and uniformity in the field of TST. The contributions of this work are (Table 2): 1. We conduct a comprehensive survey that reviews recent works on TST based on machine learning. 2. We describe the various machine learning architectures and evaluation metrics used in Text Style Transfer. 3. We provide a systematic summary of contributions and evaluation practices in Tables 3, 4, and 5. The organization of this paper starts with a discussion on some of the publicly available datasets done in Sect. 2. The machine learning algorithms used in TST
ML method
Adversarial learning
Adversarial learning
Adversarial learning
Adversarial learning
SequencetoSequence
References
Fu et al. [6]
John et al. [14]
Lai et al. [18]
Zhao et al. [43]
Jhamtani et al. [11]
Dictionaries mapping Shakespearean to modern words
Adversarially regularized autoencoders (ARAE)
Word-level conditional architecture & twophase training
Incorporates auxiliary and adversarial objectives
Multidecoder and style embedding
Contribution
Oldmodern English
Sentiment transfer, topic transfer
Sentiment transfer, tense transfer
Sentiment transfer
Paper-news title, sentiment transfer
Subtask
Shakespeare dataset
SNLI corpus, Yahoo dataset, Yelp reviews
Yelp, Amazon reviews, Yelp tense
Yelp reviews, Amazon reviews
He and McAuley
Data
BLEU, PINC
–
–
Geometric Mean of STA, WO and 1/PPL
–
Overall
–
FastText classifier
CNN classifier
CNN classifier
Lstm sigmoid classifier
Style transfer accuracy
Automatic evaluation
–
BLEU
BLEU
Cosine similarity, Unigram word overlap
Cosine distance
Content preservation
Table 3 Summary of some previous works in TST—their methodologies, data sources, and evaluation practices—Part 1
–
Perplexity
Perplexity by bi-directional LSTM
Perplexity by trigram language model
–
Fluency
✗
✓
✓
✓
✓
Human evaluation
272 N. Parekh et al.
Transformer that leverages DRG framework
TGLS (text generation by learning from search), framework
Keyword replacement
Li et al. [20]
Sudhakar et al. [37] Keyword replacement
Li et al. [19]
Unsupervised learning, backtranslation
Delete, retrieve, generate
Sequencetosequence
Carlson et al. [2]
Encoder– decoder recurrent neural networks
Translation model with a language model
Sequencetosequence
Xu et al. [39]
Contribution
ML method
References
Paraphrase generation, formality transfer
Sentiment transfer, gender transfer, political slant
Normalromantic, sentiment transfer
Oldmodern English
Paraphrasing
Subtask
GYAFC
Yelp, Amazon reviews, captions, gender, political dataset
Yelp review, Amazon review, captions dataset
Various English bible versions
Shakespeare dataset
Data
–
GLEU
–
BLEU, PINC
–
Overall
Classifier based on RoBERTa features
FastText style classifier
LSTMbased classifier
–
Cosine similarity, language model, logistic regression
Style transfer accuracy
Automatic evaluation
BLEU, iBLEU
BLEU
BLEU
–
BLEU
Content preservation
Table 4 Summary of some previous works in TST—their methodologies, data sources, and evaluation practices—Part 2
Perplexity by GPT-2
Perplexity by GPT-2
–
–
–
Fluency
✓
✓
✓
✗
✓
Human evaluation
Text Style Transfer: A Comprehensive Study … 273
Adversarial learning, keyword replacement
Adversarial learning
Mir et al. [24]
Madaan et al. [23]
al. Backtranslation
Unsupervised learning
Jain et al. [10]
et
Keyword replacement, reinforcement learning
Xu et al. [38]
Rabinovich [30]
ML method
References
Tag-andgenerate tagger extracts content& generator converts style
Focused on 3 models: CAAE, ARAE, DAR models
Personalized SMT models in automatic translation
Encoder– decoder architecture reinforced through auxiliary modules
Cycled reinforcement learning method
Contribution
Politeness, normalromantic, sentiment, gender, political
Sentiment transfer
Gender transfer
Formality transfer
Sentiment transfer
Subtask
Enron Email, captions, Yelp reviews, Amazon reviews
Yelp reviews
TED talks transcripts Europarliament corpus
Emails, English prose essays
Amazon reviews, Yelp reviews
Data
–
–
–
–
G-score: geometric mean of ACC and BLEU
Overall
LSTM classifier
Earth mover’s distance
SVM classifier
Encoderbased neural classifier
CNN classifier
Style transfer accuracy
Automatic evaluation
BLEU, METEOR, ROUGE
BLEU, METEOR, word mover’s distance
–
Cosine similarity
BLEU
Content preservation
Table 5 Summary of some previous works in TST—their methodologies, data sources, and evaluation practices—Part 3
–
Neural logistic regression classifiers
–
Perplexity by 4-g back-off model using KenLM
–
Fluency
✓
✓
✓
✓
✓
Human evaluation
274 N. Parekh et al.
Text Style Transfer: A Comprehensive Study …
275
research are explored in Sect. 3. Section 4 presents few metrics used for evaluating TST algorithms. Finally, we conclude in Sect. 5 and discuss some open issues and research scope in TST.
2 Datasets 2.1 Parallel and Non-parallel Data The datasets available for TST are classified into two categories widely based on data used for training. Parallel In parallel data, the texts of both the source and target style are available. Here, simple machine translation techniques such as sequence-to-sequence can be used. The issue with parallel data is that they are not readily available in various sub-styles and data collection is very expensive. Non-parallel Costly and scarce parallel data has led researchers to utilize nonparallel data for TST. Non-parallel corpus consists of non-matching texts from source and target styles. This category of data is readily available in various styles and hence a large proportion of works on TST utilizes non-parallel datasets.
2.2 Available benchmark Datasets for TST Table 2 records various publicly available datasets distinguished by their subtasks, sizes, whether they are parallel or non-parallel, and annotation method in case they are parallel.
3 Methodologies 3.1 Parallel Style transfer by means of a style parallel corpus is considered a monolingual machine translation task. Sequence to Sequence A seq2seq model known as encoder–decoder architecture converts sequences from one domain to another. Basis to many machine translation algorithms, many research works on TST utilized seq2seq neural networks on parallel datasets. Hence, a seq2seq model is trained such that the encoder’s input is the text of source style, and output is the corresponding text of target style. As shown in Fig. 1, RNN layers act as encoder to process the input sentence of source style and recover
276
N. Parekh et al.
Fig. 1 Sequence-to-sequence attention-based model
the state to serve as context for the decoder. For the next step, another layer of RNNs acting as decoder predicts the output sentence of the target style. In their work, Xu [39] present some early work on the task of rephrasing text in a particular style. Jhamtani et al. [11] employ a seq2seq neural network to convert Modern English to Shakespearean text. Using dictionary to map modern and Shakespearean English, they utilized a basic encoder–decoder architecture. Carlson et al. [2] utilized a seq2seq model with attention mechanism for converting a prose to Bible-style text. Some later works have experimented with techniques like data augmentation along with the sequence-to-sequence models [13, 25]. Nikolav et al. [25] work on the task of text simplification from regular Wikipedia to simpler version. They collect two pseudo-parallel corpus pairs from sources like technical articles and news websites and propose an unsupervised technique LHA for extracting pseudoparallel sentences from the sources. However, seq2seq approach requires parallel corpus, and is hence challenging due to its scarcity.
3.2 Non-parallel Keyword Replacement Certain keywords in texts often are indicative of sentiments and tone underneath. For instance, words like “wow” and “excellent” have a positive meaning, and “awful” and “worst” have a negative connotation. Hence, some works [20, 37] introduce the use of machine learning models to replace these words using Natural language generative models. An initial work of the keyword replacement approach is the Delete–Retrieve– Generate framework [20]. Firstly, the model identifies all the style-connotated words like “bad”, “rude”, etc. from the source sentences (e.g., negative sentiment). Then it eliminates these words and only the content indicative words are left, such as “hotel” or “shirt”. Next, reference texts similar to the content-related words are retrieved from the target sentiment corpus (here, e.g., positive). Then, the model similarly gets all the style-attributed words and combines them with the content words extracted previously. This combining is done with seq2seq models The authors of [41] uses attention mechanisms with deep learning architectures for NLG tasks as shown in Fig. 2. Recent works explored hybrid architectures using
Text Style Transfer: A Comprehensive Study …
277
keyword replacement methods with cycled reinforcement losses to iteratively transfer the style of the text while maintaining the content. Since the words replaced can be visually inspected by a human, it imparts explainability to the models. The parts of text modified can be examined to understand the performance of such techniques. Adversarial Learning Another effective method for TST is adversarial learning which separates the text’s style and content data for transferring the text to target style. An early work proposed by Fu et al. [6] uses an adversarial framework of two models shown in Fig. 3. The first model consists of a single encoder and multiple decoders. The style of text input is learned by the encoder representations which in turn trains multiple decoders to decipher the representation and output the text in target style. In the second model, encoder behaves similar to the first model, and the decoder outputs the target style by concatenating the embedding of the encoder to parameter representation. Similar to the previous methods, a variety of hybrid architectures have been proposed for adversarial-based frameworks. Recently, works on the usage of autoencoders for TST models implementing adversarial learning to refine the performance
Fig. 2 Overview of the model proposed by [41] using self-attention and keyword replacement
Fig. 3 Two adversarial learning-based models proposed by [6]
278
N. Parekh et al.
Fig. 4 Overview of the model proposed by [8] where rewards are returned to the generator
of style transfer tasks is studied. For example, some works [3, 14, 18, 43] implemented cycled-consistency loss where the generated text is again fed to the model and the output sentence is compared to the original input text. This way, the loss is cycled or transferred to the model to generate more accurate transferred sentences. Reinforcement Learning Reinforcement learning works on the idea that reward functions guide the decision of deep learning models instead of loss functions. The parameters of a reinforcement-based model change in a way that the estimated reward of the output style-transferred text is maximized. An attention-based model proposed by [8] uses encoder–decoder architecture to perform TST. Its proposed generator and evaluator-based method is shown in Fig. 4. Here, three rewards to guide the output text of the desired style are introduced. Style classifiers, semantic, and language models are employed to impart style, semantic, and language feedback to the model, respectively. In [22], the authors propose a dual reinforcement learning framework where one seq2seq model learns source style to target style embeddings. The second seq2seq learns the target-to-source style parameters. This dual task is designed to provide style and reconstruction rewards and their average is used to provide feedback to the model. Thus, reinforcement learning without any parallel corpus is used for style transfer tasks. Backtranslation Back-translation means translating a text of target style to the source style and mixing both original source and back-translated text to train the model. The work by authors of [30] researched the factors like gender that are often obscured in tasks like machine language translation. The back-translation method used for TST by [29] is shown in Fig. 5. Their approach was to first rephrase the sentences and retain only the content words thereby eliminating the style information. Using an NMT model, they translated English text into another language and then back into the English language. It is proposed with the understanding that the stylistic
Text Style Transfer: A Comprehensive Study …
279
Fig. 5 Back-translation method and style classifier used by [29]
properties are lost during the back-and-forth translation while preserving the content. In [19], the authors propose a text generation framework with two stages. The first stage of learning simulated annealing search to generate pseudo-input sentences and a generator undergoes training via supervised learning. In the next stage, iterations are performed of beam search process for model improvement. The authors in [42] employ a two-fold framework where first a pseudo-parallel dataset is formed using word embeddings and latent representations similarities. Thus, this unsupervised task of style transfer is now employed to back translationbased deep learning models. A style classifier is utilized to enhance the performance of the model. Like the methods using adversarial models, back-translation models also proves less efficient in style-content agreement.
3.3 Unsupervised Methods The previous models using parallel or non-parallel data are proposed for supervised settings for style transfer tasks. Recent works have also explored machine learning techniques in purely unsupervised settings where no labeled style-text corpus is provided to the model. There are relatively fewer works proposed to perform TST in such a way [10, 31, 35]. Initial study conducted by Radford et al. [31] made use of the characteristics RNNs and LSTMs wherein training of such models is done on the UTF8 byte of the input text. This preprocessing allowed the researchers to identify and modify the neuron-level embeddings of the deep learning models. Text formalization performed by Jain et al. [10] uses an unsupervised method with unlabeled corpus. They make use of external language processing tools called scorers to provide style information. The information learned by the encoder–decoder is backpropagated to the model for computing the loss. The output scores determine the formality level of the text and the resulting output text. Shen et al. in their work [35] proposed adversarial autoencoders called AAE with denoising models for mapping of similar latent rep-
280
N. Parekh et al.
resentations. These models perform sentiment transfer by computing a vector using these representations. There is still much research potential in NLG-based unsupervised methods that can be extrapolated to other TST subtasks.
4 Evaluation Techniques Measuring the true efficacy of the TST models is one of the most challenging tasks in the domain of TST. At present, there are no standard automatic evaluation metrics that are being followed conventionally and still no metric exists that can outperform human evaluation. Since human evaluation can be onerous and expensive, there is a compelling need for appropriate automatic evaluation practices. Various evaluation metrics for TST have been proposed previously which mainly focus on three aspects to measure the effectiveness of a TST model: 1. Style transfer strength: the ability to convert a source style to the desired target style. 2. Content preservation: the extent to which the original content is preserved. 3. Fluency (naturalness): the ability to generate fluent sentences. A TST model should perform deftly in all these three criteria of evaluation. Underperformance in any of the three aspects would rather deem the model ineffective. For instance, if the algorithm successfully transfers a positive sentiment sentence, “The teachers in this university are excellent” to a negative sentiment sentence, “The waiters in the restaurant were very rude”, the algorithm, here, doesn’t preserve the original meaning and hence is considered to be inadequate.
4.1 Automatic Evaluation Style transfer strength This criterion of evaluation deals with how well the style of a given text is transferred to the target style. In most past works, transfer strength is tested by making use of pre-trained classifiers. A lot of previous papers [13, 22, 34] have used TextCNN, a sentence-level text classifier trained over pre-trained word vectors, proposed by [16]. An LSTM classifier, first used for this task in [6], is employed to measure the transfer strength in [8, 9]. Some works [5, 21] have made use of fastText [15], which shows identical performance as the deep learning methods, while being faster in speed, for the style classification task. Another alternate metric proposed in [24] for measuring transfer strength is to compute the Earth Mover’s Distance [32] between the source text and the style transferred output. This metric can be used to handle even non-binary classification and exhibited a higher
Text Style Transfer: A Comprehensive Study …
281
correlation with human evaluation than fastText and TextCNN classifiers. Content Preservation The metrics under this criterion of evaluation are concerned with the measurement of the extent of original content that is preserved after style transfer. The most widely used metric to evaluate content similarity is the BLEU score [26], originally proposed for the evaluation of machine translation tasks. [33, 38] have computed the BLEU score between the source text and the transferred text to measure content preservation in transferred sentences. In [6], the authors have calculated the cosine similarity between the source and the transfer sentence embeddings by leveraging the pre-trained word embeddings by GloVe [27]. Another popular metric for this specific task is the METEOR score [1] used in [23, 36]. [24] proposes to practice style masking or style removal, i.e., to remove or mask the style attribute words from the source and transferred sentences, before using the content preservation metrics. The authors, in [40], have done an extensive comparison of 14 content similarity metrics over 2 style transfer datasets and suggested that Word Mover’s Distance (WMD) [17] and L2 distance based on ELMo [28] are the bestperforming metrics for measuring content preservation in style transfer tasks. Fluency (naturalness) For any natural language generation model to be efficient, it must possess the ability to produce fluent human-like text. Most commonly, like in [10, 14, 37], a language model is trained and employed to compute perplexity score (PPL), where lower scores indicate higher fluency. Although researchers have relied on using the perplexity scores for evaluating fluency of the TST models in the past [24], showed that perplexity scores exhibited a very low correlation with humanevaluated scores and instead adversarial neural classifiers should be employed for the task of evaluating naturalness.
4.2 Human Evaluation A lot of previous works [6, 8, 34, 42] have incorporated the use of human evaluation along with the automatic evaluation metrics discussed above. Generally, a common procedure is followed where the evaluators are asked to rate randomly selected style transferred outputs. The scale of rating varies across different works but essentially the approach is similar where higher scores indicate better performance. The human raters can provide overall scores or give scores separately based on the three criteria, i.e., style transfer strength, content preservation, and fluency. Human evaluation, apart from being expensive and cumbersome, for the style transfer task can often be subjective in nature depending on how the rater interprets the styles under consideration. Therefore, it is also often not comparable across different methods due to this subjectiveness. Notwithstanding the cons, human evaluation is extremely important to be carried out in the contemporary scenario of the TST domain along with the automatic evaluation, to study the correlation between various automatic metrics and
282
N. Parekh et al.
human judgment. It will play an important role in the evolution of automatic metrics for TST tasks and set benchmarks for future comparison.
5 Discussion and Conclusion Research in the field of TST is challenged by a few obstacles such as scarcity of publicly available, benchmark parallel corpora, and the absence of standardized automatic evaluation metrics. Currently, there are no automatic evaluation methods developed that can surpass human evaluation, which proves to be expensive and cumbersome. Hence, a prospective research area in TST is using unsupervised machine learning methods to surpass the issue of the absence of large parallel data. Also, while a significant portion of the work is applicable to English corpora, TST’s potential to extrapolate it to other languages should not be neglected. In this paper, we have provided a comprehensive review of the existing literature and numerous machine learning methods employed for the task. The review also discussed various benchmark datasets and evaluation practices. This paper can serve as a reference for NLP researchers and is aimed to provide an all-inclusive understanding of TST to facilitate and promote further research in this field.
References 1. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. Association for Computational Linguistics, Ann Arbor, Michigan, pp 65–72 2. Carlson K, Riddell A, Rockmore D (2018) Evaluating prose style transfer with the bible. R Soc Open Sci 5(10):171920 3. Chen L, Dai S, Tao C, Shen D, Gan Z, Zhang H, Zhang Y, Carin L (2018) Adversarial text generation via feature-mover’s distance. arXiv preprint arXiv:1809.06297 4. Cífka O, Sim¸ ¸ sekli U, Richard G (2020) Groove2groove: one-shot music style transfer with supervision from synthetic data. IEEE/ACM Trans Audio Speech Lang Process 28:2638–2650 5. Dai N, Liang J, Qiu X, Huang X (2019) Style transformer: unpaired text style transfer without disentangled latent representation. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics 6. Fu Z, Tan X, Peng N, Zhao D, Yan R (2018) Style transfer in text: exploration and evaluation. In: Proceedings of the AAAI conference on artificial intelligence 32(1) 7. Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) 8. Gong H, Bhat S, Wu L, Xiong J, Mei Hwu W (2019) Reinforcement learning based text style transfer without parallel training corpus. In: Proceedings of the 2019 conference of the North. Association for Computational Linguistics 9. Gröndahl T, Asokan N (2019) Effective writing style imitation via combinatorial paraphrasing. arxiv:1905.13464
Text Style Transfer: A Comprehensive Study …
283
10. Jain P, Mishra A, Azad AP, Sankaranarayanan K (2019) Unsupervised controllable text formalization. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 6554–6561 11. Jhamtani H, Gangal V, Hovy E, Nyberg E (2017) Shakespearizing modern language using copy-enriched sequence-to-sequence models. arXiv preprint arXiv:1707.01161 12. Jin D, Jin Z, Hu Z, Vechtomova O, Mihalcea R (2020) Deep learning for text style transfer: a survey. arxiv:2011.00416 13. Jin Z, Jin D, Mueller J, Matthews N, Santus E (2019) IMaT: unsupervised text attribute transfer via iterative matching and translation. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics 14. John V, Mou L, Bahuleyan H, Vechtomova O (2018) Disentangled representation learning for non-parallel text style transfer. arXiv preprint arXiv:1808.04339 15. Joulin A, Grave E, Bojanowski P, Mikolov T (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol 2, Short Papers. Association for Computational Linguistics, Valencia, Spain, pp 427–431 16. Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics 17. Kusner M, Sun Y, Kolkin N, Weinberger K (2015) From word embeddings to document distances. In: Bach F, blei d (eds) proceedings of the 32nd international conference on machine learning. proceedings of machine Learning Research, vol 37. PMLR, Lille, France, pp 957–966 18. Lai CT, Hong YT, Chen HY, Lu CJ, Lin SD (2019) Multiple text style transfer by using wordlevel conditional generative adversarial network with two-phase training. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 3579–3584 19. Li J, Li Z, Mou L, Jiang X, Lyu MR, King I (2020) Unsupervised text generation by learning from search. arXiv preprint arXiv:2007.08557 20. Li J, Jia R, He H, Liang P (2018) Delete, retrieve, generate: a simple approach to sentiment and style transfer. arXiv preprint arXiv:1804.06437 21. Liu Y, Neubig G, Wieting J (2021) On learning text style transfer with direct rewards. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics 22. Luo F, Li P, Zhou J, Yang P, Chang B, Sun X, Sui Z (2019) A dual reinforcement learning framework for unsupervised text style transfer. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence. International joint conferences on artificial intelligence organization 23. Madaan A, Setlur A, Parekh T, Poczos B, Neubig G, Yang Y, Salakhutdinov R, Black AW, Prabhumoye S (2020) Politeness transfer: a tag and generate approach. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics 24. Mir R, Felbo B, Obradovich N, Rahwan I (2019) Evaluating style transfer for text. In: Proceedings of the 2019 conference of the North. Association for Computational Linguistics 25. Nikolov NI, Hahnloser RH (2018) Large-scale hierarchical alignment for data-driven text rewriting. arXiv preprint arXiv:1810.08237 26. Papineni K, Roukos S, Ward T, Zhu WJ (2001) BLEU. In: Proceedings of the 40th annual meeting on association for computational linguistics—ACL’02. Association for Computational Linguistics 27. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics 28. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American
284
29. 30. 31. 32.
33.
34.
35.
36.
37. 38.
39. 40. 41. 42. 43. 44.
N. Parekh et al. chapter of the association for computational linguistics: human language technologies, Volume 1 (Long Papers). Association for Computational Linguistics Prabhumoye S, Tsvetkov Y, Salakhutdinov R, Black AW (2018) Style transfer through backtranslation. arXiv preprint arXiv:1804.09000 Rabinovich E, Mirkin S, Patel RN, Specia L, Wintner S (2016) Personalized machine translation: preserving original author traits. arXiv preprint arXiv:1610.05461 Radford A, Jozefowicz R, Sutskever I (2017) Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444 Rubner Y, Tomasi C, Guibas LJ (1998) A metric for distributions with applications to image databases. In: Sixth international conference on computer vision (IEEE Cat. No. 98CH36271). IEEE, pp 59–66 Shang M, Li P, Fu Z, Bing L, Zhao D, Shi S, Yan R (2019) Semi-supervised text style transfer: Cross projection in latent space. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics Shen T, Lei T, Barzilay R, Jaakkola T (2017) Style transfer from non-parallel text by crossalignment. In: Proceedings of the 31st international conference on neural information processing systems. NIPS’17, Curran Associates Inc., Red Hook, NY, USA, pp 6833–6844 Shen T, Mueller J, Barzilay R, Jaakkola T (2020) Educating text autoencoders: Latent representation guidance via denoising. In: International conference on machine learning. PMLR, pp 8719–8729 Shetty R, Schiele B, Fritz M (2018) A4nt: author attribute anonymity by adversarial training of neural machine translation. In: 27th USENIX security symposium (USENIX Security 18). USENIX Association, Baltimore, MD, pp 1633–1650 Sudhakar A, Upadhyay B, Maheswaran A (2019) Transforming delete, retrieve, generate approach for controlled text style transfer. arXiv preprint arXiv:1908.09368 Xu J, Sun X, Zeng Q, Zhang X, Ren X, Wang H, Li W (2018) Unpaired sentiment-to-sentiment translation: a cycled reinforcement learning approach. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers). Association for Computational Linguistics Xu W, Ritter A, Dolan WB, Grishman R, Cherry C (2012) Paraphrasing for style. In: Proceedings of COLING 2012, pp 2899–2914 Yamshchikov IP, Shibaev V, Khlebnikov N, Tikhonov A (2021) Style-transfer and paraphrase: looking for a sensible semantic similarity metric. arxiv:2004.05001 Zhang Y, Xu J, Yang P, Sun X (2018) Learning sentiment memories for sentiment modification without parallel data. arXiv preprint arXiv:1808.07311 Zhang Z, Ren S, Liu S, Wang J, Chen P, Li M, Zhou M, Chen E (2018) Style transfer as unsupervised machine translation. arxiv:1808.07894 Zhao J, Kim Y, Zhang K, Rush A, LeCun Y (2018) Adversarially regularized autoencoders. In: International conference on machine learning. PMLR, pp 5902–5911 Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycleconsistent adversarial networks. In: 2017 IEEE international conference on computer vision (ICCV), pp 2242–2251
Classification of Hindustani Musical Ragas Using One-Dimensional Convolutional Neural Networks Rutuparn Pawar, Shubham Gujar, Anagha Bidkar, and Yogesh Dandawate
Abstract Ragas are a melodic progression of notes used in Indian classical music. They are believed to have mental and physiological enriching qualities and are used in Raga music therapy. Identification of Ragas necessitates a great deal of expertise since there are instances where two or more Ragas have very similar characteristics making them difficult to identify. An accurate classifier will be an indispensable tool for Indian classical music learners and enthusiasts alike. This paper proposes a One-Dimensional Convolutional Neural Network (1D-CNN) to classify Ragas in the Hindustani variant of the Indian classical music using raw audio waveform. We compare our model with an Artificial Neural Network (ANN) trained using audio features which were extracted using traditional signal processing techniques from the audio files. The original dataset generated and annotated by an expert consists of audio files for 12 Ragas played on the 4 instruments. An augmented dataset consisting of 12,000 samples was created from the original dataset using slight pitch variation. The ANN trained using audio features and the 1D-CNN trained using raw audio show an accuracy of 97.04% and 98.67%, respectively. Keywords Classification · ANN · 1D-CNN · Indian classical music
1 Introduction Raga, also known as Raag, is a structured collection of notes that lies at the foundation of Indian classical music and makes up its melodic structure. The concept of Raga is similar to the melody but has more intricate features than Western music [1]. Indian classical music has evolved over the centuries and was divided into two main branches: Hindustani classical music, which is more prominent in Northern India, R. Pawar (B) · S. Gujar · A. Bidkar · Y. Dandawate Department of Electronics and Telecommunication, Vishwakarma Institute of Information Technology, Pune, Maharashtra, India e-mail: [email protected] Y. Dandawate e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_23
285
286
R. Pawar et al.
and Carnatic classical music, which is more prominent in Southern India. Both these branches of Indian classical music have slight variations in Ragas. Our work focuses on Hindustani Ragas played on musical instruments. There exist approximately 0.4 million Ragas in Hindustani classical music. Many Ragas reflect moods and sentiments that even the most novice listener can recognize [2]. Ragas have been demonstrated to have mental health-enriching qualities [3]. Identification of Ragas by a human requires practice to master since it requires a composite auditory perception thus a beginner finds it difficult to identify Ragas. A Raga classifier can be employed to identify Ragas for beginners, recommend music containing particular Ragas selected based on the time of the day or create a music collection focused on a certain Raga. The following points discuss the inherent characteristics of Ragas. • A Raga must have at least five notes and at most all seven notes. • There is no fixed starting note for a Raga since the notes used in a Raga are on a relative scale. • Variations in Ragas are observed due to improvisation by musicians from different ethnic backgrounds. • Every Raga has a primary predominant note known as the Vadi, while a secondary predominant note known as the Samvadi which partly helps in distinguishing between two Ragas. • Each Raga has an ascending note pattern known as “Aaroh” and a descending note pattern known as “Avroh”. Multiple Ragas use the same Aaroh and Avroh making it a feature that cannot independently differentiate between Ragas [4]. • Gamakas are variations in Ragas such as oscillations around a note. For each Raga, only certain types of variations are permitted, making Gamakas a strong correlation characteristic for identification [5]. • A Pakad is a sequence of notes that carries the essence of the Raga [6]. Raga classification is a challenging task since there are many cases where Ragas have the same or similar notes/structure yet produce substantially distinct musical effects owing to their inherent characteristics. Deep learning is a great medium for modeling unstructured spatial and temporal data due to its learning capacities and end-to-end training. Deep learning networks have the ability to learn more intricate and rich features from unstructured data that make it easier to perform the task at hand accurately. A critical challenge with Raga identification is the unavailability of large consistent datasets which are essential for effective training of a deep neural network. We collected a dataset of Ragas and augmented it to create 12,000 raw audio samples and then use it to train a 1D-CNN. We compare the 1D-CNN with an ANN trained on audio features extracted from the raw audio samples. The convolution layers in a CNN have the inherent ability to learn effective feature representation thus enabling better identification of Ragas. The paper is structured as follows. Section 2 discusses previous research work in Raga classification, while Sects. 3 and 4 describe the dataset and methodology, respectively. We describe our results in Sect. 5 and conclude the paper in Sect. 6.
Classification of Hindustani Musical Ragas Using One-Dimensional …
287
2 Related Work In the past two decades, much notable research has been done in Western music for the classification of musical note/key/chord, identification of music genre, and many more [7]. Since Ragas are an integral component of Indian classical music, classification of Ragas will help the development of applications related to Ragas. An early attempt to classify Yaman Kalyan and Bhupali Ragas was done by Pandey et al. [8] using Hidden Markov Models (HMM) and Pakad matching. An ensemble of HMMs was built by Sinith et al. [9] which uses pitch tracking followed by a Fibonacci series-based pitch distribution for inflection detection. In [10], the authors proposed a neural network with features based on the presence of notes, Aaroh and Avroh to identify Ragas. Multiple Ragas have very similar Avroh and Aaroh hence is not an ideal feature for Raga identification. Manjabhat et al. [11] proposed a neural network for classification of 17 Carnatic Ragas from the CompMusic dataset [5]. The ANN was trained using a feature vector consisting of parameters extracted from 12 significant peaks in the probability density function of the pitch profile. Previous experimentation on a subset of the original dataset showed 87 and 92% accuracy for K-Nearest Neighbor (KNN) classifier and Support Vector Machine (SVM) classifier, respectively, trained on handcrafted features, namely onset, zerocrossing rate (ZCR), chromagram, pitch, lower energy, spectral centroid, and spectral roll-off [12]. Another experiment on a larger subset of the dataset showed that the accuracy of detection of multi-class SVM for some Ragas in the dataset can be improved using hand-picked features from nth-order derivatives of the normalized audio [6]. A similar experimentation was conducted by Joshi et al. [13] for two Ragas, namely Yaman and Bhairavi from the Hindustani variant of Indian classical music. They extracted more audio features and trained KNN and SVM classifiers to get a test accuracy of 96% (mean of four tests) and 95%, respectively. They also state that logistic regression performed poorer than KNN and SVM. A recent study [14] on the original dataset using ensemble classifier models, namely ensemble bagged tree and ensemble subspace KNN using fundamental frequency, spectral centroid, spectral kurtosis, and mean of MFCC variants as features showed an accuracy of 96.32% and 95.83%, respectively, indicating that ensemble models are more effective for Raga classification. Kumar et al. [15] proposed an approach using a nonlinear SVM which uses a linear combination of kernels resulting from a pitch class profile to represent the pitch value distribution and a 4-g histogram of notes to represent the occurrence of a brief note sequence. The two kernels effectively captured the correlation of Raga audio data. The model is trained on a custom Raga dataset consisting of 4 Ragas and the CompMusic dataset [5] consisting of 10 Ragas showing a test accuracy of 97.3 and 83.39% for the respective datasets. An improvement is achieved by adding an ngram histogram as a time-domain feature indicating that time-domain information is crucial for the effective classification of Ragas. In [1], Bhat et al. trained an artificial neural network, a convolutional neural Network, a bidirectional- long short-term memory RNN, and a decision tree-based ensemble model with XGBoost for the
288
R. Pawar et al.
classification of 14 largely available Ragas from the CCIA dataset using hand-crafted features similar to [15]. They performed Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (TSNE) on their dataset to show that TSNE performs better than PCA while reducing the dimensionality of the dataset. A higher accuracy was achieved by Bi-LSTM over others. The approach requires features to be extracted which cannot represent characteristics of Ragas to the fullest potential. To learn temporal patterns in Raga music data, Madhusudhan et al. [16] used LSTM-RNN. They train their network on small sequences sampled from the audio obtained by randomizing origin after the separation of vocal and instrument using Mad-TwinNet. Results on the entire audio show an accuracy of 88.1% and 97% for the Comp Music Carnatic dataset and its 10 Raga subsets, respectively. Madhusudhan et al. [17] also trained a five-layer deep CNN. The authors treat the audio as monophonic and perform sub-sequencing similar to [16], before applying a tonic shiftbased data augmentation. The CNN was trained on two datasets of Ragas, one with three Ragas in two tonics and the other with seven Ragas in eight tonics, achieving a test accuracy of 77.1% and 72.8%, respectively. The accuracy improved as the subsequence size was increased, indicating that there is room for improvement by incorporating as much audio data as computationally possible. In [4], pitches extracted from the raw audio are plotted and then used to train a 2D-CNN essentially converting the audio classification problem to an equivalent image classification problem. The pitch plot images contain information in both the time and frequency domain which is ideal but misses on minute low-level features that can contribute towards higher accuracy. Most of the prior methods use pitch-based techniques devised from music signal processing [7] which most of the time miss out on temporal information and are highly error-prone, whilst other methods largely rely on preprocessing the audio and handcrafted features from the audio data, limiting their performance. Furthermore, feature engineering demands a substantial understanding of the subject and the features need to be redesigned when the problem at hand changes. For instance, when more Ragas are added to the dataset or Ragas played on more instruments are added to the dataset. Using larger subsequences of audio has shown improvement in accuracy [16] which indicates that providing more audio data assists in learning better features; hence, we train a 1D-CNN using the entire available audio waveform. The 1D-CNN learns rich features that help in a more accurate classification of Ragas.
3 Dataset The dataset was generated by recording audio during live performance using a single instrument in a quiet environment. Each audio file contains an entire Raga within a duration of 20 s. Figure 1 shows the raw audio waveforms for the 12 Ragas in the dataset. The audio files were down-sampled from 44.1 kHz to a sampling rate of 16 kHz to reduce the data feed to the 1D-CNN. The audio files were also converted
Classification of Hindustani Musical Ragas Using One-Dimensional …
289
Fig. 1 Raw audio waveforms of Ragas in the dataset
from stereo to mono and stored as .wav files without loss in audio quality. Downsampling was done using the librosa [18] python package and conversion of stereo audio to mono was done using the pydub python module. There might have been a distortion in the audio while converting from stereo to mono if the channels happen to not be in sync. The original dataset consists of 3833 audio clips for 12 Ragas each distributed as shown in Table 1. Since ANN and 1D-CNN are supervised learning algorithms, the Ragas were annotated by an expert in Indian classical music. Since deep learning models require a large amount of data to improve generalization, we have created synthetic data from the available dataset. The data augmentation was done by slightly pitch-shifting the samples so that each Raga and instrument has 250 samples, resulting in a total of 12,000 samples from the original dataset. The minor pitch variation correlates with the alterations introduced by the artists during performance. In addition to the latter, Schlüter et al. [19] showed that slight variations in the pitch of audio data help in better training of the model over other augmentation methods such as adding noise, time stretching, and many more. A musical octave in Western music consists of 12 semitones, namely C, C#, D, D#, E, F, F#, G, G#, A, A#, and B where a pound sign denotes a sharp tone. Similarly, Indian classical music is structured into 12 half-tones, of which seven are basic tones, namely Sa, Re Ga, Ma, Pa, Dha, and Ni with four flat tones, Re, Ga, Dha, and Ni, and one sharp tone Ma. An equivalent mapping of notes in Indian classical music to
89
64
70
56
279
Flute
Santoor
Sarod
Sitar
Total
Ahir Bhairav
Raga/ instrument
364
72
56
87
149
Bageshree
240
59
73
56
52
Bhairav
Table 1 Composition of the original dataset
429
73
123
109
124
Bhimplas
358
71
143
63
81
Bihag
313
83
56
57
117
Lalit
294
64
106
56
68
Madhuwanti
341
56
60
110
115
Malkauns
255
77
87
51
40
Miya Ki Todi
273
53
76
64
80
Puriya Kalyan
238
60
58
54
66
Shuddha Sarang
449
73
184
81
111
Yaman
3833
797
1092
852
1092
Total
290 R. Pawar et al.
Classification of Hindustani Musical Ragas Using One-Dimensional …
291
notes in western music along with the notation used by us for each note can be found in Table 2 and the basic details of Ragas in the dataset can be found in Table 3. The Ragas were played on four instruments, namely flute, santoor, sarod, and sitar. The following 12 Ragas were chosen for this research: Ahir Bhairav, Bageshree, Bhairav, Bhimplas, Bihag, Lalit, Madhuwanti, Malkauns, Miya Ki Todi, Puriya Kalyan, Shuddha Sarang, and Yaman. The selection of Ragas was done in such a fashion that they can be played/sung on after the other [20].
4 Methodology 4.1 Artificial Neural Network (ANN) We trained an Artificial Neural Network (ANN) as a baseline for comparison. Training an ANN directly on raw audio is not possible due to computation constraints; hence, we pass audio features [21], namely Zero-Crossing Rate (ZCR), spectral centroid, spectral bandwidth, spectral roll-off, chromagram, and Mel frequency cepstral coefficients (MFCC) to the ANN while training. The rate at which the sign of an audio waveform changes is known as the Zero Crossing Rate (ZCR). The spectral centroid is the frequency at which most of a signal’s energy is focused. Spectral roll-off denotes the frequency below which a specific percentage of the spectrum’s magnitude lies. The chroma vector is a spectral energy representation created by binning short-time DFT coefficients into 12 bins. MFCCs represent the structure of the audio waveform. MFCCs are commonly used handcrafted features in automatic speech recognition, but they can also be useful in music audio [22]. The features were extracted using the librosa python library [18] and saved in a CSV file. We trained a fully connected ANN consisting of three hidden layers and ReLU as an activation function for all layers except the output layer which uses the softmax activation function. The input layer has 36 neurons to accept the 36-element z-score normalized feature vector, while the output layer contains 12 neurons to represent the 12 Ragas in our dataset. The first, second, and third hidden layers have 256, 128, and 64 neutrons, respectively. Dropout layers with a dropout rate of 20% were used to prevent the model from overfitting. We used the Adam optimizer for training along with the sparse categorical cross-entropy. Figure 2b illustrates the network architecture of the ANN model.
C
Sa
B
Equivalent notation in Western music
Notation used Ni by us in Table 3
RE
C#, Db
Re (Komal)
Middle
Sa
Lower
Basic swara Ni (note) Komal = Flat Teevra = Sharp
Octave
Re
D
Re
GA
D#, Eb
Ga (Komal)
Table 2 Mapping of notes in Indian classical music to Western music
Ga
E
Ga
Ma
F
Ma
MA’
F#, Gb
Ma (Teevra)
Pa
G
Pa
DHA
G#, Ab
Dha (Komal)
Dha
A
Dha
NI
A#, Bb
Ni (Komal)
Ni
B
Ni
Sa”
C
Sa
Higher
292 R. Pawar et al.
Classification of Hindustani Musical Ragas Using One-Dimensional …
293
Table 3 Basic details of the Ragas in our dataset Raga name
Aaroh (Ascending order of swara)
Avroh (Descending order of swara)
Komal–Teevra swara (Flat–Sharp tone)
Missing notes in Ragas
Recommended time of performance
Bhairav
Sa re Ga Ma Sa” Ni dha Re Dha komal Pa dha Ni Pa Ma Ga re Sa” Sa
Re Dha vibrating, GaMare swara combination
6 am
Ahir Bhairav Sa re Ga Ma Sa” ni Dha Re Ni komal Pa Dha ni Pa Ma Ga re Sa” Sa
–
6 am to 9 am
Miya Ki Todi Sa regaMa! Pa dha Ni Sa”
Sa” Ni dha Re Ga Dha Pa Ma! gare komal and Sa Teevra Ma
–
9 am to 12 pm
Shuddha Sarang
Ni Sa Re Ma! Pa Ni Sa”
Sa” Ni Dha Pa Ma! Pa Ma! Ma Re Sa Ni Sa
Ga
12 pm
Bhimplas
ni Sa ga Ma Pa ni Sa”
Sa” ni Dha Ga Ni komal Pa Ma ga Re Sa ni Sa Pa ni Sa
Re and Dha in Aaroh
12 pm to 3 pm
Madhuwanti
Ni Sa gaMa! Sa” Ni Dha Pa Ni Sa” Pa Ma!gaMa! ga Re Sa Ni Sa
Ga komal and Ma Teevra
Re and Dha in Aaroh
3 pm to 4 pm
Puriya Kalyan
Nire Ga Ma! Dha Ni re” Sa”
Sa” Ni Dha Pa Ma! Ga re Sa
Teevra Ma and Pa in Aaroh Re komal
6 pm
Yaman
Ni Re Ga Ma! Dha Ni Sa”
Sa” Ni Dha Pa Ma’ Ga Re Sa
Teevra Ma
–
6 pm to 9 pm
Bihag
Sa Ga Ma Pa Sa” Ni Dha Ni Sa” Pa Ma!, Ga Ma Ga, Re Sa
Ma Shuddha and Teevra
Re Dha in Aaroh
9 pm to 12 am
Bageshree
Sa ga Ma Dha ni Sa”
Sa” ni Dha Ga Ni komal Ma Pa Dha Ma ga Re Sa
Re, Pa in Aaroh 12 am
Malkauns
Sa ga Ma dha ni Sa”
Sa” ni dha Ma ga Sa
Re Pa
Lalit
Ni re Ga Ma! dha Ni Sa”
Sa” Ni dha Re Dha Komal Pa Pa re, Ma Ga and both forms re Sa of Ma
Both Ma Shuddha and Teevra
Ga Dha Ni komal
12 am to 3 am 3 am to 6 am
294
R. Pawar et al.
Fig. 2 a Illustration of the 1D-CNN network architecture. b Illustration of the ANN network architecture. c Train and validation accuracy versus epoch plot for 1D-CNN. d Confusion matrix for 1D-CNN
4.2 1D Convolutional Neural Network (1D-CNN) Two-dimensional convolutional neural networks have shown exceptional feature learning capabilities in image classification tasks. A one-dimensional Convolution Neural Network (1D-CNN) is highly effective for the classification of onedimensional time series data [23–25]. We pass raw audio data into a 1D-CNN which is more likely to learn better features than handcrafted features. 1D-CNN has shown great performance in tasks such as automatic speech recognition, electrocardiogram monitoring, and structural damage detection for infrastructure and machine parts such as bearings [26]. The advantage of utilizing an end-to-end 1D-CNN for audio classification is that they can learn directly from raw data, eliminating the need for domain knowledge to perform features engineering manually. A 1D-CNN will learn features that are more significant for the identification of Ragas. We created a TensorFlow [27] audio data pipeline to efficiently pass the raw audio data to the 1D-CNN model for training. For faster convergence, the starting weights of the 1D-CNN are statistically generated via Glorot initialization [28]. The first layer plays an important role when using raw audio as input [29]. The first layer needs to have a receptive field equivalent to the sampling rate since the receptive field determines the region in the input space that the features devise themselves from. We observed that not using dropout layers for regularization and fully connected layers coerces the model
Classification of Hindustani Musical Ragas Using One-Dimensional …
295
to learn, while deeper CNN networks show improvement in accuracy which are in line with observations in [23, 24]. Since neural networks insist on fixed input size we need to feed in a 20 s audio clip for inference. If the audio clip is smaller than 20 s in duration, zeros can be padded. Zero padding is a valid approach since CNN show translational invariance, i.e., they can find features of a particular Raga irrespective of the order in which they appear [17]. If the audio clip is larger than 20 s in duration, then it needs to be broken down into 20 s chunks which can be overlapping. All of the chunks are fed to the 1D-CNN for inference and the final prediction can be made using max voting. The model consists of six 1D convolutional layers each having ReLU as the activation function. The first five convolutional layers are followed by max-pooling with a pool size of 4 and a stride of 4 to aggregate the features of the convolution layer it follows. Instead of the typical flatten layer, we have used global average pooling for aggregating the last convolutional layers output since we do not use fully connected layers to coerce the model to learn. Finally, we have a 12-neuron output layer with a softmax activation function which gives the model’s prediction. To provide a good initial receptive field, we use a kernel size of 160 for the first convolution layer and, to maintain the effective receptive field through the model, we increase the number of filters as we perform more convolutions on the audio data. A moderate kernel size of 9 was used for all convolutional layers except the first since a large kernel size results in less interpretation of the data. The network was trained using the Adam variant of stochastic gradient descent, with sparse categorical cross-entropy as a loss function. Figure 2a illustrates the network architecture of the 1D-CNN model.
5 Experimentation and Results We trained an ANN and a novel 1D-CNN using handcrafted audio features and raw audio waveform, respectively. The dataset was divided into train and test sets with a ratio of 80:20 in a random fashion. We also created a validation set to ensure that the model is not overfitting while training. Accuracy is an appropriate evaluation metric since the dataset is well-balanced, while a single evaluation will not help us judge the model since neural networks are stochastic; hence, we performed five trials with different train and test sets to calculate a mean accuracy of 97.04% and 98.67% for the ANN and 1D-CNN, respectively. Figure 2c, d show the accuracy versus epoch plot while training and the confusion matrix for predictions of the 1D-CNN, respectively. Table 4 shows the class-wise evaluation metrics. The 1D-CNN performed slightly better than the ANN, KNN, SVM, and BiLSTM since it was more effectively able to capture temporal features from raw audio and map them to individual ragas. Furthermore, the 1D-CNN classifier was able to classify multiple ragas with better accuracy in the dataset despite the fact that some of them have very similar musical structure. In line with the results of [16] indicating that larger subsequences result in better accuracy, our 1D-CNN model has performed better since we pass the entire audio clip to the 1D-CNN.
296
R. Pawar et al.
Table 4 Class-wise evaluation metrics for 1D-CNN Raga
Precision
Recall
F1-score
Support
0—Ahir Bhairav
0.95
0.99
0.97
213
1—Bageshree
0.97
1
0.98
221
2—Bhairav
0.99
0.96
0.98
190
3—Bhimplas
0.99
0.98
0.99
195
4—Bihag
0.97
0.98
0.97
195
5—Lalit
0.98
0.96
0.97
216
6—Madhuwanti
0.99
0.95
0.97
174
7—Malkauns
0.99
0.99
0.99
207
8—Miya Ki Todi
0.98
0.99
0.98
209
9—Puriya Kalyan
0.96
0.98
0.97
179
10—Shuddha Sarang
0.98
0.96
0.97
202
11—Yaman
0.97
0.97
0.97
199
0.98
2400
Accuracy Macro-average
0.98
0.98
0.98
2400
Weighted average
0.98
0.98
0.98
2400
6 Conclusion An improvement in accuracy was attained using 1D CNN for the classification of Ragas indicating that it can learn a rich feature representation of the raw audio data. Though 1D-CNN requires more computation than an ANN it is capable of learning intricate features which help in the identification of Ragas making it an ideal classifier for a dataset with a large number of similar Ragas. The classifier will assist music information retrieval (MIR) systems as well as music recommendation systems for Indian classical music. Further research direction for Raga classification would involve using a Bi-LSTM model for the entire raw audio and using ensembles of the classifiers for more accurate classification. A much more interesting research direction would be to build Ragabased generative networks for Indian classical music. Acknowledgements We express our gratitude to Mr. Deepak Desai, a sitarist and music expert, for sharing his knowledge in music and his efforts in annotating the dataset.
Classification of Hindustani Musical Ragas Using One-Dimensional …
297
References 1. Bhat A, Vijaya Krishna A, Acharya S (2020) Analytical comparison of classification models for Raga identification in Carnatic classical instrumental polyphonic audio. SN Comput Sci 1(6):1–9 2. Balkwill L-L, Thompson WF (1999) A cross-cultural investigation of the perception of emotion in music: psychophysical and cultural cues. Music Percept 17(1):43–64 3. Valla JM, Alappatt JA, Mathur A, Singh NC (2017) Music and emotion—a case for north Indian classical music. Front Psychol 8:2115 4. Anand A (2019) Raga identification using convolutional neural network. In: 2019 second international conference on advanced computational and communication paradigms (ICACCP). IEEE, pp 1–6 5. Computational models for the discovery of the world’s music. https://compmusic.upf.edu/dat asets. Last accessed 8 July 2021 6. Bidkar AA, Deshpande RS, Dandawate YH (2018) A novel approach for selection of features for North Indian classical raga recognition of instrumental music. In: 2018 international conference on advances in communication and computing technology (ICACCT). IEEE, pp 499–503 7. Muller M, Ellis DPW, Klapuri A, Richard G (2011) Signal processing for music analysis. IEEE J Sel Top Signal Process 5(6):1088–1110 8. Pandey G, Mishra C, Ipe P (2003) TANSEN: a system for automatic Raga identification. In: IICAI, pp 1350–1363 9. Sinith MS, Tripathi S, Murthy KVV (2020) Raga recognition using fibonacci series based pitch distribution in Indian Classical Music. Appl Acoust 167:107381 10. Shetty S, Achary KK (2009) Raga mining of Indian music by extracting arohana-avarohana pattern. Int J Recent Trends Eng 1(1):362 11. Samsekai Manjabhat S, Koolagudi SG, Rao KS, Ramteke PB (2017) Raga and tonic identification in Carnatic music. J New Music Res 46(3):229–245 12. Kumari P, Dandawate Y, Bidkar A (2018) Raga analysis and classification of instrumental music. In: International conference on advances in communication and computing technology (ICACCT) 13. Joshi D, Pareek J, Ambatkar P (2021) Indian classical Raga identification using machine learning 14. Bidkar AA, Deshpande RS, Dandawate YH (2021) A North Indian Raga recognition using ensemble classifier. Int J Electr Eng Technol (IJEET) 12(6):251–258 15. Kumar V, Pandya H, Jawahar CV (2014) Identifying Ragas in Indian music. In: 2014 22nd international conference on pattern recognition. IEEE, pp 767–772 16. Madhusudhan ST, Chowdhary G (2019) Deepsrgm-sequence classification and ranking in Indian classical music with deep learning. In: 20th international society for music information retrieval conference, ISMIR 2019. International Society for Music Information Retrieval, pp 533–540 17. Madhusdhan ST, Chowdhary G (2018) Tonic independent Raag classification in Indian classical music 18. McFee B, Raffel C, Liang D, Ellis DPW, McVicar M, Battenberg E, Nieto O (2015) librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, vol 8, pp 18–25 19. Schlüter J, Grill T (2015) Exploring data augmentation for improved singing voice detection with neural networks. In: ISMIR, pp 121–126 20. Rajopddhye V (2002) Sangeet Shastra. Gandharv Mahavidyalaya Publication 21. Giannakopoulos T, Pikrakis A (2014) Introduction to audio analysis: a MATLAB® approach. Academic Press 22. Logan B (2000) Mel frequency cepstral coefficients for music modeling. In: ISMIR, vol 270, pp 1–11
298
R. Pawar et al.
23. Dai W, Dai C, Qu S, Li J, Das S (2017) Very deep convolutional neural networks for raw waveforms. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 421–425 24. Wang Z, Yan W, Oates T (2017) Time series classification from scratch with deep neural networks: a strong baseline. In: 2017 international joint conference on neural networks (IJCNN). IEEE, pp 1578–1585 25. Tang W, Long G, Liu L, Zhou T, Jiang J, Blumenstein M (2020) Rethinking 1D-CNN for time series classification: a stronger baseline. arXiv preprint arXiv:2002.10061 26. Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman DJ (2021) 1D convolutional neural networks and applications: a survey. Mech Syst Signal Process 151:107398 27. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M et al (2016) Tensorflow: a system for large-scale machine learning. In: 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pp 265–283 28. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, pp 249–256 29. Ravanelli M, Bengio Y (2018) Speaker recognition from raw waveform with SincNet. In: 2018 IEEE spoken language technology workshop (SLT). IEEE, pp 1021–1028 30. Koduri GK, Gulati S, Rao P (2011) A survey of raaga recognition techniques and improvements to the state-of-the-art. Sound Music Comput 38:39–41
W-Tree: A Concept Correlation Tree for Data Analysis and Annotations Prakash Hegade, Kishor Rao, Utkarsh Koppikar, Maltesh Kulkarni, and Jinesh Nagda
Abstract As human beings develop and study new topics to understand and conceptualize their surroundings, the need for records and documentation arises. The internet in recent times has proven to be one of the significant contributors to this philosophy. With abundant data presented to the user online, it naturally prompts the user with W-questions. The what, when, who, which, where, and other W-questions stand as an inspiration to build a W-Tree. W-Tree is one such record book that stores such a large amount of data and provides the additional functionality of linking related topics and studying the relationship between various topics. By providing related topics to a specific concept, W-Tree’s aim is for the users to learn and understand the topic in connection and correlation with other concepts. Knowing all the topics related to the one in focus and its origin helps the user understand the specifics related to that and the domain in general. W-Tree also provides the user with the feature of annotations for each article in a three-tuple format. The paper presents the analysis of W-tree over five prominent domains analyzed on the Wikipedia data. Keywords Annotations · Data correlation · W-Tree · Wikipedia
1 Introduction From a social standpoint, hierarchies govern the world [1]. Class structure and stratification have been at the center of evolution, leading to the organization and formation of categories. The concept of narrowing the scope with each subsequent step makes space for everyone, and everything is of exceptional importance in comprehending any data [2]. Biological taxonomy, postal address system, and computer networks all function in hierarchies. Technology and data have grown in tandem, and this drastic change in availability only generates more data due to flexibility in conducting extensive research reinforced by technological resources. With 2.5 quintillion bytes of data generated P. Hegade (B) · K. Rao · U. Koppikar · M. Kulkarni · J. Nagda KLE Technological University, Hubballi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_24
299
300
P. Hegade et al.
every day [3], the need to coin newly developed phenomenons, and in turn, categorize them becomes crucial. To provide context to this hierarchy, annotating this data will help deliver the content efficiently with the help of the said hierarchy and other data representational choices. The nature of annotations and the best fit to compress data without losing the essence of the context becomes essential, along with mapping the overall hierarchy built over time with each concept stitched to this tapestry of data [4]. Considering the prospective multi-relational nature of the data [5] and boiling it down to dyads or triads in the form of a tree is the most challenging area in this model. Certain limitations are drawn, and rules are enforced to keep the output comprehensible and straightforward. This multifaceted data brings graph theory [6] and sociomatrices to ease the transition from textual data to effective representation. Knowledge systems have limited applications beyond being used as a reference to build ontologies [7, 8]. While helpful in the mentioned regard, W-Tree is a comprehensive representation of otherwise wordy content, making it accessible to everyone. Its generalized approach combined with annotations makes up for the urgency with which the data is consumed today. A leaf-to-root data-consumption path is provided for the user’s convenience. The Wikipedia database is the perfect match for the model we intend to achieve [9]. The five principal root articles of Wikipedia, namely, Philosophy, Mathematics, Science, Logic, and Biography, are used as points of origin to demonstrate the W-Tree. This paper is further divided into the following sections. Section 2 presents the literature survey. Section 3 presents our W-Tree model design and deliberations, Sect. 4 presents the results and discussion, and Sect. 5 presents the conclusion.
2 Literature Survey The design and representation of data instigated relationships and unified views [10]. Data, when perceived in association with another, add meaning and depth, also adding a correlation. Data has been mined to infer hidden knowledge and relations [11]. The mining techniques have been extended to several domains [12]. Mining methods, applications, and tools have been explored and researched [13]. Along with the mining techniques that have provided channels to explore the data using design techniques and methodologies, even the data has grown on and off-web exponentially [14]. New data content and improved web service interfaces for the contemporary web have been discussed [15]. The growth has been witnessed in all the data, tools, techniques, designs, and co-domains directly or indirectly related to data. The web evolution with linked data and progress into a global data space has provided a pathway for global information space [16]. Decentralized vision for linked data has been discussed [17]. Systems have been designed to help novice users to work with linked data [18]. Semi-structured and structured data on the web have been analyzed [19]. Extraction methods of these data have been discussed [20]. Semistructured data models and implementation issues have been worked on [21]. The
W-Tree: A Concept Correlation Tree for Data Analysis and Annotations
301
research in data and information in quality has seen a significant stride over time [22]. Data visualization has paved a pathway to new data insights and demonstrations [23]. Data analysis and interpretation methods have evolved; artificial intelligence hypes and failures have been reflected [24]. Strategic interpretations of existing data have been discoursed [25]. Massive data sets and frontiers of computational feasibility have been discussed for decades [26]. Data still houses several challenges to be solved even after the advent of big data and machine learning [27]. Many data repositories now host data for knowledge, analysis, presentations, awareness, and experimental purposes. One such is Wikipedia and has been our area of interest. With growing data needs and perspectives, Wikipedia serves as a good source of information repository. Efforts have been made to make social sciences more scientific [28]. The structured data on Wikipedia has been analyzed [29]. Wikipedia has been explored via many dimensions and domains; it has been measured [30]. Models have been designed to link with Wikipedia [31]. The risks have been compiled [32] and meanings have been mined [33]. The articles present in the system have been evaluated with models for quality assessment [34, 35]. Semantic Wikipedia has also been a viewpoint of discussion [36]. The system has further been explored for implementing numerous applications and models. The pages have been cautiously deliberated to develop a natural question answering system [37]. Strategic analyses for Industry 4.0 have been made using the data [38]. Wikipedia has also been explored for semantic relations. An approach for measuring semantic similarity between Wikipedia concepts has been discussed [39]. WordNet and Wikipedia have been combined to calculate semantic similarity [40]. Even with all the efforts, the web, data, tools, and Wikipedia delivers an opportunity to find evocative relationships across the data and build a hierarchy of the related data concepts and structures. In this paper, we use the Wikipedia databases to generate the W-Tree. The “W” here stands for the why, who, what, which, where and other w-questions, attempting to realize the data in a systematic graph and tree presentations. While the W-tree could also mean Wiki-Tree, it also presents a modest model of annotations to perceive data systematically for a machine-readable semantic web.
3 W-Tree Model The thirst for deriving insights from Wikipedia, now called a corpus for knowledge extraction, has led to properties within it like semantic relatedness, structural disambiguation of the database, to name a few. The same reason motivates us to solve this problem and create something that helps other users use Wikipedia data efficiently. This section presents our model with design and deliberations.
302
P. Hegade et al.
3.1 Design Principles The design principles for W-Tree are as follows: • To represent the data as a concept tree for operational efficiency. • To establish relationships between the data concepts. • To build annotations for the analyzed data.
3.2 Data Set The aim was to have a data set sufficient to provide the relationship related to the project’s needs. Five prominent domains were selected to establish the concept relation: Philosophy, Mathematics, Science, Logic, and, Biography. Most articles on Wikipedia has these as their root article. The project’s initial phase aims to build a principled theoretical foundation and derive principles from the analysis. We hence considered Wikipedia to be an appropriate dataset for the purpose. The functions and methods applied are generic and do not constrain themselves to the specifics of the data under consideration. They thus can be used upon any data and are not limited to only Wikipedia, opening many avenues for the idea to grow and adapt, eventually becoming a resourceful and helpful utility. Around 50–100 further links are crawled from the seed URL for the model analysis from the five considered domains.
3.3 Model Design A query on the dataset results in a concept tree and an annotation. The overall system model can be seen in Fig. 1. The articles on Wikipedia are usually longer, and most users prefer a summary or cliff notes. A module for annotation is designed to bring an extensive amount of information to a few lines while maintaining the strength of the data. The major challenge in annotating articles on Wikipedia is the versatility of the articles on the site. The annotations give a brief and concise summary of the article to comprehend the user’s benefits. The topics are of a wide range, and a generic method of annotating an article might work perfectly for one topic, while it would look improper for others. Considering all this, a model has been proposed, as shown in Fig. 2a. Annotation comprises three parts: introduction of the topic, similar articles, and category of the article. This is derived from the content of the given article and data present in its reference articles. The subsections of the article also contribute to the annotation. The Wikipedia data from one of its special pages called “What Links Here” is the source of incoming links to the article. These, along with the textual summary from info-boxes, help build the annotations. Lists of links that are present on the page are also maintained. The dictionary that contains the list of links becomes the input
W-Tree: A Concept Correlation Tree for Data Analysis and Annotations
303
Fig. 1 W-Tree system model showing the flow of data
Fig. 2 Flow diagrams a shows the graph, and tree generation module b shows the annotation module
for the model that generates graphs and trees based on the relationship between the articles. The sub-module provides a relationship between two articles and indicates a sense of the strength of the mentioned relationship. The associations can be used to draw a path between any related root articles. The flow diagram for the tree and graph generator module can be seen in Fig. 2b.
304
P. Hegade et al.
3.4 Algorithms The data gathered is checked for similarity, and the strength of the relationship is measured. A graph of relatedness is presented and using an adjacency matrix, a spanning tree is generated. This section presents the algorithms for graph and tree representations. Algorithm Match(node1, node2) //Input: 2 topics //Output: Boolean //Description: Algorithm to study the strength of relationship of two topics link1 node1.links() link2 node2.links() matches link1 U link2 if matches.sizethreshhold then return true else return false
Algorithm Graph(source) Node Wikipedia.page(source) while node ≠ NULL do node1,node2 node.Adjacent() node1 node1.links() node2 node2.links() if Match(node1, node2) = true then node1.link = node2 return node
Algorithm Graph(source) Node Wikipedia.page(source) tree tree while node ≠ NULL do links[max] node links [max-1] node.links() while i ≤ node.links() do if match(node, node.links[i]) = true then tree.link = node.links[i] return tree
W-Tree: A Concept Correlation Tree for Data Analysis and Annotations
305
4 Results and Discussion The designed system traverses from the root article given as an input to its links and produces a reached list. This list of articles is scraped for their data using the customized functions to retrieve the inbound links and textual summary in the next phase. The links help build an adjacency matrix that helps us in drawing a graph. The graph visualizes the data, taking the articles as nodes and edges representing a relationship between them. The adjacency matrix is subjected to a minimum spanning tree algorithm to generate a matrix that is free of cycles and all the nodes considered. This matrix is drawn to generate the tree structure of the dataset. To each of the articles/nodes present in the graph, a three-line annotation is generated focusing on the nearest domain to which the article belongs, a one-line summary/definition of the article, and the list of strongly related articles. A detailed description of each of the modules is presented in the following sub-sections. The code was implemented using Python and Python libraries.
4.1 Scraping the Articles Using a Python library called Wikipedia, the traversal is initiated with the selected root article. In the case of disambiguation in the name, for example, Science doesn’t have an article in the exact same name; in such cases, we go with the nearest article, in this case, Philosophy of Science. Around 50–100 articles from the root article are extracted, selecting the first link in the current article and traversing to the next. The following is an example of the list that was obtained for the Philosophy domain: [‘Greek’, ‘existence’, ‘reason’, ‘knowledge’, ‘values’, ‘mind’, ‘language’, ‘Pythagoras’, ‘Philosophical methods’, ‘questioning’, ‘critical discussion’, ‘rational argument’, ‘philosopher’, ‘Ancient Greek’, ‘Aristotle’, ‘natural philosophy’, ‘astronomy’, ‘medicine’, ‘physics’, ‘Newton’, ‘Mathematical Principles of Natural Philosophy’, ‘psychology’, ‘sociology’, ‘linguistics’, ‘economics’, ‘metaphysics’, ‘existence’, ‘reality’, ‘epistemology’, ‘knowledge’, ‘belief’, ‘ethics’, ‘moral value’, ‘logic’, ‘rules of inference’, ‘conclusions’, ‘true’, ‘premises’, ‘philosophy of science’, ‘political philosophy’, ‘philosophy of language’, ‘philosophy of mind’, ‘Academic Skeptic’, ‘Cicero’, ‘chemistry’, ‘cosmology’, ‘social sciences’, ‘value theory’, ‘mathematics’].
306
P. Hegade et al.
Fig. 3 Article names (left) from the Philosophy domain and their connectivity graph (right) tree generation
4.2 Building a Data Dictionary The discovered articles in the previous module are parsed further using the BeautifulSoup API. The special page of each article called “What Links Here” is scraped with the name-space component as the article. The number of links is limited to 50 for ease of processing, and each link can have some sub-links along with it. The article name and the links directing it to the specified article are stored in a Python dictionary, the key being the article’s name and value being a list of links.
4.3 Graph Generation A graph is generated by taking each of the 20 articles as nodes and creating an edge between them if there are 5 or more links common between 2 articles. For each of the five considered domains, a 20 × 20 matrix was generated, the value set to 1 if an edge existed otherwise zero. The matrix notation was used as it is efficient to handle co-relation queries related to business use cases. A graph is plotted using the matrix, as seen in Fig. 3. In the graph, the number of links common between any two articles act as weights (not shown in the figure) and are used for the next module. A Minimum Spanning Tree algorithm is applied on the adjacency matrix, and another matrix is generated, including all the nodes and removing the cycles. A tree representation is shown in Fig. 4. The difference between graph and tree is that it is reduced to a dyadic or triadic relationship from a multi-relational model by removing all the cycles.
4.4 Aggregating Five Domains The discovered articles in the previous module were parsed further using the BeautifulSoup API. Another attempt was made to check the relationship between all the
W-Tree: A Concept Correlation Tree for Data Analysis and Annotations
307
Fig. 4 Tree representation of the connectivity graph in Fig. 3
Fig. 5 Graph (left) and tree (right) representation of articles in all the five domains
100 articles extracted to see the relationship across the 5 clusters chosen. The graph and tree connection is shown below in Fig. 5. As evident in the figure, most concepts in all five domains are closely related. Some concepts stand as isolated nodes in the graph and the tree connecting only to the root node.
4.5 Annotations Every article discovered initially is represented using a 3-tuple annotation. The first tuple relates to the topic. The second one presents the sub-section of the article classification, and the sub-section closest to the article is added. The third tuple presents similar topics to the current topic. This may include the user tags of similar articles to further read about to comprehensively understand all the correlated topics.
308
P. Hegade et al.
Fig. 6 Annotation module results taking Albert Einstein as the article in focus
By providing these annotations, the user has an answer to the following questions: “What the topic is?”, “Where does it come from?”, and “What are the topics related to it?” An example can be seen in Fig. 6.
4.6 Overall Analysis The tree representation and annotations together speak volumes about an article under consideration. Dividing the data into clusters, which is also a method of Wikipedia, helped understand the relationship between the clusters and the ones that act as stepping stones to transition from one cluster to another. These clusters are shown in Fig. 7. The overlap between these distinctions, more fundamentally for some cases, can be noted from the figure. For example, the relationship between Philosophy and Science can be seen in the article Ambiguity. Again, taking a parallel between Philosophy and Science cluster, a strong relationship is established between Benjamin Franklin and Animism, a topic that can only be understood when proper research on the belief and Franklin’s life story is done. While the graph goes for the most mysterious and concealed answers approach, annotations are straightforward. These give us three tags: the cluster they belong to, three closely related articles (for further exploration), and the article’s first line to provide a general idea of what the article is about.
4.7 Time and Space Complexity Analysis See Table 1.
W-Tree: A Concept Correlation Tree for Data Analysis and Annotations
309
Fig. 7 Inter-relation between the five domains chosen and the articles common between them
Table 1 Algorithms overview Algorithm
Description
Graph building (depth-first search)
O(n + m)
O(n2 )
Minimum spanning building tree (Prim’s algorithm)
O(n2 )
O(n2 )
Our algorithm
O(n2 )
O(n2 )
n = number of nodes (articles) in the graph m = number of edges (relations) in the graph
5 Conclusion The research presents the plethora of anchor links present in each article of a Wikipedia page and takes directions totally unheard of. A particular strength metric of a relationship can be modeled between two articles, and a path can be drawn out that reveals more on the specific topic. The main objective of this project was to organize Wikipedia data in a structured format. Even though we have taken significant steps to group similar topics and proper categorization, the chosen categories are numbered. Efforts can be made to run the modules on much larger samples to have a knowledge system that is much more robust than the current model. There will be a need for a larger group of categories to divide the articles into with a larger sample size. These changes would make the system operable and functional.
310
P. Hegade et al.
References 1. Cooley A (2011) Logics of hierarchy. Cornell University Press 2. Siponen M, Klaavuniemi T (2019) Narrowing the theory’s or study’s scope may increase practical relevance. In: Proceedings of the annual Hawaii international conference on system sciences. University of Hawai’i at Manoa 3. Data Never Sleeps 5.0. https://www.socialmediatoday.com/news/how-much-data-is-genera ted-every-minute-infographic-1/525692/ 4. Goldberg D et al (1992) Using collaborative filtering to weave an information tapestry. Commun ACM 35(12):61–70 5. Spyropoulou E, De Bie T (2011) Interesting multi-relational patterns. In: 2011 IEEE 11th international conference on data mining. IEEE. Chen PP (1976) The entity-relationship model—toward a unified view of data. ACM Trans Database Syst (TODS) 1(1):9–36 6. Corbett D (2004) Interoperability of ontologies using conceptual graph theory. In: International conference on conceptual structures. Springer, Berlin, Heidelberg 7. Li F, Liao L, Zhang L, Zhu X, Zhang B, Wang Z (2020) An efficient approach for measuring semantic similarity combining WordNet and Wikipedia. IEEE Access 8:184318–184338 8. Liu D, Gong Y, Fu J, Yan Y, Chen J, Jiang D, Lv J, Duan N (2020) Rikinet: reading Wikipedia pages for natural question answering. arXiv preprint arXiv:2004.14560 9. Yao J, Zerida N (2007) Rare patterns to improve path-based clustering of Wikipedia articles. In: Pre-proceedings of the initiative for the evaluation of XML retrieval, pp 224–231 10. Tan PN, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India 11. Padhy N, Mishra D, Panigrahi R (2012) The survey of data mining applications and feature scope. arXiv preprint arXiv:1211.5723 12. Lei-da Chen TS, Frolick MN (2000) Data mining methods, applications, and tools. Inf Syst Manag 17(1):67–68 13. Abiteboul S, Manolescu I, Rigaux P, Rousset MC, Senellart P (2011) Web data management. Cambridge University Press 14. Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L (2021) PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res 49(D1):D1388–D1395 15. Heath T, Bizer C (2011) Linked data: evolving the web into a global data space. Synth Lect Semant Web: Theory Technol 1(1):1–36 16. Polleres A, Kamdar MR, Fernández JD, Tudorache T, Musen MA (2020) A more decentralized vision for linked data. Semant Web 17. Oh S, Yoo S, Kim Y, Song J, Park S (2021) Implementation of a system that helps novice users work with linked data. Electronics 18. Atzeni P, Mecca G, Merialdo P (1997) Semistructured and structured data in the web: going back and forth. ACM SIGMOD Rec 26(4):16–23 19. Arasu A, Garcia-Molina H (2003) Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, 9 June 2003, pp 337–348 20. Dickson MS, Asagba PO (2020) The semi-structured data model and implementation issues for semi-structured data. Int J Innov Sustain 3:47–51 21. Shankaranarayanan G, Blake R (2017) From content to context: the evolution and growth of data quality research. J Data Inf Qual (JDIQ) 8(2):1–28 22. Aparicio M, Costa CJ (2015) Data visualization. Commun Des Q Rev 3(1):7–11 23. Slota SC, Fleischmann KR, Greenberg S, Verma N, Cummings B, Li L, Shenefiel C (2020) Good systems, bad data?: interpretations of AI hype and failures. Proc Assoc Inf Sci Technol 57(1):e275 24. Eliaz K, Spiegler R, Thysen HC (2021) Strategic interpretations. J Econ Theory 192:105192 25. Wegman EJ (1995) Huge data sets and the frontiers of computational feasibility. J Comput Graph Stat 4(4):281–295
W-Tree: A Concept Correlation Tree for Data Analysis and Annotations
311
26. Bhadani AK, Jothimani D (2016) Big data: challenges, opportunities, and realities. In: Effective Big Data management and opportunities for implementation. IGI Global, pp 1–24 27. Quan-Hoang V, Anh-Vinh L, Viet-Phuong L, Phuong-Hanh H, Manh-Toan H (2020) Making social sciences more scientific: literature review by structured data. MethodsX 7:100818 28. Moreira J, Neto EC, Barbosa L (2021) Analysis of structured data on Wikipedia. Int J Metadata Semant Ontol 15(1):71–86 29. Voss J (2005) Measuring Wikipedia 30. Milne D, Witten IH (2008) Learning to link with Wikipedia. In: Proceedings of the 17th ACM conference on information and knowledge management, 26 Oct 2008, pp 509–518 31. Denning P, Horning J, Parnas D, Weinstein L (2005) Wikipedia risks. Commun ACM 48(12):152 32. Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from Wikipedia. Int J Hum Comput Stud 67(9):716–754 33. Hu M, Lim EP, Sun A, Lauw HW, Vuong BQ (2007) Measuring article quality in Wikipedia: models and evaluation. In: Proceedings of the sixteenth ACM conference on information and knowledge management, 6 Nov 2007, pp 243–252 34. Wilkinson DM, Huberman BA (2007) Cooperation and quality in Wikipedia. In: Proceedings of the 2007 international symposium on Wikis, 21 Oct 2007, pp 157–164 35. Völkel M, Krötzsch M, Vrandecic D, Haller H, Studer R (2006) Semantic Wikipedia. In: Proceedings of the 15th international conference on World Wide Web, 23 May 2006, pp 585– 594 36. Bonaccorsi A, Chiarello F, Fantoni G, Kammering H (2020) Emerging technologies and industrial leadership. A Wikipedia-based strategic analysis of Industry 4.0. Expert Syst Appl 160:113645 37. Hussain MJ, Wasti SH, Huang G, Wei L, Jiang Y, Tang Y (2020) An approach for measuring semantic similarity between Wikipedia concepts using multiple inheritances. Inf Process Manage 57(3):102188 38. Shibaki Y, Nagata M, Yamamoto K (2010) Constructing large-scale person ontology from Wikipedia. In: Proceedings of the 2nd workshop on The People’s Web meets NLP: collaboratively constructed semantic resources 39. Krötzsch M, Thost V (2016) Ontologies for knowledge graphs: breaking the rules. In: International semantic web conference. Springer, Cham 40. Jiang Y et al (2017) Wikipedia-based information content and semantic similarity computation. Inf Process Manage 53(1):248–265
Crawl Smart: A Domain-Specific Crawler Prakash Hegade, Ruturaj Chitragar, Raghavendra Kulkarni, Praveen Naik, and A. S. Sanath
Abstract With billions of people using the internet, which consists of an estimate of one billion websites, is explored by an individual with a diverse need and intent. The search engines that present results to internet user queries evaluate the websites on numerous parameters to sort the links from most to least relevant. It has become a pick-and-shovel task to extract the most relevant information for a given concept or user query. Classical crawlers that use traditional crawling techniques pull irrelevant data with the relevant ones, resulting in ineffective CPU time usage, memory, and resources. This paper proposes a knowledge-aware crawling system, Crawl Smart, which learns from its own crawling experiences and improves the crawling process in future crawls. The project’s key focus is a methodology that deploys a unique data structure to overcome the challenges of maintaining visited pages and finding a relation between the crawled pages after having them in the knowledge base, which helps the crawler preserve focus. The data structure design, annotations, similarity measures, and knowledge base supporting the Smart Crawl are detailed in the paper. The paper presents the results that show the comparison between the knowledgeaware crawler and the traditional crawler, assuring better results when used on largescale data. Keywords Annotations · Crawler · Domain-specific · Knowledge-aware · Similarity
1 Introduction Internet has grown to an extent where any information the user is looking for, is readily available and the challenge is to extract only the most relevant information about a given concept. Be it a search engine or a user, the relevant data needs to be parsed and extracted from the humongous web. Web crawling, used mostly by search engines, are softwares that are used to discover and index websites [1]. Along with P. Hegade (B) · R. Chitragar · R. Kulkarni · P. Naik · A. S. Sanath KLE Technological University, Hubballi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_25
313
314
P. Hegade et al.
establishing relationships between the web pages, crawlers are used to collect and process information that can be used to classify web documents and provide insights into the collected data [2]. The process of crawling and parsing also leads to the extraction of irrelevant information from the web. For example, classical crawlers adopt breadth-first mechanism [3], i.e., searching all the links of a parent link, extract both relevant and irrelevant data from the web, and hence result in wastage of CPU time, memory, and resources. To resolve these challenges topic specific crawlers, also known as focused crawlers [4] are introduced. These are better than classical crawlers in producing accurate data for the given concept. If a user wants to crawl a website, he first needs an entry point. Back in the early days, users had to submit the website to search engines to tell them it was online. But now users can quickly build links to their website and make it visible to the search engines on the web. A crawler works with a list of initialized links called Seed URLs. These are passed on to a fetcher that extracts the content of the web page of the URLs, which is further moved on to a link extractor that parses the HTML and extracts all the links present in it. These links are passed to a store processor, which stores them, and a page filter that sends all the interesting links to a URL-seen module. The URLseen module checks whether the URL has already been seen before in this crawling process, if not, it is sent back to the fetcher. This iterative process continues until the fetcher gets no links to fetch the content [5]. The retrieval of information is design-dependent and does not always result in styles that adapt the workflow. On the other hand, smart crawler or knowledge-aware crawler, is context-dependent. Majority of re-crawling techniques assume the availability of unlimited resources and zero operating cost. But in reality, the resources and budget are limited and it is impossible to crawl every source at every point of time. This brings up the idea of providing a mechanism to the crawler with which it can store its experience/knowledge about web pages and use it, when needed. We aim to develop and deploy the knowledge-aware crawler focusing on the context needed by the user. This is done by the integration of various modules from maintaining crawled data in an appropriate structure to finding similarities and building a knowledge base. Then, we contrast it with the traditional crawlers to know the gains in the future crawls. This paper is further divided into the following sections. Section 2 presents the literature survey. Section 3 presents our Smart Crawl model design and deliberations; Sect. 4 presents the results and discussion; and Sect. 5 presents the conclusion.
2 Literature Survey Internet documents contain the latest and relevant contents, which is essential to construct an encyclopedia, i.e., textual data and multimedia data such as images, videos, audio, etc. With the help of the dynamic encyclopedia, we can easily crawl through the web pages from the search engine results and find what is needed without
Crawl Smart: A Domain-Specific Crawler
315
the user interaction in filtering and combining data from individual search results. In data science, mining [6] can be the key to finding relevant keywords from millions of web pages without having to read everything. Irrespective of the industry, annotations tools are the key to automatically index data, synthesize text, or create a tag cloud using the most representative keywords [7]. With automatic annotation extraction, we can analyze as much data as we need. We can manually read the whole text and define key terms, but this takes a long time [8]. Automating this task allows us to focus on other parts of the project. Extraction of annotations can automate workflows, such as adding tags to web content, saving us a lot of time. It also provides actionable, data-driven insights to help make more informed decisions while crawling [9]. Manual procedures to extract statistics from textual information may be challenging for giant duties in which assets are limited, as they usually are. Computerassisted techniques appear to be a promising alternative: Hence, researchers can complete specific tasks with a great speed. Every manual, automatic, or semiautomatic technique for analyzing textual information has its set of blessings and expenses that fluctuate depending on the venture at hand [10]. Over the period of time, crawlers on numerous criteria like, Parallel and Topical web crawlers have been designed [11–13]. General evaluation framework for topical crawlers has been discussed [14]. Measures to improve the performance of focused web crawlers have been deliberated [15]. Crawlers used in developing search engines have been surveyed [16]. Studies have been made on different types of web crawlers [17]. Behaviors of the web crawlers have been modeled [18]. Advanced web crawlers have been discussed [19]. Crawlers have been designed based on inferences and as well by contextual inferences [20, 21]. Finding helpful information on an extensively distributed internet network requires an effective search strategy. Web resources’ spread and dynamic nature pose a significant challenge for search engines to keep your web content index up to date because they must search the internet periodically [22]. There are many technical challenges faced in designing a crawler [23]. Some of them are avoiding visits to the same link frequently, avoiding redundant downloading of web pages and thus utilizing network bandwidth efficiently, avoiding bot traps, maintaining the freshness of the database, bypass login pages and captchas, crawling non-textual content of HTML pages, etc. Following a broad literature survey, we would be addressing the following challenges through this project work, avoiding multiple visits to the same link, avoiding redundant downloading of web pages, and knowing whether to crawl a page or not, before opening it.
316
P. Hegade et al.
3 Crawl Smart Model This section presents the design and deliberations of the smart crawl. Our objective is to design and develop a knowledge-aware crawler that learns the web with time, by storing the context of the web pages it visits in the form of knowledge and then using it to improve crawling in future.
3.1 Design Principles The design principles of the crawler are: • • • •
To design a data structure for contemporary crawler challenges, To annotate crawled data, To establish relationships among the crawled data, and To build a knowledge system with parsed data.
3.2 Model Design The system model is presented in Fig. 1. The crawler first looks for the query tag given as input, in the knowledge base. If the tag is mapped to a link, that corresponding link is used to start crawling. Else,
Fig. 1 Crawl smart system model
Crawl Smart: A Domain-Specific Crawler
317
Fig. 2 Data structures to store crawled links: Trie and Hash Table
a random link belonging to the domain of the input query tag is chosen from the pool of seed links, and crawling is initialized. For every crawled link, the crawler first checks if there are any tags associated with the link in the knowledge base. This step avoids annotating links that were already visited in the previous crawling. If tags are available, it extracts the tags and uses them for the similarity metric. Else, it extracts text from the link and annotates it. Here, a copy of the tags is inserted into the knowledge base for future reference. Upon evaluating the similarity metric, if it is above the threshold, then the link is considered for the next iteration; otherwise, the link is discarded.
3.3 Data Design The data structure designed for this work has two major components and is represented in Fig. 2. Trie. It is a rooted tree, where a Parent Node is linked to one or more nodes, called the Child Nodes. The root node has no parent node, and the Leaf nodes have no child nodes. Apart from these properties of a tree, every child node of a Trie has a unique integer ID concerning its parent. Hash Table. The Hash Table is a map, mapping the node value entering the Trie to a unique key generated by the hash function. The size of the Hash Table is kept dynamic.
3.4 Abstract Data Type Representation The Abstract Data Type (ADT) Representation of the data structure can be described as shown in Fig. 3.
318
P. Hegade et al.
Fig. 3 Abstract data type representation
The objective of this work is to design a data structure that can efficiently store the relationship between the crawled data. Now, the web can be seen as a hierarchy of tens of millions of web pages, resembling the Trie data structure, making it suitable to store crawled data along with their relationships. Also by the concept of hashing, we can make sure that no link is inserted more than once in the trie. The two components of the data structure interact with each other to uniquely identify a particular value placed in the Trie using a Hash key. The hash function which generates a unique hash key for every input is recursively defined as Algorithm Hash (node) //Input: node //Output: hash value of the node //Description: computes a hash key for every input ‘node’ if node = root then return string(0) X Number of children of parent Node return Hash(parent) + delimiter + string(X+1)
Crawl Smart: A Domain-Specific Crawler
319
Algorithm getNode(key,delimiter) //Input: key, delimiter //Output: node //Description: searches for a node with given ‘key’ Initialize node to root IDs split the key with respect to the delimiter into a list IDs drop the first element of the IDs list for id in key: node idth child of node if node = null: exit for loop endIf endFor return node
Theorem 1 Hash(node) hash function generates unique hash value for every key. Proof Every node in the Trie must be either a root node or a child node, and not both at a time. Now consider two nodes x and y from a Trie with at least two nodes. Case 1 x and y are directly connected If x is a parent of y, then hash(y) = hash(x) + delimiter + someID ⇒ hash(y) = hash(x) If y is a parent of x, then hash(x) = hash(y) + delimiter + someID ⇒ hash(y) = hash(x) Thus, hash(x) = hash(y) Case 2 x and y are siblings, i.e., they have a common parent Let z be the parent of x and y. Now, hash(x) = hash(z) + delimiter + ID1 hash(y) = hash(z) + delimiter + ID2 Since ID1 = ID2, hash(x) = hash(y) Case 3 x and y are not related to each other Let x1 be the parent of x and y1 be the parent of y. – If x1 and y1 are directly connected, then hash(x1) = hash(y1), by case 1. hash(x1) + delimiter + ID1 = hash(y1) + delimiter + ID2 hash(x) = hash(y) – If x1 and y1 are siblings, i.e., if they have a common parent hash(x1) = hash(y1), by case 2. hash(x1) + delimiter + ID1 = hash(y1) + delimiter + ID2 hash(x) = hash(y) – If x1 and y1 are not related to each other hash(x1) = hash(y1) {termination condition falling under case (01) or (02)}
320
P. Hegade et al.
hash(x1) + delimiter + ID1 = hash(y1) + delimiter + ID2 Thus, hash(x) = hash(y) for all x and y. Thus, Hash(node) hash function generates a unique hash value for every key.
3.5 Annotations Data annotation is similar to tagging which allows users to organize information by combining them with tags or keywords. We annotate the textual content of a web page to come up with metadata that explains the context of the web page in brief. When the same web page is encountered again, these annotations help us know the context of the page, without parsing the entire page once again. Since more than 80% of the textual data from the web is unstructured or not categorized in a predetermined way, it is extremely difficult to analyze and process them. Hence, the crawlers can extract the keywords from the web to analyze the context of the web page more effectively. Our annotation module can be seen in Fig. 4. The model uses Term Frequency–Inverse Document Frequency (TF-IDF) model that comes from the language modeling theory: TF-IDF is presented in Eq. 1. Wi j = ti j ∗ log N / dfi j
(1)
Here, t ij = term frequency score, dfij = document frequency score, and W ij = Weight of a tag in the text. Although a tag can have multiple words and there are resources in the modeling theory to analyze them, they are heavily time- and resource-consuming, and hence out of the scope of this work.
Fig. 4 Annotation module
Crawl Smart: A Domain-Specific Crawler
321
Algorithm getAnnotations(text,N) // Input : text, N // Output : dictionary // Description : dictionary of top ‘N’ tags from ‘text’ mapped to their score refinedWords getTotalWords(text) sentences getTotalSentence(text) tfScore calculateTF(refinedWords,length(refinedWords)) idfScore calculateIDF(refinedWords, sentences) for word in tf Score: if idfScore[word] = 0: tfIdfScore[word] 0 else tfIdfScore[word] tfScore[word] x idfScore[word] x 10 endIf endFor return tfIdfScore
3.6 Similarity Module The similarity module is presented in Fig. 5. The major components of this module are WordNet, spaCy, and Knowledge Base. This module has components derived from Natural Language Processing. The input to the system is from the user in the form of a tag. This tag is then matched with the annotations present in the knowledge base, if the threshold of the match is crossed then the respective link in knowledge base is returned as result, if no annotations in knowledge base match the entered tag, new crawl is initiated with the input query as a tag. Fig. 5 Similarity module
322
P. Hegade et al.
Algorithm is SimilarWordnet(word, sent) // Input : word, sent // Output : Boolean value // Description : checks if the input tags are similar using WordNet for sysnet in wordnet.synsets(word): for lemma in sysnet.lemmaNames(): if lemma in sent : return true return false
Algorithm is SimilarWordnet(word, sent) // Input : word, sent // Output : Boolean value // Description : checks if the input tags are similar using Spacy word = word1+” ”+word2 tokens = nlp(word) token1, token2 = tokens[0], tokens[1] if token1.similarity(token2) = threshold: return true return false
3.7 Knowledge Base The crawler has a knowledge base where all the previously crawled links along with their annotations (tags) are maintained. Since the user provides the input query as a word, the crawler should look for links with tags matching the input query in the knowledge base. So, the tags should be mapped to their respective links. During the crawling, the crawler looks for the tags of a particular link in the knowledge base. So the links should be mapped to their respective tags. Also, a link can have more than one tag, and it needs to be mapped to each of them. Similarly, a tag can be a part of one or more links, and it should be mapped to each of them. Thus, knowledge base is a bidirectional multi-mapping. The internal storage is presented in Fig. 6.
4 Results and Discussion The input to the system is a keyword given as a query by the user. The system needs to start with a link, sometimes also known as a seed link, which is in context to the input query, and crawl all the further links using the algorithm described in the
Crawl Smart: A Domain-Specific Crawler
323
Fig. 6 The knowledge base
previous chapters. Adhering to the system capacity, power constraints, and scope, we scale down the input tests and give some relaxations in the performance metrics, according to the following assumption: user makes no spelling errors in the input query. The results of traditional and our crawler are tabulated in Table 1. We have used the website geeksforgeeks to tabulate the results as the site allows the bots to visit respective pages (rules as stated as of 01 December 2021). The system will stop crawling links once it has crawled at most 10 links. Five more queries were considered in this test, only on the smart crawler. But in this test, two successive iterations were conducted on the same set of queries to compare the performance of the crawler in both, the absence and presence of links in the knowledge base. The results obtained were plotted as shown in Figs. 7 and 8. The time comparison can be seen in Fig. 9. Observations: The smart crawler was able to fetch more relevant links than the traditional crawler for the same set of queries. Also, the smart crawler took an average crawling time of (67.96 ± 60.092) seconds, for the execution of a query in the first iteration, while it took an average of (2.3 ± 1.4) seconds in the second iteration. The difference observed in the average time taken to crawl the links for a particular query indicates that the presence of the context of a web page being crawled in the knowledge base has successfully reduced the crawling time of the crawler by avoiding it from going through the content of the web page again.
324
P. Hegade et al.
Table 1 Results for the input query “sorting” Traditional crawler
Smart crawler
www.geeksforgeeks.org/sorting-algorithms/? ref=ghm
www.geeksforgeeks.org/sorting-algorithms/? ref=ghm
#main
www.geeksforgeeks.org/sorting-algorithms/
www.geeksforgeeks.org/
www.geeksforgeeks.org/recursive-bubble-sort/
www.geeksforgeeks.org/topic-tags/
www.geeksforgeeks.org/insertion-sort/
www.geeksforgeeks.org/company-tags
www.geeksforgeeks.org/recursive-insertionsort/
www.geeksforgeeks.org/analysis-of-algori thms-set-1-asymptotic-analysis/?ref=ghm
www.geeksforgeeks.org/radix-sort/
www.geeksforgeeks.org/analysis-of-algori thms-set-2-asymptotic-analysis/?ref=ghm
www.geeksforgeeks.org/timsort/
www.geeksforgeeks.org/analysis-of-algori thms-set-3asymptotic-analysis/?ref=ghm
www.geeksforgeeks.org/pigeonhole-sort/
www.geeksforgeeks.org/analysis-of-algori thems-little-o-and-little-omega-notations/? ref=ghm
www.geeksforgeeks.org/cycle-sort/
www.geeksforgeeks.org/lower-and-upperbound-theory/?ref=ghm
www.geeksforgeeks.org/cocktail-sort/
www.geeksforgeeks.org/analysis-of-algori thms-set-4-analysis-of-loops/?ref=ghm
www.geeksforgeeks.org/stooge-sort/
Fig. 7 Iteration 01: crawler in the absence of knowledge base
Fig. 8 Iteration 02: crawler in the presence of knowledge base
Crawl Smart: A Domain-Specific Crawler
325
Fig. 9 Comparison of time required
5 Conclusion and Future Scope The objective of the work is to design a crawler, which is aware of a link or a URL page and can decide whether to crawl the link or not without opening or requesting the HTML page of the link. This has been achieved through the knowledge base which acts as a memory for the crawler. Regardless of whether a new link opened and annotated is used in the current crawling process or not, it will be added to the knowledge base. This saves time in requesting, downloading, and annotating a web page that has already been downloaded in the past. Thus, the crawler learns the web gradually with time and improves its performance as more and more links get added to the knowledge base, similar to a machine learning model.
References 1. Kausar MA, Dhaka VS, Singh SK (2013) Web crawler: a review. Int J Comput Appl 63(2) 2. Najork M (2017) Web crawler architecture 3. Gupta P, Johari K (2009) Implementation of web crawler. In: 2009 second international conference on emerging trends in engineering & technology. IEEE, pp 838–843 4. Mukherjea S (2000) WTMS: a system for collecting and analyzing topic-specific web information. Comput Netw 33:457–471 5. Mirtaheri SM, Dinçktürk ME, Hooshmand S, Bochmann GV, Jourdan GV, Onut IV (2014) A brief history of web crawlers 6. Tan PN, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India 7. Science Direct (2019) Machine learning for email spam filtering. https://www.sciencedirect. com/science/article/pii/S2405844018353404. Last accessed 24 Sept 2021 8. Alani H, Kim S, Millard DE, Weal MJ, Hall W, Lewis PH, Shadbolt NR (2003) Automatic ontology-based knowledge extraction from web documents. IEEE Intell Syst 18(1):14–21
326
P. Hegade et al.
9. Erdmann M, Maedche A, Schnurr H-P, Staab S (2000) From manual to semi-automatic semantic annotation: about ontology-based text annotation tools. In: Proceedings of the COLING-2000 workshop on semantic annotation and intelligent content, pp 79–85 10. Cardie C, Wilkerson J (2008) Text annotation for political science research. Taylor & Francis, pp 1–6 11. Cho J, Garcia-Molina H (2002) Parallel crawlers. In: Proceedings of the 11th international conference on World Wide Web, pp 124–135 12. Menczer F, Pant G, Srinivasan P, Ruiz ME (2001) Evaluating topic-driven web crawlers. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, pp 241–249 13. Menczer F, Pant G, Srinivasan P (2004) Topical web crawlers: evaluating adaptive algorithms. ACM Trans Internet Technol (TOIT) 4(4):378419 14. Srinivasan P, Menczer F, Pant G (2005) A general evaluation framework for topical crawlers. Inf Retr 8(3):417–447 15. Batsakis S, Petrakis EG, Milios E (2009) Improving the performance of focused web crawlers. Data Knowl Eng 68(10):1001–1013 16. Deshmukh S, Vishwakarma K (2021) A survey on crawlers used in developing search engine. In: 2021 5th international conference on intelligent computing and control systems (ICICCS). IEEE, pp 1446–1452 17. Chaitra PG, Deepthi V, Vidyashree KP, Rajini S (2020) A study on different types of web crawlers. In: Intelligent communication, control and devices. Springer, Singapore, pp 781–789 18. Menshchikov AA, Komarova AV, Gatchin YA, Kalinkina ME, Tkalich VL, Pirozhnikova OI (2020) Modeling the behavior of web crawlers on a web resource. J Phys: Conf Ser 1679(3):32– 43 19. Patel JM (2020) Advanced web crawlers. In: Getting structured data from the internet. Apress, Berkeley, CA, pp 371–393 20. Hegade P, Shilpa R, Aigal P, Pai S, Shejekar P (2020) Crawler by inference. In: 2020 Indo– Taiwan 2nd international conference on computing, analytics and networks (Indo-Taiwan ICAN). IEEE, pp 108–112 21. Hegade P, Lingadhal N, Jain S, Khan U, Vijeth KL (2021) Crawler by contextual inference. SN Comput Sci 2(3):1–2 22. Sharma G, Sharma S, Singla H (2016) Evolution of web crawler its challenges. Int J Comput Technol Appl 9:53–57 23. Mahajan R, Gupta SK, Bedi MR (2013) Challenges and design issues in search engine and web crawler. Ed Comm 42
Evaluating the Effect of Leading Indicators in Customer Churn Prediction Sharath Kumar, Nestor Mariyasagayam, and Yuichi Nonaka
Abstract Customer churn prediction is needed as it is one of the preventive solutions employed to retain customers that are of high value and better prospects for future sales. This churn is preventable if service providers can identify the root cause of churn using data analysis. However, since the objective is to retain those customers after prediction, it is imperative that a significant lead time is available for the service providers to engage with their customers and react in a positive way to retain them. So, early detection of churn candidates is critical for the success of such applications. It is our hypothesis that this additional lead time to engage can be derived from analyzing data sources that have the characteristic quality of being leading indicators rather than lagging indicators. In this paper, we attempt to address the issue, by modeling leading indicator sources of temporal information that are relevant to the customer namely sentiment data and socio-economic data. We also evaluate the importance of using such data sources to address the problem of having a longer time horizon to react and respond to customer churn prediction applications. We present the results of experiments using open datasets that have been adopted to evaluate our hypothesis. Our study shows that customer sentiment and socio-economic indicators are statistically significant (P-value < 0.05) and improve churn prediction accuracy up to 20% compared to conventional approaches. Keywords Time series analysis · Data mining · Boosting and ensemble methods · Customer churn analysis
S. Kumar (B) · N. Mariyasagayam R&D Center, Hitachi India Pvt Ltd., Bangalore, Karnataka, India e-mail: [email protected] N. Mariyasagayam e-mail: [email protected] Y. Nonaka R&D Group, Hitachi Ltd., Tokyo, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_26
327
328
S. Kumar et al.
1 Introduction Churn prediction involves classifying or rating customers who are likely to not transact or use a service in the future. There are several churn prediction models adopted for various industry verticals [2]. Due to changing customer expectations, typically, a proportion of them tend to leave existing service provider unless the service is constantly updated to suit their needs. This churn results in revenue loss and adds acquisition expense to maintain market share. For any service provider, retaining existing customers is usually easier and less expensive compared to gaining new customers. In terms of the revenue impact [21], it takes roughly about six new customer acquisitions to make up for a lost customer. Some studies indicate that more than 30% of customers are likely to move from their current service provider every year due to changing consumer expectations [18]. The churn is preventable if service providers can identify early warnings to detect customers who are likely to leave and find the root cause for the churn so that appropriate action can be taken to retain them. Typically, customer sales or operational data is used to develop predictive such churn management solutions [2]. As the data comes from a single source, there is a possibility of a lack of strong or influential features for churn behavior detection which increases mis-classification error because other business conditions such as customer feedback and location-specific indicators are not captured by the operational data. Even in case of accurate predictions, the lead time to target such likely to leave customers is minimal because the operational or sales data is a likely behavior of the customers after deciding to stop the service, e.g., when a group of unsatisfied customers of restaurant service would have made up their mind “after” a bad service and result in loss of repeat customer. Even if the restaurant identifies that they are losing those customers on a post-sales analysis, it could be too late to successfully retain them because the sales data acts as a lagging indicator. However, at the same time there is a possibility that such unsatisfied customers leave a review comment immediately after their bad experience. This could potentially act as a leading indicator for the next revisit decision of those customers and if utilized properly could give the advantage of lead time to retain them. Previous works [3, 8, 17] tried to address this issue by developing standard churn classifiers considering multiple time horizons and identifying influential indicators considering each time horizon. These methods focused on optimizing the training data required to improve churn prediction rather than trying to increase the lead time. To solve the issues, we looked at analyzing additional sources of temporal information that are relevant to the customer: sentiment data and socio-economic data. Also, by utilizing feature synthesis and cross feature engineering to derive influential features from transaction data, we predict customer churn with increased lead time. In this paper, we develop model and conduct experiments to answer the following research questions: • RQ1: How effective are the leading indicators derived from sentiment and socioeconomic data in predicting churn?
Evaluating the Effect of Leading Indicators …
329
• RQ2: Address the problem of having a longer lead time (i.e., time horizon) to predict compared to traditional sales or operational data to react and respond to customer churn. We present the results of experiments using open datasets that have been adopted to evaluate our hypothesis. The rest of the paper is organized as follows. In Sect. 2, we discuss literature on churn prediction models. Following this in Sect. 3, we present the open datasets used and the experimental methodology applied to answer the research questions. In Sect. 4, we discuss our results. Finally, in Sect. 5, we conclude with summary and possible directions for future work.
2 Literature Survey 2.1 Churn Prediction Using Multiple Data Sources In the past, solutions utilized a single source [6, 9, 16] of operational or sales data to make predictive churn decisions. At times, these data were analyzed to target pricing promotions [16] to increase profit and implicitly retaining customers or by providing RFM-based segments [6] of customers in terms of their expected contribution to business. Some methods derived multiple indicators from the same data source such as using the flow of transactions over time and the spend per transaction [9]. Though these methods were simple and quick to model, they were prone to misclassification error due to other business conditions such as customer feedback and location-specific indicators that were not captured by the operational data. Gradually, as the barrier of having data in silos was overcome, techniques [1, 23] to analyze multiple data sources that have a relationship to the existing operational or sales data, for improved decision-making, were developed. For instance, in [1], the authors tackle problem of class imbalance between churners and non-churners by utilizing a combination of subscriber data and call log information. Verbeke et al. [23] used social network information for customer churn prediction by combining both call detail records and customer-related information. The primary motivation of these techniques was targeted to solve the issue of lack of strong or influential features and improve the prediction accuracy. Still, these data sources are useful to improve the just-in-time predictions, but do not necessarily provide sufficient lead time for the service providers to react or engage with the customer to retain. Similar to the above works, we focus on improving the classification accuracy of churn predictions using multiple sources. However, there are two major differences in our work. First, we check the statistical significance of non-operational data sources such as customer sentiment and socio-economic data to identify strong input variables to predict churn. Second, using the significant input features, we show improvement in classification accuracy by adding such data to existing operational data.
330
S. Kumar et al.
2.2 Early Churn Detection Some techniques are customized to be application specific. Joana et al. [8] considered the prioritization of customers targeting retention campaigns. The approach was made by leveraging around six machine learning models to identify churn up to six months in prior and identify what factors influenced the most in different time periods for such retention campaigns. In [7], the authors considered uplift modeling instead of predicting customer churn because uplift models were seen to allow prediction of an outcome under various decision-making scenarios using controllable variables of a business. This is important to consider because it especially helps to provide recommendation actions based on such controllable variable of a business after identifying a potential churn candidate. In general much of the work focuses on predicting churn considering multiple time horizons and identifying influential indicators considering each time horizon. However, in our work, we focus on deriving early warning churn signals from leading indicators to predict churn. By utilizing such influential features in addition to operational data, we predict accurate customer churn with increased lead time.
3 Experimental Setup In this section, we describe the experiments on open datasets that have been adopted to answer our research questions. In Sects. 3.1, and 3.2, we describe briefly about the experimental setup and conditions, i.e., the dataset and methodology, respectively. In Sect. 4, we present the results of our experiments along with the discussion and detailed analysis.
3.1 Dataset In this report, we use two open datasets, namely Banking telemarketing dataset shown in Table 1 [19] and Synthetic credit card transaction dataset shown in Table 2 [20] to address RQ1 and RQ2, respectively. Banking telemarketing dataset provides aggregated historical marketing data to understand which customer to target for term deposit subscriptions in future. It also contains additional data sources such as socioeconomic and customer relationship. This data can be utilized to verify the hypothesis in RQ1. The dataset includes 41188 records, capturing 21 observations from several data sources containing both lagging(demographic and temporal data) and leading indicators(customer sentiment and socio-economic data). To answer RQ2, we needed transaction level multivariate data because the banking marketing data was aggregated. To overcome this, we used open synthetic credit card transaction dataset [20]. The dataset has 24 million records spanning over 30
Evaluating the Effect of Leading Indicators … Table 1 Banking telemarketing dataset Category Demographic indicators Temporal data Customer relationship data Socio-economic data
Table 2 Synthetic credit card transaction data Category Transaction data Merchant industry data Geo data Fraud data
331
Attributes Age, job, housing, marital status, education, default, housing, and loan Contact, month, day of the week, duration Campaign, past days, previous, past outcome Employment variation rate, consumer price index, consumer confidence index, number of employees
Attributes Transaction day, month, year, amount, card, transaction error data Merchant category code (MCC) City, state, zip code Fraudulent transaction data
years from 1990 to 2020. Each record has 12 fields capturing transactions, merchant industry, location-specific information, and fraudulent activity data. Refer to Table 2 for more details. This dataset contains historical time series transaction performance of merchants using which we can derive leading indicators of churn and perform prediction experiments considering various lead time windows. In total, we observed 100000 merchants making at least one transaction from the last 30 years and many of these merchants have stopped making transactions for a long time, i.e., they have churned. This data is utilized to develop merchant churn prediction models considering multiple lead time windows such as 1 and 3 months.
3.2 Methodology We applied hypothesis testing methods to answer RQ1. Hypothesis based on our assumption regarding multi-dimensional data having effect on customer churn can be defined as “Socio-econometric indicators will have an impact on customer churn”. Since we are using open data, we modify this hypothesis considering banking marketing data (see Table 1). “Customers subscribe to banking services when there are better socio-economic indicators like consumer confidence index or price index”. In
332
S. Kumar et al.
hypothesis testing, we test a statistical sample with the goal of accepting or rejecting a Null hypothesis. The Null Hypothesis (H0 ) can be defined as H0 : “Consumer confidence index has no effect on subscription”. Similarly, we define Alternate Hypothesis (Ha ) as Ha : “Customers subscribe to banking services when there is better consumer confidence index”. We checked whether socio-economic indicators of a group of customers who subscribed are significantly different from that of those who unsubscribed. This can be assessed using suitable test statistics. These tests are chosen based on the distribution of data [22]. For normal distribution data, the most commonly used test for this testing problem is Student’s t-test. The test statistic is given in Eq. 1 mA − mB t= 2 s2 − nSB nA
(1)
Here, m A and m B represent the means of groups of customers who subscribed (A) and unsubscribed(B) to term deposit, respectively. n A and n B represent the sizes of groups A and B, respectively. S 2 is an estimator of the common variance of the two samples that can use Eq. 2. S = 2
(x − m A )2 + (x − m B )2 nA + nB − 2
(2)
In t-test statistics, if the absolute value of t is greater than the critical value, then the difference is significant; otherwise, it is not significant. Typically, the value of the significance level is 0.05 [22]. After the significance test, we trained two models: (1) a baseline model with only demographic and temporal data (see Table 1) and (2) a model that includes significant indicators from customer relationship and socioeconomic data in addition to baseline data. To evaluate the effectiveness of significant indicators in predicting churn, we benchmarked the results of the two models by using methods such as logistic regression, decision tree, and random forest. To answer RQ2, we developed a model to predict merchant churn in advance by deriving influential transaction behavioral patterns from multiple time horizons. For this purpose, we used synthetic transaction data (see Table 2). This dataset contains historical transaction performance merchants using which we can define churn. Here, we do the analysis considering different training and observation periods (input time window) for analysis to detect churn candidates as early as three months. An example illustration is shown in Fig. 1 for the purpose of reader’s convenience. For a given observation period (see Fig. 1), we extracted features from multivariate transaction data (see Table 2) by using deep feature synthesis [15] to capture operational data, i.e., lagging indicators. Also, we derive additional churn indicators such as increasing competition in a geographic area(no of competitors) or location which can help in predicting churn earlier. These can be considered as leading indicators.
Evaluating the Effect of Leading Indicators …
333
Fig. 1 Churn prediction with significant lead time: an example of predicting churn with 3 month advance
In addition, we perform the benchmark of performance considering commonly used classifications algorithms in the banking sector [2] to show improvement in lead time and accuracy by adding leading indicators.
4 Experiments, Results, and Discussion 4.1 Effect of Leading Indicators In this section, we present the results of hypotheses testing of two additional data sources in banking marketing dataset namely customer relationship and socioeconometric data for churn prediction process. More specifically, we analyzed the influence of two factors for banking service subscription [19]: • Consumer confidence index: An economic indicator to understand consumer spend patterns. • Call center response time: Customers call duration of marketing campaign. To check their significance, we selected appropriate test statistic considering data type, data distribution, minimum number of data points required, etc. The distribution of data is shown in Fig. 2a, b where Fig. 2a shows data distribution of consumer confidence index score for around 20 thousand customer samples. The distribution looks normal. The t-test can be used to test this whereas Fig. 2b shows the distribution of call duration time in seconds of around 5000 customer samples. We can observe that the distribution is heavily skewed (i.e., not normal). So, we choose the Wilcox test to perform hypothesis testing. Using the appropriate test statistics, we validate both the hypothesis. Results are shown in Table 3. From the results, we observed that for both the hypotheses p-value is less than the significance level 5%, so we reject the null hypothesis and consider consumer confidence index and call center response time to be significant, and we will utilize these indicators as additional features to improve prediction accuracy.
334
S. Kumar et al.
(a)
(b) Data distribution of call center respone time Fig. 2 Data distribution Table 3 Hypothesis test results Hypothesis Data and test statistic used Customers subscribe to banking services when there is better consumer confidence index
Sample data: consumer confidence index and subscription info of index of 20000 customers Test static used: T test There is a relationship between Sample data: average call call center response time or duration time duration and in seconds 5000 customers service subscription Test static used: Wilcoxon Rank Sum and Signed Rank Test
Results (P-value) 0.0347 * 10−11
0.03279 * 10−12
In order to verify that these significant indicators will have an impact on churn predictions, we develop a model to predict whether customers will subscribe to deposit service (see Table 1) or not by using three widely used supervised learning algorithms in banking sector [2]. For model development, we use data of around 41188 customers, where we observed 4640 customers subscribe to deposit and the remaining 36548 did not subscribe. We used a 70:30 ratio for training and testing. First, we developed baseline model (i.e., conventional method) using only demographic and
Evaluating the Effect of Leading Indicators … Table 4 Modeling results to answer RQ1 Model Method Precision Conventional method
Logistic regression Decision tree Random forest Proposed method Logistic regression Decision tree Random forest
335
Recall
F-score
Accuracy
Balanced accuracy
0.46
0.07
0.12
0.88
0.53
0.30
0.25
0.27
0.84
0.59
0.49
0.16
0.24
0.88
0.56
0.66
0.42
0.51
0.91
0.69
0.63
0.48
0.55
0.91
0.72
0.65
0.51
0.57
0.91
0.74
temporal data information of customers (as shown in Table 1) and then we developed more robust model (i.e., proposed method) considering customer sentiment and socio-economic data. Finally, we do comprehensive evaluation of both models using commonly used appropriate evaluation measures such as accuracy, precision, Recall, F-score, and balanced accuracy. Since the dataset has class imbalance issue (4640 subscribe vs. 36548 not subscribe), we primarily check balanced accuracy to assess model improvements. Table 4 compares the results of baseline model and more robust model based on multiple data sources which include leading indicators. The evaluation is done for the different models chosen for evaluation. From the results, we observe an accuracy improvement in the range of 20–25% in terms of balanced accuracy for the model with leading indicators when compared to baseline model without any leading indicators. Figure 3 shows the model results in terms of receiver operating characteristic (ROC) curve for the various methods mentioned above. In terms of performance, the Random forest method gives consistently better results when compared to decision tree and other models when compared to both existing and new model approaches. This is because Random Forests aggregate many decision trees to limit overfitting and as well reduce error due to bias and therefore yield better results compared to the other methods. We can also evaluate the performance of classification algorithms using the AUC metric (Area under the ROC, i.e., Receiver Operating Characteristic curve). As shown in Fig. 3, for the baseline data, Logistic regression provides better performance based on AUC. But overall, the random forest method provides better performance not only considering AUC but other metrics such as Precision, Recall, and F-score.
336
S. Kumar et al. using baseline + leading indicators data
0.6 0.2
0.4
True positive rate
0.6 0.4 0.0
Logistic Regression Decision Tree Random Forest 0.0
0.2
0.4
0.6
False positive rate
0.8
1.0
Logistic Regression Decision Tree Random Forest
0.0
0.2
True positive rate
0.8
0.8
1.0
1.0
using baseline data
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Fig. 3 Comparison of model performance when using baseline data and after adding customer sentiment and socio-economic data as leading indicators
4.2 Lead Time Analysis To evaluate the problem of having a longer lead time (i.e., time horizon) to predict churn, we used synthetic credit card transaction dataset (see Table 2). The dataset had 24M customer transactions with 12 fields capturing merchants transaction data spanned over 30 years (from 1990 to 2020) and merchant industry, locations and card holders buying patters, fraudulent behavior, etc.. In total we observed 100000 merchants making at least one transaction from the last 30 years and many of these merchants have stopped making transactions for a long time, i.e., they have churned. So, using historical transaction performance of merchants, we defined churn and developed prediction models considering multiple lead time windows (see Fig. 1). To develop churn prediction model, we use commonly used churn classification methods in the banking sector [2] that include seven standard classifiers that include methods Random Forest [4], Logistic Regression [24], Decision Tree [5], KNearest Neighbors [14], Gradient Boosting [11], AdaBoost [10], and Multilayer perceptron [12, 13]. For model training and testing, we use data of around 1500 customers who made transactions in 2019. From the data, we observed that around 300 customers left the bank or stopped making transactions at the end of 2019. Here, we developed two models namely model with one-month lead time and more useful model with three-month lead time as shown in Fig. 4b. Here merchants who stopped making transactions after Nov’2019 are considered as churn candidates. For a given input data train window, we extract features from transaction data by using deep feature synthesis [16] and cross feature engineering that captures operational data, i.e., lagging indicators as well as leading indicators by combining other data sources. For example, raising competition from other merchants in an area
Evaluating the Effect of Leading Indicators …
337
(a) prediction with 1-month leadtime
(b) prediction with 3-month leadtime Fig. 4 Time windows used for churn prediction models Table 5 Lagging and leading features derived Operational features Transaction count Transaction amount Avg transaction amount Month on month transaction variables
Leading indicators New competitor in the location Transaction trend indicators (growth/decline) Fraudulent transactions Service issues
or location can be derived from cross feature synchronization between location and change in time series data. This can help in predicting churn earlier. Overall, we derive 25 features. A few examples are shown in Table 5. We used them to train model to predict churn with one-month and three-month lead time and compare results. We also perform benchmarking of early churn prediction results using only transaction data as a baseline and by adding leading indicators from other factors like competition, frauds, and service related issues to evaluate the effect on churn prediction accuracy. As shown in Tables 6 and 7, the model which uses both baseline and leading indicators has better performance in most of the evaluation measures compared to baseline which uses only operational or lagging indicators. In some models, we observed an improvement of up to 15% improvement in precision, recall, and 5% increase in accuracy. In general, model accuracy goes down when we add more lead time while predicting churn. However, our prediction results show that the performance of three-month lead time model is quite close to one-month lead time. This will provide a sufficient lead time for the service providers to react or engage with the customer to retain.
338
S. Kumar et al.
Table 6 Churn prediction model results with one-month lead time (Con: conventional method, Pro: proposed method) Model Accuracy AUC Precision Recall F-score Con. Pro. Con. Pro. Con. Pro. Con. Pro. Con. Pro. Random forest MLP Logistic regression KNearest neighbors Gradient boosting Decision tree AdaBoost
0.80 0.81 0.84 0.76 0.82 0.75 0.82
0.85 0.82 0.85 0.73 0.85 0.78 0.83
0.82 0.77 0.87 0.61 0.84 0.64 0.83
0.86 0.81 0.87 0.61 0.85 0.67 0.83
0.48 0.53 0.61 0.29 0.55 0.37 0.53
0.7 0.52 0.68 0.26 0.66 0.44 0.57
0.31 0.33 0.44 0.18 0.44 0.46 0.46
0.38 0.59 0.44 0.21 0.46 0.51 0.41
0.38 0.41 0.51 0.23 0.49 0.41 0.49
0.49 0.55 0.53 0.23 0.54 0.47 0.48
Table 7 Churn prediction model results with three-month lead time (Con: conventional method, Pro: proposed method) Model Accuracy AUC Precision Recall F-score Con. Pro. Con. Pro. Con. Pro. Con. Pro. Con. Pro. Random forest MLP Logistic regression KNearest neighbors Gradient boosting Decision tree AdaBoost
0.79 0.77 0.81 0.72 0.79 0.72 0.77
0.82 0.77 0.83 0.76 0.81 0.75 0.78
0.80 0.75 0.85 0.61 0.81 0.58 0.78
0.83 0.81 0.85 0.65 0.84 0.64 0.81
0.54 0.47 0.66 0.24 0.54 0.36 0.48
0.62 0.49 0.65 0.43 0.57 0.43 0.51
0.26 0.33 0.31 0.12 0.32 0.33 0.34
0.49 0.59 0.44 0.35 0.47 0.44 0.39
0.35 0.39 0.42 0.16 0.40 0.34 0.40
0.55 0.53 0.52 0.39 0.52 0.43 0.44
5 Conclusion and Future Work In this work, we studied the effectiveness of leading indicators on customer churn prediction. We specifically analyzed sentiment and socio-economic data to derive leading indicators. We found that they are statistically significant (P-value rank(Si+1 ), i ∈ N}
Here, rank(Si ): Similarity of Si to query Q U , higher is better.
3.3 Summarizer It takes in the section provided by Retriever and generates a summary for it. The Summarizer module uses PEGASUS for generating summaries. The modular design of the component enables us to switch between different models with ease. Apart from PEGASUS; XLNet, BERT, and GPT-2 models have also been explored within the scope of this study for summarization of user manuals. PEGASUS is an abstractive model presented by [13]. It employs a new selfsupervised objective that utilizes important parts of source text as target for text generation. In doing so, the model can be trained with minimal supervision. To select a target, sentences are masked out from the source text. The masking here is done using important terms. It also uses a “length modifier” parameter to adjust summary length. This enables the system to generate summaries of desired granularity. BERT is a model presented by [4]. It takes a different approach toward language processing. It uses bi-directional attention rather than traditional forward attention to achieve better results. This approach helps in understanding contexts better. It also eliminates the need of task-specific architectures by utilizing pre-trained representations.
Analysis of Deep Learning Models …
457
GPT-2 is a large transformer-based language model. It demonstrates a way to train a general language model with minimal supervision. Despite being a generalized model, it performs well on various tasks. BERT only works on encoding mechanisms, whereas GPT-2 combines encoding as well as decoding to generate a language model. XLNet is a generalized extractive model. it captures bi-directional context by means of permutation language modeling. Furthermore, it uses autoregressive pretraining without corrupting source texts with additional tokens. Thus, overcoming the limitations of BERT by reducing noise.
4 Performance Analysis 4.1 Dataset Data consists of user manuals taken from multiple sources [7, 8] and belong to various domains such as software, hardware, medical, gaming, etc. Diverse data helps mitigate the problem of overfitting. Data is first cleaned to remove noise and inconsistencies. Extractive and abstractive datasets have been created for evaluation of respective models. The abstractive dataset is machine annotated and is later reviewed and manually corrected. The extractive dataset is manually annotated where each sentence is given a score based on importance. Table 1 shows additional information about the dataset used.
4.2 Dataset Statistics 4.3 Analysis of Extractive Models Extractive models were analyzed focusing on two parameters; length of generated summary, as shown in Fig. 5, and generated time, as show in Fig. 6. The performed analysis is as follows:
Table 1 Entities in the dataset Entity Manuals Sections Pages Lines Words Images
Count 10 200 984 45032 223224 320
458
M. Kayastha et al.
Fig. 5 Average summary length
Fig. 6 Average summary generation time
1. It was observed that the XLNet model performs best in terms of providing a summary of reasonable length in comparatively shorter time. 2. It is also much smaller in size than the BERT and GPT-2, which are resource heavy and are slower to report summaries.
Analysis of Deep Learning Models …
459
4.4 PEGASUS Summary Generation and Load Time Analysis The tests are performed with the intention of assessing feasibility of PEGASUS in a real-time setting. Cold start is an execution state wherein an entity (A function/object of interest) has not been loaded into memory. Cold start affects overall response time significantly, if left unaccounted. In this case, the overhead is more as the entity has to be loaded first and then it can be used. If an entity has already been loaded into the memory, then the execution is referred to as a warm start. In this case, load times are minimal. Tests were run to find the impact of cold and warm start on run time. It was observed that loading the PEGASUS model proves to be the biggest bottleneck in the pipeline. On a cold start, it takes about 48 s as shown in Table 2 to load the model, which is significant when dynamic real-time applications are considered. Similar tests were run for summary generation. However, as shown in Table 3, it was observed that cold start has no observable effects on summary generation times. It was also observed that the generation time increases with the length modifier.
4.5 Key Observations After performing analysis, the following was observed: 1. For summarization of user manuals, both abstractive and extractive methodologies are applicable, each having its advantages and disadvantages. 2. Abstractive models like PEGASUS generate more human-like summaries and preserve grammatical flow better, but have a heavy resource requirement. 3. Extractive models on the other hand are quicker, but produce summaries which might have gaps in them. 4. Extractive models preserve the meaning of the original text by selecting important sentences from the given content. 5. Whereas abstractive models provide a better grammatical flow making it easy for the user to understand the summaries. Table 2 Effect of cold start on PEGASUS load time
Start type
Load time
Cold Warm
48.09 s 2.5 µs
Table 3 Effect of length modifier on PEGASUS summary generation time
Start type
Summary generation time (in seconds) 1 5 10
Cold Warm
0.615 0.634
1.232 1.234
1.734 1.402
460
M. Kayastha et al.
6. Hence, further analysis of both these methods for said use case was required, in order to select the best possible method for the system.
4.6 Hit Ratio Hit ratio [11] analysis is a way of quantifying similarity between the summary generated by a model and the reference summary. Here we look at the summaries and try to identify common sequences. The result as shown in Table 4 is a score (between 0 and 1) that indicates how close the summary of a model is to the reference summary. Based on the results in Table 4, it can be observed that XLNet has the highest hit ratio, which indicates that it works better for the selected dataset. Hit Ratio(H R ) =
|SG ∩ S R | |S R |
Where, SG : Summary generated by model S R : Reference summary.
4.7 Statistical Analysis Summary Overlap Analysis: Overlap [11] analysis is a way of quantifying similarity between 2 different models. First, common sequences between the summaries are identified. The result is a score (between 0 and 1) that indicates how close the summary of a model is to the other. Based on the results in Table 5, it can be observed that GPT-2 being a general language model generates summaries similar to both BERT and XLNet. Overlap =
|S1 ∩ S2 | |S1 ∪ S2 |
Where, S1 : Summary generated by the first model S2 : Summary generated by the second model.
Table 4 Hit ratios of extractive models
Model
Hit ratio
BERT GPT-2 XLNet
0.210738 0.199205 0.238537
Analysis of Deep Learning Models … Table 5 Overlap between summaries of the extractive models
461 BERT
BERT XLNet GPT-2
XLNet
GPT-2
0.187733
0.279104 0.240435
5 Results 5.1 Rouge Score Analysis The results in Table 6 indicate that extractive models have higher rouge scores [6]. This is due to the fact that reference summaries used for evaluating PEGASUS are more extractive in nature. Out of the extractive models, XLNet performs the best overall and outperforms other models in rouge-2. GPT-2 has better rouge-L scores, whereas BERT is the lowest performing extractive model. Although PEGASUS has lower rouge scores, it produces best summaries in terms of continuity and adequacy of information content from the source text. As a result, summaries are more humanlike.
Table 6 Rouge scores Rouge-1
Rouge-2
Rouge-L
F P R F P R F P R
BERT
GPT-2
XLNet
PEGASUS
0.52664 0.59209 0.47982 0.34154 0.38577 0.31005 0.50265 0.51499 0.49800
0.52355 0.56687 0.49469 0.32895 0.35976 0.30819 0.4858 0.48702 0.49541
0.55598 0.58757 0.53309 0.36726 0.38827 0.35194 0.50542 0.506793 0.509038
0.4320 0.51564 0.38455 0.18307 0.22049 0.16316 0.38611 0.49705 0.31806
462
M. Kayastha et al.
6 Conclusion and Future Work User manuals pose a challenge as they present data in different formats such as tables, diagrams, etc. The system needs to differentiate these elements from textual content while creating a summary. Since manual structure varies for each manual, automatic processing becomes difficult. To counter this, we created a system that utilizes multiple pipelines, each specialized in a single task and integrated them to get summaries relevant to a user provided query. Rouge scores and hit ratio indicate XLNet as the best extractive model. Summary length and time analysis further validate this result. Overlap analysis shows GPT-2 and BERT to be generating similar summaries. Although PEGASUS is behind XLNet in terms of rouge scores, we observed that it generates summaries that are generally preferred over others due to better readability and lesser information gaps. The future scope of this system includes implementing a general language model that can understand queries. Furthermore, automatic selection of the summary that is most relevant to the provided query is another feature that can be added. The biggest scope is integrating the model with a chat-bot which further enhances the interaction with the system.
References 1. Artifex Software I. Pymupdf: A lightweight PDF, XPS, and e-book viewer, renderer, and toolkit. https://github.com/pymupdf/PyMuPDF 2. Britz D, Goldie A, Luong MT, Le Q (2017) Massive exploration of neural machine translation architectures. arXiv:1703.03906 (2017) 3. Chaput M (2007) Whoosh: python-based search engine 4. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 5. Erera S, Shmueli-Scheuer M, Feigenblat G, Nakash OP, Boni O, Roitman H, Cohen D, Weiner B, Mass Y, Rivlin O et al (2019) A summarization system for scientific documents. arXiv:1908.11152 6. Lin CY (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81. https://aclanthology.org/W04-1013 7. Manuals online: free library for manuals. http://www.manualsonline.com/ 8. Manualslib: the ultimate manuals library. https://www.manualslib.com/ 9. Pdf.js: a portable document format (pdf) viewer, built with html5. https://github.com/mozilla/ pdf.js 10. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9 11. Sonawane S, Kulkarni P, Deshpande C, Athawale B (2019) Extractive summarization using semigraph (ESSG). Evol Syst 10(3):409–424
Analysis of Deep Learning Models …
463
12. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. Adv Neural Inf Process Syst 32 13. Zhang J, Zhao Y, Saleh M, Liu P (2020) Pegasus: pre-training with extracted gap-sentences for abstractive summarization. In: International conference on machine learning. PMLR, pp 11328–11339
Modelling Seismic Performance of Reinforced Concrete Buildings Within Response Spectrum Framework Praveena Rao
and Hemaraju Pollayi
Abstract This paper deals with the modelling and analysis of reinforced concrete buildings for seismic performance within the response spectrum framework using a deep learning toolbox in 64-bit MATLAB R2021a. The response of a building subjected to earthquake ground accelerations is of paramount importance for designing earthquake resistant structures. Huge loss of life and property has resulted in extensive research in the field of seismic prediction and analysis for accurate results. Artificial Intelligence (AI) and Machine Learning (ML) techniques are thus finding a wide variety of applications in seismic analysis for gaining new insights. The seismic data available has increased exponentially in its size, thus AI has emerged as the solution for this challenging task of processing such overwhelming time-history earthquake data sets. The response spectrum method of seismic analysis is widely used as it computes peak displacements and member forces. In the present work, ground motion recordings of the El Centro earthquake, one of the most studied earthquake data is considered as the input data sets along with two other earthquakes of the Indian subcontinent, namely, the Bhuj earthquake and the India–Myanmar earthquake. The response spectrums are developed for multi degrees of freedom (MDOF) systems based on Newmark’s method for linear systems. The ground acceleration data of the three earthquake records are used as inputs and the peak displacement, base shear and strain energy are computed. Numerical examples presented illustrate the effectiveness of the deep learning toolbox in MATLAB for determining the seismic performance of reinforced concrete buildings. Keywords Seismic modelling · Time series data · Response spectrum · Peak ground acceleration
P. Rao (B) · H. Pollayi Department of Civil Engineering, GITAM Deemed to be University, Hyderabad, TS, India e-mail: [email protected] H. Pollayi e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_36
465
466
P. Rao and H. Pollayi
1 Introduction This paper describes a methodology for the dynamic analysis of reinforced concrete buildings subjected to earthquake loads. In the first part, an extensive database is created for two different earthquakes in India, based on the available seismograph data. Buildings are subjected to random ground motions at the base which gives rise to inertia forces that cause stresses in the building; this is displacement-type loading. Ground motion recordings of the El Centro earthquake, Bhuj earthquake and India– Myanmar earthquake are used to create the database for the analysis of multi degrees of freedom (MDOF) systems based on Newmark’s method for linear systems. The ground acceleration data of the three earthquake records are used as inputs for which the peak displacement, base shear and strain energy are computed. India has witnessed some of the largest earthquakes in the world. The mapping of 20 years of India’s earthquake data is plotted using Python as shown in Fig. 1 and some of the data sets in the figure are used in the present work. The northeastern regions along the Himalayan belt are prone to huge earthquakes of magnitudes greater than 8.0 Mw, mainly due to the tectonic movements. On the 26th of January, 2001, India was shaken by the most terrible earthquake in the Kutch region, Bhuj (Gujarat); its tremors were felt across the country and even in adjoining countries like Pakistan, Nepal and Bhutan. The earthquake greatly affected 21 districts of Gujarat state and a population of 1.58 crores. Approximately, 30,000 lives were lost and about 166000 people were injured. Extensive liquefaction was witnessed in large parts of the affected areas resulting in the failure of several earth dams, masonry arch, RC bridges, railroads, highway embankments and multistorey buildings causing largescale deaths and devastations across an area up to 300 kms from the epicentre. Strong ground motions recorded with a peak acceleration of 0.11 g was found from only one station located at a building in Ahmedabad City. The intensity distribution heatmap of the Bhuj earthquake, 2001, in terms of Modified Mercalli Intensity (MMI) scale plotted in QGIS in Fig. 2 shows the impact of the earthquake witnessed across the country. Ground movement during earthquakes generates highly complex waves which produce translation ground motion in different directions coupled with rotation motions along arbitrary axes. Accelerographs which detect Strong motions are switched on automatically upon sensing ground motion accelerations beyond a threshold value of about 0.005 g and can record three translational components of ground acceleration. Earthquake magnitude reflects the amount of elastic energy released during the earthquake which is indicated by a real number on the Ritcher scale (e.g. 8.2). Whereas the earthquake intensity indicates the extent of shaking experienced during the earthquake referred by roman numbers (e.g. VII). The seismic waves generated during earthquakes are recorded as seismograms which are ground motion parameters as a function of time. The acceleration data from the seismograms are used to control floor displacements and design adequately resistant structures for future earthquake events. George and Vagelis [1] have explained the detailed structure and algorithms for different spectra in a programme. Different software applications were used to compute and compare resulting response spectra for 11 earthquake strong
Modelling Seismic Performance of Reinforced Concrete …
467
Fig. 1 Mapping of 20 years earthquake data in Python
motion data. Reza and Rahimeh [3] have presented a numerical model for response spectra as per Eurocode 8 for seismic analysis of structures. They have implemented a five-storey moment-resisting 3D structure and carried out eigenvector analysis for determining the vibration mode shapes for undamped free vibrations and natural frequencies of the structure with lumped mass matrix and stiffness matrix. Rith [4] has Constructed response spectra based on Newmark’s Method for a numerical example on MDOF system based on Response Spectrum Analysis approach in MATLAB. The inputs of the programme are taken as the ground motion acceleration data and the Mdof structure parameters including mass, stiffness, damping ratio, etc., for different mode shapes. Mostafa [5] simulated the Gulf of Suez earthquake (2013) which had moderate ground motion magnitude (ML 5.1). With the help of stochastic techniques, strong ground motion parameters (PGD and PSA)
468
P. Rao and H. Pollayi
Fig. 2 Intensity distribution heatmap of Bhuj earthquake, 2001 in terms of MMI scale
were simulated for earthquake located in the north of the Red Sea, in cities such as Ras Gharib, Sokhna, Hurghada and Zafarana were generated and validated against the accelerograph recordings found at the epicentre. Mehani et al. carried out a study on the base forces, story drifts and absolute displacements of structures with linear response spectrum method and nonlinear pushover analysis as per the Eurocodes. A Case study on a building of 8 stories was performed and results of the linear analysis were compared with nonlinear static analysis as per the Algerian seismic design Code specifications in force RPA99/version 2003 and ETABS 2013 programme.
2 Framework for Response Spectrum Analysis The response spectrum method is a well-known method of seismic analysis that estimates the dynamic response of structures subjected to earthquake motions. To design earthquake resistant structures, especially in places with high seismic activity, the time history of actual ground motion parameters recorded such as acceleration, displacement and velocity are considered. The structural response can be determined with the combination of different mode shapes, modal natural frequency and modal mass. The response spectrum for all single-degrees-of-freedom systems is the max-
Modelling Seismic Performance of Reinforced Concrete …
469
imum response (maximum displacement, velocity, acceleration, etc.) plot with a specified load function. The abscissa of this plot is the natural time period (Tn ) or frequency of the system, while the ordinate represents the maximum response of the system under the impact of earthquake ground acceleration [11]. The general linearized equations of motion for a Multiple-Degree-of-Freedom (MDOF) system in forced vibration is given by Eq. (1) ˙ + [K ] {u} = − [M] {r } u¨ g (t) ¨ + [C] {u} [M] {u}
(1)
where mass matrix = [M], stiffness matrix = [K], damping matrix = [C], and {r } is the influence coefficient vector. The MDOF system of equations can be transformed into the modal equations. The vibration modes and natural frequencies can be determined by the following characteristic equation: [K ] − ωi2 [M] {φi } = 0
(2)
where ωi2 are the eigen values of the ith mode, φi is the eigen vector or mode shape of the ith mode, and ωi is the natural frequency in the ith mode. Here i = 1, 2, ..., n and n is the number of DOFs. The floor displacements and story drifts can be computed using (3) u jn = Γn φ jn Dn Δ jn = Γn φ jn − φ j−1,n Dn
(4)
where Dn = D (Tn , ξn ) = ωA2n is the deformation spectrum ordinate corresponding n to natural period Tn and damping ratio ξ. The modal participation factor (Γn ) for ith mode can be expressed as {φi }T [M] {r } Γi = (5) {φi }T [M] {φi } where j is the mode shape number and n is the floor number. The equivalent static lateral forces ( f n ) can be computed using f jn = Γn M j φ jn An
(6)
For design of structures, the peak values of displacements and forces are required which can be computed by the modal combination rules for MDOF systems. The peak value for the total response can be determined according to the SRSS (Square Root of Sum of Squares) rule
rmax
n n = ri αi j r j i=1 j=1
(7)
470
P. Rao and H. Pollayi
Fig. 3 Flow-chart for seismic analysis of RCC buildings within response spectrum framework
The complete analysis of linear MDOF systems mentioned above is depicted as a flow-chart in Fig. 3. In 1959, Newmark developed the numerical methods to compute the deformation response of linear SDOF systems. Special cases: (1) Average acceleration method γ = 21 , β = 41 (2) Linear acceleration method γ = 21 , β = 16 1.0 Initial calculations 1.1 p0 − cu˙0 − ku 0 (8) u¨0 = m 1.2 Select Δt. 1.3
γ 1 kˆ = k + ( )c + m βΔt β(Δt)2
1.4 a=
γ 1 m+ c βΔt β
(9)
(10)
Modelling Seismic Performance of Reinforced Concrete …
b=
471
γ 1 m + Δt −1 c 2β 2β
(11)
Calculations for every time step is performed to determine the response [12].
3 Numerical Examples The numerical examples deal with the response spectrum analysis of single-degreeof-freedom systems (SDOF) and multiple-degree-of-freedom systems. The EL Centro (city in the county seat of Imperial County, California, United States) earthquake (6.9 Mw ) record (North-South) of May 18, 1940, is given in Table 1. This data set is used in the present work for a benchmark with two other earthquake data sets. The Bhuj/Kachchh earthquake (7.7 Mw ) on January 26, 2001, and the India–Myanmar earthquake (7.3 Mw ) on August 6, 1988, are considered for the present work. A six storey MDOF system with floor-wise mass and floor-wise stiffness are given in Table 2. The considered MDOF system is subjected to three earthquake ground motion accelerations separately and analyzed within the response spectrum framework. To validate the results of the present analysis, a SDOF system with mass (m) N and damping ratio (ξ) 0.02 is analyzed for the 100 kg, stiffness (k) 60.0 × 103 m benchmark El Centro ground motion. The ground motion acceleration curve and the displacement at floor 1 for mode 1 are shown in Fig. 4 (a) and (b). The results of the present analysis are compared with the analytical solutions for an SDOF system as shown in Table 3. It is observed that the present framework results are well-matched to the analytical results for the SDOF system. Figures 5, 6 and 7 show the displacement, velocity and acceleration for the SDOF system, respectively. The maximum displacement occurs at 2.5 s and it is about −0.02 mm, which can be observed from Fig. 5. From Fig. 6, it is observed that the maximum velocity occurs at 2.5 s and it is about ±0.45 mm. The maximum acceleration occurs , which can be observed from Fig. 7. at 2.5 s and it is about +12 mm s2 The mode 1, mode 2 and mode 3 displacements at the different floors of the MDOF system are analyzed within the response spectrum framework as shown in Figs. 8, 9
Table 1 Floor-wise mass and stiffness values for the MDOF system Floor no. Mass (kg) Stiffness (N/m) 1 2 3 4 5 6
10000 10000 10000 10000 10000 5000
16357.5 × 103 16357.5 × 103 16357.5 × 103 16357.5 × 103 16357.5 × 103 16357.5 × 103
472
P. Rao and H. Pollayi
Table 2 Ground motion data of EL Centro earthquake S. no. Time (s) 1 2 3 4 5 6 ... ... 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560
0.000000 0.020000 0.040000 0.060000 0.080000 0.100000 ... ... 31.000000 31.020000 31.040000 31.060000 31.080000 31.100000 31.120000 31.140000 31.160000 31.180000
Acceleration (g) 0.006300 0.003640 0.000990 0.004280 0.007580 0.010870 ... ... –0.000510 –0.000440 –0.000380 –0.000320 –0.000250 –0.000190 –0.000130 –0.000060 0.000000 0.000000
Fig. 4 For single-degree-of-freedom system: (a) Ground acceleration versus time and (b) Modal displacement at floor 1
Modelling Seismic Performance of Reinforced Concrete …
473
Table 3 Comparison of the different parameters with analytical solution Parameter Present work Analytical solution rad 3 Natural frequency s 24.495 ω = mk = 60.0×10 = 100 24.495 2π Natural period (s) 0.257 T = 2π ω = 24.495 = 0.257 Base shear (kN) 1.13987 V = ku = 1.13987 Max. strain energy (N m) 10.82752 U = 21 ku 2 = 10.82752
Fig. 5 Displacement for the single-degree-of-freedom system
and 10. It is observed that in mode 1, the building as a whole moves back and forth about the stationary point. But in mode 2 and mode 3, some part of the building moves one side, whereas the other part moves on the other side of the mean position at the stationary state. Finally, Fig. 11 shows the snap-shots of the deformation shapes of the six storey building at various time instants from the displacement animation along the duration of the earthquake. The building moves back and forth about the stationary mean position and the maximum displacement occurs on the top floors and it is about ±0.1 m, which can be observed from Fig. 11.
4 Conclusions This paper describes a methodology for modelling seismic performance of reinforced concrete buildings within the response spectrum framework using MATLAB R2021a. Numerical examples are presented to demonstrate the profound effect of developed
474
P. Rao and H. Pollayi
Fig. 6 Velocity for the single-degree-of-freedom system
Fig. 7 Acceleration for the single-degree-of-freedom system
framework for the SDOF and MDOF systems. For the SDOF system, the results from the present analysis framework are compared with the analytical solutions and it is observed that the present framework results are well-matched to the analytical results. The maximum displacement is about −0.02 mm, the maximum velocity is and the maximum acceleration is about +12 mm , all occurred at about ±0.45 mm s s2 2.5 s. For the MDOF system using the EL Centro ground motion data, the building is analyzed in the present work within the framework. In mode 1, the building as a whole moves back and forth about the stationary point but in mode 2 and mode 3,
Modelling Seismic Performance of Reinforced Concrete …
475
Fig. 8 Mode 1 displacements at the different floors for EL Centro Ground Motion
Fig. 9 Mode 2 displacements at the different floors for EL Centro Ground Motion
Fig. 10 Mode 3 displacements at the different floors for EL Centro Ground Motion
some part of the building moves in one side, whereas the other part moves on the other side of the vertical line at the stationary point. The maximum displacement occurs on the top floors and it is about ±0.1 m.
476
P. Rao and H. Pollayi
Fig. 11 Snap-shots of the deformation shapes of the six storey building at various time instants
Modelling Seismic Performance of Reinforced Concrete …
477
References 1. George P, Vagelis P (2018) OpenSeismoMatlab: a new open-source software for strong ground motion data processing. Heliyon 4:e00784 2. Pengcheng J, Amir HA (2019) Artificial intelligence in seismology: advent, performance and future trends. Geosci Front 11(3):739–744 3. Reza L, Rahimeh R (2020) Three-dimensional numerical model for seismic analysis of structures. Civ Eng Arch 8(3):237–245 4. Rith M (2016) Seismic analysis: response spectrum analysis method with MATLAB. Technical report 5. Mostafa T (2016) Simulation of strong ground motion parameters of the 1 June 2013 Gulf of Suez earthquake, Egypt. NRIAG J Astron Geophys 6:30–40 6. McCalpin JP, Thakkar MG (2003) 2001 Bhuj-Kachchh earthquake: surface faulting and its relation with neotectonics and regional structures, Gujarat, western India. Ann Geo-Phys 46(5) 7. Freeman SA (2007) Response spectra as a useful design and analysis tool for practicing structural engineers. ISET J Earthq Technol 44(1) 8. Susan EH, Stacey M, Roger B, Gail MA (2002) The 26 January 2001 M 7.6 Bhuj, India, earthquake: observed and predicted ground motions. Bull Seism Soc Am 92(6):2061–2079 9. Raghu Kanth STG, Iyengar RN (2007) Estimation of seismic spectral acceleration in peninsular India. J Earth Syst Sci 116(3):199–214 10. Sumer C, Dinesh K, Bal KR (2010) Estimation of strong ground motions for 2001 Bhuj (Mw 7.6), India earthquake. Appl Geophys 166:1317–1330 11. Mario P (1987) Structural dynamics, theory and computation, 2nd edn. CBS Publisher, New Delhi 12. Anil KC (1995) Dynamics of structural, theory and application to earthquake engineering. Prentice-Hall, New Jersey
A Survey on DDoS Detection Using Deep Learning in Software Defined Networking M. Franckie Singha and Ripon Patgiri
Abstract In this era of internet, cyber attack is one of the most prominent issues all over the world. Distributed denial of service (DDoS) attack is one such attack that has a catastrophic effect, and it is hard to detect even in the Software defined networking (SDN) too. SDN is an emerging field in the area of computer networks. In this paper, we discuss the current trends in detecting DDoS with the help of deep learning in an SDN environment. Deep learning has gained popularity in recent years due to its efficient feature detection and dimensionality reduction in classifying data to gain maximum accuracy. We have analyzed the deep learning models and their mechanisms, the performance metrics, and the dataset from the various published papers. Keywords Software defined networking · SDN · Deep learning · Distributed denial of service · DDoS · Attack detection
1 Introduction Software Defined Networking (SDN) is an emerging paradigm that has enhanced network traffic maintainability using resources at a low cost without wasting much energy by building a centralized controller. SDN can monitor each node and link in the vast network. Despite its utility, SDN has inherent security problems due to its centralized architecture. Once the controller is compromised, the whole network may come down, or the attacker can control the devices connected to the controller. The distributed denial of service(DDoS) attack is one of the most dangerous and challenging attacks to detect in the centralized networks such as SDN. DDoS sends large of data by flooding the packets, eventually consuming the device resources. Moreover, it can install random flows in the flow table, forcing the switch to miss M. F. Singha (B) · R. Patgiri National Institute of Technology, Silchar, Cachar 788010, Assam, India e-mail: [email protected] R. Patgiri e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_37
479
480
M. F. Singha and R. Patgiri
the attack. Various research have been done to handle the DDoS attack in the SDN environment. Numerous researchers have already proposed to defend against DDoS attacks in the SDN architecture. Since SDN has a centralized view of the network, it can monitor the overall data flow in the network. By collecting the flow statistics, SDN can implement various policies that can be installed in the controller based on which it can detect the attacks. Although such an approach is beneficial, it is not efficient and fails to detect the zero-day attack. As machine learning research increases, the concept of machine learning algorithms is leveraged to detect diverse attacks such as DDoS attacks. Machine learning has been applied to develop various models that can detect diverse attacks including DDoS attacks in the SDN environment. Recently, deep learning has been extensively used to detect an anomaly in SDN due to its accuracy. In this paper, we have focused mainly on the deep learning-based intrusion detection mechanism in the SDN environment. The paper’s organization is as follows. Section 2 discusses the basics of SDN and DDoS attack, Sect. 3 discusses the deep learning-based DDoS detection mechanisms, followed by a Sect. 4 that analyses the mechanism, and it ends with the conclusion.
2 DDoS Attack on SDN DDoS attack on the SDN environment is made mainly at the data plane, control plane, and the southbound interface connecting the two planes. The responsibility of the data plane is to forward the data packet to the controller when there is no flow match in the switch. The controller then installs the flow rules at the flow table. The attacker takes advantage of this feature to send a large amount of attack packets, which modifies the existing flow rule with new flows and saturates the flow table entry. This modification will cause normal packets to drop, which eventually leads to the unavailability of the devices in the network. The southbound interface connects the data plane with the control plane. This channel has limited bandwidth, which the attacker can exploit by sending many packets in messages to the control. The DDoS attack on the controller is the most common one. The controller is the central point of the whole network, where a single point failure can occur. A volumetric attack can cause a processing delay that can bring down the whole network. Figure 1 shows a simple taxonomy of DDoS attacks in SDN and its solution using deep learning.
2.1 SDN Architecture SDN is a rising field in the world of Computer Networking that removes the concept of vertical integration. It separates the control plane and the data plane of a network device and makes them simple packet forwarding elements. The data plane only forwards the packets, and the controller manages the forwarding devices. The con-
A Survey on DDoS Detection Using Deep Learning in Software …
481
Fig. 1 Taxonomy of DDoS detection in SDN with deep learning approach
troller is the single point that controls all the devices in the network. The controller is centralized, installing network policies at forwarding devices and monitoring the whole network, making it easier for network administrators to handle the network. SDN is broadly classified into three parts as shown in Fig. 2, particularly the management layer, control layer, and data layer. The controller defines the functioning of the forwarding devices, and these two planes communicate with each other via an API, called southbound API. OpenFlow is one of the most popular southbound APIs. The key feature of SDN is that a programmer can manage how the network should behave by programming the device functionality. The programming part is done in the management layer using some network applications and passes down to the controller where it is implemented. The data plane is responsible for the execution of the program. The management plane manages the network policies embedded in the devices by the control plane, and the data plane executes them. A northbound API abstracts the control plane from the management plane, whereas the southbound API like OpenFlow, hides the complexity of the data plane. OpenFlow is a technology that comes under the roof of SDN. Southbound API, e.g., OpenFlow manages the data transfer between the control and data plane. OpenFlow [4] is a standardized API that facilitates the data transfer between the devices
482
M. F. Singha and R. Patgiri
Fig. 2 SDN Architecture depicting its three planes and interfaces
and the software-based controller in an SDN architecture. As in vendor-specific devices, it is not possible to modify the source codes, but with SDN and OpenFlow, one can handle the flow of the packets in the network by programming the control plane. An OpenFlow switch consists of flow table, a secure channel that communicates with the controller through OpenFlow protocol. The flow table consists of flow entries base on which it forwards the packets. The secure channel enables the OpenFlow switch to communicate with the controller. This interface controls the configuration and management of the devices. The exchange of all the OpenFlow messages between the switch and the controller happens via this interface. Figure 3 represents the architecture of the OpenFlow Switch. There can be one or more flow tables or group tables in an OpenFlow switch. The controller can update, delete, and add flows to the flow table using OpenFlow protocol. The flow matching starts with the first table and if the match is not found, it may continue to check in the next flow table in case the switch has more than one flow table. When the flow is matched, the corresponding action associated with that flow is taken, and if not, then it acts according to flow-miss table configuration. The flow table has three main components: Header field, Counter, and Actions. • Header field: The header fields contains 12 tuples that are used to match incoming packets. Flows can match with one or more than one field in the flow. • Counter: Every table, flow, queue, or port maintains its own counter. A network administrator can implement this OpenFlow counters in a software.
A Survey on DDoS Detection Using Deep Learning in Software …
483
Fig. 3 An OpenFlow architecture depicting a switch and a controller connected by a OpenFlow protocol
Fig. 4 An OpenFlow switch depicting multiple flow tables
• Actions: One or more actions is associated with every flows in the flow table. If no action is specified, the default action is to drop the packet. Some of the required action specified in the OpenFlow 1.0 are as follows: – – – – – –
ALL: Send the packet out all interfaces, not including the incoming interface. CONTROLLER: Encapsulate and send the packet to the controller. LOCAL: Send the packet to the switches local networking stack. TABLE: Perform actions in flow table. Only for packet-out messages. IN PORT: Send the packet out the input port. FLOOD: Flood the packet along the minimum spanning tree, not including the incoming interface. – NORMAL [5]: An egress-only port, this logical port allows the switch to function like a traditional Ethernet switch. According to the protocol functional specification, this port is only supported by a Hybrid switch. The flow matching starts with the first table, and if the match is not found, it may continue to check in the next flow table in case the switch has more than one flow table. When the flow is matched, the corresponding action associated with that flow is taken, and if not, then it acts according to flow-miss table configuration. The OpenFlow switch with multiple flow tables is shown in Figs. 4 and 5 shows a flow chart showing the progress of a packet through multiple tables.
484
M. F. Singha and R. Patgiri
Fig. 5 A flow chart of packet flow in an OpenFlow switch from one table to another
2.2 Various Deep Learning Algorithms Convolutional neural network (CNN): CNN is a feed-forward deep learning algorithm that helps in classifying the input data. CNN has multiple hidden layers that helps in classifying a given data. Figure 6 represents a CNN architecture. There are four essential layers in CNN: – Convolution layer: It is the first layer that performs convolution operation using several filters. The data and the filters are represented in the form of a matrix. The convolved feature matrix is generated using convolution operation, which is a dot product of the data matrix and filter matrix. – Pooling Layer: It is the second layer which helps in extracting reducing the dimensionality reduction. This layer uses multiple filters to identify important features in the data. – Flattening: It converts the 2D array from above layer into 1D array. – Fully connected layers: This layer is responsible for final classification of the data. Recurrent Neural Network (RNN): Unlike CNN, RNN saves the output information of a layer and feeds it back to the layer as an input to predict the output of that layer. A RNN has a looping component that allows information to flow from one stage then onto the next. Thus, RNN can handle a sequential data by memorizing the previous input and the current input. Figure 7 shows a conversion of feed-forward neural network to RNN, while Fig. 8 shows the dependency the output “k(t)” of a layer on the previous input “k(t – 1)” and current input “x(t)”, at time “t”.
A Survey on DDoS Detection Using Deep Learning in Software …
485
Fig. 6 Processing of input data using various hidden layers of CNN Fig. 7 Conversion of feed-forward NN to RNN
Fig. 8 RNN calculating output information based on previous input and current input
Long Short-Term Memory (LSTM): RNN suffers from a problem called Vanishing Gradient Problem. It is a problem where there is a loss of information as the learning progresses when the sequence of data supplied is very long. LSTM helps in encountering this problem. LSTMs are a special kind of RNN that can retain the information long enough to make a correct prediction. Figure 9 shows the architecture of LSTM model. LSTM is composed of three cells: Forget gate, Input gate, and Output gate. – Forget gate: It is the first step that determines which information is to retain or forget. It is done by the sigmoid function by considering the previous state (h t−1 ) and current state (xt ).
486
M. F. Singha and R. Patgiri
Fig. 9 LSTM architecture showing the sigmoid and tanh function
– Input gate: It is the second step that is composed of sigmoid function and tanh function. It determines which information is to be added to the current state. Sigmoid function decides which value is to be kept, the value “0” being the least important and “1” the most important. The tanh function decides the level of importance (–1 to 1) of the value. – Output gate: The ouput gate determines the final output. The sigmoid function determines which part of the current state comes as an output. We then pass the current state through the tanh function and multiply the output with the output of sigmoid function. Gated Recurrent Unit (GRU): It is also a special kind of RNN, that can solve the problem of vanishing gradient problem. It is much simpler than LSTM. It uses two vectors or gates, a reset gate and update gate. The update gate decides which is the most relevant information to add and which is not. The reset gate determines the least important past information to forget.
2.3 Background of DDoS Attack Distributed denial of service (DDoS) attacks fall under denial of service (DoS) attacks. This kind of attack is a threat to network security, whose aim is to target the network with fake traffics, and thereby consuming all its network resources. DDoS does not aim to breach the network security but to block the legitimate user from using the service by jamming the network with a huge amount of traffics. Every network has a definite amount of resources allocated to it. Exceeding that limit fails the whole network. The central concept behind DDoS is the flooding of network traffic in disguise of a legitimate user. Moreover, a DDoS attack can be multi-vector,
A Survey on DDoS Detection Using Deep Learning in Software …
487
a combination of more than one DDoS attack. The DDoS attack can be broadly classified into three categories: – Volumetric attack: This kind of attack aims to consume the network bandwidth. Some of the special attacks are ICMP flood and UDP floods. – Protocol-based attack: Its main aim is to saturate the server resources. Some of the special attacks are Ping of Death and SYN floods. – Application-based attack: Its main aim is to overload the server with a huge amount of requests. Some of the special attacks are HTTP flood and DNS query flood. Slow DDoS attack is one of the most serious and hard to detect attacks.
3 Deep Learning-Based Intrusion Detection Techniques in SDN In recent years, researchers have shifted their interest in research from traditional machine learning to deep learning. Deep learning has been extensively used in face recognition, image detection, recommendation system, etc. The fundamental reasons behind such a shift are that deep learning has shown better performance in prediction and handling a large amount of data. As the data increases, the complexity of analyzing the features increases. Traditional machine learning approaches need expert domain knowledge while selecting the relevant features or during feature extraction. Deep learning, on the other hand, does not require human intervention in feature extraction procedures. Unlike machine learning, deep learning model learns features incrementally through its hidden layers by learning the low-level features and then the higher-level features. With this in mind, various researchers have successfully attempted and used deep learning as a model in detecting cyber attacks like DDoS. Autoencoder is a neural network used as a base in designing deep learning models in [1, 2, 7], as it helps in efficient dimensional reduction in a complex dataset. In [7], Quamar Niyaz et al. proposed a network intrusion detection system that uses stacked autoencoder (SAE)-based deep learning approach to detect multi-vector DDoS attacks. Multiple sparse autoencoders are stacked on top of each other to form SAE, such that the output of one layer acts as the input of the next layer. Sparse autoencoder is a neural network having three layers: input layer, hidden layer, and output layer. The input layer and output layer are composed of “M” nodes, and the hidden layer is composed of “N” nodes. The M nodes in the input represents the “M” features. The activation function at the output and hidden layer is a sigmoid function. The implementation of the model is done by building a model composed of three modules: the traffic collector and flow installer, feature extractor, and traffic classifier. The author has analyzed each packet while computing the flow and restricted the use of SFlow to minimize false positive. The author uses a sensor network, NSL-KDD, and a KDD-Cup99 dataset from which 68 features are extracted which is eventually reduced using SAE. The accuracy in detecting individual DDoS attacks shows 95.65%, and accuracy reaches 99.82% for detecting malicious and attack classes. It
488
M. F. Singha and R. Patgiri
gives better results compared to Softmax classifier and Neural network with accuracy of 94.30% and 95.23%, respectively. ROC curve for SAE shows above 90% for true positive and below 5% for false positive. Mahmoud Said Elsayed et al. in the paper [2] proposes a model to handle DDoS attacks. It uses an autoencoder such that each layer of it is a simple Recurrent Neural Network (RNN) layer. Since auto encoder is feed-forward neural network, it has lossy compression. This problem is address by RNN with its cyclic connections. The dataset used is CICDDoS2019, where the feature extraction is performed by RNN-autoencoder model, extracting 80 features using CICFlowMeter, and Softmax regressor classifiers are used to classify malicious and normal data. The model with a learning rate at 0.0001 gives an accuracy of 99% with precision, recall, and F1-score of 0.99, respectively. Nisha Ahuja et al. in the paper [1] uses various Deep learning model to handle DDoS attack detection, out of which, the MLP-SAE model outperforms the other model. SAE is used along with MLP to address the dimensionality problem of MLP, as SAE helps in significant dimension reduction. The evaluation uses a custom SDN-based dataset. The Accuracy, Precision, Recall, and F1-score are 99.75%, 99.69%, 99.94%, and 99.82%, respectively. Many papers [3, 9, 11, 16] have used CNN with other Deep learning models to achieve better performances. In paper [11], Yang Qin et al. uses a model based on CNN (Convolutional Neural Network) and RNN (Recurrent Neural network) to detect an anomaly. CNN extracts the most relevant features from the data set, while RNN, with a hidden layer of 300 neurons, keeps track of which previous information is to be saved. The core of RNN in this paper is GRU (Gated Recurrent Unit), which solves the gradient disappearance problem in RNN. The model has been tested on a custom dataset (Sim_data) and CTU-13 dataset. Accuracy on Sim_data and CTU-13 is 99.84% and 99.86%, respectively. Paper [3] uses the Ensemble Deep learning model to detect DDoS attacks. Various ensembled models (CNN+CNN, RNN+RNN, and LSTM+LSTM) and Hybrid model (RNN+LSTM) are tested on the dataset CICIDS2017, out of which, ensemble CNN model outperforms all. The CNN architecture contains three 2D-convolutional layers, two max-pool layers, one flatten layer, and three fully connected layers. ReLU is used as an activation function in hidden layers and Sigmoid the at output layer. Compared to other models proposed in the paper, the accuracy of ensemble CNN shows better result, i.e., 99.45% with high training and testing time, and CPU usage of 6.025%. In paper [9], Beny Nugraha et al. uses CNN with LSTM (Long-Short-term-memory) to handle slow DDoS attack, which is hard to detect as it resembles the normal data. The model uses 1D-CNN layer followed by three more layers to enhance the extraction of the most relevant features. LSTM layer is then used to level up the learning process by storing the previous information of the data, with ReLU as an activation function. The last layer is fully connected with the Sigmoid function to classify the DDoS attack and normal traffic. The author has used custom dataset to test the model, where the accuracy of DDoS detection is 99.998%, with a learning rate of 0.0005 and dropout rate of 0.3. In paper [16], author Jiushuang Wang et al. proposed a Deep learning model to detect DDoS attacks in SDN-IOT environment. A simple CNN is used, where the model is tested with a varying number of layers. The model 3C2P2F indicating three convolutional layers, two pooling layers, and two fully connected layers showed
A Survey on DDoS Detection Using Deep Learning in Software …
489
better performance than the other models suggested in the paper; when tested with a custom dataset. Tuan A Tang et al. uses DNN in [13] and GRU-RNN in [14, 15], on the NSL-KDD dataset to model a Network intrusion detection system in SDN environment. All three papers use flow-based anomaly detection techniques and use six features. The accuracy achieved with DNN is 75.75%. In [14], the model contains one input layer, three hidden layers (6,4,2), and one output layer; gives an accuracy of 89% with 0.001 learning rate, while [15] model uses one input layer, three hidden layers(5,4,3), and two output layers gives an accuracy of 90% with a learning rate of 0.001. Chuanhuang Li, in paper [6], proposed a Deep learning-based model to detect and defend against DDoS attack in SDN environment. The detection model uses CNN and two RNN (forward layer and backward layer), also GRU to solve the gradient descend problem of RNN. The DDoS defense model is based on LSTM. The model contains a DDoS detector model, which is evaluated with ISCX2012 dataset and various DL network models. The model with 6 LSTM/GRU outperforms the other with an accuracy of 98% in detecting and defending the attack. In paper [10, 12], the models are purely based on LSTM. Rojalina Priyadarshini et al. in [10], proposed a deep learning SDN-based DDoS defense mechanism in Fog environment. A total of 192 features are extracted from the Hogzilla dataset. The LSTM model is tested with different parameters, and the one with 128 neurons with two hidden layers with a dropout rate of 0.2 is performing the best with an accuracy of 98.88%. Shaopeng Guan et al. in [12] proposes a DDoS detection model for SDN controller. Based on Renyi entropy, an anomaly in the network is detected, and BiLSTM model is used to detect eight DDoS features to classify the DDoS attack. BiLSTM contains forward LSTM for forward feature extraction, backward LSTM for backward feature extraction, and a softmax classifier for classification. The Entropy threshold, when set to 1.2000, gives an accuracy of 98.88%. Table 1 shows the deep learning models used in the various paper along with the dataset.
4 Analysis 4.1 Based on Methodology Data collection and Preprocessing the data Almost all the papers discussed in this paper include pre-processing of the dataset on the flow statistics collected to evaluate the proposed models in their papers. The time interval for the collection of flow statistics depends on the experts. Longer the time interval in the collection may lead to the failure of the switch and controller due to extensive accumulation of data, while shorter the time, more is the load at the controller as detection frequency will increase. The optimum time interval in paper [8, 15] is set to 1 s, while in [9, 12], it is set to 3 s. The feature extraction methodologies are either pre-defied as discussed in [15] or using CICFlowmeter in [2, 3, 17]. The collected data or the datasets need to be scaled down as the information generated are of different scales. Normalization
490
M. F. Singha and R. Patgiri
Table 1 Deep learning model and the dataset used in training and testing Paper DL model Dataset [1] [2] [3] [6] [7] [9] [10] [11] [12] [13] [14] [15] [16]
SAE+MLP RNN+AutoEncoder CNN CNN+RNN+LSTM SAE CNN+LSTM LSTM CNN+RNN LSTM+RNN DNN GRU+RNN GRU-RNN CNN
Custom CICDDoS2019 CICIDDoS2017, ISCX ISCX Custom Custom CTU-13, ISCX CTU-13 Custom NSL-KDD NSL-KDD NSL-KDD Custom
techniques like max-min and Z-score are used. Encoding of the categorical data and numeric data are also done based on the model used. Deeplearning model Most of the Deep learning frameworks are implemented using Keras and Tensorflow. The Deep learning models for handling DDoS detection are mostly based on SAE, CNN, RNN, and LSTM. These are used standalone or in combination to meet the desired goal. CNN is a simple model which is based on the feed-forward technique. It can efficiently perform feature detection and dimensionality reduction for the datasets having extensive features. RNN is used extensively on the model, which feeds on the data collected in sequential order, like in [2, 10, 15], as it can preserve the previous information that can be used to preserve information to predict the outcome. RNN has an inherent problem of vanishing gradient. To fight against the problem, LSTM or GRU is used in combination with RNN while designing the model in [6, 9, 11, 12, 15]. The most widely used activation functions to preserve the non-linearity in the data and fast learning are Sigmoid function and ReLu (Rectified Linear Units). Tuning parameters The model’s fairness also depends on the value of the various parameters used while designing the model. Parameters like learning rate and dropout rate affect the efficiency of the model. To prevent over-fitting of the neural network and to increase the efficiency, the dropout rate is tuned at 0.2 in [8, 13], while in [9], it is set at 0.3. Various learning rates are tested on [2, 13], and 0.0001 is selected as the optimum, 0.001 in [8, 15], while 0.0005 showed better performance in [9].
A Survey on DDoS Detection Using Deep Learning in Software …
491
4.2 Based on Evaluation Techniques Evaluation techniques used in the models are the standard metrics: Accuracy, Precision, F1-Score, Recall, and Receiver Operating Characteristics (ROC). These parameters are calculated from a confusion matrix, which is the most compelling predictive analysis tool. In the matrix, True positive (TP) is a result where the model accurately predicts the positive class, True Negative (TN) is a result where the model accurately predicts the negative class, False positive (FP) is the result where the model predicted negative values as positives when they were not, and True Negative (TN) is the result in which the model predicts the negative class when the true class is positive. – Accuracy: It is used to find the percentage of values that are correctly categorized. It indicates how frequently our classifier is correct. Accuracy = (TP + TN)/(TP + TN + FP + FN)
– Precision: Precision is used to evaluate the model’s ability to correctly identify positive values. Precision = TP/(TP + FP)
– F1-Score: It is a harmonic mean of precision and recall. It comes in handy when there is a need to balance precision and recall. F1-Score = (2 ∗ Recall ∗ precision)/(Recall + Precision)
– Recall: It is used to figure out how well the model can predict true positive values. Recall = TP/(TP + FN)
– ROC: ROC is a graph showing how well the model is in classifying positive and negative classes. The area under the curve measures the aggregate performance of the model. More the area better is the performance. The Accuracy, Recall, Precision, Learning Rate, and Dropout Rate of the paper discussed in this paper are compiled in Table 2. It is observed that lowering the learning rate increases the performance metrics. Learning rate of 0.001 and 0.0001 gives better performance than the other rates. Evaluating the model with a dropout rate of 0.2 has shown better performance in some of the papers. Table 3 shows the performance metrics of the models where learning rate and dropout rate are not explicitly mentioned.
4.3 Based on Dataset The choice of dataset plays a vital role in defining the efficiency of the model in detecting intrusions. The most used dataset in intrusion detection are KDD-Cup99, NSL-KDD, CTU-13, ISCX, CICIDS2017, and CICDDoS2019.
492
M. F. Singha and R. Patgiri
Table 2 Performance metrics with explicit learning rate Paper Accuracy Recall (%) Precision F1-score (%) (%) (%) [2] [8] [9] [10] [13] [15]
99 99.78 99.998 98.34 75.75 89
99 99.99 100 – 76 89
99 99.76 99.989 – 83 89
99 99.87 99.994 – 75 89
L-rate
DR
ROC (%)
0.0001 0.001 0.0005 – 0.001 0.001
– 0.2 0.3 0.2 – –
98.8 – – – 86 87
Table 3 Performance metrics without explicit learning rate Paper Accuracy (%) Recall (%) Precision (%) [1] [3] [5] [6] [7] [11] [12] [14]
99.75 99.45 98.30 99 95.65 98.86 98.88 89
99.94 99.64 98 – – 99.76 – 90
99.69 99.57 98 – – – – 91
F1-score (%) 99.82 99.61 98 – 90 99.76 – 90
In current scenarios, KDD-Cup99 is very rarely used due to its inherent problem of redundancy. NSL-KDD has been used in recent studies as it contains 1,25,973 training samples and 22,334 test samples. Each sample has 41 features, categorized into three types: basic features, content-based features, and traffic-based features. Since SDN OpenFlow protocol cannot access the content-based features, a mixed feature of basic and traffic features are used. CTU-13 data set is also used for training and testing a model by various researchers. It is a dataset containing traffics generated by a botnet, prepared at CTU University, Czech Republic in 2011. It has 13 captures that contain normal, and background traffics, where each capture has a specific malware attack. ISCX is a popular dataset used in network intrusion detection. It contains real traffics generated by multi-stage attacks to maintain anomaly in the dataset. CICIDS2017 is an updated version of ISCX that resolves various traffic-related issues present in other datasets like lack of diversities in traffic and unavailability of known attacks. It contains both the benign and up-to-date attacks along with the network traffic analysis. This dataset includes real network traffics with eleven criteria unlike ISCX, which concerns only four criteria. A recent increase in DDoS attacks has made researchers develop a dataset that can study the DDoS attack better. Welldevelop datasets help in developing an efficient detection model. CICDDoS2019 is a dataset which is a compilation of attacks done by TCP or UDP protocols. It contains reflection-based DDoS attacks and exploitation-based DDoS attack, with 80 features
A Survey on DDoS Detection Using Deep Learning in Software …
493
along with network traffic analysis. The training dataset contains 12 DDoS attacks and Test set contains 7 DDoS attacks. The custom datasets are prepared by using various tools to generate the traffic. Data are collected either from real-world traffic or by creating a virtual environment using mininet and a controller. Tcpdump, port mirroring, and scapy are mostly used to generate normal traffics. Hping3 and scapy tool are widely used for generating DDoS attacks.
5 Conclusion In this work, we have discussed the recent trends in detecting DDoS using deep learning. A general walk-through of SDN and DDoS is also done to provide the basics of them. Analysis of the methodology used in various papers is done, a study on the model is done to show the reason behind the selection of that particular model in that respective papers. The use and importance of tuning parameter while developing a model is also discussed in this work. The evaluation techniques applied for analyzing the efficiency of a model are studied. The selection of a dataset for training the deep learning model fairly increases the model’s efficiency in classifying a normal and an attack data. Some of the most used datasets are also discussed in this paper.
References 1. Ahuja N, Singal G, Mukhopadhyay D (2021) DLSDN: deep learning for DDOS attack detection in software defined networking. In: 2021 11th international conference on cloud computing, data science & engineering (confluence). IEEE, pp 683–688 2. Elsayed MS, Le-Khac NA, Dev S, Jurcut AD (2020) Ddosnet: a deep-learning model for detecting network attacks. In: 2020 IEEE 21st international symposium on “A world of wireless, mobile and multimedia networks” (WoWMoM). IEEE, pp 391–396 3. Haider S, Akhunzada A, Mustafa I, Patel TB, Fernandez A, Choo KKR, Iqbal J (2020) A deep CNN ensemble framework for efficient DDOS attack detection in software defined networks. Ieee Access 8:53972–53983 4. Lara A, Kolasani A, Ramamurthy B (2013) Network innovation using openflow: a survey. IEEE Commun Surv Tutor 16(1):493–512 5. Lee TH, Chang LH, Syu CW (2020) Deep learning enabled intrusion detection and prevention system over SDN networks. In: 2020 IEEE international conference on communications workshops (ICC workshops). IEEE, pp 1–6 6. Li C, Wu Y, Yuan X, Sun Z, Wang W, Li X, Gong L (2018) Detection and defense of DDOS attack-based on deep learning in openflow-based SDN. Int J Commun Syst 31(5):e3497 7. Niyaz Q, Sun W, Javaid AY (2016) A deep learning based DDOS detection system in softwaredefined networking (SDN). arXiv:1611.07400 8. Novaes MP, Carvalho LF, Lloret J, Proença ML Jr (2021) Adversarial deep learning approach detection and defense against DDOS attacks in SDN environments. Futur Gener Comput Syst 125:156–167 9. Nugraha B, Murthy RN (2020) Deep learning-based slow DDOS attack detection in SDNbased networks. In: 2020 IEEE conference on network function virtualization and software defined networks (NFV-SDN). IEEE, pp 51–56
494
M. F. Singha and R. Patgiri
10. Priyadarshini R, Barik RK (2019) A deep learning based intelligent framework to mitigate DDOS attack in fog environment. J King Saud Univ-Comput Inf Sci 11. Qin Y, Wei J, Yang W (2019) Deep learning based anomaly detection scheme in softwaredefined networking. In: 2019 20th Asia-Pacific network operations and management symposium (APNOMS). IEEE, pp 1–4 12. Sun W, Li Y, Guan S (2019) An improved method of DDOS attack detection for controller of SDN. In: 2019 IEEE 2nd international conference on computer and communication engineering technology (CCET). IEEE, pp 249–253 13. Tang TA, Mhamdi L, McLernon D, Zaidi SAR, Ghogho M (2016) Deep learning approach for network intrusion detection in software defined networking. In: 2016 international conference on wireless networks and mobile communications (WINCOM). IEEE, pp 258–263 14. Tang TA, Mhamdi L, McLernon D, Zaidi SAR, Ghogho M (2018) Deep recurrent neural network for intrusion detection in SDN-based networks. In: 2018 4th IEEE conference on network softwarization and workshops (NetSoft). IEEE, pp 202–206 15. Tang TA, Mhamdi L, McLernon D, Zaidi SAR, Ghogho M, El Moussa F (2020) DeepIDS: deep learning approach for intrusion detection in software defined networking. Electronics 9(9):1533 16. Wang J, Liu Y, Su W, Feng H (2020) A DDOS attack detection based on deep learning in software-defined internet of things. In: 2020 IEEE 92nd vehicular technology conference (VTC2020-Fall). IEEE, pp 1–5 17. Yungaicela-Naula NM, Vargas-Rosales C, Perez-Diaz JA (2021) SDN-based architecture for transport and application layer DDOS attack detection by using machine and deep learning. IEEE Access 9:108495–108512
Segmentation of Dentin and Enamel from Panoramic Dental Radiographic Image (OPG) to Detect Tooth Wear Priyanka Jaiswal
and Sunil Bhirud
Abstract The healthcare domain is a very important research field with rapid technological advancement. In this study specialized field of oral health care is considered, i.e. Dentistry which is measured as a subdivision of medicine dealing with anatomy, development, and diseases of the teeth. In dentistry, dental panoramic radiography (DPR) images have currently captivated growing attention in the diagnosis process due to their correct endorsement of the clinical findings. Conventionally, diagnosis is done with the help of dental radiographs and clinical examination of patients, which is done by a dentist manually as per available infrastructure and knowledge. This abundant approaches influence researchers to use and develop new machine learning techniques, image processing techniques to understand dental radiographs. To understand radiographs through an automatic process and to speed up diagnosis process segmentation and enhancement of an image plays very significant role at initial phase of processing. Segmentation of radiograph is important to separate the different tooth anatomy part but which processing an image this is a major problem due to variation in size, shape, and arrangement of teeth, which will vary from one person to another. The main motive of this work is to apply different image enhancement and segmentation techniques on panoramic (OPG) x-ray through which isolation of dentin and enamel can be done. It is an essential and primary step for finding tooth wear index and determining tooth structure loss. This paper also deliberates the use of several image enhancement and segmentation techniques which are applied on panoramic (OPG) radiograph and its results are evaluated to check the performance efficiency, feasibility of available techniques with stated problem statement.
P. Jaiswal (B) · S. Bhirud Department of Computer Engineering and Information Technology, Veermata Jijabai Technical Institute, Mumbai 400019, India e-mail: [email protected] S. Bhirud e-mail: [email protected] P. Jaiswal Department of IT, YCCE, Nagpur 441110, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_38
495
496
P. Jaiswal and S. Bhirud
Keywords Dental x-ray image · Image enhancement · Panoramic radiograph Orthopantogram (OPG) · Segmentation · Tooth wear (Erosive) · Edge detection
1 Introduction In the Indian population, a senior over age 65 has exceeded 150 million, and this aging population of geriatric (elderly) people has become a global issue concerning the quality of general health [1]. The World Health Organization (WHO) has emphasized the significance of oral fitness as a crucial element of all-purpose (general) health and eminence of life [2]. Aging and oral tribulations have a long-interrelated history. While doing the diagnosis of oral diseases dental radiographs show a substantial role. To do an efficient diagnosis understanding the internal structure of a tooth is important. Segmentation plays a precise role in separating the different parts of tooth image. Here we have considered a panoramic (OPG) dental x-ray image for separating edges from tooth to uncover dentin and enamel. There are numerous problems coupled with involuntary segmentation of teeth from dental radiographic images due to unclear features of radiograph, varying size of the tooth, complex structure, un-uniform background, etc. In dental imaging, primarily three types of dental x-ray imagery are used for diagnosis, as shown in Fig. 1. Different types of radiographs are bitewing, periapical, and panoramic dental images [24]. Bitewing mainly considers the coronal half of teeth, periapical covers the apical area of the teeth, and panoramic x-ray image considered the complete upper and lower jaw with teeth and associated bones. They can be standardized easily and are repeatable [24]. This study considered panoramic x-ray image (OPG) due to its profound characteristics and popularity. In this study, all the used x-ray images are collected from various hospitals in Maharashtra region from India and analyzed by dentists before processing. The intention behind not considering color images for this study is discussed in the next section, which covers the necessity of detecting erosive wear in dentistry. This section gives a clear idea about the necessity of automatic detection of tooth wear. Mainly teeth have specific anatomy consisting of several cusps and fissures to serve the function of chewing food like a mortar and pestle where the approximating
Fig. 1 Types of dental radiographic images
Segmentation of Dentin and Enamel from Panoramic Dental …
497
Fig. 2 Different categories of tooth wear [4]
cusps work as pestle and fissures as mortar crushing the food coming in between. If these anatomical landmarks are worn-out due to wear, the crushing mechanism gets disturbed, and food slips from between the teeth. This causes a decrease in chewing efficiency, ultimately leading to nutritional deficiency and decreased quality of life. Tooth wear, if diagnosed early, can be treated conservatively, and the patient can be counseled to prevent further damage. Dental wear has several types: dental attrition, erosion, abrasion, and abreaction [3], as shown in Fig. 2. Out of which dental attrition, abrasion, and erosion are commonly found in the Indian population. Tooth wear, also known as tooth surface loss, represents the pathological tooth tissue loss caused by a disease process differs from dental caries. Basically, observation of erosive wear very prominently in radiographs needs expertise in the domain. All type of dental wear has their specific patterns and causative factors following which they can be diagnosed. Figure 3 shows the patient’s sample OPG image and color image to understand the severity of the disease. Also, it explicates the example of an OPG image of a patient suffering from severe wear compared to a normal patient of the same age and gender. The visibility of dentin and enamel is apparent in a color image shown in Fig. 3. However, we cannot use it for analysis due to its limited view, which is not feasible for automatic diagnosis, whereas OPG will cover all 32 teeth. This implementation aims to do segmentation of dental components comprising of dentin, enamel, and pulp as it is the preliminary stage of ranking severity of wear using panoramic radiograph. This section describes how enhancement algorithms have an effect on panoramic radiographs (OPG) from the viewpoint and helpfulness in dentistry. Image enhancement algorithms are usually used to adjust images, either for visual elucidation or as preprocessing for other algorithms like edge detection or segmentation. This is a very important phase of processing because obtaining a superior quality image is necessary for accurate elucidation. In [5], enhancement of dental radiograph using canny edge detection is done. In this digital OPG x-ray is given as an input. This grey color image is applied to the antistrophic filtering method. It generates the histograms of the median filtered image. CLAHE is used for edge detection in simple mode. In [6], enhancement of panoramic dental radiograph using multi-scale mathematical morphology is done. In this experiment, 598 patient data of complete dentition with all teeth and missing teeth are considered. Contrast enhancement techniques are worn to improve the optical eminence of an image. Here unique algorithm for contrast, detail, edge enhancement, and multiple feature extraction of brightness and darkness
498
P. Jaiswal and S. Bhirud
Fig. 3 Clinical picture and dental panoramic image of patient suffering from severe tooth wear
of panoramic x-ray (OPG) is introduced, which is called MSTHGR. This method has a 22% error rate and 99% confidence rate, respectively. As per [7], Adaptive histogram equalization (AHE) and its derivatives are worn for civilizing the superiority of an image. However, this method does not give a good and satisfactory accuracy ratio. To overcome this solution is given in [8] where HE techniques along with morphological operations are used to recover the contrast of OPG X-ray images. As per the literature given in [9–13] some evolutionary algorithms are also available for dental x-ray enhancement. Some of the algorithms are Particle Swarm Optimization algorithm (PSO), Cuckoo Search algorithm (CSA), and Artificial Bee Colony Algorithm. Mainly evolutionary algorithms are those which do not need selection, crossover, and mutation operation. It is also helpful for getting optimized histogram values. In [8] author describes the image enhancement techniques augmenting dental Pantomograms(OPG) to discriminate bone pathology as it is tough to evaluate utterly through medical clinicians. To identify cyst or tumor from OPG, preprocessing is done using an image enhancement technique using gray level transformation. It is concluded here that a negative transformation image can provide additional information than the original image in medical image processing. Histogram modification is used to get the graphical representation of the total of pixels and frequency of existence of each grey level pixels of an image. The following study is to understand the different segmentation methods, as segmentation of teeth is a significant problem due to variation in size, shape, and arrangement of teeth, which will vary from one person to another. Segmentation is a challenging task due to the existence of an analogous type of body tissue, ununiform gap, etc. Broadly dental x-ray segmentation technique is categorized into three groups: region-based, boundary-based, and pixel-based. Figure 4 represents the summary of a grouping of different x-ray images segmentation techniques [13–17].
Segmentation of Dentin and Enamel from Panoramic Dental …
499
Fig. 4 Types of dental image segmentation
In this paper analysis of the dataset which is collected from different patients is presented. Different image enhancement techniques are used on panoramic radiograph to compare the results on the inbuilt dataset. At last, how to do segmentation of dentin, enamel, and pulp from a panoramic image (OPG) is explored using Ostu’s thresholding and canny edge detection algorithm.
2 Experimental Method Till now, many applications are implemented for detecting several tooth diseases such as carries detection, chronological age estimation from the tooth, tooth decay prediction, periapical pathosis, periapical disease detection, gender estimation, detection of trabecular for osteoporosis, etc. [18–22]. While implementing these systems or techniques to detect or diagnose diseases, different types of x-rays images are considered. The different types of x-rays that are considered are specifically panoramic, bitewing, and periapical. As per the literature study, mostly bitewing radiograph is used because of its simplicity. However, it has limited coverage as compared to OPG. Apart from all the mentioned diseases and techniques, we are approaching
500
P. Jaiswal and S. Bhirud
Fig. 5 Analysis of different category of wear
the new concept of detection in dentistry, detection of wear, which is one of the crucial diseases among all. To support this fact, we have analyzed and collected data of patients from different clinics with the help of questionnaires. It also covers the clinical examination done by doctors. In this questionnaire, we have collected 250 observations from the age group 30 to 60 years. Its results are given in Fig. 5, which shows the different categories of wear and its occurring percentage. In the Indian population, attrition is found around 47.9%. It generates the necessity of automatic detection of wear and its different types [10, 22]. While experimenting to find tooth wear index, the first step is to separate pulp, dentin, and enamel from an image. As in this experimentation work, we have not used the publically available dataset because the available datasets are unsuitable for the mentioned research problem. To solve this issue, we have approached the dentist to collect images specific to a task. So the experiment is performed on panoramic dental radiographs (OPG) collected from different medical colleges in India, mainly situated in the Maharashtra region. The images are selected and labeled with the help of clinical assistance to perform further processing. To extend the scope of data collection and feasibility of work, conventional image processing techniques are applied. The objective of this experiment is to separate the pulp chamber, demarcation of enamel and dentin. The implementation process is broadly divided into three basic steps as given and described below. (1) Enhancement, (2) Segmentation and (3) Edge detection
2.1 Image Pre-processing After selecting images that indicated tooth wear from the dataset, pre-processing is done. It was initially used to remove unwanted information from the image. An anisotropic Diffusion filter is used to mend image eminence or quality devoid of reducing the precise information available in the image [23]. It creates stage by stage blurred images based on the diffusion operation. Gaussian filter is used to lessen
Segmentation of Dentin and Enamel from Panoramic Dental …
501
the image noises. The anisotropic output is the product between the image and the Gaussian model.
2.2 Image Enhancement Image enhancement algorithms are generally used to fine-tune images, moreover for visual interpretation or as preprocessing for supplementary algorithms such as edge detection or segmentation [11, 23]. This is a significant phase of processing because obtaining a high-quality image is essential for accurate elucidation. Following techniques are used to check the feasibility and working of images from the dataset. In this phase, different algorithms are applied to check the performance. Mainly Contrast stretching is also known as Normalization. This technique is used on the way to advance the image quality and for highlighting the hidden information available in the black and white region of an image. The contrast stretching technique is capable for refining low-resolution dental radiographic images [24]. It is used to provide a better quality view of dental image [4]. Superior view of an image can be achieved by varying the contrast or dynamic range of an image [25]. To do the stretching, we have to stipulate the upper and lower pixel value bounds through which the image will get normalize. Every pixel P is scaled by the following equation. Pout = (Pm−c)
b−a d −c
+a
(1)
where, a- Lower limit, b- Upper limit, c- Existing lowest pixel value, and d- Existing highest pixel value. Figures 6, 7, 8 and 9 shows visual results of panoramic dental radiograph, which are enhanced with contrast stretching, contrast limited adaptive histogram equalization (CLAHE), histogram equalization (HE), and Gamma correction (GC). It presented greater definition at the edges of teeth in broad and good visualization of the diverse structures such as enamel dentin and pulp chamber, which arrange them. As per the result of the enhanced image at the root level, the visibility of root canals could be experiential with better definitions.
2.3 Segmentation Image segmentation is another important phase of processing in order to quantify panoramic radiographs. There are some difficulty in segmentation due to the variation of shape and intensity of tooth in same radiograph; it also varies from one image to another. It is challenging step to extract features from panoramic radiograph but it is essential to haul out features from the radiograph which are further useful in dental diagnosis system. To get the desired output in this work to separate the dentin,
502
P. Jaiswal and S. Bhirud
Fig. 6 a Original image and its Histogram b Enhanced image with contrast stretching and histogram of contrast stretching 104 3.5 3 2.5 2 1.5 1 0.5 0 0
50
100
150
200
250
Fig. 7 Enhanced image with HE and histogram of HE-enhanced image 104 3.5 3 2.5 2 1.5 1 0.5 0 0
50
100
150
Fig. 8 Enhanced image with CLAHE and histogram of CLAHE-enhanced image
200
250
Segmentation of Dentin and Enamel from Panoramic Dental …
503 104
2.5 2 1.5 1 0.5 0 0
50
100
150
200
250
Fig. 9 Enhanced image with GC and histogram of GC-enhanced image
enamel and pulp from an image, Ostu thresholding and edge detection algorithm are processed and implemented.
2.3.1
Ostu Method
The Otsu technique computes the threshold value T mechanically founded on the input image. Otsu’s method stabs to catch a threshold rate that decreases the weighted within-class variation. Meanwhile variance is the distribution’s stretch about the mean, minimizing the within-class variance will incline to create the classes’ compact [17]. Following is the stepwise summary of the functioning of an Otsu’s algorithm is given. Step-1: Calculate normalized histogram of the image, pi =
ni , i = 0, . . . . . . L − 1 MN
(2)
Step-2: Calculate cumulative sums, P1(k) =
k i=0
pi, k = 0, . . . . . . . L − 1
(3)
Step-3: Calculate cumulative means, m(k) =
k
i pi, k = 0, . . . . . . . L − 1
i=0
Step-4: Calculate global intensity mean,
(4)
504
P. Jaiswal and S. Bhirud
mG =
L−1
ipi
(5)
i=0
Step-5: Calculate between-class variance, (σ )2 B(k) =
[mG P1(k) − m][k]2 , k = 0, .., L − 1 P1(k)[1 − P1(k)]
(6)
Step-6: Get the Otsu threshold, k ∗ that is the value of k for which σ 2 B (k ∗ ) is a maximum—f this maximum is not exceptional, get k ∗ by averaging the values of k that correspond to the a range of maxima spotted (Figs. 10 and 11 ). Step-7: Compute the separate measure Original Image
Simple Thresholding at 0.3
Multiple threshoding(Between 26-230) Otsu - Optimal Segmented Image
Original Image
Simple Thresholding at 0.6
Badly illuminated Image
Simple Thresholding at 0.3
Multiple threshoding(Between 26-230) Otsu - Optimal Segmented Image
Simple Thresholding at 0.6
Badly illuminated Image
Otsu - Segmentation for bad illuminated Image
Otsu - Segmentation for bad illuminated Image
(a) Result of OPG on complete image structure (b) Result of Ostu’s thresholding on extracted tooth Fig. 10 Result of Ostu’s thresholding algorithm
Edge Detected Image
Grey Scaled Image
50
50
100
100
150
150
50
100
150
200
250
Fig. 11 Result of canny edge detection algorithm
50
100
150
200
250
Segmentation of Dentin and Enamel from Panoramic Dental …
505
2500 1800 1600
2000
1400 1200
1500
1000 800
1000
600 400
500
200 0
0 0
50
100
150
200
250
0
50
100
150
200
250
Fig. 12 Histogram of canny edge detection before and after processing
n(k ∗ ) =
2.3.2
σ2 B(k ∗ ) σ 2G
(7)
Working of Canny Edge Detection Algorithm
Canny edge detection method is one of the well-built robust gradient-based techniques. It consists of a linear filter and a Gaussian kernel which are used to smooth the noise present in the image. Then it calculates the direction and strength of the edge for each pixel in the smoothed image. For this, it differentiates the image in the vertical and horizontal directions. After finishing with this, it will find gradient magnitude as the root sum of squares of derivates as well as the gradient direction applying arctangent of the fraction of the derivatives. At final stage, each pixel’s edge strength is fixed to zero if its edge strength is not greater than the edge strength of the two neighboring pixels in the gradient direction. The enduring pixels after this procedure are considered as candidate edge pixels, and an adaptive thresholding method is applied to the dispersed edge magnitude image to find the eventual edge map [25]. The steps of the canny edge detection algorithm are given below (Fig. 12). Step-1: Calculate fx and fy fx =
∂ ∂ (f ∗ G) = f ∗ G = f ∗ Gx dx ∂x
(7)
fy =
∂ ∂ (f ∗ G) = f ∗ G = f ∗ Gy dy ∂y
(8)
G(x,y) is the Gaussian function. Gx (x,y) is the derivative of G(x,y) with respect to
506
P. Jaiswal and S. Bhirud
X : Gx(x, y) =
−x G(x, y) σ2
(9)
Gy (x, y) is the derivative of G(x,y) with respect to Y : Gy(x, y) =
−y G(x, y) σ2
(10)
Step-2: Calculate gradient magnitude magn(i, j) =
f x 2 + f y2
(11)
Step-3: Application of non maxima suppression. Step-4: Application of hysteresis edge linking.
3 Results and Discussion The previous section demonstrates several methods of image enhancement and segmentation techniques which are applied on panoramic (OPG) X-ray image to find out desired output. To achieve this two image segmentation methods are implemented Ostu’s threshold image segmentation and canny edge detection for producing desired results to separate pulp chamber, enamel, and dentin from a panoramic image. After applying image enhancement, image segmentation, and feature extraction technique, we can represent the following parameters from the dentist’s viewpoint using image processing techniques. The pulp chamber, demarcation of enamel and dentin, and anomaly on enamel on or occlusion surface is visible. As shown in Fig. 13 we get the separated edges, and it helps to understand the internal structure of teeth with the separation of dentin, pulp, and enamel. This analysis and implementation will help to extend our work for defining the points in dentin edges, calculating the dentin thickness using knight and smith index to score the severity and subcategory of disease. Due to this initial work in the future, we can extend this to compare the thickness of enamel and dentin with the normal standard range to determine loss of tooth structure due to wear.
4 Conclusion This paper presents a framework for processing a panoramic (OPG) X-ray image which indicates the tooth wear. For indexing the severity of tooth wear in patient demarcation of enamel and dentin, pulp chamber is a primary step. To achieve the desired output first anisotropic diffusion filter is worn to remove the noise. For image
Segmentation of Dentin and Enamel from Panoramic Dental …
507
Fig. 13 Demarcation of dentin and enamel in X-ray image
enhancement, different image enhancement techniques are applied, such as contrast stretching, Contrast limited adaptive histogram equalization (CLAHE), histogram equalization (HE), and Gamma correction (GC). After analyzing the histogram, it is found that Gamma adjustment and CLACHE give better results. After getting an enhanced image, for separating pulp dentin and enamel Ostu’s thresholding algorithm is applied. To check the result with this algorithm first complete image is given, but it does not give satisfactory results. It has then cropped portion of the single tooth extracted, which covered area affected by the disease with the help of a dentist from an image. Then it is given as an input for this algorithm here, and we get satisfactory results as compared to the previous one. But we cannot use it for future work. Because for gaining better accuracy to calculate tooth wear score, we need a complete image. Due to the restricted output from Ostu’s thresholding method canny edge detection algorithm is worn to evaluate images further. It gives better results as compared to Ostu’s thresholding method. It separates pulp chamber, demarcation of enamel and dentin, and the visibility of occlusion surface to effectively identify tooth wear. An expert dentist clinically evaluated the visual outcomes obtained by the algorithms. In the future, we will focus on to compare the thickness of enamel and dentin with the normal standard range to determine loss of tooth structure due to wear. This will help the dentist to provide a value of loss of tooth structure in quantitative form. [22]
References 1. Abdi S, Spann A, Borilovic J et al (2019) Understanding the care and support needs of older people: a scoping review and categorisation using the WHO international classification of functioning, disability and health framework (ICF). BMC Geriatr 19:195 2. Garg S, Dasgupta A, Maharana SP, Mallick N, Pal B (2019) A study on impact of oral health on general health among the elderly residing in a slum of Kolkata: A cross-sectional study. Indian J Dent Res 30:164–169 3. Sykes LM, Uys A (2021) Dental Images—Their use and abuse. SADJ: J S Afr = tydskrif van die Suid-AfrikaanseTandheelkundigeVereniging 75. 584 - 590. https://doi.org/10.17159/25190105/2020/v75no10a9
508
P. Jaiswal and S. Bhirud
4. Cai Q, Wang H, Li Z, Liu X (2019) A survey on multimodal data-driven smart healthcare systems: approaches and applications. IEEE Access 7. ISSN 2169–3536. https://doi.org/10. 1109/ACCESS.2019.2941419 5. Lussi A, Ganss C (eds) (2014) Erosive tooth wear. Monogr Oral Sci. Basel, Karger 25: 46–54. https://doi.org/10.1159/000359937 6. Obuchowicz R, Nurzynska K, Obuchowicz B, Urbanik A, Piórkowski A (2020) Caries detection enhancement using texture feature maps of intraoral radiographs. Oral Radiol 36:275–287. https://doi.org/10.1007/s11282-018-0354-8 7. Singh M, Purohit M, Khare S, Kaushik BK (2015) FPGA based implementation of real-time image enhancement algorithms for Electro-Optical surveillance systems. pp 1–6. https://doi. org/10.1109/ECTICon.2015.7207055 8. Georgieva VM, Mihaylova AD, Petrov PP (2017) An application of dental x-ray image enhancement. In: 2017 13th International Conference on Advanced Technologies Systems and Services in Telecommunications (TELSIKS). pp 447–450 9. Georgieva VM, Mihaylova AD, Petrov PP (2017) An application of dental X-ray image enhancement, pp 447–450. https://doi.org/10.1109/TELSKS.2017.8246321. 10. Jaiswal P, Bhirud S (2021b) Classification and prediction of oral diseases in dentistry using an insight from panoramic radiographs and questionnaire. 2021 5th International Conference on Information Systems and Computer Networks (ISCON), Mathura, India pp 1–6. https://doi. org/10.1109/ISCON52037.2021.9702402 11. Ahmad SA, Taib MN, Khalid NEA, Taib H (2012) An analysis of image enhancement techniques for dental x-ray image interpretation. Int J Mach Learn Computing 2(3):292 12. Datta S, Chaki N (2020) Dental x-ray image segmentation using maker based watershed technique in neutrosophic domain. In: 2020 International Conference on Computer Science, Engineering and Applications (ICCSEA). pp 1–5.https://doi.org/10.1109/ICCSEA49143.2020.913 2957 13. Rad AE, Rahim MS, Kumoi R, Norouzi A (2012) Dental X-ray image segmentation and multiple feature extraction. TELKOMNIKA Indones J Electr Engineering 11(10). https://doi. org/10.13140/2.1.2109.5361 14. Fariza A, Arifin AZ, Astuti ER, Kurita T (2019) Segmenting tooth components in dental x-ray images using gaussian kernel- based conditional spatial fuzzy c-means clustering algorithm. Int J Intell Eng Syst 12:108–117. https://doi.org/10.22266/ijies2019.0630.12 15. Silva B, Pinheiro L, Oliveira L, Pithon M (2020) A study on tooth segmentation and numbering using end-to-end deep neural networks 16. Setianingrum AH, Rini AS, Hakiem N (2017) Image segmentation using the Otsu method in Dental X-rays. In: 2017 Second International Conference on Informatics and Computing (ICIC). pp 1–6. https://doi.org/10.1109/IAC.2017.8280611 17. Muramatsu C, Morishita T, Takahashi R, Hayashi T, Nishiyama W, Ariji Y, Zhou X, Hara T, Katsumata A, Ariji E, Fujita H (2020) Tooth detection and classification on panoramic radiographs for automatic dental chart filing: improved classification by multi-sized input data “SPRINGER” 18. Hwang JJ, Jung YH, Cho BH, Heo MS (2019) An overview of deep learning in the field of dentistrypublished in Imaging science in dentistry. Published online. https://doi.org/10.5624/ isd.2019.49.1.1 19. Chandra A, Yadav OP, Narula S, Dutta A (2016) Epidemiology of periodontal diseases in Indian population since last decade. J Int Soc Prev Community Dent. 6:91. https://doi.org/10. 4103/2231-0762.178741 20. Sela EI (2013) Segmentation on the dental periapical X-ray images for osteoporosis screening. Int J Adv Comput Sci Appl 4(7):147–151 21. MegalanLeo LME, Kalpalatha Reddy T (2020) Layer wise segmentation of dental X-ray images. Eur J Mol & Clin Med 7(3). ISSN 2515–8260 22. Jaiswal P, Bhirud S (2021a) Study and analysis of an approach towards the classification of tooth wear in dentistry using machine learning technique. 2021 IEEE International Conference on Technology, Research, and Innovation for Betterment of Society (TRIBES), Raipur, India pp 1–6. https://doi.org/10.1109/TRIBES52498.2021.9751650
Segmentation of Dentin and Enamel from Panoramic Dental …
509
23. Supriyanti R, Setiadi AS, Ramadhani Y, Widodo HB (2016) Point processing method for improving dental radiology image quality. Int J Electr Comput Eng (IJECE) 6:1587–1594. https://doi.org/10.11591/ijece.v6i4.9986 24. Jaiswal P, Katkar V, Bhirud SG (2022) Multi oral disease classification from panoramic radiograph using transfer learning and XGBoost. Int J Adv Comput Sci Appl (IJACSA) 13(12). https://doi.org/10.14569/IJACSA.2022.0131230 25. Gayathri V, Menon HP, Viswa A (2014) Challenges in edge extraction of dental x-ray images using image processing algorithms—A review. Int J Comput Sci Inf Technol 26. Lakshmi MM, Chitra P (2020) Tooth decay prediction and classification from X-ray images using deep CNN. Int Conf Commun Sig Process (ICCSP) 2020:1349–1355. https://doi.org/10. 1109/ICCSP48568.2020.9182141 27. Lakhani K, Minocha B, Gugnani N (2016) Analyzing edge detection techniques for feature extraction in dental radiographs. Perspect Science 8. https://doi.org/10.1016/j.pisc.2016.04.087
Revisiting Facial Key Point Detection—An Efficient Approach Using Deep Neural Networks Prathima Dileep, Bharath Kumar Bolla, and E. Sabeesh
Abstract Facial landmark detection is a widely researched field of deep learning as this has a wide range of applications in many fields. These key points are distinguishing characteristic points on the face, such as the eyes centre, the eye’s inner and outer corners, the mouth centre, and the nose tip from which human emotions and intent can be explained. The focus of our work has been evaluating transfer learning models such as MobileNetV2 and NasNetMobile, including custom CNN architectures. The objective of the research has been to develop efficient deep learning models in terms of model size, parameters, and inference time and to study the effect of augmentation imputation and fine-tuning on these models. It was found that while augmentation techniques produced lower RMSE scores than imputation techniques, they did not affect the inference time. MobileNetV2 architecture produced the lowest RMSE and inference time. Moreover, our results indicate that manually optimized CNN architectures performed similarly to Auto Keras tuned architecture. However, manually optimized architectures yielded better inference time and training curves. Keywords Inference time · Efficient transfer learning · Deep learning · MobileNetV2 · NasNetMobile · Custom CNN · Keras autotuner
P. Dileep (B) Upgrad Education Pvt. Ltd, Mumbai, India e-mail: [email protected] B. K. Bolla Salesforce, Hyderabad, India e-mail: [email protected] E. Sabeesh Liverpool John Moore University, London, England © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_39
511
512
P. Dileep et al.
1 Introduction The face is critical in visual communication. Numerous nonverbal messages, such as human identity, intent, and emotion, can be automatically extracted from the face. Localizations of facial key points are required in computer vision to extract nonverbal cues of facial information automatically. The term “facial appearance” refers to the distinct patterns of pixel intensity around or across facial landmarks or key points. These key points represent those critical features on a human face, such as the eyes, nose, eyebrows, lips, and nose from which information about a person’s emotion or intent can be identified. Once correctly identified, they can be used to train deep learning algorithms to perform various classification tasks. Their applications include computer interaction, entertainment, drowsiness detection, biometrics, emotion detection, security surveillance, and a range of medical applications. However, the practical applications of these models depend on the speed of inference of these models and their deployability on Edge and mobile devices that have lower computational powers. This research aims to evaluate various transfer learning and custom models in terms of inference time, model size to test their deployability on Edge/mobile devices. In this work, we used the Facial Key Point Detection dataset from Kaggle. The dataset consists of the training variables and 15 target variables, the facial key points representing various facial features. Deep learning models using custom and transfer learning architectures such as Resnet50, MobileNetV2, NasnetMobile have been built using baseline and by combining various augmentation techniques to identify the ideal model. Additionally, the architectures have been evaluated in terms of parameter count, disc requirements, and inference timings to determine their suitability for deployment on computationally less intensive devices. We have compared our results with other state-of-the-art architectures and found that our models have higher efficiency, hence achieving the objective of this research.
2 Literature Review Facial landmark detection algorithms can be classified into three broad categories [1] based on how they model the facial appearance and shape: holistic, Constrained Local Model (CLM), and regression-based. Holistic methods mainly include Active Appearance Models (AAM) [2] and fitting algorithms. AAM works on the principle of learning from the whole face patch and involves the concept of PCA, wherein learning takes place by calculating the difference “I” between the greyscale image and an instance of the model. The error is reduced by learning the parameters like any conventional machine learning algorithm. CLM methods are slightly better than the holistic approaches as they learn from both the globalized face pattern and the local appearance from the nearby facial key points. They can be probabilistic or deterministic. They consist of two steps [3], the initial step where the landmarks are
Revisiting Facial Key Point Detection—An Efficient Approach Using …
513
located independent of the other landmarks. In this second step, while updating the parameters, the location of all the landmarks is updated simultaneously. In regressionbased approaches, there is no initial localization of the landmark; instead, the images are mapped directly to the co-coordinates of these landmarks, and the learning is done directly. These methods may be direct or cascaded. However, with the advent of deep learning algorithms, convolutional neural networks have replaced conventional regression methods with state-of-the-art results. These methods are faster and more efficient. Convolutional neural networks using LeNet have been used in many stateof-the-art works. The principles of LeNet have been used to build many custom architectures, which have shown reduced training time [4] and reduced RMSE scores. The performance of a machine learning model also depends mainly on the type of algorithm being used. Some of the popular datasets [1] on which deep learning algorithms have been used with promising results are BU-4DFE with 68 landmark points (RMSE–5.15), AFLW with 53 landmark points (RMSE–4.26), AFW with five landmark points (RMSE–8.2), LFPW with 68 landmark points (RMSE–5.44), Ibug 300-W with 67 landmark points (RMSE–5.54). Most of the deep learning algorithms have utilized methods such as Task constrained deep convolutional network (TCDCN) [5], Hyperface [6], 3-Dimensional, Dense Face Alignment (3DDFA) [7], Coarse to Fine Auto Encoder Techniques (CFAN) [8] in achieving relatively higher accuracies. Inception architecture [9] has been used on a similar Kaggle dataset achieving an RMSE score of 2.91. Resnet has also been used in work done by [10], achieving an RMSE of 2.23. Similar work done using the LeNet architecture [4] achieved an RMSE score of 1.77. As more evidence was produced favouring custom architectures, the focus was directed to build Custom CNN networks for facial key point detection. A comparative study was done [11] using both custom and transfer learning architectures. Custom architectures were able to achieve lower RMSE scores (1.97). A similar custom model consisting of 14 layers [12] produced an RMSE score of 1.75. As evident above, attaining higher accuracy by making deep learning algorithms more efficient and precise has been the target of various studies. Tuning of deep learning models is also critical in achieving high accuracies. The Keras tuner library [13] has been widely used to achieve this. Fine-tuning efficiency has been further established in the classification plant leaves disease [14], where fine-tuning architectures such as Resnet50, DenseNet121, InceptionV4, and VGG16 have been used. Lightweight models such as MobileNetV2 and NasnetMobile have been gaining popularity recently due to the ease of their deployability. MobileNetV2 [15] utilizes the concept of depth-wise separable convolution to reduce the number of training parameters without affecting the accuracy of a model. They are ideal for tasks such as recognition of palm prints [16], breast mammogram classification [17], and the identification of proper wearing of facemask [18]. Similar models like MobileNetV2 have also been built to achieve similar accuracy with fewer parameters, as in the case of PeleeNet [19].
514
P. Dileep et al.
3 Research Methodology 3.1 Dataset Description The dataset for this paper has been taken from the Kaggle competition [20]. There are 7049 images in this dataset and 15 facial key points representing various parts of the face such as eyebrows, eyes, nose, and lips in the training dataset. These facial key points represent the target variables. The test dataset consists of 1783 images. The dataset consists of images of 96 × 96 size with a one-channel dimension (grayscale images). The distribution of null values is shown below in Fig. 1. 69.64% of the data points contain at least one null value in the facial key points, while 30.36% of the images consist of all key points.
3.2 Image Pre-processing As mentioned below in the models’ section, transfer learning architectures such as MobileNetV2 and NASNetMobile are used on this dataset along with customdesigned CNN architectures. These pre-trained networks require the input image to be in a three-channel format, and NASNetMobile requires the image size to be 224 × 224 × 3. Hence, the images are converted to the appropriate format. The raw image, along with the corresponding facial key points, is shown in Fig. 2.
3.3 Imputation Techniques Forward fill and K-Nearest Neighbour (KNN) imputation Forward fill is an imputation technique where the subsequent null values are filled with the previous valid
10000
4909
Distribuon of dataset 2140 69.64
100
1
Total Null value
Fig. 1 Class imbalance
30.36
Percentage Non null value
Revisiting Facial Key Point Detection—An Efficient Approach Using …
515
Fig. 2 Visualization of images
Fig. 3 Rotation, brightness, shift, and Random noise augmentation
observations. KNN works on imputing the missing value by predicting the nearest neighbour to a particular datapoint.
3.4 Data Augmentation Figure 3 depicts a few of the augmentations used in this paper, such as random rotation, brightness, shift, and noise. These procedures were applied offline on the dataset’s non-null subset.
3.5 Inference Time The Inference times of various models have been calculated on 100 images. It can be defined as shown in Eq. 1. Inf time on 100 images =
Inference time of the Total test dataset Number of test images in the test data set
(1)
516
P. Dileep et al.
3.6 Loss Functions The current problem is framed as a regression model where the target variable is a continuous numeric variable, and the loss function used here is a mean squared error. The mean squared error is defined by the following equation: Mean Squared Error =
2n 1 yi − yˆ i i=1 n
(2)
3.7 Evaluation Metrics The evaluation metric used in this regression problem is the root mean squared error. (RMSE) as shown in Eq. 3. Root Mean Squared Error =
1 yi − yˆ i n i=1 n
(3)
3.8 Model Architecture Two different models have been built here, Custom models and Transfer learning models, namely MobileNetV2 and NasnetMobile. Tuning is done using the Keras tuner library. Custom Models. Three different custom models have been built using baseline architecture, manual tuning, and Keras auto-tuning. The custom models are tuned sequentially to arrive at the best-performing model in terms of RMSE scores. The model’s parameter count is listed in Table 1. Additionally, complete fine-tuning of transfer learning architectures was performed—the model’s tuning results in a reduction in the number of parameters. Manually tuned Custom models have the least parameters with an insignificant difference in RMSE scores, as seen in Fig. 6. Further, the tuned model’s size is lesser than non-tuned models, with manually tuned models having the least size (1.0 MB). The model architecture of the manual-tuned and the Keras-tuned model is shown below in Fig. 4. MobileNetV2 and NASNetMobile. Transfer learning architectures such as MobileNetV2 and NASNetMobile have been customized to solve our regression problem. The original weights from the Imagenet classification have been used. The topmost softmax classification has been replaced with a GAP + Regression (Dense Layer) to predict the facial key points. The models are experimented with using the
Revisiting Facial Key Point Detection—An Efficient Approach Using … Table 1 Parameter/model size comparison of all architectures
Custom models
Total parameters
517
Model size (MB)
Baseline CNN model
1,890,366
7.6
Manually optimized CNN
235,834
1.0
Keras optimized CNN–No imputation
306,750
1.27
Keras optimized CNN–Forward fill
246,478
1.03
Keras optimized CNN–KNN imputed
246,062
1.58
Keras optimized CNN–Augmentation
364,318
1.50
MobilenetV2
2,257,984
9.66
NasNetMobile
4,301,426
18.48
original baseline weights of imagenet and by completely fine-tuning all the layers of the architecture to evaluate the RMSE scores and inference time on prediction. The model architectures are shown in Fig. 5.
4 Results The results of the experiments have been explained in the following subsections consisting of Evaluation of RMSE scores, model size, and number of parameters.
4.1 Evaluation of RMSE Scores RMSE scores on the test dataset have been calculated for both custom and transfer learning models, as shown in Figs. 6 and 7. Huge Parameters of Baseline models. The initial baseline model was created using the conventional architecture without tuning the layers among the custom models. Figure 6 show that the Custom baseline model outperformed the manually optimized and Keras fine-tuner optimized models; however, manually optimized models performed similarly to Keras fine- tuner optimized models. It’s worth noting that both fine-tuned MobileNetV2 and NASNetMobile trained on augmented data exhibit a 4–5 × improvement in RMSE scores compared to their non-fine-tuned counterparts (Fig. 7). Surprisingly, compared to its non-fine-tuned counterpart, fine-tuned MobileNetV2 demonstrated a 2 × improvement in RMSE on KNN imputed data.
518
P. Dileep et al.
Fig. 4 Manually tuned CNN (Left) and Keras Tuned CNN architecture (Right)
Supremacy of Models Trained on Augmented Data. As seen in Tables 2 and 3, augmentation of custom models results in a significant increase in the performance of the models. A sharp decrease in the RMSE scores on the fine-tuned model shows that augmentation performs better than any imputation technique.
4.2 Evaluation of Model Size and Parameters Among all the models built, manually tuned custom models have the least number of parameters (235 K) and least model size against Keras auto-tuned custom models trained on different kinds of imputation techniques and augmentation (Fig. 8).
Revisiting Facial Key Point Detection—An Efficient Approach Using …
Fig. 5 MobilenetV2 (left) and NasnetMobile architecture (right)
Fig. 6 RMSE scores of custom CNN models
Fig. 7 RMSE scores of transfer learning models
519
520
P. Dileep et al.
Table 2 Comparison of RMSE performance of transfer learning models Models
No imputation
Forward fill imputation
KNN imputation
Aug
MobileNetV2 baseline model
Similar performance
Similar performance
+
+
++
+++
Similar performance
+
MobileNetV2 fine-tuned NasNet baseline model
Similar performance
Similar performance
+++
NasNet Model fine-tuned
Table 3 Inference time analysis Model
No impute (sec)
CNN baseline model
1.99
Forward Fill (sec)
KNN Impute (sec)
Aug(sec)
1.97
1.98
2.01
CNN Manual tuned model 1.4
1.33
1.34
1.34
CNN Keras tuned
2.72
1.52
4.19
3.58
MobileNetV2–Baseline
0.89
0.86
1
0.83
MobileNetV2–Fine tuned
0.84
0.82
0.82
0.88
NasNetMobile–Baseline
8.46
8.4
8.17
7.87
NasNetMobile–Fine tuned
7.95
7.96
8.01
7.68
However, in the case of augmentation, Keras auto-tuned models slightly outperform custom models at the cost of increasing the number of parameters and model size.
Fig. 8 Model parameters versus Model size—All models
Revisiting Facial Key Point Detection—An Efficient Approach Using …
521
4.3 Inference Time Analysis The ultimate performance depends on the speed at which an inference can be made on the test dataset with the least computational requirements. Table 3 shows the inference time on 100 images by various models on a Colab CPU. Architectural efficiency in Inference Time. The inference time of a model depends on both the number of parameters and the architecture. Among all models, MobileNetV2 has the quickest inference. The enormous training parameters (twice that of MobileNetV2) account for NASNetMobile’s increased inference times. Manually tuned models come in second. In contrast to custom CNN models, MobileNetV2 has ten times the number of parameters and works two times faster. Augmentation does not affect the inference time in a regression scenario, as seen from the analysis.
4.4 Evaluation of Training Curves Training curves for various models are shown below to identify the best-performing model in this scenario. MobileNetV2 versus NasnetMobile. Figures 9 and 10 show that NASNetMobile has better training curves than MobileNetV2 architecture for all imputation techniques and augmentation. The better training curves may be attributed to higher parameters of NASNetMobile. However, when considering inference times, RMSE scores, and parameter counts, MobileNetV2 outperforms NASNetMobile.
Fig. 9 MobileNetV2—No impute, Forward Fill, KNN impute, Augmentation (Top to bottom)
522
P. Dileep et al.
Fig. 10 NASNetMobile—No impute, Forward Fill, KNN impute, Augmentation (Top to Bottom)
Fig. 11 Custom CNN Manual Tuned (Left) versus Custom CNN Keras Tuned (Right)
Manual Tuning versus Keras auto-tuning. Manually tuned models exhibit more reliable model fitting training curves than Keras auto-tuned models, as illustrated in Fig. 11.
4.5 Visualization of Test Images Figure 12 shows various augmented models’ predictions of facial key points. Varying performances by different models are observed in the images below. However, the images only represent a sample of the total test dataset, and hence no meaningful conclusion can be drawn.
Revisiting Facial Key Point Detection—An Efficient Approach Using …
523
Fig. 12 Facial Key Point Predictions by CNN manual tuned/Keras auto-tuned, MobilenetV2 and NASNetMobile on Augmentation
5 Conclusion In this work, we conducted experiments on the facial key point detection dataset by building custom CNN models optimized manually and using Keras fine-tuner. Further transfer learning architectures, non-fine-tuned and fine-tuned MobileNetV2 and NasNetMobile were used as baselines to evaluate custom-built CNN architecture. In addition, we compared the effectiveness of imputation and augmentation. The following are the conclusions of our work which aresummarized below: • Manually optimized custom CNN models outperform or are comparable to autotuned Keras optimized models. On the other hand, manually tuned custom CNN models may be ideal when considering training curves, model size, and model parameters. • MobileNetV2 outperforms all other models with the fastest inference times but slightly compromises the model size and parameters. • In both custom CNN and transfer learning models, augmented models have lower RMSE scores, proving that augmentation is superior to imputation.
524
P. Dileep et al.
• Furthermore, there is no significant difference in performance between baseline non-tuned and baseline completely fine-tuned models, demonstrating that transfer learning models must be fine-tuned selectively in terms of the number of layers for a given dataset. • The experiments demonstrate that architectural efficiency significantly impacts model performance and inference time, as demonstrated by the MobileNetV2 architecture, which uses depth-wise separable convolutions. • Moreover, our models have the lowest RMSE compared to other state-of-the-art architectures [4, 10–12], and to our knowledge, this is one of the very few studies that evaluated models on size, inference time, parameters, and RMSE
References 1. Wu Y, Ji Q (2019) Facial landmark detection: a literature survey. Int J Comput Vision 127(2):115–142. https://doi.org/10.1007/s11263-018-1097-z 2. Cooles TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685. https://doi.org/10.1109/34.927467 3. Zadeh A, Lim YC, Baltrušaitis T, Morency L-P Convolutional experts constrained local model for 3D facial landmark detection 4. Agarwal N, Krohn-Grimberghe A, Vyas R (2017) Facial key points detection using deep convolutional neural network—NaimishNet. pp 1–7. [Online]. Available http://arxiv.org/abs/ 1710.00977 5. Zhang Z, Luo P, Loy CC, Tang X Facial landmark detection by deep multi-task learning 6. Ranjan R, Patel VM, Chellappa R (2016) HyperFace: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. [Online]. Available http://arxiv.org/abs/1603.01249 7. Zhu X, Liu X, Lei Z, Li SZ (2019) Face Alignment in full pose range: A 3D Total Solution. IEEE Trans Pattern Anal Mach Intell 41(1):78–92. https://doi.org/10.1109/TPAMI.2017.277 8152 8. Zhang J, Shan S, Kan M, Chen X Coarse-to-fine auto-encoder networks (CFAN) for real-time face alignment 9. Mao C (2016) Facial keypoints detection with inception structure, pp 3–5 10. Wu S, Xu J, Zhu S, Guo H (2018) A deep residual convolutional neural network for facial keypoint detection with missing labels. SignalProcessing 144:384–391. https://doi.org/10. 1016/j.sigpro.2017.11.003 11. Shi S (2017) Facial keypoints detection. pp 1–28. [Online]. Available http://arxiv.org/abs/1710. 05279 12. Gao R (2018) Facial keypoints detection with deep learning. Journal ofComputers 13(12):1403– 1410. https://doi.org/10.17706/jcp.13.12.1403-1410 13. Introduction to the keras tuner|Tensorflow Core. https://www.tensorflow.org/tutorials/keras/ keras_tune. Accessed 24 Nov 2020 14. Too EC, Yujian L, Njuki S, Yingchun L (2019) A comparative study of fine-tuning deep learning models for plant disease identification. Comput Electron Agric 161:272–279. https://doi.org/ 10.1016/j.compag.2018.03.032 15. Howard AG et al (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications. [Online]. Available http://arxiv.org/abs/1704.04861 16. Michele A, Colin V, Santika DD (2019) Mobilenet convolutional neural networks and support vector machines for palmprint recognition. Procedia Computer Science 157:110–117. https:// doi.org/10.1016/j.procs.2019.08.147
Revisiting Facial Key Point Detection—An Efficient Approach Using …
525
17. Transfer learning in breast mammogram abnormalities classification with Mobilenet and Nasnet 18. Qin B, Li D (2020) Identifying facemask-wearing condition using image super-resolution with classification network to prevent COVID-19. Sensors (Switzerland) 20(18):1–23. https://doi. org/10.3390/s20185236 19. Wang RJ, Li X, Ling CX (2018) Pelee: A real-time object detection system on mobile devices, no NeurIPS, pp 1–10 20. Facial keypoints detection|Kaggle. https://www.kaggle.com/c/facialkeypoints-detection/data. Accessed 28 Jun 2020
A Hybrid Framework Using Natural Language Processing and Collaborative Filtering for Performance Efficient Feedback Mining and Recommendation Kathakali Mitra
and P. D. Parthasarathy
Abstract Product development insights may be found through user reviews on App stores, product forums, and social media. This feedback is often regarded as the “voice of the users”. This feedback has been subject to a lot of recent research, intending to create systems that can automatically extract, filter, analyze, and report the concerned feedback data in near real time. As per our survey results, often this user feedbacks do not reach the concerned organization promptly due to the volume, veracity, and velocity of feedback from multiple channels. In this rese arch work, we propose using sentiment analysis and social media mining an automatic engine which can be used for better product recommendation and automatic routing of relevant feedback to the product development teams. Our proposed solution is scheduled to run at regular intervals pulling dynamic reviews in an optimized manner with a lesser time complexity and higher efficiency. The reviews are collated from distributed platforms followed by building a domain classification engine on the principles of TF-IDF and Supervised Classifier. This system is used to classify the reviews of the respective enterprises. A sentiment analysis system is built using combined Rule-Based Mining and Supervised Learning Models which makes use of polarity to classify if the feedback is positive or negative. If the polarity is negative, the feedback gets routed to the concerned enterprise for immediate action and if the polarity is positive, it is passed to a user-based collaborative filtering engine which acts as a recommendation system. Keywords Sentiment analysis · Lexicons · User-based collaborative filtering · TF-IDF vectorizer · Supervised classifier · Feedback routing · Recommendation
1 Introduction Users of software leave a lot of reviews, comments, appreciations, and disappointments on the products and services they use, on the social media [1, 2]. Online feedback from app stores, social media, and product forums has been shown to K. Mitra (B) · P. D. Parthasarathy Department of CSIS, WILPD, Birla Institute of Technology and Science, Pilani, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_40
527
528
K. Mitra and P. D. Parthasarathy
include crucial product development insights that may be used to propel the growth of the product being created [2–4]. This type of criticism is sometimes referred to as the “voice of the users” [5]. Due to the various avenues of social media input, only a small portion of this feedback reaches the relevant organization. We employ sentiment analysis and social media mining to create an autonomous engine that can function as a routing agent and a recommendation engine in this proposed study. The following research questions drove our investigation: RQ1: How much time does it take for a feedback to reach the concerned organization? Are feedbacks collated from multiple channels? RQ2: How much effort goes into identification, classification, and resolution of a feedback? We offer an extendable, resilient framework for autonomous feedback processing after we find the solutions to the research questions. This paper’s structure is as follows. The second section examines the related work. The study techniques utilized and the responses to our first research questions are presented in Sect. 3. In Sect. 4, the proposed solution is given, and explained in segment 5 and 6. The results of our work is presented in Sect. 7 and concluded in segment 8.
2 Literary Review Numerous investigations have been conducted globally to understand the usage of user feedback. Leveraging social media to gather user feedback for software development [8] describes how organizations gather and act on social media feedback. Social media as a main source of customer feedback: alternate to customer satisfaction surveys [9] discuss on how customer satisfaction surveys have been replaced by social media feedback. Exploring the integration of social media feedback for user-oriented product development [10] describes how feedback from end-users can be used for better product development planning. Data is distributed over different platforms. The reviews are being gathered in order to do sentiment analysis. Typically, sentiment evaluation tasks are carried out at several levels, such as phrase level, word or sentence level, report level, and function level. Prior work on twitter sentiment analysis has typically relied solely on noisy labels or remote supervision, such as using emoticons to identify the sentiment of tweets and to train supervised classifiers. Following that, several scholars find function engineering in combination with a variety of device researching techniques such as Naive Bayes (NB), Support Vector M, and others to enhance the performance of classification. Because obtaining sentiment labels for large unstructured records is very expensive, and because twitter expressions are unstructured, informal, and dynamic in nature, a few researchers developed unsupervised strategies based entirely on lexicons, as well as the aggregate of lexicons and emotional signals. In line with this trend, recent research has revealed that the results of Twitter sentiment analysis are widely used in a
A Hybrid Framework Using Natural Language Processing and Collaborative …
529
variety of social applications, including political elections, emergency management, and so on. Different weighting mechanisms are available; the most popular being TF-IDF and Raw Term Frequency Weighting Scheme. The TF-IDF converts the text document into a sparse matrix of a Bag of Words of corpus and then performs different operations and modelling on the obtained matrix. The TF-IDF Vectorization concept is used in text summarization, domain classification, etc. [15]. Recommendation systems are usually built on collaborative filtering or contentbased approach. Depending on whether it uses a user-based or item-based approach, collaborative filtering tries to provide suggestions based on comparable neighbors of customers or items. The content-based filtering technique focuses on a consumer profile that is created by obtaining the contents of things rated by the consumer, and the system then recommends a list of items that fit that profile. Researchers [10] reported on the performance of collaborative filtering and content-based filtering results. In most cases, collaborative filtering outperformed content-based filtering, while in other cases, collaborative filtering outperformed content-based filtering [16]. In our proposed work, we focused on user-based collaborative filtering based on customer feedback. A more detailed investigation is done in this work by conducting interviews and user surveys to find the answers for RQ1 and RQ2. Details of the methodology and results of RQ1 and RQ2 are described in the next section.
3 Methodology and Survey Results A survey was carried out using an online questionnaire to better understand the answers for RQ1 and RQ2. The online nature of the survey allowed a large number of software professionals to be surveyed (792) to decide the answers of the research questions and if there exists an automatic way to gather multiple channel feedback and route it in real time.
3.1 Survey Design The survey as seen in Table 1 consisted of 10 multiple-choice questions. The questions were categorized into three main sections. The first series of questions (Q1–Q5) collect demographic information and set two (Q6 and Q7) and set three (Q8–Q19) collect the information required to answer the research question, RQ1 and RQ2. All the questions were multiple-choice questions and the Participants were also questioned on their ideologies on having an automated engine for feedback collection, analysis, and routing. Table 1 displays the complete list of questions and solution options. The table contains abbreviated responses to each question.
530
K. Mitra and P. D. Parthasarathy
Table 1 Excerpt of questions used in the survey Serial number Belongs to RQ Actual question with options Q1
Demographics
Q2
Demographics
Q3
Demographics
Q4 Q5
Demographics Demographics
Q6
RQ1
Q7
RQ1
Q8
RQ2
Q9
RQ2
Q10
All
Have you worked or presently work in the IT software industry ? No/Yes, I work or have worked as an IT developer/support services/software and support manager/others, please specify How old are you currently? Over 60 Y/55–59 Y/45–54 Y/35–44Y/25–34 Y/18–24 Y. Y stands for years Years of industry experience? 0–5 Y/6–10 Y/11–15 Y/16–20 Y/21–25 Y/Over 25+ Y. Y stands for years Your gender? Man/women/prefer to not answer Which region do you belong to (ethnicity)? Asian/European/African/Pacific people/Latin America/if others, please specify How much time does it take for a feedback to reach your organization? Real time/0–1 H/1–3 H/3–10 H/10–24 H/more than 24 H/more than 1 week. H stands for hours Are feedbacks automatically collated from multiple channels? Yes/No How much effort goes into identification, classification, and resolution of user feedbacks? 0–2 h/day, 2–4 h/day, more than 4 h/day Do you have an automated process for identification, classification and routing of feedback received from social media? Yes/No Do you have a recommendation system in place based on the user feedback? Yes/No
3.2 Employing Participants The convenience sample was used to hire participants. This was selected as this proves to be the easiest way to involve a good number of participants in a survey in a fair amount of time. Participation in the survey was marketed through multiple channels (described below) and encouraged with the opportunity to grab a cash voucher of 25$. The Qualtrics platform [4] was primarily used for the online survey. On Facebook, Twitter, Linked In, email, and telegram groups, the authors posted a link to the Qualtrics survey. The authors also distributed a hard copy of the survey in technology conferences and Software engineering workshops, during Jan.–June 2021. The online survey result was combined with the completed hard copy responses manually. The survey was available for a three-month duration.
A Hybrid Framework Using Natural Language Processing and Collaborative …
531
3.3 Survey Participants Across all channels of data collection, 792 participants completed the survey fully. All respondents stated having played a role of in the IT industry in some form (either currently or in the past) as the survey was circulated only among such potential groups.
3.4 Survey Analysis An investigation of the proportion of respondents on each user group was carried out to address the research question RQ1 and RQ2 alluded to in the introduction. As in Fig. 1, it was evident from the survey results that less than 10% organizations get feedback in real time or within one hour. In more than 75% of the respond’s organizations, there is no automatic mechanism to collate multiple channel feedback and more than 78% of them do not have an intelligent recommendation system based on positive feedback.
Fig. 1 Results of survey
532
K. Mitra and P. D. Parthasarathy
It is evident from the results that there is a dire need for an automated engine which identifies, classifies, and analyzes the sentiment of user feedback and routes it in real time to organizations. The same engine can act as a recommendation system when the feedback is positive. In the subsequent sections, we propose a novel approach for real-time feedback routing and recommendation using sentiment analysis.
4 Proposed Work This section presents our proposed work and emphases on a strategy for routing the relevant reviews to the specified organization. Millions of user reviews are registered on different social media platforms. The turnaround time for these reviews to enterprise feedback is time consuming up to 48 h. This increased time to register errors and act on it is time consuming. The proposed solution would be scheduled to run at fixed intervals (Interval of 1 h). It is designed to work in an optimized manner with a lesser time complexity and greater efficiency (see Fig. 2). The development method is broadly classified into 4 parts—Review Domain Classification, Sentiment Analysis, Routing Feedback, and Recommendation Engine. The reviews are extracted from distributed platforms through different APIs on a scheduled manner in fixed intervals. The documents/reviews are pre-processed through tokenization, Stop word removal, numeric removal, Punctuation removal, Lemmatization, and Lexical Normalization to be further used for analytical purposes. These documents are compared against domain lexicons to understand the corresponding
Fig. 2 Overall architecture of the proposed work
A Hybrid Framework Using Natural Language Processing and Collaborative …
533
enterprises to which the reviews point. For unknown domains, a TF-IDF Vectorizer is used. Since Classifiers cannot be directly built on text data. Hence different weighting schemes convert the text data into numerical form for different models to work on. We have used TF-IDF Vectorization which is a product of Number of times a term occurring in document d and Number of documents in the collection that contain the term t. This generates a sparse matrix converting text data to numerical data. The vectorized matrix with the unique terms in the dataset as its columns along with message length and punctuation count are treated as the parameters/features for building the classification model. Supervised Classifier has been used which uses ensemble learning and bagging technique to predict the unknown domain for the given reviews. The tweets are then passed into the sentiment analysis engine to extract the polarity and subjectivity associated with the given tweet. Lexical sources coupled with supervised modelling techniques have been proposed in our technique which have had a higher efficiency in the area of sentiment analysis. The negative polarity tweets are routed to the respective software development teams of the corresponding enterprises. The positive polarity tweets are passed into the recommendation engine. Product/Service Lexicons are used to associate the positive tweets with a respective product. User-based collaborative filtering is used for displaying the best reviewed products to the customers in a dynamic manner. The architectural overview describing an overall process is shown in Fig. 2.
4.1 Data Collection and Data Preprocessing With increase in demand of products, high sales and rise in e-commerce platforms, data is distributed over different data sources. With increase in sales in today’s market accompanied by freedom of expression of people opinions and reviews in different social platforms, has further contributed to growth of enormous data lakes. Furthermore, internet users’ popularity has risen rapidly in tandem with developing technology, with users actively using online review sites, social networks, and personal blogs to voice their thoughts. Social media platforms such as Twitter, Facebook, Instagram, LinkedIn, and Quora are becoming more essential sources of information for businesses. People actively engage in events, online forums, and conversations in society by expressing their ideas and making comments. This way of sharing their knowledge and emotions through social media encourages businesses and enterprise companies to gather more information about their companies, products, and how well-known they are among the public, allowing them to make more informed decisions about how to run their businesses effectively. In our work, we have used Python Libraries like Tweepy, Twitter, and Text blob to extract the tweets using OAuth Token and Consumer Key/Token. The authentication helps extract tweets and data securely from Twitter APIs. The data collected from social media like Twitter contains varied information ranging from reviews, comments, wishes, work related information, cultural interests, political views, and
534
K. Mitra and P. D. Parthasarathy
other related information. Our proposed work focuses on product and service reviews along with user complaints. Twitter API wrapper function is used for extracting the desired tweets. The given model comprises of n different domains/enterprises for which the customer reviews are extracted. Search Query contains the following keywords before passing into the cursor function which would help extract only the required information. The keywords in the Search Query are passed using AND operator. List of Keywords: , , , , , , , , , . Using the cursor and the following keywords, the tweets are extracted. Based on the different hashtags and other identification methods, the corresponding domain name is found. NLTK library is used to perform the text pre-processing. The documents are broken into tokens using the word_tokenize() package supported by NLTK. Stop words are the most frequently used words in every language and do not add in important information in terms of performing any analytical tasks or cater to a little value in helping select documents matching a user need, so they are removed as a part of text pre-processing. Numeric characters and special characters are often present in any documents, but they additionally don’t contribute to any efficiency of the model and are hence removed. Lemmatization refers to converting to the base form (lemma) with the help of a vocabulary and morphological analysis of words, normally aiming to remove the inflectional endings only. Lexical Normalization is done to convert the nonstandard word forms and domain specific entities in a review such as “Plzz”, “Happpyyy” etch into the canonical forms using two publicly available lexical dictionaries.
A Hybrid Framework Using Natural Language Processing and Collaborative …
535
4.2 Domain Classification of Reviews The reviews extracted from the previous step belongs to various organizations and enterprises. This step focuses on the strategy to classify the tweets to a particular domain. Domain Identification of a particular tweet becomes a crucial pre-requisite in Feedback Routing systems. This feature helps in routing the specified user feedback to the specified organization. The main steps involved in Domain Classification are: Data Extraction, Data Tokenization, Data Preprocessing, TF-IDF Vectorization, and Classification. Data Extraction is an important aspect of any machine learning algorithm since it forms the base for building efficient models and deriving better accuracy scores. Data is gathered from different sources, data dictionaries and through web scraping. In the proposed work, the data comprises of different tweets that we obtained using Twitter APIs[1]. NLTK library is used to perform the text pre-processing. Classifiers cannot be directly built on text data. Hence, different weighting schemes convert the text data into numerical form for different models to work on. We have used TF-IDF Vectorization which is a product of frequency of a term occurring in document d and Total number of documents in the collection where the term t occurred. Model_Pipeline(TF-IDF Vectorizer, Classifier) Any document in a vector space model is expressed by: D1: T1 T2 T3 T5 T5 T9 T9 T3 T5 D2: T2 T5 T6 T5 T8 T1 T2 T3 T4 T5 T6 T7 T8 T9 D1 1 2 2 1 3 0 0 0 2 D1 1 2 2 1 3 0 0 0 2
This is a count vectorizer model which considers the Raw Term Frequency in a document. We run into certain challenges like bias toward more stop words/more frequently occurring terms in the document, hence the proposed and widely used solution is Term Frequency—Inverse Document Frequency.
536
K. Mitra and P. D. Parthasarathy
TF-IDF Weighting Technique is given by the mathematical formula. T f − id f (t, d) = (1 + log(t f )) ∗ log(N /d f )
(1)
where, tf : frequency of occurrence of a term t appears in a document N: total number of documents in the collection df : number of documents that contains the term t. The corpus comprises of all the tweets/user reviews. The training data comprises of tweets and the respective domains. After applying the TF-IDF Vectorizer, the corpus now contains Bag of Words (all unique terms in any given order). For feature selection, we add 2 more factors like document length and punctuation count. X (Dependent Variable) = Document Length(x1) + Punctuation Count(x2)+ Bag of Words(x3...xi) (2) Y (Target Variable) = Domain
(3)
The corpus is then classified into training and testing data and different classifiers are built on the training dataset. Models are fit on the training data and prediction is done on the test data. Different Accuracy Scores are generated from applying different modelling techniques. In the proposed work, the domain is identified of a respective user review before performing the sentiment analysis on the given user review.
A Hybrid Framework Using Natural Language Processing and Collaborative …
537
4.3 Sentiment Analysis This section in our proposed work helps in determining sentiments from user reviews and opinions on social media platforms followed by classifying them as either positive or negative or neutral polarity. The sentiment of a text can be implicit/explicit. For implicit SA, the sentiment is directly understood from the user review. Example: The product was a bad one. For explicit SA, the user review implies a kind of sentiment. Example: The product stopped working after a week. The SA works in 2 steps: Subjectivity Classification and Determining Sentiment Polarities. In the suggested model, the subjectivity refers to the specific opinion, emotion or judgment and polarity refers to the emotion expressed for the given subject. Our work is built on the Hybrid modelling which is a combination of Rule-Based Classifiers and Model Classifiers. The suggested rule classifier is used to sort user evaluations into positive, negative, and unknown categories. A research of supervised learning classifiers was utilized for unknown classes, and the best modelling approach producing a better accuracy was applied. Lexical Resources for Sentiment Analysis For the proposed model, a strong lexicon is generated. These lexicons include tokens like “Good”, “brilliant”, “extraordinary”, “loved”, “hated”, “bad”, etc. Publicly available sentiment lexicons like Effect WordNet, Wordstat sentiment dictionary is used. The sentiment lexicons classified as either positive or negative are extracted separately followed by counting of the sentiment-bearing words in the rule-based classifier. Rule-Based Classifier A rule-based classifier is often built to identify a specified set of patterns that are most probable to be connected to various classifications. Every rule comprises of two parts: an antecedent dealing with a word pattern and a consequent dealing with a class label. A rule can be defined as follows: Rk: if a1 is xk1 and....... An is xkn, then Class = Ck (K=1, 2........n) where Rk is a rule label, xk1 is an antecedent set, Ck is a consequent class, k is a rule index, and N indicates the total number of rules. The unsupervised rule-based classifier expresses the sentiment analysis problem as a multi-class classification problem whose class labels can either be positive, negative or unknown. We define the following set of rules based on the occurring of positive and negative sentimentbearing words: – – – –
N1: if NPW > 0 and NNW = 0 then, Class = Positive N2: if NNW > 0 and NPW= 0 then, Class = Negative N3: if NPW − NNW > 0 then, Class = Positive N4: if NNW − NPW > 0 then, Class = Negative
Where NPW is the count for positive words, NNW is the count for negative words. These rules are applied sequentially. Corresponding classes are obtained based on these rules. The user reviews are either classified as positive or negative or unknown. Supervised Learning for classifying the unknown class labels A training data with a small number of seed words is passed into a supervised model to extract the polarity.
538
K. Mitra and P. D. Parthasarathy
This trained model is then used for the user reviews/tweets to extract the polarity of the target document. A comparative study is performed to check among the best performing trained supervised models. For all the tweets of polarity, “Unknown” are passed into the pickled trained classifier model with the highest accuracy to categorize the tweet as either positive/negative. Naïve Bayes classifier is used for the sentiment analysis for analyzing the positive and negative feedback.
5 Routing Feedback This section focuses on the routing of collected feedback to the respective support teams of the enterprise organizations. The feedback collected can either be positive feedback that emphasizes the best products and services received or negative feedback that focuses on the negative user experience. The routing strategies have been classified based on the polarity obtained from the Sentiment Analysis system. For a positive polarity (>0), the feedback appears either as recommendations or important callouts. For a neutral polarity (=0), no action is taken with no routing. For a negative polarity ( 20 V at the hallux were considered to have neuropathy and when compared with normal subjects (31.4 ± 1.92 °C), neuropathic subjects had a higher MFT (32.73 ± 1.48 °C). The measurement of VPT at a single point does not represent the severity of the neuropathy condition in the subjects which warrants diffuse VPT measurements over the entire foot to ascertain the severity of the condition and its relationship with MFT. However, the conclusive results stand out to be resourceful in understanding the thermal distribution in DPN subjects. Diffuse VPT measurements at 6 points in the plantar foot (hallux, third toe, 3rd metatarsal head, medial arch, lateral arch, and heel) were performed by Barriga E S et al. to select DPN subjects for a study between normal and diabetics; with and without neuropathy. The subjects were put under a cold stress test by immersing the foot in cold water at 13 °C for 5 min, post which thermal video of the foot was captured for 15 min continuously to understand the thermal recovery rate between
Fig. 3 Plot of temperature showing 4.8 times difference between ulcerated and unulcerated foot— study performed by Armstrong et al. [15]
560
N. C. Evangeline and S. Srinivasan
the three classes. The percentage of change in temperature at 6 points over 15 min time span was plotted to find that the normal subjects had a higher recovery time than the other two classes. The authors concluded that the normative range of these values can be used in screening subjects who are at risk of developing DPN in the future. The justification for the time delay in thermal recovery among neuropathic subjects can be realized from conclusions by Flynn M D and Tooke J E, that the degeneration of thermoreceptors is a contributing physiological reason that in turn affects the thermoregulation in the foot.
5 Thermal Analysis of Foot with Peripheral Arterial Disease The cutaneous temperature distribution bears a direct relationship with the heat from the subcutaneous blood circulation in the foot [2]. Peripheral Arterial Disease is one of the prevalent comorbidities of Diabetes mellitus which refers to the narrowing of blood vessels, which in turn leads to reduced perfusion in the limbs. This alteration in blood perfusion may be characterized by means of thermal imaging. Though PAD may barely lead to foot ulcers unlike DPN, the arterial insufficiency result in delayed healing of ulcers [20]. A study [21] that was performed on diabetics with and without PAD, showed a characteristic reduction in the temperature of feet in PAD subjects, especially in the toes and metatarsal regions. The authors concluded that infrared thermography can aid in providing broad information about the metabolic and circulatory conditions of diabetics especially in the limbs. Fujiwara Y et al. performed an experiment to study the thermal recovery time in patients with type 2 diabetes mellitus against normal subjects over their lower extremities. The subjects were exposed to cold stress by immersing their foot in cold water at 0 °C for 10 s, post which five thermal measurements were made at 5 min difference to find that the skin temperature recovery rate among diabetic subjects with PAD was lower than in other subjects using the formula: Rt = (T 0 − T t )/T 0 x100(%)
(1)
where T t indicates a fall in the temperature of the skin in t minutes after exposure. The results were associated with contributory factors like peripheral arterial sclerosis and blood coagulation fibrinolysis by means of supporting clinical studies. However, the results of the study were not competent to assess the severity of the damage caused to the blood vessels (Table 1). Yet another thermographic analysis on normal subjects and diabetic subjects with and without PAD [23] was performed to measure the mean foot temperature among the classes. The mean foot temperature in subjects with angiopathy, a disease of the blood vessels, was found to be lower than in normal subjects—26.9 °C ± 2.0 °C in PAD patients and 27.7 °C ± 2.0 °C in normal subjects. Thus, it can be concluded
Application of Infrared Thermography in Assessment of Diabetic Foot …
561
Table 1 Temperature difference between ipsilateral and contralateral foot in control and diabetic subjects—study performed by Ilo et al. [23]
Plantar
Site of measurement
Healthy control group n = 93
Diabetes group n = 110
Difference between feet in °C mean (SD)
Difference between feet in °C mean (SD)
Distal site on lateral side
0.73 (0.6)
1.77 (1.7)
Distal site on medial side
0.70 (0.6)
1.79 (1.8)
Middle
0.59 (0.4)
1.13 (1.1)
Proximal site on lateral side
0.79 (0.7)
1.53 (1.5)
Proximal site on medial site
0.83 (0.7)
1.70 (1.6)
that IRT can be helpful in understanding vascular health and in understanding the thermal variations in the lower limb.
6 Factors Affecting Acquisition of Thermal Images—for Plantar Foot Analysis The bodies at a temperature above absolute temperature emit a spectrum of electromagnetic radiations at different wavelengths, which can be termed as their ‘thermal signature’. This is called blackbody radiation ideally, while real objects can be termed as ‘grey bodies’. In human beings, most of this radiation is observed to be in the infrared range, which has a 750 nm to 1 mm wavelength. While according to Planck’s law, the temperature of the object is directly proportional to the intensity of the radiation it emits, according to Wein’s law, the wavelength of the peak emission is inversely proportional to the object’s temperature, given by the equation λ peak =
b T
(2)
where b is Wein’s constant = 2.897771955 × 10 − 3 m K and T is the temperature in Kelvin.
6.1 Emissivity On the other side of the coin, in technologies like Infrared imaging, which is a single wavelength technology, yet another factor called emissivity plays a crucial role in
562
N. C. Evangeline and S. Srinivasan
assessing the accurate temperature of the object. Stefan Boltzmann’s law defines the power radiated from the object (grey body) in terms of its temperature (T) and is given by the equation E = εσ T 4
(3)
where σ = 5.67 × 10–8 is Stefan Boltzmann’s constant and ε is emissivity. Emissivity in IR cameras is defined as the ratio of infrared radiation emitted by an object to that emitted by a black body at the same temperature and is greatly dependent on the material properties. Human skin has an emissivity of 0.98 (on a scale from 0 to 1) while aluminium foil has an emissivity of 0.03. Therefore, emissivity is variable across different temperature measurement applications, which has to be set before image acquisition else, the temperature of the object will be read in error [24]. Thus emissivity correction is the most important factor to be set (at 0.98) while acquiring the plantar foot thermal images for the study.
6.2 Reflectivity Materials with very low emissivity, like polished metals, have a higher tendency to reflect the ambient IR rays while being poor at emitting their own. Therefore, emissivity and reflectivity are inversely proportional to each other, for example, aluminum foil that has poor emissivity has reflectivity = 0.97. It can otherwise be understood that, as emissivity decreases, the measurement made by the IR camera is not of the object, but most of its surrounding objects.
6.3 Ambient Temperature and Humidity Ambient temperature and humidity have a huge influence on the temperature measurement of objects, that is, a high ambient temperature than the object can mask out the hotspot in the object while a lesser ambient temperature cools down the object [24]. Therefore, ambient temperature is a factor to be set at the time of thermal image acquisition, so that the camera automatically compensates accordingly, alongside avoiding the external sources that might affect the temperature. Sources like sunlight or a tungsten light bulb in a room, air circulation from fan, air conditioner can be contributive factors that affect proper temperature measurement. Humidity affects the transmittance of the IR radiation to the IR detector in the camera. Hence, the relative humidity of the ambience is yet another parameter to be set while acquiring thermal images.
Application of Infrared Thermography in Assessment of Diabetic Foot …
563
6.4 Angle of Incidence The best angle to measure the IR radiation is along the cone of maximum emissivity in place of a regular perpendicular angle [25]. The IR rays aren’t emitted equally in all directions by the objects, while for most materials, the 60 degree reception angle is the best for capturing the emission. When this angle is decreased towards becoming parallel, the emissivity drops significantly. Therefore, to fix the IR camera at a specific distance, angle, and height, it is suggested that the camera is mounted on a tripod stand for more accurate measurements.
6.5 Patient Side Factors Apart from recruiting subjects for the study based on the inclusion and exclusion criteria, to measure the plantar foot temperature, the subjects are to be retained barefooted in the supine position for 15 min based on the study by Sun et al., as this enables the cutaneous thermal distribution to attain a thermal equilibrium (Fig. 4). Periyasamy et al. [26] performed a study to analyze the effect of standing plantar pressure in the foot of non-diabetic and diabetic subjects to conclude that the plantar pressure distribution parameter (PR) was higher in diabetics than control subjects. Also, the walking cadence had an effect on the rate of change of foot temperature in diabetics [27], which was further related to the shear pressure due to walking. The authors observed that diabetics had a comparatively higher rate change of temperature (4.62 ± 2.00 °C than normal subjects) while walking, which again decreased to settle as the subjects rested for 20 min. Thus, the given rest duration helps subjects to acclimatize to the ambience as the foot temperature settles to become steady as they rest barefooted (Table 2).
Fig. 4 Plantar foot thermal image acquisition setup using IR Camera
28.3±0.3 35.2±0.5
Mean plantar temperature (repeat)
Mean forehead temperature
Temperature difference – sole versus forehead
28.6±0.8
Mean plantar temperature
p = 0.62
p = 0.003
p = 0.002
At 5th minute
6.8±0.9
34.9±0.3
27.8±0.9
28.1±0.8
p = 0.68
p = 0.51
p = 0.003
At 10th minute
7.1±0.8
35.0±0.3
27.9±0.8
27.8±0.9
p = 0.73
p = 0.61
p = 0.74
At 15th minute
7.1±0.8
35.1±0.4
27.8±0.9
27.8±1.0
p = 0.58
p = 0.77
p = 0.62
At 20th minute
7.1±0.9
34.8±0.3
27.8±0.8
27.7±0.9
25th minute
Table 2 Mean foot temperature at various time intervals demonstrating thermal equilibrium being achieved at 15 min with no significance beyond 15 min based on p value (p > 0.05)—results of the study by Sun et al. [11]
564 N. C. Evangeline and S. Srinivasan
Application of Infrared Thermography in Assessment of Diabetic Foot …
565
7 Conclusion This literature survey delineates the methods and outcomes of performing plantar foot thermal analysis in diabetic subjects prior to the onset of foot ulcers, thereby, being instrumental in understanding the pathophysiology that can be indicative of pre-ulcerous conditions which in turn can help in early detection of foot ulcers. Also, the brief discourse on other external factors that affect image acquisition and foot thermal distribution in subjects, helps to explicate preliminary aspects to be considered during data collection for the plantar foot study in diabetics. The results of the literature discussed, evidently explain that high blood glucose levels in the blood may cause damage to peripheral nerves and calcification of blood vessels leading to poor protective sensation in the foot and lesser oxygen perfusion to the tissues. This consecutively leads to diabetic foot complications which are manifested through variations in the temperature pattern in the plantar foot. Therefore, IRT, a non-contact, non-invasive imaging modality can provide options for direct, prompt monitoring, and understanding of foot health by explaining the pathogenesis of DFS in diabetic subjects. In conclusion, thermography proves to be a useful technique in evaluating the risk of diabetic foot ulcer development that can be suitable for mass screening.
References 1. Ziegler D, Mayer P, Wiefels K, Gries AF (1988) Assessment of small and large fiber function in long-term type 1 (insulin-dependent) diabetic patients with and without painful neuropathy. Pain 34(1):1–10. https://doi.org/10.1016/0304-3959(88)90175-3.Erratum.In:Pai n1988Sep;34(3):322. PMID: 3405615 2. Bharara M, Cobb JE, Claremont DJ (2006) Thermography and thermometry in the assessment of diabetic neuropathic foot: a case for furthering the role of thermal techniques. Int J Low Extrem Wounds 5(4):250–260. https://doi.org/10.1177/1534734606293481. PMID: 17088601 3. Gatt A, Falzon O, Cassar K, Ellul C, Camilleri KP, Gauci J, Mizzi S, Mizzi A, Sturgeon C, Camilleri L, Chockalingam N, Formosa C (2018) Establishing differences in thermographic patterns between the various complications in diabetic foot disease. Int J Endocrinol 2018:9808295. https://doi.org/10.1155/2018/9808295 4. Boulton AJ, Vileikyte L, Ragnarson-Tennvall G, Apelqvist J (2005) The global burden of diabetic foot disease. Lancet (London, England) 366(9498):1719–1724. https://doi.org/10. 1016/S0140-6736(05)67698-2 5. Shaydakov ME, Diaz JA (2017) Effectiveness of infrared thermography in the diagnosis of deep vein thrombosis: an evidence-based review. J Vasc Diagn Interv. https://doi.org/10.2147/ JVD.S103582 6. Kopsa H, Czech W, Schmidt P, Zazgornik J, Pils P, Balcke P (1979) Use of thermography in kidney transplantation: two year follow up study in 75 cases. Proc Eur Dial Transplant Assoc 16:383–387 7. Branemark P (1967) 32S Fagerberg, L Langer, JS Soderbergh, “Infrared thermography in diabetes mellitus a preliminary study.” Diabetologia 3:529–532
566
N. C. Evangeline and S. Srinivasan
8. Nagase T, Sanada H, Takehara K, Oe M, Iizaka S, Ohashi Y, Oba M, Kadowaki T, Nakagami G (2011) Variations of plantar thermographic patterns in normal controls and non-ulcer diabetic patients: novel classification using angiosome concept. J Plast Reconstr Aesthet Surg 64(7):860–866. https://doi.org/10.1016/j.bjps.2010.12.003. Epub 2011 Jan 22 PMID: 21257357 9. Attinger CE, Evans KK, Bulan E, Blume P, Cooper P (2006) Angiosomes of the foot and ankle and clinical implications for limb salvage: reconstruction, incisions, and revascularization. Plast Reconstr Surg 117(7 Suppl):261S-293S. https://doi.org/10.1097/01.prs.0000222582.843 85.54 10. Adam M, Ng EY, Oh SL, Heng ML, Hagiwara Y, Tan JH, Tong JW, Acharya UR (2018) Automated characterization of diabetic foot using nonlinear features extracted from thermograms. Infrared Phys & Technol 89:325–337. ISSN 1350–4495. https://doi.org/10.1016/j.inf rared.2018.01.022. 11. Sun PC, Jao SH, Cheng CK (2005) Assessing foot temperature using infrared thermography. Foot Ankle Int 26(10):847–853. https://doi.org/10.1177/107110070502601010. PMID: 16221458 12. Armstrong DG, Lavery LA (1997) Monitoring healing of acute Charcot’s arthropathy with infrared dermal thermometry. J Rehabil Res Dev 34(3):317–321 PMID: 9239625 13. Hernandez-Contreras D, Peregrina-Barreto H, Rangel-Magdaleno J, Ramirez-Cortes J, ReneroCarrillo F, Avina-Cervantes G (2015) Evaluation of thermal patterns and distribution applied to the study of diabetic foot. In: 2015 IEEE International instrumentation and measurement technology conference (I2MTC) proceedings, pp 482–487. https://doi.org/10.1109/I2MTC. 2015.7151315 14. Jeffcoate WJ, Harding KG (2002) Diabetic foot ulcers. The Lancet:1545–1551. https://doi.org/ 10.1016/S0140-6736(03)13169-8 15. Armstrong DG, Holtz-Neiderer K, Wendel C, Mohler MJ, Kimbriel HR, Lavery LA (2007) Skin temperature monitoring reduces the risk for diabetic foot ulceration in high-risk patients. Am J Med 120(12):1042–1046. https://doi.org/10.1016/j.amjmed.2007.06.028.Erratum.In: AmJMed.2008Dec;121(12).doi:10.1016/j.amjmed.2008.09.029. PMID: 18060924 16. Chan AW, MacFarlane IA, Bowsher DR (1991) Contact thermography of painful diabetic neuropathic foot. Diabetes Care 14(10):918–922. https://doi.org/10.2337/diacare.14.10.918. PMID: 1773693 17. Bagavathiappan S, Philip J, Jayakumar T, Raj B, Rao PN, Varalakshmi M, Mohan V (2010) Correlation between plantar foot temperature and diabetic neuropathy: a case study by using an infrared thermal imaging technique. J Diabetes Sci Technol 4(6):1386–1392. https://doi. org/10.1177/193229681000400613 18. Barriga ES, Chekh V, Carranza C, Burge MR, Edwards A, McGrew E, Zamora G, Soliz P (2012) “Computational basis for risk stratification of peripheral neuropathy from thermal imaging”. Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference 2012:1486–1489. https://doi.org/10.1109/EMBC.2012.6346222 19. Flynn MD, Tooke JE (1995) Diabetic neuropathy and the microcirculation. Diabet Med: J Br Diabet Assoc 12(4):298–301. https://doi.org/10.1111/j.1464-5491.1995.tb00480.x 20. Frykberg RG, Zgonis T, Armstrong DG, Driver VR, Giurini JM, Kravitz SR, Landsman AS, Lavery LA, Moore JC, Schuberth JM, Wukich DK Andersen C, Vanore JV (2006) Diabetic foot disorders: a clinical practice guideline (2006 Revision). J Foot Ankle Surg 45(5-suppS):0–0.https://doi.org/10.1016/s1067-2516(07)60001-5 21. Brånemark PI, Fagerberg SE, Langer L, Säve-Söderbergh J (1967) Infrared thermography in diabetes mellitus. A preliminary study. Diabetologia. 3(6):529–532. https://doi.org/10.1007/ BF01213572 22. Fujiwara Y, Inukai T, Aso Y, Takemura Y (2000) Thermographic measurement of skin temperature recovery time of extremities in patients with type 2 diabetes mellitus. https://doi.org/10. 1055/s-2000-8142
Application of Infrared Thermography in Assessment of Diabetic Foot …
567
23. Ilo A, Romsi P, Mäkelä J (2020) Infrared thermography and vascular disorders in diabetic feet. J Diabetes Sci Technol 14(1):28–36. https://doi.org/10.1177/1932296819871270 24. Thermal Imaging Guidebook—FLIR. https://www.flirmedia.com/MMC/THG/Brochures/T82 0264/T820264_EN.pdf 25. Mohammadi E, Ghaffari M, Behdad N (2020) Wide dynamic range, angle-sensing, long-wave infrared detector using nano-antenna arrays. Sci Rep 10:2488. https://doi.org/10.1038/s41598020-59440-2 26. Periyasamy R, Anand S, Ammini AC (2013) Prevalence of standing plantar pressure distribution variation in north Asian Indian patients with diabetes mellitus: a study to understand ulcer formation. Proc Inst Mech Eng H 227(2):181–189. https://doi.org/10.1177/095441191 2460806 27. Reddy PN, Cooper G, Weightman A, Hodson-Tole E, Reeves ND (2017) Walking cadence affects rate of plantar foot temperature change but not final temperature in younger and older adults. Gait Posture 52:272–279. https://doi.org/10.1016/j.gaitpost.2016.12.008
A Survey and Classification on Recommendation Systems Manika Sharma, Raman Mittal, Ambuj Bharati, Deepika Saxena, and Ashutosh Kumar Singh
Abstract In today’s modern world, the data is growing exponentially and the traditional systems are not able to fulfil the user’s requirements. To fulfil the needs of the users, various companies like Amazon, Netflix, etc. are using recommender systems which recommend content or various type of data on the basis of the user’s previous activities and interactions with the system. In the recommender system, mainly three approaches are present, i.e., content-based, collaborative filtering and knowledge-based approaches. Due to their wide applicability, recommender systems have become an area of active research and in this context, this paper furnishes a survey and comparative discussion of existing approaches. The survey draws a conclusion on how different recommendation techniques are cooperating with today’s growing technology trends and also discusses the challenges faced by them. Keywords Collaborative filtering · Content-based filtering · Hybrid methods · Knowledge-based filtering · Recommender system
1 Introduction With increasing technological advancements, there is tonnes of data available on the internet nowadays, which makes it completely tiresome for the user to browse the products of their choice. Also, it has become a difficult task for digital service providers to engage multiple users for the maximum possible time on their applications. This is where the recommender system comes into the picture. Recommender Systems recommend content or various types of data on the basis of the user’s previous activities and interactions with the system. The different application domains like movies, music, books, news, etc. are adopting recommender systems. Some examples of recommender systems are product recommendations on Amazon, Netflix suggestions for movies and TV shows, recommendations for videos M. Sharma (B) · R. Mittal · A. Bharati · D. Saxena · A. K. Singh Department of Computer Application, National Institute of Technology, Kurukshetra, Haryana 136119, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_44
569
570
M. Sharma et al.
on YouTube, music on Spotify and many more. Most recommendation systems show recommendations to the user to provide a better user experience. The system suggests users the material of their choice and liking based on a vast set of goods and a description of their wants. Such systems help users to interact better with the application and thus increase the amount of time spent by the user on that application. Which mobile phone should I buy? What is the best holiday destination for me and my family? Which movies should I watch? Which book should I rent or buy? Which song should I listen to? This list can easily be expanded with many similar questions in which some decision has to be made. Bad decisions might lead to the wastage of time and money. Traditionally, people have used a variety of strategies to solve such problems like surfing the internet, taking suggestions from a friend or simply following other people. However, we all have experienced situations in which these methods do not work well. Good advice is difficult to receive, in most cases, time-consuming or even then is often of questionable quality. Now imagine a system that actually gives great quality advice and most importantly which can be trusted. Most of us have already come across recommender systems in one way or another. Imagine, for instance, you go to an online store to shop for something. After making the purchase, you rate the product which is bought as 9/10. So, the next time you visit the store, you will see similar products being recommended to you under the “similar items” category, or you could see a section titled “people who bought this also bought”. These sections consist of the products that resemble the product which is recently bought or the products that are bought by users having similar interests. The software system which determines which products to be shown to the user is a recommender system. Most service providers aim to increase customer engagement on their applications. But with the tonnes of data available online, the users have a hard time not only searching for the things they want but even figuring out what they want in the first place. So, the recommender system amplifies the user experience which in return strengthens the relationship between the user and the service provider and that is what Fig. 1 illustrates. Recommender systems can be both personalized and impersonalized. The above example is for a personalized recommender system. In other words, every visitor sees a different list of products depending on their interest. In contrast, many stores or web portals show their top-selling products or their most viewed or purchased products as well; theoretically, this could be interpreted as an instance of impersonalized recommendation. The provision of personalized recommendations, however, requires that the system knows something about every user and every recommender system develops and maintains a user model for making recommendations. The existence of that user model depends on the way of acquiring information using any particular technique. Despite the popularity of recommendation algorithms, some research questions remain unanswered. What is the current state-of-the-art recommendation systems? What are the most common approaches used to implement and evaluate recommendation systems? What are the research challenges in developing a recommendation system?
A Survey and Classification on Recommendation Systems
571
Fig. 1 Recommendation System
The rest of the paper is structured as follows. In Sect. 1.1, we discuss the classification of the various major techniques present for the recommendation systems. The contribution of this paper is explained in Sect. 1.2. Next, in Sect. 2, we provided the literature survey on various traditional and advanced approaches which will help in future research in this area. In the same section, this paper also presents a wellstructured comparison table that compares various existing techniques. In Sect. 3, we discuss the emerging trends and future scope in the area of recommendation systems. Finally, Sect. 4 concludes our paper.
1.1 Classification Recommendation systems can be classified into Content-based, Collaborative filtering, Knowledge-based and Hybrid approaches. Figure 2 defines the classification of the various major techniques present for the recommendation systems. These techniques are very popular to build recommender systems and have been successfully used in many applications. Content-Based Filtering Method Content-based methods illustrated in Fig. 3 only have to analyse the items and user profiles for a recommendation. It recommends content on the basis of the user’s browsing history, number of clicks and viewed products. This approach can propose unrated items and is totally based on the user’s rating; however, it does not function for new users who have not yet rated anything. In a content-based approach, there is no recommendation of items that are unexpected to a user (serendipitous items) and it will not work if the system fails to distinguish the content that is not liked by the user.
572
M. Sharma et al.
Fig. 2 Classification of recommendation system Fig. 3 Content-based filtering
Collaborative Filtering Method On the other hand, Collaborative Methods illustrated in Fig. 4 works by finding similarities between different users and recommending the products used by them. Two major classifications of collaborative methods are Memory- Based Approach and Model-Based Approach. The Memory-Based Approach works in basically three steps: Measuring the similarity between the training users and the target user, identifying the target user’s closest neighbours (i.e., users who are highly similar to the target user), and generating a final list of recommendations. Model-Based Approach takes into consideration the user-rating behaviour instead of directly using the data, the rating data is used to extract the parameters for the model which leads to better accuracy and performance. The recommendation by collaborative filtering is dependent on the user’s behaviour and is content-independent. Because suggestions are based on user similarity rather than item similarity, it also gives serendipitous recommendations. But the problem with this approach is that it cannot recommend items to new users (Cold start Problem). This method also finds it difficult to recommend items to those users who have special interests and are different from the majority of people. This is because they may not agree or disagree with the rest of the users creating difficulty in producing appropriate results. (Grey Sheep problem).
A Survey and Classification on Recommendation Systems
573
Fig. 4 Collaborative filtering
The sparsity problem, which refers to a circumstance in which data is sparse and insufficient to find parallels in consumer interests, is another fundamental flaw with this strategy. Sparse data means that consumers have rated a small number of things, making it difficult to recommend enough items. Knowledge-Based Filtering Method . Both collaborative and content-based recommender methods have their advantages and strengths. However, there are many situations where these approaches are not the best choice. Real estate, autos, financial services, and other pricey luxury goods are examples of these types of scenarios. Typically, people do not buy these items very frequently. So, in this scenario, the collaborative filtering and content-based method will not perform well because of the low number of available ratings or low user-system interactions. Finally, in more complicated product domains such as real estate, clients frequently wish to specify their requirements explicitly, such as the property’s location or colour, and knowledge-based recommender systems can help us address all of these issues. To put it another way, the knowledge-based approaches shown in Fig. 5 do not require any kind of rating to recommend products; instead, the recommendation process is based on similarities between customer requirements and item descriptions, or the usage of constraints establishing use requirements. Hybrid Method Because each of the aforementioned approaches has its own set of advantages and disadvantages, hybrid methods, as shown in Fig. 6, are used to combine the benefits of different approaches to create a system that performs well in a wide range of applications. Current systems employ advanced algorithms to handle issues such as Fig. 5 Knowledge-based filtering
574
M. Sharma et al.
Fig. 6 Hybrid approach
sparsity in the data. Approaches like clustering and normalization techniques are used to address the problem of sparsity. Demographic and association rule mining techniques have been employed and they have been found to be effective in addressing the cold start problem. K-nearest-neighbours (KNN) and Frequent-Pattern Tree (FPT) are coupled to produce quality suggestions, overcoming the disadvantages of existing approaches. The only problem with the traditional hybrid systems is that it uses past information of the users to recommend content. Let’s say a user who is using an application based on a hybrid system for a long time suddenly stops using the application. After a few days, when the user will re-visit the website, then the system will recommend items based on the interest captured earlier that might not be relevant now.
1.2 Contributions of the Paper Although the techniques utilized in today’s recommender systems were created more than a decade earlier, the field is currently undergoing active research because the internet has become a vital part of everyone’s life while new technologies emerge on a daily basis. The main objective of this paper is to collect all different existing techniques in one place and compare them based on various parameters. The paper develops a conclusion on how different recommendation techniques are cooperating with today’s growing technology trends and also discusses the challenges faced by them. At the end, this paper also suggested a new hybrid technique that can resolve some limitations faced by existing techniques.
A Survey and Classification on Recommendation Systems
575
2 Literature Survey A lot of work has been done in the area of recommendation systems using various publishing forums. Here in this section, we have done extensive study on various traditional and advanced approaches and provided the survey for the same which will help in future research in this area. Further, Table 1 shows a comparison of the many approaches that have been offered thus far. Table 1 Comparison table Existing work
Year
Approach
Strength
Simovic [3]
2018
Collaborative Filtering
No domain Grey Sheep knowledge necessary Problem, Data Sparsity Problem, Cold Start Problem
Weakness
Dieu et al. [4]
2020
TrustSVD model, Matrix factorization techniques
Higher accuracy
Takes more time in the optimization of the target function
Xiao et al. [14]
2017
Content-based and collaborative filtering
Control on cold start problem to some extent
Data sparsity, Scalability
Yang et al. [5]
2013
Collaborative filtering, matrix factorization based on social information
Higher Accuracy, control on cold start problem to some extent
Data Sparsity, Scalability
Umamaheswari and Akila[29]
2019
KNN, Neural Network for Pattern Recognition
Higher accuracy
Increased complexity
Tarnowski et al. [30]
2017
K-Nearest Neighbor, Multi-Layered Neural Network
Higher accuracy, KNN is easy to implement
KNN does not work well with large dataset
Iniyan et al. [31] 2020
Convolutional Neural Higher accuracy, Network, Content based fixes the cold start approach problem
Increases threat on privacy and security of users
Bokde et al. [6]
2015
Collaborative Filtering using Matrix Factorization Models
Reduces the level of sparsity, handles the large database, handles cold start problem
Low accuracy, complexity of the model increases with the increase in dataset
Kim et al. [7]
2021
Collaborative filtering, Support vector machines classifier
Increase in accuracy (average accuracy is 87.2%)
Data sparsity, increases threat on privacy and security of users (continued)
576
M. Sharma et al.
Table 1 (continued) Existing work
Year
Approach
Strength
Weakness
Feng et al. [15]
2020
Systematic literature review as the primary method for data collection
This paper helps researchers to identify and classify some good approaches
This paper does not show any new model for news recommendation
Portugal et al. [16]
2017
The major approach of data collecting is a systematic literature review
This paper help researchers to identify and classify some good ML algorithms for recommendation system
Studies on requirements and design, as well as late stages, such as maintenance, are lacking
Geetha et al. [17] 2018
Content-based and collaborative filtering, Pearson correlation, weighted mean
Higher accuracy, can be further extended to other domains, handles cold start problems to some extent
Data sparsity, scalability
Shahbazi and Byun [1]
2020
Content-based filtering, XGBoost classifier
Handles grey sheep problem, No data sparsity problem
Cold start problem, can’t expand the user interest
Haruna et al. [8]
2017
Collaborative filtering
Does not require priori user profile, personalized recommendation, expands the user interest area
Grey sheep problem, cold start problem, data sparsity problem
Reddy et al. [2]
2019
Content-based filtering
Handles grey sheep problem, gives personalized recommendations
Cold start problem, no serendipitous recommendations
Phorasim and Yu 2017 [9]
Collaborative filtering, k-means clustering
Provides serendipitous recommendation, high accuracy and fast
Problems with cold starts, grey sheep, and data sparsity
Tian et al. [18]
2019
K-means clustering, collaborative and content-based filtering
Reduces data sparsity Cold start problem problem to some extent, gives serendipitous recommendation
Osadchiy et al. [19]
2018
Pairwise association rules
It doesn’t have a Grey Sheep problem with cold problem, big data starts or data sparsity set required
Tensor decomposition, semantic similarity, k-means clustering
Uses multiple Cold start problem, domains, reduces completely rely on data sparsity problem auxiliary domain
Kumar et al. [20] 2016
(continued)
A Survey and Classification on Recommendation Systems
577
Table 1 (continued) Existing work
Year
Approach
Alotaibi and Vassileva [10]
2016
Collaborative filtering, Alleviates the cold Grey sheep explicit social networks start problem to some problem, dynamic extent, improves the behaviour of user accuracy
Strength
Weakness
James et al. [32] 2019
Face detection, emotion No cold start recognition, problem, adaptable to classification dynamic behaviour of user, handles the grey sheep problem
No serendipitous recommendation, fully rely on user’s emotions
Shah and Sahu [21]
2015
Association Mining Technique
Better Accuracy
Gray sheep problem
Lucas et al. [22]
2013
Clustering,Association Mining, CBA-fuzzy algorithm
Fuzzy logic helps to New methods minimize the sparsity could be proposed problem to further enhance the accuracy and performance of the system
Ye et al. [23]
2019
Frequent-Pattern Tree (FPT), K-nearest-neighbors (KNN)
Overcomes the cold start problem
Reviews, user-generated ratings may not be available for user profiling
Bhatt et al. [25]
2014
Content based filtering, collaborative filtering, hybrid method
Better performance
Reliable Integration, Efficient Calculation
Wang et al. [24]
2018
Content based filtering, collaborative filtering, sentiment analysis
High user engagement on the application, High Efficiency
Limited user involvement
Badarneh and Alsakran [11]
2016
Clustering,KNN, Association mining
Better for predicting customer behavior
Too complex
Lin et al. [26]
2018
Sparse linear method (SLIM), regularization
Better performance
Accuracy can be improved
Hande et al. [27] 2016
Content based filtering, collaborative filtering, matrix factorization
Improved efficiency and overall performance of the system
For improved performance, a hybrid recommender system based on clustering and similarity can be built
Hong et al. [28]
Keyword extraction
Eradicated the problem known as cold start problem
Grouping of research papers according to the specific subjects
2014
(continued)
578
M. Sharma et al.
Table 1 (continued) Existing work
Year
Approach
Strength
Weakness
Juan et al. [12]
2019
Collaborative filtering
Accurate and Effective results
Heterogeneous & uneven data, data sparsity
Bouraga et al. [13]
2014
Knowledge-Based approach
Overcome cold start problem, grey sheep problem, no large data set required, and a more reliable recommendation
Knowledge acquisition task is challenging, development and maintenance is costly
2.1 Content-Based Filtering Method Product recommendation system based on content-based filtering proposed in [1] makes use of the machine learning algorithm XGBoost to recommend items to the users on the basis of their previous activities and click information collected from the user profile. Content-based filtering approach has been used by Reddy et al. in [2] for building a movie recommendation system that recommends items to users based on past behaviour. It also makes recommendations based on the similarity of genres. If a movie is highly rated by the user, then movies based on similar genres can also be recommended.
2.2 Collaborative Filtering Method A smart library recommender system has been proposed in [3] which recommends books and other resources to the users and enhances the educational system. The paper presents a model that collects data from various sources, after collecting the data, the processing and analysis of the data is conducted. After analysing, the model will then perform collaborative filtering and at the completion of the process, the user receives a recommendation list with items of greater interest and precision. Based on the TrustSVD model and matrix factorization techniques, Anh Nguyen Thi Dieu et al. provided a new methodology to analyse the rating item and input the implicit effect of item rating to the recommendation system in [4]. The experimental results indicated that this model outperformed the standard Matrix Factorization approach by 18% and the Multi-Relational Matrix Factorization method by 15%. Yang et al. [5] proposes a survey article on collaborative filtering-based social recommender systems. The authors gave a quick explanation of the task of recommender systems and standard ways that do not employ social network information in this study and then showed how social network information can be used as an additional input by recommender systems to increase accuracy. Bokde et al. [6] proposes a survey study on the matrix factorization model in collaborative filtering methods.
A Survey and Classification on Recommendation Systems
579
Matrix factorization using vectors of factors deduced from item rating patterns characterizes both items and users, and the high connection between user factors and item factors leads to a recommendation. Matrix operations are more scalable and economical, and they also alleviate the problem of high sparsity. Tae-Yeun Kim et al. in [7] proposed a model of recommendation system which actually recognizes six human emotions. This model is developed by merging collaborative filtering with static speech emotional information recognition that was received in real time from users. It mainly consists of an emotion classification module, emotion collaborative filtering module and mobile application. In this model, the Thayer’s extended twodimensional emotion model is selected as the emotional model. The SVM classifier was also used to recognize the patterns in the emotional information contained in the optimized featured vectors. This model gives more accurate recommendations to users because of the addition of emotional information. Haruna et al. [8] proposed a collaborative strategy for the research article recommender system that uses publicly accessible contextual data to identify hidden connections between research papers in order to personalize recommendations. Regardless of the research field or the user’s expertise, this system gives individualized recommendations. Phorasim et al. [9] employed collaborative filtering to create a movie recommender system that uses the K-means clustering technique to categorize users based on their interests and then discover similarities among the users to produce a recommendation for the active user. The proposed model attempts to improve the time taken for recommending an item to the user. A fusion of recommendations that uses explicit social relations (friends and co-authors) with recommendations that use implicit social relations (similarity between users) has been used in [10] to increase user coverage with minimum recommendation accuracy loss. This approach attempts to alleviate some drawbacks of collaborative filtering, for example, the cold start problem, and has increased the accuracy of recommendations. Jamal et al. in [11] described a different method of recommendation based on a collaborative approach combined with association mining rules to discover course patterns. It works by providing recommendations according to other people of similar interests. The system requires the stated minimal support, defined minimum confidence, and the course dataset as inputs. The system develops course association rules as a result of its work. Then the system uses these rules to generate a recommendation list. To achieve better performance, high confidence is chosen. Juan et al. [12] developed a hybrid collaborative technique that combines the KNN model and the XGBoost model and leverages the scores predicted by the model-based personalized recommendation algorithm as features to overcome data sparsity and cold start concerns in personalized recommendation systems. The algorithmic principle behind XGBoost is to pick a few samples and features to construct a rudimentary classification model. The goal is to learn from the previous data and generate a new mode.
580
M. Sharma et al.
2.3 Knowledge-Based Filtering Method In [13], Sarah Bouraga et al. established a classification approach for knowledgebased recommendation systems that genuinely differentiates such systems based on their characteristics. The framework attempts to make it easier to identify existing knowledge-based recommendation systems, and the paper’s approach aims to make it easier to develop new, better knowledge-based recommendation systems. The authors suggested three classification dimensions constituting the framework, The first one is ‘Recommendation Problem and Solution’ which actually describe what the knowledge-based system is supposed to solve. The second is the ‘User Profile,’ which specifies the characteristics that must be present in order for a tailored recommendation to be sent. The final factor is the ‘Degree of Automation,’ which decides whether or not a human intervention is required, and if so, to what extent. After conducting the survey, the authors found out that by using knowledge-based systems, we can avoid cold start, new item and grey sheep problems, and also no large historical data set is needed. However, the work of acquiring knowledge is difficult, and the system’s development and maintenance are costly.
2.4 Hybrid Method Jun Xiao et at. in [14] developed a model which recommends various online courses to learners of different countries. This model uses a combination of content-based and collaborative filtering approaches. The model presented in this paper comprises three main modules: the data support module, the combinational algorithm recommendation engine module, and the new source recommendation module. By using these modules, this model overcomes some limitations like cold start problems. A news recommender system is proposed in [15]. This paper tried to list the problems in the news recommendation systems. The authors briefly reviewed the literature and used reviews as the primary method for data collection. They included manuscripts from different standard platforms during the time frame of 2006 to 2019. After studying all the papers, the authors found that only 13% applied a collaborative filtering approach for news recommendation during the previous decade. To avoid the limitations, a hybrid approach is adopted. A review on the use of machine learning algorithms in recommender systems has been proposed in [16]. The main goal of this paper is to identify different ML algorithms being used in recommender systems and assist new researchers to do the research appropriately. The result confirms that a minimal research effort has been done focused on hybrid approaches, with plenty of room for research in semisupervised and reinforcement learning for recommender systems. Neural Network and K-Means algorithms are not being researched enough for RSs development. Geetha et al. in [17] proposed a recommendation system for movies that overcome cold start problems. It mainly follows collaborative filtering, content-based
A Survey and Classification on Recommendation Systems
581
filtering, demographics-based filtering and hybrid approaches. This system attempts to overcome the drawbacks of each individual approach. A recommendation system for libraries based on a hybrid approach has been proposed in [18]. This paper performs comparative experiments to demonstrate that the hybrid approach provides more accurate recommendations as compared to any individual filtering approach. Osadchiy et al. in [19] proposed an algorithm that can build a model of collective preferences independently of personal user interests. Using pairwise association rules, this novel approach eliminates the need for a complex system of ratings. Using this approach, several challenges that are faced by content-based and collaborative filtering approaches have been removed. Crossdomain recommendation approach has been used in [20], where it makes use of knowledge from other domains (e.g., web series) that include additional user preference data to improve the recommendation on the target application domain (e.g., toys). The system utilizes semantic similarity measurements of common information to determine how domains are related. Jaimeel et al. in [21] proposed a hybrid model which is a combination of content and collaborative filtering combined with an association mining technique to boost the effectiveness of recommender systems. The study explores other hybridization strategies, such as the weighted method, which is utilized to partially overcome the constraints of previous methods. Moreover, it also deals with the possible solutions to problems like cold start problems. A personalized hybrid system has been proposed in [22] to engage users for a longer period of time by recommending the products of their choice. The paper presents a model for tourism developed by combining classification and association known as the associative classification method. It also employs fuzzy logic to enhance the quality of recommendations. Fuzzy logic further helps to minimize the problem of sparsity. Bo Kai Ye, in [23], presented a review paper that describes various methods of recommendations. Broadly, the recommender systems are classified into three categories: Content, Collaborative and Hybrid. Collaborative filtering works by finding similarities between different users and recommending the products used by them. Contentbased methods only have to analyze the items and user profiles for a recommendation. Hybrid systems are then utilized to combine the benefits of both of these approaches to create a system that is capable of operating well in a wide range of applications. A Sentiment-Enhanced Hybrid Recommender System introduced in [24] focuses on further extension of the hybrid model by performing sentiment analysis on the result. By understanding the sentiments behind the user reviews, the system can thus make informed decisions on what particular product to recommend. The model is implemented on the spark platform to meet the needs of mobile services. This method provides high efficiency of the model obtained as compared to the existing hybrid models. One of the limitations of this method is that not having enough reviews makes it really difficult to recommend that movie to any user. [25] presented a personalized paper recommendation approach. Unlike the existing methods, this paper doesn’t entirely use content or collaborative methods but takes their advantages into consideration to build a hybrid model. This work combines K-nearest neighbours (KNN)
582
M. Sharma et al.
and Frequent-Pattern tree (FPT) to deliver excellent suggestions to researchers, overcoming the disadvantages of existing techniques. The system overcomes the cold start problem. As per the requirements for intelligent recommendations in Smart Education are concerned, [26] used a sparse linear method (SLIM) for top course recommendations. The approach works by extracting the inner structure and content of the available courses. The original student/course matrix is extracted from a college’s learning dataset. The final recommendation result for each student’s last course is completed by organizing the non-taken courses in descending order, with the first few courses in the list being the final recommendation result. MovieMender is a movie recommendation system proposed in [27] by Author with the objective to help users find movies as per their interest without any difficulty. It uses a web crawler to get the database. The dataset is pre-processed and a user-rating matrix is derived. To create a matrix, content-based filtering is applied to each user-rating pair. Collaborative filtering uses matrix factorization to determine the relationship between items and user entities in order to provide recommendations for an active user. Kwanghee et al., in [28], proposed a personalized recommendation system for research papers by extracting the keywords given in the papers. These words are then examined throughout the document. Any word that has a higher count than the average is considered a keyword for the paper. This paper compensates for the shortcomings of existing systems, which are unable to respond to user profile information in a sensitive and secure manner.
2.5 Emotion-Based Recommendation System Umamaheswari et al., in [29], presented a model for emotion recognition using speech. This model was created using a combination of Pattern Recognition Neural Network and K-Nearest Neighbour method. The authors classified and evaluated previously created systems in this research and discovered that the proposed system has a higher accuracy than previously utilized techniques such as the Gaussian Mixture Model and Hidden Markov Model. In [30], Pawl Tarnowski et al. describes a paradigm for recognizing seven primary emotional states based on facial expressions: neutral, joy, surprise, anger, sadness, fear, and disgust. The authors of this paper decided to include coefficients describing elements of facial expression as a feature or variable for the model. The K-Nearest Neighbour classifier and the MLP neural network were used to further classify the features. For random data division, this model produces good classification results with an accuracy of 96 percent (KNN) and 90% (MLP). Iniyan et al. in [31] suggested a model for a recommendation system that will suggest information to users depending on their current mood or emotion. The facial expression recognition technique, in this model, employs a Convolutional Neural Network (CNN) to extract features from the face image, followed by Artificial Neural Networks. The retrieved features are then fed to the classifier, which predicts the recognized expression as an output. By adding one more real-time variable to the system, this model corrects a flaw in the old approach and improves its accuracy.
A Survey and Classification on Recommendation Systems
583
In paper [32], James et al. concentrated on identifying human emotions in order to create an emotion-based music recommender system. This method avoids the time-consuming and tiresome effort of manually categorizing or dividing music into various lists, and it aids in the creation of an ideal playlist based on an individual’s emotional characteristics.
3 Emerging Trends and Future Scope The domain of recommendation systems is perennial in nature, as the internet’s exponential expansion has made it difficult to obtain relevant information in a fair amount of time. We believe that the future of recommendation systems will be much more than just for business, and that they will have a far greater impact on our daily lives. The ideal recommendation system would be the one that will know us better so that we know ourselves and make the decision-making which is required at every step of our life effortlessly and quickly so that we can spend our precious time on some more productive things. Several approaches and methods have already been in use as discussed in the paper, but they have their own challenges such as the Cold Start problem. The problem is based on new users who do not have any browsing history yet. So, the system is supposed to provide recommendations to the user without relying on any previous actions. Recommender systems involve the use of the entire user profile, their likes and dislikes. This might pose a challenge to the user’s privacy. Making recommendations precisely and accurately from a large amount of data can cause some delay in the response time. Also, in any recommender system, predicting the user’s interest could be a tricky task since the interest may keep on changing with time. To address these issues, researchers have proposed some modifications, such as combining K-nearest-neighbours (KNN) and Frequent-Pattern tree (FP) to provide quality recommendations to users or a sentiment-based Hybrid system that works by applying sentiment analysis to the recommendation list generated to improve the accuracy and performance of the current systems. In continuation of that, we are trying to further improve the accuracy of these models by introducing a new variable which is emotional information. Emotional data can be used efficiently in recommender systems because it depicts a user’s current emotional state. To improve user happiness, the recommendation system must recognize and represent the user’s unique traits and circumstances, such as personal preferences and feelings. Generally emotional information can be collected by two methods, speech recognition and facial expression recognition. This method helps by not only removing the cold start problem, but also would be able to recommend new and serendipitous items to the users to extend their interest area. Any type of traditional system uses the past records of users to recommend content which sometimes gets irrelevant to users but the proposed method also fixes this problem of traditional systems.
584
M. Sharma et al.
4 Conclusion Due to the vast amount of unstructured data on the internet, recommender systems are still a fascinating area for research. Such systems enable users to get access to their preferred content without having to go through all the available services. So, a good quality recommendation system can remove information barriers for the users along with increasing the business outcomes. The sorts of recommender systems, such as content-based, collaborative, knowledge-based and hybrid systems, are classified in this paper, along with their workings, applications and limits. Currently, hybrid methods are more popular and combine two or more methods to provide accurate recommendations to the users. A survey has also been done to list out the methods that are proposed to overcome these existing limitations. It is expected that, in the near future, more innovations could be done to come up with much better systems than the existing ones.
References 1. Shahbazi Z, Byun Y-C (2019) Product recommendation based on content-based filtering using XGBoost classifier. Int J Adv Sci Technol 29:6979–6988 2. Reddy S, Nalluri S, Kunisetti S, Ashok S, Venkatesh B (2019) Content-based movie recommendation system using. Smart Intell Comput Appl:391–397 3. Simovi´c A (2018) A big data smart library recommender system. Library Hi Tech 4. Dieu ANT, Vu TN, Le TD (2021) A new approach item rating data mining on the recommendation system. SN Computer Science:1–6 5. Yang X, Guo Y, Liu Y, Steck H (2014) A survey of collaborative filtering based social recommender systems. Comput Commun 41:1–10 6. Bokde D, Girase S, Mukhopadhyay D (2015) Matrix factorization model in collaborative filtering algorithms: A survey. Procedia Computer Science 49:136–146 7. Kim T-Y, Ko H, Kim S-H, Kim H-D (2021) Modeling of recommendation system based on emotional information and collaborative filtering. Sensors, 21(6):1997 8. Haruna K, Ismail MA, Damiasih D, Sutopo J, Herawan T (2017) A collaborative approach for research paper recommender system. PloS one 12(10) 9. Phorasim P, Yu L (2017) Movies recommendation system using collaborative filtering and k-means. Int J Adv Comput Res 7(29):52 10. Alotaibi S, Vassileva J (2016) Personalized recommendation of research papers by fusing recommendations from explicit and implicit social network. UMAP (Extended Proceedings) 11. Badarneh AA, Alsakran J (2016) An automated recommender system for course selection. Int J Adv Comput Sci Appl 7(3):166–175 12. Juan W, Xin LY, Ying WC (2019) Survey of recommendation based on collaborative filtering. J Phys: Conf Ser 1314(1) 13. Bouraga S, Jureta I, Faulkner S, Herssens C (2014) Knowledge-based recommendation systems: A survey. Int J Intell Inf Tech-Nologies (IJIIT) 10(2):1–19 14. Xiao J, Wang M, Jiang B, Li J (2018) A personalized recommendation system with combinational algorithm for online learning. J Ambient Intell Humaniz Comput 9(3):667–677 15. Feng C, Khan M, Rahman AU, Ahmad A (2020) News recommendation systemsaccomplishments, challenges & future directions. IEEE Access 8:16702–16725 16. Portugal I, Alencar P, Cowan D (2018) The use of machine learning algorithms in recommender systems: A systematic review. Expert Syst Appl 97:205–227
A Survey and Classification on Recommendation Systems
585
17. Geetha G, Safa M, Fancy C, Saranya D (2018) A hybrid approach using collaborative filtering and content based filtering for recommender system. J Phys: Conf Ser 1000(1): 012101 18. Tian Y, Zheng B, Wang Y, Zhang Y, Wu Q (2019) College library personalized recommendation system based on hybrid recommendation algorithm. Procedia CIRP 83:490–494 19. Osadchiy T, Poliakov I, Olivier P, Rowland M, Foster E (2019) Recommender system based on pairwise association rules. Expert Syst Appl 115:535–542 20. Kumar V, Shrivastva KMP, Singh S (2016) Cross domain recommendation using semantic similarity and tensor decomposition. Procedia Comput Sci 85:317–324 21. Shah JM, Sahu L (2015) A hybrid based recommendation system based on clustering and association. Bin J Data Min & Netw 5(1):36–40 22. Lucas JP, Luz N, Moreno MN, Anacleto R, Figueiredo AA, Martins C (2013) A hybrid recommendation approach for a tourism system. Expert Syst Appl 40(9):3532–3550 23. Ye BK, Tu YJ, Liang TP (2019) A hybrid system for personalized content recommendation. J Electron Commer Res 20(2):91–104 24. Wang Y, Wang M, Xu W (2018) A sentiment-enhanced hybrid recommender system for movie recommendation: a big data analytics framework. Wirel Commun Mob Comput 2018:1–9 25. Bhatt B, Patel PPJ, Gaudani PH (2014) A Review paper on machine learning based recommendation system. Development 2:3955–3961 26. Lin J, Puc H, Li Y, Lian J (2018) Intelligent recommendation system for course selection in smart education. Procedia Computer Science 129:449–453 27. Hande R, Gutti A, Shah K, Gandhi J, Kamtikar V (2016) Moviemender—a movie recommendation system. Int J Eng Sci & Res Technol 28. Hong K, Jeon H, Jeon C (2013) Personalized research paper recommendation system using keyword extraction based on userprofile. J Converg Inf Technol 8(16):106 29. Umamaheswari J, Akila A (2019) An enhanced human speech emotion recognition using hybrid of PRNN and KNN. In: International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), pp 177–183 30. Tarnowski P, Kołodziej M, Majkowski A, Rak RJ (2017) Emotion recognition using facial expressions. Procedia Computer Science 108:1175–1184 31. Iniyan S, Gupta V, Gupta S (2020) Facial expression recognition based recommendation system. Int J Adv Sci Technol 29(3):5669–5678 32. James HI, Arnold JJA, Ruban JMM, Tamilarasan M, Saranya R (2019) Emotion based music recommender system. Int Res J Eng Technol (IRJET) 6(3):2096–2101
Analysis of Synthetic Data Generation Techniques in Diabetes Prediction Sujit Kumar Das , Pinki Roy, and Arnab Kumar Mishra
Abstract The problem of inadequate and class imbalanced data is one of the major problems in the classification tasks. Therefore applying synthetic data generation (SDG) approaches to handle class imbalances can be useful in improving Machine Learning (ML) classifier’s performance. The aim of this work is to explore various SDG approaches to improve diabetes prediction using Pima Indian Diabetes Dataset (PIDD). We have also proposed a hybrid approach of SDG by combining the idea of popularly used SDG techniques Synthetic Minority Oversampling TEchnique (SMOTE) and SVM-SMOTE (Support Vector Machine-Synthetic Minority Oversampling TEchnique), named as SSVMSMOTE. The idea is to divide training data into equal halves and apply SMOTE and SVM-SMOTE separately to subtraining samples. The approach has successfully overcome the limitation of SMOTE and SVM-SMOTE. A set of classifiers namely Decision Tree (DT), Random Forest (RF), K-Nearest Neighbors (KNN), Logistic Regression (LR), Gaussian Naive Bayes (GNB), AdaBoost (AB), Extreme Gradient Boosting (XGB), Gradient Boosting (GB), and Light Gradient Boosting (LGM) are trained on the combined resampled training data and tested on hold out testset. The experiment shows that boosting classifiers, XGB, and GB outperformed other considered classifiers. Further, the XGB classifier, with the help of the proposed SDG technique, achieved the highest average accuracy of 0.9415. The proposed approach also achieved promising results in terms of other important evaluation metrics such as F-Scores, AUC, Sensitivity, and specificity. Therefore, such an impressive result of the proposed approach suggests its applicability in the real-life decision-making process. Keywords Diabetes · Oversampling · Classification · Synthetic data generation · Decision-making
S. K. Das (B) · P. Roy · A. Kumar Mishra Department of Computer Science and Engineering, National Institute of Technology Silchar, Assam 788010, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_45
587
588
S. K. Das et al.
1 Introduction Diabetes occurs when the human body’s pancreas cannot produce insulin, or the body cannot make good utilization of the insulin produced [1]. Diabetes is the reason for several severe life-threatening complications if it is not diagnosed on time [2]. These may include Diabetic Eye disease, Cardiovascular disease, complications in Pregnancy, and Diabetic Foot Ulcer [3, 4]. According to International Diabetes Federation (IDF), there are 463 million adults in the age group of 20–79 years living with diabetes, and this figure will rise to 700 million by the year 2045 [5]. Unfortunately, nearly 232 million diabetes patients are undiagnosed due to several reasons such as costly diagnosis and inadequate medical experts. The use of machine learning techniques in designing automatic decision-making systems is a major research area. Therefore, several approaches have come into existence to provide faster, reliable, and cost-effective solutions in the various problem area. Similarly, ML techniques in identifying disease present are one of the research problems explored in recent years [6–9]. However, the literature suggests that ML algorithms are proved to be more efficient when data are adequate and balanced [10– 12]. In the case of medical-related data, it is very common that data are scares and imbalanced. In this work, a diabetes-related dataset (PIDD) [13] has been used to explore synthetic sample generation and create a more balanced dataset. The PIDD consists of eight (8) independent attributes and one (1) dependent attribute. The total number of samples is 768, out of which 500 are diabetes positive cases, and 268 are negative cases. The positive samples are almost half in proportion to negative diabetes cases. Due to this class imbalance, classifiers are usually biased toward the majority class, resulting in wrong prediction and low performance. Therefore, various SDG techniques are explored in this work to observe improvements in prediction results. Also, a two-stage data generation method SSVMSMOTE is proposed, which consists of SMOTE [14] and SVMSMOTE [15] to generate synthetic samples of the minority class. The primary objective of the work is to enhance the classifier’s prediction capabilities by generating suitable synthetic samples in the feature space. The rest of the paper is organized as follows: Sect. 2 includes a brief discussion on related works; Methodology is discussed in Sect. 2.1; and Results and Discussion are included in Sect. 2.8 and 3, respectively. Finally, the work is concluded in Sect. 4.
2 Related Works The use of ML approaches in diabetes prediction has become quite popular in the last decades. The prediction of ML classifiers has been improved by performing various preprocessing steps. One of them is the generation of synthetic data samples to provide more knowledge to the classifiers for better learning. García-Ordás et al. [16] have proposed a DL pipeline to predict diabetes cases on the PIDD. The pipeline consists of Variational Auto-encoder to generate synthetic samples. Sparse
Analysis of Synthetic Data Generation Techniques in Diabetes Prediction
589
Auto-encoder is used as a feature augmentation and DL learning-based classifier for the prediction task. The approach has achieved promising results in terms of accuracy as 93.31%. Although these results are very promising, this work is limited to the small number of samples in the studied dataset. In another work, Pradipta et al. [17] have proposed the Radius-SMOTE oversampling approach, which consists of three stages. At first minority, the samples are filtered using KNN. Secondly, a safe radius is calculated from any random minority sample to its nearest majority sample data points. Finally, synthetic samples are generated within the safe radius distance. The random forest classifier has been used to perform classification with oversampled examples, and on the PIDD, the highest accuracy (86.4%) is achieved. However, the proposed oversampling approach is vulnerable to small disjunct samples and ignores them. Leguen-deVarona et al. [18] have proposed a modified version of SMOTE technique by using covariance matrix instead of KNN to generate synthetic samples. The oversampling method SMOTE-Cov [3] had two variations, one to generate synthetic samples within the interval of each attribute and the other capable of generating samples outside the interval. In another work, Zang and Jian [19] have modified SMOTE algorithm, which is AdaBoost integrated. The approach named as WSMOTEBoost added weight improvements at SMOTE boundary level and central sample. The sampling weight was determined with the help of Euclidian distance and weight of iterative sample in AdaBoost. Therefore, samples with high values are sampled more than samples with low values with this approach.
2.1 Method To perform the prediction, at first some preprocessing techniques are used to handle missing values in the dataset. The missing values in the dataset are replaced by the median value of the respective attribute. Secondly, after splitting the data into training and testing (80:20) data generation techniques, namely SMOTE, SVMSMOTE, ADASYN, SMOTETomek, and SMOTEENN are used to handle imbalanced class problems in the dataset. These methods are used only on the training samples after keeping out testset separately. Thirdly, multiple ML-based classifiers are trained on the resampled training set and tested on hold out testset to observe the prediction performance of ML classifiers on diabetes vs. non-diabetes cases. Finally, a two-stage data generation technique comprising SMOTE on half of the training samples and SVMSMOTE on the other half is proposed to generate more suitable synthetic data samples in the feature space. The proposed approach is shown in Fig. 1. The considered ML classifiers are trained on samples combining two oversample techniques. Similarly, the proposed model is evaluated based on the prediction done on hold out testset. All the experiments are performed in the Google Colaboratory platform without GPU settings. To observe the average behavior of the prediction results by ML classifiers on considered SDG methods, all the experiments are performed 50 times. The results are evaluated in terms of mean value of multiple important evaluation metrics.
590
S. K. Das et al. Training
Input Data
Train Test split
Divide equal half Training set
Data Preprocessing
DT
RFC
KNN
LR
GNB
AB
XGB
GBC
LGM
SMOTE Oversampling
Classifiers evaluation on Test Set Diabetes (Yes/No)
SVMSMOTE Oversampling
Fig. 1 Proposed model diagram
2.2 SMOTE Oversampling The Synthetic Minority Oversampling TEchnique (SMOTE) came into existence to overcome the overfitting problem that arises in random oversampling. The idea of SMOTE-based oversampling is using K-nearest to neighbors produce synthetic minority samples around them. In general, the number of oversampling samples is set up such that the distribution of minority and majority samples becomes equal. The approach is proved to be a successful one because the synthetic samples created by this approach are close to the minority samples in the feature space. The SMOTEbased samples distribution as a scatter plot is shown in Fig. 2b, while Fig. 2a represents training samples before performing any data generation techniques.
2.3 SVM-SMOTE Oversampling SVM-SMOTE, also sometimes known as Borderline-SMOTE SVM works based on SVM algorithm to create synthetic minority samples. In this approach, SVM is trained on the original training samples to approximate the borderline area in the feature space with the help of the support vectors. Therefore, more data will be synthesized away from the region of class overlap. The synthesized minority samples are generated where the classes are separated from each other. The SVMSMOTEbased oversampled data distribution is shown in Fig. 2c.
2.4 ADASYN Oversampling Adaptive synthetic sampling method (ADASYN) [20] is a method of oversampling approach based on the density of the minority samples. The synthetic samples generation is inversely proportional to the density of minority in the feature space. Therefore, more number of synthetic samples of minority class are generated where
Analysis of Synthetic Data Generation Techniques in Diabetes Prediction
591
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2 Scatter plots after using various SDG techniques
density is low compared to feature space where density is more. The synthetic samples generation by ADASYN follows the below steps: , where Nmin and Nmaj are the number 1. Calculate imbalance degree, d = NNmin maj of minority and majority samples in a training dataset Dt with M samples as (X i , Yi ). 2. If d ≤ dx , where dx is threshold of maximum tolerated imbalance, then: • Number of synthetic samples to be generated, Nsyn = (Nmaj − Nmin ) × β, β is balance level of synthetic samples. and ratio Ri as Ri = Ki . • For X i ∈ Nmin , find K-neighbors • Normalize Rx ← Ri / Ri , here Rx is density distribution.
592
S. K. Das et al.
• Synthetic samples generated for each minority samples G i = Rx × G, where G is the total number of minority samples that need to be generated. • For minority class X i , synthetic samples G i will be generated as follows: for Loop 1 to G i – Random minority samples X u from K-nearest neighbors of G i . – Synthetic data Si = X i + (X u − X i ) × λ, where λ ∈ [0, 1]. The result of ADASYN-based oversampling is shown in a scatter plot in Fig. 2d.
2.5 SMOTETomek Data Generation SMOTETomek [21] is a hybrid approach of using both undersampling and oversampling methods together. The SMOTETomek-based synthetic samples generated examples are shown in Fig. 2e as a scatter plot. The oversampling in this approach is carried out based on SMOTE. The duplicate samples are generated by calculating the distance between each sample and their nearest minority data points. On the other hand, Tomek is used for undersampling. It works based on modified condensed nearest neighbors undersampling technique. The Tomek links are defined as follows: The samples X i and X j belong to minority and majority class, respectively. If no samples X k satisfy the following condition, the sample (X i , X j ) is known as the Tomek link. d(X i , X k ) < d(X i , X j )
(1)
d(X j , X k ) < d(X i , X j )
(2)
Therefore, the steps of SMOTETomek are as follows: 1. Initiate SMOTE: Choose any minority sample. 2. Find distance (d) from the chosen sample and its nearest neighbors (k number). 3. Perform multiplication of the difference by any number in the range 0–1, the result will be added to the chosen minority sample to create new synthetic data. 4. The above steps (2 and 3) will be repeated until a suitable ratio of majority and minority class samples is formed (stop SMOTE). 5. Initiate Tomek: Select any sample in majority class observation. 6. If selected sample’s nearest neighbor is a minority class, then remove the random data sample (Stop Tomek).
Analysis of Synthetic Data Generation Techniques in Diabetes Prediction
593
2.6 SMOTEENN Synthetic Data Generation Similar to SMOTETomek, SMOTEENN [22] works based on the combination of SMOTE for oversampling and Edited Nearest Neighbors (ENN) for undersampling. SMOTEENN-based oversampled example is shown as a scatter plot in Fig. 2f. ENN works by finding K-nearest neighbors of every observed sample. If majority of samples of k-nearest does not match with observation’s class, in that case both observation and k-nearest neighbors are removed. A step-by-step SMOTEENN’s working principle is given below: 1. Initiate SMOTE: Choose any minority sample. 2. Find distance (d) from chosen sample and its nearest neighbors (k number). 3. Perform multiplication of the difference by any number in the range 0 to 1, the result will be added to the chosen minority sample to create new synthetic data. 4. The above steps (2 and 3) will be repeated until a suitable ratio of majority and minority class samples is formed (stop SMOTE). 5. Initiate ENN: Choose value of K. In this work, value of K is taken as 3 (default). 6. For the observation check K-nearest neighbors with other samples within the dataset and return class (majority) of K-nearest neighbors. 7. Check whether observation⣙s and K-nearest neighbors classes. If they are different, remove them. 8. Repeat step 5 to step 7 until classes are balanced (Stop ENN).
2.7 Proposed Approach of SDG In this work, the idea is to generate more representative synthetic samples by combining SMOTE and SVMSMOTE approaches. At first, the training samples are divided equally. Secondly, on the first half of the training samples SMOTE oversampling has been used to generate minority samples diagonally by choosing a random minority sample and its K-nearest neighbors. Thirdly, SVM-SMOTE oversampling method has been applied to the left out half of training samples. It helps to generate minority samples on the boundary region of two classes. At last, separately generated synthetic samples are combined together to get balanced and more representative dataset. The scatter plot of the data samples generated by the proposed approach is shown in Fig. 3. Therefore, synthetic minority samples generated by the proposed approach have the characteristics of both SMOTE and SVM-SMOTE methods. The suitable representation of the data distribution in the feature space by the proposed approach proved its significance in generating synthetic samples.
594
S. K. Das et al.
Fig. 3 Scatter plot after using the proposed SSVMSMOTE oversampling
2.8 Results The experimental results of diabetes prediction are reported in multiple stages. The average accuracy, F1-scores, and AUC values are included in tabular format for each SDG method under consideration. Table 1 represents the classifiers performance on diabetes prediction after using SMOTE oversampling on training samples. The Extreme Gradient Boosting (XGB) and Gradient Boosting (GB) classifiers successfully achieved the highest results in terms of accuracy, F-scores, and AUC values while SMOTE oversampling has been used. Similarly, Table 2 represents results on using SVMSMOTE-based oversampling approach. It is important to note that SVMSMOTE-based oversampling approach has shown similar characteristics,
Table 1 Average accuracy, F1-scores, and AUC value after SMOTE oversampling Classifiers Accuracy F1-scores AUC DT RF KNN LR GNB AB XGB GB LGM
0.8537 0.9048 0.8485 0.8124 0.8249 0.9071 0.9119 0.9132 0.9085
0.7958 0.8642 0.797 0.7539 0.7714 0.8695 0.8752 0.8788 0.8696
0.8441 0.9006 0.8491 0.8144 0.8267 0.9036 0.9119 0.9096 0.903
Analysis of Synthetic Data Generation Techniques in Diabetes Prediction
595
Table 2 Average accuracy, F1-scores, and AUC value after SVMSMOTE oversampling Classifiers Accuracy F1-scores AUC DT RF KNN LR GNB AB XGB GB LGM
0.8571 0.9061 0.8467 0.8038 0.8158 0.9023 0.9128 0.9103 0.9087
0.7974 0.8678 0.7973 0.7456 0.7618 0.863 0.8798 0.8768 0.8717
0.8461 0.9033 0.8495 0.8119 0.8183 0.9018 0.9129 0.9086 0.9054
Table 3 Average accuracy, F1-scores, and AUC value after ADASYN oversampling Classifiers Accuracy F1-scores AUC DT RF KNN LR GNB AB XGB GB LGM
0.8555 0.8994 0.8083 0.7814 0.8122 0.9024 0.9084 0.9087 0.9064
0.7992 0.8608 0.7617 0.7244 0.7504 0.8641 0.8733 0.8742 0.8683
0.8486 0.8989 0.8278 0.7917 0.8162 0.9029 0.9107 0.9083 0.9065
and the XGB classifier outperforms other classifiers under consideration. The oversampling method ADASYN-based prediction results are shown in Table 3. In the ADASYN-based oversampling method, the results of XGB and GB classifiers are the best among the considered classifiers. In Table 4, prediction results after applying SMOTETomek SDG are shown in terms of average accuracy, F1-scores, and AUC value. The results show that boosting-based classifiers XGB and GB achieved the highest results. The SDG method, SMOTEENN-based prediction results of all considered classifiers are shown in Table 5. In this approach, a different dominant classifier has been observed where LGM has achieved the highest results in terms of all considered evaluation metrics. But the results achieved by the LGM classifier are not promising as XGB and GB have achieved in other SDG techniques. The highest average accuracy achieved by the LGM classifier is 0.8757. The results of the proposed SDG are shown in Table 6 in terms of average accuracy, F1-scores, and AUC value. In this approach, it has been observed that almost all considered classifiers have achieved improved results compared to previous SDG techniques. The highest results were achieved by the XGB classifiers in terms of average accuracy,
596
S. K. Das et al.
Table 4 Average accuracy, F1-scores, and AUC value after SMOTETomek oversampling Classifiers Accuracy F1-scores AUC DT RF KNN LR GNB AB XGB GB LGM
0.8572 0.9032 0.8471 0.8224 0.8242 0.9075 0.9123 0.9118 0.9085
0.8000 0.8630 0.7914 0.7609 0.7645 0.8677 0.9093 0.8793 0.8684
0.8486 0.8980 0.8457 0.8239 0.8228 0.9024 0.9107 0.9098 0.9039
Table 5 Average accuracy, F1-scores, and AUC value after applying SMOTEENN Classifiers Accuracy F1-scores AUC DT RF LNN LR GNB AB XGB GB LGM
0.8411 0.8732 0.8415 0.8264 0.8164 0.8605 0.8619 0.8533 0.8757
0.7876 0.8311 0.7921 0.7718 0.7688 0.8136 0.8153 0.8063 0.8295
0.8425 0.8782 0.8474 0.8341 0.8306 0.8648 0.8677 0.8590 0.8790
F1-scores, and AUC value of 0.9414, 0.9189, and 0.9383, respectively. Apart from average accuracy, F1-scores, and AUC value, other important evaluation metrics in disease prediction systems are sensitivity and specificity. Therefore, sensitivity versus specificity bar graph of the dominant classifier (XGB) is shown in Fig. 4. The graph shows a similar characteristic where the proposed SDG technique has outperformed the other considered data generation techniques in this work.
3 Discussion The synthetic data generated from an imbalance dataset to improve classifiers performance is a major research area. The suitable data generation in the feature space to enhance the prediction performance is explored in this work. The scatter plots in Fig. 2 show that each of these methods has some advantages and disadvantages. The minority samples generated by SMOTE sometimes become the reason for overfitting. Again, although SVM-SMOTE overcame this problem of SMOTE by generating
Analysis of Synthetic Data Generation Techniques in Diabetes Prediction
597
Table 6 Average accuracy, F1-scores, and AUC value after SSVMSMOTE (proposed) data generation Classifiers Accuracy F1-scores AUC DT RF KNN LR GNB AB XGB GB LGM
0.8483 0.9093 0.8524 0.8135 0.8164 0.9044 0.9415 0.9111 0.9109
0.7873 0.8727 0.7963 00.7530 0.7549 0.8675 0.9189 0.8748 0.8732
0.8387 0.9051 0.8512 0.8184 0.8161 0.9023 0.9383 0.9108 0.9064
Sensitivity vs. Specificity based on XGB classifier Sensitivity
Specificity
1.0 0.9494 0.9104 0.9134
0.913 0.9129
0.9185 0.9029
0.9194 0.8992
0.9
0.9272 0.8853 0.8502
0.8
0.7
SMOTE
SVMSMOTE
ADASYN
SMOTETomek SMOTEENN SSVMSMOTE
Oversampling Approaches
Fig. 4 Sensitivity and specificity graph based on different SDG methods
more data around the decision boundary, it is important to balance these two ideas of SMOTE and SVM-SMOTE to generate more suitable samples in the feature space. Therefore, in this work, a SDG approach is introduced using both approaches. The results show that classifiers trained on SSVMSMOTE-based data generation have impressive correct predictions compared to the other considered approach. Further, among the classifiers used in this work, boosting approaches performed better than the other classifiers. The dominant type of classifiers is that when data distribution in the feature space is more generalized ensemble approaches like AB, GB, and XGB give better performance. The SDG by the proposed approach has given equal importance to generating data based on K-nearest neighbors and support vectors. It
598
S. K. Das et al.
Table 7 Comparison with SOTA works on PIDD Authors Data generation García-Ordás et al. [16] Pradipta et al. [17] Nnamoko and Korkontzelos [23] Proposed approach
Accuracy (%)
Variational autoencoder Radius-SMOTE IQRd+SMOTEd
93.31 86.4 89.5
SSVMSMOTE
94.15
results in more generalized samples to solve the imbalanced and data insufficiency problem. The validation of the proposed approach is also achieved by comparing the results in Table 7 with some recent SOTA works on PIDD. The comparison table shows that the proposed approach outperformed with the highest accuracy of 94.15 %.
4 Conclusion The class imbalance is a major problem that causes a significant fall in the prediction performance of classifiers. There are various methods introduced to handle data imbalance by researchers. In this work, we have explored multiple SDG techniques to handle data imbalance in Pima Indian Diabetes dataset. Also, a two-stage approach is proposed to generate more generalized samples in the feature space. Several classifiers are trained on the samples generated by SSVMSMOTE and evaluated their performance in terms of various important evaluation metrics. The results show that the SDG technique’s proposed approach helped improve diabetes vs. nondiabetes prediction. It has also been observed that the ensemble boosting classifiers are more prominent in giving the best results compared to other classifiers. Specifically, Extreme Gradient Boosting and Gradient Boosting classifiers have outperformed and have the highest results among considered classifiers. The highest average accuracy, F-Scores, and AUC value achieved by the XGB classifier after using the proposed SSVMSMOTE SDG method are 0.9415, 0.9189, and, 0.9383 respectively. The comparison with SOTA works shows that the proposed SSVMSMOTE has helped in achieving better prediction accuracy. Although the proposed approach has given promising results, a few limitations are there, which can be addressed to expect more improvement. Firstly, in this work, we have not used any feature engineering like feature extraction, elimination, etc. Therefore, using such techniques to find the most relevant features can help improve prediction results further. Secondly, it is important to explore the proposed SDG technique in other related datasets. We have a plan to perform that in our future tasks.
Analysis of Synthetic Data Generation Techniques in Diabetes Prediction
599
References 1. Das SK, Mishra A, Roy P (2018) Automatic diabetes prediction using tree based ensemble learners. In: Proceedings of international conference on computational intelligence and IoT (ICCI IoT) 2. Das SK, Roy P, Mishra AK (2021) Deep learning techniques dealing with diabetes mellitus: a comprehensive study. In: Health informatics: a computational perspective in healthcare. Springer, Singapore, pp 295–323 3. Das SK, Roy P, Mishra AK (2021) Recognition of ischaemia and infection in diabetic foot ulcer: a deep convolutional neural network based approach. Int J Imaging Syst Technol 4. Das SK, Roy P, Mishra AK (2021) DFU_SPNet: a stacked parallel convolution layers based CNN to improve Diabetic Foot Ulcer classification. ICT Express 5. IDF diabetes facts and figure. https://idf.org/aboutdiabetes/what-is-diabetes/facts-figures. html. Accessed 10 Oct 2021 6. Mishra AK et al (2020) Identifying COVID19 from chest CT images: a deep convolutional neural networks based approach. J Healthc Eng 2020 7. Mishra AK et al (2021) Breast ultrasound tumour classification: a machine learning-radiomics based approach. Expert Syst, e12713 8. Jain D, Mishra AK, Das SK (2021) Machine learning based automatic prediction of Parkinson’s disease using speech features. In: Proceedings of international conference on artificial intelligence and applications. Springer, Singapore 9. Das SK, Roy P, Mishra AK (2021) Fusion of handcrafted and deep convolutional neural network features for effective identification of diabetic foot ulcer. Concurr Comput Pract Exp, e6690 10. Namasudra S (2020) Fast and secure data accessing by using DNA computing for the cloud environment. IEEE Trans Serv Comput 11. Namasudra S et al (2020) Securing multimedia by using DNA-based encryption in the cloud computing environment. ACM Trans Multimed Comput Commun Appl (TOMM) 16(3s):1–19 12. Sharma P, Borah MD, Namasudra S (2021) Improving security of medical big data by using Blockchain technology. Comput Electr Eng 96:107529 13. PIMA diabetes dataset. https://data.world/uci/pima-indians-diabetes. Accessed 05 Oct 2021 14. Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357 15. Nguyen Hien M, Cooper Eric W, Kamei Katsuari (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Parad 3(1):4–21 16. García-Ordás MT et al (2021) Diabetes detection using deep learning techniques with oversampling and feature augmentation. Comput Methods Programs Biomed 202:105968 17. Pradipta GA et al (2021) Radius-SMOTE: a new oversampling technique of minority samples based on radius distance for learning from imbalanced data. IEEE Access 9:74763–74777 18. Leguen-deVarona I et al (2020) SMOTE-Cov: a new oversampling method based on the covariance matrix. In: Data analysis and optimization for engineering and computing problems. Springer, Cham, pp 207–215 19. Zhang Y, Jian X (2021) Unbalanced data classification based on oversampling and integrated learning. In: 2021 Asia-Pacific conference on communications technology and computer science (ACCTCS). IEEE 20. He H et al (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE 21. Batista GE, Bazzan ALC, Monard MC (2003) Balancing training data for automated annotation of keywords: a case study. WOB 22. Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor Newsl 6(1):20–29 23. Nnamoko Nonso, Korkontzelos Ioannis (2020) Efficient treatment of outliers and class imbalance for diabetes prediction. Artif Intell Med 104:101815
Beyond Information Exchange: An Approach to Deploy Network Properties for Information Diffusion Soumita Das , Anupam Biswas , and Ravi Kishore Devarapalli
Abstract Information diffusion in Online Social Networks is a new and crucial problem in social network analysis field and requires significant research attention. Efficient diffusion of information is of critical importance in diverse situations such as pandemic prevention, advertising and marketing. Although several mathematical models have been developed till date, but previous works lacked systematic analysis and exploration of the influence of neighborhood for information diffusion. In this paper, we have proposed Common Neighborhood Strategy (CNS) algorithm for information diffusion that demonstrates the role of common neighborhood in information propagation throughout the network. The performance of CNS algorithm is evaluated on several real-world datasets in terms of diffusion speed and diffusion outspread and compared with several widely used information diffusion models. Empirical results show CNS enables better information diffusion both in terms of diffusion speed and diffusion outspread. Keywords Information diffusion · Common neighborhood · Diffusion speed · Diffusion outspread
1 Introduction Online Social Networks (OSNs) play a key functional role in modern information sharing. Hence, it is used extensively in information diffusion research to examine real-world information diffusion process. As information sharing in such networks occurs through social contacts, hence the underlying network properties play a significant role in information diffusion. Recently, extensive research has been conducted to understand the role of network properties on the dynamics of information diffuS. Das (B) · A. Biswas · R. Kishore Devarapalli Department of Computer Science and Engineering, National Institute of Technology, Silchar 788010, Assam, India e-mail: [email protected] A. Biswas e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_46
601
602
S. Das et al.
sion in OSNs [1–4]. In particular, the propagation of information across these OSNs provides indications of diverse events and situational awareness, such as marketing prediction, social consciousness and terrorist activities [5]. It is important to understand how the information will spread throughout the network in the future. Hence, various information diffusion models have been developed to understand the information diffusion behavior in the real world. The information diffusion models are broadly classified into two categories: predictive models and epidemic models. Predictive models consider the network structure to analyze dispersion. For instance, the popularity of a book among a population can be predicted by using this model, whereas epidemic models consider influence to model epidemiological processes. In particular, the spread of malware is modeled by epidemic models. Several variants of predictive models and epidemic models have been introduced till date. For example, Independent Cascade (IC) Model, Linear Threshold (LT) model and Game Theory (GT) model are sender-centric, receiver-centric and profit-centric-based predictive models, respectively, whereas SI (Susceptible Infected) model, SIR (Susceptible Infected Removed) model, SIS (Susceptible Infected Susceptible) model and SIRS (Susceptible Infected Removed Susceptible) model are epidemic models [6–9]. Overall, prior work on information diffusion research has either focused on wideness of information diffusion, i.e. total number of infected nodes in each iteration or the efficiency of information diffusion, i.e. total time required to reach steady state. But the balance between these two important aspects has been underexplored. A good understanding of the information diffusion outspread is important in situations where information needs to be propagated among a wider section of population. For example, Wider spread is required in marketing a product, whereas diffusion speed is important in applications where the rate of information flow is to be prioritized. For instance, on election day, a get-out-the-vote campaign is dependent on diffusion speed. In light of these significant applications of information diffusion on real lives, we have proposed CNS algorithm.
1.1 Basic Idea In Online Social Networks (OSNs), individuals share hundreds or even thousands of connections which associates the individuals to friends, colleagues, family, etc. However, all of these connections are not equally strong. Studies suggest that strong connections reside within densely connected nodes [1, 10–12]. So, identification of these strong connections for information diffusion facilitates initiating information exchange between strongly connected individuals which also triggers exchange of information between the corresponding densely connected neighbors. As information propagation in OSNs expands through social contacts, thus exposure to these multiple sources facilitates rapid diffusion of information throughout the social network. For example, friendship relationship in OSNs may influence a group of users to download an App or a software that is compatible to his/her friends to stay in touch. Therefore, investigating and deploying strong connections/relationships for information
Beyond Information Exchange: An Approach to Deploy …
603
diffusion between densely connected neighbors in OSNs is important to analyze its effectiveness in information propagation throughout the network.
1.2 Contributions In this paper, a novel network property-based information diffusion algorithm has been presented. The contributions of this paper are as follows: – Utilization of network properties in the diffusion of information throughout the network has been addressed. It demonstrates the importance of common neighborhood in information propagation. – Illustrates the efficiency of network properties in faster and wider information propagation by evaluating and comparing the proposed algorithm with several popular information diffusion models based on diffusion speed and diffusion outspread.
2 Method In this section, we have discussed about our proposed information diffusion algorithm called Common Neighborhood Strategy (CNS). Information diffusion in social networks spreads through social contacts which indicates that the underlying network structure plays a significant role in the information diffusion pattern. We have considered network property namely common neighborhood to examine its effect on the dynamics of information diffusion. Before proceeding further, let us first formalize some of the mostly used terms. Suppose, we have a graph G(V, E) where V indicates set of nodes and E indicates set of edges. For any connected node pair (v, u) ∈ V , an edge ev,u indicates connection from node v to node u, and set of neighbors of node v is represented by Γ (v).
2.1 Common Neighborhood For a connected node pair say (v, u), the common neighborhood ρv,u is defined in terms of neighbors between nodes u and v. Higher common neighborhood score indicates greater interaction frequency and hence, higher similarity. Depending on the number of common neighbors shared by connected node pairs, common neighborhood ρv,u is computed by considering the following cases: Case a. Nodes v and u do not share any common neighbor, i.e. |Γ (v) ∩ Γ (u)| = φ, then common neighborhood of connected node pair (v, u) is defined by ρv,u =
1, i f | Γv | = 1 or | Γu |= 1 0, i f | Γv |> 1 or | Γu |> 1
(1)
604
S. Das et al.
In Eq. 1, ρv,u = 1 if either Γv = 1 or Γu = 1 because in this case, v ↔ u is the only available path for information exchange. Hence, this path will have maximum interaction frequency for any kind of interaction between nodes v and u. Case b. Nodes v and u share greater than or equal to one common neighbor, i.e. if |Γ (v) ∩ Γ (u)| >= 1, then common neighborhood of connected node pair (v, u) is defined by ρv,u = |Γ (v) ∩ Γ (u)| + |Γ (v) ∩ Γ (z)| + |Γ (u) ∩ Γ (z)| + |σvu | + |Γ (w) ∩ Γ (z)|, ∀(w, z) ∈ (Γ (v) ∩ Γ (u)) i f ew,z ∈ E, w = z.
(2)
There are five terms in Eq. 2. First term | Γ (v) ∩ Γ (u) | indicates number of common neighbors shared by nodes v and u, second term | Γ (v) ∩ Γ (z) | and third term | Γ (u) ∩ Γ (z) | indicates number of common neighbors shared by nodes v and u with common neighbor of v and u, respectively, fourth term | σvu | indicates number of connections shared by common neighbors of v and u and fifth term | Γ (w) ∩ Γ (z) | indicates number of common neighbors shared by common neighbors of v and u. Neighborhood Density: For a connected node pair (v, u), neighborhood density is used to compute the density of its closely connected neighbors. This density score is utilized to measure the tie strength of a connected node pair. Greater tie strength score indicates higher proximity. For any connected node pair (v, u), neighborhood density/tie strength is computed by using common neighborhood ρv,u and is defined by Φv,u =
ρv,u . maxu∈Γv ρv,u
(3)
In Eq. 3, the term in the denominator indicates maximum common neighborhood score shared by v with its neighboring node. The denominator term makes tie strength Φv,u asymmetric. If Φv,u = 1, it indicates that the tie strength shared by nodes v and u is maximum, whereas if ρv,u = 0, then Φv,u = 0, which means that if the common neighborhood of a node pair is 0, then the tie strength is 0. In this context, a tie between node pair (v, u) is represented by edge ev,u ∈ E indicating connection from node v to node u. U contains list of edges in graph G having maximum tie strength score.
2.2 Common Neighborhood Strategy Here, we present the Common Neighborhood Strategy (CNS) algorithm. This algorithm has been developed to investigate the effectiveness of common neighborhood for information diffusion. The proposed CNS algorithm comprises of three aspects. Let us consider that the diffusion process starts from a node v ∈ V . The first aspect is to identify the adjacent node/nodes of v with which it shares maximum tie strength,
Beyond Information Exchange: An Approach to Deploy …
605
i.e. if Φv,u = 1, it indicates that node v shares maximum tie strength with node u. Then, information diffusion is initiated from node v to node u and is defined by
τ1 =
φv→u .
(4)
u∈Γ (v), ev,u ∈U
Equation 4 indicates that node v exchanges information with neighboring node/ nodes, where Φv,u = 1, ∀u ∈ Γ (v). Here, φv→u indicates that information exchange takes place from node v to node u. After identification of node of u, which shares maximum tie strength with v, the second aspect is to exchange information with nodes contributing in the common neighborhood ρv,u of connected node pair (v, u). Let us assume that F contains a list of nodes contributing in the common neighborhood of node pair (v, u). Then, the second aspect is defined by τ2 =
(φv→i + φu→ j ).
(5)
i, j∈F , i, j∈V, i∈Γ (v), j∈Γ (u)
Nodes v and u exchange information with their neighbors which contributes to the common neighborhood of connected node pair (v, u) and is indicated by the terms φv→i and φu→ j , respectively, in Eq. 5. Then, the third aspect is to identify those neighboring nodes of v which have not yet participated in the information diffusion / U but ez,v ∈ U process and which shares maximum tie strength with v, i.e. ev,z ∈ where z ∈ Γ (v) and is defined by τ3 = φv→z , ∀z ∈ Γ (v), ez,v ∈ U .
(6)
Equation 6 indicates that node z shares maximum tie strength score with node v, if node z has not participated in information exchange before, then node v exchanges information with node z. Ultimately, the combination of the first three aspects presented in Eqs. 4, 5 and 6 gives the information diffusion score adopted by CNS algorithm. Information diffusion from node v as defined by CNS algorithm is given by Iv = τ1 + τ2 + τ3 .
(7)
Equation 7 indicates that information diffusion from node v is a combination of τ1 , τ2 and τ3 . The proposed information diffusion-based CNS algorithm is designed considering common neighborhood concept. A step-by-step illustration of the diffusion process of CNS algorithm is presented in Fig. 1. Let us assume that node 2 initiates the diffusion process. In Step 1, node 2 identifies the adjacent edge having maximum tie strength score. Here, node 2 shares maximum tie strength with node 1 indicated by blue arc in subfigure (a). Then, in Step 2, node 2 activates node 1. In Step 3, all the nodes contributing to the maximum tie strength score of node pair (2,1) are targeted for activation. At the end of iteration one, nodes 1 and 9 identify the adjacent edges having maximum tie strength score, i.e. arcs (1,5) and (9,33), respectively. There-
606
S. Das et al.
(a)
Step 1: Iteration one
(b)
Step 2: Iteration one
(c)
Step 3: Iteration one
(d)
Step 4: Iteration one
(e)
Step 5: Iteration two
(f) Step 6: Iteration two
(g)
Step 7: Iteration two
(h)
Step 8: Iteration two
(i) Step 9: Iteration three
(j) Step 10: Iteration three
(k) Step 11: Iteration three
(l) Step 12: Iteration three
Fig. 1 Demonstration of CNS algorithm using Karate network. Red colored nodes indicate active nodes, while gray colored nodes indicate inactive nodes, blue arcs refer to strong ties, red arcs represent connections where activated nodes try to activate their inactive neighbors, peach arcs represent edges propagated once and peach nodes indicate nodes already participated in information diffusion
after, in iteration two and three all the steps followed in iteration one are repeated for activating inactive neighbors, and ultimately, the diffusion pattern obtained by incorporation of CNS algorithm is shown in subfigure (l) of Fig. 1 by peach arcs. Pseudocode of CNS method is shown in Algorithm 1.
Algorithm 1: CNS Input: Social Network, G = (V, E) Output: // List of activated nodes A 1: CNS (G) 2: for each node v in G do 3: if ev,u in U then 4: // Compute Information Diffusion using equation (7) 5: end if 6: end for 7: return A
3 Experimental Analysis Evaluation of information diffusion result is necessary to determine the performance of information diffusion models/algorithms. In this section, we brief the evaluation strategies such as information diffusion models along with their parameters, diffusion
CNS
IC
SI
6 4
5 4
4 3
4
5
6
6
8
3 3
Number of Iterations
10
5
Fig. 2 Number of iterations required by each of the algorithms for information diffusion
607
10
Beyond Information Exchange: An Approach to Deploy …
2 0 Karate Les Misérables
Jazz
Polblogs
speed evaluation, diffusion outspread evaluation and datasets. We have considered Independent Cascade (IC) model with edge threshold set as 1 and Susceptible Infected (SI) model with infection probability set as 0.50 for comparative analysis purpose. IC model has been utilized to examine the role of tie strength in information diffusion behavior. Next, SI model has been considered to investigate the role of influence in information propagation in OSNs. These two models are selected for comparative analysis because tie strength and influence are positively correlated and hence, crucial for examining diffusion performance. The comparative analysis of representative diffusion models with CNS algorithm is conducted considering diffusion speed and diffusion outspread. In this context, diffusion speed refers to ‘how fast the dispersion occurs’, whereas, diffusion outspread indicates ‘how widely information propagates’. For evaluation of diffusion speed, we have examined the number of iterations required to reach steady state and fraction of nodes covered per iteration. Furthermore, to evaluate diffusion outspread, we have utilized graph properties such as density, diameter, average distance and average degree. Comparative graphical results based on diffusion speed and diffusion outspread obtained by incorporation of CNS algorithm and representative information diffusion models on real-world datasets such as Karate, Les Misérables, Jazz and Polblogs obtained from SNAP [17] repository are presented in this section. The details about these datasets are listed in Table 1. It is to be mentioned here that all the
Table 1 Real-world datasets used in the experiments, K indicates average degree Name |V | |E| K Network description Karate Les Misérables Jazz Polblogs
34 77 198 1224
78 254 2742 16718
4.59 6.60 27.70 27.31
Zachary’s Karate Club [13] Les Misérables network [14] Jazz musician network [15] Polblogs network [16]
608
S. Das et al.
results presented here by incorporation of CNS algorithm, IC model and SI model assume a common source/infected node to initiate the diffusion process to maintain the uniformity of our analysis.
3.1 Result Analysis The graphical results presented in Figs. 2 and 3 are related to diffusion speed. Here, Fig. 2 refers to the total number of iterations taken to complete diffusion process and Fig. 3 refers to the fraction of nodes covered in each iteration. Less number of total iterations and maximum fraction of node coverage per iteration are expected for faster diffusion. As can be seen from Fig. 2, SI model takes maximum number of iterations to complete diffusion process for all the representative datasets. Therefore, CNS algorithm is definitely better than SI model. Next, we need to compare the performance of CNS algorithm and IC model. From Fig. 2, it is observed that CNS algorithm and IC model gives almost similar performance in terms of total number of iterations in small dataset such as Karate, but the difference in their performance can be identified in larger dataset as shown by Polblogs. As CNS algorithm takes the least number of iterations as compared to IC model in large datasets, therefore it is inferred that CNS algorithm is better than IC model in terms of total number of iterations
Karate
Les Misérables
1 Fraction of nodes covered
Fraction of nodes covered
1 0.8 0.6 0.4 CNS IC SI
0.2 1
2
3 Iterations
4
0.8 0.6 0.4 CNS IC SI
0.2 5
1
2
3 Iterations
5
Polblogs
Jazz
1 Fraction of nodes covered
1 Fraction of nodes covered
4
0.8 0.6 0.4 CNS IC SI
0.2 1
2
3
4
5
0.8 0.6 0.4 CNS IC SI
0.2 6
Iterations
Fig. 3 Fraction of nodes covered per iteration
1
2
3
4
5 6 Iterations
7
8
9
10
Beyond Information Exchange: An Approach to Deploy …
609
taken to complete diffusion process. Next, considering Fig. 3, the half lines indicate that the respective information diffusion model/algorithm completes in less number of iterations as compared to the SI model for all the representative datasets. Clearly, SI model covers least fraction of nodes per iteration. So, CNS algorithm is certainly better than SI model in terms of fraction of nodes covered per iteration. Additionally, it can be observed that CNS algorithm completes the diffusion process in three to four iterations for all the datasets and it covers maximum fraction of nodes in the first two iterations itself as compared to IC model. In particular, if we consider the Polblogs dataset, it is clearly visible that CNS algorithm covers maximum fraction of nodes in the first two iterations in comparison to IC model. As CNS algorithm gives the best performance in comparison to IC model and SI model based on total number of iterations required to complete diffusion process and fraction of nodes covered per iteration, therefore, CNS algorithm is better than representative diffusion models in terms of diffusion speed. We perform a comparative analysis of CNS algorithm and representative information diffusion models in terms of diffusion horizon. In this context, diffusion horizon refers to the area covered by the diffusion process and this is measured per iteration. It is examined to determine the diffusion outspread. The results shown in Figs. 4, 5, 6 and 7 are related to diffusion horizon. Figure 4 shows the diameter of diffusion horizon per iteration. Higher diameter of diffusion horizon indicates max-
Les Misérables
5
5
4
4 Diameter
Diameter
Karate
3 2
3 2
CNS IC SI
1 1
2
3 Iterations
4
CNS IC SI
1 5
1
2
3 Iterations
5
Polblogs
Jazz 8
5
7 6 Diameter
4 Diameter
4
3
5 4 3
2
2
CNS IC SI
1 1
2
3
4
5
CNS IC SI
1 6
Iterations
Fig. 4 Diameter of diffusion horizon per iteration
1
2
3
4
5 6 Iterations
7
8
9
10
610
S. Das et al. Karate
Les Misérables
2.5
2.5
2.25 2
2.25 2
Average Distance
Average Distance
3
1.75 1.5 1.25 1
1.75 1.5 1.25 1
CNS IC SI 1
2
3 Iterations
4
CNS IC SI 5
1
2
3 Iterations
4
5
Polblogs
Jazz 3 2.5 2
Average Distance
Average Distance
2.25 1.75 1.5 1.25 1
2.5 2.25 2 1.75 1.5 1.25 1
CNS IC SI 1
2
3
4
5
CNS IC SI 6
1
2
3
4
Iterations
5 6 Iterations
7
8
9
10
Fig. 5 Average distance within diffusion horizon per iteration
Les Misérables
Karate 1 CNS IC SI
0.9 0.8
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1
0.7 Density
Density
CNS IC SI
0.6 0.5 0.4 0.3 0.2 0.1
1
2
3 Iterations
4
1
5
2
3 Iterations
5
Polblogs
Jazz CNS IC SI
0.8
CNS IC SI
0.9 0.8
0.7
0.7
0.6
0.6
Density
Density
4
0.5 0.4
0.5 0.4
0.3
0.3
0.2
0.2
0.1
0.1 1
2
3
4
5
6
Iterations
Fig. 6 Density within diffusion horizon per iteration
1
2
3
4
5 6 Iterations
7
8
9
10
Beyond Information Exchange: An Approach to Deploy …
611
Karate
Les Misérables 7
5
6 Average Degree
Average Degree
4 3 2
0
1
2
3 Iterations
4
4 3 2
CNS IC SI
1
5
CNS IC SI
1 1
5
2
3 Iterations
5
Polblogs
Jazz 45
35
40
30
35
25
Average Degree
Average Degree
4
20 15 10
2
3
4
5
25 20 15 10
CNS IC SI 1
30
6
CNS IC SI 1
2
3
4
Iterations
5 6 Iterations
7
8
9
10
Fig. 7 Average degree of nodes within diffusion horizon per iteration
imum eccentricity from all nodes and hence, larger diffusion outspread. First of all, it is clearly visible from the subfigures of Fig. 4 that SI model gives least diameter of diffusion horizon in all the iterations for all the representative datasets. Therefore, CNS algorithm is certainly better than SI model in terms of diameter of diffusion horizon per iteration. Next, comparing CNS algorithm and IC model, both of these approaches give almost similar performance for small datasets. But if we consider large datasets such as Polblogs, it can be seen that CNS algorithm gives maximum diameter of diffusion horizon per iteration compared to IC model. Therefore, CNS algorithm is better than the representative diffusion models in terms of diameter of diffusion horizon per iteration. Next, Fig. 5 is used to evaluate average distance within diffusion horizon. Higher average distance covered per iteration indicates larger average shortest path distance and hence, wider diffusion outspread. Similar to diameter of diffusion horizon results, the performance of SI model is poor in terms of average distance within diffusion horizon, whereas comparison of CNS algorithm and IC model indicates that CNS algorithm covers maximum average distance in the first two iterations as compared to IC model and it results in faster diffusion speed also. Therefore, CNS algorithm excels in performance in comparison to the representative diffusion models in terms of average distance within diffusion horizon. Next, Fig. 6 represents the density within diffusion horizon per iteration. Density is defined by 2m/n(n − 1), where m indicates number of edges and n indicates number of nodes. It is used to measure the portion of potential connections in a network that are actual
612
S. Das et al.
connections. It is expected that density decreases with increase in number of iterations for wider diffusion outspread because fraction of nodes covered in the initial iterations is expected to be high. As fraction of nodes covered per iteration is maximum for CNS algorithm, so the performance of CNS algorithm is best compared to SI model and IC model in terms of density within diffusion horizon. Figure 7 indicates average degree of nodes within diffusion horizon per iteration. The criteria for a faster and wider diffusion is to target the maximum degree nodes first and hence, average degree is expected to rise and then fall with increase in number of iterations. As can be seen from Fig. 7, CNS algorithm gives maximum average degree in the first two iterations for all the datasets compared to IC model and SI model which indicates best diffusion outspread of CNS algorithm in terms of average degree of nodes within diffusion horizon per iteration. Therefore, from these results, it is concluded that CNS algorithm gives the best performance than representative diffusion models in terms of diffusion speed and diffusion outspread.
4 Conclusion In this paper, we developed a network property-based information diffusion algorithm called CNS considering both dense and sparse networks. It utilizes common neighborhood information to compute tie strength score. It is based on the concept that strong ties reside within densely connected nodes. Extensive experiments on several real-world datasets show that CNS algorithm achieves the best diffusion speed and diffusion outspread among IC model and SI model. Therefore, it is inferred that network property plays a significant role in the dynamics of information diffusion. In the future, we will deploy social aspects in combination with network property aspects to design information diffusion method and examine its significance.
References 1. Goldenberg J, Libai B, Muller E (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark Lett 12(3):211–223 Aug. 2. Watts DJ (2011) A simple model of global cascades on random networks. In: The structure and dynamics of networks. Princeton University Press, pp 497–502 3. Galstyan A, Cohen P (2007) Cascading dynamics in modular networks. Phys Rev E 75(3):036109 4. Peng H, Nematzadeh A, Romero DM, Ferrara E (2020) Network modularity controls the speed of information diffusion. Phys Rev E 102(5):052316 5. Bakshy E, Rosenn I, Marlow C, Adamic L (2012) The role of social networks in information diffusion. In: Proceedings of the 21st international conference on World Wide Web, Lyon, France, Apr 16–20, 2012, pp 519–528 6. Das S, Biswas A (2021) Deployment of information diffusion for community detection in online social networks: a comprehensive review. IEEE Trans Comput Soc Syst 8(5):1083–1107 7. Li M, Wang X, Gao K, Zhang S (2017) A survey on information diffusion in online social networks: models and methods. Information 8(4):118
Beyond Information Exchange: An Approach to Deploy …
613
8. Chen Z, Taylor K (2017) Modeling the spread of influence for in-dependent cascade diffusion process in social networks. In: 2017 IEEE 37th international conference on distributed computing systems workshops (ICDCSW), Jun 5–8, 2017. IEEE, Atlanta, GA, USA, pp 151–156 9. Bhattacharya S, Sarkar D (2021) Study on information diffusion in online social network. In: Proceedings of international conference on frontiers in computing and systems, Jalpaiguri Government Engineering College, West Bengal, India, Nov 24, 2021, pp 279–288 10. Granovetter MS (1973) The strength of weak ties. Am J Sociol 78(6):1360–1380 11. Centola D, Macy M (2007) Complex contagions and the weakness of long ties. Am J Sociol 113(3):702–734 12. Onnela JP, Saramäki J, Hyvönen J, Szabó G, Lazer D, Kaski K, Kertész J, Barabási A-L (2007) Structure and tie strengths in mobile communication networks. Proc Natl Acad Sci 104(18):7332–7336 13. Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33(4):452–473 14. Knuth DE (1993) The Stanford graph base: a platform for combinatorial algorithms, in SODA, Austin Texas USA, Jan 25–27, 1993, pp 41–43 15. Gleiser PM, Danon L (2003) Community structure in jazz. Adv Complex Syst 6(04):565–573 16. Adamic LA, Glance N (2005) The political blogosphere and the 2004us election: divided they blog. In: Proceedings of the 3rd international workshop on Link discovery, Chicago Illinois, Aug 21–25, 2005, pp 36–43 17. SNAP datasets: Stanford large network dataset collection. http://snap.stanford.edu/data. Accessed 21 Oct 2021
Sentiment Analysis on Worldwide COVID-19 Outbreak Rakshatha Vasudev , Prathamesh Dahikar, Anshul Jain, and Nagamma Patil
Abstract Sentiment analysis has proved to be an effective way to easily mine public opinions on issues, products, policies, etc. One of the ways this is achieved is by extracting social media content. Data extracted from the social media has proven time and again to be the most powerful source material for sentiment analysis tasks. Twitter, which is widely used by the general public to express their concerns over daily affairs, can be the strongest tool to provide data for such analysis. In this paper, we intend to use the tweets posted regarding the COVID-19 pandemic for a sentiment analysis study and sentiment classification using BERT model. Due to its transformer architecture and bidirectional approach, this deep learning model can be easily preferred as the best choice for our study. As expected, the model performed very well in all the considered classification metrics and achieved an overall accuracy of 92%. Keywords COVID-19 · Sentiment analysis · Opinion mining · Word embedding · BERT · Classification · Fine tuning
1 Introduction Machine Learning is the science of making computers act without being explicitly programmed. Machine Learning has quite gained its importance in the past decade and its popularity is only going to increase in the near future. All of us use machine learning regularly without realizing and it is very useful in making decisions that are to classify a given data. Sentiment analysis is a part of machine learning study which helps us to analyze the sentiment of a piece of text. We can use sentiment analysis for classification tasks. The use of sentiment analysis on social media data to get public opinion is widely accredited. It helps in processing huge amounts of data in order to R. Vasudev (B) · P. Dahikar · A. Jain · N. Patil National Institute of Technology Karnataka, Mangaluru, India e-mail: [email protected] N. Patil e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_47
615
616
R. Vasudev et al.
get the sentiments or opinions of the people about a context. Traditional sentiment analysis can miss out on highly valued insights. The advancements in deep learning can provide us with sophisticated models to classify the data being used for sentiment analysis by providing them with contextual meaning. For our study, we have used the BERT (Bidirectional Encoder Representations from Transformers) [17] model for classification of tweets into their sentiments which are represented by the three class labels: positive (denoted by 0), negative (denoted by 1) and neutral (denoted by 2). We have also used word clouds of tweets to plot the most frequently used terms in the tweets. These plots give us an accurate visual representation of the most prominently used words in the tweets. These representations can help create an awareness.
2 Literature Survey Ji et al. [1] address the issue of spreading public concern about epidemics. They have used Twitter messages, trained and tested them on three Machine Learning Models, namely Naive Bayes (NB), Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) to obtain the best results. Alessa et al. [2] have reviewed the existing solutions that track the influenza flu outbreak in real time, with the use of weblogs and social networking sites. The paper concludes that social networking sites can provide better predictions when used to conduct real-time analysis. Adhikari et al. [3] have combined word embeddings, Term Frequency-Inverse Document Frequency(TF-IDF) and word-n grams with various algorithms for data mining and deep learning such as Support Vector Machine, NB and RNN-LSTM. Naive Bayes along with Term Frequency-Inverse Document Frequency performed the best compared to other methods used. Rastogi et al. [4] used decomposition (Normalization Form and Compatibility decomposition) for preprocessing and used NLTK(Natural Language Toolkit) package for tokenization and Twitter preprocessor to remove tags, Hashtags, Reserved words, URLs and mentions. TF-IDF and bag of words were used to find the most frequent words in the corpus. VADER(Valence Aware Dictionary and sentiment reasoner) was used for sentiment analysis which also takes care of emojis. For classification, this paper used SVM and BERT model. Pokharel et al. [5] have used Tweepy Python library for data collection. Necessary fields are scraped and the TextBlob is used for checking the polarity of the tweet (positive, negative or neutral). Singh et al. [6] have used artificial intelligence(AI) techniques for prediction of epidemic outbreak. Two approaches have been used in this paper, one is the societal approach and the other is the epidemiology approach. The societal approach includes analyzing the public awareness about the epidemic using the collected tweets and then performing sentiment analysis on them. The computational epidemiology approach includes analysis and prediction of future trends based on medical datasets. Kabir and Madria [7] have built up a real-time web application for COVID-19 Tweets Data Analyzer. They have collected the data from March 5, 2020 and kept fetching the tweets using the tweepy package of Python. In this paper, authors have performed sentiment analysis on the topics related to trend-
Sentiment Analysis on Worldwide COVID-19 Outbreak
617
ing topics to understand the sentiment of human emotions. They also provide clean dataset named coronaVis Twitter dataset based on United States. Nemes et al. [8] have analyzed the signs and sentiments of the Twitter users based on the main trends using NLP (Natural Language Processing) and Sentiment Analysis with RNN(Recurrent Neural Network). The trained model worked much more accurately with a very high accuracy in determining the emotional polarity of tweets (including ambiguous ones). Wang et al. [9] proposed fine tuned BERT model for the classification of the sentiments of posts and have used tf-idf to extract topics of posts with different sentiments. Negative sentiments of the posts are used to predict the epidemic. Based on our survey, we have concluded that the existing models struggle to evaluate the language complexities like double negatives, words with multiple meanings and context-free representations. Also, the models require a huge set of training data for sentiment analysis. Our work focuses on conducting a sentiment analysis to help people to make informed decisions by knowing what is happening around the globe and also to develop a sentiment classification model which is highly performance enhanced with limited data regarding COVID-19.
3 Architecture The Twitter data collected for sentiment analysis is analyzed and allotted class labels using their sentiments. The data is also analyzed for creating word clouds of location and most frequently used words from tweets. The class labels are analyzed to get an idea of the distribution of sentiments of the tweets. The tweets are preprocessed to remove punctuation, stop words and other unnecessary data. The data is then used for sentiment classification using BERT (Bidirectional Encoder Representations from Transformers). The model performance is evaluated using classification metrics (see Fig. 1).
4 Methodology 4.1 Data Collection and Analysis The dataset for the sentiment analysis was obtained from Kaggle. It consisted of about 170k tweets from all over the world about COVID-19. The data frame ultimately prepared consisted of tweet id and tweets extracted from the dataset for our analysis purpose.100 recent tweets specific to India were also analyzed using word clouds. The tweets in the dataframe were assigned class labels using Textblob to make this a supervised learning problem. Textblob, a Python library, is widely used for various textual data processing tasks. One of them is sentiment analysis from texts. The sentiment of the textblob object is returned in a tuple which consists of polarity
618
R. Vasudev et al.
Fig. 1 Workflow/design of the proposed model
(a) Class distribution
(b) Sentiment distribution
Fig. 2 Data distribution
and subjectivity. The polarity property is considered for class labels generation. The range of polarity values is [–1, 1]. If the polarity of a tweet was less than 0 it was assigned a class label 0 (negative tweet), else if the polarity was equal to 0 it was assigned 1 (neutral tweet) else the class label was 2 (positive tweet). From Fig. 2a, we can see that the data has imbalanced class distribution. The number of negative tweets is little lower than the other two. But this problem can be handled by using metrics which will evaluate the model class wise. The sentiments distribution of the tweets was plotted, and from Figure 2b, we can see that the majority of the tweets are distributed among neutral and positive sentiments.
Sentiment Analysis on Worldwide COVID-19 Outbreak
619
4.2 Data Pre-processing The Twitter data was preprocessed using a message cleaning pipeline to remove unnecessary data. The punctuations from the tweets were removed using string. punctuation. All the stopwords were removed using nltk’s stopwords list. And finally all the video and hyperlinks were removed using Python regex. The pre-processed data is shown in Fig. 3. Fig. 3 Pre-processed data
4.3 World Cloud Analysis A word cloud of locations of the tweets was plotted for each class to analyze the severity and pattern of the epidemic is shown in Fig. 4. We can see that countries like India, United States, South Africa, etc. are where the most tweets are from indicating the Twitter activity of people from these countries is very high. This also means that people from these regions are very concerned over the situation. Also, the most frequently used words from each class were plotted in the word cloud. This gives us an idea of how people are reacting to the epidemic and their sentiments toward it. It can also give us important information on precautions to be taken at the earliest in case of a pandemic in regions where it may not have still affected. We can see from Fig. 5a the most frequent words in positive tweets. Words
Fig. 4 Wordcloud of locations of tweets-Positive class
620
R. Vasudev et al.
(a) Positive class
(b) Negative class
Fig. 5 Word cloud of frequently used words in tweets
(a) Positive class
(b) Negative class
Fig. 6 Word cloud of frequently used words in latest tweets specific to India
like good, safe, great and vaccine tells us that people are trying to be positive minded, have concerns over vaccine and to stay healthy during the pandemic. It also raises concerns over wearing masks, schools opening, lockdown, etc. We can also see words such as tested meaning people are aware of the importance of testing and are getting tested. In Fig. 5b, we can see words like government, country, etc. giving us an idea that people are expressing views on the government’s action toward pandemic. We can derive such information from the frequently used words using word cloud. We have also considered the latest tweets from India for the word clouds. From Fig. 6a, we can see words like vaccination, happy, great, fun roll, etc. indicating the situation might be under control. In Fig. 6b, we can see words like new cases, active and variant expressing concerns over new variants that might be spreading.
4.4 Sentiment Classification Using Bert We propose to use BERT(Bidirectional Encoder Representations from Transformers) to train our model which will classify the tweets into their sentiments. The reasons for using BERT for this classification task was
Sentiment Analysis on Worldwide COVID-19 Outbreak
621
– Sentiment classification tasks always require a huge set of data for model training. Since BERT is already pre-trained with billions of data from the web, it eliminates the need for a huge dataset for model training. Fine tuning gives the desired results for our classification. – It works in two directions simultaneously (bidirectional). Other language models look for the context of the word either from left or right. But BERT is bidirectionally trained which means the words can have deeper context hence the classification task performance can be improved.
4.4.1
Fine Tuning BERT for Classification
The preprocessed dataset which consisted of 1,00,439 tweets was used for the BERT model. The dataset was divided into train and validation set using train-test-split with 20% test size as shown in Fig. 7. Here, the dataset for classification will consist only of the tweets and the label columns. BERT base uncased tokenizer from Hugging Face’s transformers library which has 12-layer, 768-hidden-nodes, 12- attention-heads and 110 Million parameters was used for tokenization of the tweets. Once the tweets are encoded using the tokenizer, we can get the input features for BERT model training which are Input id and attention masks(both of which we can get from the encoded data). Input id are the integer sequences of the sentences. Attention masks are a list of binary numbers 0s and 1s representing which tokens to be given attention by the model and for which it should not. There’s also one more input feature required by the BERT model which is labels. All the integer input features(both the train and validation sets) should be converted to tensor datasets. The tensor datasets will be used to get train and validation data loaders. Optimizers (AdamW) and schedulers (linearschedulewith-warmup) are defined to control the learning rate through the epochs. BERT base uncased BertForSequenceClassification model from the transformers library is defined for training. In training mode, the model will be trained batch wise with the input features. The loss obtained from the outputs is backward propagated and other parameters(optimizers and schedulers) are updated. At the end of each epoch, the validation data loader is evaluated in model evaluating mode. This also returns the validation loss, predictions and true values. So at the end of each epoch,
Fig. 7 Train and validation sets
622
R. Vasudev et al.
we can analyze training and validation losses. The model is saved using torch. The saved model is to be loaded with the same parameters as of the model defined for all the keys to be perfectly matched. Once the model is loaded, the performance can be evaluated by various metrics using the validation set data loaders.
4.5 Evaluation Metrics The metrics used for evaluating the classification task were as follows: Accuracy, Precision, Recall and F1- score. All these metrics are based on the confusion matrix. – True Positives (TP): The cases which are predicted positive and are actually positive. – True Negatives (TN): The cases which are predicted negative and are actually negative. – False Positives (FP): The cases which are predicted positive but are negative. – False Negatives (FN): The cases which are predicted negative but are positive.
4.5.1
Precision
Precision represents what percentage of predicted positives are actually positive and can be calculated easily by Precision = TP/(TP + FP)
4.5.2
(1)
Recall
Recall represents what percentage of actual positives are predicted correctly and calculated by Recall = (TP)/(TP + FN) (2)
4.5.3
F1-Score
F1-score is the measure of accuracy of a model on the dataset and it is the harmonic mean of precision and recall. F1 − score = (2 ∗ Precision ∗ Recall)/(Precision + Recall)
(3)
Sentiment Analysis on Worldwide COVID-19 Outbreak
4.5.4
623
Accuracy
Accuracy represents how often is our classifier correct, i.e., it Is defined as the percentage of predictions that are predicted correctly and is calculated by the following formula: Accuracy = (TP + TN)/(TP + TN + FN + FP) (4)
4.5.5
Macro average
Macro average is the arithmetic mean of all the values irrespective of proportion of each label in the dataset. For example, if we want to find the macro-average precision of n classes, individual precisions being p1, p2, p3, . . . . . . , pn then macro-average precision (MAP) is the arithmetic mean of all these. MAP = ( p1 + p2 + p3 + · · · · · · + pn)/n
4.5.6
(5)
Weighted Average
Weighted average is the average with proportion of each label in the dataset. Some weights are assigned to each label based on proportion of each label in the dataset. For example, if we want weighted-average precision of n classes, precisions being p1, p2, p3, . . . . . . , pn and assigned weights being w1, w2, w3, . . . . . . , wn then weighted-average precision(WAP) can be calculated as WAP = ( p1 ∗ w1 + p2 ∗ w2 + · · · · · · + pn ∗ wn)/n
Table 1 Metrics evaluated class wise Classes Accuracy Precision Positive Neutral Negative
0.974 0.900 0.832
0.96 0.90 0.88
Recall
F1-score
0.90 0.97 0.83
0.93 0.94 0.86
Table 2 Weighted- and Macro-average results of the metrics Macro average Weighted average Precision Recall F1-score Precision Recall 0.91
0.90
0.91
0.92
(6)
0.92
F1-score 0.92
624
R. Vasudev et al.
5 Results and Analysis The BERT model was trained on GPU (CUDA enabled) with 12.72 GB RAM in Google Colab. The model was trained for one epoch and the classification report and class-wise accuracies were evaluated and the same are tabulated in Tables 1 and 2. The classification report (from sklearn. metrics) gives us the class-wise Precision, Recall and F1-score and also the macro and weighted averages of these metrics. We can observe that the weighted-average scores are better than the macro-average scores. It is because the number of neutral and positive classes in our dataset is slightly higher than the negative class. The overall accuracy of the model was 92%. This indicates that our model can be used over time to classify huge amounts of textual data on COVID-19-related issues quite accurately. The model also performed well by achieving desired class-wise accuracy, precision, recall and F1-score. Both positive and neutral classes had values above 0.9 in all these metrics.
6 Conclusion and Future Work In this paper, we have proposed machine learning-based approaches to sentiment analysis and sentiment classification specifically for the COVID-19 pandemic worldwide. We can analyze the effect of pandemic using the Twitter data and also analyze the response from the people about the epidemic using the word clouds which plot the most frequently used words from the tweets. These word clouds also help us in knowing how different regions are being affected by the pandemic and can create an awareness among people to prevent it. The model was trained with pre-trained BERT for classification. The model performed very well with 92% accuracy, 0.91 macro-average Precision and F1-Score, 0.90 Recall and 0.92 weighted-average Precision, Recall and F1-Score. The use of BERT for sentiment analysis has certainly given the best results and makes our model a reliable one. Sentiment analysis, though useful, is always subjective. The opinions differ from people to people. Also, it’s very difficult to correctly contextualize sentiments such as sarcasm and negations. Further, this project can be automated to continuously fetch tweets to feed as data for the model while ensuring that there is no overfitting. BERT can further be fine tuned to get better results.
References 1. Ji X, Chun S, Wei Z, Geller J (2015) Twitter sentiment classification for measuring public health concerns. Soc Netw Anal Min 5 2. Alessa A, Faezipour M (2018) A review of influenza detection and prediction through social networking sites. Theor Biol Med Model 15
Sentiment Analysis on Worldwide COVID-19 Outbreak
625
3. Adhikari ND, Kurva VK, Suhas S, Kushwaha JK, Nayak AK, Nayak SK, Shaj V (2018) Sentiment classifier and analysis for epidemic prediction. Comput Sci Inf Technol (CS and IT) 4. Rastogi N, Keshtkar F (2020) Using BERT and semantic patterns to analyze disease outbreak context over social network data. In: Proceedings of the 13th international joint conference on biomedical engineering systems and technologies 5. Pokharel B (2020) Twitter sentiment analysis during COVID-19 outbreak in Nepal. SSRN Electron J 6. Singh R, Singh R (2021) Applications of sentiment analysis and machine learning techniques in disease outbreak prediction-a review. Mater Today Proc 7. Kabir M, Madria S (2020) CoronaVis: a real-time COVID-19 tweets data analyzer and data repository. https://arxiv.org/abs/2004.13932 8. Nemes L, Kiss A (2020) Social media sentiment analysis based on COVID-19. J Inf Telecommun 5:1–15 9. Wang T, Lu K, Chow K, Zhu Q (2020) COVID-19 sensing: negative sentiment analysis on social media in China via BERT model. IEEE Access 8:138162–138169 10. Siriyasatien P, Chadsuthi S, Jampachaisri K, Kesorn K (2018) Dengue epidemics prediction: a survey of the state-of-the-art based on data science processes. IEEE Access 6:53757–53795 11. Pham Q, Nguyen D, Huynh-The T, Hwang W, Pathirana P (2020) Artificial intelligence (AI) and big data for coronavirus (COVID-19) pandemic: a survey on the state-of-the-arts. IEEE Access 8:130820–130839 12. Manguri K, Ramadhan R, Amin P. Kurd J Appl Res. http://doi.org/10.24017/kjar 13. Agarwal A, Xie B, Vovsha I, Rambow O, Passanneau R (2011) Sentiment analysis of Twitter data. In: Proceedings of the workshop on languages in social media, pp 30–38 14. Rajput N, Grover B, Rathi V (2020) Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic. https://arxiv.org/abs/2004.03925 15. Kruspe A, Häberle M, Kuhn I, Zhu X (2020) Cross-language sentiment analysis of European Twitter messages during the COVID-19 pandemic. https://aclanthology.org/2020.nlpcovid19acl.14 16. Boon-Itt S, Skunkan Y (2020) Public perception of the COVID-19 pandemic on twitter: sentiment analysis and topic modeling study. JMIR Public Health Surveill 6:e21978 17. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805
Post-Vaccination Risk Prediction of COVID-19: Machine Learning Approach Anjali Agarwal, Roshni Rupali Das, and Ajanta Das
Abstract COVID-19 warns everyone of the fact that viral infection can be serious and even lethal. Prior to the COVID-19 epidemic, people paid no attention to slight fevers, sore throats, or sneezing. With the passage of time, people were persuaded of the fatality rate and forced to limit themselves to their own homes. However, the discovery of vaccinations and double dosages energized individuals to frequent workplaces, banks, stores, marketplaces, and so on for the sake of need. Although developing a vaccine is a tough undertaking owing to the unique characteristics of the COVID-19 virus, it is difficult to promise that a vaccination will give complete protection against infection. As a result, the post-vaccination risk for COVID-19 is always present. The goal of this work is to predict the post-vaccination risk of COVID 19 illness using machine learning approaches and real datasets. This research will also show that a small number of vaccinated persons get infected with COVID-19 illness. As a result, it is also indicated that if a few issues persist for more than 2 days, a doctor should be sought for proper care. Keywords Prediction · COVID-19 · Machine learning · Vaccination
1 Introduction A few more years after the WHO’s formal declaration, the COVID-19 pandemic has had far-reaching global repercussions. COVID-19 is no longer pandemic, but rather endemic, with over 651,247 individuals worldwide dying as a result of the disease. There is currently no particular therapy or cure for COVID-19, therefore living with the condition and its symptoms is unavoidable. This fact has placed huge pressure on the world’s inadequate healthcare systems, particularly in underdeveloped countries. Even though there is no effective and efficient, medically tested anti-viral agent method or an authorized vaccine to eliminate the COVID-19 disease outbreak, there are alternative solutions that may decrease the enormous burden on not only A. Agarwal · R. R. Das · A. Das (B) Amity Institute of Information Technology, Amity University, Kolkata 700135, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_48
627
628
A. Agarwal et al.
limited medical systems but also the sector of the economy; the most promising entail embracing non-clinical approaches such as machine learning, data mining, deep learning, and artificial intelligence. These options might help COVID-19 pandemic victims with diagnosis and forecasting [1]. Machine learning has gotten a lot of attention from researchers over the last several decades because of its ability to solve complicated real-world issues [2]. Natural language processing, health care, commercial applications, intelligent robotic design, gaming, and image processing are just a few of the important study fields where ML may be used. Prediction is one of the most common applications of machine learning. Several ML algorithms have been employed in applications such as weather forecasting and illness diagnostics to anticipate future occurrences. Numerous researches have been conducted to determine how ML may predict illnesses such as cardiovascular disease, heart disease, coronary artery disease, and breast cancer. Aside from the aforementioned illnesses, the hazards and diagnoses of which are well known, COVID-19 now represents a truly global threat. Besides these illnesses, the risks and diagnoses of which are well known, COVID-19 now represents a truly global health catastrophe for humanity as well as a significant new issue to be faced. Vaccines are still being developed by scientists today, but prevention and early detection remain the most effective strategies to protect individuals in the interim. Vaccines have previously been delivered in millions of doses in several nations. However, the beneficial effects of these vaccinations are likely to be observed later than predicted. In these situations, the only method to halt the development of COVID-19 is to diagnose it quickly. However, depending just on visible symptoms makes it difficult to determine whether a person is infected with COVID-19. The safety and dependability of vaccines are critical for preventing the spread of infectious diseases. There have been a modest but considerable number of documented adverse responses to the new COVID-19 vaccinations. The goal of this study is to discover probable common causes of such adverse responses to enable methods that decrease patient risk by classifying and characterizing individuals who are at risk of such reactions using patient data. We looked at patient medical records as well as data on post-vaccination effects and outcomes. The objective of this paper is to develop a model that uses machine learning (ML) techniques to detect symptoms of COVID-19-infected individuals better accurately after vaccination. Age, gender, fever, headache, nausea/vomiting, chest pain, difficulty breathing, fast heartbeat (tachycardia), leg swelling, and previous history of post-vaccine side effects are the factors included in the suggested diagnostic approach. Our suggested technique can estimate the likelihood of infection with the COVID-19 virus based on these factors, which are depicted as ML’s unique features. The suggested framework consists of three well-known ML algorithms: support vector machine, random forest, and logistic regression. The main goal of this study is to examine the effectiveness of all three algorithms and determine which one is the best fit for our proposed tool. This technique is assessed using a variety of experimental analytic measures, including accuracy, precision, recall, and F1-score. The experimental findings demonstrate that the suggested approach can predict the
Post-Vaccination Risk Prediction of COVID-19: Machine Learning …
629
existence of COVID-19 with greater than 91% accuracy. Section 2 of this article consists of the literature review to aid in the study of the research topic. Section 3 outlines the suggested approach’s structure and thorough methodology. Section 4 shows the assessment findings, including testing accuracy, training accuracy, and accuracy gap %, to pick the optimal algorithm for the proposed tool’s training dataset. This section also includes the confusion matrix for each model that employs a certain classifier. Furthermore, a graphical depiction of each machine learning algorithm’s comparative analysis is presented with precision, recall, and F1-Score. Section 5 brings the paper to a conclusion.
1.1 Motivation A breakthrough infection occurs when someone gets COVID-19 after being completely vaccinated. This implies that the vaccination elicited a weaker immunological response in the recipient. However, most of these infections are mild or asymptomatic. Breakthrough infections can occur with any vaccination. Breakthrough infections, on the other hand, are distinct from re-infection. According to experts, the immune system typically takes two weeks to produce antibodies after immunization. There have been reports of persons [3] contracting an infection shortly before or after receiving their injection. In such situations, the illness would have taken root long before the vaccine’s full effect could be felt. According to an Indian Council of Medical Research (ICMR) research [4], reinfection is defined as two positive tests separated by at least 102 days, with one intermediate negative test. According to data provided last week by the ICMR, just 2– 4 persons out of a vaccinated group of 10,000 are sick, which is a small amount. When fully vaccinated persons have symptoms, they are usually milder than those who have not been vaccinated. This implies they are considerably less likely than unvaccinated persons to get hospitalized or die but still suffer from the risk of COVID-19.
1.2 Related Work Researchers used the PA view of chest X-ray scans for COVID-19 patients as well as healthy participants in this study [1]. They examined the performance of deep learning-based CNN models after cleaning up the pictures and performing data augmentation. They compared the accuracy of the Inception V3, Xception, and ResNet models. The Xception model has the best accuracy, i.e., 97.97% for identifying Chest X-ray pictures. This paper focuses solely on potential techniques for categorizing COVID-19-infected people and makes no medical claims. This study [2] developed supervised machine learning models for COVID-19 disease using classification algorithms such as logistic regression, decision tree, support vector machine, naive Bayes, and artificial neural network for positive and
630
A. Agarwal et al.
negative COVID-19 cases in Mexico. The decision tree algorithm has the greatest accuracy of 94.99%, the support vector machine model has the best sensitivity of 93.34%, and the Naive Bayes model has the maximum specificity of 94.30%, according to the results of the effectiveness analysis of the methods. Through categorization techniques, this study [5] identifies profiles of individuals who may require further monitoring and care, e.g., immunization at a site with access to full clinical assistance to prevent unfavorable effects. With an accuracy score of more than 85%, machine learning models utilizing medical records were also possible to forecast individuals who were most likely to receive complication-free vaccinations. The authors of this work [6] used various supervised machine learning techniques to create a model to assess and predict the existence of COVID-19. The results demonstrate that the support vector machine with the Pearson VII universal kernel outperforms alternative algorithms. Support vector machine with Pearson VII universal kernel outperforms other algorithms with 98.81% accuracy and a mean absolute error of 0.012. This work [7] illustrates the capacity of machine learning models to predict the number of incoming patients impacted by COVID-19. In specifically, four common forecasting models were employed in this work to anticipate the dangerous variables of COVID-19: linear regression LR, least absolute shrinkage and selection operator LASSO, support vector machine SVM, and exponential smoothing ES. The findings show that the ES outperforms all other models, followed by LR and LASSO, while SVM performs badly in all prediction situations. In this study [8], they explore ML and DL techniques based on AI for COVID-19 diagnosis and therapy. This survey provides a thorough review of existing state-ofthe-art techniques for ML and DL researchers and the broader health community, as well as descriptions of how ML and DL and data may enhance the status of COVID-19, as well as further studies to minimize COVID-19 outbreaks. The AdaBoost method is used to boost a fine-tuned random forest framework in this research [9]. The model predicts the intensity of the case and the likely result, recovery, or death based on the COVID-19 patient’s geographical, travel, health, and demographic data. On the dataset used, the model has 94% accuracy and an F1-score of 0.86. Here [10], researchers used a deep learning method to forecast mortality in those who tested positive based on their underlying health problems, age, gender, and other variables. Logistic regression, Naïve Bayes, SVM, and random forest algorithms are used. This model can estimate whether a COVID-19-verified patient is likely to be dead or alive. The results show that the deep learning model beats other machine learning models in predicting uncommon events. In this work [11], a prospective study of the applicability of intelligent systems such as ML, DL, and others in resolving COVID-19 epidemic concerns is presented. The main goal of this study is to understand the importance of smart approaches such as ML and DL for the COVID-19 pandemic; discuss the efficiency and impact of these methods on COVID-19 prognosis, the progress in the development of the
Post-Vaccination Risk Prediction of COVID-19: Machine Learning …
631
type of ML, and advanced ML methods for COVID-19 prognosis; and analyze the impact of data types and the nature of the information. In this study [12], the author used machine learning methods to more effectively diagnose COVID-19-infected individuals. This technique is assessed using a variety of experimental analytic measures, including accuracy, precision, recall, and F1score. The acquired experimental findings demonstrate that the proposed approach can predict the existence of COVID-19 with a high degree of accuracy above 97%.
2 Problem Statement The healthcare business has grown to be worth billions of dollars. The healthcare industry creates massive amounts of data daily, which may be used to extract information for projecting future illness for a patient based on treatment history and health data. The growing usage of electronic health records has resulted in an avalanche of new patient data, which is a treasure for learning more about how to improve human health. In this case, data mining and machine learning techniques are used. Based on a patient’s treatment history and health information, the method described above is utilized to predict COVID-19 breakthrough infection risks. Furthermore, the goal and scope of this study are to provide early treatment to patients based on the prediction of COVID illness risk even after inhaling both doses of COVID vaccination, minimizing risk and chaos.
3 Proposed Architecture and Methodology Having the dataset of patients who are at risk of contracting COVID after immunization, the proposed work suggests the following methods for evaluating the risk post-COVID vaccination. The architectural architecture of the suggested algorithm is depicted in Fig. 1, which includes data collection, a stage in which the authors collect information from medical practitioners to train the model, data processing to clean the data of any arbitrary values, and splitting the dataset into two parts, training data, and testing data, to train the model using the information the authors already have. The dataset is then pre-processed using the Scikit-learn package and the Python programming language [13].
3.1 Data Collection The dataset’s primary objective is to train the model to determine if a patient is at risk for COVID-19 illness even after two doses, based on health measures and symptoms in the dataset. Following the collection of data, data pre-handling is required to
632
A. Agarwal et al.
Fig. 1 Proposed architecture
convert the raw data into organized and clean data for the model’s preparation and operation. In this experiment, the authors of this study enlisted the help of medical professionals to collect a few key symptoms and risk factors for COVID. This dataset contains 20,000 post-COVID vaccination patient records in which the patients are at risk of contracting COVID even after taking both doses of vaccine. All patients, regardless of age or gender, are considered. This dataset comprises 10 key symptoms and risk variables that play a significant role after receiving two doses of COVID-19 vaccination and can help the model determine if the user is at risk of contracting COVID-19 or not. These factors aid the model in determining if the user is at risk of getting a certain ailment depending on the information provided. The sample dataset is demonstrated below for a better understanding of parameters and their value through four patient records in Table 1.
3.2 Data-Preprocessing This is the data pre-processing stage where we have transformed all categorical variables into corresponding numerical variables to training the model more effectively. There are too many levels in a category variable. The model’s performance is lowered
Post-Vaccination Risk Prediction of COVID-19: Machine Learning …
633
Table 1 Dataset sample for post-COVID vaccine symptoms Parameters
Patient 1 data
Patient 2 data
Patient 3 data
Patient 4 data 42
Age
56
72
33
Gender
0
0
1
1
Fever
100.0
98.4
99.9
102.6
Headache
1
3
2
0
Nausea/Vomiting
0
1
0
0
Chest pain
1
1
0
1
Difficulty in breath
1
1
0
0
Fast Heartbeat (Tachycardia)
0
1
1
0
Leg swelling
1
0
1
0
Previous history of post-vaccine side effects
0
1
1
0
Outcome
0
1
1
0
as a result of this. So, Label Encoder converts non-numerical labels into numerical labels (or nominal categorical variables). The range of numerical labels is always 0 to n classes-1. In the following dataset, Table 1, Value 1 is considered for parameters Headache, Nausea, Chest Pain, Difficulty in Breath, Fast Heartbeat (Tachycardia), Leg Swelling, Previous History of Post-Vaccine Side Effects patient stands for “Yes”, whereas Value 0 stands for “No”. Data recorded for Fever is the thermometer reading in °C. For Gender, 1 stands for “Male” and 0 for “Female”. For Headache, 0 stands for Absent, i.e., no headache, 1 stands for Mild, i.e., headache lasting for (12–24 h), 2 stands for Moderate, i.e., headache lasting for (24–72 h), and 3 stands for Severe, i.e., headache lasting for (>72 h). These are the dependent variables. The target variable is the outcome where 0 stands for no risk and 1 stands for consulting a doctor.
3.3 Splitting of Dataset To evaluate the exhibition of the suggested model, machine learning is used to split the dataset. The dataset is divided into two portions in this method, which are referred to as the training dataset and the testing dataset. The training set is a set of data that the authors use to train the model and allow it to analyze and learn, whereas test data is merely used to evaluate the model’s performance. Testing data is unseen data for which predictions must be created, whereas training data output is accessible for modeling. Test data is a last, real-world check of an unknown dataset to make sure the machine learning algorithm was correctly trained. Because the model is designed to predict future objectives that are unknown, the authors require some unknown data for the model to anticipate, which is referred to as the testing dataset. As shown in
634
A. Agarwal et al.
Fig. 1, the dataset is split randomly in the ratio 80:20 for training and testing using the train–test split capability included in the Scikit-learn library work.
3.4 Classification A classifier is a mathematical computational function that assigns incoming data to the correct category. It might be challenging to decide which machine learning algorithm is appropriate for a particular dataset. There are several algorithms to choose from, and it’s critical to understand the benefits and drawbacks of each before deciding which is best for your model. As a consequence, trial and error are the most effective method for assisting us in creating the most efficient algorithm. Here, in this paper, we will be classifying whether the user has the risk of suffering from COVID post-COVID vaccination or not.
3.5 Performance Evaluation Metrics A confusion matrix is a metric used to assess the effectiveness of a machine learning system. Any person might determine the model’s accuracy by examining the diagonal values for measuring the number of accurate classifications by seeing the confusion matrix. The authors of this study used the confusion matrix function from the Scikitlearn package to assess the accuracy. Equations 1–5 is used to calculate the precision, recall, and accuracy values, which are then used to evaluate the confusion matrix. The authors have previously suggested an automated sentiment analysis tool that incorporates the confusion matrix as well as all other metrics in the study [14]. Recall—The proportion of positive instances recognized as positive among the total number of positive examples is known as recall. Precision—Precision is the percentage of real positive cases among the positive examples categorized by the model. F1 Score—The F-score, also known as the F1-score, is a metric for how accurate a model is on a given dataset. It’s used to assess binary classification algorithms that categorize examples as either “positive” or “negative.” Accuracy—Accuracy refers to the percentage of correct predictions made by our model. The accuracy of a machine learning model is a metric for determining which model is the best at finding correlations and patterns between variables in a dataset. Training Accuracy—The training accuracy refers to a model’s correctness as compared to the instances on which it was built. Testing Accuracy—The test accuracy of a model refers to how well it performs on instances it hasn’t seen before.
Post-Vaccination Risk Prediction of COVID-19: Machine Learning …
635
Accuracy Gap percentage—The percentage difference between training accuracy and testing accuracy is known as the gap percentage. Precision = TP ÷ (TP + FP)
(1)
Recall = TP ÷ (TP + FN)
(2)
F1-Score = (2 × Precision × Recall) ÷ (Precision + Recall)
(3)
Support = (σ (X + Y)) ÷ Total
(4)
Accuracy = (TP + TN) ÷ (TP + TN + FP + FN)
(5)
Accuracy Gap% = (Training Accuracy − Testing Accuracy) × 100
(6)
where True positives occur when you anticipate that an observation belongs to a particular class and it does. True negatives occur when you forecast that observation does not belong to a class and it truly does not belong to that class. False positives arise when you incorrectly anticipate that an observation belongs to a particular class when it does not. False negatives arise when you incorrectly anticipate that observation does not belong to a particular class when it does.
4 Comparative Results and Discussions 4.1 Support Vector Machine A Support Vector Machine (SVM) is a method for classifying objects based on a separating hyperplane. In other words, the algorithm produces an optimum hyperplane that categorizes fresh instances given labeled training data (supervised learning). This hyperplane is a line in two-dimensional space that divides a plane into two halves, with each class on each side. It can handle both linear and nonlinear problems and is beneficial for a wide range of applications. SVM is a basic concept: The algorithm split the entire class by drawing a line or hyperplane. The authors obtained the maximum training and testing accuracy of 91% and 87%, respectively, using SVM algorithms, proving that it is the best method among them. The accuracy gap of 4% is quite small, indicating that the model constructed is neither overfitting nor underfitting.
636
A. Agarwal et al.
4.2 Logistic Regression In other words, logistic regression is a categorical response variable that’s utilized in a classification method. The goal of logistic regression is to discover a link between characteristic and the likelihood of a specific result. When trying to predict a continuous output value from a linear connection, linear regression may be quite beneficial. However, the output values of a logistic regression are between 0 and 1; a probability. As a result, logistic regression does not operate with output continuous values that are not in the range of 0–1. The logit functions are used in logistic regression to assist derive a connection between the dependent and independent variables by forecasting probabilities. The training and testing accuracies of 82% and 81%, respectively, are obtained using the logistic regression technique, which is extremely efficient. The accuracy gap of 1% is the least of all the models, suggesting that the model built is neither overfitting nor underfitting.
4.3 Random Forest Random forest is a learning method that is supervised. It creates a “forest” out of an ensemble of decision trees, which are generally trained using the “bagging” approach. Many decision trees make up a random forest algorithm. While growing the trees, adds more unpredictability to the model. When splitting a node, it looks for the best feature from a random group of characteristics rather than the most significant feature. As a result, there is a lot of variety, which leads to a better model. By taking the average or mean of the output from several trees, the algorithm determines the outcome based on the decision tree’s predictions. Using this algorithm, the authors achieved training and testing accuracies of 82% and 81%, respectively. The 1% accuracy difference is relatively modest, indicating that the model built is neither overfitting nor underfitting. According to the authors, the values for true positive, true negative, false positive, and false negative are not particularly high in the above six confusion matrix, indicating that the model has done fairly well. Table 2 and Fig. 2 illustrate the training and testing accuracy of the various methods, as well as a graphical depiction. The smaller the accuracy gap, the more accurate will be the model. It explains that the performance of the trained model differs the least from the performance of the tested model. As a result, everyone must guarantee that testing and training accuracy do not vary significantly. The minimal accuracy gap is 1% in this case, indicating that the model is effective. Although the greatest accuracy gap for SVM is just 4%, the training accuracy of 91% and testing accuracy of 87% are both quite high.
Post-Vaccination Risk Prediction of COVID-19: Machine Learning … Table 2 Comparative analysis of all three algorithms
637
Algorithm
Training accuracy
Testing accuracy
Accuracy gap %
Support Vector Machine
0.91
0.87
0.04
Logistic Regression
0.82
0.81
0.01
Random Forest
0.82
0.81
0.01
Fig. 2 Graphical representation of all algorithms based on accuracy
5 Conclusion COVID-19 seems to be an endemic problem, similar to other infectious illnesses such as HIV/AIDS, tuberculosis, measles, and hepatitis. The World Health Organization (WHO) has designated this disease as a public health crisis of global concern. COVID-19 is spread by physical interaction with an infectious person through coughs or sneezes, and there is no medically authorized vaccine or medicine to treat it. Nonclinical methodologies, such as ML technologies, are being used as an alternative method of diagnosis and prognosis of COVID-19 pandemic patients in almost all countries around the world, to complement and reduce the huge burden on limited healthcare systems and revive the deeply affected economic sector. In this article, ML models for COVID-19 infection were created using the SVM, random forest, and logistic regression learning methods. Eighty percent of the training dataset was utilized to train the models, while the rest 20% was used to test the models. In terms of accuracy, the model built with SVM was the best of all models developed, with a score of 91%. Random forest and logistic regression models, on the other hand, emerged as the top models in terms of responsiveness and sensitivity, with 82% of training accuracy.
638
A. Agarwal et al.
The approach will primarily forecast the risk variables after COVID vaccination and will be able to give some assistance to doctors who must decide whether or not to treat a person who has come back positive for COVID-19. This virus, however, is still potential of having a significant influence on the quality of life of afflicted persons. In the future, a model that not only forecasts risk after vaccination but also forecasts the intensity of illness development should be developed.
References 1. Jain R, Gupta M, Taneja S, Jude Hemanth D (2021) Deep learning based detection & analysis of COVID-19 on chest X-ray image. Appl Intell 51(3):1690–1700 2. Muhammad LJ, Algehyne EA, Sharif Usman S, Ahmad AA, Chakraborty C, Alh Mohammed I (2021) Supervized machine learning models for prediction of COVID-19 infection using epidemiology dataset. SN Comput Sci 2(1):1–13 3. https://www.cdc.gov/coronavirus/2019-ncov/vaccines/effectiveness/why-measure-effective ness/breakthrough-cases.html 4. https://www.indiatoday.in/coronavirus-outbreak/vaccine-updates/story/you-can-get-covid19-even-after-getting-two-doses-vaccine-but-no-need-panic-all-questions-faqs-17966542021-04-30 5. Ahamad MM, Aktar S, Jamal Uddin M, Rashed-Al-Mahfuz M, Azad AKM, Uddin S, Alyami SA et al (2021) Adverse effects of COVID-19 vaccination: machine learning and statistical approach to identify and classify incidences of morbidity and post-vaccination reactogenicity. medRxiv 6. Villavicencio CN, Escudero Macrohon JJ, Inbaraj XA, Jeng J-H, Hsieh J-G (2021) COVID-19 Prediction applying supervised machine learning algorithms with comparative analysis using WEKA. Algorithms 14(7):201 7. Rustam F, Ahmad Reshi A, Mehmood A, Ullah S, On B-W, Aslam W, Sang Choi G (2020) COVID-19 future forecasting using supervised machine learning models. IEEE Access 8:101489–101499 8. Alafif T, Muneeim Tehame A, Bajaba S, Barnawi A, Zia S (2021) Machine and deep learning towards COVID-19 diagnosis and treatment: survey, challenges, and future directions. Int J Environ Res Public Health 18(3):1117 9. Iwendi C, Kashif Bashir A, Peshkar A, Sujatha R, Moy Chatterjee J, Pasupuleti S, Mishra R, Pillai S, Jo O (2020) COVID-19 patient health prediction using boosted random forest algorithm. Front Public Health 8: 357 10. Li Y, Horowitz MA, Liu J, Chew A, Lan H, Liu Q, Sha D, Yang C (2020) Individual-level fatality prediction of COVID-19 patients using AI methods. Front Public Health 8:566 11. Nayak J, Naik B, Dinesh P, Vakula K, Kameswara Rao B, Ding W, Pelusi D (2021) Intelligent system for COVID-19 prognosis: A state-of- the-art survey. Appl Intell 51(5):2908–2938 12. Rehman MU, Shafique A, Khalid S, Driss M, Rubaiee S (2021) Future forecasting of COVID19: a supervised learning approach. Sensors 21(10):3322 13. Machine Learning in Python (n.d.) https://scikit-learn.org/ 14. Das A, Agarwal A, Das R (2021, in press) Evaluation of social human sentiment analysis using machine learning algorithms. Accepted and presented in the 2nd international conference on computer engineering and communication systems (ICACECS 2021), held on 13th–14th August 2021
Offensive Language Detection in Under-Resourced Algerian Dialectal Arabic Language Oussama Boucherit and Kheireddine Abainia
Abstract This paper addresses the problem of detecting the offensive and abusive content in Facebook comments, where we focus on the Algerian dialectal Arabic which is one of the under-resourced languages. The latter has a variety of dialects mixed with different languages (i.e., Berber, French, and English). In addition, we deal with texts written in both Arabic and Roman scripts (i.e., Arabizi). Due to the scarcity of works on the same language, we have built a new corpus regrouping more than 8.7 k texts manually annotated as normal, abusive, and offensive. We have conducted a series of experiments using the state-of-the-art classifiers of text categorization, namely: BiLSTM, CNN, FastText, SVM, and NB. The results showed acceptable performances, but the problem requires further investigation on linguistic features to increase the identification accuracy. Keywords Offensive language · Abusive language · Social media · Algerian dialectal Arabic · Facebook
1 Introduction Due to the rapid development of social media, online communication becomes easier. However, this rises some concerns such as the use of offensive language that may harm people mentally and physically. Offensive language can take various forms including hate speech, bullying, disrespect, abuse, and violence. This behavior may lead to depression which has a negative impact on people’s health and relationships, O. Boucherit (B) · K. Abainia PIMIS Laboratory, Department of Electronics and Telecommunications, Université 8 Mai 1945, 24000 Guelma, Algeria e-mail: [email protected] K. Abainia e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_49
639
640
O. Boucherit and K. Abainia
and it may lead to suicide as well. To overcome this issue, several researchers worked on the automatic detection of such kind of content in a myriad of languages of the world. Serval Arabic studies have been carried out to detect offensive language, but in social media, people often use dialectal Arabic which differs from official Arabic. There is a scarcity of research works undergone on Algerian Arabic, because the latter is known with a complicated linguistic structure and contains several terms borrowed from different languages such as French, Italian, Spanish, and Turkish. All these issues among others make automatic processing of such kinds of texts more difficult. In this paper, we investigate automatic offensive language detection in underresourced Algerian dialectal Arabic, where we propose a new corpus (i.e., DziriOFN) for this task due to the lack of works on the same dialectal Arabic. The created corpus has been annotated by five native speakers at the sentence level, wherein the texts are labeled as offensive, abusive, and normal. In addition, we have evaluated the state-of-the-art tools of text categorization such as SVM, NB, CNN, BiLSTM, and FastText. The experimental results showed that the conventional tools produce acceptable results, but they require further investigation to enhance the accuracy.
2 Related Work In this section, we highlight some research works carried out on the offensive language detection, where we state related works in Latin and Arabic languages. For Latin languages, a statistical classifier based on sentiment analysis has been proposed by Gitari et al. [8], wherein the authors proposed a model to detect the subjectivity. The proposed model did not only detect if the sentence is subjective, but it rather can identify and score the polarity of sentiment expressions. Another statistical classifier has been proposed, for which the authors collected over 1 M Tweets to detect hate speech in Italian language [18]. A machine learning approach based on feature selection of meta-data has been proposed to deal with automatic offensive language detection in Twitter [5]. In particular, the authors used SVM and NB classifiers, where the latter outperformed the SVM (i.e., 92% of accuracy in contrast to 90%). Another classifier combining SVM and MLP (Multilayer perceptron) has been proposed in [15], where the authors used Stochastic Gradient Descent (SGD) for feature selection. They experimented with three Indo-European languages (i.e., English, German, and Hindi), and the results showed that the proposed framework was suitable with the Hindi language rather than English and German. Greek offensive language identification has been proposed, wherein the authors created a new Twitter corpus (OGTD) containing 4.7 K texts manually annotated [17]. The authors tested different ML and DL approaches such as SVM, SGD, NB, and LSTM. The experimental results on OGTD showed that LSTM (with Attention) outperformed the conventional ML approaches (i.e., 0.89 of F1-score).
Offensive Language Detection in Under-Resourced Algerian Dialectal …
641
A hate speech detection approach established a lexical baseline for discriminating between hate speech and profanity on a standard dataset [20]. In [3], a Twitter corpus of 16 K texts for hate speech identification was manually annotated, and for which various DL approaches have been evaluated. Another research using DL approaches has been proposed for hate speech detection in Indonesian Language [19]. The authors evaluated different feature models, where textual features produced promising results (87.98% of F1-score). Some studies have been carried out in Arabic language. For instance, an approach for detecting cyberbullying in Arabic texts has been proposed by Haidar et al. [11], where the approach focused on preventing cyberbullying attacks. In particular, it uses NLP techniques to identify and process Arabic words, and ML classifiers to detect the bullying content. A dataset for Arabic hate speech detection with 9.3 k annotated tweets was proposed by Raghad and Al-Khalifa [2]. The authors experimented with several DL and ML models to detect hate speech in Arabic tweets. The results showed that CNN-GRU produced the best performances (0.79 of F1-score). A multitask approach for Arabic offensive language and hate speech detection using DL, transfer learning and multitask learning was proposed in [7]. Otiefy et al. [16] experimented the offensive language identification on multiple Twitter datasets, where the authors evaluated several ML and DL models. They used a combination of characters and words ngram in a linear SVM model, which produced the best performance among the other baseline models [16]. A multitask approach to detect Arabic offensive language has been proposed by [6], where the proposal could be trained with a small training set. The proposed approach ranked second among others in both tasks of the shared task (i.e., 90% and 82.28% of F1-score). Solving the problem of out-of-vocabulary in Arabic language has been proposed for detecting the offensive language, where the authors presented a model with character-level embeddings [1]. An approach for abusive language detection on Arabic social media (dialectal Arabic) has been proposed, where two datasets were introduced [13]. The first contains 1,100 manually labeled dialectal tweets, and the second contains 32 K comments that the moderators of popular Arabic newswires deemed inappropriate. The authors have proposed a statistical approach based on a list of obscene words, and the produced results were around 60% of F1-score. Hate speech and abusive detection approach in the Levantine dialect has been proposed by Mulki et al. [14], where the authors experimented with both binary classification and multi-class classification employing SVM and NB classifiers. Different ML approaches and an ensemble classifier have been used to deal with offensive language identification in dialectal Arabic [12]. The conducted study showed an interesting impact of the preprocessing on such task, as well as good performances of the ensemble classifier in contrast to standard ML algorithms. Another work focused on Tunisian hate speech and abusive speech has been proposed, in order to create a benchmarked dataset (6 k of tweets) of online Tunisian toxic contents [10]. The authors evaluated two ML approaches (i.e., NB and SVM), where the NB classifier outperformed the SVM (92.9% of accuracy).
642
O. Boucherit and K. Abainia
Table 1 Details of the annotation rounds
1st round
2nd round
3rd round
Annotator #1
6,000
0
4,258
Annotator #2
4,258
0
6,000
Annotator #3
0
3,000
0
Annotator #4
0
3,000
0
Annotator #5
0
4,258
0
3 Corpus To the best of our knowledge, there is only one similar corpus proposed for offensive language (i.e., hate speech) on Algerian dialectal Arabic [9]. This corpus was proposed for hate speech detection against women and crawled from Youtube social media. Unfortunately, the corpus regroups only 3.8 k texts labeled as “not hateful” and “hateful”, where the latter has 792 compiled texts. We think that this is not enough to train machine learning classifiers to correctly recognize hateful texts, because we cannot cover different writing possibilities (i.e., dialectal texts). We have created a new1 (i.e., DziriOFN) for the same task, but it is for the offense detection in general and not addressed to a specific target (e.g., women). Our corpus was crawled from Facebook social media, while the latter is considered as the first communication media used by the Algerian community.
3.1 Data Collection Firstly, we have selected a set of public pages and groups related to sports and politics. Among the posts, we have selected the ones addressing harmful and provoking subjects that involve more interactions. More specifically, most of the subjects contain conflicts and controversies in news, religion, ethnecity, and football. The Algerian community is conservative and some provoking subjects may involve offensiveness while expressing opinions. Because of Facebook politics, public users are restricted to gather public or private data from this platform. In this regard, three Javascript scripts2 were created to automate the data collection instead of doing the task manually. The scripts unhide all the comments of a given post, unhide the second part of long comments, and retrieve all the comments with their information (i.e., user names, profile links, comment texts, number of reactions, and number of replies). Empty comments or comments with only images and emojis were ignored. Overall, we have crawled 10,258 comments written in Arabic script, Roman script (i.e., Arabizi), or both. 1 2
https://github.com/xprogramer/DziriOFN. https://github.com/xprogramer/fb-cmt-crawl.
Offensive Language Detection in Under-Resourced Algerian Dialectal …
643
3.2 Data Annotation To annotate our corpus, five Algerian native speakers were involved in this task using an in-house crowdsourcing platform. They were instructed to attribute one label to each text among three labels, i.e., offensive, abusive, or normal, as well as one language label among three labels, i.e., Arabic (MSA), dialect, or mixed between Arabic and dialect. The text is ignored if it is not offensive and completely written in another language (French, English, or Berber), or it is unclear (not understood) or difficult to label. We have defined offensive texts as texts that contain hateness, agressiveness, bullying, harassment, violence, or offend a target (someone, a group of people or an entity in general). On the other hand, we have defined abusive texts as texts containing swear words or sexual/adult content. The annotation was performed in three rounds (Table 1). In the first one, annotator #1 annotated the first 6,000 texts and annotator #2 annotated the remaining 4,258 texts. In the second round, annotator #3 annotated the first 3,000 texts, annotator #4 annotated the following 3,000 texts, and annotator #5 annotated the remaining 4,258 texts. Finally, in the third round, if there is a conflict between the two first rounds, annotator #2 annotated the first 6,000 texts (only conflicted labels) and annotator #1 annotated the remaining 4,258 texts (only conflicted labels). Overall, from the collected data, 1,509 texts were ignored (for various reasons), 3,227 texts were labeled as offensive, 1,334 texts were labeled as abusive, and 4,188 texts were labeled as normal (Table 2). Table 3 summarizes the inter-annotator agreements between different annotators, where the third annotator refers to the ensemble of annotators (i.e., 4th, 5th, and 6th). It is noticed that normal texts present a high level of agreement between different annotators, because it is easier to differentiate such kinds of texts than the other categories. In addition, the first and the second annotators agreed in most of the abusive and offensive texts (around 83% and 72.5%, respectively). By comparing Tables 3 and 4, the third annotator correctly labeled more normal texts in contrast to the two others. Conversely, the third annotator (i.e., 4th, 5th, and 6th) did not correctly label the offensive and abusive texts (around 27.5% and 16.9%), while annotator #1 correctly labeled the most of abusive texts (around 91.2%) and annotator #2 correctly labeled the most of offensive texts (around 93.6%). Table 2 Corpus description
Number of texts Offensive texts
3,227
Abusive texts
1,334
Normal texts
4,188
Ignored texts Total
1,509 10,258
644 Table 3 Inter-annotator agreement between different annotators
Table 4 Labels overlapping between final labels and annotator labels
O. Boucherit and K. Abainia
Category
1st versus 2nd
1st versus 3rd
2nd versus 3rd
Normal
3,449
3,603
4,008
Abusive
1,108
85
141
Offensive
2,341
583
305
Category
1st annotator
2nd annotator
3rd annotator
Normal
3,635
4,021
4,175
Abusive
1,199
1,249
226
Offensive
2,944
2,645
887
Thus, annotator #3 highly contributed to the annotation of normal texts, while annotators #1 and #2 contributed to the annotation of abusive and offensive texts (and normal texts as well). It is obvious that the data annotation depends on the gender, age, and culture of the annotators, and maybe the set of the third annotator are familiar with offensive texts and considered them as normal texts. It is also noticed that some abusive texts were labeled as offensive by the set of the third annotator, and this is due to the lack of training (not well trained) or they considered them as as offensive regarding their beliefs.
4 Experimental Results We have evaluated some ML and DL classifiers on our corpus. For machine learning, we have used SVM, multinomial NB, and Gaussian NB with default settings of scikit-learn toolkit. We have applied standard preprocessing steps (i.e., removing punctuation marks) and applied the TF-IDF technique to weigh the word frequencies. For deep learning classifiers, we have used 512 filters with [3 4 5] as filter size and 0.5 of a dropout rate in CNN. Similarly, we have built the BiLSTM model with one hidden layer and 0.5 of dropout. However, we have used the default settings of FastText [4]. Overall, we have conducted two series of experiments, i.e., binary classification and multi-class classification. In the first one, we merged offensive and abusive comments together, while, in the second experiment, we considered each category independently (i.e., three classes). The corpus was split into training set and test set (90% and 10%, respectively). The results of Table 5 summarize the accuracies produced by SVM, Multinomial NB, and Gaussian NB trained with two classes and three classes. Both SVM and Multinomial NB reported high accuracies in both experiments, while Gaussian NB was the worst and considerably decreased in three-label classification.
Offensive Language Detection in Under-Resourced Algerian Dialectal … Table 5 Test results of ML models trained with two and three labels
Table 6 Test results of DL models trained with two and three labels
645
SVM
Multinomial NB
Two labels
0.744
0.752
Gaussian NB 0.710
Three labels
0.669
0.662
0.518
Category
CNN
BiLSTM
FastText
Two labels
0.523
0.520
0.716
Three labels
0.347
0.400
0.648
From Table 6, deep learning classifiers produce low performances compared to machine learning classifiers. However, FastText highly outperformed CNN and BiLSTM, but produced slightly reduced accuracy compared with Multinomial NB and SVM. Moreover, deep learning classifiers in binary classification perform better than multi-label classification (three classes). It is obvious that as more the number of classes decreases as more the identification accuracy increases, and vice versa. The reason for low performances produced by DL classifiers is that such classifiers require a huge training set, as well as they cannot well extract the features in the lack of standard orthography (different writing possibilities). Indeed, as we experiment with words uni-gram, we cannot cover all writing variants of the words, but maybe with character n-grams the accuracy may increase, because, whatever the words writing is changing, they keep some common characters (generally vowels change). In addition, sometimes it is difficult to spot abusive and offensive texts even by humans in the lack of strongly offensive words, and that is why automatic algorithms sometimes cannot differentiate between normal and offensive texts. Finally, SVM and Multinomial NB hold the best performances among the others, because the words have been weighted with td-idf that gives low frequencies to stop words and common words used in any text category, while unique words (or terms) have high frequencies.
5 Conclusion In this work, we have addressed the problem of offensive language identification in under-resourced Algerian dialectal Arabic, for which we have created a new corpus for the addressed problem because of the scarcity of works carried out in the same language. The corpus was crawled from Facebook social media (commonly used social media in Algeria), where 10,258 comments have been gathered from public pages and groups related to sensitive topics. Five annotators were involved in the annotation task by following a general guideline, and the texts were labeled into one of the three categories, i.e., offensive, abusive, and normal. We have evaluated the
646
O. Boucherit and K. Abainia
state-of-the-art tools of text categorization such as SVM, Multinomial NB, Gaussian NB, CNN, BiLSTM, and FastText, where we have carried out two sets of experiments, i.e., binary classification and multi-label classification. In the first one, we merged offensive and abusive texts in the same category (offensive), while, in the second experiment, we kept the categories independently. The experimental results showed that SVM and Multinomial NB classifiers outperformed all the other classifiers in both experiments (binary and multi-label classification). The results were acceptable, but the algorithms require further investigation to improve the accuracy, because word uni-grams cannot cover all different writing possibilities. As a future work, we plan to build a larger corpus while enhancing and adding new rules in the preprocessing. Moreover, we plan to propose a new algorithm based on rules to detect offensive and abusive language effectively.
References 1. Alharbi AI, Lee M (2020) Combining character and word embeddings for the detection of offensive language in Arabic. In: Proceedings of the 4th workshop on open-source Arabic corpora and processing tools, with a shared task on offensive language detection, May, pp 91–96 2. Alshaalan R, Al-Khalifa H (2020) Hate speech detection in Saudi twittersphere: a deep learning approach. In: Proceedings of the fifth Arabic natural language processing workshop, December, pp 12–23 3. Badjatiya P, Gupta S, Gupta M, Varma V (2017) Deep learning for hate speech detection in tweets. In: Proceedings of the 26th international conference on world wide Web companion, April, pp 759–760 4. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146 5. De Souza GA, Da Costa-Abreu M (2020, July). Automatic offensive language detection from twitter data using machine learning and feature selection of metadata. In: 2020 international joint conference on neural networks (IJCNN), pp 1–6 6. Djandji M, Baly F, Antoun W, Hajj H (2020, May) Multi-task learning using AraBert for offensive language detection. In: Proceedings of the 4th workshop on open-source Arabic Corpora and processing tools, with a shared task on offensive language detection, pp 97–101 7. Farha IA, Magdy W (2020, May) Multitask learning for Arabic offensive language and hatespeech detection. In: Proceedings of the 4th workshop on open-source Arabic corpora and processing tools, with a shared task on offensive language detection, pp 86–90 8. Gitari ND, Zuping Z, Damien H, Long J (2015) A lexicon-based approach for hate speech detection. Int J Multim Ubiquit Eng 10(4):215–230 9. Guellil I, Adeel A, Azouaou F, Boubred M, Houichi Y, Moumna AA (2021) Sexism detection: the first corpus in Algerian dialect with a code-switching in Arabic/French and English. arXiv preprint. arXiv:2104.01443 10. Haddad H, Mulki H, Oueslati A (2019) T-hsab: a tunisian hate speech and abusive dataset. In: International conference on Arabic language processing, October, pp 251–263 11. Haidar B, Chamoun M, Serhrouchni A (2017) A multilingual system for cyberbullying detection: Arabic content detection using machine learning. Adv Sci, Technol Eng Syst J 2(6):275–284 12. Husain F (2020) Arabic offensive language detection using machine learning and ensemble machine learning approaches. arXiv preprint. arXiv:2005.08946
Offensive Language Detection in Under-Resourced Algerian Dialectal …
647
13. Mubarak H, Darwish K, Magdy W (2017) Abusive language detection on Arabic social media. In: Proceedings of the first workshop on abusive language online, August, pp 52–56 14. Mulki H, Haddad H, Ali CB, Alshabani H (2019) L-hsab: a levantine twitter dataset for hate speech and abusive language. In: Proceedings of the third workshop on abusive language online, August, pp 111–118 15. Nayel HA, Shashirekha HL (2019) DEEP at HASOC2019: a machine learning framework for hate speech and offensive language detection. In: FIRE (Working Notes), December, pp 336–343 16. Otiefy Y, Abdelmalek A, Hosary IE (2020) WOLI at SemEval-2020 Task 12: Arabic offensive language identification on different Twitter datasets. arXiv preprint. arXiv:2009.05456 17. Pitenis Z, Zampieri M, Ranasinghe T (2020) Offensive language identification in Greek. arXiv preprint. arXiv:2003.07459 18. Santucci V, Spina S, Milani A, Biondi G, Di Bari G (2018) Detecting hate speech for Italian language in social media. In: EVALITA 2018, co-located with the fifth Italian conference on computational linguistics (CLiC-it 2018), vol 2263 19. Sutejo TL, Lestari DP (2018) Indonesia hate speech detection using deep learning. In: 2018 international conference on Asian language processing (IALP), November, pp 39–43 20. Zampieri M, Malmasi S, Nakov P, Rosenthal S, Farra N, Kumar R (2019) Predicting the type and target of offensive posts in social media. arXiv preprint. arXiv:1902.09666
A Comparative Analysis of Modern Machine Learning Approaches for Automatic Classification of Scientific Articles Kongkan Bora, Nihar Jyoti Baishya, Chinmoy Jyoti Talukdar, Deepali Jain, and Malaya Dutta Borah Abstract Automatic classification of scientific articles is very beneficial for the scientific research community to know whether the journal is appropriate or not. Specifically, it helps editor(s) pre-screen them at the editor’s desk. In such a scenario, modern machine learning approaches can help automatically classify scientific articles based on their abstracts. In this work, we classify scientific articles based on their category, and hence a comparative analysis work is performed where several deep learning and machine learning-based approaches are analyzed. Our experimental results suggest that the domain-specific pre-trained model SciBert helps in improving the classification performance significantly. Keywords Text classification · Scientific articles · Pre-trained model · Bert · Scibert
1 Introduction Recently, there has been a boom in the number of scientific articles submitted to journals or conferences. These scientific communications need to be peer-reviewed so that the quality of the scientific articles can be validated. An initial screening is the first step in the peer review which an editor(s) usually performs. The initial screening K. Bora · N. Jyoti Baishya · C. Jyoti Talukdar · D. Jain (B) · M. Dutta Borah Department of CSE, NIT Silchar, Assam 788010, India e-mail: [email protected] K. Bora e-mail: [email protected] N. Jyoti Baishya e-mail: [email protected] C. Jyoti Talukdar e-mail: [email protected] M. Dutta Borah e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_50
649
650
K. Bora et al.
is done to check the appropriateness of the paper (Aim and Scope), plagiarism, grammar, language, template mismatch. An automated system would be beneficial for both the editor(s) and the author(s) so that the appropriateness of the journal can be checked. An automated system can be used to identify the in-scope and out-of-scope articles in the initial screening, and thus out-of-scope papers are rejected. Without the automated system, the editor(s) has to spend a substantial amount of time to check whether the submitted paper is appropriate for further proceeding (review) or not. Many times, good-quality papers get rejected only because of their out-of-scope [1]. With the advancement in new research leading to the rapid growth in the number of articles submitted in journals, conferences, and even archival submissions, there have been a great opportunity lying in the field of scientific text classification through the development of Natural Language Processing (NLP) techniques [2, 3]. Not only this, NLP techniques have broad applicability in the analysis of scientific structures [4, 5], information extraction from scientific articles [6]. In this paper, we perform comparative analysis to classify the scientific articles. Specifically, given the abstract, the aim is to identify the category of the abstract into one of the seven categories, which is defined in Sect. 2.1. Following the introduction, we describe the dataset and method adopted in Sect. 2. Section 3 presents the results and analysis of the obtained results. Finally, conclusion is given in Sect. 4 with future research directions.
2 Data and Methods 2.1 Dataset Description The dataset was released as part of the Scope Detection of the Peer Review Articles (SDPRA 2021) shared task [7]. The dataset is publicly available as a part of the Mendeley data.1 The dataset consists of training and validation data. The training dataset contains 16,800 abstracts along with the category labels to which the abstract belongs to. The validation data consists of 11,200 along with its category labels. There are seven categories or classes to which abstract belongs. The number of samples that belong to each category is given in Table 1. Since the organizers have not released the separate testing dataset, so for our experiment purposes, we equally divide the validation dataset further into valid and test data, which contains 5,600 samples in each valid and test data.
1
http://dx.doi.org/10.17632/njb74czv49.1.
A Comparative Analysis of Modern Machine Learning … Table 1 Categorywise distribution of dataset Category Train Computation and language (CL) Data structure and algorithms (DS) Cryptography and security (CR) Networking and computer architecture (NI) Logic in computer science (LO) Distributed and cluster computing (DC) Software engineering (SE)
651
Validation
2,740
1,866
2,737
1,774
2,660
1,835
2,764
1,826
1,811
1,217
2,042
1,355
2,046
1,327
2.2 Methodology We formulate the problem of identifying the categories as the multi-class classification problem. We apply Machine Learning (ML) classifiers and Deep Learning (DL)-based classifiers for the multi-class classification problem. We apply several ML classifiers: Linear Support Vector Machine, Stochastic Gradient Descent (SGD), Random Forest, Logistic Regression (LR), K-Nearest Neighbors, Non-linear SVM, XG Boost. The best performing ML classifier is chosen based on their performance to perform hyperparameter tuning followed by the voting ensembling approach. Along with ML classifiers, DL-based classifiers such as Bi-directional LSTM, RNN, and a pre-trained model such as domain-specific SciBert [8] are also considered. All the ML and DL-based classifiers considered for this work are discussed in Sects. 2.4 and 2.5 respectively.
2.3 Data Preprocessing Data preprocessing is one of the crucial steps before performing the predictive modeling experiments. In this work, several data cleaning steps have been taken such as all the non-alphabetic and numeric characters are removed from the dataset. All the abstracts are converted into lower case, followed by removing newlines. We also perform lemmatization followed by converting each sentence of an abstract into words.
652
K. Bora et al.
2.4 ML Classifiers • Logistic Regression (LR) [9]: LR is a classification algorithm which uses Sigmoid function. Sigmoid function maps any real-valued number into a value between 0 and 1. • Support Vector Machines (SVM) [10]: SVM is a supervised classification ML algorithm that tries to maximize the margin. There are two types of SVM classifiers: (1) Linear SVM (2) Non-linear SVM. A Linear SVM classifier is used where data is linearly separable, whereas Non-Linear SVM is used where data is not linearly separable. • Random Forest [11]: Random forest is an ensembling of several decision trees. This classifier is usually trained with the bagging method whose idea is that, when several learning models are combined, the overall results get improved. – Stochastic Gradient Descent (SGD) [12]: SGD classifier combines several supports binary classifiers through one versus all (OVA) scheme in order to support the multi-class classification. The classifier is used for discriminative learning of linear classifiers with several loss functions. • eXtreme Gradient Boosting (XGBoost) [13]: XGBoost is an improvisation over gradient boosted decision trees. The XGBoost is mainly designed for computational speed and model’s performance. In ML, Boosting is a sequential ensemble technique that converts weak learners into strong learners so that the model’s accuracy increases. • K-Nearest Neighbor (KNN) [14]: The idea of KNN is to classify the data points into separate classes based on the minimum distance between two data points in order to predict the classification of new data points. We also perform the hyperparameter tuning of the Linear SVM and SGD classifiers since they achieve the best scores among all the individual classifiers as shown in Table 2. The term frequency-inverse document frequency (TF-IDF) features are used for representing the text. In addition to this, we also perform a voting ensembling approach (hard voting and soft voting) to classify scientific articles into their predefined categories.
2.5 DL Classifiers We apply several DL-based classifiers such as Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM), Bi-directional LSTM, and a pre-trained SciBert model for the classification of scientific articles. [8].
A Comparative Analysis of Modern Machine Learning …
653
Fig. 1 RNN architecture
Fig. 2 Bi-LSTM architecture
2.5.1
RNN Model
We create an embedding layer of 128 dimensions, and the vocabulary size is 10,000. The maximum length is set to 299, which is the maximum length of abstract in the training dataset. We flatten the input of the embedding layer by adding Global max pooling 1D layer. Softmax activation function is used with sparse categorical cross-entropy and adam as an optimizer as shown in Fig. 1.
2.5.2
Bi-LSTM
For this model, we use a Bi-LSTM layer of 64 nodes after embedding layer. In another model, we use two Bi-LSTM layers of nodes 64 and 32 after embedding layer. For the hidden layers, rectified linear unit (ReLU) activation is used and and softmax function is used in the output layer as depicted in Fig. 2.
2.5.3
SciBert Model
We use a domain-specific pre-trained SciBert model, which is based on Bert model [15] which is further based on transformers [16]. SciBert model is trained on 1.14M papers from Semantic Scholar [12], a large-scale scientific data. It uses domainspecific vocabulary (Sci-vocab), which contains the most frequent words or subwords observed in the scientific texts. We use scibert_scivocab_uncased pre-trained
654
K. Bora et al.
model hosted on hugging face.2 The fine-tuning is performed over the pre-trained for the downstream multi-class classification task. Specifically, we employ a SciBert Base model with 12 number of layers whose hidden size is 768 with 12 self-attention heads in its uncased form (Total Parameters = 110 M), which is pictorially depicted in Fig. 3. The steps to be followed for making a SciBert-based system: • Tokenization: In order to feed the text (abstracts) to the SciBert, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary (Sci-vocab). In order to perform the tokenization, there is a need to use the tokenizer included with the SciBert. Therefore, we use the “uncased” version of SciBert. • Data Preparation: The SciBert model requires the input data to be formatted in the following manner: – [CLS] and [SEP] symbols are added at the start and end of each sentence, respectively. The first embedding, which corresponds to [CLS] token, is used by the classifier for prediction as shown in Fig. 3. – Pad and truncate every sentence to a constant length. We have truncated the dataset texts to a maximum length of 128 words/tokens in this work. – Create attention masks that differentiate actual tokens from [PAD] tokens. This mask tells the self-attention mechanism of SciBert not to include padded tokens while interpreting the sentences. • Training phase: Finally, fine-tuning-based training has been performed with an additional untrained classification layer on the top. The model is trained for 2 epochs, consisting of batch size 32, which makes use of AdamW optimizer (W=Weight Decay) as it is an improved version of Adam [17]. The learning rate of the optimizer is set as 5e-5 as suggested in Appendix A.3 of [15]. • Testing phase: With the trained model, we predict the labels on the testing dataset whose results are shown in Table 2.
3 Experimental Results and Analysis 3.1 Experimental Setup The experimental work has been carried out on a Windows 64-bit machine with 16 GB RAM and Core i5 processor, and Google Colaboratory3 has been used to perform deep learning-based experiments. We report the average of the stratified tenfold crossvalidation for ML classifiers and the voting ensemble approach. Random search cross-validation is used for hyperparameter tuning. We consider Precision, Recall, 2 3
https://huggingface.co/allenai/scibert_scivocab_uncased. https://colab.research.google.com/.
A Comparative Analysis of Modern Machine Learning …
655
Fig. 3 Sci Ber tbase architecture
F1-score, and Accuracy as the evaluation metrics for evaluating the performance of all the classifiers considered in this work. Table 2 shows the performances of individual classifiers, voting ensemble approach, and DL-based classifiers on test data. This table shows that, between ML classifiers and the voting ensembling approach, the soft voting ensembling approach performs the best, achieving the accuracy, precision, recall, and F1-score of 0.9254 0.9253, 0.9254, 0.9251, respectively. The reason is that we perform the hyperparameter tuning of SGD and SVM followed by the ensembling of these classifiers. In general, ensemble-based methods have proved to be great for improving performance. Another observation from Table 2 is that surprisingly DL based classifiers have not been able to perform as well, compared to ML classifiers and the ensembling approach. The SciBert model has achieved the best F1-score of 0.9267, recall of 0.9267, precision of 0.9267, and accuracy of 0.9267. Table 3 shows the categorywise performance of SciBert model. From this table, we see that Computation and Language (CL) achieves the best scores in terms of
656
K. Bora et al.
Table 2 Performance of several classifiers on test data Classifiers Classifiers Accuracy Precision Individual classifiers
Voting Ensembling DL classifiers
Pre-trained
Logistic Regression Random Forest Linear Support Vector Machine Stochastic Gradient Desent XG BOOST Non-Linear Support Vector Machine Majority voting Soft voting RNN One Layer Bi-LSTM Multi-Layer Bi-LSTM SciBert
0.98 0.93 0.83 0.94 0.94 0.93 0.90
F1 score
0.9161
0.9158
0.9161
0.9155
0.8473
0.8517
0.8473
0.8412
0.9239
0.9241
0.9239
0.9238
0.9239
0.9235
0.9239
0.9234
0.8875 0.9184
0.8871 0.9183
0.8875 0.9184
0.8868 0.9180
0.9246
0.9245
0.9246
0.9243
0.9254 0.9025 0.8839
0.9253 0.9025 0.8839
0.9254 0.9025 0.8839
0.9251 0.9025 0.8839
0.8550
0.859
0.859
0.8590
0.9267
0.9267
0.9267
0.9267
Table 3 Categorywise performance of SciBert model Classes Precision Recall CL CR DC DS LO NI SE
Recall
0.99 0.93 0.84 0.94 0.93 0.91 0.93
F1 score 0.98 0.93 0.84 0.94 0.94 0.92 0.91
F1-measure, recall, and precision. There are a few wrong predictions for this class as also supported by confusion matrix in Fig. 4. Similarly, in the case of Distributed and Cluster Computing (DC), this class has achieved the worst scores compared to other categories. The same fact is also supported by Fig. 4 that there are many wrong predictions for this category.
A Comparative Analysis of Modern Machine Learning …
657
Fig. 4 Confusion matrix for SciBert classifier
Among the ML-based ensemble classifier and DL-based SciBert model, SciBert achieves the highest scores in terms of F-measure, Recall, Precision, and Accuracy. The reason is that SciBert is a domain-specific Bert model which has trained on large-scale scientific articles.
4 Conclusion Detecting out-of-scope scientific articles becomes a crucial task when many papers are submitted to journals/conferences. In this work, we perform a comparative analysis work using ML and DL-based classifiers. From the experimental analysis, we found that the domain-specific SciBert model gives the best scores compared to all other approaches. When we compare ML classifiers and the ensembling approach, the ensembling approach performs the best. To our surprise, DL-based classifiers have not performed well compared to ML classifiers and ensembling approaches. The pre-trained approach outperforms the ML classifiers, ensembling approach, and DL-based classifiers. In order to further improve such kind of automated detection of out-of-scope papers, we can combine some other features such as topic modeling to SciBert features and feed these features to the more advanced neural networks such as Graph Neural Networks (GNN).
658
K. Bora et al.
References 1. Ghosal T et al (2018) Investigating domain features for scope detection and classification of scientific articles. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018), pp 7–12 2. Romanov A, Lomotin K, Kozlova E (2019) Application of natural language processing algorithms to the task of automatic classification of Russian scientific texts. Data Sci J 18(1) 3. Cox J, Harper CA, de Waard A (2017) Optimized machine learning methods predict discourse segment type in biological research articles. In: Semantics, analytics, visualization. Springer, pp 95–109 4. Ghosal T et al (2020) An empirical study of importance of different sections in research articles towards ascertaining their appropriateness to a journal. International conference on asian digital libraries. Springer. pp 407–415 5. Solovyev V, Ivanov V, Solnyshkina M (2018) Assessment of reading difficulty levels in Russian academic texts: approaches and metrics. In: J Intell Fuzzy Syst 34(5):3049–3058 6. Nasar Z, Jaffry S, Malik MK (2018) Information extraction from scientific articles: a survey. Scientometrics 117(3):1931–1990 7. Reddy SM, Saini N (2021) Overview and insights from scope detection of the peer review articles shared tasks 2021. In: PAKDD (workshops), pp 73–78 8. Beltagy I, Lo K, Cohan A (2019) Scibert: a pretrained language model for scientific text. arXiv:1903.10676 9. Cessie SL, Van Houwelingen JC (1992) Ridge estimators in logistic regression. J R Stat Soc: Ser C (Appl Stat) 41(1):191–201 10. Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: ICML 2004: Proceedings of the twenty-first international conference on machine learning. OMNI Press, pp 919–926 11. Breiman L (2001) Random forests. Mach Learn 45(1):5–32 12. Ammar W et al (2018) Construction of the literature graph in semantic scholar. arXiv:1805.02262 13. Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 785–794 14. Arya S et al (1998) An optimal algorithm for approximate nearest neighbour searching fixed dimensions. J ACM (JACM) 45(6):891–923 15. Devlin J et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 16. Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst, 5998–6008 17. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv:1711.05101
A Review of Machine Learning Algorithms on Different Breast Cancer Datasets E. Jenifer Sweetlin and S. Saudia
Abstract Machine Learning (ML) algorithms have been used widely in the domain of medical science especially in classifying clinical data. Random Forest, Decision Tree, K-Nearest Neighbor, Support Vector Machine, Naive Bayes, Logistic Regression, and Multilayer Perceptron are some of the ML algorithms used for classification and prediction of various diseases. This paper reviews 40 recent ML algorithms published for breast cancer classification and breast cancer prediction along with the associated data pre-processing and feature selection techniques. The paper identifies from literature the pre-processing, feature selection steps, and the ML algorithms used for classification and prediction of breast cancer and tabulates them according to the accuracy. The paper also briefs the aspects of three common clinical breast cancer datasets used to train most such ML algorithms. The review helps prospective researchers in identifying different aspects of research in the domain of providing ML solutions from breast cancer datasets using suitable pre-processing and feature selection techniques. Keywords Accuracy · Pre-processing · Feature selection · WDBC · WBCD · SEER
1 Introduction These days, in health-care sector, breast cancer is a leading life-threatening disease and its rate of incidence is increasing alarmingly [1]. Breast cancer affects more than one million people worldwide every year and thousands of people around the world die each year from the disease [2]. It is a dangerous condition in which certain abnormal cells in the breast start to divide uncontrollably [3]. This abnormal growth of E. J. Sweetlin (B) · S. Saudia CITE, Manonmaniam Sundaranar University, Tirunelveli, India e-mail: [email protected] S. Saudia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_51
659
660
E. J. Sweetlin and S. Saudia
the breast cells deteriorates the health steadily, progresses rapidly, and leads to sudden death. The age and family history of patients are the main factors of the disease. Age determines the treatment to be given to the patients and also the survivability of the patients [4]. As age of the patients increases, the adverse effect due to breast cancer becomes prominent. The tumor type of the disease can be malignant or benign. A malignant tumor is a dangerous tumor that grows irregularly and rapidly and a benign tumor is harmless, usually stays in the same area where it starts growing [5]. The other main types of breast cancer according to the site of incidence are the invasive ductal carcinoma, invasive lobular carcinoma, and in situ ductal carcinoma. The most common breast cancer is invasive ductal carcinoma which begins in the milk ducts. Invasive lobular carcinoma originates in the lobules and can spread to lymphatic nodes and other parts of the body rapidly [6]. The in situ ductal carcinoma is curable, and it is the earliest stage of breast cancer. The stages of the tumor are graded as T1, T2, T3, and T4 by clinicians in increasing order of size of the tumor. The patients affected from the earlier tumor stage can be treated more successfully and the possible years of survival of those patients are more. The larger tumor stage is a critical stage of the disease that leads to the complexity of the treatment and results in poor survival rate [7]. Thus, early detection and prediction of breast cancer helps the patients in terms of survivability [8]. Data mining, Machine Learning, and Deep Learning based techniques [9–11] have been used widely in the diagnosis and prognosis of breast cancer. These techniques also help researchers identify relationship among the features involved in the incidence of breast cancer and predict the outcome of the disease using archived datasets [12]. The survival rate of the breast cancer patients is also explored. Some of the widely studied ML algorithms used for classification are Random Forest, Decision Tree, K-Nearest Neighbor, Support Vector Machine, Naive Bayes, Logistic Regression, and Multilayer Perceptron [13–15]. These papers have used various data pre-processing and feature selection methods to improve the accuracy of the classifiers in terms of prediction and classification of breast cancer patients. The data pre-processing stage involves operations to remove missing values or to replace wrong values and outliers by imputation techniques [16]. The three main datasets worked upon by the ML algorithms in literature [23–62] are the Wisconsin Diagnostics Breast Cancer (WDBC) dataset [17], Wisconsin Breast Cancer DatabaseOriginal (WBCD) [17] and Surveillance, Epidemiology, and End Results (SEER) dataset [18]. So, the objectives of the paper are to review and tabulate the roles of 40 ML algorithms [23–62] trained on the WDBC, WBCD, and SEER datasets in the detection, classification, prediction, prognosis of breast cancer or in the prediction of the survivability of breast cancer patients based on accuracy. The paper also identifies the corresponding data pre-processing and feature selection techniques used by the ML algorithms toward improving accuracy. The paper is arranged in three sections. Section 2 explains the details of the three datasets: WDBC, WBCD, and SEER. Section 3 tabulates the common ML algorithms in literature which are used for breast cancer classification and survival rate prediction. The paper ends with conclusion in Sect. 4. The Annexure provides the expansions of acronyms used in this review.
A Review of Machine Learning Algorithms on Different Breast Cancer …
661
2 Datasets: WDBC, WBCD, and SEER The three popular breast cancer datasets used in training most of the ML algorithms [23–62] in literature for the classification of breast cancer and prediction of survivability of breast cancer patients are the Wisconsin Diagnostics Breast Cancer (WDBC) dataset [17], Wisconsin Breast Cancer Database-Original (WBCD) [17] and Surveillance, Epidemiology, and End Results (SEER) dataset [18]. So, the description of these datasets is provided in this section prior to the review and tabulation of common ML algorithms along with pre-processing, feature selection techniques, and the purpose of the study. The WDBC and WBCD datasets are used to train ML algorithms for classifying malignant and benign tumors and the SEER dataset is used to train ML algorithms for predicting the survival rate of the breast cancer patients.
2.1 Wisconsin Diagnostics Breast Cancer (WDBC) Dataset The Wisconsin Diagnostics Breast Cancer (WDBC) dataset is downloadable from the UCI machine learning repository [17]. The dataset is labeled and has 569 instances. Of the 569 instances, 212 instances correspond to malignant records and 357 instances correspond to benign records. Each record has 32 attributes of which the first two attributes are the identification number of the patient records and the class label of diagnosis with values: malignant, benign. The 30 remaining attributes comprise ten real-valued features along with their mean, standard error, and the mean of the three largest values for each cell nucleus. The nucleus size of tumor cells is stated in terms of radius and area of the cells. The shape of the nucleus is defined in terms of the features: smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. The size and shape of the nucleus are expressed in terms of perimeter [19]. These real-valued features in the dataset for each cell nucleus are defined below as in [20, 21] • Radius—It is the radius of the nucleus. In the WDBC dataset, values of the variable range from 6.981 to 28.110. • Area—It is the area of the nucleus. In the WDBC dataset, values of area range from 143.50 to 2501.00. • Smoothness—It is the local variation in the lengths of the radii of the cell. In this dataset, values of the feature range from 0.052 to 0.163. • Compactness—It is the ratio Perimeter2 /[Area - 1.0] of the nucleus. In the WDBC dataset, values of the variable range from 0.019 to 0.345. • Concavity—Concavity is the measure of severity of concave portions of the contour of the nucleus. In the WDBC dataset, values of concavity range from 0 to 0.426. • Concave points—They are the number of concave portions in the contour of the nucleus. In this dataset, values of the feature range from 0 to 0.201.
662
E. J. Sweetlin and S. Saudia
• Symmetry—It is measure of the symmetry of the cell. In the WDBC dataset, values of the variable range from 0.106 to 0.304. • Fractal dimension—It is a measure which indicates the regularity of the nucleus. In this dataset, values of fractal dimension range from 0.049 to 0.097. • Perimeter—It is a measure which gives the perimeter of the cell. Values of the variable in the dataset are from 43.790 to 188.500. • Texture—It is the standard deviation of gray-scale values of the cell. In this dataset, values of texture range from 9.710 to 39.280 • Diagnosis—This is the class label. Values of this feature are Malignant and Benign. Null values are not present in the records of the WDBC dataset. The data in this dataset is used by the research works [23–40] tabulated in Table 1 for classification of breast cancer types as malignant or benign.
2.2 Wisconsin Breast Cancer Database-Original (WBCD) Dataset The Wisconsin Breast Cancer Database-Original (WBCD) dataset is downloadable from the UCI machine learning repository [17]. The WBCD is a labeled dataset with 699 instances. Of the 699 instances, 241 instances correspond to malignant cases and 458 instances are benign records. Each record has 11 features. These features have integer values and are defined below as in the reference [22]. • Sample code number—It is the identification number of each record. • Clump Thickness—It specifies the number of layers in the cells under investigation. Values of the variable range from 1 to 10. Normal cells tend to have single layer while cancer cells have multiple layers. • Uniformity of cell size—It specifies the changes in the size of cancer cells. In this dataset, values of uniformity of cell size range from 1 to 10. • Uniformity of cell shape—It specifies the changes in the shape of cancer cells. In this dataset, the values of the feature are from 1 to 10. • Marginal Adhesion—It specifies the characteristics of the cells to remain close to each other. In this dataset, values of the variable range from 1 to 10. Cancer cells tend to stay apart from each other. • Single Epithelial cell size—It specifies the size of the epithelial cells. Values of the single epithelial cell size range from 1 to 10. Larger size of the epithelial cells corresponds to malignant cancer. • Bare nuclei—The variable indicates if the nucleus is surrounded by cytoplasm. In this dataset, the extent of cytoplasm around the nucleus is indicated by the values from 1 to 10. • Bland chromatin—The value is the measure of the texture of the cells. In this dataset, values of the feature range from 1 to 10.
A Review of Machine Learning Algorithms on Different Breast Cancer …
663
Table 1 Comparison of ML algorithms trained using WDBC dataset References and year
ML Algorithms
Purpose
Pre-processing and feature selection
Accuracy (%)
[23] 2021
SVM, MLP, DT (J48)
Breast Cancer (BC) Prognosis
SMOTE, PSO, GS, GRS, WBFS-NB, KNN, J48, RF
J48(GS + RF) & J48(GS + J48)-98.8
[24] 2020
RF, SVM, NB, DT, MLP, LR
Prediction of BC
Unsupervised DF
RF-95.78
[25] 2020
KNN, NB, SVM (RBF), DT (CART)
Classification of BC
PCA
KNN-96.49
[26] 2020
DT, RF
Prediction of BC
MMN, PCA
RF-97.52
[27] 2020
LR, KNN, SVM, K-SVM, NB, DT, RF
Classification of BC
MMN, MO-DT, SS-RF
RF-98.6
[28] 2020
LR, KNN, SVM, NB, DT, RF
Classification of BC
Normalization-SS
LR-98.5
[29] 2020
KNN, DT, B-SVM, AB
Classification of BC
NCA
KNN-99.12
[30] 2019
MLP, KNN, GNB, SVM, DT (CART)
Classification of BC
Standardization
MLP-99.12
[31] 2019
RF, SVM, KNN, DT
Classification of BC
Normalization
RF-93.34
[32] 2019
KNN, DT
BC identification
–
DT-99.78
[33] 2019
SVM, KNN, DT, GB, RF, LR, AB, GNB, LDA
Classification of BC
SS, LV, UFS and RFE, PO-GP
F1 scoreAB-98.24
[34] 2019
SVM (RBF), ANN, NB
Classification of BC
Centering, scaling, FS-CFS, RFE, FE-PCA, LDA
SVM (RBF)-LDA, ANN-LDA-98.82
[35] 2019
LR, SGD, NB, MLP, RDT, RDF, KNN, SVM (SMO)
Classification of BC
Mean, SE, worst or largest mean, HV and SV
F3 scoreLR, SGD, MLP-HV-99.42
[36] 2019
LR, KNN, NB, SVM, AB, GB, RF, XGB, MLP, DT (CART)
Prediction of BC
DT-XGB
SVM-97.89
[37] 2019
RF, KNN, MLP
Classification of BC
Ranker-IGAE, BF, RF, KNN-100 GRS-CFS SEv, and WSE
[38] 2018
SVM, KNN, NB, MLP, DT (J48), RF
Classification of BC
Normalization, GR-FS, RS
RF-98.77
[39] 2014
K-means-SVM
BC diagnosis
K-means algorithm
K-SVM-97.38 (continued)
664
E. J. Sweetlin and S. Saudia
Table 1 (continued) References and year
ML Algorithms
Purpose
Pre-processing and feature selection
Accuracy (%)
[40] 2011
SVM (RBF, QK)
Classification of BC
ICA
SVM (QK)-94.4
• Normal nucleoli—The variable is an information about the size and count of the nucleoli inside the nucleus of the cell. In this dataset, this value ranges from 1 to 10. • Mitosis—The variable is an indication about the characteristics of mitosis taking place in the cell. Values of the mitosis range from 1 to 10. • Diagnosis—This is the class label. Values of this feature are Malignant and Benign. The values 1–10 of all the variables are the numeric values of the corresponding variables. The data in this dataset is used by the research works [41–54] tabulated in Table 2 for classification of breast cancer types as malignant or benign.
2.3 Surveillance, Epidemiology, and End Results (SEER) Dataset The Surveillance, Epidemiology, and End Results (SEER) dataset of breast cancer patients provide population-based cancer statistics associated with age of the patients, sex, race, year of diagnosis, and ethnicity. It is a labeled dataset with label values: alive and dead. The dataset is available in the SEER website [18] and the data is used by the research works [55–62] tabulated in Table 3 for predicting breast cancer survivability as over 1 year or 5 years or 10 years. Some of the features in the SEER dataset which are commonly used by many authors [55–62] are defined below as in the reference [55]: • Cancer stage—It specifies where the cancer is located and how far it is spread across other organs of the body. The categorical values of the variable are distant, regional, and localized. • Nodes positive—It specifies how many lymph nodes are affected by cancer. Values of the variable lie from 0 to 45 nodes. • Grade—The variable gives the description about the extent of differentiation of cancer cells under the microscope. The categorical values of the variable are poorly differentiated, moderately differentiated, and well differentiated. • Age—It is the age at which the patient is diagnosed with breast cancer. In this dataset, values of age range from 22 years to 85+ years. • Primary site—It is the place where the cancer cells start to originate.
A Review of Machine Learning Algorithms on Different Breast Cancer …
665
Table 2 Comparison of ML algorithms trained using WBCD dataset References and year
ML Algorithms
Purpose
Pre-processing and feature selection
Accuracy (%)
[41] 2021
SVM, LR, KNN, WKNN, GNB
BC detection
Null values are removed
Weighted KNN-96.7
[42] 2020
BN, SVM, DT, LR, RF, MLP
Classification of BC
AR
AR-SVM-98.0
[43] 2020
SVM, NB, KNN, RF, DT, LR
Classification of BC
Null values are removed
SVM-97.07
[44] 2020
AB.M1, DTA, J-Rip, DT (J48), Lazy (IBK, K-star), LR, MC, RT MLP, NB, RF
Prediction of BC
–
RT, RF, Lazy (IBK, K-star)-99.14
[45] 2020
SVM, KNN, RF, LR, ANN (MLP)
Prediction of BC
MV-mean, PCC MLP-98.57
[46] 2019
KNN, NB, SVM (RBF)
Classification of BC
IA are deleted
KNN, SVM (RBF)-96.85
[47] 2019
SVM (SMO, LSVM), ANN (MLP, VP)
BC diagnosis
–
SVM-SMO- 96.99
[48] 2018
KNN, NB
Classification of BC
IA are deleted
KNN-97.51
[49] 2018
RBF, MLP, PNN, BPN Classification of BC
TF (TANSIG, LOGSIG, and PURELIN)
BPN-TANSIG-99.0
[50] 2018
GA-ELM-NN
Classification of BC
Missing records GA-ELM-NN-97.28 are removed
[51] 2018
NB, BLR, DT (J48, CART)
Prediction of BC
MV-median
DT (CART)-98.13
[52] 2015
F-RNN
Classification of BC
CSE, RRA, FRISM
F-RNN-99.71
[53] 2015
WNB
BC Detection
Discarded MV
WNB-98.54
[54] 2015
ANN, DT, LR, GA
Prediction of BC
IG, GA
GA-98.78
• PR status—If the cancer cells have progesterone receptors, values of the variable are positive and negative, otherwise. • ER status—If the cancer cells have estrogen receptors, values of the variable are positive and negative, otherwise. • Surgery—It is the type of surgery in which the patient has undergone. The categorical values of the variable are surgery performed, not performed, not recommended, etc. • Radiation—It is the type of therapy used to treat breast cancer at every stage. In this dataset, values of the variable may be either yes or no.
666
E. J. Sweetlin and S. Saudia
Table 3 Comparison of ML algorithms trained using SEER dataset References and year
ML Algorithms
Purpose
Pre-processing and feature selection
Accuracy (%)
[55] 2019
ANN, LR
Predicting the survival of BC for 1-,5-, and 10-year
GA, LASSO RUS, SMOTE
1-year-LR-LASSO (RUS): 84, 5-, 10-yearANN-LASSO (SMOTE): 76.1, 74.5
[56] 2012
DT (C4.5)
Classification of BC
MV, duplicated, overridden or re-coded are discarded
DT (C4.5)-93
[57] 2009
DT (C5.0)
Predicting the survival of BC patients
Mapping relationship, FS-LR-BS
AUC: DT (C5.0)-76.78
[58] 2009
ANN, BN, Hybrid BN
Predicting the survival of BC patients for 5-year
Statistic summaries
ANN-88.8
[59] 2008
LR, ANN, NB, BN, DT-NB, DT-ID3, DT-J48
Predicting the survival of BC patients
Unsuitable records are removed
LR-85.8
[60] 2008
DT (C4.5), FDT
Predicting the survival of BC patients
MV, redundant, irrelevant, inconsistent data are removed
FDT-85.0
[61] 2006
NB, ANN-BPN, DT (C4.5)
Predicting the survival of BC patients
Removed missing data
DT (C4.5)-86.7
[62] 2005
ANN-MLP, LR DT (C5.0)
Predicting the survival of BC patients
Semantic mapping
DT (C5.0)-93.6
• Race—It specifies the race of the patient as white, black, etc. • Tumor size—It specifies the size of the tumor. In this dataset, values of tumor size range from 0 to 230 mm. • Sex—It specifies the patient’s sex. Values of the variable are male or female. • Marital status—It specifies the marital status of the patients at the time of diagnosis. In this dataset, values of marital status are single, married, etc. • Behavior—The behavior of the tumor may be malignant, benign, carcinoma in situ, etc. • Lymph nodes examined—It specifies the number of lymph nodes investigated for diagnosis. In this dataset, values of the variable range from 0 to 54.
A Review of Machine Learning Algorithms on Different Breast Cancer …
667
• Survival Status—This is the class label. Values of this variable are Alive and Dead. These features are identified by the authors of the papers, [55–62] as important predictors of breast cancer survival.
3 ML Algorithms for Breast Cancer Classification and Survival Rate Prediction In literature, many ML algorithms are found to be designed for the classification of breast cancer types and the prediction of the years of survivability of the breast cancer patients. In this section, the ML algorithms [23–62] which work on the three datasets WDBC, WBCD, and SEER are compared based on the accuracy and are tabulated in the order of the year of publication along with the objective of the algorithm and pre-processing, feature selection techniques applied on the training data as shown in Tables 1, 2 and 3. The comparison which is made in terms of accuracy gives information about the effective ML algorithms in this domain. Accuracy of a ML model is defined as the ratio of true predictions to the total predictions made by the classifier. The accuracy % provided in the last column of the tables is the value corresponding to the best ML algorithm as reported by the paper referred to in that particular row of the table. The expansions of the acronyms in these tables are provided in the Annexure. Table 1 is a comparison of the ML algorithms [23–40] designed for the classification of breast cancer types: malignant and benign based on accuracy. The training dataset used by the algorithms in this table is the WDBC dataset. From Table 1, it is inferred that the ML algorithms that produce higher accuracy in the classification of breast cancer types are RF, LR, DT, SVM, KNN, and MLP. These algorithms produce higher accuracy more number of times than the algorithms with which they are compared. The most frequent pre-processing and feature selection techniques applied on the training data are PCA and normalization techniques. Higher accuracy is obtained from RF, LR, and MLP while using normalization and standardization as pre-processing techniques. In the rows corresponding to the reference [33, 35] the paper has reported F1 and F3 scores instead of accuracy and so the best of those values is mentioned in this table. F1 score is the ratio of wrong predictions to the total predictions made by the classifier while F3 score is the measure of wrong negative predictions to the total predictions. Table 2 is a comparison of the ML algorithms [41–54] designed for the classification of breast cancer types: malignant and benign based on accuracy. The training dataset used by the algorithms in this table is the WBCD dataset. From Table 2, it is inferred that the most used ML algorithms for the classification of breast cancer types are SVM and KNN and they also produce higher accuracy. In most of the papers, the missing values are removed before applying ML algorithms. The highest accuracy is provided by the algorithm F-RNN [52] when the
668
E. J. Sweetlin and S. Saudia
pre-processing and feature selection techniques such as CSE, RRA, and FRISM are used. Table 3 is a comparison of the ML algorithms [55–62] designed for predicting the survivability of the breast cancer patients based on accuracy. The training dataset used by the algorithms in this table is the SEER dataset. From Table 3, it is inferred that the most used ML algorithms for the prediction of survivability of the breast cancer patients are DT and ANN. The authors used different pre-processing and feature selection techniques. Among all the ML algorithms experimented, DT shows better accuracy. In the rows corresponding to Ref. [57], the paper has reported Area Under Receiver Operating Characteristics Curve (AUC) instead of accuracy and so the best of the AUC values is mentioned in its place. The AUC is drawn between the True Positive Rate and the False Positive Rate for different threshold values of the classifier. It is the measure to distinguish between positive and negative classes.
4 Conclusion The paper has identified from the literature of clinical breast cancer data research, 40 ML algorithms which are published for the classification of breast cancer types and the prediction of survivability of breast cancer patients. The details of the three common breast cancer datasets such as WDBC, WBCD, and SEER used in the training phase of the ML algorithms are briefed in the paper in terms of definitions, commonly used attributes, and the range of the attribute values. The ML algorithms identified are tabulated in three different tables based on the training dataset. The ML algorithms are ordered in the tables based on the year of publication of the paper and the accuracy of the best performing ML algorithm in that paper. Details of the pre-processing techniques and the purpose of the ML algorithm are also mentioned in the tables. This paper would thus help prospective researchers to identify the different aspects of research in the domain of breast cancer in terms of ML algorithms and attributes for classification of breast cancer types and the prediction of survivability of breast cancer patients. The work can be extended by exploring similar papers published on different datasets.
Annexure-Expansion of Acronyms Mentioned in Sect. 3 AB-Ada Boost
LR-Logistic Regression
ANN-Artificial Neural Networks
LSVM-Library for SVM
AR-Association Rules
LV-Low Variance
BC-Breast Cancer
MC-Multiclass Classifier (continued)
A Review of Machine Learning Algorithms on Different Breast Cancer …
669
(continued) BF-Best First
MLP-Multilayer Perceptron
BLR-Bayesian Logistic Regression
MMN-Min–Max Normalization
BN-Bayes Net
MO-Model Overfitting
BPN-Back Propagation Network
MV-Missing Values
BS-Backward Selection
NB-Naive Bayes
B-SVM-Binary SVM
NCA-Neighborhood Component Analysis
CART-Classification and Regression Tree
NN-Neural Network
CFS-Correlation-based Feature Selection
PCA-Principal Component Analysis
CSE-Consistency-based Subset Evaluation
PCC-Pearson Correlation Coefficient
DF-Discretization Filters
PNN-Probabilistic Neural Network
DT-Decision Tree
PO-Parameter Optimization
DTa-Decision Table
PSO-Particle Swarm Optimization
ELM-Extreme Learning Machine
PURELIN-Linear Transfer function
FE-Feature Extraction
QK-Quadratic Kernel
FDT-Fuzzy Decision Tree
RBF-Radial Basis Function
F-RISM-Fuzzy Rough Instance selection method
RDF-Random Decision Forest
F-RNN-Fuzzy Rough Nearest Neighbor
RDT-Random Decision Tree
FS-Feature Selection
RF-Random Forest
GA-Genetic Algorithm
RFE-Recursive Feature Elimination
GB-Gradient Boosting
RRA-Re-Ranking Algorithm
GNB-Gaussian Naive Bayes
RS-Random Sampling
GP-Genetic Programming
RT-Random Tree
GR-Gain Ratio
RUS-Random Under Sampling
GRS-Greedy Stepwise
SE-Standard Error
GS-Genetic Search
SEv-Subset Evaluation
HV-Hard Voting
SGD-Stochastic Gradient Descent
ID3-Iterative Dichotomiser 3
SMO-Sequential Minimal Optimization
IA-Irrelevant Attribute
SMOTE-Synthetic Minority Over-sampling Technique
ICA-Independent Component Analysis
SS-Standard Scaler
IG-Information Gain
SV-Soft Voting
IGAE-Information Gain Attribute Evaluation
SVM-Support Vector Machine
J-Rip-J-Repeated Incremental Pruning
TANSIG-Hyperbolic Tangent Sigmoid
KNN-K-Nearest Neighbor
TF-Transfer Function
K-SVM-Kernel SVM
UFS-Univariate Feature Selection
LASSO-Least Absolute Shrinkage and Selection Operator
VP-Voted Perceptron (continued)
670
E. J. Sweetlin and S. Saudia
(continued) Lazy-IBK-Instance-Based K-NN
WBFS-Wrapper-Based Feature Selection
Lazy K-star-KNN Star
WKNN-Weighted KNN
LDA-Linear Discriminant Analysis
WNB-Weighted Naive Bayes
LOGSIG-Log Sigmoid Activation Function
WSE-Wrapper Subset Evaluation XGB-Extreme Gradient Boosting
References 1. Akram M, Iqbal M, Daniyal M, Khan AU (2017) Awareness and current knowledge of breast cancer. Biol Res 50(1):1–23 2. Kaur A, Kumari C, Bass S (2021) Breast cancer: the role of herbal medication. Modern Phytomorphol 15:6–75 3. Rajaguru H, Prabhakar SK (2017) Bayesian linear discriminant analysis for breast cancer classification. In: 2017 2nd international conference on communication and electronics systems (ICCES). IEEE, pp 266–269 4. Sweetlin EJ, Ponraj DN (2021) Comparative performance analysis of various classifiers on a breast cancer clinical dataset. In: Intelligence in big data technologies-beyond the hype. Springer Singapore, pp 509–516 5. Alzubaidi L, Al-Shamma O, Fadhel MA, Farhan L, Zhang J, Duan Y (2020) Optimizing the performance of breast cancer classification by employing the same domain transfer learning from hybrid deep convolutional neural network model. Electronics 9(3):445 6. Fondon I, Sarmiento A, Garcia AI, Silvestre M, Eloy C, Polonia A, Aguiar P (2018) Automatic classification of tissue malignancy for breast carcinoma diagnosis. Comput Biol Med 96:41–51 7. Sweetlin EJ, Saudia S (2021) Exploratory data analysis on breast cancer dataset about survivability and recurrence. In: 3rd international conference on signal processing and communication (ICPSC). IEEE, pp 304–308 8. Yue W, Wang Z, Chen H, Payne A, Liu X (2018) Machine learning with applications in breast cancer diagnosis and prognosis. Designs 2(2):13 9. Li Y, Chen Z (2018) Performance evaluation of machine learning methods for breast cancer prediction. Appl Comput Math 7(4):212–216 10. Kalafi EY, Nor NAM, Taib NA, Ganggayah MD, Town C, Dhillon SK (2019) Machine learning and deep learning approaches in breast cancer survival prediction using clinical data. Folia Biol 65(5/6):212–220 11. Salehi M, Razmara J, Lotfi S (2020) A novel data mining on breast cancer survivability using MLP ensemble learners. Comput J 63(3):435–447 12. Eltalhi S, Kutrani (2019) Breast cancer diagnosis and prediction using machine learning and data mining techniques: a review. IOSR J Dental Med Sci 18(4):85–94 13. Prastyo PH, Paramartha IGY, Pakpahan MSM, Ardiyanto I (2020) Predicting breast cancer: a comparative analysis of machine learning algorithms. Proc Int Conf Sci Eng 3:455–459 14. Ivaturi A, Singh A, Gunanvitha B, Chethan KS (2020) Soft classification techniques for breast cancer detection and classification. In: 2020 international conference on intelligent engineering and management (ICIEM). IEEE, pp 437–442 15. Gupta M, Gupta B (2018) A comparative study of breast cancer diagnosis using supervised machine learning techniques. In 2018 second international conference on computing methodologies and communication (ICCMC). IEEE, pp 997–1002 16. Wu X, Khorshidi HA, Aickelin U, Edib Z, Peate M (2019) Imputation techniques on missing values in breast cancer treatment and fertility data. Health Inf Sci Syst 7(1):1–8
A Review of Machine Learning Algorithms on Different Breast Cancer …
671
17. UC Irvine Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets. Last Accessed 19 Oct 2021 18. SEER Homepage. https://seer.cancer.gov/data-software. Last Accessed 19 Oct 2021 19. Wolberg WH, Street WN, Heisey DM, Mangasarian OL (1995) Computerized breast cancer diagnosis and prognosis from fine-needle aspirates. Arch Surg 130(5):511–516 20. Street WN, Wolberg WH, Mangasarian OL (1993) Nuclear feature extraction for breast tumor diagnosis. In: Biomedical image processing and biomedical visualization, international society for optics and photonics, vol 1905, pp 861–870 21. Guo H, Zhang Q, Nandi KA (2008) Breast cancer detection using genetic programming. In: Proceedings of the first international conference on bio-inspired systems and signal processing, pp 334–341 22. Higa A (2018) Diagnosis of breast cancer using decision tree and artificial neural network algorithms. Int J Comput Appl Technol Res 7(1):23–27 23. Solanki YS, Chakrabarti P, Jasinski M, Leonowicz Z, Bolshev V, Vinogradov A, Jasinska E, Gono R, Nami M (2021) A hybrid supervised machine learning classifier system for breast cancer prognosis using feature selection and data imbalance handling approaches. Electronics 10(6):699 24. Prabadevi B, Deepa N, Krithika LB, Vinod V (2020) Analysis of machine learning algorithms on cancer dataset. In: 2020 international conference on emerging trends in information technology and engineering (ic-ETITE), pp 1–10 25. Kaklamanis MM, Filippakis ME, Touloupos M, Christodoulou K (2019) An experimental comparison of machine learning classification algorithms for breast cancer diagnosis. In: European, mediterranean, and middle eastern conference on information systems. Springer, Cham, pp 18–30 26. Ray S, AlGhamdi A, Alshouiliy K, Agrawal DP (2020) Selecting features for breast cancer analysis and prediction. In: 2020 international conference on advances in computing and communication engineering (ICACCE). IEEE, pp 1–6 27. Parhusip HA, Susanto B, Linawati L, Trihandaru S, Sardjono Y, Mugirahayu AS (2020) Classification breast cancer revisited with machine learning. Int J Data Sci 1(1):42–50 28. Balaraman S (2020) Comparison of classification models for breast cancer identification using Google Colab 29. Laghmati S, Cherradi B, Tmiri A, Daanouni O, Hamida S (2020) Classification of patients with breast cancer using neighbourhood component analysis and supervised machine learning techniques. In: 2020 3rd international conference on advanced communication technologies and networking (CommNet). IEEE, pp 1–6 30. Al Bataineh A (2019) A comparative analysis of nonlinear machine learning algorithms for breast cancer detection. Int J Mach Learn Comput 9(3) 31. Rajamohana SP, Umamaheswari K, Karunya K, Deepika R (2019) Analysis of classification algorithms for breast cancer prediction. In: Data management, analytics and innovation, advances in intelligent systems and computing. Springer 32. Sathiyanarayanan P, Pavithra S, Saranya MS, Makeswari M (2019) Identification of breast cancer using the decision tree algorithm. In: 2019 IEEE international conference on system, computation, automation and networking (ICSCAN). IEEE, pp 1–6 33. Dhahri H, Al Maghayreh E, Mahmood A, Elkilani W, Faisal Nagi M (2019) Automated breast cancer diagnosis based on machine learning algorithms. J Healthc Eng 1–11 34. Omondiagbe DA, Veeramani S, Sidhu AS (2019) Machine learning classification techniques for breast cancer diagnosis. In: IOP conference series: materials science and engineering 495 35. Assiri AS, Nazir S, Velastin SA (2020) Breast tumor classification using an ensemble machine learning method. J Imag 6(6):39 36. Unal HT, Basciftci F (2019) An empirical comparison of machine learning algorithms for predicting breast cancer. Bilge Int J Sci Technol Res 3(Special Issue):9–20 37. Al-Shargabi B, Al-Shami F (2019) An experimental study for breast cancer prediction algorithm. In: E-Learning and information systems (Data’19), association for computing machinery, Article 12, pp 1–6
672
E. J. Sweetlin and S. Saudia
38. Saygili A (2018) Classification and diagnostic prediction of breast cancers via different classifiers. Int Sci Vocat Stud J 2(2):48–56 39. Zheng B, Yoon SW, Lam SS (2014) Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms. Expert Syst Appl 41(4):1476–1482 40. Mert A, Kilic N, Akan A (2011) Breast cancer classification by using support vector machines with reduced dimension. In: Proceedings ELMAR. IEEE, pp 37–40 41. Khorshid SF, Abdulazeez AM, Sallow AB (2021) A comparative analysis and predicting for breast cancer detection based on data mining models. Asian J Res Comput Sci 45–59 42. Ed-daoudy A, Maalmi K (2020) Breast cancer classification with reduced feature set using association rules and support vector machine. Netw Modell Anal Health Inf Bioinf 9:1–10 43. Shamrat FJM, Raihan MA, Rahman AS, Mahmud I, Akter R (2020) An analysis on breast disease prediction using machine learning approaches. Int J Sci Technol Res 9(02):2450–2455 44. Kumar V, Mishra BK, Mazzara M, Thanh DNH, Verma A (2020) Prediction of malignant and benign breast cancer: a data mining approach in healthcare applications. In: Advances in data science and management, Lecture Notes on Data Engineering and Communications Technologies. Springer 45. Islam MM, Haque MR, Iqbal H, Hasan MM, Hasan M, Kabir MN (2020) Breast cancer prediction: a comparative study using machine learning techniques. SN Comput Sci 1(5):1–14 46. Akbugday B (2019) Classification of breast cancer data using machine learning algorithms. In: Medical Technologies Congress (TIPTEKNO). IEEE, pp 1–4 47. Bayrak EA, Kirci P, Ensari T (2019) Comparison of machine learning methods for breast cancer diagnosis. In: 2019 scientific meeting on electrical-electronics and biomedical engineering and computer science (EBBT), pp 1–3 48. Amrane M, Oukid S, Gagaoua I, Ensari T (2018) Breast cancer classification using machine learning. In: Electric electronics, computer science, biomedical engineering’s meeting (EBBT) 49. Osmanovic A, Halilovic S, Abdel Ilah L, Fojnica A, Gromilic Z (2018) Machine learning techniques for classification of breast cancer. In: World congress on medical physics and biomedical engineering, IFMBE proceedings. Springer 50. Nemissi M, Salah H, Seridi H (2018) Breast cancer diagnosis using an enhanced extreme learning machine based-neural network. In: 2018 international conference on signal, image, vision and their applications. IEEE, pp 1–4 51. Singh SN, Thakral S (2018) Using data mining tools for breast cancer prediction and analysis. In: 2018 4th international conference on computing communication and automation (ICCCA). IEEE, pp 1–4 52. Onan A (2015) A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer. Expert Syst Appl 42(20):6844–6852 53. Karabatak M (2015) A new classifier for breast cancer detection based on Naive Bayesian. Measurement 72:32–36 54. Liou DM, Chang WP (2015) Applying data mining for the analysis of breast cancer data. In: Data mining in clinical medicine, pp 175–189 55. Simsek S, Kursuncu U, Kibis E, AnisAbdellatif M, Dag A (2019) A hybrid data mining approach for identifying the temporal effects of variables associated with breast cancer survival. In: Expert systems with applications 56. Rajesh K, Anand S (2012) Analysis of SEER dataset for breast cancer diagnosis using C4.5 classification algorithm. Int J Adv Res Comput Commun Eng 1(2) 57. Liu Y-Q, Wang C, Zhang L (2009) Decision tree based predictive models for breast cancer survivability on imbalanced data. In: 2009 3rd international conference on bioinformatics and biomedical engineering 58. Choi JP, Han TH, Park RW (2009) A hybrid Bayesian network model for predicting breast cancer prognosis. J Korean Soc Med Inf 15(1):49 59. Endo A, Shibata T, Tanaka H (2008) Comparison of seven algorithms to predict breast cancer survival. Biomed Soft Comput Human Sci 13:11–16
A Review of Machine Learning Algorithms on Different Breast Cancer …
673
60. Umer Khan M, Pill Choi J, Shin H, Kim M (2008) Predicting breast cancer survivability using fuzzy decision trees for personalized healthcare. In: 30th annual international conference of the IEEE engineering in medicine and biology society 61. Bellachia A, Guvan E (2006) Predicting breast cancer survivability using data mining techniques. In: Scientific data mining workshop, in conjunction with the 2006 SIAM conference on data mining 62. Delen D, Walker G, Kadam A (2005) Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med 34(2):113–127
The Online Behaviour of the Algerian Abusers in Social Media Networks Kheireddine Abainia
Abstract Connecting to social media networks becomes a daily task for the majority of people around the world, and the amount of shared information is growing exponentially. Thus, controlling the way in which people communicate is necessary, in order to protect them from disorientation, conflicts, aggressions, etc. In this paper, we conduct a statistical study on the cyber-bullying and the abusive content in social media (i.e. Facebook), where we try to spot the online behaviour of the abusers in the Algerian community. More specifically, we have involved 200 Facebook users from different regions among 600 to carry out this study. The aim of this investigation is to aid automatic systems of abuse detection to take decision by incorporating the online activity. Abuse detection systems require a large amount of data to perform better on such kind of texts (i.e. unstructured and informal texts), and this is due to the lack of standard orthography, where there are various Algerian dialects and languages spoken. Keywords Offensive language dialectal Arabic Social media
Abusive content Cyber-bullying Algerian
1 Introduction The emergence of social media networks has substantially changed people’s life, where these kinds of websites became the parallel life of some people around the world. In Algeria, Facebook social media is the leading and the most visited social media website, in which people share news and moments, and express ideas as well. Accordingly, it becomes one of the main sources of information, news, education and culture [1]. As an outcome, some ill-intentioned people take advantage to
K. Abainia (&) PIMIS Laboratory, Department of Electronics and Telecommunications, Université 8 Mai 1945 Guelma, 24000 Guelma, Algeria e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_52
675
676
K. Abainia
broadcast misinformation for different malicious intents (e.g. threatening governmental officers, celebrities, etc.). Despite social media networks have a positive influence on people’s life, they might have a negative outcome like the cyber-criminality. Indeed, it was reported by Dailymail1 that the British police receives one Facebook cyber-crime each 40 min, and 12 hundreds cyber-criminalities have been logged in 2011 [1]. Several cyber-crimes may lead to suicide or killing such as frauds, cyber-bullying, stalking, robbery, identity theft, defamation and harassment. In this investigation, we conduct a statistical study to understand the behaviour of the Algerian online abusers (i.e. cyber-bullyers). The purpose of this study is to spot the abuser accounts from their activities and their interactions, in order to help automatic systems of abuse detection in making decisions. Indeed, the abusive content may take different forms such as a hate speech, harassment, sexisme and profanity. Golbeck et al. [9] defined five categories of harassment when they created a large corpus, namely, the very worst, threats, hate speech, direct harassment and potentially offensive. In this work, we only address the profanity (unreadable comments), while the harassment is not the purpose of this work. We particularly focus on the Algerian online community, because Algeria is rich in linguistic varieties on one hand, i.e. several Arabic dialects and languages. On the other hand, the dialectal Arabic lacks standard orthography, and could be written in Arabic script or Latin script (Arabizi). Generally, social media users simulate the phonetic pronunciation of the words in their writing. For instance, we may find consecutive repetitions of the same letter depending on some emotional passages, like the use of “bzzzzzzzzf” instead of “bzf” (meaning “a lot”). Besides the lack of standard orthography, we find the code-switching phenomenon, which consists of mixing more languages in the same sentence or conversation [11]. Usually, it occurs by multi-lingual speakers depending on the situation and the topic [12]. Hence, in the Algerian community, we spot different scenarios of code-switching such as Arabic-French, Arabic-English, Arabic-French–English, Arabic-Berber, Berber-French, Berber-English, Berber-French–English and Arabic-Berber-French–English (very rare). The most of works on abusive content detection are using machine learning or deep learning tools trained on a set of data, or using linguistic features like a lexicon of abusive words. However, these methods may have drawbacks in the case of the Algerian dialectal Arabic, because this task requires a huge training data covering different writing possibilities and a large lexicon with different patterns of abusive words and sentences. In order to overcome these issues, we have conducted a statistical study to spot the behaviour of the Algerian online abusers. Thus, we could predict potential user profiles that post abusive comments in social media networks. The ground truth of this study is based on several Facebook user profiles, where 200 different users from different regions have been involved to study their activities among 600 Algerian abusers.
1
http://www.dailymail.co.uk.
The Online Behaviour of the Algerian Abusers in Social Media Networks
677
2 Related Work In this section, we highlight some research works carried out on the abusive and harassing content, where we state related works from the sociological viewpoint and the computational linguistics as well. Al Omoush et al. [3] studied the impact of Arab cultural values on Facebook, where they investigated to study the motivation, attitudes, usage patterns and the continuity of membership values in social networks. The study revealed that social media networks break down restriction barriers in front of Arabic young people, where the latter face a lot of cultural, social, religious, moral and political restrictions [3]. Awan studied the online hate speech against the Islamic community in the United Kingdom [6]. The author analysed 500 tweets written by different users, i.e. 100 users, where the statistics showed that 72% of the tweets were posted by males. The author categorized offensive users into several categories, and among these categories the reactive and accessory reported high statistics (95 and 82 cases, respectively). Reactive users are persons following major incidents and take the opportunity, while the accessory category represents persons joining a hate conversation to target vulnerable people [6]. Later, the author studied the same phenomenon on Facebook, where the author analysed 100 different Facebook pages [7]. The analysis figured out that 494 comments contain hate speech against the Muslim’s community in the UK, where 80% of the comments were written by males. In addition, the author categorized the offenders into five different categories, i.e. opportunistic, deceptive, fantasist, producer and distributor. From the two studies, overall the offenders used the same keywords and hashtags such as Muzrats, Paedo, Paki and ISIS [6, 7]. Lowry et al. [15] studied the adults’ cyber-bullying and the reasons for cyber-bullying, while the most of research works were focused on the adolescents. The authors particularly proposed an improved social media cyber-bullying model, where they incorporated the anonymity concept with several features in the learning process of cyber-bullying. The abusers’ content in community question answering was studied in [13], where the authors focused on the flagship of inappropriate content by analysing 10 million flags related to 1.5 million users. The analysis figured out that most of the flags were correct, and the deviant users receiving more flags get more replies than ordinary users. Arafa and Senosy studied the cyber-bullying patterns on 6,740 Egyptian university students in Beni-Suef [5]. The questionnaire responses showed that 79.8% of females receive a harassment, while 51.8% of males receive a flaming content. Moreover, among the victim feelings, the anger, hatred and sorrow are common to most of the victims. The data analysis figured out that the students from rural areas and medicine students are less exposed to cyber-bullying (polite and respect moral values). Conversely, the students living in urban areas and sociology students are more exposed to cyber-bullying. A similar study has been conducted on a group of
678
K. Abainia
Saudian students, where 287 students were selected as a case study among 300 responding to a questionnaire [4]. Most of cyber-bullying studies addressed adolescents and teenagers. For instance, a group of Arab teenagers (114 teenagers) living in Israel was selected to study the cyber-bullying and the relationships with emotional aspects [10]. The study reported that 80% of the students are subject to different forms of cyber-bullying such as spreading offensive rumours, harassment, humiliation regarding the physical appearance, and sending sexual content. In addition, the study also figured out that the cyber-bullying victimization conducts to loneliness and anxiety. Another study comparing the cyber-bullying of Jewish and Arab adolescents in Israel was conducted in [14]. The analysis showed that Jewish students use the Internet more frequently than Arab students, and consequently the latter is less exposed to cyber-criminality. Moreover, Arabic females are more bullied than males in contrast to Jewish students (i.e. no difference between genders). In contrast to the above studies, Triantoro studied the cyber-bullying phenomenon in a group of 150 high school students in Jogjakarta [17]. The Triantoro’s study showed that 60% of the students are not victims of cyber-bullying. Tahamtan and Huang conducted a statistical study on the cyber-bullying prevention, for which they collected over 6 k tweets (Tahamtan and Huang 2019). The authors concluded that parents and teachers should be trained to know different aspects of cyber-bullying, and how to prevent this act in schools. Finally, Almenayes studied the relationship between cyber-bullying and depression, and he particularly focused on the gender and the age [2]. The statistical study showed that females are more likely to have depression in contrast to males, while the age is not a good predictor of depression.
3 Algerian Social Media Users The emergence of social media networks in Algeria considerably increases the code-switching phenomenon, where it also occurs by monolingual and illiterate people, because the use of foreign languages (e.g. French) is seen as a prestigious and elegant way to communicate with people [1]. Thus, monolingual people can acquire some French words from the spoken language and employ them within a sentence like “frr tjr bien et en bon senté” (correct sentence is “frr tjr bien et en bonne santé”). Regarding the code-switching expansion, some ATM companies use Arabic-French code-switching to promote offers (Fig. 1).
The Online Behaviour of the Algerian Abusers in Social Media Networks
679
Avantage IMTIYAZ 12Go INTERNET+HADRA wa SMS ILLIMITES li Djezzy+1300DA li echabakat e
Fig. 1 Offer promotion by an Algerian ATM company. The bold words are French words
3.1
Online Social Behaviour
In our previous work [1], we have noticed that several Algerian users often use pseudonyms reflecting their personalities or imaginations. For instance, if a user feels resilient and stronger, he may choose the pseudonym “Gladiator” reflecting a strong personality. In addition, some users write their surnames or pseudonyms using accented characters (e.g. Ỡ, Ữ, Ἇ, Ợ, Ἧ, etc.) not belonging to the French character set that contains 26 letters and some vowels (é, è, à, ï and î). Several social media users spread fake information about them for various reasons (e.g. keeping anonymity). For instance, some Algerian users make European countries as their living country, because they want to migrate there and they always publish and share status and images about this country. Furthermore, some Algerian geeks make renowned universities and institutions (e.g. MIT) as their studying institutions [1]. Finally, we assume that there are two kinds of social media users: naive and rational. Naive users are generally illiterate and little minded, as well as they have the same writing style, the same spoken language and they believe everything. Conversely, rational people change the writing style depending on the context and do not believe everything, as well as they are more polite and less aggressive.
3.2
Writing Behaviour
The Algerian community code-switch more frequently in social media (Fig. 2), especially between Arabic-French, because French is the second language. From our previous analysis undergone on DZDC12 corpus (i.e. 2.4 k texts written by males and females), females code-switch more frequently than males over all the cities (i.e. 12 cities), and especially in Algiers, Blida, Tipaza, Annaba, Skikda and Oran, where the difference is clear enough [1]. In addition, it was noticed that misspelling errors are commonly done by males, while the latter less use abbreviations. Algerian social media users write in Arabic script and Arabizi2 interchangeably, and the latter has irregular orthography. However, in southern cities the community commonly uses the Arabic script. 2
Arabic words written in Latin script.
680
K. Abainia
30
Percentage
25 20 15 10 5 0
Fig. 2 Percentage of French words used by both genders through different cities. Black peaks are male percentages and gray peaks are female percentage
Figure 3 aims to report some statistics about the word lengths used by both genders. Surprisingly, the two genders produce the same distribution, where the peaks drop to the ground after 13 letters. In addition, four-literal and five-literal words produce the highest peak, and uni-literal words are also used in Algerian Arabizi [1]. For the latter case, some bi-literal and tri-literal words are often abbreviated to a single letter (e.g. “fi” or “fe” meaning “in” are abbreviated to “f”).
4 Algerian Online Abusers Behaviour To conduct this study, we have manually crawled 600 Facebook abusive comments written by different users (i.e. 600 Facebook users), where we have ignored the comments written by the same users. The comments were collected from public pages and groups related to different topics such as news, anti-system and politics, advertise, humour, music and rap, volunteering, stories and diverse. It was noticed that among the 600 comments, males, excepting one female, wrote all the comments. From the collected data, we have randomly selected 200 users to analyse their profiles and gathering their information. We have saved the abuser’s account name (and the profile’s link), post text, illustrative media (photo or video), page category, post subject (i.e. politics, news, joking, etc.), post abusiveness (yes or no), post with hashtag (yes or no), abusive comment, abuser agreement with the post (support or against), number of reactions to the abusive comment. In addition, the abuser’s account name with special characters (yes or no), abuser’s gender (male or female), abuser’s account activities (publication categories), average number of reactions to the abuser publications, average number of comments to the abuser publications, visited local regions (yes or no) and finally the visited countries (yes or no).
The Online Behaviour of the Algerian Abusers in Social Media Networks
681
Fig. 3 Distribution of word lengths over 2400 Facebook comments in DZDC12 corpus
4.1
Statistics
Among the involved users (i.e. 200), 43% of them use their real names and the remaining use pseudonyms. For the latter, we have noticed surnames combined with nicknames (e.g. “Mohamed Lakikza”, “Samir Babelwad”, etc.), and pseudonyms reflecting celebrities and imaginations (e.g. “Prince Charmant”). Moreover, among the 43% users with real names, three of them use special characters, whereas 29 users use special characters among the other 57% (pseudonyms). Eighteen users among 32 using special characters (both categories) share selfies, 11 share sports images and news and 11 share status and images of heartbroken (Fig. 4). As depicted in Fig. 5, the most of abusive comments are found in anti-system and diverse pages (94 and 71 comments, respectively). In addition, the subject of the second category (i.e. diverse) is joking or politics, where the first subject may result in a sarcasme and abusiveness. Nevertheless, since the end of 2018, the Algerian political situation is instable and the most of Algerian people protest against the governmental staff in streets and social media. The fact of sharing political subjects involves hate and angry, which may provoke to post abusive content. On the other hand, the most of Algerian community are religiously preservative, and the fact of seeing controversial things to their beliefs involves angry and hate. For example, a video showing some Algerian women dancing in the street involves a hateful reaction, because in the Algerian beliefs, women should not be exposed without veil and should not do immoral acts like dancing outside. By analysing the abuser profiles and their activities, in overall, they do not receive much feedback and reactions from their connected peers, excepting a few ones. In particular, 24.5% of the abusers receive more reactions (between 70 and 300), while the majority of the others receive between 0 and 10 reactions for their posts. Among the 24.5% of the abusers, some receive until 300 comments (they share selfies alone or with friends), and the majority receives between 10 and 20
682
K. Abainia
20 15 10 5 0
Nb. users Fig. 4 Statistics about the abusers using special characters in their names/pseudonyms
100 80 60 40 20 0
number of comments Fig. 5 Statistics about the page categories
comments. In contrast to the users not receiving many reactions (75.5%), they also did not receive comments (0 comments in almost all cases). Figure 6 depicts the number of abusive comments per each post category, where the comments may belong to two subjects. From Fig. 6, we notice that political and joking subjects involve abusiveness, because as stated above most of Algerian people are against the current government. Moreover, we also notice that educational and religious subjects involve abusiveness. Algerian people are religiously preservative and restrictive to the religion (i.e. Islam) and education principles, and if their religion is insulted it would make them angry and lead to abusiveness. In addition, insulting the moral principles and the education quality may provoke negative reactions (abusiveness). Finally, it is obvious that joking subjects (e.g. sarcasm) raise abusiveness.
The Online Behaviour of the Algerian Abusers in Social Media Networks
683
100 80 60 40 20 0
Nb. abusive comments Fig. 6 Number of abusive comments per post category
4.2
Abusers Activities on Their Accounts
Figure 7 depicts the number of abusers per activity, where the same abuser may do multiple activities (e.g. selfie, sadness, etc.). From the figure, the most of the abusers share selfies (alone or with friends and family) and Football clubs and athletes. Starcevic et al. [16] have studied the medical terminology of the selfie addiction in social media, where the authors have defined the phenomenon as a mental disorder, which consists of an obsessive need to post selfies on social media, and requires a professional treatment. We assume that they may have a mental illness and try to draw attention of their peers. De Choudhury et al. [8] addressed the depression prediction in social media, where the authors used a clinical survey model via Amazon Mechanical Turk. Among the negative emotions related to depression, the depressed users share feelings of worthlessness, guilt, helplessness and self-hatred. On the other hand, we have noticed another form of sadness, which consists of sharing religious posts, because conservative Islamic people cannot post profanity and abusiveness (i.e. prohibited in their beliefs). Overall, such users do not share joking, sports and political posts. Conversely, the abusers in our dataset share religious posts combined with political, joking, sports, sadly and violence posts. This kind of abuser has used a direct harassment to abuse others, where we classify these comments as “the very worst” according to [9]. It is worth mentioning that sport posts are generally concerning Football (i.e. national and international clubs), because the latter is considered as favorite sport of the Algerian community (and the Arabic world overall). More frequently, the users share selfies in the stadiums and some pictures of their preferable players and clubs (i.e. Real Madrid, Barcelona, etc.). Usually, the Algerian football supporters (especially who go to the stadium) learn abusive words and abusive language, because they usually sing abusively in the stadium.
684
K. Abainia
140 120 100 80 60 40 20 0
Nb. abusers Fig. 7 Number of abusers per activity
We have noticed that almost the users sharing sports posts receive more reactions, i.e. up to 20 comments and 200 likes. Moreover, the users sharing travelling posts also receive more reactions, i.e. until 15 comments and 100 likes. Nevertheless, for the other activities, the number of reactions varies from a user to another. Thus, we could not draw a general conclusion, because the number of reactions depends on the user tie and engagement.
4.3
Writing Style
In overall, the language script (Arabic or Latin) of the abusive comments is independent of the script of the admin’s post. However, we have noticed that most of abusers write in Arabic script to reply to political subjects. Moreover, some users whose names are written in Arabic script always write in the same script. Conversely, we cannot predict the language script based on other information like the pseudo/name and the use of special characters in the names. Among the 200 involved abusers, 78.5% of them are against the admin’s posts, 15% agree and the remaining are neutral. Most of the comments against the posts are directly targeting the admin when the latter is commenting something (especially a political situation). On the other hand, they are targeting the subject (i.e. person stated in the post content) in the case of sharing news. From Fig. 8, most of the abusers write short comments (below 8 words), as well as some abusers write two or three words (i.e. abusive words). Due to the irregular orthography of Arabizi and dialectal Arabic, it is obvious that each word has various writing varieties. However, we can regroup the words into three words, i.e. nik (meaning “fuck”), zebi (meaning “penis”) and ya3ti (meaning “prostitute”). In addition, some users sometimes obfuscate indecent
The Online Behaviour of the Algerian Abusers in Social Media Networks
685
Fig. 8 Number of abusers per user’s activity
words by replacing some characters with others. For instance, the word “zebi” ()ﺯﺑﻲ becomes “shepi” ( )ﺷﺒﻲwhere the letter “z” is obfuscated with “sh” (pronounced like shop). Another example, the word “tnaket” ( )ﺗﻨﺎﻛﺖis sometimes obfuscated as “tnakhet” ()ﺗﻨﺎﺧﺖ. It is worth mentioning that sometimes the abusers could use classical Arabic to write an abuse, and in such case, the latter can be detected from the general context. For example, the second entry in Table 1 could be written as “”ﻳﻀﺮﺑﻮﻟﻮ ﺍﻟﻤﺆﺧﺮﺓ, where the word “ ”ﺗﺮﻣﺔis indecent and the word “ ”ﺍﻟﻤﺆﺧﺮﺓis familiar and more polite, but the sentence remains abusive. We notice from Table 1 that the 4th, 5th and 9th entries are the same sentence written differently, as well as we notice the same thing with the 3rd, 6th, 10th and 14th entries.
Table 1 Examples of some indecent words retrieved from our dataset
N°
Arabic word
Buckwalter tansliteration
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ﺍﻟﻜﺮﻓﺔ ﻳﻀﺮﺑﻮﻟﻮ ﺗﺮﻣﻪ Nike moke hetchoun yemak hatchoun yamk nik moook zabo yeniko 7atchoun yamakk nikamok 9ahbon ﺍﺯﺑﻲ ﻧﺘﻤﻨﻴﻜﻮ ﻧﻴﻜﻮ ﻣﺎﺗﻜﻢ ﻭﻻﺩ ﻟﻘﺤﺎﺏ
Alkrfp yDrbwlw trmh – – – – – – – – – Azby ntmnykw nykw mAtkm wlAd lqHAb
686
K. Abainia
Finally, from the collected data, it is noticed that the abusers use the same writing style and the same words when they interact with others. For instance, when a friend shares a selfie the abuser generally writes the same comments “rak 4444 hbibi” (meaning “you are the best buddy”), “tjrs 4 frr” (meaning “my brother, you are always the best”), “dima zine hbibi” (meaning “buddy, you are always handsome”), etc.
5 Conclusion In this investigation, we have addressed the problem of online cyber-bullying by focusing on the Algerian online abusers. In particular, we have highlighted the social behaviour of the Algerian online community, as well as the abusers’ writing style and online activities. The ground truth of this study is based on 200 abusive comments (contain profanity) written by different users (i.e. 200 users) from different Algerian regions. We have manually collected 600 comments in total written by different Algerian Facebook users, where the comments written by the same users were ignored. Among the 600 comments, we have arbitrarily selected 200 comments to inspect the user profiles and gather the related information. We found that males wrote all the abusive comments, except one comment written by a female. The statistics showed that 57% of the abusers use pseudonyms instead of real names, and 16% use special characters not belonging to the French character set. It was also noticed that political subjects in most cases involve cyber-bullying and profanity, because the Algerian community is against the current governmental staff. In addition, it was noticed that most of abusers are addicted to publishing selfies and sharing posts about sports, politics and negative feelings (e.g. sadness, heartbroken, etc.). Thus, we see from the collected data that most of abusers are depressive, and we should be aware of this kind of user. Because of the irregular orthography of Arabizi and dialectal Arabic, building a lexicon (of abusive words) covering different possibilities is tricky and requires more investigations. As perspective for future work, we would like to explore the writing variation across different Algerian regions and build a lexicon for abusive content.
References 1. Abainia K (2020) DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus. J Langage Resour Eval 54:419–455 2. Almenayes J (2017) The relationship between cyberbullying victimization and depression: the moderating effects of gender and age. J Social Netw 6(3):215–223 3. Al Omoush KS, Yaseen SG, Alma’Aitah MA (2012) The impact of Arab cultural values on online social networking: the case of Facebook. J Comput Human Behav 28(6):2387–2399
The Online Behaviour of the Algerian Abusers in Social Media Networks
687
4. Al-Zahrani AM (2015) Cyberbullying among Saudi’s higher-education students: implications for educators and policymakers. World J Educ 5(3):15–26 5. Arafa A, Senosy S (2017) Pattern and correlates of cyberbullying victimization among Egyptian university students in Beni-Suef, Egypt. J Egypt Public Health Assoc 92(2):107– 115 6. Awan I (2014) Islamophobia and Twitter: a typology of online hate against muslims on social media. Policy Internet 6(2):133–150 7. Awan I (2016) Islamophobia on social media: a qualitative analysis of the Facebook’s walls of hate. Int J Cyber Criminol 10(1):1–20 8. De Choudhury M, Gamon M, Counts S, Horvitz E (2013, June) Predicting depression via social media. Facebook: a hidden threat to users’ life satisfaction? In: Proceedings of the 7th international AAAI conference on weblogs and social media 9. Golbeck J, Ashktorab Z, Banjo RO, Berlinger A, Bhagwan S, Buntain C, Cheakalos P, A Geller A, Gergory Q, Gnanasekaran RK, Gunasekaran RR, Hoffman KM, Hottle J, Jienjitlert V, Khare S, Lau R, Martindale MJ, Naik S, Heather L, Nixon, Ramachandran P, Rogers KM, Rogers L, Sarin MS, Shahane G, Thanki J, Vengataraman P, Wan Z, Wu DM (2017). A large labeled corpus for online harassment research. In: Proceedings of the 2017 ACM on web science conference, pp 229–233 10. Heiman T, Olenik-Shemesh D (2016) Computer-based communication and cyberbullying involvement in the sample of Arab teenagers. Educ Inf Technol 21(5):1183–1196 11. Joshi AK (1982). Processing of sentences with intra-sentential code-switching. In: Proceedings of the 9th conference on computational linguistics, 5–10 July. Prague, Czechoslovakia, pp 145–150 12. Kachru BB (1977) Code-switching as a communicative strategy in India. In: SavilleTroike M (ed) Linguistics and anthropology. Georgetown University Round Table on Languages and Linguistics. Georgetown University Press, Washington D.C. 13. Kayes I, Kourtellis N, Quercia D, Iamnitchi A, Bonchi F (2015). The social world of content abusers in community question answering. In: Proceedings of the 24th international conference on world wide web, international world wide web conferences steering committee, May 2015, pp 570–580 14. Lapidot-Lefler N, Hosri H (2016) Cyberbullying in a diverse society: comparing Jewish and Arab adolescents in Israel through the lenses of individualistic versus collectivist cultures. J Soc Psychol Educ 19(3):569–585 15. Lowry PB, Zhang J, Wang C, Siponen M (2016) Why do adults engage in cyberbullying on social media? An integration of online disinhibition and deindividuation effects with the social structure and social learning model. Inf Syst Res 27(4):962–986 16. Starcevic V, Billieux J, Schimmenti A (2018) Selfitis, selfie addiction, twitteritis: irresistible appeal of medical terminology for problematic behaviours in the digital age. J Aus N Z Psychiatry 52(5):408–409 17. Triantoro S (2015) Are daily spiritual experiences, self-esteem, and family harmony predictors of cyberbullying among high school student. Int J Res Studi Psychol 4(3)23–33
Interactive Attention AI to Translate Low-Light Photos to Captions for Night Scene Understanding in Women Safety A. Rajagopal, V. Nirmala, and Arun Muthuraj Vedamanickam
Abstract There is amazing progress in deep learning-based models for image captioning and low-light image enhancement. For the first time in literature, this paper develops a deep learning model that translates night scenes to sentences, opening new possibilities for AI applications in the safety of visually impaired women. Inspired by image captioning and visual question answering, a novel ‘Interactive Image Captioning’ is developed. A user can make the AI focus on any chosen person of interest by influencing the attention scoring. Attention context vectors are computed from CNN feature vectors and user-provided start words. The encoder– attention–decoder neural network learns to produce captions from low-brightness images. This paper demonstrates how women safety can be enabled by researching a novel AI capability in the interactive vision–language model for perception of the environment in the night. Keywords Attention neural networks · Vision–Language model · Neural machine translation · Scene understanding · Explainable AI
A. Rajagopal Indian Institute of Technology, Madras, India e-mail: [email protected] V. Nirmala (B) PG and Research Department of Physics, Queen Mary’s College, Chennai, Tamilnadu, India e-mail: [email protected]; [email protected] A. M. Vedamanickam National Institute of Technology, Tiruchirapalli, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_53
689
690
A. Rajagopal et al.
1 Introduction 1.1 Need and Significance The need to research scene understanding at night is important for developing deep learning applications for assisting the visually impaired and women safety. In a scenario of a visually impaired girl straying into a vulnerable street at night, a realtime audio description of the scene from her smartphone camera can help her to perceive the scene. Further, in the scenario, she can ask the AI to focus its attention on a salient object, motivating the need for a novel idea of ‘AI based Interactive Captioning of night scenes with user guided attention mechanism’. Starting with the need to develop girl safety apps, arises the need for research on the development of vision–language models that can perceive night scenes for on-device ML inference on smartphones or wearables. The AI has to be functional on visually impaired user’s smartphone even when there is no internet. Also, the on-device ML inference has to be fast enough for real-time interaction with the user. This puts a boundary on the size of the neural network, hence the need for research. The multimodal embedding approach of multi-layer transformer model OSCAR is a powerful idea in vision–language modelling [1]. It helps achieve stateof-art performance, but the challenge is in on-device ML inference for such gigantic models. Hence, there is a need to explore.
1.2 Contributions Three areas contributed in this paper are (1) To the best of our knowledge, this is the first work to demonstrate the ability of a neural network to perceive night scenes and translate the low-brightness image to text. (2) Propose ‘Interactive Captioning’, where the user can induce the AI to focus its attention on user-specified object. (3) Apply the proposed neural network for the safety of women and blind people. For the first time in literature, this paper proposes and implements a novel scene understanding AI that has the following unique contributions. (1) AI’s ability to see and understand photos taken in low light or night has applications in girl safety use cases. Hence, the first contribution is developing a deep learning-based visual language model to understand night scenes or photos taken in low-light settings. Specifically, this paper develops and demonstrates that a deep learning model can translate any low-brightness image into a sentence. The proposed model architecture is shown in contrast with a brute-force approach in Fig. 3. A brute-force approach is a pipeline of first enhancing the image,
Interactive Attention AI to Translate Low-Light Photos to Captions …
691
and then making inferences with an image captioning network. In this paper, we use end-to-end deep learning to train an encoder–attention–decoder on lowbrightness images. Once the trained model is developed, we experiment with randomly downloaded images from the web. For example, a random nighttime image from the internet is translated into text as shown in Fig. 2. These experiments demonstrate the ability of this neural network model to describe any night scene, inspite of low brightness in the image. (2) Attention modelling [2] is a powerful paradigm in deep learning and has been the foundation for much of the recent progress in the field. Transformer [1, 3]-based approaches are entirely based on attention modelling. In the ‘Attention is all you need’ paper, Google Brain argued why modelling of attention can train the neural networks similar to how humans focus their attention on a particular salient object. While the usual norm is to allow the image captioning neural network to generate the entire sentence, this paper develops a new variation, as shown in Fig. 1C. In this variation, users can provide a start word and inductively change the attention scores. Thus, experiments show the image captioning network can focus its attention on a particular object chosen by the user and base the sentence on the salient object chosen by the user. This is the first time in literature, interactive caption on night scene comprehension is demonstrated. This new concept is shown in Fig. 3.
1.3 Novelty The research gap is articulated in Fig. 2. There is amazing progress in two tasks: • Image captioning and visual question answering. • Low-light image enhancement and super-resolution. Inspired by these advances, this paper develops a novel idea and demonstrates it. The novelty aspects are illustrated in Fig. 1. A. B. C. D.
Image captioning (established) Low-light Image captioning (proposed) Interactive captioning (proposed) Visual question answering (established).
While image captioning and visual question answering are well established in the literature, this paper contributes two novel areas. One is image captioning of photos when the illumination is low. Another is implementing the variant of the concept of Visual Question Answering (VQA). There are three aspects of novelty in this work: 1. Novel deep learning modelling (Fig. 1) 2. The idea of interactive captioning (Figs. 6 and 7) 3. A novel application for girl safety (Figs. 4 and 6).
692
A. Rajagopal et al.
Fig. 1 Key contributions and novelty
1.4 Research Gap and Related Work Since this paper is at the intersection of low-light image enhancement and vision and language interaction modelling [1], this section presents research in both these areas. While there is significant progress in both these areas, there are fewer publications at the intersection of these areas. This paper is at the intersection of these two areas. The literature in both these image caption [4] and image enhancement [5] tasks can be generalised and represented by a generic architecture pattern, as illustrated in Fig. 2. The literature study shows that, while there are significant results in both these tasks of image enhancement [5] and captioning [4–9], the gap in research is to combine the two tasks into a single network for efficient on-device ML. Two different
Interactive Attention AI to Translate Low-Light Photos to Captions …
693
Fig. 2 Research gap and architecture choices in fulfil the gap
approaches for modelling of image captioning of night scenes are illustrated in Fig. 2. The two approaches are 1. Brute-force approach: Combining the advancements in both image enhancement and image captioning can be achieved by a simple pipeline architecture such as the one shown in Fig. 2A. However, the real-time computation for on-device ML could be high with such a pipeline-based approach.
694
Fig. 3 Women safety applications of the proposed concept
Fig. 4 People with vision impairments can use the proposed AI
A. Rajagopal et al.
Interactive Attention AI to Translate Low-Light Photos to Captions …
695
2. Proposed approach: As per the trends of end-to-end deep learning-based training, this paper takes an approach as shown in Fig. 2B. Here, images with low brightness are directly used as an input to the neural network. Deep learning-based image enhancement is used in de-noising autoencoders, super-resolution, low-light image enhancement. At the heart of image enhancement [10], an UNet encoder–decoder style architecture pattern is utilized to transform the image from one representation to another, as shown in Fig. 2. There have been significant advancements in image super resolution and video super resolution, certainly indicating the potential of CNNs for learning patterns and transforming them. Skip connections along with CNN layers transform different features at various levels. The UNet-inspired architecture is typically well established for image-image translation [11]. Deep learning in combining vision and language was seen tremendous progress, especially with the latest achievement in human-level parity in image captioning tasks. The multi-layer transformer architecture [12, 13] OSCAR can transform multimodal embedding has shown promising a new direction [1] However, the challenge for on-device ML [14] requires a much smaller network given resource-constrained wearables and smartphone devices. Hence, the need for research on compact versions of vision and language modelling. The success in image captioning arises from three factors. 1. Multimodal embedding [15] rather than just concatenation visual and word embeddings. 2. Attention context vectors [16] that compute the Attention to various salient objects in the image. 3. Identifying the pairing of classes found in a image, and words in the caption [17]. The various architecture patterns in vision and language modelling literature is summarized in Fig. 5. To generalize, most of the vision–language translation tasks such as image captioning can be performed by these approaches. 1. Encoder–decoder architecture pattern (2014): • Here, the visual features are encoded by CNN and then the thought vector is then decoded by RNN. The classic paper is ‘Show and Tell’ by Bengio [18]. The pattern is depicted in Fig. 5A. Refer Fig. 5B for this neural image captioning. 2. Encoder–attention–decoder (Since 2015): • Since the advent of sequence-to-sequence neural machine translation, a popular choice of architecture is to apply different forms of attention technique. Significant progress was possible due to attention in such translation tasks. This architecture pattern is shown in Fig. 5C. ‘Show, Tell and Attend’ paper [19] demonstrated attention-based image captioning.
696
A. Rajagopal et al.
Fig. 5 Architecture generalized from literature
3. Visual feature extractor in the architecture (2015–2021): • The use of an ImageNet-based pre-trained feature extractor based on image captioning networks has found significant adoption till date. The most recent one has been with ResNet, though Inception is used in many cases. The pre-trained Inception first extracts visual features and then inputs to the encoder [1].
Interactive Attention AI to Translate Low-Light Photos to Captions …
697
Fig. 6 Literature’s first work on interactive image captioning for night scene understanding
4. Region feature extractor in the architecture (Since 2016): • The use of object detectors with R-CNN framework has added localization data to addition to visual features. This is also used in OSCAR as shown in Fig. 5G, H. 5. Transformer-based approach (2019–2021): • With the advent of multi-head attention, the use of transformers for encoding and decoding has been popular in language modelling. For example, OSCAR uses transformers both for modelling and text embedding using pre-trained BERT language models. (Ref. Fig. 5E, H). 6. Multimodal embedding and transformer (2020–2021): • OSCAR [1] relies on its success based on the idea of multimodal embedding and multi-layered multi-head attention. (Ref. Fig. 5H).
698
A. Rajagopal et al.
Fig. 7 Neural network architecture
2 AI Architecture, Methods and Results 2.1 AI for Describing Night Scenes for Visually Impaired Users This paper assumes significance because it opens doors to life-saving applications for women and the visually impaired community [20]. If one stands in the shoe of a visually impaired woman, the importance of developing this capability can be soundly defended. The significance of this work is in enabling safety use cases for visually impaired users (as illustrated in Figs. 4 and 6).
Interactive Attention AI to Translate Low-Light Photos to Captions …
699
2.2 Novel Result: This Neural Network Translates Low-Brightness Images to Sentences! The illumination level in the dark combined with exposure levels on camera in consumer devices means the quality of photos may not be comprehendable clearly by human eyes. Yet, can machines comprehend such photos taken in low light. The proposed AI in this paper is shown to translate night scenes to sentences that describe the scene. The experimental results show this trained model can easily provide a caption for any randomly downloaded image from the internet. A screenshot of this capability is shown in Fig. 7. Also, this capability to translate a random night scene into a sentence is demonstrated online. The proposed neural network architecture is inspired by encode–attention–decode architecture established in ‘Shown, Attend and Tell’ paper. However, this feat of captioning night scenes is made possible by two novel ideas introduced in this paper. First, the deep learning model is trained on a modified MS-COCO dataset. Second, the idea of interactive user input directly influences attention score computations. These two ideas are elaborated on in later sections.
2.3 Dataset To address the non-availability of dataset to train image captioning for night environments, the paper synthetically creates a dataset from the MS-COCO dataset, but adjusting the brightness of all the images. Thus, the new dataset is derived from the MS-COCO dataset [18]. The dataset now consists of modified images from the MSCOCO dataset along with the corresponding captions provided in the MS-COCO dataset. The dataset used for training comprises of {images with low brightness, five textual captions}. The modified dataset used in training the neural network is described in Fig. 8.
2.4 Novelty in AI Inference for Interactive Caption Generation While the classical concept of deep learning-based automated caption generation is shown in Fig. 1A, the proposed concept of interactive caption generation is shown in Fig. 1C. A detailed illustration of the user experience by visually impaired women is in Fig. 6. In this scenario, where you wish to ‘instruct’ the AI to focus its attention on a particular object of interest, the user can simply ask the AI to include the user specified object in the caption to be generated. This is in contrast with the VQA model (Fig. 1D). The VQA and the proposed concept require different dataset structures (refer Table 1).
700
A. Rajagopal et al.
Fig. 8 Dataset preparation
Table 1 Different tasks in vision–language and its dataset requirements VQA
Training dataset
AI app Inference I/ O
• X: Image, Question • Y: Answer
• Inputs: Image, Question • Output: Predicted Answer
Inter-active caption • X: Low-brightness Image • Y: Sentence
• Inputs: Low-light photo, word • Output: Sentence completion
2.5 HMI in Attention Model for Interactive Caption Generation In addition to proposing and developing the deep learning model for the translation of night imagery to textual sentences, the paper explored the novel idea at the intersection of Human–Machine Interaction (HMI) and on-device ML inference for image captioning use case. The architecture for injecting user input into the attention computation is presented in Fig. 9. As seen in this architecture, when the user input is combined with the visual features, the attention context vector focuses on the salient object provided in the user input. Thus, the attention scoring is influenced
Interactive Attention AI to Translate Low-Light Photos to Captions …
701
by the human input, allowing for human-in-the-loop intervention in the attention scoring-based caption generation mechanism. The result of this human input can be seen in Fig. 6, where the screenshot shows the output generated depends upon the input word provided by the user. The architecture consists of a visual feature extraction using the entire convolution layers of the ImageNet pre-trained Inception model, and then computation of attention alignments based on a combination of user input word and the visual features. By using attention computation such as dot product Attention or Bahdanau attention, the attention context vector (c) is computed from the learnt attention weights (a) and the visual embeddings (h). The attention RNN then focus its attention to generate the text description. The novelty is on this architecture in injecting the user-typed word into this attention computation architecture. The experiments shows promising results as shown in screenshots in Fig. 6 and the online demo at URL https://sites. google.com/view/lowlightimagecaption.
Fig. 9 Human in loop in attention decoder for interactive captioning of night images
702
A. Rajagopal et al.
Fig. 10 Comparison of three different models trained in different environments
2.6 Feasibility Towards Enabling Women Safety Apps The feasibility of developing women safety apps is demonstrated in this paper. The potential of employing a deep learning-based approach to build visual language models that is capable of comprehending night scenes is established experimentally in this work. Early results show that the night scene captioning network offers a similar level of accuracy/loss as compared to the captioning network trained on MSCOCO. The experimental results on loss at three different networks are shown in Fig. 10. The experiments show that loss is within a reasonable range (less than 5%) when compared to the network trained on MSCOCO. Though this is a first baby step in this direction of safety applications for visually impaired women (Fig. 11), a lot more research is required in the future to strengthen this idea. There is a plethora of opportunities for the research community to research more advanced AI ideas for enabling women safety applications [20]. This paper makes the source code in open source at URL https://sites.google.com/view/lowlightimagecaption.
3 Conclusions The paper identified a unique challenge in developing vision–language modelling for the understanding of low-light night scenes. We motivated the need for the development of deep learning-based interactive image captioning models that can understand low-light scenes. Smartphone AI-based night scene perception is a crucial capability for personal safety apps for visually impaired women and girls. Based on a review of literature, we identified the opportunity to contribute to the literature at the intersection of two areas, namely image enhancement and image captioning. Though each of these two topics has witnessed amazing progress, there are not many publications at the intersection of both image captioning and low-light image enhancement. Based on the literature review, the generic pattern in architecture for image captioning is identified as encoder–decoder with attention mechanism at the heart of most papers, with the recent focus on multi-head attention and multimodal embeddings to combine visual and language modalities. Similarly, UNet-style encoder–decoder architecture is found at the heart of image enhancement and superresolution architectures. The role of attention-based weighting of visual features in an encoder–decoder architecture, and multimodal embeddings have shown promise in vision and language modelling tasks such as image captioning. Since girl safety applications require on-device ML on resource-constrained smartphones for real-time
Interactive Attention AI to Translate Low-Light Photos to Captions …
703
Fig. 11 Attention visualization
AI, the choice of neural network architecture requires careful investigation. Though transformers-based multimodal embedding architecture offer state-of-art captioning, its distillation is needed for affordance on smartphones due to the computation and memory requirements of multi-layer transformers-inspired models such as OSCAR. Hence, the need to explore vision–language models that understand night scenes and is efficient for on-device ML to enable the deployment of women safety AI to mainstream smartphones. To the best of our knowledge, some of the contributions in this paper are near first in the literature. Key contributions are (1) The paper proposed and demonstrated a deep neural network capable of translating low-brightness images into sentences that describe the night scene. Experiments on the trained model show this AI is able to caption any low-light image that is randomly downloaded from the internet. While further research is required, this paper establishes the potential of using deep neural network-based vision–language modelling for on-device ML for night scene perception challenges. Thus, the paper establishes the potential for AI in the safety of women and visually impaired users, especially at night. (2) A new concept of interactive image captioning was explored and demonstrated. This concept allowed for the usage model similar to Visual Question Answering
704
A. Rajagopal et al.
(VQA). The user inputs both an image and a question in VQA. The proposed interactive captioning AI achieves similar user experience, where user inputs both an image and a question. This paper experimentally demonstrates this concept with integrating human interaction into the attention score layer of the neural network. Thus, when the user wishes to focus on a particular object of interest in the scene, she can prompt the attention mechanism to consider her wish for caption generation. This interesting human-in-the loop approach in the attention–decoder allows the user to influence the attention scoring weights. (3) To enable reproducibility of results, the paper contributes the source code for both the contributed ideas in open source. The source code is made available at this URL, https://sites.google.com/view/lowlightimagecaption. This paper also makes a beautiful illustration of the neural network architecture, to support in-depth exploration by interested researchers. The potential of deep learning to understand night scenes was demonstrated in this work. This paper opened door for new possibilities for applications to promote the safety of visually impaired women. Using attention modelling, the paper developed an encoder–attention–decoder-based model that learns to interactively caption images of low brightness. Thus an interactive image captioning AI for explainable night scene understanding is demonstrated experimentally. To encourage further research by interested researchers in vision–language modelling for promoting safety of women, the source code is made available in open source.
References 1. Li X et al (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. arXiv: 2004.06165 [cs], July. https://arxiv.org/abs/2004.06165. Accessed 17 Oct 2021 2. Anderson P et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, June. https://doi.org/10.1109/cvpr.2018.00636 3. Wang Z et al (2021) SimVLM: simple visual language model pretraining with weak supervision. arXiv preprint. arXiv:2108.10904 4. Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: 2018 IEEE/ CVF conference on computer vision and pattern recognition, June. https://doi.org/10.1109/ cvpr.2018.00583 5. Shi W et al (2021) Real-time single image and video super-resolution using an efficient subpixel convolutional neural network. arXiv:1609.05158 [cs, stat], September. https://arxiv.org/ abs/1609.05158v2. Accessed 17 Oct 2021 6. Biswas R, Barz M, Sonntag D (2020) Towards explanatory interactive image captioning using top-down and bottom-up features, beam search and re-ranking. KI - Künstliche Intelligenz 34(4):571–584. https://doi.org/10.1007/s13218-020-00679-2 7. Rane C et al (2021) Image captioning based smart navigation system for visually impaired. In: 2021 international conference on communication information and computing technology (ICCICT). IEEE 8. Gurari D et al (2020) Captioning images taken by people who are blind. In: European conference on computer vision. Springer, Cham 9. Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2021) From show to tell: a survey on image captioning. arXiv preprint. arXiv:2107.06912
Interactive Attention AI to Translate Low-Light Photos to Captions …
705
10. Li C, Guo C, Chen CL (2021) Learning to enhance low-light image via zero-reference deep curve estimation. In: IEEE transactions on pattern analysis and machine intelligence, pp 1–1. https://doi.org/10.1109/tpami.2021.3063604 11. Li C et al (2021) Low-light image and video enhancement using deep learning: a survey. arXiv: 2104.10729 [cs], June. https://arxiv.org/abs/2104.10729. Accessed 17 Oct 2021 12. Vaswani A et al (2017) Attention is all you need. https://arxiv.org/abs/1706.03762. 13. Huang L et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/ CVF international conference on computer vision 14. Dhar S et al (2021) A survey of on-device machine learning: An algorithms and learning theory perspective. ACM Trans Internet of Things 2(3):1–49 15. Jia Z, Li X (2020) Icap: interactive image captioning with predictive text. In: Proceedings of the 2020 international conference on multimedia retrieval 16. Bahuleyan H et al (2017) Variational attention for sequence-to-sequence models. arXiv preprint. arXiv:1712.08207 17. Lin T-Y et al (2014) Microsoft COCO: common objects in context. In: Computer vision— ECCV 2014, pp 740–755. https://doi.org/10.1007/978-3-319-10602-1_48 18. Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. https://arxiv.org/abs/1411.4555 19. Xu K et al (2019) Show, attend and tell: neural image caption generation with visual attention. http://proceedings.mlr.press/v37/xuc15.pdf. Accessed 16 Nov 2019 20. Qi Y et al (2020) Object-and-action aware model for visual language navigation. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, Proceedings, Part X 16. Springer International Publishing
AI Visualization in Nanoscale Microscopy A. Rajagopal, V. Nirmala, J. Andrew, and Arun Muthuraj Vedamanickam
Abstract Artificial Intelligence (AI) and nanotechnology are promising areas for the future of humanity. While deep learning-based computer vision has found applications in many fields from medicine to automotive, its application in nanotechnology can open doors for new scientific discoveries. Can we apply AI to explore objects that our eyes can’t see such as nanoscale-sized objects? An AI platform to visualize nanoscale patterns learnt by a deep learning neural network can open new frontiers for nanotechnology. The objective of this paper is to develop a deep learningbased visualization system on images of nanomaterials obtained by scanning electron microscope (SEM). This paper contributes an AI platform to enable any nanoscience researchers to use AI in the visual exploration of nanoscale morphologies of nanomaterials. This AI is developed by a technique of visualizing intermediate activations of a Convolutional AutoEncoder (CAE). In this method, a nanoscale specimen image is transformed into its feature representations by a Convolution Neural Network (CNN). The convolutional autoencoder is trained on a 100% SEM dataset from NFFA-EUROPE, and then CNN visualization is applied. This AI generates various conceptual feature representations of the nanomaterial. While deep learning-based image classification of SEM images is widely published in literature, there are not many publications that have visualized deep neural networks of nanomaterials. This is significant to gain insights from the learnings extracted by machine learning. This paper unlocks the potential of applying deep learning-based visualization on electron A. Rajagopal Indian Institute of Technology, Madras, Tamil Nadu, India e-mail: [email protected] V. Nirmala (B) PG and Research Department Of Physics, Queen Mary’s College, Chennai 600004, Tamil Nadu, India e-mail: [email protected]; [email protected] J. Andrew Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu, India e-mail: [email protected] A. M. Vedamanickam National Insitite of Technology, Tiruchirapalli, Tamil Nadu, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_54
707
708
A. Rajagopal et al.
microscopy to offer AI-extracted features and architectural patterns of various nanomaterials. This is a contribution to explainable AI in nanoscale objects, and to learn from otherwise black box neural networks. This paper contributes an open-source AI with reproducible results at URL (https://sites.google.com/view/aifornanotechno logy). Keywords Explainable AI · Deep learning in microscopy · Convolutional AutoEncoder · CNN visualization · Nanomaterials
1 Introduction 1.1 The Opportunity: Applications of AI to New Fields Artificial Intelligence (AI) and nanotechnology are transforming science and technology. The potential to apply AI is immense and rapidly progressing across many fields, but this potential is not widely used today by nanoscience. An editorial in reputed Nature Methods journal highlighted the potential of deep learning in microscopy [1]. Deep learning computer vision approaches have established a broad set of applications in many industries such as healthcare to self-driving cars. There is untapped opportunity for the application of AI to sub-microscopic resolution images and nanoscale objects. To unlock this potential, this paper explores AI visualization techniques in nanoscale objects.
1.2 Research Gap Deep learning in microscopy has a significant potential as per the editorial of Nature Methods (Editorial, [5]). As per methods to watch in Nature Methods [14], an astonishing use of deep learning is not image analysis, but “image transformation” [9]. As per Mar 2021 topic review of deep learning in microscopy [6], the current literature is limited to tasks such as image classification, image segmentation, and image reconstruction. But tasks such as image visualization of CNN feature maps of nanomaterials are not well published. This paper contributes to this gap. To the best of our knowledge, this paper is the first in literature to develop an AI visualization toolkit on nanomaterials. Specifically, this is the first work to apply deep learning visualization techniques presented by Zeiler and Fergus [16] and [4] on nanoscale materials. While CNN feature visualization proposed by Zeiler and Fergus [16] is widely popular in machine learning literature, its applications on nanomaterial dataset are not well published so far.
AI Visualization in Nanoscale Microscopy
709
1.3 Literature Review and Novelty There is amazing progress in deep learning in microscopy of biological images obtained with an electron microscope. As per the recent article in Nature Communications, researchers have developed AI toolkit for nanoscale bio-images [15]. Neural network-based morphological analysis of nanometer objects are less established as per Nature Communications article [13], yet this paper contributes to this aspect. While user-friendly AI tools are developed and reported in reputed Nature journals by Berg et al. [2] and [15], these are on nanoscale images of biological organisms. This paper also contributes to AI tool, but the special thing about this tool is CNN visualization on many nanomaterial classes such as MEMS devices, tips, patterned surfaces, particles, and fibers. A Convolution Neural Network (CNN)-based image classification task is reported in Nature Scientific Reports [8]. While Modarres et al. [8] develop an image classification by “black box” CNN classifier, this paper utilizes the same dataset, but performs a different task of CNN-based visualization to understand what the deep learning “black box” is learning. As per 2019 (Editorial, [5]), understanding the deep learning “black box” is an active area of research. The dataset utilized in this paper is 100% SEM dataset, a publicly available dataset of Scanning Electron Microscopy (SEM), published in Nature Scientific Data by the NFFA–EUROPE project [1]. This paper assumes significance in the context of applying explainable AI for advancing nanotechnology. A recent paper at International Conference on Learning Representations (ICLR 2021) discusses why CNN feature visualizations are valuable techniques for explainable AI [3]. Many researchers support the view that the feature maps such as the one presented in Fig. 2 are meaningful [11]. The “AI microscope” developed by leading AI research lab at https://microscope.openai.com/ models, OpenAI [12] demonstrates the potential of explainable AI, and this paper extends this idea to nanomaterials. While electron microscopy has advanced nanotechnology research, the capability to view nanomaterials through an “AI lens” (illustrated in Fig. 1) is pioneered by this research paper. While Scanning Electron Microscopy (SEM) opened the doors for us to see previously unseen structures, this paper proposes an “AI lens”, that opens the doors for the community to see previously unseen abstract structural features of nanomaterials.
1.4 Contributions The significance of the contribution of an Open-Access AI-based visualization toolkit, NanoAID is towards democratizing access to deep learning for the benefit of the research community. All results are reproducible online at the URL, https://
710
A. Rajagopal et al.
Fig. 1 NanoAID.1.IP module offers an AI-based nanoscale imaging method. This method uses gradient ascent in input space to visualize intermediate activations of filters of a convolutional autoencoder. The intuition of the proposed “AI lens”-based idea for intelligent imaging is illustrated in (G) in comparison with classical scanning electron microscopy (F). Using visualizing intermediate activations of a convolutional autoencoder (P)-(Q)-(R), it is possible to re-represent a nanoscale specimen such as (A) into feature representations such as (B), (C), (D)
sites.google.com/view/aifornanotechnology as this paper contributes this AI in open source. This paper contributes in the following ways. 1. The AI equips nanoscientists to study nanostructures as CNN filters transform an SEM image of a nanomaterial to its feature representation (example in Fig. 1). An intuitive concept of “AI lens” is proposed in Fig. 1 and demonstrated in Fig. 3. 2. In addition, this AI equips the community to visualize the CNN-extracted architectural characteristics for a class of nanomaterial. (example in Fig. 2). 3. Based on the “AI lens”, applications such as counting the number of repeating structures in patterned nanomaterials (example in Fig. 5) can be further extended
AI Visualization in Nanoscale Microscopy
711
Fig. 2 NanoAID.2 module explores the potential of AI to visualize patterns that a neural network has learnt that is common to thousands of SEM images of a particular class of nanomaterial. By visualizing the patterns a convolutional filter would respond to maximally, this AI demonstrates it is possible to gain understanding of the common features of a class of nanomaterials. The top section of picture shows the general idea of using CNNs to learn common patterns. Next, the middle section shows the output of NanoAID. The bottom section of the picture describes the neural network architecture used to generate the output. This output was generated by performing these two steps. In Step 1, a CNN classifier (R) is trained on SEM dataset as shown in (P). In Step 2, convolutional filters are visualized as shown in (Q)
712
A. Rajagopal et al.
Fig. 3 The intuition of “AI lens” is illustrated. A TensorFlow computation graph consisting of CNN layers and pooling/upsampling layers of a CAE is presented on the right side of this figure. By visualizing the output of intermediate ConvNet layers, the input SEM image is transformed into its feature representations, shown in green color images on the left
as future work. For example, features of an MEMS specimen can be highlighted by CNN filter computation as shown in Fig. 6. From the need to develop girl safety apps, arises the need for research on the development of Vision-Language models that can perceive night scenes for on-device ML inference on smartphones or wearables. The AI has be to functional on visually impaired user’s smartphone even when there is no internet. Also, the on-device ML interference has to be fast enough for real-time interaction with the user. This puts a boundary on the size of neural network, hence the need for research. The multimodal embedding approach of multi-layer transformer model OSCAR is a
AI Visualization in Nanoscale Microscopy
713
powerful idea (refer Fig. 5H) in vision-language modeling, and helps achieve stateof-art performance, but the challenge is on-device ML inference for such gigantic models. Hence, there is a need to explore.
2 Methods and Results 2.1 Open-Source AI at Enabling Discoveries in the World of Nanoscale While deep learning has been established in representing objects at human scale such as the common day objects found in the ImageNet dataset. Is it possible to extend this success of deep learning methods to explore nanoscale objects? Nanoscale structures with a length scale applicable to nanotechnology are usually cited as 1–100 nm. One nanometer is one-billionth of a meter. A sheet of paper is about 100,000 nm thick. The goal of this paper is to enable every scientist to apply AI visualization to advance the nanoscience domain and help succeed in future applications of AI in nano. With this design goal, we created nanoscientist’s AI Discovery services, NanoAID, and bestowing this effort as an Open-Source AI platform. All the results are reproducible online, and shared in the form of Google Colab URLs for ease of reproducing results.
2.2 Seeing a Nanomaterial Through an “AI Lens” Result: This technique enables a researcher to view any given SEM image through the eyes of a neural network, specifically through the feature maps of an autoencoder. The autoencoder neural network will learn the most essential features of a nanomaterial given the training objective is to minimize the reconstruction loss function, thus by visualizing the intermediate activations of a convolutional autoencoder, it is possible to highlight the essential features of a nanomaterial specimen. As illustrated in Fig. 1, this method is able to transform any SEM image (A) into feature maps shown in (B), (C), and (D). This transformed image is now visualized (R) and presented to the user. Essentially, the encoder neural network performed information distillation, and thus re-represented the input image (A) into a newer perspectives (B), (C), and (D). Given AI-based visualization is the primary output, the key results of this paper are in visual format; hence, the figures are best viewed in digital format. Method: In this method, first a convolutional autoencoder is trained on a 100% SEM dataset [1] to minimize the pixel-wise Mean Square Error (MSE) reconstruction loss between the input image (X) and the reconstructed image (Y) in Fig. 1. An appropriate bottleneck size for the latent vector (Z) was chosen to ensure the reconstruction loss
714
A. Rajagopal et al.
was at reasonable level by observing the reconstructed image. Then the encoder part (P) of the autoencoder is picked up and saved as a neural network model. The User Interface (UI) in the URL allows the researcher to select the number of layers in the encoder to transform an input image. Based on the selection of the UI, a neural network model is dynamically created with the selected depth. The neural network model is created on the fly using the TensorFlow/Keras Functional API [4]. Using this dynamically created model, the input SEM image is transformed by the various activation functions of this dynamically created model. This CNN visualization method was introduced by Zeiler and Fergus [16] and implemented in Keras deep learning framework by Chollet [4]. Researchers are beginning to apply this CNN visualization method, for example, it was utilized for exploring MRI dataset as per Nature Scientific Reports [10].
2.3 Visualizing Nanoscale Patterns Learnt by CNN Result: As articulated in Table 1, one of the results is an AI platform to see commonly occurring patterns in a class of nanomaterials. A NanoAID AI module allows nanotechnologists to extract common structural of MEMS and nanoparticles, by coherent analysis of thousands of SEM images. This is demonstrated in Fig. 2. This problem was approached by learning the common features in a convolutional neural network by training to learn the common patterns, and then visualizing its convent layers to visualize the learnings. This AI equips nanoscientists to comprehend structural patterns that are typically to a class of nanomaterial. This helps in explainable classification in contrast to black box approach used in the nature journal [8]. The NanoAID website https://sites.goo gle.com/view/aifornanotechnology/ displays the patterns learnt by AI. This has the potential to open up the science of understanding of nanoscale morphology. With reference to Fig. 2, (A) and (B) shows the patterns of nanoparticles and MEMS, Table 1 Summary of results Results
Reproducible results
(1) Contributed an AI platform for the benefit of the community
NanoAID opens doors to the opportunity to apply CNN visualization on nanomaterial
NanoAID website
(2) Demonstrated novel “AI lens”: Nanomaterials are seen through CNN imaging technique in SEM filters to visualize features. (Figs. 1 and 3) by applying AI
URL
(3) Explored machine learnt patterns in nanomaterials
Visualize common patterns of nanomaterials by learning patterns from thousands of SEM images (Figs. 2 and 4)
URL
(4) Future applications
Patterns discovered by AI in MEMS and patterned surfaces (Figs. 5 and 6)
URL
AI Visualization in Nanoscale Microscopy
715
Fig. 4 Methods of training before visualization of architectural patterns. The paper experimented with three different methods as explained in this figure. The patterns learnt from ImageNet were retained by transfer learning method (A2), while patterns of MEMS and nanoparticles were directly learnt in method (A3). The patterns (P2) and (P3) look very different although both the networks were trained on the same dataset of MEMS and nanoparticles. NanoAID website URL is https:// sites.google.com/view/aifornanotechnology
716
A. Rajagopal et al.
Fig. 5 Nanotechnology community can utilize NanoAID to explore the world of nano. Any researcher can tap into the power of AI. For instance, looking an MEMS reveals new structures such as circuits, structures as shown in this AI-generated feature maps
respectively. These patterns were learnt by the CNN classifier (R) as the CNN learns from thousands of representative SEM images of nanoparticles and MEMS from the 100% SEM dataset [1]. As seen in Fig. 2, the pattern learnt (A) from thousands of SEM images nanoparticles such as (H) is similar to human artist depiction (C). With a further close examination of Fig. 2, it can be noticed that the activations for MEMS happen in (K1, K3, K6) due to CNN filters (N1, N3, N6), while for nanoparticles are activated by CNN filters (N2, N4, N5). This deep insight can be valuable for future applications that base upon explainable AI for nanoscience. Method: The paper also explored three different methods to train the neural network and visualize the learnings. As seen in Fig. 4, the patterns were visualized by experimenting three different training methods: classical transfer learning-based training (A1), transfer learning with fine-tuning all layers (A2), and train a randomly initiated network (A3). The work by researchers at OpenAI demonstrates patterns learnt from ImageNet at https://microscope.openai.com/models, and this paper tailor develops this concept for nanomaterials such as MEMS and nanoparticles.
AI Visualization in Nanoscale Microscopy
717
Fig. 6 Nanotechnology researchers can extend the Open-Source NanoAID. For instance, insights such as count of patterns and its orientation can be extracted from SEM images of pattern surfaces
3 Conclusions and Future Directions The paper contributed by demonstrating how to apply deep learning visualization in nanomaterial SEM images to gain new insights for the benefit of interdisciplinary community by tapping into power of learning by deep neural networks. NanoAID democratized the power of AI in one of the less published areas of neural network-based image transformation [9] in the trending field of deep learning in microscopy (Editorial, [5]). Democratization of AI is the recent trend such as the toolkit contributed to the highly reputed nature communications journal [15]. The contributions in this paper added a unique set of tasks to such toolkit-based approaches. While the literature contained other tasks listed in a recent review paper
718
A. Rajagopal et al.
by Ede [6], there was not much literature to learn from the learnings acquired by learning machines. Extracting nanoscale architecture structural patterns by manual effort from thousands of images is a challenging science, and this paper demonstrated a novel approach to do the same using the power of machine learning techniques. Three different training techniques were experimented for visualizing the patterns learnt by the model as articulated in Fig. 4. The future of electron microscopy could be in areas such as intelligent visualization to automatically highlight regions and patterns of interest. Concepts like “AI lens” that was demonstrated in this work enable such progress in the field of deep learning in electron microscopy. This paper demonstrated the power of deep learning to re-represent a nanomaterial by its features. It was demonstrated that a nanomaterial SEM image could be transformed into its feature representation by computing through CNN filters of an autoencoder. Further, it was possible to represent a nanomaterial at various levels of abstractions as was demonstrated in Fig. 5. There is future potential research direction in exploring nanoscience using the power of advances in visualization in representation learning such as the reputed ICLR 2021 research by Borowski et al. [3]. NanoAID’s contribution is democratizing access to deep learning visualization of nanoscale materials. NanoAID is contributed as Open-Source AI. Reproducibility of results is enabled by the Google Colab URLs shared in Table 1. To further demonstrate future extensibility using Open-Source NanoAID, two reference applications are additionally contributed. The “AI lens” also highlighted never seen before patterns. The “AI lens” was demonstrated to identify geometry and count from another category of nanomaterial. The paper successfully demonstrated the application of CNN visualization techniques on nanomaterials using the publicly available SEM dataset [1]. The paper demonstrated how ConvNet filters in CAE can transform an SEM image, thus helping uncover nanoscale morphologies of different nanomaterials. The visual results generated by this paper show that essential features can be distilled by an autoencoder that is trained to optimizing reconstruction loss by adjusting the dimensions of latent space, introduced in Nature Machine Intelligence [7]. The paper opened doors to gain new perspectives on nanoscale features by visualizing CNN activation maps of convolutional autoencoders. The paper’s supplementary website is at https://sites. google.com/view/aifornanotechnology.
References 1. Aversa R, Modarres MH, Cozzini S, Ciancio R, Chiusole A (2018) The first annotated set of scanning electron microscopy images for nanoscience. Sci Data 5(1):1–10. https://doi.org/10. 1038/sdata.2018.172 2. Berg S, Kutra D, Kroeger T, Straehle CN, Kausler BX, Haubold C, … Kreshuk A (2019) ilastik: interactive machine learning for (bio)image analysis. Nat Methods. https://doi.org/10. 1038/s41592-019-0582-9
AI Visualization in Nanoscale Microscopy
719
3. Borowski J, Zimmermann R, Schepers, J, Geirhos R, Wallis T, Bethge M, Brendel W (2021) ICLR 2021. Exemplary natural images explain CNN activations better than feature visualizations. In: International conference on learning representations 2021. https://openreview.net/ pdf?id=QO9-y8also4. Chollet F (2018) Deep learning with python. Manning, Cop., New York 5. Deep learning gets scope time (2019) Nat Methods 16(12):1195–1195. https://doi.org/10.1038/ s41592-019-0670-x 6. Ede JM (2021) Review: deep learning in electron microscopy. Mach Learn: Sci Technol. https:/ /doi.org/10.1088/2632-2153/abd614 7. Editorial (2020) Into the latent space. Nat Mach Intell 2(3):151–151. https://doi.org/10.1038/ s42256-020-0164-7 8. Modarres MH, Aversa R, Cozzini S, Ciancio R, Leto A, Brandino GP (2017) Neural network for nanoscience scanning electron microscope image recognition. Sci Rep 7(1). https://doi.org/ 10.1038/s41598-017-13565-z 9. Moen E, Bannon D, Kudo T, Graf W, Covert M, Van Valen D (2019) Deep learning for cellular image analysis. Nat Methods 16(12):1233–1246. https://doi.org/10.1038/s41592-019-0403-1 10. Oh K, Chung Y-C, Kim KW, Kim W-S, Oh I-S (2019) Classification and visualization of Alzheimer’s disease using volumetric convolutional neural network and transfer learning. Sci Rep 9(1). https://doi.org/10.1038/s41598-019-54548-6 11. Olah C, Cammarata N, Schubert L, Goh G, Petrov M, Carter S (2020) Zoom In: an introduction to circuits. Distill 5(3):e00024.001. https://doi.org/10.23915/distill.00024.001 12. Schubert L, Petrov M, Carter S (2020, April 14) OpenAI Microscope. OpenAI Microscope website. https://microscope.openai.com/models. Accessed 31 July 2021 13. Schubert PJ, Dorkenwald S, Januszewski M, Jain V, Kornfeld J (2019) Learning cellular morphology with neural networks. Nat Commun 10(1). https://doi.org/10.1038/s41467-01910836-3 14. Strack R (2018) Deep learning in imaging. Nat Methods 16(1):17–17. https://doi.org/10.1038/ s41592-018-0267-9 15. von Chamier L, Laine RF, Jukkala J, Spahn C, Krentzel D, Nehme E, … Heilemann M (2021) Democratising deep learning for microscopy with ZeroCostDL4Mic. Nat Commun 12(1). https://doi.org/10.1038/s41467-021-22518-0 16. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer vision—ECCV 2014, pp 818–833. https://doi.org/10.1007/978-3-319-10590-1_53
Convolutional Gated MLP: Combining Convolutions and gMLP A. Rajagopal and V. Nirmala
Abstract To the best of our knowledge, this is the first paper to introduce Convolutions to Gated Multi-Layer Perceptron (gMLP) and contributes an implementation of this novel Deep Learning architecture. Google Brain introduced the gMLP in May 2021. Microsoft introduced Convolutions in Vision Transformer (CvT) in Mar 2021. Inspired by both gMLP and CvT, we introduce convolutional layers in gMLP. CvT combined the power of Convolutions and Attention. Our implementation combines the best of Convolutional learning along with spatial gated MLP. Further, the paper visualizes how CgMLP learns. Visualizations show how CgMLP learns from features such as outline of a car. While Attention was the basis of much of recent progress in Deep Learning, gMLP proposed an approach that doesn’t use Attention computation. In Transformer based approaches, a whole lot of Attention matrixes need to be learnt using vast amount of training data. In gMLP, the fine tunning for new tasks can be challenging by transfer learning with smaller datasets. We implement CgMLP and compare it with gMLP on CIFAR dataset. Experimental results explore the power of generalization of CgMLP, while gMLP tends to drastically over-fit the training data. To summarize, the paper contributes a novel Deep Learning architecture and demonstrates the learning mechanism of CgMLP through visualizations, for the first time in literature. Keywords Attention in deep learning · Vision transformers · Gated MLP
A. Rajagopal Indian Institute of Technology, Madras, India e-mail: [email protected] V. Nirmala (B) PG and Research Department of Physics, Queen Mary’s College, Chennai, India e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_55
721
722
A. Rajagopal and V. Nirmala
1 Introduction 1.1 Related Literature and Research Directions Potential of Transformers for Multimodal Transformation Tasks The real power of Transformers [1, 2] is in its ability to model multimodal content and its ability to transform representation from one modality to another. For example, the ability to translate an image to caption by multimodal embedding by OSCAR [3] is a classic example of the power of transformers [4]. This is fundamental because of the way a Transformer is modeled as a function of 3 trainable matrixes and the input features that maps to target features. And such transformation allows the representation of visual features and language features into a common embedding, thus allowing operations in a multimodal embedding space [2]. With such giant encoderdecoder learning machines, transformation of multimodal feature input to another multimodal target output is possible. And the availability of vast corpus of multimodal content on the web allows for training such future multimodal Transformers in the say way BERT was trained on the text corpus of wiki and book. OSCAR is one giant leap in architecture design toward a future of multimodal transformation tasks (Fig. 1). Journey Toward Multimodal Transformers The steps toward a multimodal Pre-trained Transformer model for multimodal translation is already been established. With the success of Transformer in NLP [5],
Fig. 1 Novelty
Convolutional Gated MLP: Combining Convolutions and gMLP
723
Transformer architecture is employed in Computer Vision tasks [1, 6] and Visual Language tasks [7]. For example, multiple layers of Transformers are used in state-ofart Visual Language models such as Microsoft’s OSCAR [3] and Google’s VimSIM [7]. Another example, is Vision Transformer (ViT) [1] for image processing tasks where an image is passed as a sequence of image patches with position embedding to Transformers, in the same way a sentence is passed as a sequence of words to Language Transformers [4]. The Potential of Transfer Learning in Language and Vision Transformer Models There is amazing progress in deep learning research. Since the introduction of attention mechanism [4] in deep learning architecture, attention-based modeling has been introduced in many architectures. With the advent of transformers [4], the IMAGENET moment for NLP [5] was arrived in 2018. Transformer networks are architected with layers of multi-headed attention and feedforward networks and often require quite huge number of weights to be learnt. Hence training a Transformer often require significant data. But once it is trained, it can be easily adapted for new tasks. For example, Pre-trained language models such as BERT and GPT2 enabled rapid adaptation for new custom tasks by fine tuning on small datasets with Transfer learning. Further, few shot learning was possible with 100 billion parameter Transformer networks such as GPT-3. Similar pre-training and Transfer Learning strategies are also employed by CNN networks. Both BERT and Inception are pre-trained on large amounts of data but can be quickly fine tunned for newer tasks with smaller datasets. The impact of transfer learning in both computer vision and natural language processing has enabled widespread adoption of the AI into the mainstream. Strategies for Learning Visual Information Computer vision can be modeling by learning the following 1. Long range interactions between objects in the image/video: Modelling the interactions across objects and situations across frames of a video (or across the different regions of the image) can be realized by learning the attention or alignment between these objects. So naturally a transformer is apt to model such long-range dependencies. So to model a video using Transformers, interactions between objects detected by Faster R-CNN across video frames can be modeling using Attention and positional embedded objects. 2. Local neighbor modeling Since a set of pixels in a location are often related as they represent the same object, modeling them is essential. Here, Convolutions with right kernel sizes where the receptive field is aligned to the size of the salient object will be optimal modeling strategy. 3. Common features in the image
724
A. Rajagopal and V. Nirmala
Features such as textures that are common across the image can be learnt by sharing the learnable parameters across the different parts of the image. This is again where Convolution layers come in handy. Designing better computer vision models by blending Convolutions and Attention to model both local and long-range interactions: Self attention-based models have low inductive biases. CNN has an inductive bias in two forms. The two areas are locality and weight sharing as explained above. So transformers-based modeling is good when there is a significantly large volume of data for training. Researchers introduced Convolutions in vision transformer to define CvT [1] in Mar 2021. Also Facebook researchers combined Convolutional neural networks with vision transformers [8, 9] to create ConViT [10]. By introducing a gating parameter, ConViT automatically learns to depend upon either self-attention or convolution. Gated MLP: Do you need attention? Researchers at Google Brain investigated the extend of attention that is required by experimenting with a simpler architecture in the paper “Pay Attention to MLPs” [11]. This paper tends to indicate that gMLP [11] perform achieve the same accuracy as Transformers in tasks like sentiment classification, though slightly weaker on tasks where two sentences are due to a lack of long-range interactions. Comparing CNN versus Attention versus gMLP: When to use each one of them? Given each architecture approach has its own inherent strengths, it will be good to explore how to best leverage them. 1. Mechanism of modeling: While CNN is good in learning common features that can be found across video clip using weight sharing kernel, Transformers are good in learning long-range interactions between objects found across different parts of the video, gMLP is good in spatial interactions. 2. Best suited tasks: Given the ability to model long-range interactions, ideal use of the power of Transformers in translation of content from one form to another representation. Interesting the content can be multimodal in format as multimodal embedding can be represented by transformers into another multimodal representation. So transformers are inherently the best choice for tasks requiring multimodal transformation like visual question answering. An practical application example of multimodal transformers would be to take a grade 10 text book as a input and let the AI simplify the concepts to be understandable by grade 6 students. Given the CNN are good at modeling spatial correlated data, it is a natural choice of any data that inherently contains spatial correlations. This could be found in images and short video clips. For example, a spatial correlation can be found across two images of a scene shot from two different angles such as stereoscopic cameras or pair of security cameras looking at the scene from different viewpoints. In this case of scenes from two cameras, an ideal choice would be gMLP as it dynamically models spatial interactions.
Convolutional Gated MLP: Combining Convolutions and gMLP
725
3. Math model: Among the 3 models under considerations, the highest amount of learnable parameters is Transformers. As shown in equations in Table 1, there are 3 large matrixes to be learnt in Transformers. In contrast, the weight sharing nature of CNN filters along it to learn faster than Transformers on relatively smaller datasets. A large number of parameters to learn also means the amount of data required to train is quite large for Transformers. This also means the tendency to over-fit on training data is high for Transformers and gMLP. The power to generalize is the aim for AI, and hence validation accuracy or accuracy after deployment is a crucial consideration. Here the over-fit nature of Transformers and gMLP means that Transformers may be ideal choice for generation tasks such as GPT2 text generation. In Transformers, the 3 Weight matrixes depend upon the input data X. In gMLP, the Weight matrix is not dependent upon input data X. This could mean that gMLP could have the potential to handle previously unseen data during inference with slightly different distribution. Learning to choose architecture dynamically The ConViT approach dynamically learns to choose one of the two blocks (Convolutions vs. Transformers) dynamically as shown in Fig. 2. Actually it dynamically learns to use the best of both approaches (Convolution and Attention). Table 1 Comparison of three approaches CNN
Transformers
gMLP
Inductive biases?
High
Minimal
Very minimal
Mechanism of modeling
Locality and weight sharing
Long range interactions
Spatial interactions
Tends to over-fit on training data?
No
Yes
Yes
Transfer learning to small datasets
Very good
Excellent (few shot learning)
Not established
Math model
Feature map = Attention = f (W Q , W K , W V , X) X * W kernel W is dynamically generated from X X is input image, W is CNN filter weights
Linear gating output = X * (W p X + b) W is spatial projection matrix is independent of X
Good for tasks
Classification, image to image transformation (e.g., UNet)
Classification
Multimodal content transformation (e.g., GPT2 text generation, OSCAR captioning)
726
A. Rajagopal and V. Nirmala
Fig. 2 Learning to switch to create a network that leverages the best of both approaches
1.2 Research Gap/Novelty 1. 2. 3. 4. 5.
Oct 2020: Computer vision tasks are modeled with Transformer (ViT) [9] Mar 2021: Convolution is introduced to ViT (CViT) [1] Mar 2021: Dynamic blend of Convolution and Transformer (ConViT) [10] May 2021: gMLP is proposed as performing comparably to Transformer [11] This work: Introduces convolution to gMLP
To the best of our knowledge, there are no publications at the time of writing this paper that demonstrates the combination of 2D Convolution with gMLP. This research gap is presented in Fig. 1. In May 2021, Google brain proposed gMLP, which performed comparable to Transformer based approach. The gMLP authors argue the need for self-attention and aim to propose a smaller network without attention. gMLP models spatial interactions. Inspired by progress in combing Convolution with transformers to build models like ConViT and CViT, this works introduces convolution with gMLP.
2 Methods and Results 2.1 Contributions The contributions in this paper are 1. First to introduce convolutions in gMLP. The new model is henceforth referred to as “Convolutional gated MLP” or CgMLP in this paper. 2. Implements Convolutional gated MLP 3. Experimental comparison of CgMLP versus gMLP on CIFAR dataset 4. Investigates how CgMLP learns by visualizing the feature maps 5. Contributes the source code of CgMLP with this paper
Convolutional Gated MLP: Combining Convolutions and gMLP
727
2.2 Convolution Gated MLP The neural network architecture of convolution gated MLP is depicted in Fig. 3. While the gMLP took in directly the 256 × 256 RGB image as 8 × 8 image patches, the CgMLP accepts visual feature maps. By adding 2D convolution layers to the gMLP, the low-level features are dynamically extracted and injected into gMLP blocks. The beauty of this approach is that a large 256 × 256 image could be condensed into a smaller feature map of 16 × 16 by CNN layers, thus enabling gMLP to process the entire image in one shot, rather than the current approach of dealing with image patch. This is particularly useful to learn spatial interaction given gMLP doesn’t use positional embedding. This opens the door for making sense of features such as the outline of the car as shown in Fig. 6. The other benefit is since convolution and MaxPool can absorb a set of neighborhood pixels of receptive 5 × 5 cell into a smaller feature, gMLP can work look at the entire image rather than in patches. The CgMLP simply consists of initial layers of 2D Convolution layers and higher layers of gMLP layers. This is shown in Fig. 5.
2.3 Comparison of gMLP Versus CgMLP The 3 models under comparison are gMLP, and two variants of CgMLP. The model architecture is shown in Fig. 4.
2.4 Experimental Results: GMLP Versus CgMLP Experiments show CgMLP which achieve equal or better validation accuracy than gMLP as shown in Fig. 5. We trained three models on CIFAR-100. The three models are shown in Fig. 4. The gMLP model had four blocks of gMLP units. Then we had two variations of CgMLP, one had 1 layer of CNN introduced, and the other had 2 layers. As seen in Fig. 5, the CgMLP achieves a competitive validation accuracy. Results show the right balance of CNN layers and gMLP would be beneficial. gMLP stopped training before 60 epochs due to TensorFlow early stopping set of validation accuracy, while CgMLP continues to train. This indicates the potential of CgMLP to generalize better! The experimental results of the comparison in listed in table in Fig. 5. The experiment shows that the CgMLP can do better than gMLP. The 1CNN-4gMLP achieves a better accuracy than gMLP. This model consists of a combination of 1 CNN layer with 3 × 3 kernels and four layers of gMLP. The 2CNN-4gMLP layer consists of two layers of CNN of 5 × 5 filters and 4 layers of gMLP. The experiment showed that an ideal mix of CNN and gMLP blocks can yield the best performance.
728
A. Rajagopal and V. Nirmala
Fig. 3 The building block of gMLP versus CgMLP
2.5 How the Convolutional Gated MLP Learns? To gain insights of the proposed neural network architecture, we visualized the feature maps as the image is passed through various layers of model. It is interesting to note that the model learnt to extract a certain set of visual features that are important for further downstream consumption. It is important to note that the Convolution
Convolutional Gated MLP: Combining Convolutions and gMLP
729
Fig. 4 Variants of CgMLP models, and visualization of how it works
Fig. 5 Validation accuracy comparison
layers are trained from scratch rather than borrowing the learnings from IMGAGE pre-trained model. This meant the Convolution layer had the opportunity to focus on features that are most important for further downstream gMLP blocks [11]. The CNN layers learned to re-represent the input image to certain other representations (Fig. 6).
730
A. Rajagopal and V. Nirmala
Fig. 6 Visualization of features maps at different levels of the hierarchy
There are few areas that gain the attention of the CgMLP network 1. Attention on Salient objects: • Key object features such as the focus on flower rather than the background garden (Fig. 4). The face of the horse gets the focus rather than the background playground (Fig. 7). 2. Attention to the outline of the object: • While any low-level features could have been extracted, CgMLP extracted the outline of a car (Fig. 8). In contrast to this, a VGG-16 learns multiple features
Convolutional Gated MLP: Combining Convolutions and gMLP
731
Fig. 7 Visualization of feature maps for gMLP and CgMLP
such as color, textures, edges. Due to selection of smaller number of filters on the CNN, the model just learnt to focus on the most important low-level features. 3. Receptive field that lets the gMLP to focus on the whole rather than patches. • The visual features from CNN can maxpooled to reduce the dimension of a 256 × 256 image to an 8 × 8 feature map, hence allowing the gMLP to process the entirety of the visual. This is important to improve the accuracy. In the gMLP, each patch is processed by a series of gMLP layers. Then all the processed patches are finally pooled using a Pooling layer. In contrast, CgMLP can process the entire 256 × 256 receptive field at one go. Tunning
732
A. Rajagopal and V. Nirmala
Fig. 8 Channel projection versus spatial projection
the CgMLP to look at the entire image at one go can lead to improvements in accuracy.
2.6 GMLP Versus CgMLP: Spatial Interactions Versus Feature Channel Interactions The gating unit is gMLP and has a spatial projection layer. The gMLP authors defined spatial gating unit as s(X) = X 1 * f W (X 2 ). The input image tensor is spilt into two tensors X 1 and X 2 . The size of X 1 tensor is size 64 × 256 or (number of patches × embedding Dimension), when the number of image patches is 64. A spatial projection of trainable Weight, W of 1 × 256 matrix (when the embedding Dimension is set as 256). In a CgMLP, this spatial projection can be flipped as channel projection across the channels of the feature map. This is shown in Fig. 8. So if the CNN layer has 64 filters, there will be 64 channels in the feature map. So the input feature map X can be split into two tensors X 1 and X 2 in such a way so that there is modeling of channel interaction. So in CgMLP, the feature channel projection can be trainable Weight, W of 1 × 256 matrix. In short, CgMLP can learn channel interactions or spatial interactions based on the axis of splitting the tensors. This increases the fundamental
Convolutional Gated MLP: Combining Convolutions and gMLP
733
power of CgMLP to model either spatial interaction or channel interaction. So a network can consist of a combination of two variants of CgMLP • CgMLP layer to model spatial interactions • CgMLP layer to model channel interactions
2.7 GMLP Versus CgMLP: Inductive Biases The opportunity to model more specialized interactions is possible with a CgMLP due to the factors listen in Table 2. A CgMLP neural network will employ all these 5 factors in combination to model the training data in computer vision tasks. For example, a Convolution gMLP unit with spatial gating may be used in a layer, but other layers may include a Convolution gMLP with channel gating. Thus, CgMLP is a combination of different layers each specializing in a modeling a certain aspect of data. This vision of CgMLP architectural basis is shown in Fig. 9 (Table 3). Table 2 Accuracy of the proposed architecture
Architecture
Model
Validation accuracy top 5%
gMLP
gMLP
77.6%
CgMLP (proposed)
2CNN-4gMLP
75.9%
1CNN-4gMLP
88.6%
Fig. 9 Explanation of how the two architectures differ
734
A. Rajagopal and V. Nirmala
Table 3 Comparison of proposed architecture with gated MLP 1. Model spatial interactions
CgMLP
gMLP
Yes
Yes
2. Model channel interactions of feature map
Yes
No
3. Learn to extract meaning from neighboring pixels
Yes
Partial
4. Weight sharing to learn common features
Yes
No
5. Model a smaller input embedding (by receptive field of CNN)
Yes
No
2.8 Source Code Availability and Reproducible Results The source code of Convolution Gated MLP is contributed in open source at this URL, https://sites.google.com/view/convolutional-gated-mlp. The supplemental website also shows the visualization of how CgMLP works under the hood. The results are reproducible.
3 Summary Since the introduction of gMLP by Google Brain in May 2021, not many publications have attended to combining gMLP with Convolution neural networks. This paper spotted this gap in knowledge and proposed a novel architecture. The proposed architecture trained a network that combined 2D Convolution with gMLP as a sequence of neural network layers. This proposed Convolutional gated MLP (CgMLP) was implemented and trained on CIFAR-100. Experiments demonstrated the potential of blending Convolutions with gMLP. Experiments also tend to indicate that CgMLP could generalize, while the tendency of gMLP to over-fit on training data was reported by gMLP authors. Our experiment validated that CgMLP can train for more epochs than gMLP, as the experiment showed automatic early stopping on validation accuracy during Tensorflow training. While gMLP’s gating unit performs spatial projection, a CgMLP’s gating unit can be made to at projection across different channels of the CNN feature map. So gMLP mechanism is spatial interactions, CgMLP mechanism can feature channel interactions. The advantage of blending 2D Conv in a gMLP is many fold. The CgMLP architecture allowed the network to learn to attend to certain salient features and freed up the gMLP blocks to look at the entire image through the eyes of the receptive field of CNN. This allows for right sizing of the neural network to open up the opportunity for tuning the network configuration for optimizing the generalization power of the model. By searching through the architecture search space, a design for an ideal network configuration containing Convolution and gMLP can emerge. This opportunity to tune for generalization power is the key breakthrough this paper leads into for future direction.
Convolutional Gated MLP: Combining Convolutions and gMLP
735
References 1. Wu H et al (2021) CvT: introducing convolutions to vision transformers. arXiv:2103.15808 [cs]. https://arxiv.org/abs/2103.15808. Accessed 17 Oct 2021 2. Le H, Sahoo D, Chen NF, Hoi SCH (2019) Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 5612–5623. https://doi.org/10.18653/v1/P19-1564 3. Li X et al (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: Computer vision—ECCV 2020, pp 121–137. https://doi.org/10.1007/978-3-030-58577-8_8 4. Vaswani A et al (2017) Attention is all you need. In: NeurIPS 2017. https://proceedings.neu rips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf 5. Edwards C (2021) The best of NLP. Commun ACM 64(4):9–11. https://doi.org/10.1145/344 9049 6. d’Ascoli S, Touvron H, Leavitt M, Morcos A, Biroli G, Sagun L (2021) ConViT: improving vision transformers with soft convolutional inductive biases. arXiv:2103.10697 [cs, stat]. https:/ /arxiv.org/abs/2103.10697. Accessed 30 Oct 2021 7. Wang Z, Yu J, Yu AW, Dai Z, Tsvetkov Y, Cao Y (2021) SimVLM: simple visual language model pretraining with weak supervision. arXiv:2108.10904 [cs]. https://arxiv.org/abs/2108. 10904. Accessed 30 Oct 2021 8. Xu Y, Zhang Q, Zhang J, Tao D (2021) ViTAE: vision transformer advanced by exploring intrinsic inductive bias. arXiv:2106.03348 [cs]. https://arxiv.org/abs/2106.03348. Accessed 30 Oct 2021 9. Dosovitskiy A et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 [cs]. https://arxiv.org/abs/2010.11929v1. Accessed 30 Oct 2021 10. d’Ascoli S et al (2021) Convit: improving vision transformers with soft convolutional inductive biases. In: International conference on machine learning (ICML). PMLR, pp 2286–2296. https:/ /proceedings.mlr.press/v139/d-ascoli21a 11. Liu H, Dai Z, So DR, Le QV (2021) Pay attention to MLPs. arXiv:2105.08050 [cs]. https:// arxiv.org/abs/2105.08050. Accessed 30 Oct 2021
Unique Covariate Identity (UCI) Detection for Emotion Recognition Through EEG Signals V. S. Bakkialakshmi
and T. Sudalaimuthu
Abstract Affective computing has become one of the emerging technologies in the current arena as most of the industries depend on the consumers and their feedbacks. Opinion and emotional feedbacks are playing major ways to improve the quality of services provided by the industry. Affective computing plays a vital role in analyzing emotional feedbacks. The emotions of the human being can be derived by using various ways including facial emotions and textual emotions. Recent research uses biological signals to detect the emotions of human beings. The emotions include anger, sadness, happiness, joy, disgust, surprise generating promising parameters with the biological signal from EEG. Electroencephalography (EEG) based on unique subject identification is evaluated using the presented system. Biological signals are prone to motion artifacts inside the body that distracts during the recordings. EEG signals are complex with numerous oscillating points that are unique in certain cases. The proposed system focused on capturing the impacted component in the brain wave data that produce the unique identification of subjects. The states of the brain wave data like alpha, beta, gamma, theta, and delta are keenly monitored with the help of frequency-domain analysis through discrete wavelet transforms (DWT). In this paper, the analysis of subject impacted factors is detected using a novel multinominal regression-based (UCI) unique covariate identity detection algorithm. The proposed system also compared with state-of-the-art approaches in terms of accuracy, precision, and error rate. Keywords Affective computing · Emotion recognition · EEG analysis · Subject identification · Machine learning
V. S. Bakkialakshmi (B) SRM Institute of Science and Technology, Kattankulathur, SRM University, Chennai 602302, India e-mail: [email protected] T. Sudalaimuthu Hindustan Institute of Technology and Science, Hindustan University, Padur 603103, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_56
737
738
V. S. Bakkialakshmi and T. Sudalaimuthu
1 Introduction 1.1 Electroencephalography (EEG) Electroencephalography (EEG) based unique subject recognition models have been highlighted in recent years [1]. EEG signals hold numerous variations in the form of alpha (8–12 Hz), beta (12–30 Hz), gamma (30–140 Hz), delta (1–4 Hz), and theta (4–8 Hz) waves. On the other hand, EEG signals have complex points that are hard to recognize [2]. The raw EEG data is not the cleaned data; it consists of unwanted junk values, negative values that are not appropriate, and sometimes the missed value in the form of spaces. The frequent variations in the EEG peaks are used to measure the impacted features from the subject identification point of view [1]. The emotion associated with the impacted parameters is analyzed. Automatic labeling of emotions is developed.
1.2 Methods The EEG data analysis undergoes various methodologies since the structure of the data is complex [3]. Initially, to remove the unwanted DC components from the raw EEG signal, a high pass filter is used. One of the major problems with the EEG signals is the motion artifacts [4]. Correction and elimination of motion artifacts from the EEG signals act as a prominent role in the filtering process. These signals are further divided into frames of signals to run the pattern for little iteration during analysis.
1.3 Feature Extraction In any kind of signal analysis, [5] feature information conveys the unique property that is recognizable during the measurement which is helpful to find the functional component associated with the classes [6]. To improve the quality of prediction, simplification of data is mandatory. The reduced complexity in the data enables the prediction model to get the closest data accurately. Feature extraction procedures are used to do the simplification the data. Possible numbers of statistical parameters such as mean, median, and kurtosis are measured to make the unique recognition points.
Unique Covariate Identity (UCI) Detection for Emotion Recognition …
739
1.4 Machine Learning Approaches EEG signals are bio-inspired data that help analyze the brain wave signals also associated more with the affective computing parameters for emotion analysis [7]. The electrodes are used to detect the electrical activity of the brain. Virtual analysis of EEG data is highly expensive and consumes time for processing the entire signals. The improvements in brain-machine interface models evaluate the idea of automated detection models that uses machine learning techniques [3]. EEG data decoding undertakes supervised learning approaches. The system is used to get trained by the known dataset that can able to predict and identify the unique parameters using the existing data. Further, the model is performed with a test dataset. Most of the methodologies analyze the correlated patterns within the dataset. The trained models have labels around the pre-trained dataset.
1.5 Linear Discriminant Analysis (LDA) Linear discriminant analysis (LDA) is commonly used for EEG data analysis in which the goal of the model is to converge the feature points in the dataset by reducing the high dimensional data into low dimensional data [8]. LDA analyzes the different correlation parameters within the data and between the data that scatters metrics for feature selections [9]. LDA achieves optimum performance that linearly splits up the class of the data. The feature points are scattered into groups. LDA converges the dimensional correlated data that is linear to the diagonal alignment. Machine learning techniques can extract the Clinical analysis of EEG in an automated way.
1.6 Hidden Markov Model (HMM) The hidden Markov model (HMM) is a statistical model in which the system performs a sequential operation whose outcome depends on the known pattern of influenced data [10]. Markov models are used in EEG analysis to augment data when using [5] deep learning models. The set of possible observations is formulated using the Markova chain process.
1.7 Support Vector Machines (SVM) Support vector machines (SVM) are also a robust machine learning algorithm that is used for both regression and classification [11]. SVM assigns new data labels to every predicted category which enables the predictive learning approach. SVM is normally
740
V. S. Bakkialakshmi and T. Sudalaimuthu
categorized into two types, namely, Linear SVM and Non-Linear SVM [12]. In a hyperplane of equally scattered data, multiple lines can be formed to segregate the data into different categories or classes [5]. The data samples inside the boundaries have a unique association. In linear SVM the data samples are divided into two parts [11]. In non-linear SVM more than one boundary can be formed and a few samples are left out also. The dimension of the dataset depends on the unique features present in the dataset [13]. SVM impacts a lot in signal processing applications since the SVM space holds the unique values in the invisible space than that of the adjacent noise values.
2 Literature Survey Spatial techniques help solve autocorrelation issues [1]. EEG data on spatial analysis tackles the fake correlations using the unique subject details present in it. Spatial data extracts the is discussed in [1]. The system extracts the spatial and temporal dependencies of the EEG emotion data [2]. Long-short-term memory network provides a large span of learning rate and variations in the bias outcomes. There is a continuous iterative learning process that demands accurate data. It is based on the time intervals and associated time patterns present in the input data. Video EEG to predict emotional interference using a live dataset is discussed [2]. Quadratic discriminant analysis (QDA) is used to determine the unique procedure to handle the DEAP dataset that holds the data containing emotions such as linking, discriminant, arousal, and feeling low. The study helps find a unique way of setting the attributes.
3 System Tool Using MATLAB 2017 version [14] Signal processing toolbox is utilized for performing discrete wavelet transform (DWT). MATLAB is the high-level computing platform integrated with various application toolboxes available as plugins. The tools are used for high-level scientific computing and dynamic data analysis. The core idea behind the unique covariate identity is possible by extracting the high dynamic peak points of the EEG dataset. DWT is configured in such a way with various test cases, using the Haar wavelet for EEG data collected from the DEAP dataset [15]. The frequency-domain component from the outcome of the Haar wavelet is impacted more on the prediction accuracy.
Unique Covariate Identity (UCI) Detection for Emotion Recognition …
741
3.1 Datasets Available DEAP is the standard dataset containing EEG and periphery physiological indications of 32 people who were recorded with a complete monitoring environment for 40 oneminute-long sections of music accounts [15]. These were taken control by the levels of fervor, valence, like/revulsion, strength, and shared traits reported by the subjects [16]. The dataset also contains extended information on face video for 22 of the individuals. Also, systems and results that are correlated with the results are presented for the single-fundamental portrayal of fervor, valence, and like/hate emotions. Assessments using the different modalities of EEG, periphery physiological signs, and blended media content assessment. EEG signals were transformed into data by using a Bio-semi–Active Two structure. MAHNOB is the multi-camera highquality dataset used to detect the emotions captured using the camera. Tagging the spontaneous capture of real emotions held with the dataset [7]. DREAMER is another kind of standard dataset available with the multi-modal capturing of ECG and EEG data together on emotion detection (Table 1).
4 System Design The system design consists of blocks that cleanse the raw DEAP dataset in the preprocessing section. Preprocessing is nothing but cleaning the data by removing unwanted junk values present in it. Preprocessing is reading the input dataset in the form of a.csv file, reading the columns, and considering the impacted attribute from the given columns (Fig. 1). Table 1 Summary of various datasets available for emotion recognition [16] Database
Subjects
Stimuli
Duration
Device
Channels
Sampling frequency (Hz)
Features
DEAP
32
Music
60 s
Bio-semi
32
512*
230
32
512*
230
14
128
105
Videos MAHNOB
DREAMER
27
23
Active II
Excerpts
34.9–117 s
Bio-semi
From movies
(M = 81 s)
Active II
Music
65–393 s
Emotive
Videos
M = 199 s
EPOC
742
V. S. Bakkialakshmi and T. Sudalaimuthu
DEAP
Preprocess
DATASET
Real-Time
Preprocess
Feature
Mean,
Multi-Nom-
Extract
Median,
inal Regres-
DWT
Kurtosis
sion
Feature
Mean,
Multi-Nominal
Extract DWT
Median,
Regression with
Kurtosis
SOM Optimizer
EEG Data
Covariate Analysis
Fig. 1 The architecture of the proposed UCI detection model
4.1 DWT: Discrete Wavelet Transform Discrete wavelet acts as one of the oscillating functions applicable both in frequency [12] and time-domain process [3]. The scaling parameter and translation parameter enabled in the wavelet function are used to detect the member function. It generates the approximate coefficient of the input signal. In time-domain analysis, the observation frame is less. In the frequency domain, the observation range gets increased [3]. The wavelet transform of the signal x(t) is given by √ W tx (a, τ ) = 1/ (1&a)
α fn (x(t)ψ (t − τ )/a < dt >)
(1)
(−α)
where a is the scale displacement, τ represents time displacement, ψ(i) is the wavelet function associated with the input EEG signal [10]. Further, the high-frequency components are extracted from the EEG signal through the DWT. The signal is further estimated to mean, median, and kurtosis. Kurtosis gives the boundary of the sample distribution. The expression for kurtosis is given below Kurtosis = n ∗ ni Y i − Y 4/ ni Y i − Y 2 2
(2)
¯ Mean of the Distribution, n: No. of where Yi: ith Variable of the Distribution, Y: Variables in the Distribution.
4.2 MNR: Multi-nominal Regression Multinomial logistic regression is a kind of binary regression that allows different categories and finds the outcome with independent variables. The algorithm is used to
Unique Covariate Identity (UCI) Detection for Emotion Recognition …
743
Fig. 2 MNR: multi-nominal regression
converge the frequently associated variables in the wavelet distributions. The model is used to predict the probabilities of dependent categories. The feasible outcome forms the multi-nominal regression using two kinds of classes or outcomes. The nominal outcome directly conveys whether the distributed function belongs to arousal or valence, where the ordinal outcome of the multi-nominal regression model provides the suggestion results only that give us the impacted attributes (Fig. 2). The multi-nominal regression with a dependent variable should be categorical. Other independent variables around the regression variables are factors or covariates. The dependent factors are categorical values and independent values are continuous variables in nature. Based on certain functional operations with the hidden neurons the output of the regression dominates.
4.3 SOM: Self-organized Mapping Optimizer The purpose of the SOM optimizer is to cross-validate the obtained results through the DWT-MR model in terms of classification and recognition. The optimizer generates the unique identity code after the evaluation process using biased weight assignments to each block. The unique points in the biased weights enable the impacted attributes from the EEG signal. SOM helps identify the data mapping easily. The combination of the MNR model with cross-validation on SOM forms the novel UCI search algorithm (Fig. 3).
744
V. S. Bakkialakshmi and T. Sudalaimuthu
Fig. 3 Self-organized mapping optimizer
UCI Search Algorithm: Pseudocode Start Get x = EEG_data Y = preprocess(x), (fr, ky) = DWT(Y) Calculate Kurth(ky), mean(ky), median(ky) Fea_metrics_1 = [Kurth, Mean, Median] Fea_metrics_2 = Som_opt(ky) D1 = Mnr(Fea_Metrics_1) D2 = Mnr(Fea_Metrics_1) Check Binary_assement(D1, D2) Call attribute (sample i) Repeat End
5 Results and Discussions 5.1 Preprocessing of EEG Signals Figure 4 shows the frequency spectrum of EEG data input as a sample. The spectrum shows a single EEG sample with few harmonic frequencies aside. The x-axis contains the number of samples in the particular frame. Y-axis determines the frequency component in the form of a constant. The collected EEG data is the fast-flowing large sequence of values. To visualize the values, only a certain number of samples only plotted and fit into the graph (Fig. 5).
Unique Covariate Identity (UCI) Detection for Emotion Recognition …
745
Fig. 4 A spectrum of sample EEG data input
Brain wave data are one of the complex forms of signal sequences, where Fig. 3 shows Haar wavelet transform extracts the elements from the sample EEG input. Figure 6 is the confusion matrix formed with the outcome of the proposed UCI algorithm, Since the training and testing models are developed using a static dataset, the prediction of accuracy seems good around 95%. Further, the system needs to check the real-time data by collecting the samples from the volunteers. Table 2 shows the [15] DEAP Dataset summary, in which the different types of exposure to the subject while recording the dataset is mentioned. The individual is subjected to watching videos, images that attract the eyes, emotional stories, and discussions that stimulate the analytic side of the brain, complete silence is monitored and the most common way of subject exposure is using music. A correlated accuracy value to the relevant records is tabulated. Figure 7 shows the correlated results on providing the [15] DEAP dataset with emotion correlations such as liking, arousal, and dominance. Brain stimulus activities are very complex. Stimulus triggers can occur at any point of the complete brainwave. The peaks are usually sudden high, which are raised when the emotional point is triggered. Exposure of music, Silence state, Discussions, Stories, Image visualizations, and Videos are considered for analysis. Table 3 depicts Comparing state-of-the-art approach with multi-modal fused emotion recognition technique using LSTM [2].
746
V. S. Bakkialakshmi and T. Sudalaimuthu
Fig. 5 Separation of alpha, beta, gamma, theta, and delta components after DWT
6 Challenges The major challenge undergone with the presented study is during the testing procedures. The noise removal from EEG contains more artifacts sustained by the DWT even though the analysis model provides false positive values at the initial stages. The amount of stability is likely to be improved. In the presented system it is almost like a state machine procedure that initially performs the DWT with the MNR model first then waits for the DWT with SOM procedure to get completed. The proposed unique covariate identifier algorithm that took the final binary decision provides the impacted results. Nevertheless, improvements are suggested in the optimization area to reduce the processing time.
Unique Covariate Identity (UCI) Detection for Emotion Recognition …
Fig. 6 Confusion matrix in unique covariate/identification process
Table 2 Summary of different input exposure and accuracy obtained DEAP dataset Type of exposure
Correlated results with accuracy
Videos
0.95
Images
0.92
Stories
0.91
Discussion
0.9
Silence
0.95
Music
0.95
747
748
V. S. Bakkialakshmi and T. Sudalaimuthu
Correlated rao
CORRELATED EMOTIONS VS ACCURACY DEAP Dataset 0.96 0.94 0.92 0.9 0.88 0.86
Fig. 7 Results on emotion recognition on different subjects
Table 3 Comparison with DEAP dataset S. No
Concept reference
Dataset used
Methodology
Accuracy (%)
1
LSTM [2]
DEAP dataset/ EEG signals
LSTM, multi-modal fusion technique
92
2
Proposed UCI identification technique
DEAP dataset/ EEG signals
MNR-SOM fusion model
95
7 Conclusion Affective computing is an interesting area of study to determine the in-depth understanding of emotions. EEG-based emotion recognition using a unique covariate identification algorithm is discussed here. The Novel algorithm is derived from a multi-nominal regression model with the combination of discrete wavelets–Haar wavelet transform. The cross-validation optimizer is developed using self-organized mapping with a binary decision-making routine at the end. The proposed model treats the DEAP dataset as input and finds the maximum correlation and highly impacted covariate component present in the EEG dataset. The UCI algorithm is used to find the impacted component in the EEG dataset that changes the decision to a positive or negative outcome depending on the repeated occurrence of the component. The proposed model achieves an accuracy of 95% in terms of recognizing the emotions such as liking, discriminant, silence, and arousal. The presented study is more likely extended by testing the real-time dataset collected from volunteers and comparatively, deep learning algorithms need to be influenced to determine the real-time emotion and static emotional results.
Unique Covariate Identity (UCI) Detection for Emotion Recognition …
749
References 1. Zhang T, Zheng W, Cui Z, Zong Y, Li Y (2019) Spatial-temporal recurrent neural network for emotion recognition. IEEE Trans Cybern 49(3):839–847. https://doi.org/10.1109/TCYB. 2017.2788081 2. Wu D, Zhang J, Zhao Q (2020) Multimodal fused emotion recognition about expression-EEG interaction and collaboration using deep learning. IEEE Access 8:133180–133189. https://doi. org/10.1109/ACCESS.2020.3010311 3. Ji N et al (2019) EEG signals feature extraction based on DWT and EMD combined with approximate entropy. Brain Sci 9(8):201. https://doi.org/10.3390/brainsci9080201 4. Bakkialakshmi VS, Thalavaipillai S (2022) AMIGOS: a robust emotion detection framework through Gaussian ResiNet. Bull Electr Eng Inform 11(4):2142–2150. https://doi.org/10.11591/ eei.v11i4.3783 5. Gannouni S, Aledaily A, Belwafi K, Aboalsamh H (2021) Emotion detection using electroencephalography signals and a zero-time windowing-based epoch estimation and relevant electrode identification. Sci Rep 11(1) 6. Islam KA, Tcheslavski GV (2015) Independent component analysis for EOG artifacts minimization of EEG signals using kurtosis as a threshold. In: 2015 2nd international conference on electrical information and communication technologies (EICT), pp 137–142. https://doi. org/10.1109/EICT.2015.7391935 7. Bakkialakshmi VS, Sudalaimuthu T (2022) Dynamic cat-boost enabled keystroke analysis for user stress level detection. In: International conference on computational intelligence and sustainable engineering solutions (CISES), Greater Noida, India, pp. 556–560. https://doi.org/ 10.1109/CISES54857.2022.9844331 8. Subasi A, Ismail Gursoy M (2010) EEG signal classification using PCA, ICA, LDA, and support vector machines. Expert Syst Appl 37(12):8659–8666 9. Bhardwaj A, Gupta A, Jain P, Rani A, Yadav J (2015) Classification of human emotions from EEG signals using SVM and LDA classifiers. In: 2015 2nd international conference on signal processing and integrated networks (SPIN), pp 180–185. https://doi.org/10.1109/SPIN.2015. 7095376 10. Wang M, Abdelfattah S, Moustafa N, Hu J (2018) Deep Gaussian mixture-hidden Markov model for classification of EEG signals. IEEE Trans Emerg Top Comput Intell 2(4):278–287. https://doi.org/10.1109/TETCI.2018.2829981 11. Jirayucharoensak S, Pan-Ngum S, Israsena P (2014) EEG-based emotion recognition using deep learning network with principal component-based covariate shift adaptation. Sci World J 2014:1–10 12. Hulliyah K et al (2021) Analysis of emotion recognition model using electroencephalogram (EEG) signals based on stimuli text. Turk J Comput Math Educ (TURCOMAT) 12(3):1384– 1393 13. Bakkialakshmi VS, Sudalaimuthu T (2021) A survey on affective computing for psychological emotion recognition. In: 2021 5th international conference on electrical, electronics, communication, computer technologies, and optimization techniques (ICEECCOT). IEEE, pp 480–486. https://doi.org/10.1109/ICEECCOT52851.2021.9707947 14. Sudalaimuthu T, Bakkialakshmi VS (2021) Emo-Gem: an impacted affective emotional psychology analysis through Gaussian model using AMIGOS. J Positive Sch Psychol 6(3):6417–6424 (ISSN 2717-7564) 15. Koelstra S et al (2011) DEAP: a database for emotion analysis; using physiological signals. IEEE Trans Affect Comput 3(1):18–31 16. Arnau-Gonzalez P, Arevalillo-Herraez M, Katsigiannis S, Ramzan N (2021) On the influence of effect in EEG-based subject identification. IEEE Trans Affect Comput 12(2):391–401
A Simple and Effective Method for Segmenting Lung Regions from CT Scan Images Using K-Means Yumnam Kirani Singh
Abstract Proposed here is a simple and effective method for segmenting lung regions from CT-scan image. In this method, the CT-scan image in DICOM format is converted into RGB image, which is then further converted into gray image. The resulted grayscale image is then binarized using K-means, which automatically groups the pixels into two clusters of pixels; one belonging to the background pixels, while the other cluster to the pixels belonging to the lung regions. From the cluster of pixels belonging to the lung regions, the left and right lungs can be properly separated. The K-Means clustering used in the method is based on recursive averaging to avoid overflow errors while computing and updating cluster centers. As compared to other traditional methods of lung region segmentation, it is much simpler and gives better results. Also, it is much simpler and faster than the deep learning methods for lung region segmentation as it does not require any rigorous training with a large number of images. Keywords Recursive averaging · K-means clustering · Image binarization · Image segmentation · Lung region segmentation · CT-scan images · Edge images of lung regions · Nodules extraction
1 Introduction Lung is one of the vital organs of our body, which sustains our respiration. Many respiratory diseases affect the lungs, for example, lung cancer, bronchitis, tuberculosis, covid-19 [1, 2]. Computed tomography (CT) imaging is widely used for capturing image of the lung to get CT Scan image. By analyzing the captured CT-scan image, radiologists can give the diagnosis of respiratory diseases affecting the lung by combining information from several hundreds of sagittal, coronal, and transverse CT slices. Such manual methods of diagnosis of respiratory diseases from CT-scan images require experience radiologists. Moreover, they are time consuming, labor Y. K. Singh (B) C-DAC Silchar, Ground Floor, NIT Silchar Campus, Silchar, Assam 788010, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_57
751
752
Y. K. Singh
intensive, and quite prone to fatigue-induced human errors. With the advancement of computing and image processing technology, several researchers have been trying to develop an automatic image analysis system for the diagnosis of respiratory diseases from CT Scan images. Developing an automatic system for the diagnosis of respiratory diseases is challenging. The first challenge is the feature extraction that needs the separation or segmentation of lung regions from the CT-scan image. The second challenge is the development of an intelligent system for diagnosis of the diseases based on the inputs, i.e., features extracted from CT-scan images. This paper deals with finding solution to the first challenge, i.e., automatic segmentation of lung regions from the CT-scan images. Separating lung regions from CT-scan image is quite challenging because of the presence of anatomic structures such as the colon, bowel gas, arteries, bronchioles, and subcutaneous cavity that exhibit similar grayscale as the lung region. Also, the intensity variation in CT-scan image especially in the lung regions, and the presence of pathological features which increase the variability of image attributes across slices make the automatic segmentation more challenging [3]. Several automatic lung regions segmentation have been developed in the last two decades which can be categorized into traditional and machine learning methods [4, 5]. In traditional methods, the lung regions are separated by grayscale thresholding to generate a binary image on which several operations are performed to segment the lung regions. Some other commonly used operations for segmentation, region growing, watershed model, active contour model, morphological operations such as connected component analysis, erosion, and dilation operations. In region growing, a few seed pixels on the lungs region are identified and then utilize region growing techniques to segment the lungs [6]. Nithilla et al proposed the reconstruction of parenchyma to eliminate the mediastinum and thoracic wall and to separate the parenchyma using region-based ACM which can be employed using selective binary and Gaussian filtering with new signed pressure force function (SBGF-new SPF) and Fuzzy C-Means clustering technique was used for nodule segmentation. In [7], a three-stage based approach for thorax extraction, lung segmentation, and boundary refinement has been proposed using thresholding, connected component analysis, and morphology to achieve a fast and precise lung segmentation. In [8] gray-level slicing is performed based on histogram analysis to remove the background regions outside the white enclosure in the CT-scan image. The resulted image is inverted and binarized with the threshold computed by Otsu’s method. The lung regions are then extracted using connected component analysis. The machine learning methods are broadly classified into two types—Supervised learning method and unsupervised learning method. Most popularly supervised learning methods which include deep learning are used in lung region segmentation. Supervised learning methods also known as classification models require training on large number of data. The performance of supervised learning methods depends on how perfectly training models have been developed. To reduce the requirement of large training data for image segmentation, a network and training strategy known as U-net that relies on the strong use of data augmentation to use the available annotated samples more efficiently has been proposed in [9]. Such a network can be trained end-to-end from very few image and give better performance. Skourt et al. uses U-net
A Simple and Effective Method for Segmenting Lung Regions from CT …
753
architecture in deep learning for lung region segmentation [4]. Surveys on various deep learning techniques applied in the recent research on medical image segmentation, their structures and methods and analyze their strengths and weaknesses [10, 11]. In [12], Osadebey et al. proposed a deep learning technique to implement automatic segmentation system in three stages as adopted by traditional methods, i.e., pre-processing, processing and post-processing. At the pre-processing stage, a CNN is used to identify and remove the CT slices having no lung regions to reduce false positives. In the processing stage, a CNN-based U-net is used to convert a grayscale CT slice to a binary image. To execute the clustering task, the input training images of the U-net are grayscale images while the output training labels are K-Means clustered images. At the post-processing stage, a new deep learning-based method is introduced to refine the contour of lung regions. There are also several other approaches for lung region segmentation that do not directly fit either into the traditional methods or machine learning methods. In these approaches, more attention has been paid to the pre-processing of different transforms and filters for enhancing the CT-scan images. In [5], Liu et al. first denoise CT scan by applying decomposition filters having contour preserving properties and then use wavelet transform in combination with morphological operation for segmentation of lung regions. Javaid et al. proposed a computer aided nodule detection method for the segmentation and detection of challenging nodules like juxtavascular and juxtapleural nodules [13]. In their method, lung regions are separated using intensity thresholding based on histogram analysis followed by morphological closing operation. K-means clustering is applied for the initial detection and segmentation of potential nodules, which are then divided into six groups on the basis of their thickness and percentage connectivity with lung walls. In [14], first, CT-scan images are denoised by Wiener filter. Then, segmentation is performed by fusion of features that are extracted from the gray-level co-occurrence matrix (GLCM) which is a classic texture analysis method, and U-Net which is a standard convolutional neural network (CNN). In [15] a novel CT lung segmentation method based on the integration of multiple strategies was proposed in this paper. Firstly, in order to avoid noise, the input CT slice was smoothed using the guided filter. Then, the smoothed slice was transformed into a binary image using an optimized threshold. Next, a region growing strategy was employed to extract thorax regions. Then, lung regions were segmented from the thorax regions using a seed-based random walk algorithm. The segmented lung contour was then smoothed and corrected with a curvature-based correction method on each axis slice. Finally, with the lung masks, the lung region was automatically segmented from a CT slice. In this paper, we proposed a K-Means clustering based on a newly proposed recursive averaging method for automatic image binarization. The proposed K-Means base avoids the overflow errors which quite often happen when we try to find the centroid which requires computing an average of large number of pixels. The traditional method of finding the average by computing the sum becomes problematic because the sum of thousands of pixels values quite often goes beyond the byte size limitation of a variable. The recursive averaging method avoids computing the sum for finding the average and hence the overflow error is eliminated. Also, the binarization using
754
Y. K. Singh
K-means clustering does not require training and its performance is as good as and quite often better than the binary image generated by popular Otshu’s thresholding method [16]. After generating binary images using K-means, the lung regions are segmented by applying logical XOR and operations, and morphological operations. It has been found to give good performance for segmenting lung regions in binary form and edged form as well as extracting the internal structures inside the lung regions.
2 K-Means Clustering Using Recursive Averaging 2.1 Recursive Averaging Averaging a sequence of N numbers is done by dividing the sum of all the numbers in the sequence by N. The value of the sum may become very large when the numbers in the sequence and the length of the sequence are large. This creates a problem when averaging the average in a computer because the sum of numbers has to be stored in a variable of a particular type. We cannot store arbitrarily large numbers in a variable of any type. The average itself is less than the maximum of the numbers in the sequence. If we can compute the average recursively, we can avoid finding the total sum of the numbers. This way, we can solve this memory problem. Let x = [x1 , x2 , x3 , · · · , x N ] be a sequence of N numbers. Let a1 , a2 , a3 , · · · , a N be the average values of 1, 2, 3, …, N numbers of the sequence. To find the average recursively, we have to find an from the values of an−1 and xn . We know, a1 = x 1 x1 + x2 2 x1 + x2 + x3 a3 = 3
a2 =
Now, we have to find a way that express a2 in terms of a1 and x2 , a3 in terms of a2 and x3 . We can write a2 = a1 +
a3 = a2 +
x2 x1 + x2 = k2 2
x3 x1 + x2 + x3 = k3 3
(1)
(2)
A Simple and Effective Method for Segmenting Lung Regions from CT …
755
From (1) and (2), we can easily find k2 and k3 as k2 =
2x2 3x3 and k3 = x 2 − a1 x 3 − a2
So, we can write a2 = a1 +
x2 2x2 where k2 = k2 x 2 − a1
a3 = a2 +
x3 3x3 where k3 = k3 x 3 − a2
Proceeding in this way, we can write an = an−1 +
xn nxn where kn = kn xn − an−1
that is, an = an−1 +
xn xn − an−1 = an−1 + nxn /(xn − an−1 ) n
So, the recursive algorithm for finding the average value of a sequence of numbers x = [x1 , x2 , x3 , · · · , x N ] is given below. Algorithm: Recursive Averaging Input: x = [x1 , x2 , x3 , · · · , x N ]. Output: avg—average of the input sequence. Pseudo Code: recursive Averaging
avg x1 For I=2 to N
avg avg ( xi avg ) / i End For Most often, we want to find the average a sequence of output numbers, which we do not want to store in an array. In such case, we can modify the above recursive algorithm to compute the average of the output sequence as each output sample is available. For this purpose, we need to keep a counter each time an output sample is added and the corresponding average value is computed as a result of new addition. So, we need to specify the input sample, previous average value, and the number of samples so far used to compute the average. Input: x: current sample, av: previous average,
756
Y. K. Singh
n: counter for number of samples available. Output: avg: current average Pseudo Code: recursive Average. Set av=0 Set n=0; While a sample x is available n=n+1 d=x-av; avg=av+d/n; av=avg End While This averaging method is suitable for computing the average of real-time output sequence without storing the sample values in an array and without computing the sum.
2.2 K-Means Clustering Using Recursive Averaging When we compute cluster centers using k-means algorithm, we usually store the distances of the pixels from the cluster centers. The new cluster centers are generated from the pixels nearer to the cluster centers. Storage of distance information of the pixels from the cluster centers requires a large amount of memory if the number of cluster centers is large. This creates out-of-memory problem, when we directly apply K-Means algorithm for image clustering in MATLAB for large images. To eliminate this memory problem, we use here the recursive averaging techniques proposed in the paper for computing the new cluster centers. This requires only three memory locations for storing the currently computed and index clustering centers, irrespective of the size of the image. At the same time, the number of computational operations is also reduced to some extent. Let us consider the case of two feature vectors of length N, which need to be clustered into k clusters. Let x = [x1 , x2 , x3 , · · · , x N ] and y = [y1 , y2 , y3 , · · · , y N ] be the two feature vectors of length N. Let C x = [cx1 , cx2 , cx3 , · · · , cxk ] and C y = [cy1 , cy2 , cy3 , · · · , cyk ] are the initial k-cluster centers, which may be random or computed from the corresponding feature vectors. For each feature (xi , yi ), we compute distances from the clusters (C x j , C y j ) and identify the cluster which is nearest. We compute new cluster center for the corresponding nearest cluster center using the recursive averaging method. We test whether the newly computed cluster center is equal to the current cluster centers against which the distances are measured. If they are not the same, the old cluster is replaced by the new cluster center. The process is repeated until the new cluster centers and the old cluster centers are the same.
A Simple and Effective Method for Segmenting Lung Regions from CT …
757
Algorithm: K-Means Clustering
Cx: random k cluster centres corresponding to row index Cy: random k cluster centres corresponding to column index Intitialize xc to zeros #Size of xc equals to size of Cx Initialize yc to zeros #Size of yc equals to size of Cy Do Initialize c to k zeros # counter to record the numbers in each cluster center xc=cx yc=cy For i=1 to N q=0 For j=1 to k d=dist((x[i],y[i]),(xc[j],yc[j])) # compute distance between (x[i],y[i]) and (xc[j],yc[j]) mn=99999 #a large number If mn>=d mn=d q=j End if End for j c[q]=c[q]+1 ax=xc[q] ay=yc[q] cx[q]=recursiveAverage(x(i),ax,c[q]) #compute recursive average cy[q]=recursiveAverage(y[i],ay,c[q]) End for i While xc cx and yc cy
3 Segmentation of Lung Regions from CT Scan Images The lung region segmentation is the process of separating the lung region in the CTscan image from the rest of the image. Traditionally, it is done by binarization of the image at the appropriate threshold. However, choosing an appropriate threshold is a difficult task unless the image has a bimodal histogram in which the valley between the two peaks is considered as a threshold for binarization. But most of the images in the CT scan are multi-modal in nature having different intensity values corresponding to many different regions in the CT scan. So, Otshu’s thresholding method based on the maximization of inter-class variation of the pixel intensity is used to get automatic threshold values. However, the automatic threshold value generated by Otsu’s method does not always give good binary images. As a result, researchers look for a better binarization method for image segmentation. With the recent advancement of
758
Y. K. Singh
machine learning techniques, different binarization techniques have been suggested considering binarization as two-class problem—the background of a binary image as one-class and the foreground of the binary image as another class. The issue of classification used in binarization is that it needs to train for different types of images used in different applications. In other words, the binarization or image segmentation technique developed for lung region segmentation cannot be used for the segmentation of texts from document images. A better option would be to consider the image binarization as a clustering problem, where the different pixels are to be clustered into two clusters of pixels. One-cluster belonging to background pixels, and another cluster belonging to foreground pixels. As the generation of the cluster centers is based on the similarity measures, it appropriately generates a binary image for most of the images. One of the most effective and commonly used clustering methods is the K-Means clustering. For applying image clustering using K-means, the input image is converted to 1-D signal by flattening the 2-D grayscale image. Then, two random cluster centers are generated which are far enough initially. The pixels are then mapped to the cluster centers nearest to them. The cluster centers are updated by averaging the pixels belonging to respective cluster centers. The averaging is performed using recursive averaging so that no overflow error occurs when summing a large number of pixel values. The process of pixel mapping to the updated cluster centers and updating of the cluster centers continues till there is no change in the last two updated cluster centers or the difference between the last two updated cluster centers is less than a specified value. In the end, we will get a binary image with pixel values corresponding to the two cluster centers. Figure 1a shows the grayscale image of the CT-scan image. The generated binary image using K-Means is shown in Fig. 1b. It can be seen that the binary image contains the distinctive lung regions in black inside enclosed inside the white region.
(a)
(b)
Fig. 1 a Original CT-scan image in gray. b Generated binary image using K-means
A Simple and Effective Method for Segmenting Lung Regions from CT …
759
Once we get the binary image containing the lung region, we need to separate lung region from the rest of the image. This is done by applying binary and morphological operations. We compute the boundary image by XORING the dilated image of the binary with the binary image. Then, we fill the holes in the generated boundary image to get the Lung regions in white which become the largest white component in the resulting binary image. As the lung regions have a much larger number of white pixels, we can easily separate out the lung regions from the rest of the binary image as shown in Fig. 3. We then compute the rectangular bounding box enclosing the lung regions and only the lung regions is separated out as in Fig. 3b. By applying binary AND operation between the binary images in Fig. 2a and that in Fig. 3a, we can extract the edged lung regions in Fig. 2a. The bounding box region of the extracted edged image is then cut out to get the image shown in Fig. 3a. To separate out the different regions inside the lungs, the lung region in Fig. 3b is then eroded around by one pixel and then XORed with an image in Fig. 3a to remove the boundary of the lungs as shown in Fig. 4b. Once, we get the shapes of the objects or structures inside the lungs, we can do the shape analysis of the objects to identify the health issues related to the lungs. The Algorithm for the described image segmentation is given below. Inputs: Gray Scale image X of the CT Scan, N (=2), number of clusters for K-Means. Outputs: X L : Lung regions enclosed in a minimum rectangular bounding box. X Le : Edged lung regions enclosed in a minimum rectangular bounding box. X Lo : Outlines of the objects inside the lung region.
(a)
(b)
Fig. 2 a Edged lung regions. b Lung regions filled in white
760
Y. K. Singh
(a)
(b)
Fig. 3 a Separated lung regions. b Lung regions in a bounding box
(a)
(b)
Fig. 4 a Edged of lungs in a bounding box. b Internal components inside lungs
Steps: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Get X, the grayscale image of the CT scan Find X b , the binary image of the grayscale image using K-Means Dilate the inverse of X b by one pixel to get X d Perform XOR operation between X b and X d to get an edged image X e Fill the holes in X e to get filled lung regions X f Remove the white components in X f having a number of pixels less than 3000 to get the binary image X g . Find the minimum rectangular bounding box enclosing the lung regions Find the extracted lung regions inside the bounding box,X L Perform AND operation between X e and X g and find the region inside the bounding box to get edged lung regions, X Le Erode X L by one pixel around and XOR it with X Le to get the objects inside the lung regions, X Lo .
A Simple and Effective Method for Segmenting Lung Regions from CT …
761
4 Experimental Results The proposed image segmentation method has been used in segmenting the lung regions from the CT-scan images downloadable from the link [17]. We have not applied in other databases, due to the paucity of time. It is expected that as the proposed algorithm is based on K-Means clustering which does not require training for performing the segmentation, it would be equally effective for other CT-scan images. The dataset given [17] contains six different CT Scam images in .png format and some MATLAB files to perform segmentation based on Otsu’s binarization and region growing methods, extracting only the lungs part in the image. In the proposed segmentation method, the given CT scan is clustered into two clustered centers using K-Means clustering to generate a good binary image. The generated binary image is processed further with Logical XOR and AND operations along with binary morphological operations of dilations and erosions to segment the lung regions in three different output forms of the lung regions, i.e., the binary mask of the lung regions, edged image of lung regions, and the internal object shapes of lung regions are also segmented. Table 1 Shows the segmented regions of the lung regions in the six different CT-scan images. The first column of the Table is the lung regions corresponding to the bounding box detected lung region shown in the second column. To save space, the original CT-Scan images in gray are not shown which are much bigger in size than the detected lung regions. The third column shows the edged image of the detected lung regions along with the internal object shapes inside the lungs. The fourth column shows only the internal structures or object shapes inside the lungs. The third and the Fourth column output can be directly used as features to classification models to classify the types of diseases that affected in the lungs.
5 Conclusions A simple and effective method of image segmentation based on K-Means clustering has been proposed in the paper. A recursive method of computing average values of sequences to avoid overflow has also been proposed. The algorithm for K-Means clustering based also the recursive averaging is also described. The proposed KMeans clustering has been implemented for separating lung regions from the CT-scan images. It has been found that the proposed method can separate the Lung Regions from the CT-Scan images. The paper also shows how the objects of interests or nodules in the lung regions can be separated out of the lung regions. These separated nodules from the lungs can be conveniently used as inputs to an automatic lung cancer detection system.
762
Y. K. Singh
Table 1 Segmented lung regions of CT-scan images
in original (Gray)
in Binary (White)
in Edges
inside lungs
References 1. Pakdemirli E, Mandalia U, Monib S (2020) Positive chest ct features in patients with covid-19 pneumonia and negative real-time polymerase chain reaction test. Cureus 12(8) 2. Pakdemirli E, Mandalia U, Monib S (2020) Characteristics of chest ct images in patients with Covid-19 pneumonia in London, UK. Cureus 12(9) 3. Kamble B, Sahu SP, Doriya R (2020) A review on lung and nodule segmentation techniques. In: Advances in data and information sciences. Springer, New York, pp 555–565 4. Skourt BA, El Hassani A, Majda A (2018) Lung ct image segmentation using deep neural networks. Proc Comput Sci 127:109–113 5. Liu C, Pang M (2020) Automatic lung segmentation based on image decomposition and wavelet transform. Biomed Signal Process Control 61:102032 6. Cascio D, Magro R, Fauci F, Iacomi M, Raso G (2012) Automatic detection of lung nodules in CT datasets based on stable 3D mass–spring models. Comput Biol Med 42:1098–1109
A Simple and Effective Method for Segmenting Lung Regions from CT …
763
7. Sun L, Peng Z, Wang Z, Pu H, Guo L, Yuan G, Yin F, Pu T (2019) Automatic lung segmentation in chest ct image using morphology. In: 9th International symposium on advanced optical manufacturing and testing technologies: optoelectronic materials and devices for sensing and imaging, vol 10843, p 108431. International Society for Optics and Photonics 8. Khehrah N, Farid MS, Bilal S, Khan MH (2020) Lung nodule detection in CT images using statistical and shape-based features. J Imaging 6(2):6. https://doi.org/10.3390/jimagi ng6020006 9. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, pp 234–241. Springer 10. Mittal A, Hooda R, Sofat S (2017) Lung field segmentation in chest radiographs: a historical review, current status, and expectations from deep learning. IET Image Proc 11(11):937–952 11. Hesamian MH, Jia W, He X, Kennedy P (2019) Deep learning techniques for medical image segmentation: achievements and challenges. J Digit Imaging 32(4):582–596 12. Osadebey M, Andersen HK, Waaler D et al (2021) Three-stage segmentation of lung region from CT images using deep neural networks. BMC Med Imaging 21:112. https://doi.org/10. 1186/s12880-021-00640-1 13. Javaid M, Javid M, Rehman MZU, Shah SIA (2016) A novel approach to cad system for the detection of lung nodules in ct images. Comput Meth Prog Biomed 135:125–139 14. Pang T, Guo S, Zhang X, Zhao L (2019) Automatic lung segmentation based on texture and deep features of hrct images with interstitial lung disease. BioMed Res Int 15. Shi Z, Ma J, Zhao M, Liu Y, Feng Y, Zhang M, He L, Suzuki K (2016) Many is better than one: an integration of multiple simple strategies for accurate lung segmentation in ct images. BioMed Res Int 16. Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66 17. http://www.di.unito.it/~farid/Research/hls.html
Risk-Based Portfolio Optimization on Some Selected Sectors of the Indian Stock Market Jaydip Sen
and Abhishek Dutta
Abstract Designing portfolios with optimum future return and risk have always proved to be a very difficult research problem since the precise estimation of the future returns and volatilities of stocks poses a great challenge. This paper presents an approach to portfolio design using two risk-based methods, the hierarchical risk parity (HRP) and the hierarchical equal risk contribution (HERC). These two methods are applied to five important sectors of the National Stock Exchange (NSE) of India. The portfolios are built on the stock prices for the period January 1, 2016–December 31, 2020, and their performances are evaluated for the period January 1, 2021– November 1, 2021. The results show that the HRP portfolio’s performance in the five sectors is superior to its HERC counterpart in all the five sectors. Keywords Critical line algorithm · Portfolio optimization · Hierarchical risk parity · Return · Hierarchical equal risk contribution · Risk · Backtesting · Sharpe ratio
1 Introduction Designing portfolios with optimum future return and risk have always proved to be a very challenging research problem that has attracted considerable interest and effort from the community of quantitative finance. The goal of an optimum portfolio is to assign weights to its constituent capital assets in a way that maximizes its return while minimizing its risk. The classical mean-variance approach to portfolio optimization was originally proposed by Markowitz and attempted to solve the portfolio optimization using the mean values and the covariance matrix of the past returns of the stocks [1]. The original algorithm proposed by Markowitz, known as the critical line algorithm (CLA), however, suffers from some major shortcomings. The major drawback of the CLA is its large error in the estimation of the future returns and J. Sen (B) · A. Dutta Praxis Business School, Kolkata 700104, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 M. D. Borah et al. (eds.), Big Data, Machine Learning, and Applications, Lecture Notes in Electrical Engineering 1053, https://doi.org/10.1007/978-981-99-3481-2_58
765
766
J. Sen and A. Dutta
risks as estimated from the covariance matrix and its subsequent adverse effect on the portfolio’s performance on the out-of-sample data. To mitigate the problems associated with the quadratic optimization approach used in the CLA algorithm, numerous alternative portfolio design approaches have been proposed in the literature. The hierarchical risk parity (HRP) and the hierarchical equal risk contribution (HERC) are risk-based portfolio design methods that have gained wide acceptance [2]. The three main problems that CLA suffers from are instability of the covariance matrix, skewed (i.e., non-uniform) weight allocation to assets, and under-performance in the our-of-sample data. Since the risk-based portfolio design approaches do not depend on the inverse covariance matrix of the stock returns, these approaches get rid of the problem of instability. Hence, even if the covariance matrix of the stock returns is a singular one, the HRP portfolio is capable of yielding a good performance result on the out-of-sample data. The HEPC approach, on the other hand, further adapts the HRP method by making the cluster formation process among the stocks an optimal one and designing an allocation strategy in which each stock in a given cluster contributes equally to the overall risk of the portfolio. The risk-based portfolios such as the HRP and the HERC usually perform better than the CLA. However, no formal work so far has been carried out comparing their performances on the Indian stock market. The current work presents a systematic approach for a comparative evaluation of these two portfolio design approaches on five critical sectors of stocks of the NSE of India. The most significant stocks of the five sectors from the NSE’s report of October 29, 2021 [3]. The historical prices of these stocks over the past five years (January 1, 2016–December 31, 2020) are used for building the HRP and HERC portfolios for each sector. Both the training data (January 1, 2016–December 31, 2020) and the test data (January 1, 2021–November 1, 2021) are used for backtesting and identifying the best-performing portfolio for each sector. Extensive results are presented on the backtesting of the portfolios. The present work has three distinct contributions. First, it illustrates how to design robust portfolios using two approaches, the HRP and the HERC algorithms. Based on these two approaches, portfolios are built on the stocks of the five critical sectors listed in the NSE. The performance results of these portfolios can be used as a guide for investments in the Indian stock market. Second, a robust backtesting framework for evaluating the performances of the portfolios is proposed that takes into account the returns risk and the Sharpe ratios associated with the portfolios. This has enabled the evaluation framework to identify the best-performing portfolio for each sector both on the in-sample and the out-of-sample records. Finally, the results of this work serve as a very reliable indicator of the current profitability and risks associated with the eight sectors of the Indian stock market. can be gainfully utilized by investors in the Indian stock market. The paper is organized as follows. In Sect. 2, some existing works on portfolio design and stock price prediction are discussed briefly. Section 3 highlights the methodology followed. Section 4 presents the results of the two portfolio design approaches on five sectors. Section 5 concludes the paper.
Risk-Based Portfolio Optimization on Some Selected Sectors …
767
2 Related Work Due to the challenging nature of the problems and their impact on real-world applications, several propositions exist in the literature for stock price prediction and robust portfolio design for optimizing returns and risk in a portfolio. The use of predictive models based on learning algorithms and deep neural net architectures for stock price prediction is quite common [4–9]. Hybrid models are also demonstrated that combine learning-based systems with the sentiments in the unstructured nonnumeric contents on the social web [10–13]. The use of multi-objective optimization, principal component analysis, and metaheuristics have been proposed for portfolio design [14–18]. Estimating volatility in future stock prices using GARCH has also been done [19]. In the present work, two portfolio design approaches, the HRP and the HERC are illustrated for maximizing the return and optimizing the risk for five sectors of the NSE of India. For each sector, two portfolios are built and backtested based on the historical prices of the stocks over the past 5 years, and the portfolio yielding better results on the out-of-sample data is identified.
3 The Portfolio Design Methodology This section presents a discussion on the six-step portfolio design and testing methodology followed. The steps are as follows. (1) Choosing the sectors: The following sectors of NSE are chosen for analysis: (i) media, (ii) oil and gas, (iii) private banks, (iv) PSU banks, and (v) realty. Based on the NSE’s report of October 29, 2021, the most significant stocks from each sector are then selected [3]. (2) Acquiring the Data: The DataReader function defined in the pandas library of Python is used for scraping the historical prices of the stocks of the five sectors from the Yahoo Finance website. The stock price records for the period January 1, 2016–December 31, 2020, are used to build the portfolios for the sectors, while their testing is carried out on the stock records for the period January 1, 2021–November 1, 2021. The variable close is used for computing the portfolio return and risk, and the other variables are ignored. (3) Computation of return and risk of the stocks: The daily return for a stock is given by the percentage change in its successive daily close values. The daily returns of the stocks are computed using Python’s pct_change function. The yearly return and risk values are computed from the daily returns and their standard deviation. Since there is a general assumption of 250 working days in a year, the yearly return and risk are obtained by multiplying their corresponding daily values by a factor of 250 and the square root of 250, respectively.
768
J. Sen and A. Dutta
(4) HRP portfolio design: The design of the HRP portfolio for the five sectors involved three steps in which clusters are formed and the weights are allocated to the stocks. The steps are discussed in the following. Tree Clustering: In the first step, the HRP portfolio design approach carries out a hierarchical clustering to form a tree of clusters. A hierarchy class in Python is designed for designing the agglomerative hierarchical clusters. The hierarchy class creates a dendrogram that executes the value of the single linkage metric that it receives. The single linkage value is returned by a linkage method that executes on the input of the historical stock return values and computes the distance (usually the ward distance) between a pair of stocks. The clusters are formed hierarchically by the linkage method and finally, the formed clusters are depicted in a dendrogram. Quasi-Diagonalization: This step carries out that operation in which the entries in the cells in the covariance matrix of the stock returns are organized in such a way that the larger values are brought near the diagonals while the smaller ones are put further away from it. While the basis of the covariance matrix remains unaltered after the quasi-diagonalization step, the assets with similar returns are brought nearer to each other in the correlation matrix while the dissimilar ones are pushed further away. Recursive Bisection: The covariance matrix is transformed into a quasi-diagonal form after the completion of the quasi-diagonalization step. In a return matrix that is in a quasi-diagonal form, a weight allocation strategy in the inverse ratio of the variances of the assets is proven to be an optimal one [2]. Two different weight allocation approaches are possible. In the bottom-up approach, the inverse of the variance of an adjacent pair of stocks is used for the allocation of weights in the cluster. On the other hand, in the top-down approach, the allocation of weights is made in the inverse ratio of their variances for a pair of contiguous of stocks. (5) HERC portfolio design: The HERC portfolio design approach combines techniques of machine learning and the risk parity method of HRP for optimizing the return and risk [20]. The portfolio design approach consists of the following steps: (i) formation of tree clusters, (ii) identifying an optimal cluster number, (iii) recursive bisection in a top-down manner, and (iv) intra-cluster naive risk parity-based allocation. Step (i) of the HERC method works in an exactly similar way as the tree clustering step of the HRP method. While the HRP method does not involve the finding of an optimal number of clusters, in step (ii), the HERC method utilizes the gap index method for identifying the optimal cluster number. Once the optimal cluster number is found, the HERC method employs the bisection step recursively in a top-down manner for determining the weights for the cluster. In step (iii), the algorithm at a given level in the cluster tree creates two sub-clusters trees by bisecting the cluster at a given level. The allocation of weights to the sub-clusters is done in the same proportion to their respective contributions to the risk of the aggregate cluster. This bisection into clusters and assignment of weights go on recursively till the allocation to all the clusters is done. In step (iv), intra-cluster weight assignment to the assets is done using a risk parity method that allocates weights in the inverse ratio of their risks.
Risk-Based Portfolio Optimization on Some Selected Sectors …
769
(6) Backtesting the portfolios: The portfolios of each sector are finally backtested on the training and test data. The evaluation of the portfolios is done on three metrics: cumulative return, annualized risk, and the Sharpe ratio. The Sharpe ratio on the test data is taken as the most critical metric for evaluation.
4 Performance Evaluation This section presents extensive results of the two portfolios’ compositions and their performances in the backtesting process. The HRP and HERC portfolios are implemented using the library riskfolio-lib in Python. The GPU runtime environment of Google Colab is used for building and testing the portfolio models. The parallel processing ability of the GPU environment is exploited in training and testing to minimize the computation time.
4.1 The Media Sector The ten most significant stocks of the media sector mentioned in the NSE’s report of October 29, 2021, are the following: Zee Entertainment Enterprises, PVR, Sun TV Network, TV18 Broadcast, Saregama India, Inox Leisure, Dish TV India, Nazara Technologies, Network 18 Media and Investments, and Hathway Cable and Datacom [5]. The cluster dendrograms of the media sector stocks are shown in Fig. 1. The HRP method created the following clusters: Cluster 1: Dish TV; Cluster 2: Sun TV and Inox Leisure; Cluster 3: Saregama and Zee; and Cluster 4: Network 18, Hathway, TV18 Broadcast, and PVR. The stock of Nazara could not participate in the clustering due to the inadequacy of records. On the other hand, the clusters created by the HERC method are Cluster 1: PVR, Hathway, and TV18 Broadcast; Cluster 2: Saregama, Zee, and Network18; Cluster 3: Inox Leisure and Sun TV; and Cluster 4: Dish TV.
Fig. 1 The dendrograms of clusters formed on the media sector stocks by the HRP (the fig on the left) and the HERC (the fig on the right) on the training data
770
J. Sen and A. Dutta
Figure 2 depicts the weight allocations to the media sector stocks by the two portfolios. The HRP portfolio assigned the maximum weights (in percent) to the following three stocks: Inox Leisure (18.66), Sun TV (16.80), and Network 18 (15.12). The three stocks that were allocated with the largest weights by the HERC portfolio are TV 18 Broadcast (30.57), Hathway (27.77), and PVR (18.18). While the HRP’s allocation of weights is fairly uniform, the same for the HERC looks very skewed. Figure 3 depicts the two portfolio’s cumulative returns over the training and the test periods. The training (i.e., the in-sample) data refers to the period January 1, 2016–December 31, 2020, while the test period is from January 1, 2021–November 1, 2021. The HRP’s return is higher for both during the training and the test period. Additionally, the Sharpe ratios of the HRP portfolios are also larger as observed in Table 1. For the media sector, the HRP portfolio has performed better.
Fig. 2 The media sector portfolio composition: HRP portfolio (the fig on the left) and the HERC portfolio (the fig on the right)
Fig. 3 The cumulative returns of the media sector portfolios on the training records (the fig on the left) and the test records (the fig on the right)
Table 1 The media sector portfolio performance Portfolio
Perf in training records
Perf in test records
Vol
SR
Vol
SR
HRP
0.2683
0.4120
0.2719
2.5528
HERC
0.3505
0.1195
0.3907
1.4323
Risk-Based Portfolio Optimization on Some Selected Sectors …
771
4.2 The Oil and Gas Sector The most important stocks in this sector as mentioned in the NSE’s report of October 29, 2021 are the following: Reliance Industries, Oil and Natural Gas Corporation, Bharat Petroleum Corporation, Adani Total Gas, Indian Oil Corporation, GAIL India, Hindustan Petroleum Corporation, Petronet LNG, Indraprastha Gas, Gujarat Gas, Castrol India, Gujarat State Petronet, Mahanagar Gas, Gulf Oil Lubricants, Oil India, [5]. The dendrograms formed by the two portfolio design approaches are depicted in Fig. 4. The clusters created by the HRP are Cluster 1: Indraprastha Gas; Cluster 2: Reliance; Cluster 3: Adani Total Gas, GAIL, Petronet, Castrol, Mahanagar Gas, Indian Oil, Gujarat State Petronet, Gulf Oil, Gujarat Gas, Hindustan Petroleum, ONGC, and Oil India; Cluster 4: Bharat Petroleum. The clusters formed by the HERC are as follows: Cluster 1: Indian Oil, Gulf Oil, Gujarat State Petronet; Cluster 2: Gujarat Gas, Hindustan Petroleum, ONGC; Cluster 3: Reliance, Adani Total Gas, Mahanagar Gas, GAIL, BPCL, Oil India, Castrol India, Petronet; and Cluster 4: Indraprastha Gas. The compositions of the two portfolios for the oil and gas sector are exhibited in Fig. 5. While the HRP assigned the largest weights to Indraprastha (13.39), Adani Total Gas (10.55), and Mahanagar Gas (9.11), the three stocks that were assigned the maximum weights by the HERC are Indraprastha Gas (16.53), Gujarat Gas (9.32), and Indian Oil (7.79). The weight allocation in HRP looks more uniform than that of the HERC portfolio. From Fig. 6, it is clear that the HRP portfolio’s return is consistently higher for both cases. Moreover, the HRP’s Sharpe ratios are also larger in both cases as shown in Table 2. The results indicate HRP’s superior performance on the oil and gas sector stocks.
Fig. 4 The dendrograms of clusters formed on the oil and gas sector stocks by the HRP (the fig on the left) and the HERC (the fig on the right) on the training data
772
J. Sen and A. Dutta
Fig. 5 The oil and gas sector portfolio composition: HRP portfolio (the fig on the left) and the HERC portfolio (the fig on the right)
Fig. 6 The cumulative returns of the oil and gas sector portfolios on the training records (the fig on the left) and the test records (the fig on the right)
Table 2 The oil and gas sector portfolio performance Portfolio
Perf in training records
Perf in test records
Vol
SR
Vol
SR
HRP
0.2217
1.1571
0.1673
2.2367
HERC
0.2275
0.4037
0.1700
1.8309
4.3 The Private Sector Banks The following are the most significant stocks in the private sector banks as mentioned in the NSE’s report of October 29, 2021: ICICI Bank, HDFC Bank, Kotak Mahindra Bank, Axis Bank, IndusInd Bank, Bandhan Bank, Federal Bank, IDFC First Bank, City Union Bank, and RBL Bank [5]. Figure 7 depicts the dendrograms formed by the two portfolio strategies. The clusters created by the HRP are Cluster 1: RBL Bank; Cluster 2: IDFC First Bank; Cluster 3: Yes Bank, Axis Bank, HDFC Bank, IndusInd Bank, Kotak Bank, and Bandhan Bank; Cluster 4: Federal Bank and ICICI Bank. The clusters created by the HERC portfolio are as follows: Cluster 1: RBL Bank; Cluster 2: ICICI Bank and Federal Bank; Cluster 3: HDFC Bank. IndusInd Bank, Bandhan Bank, Kotak Bank, and IDFC First Bank; and Cluster 4: Axis Bank and Yes Bank.
Risk-Based Portfolio Optimization on Some Selected Sectors …
773
Fig. 7 The dendrograms of clusters formed on the private sector banks’ stocks by the HRP (the fig on the left) and the HERC (the fig on the right) on the training data
Fig. 8 The private sector banks’ portfolio composition: HRP portfolio (the fig on the left) and the HERC portfolio (the fig on the right)
The portfolio compositions of the HRP and HERC portfolios for the private sector banks are shown in Fig. 8. The largest weights were assigned by the HRP portfolio to the following three stocks: Yes Bank (21.56), Axis Bank (19.70), and IndusInd Bank (11.37). However, the HERC allocated the maximum weights to the following stocks: RBL Bank (33.06), Axis Bank (32.66), and Yes Bank (23.85). The allocation in the HERC portfolio is highly skewed with three stocks being allocated around 90% of the total allocation. The cumulative returns of the two portfolios are depicted in Fig. 9 from which it is evident that during the period of the training data, the return of the HRP is lower. However, the HRP’s return is higher in the majority of the test period. The results presented in Table 3 make it clear that the HRP has exhibited a superior performance as it has yielded larger Sharpe ratios for both cases.
4.4 The PSU Banks The critical stocks in the PSU banks’ sector as mentioned in the NSE’s report of October 29, 2021, are the following: State Bank of India, Bank of Baroda, Canara Bank, Punjab National Bank, Union Bank of India, Bank of India, Indian Bank,
774
J. Sen and A. Dutta
Fig. 9 The cumulative returns of the private sector banks’ portfolios on the training records (the fig on the left) and the test records (the fig on the right)
Table 3 The private sector banks portfolio performance Portfolio
Perf in training records
Perf in test records
Vol
Vol
SR
SR
HRP
0.3037
0.4555
0.2320
0.9961
HERC
0.3872
-0.0757
0.2402
0.0298
Indian Overseas Bank, Central Bank of India, Bank of Maharashtra, Jammu and Kashmir Bank, Punjab and Sind Bank, and UCO Bank [5]. The dendrogram of the HRP portfolio for the PSU banks’ sector as shown in Fig. 10, consists of the following clusters: Cluster 1: Union Bank; Cluster 2: Canara Bank; Cluster 3: J&K Bank, SBI, Central Bank, UCO Bank, PNB, Bank of Baroda, Indian Bank, Bank of Maharashtra, Bank of India, and Punjab and Sind Bank; and Cluster 4: Indian Overseas Bank. The HERC’s cluster composition is as follows: Cluster 1: Indian Bank, Bank of Baroda, PNB, UCO Bank, Central Bank. SBI, and J&K Bank; Cluster 2: Indian Overseas Bank, Punjab and Sind Bank, Bank of India, and Bank of Maharashtra; Cluster 3: Canara Bank; and Cluster 4: Union Bank. As evident from Fig. 11, J&K Bank (12.50), Bank of India (9.87), and Canara Bank (9.08) were assigned the highest weights in the HRP portfolio. However, HERC allocated the highest weights to J&K Bank (16.12), Union Bank (11.16), and SBI (11.13). Again, the allocation of HERC appears to be highly skewed. Figure 12 shows that while cumulative returns of the HERC portfolio are marginally higher during the period of the training data, the HRP has produced a higher return in the test data period. From Table 4 it is seen that the HRP portfolio’s Sharpe ratio is also higher for the test period. Since, for a portfolio, its performance in the test data is what matters, the HRP portfolio has exhibited better performance.
4.5 The Realty Sector The important stocks in the realty sector as mentioned in the NSE’s report of October 29, 2021, are the following: Godrej Properties, DLF, Oberoi Realty, Phoenix Mills,
Risk-Based Portfolio Optimization on Some Selected Sectors …
775
Fig. 10 The dendrograms of clusters formed on the PSU banks’ stocks by the HRP (the fig on the left) and the HERC (the fig on the right) on the training data
Fig. 11 The PSU banks’ portfolio composition: HRP portfolio (the fig on the left) and the HERC portfolio (the fig on the right)
Fig. 12 The cumulative returns of the PSU banks portfolios on the training records (the fig on the left) and the test records (the fig on the right) Table 4 The PSU banks portfolio performance Portfolio
Perf in training records
Perf in test records
Vol
SR
Vol
HRP
0.3345
−0.2501
0.3766
1.8843
HERC
0.3589
−0.2137
0.3681
1.8685
SR
776
J. Sen and A. Dutta
Fig. 13 The dendrograms of the clusters formed on the realty sector stocks by the HRP (the fig on the left) and the HERC (the fig on the right) on the training data
Fig. 14 The realty sector portfolio composition: HRP portfolio (the fig on the left) and the HERC portfolio (the fig on the right)
Prestige Estates Projects, Brigade Enterprises, Macrotech Developers, Indiabulls Real Estate, Sobha, and Sunteck Realty [5]. The dendrograms are shown in Fig. 13. The portfolio compositions are exhibited in Fig. 14. While the three stocks that are assigned with the largest weights by the HRP portfolio are Prestige (16.69), India Bulls (15.01), and Oberoi (13.39), the HERC allocated the highest weights to Prestige (72.56), Godrej Properties (8.69), and Phoenix (5.22). As evident from Fig. 15 and the results presented in Table 5, the HRP portfolio has produced better results for the realty sector stocks, yielding higher returns and higher Sharpe ratios. The results show that except for the training case of the PSU banks, the HRP portfolios produced higher Sharpe ratios than their HERC counterpart. The HRP’s performance is superior as the risk parity-based approach in the clusters adopted by it has produced higher returns with a lower risk than the HERC’s policy of equality of risk contribution. In the presence of some stocks with highly correlated returns (which is the case for some sectors), HRP has been able to form a more diversified portfolio successfully obviating the instability-related issues of the covariance matrix of the stock returns. While it may not be wise to arrive at a general conclusion, the results provide convincing evidence for the superiority of HRP as a portfolio design approach.
Risk-Based Portfolio Optimization on Some Selected Sectors …
777
Fig. 15 The cumulative returns of the realty sector portfolios on the training records (fig on the left) and the test records (the fig on the right)
Table 5 The realty sector portfolio performance Portfolio
Perf in training records
Perf in test records
Vol
SR
Vol
SR
HRP
0.2605
1.0177
0.3153
2.1425
HERC
0.3248
0.7717
0.3421
1.3813
5 Conclusion This paper demonstrated two approaches of risk-based portfolio optimization which are applied to the five important sectors of the NSE of India. The two portfolios, the HRP and HERC were built based on the historical stock prices of the five sectors from January 1, 2016, to Dec 31, 2020. The backtesting was done both over the training and the test period (i.e., January 1, 2021–November 1, 2021). The results clearly showed the HRP exhibited a better performance as its Sharpe ratios and returns are found to be higher for all five sectors. Future work includes studying the other sectors to arrive at a general conclusion on the performances of the two portfolio approaches.
References 1. Markowitz H (1952) Portfolio selection. J Finan 7(1):77–91 2. de Prado ML (2016) Building diversified portfolios that outperform out of sample. J Portf Manage 42(4):59–69 3. NSE Website. http://www1.nseindia.com. Accessed 11 Nov 2021 4. Mehtab S, Sen J, Dutta A (2020) Stock price prediction using machine learning and LSTMbased deep learning model. In: Proceedings of SoMMA, pp 88–106 5. Sen J (2018) Stock price prediction using machine learning and deep learning frameworks. In: Proceedings of ICBAI, Bangalore, India 6. Mehtab S, Sen J (2020) Stock price prediction using convolutional neural networks on a multivariate time series. In: Proceedings of the 2nd NCMLAI, New Delhi, India 7. Bao W, Yue J, Rao Y (2017) A deep learning framework for financial time series using stacked autoencoders and long-and-short-term memory. PloS ONE 12(7) 8. Mehtab S, Sen J (2020) A time series analysis-based stock price prediction using machine learning and deep learning models. Int J Bus Forecast Market Intell 6(4):272–335
778
J. Sen and A. Dutta
9. Mehtab S, Sen J (2020) Stock price prediction using CNN and LSTM-based deep learning models. In: Proceedings of IEEE DASA, Sakheer, Bahrain, pp 447–453 10. Mehtab S, Sen J (2019) A robust predictive model for stock price prediction using deep learning and natural language processing. In: Proceedings of 7th BAICONF, Bangalore 11. Audrino F, Sigrist F, Ballinari D (2020) The impact of sentiment and attention measures on stock market volatility. Int J Forecast 36(2):334–357 12. Carta SM, Consoli S, Piras L, Podda AS, Recupero DR (2021) Explainable machine learning exploiting news and domain-specific lexicon for stock market forecasting. IEEE Access 9:30193–30205 13. Chen M-Y, Liao C-H, Hsieh R-P (2019) Modeling public mood and emotion: stock market trend prediction with anticipatory computing approach. Comput Human Behav 101:402–408 14. Sen J, Mehtab S (2021) A comparative study of optimum risk portfolio and Eigen portfolio on the Indian stock market. Int J Bus Forecast Market Intell. Inderscience Publishers 15. Corazza M, di Tolo G, Fasano G, Pesenti R (2021) A novel hybrid PSO-based metaheuristic for costly portfolio selection problems. Ann Oper Res 304:104–137 16. Zhao P, Gao S, Yang N (2020) Solving multi-objective portfolio optimization problem based on MOEA/D. In: Proceedings of 12th ICACI, Dali, China, pp 30–37 17. Chen C, Zhou Y (2018) Robust multi-objective portfolio with higher moments. Expert Syst Appl 100:165–181 18. Ertenlice O, Kalayci CB (2018) A survey of swarm intelligence for portfolio optimization: algorithms and applications. Swarm Evol Comput 39:36–52 19. Sen J, Mehtab S, Dutta A (2021) Volatility modeling of stock from selected sectors of the Indian economy using GARCH. In: Proceedings of IEEE ASIANCON, Pune, India 20. Raffinot T (2018) The hierarchical equal risk contribution portfolio. https://ssrn.com/abstract= 3237540. https://doi.org/10.2139/ssrn.3237540