332 20 51MB
English Pages XVI, 504 [516] Year 2021
Advances in Intelligent Systems and Computing 1199
Chhabi Rani Panigrahi · Bibudhendu Pati · Prasant Mohapatra · Rajkumar Buyya · Kuan-Ching Li Editors
Progress in Advanced Computing and Intelligent Engineering Proceedings of ICACIE 2019, Volume 2
Advances in Intelligent Systems and Computing Volume 1199
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. Indexed by SCOPUS, DBLP, EI Compendex, INSPEC, WTI Frankfurt eG, zbMATH, Japanese Science and Technology Agency (JST), SCImago.
More information about this series at http://www.springer.com/series/11156
Chhabi Rani Panigrahi Bibudhendu Pati Prasant Mohapatra Rajkumar Buyya Kuan-Ching Li •
•
•
•
Editors
Progress in Advanced Computing and Intelligent Engineering Proceedings of ICACIE 2019, Volume 2
123
Editors Chhabi Rani Panigrahi Department of Computer Science Rama Devi Women’s University Bhubaneswar, India Prasant Mohapatra Department of Computer Science University of California Davis, CA, USA Kuan-Ching Li Department of Computer Science and Information Engineering Providence University Taichung, Taiwan
Bibudhendu Pati Department of Computer Science Rama Devi Women’s University Bhubaneswar, India Rajkumar Buyya Cloud Computing and Distributed Systems (CLOUDS) Lab School of Computing and Information Systems, The University of Melbourne Melbourne, VIC, Australia
ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-15-6352-2 ISBN 978-981-15-6353-9 (eBook) https://doi.org/10.1007/978-981-15-6353-9 © Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
This volume contains the papers presented at the 4th International Conference on Advanced Computing and Intelligent Engineering (ICACIE) 2019: The 4th International Conference on Advanced Computing and Intelligent Engineering (www.icacie.com) held during 21–23rd December 2019, at Rama Devi Women’s University, Bhubaneswar, India. There were 284 submissions and each qualified submission was reviewed by a minimum of two Technical Program Committee members using the criteria of relevance, originality, technical quality, and presentation. The committee accepted 86 full papers for oral presentation at the conference and the overall acceptance rate is 29%. ICACIE 2019, was an initiative taken by the organizers which focuses on research and applications on topics of advanced computing and intelligent engineering. The focus was also to present state-of-the-art scientific results, to disseminate modern technologies, and to promote collaborative research in the field of advanced computing and intelligent engineering. Researchers presented their work in the conference and had an excellent opportunity to interact with eminent professors, scientists, and scholars in their area of research. All participants were benefitted from discussions that facilitated the emergence of innovative ideas and approaches. Many distinguished professors, well-known scholars, industry leaders, and young researchers were participated in making ICACIE 2019, an immense success. We had also an industry panel discussion and we invited people from software industries like TCS, Infosys, Cognizant, and entrepreneurs. We thank all the Technical Program Committee members and all reviewers/ sub-reviewers for their timely and thorough participation during the review process. We express our sincere gratitude to Prof. Padmaja Mishra, Honourable Vice Chancellor and Chief Patron of ICACIE 2019, to allow us to organize ICACIE 2019, on the campus and for her unending timely support towards organization of this conference. We would like to extend our sincere thanks to Prof. Bibudhendu Pati and Dr. Hemant Kumar Rath, General chairs of ICACIE 2019, for their valuable guidance during review of papers, as well as other aspects of the conference. We appreciate the time and efforts put in by the members of the local organizing team at Rama Devi Women’s University, Bhubaneswar, India, v
vi
Preface
especially the faculty members of the Department of Computer Science, student volunteers, and administrative staff, who dedicated their time and efforts to make ICACIE 2019, successful. We would like to extend our thanks to Dr. Subhashis Das Mohapatra for designing and maintaining ICACIE 2019, Website. We are very grateful to all our sponsors, especially Department of Science and Technology (DST), Government of India under Consolidation of University Research for Innovation and Excellence in women universities (CURIE) project for its generous support towards ICACIE 2019. Bhubaneswar, India Bhubaneswar, India Davis, USA Melbourne, Australia Taichung, Taiwan
Chhabi Rani Panigrahi Bibudhendu Pati Prasant Mohapatra Rajkumar Buyya Kuan-Ching Li
About This Book
The book focuses on theory, practice and applications in the broad areas of advanced computing techniques and intelligent engineering. This two volumes book includes 86 scholarly articles, which have been accepted for presentation from 287 submissions in the 5th International Conference on Advanced Computing and Intelligent Engineering held at Rama Devi Women’s University, Bhubaneswar, India during 21–23rd December, 2019. The first volume of this book consists of 40 numbers of papers and volume 2 contains 46 papers with a total of 86 papers. This book brings together academic scientists, professors, research scholars and students to share and disseminate their knowledge and scientific research works related to advance computing and intelligent engineering. It helps to provide a platform to the young researchers to find the practical challenges encountered in these areas of research and the solutions adopted. The book helps to disseminate the knowledge about some innovative and active research directions in the field of advanced computing techniques and intelligent engineering, along with some current issues and applications of related topics.
vii
Contents
Advanced Machine Learning Applications Prediction of Depression Using EEG: A Comparative Study . . . . . . . . . Namrata P. Mohanty, Sweta Shree Dash, Sandeep Sobhan, and Tripti Swarnkar
3
Prediction of Stroke Risk Factors for Better Pre-emptive Healthcare: A Public-Survey-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Debayan Banerjee and Jagannath Singh
12
Language Identification—A Supportive Tool for Multilingual ASR in Indian Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Basanta Kumar Swain and Sanghamitra Mohanty
25
Ensemble Methods to Predict the Locality Scope of Indian and Hungarian Students for the Real Time: Preliminary Results . . . . . Chaman Verma, Zoltán Illés, and Veronika Stoffová
37
Automatic Detection and Classification of Tomato Pests Using Support Vector Machine Based on HOG and LBP Feature Extraction Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gayatri Pattnaik and K. Parvathi
49
Poly Scale Space Technique for Feature Extraction in Lip Reading: A New Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. S. Nandini, Nagappa U. Bhajantri, and Trisiladevi C. Nagavi
56
Machine Learning Methods for Vehicle Positioning in Vehicular Ad-Hoc Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suryakanta Nayak, Partha Sarathi Das, and Satyasen Panda
65
Effectiveness of Swarm-Based Metaheuristic Algorithm in Data Classification Using Pi-Sigma Higher Order Neural Network . . . . . . . . Nibedan Panda and Santosh Kumar Majhi
77
ix
x
Contents
Deep Learning for Cover Song Apperception . . . . . . . . . . . . . . . . . . . . D. Khasim Vali and Nagappa U. Bhajantri
89
SVM-Based Drivers Drowsiness Detection Using Machine Learning and Image Processing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 P. Rasna and M. B. Smithamol Fusion of Artificial Intelligence for Multidisciplinary Optimization: Skidding Track—Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Abhishek Nigam and Debi Prasad Ghosh A Single Document Assamese Text Summarization Using a Combination of Statistical Features and Assamese WordNet . . . . . . . 125 Nomi Baruah, Shikhar Kr. Sarma, and Surajit Borkotokey SVM and Ensemble-SVM in EEG-Based Person Identification . . . . . . . 137 Banee Bandana Das, Saswat Kumar Ram, Bibudhendu Pati, Chhabi Rani Panigrahi, Korra Sathya Babu, and Ramesh Kumar Mohapatra A Self-Acting Mechanism to Engender Highlights of a Tennis Game . . . 147 Ramanathan Arunachalam and Abishek Kumar Performance Evaluation of RF and SVM for Sugarcane Classification Using Sentinel-2 NDVI Time-Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Shyamal Virnodkar, V. K. Pachghare, V. C. Patil, and Sunil Kumar Jha Classification of Nucleotides Using Memetic Algorithms and Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Rajesh Eswarawaka, S. Venkata Suryanarayana, Purnachand Kollapudi, and Mrutyunjaya S. Yalawar A Novel Approach to Detect Emergency Using Machine Learning . . . . 185 Sarmistha Nanda, Chhabi Rani Panigrahi, Bibudhendu Pati, and Abhishek Mishra Data Mining Applications and Sentiment Analysis A Novel Approach Based on Associative Rule Mining Technique for Multi-label Classification (ARM-MLC) . . . . . . . . . . . . . . . . . . . . . . . 195 C. P. Prathibhamol, K. Ananthakrishnan, Neeraj Nandan, Abhijith Venugopal, and Nandu Ravindran Multilevel Neuron Model Construction Related to Structural Brain Changes Using Hypergraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Shalini Ramanathan and Mohan Ramasundaram
Contents
xi
AEDBSCAN—Adaptive Epsilon Density-Based Spatial Clustering of Applications with Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Vidhi Mistry, Urja Pandya, Anjana Rathwa, Himani Kachroo, and Anjali Jivani Impact of Prerequisite Subjects on Academic Performance Using Association Rule Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Chandra Das, Shilpi Bose, Arnab Chanda, Sandeep Singh, Sumanta Das, and Kuntal Ghosh A Supervised Approach to Aspect Term Extraction Using Minimal Robust Features for Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 237 Manju Venugopalan, Deepa Gupta, and Vartika Bhatia Correlation of Visual Perceptions and Extraction of Visual Articulators for Kannada Lip Reading . . . . . . . . . . . . . . . . . . . . . . . . . 252 M. S. Nandini, Nagappa U. Bhajantri, and Trisiladevi C. Nagavi Automatic Short Answer Grading Using Corpus-Based Semantic Similarity Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Bhuvnesh Chaturvedi and Rohini Basak A Productive Review on Sentimental Analysis for High Classification Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Gaurika Jaitly and Manoj Kapil A Novel Approach to Optimize Deep Neural Network Architectures . . . 295 Harshita Pal and Bhawna Narwal Effective Identification and Prediction of Breast Cancer Gene Using Volterra Based LMS/F Adaptive Filter . . . . . . . . . . . . . . . . . . . . 305 Lopamudra Das, Jitendra Kumar Das, and Sarita Nanda Architecture of Proposed Secured Crypto-Hybrid Algorithm (SCHA) for Security and Privacy Issues in Data Mining . . . . . . . . . . . . . . . . . . . 315 Pasupuleti Nagendra Babu and S. Ramakrishna A Technique to Classify Sugarcane Crop from Sentinel-2 Satellite Imagery Using U-Net Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 Shyamal Virnodkar, V. K. Pachghare, and Sagar Murade Performance Analysis of Recursive Rule Extraction Algorithms for Disease Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Manomita Chakraborty, Saroj Kumar Biswas, and Biswajit Purkayastha Extraction of Relation Between Attributes and Class in Breast Cancer Data Using Rule Mining Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Krishna Mohan, Priyanka C. Nair, Deepa Gupta, Ravi C. Nayar, and Amritanshu Ram
xii
Contents
Recent Challenges in Recommender Systems: A Survey . . . . . . . . . . . . 353 Madhusree Kuanr and Puspanjali Mohapatra Framework to Detect NPK Deficiency in Maize Plants Using CNN . . . . 366 Padmashri Jahagirdar and Suneeta V. Budihal Stacked Denoising Autoencoder: A Learning-Based Algorithm for the Reconstruction of Handwritten Digits . . . . . . . . . . . . . . . . . . . . . 377 Huzaifa M. Maniyar, Nahid Guard, and Suneeta V. Budihal An Unsupervised Technique to Generate Summaries from Opinionated Review Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 Ashwini Rao and Ketan Shah Scaled Document Clustering and Word Cloud-Based Summarization on Hindi Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 Prafulla B. Bafna and Jatinderkumar R. Saini Big Data Analytics, Cloud and IoT Rough Set Classifications and Performance Analysis in Medical Health Care . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 Indrani Kumari Sahu, G. K. Panda, and Susant Kumar Das IoT-Based Modeling of Electronic Healthcare System Through Connected Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Subhasis Mohapatra and Smita Parija SEHS: Solar Energy Harvesting System for IoT Edge Node Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 Saswat Kumar Ram, Banee Bandana Das, Bibudhendu Pati, Chhabi Rani Panigrahi, and Kamala Kanta Mahapatra An IoT-Based Smart Parking System Using Thingspeak . . . . . . . . . . . . 444 Anagha Bhat, Bharathi Gummanur, Likhitha Priya, and J. Nagaraja Techniques for Preserving Privacy in Data Mining for Cloud Storage: A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 Ila Chandrakar and Vishwanath R. Hulipalled A QoS Aware Binary Salp Swarm Algorithm for Effective Task Scheduling in Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 Richa Jain and Neelam Sharma An Efficient Emergency Management System Using NSGA-II Optimization Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 V. Ramasamy, B. Gomathy, and Rajesh Kumar Verma
Contents
xiii
Load Balancing Using Firefly Approach . . . . . . . . . . . . . . . . . . . . . . . . 483 Manisha T. Tapale, R. H. Goudar, and Mahantesh N. Birje IoT Security, Challenges, and Solutions: A Review . . . . . . . . . . . . . . . . 493 Jayashree Mohanty, Sushree Mishra, Sibani Patra, Bibudhendu Pati, and Chhabi Rani Panigrahi
About the Editors
Dr. Chhabi Rani Panigrahi is Assistant Professor in the P.G. Department of Computer Science at Rama Devi Women’s University, Bhubaneswar, India. She completed her Ph.D. from Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, India. Her research interest areas include Software Testing and Mobile Cloud Computing. She holds 19 years of teaching and research experience. She has published several international journals and conference papers. She is a Life Member of Indian Society of Technical Education (ISTE) and member of IEEE and Computer Society of India (CSI). Dr. Bibudhendu Pati is Associate Professor and Head of the P.G. Department of Computer Science at Rama Devi Women’s University, Bhubaneswar, India. He completed his Ph.D. from IIT Kharagpur. Dr. Pati has 21 years of experience in teaching, research. His interest areas include Wireless Sensor Networks, Cloud Computing, Big Data, Internet of Things, and Network Virtualization. He has got several papers published in journals, conference proceedings and books of international repute. He is a Life Member of Indian Society of Technical Education (ISTE), Life Member of Computer Society of India, and Senior Member of IEEE. Prof. Prasant Mohapatra is serving as the Vice Chancellor for Research at University of California, Davis. He is also a Professor in the Department of Computer Science and served as the Dean and Vice-Provost of Graduate Studies during 2016-18. He was the Department Chair of Computer Science during 2007-13. In the past, Dr. Mohapatra has also held Visiting Scientist positions at Intel Corporation, Panasonic Technologies, Institute of Infocomm Research (I2R), Singapore, and National ICT Australia (NICTA). Dr. Mohapatra received his doctoral degree from Penn State University in 1993, and received an Outstanding Engineering Alumni Award in 2008. He is also the recipient of Distinguished Alumnus Award from the National Institute of Technology, Rourkela, India. Dr. Mohapatra received an Outstanding Research Faculty Award from the College of Engineering at the University of California, Davis. He received the HP Labs Innovation awards in 2011, 2012, and 2013. He is a Fellow of the IEEE and a xv
xvi
About the Editors
Fellow of AAAS. Dr. Mohapatra’s research interests are in the areas of wireless networks, mobile communications, cyber security, and Internet protocols. He has published more than 350 papers in reputed conferences and journals on these topics. Dr. Mohapatra’s research has been funded through grants from the National Science Foundation, US Department of Defense, US Army Research Labs, Intel Corporation, Siemens, Panasonic Technologies, Hewlett Packard, Raytheon, and EMC Corporation. Prof. Rajkumar Buyya is a Redmond Barry Distinguished Professor and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also serving as the founding CEO of Manjrasoft, a spin-off company of the University, commercializing its innovations in Cloud Computing. He has authored over 650 publications and seven text books including “Mastering Cloud Computing” published by McGraw Hill, China Machine Press, and Morgan Kaufmann for Indian, Chinese and international markets respectively. Dr. Buyya is one of the highly cited authors in computer science and software engineering worldwide (h-index=120, g-index=255, 76,800+ citations). He is named in the recent Clarivate Analytics’ (formerly Thomson Reuters) Highly Cited Researchers and “World’s Most Influential Scientific Minds” for three consecutive years since 2016. Dr. Buyya is recognized as Scopus Researcher of the Year 2017 with Excellence in Innovative Research Award by Elsevier for his outstanding contributions to Cloud computing. He served as founding Editor-in-Chief of the IEEE Transactions on Cloud Computing. He is currently serving as Editor-in-Chief of Software: Practice and Experience, a long standing journal in the field established *50 years ago. Prof. Kuan-Ching Li is currently a Professor in the Department of Computer Science and Information Engineering at the Providence University, Taiwan. He was the Vice-Dean for Office of International and Cross-Strait Affairs (OIA) in this same university since 2014. Prof. Li is recipient of awards from Nvidia, Ministry of Education (MOE)/Taiwan and Ministry of Science and Technology (MOST)/ Taiwan, as also guest professorship from different universities in China. He got his PhD from University of Sao Paulo, Sao Paulo, Brazil in 2001. His areas of research are networked and GPU computing, parallel software design, and performance evaluation and benchmarking. He has edited 2 books: Cloud Computing and Digital Media and Big Data published by CRC Press. He is a Fellow of the IET, senior member of the IEEE and a member of TACC.
Advanced Machine Learning Applications
Prediction of Depression Using EEG: A Comparative Study Namrata P. Mohanty(B) , Sweta Shree Dash, Sandeep Sobhan, and Tripti Swarnkar Department of Computer Science and Engineering, ITER, S’O’A (Deemed to be University), Bhubaneswar, India [email protected], [email protected], [email protected], [email protected]
Abstract. The worldwide havoc of today’s world: depression, is increasing in this era. Depression is not any specific disease rather the determinant factor in the onset of numerous terrible diseases. With the increase in automation and artificial intelligence, it has become easier to predict depression before a much earlier time. The machine learning techniques are used in the classification of EEG for the prediction of different neuro-problems. EEG signals are the brain waves which can easily detect any abnormalities occurring in the brain waves, thereby making it easier to predict the seizure formation or depression. Proposed work uses the EEG signals for the analysis of brain waves, thereby predicting depression. In this paper, we have compared two widely used benchmark models, i.e., the k-NN and the ANN for the prediction of depression with an accuracy of 85%. This method will help doctors and medical associates in predicting diseases before the onset of its extreme phase, as well as assist them in providing the best treatments, possible in proper time. Keywords: Depression · EEG · ANN · k-NN
1 Introduction Depression is becoming one of the most widely spreading disability causing disorders around the globe which is expanding at a very fast pace taking more and more subjects under its paw. It can be caused by various circumstances such as—peer pressure, any acute disease, family issues, career tensions, etc. Mostly, it is due to changes in brain waves and the formation of seizures due to persistent feeling of stress and sadness [3]. One of the most effective ways of detecting it is by recording the EEG signals. EEG signals are noninvasive and low-cost ways of measuring the brain’s electrical activity, which detects any abnormalities or deviations occurring from the normal brain waves, thereby helpful in detecting depressive symptoms in the patient. In today’s world the developing Human–computer interaction, i.e., HCI has made it much more successful to detect such a complicated disease rather we can say the terrible disability causing disorder, i.e., Major Depressive Disorder (MDD) with its machine learning techniques. © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_1
4
N. P. Mohanty et al.
The main objective of the empirical study is comparing the performance of the two benchmark classifiers in the classification process of depression using the EEG signals, such that it can be helpful in predicting depression for the doctors, and thereby letting them provide the best preventive measures before the onset of depression to the patients. In this paper, we have performed the experiment by taking the dataset and feeding it into two most efficient classifiers, i.e., k-NN and ANN. Here, we have successfully classified depressed patients and normal subjects with an accuracy of 85%.
2 Literature Review In the year 2008, Brahmi et al. [8], had performed the classification of the EEG signals using the back-propagation neural network achieving an accuracy and specificity of 93% and 94%,respectively. In this paper they have tried to distinguish among awake stage 1+REM, stage2, and Slow Wave Stage (SWS) on EEG signal using machine learning techniques of Neural networks and wavelet packet coefficients. In the year 2011, Hosseinifard et al. [10], performed Linear and nonlinear features extraction along with the classification using the k-NN, LDA, and LR classifiers. In their experiment, they had obtained an accuracy of 90% classifying depressed patients and normal subjects. Liao et al. [11] in the year 2017, has carried out the classification of depression using one of the standard and most efficient classifier SVM to classify depressed patients. In their experiment, a special robust spectral-spatial EE feature extractor has been used for the EEG signals to cope up with the absence of biological and psychological markers efficiently. They have obtained an accuracy of 81.23% in their experiment. Chisci et al. [12] in the year 2010, has carried a seizure prediction model for detecting seizure formation in the brain which leads to depression, as well as all the associated diseases. Individuals with depression or anxiety have been bound to experience the ill effects of epilepsy than those without depression or anxiety. Different cerebrum territories including the frontal, temporal, limbic regions are associated with the biological pathogenesis of depression in individuals with epilepsy [14] [17]. Machine Learning techniques are of great help for the detection of epilepsy from the analysis of EEG signals [17]. In the year 2018, Acharya et al. have focused on seizure formation and prediction and basically how depression is related to seizure formation which is generally due to sudden change in the electrical activity of the brain [15]. Piotr Mirowski et al. in the year 2009, have successfully investigated the efficiency of employing bivariate measures to predict seizures occurring mostly by depression with a sensitivity of 71% [19]. With the advancement of science and technology, machine learning tools, and techniques can easily predict depression from a much earlier time, thereby keeping this disability causing disorder at the bay [20]. Machine learning algorithm usually learn, extract, identify, and map underlying pattern to identify groupings of depressed individuals without constraint [21].
3 Materials and Methods Our main focus over here is analyzing the two most widely known classifiers which can be used in the field of medical science to predict depression from the EEG signals much earlier. Our implementation process follows the path below (Fig. 1).
Prediction of Depression Using EEG: A Comparative Study
EEG Signals
Data Preprocessing: Filtering ICA
Feature Extraction: Min value, max value, mean value and standard deviation
Classification : Classifiers: k-NN and ANN
5
Comparison of results: Accuracy and time taken.
Fig. 1. Implementation Procedure
The whole implementation process has been carried out in the system bearing the following specification: Processor: Intel(R) Pentium (R) CPU N3540 @2.16 GHz 2.16 GHz, Installed Memory (RAM): 8.00 GB, System Type: 64 bit Operating System, ×64-based processor. 3.1 Data Preprocessing Before preprocessing the data, we have to convert it to the .wav form in order to make it suitable for being used by the Matlab. Here, we have used the edf2wav online browser [1], for the conversion of the EEG signals to the .wav format. The .wav format is then imported by the Matlab for data preprocessing and further implementations. The first step in the preprocessing is the filtering of the raw EEG signals. In this step, the signals of the specified frequencies in the range of 0–30 Hz containing the alpha, beta, theta, gamma, and the delta waves get selected and the rest are rejected. The next step is the Independent Component Analysis (ICA) was performed which helps in the removal of the artifacts such as the eye blinking, etc., from the selected wave range. Further, the real values of the data were obtained from the preprocessed EEG signals, which makes it easier to extract the specific features for the classification process. The obtained dataset is of size 12.6 MB containing 50 samples having 10240 data points each. 3.2 Feature Extraction Feature extraction basically refers to identifying any uniquely recognized patterns from a group of classified data in order to predict its outcomes. These are meant to reduce the amount of loss of information that has been fed to the system and at the same time, it simplifies the implementation process due to the reduction in the amount of data. From the obtained EEG signals it has been observed that physiological features were highly correlated with the state of arousal among two subjects. A feature can be considered significant and selected as input to classifier if its absolute correlation is greater for physiological features among subjects [6]. Selection of highly correlated features helps to exclude less important features affective state and emotional expressions. Considering the above studies and statistical features like the minimum value, the maximum value, the mean value, and the standard deviation were selected to represent the EEG signals.
6
N. P. Mohanty et al.
3.3 Classification As we have labeled data so the research work goes under the supervised learning part of the machine learning. For our classification, we have taken two widely used classifiers the k-Nearest Neighbor (k-NN) and Artificial Neural Network (ANN). 3.3.1 K-Nearest Neighbor (KNN) k-NN has become one of the most popularly used classification techniques because of its ease of interpretation methodology, high predictive power, and low calculation time properties. In k-NN an object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors [26]. 3.3.2 Artificial Neural Networks Back in the 90s, ANN was proposed by Nobel laureates Hubel and Wiesel. This proposed model basically mimics the human brain, tries to process information, and gives output just like its been done by a human brain. It is one of the most efficient classification tools frequently being used in the classification of various diseases and prediction techniques especially in the medical world [27]. In our paper, we have fed the EEG signals to the neural network model where it gets processed by two input layers and two hidden layers giving rise to one output layer using the rectified linear unit (ReLU) activation function for the hidden layers and binary sigmoid activation function for the output layer. 3.4 Performance Measures Performance measures of the models were then calculated in terms of ROC, accuracy, and time taken. Time taken is one the prime aspect on the basis of which the performance has been judged. The lesser the time taken to operate, the more coherent the model is considered to be.
4 Results and Discussions EEG records the brain signals over a particular period of time, which then shows whether the obtained signals are the same as the normal waves or not. Basically, there are 5 types of brain signals on which we are concentrating upon based on the frequency ranges, i.e., delta, theta, alpha, beta, and gamma. Signals are measured for a short duration, i.e., 20–40 min and are produced by the continuous striking of electrons within the brain. There are a lot of EEG databases available online of both normal subjects and controls which are widely used for research purposes. Here, in our study, we have taken the DEAP dataset [5], which is freely available online only for research purposes. In our above experiment, we have taken the EEG signals for the analysis of the brain waves. Various preprocessing has been carried out in different platforms such as the
Prediction of Depression Using EEG: A Comparative Study
7
edf2wav and the EEGLAB toolbox to get the filtered data, i.e., the EEG signals which are free from artifacts. For further data preprocessing, EEGLAB toolbox available online in MATLAB platform have been used, which makes the EEG signals preprocessing a lot more easier. Data preprocessing over here includes filtering, epoch selection, and independent component analysis (ICA). Figure 2 shows the EEGLAB platform we have used for the EEG data preprocessing.
Fig. 2. EEGLAB toolbox: for EEG signal processing [2]
Then the real values are obtained from the filtered signals which are then used for feature extraction and classification processes. After which, four features were extracted, i.e., the minimum value, maximum value, mean value, and the standard deviation in order to obtain better classification results in further processes. The time taken by the k-NN is nearly 12 h to generate the confusion matrix in our system while it took nearly 15 h by ANN to compute the results of the 12.6 MB datafile. In our experiment, the classification accuracies of k-NN and ANN for the train set are 83.2 and 87.5%. In the training data set, the accuracy of ANN obtained is higher than that of the k-NN classifier (Table 1 and Fig. 3). The test set accuracies of both k-NN and ANN also show a similar pattern as seen from Table 2 and Fig. 4, where ANN possesses higher accuracy than that of the k-NN.
8
N. P. Mohanty et al. Table 1. Classification accuracies by selected techniques for train set (Fig. 3) Techniques Accuracy (%) k-NN
83.2
ANN
87.5
Fig. 3. Classification accuracies by selected techniques for train set (set A)
Table 2 Classification accuracies by selected techniques for the test set (Fig. 4) Techniques Accuracy (%) k-NN
74.6
ANN
80.3
Fig. 4. Classification accuracies by selected techniques for test set (set B)
We have performed the classification process using two famous machine learning classifiers those are the k-NN and the ANN. Then we have got the confusion matrix for the analysis of the performance of the two models. In both the cases, we have got the true positives and the true negatives rate higher for the ANN model. Thus, the accuracy rate of ANN is nearly 80.3% as compared to that of the k-NN which has an accuracy of 74.6% as ANN has better processing capacity due to the presence of interconnected neurons just the same way a human brain does.
Prediction of Depression Using EEG: A Comparative Study
9
Fig. 5. Comparison between the accuracies between train set and test set
Due to the visible differences of accuracies obtained from the two selected techniques, we have plotted a comparison graph so that it will be easier to select a particular technique for future researches. From Fig. 5, we can notice that ANN gives a better accuracy rate than k-NN which is a clear conclusion why more studies should be done on the neural network that can be of tremendous help in the field of medical and paramedical sciences.
5 Conclusion Our work has demonstrated that the neural networks has the potential of predicting depression with much accuracy than the other machine learning techniques. Though ANN has given more accurate results than k-NN still the time taken is more in case of ANN which can be reduced if the dimensionality can be reduced to a greater extent by selecting more efficient features and also by implementing the model in a system having higher processor and RAM. Moreover, better research works should be done on neural networks considering real time data acquisition including complex brain structure investigation and analysis. Last but not the least depression is something which shouldn’t be taken lightly and proper check-up by experienced professionals should be done in due time so as to get rid of this havoc before the onset of its extreme phase.
References 1. European Data Format (EDF). http://www.edfplus.info 2. MathWorks—MATLAB and Simulink for Technical Computing. https://www.mathworks. com 3. Mallikarjun, H.M., Dr. Suresh, H.N.: Depression level prediction using EEG signals processing. In: International Conference on Contemporary Computing and Informatics (IC31), pp. 928–933 (2014)
10
N. P. Mohanty et al.
4. Biosemi EEG ECG EMG BSPM NEURO amplifiers systems. http://www.biosemi.com/faq/ file_format.htm 5. https://www.eecs.qmul.ac.uk/mmv/datasets/deap/download.html 6. Khan, N.A., Jönsson, P., Sandsten, M., Performance comparison of time-frequency distributions for estimation of instantaneous frequency of heart rate variability signals. Appl. Sci. 7(3), 221 (2017). https://doi.org/10.3390/app7030221 7. Gautam, R., Mrs. Shimi S.L.: Features extraction and depression level prediction by using EEG signals. Int. Res. J. Eng. Technol. (IRJET) 04(05) (2017) 8. Ebrahimi, F., Mikaeili, M., Estrada, E., Nazeran, H.: Automatic sleep stage classification based on EEG signals by using neural networks and wavelet packet coefficients. In: 30th Annual International IEEE EMBS Conference Vancouver, British Columbia, Canada, August 20–24, 2008, pp. 1151–1154. https://doi.org/10.1109/iembs.2008.4649365 9. Knott, Verner., Mahoney, Colleen., Kennedy, Sidney, Evans, Kenneth: EEG power, frequency, asymmetry and coherence in male depression. Psych. Res. Neuroimaging Sect. 106, 123–140 (2001) 10. Hosseinifard, B., Moradi, M.H., Rostami, R.: Classifying depression patients and normal subjects using machine learning techniques. In: 2011 19th Iranian Conference on Electrical Engineering, Tehran, pp. 1–1 (2011) 11. Shih-Cheng Liao, Chien-Te Wu, Hao-Chuan Huang, Wei-Teng Cheng, Yi-Hung Liu, ” Major Depression Detection from EEG Signals Using Kernel Eigen-Filter-Bank Common Spatial Patterns”, Sensors (Basel) 2017 Jun; 17(6): 1385. Published online 2017 Jun 14. 10.3390/s17061385 12. Chisci, L., Mavino, A., Perferi, G., Sciandrone, M., Anile, C., Colicchio, G., Fuggetta, F.: Real-time epileptic seizure prediction using AR models and support vector machines. IEEE Trans. Biomed. Eng. 57(5), 1124–32 (2010). https://doi.org/10.1109/TBME.2009.2038990. Epub 2010 Feb 17 13. Karim, H.T., Wang, M., Andreescu, C., Tudorascu, D., Butters, M.A., Karp, J.F., Reynolds, C.F., 3rd Aizenstein, H.J.: Acute trajectories of neural activation predict remission to pharmacotherapy in late-life depression. Neuroimage Clin 8(19), 831–839 (2018). https://doi.org/ 10.1016/j.nicl.2018.06.006 14. Kwon, Oh-Young, Park, Sung-Pa: Depression and anxiety in people with epilepsy. J Clin Neurol. 10(3), 175–188 (2014). https://doi.org/10.3988/jcn.2014.10.3.175 15. Acharya, U.R., Hagiwara, Y., Adeli, H.: Automated seizure prediction. Epilepsy Behav. 88, 251–261 (2018). https://doi.org/10.1016/j.yebeh.2018.09.030. Epub 2018 Oct 11 16. Varatharajah, Y., Iyer, R.K., Berry, B.M., Worrell, G.A., Brinkmann, B.H.: Seizure forecasting and the preictal state in canine epilepsy. Int. J. Neural Syst. 27:1650046 (2017) [12 pp.] 17. Günay, M., Ensari, T.: EEG signal analysis of patients with epilepsy disorder using machine learning techniques. In: 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting (EBBT), Istanbul, pp. 1–4 (2018) 18. Kumar, P.N., Kareemullah, H.: EEG signal with feature extraction using SVM and ICA classifiers. In: International Conference on Information Communication and Embedded Systems (ICICES2014), Chennai, pp. 1–7 (2014). https://doi.org/10.1109/icices.2014.7034090 19. Mirowski, P., Madhavan, D., LeCun, Y., Kuzniecky, R.: Classification of patterns of EEG synchronization for seizure prediction. Clin. Neurophysiol. 120(11), 1927–1940 (2009) 20. Jauhar, S., Krishnadas, R., Nour, M.M., Cunningham-Owens, D., Johnstone, E.C., Lawrie, S.M.: Is there a symptomatic distinction between the affective psychoses and schizophrenia? A machine learning approach. Schizophr. Res. 202, 241–247 (2018). https://doi.org/10.1016/ j.schres.2018.06.070
Prediction of Depression Using EEG: A Comparative Study
11
21. Dipnall, J.F., Pasco, J.A., Berk, M., Williams, L.J., Dodd, S., Jacka, F.N., Meyer, D.: Why so GLUMM? Detecting depression clusters through graphing lifestyle-environs using machinelearning methods (GLUMM). Eur. Psych. 39, 40–50 (2017). https://doi.org/10.1016/j.eurpsy. 2016.06.003 22. Liu, A. et al.: Machine learning aided prediction of family history of depression. In: 2017 New York Scientific Data Summit (NYSDS), New York, NY, pp. 1–4 (2017).https://doi.org/ 10.1109/nysds.2017.8085046 23. Sri, K.S., Rajapakse, J.C.: Extracting EEG rhythms using ICA-R. In: IEEE International Joint Conference on Neural Networks, IJCNN 2008. (IEEE World Congress on Computational Intelligence), pp. 2133–2138 (2008) 24. Malmivuo, J., Plonsey, R.: Bioelectromagnetism: Principles and Applications of Bioelectric and Biomagnetic Fields. Oxford University Press (1995) 25. Delorme, A., Makeig, S.: EEGLAB: an open source toolbox for analysis of single trial EEG dynamics including independent component analysis. J. Neurosci. Methods 134(1), 9–21 (2004) 26. Wu, Y., Ianakiev, K., Govindaraju, V.: Improved k-nearest neighbor classification. Pattern Recogn. 35(10), 2311–2318 (2002) 27. Ho, C.K., Sasaki, M.: EEG data classification with several mental tasks. In: 2002 IEEE International Conference on Systems, Man and Cybernetics, vol. 6, p. 4 (2002) 28. About GSIL—Blekinge Institute of Technology—in real life. http://www.bth.se/com/gsil. Accessed 05 August 2012
Prediction of Stroke Risk Factors for Better Pre-emptive Healthcare: A Public-Survey-Based Approach Debayan Banerjee(B) and Jagannath Singh KIIT Deemed to be University, Bhubaneswar, India [email protected], [email protected]
Abstract. This work endeavours to explore the relation between certain behavioural traits and prevalent diseases among the sample population,reported in a public health survey,by means of machine learning techniques. Predictive models are developed to ascertain the significance statistically while also checking the fitness of the models to predict the diseases in a non-invasive way. Our study focuses on cardiovascular stroke from the BRFSS database of CDC, USA. The proposed model achieves 0.71 AuC in predicting stroke from purely behavioural features. Further analysis reveals an interesting behavioural trait: proper maintenance of an individual’s work–life balance, apart from the three main conventional habits: regular physical activity, healthy diet, abstinence from heavy smoking and drinking as the most significant actors for reducing the risk of potential stroke. Keywords: Stroke prediction · Behavioural features model · Gradient boosting · BRFSS
1
· Predictive
Introduction
We propose to underline the significance of pre-emptive healthcare for cardiovascular stroke by discerning behavioural traits which may play a crucial role in their contribution to the gradual development of health conditions that inclines to stroke. The behaviours that affect the health in a negative way such as lack of regular physical activity, lack of calibrated and nutritious food intake, unrestrained tobacco use and alcohol consumption, etc., if continued for long, most of the times may result in health conditions that lead to stroke.1 Thus, to prevent the looming risk of stroke, positive behavioural changes are indispensable. In the United States (U.S.), stroke is the fifth leading cause of death claiming one life out of 20 from the total number of deaths per year with more than 0.7 1
https://www.cdc.gov/stroke/behavior.htm.
c Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_2
Prediction of Stroke Risk Factors for Better Pre-emptive Healthcare
13
million stroke patients per year [1]. Moreover, from the government’s perspective, each year U.S. spends more than $34 billion for stroke which consists of the cost of personal healthcare services, medicines for stroke treatment and missed working days2 [1]. For an individual, this underpins the fact that how little changes towards a healthier lifestyle can eliminate a very significant amount of expenditure on health. In the present work, we aim to find out the relation between the behavioural traits and the chance of stroke using machine learning (ML) techniques and to further specify which traits dominate the list of possible risk factors regarding stroke. We apply our analysis on the Behavioral Risk Factor Surveillance System (BRFSS)3 which is the largest health-related telephonic survey of United States and contains a significant number of behavioural features. By purely behavioural features we mean those that are directly controllable or negotiable (or both) by an individual without any monetary requirement. (excludes insurances). Thus behaviours influenced by social context, mental health, etc. are excluded for the most generalization of our result in order to achieve the maximum relevance with respect to mass awareness thus the less demography-constrained. To achieve our objective, we first identify the possible risk factors contributing to stroke from a set of selected behavioural traits by using a GBM (Gradient Boosting Machine)-based predictive model. Then we venture forward to analyse the impact of the individual features on the model outcome to prove the soundness of the features statistically. Our main contributions can be outlined through the following: – We are able to discern a set of behavioural traits as a possible risk factor regarding stroke along with their comparative individual statistical significance in contributing to stroke. – Apart from the conventional habits our analysis identifies that maintaining a healthy work–life balance as well as sharing household responsibilities in relevant cases can be significantly beneficial to maintain good health. – Furthermore, our findings are based on the whole BRFSS dataset signifying the whole U.S. Thus not being constrained to specific demographics or states, our results provide a holistic view to anyone concerned with this matter— individuals and policymakers.
2
Related Works
Positive Behavioural Changes in Prevention of Stroke: As discussed in Sect. 1, most of the times chronic diseases such as stroke are preventable and their chances of gradual development can be minimized by changing negative behavioural traits, leading to a healthy lifestyle. Centers for Disease Control and Prevention (CDC), U.S., argues that a large percentage of stroke cases can be eliminated by eliminating the three main risk factors: unhealthy diet 2 3
https://www.cdc.gov/stroke. https://www.cdc.gov/brfss/index.html.
14
D. Banerjee and J. Singh
[24], excessive smoking [2] and lack of enough physical activity4 [26]. Other epidemiological studies also highlight the cardinal importance of quantifying the impact of lifestyle on public health due to the possibility of granular level control by individuals themselves to encounter chronic diseases such as stroke [3]. Machine Learning in Healthcare: Predicting diseases solely on the report of patient behaviours can be a difficult problem, especially from a survey dataset. But recent advancement of machine learning algorithms has opened up opportunities to navigate through this complex problem of disease prediction from health survey datasets and also to determine risk factors behind the diseases in contention. In cases which present the opportunity to impact a particular case of prediction requirement, decision trees are used [4]. A decision tree is composed of several weak learners [4], which means that the classifier is only slightly or weakly correlated with the true classification.5 Thus, the performance can be biased in favour of the majority class of the target in particular cases if the dataset in contention already has much bias or variance or both. Hence, to better the purely decision-tree-based classifier performance, random forest [5] and gradient boosting [8] may be used. Gradient boosting algorithm [8] uses a gradient descent procedure which is an iterative method that moves along with the direction of steepest descent, defined by the negative of the gradient of the function to find a local minima of that function. Decision trees are used as weak learners in gradient boosting, where they are implemented one at a time unlike in the case of random forest where this is done all at a time without the gradient descent procedure that minimizes the loss while adding trees [7]. Overall, random forests are built to reduce variance [5] whereas gradient boosting reduces bias [8]. Existing Works on Stroke Prediction: Among the existing works on stroke prediction, Yang, Zhong et al. study the risks in state-level demographics [13]. Akdag et al. implement classification trees for finding the risk factors of hypertension from an observation conducted on hospital patients in Turkey [14]. Sunmoo Yoon et al. work upon the prediction of disability—one of the results of stroke—and how the types of disability are associated with stroke and their corelation [11]. Alkadry et al. find out and detail the disparity in stroke awareness across demographics [9]. Howard et.al. forward the importance of self-reported or questionnaire-based approach within categorizing the general levels of risk among the respondents [17]. Luo et al. also demonstrate the impact of stroke across demographics as reported in the BRFSS using regression model which checks if there is a relation between the two [10]. Nuyujukian et al. employ logistic regression model to show the association between length of sleeping hours and stroke across ethnicities [25]. To the best of our knowledge, neither are there any existing work that performs stroke prediction on the basis of the whole BRFSS dataset, nor are there any such work that focuses on the purely behavioural features. Most of the 4 5
https://www.cdc.gov/stroke/behavior.htm. https://en.wikipedia.org/wiki/Boosting (machine learning).
Prediction of Stroke Risk Factors for Better Pre-emptive Healthcare
15
other works are either related to finding correlations between certain features and stroke [19] or they are restricted to only certain states [20]. The downside of finding only the correlations is that it does not necessarily enable us to hypothesize how true we are about the relationship as correlation only denotes a number to indicate the relationship strength between two variables whether the prediction shows basing ourselves on the predictors how saliently can we hypothesize that the target is conceivable. The Class Imbalance Problem and Its Proposed Solution: Apart from the model selection issues, the class imbalance is a common problem that arises specially in medical datasets. Moreover, in case of public survey, this is obvious due to missing values and wrong responses of stroke that is not present in case of clinical data. Class imbalance often leads the created prediction model to learn with a bias towards the majority class. For example, if in a dataset the ratio of label counts between majority and minority class is 25:1 then an accuracydriven classifier model may yield an accuracy of more than 90% by disregarding the impact of minority class instances and classifying all instances as belonging to the majority class. This problem has been mentioned and worked upon by Wang and Yao [15], who propose sub-sampling as a way to get better results.
3
Dataset Collection and Description
The dataset for this work is taken from the Behavioral Risk Factor Surveillance System (BRFSS)6 conducted by CDC. Starting in 1984, BRFSS collects data in all states of U.S. and also the U.S. territories. This health survey is conducted over telephone and it covers most of the potential health risk factors, healthrelated behavioural practices and health conditions. The resulting dataset is shared with public and is available for free [12]. CDC’s official website7 publishes the relevant questionnaire and detailed dataset encoding for a very detailed and comprehensive understanding. In the present work, we use BRFSS 2012 dataset which records 475687 observations (See Footnote 7).
4
Feature Selection
According to the survey features mentioned in BRFSS 2012 related to stroke that are purely behavioural, excluding our target variable denoting positive or negative response for stroke, we pick out 15 features as candidate risk factors as discussed in Sect. 1 and the BRFSS Codebook (See Footnote 7). In Table 1, we present the variable names coded in BRFSS.
6 7
https://www.cdc.gov/brfss/index.html. https://www.cdc.gov/brfss/annual data/annual 2012.html.
16
D. Banerjee and J. Singh Table 1. BRFSS selected features
Index BRFSS codes Meaning 1
CVDSTRK3
Ever told you had a stroke.
2
USENOW3
Usage of tobacco products other than cigars
3
SSBSUGR1
Intake of sugar sweetened beverages excluding diet soda or diet pop over last 30 days
4
SSBFRUT1
Intake of sugar sweetened fruit drinks including fruit drinks made at home with added sugar over last 30 days
5
X.SMOKER3 Four-level smoker status: Everyday, Someday, Former, Nonsmoker
6
X.TOTINDA Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
7
AVEDRNK2
Total number of occasions of alcohol drinking in last 30 days
8
FRUIT1
Excluding fruit juice, intake frequency of fruits over last 30 days
9
FRUITJU1
Frequency of 100% PURE fruit juices excluding fruitflavoured drinks with added sugar over past 30 days
10
FVBEANS
Intake frequency of beans (all types) over last 30 days
11
FVGREEN
Intake frequency of green-coloured vegetables (all types) over last 30 days
12
FVORANG
Intake frequency of orange-coloured vegetables (all types) over last 30 days
13
VEGETAB1
Not counting 10, 11 and 12, intake frequency of vegetables over last 30 days
14
SLEPTIME
Average sleeping hours per day
15
SCNTWRK1
Average work hours per week
16
X.CHLDCNT Number of children in household (Categories: 0, 1, 2, 3, 4, 5)
According to Sect. 2, we do not find any other unique features mentioned in the survey data that pertains to our goal. In our work, the response for stroke is deemed positive when an individual’s answer is yes to the survey question: ‘Ever told you had a Stroke? ’ as mentioned in Table 1.
5
Feature Cleaning
To remove NaN (Not a Number) values, we replace them with the mean of all features in case of continuous values (i.e. SSBSUGR1 in Table 1). For categorical values (i.e. X.CHLDCNT in Table 1), we assume that they belong to the category reporting most responses (i.e, for X.CHLDCNT the category is 0) (See Footnote 7). This step was found valid as the number of outliers were negligible according to cook’s distance [27].
Prediction of Stroke Risk Factors for Better Pre-emptive Healthcare
6
17
Experimental Setup and Methodology
6.1
Experimental Setup
We use H2O,8 a Machine Learning (ML) platform that provides an open-source, in-memory, distributed, ML and predictive analytic platform allowing the user to build ML models. The in-memory data storing system of H2O especially helps in case of the large databases as the internal database management system of H2O primarily relies on the main memory rather than the disk for computer data storage which is slower to access. 6.2
Choice of Machine Learning Algorithm
As discussed in Sect. 2, we choose GBM as our go-to ML algorithm for stroke prediction due to it being able to handle high bias and less variance the best [8]. Figure 1 shows that there is a high class imbalance between positive and negative responses of stroke, implying the possibility of a much weighted bias in the model outcome towards the negative class as discussed in Sect. 2 in Wang et.al. [15]. Figure 2 shows that none of the selected features are strongly correlated (more than 50%) with each other indicating a very little variance, and hence our choice of GBM as our go-to ML algorithm due to the low variance and high bias is vindicated. 6.3
Model Building
We split the dataset into train, test and calibrate data with ratio 89:10:1 (in percentage of the total number of observations or rows). Our predictors are all the features from numbered 2 to 16 as denoted in Table 1 and the target variable is numbered 1, i.e. (CVDSTRK3). 6.4
Dealing with Class Imbalance
As also discussed in Sect. 2, we apply downsampling to reduce the reported class imbalance in Fig. 1. We make the ratio into 1:4.74 as shown in Fig. 2. Moreover, Fig. 3 suggests that there is non-deterministic correlation between the selected features. Hence, this dataset does not suffer from the class overlapping problem, where any feature influences the other heavily [1]. Hence, no particular downsampling method would be significantly better than the other [16]. 6.5
Individual Feature Predictive Power
Next we construct Generalized Regression Model (GLM) [23] working as a binary classifier for each individual feature to examine their power to predict the target variable measured by Area under Curve (AuC) score. Figure 4 depicts the result. 8
https://www.h2o.ai.
18
D. Banerjee and J. Singh
Fig. 1. Label count of stroke—high class imbalance between positive and negative responses
Fig. 2. Correlation plot of selected features
Prediction of Stroke Risk Factors for Better Pre-emptive Healthcare
19
Section 7 shows that our final GBM model gives the AuC of 0.71 with all 15 features as predictors, whereas Fig. 4 shows that the highest individual binary classifier AuC is just above 0.60, i.e. around 0.10 units less than the final AuC. Hence, we can confirm that no one feature would dominate the predictor model if all are used together thus allowing each feature to contribute without any hindrance or bias in the predictive model, as none of these features are strongly correlated with our target variable.
Fig. 3. Downsampled label count, showing the impact of downsampling on the label count
6.6
Calibration of Initial Result
As downsampling modifies the original distribution of the observations, Platt scaling is performed with the help of the calibration set mentioned in Sect. 6.3 to calibrate the outcome of the model as suggested by Platt [15].
7
Model Accuracy
Due to the existing class imbalance we use Area under Curve (AuC) and confusion matrix as our metrics to understand the model accuracy as in such cases it is found to be yielding the best evaluation or a classifier as suggested by Sokolova et al. [21]. Our model achieved 0.71 AuC with the accuracy being 68.08% (1 − Total Error). Total accuracy may also be calculated as per Eq. 1: T rueP ositive + T rueN egative T rueP ositive + F alseN egative + T rueN egative + F alseP ositive (1) The associated confusion matrix is shown in Table 2.
Accuracy =
20
D. Banerjee and J. Singh Table 2. Resulting confusion matrix Predicted no
Predicted yes
Error rates (*100)
Actual no
6388
2827
2827/9215 = 30.67
Actual yes
744
1231
744/1975 = 37.67
Error rates (*100) 744/7132 = 10.43 2827/4058 = 69.66 3571/11190 = 31.91
According to the confusion matrix (Table 2), the false positive rate (Actual: No, Predicted: Yes) is 30.67% whereas the false negative rate (Actual: Yes, Predicted: No) is 37.67%. The process to calculate AuC from confusion matrix is described in the work of Kumar et al. [22]. The associated Receiver Operating Characteristic curve is given in Fig. 5. True positive rate is calculated as per Eq. 2: True Positive Rate =
1 − F alseN egative T rueP ositive + F alseN egative
(2)
Fig. 4. Binary GLM model AuC score for individual features, target : CVDSTRK3
Next we inquire the variable importance 9 to understand the relative contribution of each feature in the prediction process. For each variable, its variable importance or relative influence is derived by considering two factors: – if that variable was chosen for splitting during the tree formation process. – if yes then the improvement of the squared error over all trees as a result. The top five contributors in terms of variable importance are shown in Fig. 6. To sum it all up, Table 3 lists the features with descending variable importance with respect to our predictive model. 9
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html.
Prediction of Stroke Risk Factors for Better Pre-emptive Healthcare
21
Fig. 5. Receiver operating characteristic curve, resulting AuC = 0.71
Fig. 6. GBM variable importance plot
Table 3. Top contributors Index BRFSS codes Meaning 1
X.TOTINDA
Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
2
X.CHLDCNT Number of children in household (CATEGORIES : 0, 1, 2, 3, 4, 5)
3
AVEDRNK2
4
X.SMOKER3 Four-level smoker status: Everyday, Someday, Former, Non-smoker
5
SCNTWRK1
Total number of occasions of alcohol drinking in last 30 days Average work hours per week
22
8
D. Banerjee and J. Singh
Conclusion, Limitations and Future Work
We present a purely behavioural-features-based stroke prediction method using GBM and furthermore we analyse the feature importance (see Fig. 6) to obtain a set of risk factors for stroke. A significant point to note from Table 3 is that the feature X.CHLDCNT, the number of children in household, which is the second most important contributor in our prediction, whereas the feature SCNTWRK1, the average work hours per week, is the fifth most important contributor. The feature X.CHLDCNT could be considered as a potentially interesting feature because there are some hidden responsibilities associated with looking after a child which gradually increases along with the number of child. Furthermore, these two features, X.CHLDCNT and SCNTWRK1, clearly indicate that the responsibility sharing or managing and maintaining a healthy work–life balance may help in a significant way towards a healthy and long life. As for the limitation, we selected each of our predictors as listed in Table 1, from a survey dataset. Thus, there is a possibility of wrong reports and missing values and so on thus we could not achieve very high AuC or much reduced error rates which might be possible if the same questions are asked to hospital patients. In the future, as discussed above, we can look more into true positive and false negative sets to figure out which features are hindering the false negative rate to come down. To achieve this, LIME analytic library [18] can be used. Moreover, it might be possible to suggest a modification and/or extension of the BRFSS survey questionnaire for more details on the certain behavioural features.
References 1. Xiong, H., Wu, J., Liu, L.: Classification with ClassOverlapping: a systematic study. In: 2010 Proceedings of the 1st International Conference on E-Business Intelligence (ICEBI2010). Atlantis Press (2010) 2. Soerjomataram, I., de Vries, E., Engholm, G., Paludan-M¨ uller, G., BrønnumHansen, H., Storm, H.H., Barendregt, J.J.: Impact of a smoking and alcohol intervention programme on lung and breast cancer incidence in Denmark: An example of dynamic modelling with Prevent. Eur. J. Cancer 46(14), 2617–24 (2010) 3. Lafortune, L., Martin, S., Kelly, S., Kuhn, I., Remes, O., Cowan, A., Brayne, C.: Behavioural risk factors in mid-life associated with successful ageing, disability, dementia and frailty in later life: a rapid systematic review. PLoS One 11(2), e0144405 (2016) 4. Podgorelec, V., Kokol, P., Stiglic, B., Rozman, I.: Decision trees: an overview and their use in medicine. J. Med. Syst. 26(5), 445–63 (2002) 5. Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition 1995 August 14, Vol. 1, pp. 278–282. IEEE (1995) 6. Garc´ıa, M.N., Herr´ aez, J.C., Barba, M.S., Hern´ andez, F.S.: Random forest based ensemble classifiers for predicting healthcare-associated infections in intensive care units. In: 2016 13th International Conference on Distributed Computing and Artificial Intelligence, pp. 303–311. Springer, Cham (2016)
Prediction of Stroke Risk Factors for Better Pre-emptive Healthcare
23
7. Al-Janabi, S., Patel, A., Fatlawi, H., Kalajdzic, K., Al Shourbaji, I.: Empirical rapid and accurate prediction model for data mining tasks in cloud computing environments. In: 2014 International Congress on Technology, Communication and Knowledge (ICTCK), pp. 1–8. IEEE (2014) 8. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 1, 1189–232 (2001) 9. Alkadry, M.G., Bhandari, R., Wilson, C.S., Blessett, B.: Racial disparities in stroke awareness: African Americans and Caucasians. J. Health Hum. Serv. Adm. 1, 462– 90 (2011) 10. Luo, W., Nguyen, T., Nichols, M., Tran, T., Rana, S., Gupta, S., Phung, D., Venkatesh, S., Allender, S.: Is demography destiny? Application of machine learning techniques to accurately predict population health outcomes from a minimal demographic dataset. PloS One 10(5), e0125602 (2015) 11. Yoon, S., Gutierrez, J.: Behavior correlates of post-stroke disability using data mining and infographics. Br. J. Med. Med. Res. 11(5) (2016) 12. Oswald, A.J., Wu, S.: Objective confirmation of subjective measures of human well-being: evidence from the USA. Science 327(5965), 576–9 (2010) 13. Yang, Q., Zhong, Y., Ritchey, M., Loustalot, F., Hong, Y., Merritt, R., Bowman, B.A.: Predicted 10-year risk of developing cardiovascular disease at the state level in the US. Am. J. Prev. Med. 48(1), 58–69 (2015) 14. Akdag, B., Fenkci, S., Degirmencioglu, S., Rota, S., Sermez, Y., Camdeviren, H.: Determination of risk factors for hypertension through the classification tree method. Adv. Ther. 23(6), 885–92 (2006) 15. Wang, S., Yao, X.: Multiclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics). 42(4), 1119–1130 (2012) 16. Provost, F.: Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI’2000 Workshop on Imbalanced Data Sets 2000, vol. 68, pp. 1–3. AAAI Press (2000) 17. Howard, G., McClure, L.A., Moy, C.S., Howard, V.J., Judd, S.E., Yuan, Y., Long, D.L., Muntner, P., Safford, M.M., Kleindorfer, D.O.: Self-reported stroke risk stratification: reasons for geographic and racial differences in stroke study. Stroke 48(7), 1737–43 (2017) 18. Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016) 19. Pharr, J.R., Coughenour, C.A., Bungum, T.J.: An assessment of the relationship of physical activity, obesity, and chronic diseases/conditions between active/obese and sedentary/normal weight American women in a national sample. Pub. Health 1(156), 117–23 (2018) 20. Tshiswaka, D.I., Ibe-Lamberts, K.D., Fazio, M., Morgan, J.D., Cook, C., Memiah, P.: Determinants of stroke prevalence in the southeastern region of the United States. J. Pub. Health. 1–8 (2018) 21. Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–37 (2009) 22. Kumar, R., Indrayan, A.: Receiver operating characteristic (ROC) curve for medical researchers. Indian Pediatr. 48(4), 277–87 (2011) 23. Jones, A.M.: Models for Health Care. University of York, Centre for Health Economics (2009)
24
D. Banerjee and J. Singh
24. Bailey, R.R., Phad, A., McGrath, R., Haire-Joshu, D.: Prevalence of five lifestyle risk factors among US adults with and without stroke. Disabil. Health J. 12(2), 323–7 (2019) 25. Nuyujukian, D.S., Anton-Culver, H., Manson, S.M., Jiang, L.: Associations of sleep duration with cardiometabolic outcomes in American Indians and Alaska Natives and other race/ethnicities: results from the BRFSS. Sleep health (2019) 26. Howard, V.J., McDonnell, M.N.: Physical activity in primary stroke prevention: just do it!. Stroke 46(6), 1735–9 (2015) 27. Diaz-Garcia, J.A., and Gonz´ alez-Farnias, G. A note on the cook’s distance. J. Stat. Plan. Infer. 120(1-2), 119–136 (2004)
Language Identification—A Supportive Tool for Multilingual ASR in Indian Perspective Basanta Kumar Swain1(B) and Sanghamitra Mohanty2 1 Department of Computer Science & Engineering, Government College of Engineering
Kalahandi, Bhawanipatna 766002, India [email protected] 2 Department of Computer Science & Application, Utkal University, Bhubaneswar 51004, India [email protected]
Abstract. In this research paper, we have engineered a multilingual automatic speech recognition engine in Indian perspective by employing language identification technique. Our motherland, India is treated as land of many tongues. Our vision is leading to develop a system that would aid in man–machine communication in Indian spoken languages. We have developed two vital models, namely language identification and speech recognition using array of pattern recognition techniques, viz. k-NN, SVM and HMM. We have tackled both LID and multilingual ASR tasks over three Indian spoken languages, namely Odia, Hindi and Indian English. We have experimented over short as well as long durational isolated spoken words for LID task purpose. Finally, we have integrated LID module with multilingual ASR in order to transcribe the words in desired spoken language. Keywords: Language identification · Multilingual ASR · k-NN · SVM · HMM
1 Introduction India is a multilingual as well as a multicultural country, and moreover, the Indian speakers are multilingual by default. The strength of multilingual speakers in India outnumbers monolingual speakers in the current day due to the effect of globalization and easier communication. The total number of different Indian spoken languages is arrived at 121 as per 2011 census data [1]. The 121 languages are classified into two groups, namely Scheduled Languages and non-Scheduled Languages. The Scheduled Languages and non-Scheduled Languages are comprising of 22 (Part A) and 99 languages (Part B), respectively. The Scheduled Languages are the languages with speakers more than 10,000 and non-Scheduled Languages which are having less than 10,000 speakers or spoken languages are not identifiable on the basis of the linguistic information available. It is seen from 2011 census data that Hindi is spoken by majority of speakers out of twenty-two Scheduled Languages and the percentage of Hindi speakers is found as 43.63% of total Indian population. Odia language is also one of the twenty-two Scheduled Languages, and it is placed in ninth position of descending order list of Scheduled © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_3
26
B. K. Swain and S. Mohanty
Languages category. The percentage of Odia speakers is arrived at 3.10% of total speakers. In this research article, we have considered three Indian spoken languages which is consisting of two from Indo-Aryan family, namely Hindi, a majority spoken language, and Odia in addition to another language from Indo-European family, i.e. English. In today’s market, man–machine communication in spoken context has plethora of advantages over traditional keyboard and pointing device based approach. But the spoken-based man–machine communication is a daunting task as compared to traditional communication due to lack of handling mechanisms of multilingual spoken instructions [2]. The automatic speech recognition is a key module of man–machine communication in spoken natural language. However, man-machine communication is an uphill task in Indian context due to the use of multiple spoken languages in different regions. The proposed algorithm simplifies the job of man–machine communication by adopting a language Identification (LID) technique prior to the involvement of Automatic Speech Recognition (ASR) [3]. The word accuracy rate of ASR has significantly improved due to the involvement of LID as preprocessing block. LID acts as an aid to ASR tool by providing acoustic and lexical information of spoken language. In this research paper, authors have emphasized on development of a man–machine interaction system that will accept the inputs in Indian spoken languages. We have developed three isolated speech recognition engines for Odia, Hindi and Indian English languages using Hidden Markov Model (HMM). LID provides the prior information in order to activate one of the speech recognition engines to establish man–machine communication in a specific spoken language. We have employed Support Vector Machine (SVM) and k-Nearest Neighbour (k-NN) algorithms for identification of spoken language. The developed systems can be extended to provide bouquet of services in telecommunications, customer relation management and enterprise service automation, etc. in Indian context. The paper is organized as follows: Sect. 2 presents the speech database for LID and ASR, Sect. 3 describes about speech feature vectors for LID and ASR, Sect. 4 emphasizes on classifiers for LID and ASR, Sect. 5 illustrates the experimental results and Sect. 6 focuses on conclusion and future work.
2 Speech Database for LID and ASR The speech database used in this research paper is collected from 95 speakers of age between 18 and 55 years old. We have set ten isolated words of each of three languages, namely Hindi, Odia and Indian English for LID task. But we have collected 30 isolated words for Hindi and English languages and 40 Odia isolated words in different domains. The spoken voice files are recorded in laboratory and home environments by using laptops and desktops connected with head-mounted microphone. The speech files are set to mono, 16 bit, 16000 Hz and PCM [4]. The voice files are recorded by requesting to the speakers to repeat the same isolated words which seemed to be inappropriate for the application. Most speakers participated in speech corpora collection for LID and ASR tasks are capable enough to speak in multiple Indian languages. The detailed description of speech database used in this research article is shown in Table 1.
Language Identification—A Supportive Tool
27
Table 1. Description of Speech database of LID and ASR. Task
Spoken language
No. of spoken words
No. of speakers
Male
LID
Hindi
10
20
15
5
200
Odia
10
50
35
15
500
Indian English
10
20
15
5
200
Hindi
30
25
18
7
750
Odia
40
50
35
15
2000
Indian English
30
20
15
5
600
ASR
Female
Total no. of Spoken word spoken words property 16 bit, 16 kHz and PCM
16 bit, 16 kHz and PCM
3 Speech Feature Vectors for LID and ASR The collected speech database of LID and ASR is exposed to parameterization process. The parameterization of speech files is collected by keeping on view of end objective of research work. In this research work, authors have focused on two aspects, namely language identification and speech recognition. It is found from the recent research work that the language characteristics exist in the speech utterances in different levels such as phonotactic, acoustic-phonetic-prosodic, etc. [5–7]. We have used prosodic features for LID task over Hindi, Odia and Indian English due to the fact of high language discriminative power and less cost of computation. Authors have used Jitter and Shimmer as well as their variants as prosodic features for spoken language identification over Hindi, Odia and Indian English languages at word-level utterances. The Eqs. 1 and 2 represent Jitter_Absolute and Jitter_Relative. 3.1 Jitter_Absolute Cum Jitter_Relative Jitter_Absolute is determined by finding the average absolute difference over consecutive periods, and Jitter_Relative is calculated by measuring the ratio of consecutive periods and average period [8]. Jitter_Abs =
N −1 1 |TI − Ti+1 |, N −1
(1)
i=1
Jitter_Rel =
1 N −1
N −1
|Ti − Ti+1 | , N −1 i=1 Ti
i=1
1 N
(2)
where N signifies the number of extracted f0 periods and Ti represent f0 period lengths. Similarly, another prosodic feature called Shimmer_dB cum Shimmer_Relative is extracted from LID speech corpora.
28
B. K. Swain and S. Mohanty
3.2 Shimmer_dB Cum Shimmer_Relative Shimmer_dB is evaluated by following the simple mathematical formulation using logarithmic function and Shimmer_Relative is determined by finding the ratio over average of absolute difference of amplitudes in consecutive periods and average of amplitudes. The Eqs. 3 and 4 represent Shimmer_dB and Shimer_Relative, respectively [9, 10]. Shimmer_dB =
N −1 1 |20 log(Ai+1 /Ai )|, N −1
(3)
i=1
Shimmer_Rel =
1 N −1
N −1 i=1
1 N
N
|Ai− Ai+1 |
i=1 Ai
.
(4)
The lexical and grammatical structure plays a crucial role in development of multilingual speech recognizer for set of languages. We have extracted Mel Frequency Cepstral Coefficients (MFCC) of size thirty-nine from multilingual ASR database. 3.3 MFCC as Multilingual ASR Feature Vectors MFCC feature vectors are computed in every 10 ms with an overlapping window of around 25 ms. MFCCs are constructed by applying Discrete Cosine Transformation (DCT) to a log spectral around 20 frequency bins distributed non-linearly across the speech spectrum [4]. In this research work, we have also considered first-order (delta), yst and secondorder, 2 yt regression coefficients in addition to the spectral coefficients. The delta parameter, yst is represented in Eq. 5 s n s i=1 wi yt+1 − yt−i s yt = , (5) 2 ni=1 wi2 where n is the window width and wi are the regression coefficients; static feature vector is yt . The second-order parameters, 2 yt , are derived in similar manner by using differences of the delta parameters. Finally, all features are concatenated together to form the feature vector yt of length thirty-nine which is represented in Eq. 6 T yt = ytsT ytsT 2 ytsT .
(6)
4 Proposed Algorithms for LID and Multilingual ASR Man–machine interaction in spoken language especially in Indian perspective is a challenging task because at first the machine should correctly predict the spoken language being uttered by the user, and in the very next phase the machine should correctly predict the content which is being spoken by the user. Authors have used k-NN as well as SVM for identification of languages and HMM for prediction of spoken language contents.
Language Identification—A Supportive Tool
29
4.1 K-Nearest Neighbour Classifier for LID The k-Nearest Neighbour algorithm can be categorized as a non-linear and nonparametric classification method. This algorithm is projected on very lucid principle that the similar data are close to each other in the searching or data space. In other words, the k-NN finds for every object from test data set of k objects in the training data that are closest to the test object (nearest neighbours) [11, 12]. The label assignment is usually based on the diktat of majority voting, i.e. the most frequent class from the k-nearest neighbours for given test object determines the class where test object should belong [13, 14]. A value of k dictates a number of closest objects from training data that are taken into account at the label decision. In this research paper, we have fixed k = 3 in order to identify the single language out of three languages such as Hindi, Odia and Indian English. Euclidian distance is used to measure the distance between train and test data which is represented in Eq. 7 [15–17]. na (7) de (x, y) = (pak − pbk )2 . k=1
Figure 1 describes k-NN classifier that classifies two-dimensional data into two classes. First octagon (interior) represents region with three neighbours (k = 3) for making decision. In this case, the classified test sample belongs to ‘black circle’ class. Second octagon (exterior) represents six neighbours (k = 6) for classification task. In the second case, the classification result also belongs to ‘black circle’ class. This is due to the majority numbers of black circles present as the neighbours in both the cases, i.e. k = 3 and k = 6.
Fig. 1. Illustration of K-NN classifier.
30
B. K. Swain and S. Mohanty
4.2 Support Vector Machine (SVM) Classifier for LID SVM proposes the solution of language identification by incorporating the maximum margin that makes the classifier robust in comparison to others using Kernel functions. SVM classifier has the capability to handle inputs of very high dimensionality [18]. SVM classifier is passed by input sample features in terms of pair (xi , y), i.e. x1 , x2 …xn , and the output class y. The value of y is selected as (−1, +1) for a binary classifier. In SVM classifier, a test sample is classified on the basis of position of test sample with respect to hyperplane. The hyperplane is chosen in such a way that it can correctly separate most of the training observations into the classes. Figure 2 describes two classes of data represented by squares and circles with largest margin.
Fig. 2. Illustration of a binary SVM classifier.
The above three hyperplanes (H1 , H0 , H2 ) are represented in Eqs. 8, 9 and 10. wxi + b = +1,
(8)
wxi + b = 0,
(9)
wxi + b = −1.
(10)
The above Eqs. 8, 9 and 10 represent the hyperplanes that can be generalized in the forms Eqs. 11, 12 and 13. W T X + b ≥ 0,
(11)
W T X + b = 0,
(12)
Language Identification—A Supportive Tool
W T X + b < 0,
31
(13)
where W is weight vector, b represents bias and X describes the input vector. SVM classifies the samples by using optimal hyperplane that maximizes separating distance d. The distance between the two exterior planes (H1 , H2 ) is W2 , i.e. the gap between two terminal hyperplanes. The maximization of gap means minimization of W. Hence, we can formulate the optimized hyperplanes by minimizing W. Similarly, SVM can handle non-linear class boundaries using Kernel functions [19]. A kernel is a function that quantifies the similarity of two observations. Equation 14 represents a linear kernel function.
K x, xi , =
p
xij, xi j .
(14)
j=1
In this research work, we have also used polynomial kernel and Radial Basis Function (RBF) kernel for LID task. The Eqs. 15 and 16 represent polynomial and RBF kernel, respectively. ⎛ ⎞d p xij, xi j, ⎠ . K x, xi , = ⎝1 +
(15)
j=1
This is known as a polynomial kernel of degree d, where d is a positive integer. ⎛ ⎞ 2 x − xi ⎟ ⎜ K x, xi , = exp⎝− (16) ⎠. 2σ 2
4.3 Hidden Markov Model as Multilingual Speech Recognizer The input audio waveforms of Hindi, Odia and Indian English languages are collected using a microphone and represented into a sequence of fixed-size acoustic vectors Y = y1,… , yT in a process called feature extraction (described in Sect. 3). The decoder which is represented in Eq. 17 that attempts to find the sequence of word W = w1 , · · ·, wK (Hindi/Odia/Indian English) which are most likely to be generated from Y [20–25], ˆ = argmax[p(W |Y )]. W
(17)
w
Bayes’ rule is used to transform p(W|Y) into the equivalent problem of finding as Eq. 18: ˆ = argmax[p(Y |W )p(W ) . (18) W w
The likelihood p(Y|W) is determined by an acoustic model and the prior p(W) is determined by a language model [26–28].
32
B. K. Swain and S. Mohanty
5 Experimental Results The objective of this research work is to measure the performance of isolated speech recognizer for automatically classifying the speech utterances as per the language being spoken by the users. We have investigated the results in keeping on views that LID task is carried out at the outset of the speech recognition. First, we have measured the performance of LID task using k-NN algorithm and then checked performance by exposing another machine learning algorithm called SVM classifier using different Kernel functions. The LID performance is also judged from short and long time durational of multiple spoken languages utterances over different classifier. Figure 3 represents the distribution of isolated words speech database of LID task for Hindi, Odia and Indian English languages. It is found from the experimental results (Table 2 and Table 4) that language identification task is yielding better results over long durational periods as compared to short durational period for both k-NN and SVM classifier. SVM classifier performed superiorly as compared to k-NN algorithms by identifying the spoken language more accurately for both short as well as long durational utterances. Moreover, it is seen that RBF Kernel function (Table 3) gives better accuracy rate than Polynomial Kernel and Linear Kernel functions. Table 5 represents the confusion matrix of SVM classifier over long durational utterances for undertaken spoken languages. In this research article, authors have developed three speech recognition engines for Hindi, Odia and Indian English languages using Hidden Markov Model. Each speech recognition engine has its own Language Model, Pronunciation Dictionary, Transcription file and Phone file. The LID module decides about the specific speech recognition engine that will be activated for prediction of words as per the language used by speaker. The performance of the multilingual speech recognizer is measured in terms of word accuracy which is represented in Eq. 19. Word Accuracy =
N −D−S −I ∗ 100, N
(19)
160 140 120 100 80 60 40 20 0 Hindi
Odia Short DuraƟon
Indian English
Long DuraƟon
Fig. 3. Illustration of short and long durational isolated utterances of multiple languages.
Language Identification—A Supportive Tool
33
Table 2. Performance of k-NN classifier of LID task. Spoken Language
Short Duration_Accuracy (%)
Long Duration_Accuracy (%)
Hindi
53.82
62.25
Odia
54.15
64.36
Indian English
51.06
52.37
Table 3. Performance of k-NN classifier of LID task over different Kernels. Spoken language
RBF_Accuracy(%)
Polynomial_Accuracy(%)
Linear_Accuracy(%)
Hindi
64.33
62.45
62.11
Odia
66.78
64.23
63.50
Indian English
53.08
51.6
52.71
Table 4. Performance of SVM classifier of LID task. Spoken language
Short Duration_Accuracy (%)
Long Duration_Accuracy (%)
Hindi
57.40
66.33
Odia
58.22
69.90
Indian English
52.45
54.04
Table 5. Confusion matrix of LID task for SVM classifier over long durational utterances. Spoken language
Hindi
Odia
Indian English
Hindi
66.33
23.46
13.21
Odia
25.0
69.90
5.1
Indian English
21.21
24.75
54.04
where N, D, S, I indicate total test data, number of substitution errors, deletion errors and insertion errors, respectively. Table 6 indicates the performance of multilingual ASR using HMM. Figure 4 represents the performance curve of multilingual ASR for Hindi, Odia and Indian English language in terms of word accuracy rate. The performance curve is drawn by considering the word accuracy rate of multilingual ASR against 50, 60, 75 and 90% of ASR speech database of undertaken languages.
34
B. K. Swain and S. Mohanty Table 6. Performance of multilingual ASR using HMM. Spoken language Word accuracy (%) Word error (%) Hindi
69.36
30.64
Odia
78.18
21.82
Indian English
62.54
37.46
Fig. 4. Word accuracy rate of multilingual ASR over different sizes of speech database.
6 Conclusion and Future Work In this research work, we carried out language identification task over two Scheduled Languages (Hindi and Odia) and one non-Scheduled Language called Indian English using k-NN algorithm and SVM classifier. It is found that SVM classifier yielded better result as compared to k-NN algorithm. Moreover, it is also observed that LID performance is very high for long durational utterances of isolated words as compared to short durational utterances. RBF Kernel function of SVM produced better result with respect to polynomial and linear Kernel. In this research paper, we have developed three different ASR engines for different languages namely Hindi, Odia and Indian English using HMM and integrated the LID module at outset of speech recognition engines. The potential of LID module is exploited in order to work hand in hand with multilingual ASR. It is also found that Odia speech recognition engine yielded better result as compared to other ASR engines. This may be due to the effect of different sizes of speech corpora used in training of ASR engines. In future, we will work on a single multilingual speech recognition system that will be capable of recognizing any of the languages seen in training phase. Acknowledgments. A standard ethical committee has approved this data set, and the data set has no conflict of interest/ethical issues with any other public source or domain.
Language Identification—A Supportive Tool
35
References 1. http://censusindia.gov.in/2011Census/C-16_25062018_NEW.pdf 2. Shaughnessy, D.: Speech Communications Human and Machine, 2nd edn. Universities Press (2001) 3. Quatieri, T.F.: Discrete-Time Speech Signal Processing Principles and Practice, Pearson Education, Third Impression (2007) 4. Rabiner, L.R, Schafer, R.W.: Digital Processing of Speech Signals, 1st edn. Pearson Education (2004) 5. Leena, M., Srinivasa Rao, K., Yegnanarayana, B.: Neural network classifiers for language identification using phonotactic and prosodic features. In: Proceedings of International Conference on Intelligent Sensing and Information Processing (2005) 6. Tüske, Z., Pinto, J., Willett, D., Schlüter, R.: Investigation on cross-and multilingual mlp features under matched and mismatched acoustical conditions. In: IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP (2013) 7. Lin, H., Deng, L., Yu, D., Gong, Y.F., Acero, A., Lee, C.H.: A study on multilingual acoustic modeling for large vocabulary ASR. In: IEEE International Conference on. Acoustics, Speech and Signal Processing, ICASSP (2009) 8. Praat software website. http://www.fon.hum.uva.nl/praat 9. Ferrer, L., Scheffer, N., Shriberg, E.: A comparison of approaches for modeling prosodic features in speaker recognition. In: International Conference on Acoustics, Speech, and Signal Processing (2010) 10. Martinez, D., Lleida, E., Ortega, A., Miguel, A.: Prosodic features and formant modeling for an ivector based language recognition system. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, pp. 6847–6851 (2013) 11. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics), 1st edn. Springer 2006. Corr. 2nd printing edition, October 2007 12. Zissman, M.A., Berkling, K.M.: Automatic language identification. Speech Commun. 35, 115–124 (2001) 13. Malmasi, S., Dras, M.: Native Language Identification using Stacked Generalization. ArXiv e-prints (March), 1–33 (2017) 14. Ambikairajah, E., Li, H., Wang, L., Yin, B., Sethu, V.: Language identification: a tutorial. Circ. Syst. Mag. 11. IEEE (2011) 15. Han, J., Kamber, M., Pei, J.: Data Mining Concepts and Techniques, Third edn. Elsevier (2007) 16. Dehak, N., Torres-Carrasquillo, P.A., Reynolds, D.A., Dehak, R.: Language recognition via i-vectors and dimensionality reduction. In: Interspeech (2011) 17. http://www.technologyforge.net/WekaTutorials/ 18. Mohanty, S., Swain, B.K.: Speaker identification using SVM during Oriya speech recognition. Int. J. Image Graph. Sig. Process. (2015) 19. Keshet, J., Bengio, S.: Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. Wiley (2009) 20. Wang, D., Zheng, T.F.: Transfer learning for speech and language processing. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2015 Asia-Pacific. IEEE (2015) 21. Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: First results, arXiv preprint arXiv:1412.1602 (2014) 22. Kim, S., Seltzer M.L.: Towards language-universal end-to-end speech recognition, arXiv: 1711.02207v1 [cs.CL] 6 Nov 2017
36
B. K. Swain and S. Mohanty
23. Ma, B., Guan, C., Li, H., Lee, C.-H.: Multilingual speech recognition with language identification. ICSLP, USA (2002) 24. Jaitly, N., Nguyen, P., Senior, A., Vanhoucke, V.: Application of pretrained deep neural networks to large vocabulary speech recognition. In: Proceedings of Interspeech (2012) 25. Settle, S., Roux, J.L., Hori, T., Watanabe, S., Hershey, J.R.: End-to-end multi-speaker speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP (2018) 26. Cho, J., Baskar, M.K., Li, R., Wiesner, M., Mallidi, S.H., Yalta, N., Hori, T.: Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling. In: 2018 IEEE Spoken Language Technology Workshop, SLT 2018 – Proceedings, pp. 521–527 (2019) 27. Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran, B., Picheny, M., Lim, L., Roomi, B., Hall, P.: English conversational telephone speech recognition by humans and machines, arXiv preprint arXiv:1703.02136 (2017) 28. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning with Applications in R. Springer, New York Heidelberg, Dordrecht, London (2013)
Ensemble Methods to Predict the Locality Scope of Indian and Hungarian Students for the Real Time: Preliminary Results Chaman Verma1(B) , Zoltán Illés1 , and Veronika Stoffová2 1 Eötvös Loránd University, Budapest, Hungary
{chaman,illes}@inf.elte.hu 2 Trnava University, Trnava, Slovakia [email protected]
Abstract. In the present study, we presented ensemble classifier to predict the locality scope (National or International) of the student based on their motherland and sex toward Information and Communication Technology (ICT) and Mobile Technology (MT). For this, a primary dataset of 331 samples from Indian and Hungarian university was gathered during the academic year 2017–2018. The dataset contained 331 instances and 37 features which belonged to the four major ICT parameters attitude, development and availability, educational benefits and usability of modern ICT resources, and mobile technology in higher education. In addition to class balancing with Synthetic Minority Over-Sampling Technique (SMOTE), Adaptive Boosting (AdboostM1) and bagging ensemble technique is applied with Artificial Neural Network (ANN) and Random Forest (RF) classifiers in Weka tool. Findings of the study infer that the ANN achieved higher accuracy (92.94%) as compared to RF’s accuracy (92.25%). The author’s contribution is to apply ensemble methods with standard classifiers to provide more accurate and consistent results. On the one hand, with the use of bagging, the ANN achieved 92.94% accuracy, and on the other hand, AdboostM1 has also significantly improved the prediction accuracy and RF provided 92.25% accuracy. Further, the statistical T-test at the 0.05 significance level proved no significant difference between the accuracy of RF and ANN classifier to predict the locality scope of the student. Also, the authors found a significant difference between the CPU prediction time between bagging with ANN and AdboostM1 with RF. Keywords: Bagging · Locality scope · Prediction · Ensemble classifier · AdBoostM1
1 Introduction The knowledge discovery in databases is referred to as data mining, data preprocessing [1], pattern recognition, clustering [2], and classification which are the popular technologies in data mining. In educational data mining, Machine Learning (ML) plays a vital role in recognized patterns. In Education data mining, ML is used in frequent manner © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_4
38
C. Verma et al.
nowadays. Previously, statistical analysis was not strong enough in finding the pattern in datasets, and differential analysis with T-test, Z-test, F-test, and ANOVA techniques were frequently applied [3–7]. In spite of using traditional statistics, many researchers are using machine learning frequently in predictive modeling. Machine learning classifiers are trending in predicting data patterns. With two independent variables such as sex, the student’s locality has been predicted [17]. In this, locality (national and international) based on sex and motherland has been predicted with the RF and the ANN with wise accuracy. The authors have performed class balancing using the SMOTE.
Instances (n)
Class Balancing using SMOTE 350 300 250 200 150 100 50 0
316 252
252 158
252
79
Unbalanced
Smote-1
Smote-2
Balanced Vs Unbalanced Dataset National
International
Fig. 1. Class balancing using SMOTE.
Figure 1 shows that after first run of SMOTE, International student instances are increased up to 158, and after the second run of SMOTE, this count is enhanced up to 252 which makes class balancing significant. After this, the new count of enhanced instances is 568 for the training and testing. Cross-validation with k equal to 10 was applied without using any ensemble technique. On the one hand, Fig. 2a shows the highest prediction accuracy (91.02%) which is given by ANN classifier as compared to the accuracy (89.6%) of RF classifier, and on the other hand, Fig. 2b displays the overall accurate prediction of student’s locality status with 10-folds. The uppermost prediction counts of students for both classes (national and international) is 517 out of 568 which is excellent accuracy provided by ANN classifier. The authors are motivated to augment the classification accuracy using trending ensemble methods such as boosting and bagging. A normalized dataset is scattered into the bootstrapped data samples to test and train parallel, and accuracy is measured accordingly. After the implementation of the RF model with boosting, we found significant increment (2.65%) in prediction accuracy [17]. Also, using with the multiple bootstrapped subsamples, the accurateness of the ANN algorithm is enhanced by 1.92% as compared to the study [17]. Further, prediction difference is measured using poplar statistical T-test which compared the accuracy mean (average) of two ensemble classifiers. In this paper, the authors applied ensemble methods such as adaptive boosting and bagging with the RF and ANN to predict the locality scope (National or International) of the university’s
Ensemble Methods to Predict the Locality Scope of Indian
39
Accuracy (%)
a Accuracy at 10-Folds
ANN , 91.02
91.5 91 90.5 90 89.5 89 88.5
RF , 89.6
RF
ANN Classifer
Classifier
b Overall Prediciton 51
ANN
517 59
RF
509 0
100
200
300
400
500
600
Count Wrong Prediction
Right Prediction
Fig. 2. a Accuracy of RF and ANN with ensemble methods (Source [20]). b Overall prediction count without ensemble methods (Source [20])
student assuming sex and motherland as a dependable variable with ICT variables. Also, these ensemble classifiers are trained and tested with dynamic training ratios and crossvalidation methods. Also, the statistical student T-test at the confidence level 0.05 is also applied to test the significant change between the prediction accurateness provided by the RF and the ANN with ensemble methods.
2 Related Work To classify the student demography features recently work done is included in this section. The sex of the European school’s students toward ESSIE survey response was predicted with binary logistic regression having the accuracy of 62% [8], and it was also enhanced up to 76% with the SVM and the RF [9]. Also, the same significant predictive models for the European school teachers, principals, and students were also presented [10, 11]. For the real-time system, the age group predictive model of student was also presented [12]. The numerous variants of decision tree were also applied to predict the nationality of European school students and the concept of real-time prediction models was also suggested by calculating CPU time [13]. Further, student’s attitude [14] and ICT development [15] was also predicted for the Real Time. The consciousness level
40
C. Verma et al.
toward the trending technology for the real time was also performed [16]. Also, student’s locality such as rural and urban was predicted for the real-time prediction using classification techniques with feature selections methods [17]. Based on the student’s sex and motherland, the locality status such as national or international was also presented [18]. A several researchers used prediction algorithms in the educational arena. The academic performance of student was classified with the Naive Bayes and KNN with a significant accuracy [19]. Based on the enrollment dataset, student’s success was predicted with the various variants of the decision tree [20]. Further, to favor the real-time system, few demography features of students such as university [21], study level [22], nationality [23], and teacher’s residence state [24] with feature selection [25] were also classified with the machine learning.
3 Research Design and Methodology This section explores the description of datasets, preprocessing of the dataset, ensemble methods to be used, and model performance metrics. 3.1 Dataset A stratified random sampling method is used to gather the primary samples of 331 with the help of Google Form from Indian and Hungarian University. The initial dataset has a total of 331 instances and 46 features in which 37 features are related to the four major ICT parameters and nine features relates to the student’s demography. 3.2 Preprocessing The dataset is normalized using Normalize filter from 0 to 1 scale. The response variables locality status is set as a class variable and encoded as National-1 or International-2. The target of the response variable is converted to nominal using NumericToNominal filter. The gender variable has two values and encoded as Male-1 and Female-2. Country variables have two values and encoded as Indian-1 and Hungary-2. The class balancing is achieved with the help of SMOTE to make dataset balanced according to the objective. It enhanced the records belonging to the minority class (International). To enhance the samples, random data samples are selected assuming k closest neighbors residing feature galaxy. Later, to generate the new artificial point, the vector is chosen between one of the k closest and the existing point. After that, the multiplication of the selected vector is done by a random number x in the range of 0–1. Now, the addition of this value in the present point may lead to the new, mock point which is the real SMOTE point. 3.3 Ensemble Methods Instead of using the individual classifier for the prediction, the ensemble methods such as Bagging and Boosting are most significant to augment the accurateness of the classifier to identify the data patterns more accurately with consistent results. In ensemble methods, we used bagging with the ANN classifier to enhance accuracy. The bagging is the
Ensemble Methods to Predict the Locality Scope of Indian
41
combination of both bootstrapping and aggregation. A normalized dataset is scattered into the bootstrapped data samples to test and train parallel, and accuracy is measured accordingly. Later, aggregation is performed to form the most significant predictor for prediction. Hence, with the usage of bagging in collaboration with the ANN, the results are found to be more stable with optimal accuracy of the locality prediction. It also avoids the problem of overfitting, and it reduces the variances too. Unlike bagging, the boosting makes ensemble models in a serial manner in which the next models train data points to include the enhanced weights of misclassified data points of the previous model. We can say that it learns from the mistakes by increasing the weight of misclassified data points. The author used the latest boosting algorithm named AdBoostM1 with RF to enhance accuracy [18]. 3.4 Performance Metrics In order to assess the power of each model, Root Mean Square Error (RMSE), Receiver operating characteristics curve (ROC), F-score, and Cohen’s kappa are compared appropriately.
4 Experiments and Result Discussions This section concentrated on the experimental framework of the presented study. The authors have used data mining tool Weka 3.9.1 which is developed by Waikato University, New Zealand. The authors have used only two applications of this tool named Explorer and Experimenter. On the one hand, the Explorer application is used to data preprocessing and to generate the confusion matrix and performance metrics. On the other hand, Experimenter application is used to compare the accuracy and CPU training time of the classifiers using T-test at 0.05 significant level. 4.1 Experiment-I A dynamic holdout method (training ratio) is applied on the balanced samples with the ANN and the RF. For this, the three hold ratios are considered as 50:50, 60:40, and 70:30 in random manner. Figure 3 visualizes the prediction accuracy judgement of two ensemble classifiers at dynamic holdout ratios for the locality scope prediction. The maximum accuracy of locality scope prediction (92.94%) is achieved by ANN with bagging technique at 70:30 ratios which significantly improved the standard ANN’s accuracy (Fig. 2a). Hence, with the holdout testing method, the bagging is performed well with ANN in the prediction task. 4.2 Experiment-II Using even values of folds in between the range of 2 and 10, temporary test sets are framed against the training set. Figure 4 evidences that, with k = 10, the AdBoostM1 technique with RF attained the maximum prediction accuracy as compared to other k
42
C. Verma et al.
Ensembled using Hold Out 96
Accuracy (%)
92.94 92
91.76
91.17 90.84
91.63
91.18
92.07
90
90.14
88.98
88.02
87.32
88
84 AdBoostM1-RF
Bagging -RF
AdBoostM1-ANN
Bagging-ANN
Ensembled Classifier 50-50
60-40
70-30
Fig. 3. Ensemble modeling using Holdout.
values. The minimum prediction accuracy (86.26%) is gained by ANN with bagging at 2-fold. The AdBoostM1-RF accuracy (91.19%) is sustained at 4, 6, and 8 folds. The AdBoostM1 method significantly enhanced the RF standard classifier with k = 10 (Fig. 2a). Ensembled using K-Fold
Accuracy (%)
96
92
88
92.25
91.19
91.19 91.37 90.84
91.72 90.84 89.96
91.37 90.31 89.61
89.08
91.54 90.14
91.19
89.61
92.07 90.84 88.55 86.26
84 10
8
6
4
2
Ensembled Classifier AdBoostM1-RF
Bagging -RF
AdBoostM1-ANN
Fig. 4. Ensemble modeling using K-Fold.
Bagging-ANN
Ensemble Methods to Predict the Locality Scope of Indian
43
4.3 Experiment-III To compare the prediction time (in seconds) of ensemble classifier, statistical T-test is applied. Table 1 shows that statistical T-test at 0.05 level of significance found an important difference between the CPU training time spent with the ensemble classifiers. Table 1. CPU training time testing with T-test at 0.05 level of significance with 10-Fold Dataset
(1) meta.AdBoostM1-RF
(2) meta.Bagging-ANN
Final_Normalized
(100) 0.17
47.53 v
(v//*)|(1/0/0)
The Victory (v) symbol proves that bagging technique used with ANN has taken the highest CPU time (47.53 s) to train the prediction model. Table 2. Accuracy testing with T-test at 0.05 level of significance with 10-Fold. Dataset
(1) meta.AdBoostM1-RF
(2) meta.Bagging-ANN
Final_Normalized
(100) 91.44
90.21
(v//*)|(1/0/0)
From Table 2, it is found that the locality scope dataset makes no major impact on the accuracy difference at 10-fold. Hence, no important statistical difference is discovered between the accuracy of boosted RF model and bagged ANN model. The absence of the difference (*) symbol proves that bagging technique used with ANN is not significantly different from AdBoostM1 with RF classifier.
5 Prediction Count in Real Time For the real-time prediction, there is a need to measure both CPU training time and CPU testing time during the simulation. The various concepts of real-time prediction of student’s demography features, attitude, and awareness levels toward the ICT and MT were presented [11–16, 18]. To support the aim of the research, the authors calculated the prediction time of the locality scope of the student. Further, with the deployment of the presented predictive models, a user (teacher, head, principal) may predict the locality scope of student toward trending technology in higher education of the institutions in the specified time. Figure 5 shows the pictorial representation of results gained by ensemble methods. The Y-axis represents Real time (CPU training time), Overall instances, Accurate international student, and Accurate national student. The X-axis represents ensemble methods with prediction counts.
44
C. Verma et al.
Prediciton count and Time Real Time Overall Accurate International Accurate National 0
200
400
AdBoostM1-RF
240
Accurate International 284
Bagging-ANN
69
89
Accurate National
AdBoostM1-RF
600 Overall
800 Real Time
568
0.17
170
47.53
Bagging-ANN
Fig. 5. Real-time prediction count and time using ensemble classifiers.
Out of total 568 (balanced), a total of 240 accurate national students and 284 accurate international students are predicted with AdBoostM-RF in 0.17 s. Hence, a right prediction ratio is 524:568. The bagging with ANN accurate prediction ratio is 158:170 in 47.53 s.
6 Model Evaluation The three major model appraisal parameters are shown in Table 3. To signify the power of the locality scope models, these are very mandatory. The supreme relation among records to classify the locality scope is seen in the kappa value 0.85 computed with the ANN. Table 3. Performance metrics at the 10-Fold of ensemble classifier. Ensemble classifier
Kappa static
F-Score
RMSE
AdBoostM1-RF
0.85
0.92
0.26
Bagging-ANN
0.86
0.93
0.26
Also, the same classifier provides most F-score 0.93 which is the justifiable balancing between the recall and precision. It happens because of application of bagging with the ANN. The authors found no noteworthy gap between the RMSE value of the ensemble classifiers.
Ensemble Methods to Predict the Locality Scope of Indian
45
Fig. 6. ROC of AdBoostM1-RF at 10-Fold.
Figure 6 visualizes the ROC curve produced by AdBoostM1-RF ensemble classifier which differentiates true positive rate with false positive rate of the real-time model at various thresholds for both classes of the student’s locality scope. With dynamic cutoffs, the noteworthy sensitivity starts from 0.70 and ends at 0.95. Further, at cutoff point 0.5, the sensitivity touches the highest point 0.99 with the FP rate of 0.01. Hence, adaptive boosting has significantly improved the accuracy of prediction attained by individual RF [18]. Figure 7 signifies that the ANN learning point is upper than both. This winner learner initiates from the point 0.65 and close to the point 0.99. The learning power with exact prediction starts at the thresholds 0.2, and the extreme values of the TP rate is 0.95. On the same point, the misclassification rate (FP rate) is 0.05. Accordingly, bagging ensemble technique significantly enhanced the accuracy of ANN to predict the locality scope of the student in both countries.
7 Conclusion To predict the locality scope of students, three experiments were led followed by preprocessing with measuring the time. Three well-known testing procedures are used in connection with statistical analysis. Using the holdout method, the bagging technique with ANN achieved the highest accuracy (92.94%) at 70:30 ratios, and on another side, AdBoostM1 technique with ANN attained highest accuracy (91.63%) at 60:40 ratios. Although, in K-fold with k = 10, AdBoostM1 with RF attained maximum accuracy (92.25%) and bagging with RF scored maximum accuracy (91.72%). It is also found that the ANN has scored the highest accuracy of 92.94% as compared to RF’s accuracy
46
C. Verma et al.
Fig. 7. ROC of Bagging-ANN at 70:30 training ratio.
(92.25%). The results of the study reveal that the ANN outperformed RF in the prediction of locality scope of students. On the one hand, bagging has significantly boosted the ANN’s accuracy by 1.92%, and another hand, AdboostM1 has also upgraded the prediction accuracy of the RF considerably by 2.65% [18]. On the one hand, the statistical analysis using T-test at 0.05 significance level proved no weighty gap between the correctness of RF and ANN classifier, and on the other hand, it found a meaningful variance between the CPU time between bagging with ANN and AdboostM1 with RF. Also, the authors recommended this predictive model to be deployed on the university’s website as real-time prediction of the locality scope of students. Acknowledgments. The present study is funded by the Hungarian Government and co-sponsored by the European Social Fund under the project “Talent Management in Autonomous Vehicle Control Technologies (EFOP-3.6.3-VEKOP-16-2017-00001).” Also, this chapter fits to the Ph.D. study of the first author.
References 1. Singhal, S., Jena, M.: A study on weka tool for data preprocessing, classification and clustering. Int. J. Innov. Technol. Explor. Eng. 2(6), 250–253 (2013) 2. Chauhan, R., Kaur, H.: Alam: Data clustering method for discovering clusters in spatial cancer databases. Int. J. Comput. Appl. 10(6.9), 14 (2010) 3. Verma, C., Dahiya, S.: Gender difference towards information and communication technology awareness in Indian Universities. SpringerPlus 5, 1–7 (2016) 4. Verma, C., Dahiya, S., Mehta, D.: An analytical approach to investigate state diversity towards ICT: a study of six universities of Punjab and Haryana. Indian J. Sci. Technol. 9, 1–5 (2016)
Ensemble Methods to Predict the Locality Scope of Indian
47
5. Verma, C.: Educational data mining to examine mindset of educators towards ICT knowledge. Int. J. Data Mining Emerg. Technol. 7, 53–60 (2017) 6. Verma, C., Stoffová, V., Illés, Z.: Analysis of situation of integrating information and communication technology in Indian higher education. Int. J. Inf. Commun. Technol. Educ. 7(1), 24–29 (2018) 7. Verma, C., Stoffová, V., Illés, Z.: Perception difference of indian students towards information and communication technology in context of university affiliation. Asian J. Contemp. Educ. 2(1), 36–42 (2018) 8. Verma, C., Stoffová, V., Illés, Z., Dahiya, S.: Binary logistic regression classifying the gender of student towards computer learning in European schools. In: THE 11th Conference of Ph.D Students in Computer Science, p. 45. Szeged University, Hungary (2018) 9. Verma, C., Stoffová, V., Illés, Z.: An ensemble approach to identifying the student gender towards information and communication technology awareness in European schools using machine learning. Int. J. Eng. Technol. 7, 3392–3396 (2018) 10. Verma, C., Ahmad, S., Stoffová, V., Illés, Z., Dahiya, S.: Gender prediction of the European school’s teachers using machine learning: preliminary results. In: International Advance Computing Conference, pp. 213–220. IEEE, India (2018) 11. Bathla, Y., Verma, C., Kumar, N.: Smart approach for real-time gender prediction of European School’s principal using machine learning. In: The 2nd International Conference on Recent Innovations in Computing. pp. 159–175 Springer (2019) 12. Verma, C., Stoffová, V., Illés, Z.: Age group predictive models for the real time prediction of the university students using machine learning: preliminary results. In: IEEE International Conference on Electrical, Computer and Communication, pp. 1–7 (2019) 13. Verma, C., Ahmad, S., Stoffová, V., Illés, Z. Singh, M.: National identity predictive models for the real time prediction of European school’s students: preliminary results. In: IEEE International Conference on Automation, Computational and Technology Management, pp. 418–423, London (2019) 14. Verma, C., Illés, Z., Stoffová, V.: Attitude prediction towards ICT and mobile technology for the real-time: an experimental study using machine learning. In: The 15th International Scientific Conference eLearning and Software for Education, vol. 3, no. 1, pp. 247–252, Romania (2019) 15. Verma, C., Illés, Z., Stoffová, V.: Real-time prediction of development and availability of ICT and mobile technology in Indian and Hungarian University. In: The 2nd International Conference on Recent Innovations in Computing. pp. 605–615 Springer (2019) 16. Verma, C., Stoffová, V., Illés, Z.: Prediction of students’ awareness level towards ICT and Mobile Technology in Indian and Hungarian University for the real-time: preliminary results. Heliyon 5, 1–7 (2019). Elsevier 17. Verma, C., Stoffová, V., Illés, Z.: Real-time prediction of student’s locality towards information communication and mobile technology: preliminary results. Int. J. Recent Technol. Eng. 8(1), 580–585 (2019) 18. Verma, C., Stoffová, V., Illés, Z. Singh, M.: Prediction of locality status of the student based on gender and country towards ICT and Mobile Technology for the real-time. In: XXXIIDidmattech, pp. 1–10 Trnava University, Slovakia (2019) 19. Koutina, M., Kermanidis, K.L.: Predicting postgraduate students performance using machine learning techniques. In: IFIP Advances in Information and Communication Technology, p. 364 (2011) 20. Ayn, M.R.N., Garcia, M.T.C.: Prediction of university students academic achievement by linear and logistic models. Span. J. Psychol. 11(1), 275–288 (2014) 21. Verma, C., Illés, Z., Stoffová, V. Singh, M.: ICT and Mobile Technology features predicting the University of Indian and Hungarian student for the real-time. In: IEEE System Modeling & Advancement in Research Trends, pp. 85–90 (2019)
48
C. Verma et al.
22. Verma, C., Illés, Z., Stoffová, V.: Study level prediction of Indian and Hungarian students towards ICT and Mobile Technology for the real-time. In: IEEE International Conference on Computation, Automation and Knowledge Management, pp. 215–219, UAE (2020) 23. Verma, C., Illés, Z., Stoffová, V.: Real-Time classification of national and international students for ICT and Mobile Technology: an experimental study on Indian and Hungarian University. In: The First International Conference on Emerging Electrical Energy, Electronics and Computing Technologies, pp. 1–9, J. Phys.: Conf. Ser. 1432 012091, UK (2020) 24. Verma, C., Illés, Z., Stoffová, V.: Predictive modeling to predict the residency of teachers using machine learning for the real-time. In: Second International Conference on Futuristic Trends in Networks and computing technologies, pp. 592–601 Springer (2019) 25. Verma, C., Stoffová, V. Illés, Z.: Feature selection to identify the residence state of teachers for the real time. In: IEEE International Conference on Intelligent Engineering and Management, pp. 1–6, London (2020) (In.Press)
Automatic Detection and Classification of Tomato Pests Using Support Vector Machine Based on HOG and LBP Feature Extraction Technique Gayatri Pattnaik(B) and K. Parvathi KIIT Deemed to be University, Bhubaneswar, India [email protected], [email protected]
Abstract. The automatic detection and classification of insect pest is emerged as one of the interesting research areas in agriculture sector to ensure reduction of damages due to pest. From the general process of detection of pest, feature extraction plays a significant role. It extracts features from the segmented image obtained by segmentation process, and then extracted images are being transferred to a classifier for the operations. In this work, we studied and implemented two feature extraction techniques, i.e., Histogram of Oriented Gradient (HOG) and Local Binary Pattern techniques (LBP). The comparison result expressed that HOG performs better than its counterpart. The result comes with accuracy of 97% for HOG. Here, we are adopting SVM-based pest classification as a test case. Keywords: Feature extraction · HOG · LBP · SVM
1 Introduction In tropical, subtropical, or temperate region of the world, crops are affected by wide variety of diseases and pests. This is because of the effects of climate change in the atmosphere. The changing climatic variables like humidity, temperature, and rainfall bring about production of pathogen, viruses, or pests which destroy crop. As a result, it causes a challenge to food security and economic growth of a country. We focus our approach in identification of pests of tomato plant [1]. Most of the farmers used pest management method traditionally that means spraying chemicals all around the crop field so far that some useful pests are killed. Another pest control method is sticky trap where pest insects are being trapped and counted manually which is a tedious and time-consuming task. To counter the problem, so many techniques are brought about and some are under research. Nowadays, Image Processing technique is vastly used to detect and classify the pest by using some model [2]. Image Processing Technology targets on providing decision support for precise pesticide spraying so that labor cost can be reduced. In recent years, image processing © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_5
50
G. Pattnaik and K. Parvathi
has been added with machine learning. The combination of these two has grown into a hotspot for research and application [3]. Recently, deep learning based Convolutional Neural Network (CNN) techniques are implemented for detection and classification of pests [4]. 1.1 Previous Work Hiary et al. [5] evaluated a software methodology for automatic detection and classification of disease-affected leaves. It consists of four main phases. Those are 1. k-mean clustering, 2. masking, 3. feature extraction, and 4. classification by neural network. Bhadane et al. [6] described a software prototype for detection of pest on infected images of leaves. Background Subtraction method is used for object extraction. Here, edge detection Sobel operator is used. The main difficulties of the system arise when the color of pest and leaf is almost similar. Hence it improve the accuracy threshold value should be chosen precisely. Barbedo [7] presented a survey on methods which use digital image processing techniques. These methods are thresholding, regression analysis, color analysis, fuzzy logic, region growing, classification, neural network, SVM, and self-organizing maps. Krishnan and Jabert [8] approached image processing techniques which includes k-mean clustering and fuzzy c-means algorithms for pest detection. Mainkar et al. [9] used image processing techniques which include steps like image acquisition, image preprocessing, segmentation, feature extraction, and neural network based classification. Rajan and Radhakrishnan [10] explained image processing techniques to detect pest identification and plant disease detection. Ebrahimi et al. [11] discussed automatic pest detection process for strawberry greenhouse pest named as Thrips. The main objective was to detect thrips using SVM classification method. Also detection technique was used to find parasites on strawberry plant. Tripathy and Maktedar [12] focused on various image processing methods which were followed by various classification techniques. Dey and Dey [13] focused on one pest whitefly, most hazardous pest. They presented an automatic approach using image processing techniques. The noise removal and contrast enhancement are used for improving the quality of image before driving into k-mean clustering based segmentation. Then, feature extraction methods like Gray Level Running Length Matrix (GLRLM) and Gray Level Co-occurrence Matrix (GLCM) were used. Finally, classifiers like Support Vector Machine (SVM), Artificial Neural Network (ANN), Bayesian Classifier, Binary Decision Tree Classifier, and K-Nearest Neighbor were used to distinguish whitefly pest from infected leaf images. Venugoban and Ramana [14] explained the classification operations based on gradient-based feature extraction process. The gradient-based feature extraction processes are Scale Invariant Feature Transform (SIFT), Speeded up Robust Features (SURF), and Histogram of Gradient (HOG). After all comparison made, HOG outperforms. Ojala et al. [15] discussed Local Binary Pattern (LBP) feature extraction techniques. The way it is useful for further processing has been discussed.
Automatic Detection and Classification of Tomato Pests
51
Labaña et al. [16] explained another method of utilizing convolutional neural network model for treatment of affected crop taken by user at any time/anywhere. With addition of more different techniques, CNN and REST resulted in an accuracy of 90%. Boulent et al. [17] presented major issues and shortcomings of work that used Convolutional Neural Network to automatically identify diseases of crop. Most important issue is lack of concept of machine learning which leads to poor generalization.
2 Proposed Work This paper emphasizes on two feature extraction techniques i.e. HOG and Lexplains a comparative study on algorithms used for feature extraction in pest detection of Tomato Plant. As a baseline classifier, multiclass SVM has been used. Following Fig. 1 depicts overview of the system and Fig. 2 explains flow graph of proposed work.
Input Images
Feature Extraction (HOG and LBP)
SVM classifier
Fig. 1. An overview of system.
Fig. 2. Flowchart of proposed work.
Class Name of Pest
52
G. Pattnaik and K. Parvathi
3 Methods and Materials There are three main steps in image processing; first is conversion of collection or captured images into gray scale of size 256 × 256 for processing of computer; second is preprocessing of image, and third is displaying of processed image. The novel feature extraction process extracts texture features for further processing. The extraction techniques which we have used are HOG, LBP, and addition of both features of HOG and LBP which is named as Hybrid. The extracted features are fed into classifiers like multiclass SVM model. Two separate datasets are required to develop a model: one is training set and another is test set. In this work, we have taken 959 images out of which 859 number of images has been used for training and other 100 is for test purpose. 3.1 Image Processing Operations An image has two dimensions and is represented by f(x, y), where “f” is amplitude, and x and y are two spatial coordinates. For the purpose of automatic detection of pests on scanned leaves, image analysis has to be followed. The analysis followed extraction of pest from its leaf background by feature extraction techniques. Feature extraction computes the attributes like color, shape, and size descriptors corresponding to each region. Finally, we get the information about pests and its features and processed through a multiclass SVM classifier [4]. 3.2 Feature Extraction Feature extraction is nothing but reduction of dimension of image so that most relevant information can be obtained and represented in a lower dimension [13]. There is a chance of redundancy in large input data. So data must be transformed into reduced dimensional features. When the input data is transformed into the set of features, the process is called feature extraction. The feature extraction focuses on inherent attributes of objects within an image. The common feature extraction techniques are Speeded Up Robust Features (SURF), HOG, Scale Invariant Feature Transform (SIFT), LBP, Sparse Coding, Auto Encoders, Restricted Boltzmann Machines, Principal Component Analysis (PCA), Independent Component Analysis (ICA), and K-means. In this paper, we focus on HOG and LBP feature extraction techniques. 3.2.1 Histogram of Oriented Gradients (HOG) HOG evaluates histogram of gradient orientation of image. In this technique, appearance and shape of object can be attributed by the distributions of local intensity gradients without any prior knowledge of gradients. Image window of size 3 × 3 is divided into smaller spatial regions practically. Each cell represents 1-D histogram of gradient orientation over the pixels. Practically, it is implemented by dividing the image window into smaller cells or spatial regions. Each cell is assumed to have a local 1-D histogram of gradient directions or edge orientations over the pixels [12]. An overview diagram of HOG is given below in Fig. 3.
Automatic Detection and Classification of Tomato Pests
53
Fig. 3. An overview of HOG feature extraction.
3.2.2 Local Binary Patterns (LBP) In computer vision process, LBP is used as a visual descriptor. It is a powerful technique in extracting texture feature in comparison to other state-of-the-art techniques. Each pixel in LBP relates to its neighboring pixel’s gray level, i.e., 1 orb 0. LBP is a simple and oldest algorithm invented in 1994. It calculates statistics of invariant rotation patterns of individual pixels. It makes correspondence to certain microfeatures in the image; then pattern of pixel is considered as feature detector. It can be used with the HOG algorithm mentioned above to improve the performance. Here, examined window is divided into 32 × 32 cells. Then, each pixel of cell is compared to its eight neighbors of pixel. When center pixel value is greater than neighbor value, write “0”, otherwise write “1”. compute the histogram and finding out the normalized value similarly find out histograms of entire cell and concatenate all. 3.2.3 Support Vector Machine (SVM) The Support Vector Machine (SVM) is a supervised learning technique which finds out decision surface that maximizes the margin between two classes. SVM is a supervised learning used for pattern classification. In general, SVMs outperform other classifier because of its generalized performance. SVM data is linearly separable and there is a presence of unique global minima. A realistic SVM produces a hyperplane. Hyperplane is a subset of large Euclidean n-dimensional surface, which divides the space into two nonoverlapping regions. SVMs were originally developed for solving binary classification problems; later on, it is extended to solve the problem of multiclass pattern classification. In multiclass classification, each training point belongs to exactly one of the different classes. The goal is to construct a new data point, which can be predicted precisely about to which class it belongs. There are four standard techniques exist for multiclass SVM problems. Those are One-Versus-One (OVO), One-Versus-All (OVA), Directed Acyclic Graph (DAG), and Unbalanced Decision Tree (UDT). We used the One-Versus-All (OVA) linear SVMs [13].
4 Image Data and Performance Metrics The dataset used to evaluate the performance of SVM based model include images of pest’s tomato plant common to India, Mexico, and Philippines. These images were obtained from different sites through Google browser. Those are https://nbair.res.in, https://apnikheti.com, https://flickr.com, etc. Each image in this dataset consists of size
54
G. Pattnaik and K. Parvathi
of 256 × 256 and three channel (RGB) color images. We have collected 959 number of images, where 859 have been used for training and rest 100 have been used for test set. Following Table 1 lists the names of common tomato pests. Table 1. List of names of pest. Label Name
Training Test Total
1
Bactrocera litifrons
80
10
90
2
Bemisia tabaci
80
10
90
3
Chrysodeixis chalcites
94
10
104
4
Epilachna vigintioctopunctata 94
10
104
5
Helicoverpa armigera
92
10
102
6
Icerya aegyptiaca
80
10
90
7
Liriomyza trifolii
88
10
98
8
Nesidiocoris tenuis
91
10
101
9
Spodoptera litura
80
10
90
10
Tuta absoluta
80
10
90
In this paper, we bridged the gap by conducting a performance evaluation, which was from three different aspects of data, feature, and model. The performance of proposed model can be evaluated by confusion matrix, and performance can be measured by overall accuracy. Accuracy = Sum of correct classification divided by total number of classification. It can be calculated by using the following Eq. (1). Accuracy =
(TP+TN ) (TP+TN +FP+FN )
(1)
where TP is called as True Positive TN is called as True Negative FP is called as False Positive FN is called as False Negative Table 2. Performance metrics of HOG and LBP feature extraction techniques Feature extraction technique Overall accuracy (%) HOG
97
LBP
96
Table 2 explained that HOG feature extraction got the accuracy of about 97%, whereas LBP got 96% and same 96% for Hybrid Technique. Hence, we will choose HOG feature extraction for next level of work.
Automatic Detection and Classification of Tomato Pests
55
5 Conclusion and Future Work In this paper, two emerging feature extraction methods (HOG and LBP) are discussed for detection of Pests. It is observed from the table that HOG performs better than LBP techniques for tomato pest data. Multiclass SVM has been used for making predictions. Other statistical or gradient feature extraction techniques can be applied on to other classifier for better result.
References 1. Fuentes, A., Yoon, S., Kim, S., Park, D.: A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. J. Sens 17(9), 2022 (2017) 2. Miranda, J.L., Gerardo, B.D., Tanguilig III, B.T.: Pest detection and extraction using image processing techniques. J. Comp. Comm. Eng. 3(3), 189 (2014) 3. Xiao, D., Feng, J., Lin, T., Pang, C., Ye, Y.: Classification and recognition scheme for vegetable pests based on the BOF-SVM model. J. Agricult. Biol. Eng. 11(3), 190–196 (2018) 4. Alfarisy, A.A., Chen, Q., Guo, M.: Deep learning based classification for paddy pests & diseases recognition. International. Conference on Mathematics and Artificial Intelligence, pp. 21–25 ACM (2018) 5. Al-Hiary, H., Bani-Ahmad, S., Reyalat, M., Braik, M., Rahamneh, Z.: Fast and accurate detection and classification of plant diseases. J. Comp. Appl. 17(1), 31–38 (2011) 6. Bhadane, G., Sharma, S., Nerkar, V.B.: Early pest identification in agricultural crops using image processing techniques. J. Elect. Elect. Comput. Eng. 2(2), 77–82 (2013) 7. Barbedo, J.G.A.: Digital image processing techniques for detecting, quantifying and classifying plant diseases. Springer Plus 2(1), 660 (2013) 8. Krishnan, M., Jabert, G.: Pest control in agricultural plantations using image processing. IOSR J. Elect. Comm. Eng. (IOSR-JECE) 6(4), 68–74 (2013) 9. Mainkar, P.M., Ghorpade, S., Adawadkar, M.: Plant leaf disease detection and classification using image processing techniques. J. Inn. Emer. Res. Eng. 2(4), 139–144 (2015) 10. Rajan, P., Radhakrishnan, B.: A survey on different image processing techniques for pest identification and plant disease detection. J. Comput. Sci. Net. (IJCSN), 137–141 (2016) 11. Ebrahimi, M.A., Khoshtaghaza, M.H., Minaei, S., Jamshidi, B.: Vision-based pest detection based on SVM classification method. Comput. Electron. Agric. 137, 52–58 (2017) 12. Tripathi, M. K., Maktedar, D. D.: Recent machine learning based approaches for disease detection and classification of agricultural products. In: 2016 International Conference on Computing Communication Control and Automation (ICCUBEA), pp. 1–6. IEEE (2016) 13. Dey, A., Bhoumik, D., Dey, K.N.: Automatic detection of whitefly pest using statistical feature extraction and image classification methods. Int. Res. J. Eng. Technol. 3(9), 950–959 (2016) 14. Venugoban, K., Ramanan, A.: Image classification of paddy field insect pests using gradientbased features. Int. J. Mach. Learn. Comput. 4(1) (2014) 15. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. (7), 971–987 (2002) 16. Labaña, F.M., Ruiz, A., Garcia-Sánchez, F.: PestDetect: pest recognition using convolutional neural network. In: 2nd International Conference on ICTs in Agronomy and Environment, pp. 99–108. Springer, Cham (2019) 17. Boulent, J., Foucher, S., Théau, J., St-Charles, P. L: Convolutional Neural Networks for the Automatic Identification of Plant Diseases. Front. Plant Sci. 10 (2019)
Poly Scale Space Technique for Feature Extraction in Lip Reading: A New Strategy M. S. Nandini1(B) , Nagappa U. Bhajantri2 , and Trisiladevi C. Nagavi3 1 Department of Information Science & Engineering, NIE Institute of Technology, Mysuru,
Karnataka, India [email protected] 2 Department of Computer Science & Engineering, Government Engineering College, Chamarajanagara, Karnataka, India [email protected] 3 Department of Computer Science & Engineering, Jayachamaraja College of Engineering, JSS S&T U, Mysuru, Karnataka, India [email protected]
Abstract. Lip reading involves the extraction of visual speech information contained in the inner and outer lip contour. Visibility of teeth and tongue during speech provides important speech cues. Particularly for fricatives, the place of articulation can often be determined visually, that is, for labiodentals (upper teeth or lower lip), interdentals (behind tongue or front teeth), and alveolar (tongue touching gun ridge). Other speech information might be contained in the protrusion and wrinkling of lips. Feature extraction is a remarkable process in lip reading as it holds an important role in lip reading classification. In Improved Speeded Up Robust Feature (ISURF), extraction for finding an exact edge is difficult because of more false corner ratio. In case of PSST, exact edges could be obtained with reduced false corner ratio. This paper provides the idea about PSST based on Harris algorithm and gives more precise edge detection in different illumination conditions. Keywords: ISURF · PSST · Feature extraction · Lip reading · Speech
1 Introduction In recent days, computer vision system is a great area of research. It operates camera images and is much closer to human vision system [2, 7, 8] where eye images are processed. Under computer vision system [16, 18], few important research domains are forensic studies [13, 14], speech reading [19, 20], biometric speech, human recognition [15, 16], face recognition [10, 17], and lip reading. Lip reading is a technique of understanding speech by visually interpreting the movements of lips. It is also a method to understand unheard speech by interpreting through the lip and facial movements of speaker. There is no special hardware like head-mounted © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_6
Poly Scale Space Technique for Feature Extraction
57
cameras exists to support robustness against online conditions like different illumination, different size of images, and reasonable freedom of movements for the speaker. Lip reading is most benefitted to deaf and dumb people, and it is crucial for Industry bodies engaged in voice dubbing, recording, aggregating, and disseminating speech. It plays a vital role in noisy environment like machineries of industry or human voices which fail to control the equipments in spite of availability of high word recognition rate system. The combination of poor hearing and mediocre lip reading skills can create a competent auditory-visual speech perceiver. In lip reading process, feature extraction is a significant step. In lip image analysis, the feature extraction comes as middle stage after lip tracking and detection. During classification on stage of lip images, we come across feature extraction in both training phase and testing phase. We can identify different type of attributes of object such as shape, color, texture, etc., through feature extraction [3, 5, 6] method. PSST is refined form of Improved SURF technique which is used in feature detection of different parts of mechanical objects building, etc. PSST is used in speech processing in lip reading in exact edge detection. Section 2 of the paper provides the literature survey. Section 3 describes methodology, and Sect. 4 encompasses details of procedure of PSST. Section 5 furnishes the summary of experiments carried out and output obtained along with discussions. Section 6 encloses details about future works and the conclusion.
2 Literature Survey Since most of the process in object recognition is feature extraction, various feature extraction techniques in different ways are proposed by many authors. D. G. Lowe et al. proposed Scale Invariant Feature Extraction Technique [SIFT] [4]. Here, SIFT builds a pyramid image by filtering each layer with Gaussian’s algorithm by raising sigma value and its difference. But SIFT [4, 9] is not so good at scaling changes and also it is time consuming. SIFT combined with Principle Component Analysis [PCA] is a technique built by Ke and Sukthankar [5]. It is emerged in order to refine the SIFT. Here, by decreasing dimensionality, computational cost is reduced. Hence, PCA is used in the place of smoothened weighted histograms in order to normalize the gradient path. SURF algorithm is developed by Bay and Tuytelaars [8]. In this, authors make sure high speed in feature extraction based on Harris corner detection method [1]. FastSIFT(F-SIFT) is exercised by Chandrasekhar et al. [12]. In this, vector feature is relatively smaller than SIFT vector feature. F-SIFT seems to be good and firm as of SURF because it used K-D tree in order to represent and index the description. Further, it is noticed that F-SIFT achieves increase in speed compared to SIFT. AAnd also it is found to be faster and good in different aspects compared to SIFT but increased the effort to detect lesser number features for matching. But it suffers from detecting very few features. Thus, in order to be improved in the identification of the total number of features, F-SIFT is needed. Harris corner detector [1] detects points at corner. Somehow, it overcomes the drawback of identification of some features as found in F-SIFT. Here, effort is not only
58
M. S. Nandini et al.
capable of finding corner points but also image location that has higher gradient value in all directions. It is sensitive enough to scale invariance; therefore, it is observed from the literature survey that there is an increase in false corner detection ratio in SURF [10, 11]. False corner ratio is reduced by increasing threshold value, and computational time of finding corner is also reduced in Improved SURF. From the literature survey, it has been observed that the technique which detects more precise corner is most needed for lip reading.
3 Overview of Proposed System Lip movements of any person are analyzed by sequence of steps indicated in Fig. 1.
Fig. 1. Methodology of proposed system.
The lip shapes recognition begins from extracting frames of video from the video database; then, preprocessing of every frame of video is carried out. Once the task of preprocessing is completed, the frames are moved for face detection; then, lip detection and lip contour extraction are carried out. After the completion of lip tracking, lip shape features are extracted from the proposed technique; later, these lip shapes will be compared with stored trained pattern of shapes present in knowledge base. Once comparison is done, the processed lip shapes are classified into different groups of words.
4 Proposed Methodology: Poly Scale Space Technique (PSST) We know that real corners are detected by Harris algorithm under large scales, but here the drawback is that corners positioning is not precise. Though, under the small scales, corners positioning is accurate, there will be increase in false corners ratio. Step 1: This method refines the autocorrelation matrix M, and corner will be determined by double discriminants. Scale space theory is built on scale transformation of the given image, and it aims at obtaining scale space sequences of the image under various scale space, then realize the edge and corner detection and feature extraction at different resolutions. Step 2: Gaussian kernel function under scale space is defined as fout = k ⊗ fin
(1)
Poly Scale Space Technique for Feature Extraction
59
fin = Any signal, ⊗ = Convolution transform. If Max fout < Max fin , then k = scale-space kernel. Gaussian kernel has the property of scale invariant, and it can be proved that Gaussian kernel is the transformation to realize scale invariant. Gaussian kernel is defined as x2 1 (2) exp − 2 G(x, σ ) = √ 2σ 2π σ Scale space representation is acquired from Gaussian filter, and it can be expressed as (x, σ) space, where x = location parameter σ = scale parameter This Poly scale space method is built on Gaussian theory. Step 3: Here, Poly scale space method alters the autocorrelation matrix M, and adds fitness scaling to M. 2 I ⊗ wD Ix Iy ⊗ wD M = u(x, y, σ1 , σD ) = σD2 X w1 ⊗ x Ix Iy ⊗ wD I 2 y ⊗ wD
The improved M is defined as ⎡ = σD2 X e
−(u2 +v 2 )/2σ12
⊗⎣
Ix2 ⊗ e−(u
2 +v 2 )/2σ 2 D
Ix Iy ⊗ e−(u
2 +v 2 )/2σ 2 D
Ix Iy ⊗ e−(u
2 +v 2 )/2σ 2 D
Iy2 ⊗ e−(u
2 +v 2 )/2σ 2 D
⎤ ⎦
(3)
Where σ1 and σ D represent the integration and differentiation scales, respectively, and the operator ⊗ represents the convolution. σ 1 = kσD , k represents the linear coefficient. Step 4: We are supposed to calculate the corner response function of each and every point. If R is a local maximum and the response function value R of a point (x, y) is greater than given threshold, then the point (x, y) is considered to be a corner point. Here, the corner response function value of image points inclusive of points under the same scale space and the same points under different scale space is considered as local maximum.
5 Experiments and Results Around forty video samples are collected as ground truth database. The group of words Appa, Amma, Akka, Anna, and Ajja are taken as input to experiment the Improved SURF [7] and PSST strategies. To evaluate the proposed system, following metrics are exercised.
60
M. S. Nandini et al.
(a) Recall (Sensitivity): The term Recall is the ratio of correctly predicted positive observations to all observations in actual class. Recall = TP/(TP + FN)
(4)
(b) Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positive observation. Precision = TP/(TP + FP)
(5)
The TP, TN, FP, and FN are defined as follows: True Positives (TP): These are the correctly predicted positive values, and the value of predicted class is also yes. True Negatives (TN): These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no. False Negatives (FN): When actual class is yes but predicted class in no. False Positives (FP): When actual class is no and predicted class is yes The total number of samples used for training and testing are 30 and 10, respectively. The experimental results of applying Improved SURF and PSST on the group of words Amma, Appa, Ajja, Akka, and Anna are tabulated in Tables 1 and 2, respectively. Table 1. Summary of results of applying Improved SURF on the words such as Amma, Appa, Ajja, Akka, and Anna Amma
Appa
Ajja
Akka
Anna
Precision (%)
Recall (%)
Amma
05
02
01
00
00
62.5
62.5
Appa
02
06
02
00
00
65.66
60.5
Ajja
00
00
05
02
01
50.00
62.5
Akka
00
06
02
05
02
62.5
55.55
Anna
01
01
00
01
05
62.5
62.5
Average Precision 61.83
Average Recall 60.71
In Improved SURF, we have exhibited experimental outcome as precision rate of 61.83% and recall rate of 60.71%. Further, PSST corroborated the supremacy by exhibiting its potentiality such as precision rate of 66.71% and recall rate of 68.16%. Precision and recall rate of Improved SURF[20] and PSST are depicted in the graph shown in Figs. 2 and 3, respectively. Hence, it is observed that there is an empirical increase in precision rate by 5.88% and recall rate by 7.45% in PSST due to decrease in false corner ratio. Improved SURF and PSST techniques are applied on the word Amma portrayed in Figs. 4 and 5, respectively. On the other hand, the experiment attempt to emphasize visual evidences revealed the efficiency of model. In other words, visual appearance
Poly Scale Space Technique for Feature Extraction
61
Table 2. Summary of result of applying PSST on the group of words Amma, Appa, Ajja, Akka, and Anna Amma
Appa
Ajja
Akka
Anna
Precision (%)
Recall (%)
Amma
06
02
00
00
00
66.66
75.00
Appa
01
07
00
01
01
77.77
70.00
Ajja
00
00
06
02
01
60.00
65.66
Akka
01
00
02
05
01
62.5
62.50
Anna
01
00
02
00
06
66.66
66.66
Average Precision 66.71
Average Recall 68.16
Fig. 2. Precision versus Sample videos of Amma, Appa, Ajja, Akka, and Anna on applying Improved SURF and PSST.
comparison of these two figures has been helping to explore avenues. However, the significant enhancement has been noticed via corner detection accuracy as observed in Fig. 5.
6 Conclusions and Future Work Lip reading is inevitable for the hard of hearing or deaf. The dominant feature extraction is most important stage for classification of lip images. The proposed system Poly Scale Space Technique provides response function value R of a point (X, Y) which is greater than threshold value and is considered as corner point. The corner point obtained is more precise compared to corner point which is obtained using Improved SURF [20].
62
M. S. Nandini et al.
Fig. 3. Recall versus Sample videos of Amma, Appa, Ajja, Akka, and Anna on applying Improved SURF and PSST
Fig. 4. Lip feature extraction and lip matching on applying Improved SURF
The Average precision rate and recall rates are noticeably increased in PSST as compared to Improved SURF technique. With this, we are successful in extracting invariant feature for lip reading. Moreover, the effort can be extendable to improve the precision rate and recall rate for the same videos of the words to obtain enhancement of accurate corner detection ratio.
Poly Scale Space Technique for Feature Extraction
63
Fig. 5. Lip feature extraction and lip matching on applying PSST
Acknowledgements. Data set used for the research work was collected from children of the Rotary west and parent association of deaf children trust, Bhogadi, Mysuru.
References 1. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of the 4th Alvey Vision Conference, pp. 147–151 (1988) 2. Lowe, D.G.: Object recognition from local scale-invariant features. In: IEEE International Conference on Computer Vision. IEEE Computer Society, pp. 1150–1157 (1999) 3. Agarwal, S., Roth, D.: Learning a sparse representation for object detection. In: European Conference on Computer Vision, vol. 4, Copenhagen, Denmark, pp. 113–130, May 2002 4. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. IJCV 60(2), 91–110 (2004) 5. Ke, Y., Sukthankar, R.: PCA-SIFT: a more distinctive representation for local image descriptors. In: Proceedings of Conference on Computer Vision and Pattern Recognition, pp. 511–517 (2004) 6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 91–110. Springer, Netherlands (2004) 7. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005) 8. Bay, H., Tuytelaars, T., Van Gool, L.: SURF: speeded up robust features. In: European Conference in Computer Vision, pp. 404–417 (2006) 9. Abdel-Hakim, A.E., Farag, A.A.: CSIFT: a sift descriptor with color invariant characteristics. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1978–1983 (2006) 10. Tuytelaars, T., Mikolajczyk, K.: Local invariant feature detectors: a survey. In: Foundation and Trends in Computer Graphics and Vision, pp. 177–280 (2008) 11. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)
64
M. S. Nandini et al.
12. Chandrasekhar, V., Makar, M., Takacs, G., Chen, D., Tsai, S.S., Cheung, N.M., Grzeszczuk, R., Reznik, Y. and Girod, B.: Survey of SIFT Compression Schemes. In: International Workshop on Mobile Multimedia Processing, August 2010 13. Wang, G., Tao, L., Di, H., Ye, X., Shi, Y.: A scalable distributed architecture for intelligent vision system. IEEE Trans. Ind. Inform. 8, 91–99 (2012) 14. Liu, H., Chen, S., Kubota, N.: Intelligent video systems and analytics: a survey. IEEE Trans. Ind. Inf. 9, 1222–1233 (2013) 15. Lee, M.H., Park, I.K., Performance evaluation of local descriptors for affine invariant region detector. In: Asian Conference on Computer Vision, pp. 630–643 (2014) 16. Salahat, E.N., Saleh, H.H.M., Salahat, S.N., Sluzek, A.S., AlQutayri, M., Mohammad, B., Elnaggar, M.I.: Object detection and tracking using depth data, October 23, 2014. US Patent App. 14/522,524 17. Tamura, S., Ninomiya, H., Kitaoka, N., Osuga, S., Iribe, Y., Takeda, K., Hayamizu, S.: Audiovisual speech recognition using deep bottleneck features and high-performance lipreading. In: 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 575–582. IEEE (2015) 18. Wand, M., Koutn, J. et al.: Lipreading with long short-term memory. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6115– 6119. IEEE (2016) 19. Zimmermann, M., Ghazi, M.M., Ekenel, H.K., Thiran, J.P.: Visual speech recognition using PCA networks and LSTMs in a tandem GMM-HMM system. In: Asian Conference on Computer Vision, pp. 264–276. Springer (2016)
Machine Learning Methods for Vehicle Positioning in Vehicular Ad-Hoc Networks Suryakanta Nayak1 , Partha Sarathi Das2 , and Satyasen Panda2(B) 1 Department of Computer Science and Application, Utkal University, Bhubaneswar, Odisha,
India [email protected] 2 Department of ECE, GITA, Bhubaneswar, Odisha, India [email protected], [email protected]
Abstract. Unambiguous vehicular sensing is one of the most important aspects in autonomous driving in vehicular ad-hoc networks. The conventional techniques such as communication-based technologies (e.g., GPS) or the reflection-based technologies (e.g., RADAR, LIDAR) have various limitations in detecting concealed vehicles in dense urban areas without line of sight which may trigger serious accidents for autonomous vehicles. To address this issue, this paper proposed a machine learning method based on stochastic Gaussian process regression (SGP) to position vehicles in a distributed vehicular system with received signal vector (RSV) information. To estimate the test vehicle position and respective position errors, the proposed SGP method records the RSV readings at neighboring locations with continuous approximation of the vehicle-to-vehicle (V2V) distance, angle of arrival (AoA), and path delay. Then, the subsequent averaging of the training RSVs minimizes the effects of the shadowing noise and multipath fading. The prediction performance of the proposed learning approach is measured in terms of the root mean square prediction error (RMSE) in a realistic environment. Finally, the prediction performance of the proposed learning method is compared with other existing fingerprinting methods for error-free location estimation of the vehicular network. Keywords: Machine learning · Stochastic Gaussian process regression · Root mean square error estimate · Vehicle positioning
1 Introduction Automated driving is an advanced research direction to reduce traffic congestion, car accidents, manual errors, and greenhouse emissions. One of the primary aspects of automated driving is vehicle positioning or vehicle-to-vehicle (V2V) relative positioning [1] which involves tracking the nearby vehicles of different sizes and trajectories, and then using that trajectory information for navigation and accident avoidance. There are various conventional approaches like global positioning system (GPS), light detection and ranging (LIDAR), and radio detection and ranging (RADAR), etc., [2] which are © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_7
66
S. Nayak et al.
used to collect the location information of nearby vehicles through V2V transmission. But the above techniques are not able to guarantee the specific positioning accuracy in different environmental conditions due to multiple reasons. In urban areas and tunnels, the performance of GPS systems is reduced due to slow data exchange process and reliability issues. The performance of LIDAR and RADAR technologies is diminished in non-line-of-sight (NLOS) occasions because they cannot see through large objects like buildings, trees, etc., to detect the concealed vehicles (CVs) in vehicular ad-hoc networks. There are various approaches of V2V positioning based on exchange of position information between nearby vehicles for smooth driving environment. For example, LIDAR beacons sharp laser beams for scanning the neighboring environment and produces a dynamic three-dimensional (3D) map for traveling. But the LIDAR and other technologies are ineffective under adverse weather conditions because of the limitations of light in penetrating conditions like rain, fog, and snow [3–5], etc. Also, the higher cost and steep data processing complexity of LIDAR makes it unprofitable for V2V and vehicle-to-everything (V2X) positioning. In synchronous transmission oriented vehicle positioning approaches, the propagation delays (PDs) and spatial parameters of multiple paths are taken into account. Here, the observing vehicle (OV) estimates the position of the nearby objects by measuring the PDs and frequency variations between transmitted signal and received signal. The antenna arrays in the trans-receiver of OV perform spatial filtering to measure the spatial parameters like angle of arrival (AoA) and time delay (TD). So, the synchronization between the transmitter and receiver should be guaranteed to achieve near optimal V2V and V2X positioning. To overcome the above challenges, we propose a novel machine learning approach based on stochastic Gaussian process regression (SGP) to predict vehicle locations based on the received signal vector (RSV) data. The rest of the paper is organized as follows. Section 2 describes the model of vehicular positioning system. In Sect. 3, the detailed discussion on the proposed system model with machine learning method is provided. The simulation results and discussions are provided in Sect. 4. Finally, conclusions are drawn in Sect. 5.
2 Vehicular System Modeling In this work, we consider that both the CV and OV are equipped with multiple antenna clusters which are scattered around the vehicles to support V2V communication. The OV tries to detect the position and orientation of the CV hidden by various obstacles like trees, buildings, and other large objects. The multi-cluster CV array enables the OV to estimate the AoA and PD from multiple reflections from nearby objects. The vehicles are equipped with Q antenna element uniform linear array (ULA) receiver which tracks the reflected signals using vehicle-to-infrastructure (V2I) communications. The transmitted signals from CV are modulated by P + 1 subcarriers. Through multiple reflections, these transmitted signals touch upon the receiving antenna array in OV from different AoAs αN and transmission delays (TDs)τN . αN = [α1 , α2 , α3 , . . . , αN ],
(1)
Machine Learning Methods for Vehicle Positioning in Vehicular
τN = [τ1 , τ2 , τ3 , . . . , τN ].
67
(2)
Hence, the channel frequency response (CFR) for the pth subcarrier and qth antenna element can be expressed as [6] hp,q =
N
βn e−j2π fc τn e−j2π f τn e−j2π ψq,p (αn ) .
(3)
n=1
Here, fc is the carrier frequency, f is the path gain parameter, and ψq,p (αn ) is the angular transformation based on array geometry. ψq,p (αn ) =
d (fc + pf )(q − 1)Sin(αn ) . C
(4)
Here, d is the half-wavelength separation between neighboring antenna elements and C is the speed of light. Since the TD between antenna elements is far smaller than PD, the estimated CFR can be expressed as h˜ p,q (m) =
N
βn,p e−j2π f τn e−j2π ψq,p (αn ) + Wp,q .
(5)
n=1
Here, βn,p = βn e−j2π fc τn is the path gain parameter for all subcarriers and Wp,q is the residual noise component. For different AoAs, βn,p and Wp,q can be expressed as βn,p = β1,p , β2,p , . . . , βN ,p , (6) Wp,q = Wp,1 , Wp,2 , . . . , Wp,Q .
(7)
The channel estimate vector for all antenna elements at the pth subcarrier can be expressed as h(p) =
N
Sp (αn )βn,p e−j2π f τn + N (p).
(8)
n=1
Here, Sp (αn ) is the array steering vector for any angle αn . Sp (αn ) = e−j2π ψ1,p (αn ) , e−j2π ψ2,p (αn ) , . . . , e−j2π ψQ,p (αn ) . The Gaussian noise vector N (p) is represented as N (p) = N1 (p), N2 (p), . . . , NQ (p) .
(9)
(10)
We assume that a recognized modulated signal u(t) is transmitted by the CV. The P−1 analog signal received at the OV is sampled at the time instants tp = nTs p=0 . Here, Ts is the sampling interval.
68
S. Nayak et al.
The qth antenna acquires the following p samples: N Rp tp = βn,p Sp (αn )u(tp − τn )e−j2π pf τn + Nq (tp ),
(11)
n=1
for p = 0, 1, 2, 3, . . . , M − 1. At the pth time index, the samples collected across all antenna elements can be consolidated as the digital received signal vector (RSV) Rc (tp ) for any CV c. Rc (tp ) = R1 (tp ), R2 (tp ), R3 (tp ), . . . , RQ (tp ) , (12) Rc (tp ) =
N
h(p)u(tp − τn ) + Nq (tp ).
(13)
n=1
For every CV c, the OV can form N × 1 RSV Rc such that Rc = [R1c , R2c , . . . , RNc ]T .
(14)
3 System Model Based on Machine Learning 3.1 Training Stage Consider ox (.) and oy (.) as the objective function for mapping the RSV Rc of any CV c in the V2V system to its two-dimensional (2D) location estimate coordinates (xc , yc ), so that xc = ox (Rc ) (15) , ∀xc , yc . yc = oy (Rc ) We propose a supervised machine learning model to determine the objective functions ox (.) and oy (.). At first, we train the learning model with RSVs of various acknowledged CV locations. Then the trained model is fed with RSVs of test CVs as input vectors to acquire their positions. To forecast the test CV locations from their RSVs, we apply stochastic Gaussian process regression (SGP) method with time analysis [7, 8]. It is assumed that the objective function is to be studied from zero mean Gaussian process with predefined covariance function δ(., .). So any finite number of ox (.) and oy (.) realizations are considered to follow the Gaussian distribution with zero mean and covariance δ(., .) such that
ox (.) = SGP(0, δ(., .)) . (16) oy (.) = SGP(0, δ(., .)) The covariance factor δ(., .) models covariance of x and y coordinates of CV c and OV o in the V2V system as functions of their RSVs. We consider δ(., .) as the weighted
Machine Learning Methods for Vehicle Positioning in Vehicular
69
aggregate of exponential functions, product functions, and del functions of any two RSVs Rc and Ro . δ(Rc , Ro ) = γ e−0.5(Rc −Ro )
T D−1 (R
c −Ro )
2 + μRTc Ro + σerror ∂c,o .
(17)
−1
Here, γ e−0.5(Rc −Ro ) D (Rc −Ro ) represents the dependencies of δ(Rc , Ro ) on the separation between RSVs Rc and Ro , μRTc Ro represents the dependencies of δ(Rc , Ro ) on 2 ∂ the definite values Rc and Ro , and σerror c,o represents the variance due to assessment errors in both coordinates in the 2D plane. The diagonal matrix D contains elements representing the distances to be traveled along different dimensions in RSVs until oxy (Rc ) and oxy (Ro ) become uncorrelated. T
⎞ λ1 0 0 ⎟ ⎜ D = ⎝ 0 . . . 0 ⎠, 0 0 λN ⎛
(18)
1 for c = o for ∂c,o = . 0 otherwise The parameters in (17) can be assembled into [(N + 2) × 1] vector φ as φ=
γ λ1 . . . λN μ
T
.
(19)
We consider having I training positions. By introducing I × 1 vectors for training x and y coordinates, we obtain I × N matrix R of training RSVs. T X = x1 x2 . . . xN ,
(20)
T Y = y1 y2 . . . yN ,
(21)
T R = r 1 r2 . . . rI .
(22)
Since the training coordinates form a finite set of ox (.) and oy (.) realizations for training RSVs in R, we find that the training coordinates are joint Gaussian distribution with zero mean and covariance δ. X R, Y R, φ = N (0, δ), (23)
where [δ]ii = δ(ri , ri ), ∀i, i = 1, 2, . . . , I . Considering a random vector v that is stochastic Gaussian distributed with mean ζ and respective covariance C, is expressed as v = N (ζ, C) and its probability density function (PDF) is represented as N (v; ζ, C). When both ζ and v are deterministic k-dimensional vectors and C is a deterministic k × k matrix, then k 1 1 T −1 (24) N (v; ζ, C) = (2π )− 2 |C|− 2 e− 2 (v−ζ ) C (v−ζ ) .
70
S. Nayak et al.
The schooled parameter vector function φ can be derived through maximum likelihood as φ = Arg Max Log(p(X / R, Y / R, φ)) φ
= Arg Max Log(p(X / R, Y / R, φ)) φ
(25)
The objective problem (25) is non-convex in nature. To obtain closed form local optimal vector φ , we use the stochastic conjugate gradient ascent approach [9]. 3.2 Prediction Stage We consider I × 1 vectors X and Y of the test CV’s x and y coordinates, respectively, which is to be predicted from I × N matrix R of test RSVs. R = r1 r2 . . . rI , (26) X = x2 x2 . . . xN ,
(27)
Y = y1 y2 . . . yN .
(28)
We now apply the SGP method to predict the vehicle locations by acquiring the combined distribution of the training (X , Y ) and test coordinate X , Y vectors as T 0 X δ δ R, R = N , , (29) 0 X δ δ T 0 Y δ δ R, R = N , (30) 0 Y δ δ where δ, δ , and δ are the covariance matrices between training and test RSVs.
[δ]ii = δ(ri , ri ), ∀i, i = 1, 2, . . . , I . δ ii = δ(ri , ri ), ∀i = 1, 2, . . . , I
for i = 1, 2, . . . , I . δ ii = δ(ri , ri ), ∀i, i = 1, 2, . . . , I .
(31)
(32) (33)
By conditioning the combined Gaussian distribution (23), we find the predictive distribution of X and Y as ! " (34) X X , Y Y , R, R = N ζx , ζy , Cx , Cy .
Machine Learning Methods for Vehicle Positioning in Vehicular
71
The predictive mean ζx , ζy and the corresponding covariance Cx , Cy of test coordinates of vector X and Y for SGP approach are given as ζx δ δ −1 X , (35) = x x−1 δ y δy Y ζy δ − δx δx−1 (δx )T Cx = x . (36) −1 T Cy δy − δy δy (δy ) The predicted distribution of both coordinates for a certain test user i can be achieved by marginalization of the combined predictive distribution in (34) as ! " , (37) [X ]i X , [Y ]i Y , R, ri = N ζx , ζy , Cx , Cy i
ii
where I ! " ζx = δx δx−1 X = δx ri , ri δx−1 X , i
i
i
i=1
I ! " −1 ζy = δy δy Y = δy ri , ri δy−1 Y , i
i
i
i=1
(38)
(39)
! "T Cx = δx − δx δx−1 δx ii
ii
I I ! " ! " ! " = δx ri , ri − δ ri , rj (δx )−1 δ rj , ri ,
(40)
ij
i=1 j=1
! "T −1 Cy = δy − δy δy δy ii
ii
I I ! " " ! " ! −1 δy = δy ri , ri − δ ri , rj δ rj , ri . i=1 j=1
(41)
ij
Since the mean to its mode, the of the combined Gaussian distribution is equal predictive mean ζx , ζy and corresponding covariance Cx , Cy provide the optimal i i estimate of X , Y for any test user i. i
4 Results and Discussions The performance of the SGP method is evaluated by evaluating a theoretical lower bound for the attainable root mean square prediction error (RMSE) performance. So,
72
S. Nayak et al.
we determine the near optimal RMSE performance of any nonpartisan estimator for the test vehicle’s position coordinates. # $ I "2 2 ! $ & $ X − ζ + Y − ζy x i i i % i=1 i RMSE = . (42) I When the RSV data is obtained, the stochastic GP model is trained by resolving the log likelihood optimization problem using conjugate gradient (CG) approach. Numerous trials are run with random values to obtain near optimal prediction performance of the proposed SGP method. The simulations for the proposed work are carried out by Matlab ver. 2015. We consider a 2D coordinate simulation space with 500 m × 20 m sector of a crossed path as presented in Fig. 1. The buildings and trees are located on the sidelines of the path with a gap of approximately 10 m between them on both sides. The lateral and longitudinal location errors should be less than 0.2 m and 0.8 m, respectively. The antenna arrays are designed as perfect quarters of uniform linear array with 2fc spaced elements, where fc is the carrier frequency. The fixed beam forming vectors for every transmitter and receiver are selected in such a way that the signals are transmitted and received in 255°-sector which is not blocked by any vehicle.
Fig. 1. Distributed vehicle scenario for LOS and NLOS channels
The channel gain of every link is calculated by considering free space propagation with the respective wavelength. The typical parameters for the proposed V2V system are provided in Table 1. Figure 2 shows the RMSE performance with respect to signal-to-noise ratio (SNR) for the proposed SGP method and other classical fingerprinting approaches such as SP method and ML method [10] for vehicle positioning. The RMSE of the SGP method is 0.3 m at minimum and 14.5 m at maximum when the SNR changes from 50 to 5 dB, which is much smaller than the compared methods. This is primarily due to the fingerprinting
Machine Learning Methods for Vehicle Positioning in Vehicular
73
Table 1. Parameters for simulation studies System parameters
Value
Carrier frequency
5.9 GHz
Transmission bandwidth
100 MHz
Propagation path number
5
Power spectral density of AWGN
0.1
Size of CV, OV
3 × 5 m2
Maximum distance between CV and OV 30 m Training vehicles locations
10
Noise power
−107.5 dBm
Receiver sensitivity
−105.5 dBm
Transmit power
21 dBm
uncertainty caused by the neighboring dense environment and the matching faults. But the proposed SGP method achieves significant gains in prediction performance due to iterative training of RSV. 25 Proposed SGP method
20
R M S E (m )
SP method ML method
15
10
5
0
5
10
15
20
25
30
35
40
45
50
Signal to noise ratio (dB)
Fig. 2. Variation of RMSE with respect to SNR for different methods
As found in Fig. 3, the RMSEs of all prediction approaches are found to be decreasing when number of antenna elements increase. This is due to the better AoA estimation with larger space dimensionality provisions for the compared methods. The RMSE of the proposed SGP method is 0.2 m at minimum and 1.8 m at maximum when the number of antenna elements are 20 and 2, respectively, at SNR = 40 dB, which is found to be better than other conventional methods.
74
S. Nayak et al. 10 Proposed SGP method SP method
8
R M S E (m )
ML method
6
4
2
0
2
4
6
8
10
12
14
16
18
20
Number of antenna elements
Fig. 3. Variation of RMSE with different number of antenna elements for various methods
Figure 4 provides the RMSE performance with respect to traveling time at SNR = 40 dB. It is found that with increased travel time, the RMSE decreases for all the methods because the prediction performance improves when more number of iterations is run. But with increased travel time, the system complexity increases due to multipath fading and shadowing [11]. The RMSE performance of the SGP method is much better than the other compared methods for increased traveling time. 20 Proposed SGP method SP method
R M S E (m )
15
ML method
10
5
0
10
20
30
40
50
60
70
80
90
100
Travel time (s)
Fig. 4. Variation of RMSE with increasing travel time for different methods
Figure 5 demonstrates the variation of RMSE with respect to noise level for different methods. The RMSE values of all the methods increase with rise in the noise level. The
Machine Learning Methods for Vehicle Positioning in Vehicular
75
proposed SGP method has minimum RMSE in comparison to other methods for different noise levels which indicates better prediction accuracy in vehicle positioning. 30 Proposed SGP method SP method
25
R M S E (m )
ML method
20 15 10 5 0
1
2
3
4
5
6
7
8
9
10
Noise level (dB)
Fig. 5. Plot for RMSE values at multiple noise levels in different methods
5 Conclusion This work considered a stochastic Gaussian process regression (SGP) learning framework to estimate vehicle positions from the RSV information in a distributed vehicular network. The proposed SGP method considered the RSV data for training and prediction purpose in approximating the vehicle locations. We derived the root mean square error (RMSE) performance of the SGP method through simple algebraic operations on the predictive mean and covariance. The superior prediction performance of the proposed learning approach is proved by comparing with other existing fingerprinting techniques in terms of RMSE in various scenarios.
References 1. Cui, X., et al.: Vehicle positioning using 5G Millimeter-wave systems. IEEE Access 4, 6946– 6973 (2016) 2. Choi, J., et al.: Millimeter-wave vehicular communication to support massive automotive sensing. IEEE Comm. Magaz. 54, 160–167 (2016) 3. Seo, H., et al.: LTE evolution for vehicle-to-everything services. IEEE Commun. Mag. 54, 22–28 (2016) 4. Panda, S.: Performance optimization of cell-free massive MIMO system with power control approach. Int. J. Electron. Commun. (AEU) 97, 210–219 (2018a) 5. Panda, S.: Performance improvement of optical CDMA networks with stochastic artificial bee colony optimization technique. Optical Fiber Technol. 42, 140–150 (2018b) 6. Gaber, A., et al.: A study of wireless indoor positioning based on joint TDOA and DOA estimation using 2-D matrix pencil algorithms and IEEE 802.11ac. IEEE Trans. Wirel. Comm. 14(5), 2440–2454 (2015)
76
S. Nayak et al.
7. Rasmussen, C.E., et al.: Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, USA (2006) 8. Nocedal, J., et al.: Numerical Optimization. Springer Science Business Media, New York, NY (2006) 9. Trees, H.L.V., et al.: Detection, Estimation, and Modulation theory, Part I. Wiley, New York 2013 (1968) 10. Kupershtein, E., et al.: Single-site emitter localization via multipath finger printing. IEEE Trans. Signal Process. 61(1), 10–21 (2013) 11. Panda, S.: Performance improvement of clustered wireless sensor networks using swarm based algorithm. Wirel. Pers. Comm. 103(3), 2657–2678 (2018c)
Effectiveness of Swarm-Based Metaheuristic Algorithm in Data Classification Using Pi-Sigma Higher Order Neural Network Nibedan Panda1,2(B) and Santosh Kumar Majhi1 1 Department of Computer Science and Engineering, Veer Surendra Sai University of
Technology, Burla 768018, Odisha, India [email protected], [email protected] 2 Department of Information Technology, Aditya Institute of Technology and Management, Tekkali 532201, Andra Pradesh, India
Abstract. In this paper, Salp Swarm Algorithm (SSA) is employed in training the Higher Order Neural Network (HONN) for data classification task. In machine learning approach, to train artificial neural network is considered a difficult task which gains the attention of researchers recently. The difficulty of Artificial Neural Networks (ANNs) arises due to its nonlinearity nature and unknown set of initial parameters. Traditional training algorithms exhibit poor performance in terms of local optima avoidance and convergence rate, for which metaheuristic based optimization emerges as a suitable alternative. The performance of the proposed SSA-based HONN method has been verified by considering various classification measures over benchmark datasets chosen from UCI repository and the outcome obtained by the said method is compared with the state-of-art evolutionary algorithms. From the outcome reported, the proposed method outperforms over the recent algorithms which confirm its supremacy in terms of better exploration and exploitation capability. Keywords: Classification · Salp swarm algorithm · PSNN · GA · DE · GWO · PSO
1 Introduction In day-to-day life, classification is important as we come across various situations where we have to decide whether we should accept or reject by applying our intelligence of past experience. It makes things simpler to find and recognise. Many attempts have been carried out for better classification outcome, most importantly identifying the appropriate class is crucial to achieve success for classification system. Classification belongs to supervised learning approach which assigns objects into corresponding classes and allocates similar class levels to anonymous patterns. The prime concern of classification is to make a model from the training dataset. It follows a two-step approach: first is the training phase or learning phase and second one is the classifying phase. The goal of © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_8
78
N. Panda and S. K. Majhi
classification is to foretell the target class for each individual case present in the data [1]. Currently, classification task exhibits its significance among all fields of engineering and Sciences such as genomic classification [2], document classification [3], textual classification [4], image classification [5], medical data classification [6], sentimental analysis classification [7], video classification [8] and internet traffic classification [9], etc. Major classification techniques available for classification task are Artificial Neural Network (ANN) [10], Naïve Bayes classifier [11], Decision Tree method [12], K-Nearest Neighbour classifier [13] and Support Vector Machine [14]. Among all classifiers, ANNs optimised by nature-inspired evolutionary algorithms are popular across all fields for resolving classification task. In the year of 1943, McCulloch-Pitts first reveals the notion of ANN as computing systems [15]. ANNs got its inspiration from biological nervous systems and considered one of the greatest accomplishment in artificial intelligence field. The original aim of ANN is to imitate human capability for solving various multifaceted real-life difficulties along with adapting dynamically to changing circumstances, of the present environment. The main building blocks of ANN are a neuron or an artificial neuron also termed as processing elements, which combined together and forms a highly interconnected interactive network for processing of information equivalent to neurons in human brain. So ANN can be thought of as an information processing system. The neurons are connected by means of connection lines. Each link carries an assigned weights, which is multiplied with the supplied input. The output signal from the network will be determined by applying different activations over the entire input. Generally, ANN contains three layers as input, hidden and output layer. Basically, ANN follows many learning approaches among which supervised and unsupervised learning approach is most popular. ANNs can be applied to different problems in various diversities of the areas such as data classification [16], image processing [17], pattern recognition [18], forecasting [18], data compression [19] and optimisation engineering [20]. The ANNs popular among researchers available in the literature are Feedforward Neural Network (FFNN) [21], Convolutional neural network [22], Recurrent neural network [23], Radial basis function neural network [24] and Modular neural network [25]. Apart from this some Higher Order Neural Networks (HONN) are also available such as Functional Link Artificial Neural Network (FLANN) [26], Pi-Sigma Neural Network (PSNN) [27], Jordan Pi-Sigma Neural Network (JPSNN) [28], etc. which are most popular among research community due to its diversity of handling difficult and day-to-day life problems. The higher attraction of researchers towards higher order neural network rather than traditional neural network is due to the underlying advantages such as (i) better computational and learning capability (ii) less training time and less implementation cost. The training phase of ANNs is categorised into two types supervised and unsupervised training approaches. In supervised learning approach, the feedback is supplied to the neural network from an external source. But in case of unsupervised approach, the neural network is accustomed to learn without taking any outside feedback. The efficacy of ANN is really reliant on the learning pattern. Supervised learning approach continues with two approaches, such as gradient-based or deterministic learning approach and evolutionary approach. Backpropagation method is the most used gradient-based approach. The advantages of gradient-based approach
Effectiveness of Swarm-Based Metaheuristic Algorithm in Data Classification
79
lie in its uncomplicatedness, adaptability and accurateness. But the major demerit associated with this type of method is stuck in local optima and no randomness involved in choosing parameters. Also the initial set of parameters fully determines the future outcome. Whereas evolutionary-based approach wholly functions on randomness. The early setup factors found out randomly and iterated multiple times to determine improved outcome. The higher acceptance of evolutionary-based metaheuristic approach is due to its capability to avoid local optima stagnation by compromising the training time. However, allocation of perfect evolutionary metaheuristic algorithm leads to you better optimal outcome. A large number of nature inspired evolutionary metaheuristic based algorithms evolved day-by-day, due to the increasing complexities in real-life optimization problems. Some of the popular metaheuristic algorithms are Genetic Algorithm (GA) [29], Differential Evolution (DE) [30], Particle Swarm Optimisation (PSO) [31], Ant Colony Optimisation (ACO) [32], Artificial Bee Colony Optimisation (ABC) [33], Grey Wolf Optimisation (GWO) [34] and Bat Algorithm (BA) [35]. In the present work, we have cast of newly developed swarm-based optimisation technique termed as Salp Swarm Algorithm (SSA) for training HONN such as PSNN to resolve the data classification complexities. The motivation for choosing SSA as the trainer is due to its higher exploration and exploitation capability for avoiding local optima stuck. The inspiration for the current work to consider SSA as a trainer for higher order neural network training is due to • No specific metaheuristic-based algorithm assures for acquiring global optimum. • Better explorative and exploitative strength of SSA in comparison to other current trainers. The organisation of the paper is represented as follows: Subdivision 2 presents an overview of Pi-Sigma neural network, outline of SSA algorithm and proposed SSA optimised Pi-Sigma neural network. Subdivision 3 represents experimental setup, result analysis, specifics of datasets used and PSNN arrangement. The concluding part and future direction are presented in subdivision 4.
2 Material and Methods This current segment depicts the details about Pi-Sigma higher order neural network, Salp swarm algorithm and the proposed SSA optimised Pi-Sigma HONN. 2.1 Overview of Pi-Sigma Neural Network Pi-Sigma Neural Network (PSNN) was evolved in the year of 1991 by Shin et al. as a feedforward neural network which belongs to the category of Higher Order Neural Network (HONN) [36]. This HONN avoids the rapid growth in the number of processing elements and its associated weights [37]. The PSNN consists of three layers, the first layer represents input layer and second layer represents the hidden layer which contains linear summation units. The final layer is the output layer which contains the product unit. The summing units present in the hidden layer uses linear activation function, and
80
N. Panda and S. K. Majhi
the product unit uses nonlinear activation function as sigmoid. The network is named PSNN as it uses product of sums of initial input parameters. The training time of PSNN is less because we have to adjust only one layer of weights, i.e. the weight between input layer and hidden layer. The weights that are associated with the output layer is normally set to unity. The popularity of HONN is due to its ability of nonlinear mapping of complex real-life problems. It illustrates more precise outcome as compared to other traditional ANNs due to the presence of combination of multiplication and summation units.
Fig. 1. Basic Architecture of Pi-Sigma Neural Network
The outcome from the mth hidden layer is computed by Eq. 1 and outcome of network can be calculated by Eq. 2. Tm = bm + ⎛ Y = σ⎝
j i=1
j
Wim Xi
(1)
⎞
Tm ⎠
(2)
m=1
where σ characterises the nonlinear transfer function, Wim characterises weights related to connections in between input layer and hidden layer, bm characterises the biases. The classic design of a PSNN is presented in Fig. 1. 2.2 Salp Swarm Algorithm In 2017, Mirjalili et al. established a new metaheuristic algorithm which relies on the swarming behaviour of salps [38]. Salps are marine animals having transparent barrelshaped body and belongs to the family of salpidae. The salps move forward by contracting and pumping water through its body. During search of food the adult salps make a chain-like structure by joining together which is called as salp chain or salp swarm. The swarming activities of salps mathematically modelled to unravel optimization complications. In the chain of salps the total population is divided into two classes, the first class represents the leader and rest represents the followers. The salp chain is guided by
Effectiveness of Swarm-Based Metaheuristic Algorithm in Data Classification
81
the leader salp from the front of the chain and others will follow each other in the chain. Inside deep ocean during food search the leader salp update its own position according to the food source and followers update their positions sequentially with one another. For mathematical modelling leader update its own position by Eq. (3). Fl + C1 ((UBl − LBl )C2 + LBL )C3 ≥ 0 (3) Sl1 = Fl − C1 ((UBl − LBl )C2 + LBL )C3 ≥ 0 where Sl1 = Position of leader salp in lth dimension. Fl = Position of food source in lth dimension. UBl = Upper bound value in lth dimension. LBl = Lower bound value in lth dimension. C1 , is the exploration and exploitation factor, which can be computed as in Eq. (4). C1 = 2e
2 − 4iI
(4)
where i denotes the current iteration and I denotes the extreme number of iteration. C2 andC3 represents two random numbers which lies in the interval [0, 1]. The position of follower salps is updated by Eq. (5).
1 k sl + slk−1 slk = 2
(5)
where k ≥ 2 slk = Position of kth follower salp belongs to chain in kth dimension. slk−1 = Position of (k−1)th follower salp belongs to chain in kth dimension. 2.3 Proposed SSA Optimised Pi-Sigma Neural Network The crucial step in training HONN using swarm-based metaheuristic algorithm is the formulation of problem, i.e. the metaheuristic approach is suitable to solve. HONN got its wide acceptance among researchers due to its capability to resolve complicated and nonlinear problems. For obtaining optimum accuracy using any HONN, the most vital parameters are weights and biases. The role of the SSA trainer is to generate an optimal set of random values for the assigned adjustable weights which will provide optimum classification accuracy. The set of weights denotes for SSA are in Eq. (6). W = w1,1 , w1,2 , . . . , wn,n
(6)
where n signifies the number of input nodes, wij signifies the weights associated from ith node to jth node. After successfully allocating initial variables, next important step is
82
N. Panda and S. K. Majhi
to set objective functions for SSA algorithms. Here in training HONN (PSNN), primary objective is to attain maximum classification accuracy in terms of training and testing samples. To evaluate the ANN (PSNN) one common performance evaluation metrics will be used, i.e. Root Mean Square Error (RMSE). It can be computed by finding the difference between the desired outcome and actual result observed from the HONN. The RMSE can be computed by Eq. 7. m k k 2 i=1 Ri − Ti (7) RMSE = m where m indicates the number of outputs, Ri indicates the reported output and Ti indicate the desired output. Figure 1 indicates the whole procedure of training PSNN by using SSA. The entire process may be thought of as, the SSA algorithm supplies random weights to the PSNN and in the return accepts reduced RMSE value. By using SSA, we should not expect absolute guarantee for attaining optimum result from the PSNN for a given dataset due to the heuristic nature of algorithm. But we iterate the best set of parameters through the PSNN, attended after each iteration. In the consequences, we expect reduced RMSE value over entire iterations. Finally, SSA will attain an outcome which is much better than as compared to the initial random solutions by iterating the algorithms for sufficient number of iterations. Figure 2 represents the supply of weights and biases by SSA to PSNN and accepts average RMSE for entire samples.
Training Samples
Weights & Biases SSA
Ʃ
Ʃ
π
Average RMSE Ʃ
Fig. 2. SSA-based Pi-Sigma Neural Network Trainer
3 Results and Discussions In this segment, the proposed SSA-based PSNN trainer has been benchmarked by using 10 standard classification datasets considered from UCI repository [39]. The chosen datasets are balloon, cancer, diabetes, glass, heart, iris, liver, seed, XOR and yeast. The
Effectiveness of Swarm-Based Metaheuristic Algorithm in Data Classification
83
comprehensive depiction of used datasets is reported in Table 1. Some standard performance evaluator has been used to scrutinise the effectiveness of the proposed model. The outcome from SSA-based PSNN has been compared with recent metaheuristic based trainers such as DE, GA, PSO and GWO. 3.1 Experimental Setup MATLAB 2016a has been used for the implementation of the proposed training algorithm along with other trainers. The training and testing samples considered from all datasets have been chosen in the ratio as 70:30. The detailed description of training and testing samples for all datasets is reported in Table 2. For a fair comparison among all considered trainers, the maximum number of iterations is fixed to 100 and for entire set of trainers outcome is evaluated for 20 numbers of iterations (Table 3). 3.2 Result Analysis All the outcome obtained has been reported in the form of minimum RMSE, average, standard deviation, error rate and accuracy along other three performance evaluators such as specificity, sensitivity and prevalence. Comparing the error rate and accuracy presented in Table 4, SSA-based PSNN trainer will exhibit superior performance over other metaheuristic-based PSNN trainers. As out of 10 datasets used SSA-PSNN shows supremacy in terms of classification accuracy over 8 datasets and also gives a tight challenge in two of the datasets to other trainers. The higher average outcome and lower standard deviation value obtained from SSA-PSNN trainer signifies its solid evidence regarding avoidance of untimely convergence towards local minima and attain an optimal set of parameters as weights. The better local minima avoidance proves the superior explorative strength of the SSA algorithm as one higher order trainer. Table 1. Structure of PSNN for different datasets Classification of datasets
Number of Attributes
PSNN structure
Balloon
4
4-4-1
Cancer
9
9-9-1
Diabetes
8
8-8-1
Heart
13
13-13-1
Iris
4
4-4-1
Liver
6
6-6-1
Seed
7
7-7-1
Glass
9
9-9-1
XOR
3
3–3-1
Yeast
8
8-8-1
84
N. Panda and S. K. Majhi Table 2. Description of used standard datasets
Classification datasets
Number of attributes
Number of training Number of test samples samples
Number of class
Balloon
4
20
16
2
Cancer
9
683
120
2
Diabetes
8
768
150
2
Heart
13
270
60
2
Iris
4
150
150
3
Liver
6
345
70
2
Seed
7
210
210
3
Glass
9
214
42
6
XOR
3
8
8
2
Yeast
8
1484
185
10
Table 3. Performance comparison of SSA w.r.t DE, GA, GWO, PSO Dataset\Algorithm Balloon
Cancer
Diabetes
DE
GA
GWO
PSO
SSA
MIN RMSE
0.5001
0.3266
0.1936
0.1934
0.3302
AVG
0.5108
0.3448
0.3158
0.3099
0.3487
STD
0.0082
0.0147
0.0828
0.1619
0.0110
SPECIFICITY
0.3
0.6
0.6
0.7
0.5
SENSITIVITY
0.8
0.6
0.8
0.8
0.8
PREVALENCE
40
30
37.5
40
40
MIN RMSE
0.5017
0.3540
0.3084
0.0396
0.3554
AVG
0.5234
0.3569
0.3629
0.1825
0.3601
STD
0.0093
0.0036
0.0408
0.0997
0.0027
SPECIFICITY
0.96
0.57
0.96
0.01
0.96
SENSITIVITY
0.95
1
0.98
1
1
PREVALENCE
47.5
50
49.16
50
50
MIN RMSE
0.5000
0.3962
0.1464
0.0404
0.1786
AVG
0.5072
0.4136
0.3234
0.1031
0.1880
STD
0.0093
0.0148
0.1828
0.0293
0.0101
SPECIFICITY
0.68
0.11
0.35
0.08
0.48
SENSITIVITY
0.68
0.97
0.95
0.95
0.92 (continued)
Effectiveness of Swarm-Based Metaheuristic Algorithm in Data Classification Table 3. (continued) Dataset\Algorithm Heart
Iris
Liver
Seed
Glass
XOR
Yeast
DE
GA
GWO
PSO
SSA
PREVALENCE
34.28
48.57
47.85
47.85
46.42
MIN RMSE
0.5027
0.1476
0.0818
0.1920
0.0607
AVG
0.5269
0.1614
0.5705
0.3093
0.0844
STD
0.0188
0.0109
0.2764
0.0787
0.0145
SPECIFICITY
0.29
0.07
0.14
0.03
0.14
SENSITIVITY
0.96
1
1
1
1
PREVALENCE
48.14
50
50
50
50
MIN RMSE
0.5001
0.4139
0.1735
0.2575
0.2830
AVG
0.5117
0.4272
0.4605
0.4058
0.2957
STD
0.0132
0.0096
0.3175
0.1003
0.0090
MIN RMSE
0.5019
0.4146
0.3430
0.1243
0.3429
AVG
0.5079
0.4320
0.4978
0.2720
0.3469
STD
0.0087
0.0116
0.1511
0.1555
0.0035
SPECIFICITY
0.41
0.44
0.41
0.02
0.55
SENSITIVITY
0.73
0.64
0.73
1
0.64
PREVALENCE
36.76
32.35
36.76
50
32.35
MIN RMSE
0.5066
0.1331
0.1601
0.2426
0.1247
AVG
0.5869
0.1636
0.3362
0.3867
0.1314
STD
0.1493
0.0248
0.1423
0.1990
0.0054
MIN RMSE
0.3346
0.2929
0.6726
0.1893
0.0679
AVG
0.3417
0.3225
0.7074
0.3672
0.1033
STD
0.0100
0.0226
0.0305
0.1584
0.0389
MIN RMSE
0.5060
0.3579
0.1979
0.1768
0.3519
AVG
0.5095
0.3645
0.3047
0.3637
0.3571
STD
0.0036
0.0245
0.1520
0.2433
0.0036
SPECIFICITY
0.25
0.25
0.75
0.25
0.75
SENSITIVITY
0.75
0.75
0.75
1
0.75
PREVALENCE
37.5
37.5
37.5
50
37.5
MIN RMSE
0.5006
0.4562
0.2303
0.1040
0.1369
AVG
0.5321
0.4607
0.5253
0.2489
0.1550
STD
0.0238
0.0041
0.2641
0.2460
0.0201
85
86
N. Panda and S. K. Majhi Table 4 Accuracy results obtained from standard datasets
Dataset\Algorithm
DE
GA
GWO
PSO
SSA
50
40
25
37.5
25
XOR
Error Rate (%) Accuracy (%)
50
60
75
62.5
75
Balloon
Error Rate (%)
45
40
30
25
35
Accuracy (%)
55
60
70
75
65
Error Rate (%)
4.16
2.5
2.5
49.16
1.66
Accuracy (%)
95.84
97.5
97.5
50.84
98.34
Diabetes
Error Rate (%)
31.42
45.71
34.28
47.85
29.28
Accuracy (%)
68.58
54.29
65.72
52.15
70.72
Heart
Error Rate (%)
37.03
46.29
42.59
48.14
42.59
Accuracy (%)
62.97
53.71
57.41
51.86
57.41
Error Rate (%)
30
33.33
33.33
30
23.33
Accuracy (%)
70
66.67
66.67
70
76.67
Error Rate (%)
42.64
45.58
42.64
48.52
39.70
Accuracy (%)
57.36
54.42
57.36
51.48
60.3
Seed
Error Rate (%)
25.64
33.33
43.58
35.89
23.07
Accuracy (%)
74.36
66.67
56.42
64.11
76.93
Glass
Error Rate (%)
14.28
33.33
21.42
26.19
7.14
Accuracy (%)
85.72
66.67
78.58
73.81
92.86
Error Rate (%)
9.18
10.81
9.72
3.78
8.10
Accuracy (%)
90.82
89.19
90.28
96.22
91.9
Cancer
Iris Liver
Yeast
4 Conclusion In this paper, SSA algorithm has been engaged first time as an HONN trainer. The motivation for choosing SSA as the metaheuristic trainer is due to its underlying capability to escape, the local minima stuck problem and quicker convergence speed towards attaining global optimum. The idea behind choosing PSNN as the HONN is its simplicity, lesser number of adjustable parameters and enhanced capability to deal with nonlinear complicated problems. The proposed SSA-PSNN trainer has been compared with recent metaheuristic-based algorithm such as DE, GA, PSO and GWO. From the reported outcome, the proposed trainer will exhibit supremacy over other training algorithms and also obtained optimum outcome for adjustable weights. From all the above-discussed merits, we may conclude that SSA can be used as a better training algorithm for higher order artificial neural network, and there is a perfect mishmash between exploration and exploitation.
Effectiveness of Swarm-Based Metaheuristic Algorithm in Data Classification
87
For further progress, the SSA algorithm will be applied to generate optimal number of hidden nodes in case of PSNN. Also SSA will be applied as a trainer over other higher order networks available.
References 1. Panda, N., Majhi, S.K.: Improved Salp Swarm algorithm with space transformation search for training neural network. Arab. J. Sci. Eng. 1–19 2. Shames, D.S., Wistuba, I.I.: The evolving genomic classification of lung cancer. J. Pathol. 232(2), 121–133 (2014) 3. Wu, Z., Zhu, H., Li, G., Cui, Z., Huang, H., Li, J., Chen, E., Xu, G.: An efficient wikipedia semantic matching approach to text document classification. Inf. Sci. 393, 15–28 (2017) 4. Deng, X., Li, Y., Weng, J., Zhang, J.: Feature selection for text classification: a review. Multimed. Tools Appl. 78(3), 3797–3816 (2019) 5. Sun, Y., Xue, B., Zhang, M., Yen, G.G.: Evolving deep convolutional neural networks for image classification. IEEE Trans. Evol. Comput. (2019) 6. Mohapatra, P., Chakravarty, S., Dash, P.K.: An improved cuckoo search based extreme learning machine for medical data classification. Swarm Evol. Comput. 24, 25–49 (2015) 7. Tripathy, A., Agrawal, A., Rath, S.K.: Classification of sentiment reviews using n-gram machine learning approach. Expert Syst. Appl. 57, 117–126 (2016) 8. Jiang, Y.G., Wu, Z., Tang, J., Li, Z., Xue, X., Chang, S.F.: Modeling multimodal clues in a hybrid deep learning framework for video classification. IEEE Trans. Multimed. 20(11), 3137–3147 (2018) 9. Zhang, J., Chen, C., Xiang, Y., Zhou, W., Xiang, Y.: Internet traffic classification by aggregating correlated naive bayes predictions. IEEE Trans. Inf. Forens. Secur. 8(1), 5–15 (2012) 10. Panda, N., Majhi, S.K.: How effective is the salp swarm algorithm in data classification. In: Computational Intelligence in Pattern Recognition, pp. 579–588. Springer, Singapore (2020) 11. Tang, B., Kay, S., He, H.: Toward optimal feature selection in naive Bayes for text categorization. IEEE Trans. Knowl. Data Eng. 28(9), 2508–2521 (2016) 12. Rao, H., Shi, X., Rodrigue, A.K., Feng, J., Xia, Y., Elhoseny, M., Yuan, X., Gu, L.: Feature selection based on artificial bee colony and gradient boosting decision tree. Appl. Soft Comput. 74, 634–642 (2019) 13. Tang, Y., Jing, L., Atkinson, P.M., Li, H.: A multiple-point spatially weighted k-NN classifier for remote sensing. Int. J. Remote Sens. 37(18), 4441–4459 (2016) 14. Kumar, S., Singh, S., Kumar, J.: Multiple face detection using hybrid features with SVM classifier. In Data and Communication Networks, pp. 253–265. Springer, Singapore (2019) 15. McCulloch, W.S., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5(4), 115–133 (1943) 16. Panda, N., Majhi, S.K.: How effective is Spotted Hyena optimizer for training multilayer perceptrons. Int. J. Recent Technol. Eng. 4915–4927 (2019) 17. Vardhana, M., Arunkumar, N., Lasrado, S., Abdulhay, E., Ramirez-Gonzalez, G.: Convolutional neural network for bio-medical image segmentation with hardware acceleration. Cogn. Syst. Res. 50, 10–14 (2018) 18. Anthimopoulos, M., Christodoulidis, S., Ebner, L., Christe, A., Mougiakakou, S.: Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Trans. Med. Imaging 35(5), 1207–1216 (2016) 19. Watkins, Y.Z., Sayeh, M.R.: Image data compression and noisy channel error correction using deep neural network. Procedia Comput. Sci. 95, 145–152 (2016)
88
N. Panda and S. K. Majhi
20. Villarrubia, G., De Paz, J.F., Chamoso, P., De la Prieta, F.: Artificial neural networks used in optimization problems. Neurocomputing 272, 10–16 (2018) 21. Ehlers, R.: Formal verification of piece-wise linear feed-forward neural networks. In: International Symposium on Automated Technology for Verification and Analysis, pp. 269–286. Springer, Cham (October 2017) 22. Tajbakhsh, N., Shin, J.Y., Gurudu, S.R., Hurst, R.T., Kendall, C.B., Gotway, M.B., Liang, J.: Convolutional neural networks for medical image analysis: full training or fine tuning? IEEE Trans. Med. Imag. 35(5), 1299–1312 (2016) 23. Roy, K., Mandal, K.K., Mandal, A.C.: Ant-Lion Optimizer algorithm and recurrent neural network for energy management of micro grid connected system. Energy 167, 402–416 (2019) 24. Aljarah, I., Faris, H., Mirjalili, S., Al-Madi, N.: Training radial basis function networks using biogeography-based optimizer. Neural Comput. Appl. 29(7), 529–553 (2018) 25. Uriarte, A., Melin, P., Valdez, F.: Optimization of modular neural network architectures with an improved particle swarm optimization algorithm. In: Recent Developments and the New Direction in Soft-Computing Foundations and Applications, pp. 165–174. Springer, Cham (2018) 26. Misra, B.B., Dehuri, S.: Functional link artificial neural network for classification task in data mining (2007) 27. Nayak, J., Naik, B., Behera, H.S.: A novel nature inspired firefly algorithm with higher order neural network: performance analysis. Eng. Sci. Technol. Int. J. 19(1), 197–211 (2016a) 28. Nayak, J., Naik, B., Behera, H.S.: Solving nonlinear classification problems with black hole optimisation and higher order Jordan Pi-sigma neural network: a novel approach. Int. J. Comput. Syst. Eng. 2(4), 236–251 (2016b) 29. Mirjalili, S.: Genetic algorithm. In: Evolutionary Algorithms and Neural Networks, pp. 43–55. Springer, Cham (2019) 30. Wu, G., Shen, X., Li, H., Chen, H., Lin, A., Suganthan, P.N.: Ensemble of differential evolution variants. Inf. Sci. 423, 172–186 (2018) 31. Du, K.L., Swamy, M.N.S.: Particle swarm optimization. In: Search and Optimization by Metaheuristics, pp. 153–173. Birkhäuser, Cham (2016) 32. Dorigo, M., Stützle, T.: Ant colony optimization: overview and recent advances. In: Handbook of Metaheuristics, pp. 311–351. Springer, Cham (2019) 33. Karaboga, D., Gorkemli, B., Ozturk, C., Karaboga, N.: A comprehensive survey: artificial bee colony (ABC) algorithm and applications. Artif. Intell. Rev. 42(1), 21–57 (2014) 34. Mirjalili, S., Mirjalili, S.M., Lewis, A.: Grey wolf optimizer. Adv. Eng. Softw. 69, 46–61 (2014) 35. Yang, X.S., Hossein Gandomi, A.: Bat algorithm: a novel approach for global engineering optimization. Eng. Comput. 29(5), 464–483 (2012) 36. Shin, Y., Ghosh, J.: The pi-sigma network: An efficient higher-order neural network for pattern classification and function approximation. In: IJCNN-91-Seattle International Joint Conference on Neural Networks, vol. 1, pp. 13–18. IEEE (July 1991) 37. Akram, U., Ghazali, R., Mushtaq, M.F.: A comprehensive survey on Pi-Sigma neural network for time series prediction. J. Telecommun. Electron. Comput. Eng. (JTEC) 9(3–3), 57–62 (2017) 38. Mirjalili, S., Gandomi, A.H., Mirjalili, S.Z., Saremi, S., Faris, H., Mirjalili, S.M.: Salp swarm algorithm: a bio-inspired optimizer for engineering design problems. Adv. Eng. Softw. 114, 163–191 (2017) 39. Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California. School of information and Computer Science, Irvine, CA, 28 (2013). https://archive.ics.uci.edu/ml. Luo, Q., Li, J., Zhou, Y. 2019. Spotted hyena optimizer with lateral inhibition for image matching. Multimedia Tools and Applications, pp. 1–20
Deep Learning for Cover Song Apperception D. Khasim Vali1(B) and Nagappa U. Bhajantri2 1 Department of Computer Science and Engineering, Vidyavardhaka College of Engineering,
Mysuru 570017, India [email protected] 2 Department of Computer Science and Engineering, Government College of Engineering, Chamarajanagara 571313, India [email protected]
Abstract. In this work, we proposed a cover song recognition system using deep learning. From the literature, understand that most of the works extract the discriminate feature that classifies the cover song between a pair of songs and calculates the dissimilarity or similarity between the two songs based on the observation, which is a meaningful pattern between cover songs. Moreover, it inspires reformulating the cover song apperception obstacle in a machine learning framework. In other words, essentially builds the cover song recognition system using Convolution Neural Network (CNN) and Mel Frequency Cepstral Coefficients (MFCCs) features following the construction of the data set composed of cover song pairs. The prepared CNN yields the likelihood of being in the spread tune connection given a cross-closeness grid produced from any two bits of music and recognizes the spread tune by positioning on the likelihood. Test results display the prescribed methodology that has accomplished enhanced execution tantamount to the cutting edge endeavors. Keywords: CNN · MFCC · Deep learning
1 Introduction The cover song is another rendition of the music database already stored or organized by one more musician. A spread uses again the tune plus verses of the first melody; however, it is achieved using novel vocalists plus tools. Further melodic features, for example, key, mood, and classification can be reinterpreted by the new artist. Since the exclusive rights of the arrangement of the spread stagnant have a place with the writer of the first melody, discharging a spread tune without authorization of the first writer may cause a legitimate clash. One more instance is music examining, it may demonstrate the procedure that reclaims a bit of stored complete chronicles. An examining is broadly viewed as a system for recovering music today, yet permitting that the first maker approves its reuse is a lawful necessity. As it were, Cover tune distinguishing proof is an assignment that intends to quantify the closeness between two tunes. It very well may be utilized to forestall the encroachment of copyright, and furthermore to be a target reference if there © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_9
90
D. K. Vali and N. U. Bhajantri
should arise an occurrence of contention. For 10 years some of the methodologies [1–3] for spread melody recognizable proof have been proposed. People, for the most part, perceive the spread through the melodic or verse likeness, however, partition of the transcendent tune from a blended music sign is as yet not at a dependable level, and extraction of the verses can endeavor just on the off chance that it is isolated. 1.1 Deep Learning—Basic Concepts Newly, Deep learning has risen has the famous plus adaptable zone of Machine Learning Area. The benefits of profound neural systems, when contrasted with other Machine Learning and skilled systems frameworks remain because of its capacity to naturally gather progressive organization which is composite plus theoretical in behavior and has been utilized for grouping, regression, and expectation tasks [4]. Deep knowledge representations have been effectively connected to an extensive assortment of issues together with various language handling, image processing, robots plus data recovery, and so on. Deep neural systems are changed adaptation neural systems with numerous concealed layers. These layers assume a noteworthy job in learning the highlights from information informational indexes. In the music area, related sort of various leveled conceptual features is extracted from cover song to recognizing the melody. These composite unknown organizations can be extricated productively through the classical individual if the classical model has been prepared by a monstrous measure of information through multidimensional information highlights. In this architecture, initial information remains preprocessed and later it is served into setup. This system is extensively used to training the songs present in the database. Hidden layers used to study the structures plus suckle the learned features to the output units as illustrated in Fig. 1. The building of Deep Neural Network (DNN) for cover song investigation has remained obtainable in Fig. 1.
Fig. 1. A conceptual architecture of DNN
In this work, we used Convolution neural networks (CNN) statements the problematic of expecting cover song existing in music.
Deep Learning for Cover Song Apperception
91
1.2 Convolution Neural Network An input layer, a pooling layer, a convolution layer, connected nodes, and a prediction layer are containing in the CNN structure. A grouping of numerous complications and assembling layers used for a huge volume of datasets vie, Cover song recognition, which contains a big volume of shapeless data. CNN is a run of the mill kind of system neural classic later is roused by what means living nature’s procedure characteristic picture information in brains. In this prototypical utilized convolution approach, the structures are extracted from given data and features are added every time. The employed system stands exhibited by Hubel and Wiesel (1968) and establishes that cells that go about have nearby channels that are utilized to examine or else quest the regular pictures aimed at examples.
Fig. 2. Architecture of the CNN recycled for cover song identification
Consider that the input data S and N are the number of convolution functions (CF) and also N1 is the maximum number of features. All these features convoluted the given data with the filters. The dimensional vector D will be directed for pooling to afford appropriate structures from sets of features. A graphical demonstration of CNN [5] has been accessible in Fig. 2. Figure 3 provides basic operational steps that are explored to train the CNN model [6]. 1.3 Flow of Work Here, elaborate and motivate to erect the strategy to identify the duplicate song. The system evolved based on the work heaped in the contemporary contributions has addressed in Sect. 2 as related work. Further, Sect. 3 attempts to portray the solution to problem rose for identical tune identification in a real-world database. Here, the effort makes out the outcomes, judicious way to hammer on the obtained results. Further, exercise the metrics over the quantitative output to reveal the performance potentiality through rough yardsticks. The same is highlighted in Sect. 4. Eventually, Sect. 5 shows interest to conclude the attempt.
92
D. K. Vali and N. U. Bhajantri
Audio Signal
MFCC Features
Embedding Layer
Convolution Layer
Pooling Layer
Dropout Layer
Output Layer Fig. 3. Milestones for cover song detection using CNN
2 Contemporary Effort Acknowledging the limits in terms of scalability and tractability of traditional cover detection systems, Bertin Mahieux et al. [7] extended audio fingerprinting-inspired hashing to compute jump-codes from beat-synchronous chroma [8] to detect covers. On the other hand, Audio fingerprinting [9] criteria were successfully adopted in the industry to find an almost identical copy of a given piece of music with high robustness to pitch, noise, distortions, and other transformations. Subsequently, audio fingerprinting strategies were explored in the works of [10,11] for detecting live versions of cover songs. Authors of [1,12] tried to achieve the factor of scalability and invariance by efficiently capturing some of this invariance in low-dimensional feature space using sparse representations where simple distance metrics give a notion of cover song similarity. The work [12] reported the highest MAP score on the large-scale cover identification problem. Naturally, it is an alternative to the non-scalable MIREX-based evaluations. Similarly, another possible way to achieve scalability in cover detection is by exercising efficient database pruning based on various features in a preliminary step, which is potentially enabled to use audio conversant efforts with more specificity to capture understandable cover similarity in the later layers. Hence, the idea of multi-stage architecture was explored in the works of [7,14–16] through audio fluent strategies. Moreover, besides chroma features, melody or pitch salience [17], chord profiles [15], self-similarity MFCC matrix [19], cognition-inspired descriptors [19] were also explored
Deep Learning for Cover Song Apperception
93
in the literature. A cover song similarity distance was computed using various sequence alignment approaches such as dynamic time warping [17], smith–waterman algorithm [18], recurrence quantification analysis [19], cross-correlation plots [18], SiMPle [20], and information-theoretic measures [9]. A preprocessing step is usually done to these tonal features before computing this distance to make them invariant to the key or the tempo of the song. Key invariance can be obtained by the Optimal Transposition Index (OTI) [20] or by computing 2D-Fourier transform magnitude coefficients of tonal feature vectors by disregarding the phase coefficients [1]. Further, tempo invariance can be achieved by using beat-synchronous chroma features [8]. In [10], the authors emphasized the metric learning by projecting audio features in a high dimensional space where simple distance measures capture cover song similarity. Recently in [6], the researchers have attempted to train a CNN with cross-similarity matrices obtained from Chroma Energy Normalized Statistics (CENS) features of cover or non-cover pairs to predict the probability of whether a reference song is a cover or noncover of the query song. The system proposed in [21,22] remains as the best performer in MIREX with a mean average precision (MAP) of 0.75 on the MIREX mixed music collections. However, MIREX cover detection evaluation is encompassed on a dataset of 1000 songs with pairwise comparisons obviously, which are non-scalable. Recently, there has been a growing interest in applying domain-specific knowledge such as from NLP on solving traditional MIR tasks using multi-modal approaches such as in genre classification [23] and music recommendation [24]. Even though NLP techniques were previously used in MIR studies for different tasks such as lyrics-alignment [25], mood, tag, genre classification [26], particularly for the cover detection problem, the reliability and the accuracy of text-based approaches have not been studied or benchmarked beside in the recent work of [27]. However, comprehension of the contemporary community has heaped the effort, the same has dragged the attention. Further, the cumulated capability of researchers has witnessed through their strategies to identify identical tunes because of protecting the effort of genuine music producers. Thus, there is a dearth of literature in a broader area of the deep learning-based model to bifurcate the original tune among the mirrored versions. Hence, the work-oriented toward a deep learning-based effort to evolve to erect the approach has extended to separate the cover song. In other words, here converge the literature to boil down into expanding the deep learning to detect the duplicate tune.
3 Proposed Convolution The prospective system, shown in Fig. 4, has portrayed in three stages. Initially, the preprocessing stage attempts to convert audio signals into MFCC features for each song, and those are fed into deep learning classifier.
94
D. K. Vali and N. U. Bhajantri MFCC features
Deep learning Classifiers
Cover Song Identification
Fig. 4. The scheduled method
3.1 Mel Frequency Cepstral Coefficients (MFCCs) In the speaker recognition framework the feature extraction module is playing a significant role. This activity is essentially aimed at two. Initially, a speaker acknowledgment framework wants to work through low dimensionality vectors as to effort progressively manner. Furthermore, the component pulling out square expels needless data which is being conveyed in the discourse outlines and stresses speaker-subordinate parts of the discourse. MFCC remains the furthermost mainstream feature for speaker acknowledgment mission and these stayed presented by Davis and Mermelstein in the 1980s and has stayed cutting edge from that point onward. Then again, preceding the presentation of MFCCs, Linear Prediction Coefficients (LPCs) and Linear Prediction Cepstral Coefficients (LPCCs) stayed the primary element like for Automatic Speech Recognition (ASR). Further, MFCCs depend on the well-known variety of the human ear’s basic data transfer capacity with direct recurrence separating beneath 1000 Hz and logarithmic dispersing over 1000 Hz [28], a accompanying advances portray the calculation. Algorithm Begins Input: Speech sample, sample rate Output: Matrix of MFCC by frame Parameters: window size = 20 ms, step size = 10 ms, nbins = 32, d = 12(cepst) Step I: Compute FFT power spectrum Step II: Determine Mel frequency m channel filter bank Step III: Convert to cepstra via DCT Algorithm Ends
The nature of the speaker recognition framework relies upon the correct component separated qualities. MFCC is the most generally utilized list of capabilities in content free recognition of the speaker. The subsidiaries of MFCC, for example, MFCC and MFCC founded discourse 64 that highlights the mining method which has stayed utilized to upgrade the productivity of a framework. In MFCC including mining, there is a requirement to settle on constraints. Then again, evasion esteems are engaged for some constraints. These highlights might not be the greatest decision. Here, the TIMIT database has expanded, primarily the quantity of MFCC required with the default estimation of 12. The casing measurement set to 20 ms inside the run of the mill scope of 10–30 ms is typically utilized. The edge cover is set to half as a matter of course which is additionally well inside the run of the mill estimations of 25–70%. The quantity of channels is set to 32 naturally. At first, the preaccentuation of discourse is finished with a factor of 0.97. It is then sectioned into casings of length 20 ms and a move of 10 ms. The hamming window is connected to each edge alongside the time sign changed over
Deep Learning for Cover Song Apperception
95
into the recurrence space by utilizing FFT. A lot of 32 covered trilateral channels are connected to the range. These channels are consistently ranking above the Mel recurrence hub. Discrete cosine change changes over the log vitality yield of channels to the direction of 12 melcepstral (MFCC) highlights. Further, the initial and another time subordinates of MFCC segments are gotten and attached to the vector. 3.2 Mel Frequency Warping Also, psychophysical training has demonstrated, a human view is a recurrence substance the discourse sign doesn’t comply with a straight measure. The human ear is increasingly touchy to low frequencies and for respective tone with a genuine recurrence f, an emotional area is estimated 58 on the Mel scale. Mel is a contraction of the word song and its ordinary unit is a pitch. The Mel recurrence is agreed by Equation Mel (f) = 2595 log (1 + f/700) (3.7). A channel bank has a triangular bandpass recurrence reaction. This channel bank is connected in the recurrence area resources relating to the triangular-shaped windows to the range. On low frequencies around is straight dividing between channels however at high frequencies are logarithmically dispersed.
4 Results and Discussion We have experimented with real datasets to reveal the capability of the suggested criteria. The evolved method has been implemented in a Matlab R2013a using an Intel Pentium 4 processor, bearing speed 2.99 GHz Windows PC with 1 GB of RAM. A cross-similarity matrix (CSM) is defined between a pair of songs. In this experiment, we used the MFCC feature of each song to calculate the CSM. Here, the suitable dataset has created, such as covers30, contains 30 sets of original and cover songs— spanning across genres, styles, live, and recorded music. The dataset is biased toward regional languages. Most songs contain a cover version; however, some songs contain up to three. Similarly, the extended covers80 [29] was proposed at MIREX 2007 to benchmark cover song recognition systems. The dataset contains 80 sets of original and cover songs, 166 in total—which comprises genres, styles, live, and recorded music. In other words, covers80 predominantly oriented toward western music. Table 1, Shows filter banks with recognition rate. Further, for the experimentation, we created the covers50 song dataset by randomly picking songs from available datasets like covers30 and covers80 (Table 2). Precisely work, an attempt to compute MFCC as a dominant feature with different datasets like covers80 and covers30. Besides, to reveal performances of deep learning. On the other hand, the improved effort has extended through CNN. In experimentation, pick songs randomly from the database, and experimentation is conducted more than five times. The experimented results witnessed through maximum accuracy obtained in all cases as shown in Table 2. The table portrays varied dataset with same and different number testing and training samples. The graphical representation of the accuracy of different classifiers is shown in Fig. 5. Further, in Fig. 5, the CNN achieves better accuracy when compared to other classifiers on both same and different datasets.
96
D. K. Vali and N. U. Bhajantri
Reducing the channel size improves the recognition rates yet expands the recovery back computational time because of the expansion in several channels. In this way, utilized convolutional windows of sizes 16 × 16, 9 × 9, 5 × 5, and 5 × 5 for conv1, conv2, conv3, and conv4 layers separately. All convolutional layers are practiced with various channel windows of sizes 32 × 32, 16 × 16, 9 × 9, and 5 × 5. Table 1 looks at the presentation of picking distinctive sifting window sizes. Executing max pooling and mean pooling produces an acknowledgment pace of 93.33% and 90.84% individually. Along these lines, to know the strength and productivity of developed CNN, it is contrasted and has exhibited better performance when compared with different classifiers. For quicker acknowledgment, we investigated the Adaboost classifier [30] and finished with exceptionally low order rates that have broadened. Then again, supplanted Adaboost with a conventional neural system (ANN) for order discovered better acknowledgment rates. At the end of the day, the acknowledgment precision is additionally improved by supplanting ANN with profound ANN and detailed an expansion in recognition rate by 5%. A vastly improved improvement of 4% in the acknowledgment precision and an upward 15% in testing pace were seen in this work with CNN. Although CNN sets aside extra effort for preparing, the testing takes nearly far lesser calculation times. Subsequently, CNNs are an appropriate device for spread tune identification, which can be widely read for, and there is broad examination basic to develop improved structures and quicker preparing calculations.
Fig. 5. Shows the comparison accuracy with classifiers
5 Conclusion Here, we explored a CNN-based approach to audio cover song identification by employing a cross-similarity matrix. The features are extracted from MFCC. However, extended the training with the CNN exercising cross-similarity matrices, similarly a deep learning classifier for songs are trained. The performance of the proposed system was compared
Deep Learning for Cover Song Apperception
97
Table 1. Shows filter banks with recognition rate Exercise
Layers
Recognition rate
Convolutional filter window
5 × 5, 5 × 5, 5 × 5, 5 × 5
95.54
9 × 9, 9 × 9, 9 × 9, 9 × 9
93.73
16 × 16, 16 × 16, 16 × 16, 16 × 16
90.15
32 × 32, 32 × 32, 32 × 32, 32 × 32
89.86
Table 2. Comparison accuracy with classifiers Classifiers
Adaboost
Recognition rates Random1 (covers30)
Random2 (covers80)
Random3 (covers50)
Same
Different
Same
Different
Same
Different
65.68
60.36
66.47
61.19
64.33
60.00
ANN
79.77
69.68
78.54
71.8
73.66
70.46
Deep ANN
87.34
78.89
88.75
81.1
86.56
78.33
Proposed [CNN]
95.12
89.03
95.89
90.15
93.22
89.00
with a deterministic strategy and machine learning-based approach. Although the current study showed promising results, there is much room for improvement, particularly by finding more suitable CNN design, hyper-parameter tuning, and increasing the size of the training data set with flexible input feature-length. Furthermore, the embedding techniques are necessary for a large-scale search of cover songs. Thus, exploration of these is left for future work.
References 1. Bertin-Mahieux, T., Ellis, D.P.W.: Largescale cover song recognition using the 2d Fourier transform magnitude. Int. Soc. Music Inf. Retr. (2012) 2. Bertin-Mahieux, T., Ellis, D.P.W.: (2011). Largescale cover song recognition using hashed chroma landmarks. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics 3. Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR) (2011) 4. Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009) 5. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. (2015). https:// doi.org/10.1016/j.neunet.2014.09.003 6. Othman, E., Bazi, Y., Alajlan, N., Alhichri, H., Melgani, F.: Using convolutional features and a sparse autoencoder for land-use scene classification. Int. J. Remote Sens. 37, 2149–2167 (2016)
98
D. K. Vali and N. U. Bhajantri
7. Cai, K., Yang, D., Chen, X.: Two-layer large-scale cover song identification system based on music structure segmentation. In: 2016 IEEE 18th International Workshop on Multimedia Signal Processing, MMSP 2016 (2017) 8. Cano, P., Batle, E., Kalker, T., Haitsma, J.: A review of algorithms for audio fingerprinting. In Multimedia Signal Processing, 2002 IEEE Workshop on, pp. 169–173. IEEE (2002) 9. Chang, S., Lee, J., Keun Choe, S., Lee, K.: Audio cover song identification using a convolutional neural network (2017). arXiv preprint https://arxiv.org/abs/1712.00166 10. Chen, N., Li, W., Xiao, H.: Fusing similarity functions for cover song identification. Multimed. Tools Appl. 77(2), 2629–2652 (2018) 11. Ellis, D.P.W., Poliner, G.E.: Identifying cover songs’ with chroma features and dynamic programming beat tracking. In: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. IEEE (2007) 12. Foster, P., Dixon, S., Klapuri, A.: Identifying cover songs using information-theoretic measures of similarity. IEEE/ACM Trans. Audio, Speech Languag. Process. (TASLP) 23(6), 993–1005 (2015) 13. Heo, H., Kim, H.J., Kim, W.S., Lee, K.: Cover song identification with metric learning using distance as a feature. In: ISMIR (2017) 14. Humphrey, E.J., Nieto, O., Bello, J.P.: Data-driven and discriminative projections for largescale cover song identification. In: Proceedings of the 14th International Society for Music Information Retrieval Conference (2013) 15. Khadkevich, M., Omologo, M.: LargeScale cover song identification using chord profiles. In: Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR-2013) (2013) 16. Knees, P., Schedl, M.: Music similarity and retrieval: an introduction to audio-and web-based strategies, vol. 36. Springer (2016) 17. Knees, P., Schedl, M., Widmer, G.: Multiple lyrics alignment: automatic retrieval of song lyrics. In: International Society for Music Information Retrieval Conference (ISMIR) (2005) 18. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008) 19. Muller, M., Kurth, F., Clausen, M.: Audio matching via chroma-based statistical features. In: Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR) (2005) 20. Oramas, S., Nieto, O., Barbieri, F., Serra, X.: Multi-label music genre classification from audio, text, and images using deep features. In: International Conference on Music Information Retrieval (ISMIR) (2017) 21. Oramas, S., Nieto, O., Sordo, M., Serra, X.: A Deep multimodal approach for coldstart music recommendation. In: Proceedings of the 2nd Workshop on Deep Learning for Recommender Systems—DLRS (2017) 22. Osmalskyj, J., Pirard, S., Van Droogenbroeck, M., Embrechts, J.J.: Efficient database pruning for large-scale cover song recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 714–718 (2013) 23. Osmalskyj, J., Foster, P., Dixon, S., Jean-Jacques: Embrechts. Combining features for cover song identification. In: 16th International Society for Music Information Retrieval Conference (ISMIR) (2015) 24. Rafii, Z., Coover, B., Han, J.: An audio fingerprinting system for live version identification using image processing techniques. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 644–648 (2014) 25. Ravuri, S.V., Ellis, D.P.W.: Cover song detection: From high scores to general classification. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, Texas, USA (2010)
Deep Learning for Cover Song Apperception
99
26. Salamon, J., Serra, J., Gomez, E.: Tonal ´ representations for music retrieval: from version identification to query-by-humming. Int. J. Multimed. Inf. Retr. 2(1), 45–58 (2013) 27. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975) 28. Logan, B.: Mel frequency cepstral coefficients for music modeling. In: Proceedings of International Symposium on Music Information Retrieval (2000) 29. Ellis, D.P.W.: The “covers80” cover song data set (2007) 30. Tharwat, A.: AdaBoost classifier: an overview. https://doi.org/10.13140/RG.2.2.19929.01122 (2018)
SVM-Based Drivers Drowsiness Detection Using Machine Learning and Image Processing Techniques P. Rasna(B) and M. B. Smithamol Department of Computer Science and Engineering, LBS College of Engineering Kasargod (Govt. Undertaking), Kasaragod 671542, Kerala, India [email protected], [email protected]
Abstract. In this paper, we propose an efficient algorithm for driver drowsiness detection and efficient alert system. The existing works mainly follow vehicle-based measures, physiological-based measures, behavioral-based measures. Moreover, the works based on behavioral measures mainly focused on eye movements, yawning, and head position. The proposed method uses more relevant and appropriate behavioral features such as significant variation in aspect ratio of eyes, mouth opening ratio, nose length bending, and the changes that happened in eyebrows, wrinkles, ear due to drowsiness. The binary SVM classifier is used for classification whether the driver is drowsy or not. The inclusion of these features helped in developing more efficient driver drowsiness detection system. The proposed system shows 97.5% accuracy and 97.8% detection rate. Keywords: Behavior changes · Drowsiness · Wrinkle detection · Eyebrow variation
1 Introduction Drowsiness is a state of sleepiness, which reduces the attention level of the driver because of less sleep, long continuous driving, or other medical conditions like brain disorders [1–3]. A study in United States (U. S) showed that 37% of drivers surveyed admitted to falling asleep at the wheel during long driving. Figure 1 shows the accidents study in India between 2012 and 2018 [1]. The main challenge in this area is efficient extraction of features which results in better accuracy and result. In our system by using behavioral features like facial landmarking gives better accuracy. Also, drivers never feel any uncomfortable while driving by using this system. The proposed method in this paper efficiently detects driver drowsiness by identifying more accurate facial landmark points and extracting more number of valid features. Also, this method can be extended to various real-life applications like chemical industries where drowsiness is induced due to chemical environment. This paper is presented as follows. Section 2 gives a brief description of related works. Section 3 gives the detailed system architecture and the working of the proposed © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_10
SVM-Based Drivers Drowsiness Detection Using Machine Learning
101
work Sect. 4 provides the experimental analysis of the proposed work. Finally, Sect. 5 concludes with our analysis and offers valuable insight into enhancing the work for real-time environment effectively.
81
2014
77
57
2013
72
61
2012
67
52
Accidents Rate(%)
2015
2016
2017
2018
Fig. 1. Accidents study in India (2012–2018)
2 Literature Review Driver drowsiness is one of the main causes for increase in road accidents. The existing work in the driver drowsiness detection can be classified as follows: Vehicle basedmeasures, Physiological measures, and Behavioral measures. 2.1 Vehicle-Based Measures Forsman et al. [4] and Wang et al. in [5] proposed a method using driving behavior with eye features. Here vehicle speed, lateral position, and steering wheel angles are considered for detecting drowsiness level. 2.2 Physiological-Based Measures The authors mainly focused to extract physiological features by using EEG. Hwang et al. [6] proposed a method using Electro Encephalo Gram (EEG). Fujiwara et al. [7] used heart rate variability for detecting the drowsiness. Here also EEG is used for detecting heart rate changes. Electrocardiogram also used [8] for detecting respiration changes using heart rate monitoring. Electrocardiogram used for measuring heart rate variability is also used in [9]. 2.3 Behavioral-Based Measures The authors [10] introduced a method using multidimensional facial features like mouth and eyelid movements to detect drowsiness. The geometrical characteristics of facial features mainly mouth and eyelid movement are extracted and computed quickly to detect drowsiness. Subashree et al. [11] described a method only using eye features. Nose features are included with eye features for giving better accuracy in [12]. Charlotte Jacobe et al. [13] aim to determine not just the detection but also the prediction. Here two artificial neural networks were developed. One detects the degree of drowsiness at
102
P. Rasna and M. B. Smithamol
every minute. The second one predicts every minute the time required to reach particular drowsiness level. Mandal et al. [14] specified a new feature PERCLOS (percentage of eye closure). Verma et al. [15] describe a robust method for drowsiness detection which employs a method to detect drowsiness or distraction of driver during both day and night by issuing an alarm signal. Another method [16] constantly monitors the eyes of the person and extracts the signs of the fatigue. Yong et al. [17] described that the methods and selected features are also same, but the algorithms are different. Here ada-boost algorithm is used for detecting eyes and mouth. Fuzzy algorithm is used to judge the fatigue status which is combined with PERCLOS. The authors show a different idea using two cameras [18]. Authors also use GPS system to track the driving pattern. Landmarking features are applied to find eye region. The authors [19] propose a novel approach based on symmetry of sclera around iris. It is known as Iris-Sclera Pattern Analysis. The three stages are described , and drowsiness state is determined by using PERCLOS. The authors [20] suggest a system in that a webcam records the video of driver and sends it to the server. ˙Image processing technique is applied to extract the frames by SVM and HOG. After detecting the face then facial landmarks like position of eyes, nose, and mouth are marked on the image. From the facial landmark EAR, MOR, NLR, and position of head are calculated. The papers [21–23] use facial landmarking. Compared to physiological and vehicle-based measures, behavioral measures are easier to handle. So here a driver’s drowsiness detection using facial landmarks is proposed. The inclusion of more relevant features in the proposed method enables us to achieve more precision and accuracy compared to existing work.
3 System Architecture Here the proposed model suggests a method for detecting drivers’ drowsiness by using facial landmarking. In facial landmarking technique, after the detection of face the landmarking points of the eye, nose, mouth, eyebrows, and ear are found. Also the wrinkles in the forehead are found. These points are the indication of drowsiness. This technique involves image acquisition, face detection, facial landmark marking, feature extraction, and classification. In image acquisition, the data is video and frames are extracted from the videos. After the image acquisition using Viola–jones algorithm [16] the face is detected. After the detection of face, the bounding box is marked on the face. • Facial Landmark marking The important points in the face can be identified as discussed in [17], and a sample is shown below. The existing system considered only ear, mouth, nose for detecting the drowsiness. Using the landmarking points, apply some equations and determine the drowsiness by using the values of the results. Figures 2 and 3 show the facial landmarks points on the existing [20] and our proposed system, respectively. Table 1 shows in detail all landmark points considered in the proposed work.
SVM-Based Drivers Drowsiness Detection Using Machine Learning
Fig. 2. Present facial landmarks
Fig. 3. Modified facial landmarks
103
104
P. Rasna and M. B. Smithamol Table 1. Facial landmark points Parts
Landmark points
Parts
Landmark points
Mouth
[38–49]
Left ear
[58–65]
Right eye
[22–27]
Fore head wrinkle
[1–10]
Left eye
[28–32]
Right eyebrow
[11–15]
Nose
[34–37]
Left eyebrow
[16–20]
• Feature Extraction After the landmark detection, the face can be divided into different elliptical regions. Hough transform is used to find the elliptical shape in the face [24]. Due to the drowsiness it will make changes in points of elliptical shape as depicted in Fig. 4.
Fig. 4. Elliptical shape extraction at different parts of the face
The features Eye Aspect Ratio, Mouth Opening Ratio, Head bending are already described in the existing system [20]. Eye Aspect Ratio (EAR): This ratio indicates the height and width of the eye. EAR = (p22 − p26) + (p23 − p27)/2(p24 − p25)
(1)
Mouth Opening Ratio (MOR): It is the ratio that indicates the yawning of the driver. MOR = (p40 − p48) + (p41 − p47) + (p42 − p46)/3(p44 − p38)
(2)
SVM-Based Drivers Drowsiness Detection Using Machine Learning
105
The nose length ratio greater or less than a specific range of value indicates the head bending and change in the drowsiness level. NLR =
noselength(p37 − p34) averagenosevlength
(3)
In the proposed model, add another extra features that are described below, such as eyebrow variations, wrinkle detection, and ear shape variation. Eyebrows variation: The eyebrow changes can be computed by using circular Hough transform. That is changes in the circular shape which contain the eyebrow determine the drowsiness. In drowsy state, the length of the eyebrow will reduce. So that will affect the overall shape of the circle. In this way the drowsiness can be determined. Wrinkle detection: In drowsy state, more than three wrinkles may appear on the forehead. This is another indication of the drowsiness [25]. Figure 5 shows the marked sleep lines Vs wrinkles. The part I (left part) shows the sleep lines which appear in the drowsy state, and the part II indicates the normal wrinkles. Using these changes we can determine the drowsy person. The wrinkles can be found using Hessian filter. It is based on the directional gradient and Hessian matrix as discussed in [28]. In this method, a matrix is determined for the image frame extracted that is for all the pixels on the image. The eigenvalues of this matrix will help us to decide whether the given point belongs to a ridge irrespective of the ridge orientation of the point. To measure the reliability of the wrinkle detection Jaccard Similarity Index (JSI) is used [25].
Fig. 5. Sleep lines versus wrinkles
106
P. Rasna and M. B. Smithamol
Jaccard symbol J can be calculated by using the intersection of A and B that divides union of A and B. A and B are annotations of different coders. The Eq. (4) referred from [25] is shown below. J (A, B) = |A ∩ B|/|A ∪ B|
(4)
Ear detection: The ear is another indicator of drowsiness. At the drowsy state, the lower tip of the ear makes small movements. Shape of the ear is deformed elliptic shape and finds out the quantity of deviation from the circular form [26]. Figure 6 shows the general diagram for the proposed system.
Fig. 6. General diagram for the proposed system
• Classification Machine learning algorithms are used for classification of the data. In the setup phase, the normal condition of a person and the EAR value are computed for more than 200 frames. From that half number of maximum values are considered for computing threshold. Greater than the threshold value indicates that eyes are in open condition. If EAR value is less than threshold, then it indicates eyes are in closed condition. Likewise MOR value threshold is calculated in setup phase. If the value is greater than the threshold, then it indicates that mouth is in open condition and if less than the threshold, then mouth is in closed state. Table 2 shows the threshold values of computed features. As shown in Table 2, the threshold values are computed and the system is tested for verification and validation. The system will show drowsiness if one of the features is detected in one frame. If more than 8 frames show drowsiness [2] for at least one extracted feature like eye closed, then the system will definitely detect drowsiness and issues a warning signal.
SVM-Based Drivers Drowsiness Detection Using Machine Learning
107
Table 2. Threshold and properties of computed features EAR from setup phase
0.35
MOR from setup phase
0.6
NLR from setup phase
0.7 > NLR < 1.2
Eyebrow from setup phase 10 mm Ear from setup phase
Elliptical shape
Wrinkles from setup phase No. of wrinkles height/2 top end ← find end(upper) bottom end ← find end(lower) if width(top end) < = 25: top end ← join(top end,top end − 6x) if width(top end) < = 25: bottom end ← join(bottom end,bottom end − 6x) rectangle.append(top end,bottom end) return top right img, return bottom right img End
A Self-Acting Mechanism to Engender Highlights of a Tennis Game
155
A two-layered Convolutional Neural Network is implemented for the purpose of recognizing the score from the Score-Box. The input image to this CNN is obtained from the Text Localization Process. CNN for Text Extraction def tensor graph Text Extraction(RKL Tennis OCR Dataset): Begin for filter in number of filters: W ← Random initialized weight vectors conv ← Conv2d(input, kernel, strides, padding) pooling ← Max pool(input,kernel size,strides,padding) activation ← Relu(conv) conv ← Conv2d(input, kernel, strides, padding) pooling ← Max pool(input,kernel size,strides,padding) activation ← Relu(conv) fullycon ← Fully connected(inp,num op,activation function) activation ← Relu(fullycon) combined outputs ← combine all filter outputs Predictions ← softmax (combined outputs) Loss ← compute loss(predictions) Accuracy ← compute accuracy(predictions) Training Graph ← Save Model() return Training Graph End Crowd Intensity Analysis The input to the Crowd Intensity module is the Audio clip Name and the nth second from which the Audio is to be processed. From the whole Audio Segment, the Segment Ranging from nth second −1 to nth second +1 is clipped and stored in a temporary variable. The maximum amplitude from this three second temporary audio segment is calculated using python’s pydub library whose max function returns the maximum amplitude of an Audio Segment. The reason the range is taken from nthsec −1 is that after the point has actually ended the crowd begins to react, there is a time gap of a second or two between the actual point end and the score change. Hence, by taking into account the crowd reaction, a second before the score change yields a more accurate comparison while choosing points for Highlights. Crowd Intensity Analysis def Crowd Intensity(Audio Clip, nth second): Begin Audio ← open(Audio clip) Temp ← Extract Sub Segment(nth sec − 1,nth sec + 1) Amp ← Max Amplitude(Temp) return Amp End
156
4 4.1
R. Arunachalam and A. Kumar
Result Analysis Score-Box Detection
For testing the accuracy of score-box detection, 250 images which include 13 images from tennis match having no score-box was used, Out of the 237 images which contained score-box, the average probability of 93% associated with true positives have been obtained, and out of all images used for testing 4 false positive score-boxes was recognized with 0 boxes of true negative which gives us a precision of 0.98. Unnecessary frames are identified by the absence of score-box and this is essential for less unnecessary contents to be processed for further work and reduce redundancy of repetition of points included in the highlights package as the Replay and Practice segments contain serve action of the player which the system will wrongly interpret as a start of a point. The training set for Score-Box detection has been constructed in a way such that it included 7 prominent types of score-boxes. This includes ATP250 series, ATP500 series, Masters 1000 series, and the Four Grand Slams. Any frame which falls under the seven categories as in Table 1, and with seldom change in pixel values around the score-box in successive frames achieves an accuracy of 99%. The reason for lower 94% in these categories occurs when the pixel values around the score-box varies highly between successive frames, an example of this is the match segments which are taken in Spidercam. Sometimes the score-box detection accuracy falls between the range of 85–93%. This happens when a match outside the seven categories is given as input because the system has been trained with images mostly from these seven categories, an example of this scenario is the matches in the early 2000s, where the score-box is quite different from the ones present now. Because of the fact that the CNN learns features of the Score-Box like contrast difference, there occurs false positives in situations where the machine detects the sponsor’s name as in Fig. 3. But here false positive is eliminated by giving the approximate width and height of the Score-Box, so that the detected box which has either greater than or less than these range of values will be discarded. 4.2
Action Recognition
The 3D CNN model was trained on Thetis dataset—a set of videos consisting of prominent tennis actions [10]. This model is implemented using Tensorflow— GPU with processes distributed over multiple cores. The training set has 12 tennis actions with 1980 videos in total and it was split into 75% for training and 25% for testing. Hence, the training set has 1485 action videos and test set has 495 action videos. To obtain convergence quickly, Momentum Optimizer and definite weight initialization instead of random are applied. The weight Initialization of the network is done in a way that the neuron activation functions are not starting out in saturated or dead regions. Initialization of the weights is made with random values that are not “too small” and not “too large”. Momentum Optimizer prevents us to go too fast and not miss the minimum and makes it more responsive to changes. When the local minimum is reached, i.e., the lowest
A Self-Acting Mechanism to Engender Highlights of a Tennis Game
157
Fig. 3. False positive of score-box detection
point on the curve, the momentum is pretty high and it doesn’t know to slow down at that point due to the high momentum which could cause it to miss the minimum entirely and continue movie up. As expected, the model accuracy increased with the number of training steps as shown in Fig. 4. From Table 2, the accuracy of the Action Recognition model increases as the Growth Rate parameter and the Depth parameter increases. Growth rate (k) is the number of feature maps of a layer and Depth (d) is the number of layers of the 3D CNN. Increasing k and d too much results in a very sophisticated network which makes attaining global minimum point while training extremely difficult to attain. The higher are the growth rate and depth, the lower is the batch size in order to speed up the Training process. Increasing Accuracy of 90.56% is achieved with a Batch size of 4, Crop Size of (64, 128), and Growth Rate as 24 and Depth as 30 is trained for 140 epochs. 4.3
Text Localization
During localization, all the characters are recognized, i.e., player names along with score values. This can be easily eliminated by starting to slide the window from half of the width of the Score-Box. Changing the sliding window size up to half of width and height, recognizes the same region in the Score-Box. This has to be removed as it represents the same region of space where it already found some digits. The current rectangular box overlaps with previously found boxes. This can be eliminated by comparing the boundary values of the current box with already found rectangular boxes whether it overlaps or not. If yes, the outer rectangular box has to be removed, for example, in Fig. 7, it has already found 6 and 7 in separate, but it again finds the rectangular box with a width containing both 6 and 7 (Figs. 5 and 6).
158
R. Arunachalam and A. Kumar
Fig. 4. Accuracy if 3D CNN model over multiple training steps
Fig. 5. Box around score points
So, to remove the outer rectangle, compare the area of the box containing both 6 and 7, with the box containing 6 and another box containing 7. Skip the maximum area rectangular box which brings us the localized array containing individual boxes with 6 and 7. Figure 8 indicates that this method of identifying white pixels in the black background only identifies score points and not set point. So, to identify them binarize the image using THRESH BINARY INV which binarizes the RGB image to an image which contains black pixels in white background, so now score points are displayed as black pixels in white background and set points are displayed as white pixels in black background. For this purpose, the loop is to be run twice with both binarized images. The first one finds out the score points and then the other finds out the set points as shown in Fig. 9. This sometimes finds out characters on the net also. This can be eliminated by checking all the rectangular boxes after the localization process above some area and then removing boxes which are not. This finds out all the set and score points but it will require only the end points to detect score change. This sometimes finds A and D as individual boxes. So to combine both boxes compare the distance of D with A and 0. Join D with the box which has a minimum distance from D, i.e., A. This method sometimes joins with the small rectangular box within some characters.
A Self-Acting Mechanism to Engender Highlights of a Tennis Game
159
Fig. 6. After overlapping and binarization problem
Fig. 7. Small box inside “A”
As shown in Fig. 10, the second loop finds out some space inside A. In this situation, if it draws boxes around A, within A and D individually, the box around D joins with the box within A. So, compare the area of all localized boxes and remove it, if it is less than some specific area. After solving all the problems, rightmost values along top and bottom of the Score-Box are localized as shown in Fig. 11, and are given to OCR to detect score change.
Fig. 8. Final score-box with score points
4.4
Text Recognition
For the Optical Character Recognition of Text in Score-Box, accuracy of 99.1% was achieved when the CNN model with a filter size of 5 × 5 and SAME Padding was trained for 10 epochs with a batch size of 128 and learning rate of 0.001 where 36,472 Point-Score images from the Tennis Score-Box was used for training and 7327 images were used for testing purposes. For introducing nonlinearity,
160
R. Arunachalam and A. Kumar
ReLu activation function is used in the intermediate layers and from Table ?? it has been proven that having Sigmoid Layer in the Output Layer has a better accuracy than ReLu. No vanishing problem occurs in ReLu Activation function when x is increased and there is a six-time improvement in Convergence time from Tanh function. Table 1. Performance comparison of OCR by varying Hyperparameters Padding Activation function in output layer Accuracy in (%)
4.5
Same
Sigmoid
99.1
Same
ReLU
96.5
Valid
ReLU
92.7
Valid
Sigmoid
94.8
Crowd Intensity Analysis
While Calculating the Intensity after the end of the point, sometimes the Chair Umpires Score call is taken into consideration unwillingly while calculating the maximum amplitude for that three-second Audio Segment. This rare anomaly causes some mediocre points to be considered as worthy points, and hence included in the Highlights Package. This anomaly can be completely avoided if the chair umpires score call is made after a second or two following the Score change. This anomaly is rare in the matches played after 2016, as the score change in the Score-Box is done swiftly after the point has ended, and hence there is a one or two-second gap to the umpires score call which is in contrast to the matches played in 2016, or earlier where the score change doesn’t happen spontaneously after the point has actually finished. This delay is one of the major factors causing this Crowd Intensity Analysis anomaly.
5
Conclusion
Highlights for a Tennis match are generated by taking into account the people’s reaction. This is accomplished by first implementing Unnecessary Frame elimination implemented in order to remove segments like Replay and Practice Points before the actual start of the match. Action Recognition is used to identify a player’s serve to mark the start of a point followed by the implementation of text extraction from the detected score-box for the purpose of analyzing them to find out score changes which help us to find the end of a point and events like Break Points, Set Points, and Match Point. Finally, analyzing the crowd reaction after the end of a point is to be carried out for the purpose of assigning priority levels to individual points. This method overcomes the flaws of recent methods used for Highlights generation of a sports match. These include collaborating the
A Self-Acting Mechanism to Engender Highlights of a Tennis Game
161
replay segments alone which have two major flaws, the replay segments shown by the broadcaster involves the frames at a slower rate than usual. Secondly, the replay doesn’t have score-box, hence the viewer will be unaware of the situation in the match. The second flaw of this technique is that, it includes segments of the match like the pre-match practice as the replay methods are focused on detecting the logo before and end of the replay segment.
References 1. Cao, et al.: Body joint guided 3D deep convolutional descriptors for action recognition. IEEE Trans. Cybern. 23(90), 1–14 (2017) 2. Chen, et al.: Deep manifold learning combined with convolutional neural networks for action recognition. IEEE Trans. Neural Netw. Learn. Syst. 10(30), 1–15 (2017) 3. Liu, et al.: 3D action recognition using data visualization and convolutional neural networks. In: IEEE International Conference Multimedia and Expo Workshops, vol. 11, no. 99, pp. 925–930 (2017) 4. Liu, et al.: Frame-skip convolutional neural networks for action recognition. In: IEEE International Conference Multimedia and Expo Workshops, vol. 10, no. 50, pp. 573–578 (2017) 5. Li, L., et al.: End-to-end learning of deep convolutional neural network for 3D human action recognition. In: IEEE International Conference Multimedia and Expo Workshops, vol. 13, no. 30, pp. 609–612 (2017) 6. Li, L., et al.: Skeleton-based action recognition with convolutional neural networks. In: IEEE International Conference Multimedia and Expo Workshops, vol. 23, no. 99, pp. 597–600 (2017) 7. Wang, P., et al.: Temporal pyramid pooling-based convolutional neural network for action recognition. IEEE Trans. Circuits Syst. Video Technol. 27(12), 2613–2622 (2017) 8. Sheng, Yu., et al.: Fully convolutional networks for action recognition. IET Comput. Vis. 11(8), 744–749 (2017) 9. Liang, G., et al.: Multi-spectral fusion based approach for arbitrarily-oriented scene text detection in video images. In: IEEE Trans. Image Process. 24(11), 4488–4500 (2015) 10. Gourgari, S., et al.: Thetis: three dimensional tennis shots a human action dataset. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, vol. 30, no. 16, pp. 676–681 (2013) 11. Javad, A., et al.: An efficient framework for automatic highlights generation from sports videos. IEEE Signal Process. Lett. 23(7), 954–958 (2016) 12. Neumann, L., et al.: Real-time lexicon-free scene Text localization and recognition. In: IEEE Trans. Pattern Anal. Mach. Intell. 38(38), 1872–1885 (2016) 13. Ravinder, M., et al.: Replay frame classification in a cricket video using correlation features and SVM. Eur. J. Appl. Sci. 7(2), 92–97 (2015) 14. Nguyen, N., et al.: Shot type and replay detection for soccer video parsing. In: IEEE International Symposium on Multimedia, vol. 19, no. 7, pp. 344–347 (2013) 15. Ji, S., et al.: 3D convolutional neural networks for human action recognition. In: IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013) 16. Tian, S., et al.: A unified framework for tracking based text detection and recognition from web videos. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1–14 (2017)
162
R. Arunachalam and A. Kumar
17. Shi, Y., et al.: Learning deep trajectory descriptor for action recognition in videos using deep neural networks. In: IEEE International Conference on Multimedia Expo, vol. 23, no. 6, pp. 1–6 (2015) 18. Shi, Y., et al.: Sequential deep trajectory descriptor for action recognition with three-stream CNN. In: IEEE Trans. Multimed. 19(7), 1510–1520 (2017) 19. Zhang, B., et al.: Action recognition using 3D histograms of texture and a multiclass boosting classifier. IEEE Trans. Image Process. 26(10), 4648–4660 (2017) 20. Zhao, S., et al.: Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans. Circuits Syst. Video Technol. 25(12), 1–5 (2017) 21. Mian, A., Rahmani, H., Shah, M.: Learning a deep model for human action recognition from novel viewpoints. IEEE Trans. Pattern Anal. Mach. Intell. 12(4), 1–14 (2017) 22. Kolekar, M.H., Sengupta, S.: Bayesian network based customized highlight generation for broadcast soccer videos. IEEE Trans. Broadcast. 35(2), 195–209 (2015) 23. Naik, V.H., Doddamani, V.B.: Score point event extraction from tennis video using embedded text detection. Int. J. Adv. Comput. Eng. Netw. 2(10), 35–38 (2014) 24. Mora, S.V., Knottenbelt, W.J.: Deep learning for domain-specific action recognition in tennis. In: IEEE Conference Computer Vision and Pattern Recognition Workshops, vol. 13, no. 7, pp. 114–122 (2017)
Performance Evaluation of RF and SVM for Sugarcane Classification Using Sentinel-2 NDVI Time-Series Shyamal Virnodkar1(B) , V. K. Pachghare1 , V. C. Patil2 , and Sunil Kumar Jha2 1 Department of Computer Engineering & IT, College of Engineering, Pune, SPPU, India
{ssv18.comp,vkp.comp}@coep.ac.in 2 K. J. Somaiya Institute of Applied Agricultural Research, Saidapur, Karnataka, India
{patil.vc,jha.sunilkumar}@somaiya.com
Abstract. Sentinel-2 optical time-series images obtained at high resolution are creditable for cropland mapping which is the key for sustainable agriculture. The presented work was conducted in a heterogeneous region in Sameerwadi with an aim to classify sugarcane crops, with mainly two groups so as to provide a sugarcane field map, using Sentinel-2 normalized difference vegetation index (NDVI) time-series data. The potential of two better-known machine learning (ML) classifiers, random forest (RF) and support vector machine (SVM), was investigated to identify seven classes including sugarcane, early sugarcane, maize, waterbody, fallow land, built-up and bare land, and a sugarcane crop map is produced. Both the classifiers were able to effectively classify sugarcane areas and other land covers from the time-series data. Our results show that RF achieved higher overall accuracy (88.61%) than SVM having an overall accuracy of 81.86%. This study demonstrated that utilizing the Sentinel-2 NDVI time-series with RF and SVM successfully classified sugarcane crop fields. Keywords: Sentinel-2 · NDVI · RF · SVM · Sugarcane classification
1 Introduction Agriculture plays an important role in the economy of India. To attain sustainable agriculture practice, accurate crop mapping needs to be in place. Satellite imagery provides timely, accurate, and detailed spatial information about an agro-ecological environment [1]. Crop mapping using satellite imagery would help in providing essential and accurate information about the crops, useful to manage many agricultural resources [2]. However, crop classification using remote sensing data is a challenging task due to crop heterogeneity and similar reflectance in fields. Various machine learning algorithms have been successfully investigated for cropland mapping from single-date to time-series remote sensing images. The cropland mapping techniques applied to time-series images have been demonstrated to perform superior to single-date mapping techniques [3, 4]. For example, Muller [5] successfully differentiated cropland and pasture fields from Landsat © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_15
164
S. Virnodkar et al.
time series and Zheng [6] applied the SVM model on time-series Landsat Normalized Difference Vegetation Index (NDVI) data for identification of crop type. Time-series Landsat is explored with ensemble classifier and with other ML methods like SVM, Neural Network, logistic regression, and extreme gradient boosting for land cover classification [7]. Senf [8] used Landsat time-series imagery, and multi-seasonal MODIS to classify crops from Savannah. Jia [9] researched the adequacy of phenological features processed from the MODIS NDVI time-series melded with NDVI data obtained from Landsat 8 for cropland mapping. MODIS-Terra/Enhanced Vegetation Index (EVI) time series have been effectively used to derive the phenological patterns for the classification of cotton, maize, soybean, and noncommercial crops in Brazil [10]. MODIS-Terra EVI has also been used to detect phenological stages, and MODIS NDVI to extract phenological information like the season, peak, and end of the season [11] of rice crop. Double cropping, single cropping, forest, and pastures were mapped using the patterns of vegetation dynamics identified from MODIS EVI data by Maus [12]. Landsat, MODIS, and Chinese HJ-1 time series have been successfully explored for sugarcane crop classification. Time-series Landsat 8 [13] and time-series Chinese HJ-1 CCD images [14] were used to automatically map sugarcane over large areas by applying object-based image analysis and data mining techniques. Sugarcane cropping practices, including crop type and harvest mode, were mapped using Landsat 8 NDVI time-series by Mulianga et al. [15]. Time series of SPOT 5 images were integrated with crop growth model and expert knowledge to deal with the issue of missing acquisitions or uncertain radiometric values by El Hajj et al. [16] in order to detect sugarcane harvest. Many studies have investigated the potential of a single date Sentinel-2 image to classify crops including sugarcane using RF, SVM, DT machine learning methods. Furthermore, applying RF, DTW algorithms on time series of Sentinel-2 produced the best results for cropland mapping [17], but is not yet explored for sugarcane crop classification. So, considering the affordability of high spatial-temporal resolution of Sentinel-2 data, and the potential of RF, SVM, and ML approaches, this study aimed to evaluate the effectiveness of time-series Sentinel-2 images and the potential of RF and SVM on this data to classify sugarcane crop from other land covers. The rest of the paper is organized as follows: Sect. 2 describes the study area and the data; Sect. 3 presents the proposed methodology; Sect. 4 discusses the results followed by a conclusion.
2 Study Area and Data 2.1 Study Area The study area is located near to Sameerwadi, Karnataka, India, at 16.38980 N and 75.03710 E (Fig. 1). Sameerwadi is a village situated in Mudhol taluka, Bagalkot district of Karnataka state in India. The study area covers four talukas, i.e. Mudhol, Jamkhandi, Raibag, and Gokak, and around 8 lack acres of land. The area has an altitude of 541 m above sea level with annual precipitation around 545 mm. The climate is generally dry and the temperature ranges between 16.2 and 38.7 °C. Sugarcane is the main crop cultivated in this region, apart from maize, turmeric, and banana. Figure 1 depicts the study area.
Performance Evaluation of RF and SVM
165
Fig. 1. FCC image of the study area
2.2 Sugarcane Crop Cycle The phenology of sugarcane may provide valuable information for remote sensing classification in the study area. Sugarcane crop’s phenological dynamics throughout its biological cycle needs to be perceived well to understand its spectral behavior, which is vital because of its great impact on classification accuracy. Depending on the planting date, sugarcane has 3 growth cycles, i.e. 12 months (Early season), 14 months (Mid-late season), and 18 months (Late season) in the study region. The 12-month crop is planted in the months of January and February, 14-month crop is planted between November and December, whereas 18-month crop is planted during July–August. After harvesting for the first time, the crop is regrown again 3–4 times and harvested after every 12 months. This practice is referred to as ‘ratoon’. In addition to this, it is important to take in the growth stages and varieties of sugarcane in the classification task. There exist four stages: germination, tillering, grand growth, and maturity of sugarcane with varieties of CO 86032, CO 91010, SNK 2005, 265 in the study region. Due to these properties of sugarcane, a satellite image acquired on a particular date contains variations in fields which include different growth stages of sugarcane crop, plant cane and ratoon cane, sugarcane varieties, and other crops cultivated for the crop rotation purpose. This necessitates the use of multi-temporal images to perform the classification with the best accuracy. By appropriately utilizing time-series remote sensing images, the phenology of sugarcane, which can be utilized to separate the sugarcane crop fields from the other land, may diminish the obstruction of comparative spectra from the other vegetation in the range and help in increasing the classification accuracy.
166
S. Virnodkar et al.
2.3 Data The Sentinel-2 launched on June 23, 2015 is an Earth Observation (EO) mission from the EU Copernicus program that captures optical imagery at a high resolution of 10–60 m for the services and applications for agriculture monitoring, land cover classification, water quality, and emergencies management. It has 13 bands out of which one of the three visible bands (band 4) and the near-infrared band (band 8) were used in our study. The images were downloaded from the European Space Agency’s (ESA) Sentinel Scientific Data Hub which is an open source. Five satellite titles per month, used to obtain study area, are obtained from January 20, 2019, to May 07, 2019, as listed in Table 1. The selected temporal images were free from cloud coverage and with good quality. The images were geo-referenced to WGS 1984 UTM zone 43 N projection system. EU Copernicus program provides images with geometrical and radiometrical corrections. All the images were atmospherically corrected using Semiautomatic Classification Plugin (SCP) available on QGIS 2.18 distributed under the GNU GPL license. Table 1. Sentinel-2 images used in the study Image no. Satellite
Date (dd/mm/yy)
1
Sentinel-2A 20/01/19 to 22/01/19
2
Sentinel-2A 24/02/19 to 26/02/19
3
Sentinel-2A 06/03/19 to 08/03/19
4
Sentinel-2A 10/04/19 to 12/04/19
5
Sentinel-2A 05/05/19 to 07/05/19
3 Methodology The proposed methodology is depicted in Fig. 2 which contains the following steps: (i) acquisition of Sentinel-2 temporal data, (ii) atmospheric corrections of all the images, (iii) NDVI computation, (iv) preparing an input image, v) selection of training samples and generation of Region of Interest (ROI) files, (vi) classification using RF, (vii) classification using SVM, and (viii) classification accuracy assessment. 3.1 Data Acquisition and Preprocessing As listed in Table 1, Sentinel-2 images were obtained free of cost from the Copernicus website. All images were atmospherically corrected to reduce the effects of the atmosphere to produce the surface reflectance values. It helps in improving the use and interpretability of images.
Performance Evaluation of RF and SVM
167
Fig. 2. The proposed methodology
3.2 Data Collection and Preparation of Training Set The classification was performed based on the NDVI values of the crops from January 2019 to May 2019. We have selected NDVI as it is proven to be the best Vegetation Index (VI) in the literature for crop mapping [6, 7, 9, 11, 15]. All preprocessed images’ NDVI computation is performed to get the NDVI time-series. Then the study area is extracted from these images and layer stacked to generate a multispectral input image for the classification of sugarcane crops. Every pixel of the stacked image represents a vector containing NDVI values corresponding to the considered images. Training Dataset: Training dataset has been created by field survey which was performed from January 2019 to May 2019. In this field campaign, ground truth data has been recorded by the Global Positioning System (GPS) device (Montana 680) for sugarcane and maize crops. Apart from this, samples for other classes were generated from a visual interpretation based on expert knowledge. In total, 14 sugarcane polygons and 06 maize polygons surveyed in fields were used for training, and 40 polygons were generated for all other classes. In the study area, during the sugarcane developing cycle, various phenological stages of sugarcane fields may coincide on the same date, ranging from the region of reaped sugarcane, and sugarcane in different growth phases up to the phase of grown-up sugarcane ready to harvest. In this way, we attempted to collect samples of all sugarcane phenological stages, with the goal that all the significant subclasses would be represented. The testing polygons are distinct from training polygons. The polygons were selected from the different agricultural parcels to account for many other factors such as soil, water source, climate, and cultivation practices. 3.3 Classification Random Forest: Random forest is a nonparametric, ensemble method [18] based on the Classification and Regression Trees (CART). A classification tree iteratively splits the bootstrap data into pure subsets. Many such independent classification trees are generated by setting ntree and mtry hyperparameters. The ensemble’s final decision is taken
168
S. Virnodkar et al.
from the majority vote of the predictions of all the trees. RF has shown magnificent performance in remote sensing applications [19–22] due to the capability of handling large input variables, run on large datasets, to handle outliers and to provide the importance of predictive variables on final model performance [23, 24]. RF also achieved significant accuracy in sugarcane classification [1, 2]. Support Vector Machine: SVM is a statistical learning method used for solving classification as well as regression problems. It does not assume the distribution of data and finds an optimal hyperplane between the two classes to be classified. It is basically a two-class classification method but can be extended for multiclass problems [25, 26]. The main capability of SVM of achieving high accuracy even with fewer training samples made them very useful in remote sensing applications [6, 27]. SVM is proven to be one of the best ML methods in various remote sensing applications which mainly include crop classification [26], biotic stress detection [28], yield estimation, and Land Use and Land Cover (LULC) [25, 29, 30]. Sugarcane crop has varying crop cycle and diverse planting and harvesting dates which make classification complex. We first, classified the Sentinel-2 NDVI time-series using ground truth data and supervised the classification into seven classes using RF and SVM classifiers. The classes are sugarcane (sugarcane crops having age more than six months), early sugarcane (sugarcane with age less than six months), maize, water body, fallow land, built-up and bare land. Both the models were trained using the training dataset. Both models are widely used models in the crop classification and are tuned with the hyperparameters to achieve maximum accuracy. Open-source R software is used to implement RF and SVM classifiers. Then, recoding of the assigned classes was performed in post-classification through ENVI software. This resulted in one early sugarcane class and a grown-up sugarcane class. This formed the sugarcane map for four talukas’ region. 3.4 Accuracy Measures In remote sensing, accuracy is the measure to validate the correctness and quality of the generated classification maps. The evaluation is performed through the overall accuracy and kappa coefficient measures, and accuracy of an individual class is measured through producer’s and user’s accuracy. Sometimes, F1 score is used to determine class-wise accuracy [2]. In this work, the accuracy was determined with overall accuracy and kappa coefficient measures.
4 Results and Discussion Sentinel-2 five tiles, covering the study area, in every month for five months were obtained, then mosaicked and ROI was cropped from that image. Then, layered stacking of NDVI was performed, and the resultant image was used for sugarcane and other land cover classification. Two well-known ML classifiers RF and SVM discriminated the sugarcane and other classes very well. The RF’s overall accuracy is obtained as 88.61% and the kappa coefficient is 0. 8387 (Table 2).
Performance Evaluation of RF and SVM
169
Table 2. Accuracy assessment of RF and SVM Overall accuracy (%)
Kappa coefficient
RF
88.61
0.8387
SVM
81.86
0.7623
The classified image using RF is shown in Fig. 3. The optimum accuracy was achieved by tuning the parameter mtry with value 2. SVM’s achieved overall accuracy is 81.86% and kappa coefficient is 0.7623 (Table 2), and the classified image is given in Fig. 4. From Tables 3 to 4, it is observed that the work resulted in classifying sugarcane, early sugarcane, built-up and bare land classes more accurately by RF than SVM. Fallow land class achieved the lowest producer’s accuracy with RF, and Maize is less accurately classified by SVM.
Fig. 3. The classified image by RF
The total area classified into each of the classes by RF and SVM is presented in Fig. 5. After classifying the time-series image into seven classes, reclassification was performed that resulted in two sugarcane classes (early sugarcane and grown-up sugarcane), and a sugarcane map is generated which is shown in Figs. 6 and 7.
170
S. Virnodkar et al.
Fig. 4. The classified image by SVM
Table 3. Producer’s and user’s accuracy for RF Class name
Reference totals
Classified totals
Waterbody
10
11
Fallowland
13
Builtup
87
Sugarcane
29
29
Number correct
Producer’s accuracy in %
User’s accuracy in %
9
90.00
81.00
8
5
38.46
62.50
92
85
97.70
92.39
26
89.66
89.66
Maize
10
8
7
70.00
87.50
Bareland
22
22
20
90.91
90.91
Early Sugarcane
30
31
27
90.00
87.10
5 Conclusion In this study, we evaluated the potential of RF and SVM to discriminate sugarcane crop from other land covers using Sentinel-2 NDVI time-series images and a limited number of training polygons. We utilized Sentinel-2 images of five months from January to May 2019 which covers two main phenology of sugarcane, i.e. tillering and grand growth, January–December temporal coverage is required for precise crop classification. The achieved producer’s and user’s accuracies reach 97.70 and 92.39 respectively. RF classifier achieved 88.61% accuracy whereas SVM reached up to 81.86% concludes RF’s
Performance Evaluation of RF and SVM
171
Table 4. Producer’s and user’s accuracy for SVM Class name
Reference totals
Classified totals
Number correct
Producer’s accuracy in %
User’s accuracy in %
Waterbody
6
7
5
83.33
71.43
Fallowland
34
25
23
67.65
92.00
Builtup
39
36
33
84.62
91.67
Sugarcane
43
47
38
88.37
80.85
Maize
15
11
9
60.00
81.81
Bareland
23
30
22
95.65
73.33
Early Sugarcane
33
37
28
84.85
75.68
Area in hectaers
Area Covered by Classified Classes
200000 179360.3 180000 155668.66 160000 138336.95 140000 120000 94216.46 100000 80000 66980.9 64191.71 69361.04 60000 50486.32 28649.64 40000 15855.31 21052.48 20000 2081.5 5375.94 620.39 0 Waterbody Fallowland Builtup Sugarcane Maize Bareland Early Sugarcane
RF SVM
Classes
Fig. 5. Class-wise coverage of the total area in hectares
superiority in sugarcane classification in the study area. Thus from the results, we conclude that our spectral-temporal approach for classification gave reliable discrimination between sugarcane and other land covers. Future investigation will be to evaluate different vegetation indices like GNDVI, EVI, etc. from time series data to discriminate all four phenology of sugarcane crops in the study area.
172
S. Virnodkar et al.
Fig. 6. Sugarcane map on RF-classified image
Fig. 7. Sugarcane map on SVM-classified image
Acknowledgments. The authors would like to thank the staff of KIAAR and GBL, Sameerwadi, Karnataka, India, for their support and efforts in collecting ground truth data of crop plots used as the training set in this study.
Performance Evaluation of RF and SVM
173
References 1. Everingham, Y.L., Lowe, K.H., Donald, D.A., Coomans, D.H., Markley, J.: Advanced satellite imagery to classify sugarcane crop characteristics. Agron. Sustain. Dev. 27(2), 111–117 (2007) 2. Saini, R., Ghosh, S.K.: Crop classification on single date sentinel-2 imagery using random forest and suppor vector machine. Int. Arch. Photogramm. Remote Sens. & Spat. Inform. Sci. (2018) 3. Gomez, C., White, J.C., Wulder, M.A.: Optical remotely sensed time series data for land cover classification: a review. ISPRS J. Photogramm. Remote Sens. 116, 55–72 (2016) 4. Long, J.A., Lawrence, R.L., Greenwood, M.C., Marshall, L., Miller, P.R.: Object-oriented crop classification using multitemporal ETM + SLC-off imagery and random forest. GISci. Remote Sens. 50(4), 418–436 (2013) 5. Muller, H., Rufin, P., Griffiths, P., Siqueira, A.J.B., Hostert, P.: Mining dense landsat time series for separating cropland and pasture in a heterogeneous Brazilian savanna landscape. Remote Sens. Environ. 156, 490–499 (2015) 6. Zheng, B., Myint, S.W., Thenkabail, P.S., Aggarwal, R.M.: A support vector machine to identify irrigated crop types using time-series Landsat NDVI data. Int. J. Appl. Earth Obs. Geoinf. 34, 103–112 (2015) 7. Man, C.D., Nguyen, T.T., Bui, H.Q., Lasko, K., Nguyen, T.N.T.: Improvement of land-cover classification over frequently cloud-covered areas using landsat 8 time-series composites and an ensemble of supervised classifiers. Int. J. Remote Sens. 39(4), 1243–1255 (2018) 8. Senf, C., Leitao, P.J., Pflugmacher, D., van der Linden, S., Hostert, P.: Mapping land cover in complex Mediterranean landscapes using landsat: improved classification accuracies from integrating multi-seasonal and synthetic imagery. Remote Sens. Environ. 156, 527–536 (2015) 9. Jia, K., Liang, S., Zhang, N., Wei, X., Gu, X., Zhao, X., et al.: Land cover classification of finer resolution remote sensing data integrating temporal features from time series coarser resolution data. ISPRS J. Photogramm. Remote Sens. 93, 49–55 (2014) 10. Boschetti, M., Stroppiana, D., Brivio, P.A., Bocchi, S.: Multi-year monitoring of rice crop phenology through time series analysis of MODIS images. Int. J. Remote Sens. 30(18), 4643–4662 (2009) 11. Arvor, D., Jonathan, M., Meirelles, M.S.P., Dubreuil, V., Durieux, L.: Classification of MODIS EVI time series for crop mapping in the state of Mato Grosso, Brazil. Int. J. Remote Sens. 32(22), 7847–7871 (2011) 12. Maus, V., Câmara, G., Cartaxo, R., Sanchez, A., Ramos, F.M., de Queiroz, G.R.: A timeweighted dynamic time warping method for land-use and land-cover mapping. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 9(8), 3729–3739 (2016) 13. Vieira, M.A., Formaggio, A.R., Renno, C.D., Atzberger, C., Aguiar, D.A., Mello, M.P.: Object based image analysis and data mining applied to a remotely sensed landsat time-series to map sugarcane over large areas. Remote Sens. Environ. 123, 553–562 (2012) 14. Zhou, Z., Huang, J., Wang, J., Zhang, K., Kuang, Z., Zhong, S., Song, X.: Object-oriented classification of sugarcane using time-series middle-resolution remote sensing data based on adaboost. PLoS ONE 10(11), e0142069 (2015) 15. Mulianga, B., Begue, A., Clouvel, P., Todoroff, P.: Mapping cropping practices of a sugarcanebased cropping system in Kenya using remote sensing. Remote Sens. 7(11), 14428–14444 (2015) 16. El Hajj, M., Begue, A., Guillaume, S., Martine, J.-F.: Integrating SPOT-5 time series, crop growth modeling and expert knowledge for monitoring agricultural practices—The case of sugarcane harvest on Reunion Island. Remote Sens. Environ. 113(10), 2052–2061 (2009)
174
S. Virnodkar et al.
17. Belgiu, M., Csillik, O.: Sentinel-2 cropland mapping using pixel-based and object-based time-weighted dynamic time warping analysis. Remote Sens. Environ. 204, 509–523 (2018) 18. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 19. Mohite, J., Karale, Y., Pappula, S., TP, A. S., Sawant, S. D., & Hingmire, S.: Detection of pesticide (Cyantraniliprole) residue on grapes using hyperspectral sensing. In: Sensing for Agriculture and Food Quality and Safety IX, vol. 10217, p. 102170P (2017) 20. Poona, N., Van Niekerk, A., Ismail, R.: Investigating the utility of oblique tree-based ensembles for the classification of hyperspectral data. Sensors 16(11), 1918 (2016) 21. Yin, H., Pflugmacher, D., Li, A., Li, Z., Hostert, P.: Land use and land cover change in Inner Mongolia-understanding the effects of China’s re-vegetation programs. Remote Sens. Environ. 204, 918–930 (2018) 22. Loggenberg, K., Strever, A., Greyling, B., Poona, N.: Modelling water stress in a Shiraz Vineyard using hyperspectral imaging and machine learning. Remote Sens. 10(2), 202 (2018) 23. Rodriguez-Galiano, V.F., Ghimire, B., Rogan, J., Chica-Olmo, M., Rigol-Sanchez, J.P.: An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J. Photogramm. Remote Sens. 67, 93–104 (2012) 24. Truong, Y., Lin, X., Beecher, C.: Learning a complex metabolomic dataset using random forests and support vector machines. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 835–840 (2004) 25. Mountrakis, G., Im, J., Ogole, C.: Support vector machines in remote sensing: a review. ISPRS J. Photogramm. Remote Sens. 66(3), 247–259 (2011) 26. Khobragade, A., Athawale, P., Raguwanshi, M.: Optimization of statistical learning algorithm for crop discrimination using remote sensing data. In: 2015 IEEE International Advance Computing Conference (IACC), pp. 570–574 (2015) 27. Foody, G.M., Mathur, A.: A relative evaluation of multiclass image classification by support vector machines. IEEE Trans. Geosci. Remote Sens. 42(6), 1335–1343 (2004) 28. Behmann, J., Mahlein, A.-K., Rumpf, T., Römer, C., Plümer, L.: A review of advanced machine learning methods for the detection of biotic stress in precision crop protection. Precision Agric. 16(3), 239–260 (2015) 29. Hawrylo, P., Bednarz, B., Wkezyk, P., Szostak, M.: Estimating defoliation of Scots pine stands using machine learning methods and vegetation indices of Sentinel-2. Eur. J. Remote Sens. 51(1), 194–204 (2018) 30. Warner, T.A., Nerry, F.: Does single broadband or multispectral thermal data add information for classification of visible, near-and shortwave infrared imagery of urban areas? Int. J. Remote Sens. 30(9), 2155–2171 (2009)
Classification of Nucleotides Using Memetic Algorithms and Computational Methods Rajesh Eswarawaka1(B) , S. Venkata Suryanarayana2 , Purnachand Kollapudi3 , and Mrutyunjaya S. Yalawar4 1
SREYAS Institute of Engineering and Technology, Hyderabad, India [email protected] 2 CVR College of Engineering, Hyderabad, India [email protected] 3 BV Raju Institute of Technology, Hyderabad, India [email protected] 4 CMR Engineering College, Hyderabad, India [email protected]
Abstract. This paper presents an approach to solve an optimization problem using clustering by genetic algorithm approach. The central idea is to form clusters of patients’ nucleotide data sets. The genetic algorithm is applied to this initial cluster population. The fitness function for the genetic algorithm is calculated using intra-cluster and inter-cluster distances. Later genetic crossover functions are applied. This procedure is iterated until the stopping condition is reached. The superiority of this algorithm lies in comparing the performance with Ant Colony Optimization and simulated annealing algorithms. Keywords: Clustering · Genetic algorithm Hybrid genetic algorithm · Data mining
1
· Genetic annealing ·
Introduction
Data mining is the procedure of deriving information from a larger set of raw data. Various approaches that are used to mine data are classification and clustering [1]. Classification is a Data mining procedure wherein a class label is allocated to unlabeled data vectors. In supervised classification, the function Rajesh Eswarawaka is a Professor in the Department of Computer Science and Engineering, S. Venkata Suryanarayana is an Associate Professor in the Department of Information Technology and Purnachand Kollapudi is an Associate Professor in the Department of Computer Science and Engineering, Mrutyunjaya S. Yalawar is an Assistant Professor in the Department of Computer Science and Engineering c Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_16
176
R. Eswarawaka et al.
is retrieved from labeled data. In unsupervised classification, the function is retrieved from unlabeled data. This is clustering, the focus of this paper [2]. The most appropriate classification approach is genetic algorithms (GA). Genetic Algorithms facilitate implementing a guided search for various existing models. They follow the natural evolution process wherein the best characteristics of a generation are followed to the subsequent generation by the simple process of reproduction. These characteristics are enclosed in “chromosomes”, which have the parameters for the construction of the model. Genetic algorithms can be used in various areas of research. In this paper we shall use genetic algorithm in the area of medical research [3–5]. Use Cases of SVM involve various applications like land cover study [6], rare species study [7], medical study [8], error diagnosis [9], character style study [10], speech analysis [11], radar signal study [12], habitat study, etc.. The research undertaken herein is a combination of genetic algorithm and clustering to classify patient’s nucleotide data sets. Later, it is proved by means of various evaluation parameters that the hybrid algorithm proposed is better in performance than various existing classification algorithms. Our subsequent section deals with Literature Survey. Followed by the proposed approach, experimental result in Sect. 4. The final section consists of conclusion and acknowledgement followed by references.
2 2.1
Background Knowledge Overview of Genetic Algorithms
Natural evolution works on the principle of selection of the best individual from the existing group based on various fitness criteria. Once the best individual is selected, mutations and crossovers are performed to generate the next subsequent generation. The strength of this natural evolution approach lies in the criteria used to select the best individual as this determines the overall efficacy of the approach. Genetic algorithms also use similar fitness function criteria in search and solving optimization problems [13]. In this approach the stronger an individual is, the better is the chance of being selected as a parent. Genetic Algorithm (GA) has following steps: 1. Initialization: The step in implementing genetic algorithm is to initialize an inceptive population. The initial set of the population can be created by random generation approach or novel approaches can be used to begin with a high-quality initial population. The method employed to generate the initial population is a very vital step in the final efficacy of the genetic algorithm. 2. Selection: The fitness function determines the selection of two parents from the available pool, where higher the value of fitness function higher will be the chance of being selected as a parent.
Classification of Nucleotides Using Memetic Algorithms
177
3. Reproduction: Crossover is performed on the selected two parents to obtain one or two children and the obtained children are added back into the available pool of chromosomes. Later the population which does not fit the fitness criteria are eliminated. Various crossover operators can be applied like mutation (here the status of the arbitrarily chosen element is reversed), recombination (here chromosomes are part combined), sorting based on fitness scores, and convergence check wherein ways to end the reproduction are evaluated [16]. 4. Genetic operator: After a crossover [15] this operator is performed. 5. Replacement: For subsequent iterations use the newly produced population. Finally, some parameter is used to ensure GA does not loop infinitely. Diverse problems in various applications have been solved using Genetic algorithms [17– 19]. 2.2
Overview of Simulated Annealing
Traditionally, annealing is useful in eliminating defects from the metal crystal. In this process, the average potential energy per atom is lessened. So, the whole approach can be measured as a macroscopic energy minimization pattern [20]. This process of annealing can be applied to a wide range of optimization problems in data mining. Herein, T is a variable or function. An approach is used to simulate the system at the parameter T, while the value of T is gradually reduced or cooled down from some very high value. This is called simulated annealing, and is of high importance in high-dimensional cases which are intricate to be handled by new approaches [21]. It is a local search algorithm competent to solve very complex optimization problems. It is easy to execute and its conjunction features have made it a very sought-after approach in the recent times. It is used to solve both discrete and incessant complications [22, 23].
Algorithm 1 The approach is outlined in pseudo-code [24–25]: Require: Initialize solution p X Initialize temperature T Initialize k=0 Do Generate a solution p’ Compute f’= f(p’)-f(p) p = p+1 Until p = Xk k=k+1 while stopping criterion is not met
178
2.3
R. Eswarawaka et al.
Overview of Ant Colony Optimization
Ant Colony Optimization (ACO) is a sought-after technique to solve optimization problems. ACO algorithm was initially recommended in 1991 [26]. ACO algorithm is used to find the optimal path of a graph. It is based on the path seeking behavior of ants in pursuit of finding food. Ants follow a Random-Walk style of motion in search of food from their nest. If food is located in the path, they collect some of them and return to their nest. On their way back, they release pheromones. The more is the quantity and quality of food, the stronger is the smell of pheromone. This very scent helps ants identify the shortest path to the destination. This strategy of using the smell of pheromones is also called stigmergy [27]. This algorithm is a type of meat heuristics algorithms. It starts with a basic heuristic algorithm which can either be a null solution and it keeps on adding elements to it on its way to build the optimum solution or starts with a basic solution and keeps on refining it in iterative steps until the acceptable optimum solution is found [28]. A set of concurrent and asynchronous agents work toward finding an optimization solution in computational domain similar to a colony of ants. It has two components: trail and attractiveness. Based on subsequent trails, the optimum solution is reached. Trail evaporation is the process of reducing trails so that number of trails does not reach an uncontrollable level. A daemon action could be an optional strategy wherein, invocation of optimization procedure facilitates in reaching the optimization solution faster. The quasi-code for ACO algorithm is [29]:
Algorithm 2 Ant Colony Optimization (ACO) Require: while(optimization solution not found) do perform refinement of basic heuristic solution AntFormulation() TrailUpdate() DeamonActions() (optional) end Refinement end while
The Genetic Clustering Algorithm (GCA) is the Genetic Algorithm (GA) applied in k partitions, when the value of k is previously known [30]. Tackling clustering tasks with GA requires adaptations in areas such as, the representation of the solution, fitness function, operators, and the value of parameters. The pseudocode is as follows:
Classification of Nucleotides Using Memetic Algorithms
179
Algorithm 3 Ant Formation() Require: Algorithm 1 B= NULL Determine P(B) While P(B) #0 do C is ChooseFrom(P(B)) s is Refine solution by adding to C. Initial state x(0) is member of S Determine P(B) end while
Algorithm 4 Clustering by Genetic Algorithm Require: Step1: Initial population definition Step2: Random generation Do Step3: Selection of parents based on fitness function Step4: Crossover for reproduction Step5: Mutations while Stopping criteria (not much difference between fitness function of various parents) is not reached)
2.4
Proposed Methodology
The Genetic Clustering Algorithm (GCA) is the Genetic Algorithm (GA) applied to the clustering problem in k partitions, when the value of k is previously known [30]. Tackling clustering tasks with GA requires adaptations in areas such as The representation of the solution, fitness function, operators, and the value of parameters. In this paper we start with a population of solution with random nucleotides. The fitness function used herein is calculated by using the following equation F itnessF unction := M eanintra−clusterdistance/M eaninter−clusterdistance. (1)
Later based on the fitness function, the natural selection of the two best parents is done and the genetic cross overs are performed between them to create a new population. This process is repeated recursively until the convergence of the best fitness value is detected [31–32]. The pseudocode is as follows:
180
R. Eswarawaka et al.
Algorithm 5 Clustering By Genetic Algorithm Require: Step 1: Chromosome representation Step 2: Genesis Do Step3: Natural Selection based on either fitness function Step4: Genetic crossover Step5: Random mutations while Stopping criteria not much difference between fitness function of various parents is not reached.
3 3.1
Performance Analysis Environment Setting
A data set consisting of 100 patient records is taken. Each record is represented as a nucleotide, comprising a value and an index. The index of each nucleotide stands for the patient’s number and the value of each nucleotide stands for patient’s cluster. Sample records for the population with random nucleotides is given below [33] (Tables. 1 and 2):
Table 1. 14 patients data sets of nucleotides Key Na K
ALT AST WBC RBC Hgb Het
1
143 4
83
45
10.7
6
13.6 45.7
2
150 6.5 36
32
8.7
5.2
20
53.6
3
157 4.3 53
43
9.2
4.4
14
42.8
4
141 3.9 83
43
9.9
4.3
17.1 41.9
5
142 3.7 87
26
12.8
6.5
15.4 40.9
6
147 3.8 35
21
6.9
4.7
15.5 44.8
7
138 4.2 23
27
6.1
4.6
16.3 39.9
8
141 3.9 33
39
9.7
4.6
19
15.8 48
42.3
9
139 4.4 23
53
4.6
4.2
10
142 4.7 91
34
10
6.1
17.4 46.2
11
143 4.3 36
63
9.4
5.3
13.4 44.2
12
138 4
30
27
6.8
4.5
13.7 39.7
13
143 4
72
64
7.3
5.2
19.5 45
14
139 6.2 22
15
6.2
4.2
17.4 59.6
Classification of Nucleotides Using Memetic Algorithms
4
181
Stopping Criterion
As explained, the fitness function and chances to breed using a biased selection method are calculated for each nucleotide. The values for sample records are mentioned below. Top two parents with maximum chances to breed are selected and genetic crossover is performed with both the parents. The genetic crossover for the above set of sample records is mentioned below. As per the proposed methodology, random mutations of the nucleotides is done as follows (Figs. 1, 2, 3, 4, and 5). Table 2. Comparison table of the three algorithms Simulated annealing 1 s Hormony algorithm 2 Ant colony optimization 4.85/3.55
3.60/3.55
3.94/3.61
50
4.52/3.58
3.62/3.55
3.93/3.55
100
4.08/3.55
3.55/3.55
3.82/3.55
500
4.08/3.55
11.06/9.39
5.74/5.65
100
3.75/3.55
8.32/5.83
5.70/5.65
500
12.5/10.61 11.06/9.39
5.68/5.65
1000
11.0/8.68
8.32/5.83
5.66/5.65
5000
11.0/8.84
7.75/5.78
5.68/5.65
10000
9.84/8.10
7.16/5.78
5.66/5.65
100
9.87/8.31
6.70/5.72
5.66/5.65
500
Fig. 1. Three patients nucleotides representation in X, Y, and Z directions
182
R. Eswarawaka et al.
Fig. 2. The representation of cluster into string representation
01
02
03
04
05
06
07
08
09
10
11
12
13
14
A
C
B
A
C
B
A
C
A
B
C
A
B
C
01
02
03
04
05
06
07
08
09
10
11
12
13
14
C
B A
B
A
C
A
B
C
C
A
B
B
A
01
02
03
04
05
A
C
B
C
B
06
07
08
A C
09
10
B
A
A
12
11
14
13
B C A
C
Fig. 3. Three parents A, B, C from the initial population of nucleotides
Each solution has a (1\Fitness)/ (1\fitness) Chances to breed Parent A
01
02
03
04
05
06
07
08
09
10
11
12
13
C
A
B
C
A
B
C
A
B
C
B
A
C
B
14
Fitness=0.684 ----> 1\Fitness= 1.42 ------> Chances 14% Parent B
01
02
03
04
05
06
07
08
09
10
11
12
13
14
A
C
A
C
B
C
B
B
C
A
B
C
B
A
Fitness=0.215 -----> 1\Fitness= 4.651 ------>45% Child
01
02
C
A
03
04
B
C
05
06
07
08
09
10
11
12
13
A
B
C
B
C
A
B
C
B
14
A
Fitness ??
Fig. 4. Figure shows genetic crossover in parent A, parent B
This process is iterated until the stopping criteria are reached as per the below graph. From the above table, inferences can be drawn by a comparative study on results obtained by different techniques.
Classification of Nucleotides Using Memetic Algorithms
183
Fig. 5. Figure shows detecting convergence of the best fitness value
5
Conclusion
In this paper, a hybrid approach based on genetic algorithm is used to solve optimization problem in clustering patients’ nucleotide data. A comparative study of the results obtained using hybrid approach with those obtained by simulated annealing and ACO is performed. The research concludes that the hybrid approach of optimization is more efficient as compared to simulated annealing and ant colony optimization methods, since, it provides better values for 1000 iterations. Acknowledgments. The authors would like extend their gratitude to Sri T. V. Bala Krishna Murthy for valuable suggestions pertaining to the literature. Murthy is an accomplished educationalist, author, and a good administrator.
References 1. Adeniyi, D.A., Wei, Z., Yongquan, Y.: Automated web usage data mining and recommendation system using K-nearest neighbor (KNN) classification method. Appl. Computi. Inf. (2014) 2. Gacquer, D., Delcroix, V., Delmotte, F., Piechowiak, S.: Comparative study of supervised classification algorithms for the detection of atmospheric pollution. Eng. Appl. Artif. Intell. (2011) 3. Sarafraz, H., Sarafraz, Z., Hodaei, M., Sayeh, M: Minimizing vehicle noise passing the street bumps using Genetic Algorithm. Appl. Acoust. (2015) 4. Contreras-Bolton, C., Gatica, G., Barra, C.R., Parada, V.: A multi-operator genetic algorithm for the generalized minimum spanning tree problem. Expert Syst. Appl. (2016) 5. Park, Y.-B., Yoo, J.-S., Park, H.-S.: A genetic algorithm for the vendor-managed inventory routing problem with lost sales. Expert Syst. Appl. (2016)
184
R. Eswarawaka et al.
6. Gislason, P.O., Benediktsson, J.A., Sveinsson, J.R.: Random forests for landcover classification. Pattern Recognit. Lett. (2005) 7. Ozcift, A.: Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis. Comput. Biol. Med. (2011) 8. Genuer, R., Poggi, J.-M., Tuleau-Malot, C.: Variable selection using R random forests pattern recognition letters (2010) 9. Zitouni, I., Kuo, H.-K.J., Lee, C.-H.: Boosting and combination of classifiers for natural language call routing systems. Speech Commun. (2003) 10. Cho, H.-J., Tseng, M.-T.: A support vector machine approach to CMOS based radar signal processing for vehicle classification and speed estimation. Math. Comput. Modell. (2013) 11. Babu, P.H., Gopi, E.S.: Medical data classifications using genetic algorithm based generalized kernel linear discriminant analysis. Procedia Comput. Sci. (2015) 12. Amirov, A., Gerget, O., Devjatyh, D., Gazaliev, A.: Medical data processing system based on neural network and genetic algorithm. Procedia-Soc. Behav. Sci. (2014) 13. Latifi, Z., Karimi, A.: A TMR genetic voting algorithm for fault-tolerant medical robot. Procedia Comput. Sci. (2015) 14. GloRib51.: Evaluation and comparison of genetic algorithm and bees algorithm for location—allocation of earthquake relief centers. Int. J. Disaster Risk Reduct. (1993) 15. GloRib51.: Evaluation and comparison of genetic algorithm and bees algorithm for location—allocation of earthquake relief centers. Int. J. Disaster Risk Reduct. (1993)
A Novel Approach to Detect Emergency Using Machine Learning Sarmistha Nanda1(B) , Chhabi Rani Panigrahi1 , Bibudhendu Pati1 , and Abhishek Mishra2 1
2
Department of Computer Science,Rama Devi Women’s University, Bhubaneswar, India [email protected], [email protected], [email protected] Department of Computer Science & Engineering, Indian Institute of Technology, Bhubaneswar, India [email protected]
Abstract. Human activity is always a reflection of its external environmental conditions. If a group of people is in some emergency, then their activities and behaviour will be different as compared to normal conditions. To detect an emergency, Human Activity Recognition (HAR) can play an important role. Human activities such as shouting, running here and there, crying, searching for an exit door can be taken into consideration as an emergency indicator. By detecting the emergency and its degree, the Emergency Management System (EMS) can manage the situation efficiently. In this work, we use machine learning algorithms such as Random Forest (RF), IBK, Bagging, J48 and MLP on WISDM Smartphone and Smartwatch Activity and Biometric Dataset for human activity recognition and RF is found to be the best algorithm with classification accuracy 87.1977% among all other considered techniques. Keywords: Emergency management system · Human activity recognition · Machine learning · Random forest
1
Introduction
Circumstances interrupting the normal procedures with a combination of dangerous, serious and unpleasant environment which demands an immediate assistance in terms of relief, medical facilities, relocation, etc., is called an emergency [1]. Landslide, cyclone, earthquake, drought, bomb blast, theft, accident, release of chemical poisonous agent, etc., are some examples of emergencies. The emergency can be categorised into two types: natural and man-made [2]. In the recent technically enriched era, most of the natural emergencies occur with prior c Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_17
186
S. Nanda et al.
acknowledge, but there is either less or no possibility for prevention. Only some preventive measures can be taken such as people displacement, availability of alerted workers from different disaster management organisations. Man-made emergencies are not forecastable most of the time. These include fire in a shopping mall, theft, train derailment, poisonous gas release from a plant, etc. There are many disaster management organisations such as National Disaster Management Authority (NDMA) [3], Global Network for civil society organisations for Disaster Reduction (GNDR) [4], etc., who provide help to affected people. During an emergency, there is a possibility of risk to life, property and environment as well. However, early detection and its efficient management are required to prevent the loss. If the emergency can be detected with minimum time span, then a lot of loss may be avoided by providing the required assistance. To detect an emergency within a very short period after its occurrence is a big challenge and it needs a lot of data analysis. On-body sensors and environmental sensors can be deployed in a particular area to identify the emergency. A number of emergency-specific parameters need to be identified with certain threshold value for each so that, when the parameter value will cross its threshold value, then it may give an alarm and other important information to the concerned authority. Machine Learning (ML) plays a very important role in this type of data analysis on the collected sensor data stream. There are many ML classification algorithms [5,6] such as Random Forest, Multilayer perception, Bagging, J48 are developed and be applied to solve this emergency identification problem. In this work, we have used Random Forest, IBK, Bagging, J48 and MLP algorithms on human activity datasets to classify the human activity which can be helpful for detecting emergency situations in a particular region. 1.1
Motivation
Emergency always associated with risk to life, property and environment, etc. Preventive measures are always acceptable, but in other cases, the next step is early detection and efficient management of the emergency situation. If the emergency can be detected within a short time span, then a lot of loss can be avoided. 1.2
Contributions
The contributions of this paper are as follows: – To study the WISDM Smartphone and Smartwatch Activity and Biometrics dataset [7]. – To apply different classification algorithms on the considered dataset to classify human activities. The rest of the paper is organised as follows: Sect. 2 presents the related work. Section 3 describes the considered dataset along with the experimental set-up. Section 4 presents the results and discussion and Sect. 5 concludes the paper.
A Novel Approach to Detect Emergency Using Machine Learning
2
187
Related Work
HAR plays an important role in many areas and is a highly active area of research in the recent scenario. To solve this problem, a continuous environmental monitoring and activity modelling is necessary and for that data analysis is required. The activity recognition is done mainly in two ways, i.e. vision-based activity recognition, sensor-based activity recognition [8]. Vision-based activity recognition: In this process, the actions are labelled with image sequences. These images can be retrieved from the recorded videos or any other mode of image collection by human–computer interaction [9]. After the data collection, the images can be classified using various image classification algorithms. Classifying these data is very much challenging because it depends on interpersonal differences, properties settings of recordings, etc. Sensor-based activity recognition: In this process, a number of sensors such as gyroscope, accelerometer, sound sensors, Bluetooth, etc., are used for collecting data depending upon the requirements [10]. The type of sensors with respect to modality can be divided into four types, i.e. (1) body-worn sensor : As the name suggests, these are found in wearable items that include smartphones, watches, bands, glasses, helmets, etc. (2) object sensor : Generally, these sensors are placed on objects to detect their movement. For example, a sensor can be attached to a spoon to detect the eating activity. (3) ambient sensor : These are environmental sensors and usually deployed in a smart environment. Various types of ambient sensors are available with the capability of sensing the temperature, pressure, etc. (4) hybrid sensor : any combination of the above three sensors is an example of a hybrid sensor and is having an extended sensing feature.
3
Human Activity Recognition
HAR is a wide area of research nowadays and is used for many applications such as intrusion detection, health care, young and elderly care, etc. [11–13]. In this work, we have considered HAR as a parameter to detect an emergency. The description of the dataset along with the experimental set-up is described in this section. 3.1
Dataset Description
There are many HAR datasets available in the literarture such as WISDM, OPPORTUNITY, Ambient sensor dataset, etc. [14–16]. In this work, we consider the WISDM Smartphone and Smartwatch Activity and Biometrics Dataset which was created by Wireless Sensor Data Mining (WISDM) lab members of Fordham University. In this dataset, 18 daily living activities such as walking, jogging, eating, brushing teeth, etc., are taken into consideration. These data are collected by smartphones and smartwatches using accelerometer and gyroscope
188
S. Nanda et al.
sensors. During data collection, 51 subjects were involved to contribute 54 minutes each because every activity was performed for 3 minutes by each individual. Google Nexus 5/5x or Samsung Galaxy S5 smartphones and LG G Watch with gyroscope and accelerometer sensors were engaged to receive the data, the raw data as well as. Arff transformed data are available in this dataset repository. A brief description of the dataset is given in Table 1. Table 1. Dataset summary Attributes
Values
Number of subjects
51
Number of activities
18
Each aubject’s contribution 3 minutes for each actvity
3.2
Smart watch used
LG G watch
Smart phone used
Google nexus 5/5x or samsung galaxy S5
Sensor polling rate
20 Hz
Experimental Setup
The experiment was performed by using the Waikato Environment for Knowledge Analysis (WEKA) tool [17,18] of version 3.9.3. Each of the considered algorithms was configured with ten-fold cross-validation as test mode which means the total data is partitioned into ten equal parts then one part is considered as test data and the remaining nine parts behave as training set and the classification accuracy was calculated. This process was continued for ten times by using each part as a test case once. Finally, all the results were averaged to calculate the classification accuracy [19].
4
Results and discussion
In this work, our aim was to detect the emergency after it occurred by using HAR. The parameter for detection was considered on human activity. Since the activity of a person is always different during an emergency as compared to normal daily activity, so we are trying to recognise the abnormal activities. The considered dataset contains both accelerometer and gyroscope sensor data for various activities. We took only the data of smartphone collected through accelerometer in our experiment. Here, the number of instances considered is 23074 and the number of attributes is 92. Different ML classification algorithms such as Random Forest, IBK, Bagging, J48 and MLP are applied on the considered dataset to classify the activities. The accuracy of correct classification obtained for Random Forest, IBK, Bagging, J48 and MLP is 87.1977, 85.0481,
A Novel Approach to Detect Emergency Using Machine Learning
189
82.9592, 77.2471 and 66.4688 percent, respectively. We have also calculated the relative absolute error and root relative squared error for all the implemented algorithms. A comparative analysis of each considered algorithm with respect to the percentage of Correctly Classified Instances (CCI) and Incorrectly Classified Instances (ICI) are shown in Fig. 1. For the considered dataset, Random Forest is found to be the best algorithm as compared to IBK, Bagging, J48 and MLP. The number of CCIs in this method is 20120 out of 23074 numbers of total instances. The mean absolute error and root mean squared error for Random Forest are found to be 0.0336 and 0.1112, respectively.
Fig. 1. Classification accuracy analysis of various algorithms
We have also calculated the True Positive (TP) Rate, False Positive (FP) Rate, Precision, Recall, F-Measure, Matthews Correlation Coefficient (MCC), Receiver Operating Characteristic (ROC) area and Precision–Recall Curve (PRC) area for the Random Forest and are given in Table 2. The method to calculate each of the above parameters is given as follows. True Positive: It represents the correctly identified instances. For example, the activity walking is identified as walking only. False Positive: It represents the incorrectly identified instances. For example, the activity walking is identified as an activity other than walking. Precision: It is also called positive predictive value and is calculated as follows: Precision =
TP TP + FP
190
S. Nanda et al.
Recall : It is also known as sensitivity and is calculated as follows: Recall =
TP TP + FN
MCC : It is the Matthews correlation coefficient and is calculated as follows: M CC =
(T P XT N ) − (T P XF N ) ((T P + F P )(T P + F N )(T P + F P )(T N + F N ))
ROC Area: It is the Receiver Operating Characteristic Area for a certain class label of dataset. PRC Area: It is the Precision–Recall Curve Area. Table 2. Detailed accuracy by class using random forest algorithm TP rate
FP rate Precision Recall F-measure MCC area ROC
PRC area Class
0.954
0.005
0.922
0.954
0.938
0.935
0.998 0.984
Walking
0.963
0.003
0.945
0.963
0.954
0.951
0.998 0.985
Jogging
0.905
0.007
0.875
0.905
0.890
0.884
0.995 0.950
Stairs
0.865
0.004
0.919
0.865
0.891
0.885
0.988 0.937
Sitting
0.871
0.007
0.882
0.871
0.877
0.870
0.990 0.935
Standing
0.879
0.004
0.926
0.879
0.902
0.897
0.990 0.948
Kicking a soccer Ball
0.888
0.005
0.914
0.888
0.901
0.985
0.993 0.953
Dribbling a Basketball
0.858
0.007
0.873
0.858
0.865
0.858
0.988 0.927
Catch a Tennis Ball
0.834
0.008
0.854
0.834
0.844
0.835
0.985 0.909
Typing
0.835
0.006
0.875
0.835
0.854
0.847
0.987 0.915
Writing
0.846
0.007
0.875
0.846
0.840
0.852
0.990 0.929
Clapping
0.852
0.008
0.866
0.852
0.859
0.851
0.988 0.925
Brushing Teeth
0.892
0.018
0.771
0.892
0.827
0.817
0.990 0.917
Folding Clothes
0.762
0.012
0.811
0.762
0.786
0.773
0.985 0.882
Eating Pasta
0.838
0.011
0.836
0.838
0.837
0.826
0.990 0.913
Eating Soup
0.897
0.003
0.948
0.897
0.922
0.918
0.986 0.951
Eating Sandwitch
0.887
0.005
0.916
0.887
0.901
0.896
0.989 0.950
Eating Chips
0.882
0.016
0.756
0.882
0.814
0.805
0.991 0.926
Drinking from a Cup
The activity recognition accuracy is measured by creating a confusion matrix and the graph obtained is shown in Fig. 2.
A Novel Approach to Detect Emergency Using Machine Learning
191
Fig. 2. Activity wise classification accuracy
5
Conclusion and Future Work
In this work, we considered human activity as an emergency indicator parameter, and for the detection of human activity, we have used different classification algorithms. After the implementation of all the algorithms on our considered dataset, it is found that the Random Forest algorithm performs better as compared to other algorithms in terms of CCI. Since the early emergency detection is helpful to minimise the losses, it will be a great idea to build an emergency management system with automation capability. In recent years, a big percentage of the population is using smart devices with cloud accessibility. If an application can be developed using an efficient ML algorithm to recognise human activities, then the losses can be avoided.
References 1. 2. 3. 4. 5.
https://www.merriam-webster.com/dictionary/emergency Disaster, S.K.: Challenges and perspectives. Ind. Psychiatry J. 19(1), 1 (2010) https://ndma.gov.in/en/ , Last accessed 10 Dec 2019 https://www.gndr.org/ , Last accessed 10 Dec 2019 Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 10(160), 3–24 (2007) 6. Khanum, M., Mahboob, T., Imtiaz, W., Ghafoor, H.A., Sehar, R.: A survey on unsupervised machine learning algorithms for automation, classification and maintenance. Int. J. Comput. Appl. 119(13) (2015) 7. Weiss, G.M., Yoneda, K., Hayajneh, T.: Smartphone and smartwatch-based biometrics using activities of daily living. IEEE Access. 12(7), 133190–133202 (2019)
192
S. Nanda et al.
8. Ponce, H., Mart´ınez-Villase˜ nor, M., Miralles-Pechu´ ain, L.: A novel wearable sensorbased human activity recognition approach using artificial hydrocarbon networks. Sensors 16(7), 1033 (2016) 9. Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–90 (2010) 10. Wang, J., Chen, Y., Hao, S., Peng, X., Hu, L.: Deep learning for sensor-based activity recognition: a survey. Pattern Recognit. Lett. 1(119), 3–11 (2019) 11. Lara, O.D., Labrador, M.A.: A survey on human activity recognition using wearable sensors. IEEE Commun. Surv. Tutor. 15(3), 1192–209 (2012) 12. Sunny, J.T., George, S.M., Kizhakkethottam, J.J., Sunny, J.T., George, S.M., Kizhakkethottam, J.J.: Applications and challenges of human activity recognition using sensors in a smart environment. IJIRST Int. J. Innov. Res. Sci. Technol. 2, 50–57 (2015) 13. Ranasinghe, S., Al Machot, F., Mayr, H.C.: A review on applications of activity recognition systems with regard to performance and evaluation. Int. J. Distrib. Sens. Netw. 12(8), 1550147716665520 (2016) 14. Chavarriaga, R., Sagha, H., Calatroni, A., Digumarti, S.T., Tr¨ oster, G., Mill´ ain, J.D., Roggen, D.: The opportunity challenge: a benchmark database for on-body sensor-based activity recognition. Pattern Recognit. Lett. 34(15), 2033–2042 15. Cook, D.J.: Learning setting-generalized activity models for smart spaces. IEEE Intell. Syst. 2010(99), 1 (2010) 16. Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: A public domain dataset for human activity recognition using smartphones. InEsann (2013) 17. Srivastava, S.: Weka: a tool for data preprocessing, classification, ensemble, clustering and association rule mining. Int. J. Comput. Appl. 88(10) (2014) 18. Singhal, S., Jena, M.: A study on WEKA tool for data preprocessing, classification and clustering. Int. J. Innov. Technol. Explor. Eng. (IJItee) 2 (2013) 19. Rodriguez, J.D., Perez, A., Lozano, J.A.: Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 569–575 (2009)
Data Mining Applications and Sentiment Analysis
A Novel Approach Based on Associative Rule Mining Technique for Multi-label Classification (ARM-MLC) C. P. Prathibhamol, K. Ananthakrishnan, Neeraj Nandan(B) , Abhijith Venugopal, and Nandu Ravindran Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham, Amritapuri, India [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract. In this paper, we have implemented an efficient and novel technique for multi-label class prediction using associative rule mining. Many of the research works for the classification have been carried out on single-label datasets, but it is not useful for all real-world application accounting to multi-label datasets like scene classification, text categorization, etc. Hence, we propose an algorithm for performing multi-label classification and solve the problems which come across in the domain pertaining to single-label classification. Our novel technique (ARMMLC) will aim in enhancing the accuracy of any decision-making processes. Here, in multi-label classification, based on our work, we aim to predict the multiple characters of the instances. Keywords: Multi-label classification · Multi-label associative rule · Mining rule · Set rule · Mining data · Mining frequent · Itemset label classification hamming loss ordered list brute suppression
1 Introduction Data classification is a task associated with classifying the data with similar features into their corresponding class to which they belong. In other words, the main aim of any classification algorithm is finding the decision class to which the object belongs. But, multi-label classification is also same when compared with single-label classification. The only difference is that here in this kind of multi-label datasets, the data may be associated with more than one label at an instant of time. In short, it is the process of classifying data that can have multiple labels. This kind of data is referred to as multiple decision classes and are discussed as examples as in [1]. Apriori algorithm is a very commonly used algorithm in the domain of data mining for the extraction of frequent itemsets. It is more important because it also performs operations related to associative rule mining [2]. The field of data mining is progressing © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_18
196
C. P. Prathibhamol et al.
very rapidly as many and new algorithms that are faster are being introduced every day. Each of these new algorithms is more or less based on the Apriori algorithm or its variations [3] (Table 1). Table 1. A multi-label classification table Class
L1 L2 L3 L4
A, B
1
X
3
X
A, C, D 5
X
2
Y
D
5
X
3
Y
B, D
4
Y
2
Z
Once association rule mining techniques are carried out the rules extracted will be of the order in millions [1]. The role of an expert here is to sort and gather out only the required or needy rules from the total subset of the generated rule. The experts should extract required rules in such a way that it can be understood by a common man. Therefore, the rule set should be minimized for better consistency of the model. The common way for reducing the rule set size is to remove the rules which fall under a certain threshold value signifying their least importance. But there are two drawbacks to this method; firstly, some important rules in dominant positions having various cases maybe eliminated and secondly, there are no methods to select the most optimal threshold value. Here, we have done a process in which if two rules are similar, we will delete one of them [4]. Here, we introduce a new method to foretell the class of objects whose labels are unknown.
2 Literature Survey A bottom-up approach is used by the Apriori algorithm and it is an effective algorithm for frequent itemset mining for Boolean association rules. In the bottom-up approach, frequent subsets are enlarged one item after the other. They are drafted to operate on transactional databases [2]. The property of this algorithm is any subset of the frequent itemset must be common and here all the sets, which have the item containing the minimum support (represented by Li for ith itemset), will be frequent only. Apriori algorithm is basically operated in two phases that is joining and pruning. Joining step is used to generate a candidate set of two item sets and the prune step involves reducing the size as it will help to avoid heavy computations. In the joining step, it will list all the items and its support value followed by the pruning step, which will prune the items with minimal support value [3]. Apriori algorithm supports to obtain frequent itemsets completely. First, it is used to nd all frequent items in the dataset taking the help of minimum support. Frequent itemsets are the set of items containing minimum support. Support and confidence are the two important parameters used to obtain strong rules. Support of a rule is how often a rule occurs in a dataset and confidence of a rule is
A Novel Approach Based on Associative Rule
197
how often a rule will be true. Then in the cross-joining step, accordingly all the items will be cross-joined to get longer sets, which then is followed again by the pruning step based on the minimum support value. After all the successful iterations, we will be obtaining a list o all frequent itemset [2]. Rule sets are generated from datasets using association rule mining algorithms. A rule set can have uncountable rules. Generally, there will be a common overlap between many of the discovered rules [1]. Associative rule mining is a critical part of any data mining area as it is visible and understandable to any inexpert user in this domain. But this comprehensibility decreases by the generation of a large combination of rules extracted. So, it is essential to reduce the rules set [5]. The main traditional approach towards this is to change the value of parameters such as confidence and support. But this method is not a proper method as it has many disadvantages related to it. So, here we are using an algorithm called brute suppression for decreasing the rule set’s size which is based on eliminating the rules that is similar to other rules [4]. The idea of this method is that if rules are alike, we can keep one and delete the rest [5]. Input to this algorithm will be the rule sets generated using Apriori and we will be getting a reduced and valid set of rules as output. Now, we have to generate frequent pattern tree (fp-tree) for the following dataset with minimum confidence. In our application context, 30% is the minimum confidence (Table 2). Table 2. Transaction database for FP-growth TR ID Items 1
E, A, D, B
2
D, A, C, E, B
3
C, A, B, E
4
B, A, D
5
D
6
D, B
7
A, D, E
8
B, C
Consider eight items in our dataset. Therefore, we should take 30% of 8 total counts of items, which is 2.4. Now, we consider the upper value for convenience or we have chosen the ceiling function for approximation, which is 3. Now, identify the frequency of each unique item and record their priority in increasing order [6] (Table 3). Here, lower priority number means high priority, and higher priority number means lower priority. Now, order the items according to the priority. The ordered list should be constructed in increasing order of the priority (Table 4). We have the ordered list of transactions according to their priority. Now, with the null node as the root and considering the ordered list, construct the fp-tree. Here, we have to
198
C. P. Prathibhamol et al. Table 3. Fp-growth frequency-priority table Items Frequency Priority A
5
3
B
6
1
C
3
5
D
6
2
E
4
4
Table 4. Ordered items after prioritization TR ID
Items
Ordered items
1
E, A, D, B
B, D, A, E
2
D, A, C, E, B
B, D, A, E, C
3
C, A, B, E
B, A, E, C
4
B, A, D
B, D, A
5
D
D
6
D, B
B, D
7
A, D, E
D, A, E
8
B, C
B, C
construct the leaf nodes for the null node with the ordered list for each transaction [7] (Fig. 1).
3 Training Phase The model is trained so that it can differentiate various classes to which the data instances may belong, even though a data instance may be related to multiple labels at the same time. In the training data, we have objects that contain a known class label. The training process starts by preprocessing of dataset, which is normalization followed by discretization. Normalization is to alter the values or map the numeric sections in the dataset to that with a regular scale, without disturbing the differences in the ranges of values or losing information. So, the data is mapped without losing its originality. Data discretization is used to make the task of data evaluation and data management easier. It converts a large number of data values into a smaller range. In preprocessing step, the continuous values will be converted into categorical form. After preprocessing is performed on the dataset, we will be applying Apriori algorithm to the attribute space of the dataset to
A Novel Approach Based on Associative Rule
199
Fig. 1. Fp-tree
generate associative rules between attributes. In Apriori algorithm, a large set of rules are generated, so to remove these redundant rules, we apply a reduction algorithm called brute suppression. The basic idea of this algorithm is that if two rules cover same items, the rule with lesser confidence is considered redundant and is removed from the rule set. Then it is followed by the part of clustering. Now, once the similar or common rules are clustered, then apply fp-growth algorithm directly to the label space of the records to create associative rules among the labels. Thus, with this, the training phase gets completed (Fig. 2).
4 Testing Phase In this phase, when the test data arrives, we will do the preprocessing steps on the feature space of the test data (normalization and discretization). After the preprocessing, we will check the feature space rule and will nd the cluster number that is generated in
200
C. P. Prathibhamol et al.
Fig. 2. Training phase
the training phase. After finding the cluster in which this rule belongs to, we will check for the label space in which rules generated, particular to that specific cluster, which is done using fp-growth algorithm, and then we will classify the labels (Fig. 3).
5 Experimental Result Here, we are using a popular measure called Hamming loss for calculating the correctness and accuracy of our model. The fragment of labels that are not predicted correctly in a model is called hamming loss [8]. In multi-label classification, the hamming loss is used to compute hamming distance between the true value and the predicted value. The hamming loss values always fall in the range of 0–1 when it is normalized. In the hamming loss, we are checking a predicted label to an original label and if it is predicted correctly, we will add 0, and if it is wrongly predicted we will add 0. Let ‘D’ be the number of instances predicted and ‘L’ be the number of label. Then, the hamming loss can be calculated using the formula (Fig. 4).
A Novel Approach Based on Associative Rule
201
Fig. 3. Testing phase
Fig. 4. Equation for hamming loss
Multi-label datasets from Mulan repository were taken for the purpose of validation of our proposed method ARM-MLC. Standard datasets such as yeast, scene, and emotion datasets were considered for this purpose [5]. Yeast dataset has 2417 instances, which contain 14 labels and 103 numeric attributes. In this dataset, the data is related to an organism’s genes. Emotion dataset has 593 instances, which contain 6 labels and 72 attributes. In this dataset, data is related to music. Scene dataset contains 2407 instances, which has 294 attributes and 6 labels. In this dataset, data is numeric and holds data of many environmental parts like sky, sunrise, etc. In Table 5, we depict the result after performing our proposed method with state of art methods on the yeast dataset. Therefore, for multi-label classification problem, it is observed that other than the methods mentioned in the table, our method does a good performance.
202
C. P. Prathibhamol et al. Table 5. Hamming loss and accuracy Algorithms
Hamming
Accuracy (%)
ML-KNN
0.18
0.492
Binary-SVM
0.261
0.530
C4.5
0.259
0.423
Naive Bayes
0.301
0.421
SMO
0.263
0.337
CLR
0.210
0.497
RAKEL
0.258
0.465
I-BLR
0.199
0.506
KNN
0.258
0.514
ARM-MLC
0.255
0.743
6 Conclusion In the proposed method, an efficient solution for multi-label classification is explored. The multiple labels of an unlabelled instance are predicted accurately and efficiently by using the associative rule mining techniques like fp-growth and Apriori algorithm. The rules generated using Apriori algorithm are further reduced using rule reduction techniques. In our proposed method, we have used a rule reduction technique called brute suppression. In the brute suppression technique, associative rules are reduced based on a concept that, if there two rules which cover the same degree, it can be reduced to one by removing the rule with lesser confidence. Then using the reduced rules we cluster the dataset and these clusters are then subjected to fp-growth algorithm, using which we can generate label-to-label rules. When a new test data is taken, we first find the cluster in which this data belongs to and then we predict the labels of that test data from the label to label rules generated using fp-growth. The advantage of this proposed solution is the quick prediction of multiple labels of a test instance by directly referring to the rules generated in the training phase. The complexity of applying classification technique in the testing phase is eliminated in this approach, thus providing easy prediction of multiple labels. Acknowledgments. We, hereby, would like to express our gratitude towards the Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham, Amritapuri Cam-pus, for furnishing their valuable time for the completion of our project.
A Novel Approach Based on Associative Rule
203
References 1. Haripriya, H., et al.: multi-label prediction using association rule generation and simple kmeans. In: 2016 International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT). IEEE (2016) 2. Bodon, F.: A Fast APRIORI Implementation. FIMI, vol. 3 (2003) 3. Haripriya, H., et al.: Integrating apriori with paired k-means for cluster fixed mixed data. In: Proceedings of the Third International Symposium on Women in Computing and Informatics. ACM (2015) 4. Hills, J., et al:. BruteSuppression: a size reduction method for apriori rule sets. J. Intell. Inform. Syst. 40(3) 2013 5. Athira, S., Poojitha, K., Prathibhamol, C.P.: An efficient solution for multi-label classification problem using apriori algorithm (MLC-A). In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE (2017) 6. Borgelt, C.: An Implementation of the FP-growth algorithm. In: Proceedings of the 1st International Workshop On Open Source Data Mining: frequent Pattern Mining Implementations. ACM (2005) 7. Pramudiono, I., Kitsuregawa, M.: Parallel FP-growth on PC cluster. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg (2003) 8. Prathibhamol, C.P., Ashok, A.: Solving multi label problems with clustering and nearest neighbor by consideration of labels. In: Advances in Signal Processing and Intelligent Recognition Systems, pp. 511–520. Springer, Cham (2016)
Multilevel Neuron Model Construction Related to Structural Brain Changes Using Hypergraph Shalini Ramanathan and Mohan Ramasundaram(B) Department of Computer Science and Engineering, National Institute of Technology Tiruchirappalli, Tiruchirappalli 620015, Tamil Nadu, India {406916001,rmohan}@nitt.edu
Abstract. Birth of neurons in the human brain relocates from their place of birth to other regions of the brain. Designing a model to read the structural changes in the brain will help scientists to understand more about the process of the life cycle of neurons. In this paper, Hypergraphbased model for recognizing the structural changes during the birth and death of neurons was developed and its performance was evaluated quantitatively with small-world network and robust connectivity measures. This neuron reconstruction model will operate as a treatment modality to cure brain diseases and disorders that affect the lives of millions of human being. Keywords: Hypergraph · Multilevel neuron Visualization · Communication network
1
· Brain disorder ·
Introduction
Neurons play a vital role in the human brain. They were the building blocks of the nervous system. They were the information carrier between the brain and the nervous system. Neurons normally don’t reproduce or replace themselves, so when they become damaged or die they cannot be replaced by the body. This causes brain disorders [1,2]. This, in turn, affects people’s ability to remember, speak, think, and make decisions. The human brain has billions of neurons and different types of neurons. Each neuron was connected to all other neurons even in a small piece of brain. It forms a huge complex network with multiple types of neurons called a multilevel neuron. Neuron’s network connections are extremely entangled and its detailed connectomes are not inferred until now [3,4]. Corinne Teeter was mentioned that there was a need to make a simple linear model with a unique parameter to classify multiple types of the neuron [5]. Olaf Sporns also stated that there was a strong demand for graph theory-based tools to analyze c Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_19
Multilevel Neuron Model Construction Related
205
brain network data [6]. It shows the importance of designing a computer model to construct and compute the neurons’ structural changes. Such a model will help medical researchers to study brain diseases and disorders. It motivated to create an efficient and effective model to visualize and evaluate the human brain structure [7]. Constructing the multilevel structure of the neuron’s network was an idea of constructing a working computer model of a brain. Hypergraph was used to represent the great network correlation [8]. So here, the hypergraph was used to model the neuron’s structural properties. A hypergraph-based multilevel neuron construction framework was introduced. It demonstrates the creation of a new neuron, migration of the new neuron, creates and finds a path between available neurons, communication of multiple neurons with each other, and death of the new neuron. This framework works as a combination of visualization and evaluation tools. It helps to visualize the neurons, to compute effective communication, and to convey the complex information in a simple way. This idea of modeling a brain will also help the neuroscientists and neurologists to better view the functioning of the neuron [8]. This paper was organized as follows. Section 2 describes the existing work related to neuron construction and hypergraph. Section 3 introduces the hypergraph framework used to build the brain network. Sections 4 discusses the results of the framework along with its performance measure. This section also highlights the limitations faced during the implementation of the neuron model. Finally, in Sect. 5 the summary of work was presented.
2
Related Work
Understanding brain circuits was a key challenge over the last decade. Through advances in data processing in the neural network, so many neuroscience challenges like calculating brain pathways, evaluating complex network connectivity were made easy. Thereby, so many simulation research tools also have been emerged for the use of physicians [9]. Graph theory is the mathematical oriented non-linear data structure. It is a collection of points and lines connecting some subset of the set. It has been way back since the graph theory was applied to neuroscience. But still, computational neuroscience lacks to solve the real-world problem of the evolution of the brain network of nodes and its communications [10–12]. Nowadays brain simulation has become an important part of the treatment modality of severe diseases like Alzheimer’s, Parkinson’s, etc. [13]. But simulation techniques need more significant computing power. So, it provides a way for the development of supercomputers and programming languages like CUDA with GPU. The neuropsychiatric disease was the burden for neurologists and neurosurgeons in which depression was the lead cause of disability [15]. The latest technologies in the field of computer science create smart and intelligent medical products. These technologies can be completely utilized by making a contribution to the challenges of cognitive function of the brain, functional connectomes, etc. [15]. The neurologist can apply diagnosis and prevention of brain diseases by using these kinds of smart computer models [16]. The computer
206
S. Ramanathan and M. Ramasundaram
model of a brain often provides awareness to other fields of research like drug discovery. So, this kind of brain model aids to identify the drugs for diagnosis and treatment methods. Eventually, it leads to more advanced medical options available to patients at a lower cost [17]. It has been inferred from the review that there was active research on brain visualization using graph theory. Those active researchers focus on nerve cell movements inside the brain through brain stimulation and functional imaging.
3
Method
The robust network of the brain can be viewed as a brawny hypergraph structure. Mapping hypergraph for the simulation of a single neuron and multiple neuron construction was effortless because the hypergraph contains many nodes in a single edge [18,19]. In this paper, every brain region was considered as a single edge and the cells or neurons in the brain region were considered as vertices. Here, the hypergraph framework simulates the life cycle of multilevel neuron construction shown in Fig. 1. Vertex represents a new neuron. Every vertex was created inside some of the edges, means that inside some region of the brain, a neuron was born. Initially, the first vertex was created to represent a new neuron. This first vertex (n1 ) will be inside the first edge (r1 ). When the second vertex (n2 ) was created, a pathway was generated between two neurons, which forms a neural circuit [20]. The second vertex can be inside the same first edge (r1 ) or can be a new edge(r2 ). If it was between the two new edges, then the neurons form a message flow path among the two dissimilar regions of the brain. Likewise, the next neuron or the vertex was created along with the region of the brain or the edges. ‘N’ denotes the total count. Once the construction of the ‘N’ number of edges and vertex was over, hypergraph coloring was implemented to differentiate the region of the brain as shown in Fig. 2. A transversal hypergraph was implemented for the information flow from one neuron to another along the pathway or highway. Removing the vertex or the neurons without pathway shows the death of the neuron. To differentiate the information flow among the same and different regions of the brain, here, the communication of neurons among the same region was called a pathway and among different regions was called highway [21]. The stem cell’s lineage was also shown using the dual hypergraph.
4
Results and Discussion
Hypergraph-based neuron construction framework was implemented in Python 3.7 in a system with Windows 10 of Intel Core i7 CPU 3.40 GHz and NVidia GeForce GT 730 processor with 6GB RAM. The visualization of the brain network was created in the Latex environment with Tikz package [22]. Hypergraph was defined by incidence matrix A = (aij ) with a column representing edges r1 , r2 , ..., rm and row representing vertices n1 , n2 , ..., nn , where (aij ) = 0 if ni ∈ / rj
Multilevel Neuron Model Construction Related
207
Fig. 1. a New neuron—first vertex in the first edge. b Second new neuron—second vertex in the second edge. c First and second vertices with the first edge Table 1. Total size of input elements Number of vertex Number of edges Total communication pathway 8
6
28
Table 2. Input incidence matrix rj r1 r2 r3 r4 r5 r6 n1 0
0
0
0
1
0
n2 0
0
0
1
1
0
n3 1
0
0
1
0
0
n4 1
0
0
0
0
0
n5 1
1
0
0
0
0
n6 0
0
1
0
0
0
n7 0
0
1
1
0
1
n8 0
1
1
0
0
0
and (aij ) = 0 if ni ∈ rj [23,24]. Neurons and regions of the brain were considered as vertices and edges. Each neuron connected with other neurons and form a pathway shown in Table 1.
208
S. Ramanathan and M. Ramasundaram
Fig. 2. a and b Constructed hypergraph structure in Latex–Tikz environment
Table 2 represents matrix ensemble of 8-neurons, 6-corresponding brain regions, and 28 pathways [25]. Suppose, neuron n1 has died after a period of time, then it was shown by removing the vertex, for example, from the above matrix, neuron (n7 ) was considered as dead, by removing the vertex (n7 ), makes the edge (r6 ) to become meaningless, that is, it represents that dead neurons may cause the brain region to disappear from the original normal view of a brain. Now, the matrix was recreated as shown in Table 3, which means the hypergraph structure also changes. Hypergraph was used to represent the brain image. Hypergraph
Multilevel Neuron Model Construction Related
209
will be used to predict the disorder of memory of loss. Here, the original matrix represents the brain image of the normal subject, and the recreated matrix represents the reshaped brain image of the abnormal Alzheimer’s disease subject [26]. It has been identified from Table 4, that due to the death of neuron (n7 ). First, the total number of communication pathway was shrunk. Second, there was no communication highway between region (r4 )and(r3 ). Here, it has been inferred that the memory loss with Alzheimer’s disease subject was happened due to this dead neuron and ideal region of the brain [27]. Table 3. Recreated incidence matrix rj r1 r2 r3 r4 r5 r6 n1 0
0
0
0
1
0
n2 0
0
0
1
1
0
n3 1
0
0
1
0
0
n4 1
0
0
0
0
0
n5 1
1
0
0
0
0
n6 0
0
1
0
0
0
n7 0
0
1
1
0
1
n8 0
1
1
0
0
0
Table 4. Total size of input elements Number of Vertex Number of edges Total communication pathway 7
4.1
5
21
Performance Measures
The correctness of the hypergraph creation in the visual environment was assessed quantitatively by means of the following two measures: (i) Small-World Network [28] and (ii) Robust Connectivity Measures [29]. Small-World Network provides the structural properties with neighbors and non-neighbors of hypergraph through the measure of L ∞ Log N. Robust connectivity measures are feasible and unpretentious interaction of the huge intricate network [30]. 4.2
Limitations
The number of edges and vertex in the hypergraph increase the size of connections in the network. This framework can handle 2500 communication pathways with Core i7 processor and up to 6000 communication pathways in an NVidia
210
S. Ramanathan and M. Ramasundaram
GeForce GT 730 processor. To further increase the connection pathways to million nodes. It is the superlative prime way to utilize the supercomputers in India. Nowadays the powerful mainframe computer with more than millions of cores are competing together to operate like a human brain [31].
5
Conclusion
Neuron architecture was simulated with concepts in Graph Theory and visualized with implementation in Python and Latex environment. The statistical test results have shown that the visualization environment works in an effective and efficient manner. This framework will help the neuroscientists to better understand and visualize the neuron developments and its behaviors. The framework needs to be improved by reusing the same paths for multiple communications.
References 1. Levine, D.S.: Theory of the Brain and Mind: Visions and History. Artif. Intell. Age Neural Networks Brain Comput. 191–203 (2019). https://doi.org/10.1016/B9780-12-815480-9.00009-8 2. Einevoll, G.T., Destexhe, A., Diesmann, M., Gr¨ un, S., Jirsa, V., de Kamps, M., Migliore, M., Ness, T.V., Plesser, H.E., Sch¨ urmann, F.: The Scientific Case for Brain Simulations. Neuron. 102, 735–744 (2019). https://doi.org/10.1016/J. NEURON.2019.03.027 3. Colombo, M.: Olaf Sporns: Discovering the Human Connectome. Minds Mach. 24, 217–220 (2014). https://doi.org/10.1007/s11023-013-9334-2 4. van den Heuvel, M.P., Sporns, O.: A cross-disorder connectome landscape of brain dysconnectivity. Nat. Rev. Neurosci. 20, 435–446 (2019). https://doi.org/10.1038/ s41583-019-0177-6 5. Teeter, C., Iyer, R., Menon, V., Gouwens, N., Feng, D., Berg, J., Szafer, A., Cain, N., Zeng, H., Hawrylycz, M., Koch, C., Mihalas, S.: Generalized leaky integrateand-fire models classify multiple neuron types. Nat. Commun. 9, 709 (2018). https://doi.org/10.1038/s41467-017-02717-4 6. Sporns, O.: Graph theory methods: applications in brain networks. Dialogues Clin. Neurosci. 20, 111–121 (2018) 7. Lippert, T., Thomas: HPC for the human brain project. In: Proceedings of the 28th ACM international conference on Supercomputing - ICS ’14. pp. 1–1. ACM Press, New York, New York, USA (2014). https://doi.org/10.1145/2597652.2616584 8. Yadati, N., Nitin, V., Nimishakavi, M., Yadav, P., Louis, A., Talukdar, P.: Link Prediction in Hypergraphs using Graph Convolutional Networks (2018) 9. Bhalla, S., Dura-Bernal, S., Suter, B.A., Gleeson, P., Cantarelli, M., Quintana, A., Rodriguez, F., Kedziora, D.J., Chadderdon, G.L., Kerr, C.C., Neymotin, S.A., McDougal, R.A., Hines, M., Shepherd, G.M., Lytton, W.W.: NetPyNE, a tool for data-driven multiscale modeling of brain circuits. https://doi.org/10.7554/eLife. 44494.001
Multilevel Neuron Model Construction Related
211
10. Biamonte, J., Faccin, M., De Domenico, M.: Complex networks from classical to quantum. Commun. Phys. 2, 53 (2019). https://doi.org/10.1038/s42005-019-01526 11. Fleischer, V., Radetz, A., Ciolac, D., Muthuraman, M., Gonzalez-Escamilla, G., Zipp, F., Groppa, S.: Graph Theoretical Framework of Brain Networks in Multiple Sclerosis: A Review of Concepts. Neuroscience 403, 35–53 (2019). https://doi.org/ 10.1016/j.neuroscience.2017.10.033 12. Bullmore, E., Sporns, O.: Complex brain networks: graph theoretical analysis of structural and functional systems. Nat. Rev. Neurosci. 2009 103. 10, 186–198 (2009). https://doi.org/10.1038/nrn2575 13. Lee, H., Kim, E., Ha, S., Kang, H., Huh, Y., Lee, Y., Lim, S., Lee, D.S.: Volume entropy for modeling information flow in a brain graph. Sci. Rep. 9, 256 (2019). https://doi.org/10.1038/s41598-018-36339-7 14. Stam, C.J.: Modern network science of neurological disorders. Nat. Rev. Neurosci. 15, 683–695 (2014). https://doi.org/10.1038/nrn3801 15. Bansal, K., Nakuci, J., Muldoon, S.F.: Personalized brain network models for assessing structure-function relationships. Curr. Opin. Neurobiol. 52, 42–47 (2018). https://doi.org/10.1016/J.CONB.2018.04.014 16. Lynn, C.W., Bassett, D.S.: The physics of brain network structure, function and control. Nat. Rev. Phys. 1, 318–332 (2019). https://doi.org/10.1038/s42254-0190040-8 17. Shalini, R., Mohan, R.: Drugs Relationship Discovery using Hypergraph. Int. J. Inf. Technol. Comput. Sci. 10, 54–63 (2018). https://doi.org/10.5815/ijitcs.2018. 06.06 18. Mohan R Shalini R: Neuroinformatics Conference, https://abstracts.g-node.org/ conference/NI2018/abstracts#/uuid/340cca06-1ea0-42bc-9a37-04f07828da89 19. Shalini R, Mohan R: Diagnosis of Alzheimer’s disease using Hypergraph. In: GNode (2018). https://doi.org/10.12751/incf.ni2018.0098 20. Ritz, A., Avent, B., Murali, T.M.: Pathway Analysis with Signaling Hypergraphs. IEEE/ACM Trans. Comput. Biol. Bioinforma. 14, 1042–1055 (2017). https://doi. org/10.1109/TCBB.2015.2459681 21. Wei, K., Cieslak, M., Greene, C., Grafton, S.T., Carlson, J.M.: Sensitivity analysis of human brain structural network construction. Netw. Neurosci. 1, 446–467 (2017) 22. Mertz, A., Slough, W.: Graphics with TikZ. Pr, E X J (2007) 23. Berge, C.: Hypergraph-Combinatorics of finite sets. North Holland (1989) 24. Bretto, A. : Hypergraph Theory : An Introduction. Springer, cham; New York (2013) 25. Weisstein, E.W.: Incidence Matrix, http://mathworld.wolfram.com/ IncidenceMatrix.html 26. Mueller, S.G., Weiner, M.W., Thal, L.J., Petersen, R.C., Jack, C.R., Jagust, W., Trojanowski, J.Q., Toga, A.W., Beckett, L.: Ways toward an early diagnosis in Alzheimer’s disease: the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimers. Dement. 1, 55–66 (2005). https://doi.org/10.1016/j.jalz.2005.06.003 27. Naresh, Korrapati: Alzheimer’s Disease and Memory Loss - A Review. (2016). https://doi.org/10.4172/2161-0460.1000259 28. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature. 393, 440–442 (1998). https://doi.org/10.1038/30918 29. Liu, J., Zhou, M., Wang, S., Liu, P.: A comparative study of network robustness measures. Front. Comput. Sci. 11, 568–584 (2017). https://doi.org/10.1007/ s11704-016-6108-z
212
S. Ramanathan and M. Ramasundaram
30. Golas, U.: Analysis and Correctness of Algebraic Graph and Model Transformations. Vieweg+Teubner, Wiesbaden (2011). https://doi.org/10.1007/978-3-83489934-7 31. Yoo, H.-J.: 1.2 Intelligence on Silicon: From Deep-Neural-Network Accelerators to Brain Mimicking AI-SoCs. In: 2019 IEEE International Solid- State Circuits Conference—(ISSCC). pp. 20–26. IEEE (2019). https://doi.org/10.1109/ISSCC. 2019.8662469
AEDBSCAN—Adaptive Epsilon Density-Based Spatial Clustering of Applications with Noise Vidhi Mistry, Urja Pandya, Anjana Rathwa, Himani Kachroo, and Anjali Jivani(B) Department of Computer Science and Engineering, The Maharaja Sayajirao University of Baroda, Baroda, India [email protected], [email protected], [email protected], [email protected], [email protected]
Abstract. The objectives of this research are related to study the DBSCAN algorithm and engineer an enhancement to this algorithm addressing its flaws. DBSCAN is criticized for its requirement to input two parameters, namely— epsilon radius () and minimum number of points (MinPts). It is difficult to know beforehand the optimum value of both parameters, and hence many trials are required until desired clusters are obtained. Also, in a dataset, a cluster’s density can vary. DBSCAN fails to identify clusters with density variations present. The proposed algorithm Adaptive Epsilon DBSCAN (AEDBSCAN), generates epsilon dynamically in accordance with the neighborhood of a point and thereafter adopts DBSCAN clustering with the corresponding epsilon to obtain the clusters. Experimental results are obtained from testing AEDBSCAN on artificial datasets. The experimental results confirm that the proposed AEDBSCAN algorithm efficiently carries out multi-density clustering than the original DBSCAN. Keywords: Data mining · DBSCAN · Density-based clustering · Multi-density · Adaptive epsilon
1 Introduction Data clustering is an important unsupervised learning technique. It groups data into meaningful classes, such that the intra-class similarity is high and the inter-class similarity is low. Data clustering is an integral tool for data mining and has found many applications in the fields of image processing, market analysis, pattern recognition, etc. Also, clustering being an attractive research field, many clustering techniques have been developed. Clustering techniques can be broadly classified into partitioning methods, density-based methods, hierarchical methods, and grid-based methods [3]. In order to compare different clustering methods, many clustering requirements have been put forward; some of them being scalability, ability to deal with different types of attributes, discovery of arbitrary shaped clusters, outlier detection, ability to deal with high-dimensional data, and insensitivity to input data [4]. © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_20
214
V. Mistry et al.
Density-based clustering techniques are widely used as they satisfy many of the above-mentioned requirements. They can efficiently detect clusters of arbitrary shapes and different sizes. Also, noise is handled well by these methods. The basic idea behind the density-based clustering is to separate low-density regions from high-density regions. Of all the density-based methods, DBSCAN is most popular and possesses all the advantages of a density-based clustering family. DBSCAN is the algorithm in focus for this research carried out. However, DBSCAN also brings along some disadvantages like the requirement to input two parameters and fixating a global density threshold that results in failure in detecting varied density clusters. In this paper, we propose a new algorithm, AEDBSCAN, which is based on DBSCAN but efficiently carries out clustering in multi-density cluster datasets. This algorithm automatically generates epsilon for each point and unlike DBSCAN, establishes a dynamic density threshold for clustering. Experimental results confirm that the proposed algorithm AEDBSCAN can identify clusters that have varying densities, whether globally or locally. The outline for the rest of the paper is as follows. Section 2 summarizes the original DBSCAN and basic concepts that are associated with the algorithm. This section also gives the gist of a metric used to evaluate clustering techniques. Section 3 introduces the proposed algorithm AEDBSCAN, focusing on the details of its underlying principles and working. Experimental work and analysis carried out for comparison of the performance of AEDBSCAN with original DBSCAN are discussed in Sect. 4. Lastly, Sect. 5 concludes the findings and examines the further scope of the proposed algorithm.
2 Related Work and Basic Concepts This section provides a summary of the widely used DBSCAN algorithm and its related basic concepts. Moreover, the Silhouette Width Score, which is a metric used for the evaluation of clustering algorithm is also briefly discussed in this section. 2.1 2.1 Density-Based Spatial Clustering of Applications with Noise (DBSCAN) DBSCAN algorithm is the most efficient density-based clustering algorithm and hence widely used. The goal of the algorithm is to identify dense regions, by measuring the number of data objects near to the given point. DBSCAN algorithm needs two input parameters namely: 1. Epsilon radius (1): It is a measure of the maximum distance between two data objects to be considered as neighbors, that is, the radius of neighborhood around a given point. This neighborhood formed by the epsilon radius 1 is referred to as the epsilon (1) neighborhood of a data object. 2. Minimum number of Points (MinPts): It is a measure of the minimum number of data points, to be present in the epsilon (1) neighborhood of a data object, required to form a cluster. Every individual data object is classified as one of the three types by the DBSCAN algorithm. These types include:
AEDBSCAN—Adaptive Epsilon Density-Based Spatial Clustering
215
• Core point: When a data object has at least a minimum number of points (MinPts) in its epsilon (1) neighborhood, it is classified as a core point • Border point: When a data object does not have a minimum number of points (MinPts) in its epsilon (1) neighborhood but has at least one core point then that point is classified as a border point. • Noise point (Outliers): When a data object is not classified as a border point or as a core point, then it is classified as a noise point or as an outlier. (Note here that the words data object and data point refer to the same object.) Some of the key concepts related to DBSCAN are defined below: 1. Directly density reachable: A point P is said to be directly density reachable from another point Q if P lies in the 1-neighborhood of Q, and Q is a Core point. 2. Density reachable: A point P is said to be density reachable from another point Q if there exists a chain of points P1 , …, Pn with P1 as Q and Pn as P such that Pi+1 is directly density reachable from Pi . 3. Density connected: Points P and Q are said to be density connected if there exists a Core Point Q such that P and Q both are density reachable from Q. In DBSCAN, a cluster is defined in terms of density it is a group of density connected points. The DBSCAN algorithm performs clustering as follows [5]: 1. For each point pi , the distance between pi and all other points is calculated. All points that lie in the 1-neighborhood are found. For each point, if the neighbor count is more than or equal to the MinPts, then that point is marked as core point or visited. 2. Thereafter, for each core Point, if there is no cluster assigned to it, a new cluster is created and assigned to it, also, all the density connected points are assigned the same cluster. 3. Repeat the process for all the unvisited points remaining in the dataset. 4. At the end, if there exist points that aren’t assigned a cluster, then those points are marked as noise point or outlier. 2.2 Silhouette Width Score Apart from visual cues, a statistical validation tool is required to evaluate a cluster’s quality. Among the tools available, the silhouette width score is widely preferred. It is used to interpret and validate consistency within clusters. Silhouette width score needs to be calculated for each individual data object. This score determines the similarity of that object to its own cluster (Cohesion) and the dissimilarity to other clusters (Separation). To compare clusters formed by DBSCAN and the proposed algorithm, the average silhouette width score is used. Silhouette score value is in the range [−1, 1]. • If the score is closer to + 1, then it is very likely that the data object had been assigned a correct cluster. • If the score is closer to −1, then it is very likely that it has been assigned a wrong cluster.
216
V. Mistry et al.
• If the score is closer to 0, then it is very likely that the data object has not been assigned a correct cluster.
3 Proposed Algorithm In this paper, the proposed, Adaptive Epsilon Density-Based Spatial Clustering of Applications with Noise—AEDBSCAN algorithm attempts to further generalize the original DBSCAN by addressing its flaws. 3.1 Overview of Proposed Algorithm First and foremost, DBSCAN requires to input two parameters, namely—epsilon radius () and the minimum number of points (MinPts). The clustering by DBSCAN is sensitive to these two parameters. However, it is difficult to obtain the optimum values of both parameters without prior knowledge and a thorough analysis of data. Many trials are required to stumble upon an optimal value for both and MinPts. Also, in a dataset, a cluster’s density can vary globally or locally. Local density variation means there is a change in density within a single cluster. Global variation means that within a cluster, density may remain the same; however, within a dataset, different clusters have different density. For DBSCAN, the density is defined by the two input parameters ( and MinPts) and hence is constant throughout for each data object. Having a constant predefined density threshold, DBSCAN is not able to handle global as well as local density variations, resulting in inefficient clustering for multi-density data [2]. The proposed algorithm AEDBSCAN eliminates the above-mentioned disadvantages of DBSCAN. Unlike DBSCAN, the proposed algorithm requires only one input parameter, namely the minimum number of points (MinPts). AEDBSCAN computes epsilon radius () for each data object based on its distance from its neighbors. Here, the neighbors of a data object are the nearest data objects which are determined by the input parameter-MinPts. Thereafter, the data objects are then clustered using this computed epsilon and the input parameter-MinPts, just like DBSCAN. With a single input parameter, AEDBSCAN frees the user from one input parameter, requiring less try-outs compared to DBSCAN to get the desired clustering. Moreover, AEDBSCAN generates an epsilon radius dynamically for each individual data object in accordance with its spatial distribution of its neighbors. Due to this, in AEDBSCAN, the decision for a data object to belong to a cluster is made considering the nearby spatial distribution. In such a manner, AEDBSCAN is able to efficiently identify clusters that have varying densities, whether globally or locally. 3.2 AEDBSCAN Algorithm The Following are the algorithmic steps for the AEDBSCAN: (1) Take suitable MinPts as input.
AEDBSCAN—Adaptive Epsilon Density-Based Spatial Clustering
217
(2) For each individual data point, we calculate the distance of the remaining data points from it. (3) Sort these obtained distances in ascending order and select twice the MinPts from the top of the sorted list. (4) Calculate the average of the above-selected distances which will act as 1-radius for that point. (5) After dynamically obtaining the 1-radius for each individual data point and using the minimum number of points as provided earlier as an argument, our algorithm proceeds similarly as the original DBSCAN algorithm. (6) This algorithm may generate small clusters within noise points, mark all single element clusters or small clusters as noise or outliers. The working methodology of this algorithm is summarized in the following flowchart Fig. 1. 3.3 AEDBSCAN Pseudocode AEDBSCAN algorithm is based on DBSCAN except that 1-radius is automatically calculated for each individual point in AEDBSCAN. Following is the pseudocode for calculating 1-radius of a given point (P) in a dataset (D): findEpsilon (D, P, MinPts) for each point P’ in Dataset D insert distance between P and P’ in array Dist sort Dist in ascending order select top 2*MinPts elements from the Dist array eps := mean of above selected elements return eps
Here, MinPts is the ‘minimum number of points’ argument passed initially by the user to the AEDBSCAN method.
4 Experiments and Analysis Within this section, we evaluate the performance of AEDBSCAN on artificial datasets [1, 7] and compare it with the performance of typical DBSCAN. For simplicity, all the artificial datasets considered are 2-dimensional. As shown in Figs. 2, 3, 4, 5, the first dataset DS1 consists of 300 data points and 3 clusters of varying density. Global, as well as local density variation, can be seen in the clusters of this dataset. Second dataset DS2 is the aggregation dataset that contains thin bridges between the two clusters, and the clusters aren’t well separated. It has 787 points and 7 different clusters. The third dataset DS3 consists of 524 data points and 21 spherical clusters having small local density variations. Fourth dataset DS4 is also a multi-density dataset consisting of 398 data points and 7 clusters. As seen in Figs. 2 and 5, these datasets pose the challenge of forming clusters of varying density, whereas Fig. 3 dataset poses the challenge of detecting clusters that aren’t well separated (Fig. 4).
218
V. Mistry et al.
Fig. 1. Flow chart for AEDBSCAN
4.1 Comparison for Dataset DS1 Both DBSCAN and AEDBSCAN form three clusters in the Fig. 5 and Fig. 6. In Fig. 5, DBSCAN is not able to keep up with the local density variation found in the second cluster (pink color) as the data becomes sparser as we move away from the center of the cluster and marks those points as noise (light gray color). However, AEDBSCAN takes into consideration the neighborhood of each data point and efficiently handles local density svariation yielding three clusters.
AEDBSCAN—Adaptive Epsilon Density-Based Spatial Clustering
Fig. 2. Dataset—DS1
Fig. 3. Dataset—DS2
219
220
V. Mistry et al.
Fig. 4. Dataset—DS3
Fig. 5. Dataset—DS4
AEDBSCAN—Adaptive Epsilon Density-Based Spatial Clustering
221
In all figures, the gray circle surrounding the plotted data points illustrates the epsilon neighborhood of that point. For each cluster in Fig. 6, for AEDBSCAN, we can see differently sized epsilon neighborhood due to dynamic 1-radius generation.
Fig. 6. AEDBSCAN implementation on DS1
While in Fig. 7, for DBSCAN, the epsilon neighborhood is the same sized for all the points showing a fixed global density threshold for each data point. 4.2 Comparison for Dataset DS2 The Fig. 8 and Fig. 9 confirm that AEDBSCAN is able to efficiently perform clustering on clusters that aren’t well separated. DBSCAN with its global density threshold fails and merges two clusters into one. 4.3 Comparison for Dataset DS3 The Fig. 10 and Fig. 11 clearly show that for AEDBSCAN, for closely packed clusters the epsilon neighborhood of its data points is small (red cluster in fourth row from top), whereas for loosely packed or sparsely populated clusters the epsilon neighborhood is large (red cluster in second row from top). Also, upon careful observation and comparison of the these two figures (Fig. 10 and Fig. 11), noise objects (light gray color) detected by both AEDBSCAN and DBSCAN are different. The noise objects in DBSCAN are the points that don’t possess the globally fixed density. Whereas, in AEDBSCAN, the noise objects are marked only after the consideration of density of its neighborhood, i.e., points that aren’t able to cross its own local density threshold.
222
V. Mistry et al.
Fig. 7. DBSCAN implementation on DS1
Fig. 8. AEDBSCAN implementation on DS2
AEDBSCAN—Adaptive Epsilon Density-Based Spatial Clustering
Fig. 9. DBSCAN implementation on DS2
Fig. 10. AEDBSCAN implementation on DS3
223
224
V. Mistry et al.
Fig. 11. DBSCAN implementation on DS3
4.4 Comparison for Dataset DS4 The Fig. 12 and Fig. 13 also shows that with global density, DBSCAN fails to detect the multi-density cluster and generates only four clusters. However, AEDBSCAN with a dynamically calculated epsilon neighborhood efficiently generates six clusters. Apart from the visual comparison of clustering by DBSCAN and AEDBSCAN, the quality of clusters is also estimated using the average silhouette width score. The results are summarized in the following Table 1.
5 Conclusion and Future Scope Thus, concluding, the new algorithm, AEDBSCAN, reduces the required inputs from two to one, and gives better clustering in multi-density datasets due to the dynamic computation of epsilon radius. Moreover, the visual cues along with the silhouette score show that quality clusters are generated with the new AEDBSCAN algorithm. In the future, AEDBSCAN can further be generalized to work with high-dimensional real datasets. Further, the algorithm can be optimized to handle noise more efficiently, so that they do not create their own small clusters. Also, it can be made more time-efficient by using appropriate data structures, to reduce computation time.
AEDBSCAN—Adaptive Epsilon Density-Based Spatial Clustering
Fig. 12. AEDBSCAN implementation on DS4
Fig. 13. DBSCAN implementation on DS4
225
226
V. Mistry et al. Table 1. Comparing DBSCAN with AEBSCAN using average silhouette width score
Sr.No
Dataset
DBSCAN
AEDBSCAN
Eps
MinPts
MinPts
Avg. Silhouette score of Original DBSCAN
Avg. Silhouette score of AEDBSCAN
1
DS1
1.4
10
69
0.592
0.619
2
DS2
3.5
5
20
0.519
0.547
3
DS3
1.2
3
14
0.752
0.759
4
DS4
0.67
2
6
0.435
0.341
References 1. Barton, T.: Clustering-benchmark, Github, https://github.com/deric/clustering-benchmark (2015) 2. Dang, S.: Performance Evaluation of Clustering Algorithm Using Different Datasets. IJARCSMS. 3, 167–173 (2015) 3. Ghuman, S.S.: Clustering Techniques-A Review (2016) 4. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3/e (2012) 445–448 5. Martin, E.,Hans-Peter, K., Jörg, S.,Xiaowei X(1996): A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise (1996) 6. Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis (1987) 7. Udacity: Data Scientist Nanodegree, Github, https://github.com/udacity/DSND_Term1, (2018)
Impact of Prerequisite Subjects on Academic Performance Using Association Rule Mining Chandra Das1 , Shilpi Bose1(B) , Arnab Chanda1 , Sandeep Singh1 , Sumanta Das1 , and Kuntal Ghosh2 1 Department of CSE, Netaji Subhash Engineering College, Kolkata 700152, India
[email protected] 2 MIU, ISI, Kolkata 700108, India
Abstract. Association rule mining is a popular approach to find out the frequent itemset from a database and hence discover the association rules exist for those itemsets. It often turns out to be useful to explore the interestingness among the data. Student’s educational information is one such important area where mining algorithms can be applied to uncover useful hidden information for improving academics. In this regard, the association rule mining techniques have been used in the present work to study the importance of prerequisite subjects on academic results of dependent subjects. The dataset used in this study contains subjectwise semester marks collected throughout the eight semesters of 117 students of Computer Science and Engineering bachelor course of a university of West Bengal. The study reveals the significant impact of prerequisite subjects on the academic result of dependent subjects of students very clearly. Keywords: Data mining · Education data mining · Association rule mining · Apriori algorithm
1 Introduction Data mining [1] is an area of computer science which deals with several computational techniques to extract implicit and valuable information from large data repositories. The last two decades witnessed the importance of this field increasing day by day and, presently, it has become an essential process of discovering hidden knowledge in databases (KDD). Data mining techniques have several applications [1, 2] in different fields such as e-commerce, fraud detection, instruction detection, lie detection, customer relationship management, market basket analysis, telecommunication networks, banking sector, inventory control, and bioinformatics, etc. Apart from these areas, since the last decade, researchers and experts or so have been using mining techniques widely in the field of education and as a consequence, a new field has emerged, named educational data mining [3, 4]. The main aim of educational data mining [3–5] is to apply mining techniques to extract useful and hidden information from educational datasets generated from various educational institutes to improve the education system. The ultimate goal of any © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_21
228
C. Das et al.
educational institution is to improve its students’ performance and to achieve this, it is important to understand the students, the environment in which they learn, their need according to the current scope, improvement of the course curriculum, etc. Using mining technology, it is possible to analyze several aspects of the educational system like student’s academic performance, their learning growth, their employability, problems of course curriculum which lectures benefited the students or which workshops proved to be helpful or which training proved to be most crucial for students, etc. Among several well-known techniques, association rule mining [1] is a very important mining technique to find out the connection of one or more attributes over another attribute or attributes present in a dataset. Initially, association rule was used to find out the relationship among items in several transactions to analyze market basket [1]. Later, it was used in different areas and one of the promising areas is education. A number of works using association rule mining [6–16] have been performed on education data mining. These are analysis of students’ behavior for identification of their interest to take part in different training programs [6, 7], assessment of education content [8–10], students’ academic performance analysis [11, 12], finding drop out students [13, 14], finding the reasons of employability criteria [15, 16], etc. Here, in this paper, the importance of prerequisite subjects of a subject is found out using association rule mining techniques by analyzing the results of students on those subjects. A prerequisite of a specific subject is the subjects whose knowledge is essential before learning is started in that particular subject. This type of analysis has a number of applications in the area of Education Data mining. Firstly, the analysis result can be used in the improvement of content of course curriculum and specific ordering of subjects in the course—curriculum which will be beneficial for the improvement of students’ academic performance. Secondly, analysis results can help the mentors to guide any student for selecting proper elective subjects according to their interests which in turn helpful for carrier selection. Moreover, currently, students are attending different types of educational competitions where this type of analysis will be beneficial for selecting relevant competition according to their capabilities. To find the importance of prerequisite subjects, several analyses have been done [17– 19], but this is the first time where association rule mining techniques are used to solve this task. To identify the importance of knowledge of prerequisite subjects of a particular subject, a number of students’ academic results of a specific engineering discipline in an educational organization are analyzed. Significant results have been found in this analysis. The paper has been written in the following manner. In Sect. 2, the overview of the data mining, association rule mining, Apriori algorithm-based proposed work, and dataset preparation have been described. The third section is the result section where the detailed analysis of results has been described. Finally, conclusion and future scope are given in Sect. 4.
2 Proposed Work The objective of this work is to find out the importance of prerequisite subjects of a specific subject in a course. To exhibit this, the most popular association rule mining technique Apriori [1] is applied on a dataset containing semester marks (collected
Impact of Prerequisite Subjects on Academic Performance
229
over eight semesters) of a batch of Computer Science and Engineering bachelor course students of a university to check if any dependent subject’s semester marks have any dependency on its prerequisite subjects’ marks or not. In other words, our objective is to see whether a student’s academic performance in one prerequisite subject affects his/her performance on a dependent subject or not. This will be very beneficial for the improvement of the course curriculum settings. The following subsections consist of a brief overview of data mining, association rule mining, and the proposed work. 2.1 Overview of Data Mining Data mining [1] refers to the process of discovering hidden knowledge in databases (KDD), which involves several data management aspects like data pre-processing, data mining techniques, post-processing of discovered structures, visualization, and online updating. The steps involved in the process of knowledge extraction (KDD) from data are shown in Fig. 1. Some of the mining techniques [1] used in the KDD process are Clustering, Classification, Association rule mining, Prediction, etc.
Fig. 1. A brief overview of the steps involved in data mining
2.2 Association Rule Mining Association rule mining is a process of finding a frequent set of attributes or items which frequently co-occurred in a dataset and then finding rules from those frequent itemsets which satisfy certain constraints. Let I = {O1 , O2 , O3 , . . . Om } be a set of all possible items.A ⊆ I is a subset of I, and is considered as a z itemset if it contains z a number of items where z ≤ m. A transaction T R over I contains two information. Among them, one is the unique identification number of the transaction T R and the second is an itemset (associated with TR) which is a subset of I . Let T = {T1 , T2 , . . . , Tn } be a set of transactions over I in a database D and X, Y are two itemsets over I such that X ⊆ I, Y ⊆ I . The support of an itemset (X ) is the number of transactions in the database D that contains the itemset X . It can be represented as: supp (X ) =
No. of transactions in database Din which X present Total no. of transaction present in database D
(1)
230
C. Das et al.
An itemset is called frequent if its support is greater than a minimum support threshold α. An association rule can be expressed as X ⇒ Y, X = ∅, Y = ∅, X ∩ Y = ∅. It signifies that if a transaction contains itemset X then it will contain itemset Y also with a minimum support S and minimum confidence C. The support of an association rule, X ⇒ Y, X = ∅, Y = ∅, X ∩Y = ∅, is the support of X ∪Y (i.e., it contains every item inX andY ). It is considered as the probability of X ∪Y that means P(X ∪ Y ). The confidence of an association rule, X ⇒ Y, X = ∅, Y = ∅, X ∩ Y = ∅ is the number of transactions in database D such that if it contains X , then it will contain Y . This is taken to be the conditional probability, P(Y |X ), which means the probability of finding the itemset Y in the transactions in which already X is present. This is calculated using the equation below: Confidence (X ⇒ Y ) = P(Y |X ) =
support_count(X ∪ Y ) support_count(X )
(2)
The rule is considered as confident if its confidence is greater than a minimum confidence threshold β. Actually, association rule mining is the process of finding strong association rules from the database D (where strong association rules satisfy both minimum support and minimum confidence). Apriori [1] is a popular method that is used here for finding association rules from a database. It is discussed below. 2.3 Apriori Algorithm The Apriori algorithm proposed by Agarwal and Srikant in 1994 [1] is used to find out the frequent itemsets by employing iterative searching procedure and also to find out corresponding association rules from those frequent itemsets. It works on a set of transactions present in a database. In the first iteration, it counts the occurrence of each item (itemset of length 1) among all the transactions in the database and selects frequent items with support ≥ α (minimum support threshold). In the second iteration, it generates frequent itemsets of length 2 from frequent itemsets of length 1 generated in the previous iteration. So, at i th iteration, it first generates the set of possible candidate set Ci from frequent itemset Ci−1 generated at (i − 1)th iteration and then selects frequent itemsets from Ci with support ≥ α, (minimum support threshold). The algorithm terminates when Ci is empty. In each ith iteration, frequent itemsets of size i are generated. After generating all of the frequent itemsets, all possible rules are created for each frequent itemset. Among those rules, the rules whose support and confidence are greater than or equal to corresponding minimum support and minimum confidence are treated as valid association rules. 2.4 Proposed Method The dataset used in this study contains subject-wise semester marks collected throughout the eight semesters of 117 students of Computer Science and Engineering bachelor
Impact of Prerequisite Subjects on Academic Performance
231
course of a university of West Bengal. The course consists of different category based subjects like humanities, basic science subjects, values and ethics, and core computer science-related subjects. Core subjects that mean the subjects which have a direct connection to the course are considered for the analysis and all other category-based subjects are deleted from the dataset. The subjects are (1) Introduction to computing and problemsolving using C, (2) Data structures and algorithms, (3) Analog and digital electronics, (4) Computer organization, (5) Formal language and Automata theory, (6) Numerical Methods, (7) Computer architecture, (8) Microprocessors and Microcontrollers, (9) Design and Analysis of Algorithm, (10) Object-Oriented Programming, (11) Database Management System, (12) Computer Networks, (13) Operating System, (14) Computer Graphics, (15) Software Engineering, (16) Compiler Design, (17) Artificial Intelligence, and (18) Data warehousing and Data Mining. Finally, the formed dataset consists of 117 number of students’ marks for 18 subjects and roll number field of every student. Another aspect of this dataset is that the subjects in the dataset are placed in a chronological manner according to the syllabus. This means that the subjects are placed in the dataset maintaining the same order as they appear semester wise in the syllabus for that course. The description of individual column of the dataset is given in Table 1. Table 1. Description of the dataset Column name
Description
Possible values
Roll no.
University roll number of students
It can be any number or string but has to be unique for each row
University Subject1 University marks of all 117 students marks of name and for Subject1 throughout the course subjects code Subject2 University marks of all 117 students name and for Subject2 throughout the course code Subject3 University marks of all 117 students name and for Subject3 throughout the course code Subject4 University marks of all 117 students name and for Subject4 throughout the course code Subject5 name and code . . . .. .
University marks of all 117 students for Subject5 throughout the course . . . .. .
Subject18 University marks of all 117 students name and for Subject18 throughout the course code
Marks are given as grade and corresponding grade point by the university. The possible grade and corresponding grade point provided by the university are: {O-10 (90–100%), E-9 (80–89%), A-8 (70–79%), B-7 (60–69%), C-6 (50–59%), D-5 (40–49%) F-4 < 40%}
232
C. Das et al.
Initially, the result of a student for a particular subject is given as the grade (O–F) with its corresponding grade point (10–4) as per the guidelines of the university rules. To make it more precise and convenient for analysis, these seven different categories of grade points are integrated into three different levels. Thus, the grade points of any subject for all students have been converted into three categories: “Good,” “Average,” and “Poor” using a short logic (discussed below). This data transformation is done here so that all students’ performance in each subject can be categorized specifically considering subject weightage. It is already known that all subjects are not equally marks scoring. The subject wise marks of every student vary from subject to subject. It depends on several factors such as subject weightage (some subjects are tough and difficult to understand, while some subjects are easy), the quality of semester question paper (semester question may be tough or tricky or lengthy or simple for a subject), students’ likes or dislikes, etc. For example, if in a particular subject (let Subject 1) due to some reason all students’ grade point remain in the range of 4–6, while in another subject (let Subject 3) the grade point are in the range of 4–10 then the intention is to categorize the grade point 6 as “Good” and 4 as “Poor” for Subject 1 for comparative performance analysis. The categorization procedure is given below: • The minimum and maximum grade points have been found for each subject and stored in variables min and max, respectively. • The differences between the max and min have been divided by 3 and stored in a variable difference. • Now, the grade points are categorized for that subject into three groups as follows: 1. If the grade point lies in the range of [max − difference, max], then it is considered as “Good” 2. If the grade point lies in the range of [min + difference, min − difference], then it is considered as “Average” 3. If the grade point lies in the range of [min, min + difference], then it is considered as “Poor” After the transformation of the data values, the dataset (a part of the dataset) looks like the following as in Fig. 2. Here, S1, S2,…, S18 represent subjects and C1, C2,…, C18 represent corresponding subject codes. Roll No. 1578 1623 …..
S1 C1
S2 C2
S3 C3
S4 C4
S5 C5
MARKS S6 S7 C6 C7
S8 C8
S9 C9
S18 C18
Good Average ………
Good Good ……..
Good Poor …….
Poor Good ……..
Poor Good …….
Good Good …….
Good Poor …….
Average Poor …….
Good Average ……..
Average Good ……..
……..
Fig. 2. Dataset after preprocessing and transforming
After the preparation of the dataset, Apriori algorithm is applied on the dataset for different support count value and confidence to find strong association rules. When Apriori algorithm is applied to our dataset every student record is considered as a transaction
Impact of Prerequisite Subjects on Academic Performance
233
and marks or performance category value (Good, Average, Poor) on every subject is treated as an item. The entire process which is carried out on our case study is described in the Fig. 3.
Fig. 3. Flowchart describing the workflow process
3 Results In this work, Apriori algorithm is applied to the dataset consists of university marks of selected 18 core subjects of 117 B.TECH CSE students. For this dataset, the Apriori algorithm is run on the dataset for different support count values (20–40) and for different confidence (70, 80, 90%) values. Significant association rules are found for support value 32 and for confidence value 90%. In total, 61 associations have been found among prerequisite subjects and corresponding dependent subjects where the length of each association varied from 2 to 4. Some of the significant dependencies among prerequisite subjects and their corresponding dependent subjects that we have extracted from the results are listed below: 1. Relationship between: (“Introduction to Computing and problem solving using C,” “Numerical Methods”) [‘Good marks in Introduction to Computing’] =>[‘Good marks in Numerical Methods’]: 97.3 [‘Average marks in Introduction to Computing’] =>[‘Average marks in Numerical Methods’]: 94.74 [‘Poor marks in Introduction to Computing’] =>[‘Poor marks in Numerical Methods’]: 97.3. 2. Relationship between: (“Introduction to Computing and problem solving using C,” “Data Structure and Algorithms”) [‘Good marks in Introduction to Computing’] =>[‘Good marks in Data Structure and Algorithms’]: 97.3 [‘Average marks in Introduction to Computing’] =>[‘Average marks in Data Structure and Algorithms’]: 94.74 [‘Poor marks in Introduction to Computing’] =>[‘Poor marks in Data Structure and Algorithms’]: 94.59.
234
C. Das et al.
3. Relationship between: (“Introduction to Computing and problem solving using C,” “Object Oriented Programming”) [‘Good marks in Introduction to Computing’] =>[‘Good marks in Object Oriented Programming’]: 97.3 [‘Average marks in Introduction to Computing’] =>[‘Average marks in Object Oriented Programming’]: 94.74 [‘Poor marks in Introduction to Computing’] =>[‘Poor marks in Object Oriented Programming’]: 94.59. 4. Relationship between: (“Formal Language and Automata Theory,” “Compiler Design”) [‘Good marks in Formal Language and Automata Theory’] =>[‘Good marks in Compiler Design’]: 94.74 [‘Average marks in Formal Language and Automata Theory’] =>[‘Average marks in Compiler Design’]: 94.74 [‘Poor marks in Formal Language and Automata Theory’] =>[‘Poor marks in Compiler Design’]: 94.44. 5. Relationship between: (“Introduction to Computing,” “Data Structure and Algorithm,” “Design and Analysis of Algorithm”) [‘Good marks in Introduction to Computing’, ‘Good marks in Data Structure and Algorithm’] =>[‘Good marks in Design and Analysis of Algorithm’]: 97.14 [‘Average marks in Introduction to Computing’, ‘Average marks in Data Structure and Algorithm’] =>[‘Average marks in Design and Analysis of Algorithm’]: 97.3 [‘Poor marks in Introduction to Computing’, ‘Poor marks in Data Structure and Algorithm’] =>[‘Poor marks in Design and Analysis of Algorithm’]: 100.0. 6. Relationship between: (“Analog and Digital Electronics,” “Computer Architecture,” “Computer Organisation,” “Microprocessors and Microcontrollers”) [‘Good marks in Analog and Digital Electronics’, ‘Good marks in Computer Organisation’] =>[‘Good marks in Computer Architecture’, ‘Good marks in Microprocessors and Microcontrollers’]: 94.59 [‘Good marks in Analog and Digital Electronics’, ‘Good marks in Computer Organisation’, ‘Good marks in Computer Architecture’] =>[‘Good marks in Microprocessors and Microcontrollers’]: 94.59 [‘Average marks in Analog and Digital Electronics’, ‘Average marks in Computer Organisation’] =>[‘Average marks in Computer Architecture’, ‘Average marks in Microprocessors and Microcontrollers’]: 94.74 [‘Average marks in Analog and Digital Electronics’, ‘Average marks in Computer Organisation’, ‘Average marks in Computer Architecture’] =>[‘Average marks in Microprocessors and Microcontrollers’]: 94.74 [‘Poor marks in Analog and Digital Electronics’, ‘Poor marks in Computer Organisation’] =>[‘Poor marks in Computer Architecture’, ‘Poor marks in Microprocessors and Microcontrollers’]: 94.59
Impact of Prerequisite Subjects on Academic Performance
235
[‘Poor marks in Analog and Digital Electronics’, ‘Poor marks in Computer Organisation’, ‘Poor marks in Computer Architecture’] =>[‘Poor marks in Microprocessors and Microcontrollers’]: 94.59. The above-mentioned results show us a strong connection between the following prerequisite subjects and their corresponding dependent subjects: • Introduction to Computing and problem-solving using C (prerequisite), Data Structure and Algorithm (prerequisite), and Design and Analysis of Algorithm (dependent) • Introduction to Computing and problem-solving using C (prerequisite), Data Structure and Algorithm (dependent) • Introduction to Computing and problem-solving using C (prerequisite) and Object-oriented Programming (dependent) • Introduction to Computing and problem-solving using C (prerequisite) and Numerical Methods (dependent) • Analog and Digital Electronics (prerequisite), Computer Organization (prerequisite), Computer Architecture (prerequisite), and Microprocessors and Microcontrollers (dependent) • Formal Language and Automata Theory (prerequisite), and Compiler Design (dependent) From these results, it can be said that a student who scores good marks in prerequisite subjects is very likely to score good marks in corresponding dependent subjects. Similarly, scoring poor marks in one prerequisite subject is likely to affect the dependent subject too. Thus, the algorithms are able to verify the fact that these subjects share a strong bond among one another, thus establishing their strong interdependency. A point to note is that we have only considered those results which have uniform quality like ‘Good,’ ’Good,’ and ‘Good’ in three subjects or ‘Poor,’ ‘Poor,’ and ‘Poor’ in those same three subjects and from those itemsets valid rules are constructed. An itemset of a mixture of ‘Good,’ ‘Poor,’ and ‘Average’ is not considered. Similarly, if an association exists between three or more subjects then it may give rise to sub-associations where two or more subjects from the larger group may form an association. In that case, the largest group association is taken into consideration. This same principle can be applied to subjects where we have found no clue of whether they are interrelated or not.
4 Conclusion and Future Scope In this paper, association rule mining is used to identify the correlations and dependencies between a subject and its corresponding prerequisite subjects which will be beneficial for the academic decision-making process. The Apriori association rule mining technique is applied to students’ results to find out such dependencies. From the results, it has been found that the generated rules properly identify a subject’s corresponding prerequisite subjects. In future, other association rule mining based techniques can be applied to identify prerequisite subjects of a subject. In future, we will apply these methods to other disciplinary datasets.
236
C. Das et al.
References 1. Han, J., Kamber, M., Pei, J.: Data Mining concepts and Techniques. Morgan Kaufmann Publishers, Elsevier 2. Phridvi Raja, M.S.B., GuruRaob, C.V.: Data mining−past, present and future−a typical survey on data streams. In: The 7th International Conference Interdisciplinarity in Engineering (INTER-ENG 2013). Elsevier (2013) 3. Mohamada, S.K., Tasir, Z.: Educational data mining: a review. In: The 9th International Conference on Cognitive Science, Procedia−Social and Behavioral Sciences, vol. 97, pp. 320– 324. Elsevier (2013) 4. Baker, R.S.J.D., Yacef, K.: The state of educational data mining in 2009: a review and future visions. J. Edu. Data Mining, Article 1 1(1), Fall (2009) 5. Romero, C., Ventura, S.: Educational data mining: a survey from 1995 to 2005. In: Expert Systems with Applications, vol. 33, pp. 135–146. Elsevier (2007) 6. Abdullaha, Z., Herawanb, T., Ahmadb, N., Derisc, M.M.: Extracting highly positive association rules from students’enrollment data. In: WCETR 2011, Procedia−Social and Behavioral Sciences, vol. 28, pp. 107–111. Elsevier (2011) 7. Freyberger, J., Heffernan, N., Ruiz, C.: Using association rules to guide a search for best fitting transfer models of student learning. In: Workshop on Analyzing Student-Tutor Interactions Logs to Improve Educational Outcomes at ITS Conference, pp. 1–10 (2004) 8. Tsai, C.J., Tseng, S.S., Lin, C.Y.: A two-phase fuzzy mining and learning algorithm for adaptive learning environment. In: Proceedings of the International Conference on Computational Science. LNCS, vol. 2074, pp. 429–438. San Francisco, CA, USA (2001) 9. Romero, C., Venturaa, S., Garcíaa, E.: Data mining in course management systems: moodle case study and tutorial. J. Comput. Edu. 51(1), 368–384 (2008) 10. Ramli, A.A.: Web usage mining using apriori algorithm: UUM learning care portal case. In: Proceedings of the International Conference on Knowledge Management, pp. 1–19. Malaysia (2005) 11. Angeline, D.M.D.: Association rule generation for student performance analysis using apriori algorithm. SIJ Trans. Comput. Sci. Eng. Appl. (CSEA) 1(1) (2013) 12. Kumar, V. et al.: Mining association rules in student’s assessment data. IJCSI Int. J. Comput. Sci. Issues 9(5), 3 (2012) 13. Yukselturk, E. et al.: Predicting dropout student: an application of data mining methods in an online education program. Eur. J. Open, Distance e-Learn 17(1) (2014) 14. Simon, B.K., Nair, A.P.:Association rule mining to identify the student dropout in MOOCs. Int. Res. J. Eng. Technol. (IRJET) 06(01) (2019) 15. Deva Sarma, P.K.: Discovery and analysis of job eligibility as association rules by apriori algorithm. Int. J. Adv. Res. Comput. Sci. 16. Deva Sarma, P.K.: An association rule based model for discovery of eligibility criteria for jobs. Int. J. Comput. Sci. Eng. 6(2), 143–149 (2018) 17. Choudhury, A., Robinson, D., Radhakrishnan, R.: Effect of prerequisite on introductory statistics performance. J. Econ. Econ. Edu. Res. 8, 19 (2007) 18. Liang, C. et al.: Recovering concept prerequisite relations from university course dependencies. In: Proceedings of the Seventh Symposium on Educational Advances in Artificial Intelligence (EAAI-17) 19. Sato, B.K. et al.: What’s in a prerequisite? a mixed-methods approach to identifying the impact of a prerequisite course. CBE Life Sci. Edu. Spring 16(1), ar16 (2017)
A Supervised Approach to Aspect Term Extraction Using Minimal Robust Features for Sentiment Analysis Manju Venugopalan1 , Deepa Gupta1(B) , and Vartika Bhatia2 1 Department of Computer Science and Engineering, Amrita School of Engineering, Bengaluru,
Amrita Vishwa Vidyapeetham, India {v_manju,g_deepa}@blr.amrita.edu 2 Department of Computer Science, Banasthali Vidyapith, Niwai, Tonk, India [email protected]
Abstract. The instinct to know what others feel lays the foundation for the field of sentiment analysis which extracts opinion from text data and categorizes them as positive, negative or neutral. Beyond a report of the consolidated sentiment, the end-user is more interested to know what the product features that are talked about and what is the sentiment of the opinion holder towards each feature/aspect which leads to the task of aspect-level sentiment analysis. In this paper, the focus has been on the aspect extraction task of aspect-level sentiment analysis which extracts the features of the product that has been talked about in the reviews. The experiments have been reported on Bing Liu Customer Review Datasets consisting of five different categories DVD, Canon, MP3, Nikon and Cell phones. The strength of the model lies in the fact that a simple classifier that incorporates handling of imbalanced data, using a minimal set of robust features has been able to achieve comparable results with the state of art in aspect extraction task. The random forest classifier reported the best results across all domains with an F-measure ranging from 85.3 to 89.1. Keywords: Aspect term extraction · SMOTE · Supervised · Machine learning · Minimal features
1 Introduction Sentiment analysis is a data mining technique which is used to extract opinion or sentiment from a product review [1–6]. There has been a large growth in this sector in the last few years with more users expressing their views online. The availability of social platforms, such as Facebook, Twitter, has made it easier for customers to express their opinions on the web. Online product reviews have a huge impact on customers and it even influences their perspective of a given product. Products are often subjected to design improvisations based on user reviews. On social media, sentiment analysis helps to get an insight into the brand value of products [7–11]. Such analysis can also help the organization to judge the potential competitors. The later era of sentiment analysis raised © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_22
238
M. Venugopalan et al.
the demand for more fine-grained [12] results than consolidated user sentiments. Rather than a binary classification of reviews into positive or negative classes, the demand was for a deeper level analysis of the product features that are discussed in the review and understand the sentiment orientation of the reviewer towards each of them. Aspect-level sentiment analysis is one such approach where every aspect or feature that is discussed in the review is extracted, and hence, the sentiment associated with each of them thus owing to the subtasks of aspect term extraction and sentiment classification. For instance, in the review of a camera quoted as “it has an amazing picture quality but the battery life is very bad”, the sub-task of aspect term extraction extracts the aspect terms as picture quality and battery life. The sentiment classification sub-task assigns a positive polarity to picture quality and a negative polarity to battery life. The current work proposes a supervised learning approach for the sub-task of aspect term extraction. The focus of the work is to identify those discriminative features which are relevant in identifying the aspect terms and the best classifiers for the task using a supervised learning approach. The highlights of the proposed work can be summarized as follows: • Extracting the most discriminating features which are meaningful and convincing for the application • Handling the imbalanced train data using an efficient balancing technique Synthetic Minority Oversampling Technique (SMOTE) • Identifies a minimal set of robust features by using feature selection techniques • Compares the performance of different machine learning classifiers and hence chooses the most efficient classifier for the aspect extraction task • The proposed system performance is compared with the state-of-the-art systems in the field that has reported the results on the same dataset. The following subsections of this paper are organized as follows. Section 2 discusses the prominent works reported on aspect extraction. Section 3 has a detailed explanation of the proposed approach. Section 4 discusses the datasets used, evaluation measures and the baselines used. Section 5 presents the results and a report on detailed analysis and, finally, Sect. 6 gives the conclusions and the future directions in the work.
2 Related Works Most of the approaches in aspect extraction belong to the genre of supervised or unsupervised/semi-supervised approaches. The current section confines the discussion to research works in these categories and a few works which have reported results on Bing Liu datasets. As the pioneering work in supervised approaches, Li et al. proposed an approach [13] for aspect term extraction on movie reviews working with feature words and opinion words. The dataset consisted of 880 reviews and the results were reported for fivefold cross-validation. Their proposed approach resulted in an improved performance in comparison to their baseline. A CRF-based approach [14] where opinion target extraction is designed as an information extraction task was a different approach in the field. Word
A Supervised Approach to Aspect Term …
239
tokens, POS tags, the length of the path in the dependency tree, etc., are the features considered in the supervised model. The algorithm has been evaluated on four different datasets which resulted in promising results with an average of F-measure of 0.47 across the datasets. Akhtar et al. [15] had proposed a PSO (Particle swarm optimization)based unique method for feature selection in aspect extraction. Using a CRF learning framework, they achieved an F-measure of 0.81 and 0.72 across Laptop and Restaurant domains, respectively, from the SemEval Datasets. The demand for unsupervised models is always appreciated owing to large volumes of data available online and the task of labelling them at the fine-grained level being task intensive. The model proposed by Cheng et al. in [16] deals with identifying aspects and rating them based on each aspect and weights placed on different aspects by the users. For aspect term extraction, they have used topic modelling techniques which exploit word co-occurrence patterns. Samaneh and Martin have proposed a joint model [17] for aspect term extraction and sentiment classification based on Factorized Latent Dirichlet Allocation (FLDA). Ivan and Ryan have used multigrain topic models [18] which are extensions of LDA and PLSA which showcased better results over other topic models, hence extracting ratable aspects from online reviews. They have reported results in terms of ranking loss across MP3 Player, Hotel and Restaurant domains. An unsupervised approach [19] using a seed set expansion-based approach for aspect extraction has been attempted on the dataset of reviews collected from Indonesian restaurants. They employed a simple methodology of expanding the seed aspects using word embedding-based similarity. For aspect extraction, they achieved an F-measure of 88.4. The proposed work has experimented with the task of aspect term extraction on Bing Liu dataset [20, 21]. This paragraph closely analyses research works that have reported results on the same dataset. Hu and Liu [21] have used a combination of techniques like frequent feature generation, compactness pruning and association mining for extracting the aspect terms. Association mining has been a filtering method to identify the item sets with which the product names frequently occur and hence are more likely to be aspects. Compactness pruning has been used to filter meaningless multi-word aspects. They have reported an average precision of 0.72 and a recall of 0.8 across all five domains. Soujanya et al. [22] used a seven-layer deep convolutional neural network to extract aspects. The inputs to the neural network model were word embedding representation for every word in the sentence and the output layer had a neuron corresponding to every word. A 300-dimensional vector representation of each word based on word embedding and POS tags were used as features. In addition, heuristic linguist patterns and word embedding concepts have been used. They have reported an average precision of 0.901 and an average recall of 0.86 across all datasets. Bing Liu et al. [23] expressed that if a system after performing aspect extraction has retained its information then the system can be used better by L-CRF (Lifelong learning CRF) than any other CRF. The experimentation performed on different datasets showed the effectiveness of the approach with a maximum F-measure of 0.79 reported on the DVD Player Dataset. Most of the research works on aspect extraction in the supervised category are based on the sequence labelling approach. Pure machine learning methods for aspect extraction haven’t been much explored. The literature doesn’t lead to any works which have tried to rectify the issue of imbalanced class when the aspect extraction task is designed as
240
M. Venugopalan et al.
a binary classification problem. The proposed approach attempts to fill this research gap by incorporating data balancing techniques and robust feature selection methods in a supervised approach unlike most of the experimentations that have used sequence labelling classifiers. Our approach has been novel in that it’s a pure supervised method approach where the focus has been to use minimum striking features and hence determine the best classifier to handle data across multiple domains.
3 Proposed Methodology The main focus of the proposed work is to devise a methodology using a supervised approach to classify every token in a sentence as an aspect or not. The different stages in the proposed supervised approach for aspect term extraction are diagrammatically represented in Fig. 1. 3.1 Pre-processing Module The training data is subjected to a pre-processing step before features are extracted. The proposed work focuses on explicit aspect term extraction and hence the sentences in the training data characterized by the presence of explicit aspect terms are filtered. The sentences in the dataset are subjected to a spell check module using the PyEnchant1 implementation in Python. Each sentence in the review text is subjected to word tokenization. The part of speech (POS) tag of each word in the sentence is determined using Stanford Dependency Parser 3.9.2 The output of POS tagger for a sentence S = (w1 , w2 , . . . wi . . . wn ), where n is the number of tokens/words in the sentence would be of the form P O S(S) = ( pos1 , pos2 . . . posi . . . posn ). The noun phrase chunks in the input sentence are also extracted in a similar manner. 3.2 Feature Extraction Module Feature extraction is the process of building derived values from the raw data that are expected to be informative. Optimized feature extraction is the key to effective model construction. The discriminating features which contribute to identifying aspects are extracted from the pre-processed reviews which have been listed in the following subsections. POS of the word. Features or aspects of a product that are reviewed are most likely to be a noun or part of a noun phrase and hence POS of the word can help in determining aspect terms. The POS tag of the current word (wi ), posi , is extracted from the pre-processed review which forms this feature for all words in the pre-processed sentence. Frequent aspect term. The feature tries to assign a higher weightage to a word that occurs frequently in the training data as an aspect. For instance, if the token canon has occurred at least α times as an aspect in the training data, then its occurrence in test data 1 https://pypi.org/project/pyenchant/. 2 https://stanfordnlp.github.io/CoreNLP/.
A Supervised Approach to Aspect Term …
241
Pre-Processing Module Spell Check Training Data Tokenization
Chunking Feature Extraction
POS Tagging
ASPECT
Model Building Data Balancing Module Prediction Feature Selection
Selection of the best classifier NON-ASPECT
Fig. 1. Flow diagram for the proposed methodology
would most likely indicate an aspect. The frequent aspect term is designed as a binary feature with either 0 or 1 as the value. To determine this feature, we construct a Frequent Aspect List (FAL) which is the list of all aspects in the training data which have occurred more than the threshold value α in the training dataset. This feature has a value 1 if it is a part of frequent aspect list and 0 otherwise as depicted in Eq. (1). The features are extracted for all n tokens in the sentence. 1, wi ∈ F AL ∀i where i = 1(1)n (1) f 1 (wi ) = / F AL 0, wi ∈ Similarity-based Frequent Aspect Term. Frequent aspect term feature checks for the occurrence of a word in the frequent aspect list, but in general, aspect terms can have numerous forms. For example, in a restaurant domain, ambience and atmosphere would mean the same but that associativity would not be captured by the previous feature and hence a synonym-based approach would help to better indicate the token’s relatedness to aspect term. Wordnet 3.03 is a large lexical resource that has groups of words called synsets based on their semantic and lexical relatedness. Using Wordnet, a list of frequent aspect terms and its noun synonyms are extracted denoted as Similarity-based Frequent Aspect List (SFAL). Similarly, the current word and its noun synonyms are extracted to 3 https://wordnet.princeton.edu/.
242
M. Venugopalan et al.
form another list, symbolized as SW i . For the current word wi , if SW i and SFAL have anything in common, the value for this feature is 1 otherwise 0 as given in Eq. (2). 1, SWi ∩ S F AL = ∅ ∀i where i = 1(1)n (2) f 2 (wi ) = 0, SWi ∩ S F AL = ∅ IsNounPhrase. A word is more likely to be an aspect if it belongs to a noun phrase. To capture this aspect, all the noun phrases in the sentence are identified symbolized as NPL. This feature will have a 1 as its value if the current word is a part of any noun phrases in the list NPL and 0 otherwise as depicted in Eq. (3). TextBlob4 package in Python has been used to extract the noun phrases. 1, wi ∈ N P L ∀i where i = 1(1)n (3) f 3 (wi ) = / N PL 0, wi ∈ Headword of Noun Phrase. This feature identifies the headword of that noun phrase to which the current word belongs to. If the current token doesn’t belong to a noun phrase, then the value for this feature is NULL. For example, if the current word is stylish and it belongs to the noun phrase the stylish camera then the value of this feature is camera which is the headword of the considered noun phrase. The feature reflects the higher probability of the headword in the noun phrase being an aspect. The feature evaluation is as depicted in Eq. (4), where NPL is the list of noun phrases in the current sentence. Headword(N P), wi ∈ N P L (4) f 4 (wi ) = / N PL NULL , wi ∈ Is-Part-of-Opinion Word. It is not always necessary that every noun or noun phrase is a likely indicator of an aspect. The probability of a term in the noun phrase being an aspect is high when there is a sentiment expressed. The current feature checks if an opinion word (a word expressing an opinion or sentiment) is a part of the noun phrase to which the current token belongs to. If so, then the feature takes a value 1, else 0. As shown in Eq. (5). For example, the noun phrase best restaurant contained the opinion word best and hence restaurant is more likely to be an aspect. The opinion words in the sentence are identified using generic opinion lexicons Bing Liu dictionary5 and SenticNet.6 The noun phrases containing opinion words are filtered to form an opinionated noun phrase list symbolized as ONPL. The feature value is derived as depicted in Eq. (5). 1, wi ∈ O N P L ∀i where i = 1(1)n (5) f 5 (wi ) = / ON PL 0, wi ∈ Distance from Opinion word. A word that doesn’t belong to a noun phrase doesn’t get any advantage from the feature depicted using Eq. (5). This feature determines the distance between the current token and its nearest noun phrase containing an opinion 4 https://pypi.org/project/textblob/. 5 https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html. 6 http://sentic.net/.
A Supervised Approach to Aspect Term …
243
word. If the word belongs to the noun phrase, then its distance is 0, and if there is no noun phrase present in the sentence, then the value for this feature is 99, which a large value indicating the absence of noun phrase. Closer a word is to a noun phrase containing an opinion word it is more likely to be an aspect. Polarity orientation using PMI. The feature tries to capture any inclination of the current token towards positive or negative reviews. The feature is calculated as the magnitude of the difference of PMI values of the current token towards positive and negative classes, respectively, as depicted in Eq. (6), where C1 and C2 represent the positive and negative classes, respectively. Here, PMI, pointwise mutual information, is a statistical measure that measures association of a word with positive and negative reviews based on its co-occurrence. PMI(C1) and PMI(C2) measure the inclination of the current token towards positive and negative classes, respectively, and are calculated based on Eq. (7). P M I (wi ) = |P M I (C1) − P M I (C2)| P M I (Ck ) = log
P×N (P + R) × (P + Q)
∀i where i = 1(1)n
(6)
∀k where k = 1, 2
(7)
where P is the number of instances where the current token wi co-occurs with a particular class C k (C k could be positive or negative), N is the total number of sentences, Q is the number of instances where the word wi occurs in a class other than C i , R is number of instances belonging to class C i which do not contain wi . All these eight features are extracted for every token in the pre-processed training data which results in the creation of a feature matrix that has a row corresponding to every token which has to be classified as an aspect or non-aspect. Since stop words are guaranteed not to be aspects, the rows corresponding to stop words as the current word are deleted. 3.3 Model Building and Prediction Module After the feature extraction phase, it is observed that this resultant training dataset is highly imbalanced. The number of instances corresponding to ‘Aspect NO’ class is much larger than ‘Aspect YES’ in number. Classifiers tend to behave in a biased manner on imbalanced datasets and hence this issue of data balancing cannot be left unaddressed. The feature selection plays a significant role in ascertaining the strength of each feature in terms of its correlation with the class. A few handpicked classifiers with consistent and superior performance are chosen and their performance on the given data is evaluated and hence the best classifier is chosen. The submodules are explained in detail in the following subsections. Data Balancing Process. The biased behaviour of classifiers on imbalanced datasets highlights the importance of balancing datasets in a supervised approach. Synthetic minority oversampling technique (SMOTE) [24] is one of the successful techniques which is used for balancing data. It is an over-sampling approach in which the minority class is over-sampled by creating synthetic instances. To ensure that the synthetic instances are generated in the least application-specific manner, the operations are carried out in feature space rather than data space. Synthetic instances are generated along
244
M. Venugopalan et al.
the line segments joining any/all of the k nearest neighbours belonging to the minority class. Based on the percentage of oversampling to be done, any or all of the k nearest neighbours of the minority instances are used to generate new synthetic instances. Feature Selection. Feature selection involves searching through all combinations of features to find the best subset of attributes. This in turn involves an attribute evaluator linked to a search method. The method that assigns a value to each feature subset is determined by the evaluator and the search method decides how the search space should be explored. Thus, this process contributes to reducing model complexity and facilitates lesser training time. The correlation-based feature Subset Evaluator algorithm along with Best First search method has been chosen for feature selection to remove the redundant and irrelevant features. The correlation-based Subset Evaluator algorithm estimates the value of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them as measured using the Eq. (8), where Scores indicates the value assigned by the algorithm to the subset of features, k the number of features in the subset considered, ρ f c is the average correlation between the features in the subset and the class and ρ f f is the average correlation between features of the subset. The average correlation ρ should be calculated using Eq. (9), where x and y are a feature–feature pair for ρ f f and a feature–class pair for ρ f c . x¯ and y¯ are the mean values of x and y, respectively. kρ f c k + k(k + 1)ρ f f ¯ i − y¯ ) (xi − x)(y ρ(x, y) = i 1 2 2 2 ¯ i (x i − x) i (yi − y¯ ) Scores =
(8) (9)
The Best First search method searches the space of attribute subsets by greedy hill-climbing augmented with a backtracking facility. Prediction Using Classifiers. To predict the aspects, four different algorithms Naive Bayes, Support Vector Machine, Random Forest and Multilayer Perceptron are chosen for experimentations. The training data post–pre-processing, feature extraction and data balancing stage are fed to these machine learning algorithms and their capability to predict the aspects terms in unseen test data is evaluated. The best-performing classifier is incorporated into the proposed model. The following subsections discuss the datasets which have been used for experimentations and the trials using various classifiers to determine the best classifier for the task.
4 Dataset Statistics and Evaluation Measures The experimentations have been performed on customer reviews from Hu and Liu 20047 [21] which comprises of reviews presented by users on products like DVD, Canon, Mp3, Nikon and CellPhones. The data statistics for the five domains have been presented in Table 1. The dataset has been annotated with aspects as well as their polarity orientations 7 https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html.
A Supervised Approach to Aspect Term …
245
at the sentence level. The performance of the proposed model is evaluated and compared with baseline systems using Precision, Recall and F-measure metrics for the “Aspect YES” class as seen in Eqs. (10), (11) and (12), respectively. The model performance is compared with two recent baselines that have reported their model performance on the same Hu and Liu dataset. Table 1. Data statistics Dataset
#Sentences #Explicit aspects
DVD
740
339
Canon
597
253
Mp3
1717
716
Nikon
347
178
CellPhone
546
296
#Correctly identified Aspect terms #Words predicted as Aspect terms #Correctly identified Aspect terms R= #Actual Aspect terms 2P R F1 = P+R
P=
(10) (11) (12)
Baseline 1: Soujanya Poria et al. 2016 [22] The model had claimed to be the first deep learning approach for aspect term extraction. It followed a seven-layer deep convolutional neural network (CNN) approach combined with the set of linguistic patterns. A 300-dimensional vector representation of each word based on word embedding and POS tags were used as features. A set of heuristic linguistic patterns in rule forms were integrated with the deep learning classifier. Baseline 2: Bing Liu et al. 2017 [23] The model is a CRF-based model for aspect term extraction. The work trains the CRF classifier to leverage the knowledge gained from training on previous domains onto a new domain termed as L-CRF. Dependency patterns are generalized using dependency relations like type of the dependency relation, the governor word, the POS tag of the governor word, the dependent word and the POS tag of the dependent word. These dependency relations are mapped into dependency patterns. The research work had presented both cross-domain and in-domain results. The in-domain results have been considered as the second baseline.
246
M. Venugopalan et al.
5 Experimental Results and Analysis The experimentations of the proposed system after having incorporated data balancing and feature selection are performed on all the five datasets and the results are reported for all the domains. The pre-processing, feature extraction and classifier comparison phases have been implemented in Python 3. The SMOTE technique has been implemented using the supervised filter implementation in WEKA to balance the training data where the oversampling parameter has been chosen such that both the class instances become imbalanced to an extent and the parameter k (number of nearest neighbors chosen) is chosen as the default value of 5. The results of the experimentation on all the four classifiers have been reported using ten-fold cross-validation. The following subsections report the experimentation results incorporating. • comparison of the performance of all the four considered classifiers • a detailed analysis of the strength and contribution of each feature across different domains • comparison of the proposed model performance with the considered baseline.
5.1 Results of Classifier Comparisons Using the Proposed Model and Analysis Performance of all the four classifiers on the proposed system is given in Table 2. It can be observed that the best F-measure has been reported by random forest classifier across all the domains. The results reported on the Nikon dataset are higher than compared to all other domains with best values of 88.7, 89.4 and 89.1 for Precision, Recall and Fmeasure reported by Random Forest classifier. The results reported on the Nikon dataset are better which can be justified by the lesser imbalanced nature of the datasets owing to larger number of aspects in the train data from the domain as reflected in Table 1. Even though data balancing methods have been employed there would be a limit to the information contained in the generated synthetic instances. SVM also has exhibited a superior performance reporting the second best results in four out of five domains. All the classifiers that have been experimented have exhibited consistent performance across all the domains which show the strength of the model. A careful analysis of the discriminating features selected by the feature selection approach as displayed in Table 3 reveals that out of the eight features considered, four to six features are the major contributors. The features POS of the word, IsNounPhrase and Polarity Orientation using PMI were the major contributors across all the five domains out of which the first two were syntactic features and the third a semantic feature. Even the features Frequent Aspect Term, Similarity-Based Frequent Aspect term made an impact across three domains where the first feature tried to capture information from the seed set extracted from the training data and the second exploited semantic relatedness with the seed set. But Is-Part-of-Opinion Word was almost nil in its contribution.
A Supervised Approach to Aspect Term …
247
Table 2 Proposed approach experimentations using different classifiers Dataset
Classifier P%
R%
F1 (%)
DVD
NB
80.5
73.7
76.9
SVM
83.4
73.9
78.4
MP
65.4
71.5
68.3
Canon
Mp3
Nikon
Cellphone
RF
88.2
86.9
87.5
NB
71.8
86.3
78.4
SVM
81.9
80.5
81.2
MP
70.8
71.4
71.1
RF
87.2
87.7
87.4
NB
84.9
67.5
75.2
SVM
84.0
64.5
72.9
MP
51.0
40.2
44.9
RF
90.3
86.2
88.2
NB
82.5
77.4
79.9
SVM
82.0
78.3
80.1
MP
84.9
80.3
82.6
RF
88.7
89.4
89.1
NB
71.1
88.2
78.8
SVM
71.4
87.5
78.6
MP
60.8
41.9
49.6
RF
86.4
84.1
85.3
5.2 Comparison of Proposed Model with Baselines and Its Analysis The Random Forest Classifier had showcased the best performance as proved from our classifier comparisons and hence the proposed approach using Random Forest Classifier has been considered as the proposed model. Figure 2 depicts a comparison of the proposed approach with the two baselines across different domains. In spite of Soujanya Poria et al. 2016 baseline being a CNN-based complex model, our proposed approach has outperformed it for the DVD and Nikon datasets. Even for the Canon and Mp3 datasets, the difference in the performance with Soujanya Poria et al. 2016 is minimal. The proposed model has outperformed Bing Liu 2017 across all the four domains. The CRF-based baseline has not reported their results on the Nikon dataset. In comparison to the baselines, it can be observed that precision and recall values are more balanced in the proposed approach. The relative difference that the proposed approach has been able to achieve in comparison to the baselines is represented as F which has been calculated using Eq. (13)
248
M. Venugopalan et al. Table 3. Features selected across different datasets in the feature selection stage
where F1B and F1P are the F-measures of the baseline and proposed approach, respectively. Table 4 gives a quick comparison of the proposed approach with the baselines. The figures in Table 4 clearly suggest the proposed approach has out-performed the baseline Bing Liu 2017 with a F value ranging between approximately 8 and 12 points. The proposed model has almost been in par with baseline Soujanya Poria et al. 2016 as even in the failing domains the maximum F is 5 points roughly. −F1 B × 100 (13) F = F1 PF1 B The analysis justifies the performance of the proposed model which is a supervised approach using minimal robust features in comparison to the complex models considered for the baselines.
6 Conclusion and Future Directions The proposed approach has contributed a model for aspect term extraction in the field of aspect-level sentiment analysis which helps to identify the features of the product that are discussed in a review. The proposed supervised approach has showcased appreciable performance compared to the state of art systems. The model has out-performed the CRFbased baseline across all domains and the CNN in two domains. The model handles the imbalance issue in the training data using SMOTE technique. The robust feature selection process incorporated has been able to identify a minimal number of relevant
A Supervised Approach to Aspect Term …
249
95 90 85
% 80 75 70 P
R
F
P
DVD
R
F
P
Canon
R
F
P
Mp3
Soujanya Poria et al. (2016)
R Nikon
Bing Liu (2017)
F
P
R
F
Cellphone Proposed Approach
Fig. 2. Comparison of the proposed approach with considered baselines
Table 4. Relative difference in F-measure of the proposed approach over baseline 1 and baseline 2 Datasets DVD
Poria [22] Liu [23] 0.35
10.61
Canon
−1.39
12.19
Mp3
−1.19
Nikon
4.98
Cellphone −5.68
12.07 Results not reported 8.524
and discriminating four to six features that are capable of producing results comparable to the baselines considered. The proposed methodology has also experimented with different classifiers to identify the best classifier that fits into the model. The Random Forest classifier reported the best results across all domains with a F-measure ranging from 85.3 to 89.1. All the classifiers that have been experimented have exhibited consistent performance across all the domains which ascertain the strength of the model. Further analysis for the poor performance of Multi-Layer Perceptron and possible trials varying the number of hidden layers and neurons in each layer would be a future task. The approach has been experimented across five domains that belong to a similar genre of electronic products. The model needs to be experimented on other varied domains like destination reviews, hotels, kitchen appliances, etc., and ascertain the performance of the minimal features considered in the proposed approach. The impact of Word Embedding based features which efficiently extract semantic information also needs to be experimented.
250
M. Venugopalan et al.
References 1. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retrieval 2(1–2), 1–135 (2008) 2. Pang, B., Lee, L.: A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pp. 271–278 (2004) 3. Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol. 5.1, 1–167 (2012) 4. Xu, W., Ying T.: Semi-supervised target-oriented sentiment classification. Neurocomputing 337, 120–128 (2019) 5. Chen, X. et al.: Adversarial deep averaging networks for cross-lingual sentiment classification. Trans. Assoc. Comput. Linguist. 6, 557–570 (2018) 6. Sanagar, S., Gupta, D.: Adaptation of multi-domain corpus learned seeds and polarity lexicon for sentiment analysis. In: International Conference on Computing and Network Communications (CoCoNet), 50–58 (2015) 7. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. LREC 10, 433–441 (2010) 8. Kouloumpis, E., Wilson, T., Moore, J.D.: Twitter sentiment analysis: The good the bad and the omg! Icwsm 11, 538–541 (2011) 9. Trung, D.N., Jung, J.J.: Sentiment analysis based on fuzzy propagation in online social networks: a case study on TweetScope. Comput. Sci. Inf. Syst. 11.1, 215–228 (2014) 10. Venugopalan, M., Gupta, D.: Exploring sentiment analysis on twitter data. In: Eighth International Conference on Contemporary Computing (IC3), pp. 241–247 (2015) 11. Paltoglou, G., Thelwall, M.: Twitter, Myspace, Digg: unsupervised sentiment analysis in social media. ACM Trans. Intell. Syst. Technol. (TIST) 3(4), 1–19 (2012) 12. Guzman, E., Maalej, W.: How do users like this feature? a fine grained sentiment analysis of app reviews. In: 22nd International Requirements Engineering Conference (RE), pp. 153–162 (2014) 13. Zhuang, L., Jing, F., Zhu, X.-Y.: Movie review mining and summarization. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 43–50. ACM (2006) 14. Jakob, N., Gurevych, I.: Extracting opinion targets in a single-and cross-domain setting with conditional random fields. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 1035–1045 (2010) 15. Akhtar, M.S. et al.: Feature selection and ensemble construction: a two-step method for aspect based sentiment analysis. Knowl.-Based Syst. 125, 116–135 (2017) 16. Wang, H., Lu, Y., Zhai, C.X.: Latent aspect rating analysis without aspect keyword supervision. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 618–626. ACM (2011) 17. Moghaddam, S., Ester, M.: The FLDA model for aspect-based opinion mining: addressing the cold start problem. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 909–918. ACM (2013) 18. Titov, I., McDonald, R.: Modeling online reviews with multi-grain topic models. In: Proceedings of the 17th International Conference on World Wide Web, pp. 111–120. ACM (2008) 19. Sasmita, D. et al.: Unsupervised aspect-based sentiment analysis on Indonesian restaurant reviews. In: International Conference on Asian Language Processing (IALP), pp. 383–386. IEEE (2017)
A Supervised Approach to Aspect Term …
251
20. Hu, M., Liu, B.: Mining opinion features in customer reviews. In: AAAI, vol. 4, 4, pp. 755–760 (2004) 21. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168– 177. ACM (2004) 22. Poria, S., Cambria, E., Gelbukh, A.: Aspect extraction for opinion mining with a deep convolutional neural network. Knowl.-Based Syst. 108, 42–49 (2016) 23. Shu, L., Xu, H., Liu, B.: Lifelong learning CRF for supervised aspect extraction. In: 55th Annual Meeting of the Association for Computational Linguistics, pp. 148–154 (2017) 24. Chawla, N.V. et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Correlation of Visual Perceptions and Extraction of Visual Articulators for Kannada Lip Reading M. S. Nandini1(B) , Nagappa U. Bhajantri2 , and Trisiladevi C. Nagavi3 1 Department of IS & Engineering, NIE Institute of Technology, Mysuru, Karnataka, India
[email protected] 2 Department of CS & Engineering, Government Engineering College, Chamarajanagara,
Karnataka, India [email protected] 3 Department of CS & Engineering, Jayachamaraja College of Engineering, JSS S&T U, Mysuru, Karnataka, India [email protected]
Abstract. The Visual Articulators like Teeth, Lips and tongue are correlated to one another and these correlation among those visual features are extracted with visual perceptions. The term visual perceptions indicates the features that are used as a parameter for representation learning and the description of visual information. These visual features are extracted and classified into different classes of Kannada Words. The movements of lips, tongue, and teeth are extracted by analyzing the inner and outer portion of lips along with movement of tongue and teeth. These parts teeth, lips, tongue are together used for feature extraction, as these features are correlated with resonances. These resonance information is extracted from every frames by analyzing and understanding the correlation that exists among them in different sequence of frames of a video. The proposed method of visual perceptions has yielded an accuracy of 82.83% over a dataset having different benchmark challenges. These benchmark challenges include facial tilt as a result of which the correlation may be less among teeth, tongue and lips. Thus, we have erected a new methodology of analyzing and understanding the visual features. The Kannada Words spoken by a person is indicated by assigning labels to the sequence of frames of a video in specific pattern. If these sequence of patterns of data is extracted and visualized from a video, the system recognizes the lip movements into different classes of words spoken. Keywords: Articulators annotation · Articulators correlation · Kannada words classification · Lip shapes
1 Introduction Languages used depends on the context in which the person uses a specific words [3, 6]. It is an understanding and analysis done by a person for understanding any of the languages [3, 9]. Thus, we state that the language understandable for any person and the capability of understanding the languages are Natural Language Processing. The system © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_23
Correlation of Visual Perceptions and Extraction
253
is to be trained with certain machine learning algorithms to detect and analyze the facial features including shape of lips, teeth, and tongue movements. Visual perceptions include feature representation learning that helps in understanding the facial lip movements and its correlation with other parts of lips like tongue, and teeth. The facial lip movements are used to recognize the lip movements in specific pattern at the time of testing. The system is usually trained with certain visual feature descriptors that helps in extraction of visual features and its importance in understanding the facial lip movements. The words spoken by a person correspond to a specific pattern. These patterns together with features used during the process of training is used as a metric to recognize the facial lip movements into different words. The language spoken by a person may be any language, the language is understandable from their lip movements. Thus, we have designed a correlation that measures the similarity among the lip movements. Irrespective of different personalities speaks a specific language. In addition, Recognition of lip movements for languages like English is already carried out in few of the research articles [15, 18, 19]. Thus, we need that system that understands the lip movements of Kannada Language. Especially Identification of lip movement for a Kannada sentences is very important and challenging task. Since the approach is involving the feature information of correlation among lips, teeth, and tongue, Especially, when there is a small change in shape of a lip, those
Words of Sentences: Class 2
Words of Sentences: Class 1
Words of Sentences: Class 3
Words of Sentence s: Class 4
Fig. 1. Architecture of the proposed visual articulators and training for Kannada Lip reading
254
M. S. Nandini et al.
changes must be notified and compared with ground truth shapes of objects to recognize different shapes of lips. The architecture bring out here for visual articulators and training Kannada lip reading is shown in Fig. 1. The hidden information of languages spoken always depends on correlated tasks of lip movement, tongue, and teeth. These correlated facial features are dependent on one another. The exploiting of these hidden information is a challenging and difficult tasks, as it involves identifying each of these mouth parts, where the dependencies are existing, whenever the lip movement changes with respect to time. This ideology and fact has motivated the research work to be implemented for meeting the desired objective like identification of correlation among lip, tongue and teeth as a visual articulators. The research article discusses different languages and its lip movement for English and Kannada language in Sects. 2, 3 focuses on datasets and its challenges in lip-reading, Sect. 4 presents strategy of visual perception of features correlated to one another for understanding the Kannada lip movements, Sect. 5 describes the results and analyzes the performance of method with respect to other contemporary efforts, Sect. 6 discusses the advantages and results of works with respect to other existing methods, Sect. 7 concludes the research article with few contributions.
2 Related Work An Active Shape Model (ASM) is a shape-constrained iterative fitting algorithm [4, 17]. The shape features are considered from an ASM features that is extracted from an object as per [3, 5–7] also known as a point distribution model (PDM), Which is obtained from the statistics of hand-labeled training data [15]. In this work, the mathematical operations like point distribution model [10, 12, 14] is considered as a reference model to extract features of lip in every instance of it. Even when a small changes are observed in facial features like lip, the shape features are noticed by annotating the facial information from a sequence of frames of a video. In order to align the set of training models, the conventional iterative algorithm is used [5]. Given the set of aligned shape models, the mean shape, as it can be calculated and the axes that describes most variance about the mean shape can be determined using a principal component analysis (PCA). The research work carried out in few of the recent research papers has not addressed the need of identifying the facial lip movements for specific languages like Kannada. Since the movement of lips varies from language to language, the research works referred have not contributed to the regional languages. Hence, we have focused the research attention towards identifying the lip movements for Kannada language. In addition, We have also given importance in understanding the shapes of lips, correlation among lips, teeth and tongue as a new way of identifying the facial lip movements.
Correlation of Visual Perceptions and Extraction
255
3 Datasets The dataset consists of frames of videos of different directions of facial poses shown in Fig. 2. Even then the system has been able to detect the facial features of a person and predicts the words spoken by a person by applying erected algorithm is shown in Fig. 3.
Fig. 2. The images of a video used for identifying correlation among articulators for a sentence Neerannu Tharisu
256
M. S. Nandini et al.
Fig. 3. Detected lips of tilted face and portion of cropped lips is shown in a and b respectively. Similarly, c and d shows the lips together with articulators detected from mouth in different frames for the sentence “Neerannu Tharisu”
4 Proposed Extraction of Visual Articulators for Kannada Lip Reading Let us consider a labelled features of training data trained with supervised learning problem that has access to the trained data. A. Visual Articulators for Kannada Lip Reading Supervised Machine Learning has access to the training data of the form x i , y i , where x i is a feature vectors corresponding to feature labels y i assigned during training phase. The training phase consists of three (3) phases where the data extracted from every frames of a video is converted into a form that fits our feature vectors as per Eq. (1). X i = (xi1 yi1 , xi2 yi2 , xi3 yi3 . . . xin yin )T
(1)
The pair of information (π xi1 yi1 ) contained in tuple is used to assign labels yi1 to the data xi1, where the labels indicate the names associated with the data or information extracted from visual descriptions in the form of visual articulators. As we are more interested to extract more detailed data and information with respect to every articulators correlated to one another. It is necessary to relate one feature with respect to the other features. (s cos θ )xik − (s sin θ )yik (2) S(s, θ )[(xik |yik )] = (s sin θ )xik + (s cos θ )yik
Correlation of Visual Perceptions and Extraction
257
These terms (s cos θ )xik along with data is used to know the detailed information of features associated with labels and its shape at various angles. The change in shape of a lip along with teeth and tongue are used to correlate at different angles of patterns. If there is a pattern match with different articulators during training phase, the system understands that the features belong to different classes. Otherwise, the recognizes that the features belong to the same class at the time of testing. M = (x1 − S(s, θ )[x2 ] − t)T W (x1 − S(s, θ )[x2 ] − t)
(3)
The matrix M as per Eq. (3) of data along different dimensions are measured and calculated to obtain the identity information of an image or frames of a video. The information gathered from every frames of a video is used to cross verify the patterns associated with labels at different time sequence of frames. Thereby the system gets trained and helps in recognition of shapes during the time of testing. The training stage is more rigid and dependent on data that is to be extracted from visual articulators, as it is most essentially needed to measure the accuracy of recognition of shapes of lips, teeth, and tongue. t = (xt1 yt1 , xt2 yt2 , xt3 yt3 . . . .xtn ytn )
(4)
The Eq. (4) represents the information gathered from frames that are used to match the patterns of information more suitable for recognition at varying instance of time. The time is an important parameter that helps in cognizing the pattern that are similar to the classes of operation that is required to be determined. X s = x¯s + Ps Bs
(5)
Ps = (P1 P2 P3 . . . Pn )
(6)
Equations (5) and (6) are used to measure the shape similarity in terms of dice similarity coefficient and Jaccard similarity index in addition to cosine similarity index used to measure the accuracy of recognition of shapes of frames in videos at different instance of time. (7) B p = PpT X p − X¯ p S p = S¯ p + Pp B p
(8)
Equations (7) and (8) represents shapes along with similarity index which is to be calculated from identity element to assign the similarity based shape information. Thereby, the resonance of correlation among different features extracted from frames are used as a part of recognition. (9) B p = PpT X p − X¯ p Mean(m) =
∞ 1 P(i, j) N2 i, j=1
(10)
258
M. S. Nandini et al.
Some of the statistical features used to measure the correlation of visual articulators of teeth, tongue and lips inner and outer boundary are used as attributes to measure the recognition rate of shapes and its correlation features at varying instance of time. Equations (9) and (10) are used to measure the feature extraction in terms of Standard deviation, and correlation to determine the features that are similar in nature but dissimilar in correlation in accordance with Eqs. (11) and (12). N 1 (11) Standard deviation (sd) = [ p(i, j) − m]2 N i, j=1
Correlation =
n−1 n−1 i=0 j=0
(i, j)Pi j − μx μ y σx σ y
(12)
1. Grouped Consonants: The grouped consonants are of 25 types, these 25 consonants are used in Kannada language very often. Thus, we designed a system with features of the form 25 × 50 × 3 = 7500 features of low dimensional features in the form of vectors. These grouped consonants are combined together to form a total of 7500 features. The main reason for multiplying the value 25 with 50 features dimensions and 3 is that the 25 group consonants are multiplied with 50 dimensions of data along 3 different channels like R-G-B of the video data. Thus, we have obtained a total of 7500 features. 2. Miscellaneous Consonants: Similar set of consonants are miscellaneous consonants. Where there are 10 types of miscellaneous consonants. These 10 consonants are extracted with 50 dimensions features leads to 500 features of 3 color channels R-G-B leads to a total of 6000 features. Thus, finally we have obtained a total number of features of size 7500 + 6000 features together forms a total of 13,500 features dimension of different frames of a video. i.e. 7500 + 6000 = 13,500 features are extracted while the system is being trained with. The algorithm has certain advantages over other contemporary methods [1] that uses various performance metrics to measure the efficiency of their method. Further, the effort has limitations like correlation is most essentially required to measure the goodness of the approach with respect to the shapes in every frames of a video. The training of system needs video data along with audio to measure the correctness of the strategy.
Correlation of Visual Perceptions and Extraction
259
B. Proposed Algorithm Algorithm: Visual articulators and its shape based information is extracted in different steps. Algorithm: Visual Perceptions for Kannada Lip Reading Input: Video without audio data is given to the system Description: System determines the lip movement Output: Lip movements are recognized for Kannada Sentences Begin [Pre-processing] Step 1: Separate Video from Audio and remove unwanted information from videos. Step 2:
[Extract Visual Perceptions] Step 2.1
Calculate chi from eq.(2)
Step 2.2
Calculate and assign the values of Chi to eq.(3) Extract teeth shape along with tongue shapes as per eq. (4). Compute shape of lips inner and outer border as per eq.(5)
Step 2.3
Step 2.4
Step 3:
Step 4:
[Annotations] Step 3.1
Assign labels to shapes of lips and determine the correlation values as per eq. (6) and eq. (7).
Step 3.2
Compute eq. (8) and eq. (9)
[Recognitions (Testing)] Step 4.1
Compute eq. (10)
Step 4.2
Calculate chi from eq. (4) and eq. (5). Compute and recognize Words from shapes of lips, teeth and tongue.
Step.4.3
End
260
M. S. Nandini et al.
C. Performance Evaluation Parameters considered for evaluation of performance of a trained systems are precision, recall, accuracy. Evaluation of performance of method with respect to the other existing contemporary methods is shown in Table 1. TP T P + FP TP Recall = T P + FN TP +TN Accuracy = T P + FP + T N + FN Precision =
(13) (14) (15)
The parameters like precision and recall of Eqs. (13) and (14) plays a significant role in assessing the accuracy of recognizing the movement of lips.
5 Results and Discussion The efficacy of exercise shall be verified by analyzing the results with ground truth values and different metrics like accuracy as per Eq. (15), precision and recall are employed. The posterior time of the proposed method is verified so as to measure the efficacy. Further, the average precision of method is clearly more precise and accurate than other existing method, which shall be verified. Training of system requires an input presented in the form of a video separated from an audio. The separated audio along with video is to be processed for recognizing the shapes of lips and other correlated parts. Some of the training versus testing information used are 90% of the statistical correlation features are used for training and the rest 10% of the statistical data is used for testing. Similarly, 85% of the data versus 15% of the data are used for training and testing respectively. The time aspects of evolved criteria with respect to other works have been verified and shown in graphical representation of Fig. 6 in addition to tabular representation. Moreover, brought out work throws light on measuring posterior time with reference to other existing methods. Table 1. Analysis of strategy for Kannada lip reading Sl.no Methods
Accuracy (%)
1
Lan et al. [5]
68.46
2
Wand et al. [17]
52.50
3
Assael et al. [2]
74.50
4
Chung and Zisserman [9]
78.55
5
Proposed visual perception representation learning 82.83
Correlation of Visual Perceptions and Extraction
261
Table 2. Analysis of approach for Kannada lip reading Methods
Jaccard similarity Dice similarity Cosine similarity index
Lan et al. [5]
71.2
69.15
70.1
Petridis et al. [19]
70.50
71.21
72.5
Chorowski et al. [8]
71.18
74.7
73.4
Graves [16]
72.15
73.41
74.5
King et al. [5]
69.51
68.53
69.5
Krizhevsky et al. [18]
70.33
79.83
80.5
Galatas et al. [13]
80.41
81.32
81.3
Cooke et al. [11]
84.56
88.92
88.3
Proposed visual articulators
81.36
81.44
84.3
Table 3. Analysis of posterior time for Kannada lip reading Sl.no Folds
Time in (ms) (%)
1
Fold 1 65.12
2
Fold 2 64.19
3
Fold 3 78.44
4
Fold 4 79.51
5
Fold 5 78.56
Every video of a dataset is divided into equal number of frames, where a total number of frames extracted from a video is divided into 5 folds consisting of equal number of frames in every folds. hence, the performance has been assessed in terms of posterior time aspects, accuracy, and precision. Further, the precision results obtained from the errected approach is shown in Fig. 4, and the posterior computational aspect results are shown in Table 3. and the the graphical representation of the same is depicted in Fig. 5. It is clear from the Table 2, the similarity indexes of visual articulators are much better than existing methods. Further, Table 3 indicates the time elapsed for detection and recognition of lip movements in terms of millisecond. The challenging issues observed is quite a challenging task for predicting certain characters like as we are more focused towards recognizing similar shapes for different words with lip shapes. Even though the characters seems to be spelt similar in manner, meanings of those characters are different, therefore some expressions were very difficult to predict. Similarly other challenging tasks addressed some facial lip expressions are supposed to be identified with inherent shape changes and these changes are tend to be different and used to measure the lip shapes.
262
M. S. Nandini et al.
Fig. 4. Results of precision of visual perception
Fig. 5. Measure of posterior time aspects for prediction of Kannada sentences
The facial changes that takes place for every Kannada words spoken and identified in frames of a video may cause the system to be used for comparing the correlation nearest values with classification. Thereby a person’s lip movements shall be determined. The process of determining the words annotated with shapes of lips involves various methods like Active Shape Model. The erected efforts has defined a way to recognize certain words to recognize the Kannada sentences. The precision is a metric exercised to measure the precise information of the lip, teeth, and tongue shape information with respect to the ground truth results.
Correlation of Visual Perceptions and Extraction
263
The ground truth results are necessary to measure the accuracy of precise values used. The movements of lip is measured in other metrics as well that includes accuracy and time complexity of the proposed method over different folds of frames of a video. The time aspects of the method to process every frames of a video while annotating the shape based statistical correlation information is calculated by comparing the evolved method with ground truth results. Thereby the time consumed by the approach and its efficacy shall be verified. Here, employed various descriptions like information correlated with lips, tongue and teeth together combined to measure the recognition accuracy of Kannada words. The performance obtained in different folds are 83.14% in fold 1, 82.36% in fold 2, 83.46% in fold 3, 83.51% in fold 4 and 81.69% in fold 5 which produces an average of 82.83% of accuracy over an entire frames of a video in a dataset as shown in Fig. 6 and Table 1. The dataset considered for reading of lips has certain challenges like tilted faces, with inappropriate lip shapes. The shapes of the lips of a child is not appropriate, as it also involves certain tilt or cross variation while speaking certain words. Thus, it is necessary for a system that correlates and verifies the system with all necessary information satisfied to meet any patterns of information.
Fig. 6. Measure of accuracy for prediction
6 Conclusion The proposed method of recognition of words from lip movement has achieved few significant contributions like detecting the facial features like lip in tilted faces of a person. These tilted faces of a person together with frontal faces forms a contribution for detecting the lip portion from every frames for feature extraction. The purpose of recognizing the lip movements and to recognize the language spoken by a person is achieved with visual articulators. As future works are towards recognizing lip movements even though the words are similar in nature by expression, they are different by meaning. Thus, we developed approach of recognizing the lip movement for Kannada Language.
264
M. S. Nandini et al.
These research contributions have many new techniques that shall be incorporated into regional languages. Acknowledgements. Data set used for the research work was collected from children of the Rotary west and parent association of deaf children trust. Bhogadi, Mysuru.
References 1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M. et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 516.5 2. Assael, Y.M., Shilling Ford, B., Whiteson, S., de Fre-itas, N.: Lip net: Sentence-level lipreading. Under submission to ICLR 517, arXiv:1611.01599v2, 516. 2, 8 3. Bahdanau, D., Cho, K., Bengio Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings ICLR, 515. 1, 3, 11 4. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural net-works. In: Advances in Neural Information Processing Systems, pp. 1171–1179, 515. 2, 4 5. Lan, Y., Harvey, R., Theobald, B., Ong, E.-J., Bow-den, R.: Comparing visual features for lip reading. In: International Conference on Auditory-Visual Speech Processing 509, pp. 102– 106, 509. 8 6. Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of BMVC. 514. 2 7. Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent nn: first results. arXiv:1412.1602,514.2 8. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp. 577–585, 515. 2, 3, 4 9. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Proceedings of ACCV, 516. 1, 2, 5, 6, 8 10. Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Workshop on Multi-view Lip-reading, ACCV, 516. 5 11. Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 15(5), 2421–2424, 506. 1, 2, 6, 8 12. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings CVPR, 516. 4 13. Galatas, G., Potamianos, G., Makedon, F.: Audio-visual speech recognition incorporating facial depth information captured by the kinect. In: 512 Proceedings of the 5th European on Signal Processing Conference (EUSIPCO), pp. 2714–2717. IEEE, 512. 2 14. Graves, A., Fernandez, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling un segmented sequence data with recurrent neural networks. In: Proceedings of the 5th International Conference on Machine Learning, pp. 369–376. ACM, 506. 1, 2 15. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 19th International Conference on Machine Learning (ICML-14), pp. 1764–1772, 514. 1 16. Graves, A. Jaitly, N., Mohamed, A.-R.: Hybrid speech recognition with deep bidirectional lstm. In: 513 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 273–278. IEEE, 513. 4
Correlation of Visual Perceptions and Extraction
265
17. Wand, M., Koutn, J. et al.: Lip reading with long short-term memory. In: 516 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6115–6119. IEEE, 516. 2, 8 18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Image net classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114, 512. 1 19. Petridis, S., Pantic, M.: Deep complementary bottle neck features for visual speech recognition. In: ICASSP, pp. 504–508, 516. 2
Automatic Short Answer Grading Using Corpus-Based Semantic Similarity Measurements Bhuvnesh Chaturvedi and Rohini Basak(B) Jadavpur University, Kolkata, India [email protected], [email protected]
Abstract. In this paper, we have explored some unsupervised techniques for the task of Automatic Short Answer Grading. Three models are designed for the task in hand, with each model evaluating and grading students’ responses individually. The proposed method has shown quite promising correlation between scores generated by the method and that awarded by human scorers on a standard computer science dataset, with the overall correlation reaching the peak value at 0.805 which outperforms the state-of-the-art results reported on this dataset so far. Keywords: Automatic short answer grading · Semantic similarity · Dependency parser · POS tagger
1 Introduction The task of Automatic Short Answer Grading (ASAG) involves scoring a student answer (SA) based on a given reference (model) answer (RA or MA). Optionally, scoring schemes may also be provided to indicate relative importance of different parts of the model answer. This is a complex natural language understanding task due to linguistic variations (same answer could be written in different ways), subjective nature of assessment (multiple possible correct answers or no correct answer), and lack of consistency in human rating (non-binary scoring on an ordinal scale within a range). Earlier ASAG work requires supervisors to grade key concepts and their variations using concept mapping or to grade a fraction of student answers to train supervised learning algorithms. Much work has been done in the field of ASAG in recent years. These works can be mainly divided into two categories. The first is supervised approach, in which the human graders first grade a few students’ answers which in turn are used as training data for the algorithm. The second is unsupervised approach, in which different text similarity measures are used to compare the model answer and students’ answers. In this paper, we present an unsupervised approach, which consists of three models that score the students’ responses individually. The first model starts with Parts-OfSpeech (POS) tagging of RA as well as SA, after which one word having a certain POS tag from RA is matched with all the words of SA having similar POS tags (for example, a noun from RA is matched with all nouns from SA), and finally the maximum similarity © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_24
Automatic Short Answer Grading
267
score is taken. Such similarity score is obtained for every word of RA and finally an average of these scores is calculated to come up with the score that is to be awarded to the student. The second model subjects the RA and SA individually to a parser which converts each of them into a set of dependency triples (a triple consists of three parts: the relation, the governor, and the dependent words). The RA and SA triples are matched based on a set of predefined rules in which the relation part of the triple plays a significant role. Stanford POS tagger and dependency parser are used for the purpose of POS tagging and dependency triple extraction in this model. The third and final model arranges the words of RA and SA in a two-dimensional matrix, with the words in RA (or SA, whichever has lesser number of words) in rows and words of SA (or RA) in columns. Once done, similarity score between a pair of words (one taken from a row and the other taken from a column) is measured. Once all the scores are obtained, the highest score is taken into consideration, after which the row and column to which this score belongs is struck off from being considered again. Then the word pair with the next highest similarity score from the rest of the matrix is considered, and the process is repeated until all the words in the rows are considered. Finally, the average of these scores is taken over the number of rows to compute the final score that is to be awarded to the student. For evaluation, we have calculated correlation between the obtained scores from the method and the scores awarded by human judges. Higher the correlation, the more efficient is the method to grade the students’ answers close to the human judges. The main motivation behind this research is to tap into the field of ASAG which is still full of possibilities. ASAG applications are quickly gaining popularity due to shifting of education from classrooms to online or e-learning centers, and this has provided us the motivation of coming up with another approach to solve the problem of ASAG. Our research is a small contribution to this vast field which is still growing and will continue to grow. Our method is not limited to a specific domain and can be used to evaluate responses from any courses, be it from humanities, science, commerce, or any other course. Also no extra burden will be put on the grader, as all that needs to be done is to feed the model and students’ responses into the method and the algorithm will take care of the rest of the evaluation process. The rest of the paper is organized as follows. Section 2 discusses some related work reported in literature on this task. Section 3 presents the detailed description of our method. Section 4 illustrates the working of the proposed models with a running example. The experimental results and the details of the dataset have been provided in Sect. 5. Section 6 provides the error analysis of the method. Finally Sect. 7 draws the conclusion and provides some avenues for future work.
2 Related Work Many approaches have already been reported in literature for the task of ASAG. Gomma et al. [1] presented an unsupervised approach, which uses text to text similarity measure. Different string-based and corpus-based similarity measures were tested separately and the results were then combined to obtain a final score.
268
B. Chaturvedi and R. Basak
Adams et al. [2] evaluated a variety of representative distributional semantics based approaches in the task of unsupervised grading and proposed an asymmetric method based on aligning word vectors that exploit properties of grading tasks. Aligning word vectors means moving the word vectors of one document in the vector space so that they can become word vectors of another document. This method allows words to move to multiple other words when a mismatch in document size occurs. Alotaibi et al. [3] proposed a combination of different techniques to propose an approach for the task. In particular, an integrated method was presented which uses Information Extraction (IE) and Machine Learning (ML) techniques. The purpose of the approach was to assign a score depending on the percentage of correctness in the students’ response; not just classifying the resultant mark as correct or incorrect. Roy et al. [4] proposed a technique which was based on the intuition that student answers to a question, as a collection, are expected to share more commonalities than any random collection of text snippets. If these commonalities can be identified from student answers then the same could be used to score them. Sultan et al. [5] presented a fast and simple supervised method for the task. From the given RA and SA, they extracted a number of text similarity features and combined them with key grading-specific constructs. Leacock et al. [6] presented the working of an automated short answer marking engine called C-rater. If a set consisting of all the possible correct student responses is considered, then the C-rater scoring engine operates as a paraphrase recognizer that identifies members of this set, i.e., it identifies whether the student response is a paraphrase of the correct concept or not. Basak et al. [7] presented a method that casts the ASAG task to an RTE(Recognizing Textual Entailment) task, which determines whether the meaning of SA (or RA), is logically entailed by the RA (or SA): the better the entailment, the higher the score to be awarded to the student. To take the entailment decision between the RA and SA, they have used the rule set from one of their other works [8]. The matching rules are associated with different numerical scores and the total score to be awarded was calculated in a well-specified manner. We have incorporated the matching rules illustrated in [8] in our method for the purpose of grading the students’ answers.
3 The Proposed Method The proposed method consists of three different models, each model scoring the students’ responses individually. The pre-processing steps for all the models are same. 3.1 Pre-processing Each RA and SA is individually subjected to the pre-processing steps first, which include splitting words joined by a slash or hyphen (e.g., “function/method” becomes “function method” and “run-time” becomes “run time”) and removing unnecessary symbols (the dataset contains certain HTML tags which need to be removed). We also removed the punctuation marks like “,”, “;”, ”.”, and others as their absence did not impact the scores generated by the method, but their presence causes generation of unnecessary triples in
Automatic Short Answer Grading
269
Model 2. After that both RA and SA are converted to lowercases. The reason is that both Word2vec as well as fastText produce different similarity scores when the case is changed as compared to when the case is same. Once pre-processing is done, the actual method of ASAG is invoked. The three models are explained in the subsequent subsections. One important thing to note is that for all the models, we have shown that the SA is evaluated with reference to the RA in cases where SA is concrete while RA is elaborate. However, for some cases, it was found that the RA is concrete and to the point, while the SA is elaborate containing extra piece of information. For such cases, we reversed the roles of RA and SA, i.e., we scored the RA with reference to the SA. This is a part of the process in each model and not a separate model in itself, i.e., before the actual method is invoked; the length of the RA and SA is checked and the decision is made as to whether the SA is to be scored based on RA or the RA is to be scored based on SA. This is applicable for all the three models. 3.2 Model 1 The RA and SA are fed into Stanford POS tagger [9]. As output from this phase the RA and SA with each word tagged with their corresponding POS are generated (e.g., An input of “at the main function” will produce an output as “at/IN the/DT main/JJ function/NN,” where NN signifies singular noun, JJ signifies adjective, IN signifies preposition or subordinating conjunction, and DT signifies determiner). Once the POS tagging is done, the response matching and scoring mechanism process gets started. Each word with a particular POS tag in RA is compared with every word in SA having an equivalent POS tag (e.g., a word in RA with POS tag NN is matched with every word in SA with equivalent POS tags of NN, NNS, NNP, or NNPS) and the maximum similarity score is considered. The similarity score between a given pair of words is determined by gensim’s Word2vec model and Facebook’s fastText model separately, where a 300 dimension vector of the words is generated in both models and the similarity score is obtained by measuring the cosine value of the angle between the two vectors. Once the maximum score is obtained for each word in RA, the overall similarity score is calculated by taking the average of all the maximum scores over the number of words in the RA. This overall similarity score is then multiplied by the full marks of the question to obtain the marks that is to be awarded to the student. 3.3 Model 2 This model starts with parsing the RA and SA into a set of dependency triples by subjecting each of them individually into Stanford dependency parser [10]. Once the dependency parsing phase is over, the actual process of matching and scoring starts. The matching operation between the RA-SA triples is performed based on a set of predefined rules [8]. For simplicity, let us consider that the triples generated from RA and SA are of the forms RR (GR , DR ) and RS (GS , DS ) respectively, where RR and RS signify the relation parts, GR and GS are the governor nodes, and DR and DS are the dependent nodes. Here the ROOT node associated to a triple is ignored. The reason for excluding the root triple is that root is an extra relation generated just for the purpose
270
B. Chaturvedi and R. Basak
of representing the word from which the parse tree is originated and has no such other significance. Its absence does not impact the score but its presence is found to lower the overall similarity score. The rules are described as follows: 3.3.1 Rule 1 If the SA triple RS (GS , DS ) completely matches with an RA triple RR (GR , DR ), i.e., RS matches with RR , GS matches with GR , and DS matches with DR , then a full score of 1 is assigned to the RA triple. 3.3.2 Rule 2 If the nodes DS and GS of an SA triple completely match with the nodes GR and DR, respectively, of an RA triple but the relations RS and RR do not match, then for the cases as listed in Table 1 (only a few equivalent relation pairs are shown), a full score of 1 is assigned to that RA triple. 3.3.3 Rule 3
Table 1. Based on rules defined in [8] RR (Relation from RA triple) Equivalent RS (Relation from SA triple) amod
dobj, nsubjpass, nsubj, nn
rcmod
nsubj, agent, dobj, nsubjpass
ccomp
advcl
vmod
nsubj
If the nodes GS and DS of an SA triple match with the nodes GR and DR, respectively, of an RA triple but the relations RS and RR do not match, then for the few equivalent relation pairs of Table 2, a full score of 1 is assigned to the RA triple. 3.3.4 Rule 4 If the relation RS exactly matches with the relation RR and either the node pair GS and GR match exactly with each other or the node pair DS and DR match exactly, then the nodes DS and DR (or GS and GR ) are compared with each other to obtain their similarity score. Then the average of these two scores (1 for exactly matched node pair and the other obtained on comparing the other node pair) is taken, which gives the score that is to be assigned to this RA triple.
Automatic Short Answer Grading
271
Table 2. Based on rules defined in [8] RR (Relation from RA triple) Equivalent RS (Relation from SA triple) nsubj
nsubjpass, agent, nn, csubj
amod
poss, nn, prep_in, appos, partmod
dobj
nsubjpass, iobj, prep_of
nn
prep_of, amod, nsubj
3.3.5 Rule 5 There are some relations that, although seem to be less important and insignificant, cannot be ruled out and need to be considered. These relations are mentioned in the following set: {aux, auxpass, cop, det, expl, mark, nn, prt, predet, cc} If these relations are ignored, it results in generating lower score. Therefore they should also be taken care of. One thing to note is that this rule is only applicable if both RA and SA contain at least one triple having relation belonging to the above set. If it happens that either only RA or only SA contains any triple with a relation from the above set, then this rule is not applied. The words GS and GR and the words DS and DR are compared to obtain their similarity score. The average of these two scores is taken, which gives the score that is to be assigned to the RA triple. If an RA triple matches with multiple SA triples, then the maximum of all the maximum scores is considered and assigned to that triple. Once a score is assigned to each RA triple, the overall similarity score is calculated by taking the average of all the maximum scores over the number of triples in the RA. This overall similarity score is then multiplied by the full marks of the question to obtain the marks that is to be awarded to the SA. 3.4 Model 3 The RA and SA are broken down into its individual component words which are then arranged in the form of a two-dimensional matrix. Which of the RA or SA words will be arranged in the rows or columns depends on the length or the number of words in the answers. If the RA contains lesser number of words than the SA, its words will be arranged row-wise while those of SA will be arranged column-wise, and vice versa. After the arrangement, each word from a row is matched with every word in the column to obtain a similarity score. The gensim’s Word2vec model and Facebook’s fastText model used previously are also employed here to obtain the similarity score. Next the scoring mechanism starts by taking the highest similarity score from the overall matrix, and then striking out the corresponding row and column. For example, a maximum score of say 0.98 is obtained at row 2 and column 3 of the matrix. Then this score is considered and it is made sure that none of the other scores belonging to row
272
B. Chaturvedi and R. Basak
2 and column 3 is considered in the next iteration. This is done to make sure that one word from RA is matched to one and only one word of SA. The next maximum score from the rest of the matrix barring those contained in the stricken out row and column is obtained and the corresponding row and column are crossed off. The entire process is repeated until the maximum score from every row is considered. Finally the average of these scores is computed over the number of rows to obtain an overall similarity score, which is then multiplied by the full marks carried by the question to obtain the marks that is to be awarded to the student. The idea of considering the maximum similarity score is based on N Queen’s problem of greedy approach, where the target is to maximize the score that is to be awarded to the SA. 3.5 Special Case Apart from the three models, we included a separate mechanism for scoring the special cases in which RA or SA consists of only a single word. For example, for certain questions, the RA is simply “push” whereas the SA contains the word “push” or one of its synonyms along with some extra information. For such cases, we compared this single word in RA, (or SA if the role of SA and RA are reversed) with every word in SA (or RA), and took the maximum similarity score and used it to calculate the score that is to be awarded to the SA in consideration. However, it is to be noted that this is not a separate model itself; rather it is part of all the three models and is applied only for the special case as mentioned here.
4 Running Example Consider the following Q-RA-SA triple taken from Assignment 8 of the dataset: Q: What are the two main functions defined by a stack? RA: push and pop SA: pop and push The working is only shown with gensim’s Word2vec. The approach is same when using fastText. Therefore its working is not shown here due to space constraints. 4.1 Model 1 Individual words along with their corresponding POS tags for both RA and SA are shown in Table 3. The scores obtained when the model is run with Word2vec are shown in Table 4. Now “and” from SA only matches with one word “and” in RA with similarity score of 1.0, so this score is considered. For “push” from SA, there are two matching words “push” and “pop” in RA with similarity scores of 1.0 and 0.185 respectively among which 1.0 is maximum, therefore this score is considered. For “pop” from SA also, there are two matching words “pop” and “push” in RA with respective similarity scores of
Automatic Short Answer Grading
273
Table 3. Words with POS tags Words in RA POS tags Words in SA POS tags Push
NN
pop
NN
And
CC
and
CC
Pop
NN
push
NN
Table 4. Scores obtained with word2vec Pop And Push (NN) (CC) (NN) Push (NN) 0.185 –
1.0
And (CC)
–
1.0
–
Pop (NN)
1.0
–
0.185
1.0 and 0.185 among which 1.0 is maximum and this score is considered. Now all these similarity scores are obtained, the average score is calculated as: Score(Word2vec) = (1.0 + 1.0 + 1.0) 3 = 1.0. (1) Since the question carries a full mark of 5, the grade to be awarded to this SA is calculated as Grade(Word2vec) = 1.0 × 5 = 5.0.
(2)
Both the human graders have awarded a score of 5 to this response, and the obtained score of 5.0, is same as the scores awarded by the human judges. 4.2 Model 2 The RA and SA dependency triples of the above Q-RA-SA triple are provided in Table 5, while the matched token pairs, the rule based on which they are matched, their similarity scores, and the average similarity scores are provided in Table 6. Since the triple pairs SA1-RA1 and SA4-RA4 are associated to ROOT node and labeled by root relation, they are not taken into consideration for assignment of scores. Here, MR signifies the matching rule number, SS signifies the similarity score of the tokens being matched, and AS signifies the average of the similarity scores between the corresponding governors and dependents of RA and SA, which is the score that is to be assigned to the triple. Here every triple of RA is getting matched to only one triple of SA, therefore obtaining maximum score for each triple of RA is not required since there is only one.
274
B. Chaturvedi and R. Basak Table 5. RA and SA dependency triples RA triples
SA triples
RA1: root (ROOT-0, push-1)
SA1: root (ROOT-0, pop-1)
RA2: cc (push-1, and-2)
SA2: cc (pop-1, and-2)
RA3: conj: and (push-1, pop-3) SA3: conj:and(pop-1, push-3) RA4: root (ROOT-0, pop-3)
SA4: root (ROOT-0, push-3)
Table 6. Assignment of scores SA triple matched with RA triple
Token pair being matched
MR
SS
AS
SA2- RA2
push → pop
4
0.185
0.593
and → and SA3-RA3
push → push pop → pop
1.0 2
1.0
1.0
1.0
The average similarity score is calculated as: Score(Word2vec) = (0.593 + 1.0) 2 = 0.797.
(3)
The grade to be awarded to the SA is: Grade(Word2vec) = 0.797 × 5 = 3.985 = 4.0.
(4)
Both the human graders have awarded a score of 5 to this response, and the obtained score of 4.0 is significantly closer to the scores awarded by the human judges. 4.3 Model 3 At first the RA-SA words are arranged in the form of a two-dimensional matrix. If the RA contains lesser number of words than SA, then RA words are arranged in rows while the SA words are arranged in columns. On the other hand, if the SA contains lesser number of words than RA, then SA words are arranged in rows while the RA words are arranged in columns. Once this arrangement is done, the semantic similarity scores between each RA-SA word pair are calculated and stored in the matrix. The scores obtained when the model is run with Word2vec are arranged in the form of a two-dimensional matrix as shown in Table 7. The approach is same when using fastText with the difference in the generated similarity scores and the overall score. Therefore its working is not shown here. The algorithm to find the highest similarity score starts at row 0 and column 2. Since it contains the overall highest score with value 1.0, it is taken, and the row 0 and column 2 are struck off. The updated matrix is shown in Table 8.
Automatic Short Answer Grading
275
Table 7. Scores obtained with word2vec Pop
And
Push
Push 0.185 0.034 1.0 And 0.047 1.0 Pop
1.0
0.034
0.047 0.185
Table 8. Updated after first iteration Pop Push –
And
Push
–
–
And 0.047 1.0 Pop
1.0
–
0.047 –
In the next iteration the next highest score is taken, which is again 1.0 at row 1 and column 1, so it is taken and the corresponding row and column are struck off. In the next and final iteration the next highest score from the remaining matrix is taken, which is again 1.0 at row 2 and column 0, so it is taken and the corresponding row and column are struck off. Since all the rows have been considered, the algorithm stops and the average similarity score is calculated as: Score(Word2vec) = (1.0 + 1.0 + 1.0) 3 = 1.0. (5) The grade to be awarded to the SA is: Grade(Word2vec) = 1.0 × 5 = 5.0.
(6)
Both the human graders have awarded the scores of 5 to this response, and the obtained score of 5.0 is exactly the same as the scores awarded by the human judges.
5 Experiments and Results Experiments were carried out for all three of our models on a standard computer science dataset consisting of questions, reference answers, and student responses taken from an undergraduate computer science course [10].1 Table 9 presents a summary of the dataset used here for the experimental purpose. The dataset consists of a total of 12 assignments containing different number (9 assignments have 7 questions, 1 has 4, and 2 have 10) of questions. Each question (Q) is provided with a reference answer (RA) as well as a set of student answers (SA). The breakdown of the number of questions as well as the number of student responses for each question and for each assignment is provided in 1 web.eecs.umich.edu/~ mihalcea/downloads/.
276
B. Chaturvedi and R. Basak
the Table 9. Each student response is evaluated by two human graders and the average of their scores is also provided. The performances of all the three models were tested on the 12 assignments and the scores produced by the method are compared with the scores of both the examiners as well as their average to come up with the correlation which is provided in the Tables 10 and 11. Table 10 presents the scores obtained using gensim’s Word2vec model trained on Google-News corpus, while Table 11 presents the scores obtained using Facebook’s fastText trained on Wikipedia articles. Each of the tables shows the scores using all three of our models. Ex1 and Ex2 indicate the correlation between the scores produced by our method and the scores awarded by Examiner 1 and Examiner 2, respectively. Avg indicates the correlation between the scores produced by the method and the average scores of the two examiners. Table 9. Statistics of the dataset being used Assignment no. No. of questions (Q) No. of SA per Q Total no. of SA 1
7
29
203
2
7
30
210
3
7
31
217
4
7
30
210
5
4
28
112
6
7
26
182
7
7
26
182
8
7
27
189
9
7
27
189
10
7
24
168
11
10
30
300
12
10
28
280
Total
87
–
2442
As observed from the tables, Models 1, 2, and 3 exhibit the best performances, with correlation of 0.608, 0.434, and 0.646, for Assignments 1, 9, and 4, respectively, using Word2vec on Google-News corpus. On the other hand, the three models exhibit the best performances, with correlation of 0.625, 0.601, and 0.666 for Assignments 1, 6, and 1, respectively, using fastText on English Wikipedia corpus. The overall correlation over the entire assignment was also computed and it reaches the highest value of 0.805 for Model 3 with Word2vec on Google-News corpus. Figures 1 and 2 represent the comparison of correlation for the three models using Word2vec and fastText similarity measures, respectively. For graphical representation, only the Avg of the scores awarded by the two Examiners Ex1 and Ex2 is considered. From Figs. 1 and 2, it is observed that except for Assignment 9, all three models perform
Automatic Short Answer Grading
Table 10. Scores with word2vec on Google-news corpus Assignment Model 1 Ex 1
Ex 2
Model 2 Avg
Ex 1
Model 3
Ex 2
Avg
Ex 1
Ex 2
Avg
1
0.557 0.565 0.608 0.325
0.373 0.377 0.555 0.554 0.602
2
0.218 0.393 0.319 0.041
0.188 0.113 0.224 0.396 0.324
3
0.416 0.426 0.475 0.045
0.045 0.051 0.297 0.211 0.299
4
0.468 0.576 0.546 0.239
0.149 0.223 0.534 0.646 0.619
5
0.548 0.357 0.539 0.032
0.017 0.03
6
0.426 0.426 0.579 0.241
0.405 0.365 0.323 0.475 0.453
7
0.375 0.497 0.468 0.072 −0.075 0.009 0.353 0.437 0.427
8
0.419 0.425 0.463 0.312
0.376 0.365 0.32
9
0.421 0.334 0.427 0.411
0.38
10
0.423 0.383 0.438 0.139 −0.008 0.103 0.294 0.252 0.3
11
0.374 0.434 0.425 0.219
0.532 0.415 0.552
0.259 0.331
0.434 0.521 0.424 0.531
0.265 0.252 0.358 0.486 0.429
12
0.386 0.438 0.425 0.156
0.187 0.169 0.239 0.269 0.251
Overall
0.652 0.785 0.448 0.395
0.466 0.221 0.639 0.805 0.4
Table 11. Scores with fasttext on Wikipedia corpus Assignment Model 1 Ex 1
Ex 2
Model 2 Avg
Ex 1
Ex 2
Model 3 Avg
Ex 1
Ex 2
Avg
1
0.625 0.582 0.656 0.483 0.477 0.521 0.632 0.594 0.666
2
0.253 0.371 0.33
3
0.434 0.416 0.484 0.395 0.315 0.413 0.429 0.347 0.451
4
0.524 0.472 0.545 0.319 0.269 0.325 0.535 0.496 0.561
5
0.398 0.346 0.427 0.109 0.068 0.106 0.471 0.444 0.519
6
0.356 0.593 0.536 0.325 0.601 0.52
7
0.421 0.539 0.518 0.309 0.357 0.361 0.386 0.503 0.478
8
0.443 0.429 0.482 0.329 0.281 0.346 0.364 0.342 0.393
9
0.244 0.167 0.239 0.154 0.112 0.153 0.223 0.162 0.221
10
0.46
11
0.559 0.561 0.612 0.479 0.531 0.537 0.558 0.577 0.615
12
0.456 0.459 0.488 0.294 0.238 0.282 0.398 0.377 0.414
Overall
0.686 0.783 0.461 0.609 0.713 0.335 0.687 0.801 0.456
0.222 0.397 0.324 0.351 0.478 0.442
0.368 0.624 0.559
0.402 0.472 0.408 0.339 0.413 0.491 0.429 0.503
277
278
B. Chaturvedi and R. Basak
Fig. 1. Comparison of correlation using word2vec
Fig. 2. Comparison of correlation using fasttext
better when using fastText instead of Word2vec. The reason is that fastText can even build vectors for words that are not in the corpus being used by using the vectors for the components that make up the unknown word. It does so by breaking up the word into n-gram characters (the value of n is decided by the fastText algorithm and not by us), and then using the vector representations of these n-gram components to build the vector for the word. Hence presence of words such as Enqueue, which has important significance in computer science terminology but not in plain English, does not negatively affect the score to be awarded to the student.
Automatic Short Answer Grading
279
One more thing to note is that the correlation values obtained for Assignments 5 and 7 are lower for Model 2 when Word2vec is used as compared to fastText. The reason behind this is the different working principles of Word2vec and fastText. Word2vec uses the whole word to generate its corresponding vector, while fastText breaks down the word into a group of n-grams, then generates vectors for these individual n-grams, and then finally adds them up to come up with a vector for the original word. Therefore, when a word that is not present in the corpus is supplied to Word2vec model, it generates an erroneous resultant vector which results in generation of a lower score when the word is compared with any other word. However, this is not the case with fastText, as even if the input word is not present in its corpus, it still breaks the word down into a group of n-grams where the n-grams might be present for some other words in the corpus and thus can be used to generate a vector for this word. For example, the pair of words (n, array) when checked with Word2vec generated a similarity score of 0.010 while for the same fastText generated a score of 0.231. Frequent occurrence of such word pairs lowers the overall similarity score, which in turn lowers the grade to be awarded to the SA. Such cases are found in much more quantity in Assignments 5 and 7 compared to the rest of the dataset.
6 Error Analysis Even though our models produce scores quite close to those of the human scorers, in some cases they produce low scores as well. One such case is given below, which is taken from Assignment 8 of the dataset: Q: What operations would you need to perform to find a given element on a stack? RA: Pop all the elements and store them on another stack until the element is found, then push back all the elements on the original stack. SA: StackPush() StackPop() StackIsEmpty() Here the student has answered the question by giving the name of the operations i.e., instead of writing “pop an element from the stack” the student has written this operation in terms of function name “StackPop().” Now a human scorer easily understands that the function signifies a pop operation; however, the method requires a little bit of context to understand this concept. The same goes for the other two functions as well, i.e., “StackPush()” signifies pushing an element into the stack and “StackIsEmpty()” signifies checking whether the stack is empty or not. To handle such cases, the context needs to be fed separately into the method in order to explain the concept in some way.
7 Conclusions and Future Work As evident from the correlation between the scores obtained from the proposed method and the scores awarded by the examiners, all three of our models have proven to be quite efficient in assigning scores to the student answers sufficiently close to that awarded by the human examiners. Given a reference answer (RA), our models assign a similarity
280
B. Chaturvedi and R. Basak
score (in the range [0, 1]) to the student answer (SA), which is then scaled up by multiplying the score with the full marks of the question (Q). Experiments were carried out on all three of our models on a standard ASAG dataset and the correlation obtained between our scores and those awarded by the human examiners reveals that our models can competently award the students’ responses with scores sufficiently close to those of the human examiners. It is important to note that the performance of our method in terms of correlation surpasses the highest state-of-the-art result reported on this dataset so far. In some cases, we observed that the RA has conveyed the answer in very few words, i.e., in a gist, while the student has conveyed almost the same answer with more number of words by providing some extra information, which may or may not be related to the actual answer. In such cases, if the SA is evaluated with respect to the corresponding RA, this extra piece of information contained within SA may not get matched to that provided in the RA, resulting in assignment of a lower score to the SA than it should have been actually awarded. Thus, for such cases reversing the role of RA and SA, i.e., evaluating the RA with respect to the SA, is found to be justified as the models generate fairly higher scores nearer to the ones awarded by the human examiners. However, for such cases it is assumed that the extra piece of information provided by the student is correct, and therefore the provision of penalizing the student response for extra, incorrect, or irrelevant information is currently out of the scope of the proposed work. In future, we will augment the method with some advance mechanism to find out whether the extra piece of information is incorrect or not in relevance to the context of the answer, which will help us with the decision of penalizing the SA or not. Apart from the rules that have been used in our method for the purpose of matching the RA and SA triples, the efficiency of our method is upper bounded by the similarity scores produced by the corpus-based semantic similarity measures, Word2vec and fastText. The similarity scores produced from these modules play a pivotal role in generating scores nearer to the human graders. Naturally incorrect scores produced by them adversely impact the performance of the models.
References 1. Gomaa, W.H., Fahmy, A.A.: Short answer grading using string similarity and corpus-based similarity. Int. J. Adv. Comput. Sci. Appl. 3(11) (2012) 2. Adams, O., Roy, S., Krishnapuram, R.: Distributed vector representations for automatic short answer grading. In: Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications, pp. 20–29 (2016) 3. Alotaibi, S.T., Mirza, A.A.: HYBRID approach for automatic short answer marking. In: proceedings of 2012 Southwest Decision Sciences Institute Conference (SWDSI), pp. 581– 589 (2012) 4. Roy, S., Dandapat, S., Nagesh, A., Narahari, Y.: Wisdom of students: a consistent automatic short answer grading technique. In: Proceedings of 13th ICON-2016, pp. 178 (2016) 5. Sultan, M.A., Salazar, C., Sumner, T.: Fast and easy short answer grading with high accuracy. In: NAACL: HLT, pp. 1070–1075. San Diego, California (2016) 6. Leacock, C., Chodorow, M.: C-rater: automated scoring of short-answer questions. Comput. Humanit. 37(4), 389–405 (2003)
Automatic Short Answer Grading
281
7. Basak, R., Naskar, S.K., Gelbukh, A.: Short-answer grading using textual entailment. J. Intell. Fuzzy Syst. 36(5), 4909–4919 (2019) 8. Basak, R., Naskar, S.K., Pakray, P., Gelbukh, A.: Recognizing textual entailment by soft dependency tree matching. Computacion y Sistemas 19(4), 685–700 (2015) 9. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014) 10. Mohler, M., Bunescu, R., Mihalcea, R.: Learning to grade short answer questions using semantic similarity measures and dependency graph alignments. In: HLT: 49th Annual Meeting of the ACL: Human Language Technologies, vol. 1, pp. 752–762 (2011)
A Productive Review on Sentimental Analysis for High Classification Rates Gaurika Jaitly(B) and Manoj Kapil CSE Department, Subharti University, Meerut, India [email protected], [email protected]
Abstract. Mining of sentiments is the key aspect of Natural Language Processing. The analysis of sentiments has extended much consideration in recent years. In this paper, problem tackling of sentiment polarization is discussed, which deals with the high difficulties of analysis in terms of opinion/sentimental analysis. In this paper, the overall practice for sentiment polarity tagging is reviewed and also this paper discusses some recent approaches done on the sentimental analysis with detailed descriptions. Also the basic knowledge which is required to achieve effectual sentimental analysis is discussed in this review paper with their applications which deals with the sentimental analysis with various aspects. This paper discusses various tactics in brief to achieve computational behavior of sentimentalities and opinions. Several controlled practices in mining of the opinions in terms of the assets and disadvantages are discussed in this paper. Keywords: Opinion mining · Classifications · Sentimental analysis · Error rate probabilities
1 Introduction Sentiment deals with the attitude, assumption, or decision stimulated by sentiments. Analysis of the sentiments which is well known by the opinion mining deals with sentiments of the people among certain objectives. Web is a practical place to express the sensitive information [1]. From user’s perception, individuals are capable to support their content over various social broadcasting, like mediums, small blogs, and social networking spots. As per researcher’s perception, various social media locations release their valuable APIs, stimulating information collection and exploration by researchers and inventers. For example, Twitter presently has three diverse APIs availability i.e., the REST APIs, API to search, and the API of streaming. Through REST API, discoverers are able to meet status information and user statistics; the API used for the searching process allows inventors to query detailed content, also the API used for the streaming is able to assemble content in real time scenarios. Additionally, creators can mix APIs usages to generate their applications for the usage of the contents. Therefore, sentiment analysis is having the strong perception with the maintenance of huge online information [2, 3] (Fig. 1). © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_25
A Productive Review on Sentimental
283
Fig. 1. Generalized sentiments
Still, those forms of online information have some failings that hypothetically hinder the method of emotion analysis. The main flaw is that as people in the societies can generously post their personal content, the feature of their sentiments cannot be assured. For instance, instead of allocation of the topic-related thoughts, various online spammers support spam on media. Some junk is worthless at all, while some have inappropriate opinions also recognized as fake feelings. The mistake is that pounded truth of such connected data which is not constantly accessible [4]. A pounded truth is further like a label of a convinced opinion, representing whether the estimation is positive, undesirable, or neutral. The Stanford Sentiment which is Tweet Corpus is unique datasets that have truth and publically accessible. The dataset contains almost 1.4 million Twitter tagged messages with the label of positive, negative, and neutral [5]. Data is castoff in various research institutes which is a set of reviews composed from Amazon. The above discussed flaws have been slightly overcome in the subsequent two methods: First, every product evaluation accepts checkups before it can be dispatched on the social media. Another, each evaluation must have an assessment on it, which is used as for ground truth scenario. The assessment is grounded on a star-scaled arrangement, where the maximum assessment has 5 stars and the bottommost assessment has 1 star.
2 Applications of Sentimental Analysis 1. Bot systems The significance of sentiment exploration deals with the helping of the agents. If the there is a chat bot on someone’s site, it can value from analysis of the sentiments. It can help in the training of your machine to distinguish, and answer to the customer. For instance, sentiment exploration could sense when a chat desires human agent, or path an engaged viewpoint through to company’s team.
284
G. Jaitly and M. Kapil
2. Market Research Opinion mining deals with the completion of the market research for the human using opinion of the customers about services and the alignment of the service quality and structures with their perceptions. 3. Product Evaluation It also helps in the improvement of the customers and once the customer grasps product, it helps to be loyal for the brand as much as possible. The customers can principally become small influencers for the product. That’s why opinion mining is helpful in the performance evaluation of the reviews of the customers with the help of the analysis of the sentiments [6]. 4. Management of the disasters It helps in the management of the crises. It also deals with the monitoring of the customer and what is presently trendy in social broadcasting and also helps to avoid or mitigate the loss of connected communication disaster. Disaster might be from the product’s quality in terms of the bad handling, unacceptable client provision, and other thoughtful social problems such as conservational injury, animal brutality, or child effort tradition in emerging marketplaces. 5. Sales Revenue The main assistance of exploiting sentiment analysis is to increase the sales income. The growth in sales profits is the final consequence of effective marketing operations, better-quality products, and client facility, and can be obtained using analysis of the sentiments [7] (Figs. 2 and 3).
Fig. 2. Generalized opinion cloud
A Productive Review on Sentimental
285
Fig. 3. Generalized automatic sentimental classification model
3 Related Work This section shows the recent researches done on the sentimental and opinion mining to achieve high end evaluations in terms of classification and error rates. Liu [8] has worked on the analysis of the sentiments and mining which deals with the field of learning that considers people’s attitudes, feelings, valuations, approaches, and reactions from transcribed linguistic. The authors proposed research in the NLP area and is commonly planned in mining of opinions, mining of the web data, and mining of the text. Also this research deals with the field of the computer science and also the sciences in the management field and society as an entire field. The increasing reputation of analysis of the sentiments matches with the progress of public media like reviews, discussions of the forums, small blogs. For the leading period in the history of the human, the most of the work is depend on the opinion mining. Liu [9] has worked on the opinion mining process that carries people’s optimistic or negative views. Far of the present investigation on textual statistics handling has been absorbed on removal and recovery of accurate information, such as retrieval of the information, classification of the texts, clustering and other text removal and NLP responsibilities. Minute work is done on the handling of views until only freshly. Yet, sentiments are so significant that when it is needed to make a conclusion they want to catch others’ feelings. Ahsan et al. [10] recommend concept of the illustrations for composite event acknowledgment in pictures given imperfect training illustrations. The author has introduced an innovative framework to determine event perception features from the system and use to eliminate semantic structures from descriptions and organize into social occurrence with little preparation. Exposed concepts comprise of a diversity of objects, divisions, activities and occurrence sub-types, important and compacted depiction for event descriptions. Network images are attained for each revealed event perception and they have used CNN topographies classifiers training. Extensive experimentations on puzzling occurrence datasets show that their proposed approach is well efficient for the performance of the system. Kenyon-Dean et al. [11] suggested the concept of a complex session of emotion to group such manuscript, and claim that its addition in the sentiment analysis structure will progress the superiority of computerized sentiment analysis arrangements as they are executed in real-world situations. They have motivated this dispute by constructing and investigating a new widely accessible opinion mining dataset having huge
286
G. Jaitly and M. Kapil
tweets marked with 5 times exposure, and well known by MTSA. Our consideration of the classifier on the accessible information in term of analysis of the sentiments and model’s complexity and design and how existing procedures would accomplish in the real ecosphere, and how scholars should switch difficult information. Che et al. [12] have mined and categorized user sentiment on the LSTM system model and functional them to perinatal sadness screening setups. They have used various sentiments as extraction of the features and exhibiting for script level sentiment exploration in precise regions and accomplished good outcomes. The outcomes were fundamentally dependable with the conclusions of the Depression Measure. This technique significantly shortens the selection time and diminishes the doctor-patient charge. It has optimistic implication for specific parts of sentiment arrangement tasks and offers situations for sentiment analysis in scripting level. Jimenez and Tortajada [13] have developed classification prototypes for sensing the possibility of depression level throughout the birth of the new born child thus permitting early involvement. Furthermore, to change an application of the health for the software platform on the prototype with best presentation for parents who have given delivery and clinicians to observe their test. Jindal and Singh [14] have worked on the image sentiment estimation structure which is done with CNN. Precisely, this structure is retrained on huge scale information for object acknowledgment to promote and perform transfer knowledge. Widespread experimentations were shown on by hand characterized Flickr image dataset. To sort use of various labeled records, they have employed a progressive approach of domain precise fine correction of the deep linkage. The outcomes illustrate that the projected CNN preparation can accomplish better concert in image sentimentality analysis than competing linkages. Che et al. [15] has worked on a framework of calculating a sentiment density phase before accomplishment of the aspect grounded sentiment exploration. Diverse from the foregoing sentence solidity model for mutual news condemnations to eliminate the sentiment-unnecessary data for sentiment exploration and condensing a difficult sentiment judgment into one which is easy and easier process to work on. They have applied a discriminative restricted random field prototype, with assured special structures, to routinely compress emotion judgments. By the Chinese amounts of four merchandise fields, Sent Comp suggestively progresses the enactment of the sentiment exploration. The structures projected for Sent Comp, specifically the potential semantic structures, are beneficial for sentiment judgment. You et al. [16] engaged their research works with CNN. They mainly dealt with the appropriate CNN development for image sentiment exploration. They obtained loads of the training designs by using an ordinary sentiment method to tag Flickr explanations. To make procedure of such persuasive machine categorized data, they hired an openminded pattern to fine-tune the cavernous method. Furthermore, they have improved the concert on Twitter descriptions by inducing field with a small quantity of manually categorized Twitter descriptions. They have showed extensive trials on manually characterized Twitter information. The outcomes show that the suggested CNN can accomplish better routine in image emotion exploration than competing processes. Souma et al. [6] have categorized the news item to the observed standard reappearance as positive or negative arrangement as a sentiment. They have used Wikipedia and Gig word corpus trainings from 2015 and they have applied the global routes for word illustration process to corpus to generate word trajectories to use as involvements into deep learning
A Productive Review on Sentimental
287
system. They have also examined high-frequency price tick of the Index discrete stocks for the history. They have applied a grouping of deep learning policies of persistent neural network along short-term retention units to sequence the Information and they test the projecting influence of our system on 2013 News Collection information. They found that the projecting accurateness of their procedure enhances when the adjustment from random collection of positive and negative broadcast to pick out the broadcast with highest constructive marks as positive update and update with highest undesirable marks as negative broadcast to create their training information set (Table 1). Table 1. Related work comparison Author name
Description
Technique used
Research gaps
Liu [8]
In this the author has Deep learning worked on the learning of the machines to recognize the classifications based on the sentiments and polarity for the high recognition rates. He has worked on the CNN classifier layers to trained system rapidly
The system is getting complex due to structural analysis
Liu [9]
The author has worked Linear detectors on the line detectors which shows the working on the medical image processing and has worked to increase the robustness of the system
Detection of the patterns are complex
Ahsan et al. [10]
The author has worked LSTM (Neural) on the neural training process which is a part of the supervised learning process and through which the classification error rates will be reduces through back propagation processes
The layering of the neurons structure is needs high biasing for the low error rates during training phase
(continued)
288
G. Jaitly and M. Kapil Table 1. (continued)
Author name
Description
Technique used
Research gaps
Kenyon-Dean et al. [11]
In this the author has Logistic worked on the text Regression mining for the sentimental analysis in which they have discussed some important issues which produce complexities in the classification rates
Complex structure to for the classification
Che et al. [12]
The author has worked LSTM Supervised on analysis of the Learning emotions based on the sentiments for the user and also discussed the performance evaluation for the same
High variances among the weights connections results in high standard deviation.
Jimenez and Tortajada [13]
The author has worked Naive Bayes on the emotion detection in case of depression and implements the biomedical application to achieve high classification prediction rate
Pre-processing & Segmentation of the images was challenging
Jindal and Singh [14]
In this paper the author CNN (Deep has worked on the learning) deep learning process for the classification of the sentiments and evaluates the high classification error rates
Organization of the CNN layers are somewhat complex
Che et al. [15]
They have worked on Discriminant the compression on the analysis sentences to analyze the sentiments in depth with low error rate probabilities
Dependencies issues makes the negative sentence to the true positive level
(continued)
A Productive Review on Sentimental
289
Table 1. (continued) Author name
Description
Technique used
Research gaps
You et al. [16]
The author has worked CNN on the image sentimental analysis for the implementation of the twitter images based on the transfer domain scenario
Image analysis is getting complex due to dense neural networks
Souma et al. [6]
The author has worked Deep learning on the news sentiment analysis for the Wikipedia and Gigaword corpus articles and also evaluate the negative and positive scores on the news
Normalization of the text was time consuming due to large dataset
4 Method Used for Opinion Mining There are various opinion mining procedures that revolve under mentioned practices in real time setups. The supervised knowledge procedures depend on the presence of considered training forms. There are various types of supervised classifiers in research areas. In the next subdivisions, we show in detail some of the supreme recurrently methods which acts as the classifiers in analysis of the system. 1. Probabilistic classifiers This type of classifiers uses combination of the models for grouping. The mixture prototype accepts that each session is a section of the fusion. Each combination element is a generative prototype that offers the probability of selection of a specific term for that module. These varieties of classifiers are known as the reproductive classifiers. 2. Naive Bayes This type of the classification process is the humblest and most frequently used classifier. This classification prototype computes the subsequent possibility of a session, based on the circulation of the disagreements in the manuscript. The prototype mechanisms work with the feature mining which ignores the situation of the conversation in the text. It uses the process of the Bayes Theorem to forecast the possibility that a specified feature
290
G. Jaitly and M. Kapil
set goes to a particular label.
lab P F
P(Lab) ∗ P = P(F)
F lab
P (lab) deals with the prior probability or the possibility that an arbitrary feature set comprises with the label. P (F/lab)/is the preceding probability that a feature set is actually categorized as a label. P (F) is the prior possibility that an assumed feature set is followed [17]. 3. Support Vector Machines Classifiers (SVM) The chief principle of this classifier is to regulate linear filters in the n-dimensional space which can distinct the various classes. It can be noticed in the Fig. 4 there are two types of classes x and o notation exist and there are total three hyper-planes donated as A, B, C. Hyper-plane A offers the best split-up among the modules, because the standard distance of the information points is the leading process, so it characterizes the supreme margin of splitting-up.
Fig. 4. SVM Classification
Text information are perfectly suited for SVM arrangement because of the bare nature of script, in which few structures are unrelated, but they have a tendency to be associated with one extra and generally planned into linearly distinguishable categories. SVM can build a nonlinear conclusion seeming in the innovative feature space by plotting the information occurrences non-linearly to n-dimensional space where the modules can be detached in a linear fashion using hyper-plane.
A Productive Review on Sentimental
291
These classifiers are castoff in many presentations, among these presentations are categorizing reviews permitting to their superiority. Some of the various researchers showed binary SVM-based tactics and worked on a process for estimating the superiority of information in creation of the reviews as classification difficulty problem. They also approved an information excellence structure to find information concerned with feature problem. They normally operated on digital form and MP3 evaluations. Their outcomes exhibited that their process can precisely classify evaluations in relation to their superiority. It expressively overtakes novel approaches [18]. 4. Decision tree classifiers These provide an ordered training data space decomposition in which a situation on the quality value is recycled to split the information. The establishment is the occurrence or deficiency of one or more arguments. The partition of the information space is completed recursively till the leaf points contain certain lowest numbers of proceedings which are recycled for the persistence of grouping. There are further categories which deal with the resemblance of forms to correlate sets of relationships which may be recycled to further separation of documents. The diverse splits are Distinct Feature splits which practice the presence or nonexistence of specific expressions at a specific module in order to make the split. Similarity constructed multi-attribute fragments use frequent words gatherings and the comparison of the scripts to these confrontations clusters to accomplish the splitting [19].
5 Research Gaps in the Sentimental Analysis The key reasons for the deficiency of learning on sentiments or opinion mining are the information that was intolerant typescripts that are presented before the WWW i.e., World Wide Web. Earlier the existence of the Web, when the person needed to make variety of the decisions, he/she characteristically tested for sentiments from supporters. When an association wanted to discover the views or feelings of the general community about its foodstuffs and amenities, it showed opinion polls, investigations, and focus clusters. Nevertheless, using Web, specifically with the volatile growth of the consumer content on the internet in few years back, the ecosphere has been renovated. The Web has intensely transformed the technique that people direct their interpretations and thoughts. People can now post assessments of products at commercial sites and can express their understandings on almost everything in the forums, debate groups, and on high profile blogs, which are cooperatively known as the user-generated scripts. This online behavior characterizes new and determinate sources of data with many practical presentations. Now if one needs to purchase the product, he/she doesn’t deal with time-consuming constraints in requesting his/her associates and relatives because there are countless product evaluations on the internet which give views or thoughts of existing operators of the product. For a business, there is no extension necessary to evaluate the surveys, establish focus assemblies, or employ outdoor advisers in order to catch consumer thoughts or can say opinions about its products and also from their participants because the usergenerated scripts on the internet can now give them such statistics to evaluate the threads
292
G. Jaitly and M. Kapil
that the product is good or not. Still, finding estimation sources and observing them on the internet can be a tough assignment because of heavy amount of different sources and every source is having heavy volume of text having various opinions to express on. In several cases, thoughts are secreted in long stakes and blogs. It is tough for the reader to discover relevant foundations, remove related scripts consist of opinions, read those typescripts, summarize those typescripts, and organize those typescripts into functional forms. Thus, computerized opinion innovation and summarization structures are needed. Sentiment study, well known by opinion mining is a very hot topic these days. It is a puzzling natural language processing (NLP) or text mining situations. Due to its wonderful practical presentations, there is volatile growth of both explorations in academic field and solicitations in the business (Fig. 5).
Research Gap Analysis
True positive Rate 68%
Research Gap 32%
Fig. 5. Gap analysis
As it can be noticed that most of the work is done on the text corpus datasets and the whole work revolves around the scenario of the text classification but there is very less work done on the sentimental analysis using medical images as it will be a very big advantage to analyze the condition of the patients and also it will help doctors for the treatment of the patient at critical stages.
6 Conclusion In this review paper, we offered an efficient literature and the common trends in the sentimental analysis, using clustering and classification with guided qualitative investigation, and a learning of sentiment analysis. The investigation is done on the antiquity of sentiment study, estimated the effect of sentiment exploration and its developments over an illustration and training, enclosed the populations of sentiment exploration using the most popular approaches, revealed which scenarios have been considered in analysis of
A Productive Review on Sentimental
293
the opinions or the sentiments, and studied the original mechanisms and reviews in opinion or the sentiment analysis. We set up that the knowledge of sentiment exploration and mining of the opinions has deep studies on community opinion analysis in twenty-first century. Also the comparison of the various valuable papers published in the reputed journals is analyzed and discussed with the drawbacks and advantages of the techniques used in their research. It can be noticed that the most of the work is done on the text classification in terms of the sentiments or the opinions but very less work is done by considering the images. So in our proposed work the main focus is sentimental analysis using image processing and the results are evaluated in terms of high recognition rates and low error rate probabilities.
References 1. Chen, T., Xu, R., He, Y., Wang, X.: Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Exp. Syst. Appl. 72, 221–230 (2017) 2. Baziotis, C., Pelekis, N., Doulkeridis, C.: Datastories at semeval-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 747–754 (2017) 3. Cambria, E., Das, D., Bandyopadhyay, S., Feraco, A.: A Practical Guide to Sentiment Analysis. Springer International Publishing, Cham, Switzerland (2017) 4. Giatsoglou, M., Vozalis, M.G., Diamantaras, K., Vakali, A., Sarigiannidis, G., Chatzisavvas, K.C.: Sentiment analysis leveraging emotions and word embeddings. Exp. Syst. Appl. 69, 214–224 (2017) 5. Yadollahi, A., Shahraki, A.G., Zaiane, O.R.: Current state of text sentiment analysis from opinion to emotion mining. ACM Comput. Surv. (CSUR) 50(2) (2017) 6. Souma, W., Vodenska, I., Aoyama, H.: Enhanced news sentiment analysis using deep learning methods. J. Comput. Soc. Sci. 1–14 (2019) 7. Miner, G., Elder, J., Fast, A., Hill, T., Nisbet, R., Delen, D.: Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Academic Press (2012) 8. Lomas, A., Bee, J.L., Hextall, F.B.: A systematic review of worldwide incidence of nonmelanoma skin cancer. Br. J. Dermatol. 166(5), 1069–1080 (2012) 9. Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol.s 5(1), 1–167 (2012) 10. Ahsan, U., Sun, C., Hays, J., Essa, I.: Complex event recognition from images with few training examples. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 669–678. IEEE (2017) 11. Dean, K.K., Ahmed, E., Fujimoto, S., Filteau, J.G., Glasz, C., Kaur, B., Lalande, A. et al.: Sentiment analysis: it’s complicated! In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 1886–1895 (2018) 12. Chen, Y., Zhou, B., Zhang, W., Gong, W., Sun, G.: Sentiment Analysis Based on Deep Learning and Its Application in Screening for Perinatal Depression. In: 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC), pp. 451–456. IEEE (2018) 13. Serrano, J., Tortajada, S., Gómez, J.M.G.: A mobile health application to predict postpartum depression based on machine learning. Telemed. e-Health 21(7), 567–574 (2015) 14. Jindal, S., Singh, S.: Image sentiment analysis using deep convolutional neural networks with domain specific fine tuning. In: 2015 International Conference on Information Processing (ICIP), pp. 447–451. IEEE (2015)
294
G. Jaitly and M. Kapil
15. Che, W., Zhao, Y., Guo, H., Su, Z., Liu, T.: Sentence compression for aspect-based sentiment analysis. IEEE/ACM Trans. Audio, Speech, Lang. Process. 23(12), 2111–2124 (2015) 16. You, Q., Luo, J., Jin, H., Yang, J.: Robust image sentiment analysis using progressively trained and domain transferred deep networks. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015) 17. Li, T., Li, J., Liu, Z., Li, P., Jia, C.: Differentially private Naive Bayes learning over multiple data sources. Inf. Sci. 444, 89–104 (2018) 18. Al-Smadi, M., Qawasmeh, O., Al-Ayyoub, M., Jararweh, Y., Gupta, B.: Deep Recurrent neural network versus support vector machine for aspect-based sentiment analysis of Arabic hotels’ reviews. J. Comput. Sci. 27, 386–393 (2018) 19. Saad, M.K., Ashour, W.M.: Arabic text classification using decision trees. Arabic Text Classif. Using Decis. Trees 2 (2010) 20. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis (2017). arXiv:1707.07250 21. Maynard, D., Bontcheva, K., Rout, D.: Challenges in developing opinion mining tools for social media. In: Proceedings of the@ NLP, pp. 15–22 (2012)
A Novel Approach to Optimize Deep Neural Network Architectures Harshita Pal(B) and Bhawna Narwal Department of IT, IGDTUW, Delhi, India [email protected], [email protected]
Abstract. Deep Neural Networks have the generic layers and have appeared as a strong Machine Learning model as they use a different approach for classification of objects and can learn very complex models also. This paper tries to provide analysis of various related approaches (model optimization techniques such as MobileNets, MorphNet, SqueezNet, and so on) by which existing deep neural network architectures can be optimized. In order to aid the proposed approach of optimization, we built a tool on CO-LAB in order to deeply understand each and every model layer structure as well as visualized them closely using feature maps, which highlights each and every feature clearly and visibly. Keras and TensorFlow APIs are also used to understand the model building. Keywords: CNN · Convolution · Deep neural networks · Image analysis · Visualization
1 Introduction A fast progression toward Machine learning (ML) and Artificial Intelligence (AI) is making the image important data for them. With the advent of camera phones, images are easy to generate, handle as well as exactly the right type of data to feed into the ML model. It is easy for the human mind to identify an image but not for Computers and this is the reason why deep neural networks came into the picture. In simple words, training the computer to identify the given image can be possible only with the help of Deep Neural Networks (DNN) [1]. However, nowadays this process of object detection is accelerated by the use of Graphic Processing Unit (GPUs) which helps in minimizing the model training time. It is worthy to note that object detection involves complex algorithms. But, with Convolution Neural Networks (CNN) it is possible to train computers for object detection like humans. Also, CNN uses mobile to perform training and object detection in lesser computations with much faster speed. The power of CNN lies in object identification as it extracts important features from the given visuals and works on the minute details of the image [2, 18–23]. It identifies sub-images from the given image like from an image extracting car, traffic lights, tree, human being, zebra crossing, etc. Then combining these sub-images forms a larger image and identifies it. The basic CNN architecture involves an input layer, feature extraction layers (from 1 to N), and an output layer (or classification layer) as depicted in Fig. 1. © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_26
296
H. Pal and B. Narwal
Fig. 1. General CNN architecture
2 Related Work DNNs have the generic layers and have appeared as a strong ML model as they use a different approach for classification of objects and can learn very complex models also. In this section, some of the related works are presented. Szegedy et al. [1] used multi-scale inference technique and tried to present a simple yet effective way of object detection (large in number and of different sizes) in an image at high resolution and precise localization. In this paper, authors used regression model and were successful in object identification (can find bounding boxes of multiple objects and their portions too) with the usage of a limited number of computing resources. It is worthy to note that at the time of training the network, the focus is on every object and mask. Jialue et al. [2] came up with a new learning technique to track humans based on CNNs. By using offline training, authors tried to track key points and made the model to learn spatial as well as temporal structures from given image pairs (adjoining frames). They estimated the location as well as the scale of an object for identification of object using scale and location of the previous image and through current as well as the previous image. It is advantageous to note that the proposed technique can be applied to other objects also (not only for humans). They tried to design a different CNN that tries to eliminate similar object detection in an unorganized target environment. Howard et al. [3] explained the DNN architectures used in mobile phones called MobileNets. They explained different type of convolution operation for MobileNets in order to perform faster computations as well to reduce the model size in order to work on mobile phones. They explained how the convolutions are broken down into Depth wise Convolution and Pointwise Convolution. MobileNets can effectively work for identification of objects, fine-grain categorization, face characteristics classification, and geolocalization. Gordan et al. [4] presented MorphNet for DNNs which is simple to use and works faster with good performance under a resource-constrained environment. The proposed procedure is highly scalable and can work well for larger datasets. Also, the proposed
A Novel Approach to Optimize Deep Neural Network Architectures
297
approach can easily accommodate increasing as well as decreasing network size without performance degradation. Romera et al. [5] came up with ERFNet in which the residual layer was redesigned to make the DNN architecture run in real-time. The proposed technique makes use of residual connections and factorization of the convolution operation. This layer keeps intact the original accuracy of the model. Freeman et al. [6] worked on the efficiency of CNN by reducing the model size of MobileNet and ShuffleNet. They tried to fill the void through their model EffNet which efficiently reduces the computations, faster inference, and maintains accuracy. Sandler et al. [7] proposed a new mobile architecture, MobileNetV2, that performs object detection through SSDLite framework, uses MobileDeepLabv3 to build model, described importance of the removal of non-linearities in narrow layers, and computed trade-off between precision and number of operations measured through additions and multiplications along with delay and parameters. Rastegari et al. [8] proposed simple and effective methods for binary approximations for neural networks: (1) Binary-WeightNetworks: Filters contain binary values thus reducing memory consumption and, (2) XNOR-Networks: Both filters and convolutional layers contain binary values. XNORNet performs convolutions using binary operations and is 58 times faster than existing convolutional networks. Giusti et al. [9] explained how the use of dynamic programming can fasten the process of image classification, fast object identification, and segmentation. Miikkulainen et al. [10] proposed CoDeepNEAT that builds a real-world application for automated captioning of images on a magazine website. The proposed work made deep learning applicable in wider fields of vision, languages, speech, and so on. Geng et al. [11] proposed a quantization scheme that reduces the computation and storage by using bitshifting and round operations. Tests are performed on datasets like ImageNet and KITTI and results revealed that the proposed scheme achieves a good level of accuracy and has lower hardware cost. Xu et al. [12] proposed a Co-CNN framework and explained that network optimization is an isolated learning process. The main motive behind this proposed framework is to avoid overfitting. The best part of this work is that it can parallely optimize neural networks with same or varied structures. Liu et al. [13] optimize the deep neural network architecture by removing the redundant parameters using sparse decomposition. Sparsity is obtained by exploiting interchannel and intra-channel redundancy that reduces the recognition loss. In addition to this, its CPU implementation provides great efficiency. Sun et al. [14] used a gene encoding mechanism and weight initialization values for deep neural networks optimization. In addition to this, a new scheme was proposed to avoid networks from getting stuck in local minima during backward gradient-based optimization. Also, a new evaluation scheme was proposed to speed up the heuristic search with less computation overhead. Zhao et al. [15] provided a review on optimization of DNN architecture. They surveyed various tasks like object detection, face detection, pedestrian detection, etc. Thus provided a direction in a promising way where we can use neural network-based learning systems. Zhang et al. [16] designed ShuffleNet that reduces computing power through pointwise group operation and channel shuffle. Iandola et al. [17] proposed SqueezeNet and explained the advantages of smaller CNN architectures such as less communication across servers at the time of training, less
298
H. Pal and B. Narwal
bandwidth required, and more feasible to deploy on other hardware with less memory. SqueezeNet achieves AlexNet level accuracy using 5 convolution layers and 3 fully connected layers.
3 Proposed Approach The proposed approach takes a hybrid of the two approaches which iteratively alternates between a sparsifying regularizer and a uniform width multiplier and optimizes a neural network through a cycle of shrinking and expanding phases. The phases involved in the proposed approach are explained as follows: 3.1 Phase 1 Normalized Inputs will help in generating symmetric cost function which otherwise will be elongated and not be so symmetric. Series of steps allowed for providing a normalized input to the network: 1. 2. 3. 4. 5. 6.
Take raw image numpy array as input. Flatten the input numpy array. Afterward, generate numpy arrays of individual channels used in input raw image. Evaluate the mean of the individual channel. Evaluate the standard deviation of the individual channel. Calculate zero mean and divide it by calculated standard deviation. The result is the normalized input.
3.2 Phase 2 Observation of the weight matrix of each layer (Fig. 2) allows us to reduce computations by analyzing whether it contains fixed-point values or floating-point values (Fig. 3) and if it contains only fixed-point values then the computations in the convolution operation can be reduced. 3.3 Phase 3 It is possible to fasten the approach of training and classification process of deep learning models by creating the TF records of the data (Fig. 4). It reduces very huge datasets like ImageNet into a file called TF records. It serializes the image data and stores them in the set of files that can be read linearly in a faster way. For Tensor Flow, processing such records makes it readable with ease and as a result of which the proposed model trains faster on huge datasets.
A Novel Approach to Optimize Deep Neural Network Architectures
Fig. 2. Weight values of mobilenets layer
Fig. 3. After the conversion of weight values to fixed-point notation
299
300
H. Pal and B. Narwal
Fig. 4. TF record creation
Fig. 5. Original input image
4 Experimental Evaluation In this paper, we developed a Visualization Tool to closely observe and visualize the DNN architecture of various models, where each model is having a different set of layers that highlights different features in feature maps (Figs. 6, 7, and 8). The original input image is presented in Fig. 5 and has the image shape = (224, 224, 3). The Experimental setup details are presented in Table 1.
A Novel Approach to Optimize Deep Neural Network Architectures
Fig. 6. Layer index: 2, layer name: block1conv2 visualization of VGG16
Fig. 7. Layer index: 3, layer name: Conv1bn visualization of mobilenet v1
301
302
H. Pal and B. Narwal
Fig. 8. Layer index: 4, layer name: Conv1relu visualization of mobilenet v1
5 Conclusion and Future Work In this paper, we provided a novel approach to optimize DNN architecture. The aim of this approach is to find a globally optimal solution in which CNN has the lowest testing error with a good performance. Data sets for the experiment contain a large number of attributes and samples and the results reveal that the architecture of DNN can be improved through the proposed approach. The proposed approach not only optimizes DNN over MNIST dataset but also provides highly accurate results for different datasets. Hence, the proposed approach can work in cases where different architecture requires a reduced set of layers by maintaining efficiency, accuracy, and performance intact. In the future, the work can be further extended to find more optimal connections in the same solution. In addition to this, in the future, the proposed approach can be tested at the hardware level.
A Novel Approach to Optimize Deep Neural Network Architectures
303
Table 1. Experimental setup details Detail
Setup feature
Programming language
Python
Libraries, packages or APIs used Tensorflow and Keras Interface design (GUI)
A tool for visualizing feature maps (Implemented for CO-LAB)
Datasets used
CIFAR 10, IMAGENET
References 1. Szegedy, C., Toshev, A., Erhan, D.: Deep neural networks for object detection. In: Advances in Neural Information Processing Systems, pp. 2553–2561 (2013) 2. Fan, J., Xu, W., Wu, Y., Gong, Y.: Human tracking using convolutional neural networks. IEEE Trans. Neural Netw. 21(10), 1610–1623 (2010) 3. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017). arXiv:1704.04861 4. Gordon, A., Eban, E., Nachum, O., Chen, B., Wu, H., Yang, T. J., Choi, E.: Morphnet: Fast and simple resource-constrained structure learning of deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1586–1595 (2018) 5. Romera, E., Alvarez, J.M., Bergasa, L.M., Arroyo, R.: Erfnet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 19(1), 263–272 (2017) 6. Freeman, I., Roese-Koerner, L., Kummert, A.: Effnet: an efficient structure for convolutional neural networks. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 6–10. IEEE (2018) 7. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2013) 8. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classification using binary convolutional neural networks. In: European Conference on Computer Vision, pp. 525– 542. Springer, Cham (2016) 9. Giusti, A., Cire¸san, D.C., Masci, J., Gambardella, L.M., Schmidhuber, J.: Fast image scanning with deep max-pooling convolutional neural networks. In: 2013 IEEE International Conference on Image Processing, pp. 4034–4038. IEEE (2013) 10. Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Hodjat, B.: Evolving deep neural networks. In: Artificial Intelligence in the Age of Neural Networks and Brain Computing, pp. 293–312. Academic Press 11. Geng, X., Fu, J., Zhao, B., Lin, J., Aly, M.M.S., Pal, C., Chandrasekhar, V.: Dataflow-Based Joint Quantization of Weights and Activations for Deep Neural Networks (2019). arXiv:1901. 02064 12. Xu, C., Yang, J., Gao, J.: Coupled-learning convolutional neural networks for object recognition. Multimed. Tools Appl. 78(1), 573–589 (2019) 13. Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)
304
H. Pal and B. Narwal
14. Sun, Y., Xue, B., Zhang, M., Yen, G.G.: Evolving deep convolutional neural networks for image classification. IEEE Trans. Evol. Comput. (2019) 15. Zhao, Z.Q., Zheng, P., Xu, S.T., Wu, X.: Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 16. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018) 17. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.1 and 0 and is used to swap between performance and convergence speed and is known as the threshold parameter. When ρ > > e (k)2 , it is standard LMF with μ/ρ is the step size, and with ρ < < e (k)2 , it shrinks to the standard LMS algorithm. Thus the choice of this parameter becomes very important.
310
L. Das et al.
The following steps are employed to implement the proposed VLMS/F algorithm. The weight vector w(k) is taken to be zero initially (6). W(0) = [0 0 0 0 . . . 0]
(6)
Then for each iteration the next weight vector is updated to compute the error by e(k)3 summing term together μ e(k) 2 +ρ U (k) to the current weight estimate. The input vector U(k) structure is expanded by Volterra expansion as given in (7). U(k) = U(k) . . . U(k − N)U2 (k) . . . U(k)U(k − 1) (7) The Error signal which is the difference of the desired and the output signal vector is generated as follows (8). e(k) = d(k) − U(k)w(k)
(8)
The weight updated vector is calculated as (9): U (k + 1) = U (k) + μ
e(k)3 U (k) e(k)2 + ρ
(9)
Equation (9) is used in the proposed work for weight up-gradation which is the central building block of the LMS/F algorithm. Using this all the datasets are tested for disease prediction.
3 Results Analysis In the proposed work VLMS/F adaptive filter based algorithm is implemented as a means to recognize the cancerous nature of genes along with healthy genes. The mean of the collected healthy (breast healthy (NM_130786.3) gene is regarded as the desired signal. Rest is taken to be the target and is filtered applying the proposed adaptive filtering technique. The algorithm is tested on 41 collected known breast cancer and healthy genes and out of those only 10 have been shown in Table 2. The number of iterations for the algorithm for which the error becomes stable is near 400 to minimize the error. The MSE value is set for = 0.9 and ρ = 0.0005. Table 2 shows the calculated MSE values for both healthy and cancer cases. It is observed that, MSE obtained is < 0.1 when target is taken to be a healthy gene and MSE > 0.1 while the target is taken to be cancerous. Thus, MSE value 0.1 is considered to be a standard as the gene identifier for the proposed algorithm. The decisions for prediction whether the gene is healthy or cancer affected are made using this value. The flow diagram of the complete algorithm is depicted in the following Fig. 3. The proposed VLMS/F algorithm offers a distinct range of MSE for both breast healthy and breast cancer genes for the combination of the above two values of μ and ρ. Hence a suitable choice of these is two parameters that are significant in weight updating using the adaptive algorithm. In the case of a healthy target gene, error attains zero value within few numbers of iterations, while more number of iterations is required to attain zero in case of breast cancer target gene. For that reason,
Effective Identification and Prediction of Breast Cancer
311
Table 2. List of MSE values for breast cancer and healthy gene sets Sl. no. Breast healthy gene
MSE
Breast cancer gene MSE
1
NM_013375.3
0.0321 NM_021094.3
2
NM_015407.4
0.0189 NM_001257386.1 0.405
3
NM_024684.2
0.0336 AF336980.1
4
NM_032548.3
0.0189 AF349467.1
0.429
5
NM_148912.2
0.0225 NM_007300.3
0.401
6
NM_015423.2
0.0291 AF012108.1
0.426
7
NM_024666.4
0.0134 AY436640.1
0.312
8
NM_130786.3
0.0299 BC072418.1
0.492
9
NM_138340.4
0.0252 NM_014567.3
0.507
10
NM_021243.2
0.0459 NM_053056.2
0.245
0.333 0.327
this MSE value acts as an inscription for predicting cancer and healthy genes in the gene identification systems. Fig. 4 shows the bar-chart representation of MSE values for both breast-cancerous (first 10 bars) and healthy (last 10 bars) gene sets. Comparison with existing method: The authors S. S. Roy [29] predicted hereditary mutational diseases (cancer) using FLANN expanded and traditional LMS based adaptive algorithm by taking the NMSE value as deciding parameter for prediction of healthy and diseased data sets. Although it is an advanced technique, there is no clear distinction between the diseased gene and the healthy gene. However, the NMSE value does work little for every gene. When the proposed algorithm is applied on breast cancer diseased genes, which are also used by their algorithm, it gave a perfect discrepancy in terms of MSE for all collected genes (Fig. 4). This proves the efficacy of the proposed method.
312
L. Das et al. Reference gene
Target gene
Volterra LMS/F based Adaptive filtering
Mean SquareError (MSE)
MSE0.1
Cancer gene
Fig. 3. The flow diagram for prediction of breast cancer/healthy genes
Fig. 4. Bar plot of MSE values of cancer and healthy genes
4 Conclusion The Volterra expanded LMS/F filtering is thus shown to be an awfully competent method in regard to the breast cancer identification and prediction. By simulating it in MATLAB it can effectively be applied for this purpose. The MSE is used as a standard measure to find out the diseased nature of a gene. The MSE value reflects the discrimination between breast healthy and breast cancer genes and acts as an identification criterion. The proposed VLMS/F technique is simple, faster convergent, and robust. When compared with existing method, our proposed filtering technique attains improved accuracy. The proposed work offers support for potential detection of supplementary mutational diseases in the future.
Effective Identification and Prediction of Breast Cancer
313
References 1. Mohapatra, P., Chakravarty, S., Dash, P.K.: Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system. Swarm Evol. Comput. 28, 144–160 (2016) 2. Vogelstein, B., Kinzler, K.W.: Cancer genes and the pathways they control. Nat. Med. 10, 789 (2004) 3. Stratton, M.R., Campbell, P.J., Futreal, P.A.: The cancer genome. Nature 458, 719 (2009) 4. Chou, K.-C.: Impacts of bioinformatics to medicinal chemistry. Med. Chem. (Los. Angeles). 11, 218–234 (2015) 5. Zhou, Z.-H., Jiang, Y., Yang, Y.-B., Chen, S.-F.: Lung cancer cell identification based on artificial neural network ensembles. Artif. Intell. Med. 24, 25–36 (2002) 6. Qiao, G., Wang, W., Duan, W., Zheng, F., Sinclair, A.J., Chatwin, C.R.: Bioimpedance analysis for the characterization of breast cancer cells in suspension. IEEE Trans. Biomed. Eng. 59, 2321–2329 (2012) 7. Roy, T., Barman, S.: Performance analysis of network model to identify healthy and cancerous colon genes. IEEE J. Biomed. Heal. informatics. 20, 710–716 (2016) 8. Chen, J., Wang, S.T.: Nanotechnology for genomic signal processing in cancer research-A focus on the genomic signal processing hardware design of the nanotools for cancer ressearch. IEEE Signal Process. Mag. 24, 111–121 (2007) 9. Meng, T., Soliman, A.T., Shyu, M.-L., Yang, Y., Chen, S.-C., Iyengar, S.S., Yordy, J.S., Iyengar, P.: Wavelet analysis in current cancer genome research: a survey. IEEE/ACM Trans. Comput. Biol. Bioinforma. 10, 1442–14359 (2013) 10. Chakraborty, S., Gupta, V.: Dwt based cancer identification using EIIP. In: 2016 Second International Conference on Computational Intelligence and Communication Technology (CICT), pp. 718–723 (2016) 11. Gayathri, T.T.: Analysis of Genomic sequences for prediction of Cancerous cells using Wavelet technique (2017) 12. Das, J., Barman, S.: Bayesian fusion in cancer gene prediction. Int. J. Comput. Appl. 5–10 (2014) 13. Ghosh, A., Barman, S.: Prediction of prostate cancer cells based on principal component analysis technique. Procedia Technol. 10, 37–44 (2013) 14. Barman, S., Roy, M., Biswas, S., Saha, S.: Prediction of cancer cell using digital signal processing. Ann. Fac. Eng. Hunedoara. 9, 91 (2011) 15. Voss, R.F.: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 68, 3805 (1992) 16. Akhtar, M., Epps, J., Ambikairajah, E.: Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE J. Sel. Top. Signal Process. 2, 310–321 (2008) 17. Roy, S.S., Barman, S.: Polyphase filtering with variable mapping rule in protein coding region prediction. Microsyst. Technol. 23, 4111–4121 (2017) 18. Das, L., Das, J.K., Nanda, S.: Identification of exon location applying kaiser window and DFT techniques. In: 2017 2nd International Conference for Convergence in Technology (I2CT), pp. 211–216 (2017) 19. Nair, A.S., Sreenadhan, S.P.: A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation. 1, 197–202 (2006) 20. Rao, K.D., Swamy, M.N.S.: Analysis of genomics and proteomics using DSP techniques. IEEE Trans. Circuits Syst. I Regul. Pap. 55, 370–378 (2008) 21. Sahu, S.S., Panda, G.: Identification of protein-coding regions in DNA sequences using a time-frequency filtering approach. Genomics. Proteomics Bioinformatics. 9, 45–55 (2011)
314
L. Das et al.
22. Das, L., Nanda, S., Das, J.K.: An integrated approach for identification of exon locations using recursive Gauss Newton tuned adaptive Kaiser window. Genomics. (2018) 23. Das, L., Nanda, S., Das, J.K.: A novel DNA mapping scheme for improved exon prediction using digital filters. In: 2017 2nd International Conference on Man and Machine Interfacing (MAMI), pp. 1–6 (2017) 24. Ahmad, M., Jung, L.T., Bhuiyan, A.-A.: A biological inspired fuzzy adaptive window median filter (FAWMF) for enhancing DNA signal processing. Comput. Methods Programs Biomed. 149, 11–17 (2017) 25. Haykin, S.S.: Adaptive filter theory. Pearson Education India (2005) 26. Subudhi, U., Sahoo, H.K., Mishra, S.K.: Harmonics and decaying DC estimation using Volterra LMS/F algorithm. IEEE Trans. Ind. Appl. 54, 1108–1118 (2017) 27. Malakar, B., Roy, B.: A novel application of adaptive filtering for initial alignment of Strapdown Inertial Navigation System. In: 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications, CSCITA 2014, pp. 189–194 (2014). https://doi.org/10.1109/CSCITA.2014.6839257 28. Sahoo, H.K., Subudhi, U.: Power system harmonics estimation using adaptive filters. Compend. New Tech. Harmon. Anal. 117 (2018) 29. Roy, S.S., Barman, S.: A non-invasive cancer gene detection technique using FLANN based adaptive filter. Microsyst. Technol. 1–16
Architecture of Proposed Secured Crypto-Hybrid Algorithm (SCHA) for Security and Privacy Issues in Data Mining Pasupuleti Nagendra Babu1(B) and S. Ramakrishna2 1 Department of Computer Science, Rayalaseema University, Kurnool, India
[email protected] 2 Department of Computer Science, S. V. University, Tirupati, India
Abstract. Nowadays, there is a lot of urge for security everywhere due to various attacks over the Internet. Researchers try to find solutions but there are new exploits going on from time to time. This research provides security and privacy solutions for data mining using a Secured Crypto-Hybrid Algorithm. The proposed algorithm combines traditional algorithms such as K-means clustering and Local outlier algorithm combined with the AES-256 key encryption method on datasets for security analysis. Keywords: Dataset · Algorithm · Cryptography · Data mining
1 Introduction Data mining is the domain of science which deals with the relationship of patterns inside data and provides useful data to the end users [1–3]. Over the vast developments in data science technology, protecting the data is the topmost criteria. The growing prospects in data mining have resulted in vulnerabilities in the present-day network communications. Hackers try to find new methods to exploit the data thereby causing heavy losses to the industry. According to the statistical report of the USA, around the globe every day some crores of pages are being compromised and data is being leaked [4]. The major attacks in data mining are denial of service, distributed denial of service, malware, botnet, spyware, probing, and ransomware as shown in Fig. 1. DoS stands for Denial of Service. It is a type of attack used by hackers to attack the network thereby stopping the services to the users. [5]. DDoS stands for Distributed Denial of Service. It is a type of attack where hackers try to attack a network server from multiple domains thereby creating havoc to the network resulting in the collapse of the network [6]. Malware stands for any type of software code designed to damage the computer which can be a server or a network. It can be introduced in the form of executable code and scripts [7]. A Spyware is a piece of software which intrudes into a user device and steals the information like passwords and its functioning from target devices over a network [8]. © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_28
316
P. N. Babu and S. Ramakrishna
DoS/DDoS Malware
Botnet Attacks in Data Mining Spyware
Probing Ransomware
Fig. 1. Attacks in data mining
Probing attack is defined as any intruder who tries to enter the target device and monitor the computer and get the required information [9]. Ransomware is defined as any software designed to encrypt the target computer and ask for money for getting decrypted. This is the most dangerous attack present-day community is facing [10]. Our research is focused on proposing a new algorithm for combating the security threats called Secured Crypto-Hybrid Algorithm. In the next section, the architecture of the proposed secured crypto hybrid algorithm is discussed followed by conclusions.
2 Architecture of Proposed Secured Crypto-Hybrid Algorithm (SCHA) The architecture of the proposed secured crypto hybrid algorithm is based on anomaly detection algorithms which include K-means clustering algorithm and the Local outlier factor algorithm combined with 256 key AES cryptography [11]. We proposed to use the WEKA tool [12] for security analysis of the datasets. The pseudo code and algorithm are well explained. The architecture and framework are divided into four phases namely a. b. c. d.
Dataset mining, Preprocessing, Training and Testing, Encryption and Decryption. It is shown in the flowchart in Fig. 2.
Architecture of Proposed Secured
317
KYOTO 2006+
NSL-KDD
INPUT DATA
Extraction of features and attributes from Datasets
Using k-means Clustering algorithm + Local Outlier Factor algorithm
Using AES 256 key encryption and decryption of data
OUTPUT DATA
Dos/DDoS
Malware
Spyware
Fig. 2. Flowchart of proposed Secured Crypto-Hybrid Algorithm
Pseudo code for proposed Secured Crypto-Hybrid Algorithm 1. Input: Datam (m is malicious data), NSL KDD, KYOTO 2006+. 2. Output: Datadet (det is detected data which may be outlier, anomaly, or normal). 3. Steps: Load the data (NSL KDD, KYOTO 200 +); 4. Input data ← preprocessing;
318
5. 6. 7. 8. 9. 10.
P. N. Babu and S. Ramakrishna
Extract ← datafeatures + dataattributes ; Inititate Training sets K-means + LOF and Testing sets K-means + LOF; Initiate DataEncryption and DataDecryption with AES-256 key; Compute the bounds using formula; Compute ← TP, FP, TN, FN; Compute ← Er , Ar where Er is Error rate and Ar is accuracy rate Return the values Output: Datadet ← Anamoly and outlier detection
a. Datasets Mining In this phase, the datasets namely NSL-KDD [13] and KYOTO 2006+ [14] are chosen for evaluating the performance of the proposed algorithm. NSL-KDD is based on KDD Cup 99 [15] dataset. The NSL-KDD has 43 features from the dataset for evaluation. For the Kyoto 2006+ dataset, the data is pruned from various sources like servers and honeypots around the globe which is developed at Kyoto University. Kyoto 2006+ dataset has 24 features derived from KDD-99 and its own development. b. Preprocessing In this phase, the terminated and unwanted noisy data is deleted. The target data is preprocessed to make data more suitable to use and manage lost values, change values in hexadecimal form, etc. The data is converted to a more suitable form for ease of access to features and attributes in the datasets. 43 features are selected from NSL-KDD dataset and 24 features are selected from KYOTO 2006+ dataset. The features are, for example, Source IP address, Destination IP address, etc. In the features, we extract byte sequence into byte array and further create hexadecimal dumps of the binary features and n-gram generation of the hexadecimal dumps of the binary features using the WEKA tool [12]. Feature extraction is done using frequency extraction and in feature reduction of created n-gram generation, the highest frequency feature values are given importance and low values are neglected. c. Training and Testing In this phase, the data mining algorithm K-means clustering [16] and Local outlier factor algorithm [17] are applied to the datasets. The training data is applied in the model and is assessed on testing data. The performance benchmark metrics are used for testing. The model with the best attributes is used for real-time forecasts. Different attacks like DoS and probing malware are tested.
Architecture of Proposed Secured
319
d. Encryption and Decryption In this phase, the resulting datasets are encrypted and decrypted with 256 key Advanced Encryption Standard (AES) algorithm block-wise and provide security features to the dataset with encryption and decryption so that the chance of deciphering is impossible and takes a huge amount of time thereby eliminating attacks. _____________________________________________________________________ Algorithm for proposed Secured Crypto-Hybrid Algorithm _____________________________________________________________________ Step 1: Initiate the datasets NSL-KDD and Kyoto 2006+. Step 2: Extract the binary features from the datasets: 43 features from NSL-KDD and 24 features from Kyoto 2006+. Step 3: Extract Byte Sequence into byte array and further create hexadecimal dumps of the binary features using the WEKA tool. Step 4: n-gram generation of the hexadecimal dumps of the binary features using the WEKA tool. Step 5: Feature extraction is done using frequency extraction and in feature reduction of created n-gram generation, the highest frequency feature values are given importance and low values are neglected. Step K-means 1 6: clustering algorithm for classification of datasets. Given m cluster, m = s , s2 , . . . sn are the set of cluster points and their centers such as s1,n , s2,n , … sk,n at iteration n, then calculate s1,n+1 , s1,n+2 , …, sk,n+1 , the distance from the cluster center sn yn where sn is the number of cluster points in to the cluster point using mc = 1 sn j=1
nth
cluster. Regularly update the distance between each cluster point and newly formed cluster center. If there is no cluster point nominated, then the process is stopped. Step 7: LOF algorithm for classification of datasets. Let set of S nearest neighbors be mk (s) and k-distance be the distance of the object S to kth nearest neighbor, the approachmax{k − distance(y), d (x, y)}. The able distance is Approachable distances (x, y) = 1 y∈mk (s)Approachable distances (x,y) local approachable density of object x byladk (s) = |mk (s)|
and local outlier function for comparing other neighborhood is LOFK (s) =
k (y) y ∈ mk (s) lad ladk (x)
|mk (s)|
=
y∈mk (s)ladk (y) |mk (s)|
ladk (x)
The bounds of density of inlier and outlier lie between 1 < LOFK (s) > 1. Step 8: After the confusion matrix is prepared, the outcomes for calculation are True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). Error
320
P. N. Babu and S. Ramakrishna
rate and accuracy are calculated by the formula Er = (TP+TN ) Ar = FP+FN +TP+TN .
FP+FN FP+FN +TP+TN
and accuracy by
Step 9: For the dataset D(), a block cipher of block size termed as d and Mi , Ni andKi , which are the bit strings of length d, and Iv be the initialization vector which is a pseudorandom number of length d then encryption and decryption of dataset using AES-256 keys algorithm will be Encryption of first block : E1 = DK (Iv ), N1 = K1 ⊕ M1 Encryption of general block : Ei = DK (Ei−1 ), Ni = Ei−1 ⊕ Mi ∀i ≥ 2 Decryption of first block : E1 = Dk−1 (Iv ), M1 = K1 ⊕ N1 Decryption of general block : Ei = Dk−1 (Ei−1 ), Mi = Ei−1 ⊕ Ni ∀i ≥ 2 End
3 Conclusions In this research, the need for security in data mining is discussed. The architecture of the proposed algorithm named Secured Crypto-Hybrid Algorithm (SCHA) using Kmeans clustering and Local outlier algorithm combined with the AES-256 key encryption method on datasets for security analysis was well explained. The framework of the proposed algorithm is discussed in various stages using flowchart, pseudo code, and algorithm. Our future work will be experimenting with the proposed framework and evaluating the performance of SCHA on datasets.
References 1. Kumar, S.R., Jassi, J.S., Yadav, S.A., Sharma, R.: Data-mining a mechanism against cyber threats: A review. In: 2016 International Conference on Innovation and Challenges in Cyber Security (ICICCS-INBUSH) (2016) 2. Yamini, O., Dr. Ramakrishna, S.: A study on advantages of data mining classification techniques. Int. J. Eng. Res. Technol. 4 (2015) 3. Jayakameswaraiah, M. Dr. Ramakrishna S.: Development of data mining system to analyze cars using TkNN clustering algorithm. Int. J. Adv. Res. Comput. Eng. Technol. ISSN: 22781323. 3(7), 2365–2373 4. Data breach record of USA from 2005–2018. https://www.privacyrights.org/data-breaches 5. Kim, M., Jung, S., Park, M.: A distributed self-organizing map for DoS attack detection. 2015 Seventh International Conference on Ubiquitous and Future Networks, 7–10 July 2015, Sapporo, Japan, pp. 19–22 6. Kaur, P., Kumar, M., and Bhandari, A.: A review of detection approaches for distributed denial of service attacks. Syst. Sci. Control Eng. 5(1), 301–320
Architecture of Proposed Secured
321
7. Razak, M.F.A., Anuar, N.B., Salleh, R., Firdaus, A.: The rise of “malware”: Bibliometric analysis of malware study. J. Netw. Comp. Appl. 75, 58–76 (2016) 8. Wang, T-Y., Horng S.-J., Su, M.-Y., Wu, C.-H., Wang, P.-C., Su. W.-Z.: A surveillance spyware detection system based on data mining methods. 2006 IEEE Congress on Evolutionary Computation Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada, July 16–21, 2006, pp. 3236–3241 9. Paliwal, S., Gupta, R.: Denial-of-service, probing & remote to user (R2L) attack detection using genetic algorithm. Int. J. Comput. Appl. 60(19), 57–62 (2012) 10. Biryukov, A., and Khovratovich, D.: Related-Key Cryptanalysis of the Full AES-192 and AES-256. Lect. Notes Comput. Sci. 1–18 (2009) 11. Silva, J.A.H., and Hernandez-Alvarez, M.: Large scale ransom ware detection by cognitive security. IEEE Second Ecuador Technical Chapters Meeting (ETCM), 16–20 October, 2017, Salinas, Ecuador 12. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software. ACM SIGKDD Explor. Newsl. 11(1), 10 (2009) 13. NSL-KDD. https://www.unb.ca/cic/datasets/nsl.html 14. Song, J., Takakura, H., Okabe, Y., Eto, M., Inoue, D., Nakao, K.: Statistical analysis of honeypot data and building of Kyoto 2006 + dataset for NIDS evaluation. Proceedings of the First Workshop on Building Analysis Datasets and Gathering Experience Returns for Security - BADGERS’11 (2011) 15. Tavallaee, M. Bagheri, E. Lu, W. Ghorbani, A.: A detailed analysis of the KDD CUP 99 data set. Submitted to Second IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2009 16. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C., Silverman, R., Wu. A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Mach. Intell. 24(7), (2002) 17. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data (SIGMOD ‘00). ACM, New York, NY, USA, 93–104 (2000)
A Technique to Classify Sugarcane Crop from Sentinel-2 Satellite Imagery Using U-Net Architecture Shyamal Virnodkar(B) , V. K. Pachghare, and Sagar Murade Department of Computer Engineering & IT, College of Engineering, SPPU, Pune, India {ssv18.comp,vkp.comp}@coep.ac.in, [email protected]
Abstract. Satellite imagery data collected from various modern and older versions of satellites discover its applications in a variety of domains. One of the domains with great importance is the agriculture domain. Satellite imagery data can be significantly used in agricultural applications to increase the precision and efficiency of farming. These images are of great importance in applications like disease detection, crop classification, weather monitoring and farmland usage. In this paper, we propose a technique to classify sugarcane crops from the satellite imagery utilizing a supervised machine learning approach. Unlike unsupervised models, this technique relies on the ground truth data collected from the farm to train, test, and validate the model. The ground samples contain four stages, germination, tillering, grand growth, and maturity, of the sugarcane growth cycle. This collected information acts as an input to the U-Net architecture which will extract the features unique to the sugarcane field and further classify the sugarcane crop. Keywords: Remote sensing · Sentinel-2 · U-Net architecture · Augmentation
1 Introduction Crop classification is a field where an abundant amount of research has been carried out earlier to distinguish a particular crop from other crops present in its vicinity. However, this research has been conducted on major crops like maize, wheat, rice, and paddy available in foreign countries. Much less attention is paid to sugarcane crop classification as it is a very dynamic and semi-perennial crop. It has four phenology, plant, and ratoon crop types, and various varieties. So it is essential to consider these properties of sugarcane crop into account in the classification task. Moreover, currently, this machine learning task is generally carried out with pixel-wise classification techniques. The extreme move of the computer vision field away from hand-built image features and toward more automation motivated us to use deep learning CNN architecture for better results. Our main aim is to classify sugarcane crop from other crops, so here we formulate a technique which uses supervised machine learning architecture, U-Net [1] model of CNN, to classify sugarcane fields from others. © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_29
A Technique to Classify Sugarcane Crop
323
The trouble with classifying a sugarcane field from the other fields is to know on what basis we should perform the classification task. It is needed to perceive features which are unique to the sugarcane crop and based on these features this crop can be distinguished from other crops. Nevertheless, the features of the sugarcane field that we see through our naked eyes are not guaranteed to be visible in the satellite imagery. Traditionally, classification can be achieved using normalized difference vegetation index (NDVI) values [2, 3] calculated from the bands of the satellite along with other spectral indices and even by using information from the thermal band of the satellite imagery. The manual extraction of specific features of sugarcane that contributes in classifying it from the other crops is difficult, so we now look for a technique for automatic extraction of features from satellite images using the U-Net architecture which is a supervised based machine learning architecture. U-Net is a convolutional neural network architecture for fast and precise segmentation of images which works with fewer training sample images and yields more precise segments. Challenges: Due to the physical presence in the field and taking measurements by walking around the field, collecting precise ground truth data of sugarcane becomes one of the challenges in an existing approach of classification from remote sensing data. Mapping ground truth data on satellite imagery is another major challenge as it is time-consuming process and requires perfection. Objective: The main objective of this study is to propose a technique to extract the features which are unique to the sugarcane crop field and using these features to classify the sugarcane field from the other crops available in its vicinity. The rest of the paper is structured as follows: Sect. 2 presents a literature survey; Sect. 3 describes a methodology for sugarcane classification using U-net architecture followed by a conclusion in Sect. 4.
2 Literature Survey Land use and land cover (LULC) classification is one of the main applications of remote sensing data. It tends to a transition from pixel-based land use classification to pixel-level semantic segmentation. The recent success of deep learning inspired many researchers to employ deep neural networks for LULC classification [4] and other areas in remote sensing applications. Since 2012, Convolutional Neural Network (CNN) has been greatly applied to image-wise classification, and CNN structures such as GoogLeNet [5], VGGNet [6], ResNet [7], and AlexNet [8] have demonstrated to be fruitful. Special CNN structures, R-CNN [9], Mask R-CNN [10], Fast R-CNN [11], Faster R-CNN [12], and Fully CNN [13] were developed for pixel-wise semantic segmentation. Different variants of (FCNN) are SegNet [14], DeconvNet [15], and U-Net [1] used for a variety of applications like Crop area mapping [16], urban planning, etc. U-Net was first proposed for biomedical image segmentation by [1]. Building extraction from satellite imagery, one of the applications of remote sensing classification, is
324
S. Virnodkar et al.
achieved with significant accuracy using deep learning. Reference [17] employed a deep residual network and proposed a new model Res-U-Net for building extraction. Building extraction from high-resolution imagery with scale robust CNN combining with Atrous convolution and multiscale aggregation was done by [18]. Potential of U-Net convolutional network to identify and segment natural forest and eucalyptus plantation, and to estimate forest disturbance has been investigated by [19]. U-Net not only works well for small-scale regions but also gives good accuracy for a large area of the landscape. This fact is investigated by [20] for identifying the presence and absence of trees and large shrubs in a large area with an accuracy of around 90%. Authors produced woody vegetation maps with a relatively small training dataset which is possible because of the U-Net approach of classification. High-resolution Sentinel 2-time series have been utilized to create land cover maps using the U-Net model especially because of its capability to handle sparse annotation data [21]. As seen from different studies, U-Net achieved significant accuracy using optical remote sensing data; it also has the potential to classify SAR data well. Though SAR data is powerful in large-scale crop mapping because of its capability to deal with clouds, very few studies have been used by the SAR data for crop classification. Reference [22] utilized multitemporal dual-polarization SAR data for large-scale crop mapping having complex crop planting patterns. Authors employed U-Net to predict the crop types in addition to the introduction of a batch normalization algorithm in the U-Net model to handle the issue of unbalanced sample numbers and a large number of crops. RS image data extraction incorporates feature classification, which is a long-standing exploration issue in the RS domain. Rising profound deep learning techniques, such as UNet, Alexnet, and ResNet are effective methods to automatically find relevant contextual features, what’s more, improve image classification results. This process not just expands the research being carried out yet, in addition, builds the number of utilizations of the satellite imagery in various other fields of agriculture. This process will help us especially in case of sugarcane fields’ discrimination as it is one of the longest living crops in the farm which enables more features to be extracted from it during this significant long period [23].
3 Methodology Figure 1 shows the basic model of our technique to classify sugarcane crops using UNet architecture. Analysis of the satellite image is carried out to know how many bands are present in it and display each band separately. Extracting the information about dimensions (in meters) and the longitude, the latitude of satellite images along with coordinates collected from ground truth data help us in plotting polygons representing the sugarcane fields. We can perform mapping of the coordinates collected from the sugarcane field to the satellite image corresponding to the same field after which we get an image having polygons over chosen parts of the sugarcane crop. We would then take the political map of the research study area which will form our interactive display. Then using a sentinel hub, Sentinel-2 satellite imagery corresponding to the same area is acquired and it is overlain over the political map. After this, we take ground truth information about the sugarcane field that we want to classify and overlay it over the
A Technique to Classify Sugarcane Crop
325
sentinel-2 image. As preparing an image with multiple polygons in it can be tedious, augmentation is performed on the images to increase the number of images available with us for further processing. For the legitimate working of the U-Net architecture, we need at least 5000 unique images of different sugarcane fields which can be increased up to 20000 using the augmentation method [1]. This set of 20000 images can be divided into training, testing, and validating. We can use these images as an input to the U-Net architecture which will extract features of the sugarcane field and help us to classify the sugarcane fields [1].
Satellite image visualization and preprocessing Data collection and preparation of Sugarcane fields
Sugarcane ground truth images
Augmentation
Interactive display
U-NET Architecture
Add Sentinel 2 layer to the political map
Classified sugarcane fields
Fig. 1. Proposed model for classification
3.1 Satellite Image Visualization and Preprocessing Rasterio package of Python can be used to perform analysis on the sentinel-2 imagery. The number of bands contained in the image can be found out and each band can be visualized individually. Figure 2a–c shows us the use of Rasterio package of Python to perform analysis on the sentinel-2 imagery. Sentinel-2 images are required to undergo the atmospheric correction in order to remove atmospheric effects. 3.2 Data Collection and Preparation of Sugarcane Fields Raw data about the Sentinel-2 imagery is fetched from the sentinel hub. Sentinel Hub is a platform which makes earth observation imagery easily accessible for analysis, browsing, and visualization. Along with the sentinel-2 image, we can also fetch the ground truth data of the sugarcane fields collected from the area of interest. This ground truth data can also be uploaded on Geopedia for runtime access to the data. Geopedia is a webbased application for editing, searching, and viewing of geographical data. The fetched
326
S. Virnodkar et al.
Fig. 2. a Blue band. b Green band. c Red band
ground truth data have polygons marked in the area where sugarcane fields are present. Rasterio package of Python can be used to get the dimensions of the sentinel-2 image on earth in meters and also to get the longitude and latitude information corresponding to the pixels. This information can be useful while creating ground truth data to map the polygons on sugarcane fields. 3.3 Interactive Display Using the ipyleaflet package of Python, we can create an interactive display (Fig. 3) on the computer screen which allows us to interact with the map. Initially, the window contains a political map of the area having coordinates given in the argument list. We can navigate through this map using the mouse and even perform zoom in, zoom out operations. Separate buttons are also available on the window for zoom in, zoom out operations.
A Technique to Classify Sugarcane Crop
327
Fig. 3. Interactive display window
3.4 Adding Sentinel-2 Layer to the Political Map Using the web map service (WMS), we can fetch the sentinel-2 image at runtime and add it as a layer over the political map (Fig. 4). Once we add a sentinel-2 image to the political map, we can then see the actual satellite image of the corresponding area.
Fig. 4. Add Sentinel-2 layer to the map
328
S. Virnodkar et al.
3.5 Sugarcane Ground Truth Images Using the web mapping service (WMS), we can also fetch the ground truth data layer at runtime and add it on top of the Sentinel-2 imagery. The ground truth data will have polygons corresponding to the sugarcane crop fields. As we zoom out, we can see polygons in the nearby area too where the fields are present. The ground truth layer can also be removed in order to get back the view of Sentinel-2 imagery. 3.6 Augmentation Augmentor package of Python can be used to perform augmentation operation on the image which generates multiple copies of the same image but with each image having a different view. Various operations like flip, rotate right, rotate left, zoom in, and zoom out are performed with different probabilities. Augmentation is required because it is difficult to map the polygons on the actual sugarcane fields. So to work with less number of images having polygon mapped onto them, we create multiple copies of the same image with different views. Figure 5a, b shows how augmentor package of Python is used to perform augmentation operation on an image to generate multiple copies of the same image from different angles. 3.7 U-Net Architecture U-Net is a convolutional neural network originally developed for biomedical image segmentation and is being used for a variety of applications. U-Net is used for fast and precise segmentation of images which works with fewer training images and yields more precise segmentation. It consists of a contracting path and an extracting path [1]. During the contraction, spatial information is reduced while feature information is increased. The expansion path combines the feature and spatial information through a sequence of convolutions and concatenations with high-resolution features from the contracting path. We can give the ground truth data of sugarcane crop to U-Net as an input which will automatically extract the features which are unique to the sugarcane crop. These extracted features can then be used for classifying the sugarcane crop from other crops in its surrounding.
4 Conclusion Thus, we propose a technique to classify the sugarcane crops from the Sentinel-2 satellite imagery using U-Net architecture. This technique will overcome the traditional challenge of finding out features related to sugarcane crops manually and will help in automatically extracting the features. These features help in the classification process where we want to distinguish between the sugarcane crop and other crops in the region of interest. It will be an efficient process in terms of time and computation power, needed alongside providing us with precise results than that of the traditional approach.
A Technique to Classify Sugarcane Crop
329
Fig. 5. a Original image. b Augmented images
References 1. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention 234–241 (2015) 2. Vinod, K.V.K., Kamal, J.: Development of Spectral Signatures and Classification of Sugarcane using ASTER Data. Int. J. Comput. Sci. Commun. 1, 245–251 (2010) 3. Mulianga, B., Begue, A., Clouvel, P., Todoroff, P.: Mapping cropping practices of a sugarcanebased cropping system in Kenya using remote sensing. Remote Sens. 7(11), 14428–14444 (2015) 4. Rakhlin, A., Davydow, A., Nikolenko, S.I.: Land cover classification from satellite imagery with U-Net and Lovasz-Softmax loss. In CVPR Workshops, 262–266 (2018) 5. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1–9 (2015) 6. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014) 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (2016)
330
S. Virnodkar et al.
8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 1097–1105 (2012) 9. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Region-based Convolutional Networks for Accurate Object Detection and Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 38(1), 142–158 (2015) 10. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 99(1), 770–778 (2017) 11. Girshick, R.: Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision 1440–1448 (2015) 12. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 91–99 (2015) 13. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 3431– 3440 (2015) 14. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481– 2495 (2017) 15. Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision 1520–1528 (2015) 16. Du, Z., Yang, J., Ou, C., Zhang, T.: Smallholder crop area mapped with a semantic segmentation deep learning method. Remote Sens. 11(7), 888 (2019) 17. Xu, Y., Wu, L., Xie, Z., Chen, Z.: Building extraction in very high resolution remote sensing imagery using deep learning and guided filters. Remote Sens. 10(1), 144 (2018) 18. Ji, S., Wei, S., Lu, M.: A scale robust convolutional neural network for automatic building extraction from aerial and satellite imagery. Int. J. Remote Sens. 40(9), 3308–3322 (2019) 19. Wagner, F.H., Sanchez, A., Tarabalka, Y., Lotte, R.G., Ferreira, M.P., Aidar, M.P.M., Aragao, L.E.O.C.: Using the U-Net convolutional network to map forest types and disturbance in the atlantic rainforest with very high resolution images. Remote Sensing in Ecology and Conservation (2019) 20. Flood, N., Watson, F., Collett, L.: Using a U-Net convolutional neural network to map woody vegetation extent from high resolution satellite imagery across Queensland, Australia. Int. J. Appl. Earth Obs. Geoinf. 82, 101897 (2019) 21. Stoian, A., Poulain, V., Inglada, J., Poughon, V., Derksen, D.: Land Cover Maps Production with High Resolution Satellite Image Time Series and Convolutional Neural Networks: Adaptations and Limits for Operational Systems (2019) 22. Wei, S., Zhang, H., Wang, C., Wang, Y., Xu, L.: Multi-temporal SAR data large-scale crop mapping based on U-Net model. Remote Sens. 11(1), 68 (2019) 23. Falk, T., Mai, D., Bensch, R., Cek, O., Abdulkadir, A., Marrakchi, Y., … others.: U-Net: Deep learning for cell counting, detection, and morphometry. Nat. Methods 16(1), 67 (2019)
Performance Analysis of Recursive Rule Extraction Algorithms for Disease Prediction Manomita Chakraborty(B) , Saroj Kumar Biswas, and Biswajit Purkayastha CSE Department, NIT Silchar, Silchar 788010, Assam, India [email protected], [email protected], [email protected]
Abstract. Modern busy lifestyles are acting as a catalyst to enhance the growth of various health-related issues among people. As a consequence, a massive amount of medical data are getting accumulated every day. So, it is becoming a challenging task for the medical community to handle those data. In such a situation, if a system exists that can effectively analyze those data and can retrieve the primary causes of a disease, then the disease can be prevented on time by taking the correct precautionary measures beforehand. Recently, machine learning algorithms have been receiving a lot of appreciation in building such an expert system, and the neural network is one of them which has attracted a lot of researchers due to its high performance. But the main obstacle which hinders the application of neural networks in the medical domain is its black-box nature, i.e. its incapability in making a transparent decision. So, as a solution to this problem, the rule extraction process is becoming very popular as it can extract comprehensible rules from neural networks with high accuracy. Many rule extraction algorithms exist in the literature, but this paper mainly assesses the performances of the algorithms that generate rules recursively from neural networks. Recursive algorithms recursively subdivide the subspace of a rule until the accuracy increases. So, they can provide comprehensible decisions along with high accuracy. Four medical datasets are collected from the UCI repository for assessing the performances of the algorithms in diagnosing a disease. Results prove the effectiveness of the recursive rule extraction algorithms in medical diagnosis. Keywords: Medical diagnosis · Neural networks · Classification · Rule extraction
1 Introduction Along with various advantages provided by modern lifestyle, it also brings with it various health-related problems. People are attracting various life-threatening issues due to the hectic and busy lifestyle they follow. And as a result, the data related to various medical problems are growing at a phenomenal rate and it is becoming very difficult for the medical community to handle those huge scattered data. In such a situation, a proper system is required that can analyse and extract useful knowledge from those data and can recognize the major symptoms causing a disease so that people can take preventive © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_30
332
M. Chakraborty et al.
measures beforehand by controlling the major symptoms. Such a system can act as an assistant to the medical community by assisting the physicians to diagnose a disease easily. Various data mining tools exist to create such an expert system, especially machine learning algorithms are recently gaining a lot of attention to build such an expert system as they can provide accurate decisions [1]. Neural Networks (NNs) are one of them. They can provide accurate decisions on a huge amount of complex data. But the main obstacle which hinders the application of NNs in the medical domain is their inherent black box nature, i.e., their incapability to explain the decision they make. For diagnosing a disease effectively, transparency is a must. The primary causes for a disease should be known, so that the disease can be diagnosed early and cured on time. So, as a solution to this problem, the rule extraction technique is used. Rule extraction explains the knowledge learned by a network in the form of simple and understandable rules [2–4]. Extracting transparent rules from NNs is a deep-rooted topic and a strong literature exist on this. Figure 1 shows an example of if-then rules generated using a trained network. This is the basic principle followed by any rule extraction algorithm. NNs are already known for their high generalization capability and the rule extraction technique makes them perfect to be applicable in medical diagnosis [5]. The generated rules can assist the physicians to diagnose a disease with great ease. For example, consider the rule: {IF blood sugar level==High, THEN class= diabetes}. Here, diabetes can be diagnosed by only looking at the blood sugar level and further diabetes can be controlled by controlling the blood sugar level.
Fig. 1. Schematic representation of the rule extraction process
The objective of this paper is to analyse the performances of rule extraction algorithms which generate rules recursively from neural networks for diagnosing a disease. Recursive rule extraction algorithms are considered to provide transparent and accurate rules because they divide the subspace of the rules until certain criteria are satisfied to increase the accuracy. This paper mainly assesses the performances of the four recursive algorithms (Recursive Rule Extraction (Re-RX) [6], Continuous Re-RX [7, 8], Reverse Engineering Recursive Rule Extraction (RE-Re-RX) [9], and Eclectic Rule Extraction from Neural Network Recursively (ERENNR) [10]) for diagnosing a disease. Details of the algorithms are given in the succeeding sections. The algorithms are assessed based on different measures (accuracy and comprehensibility). Four medical problems are collected from the repository of UCI to study their performances.
Performance Analysis of Recursive Rule
333
The paper is sectioned accordingly: the 2nd section gives a brief introduction about the history of recursive rule extraction algorithms, the 3rd section explains the framework of the recursive procedure and the four recursive algorithms in brief, the 4th section presents the experimental results to assess the performances of the algorithms with four medical datasets, and finally the 5th section draws a conclusion.
2 Related Literature Recursive Rule Extraction (Re-RX) [6] is the first recursive algorithm for NNs proposed by Setiono et al. Many recursive rule extraction algorithms exist in the literature, maximum of them are variants of Re-RX algorithm. All of them are designed, with the desire to increase the accuracy of the rule sets. Re-RX algorithm constructs separate rules for discrete and continuous attributes. Re-RX initiates by generating rules with the discrete attributes if present, or else generates rules with continuous attributes. If the classification performance of a rule comprising of discrete attribute(s) is not good enough, the algorithm divides the subspace of the rule recursively by calling the entire algorithm. Re-RX polishes the rule by constructing new rules with the discrete attribute(s) that are not covered by it, or by generating a linear hyperplane with the continuous attributes. Various algorithms are proposed based on Re-RX algorithm. Ensemble Recursive Rule extraction (E-Re-RX) [11] algorithm constructs rules using Re-RX algorithm based on the concept of ensemble method. It generates primary rules succeeded by generating secondary rules using Re-RX algorithm, and finally merges all of them to generate the final rule set. The Three-MLP Re-RX [12] combines the rules generated from three multi-layer perceptrons based on Re-RX. Reverse Engineering Recursive Rule Extraction (RE-Re-RX) [9] extends Re-RX by substituting the linear hyperplane for continuous attributes with simple if then classification rules. Continuous Re-RX [7–8] does not treat continuous and discrete attributes separately. It generates hierarchical rules recursively for both types of attributes using the C4.5 decision tree. Re-RX with J48graft [8] uses J48graft in place of C4.5 decision tree to construct rules. Sampling Re-RX [8] uses a sampling technique to pre-process the data before applying Re-RX to accelerate the performance of rules generated by Re-RX. Sampling Re-RX with J48graft [13] algorithm combines both Re-RX with J48graft and Sampling Re-RX algorithms to construct rules. Eclectic Rule Extraction from Neural Network Recursively (ERENNR) [10] is another recursive rule extraction algorithm which is capable of generating simple rule sets. ERENNR extracts rules in the guise of attribute data ranges and targets by exploiting the nodes of a trained network.
3 Recursive Rules This section is divided into two subsections. The first subsection explains the common framework followed by the recursive rule extraction algorithms to generate rules and the last subsection presents the methodologies of the four recursive rule extraction algorithms in brief and shows the difference in the pattern of recursive rules generated by them.
334
M. Chakraborty et al.
3.1 Framework for Generating Recursive Rules All the existing recursive rule extraction algorithms for NNs uses the same procedure to subdivide the subspace of a rule. They use support and error of a rule to subdivide it. The per cent of patterns covered by a rule forms the support of the rule and the per cent of patterns misclassified among the classified patterns forms the error of the rule. If the support and error exceed the preset thresholds, then the subspace of the rule is divided recursively. The framework for the recursive procedure is depicted in Fig. 2. Rr denotes rth rule in a rule set R. The example in Fig. 3 shows how a rule r in R is subdivided recursively. Suppose, the rule r meets the criteria for support and error. The rule is divided with the patterns covered by it and with the attributes not present in it. The subspace of rule r is divided into three rules. Again, all the three rules are evaluated based on support and error, and the subspace of rule r1 is divided into three rules as it satisfies the required criterion. The process is continued until a rule cannot be subdivided any more. All the rules in R are evaluated similarly.
Fig. 2. Framework for recursive procedure
3.2 Recursive Algorithms Re-RX [6], Continuous Re-RX [7, 8], RE-Re-RX [9], and ERENNR [10] are some of the important recursive algorithms for NNs that can generate transparent rules with good
Performance Analysis of Recursive Rule
335
Fig. 3. Example for recursive rule generation
accuracy. Though all the algorithms use the same criterion to divide the subspace of the rules recursively, they work differently. The details of the algorithms are given below: Re-RX. This is the most extensively used rule extraction algorithm used in different applications. The beauty of the algorithm lies in its pattern of rule generation. It generates rules independently with the discrete and the continuous attributes. The generated rules follow a specific hierarchy. The algorithm initiates by rule generation with the discrete attribute(s) using a decision tree if there or else halts by constructing linear hyperplane with the continuous attribute(s). Each rule generated for discrete attribute(s) is further refined by subdividing the subspace of the rule recursively if it satisfies a certain criterion. The rule is divided by calling the entire process again with the attribute(s) not enclosed within the region of the rule. But the drawback of this algorithm is that the generated rules cannot retain the performance of the neural network if the continuous attributes are best explained by a nonlinear function. Continuous Re-RX. This algorithm was proposed with the aim to estimate the potential of rules generated by Re-RX algorithm comprising of continuous attributes for thyroid diagnosis. It uses the same procedure for generating rules recursively like ReRX but does not have any constraint for constructing rules with continuous and discrete
336
M. Chakraborty et al.
attributes unlike Re-RX. Rules are constructed using a decision tree algorithm (C4.5) for both types of attributes. RE-Re-RX algorithm. This algorithm was proposed with the objective to overcome the drawback associated with the Re-RX algorithm related to continuous attributes. The linear hyperplane generated with the continuous attributes is replaced with if-then rule. The antecedent part of the rule contains input data ranges and the consequent part contains the target class. Reverse engineering technique is used to analyse the continuous attributes. The patterns classified in the presence of a continuous attribute and the patterns misclassified in the absence of the continuous attribute are used to calculate the data range of the continuous attribute. ERENNR algorithm. It constructs rules recursively in the form of ranges of attribute values and target from a neural network with a single hidden layer. It uses an eclectic approach to analyse each node and generate global rules. Nodes in a hidden layer are analysed by extracting data range matrices and nodes in the output layer are analysed based on the logical combinations of hidden nodes. Subsequently, starting from the output layer and by proceeding back, a rule set is formed following the substitution process. The rules in the set are pruned if the accuracy improves. The region of a rule in the rule set is further subdivided recursively using the same procedure followed by Re-RX. Table 1 shows the differences between the patterns of rules generated by the four algorithms. Suppose a, b, c, and d are four attributes, and m1 and m2 are two classes. Ci or Di denotes that the attribute i is continuous or discrete, respectively, for ReRX and RE-Re-RX algorithms, SPVi denotes split value for ith attribute in case of continuous Re-RX, lower_rangei and upper_rangei denote lower and upper ranges of attribute i, respectively, for classifying patterns in particular class incase of RE-Re-RX and ERENNR algorithms.
4 Results and Discussion This part of the paper assesses the performances of the recursive algorithms for medical diagnosis. Four medical datasets: Echocardiogram, Statlog (heart), Pima Indians Diabetes, and Thyroid are collected from the UCI repository to study the effectiveness of the recursive algorithms. Table 2 shows the description of the datasets. The performances of the algorithms are evaluated based on average testing accuracies of tenfold crossvalidation results and comprehensibility (local and global). Local comprehensibility for a rule set is the frequency of conditions present in the rule set and global comprehensibility for a rule set is the frequency of rules present in the rule set. Lower the value of the comprehensibility, better is the rule set. 4.1 Results The performances of the algorithms on the datasets are summarized below: Echocardiogram. Echocardiogram dataset contains details of whether a patient will survive or not for at least 1 year after a heart attack. Attribute numbers 11 and 12 are removed as they are irrelevant. Table 3 and Fig. 4 show that in this case, all the algorithms
Performance Analysis of Recursive Rule
337
Table 1. Pattern of rules generated by Re-RX [6], Continuous Re-RX [7, 8], RE-Re-RX [9], and ERENNR [10] algorithms Algorithm
Pattern of recursive rules
Description
Re-RX
If (Da ==1 && Db ==0) follows: If ((coefficientc *Cc —coefficientd * Cd) = SPVc && d< SPVd ), then class=m1 ; Else, class=m2 ;
No distinction between continuous and discrete attributes. Employs decision tree algorithm for generating rules with both the type of attributes
RE-Re-RX
If (Da ==1 && Db ==0) follows: If (Cc >= lower_rangec && Cd = lower_rangec && d< upper_ranged ), then class=m1 ; Else, class=m2 ;
No distinction between continuous and discrete attributes. Rules are constructed using a combination of attributes along with their significant data ranges with respect to output classes
performed well in diagnosing the disease but RE-Re-RX and ERENNR algorithms are more effective. Table 4 shows the comprehensibilities of the rule sets generated by the algorithms. By analysing the results, it can be said that the rule sets generated by all the algorithms are more or less comprehensible but among them global comprehensibility of rule set generated by ERENNR is better and local comprehensibilities of rule sets generated by ERENNR and Continuous Re-RX are better. So, in this case the average performance of ERENNR algorithm is better. So, it can more effectively help to diagnose the disease. Statlog (Heart). The dataset contains information whether the patients have heart disease or not. In this case, a huge difference can be seen between the algorithms in terms of rule set accuracies. Table 3 and Fig. 4 depict that ERENNR algorithm performed
338
M. Chakraborty et al. Table 2. Description of datasets
Datasets
Total number of patterns
Number of attributes (including class attribute)
Attribute property
Number of classes
Echocardiogram
132
13
Categorical, Integer, Real
2
Statlog (Heart)
270
14
Categorical, Real
2
Pima Indians Diabetes
768
9
Integer, Real
2
7200
22
Continuous, binary
3
Thyroid
much better in diagnosing the disease compared to the others. By analysing the results in Table 4, it can be said that in terms of local comprehensibility, continuous Re-RX is better and in terms of global comprehensibility continuous Re-RX and ERENNR algorithms performed better. So, in this case also, the average performance of ERENNR algorithm is better. So, it can more effectively diagnose the disease. Pima Indians Diabetes. The dataset contains details of patients whether they have diabetes or not. Table 3 and Fig. 4 show that in this case also the performance of rule sets generated by ERENNR algorithm is better in diagnosing the disease. Table 4 says that in terms of comprehensibility, Re-RX and RE-Re-RX algorithms performed better. So, in this case, it cannot be said that a particular algorithm is better; all the algorithms are more or less effective in diagnosing the disease. Thyroid. The dataset contains clinical details about patients with hyperthyroid (overactive, Class 1), hypothyroid (under-functioning, Class 2), and thyroid in the normal range (normally functioning, Class 3). The dataset is highly imbalanced: Class 1 represents 2.3% (166 patterns), Class 2 represents 5.1% (367 patterns), and Class 3 represents 92.6% (6667 patterns) of the dataset. Results in Table 3 and Fig. 4 depict that in this case, all the algorithms performed well in diagnosing the disease but ERENNR algorithm is more effective. Table 4 also shows that the comprehensibility of the rule set generated Table 3. Average testing accuracies of tenfold cross-validation results (in %) Datasets
Re-RX
Continuous Re-Rx
RE-Re-RX
ERENNR
Echocardiogram
93.33
95
96.67
96.67
Statlog (Heart)
73.33
74.07
74.82
80.37
Pima Indians Diabetes
68.57
73.77
74.68
76.88
Thyroid
91.09
92.58
91.72
93.47
Performance Analysis of Recursive Rule
339
Fig. 4. Graphical comparison
by ERENNR is better. So, ERENNR algorithm can more effectively help to diagnose the disease. 4.2 Discussion All the results presented above show the effectiveness of the rule sets generated by the four recursive algorithms in diagnosing a disease. The algorithms can effectively assist the medical community in analysing the huge medical datasets and representing the hidden knowledge in the form of comprehensible and accurate rule sets. A disease can be diagnosed easily by looking at the major symptoms represented by the antecedents of the rule sets and further prevented by controlling them. Though all the algorithms are more or less effective, the results presented above depict that the average performance of ERENNR algorithm is good compared to the other recursive algorithms in most of the cases. The reason for this is the approach of rule extraction followed by the algorithm. ERENNR algorithm uses an eclectic approach of rule extraction, which can produce good accuracy as it combines the advantages of decompositional and pedagogical approaches of extracting rules. And the global comprehensibility of the algorithm is also better compared to others. The reason for this is the other three algorithms use decision tree for rule generation, and decision tree generally generates larger number of rules. ERENNR does not use a decision tree for generating rules.
340
M. Chakraborty et al. Table 4. Local and global comprehensibility of rule sets
Datasets
Algorithms
Local Comprehensibility
Echocardiogram
Re-RX
13
6
Continuous Re-RX
5
6
RE-Re-RX
7
6
ERENNR Statlog (Heart)
5
5
22
9
6
5
RE-Re-RX
23
9
ERENNR
8
5
Re-RX
4
2
Re-RX Continuous Re-RX
Pima Indians Diabetes
Continuous Re-RX
Thyroid
Global Comprehensibility
18
11
RE-Re-RX
4
2
ERENNR
8
5
Re-RX
23
11
Continuous Re-RX
14
6
RE-Re-RX
23
11
ERENNR
8
3
5 Conclusion With the change in lifestyle, people are more likely to fall prey to a disease. So, with the prevailing situation, a proper medical diagnosis system will be very beneficial for both the physicians and commoners. If the major causes of a disease can be known, a disease can be easily diagnosed and correct medicine can be taken on time. And also a disease can be prevented by controlling the major causes. Researchers are using various data mining techniques to create accurate diagnosis systems and NNs are one of them. But the problem with NNs is their black box nature which hinders their application in the medical domain. So, as a solution to this problem, rules are generated from NNs. Symbolic rules generated from NNs can be used to accurately diagnose a disease. So, this paper focuses on rule extraction algorithms for disease diagnosis, especially recursive rule extraction algorithms. Because recursive methods of rule generation are capable of generating accurate comprehensible rules. The paper assesses the performances of four recursive algorithms Re-RX, Continuous Re-RX, RE-RE-RX, and ERENNR algorithms for disease diagnosis. The performances of the algorithms are validated based on four medical datasets collected from the UCI repository. Along with accuracy, the rule sets generated by the algorithms are also evaluated based on local and global comprehensibility. And the results show that the performances of all the four recursive algorithms
Performance Analysis of Recursive Rule
341
are good but among them, ERENNR algorithm can more accurately diagnose a disease with good comprehensibility. So, in a nutshell, it can be said that the recursive rule extraction algorithms are effective enough for any medical application.
References 1. Nithya, B., Ilango, V.: Predictive analytics in health care using machine learning tools and techniques. International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, (2017) 492–499 2. Biswas, S.K., Chakraborty, M., Purkayastha, B., Roy, P., Thounaojam, D.M.: Rule Extraction from Training Data Using Neural Network. Int. J. Artif. Intell. Tools 26, 3 (2017) 3. Jivani, K., Ambasana, J., Kanani, S.: A survey on rule extraction approaches based techniques for data classification using neural network. Int. J. Futuristic Trends Eng. Technol. 1, 1 (2014) 4. Augasta, M.G., Kathirvalavakumar, T.: Reverse engineering the neural networks for rule extraction in classification problems. Neural Process. Lett. 35, 131–150 (2012) 5. Bologna, G.: A study on rule extraction from neural networks applied to medical databases. Int. J. Neural Syst. 11(3), 247–255 (2001) 6. Setiono, R., Baesens, B., Mues, C.: Recursive neural network rule extraction for data with mixed attributes. IEEE Trans. Neural Netw. 19(2), 299–307 (2008) 7. Hayashi, Y., Nakano, S., Fujisawa, S.: Use of the recursive-rule extraction algorithm with continuous attributes to improve diagnostic accuracy in thyroid disease. Inform. Med. Unlocked 1, 1–8 (2015) 8. Hayashi, Y.: Application of a rule extraction algorithm family based on the Re-RX algorithm to financial credit risk assessment from a Pareto optimal perspective. Oper. Res. Perspect. 3, 32–42 (2016) 9. Chakraborty, M., Biswas, S.K., Purkayastha, B.: Recursive rule extraction from NN using reverse engineering technique. New Gen. Comput. 36(2), 119–142 (2018) 10. Chakraborty, M., Biswas, S.K., Purkayastha, B.: Rule extraction from neural network using input data ranges recursively. New Gen. Comput. 37(1), 67–96 (2019) 11. Hara, A., Hayashi, Y.: Ensemble neural network rule extraction using Re-RX algorithm. Neural Netw. (IJCNN) 1–6 (2012) 12. Hayashi, Y., Sato, R., Mitra, S.: A new approach to three ensemble neural network rule extraction using recursive-rule extraction algorithm. In Proceedings of the 2013 International Joint Conference on Neural Networks, IJCNN, USA (2013) 13. Hayashi, Y., Yukita, S.: Rule extraction using Recursive-Rule extraction algorithm with J48 graft combined with sampling selection techniques for the diagnosis of type 2 diabetes mellitus in the Pima Indian dataset. Inform. Med. Unlocked 2, 92–104 (2016)
Extraction of Relation Between Attributes and Class in Breast Cancer Data Using Rule Mining Techniques Krishna Mohan1 , Priyanka C. Nair1(B) , Deepa Gupta1 , Ravi C. Nayar2 , and Amritanshu Ram2 1 Department of Computer Science & Engineering, Amrita School of Engineering, Amrita
Vishwa Vidyapeetham, Bengaluru, India [email protected], {v_priyanka,g_deepa}@blr.amrita.edu 2 HealthCare Global Enterprises Ltd (HCG) Hospitals, Bangalore, India [email protected], [email protected]
Abstract. Breast cancer is a rapidly growing cancerous disease, which leads to the main cause of death in women. The early identification of breast cancer is essential for improving patients’ prognosis. The proposed work aims at identifying the relationships between the attributes of breast cancer datasets obtained from HCG Hospital, Bengaluru (India). The work focuses on identifying the effect of attributes on three different classes, which are metastasis, progression, and death using Apriori algorithm, an association rule mining technique. To analyze the relation among the attributes with the value it takes for a particular class, more detailed rules are generated using decision tree-based rule mining technique. Rules are selected for each class based on specific threshold set for confidence, lift, and support. Keywords: Machine learning · Breast cancer data · Association rule mining · Apriori algorithm · Decision tree-based rule mining technique
1 Introduction From recent decades, it can be seen that there is a great revolution of computer science, and information technology is playing a major role in healthcare industry in order to make health care as an intelligent system for prediction and early diagnosis of various kinds of diseases, such as cancer, heart disease, and diabetes [1]. One of the major kinds of life-threatening diseases is breast cancer. According to WHO reports [2], every year 2.1 million women are being affected with breast cancer and 6,27,000 women died due to breast cancer in 2018, which leads to the ratio of 15% of all kinds of cancerous disease in female. Numerous data mining techniques can be utilized successfully to recognize breast cancer occurrence, cancer status, and to extract the hidden patterns of breast cancer. © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_31
Extraction of Relation Between Attributes …
343
Association rule mining is a major approach using which relation between the attributes of the breast cancer can be explored. It gives a meaningful and efficient way to define and present certain dependencies among attributes in a dataset. Apriori algorithm is a majorly used algorithm for association rule mining. The proposed work uses Apriori algorithm to explore the dependencies between the attributes in the breast cancer data collected from HCG Hospitals. The work focuses on identifying relation between the attributes as well relation between the attributes and the class. Relation between attributes and multiple classes such as metastasis, progression, and death have been analyzed in this study. Association rules are more useful when the aim is to explore completely new rules. But in cases, where rules with specific decision or diagnosis need to be explored, decision tree-based rules seem to be more appropriate [3]. Another major advantage of decision tree-based rule is multiple attributes; values are incorporated into the rules which would provide meaningful patterns for the doctors. Hence, the proposed work also focuses on decision tree-based rules generated from the data. The remaining sections in this paper are organized as follows: Sect. 2 focuses on the related works, in which various papers have been studied and analyzed, Sect. 3 is talking about data source, Sect. 4 is the proposed system telling about the overall methods of the proposed work, Sect. 5 is result analysis explaining the set of rules generated from the association rule mining algorithm and decision tree-based rule mining technique, Sect. 6 is the discussion, briefing about the general analysis observed in the result section, and Sect. 7 is conclusion along with talking about future enhancement in this work.
2 Related Works Different data mining techniques have been explored on the various kinds of healthcare data. Classification is a major task performed on healthcare data in different works based on health care. Classification of breast cancer data obtained from HCG hospitals has been performed using different algorithms, and multilayer perceptron has been found to be the algorithm with the best performance [4]. The dataset was imbalanced which gave a poor performance. This has been solved by the application of SMOTE on the dataset. The data consisted of four different classes, such as metastasis class, death class, progression class, and recurrence class. Performance measures used were accuracy, sensitivity, and specificity. Few other studies attempt to classify the Wisconsin breast cancer dataset into recurring or non-recurring class by application of classification algorithms such as Naïve Bayes, support vector machine, and K-Nearest Neighbor (KNN) classifiers [5–7]. Another work classifies breast cancer patients into biological groups for better prognosis by including ensemble classification post ensemble clustering stage [8]. The results, verified by statistical test and by clinical experts, show that the proposed framework has better performance. Adaptive ensemble voting method has been proposed for the diagnosis of breast cancer using Wisconsin breast cancer data [9]. An ensemble of logistic regression and artificial neural network is used on the 16 best features obtained by univariate feature selection method. Self-regulated Multilayer Perceptron Neural Network (MLNN) is another classifier used for breast cancer classification [10] using digital mammograms datasets. The classes included in the study are benign, malignant, and
344
K. Mohan et al.
normal. A model has been developed to classify the breast cancer data as recurring and non-recurring class using multilayer perceptron [11]. The dataset included details of 286 patients of a medical center in Yugoslavia out of which 201 belong to non-recurring class and 85 recurring classes. The risk level of breast cancer in patients has been classified into T1, T2, T3, and T4 using Bayesian Linear Discriminant Analysis Classifier (BLDA) in research [12]. The data used for the work is 82 breast cancer patients of the Department of Oncology of Sri Kuppuswamy Naidu Hospital, Coimbatore, India. The work has proposed to find the best performance classifier by applying Principal Component Analysis (PCA) with minimal classification rules [13]. The result has shown that J48 classifier has been found as the best classifier among J48 decision tree, random tree, and reduced error pruning tree implemented on the Wisconsin breast cancer dataset. Another major emphasis is on exploring association rule mining to extract the dependencies between attributes in a dataset. A study focuses on applying association rule mining on SEER Breast Cancer Dataset to examine the relationship between attributes for the recurrence of breast cancer [14]. Hidden relationship between the attributes of cardiovascular disease data obtained from UCI Repository has been extracted using Apriori algorithm [15]. Out of the 40 rules generated, prominent rules were selected after analysis using support, confidence, and lift as the measures. In a study, predictability-based collective class association rule mining is applied that increases overall predictive performance by applying cross-validation and combining the resulting rules [16]. Effective rules are identified from the generated candidate rules, by application of cross-validation for rule evaluation that includes rule ranking and rule pruning. Identification of hidden pattern between the attributes of most frequently occurring heart diseases such as Unstable Angina (UA), Myocardial Infarction (MI), Coronary Heart Disease (CHD), etc. have been performed using Apriori, an association-based rule mining algorithm [17]. Work focuses on the relationship of attributes in class, and hence class-based association rules are generated. Another work focuses on describing the behavior of leprosy patients during the long-term medication of Multi-Drug Therapy (MDT) drugs using Apriori algorithm [18]. The rules have been extracted using Apriori algorithm on breast cancer dataset for identifying the relation between the attributes and the recurrence class [19, 20]. The work has applied association rule mining in order to discover the relationships between the possible factors associating with the occurrence of skin melanoma [21]. The factors of dataset used in the work are demographic and environmental factors obtained from National Cancer Institute (NIC) SEER dataset and Missouri Information for Community Assessment (MICA). Decision tree-based rule is another approach to extract the rules from the dataset with a faster response. Decision tree-based rules have been generated from diabetes dataset in work [3]. The rules generated from decision tree have been found to less in number and more efficient, compared with other association rules. Although there are a lot of works that apply machine learning on healthcare data, from the literature, it is understood that no significant work has been done on association rule mining to explore dependency between the attributes of breast cancer data. The proposed work attempts to extract relation between the attributes in the breast cancer data. Multiple attributes with value cannot be explored in association rule mining. Hence, the proposed work also focuses on extracting dependencies of breast cancer attributes with the help of rules generated from decision tree. The work focuses on extracting the dependencies
Extraction of Relation Between Attributes …
345
between the attributes as well as dependencies between the attributes and class. From the literature, it has been identified that the existing works focus on the effect of attributes of breast cancer data on a single class variable, which is generally recurrence class. On the contrary, the proposed work discovers the relationship of the same attributes with three different classes, namely, metastasis, progression, and death.
3 Data Source This is one of the major steps of data source for extracting the information to implement. The data provided by HCG hospital, Bangalore without revealing the patient’s private information consisting of clinical information of 1595 breast cancer patients have been summarized in Table 1. The proposed work focuses on 11 attributes and 3 class variables, which are death, progression, and metastasis. Table 1. Breast cancer dataset description Attributes
Values
Attributes
Values
Agecat
1 (below 50) 2 (above 50)
Cancer type
1 (Right) 2 (Left) 3 (Bilateral)
Treatment
1 (Surgery with some other Stage treatment is performed) 2 (Chemotherapy, radiotherapy, and surgery are performed) 3 (Chemotherapy and radiotherapy are performed) 4 (only surgery is performed) 5 (only cameo therapy is performed) 6 (only radiotherapy is performed)
1 (Tumor size maximum till 2 cm) 2 (Tumor size between 2 cm and 5 cm), 3 (Tumor size larger than 5 cm and have spread to lymph nodes), 4 (Tumor have spread to nearby lymph nodes)
Status Code
1 (ER and PR are positive and Her2 is negative) 2 (ER and PR are negative and Her2 is positive) 3 (ER, PR, and Her2 are negative) and 4 (All three hormone receptors positive)
Grade
1 (Well differentiated) 2 (Moderately differentiated) 3 & 4 (Poorly differentiated)
Surgery
1(Surgery is performed), 0(Surgery not performed)
Progesterone 1 (Positive) Receptor(PR) 0 (Negative) (continued)
346
K. Mohan et al. Table 1. (continued)
Attributes
Values
Attributes
Values
Her2
1(Present) 0(Absent)
Class Death
Class Label 1—Yes(within 5 years) 0—No (Survived for 5 years)
Metastasis
1(Cancer has spread from breast to other organs) 0(Cancer has not spread from breast to other organs)
Estrogen 1 (Positive) Receptor (ER) 0 (Negative)
Radiotherapy
1(Radiotherapy is performed) Progression 0(Radiotherapy is not performed)
1(Cancer develops to advance state in the breast) 0(Cancer does not develop to advance state)
4 Proposed System The proposed work is mainly based on the set of rules extracted using Apriori algorithm and decision tree-based rule mining technique, which are basically consisting of the following phase of rule generation and extraction process. 4.1 Rule Generation/Extraction Process The proposed work uses the data which is already cleaned in the previous work [4]. So, rule is extracted on the preprocessed data, in which missing values have been removed in the dataset in order to obtain accurate rules. One of the majorly used techniques is association rule mining extraction techniques. Apriori algorithm and decision tree rule mining technique are implemented for rule discovery, which comes up with meaningful set of rules for decision-making in order to have early diagnosis of diseases. 4.2 Association Rule Mining Association rules mining is a data mining practice through the analysis of association rules in dataset, which finds the association between frequent patterns and large datasets as well as the causal relationship between the frequent patterns. In this mining structure of association, it is to ensure that each mining rule should get the minimum support and confidence. Association rule mining such as Apriori algorithm has the form LHS (left) =>RHS (right), where LHS and RHS are two unlink pairs of patterns. Such rule shows that the RHS set may appear under the premise that the LHS set appears. Two metrics, support and confidence, are used to measure the effectiveness and assurance of the rules accordingly. Considering the rule, Item1 =>Item2, Support, confidence, and lift values are calculated as shown in Eq. (1), Eq. (2), and Eq. (3), respectively. Support = P(Item1 ∩ Item2)
(1)
Extraction of Relation Between Attributes …
Confidence = Lift =
P(Item1 ∩ Item2) P(Item1)
Support Support(Item1) × Support(Item2)
347
(2) (3)
Out of different association rule mining algorithms, Apriori algorithm which is the most popular method has been selected for the proposed work to identify the dependencies between the attributes in breast cancer data. Apriori algorithm determines the frequent itemset in the transactional database using candidate generation method [22]. Apriori algorithm uses bottom-up approach for influencing mining frequent itemset for Boolean association rules. Apriori is used to perform in the database holding transaction and collection of items for mining. Apriori algorithm generates huge number of rules from the dataset. Hence, significant and relevant rules need to be selected from them. 4.3 Decision Tree-Based Rule Mining Technique A decision tree is a classification algorithm, which can generate the tree and rules for model representation for different classes from dataset. Decision tree is a tree structure that makes decision based on the values assigned to each node. In this tree, internal node is known as attribute, and leaf nodes are known as class level. DT handles both categorical data and numerical data. The set of rules can be extracted as attributes to class from decision tree. In the proposed work, the decision tree-based technique J48 has been used to predict the target variable in a dataset; this J48 is used to create univariate decision tree. This is statistical-based classifier consisting of the different kinds of parameters such as minimum number of objects, confidence factor, binary splits, and number of folds, where confidence factor is 0.25 by default for effective pruning. DT-based rules have been generated for the proposed work as it generates more meaningful rules considering the values taken by multiple attributes for a particular class.
5 Result Analysis To extract the relation between the attributes of the breast cancer dataset, Apriori algorithm and decision tree-based rule mining technique have been implemented in this work. 5.1 Rules from Apriori Algorithm Apriori algorithm has been implemented, and the rules were filtered based on the threshold values defined for confidence as 0.9, support as 0.6, and lift as 1. Based on these defined thresholds, 34 rules for death class and 30 rules for each progression and metastasis class have been generated. A few significant rules from each of the three classes are as shown in Table 2. Considering the rules 2.3 and 3.3 that have been generated for each class as shown in Table 2, which is understood that Human Epidermal Growth Factor Receptor (Her2)
348
K. Mohan et al. Table 2. Sampled rules generated from Apriori algorithm on breast cancer dataset
No.
Rules
Support
Confidence
Lift
1.
Attribute-Based Rules
1.1
{Rt = 1} =>{surgery = 1}
0.75
0.97
1.01
1.2
{Grade = 3} => {surgery = 1}
0.62
0.95
1.00
1.3
{Her2 = 0} => {Surgery = 1}
0.74
0.96
1.00
1.4
{stage = 2} => {surgery = 1}
0.72
0.96
1.00
2.
Death Class Rules
2.1
{stage = 2, death = 0} =>{surgery = 1}
0.72
0.93
1.00
2.2
{stage = 2, streatment = 1} => {death = 0}
0.61
0.93
1.00
2.3
{Her2 = 0} => {death = 0}
0.68
0.96
1.00
2.4
{treatment = 1,Rt = 1} => {death = 0}
0.62
0.93
1.00
2.5
{Grade = 3} => {death = 0}
0.60
0.93
1.00
3.
Progression Class Rules
3.1
{Surgery = 1, Rt = 1} =>{Progression = 0}
0.66
0.88
1.00
3.2
{stage = 2, Surgery = 1} =>{Progression = 0}
0.64
0.89
1.00
3.3
{Her2 = 0} =>{Progression = 0}
0.68
0.89
1.00
3.4
{stage = 2, Progression = 0} =>{Surgery = 1}
0.64
0.97
1.01
3.5
{Her2 = 0, Surgery = 1} =>{Progression = 0}
0.66
0.89
1.01
4.
Metastasis Class Rules
4.1
{surgery = 1,Rt = 1} =>{mets = 0}
0.69
0.95
1.00
4.2
{stage = 2, surgery = 1} => {mets = 0}
0.69
0.96
1.00
4.3
{Her2 = 0, surgery = 1} => {mets = 0}
0.68
0.96
1.00
4.4
{Rt = 1, mets = 0} =>{surgery = 1}
0.69
0.98
1.01
4.5
{treatment = 1} => {mets = 0}
0.81
0.95
1.00
is negative implies that there is no progression and no death for 5 years. These rules seem to be valid from the medical sources as it is known that Her2 negative will prevent the cancer from developing to advance stage. The rule 4.1 from Table 2, whenever surgery is performed in combination with radiotherapy implies no metastasis. Similar other meaningful rules have been extracted by Apriori algorithm as depicted in Table 2. 5.2 Rules from Decision Tree-Based Rule Mining Technique Decision tree has been created using Weka tool on the breast cancer dataset, and rules from it have been identified. The values in bracket correspond to the number of instances that follow this rule. Rules are selected based on the count of samples that belong to the rule. There are 12 rules, 13 rules, and 15 rules which have been generated for death,
Extraction of Relation Between Attributes …
349
metastasis, and progression class, respectively. Some of the sample rules generated from DT for each class is as shown in Table 3. Table 3. Sampled rules generated from decision tree rule mining technique on breast cancer dataset 1. Death Class Rules Rule 1—>(If stage = 3), then death = 0; (254/21) Rule 2—> If (If stage = 2), AND (treatment = 1) AND Rt = 1, AND Grade = 2, Her2 = 0, AND (Cancer type = 2), then death = 0; (103/21). Rule 3 —> (If stage = 2), AND (treatment = 1) AND (Rt = 0), then death = 0; (186/34) Rule 4—> (If stage = 2), AND (treatment = 1) AND Rt = 1, AND (Grade = 3), AND (agecat = 1), AND Her2 = 0, then death = 0; (141/20) Rule 5—> (If stage = 2), AND (treatment = 1) AND Rt = 1, AND Grade = 3, AND (agecat = 2), AND, (Her status code) = 3, AND Cancer type = 2, then death = 1; (79/19) 2. Metastasis Class Rules Rule 1—> If surgery = 1, AND Rt = 0, then metastasis = 0 (280/5) Rule 2—> If surgery = 0, AND (ER) = 0, AND (PR) = 1, then metastasis = 1 (33/1) Rule 3—> If surgery = 1, AND (Rt = 1), AND (treatment = 1), AND (Stage = 2) AND (Grade = 2), AND (agecat = 2) AND (PR = 1), then metastasis = 1 (114/13) Rule 4—> If surgery = 1, AND (Rt = 1), AND (treatment = 1), AND (Stage = 3) AND (ER = 1) then metastasis = 0(90/2) Rule 5—> If surgery = 1, AND (Rt = 1), AND (treatment = 1), AND (Stage = 2) AND (Grade = 3), AND (Her Status code = 3), then metastasis = 0 (78/6) 3. Progression Class Rules Rule 1 —> If Stage = 3 AND (PR = 1) then progression = 0 (147/14) Rule 2—> If Stage = 2 AND (treatment = 1) AND (Rt = 1) AND (Grade = 2) AND (Her status code = 2) AND (Cancer type = 1) then progression = 1 (76/7) Rule 3 —> If Stage = 2 AND (treatment = 1) AND (Rt = 0) AND (agecat = 1) then progression = 1 (72/7) Rule 4—> If Stage = 2 AND (treatment = 1) AND (Rt = 1) AND (Grade = 3) AND Her status code = 1 AND (PR = 1) AND (agecat = 1) then progression = 0 (69/15) Rule 5—> If Stage = 2 AND (treatment = 1) AND (Rt = 1) AND (Grade = 3) AND (Her status code = 4) AND (age = 1) then progression = 1 (70/11)
By analyzing, the extracted rules have revealed some of the following cases: • The poorly differentiated grade in the breast cancer patient was found to be given surgery treatment. • Human epidermal growth Her2 negative shows that the survival rate of patient is at least up to 5 years after the treatment. • When patient undergoes surgery and radiotherapy, metastasis and progression were found to be absent. Metastasis means the cancer spreads from the breast to other organs in the body, and progression means cancer spreads to the other organs. • Whenever stage was 2 (Tumor size between 2 cm and 5 cm), the treatment has found to include surgery in it.
350
K. Mohan et al.
6 Discussion According to the breast cancer report analysis done by an American Cancer Society [23], it has been observed that the average age of diagnosis for stage 1 breast cancer is identified as 52 years old and the women suffering from cancer with stage 1, stage 2, and stage 3 in 90% cases; it is seen that the treatment is given for such case where surgery followed by radiotherapy along with it does not metastases; it means that the cancer does not spread from breast to other organs in the body. In the generated association rule also, the same kind of rules is achieved in most of the cases with high confidence value of 97%. In the analysis done on the rules generated using Apriori algorithm, and decision tree, it is observed that in most of the cases radiotherapy followed by surgery is performed for the diagnosis of breast cancer, and basically it is notable that radiotherapy is common in all type of breast cancer diagnosis with high confidence value. According to a report by Breast cancer organization [24], it is analyzed that in more than 80% of the breast cancer cases, Progesterone Receptor (PR) is positive, which means the cancer cell is growing in response to the progesterone hormone. The same observation has been achieved after the analysis done on the rules extracted using decision tree-based rule generation technique.
7 Conclusion and Future Work This research work proposes the application of association rule mining for mining frequent itemset and implementation of decision tree for extracting the relation between attributes in breast cancer dataset. The novelty of the work is that along with predicting the class attributes of breast cancer, the work tries to suggest lab tests and medication for the diagnosed disease breast cancer. The proposed work is built for commonly occurring Indian diseases. Rule generation-based methodology is applied for extracting the relation among attributes modeling the lab tests and medication, which could help doctors and medical experts to take effective and reliable decision easily in order to diagnose the disease. And also, it analyzes how data mining techniques and machine learning techniques are exploited on the live dataset in order to extract and validate the proficient relations between attributes in healthcare data, which are very sensitive for human being and for their medication. The analysis has been performed on the live dataset basically consisting of three classes: death class, progression class, and metastasis class. As an extension of current work, it is planned to obtain more information on different other attributes and classes. It is also planned to identify the effect of all attributes on classes like recurrence (that shows whether the patient’s cancer is recurring or not) response (which is the response of the patients to the treatment) and DFS that corresponds to the number of years of disease-free survival by the patient. As DFS is a continuous variable, multiple regression techniques can be applied to the data for prediction. Similarly, response takes seven values, which could be explored using multiclass classification approaches. Acknowledgments. This research work is carried out with the data provided by HealthCare Global Enterprises Ltd (HCG) Hospitals, Bengaluru, India, without any direct involvement of
Extraction of Relation Between Attributes …
351
the patients. Ethical clearance has been taken from HealthCare Global Enterprises Ltd (HCG) Hospitals, Bengaluru, India.
References 1. Gupta, D., Khare, S. Aggarwal, A.: A method to predict diagnostic codes for chronic diseases using machine learning techniques. In: 2016 International Conference on Computing, Communication and Automation (ICCCA). IEEE, 2016 2. World Health Organization Report (Last accessed on 11th September, 2019) https://www. who.int/cancer/prevention/diagnosis-screening/breast-cancer/en/# 3. Zorman, M., et al.: Mining diabetes database with decision trees and association rules. In: Proceedings of 15th IEEE Symposium on Computer-Based Medical Systems (CBMS 2002). IEEE, 2002 4. Shastri, S.S., Nair, P.C., Gupta, D., Nayar, R.C., Rao, R., Ram, A.: Breast cancer diagnosis and prognosis using machine learning techniques. In: The International Symposium on Intelligent Systems Technologies and Applications, pp. 327–344. Springer, Cham, 2017 5. Asri, H., et al.: Using machine learning algorithms for breast cancer risk prediction and diagnosis. Procedia Comput. Sci. 83(2016): 1064–1069 6. Amrane, M., Oukid, S., Gagaoua, I., Ensar˙I, T.: Breast cancer classification using machine learning. In: 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting (EBBT), pp. 1–4. IEEE, 2018 7. Bharati, S., Rahman, M.A., Podder, P.: Breast cancer prediction applying different classification algorithm with comparative analysis using WEKA. In: 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT). IEEE, 2018 8. Agrawal, U., Soria, D., Wagner, C., Garibaldi, J., Ellis, I.O., Bartlett, J.M.S., Cameron, D., Rakha, E.A., Green, A.R.: Combining clustering and classification ensembles: A novel pipeline to identify breast cancer profiles. Artif. Int. Med. 97(2019): 27–37 9. Khuriwal, N., Mishra, N.: Breast cancer diagnosis using adaptive voting ensemble machine learning algorithm. In: 2018 IEEMA Engineer Infinite Conference (eTechNxT). IEEE, 2018 10. Ting, F.F., Sim, K.S.: Self-regulated multilayer perceptron neural network for breast cancer classification. In: 2017 International Conference on Robotics, Automation and Sciences (ICORAS). IEEE, 2017 11. Nurmaini, S., et al.: Breast cancer classification using deep learning. In: 2018 International Conference on Electrical Engineering and Computer Science (ICECOS). IEEE, 2018 12. Rajaguru, H., Prabhakar, S.K.: Bayesian linear discriminant analysis for breast cancer classification. In: 2017 2nd International Conference on Communication and Electronics Systems (ICCES). IEEE, 2017 13. Douangnoulack, P., Boonjing, V.: Building minimal classification rules for breast cancer diagnosis. 2018 10th International Conference on Knowledge and Smart Technology (KST). IEEE, 2018 14. Umesh, D.R., Ramachandra, B.: Association rule mining based predicting breast cancer recurrence on SEER breast cancer data. In: Emerging Research in Electronics, Computer Science and Technology (ICERECT), 2015 International Conference on. IEEE, 2015 15. Khare, S., Gupta, D.: Association rule analysis in cardiovascular disease. In: 2016 Second International Conference on Cognitive Computing and Information Processing (CCIP). IEEE, 2016 16. Song, K., Lee, K.: Predictability-based collective class association rule mining. Expert Syst. Appl. 79, 1–7 (2017)
352
K. Mohan et al.
17. Sonet, K.M.M.H., et al.: Analyzing patterns of numerously occurring heart diseases using association rule mining. In: 2017 Twelfth International Conference on Digital Information Management (ICDIM). IEEE, 2017 18. Rachmani, E., et al.: Mining medication behavior of the completion leprosy’s multi-drug therapy in Indonesia. In: 2018 International Seminar on Application for Technology of Information and Communication. IEEE, 2018 ˘ ˙I., Bibero˘glu, H.: Association rule for classification of breast 19. Tuba, P.A.L.A., YÜCEDAG, cancer patients. Sigma 8.2, 155–160 (2017) 20. Kabir, Md F., Ludwig, S.A., Abdullah, A.S.: Rule discovery from breast cancer risk factors using association rule mining. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018 21. Shyu, R., Haithcoat, T, Becevic, M.: Spatial association mining between melanoma prevalence rates, risk factors, and healthcare disparities. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2017 22. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In Proceedings of the 20th International Conference Very Large Data Bases, VLDB. Vol. 1215. 1994 23. American Cancer Society. https://www.cancer.org/cancer/breast-cancer/treatment/treatmentof-breast-cancer-by-stage/treatment-of-breast-cancer-stages-i-iii.html 24. Breast Cancer Organization Report https://www.breastcancer.org/symptoms/types/recur_met ast/treat_metast/options/surgery
Recent Challenges in Recommender Systems: A Survey Madhusree Kuanr and Puspanjali Mohapatra(B) Department of Computer Science, IIIT, Bhubaneswar, India [email protected], [email protected]
Abstract. The recent revolutionary technology transformations in the internet domain have enabled us to move from static web pages to ubiquitous computing web through social networking web. In return, this has enabled the recommender systems to leave their infancy and get matured while tackling the dynamic challenges arising for users. Recommender system anticipates user requirements before the user requires them. Recommender system in various domains proves its efficiency by providing appropriate recommendations according to the preferences of the users. It is a software solution in different online applications which helps the user to make appropriate decisions and also acts as a business tool in various domains. The proposed article covers the various types of recommender systems as well as the strategies and recent challenging research issues to improve the capabilities of recommender systems. Keywords: Recommender systems · Sparsity · Popularity span · Preference relations · Absolute ratings
1 Introduction Recommender systems are software solutions that pick a number of products from the broad product tool and recommend them to the user based on their preferences. The first recommender system Tapestry was designed to recommend documents from newsgroups. Recommendation systems generate a recommendation list using various methods such as collaborative recommendation strategy, content-based, hybrid, and knowledgebased recommendation approach. Collaborative recommender system proposes those products to the user that have been liked in the past by individuals of comparable taste. It tries to discover the peers to a given query user and tries to recommend the items which are liked by the peers in the past. Various measures like similarity and correlation are used to find the peers to a given query user. Examples of collaborative recommendation systems are Amazon.com, Group Lens [1], video recommender [2], Ringo [3], and PHOAKS system [4]. Memory-based and model-based algorithms are two distinct variants of collaborative algorithm. Memory-based algorithms are basically based on the rating predictions of the items by considering the past items liked by the peers [5, 6]. Memory-based algorithms may not always be fast and scalable which can be replaced by © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_32
354
M. Kuanr and P. Mohapatra
model-based algorithms. A model is intended on previous data and used for recommendation purposes in recommendation systems using model-based algorithms. Recommendation systems based on content suggest the products by discovering items comparable to the products that the query user enjoyed in the past. The strategy based on content is based on the idea of retrieving data and filtering data. Various measures like TF-IDF can be used for content-based approaches. Fab system [7] and Syskill Webert system [8] are some examples of recommender schemes based on content. Hybrid recommender systems incorporate the features of both collaborative and content-based recommender systems [9]. Recommendation systems based on knowledge [10, 11] do not require rating information for recommendation purposes. The system produces the list of recommendation based on the knowledge base of the system. The knowledge-based systems can be either categorized into case-based systems or constraint-based systems. The way they use the knowledge varies between case-based and constraint-based schemes. The basic concept of a recommender system is depicted in Fig. 1.
Fig. 1. Basic concept of a recommender system
The remainder of the paper is structured as follows. Section 2 discusses various recommendation techniques available for recommender systems. Section 3 describes seven recent recommendation challenges. Section 4 presents the various datasets available for recommender systems, and Sect. 5 illustrates the different parameters for recommender systems evaluation.
Recent Challenges in Recommender Systems …
355
2 Recommendation Techniques Recommendation systems can be used to generate the top-N list of recommendations or to predict the rating value for a specific product. Different recommendation approaches use different techniques for recommendation as listed in Table 1. The recommendation approaches can be broadly classified into four different approaches as shown in Fig. 2. Table 1. Recommendation techniques in recommender systems Sl. No
Recommendation Approach
Recommendation Techniques Heuristic-based
Model-based
1
Collaborative
• Nearest Neighbor(Cosine, Correlation) • Clustering • Graph Theory
• Bayesian Network • Clustering • Artificial Neural Networks • Deep Neural Network • Linear Regression • Probabilistic Models
• Yu, Tang, (2018) [12] • Rohit, Singh (2017) [13] • Shakirova (2017) [14]
2
Content-based
• TF-IDF • Clustering
• • • •
• Walek, Spackova (2018) [15] • Bahulikar (2017) [16] • Pal, Parhi (2017) [17]
3
Hybrid
• Various Voting • Building one • Devika, Schemes unifying model Subramaniyaswamy • Linear Combination • Incorporating one (2018) [18] of Predicted Ratings component as a part • Kbaier, Masri • Incorporating one of the model for the (2017) [19] • Zhang, Liu (2016) component as a part other [20] of the heuristic for the other
4
Knowledge-based
Knowledge base of • User Feedback • Ontology
Bayesian Classifier Clustering Decision Trees Artificial Neural Network • Deep Neural Network
Research Examples
• Subbotin, Gladkova (2018) [21] • Wonoseto, Rosmansyah (2017) [22] • Tsai, Wuy, Hsuy (2017) [23]
356
M. Kuanr and P. Mohapatra
Fig. 2. Types of recommendation approaches
3 Challenges in Recommender Systems In recommender systems, there are some difficulties that can be disclosed to enhance the recommender systems’ effectiveness and performance. Some of the recent issues which have been discovered by various researchers in this regard are discussed in the following subsections. 3.1 Handling of Data Sparsity Data sparsity is one of the challenging issues for collaborative recommender systems where if an item is rated by very few people but with very good ratings then that item may not appear in the recommendation list. The scheme can also lead to bad recommendations for users whose tastes are uncommon compared to other users or for a fresh user. Demographic filtering [24] is one of the methods where demographic profile data such as gender, age, education, region code, job data, etc. can be used to calculate customer resemblance. Sparsity can also be resolved through the application of associative recovery frameworks and associated spreading activation algorithms that find transitive connections among customers through previous operations and feedback [25]. Recent researches have revealed that preferences relation instead of absolute ratings can be used to handle the sparsity problem because users have comparable tastes of products, distinct rating biases or practices may not have comparable rating patterns. This becomes more noticeable when the ratings for the items are lower. But for similar users, the relationships of preference over the items are lower. But for similar users, the relationships of preference over the items are expected to be similar. For example, [26] represents a memory-based collaborative recommender system that uses preference
Recent Challenges in Recommender Systems …
357
relationship between items to boost the prediction accuracy of items with fewer ratings using MovieLens dataset. 3.2 Adaptness in Search Engines The Internet is a huge pool of structured and unstructured data of text, audio, and video. Because of this, a user gets a large number of references to web pages for a standard search in a search engine and extracting the most relevant information from that large set of information is very tedious and time consuming for a user. Hence, personalization of search engine is one of the important aspects to provide the relevant information to a user. It requires the domain knowledge of the user to construct the profile of the user. For example, [27] recommends a hybrid algorithm that combines a content-based algorithm with an access-based algorithm to provide user-based recommendations, and the system can adapt to change in the user’s interest. 3.3 Website Personalization Personalization of the website is the method of creating the most suitable web pages in the interest of the user. For example, [28] represents an adaptive framework that recommends the most appropriate web pages to the user when the user visits a specific web page for a topic where the relevant web pages are connected via certain hyperlinks to the parent web page. It tries to customize “on the fly” personalization which is distinct for each customer. Each website is modeled as a graph in the system, each graph node as an individual URL, and the node edges are the connections inserted in the HTML. It uses a backward breadth-first traversal to identify the sub-WebPages relevant to the parent searched web page. 3.4 Time awareness in Collaborative Recommender Systems The most common use of collaborative filtering algorithms is for providing recommendations to the user in all the domains, and it generally does not consider the time as a factor while computing similar users and generating the recommendation list. But the users’ interest may vary from time to time. So, the recommender system should consider the user dynamics and also the item dynamics while generating the recommendation list by considering the most recent transactions of the users. “Popularity span” is a time measure of the era in which the item stays popular. For example, [29] proposes algorithms which use the time-of purchase information to calculate user similarities and also combines this information with the purchase behaviors of the experts to generate the final recommendation.
358
M. Kuanr and P. Mohapatra
3.5 Generation of Recommendations with Explanation The traditional recommender systems generate a recommendation list without any transparent information on why that product is recommended which in turn affects the user to accept those recommendations for making decisions. But recommender systems with a detailed explanation [30, 31] produce explainable recommendations along with disrecommendations by giving explanations at the feature level and providing suggestions for the products by correlating with the product reviews. 3.6 Superiority of Preference Relation-Based Matrix Factorization Algorithms The traditional recommender systems use absolute ratings of the items to generate the recommendation list. But recent research advocates that preference relations outperform absolute ratings for producing better recommendations. Matrix factorization algorithms also perform well for recommender systems with sparse data. For instance, [32] is a recommendation system that utilizes a collaborative recommendation algorithm based on matrix factorization that utilizes preferential relationships for better recommendation and implements this approach on Netflix dataset. Rating prediction is another aspect of a recommender system. The traditional rating-based recommender systems use absolute ratings of the items to predict the rating a user will give to an item he/she has not previously rated. In the forecast of scores, few latest researches have disclosed some of the disadvantages of rating-based recommender systems. Preference relation-based matrix factorization method is used by [33] which uses a rating prediction algorithm considering the relative ratings given by the users for different pairs of items. 3.7 Decision Recommender Systems Recommender systems can also be used to take appropriate decisions in online auction process. For example, [34] proposes a recommender system which helps the transporter to take decision for biding a consignment from a set of available consignments. The system is represented as a resource allocation problem, and the features include each state and action as pair. The latitude and longitude parameters are used to represent each location.
4 Datasets for Recommender Systems The proper evaluation of a recommender system purely depends on the dataset in its domain. The dataset should be correctly chosen for the proper implementation of the recommender system. There are certain popular datasets like MovieLens, Amazon, BookCrossing, Yahoo, and LastFM which are used by most of the researchers to implement their concepts in recommender system. Data pruning is another important issue which should be considered while implementing the recommender system as it affects the performance of the system. Most recommender systems’ datasets are pruned, i.e., some data are removed that should not be removed in a recommender system for development. For example, the MovieLens dataset contains only user data that has rated 20 or more movies.
Recent Challenges in Recommender Systems …
359
Likewise, by considering only a subset of the original dataset, many researchers prune the data themselves, sometimes as little as 0.58 percent of the original data. A study [35] has been conducted to find out how often pruned data is used to evaluate recommender systems and how it affects system performance. From the study, it has been found that 40 percent of researchers used pruned recommender system datasets for their work and 15 percent pruned data themselves. The various datasets available for different domains of recommender system are listed in Table 2. Table 2. Datasets for recommender systems Sl. No. Domain
Name of Dataset
Description
Link
1
Book
Book-Crossing
It was collected by Cai-Nicolas in a 4-week crawl (August/September 2004) from the Book-Crossing community
http://www2.inform atik.uni-freiburg.de/ ~cziegler/BX/
2
E-commerce Amazon
3
It contains product http://jmcauley.ucsd. reviews and metadata edu/data/amazon/ from Amazon, including 142.8 million reviews spanning May 1996–July 2014
Retailrocket recommender system dataset
It consists of behavior data, item properties, and category tree. The data has been collected from a real-world e-commerce website
Amazon Music
It contains reviews and http://jmcauley.ucsd. metadata from Amazon edu/data/amazon/
5
Yahoo Music
It is a snapshot of the https://bit.ly/2Xy0sv7 tastes of the Yahoo! Music community for different musical artists
6
LastFM (Implicit)
It contains social networking, tagging, and music artist listening information from Last.fm online music system
https://grouplens.org/ datasets/hetrec-2011/
7
Million Song Dataset
It is a collection of audio features and metadata for popular music tracks
http://millionsongdata set.com/
4
Music
https://www.kaggle. com/retailrocket/eco mmerce-dataset
(continued)
360
M. Kuanr and P. Mohapatra Table 2. (continued)
Sl. No. Domain
Name of Dataset
Description
8
MovieLens
GroupLens Research https://grouplens.org/ has collected and made datasets/movielens/ available rating datasets from their movie website
9
Yahoo Movies
This dataset contains https://webscope.san ratings for songs dbox.yahoo.com/cat collected during normal alog.php?datatype=r interaction with Yahoo! Music services from the ratings provided by users
10
Netflix
This is the official dataset used in the Netflix Prize competition
http://academictorrents. com/details/9b13183dc 4d60676b773c9e2cd6 de5e5542cee9a
Movies
Link
11
Games
Steam Video Games
User-id, game-title, behavior-name, and value are the properties of this dataset
https://www.kaggle. com/tamber/steamvideo-games/data
12
Jokes
Jester
4.1 million continuous ratings of jokes from 73,496 users are present in this dataset
https://goldberg.ber keley.edu/jester-data/
13
Food
Chicago Entree
This database includes a record of client experiences with the recommendation system for the Entree Chicago restaurant
http://archive.ics.uci. edu/ml/datasets/Ent ree+Chicago+Recomm endation+Data
14
Anime
Anime Recommendations Database
It contains information from 73,516 users on 12,294 anime about user preferences results
https://www.kaggle. com/CooperUnion/ anime-recommendati ons-database
15
Dating
Dating Agency
This dataset contains 17,359,346 anonymous ratings of 168,791 profiles made by 135,359 LibimSeTi users as dumped on April 4, 2006
http://www.occamslab. com/petricek/data/
(continued)
Recent Challenges in Recommender Systems …
361
Table 2. (continued) Sl. No. Domain
Name of Dataset
Description
16
GroupLens Datasets
The research laboratory https://grouplens.org/ of GroupLens deals datasets/ with mobile and ubiquitous technology, digital libraries, and regional geographic information systems
17
Yahoo Research
Internet search, https://webscope.san machine learning, dbox.yahoo.com/cat microeconomics, media alog.php?datatype=r experience, and community systems are the included topics of Yahoo Research
18
Datasets for Machine Learning
These datasets are meant for machine learning researches
9
Stanford Large Network More than 50 large Dataset Collection network datasets are used to make this collection
Other
Link
https://gist.github.com/ entaroadun/1653794 https://snap.stanford. edu/data/
5 Parameters for Evaluation of Recommender Systems Recommender system plays an important role in decision-making both for the users and service providers. So, it should be properly evaluated and tested before producing recommendations. Different quality measures exist for improvising the recommendation system such as recommendation evaluations as sets, prediction evaluations, recommendation evaluations as ranked list, and recommendations for diversity. Let u (c, s) and up (c, s) be the true ratings and rating predicted by a recommender system, respectively. Let W = (c, s) be a set of user–item pairs for which the recommender system made predictions. Then the different evaluation metrics can be considered for each type of quality measure as shown in Fig. 3. Surveys on different domains of recommender systems have revealed that there are some domains like sports and disaster management which need efficient recommender systems to help the users in those domains as shown in Fig. 4.
362
M. Kuanr and P. Mohapatra
Fig. 3. Parameters for evaluation of recommender system
Fig. 4. Researches for recommender system in different domains
6 Conclusion A key role is played by recommender systems in choosing the best products for customers according to their preferences. It helps users in various domains for making appropriate decisions. Recommended e-commerce schemes assist consumers select the right item according to their preferences and also act as a severe company instrument to re-form the e-commerce environment. It plays an important role in agriculture domain by recommending crops, fertilizers, pesticides, seeds, and agricultural equipment according to the location and preferences of the user. Health sector is another important domain which demands efficient recommender systems to recommend health information according to the needs of the patient. Recommender system has also shown its efficiency in financial
Recent Challenges in Recommender Systems …
363
domain. This domain also needs good recommender systems to produce recommendations for online banking, loan, insurance, real estate, stocks, asset allocation, portfolio management, etc. It has shown its efficiency in social networks as well by providing recommendations of different domains to different categories of user according to age, gender, etc. It is a platform where individuals are socially connected with others who share comparable career interests, private concerns, events, backgrounds, or real-life relationships. So, there are various challenges for the researchers to identify those similarities according to their activities in the social networks as users are not providing accurate information regarding their profile. This article discusses various challenges present in recommender systems and also discusses various recent discovered techniques to handle them. As recommender system has very much importance in various domains, recent and efficient recommendation techniques can be used in it to enhance its capability.
References 1. Konstan, J.A., Miller, B.N., Maltz, D., Herlocker, J.L., Gordon, L.R., Riedl, J.: GroupLens: applying collaborative filtering to Usenet news. Commun. ACM 40(3), 77–87 (1997) 2. Hill, W., Stead, L., Rosenstein, M., Furnas, G.: Recommending and evaluating choices in a virtual community of use. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 194–201. ACM Press/Addison-Wesley Publishing Co (1995) 3. Shardanand, U., Maes, P.: Social information filtering: Algorithms for automating” word of mouth. In: Chi, vol. 95, pp. 210–217 (1995) 4. Terveen, L., Hill, W., Amento, B., McDonald, D., Creter, J.: PHOAKS: a system for sharing recommendations. Commun. ACM 40(3), (1997) 5. Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pp. 43–52. Morgan Kaufmann Publishers Inc. (1998) 6. Delgado, J., Ishii, N.: Memory-based weighted majority prediction. In SIGIR Workshop Recomm. Syst, Citeseer (1999) 7. Balabanovi, M., Shoham, Y.: Fab: content-based, collaborative recommendation. Commun. ACM 40(3), 66–72 (1997) 8. Pazzani, M., Billsus, D.: Learning and revising user profiles: The identification of interesting web sites. Mach. Learn. 27(3), 313–331 (1997) 9. Soboroff, I., Nicholas, C.: Combining content and collaboration in text filtering. In: Proceedings of the IJCAI, vol. 99, pp. 86–91 (1999) 10. Burke, R.: Knowledge-based recommender systems. Encycl. Libr. Inform. Syst. 69(Supplement 32), 175–186 (2000) 11. Middleton, S.E., Shadbolt, N.R., De Roure, D.C.: Ontological user profiling in recommender systems. ACM Trans. Inform. Syst. (TOIS) 22(1), 54–88 (2004) 12. Yu, C., Tang, Q.J., Liu, Z., Dong, B., Wei, Z.: A recommender system for ordering platform based on an improved collaborative filtering algorithm. In: 2018 International Conference on Audio, Language and Image Processing (ICALIP), pp. 298–302 (2018) 13. Rohit, Singh, A.K.: Comparison of measures of collaborative filtering recommender systems: rating prediction accuracy versus usage prediction accuracy. In: 2017 International Conference on Innovations in Control, Communication and Information Systems (ICICCI), pp. 1–4 (2017) 14. Shakirova, E.: Collaborative filtering for music recommender system. In: 2017 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus), pp. 548-550. IEEE (2017)
364
M. Kuanr and P. Mohapatra
15. Walek, B., Spackova, P.: Content-based recommender system for online stores using expert system. In: 2018 IEEE First International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), pp. 164–165 (2018) 16. Bahulikar, S.: Analyzing recommender systems and applying a location based approach using tagging. In: 2017 2nd International Conference for Convergence in Technology (I2CT), pp. 198–202. IEEE (2017) 17. Pal,A., Parhi,P. Aggarwal,M.: An improved content based collaborative filtering algorithm for movie recommendations. In: Tenth International Conference on Contemporary Computing (IC3), Noida, pp. 1--3 (2017) 18. Devika, R.V.S.: A novel model for hospital recommender system using hybrid filtering and big data techniques. 575–579 (2018). https://doi.org/10.1109/ismac.2018.8653717 19. Kbaier, M.E.B.H., Masri, H., Krichen, S.: A personalized hybrid tourism recommender system. In: 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), pp. 244–250. IEEE (2017) 20. Zhang, Y., Liu, X., Liu, W., Zhu, C.: Hybrid recommender system using semi-supervised clustering based on gaussian mixture model. In: 2016 International Conference on Cyberworlds (CW), pp. 155–158. IEEE (2016) 21. Subbotin, S., Gladkova, O., Parkhomenko, A.: Knowledge-based recommendation system for embedded systems platform-oriented design. In: 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), vol. 1, pp. 368–373. IEEE (2018) 22. Wonoseto, M.G., Rosmansyah, Y.: Knowledge based recommender system and web 2.0 to enhance learning model in junior high school. In: 2017 International Conference on Information Technology Systems and Innovation (ICITSI), pp. 168–171. Bandung (2017) 23. Tsai, Y.T., Wuy, C.S., Hsuy, H.L., Liuy, T., Cheny, P.L., Keng-Te Liao, W.H.C.: A crossdomain recommender system based on common-sense knowledge bases. In: 2017 Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 80–83. IEEE (2017) 24. Pazzani, M. J. (1999). A framework for collaborative, content-based and demographic filtering. Artif. int. Rev. 13(5–6), 393-408 25. Aggarwal, C.C., Wolf, J.L., Wu, K.L., Yu, P.S.: Horting hatches an egg: A new graph-theoretic approach to collaborative filtering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 201–212. ACM (1999) 26. Desarkar, M.S., Sarkar, S., Mitra, P.: Aggregating preference graphs for collaborative rating prediction. In: Proceedings of the Fourth ACM Conference on Recommender Systems, pp. 21– 28. ACM (2010) 27. Aprilianti, M., Mahendra, R., Budi, I.: Implementation of weighted parallel hybrid recommender systems for e-commerce in Indonesia. In: 2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS), pp. 321–326. IEEE (2016) 28. Goel, M., Sarkar, S.: Web site personalization using user profile information. In: International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems, pp. 510–513. Springer, Berlin, Heidelberg (2002) 29. Bakr, Albayrak, S.: User based and item based collaborative filtering with temporal dynamics. In: 2014 22nd Signal Processing and Communications Applications Conference (Siu), pp. 252–255. IEEE (2014) 30. He, X., Chen, T., Kan, M. Y., Chen, X.: Trirank: Review-aware explainable recommendation by modeling aspects. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1661–1670. ACM (2015) 31. Chelliah, M., Sarkar, S.: Product recommendations enhanced with reviews. In: Proceedings of the Eleventh ACM Conference on Recommender Systems, pp. 398–399. ACM (2017)
Recent Challenges in Recommender Systems …
365
32. Desarkar, M.S., Saxena, R., Sarkar, S.: Preference relation based matrix factorization for recommender systems. In: International conference on user modeling, adaptation, and personalization, pp. 63–75. Springer, Berlin, Heidelberg (2012) 33. Desarkar, M.S., Sarkar, S.: Rating prediction using preference relations based matrix factorization. In: UMAP Workshops (2012) 34. Mallick, P., Sarkar, S. Mitra, P.: Decision recommendation system for transporters in an online freight exchange platform. In: 9th International Conference on Communication Systems and Networks (COMSNETS), Bangalore, pp. 448–453 (2017) 35. Beel, J., Brunel, V.: Data pruning in recommender systems research: best-practice or malpractice? In: 13th ACM Conference on Recommender Systems (RecSys)
Framework to Detect NPK Deficiency in Maize Plants Using CNN Padmashri Jahagirdar and Suneeta V. Budihal(B) School of ECE, KLETU, Hubballi, India [email protected], suneeta [email protected]
Abstract. A balanced level of nutrients is very essential for healthy growth of plants. Deficiency of nutrients inhibits the growth of plants. It is needed to detect the infertile plants for the deficiency of nutrients at the early stage, so that proper fertilizers can be provided. In this paper, a framework is proposed by utilizing the images of nutrient-deficit leaves w.r.t. nitrogen (N), phosphorus (P), and potassium (K) of maize plant. A set of images contributes for bunch of dataset to be used as the training dataset. It is a non-invasive way of detecting nutrient deficiency in plants. The collected authentic training dataset of images is used to train the Inception V3 Convolutional Neural Network (CNN) model. The Inception V3 CNN uses transfer learning technique which is a research problem in machine learning. It concentrates on collecting the knowledge acquired while solving one problem and applying it to solve a related another problem. Therefore, features of maize leaf are extracted by the initial pretrained layers of CNN. Accurate and effective results are provided by speeding up the working of CNN. The given test image of maize leaf is provided to the trained CNN model which detects the nutrient deficiency in maize leaf as nitrogen, phosphorous, or potassium deficient accordingly. This framework can be applied in agricultural development in order to help farmers and to increase agricultural productivity.
Keywords: Nitrogen Inception V3
1
· Phosphorous · Potassium · Deficiency · CNN ·
Introduction
Agricultural productivity is one of the sources of economy for a developing country. In order to improve upon the economy it is compulsory to reform the agricultural practices at micro-level. For the increased yield in the crop production, nutrient deficiency detection in plants plays a major role, as having nutrient deficiencies are common in plants. Detection of nutrient deficiency in plants at an early stage, supports the proper provision of fertilizers to plants. Therefore, c Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_33
Framework to Detect NPK Deficiency in Maize Plants Using CNN
367
loss in crops or loss in agriculture can be minimized. A maize plant is considered for experimentation. Maize is one of the largest selling crops. Maize plants also get affected by nutrient deficiencies like nitrogen, phosphorus, and potassium. The nutrient deficiencies inhibit the healthy growth of plants. Nitrogen in plant provides green, leafy growth. If there is a deficiency, it results in yellowing and retarded growth. It is a general cause of yellow leaves in spring season. Potassium is required for manipulating both water uptake and the process that allows plants to utilize energy for photosynthesis during day time. Potassium provides flowering, fruiting, and common roughness. Phosphorus is needed for healthy and strong roots and enhances plant growth. There is a need to have balanced level of nutrients in plants for healthy growth. Therefore, there is a need to detect the type of nutrient deficiency so that proper fertilizers can be provided and healthy growth of plants can be achieved. Agriculture is the important occupation and is the major source of economy but it depends on the quality of products grown by farmers. This quality is based on health of the plants. Leaves can be considered as the fundamental part of the plant. As photosynthesis takes place through the leaves, leaves are considered for nutrient deficiency detection. Therefore, a framework is developed to detect nutrient deficiencies in plants. Authors in [1] have discussed nutrient deficiency detection for cotton plants. Images of cotton plant are acquired and leaf part is extracted and enhanced. The images are further segmented using statistical region merging algorithm. The color histogram algorithm was used to detect nutrient deficiency in the cotton plants. Authors in [2] have discussed nutrient deficiency detection using linear or nonlinear mathematical form. Images are acquired and segmented and features are extracted. Then agricultural models are generated. Authors in [3] have discussed about the nutrient deficiency detection using fuzzy classifier. Images are acquired, preprocessed, and segmented and then features are extracted employing suitable fuzzy classifier in accordance with the nutrient deficiencies, and the obtained feature vectors are classified. Authors in [4] have discussed nutrient deficiency detection using random forest classifier. Images are acquired, preprocessed, and features are extracted. Then the random forest classifier is trained to classify the given test image. Authors in [5] have discussed nutrient deficiency detection using techniques on artificial vision. Nitrogen nutritional status is diagnosed with these techniques in maize plants. It can be extended for detection of even phosphorous and potassium contents along with nitrogen. Authors in [6] have discussed the spectrum way of determining nutritional status in citrus plants. Authors in [7] have discussed the estimation of leaf nitrogen concentration and rice chlorophyll content with a digital still color camera under conditions of natural light. Authors in [8] have discussed the image gathering of rice sample using static scanning technology. Region props function in MATLAB is used to extract spectral and shape characteristic parameters from 32 images, by employing the use of an RGB mean value function. Authors in [9] have discussed about the calculation of the amount of leaf nitrogen and chlorophyll in the bean plant. Various combinations
368
P. Jahagirdar and S. V. Budihal
of spectral bands and vegetation indices from original, segmented, and reactance images are considered for experimentation. Authors in [10] have mentioned the use of features for classification. In order to identify the amount of deficiency of NPK contents in any plant, at any stage of its growth many techniques are proposed. It is observed that a learning-based CNN framework is desired to serve the purpose. Following are the contributions of this paper: • Pretrained Inception V3 CNN is used in order to reduce human effort in feature extraction. • Only the last layer of CNN is trained using bottleneck values to detect the type of nutrient deficiency. Section 2 provides the proposed framework. The proposed methodology is discussed in Sect. 3. Results and discussions are mentioned in Sect. 4. Conclusions drawn from results are mentioned in Sect. 5. Future scope of the framework is mentioned in Sect. 6.
2
Proposed Framework
CNN model is used in the proposed framework. The training dataset consists of images of nitrogen-deficient, phosphorous-deficient, and potassium-deficient maize leaves and is given to CNN model as shown in Fig. 1 in order to generate trained model. The test image of maize leaf is given as an input to trained model. Now the trained model correctly detects the type of nutrient deficiency in maize plants. Leaves are main indicators of nutrient deficiency in plants.
Fig. 1. Block diagram for nitrogen, phosphorus, and potassium deficiency detection in maize plants
The CNN is a deep, feedforward Artificial Neural Networks (ANN). These neural networks are used to examine visual images. CNN is also called ConvNet. CNNs employ comparatively less pre-processing in comparison with other image categorization algorithms. This reduces human effort in feature extraction and it is advantageous. As shown in Fig. 2, a basic CNN is made up of an input, a number of hidden and an output layer. Hidden layers are made of convolutional, pooling, fully connected and loss (final) layers.
Framework to Detect NPK Deficiency in Maize Plants Using CNN
369
Fig. 2. Basic architecture of CNN used for deficiency detection in maize plant
Convolutional layer is main block of a CNN model. It comprises a range of kernels also called as graspable filters, having a small receptive field, that extend through complete depth of input volume. In case of forward pass, twodimensional activation map of that filter is calculated by computing the dot product of the entries of the filter and the input. Therefore, the network grasps the filters in order to get enabled when it identifies few particular type of feature at some spatial position in the input. Next building block of CNNs is pooling layer. It is a nonlinear downsampling way of pooling. Most regularly used pooling is max pooling. It divides the input image into a set of non-overlapping rectangles and for every such sub-set, maximum output is desired. The correct location of a feature is relatively less significant than its rough location w.r.t. remaining features is the key idea. Pooling layer exists in between successive convolutional layers in the architecture of CNN. The frequent form of pooling layer consists of filters of size 2 × 2. It has been a major part for CNN for object detection. ReLU represents rectified linear units. Non-saturating activation function is applied by this layer. ReLU increases nonlinear properties of decision function. ReLU is prescribed frequently than other functions. Neural network is trained faster by ReLu layer. The next important layer of CNN is fully connected layer. Here, neurons will be having link with all activations in the previous layer. Matrix multiplication is used to calculate the activation. The loss layer is normally the last layer which describes the role of training in providing the deviation among predicted and true labels. Multiple loss functions are employed here. Softmax loss has been employed for estimating a single class of K mutually exclusive classes. To predict K independent probability values, Sigmoid cross-entropy loss function is utilized.
3
Proposed Methodology
After training set is passed through variety of CNN layers, trained model is generated. The trained model is based on learning of color features through CNN layers. Now the test image of maize leaf is given to trained model. The trained model detects the kind of nutrient deficiency in test image of maize leaf based
370
P. Jahagirdar and S. V. Budihal
on its learning of features during training process. Finally, the type of nutrient deficiency present in given test image of maize leaf is detected. The training set consisting of images of nitrogen, phosphorous and potassium deficient maize leaves is given as an input to CNN in order to generate the trained model. To the trained model, the test image of maize leaf is given as an input in order to detect the kind of nutrient deficiency in test image of maize leaf. The flow is mentioned in Fig. 5. The type of CNN architecture used is Inception V3. Inception V3 is developed by Google Brain Team. It uses the transfer learning which is a kind of machine learning technique that uses a pretrained neural network and also it is a research-oriented problem in machine learning that concentrates on collecting knowledge acquired while solving one problem and employing it to a solve related problem. Inception V3 consists of two sections. The first section in Inception V3 CNN is extraction of features and the second section is the classification part. The classification section consists of fully connected and softmax layers as shown in Fig. 3. The initial pretrained layers of CNN extract features by using series of convolutions through a variety of filters are shown in Fig. 4.
Fig. 3. Inception V3 CNN architecture developed by Google Brain Team and it uses transfer learning to reduce the effort required in feature extraction
3.1
Softmax Layer and Processing Using Inception V3
Softmax Regression is used to train the last layer of inception as shown in Fig. 6, where probabilities are generated based on the proofs obtained from the nutrientdeficient images. The proofs are obtained by using the addition of values of weights. These values are obained through the use of pixels intensity. It also includes bias value. In Fig. 6, W1,1 , W2,2 , W3,3 , W1,2 , W2,1 , W3,1 , W1,3 , W2,3 , and W3,2 are weights and b1 , b2 , and b3 are bias values. The sum of weights with added bias is detected by intensity of pixels. Wi,j xj + bi (1) evidencei = j
Framework to Detect NPK Deficiency in Maize Plants Using CNN
371
Fig. 4. Feature extraction through series of convolution by using variety of filters
(1) is the proof for a category i with x as an input. In (1), Wi is the weights, bi is the bias, and j refers to index for summation of pixels present in an input. y = sof tmax(evidence)
(2)
exp(xi ) sof tmax(x)i = j exp(xj )
(3)
Then Softmax function is used to calculate the probabilities as shown in (2). The Softmax function is provided by (3). Now, the training dataset consisting of nitrogen-, phosphorous-, and potassium-deficient maize leaves is processed using pretrained Inception V3 model. The initial step is to examine the complete set of images in the training set and to obtain the corresponding values required for bottleneck layer. It is the layer that precedes the last layer of output that actually does the classification. By retraining the bottleneck layer, it enables the Inception V3 CNN to identify the type of deficiency. Then, the number of training steps are executed either by using default value of 4000 or user-defined value. In each training step, images are selected for bottleneck. The outcomes (predictions) are then analyzed in comparison with the actual labels. By using backpropagation method the last layer’s weight is calculated. The training accuracy provides the percentage of nutrient-deficient images employed in the corresponding training set that are correctly labeled. The validation accuracy refers to the precision of randomly chosen group of nutrientdeficient images from the various sets during the process of training. When compared to training accuracy, validation accuracy is more accurate. Inception V3 CNN divides the training data into three parts in such a way that training set forms 80%, validation set forms 10%, and the remaining 10% is utilized as a testing set during the course of training. Therefore, it is possible to avoid overfitting and to correctly tune the bottleneck values.
372
P. Jahagirdar and S. V. Budihal
Fig. 5. Proposed methodology for nutrient deficiency detection
Fig. 6. Softmax Regression is used to train the last layer of Inception V3 CNN in order to detect the type of nutrient deficiency
4
Results and Discussions
After generating the bottleneck values and training last layer of Inception V3 CNN, trained model is generated. The given input test image of maize leaf is given to trained model which correctly detects the nutrient deficiency accordingly as nitrogen deficient or phosphorous deficient or potassium deficient as shown in Figs. 7, 8, and 9. When the test image is given to trained model, the features are extracted and depending on bottleneck values, the type of nutrient deficiency is detected. The performance depends on number of steps involved in training and also on number of images in dataset.
Framework to Detect NPK Deficiency in Maize Plants Using CNN
373
Fig. 7. Test image detected as nitrogen deficient by comparing prediction score using Inception V3. Since deficiency prediction score is high for nitrogen, the maize leaf is nitrogen deficient
In Fig. 7, the maize leaf is detected as nitrogen deficient. The deficiency score for nitrogen is more compared to that of phosphorus and potassium. The prediction array for nitrogen has higher value. So the input test image of maize leaf is classified as nitrogen deficient.
Fig. 8. Test image detected as phosphorous deficient by comparing prediction score using Inception V3. Since deficiency prediction score is high for phosphorus, the maize leaf is phosphorus deficient
In Fig. 8, the maize leaf is detected as phosphorus deficient. The deficiency score for phosphorus is more compared to that of nitrogen and potassium. The prediction array for phosphorus has higher value. So the input test image of maize leaf is classified as phosphorus deficient.
374
P. Jahagirdar and S. V. Budihal
Fig. 9. Test image detected as potassium deficient by comparing prediction score using Inception V3. Since deficiency prediction score is high for potassium, the maize leaf is potassium deficient
In Fig. 9, the maize leaf is detected as potassium deficient. The deficiency score for nitrogen is more compared to that of phosphorus and nitrogen. The prediction array for potassium has higher value. So the input test image of maize leaf is classified as potassium deficient. Table 1. Table showing variation in % accuracy with number of training steps Number of training steps % Accuracy 100
40
500
60
1000
80
Table 1 shows the variation in percentage accuracy with number of training steps. In Fig. 10, it shows the graph of % accuracy versus number of training steps. It is inferred from the graph that percentage accuracy increases with number of training steps. Because with more number of training steps more deeply the network is trained. Therefore, percentage accuracy is directly proportional to number of training steps. With more number of training steps, the network learns the more detailed features of maize leaves like color, shape, midrib, texture, etc. Therefore, by learning the more detailed features of the maize leaves, CNN enables the last layer to correctly detect the type of nutrient deficiency in maize leaf. Since Inception V3 uses transfer learning, it is possible to train the network more accurately by using more training steps.
Framework to Detect NPK Deficiency in Maize Plants Using CNN
375
Fig. 10. The percentage accuracy versus number of training steps
5
Conclusion
From the above framework, it can be concluded that engineering solution is provided to develop agricultural production. This is done by providing automatic detection of nutrient deficiency in leaves at early stage. So that deficiencies can be prevented. By using the pretrained Inception V3 CNN model, the features are extracted by lower layers. Then bottleneck values are generated. These values are fed to last layer of CNN. This is nothing but training the classifier. Now the trained model is generated. To this trained model, test image of maize leaf is given. It correctly detects the nutrient deficiency test image of maize leaf accordingly. It is inferred from graph shown in Fig. 10 that the accuracy is 80% for 1000 training steps. So, finally a non-invasive way of detecting nutrient deficiency in maize plant is achieved.
6
Future Scope
After the detection of type of nutrient deficiency, the framework can be extended to suggest the appropriate fertilizers to address the nutrient deficiency. This can be done by using supervised learning with dataset containing information about the type of nutrient deficiency and corresponding fertilizer.
References 1. Miyatra, A., Solanki, S.: Disease and nutrient deficiency detection in cotton plant. International Journal of Engineering Development and Research 2(2), 2801–2804 (2014)
376
P. Jahagirdar and S. V. Budihal
2. X. Yao, W. Luo, and Z. Yuan, ”An adaptive and quantitative rubber nutrient status analyzing system by digital foliar images,” 3 rd International Congress on Image and Signal Processing (CISP), 2010 , vol. 5, pp. 2492-2495, IEEE, 2010 3. M. A. Hairuddin, N. M. Tahir, and S. R. S. Baki, ”Overview of image processing approach for nutrient deficiencies detection in elaeis guineensis,” IEEE International Conference on System Engineering and Technology (ICSET), 2011, pp. 116-120, IEEE, 2011 4. A. Panwar, M. Al-Lami, P. Bharti, S. Chellappan, and J. Burken, ”Determining the effectiveness of soil treatment on plant stress using smartphone cameras,” International Conference on Selected Topics in Mobile & Wireless Networking (MoWNeT), 2016, pp. 1-8, IEEE, 2016 5. L. Romualdo, P. H. d. C. Luz, F. Devechio, M. Marin, A. Z’u˜ nniga, O. M. Bruno, and V. R. Herling, ”Use of artificial vision techniques for diagnostic of nitrogen nutritional status in maize plants,” Computers and electronics in agriculture, vol. 104, pp. 63–70, 2014 6. Liu, Y., Lyu, Q., He, S., Yi, S., Liu, X., Xie, R., Zheng, Y., Deng, L.: Prediction of nitrogen and phosphorus contents in citrus leaves based on hyperspectral imaging. International Journal of Agricultural and Biological Engineering 8(2), 80–88 (2015) 7. Wang, Y., Wang, D., Shi, P., Omasa, K.: Estimating rice chlorophyll content and leaf nitrogen concentration with a digital still color camera under natural light. Plant methods 10(1), 36 (2014) 8. Chen, L., Lin, L., Cai, G., Sun, Y., Huang, T., Wang, K., Deng, J.: Identification of nitrogen, phosphorus, and potassium deficiencies in rice based on static scanning technology and hierarchical identification method. PloS one 9(11), 113–200 (2014) 9. S. A. Abrah˜ ao, F. d. A. d. C. Pinto, D. M. d. Queiroz, N. T. Santos, and J. E. d. S. Carneiro, ”Determination of nitrogen and chlorophyll levels in bean-plant leaves by using spectral vegetation bands and indices,” Revista Ciencia Agronomica, vol. 44, no. 3, pp. 464-473, 2013 10. Sneha Pawaskar, Suneeta V. Budihal, ”‘Real-Time Vehicle-Type Categorization and Character Extraction from the License Plates”, International conference on Cognetive informtics and soft computing -2017, Dec. 20th -21st, 2017, VBIT, Hyderabad, pp. 557-565
Stacked Denoising Autoencoder: A Learning-Based Algorithm for the Reconstruction of Handwritten Digits Huzaifa M. Maniyar(B) , Nahid Guard, and Suneeta V. Budihal School of Electronics and Communication, KLE Technological University, Vidyanagara, Hubballi 580031, India [email protected], [email protected] http://www.kletech.ac.in
Abstract. This paper delivers a strategy to build a deep neural network, established by heaping layers of autoencoder, which in turn consists of both encoder and decoder layers, which are generally being locally trained to denoise the corrupted inputs and reconstruct an approximation to the original input. The outcome as an algorithm is a candid variation by stacking the ordinary autoencoder. It is basically a classification problem of machine learning yielding to obtain less classification error, and therefore spanning the performance gap with deep belief neural networks and in majority of the cases surpassing it. Results show that the reconstruction of the inputs depend upon the training parameters such as the upsurge of the epoch and batch size will increase the training period, thus increasing the accuracy in representing the denoised reconstruction.
Keywords: Deep learning
1
· Denoising autoencoder · Neural network
Introduction
Images are usually corrupted by image acquisition technique or due to artificial editing. The main objective is to restore the original image from the noisy input given to the network and have a better recognition capability. The images that are denoised need to meet the accuracy in terms of recognition in order to have a greater visual clarity. Image denoising is a quite common and important preprocessing step for the majority of the applications. The requirement of image denoising arises when images contain some noise which is the result of acquisition techniques. This paper emphasizes on denoising of images such that they will have a better recognition capacity. Several other methods are proposed for
c Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_34
378
H. M. Maniyar et al.
denoising of images such as Support Vector Machine (SVM), Principal Component Analysis (PCA), Spatial filters, Transform domain, etc. But the autoencoder has exceeded the traditional methods. The autoencoder has better data projections in comparison to the traditional methods, it provides dimensionality reduction, easily accounts for nonlinearities thus providing better outcomes when compared to the other methods. Image denoising using autoencoder can be attained by various methods and the accuracy in terms of image recognition depends on tuning the desired parameters. Denoised images are obtained by training the images which are done by the neural network. The training can be done in different forms such as the neural network can be trained to denoise the images by deliberately adding some noise and thus the neural network learns the process of image denoising. 1.1
Motivation
Many strategies have been developed for image denoising, due to the advancement in technology we are able to apply machine learning for image processing. As the handwritten digits may be of any size, font, style, etc. It is important that they should be identified by the viewers. Noise in the image is quite common, noise can be added in the image by numerous means, various image acquisition techniques are also responsible for the noisy image which include lack of lens aperture settings or due to the fault in imaging sensor. Images may also be in blur form which are again difficult to analyze. Noise may be of several forms either salt and pepper noise or white Gaussian noise. What so ever the type of noise it may be the removal of the same is necessary and one of the important image processing steps. Noise removal from the images of handwritten digits will help to recognize the digits appropriately. It helps in the maintenance of traffic rules as the number plate images captured may be corrupted by noise. Identification of various identities like house number of a person, unique ID of a person (if contained only digits). Organization Of The Paper Section 2 discusses the various concepts available for the design of the model, which satisfies the specifications as much as possible. It consists of subsections such as the functional block diagram, algorithm, and the details about the dataset required. Section 3 discusses the implementation details of how exactly the reconstruction and denoising of the images of handwritten digits is being implemented. There are subtopics like training strategies and performance parameters. Section 4 discusses the obtained results and the detailed discussion about it. This section also discusses the analysis of the model.
2
Literature Survey
Various methods for the recognition of handwritten digits [1–4] have been proposed and high recognition rates are reported. On the other hand, many researchers addressed the recognition of digits including Arabic.
Stacked Denoising Autoencoder
379
A method to present recognition of handwritten digit in Arabic language using SVM was developed by Mahmoud [5], a database of around 21120 samples was used. the testing database was 30% and the remaining 70% was used for training. Average recognition accuracy was about 99.85%. In 2011, an improvement in the technique which was based on local characters of Arabic digits recognition [6] was done. This work was basically on both printed digits as well as handwritten ones. The data set contained in total 600 digits from which 400 were used for training and the rest were used for testing which led to an accuracy of 99%. Another work on similar grounds was proposed [7] using the Arabic numerals which were based on neural networks and backpropagation technique. For a small sample of data set, the accuracy was around 96%. A three-vector classification technique is based on support vector machine [8], which uses fuzzy C means and the pixels to classify the Arabic digits. In total 3510 images were present in the data set, which involved 40% for training and 60% of the images for testing which gave av=n accuracy of 88%. Two methods to enhance the recognition rate for the typewritten digits was developed in 2014 [9]. The initial method calculates the number of edges of the given shape and combines the nodes. The other method known as fuzzy logic for recognition of the patterns develops a study of each shape and classifies them into numbers. The results for recognition yielded up to 95%. An algorithm was proposed to classify the handwritten digits using the Bayesian network [10]. A discrete cosine transform is used for feature classification. They used around 60000 images for training and 10000 images for testing, the accuracy was around 85%. A multilayer preceptron was developed to achieve good denoising performance [13]. BM3D [12] is a very well used technique for denoising purpose. A method for extending the classic autoencoder was introduced using deep networks as a building block [14]. The denoising autoencoder in a stacked form was developed in order to increase the accuracy [15], where the output of the below layer is fed as the input of the above layer. An algorithm using convolutional neural network was proposed for the sake of image denoising [16] using a small sample of data set. A sparse autoencoder was stacked for the purpose of image denoising and impainting, which was performed using K-SVD. A combination of stacked and sparse autoencoder was built [18] to have multiple columns of the adaptive neural network to obtain a robust system for denoising. Contributions – To design an autoencoder using convolutional layers. – To remove noise from the images, which are MNIST-based handwritten digits. – To develop methods that can recover signal even when noise levels are extremely high, where other denoising methods would fail.
380
3
H. M. Maniyar et al.
Proposed Methodology
It includes the brief description of the block diagram of autoencoder. To be more specific, it provides the outline for the design. The methodology includes design alternatives which are the various methods for building an autoencoder and finally choosing the best concept among all the variations. 3.1
Functional Block Diagram of Autoencoder
The block diagram gives a brief description of the denoising and reconstruction of the handwritten digits. Initially, we have an original input block that consists of the original input images that are taken from the MNIST database. At the later phase noise is added in order to enhance the training capacity of the autoencoder and noisy input images are sent to the denoising autoencoder. The denoising autoencoder after obtaining the input will perform the denoising task and reconstruct an approximation of the original input by adjusting the parameters. A backpropagation algorithm is applied to tune the parameters and adjust them in accordance with the accuracy of the obtained output. Regularization is done in order to minimize the cost function between the original and reconstructed image.
Fig. 1. Functional block diagram of autoencoder: denoising and reconstruction of images of MNIST-based handwritten digits
3.2
Algorithm for Reconstructing and Denoising Autoencoder
The autoencoder consists of several layers of implementation. The layers depend upon the designer generally as the layers of training increases the output will be more close to the original input. Training samples are 60,000 and testing samples
Stacked Denoising Autoencoder
381
are around 10,000, summing up for a total of 70,000 samples. The training samples are given to the various layers and the result is obtained. The training sample is obtained from the MNIST handwritten dataset, which is been deliberately perturbed and given as a corrupted input to the hidden layers in the autoencoder which automatically follows the encoding and decoding process based on the parameters chosen. Reconstruction of the original input is obtained which is an approximation of the original input. Using backpropagation algorithm the reconstruction losses are reduced. A deep autoencoder is required, which consists of more than two layers of both encoder and decoder in order to enhance the denoising performance and decrease the reconstruction losses. We are aiming to get images without noise. The autoencoder accounts for dimensionality reduction of the input images as the noise-added images are considered as the input of the autoencoder and get the desired out from the decoder. We use the denoising method because we think that the added noise is not a part of the representation of the input image. So the added noise will be filtered by the encoder and the compressed representation will have no information about the noise thus obtaining a clean and noiseless output.
Fig. 2. Algorithm of stacked denoising autoencoder: It gives the step by step procedure followed in the proposed model
382
3.3
H. M. Maniyar et al.
MNIST Dataset
The MNIST database (Modified National Institute of Standards and Technology database) is a huge database of handwritten digits that is commonly used to train various image processing systems. The ”re-mixing” the samples from NIST’s original dataset resulted as the MNIST dataset. The creators felt that since NIST’s training dataset was taken from American Census Bureau employees, whereas the testing dataset was taken from American high school students. It was not well-suited for machine learning problems. Further- more, the black and white images from NIST were dimensionalized to a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels. 3.4
Training Strategies
As we want to use the output from the autoencoder to form an image, the network should yield positive output. Hence, we add ReLU activation at the last output layer of the decoder, which can also accelerate the training speed. We choose to use Adam optimizer that can adjust the learning rate on each weight parameter, which guarantees effective training. In order to ensure the recovered data has less loss, we use a function J(w,b) to calculate the differences between them. m
J(w, b) =
3.5
1 (x − x)2 . m i=1
(1)
Performance Parameters
Deep neuron networks are used as the encoding function and the decoding function. Cross-entropy function is being chosen for the distance func- tion between the input and the output. To be specific , convolutional neuron networks (CNN) is used. ReLU activation and SoftMax of PyTorch in encoder and decoder network, respectively. As autoencoder network can achieve image compression from its hidden neurons output, because the hidden layer can learn a compressed feature representation for the input data. We also can achieve decompression by the output layer because the output value size is the same as the input. Activation functions ReLU function (rectified linear units) ReLU is the most common activation func- tion used in neural networks. It returns 0 for negative values and for any positive values x it returns the value back so we can write f (x) = max(0, x) It can allow the model to account for nonlinearities very well.
(2)
Stacked Denoising Autoencoder
383
Sigmoid function takes any range of real numbers and returns the output value which falls in the range of 0–1. The sigmoid function produces the curve which will be in the Shape S. The sigmoid function returns a real-valued output. The first derivative of the sigmoid function will be nonnegative or nonpositive. In SoftMax, the range of output will 0–1, and the sum of all the probabilities will be equal to one. It returns the probabilities of each class and the target class will have a high probability. It computes the exponential (e-power) of the given input value and the sum of exponential values of all the values in the inputs. Then the ratio of the exponential of the input value and the sum of exponential values is the output.
4
Implementation Details
An autoencoder is an unsupervised machine learning algorithm that uses a neural network which tries to study a guesstimate to identity the function using backpropagation algorithm. A autoencoder can be built in various ways such as denoising autoencoder, stacked denoising autoencoder, convolutional denoising autoencoder. In this paper, we will emphasizes on building a network-based model that has a combination of all the three autoencoders. Thus resulting in the stacked convolutional denoising autoencoder. In order to build the above-mentioned model, many aspects are to be considered. Selection of the parameters is dependent on optimization and the accuracy of the model. Changes have been made during the training period. The process of building an autoencoder which is able to reconstruct noiseless images of handwritten digits involves an autoencoder with connected layers. Autoencoding is a type of algorithm which includes data compression where the functions such as compression and decompression are involved. These functions are automatically learnt, specific in terms of data, and are lossy. This means rather than manually being engineered by developers, in almost every context where the autoencoders are being used the functions are automatically implemented by the neural network.
Fig. 3. General representation of an autoencoder consisting of encoder, decoder, and compressed representation
384
H. M. Maniyar et al.
The autoencoder is built concerning specific data, which means that the data compression algorithm will be able to compress only the data similar to what it has been trained. For example, if we have trained the autoencoder with the images of digits and if the autoencoder is given with an image of a building than the autoencoder fails to achieve its goal. The autoencoders are lossy which means in accordance with the original input the obtained output will be degraded. The autoencoders learn the functions automatically thus it is easy in terms of training. In order to build an autoencoder three functions are required they are the encoding function, the encryption function, and lastly the decoder function. To develop the model, Google co-laboratory is used as a platform for coding. All the necessary packages are installed on the initial grounds, the data set is directly downloaded from the cloud. After reshaping the images the testing and training samples is shuffled and split and two separate sets of testing and training are created. A noise factor is then added to the training data. Getting started with the neural network, multiple layers of encoder and decoder are developed. We have used three encoder layers and one decoder layer stacked together, where the output of the previous is given as input to the above layer.
Fig. 4. Stacked autoencoder representing decoder and encoder layers
ReLu is used as an activation function in all the encoder layers. The ReLu function very well accounts for nonlinearities, basically, it only accounts for positive input values and the negative values will be customized to zero, which helps to reduce the dimension and obtain a noiseless image.
Stacked Denoising Autoencoder
385
The decoder layer consists of sigmoid as activation function which will scale all the negative values to zero and only emphasize the higher values. adadelta is used as an optimizer, binary cross-entropy is used to reduce the reconstruction losses. The epoch is the duration up till which the training on a particular batch is done.
5
Results and Discussion
MNIST dataset is used to train our denoising Autoencoder with a 128 batch size input noisy images for 300 epochs. The noisy images are obtained by manually adding some Gaussian noise with noise level 60 to the input images. By trying to get the output images as the original noiseless images, we get our denoised images .We can see from the results, the quality of the denoised image is very good and is almost the same with the noiseless images.
Fig. 5. screenshot: Reconstruction of Denoised MNIST-based handwritten digit images
5.1
Result Analysis:
The following table shows the different result analysis carried out for different noise levels. Initially, the analysis is done by keeping the training and testing the data constantly and varying the noise level. Beginning with the 10% of noise level and checked up till 100% of the noise level. As the noise level increases the accuracy increases and the losses decrease. In the table below, the noise level is fixed to 50% and the epoch size for training increases from 10 to 100. As the epoch size increases, the accuracy increases and the reconstruction of the denoised image will be much more effective.
386
H. M. Maniyar et al. Table 1. Analysis:Fixed training and testing samples, varying noise levels Noise level (in%) losses
Accuracy (in%)
10
0.0708 92.92
20
0.0760 92.4
30
0.0834 91.66
40
0.0922 90.78
50
0.1013 89.87
60
0.1105 88.95
70
0.1195 88.05
80
0.1282 87.18
90
0.1367 86.33
100
0.1447 85.53
Table 2. Analysis: Fixing the noise level at 50% Noise level (in%) losses
6
Accuracy (in%)
10
0.1181 99.19
20
0.1082 89.18
30
0.1040 89.6
40
0.1013 89.87
50
0.0993 90.07
60
0.0977 90.023
70
0.0965 90.35
80
0.0955 90.45
90
0.0944 90.56
100
0.0936 90.64
Conclusion
We can achieve effective image denoising with the method of autoencoder network by just feeding noisy images to autoencoder and make sure that the output images are similar to the original noiseless images. This method guarantees that we don’t need to do much images preprocessing work and can get the denoised image just from the autoencoder.
Stacked Denoising Autoencoder
387
References 1. S.S. Ali and M.U. Ghani., “Handwritten Digit Recognition Using DCT and HMMs,” In Proc. Frontiers of Information Technology (FIT), 2014 12th International Conference on, pp. 303-306, 2014 2. Niu, X.-X., Suen, C.Y.: A novel hybrid CNN- SVM classifier for recognizing handwritten digits. Pattern Recognition 45(4), 1318–1325 (2012) 3. M.D. Tissera and M.D. McDonnell, “Deep extreme learning machines: supervised autoencoding architecture for classification,” Neurocomputing, vol. 174, no. A, pp. 42-49, 2016 4. Y. Hanning and W. Peng, “Handwritten digits recognition using multiple instance learning,” In Proc. Granular Computing (GrC), 2013 IEEE International Conference on, pp. 408-411, 2013 5. S.A. Mahmoud, “Arabic (Indian) handwritten digits recognition using Gaborbased features,” In Proc. Innovations in Information Technology, IIT 2008. International Conference on. pp. 683-687, 2008 6. O.E. Melhaoui,M. El Hitmy, F, “Lekhal.Arabic Numerals Recognition based on an Improved Version of the Loci Characteristic,” International Journal of Computer Applications, vol. 24, no.1, pp. 36-41, 2011 7. P.P. Selvi and T. Meyyappan, “Recognition of Arabic numerals with grouping and ungrouping using back propagation neural network,” In Proc. Pattern Recognition, Informatics and Mobile Engineering (PRIME), 2013 International Conference on, pp. 322-327, 2013 8. M. Takruri, R. Al-Hmouz, A. Al-Hmouz, “A three-level classifier: Fuzzy C Means, Support Vector Machine and unique pixels for Arabic handwritten digits,” In Proc. Computer Applications & Research (WSCAR), World Symposium on pp. 1-5, 2014 9. Salameh, M.: Arabic Digits Recognition Using Statistical Analysis for End/Conjunction Points and Fuzzy Logic for Pattern Recognition Techniques. World of Computer Science & Information Technology Journal 4(4), 50–56 (2014) 10. J.H. AlKhateeb and M. Alseid, “ DBN - Based learning for Arabic handwritten digit recognition using DCT features,” In Proc. Computer Science and Information Technology (CSIT). 6th International Conference on, pp. 222-226, 2014 11. Abdleazeem, S., El-Sherif, E.: Arabic handwritten digit recognition. International Journal of Document Analysis and Recognition (IJDAR) 11(3), 127–141 (2008) 12. Dabov, Kostadin, et al. “Image denoising by sparse 3-D transformdomain collaborative filtering.” IEEE Transactions on image processing 16.8 (2007): 2080-2095 13. Burger, Harold C., Christian J. Schuler, and Stefan Harmeling. “Image denoising: Can plain neural networks compete with BM3D?.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012 14. Jain, Viren, and Sebastian Seung. “Natural image denoising with convolutional networks.” Advances in Neural Information Processing Systems. 2009 15. Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. “Adaptive multicolumn deep neural networks with application to robust image denoising.” Advances in Neural Information Processing Systems. 2013 16. Vincent, Pascal, et al. “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.” Journal of Machine Learning Research 11.Dec (2010): 3371-3408 17. Jain, Viren, and Sebastian Seung. “Natural image denoising with convolutional networks.” Advances in Neural Information Processing Systems. 2009 18. Xie, Junyuan, Linli Xu, and Enhong Chen. “Image denoising and inpainting with deep neural networks.” Advances in Neural Information Processing Systems. 2012
An Unsupervised Technique to Generate Summaries from Opinionated Review Documents Ashwini Rao(B) and Ketan Shah MPSTME, NMIMS University, Mumbai 400056, India [email protected]
Abstract. In the last few years, there has been a tremendous change in the way users behave over the net. This is mainly because of the growth that has happened in the field of Web technology. In earlier times, the role a user over the net played was that of an information consumer, now it’s more of a data creator role. This role change has benefitted the world of politics, social network analysis, financial market analysis, etc., to name a few. Due to this huge creation of data, a mechanism that can automatically analyze and interpret this opinionated data is badly needed. Toward this research direction, unlike other summarization techniques, the paper proposes a novel method that is unsupervised and also domain-independent for generating opinion summaries. The final summaries that were generated are at four levels that range from being coarse to more granular ones. The proposed technique was tested on various data sets that were from nine different domains. The experimental results clearly indicated that 70–75% of the summaries generated were matching with the manually selected ones. Keywords: Opinion summarization · Dependency parsers · Supervised and unsupervised learning
1 Introduction In the past few years, Natural Language Processing (NLP) along with Internet technologies has seen a tremendous growth. This has managed to change the way the users now look and perceive the information on the net. In this era of immense social networking, the opinion of users over the net is now playing a central role in our decision-making process. A lot of research is being carried out to find out innovative ways of managing and analyzing this content generated by the user. A lot of changes are also observed in the behavior of people in these social networking sites. They are now open to collaborate, share their views and opinions without any hesitation. Various fields such as finance, healthcare, online shopping, etc., are making use of this vast intelligence (in the form of opinions) collected over the net to make good business decisions. But the task is difficult to handle, due to the noisy and unstructured characteristics of the online data. This is the reason that many researchers are working © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_35
An Unsupervised Technique …
389
in this direction to develop certain automated techniques that can mine and analyze the content generated by the user. The subdomain of sentiment analysis which tries to give a solution to the above problem is opinion summarization. In this research domain, various methods and techniques are being researched to help users to collect, analyze, and draw conclusions from the huge collection of unstructured data. The growth in the various social networking sites such as Facebook, Ecommerce review websites, Twitter, etc., have also led to enormous wealth creation in terms of data. Analysis of this data can enable timely predictions and also help in planning better business strategies. Toward this research direction, the paper proposes a novel technique for opinion summary generation. Unlike other approaches, the proposed technique is unsupervised as well as domain-independent and generates summaries at four different levels. The technique proposed was experimented on nine data sets that were from different domains. The organization of the paper is as follows. Related research in the area of opinion summarization is discussed in Sect. 2. The proposed model of opinion summary generation is outlined in Sect. 3. Section 4 discusses the data sets used and experiments carried out. The future plan is outlined in Sect. 5.
2 Literature Survey In recent years, many researchers have been trying to find probable optimized solutions to the different challenging problems that have emerged in the research domain of opinion summarization. One of these is the task of presenting the summaries to the user. It’s a complicated task, as a summary that is informative to one set of users may not hold any value to that of the others. So, many researchers have been working on various techniques that display the summary of opinions in a way that the user can digest. Many researchers have shown that it is always better to aggregate opinions from many sources or persons instead of just presenting the individual view of a person. Liu et al. [1] proposed one such type of summary which was structured and was called aspect-based summaries. These displayed positive and negative opinions about the set of features that frequently occurred in the reviews. Cheng et al. [2] further worked on the same lines and came up with summaries that compared features of one product with that of the other product. These summaries proved to be useful to users who wanted to compare products across multiple feature dimensions. Other interesting types of summaries are the ones which try to show the perception of the user about a product over time. These trend tracking summaries helped the manufacturers to draw their team’s attention toward products that were losing their popularity among users. Many researchers [3–5] have even presented their summaries in a traditional fashion. By taking inputs that belong to an individual user or from a set of user reviews, textual summaries that are short and meaningful were generated. These summaries did not gain much popularity as any type of quantitative analysis could not be done and they were just suitable for the purpose of human reading. Some researchers like Wei et al. [4] worked on approaches to determine the ways to select important features that would make their place in the final summary presented. The term frequencies of aspects/features were found out and further an appropriate threshold
390
A. Rao and K. Shah
was used to filter out some of them. They presented a summary at a brief and a detailed level. A brief level had just the headline featuring the sentence selected, and the detailed ones carried the sentiments of the previously selected sentences. Researchers [6] also presented summaries that highlighted the pros and cons of various features of a product. One more interesting summary was developed by Blair et al. [7]. In their approach, sentences were first rated based on the importance of features that they had. Then in the subsequent step, few of these highly rated sentences were selected for the final summary based on their length. Some researchers [8] even used different methods based on graphs for generating opinion summaries. Several clusters based on aspects/features were first formed and within each of these clusters, few sentences were selected as leaders taking their measure of informativeness and usefulness into consideration. Few have even worked on generating summaries at different granular levels [9]. They start with a word-level summary, by ranking features as per their importance. Next is the phrase level of summaries that just displays short phrases that represent individual clusters. The final level is that of displaying highly ranked sentences where usefulness and informativeness measures are used for sentence ranking.
3 Proposed Approach to Generate Opinion Summaries The paper proposes a novel way to generate opinion summaries at four distinct levels. In the research domain of opinion summarization, this phase of generating relevant opinion summaries is considered to be a challenging step by many researchers. It is mainly due to the fact that concise, precise, and visually effective summaries are easily understood and effectively used by the consumers for the process of decision-making. Many of the researchers, who are working in this direction, have used various graphical structures to display the summary results rather than a tabulated display. Also, many have used different sentence ranking methods to select sentences that would be part of the final summary. The challenge in these techniques is that they are domain-dependent as well as they require a good amount of training data before the model could be built. Fig. 1 shows the approach for generating opinion summary. The process begins with the already extracted single and multiword features applying the automated rule-based algorithm [10, 11]. Then, by using the technique proposed [12], relevant feature opinion pairs were extracted. These features and their corresponding opinion pairs are used in the proposed domain-independent model to generate opinion summaries at various levels, which differentiates this method from the various other techniques for a summary generation. As shown in Fig. 1, the initial step is to define a method to filter out few features that were extracted in the earlier phase of feature extraction [11]. It was found during experimentation that a larger number of relevant features were extracted. But, for the purpose of generating opinion summary, it would be better to filter out a few of them as our aim is to display only those review sentences that are important from the user’s perspective. We go ahead with this, by the assumption that popular features are usually the ones that are mentioned frequently in the reviews. So, the first step is to find their frequency of occurrence and retain the top few features using an appropriate threshold.
An Unsupervised Technique …
391
Fig. 1. Proposed technique for summary generation
The next step is to extract sentences having the selected features. The orientation of these sentences is then determined using a SentiWordNet dictionary [13]. This dictionary is used, as it has a huge collection of adjectives along with their orientation score. The orientation scores that are assigned to the words are based on the senses in which the words are used. As in the phase of feature extraction [10, 11], a dependency parser was used to parse the review documents, the context of adjective bearing words is easily available to determine the orientation of sentences. So, now every sentence that is selected previously using the appropriate threshold is placed into a class having a positive, a negative, or a neutral polarity. With these polarities assigned to sentences, the summary is now generated from a coarse level to a more granular one which can be used as per user’s need. At the first level, the summary reveals details regarding the popular features mentioned in the reviews. The other information presented in this overall summary is the polarity of these features. Summary at the next level is termed as feature wise as it reveals more details about the set of features selected previously. This summary shows the opinion, i.e., positive or negative, of the author toward every individual feature. The third level is called average polarity wise summary. During experimentation, it was observed that the opinions for a few of the features that were selected were very strong even though their number remained small. To prevent losing even such granular detail which may generate better summaries, the average positive and negative polarity scores of every individual feature with its corresponding opinion was found out. This computed value was used in the display of selected features as per their average polarity score.
392
A. Rao and K. Shah
In the last level, the sentences bearing the features that were selected based on their average polarity score in the previous level was displayed. The above-discussed technique of generating opinion summary was experimented with nine other data sets that belong to the various domains like Mobile phones, Automobiles, Hotel Industry, and Softwares. The characteristic of the data set used along with the results of experimentation is discussed in the next section.
4 Experiments Conducted The data sets that were used to evaluate the proposed technique were from various review sites such as Amazon, CNET, TripAdvisor.com, Carswale.com to name a few. Some of them were the golden data sets used by many researchers [1, 14] to showcase their contribution in the research field of sentiment analysis and few were manually crawled ones. Each of the reviews was around 8–10 sentences long, having an average of eight tokens per every individual sentence. 4.1 Results As was discussed in the earlier section, the process of opinion summary generation using the proposed technique starts with the feature set and their corresponding opinion pairs that were extracted previously [10–12]. During experimentation, it was observed that the number of features in the category of relevant were too many. So, there was a need to limit their number. A literature survey conducted revealed that sentence ranking [2, 4, 6, 7] was one of the popular techniques used by many researchers to tackle the problem of dealing with the large feature vector. So, an initial step of filtering features was carried out to generate concise opinion summaries. As discussed in the earlier section, the frequency of occurrence of features selected previously was found out. This was based on the assumption that popular features had more mentions in the reviews. For the automobile data set, there were around 45 features that got extracted in the feature extraction phase. Out of these, the ones that were having a good frequency of occurrence were the eight features that happened to be the top 20% of the sorted list. The review sentences corresponding to these selected features were then obtained. The selected top feature list of the automobile data set with frequency of occurrence is shown in Table 1. The table clearly indicates that features like Seat Comfort, Leg space, Service Centre, and Sensor systems are the most popular ones. The next step as discussed earlier, was to extract sentences bearing these top-rated features and determine their polarity using the dictionary of SentiWordNet [13]. The orientation of few sample sentences having the features selected earlier is as shown in Fig. 2. These sentences are then used in a subsequent step to generate summaries that start from a coarse level and extend up to a fine-grained level. Level 1 summary is at a higher level and gives details regarding the top-rated features and the number of opinions that are positive, negative, or neutral about the feature. Figure 3 shows the overall summary of the automobile data set.
An Unsupervised Technique …
393
Table 1. Features of automobile data set selected for summarization Features
Frequency of occurrence
Seat Comfort
18
Leg space
15
Service Centre
14
Sensor Systems
11
Safety Feature
9
Brake System
8
Price Tag
5
Convenience Feature
4
The pricing is very high, even though it seems to give a tough competition---------pricing,high *****0.0, 0.425 POSITIVE ORIENTATION They have service centres across the country and the service they provide is preety good for the charges laid out-------service centres, good ***** 0.75, 0.2 POSITIVE ORIENTATION The design of the engine looks shabby, would be better with more sophsticated safety features-------design, shabby ******0.5,0.85 NEGATIVE ORIENTATION Seating at the back would have been much better but is still tolerable------Seating, tolerable *****0.5,0.5 NEUTRAL ORIENTATION
No. of Feature Opinion Pairs
Fig. 2. Sample sentences of automobile data set with orientation
Overall Summary 40 30
30
31 23
20 10 0 Positive
Negative Polarity of Opinions
Neutral
Fig. 3. Overall summary of the automobile data set
The figure clearly indicates that there are around 30 review sentences that bear positive opinion, 31 bearing a negative orientation, and 23 being neutral for a set of 8 top-rated features selected earlier. The next level of summary for the automobile data set, termed as feature wise is shown in Fig. 4. The details regarding the number of positive, negative, and neutral opinions for each one of the features selected previously are shown here.
394
A. Rao and K. Shah
No. of Feature Opinion Pairs
Summarization (Feature Wise) 9 8 7 6 5 4 3 2 1 0
Positive Negative Neutral
Features Fig. 4. Feature wise summary of the automobile data set
The figure is indicative of the fact that the features of Leg space, Safety, Price tag, and Convenience have a negative orientation, whereas that of Service Centre remains neutral and it is positive for Seat comfort, Sensory, and Brake systems. The next level is termed as average polarity wise summary that presents much more detail than the previous two levels. Here the information regarding the average polarity scores that were computed for each of the feature opinion pairs that was previously selected is presented. The importance of this granular level of the summary is in situations when for certain features even though the number of positive/negative opinions may be high, but their individual opinion strength may not be too great. For example, Fig. 4 indicates that feature of Safety and Leg space have more number of opinions that are with negative orientation. But the result displayed in Table 2 and Fig. 5 indicates that even though the number of negative opinions toward these features is high, they are still classified as positive. This is because of the fact that the opinion strength of the word bearing positive sentiment is more than that of the word with a negative sentiment. Tackling such feature opinion pairs further increases the usefulness measure of the summary generated. The summary at the last level displays the actual review statements and is more granular. At this level, for the set of features selected, the average strength of its corresponding opinion that was computed in the previous level of the summary is used. The review sentences that are selected for the final display are the ones that have the score of polarity strength to be greater than the average score computed earlier. The level 4 summary of the Automobile data set is shown in Fig. 6. Using the proposed technique, a similar set of summaries was also generated for the eight remaining data sets. The summary so obtained was not comparable with other similar approaches due to the following reasons that were evident during the conduction of literature survey.
An Unsupervised Technique …
395
Avg. Polarity Strength
Table 2. Average polarity wise summary of the automobile data set
1.00
Features
Positive polarity Negative polarity
Seat Comfort
0.80
0.63
Leg space
0.79
0.71
Service Centre
0.75
0.00
Sensor Systems
0.51
0.52
Safety Feature
0.67
0.56
Brake System
0.63
0.00
Price Tag
0.58
0.31
Convenience Feature 0.88
0.00
Summarization (Avg. Polarity Strength)
0.80 0.60 Positive
0.40
Negative
0.20 0.00
Features Fig. 5. Average polarity strength wise summary of the automobile data set
• Researchers evaluated the summaries that they generated based on the measure of informativeness, readability as well as non-redundant content of the summary. • The researchers compared their summary generated with those that are stored in manually extracted files. • Some researchers even based their evaluation according to the way in which polarity of every feature was distributed. • Few of the researchers even argued that there can be no goodness measure of an opinion summary as it’s very subjective. A summary that is found informative to one user may not be of any value to the others.
396
A. Rao and K. Shah I own this car since Aug-2012 and one of the key differences is that I noticed though this is ONLY by reading and by viewing photographs is that the 3rd seating in this car seems to be better. My daily ride to office gets really irritating with the seat rattling in Ertiga.
Seat Comfort
When I saw the vehicle today, except for the fact that I liked the ease of getting in and out of the third row, seating inside was horrible. In today 's age what I can not digest is , Fixed head restraints , narrow and uncomfortable seats. The reclining and sliding seats mean that the vehicle would definitely be little more spacious and comfortable to use than its counterparts. Third row seats best for kids that too for short drives. The most glaring of all are those wafer thin seat cushions evident from the snaps.
Leg space
The model has a fixed middle row and the seating offers decent legroom for the third row. I was pretty dissatisfied with the car overall as it is not well packaged and I feel that vertical space is mismanaged, which could have created an altogether different experience otherwise. Maruti Suzuki's sales & service network remains the unmatched numero uno with more choice , less arrogant dealers and open market availability of genuine spares.
Service Centre They have service centres across the country and the service they provide is preety good for the charges laid out. Even when encountered with apparent faults that cannot be blamed on anything else but poor design, or costcutting , or willful deficiency of service , buyers and users are willing to forgive and forget. Sensor Systems
In this the car the basic features like crash handling, airbags, parking sensors, speed sensitive and seat belts sensors, auto locking doors, crumple zones is just so bad, that its just not worth the money you pay for this big gaint. One CD thing that baffles me is missing dog bar in Honda cars, I had an accident on the highway and condenser was dead meat. A car costing me close to 7.5 L on the road , and not having even basic stuff such as a driver airbag, is shocking.
The only reason I dont want to go for this big car is because, somewhere in the back of my mind I had the feeling that safety features are Safety Feature virtually non existent in every car that Maruti makes in the sub lac bucket ,/, hence whats the point in compromising safety. Between the Ertiga and the Mobilio, I think the Ertiga looks much better from the front, and in all it has the great attributes of a safe, spacious family car. A well deserved 5 starts Pricing from Honda is disappointing compared to the product on offer. Price Tag
The pricing is very high, even though it seems to give a tough competition. Grow up Honda City, successful does n't mean people will buy anything at any price you give.
Fig. 6. Actual review statements summary of the automobile data set
5 Conclusion and Future Work As the various techniques used for opinion summary generation are domain-dependent and supervised methods, the paper proposes a novel unsupervised domain-independent approach to this summary generation. The technique proposed generated summaries at four different levels revealing details about individual features from a basic level to that of a more granular one. Level 1 summary termed as overall just displayed the total number of feature opinion pairs along with their polarity for individual feature selected. Level 2 summary had detail about every feature and number of positive and/or negative opinions that it carried. The level 3 summary had the features displayed based on the score of average polarity. The last level displayed the statements as per the filtering done in the previous steps. The proposed technique was validated by running it through nine data sets that belonged to different domains. The summaries generated were tested for their readability and informativeness against files that had sentences that were manually selected for the final summary. It was observed that there was around 70%–75% matching of the summaries generated using the proposed method with that of manual ones. Our future research direction is to come up with a way in which we can customize summaries as per the user needs. This can be made possible if we can think of a method to integrate the opinion summarization system with feedback from the user and use this information effectively in building an intelligent model.
An Unsupervised Technique …
397
References 1. Hu, M., Liu, B.: Mining opinion features in customer reviews. In: AAAI 2004 (Vol. 4, No. 4, pp. 755–760) 2. Liu, B., Hu, M., Cheng, J.: Opinion observer: analyzing and comparing opinions on the web. In: Proceedings of the 14th International Conference on World Wide Web 2005 (pp. 342–351). ACM 3. Beineke, P., Hastie, T., Manning, C., Vaithyanathan, S.: Exploring sentiment summarization. In: Proceedings of the AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications 2004 (Vol. 39). Palo Alto, CA: The AAAI Press 4. Lun-Wei Ku, Y.-T.L., Chen, H.-H.: Opinion extraction, summarization and tracking in news and blog corpora. In: Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs 5. Carenini, G., Cheung, J.C., Pauls, A.: Multi-document summarization of evaluative text. Comput. Intell. 29(4), 545–576 (2013) 6. Zhuang, L., Jing, F., Zhu, X.Y.: Movie review mining and summarization. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management 2006 (pp. 43–50). ACM 7. Blair-Goldensohn, S., Hannan, K., McDonald, R., Neylon, T., Reis, G.A., Reynar, J.: Building a sentiment summarizer for local service reviews. In: Proceedings of World Wide Web (WWW-2008) Workshop on NLP in the Information Explosion Era 2008, (Vol. 14, pp. 339–348) 8. Seki, Y., Eguchi, K., Kando, N., Aono, M.: Opinion-focused summarization and its analysis at DUC 2006. In: Proceedings of the Document Understanding Conference (Duc) 2006 (pp. 122– 130) 9. Ganesan, K., Zhai, C.: Opinion-based entity ranking. Inf. Retrieval 15(2), 116–150 (2012) 10. Rao, A., Shah, K.: Model for improving relevant feature extraction for opinion summarization. In: Proceedings of IEEE International Advance Computing Conference (IACC) 2015 (pp. 1– 5). IEEE 11. Rao, A., Shah, K.: An optimized rule based approach to extract relevant features for sentiment mining. In: Proceedings of 3rd International Conference on Computing for Sustainable Global Development (INDIACom) 2016 (pp. 2330–2336). IEEE 12. Ms. Rao, A., Dr. Shah, K.: A domain independent technique to generate feature opinion pairs for opinion mining. In: WSEAS Transactions on Information Science and Applications, 2018, (Vol 2, pp. 61–69) 13. Baccianella, S., Esuli, A., Sebastiani, F.: Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Lrec 2010 (Vol. 10, No. 2010, pp. 2200–2204) 14. Zhu, L., Gao, S., Pan, S.J., Li, H., Deng, D., Shahabi, C.: Graph-based informative-sentence selection for opinion summarization. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2013 (pp. 408–412). ACM
Scaled Document Clustering and Word Cloud-Based Summarization on Hindi Corpus Prafulla B. Bafna(B) and Jatinderkumar R. Saini Symbiosis International Deemed University, Pune, Maharashtra, India [email protected]
Abstract. Managing a large number of textual documents is a critical and significant task and supports many applications ranging from information retrieval to clustering search engine results. The multilinguistic facility provided by websites makes Hindi a major language in the digital domain of information technology today. This work focuses on document management through document clustering for a big corpus and summarization of clusters. The objective is to overcome the scalability problem while managing the documents and summarizing the Hindi corpus by extracting tokens. The work is better in terms of scalability and supports the consistent quality of cluster for incremental dataset. Most of the past and contemporary research works have targeted English corpus document management. Hindi corpus has been mostly exploited by the researchers for exploring stemming, single-document summarization, and classifier design. Implementing unsupervised learning on the Hindi corpus for summarization of multiple documents through Word Cloud is still an untouched area. Technically speaking, the current work is an application of TF-IDF, cosine-based document similarity measures, and cluster dendrograms, in addition to various other Natural Language Processing (NLP) activities. Entropy and precision are used to evaluate the experiments carried on different live and available/tested datasets and results prove the robustness of the proposed approach for Hindi Corpus. Keywords: Classification · Clustering · Document management · Hindi · Summarization · Word cloud
1 Introduction In online and offline systems, documents are continuously generated, stored, and accessed every day in large volumes. Categorizing documents according to the contents present in it will help to retrieve documents based on a particular topic. The maximum work done in document management focuses on English corpus, but text in Hindi on the web has come of age since the advent of Unicode standards in Indic languages. The Hindi content has been growing by leaps and bounds and is now easily accessible on the web at large. Generally, researchers focus on Hindi text but only for Natural Language Processing (NLP) activities like word identification, stemming, and summarization. Various data mining techniques like clustering, classification can be applied © Springer Nature Singapore Pte Ltd. 2021 C. R. Panigrahi et al. (eds.), Progress in Advanced Computing and Intelligent Engineering, Advances in Intelligent Systems and Computing 1199, https://doi.org/10.1007/978-981-15-6353-9_36
Scaled Document Clustering and Word
399
once the data is in a structured format. Document term matrix (DTM) allows converting text files into table forms, where rows are represented by documents and terms are placed as columns. But DTM causes dimension curse because all the terms present in the corpus are considered while constructing DTM. Term Frequency–Inverse Document Frequency (TF-IDF) allows selecting significant terms based on the token weights of terms. The cosine similarity measure is a prerequisite for applying the hierarchical algorithm on documents. Once the clusters are formed, they can be summarized through a word cloud. Word clouds enable to visualize and understand the text information in an easy way. They represent the words from higher to lower frequency from big to small font size. It’s a way of text summarization [1]. The proposed approach imitates removal of stop words and finds out top N frequent terms using TF-IDF weights. The N-value is called a threshold which is 50% of maximum TF-IDF weight. It effectively removes all unuseful words. Considering these keywords, cosine similarity measure and hierarchical clustering are applied to get document clusters. Entropy is used to validate cluster quality and, in turn, N-value. The dataset having predefined classes are used to decide the precision of the experimental setup. Proving the betterment of the technique, it is applied to the live dataset. Once the clusters are formed, the word cloud is applied to summarize the clusters. The approach is unique because of the following. 1. Multi-document Word Cloud-based Summarization through Unsupervised Learning on Hindi corpus is an untouched topic. 2. Various applications, for example, information retrieval, will benefit from the proposed approach by saving time and effort required to read and manage an entire corpus. 3. The approach can process 500 documents having more than 15000 words and hence proves betterment in scalability. 4. Entropy is consistent even for large data size. Document Management System facilitates accessing the documents in a fast and easy way and, in turn, increases the productivity of the work. Grouping of documents is one of the most important steps toward document management, which helps in identifying replicas of documents, clustering search engine results, and so on. There are different document management systems (DMS) existing for English corpus; DMS for Hindi corpus though significant is not explored yet. “Businesses that invest in Hindi content today stand to gain a whole new set of consumers tomorrow, doing digital content in Hindi is not only preferred in India but around the world” [2]. In the paper, terms, dimensions, words, and tokens are used as synonyms, interchangeably. The organization of the paper is as mentioned. The work done by other researchers on the topic is presented as a background in the next, that is, the second section. The third section presents the methodology, along with the experimental setup and results. The paper ends with the fourth section that is the conclusion and future directions.
400
P. B. Bafna and J. R. Saini
2 Literature Review This section illustrates existing techniques of text preprocessing, text stemming, summarization, word cloud, and so on. 2.1 Text Stemming and Text Summarization Extracting an original word from a token is termed as stemming. It is a type of preprocessing and improves the performance of the algorithm to be used for NLP tasks [3]. The stemming problem has been addressed in many contexts, and by researchers in many disciplines; the main purpose of stemming is to reduce different grammatical forms/word forms to its root form. Stemming is widely used in an Information Retrieval system and reduces the size of the index files. The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. Stemming is used in information retrieval systems to improve performance. Additionally, this operation reduces the number of terms in an information retrieval system, thus decreasing the size of the index files. Different stemming algorithms for non-Indian and Indian languages, methods of stemming, accuracy, and errors are explained [4]. A new stemmer has been proposed named as “Maulik” for the Hindi Language. This stemmer is purely based on the Devanagari script, and it uses the Hybrid approach (a combination of brute force and suffix removal approach). The stemmer is both computationally inexpensive and domainindependent. This stemmer also reduces the problem of over-stemming and understemming using the brute force technique, which was found in the Lightweight Stemmer for Hindi. Over-stemming and the under-stemming problems are also controlled in the stemmer by using the suffix removal technique [5]. Most of stemming algorithms are based on a rule-based approach. The performance of a rule-based stemmer is superior to some well-known methods like brute force. Dictionary-based algorithms, including natural language processing approaches, are used to build stemmers. The purpose of stemmers is to encode a wide range of language-related rules evolved across a period. Such stemmers with comprehensive rules are language-dependent. Hindi is a morphologically rich language. Hindi words have many morphological variants that present the same concept but different in tense, plurality, etc. A lightweight stemmer is proposed for Hindi, which conflates terms by providing a suffix list. The stemmer has been evaluated by computing under-stemming and over-stemming figures for a corpus of documents [6]. Abstracting of the document and presenting it to the user are achieved through text summarization; it extracts the significant information from the given text. Manual summarization of the document is costly in terms of time, efforts, etc. Automatic text summarization is a more accurate way than manual summarization. A deeper analysis of the text needs to carry out summarization [7]. Generally, clustering is preferred to group similar type of documents, and multiple algorithms are suggested by various researchers. For retrieving documents in a clustered manner, Hierarchical techniques are used. There are various kinds of distance measures listed by several researchers for hierarchical clustering that are single link, average link, etc. [8].
Scaled Document Clustering and Word
401
Textual data is accumulating and increasing tremendously. Different types of reviews, web pages, etc. are examples of text data. Documents are used to store this information. To apply any clustering technique, the data should be in a tabular form. Various techniques are available to store such types of data, for example, bag of words [9]. But it creates dimension curse, as all terms in the corpus are considered. High dimensions affect the performance of the algorithm. To reduce high dimensions, only significant words need to be considered. The document clustering process will execute in less time if the top significant words are selected. To improve the clustering process, the text is preprocessed by removing stop words, etc. [1, 10]. Generally, TF-IDF is a popularly used technique that transforms text data into matrix form. The measure represents the significance of the token with respect to text documents considering the entire corpus. In document processing, it acts as a weighting unit. In spite of increasing word count proportional to the number of documents in which it is present, the TF-IDF ignores the most commonly occurring words, by offsetting count of the words in the entire corpus [11]. Entropy and precision are popularly used parameters to evaluate clustering. To validate the purity of the clustering results, entropy is used and the accuracy of the clusters is measured by precision. To represent the textual data graphically in the form of words, Word clouds are used. The words with higher frequency appear prominently. The font size reduces as the word count lowers. It’s very easy to understand and interpret the word cloud [12]. Word cloud is referred to as a simple and effective visualization and summarization technique. It helps in the domain of text mining, visualization techniques, and contextual data. A word cloud can be effectively used for focusing on the needs and problems of customers and, in turn, to increase the business without reading the text. Research scholars can use word cloud for interpretation of qualitative data in a faster manner. User comments entered on social media about service/product, or political party, etc. can be analyzed through word cloud, and the overall essence of the product or service can be understood without going through comments, thus saving time and effort [13, 14]. Word cloud on Hindi text is not attempted yet. It will benefit the people who prefer to use their regional language for doing day-to-day activities like reading news articles, summarizing government documents, and so on [15–17].
3 Research Methodology Corpus containing Hindi text is processed to remove the stop words; TF-IDF is used on a set of documents, and to weight is calculated. Terms having a weight greater than or equal to the threshold are considered and termed as modified TF-IDF and are followed by computing cosine measure similarity matrix. The dendrogram is constructed using single-link hierarchical clustering applied to get the clusters. Entropy and precision are calculated to validate the experiment, and a word cloud is created for each cluster to summarize the cluster. 3.1 Creating a Corpus of Hindi Text The dataset is used which is available online [10]. The dataset is present in XML version=“1.0”encoding=“UTF-8”. It has predefined categories of movie reviews. The
402
P. B. Bafna and J. R. Saini
categories are positive, negative, conflict, and neutral. The second dataset is a live dataset created by collecting different stories belonging to different domains having size 15 KB. To avoid bias, stories are collected from different websites. The stories are uploaded from years 2000–2019 [http://hindi.webdunia.com], [https://www.pinterest.com/pin/851180 398296339274/], [https://www.hindisahityadarpan.in], The dataset is created in Excel having a mixture of stories. 3.2 Clustering of the Corpus Using Unsupervised Learning Technique and Evaluation The corpus is tokenized to get a bag of words and is preprocessed to remove stop words and so on. Lemmatization is purposely avoided in the preprocessing to like preserve the semantics of words. TF-IDF is applied to calculate the weight of each term. Instead of applying stemming, the top N terms are selected for further processing. N-value depends on the maximum TF-IDF weight and TF-IDF weights of all other terms. To select the terms, modified TF-IDF weights are used. The cosine similarity measure between documents is calculated to generate a dendrogram. The algorithm is implemented on the corpus of stories and reviews. The stories are clustered and each cluster represents the topic of the story. Reviews are clustered based on sentiments present in it. Table 1 shows the applied steps and with package details. Table 1. Experimental steps with package details Step No.
Step
Library/Package/Function
1
Documents are preprocessed and stop words are removed
library(udpipe)
2
Apply TF-IDF to calculate token weights
dtm_tfidf
3
Select terms having token weights greater than threshold
dtm