150 118 31MB
English Pages 938 [888] Year 2021
Advances in Intelligent Systems and Computing 1300
Aboul Ella Hassanien Siddhartha Bhattacharyya Satyajit Chakrabati Abhishek Bhattacharya Soumi Dutta Editors
Emerging Technologies in Data Mining and Information Security Proceedings of IEMIS 2020, Volume 2
Advances in Intelligent Systems and Computing Volume 1300
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. Indexed by DBLP, EI Compendex, INSPEC, WTI Frankfurt eG, zbMATH, Japanese Science and Technology Agency (JST). All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/11156
Aboul Ella Hassanien Siddhartha Bhattacharyya Satyajit Chakrabati Abhishek Bhattacharya Soumi Dutta •
•
Editors
Emerging Technologies in Data Mining and Information Security Proceedings of IEMIS 2020, Volume 2
123
•
•
Editors Aboul Ella Hassanien Faculty of Computers And Information Cairo University Giza, Egypt Satyajit Chakrabati Institute of Engineering and Management Kolkata, West Bengal, India
Siddhartha Bhattacharyya CHRIST (Deemed to be University) Bengaluru, Karnataka, India Abhishek Bhattacharya Institute of Engineering and Management Kolkata, West Bengal, India
Soumi Dutta Institute of Engineering and Management Kolkata, West Bengal, India
ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-33-4366-5 ISBN 978-981-33-4367-2 (eBook) https://doi.org/10.1007/978-981-33-4367-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
This volume presents the proceedings of the 2nd International Conference on Emerging Technologies in Data Mining and Information Security (IEMIS 2020), which took place at the Institute of Engineering & Management in Kolkata, India, from 2 to 4 July 2020. The volume appears in the series “Advances in Intelligent Systems and Computing” (AISC) published by Springer Nature, one of the largest and most prestigious scientific publishers, in the series which is one of the fastest growing book series in their programme. AISC is meant to include various high-quality and timely publications, primarily conference proceedings of relevant conference, congresses and symposia but also monographs, on the theory, applications and implementations of broadly perceived modern intelligent systems and intelligent computing, in their modern understanding, i.e. including tools and techniques of artificial intelligence (AI) and computational intelligence (CI)— which includes data mining, information security, neural networks, fuzzy systems, evolutionary computing, as well as hybrid approaches that synergistically combine these areas—but also topics such as network security, cyber intelligence, multiagent systems, social intelligence, ambient intelligence, Web intelligence, computational neuroscience, artificial life, virtual worlds and societies, cognitive science and systems, perception and vision, DNA and immune-based systems, self-organizing and adaptive systems, e-learning and teaching, human-centred and human-centric computing, autonomous robotics, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, various issues related to network security, big date, security and trust management, to just mention a few. These areas are at the forefront of science and technology and have been found useful and powerful in a wide variety of disciplines such as engineering, natural sciences, computer, computation and information sciences, ICT, economics, business, e-commerce, environment, health care, life science and social sciences. The AISC book series is submitted for indexing in ISI Conference Proceedings Citation Index (now run by Clarivate), Ei Compendex, DBLP, Scopus, Google Scholar and SpringerLink, and many other indexing services around the world. IEMIS 2020 is an annual conference series organized at the School of Information Technology, under the aegis of the Institute of Engineering & Management. Its idea came from the heritage v
vi
Preface
of the other two cycles of events: IEMCON and UEMCON, which were organized by the Institute of Engineering & Management under the leadership of Prof. (Dr.) Satyajit Chakraborty. In this volume of “Advances in Intelligent Systems and Computing”, we would like to present the results of studies on selected problems of data mining and information security. Security implementation is the contemporary answer to new challenges in threat evaluation of complex systems. Security approach in theory and engineering of complex systems (not only computer systems and networks) is based on multidisciplinary attitude to information theory, technology and maintenance of the systems working in real (and very often unfriendly) environments. Such a transformation has shaped natural evolution in the topical range of subsequent IEMIS conferences, which can be seen over the recent years. Human factors likewise infest the best digital dangers. This book will be of extraordinary incentive to a huge assortment of experts, scientists and understudies concentrating on the human part of the Internet, and for the compelling assessment of safety efforts, interfaces, client-focused outline and plan for unique populaces, especially the elderly. We trust this book is instructive yet much more than it is provocative. We trust it moves, driving per user to examine different inquiries, applications and potential arrangements in making sheltered and secure plans for all. The Programme Committee of the IEMIS 2020, its organizers and the editors of this proceedings would like to gratefully acknowledge the participation of all reviewers who helped to refine the contents of this volume and evaluated conference submissions. Our thanks go to Prof Dr. Zdzisław Pólkowski, Dr. Sushmita Mitra, Dr. Pabitra Mitra, Dr. Indrajit Bhattacharya, Dr. Siddhartha Bhattacharyya, Dr. Celia Shahnaz, Mr. Abhijan Bhattacharyya, Dr. Vincenzo Piuri, Dr. Supavadee Aramvith, Dr. Thinagaran Perumal, Dr. Asit Kumar Das, Prof. Tanupriya Choudhury, Dr. Shaikh Fattah and to our all session chairs. Thanking all the authors who have chosen IEMIS 2020 as the publication platform for their research, we would like to express our hope that their papers will help in further developments in design and analysis of engineering aspects of complex systems, being a valuable source material for scientists, researchers, practitioners and students who work in these areas. Giza, Egypt Bengaluru, India Kolkata, India Kolkata, India Kolkata, India
Aboul Ella Hassanien Siddhartha Bhattacharyya Satyajit Chakrabati Abhishek Bhattacharya Soumi Dutta
Contents
Pattern Recognition Local Binary Pattern-Based Texture Analysis to Predict IDH Genotypes of Glioma Cancer Using Supervised Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sonal Gore and Jayant Jagtap An Efficient Classification Technique of Data Mining for Predicting Heart Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Divya Singh Rathore and Anamika Choudhary
3
15
Sentiment Analysis for Electoral Prediction Using Twitter Data . . . . . . Wangkheirakpam Reema Devi and Chiranjiv Chingangbam
25
A Study on Bengali Stemming and Parts-of-Speech Tagging . . . . . . . . . Atish Kumar Dipongkor, Md. Asif Nashiry, Kazi Akib Abdullah, and Rifat Shermin Ritu
35
Latent Fingerprinting: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ritika Dhaneshwar and Mandeep Kaur
45
Skin Cancer Classification Through Transfer Learning Using ResNet-50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anubhav Mehra, Akash Bhati, Amit Kumar, and Ruchika Malhotra Sentiment Analysis Using Machine Learning Approaches . . . . . . . . . . . Ayushi Mitra and Sanjukta Mohanty Nonlinear 2D Chaotic Map and DNA (NL2DCM-DNA) Sequences-Based Fast and Secure Block Image Encryption . . . . . . . . . . Shalini Stalin, Priti Maheshwary, and Piyush Kumar Shukla
55 63
69
vii
viii
Contents
A Language-Independent Speech Data Collection and Preprocessing Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. M. Saiful Islam Badhon, Farea Rehnuma Rupon, Md. Habibur Rahaman, and Sheikh Abujar
77
Ensemble Model to Predict Credit Card Fraud Detection Using Random Forest and Generative Adversarial Networks . . . . . . . . . . . . . Sukhwant Kaur, Kiran Deep Singh, Prabhdeep Singh, and Rajbir Kaur
87
Performance Analysis of Machine Learning Classifiers on Different Healthcare Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aritra Ray and Hena Ray
99
A Correlation-Based Classification of Power System Faults in a Long Transmission Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Alok Mukherjee, Palash Kumar Kundu, and Arabinda Das A Wavelet Entropy-Based Power System Fault Classification for Long Transmission Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Alok Mukherjee, Palash Kumar Kundu, and Arabinda Das Bangla Handwritten Math Recognition and Simplification Using Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Fuad Hasan, Shifat Nayme Shuvo, Sheikh Abujar, and Syed Akhter Hossain Stress Detection of Office Employees Using Sentiment Analysis . . . . . . . 143 Sunita Sahu, Ekta Kithani, Manav Motwani, Sahil Motwani, and Aadarsh Ahuja Redefining Home Automation Through Voice Recognition System . . . . 155 Syeda Fizza Nuzhat Zaidi, Vinod Kumar Shukla, Ved Prakash Mishra, and Bhopendra Singh CNN-Based Iris Recognition System Under Different Pooling . . . . . . . . 167 Keyur Shah and S. M. Shah Semantics Exploration for Automatic Bangla Speech Recognition . . . . . 171 S. M. Zahidul Islam, Akteruzzaman, and Sheikh Abujar A Global Data Pre-processing Technique for Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Akteruzzaman, S. M Zahidul Islam, and Sheikh Abujar Sentiment Analysis on Images Using Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Ramandeep Singh Kathuria, Siddharth Gautam, Anup Singh, Arjan Singh, and Nishant Yadav
Contents
ix
Classification of Microstructural Image Using a Transfer Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Shib Sankar Sarkar, Md. Salman Ansari, Riktim Mondal, Kalyani Mali, and Ram Sarkar Connecting Quaternion-Based Rotationally Invariant LCS to Pedestrian Path Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Kazi Lutful Kabir, Prasanna Venkatesh Parthasarathy, and Yash Tare Real-Time Traffic Sign Recognition Using Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Aditya Rao, Rahul Motwani, Naveed Sarguroh, Parth Kingrani, and Sujata Khandaskar A Survey on Efficient Management of Large RDF Graph for Semantic Web in Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Ashutosh A. Abhangi and Sailesh Iyer URL Scanner Using Optical Character Recognition . . . . . . . . . . . . . . . . 251 Shivam Siddharth, Shubham Chaudhary, and Jasraj Meena A Survey on Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Deb Prakash Chatterjee, Anirban Mukherjee, Sabyasachi Mukhopadhyay, Mrityunjoy Panday, Prasanta K. Panigrahi, and Saptarsi Goswami Movie Review Sentimental Analysis Based on Human Frame of Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Jagbir Singh, Hitesh Sharma, Rishabh Mishra, Sourab Hazra, and Namrata Sukhija A Gaussian Naive Bayesian Classifier for Fake News Detection in Bengali . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Shafayat Bin Shabbir Mugdha, Marian Binte Mohammed Mainuddin Kuddus, Lubaba Salsabil, Afra Anika, Piya Prue Marma, Zahid Hossain, and Swakkhar Shatabda Human Friendliness of Classifiers: A Review . . . . . . . . . . . . . . . . . . . . . 293 Prasanna Haddela, Laurence Hirsch, Teresa Brunsdon, and Jotham Gaudoin Bengali Context–Question Similarity Using Universal Sentence Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Mumenunnessa Keya, Abu Kaisar Mohammad Masum, Sheikh Abujar, Sharmin Akter, and Syed Akhter Hossain Abstraction Based Bengali Text Summarization Using Bi-directional Attentive Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Md. Muhaiminul Islam, Mohiyminul Islam, Abu Kaisar Mohammad Masum, Sheikh Abujar, and Syed Akhter Hossain
x
Contents
Bengali News Classification Using Long Short-Term Memory . . . . . . . . 329 Md. Ferdouse Ahmed Foysal, Syed Tangim Pasha, Sheikh Abujar, and Syed Akhter Hossain Facial Emotion Recognition for Aid of Social Interactions for Autistic Children . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Sujan Mitra, Biswajyoti Roy, Sayan Chakrabarty, Bhaskar Mukherjee, Arijit Ghosal, and Ranjita Chowdhury Exer-NN: CNN-Based Human Exercise Pose Classification . . . . . . . . . . 347 Md. Warid Hasan, Jannatul Ferdosh Nima, Nishat Sultana, Md. Ferdouse Ahmed Foysal, and Enamul Karim Convolutional Neural Network Hyper-Parameter Optimization Using Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Md. Ferdouse Ahmed Foysal, Nishat Sultana, Tanzina Afroz Rimi, and Majedul Haque Rifat Automatic Gender Identification Through Speech . . . . . . . . . . . . . . . . . 375 Debanjan Banerjee, Suchibrota Dutta, and Arijit Ghosal Machine Learning-Based Social Media Analysis for Suicide Risk Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Sumit Gupta, Dipnarayan Das, Moumita Chatterjee, and Sayani Naskar Gender Identification from Bangla Name Using Machine Learning and Deep Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Md. Kowsher, Md. Zahidul Islam Sanjid, Fahmida Afrin, Avishek Das, and Puspita Saha A Survey of Sentiment Analysis and Opinion Mining . . . . . . . . . . . . . . 407 Ankit Kumar, Tanisha Beri, and Tanvi Sobti Precision Agriculture Using Cloud-Based Mobile Application for Sensing and Monitoring of Farms . . . . . . . . . . . . . . . . . . . . . . . . . . 417 Ria Shrishti, Ashwani Kumar Dubey, and Divya Upadhyay Indian Sign Language Recognition Using Python . . . . . . . . . . . . . . . . . . 427 Sudaksh Puri, Meghna Sinha, Sanjana Golaya, and Ashwani Kumar Dubey Implementation of Classification Techniques in Education Sector for Prediction and Profiling of Students . . . . . . . . . . . . . . . . . . . . . . . . . 435 Sheena Mushtaq and Shailendra Narayan Singh Classification of Locations Using K-Means on Nearby Places . . . . . . . . 445 Harsh Neel Mani, Mayank Parashar, Puneet Sharma, and Deepak Arora Risk Factor Anatomization of Lung Diseases and Nutrition Value in Indian Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Divya Gaur and Sanjay Kumar Dubey
Contents
xi
Analytical Study of Recommended Diet for Patients with Cardiovascular Disease Using Fuzzy Approach . . . . . . . . . . . . . . . 463 Garima Rai and Sanjay Kumar Dubey Programmed Recognition of Medicinal Plants Utilizing Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 P. Siva Krishna and M. K. Mariam Bee Information Retrieval GA-Based Iterative Optimization System to Supervise Adaptive Workflows in Cloud Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Suneeta Satpathy, Monika Mangla, Sachi Nandan Mohanty, and Sirisha Potluri Rainfall Prediction and Suitable Crop Suggestion Using Machine Learning Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Nida Farheen Implementation of a Reconfigurable Architecture to Compute Linear Transformations Used in Signal/Image Processing . . . . . . . . . . . . . . . . . 515 Atri Sanyal and Amitabha Sinha PCA and Substring Extraction Technique to Generate Signature for Polymorphic Worm: An Automated Approach . . . . . . . . . . . . . . . . 527 Avijit Mondal, Arnab Kumar Das, Subhadeep Satpathi, and Radha Tamal Goswami A Bengali Text Summarization Using Encoder-Decoder Based on Social Media Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 Fatema Akter Fouzia, Minhajul Abedin Rahat, Md. Tahmid Alie - Al - Mahdi, Abu Kaisar Mohammad Masum, Sheikh Abujar, and Syed Akhter Hossain Statistical Genomic Analysis of the SARS-CoV, MERS-CoV and SARS-CoV-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 Mayank Sharma Analysing Hot Facebook Users Posts’ Sentiment Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561 Nguyen Ngoc Tram and Phan Duy Hung The Connection of IoT to Big Data–Hadoop Ecosystem in a Digital Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 Le Trung Kien, Phan Duy Hung, and Kieu Ha My Learning Through MOOCs: Self-learning in the Era of Disruptive Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 Neha Srivastava, Jitendra Kumar Mandal, and Pallavi Asthana
xii
Contents
Automatic Diabetes and Liver Disease Diagnosis and Prediction Through SVM and KNN Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 Md. Reshad Reza, Gahangir Hossain, Ayush Goyal, Sanju Tiwari, Anurag Tripathi, Anupama Bhan, and Pritam Dash Stationarity and Self-similarity Determination of Time Series Data Using Hurst Exponent and R/S Ration Analysis . . . . . . . . . . . . . . . . . . . 601 Anirban Bal, Debayan Ganguly, and Kingshuk Chatterjee World Cup Semi-finalists Prediction by Statistical Method . . . . . . . . . . 613 Saptarshi Banerjee, Arnabi Mitra, and Debayan Ganguly Smart Crop Protection System from Birds Using Deep Learning . . . . . 621 Devika Sunil, R. Arjun, Arjun Ashokan, Firdous Zakir, and Nithin Prince John Collaborating Technologies for Autonomous and Smart Trade in the Era of Industry 4.0: A Detail Review on Digital Factory . . . . . . . 633 Monika Nijhawan, Ajay Rana, Shishir Kumar, and Sarvesh Tanwar Code Clone Detection—A Systematic Review . . . . . . . . . . . . . . . . . . . . . 645 G. Shobha, Ajay Rana, Vineet Kansal, and Sarvesh Tanwar Detecting Propaganda in Trending Twitter Topics in India—A Metric Driven Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 K. Sree Hari, Disha Aravind, Ashish Singh, and Bhaskarjyoti Das COVID–19 Detection from Chest X-Ray Images Using Deep Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673 Aritra Ray and Hena Ray Wavelet-Based Medical Image Compression and Optimization Using Evolutionary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681 S. Saravanan and D. Sujitha Juliet A Smart Education Solution for Adaptive Learning with Innovative Applications and Interactivities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691 R. Shenbagaraj and Sailesh Iyer Exhaustive Traversal of Deep Learning Search Space for Effective Retinal Vessel Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 699 Mahua Nandy Pal and Minakshi Banerjee Empirical Study of Propositionalization Approaches . . . . . . . . . . . . . . . 707 Kartick Chandra Mondal, Tanika Chakraborty, and Somnath Mukhopadhyay A Survey on Online Spam Review Detection . . . . . . . . . . . . . . . . . . . . . 717 Sanjib Halder, Shawni Dutta, Priyanka Banerjee, Utsab Mukherjee, Akash Mehta, and Runa Ganguli
Contents
xiii
Leveraging Machine Vision for Automated Tiles Defect Detection in Ceramic Industries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725 Brijesh Jajal and Ashwin R. Dobariya An Automated Methodology for Epileptic Seizure Detection Using Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 N. Samreen Fathima, M. K. Mariam Bee, Abhishek Bhattacharya, and Soumi Dutta A Novel Bangla Font Recognition Approach Using Deep Learning . . . . 745 Md. Majedul Islam, A. K. M. Shahariar Azad Rabby, Nazmul Hasan, Jebun Nahar, and Fuad Rahman Heart Diseases Classification Using 1D CNN . . . . . . . . . . . . . . . . . . . . . 755 Jemia Anna Jacob, Jestin P. Cherian, Joseph George, Christo Daniel Reji, and Divya Mohan Detection of Depression and Suicidal Tendency Using Twitter Posts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767 Sunita Sahu, Anirudh Ramachandran, Akshara Gadwe, Dishank Poddar, and Saurabh Satavalekar A Deep Learning Approach to Detect Depression from Bengali Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777 Md. Rafidul Hasan Khan, Umme Sunzida Afroz, Abu Kaisar Mohammad Masum, Sheikh Abujar, and Syed Akhter Hossain Hair Identification Extension for the Application of Automatic Face Detection Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787 Sugandh Agarwal, Shinjana Misra, Vaibhav Srivastava, Tanmay Shakya, and Tanupriya Choudhury A Study on Secure Software Development Life Cycle (SSDLC) . . . . . . . 801 S. G. Gollagi, M. S. Narasimha Murthy, H. Aditya Pai, Piyush Kumar Pareek, and Sunanda Dixit Toward a Novel Decentralized Multi-malware Detection Engine Based on Blockchain Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811 Sumit Gupta, Parag Thakur, Kamalesh Biswas, Satyajeet Kumar, and Aman Pratap Singh A Gamification Model for Online Learning Platform . . . . . . . . . . . . . . . 821 Rituparna Pal and Satyajit Chakrabati Voice-Operated Scientific Calculator with Support for Equation Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829 Harshit Bhardwaj, Mukul Chakarvarti, Nikhil Kumar, and Shubham Mittal
xiv
Contents
Deep Learning Framework for Cybersecurity: Framework, Applications, and Future Research Trends . . . . . . . . . . . . . . . . . . . . . . 837 Rahul Veer Singh, Bharat Bhushan, and Ashi Tyagi Explainable AI Approach Towards Toxic Comment Classification . . . . 849 Aditya Mahajan, Divyank Shah, and Gibraan Jafar Rank Prediction in PUBG: A Multiplayer Online Battle Royale Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859 Harsh Aggarwal, Siddharth Gautam, Sandeep Malik, Arushi Khosla, Abhishek Punj, and Bismaad Bhasin Automatic Medicinal Plant Recognition Using Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869 P. Siva Krishna and M. K. Mariam Bee Power Spectrum Estimation-Based Narcolepsy Diagnosis with Sleep Time Slot Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883 Shivam Tiwari, Deepak Arora, Puneet Sharma, and Barkha Bhardwaj Rhythmic Finger-Striking: A Memetic Computing-Inspired Musical Rhythm Improvisation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893 Samarjit Roy, Sudipta Chakrabarty, Debashis De, Abhishek Bhattacharya, Soumi Dutta, and Sujata Ghatak COVID-R: A Deep Feature Learning-Based COVID-19 Rumors Detection Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907 Tulika Paul, Samarjit Roy, Satanu Maity, Abhishek Bhattacharya, Soumi Dutta, and Sujata Ghatak Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919
About the Editors
Aboul Ella Hassanein is Founder and Head of the Egyptian Scientific Research Group (SRGE) and Professor of Information Technology at Cairo University. He has more than 1000 scientific research papers published in prestigious international journals and over 45 books covering such diverse topics as data mining, medical images, intelligent systems, social networks and smart environment. He is Founder and Head of Africa Scholars Association in Information and Communication Technology. His other research areas include computational intelligence, medical image analysis, security, animal identification, space sciences and telemetry mining and multimedia data mining. Siddhartha Bhattacharyya is Professor in Christ University, Bangalore, India. He served as Senior Research Scientist at the Faculty of Electrical Engineering and Computer Science of VSB Technical University of Ostrava, Czech Republic, from October 2018 to April 2019. Prior to this, he was Professor of Information Technology at RCC Institute of Information Technology, Kolkata, India. He is Co-Author of 5 books and Co-Editor of 30 books and has more than 220 research publications in international journals and conference proceedings to his credit. His research interests include soft computing, pattern recognition, multimedia data processing, hybrid intelligence and quantum computing. Satyajit Chakrabati is Pro-Vice Chancellor, University of Engineering & Management, Kolkata & Jaipur Campus, India, and Director of the Institute of Engineering & Management. He was Project Manager in TELUS, Vancouver, Canada, from February 2006 to September 2009, where he was intensively involved in planning, execution, monitoring, communicating with stakeholders, negotiating with vendors and cross-functional teams and motivating members. He managed a team of 50 employees and projects with a combined budget of $3 million.
xv
xvi
About the Editors
Abhishek Bhattacharya is Assistant Professor at the Institute of Engineering & Management, India. He completed M.Tech (CSE) from BIT, Mesra. He has 5 edited and 3 authored books to his credit. He has published many technical papers in various peer-reviewed journals and conference proceedings, both international and national. He has teaching and research experience of 13 years. His area of research is data mining, network security, mobile computing and distributed computing. He is reviewer of couple of journals of IGI Global, Inderscience and Journal of Information Science Theory and Practice. He is a member of IACSIT, UACEE, IAENG, ISOC, SDIWC and ICSES and Advisory Board member of various international conferences such as CICBA, CSI, FTNCT, ICIoTCT, ICCIDS, ICICC, ISETIST etc. Dr. Soumi Dutta is Associate Professor at the Institute of Engineering & Management, India. She has completed her Ph.D (CST) from IIEST & secured 1st position (Gold medalist) in M.Tech (CSE). Her research interests include social network analysis, data mining, information retrieval, Online Social Media Analysis, Natural Language Processing, image processing. She was Editor in CIPR, IEMIS Springer Conference for 3 volumes & special issue 2 volumes in IJWLTT. She is TPC member in various international conference. She is Peer Reviewer in different international journals such as – Journal of King Saud University, IGI Global, Springer, Elsevier etc. She is a member of several technical functional bodies such as IEEE, MACUL, SDIWC, ISOC and ICSES. She has published several papers in reputed journals and conferences.
Pattern Recognition
Local Binary Pattern-Based Texture Analysis to Predict IDH Genotypes of Glioma Cancer Using Supervised Machine Learning Classifiers Sonal Gore and Jayant Jagtap
Abstract Nowadays, machine learning-based quantified assessment of glioma has recently gained more attention by researchers in the field of medical image analysis. Such analysis makes use of either hand-crafted radiographic features with radiomic-based methods or auto-extracted features using deep learning-based methods. Radiomic-based methods cover a wide spectrum of radiographic features including texture, shape, volume, intensity, histogram, etc. The objective of the paper is to demonstrate the discriminative role of textures for molecular categorization of glioma using supervised machine learning techniques. This work aims to make state-of-the-art machine learning solutions available for magnetic resonance imaging (MRI)-based genomic analysis of glioma as a simple and sufficient technique based on single feature type, i.e., textures. The potential of this work demonstrates importance of texture features using simple, computationally efficient local binary pattern (LBP) method for isocitrate dehydrogenase (IDH)-based discrimination of glioma as IDH mutant and IDH wild type. Further, such texture-based discriminative analysis alone can definitely facilitate an immediate recommendation for further diagnostic decisions and personalized treatment plans for glioma patients. Keywords Glioma · MRI · IDH genotypes · Supervised classifier · Local binary pattern
S. Gore · J. Jagtap (B) Symbiosis Institute of Technology, Symbiosis International (Deemed University), Lavale, Pune, Maharashtra, India e-mail: [email protected] S. Gore e-mail: [email protected] S. Gore Pimpri Chinchwad College of Engineering, Nigdi, Pune, Maharashtra, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_1
3
4
S. Gore and J. Jagtap
1 Introduction Risk of brain cancer with glioma type is increasing day by day by making it as one of the life-threatening and critical disease. It is critical in the context of its detection, exact categorization, and even treatments. As per the latest update in tumor classification by WHO in the year 2016, glioma categorization needs to be carried out based on its genomic characteristics [1]. Isocitrate dehydrogenase (IDH) mutation is one of the important genotypes of glioma used to understand its characteristics in terms of its severity, nature, progression stage, survival period, and further to decide the quality of patient’s life [2]. The IDH mutational status classifies glioma into IDH mutant and IDH wild type [3]. The conventional method for genomic classification is with biopsy and histopathological test which is manual, invasive, and time consuming. Therefore, there is a high need of non-invasive computer-aided diagnostic (CAD) method for early prediction of glioma cancer which will definitely help for fast treatment plans leading to better survival of patients. MRI is an advanced medical imaging technique which is commonly accepted as superior modality for non-invasive cancer diagnosis [4]. Currently, there are various machine learning algorithms extensively employed to analyze the MRI derived features in order to make an useful predictions in precision genomic analysis of glioma. And, there are CAD implementations using variety of radiographic hand-crafted features such as texture, intensity, histogram, shape, volume along with other features like location, diagnosis age of patient, sex, histological grade which are analyzed by using machine learning techniques providing detailed insight about gene-level characterization of glioma [5–8]. Few CAD systems are using auto-extracted features developed with deep learning methods [9, 10]. Among different appearance-related features derived from MRI image, the texture of tumor region plays a vital role in IDH-based delineation of glioma into mutant and wild subtype. There are several work on analysis of texture data integrated with other features using machine learning techniques conducted for genomic analysis of glioma as well as for WHO grading [11–14]. Majority of reports have used gray-level co-occurrence matrix (GLCM)-based texture features. Very few work have investigated an importance of LBP-based texture combined with other radiomic features for glioma subtyping. The various machine learning algorithms like support vector machine, random forest, naive Bayes, clustering, k-nearest neighbor, neural network, etc., are used in existing methods. The objective of this paper is to justify significance of LBP-based texture features for IDH characterization of low-grade and high-grade glioma using supervised machine learning classifiers.
2 Related Work Radiomics is an emerging field of research in medical image analysis which is used to extract meaningful features from radiographic images. Such radiomic analysis helps to determine mutational status of different genotypes of glioma including IDH,
Local Binary Pattern-Based Texture Analysis to Predict …
5
1p-19q co-deletion, methylguanine methyltransferase (MGMT), epidermal growth factor receptor (EGFR), etc., by making use of clinical data as well as variety of radiographic features. Various studies have found the correlation between genetic mutations occurring in glioma patients and different MRI-derived features including texture, shape, volume, intensity based, histogram, first-order statistical features, diffusion measures, contrast enhancement, tumor margin, topological features, etc. Among these, texture-based analysis proved a useful approach contributing significantly for molecular characterization of glioma [11–14]. Tumors are heterogeneous in nature growing randomly in any direction depicting the random shapes. But still, IDH gene alteration attributed greater predictive value using shape features along with clinical details like age of patient [15]. The radiographic textural features are associated with 1p/19q loss-based IDH mutation. Such association was well justified by performing a systematic evaluation of feature sets, including local binary pattern histogram (LBPH) features, skewness, kurtosis, intensity grading based, entropy, histogram, which was carried out using artificial neural network algorithm for predictive value characterization of MRI features to differentiate between IDH mutant with 1p/19q positive or 1p/19q negative type [16]. Variety of textural feature set were calculated using gray-level co-occurrence matrix (GLCM), gray-level run-length matrix, gray-level size zone matrix, neighborhood gray-tone difference matrix, etc., and the leveraged performance of random forest analysis was revealed using above set of texture features for IDH characterization [17]. The global topological features like holes, connected components, etc., which are based on theory of homology, carry greater predictive capacity than GLCM textural features to predict 1p/19q co-deletion status and such feature proved as more stable against different image perturbations [18]. The retrospective study based on linear regression technique found that the homogeneity texture attribute combined with tumor volume is highly associated with molecular status of glioma giving the promising results and it was found that tumor homogeneity is less with IDH wild type comparatively predicting highly dispersed volumes with this type [11]. Random forest-based assessment of predictive strength of integrated MRI features by combining clinical data (age, sex, location) with imaging features has produced better results than individual MRI feature assessment for accurate determination of IDH mutation of primary high-grade glioma [19]. The neural network-based texture analysis of diffusion tensor imaging (DTI) along with the tumor volume is reported as most weighted features, which carry greater information and capable for precision determination of IDH mutational status of glioma, by highlighting IDH mutant glioma of larger in size as compared to IDH wild type [13]. The special studies focused only on the glioblastoma type which possess more aggressive nature among all types of glioma exhibiting very less survival rate comparatively. The different regions of glioblastoma tumor can be distinctly identified as rim-enhancing, solid non-enhancing and irregular edema which are capable to assess the genomic heterogeneity of glioblastoma demonstrating different types of predictions for survival rate, for molecular composition, and for molecular-level categorization in terms of biological gene-level alterations status of IDH-1, EGFR variant III, and methylation status of MGMT [20]. A total of 39,212 different features were extracted including GLCM
6
S. Gore and J. Jagtap
and LBP texture features, first-order statistics features, shape- & size-related features for stratification of glioma into five molecular subcategories based on low/high grading, IDH mutations, and 1p/19q co-deletion [21]. Diffusional kurtosis imaging (DKI) biomarkers played a superior role in glioma genetic grading as compared to conventional MRI technique [22]. The radiomic-based conventional machine learning approaches operate on handengineered features, whereas deep learning-based methods perform the gene-level prediction with auto-extracted features. Few current research work have gained an effective attention over radigenomic studies using deep learning methods. The quantification of contribution to abnormal regulation of IDH genotype with 1p/19q loss and MGMT promotor methylation in tumorigenesis can be well measured with leveraged performance of deep convolutional neural network (CNN) [23]. The transfer learning approach to reuse pre-trained deep neural network was proved as potential method to perform MRI image analysis for glioma grading assessment [24]. The predictive performance of 34-layered residual network model was cross-validated with three different network designs including single combined network design using multi-plane, multimodal MR input, separate network design using plane-wise MRI and separate network design using modality-wise MRI data. The result of deep learning model with combined modality network design yielded an accuracy of 85.7% during testing phase, which was improved up to 89.1% with age as an additional contributing marker [9]. Data augmentation techniques used in the design of CNN model helped to avoid over fitting [25]. Deep CNN sufficiently discriminated IDH-based glioma types by selecting high-responsive filters which carry significant weights from last layer of CNN to extract the appropriate distinctive features [26]. The effectiveness of deep learning architectures with 3D models was enhanced to classify IDH genotype dealing with 3D multimodal medical images [27]. In parallel with our study, the work proposed by [13, 22] experimented the machine learning-assisted IDH discriminative model by analyzing texture features. Reference [13] predicted IDH genotypes with 95% accuracy in validation cohort of low-grade glioma using neural network classifier. The study by [22] achieved an accuracy of 83.8 % by analyzing diffusional kurtosis imaging (DKI) textures using support vector machine. However, our s-LBP texture-based and random forest-based unimodal MRI analysis model effectively differs from these studies in the aspects as follows: (1) Reference [13] used LBP textures of diffusion tensor imaging and reference [22] used textures of diffusional kurtosis imaging as IDH biomarkers; (2) both studies performed a step of feature selection; (3) both studies conducted experimentation using small number of patients; All these machine learning-based and deep learning-based studies suggested that IDH mutant type of glioma exhibits less aggressive carcinogenesis process related to glioma formation and further growth as compared to more aggressive behavior by IDH wild type. It is also suggested about better survival rate with IDH mutant as compared with wild type [6]. Such discriminative behavior of glioma can be well proved using various attributes including clinical data, histology grade, genetic mutations, and MR-derived imaging features. Moreover, among the various MRI-
Local Binary Pattern-Based Texture Analysis to Predict …
7
derived features, it is hereby investigated an independent and effective role of textures participating in IDH-based delineation of glioma into wild and mutant type.
3 Materials and Methods This section describes step-wise approach of proposed work. It is herewith elaborated about MRI data set preparation, steps performed for pre-processing, texture feature extraction and classification using three supervised classifiers.
3.1 Dataset Preparation and Image Interpretation with Segmented Tumor Mask The experimentation is conducted using pre-surgical T2 weighted MRI scans of 175 subjects of low- and high-grade glioma (WHO grade II, III and IV). Out of 175 subjects, 74 subjects are of IDH mutant type and 101 subjects are of IDH wild type. Total 2308 2-D images of size 240 × 240 are involved to train and validate the model. It includes 978 images of IDH mutant and 1330 images of IDH wild type. The publicly available “The Cancer Imaging Archive (TCIA)” data is used for experimentation [28]. IDH assessment of all subjects under consideration is confirmed by referring the genomic data available with The Cancer Genome Atlas (TCGA) [29, 30]. The ready segmented tumor mask taken from TCIA portal [31, 32] is further processed into two steps: First is to convert binary mask into gray scale mask by performing pixel-wise mapping with respect to original image, and second is to crop the rectangular region of interest from gray scale tumor mask using bounding box method.
3.2 Texture Analysis Using LBP with Circular and Square Neighborhood Textures are one of the key feature used for biomedical applications. Local Binary Pattern (LBP) as proposed by [33] is most vigorous texture investigation method giving encouraging results for applications under medical image analysis. The LBP method designates the local-level exterior of image with binary pattern. This local representation is computed with binary code for each pixel by thresholding local neighborhood in comparison with center pixel. If the intensity of neighboring pixel is greater than center pixel, then it is coded as 1; otherwise, it is coded as 0. The resultant binary code is the LBP feature descriptor for the center pixel. Likewise, LBP feature is computed for whole image pixel by pixel which results into feature vector
8
S. Gore and J. Jagtap
of image size. For 8-pixel neighborhood, there will be total 256 binary patterns. With extended LBP method proposed by [34], all LBP codes are grouped together into uniform and non-uniform codes. And all uniform LBP codes are converted into rotation invariant codes. The number of rotation invariant uniform and non-uniform codes finally decide the size of LBP feature vector for an image. The extension method to original LBP operator is used in this study for variablesized neighborhood with LBP(P,D) operator where P is number of sampling pixels in a symmetric neighborhood with sampling distance D. Sampling distance of (D=1), (D=2) and (D=3) considers 3×3, 5×5, 7×7 local neighborhood, respectively. Two different neighborhood patterns are experimented. These are LBP with circular neighborhood (c-LBP) and LBP with square neighborhood (s-LBP). Pixel intensity value is interpolated in case of circular neighborhood, and it is taken originally with square neighborhood. Rotation and gray scale invariant uniform LBP method is applied to extract binary codes with three different combinations of (P, D) in our study. The histogram is computed for binary patterns extracted from cropped tumor images. With (P=8,D=1), the histogram has 10 bins (9 rotation invariant uniform LBP codes and one for non-uniform LBP code numbered from 0 to 9). Each bin represents the frequency count of every LBP code. These histogram for every image serves as LBP feature vector. All nine uniform LBP codes represent nine fundamental image patterns (binary rotation invariant uniform LBP code) such as LBP code number 0: white spot (00000000) , LBP code number 1 & 2: line end (00000100, 00001100), LBP code number 3: corner (00011100), LBP code number 4, 5, 6 & 7: edges (00111100, 01111100, 11111100, 11111110), LBP code number 8: black spot (11111111). And all remaining non-uniform patterns are represented by LBP code number 9. Our experimentation has extracted LBP texture features with three different circular and three different square neighborhood, i.e., c-LBP (P=8, D=1) , c-LBP (P=16, D=2) , c-LBP (P=24, D=3), s-LBP (P=8, D=1) , s-LBP (P=16, D=2) , s-LBP (P=24, D=3). Histogram of LBP codes of two sample images of both classes with square neighborhood, i.e., (P=8, D=1) is shown in Fig. 1.
3.3 IDH Classification Using Machine Learning Supervised machine learning algorithms are majorly used for classification problems [35, 36]. Our model is trained using three supervised machine learning methods: random forest (RF), support vector machine (SVM) with linear kernel, and Naive Bays (NB). The algorithm for building our model split the dataset of 2308 images into training and testing data samples with 80:20 ratio. The performance of trained model is evaluated for test data with 462 sample images.
Local Binary Pattern-Based Texture Analysis to Predict …
9
Fig. 1 Histogram of s-LBP (P = 8, D = 1) codes of sample images of IDH mutant and IDH wild type
4 Experimentation and Results As mentioned above, we developed binary classification model using RF, linear SVM, and NB classifiers. The performance of such predictive model is assessed using quality attributes—accuracy, sensitivity, and specificity. The best experimental results are achieved using random forest with s-LBP method which yielded 85.93% accuracy, 76.80% sensitivity, and 92.53% specificity. The proposed work is experimented on Intel Core i7 Processor, 3.2 GHz with 16 GB RAM and 500 GB Hard Drive. Python 3.6 is used for implementation of proposed method. Table 1 shows the performance accuracy (%) of LBP-based texture analysis using three supervised classifiers. For every tumor image, we extracted ten LBP features using (P=8, D=1), eighteen LBP features using (P=16, D=2) and twenty-six LBP features using (P=24, D=3). Our model achieved high accuracy of 85.93% using random forest classifier and s-LBP with (P=8, D=1). It is observed that performance of model is not improved with increase in size of LBP neighborhood in terms of number of sampling points and distance. As scope of local neighborhood is increased with increase in distance, the neighboring pixels at farther distance are considered in computing LBP feature
Table 1 Performance accuracy (%) of LBP-based texture analysis using supervised classifiers ML Classifier
c-LBP (P = 8, D = 1)
s-LBP (P = 8, D = 1)
c-LBP s-LBP c-LBP s-LBP (P = 16, D = 2) (P = 16, D = 2) (P = 24, D = 3) (P = 24, D = 3)
Random Forest(RF)
84.84
85.93
82.68
85.06
81.38
84.41
SVM-linear
70.99
69.69
68.18
69.26
66.66
68.83
Naive Bays(NB)
65.15
63.20
63.20
65.58
62.77
65.80
10
S. Gore and J. Jagtap
vector. And neighboring pixels at farther distance are not appropriately discriminating texture patterns of IDH mutant and wild type. The intensity values of neighboring pixels are interpolated in case of c-LBP. And original intensity values are considered in case of s-LBP. This is the reason because of which our model obtained higher performance using s-LBP method than c-LBP method. Statistical analysis is performed to assess the model’s performance using student ttest. A p-value of less than 0.05 is assumed as statistically significant. The significant LBP codes with their p-values obtained for s-LBP method (P=8, D=1) are as follows: LBP code 2 with p-value of 0.040, LBP code 3 with p-value of 0.021, LBP code 4 with p-value of 0.006, LBP code 5 with p-value of 0.012, LBP code 6 with p-value of 0.006, LBP code 7 with p-value of 0.016. These are observed as most contributing texture patterns for IDH discrimination into mutant and wild class. The box plots of significant LBP codes obtained for s-LBP method (P=8, D=1) are shown in Fig. 2. These box plots demonstrate the discriminate potential of LBP feature vectors to classify into IDH mutant and wild glioma.
5 Conclusion The substantial performance with 85.93% accuracy is achieved with random forestbased LBP texture analysis for IDH classification of glioma into IDH mutant and IDH wild type. LBP texture analysis using square-shaped local neighborhood with random forest classifier gives better discrimination of IDH classes as compared using circular neighborhood. Also, the local neighborhood with smaller distance yielded increased performance comparatively. It is hereby proved that LBP texture features carries good and sufficient discriminating capability to characterize IDH mutant and wild glioma. There is further scope to make use of LBP textures integrated with other texture features or other radiographic features like shape, volume, etc., which will help to improve the performance of such preliminary CAD models. The resultant CAD can adequately prepare the personalized early recommendations to begin with the treatments for glioma patients. The early assessments of therapeutic schedule can be carried out using this CAD as preliminary support system without much delay. However, many challenges remain to be attended to make the system as complete and sophisticated framework which can work reliably with less manual intervention to care for cancer patients.
Local Binary Pattern-Based Texture Analysis to Predict …
11
Fig. 2 Box plots of LBP codes
References 1. Louis, D., Perry, A., Reifenberger, G., et al.: The 2016 World Health Organization classification of tumors of the central nervous system: a summary. Acta Neuropathol. 131, 803–820 (2016) 2. Huang, J., Yu, J., Tu, L., Luo, N.H.: Isocitrate dehydrogenase mutations in glioma: from basic discovery to therapeutics development. Front. Oncol. 9 (2019). https://doi.org/10.3389/fonc. 2019.00506 3. Cohen, A., Holmen, S., Colman, H.: IDH1 and IDH2 mutations in gliomas. Curr. Neurol. Neurosci. 13(5), 345 (2013) 4. Bauer, S., Wiest, R.: A survey of MRI-based medical image analysis for brain tumor studies. Phys. Med. Biol. 58(13), R97–R129 (2013)
12
S. Gore and J. Jagtap
5. Wang, Q., Zhang, J., Li, F., Xu, X., Xu, B.: Diagnostic performance of clinical properties and conventional magnetic resonance imaging for determining the IDH1 mutation status in glioblastoma: a retrospective study. PeerJ 7, e7154 (2019) 6. Qi, S., Yu, L.: Isocitrate dehydrogenase mutation is associated with tumor location and magnetic resonance imaging characteristics in astrocytic neoplasms. Oncol. Lett. 7, 1895–1902 (2014) 7. Asodekar, B., Gore, S.: (2019) Brain tumor classification using shape analysis of MRI images. In: Proceedings of International Conference on Communication and Information Processing (ICCIP) (2019). Available via SSRN. https://ssrn.com/abstract=3425335 or http://dx.doi.org/ 10.2139/ssrn.3425335 8. Yu, J., Shi, Z., Lian, Y., et al.: Noninvasive IDH1 mutation estimation based on a quantitative radiomics approach for grade II glioma. Eur. Radiol. 27(8) (2016). https://doi.org/10.1007/ s00330-016-4653-3 9. Chang, K., Bai, H., Zhou, H., et al.: Residual convolutional neural network for the determination of IDH status in low and high grade gliomas from MR imaging. Clin. Cancer Res. 24(5), 1073– 1081 (2018) 10. Ahmad, A., Sarkar, S., Shah, A., et al.: Predictive and discriminative localization of IDH genotype in high grade gliomas using deep convolutional neural nets. In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019). https://doi.org/10.1109/ISBI. 2019.8759313 11. Jakola, A., Zhang, Y.H., Skjulsvik, A., et al.: Quantitative texture analysis in the prediction of IDH status in low-grade gliomas. Clin. Neurol. Neurosur. 164, 114–120 (2017) 12. Qiang, T., Yan, L., Xi, Z., et al.: Radiomics strategy for glioma grading using texture features from multiparametric MRI. J. Magn. Reson. Imaging 48(6), 1528–1538 (2018) 13. Eichinger, P., Alberts, E.: Diffusion tensor image features predict IDH genotype in newly diagnosed WHO grade II/III gliomas. Sci. Rep. 7, 13396 (2017). https://doi.org/10.1038/s41598017-13679-4 14. Yang, D., Rao, G.: Evaluation of tumor-derived MRI-texture features for discrimination of molecular subtypes and prediction of 12-month survival status in glioblastoma. Med. Phys. 42(11), 6725–6735 (2015) 15. Zhou, H., Chang, K., Bai, H.X.: Machine learning reveals multimodal MRI patterns predictive of isocitrate dehydrogenase and 1p/19q status in diffuse low and high grade gliomas. J. NeuroOncol. 142(2), 299–307 (2019) 16. Jagtap, J., Saini, J., Vani, S., et al.: Predicting the molecular subtypes in gliomas using T2– weighted MRI. In: Proceedings of 2nd International Conference on Data Engineering and Communication Technology, Adv Intell Syst Comput Series Springer, Singapore, vol. 828, pp. 65–73 (2019) 17. Wu, S., Meng, J.: Radiomics-based machine learning methods for isocitrate dehydrogenase genotype prediction of diffuse gliomas. J. Cancer Res. Clin. 145, 543–550 (2019) 18. Kim, D., Wang, N., Ravikumar, V., et al.: Prediction of 1p/19q codeletion in diffuse glioma patients using pre-operative multiparametric magnetic resonance imaging. Front. Comput. Neurosc. 13, 52 (2019) 19. Zhang, B., Chang, K., Ramkissoon, S., et al.: Multimodal MRI features predict isocitrate dehydrogenase genotype in high-grade gliomas. Neuro-Oncol. 19(1), 109–117 (2016) 20. Rathore, S., Akbar, H., et al.: Radiomic MRI signature reveals three distinct subtypes of glioblastoma with different clinical and molecular characteristics, offering prognostic value beyond IDH1. Sci. Rep. 8, 5087 (2018). https://doi.org/10.1038/s41598-018-22739-2 21. Lu, C.F., Hsu, F.T., et al.: Machine learning-based radiomics for molecular subtyping of gliomas. Clin. Cancer Res. 24(18), 4429–4436 (2018) 22. Bisdas, S., Shenet, H., et al.: Texture analysis and support vector machine-assisted diffusional kurtosis imaging may allow in vivo gliomas grading and IDH-mutation status prediction: a preliminary study. Sci. Rep. 8, 6108 (2018) 23. Chang, P., Grinband, X.J., Weinberg, X.B.D.: Deep-learning convolutional neural networks accurately classify genetic mutations in gliomas. AJNR Am. J. Neuroradiol. 39(7), 1201–1207 (2018)
Local Binary Pattern-Based Texture Analysis to Predict …
13
24. Yang, Y., Yan, L.F., Zhang, X.: Glioma grading on conventional MR images: a deep learning study with transfer learning. Front. Neurosci. 12, 804 (2018) 25. Akkus, Z., Ali, I., Sedlar, J.: Predicting 1p19q chromosomal deletion of low-grade gliomas from MR images using deep learning. J. Digit Imaging 30(4), 469–476 (2017) 26. Li, Z., Wang, Y., Yu, J., Guo, Y., Cao, W.: Deep learning based radiomics (DLR) and its usage in noninvasive IDH1 prediction for low grade glioma. Sci. Rep. 7, 5467 (2017) 27. Liang, S., Zhang, R., Liang, D., et al.: Multimodal 3D DenseNet for IDH genotype prediction in gliomas. Genes (Basel) 9(8), 382 (2018) 28. Clark, K., Vendt, B., Smith, K., et al.: The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J. Digital Imaging 26(6), 1045–1057 (2013) 29. Pedano, N., Flanders, A., Scarpace, L., et al.: Radiology data from The Cancer Genome Atlas Low Grade Glioma [TCGA–LGG]. The Cancer Imaging Archive (2016). http://doi.org/10. 7937/K9/TCIA.2016.L4LTD3TK 30. Scarpace, L., Mikkelsen, T., Soonmee, C., et al.: Radiology data from The Cancer Genome Atlas glioblastoma multiforme [TCGA–GBM]. The Cancer Imaging Archive (2016). http:// doi.org/10.7937/K9/TCIA.2016.RNYFUYE9 31. Bakas, S., Akbari, H., Sotiras, A., et al.: Segmentation labels and radiomic features for the pre– operative scans of the TCGA–LGG collection. The Cancer Imaging Archive (2017). https:// doi.org/10.7937/K9/TCIA.2017.GJQ7R0EF 32. Bakas, S., Akbari, H., Sotiras, A., et al.: Segmentation labels and radiomic features for the preoperative scans of the TCGA–GBM collection. The Cancer Imaging Archive (2017). https:// doi.org/10.7937/K9/TCIA.2017.KLXWJJ1Q 33. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002) 34. Ojala, T., Pietikainen, M., Harwood, D.: A comparative study of texture measures with classification based on feature distributions. Pattern Recognit. 19(3), 51–59 (1996) 35. Houman, S., Omid, S., Joshua, B., et al.: Artificial intelligence in the management of glioma: era of personalized medicine. Front. Oncol. 9, 768 (2019). https://doi.org/10.3389/fonc.2019. 00768 36. Cho, H., Lee, S.H., Kim, J., Park, H.: Classification of the glioma grading using radiomics analysis. PeerJ 6, e5982 (2018). https://doi.org/10.7717/peerj.5982
An Efficient Classification Technique of Data Mining for Predicting Heart Disease Divya Singh Rathore and Anamika Choudhary
Abstract The prediction of the occurrence of heart disease in the medical area is an important task. Nowadays, various techniques have been applied for the prediction analysis. In this work, the process of prediction analysis is divided into two phases which are clustering and classification. K-mean clustering is applied to cluster head and output of clustering is given as input to SVM classifier for the classification. In order to increase the accuracy of prediction analysis, the back propagation algorithm is proposed to be applied with the k-means clustering algorithm to cluster the data. The proposed algorithm performance is tested in the heart disease dataset which is taken from UCI repository. There are 76 attributes present within a database. However, a subset of 14 among them is required within all the published experiments. The proposed work will increase the accuracy of the prediction analysis. Keywords SVM · Back propagation · Prediction · Classification
1 Introduction Currently, in every field, there is large amount of data present and analyzing whole data is very difficult as well as it consumes a lot of time. This present data is in raw form that is of no use hence a proper data mining process is necessary to extract knowledge. The process of extracting raw material is characterized as mining [1]. For the analysis of the simple data, there are various cheaper, simpler, and more effective solutions are present. The main objective of using data mining is to discover important information that is available in distorted manner. Data entry key errors are represented by the patterns discovery problems such as network system and fraudulent credit card transactions detection. Therefore, the result must be presented in the way human can D. S. Rathore (B) · A. Choudhary Computer Science and Engineering, JIET, Jodhpur, India e-mail: [email protected] A. Choudhary e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_2
15
16
D. S. Rathore and A. Choudhary
Fig. 1 Process of data mining
understand with the help of network supervisor and marketing manager domain expert. The predictive information can be extracted from various applications with the help of efficient data mining tools. The process of exploration of large dataset in order to extract hidden interesting facts and patterns to analyze data is known as data mining. Different data mining tools are available to analyze various types of data in data mining. Education systems, customer retention, scientific discovery, production control, market basket analysis, and decision making are some of the data mining application used to analyze the gathered information [2]. This process of extraction is also known as misnomer. Figure 1 describes the complete procedure followed in data mining. Data cleaning. To ensure the accuracy and correctness of data by altering it within the given storage resource is known as a data cleaning process. Data transformation. The process through which data can be modified from one form to another is known as data transformation. Mainly, a new destination system is generated from the source system’s format. Pattern evaluation. Truly, interesting patterns that represent the knowledge on the basis of interesting measures are identified through pattern evaluation. Data integrity. In case, if the data stored within the database or data warehouse is accurate and consistent, then the data is known as integral. The state, process, or function of data can be known through data integrity and it can also be known as data quality. Data selection. For the determination of appropriate source of data and its type, the data selection process is applied. Further, the appropriate measures through which data can be gathered can be known here. The actual practice of data collection is preceded by the data selection process.
An Efficient Classification Technique of Data Mining for …
17
1.1 Techniques of Data Mining Association: It is used to predict heart disease which is used to provide information about the relationship of various characteristics and to analyze. Classification: A classic approach based on machine learning is known as the classification of data mining techniques. Within the predefined set of groups or classes, each item in the data set is classified by a classification approach. Different mathematical approaches used within the classification system are linear programming, decision trees, and so on. Clustering: Clustering is a technique of data mining, in which clustering of objects are done using automatic technique, as it has similar characteristics. The clusters are defined by clustering techniques and objects are placed in the cluster, which is in contrast to the classification process, where the objects are assigned in predefined classes [3]. Prediction: In the data mining technique, prediction is a technique that searches for relationships between dependent variables and independent variables. This technique can be utilized in various fields such as in sale in order to predict profit for future; hence, profit is referred as a dependent variable and sale as independent variable.
2 Various Applications of Data Mining Nowadays, satellite pictures, business transactions, text-reports, military intelligence, and scientific data are the major source of information that needs to be handled. For decision making, no appropriate results are provided by the information retrieval. It is required to invent new methods in order to handle large amount of data that helps in making good decisions. In the raw data, it is required to discover new patterns and important information is extracted in order to summarize all the data. In many applications, a great success is provided by the data mining and various companies such as communication, financial, retail, and marketing organizations are utilized this technique in order to minimize their work pressure [4]. For the development of product and its promotion, retailers used data mining approach as by which they can build a record of every customer such as its purchase and reviews. An essential role is played by data mining process when it is impossible to enumerate all application. In the cluster analysis, image processing, market research, data analysis, and pattern recognition are some major application of this technique. In the clustering technique, customer has been categorized in a group and purchasing patterns has been created in order to discover their customer’s interest by the marketers [5].
18
D. S. Rathore and A. Choudhary
Data mining techniques are also utilized in field of biology as it derives the animals and plants taxonomies and also categorizes their genes with similar functionality. In geology, this technique is used to identify the similar houses and lands areas. The group of metabolic diseases in which a person has high blood sugar is commonly referred as diabetes and in the scientific term as diabetes mellitus. There are two reasons for the presence of high blood sugar in the body: (1) enough insulin is not produced by the pancreas, (2) no response by cells to the produced insulin. Hence, it is the condition that occurs in the human body due to absence of appropriate insulin. There are various types of diabetes exists such as diabetes insipidus [6]. A wide range of problems in different applications are solved with the help of this method. This methods functioning depends up on the learning and adaptation capability. In the process of diagnosis of diabetes, there are various set of framework in the fuzzy that has been utilized. The popular computing framework is a fuzzy inference system (FIS).
3 Literature Review The paper [7] describes a chronic disease, that causes major casualties in worldwide, that is diabetes. According to a report of International Diabetes Federation (IDF), an ejective 285 million people worldwide suffer from diabetes. In nearby future, this range and data will increase drastically as there is no appropriate method till date that minimize its effect and prevent it completely. Type 2 diabetes (TTD) is the most common type of diabetes. The major issue was the detection of TTD as it was not easy to predict all the effects. Therefore, data mining was used as it provides the optimal results and help in knowledge discovery from data. In the data mining process, support vector machine (SVM) was utilized that acquire all the information extract all the data of patients from previous records. The early detection of TTD provides the support to take effective decision. In paper [8], the author analyzed various applications that provide significance of the data mining and machine learning in different fields. Research on the management de-signs of different components of the system is proposed as most of the work is done on the characteristics of the system that varies from time to time. The performance of the system with static or statically adaptive is optimized with the help of proposed method in order to design system. Author in this paper proposed a method to design operating system that use the support of data mining and machine learning. All the collected data from the system was analyzed when reply is obtained from a data miner. As per performed experiments, it is concluded that proposed method provides effective results. The paper [9] presented a review on existing data mining and analytics applications by the author which is used in industry for various applications. For the data mining and analytics, eight unsupervised and ten supervised learning algorithms were considered for the investigation purpose. An application status for semi-supervised learning algorithms was given in this paper. In the industry process, about 90–95% of
An Efficient Classification Technique of Data Mining for …
19
all methodologies are widely used for unsupervised and supervised machine learning. In the recent years, the semi-supervised machine learning has been introduced. Therefore, it is demonstrated that an essential role is played by the data mining and analytics in the process of industry as it leads to develop new machine learning technique. The paper [10] proposed a model that overcome all the problems such as clustering and classifications from the existing system by applying data mining technique. This method is used to diagnose the type of diabetes and from the collected data, a security level for every patient. There are various affects of this disease due to which most of the research is done in this area. All the collected data of the 650 patients was used in this paper for the investigation purpose and its affects are identified. In the classification model, this clustered dataset was used as input that is used for the classification process such as mild, moderate, and severe as per the patient’s risk levels of diabetes. On the basis of obtained result, the performance of each classification algorithm is measured. The author in paper [11] has proposed a model for predicting type 2 diabetes mellitus (T2DM) based on data mining techniques. The main objective of this paper is to improve the accuracy of models based on prediction and more than one dataset model is optimized in nature. Based on a series of pre-processing methods, the proposed model has two components—the logistic regression algorithm and the improved K-means algorithm. The Pima Indians Diabetes Dataset and the Waikato Environment were used for the Knowledge Analysis Toolkit to compare.
4 Research Methodology Datasets used: The proposed algorithm performance is tested in the heart disease dataset which is taken from UCI repository. There are 76 attributes present within a database. However, a subset of 14 among them is required within all the published experiments.
4.1 Back Propagation with k-Means Clustering The k-means clustering is the clustering technique in which similar and dissimilar data are clustered together on the basis of their similarity. In the k-mean clustering, the dataset is considered and from that dataset, arithmetic mean is calculated which will be the central point of the dataset. The Euclidian distance from the central point is calculated and points which are similar and dissimilar are clustered into different clusters. The Euclidian distance is calculated dynamically in this work to increase accuracy of clustering. The Euclidian distance is calculated dynamically using back propagation algorithm which clusters the uncluttered points and increase accuracy of clustering. The back propagation algorithm is the algorithm which learns from the previous experiences and drive new values. The formula given below is used to
20
D. S. Rathore and A. Choudhary
drive values from the input dataset. In the formula given, the x is the each point in the dataset and w is the value of the data point from which the actual output is taken and bias the value which is used to change the final value of output. The output of each iteration is compared with the output of next iteration and iteration at which error is minimum is the final value of Euclidian distance. When the error is reduced, the accuracy of clustering is increased and execution time is reduced.
4.2 Support Vector Machine (SVM) Classifier A classifier in which the hyperplanes are partitioned is known as SVM classifier. Following are the various steps which are used to implement SVM in the proposed work: Input data values: The output of k-mean clustering is given as input to the SVM classifier for the classification. The SVM classifier can classify data into certain set of classes like heart disease and non-heart disease. Division of input data: In the second step, whole data will be divided into training and test set. The training set will be 60% of the whole data and test set will be the 40% of the whole data. Classify data: The training and test set will be given as input for the classification. The function called SVC is applied which take input training and test set for the classification. The SVM classifier can draw hyperplane which can divided data into heart disease and non-heart disease classes (Fig. 2).
5 Experimental Results The proposed approach is implemented in Python and the results are analyzed by showing comparisons among proposed and existing approaches in terms of accuracy and execution time. 1. As shown in Fig. 3, the UCI repository data is taken as input on which prediction need to performance. In this figure, scatter plot is show which is age verses heart rate of the patients. 2. As shown in Fig. 4, the UCI repository data is taken as input on which prediction need to performance. In this figure, scatter plot is show which is age verses electrocardiology. 3. As shown in Fig. 5, the SVM classifier is applied for the prediction analysis. The performance for prediction analysis is done and performance is shown in terms of accuracy, prediction, and recall. 4. A canonical analysis on the principal coordinates for any resemblance matrix, including a permutation test. CAP takes into consideration the structure of the
An Efficient Classification Technique of Data Mining for …
21
Fig. 2 Proposed methodology
data. This method will allow a constrained ordination to be done on the basis of any distance or dissimilarity measures. So, it is more likely to separate your different levels if there is no strong difference and is good to show the interaction between factors. As shown in Fig. 6, the CAP analysis is shown in this figure. On the axis of this cure the training dataset is given as input and on the y-axis the test data is given as input. The blue line shows that CAP curve which represents accuracy of the classifier.
22
D. S. Rathore and A. Choudhary
Fig. 3 Plots relationship between age and blood pressure
Fig. 4 Plots relationship between age and electrocardiographic
6 Conclusion The relevant information is fetched from rough dataset using data mining technique. The similar and dissimilar data are clustered after calculating a similarity between input dataset. The SVM is used to classify both similar and dissimilar data type in which central point is calculated by calculating an arithmetic mean of the dataset. The central point calculated Euclidian distance is used to calculate a similarity between different data points. According to the type of input dataset, a clustered data is classified using SVM classifier scheme. In this research work, back propagation algorithm is applied with SVM classifier to increase accuracy of prediction. The proposed algorithm performs well in terms of accuracy and execution time. In the
An Efficient Classification Technique of Data Mining for …
23
Fig. 5 Plot of confusion matrix, without normalization
Fig. 6 CAP analysis
future, proposed technique will be further improved to design hybrid classifier for the heart disease prediction.
24
D. S. Rathore and A. Choudhary
References 1. Sun, Y., Fang, L., Wang, P.: Improved k-means clustering based on Efros distance for longitudinal data. In: 2016 Chinese Control and Decision Conference (CCDC), vol. 11, issue 3, pp. 12–23 (2016) 2. Dey, M., Rautaray, S.S.: Study and analysis of data mining algorithms for healthcare decision support system. Int. J. Comput. Sci. Inf. Technol. 6(3), 234–239(2014) 3. Kaur, B., Singh, W.: Review on heart disease prediction system using data mining techniques. Int. J. Recent Innov. Trends Comput. Commun. 2(10), 3003–3008 4. Songthung, P., Sripanidkulchai, K.: Improving type 2 diabetes mellitus risk prediction using classification. In: 2016 13th International Joint Conference on Computer Science and Software Engineering (JCSSE), vol. 11, issue 3, pp. 12–23 (2016) 5. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, vol. 3, pp. 1–31 (2000) 6. Weiss, G.M., Davison, B.D.: Data mining. In: Bidgoli, H. (ed.) Handbook of Technology Management, vol. 3, pp. 121–140. Wiley, Hoboken (2010) 7. Tama, B.A., Firdaus, A., Rodiyatul, F.S.: Detection of type 2 diabetes mellitus with data mining approach using support vector machine, vol. 11, issue 3, pp. 12–23 (2008) 8. Wang, Y.-X., Sun, Q.H., Chien, T.-Y., Huang, P.-C.: Using data mining and machine learning techniques for system design space exploration and automatized optimization. In: Proceedings of the 2017 IEEE International Conference on Applied System Innovation, vol. 15, pp. 1079– 1082 (2017) 9. Ge, Z., Song, Z., Ding, S.X., Huang, B.: Data mining and analytics in the process industry: the role of machine learning. In: 2017 IEEE Translations and Content Mining are Permitted for Academic Research Only, vol. 5, pp. 20590–20616 (2017) 10. Chen, M., Hao, Y., Hwang, K., Wang, L., Wang, L.: Disease prediction by machine learning over big data from healthcare communities, vol. 15, issue 4, pp.215–227. IEEE (2017) 11. Suresh Kumar, P., Umatejaswi, V.: Diagnosing diabetes using data mining techniques. Int. J. Sci. Res. Publ. 7(6) (2017)
Sentiment Analysis for Electoral Prediction Using Twitter Data Wangkheirakpam Reema Devi and Chiranjiv Chingangbam
Abstract Since the appearance of the World Wide Web, there has been a significant change in the way we share information. People can express themselves on the web through many mediums. Examples of such mediums are Blogs, wikis, forums and social networks, where in these platforms being mentioned, people are able to exercise posting and sharing of opinions, thereby leading to the need for research in sentiment analysis. This paper contributes to the field of Sentiment Analysis of electoral prediction using Twitter, which aims to extract opinions from text using machine learning classifiers such as Naive Bayes, Max Entropy and SVM. The problem statement for this research is the classification of textual information as whether it bears positive sentiment or negative sentiment. Our research focuses on a novel data pre-processing method which enhances the performance of different supervised algorithms. With our method, the accuracy of Naive Bayes was enhanced to 89.78% (hybrid), Maximum Entropy was enhanced to 87% (hybrid), SVM was enhanced to 91% (unigram), and ensembling of Naive Bayes and SVM was enhanced to 95.3% (bigram). Keywords Naive Bayes · Max Entropy · SVM · Ensemble
1 Introduction The need of sentiment analysis arises from various facts such as evaluating people’s opinions for better customer relations, reviews, etc. Before deciding and casting a political vote, we seek information in various forums related to political environment, read product reviews before buying, and even search recommended places W. R. Devi Government Polytechnic, Takyelpat, Manipur, India e-mail: [email protected] C. Chingangbam (B) Manipur Technical University, Takyelpat, Manipur, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_3
25
26
W. R. Devi and C. Chingangbam
before actually going out. The World Wide Web now serves as the base platform for finding opinions related to multiple spheres. Most of the user opinion can be found in social media platforms. Various manufacturing related firms relies heavily on the views expressed by their customers, because of this the research for Sentiment Analysis gained an important momentum among the fields of Information Retrieval and Natural Language Processing. Corpus used in the field of Natural Language Processing (NLP), tended to be noise-free as they were collected from various syntactically correct sources such as daily newspapers and books. But nowadays, most of the textual data or information available in various social media platforms tend to be syntactically wrong, cross-lingual and very much unstructured. And hence, the need for proper datapreprocessing alongwith a good classification technique comes into the picture. As such, Sentiment Analysis or Opinion Mining deals with the computational process of extracting and categorizing user’s opinions towards a specific topic as positive, negative or neutral. Considering its magnitude of significance in monitoring activities in social media platforms, it is extremely useful. We chose Twitter as it encompasses short text messages of limited characters and has a large pool of users including important and influential personalities covering various subjects. The paper is organized as follows: Section 2 highlights the Related Works, Section 2 highlights the Proposed Methodology, Section 4 highlights the Evaluations and Results, and Sect. 5 highlights the Future Roadmap and Conclusions.
2 Related Works Sentiment Analysis is a classification task for textual information. It classifies whether a sentence or a document expresses positive or negative sentiment [16]. In [11], it explains the difference between subjective and objective sentences for classifying subjectivity. More detailed analysis is required for many real-life applications, because the user often wants to know the opinions that have been expressed [8, 19]. In [12], product rankings were based on periodical reviews. In [7], the study was focussed on National Football League betting and its relationship with blog opinions. In [15], the focus was on linking of public opinion polls with Twitter sentiment. In [18], predicting election based rigorous prediction using Twitter sentiment was discussed. In [20], a discussion has been presented for predicting political issues. In [2], generation of revenues for entertainment sector from Twitter data was studied. In [13], investigation on flow of sentiment over social networks was thoroughly discussed. In [14], gender-based sentiment study was carried out in mails. In [4], Twitter moods were used to predict the stock market study for predicting stock market evalutations was analysed. In [3], identification study for investors was done on multiple platforms. In [21], empirical analysis over trading based strategical study was done. In [17], study on scalable network based emotional influence was thoroughly discussed.
Sentiment Analysis for Electoral Prediction …
27
In [10], comparison and review based study was discussed over three different supervised learning paradigms, namely, SVM, Naive Bayes and kNN. The study reported that SVM was able to attain accuracy of 80% which exceeds Naive Bayes and kNN. In [9], different ML paradigms were used for studying text classification. In [6], explanation for various generic strategies and solutions has been provided for dealing and handling very large number of unstructured text alongwith selecting an appropriate ML paradigm. In [5], strategical study using pre-defined data using different supervised algorithm classes depicts that it simplifies the whole process and among the paradigms, SVM performs better than Decision Trees.
3 Proposed Methodology The following Fig. 1 gives the outline of the methodology proposed for our research.
3.1 Data Collection The Twitter data is downloaded using the TwitterAPI v1.1.
Fig. 1 Proposed architecture
28
W. R. Devi and C. Chingangbam
3.2 Data Pre-processing The data preprocessing that has been done on Twitter data is as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Convert the latest tweet of a particular person. Convert the tweets to lower case. Convert username to AT_USER. Replacing #word with word. Reducing duplicate words to two. Replacing two or more repetitions of the characters with the character itself. Check whether the word starts with an alphabet. Consideration of positive and negative flags. Consideration of the latest tweet of a particular person.
Our pre-processing algorithm overcomes the following limitations: 1. A person might have tweet for a particular party several months ago, and he/she tweets it again for different party few days before the election. So, which tweet should be considered for the particular person. 2. Generalization problem : Only those features (or words) which are present in the training dataset has been used. Therefore, inability to differentiate between the composite words such as “good” and “not good”.
3.3 Feature Extraction N-grams are basic probabilistic models for computing likelihood for sequential problems related to Language processing and speech modeling. For estimating a given probability function, it uses relative frequency counting approach. The example given below illustrates for a bigram model where N = 2. 1. 2. 3. 4. 5. 6. 7. 8.
The cat cat wants wants to take a a sip sip of of the the milk
We have used Unigram, Bigram and hybrid (unigram + bigram) for feature extraction in our work.
Sentiment Analysis for Electoral Prediction …
29
3.4 Classification Technique Used For classification purpose, we have used Naive Bayes, Max Entropy, Support Vector Machine and ensembling of Naive Bayes and Support Vector Machine using if then rule.
3.4.1
Naive Bayes
The Naive Bayes classifier uses the formulation given by the Bayes Theorem. It has achieved remarkable progress and has applications in the areas of spam detection, categorizing documents, sentiment classification, etc. It is not generally a single classifier, but actually a class of classifiers which can be differentiated as—Gaussian NB, Multinomial NB, and Bernoulli NB. Each of the different classes of classifiers behaves differently from one another. Gaussian NB can be used when the data distribution is normal or gaussian, similarly, Multinomial NB can be used when the data is distributed multinomially, and, Bernoulli NB can be used when the data has multivariate Bernoulli distribution. For Sentiment Analysis, Multinomial NB plays a key role as the appearance or absence of words or the frequencies does not carry much weight. The classifier assumes that the features are conditionally independent. The following classifier uses maximum a posteriori decision rule: Cmap = argc∈C max(P(c|d)) = argc∈C max(P(c)
(P(tk |C))
(1)
1≤k≤n i
where tk are the tokens (terms/words) of the document, C is the set of classes that is used in the classification, P(c|d) is the conditional probability of class c given document d, P(c) is the prior probability of class c, and P(tk |C) is the conditional probability of token tk given class C.
3.4.2
Support Vector Machine
The original idea for using Support Vector Machines was for binary classification. Then, Multi-Class SVMs were introduced for performing multiple class classification. It is used in both Regression and Classification based computational tasks. Finding a hyperplane in an R n space is the main objective for SVM. There is a distinct possibility that multiple hyperplanes are generated for a binary class problem, but our choice is to select a hyperplane that maximizes the margin distance between two data points belonging to the different classes. These margins are referred as the support vectors. According to the form of the error function, SVM models can be classified into four distinct groups:
30
1. 2. 3. 4.
W. R. Devi and C. Chingangbam
Classification SVM Type 1 (also known as C-SVM classification) Classification SVM Type 2 (also known as nu-SVM classification) Regression SVM Type 1 (also known as epsilon-SVM regression) Regression SVM Type 2 (also known as nu-SVM regression).
3.4.3
Maximum Entropy
The significance of Maximum Entropy is that it does not assume the conditional independence of feature as Naive Bayes does. The principle of max entropy defines the working mechanism for the algorithm, and the algorithm belongs to exponential model class. The training cost and time is higher with this algorithm than the Naive Bayes, the main cause being the estimating the parameters. The main usage for this algorithm happens when information regarding prior distributions and conditional independence are not needed.
4 Experimental Results In our research, we have incorporated Karnataka Election 2016 data from twitter. We have performed the evaluation on a system having the following configurations: Intel i5-8600K CPU @ 3.60 GHz (6 cores) with 16 GB of RAM. We have used SciKit library for implementation of the proposed model. At the initial stage, data is pre-processed using the above mentioned algorithm in Sect. 3.2. Then, we extracted the features using three n-gram models viz. Unigram, Bigram and Hybrid (Unigram + Bigram). After feature extraction, we performed classification using supervised learning algorithms viz. Naive Bayes, Maximum Entropy and Support Vector Machine. We have also compared our work as shown in Tables 1 and 2. Further illustrations are provided in Figs. 2 and 3. For the purpose of evaluation, we performed cross-validation by breaking the data into 10 folds. Training was performed on 9 folds and computing performance was done on the test fold.
Table 1 Results for machine learning algorithm according to Anjaria et al. [1] Classifier Accuracy for unigram Accuracy for bigram Accuracy for hybrid features (%) features (%) features (%) Naive Bayes Maximum Entropy SVM
71 72 78
68 73 75
84 83 88
Sentiment Analysis for Electoral Prediction …
31
Table 2 Results for our machine learning algorithms Classifier Accuracy for unigram Accuracy for bigram features (%) features (%) Naive Bayes Maximum Entropy SVM Ensembling of Naive Bayes and SVM using if then rule
87.26 75 91 92.3%
88.52 81 89 95.3%
Accuracy for hybrid features (%) 89.78 87 90 93.3%
Fig. 2 a Results according to Anjaria et al. [1] and b Results according to our algorithms for three different classes of features
Fig. 3 Results according to the ensembling of Naive Bayes and SVM using if then rule algorithm
32
W. R. Devi and C. Chingangbam
5 Conclusion and Future Works In order to perform electoral prediction over Twitter data, Karnataka Election 2016 has been considered. We conclude that a predictive quality has been displayed by Twitter data which is improved by the incorporation of the influencing factors based on the user behavior analysis such as the consideration of the latest tweet by a particular person. Our experimental work shows that ensembling of classifiers outperforms the one which uses a single classifier. Despite the challenges, the field has made significant progresses over the past few years. So, it is required to conduct more refined and in-depth investigations, and to build integrated systems which deals with all the problems needed in applications.
References 1. Anjaria, M., Guddeti, R.M.R.: A novel sentiment analysis of social networks using supervised learning. Soc. Netw. Anal. Mining 4(1), 181 (2014) 2. Asur, S., Huberman, B.A.: Predicting the future with social media. In: 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 1. IEEE (2010) 3. Roy, B., et al.: Identifying and following expert investors in stock microblogs. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2011) 4. Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. J. Comput. Sci. 2(1), 1–8 (2011) 5. Chavan, G.S., et al.: A survey of various machine learning techniques for text classification. Int. J. Eng. Trends Technol. (IJETT) 15(6), 288–292 (2014) 6. Dalal, M.K., Zaveri, M.A.: Automatic text classification: a technical review. Int. J. Comput. Appl. 28(2), 37–40 (2011) 7. Hong, Y., Skiena, S.: The wisdom of bookies? sentiment analysis versus. The nfl point spread. In: Fourth International AAAI Conference on Weblogs and Social Media (2010) 8. Hu, M., Liu, B.: Mining and Summarizing Customer Reviews (2004) 9. Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 4(8), 966–974 (2005) 10. Kalaivani, P., Shunmuganathan, K.L.: Sentiment classification of movie reviews by supervised machine learning approaches. Indian J. Comput. Sci. Eng. 4(4), 285–292 (2013) 11. Liu, Bing: Sentiment analysis and subjectivity. Handb. Natural Language Process. 2(2010), 627–666 (2010) 12. McGlohon, M., Glance, N., Reiter, Z.: Star quality: aggregating reviews to rank products and merchants. In: Fourth International AAAI Conference on Weblogs and Social Media (2010) 13. Miller, M., et al.: Sentiment flow through hyperlink networks. In: Fifth International AAAI Conference on Weblogs and Social Media (2011) 14. Mohammad Saif, M., Yang, T.W.: Tracking sentiment in mail: How genders differ on emotional axes. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis. Association for Computational Linguistics (2011) 15. O’Connor, B., et al.: From tweets to polls: linking text sentiment to public opinion time series. In: Fourth International AAAI Conference on Weblogs and Social Media (2010) 16. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations Trends Information Retrieval 2(12), 1–135 (2008)
Sentiment Analysis for Electoral Prediction …
33
17. Sakunkoo, P., Sakunkoo, N.: Analysis of social influence in online book reviews. In: Third International AAAI Conference on Weblogs and Social Media (2009) 18. Tumasjan, A., et al.: Predicting elections with twitter: what 140 characters reveal about political sentiment. In: Fourth International AAAI Conference on Weblogs and Social Media (2010) 19. Wiebe, J., et al.: Learning subjective language. Comput. Linguistics 30(3), 277–308 (2004) 20. Yano, T., Smith, N.A.: What’s worthy of comment? Content and comment volume in political blogs. In: Fourth International AAAI Conference on Weblogs and Social Media (2010) 21. Zhang, W., Skiena, S.: Trading strategies to exploit blog and news sentiment. In: Fourth International AAAI Conference on Weblogs and Social Media (2010)
A Study on Bengali Stemming and Parts-of-Speech Tagging Atish Kumar Dipongkor, Md. Asif Nashiry, Kazi Akib Abdullah, and Rifat Shermin Ritu
Abstract Stemming and parts-of-speech (POS) tagging are often considered as initial steps for the natural language processing (NLP) applications. The performance of NLP applications greatly depends on stemming and POS tagging. This paper presents the research works that have been done on Bengali stemming and POS tagging. In this work, we also categorize and compare Bengali stemming and POS tagging techniques. The stemming techniques are categorized as rulebased and statistical. The existing works suggest that rule-based techniques work better for Bengali stemming. Similarly, POS tagging techniques are categorized as linguistic, statistical, and machine learning-based tagger. Machine learning-based taggers perform better than other approaches of POS tagging for Bengali. Keywords Bengali NLP · Stemming and parts-of-speech tagging
1 Introduction Text preprocessing (TP) is a task that transforms text into an analyzable and digestible form. It is often the first step in natural language processing (NLP) applications such as machine translation, sentiment analysis, text summarization, and document classification [1]. Among all different techniques of TP, stemming and parts-ofspeech (POS) tagging are of paramount importance since the two techniques have a A. K. Dipongkor (B) · Md. A. Nashiry · K. A. Abdullah · R. S. Ritu Department of Computer Science and Engineering, Jashore University of Science and Technology, 7408 Jashore, Bangladesh e-mail: [email protected] Md. A. Nashiry e-mail: [email protected] K. A. Abdullah e-mail: [email protected] R. S. Ritu e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_4
35
36
A. K. Dipongkor et al.
direct impact on the performance of NLP applications [2, 3]. Stemming techniques can also be associated with morphological changes of a word. The motivation behind stemming is to acquire the stem or root form of an inflected word so that NLP applications can identify all variants of the inflected word. POS tagging is a way of earmarking the appropriate parts of speech of each word of a given sentence. This technique is often the first step of NLP which is followed by further processing, i.e., chunking, parsing, etc. [4, 5]. Although stemming and POS tagging play a vital role in most NLP applications, these two techniques have not been received enough attention in NLP applications on the Bengali language. To understand the state-ofthe-art approaches and their improvement scopes, a literature survey for Bengali stemming and POS tagging is required. Although stemming and POS tagging serve the different purposes of TP, they should be studied together in Bengali. Bengali is an over-inflectional language compared to other languages and its derivational changes of words are exceedingly huge [6]. In other languages such as English and Dutch, stemming is performed based on rules, and these rules do not depend on the parts of speech of a word [4, 7]. However, stemming rules and root words in Bengali largely depend on the parts of speech [8]. ’ in Bengali can be used as both verb and noun. The For instance, the word ‘ ’ for verb and root of this word is diverged based on its parts of speech, i.e., ‘ ’ for noun. The existing rule-based techniques of Bengali stemming do not ‘ work for text corpus. These techniques require POS tagged words for performing stemming operation for a given text corpus [8]. Some notable POS tagging techniques of Bengali cannot perform operations without stemming [9, 10]. For this reason, it is required to study stemming and POS techniques simultaneously for performing text preprocessing on Bengali. Gupta and Lehal [11] and Patil et al. [12] discussed some common stemming techniques for the Indian language. Bijal and Sanket [13] and Jivani et al. [14] presented stemming algorithms for Indian and non-Indian languages. Swain and Nayak [15] discussed rule-based and hybrid stemming techniques. Singh and Gupta [10] and Moral et al. [16] presented a survey on stemming approaches, applications, and challenges. As far as POS tagging is concerned, we have found few surveys work on Indian languages [17–19]. We have found literature that presented comparative studies among existing POS tagging techniques such as n-gram, HMM and Brill’s tagger [20, 21]. We have not encountered any literature that discussed only on existing POS tagging techniques for the Bengali language. In this work, we provide a literature review on stemming and POS tagging approaches only for the Bengali language since there does not exist such literature. We also categorize both stemming and POS tagging techniques based on their methodologies. The Bengali stemming techniques are grouped into two categories such as rule-based and statistical stemming. The Bengali POS tagging techniques are grouped into three categories such as linguistic, statistical, and machine learningbased tagger. In this paper, we also compare all Bengali stemming and POS tagging techniques. We believe this work will leverage the way of further contributions for the researcher in Bengali stemming and POS tagging.
A Study on Bengali Stemming and Parts-of-Speech Tagging
37
The remaining sections of this paper are organized as follows. Sections 2 and 3 include our survey on Bengali stemming and POS tagging, respectively. In Sect. 4, we have presented our observation on Bengali stemming and POS tagging for further improvements. Finally, this work is concluded in Sect. 5.
2 Stemming Techniques This section discusses the existing stemming techniques for Bengali. Current stemming techniques can be grouped into two categories such as rule-based and statistical.
2.1 Rule-Based Rule-based stemmers transform a word into its root or base form using predefined language-specific rules. Linguistic skills are required in order to create languagespecific rules. Stemmers fall into this category and are referred to language-dependent stemmer. Islam et al. [22] proposed an Affix removal approach for stemming which is initially developed for spell checker. In this work, the authors proposed several rules which strip suffixes from words using a predefined suffix list. They identified 72 suffixes for verbs, 22 for nouns, and 8 for adjectives for the Bengali language. If multiple suffixes are found in a word, this approach eliminates the longest one. For ’ contains two suffixes such as ‘ ’ and ‘ ’. According example, ‘ ’ will be eliminated as its length is longer than ‘ ’. This to this approach, ‘ technique performs only for its predefined suffix list. It does not work for derivational suffixes. Sarkar and Bandyopadhyay [8] presented a rule-based approach to find out the stems from Bengali text. Their main contribution is that they introduced a new concept named orthographic syllable for Bengali words. At first, their system takes POS tagged words as input. Next, the words are transformed into orthographic syllables such as (C, V, D) where C is a string of consonant characters, V is a vowel character, and D is a diacritic mark. Then 5-tuples rule is proposed to eliminate inflections from the words by analyzing orthographic syllables. Moreover, there is a dictionary of root words. This approach was applied on three classic short stories by Rabindranath Tagore and the approach achieved 99.2% accuracy. Mahmud et al. [6] developed a rule-based suffix striping technique for eliminating the inflections of verbs and nouns in Bengali. Initially, they have identified the commonly used suffixes in the Bengali language and categorized those suffixes. Then, they found some patterns by analyzing the identified suffixes. For instance, nouns contain independent inflections only whereas verbs contain both independent
38
A. K. Dipongkor et al.
and combined inflections. Finally, they proposed an algorithm for eliminating the inflections based on their identified patterns. They applied their technique on 3000 verbs and 1500 nouns. The experimental result showed that their approach achieved 83% and 88% accuracy for verbs and nouns, respectively.
2.2 Statistical The statistical stemmers do not require language-specific rules for stemming. These stemmers perform stemming using probabilistic models which are trained by unsupervised or semi-supervised machine learning techniques. The main advantage of these stemmers is that they can deal with complicated morphology and sparse data during stemming [23]. Majumder et al. [24] proposed a clustering technique to find equivalence classes of root words and their morphological variants. This approach works for several languages including Bengali. Initially, the authors defined a set of string distance measures. Then, they clustered a given text collection using the distance measures to identify equivalence classes. They applied their proposed technique on 50,000 Bengali news articles. The proposed algorithm provided consistent improvements for Bengali Stemming; however, it is trivial compared to rule-based stemmers. Singh and Gupta [25] proposed an unsupervised corpus-based stemming technique for English, Bengali, Marathi, and Hungarian. This approach basically clusters morphological words based on lexical knowledge. This approach consisted of two phases. In the first phase, the authors collected distinctive words and sorted the words into different categories. Then, they used Jaro-Winkler distance to find the distance between the morphologically related words where the high distance value of the words indicates less similarity between the word pairs. In the second phase, words are grouped by the average linkage method based on the distance information with common prefixes of an equivalent length. Finally, a class-conscious agglomeration algorithmic rule is employed in order to cluster similar morphological variants. For Bengali, this technique did not perform well as compared to other languages. In [26], Urmi et al. proposed a contextual and spelling similarity-based N-gram language model for finding root forms of Bengali words. This approach basically works based on a similarity assumption, i.e., a root word and its modified forms are very similar in terms of spelling and context. At first, it calculated the contextual similarity of two distinct words by comparing the previous words and frequencies of the distinct words in a sentence. Then, it compared the contextual similarity score of two distinct words with a threshold. If the similarity score is found above the threshold value, the words are considered to be generated from the same root. This technique was applied to the 5000 Bengali words and achieved 40.18% accuracy.
A Study on Bengali Stemming and Parts-of-Speech Tagging
39
3 POS Tagging Techniques Parts-of-speech (POS) tagging is the process of assigning an appropriate part of speech or lexical category to each word of a natural language sentence. Identifying the parts of a sentence is an important operation for several natural language processing (NLP) problems. This is often the first step of the processing of natural languages which follows further processing such as chunking and parsing. Several techniques have been proposed in order to identify different parts of a sentence [27].
3.1 Linguistic Taggers Linguistic taggers work with knowledge written by linguists and presented as a set of rules or constraints. Roy et al. [28] developed a rule-based unsupervised system based on Bengali grammar and suffix. They considered eight categories of tag sets similar to the English language. They used a different dataset, verb root dataset, in order to identify only the verbs. The training dataset consists of 14,500 Bengali words, and the testing dataset consists of 12,000 words. This work presented nine rules for identifying POS. For example, the authors matched the suffix of a word with all the suffixes . If a match is found, the suffix part will be from a list of verb suffix or ), if the remaining word is found in removed. After removing verb suffix ( the verb root dataset, then the word will be considered as a verb. They also created two different groups for noun suffix and adjective suffix. A word will be identified as and . an adjective or a noun based on two different suffixes such as They achieved an accuracy of (94.2%) but they did not consider pronoun, conjunction, and interjection. Although this approach achieved 94.2% accuracy, the approach requires to maintain several predefined datasets for each of the categories of parts of speech. This requires a lot of storage and access time to compare a given word to the predefined datasets. In [9], Hoque and Seddiqui proposed an automated POS tagging system for the Bengali language based on word suffixes. The authors used their own stemming technique to retrieve root words and applied rules according to the different forms of suffixes. They incorporated a Bangla vocabulary that contains more than 45,000 words with their default tag and a patterned-based verb dataset. Dictionary and patterned-based verb dataset increased the performance which is 93.7%. In this work, the authors focused on eight fundamental parts-of-speech tags.
40
A. K. Dipongkor et al.
3.2 Statistical Taggers Statistical taggers are very common POS taggers. This approach works by building a statistical model of the language to disambiguate a word sequence by assigning the most likely tag sequence. Such models are typically generated from previously annotated data encoding the frequency of co-occurrence of various linguistic phenomena to simple probabilities of the n-gram. Among stochastic models, bi-gram and trigram hidden Markov models (HMM) are mostly used. A large amount of annotated text is needed for the creation of a stochastic tagger. Stochastic taggers with more than 95% word-level accuracy have been developed for English, German, and other European languages [29]. However, for Indo-Aryan language, such as Bengali and Marathi, a large dataset of labeled data is not available. The unavailability of such a dataset is one of the main challenges to develop a statistical Stemmer for the Bengali language. Mukherjee and Das Mandal [30] presented an automated POS tagging using Global Linear Model (GLM) for Bengali sentences that determined to represent the entire sentence through a function vector called Global feature. GLM works by tagging POS of a sentence from left to right of the sentence and decomposed into a sequence of decision. They used UnDivide++ which was a language-independent morphological software for this purpose. They used a training dataset comprising 44,000 words and two test sets of 14,784 and 10,273 words. In this paper, they achieved 93.12% accuracy. Authors compared the GLM POS tagger with the other Bengali POS tagger and showed that GLM performs better than other taggers. Kumar [31] presented an approach that uses 5-g model. This work is based on observing the previous two and next two words’ tags to predict the POS tag for the current word. They have found the database from SHRUTI’ IIT Kharagpur for Bangla. This includes a wordlist marked with 98,521 words which they used as training and testing data. They set six tag categories namely noun (nouns, proper nouns, compound nouns, etc.), preposition (preposition or postposition), content words (adjectives, adverbs), function words (conjunctions, pronouns, auxiliary verbs, and particles), verbs (verbs), and Others (symbols, quantifiers, negations, question words, etc.). In this paper, they worked with three different languages: Marathi, Telugu, and Bengali. They achieved 88.29% accuracy for Bengali. Ekbal and Bandyopadhyay introduced a Bengali POS tagger using Maximum Entropy statistic [32]. In this work, the authors trained a corpus of 72,341 words. The POS tagger uses a tagset of 26 different POS tags which are specified for the Indian languages. A part of this corpus was selected as the creation set to find out the best set of features in Bengali for POS tagging. The POS tagger achieved 88.2% accuracy for a dataset consisting of 20,000 words.
A Study on Bengali Stemming and Parts-of-Speech Tagging
41
3.3 Machine Learning (ML)-Based Tagger Machine learning-based tagger allows a program to learn and infer appropriate POS from data by applying learning algorithms to observed data and make predictions about POS based on the data. Machine learning base taggers are similar to statistical-based taggers, and however, more advanced and automated tools are used for developing machine learning-based taggers. In [33], Kabir et al. presented a POS tagger using deep learning. Deep learningbased taggers use morphological and contextual details of the words under consideration. The authors used morphological, contextual properties of a word with a prebuilt dictionary of probable POS of words. In their experiment, they used the deep belief network (DBN), which is a generative graphical model, to train the model. They used restricted Boltzmann machines (RBMs) as unsupervised networks and linear activation function. Four features were selected based on different possible combinations of available word and tag context. They also provided an equation which is used as an activation function for output. For each layer, the learning rate was fixed to 0.3 and 25 epoch was used. The author used 10-fold cross-validation to evaluate their model. The performance of the model was evaluated based on the precision (P), recall (R), and F1-score (FS). They experimented on Linguistic Data Consortium (LDC) catalog number LDC2010T16 and ISBN 1-58563-561-8 corpus and achieved 93.33% accuracy for tagging Bengali POS using deep learning-based model.
4 Comparison In this section, we present a comparison between Bengali stemming and POS tagging techniques. The summary of Bengali stemming and POS tagging are shown in Tables 1 and 2 respectively. Since all the Bengali stemming techniques had used their own dataset, we are unable to perform technique-wise comparison. Moreover, three techniques (Table 1) did not measure/mention their accuracy. Thus, we can only compare these techniques Table 1 Bengali stemming techniques Author
Year
Techniques
Category
Accuracy
Islam et al. [22]
2007
Suffix striping
Rule-based
Not measured
Majumder et al. [24]
2007
Clustering
Statistical
Not measured
Sarkar and Bandyopadhyay [8]
2008
Orthographic syllable
Rule-based
99.2%
Mahmud et al. [6]
2014
Suffix striping
Rule-based
88%
Urmi et al. [26]
2016
N-gram
Statistical
40.18%
Singh and Gupta [25]
2017
Clustering
Statistical
Not measured
42
A. K. Dipongkor et al.
Table 2 Bengali POS tagging techniques Author
Year
Ekbal and Bandyopadhyay [32]
2008 Statistical maximum entropy Statistical (ME)
Techniques
Category
Accuracy (%) 88.2
Mukherjee and Das Mandal 2013 Global linear model [30]
Statistical
93.12
Hoque and Seddiqui [9]
2016 Rule-based suffix removal
Linguistic 93.7
Kabir et al. [33]
2016 Deep learning
ML
93.33
Kumar [31]
2018 5-g model
Statistical
88.29
Roy et al. [28]
2019 Rule-based suffix removal
Linguistic 94.2
category-wise which shows that rule-based stemming performs better than statistical techniques for Bengali. However, we have the following major observations about these rule-based stemming techniques 1. 2. 3. 4.
These techniques are semi-rule-based. Exceptions are handled using dictionary. These techniques are POS-aware. They do not work for sentences. None of the techniques has been implemented as a tool for further usages. No author has made his/her dataset available publicly.
The performance of the POS tagger available for the Bengali language is not as good as English or Chinese language [10]. In this work, we present six papers on Bengali POS tagging. A comparison of these papers is presented in Table 2. Although these papers claimed more than 80% accuracy, we have the following observations on these papers 1. We have not found any approach that can handle ambiguous words efficiently. Ambiguous words have different tags based on their role in the sentence. 2. There is no POS tagging technique based on a supervised system. The resources available for training machine learning models are limited. 3. The accuracy of linguistic taggers is not trustworthy. Linguistic taggers were not tested with the unknown words like statistical and ML taggers.
5 Conclusion This paper has focused on stemming and POS tagging for Bengali language processing. Although much effort has been made for the development of stemmer and POS tagger for other languages, e.g., English, these two data preprocessing techniques did not get much attention for the Bengali language. This paper has identified the necessity to study and analyze stemming and POS tagging together since Bengali is a very inflectional language and its derivational changes of words are exceedingly huge. We have studied the existing approaches on these two data preprocessing techniques, and the performances are compared based on accuracy. We have studied the
A Study on Bengali Stemming and Parts-of-Speech Tagging
43
existing stemming approaches in Bengali, and our observation suggests that rulebased stemmers perform better as compared to the other categories of stemmers. As far as POS tagger is concerned, machine learning-based tagging approaches show a high level of accuracy nevertheless machine learning-based POS taggers are difficult to implement. Thus, focusing on the improvement of rule-based POS tagger for Bengali will be an important area of future research. Our investigation also suggests that the accuracy of a stemmer largely depends on the accuracy of tagging POS. Thus, a prerequisite of an accurate stemmer is to have an effective POS tagger. It requires a dataset with massive collection of data in order to develop an effective POS tagger. Since at present no Bengali dataset is available, developing an unbiased dataset consisting of words with given POS tag can be an area of future work.
References 1. Camacho-Collados, J., Pilehvar, M.T.: On the role of text preprocessing in neural network architectures: an evaluation study on text categorization and sentiment analysis, pp. 40–46 (2019). https://doi.org/10.18653/v1/w18-5406 2. Etaiwi, W., Naymat, G.: The impact of applying different preprocessing steps on review spam detection. Procedia Comput. Sci. 113, 273–279 (2017). https://doi.org/10.1016/j.procs.2017. 08.368 3. Soeprapto Putri, N.K., Steven Puika, K., Ibrahim, S., Darmawan, L.: Defect classification using decision tree. In: Proceedings of 2018 International Conference on Information Management and Technology, ICIMTech 2018, pp. 281–285 (2018). https://doi.org/10.1109/ICIMTech. 2018.8528095 4. Kraaij, W., Pohlmann, R.: Porter’s stemming algorithm for Dutch. Technical Report. https:// www.cs.ru.nl/kraaijw/pubs/Biblio/papers/kraaij94porters.pdf 5. Willett, P.: The porter stemming algorithm: then and now. Program (2006) 6. Mahmud, M.R., Afrin, M., Razzaque, M.A., Miller, E., Iwashige, J.: A rule based bengali stemmer. In: Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014, pp. 2750–2756, Sept 2014. https://doi.org/ 10.1109/ICACCI.2014.6968484 7. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980) 8. Sarkar, S., Bandyopadhyay, S.: Design of a rule-based stemmer for natural language text in Bengali. In: Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pp. 65–72, Jan 2008. https://www.aclweb.org/anthology/I/I08/I08-0112 9. Hoque, M.N., Seddiqui, M.H.: Bangla parts-of-speech tagging using Bangla stemmer and rule based analyzer. In: 2015 18th International Conference on Computer and Information Technology, ICCIT 2015, pp. 440–444 (2016). https://doi.org/10.1109/ICCITechn.2015.748 8111 10. Singh, J., Gupta, V.: Text stemming. ACM Comput. Surv. 49(3), 1–46 (2016). https://doi.org/ 10.1145/2975608 11. Gupta, V., Lehal, G.S.: A survey of common stemming techniques and existing stemmers for Indian languages. J. Emerg. Technol. Web Intell. 5(2), 157–161 (2013). https://doi.org/10.4304/ jetwi.5.2.157-161 12. Patil, H.B., Pawar, B.V., Patil, A.S.: A comprehensive analysis of stemmers available for Indic languages. Int. J. Nat. Lang. Comput. 5(1), 45–55 (2016). https://doi.org/10.5121/ijnlc.2016. 5104 13. Bijal, D., Sanket, S.: Overview of stemming algorithms for Indian and non-Indian languages 5(2), 1144–1146 (2014)
44
A. K. Dipongkor et al.
14. Jivani, A.G., et al.: A comparative study of stemming algorithms. Int. J. Comput. Technol. Appl. 2(6), 1930–1938 (2011) 15. Swain, K., Nayak, A.K.: A review on rule-based and hybrid stemming techniques. In: Proceedings—2nd International Conference on Data Science and Business Analytics, ICDSBA 2018, pp. 25–29 (2018). https://doi.org/10.1109/ICDSBA.2018.00012 16. Moral, C., de Antonio, A., Imbert, R., Ramírez, J.: A survey of stemming algorithms in information retrieval. Inf. Res. Int. Electron. J. 22 (2014). https://eric.ed.gov/?id=EJ1020841 17. Aggarwal, N., Kaur Randhawa, A.: A survey on parts of speech tagging for Indian languages. Int. J. Comput. Appl. 975, 8887 (2015) 18. Rathod, S., Govilkar, S.: Survey of various POS tagging techniques for Indian regional languages. Int. J. Comput. Sci. Inf. Technol. 6(3), 2525–2529 (2015). https://ijcsit.com/docs/ Volume6/vol6issue03/ijcsit20150603118.pdf 19. Singh, J., Garcha, L.S., Singh, S.: A survey on parts of speech tagging for Indian languages. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 7(4) (2017) 20. Kanakaraddi, S.G., Nandyal, S.S.: Survey on parts of speech tagger techniques. In: Proceedings of the 2018 International Conference on Current Trends towards Converging Technologies, ICCTCT 2018 (2018). https://doi.org/10.1109/ICCTCT.2018.8550884 21. Kumawat, D., Jain, V.: POS tagging approaches: a comparison. Int. J. Comput. Appl. (2015). https://doi.org/10.5120/20752-3148 22. Islam, M., Uddin, M., Khan, M.: A light weight stemmer for Bengali and its use in spelling checker. In: Proceedings of International Conference on Digital Communication and Computer Applications (DCCA), pp. 19–23 (2007) 23. Singh, J., Gupta, V.: A systematic review of text stemming techniques 48 (2017). https://doi. org/10.1007/s10462-016-9498-2 24. Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: YASS: yet another suffix stripper. ACM Trans. Inf. Syst. 25(4) (2007). https://doi.org/10.1145/1281485.1281489 25. Singh, J., Gupta, V.: An efficient corpus-based stemmer. Cogn. Comput. 9(5), 671–688 (2017) 26. Urmi, T.T., Jammy, J.J., Ismail, S.: A corpus based unsupervised Bangla word stemming using N-gram language model. In: 2016 5th International Conference on Informatics, Electronics and Vision, ICIEV 2016, pp. 824–828 (2016). https://doi.org/10.1109/ICIEV.2016.7760117 27. Dandapat, S.: Part-of-Speech Tagging for Bengali. Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur, Jan 2009 28. Roy, M.K., Paull, P.K., Noori, S.R.H., Mahmud, S.M.: Suffix based automated parts of speech tagging for Bangla language. In: 2nd International Conference on Electrical, Computer and Communication Engineering, ECCE 2019, pp. 7–9 (2019). https://doi.org/10.1109/ECACE. 2019.8679161 29. Dandapat, S.: Part-of-Speech Tagging for Bengali, pp. 1–132 (2009) 30. Mukherjee, S., Das Mandal, S.K.: Bengali parts-of-speech tagging using global linear model. In: 2013 Annual IEEE India Conference, INDICON 2013 (2013). https://doi.org/10.1109/IND CON.2013.6726132 31. Kumar, V.: 5 Gram Model for Predicting POS Tag for Bangla, Marathi and Telugu Indian Languages, pp. 1043–1047 (2018) 32. Ekbal, A., Bandyopadhyay, S.: Web-based Bengali news corpus for lexicon development and POS tagging. Polibits 37, 21–30 (2008). https://doi.org/10.17562/pb-37-3 33. Kabir, M.F., Abdullah-Al-Mamun, K., Huda, M.N.: Deep learning based parts of speech tagger for Bengali. In: 2016 5th International Conference on Informatics, Electronics and Vision, ICIEV 2016, pp. 26–29 (2016). https://doi.org/10.1109/ICIEV.2016.7760098
Latent Fingerprinting: A Review Ritika Dhaneshwar and Mandeep Kaur
Abstract Latent fingerprinting, which provides a mechanism to lift the unintentional impressions left at crime scenes, has been highly significant in forensic analysis and authenticity verification. It is extremely crucial for law enforcement and forensic agencies. However, due to the accidental nature of these impressions, the quality of prints uplifted is generally very poor. There is a pressing need to design novel methods to improve the reliability and robustness of latent fingerprinting techniques. A systematic review is, therefore, presented to study the existing methods for latent fingerprint acquisition, enhancement, reconstruction, and matching, along with various benchmark datasets available for research purposes. The paper also highlights various challenges and research gaps to augment the research in this direction that has become imperative in this digital era. Keywords Latent fingerprint · Enhancement · Segmentation · Matching · Reconstruction
1 Introduction With the rise in the number and diversity of crimes committed by criminals, it becomes a challenging task for intelligence agencies to convict a criminal. There are various types of crimes encountered in our day to day life like theft, kidnapping, murder, forgery, etc. With the advancement in technology, it has been observed that perpetrators of the crime have also changed their methods of committing a crime. With the increased digitization, the criminals are now more into hacking, phishing, malware attacks, etc. To deal with these upcoming security threats, it became imperative to secure ourselves from these new age threats. One such method of defending R. Dhaneshwar (B) · M. Kaur Department of Information Technology, University Institute of Engineering and Technology, Panjab University, Chandigarh, India e-mail: [email protected] M. Kaur e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_5
45
46
R. Dhaneshwar and M. Kaur
ourselves is the use of biometrics that rely on intrinsic physical or behavioral traits of human beings for authentication purposes. Unique physical characteristics, like fingerprints, palm prints, iris, facial recognition, etc., are widely used today for solving criminal cases. Solo or multiple traits can be used for authentication purposes. Among the various traits, fingerprints are the most widely accepted trait due to its uniqueness. Fingerprint recognition is therefore widely used in the banking industry, securing areas of national interest, passport control, securing E-commerce, identification of missing children, etc. In most of the above applications, the fingerprints are captured in a controlled environment for recognition purposes. In real-world scenarios, the fingerprints that are recovered, particularly by law enforcement agencies, are unintentional, and are left at crime scenes by chance. In such circumstances, latent fingerprinting is the mechanism that is available to recover the chance impression from a crime scene by legal authorities. These prints require further processing for the identification of criminals. Due to unintentional and uncontrolled nature of these prints, we encounter a whole lot of challenges like inefficient capturing and upliftment of fingerprints, partial prints, complex background noise, manual marking of minutiae, one-time upliftment of prints in some techniques, enhancement of poor quality ridge, reconstruction of the incomplete image, etc. These challenges provide us a lot of scope in the improvement of the performance of the fingerprint recognition system. Recently, India launched the world’s largest fingerprint database, i.e., Aadhaar which signifies the importance of fingerprintbased recognition even today [1, 2]. The basic objective of this paper is to acquaint the reader with basics concepts of latent fingerprinting and with some of the latest available approaches for its enhancement, reconstruction, and matching. Matching of the latent fingerprint is done on the bases of unique features which are categorized into three levels that are level-1, level-2, level-3 [3]. Level-1 features include arch, left loop, right loop, whorl, etc., Level-2 features comprise ridge endings, bifurcations, hook, etc., and Level-3 features are pores, line-shape, scars, etc. Usually, a combination of the above features is used for appropriate matching results (Fig. 1). The general approach that is followed while processing latent fingerprint images is as follows. The first step is the image acquisition phase in which we uplift the latent fingerprint using various techniques, discussed in Sect. 2 of our paper. This captured image is further used in the enhancement phase in which the quality of an image is improved by noise removal, sharpening of an image, adjusting the brightness of the image, etc. Image enhancement makes it easier to identify key features in an image. The next step is image restoration in which an image which is degraded due to blur, noise, dirt, scratches, etc. is recovered to extract accurate features from the image.
Fig. 1 Basic flow diagram of latent fingerprint processing
Latent Fingerprinting: A Review
47
Matching is the final step in which the features that are recovered from an image is matched with the ground truth using various matching techniques and algorithms. This paper is organized as follows. Section 2 deals with various latent fingerprint development approach. Section 3 is about latent fingerprint enhancement approaches along with their comparison. In Sect. 4, we discuss and compare the reconstruction approaches. Section 5 presents the comparison of available matching techniques. Section 6 lists the available latent fingerprint databases. In Sect. 7, we list some of the challenges in latent fingerprinting, and with Sect. 8, we conclude our study.
2 Latent Fingerprint Upliftment Approaches Latent fingerprint development from different surfaces is the first step in the processing of latent fingerprints. This is the most vital step among all the further preprocessing steps because the quality of latent prints uplifted that this stage is further used for enhancement, reconstruction, and matching stage. If the uplifted prints are of good quality, then chances are that the results after preprocessing will be far better than if the prints are of poor quality. Further, the number of minutiae that we able to extract from an image directly depends on the quality of prints obtained, which further affects the matching performance. To get quality results, we must be handing our evidence with the utmost care and uplift the prints with as much care as we can. In this section, we are going to discuss some of the available techniques for fingerprint upliftment (Table 1).
3 Latent Fingerprint Enhancement Approaches After capturing the fingerprint evidence using various methods as discussed above, the next step is to enhance the image. In a real-world crime scenario, it is commonly observed that the uplifted evidence is not of good quality. So to get relevant information from the image, we need to enhance it using various approaches as discussed in this section (Table 2). In 2019, a fingerprint enhancement approach was proposed by Jhansirani and Vasanth [11] in which a combination of total variation model and sparse representation with multi-scale patching is used. In this method, the image is divided into two components: texture and cartoon components using the TV model. Attributes of ridge structures like ridge frequency and orientation are obtained with the help of Gabor function. For matching and identification purpose, the author used the Levenberg Marquardt algorithm for training the neural networks. A generative adversarial network-based latent fingerprint enhancement algorithm is proposed by Joshi et al. [12]. The main objective of the proposed approach is to boost the quality of ridge structure quality. Using this approach, the ridge structures are preserved along with improving the quality of fingerprint images. Further, an enhancement approach was
48
R. Dhaneshwar and M. Kaur
Table 1 Various approaches for fingerprint upliftment Approach
Description
Surfaces
Power method [4]
Powder of contrasting color with respect to its surface is used
Useful on any dry, smooth, non-adhesive surfaces
Ninhydrin [5]
“Rhuemann’s Purple” which is Useful on porous a purple color product is surfaces—especially paper obtained after reaction
DFO [6, 7]
Variant of ninhydrin. The print glows when exposed to blue–green light
DFO also is useful to develop weak blood stains
Iodine [8]
We get a Yellow–brown product when sprayed on the print
Useful on non-metallic surfaces, fresh prints on porous and non-porous
Cyanoacrylate (glue fuming) Whitish deposit is obtained It is useful on most non-porous [9] when cyanoacrylate reacts with and some porous surfaces. print Produces good results on Styrofoam and plastic bags Small particle reagent [10]
Gray deposits are obtained Useful on relatively smooth, when it reacts with latent prints non-porous surfaces, including wet ones
proposed by Manickam and Devarasan [13] using an intuitionistic fuzzy set. The given approach is divided into two stages. Firstly, fingerprint contrast enhancement is done using an intuitionistic fuzzy set. Further, the level-2 features are extracted for matching purposes. The core of the given technique is based on minutia points which look over n number of images. The matching score is calculated by the author using Euclidean distance. A novel approach was proposed by Manickam et al. [14] which is based on Scale Invariant Feature Transformation (SIFT). The model deals with two phases. In the first phase-contrast enhancement of latent prints is done using an intuitionistic type-2 fuzzy set. In the next phase, the SIFT features are extracted which are further used for matching purposes. With the help of Euclidean distance, the matching scores are calculated. A hybrid model is presented by Liban and Hilles [15] which is a fusion of Edge Directional Total Variation model (EDTV) and quality image enhancement with lost minutia reconstruction. The objective of the paper was to the enhancement of input image as well as de-noising of latent fingerprints. The observation made by the author is that the performance of the proposed technique superior to good quality latent fingerprint as compared to bad and ugly quality images.
Latent Fingerprinting: A Review
49
Table 2 Available latent fingerprint enhancement approaches Approach Description
Database
Limitation
[11]
Image enhancement is done using Gabor function via multi-scale patch-based sparse representation
NIST SD27
Dictionary creation The best training and lookup are performance is slow 7.8717e obtained at epoch 10
[12]
Latent fingerprint IIITD (MOLF) enhancement database and IIITD algorithm based on (MSLFD) database generative adversarial networks is used
Spurious features are generated when the ridge information is insufficient
Matching results: rank-50 accuracy of 35.66% (DB 1) 30.16% (DB 2)
[13]
An intuitionistic fuzzy set is used for contrast enhancement of fingerprints
Imperfect matching in case of the presence of background noise and non-linear ridge distortion
Matching scores IIIT-latent fingerprint = 0.2702 FVC-2004 database 1 = 0.1912 FVC-2004 database 2 = 0.2008
[14]
Scale invariant FVC-2004 and feature IIIT-latent transformation fingerprint (SIFT) is used for the enhancement of an image
Does not work well Linear index of with very poor and fuzziness partial prints IIIT-latent fingerprint = 0.2702 FVC-2004 database 2 = 0.2008
[15]
A hybrid model that is a combination of edge directional total variation model (EDTV) and quality image enhancement with lost minutia reconstruction is used
Results not good with ugly images Overlapping images not considered
Fingerprint verification competition-2004 and IIIT-latent fingerprint database
NIST SD27 database for testing RMSE, PSNR to measure performance
Results
RMSE average = 0.018373(good quality image) PSNR average = 82.99068 (good quality image)
50
R. Dhaneshwar and M. Kaur
4 Latent Fingerprint Reconstruction Approaches Image reconstruction is a fundamental step in improving the quality of an image. Generally, the evidence recovered from crime scenes is of poor quality, blur, incomplete, etc. So to extract minutiae efficiently from the evidence, it becomes essential to first reconstruct the image. Various reconstruction techniques are discussed in this section (Table 3). A CCN-based method for reconstruction and enhancement of latent fingerprints was proposed by Wong and Lai [16] in 2020. The recovery of ridge structures is done by learning from corruption and noises encountered at various stages in fingerprint processing. The CNN model consists of two streams which help in reconstruction. The work proposed by Svoboda et al. [17] is based on generative convolutional networks. This approach helps in predicting the gaps, holes, and missing parts of the ridge structures, as well as helps in filtering the noise from minutiae. Conditional Generative Adversarial Networks (cGANs) approach is given by Dabouei et al. [18] which helps in the direct reconstruction of latent fingerprints. The cGAN approach has been modified by the author so that it can be used for the task of reconstruction. Table 3 Available latent fingerprinting reconstruction approaches Approach Description
Database
Limitation
Results
[16]
CNN-based MOLF, fingerprint FVC2002DB1, and reconstruction from FVC2004DB1 the corrupted image
Unsuccessful in extremely low contrast and noisy images
Accuracy = 84.10%
[17]
Reconstruction is done using generative convolutional networks
Gallery dataset like Lumidigm, Secugen, cross-match is used
False minutiae generation is a challenge
Rank-25 Lumidigm = 16.14% Secugen = 13.27% Crossmatch = 12.66%
[18]
ID preserving generative adversarial network is used for partial latent fingerprint reconstruction
IIIT-Delhi latent fingerprint database and IIIT-Delhi MOLF database
Minutiae are not directly extracted from the latent input fingerprints
Rank-10 accuracy = 88.02% (IIIT-Delhi latent fingerprint database) Rank-50 accuracy = 70.89% IIIT-Delhi MOLF matching
[19]
Multi-scale NIST SD27 dictionaries with texture components are used
Computation for false minutiae removal and repetitive minutiae removal is very high
Average orientation estimation error (in degrees) is 16.38
Latent Fingerprinting: A Review
51
In order to protect ID information in the course of the reconstruction process, a perpetual ID preservation approach is used. An algorithm based on dictionary-based learning and sparse coding for the latent fingerprint is proposed by Manickam et al. [19]. Also, an algorithm has been proposed for the estimation of orientation fields. In the first step using the total variation model, the texture image is acquired by decomposing the latent fingerprint image. Also, sparse coding is repeatedly applied with varying patch sizes to amend the distorted and corrupted orientation fields. The advantage of using this approach is that it helps to repair corrupted orientations as well as reduce noise. This algorithm helps to preserve the details of singular regions.
5 Latent Fingerprint Matching Approaches Latent fingerprint matching is the final step in the processing of our fingerprint image. At this stage, the matching between the original and the ground truth image is done using various approaches as discussed in this section (Table 4). The matching technique proposed by Liu et al. [20] uses Scale Invariant Feature Transformation (SIFT) for matching and enhancement purposes. The approach comprises two stages— in the first stage, contrast enhancement is performed using type-2 fuzzy sets. In the next step, the SIFT features are extracted for further matching purposes. A deep learning-based approach is put forward by Ezeobiejesi and Bhanu [21] for matching latent to rolled fingerprints. With the fusion of minutiae and patch similarity score, the matching score has been calculated. Minutia Spherical Coordinate Code (MSCC) based matching algorithm is proposed by Zheng et al. [22]. This algorithm is the improvement of the Minutia Cylinder Code (MCC). A greedy alignment approach is used to restore minutiae pairs that are lost at the original stage. A robust descriptor-based alignment algorithm is proposed by Paulino et al. [23] which is based on Hough transform. Minutiae along with orientation fields are used by the author to draw a similarity between the fingerprints. A novel fingerprint matching system is proposed by Jain and Feng [24]. In the proposed approach along with minutiae, other features, like ridge wavelength map, skeleton, singularity, etc., are used to enhance the performance.
6 Databases Available The fingerprint database is generally classified into three categories—rolled, plain, and latent fingerprint database [25]. For forensic applications, mainly rolled and latent fingerprints are used, whereas for commercial applications plain, fingerprints are used. To capture latent fingerprints range of methods like chemical, powder, or simply photography is done. Plain fingerprints are prints of our fingers taken using sensors which are mostly used as ground truth. Rolled prints, on the other hand, are obtained by simply rolling of fingers from one side to another. Various databases
52
R. Dhaneshwar and M. Kaur
Table 4 Available latent fingerprint matching approaches Approach
Description
Database
Limitation
Results
[20]
Matching using SIFT feature
FVC-2004 and IIIT-latent fingerprint
Database size is small Feature set used is small
Linear index of fuzziness IIIT-latent fingerprint = 0.2702 FVC-2004 database 1 = 0.1912 FVC-2004 database 2 = 0.2008
[21]
Matching is a patch based using a deep learning approach
NIST SD27
The approach does not work well with mixed images resolutions
Rank-20 identification rate = 93.65%
[22]
Minutia spherical coordinate code is used for matching
AFIS data and NIST special data 27
There are many redundancies in MCC and MSCC’s feature
Rank-1 recognition rate = 49.2%
[23]
Descriptor-based Hough transform used for matching
NIST SD27 and WVU latent databases
Latent matching is slow
Rank-1 accuracy = 53.5%
[24]
The fusion of various extended features to improve performance
NIST SD4, SD14, and SD27 databases
The separation of feature extraction and matching in automatic systems leads to some information loss
Rank-1 identification rate of 74% was achieved
available related to latent fingerprints are listed as follows—NIST27 [26], WVU latent databases [27], FVC 2004 databases [28], IIIT-latent fingerprint database [29], IIITD Multi-surface Latent Fingerprint database (IIITD MSLFD) [29], IIIT Simultaneous Latent Fingerprint (SLF) database [30], multisensor optical and latent fingerprint database [31], Tsinghua latent overlapped fingerprint database [32], ELFT-EFS public challenge database [33].
7 Research Gaps and Challenges To improve the authentication results and reliability of fingerprint recognition, we need a lot of improvement at various stages like enhancement, reconstruction, and matching. Some of the major challenges encountered are as follows.
Latent Fingerprinting: A Review
53
• Even today the marking of fingerprint features is done by an expert which opens a new sphere for improvement, i.e., Automation of fingerprint marking [2]. • The fingerprints recovered from the crime scenes are generally of very poor quality (background noise, partial prints, etc.) which requires a lot of preprocessing to get desired results [16]. • Another major challenge is concerning the surface from which the fingerprints are uplifted. Different surfaces require different methods on the bases of its texture, color, porous/non-porous surface, etc.
8 Conclusion To enhance the robustness and efficiency of various security applications, there is a dire need of a novel approach for latent fingerprint recognition. Various image processing techniques can be applied at the enhancement and reconstruction phase to improve robustness and efficiency at the matching stage. This paper presents various aspects of latent fingerprinting which can be used to improve recognition and authentication results. Research in this domain may help us fortify ourselves from emerging digital era threats which are imperative to maintain the security and integrity of our nation.
References 1. Kumar, M., Hanumanthappa, M., Kumar, T.S.: Use of AADHAAR biometrie database for crime investigation—opportunity and challenges. In: 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), pp. 1–6. IEEE, Mar 2017 2. Krishna, A.M., Sudha, S.I.: Automation of criminal fingerprints in India. 1 Interoperable Criminal Justice System, p. 19 3. Jain, A.K., Feng, J.: Latent fingerprint matching. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 88–100 (2010) 4. Sodhi, G.S., Kaur, J.: Powder method for detecting latent fingerprints: a review. Forensic Sci. Int. 120(3), 172–176 (2001) 5. Jasuja, O.P., Toofany, M.A., Singh, G., Sodhi, G.S.: Dynamics of latent fingerprints: the effect of physical factors on quality of ninhydrin developed prints—a preliminary study. Sci. Justice 49(1), 8–11 (2009) 6. Xu, L., Li, Y., Wu, S., Liu, X., Su, B.: Imaging latent fingerprints by electrochemiluminescence. Angew. Chemie Int. Ed. 51(32), 8068–8072 (2012) 7. Luo, Y.P., Bin Zhao, Y., Liu, S.: Evaluation of DFO/PVP and its application to latent fingermarks development on thermal paper. Forensic Sci. Int. 229(1–3), 75–79 (2013) 8. Kelly, P.F., King, R.S.P., Bleay, S.M., Daniel, T.O.: The recovery of latent text from thermal paper using a simple iodine treatment procedure. Forensic Sci. Int. 217(1–3), e26–e29 (2012) 9. Wargacki, S.P., Lewis, L.A., Dadmun, M.D.: Understanding the chemistry of the development of latent fingerprints by superglue fuming. J. Forensic Sci. 52(5), 1057–1062 (2007) 10. Jasuja, O.P., Singh, G.D., Sodhi, G.S.: Small particle reagents: development of fluorescent variants. Sci. Justice 48(3), 141–145 (2008) 11. Jhansirani, R., Vasanth, K.: Latent fingerprint image enhancement using Gabor functions via multi-scale patch based sparse representation and matching based on neural networks. In:
54
12.
13.
14. 15.
16. 17.
18.
19.
20. 21.
22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33.
R. Dhaneshwar and M. Kaur Proceedings 2019 IEEE International Conference on Communications and Signal Processing, ICCSP 2019, no. c, pp. 365–369 (2019) Joshi, I., Anand, A., Vatsa, M., Singh, R., Roy, S.D., Kalra, P.K.: Latent fingerprint enhancement using generative adversarial networks. In: Proceedings of 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019, pp. 895–903 (2019) Manickam, A., Devarasan, E.: Level 2 feature extraction for latent fingerprint enhancement and matching using type-2 intuitionistic fuzzy set. Int. J. Bioinform. Res. Appl. 15(1), 33–50 (2019) Manickam, A., et al.: Score level based latent fingerprint enhancement and matching using SIFT feature. Multimed. Tools Appl. 78(3), 3065–3085 (2019) Liban, A., Hilles, S.M.S.: Latent fingerprint enhancement based on directional total variation model with lost minutiae reconstruction. In: 2018 International Conference on Smart Computing and Electronic Enterprise, ICSCEE 2018, pp. 1–5 (2018) Wong, W.J., Lai, S.: Multi-task CNN for restoring corrupted fingerprint images. Pattern Recogn. 107203 (2020) Svoboda, J., Monti, F., Bronstein, M.M.: Generative convolutional networks for latent fingerprint reconstruction. In: IEEE International Joint Conference on Biometrics, IJCB 2017, vol. 2018, pp. 429–436, Jan 2018 Dabouei, A., Soleymani, S., Kazemi, H., Iranmanesh, S.M., Dawson, J., Nasrabadi, N.M.: ID preserving generative adversarial network for partial latent fingerprint reconstruction. In: 2018 IEEE 9th International Conference on Biometrics: Theory, Applications, and Systems, BTAS 2018, pp. 1–10 (2018) Manickam, A., Devarasan, E., Manogaran, G., Priyan, M.K., Varatharajan, R., Hsu, C.H., Krishnamoorthi, R.: Score level based latent fingerprint enhancement and matching using SIFT feature. Multimedia Tools Appl. 78(3), 3065–3085 (2019) Liu, S., Liu, M., Yang, Z.: Sparse coding based orientation estimation for latent fingerprints. Pattern Recognit. 67, 164–176 (2017) Ezeobiejesi, J., Bhanu, B.: Patch based latent fingerprint matching using deep learning. In: 2018 25th IEEE International Conference on Image Processing, pp. 2017–2021. Center for Research in Intelligent Systems, University of California, Riverside, CA 92521, USA (2018) Zheng, F., Yang, C., Road, W., Road, R., District, F.: Latent fingerprint match using minutia spherical coordinate code, no. 186, pp. 357–362 (2015) Paulino, A.A., Feng, J., Jain, A.K.: Latent fingerprint matching using descriptor-based Hough transform. IEEE Trans. Inf. Forensics Secur. 8(1), 31–45 (2013) Jain, A.K., Feng, J.: Latent fingerprint matching. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 88–100 (2011) Feng, J., Jain, A.K.: Filtering large fingerprint database for latent matching. In: Proceedings— International Conference on Pattern Recognition, pp. 25–28 (2008) https://www.nist.gov/itl/iad/image-group/nist-special-database-2727a https://databases.lib.wvu.edu/ https://bias.csr.unibo.it/fvc2004/download.asp https://www.iab-rubric.org/resources/molf.html https://www.iab-rubric.org/resources.html Sankaran, A., Vatsa, M., Singh, R.: Multisensor optical and latent fingerprint database. IEEE Access 3, 653–665 (2015) https://ivg.au.tsinghua.edu.cn/dataset/TLOFD.php https://www.nist.gov/itl/iad/image-group/nist-evaluation-latent-fingerprint-technologies-ext ended-feature-sets-elft-efs
Skin Cancer Classification Through Transfer Learning Using ResNet-50 Anubhav Mehra , Akash Bhati, Amit Kumar, and Ruchika Malhotra
Abstract Skin cancer is one of the most serious public health issues with between 2 and 3 million newly diagnosed cases all over the world every year. Skin cancer like its contemporaries becomes very lethal if it’s diagnosed at a late stage and probability of the patient surviving reduces to a large extent. Thus, it becomes imperative to diagnose such a problem at an earlier stage so that the patient can be cured from skin cancer. In this paper, we have classified the 7 different categories of skin cancer using the deep residual network ResNet-50 and also compared the model with VGG16 and CNN. ResNet was the winner of the ImageNet challenge (ILSVRC) in 2015 and also the winner of MS COCO Challenge in 2015. Each layer of the ResNet-50 consists of the layers—convolution layer , batch normalization , max pooling and ReLu activation layer. The seven different categories of skin cancer are: melanocytic nevi, melanoma, benign keratosis-like lesions, basal cell carcinoma, actinic keratoses, vascular lesions and dermatofibroma. We have trained the model on the HAM10000 (‘Human Against Machine with 10000 training images’) dataset which has 10,015 dermoscopic images and is available through the ISIC archive. The model performed well on the given dataset. We evaluated the model using multiple performance metrics. The results that the developed model has high predictive capability. Accuracy achieved through the model was 84.87%. Precision was 0.86, recall was 0.85 and F1-score achieved was 0.85. Keywords Skin cancer · ResNet-50 · Convolution neural networks A. Mehra (B) · A. Bhati · A. Kumar · R. Malhotra Discipline of Software Engineering, Department of Computer Science and Engineering, Delhi Technological University, Delhi, India e-mail: [email protected] A. Bhati e-mail: [email protected] A. Kumar e-mail: [email protected] R. Malhotra e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_6
55
56
A. Mehra et al.
1 Introduction Skin cancer is a disease in which there is abnormal growth of skin cells. It is the most common type of cancer. It is a disease in which the cancer cells establish in the tissues of the skin. Contrary to popular beliefs, skin cancer does not always develop in areas which is exposed to sunlight , but it can also develop in areas which are not usually exposed to sun. One in every three cancer diagnosed turns out to be a skin cancer [1].There are mainly two types of skin cancer: melanoma and non-melanoma skin cancer. At present, the number of non-melanoma skin cancer cases globally each year is a between a staggering 2 to 3 million and the number of cases for melanoma skin cancer is around 132,000 and it is very likely that the number of cases will only increase due to the damage to ozone layer by CFC’s. The atmosphere will lose its protective filter, and as a result, more harmful UV rays will reach the earth surface. According to World Health Organization (WHO), if there is a 10 per cent decrease in ozone levels, then it will result in an additional 300,000 non-melanoma and 4,500 melanoma skin cancer cases. Malignant melanoma skin cancer constitutes 75 percent of all death cases due to skin cancer. It is estimated that new melanoma cases will increase by 2 percent in 2020. Fortunately, if it detected early , the 5 -year survival rate for melanoma skin cancer is 99 percent [2]. Many convolution layer networks have been used to classify skin cancer lesions. In fact, many CNN models have outperformed manual classification of skin cancer lesions by human eyes. A common perception is that if we just stack the layers and make a deep network, then it will lead to improved accuracy. However , this is not the case as observed by the researchers. When we train deeper networks, we observe a degradation of accuracy. The deep residual networks have solved the problem and have enabled the researchers to have a deeper network without compromising on the accuracy. ResNet is one of the most powerful deep neural networks. ResNet-50 has been trained on more than 14 million images from ImageNet database. ResNet-50 , as the name suggests is a CNN which consists of 50 layers. The pre-trained network is able to classify images into thousand categories, such as pencil , mouse, and many animals. The input size for the image for ResNet-50 is 224-by-224. In this paper, we have used a ResNet-50 model which analysis the skin lesions and classifies them into their respective categories and also compared the results with that of VGG16 model and CNN model. On top of the ResNet-50 model, we have kept two dropout layers before and after the dense layer and a Softmax layer at the end for classification.
Skin Cancer Classification Through Transfer Learning …
57
2 Related Work Skin cancer is one of the most widespread and dangerous type of cancer. In early times, analyzation of images was carried out for nodule development in the body in order to diagnose the skin cancer. In this type of manual diagnosis, the reliability is often low as there are many errors in human observation and this is also time consuming as well as laborious too. An algorithm was developed for classification of melanoma (type of skin cancer) by Almansour et al. [1] using support vector machine (SVM) and k-means clustering. The ability and usefulness of CNNs were studied in the classification of 8 skin diseases. The objective of this paper was to compare the capability of deep learning with the performance of experienced dermatologists. The results confirmed the better performance of deep learning models than dermatologists (at least 11%). The characteristics that affect an algorithm to arrive at a decision can be interpreted by anticipating the layers in a CNN. Selvaraju et al. [4] proposed the Grad-Cam approach that uses gradients of any random class that is flowing through the last layer for the production of a rough map. A similar approach was used by Esteva et al. [5] for the attainment of the dermatologist’s level precision on classification of skin cancer. An Inception V4 [6] model which was pre-trained on ImageNet was used in this method. Melanoma is the most dangerous form of skin cancer. For the classification of malignant melanoma, Yu et al. [7] proposed a CNN with more than 50 layers on ISBI 2016 challenge dataset. Han et al. [8] proposed a model that used a deep CNN. This model is used to classify the clinical images of 12 skin lesions. The authors did not consider the subjects of different ages. Researchers developed a ResNet model that was trained with 19,398 images from atlas site images, MED-NODE dataset and Asan datasets. CNN approach was used by Dorj et al. [9] for the development of ECOC SVM for the classification of 4 diagnostic groups of clinical cancer pictures. Brinker et al. [10] were the first to propose the symmetric study on the classification of skin lesion diseases. CNN is applied significantly for the classification of skin cancer. For the classification task, the challenges that need to be addressed are highlighted by the authors. Tschandl et al. [11] developed a CNN-based classification model that was trained on 5829 closeup and 7895 dermo-scopic images of lesions. The model was tested on a set of 2072 unknown cases, and the results obtained were compared with results obtained from 95 medical personnel. A cross-sectional study was performed by Marchetti et al. [12] on randomly selected dermo-scopic images (44 nevi, 6 lentigines and 50 melanomas). Researches combined 5 different methods for the classification of 3 motives. Most of the research that has been done till now is mostly focused on a CNN model, and very few researches have been done on using transfer learning processes for classifying skin cancer. This paper aims at using ResNet-50 for classification of skin cancer and comparing the result of the proposed model with the results of VGG-16 model and a CNN model on the same dataset.
58
A. Mehra et al.
3 Proposed Work Our model is a sequence model. Convolution neural network and deep residual network ResNet-50 are used for classification of skin lesion. Although ResNet-50 model is trained on the ImageNet database which consists of more than a million images, we trained our model on the HAM10000 dataset by freezing some of the initial layers and training the rest of the layers . We also added some additional layers. These additional layers include a dense layer and dropout layers before and after the dense layer. Dropout layer is used so that we can avoid overfitting. At the end of the model, we have a Softmax layer for classification of the type of skin lesions.
3.1 Dataset Training of neural networks on a dataset is very difficult if the dataset is of small size and if the data does not have variety of images, i.e images of different classes. If the dataset is highly skewed, then it does not solve the purpose and may give us wrong sense of accomplishment. Fortunately, the dataset we used was MNIST HAM10000 [13]. This dataset has 10,015 dermoscopic images . The dataset has 7 categories: actinic keratoses(327 images), melanocytic nevi(6705 images), melanoma(1113 images), dermatofibroma (115 images), benign keratosis-like lesions (1099 images), basal cell carcinoma(514 images), vascular lesions(142 images). The dimension of the images is 450*600*3. These images were collected through 4 different types of diagnosis methods: Histopathology, Confocal, Follow-up, Consensus.
3.2 Image Pre-processing Image pre-processing becomes imperative when dealing with large datasets. Preprocessing the image has a direct impact on the accuracy of the project. Hence, we undertook several pre-processing methods before feeding the image into model. First, we cleaned the dataset. The ‘age’ field had 57 null values. So we took the mean of all the ages and replaced the null values with it. We also applied normalization to the dataset by dividing each pixel by 255.0, thus rescaling the image between 0 and 1. Normalization is applied when features have contrasting ranges. If this happens, then the feature having numerical values of higher range will have more influence than feature having numerical values of lower range. In order to avoid overfitting, we did data augmentation. By augmenting the data, we can increase the number of training examples which in turn creates a powerful model. Augmenting the data was very important for this dataset as it is very imbalanced with 6705 images of ‘melanocytic nevi’. In this, we randomly rotated some images by 10 degrees , shifted some images horizontally by 10%, zoomed some images by 10%, and shifted some images vertically by 10%.
Skin Cancer Classification Through Transfer Learning …
59
3.3 Method We took the deep residual network ResNet-50 as the base model and added additional layers on top of it. ResNet-50 is a pre-trained network which has 5 stages. A particular parametric layer in ResNet-50 consists of four layers—convolution layer , batch normalization, max pooling layer and ReLU. The architecture of the network is shown in Fig. 1. Convolution Layer—A convolution layer is like a set of filters, and each filter is applied on the entire image. The convolution layer of stage 1 has 64 filters of kernel size 7*7. The filter size for layers of stage 2 to stage 5 is alternatively 1*1 and 3*3. Batch Normalization—Batch normalization layer helps in normalizing the input from each layer and also improves the performance of the network. Max Pooling—Max pooling layer does not have any parameters to learn. It is used to reduce the size of the output from convolution layer and reduce the computational cost. ReLU—Rectified linear unit is used to add non linearity to the network. It is basically an activation function max(x, 0).
3.4 Implementation In this research, we used ResNet-50 as the base model and added some additional layers on top of it to train our model to classify the type of skin lesion. We used Keras as the primary deep learning library. The HAM10000 dataset had 100015 images and 7 labels of skin lesions. The dataset was cleaned and normalized. Exploratory data analysis was done to check the distribution of the data and explore the different features. The labels were one-hot encoded. Data augmentation was done to increase the number of training images. The dataset was split into training data and testing data. Furthermore, the training data was split into training data and validation data. On top of the ResNet-50 layer, there are two dropout layers on either side of a dense layer and an output layer at the end to classify the images. Dropout layers help in avoiding overfitting the model. Adam optimizer was used for optimization. The proposed model was trained on the training images. The number of epochs
Fig. 1 Architecture of the proposed model
60
A. Mehra et al.
was 50, and the batch_size was 64. The model was then tested on the testing data. The metric functions accuracy, precision, recall and F1-score were used to evaluate the performance of the proposed model. The same was done for VGG16 and CNN models, and these were evaluated on the same performance metrics.
4 Results We used TensorFlow for the backend and Keras for the frontend development. We used Kaggle’s kernels for training and testing our model. Matplotlib was used for exploratory data analysis, and Pandas was used for pre-processing of data . After training our model on the training images, we tested our model and evaluated the performance of our model. We calculated the confusion matrix, which is shown in Fig. 2. The performance metrics used were accuracy, precision , recall and F1-score. For the proposed model, the validation accuracy was 84.91% and the test accuracy was 84.87%. The average precision for training data was 0.99 and average precision for testing data was 0.86. The average recall for training data was 0.99, and average recall for testing data was 0.85. The average F1-score for training data was 0.99, and average F1-score for testing data was 0.85 . Table 1 shows the result matrix for our proposed model, VGG16 and CNN. Table 2 shows the performance of our proposed model on classifying different classes.
Fig. 2 Confusion matrix of the proposed model
Skin Cancer Classification Through Transfer Learning … Table 1 Result summary for the models Sr. No. Model Accuracy 1 2 3
Proposed Model VGG16 CNN
Precision
Recall
F1-score
84.87%
0.86
0.85
0.85
81.27% 76.33%
0.82 0.74
0.81 0.76
0.81 0.74
Table 2 Result summary for the proposed model Sr. No Class Precision 1 2 3 4 5 6 7
61
Melanocytic nevi Melanoma Benign keratosis Basal cell carcinoma Actinic Keratoses Vascular skin lesions Dermatofibroma
Recall
F1-score
0.88 0.77 0.67 0.72
0.45 0.60 0.78 0.57
0.59 0.67 0.72 0.63
0.94 0.60
0.92 0.74
0.93 0.66
1.00
0.96
0.98
Apart from accuracy, the proposed model is also evaluated on precision, recall and F1-score. These classification metrics are defined as follows: pr ecision = r ecall =
tr ue positive total pr edicted positive
(1)
tr ue positive total actual positive
(2)
f 1 − scor e =
2 ∗ pr ecision ∗ r ecall ( pr ecision + r ecall)
(3)
5 Conclusion In this work, a model was proposed which used transfer learning method and residual deep network ResNet50 with additional layers. It was found that ResNet50 model, which is pre-trained in ImageNet database can be used for successful classification of skin lesions of HAM10000 dataset. The accuracy observed of the proposed model is 84.87%. The precision was 0.86, the recall was 0.85 and the F1-score was 0.85. The HAM10000 dataset was highly imbalanced with 6705 samples of melanocytic
62
A. Mehra et al.
nevi and a meagre 115 samples of dermatofibroma . This accuracy can be further increased by having a more balanced dataset, and then this system can be used to assist dermatologists in inspection of skin lesions.
References 1. WHO Skin Cancer. https://www.who.int/uv/faq/skincancer/en/index1.html 2. Skin Cancer Facts and Statistics. https://www.skincancer.org/skin-cancer-information/skincancer-facts/ 3. Almansour, E., Arfan Jaffar, M.: Classification of dermoscopic skin cancer images using color and hybrid texture features. IJCSNS Int. J. Comput. Sci. Netw. Secur. 16(4), 135–139 (2016) 4. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017) 5. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639), 115–118 (2017) 6. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017) 7. Yu, Y., Lin, H., Meng, J., Wei, X., Guo, H., Zhao, Z.: Deep transfer learning for modality classification of medical images. Information 8(3), 91 (2017) 8. Han, S.S., Kim, M.S., Lim, W., Park, G.H., Park, I., Chang, S.E.: Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. J. Investigative Dermatol. 138(7), 1529–1538 (2018) 9. Dorj, U.-O., Lee, K.-K., Choi, J.-Y., Lee, M.: The skin cancer classification using deep convolutional neural network. Multimedia Tools Appl. 77(8), 9909–9924 (2018) 10. Brinker, T.J., Hekler, A., Utikal, J.S., Grabe, N., Schadendorf, D., Klode, J., Berking, C., Steeb, T., Enk, A.H., von Kalle, C.: Skin cancer classification using convolutional neural networks: systematic review. J. Med. Internet Res. 20(10), e11936 (2018) 11. Tschandl, P., Rosendahl, C., Akay, B.N., Argenziano, G., Blum, A., Braun, R.P., Cabo, H., et al.: Expert-level diagnosis of nonpigmented skin cancer by combined convolutional neural networks. JAMA Dermatol. 155(1), 58–65 (2019) 12. Gutman, D., Codella, N.C.F., Celebi, E., Helba, B., Marchetti, M., Mishra, N., Halpern, A.: Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (ISBI) 2016. hosted by the international skin imaging collaboration (ISIC) (2016). arXiv preprint arXiv:1605.01397 13. Tschandl, P., Rosendahl, C., Kittler, H.: The HAM10000 dataset, a large collection of multisource dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 (2018)
Sentiment Analysis Using Machine Learning Approaches Ayushi Mitra and Sanjukta Mohanty
Abstract Sentiment analysis (SA) or opinion mining or emotion AI is an on-going field which refers to the use of Natural Language Processing (NLP), analysis of text and is utilized to extract quantify and is used to study the emotional states from a given piece of information or text data set. It is an area that continues to be currently in progress in field of text mining. Sentiment analysis is utilized in many corporations for review of products, comments from social media and from a small amount of it is utilized to check whether or not the text is positive, negative or neutral. Throughout this research work we wish to adopt rule- based approaches which defines a set of rules and inputs like Classic NLP techniques, stemming, tokenization, a region of speech tagging and parsing of machine learning for sentiment analysis which is going to be implemented by most advanced python language. Keyword Classic NLP techniques · Sentiment analysis · Text mining
1 Introduction Many new researchers nowadays have been influenced by on-line and public forum sites data analysis by applying slicing-dicing and mining algorithms to derive precious information out of raw data sources. Organizations these days assess their customers for their respective related products from social sites text dumps/raw logs [12]. Method of automatically classifying a user-generated text as positive text, negative text or neutral opinion is determined by Sentiment Analysis algorithms concerning an entity such as product, people, topic, event etc. Document level, Sentence level and Feature level are the three levels of classification in Sentiment analysis respectively [13]. A. Mitra · S. Mohanty (B) Centurion University of Technology and Management, Gajapati, India e-mail: [email protected]; [email protected] A. Mitra e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_7
63
64
A. Mitra and S. Mohanty
This entire document is presented as a basic information unit to provide scope of classification addicted to positive or negative class at Document level specifically. Each sentence is classified initially into subjective or objective and then it is segregated as positive, negative or neutral category in Sentence level classification. In third level of Sentiment analysis classification that is Aspect or Feature level classification, it distinguishes by extracting product attribute details by analysing the source data [13]. Second approach in industry today is Lexicon based approach contains a dictionary of positive and negative words which is used to determine the sentiment polarity based on inclination of message from source dataset content that is source data set has more words in positive word repository or negative word repository. The combination of both Machine learning and lexicon based approach is then used by Hybrid based approach for classification.
2 Theoritical Background Machine learning ways for sentiment analysis usually rely upon supervised classification methods, wherever tagged/labelled data is employed for the approach. In Fig. 1 we have tried to provide an overview of the entire architecture. Below, it depicts two methods (a) Training method, and (b) Prediction method. Within the Training method (a) model trains itself to adapt to a specific input (text) to the corresponding output (tag) which is based on sample data provided for training purpose, based on 80:20 principle. Here 80% data is fed into the application with intention to train it. Rest 20% is meant for the next phase that is prediction phase. Feature extractor function is to transfer text input from previous step into a feature vector, where the text-tag matric is built and then these feature vectors and tags (e.g. positive, negative, or neutral) are fed into the machine learning codes/algorithms that will generate a model. In the prediction phase (b) feature extractor work is to transform the unseen text inputs into feature vectors. These feature vectors are then fed into the model which will generate the predicted or expected tags (i.e. positive, negative, or neutral) that the model learnt for the 80% data sample in previous step.
3 Related Work Joshi et al. [4] used Natural Language Processing (NLP) Techniques to deter- mine sentiments with the help of a tool Sentiment Analyzer that automatically extracts sentiments and is used to discover all references for the given subject with efficiency. Hur et al. [5] three machine learning based algorithms like artificial neural network, regression tree, and support vector regression were used to get non-linear relationship between the box-office collections based on Sentiments of movie review.
Sentiment Analysis Using Machine Learning Approaches
65
Fig. 1 Block diagram of sentiment analysis
Agarwal et al. [6] used numerous approaches and classifiers like lexicon based mostly approach, Naïve Bayes (NB) classifier algorithm Support Vector Machines (SVM) and MaximumEntropy (MaxEnt). Pandarachalil et al. [7] a Twitter Sentiment analysis method was presented by using an unsupervised learning approach. SenticNet, SentiWordNet, and SentislangNet were the three Sentiment lexicons used to determine the polarity of tweets. Sangani [8] provides a collection of reviews to every topic that refers user opinions towards the topic and a many-to-many relation was established from reviewers to topics of interest. Mudinas and Zhang [9] hybrid techniques which were used are reliable techniques like lexicon based technique and performance as Machine learning based technique. Overall accuracy of the system was observed to be 82.3%. Koloumpis et al. [10] various features were used like unigrams, bigrams, n-grams, pos tagging and hash tags. The result which was found was of mixed classification. Zhang et al. [11] they used the associated rule mining technique for extracting product features and differentiated between positive and negative reviews.
66
A. Mitra and S. Mohanty
Parvathy et al. [12] presented a hybrid approach consisting of machine learning techniques like artificial neural network, support vector machine, regression tree and rule based technique.
4 Proposed approach To perform Sentiment Analyses we have discussed some approaches, and their accuracy results. We have shown the block diagram of Sentiment Analysis in Fig. 1. Here we have presented test set, training set and prediction set and we have tried to provide an overview of the entire architecture, how machine learning model works on training the model first to gather experiences (usually 80% training data as input) and then prediction by the model using that gained experience in earlier step (usually 20% training data as input).It depicts two methods (a) Training method, and (b) Prediction method. Within the Training method (a) model learns to accompany a specific input (text) to the corresponding output (tag) which is based on the test samples for training. Feature extractor work is to transfer text input into a feature vector, then these feature vectors and tags (e.g. positive, negative, or neutral) are then fed into the machine learning- approaches that will generate a model. In the prediction method (b) feature extractor work is to transform the unseen text inputs into feature vectors. These feature vectors are then fed into the model which will generate the predicted or expected tags (i.e. positive, negative, or neutral) which is shown by CFG in Fig.2.
5 Conclusion and Future Work By using lexicon based approach, machine learning based approach or hybrid approach Sentiment analysis will be performed. In this related field already researches have been made to find the accurate accuracy still their results seems to be inefficient. The strength of the sentiment classification depends on the scale of the lexicon (dictionary) because the size of the lexicon will increase this approach and becomes more incorrect and time consuming. We will be using NLTK (Natural Language Toolkit) feature in python for further implementation sample movie review data. This will focus upon using in-built classifier models from NLTK package in python and compare their accuracy for a given dataset.
Sentiment Analysis Using Machine Learning Approaches Fig. 2 Workflow of the proposed framework
67
Collection of data
Cleaning the data
Algorithm
select
Data
Test set
Training set
Preprocess
data
Training
Data forecast
Validation set
68
A. Mitra and S. Mohanty
References 1. El Alaoui, I., Gahi, Y., Messoussi, R., Chaabi, Y., Todoskoff, A., Kobi, A.: A novel adaptable approach for sentiment analysis on big social data. J. Big Data 5(1) (2018) 2. Singh, J., Singh, G., Singh, R.: Optimization of sentiment analysis using machine learning classifiers. Hum. Cent. Comput. Inf. Sci. (2017). https://doi.org/10.1186/s13673-017-0116-3 3. Bharti, O., Malhotra, M.: Sentiment analysis (2016) 4. Joshi, R., Tekchandani, R.: Comparative analysis of Twitter data using supervised classifiers. In: 2016 International Conference on Inventive Computation Technologies (ICICT), vol. 3, pp. 1–6. IEEE (2016) 5. Hur, M., Kang, P., Cho, S.: Box-office forecasting based on sentiments of movie reviews and Independent subspace method. Inf. Sci. 372, 608–624 (2016) 6. Agarwal, B., Poria, S., Mittal, N., Gelbukh, A., Hussain, A.: Concept-level sentiment analysis with dependency-based semantic parsing: a novel approach. Cogn. Comput. 7(4), 487–499 (2015) 7. Pandarachalil, R., Sendhilkumar, S., Mahalakshmi, G.S.: Twitter sentiment analysis for largescale data: an unsupervised approach. Cogn. Comput. 7(2), 254–262 (2015) 8. Sangani, C., Ananthanarayanan, S.: Sentiment analysis of app store reviews. Methodology 4(1), 153–162 (2013) 9. Mudinas, A., Zhang, D., Levene, M.: Combining lexicon and learning based approaches for concept-level sentiment analysis. In: Proceedings of the First International Workshop on Issues of Sentiment Discovery and Opinion Mining, p. 5. ACM (2012) 10. Kouloumpis, E., Wilson, T., Moore, J.: Twitter sentiment analysis: the good the bad and the omg!. In: Fifth International AAAI Conference on Weblogs and Social Media (2011) 11. Zhang, L., Liu, B., Lim, S.H., O’Brien-Strain, E.: Proceedings of the 23rd International Conference Computational Linguistics (2010) 12. Parvathy, G., Bindhu, J.S.: A probabilistic generative model for mining cybercriminal network from online social media: a review. Int. J. Comput. Appl. 134(14), 1–4 (2016). https://doi.org/ 10.5120/ijca2016908121 13. Vohra, S.M., Teraiya, J.B.: A comparative study of sentiment analysis techniques. J. JIKRCE 2(2), 313–317 (2013)
Nonlinear 2D Chaotic Map and DNA (NL2DCM-DNA) Sequences-Based Fast and Secure Block Image Encryption Shalini Stalin , Priti Maheshwary , and Piyush Kumar Shukla
Abstract This work utilizes a novel image encryption technique for block image encryption with the help of nonlinear 2D chaotic map and DNA sequences (NL2DCM-DNA). 2D chaotic map has generated the several key sequences for encryption, and DNA rules are applied to perform fast block encryption of image. Nonlinearity of 2D chaotic map and complexity of DNA sequences are used to perform highly secure image encryption. The experimental outputs illustrate the higher efficiency of NL2DCM-DNA in terms of security, attack resilience, entropy, histogram, running time and diffusion against some previous image encryption algorithms. Keywords Chaotic map · DNA rules · Image encryption · Pixel substitution · Diffusion
1 Introduction The security and confidentiality are major concern in Internet communication. Several type of messages like audio, video, image and text are continuously transmitted over Internet. These messages are secured by several chaotic map-based encryption schemes and traditional algorithms like Rivest Shamir Adleman (RSA), S. Stalin (B) Research Scholar, Computer Science & Engineering Department, Rabindranath Tagore University, Bhopal 464993, India e-mail: [email protected] P. Maheshwary Professor, Computer Science & Engineering Department, Rabindranath Tagore University, Bhopal 464993, India e-mail: [email protected] P. K. Shukla Associate Professor, Computer Science & Engineering Department, University Institute of Technology, RGPV, Bhopal 462023, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_8
69
70
S. Stalin et al.
International Data Encryption Algorithm (IDEA) and Diffie Hellman. The traditional algorithms have some drawbacks related to sensible results due to numerous data capacity, robust correlation, security and largest delicacy. These limitations are reduced by using block encryption of images in which image is divided into smaller blocks and encryption is applied on each block separately to create confusion and diffusion. There is another solution of secure image encryption is logistic map-based cryptography in which nonlinear logistic sequences are utilized for generating the cryptographic keys to enhance the complexity of cryptosystem [1, 2]. Chaotic maps have several features like maximum initial state perception, pseudorandomness and volatility; therefore, they are broadly utilized for image cryptosystem with the help of pseudorandom quantity creation, transformation and dispersion [3]. One-dimensional (1D) chaotic maps are simply developed for image cryptosystem with small parameters in which one variable and simple chaotic orbit are applied to generate the random values by extracting few information [4]. Thus, two-dimensional (2D) chaotic maps are implemented to increase the security by utilizing two variables for random number generation to encrypt the real-time images [5]. The sine and cosine form of 2D chaotic maps also applied on image encryption for enhancing the complexity of cryptosystems [6]. The complexity of encryption is also enhanced by using some bioinformatics concepts like deoxyribo nucleic acid (DNA) to perform bit-level encryption [7]. DNA has several advantages like parallel computation, minimum energy expenditure and huge amount of storage memory, and therefore, DNA becomes most powerful tool for secure and fast encryption [8, 9]. Hence, we developed a novel nonlinear 2D chaotic map and DNA sequences (NL2DCM-DNA)based block image encryption technique which combines the properties of nonlinear 2D chaotic map and DNA sequences to provide secure and fast image encryption.
2 Nonlinear 2D Chaotic Map and DNA Sequences (NL4DLM-DNA)-Based Block Image Encryption 2.1 Nonlinear 2D Chaotic Map (NL2DCM) The 2D chaotic map is highly complex, volatile and dynamic nonlinear system which combines at least two positive Lyapunov exponents. The maximum key space, arbitrariness and ambiguity are mostly improved in 2D chaotic map to obtain maximum security against 1D chaotic map. We take up a 2D chaotic map nonlinear equations (Eq. 1), and a key (K) is generated by using Eq. 2.
X¨ 1 = −η ∗ X 1 + δ ∗ X 2 X¨ 2 = −μ ∗ X 1 + X 1 ∗ X 2
X¨ 2 − X¨ 1 × 1012 /105 , 256 K = mod
(1) (2)
Nonlinear 2D Chaotic Map and DNA (NL2DCM-DNA) Sequences-Based …
71
Here, η, δ and μ are chaotic map control parameters. We take the values of parameters as η = 5, δ = 3 and μ = 0.3. And initial values of X 1 = 0.25, and X 2 = 0.35. mod (•) denotes the modulo operation, |•| denotes absolute value and • denotes the ceiling operation.
2.2 Rules of Deoxyribo Nucleic Acid (DNA) Standard There are four nucleic acid bases (A = Adenine, G = Guanine, C = Cytosine, T = Thymine) are present in DNA in which A is complement of T and C is complement of G and vice versa. A, G, C and T are represented in two digit binary form as 00, 01, 10 and 11 respectively, so total 24 rules of coding are obtained in which eight rules declare the Watson–Crick complement rules (Table 1). Each eight bit pixel of image is represented as four length DNA sequence like “11011000” as “TCGA” (rule 4) and “GTAC” (rule 7). It means different DNA rules generated different outputs for same input binary bit pixel. In DNA standards, XOR, complement, addition and subtraction of DNA are explained on the basis of standard binary operations (Tables 2 and 3). Table 1 Rules of DNA standards Rules
Rule1
Rule2
Rule3
Rule4
Rule5
Rule6
Rule7
Rule8
00
G
G
A
A
T
T
C
C
01
T
A
G
C
C
G
T
A
10
A
T
C
G
G
C
A
T
11
C
C
T
T
A
A
G
G
Table 2 DNA XOR (⊗⊗) and complement ⊗⊗
T = 00
G = 01
C = 10
A = 11
DNA
Complement
T = 00
T
G
C
A
T = 00
A = 11
G = 01
G
T
A
C
G = 01
C = 10
C = 10
C
A
T
G
C = 10
G = 01
A = 11
A
C
G
T
A = 11
T = 00
Table 3 DNA addition (++) and subtraction (−) ++
T = 00
G = 01
C = 10
A = 11
–
T = 00
G = 01
C = 10
A = 11
00
T
G
C
A
00
T
A
C
G
01
G
C
A
T
01
G
T
A
C
10
C
A
T
G
10
C
G
T
A
11
A
T
G
C
11
A
C
G
T
72
S. Stalin et al.
2.3 Block Image Encryption The proposed NL2DCM-DNA-based block image cryptosystem is described in following steps. Step 1: First of all the picture is divided into blocks size of p × q. NL2DCM equation (Eq. 1) is continual N epoch to increase the randomness, security and eliminate the unsympathetic effects. Next Eqs. 1 and 2 are performed another p × q (size of block) times for every block of image to generate key sequences (KS) (Eq. 3). K S = K 1 , K 2 , K 3 , . . . , K p×q
(3)
Step 2: The pixel values of image block are represented in binary sequence form B 1f (8 bit for each pixel), and fifth DNA standard rule (see Table 1) is applied to convert binary sequence B 1f into a DNA standard DS 1f . Step 3: A key sequence (KS) of a image block is converted from decimal to binary Bk1 (8 bit for each key), and fourth DNA standard rule (see Table 1) is applied to convert binary sequence Bk1 into a DNA standard DSk1 . The DNA complement (see Table 2) is applied on every element of DSk1 to obtain DSk2 . The DNA addition (Table 3) is performed between DS 1f and DSk2 for encryption to obtain a DNA standard DS 3f . Step 4: The DNA standard DS 3f is converted into a binary sequence form B 2f by using eighth DNA standard rule (Table 1). Step 5: Next-level encryption is performed by applying Bitwise XOR between B 2f and Bk1 to obtain the cipher binary sequence form Bc . Step 6: The binary sequence form Bc is altered to a cipher (encrypted) image (CI). The decryption is performed in reverse direction of encryption process.
3 Result and Analysis The experiment is implemented on an Intel(R) Core(TM) i3 processor, 4 GB RAM and Windows 8.1 operating system in MATLAB tool. The experimental results of NL2DCM-DNA are analyzed against several encryption techniques [5, 6, 8–10]. We take two broadly used images Lena (256 × 256) and human brain (248 × 200) for experiment, and chaotic map is repeated initially N = 500 times. Lena image (256 × 256) is alienated into 16 blocks of 64 × 64 sizes, and human brain (248 × 200) is alienated into 16 blocks of 62 × 50 sizes. Each block is encrypted by using NL2DCM equations and combined to form cipher image at last (Fig. 1).
Nonlinear 2D Chaotic Map and DNA (NL2DCM-DNA) Sequences-Based …
73
Fig. 1 Encrypted images of Lena and human brain. First column (plain images), second (block of images), third (encrypted blocks) and fourth (cipher images)
3.1 Analysis of Secure Key The encryption of image is directly dependent on key sequence sensitivity, so little modification in initial values of parameters will give completely different encrypted (cipher) images. In Fig. 2, first and third columns denote the decrypted images by correct initial values of keys {X 10 = 0.25, X 20 = 0.35}, andsecond and fourth columns denotethe decrypted images by incorrect values of keys X 10 = 0.25 + 10−12 , X 20 = 0.35 . It represents that the proposed NL2DCM-DNA is highly resistive against exhaustive attack.
Fig. 2 Decrypted images of Lena and human brain
74
S. Stalin et al.
Fig. 3 Histogram of Lena, cipher Lena, human brain and cipher human brain
3.2 Security of Key Size The key sequence is generated by using two initial values {X 10 = 0.25, X 20 = 0.35}. If the exactitude is 10−12 , then the key size is 1012×2 = 1024 ≈ 280 large enough to obtain robustness beside brute force attack.
3.3 Analysis of Histogram The cipher images are exposed in the structure of a histogram in which moderately consistent pixel allocation performs reducing the pixel relations. Therefore, NL2DCM-DNA obtains higher robustness beside statistical attacks by not providing any statistical information (Fig. 3).
3.4 Analysis of Information Entropy Information entropy is explained as randomness of pixels of block images. It means only small information is generated from cipher image standards to recover original plain image which enhanced the security of image encryption (Eq. 4). 256 PRm logPRm H =− m=1
(4)
where PRm = Probabilty of cipher image standard. The result shows that NL2DCM-DNA obtains better efficient value of information entropy of cipher image near to principal value of entropy 8 against several previous algorithms [5, 8–10] and it is difficult for attacker to decrypt the cipher image (Table 4).
Nonlinear 2D Chaotic Map and DNA (NL2DCM-DNA) Sequences-Based …
75
Table 4 Information Entropy Comparison Images
Input images
NL2DCM-DNA
Cipher image Guodong Ye et al. [10]
Ye Tian et al. [8]
Fayza Elamrawy et al. [5]
Shuliang Sun [9]
Lena
7.2426
7.9981
7.9973
7.9975
7.9973
7.9972
Human brain
7.6571
7.9987
7.9976
7.9975
7.9962
7.9880
Table 5 Time of key sequence generation (seconds) Images
NL2DCM-DNA
Xiaojun Tong et al. [6]
Guodong Ye et al. [10]
Ye Tian et al. [8]
Lena
0.2783
0.3421
0.7856
1.2765
Human brain
0.3123
0.332
0.8354
1.154
Generaon Time
Key Sequence Generaon Time NL4DLM_DNA Xiaojun Tong et. al. [6] Guodong Ye et. al. [7] Ye Tian et. al. [8] Images Fig. 4 Time of key sequence generation of Lena and human brain images
3.5 Time of Key Sequence Generation Key sequences are generated by using NL2DCM equations and analyze the running time against several previous algorithms [6, 8, 10]. Table 5 and Fig. 4 illustrate that proposed NL2DCM-DNA generates the key sequence in minimum time.
4 Conclusion The combination of nonlinear 2D chaotic map and DNA sequences (NL2DCMDNA) is used for block image encryption. Several key sequences are obtained for encryption by using nonlinear 2D chaotic map, and fast block image encryption is achieved by using DNA standards. The security of image cryptosystem is increased
76
S. Stalin et al.
by stylizing complex DNA standards and nonlinear nature of chaotic map. The experimental results represent the higher performance of NL2DCM-DNA in terms of security, entropy, histogram, attack resilience, running time and diffusion against some previous image encryption algorithms.
References 1. Khare, A., Shukla, P.K., Rizvi, M.A., Stalin, S.: An intelligent and fast chaotic encryption using digital logic circuits for ad-hoc and ubiquitous computing. Entropy MDPI 18(201), 1–27 (2016). https://doi.org/10.3390/e18050201 2. Shukla, P.K., Khare, A., Rizvi, M.A., Stalin, S., Kumar, S.: Applied cryptography using chaos function for fast digital logic-based systems in ubiquitous computing. Entropy MDPI17, 1387– 1410 (2015). https://doi.org/10.3390/e17031387 3. Keuninckx, L., Soriano, M.C., Fischer, I., Mirasso, C.R., Nguimdo, R.M., Sande, G.V.D.: Encryption key distribution via chaos synchronization. Sci. Rep. 1–14 (2017). https://doi.org/ 10.1038/srep43428 4. Usama, M., Zakaria, N.: Chaos-based simultaneous compression and encryption for hadoop. PLoS ONE 12(1), 1–29 (2017). https://doi.org/10.1371/journal.pone.0168207 5. Elamrawy, F., Sharkas, M., Nasser, A.M.: An image encryption based on DNA coding and 2D logistic chaotic map. Int. J. Signal Process. 3, 27–32 (2018) 6. Tong, X., Liu, Y., Zhang, M., Xu, H., Wang, Z.: An image encryption scheme based on hyperchaotic Rabinovich and exponential chaos maps. Entropy MDPI 17, 181–196 (2015). https:// doi.org/10.3390/e17010181 7. Mondal, B., Mandal, T.: A light weight secure image encryption scheme based on chaos & DNA computing. J. King Saud Univ. Comput. Inf. Sci. 1–6 (2016). https://doi.org/10.1016/j. jksuci.2016.02.003 8. Tian, Y., Lu, Z.: Novel permutation-diffusion image encryption algorithm with chaotic dynamic S-box and DNA sequence operation. AIP Adv. 1–23 (2017) 9. Sun, S.: Chaotic image encryption scheme using two-by-two deoxyribonucleic acid complementary rules. Opt. Eng. 56(11), 1–10 (2017) 10. Ye, G., Jiao, K., Pan, C., Huang, X.: An effective framework for chaotic image encryption based on 3D logistic map. HindawiSecur. Commun. Netw. 1–11 (2018). https://doi.org/10. 1155/2018/8402578
A Language-Independent Speech Data Collection and Preprocessing Technique S. M. Saiful Islam Badhon, Farea Rehnuma Rupon, Md. Habibur Rahaman, and Sheikh Abujar
Abstract Virtual assistant or human-like robot has become the most attractive technology nowadays., where we badly need communication. And verbal communication is the most comfortable one. Here, it comes with the necessity of voice recognition system. Recently, researchers are really focusing on developing voice recognition systems with the help of machine learning and deep learning algorithms. And for that, researchers need a large amount of audio data. In this research, we will discuss from collecting audio files from different speakers to represent them into numeric formation so that it becomes possible to apply machine learning or deep learning algorithm on it. Its need some requirement as well as specific formation, instruction to the speaker, organized the speeches and making the scripts are the most important things. And all the description in this work has been described with experience of our own work with speech data. The dataset of this work has exactly 10,992 speech data of 1000 unique words of more than 50 speakers. Keywords Voice recognition systems · Machine learning · Deep learning algorithm
S. M. Saiful Islam Badhon (B) · F. R. Rupon · Md. Habibur Rahaman · S. Abujar Department of Computer Science and Engineering, Daffodil International University, Dhaka, Bangladesh e-mail: [email protected] F. R. Rupon e-mail: [email protected] Md. Habibur Rahaman e-mail: [email protected] S. Abujar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_9
77
78
S. M. Saiful Islam Badhon et al.
1 Introduction Speech is one of the core mediums of human–computer interactions. Software usually generates words as quickly as those are uttered, which is faster than typing the words or sentences. So, speech identification system has become a serious topic in these recent years. Voice recognition has a lot of applications which attract the researchers to work with that. Some attractive applications of voice recognition are virtual assistant, automated call center, surveillance system, voice typing system and so on. Moreover, this type of system is easy to use, and it can be useful for unable people who cannot type. Researchers are trying to construct an efficient voice identification system though machines cannot match with the human voice performance in terms of response speed and accuracy. Basically, speech is a bunch of sound waves created by our vocal cords. Voice recognition system must need huge amount of audio data. To collect the data, a microphone is used to record the sound waves and then converts them to an electrical signal. The data is then converted using modern technology of signal processing, isolating the words and syllables. Dealing with audio data is the most problematic data preprocessing work. In this work, we tried to make a procedure for preprocessing audio data for any voice to text conversion research work. As English is the international language, there are lots of works on this specific sector [1–3]. Beside that there are other languages like Bangla [4, 5], Hindi [6], Arabic [7], Chines [8] etc. have some research and implemented work on voice to text. The researchers from other language also try to develop their datasets and system. For that, everyone needs a rich dataset, and in this sector of machine learning, data are very crustal and tough to collect. If we collect or preprocess speech data, then all the hardwork will go wrong. For proper and maximum possible use of speech data, hopefully this paper will help the new researchers of natural language processing speech sector. For this process, we used multiple popular technology like Python, Librosa, NumPy, Pydub, etc. Our method worked with 1000 unique words. Hopefully, by understanding the methodology and technology, other researchers can also modify this and make one preprocessing system of their own.
2 Literature Review There are lots of speech datasets and research work in different languages like Pete Warden. (2018). Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition [9] this data set has 35 unique words. SHRUTI Bengali speech corpus, which was developed by some researchers of IIT, Kharagpur [10]. This is one of the largest Bengali open-source datasets. There is another dataset that was developed by Chaudhuri S. et al. for detecting activities in movies through movies [11]. They collect 46 h of movie videos and labeled and segmented that video. Beside this dataset, there are some paper who’s provided dataset not so rich, but they introduced us quality works some of them are described. In 2019, Benkerzaz,
A Language-Independent Speech Data Collection …
79
Yussef et al. provided a description of the core concepts of automatic speech recognition (ASR) [12], which is considered as a significant sector of AI. In this paper, they have interpreted the characteristics and general architecture of ASR system and then examined several recent studies to point out the problems and various proposed solutions. After comparing several works, they concluded that neural network is the most common solution for this system. For future work, they want to create a model that will process information with neural network using human voice and other senses. In 2016, Pala Mahesh Kumar proposed a speech identification system using the summation of decimated wavelet and Relative Spectra Algorithm with LPC [13]. This study discussed about an efficient and comprehensive method for feature extraction for processing voice. For this, firstly, they used the proposed methods to train voice signals and construct a train feature vector with low-level extracted features, LPC and wavelet. To build a test feature vector, the previous procedure had been used for the testing voice signals. By measuring the Euclidean difference between the vectors to recognize the voice and the speaker, they compared those two feature vectors. They concluded that if the difference between those vectors is close to zero, then the tested voice and the trained voice will match to each other. They collected 50 preloaded speech signals from six individuals and got 90% of accuracy. All the above datasets are really rich and well-formatted which give us information like, number of unique words, utterance, classes also size and duration of the speech dataset, but no paper has a describable and clear instruction about collecting and preprocessing procedure of speech data, where this paper comes with the solution.
3 Sample Dataset This work has its own dataset of 1000 unique Bangla words; almost every word has 11 speakers, and total data are more than 10,000, and the age range of speakers is 22–45. A small portion of data is described in Table 1 from the sample dataset. Table 1 give us an idea about the number of speakers and ratio of male female speakers. The ration of male and female speaker is 43:57.
4 Proposed Methodology 4.1 Data Collection The very first work is collecting audio from different people which is really a very sensitive work. For data collection, we need to make scripts. And for research purposes, we need some pause in between continuous speeches. That is why the speaker should be instructed that they will read the script with a minimum pause of 0.5 s in between words.
Word
Table 1 Sample dataset Male utterance 4 4 6 4 4 5 4 4 4 4
No. of utterance
11
11
11
11
11
11
11
11
11
11
7
7
7
7
6
7
7
5
7
7
Female utterance
80 S. M. Saiful Islam Badhon et al.
A Language-Independent Speech Data Collection …
81
Fig. 1 Representation of audio scripts
Figure
1
wave
is
the
audio
of
a
bangla
script:
4.2 Background Noise Remove Any type of background noise can put a bad effect on the level of work with speech. Some possible problems that can happen in any steps, segmented wrongly, labeling wrong words, assuming non-voice as a word for removing background noise we selected a threshold which actually measure the loudness of sound whenever this process got the loudness is less than threshold its removed that sound.
4.3 Silence Remove Before and After Long scripts may have before and after non-voice part which we need to remove them. Difference between Figs. 1 and 2 shows the important of silence removing. For that, we used Algorithm 1.
82
S. M. Saiful Islam Badhon et al.
Fig. 2 Before after silence removed audio file
Algorithm 1 Before after silence remove Initialize the audio file path Select a suitable threshold for silence Initialize an empty list pickPoints While Duration of audio file do If RMS < thresholdthen then pickPoints.append(time of audio file) else continue end End while Then revers the audio file, Get the end point in the same way, finally crop the audio file with start and end pick points End
A Language-Independent Speech Data Collection …
83
4.4 Split Audio File into Small Token Now, we need to tokenize the long script into small tokens. As our long scripts have minimum 0.5 s pause in between words, we can split them by this pause. Algorithm 2 will help us to get the tokens. Algorithm 2 Tokenize the audio files Read the audio file Threshold silence length = 500ms Threshold silence loudness = -40db While Duration of audio file do Detect loudness Count duration of silence If audio silence>=Threshold and silence loudeness ΔH
RB > ΔH
RC > ΔH
No-fault
RA < ΔL
RB > ΔL
RC > ΔL
AG
RA > ΔL
RB < ΔL
RC > ΔL
BG
RA > ΔL
RB > ΔL
RC < ΔL
CG
RA < ΔL
RB < ΔL
RC > ΔH
AB
RA > ΔH
RB < ΔL
RC < ΔL
BC
RA < ΔL
RB > ΔH
RC < ΔL
CA
RA < ΔL
RB < ΔL
ΔL ≤ RC ≤ ΔH ABG
ΔL ≤ RA ≤ ΔH
RB < ΔL
RC < ΔL
RA < ΔL
ΔL ≤ RB ≤ ΔH RA < ΔL
CAG
RA < ΔL
RB < ΔL
LLL
BCG
RC < ΔL
Table 4 Overall classification accuracy of the proposed classifier SLG
DL
DLG
LLL
Overall accuracy
100
98.5714
99.5238
100
99.4286
4 Conclusion A simple yet effective fault classifier model is proposed in this work using correlationbased analysis of the phase voltage and phase currents of all the three lines. It is observed in case of a fault that the phase voltage gradually falls and the phase current increases abruptly. This interrelation is represented quantitatively to develop a correlation coefficient-based fault classifier algorithm. The method uses only the Pearson’s linear correlation coefficient method to quantify the effects. This proposed method is simpler with respect to the computationally heavier ANN, PNN or wavelet transformbased analyses. It is observed that the proposed classifier accuracy is 99.4286%. The proposed method is computationally very light and hence requires less computational time. The proposed classifier requires only (3/20) cycles of post-fault transient signal, which is better than most of the other works which mostly require at least quarter cycle [2], half cycle post-fault signal [4, 8] or higher. Besides, the four wrong results out of the 700 test data belong to DL or DLG fault classes. This error is not of very high significance technically, in a sense that, in either case, the two affected lines are required to be isolated similarly using the circuit breakers. The proposed method is highly accurate even in scenarios with natural power system noise. Hence, on an overall analysis, the proposed method is highly acceptable for its high accuracy and low detection time.
A Correlation-Based Classification of Power System Faults …
121
References 1. Chen, K., Huang, C., He, J.: Fault detection, classification and location for transmission lines and distribution systems: a review on the methods. High Voltage 1(1), 25–33 (2016) 2. Jain, A., Thoke, A.S., Patel, R.N.: Fault classification of double circuit transmission line using artificial neural network. Int. J. Electr. Syst. Sci. Eng. 1(4), 750–755 (2008) 3. Raval, P.D., Pandya, A.S.: Accurate fault classification in series compensated multi-terminal extra high voltage transmission line using probabilistic neural network. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 1550– 1554, Mar 2016. IEEE 4. Dasgupta, A., Nath, S., Das, A.: Transmission line fault classification and location using wavelet entropy and neural network. Electr. Power Compon. Syst. 40(15), 1676–1689 (2012) 5. Ekici, S., Yildirim, S., Poyraz, M.: Energy and entropy-based feature extraction for locating fault on transmission lines by using neural network and wavelet packet decomposition. Expert Syst. Appl. 34(4), 2937–2944 (2008) 6. Roy, N., Bhattacharya, K.: Detection, classification, and estimation of fault location on an overhead transmission line using S-transform and neural network. Electr. Power Compon. Syst. 43(4), 461–472 (2015) 7. Gopakumar, P., Reddy, M.J.B., Mohanta, D.K.: Fault detection and localization methodology for self-healing in smart power grids incorporating phasor measurement units. Electr. Power Compon. Syst. 43(6), 695–710 (2015) 8. Mukherjee, A., Kundu, P., Das, A.: Identification and classification of power system faults using ratio analysis of principal component distances. Indones. J. Electr. Eng. Comput. Sci. 12(11), 7603–7612 (2014) 9. Sinha, A.K., Chowdoju, K.K.: Power system fault detection classification based on PCA and PNN. In: 2011 International Conference on Emerging Trends in Electrical and Computer Technology, pp. 111–115, Mar 2011. IEEE 10. Yu-Wu, C., Yu-Hong, G.: Fault phase selection for transmission line based on correlation coefficient. In: 2010 International Conference on Computer Application and System Modeling (ICCASM 2010), vol. 7, pp. V7–350–54 (2010). IEEE 11. Haomin, C., Peng, L., Xiaobin, G., Aidong, X., Bo, C., Wei, X., Liqiang, Z.: Fault prediction for power system based on multidimensional time series correlation analysis. In: 2014 China International Conference on Electricity Distribution (CICED), pp. 1294–1299, Sept 2014. IEEE 12. Zheng, Z., Liu, J., Yu, H.: Fault location on transmission line using maximum correlation coefficient method. In: 2012 Annual Report Conference on Electrical Insulation and Dielectric Phenomena, pp. 226–229, Oct 2012. IEEE 13. Dasgupta, A., Debnath, S., Das, A.: Transmission line fault detection and classification using cross-correlation and k-nearest neighbor. Int. J. Knowl.-Based Intell. Eng. Syst. 19(3), 183–189 (2015) 14. Rajamani, P., Dey, D., Chakravorti, S.: Cross-correlation aided wavelet network for classification of dynamic insulation failures in transformer winding during impulse test. IEEE Trans. Dielectr. Electr. Insul. 18(2), 521–532 (2011) 15. Lin, D., Jun, P., Wenxia, S., Jun, T., Jun, Z.: Fault location for transmission line based on traveling waves using correlation analysis method. In: 2008 International Conference on High Voltage Engineering and Application, pp. 681–684, Nov 2008. IEEE 16. Shu, H., An, N., Yang, B., Dai, Y., Guo, Y.: Single pole-to-ground fault analysis of MMCHVDC transmission lines based on capacitive fuzzy identification algorithm. Energies 13(2), 319 (2020)
A Wavelet Entropy-Based Power System Fault Classification for Long Transmission Lines Alok Mukherjee, Palash Kumar Kundu, and Arabinda Das
Abstract This paper describes a wavelet entropy-based simple method for classification of transmission line faults using wavelet entropy analysis of sending end fault current waveforms of one cycle post-fault duration. The fault transients are scaled with respect to the peak value under no-fault condition for respective phases. These three phase scaled current signals are fed to the wavelet classifier model to extract fault features in terms of wavelet entropy values. The variation in the three phase entropy for ten fault classes provides enough features for distinct differentiation among different fault conditions. Two threshold values are identified on detail analysis of the fault class entropies, which helps to develop fault classifier rule base, and in turn, fault signatures. The unknown class is identified by direct comparison of the three phase test entropies with that of fault class signatures. The proposed classifier produces 99.2857% accuracy in classification with one cycle post-fault data. Keywords Wavelet entropy · Fault class entropies · Cycle post
1 Introduction Electrical power system is one of the largest interconnected systems, which often falls under minor to severe level of faults, especially due to the environmental constraints like storm, snow, rain, etc. Sometimes these faults are temporary, and rest of the times these are permanent in nature, requiring manual intervention of the operating A. Mukherjee (B) Electrical Engineering, Government College of Engineering and Ceramic Technology, Kolkata 700010, India e-mail: [email protected] P. K. Kundu · A. Das Department of Electrical Engineering, Jadavpur University, Kolkata 700032, India e-mail: [email protected] A. Das e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_13
123
124
A. Mukherjee et al.
personnel. Removal of faults is imperative for recovery and stability of the system and restoration of normal operation. Identification of different classes of faults is very important in this regard. This article presents a wavelet energy entropy (WEn)-based analysis for classification of long overhead transmission line faults. It is observed that high frequency transients arrive immediately on occurrence of a fault, both in the phase currents and voltages. One cycle post-fault sending end current signals of the three phases are used in this work as the test data. These are analyzed to extract some key features of each class of faults in terms of the three phase WEn. These WEn are analyzed to find two threshold values which help in developing classifier logic. The three phase entropies of the test case are compared with the classifier logic-based fault signature table to predict the class of fault directly. The proposed classifier is found to produce a high classifier accuracy of 99.2857%. Fault analysis has been practiced by scientists since long. Several methods have been adopted in different researches for detection, classification and localization of power system faults [1]. Artificial neural network (ANN) and probabilistic neural network (PNN) are very useful in fault classification due to their inherent pattern recognition properties [2, 3]. Many of the researchers use wavelet transform (WT)based analysis, and very often, combination of ANN and WT [4, 5]. Wavelet analysis, which is used in this work, is one of the most powerful tools for classification and distance prediction of power system faults. Wavelet packet decomposition and WEn is widely and effectively used, especially for classification of faults [1–3]. This extracts vital information about the extent of disturbance associated with a multiple frequency signal, like that of the fault transient oscillations. In this work, this feature identifying property of the WEn has been utilized. Apart from this method, researchers have developed several methodologies for fault transient analysis, classification and distance estimation [4]. Artificial neural network (ANN) and probabilistic neural network (PNN) are also used extensively for fault treatment [5, 6]. PNN has inherent characteristics of pattern recognition, which is effectively used for classification [6]. Wavelet transform (WT) is another tool used for fault analysis widely [7]. Hybrid methods are extremely popular nowaday since these methods include the advantages of both the schemes. Combination of ANN and WT has often been practiced with successful outputs [1, 2]. WT is also used in combination with fuzzy logic to produce accurate fault diagnosis [8, 9]. Principal component analysis (PCA)-based works have also produced good results [10, 11]. PCA has also been combined with WT in many researches [12]. But sometimes, the linearity analysis of PCA sometimes reduces the accuracy to some extent.
A Wavelet Entropy-Based Power System Fault Classification …
125
2 Methods 2.1 System Design and Data Collection A 132 kV, 50 Hz, 150 km long overhead transmission line is designed and simulated in EMTP/ATP software using fifteen LCC blocks of 10 km each. Ten major classes of faults have been conducted at fourteen intermediate locations, each 10 km apart. The sampling frequency is kept as 100 kHz, which produces 2000 samples per cycle. Only one cycle post-fault three phase line currents are collected from the sending end as the fault data for the proposed work. Four major classes of faults are considered in this work, three of which are further classified into three subclasses. Hence, the total number of fault classes becomes ten. These classes are: Single line to ground (SLG: AG, BG and CG), double line fault (DL: AB, BC and CA), double line to ground fault (DLG: ABG, BCG and CAG) and three line fault (LLL: ABC). One cycle post-fault sending end current signals are scaled only in order to provide uniformity. Filtering of the data is not required, thus saving vital computation time. The absolute values of the scaled noise contaminated current signals are directly fed to a wavelet-based classifier module. The idea of the proposed scheme and the algorithm design is described in the following sections with the help of some case studies.
2.2 Analysis of Fault Signals The three phase currents signals for different major classes of faults and under healthy condition are shown for different fault classes in Fig. 1. It is observed that for each class of fault, the directly affected line is disturbed to the highest amplitude. The other indirectly affected phases are disturbed less. Further, it is observed that for DL faults, the third un-affected line has minimum diversion from the no-fault condition; whereas, in case of DLG fault, the third line is affected to a higher degree compared to the third line of the DL fault. This is primarily due to the circulation of zero sequence components of current through the ground-faulted line(s) and the grounded neutral in case of ground faults like SLG and DLG. These inferences are also observed from the few cases illustrated in Fig. 1. It is observed that for AG fault, the directly affected A phase is disturbed largely higher than the un-faulted B and C phases. It is also observed that the un-faulted lines B and C of AG fault are disturbed more than the un-faulted lines C for AB fault. The proposed work estimates the disturbance of fault current of each phase in terms of the wavelet entropies (WEn). These are compared to that of the healthy condition for measuring the extent of disturbance. These WEn are analyzed to bring out two major threshold values: the direct fault line threshold and the un-affected ground line threshold. These two limiting values are used to develop fault classifier logic and hence fault signature table. The test fault with unknown class is analyzed
126
A. Mukherjee et al.
Fig. 1 Three phase current signals for different major classes of faults
similarly to find out the three phase WEn, which are compared to the fault signature table to predict fault class. A quantitative approach in support of the above analysis is shown in the next section.
3 Results and Discussion 3.1 Quantitative Representation of Wavelet Entropy Values: Case Study The fault transient signals are analyzed using the proposed scheme to obtain the extent of disturbance in terms of WEn. The WEn corresponding to ten different classes of faults, conducted at three intermediate locations, is given in Table 1, as a case study. This shows the variation in WEn for variation in fault location.
3.2 Analysis of Wavelet Entropies of Different Fault Classes It is observed form Table 1 that WEn varies as the fault location changes, as well as the fault class. It is further observed that as the disturbance increases from no-fault condition, the WEn values move away from the no-fault WEn and goes negative. In case of directly affected lines, the WEn becomes high negative. The closest match of three phase WEn is observed in between DL and DLG faults, where only the
116.532 145.629
42.3159 147.562 −585,166 185.065 −610,454 91.4282 −802,642
154.504
−424,082
131.895
−275,975
−633,566
180.208
−553,855
−565,271
84.607
−656,723
−699,198
150.944
148.253
−282,718
190.155
−410,855
−267,424
121.29
−358,774
−367,769
BG
CG
AB
BC
CA
ABG
BCG
CAG
ABC
−850,520
−675,636
−450,472
−109,909
−118,936
104.873
−69,893
−157,023
189.904
−61,415
−118,476
192.024
WEnA
AG
WEnC 185.532
188.853
Healthy
WEnB
180.188
WEnA
Fault class
Fault location 70 km
Fault location 30 km
Fault location WEnB
−197,520
43.427
−203,890
−124,348
180.89
−248,321
−53,790
53.9595
−108,815
162.105
182.196
WEnC
−361,220
−280,251
−305,027
54.8024
−190,908
−227,698
185.036
−201,006
143.985
19.7805
185.761
−53,829
−59,044
88.5973
−31,583
−76,924
192.292
−22,897
143.169
91.8894
−45,449
189.456
WEnA
Fault location 110 km
−80,607
25.0434
−93,097
−43,061
179.38
−118,633
−16,851
17.4038
−43,206
164.674
183.949
WEnB
−182,613
−144,278
−143,900
34.4363
−102,272
−103,380
184.92
−93,976
145.695
−2.2565
186.216
WEnC
Table 1 Three phase wavelet entropy values (WEn) for ten fault classes corresponding to faults conducted at 30 km, 70 km and 110 km, Rf = 1
A Wavelet Entropy-Based Power System Fault Classification … 127
128
A. Mukherjee et al.
Table 2 Highest and the lowest values of WEn for major classes of faults Class of fault
Directly faulted line
Indirectly faulted line
Closest WEn to no-fault condition
Highest WEn
Lowest WEn
SLG
−22,293.415
165.845
−23.041
DL
−6058.792
192.517
178.688
DLG
−18,986.227
130.094
11.902
LLL
−35,790.175
No-Fault
Highest WEn: 192.108
–
–
Lowest WEn: 181.956
un-affected line causes the major difference. The un-faulted lines of DL faults are found to produce entropy similar to no-fault; whereas, the same for DLG faults is found marginally less than the no-fault WEn, although, this is much higher than the direct fault line WEn. Hence, the DL and DLG faults are treated and distinguished based on this third line WEn values. Hence, in order to develop the algorithm, two threshold values of WEn are constructed as mentioned. Faults are carried out at three locations along the 150 km designed line for the development of the algorithm: at 10 km, which is the first junction after the source end, at 140 km, which is the final junction of the line and at 70 km, which lies almost at the middle. The three phase WEn for all major classes of faults, conducted at these locations, are collected and arranged. The limiting values of WEn for these locations and for respective broad fault classes are given in Table 2.
3.3 Selection of Threshold Values It is observed from Table 1 that WEn of the directly affected phase is many folds lower than no-fault WEn. Hence, classification of SLG faults can be done directly by looking for the single very high negative value of WEn. It is further observed from Table 2 that the closest WEn to the no-fault condition for the directly affected lines is −6058.792 obtained for DL fault. Still this WEn is far away from WEnO . The lowest WEn of indirectly faulted line is found as −23.041. Hence, direct fault line threshold (θ F ) is selected safely as −1000. Thus, if any one of the three phases produces WEn less than θ F , it is assumed to be directly affected for any fault class. The DL and DLG classes are separated using second threshold, i.e., un-affected ground line threshold (θ G ). In case of both DL and DLG faults, both the faulted lines are found to produce very low values of entropy, less than θ F . It is further observed that the range of WEn for un-affected line for DL faults (178.688–192.517) and that of no-fault class (181.956 and 192.108) are practically similar and hence the un-faulted line of DL faults may be assumed to remain almost un-affected and behave similar to no-fault lines. The same range in case of DLG faults is found much lower (11.902 and 130.094), i.e., moving towards faulted line condition. Hence, the
A Wavelet Entropy-Based Power System Fault Classification …
129
distinction could be made between DL and DLG by comparing the third line WEn. θ G is hence selected in the middle of the upper limit WEn for un-affected line of DLG (130.094) and the lowest WEn of the same for DL (178.688). Hence, θ G is chosen as the near rounded average of 130.094 and 178.688, i.e., 155. The algorithm is developed such that if any two lines are found to produce WEn less than θ F , indicating two directly affected lines, the class of fault is ascertained between DL or DLG. θ G helps in distinguishing between these two classes. If WEn for the third line is found higher than θ G , i.e., closer to no-fault condition, the fault is classified as DL, else the line is assumed to contain higher disturbance than no-fault and classified as DLG fault. Thus, the two fault thresholds are finally selected as Direct fault line threshold(θ F ) = −1000 and Unaffected ground line threshold(θG ) = 155
3.4 Development of Rule Base for Classification Hence, considering all the above analysis, the following decision rules are formed for the broad classification of faults: 1. If all the lines have WEn > θ G , it is indicated as no-fault condition 2. If, only and any one line has WEn < θ F , and the other two lines have WEn > θ F , the fault is categorized as SLG 3. If any two of the lines have WEn < θ F and the third line has WEn > θ G , then the fault is categorized as DL 4. If any two of the lines have WEn < θ F and the third line has θ F < WEn < θ G , the fault is categorized as DLG 5. Finally, if all three lines have WEn < θ F , the fault is categorized as LLL. Depending on these decision rules, a fault classifier signature rule base is prepared an a constructive form and is shown in Table 3. The major fault classes are determined for the test or unknown sample using the best fit analysis from any one of these rule bases, i.e., the test fault falls among one of these categories and fault class is predicted accordingly.
3.5 Classifier Output and Analysis The proposed wavelet entropy-based classifier is tested with all classes of faults, conducted at fourteen locations, each at 10 km apart, in the 150 km line. The same class of fault has been carried out at the same point for three times for reducing the difference in values caused by noise component. Hence, a total test data set of
130
A. Mukherjee et al.
Table 3 Decision rules for fault classification: fault classifier signature table WEnA
WEnB
WEnC
Predicted class
WEn > θ G
WEn > θ G
WEn > θ G
No-fault
WEn < θ F
WEn > θ F
WEn > θ F
AG
WEn > θ F
WEn < θ F
WEn > θ F
BG
WEn > θ F
WEn > θ F
WEn < θ F
CG
WEn < θ F
WEn < θ F
WEn > θ G
AB
WEn > θ G
WEn < θ F
WEn < θ F
BC
WEn < θ F
WEn > θ G
WEn < θ F
CA
WEn < θ F
WEn < θ F
θ F < WEn < θ G
ABG
θ F < WEn < θ G
WEn < θ F
WEn < θ F
BCG
WEn < θ F
θ F < WEn < θ G
WEn < θ F
CAG
WEn < θ F
WEn < θ F
WEn < θ F
ABC
Table 4 Overall classification accuracy of the proposed classifier Fault class
SLG
DL
DLG
LLL
Overall accuracy
No. of observations
126
126
126
42
420
Correct prediction
126
125
124
42
417
Wrong prediction
0
1
2
0
3
Classifier accuracy
100
99.2063
98.4127
100
99.2857
(14 locations × 3 observations) i.e., 42 for one class of fault is obtained. Similarly, ten classes of fault data are arranged to make total set of 420 three phase test fault current which are tested with this proposed algorithm. The accuracy of the proposed classifier is given in Table 4, in terms of broad classes of faults. The overall classifier accuracy is found as 99.2857% with three wrong predictions among 420 test cases. The classifier accuracy is 100% for SLG and LLL. In three occasions among DL and DLG faults, the third line WEn has exceeded the corresponding range in wrong direction to misinterpret the fault class.
4 Conclusion A simple transmission line fault classifier scheme is developed in this work using wavelet entropy (WEn) of the fault transient signals. One cycle post-fault three phase current signals are collected for ten different fault classes. These are analyzed using wavelet to obtain fault features in terms of the three phase wavelet entropy values. The faults are also conducted at different locations along the line to obtain the variation of WEn for change in fault locations. The three phase WEn are analyzed carefully for faults conducted at the two extreme ends and almost the middle of the line. The
A Wavelet Entropy-Based Power System Fault Classification …
131
analysis gave rise to two threshold WEn values, depending on which a fault classifier rule base is designed. The test fault class is also analyzed similarly, and the test three phase WEn are compared to the classifier rule base table to obtain the fault class directly. The proposed method is simple as it uses direct WEn, without going for higher level of decomposition. The classifier accuracy obtained is 99.2857% which is very high itself. This high accuracy is achieved using noise affected one cycles post-fault transient current data only. All the three wrong answers obtained in this work are regarding the DL and DLG fault. But both the faults require two corresponding circuit breakers to operate. Hence, this error does not majorly influence the outcome. This method does not require large training time like the neural networks. Hence, the proposed work, on the whole, provides an effective method of fault classification in transmission lines.
References 1. Dasgupta, A., Nath, S., Das, A.: Transmission line fault classification and location using wavelet entropy and neural network. Electr. Power Compon. Syst. 40(15), 1676–1689 (2012) 2. El Safty, S., El-Zonkoly, A.: Applying wavelet entropy principle in fault classification. Int. J. Electr. Power Energy Syst. 31(10), 604–607 (2009) 3. Ekici, S., Yildirim, S., Poyraz, M.: Energy and entropy-based feature extraction for locating fault on transmission lines by using neural network and wavelet packet decomposition. Expert Syst. Appl. 34(4), 2937–2944 (2008) 4. Chen, K., Huang, C., He, J.: Fault detection, classification and location for transmission lines and distribution systems: a review on the methods. High Voltage 1(1), 25–33 (2016) 5. Jain, A., Thoke, A.S., Patel, R.N.: Fault classification of double circuit transmission line using artificial neural network. Int. J. Electr. Syst. Sci. Eng. 1(4), 750–755 (2008) 6. Raval, P.D., Pandya, A.S.: Accurate fault classification in series compensated multi-terminal extra high voltage transmission line using probabilistic neural network. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 1550– 1554, Mar 2016. IEEE 7. Valsan, S.P., Swarup, K.S.: Wavelet transform based digital protection for transmission lines. Int. J. Electr. Power Energy Syst. 31(7–8), 379–388 (2009) 8. Pradhan, A.K., Routray, A., Pati, S., Pradhan, D.K.: Wavelet fuzzy combined approach for fault classification of a series-compensated transmission line. IEEE Trans. Power Delivery 19(4), 1612–1618 (2004) 9. Reddy, M.J., Mohanta, D.K.: A wavelet-fuzzy combined approach for classification and location of transmission line faults. Int. J. Electr. Power Energy Syst. 29(9), 669–678 (2007) 10. Barocio, E., Pal, B.C., Fabozzi, D., Thornhill, N.F.: Detection and visualization of power system disturbances using principal component analysis. In: 2013 IREP Symposium Bulk Power System Dynamics and Control-IX Optimization, Security and Control of the Emerging Power Grid, IEEE, Rethymno, Greece, pp. 1–10, Aug 2013 11. Mukherjee, A., Kundu, P., Das, A.: Identification and classification of power system faults using ratio analysis of principal component distances. Indones. J. Electr. Eng. Comput. Sci. 12(11), 7603–7612 (2014) 12. Jafarian, P., Sanaye-Pasand, M.: A traveling-wave-based protection technique using wavelet/PCA analysis. IEEE Trans. Power Delivery 25(2), 588–599 (2010)
Bangla Handwritten Math Recognition and Simplification Using Convolutional Neural Network Fuad Hasan, Shifat Nayme Shuvo, Sheikh Abujar, and Syed Akhter Hossain
Abstract Mathematical simplification—it is an interior capability in human, but machine does not have cognitional skills that can understand the problem by visual context. In this present work, we represent a new system which takes an input image of Bangla handwritten mathematical expression and automatically simplifies the problem and generates the answer as an output. Proposed pursuit can be workable in an embedded system as well as mobile application. In this scope for recognition purpose, we use a CNN model called MatheNET for segmented Bangla digits and mathematical symbols. This model dataset contains 54 classes, 10 numerals and 44 mathematical operators and symbols. The algorithm has followed for rising a system model for automatically Bangla handwritten math simplification; it has been done really good job. In the fields of the state of the art, contributions in Bangla languages are still very low. Developing an automatic math equation solver has been a desire of the researchers who worked in the field of NLP for many years. Keywords Bangla handwritten equation · Simplification · Prepossessing · Segmentation · Implementation · CNN · Mathematical expressions · Recognition · Image processing
F. Hasan (B) · S. N. Shuvo · S. Abujar · S. A. Hossain Department of Computer Science and Engineering, Daffodil International University, Dhaka 1212, Bangladesh e-mail: [email protected] S. N. Shuvo e-mail: [email protected] S. Abujar e-mail: [email protected] S. A. Hossain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_14
133
134
F. Hasan et al.
1 Introduction The effectiveness of mathematics in human life and natural science is like a bless of God. All places of science like physics, electronics, banking, engineering are stands with the influence of mathematics. Nowadays, handwritten mathematical equation and expressions recognition and solving with the help of AI is supreme area in scientific research. About more than 250 million scientific documents are written in Bangla. Recognition of handwritten mathematical operators or symbols is still more difficult job due to segmentation of character symbols of math. Math expressions contains different sizes and two-dimensional symposium. The main part of this work is segmentation to the sequence of the expression and through the recognition process simplify the math and show the output to the user in Bangla and English both languages. For solving a problem. The study of basic neural machine translation (NMT) research consists of encoder–decoder, in which encoder takes the vector form of words as input to process and decoder uses the vector to predict the most likely output. To get the maximum performance of the translation output, NMT model jointly learns the parameters [1– 3] with minimum domain knowledge. However, NMT has limitation to deal with the long sentences [4]. However, compare to resource-rich languages, the development of MT systems is very sparse for the Bangla–English language pair. The literatures also report that the development of large-scale parallel corpus for Bangla–English is very scarce. Human use many cognitive abilities. From segmented image classification is done by using a CNN model. CNN eliminated property from the image by a sequence of actions. By the help of the CNN model after successfully recognizing each of the segmented digits and operators, we perform a string operation to calculate those expressions. The structure of this paper is as follows. Section 2 provides a brief overview of the existing work on mathematical equation recognition. Sections 3 and 4 represent the approaches that are used in this system. Following that in Sect. 5, we discuss about how we solve the expressions. After that, in Sect. 6, we discuss about the results of this work and at last conclude the whole work in the last section.
2 Literature Review There can be found various paper on handwritten character segmentation [1, 4] for Bangla languages as well as other languages. Some scheme is also working on mathematical expressions recognition “MER.” This are very few of amount and not properly work out in Bangla language. Like “Using SVM and projection histogram identification of ME” have mentioned in [5]. Few are effective for offline printed mathematical expressions and recognition mentioned by Zanibbi et al. (2002), “Recognition of printed mathematical symbols” [6], “Using SVM mathematical
Bangla Handwritten Math Recognition and Simplification …
135
symbols identification” [7], “Recognition of online mathematical symbols using template matching distance” [2], “Offline handwritten mathematical symbols recognition using character geometry” [3]. This all proposed method for recognition of symbols and segmentation using various actions. Recognition has been also done by CNN-based model for mathematical symbols and character [8, 9]. Some discussion about concerned to the labyrinth of online mathematical expression recognition [10]. Majority of those paper concentrated in the recognition scheme. In this approach, the main focus is on Bangla handwritten expression simplify, as there is not any proper work which can successfully handle the problem. For Bangla handwritten image after preprocessing, segmentation and recognition of the input image, generating a string from that expression and simplification of that expression is the main target.
3 Methodology In this activity, for identifying handwritten mathematical expressions and simplifying the problem, we illustrated many different phases from taking an input image to final result that are described below. Figure 1 shows the workflow diagram of our methodology.
Fig. 1 Workflow diagram
136
F. Hasan et al.
Fig. 2 Input image
Fig. 3 After preprocessing
3.1 Preprocessing Preprocessing is the method in which we transform and modify the input image to make it suitable quality for the recognition purpose. At first, we transform the original input image into grayscale image cause in color image the identification of character is more challenging andhen remove some noise from the grayscale image. At last, we turn the image into BINARY_INV. This method changes all the pixel value 0–255. This will reduce our computational time for segmentation, eliminating all the unnecessary pixel value as much as allowable. Figures. 2 and 3 show the input image and preprocessed image.
3.2 Line Segmentation In an image, there will be multiline of characters. Thus, by applying text line segmentation on an image can be identified is there multiple handwritten expressions exits or not. Each line consists of minimum horizontal gap between two lines. By iteration through horizontally in the image, if the pixel value is 255 (white pixel), it means it is a text. White pixel value of 255 is considered as a text. If the sum of horizontal row is 0, it means it is a black row, and it is considered as a gap between two lines. Line segmentation has been done in many languages [11] and earn massive achievement. Figure 4 shows the line detection of preprocessed image.
Fig. 4 Line detection
Bangla Handwritten Math Recognition and Simplification …
137
Fig. 5 Denoting character region
Fig. 6 Segmented character
3.3 Character Segmentation Segmentation of a character is an established way to extract an image to sequence of character into individual images [12]. As the preprocessed image have only binary pixel values, calculate the summation of all pixels from each column in the image. If the resulting sum of each column is inferior or equal to five (binary image black pixel value 0), then it is called a gap and suspect as a character. So, the idea is in y axis connected white pixel is considerable as a separate character shown in Fig. 5. After finding the gaps between each character, it is easy to remove the unnecessary vertical and horizontal gaps from the image. Then, each character is separated using by this method [4]. Finally, resize the separated image into pixel size of (28 × 28). Figure 6 shows the segmented characters.
4 Classification and Recognition In this paper, we use CNN model for recognition and classification purpose. MathNET [13] model contains total 54 classes in which 10 are Bangla numerals and 44 mathematical symbols and operator. This model dataset contains 32,400 images. The recognition accuracy of this model is 96.50%. Here too, this is the most obtained accuracy for handwritten Bangla math recognition. (28 × 28) pixel binary_inv images are used for an input layer in training and testing session of the model. Figure 7 shows the architecture of MathNET.
138
F. Hasan et al.
Fig. 7 Architecture of MathNET [13]
Table 1 Some samples of mathematical expressions covered Area Expression input image add
Output
sub mul div comparison sqrt Decimal mixed percentage mixed
5 Solution of Expressions For each segmented numerals and symbols from the main image, we predict sequentially one by one and store the output sequentially into a list. After that, we convert the list into a string. Then, we perform string operation with the help of Python eval [14] function. Finally, we showed the output in Bangla and English, both languages. Table 1 shows some sample of expressions covered.
6 Experimental Result Very few test images which are segmented from the input image are falsely recognized. The whole result depends on the recognition module. If one symbol or digit is falsely identified, the simplification result will be wrong. We test thousands of expressions, and some of the segmented characters are recognized falsely and give
Bangla Handwritten Math Recognition and Simplification …
139
wrong answer of the math. In most of the cases, almost 98% images are recognized correctly and give the correct answer, visualize in Fig. 8. Table 2 shows some correctly recognized expressions and gives the correct answer. Table 3 shows some falsely identifying symbols which cause the wrong answer of the math.
Fig. 8 Visualization of recognition rate
Table 2 Table II correctly recognized expressions No
Input image
Truly recognized expression in String
True answer
1 2 3
Table 3 Shows the error recognition of segmented character No
1 2 3
Input image
Wrong recognized expression in String
True answer
False answer
140
F. Hasan et al.
Sometimes human made mistakes too. In this example, input images do not clear; in image number 1, 2, and 3, the 1st closing bracket seems like Bangla 1, decimal point seems like 0, and comparison symbol seems like 1, respectively.
7 Conclusion and Future Work In this presence, Bangla handwritten mathematical expressions simplification is described. For simplification of the math, the main part is done in the feature extraction from the image and recognition by the help of CNN model. If the CNN model classifies correctly all of the segmented images, then this will perform better in the simplification part. Simplification is done by string operations. At last, we successfully acquired the state-of-the-art representation in the recognition and simplification stage. In next days, the focus will be to try to raise the precision level and also try to create a scheme feasible for composite mathematical equation at one time. Acknowledgements We gratefully acknowledge support from DIU NLP and Machine Learning Research LAB for providing GPUs support. We thank the Department of CSE, Daffodil International University, for providing necessary supports.
References 1. Mahmud, J.U., Raihan, M.F., Rahman, C.M.: A complete OCR system for continuous Bangla characters. In: IEEE TENCON-2003: Proceedings of the Conference on Convergent Technologies for the Asia Pacific (2003) 2. Simistira, F., Katsouros, V., Carayannis, G.: A template matching distance for recognition of online mathematical symbols. Institute for Language and Speech Processing of Athena—Research and Innovation Center in ICKT, Athens, Greece 3. Bage, D.D., et al.: A new approach for recognizing offline handwritten mathematical symbols using character geometry. Int. J. Innov. Res. Sci. Eng. Technol. 2(7) (2013). ISSN: 2319-8753 4. Hasan, F., Shuvo, S.N., et al.: Bangla continuous handwriting character and digit recognition using CNN. In: 7th international Conference on Innovations in Computer Science & Engineering (ICICSE 2019), vol. 103, pp. 555–563. Springer, Singapore 5. Gharde, S.S., Baviskar, P.V., Adhiya, K.P.: Identification of handwritten simple mathematical equation based on SVM and projection histogram. Int. J. Soft Comput. Eng. (IJSCE) 3(2), 425–429 (2013) 6. Álvaro, F., Sánchez, J.A.: Comparing several techniques for offline recognition of printed mathematical symbols. IEEE, ICPR 2010 7. Malon, C., Suzuki, M., Uchida, S.: Support vector machines for mathematical symbol recognition. Technical Report of IEICE 8. Lu, C., Mohan, K.: Recognition of online handwritten mathematical expressions using convolutional neural networks. cs231n project report stanford 2015 9. Gharde, S.S., Nemade, V.A., Adhiya, K.P.: Evaluation of feature extraction and classification techniques on special symbols. IJSER 3(4) (2012)
Bangla Handwritten Math Recognition and Simplification …
141
10. Awal, A.-M., Mouchère, H., Viard-Gaudin, C.: Towards handwritten mathematical expression recognition. In: 2009 10th International Conference on Document Analysis and Recognition 11. Louloudisa, G., Gatosb, B., Pratikakisb, I., Halatsisa, C.: Text line and word segmentation of handwritten documents (2009) 12. Casey, R.G., Lecolinet, E.: A survey of methods and strategies in character segmentation (1996) 13. Shuvo, S.N., Hasan, F., Ahmed, M.U., Hossain, S.A., Abujar, S.: MathNET: using CNN Bangla handwritten digit, mathematical symbols and trigonometric function recognition. In: International Conference on Computing & Communication (2020) 14. eval, https://docs.python.org/3/library/functions.html#eval. Last access: 22 Feb 2020
Stress Detection of Office Employees Using Sentiment Analysis Sunita Sahu, Ekta Kithani, Manav Motwani, Sahil Motwani, and Aadarsh Ahuja
Abstract Due to the increasing competition in the industry, companies demand more work hours from employees, and employees take a lot of stress in completing their deadlines. Now with the existing deadline stress, they also face problems like family problems, low motivation, discrimination, politics, etc., which bring the extra negative stress that harms the productivity and mental peace of employees. To reduce workplace stress among the employees and increase productivity, there is a need for a system to identify the stress level so that remedial action can be taken beforehand. In this paper, we have proposed a method to detect the seven emotions (angry, disgust, happy, sad, fear, surprise, neutral) of employees at the workplace using facial expressions from the Web camera of their computers and sentiment analysis on the monthly reviews provided by the employees using natural language processing to calculate the stress level, and stress level is also calculated using the answer provided by the employee to the question “How was your day?” at the end of each day and generate a report for the human resources (HR) of the company who will analyze the stress level of the employees. HR can talk to them, counsel them, and help them which will ultimately motivate employees to do quality work.
S. Sahu Hashu Advani Memorial, Collectors Colony, Chembur 400074, India e-mail: [email protected] E. Kithani Mahadev Co-operative Society, 17/A, 4th Floor, Bewas Chowk, Ulhasnagar 421001, India e-mail: [email protected] M. Motwani Flat No G3, Poonam APT, Laxmi Nagar, Ulhasnagar 421003, India e-mail: [email protected] S. Motwani (B) 605, Ajmera Heights-3, Yogidham, Kalyan (w) 421301, India e-mail: [email protected] A. Ahuja 1503/B, Mohan Pride, Wayle Nagar, Kalyan 421301, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_15
143
144
S. Sahu et al.
Keywords Stress · Myers–Briggs test · Natural language processing · Face emotion recognition
1 Background All over the world, an estimated 264 million human beings suffer from depression, one among the leading causes of disability, with many of those people also affected by symptoms of tension [1]. A recent World Health Organization (WHO) study estimates that depression and anxiety disorders cost the world economy US$ 1 trillion every year in lost productivity. Unemployment may be a well-recognized risk factor for psychological state problems while returning to or getting work is protective [1]. A negative working environment may cause physical and psychological state problems, harmful use of gear or alcohol, absenteeism, and lost productivity [2]. Workplaces that promote psychological state and support people with mental disorders are more likely to cut back absenteeism, increase productivity, and like associated economic gains. From these sorts of various studies, we got the motivation to create some software to trace the psychological state of the workforce and solve the issues they are facing, therefore the economic losses are often reduced and quality of life is often increased. In this day and age, stress has become an overall phenomenon [3]. Employees are working for longer hours, as the increased level of duties and responsibilities is given to them. Stress is characterized as a state of mental and emotional pressure or strain, induced by difficult or unfavorable circumstances in human resource management [4]. It is an outside force that governs the feelings and actions of a person. Stress affects the human body badly physically, feeling wise or mentally [5]. Nowadays, the most common problem which results in significant health disease is stress [6]. Stress can hit anybody at any degree of the business and late research shows that business-related pressure is far-reaching and is not kept to a specific division or an occupation. Many of the reasons for the stress in the workplace are as follows: Poor working conditions: It is the physical environment of the workplace, which involves high or low visibility, smoke, heat, inadequate ventilation system, and everything that can influence his/her mood and mental state [2]. Shift work: When a shift of employees changes, their body is accustomed to a particular time to be productive but because of the change in the shift their mind and body need a few days to get normal with the new shift timings. Employees get stressed in maintaining focus at work these days. Long working hours: Some investigations have exhibited that long work hours add to mental pressure and work stress. Working at least 10 h out of each day, at least 40 extra time hours of the month, and at least 60 h of the week would, in general, make upsetting sentiments. It has been considered that working more than 45 h for each week diminished the danger of mental pressure. The connection between working extended periods and work pressure requires more examination [2].
Stress Detection of Office Employees Using Sentiment Analysis
145
Work underload: When employees are not given challenging work instead they are given routine and uninteresting work they get a lot of stress as they feel harassed [2]. Suppose a data scientist is given the work of replying to grievance mails. Work overload: It happens when an employee is given a lot of work to complete in less time. Due to the workplace stress, employee’s productivity and efficiency decrease and the quality and quantity of the work also decrease which ultimately affect the outcome of the company. Hence, it is necessary to identify the workplace stress and inform it to the human resources of the company so that remedial actions can be taken [7]. In this paper, we aim to propose a system for organizations to manage and reduce stress at work [2]. Our system is designed in such a way that the personality of the employee is detected by asking questions based on the Myers–Briggs test. Myers–Briggs is a personality classification test. The four classifications are introversion/extraversion, sensing/intuition, thinking/feeling, and judging/perception. Every individual is said to have one favored quality from every classification, creating 16 one of a kind sorts. The system will detect the mood of the employee by detecting his/her face at regular intervals and accordingly generate the reports. Also, at the end of each day, the employees would be asked to provide their personal opinion on the work that they have done for that day. Apart from this, the employees will be asked to provide a monthly review of their work. In this review, natural language processing techniques will be applied to determine whether the review is positive (happy) or negative (sad). Weekly reports of the mood of employees will be shown to the admin so that the admin can conduct stress management sessions for those employees who are suffering and take actions accordingly which will ultimately motivate employees to do quality work and increases the overall performance which is beneficial to the organization in many ways [8, 9].
2 Methods 2.1 Approach Architecture The above diagram shows the interaction of office employees with the system, the employee will first have to register at the Web portal, where he/she will provide his/her details. Followed by this, the employee will undergo the Myers–Briggs personality determination test which will be in the form of four questions, each of them having two alternative options of which only one can be selected. Based on these answers, a code will be generated which will then be compared with the codes stored in the database. The corresponding matching code will describe the personality of an employee which will help the human resources of the company to assign appropriate work to that employee. The employee will then log in to the system. After this, pictures of the employee will be taken at regular intervals. These pictures will be provided as an input to the face identification and emotion recognition model. This model will first recognize
146
S. Sahu et al.
the employee from the picture and then detect the emotion of the employee at that particular instant. Similarly, at regular intervals, the mood/stress of an employee is detected and stored in the database. Daily answers to “How was your day?” will be taken from the employee which will help human resources to analyze the mood of the employee. Apart from this, monthly feedback from the employee is taken; using natural language processing, the mood (happy/sad) of the employee is detected and stored in the database. In the end, reports in the form of charts will be generated and shown to the human resources to help him analyze the mood/stress level of the employee. Thus, the human resources of the company will have a complete overview of the stress levels of each employee, and accordingly, he can take remedial action.
2.2 Algorithms Used The algorithms we used were support vector machine (SVM), LBPH, CNN out of which we got the highest accuracy for LBPH with 99% accuracy, followed by the support vector machine with 79.3%, followed by face emotion recognition 72% accuracy. All the accuracy results are illustrated in Fig. 1. Local Binary Pattern Histogram Algorithm The local binary pattern histogram (LBPH) algorithm is a simple solution to face recognition problems, and it can recognize both the front face and side face [10]. However, the popularity rate of the LBPH algorithm under the conditions of illumination diversification, expression variation, and attitude deflection has decreased
Fig. 1 System architecture
Stress Detection of Office Employees Using Sentiment Analysis
147
[11]. To unravel this problem, a modified LBPH algorithm supporting pixel neighborhood gray median (MLBPH) is proposed. The gray value of the pixel is modified by the average of its neighborhood sampling value, so the feature value is extracted by the sub-blocks and therefore the statistical histogram is established to create the MLBPH feature dictionary, which is employed to acknowledge the face identity compared with the test image [11]. Experiments are carried on the FERET standard face database and therefore the creation of recent face databases, and therefore the results show that the MLBPH algorithm is superior to the LBPH algorithm in recognition rate. Natural Language Processing It provides a powerful tool for analysis of text written in human-understandable languages. NLP helps in understanding and interpreting the human language. It is used to determine the emotion associated with the monthly review of each employee. The process begins with tokenization. This is followed by word stemming which aims at reducing the inflectional forms of each word into a common base or root. After word stemming, term frequency and inverse document frequency methods are used for word vectorization. Finally, a linear support vector machine model is used to predict the emotion associated with the review. The dataset used for NLP sentiment analysis is in the form of 40,000 tweets and was split into 80:20 ratio for training and testing, respectively. Face Emotion Recognition Face emotion recognition (FER) bundles a Keras model. The model is a convolutional neural network with weights saved to the HDF5 file in the data folder relative to the module’s path. It can be overridden by injecting it into the FER() constructor during instantiation with the emotion_model parameter. In this project, it is used to determine the mood of the employee at a particular instant of time based on the facial expression of the employee. It returns seven classes of classification, namely happy, sad, surprise, anger, neutral, fear, and disgust [12]. Convolutional Neural Network (CNN) CNN is a deep learning algorithm that is used to work with image data. The working of this algorithm is divided into three phases, namely the input phase, the feature learning phase, and the classification phase. The input corresponds to the image of the employee as a matrix. On this matrix, features are extracted through multiplication with the feature map matrix. This is the convolution step and is primarily used for edge detection in images. After this, pooling is done on the convolved matrix which reduces the number of features in case of a large image matrix. This resultant matrix is flattened into a vector and fed into a fully connected layer. The output class is then obtained by using an activation function on the flattened vector. Support Vector Machine (SVM) Support vector machine is a supervised machine learning algorithm that is primarily used for classification problems and is capable of working on regression as well
148
S. Sahu et al.
as classification problems. It can work with both linear and nonlinear cases. The algorithm works by approximating a hyperplane between two classes. It then finds the points closest to the hyperplane. These points are called support vectors and the distance between the hyperplane is called margin. The focus then shifts on maximizing the margin. Thus, the line or hyperplane corresponding to maximum margin is considered to be the line of demarcation between the two classes. The confusion matrix obtained from the results is: Actual
Predict 0
1
0
1033
270
1
283
1008
Here, 0 indicates the happy mood of the employee and 1 indicates the sad mood of the employee.
2.3 Modules In this, we will discuss various modules of the proposed system. Registration of employee: The first step is that the employee has to register on our Web site. After registration, the employee has to answer a few questions for the determination of their personality. Face verification and emotion analysis: After completion of step one, a picture of the employee will be taken for face verification. Now, when employees will start their computer, a picture will be taken four times a day; first, the face of the employee will be verified and then the picture will be sent to an API call for emotion analysis. After that, the emotions of that employee and time stamp will be stored in the database. Daily feedback: At the end of the day, when employees will shut down their computer there will be a pop-up with two buttons green and red. Green indicates good day, whereas red indicates bad day. Employees will select one option that value will be stored in the database of that employee. Monthly feedback: At the end of each month, written monthly feedback of the employee will be taken, and by using natural language processing, we will classify the positive and negative feedback and it will be stored in the database of that employee. Aggregate sentiment analysis of employes: Reports in the form of charts will be generated and shown to the human resources to help him analyze the mood/stress level of the employee. Thus, the human resources of the company will have a complete overview of the stress levels of each employee, and accordingly, he/she can take remedial action.
Stress Detection of Office Employees Using Sentiment Analysis
149
3 Results The main purpose of the proposed system is to analyze the stress level of people working in software industries who are facing many problems (physical and health) during their hectic working time or in their life. This section will cover various user interfaces of the proposed Web-based system. First, when an employee joins, he/she will register his/herself using the registration page. Figure 2 shows the Myers–Briggs test questions which are asked after registration to detect the personality of the employee. Figure 3 shows the mood of the employee stored in quantitative terms in the database using face recognition and face detection. Figure 4 shows the snippet of the application which asks “How was Your Day” at the end of everyday. This will give employees their view about the day. Figure 5 shows the UI for taking monthly feedback/reviews from the employee. Here, the employee will write the review and we will run our natural language processing module to find the sentiments of the employee.
Fig. 2 Myers–Briggs questions
150
S. Sahu et al.
Fig. 3 Mood stored in the database
Fig. 4 Daily feedback of employee
Figure 6 shows the dashboard for the administrator in which the status of all the employees is displayed. Human Resources can see the mood/stress of each employee. Figure 7 shows the reports in the form of pie charts which will be shown to the human resources of the company. Table 1 shows the accuracy obtained from different algorithms.
Stress Detection of Office Employees Using Sentiment Analysis
151
Fig. 5 Monthly feedback from the employees
Fig. 6 Dashboard to display an employee’s mood
4 Conclusion The study examines the stress faced by employees in both the government and public sectors. The daily interaction with the co-workers and fragmented demands of this profession often leads to pressure and challenges which may lead to stress. Stress in employees can be detected by observing patterns in emotional data and can be resolved by the human resource department of the company. The accuracy of emotion
152
S. Sahu et al.
Fig. 7 Mood of an employee
Table 1 Average classification accuracy Model
Accuracy (%)
Face recognition using LBPH algorithm
99
SVM for monthly review analysis
79.3
Face emotion recognition using CNN
72
detection is 72% for face input, 79.08% for monthly review input in form of text, 99% for identification of employees from face input and we are applying solutions which are simple, not medical solutions, so where our algorithm fails, it does not have a negative impact. In conclusion, the results of the current study suggest that we can use face recognition to detect stress in employees.
5 Competing Interests The authors declare that they have no competing interests.
Stress Detection of Office Employees Using Sentiment Analysis
153
6 Availability of Data and Materials The dataset for emotion detection from the monthly review was collected from: https://data.world/crowdflower/sentiment-analysis-in-text. The above-mentioned dataset is used for sentiment analysis. The dataset is in the form of 40,000 tweets and was split into 80:20 ratio for training and testing, respectively.
References 1. Mental health in the workplace by WHO https://www.who.int/mental_health/in_the_workpl ace/en/ 2. OSH Answers Fact Sheets, Canadian Centre for Occupational Health and Safety (CCOHS) https://www.ccohs.ca/oshanswers/psychosocial/stress.html 3. Subhani, A.R., Mumtaz, W., Saad, M.N.B.M., Kamel, N., Malik, A.S.: Machine learning framework for the detection of mental stress at multiple levels. IEEE Access 5, 13545–13556 (2017) 4. Health and well-being, Employee Benefits, Jennifer Paterson on 28th October 2013. https:// employeebenefits.co.uk/issues/health-and-wellbeing-2013/how-to-measure-workplace-stress/ 5. The impact of employee level and work stress on mental health and GP service use: an analysis of a sample of Australian government employees by BMC Public Health. https://bmcpublic health.biomedcentral.com/articles/10.1186/1471-2458-4-41 6. Jeon, S.W., Kim, Y.K.: Application of assessment tools to examine mental health in workplaces: job stress and depression. Psychiatry Investig. 15(6), 553 (2018) 7. How One Company Is Using AI To Address Mental Health by Lucy Sherriff. https://www.forbes.com/sites/lucysherriff/2019/03/25/how-one-company-is-using-ai-toaddress-mental-health/ 8. How to measure workplace stress By Jennifer Paterson, https://employeebenefits.co.uk/issues/ health-and-wellbeing-2013/how-to-measure-workplace-stress/ 9. Abid, R., Saleem, N., Khalid, H.A., Ahmad, F., Rizwan, M., Manzoor, J., Junaid, K.: Stress detection of the employees working in software houses using fuzzy ınference. Stress 10(1) (2019) 10. Face Recognition, Adam Geitgey github link: https://github.com/ageitgey/face_recognition 11. Zhao, X.M., Wei, C.B.: A real-time face recognition system based on the improved LBPH algorithm 12. Face Emotion Recognition, Justin Shenk github link: https://github.com/justinshenk/fer
Redefining Home Automation Through Voice Recognition System Syeda Fizza Nuzhat Zaidi, Vinod Kumar Shukla, Ved Prakash Mishra, and Bhopendra Singh
Abstract The concept of home automation is an on growing topic in this century and it plays a vital role in our daily lives as individuals, as innovators and also as a collective society. It reduces the human labor, time, and the amount of effort spent on daily essential and tiring tasks. The primary idea behind this research is to cover various challenges and possibilities regarding home automation. Today, the demand for better home security systems has increased widely. Not only do homes need to be secure but also need a simple and refined control system. These days, home automation is being exercised in every smart home and it considers all factors including the security of appliances too, especially when a person is not at their home, which eventually also helps to minimize the usage of electricity. It is a wideranging concept as it can have from one automated device to a number of automated devices and can have the whole house in one system’s control at any point of time. This huge task can produce a variety of challenges, but once they are solved, it can lead to a range of possibilities and thus be one of the most useful and advanced innovations of our tech-based generation and our fast-moving society as a whole. In this paper, a voice recognition-based model has been proposed, which enhances the user experience in smart home management. This topic was hence chosen due to its vast number of possibilities in the near future and its innumerable advancements day by day in this evolving society. The objective thus remains to list its challenges and also to provide solutions for them and come up with a system that is effective and affordable.
S. F. N. Zaidi (B) · V. K. Shukla · V. P. Mishra · B. Singh Amity University, Dubai, UAE e-mail: [email protected] V. K. Shukla e-mail: [email protected] V. P. Mishra e-mail: [email protected] B. Singh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_16
155
156
S. F. N. Zaidi et al.
Keywords Smart home · Voice recognition system · IoT · Home automation
1 Introduction In this tech-based generation, the constant aim of every individual and every upcoming company is to provide an environment with everything completely automated and reducing labor input to the minimum in every way possible. In this agenda, home automation is indeed one of the most important and upcoming field of enhancement and is a major field for the youth and the IT departments to focus on with a huge scope. Home automation can be defined as a method to perform basic tasks and doing something without human intervention or inclusion. The concept of automating each appliance in the home has been a field of study from many years; it started with the connection of two electric wires to the battery and closing the circuit by using load as a light [1]. Home automation is basically the method of using one or more computers to perform and control basic home functions automatically. It can include literally any operation from security system to reminder systems to entertainment appliances to lighting the whole house up or just one room depending on the command or on the motion detectors and controllers [2, 3]. These home-based automation systems are gaining immense popularity every day because of their high security and ease of use based on widely operating capabilities. These home automation systems are used by the vast majority for different reasons; while some people want these systems for satisfaction of daily needs and to be more comfortable in terms of reduced labor activities, other people want it because they may be physically challenged and require these for assistance and hence will not rely on humans but on tech and feel more independent; the same goes for the elderly people who want this instead of depending on other individuals for daily basic tasks [4]. Home-based automation systems are becoming even more beneficial due to their safety and extremely high security. It is highly advanced and hence monitors all types of appliances and gadgets and more. These are easily approachable and efficient to use [5]. Today, automation has become one of the attractive areas that play an important role in this constantly developing life. Home automation has a number of applications such as: application control, lighting-based control system, security management, leak detection, smoke and fire detection, and automation for the elderly, disabled, and sick. [6]. In this highly dangerous world, the need for security is a rising issue and any concept which provides high security and ensures protection to property and life is widely asked for. Many systems provide limited access in both the indoor and the outdoor environments. But even if there are a number of developed systems that support this, some of them have a very complicated connectivity system and the cost of implementation of these systems and their devices is extremely high and not affordable enough by the common man but only to those of the richer society. In these smart home automations, a huge number of different devices and systems are needed to be integrated together with the minimum number of resources [7].
Redefining Home Automation Through Voice Recognition System
157
Considering the rapid economic expansion, living standards keep rising day by day and people eventually have a higher requirement for ease of access to the various daily tasks. The intellectual society thus has to bring about a number of changes and inventions and come up with enough innovations to please the modern family in every way possible and the market has to come up with affordable tech assistants for these individuals in such a way that they are more advanced than the other companies but still affordable enough to be bought by any member of the society and not just of the rich society [8]. There are several methods to incorporate smart home automation systems in our homes, a few of them are as follows: • GHMI: Gesture Human Machine Interface (GHMI) which means controlling electronic devices using hand gestures [9]. • DTMF: One of the most common form is the systems prevalent on dual tone multifrequency (DTMF)-based automation system; here appliances can be controlled wirelessly [10]. • Speech recognition: It is known to be one of the most complicated areas of computer science. As speech in dynamic naturally, there are a number of methods used in this type of recognition, namely artificial neural networks (ANN), patternbased recognition, statistical analysis and language modeling. This type of recognition is commonly used in the home automation systems [11]. It is an increasing concept and has become a wide spread trend around the world, and the biggest of companies are attempting to excel in this field including Google, Amazon, etc. All we have to do is just talk to our appliances and give the required set of commands and the rest is taken care of by the voice assistant [12]. • Wi-Fi (Wireless Fidelity): In this generation, Wi-Fi has become widely available and is the perfect basis for home automation systems. It has a number of advantages such as wireless installations, lower cost, no holes or damages to the walls and just needs to connect to the network which can be done by any user [13]. • Web Applications: Any device such as fan, AC, TV, and motors based on the basis of electronics can be connected to the Internet environment using a software as a interface and accessing it remotely through an Android or Web-based application. Hence another form of automation can be done using the Android and Web applications [14]. • Short message service (SMS): Check and control the safety of the house or appliances in the office using mobile phones assigned to send commands in the form of SMS texts and receive the status of the device [15]. Smart homes: Smart home represents the concept of a home where the home appliances are connected to each other via some central home network, and if required they can be controlled and operated from some remote location too. Smart home appliances can communicate to each other and also capable to communicate to other smart object of the home [17]. According to the Smart Home Association, the perfect definition of smart home innovation is: “the integration of technology and services through home networking for a better quality of living” [18]. The term smart home is used to define a house which uses a system-based algorithm to carry out basic human
158
S. F. N. Zaidi et al.
Fig. 1 Block diagram of a basic home automation system [16]
Fig. 2 Aging report estimated for 2050
needs and reduces labor hours on carrying out these tasks. The most famous home controllers are the ones that are connected to a Windows gadget during programming and then left to perform the duties on its own. Integrating these smart home systems allows them to communicate information with the interconnected gadgets in the particular home and hence enabling command interpretation using voice recognition and more (Fig. 1). This field is increasing significantly as technologies are becoming more and more abundant. These home networks cover communication, entertainment, safety, adaptability, and data information systems [19]. According to the World Health Organization (WHO) Global Health and Aging report, around 524 million people, representing around 8% of the population of the world, were of the age of 65 or above in the year 2010. By the year 2050, this huge number is estimated to reach 1.5 billion (around 16% of the population of the world) (Fig. 2). In addition to this, the WHO estimates that 650 million people live with a number of disabilities around the world. So it is important for inventors to come up with user-friendly smart environments for enhancing better life independency for elderly and differently abled people [17]. When a research was conducted for elderly and disabled people, the results were that besides the fact they want to be at home for as long as possible, they also want assistance in their daily life and emergency responses as to when they fall or have a medical condition, this can be incorporated in a smart home and is one of the main objectives of upcoming innovations in this field [20]. In a smart home, variety of appliances can be controlled by technology, examples can be lights being switched ON using motion detectors so they are used only when needed. Another example could be the blinds of the windows closing on their own when lights inside are switched ON, or the thermostat detecting the temperature of the room and automatically changing it according to the environment needed [21]. While smart homes are expected to perform an important role in the future, they
Redefining Home Automation Through Voice Recognition System
159
are still in the early phase, still not common enough to be afforded by everybody. Demand is still limited as it is only adding complexity rather than user-friendliness [22].
2 Growth of Home Automation System There are a number of countries which are trying to introduce the concept of smart homes into their citizen’s lives. The concept of smart cities is widely becoming familiar and is being introduced in the downtown hub of large-scale countries such as UAE and USA. In UAE, Dubai is experimenting with a wide range of technological developments and has already started to improve its smart cities. Cities such as San Francisco and Toronto also have a number of smart innovations and have their own smart cities [23]. According to statista.com, revenue in the smart home market amounts to US$90,968 m in 2020. Revenue is expected to show an annual growth rate (CAGR 2020–2024) of 15.0%, resulting in a market volume of US$158,876 m by 2024 (Fig. 3). A global comparison reveals that most revenue is generated in the USA (US$27,649 m in 2020) [24]. Impact of IOT: IOT or Internet of things is an upcoming field and is an invention where every tech gadget is assigned to a specific IP address; with this address, the gadgets can have an identity on the Internet. The Internet is a constantly changing entity. It began as “Internet of Computers.” Research and analysis have shown that the number of devices on this is increasing continuously and hence this can also be known as “Internet of Things” (IoT) [25]. The term “Internet of things” widely known as IoT is coined from two terms, i.e., the main expression is “Internet” and the next expression is “things”. The net is an interconnected gadget of various protocols (TCP/IP) invented to help billions of users around the world. In these times, over to hundred countries are joined together into trades of data using the Internet [26].
Fig. 3 Projected annual growth CAGR of 15% (2020–2024)
160
S. F. N. Zaidi et al.
One example of implementing IOT—In a real-time Web-based home automation system to ensure the maximum security, the vibration can be controlled through online via HD spy camera. Using this, a system can be built in the whole building of an institution or a building and can be monitored from any place [27]. Another example is—Home automation can work on the principle of cloud. Using various sensors, monitoring is taken care of [28].
3 Home Automation Challenges Despite the advancements in home automations systems, there are a number of challenges in this field as it is just an upcoming form and a new arena of exploration for all fields. This field holds a number of areas for improvement, some of them are as follows: • Eco-friendly: The devices used and systems implemented should be eco-friendly and should not add to the huge amount of e-waste which is already present in this polluted world. The devices should be of such manner that they can be used and reused and should not be generated as waste or cause pollution when out of use. • Cost-effective: The devices and system used in the automation system should be budget friendly enough for a common middle-class man to be able to purchase and should not only be accessible to rich citizens of a nation because that would lead to a need for smart homes only for rich people and something beyond imagination for middle-class citizens of a country. • Safe to use: The devices used should be safe as they will be used on a daily basis and it should be made sure that there are proper safety precautions taken in case these devices cause harm. There should be proper safety protocol at all times. • Easy to adapt: The system implemented to perform the basic needs in a smart home should be easy enough and adaptable enough for the new users to follow and use on a daily basis and not need assistance just to use the system meant to make them more independent. It should be adaptable to all ages and people of special needs and those who require more help in their daily chores. • Secure: Since these devices are used through the Internet, it is very easy to hack such systems and thus cyber security should be one of the main concerns when generating and implementing these systems because once the system gets hacked, the whole house can be in the hands of the hacker and be a very dangerous issue. The aim is to invent and implement budget-friendly automated systems [29]. While using devices through Wi-Fi operated devices, it is important to make sure that the connection is low cost and affordable. It is also important to make sure that the connection is strong and does not get disconnected at any point as it will lead to a variety of problems and cause the system to shut down and stop working. When using IoT-based systems, it is important to make sure that the devices are long lasting and at the same time affordable and very easy to use. It is at the same time very important that the users know what to do and how to do when there is a
Redefining Home Automation Through Voice Recognition System
161
problem based on the connections and devices and when new devices are installed. When using voice clients, it is very important to make sure that the voice clients are enabled to hear voices at different audible ranges so when there is an emergency they can hear even the smallest or least audible sound. Due to these challenges, home automation systems are still an upcoming area and need more people to be familiar with the concept and educated enough to install these in their homes.
4 Framework Implementing Voice Recognition System To make the system easily accessible and adaptable for using, we have proposed to use voice recognition system as the base system for our smart home management model. Though voice recognition has a lot of errors and challenges within itself, it is by far one of the most easy and adaptable system to be used. There are a variety of smart innovations coming up which are created by big companies such as Amazon’s Alexa, Apple’s Siri, and Google’s Ok Google. These are voice assistants and are examples of smart innovations based on voice clients. The three major steps in this voice recognition-based implementation are: 1. The recognizing of voice commands using module and mic. 2. Identifying the command using the module and microcontroller simultaneously. 3. Sending the control signals to the respective devices based on the commands given and the groups they are in. Proposed framework with ASR system works in a very simple and similar way as other ASR algorithm works. Voice data is captured by microphone and this data is immediately checked for it pattern. Based on the match found, the associated task manager allocates the relevant task to execute, and finally, the voice-based system gets executed for the smart home management. For this purpose, we have used automatic speech recognition (ASR) system which is an automated computerized process of decoding and analyzing oral speech. A typical ASR-based system (Fig. 4) works by receiving acoustic inputs from a speaker, analyzes it using a pattern, method, or algorithm, and produces an output usually a task given by the user to carry out. ASR collects the voice input and immediately preprocessing of this collected data starts, followed by cleaning of this data to remove any kind of noise associated with this voice data. Once data is clean, then further Fig. 4 Framework for voice recognition model
162
S. F. N. Zaidi et al.
Fig. 5 An ASR-based system
analysis gets started which can be pattern recognition or matching and identifying the pattern, if pattern is found and match voice command gets executed (Fig. 5). There are several studies available on different type of acoustic models. Gaussian mixture models (GMM), hidden Markov models (HMM), and deep neural networks (DNN), along with various variations on these models are available. HMM and GMM models are experimented most for the statistical distributions, which have given the effective results for speech feature sequences. Internal representation for speech recognitions has also been experimented with HMM model. Generative model speech has also helped to understand more about the speech features. Generating noisy version from the clean speech can be generated anytime by using straightforward distortion models [30]. In a proposed model, for developing a home automation system, major hardware component will be voice recognition module for voice data capturing and Arduino Uno microcontroller, which will help to communicate the commands. Voice commands can be classified into five different groups and each group can further have up to eight commands. The command groups are safety, access, AC, light, and utilities. The user in order to fulfill his/her command will have to tell the name of the group along with the message for it to be registered successfully. The module for voice recognition should be trained using the voices of up to five samples (mixed with men and women), so it is well ensured that the system can recognize voice
Redefining Home Automation Through Voice Recognition System
163
Fig. 6 Block diagram of voice recognition system prototype [32]
commands clearly and have no issues with distinguishing commands from usual talk [31]. The five groups and their uses are as follows (Fig. 6): • • • • •
Accessing group is used for opening and closing the doors. Lighting group is used to control the lights. AC group controls the temperature of AC and to switch it ON or OFF as required. Safety group is used to lock the door and sound/silence an alarm. Utilities group controls sprinklers, blinds and to switch ON/OFF the TV. This system can be tested based on five different parameters:
• Flexibility: It is flexible as the voice recognition module is tested for different demographics hence it is suitable for various environments. • Robustness: It performs perfectly even when the user is away from the mic but performance may decrease at higher levels. • Security: The channel will be secured before the transmission takes place hence it is highly secure. • Cost: The cost will be minimum since only one microcontroller and voice recognition module is being used. • Response Time: The response time will be average as the command has to be grouped out and then sent to the respective device.
5 The Future of Home Automation The housing accommodations of the future might fulfill various needs such as medical needs, communication, energy-based requirements, utility, entertainment, and security. In the upcoming generations, many more devices will be interconnected with each other. The objective is to attain a future where the information can be communicated between gadgets and beings without having to depend on labor input. The future implies on devices that can search information on their own and then use this to
164
S. F. N. Zaidi et al.
change different features of the housing accommodations. For example, the AC can automatically detect the rise and fall of temperature and adjust accordingly or light sensors which switch OFF when not in use. As the gadgets become more advanced, the goal is to make housing accommodation as easy to use as possible [33].
6 Conclusion In conclusion, the most upcoming and most discussed field as of now is home automation which includes the concept of smart homes and smart cities. More and more teenagers and young adults should be encouraged to enroll and enhance this field as the next generation will see a huge rise in smart homes and smart homes users. This field despite its challenges is still being made possible worldwide and the most advanced of cities are being created nowadays. From smoke detectors to voice clients and from automated washing machines to smart wheelchairs for the disabled and the aged, this field covers it all. This research paper covers the challenges of this field and also its possibilities and implementations. More engineers, designers, programmers, architects, etc., should focus on overcoming the challenges in this field and work together to make a better and smarter tomorrow for the upcoming generation of advanced users.
References 1. Jogdand, R.R., Choudhari, B.N.: DTMF based Home Automation System, vol. 7(2), Department of Electrical Power System, People’s Education Society College, Aurangabad, Maharashtra, India, Feb 2017 2. Prathima, N., Sai Kumar, P., Lal Ahmed, S.K., Chakradhar, G.: Voice recognition based home automation system for paralyzed people. Int. J. Modern Trends Sci. Technol. 03(02) (2017) 3. Jermilla, S., Kalpana, S., Preethi, R., Darling, D.R.R., Devashena, T.: Home automation using LabVIEW. 3(09) (2017) 4. Khan, W., Sharma, S.: Smart home (home automation). Int. J. Latest Trans. Eng. Sci. 2(2) (2017) 5. Dhami, H.S., Chandra, N.: Raspberry Pi home automation using android application. 3(2) (2017) 6. Laberg, T., Aspelund, H., Thygesen, H.: Smart home technology, planning and management in municipal services 7. Hespe, M.: Smart_Home_Magazine.pdf (2018) 8. Pampattiwar, K., Lakhani, M., Marar, R., Menon, R.: Home automation using Raspberry Pi controlled via an android application. 7(3) (2017) 9. Aravindhan, R., Ramanathan, M., Sanjai Kumar, D., Kishore, R.: Home automation using Wi-Fi interconnection. 04(03) (2017) 10. Robles, R.J., Kim, T.: Applications, systems and methods in smart home technology: a review. Int. J. Adv. Sci. Technol. 15 (2010) 11. Baidya, N., Kumar P.S.: A review paper on home automation. Int. J. Eng. Tech. 4(1) (2018) 12. Mehar, D., Gupta, R., Pandey, A.: A review on IOT based home automation techniques. Int. J. Eng. Manage. Res. 7(3) (2017)
Redefining Home Automation Through Voice Recognition System
165
13. Kumar, N., Singh, P.: Economical home automation system using Arduino UNO. 10 (2017) 14. Smart Home: Technologies with a standard battle, Sept 2017 15. Singh, S., Ray, K.K.: Home automation system using internet of things. RVS College of Engineering and Technology, Jamshedpur 16. Isa, E., Sklavos, N.: Smart home automation: GSM security system design & implementation. Computer Engineering & Informatics Department, University of Patras, Greece, Received 30 June 2015. Accepted 15 Jan 2016 17. Soliman, M.S., Alahmadi, A.A., Maash, A.A., Elhabib, M.O.: Design and implementation of a real-time smart home automation system based on Arduino microcontroller kit and LabVIEW platform. Int. J. Appl. Eng. Res. 12(18) (2017). ISSN 0973-4562 18. Asadullah, M., Raza, A.: An overview of home automation systems. Department of Electrical Engineering, National University of Computer and Emerging Sciences Peshawar, Pakistan (2016) 19. Palaniappan, S., Hariharan, N., Kesh, N.T., Deborah, A.: Home automation systems—a study. Int. J. Comput. Appl. (0975–8887) 116(11) (2015) 20. Amudha, A.: Home automation using IoT. Int. J. Electron. Eng. Res. 9(6) (2017). ISSN 09756450 21. Kulkarni, B.P., Joshi, A.V., Jadhav, V.V., Dhamange, A.T.: IoT based home automation using Raspberry PI. Int. J. Innov. Stud. Sci. Eng. Technol. 3(4) (2017) 22. Behera, S., Saha, A.K., Kumar, D., Polai, J.: Home automation control system using SMS. 04(03) (2017) 23. https://www.statista.com/outlook/283/100/smart-home/worldwide 24. Krell, M.: Wireless technologies for home automation, June 2014 25. Shinde, H.B., Chaudhari, A., Chaure, P., Chandgude, M., Waghmare, P.: Smart home automation system using android application. 04(04) (2017) 26. Rathi, K., Patil, D., Bhavsar, S., Jadhav, K., Thakur, S.V.: Gesture human-machine interface (GHMI) in home automation. 04(06) (2017) 27. Georgiev, A., Schlögl, S.: Smart home technology: an exploration of end user perceptions, 21 Aug 2018 28. IP 2014 Group 3 ”Mango”, Smart Home Project 29. Belli, M.: Sensors in smart homes: a new way of living. Digital Innovation, 20/04/2018 30. Hsu, Y.-L.: Design and implementation of a smart home system using multisensor data fusion technology, 15 July 2017 31. Li, J., Gong, Y.: Robust Automatic Speech Recognition (2016) 32. Technical Report M2M/IoT Enablement in Smart Homes, Mar 2017 33. Soni, N., Dubey, M.: A review of home automation system with speech recognition and machine learning. 5(4) (2017)
CNN-Based Iris Recognition System Under Different Pooling Keyur Shah and S. M. Shah
Abstract The local pooling operation plays central role in CNN architecture thatis used for downsampling feature maps of image regions. In this paper, we propose convolution neural network (CNN)-based iris recognition model under different pooling operations. We compare popular pooling approaches that allow pooling to learn and to extract feature maps from image regions. The performance of the model is evaluated for various pooling strategies by conducting an experiment on well-known public datasets. The results of the experiment show the average pooling operation outperforms in comparison with the other pooling operations. Keywords Convolution neural network · Pooling · Accuracy
1 Introduction Deep learning algorithms have evolved the computer vision field remarkably by introducing non-traditional and efficient solutions to several image-related problems that had long remained unsolved or partially addressed. In this regard, convolution neural networks (CNNs) are the most important type of deep learning models with respect to their applicability in visual recognition task [1]. Various kinds of CNNbased iris recognition system have been proposed to improve the performance of the system, see [2, 3] and references therein. In recent years, many studies have been carried out to improve CNN architecture as well as the building blocks of CNN [4–7]. Local pooling is one of the key components of CNN that downsizes feature maps by summarizing the presence of features in patches thereby increases computational efficiency and robustness to the input variations. Pooling operation is like filter that is applied on feature maps independently. Two common functions are used in the pooling operation namely average pooling and max pooling. The average pooling calculates the average value, whereas max pooling calculates the max value over local descriptors extracted from different K. Shah (B) · S. M. Shah Gujarat Technological University, Ahmedabad, Gujarat, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_17
167
168
K. Shah and S. M. Shah
image regions. The interested reader can refer [8] to get deep insight about pooling strategies. In this paper, we recast the CNN-based iris model with linear scaling as in [2] under the average pooling, max pooling, and depth-wise pooling. An experiment under the same settings as in [2] is carried out and the performance of the CNN model with different pooling strategies is discussed.
2 Experiment and Results For the sake of simplicity, we have employed built-in functions available in TensorFlow for max pooling, average pooling, and depth-wise pooling methods. The other settings for the experiment are same as in [2]. We set kernel size 3 in case of depthwise pooling. The training and validation accuracy and loss are summarized in Table 1 and shown in Figs. 1, 2 and 3. From Table 1, it can be seen that the training accuracy is highest for the depthwise pooling but lowest validation and testing accuracy. This clearly indicates that Table 1 Loss and accuracy under different pooling techniques Pooling
Loss (train/valid/test)
Accuracy (train/valid/test)
Max pooling
0.3912/0.9576/0.9929
0.8836/0.8200/0.7900
Average pooling
0.0593/0.6372/0.8806
0.9811/0.9000/0.8600
Depth-wise pooling
0.0595/2.3373/2.8971
0.9822/0.6000/0.5200
Fig. 1 Loss and accuracy for max pooling
CNN-Based Iris Recognition System Under Different Pooling
169
Fig. 2 Loss and accuracy average pooling
Fig. 3 Loss and accuracy for depth-wise pooling
the model is underfitted. Though the accuracy is high in the model with max pooling, the values for validation accuracy oscillates more (see Fig. 1) as compared to average pooling. This indicates that the model with max pooling is overfitted. The model with average pooling outperforms over the other pooling strategies.
170
K. Shah and S. M. Shah
3 Conclusion Pooling operation becomes an integral part of CNN as it effectively downsizes feature maps that enhances the computational efficiency and robustness of the model. In this paper, we compare pooling methods and analyze the performance of CNN for iris recognition. Our experiment shows that the average pooling operation outperforms over the other pooling operations. This research can be expanded by analyzing CNN model on another iris dataset by applying various pooling methods.
References 1. Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E.: Deep learning for computer vision: a brief review. Comput. Intell. Neurosci. (2018). https://doi.org/10.1155/2018/7068349 2. Shah, K., Shah, S.M., Pathak, D.: Enhancing convolution network with non-linear scaling for iris recognition. MuktShabd J. 840–846 (2020) 3. Tang, X., Xie, J., Li, P.: Deep convolutional features for iris recognition. In: Zhou, J., et al. (eds.) Biometric Recognition. CCBR 2017. Lecture Notes in Computer Science, vol. 10568. Springer, Cham (2017) 4. Lee, Y., Gallagher, W., Tu, Z.: Generalizing pooling functions in convolutional neural networks: mixed, gated, and tree. In: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pp. 464–472 (2016) 5. Christlein, V., Spranger, L., et al.: Deep generalized max pooling. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 1090–1096 (2019) 6. Guo, Y., Li, Y., Wang, L., Rosing, T.: Depthwise convolution is all you need for learning multiple visual domains. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8368– 8375 (2019) 7. Yu, D., Wang, H., Chen, P.: Mixed pooling for convolution neural networks. In: 9th International Conference on Rough Sets and Knowledge Technology, pp. 364–375 (2014) 8. Sharma, S., Mehra, R.: Implications of pooling strategies in convolution neural networks: a deep insight. Found. Comput. Decis. Sci. 303–330 (2019)
Semantics Exploration for Automatic Bangla Speech Recognition S. M. Zahidul Islam, Akteruzzaman, and Sheikh Abujar
Abstract ASRBangla is a proposed model that can recognize Bangla speech or voice automatically by using the Mel Frequency Cepstral Coefficient (MFCC) and Recurrent Neural Network (RNN). Speech Recognition is the power of a process of an instrument or program that puts word and phrases into words and converts them into a machine-meaningful arrangement. ASR are already implemented in many other language but not in Bengali language properly. In this worlds about 15% population which is almost one billion who are physically disabled. For this physically disabled people we used this technology like blind, handicapped, deaf. This really reduces their dependency. In this research work we pursuits at building a model in deep learning way to deal with Automatic Speech Recognition in Bangla. Our proposed methodology there used MFCC ‘for feature extraction, RNN ’ for training dataset. This proposed model we trained a LSTM model to know the most possible phonemes. Our proposed model get 20% error rate in word detection on Bangla-Word from audio dataset. Keywords Bangla speech recognition · Automatic speech recognition · Mel Frequency Cepstral Coefficient (MFCC) · Long short term memory (lstm) · ASR and Recurrent Neural Network (RNN)
S. M. Z. Islam (B) · Akteruzzaman · S. Abujar Department of CSE, Daffodil International University, Dhanmondi, Dhaka 1205, Bangladesh e-mail: [email protected] Akteruzzaman e-mail: [email protected] S. Abujar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_18
171
172
S. M. Z. Islam et al.
1 Introduction In the times revolution of technology, automatic speech recognition has progressed to the point where it is utilized by a large number of people to automatically make documents by speech. By the grace of almighty human can automatically understand speech and can differentiate voice also identify who is the person. But it is difficult to communicate with machine. Speech recognition make it easy for communication among electric device. Automatic speech recognition also known as speech to text conversion. One of the principle points for speech recognition administrations is the decreased incorrectly spelled words that a few typists may suffer from when they composing. The administration eliminates the measure of time altering and fixing spelling rectifications. It is additionally a major favorable position to individuals who may experience the ill effects of disabilities that influence their writing ability but it can use their speech to make message on PCs or different devices. An ASR system expects to gather those unique words given the recognizable sign. There are more than 215 million people all over the world who speak in Bengali. In spite of having huge number of people, there are only few smart technology in this particular language. So, we have proposed a method for automated speech recognition. Speech recognition is simply a process that converting speech into text. It is a growing up technology in computer vision. There Fig.1 is a block diagram of Speech Recognition.
Fig. 1 Block diagram of proposed system
Semantics Exploration for Automatic Bangla …
173
2 Literature Review Study on a particular subject that is a scholarly source is called Literature Review (LR). LR give a general review of a particular subject. Lets you distinguish closely related with theories, techniques, and lacuna from current paper. Graves et al. they used deep Long Short-term Memory RNNs workable for speech recognition [3] and evaluate them the connection between speech recognition and neural networks, generally combined with the Hidden Markov Models (HMM) [2]. They get test set error in their model LSTM & RNNs is 17.7%. This method utilizes greater sophisticated and rich mobility capabilities of RNNs than HMMs. Nahid et al. isolated each word into various frames, every frame contains 13 MelFrequency Cepstral Coefficients (MFCC), which provides a workable features set and identify the most probable. Then trained a LSTM model with this frame [1]. In this deep model final layer is a softmax layer which has a single number of units equal to the number of connected phonemes to detect a separate Bangla words. To training the audio sequences it is the best use of Recurrent neural networks [3, 6, 8]. they get test set error in their model 28.7% on the Bangla-Real-Number audio data set. That also done by HMM. LSTM, is an another iterative network, that use gates for stores and repeated memory [12]. The automated speech recognition system uses clear readings and noise datasets for end-to-end deep learning, where CTC perform to align perfect audio signal also CNN and GRU are leverage learning and unroll long features. For robust the system batch normalization is proceed [13]. Kaldi an open source toolkit for speech recognition is used where they analyze GMM-HMM and DNN-HMM model and found error rate 3.96% and 5.30% for 500 unique vocabulary. A best paper have been evaluated ever for bangla word. Comparing with other procedure we propose a MFCC based language model where multilayer RNN is used to recognize voice.
3 Proposed Methodology In this dataset has many steps as described below.
3.1 Extracting Features (MFCC) Speech is a continuous wave. To recognize speech first audio must be recoded. The recoded audio is converted in a form for feeding neural network. Fundamentally it is 16 kHz. Testing the records examining implies the coefficient remove between two words. Then we can go through this records for feature extracting. MFCC utilized for feature extraction. This algorithm accept the voice as data sources, process it at last by figuring coefficient one of a kind at a specific example.
174
S. M. Z. Islam et al.
This calculation is continue since it groups frequencies similarly dependent on mel scale, which is practically same as human sound-related system. To acquired MFCC a Discrete Cosine Transform (DCT) is driven which included numerous means like pre-emphasis, window and sampling, FFT, Mel filer bank, DCT.
3.2 Pre-emphasis In speech recognition the original signal has too much low frequency. The preemphasis channel is connected for intensify sign high frequencies. It pre-emphasis process the signal also transfer the high frequency from the low frequency signal e.g., (1). Whatever happens, the SNR proportion were greater than 0 dB, which also indicates that in this proportion signal also more than the noise. We pass a signal p(x) in a high pass filter and the equation is. P2 (x) = p(x) − q ∗ p(x − 1)
(Where q = 0.9 → 1)
(1)
Now Y transformation of that e.g., (1) formula is. M(y) = 1 − q ∗ y −1
(2)
3.3 Framing After finishing pre-emphasizing we have to part the sound into little pieces. We realize that the sound sign is continually changing so in the event that the casing size is little, at that point it doesn’t change much. The piece must be 20–25 from the original sound signal. Typically the edge estimate (as far as test focuses) is equivalent to intensity of two so as to encourage the utilization of FFT. On the off chance that the example rate is 16 kHz and the edge estimate is 256 example focuses, at that point the casing term is 256/16000 = 0.016 s = 16 ms.
3.4 Windowing and Inspecting In a general way windowing is performed for showing signs of improvement smearleakage trade-off. In this case, Hamming window results e.g., (3) in the most extreme quality for signal recreation. Murmuring window decrease swell for unique sign’s recurrence range. On the off chance that the sign of a wave is win as Y (r ) where r = 0, 1, 2, 3, …r − 1
Semantics Exploration for Automatic Bangla …
175
Then the Humming window is. Y (r ) ∗ Q(r )
(3)
Where, (r) = 0.54−0.46*cos((2πr )/(R − 1)) Here, (0 ≤ r ≤ R − 1)
3.5 FFT Algorithm To rearrange describe, the greatness and amplitude of sign FFT is utilized. e.g., (4) At the point when fft is win on an edge it is expected that the sign inside an edge is intermittent, and consistent when folding over. The condition is given beneath P = |F F T (xi)|2/N
(4)
3.6 Mel Filter Bank They are utilized in numerous territories, for example, flag and picture pressure, and preparing. The fundamental utilization of channel banks is to partition a flag or framework in to a few separate recurrence spaces. Diverse channel plans can be utilized relying upon the reason. Frequency of size repetitions is greatly increased by 39 triangular band pass channels for achieve the vitality of the log in each channel. These channels are repeatedly spread through the mel recurrence. There are 13 covering (half) channels its frequencies 133.3 Hz 1 KHz and 27 logarithmically dispersed channels that’s frequencies are 1–8 KHz
3.7 Discrete Fourier Transform It is a contrast to DTFT which uses discrete time, yet changes to constant recurrence. For this progress, R band pass channels connected to the DCT for acquire L mel-scale cepstral. DCT equation is, D(r ) = where r = 0, 1, ..to R
C T ∗ cos(r ∗ (T − 0.5) ∗ π/39)
(5)
176
S. M. Z. Islam et al.
Fig. 2 13 mel-scale cepstral coefficients
Fig. 3 39 mel-scale cepstral coefficients
In this equation e.g., (5) quantity of triquetrous band pass channels is R and quantity of mel-scale cepstral is L. In this project, R = 39 and L = 13, see also Figs. 2 and 3. We performed FFT, when DCT replaces the repetition space in an area known as coefficient space. The highlights obtained are similar to cepstrum, thus it is characterized as a mail-scale cepstral coefficient or MFCC. MFCC can be used alone as a component of discourse recognition for that we can ensure MFCC.
Semantics Exploration for Automatic Bangla …
177
3.8 Recurrent Neural Network (RNN) It is imperative in speech recognition. Use to train data from acoustic and language demonstrate. A crude sound must be confined in some sub outline done from mfcc. Each word were confined in 13 particular features. At that point this element were prepared by LSTM under deep neural system. That produce a measurable outcome for foresee phoneme just as the word. LSTM use to store and recover memory.
4 BRNN and Network Training One deficiency of ordinary RNNs is that they are just ready to utilize past affection. In this process, the entire statement is replicated in one times therefore no cause to abuse hereafter affection as well. Bi-directional RNN (BRNN) provides two-way information with a pair of separate secret layers, at that point complied advances to a similar output layer combine LSTM and BRNN access both input direction in long range.
4.1 Network Training Network Training is an end to end training. So RNN learns to map directly from acoustic to phonetics sequence. Use network output to differentiable overall phonetic output sequences y through x input. Then the whole system can be optimized with gradient descent.
4.2 CTC Softmax layer in use for output distribution. CTC sum overall possible alignment and determine the normalize probability of the target sequence. RNN trained with CTC with bidirectional.
5 Experimental Results and Comparison Trained wave are explored by totally different word and get prediction error 20%.in this dataset we have 6 different region voice of different word. As the dataset noise is not properly removed so it occurs some training error. Here Table 1 showing some Comparison between previous work and ASR Bangla.
178 Table 1 Comparison between previous work Work TIMIT phoneme recognition [3] Bangla-Real-Number audio dataset [1] ASR Bangla(Proposed dataset)
S. M. Z. Islam et al.
Accuracy (%) 82.3 71.3 93.90
Fig. 4 Training and validation loss
Figure 4 showing the training and validation Loss. After the 54 epochs, our proposed model achieved in training set 93.90% accuracy and validation set 80% accuracy of our ASRBangla data set.
6 Conclusion Semantic exploration for bangla speech recognition model basically depend on statistical model. For proper evaluating the signal is chunked in small part MFCC used for this purpose of feature extraction. Reshaping and batch normalization are some techniques applied for training data in RNN. The RNN is very effectively because of it use memory for capturing context. LSTM gives use the output of our phoneme in integer value that would be matched with our language model. Accuracy of this model is 80% based on GPU and dataset. Our future plane is to visualized the sequence of character to made a proper word as well as sentence. That will help to made a Bangla chatboat.
References 1. Nahid, M.M.H., Purkaystha, B., Islam, M.S.: Bengali speech recognition: a double layered LSTM-RNN approach. In: Proc. 20th Int. Conf. on Computer and Information Technology (ICCIT), pp. 22–24 (2017) 2. Zhu, Q., Chen, B., Morgan, N., Stolcke, A.: Tandem connectionist feature extraction for conversational speech recognition. In: International Conference on Machine Learning for Multimodal Interaction, MLMI’04, pp. 223–231. Springer, Heidelberg (2005)
Semantics Exploration for Automatic Bangla …
179
3. Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6645–6649 (2013) 4. Islam, M.R., Sohail, A.S.M., Sadid, M.W.H., Mottalib, M.A.: Bangla speech recognition using three layer back-propagation neural network. In: Proceedings of NCCPB, Dhaka (2005) 5. Rahman, K.J., Hossain, M.A., Das, D., Touhidul Islam, A.Z.M., Ali, M.G.: Continuous Bangla speech recognition system. In: Proc. 6th Int. Conf. on Computer and Information Technology (ICCIT), Dhaka (2003) 6. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 4960–4964 (2016) 7. Saini, P., Kaur, P., Dua, M.: HindiAutomatic speech recognition using HTK. Int. J. Eng. Trends Technol. (IJETT) 4(6), 2223–2229 (2013). ISSN:2231-5381 8. Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G.: The microsoft 2016 conversational speech recognition system. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5255–5259 (2017) 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 10. Graves, A., Fernandez, S., Liwicki, M., Bunke, H., Schmidhuber, J.: Unconstrained Online Handwriting Recognition with Recurrent Neural Networks. In: NIPS (2008) 11. Graves, A., Schmidhuber, J.: Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. In: NIPS (2009) 12. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lSTM and other neural network architectures. Neural Netw. 18(5), 602–610 (2005) 13. Sumit, S.H., Al Muntasir, T., Arefin Zaman, M.M., Nandi R.N., Sourov, T.: Noise robust endto-end speech recognition for Bangla language. In: International Conference on Bangla Speech and Language Processing(ICBSLP), pp. 21–22 (2018)
A Global Data Pre-processing Technique for Automatic Speech Recognition Akteruzzaman, S. M Zahidul Islam, and Sheikh Abujar
Abstract Automatic speech recognition (ASR) is one of the major research field in computer science. ASR is one of the most necessary tools, which is required for, almost every languages. But every language has its own techniques and highlights. In past years, significant number of mentionable research has already been done in this field. And the exactness of ASR mostly depends on the noise reduction process. This is the fundamental requirement of Speech data preprocessing . There are various way of data preprocessing for speech recognition. In this paper we discuss very effective way for preprocessing technique of Bangla speech data. We demonstrate the whole process of recording, then digitizing the data, removing noise with respect to signal (SNR), transforming the signal into Fourier transformation (FFT) for time domain to frequency domain, it would be better if result frames are chunked into small window, and finally apply Mel frequency cepstral coefficients (MFCC) algorithm to make data into respective vector representation so that, data is ready for training in neural network. Keywords Record data · Signal to voice ratio · Transform audio in smooth way · Chunk the audio · Automatic speech recognition and prepare for training
1 Introduction Speech is not only the way of communication among the human being. By developing the technology we use it in many purpose. The main difficulty in ASR is to identify the Speech sub accent limit called syllable. Akteruzzaman (B) · S. M. Z. Islam · S. Abujar Department of CSE, Daffodil International University, Dhanmondi, Dhaka 1205, Bangladesh e-mail: [email protected] S. M. Z. Islam e-mail: [email protected] S. Abujar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_19
181
182
Akteruzzaman et al.
Speech recognition helps effectively who were blind and physically blind. The current world if we have a device that interact with human by command then it work effectively. Bangladesh ahead forward to digitization so it is very significant to build such this kind of application for a large amount of population. Many speech recognition application has been built in English, French, Mandarin language. Therefore it is time for made an application for Bangali.
2 Literature Review Machine learning technology made a great change of speech recognition. After 90’s it became the vital issue for researcher. The whole recognition process is depend on how sensitively data were process. Different researcher use different algorithm for processing, MFCC, Perceptual Linear Prediction PLP [9], Relative Spectral Processing RASTA, Linear Predictive Cepstral Coefficients (lpcc). This section describes previous work, which also deals with the relevant data preprocessing techniques for ASR. Saini et al. they proposed a model for automatic speech recognition Hindi. In this model use HTK and Isolated words produced accuracy 96.61% using HMM topology where speech were taken from 10 states [4]. Ali et al. they work on automatic speech recognition technique for Banglawords. In that technique they use Gaussian Mixture Model (GMM) and Linear Predictive Coding (LPC) [1]. Three layer back propagation Neural Network is used for classifying bangla ASR system [2]. Based on ANN they proposed Continuous Bangla Speech Recognition System. Fourier transform is used for spectral analysis [3]. Pattern is classified by HMM technique and incorporate stochastic language mode such as MFCC is used for extracting feature vector [6]. First and second derivation of MFCC is applied to determine dynamic feature vector. This vectors ware matched by DTW and finally regression analysis and classified by SVM. In speech recognition, preprocessing and pattern recognition are follows. Where preprocessing stage compose start and closing edge tackle, windowing, strain, calculating Linear Predictive Coding (LPC) and Cepstral Coefficients to construct codebook using vector quantization. This codebook are then used for train and compare by Artificial Neural Network (ANN) to recognize speech.
3 Methodology In this global technique has many steps for preprocessing data as described below.
A Global Data Pre-processing Technique for Automatic …
183
3.1 Record Voice The first step of this methodology for recognition speech. The data which we use to communicate is analog signal. The analog signal is not readable for nay machine so it need to be digitized and save it in right way for better optimization. It is better to record voice in a noise free area. Voice recorded by microphone. See also Fig. 1 show the block diagram of data sampling.
3.2 Pulse Code Modulation Digitization of sound is called pulse wide modulation. Digital device stores audio using this standard form. An analog signal wave is continuous representation, whereas computer can only store discrete values. See also Fig. 2 show the digitized audio recoded from users. The digitization process is initiated by microphone then converts by ADC. PCM has two technique American standard T1 and European standard E1. Sampling means the number of sample happened in a second. Samples are taken by a given time for all amplitude. Each sample has digital steps are quantized its neighbor value. Dividing analog signal into non over lapping in some discrete parts and supreme value this sampling process is called quantization. For example an 8-bit and 15 bit ADC has different value but the sampled in same frequency.
Fig. 1 Block diagram of data sampling
Fig. 2 Digitized audio recoded from users
184
Akteruzzaman et al.
3.3 Signal to Noise Ratio When an audio sound is recorded some extra noise add with that sample so the voice is changed. For signal processing we need to remove this extra noise, so we found the good quality sound. Signal to the noise ratio is also entitled as SNR (e.g., 1). Its unit is revealed in decibels. If the ratio is higher than we found the true sound. The equation can be defined as. (1) SNR = Psignal /Pnois For wide range of dynamic signal this equation can be changed as decibel format. Psignal, dB = 10log10 (Psignal )
(2)
Pnoise,dB = 10log10 (Pnoise )
(3)
SNRdB = Psignal, dB − Pnoise, dB
(4)
When Vsignal = Vnoise , then Signal/Noise = 0. In this circumstance creates unreadable signal borders for over lapping, because the noise level arrogantly wrestle with it. As an illustration, deduce that P signal = 20.00 microvolts(mv) (e.g., 2) and Pnoise = 1.00 microvolt(mv) (e.g., 3). Then: the SNR = 20.00 (e.g., 4)that’s why the sound is readable very clearly.
3.4 Transform Audio in Smooth Way It is a very significant step for exploration and implication of signal. This tone down signal promote signal transmission adherence, avail of storage, subjective quality and impulse or discover components in a portion where signal are measured. This mitigate signal is a sequence of digit produced samples of continuous variable. Converted domain are frequency, time and space. There are various algorithms are used to convert signal for example Fast Fourier Transform (FFT), Infinite Impulse Response (IIR) and Finite Impulse Response (IIR). Most of the scientists were use FFT algorithm.
3.5 Fast Fourier Transform (FFT) Digitized signal in time domain are converted into frequency domain. FFT is very effective algorithm for transform. This is the utilization of the Discrete Fourier Transform (DFT) but FFT time complexity is Nlog(N) that is lower than DFT. For each point of amplitude is function ate by FFT algorithm to decomposing beacon into N
A Global Data Pre-processing Technique for Automatic …
185
Fig. 3 Flow diagram of converting time domain to frequency domain
time domain signals, where time domain of each beacon remain same. Lastly, this synthesized ligature are composed into frequency spectrum. See also Fig. 3 show the flow diagram of converting time domain to frequency domain. FFT is most commonly used algorithm for signal processing that provide spectrum analysis. Several variant of FFT were apply such as the Winograd transform, discrete Hartley transform and discrete cosine transform (DCT). For real time signal synthesis in a large compression ratio DCT mainly use.
3.6 Discrete Cosine Transform (DCT) Converting this restricted data sequence are exhilarate in terms of cosine is mainly use for lossy compression audio signal (e.g., 5, 6 and 7) . If we have N sequence of signal, normally for fourier transform audio data were represent in one dimensional. DCT is defined by, V (k) = 2
N −1 n=0
1 π x(n)cos[ (n + )k] 2 2 k/2
=> V (k) = W2N S(k) k/2
=> V (k) = 2R[W2N
N −1 n=0
(where 0 ≤ k ≤ N − 1)
(5)
(where 0 ≤ k ≤ N − 1)
(6)
nk x(n)W2N ]
(7)
(where 0 ≤ k ≤ N − 1)
186
Akteruzzaman et al.
3.7 Chunk/ Windowing the Audio Windowing is the technique to split this audio in same frequencies. To avoid unnatural discontinuity in a speech signal windowing perform. Framing must be 60–70% overlapped to avoid the distortion of spectrum and chunk duration 25ms is appropriate for windowing. Depending on application need incision is selected by window function that allows leakage spectrally. Basically humming window is applied to chunk this audio in this paper.
4 Prepare for Training To train this signal in a neural network it need to reshape and batch normalize. Speech data are complex and heavy weight so it is difficult to train and time consuming process. Doing batch normalization training data is accelerated and improve judgment, which is vastly used in the computer vision community. For better performance we need to extract feature from this windows. MFCC used widely in recent days for extracting feature.
4.1 Mel Frequency Cepstral Coefficient (MFCC) Sound are produce by humans, it is purify by the size of guttural period. If we determine the shape of sound we know how to produce the phoneme. Every phoneme utterance in a short time that evolve by the MFC (e.g., 8). Speech signal change constantly so we estimate the frame size between 25ms that’s why we get enough samples as our desire, its helps us to get dependable spectral estimation. Periodogram catalogue is apply to calculate the strength of spectrum also the Mel riddle bank determine the magnitude of the signal. Formula to convert Mel scale from frequency of signal. M( f ) = 1125ln(1 + f /700)
(8)
Here signal frequency is f . For MFCC then we apply Discrete Fourier Transform. S(k) =
L l=1
S(l)h(l)e− j2πkl/L
(9)
A Global Data Pre-processing Technique for Automatic … Table 1 Comparison between previous work Work Cnn + Mfcc based accuracy [5] numeral data Cnn+Mfcc base model [7] GMM − Hmm base model [8] Mfcc + Rnn based (Proposed model)
187
Accuracy (%) 74 93 64.4 93.90
where (e.g., (9)) S(l) is the time domain of signal, h(l) is L pattern extended dissection for each window represents hamming window, and k is the thoroughness of D F T The periodogram based strength of spectrum gauge for speech frame S(l) is recognized by: 1 (10) P(k) = |S(k)|2 N After that we take log of that frequency. Finally DCT decorate the spunk which creates some diagonal covariance matric. This generated matrices are used for making the features model can be trained implemented neural networks.
5 Result Comparison Recognition accuracy always depends on how accurately data are processed. There are many researchers try to many technique for automatic speech recognition. Now Table 1 show the some comparision between previous and our proposed technique result.
6 Conclusion With proper processing help proper result. It is an opportunity to improve speech recognition system properly by using more data. It is an universal process can use for all language though we initiate it for bangla language. We know bangla vowel have different point for utterance that can be determine easily by our algorithm and we effectively recognize speech. In this process we use most useful data preprocessing techniques and explanation. Recent days MFCC is much popular for speech recognition. Basically this represent acoustic model for speech recognition. By this preprocessing we are able to separate mel frequency form the original audio. Depend on this data preprocessing we would like to work on End-to-End bangla speech recognition by native speakers.
188
Akteruzzaman et al.
References 1. Ahammad, K., Rahman, M.M.: Connected Bangla speech recognition using artificial neural network. Int. J. Comput. Appl. 149(9), 38–41 (2016). ISSN 0975-8887 2. Ali, M.A., Hossain, M., Bhuiyan, M.N.: Automatic speech recognition technique for Bangla words. Int. J. Adv. Sci. Technol. 50, 51–60 (2013) 3. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 4960–4964 (2016) 4. Kumar, K., Aggarwal, R.K.: A Hindi speech recognition system for connected words using HTK. Int. J. Comput. Syst. Eng. 1(1), 25–32 (2012). Online ISSN:2046-3405 5. Rahman, K.J., Hossain, M.A., Das, D., Touhidul Islam, A.Z.M., Ali, M.G.: Continuous Bangla speech recognition system. In: Proceedings of 6th International Conference on Computer and Information Technology (ICCIT), Dhaka (2003) 6. Saini, P., Kaur, P., Dua, M.: Hindi automatic speech recognition using HTK. Int. J. Eng. Trends Technol. (IJETT) 4(6), 2223–2229 (2013). ISSN:2231-5381 7. Sumit, S.H., Al Muntasir, T., Arefin Zaman, M.M., Nandi, R.N., Sourov, T.: Noise robust endto-end speech recognition for Bangla language. In: International Conference on Bangla Speech and Language Processing(ICBSLP), 21–22 (2018) 8. Sumon, S.A., Chowdhury, J., Debnath, S., Mohammed, N., Momen, S.: Bangla short speech commands recognition using convolutional neural networks. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), Dhaka, 21–22 (2018) 9. Tao, F., Busso, C.: Aligning audiovisual features for audiovisual speech recognition. In: IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 (2018)
Sentiment Analysis on Images Using Convolutional Neural Network Ramandeep Singh Kathuria, Siddharth Gautam, Anup Singh, Arjan Singh, and Nishant Yadav
Abstract Sentiment analysis has evolved to be of much greater use; it helps brands, political parties, governments, and anyone to help gauge the sentiment of the masses towards something, and to understand their needs; sentiment analysis has proven effective in decision and policy making. Sentiment analysis usually is limited to gauging sentiments from text. But to determine the sentiment depicted in images, we need to use convolutional neural networks (CNN), which use multilayer perceptions to detect unique features in the simplest and easiest method of sharing emotions and thoughts. Since image sharing on social media is images. Images very much popular; the use of images for sentiment analysis can enhance the results. In this paper, we present the use of sentiment analysis to filter out images with adult, violent content which is possibly dangerous to remain in the public domain; the applications using CNN can be so many. It can also be used in various social media platforms for automated tag predictions. Keywords CNN · Keras · NumPy · Sklearn · Pickle
R. S. Kathuria (B) · S. Gautam · A. Singh · A. Singh · N. Yadav Departement of HMRITM, Delhi, India e-mail: [email protected] S. Gautam e-mail: [email protected] A. Singh e-mail: [email protected] A. Singh e-mail: [email protected] N. Yadav e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_20
189
190
R. S. Kathuria et al.
1 Introduction In the contemporary world, people share exuberant amounts of content on various social media platforms. As per a report, there are over 95 billion uploads on various social media platforms. This is majorly in the form of images, which can be personal in nature, daily updates, or their opinions depicted in the form of comic strips and pictorial humour-termed memes. Analysing this content from social media platforms and/or photo-sharing websites like Flickr, Twitter, Tumblr, etc., can give us a brief insight into the general sentiment of people and the public ideology about and global issue, say Presidential Elections, for instance [1]. Sentiment analysis through images on a largescale can easily predict a particular emotion towards any event or topic. Also, it can be an essential tool to understand and analyse the emotions behind an image to automatically predict emotional tags on them like happiness, fear, excitement, etc., which the image depicts [2]. It is definitely worth much more in terms of conveying human emotions and sentiments. Examples to support this hypothesis are abundantly available [3]. Immensely alluring images, most often than not, contain rich emotional cues that equip the viewers to easily connect with the sentiments of the capturer. An ever-growing number of people use photos/images to express their happiness, anger, and boredom on social media platforms like Instagram and Facebook [4] with the advent of social media [5]. Automation for sentiment analysis and inference of emotions from such ever-growing, massive amounts of user-generated graphical data is of prime importance and will find applications in the health-care industry, anthropology, support, gauging public sentiment [6], communication, digital marketing, and many similar fields within the digital industry such as computer vision. Mental health and emotional wellness and fields of growing concerns and studies that impact several aspects of many lives [7]. This study would help introduce self-empathy in individuals giving the people greater awareness of their emotional endurance. It would also help improve and built one’s self-esteem, confidence and toughness, which allows them to recuperate from poor emotional health and physical stress and difficulties with relative ease [8]. The thriving digitization of people’s personal lives due to increasing use of photos/images to record what they do in their everyday lives can help us assess and determine a person’s emotional well-being based on the emotions and sentiments depicted through the images they share and post on social media platforms like Facebook [9], Instagram [10], Twitter, LinkedIn etc. The rest of the paper is organized as follows: Sect. 2 illustrated the convolutional neural network. In Sect. 3, the tools used in this are explained. In Sects. 4 and 5, we have discussed the dataset and model and research method and in the last, conclusion of the paper and its future scope is presented in Sect. 6.
Sentiment Analysis on Images Using Convolutional Neural Network
191
2 Convolutional Neural Networks In the field of deep learning, CNN or convolutional neural network is considered a form of deep neural networks. They are ordinarily used to analyse visual imagery. They are in high demand in various different IT applications such as image and video recognition, classification of an image, natural language processing, medical image analysis, etc. Convolutional neural networks are the simplest form of multilayer perceptrons [11]. In multilayer perceptrons, the whole network is exhaustively connected. Numerous neurons together make up a layer in a multilayer perceptron. Each layer has a load of neurons and each of them is connected with the neurons present in the following layer [10]. But the existence of full connectivity in CNN makes it vulnerable to overfitting. Overfitting transpires when the model is efficiently able to classify the data set presented to it during the training part but eventually fails to classify the data set that has not been introduced to the model before. This complication can be eradicated by using dropout layers, applying regularization, reducing the portion of elements in the hidden layers, etc. [12]. Multiple different images recognizing algorithms require raw data which is handengineered as prior knowledge. But CNN omits this accustomed way with the addition of preprocessing [13]. This adds to the advantages of CNN. There is a combination of input, output, and several hidden layers in a convolutional neural network. The hidden layers in CNN have a combination of dot products with the sequence of convolutional layers. CNN is arranged by a stack of different recognizable layers that produce the output volume by remolding the input volume. CNN is made up of several convolutional layers. Each of these convolutional layers comprises of learnable kernels at parameters. These kernels have a small receptive field [14]. These kernels combine the width and the height from the volume of the input and the dot product is performed by multiplying the input volume and the field’s entries, during a forward pass. This results in the formation of a two-dimensional activation map of that filter. Now, this helps the network to learn those filters which trigger when there is some unique feature in the input volume. When high-dimensional inputs, for instance images, comes into the input volume, it becomes challenging to exhaustively connect all the neurons to all the neurons residing in the previous volume since this type of architecture of a network neglects the dimensional structure of the data [15]. This issue is resolved by imposing a sparse local connectivity pattern linking neurons from adjacent layers. According to this pattern, each neuron is linked to a small portion of the volume of the input. In the image processing, the discrete wavelet transform sub-banded the input images from where the approximate values of the image are extracted [16]. It is analysed both in the numerical and function for wavelets sampling discretely. By this technique, the image pixel is transformed into wavelets. These pixels are then utilized for coding and wavelet-based compression [17]. The definition of DWT is shown in (3) and (4).
192
R. S. Kathuria et al.
3 Tools Used 3.1 Keras Keras is an open-source artificial neural network library that exists in Python. It offers a user-friendly interface to run quick experiments with deep neural networks that run on TensorFlow. Keras obliges in minimizing the cognitive load. The used volume of working memory is a cognitive load. In any cognitive system, there is some restricted capacity that can carry temporary space for storing information paramount in decision making, reasoning and fast execution of the task is known as working space [18]. Keras give the provision of adding new modules as new functions and classes. Keras can also be used for convolutional neural networks and provide support to various other neural networks [19].
3.2 NumPy NumPy is another library existing in Python. NumPy is used for high volume, multidimensional arrays, and matrices. NumPy also provides some complex mathematical functions which can be used on those arrays and matrices. NumPy is used in mathematical computation. It provides a good alternative to the regular list type used in Python programming language. Python lists cannot be used to analyse the data in the list whereas NumPy is able to do so. In Python, multiplication of two lists shows the error and it requires to multiply the elements from the respective lists separately. This problem is eradicated with the use of NumPy. It can perform the multiplication of lists [7].
3.3 Sklearn Sklearn or scikit-learn is a free machine learning software written in Python. It uses NumPy to execute high performance tasks including the calculations of linear algebra and large and multi-dimensional arrays [20]. Scikit is used in classification which helps the model to classify which objects belong to which group or category, it features regression which predicts that which attribute is associated with an object, clustering, which automatically groups the object with similar properties into sets, model selection, preprocessing which includes extraction and normalization and dimensionality reduction which refers to the reduction in the number of random variables [21]. Applications for Sklearn have been seen in many fields like development of fast bar code scanners, cyber security, financial modelling, etc. [22].
Sentiment Analysis on Images Using Convolutional Neural Network
193
3.4 Pickle Pickle is another standard library module in Python. Pickle is used in serialization and deserialization of a Python object which basically means that it converts a Python object into a byte stream and vice versa [23]. Python serializes the Python objects which can then be saved on disk. Serialization is always done before writing an object to the disk [24]. When Python objects for instance list, dictionary, tuples, etc., are converted into the byte stream, it ensures that the byte stream holds all the important information required to reconstruct that Python object [25]. Pickle module has recursive objects that are object has the reference to itself and user-defined classes and instances, which add to the advantages of using the pickle module in Python.
3.5 TensorFlow TensorFlow is an open-source and free software which is commonly used in machine learning. It is a library used in numerical computation. It was initially created and developed by Google Brian’s team for Google’s research purposes but the system gained popularity around the world and is used more commonly in machine intelligence such as neural networks. TensorFlow is able to run on many different platforms; thus, making it easier for the user. TensorFlow, basically, helps you to train your model [26]. TensorFlow takes the input of data as tensors. These tensors are multi-dimensional arrays. Multi-dimensional arrays are extremely helpful when it comes to handling of huge volume of data [27]. Calculations of large amount of data require a large amount of storage and processing time. This makes deep learning quite challenging and complex if it is performed on CPU. Whereas GPUs are very effective in developing deep learning applications. TensorFlow uses both CPU and GPU, and this makes it faster than Keras and Torch [28].
3.6 Matplotlib Matplotlib is a library in Python which is used to create visualizations, be it through line plots, contour plot, histograms, scatter plots, 3D plots, polar plot, or image plot. The visualizations thus created look very professional and are worth publication. It uses an object-oriented API for plotting using GUI toolkits [29]. These toolkits maybe Twinkter, wxPython, etc. Matplotlib has a module called Pyplot which resembles the features of MATLAB and provides an interface similar to MATLAB. It has all the features of MATLAB plus it is free and open-source. Over time, matplotlib has found applications to create plots in various areas be it airlines, health sector, public records, the different types of plots which are possible to be created using matplotlib help users easily understand the data [5].
194
R. S. Kathuria et al.
3.7 Pillow PIL or Python imaging library, which is even called as PILLOW in latest versions, is a free library supported in Python programming language. It is used to work with and manipulating images using Python. It supports many different file formats, for instance, JPEG, GIF, TIFF, BMP, PNG, and PPM; all fit well when it comes to working with PILLOW in Python. If there is need of some another file format, PILLOW also provide the feature of creating new file decoder to add more support to other file formats [30]. Manipulation of images consists of per-pixels manipulation, filtering of images like finding the edges of the image, smoothing the image or adding blur effects, increasing brightness, contrast, etc.
4 Dataset Now, we have a data of 2822 images, which are divided in two categories, namely positive and negative. In total, there are 1411 images for each sentiment. To categorize the images, it took an intensive work for us to divide the images in two categories, and for each image to be selected in a category as shown in Fig. 1. It took the opinion of all the persons whom we gave the task to reach the same conclusion as others. The data is divided into training set and validation set, with 80% of the data being the training set, and 20% being the validation set (Fig. 1). We have set our parameters in the model as follows: The maximum epoch for the network training is 20. This provides a reference point from which we will measure time. We have kept the batch size for each iteration of 32 images. Two sets are being used in the model. The first set would carry out the training part and the second set will oblige in validation. Validation split for this model is 0.2. This validation value is crucial in understanding if our training model is performing well. The training curve should be similar and close to the validating curve to generate more efficient and accurate results. If validation curve descends below the training curve then it upsurges the problem of overfitting.
Fig. 1 Examples of positive and negative images
Sentiment Analysis on Images Using Convolutional Neural Network
195
5 Model and Research Method The model follows five-step process for prediction of sentiment depicted in images as shown in Fig. 2.
5.1 Data Collection Data collection was the most tough part for us, and we got data from various social media sites, Google images, etc. The data was then divided based on sentiment depicted.
5.2 Pre-processing Data In preprocessing of data, the images are converted from RGB to black and white so as to reduce the time for computation; this although does not reduce the accuracy of the program, but helps in reducing the time taken for preprocessing. Also, since all the images need to be in a uniform resolution, they are converted into 300 × 300 images, and then the image data is stored in a matrix form.
5.3 Convolutional Neural Network Algorithm Convolutional neural network algorithm (CNN) is utilized for the purpose of sentiment analysis of images; since images are complex to understand, the sentiment cannot be predicted through simple machine learning algorithms, and it has to be done through deep learning or neural network methods.
Fig. 2 Sentiment analysis system architecture
196
R. S. Kathuria et al.
5.4 Prediction of Sentiment The final layer of CNN provides us with the results of the sentiment in three numerical outputs, 0, 1, and 2. 0 for an image depicting negative sentiment, 1 for an image depicting neutral sentiment, and 2 for positive sentiment.
5.5 Output The output hence generated is shown to the user, where predicted sentiment value of the image by the model is shown and just below it actual sentiment value is also shown in Figs. 3, 4 and 5. As you can see in the above three images of output with both negative and positive images, Fires in jungle, destroyed houses, and a devil’s image has been predicted as negative, where as a dog playing with the ball, and picture of a smiling lady has been determined to have a positive sentiment. We can also see an image is wrongly predicted by the model, i.e. in Fig. 5, first image of a war is predicted to be positive but in reality, it is a negative image.
6 Conclusion and Future Scope Sentiment analysis today has many possibilities from gauging the sentiment depicted in a word to an image, etc. There are many applications when it comes to sentiment analysis be it getting the changing political trends, market sentiments, business sentiments, customer, and consumer sentiments. The applications for sentiment analysis are many. In this paper, we have worked to predict and determine the sentiment depicted in the images, and we further want to add the feature of predicting images fetched from social media sites in real tim, so as to create applications which can help understand ongoing situations which happen to be evolving and dynamic in nature. Also, with the development of CNN for usage in prediction of sentiment of images, this can further be used in future for not just images but can be expanded for sentiment analysis of videos and GIF images.
Sentiment Analysis on Images Using Convolutional Neural Network
Fig. 3 Output image 1
197
198 Fig. 4 Output image 2
R. S. Kathuria et al.
Sentiment Analysis on Images Using Convolutional Neural Network
Fig. 5 Output image 3
199
200
R. S. Kathuria et al.
References 1. Farias, D.H., Rosso, P.: Irony, sarcasm, and sentiment analysis. Sentiment Anal. Soc. Netw. 113–128 (2017). https://doi.org/10.1016/b978-0-12-804412-4.00007-3 2. Campos, V., Jou, B., Giró-I-Nieto, X.: From pixels to sentiment: fine-tuning CNNs for visual sentiment prediction. Image Vis. Comput. 65, 15–22 (2017). https://doi.org/10.1016/j.imavis. 2017.01.011 3. Mittal, N., Sharma, D., Joshi, M.L.: Image sentiment analysis using deep learning. In: 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI) (2018). https://doi.org/ 10.1109/wi.2018.00-11 4. French, J.H.: Image-based memes as sentiment predictors. In: 2017 International Conference on Information Society (i-Society) (2017). https://doi.org/10.23919/i-society.2017.8354676 5. Grover, M., Verma, B., Sharma, N., Kaushik, I.: Traffic control using V-2-V based method using reinforcement learning. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974540 6. Nelli, F.: Data visualization with matplotlib. In: Python Data Analytics, pp. 231–312 (2018). https://doi.org/10.1007/978-1-4842-3913-1_7 7. Harjani, M., Grover, M., Sharma, N., Kaushik, I.: Analysis of various machine learning algorithm for cardiac pulse prediction. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019. 8974519 8. Chaudhuri, A.: Experimental setup: visual and text sentiment analysis through hierarchical deep learning networks. In: Visual and Text Sentiment Analysis through Hierarchical Deep Learning Networks SpringerBriefs in Computer Science, pp. 25–49 (2019). https://doi.org/10. 1007/978-981-13-7474-6_6 9. Nelli, F.: The NumPy library. In: Python Data Analytics, pp. 49–85 (2018). https://doi.org/10. 1007/978-1-4842-3913-1_3 10. Kathuria, R.S., Gautam, S., Singh, A., Khatri, S., Yadav, N.: Real time sentiment analysis on Twitter data using deep learning (Keras). In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10. 1109/icccis48478.2019.8974557 11. Pattanayak, S.: Introduction to deep-learning concepts and TensorFlow. In: Pro Deep Learning with TensorFlow, pp. 89–152 (2017). https://doi.org/10.1007/978-1-4842-3096-1_2 12. Khamparia, A., Singh, K.M.: A systematic survey on deep learning architectures and applications. Expert Syst. (2019). https://doi.org/10.1111/exsy.12400 13. Kulshrestha, S.: What is a convolutional neural network? Dev. In: Developing an Image Classifier Using TensorFlow (2019). https://doi.org/10.1007/978-1-4842-5572-8_6 14. Sil, R., Roy, A., Bhushan, B., Mazumdar, A.: Artificial intelligence and machine learning based legal application: the state-of-the-art and future research trends. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https:// doi.org/10.1109/icccis48478.2019.8974479 15. Sharma, A.K., Chaurasia, S., Srivastava, D.K.: Sentimental short sentences classification by using CNN deep learning model with fine tuned Word2Vec. Procedia Comput. Sci. 167, 1139– 1147 (2020). https://doi.org/10.1016/j.procs.2020.03.416 16. Aghdam, H.H., Heravi, E.J.: Convolutional neural networks. In: Guide to Convolutional Neural Networks, pp. 85–130 (2017). https://doi.org/10.1007/978-3-319-57550-6_3 17. Hybrid CNN classification for sentiment analysis under deep learning. Int. J. Innov. Technol. Exploring Eng. Regular Issue 9(5), 1473–1480 (2020). https://doi.org/10.35940/ijitee.e2922. 039520 18. Jindal, S., Singh, S.: Image sentiment analysis using deep convolutional neural networks with domain specific fine tuning. In: 2015 International Conference on Information Processing (ICIP) (2015). https://doi.org/10.1109/infop.2015.7489424
Sentiment Analysis on Images Using Convolutional Neural Network
201
19. Khamparia, A., Pandey, B., Tiwari, S., Gupta, D., Khanna, A., Rodrigues, J.J.P.C.: An integrated hybrid CNN-RNN model for visual description and generation of captions. Circ. Syst. Signal Process. (2019). https://doi.org/10.1007/s00034-019-01306-8 20. Jindal, M., Gupta, J., Bhushan, B.: Machine learning methods for IoT and their future applications. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974551 21. Trappenberg, T.P.: Neural networks and Keras. In: Fundamentals of Machine Learning, pp. 66– 90 (2019). https://doi.org/10.1093/oso/9780198828044.003.0004 22. Singh, P., Manure, A.: Images with TensorFlow. In: Learn TensorFlow 2.0, pp. 75–106 (2019). https://doi.org/10.1007/978-1-4842-5558-2_4 23. Gaur, J., Goel, A.K., Rose, A., Bhushan, B.: Emerging trends in machine learning. In: 2019 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT) (2019). https://doi.org/10.1109/icicict46008.2019.8993192 24. Khamparia, A., Gupta, D., de Albuquerque, V.H.C., Sangaiah, A.K., Jhaveri, R.H.: Internet of health things driven deep learning system for detection and classification of cervical cells using transfer learning. J. Supercomput. (2020). https://doi.org/10.1007/s11227-020-03159-4 25. Deep Learning in Python: Introduction to Deep Learning (2019). https://doi.org/10.4135/978 1526493446 26. Paper, D.: Introduction to Scikit-Learn.In: Hands-on Scikit-Learn for Machine Learning Applications, pp. 1–35 (2019). https://doi.org/10.1007/978-1-4842-5373-1_1 27. Walt, S.V.D., Schönberger, J.L., Nunez-Iglesias, J., Boulogne, F., Warner, J.D., Yager, N., Gouillart, E., Yu, T.: Scikit-image: image processing in python (2014). https://doi.org/10.7287/ peerj.preprints.336v1 28. Manaswi, N.K.: CNN in TensorFlow. In: Deep Learning with Applications Using Python, pp. 97–104 (2018). https://doi.org/10.1007/978-1-4842-3516-4_7 29. Sharma, N., Kaushik, I., Rathi, R., Kumar, S.: Evaluation of accidental death records using hybrid genetic algorithm. SSRN Electron. J. (2020). https://doi.org/10.2139/ssrn.3563084 30. Khamparia, A., Saini, G., Gupta, D., Khanna, A., Tiwari, S., de Albuquerque, V.H.C.: Seasonal crops disease prediction and classification using deep convolutional encoder network. Circ. Syst. Signal Process. (2019). https://doi.org/10.1007/s00034-019-01041-0
Classification of Microstructural Image Using a Transfer Learning Approach Shib Sankar Sarkar, Md. Salman Ansari, Riktim Mondal, Kalyani Mali, and Ram Sarkar
Abstract Mechanical properties and corrosion resistance of metal largely depend on the solidification of microstructure. Microstructure of metallic alloy is described by the grain size, type of phases, structure, shape, etc. Material characterization is an important issue in material science. A large number of attempts was made for automatic characterization of microstructure using computer vision and machine learning methods. In this paper, an attempt is made to distinguish dendrite morphology among others by using transfer learning approach. A transfer learning approach takes advantage of fully connected layer of pre-trained Deep Convolutional Neural Network (DCNN) for weight initialization and feature extraction. In this work, three different pre-trained DCNN architectures, namely DenseNet121, MnasNet0_5, and MnasNet1_0 are used as a feature extractor. A LogSoftmax layer is used as a binary classifier to classify the microstructure images. A maximum classification accuracy 95% is achieved using customized framework of pre-trained DenseNet121 architecture. Keywords Microstructure · Transfer learning · Convolutional neural network · Feature extraction S. S. Sarkar (B) · Md. S. Ansari Department of Computer Science and Engineering, Government College of Engineering & Textile Technology Berhampore, Berhampore 742101, West Bengal, India e-mail: [email protected] Md. S. Ansari e-mail: [email protected] R. Mondal · R. Sarkar Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India e-mail: [email protected] R. Sarkar e-mail: [email protected] K. Mali Department of Computer Science and Engineering, University of Kalyani, Kalyani 741235, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_21
203
204
S. S. Sarkar et al.
1 Introduction Engineering materials are classified into four different categories, namely metals, ceramic, polymers, and semiconductors. The microstructure of crystalline material is described by grain size, type of phase, topological grains. The microstructure of a material strongly influences the physical properties such as hardness, strength, ductility, corrosion resistance. These properties play the key role in selection of material in the industrial practice. For material scientists, understanding as well as identification of microstructure is a key factor for metrical selection, quality control, and development of new material. Recently, there has been much research in the field of microstructure recognition, interpretation, and characterization in which information technology and data science methods are used extensively. Dendrites are well-defined microstructure which can exist in variety of material system. In metallurgy, the texture of dendrite microstructure is a tree-like structure of crystal growing as molten metal solidifies. The size, shape, and spacing of dendrite depend on the cooling rate during solidification and metrical properties. As a result, features of the micrograph can vary widely. The dendritic growth has large consequences on the mechanical properties and corrosion resistance of metallic alloys [1]. Finer structural dendritic morphology produces higher tensile strength than a coarser morphology, and coarser dendritic structures produce higher corrosion resistance than finer dendritic structures. In the last decade, the concept of machine learning and computer vision has been employed by material scientists to characterize microstructures and to study structure–property relationships. DeCost and Holm [2] used the histogram of visual features and support vector machine (SVM) classifier to classify the microstructures of seven groups with greater than 80% classification accuracy using relatively smaller training dataset. Chowdhury et al. [3] employed visual features such as histogram of oriented gradients (HoG) and local binary patterns (LBP), visual bags of words (VBOW) in combined with pre-trained convolution neural network (CNN) to distinguish micrograph with dendrite morphologies. Azimi et al. [4] achieved a microstructural classification accuracy of 93.94% for the four classes, namely martensite, tempered martensite, bainite, and pearlite using deep learning-based method. They proposed a new approach to show that the use of fully CNN (FCNN) along with pixel-wise segmentation produced classification of microstructures with high precision. Recently, the concept of transfer learning has been employed to transfer different level of features including both low and high-level for generic recognition task [5, 6]. To deal with scalable to large, complex, and multi-class problems, various improvements such as layer design, activation function, loss function, regularization, optimization, and fast computation, etc., in CNN learning methodology were introduced [7]. CNN-based applications became widespread after the success of AlexNet in 2012 [8]. Zeiler and Fergus [9] introduced a concept of layer-wise visualization technique in CNN that helps to understand the function of intermediate feature layers and
Classification of Microstructural Image Using a Transfer …
205
the operation of the classifier. Simonyan and Zisserman [10] investigated significant improvement on accuracy increasing depth of convolutional network using an architecture with very small (3 × 3) convolution filters as performed in VGG network. The Google deep learning group introduced a 22-layer deep CNN (DCNN) architecture by increasing the depth and width of the network and improving the utilization of computing resources within the network, which secured efficient smaller and faster network with a reduced computational cost [11]. Later, Microsoft research introduced an interesting idea called skip connection in ResNet [12] to train a DCNN. These connections provided alternate pathway for data and gradients to flow and thus making training was possible. Following the research trend in the image classification domain, in this paper, an attempt is made to apply the concept of transfer learning method to a wide collection of microstructure images for microstructure classification. We hypothesize that this approach will be flexible enough to meet accurate classification of diverse set of microstructure images, and this hypothesis is ratified through exhaustive experimentations.
2 Methodology The problem of any deep learning-based model is that it needs lots of data to be trained. To overcome this limitation, often the concept of transfer learning is used in the literature for various research works. The motivation behind transfer learning [13, 14] is that a model trained with large number of dataset serves as a generic model, and one can take advantage of these learned feature maps without having to start everything from scratch to build a training model for the datasets under consideration. Transfer learning is flexible, allowing the use of pre-trained models, fine-tune to adjust the weights to be used for generating the feature maps, to add own classification layer and integrate into entirely new model. The proposed framework is trained on three different DCNN architectures, namely DenseNet121, MnasNet0_5, and MnasNet0_1. Figure 1 shows the proposed architecture used to classify microstructure images. In the first step, images are cropped, and data augmentation is performed to solve the issue of limited number of samples in the dataset and optimize the performance of the model. In the second step, a pre-trained network trained on a large dataset is loaded, and all the weights of convolution layers are frozen which remains unchanged during training. In the third step, the final output layer of the pre-trained network is dropped, and a new set of four fully connected (linear) layers with Rectified Linear Unit (ReLU) dropout in between is added with a final classification layer (i.e., output) of required binary classes. All the weights of these four layers are trained with our own dataset to achieve some satisfying accuracy. In the fourth step, high-level features are extracted using the proposed architecture. Lower layer of the pre-train DCNN extracts generic features that are used for microstructure classification task. Finally, the performance of the proposed model is evaluated on the test dataset.
206
S. S. Sarkar et al.
LogSoftmax
Linear
ReLU
Dropout
Layer L-1
……
Layer 3
Layer 2
Layer 1
Data pre-processing
Linear
Repeat 3 times
Test data
Input image
Train data
Freeze all layers of pre-trained network
Additional trainable layer
Without dendrite With dendrite
Classification layer
Fig. 1 Proposed transfer learning-based architecture used for classification of microstructure images
2.1 Network Architecture The proposed model introduces four numbers of additional dense hidden layers with activation function to introduce non-linearity into the output. An appropriate activation function can accelerate the learning process. So these layers make the network more powerful and help to learn the complex patterns from the data. ReLU is preferred as activation function, since it helps to overcome the vanishing gradient problem as with sigmoid and tanh function. Dropout with probability of 0.5 is used to avoid overfitting which helps in preventing the training and validation losses from diverging too much. The last layer contains a binary classifier with a LogSoftmax activation function, which is basically a Softmax-based activation function for logarithm values. This activation function changes the output class into probabilities whose sum is equal to unity. The Softmax layer outputs the probability distribution over each probable class and classifies the images according to the most probable class. To prepare the dataset, the micrograph is collected from DoITPoMS micrograph library (https://www.doitpoms.ac.uk/) and ultra high carbon steel micrograph database (https://uhcsdb.materials.cmu.edu/). The dataset contains 100 with dendrite microstructures in one group and 100 mixture of pearlite, cementite, bainite, and widmanstätten as another group which is termed as without dendrite. Some sample images taken from this are shown in Fig. 2. Each image is cropped to generate 16 square images of size 128 × 128. Each original image contains a scale bar and to avoid the interference in feature extraction process, but in the cropped images, scale bar is excluded. Thus, 1744 images are sorted out for classification task. No filtering and noise cancelation process are applied for preprocessing of the images. The original cropped images are scaled up into 256 × 256 and then performed the center crop to generate image of dimension 224 × 224, which is the standard used in the pre-trained model. Finally, random horizontal flip
Classification of Microstructural Image Using a Transfer …
207
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2 Example of microstructure images used in the classification task: a–c with dendrite, d–f without dendrite dataset preparation
and random rotation are applied in real time on image dataset with normalization of the dataset as per the requirement of the pre-trained model.
2.2 Pre-trained DCNN Models Used 2.2.1
DenseNet
Over the improvement of computer architecture and network structure, CNN has come up with more accurate result, and it becomes efficient to train if the number of layers between input and output is less. But as the convolution network becomes deep, they are difficult to train due to the problem of vanishing gradient. Dense convolutional network (DenseNet) [15] is one of the recent architectures used in Imagenet classification challenge and has shown exceptional performance in terms of classification accuracy. DenseNet is used to increase the depth of the CNN. Each layer of this network is connected to every other layers in a feed-forward manner. Traditional CNN architecture with L layers will have L connection, but in DenseNet, it is number of direct connections. Connection pattern is the main idea behind the DenseNet; input of next layer is the concatenation of previous layers. The key observation behind this architecture is that creating a short path from the initial layer (closer to the input) to the later layer (closer to output) gradient propagation as well as classification is improved. The advantages of this network are parameter efficiency, implicit deep supervision, and feature reuse.
208
S. S. Sarkar et al.
2.2.2
MnasNet
MnasNet [16] is an automated neural architecture search approach for developing resource-efficient mobile CNN models using reinforcement learning. The architecture was developed based on two mains ideas. First, it was formulated as a multiobjective optimization problem that considers both accuracy and inference latency of CNN models. Secondly, introduction of novel factorized hierarchical search space, which divides a CNN model into unique blocks. Each block separately searches for operation and connection. This allows different block to have different architectures. MnasNet model consists of a variety of layered architectures throughout the network. One noticeable observation is that MnasNet uses both 3 × 3 and 5 × 5 convolutions, which are different from others mobile models, as others use 3 × 3 convolution only.
3 Results and Discussion This section demonstrates the results and analysis of each of the transfer learningbased classification models. Instead of following standard cross-validation, alternative approach is taken by considering an increasing distribution of training set starting from 10 to 80% in the steps of 10% with remaining respective percentage of data for testing set (i.e., 90–20%). This helps to understand the importance of quantity of training data for training deep learning-based classification model. The accuracy of the classification models is depicted in Fig. 3a–c. Figure 3a shows that Densenet121 achieves a highest accuracy of 98% when considering 60% data for training and 40% data for testing. While in Fig. 3b, c, it can be shown that for same distribution of dataset in training and test, MnasNet0_5 and MnasNet1_0 achieve 86% and 91% accuracy, respectively. This implies that due to increase in the training data, the model gets over fitted which leads to decline in test accuracy and increase in training accuracy. But MnasNet0_5 is a special case, where even for 80% training data, the test accuracy is more than training accuracy. We have
x-axis - Data input (%) y-axis - Accuracy (%)
(a)
x-axis - Data input (%) y-axis - Accuracy (%)
(b)
x-axis - Data input (%) y-axis - Accuracy (%)
(c)
Fig. 3 Graphical plot of the classification accuracy of the respective models for various distribution of training and test datasets. a Densenet121 model; b MnasNet0_5 model; c MnasNet1_0 model
Classification of Microstructural Image Using a Transfer …
209
Table 1 Classification accuracy and loss of the proposed frameworks having different transfer learning models on test dataset consists of two-class microstructure images Model
Without any additional layer
With four additional layers
With six additional layers
With eight additional layers
Accuracy (%)
Loss (%)
Accuracy (%)
Loss (%)
Accuracy (%)
Loss (%)
Accuracy (%)
Loss (%)
MnasNet0_5
83
0.34
93
0.19
91
0.28
90
0.27
MnasNet1_0
92
0.23
92
0.20
93
0.18
94
0.18
Densenet121
97
0.07
95
0.14
97
0.13
94
0.18
considered the 80% training data and the 20% testing data as the base dataset for our experiments, and all further results belong to this dataset distribution. Initially, the proposed framework is created with no additional layer, where the weights of the pre-trained networked are frozen. Only the last classification Softmax layer is changed for training on our processed dataset to achieve some satisfying accuracy, and subsequently, layers are added to optimize the performance of the framework. The performance of the model at the various stages of the experiments is depicted in Table 1. After evaluating performance with the model having no additional layer, four more layers are added with ReLU activation and dropout in between, after the original pre-trained network architecture. The weights of the original network are kept intact which was trained on ImageNet. So the weights of the four layers which are added are trained on our dataset considered here to achieve better accuracy which is shown in Table 1. Overfitting is a common problem of deep learning-based models when the models become too complex in comparison to the real complexity of the dataset trained similar to second experiment described above. Their results and losses for test sets are shown in Tables 1, and it can be observed that with addition of some more layers; the accuracy becomes stagnant (since the model gets converged); and the result would be either comparable to best model (with four additional layer) or the accuracy gets decrease on test set due to overfitting. From Table 2, it can be observed that Densenet121 has performed the best with overall test accuracy of 95% with four additional layers and 97% with six additional layers due to its exploitation of the Table 2 Comparative analysis of the class-wise precision obtained by the proposed architectures Model
Image size
Train accuracy (%)
Test accuracy (%)
Class-wise precision
DenseNet121
224 × 224
96
95
With dendrite. 100% Without dendritic. 94%
MnasNet0_5
224 × 224
92
93
With dendrite. 96% Without dendritic. 90%
MnasNet0_1
224 × 224
93
92
With dendrite. 99% Without dendritic. 90%
210
S. S. Sarkar et al.
Table 3 Classification report of three models on test set Models
Type
Class (label)
Precision
Recall
F1-score
DenseNet121
With dendrite
1.00
0.91
0.95
179
Without dendritic
0.91
1.00
0.95
169
Macro avg.
0.96
0.96
0.95
348
Weighted avg.
0.96
0.95
0.95
348
With dendrite
0.96
0.91
0.94
180
Without dendritic
0.91
0.96
0.94
168
Macro avg.
0.94
0.94
0.94
348
Weighted avg.
0.94
0.94
0.94
348
With dendrite
0.98
0.86
0.92
173
Without dendritic
0.88
0.98
0.93
175
MnasNet0_5
MnasNet1_0
network by reusing pre-computed features. MnasNet0_5 and MnasNet1_0 although are built upon the same backbone architecture with only difference in number of layers, and size of convolutional layer has given comparable results. After comparing the results, it is observed that the per-class accuracy of Densenet121 is able to identify 100% of the dendrite class images and misclassifies 6% of non-dendrite class images. Table 3 shows that DenseNet121 has performed best in terms of precision, recall and F1-Score followed by MnasNet0_5 and MnasNet1_0. Even with respect to per class F1-Score DenseNet121 is best, whereas MnasNet0_5 and MnasNet1_0 give comparable results. So finally, we can conclude that among the three model. Among the three pre-trained architecture with additional four layers, DenseNet121 came out to be the best for microstructural classification.
4 Conclusion Transfer learning-based models offer a new way of recognizing microstructural image of interest. Results of the experiments performed in the present work demonstrate that the customized pre-trained DCNN can classify microstructure images with high degree of accuracy, even with a small set of input images and without any need of separate segmentation and feature extraction. Classification results obtained from the architectures using pre-trained DenseNet121, MnasNet0_5, and MnasNet1_0 model are able to recognize the micrograph with and without dendrite with good accuracy. For future work, the model can be extended by incorporating the deep learning with attention network and better batch normalization and dropout techniques.
Classification of Microstructural Image Using a Transfer …
211
References 1. Goulart, P.R., Osório, W.R., Spinelli, J.E., Garcia, A.: Dendritic microstructure affecting mechanical properties and corrosion resistance of an Al-9 wt% Si alloy. Mater. Manuf. Process. 22(3), 328–332 (2007) 2. Decost, B.L., Holm, E.A.: A computer vision approach for automated analysis and classification of microstructural image data. Comput. Mater. Sci. 110, 126–133 (2015) 3. Chowdhury, A., Kautz, E., Yener, B., Lewis, D.: Image driven machine learning methods for microstructure recognition. Comput. Mater. Sci. 123, 176–187 (2016) 4. Azimi, S.M., Britz, D., Engstler, M., Fritz, M., Mücklich, F.: Advanced steel microstructural classification by deep learning methods. Scientific Reports 8(1), 1–14 (2018) 5. Li, X., et al.: DELTA: DEep Learning Transfer Using Feature Map with Attention for Convolutional Networks (ICLR) (2019) 6. Kitahara, A.R., Holm, E.A.: microstructure cluster analysis with transfer learning and unsupervised learning. Integr. Mater. Manuf. Innov. 7(3), 148–156 (2018) 7. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., et al.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–77 (2018) 8. Krizhevsky, A., et al.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017) 9. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Computer Vision. ECCV 2014 Lecture Notes in Computer Science, pp. 818–833 (2014) 10. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR 75(6), 398–406 (2015) 11. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition 13. Weiss, K., et al.: A survey of transfer learning. J. Big Data 3(1) (2016) 14. Pan, S.J., Yang, Q.:A survey on transfer learning.IEEE Trans. Knowl. Data Eng. 22(10), 1345– 1359 (2009) 15. Huang, G., et. al.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 16. Tan, M., et. al.: MnasNet: platform-aware neural architecture search for mobile. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Connecting Quaternion-Based Rotationally Invariant LCS to Pedestrian Path Projection Kazi Lutful Kabir, Prasanna Venkatesh Parthasarathy, and Yash Tare
Abstract Pedestrian path projection is a crucial problem in computer vision, with a handful of applications ranging from surveillance to automotive systems. Out of several ways to approach this problem, one can consider trajectory matching with quaternion-based rotationally invariant LCS (QR-LCS) metric on motion features. In fact, this paper discusses the strategy to interpret this problem by means of trajectory matching and utilizes an invariant of LCS to resolve the same. It is necessary (for the systems implementing such strategies) to classify pedestrian action (walking/stopping) as well as precise path projection from a vehicle on motion (at momentary intervals). Hence, more sophisticated systems can utilize a probabilistic framework to efficiently reduce trajectory search space alongside the insights from the prominent trajectory matching approaches with the restoration of the underlying features that capture motion cues. In a nutshell, this paper investigates the process to link QR-LCS metrics to pedestrian path projection. Keywords Dense stereo · Longest common subsequence · Optical flow · Pedestrian · Trajectory matching
1 Introduction Object detection has been sustaining as a prominent research domain for a long time. And pedestrian path projection is a popular canonical sub-problem of this domain [1]. In fact, pedestrian behavior has always been a subject of keen interest in the study of automotive driving systems as pedestrian safety is one of the major concerns of self K. L. Kabir (B) · P. Venkatesh Parthasarathy · Y. Tare Department of Computer Science, George Mason University, Fairfax, VA, USA e-mail: [email protected] P. Venkatesh Parthasarathy e-mail: [email protected] Y. Tare e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_22
213
214
K. L. Kabir et al.
driving vehicles. While vehicle motion can be predicted based on the roads and lane orientations, the environment and the signals given by the vehicle; the pedestrian path projection is the more challenging bit and to approach this, we have to solely rely on the pedestrian’s behavioral pattern. It is necessary for automotive vehicles to have a very high degree of confidence in the predicted move of the pedestrian before making a decision to move ahead. Even minor performance improvement can yield substantial benefit. For instance, analysis of road-accidents exhibits that if it were 4 of a second) possible to apply urgent braking for just minor fraction of a second ( 25 earlier, the chances of major injuries could have been prevented by about 15% more, with a commencing vehicle-speed of approximately 31 miles per hour [2]. In fact, a decent amount of works has been done on this problem. For example [2, 3] studies different state-of-the-art techniques to estimate present and future positions of a pedestrian with respect to a moving vehicle. Ridel et. al. [4] provides a comprehensive study of several other prominent strategies in this regard. This paper demonstrates how this problem of pedestrian path prediction can be approached using an LCS invariant. One simple approach is to apply sequence matching (using histogram of orientation motion features [3]) with quaternion-based rotationally invariant LCS (QR-LCS) metric. To improve significantly, more sophisticated strategy known as Probabilistic Hierarchical Trajectory Matching (PHTM) can be considered which possesses the same building blocks as the simpler approach [3]. PHTM uses the trajectory pattern matching approach comparing all its motion states in a probabilistic manner as against an exhaustive one. In this paper, we discuss in details the strategy to interpret pedestrian path prediction from the perspective of trajectory matching using an LCS invariant. The remainder of the paper is organized as follows: Sect. 2 covers different aspects of the methodology, Sect. 3 contains the details about the dataset along with the observations deduced from the experimentation, and Sect. 4 concludes as well as provides future research directions.
2 Methodology Figure 1 represents a systematic overview of the entire process. The feature extraction step involves dense stereo [5] (resulting stereo images are included in the dataset repository) and data obtained from dense optical flow (we have utilized the openCV implementation of Gunnar Farneback’s algorithm [6] for this purpose) computed over the bounding boxed image segments returned by a pedestrian detection system [7]. To compute the lateral and longitudinal position of the pedestrian, the centroid of the box-detector and the notion of median disparity with respect to the interior of the box have been employed. Motion features depict a dimensionally reduced histogramatic version of the optical flow representation. Information pertaining to pedestrian positions and motion features are eventually fed into a framework that combines the tasks of trajectory matching and filtering. Finally, the filter state and the class labels of the resembled trajectories completes the process with the derivation of the pedestrian action as well as the position.
Connecting Quaternion-Based Rotationally Invariant …
215
Fig. 1 System diagram that depicts the functional mechanism for the overall procedure
2.1 Feature Extraction The features capture the flow variance between the upper and lower part of the body of the pedestrian. In fact, the features are characterized to sanction errors inherent to the bounding box locality [7]. Figure 2 demonstrates the steps that summarizes the feature extraction procedure. To distinguish between the motions of upper and lower parts of the body, the bounding box is divided into two corresponding subboxes. The resultant orientation vectors are of format, R = [Rx , R y ]T , each assigned to bins in the range of [0, 7] using 360◦ orientation where Angle of Orientation, θ . Each of the bins are weighted by their θ = atan2(R y , Rx ) and bin index, I = π/4 corresponding magnitudes and the computed histograms are subjected to normal-
Fig. 2 System diagram: feature extraction procedure
216
K. L. Kabir et al.
ization with respect to the contribution frequency. A feature vector is constructed by aggregating the histogram measures and the median flow for the both upper and lower sub-boxes. Finally, the feature vector is reduced dimensionally by applying principal component analysis (PCA). And the first two principal components with the largest eigenvalues are captured as the ultimate histograms of orientation motion or in short, HoM features.
2.2 Pedestrian Trajectory Matching and Path Projection A pedestrian trajectory J is represented using the ordered tuples J = (( j1 , t1 ), …, ( j N , t N )). For each timestamp ti , the state ji consists of the lateral and longitudinal position of the pedestrian in association with the supplementary features extracted from optical flow (Fig. 3). To retrieve motion projection, it is possible to search each trajectory in a motion database with respect to an observed trajectory along with the consideration of a similarity measure. In fact, to apply the strategy of longest common subsequence (LCS) to trajectories, it is necessary to define a proximity function between two states ai and b j from the given trajectory points (ai , ta,i ) ∈ A and (b j , tb, j ) ∈ B. For the trajectory points ai and b j , a fixed decision boundary with values in each dimension d is defined in association with the application of a linear function in the range [0, (d)] to determine the proximity between ai and b j , where L 1 (·) denotes the L1 norm i.e., the Manhattan distance [8]. dist(ai , b j ) =
0; if ∃d ∈ D : L 1 ( ai(d) , b(d) j ) > L 1 ( ai(d) , b(d) j ) 1 D ]; else d=1 [ 1 − D d
Fig. 3 Systematic overview of the trajectory matching procedure
(1)
Connecting Quaternion-Based Rotationally Invariant …
217
Having set the similarity criteria, LCS on trajectories [8] can be defined as, ⎧ ⎪ ⎨0 ; N A = 0 ∧ N B = 0 LCS(A, B) = LCS(H (A), H (B)) + dist(a N A , b N B ); dist(a N A , B N B ) = 0 ⎪ ⎩ max{LCS(H (A), B), LCS(A, H (B))}; else
(2)
Here, H (A) denotes all but last element of the sub-trajectory. The distance between two trajectories A and B [8] can then be computed as, dist(ai , b j ) = 1 −
LCS(A, B) min{N A , N B }
(3)
The distance between two trajectories, distLCS (A, B) ∈ [0, 1]. Quaternion-Based Rotationally Invariant LCS(QR-LCS): With the QR-LCS metric [8, 9], the parameters and hyper-parameters resulting in optimal translation and rotation of superimposition between two trajectories are derived. And the distance between two trajectories, distQRLCS (A, B) ∈ [0, 1] is depicted by the number of possible assignments specified by an area around each trajectory state and normalized by the number of trajectory states. Translation is achieved by using mean values of the trajectories in the one of two point sets in the x y plane [8]. μa,T =
T 1 T −1 1 μa,T −1 + at at = T t=1 T T
(4)
Here, T is the number of assignments in the sub-trajectory. To find the optimal rotation, we follow the solution provided in [10]. The incremental computation of the rotational parameters can be done by taking the following in the dynamic programming table [11], T T (at × bt ), sumt=1 ( at , bt ) (5) sumt=1 Probabilistic Framework: The greedy search depicted by the matching process (Fig. 3) is replaced by a probabilistic search framework [8, 12] where the time complexity of the search is dependent on the number of samples. With a motion history M1:t up to a certain current time step t, the probability that a future pedestrian state, Q T [2] happens is computed by, p(Q T |M1:t ) = ηp(M1:t |Yt )
p(Yt |Yt−1 ) p(Yt−1 |M1:t−1 )dYt−1
(6)
Here, η is a normalization constant, Yt denotes the current state, d represents the number of time steps and Q T is a future state [2]. The probabilistic model is constructed based on the concept of particle filtering [13] to estimate the posterior distribution of the states from an incomplete trajectory. The particles denote the states that are distributed in time as a sequence in Markov fashion. In order to perform the particle
218
K. L. Kabir et al.
Fig. 4 Technique to reduce the trajectory search space
prediction and to reduce the search complexity, a binary tree (Fig. 4) is constructed initially which is based on the snippets of the trajectory. Snippets [14] are subsets of states from each trajectory with fixed pedestrian states, long enough to be informative, but short enough to characterize probabilistically from the training data. The sequence length of a snippet is about one third of the length of the trajectory. And the information of the snippet position in the parent trajectory and successor snippets are kept into a description vector. By employing the PCA to these vectors, the dimensions can be ordered as per the largest eigenvalues. After transformation, the resultant description vector c is used to construct a binary search tree. Now, for each level l, the snippet is assigned to the left or right sub-tree basing on the sign of cl and with N training snippets, the depth/height of the search tree is O(log N ). We obtain the binary search tree and analyze to verify the features transformed into the search tree. The idea behind ordering the search tree queries based on the principal components is to match the prominent features first and then proceed to next steps eventually making a better match for a test snippet. Action Categorization and Path Projection: From the perspective of particle filters, the distribution of the predicted state, p(Q T |M1:t ) corresponds to the weight of a particle which is approximated by w (s) = 1—distQRCLS [2] with each particle, Yt(s) . An estimated pedestrian state in the future can be determined by looking ahead in the future, T = T + δt. The trajectory database contains two categories of snippets: the stopping category class Cs , and the walking/crossing category class Cw . The stopping probability can be estimated [2] by,
p(Cs |L) ≈
w (l) Q (l) t ∈C s w (l) + Q (l) Q (l) t ∈C s t ∈C w
w (l)
(7)
Connecting Quaternion-Based Rotationally Invariant …
219
3 Dataset and Experimental Observations We have considered the Daimler Pedestrian Path Prediction Benchmark Dataset. This dataset [15] contain 68 pedestrian sequences captured from a stationary and onmotion vehicle covering four different pedestrian action types: crossing, stopping, starting to walk, and bending-in with the underlying assumption that there is a single pedestrian in the sequences with no occlusion. Bounding boxes and disparity were computed by means of the procedure stated in [16]. The sequences contain 19,612 stereo image pairs out of which 12,485 images possess pedestrian bounding boxes (manually labeled) and 9366 images have additional detector measurements. We have used a portion of the dataset specifically focusing on the crossing and the stopping cases (3176 image pairs). The dataset is available at https://bit.ly/2JgtZmv. In terms of the path prediction accuracy, this approach is less efficient than state-of-the-art Interacting Multiple Model Kalman filter (IMM-KF) [2, 17]. In fact, the accuracy rate is lower if we use the QRLCS metric alone (without the probabilistic framework). However, on the task of determining whether a pedestrian in the curb side will halt, this system performs better than IMM-KF [2, 17].
4 Discussion and Conclusion Trajectory motion states constructed from the bounding box representation features (Fig. 5) effectively assist in modeling this prediction problem as a sequence matching (Fig. 6), and the features are most representative for this model. With an attempt to perform pedestrian path prediction with the procedure of sequence matching and QR-LCS metric alone demonstrates how the same problem can be solved with a straight-forward approach. However, less accurate results suggest that the straight forward approach is not adequate and more sophisticated and refined probabilistic approach (like PHTM [3]) is required which is eventually an extension of the straightforward approach involving the metrics of an LCS invariant (QR-LCS). Additionally, we realize that the complexity of the problem can grow faster as we attempt to improve path prediction accuracy by incorporating training data of bigger size. The interme-
Fig. 5 Bounding box representation and optical flow
220
K. L. Kabir et al.
Fig. 6 Matching of two trajectories with crossing category
diate results clarify that another important aspect in reducing time complexity is to incorporate search trees constructed from principal component features. Pedestrian path estimation involves predicting future sets of states from an intermediate state and we anticipate the efficiency of the approaches that involve complex modeling as markov models and time-sequenced states. Hence, if we extend our approach through the particle filtering, we can embrace the power of markov assumptions and seek for more efficient prediction. Alongside, we would like to investigate other similar posterior distribution estimation models or some stochastic search algorithm like Monte Carlo or Evolutionary Algorithms as an alternative to the particle filter based estimation.
References 1. Zhang, S., Benenson, R., Omran, M., Hosang, J., Schiele, B.: How far are we from solving pedestrian detection? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1259–1267 (2016) 2. Keller, C.G., Hermes, C., Gavrila, D.M.: Will the pedestrian cross? probabilistic path prediction based on learned motion features. In: Joint Pattern Recognition Symposium, pp. 386–395. Springer (2011) 3. Keller, C.G., Gavrila, D.M.: Will the pedestrian cross? a study on pedestrian path prediction. IEEE Trans. Intell. Trans. Syst. 15(2), 494–506 (2014) 4. Ridel, D., Rehder, E., Lauer, M., Stiller, C., Wolf, D.: A literature review on the prediction of pedestrian behavior in urban scenarios. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 3105–3112. IEEE (2018) 5. Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2008) 6. Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Scandinavian Conference on Image Analysis, pp. 363–370. Springer (2003)
Connecting Quaternion-Based Rotationally Invariant …
221
7. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: international Conference on Computer Vision & Pattern Recognition (CVPR’05), vol. 1, pp. 886–893. IEEE Computer Society (2005) 8. Käfer, E., Hermes, C., Wöhler, C., Ritter, H., Kummert, F.: Recognition of situation classes at road intersections. In: 2010 IEEE International Conference on Robotics and Automation, pp. 3960–3965. IEEE (2010) 9. Baghiyan, A.: Quaternion-based algorithm of ground target tracking by aircraft. Gyroscopy Navig. 3(1), 28–34 (2012) 10. Horn, B.K.: Closed-form solution of absolute orientation using unit quaternions. Josa A 4(4), 629–642 (1987) 11. Wang, L., Wang, X., Wu, Y., Zhu, D.: A dynamic programming solution to a generalized LCS problem. Inf. Proc. Lett. 113(19–21), 723–728 (2013) 12. Sidenbladh, H., Black, M.J., Sigal, L.: Implicit probabilistic models of human motion for synthesis and tracking. In: European Conference on Computer Vision, pp. 784–800. Springer (2002) 13. Black, M.J., Jepson, A.D.: A probabilistic framework for matching temporal trajectories: condensation-based recognition of gestures and expressions. In: European Conference on Computer Vision, pp. 909–924. Springer (1998) 14. Howe, N.R., Leventon, M.E., Freeman, W.T.: Bayesian reconstruction of 3d human motion from single-camera video. In: Advances in Neural Information Processing Systems, pp. 820– 826 (2000) 15. Schneider, N., Gavrila, D.M.: Pedestrian path prediction with recursive bayesian filters: a comparative study. In: German Conference on Pattern Recognition, pp. 174–183. Springer (2013) 16. Hirschmüller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information, pp. 807–814. IEEE (2005) 17. Bar-Shalom, Y., Li, X.R., Kirubarajan, T.: Estimation with Applications to Tracking and Navigation: Theory Algorithms and Software. Wiley (2004)
Real-Time Traffic Sign Recognition Using Convolutional Neural Networks Aditya Rao, Rahul Motwani, Naveed Sarguroh, Parth Kingrani, and Sujata Khandaskar
Abstract Road accidents cost India 3–5% of gross domestic product every year and are avoidable if India could improve its roads and city planning, train its drivers better and enforce traffic laws properly, an India spend analysis shows. Traffic sign boards are more often than not ignored by drivers either intentionally or because they are not taught about them in their formative training. The objective is to have our system detect the signs on the road in real time and provide alerts to the driver to perform the action corresponding to the sign, e.g. If our system detects a stop sign, then the user will be prompted to lower their speed. Keywords Road accidents · City planning · Traffic sign boards
1 Introduction Real-time traffic sign detection is a very important breakthrough in the efforts to make learning drivers aware of any kind of road signs, which they may otherwise fail to recognise. It is also an important step for self-driving cars. Traffic sign recognition usually consists of two steps: detection and classification. A predefined dataset is necessary to compare the images captured with the images in the dataset and make the A. Rao (B) · R. Motwani · N. Sarguroh · P. Kingrani · S. Khandaskar Department of Computer Engineering, Vivekanand Education Society’s Institute of Technology, Collector’s Colony, Chembur, Mumbai, Maharashtra 400074, India e-mail: [email protected] R. Motwani e-mail: [email protected] N. Sarguroh e-mail: [email protected] P. Kingrani e-mail: [email protected] S. Khandaskar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_23
223
224
A. Rao et al.
necessary decision. The accuracy as well as speed of this process plays an important role. The dataset that we have generated consists of roughly 81,000 images. These images have been generated by augmenting the basic dataset collected from the Indian RTO website. Augmenting the images gives us various angles of the images as well as flipped images generated from a single picture. This makes the dataset more convincing. For traffic sign detection, existing approaches already have a high percentage of accuracy. Most of the previous works have used traditional machine learning approaches such as probability model, HOG features and support vector machines are just designed to detect specific categories of traffic signs [1]. An RGBbased colour threshold technique along with a circle detection algorithm can be used for detection of traffic signs on the basis of their shape and colour [2–4]. However, on a larger scale, these shapes- and colour-based detection methods soon turned out to be inaccurate. Although, if there are abundant images in our dataset (like in our case), then the concept of convolutional neural networks can be used very effectively [5, 6]. For the detection of traffic signs, CNN helps us to obtain precise results, and it is aided by the abundance of signs in our dataset. The classification and detection are two totally different tasks, and our application’s job is to migrate these two different tasks at a single platform. The existing works on traffic sign detection which are done by machine learning technique do not give a closed pack bounded quadrilateral every single time. Refining is required when we’re performing localization, which is taken in our case.
2 Related Work 2.1 Image Preprocessing Image preprocessing is the first step towards ensuring a high-accuracy model. Image preprocessing is not a single operation or task, but a combination of a variety of them. From the simplest of tasks such as cropping, rotation or translation to complex processes such as median filtering, all of these are included in image preprocessing. In our model, we have used two preprocessing methods, image resizing and resolution adjustment. The refined images these methods generate later prove to be extremely vital during the process of sign detection. The images are first resized to be of the resolution 60 × 60 pixels. They are then converted into grayscale. Since all images are of similar colour scheme, colour does not play a role during getting features of all the different classes [7].
Real-Time Traffic Sign Recognition Using Convolutional …
225
2.2 Traffic Sign Detection Various sign detection algorithms and techniques have come into existence in the last few years. Colour-based and shape-based models have been tried and tested but both of them had their share of drawbacks and inaccuracies. The RGB space (red-green-blue levels of each point on the image) and also the HSV model are two main colour-based models [4, 8, 9]. The shape-based models were created taking into consideration the fixed size (dimensions) of the traffic signs [2, 3, 10]. We have used the shape-based model.
2.3 Localization Refinement After preprocessing, once we have the image in the desired resolution and size, we must narrow down the region in which our desired sign exists. To detect traffic signs, we must focus on the corner of the images. Once an object is conclusively found to be a traffic sign, it is bound by a rectangular box, and we can proceed to recognition and classification [1].
2.4 Recognition and Classification After identifying the location and shape of all potential traffic signs, the next step is to classify them. This is done by comparing the regions of interest individually with each image in a database. The database contains images of each traffic sign, which are classified according to their shapes. Since the comparison is iterative, the time complexity of the comparison algorithm increases when more traffic sign patterns in the form of images are added in the dataset. The cross-correlation algorithm is used for the comparison of each potential traffic sign with each pattern in the dataset. If the value obtained after comparison using this algorithm is greater than a threshold value, then it can be considered very similar to the traffic sign pattern. Else, it is compared with the next image pattern in the dataset, and this process is done successively. At the end, if the potential traffic sign under consideration does not match any of the images in the dataset, that is, the threshold value is always greater than the value generated by the comparison in the cross-correlation algorithm, then, the object is considered to have the shape of a traffic sign, but concluded as not actually being a traffic sign. However, this might also be possible due to the fact that the signal pattern is not present in the database.
226
A. Rao et al.
3 Approach Here, we will be giving the details about our approach regarding all the steps of this project. Right from data collection and generation to the classification process, every part is covered in this section.
3.1 Dataset We have prepared the dataset by collecting images of the various traffic signs which are classified into three categories as cautionary, mandatory and informatory. A few of these signs have been shown below, taken from various sources (Figs. 1, 2 and 3). Although we have not classified our prepared dataset into these three sections, understanding the difference between them is of utmost importance to anyone. There is no standard dataset available as such for the traffic signs. So all the images in our initial dataset were collected from the RTO website (transport.maharshtra.gov.in). This initial dataset was then expanded to a dataset of around 40,000 images, by augmenting all the images of the previously collected dataset. The augmentation process transforms the original images into new images with change in angles, change in the position of the signs, etc. This helps to cover every case possible in a real-life situation. We have made changes by shifting the signs by ten pixels both in height and width as well as rotation of range 15°. In addition to this, there is also variation of brightness for better predictions in poor lighting situations (Fig. 4). Augmentation of images is done to increase the models ability to generalise outside the dataset [7]. In this way, we have created the entire dataset required for our project. Fig. 1 Mandatory traffic signs
Real-Time Traffic Sign Recognition Using Convolutional …
Fig. 2 Cautionary traffic signs
Fig. 3 Informatory traffic signs
Fig. 4 Example of augmentation
227
228
A. Rao et al.
Fig. 5 Data preprocessing
3.2 Data Preprocessing The entire project is based on the concept of convolutional neural networks (CNN). CNN helps in running neural networks directly on images and is more efficient and accurate than many of the deep neural networks. ConvNet models are easy and faster to train on images comparatively to the other models. CNN makes it compulsory for all the images in the dataset to be of the same dimension. The fact that training of the model cannot be done on images of different dimensions is one of the limitations of the CNN model. Hence, we undertake the process of data preprocessing. In our case, the data is the images of different dimensions and preprocessing will be done to convert all these images into a standard dimension. We have to make sure that none of the images have to be compressed or stretched more than the level that is acceptable. Hence, selection of the standard dimension is a very important task. The resize method enables us to do the required dimension setting (Fig. 5).
3.3 Data Loading After pre-processing, the next step is to load the dataset along with converting them in the decided dimension. Our dataset consists of 42 classes in total. In other words, 43 different types of traffic signs are present in that dataset, and each sign has its own folder consisting of images in different sizes and clarity. Each class consists of 1200 images—1100 for training and 100 for validation. Thus, a total of 50,400 images are present. We give a unique id for each unique traffic sign. As the dataset is divided into multiple
Real-Time Traffic Sign Recognition Using Convolutional …
229
folders and the naming of images is not consistent, we will load all the images by converting them in (60 * 60) dimension into one list, and the traffic sign it resembles into another list. We will be reading the images using ‘imread’. Now, the dataset is ready to be loaded.
3.4 Detection of Signs A shape-based detection algorithm is used in our model. In general, all across the globe, traffic signs are always displayed on boards of specific shapes. Circle, triangles and rectangles are the only shapes of these traffic signs. We have used this exact feature for the detection of these signs.
3.5 Localization After preprocessing, once we have the image in the desired resolution and size, we must narrow down the region in which our desired sign exists. For detecting traffic signs, we must focus on the corner of the images. Once an object is conclusively found to be a traffic sign, it is bound by a rectangular box, and we can proceed to recognition and classification.
3.6 Classification The entire process of classification is solely based on the concept of convolutional neural networks and its various layers. A convolutional neural network (ConvNets/CNN) is very similar to ordinary neural networks. They are made up of neurons with learnable weights and biases. The only difference between CNN and the ordinary neural network is that CNN assumes that input is an image. This vastly reduces the number of parameters to be tuned in a network. Convolutional neural network is basically a sequence of layers. Each layer transforms an input image to an output image with some differentiable function that may or may not have parameters. The three layers of CNN are (Fig. 6) 1. Convolutional layer 2. Pooling layer 3. Fully connected layer 1. Convolutional layer The input 3D volume is passed to this layer. The dimension would be H * W * C. H, W and C represent height, width and the number of channels, respectively. There are different parameters used in this layer. They are as follows:
230
A. Rao et al.
Fig. 6 Layered structure of CNN
(a) Depth: represents the number of layers present. The depth of a 2D image is 1. (b) Stride (S): represents step size to be taken while traversing horizontally or vertically along the height and weight. (c) Padding (p): The size of an input image reduces after processing, and hence, it becomes difficult to model deep networks with small size images. Hence, padding is used to retain the size and width of an input image. In this process, we add the required number of rows and columns around the input image. These added pixels have value 0 by default. The actual computation begins after padding. The filter slides starting from the top left corner. The corresponding values of filter and input volume are multiplied, and then, the summation of all the multiplied values takes place. Then, the filter is slid horizontally according to the stride value. The same process is repeated vertically as well until the whole image is covered. 2. Pooling Layer The main function of this layer is to reduce the dimension of an input. This layer helps in reducing the computational power demanded to process the image, by reducing the spatial size. The main function of the pooling layer is to extract the most dominant information. Two types of pooling are most commonly used: max pooling and average pooling. In max pooling, the maximum value present in a selected kernel is retained, and all the other values are discarded. While on average pooling, the average of all the values present in a kernel selected is stored (Fig. 7).
Real-Time Traffic Sign Recognition Using Convolutional …
231
Fig. 7 Difference between max and average pooling
3. Fully Connected (FC) layer This layer is used for the classification of the complex features extracted from previous layers. This layer is the same as the neural networks in which each neuron is connected to all the neurons on a consecutive layer. To pass an input image to the FC-layer, we need to flatten out the image so that all the pixel values are arranged in one column. Now, this flattened feature is passed to the FC-layer. All the weights assigned in FC-layer are initialised by small random numbers and are trained using a back-propagation algorithm.
4 Experiments and Evaluation An epoch is a unit which refers to one complete cycle through the full training dataset. Given the complexity and variability of data in real world problems, it may take hundreds to thousands of epochs to get some sensible accuracy on test data. Also, the term epoch varies in definition according to the problem at hand. Accuracy: Model accuracy is a metric used to measure the model’s/algorithm’s accuracy in an interpretable manner. It is usually measured based upon the model parameters and is finally given in the form of a percentage. It is the measure of how accurate your model is compared to the true data. For example, If you have 100 test samples and if the model is able to classify 85 of them correctly, then the accuracy of the model will be 85%.
232
A. Rao et al.
Fig. 8 Accuracy graph between test data and actual data
In Fig. 8, we have shown the accuracy of both, the testing as well as training processes. Loss: Loss is the sum of all errors made for each example in training or a validation set. Loss is the result of a bad prediction. If the model’s prediction is perfect, then the loss is zero; otherwise the loss is greater. The loss is calculated on training and validation, and its interpretation is how well the model is doing for these two sets. Unlike accuracy, a loss is not a percentage. It is a sum of the errors made for each example in training or validation sets. Figure 9 shows our model’s loss for both the training as well as testing data. As we can observe, the model loss for both training and testing datasets reduces up to a final value with increasing number of epochs, while the model accuracy increases up to a final value with increasing number of epochs.
5 Conclusion At the beginning of our project, the main aim was to devise a fully working model for detection and classification of traffic signs. Now, at the end of our project, we have successfully put into working a real-time model for the detection of traffic signs along with their effective and accurate classification into their respective classes. The camera, fitted at the front of the car, captures images continuously at very short intervals of time and tries to detect the presence of traffic signs in these captured
Real-Time Traffic Sign Recognition Using Convolutional …
233
Fig. 9 Error graph between test data and actual data
images. If a traffic sign is detected, then the system tries to recognise the detected traffic sign and give an alert message regarding the presence of the respective traffic sign ahead. The project was taken up most suitably for motor training schools. In the current scenario, the teaching provided by driving schools is focused more on the practical driving lessons, but fails to recognise the importance of understanding the basic road etiquettes and rules. Failure to spot these traffic signs or failures to understand them are two important issues that compromise road safety to a very large extent. These issues needed to be taken into consideration sooner rather than later, and with the model, we have developed; we aim to overcome these issues as successfully as possible. The system with further improvements in the accuracy and speed of recognition of the traffic signs will also be a major development in smart cars. An audio feedback integrated with the system will allow the smart car to give notifications about the traffic signs ahead.
References 1. Geisler, W., Clark, M., Bovik, A.: Multichannel texture analysis using localized spatial filters. IEEE Trans. Pattern Anal. Mach. Intell, 1 Jan 1990 2. Li, H., Sun, F., Liu, L., Wang, L.: A novel traffic sign detection method via color segmentation and robustshape matching. Neurocomputing 169, 77–88 (2015) 3. Malik, R., Khurshid, J., Ahmad, S.N.: Road sign detection and recognition using colour segmentation, shape analysis and template matching. In: Proceedings of the 2007 International
234
A. Rao et al.
Conference on Machine Learning and Cybernetics, Hong Kong, China, 19–22 Aug 2007 4. Bahlmann, C., Zhu, Y., Ramesh, V., Pellkofer, M., Koehler, T.: A system for traffic sign detection, tracking, and recognition using color, shape, and motion information. In: Proceedings of the 2005 IEEE Intelligent Vehicles Symposium, Las Vegas, NV, USA, 6–8 June 2005 5. Qian, R., Yue, Y., Coenen, F., Zhang, B.: Traffic sign recognition with convolutional neural network based on max pooling positions. In: Proceedings of the 12th International Conference on Natural Computation; Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), Changsha, China, 13–15 Aug 2016 6. Jin, J., Fu, K., Zhang, C.: Traffic sign recognition with hinge loss trained convolutional neural networks. IEEE Trans. Intell. Transp. Syst. 15, 1991–2000 (2014) 7. Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: The German traffic sign recognition benchmark: a multi-class classification competition. In: Proceedings of the 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 Aug 2011 8. Shadeed, W.G., Abu-Al-Nadi, D.I., Mismar, M.J.: Road traffic sign detection in color images. In: Proceedings of the 10th IEEE International Conference on Electronics, Circuits and Systems, Sharjah, UAE, 14–17 Dec 2003 9. Soendoro, W.D., Supriana, I.: Traffic sign recognition with color-based method, shape-arc estimation and SVM. In: Proceedings of the 2011 International Conference on Electrical Engineering and Informatics, Bandung, Indonesia, 17–19 July 2011 10. Paclk, P., Novovicova, J.: Road sign classification without color information. In: Proceedings of the 6th Conference of Advanced School of Imaging and Computing, July 2000
A Survey on Efficient Management of Large RDF Graph for Semantic Web in Big Data Ashutosh A. Abhangi and Sailesh Iyer
Abstract Semantic web expands the web principles by allowing the computer to understand and easily analyze the data on web. Presently, RDF is used as a triplet model for the semantic web. There is primary need for efficient storage, data retrieval from RDF graph of semantic web in live world application. This paper compares work done in semantic web and also discusses the various challenges involved including scalability, real-time efficient storage and query processing in graph oriented distributed database. The different approaches compared are direct relational mapping approach, entity-based perspectives with different indexing techniques for querying linked data for multilevel indexing framework and graph-based approach. This paper provides an overview of the features and techniques for storing the RDF graph and managing the metadata of data for the semantic web. Keywords Semantic web · RDF · OWL · NOSQL · SPARQL · Storage · Graph oriented distributed database
1 Introduction The semantic web is an extension of the world wide web to make Internet/web data computer or machine readable and providing an architecture for the integration of the web data. So the data may describe other data (Metadata) and combined (merged); data should be available for machine for further processing, and also data is to be exchange by itself with valid description or with reason about that data [1, 2] (Fig. 1). For making data manageable or processable by machine, we need:
A. A. Abhangi (B) · S. Iyer Department of Computer Science and Engineering, Rai University, Ahmedabad, Gujarat 382260, India e-mail: [email protected] S. Iyer e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_24
235
236
A. A. Abhangi and S. Iyer
Fig. 1 Semantic web layer [3]
1. 2. 3. 4.
Resource names for binding the data into object: URIs A data model for interchanging the data: RDF Own query language for accessing the data: SPARQL Common vocabularies: RDF Schemas (RDFS) and Web Ontology Language (OWL).
As per the technical point of view [2, 4], the semantic web consists: 1. RDF: Resource description framework is one type of modeling language in which information is stored and represents in the form of graph or triplet (Statement—S, P and O) for semantic web. 2. SPARQL: SPARQ protocol and RDF query language are a language for designing to query data across different systems for managing a data on semantic web. 3. OWL: Web ontology language is a schema language for representing knowledge of data for semantic web.
2 Theoretical Background 2.1 Structure of RDF RDF stands for resource description framework. It is a graph data model for metadata so RDF consists of a model, which is a graph representation of data on the web. RDF is one type of framework which is describing resources on the web. RDF is designed
A Survey on Efficient Management of Large RDF Graph for …
237
Fig. 2 RDF format
Fig. 3 RDF example
to be read and understand by the computer system only [1, 5]. RDF is a combination of a resource as a subject, property as a predicate and its value as object. It is also known as a RDF statement. For example, statement: The author of https://www.java2all.com/jsp is Ashutosh Abhangi. • The Subject of the above statement is https://www.java2all.com/jsp • The Predicate is author • The Object is Ashutosh Abhangi. Each triple statement represents a relationship between the subject with its object. In generally, RDF is a set of such triples, which also represent a labelled graph, called an RDF Graph.
2.2 RDF Terminology RDF has a naming nodes and edges in a graph format. References [6, 7] consider Fig. 2, here edge is called a triple; the source node is called a subject; the edge name (Arrow key) is called a predicate (Property); and the target node is called an object (Value). In a graph, one node is a subject position, and another node is an object position [6, 8]. As per above statement example (Fig. 3).
2.3 RDF Graph The data of user table are display here as per the web link https://example.com/Blog# in a normal entity wise which contains the column like id, first name, last name and gender (Table 1). This information is also representing by a set of triples by using directed, labelled graph like Fig. 2. So in such way, a RDF graph is a triples display or represent by
238 Table 1 User table [3]
Table 2 RDF triples [3]
A. A. Abhangi and S. Iyer Id
First name
Last name
Gender
1
Joe
Doe
Male
2
Mary
Smith
Female
Subject
Predicate
Object
JD
firstName
Joe
JD
lastName
Doe
JD
gender
Male
MS
firstName
Mary
MS
lastName
Smith
MS
Gender
Female
an edge between two different nodes. The subject represents a node as source, for example, JD and MS; object is a destination node, for example, Joe, Male, Mary, Female, etc; and predicate is an edge, for example, first name, last name and gender. Table 2 contains the information in a tabular format as per RDF graph represents in a Fig. 4 in a labelled graph format [9, 10]. Thus, uniform resource identifier represents the all information available in RDF datasets in the form of subject, predicates and objects. Some resources which does not have a permanent identity is called blank node, and it represents by _ [11]. As per the standards of RDF, its provides a specific name space with the RDF prefix name like rdf:type. So this mechanism is quite useful to apply a value to gender like man (male) and women (female) concept to Joe and Mary as shown in Table 3. Here, if we want to add state like ‘Joe is following Mary’, then it can be representing like blog:isFollowing by using new predicate directly. This RDF declaration is shown in a Table 4. So the rdf:Property is work with rdf:type and also displays that the given resource is used as property (predicate) rather than an object or a subject [8, 12]. Now assume the state like ‘joe is following Mary since last March’. So we have to manage this extra information in a specific way with previous statement. For the solution, this statement is managing as resource, RDF use a vocabulary for identify a
Fig. 4 RDF graph representation [3]
A Survey on Efficient Management of Large RDF Graph for … Table 3 RDF triples with its prefixes like ex: and blog: [3]
Table 4 RDF triples with its new predicate and prefixes [3]
239
Subject
Predicate
Object
blog:JD
ex:firstName
‘Joe’
blog:JD
ex:lastName
‘Doe’
blog:JD
rdf:type
ex:Man
blog:MS
ex:firstName
‘Mary’
blog:MS
ex:lastName
‘Smith’
blog:MS
rdf:type
ex:Woman
Subject
Predicate
Object
blog:JD
ex:firstName
‘Joe’
blog:JD
ex:lastName
‘Doe’
blog:JD
rdf:type
ex:Man
blog:JD
blog:isFollowing
Blog:MS
blog:MS
ex:firstName
‘Mary’
blog:MS
ex:lastName
‘Smith’
blog:MS
rdf:type
ex:Woman
blog:isFollowing
rdf:type
rdf:Property
given statement and refer to it in another statement as per relationship or dependency in the information. It is represent, namely rdf:subject, rdf:predicate, rdf:object and rdf:statement as shown in a Table 5. So RDF allows to make statements about another statements is called reification [11]. A graph visualization of Table 5 is given in Fig. 5. Here, we have only represent small statements by using only three columns in a table Table 5 RDF dataset with vocabulary [3, 11]
Subject
Predicate
Object
blog:JD
ex:firstName
‘Joe’
blog:JD
ex:lastName
‘Doe’
blog:JD
rdf:type
ex:Man
blog:MS
ex:firstName
‘Mary’
blog:MS
ex:lastName
‘Smith’
blog:MS
rdf:type
ex:Woman
blog:isFollowing
rdf:type
rdf:Property
_:X
rdf:type
rdf:Statement
_:X
rdf:subject
blog:JD
_:X
rdf:predicate
blog:isFollowing
_:X
rdf:object
blog:MS
_:X
blog:started
ex:March
240
A. A. Abhangi and S. Iyer
Fig. 5 RDF graph of Table 2.5 [3, 11]
format or by using small RDF graphs. While normally, it is not easy to read and understand the such type of textual representation with graph visualization.
3 OWL The full form of OWL is ‘Web Ontology Language’. It was designed to be interpreted by machine for processing information on the semantic web as a part of W3C’s semantic web technology stack. OWL is built on the top of RDF so it is a stronger language with more vocabulary and syntax for machine interpretability as compared RDF [2]. OWL information can easily transfer by using XML between different types of machines or computers which are run on different platforms with types of application languages. The vocabulary extensions enable us to define classes, properties, instances and meta-description with more accuracy in data and information [2, 13]. OWL has two versions like OWL 1 and OWL 2 which corresponds to three sub-languages [2, 8]: 1. OWL-Lite 2. OWL-DL 3. OWL-Full. There is no RDF document can be assumed to be compatible with OWL-Lite. OWL-DL means the web ontology language with description logics for using a first-order logic and its subsets with other interesting properties which does not allow metamodelling and restricts mixing of RDF and OWL construct classes. While OWL-Full allows metamodelling and freely mixing of RDF and OWL constructs [13]. OWL defines two type of property like object property for relates instances of two classes and data type property for relates individual class to a literal. The process for defining a class by its description is based on enumeration so all individuals of the
A Survey on Efficient Management of Large RDF Graph for …
241
described class should be exact set of enumerated individuals means no more and no less [8]. The following code illustrate the use of enumeration constructor owl:oneOf. PREFIX rdf: PREFIX owl: PREFIX blog: _:X rdf:typeblog:User; owl:oneOf (blog:JD blog:MS blog:OC blog:GB). The second mechanism is use by a set of property restriction with two types like value constraints (rdfs:range) and cardinality constraints (owl:cardinality— min/max). The another mechanism is use by sets of operators on class like intersection of, union of and complement of for a class description [2, 8]. So OWL is a representative language with high computing complexity of reasoning and vocabulary; hence, OWL is rarely used in RDF graph management systems.
4 SPARQL SPARQL stands for ’SPARQ Protocol and RDF Query Language’ and is the standard query language for linked web open data or for RDF graph databases and enables users to manage information from databases that mapped in RDF [14]. The SQL used for retrieve, manage and modify the information in a relational database by using standard SELECT, FROM, WHERE pattern. SPARQL provides the same functionality for graph database of type NoSQL like GraphDB, Neo4j, OrientDB, Casendra, IBM Graph, Amazon Neptune, etc. [15, 16]. A variable is an identifier as a name followed by question mark (?) to store result information as per retrieve by standard query format from RDF graph [14, 17]. To retrieves the user’s first name whose last name is Doe, let’s Shawn the following first example. Here, we declare two variables, ?name and ?user. SELECT part is result template as per standard format of query, and WHERE part is standard query pattern as per match the condition in RDF Triples. PREFX blog: PREFX ex: SELECT ?name WHERE { ?user ex:lastName “Doe ” ; ex:firstName ?name . }
As per the state, we already added like ‘Joe is following Mary’, and here, we can also apply a query to retrieve the first name of users whose last name is ’Doe’, and who are following someone whose first name is ’Mary’. It is similar to perform a SQL JOIN operation in RDBMS. SPARQL allows to SELECT part for used to connect
242
A. A. Abhangi and S. Iyer
different triples pattern manage by RDF graph. Shawn the following example for that. PREFX blog: PREFX ex: SELECT ? name WHERE { ?userex:lastName “Doe ” ; Ex:firstName ?name ; blog:isFollowing ?user2 . ?userex:firstName “Mary” . }
SPARQL also allows LIMIT (for restrict a result), ORDER BY (for sorts the result), GROUP BY (AVG(), MIN(), MAX() or COUNT()), OFFSET (skips in result as per given number), DISTICT (Forbids duplicate results) and HAVING arguments (for specifies the condition restriction values) just like SQL [14, 18]. The following code provides the full example. PREFX blog: PREFX ex: SELECT ?name WHERE{ ?user ex: lastName “Doe ” ; ex: firstName ?name ; blog: is Following ?user2 ; blog: owner ?blog . ?user ex: firstName “Mary” . } GROUP BY ?name HAVING (COUNT(?blog) > 2) LIMIT 2 OFFSET 3 ORDER BY ?name
SPARQL also provide additional functionality like OPTIONAL (retrieve a data in the absence of something matching), FILTER (for certain conditions), UNION (for define intermediate subquery result), ASK (for Boolean result like yes/no), DESCRIBE and COSTRUCT keywords (for representing a result in RDF Graph) [18, 19].
5 Centralized Approach—Direct Relational Mapping It is an approach to map the linked or relational data by applying self-joins operation in SQL. RDF triples have a natural tabular structure with three different columns like
A Survey on Efficient Management of Large RDF Graph for …
243
subject, predicate and object that handle the triples information [20]. A table of RDF dataset in the format of triples is shown in Table 6 which contains the information from different resources as per mention by its prefixes with detail URI. Prefixes: mdb=https://data.linkedmdb.org/resource/geo=https://sws.geonames.org/ bm=https://wifo5-03.informatik.uni-mannheim.de/bookmashup/ exvo=https://lexvo.org/id/ wp=https://en.wikipedia.org/wiki/ As per the data given in a relational Table 6, we want to finds movies’ name which are directed ’Stanley Kubrick’ and have a related book which has rating more than Table 6 RDF dataset [20] Subject
Property
Object
mdb:film/2014
rdfs:lable
’The Shining’
mdb:film/2014
movie:initial_release_date
’1980-05-23’
mdb:film/2014
movie:director
mdb: director/8476
mdb:film/2014
movie:actor
mdb: dictor/29704
mdb:film/2014
movie:actor
mdb:actor/30013
mdb:film/2014
movie:music_contributor
mdb:music_contributor/4110
mdb:film/2014
foal:based_near
geo:2635167
mdb:film/2014
movie:relatedboock
bm:0743424425
mdb:film/2014
movie:language
lexvo:iso639-3/emg
mdb:director/8476
movie:direct_name
“Stanley Kubrick”
mdb:film/2685
movie:actor
mdb:director/8476
mdb:film/424
rdfc:lable
’A Clocockwork Orange’
mdb:film/424
movie:actor
mdb:director/8476
mdb:dictor/29704
rdfc:lable
’Spartacus’
mdb:film/1267
movie:actor_name
’Jack Nicholson’
mdb:film/1267
movie:actor
mdb:actor/29704
mdb:film/3418
rdfs:lable
’The Passenger’
geo:2635167
gn:name
’United Kingdom’
geo:2635167
gn:population
62348447
geo:2635167
gn:wikipediaArticle
wp:unite_kindom
bm:books/0743424425
rev:rating
bm:persons/Stephen + king
bm:books/0743424425
scom:hasOffer
4.7
bm:books/0,743,424,425
rdfs:lable
bm:offers/0743424425amazonoffer
lexvo:iso639-3/emg
lvont:lable
’English’
lexvo:iso639-3/emg
lvont:usedln
lexvo:iso3166/CA
lexvo:iso639-3/emg
lvont:usesript
lexvo:script/latn
244
A. A. Abhangi and S. Iyer
4.0. The following SQL query where subject mention as s, property mention as p and object mention as o with respect to column names in a given table [21]. SELECT D1.object FROM D as D1, D as D2, D as D3, D as D4, d as T5 WHERE D1.p=“rdfs:label” AND D2.p=“movie:relatedBook” AND D3.p=“movie:director” AND D4.p=“rev:rating” AND D5.p=“movie:director_name” AND D1.s=D2.s AND D1.s=D3.s AND D2.o=D4.s AND D3.o=D5.s AND D4.o>4.0 AND D5.o=“Stanley Kubrick” In this example, you can easily observe that there are more number of self-joins with direct relational mapping approach so there are no easy to optimize.
6 RDF Primer Here, we provide basic functionality for managing RDF data by using RDFS with SPARQL. Also we are try to cover the basic understanding of RDF graph by taking an example of movie datasets from different data sources [20, 22]. RDF is set of triples (s, p, o) where subject is a URI, class or blank node; a property denotes an attribute associated with class or entity, and an object is a literal value, a class or blank node. RDFS (RDF Schema) is an extension of RDF for providing the framework to describe an application specific classes and properties like instances of classes and subclasses of classes (rdfs:Class, rdfs:subClassoOf). So RDFS allows the definition of classes and its hierarchies [22, 23]. For example, if we want to manage a class called Movies which has two subclasses like DramaMovies and ScienceMovies, then this would be represent in a following way: Movies rdf:type rdfs:Class.Movies. DramaMovies rdfs:subClassOf Movies. ScienceMovies rdfs:subClassOf Movies. Here applies the same example as per direct relational mapping approach like we want to finds movies’ name which are directed ’Stanley Kubrick’ and have a related book which has rating more than 4.0 on the datasets as per given in Table 6. The SPARQL query is write as below:
A Survey on Efficient Management of Large RDF Graph for …
245
SELECT ?name WHERE { ?m rdfs:label ?name. ?m movie:director ?d. ?d movie:director_name “Stanley Kubrick”. ?m movie:relatedBook ?b. ?b rev:rating ?r. FILTER(?r > 4.0) } Here, the first three line in the WHERE clauses from a set of triple patterns is also called Basic Graph Pattern (BGP), which contains five different triple patterns which contain the variables like ’?name’, ’?m’, ’?r’ and one filter with ’r?’ [18, 23]. The query graph for this query is given in a Fig. 6. By using graph oriented distributed database like Neo4j, Amazon Neptune, GraphDB, etc., the such type of RDF graph can be representing by more knowledge. Here, we are try to represent RDG graph by using Amazon Neptune as per shown in Fig. 7. Amazon Neptune is a service related to graph database which makes to build and run applications easily that work with highly connected or linked datasets which is fast, fully managed and reliable as compared to other graph database service. The main purpose of Neptune is to build the graph database engine which is highly performance that is optimized for storing lots of relational data and querying the graph with less time consuming [15]. Neptune supports W3C’s SPARQL, the popular query languages for graph which allows you to build queries that work efficiently with linked or connected data. Amazon Neptune uses graph model in different cases such as knowledge graphs, network security, recommendation engines, drug discovery and fraud detection [15]. We are trying to represent the look of RDF knowledge graph by using Amazon Neptune, a graph oriented distributed database service as Shawn in Fig. 8.
Fig. 6 SPARQL query graph [20]
246
Fig. 7 RDF graph representation as per data given in Table 5.1 [20]
Fig. 8 Knowledge RDF graph in Amazon Neptune [15]
A. A. Abhangi and S. Iyer
A Survey on Efficient Management of Large RDF Graph for …
247
7 Hexastore: Multiple Indexing Framework This approach is focus on rearranging the data in a given database system or in memory such a way so query processing can be performed more analyze and manage as compared triple tables. For achieving a result, the multiple index maintains six indexes covering possible access schemas like PSO, SPO, OSP, OPS, SOP and POS for the RDF query. This approach pays equal attention to all items of RDF so it is does not treat property attributes specially [20]. The Hexastore [20, 24] framework is a multiple-index technique based on mainmemory indexing of RDF data. The indexed of RDF data is applying in six different ways, one for each possible approach or way ordering of the three RDF elements by separate columns. The representation is based on different order of significance of RDF properties, resources with a combination of vertical partitioning by using multiple indexing approaches. The two vectors are represent for each RDF element instance; each vector collects the elements of one of the different types (e.g. [s, o] and [s, p]). Moreover, lists of the third RDF element are appended to the elements in these vectors. In this way, a sextuple indexing schema is created, where the values for O in SPO and PSO are the same. Hexastore uses the process of dictionary encoding of the URIs and its literals for reducing the storage needed by the URIs, i.e. every literal and URI is assigned unique number as an identifier. Hexastore manages one triple pattern with joins any other pair of two triple patterns for fast merging. However, Hexastore requires more space as compared to column-oriented vertical partition for storing statement in a triples table [24]. Hexastore is better in query performance over insertion time passing over applications that require efficient statement insertion. Updates operation affects all six indices, hence can be need more space so Hexastore does not provide better support in storage. In the experimental evaluations, Hexastore shows that their vector storage schema provides lower times for data retrieval (query response time) as compared to any other technique like B Tree and column-oriented vertical partition.
8 Result Comparison The column-oriented vertical partition is process through pso indexing and enhance the vertical partition only was implement on column oriented database. So it is called a single-index oriented COVP, represent by COVP1. If want to add one more index like spo and want to work with both indexes (pos ans spo) that is the enhance process of column-oriented vertical partition refereed as two-index oriented COVP or COVP2 (Fig. 9). Here, in an experimentation result, we try to compare the multiple indexing approach like Hexastore (which has a six index) with COVP1 and COVP2 on LUMB datasets in terms of performance improvement for query processing by calculation its response time. As experimental study has Shawn clearly that Hexastore gives the
248
A. A. Abhangi and S. Iyer
Fig. 9 Result comparison
fast and scalable query processing because its reaches upto five orders of magnitude as compared to COVP. So Hexastore takes less time in query processing as compared to COVP1 and COVP2 both.
9 Conclusion Here, we give the details of RDF, RDF schema, RDF graph, web ontology language and SPARQL. These all technology are used for managing RDF graph by using different approach like direct relational mapping (Centralized Approach) which have more number of self-joins, RDF primer which is combination of RDF with SPARQL and multiple indexing approach like Hexastore by using compound indexes which are a better approach for fast query processing as compared column-oriented vertical partition technique. We also describe the existing system by applying real example of blog with RDF graph and working of SPARQL queries with different properties in RDF data. There are many other approaches on RDF that are not discussed here like streaming RDF processing, RDF data integration, RDF-3X and graph-based processing.
A Survey on Efficient Management of Large RDF Graph for …
249
References 1. Alexaki, S., Karvounarakis, G., Christophides, V., Tolle, K., Plexousakis, D.: The ICS-forth RDF suite: managing voluminous RDF description bases. In: 2nd International Workshop on the Semantic Web, pp. 1–13 (2001) 2. Allemang, D., Hendler, J.A..: Semantic Web for the Working Ontologist: Effective Modeling in RDFS and OWL, 2nd edn, pp. 10. Morgan Kaufmann, San Francisco (2011) 3. Cure, O., Guillaume, B.: RDF Database Systems Triples Storage and SPARQL Query Processing, 1st edn, pp. 47–69. Elsevier, Waltham, MA, USA (2015) 4. Angles, R., Gutierrez, C..: Querying RDF data from a graph database perspective. In: Proceedings of the Second European Semantic Web Conference (ESWC), pp. 346–360 (2005) 5. Aluç, G., Özsu, T., Hartig, O., Daudjee, K.: Executing queries over schemaless RDF databases. In: Proceedings of the 31st International Conference on Data Engineering, pp. 807–818 (2015) 6. Beckett, D.: The design and implementation of the Redland RDF application framework. In: Proceedings of the 10th International Conference on World Wide Web, pp. 449–456. ACM, NY Press (2001) 7. Bönström, V., Schweppe, H., Hinze, A.: Storing RDF as a graph. In: Proceedings of the First Conference on Latin American Web Congress, p. 27. IEEE Computer Society, Washington, DC (2003) 8. Brickley, D., Guha, R.: RDF vocabulary description language 1.0: RDF schema. W3C Recommendation (2004) 9. Bornea, M.A., Dolby J., Srinivas K., Kementsietsidis A., Bhattacharjee B., Udrea O., Dantressangle P.: Building an efficient RDF store over a relational database. In: SIGMOD Conference, pp. 121–132, New York, NY, USA (2013) 10. Broekstra, J., Harmelen V., Kampman, A.: Sesame: a generic architecture for storing and querying RDF and RDF schema. In: The International Semantic Web Conference (ISWC), pp. 54–68 (2002) 11. Fernandez, J., Gutierrez, C., Martinez-Prieto, M.: Compact representation of large RDF data sets for publishing and exchange. In: International Semantic Web Conference, vol. 1, pp. 193– 208 (2010) 12. Ladwig, G., Tran, T.: Linked data query processing strategies. In: International Semantic Web Conference (ISWC), vol. 1, pp. 453–469 (2010) 13. Arenas, M., Perez, J., Gutierrez, C.: Foundations of RDF databases. In: Reasoning Web, pp. 158–204 (2009) 14. Huang, J., Ren, K., Abadi, D.: Scalable SPARQL querying of large RDF graphs. In: PVLDB 4, pp. 1123–1134 (2011) 15. Mallidi, S., Bebee, B., Choi, D., Gupta, A., Gutmans, A., Khandelwal, A., Kiran, Y., McGaughy, B., Personick, M., Rajan, K., Rondelli, S., Ryazanov, A., Schmidt, M., Sengupta, K., Thompson, B., Vaidya, D., Wang, S.: Amazon neptune: graph data management in the cloud. In: International Semantic Web Conference, pp. 46–97, WA, USA (2018) 16. Cudré-Mauroux, P., Enchev, I., Groth, P.T., Fundatureanu, S., Haque, A., Harth, A., Keppmann, F.L., Sequeda, J., Miranker, P., Wylot, M.: NoSQL databases for RDF: an empirical evaluation. In: International Semantic Web Conference, vol. 2, pp. 310–325 (2013) 17. Hartig, O.: SPARQL for a web of linked data: semantics and computability. In: Proceedings of the 9th Extended Semantic Web Conference, pp. 8–23 (2012) 18. Peng, P., Özsu, T., Zou, L., Zhao, D., Chen, L.: Processing SPARQL queries over distributed RDF graphs. VLDB J. 243–268 (2016) 19. Lefranois, M., Zimmermann, A., Bakerally, N.: A SPARQL extension for generating RDF from heterogeneous formats, pp. 82–98. Springer, Berlin (2017) 20. Ozsu, T.: A survey of RDF data management systems. Front. Comput. Sci. 418–432 (2016) 21. Atre, M., Chaoji V., Hendler, J., Zaki, M.: Matrix “bit” loaded: a scalable lightweight join query processor for RDF data. In: Proceedings of the 19th International Conference on World Wide Web, pp. 41–50. ACM Press, New York, USA (2010)
250
A. A. Abhangi and S. Iyer
22. Lei, Z., Özsu, T., Chen, L., Shen, X., Huang, R., Zhao, D.: gStore: a graph-based SPARQL query engine. VLDB J. 565–590 (2014) 23. Papailiou, N., Tsoumakos, D., Karras, P., Konstantinou, I., Koziris, N.: H2RDF+: an efficient data management system for big RDF graphs. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 909–912 (2014) 24. Weiss, C., Bernstein, A., Karras, P.: Hexastore: sextuple indexing for semantic web data management. In. Proceedings of the VLDB Endowment, pp. 1008–1019 (2008) 25. Castillo, R.: RDF mata view: indexing RDF data for SPARQL queries. In: 9th International Semantic Web Conference (2010) 26. Acharjya, D.P., Kauser Ahmed, P.: A survey on big data analytics: challenges, open research issues and tools. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 7(2) (2016) 27. Fletcher, H., Beck, W.: Scalable indexing of RDF graphs for efficient join processing. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 1513–1516. ACM Press, New York (2009) 28. Harris, S., Shadbol, N., Lamb, N.: 4Store: the design and implementation of a clustered RDF store. In: Proceedings of the 5th International Workshop on Scalable Semantic Web Knowledge Base Systems (IWSSWBS), pp. 16–25 (2009) 29. Ladwig, G., Harth, A.: Cumulus RDF: linked data management on nested key-value stores. In: Proceedings of the 7th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWBS) at the 10th International Semantic Web Conference (ISWC), pp. 30–42. Springer, Berlin (2011) 30. Zou, L., Ozsu, M.: Distancejoin: pattern match query in a large graph database. In: PVLDB, pp. 886–897 (2009) 31. Myung, J., Lee, G., Yeon, J.: SPARQL basic graph pattern processing with iterative MapReduce. In: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, pp. 6–16. ACM Press, New York (2010) 32. Cattell, R.: Scalable SQL and NoSQL data stores. In: SIGMOD Rec., pp. 134–150, New York, NY, USA (2011) 33. Tsatsanifos, G., Sellis, T., Sacharidis, D.: On enhancing scalability for distributed RDF/S stores. In: Proceedings of the 14th International Conference on Extending Database Technology, pp. 141–152. ACM Press, New York (2011) 34. AWS Product: Amazon Neptune—Graph Oriented Distributed Database. https://aws.amazon. com/neptune/
URL Scanner Using Optical Character Recognition Shivam Siddharth, Shubham Chaudhary, and Jasraj Meena
Abstract Our paper aims to develop an efficient method to extract URLs from any image or PFDs by using optical character recognition. Recognizing the text from images and converting them to machine-encoded text is known as OCR. This conversion is mostly done from documents, photos, or subtitle text imposed on a picture. One can use this editable text for further modifications. Optical character recognition is achieved mainly by using digital image processing and neural network. We used a convolutional neural network for training our dataset which is Infty for our OCR model. Then, we used regular expressions to find that if it contains any URLs or not. To access these URLs, this URL scanner could be used which can scrap all the URLs from the source. This could increase our productivity and can be checked for encrypted links, spam links, image links, or broken links. Keywords URL scanner · Optical character recognition · Image processing · Convolutional neural network
1 Introduction OCR or optical character recognition was first developed by Emanuel Goldberg in the late nineteenth century, during the First World War, by converting the characters from the image to the telegraphic code. Many people dreamed of a machine that can read the letters or numeric. OCR was first developed for helping blind and illiterate people unable to read by making the computer to read the text out loud. They did a lot of S. Siddharth (B) · S. Chaudhary · J. Meena Department of Information Technology, Delhi Technological University, Shahbad Daulatpur, Main Bawana Road, Delhi 110042, India e-mail: [email protected] S. Chaudhary e-mail: [email protected] J. Meena e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_25
251
252
S. Siddharth et al.
changes to improve OCR, print the accurate character in different fonts, and using it in different forms. Optical character recognition is the process of extracting machineencoded text from the handwritten text, PDF files, and other scanned images. OCR is one of the most promising research fields with a variety of applications. There are various algorithms and techniques like deep learning, neural networks, etc., which can help implement OCR with remarkable accuracy [1]. In this paper, we implemented our training model using the convolutional neural network (CNN) [2], because it can handle images more efficiently than any other neural network. The convolutional neural network takes the image as a three-dimensional matrix instead of a 1D array and performs different types of operations like padding, flattening, dense, convolving, etc. Convolutional neural network is a deep learning algorithm that assigns weights and biases on different aspects of the objects/images to map the input images to their corresponding output. CNN is better and more efficient for the image dataset as it reuses the weights and biases and reduces the numbers of parameters involved. CNN could be useful many like in natural language processing, optical character recognition, image analysis and recognition, recommendation system, etc. Optical character recognition has many applications such as automated number plate system [3], text to speech conversion for visually impaired persons, bank cheques, URL scanner, etc. The purpose of our URL scanner is to scrap or extract all the URLs present in an image. We used regular expressions for pattern matching. URL scanner can increase our productivity and save time. We can also use it to identify URLs like spam links, broken links, encrypted links, etc. This paper is divided into the following sections. Section 2 contains our method that we used to implement OCR using convolutional neural network. Section 3 contains the solution to our problem that is URL scanner. Section 4 contains the result of our research and in the final section, we concluded our result with some limitations we faced.
2 Our Methodology 2.1 Design Model The proposed OCR system is focused and can only be applied on English text characters. Our OCR model is designed and implemented with the help of various modules that are also shown in Fig. 1. Steps involved in our OCR model are-: • Loading or uploading the image by the user to the model. • Processing the image to remove noise and to make it easier for segmentation. • Next step is image segmentation followed by character recognition using our CNN model. • Last step is to generate the output and displaying it to the screen.
URL Scanner Using Optical Character Recognition
253
Fig. 1 Sequence diagram shows interaction between modules
2.2 Image Processing and Segmentation Image processing [4] is the most important process done before training our model to improve our results and accuracy. Image processing also helps in changing the dimensions of the image using various libraries NumPy, Pandas, etc., which helps in processing the uneven images together. Steps involved in Image preprocessing are: 1. Image Acquisition The first step in image processing is the acquiring or retrieving of the image from any source like a camera, Internet, etc., in digitized form. We can change the image in the required format and dimensions by using different libraries like NumPy, SciPy, etc., we acquired the following image as shown in Fig. 2. 2. Image Enhancement Image enhancement is the simplest and the effective way to enhance certain features of an image like contrast, brightness, sharpening the edges, deblurring images, removing the noises, etc., to make image processing easier. Image enhancement helps in improving the overall accuracy and results of a model. 3. Image Thresholding It is a process to convert a digital image to a gray scale image. If the pixel value is bigger than a threshold value, it is assigned one value (maybe white), else it is assigned another value (maybe black). Thresholding is often categorized into global thresholding and local thresholding.
Fig. 2 Sample input image
254
S. Siddharth et al.
• Convert image pixels to one of two pixels—either black or white. • Use threshold() function. • Invert image color. 4. Image Segmentation It is the process of segmenting and changing the representation of a digital image to make it simple to analyze. Image segmentation is used to characterize and label every pixel such that pixels with the same characteristics form the same clusters. • Tilt detection Tilt detection is used when the image is slightly rotated. It helps in detecting the tilt and correcting it. – – – –
Detect text pixels. Create minimum area enclosing the pixels. Calculate tilt angle. Perform necessary rotation to straighten the image.
• Line Detection In this stage, we try to segment the image by finding the boundaries between lines with the help of image thresholding and by calculating the distance between two lines or segmentation. Steps for line detection are: – Calculate horizontal projections of the image. – Based on threshold value, separate each line from the image. • Word Detection Now segment each line into words with the help of segmentation and image thresholding. Steps for word detection are: – Calculate vertical projections of the image. – Based on threshold value, separate each word from the line image as shown in Fig. 3. • Character Detection Character recognition [5] refers to segmenting each character by scanning each letter from left to right in a word. Order of the character should be maintained at all times to recognize it as a word. Firstly, we will find contours. Contour means an outline representing or bounding the shape or form of something. Then, create separate character images based on contours. That is, we draw a boundary from the contour. After that we resize the image to a 32 × 32 image with added padding, we get detect character using boundary coordinates. Finally, we obtain extracted and resized characters as shown in Fig. 4. 5. Object recognition Finally, we will process our extracted characters [6] in a neural network and map the character to the English alphabet to identify the letter and form a word.
URL Scanner Using Optical Character Recognition
255
Fig. 3 One line detected from input image and partitions for each word Fig. 4 a One word detected, b individual characters
2.3 Neural Network We have used convolutional neural network (ConvNet) because the preprocessing needed for image processing is much less than any other neural network. ConvNets [7] are inspired by the organization of the visual cortex in the human body. It uses filters that scan only part of the image and learn its characteristics, instead of viewing the whole picture at a time. This behavior is much similar to what humans do. The input shape of the image for the first layer of ConvNet is (number of samples × 1 × 32 × 32). We have used only gray scale images and that is why our one image has dimensions (1 × 32 × 32). Our first layer is a convolutional layer with filter size 5 × 5 with a ReLu activation function. We have 30 such filters in the first layer. The second layer is the max-pooling layer with pool size 2 × 2 and this reduces the dimensions to half. Next is again the convolution layer with 15 kernels with size 3 × 3 and ReLu activation function. Now again we have a max-pooling layer with pool size 2 × 2. Now for regularization, we have a dropout layer with probability 0.2. Now we flatten our intermediate images which result in the 540 × 1 array. Now we have two 1D layers of 128 neurons and 50 neurons with a ReLu activation function. Finally, we have a 1D layer of 79 neurons (as we have 79 final classes) with a softmax activation function. In the final layer, only 1 neuron will generate 1 as an output, and all other neurons will generate 0 as an output. The class corresponding to that particular neuron is the classified class for that input image. The summary of our model is shown in Fig. 5.
2.4 Application Optical character recognition can be implemented with the help of other technologies to innovate something new which could help carry out difficult tasks easily. One of the applications of OCR that we implemented is URL scanner. Most of the time, while exploring or reading books, magazines, newspapers, and posters, readers stumble upon random links like online form, payment gateway, product buying page, or even registration for a Web site. With the help of OCR, we can extract all of the URLs present in an image or any sort of document. By scanning a URL, we can direct to that
256
S. Siddharth et al.
Fig. 5 Summary of our neural network model
link without typing. We can also check for spam links [8] if present. We implemented an URL scanner with the help of regex expression. Regex or regular expression is a piece of code that helps in finding any type of pattern. By training our model in python library, Keras [9], for extracting text from images and then using Regex to identify URLs, we can get all the URLs from the input image. We stored the result of our image in an output.txt file. We used regex function to check each word that it is an URL or not if the given word is an URL then it is written in outpt.txt. For pattern matching, we used Python’s regular expression function known as a regex. Regex helps in finding all the strings which match a given pattern. The symbol used in regex is “re”. In Python, a regular expression search is typically written as Match = re.search(’pat’,’str’) This method is used to search the given pattern within a string by taking a pattern and a string as an input. If there is a pattern then the search function returns a match string otherwise none is returned.
URL Scanner Using Optical Character Recognition
257
Fig. 6 Final output for sample input image
2.5 Experimental Results and Discussion After implementing the above-mentioned algorithm in Jupyter notebook, we have tested it on various images. Both phases are tested individually as well as an integrated system. For image processing, we had used DDI-100 dataset [10]. In this dataset, we have used all the images having a paragraph containing English alphabetic words along with special symbols, which are present in URLs and it gives 92.371% accuracy. This means among all the characters present in all 100 images, 92.371% of the time our method was able to segment it from the remaining characters, and 7.629% of the time it fails to segment individual characters or it segment single character into two characters. For the character recognition phase, we have used the Infty [11] dataset. It is quite a famous dataset as it contains all punctuations, special characters, and most of the mathematical characters. Not all of the classes are used from this dataset, such as Greek letters and logical operators. We have removed these classes because they are not relevant to our use case. So we had trained our neural network on this dataset and achieved 98.593% accuracy. Some drawbacks of using the dataset are that all the characters are in the same font. Also, this dataset has biased classes as some classes have a very low number of data points and some classes have a very large number of data points. For training our model on this, we have kept this mind and thus reduce the biased by adding and subtracting data points. Now for integration testing, we used the same DDI-100 dataset and achieved 90.483% of the accuracy of predicting the written character and 98.597% accuracy for predicting the URL that was embedded in the articles. The final output is shown in Fig. 6.
3 Conclusion In this paper, we successfully implemented our URL scanner with the help of optical character recognition. The text we choose was from the English language and could be implemented in other languages too. The results we get were satisfying and more promising results could be obtained in the future. These results were obtained by using convolutional neural network and Keras. If used correctly, URL scanner can prove helpful in completing some diligent tasks with ease.
258
S. Siddharth et al.
4 Limitations and Future Work Some challenges that we faced while doing our research are: • The text lines in the input image should be prominently separated such that it is possible for detection by horizontal projection method that is the image should not be skewed. • There were some segmentation problems in some letters and symbols like i and j and . and , . • Computing time and accuracy could be improved with technology.
References 1. Sabu, A.M., Das, A.S.: A survey on various optical character recognition techniques. In: 2018 Conference on Emerging Devices and Smart Systems (ICEDSS), pp. 152–155. IEEE (2018) 2. Cha, J., Lee, J.H.: Extracting topic related keywords by backtracking CNN based text classifier. In: 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), pp. 93–96. IEEE (2018) 3. Qadri, M.T., Asif, M.: Automatic number plate recognition system for vehicle identification using optical character recognition. In: 2009 International Conference on Education Technology and Computer, pp. 335–338. IEEE (2009) 4. Sharma, P., Sharma, S.: Image processing based degraded camera captured document enhancement for improved OCR accuracy. In: 2016 6th International Conference—Cloud System and Big Data Engineering (Confluence), pp. 425–428. IEEE (2016) 5. Wei, T.C., Sheikh, U.U., Ab Rahman, A.A.H.: Improved optical character recognition with deep neural network. In: 2018 IEEE 14th International Colloquium on Signal Processing and Its Applications (CSPA), pp. 245–249. IEEE (2018) 6. Afroge, S., Ahmed, B., Mahmud, F.: Optical character recognition using back propagation neural network. In: 2016 2nd International Conference on Electrical, Computer and Telecommunication Engineering (ICECTE), pp. 1–4. IEEE (2016) 7. Almakky, I., Palade, V., Ruiz-Garcia, A.: Deep convolutional neural networks for text localisation in figures from biomedical literature. In: 2019 International Joint Conference on Neural Networks (IJCNN). IEEE (2019). https://doi.org/10.1109/IJCNN.2019.8852353 8. Olalere, M., Abdullah, M.T., Mahmod, R., Abdullah, A.: Identification and evaluation of discriminative lexical features of malware URL for real-time classification. In: 2016 International Conference on Computer and Communication Engineering (ICCCE), pp. 90–95. IEEE (2016) 9. Arora, S., Bhatia, M.P.S.: Handwriting recognition using Deep Learning in Keras. In: 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), pp. 142–145. IEEE (2018) 10. Zharikov, I., Nikitin, F., Vasiliev, I., Dokholyan, V. DDI-100: Dataset for Text Detection and Recognition. arXiv preprint arXiv:1912.11658 (2019) 11. Suzuki, M., Uchida, S., Nomura, A. :A ground-truthed mathematical character and symbol image database. In: Eighth International Conference on Document Analysis and Recognition (ICDAR’05), pp. 675–679. IEEE (2005)
A Survey on Sentiment Analysis Deb Prakash Chatterjee, Anirban Mukherjee, Sabyasachi Mukhopadhyay, Mrityunjoy Panday, Prasanta K. Panigrahi, and Saptarsi Goswami
Abstract Natural language processing (NLP) is a booming field in this era of data, where almost all businesses and organizations have access to many review sites, social media, and e-commerce websites. Recently, deep learning models have shown state-of-the-art results in NLP tasks. With the help of complex models like longshort term memory, various problems such as vanishing gradient problem have been diminished and new models like the attention model or aspect embedding increases accuracy. These made a drastic change in the field of sentiment analysis and made it more business-oriented, like most of the big business organizations, for example, Amazon and Flipkart, where it is used for analyzing details about their customer review. Some researchers have shown us a way to not even using complex models like LSTM we can do so, even better with adding gating mechanism to our wellknown CNN. Watching all of these, we are going to do a brief review of many technologies discovered by many scientists across the world and focus on some of the state-of-the-art works done in the domain of sentiment analysis.
D. P. Chatterjee Techno India University, Kolkata, West Bengal 700091, India A. Mukherjee University of Engineering & Management Kolkata, Kolkata, West Bengal 700160, India S. Mukhopadhyay School of Management Science, Maulana Abul Kalam Azad University of Technology, Haringhata, West Bengal 741249, India e-mail: [email protected] M. Panday (B) University of Calcutta, Technology Campus, Kolkata, West Bengal 700106, India e-mail: [email protected] P. K. Panigrahi Indian Institute of Science Education and Research Kolkata, Mohanpur, West Bengal 741246, India S. Goswami Bangabasi Morning College, Kolkata, West Bengal 700009, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_26
259
260
D. P. Chatterjee et al.
Keywords Sentiment analysis · Deep learning · Machine learning
1 Introduction Sentiment analysis is turning into a common domain in the area of machine learning and deep learning, where we, not just consider the author’s sentiment, but also opinions, suggestions, appraisals, and emotions toward a particular or any products or services. Sentiment analysis (SA) is about a particular sentiment of the author, e.g., good or bad. Opinion mining (OM) extracts the opinion of the author for a product or service [1]. Text analysis focuses on the structured/unstructured text and by deciphering the text to get some meaning out of it. Usually, we consider both SA and OM almost the same field, which requires natural language processing (NLP) techniques to solve its challenges [2]. Text classification (TC) and sentiment analysis are somewhat the same; in both cases, you have to classify the emotion behind the text (SA) or classify the category of the text (TC). Nowadays, due to the recent hike of the Internet services such as social media websites like twitter, blogs, microblogs, review sites and e-commerce sites, we can get a huge amount of review data from these websites in a digitalized format [1]. NLP domain has spread from computer science to finance, marketing, politics, and even medical science [4]. The choices or decisions we make in our everyday life were impacted by some other’s decisions up to some extent. However, extracting the meaning or opinion from such huge datasets is not always feasible. For this reason, many academic institutions, multi-national companies, and start-ups are currently working in this field to make the process of sentiment segmentation totally automated rather than by depending on human efforts. Since early 2000, scientists, researchers, and students are finding new ways to achieve better results in sentiment analysis. They are mostly doing this in two ways: using supervised models and unsupervised models. In the supervised approach, we can think of algorithms like random forest, support vector machine, and naïve Bayes and unsupervised algorithms like lexicon-based approach, doing grammatical analysis or some clustering and labeling on TF-IDF vectors. Due to the recent growth of deep learning techniques [5], we are getting more accurate and better result in almost all of the sectors of machine learning, from computer vision to NLP problems. As new papers are going to state so many new inventions in this field, and it requires to have meticulously designed review or survey papers [3, 4] to keep track of new inventions, so that newcomer researchers or even professionals can find all the relevant information of what had happened and what needs to be done, all combined in a single place. Cambria and Schuller [2] mostly talk about the evaluation of SA, opinion mining and recent trends. By looking at this culture, recent researches in natural language processing problems are mostly focusing on this deep learning area and trying to use it at its best. On a global level research on Jan 2017, it is proven that on a total population of 7.476 billion people, 4.917 billion peoples (66% of total) are mobile phone users, and 3.773 billion persons (50%) are active Internet users.
A Survey on Sentiment Analysis
261
Also, among the active mobile user, 2.549 billion are active mobile social user and among active internet user, 2.789 billion users are active social media user. In this paper, we will look after the applications of SA and it is leveling, microblogging and also performs a study of various machine learning, specifically deep learning algorithms to classify sentiments and its types, and last but not the least, quite a few datasets, that peoples can use. I hope this information will be sufficient for a researcher to start or maybe switch from any domain to the SA domain.
2 Applications of Sentiment Analysis Sentiment analysis is an old field, which has its usage ranging from business intelligence to politics via medical and supporting government system. If a business has a product and the owner of the company want to evaluate the product from its reviews, then we can use SA to thoroughly check the products performance or durability or any certain issues like delivery and also maybe deficiency the product may have. By applying SA on the data, the company can recover this issue/ financial risk. It helps the company to know its position better, than other competitors in the market in terms of the product [6, 7]. SA can also tell an advertising agency about how successful their advertisements are in terms of satisfaction and dissatisfaction [8]. In politics, you have to always keep track of people’s votes. SA can be used for taking people’s opinions from social media and check their reaction about an issue and maybe predict what the party needs to do to win the vote. Along with that, it can tell the result of the prediction very briefly using methods like sentence summarization [7–9]. Public and government both are equally important to maintain peace and balance in a state or a country. SA uses its techniques to decide if a certain policy or judgment is good or bad for people of the nation, and thus proper decisions can be taken, if this policy or judgment should be permanent or if the government should revoke it [7, 8, 10].
3 Sentiment Analysis and Its Levels We have found two fundamental levels of sentiment analysis, which are (1) Document Level (2) Sentence level [11, 43]. Concisely there are two approaches possible for document-level SA, i.e., supervised and unsupervised approach [12]. In the supervised approach, we have a finite set of classes, like binary (positive/negative) or multiclass (rates). We also have to have specific training data to train the sentiment analysis model with some general machine learning algorithms like SVM [25], KNN, logistic regression, Naïve Bayes or even deep learning algorithms like RNN [29], LSTM, or GRU. Some very important text representation and information retrieval techniques from the text are the bag of words, n-gram, TF-IDF, parts of speech and word embedding. There are mainly two types of techniques to convert preprocessed
262
D. P. Chatterjee et al.
text into vectors to feed into our model, namely sparse indexing and dense indexing [16]. The best example of sparse word vector/embedding (ex—n-gram, TF-IDF) will be one-hot-encoding [16], which is one of the most used vector representation. The main disadvantage with the one-hot vector is its hefty size and the lack of meanings of the words. In the dense embedding, we categorize our functions, and if the task is to find some category or some set of categories, then we can search for that category only, rather searching through the entire dataset. This reduces the size of the representation. One of the most popular word vector is word2vec. For the unsupervised approach, the main strategy is to capture the semantic orientation (SO) [15] of words [13] and classify the document accordingly [14]. Another unsupervised approach is using internal representation of language model as a character-level sentiment quantifier using multiplicative LSTM [69]. In Ref. [66], the efficacy of oversampling for multiclass sentiment analysis on Amazon review dataset was highlighted. In sentence-level classification [11], we can classify a sentence in either subjective where there is an opinion, or objective which states a fact. There is another type of SA, as discussed in Feldman [11], which is the aspect level [17]. Aspect-level SA [65] will consider aspects of sentences, where a product may have multiple as well as different aspects for a single product or service.
4 Sentiment Analysis with Microblogging Microblogging is an amalgamation of short messaging, to express the feelings of online audiences. Here social media and blogging meet each other in a way that sharing thoughts, or maybe some relevant information [18] becomes much easier. But the most popular thing about microblogging is a real-time chatting features [19, 20]. Twitter is one of the most popular microblogging sites, which is also one of the oldest social media platforms too. Twitter has a little barrier that you have to express your feeling in a tweet, but within 140 UTF-8 characters [21]. In websites like Facebook, Instagram, and tumbler, you can share texts as well as media, which are images, GIFs, videos, and hyperlinks [18, 36]. According to this paper [19], researchers tried to implement sentiment analysis from text data and image data, by using the bag of text words and bag of image words. They applied naïve Bayes, SVM, and logistic regression and found out that among all of them, and logistic regression clearly outperforms them all. In paper [20], micro-blog sentiment analysis was used as a tool to predict future stock market price direction, and paper [21] talks about analysis of customer insights using sentiment analysis. Also in paper [21], the author separated microblogging sentiment analysis into two equal phases, preparation phase and analysis phase, where we will first enter our microblogging texts, and second, the model will output the sentiment polarity of each and individual output feature. Paper [22] shows the growth of microblogging, especially twitter and discussed geographical distributions of twitter in many major cities in the world, current trends of it.
A Survey on Sentiment Analysis
263
5 Literature Review Sentiment analysis comprises working with vectors points, not actual text, so, the main question is how are we going to get those vector points from the text data?
5.1 Word Embedding Word embedding [24] is a vector representation of a particular text and a typical feature learning technique. We use embedding to solve the problem of converting many unique words to one-hot encoding and greatly improve the efficiency of our network. Embedding layers are like a fully connected network (FCN).
5.2 Support Vector Machine (SVM) Support vector machine [8] is a common supervised ML model which chooses an optimal hyperplane or margin to separate out the classes or predict data points. It is often used along with a kernel function such as radial basis function.
5.3 Random Forest Random forest [68] an ensemble method of supervised machine learning which takes the decision of a major party rather than some individual ones. It is a more advanced version of the decision tree and tackles the bias issue of decision trees. First a bootstrap dataset created from the actual data randomly. It makes the tree by randomly choosing a subset of features from the actual set of features; next, it selects the best features among them and splits the node of the tree like the brunch of the actual trees. The randomness actually helps the classifier to get an unbiased result. We get a forest by creating and combining a lot of trees. Here the majority votes from the trees count.
5.4 XGBoost XGBoost [9] is extreme gradient boosting algorithm, which is another ensemble method of machine learning, and somehow based on the decision tree. This uses bagging and boosting method.
264
D. P. Chatterjee et al.
5.5 Neural Networks (NN) Neural network [10] is the building block for deep learning. Artificial neural networks (ANN) are a network of neurons, inspired by brains, which lead to producing more improved and accurate result in most machine learning applications. An ANN consists of mainly three types of layers, namely input layer, hidden layer, and the output layer. All the connections between the layers have weights, and the neurons have biases, which change during training and thus affect the output of the model.
5.6 Convolutional Neural Network (CNN) Convolutional neural networks [11] are a type of neural network inspired by our visual cortex, used in computer vision and time series, such NLP or speech recognition problems. A CNN includes convolution and pooling layers, which captures spatial information from the data to extract important features, and finally uses a fully connected layers to perform decision making based on those extracted features. Unlike fully connected network, all the neurons are not fully connected.
5.7 Recurrent Neural Network (RNN) Recurrent neural network [12] means a type of neural network, where the outputs from the previous state are fed into the current state, describing the word recurrence. Unlike simple feedforward network, RNNs use memory cell which can remember the information about the previous time steps. We call the current time step as the current hidden state, which is the hidden layer of our neural network. It really helps to retain the information we need, but unfortunately, the memory used here cannot be used to retain long-term information, and there is a vanishing gradient problem too. To eliminate both the problems, LSTM came as a solution.
5.8 Long-Short Term Memory (LSTM) As the vanishing and exploding, gradient problems are causing the procedure of training of our model very slow, so Hochreiter & Schmidhuber introduced LSTM [13] in 1997, to diminish those problems, also to cope up with the problems of keeping track of long-term dependencies or very long sequences. With the memory cell, it uses to remember the things, it needs to remember. With the existing memory cell, it also decides what data to be kept in memory with its three gates: input, forget, and output gate. First the forget gate decides, i.e., what is/are the information that
A Survey on Sentiment Analysis
265
needs to forgotten and what to be kept remembered from the previous cell state to the current cell state. Next the input layer decides what information has to be remembered. Lastly, there is an output layer, which decides the proper output and then sends it to the next cell from this cell.
5.9 Target-Dependent LSTM (TD-LSTM) This is more like an extension to the LSTM. As LSTM can cover the long-term dependencies quickly and pretty accurately, but in sentiment analysis, we want to understand the semantic relativeness of the target word with the context word in the same sentence. It is two LSTM models, respectively, left and right, to capture preceding and following contexts of the target string. So, the left LSTM will run from left to right and will take the preceding context as input, where the target string will be at last. Similarly, the right LSTM will run from right to left and take the following context as input, where also the target string will be, at last, making the target string situated in the middle.
5.10 Attention-Based LSTM with Aspect Embedding (ATAE-LSTM) In this paper [32], the authors try to add an attention model with aspect embedding to extract the sentiment polarity with considering different aspects. It has also shown us efficient ways to add the aspect embedding in not just one, but two ways. Attention [16] means to simply give attention to one particular area, rather than focusing on everything. For sentiment analysis, the model learns from the training data and tells us what to encode from the input, rather than encoding the entire text as a vector representation. In paper [16], attention has been mostly used to focus on different parts of a sentence to focus on different aspects. Another most crucial part of this is aspect embedding. Information about the aspect is very much important because there can be a different meaning of sentiment when considering two different aspects. Aspect embedding is used to decide the important aspect in a sentence, and they have used this attention mechanism and this can capture the key part of the sentence given an aspect. The aspect embedding has been also used after the hidden vectors to select the attention weights accompanying the sentence representations. But this does not end yet, because to leverage the aspect embedding to the fullest, they append an input aspect embedding vector before each word input vector. This is an excellent way to let the output hidden vectors teach information from the input aspect. Therefore, in the following step that computes the attention weights, the interdependence between words and the input aspect can be modeled. Hence, their
266
D. P. Chatterjee et al.
plan is to focus on different aspect parts more accurately using attention weights and calculate their interconnections.
5.11 Aspect-Based Sentiment Analysis with Gated CNN (GCAE) This last research paper [17] on deep learning algorithms for sentiment analysis focuses more on aspect-based sentiment analysis using gated CNN, with two subparts of it, namely aspect category sentiment analysis (ACSA) and aspect term sentiment analysis (ATSA). For most of ABSA task, LSTM-based models were used previously, which is time-dependent and more complex [30–35]. As in some cases, where our review data has noise, our model may need some positional information about target and context, where we need more complex LSTM, which requires more power and memory. That is why they have used CNN instead of LSTM and to cover the deficiency of long-term dependency, gating mechanisms have used. First, it takes the input words and puts them into a convolutional neural network model. This CNN layer here made of one embedding layer, one-dimensional convolution layer and one Max-pooling layer. Now, as we told before that we need to convert our text data into vectors to feed them into machine learning models, here the embedding vector will do the work for us. These word embeddings are mostly initialized with some pre-trained word embedding such as Glove, where after we have to fine-tune for the training of our exact data. The gating mechanism is to be added just after the convolutional layer, followed by an aspect embedding layer. Max-pooling layer will be there followed by everything, and it will take the most important parts from the vector produced by the gating mechanism and creates another vector with important information. Lastly, the fully connected layer predicts the sentiment polarity with the help of the produced vectors. Next we will talk a little bit about unsupervised approach of classifying text data. Here in this part, more focus will be given on lexicon-based approach.
5.12 Lexicon-Based Sentiment Analysis (LBSA) In LBSA, we use opinion lexicons [41, 42] to successfully classify positive or negative opinion data [23]. Positive opinion words are most likely used for optimistic cases and negative opinion words are more used for stating pessimistic cases. Other than these words, we have many more words, phrases, and idioms too, which combine construct opinion lexicon. We use these lexicons exactly for calculation of the sentiment from the lexical orientations of the words from given texts. As you can understand from the above, we surely need a full dictionary of positive and negative words, to let the model know, what we mean by telling positive and negative words in a sentence.
A Survey on Sentiment Analysis
267
Table 1 A comparative chart of techniques doing sentiment analysis (SA) technique used
Accuracy
References
SVM
91.60% (PL04 Dataset)
[44]
Linear SVM & NB
84.75% & 67.50% (TF-IDF)
[45]
Dictionary-based approach
63%
[14]
Hybrid CNN + LSTM
91%
[46]
Regional CNN + LSTM
0.987 (Stanford Sentiment Treebank)
[61]
ConvLSTM
88.3%
[28]
CNN
45.4%
[35]
RNN
87% (Task-3)
[47]
LSTM
82.0% (3-class)
[32]
TD-LSTM
70.8%
[31]
ATAE-LSTM
84.0% (3-class)
[32]
GCAE
77.28% (Restaurant)
[34]
Now coming to the part of creating those dictionaries can be done in two respective ways, those are manual [37] and automatic [38] ways. That is why, the automated way is the best to consider here, which are dictionary-based approach [39, 40] and corpus-based approach [41]. In Table 1, we have made a chart comparing the ‘Techniques Used’ doing SA, with how much ‘Accuracy’ and ‘References’ are there too.
6 Dataset The problem that we have noticed is that, there is a lack of benchmark dataset in the field of natural language processing, which can be used for sentiment analysis, sentiment classification or maybe can be used for opinion mining too. Only quite a few resources like Amazon review dataset, Yelp review dataset and Rotten Tomatoes review [60] datasets are there. So here we are coming up with this list of datasets which you can use. Task to do
Dataset name
References
Sentiment analysis
automotvieforums.com
[48]
Sentiment analysis
CNETD
[49]
Sentiment analysis
amazon.com, epinions.com, blogs, SNS
[50]
Sentiment analysis
ebay.com, wikipedia.com, epinions.com
[51]
Sentiment analysis
amazon.com
[52]
Sentiment analysis
Amazon.com
[53]
Sentiment analysis
Twitter
[54] (continued)
268
D. P. Chatterjee et al.
(continued) Task to do
Dataset name
References
Review rating prediction
Yelp
[55]
Sentiment analysis
IMDB movie Reviews
[56]
Sentiment classification
IMDB
[57]
Sentiment classification
Convinceme.net
[58]
Sentiment classification
Amazon.com
[59]
7 Conclusion In this paper, we have done a brief survey of many new technologies, that were invented and used by scientists to perform sentiment analysis and externally we have focused on what is attention model and what is aspect embedding. Next, we have proposed a model which is based on gated CNN, and added an attention layer to it, so that accuracy can be improved. We hope that this paper will surely help students, professionals and all sort of researcher to motivate to come to this field and contribute here.
References 1. Tsytsarau, M., Palpanas, T.: Survey on mining subjective data on the web. Data Min. Knowl. Disc. 24(478–514), 2016 (2012). https://doi.org/10.1007/s10618-011-0238-6 2. Cambria, E., Schuller, B., Xia, Y., Havasi, C.:New avenues in opinion mining and sentiment analysis.IEEE Intell. Syst. 28(2), 15–21. https://doi.org/10.1109/MIS.2013.30 3. Liu, B.: Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers (2012) 4. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1–2), 1–135 (2008). https://doi.org/10.1561/1500000011 5. Medhat, W., et al.: Sentiment analysis algorithms and applications: A survey (2014). https:// doi.org/10.1016/j.asej.2014.04.011 6. Funk, A., Li, Y., Saggion, H., Bontcheva, K., Leibold, C.: Opinion analysis for business intelligence applications, 3 (2008). https://doi.org/10.1145/1452567.1452570 7. Behdenna, S., Barigou, F., Belalem, G.:, Document Level Sentiment Analysis: A survey, CASA, EAI (2018). https://doi.org/10.4108/eai.14-3-2018.154339 8. D’Andrea, A., Ferri, F., Grifoni, P., Guzzo, T.: Approaches, tools and applications for sentiment analysis implementation. Int. J. Comput. Appl. 125, 26–33 (2015). https://doi.org/10.5120/ijc a2015905866 9. Rani, S.: Sentiment analysis: a survey. Int. J. Res. Appl. Sci. Eng. Technol. V, 1957–1963 (2017). https://doi.org/10.22214/ijraset.2017.8276 10. Asghar, M., Khan, A., Ahmad, S., Kundi, F.: A Review of feature extraction in sentiment analysis. J. Basic Appl. Res. Int. 4, 181–186 (2014) 11. Feldman, R.: Techniques and applications for sentiment analysis. Commun. ACM. 56, 82–89 (2013). https://doi.org/10.1145/2436256.2436274
A Survey on Sentiment Analysis
269
12. Chatterjee, D.P., Mukhopadhyay, S., Goswami, S., Panigrahi, P.K.: Efficacy of oversampling over machine learning algorithms in case of sentiment analysis. In: Springer Proceedings, ICDMAI 2020, India (2020) 13. Turney, P.D.: Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. ArXiv cs.LG/0212032 (2002): n. pag. 14. Sharma, R., Nigam, S., Jain, R.: Opinion mining of movie reviews at document level. ArXiv abs/1408.3829 (2014): n. pag. 15. Jagtap, V., Pawar, K.: Analysis of different approaches to sentence-level sentiment classification. Int. J. Sci. Eng. Technol. 2, 164–170 (2013) 16. Mayo, M.: KDnuggets.com. Data Representation for Natural Language Processing Tasks. Data Representation for Natural Language Processing Tasks. https://www.kdnuggets.com/2018/11/ data-representation-natural-language-processing.html 17. Schouten, K., Frasincar, F.: Survey on aspect-level sentiment analysis.IEEE Trans. Knowl. Data Eng. 28(3), 813–830 (2016). https://doi.org/10.1109/TKDE.2015.2485209 18. D. Nations, What Is Microblogging? A definition of microblogging with examples. In: LifeWire. https://www.lifewire.com/what-is-microblogging-3486200, 19 Dec 2019 19. Wang, M., Cao, D., Li, L., Li, S., Ji, R.: Microblog sentiment analysis based on cross-media bag-of-words model. In: Proceedings of International Conference on Internet Multimedia Computing and Service (ICIMCS ’14). Association for Computing Machinery, New York, NY, USA, pp. 76–80 (2014). https://doi.org/10.1145/2632856.2632912 20. Oh, C., Sheng, O.: Investigating predictive power of stock micro blog sentiment in forecasting future stock price directional movement. ICIS (2011) 21. Chamlertwat, W., Bhatarakosol, P., Rungkasiri, T.: Discovering consumer insight from twitter via sentiment analysis. J. Universal Comput. Sci. 18, 973–992 (2012) 22. Java, A., Song, X., Finin, T., Tseng, B.: Why We Twitter: An Analysis of a Microblogging Community (1970). https://doi.org/10.1007/978-3-642-00528-2_7 23. Tang, D., Qin, B., Liu, T.: Deep learning for sentiment analysis: successful approaches and future challenges. WIREs Data Min. Knowl. Discov. 5, 292–303 (2015). https://doi.org/10. 1002/widm.1171 24. Li, Y., Yang, T.: Word embedding for understanding natural language: a survey. In: Srinivasan, S. (ed.) Guide to Big Data Applications. Studies in Big Data, vol. 26. Springer, Cham (2018) 25. Noble, W.: What is a support vector machine? Nat. Biotechnol. 24, 1565–1567 (2006). https:// doi.org/10.1038/nbt1206-1565 26. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16). ACM, New York, NY, USA, pp. 785–794. https://doi.org/10.1145/2939672.2939785 27. Wan, E.A.: Neural network classification: a Bayesian interpretation. IEEE Trans. Neural Netw. 1(4), 303–305 (1990). https://doi.org/10.1109/72.80269 28. Kim, Y.: Convolutional Neural Networks for Sentence Classification (2014). arXiv e-prints arXiv:1408.5882 29. Arras, L., Montavon, G., Muller, K.-R., Samek, W.: Explaining Recurrent Neural Network Predictions in Sentiment Analysis (2017). arXiv e-prints arXiv:1706.07206 30. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 31. Tang, D., Qin, B., Feng, X., Liu, T.: Effective LSTMs for Target-Dependent Sentiment Classification (2015). arXiv e-prints arXiv:1512.01100 32. Wang, Y., et al.: Attention-based LSTM for Aspect-level Sentiment Classification. EMNLP (2016) 33. Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: a survey. WIREs Data Min. Knowl. Discov. 8, e1253 (2018). https://doi.org/10.1002/widm.1253 34. Xue, W., Li, T.: Aspect Based Sentiment Analysis with Gated Convolutional Networks (2018). arXiv e-prints arXiv:1805.07043 35. Ouyang, X., Zhou, P., Li, C.H., Liu, L.: Sentiment analysis using convolutional neural network. In: 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous
270
36.
37. 38. 39. 40. 41. 42. 43. 44. 45. 46.
47.
48. 49. 50. 51. 52. 53. 54. 55. 56.
57.
D. P. Chatterjee et al. Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing, Liverpool, pp. 2359–2364 (2015). https://doi.org/10.1109/CIT/ IUCC/DASC/PICOM.2015.349 Dutta, S., Roy, M., Das, A.K., Ghosh, S.: Sentiment detection in online content: a WordNet based approach. In: Panigrahi, B., Suganthan, P., Das, S. (eds,) Swarm, Evolutionary, and Memetic Computing. SEMCCO 2014. Lecture Notes in Computer Science, vol. 8947. Springer, Cham (2015) Tong, R.M.: An operational system for detecting and tracking opinions in on-line discussions. In: Working Notes of the SIGIR Workshop on Operational Text Classification, pp. 1–6 (2001) Turney, P., Littman, M.: Measuring praise and criticism: inference of semantic orientation from association. ACM Trans. Inf. Syst. J. 21(4), 315–346 (2003) Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04) (2004) Kim, S., Hovy, E.: Determining the sentiment of opinions. In: Proceedings of International Conference on Computational Linguistics (COLING’04) (2004) Riloff, E., Shepherd, J.: A Corpus-Based Approach for Building Semantic Lexicons. ArXiv cmp-lg/9706013 (1997): n. pag. Alsaeedi, A., Khan, M.Z.: A study on sentiment analysis techniques of Twitter data. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 10(2) (2019). https://doi.org/10.14569/IJACSA.2019.0100248 Jurek, A., Mulvenna, M.D., Bi, Y.: Improved lexicon-based sentiment analysis for social media analytics. Secur. Inf. 4, 9 (2015). https://doi.org/10.1186/s13388-015-0024-x Nguyen, D.Q., Nguyen, D.Q, Vu, T., Pham, S.B.: Sentiment Classification on Polarity Reviews: An Empirical Study Using Rating-based Features. WASSA@ACL (2014) Tripathi, G., Naganna, S.: Feature selection and classification approach for sentiment analysis. Mach. Learn. Appl. Int. J. 2, 01–16. https://doi.org/10.5121/mlaij.2015.2201 Rehman, A.U., Malik, A., Raza, B., Ali, W.: A hybrid CNN-LSTM model for improving accuracy of movie reviews sentiment analysis. Multimedia Tools Appl. (2019). https://doi.org/ 10.1007/s11042-019-07788-7 AL-Smadi, M., Qawasmeh, O., Al-Ayyoub, M., Jararweh, Y., Gupta, B.B.: Deep recurrent neural network vs. support vector machine for aspect-based sentiment analysis of Arabic hotels’ reviews. J. Comput. Sci. (2017). https://doi.org/10.1016/j.jocs.2017.11.006 Qiu, G., He, X., Zhang, F., Shi, Y., Jiajun, Bu., Chen, C.: DASA: dissatisfaction-oriented advertising based on sentiment analysis. Expert Syst. Appl. 37, 6182–6191 (2010) Cao, Q., Duan, W., Gan, Q.: Exploring determinants of voting for the “helpfulness” of online user reviews: a text mining approach. Decis. Support Syst. 50, 511–521 (2011) Xu, K., Liao, S.S., Li, J., Song, Y.: Mining comparative opinions from customer reviews for competitive intelligence. Decis. Support Syst. 50, 743–754 (2011) Fan, T.-K., Chang, C.-H.: Blogger-centric contextual advertising. Expert Syst. Appl. 38, 1777– 1788 (2011) Hu, N., Bose, I., Koh, N.S., Liu, L.: Manipulation of online reviews: an analysis of ratings, readability, and sentiments. Decis. Support Syst. 52, 674–684 (2012) Min, H.-J., Park, J.C.: Identifying helpful reviews based on customer’s mentions about experiences. Expert Syst. Appl. 39, 11830–11838 (2012) Kontopoulos, E., Berberidis, C., Dergiades, T., Bassiliades, N.: Ontology-based sentiment analysis of twitter posts. Expert Syst. Appl. (2013) Asghar, N.: Yelp Dataset Challenge: Review Rating Prediction. ArXiv abs/1605.05362 (2016): n. pag. Sahu, T.P., Ahuja, S.: Sentiment analysis of movie reviews: a study on feature selection & classification algorithms. In: 2016 International Conference on Microelectronics, Computing and Communications (MicroCom), Durgapur, pp. 1–6 (2016). https://doi.org/10.1109/MicroCom. 2016.7522583 Bai, X.: Predicting consumer sentiments from online text. Decis. Support Syst. 50, 732–742 (2011)
A Survey on Sentiment Analysis
271
58. Walker, M.A, Anand, P., Abbott, R., Fox Tree, J.E., Martell, C., King, J.: That is your evidence?: Classifying stance in online political debate. Decis. Support Syst. 53, 719–729 (2012) 59. Moraes, R., Valiati, J.F., GaviãoNeto, W.P: Document-level sentiment classification: an empirical comparison between SVM and ANN. Expert Syst. Appl. 40, 621–633 (2013) 60. Rotten Tomatoes Movie Reviews. Data: https://www.kaggle.com/c/movie-review-sentimentanalysis-kernels-only/data 61. Wang, J., Yu, L.-C., Lai, K., Zhang, X.: Dimensional Sentiment Analysis Using a Regional CNN-LSTM Model, pp. 225–230 (2016). https://doi.org/10.18653/v1/P16-2037. 62. Hassan, A., Mahmood, A.: Deep Learning approach for sentiment analysis of short texts. In: 2017 3rd International Conference on Control, Automation and Robotics (ICCAR), Nagoya, x, pp. 705–710 (2016) 63. Ain, Q.T., Ali, M., Riaz, A., Noureen, A., Kamran, M., Hayat, B., Rehman, A.: Sentiment analysis using deep learning techniques: a review. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 8(6) (2017). https://doi.org/10.14569/IJACSA.2017.080657 64. Sohangir, S., Wang, D., Pomeranets, A., et al.: Big data: deep learning for financial sentiment analysis. J. Big Data 5, 3 (2018). https://doi.org/10.1186/s40537-017-0111-6 65. Wang, B., Liu, M.: Deep learning for aspect-based sentiment analysis. Stanford University report (2015) 66. Mukherjee, A., Mukhopadhyay, S., Panigrahi, P.K., Goswami, S.: Utilization of Oversampling for multiclass sentiment analysis on Amazon Review Dataset. In: IEEE Conference Proceedings, 2019 IEEE 10th International Conference on Awareness Science and Technology (iCAST) (2019) 67. Shirani-Mehr, H.: Applications of deep learning to sentiment analysis of movie reviews. Technical report, pp. 1–8 (2004) 68. Pouransari, H., Ghili, S.: Deep learning for sentiment analysis of movie reviews.Technical report, Stanford University (2014) 69. Radford, A., Jozefowicz, R., Sutskever, I.: Learning to generate reviews and discovering sentiment (2017). arXiv preprint arXiv:1704.01444
Movie Review Sentimental Analysis Based on Human Frame of Reference Jagbir Singh, Hitesh Sharma, Rishabh Mishra, Sourab Hazra, and Namrata Sukhija
Abstract It explored an innovative technology of web development, “React js”. It is one of the versatile, optimized and latest technologies developed by Facebook, which consist of some of its distinctive features like reusability of components and hot reloading. It uses a JavaScript library for building prodigious interfaces. React is very efficient for building any kind of a single or a mobile-based application, and there is also some predefined library built-in it which provides some additional features such as routing and statement management. Redux and React router are some of the examples of such kind of libraries. There are notable built-in features of react js which are able to “explored” in our web applications. Reusing any components, books, virtual DOM are some of the useful concepts implemented in our web applications. React makes our programming very efficient and facile, because of its inbuilt features that are specifically developed for react js by Facebook. So, react js use hot reloading, and it is kind of features which you can see in daily life such as when you post a comment on Facebook or YouTube or the life someone post on the Instagram the whole page of the web application does not get reloaded only the components get reloaded, which is very efficient and makes the web application fast in nature. Keywords Webseries · Movie review · Netflix · Motion pictures · Facebook
J. Singh · H. Sharma (B) · R. Mishra · S. Hazra · N. Sukhija HMR Institute of Technology and Management, New Delhi, India e-mail: [email protected] J. Singh e-mail: [email protected] R. Mishra e-mail: [email protected] S. Hazra e-mail: [email protected] N. Sukhija e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_27
273
274
J. Singh et al.
1 Introduction The idea behind the “React movies” is to develop an entertainment platform like Netflix, amazon prime and Hulu originals. But built it with using react is unlike the other platforms which are based on old technology like JavaScript, html, css and php [1]. But our web application has some of the unique features which like downloading, movie or series trailers, budget and ratings. And our application consists of huge collection of latest and old movies and series, and also the upcoming movies or series trailers [2]. And when we download some the movies from the web application like Netflix and amazon prime [3], they got downloaded inside the application not in your local storage which is non-transferable to the other devices and can only be run inside that application due to which after few days the movie get auto downloaded [4], but we are in our web application when we download anything it get stored inside your local storage devices and once you download the movie or the series you do not need to download that again for thirty days and automatically got accessible to all the other devices you have registered in [5]. We used the concept of reusability of components, which means if we developed a single component then we do not have to write that code again. We just need to call that component inside our code, and it gets automatically called. So, it is one of the features we used in our code [6]. So, we write all the components that we need for our application once and route all the components inside the app.js file. When the components get routed inside app.js, then their respective elements of the components get loaded inside the app.js file and get loaded inside the web pages [7]. So, if we create more than one pages and we need to use that same components in that page to, we just need to route that component into that page rather than writing the whole code again [8]. One of the interesting features that are used by one of the top web applications like Facebook and YouTube is the concept of hot reloading [9], which is very useful for our application. So, if change anything inside the components it got automatically updated inside the web page and we do not have to reload that page again to get that component updated [10].
2 Working Model This website is a dynamic website with key features such as rating drivers, rating, casting team, downloads where the user can select their favourite Movies and web series [11] which will help them be in a role where they can watch movies and web series. They can check out the broadcasts now to get acquainted with movies and animated series [12]. Upcoming movies can also be viewed from the website so the user can make the right plan to free up their time and watch upcoming shows. Take movies on demand streaming services, formerly known as Watch now [13], allow subscribers to stream television and movies through the website of movies used on personal computer or reaction movie software on various supported forums,
Movie Review Sentimental Analysis Based …
275
Fig. 1 Popular series section
including smartphones and tablets, player’s digital media and smart TV [14]. The rise of reactionary movies has affected the way viewers view television content. The website has information about movies, upcoming movies and theatres [15]. It has been home to the official website of the television show. In this case, we use api called Movie DB Api, Movie Db is an application programming interface (API), which contains a set of custom agreement and software application tools [16]. The API specifies how software components should work together. It provides all the building blocks we needed to build our movie and web services, and we are told and we must arrange that those codes are not removed according to our need. Movie Db is currently in its third edition. In order to use any kind of API, you need to use following a set of steps like registration and using api key [17]. Figure 1 shows the popular tv shows or web series checklist shown to user. Basically, tv shows which are recently released or recently updated or higher rating on imdb rating are shown on popular section tv show [18]. Using link of moviedbapi link of popular tv is used to grab the data and show on web page using four col grids.
3 Proposed Methodology Entertainment plays a vital role in human life. There are many sources for entertainment such as listening songs, doing drama, playing movies, watching web series. People will prefer theatre for good movies or series. If movie or web series is not up to the mark, then it felt like wastage of time and money [19]. Therefore, ratings for particular movie and series are provided by IMDB ratings, voice and flixster. But just by going through these reviews we cannot decide that movie or series is good or not
276
J. Singh et al.
because different people have different taste and area of interest so just by visiting few reviews we can’t decide what is good or bad, we need some time, some research for deciding that the given rating is true or not [20]. For example, if someone watches action movie and he is giving bad rating to a comedy movie, then it is not correct for a person to judge that rating by just seeing that rating as this rating is given by a person whose area of interest is different mat be person who is fond of comedy movies likes that movie [21].
4 Implementation Although the existing sentiwordnet provides their own numerical and numerical value, the major backlog as it may not provide a complete solution for the analytical function you hear. To overcome this in the field of ideas, using a bookish attitude during nearer grind, where multiple word points are produced by dividing them into strong, precise, weak, political, powerful, negative and strong words utilize sentiwordnet.
4.1 Information/Data Acquisition This is the first step to elaborate in any mining and sensing processing because a quality of the data is very important. Large amounts are collected of web content using the website updates [22]. HTML defines pages on every html-page also removes each and every tag data/information and extracts program written sentences. The composed evaluation is arranged in the form of text documents [23].
4.2 Pre-processing Halt the transfer of words: These are pre-programmed names before or after natural language information processing. Due to these keywords may cause a problem while finding for keywords to enter [24]. These kinds of words are not essential for accomplishment of our purpose. Those words are expelled by implementing the small and simple lines of codes. More than a 100 of words are extracted from our text in this work as they do not apply to our purpose [25].
Movie Review Sentimental Analysis Based …
277
4.3 Classification Sentiwordnet: Sentiwordnet is a lexical source, where every domain name was connected with 2 letters (s) & (s), which describe how good or bad words contain the same words [26]. When an analyst discovers an explained keyword in a sentence for an online page assigned to particular equipment, it checks for changes connected with that particular keyword [27]. When it receives such kind of name, it gets points of it in the senti-wordnet of the continuous activity going on.
5 Movie Reviews and Sentiment Analysis on the Bases of Human Perspective Everyone has consecutive bugs. and there has been a lot of research done to understand that pattern, such as text segmentation and sentiment analysis to predict good customer reviews [28]. It is important because it can mean customer satisfaction and helps us predict what kind of product will satisfy the customer [29]. Being able to communicate with the customer [30] and collect that information can have a variety of benefits [31], such as improving product quality [32], business strategies [33], improving services and monitoring performance. The actual details about the product may not mean much, but companies like Facebook, twitter, IMDB, various ad platforms have asked people about their product to be reviewed, in order to satisfy customer satisfaction [34]. Reviews about the movie are lines of sentences that describe the movie, and the view can be both positive and negative which is the story used. On the basis of critics’ reviews, it captures the general idea of the movie whether it’s watchable or not. But it is very difficult to draw a conclusion a person’s written statement because the human language is complex, there are many different ways of expressing positive and negative feelings about something.
6 Result The idea behind the “React movies” is to develop an entertainment platform like Netflix, amazon prime and Hulu originals. But built it with using react js unlike the other platforms which are based on old technology like javascript, html, css and php. But our web application has some of the unique features which like downloading, movie or series trailers, budget and ratings. And our application consists of huge collection of latest and old movies and series, and also the upcoming movies or series trailers and when we download some of the movies from the web application like Netflix and amazon prime,they got downloaded inside the application not in your local storage which is non-transferable to the other devices and can only be run
278
J. Singh et al.
inside that application due to which after few days the movie get auto downloaded, but user are in our web application when download anything it get stored inside your local storage devices and once you download the movie or the series you don’t need to download that again for thirty days and automatically got accessible to all the other devices you have registered in. It used the concept of reusability of components, which means if it developed a single component then it does not have to write that code again. It just needs to call that component inside our code and it get automatically called. So, it is one of the features which is used in our research. So, it writes all the components that it needs for our application once and route all the components inside the app.js file. When the components get routed inside app.js, then their respective elements of the components get loaded inside the app.js file and get loaded inside the web pages. So, if it creates more than one pages and it need to use that same components in that page to, it just needs to route that component into that page rather than writing the whole code again. Figure 2 shows how the search algo works. In this whenever a user trying to search something, the key is passing to the api and then api will check the amount of similar connect are shown. This shows the rate of interactivity between the api and algo. If user search something with keyword it also shows result according to user requirement. The algo is design for that kind of dynamic purpose. Figure 3 shows the details of tv show grab from the api link are arrange on web page using css. This figure also shows the working of processes which are going on behind the image like getting data from api link movidDB and how much css is use on props and components for arranging them in attractive manner so which show the amazing user interface (UI) to user.
Fig. 2 Search bar
Movie Review Sentimental Analysis Based …
279
Fig. 3 TV show details
7 Conclusion The purpose behind the “React movies” is to provide an entertaining room to the users as per another platform. But built it with using react js unlike the other platforms which are based on old technology like JavaScript, html, css and php. But our web application has some of the best features which makes it unique like downloading, movie or series trailers, budget, ratings etc. And our web application contains large variety of latest and old movies and series, and also the upcoming movies or series trailers and when we downloads some of the movies from the web application like Netflix and amazon prime,they got downloaded inside the application not in your local storage which is non-transferable to the other devices and can only be run inside that application due to which after few days the movie get auto downloaded. An analysis of web evaluation operating a seven-point result on any website on the Internet using many machine learning (ML) methods to do yield better results. That method is very powerful tool for web users to make a selection about their web mining. Friendly UI which interact peoples to checkout our web application and completely hosted on a server. To learn this new technology which help for future use. If a made web application is made of using typical HTML CSS and PHP, then it will be so much hectic and do hard code for each and every page of the application. If this going in the market, then definitely get exposure. One feature makes this application different form other of its UI (attractive UI). It helps to understand the working of APIs and connect small things to make a complete substance. For research purpose, so many things to learn in react.
280
J. Singh et al.
References 1. Suhariyanto, Firmanto, A., Sarno, R.: Prediction of movie sentiment based on reviews and score on Rotten Tomatoes using SentiWordnet. In: IEEE 2018 International Seminar on Application for Technology of Information and Communication, September 2018 2. Nanda, C., Dua, M., Nanda, G.: Sentiment analysis on movies reviews in Hindi Language using machine learning. In: IEEE 2018(ICCSP), April 2018 3. Shaheen, M.: Sentiment analysis on mobile phone reviews using supervised learning techniques. Int. J. Modern Educ. Comput. Sci. 11, 32–43 (2019) 4. Li, L., Goh, T., Jin, D.: How textual quality of online reviews affect classification performance: a case of deep learning sentiment analysis. Neural Comput. Appl., 1–29 (2018) 5. Grover, M., Verma, B., Sharma, N., Kaushik, I.: Traffic control using V-2-V based method using reinforcement learning. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974540 6. Mamtesh, M., Mehla, S.: Sentiment analysis of movie reviews using machine learning classifiers. Int. J. Comput. Appl. 182, 25–28 (2019) 7. Hassan, A.A., Abdulwahhab, A.B.: Reviews Sentiment analysis for collaborative recommender system. Science 2, 87–91 (2017) 8. Harjani, M., Grover, M., Sharma, N., Kaushik, I.: Analysis of various machine learning algorithm for cardiac pulse prediction. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019. 8974519 9. Nafees, M., Dar, H.S., Lali, I.U., Tiwana, S.: Sentiment analysis of polarity in product reviews in social media. In: 2018 14th International Conference on Emerging Technologies (ICET), pp. 1–6 (2018) 10. Zhang, S., Tang, Y., Lv, X., Dong, Z.: Movie short-text reviews sentiment analysis based on multi-feature fusion. In: ACAI 2018 (2018) 11. Sharma, A., Singh, A., Sharma, N., Kaushik, I., Bhushan, B.: Security countermeasures in web based application. In: 2019 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT) (2019). https://doi.org/10.1109/icicict46008. 2019.8993141 12. Pandey, S., Sagnika, S., Mishra, B.S.: A technique to handle negation in sentiment analysis on movie reviews. In: 2018 International Conference on Communication and Signal Processing (ICCSP), 0737–0743 (2018) 13. Salas-Zárate, M.D., Paredes-Valverde, M.A., Limon, J., Tlapa, D.A., Báez, Y.A.: Sentiment classification of Spanish reviews: an approach based on feature selection and machine learning methods. J. UCS 22, 691–708 (2016) 14. Iqbal, N., Chowdhury, A.M., Ahsan, T.: Enhancing the performance of sentiment analysis by using different feature combinations. In: IEEE 2018 (IC4ME2), Feb 2018 15. Untawale, T.M., Choudhari,G.: Implementation of sentiment classification of movie reviews by supervised machine learning approaches. In: 2019 3rd (ICCMC), March 2019 16. Goyal, S., Sharma, N., Kaushik, I., Bhushan, B., Kumar, A.: Precedence & issues of IoT based on edge computing. In: 2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT) (2020). https://doi.org/10.1109/csnt48778.2020.9115789 17. Yasen, M., Tedmori, S.: Movies reviews sentiment analysis and classification. In: 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), pp. 860–865 (2019) 18. Rustagi, A., Manchanda, C., Sharma, N.: IoE: A boon & threat to the mankind. In: 2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT) (2020). https://doi.org/10.1109/csnt48778.2020.9115748 19. Ali, N.M., Hamid, M.M., Youssif, A.A.: Sentiment analysis for movies reviews dataset using deep learning models. Int. J. Data Min. Knowl. Manage. Process 09, 19–27 (2019)
Movie Review Sentimental Analysis Based …
281
20. Wankhede, R., Thakare, A.N.: Design approach for accuracy in movies reviews using sentiment analysis. In: 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA), vol. 1, pp. 6–11 (2017) 21. Sethi, R., Kaushik, I.: Hand written digit recognition using machine learning. In: 2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT) (2020). https://doi.org/10.1109/csnt48778.2020.9115746 22. Amplayo, R.K., Song, M.: An adaptable fine-grained sentiment analysis for summarization of multiple short online reviews. Data Knowl. Eng. 110, 54–67 (2017) 23. Dubey, T., Jain, A.: Sentiment analysis of keenly intellective smart phone product review utilizing SVM classification technique. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–8 24. Kumari, U., Sharma, A.K., Soni, D.K.: Sentiment analysis of smart phone product review using SVM classification technique. In: 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), pp. 1469–1474 (2017) 25. Rathi, R., Sharma, N., Manchanda, C., Bhushan, B., Grover, M.: Security challenges & controls in cyber physical system. In: 2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT) (2020). https://doi.org/10.1109/csnt48778.2020. 9115778 26. Kaur, R.: Sentiment analysis of movie reviews: a study of machine learning algorithms with various feature selection methods (2017) 27. Manchanda, C., Rathi, R., Sharma, N.: Traffic density investigation & road accident analysis in India using deep learning. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/icccis48478.2019.8974528 28. Untawale, T.M., Choudhari, G.: Implementation of sentiment classification of movie reviews by supervised machine learning approaches. In: 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), pp. 1197–1200 (2019) 29. Thakur, R.K., Deshpande, M.V.: Kernel optimized-support vector machine and mapreduce framework for sentiment classification of train reviews. S¯adhan¯a 44, 1–14 (2018) 30. Singh, A., Sharma, A., Sharma, N., Kaushik, I., Bhushan, B.: Taxonomy of attacks on web based applications. In: 2019 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT) (2019). https://doi.org/10.1109/icicict46008.2019.899 3264 31. Jha, P., Khan, S.: Multi domain sentiment classification approach using supervised learning. Int. J. Adv. Res. Ideas Innov. Technol. 5, 44–47 (2019) 32. Ahmed, H.M., Jaber, H.R.: Sentiment Analysis for Movie Reviews Based on Four Machine Learning Techniques. Diyala Journal for Pure Science 16, 65–83 (2020) 33. Manchanda, C., Sharma, N., Rathi, R., Bhushan, B., Grover, M.: Neoteric security and privacy sanctuary technologies in smart cities. In: 2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT) (2020). https://doi.org/10.1109/csnt48 778.2020.9115780 34. Nanda, C.N., Dua, M., Nanda, G.: Sentiment analysis of movie reviews in hindi language using machine learning. In: 2018 International Conference on Communication and Signal Processing (ICCSP), pp. 1069–1072 (2018)
A Gaussian Naive Bayesian Classifier for Fake News Detection in Bengali Shafayat Bin Shabbir Mugdha, Marian Binte Mohammed Mainuddin Kuddus, Lubaba Salsabil, Afra Anika, Piya Prue Marma, Zahid Hossain, and Swakkhar Shatabda
Abstract With the advent of modern digital technology and reach of digitized contents manipulation of facts turned to fake news and their impact is widespread than ever. The intent is often to manipulate consents on religious, political, financial and other serious matters within the social and state context and create a nuisance and spread violence even wage wars. However, common people are not able to distinguish between fake and real news. Often the dubious nature of the fake news makes us even to suspect the real news. With the progress made in natural language processing it has become interesting to seek for knowledge or patterns in the generation of fake news and thus find better predictive ways to fake news to differentiate it from real news. In this paper, we propose a machine learning based fake news detection method in Bengali. Our proposed method uses a novel dataset created for the purpose and a Gaussian Naive Bayes Algorithm. The algorithm uses TF-IDF based text features and Extra Tree Classifier for feature selection. In addition to this, we have performed comprehensive analysis on different machine learning algorithms and on features. Keywords Fake news · Machine learning (ML) · Natural language processing (NLP) · Feature selection
1 Introduction Fake news has become a major issue in recent years in the global scale. We can define fake news as news, stories or hoaxes generated intentionally to mislead readers towards an intent of using the reader consent or view over political, social, religious and other issues. Often, these stories are created to either influence people about their views, push a political agenda or put people into a dilemma and can be commercially S. B. S. Mugdha · M. B. M. M. Kuddus · L. Salsabil · A. Anika · P. P. Marma · Z. Hossain · S. Shatabda (B) Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_28
283
284
S. B. S. Mugdha et al.
beneficial for online news publishers. People sometimes get deceived by fake news by some websites which looks trust-worthy or using identical names and web addresses to well-known news organizations. Fake news websites and channels use their fake news content with the intention to misguide reader of the content and spread misinformation through social networks and word-of-mouth. It has bad impacts on millions of people and the surrounding environment. It spreads violence, wrong information. People write fake news pieces to intentionally mislead consumers [1]. If news can be detected as fake, commoners will get most of the benefits. They will be able to get the right information. So We aim to detect if a news is fake or not by using machine learning approaches. There has been a number of efforts to detect fake news using machine learning or knowledge based methods. There are two general approaches: natural language processing and fact checking methods [2]. There a number of methods and attempts made on fake news detection in Facebook, Twitter and other sources. However, there is a lack of progress in the Bengali language community and thus leaving a vulnerable gap. We address this gap in this research. In our work, we approached firstly to create a novel dataset and then apply natural language processing to develop a method for fake news detection. We have compared six machine learning classifiers. Our proposed method uses TF-IDF based text features. On top of it we have used Extra Tree algorithm to select important features. Among the different algorithms tested on the dataset using a cross-validation method, Gaussian Naive Bayes classifier outperforms the rest of the algorithms. We have also performed a number of analysis on the patterns of the data and nature of the features.
2 Related Work In the past years, there has been lots of work implemented on fake news detection or Spam news detection. With the fast development of websites and social media platforms, social spam or fake news has drawn the attention of the industry as well as academia.
2.1 Research over the Years One of the earliest paper is on detection of social spam [3]. Authors identified six features of collaborative tagging systems tuning various properties of social spam classifiers achieving accuracy above 98% which maintained a false positive rate 2%. There are a few other recent works [4, 5] that were published on the issue of fake news detection. Natali Ruchansky et al. proposed a hybrid model [6]. In [7] authors discussed the challenges while working with fake news. In a very recent work, Devyani Keskar et al. conducted an analysis on fake news detection [8].
A Gaussian Naive Bayesian Classifier for Fake News Detection in Bengali
285
2.2 Different Methods For categorizing news, different types of classifiers were used over the years. RNN and LSTM was used in [9]. Wang et al. [10] used Logistic Regression, Support Vector Machine (SVM), Convolutional Neural Network (CNN) and Long short-term memory (LSTM) to detect fake news. Support Vector Machine (SVM) Classifier has been used in [11]. A simple approach was proposed in the paper [12] using the Naive Bayes (NB) Classifier achieving 74% accuracy on the test set. A hybrid approach combining logistic regression and harmonic boolean label crowdsourcing was proposed in [13]. A geometric deep learning was used in [14].
2.3 Different Datasets Various papers used various datasets according to their convenience. In [15], Zheng et al used a Chinese dataset collected from Weibo (a Chinese social media). It was used to classify users into spammers and non-spammers. Facebook was used as a source of data in [13]. They selected about 15,500 posts and user reaction of about 909236 users and created their dataset. LIAR dataset was used in several research work [9, 10]. Also two more famous datasets, Twitter dataset [14] and Buzzfeed dataset [16] are extensively used and very popular.
3 Proposed Methodology Our methodology is depicted in Fig. 1. In our methodology, we first used a novel technique to collect a dataset. Then we preprocessed the dataset to get stemmed sentences in which then we used feature extraction and finally put them through classifiers for accuracy. We used different classifiers in our dataset for detecting fake news. Moreover, we have used feature selection on the generated features. Rest of the section describes the details of the methodology.
Fig. 1 Work flow of proposed methodology
286
S. B. S. Mugdha et al.
3.1 Creation of the Dataset According to the authors of ‘Ethnologue’, Bengali is the seventh most spoken language. Despite that, a standard corpus of Bengali language is hard to find [17]. The dataset that we have created includes data from multiple websites that post fake news alongside real news. From each news, we gathered the title, body, date, URL, and label for our dataset. We labeled the data in two labels, fake and real. While choosing news, we mainly saw the title and based on that we labeled as fake or real. We searched for it and if it existed in some more newspapers then we collected it as a real news otherwise labeled it as fake. Mostly we tried to collect fake news corresponding to its real news. There are 112 instances in this dataset among which 56 of them are real and 56 are fake.
3.2 Preprocessing The preprocessing of our dataset is mainly done to extract a more refined dataset excluding stopwords and stemmers from the body feature of the dataset since it contains a large amount of texts. The preprocessing step works as shown in Figure 2. The following two steps are mainly used for the preprocessing: tokenization and stemming.
3.2.1
Tokenization
In this step, we removed the including special characters along with the numeric values, so that we can have the texts alone to work with next. We then divided the remaining sentences into words hence tokenizing it, from these sack words we then
Fig. 2 Preprocessing stage
A Gaussian Naive Bayesian Classifier for Fake News Detection in Bengali
287
removed the common stopwords that are being used daily. Since filtering now we have a set of unique words.
3.2.2
Stemming
The set of words that we got after tokenization are used for stemming. The stemming has been done separately for the noun and the verb. The stemming is done in the same way as mentioned in [18]. They did a 4 step verbal stemmer and 3 step noun stemmer but our noun and verbal stemmer both consists of 3 steps as follows: • Step 1: We eliminated the inflected words. • Step 2: Then we eliminated the diacritic mark from words. • Step 3: Special cases are handled including a few transformation for the diacritic mark in the words. After the stemming the remaining words were merged to form a sentence known as Stemmed Sentences which is later used for feature extraction.
3.3 Feature Extraction and Selection The features usually used for fake news are Title, Body, URL, Date, Author [10]. However, we have only used the title and the body of the news.
3.3.1
TF-IDF
Term Frequency–Inverse Document Frequency known as TF-IDF is a commonly used method that uses the numerical representation of transformed texts to determine how important a particular word in a document is. This is a widely used feature extraction technique for Natural Language Processing (NLP).
3.3.2
Extra Tree Classifier
It is a regular tree classifier like Random Forest or Decision Tree classification but in this case, we did not use it as a classifier, instead used it as a feature selection technique to select the best suitable features and then use the result in the Classifiers to get better results.
288
S. B. S. Mugdha et al.
3.4 Gaussian Naive Bayes Classifier In general, Naive Bayesian is known as a probabilistic classifier which is based on Bayes’ theorem, strong independent assumption and independent feature model. In our work we used the Gaussain Naive Bayes in which the continuous values with each class are distributed according to a normal distribution. (xi − μ y )2 1 exp(− ) Pr (xi |y) = 2σ y2 2π σ y2
4 Experimental Analysis All the experiments performed in this work are done on Google Colab environment using Python 3.5 and Scikit-learn library. We have used a tenfold cross validation on the dataset to evaluate performance. In this work several performance metrics have been used. We have used accuracy, area under Receiver Operating Characteristics curve (auROC), area under Precision-Recall curve (auPR), F1 Score and Matthews Correlation Coefficient (MCC).
4.1 Results We have tested multiple classification algorithms: Support Vector Machine (SVM), Logistic Regression (LR), Multilayer Perceptron (MLP), Random Forest Classifier (RF), Voting Ensemble Classifier (VEC) and Gaussian Naive Bayes (GNB). Table 1 shows the performance of the classifiers without feature selection. Note that, here none of the algorithms are performing well. Next we applied three feature selection techniques: PCA, Kernel PCA and Extra Tree Classification. The results are presented in Tables 2 and 3 respectively.
Table 1 Comparison performance of classifiers without feature extraction Classifiers Accuracy (%) F1-score MCC auROC SVM (linear) LR MLP RF VEC GNB
54.12 50.91 46.13 57.19 54.85 45.16
0.701 0.576 0.568 0.598 0.514 0.516
0.000 −0.007 −0.143 −0.132 −0.131 −0.116
0.541 0.438 0.426 0.641 0.424 0.402
auPR 0662 0.616 0.571 0.514 0.531 0.531
A Gaussian Naive Bayesian Classifier for Fake News Detection in Bengali
289
Table 2 Comparison performance of classifiers with principal component analysis (PCA) Classifiers Accuracy (%) F1-score MCC auROC auPR SVM (linear) LR MLP RF VEC GNB SVM (linear) LR MLP RF VEC GNB
PCA 54.04 51.33 44.76 54.42 46.82 54.15 Kernel PCA 54.12 55.06 53.93 62.81 53.22 53.95
0.702 0.605 0.515 0.631 0.554 0.604
0.000 -0.009 -0.146 0.077 -0.121 0.070
0.518 0.473 0.407 0.463 0.465 0.515
0.631 603 0.576 0.589 0.630 0.615
0.701 0.645 0.566 0.657 0.670 0.675
0.000 0.082 0.065 0.251 0.018 0.031
0.484 0.547 0.523 0.561 0.534 0.481
0.612 0.674 0.667 0.638 0.664 0.623
Table 3 Comparison performance of classifiers with extra tree classifier Classifiers Accuracy (%) F1-score MCC auROC SVM (linear) LR MLP RF VEC GNB
54.12 69.52 70.93 63.14 76.29 85.52
0.701 0.749 0.745 0.688 0.772 0.821
0.000 0.403 0.435 0.269 0.545 0.634
0.537 0.835 0.776 0.712 0.857 0.759
auPR 0.571 0.889 0.881 0.824 0.891 0.799
4.2 Analysis Note that the results obtained for feature selection using Extra Tree Classifier based feature selection is superior to other methods and Gaussian Naive Bayes classifier outperforms all other methods in terms of accuracy and other methods. We also show the ROC analysis in Fig. 3a. After successfully building our model, the next thing would be visualizing the most important word features in the dataset. Figure 4 shows the frequencies of most frequent words in real category and fake category. We then assigned some color and plotted them in the graph. After getting the word frequencies of both the categories we build the colored word cloud by assigning them colors by category as shown in Figure 3b.
290
S. B. S. Mugdha et al.
Fig. 3 a ROC curve with extra tree classifier and b Word cloud showing most frequent Words in the dataset
Fig. 4 Histogram showing word frequencies in a Real news and b Fake news
5 Conclusion In this paper, we propose a Gaussian Naive Bayes Classifier based fake news detection algorithm based on TF-IDF based features extracted from news articles using Extra Tree Classifier. We have also created a novel dataset for Bengali fake news. For future work, we aim to build a richer corpus of Bengali news and a more powerful model to preprocess to extract features more precisely.
References 1. Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H.: Fakenewsnet: a data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286 2. Haque, M.M., Yousuf, M., Arman, Z., Uddin Rony, M.M., Alam, A.S., Hasan, K.M., Islam, M.K., Hassan, N.: Fact-checking initiatives in bangladesh, india, and nepal: a study of user engagement and challenges. arXiv preprint arXiv:1811.01806 (2018)
A Gaussian Naive Bayesian Classifier for Fake News Detection in Bengali
291
3. Markines, B., Cattuto, C., Menczer, F.: Social spam detection. In: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, pp. 41–48. ACM (2009) 4. Rubin, V.L., Conroy, N.J., Chen, Y.: Towards news verification: deception detection methods for news discourse. In: Hawaii International Conference on System Sciences (2015) 5. Rubin, V.L., Chen, Y., Conroy, N.J.: Deception detection for news: three types of fakes. In: Proceedings of the 78th ASIS & T Annual Meeting: Information Science with Impact: Research in and for the Community, p. 83. American Society for Information Science (2015) 6. Ruchansky, N., Seo, S., Liu, Y.: Csi: a hybrid deep model for fake news detection. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 797–806. ACM (2017) 7. Zhou, X., Zafarani, R., Shu, K., Liu, H.: Fake news: fundamental theories, detection strategies and challenges. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 836–837. ACM (2019) 8. Keskar, D., Palwe, S., Gupta, A.: Fake news classification on twitter using flume, n-gram analysis, and decision tree machine learning technique. In: Proceeding of International Conference on Computational Science and Applications, pp. 139–147. Springer (2020) 9. Girgis, S., Amer, E., Gadallah, M.: Deep learning algorithms for detecting fake news in online text. In: 2018 13th International Conference on Computer Engineering and Systems (ICCES), pp. 93–97. IEEE (2018) 10. Wang, W.Y.: liar, liar pants on fire: a new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648 (2017) 11. Rubin, V., Conroy, N., Chen, Y., Cornwell, S.: Fake news or truth? using satirical cues to detect potentially misleading news. In: Proceedings of the Second Workshop on Computational Approaches to Deception Detection, pp. 7–17 (2016) 12. Granik, M., Mesyura, V.: Fake news detection using Naive Bayes classifier. In: 2017 IEEE First Ukraine Conference on Electrical and Computer Engineering (UKRCON), pp. 900–903. IEEE (2017) 13. Tacchini, E., Ballarin, G., Della Vedova, M.L., Moret, S., de Alfaro, L.: Some like it hoax: automated fake news detection in social networks. arXiv preprint arXiv:1704.07506 (2017) 14. Monti, F., Frasca, F., Eynard, D., Mannion, D., Bronstein, M.M.: Fake news detection on social media using geometric deep learning. arXiv preprint arXiv:1902.06673 (2019) 15. Zheng, X., Zeng, Z., Chen, Z., Yuanlong, Y., Rong, C.: Detecting spammers on social networks. Neurocomputing 159, 27–34 (2015) 16. Potthast, M., Kiesel, J., Reinartz, K., Bevendorff, J., Stein, B.: A stylometric inquiry into hyperpartisan and fake news. arXiv preprint arXiv:1702.05638 (2017) 17. Anika, A., Rahman, M., Islam, A.S.M.M.J., Rahman, C.R., et al.: Comparison of machine learning based methods used in Bengali question classification. arXiv preprint arXiv:1911.03059 (2019) 18. Mahmud, M.R., Afrin, M., Razzaque, M.A., Miller, E., Iwashige, J.: A rule based Bengali stemmer. In: 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2750–2756. IEEE (2014)
Human Friendliness of Classifiers: A Review Prasanna Haddela, Laurence Hirsch, Teresa Brunsdon, and Jotham Gaudoin
Abstract During the past few decades, classifiers have become a heavily studied and well-researched area. Classifiers are often used in many modern applications as a core computing technique. However, it has been observed that many popular and highly accurate classifiers are lacking an important characteristic that of human friendliness. This hinders the ability of end users to interpret and fine-tune the method of decision-making process as human friendliness allows for crucial decision making toward applications. This paper presents, in terms of classification (i) a taxonomy for human-friendliness (ii) comparisons with well-known classifiers as related to human friendliness, and (iii) discussion regarding recent developments and challenges in the field. Keywords Evolved search queries · Human interpretability · Rule-based classifiers
P. Haddela (B) · L. Hirsch · J. Gaudoin Sheffield Hallam University, Sheffield S1 1WB, UK e-mail: [email protected] L. Hirsch e-mail: [email protected] J. Gaudoin e-mail: [email protected] T. Brunsdon University of Warwick, Coventry CV4 7AL, UK e-mail: [email protected] P. Haddela Sri Lanka Institute of Information Technology, Colombo, Sri Lanka © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_29
293
294
P. Haddela et al.
1 Introduction Classification is an important machine learning technique that involves categorizing unseen instances into pre-labeled groups. This is also known as supervised learning. In the process of building classifiers, a “training set” or historical data of a similar scenario is used. Apparently, much dedicated research efforts have resulted in a few widely accepted classifiers and they are used in many different application domains [6]. But none of the classifiers can be termed perfect solutions due to the natural complexity of the problem. Among the various types of classification applications, it is noted that a subset of classification applications essentially requires human friendliness. With this research, human friendliness is defined as the end user’s ability to interpret and modify (fine-tune) the decision-making criteria. For example, consider a medical diagnostic system and a rule received for stroke risk prediction. Figure 1 illustrates a rule received from a stroke prediction model [20]. As domain experts, doctors may be highly motivated to use these types of systems due to the characteristic of human interpretability of the decision-making process. To the domain experts. It is highly transparent. This is vital before trusting it as a decision support tool in certain industries. Since there is no perfect classification method developed, the ability to customize the classifier is always of merit because it allows knowledgeable experts in the field to fine-tune the system. Unfortunately, commonly used and widely accepted as highly accurate classifiers appear to lack human interpretability and the ability for experts to customize to suit requirements. For example, support vector machine (SVM) and artificial neural networks (ANN) are not human friendly. They obstruct monitoring and fine-tuning. This paper reviews commonly used existing methods and techniques for classification, categorizes them, and presents research challenges and ends with possible research directions. The rest of the paper has been organized as follows; Sect. 2 details the taxonomy for text classification. Section 3 shows some outlined applications indicating where they need human-friendly classifiers. Sections 4 and 5 illustrate the type of classifiers and variety of rule-based classifiers respectively. Finally, Sect. 6 compares the classifiers in terms of human friendliness. if hemiplegia then stroke risk 59.0% else if cerebrovascular disorder then stroke risk 44.7% else if hypovolaemia and chest pain then stroke risk 14.6% else if transient ischaemic attack then stroke risk 29.9% else if age_70 then stroke risk 4.5%
else stroke risk 9.0% Fig. 1 Rule from stroke prediction system
Human Friendliness of Classifiers: A Review
295
2 Taxonomy for Classification In the past, accuracy and efficiency dominated the development of classifiers. However, it is observed that some of the applications have an equally important characteristic for domain experts, which is human friendliness of the classifier. Human friendliness is defined as the ability of end users to interpret and modify the decisionmaking process of classifiers. The taxonomy shown in Fig. 2 was developed based on the level of human friendliness of classifiers. This taxonomy branches classifiers as black box and white box. Further, the white box classifiers have Type I and Type II. This taxonomy is used to organize the classifiers and to organize paper content.
2.1 Black Box-Type Classifiers Black box-type classification models are capable of classifying text, but human interpretability and modifiability of the model are absent. Accordingly, end users of the classifier are blind about the way and the classification has been carried out and also finds it difficult to fine-tune the model. Support vector machine (SVM) and artificial neural network (ANN)-based models are good examples for black box-type classifiers.
2.2 White Box-Type Classifiers The main feature of white box-type classification models is interpretability. Also, it has a higher level of human friendliness due to transparency, explainability, and sometimes modifiability. Classifiers in this category can further divide into two groups: Type I, Classifiers are human interpretable but not easy to fine-tune or modify and Type II, Classifiers are human interpretable and also free to modify as needed by domain experts. Fig. 2 Taxonomy for classification
296
P. Haddela et al.
For example, rules derived from a decision tree are human interpretable to end users, but it has a limitation of how to incorporate end user’s knowledge or feedback and reconstruct tree to produce optimal decision tree. Therefore, this decision tree is categorized as white box-Type I classifiers. Classifiers belonging to Type II are search query-based classifiers or simple decision rules which represent a particular category and have the power of categorizing text into free labeled groups. These are capable of human interpretability as well as modifiability. Therefore, white box-Type II classifiers have achieved the highest level of human friendliness.
3 Human Friendliness of Medical Applications White box-type classifiers are heavily used in applications requiring a high level of human friendliness as well as accuracy. In many cases, end users are curious about how it works and what if they do some changes before taking an action. This happens especially when users have sound knowledge about the specific classification problem. There are many medical applications where the white box classifiers are heavily utilized. Most of the experts in the medical domain would like to know what the prediction model of the systems is and also to use their knowledge for fine tuning the system. Medical scoring systems are one of them and are used to recommend personalized medicine. They are designed to be human interpretable and also aim to fine-tune for maximum accuracy. Therefore, White box-Type II classifiers are more suitable for these applications. CHADS2 score system for stroke risk [9], Thrombolysis in Myocardial Infarction (TIMI) [1], Apache II score for infant mortality in the ICU [17], the CURB-65 score for predicting mortality in community-acquired pneumonia [21], system introduced in [20] for stroke prediction, in [18] presented an application where the thoracic surgery rules have been induced by classifying thoracic surgery into different classes are some of the applications for this category used in the health sector.
4 Type of Classifiers Versus Human Friendliness Researchers have invented many classifiers employing different concepts and techniques for various types of datasets. The following classifiers from among them are popularly used. SVM classifiers: SVM classifiers are using linear or non-linear functions to partition the high-dimensional data space for different classes or categories. In this case, the main challenge is to identify optimal boundaries between categories. The founder, Joachim states that [15], it is not necessary to human involved in parameter tuning
Human Friendliness of Classifiers: A Review
297
as there is a theoretically motivated parameter tuning set up in place. This automatic parameter tuning hides internal behavior and eventually this means that SVM gives us a minimal level of human friendliness. Neural network classifiers: Artificial neural networks (ANN) are popular classification techniques in many software solutions. These models are inspired by biological neural networks of animal brains. In ANNs, connectivity between input layer and output layer creates through the hidden layers. Due to the complex nature of connectivity, it makes ANN more complicated to understand. The activation function decides the output of each node for the set of inputs and which makes ANN a non-linear classifier. Both ANN and deep NN methods are opaque or black box type due to the complexity and there have been many research attempts to make them are more human-friendly [2, 5, 27]. Naïve Bayes (NB) is a well-studied classifier in machine leaning research. The method used in NB classifier is that the joint probabilities of features and categories are used to compute the probabilities of categories of a given instance. For this, it makes the assumption of feature independence. That is the conditional probability of a feature given category is assumed to be independent from the conditional probabilities of other features given in that category. Due to the low human friendliness, some research attempts have tried to make it more interpretable [23, 24, 28]. Nearest neighbor classifiers: The k-nearest neighbor (kNN) classification method finds closes (neighboring) objects within the training set and assign class labels. For a new object, the distance of k neighboring objects is measured and the label of majority class assigned. Euclidean distance or the cosine value is commonly used distance or similarity measures [16]. In [29], the authors develop a system to convert kNN decisions in to set of rules in that way improving human friendliness which is lacking in such classifiers. Rocchio method: The Rocchio method is used for inducing linear, profile style classifiers. It relies on an adaptation to classification of the well-known Rocchio’s formula for relevance feedback in the vector space model, and it is perhaps the only classification method rooted in the information retrieval tradition [8] rather than in the machine learning. This adaptation was first proposed by Hull [14]. Some linear classifiers consist of an explicit profile (or prototypical document) of the category. This has obvious advantages in terms of interpretability as such a profile is more readily understandable by a human. Decision tree-based classifiers: Decision tree (DT) is a hierarchical view of the training set. Class labels are mapped to leaf node in the tree. Parent nodes split instances for child nodes based on impurity measures. Information gain and Gini index are commonly used in decision tree induction algorithms. When all instances belong to a child node, it stops splitting and makes it a leaf node otherwise nodes are split recursively. Decision trees are more transparent and make it easier to build rules for classification. Therefore, DT has a higher level of human friendliness. Rule-based classifiers: Rule-based classifiers are determined the features or term patterns which are most likely to be related to the different classes. Decision criteria
298
P. Haddela et al.
consist of disjunctive normal form (DNF) rules which denote the presence or absence of terms patterns in the testing object while the clause head denotes the category. These rules are used for the purposes of classification. Rule-based classifiers have the highest level of human friendliness. Genetic algorithm-based classifiers: Genetic algorithm-based (GA) methods follow iteratively progressing approach to develop a population toward achieving the desired end. Such developments are often inspired by biological mechanisms of evolution. The genetic operators; selection, crossover (recombination) and mutation are applied to the individuals to breed the next generations. Fitness functions are used in order to measure the strength of an individual [19, 22]. When the fitness is higher, there is a high probability for the next generation of individuals to be selected, to take part in creating the next generation. Thus, the genetic material of strong individuals will survive throughout the evolutionary process until an optimal or near optimal solution is found. GAs are often used to support other algorithms, for example by parameter tuning. GAs have also been used to generate search query-based text classifiers with very high human friendliness [11].
5 Human Friendliness: Approaches and Challenges In the past, there have been many research attempts to improve the accuracy of classifiers; but in most cases, human friendliness of classifiers was neglected. However, in the recent past, a new trend of building methods to enable human friendliness for black box-type classifiers, is evident [5, 20, 23]. Yet, white box-type classifiers which are human friendly by nature appear not to have drawn much attention. This section reviews white box-type classifiers and other related approaches.
5.1 White Box Classifiers: Type I Versus Type II Decision tree-based classifiers are very frequently used with projects. These classifiers often use decision rules based on information theory to branch the parent node and link with the child nodes. This method is transparent and interpretable. But decision trees are not flexible for end users to modify. Therefore, such classifiers are categorized under Type 1. The following two examples illustrate a popular rule-based system and highly compact search query-based text classifiers. In CONSTRUE [10] to classify documents in the “wheat” category of the Reuters dataset an example rule of the type used is illustrated below. if ((wheat & farm) or (wheat & commodity) or (bushels & export) or
Human Friendliness of Classifiers: A Review
299
(wheat & tonnes) or (wheat & winter & ¬ soft)) then WHEAT else ¬ WHEAT The search queries evolved from GA-SFQ [11] system for popular Reuters dataset acquisitions category; (buy 10) (company 11) (bid 13) (offer 15) In GA-SFQ, terms and their proximity to the start of a document are taken into consideration when constructing these type of search query-based classifiers. In [35], various search query types were evaluated for classifier effectiveness but with a clear objective of interpretability. These search queries and rules are highly interpretable and also mean that terms of the classifiers can be amended by end users easily and applied again. These classifiers belong to white box-Type II and has highest level of human friendliness compared to black box-type classifiers and white box-Type I classifiers.
5.2 Rule-Based Classifiers: Direct Versus Indirect Rule-based classifiers belong to white box-type II and there are broadly two types: direct method and indirect method. Direct methods extract rules from datasets directly while indirect methods extract rules from other classification models. This is an indirect way of achieving a higher level of human friendliness for black box-type and white box-type I classifiers. These popular direct rule-based classifiers are discussed in Sect. 5.3 and indirect method in Sect. 5.4.
5.3 Rule-Based Direct Methods The IREP rule learning algorithm [8] is one of the base algorithms, and it has been improved later for better results. The RIPPERk [4] is one of the successors of IREP rule learning algorithm, and authors have compared its results with C4.5rules. As per their experiments, RIPPERk is very comparative with respect to error rates but much more efficient on large datasets. Rule sets are very friendly. A certain type of prior knowledge can also be communicated to the rule learning system. In [32], the authors have presented a system which is capable of automatically categorizing web documents in order to enable effective retrieval of web information. Based on the rule learning algorithm RIPPER, they have proposed an efficient method for hierarchical document categorization. [34] describes TRIPPER—it is a rule induction algorithm,
300
P. Haddela et al.
and it is an extended version of RIPPER. TRIPPER uses background knowledge in the form of taxonomies over values of features used to describe data. Their experiments show that the rules generated by TRIPPER are generally more accurate and more concise compared to RIPPER. The Olex [30] is a method of creating rule-based classifiers automatically. It developed using an optimization algorithm. Both positive and negative terms generated by optimization algorithm are used in constructing classification rules. The paper [25] presents extended version of Olex. It is a genetic algorithm, called Olex-GA, for the induction of rule-based text classifiers of the form “classify document d under category c if t 1 ∈ d or … or t n ∈ d and not (t n+1 ∈ d or … or t n+m ∈ d) holds,” where each t i is a term. Olex-GA relies on an efficient several individual rule representations per category and uses the F-measure as the fitness function. Results of improved version have been presented in [31]. The Olex system also provides classifiers that are accurate, compact, and comprehensible. The classifier developed in [3] has used genetic programming to evolve classifying agents where each agent evolves a parse-tree representation of a user’s particular information need. An agent undergoes a continual training process; therefore, feedback from the user enables the system to learn to the user’s long-term information requirements. The Boolean information retrieval system proposed in [33] has been developed using the genetic programming techniques. Using randomly selected terms of relevant documents creates Boolean queries. These queries become elements of next population that used for breeding to produce new elements. Boolean queries developed for each category are used as classifiers, and they are human interpretable. In [12] a genetic algorithm (GA) is described which is capable of producing accurate compact and human interpretable text classifiers. Document collections are indexed using Apache Lucene and a GA is used to construct Lucene search queries. Evolved search queries are binary classifiers. The fitness function helps producing effective classifiers for a particular category when evaluated against a set of training documents. This system has extended in the paper [11], and they found that a small set of disjunctive Lucene SpanFirst queries meet both accuracy and classifier readability effectively. QuIET in [26] is also automatically generates a set of span queries from a set of annotated documents and uses the query set to categorize unlabeled texts but is not a GA-based algorithm. Classifiers IREP, RIPPERk, TRIPPER, Olex, Olex-GA, and other systems outlined above are following direct methods for extracting rule sets.
5.4 Rule-Based Indirect Methods The paper published in [7] describes an algorithm which generates non-overlapping classification rules from linear support vector machines. This algorithm designed as a constrained-based optimization problem, and it extracts classification rules iteratively. For this computationally inexpensive algorithm, authors have discussed
Human Friendliness of Classifiers: A Review
301
number of properties of the algorithm and optimization criteria. These rules can easily understandable to humans unlike support vector machine. The paper published in [13] presents a method of deriving an accurate rule set using association rule mining. Commonly used rule-based classifiers are preferred small rule sets to large rule sets. But small rule sets are sensitive to the missing values in unseen test data. This paper presents a classifier that is less sensitive to the missing values in unseen test data. The paper published in [18] describes a human interpretable medical scoring system. They are producing decision lists in which includes a series of if…then… statements. Those if…then… statements groups a high-dimensional feature space into a series of simple, interpretable decision statements. The authors have introduced a generative model called Bayesian rule lists that yield a posterior distribution over possible decision lists. This is an alternative to the CHADS2 score, actively used in clinical practice for estimating the risk of stroke in patients. The research work in [29] outlines a new hybrid approach for text classification. It has combined a kNN with a rule-based system. kNN classifier has used in building the base model for a given labeled dataset. Rule-base expert system has used to improve the accuracy carefully handing false positives and false negatives. This system can easily fine-tune by humans adding or removing classification rules. The authors in paper in [5] are trying to break the barrier of human interpretability of deep neural networks (DNN). Concentrating on the video captioning task, authors first extract a set of semantically meaningful topics from the human descriptions that cover a wide range of visual concepts and integrate them into the model with a less interpretable. Then they proposed a prediction difference maximization algorithm to interpret the learned features of each neuron.
6 Conclusion Classification is a well-matured research field often utilizing a range of applications as a core computing technique. For a certain type of applications, human friendliness is vital. Classifiers belonging to the white box-type II category appear to have the highest level of human-friendliness while the white box-type I seems to have good human interpretability. A number of research attempts to make black box-type classifiers more human friendly were found. Further, it is noted that human friendliness of the classifier is critical for certain types of applications.
302
P. Haddela et al.
References 1. Antman, E.M., et al.: The TIMI risk score for unstable angina/non–ST elevation MI: a method for prognostication and therapeutic decision making. JAMA 284(7), 835–842 (2000) 2. Arras, L. et al.: What is relevant in a text document?: An interpretable machine learning approach. PloS One. 12(8), e0181142 (2017) 3. Clack, C., et al.: Autonomous document classification for business. In: Proceedings of the First International Conference on Autonomous Agents, pp. 201–208. ACM (1997) 4. Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123 (1995) 5. Dong, Y. et al.: Improving Interpretability of Deep Neural Networks with Semantic Information. arXiv preprint arXiv:1703.04096 (2017) 6. Espejo, P.G., et al.: A survey on the application of genetic programming to classification. IEEE Trans Syst Man Cybern C Appl Rev 40(2), 121–144 (2010) 7. Fung, G., et al.: Rule extraction from linear support vector machines. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 32–40. ACM (2005) 8. Fürnkranz, J., Widmer, G.: Incremental reduced error pruning. In: Proceedings of the 11th International Conference on Machine Learning (ML-94). pp. 70–77. Morgan Kaufmann (1994) 9. Gage, B.F., et al.: Validation of clinical classification schemes for predicting stroke: results from the National Registry of Atrial Fibrillation. JAMA 285(22), 2864–2870 (2001) 10. Hayes, P.J., et al.: Tcs: a shell for content-based text categorization. In: 1990 Sixth Conference on Artificial Intelligence Applications, pp. 320–326. IEEE (1990) 11. Hirsch, L.: Evolved Apache Lucene SpanFirst queries are good text classifiers (2010) 12. Hirsch, L., et al.: Evolving Lucene search queries for text classification (2007) 13. Hu, H., Li, J.: Using association rules to make rule-based classifiers robust. In: Proceedings of the 16th Australasian database conference, vol. 39, pp. 47–54. Australian Computer Society, Inc. (2005) 14. Hull, D.A., et al.: Method combination for document filtering. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 279–287. ACM (1996) 15. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Machine Learning: ECML-98, pp. 137–142 (1998) 16. Khan, A., et al.: A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Technol. 1(1), 4 (2010) 17. Knaus, W.A., et al.: APACHE II: a severity of disease classification system. Crit. Care Med. 13(10), 818–829 (1985) 18. Koklu, M., et al.: Applications of rule based classification techniques for thoracic surgery. In: Managing Intellectual Capital and Innovation for Sustainable and Inclusive Society: Managing Intellectual Capital and Innovation; Proceedings of the MakeLearn and TIIM Joint International Conference 2015, pp. 1991–1998. ToKnowPress (2015) 19. Koza, J.R.: Genetic Programming : On the Programming of Computers by Means of Natural Selection. MIT Press (1992) 20. Letham, B., et al.: Interpretable classifiers using rules and Bayesian analysis: building a better stroke prediction model. Ann. Appl. Stat. 9(3), 1350–1371 (2015) 21. Lim, W.S., et al.: Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study. Thorax 58(5), 377–382 (2003) 22. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press (1996) 23. Mori, T.: Superposed Naive Bayes for accurate and interpretable prediction. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), pp. 1228–1233. IEEE (2015) 24. Ng, A.Y., Jordan, M.I.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Advances in Neural Information Processing Systems, pp. 841– 848 (2002)
Human Friendliness of Classifiers: A Review
303
25. Pietramala, A., et al.: A genetic algorithm for text classification rule induction. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 188–203. Springer (2008) 26. Polychronopoulos, V., et al.: QuIET: a text classification technique using automatically generated span queries. In: 2014 IEEE International Conference on Semantic Computing (ICSC), pp. 52–59. IEEE (2014) 27. Ribeiro, M.T., et al.: Why should I trust you?: Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016) 28. Ridgeway, G., et al.: Interpretable Boosted Naïve Bayes Classification. In: KDD, pp. 101–104 (1998) 29. Román, J.V., et al.: Hybrid approach combining machine learning and a rule-based expert system for text categorization. Presented at the (2011) 30. Rullo, P. et al.: Learning rules with negation for text categorization. In: Proceedings of the 2007 ACM Symposium on Applied Computing, pp. 409–416. ACM (2007) 31. Rullo, P., et al.: Olex: effective rule learning for text categorization. IEEE Trans. Knowl. Data Eng. 21(8), 1118–1132 (2009) 32. Sasaki, M., Kita, K.: Rule-based text categorization using hierarchical categories. In: 1998 IEEE International Conference on Systems, Man, and Cybernetics, pp. 2827–2830. IEEE (1998) 33. Smith, M.P., Smith, M.: The use of genetic programming to build Boolean queries for text retrieval through relevance feedback. J. Inf. Sci. 23(6), 423–431 (1997) 34. Vasile, F., et al.: TRIPPER: Rule learning using taxonomies. In: Advances in Knowledge Discovery and Data Mining, pp. 55–59 (2006) 35. Hirsch, L., Brunsdon, T.: A comparison of Lucene search queries evolved as text classifiers. Appl. Artif. Intell. 32(7), 768–784 (2018)
Bengali Context–Question Similarity Using Universal Sentence Encoder Mumenunnessa Keya, Abu Kaisar Mohammad Masum, Sheikh Abujar, Sharmin Akter, and Syed Akhter Hossain
Abstract In natural language, the similarity between the two texts is judged by their similarity score. Some of the recent NLP application such as text summarization, question answering, text generation, and text mining are depended on the machine provided text. Accuracy of response text is measured by the similar with corresponding text or human given text. Comparing by two texts and measuring the similarity defines that the two texts are lexical or semantically similar. If two texts are related to each other with the word or character, this text is lexically similar. Also, if the texts are related in meaning but not in word or character level that are semantically similar. In this research, we measure the similarity of context and question for your question answering system. Then we find the most similar answer for the corresponding question. We used universal sentence encoder for embedding and measure the similarity using cosine distance of the text. We used deep averaging network for find the best similar text. For evaluation of similarity model, we calculate the Pearson correlation value for our dataset and achieve 0.41 coefficient. Keywords Text similarity · Embedding · Universal sentence encoder · Question-answering · Pearson correlation
M. Keya (B) · A. K. M. Masum · S. Abujar · S. Akter · S. A. Hossain Department of Computer Science and Engineering, Daffodil International University, Dhaka 1212, Bangladesh e-mail: [email protected] A. K. M. Masum e-mail: [email protected] S. Abujar e-mail: [email protected] S. Akter e-mail: [email protected] S. A. Hossain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_30
305
306
M. Keya et al.
1 Introduction Current appeals for natural language processing illustrate the need for matching schemes or effective blueprints between text and sentences. Sentence similarity has the Internet-based solicitations. The significant part of the learning procedure is the assessment of the information gained by the student [6]. Question noting is the best procedure for increasing new information. Deciding correlations between sentences is one of the most significant commitments that numerous content solicitations broadly affect. In recovery information, the comparability estimations are utilized to dole out a positioning score among questions and substance on a creature. Inquiry react appeals to require the closeness recognizable proof among questions and substance or question–question [9]. The client asked an interrogatory to the Web index or the framework, the inquiry sent to the noting module to contrast and the given inquiry and by estimating the score the framework will reply to the client. There are many 2 types of question similarity measurement system—semantic, syntactic or statistical similarity [2]. In contrast to continuous data retrieval, which gives the whole records or its basic sections, the query response system is designed to provide more concise answers and specifically to allow the user to submit their questions to the natural laboratory [3]. The provisional response to a QAS is often not a consistent answer. Text-to-text generation can encourage the answering of a question into an appropriate response [10]. Text matching can drive intimacy between two sentences or words. Bengali text processing is different than other languages from frameworks. There are some limitations to the pre-processing phase of it as deleting unwanted character, removing space, sound, etc. And this called Cleaning text. So, unicode conversation is the best way of analysis of Bangla text. Removing of irrational things provides an optimal Bengali text for analysis. Therefore, measuring the transformation of a sentence means the value of the word match where a real meaning of each word changes to standard [8]. In this paper, we are presenting text-to-text semantic question–answer similarity measurement. Their questions will contrast the score and setting and afterwards coordinating the score with the answer by utilizing universal sentence encoder [11]. Sentence similitude scores can be discovered by utilizing implanting from the universal sentence encoder. The USE encodes content into a high-dimensional vector for semantic similitude.
Bengali Context–Question Similarity Using …
307
2 Related Work Calculating semantic similarities within sentences or texts requires many natural language processing (NLP) such as investigation, interrogation advice, and question answering (QA) [1]. It can increase the efficiency of query answering methods. The calculating of the match of the question is the key step which basically determines the quality of the answer. There have been many research work done on similarities. Song et al. [2] proposed a model for measuring the similarity between the users’ and the FAQ database questions. They show both statistic amount and semantic actuality for question similarity in the paper. In statistic measurement, they apply dynamic formed vector. For semantic similarity, they utilize word similarity with WordNet database. Before consideration the correspondence between queries, they first conduct NLP which includes POS tagging for searching necessary information, steaming for reduce inflected words, and stop words refer the useless noise. Finally, they achieve an excellent accomplishment. For matching question and answer [3], apply semantic knowledge base, namely word co-occurrence corpus method. They provide a mechanism for evaluating the questionnaires in relation to the relative clop quantity and the length of the queries and answers. For widely used information retrieval model, word co-occurrence is a form of term association. They used SWordSim and LenSim in the question similarity metering where SWordSim measures the number of same term between the query and the question in FAQ corpus that volume the similarity. LenSim used for length of questions that is the term number of sentences. Jeon et al. [5] propound an automatic model for finding question similarity which have same meaning. They provide semantically similar question that compute question–question similarity by using similar answers and the queries which have word overlap. They apply cosine similarity and twice language modeling technique where cosine comparison measures symmetric and thoroughly used in different IR and NLP ought. They convert each answer into a question probability ask and retrieve other answers and use rank instead of scores by using language modeling technique. Unsupervised deftness has been applied [6] for automatic short answer grading of students answer. They assimilate both knowledge and corpus-based measurement for text similarities where corpus-based technique measure the effect of domain and the size. Biomedical domain-based query answering system matches the similarity between question–answer pairs from document to choose the multiple-choice question [7]. They apply the possible similarity matching process, simple word overlap, dependency graph matching, feature-based vector similarity for semantic, lexical, and syntactic. Similarity of sentence of short text is the best way of progressing information rescue. Strategies for finding similarities between texts (documents) focus on analyzing split terms. Speech correspondence plays a significant preface in text-related exploration [4]. There exists an algorithm that considers the semantic information contained in sentences and the sequence of words. The use of lexical databases enables the process to model a boarder understanding of the human being, and the chamber of commerce’s approach to corpus statistics enables adaptation to
308
M. Keya et al.
different domain. Semantic similarity method for text has been developed for the process. For best text summarization, Masum et al. [8] proposed sentence similarity. Finding the better summary phrase similarity is the best way. They used Bengali text for measuring the similarity and for increasing the efficiency measure human and machine summary text. By applying different algorithm, they find appropriate result. Beneath this research, we have experimentally tried to explain the similarity of questions and contexts by giving exact answer. Here, questions will compare the similarity score with context then predict the answer compared with the best similarity score. Similarity score will be measured by the cosine distance of each sentence. The similarity measurement of context and queries helps to find the accurate and better answer.
3 Methodology In methodology segment, we will discuss about the model we have used. Normally, there not too much work have done yet on similarity assessment for QA which will provide answer by comparing score and that is why attempt to make the similarity measurement of query answering. The model is represented in Fig. 1 of our workflow. Fig. 1 Flow diagram for this work
Data Collection Clean Text Deep Averaging Network
Cosine Similarity PCA Transform
Model Evaluation Normalization Pearson Correlation Output Prediction
Bengali Context–Question Similarity Using …
309
3.1 Data Properties For better results, a consistent dataset is required for Bangla content. We use our own datasets that were collected from Web, google which based on general knowledge. Our dataset contains general knowledge of sports, international, national, and more. There are some barriers to collecting Bengali information, for precedent, to form Bengali content. Nevertheless, we try to reduce most of that barrier in our dataset to keeping original Bengali content. Our dataset comprise context, question, and answer segment. For our working, we compare the similarity score with context and predict the answer with best score provider. With the advent of assemblage datasets, we need to set up a clear data to create content. And for cleaning text, we remove whitespace, digits, and so on.
3.2 Problem Understanding In our query answering dataset, there is context, question, and answer segment. From the text A and text B (A is context and B is question), our model will find the score of the two text how much it similar to each other. For test the similarity, we apply commonly used approach cosine similarity. Cosine matches the corresponding document based on the calculation of the maximum number of common words. The Cos value is a matric used to determine how documents/texts are classified according to their size. Mathematically, it measures the cosine conversion of an angle between two vectors in a multidimensional position. It turns out that closer the documents are to the corner, the greater the cosine similarity. Suppose, A = text a, B = text b, Then, n Ai × Bi A·B (1) = i=1 Similarity (A, B) = cos θ = n n ||A|| × ||B|| 2 2 A × B i=1 i i=1 i The library has both methods and functions to measure the datasets similarity. This is symmetric algorithm, which means that the score obtained from the prediction of text A to Text B same as the score of text B to text A.
3.3 Applied Approach The recognition behind deep feed-forward neural networks is that each level experiences more abstract representatives of the input than previous one. In the research, we present a deep unraveled model that gains sophisticated recognition in a variety of sentence and document-level tasks during a few minutes of training on an average laptop computer. DAN works as—Take the vending average of the embedding relative
310
M. Keya et al.
Fig. 2 Applied model (Iyyer et al. [14]) architecture
to an input sequence of tokens, pass the average through one or more feed-forward layers, etc. DAN applied for assessing factoid question-answering task. DAN takes much less training time than with other methods such as the transformer encoder. On checking conformation of sentences, DAN encoding should be equal as the number of words is equal and the ordering does not matter, but it turns out that are not same. The applied model architecture is given in Fig. 2. In the model, input 1 is the context and input 2 is question. After load, the model from hub both texts are embedded. Then measure the cosign distance of both text for find the top similar text. We used PCA for finding the similarity. Finally, we used query function for taking input from used to find the best similar text. For query, user gives any question then score function provides the answer. The algorithm for finding the similarity between texts is given below.
Bengali Context–Question Similarity Using …
311
Algorithm 1 for similarity measure of context and question Input 1: a is context Input 2: b is question 1 2 3 4 5
function get_score(s , q)
6 7
return a ,b , score
8
end
9
function pca_transform(r)
10 11 12 13 14 15
end )
16
3.4 Universal Sentence Encoding For Bangla dataset of inquiry noting similitude estimation here, apply universal sentence encoder [11, 12]. It enciphers message in high-dimensional vectors that can be incorporated for content arrangement, semantic coordinating, bunching, and another common language. The model is prepared and adjusted for high-wordlength content sentences, expressions or short passages. It is prepared on various information sources and various undertakings focused on progressively incorporating distinctive characteristic language appreciation influential. USE pre-prepared in Tensorflow. It accompanies two modifications: one is transformer encoder and another deep averaging network. Our model prepared with transformer encoder adaptation. Transformer-based sentence encoding model performs state inserting altering utilizing the encipher sub-diagram of the transformer design [11]. Sentence
312
M. Keya et al.
implanting can be utilized to figure the semantic match scores at the sentence level which accomplishes incredible execution as far as semantic sentence coordinating.
3.5 Principle Component Analysis PCA is a way to find the basic information that carries the parts of embedding. It tries to find out what is different and unique about each embedding and cast off anything they have in common. PCA is a dimensional correction process that is often used to reduce the size of large data sets, converting large-sized variables into smaller ones that still contain most of the data in large sets. Generally, the reduction of dimensionality loses the reality. PCA-based dimensional summarization can reduce that data loss under certain gestures and crack models tend. The PCA is applied as A dataset that has an average of zero is wooded by cutting the data medium from each data dimensions. Calculate covariance matrix. Eigenvectors and eigenvalues are calculated. Represents the main components of a dataset maximum value and lowest value significance is removed and a feature vector is formed [13].
3.6 Cosine Similarity The cosine distance is the measure of transformation between two non-zero vectors of the equivalent inner item space that quantifies the Cos theta of the edge between them. Most ordinarily, it utilized in high-dimensional positive spots. For instance, in information recovery and content mining, each word is set apart as a different measure and a record is independently set apart by a vector where the estimation of each measurement relates to the time appeared in the archive. An essential appraisal of the likenesses between the two securing occasions gives Cos theta compliance. The Cos value of two non-zero vectors can be derived by using Euclidian dota · b = ||a||||b|| cos θ
(2)
Here given two vectors of traits a, b, the cosine distance is cos θ. For sentence closeness, the traits vector a, b for the most part term recurrence vectors of reports. Cosine closeness (CS).
4 Experiment and Outcome We have used Tensor flow hub 3. It is a repository of reusable assets for ml with Tensor flow. In our research, varied algorithms are used to measure the similarity. That provide an effective output. It helps us to judge the extent of the difference
Bengali Context–Question Similarity Using …
313
Table 1 Model evaluation using Pearson correlation coefficient Analysis
Result
Pearson correlation coefficient
0.41
P-value
7.69
between the Bengali question and context. There used the datasets to find the best answer from measuring the similarity score using cosine. And after training, the model we find the proper answer of the question.
4.1 Model Evaluation In our model evaluation, we applied Pearson correlation coefficient (PCC). In statistics, it measure linear interrelationship between the two variables a & b. It has value between +1 and −1. Where 1 is total positive linear correlation, 0 is not lineal consistency, and −1 is total negative linear compatibility. In our model, the PCC is 0.40506…… and P-value is 7.686……. It is the cosine value of PCC (Table 1).
4.2 Result The user provides a question to find the relevant context similarity. After calculating, the cosine distance model gives the most similar output context with its answer for the provided question. Result of this research is given in Table 2. Table 2 Output sample of similar top 5 text with the question
Question: Similar context
Answer
Score 0.8 0.8 0.8 0.8 0.8
314
M. Keya et al.
5 Conclusion and Future Work The experiment has presented question–answer similarity measurement using semantic uniformity. Measuring the similarity of context and query, the model provides best answer. In our dataset, we use general knowledge-based data to find answer. And the dataset is factoid question-answering based. Factoid questionanswering provides summarized actuality. It is satisfied with short context and statement. The similarity score of questions and contexts will provide best answer of the given question. So it will be better response for a user if the given question match with the datasets question and it measure the score with the context. For better output of semantic matching of QA, we apply universal sentence encoder and for finding out unique and common word used principle component analysis. Finally, cosine similarity is used to measure the correspondence of the two acquisition. There are various ways to improve our work in the future for console an automatic question-answering system by measuring score. We have worked in this paper with factoid questions. We will increase our data in the future and will work on complex questions similarity measurement. Acknowledgements We gratefully acknowledge support from DIU NLP and Machine Learning Research LAB for providing GPUs support. We thank, Dept. of CSE, Daffodil International University for providing necessary supports. And also thanks to the anonymous reviewers for their valuable comments and feedback.
References 1. Si, S., Zheng, W., Zhou, L., Zhang, M.: Sentence similarity computation in question answering robot. IOP Conf. Ser. J. Phys. Conf. Ser. 1237, 022093 (2019) 2. Song, W., Feng, M., Gu, N., Wenyin, L.: Question similarity calculation for FAQ answering. In: Third International Conference on Semantics, Knowledge and Grid. 0-7695-3007-9/07 $25.00 © 2007 IEEE. https://doi.org/10.1109/SKG.2007.32 3. Juan, Z.M.: An effective similarity measurement for FAQ question answering system. In: 2010 International Conference on Electrical and Control Engineering 4. Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., Crockett, D.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8) (2006) 5. Jeon, J., Bruce Croft, W., Lee, J.H.: Finding semantically similar questions based on their answers (Copyright is held by the author/owner. SIGIR’05, August 15–19, 2005, Salvador, Brazil) 6. Mohler, M., Mihalcea, R.: Text-to-text Semantic Similarity for Automatic Short Answer Grading. In: Proceedings of the 12th Conference of the European Chapter of the ACL, pp. 567–575, Athens, Greece, 30 March–3 April 2009. c2009 Association for Computational Linguistics 7. Martinez, D., MacKinlay, A., Molla-Aliod, D., Cavedon, L., Verspoor, K.: Simple similaritybased question answering strategies for biomedical text 8. Masum, A.K.M., Abujar, S., Tusher, S.T.H., Faisal, F., Hossain, S.A.: Sentence similarity measurement for Bengali abstractive text summarization. In: 10th ICCCNT 2019 July 6–8, 2019, IIT—Kanpur, Kanpur, India
Bengali Context–Question Similarity Using …
315
9. Achananuparp, P., Hu, X., Shen, X.: The Evaluation of Sentence Similarity Measures. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008, LNCS 5182, pp. 305–316, 2008. © Springer-Verlag Berlin Heidelberg (2008) 10. Bosma, W., Marsi, E., Krahmer, E., Theune, M.: Text-to-text generation for question answering. In: van den Bosch, A., Bouma, G. (eds.) Interactive Multi-modal Question-Answering, Theory and Applications of Natural Language Processing. © Springer-Verlag Berlin Heidelberg (2011). https://doi.org/10.1007/978-3-642-17525-1_6 11. Cera, D., Yanga, Y., Konga, S.-y., Huaa, N., Limtiacob, N., St. Johna, R., Constanta, N., Guajardo-Cespedesa, M., Yuanc, S., Tara, C., Sunga, Y.-S., Stropea, B., Kurzweila, R.: Universal Sentence Encoder. A Google Research Mountain View, CA. Mountain View, CA. b Google Research New York, NY. Google Cambridge, MA 12. Cera, D., Yanga, Y., Konga, S.-y., Huaa, N., Limtiacob, N., St. Johna, R., Constanta, N., Guajardo-Cespedesa, M., Yuanc, S., Tara, C., Sunga, Y.-S., Stropea, B., Kurzweila, R.: Universal Sentence Encoder for English. A Google Research Mountain View, CA. Mountain View, CA. b Google Research New York, NY. Google Cambridge, MA 13. Jotheeswaran, J., Loganathan, R., MadhuSudhanan, B.: Feature reduction using principal component analysis for opinion mining. Int. J. Comput. Sci. Telecommun. 3(5) (2012) 14. Iyyer, M., Manjunatha, M., Boyd-Graber, J., Daumé II, H.: Deep Unordered Composition Rivals Syntactic Methods for Text Classification
Abstraction Based Bengali Text Summarization Using Bi-directional Attentive Recurrent Neural Networks Md. Muhaiminul Islam, Mohiyminul Islam, Abu Kaisar Mohammad Masum, Sheikh Abujar, and Syed Akhter Hossain
Abstract Summarizing text is recognized as a vital problem in the field of deep learning and natural language processing (NLP). The entire process of text summarization is proved to be critical in correct and quickly summarizing massive texts, something which could be not only time consuming but also expensive if it is done without the help of machines. We intended to make a model that will be able to generate fluent, effective human likely summarized Bengali text. We built the model with the help of bi-directional RNNs with LSTM cells at the encoding layer and in the decoder layer. We used the attention mechanism for an appropriate result. We followed the structure of sequence-to-sequence model which is mostly used in machine translation. We also used a word embedding file that was pre-trained and specially created for Bengali NLP researches. We tried to keep the entire train loss as low as possible and build a useful and most human likely text summarizer. After the entire experiment, our model can predict quite meaningful and fluent text summaries with a training loss of 0.007. Keywords Natural language processing · Deep learning · Textsummarization · Bi-directional RNNs · Sequence to sequence · LSTM · Attention model
Md. M. Islam (B) · M. Islam · A. K. M. Masum · S. Abujar · S. A. Hossain Department of Computer Science and Engineering, Daffodil International University (DIU), Dhaka, Bangladesh e-mail: [email protected] M. Islam e-mail: [email protected] A. K. M. Masum e-mail: [email protected] S. Abujar e-mail: [email protected] S. A. Hossain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_31
317
318
Md. M. Islam et al.
1 Introduction Propelled by modern innovations, in the current century, data is the most vital element what oil was to the previous one. Dissemination of massive amounts of data has parachuted our world today. There are lots of valuable data which is available now for us. One of them is text data. For the sack of working purpose, a gigantic amount of data is now generating day by day. So these data should be utilized properly. With such a huge amount of text data present in the digital space, there is a need to develop such deep learning techniques that can shrink longer text into meaningful summaries. Also, text summarization reduces the reading time and boosts the process of researching for information. Summarizing any huge text document is a lot easier for humans. They select the salient words and by this, they generate such informative summaries. But for machine, it is a challenging task to do. For summarizing any text, a machine needs to understand the language, reasoning, and must have proper common sense knowledge as like human. But providing these features to a machine is a very difficult task to do. But fortunately, a deep sequential neural network which is known as recurrent neural network (RNN) provides such facilities by which a machine can think and analyze quite similar like humans with the help of core mathematics. In this study, we proposed a model that will produce an abstractive summary of a given document. To produce a short fluent summary, we had to study the problem first. We had to analyze the situation. What obstacles can be found during this work and how can we improve it. All of these studies have been done before starting the work. We have used the sequence-to-sequence model with bi-directional RNNs. RNN was bound in two layers. Each was attached to LSTM cells. We used Bahdanau attention [1] for the improvement of the model. The encoder takes inputs as sentences into fixed-length vectors and the decoder decodes it and generates appropriate sequence as summary. This seq2seq was basically applied in machine translation used the same concept to produce abstractive summaries from documents. In this study, we have discussed not only our model but also the limitations and analysis of previous models. The necessary factors which are need for improving the model and make the summary more human likely are also discussed in this study.
2 Literature Review Building such a model for production needs lots of language resources for providing fruitful results. However, Bengali researchers came into some solutions by following special techniques. Neural machine translators came handy in these similar tasks of natural language processing which was proposed by Kalchbrenner and Blunsom in 2013 [2]. Sutskever proposed a model for seq2seq learning with deep neural networks and which was trained on WMT-14 dataset. The purpose was translating English to the French language. They used multilayered LSTM cells to map the input sequence as
Abstraction Based Bengali Text Summarization Using Bi-directional Attentive …
319
vectors of fixed dimensionality. And the second LSTM for decoding the sequence from vector to sequence [3]. Attention mechanism also came into play in the field of text summarization. A neural attention-based model was proposed in 2015 by Alexandar M. Rush and his team [4]. Attention-based models are used vastly in the field of machine translation. Another work published in 2015 which was on attention-based machine translation. In this study, there were two classes of attention mechanism. One is global and another is local [5]. The local looks at a subset of words and global attend to all the source words. But both approaches are efficient and handy in the field of machine translations. In Bengali, there have been very few researches about Bangla text processing. Text summarization with different approaches with neural networks became one of the hot topics for Bangla researches. In 2017, a model was proposed for extractive Bangla text summarization. They extracted the prime words from the passage, tokenized it, and analyze it properly before giving a summary [6]. Another work has been done of abstraction-based text summarization where seq2seq with two-layer recurrent nets was used. LSTM with encoder and decoder model was used to produce such a correct sequence [7]. The approach was quite good and fruitful to make an abstractive summarizer for Bengali. But the used dataset was not enough to establish an accurate summarizer as there was a limited amount of data. But the proposed method was fruitful and limitations of previous related works were reduced marginally. The necessity of Word2Vec models is also very important while summarizing the text data. For Bangla language, a study was published in 2019 where the Bangla text summarization was done with the help of Word2Vecctor. It was clarified how Word2Vector helps to analyze the data and prioritizing it for making a perfect summary in that particular study [8].
3 Research Methodology From preparing our own big text dataset to analyze and build sequence-to-sequence model for a fluent and accurate text summarizer. We have divided our working procedure into two steps. Analysis of data and experiment. To implement our model, we used Tensorflow 1.14 CPU version. Figure 1 describes our entire workflow.
3.1 Task Selection We had to analyze the entire situation before implementing the model. The pros. and cons. is noted down carefully. Hence the model is for text summarization; our input sequence will be the Text which is denoted by T. And the output sequence will be a form of summary (Shorten text). The output sequence will be shorter than the original text which means (S < T ). Figure 2 describes the scenario.
320
Md. M. Islam et al.
Fig. 1 Workflow of the proposed framework
The black box of the above image is nothing but the model itself which is holding two recurrent neural networks with attention mechanism in the decoding layer.
Abstraction Based Bengali Text Summarization Using Bi-directional Attentive …
321
Fig. 2 Brief description of the model
3.2 Data Collection Neural networks predict the sequence by probability mechanism where the data which is feeding the networks is a very sensitive issue. The dataset should maintain some quality. No dataset was available regarding this online. Hence for training, the model needs enough data, and we have collected them from social media Web sites and online newspapers. Trying for about one month, we were able to collect a total of 3000 documents and its summary.
3.3 Pre-analysis of Data After we had finished the collection of data, the pattern of the entire dataset needs to be analyzed carefully. Because without proper analysis, cleaning and preprocessing could not be done. We measured the text and summary length for the entire dataset before cleaning it up. And we found it pretty messy and unorganized. Figure 3 describes it in brief.
3.4 Data Preprocessing For having accurate result, the data needs to be well cleaned and processed. We have undergone some steps. The contractions were added for the texts and the summaries. There are words that can represent its short form. But having the same meaning with different words can cause neural networks with unrealistic predictions. Hence we removed the short forms and added their full form. We removed unwanted characters and numbers from the texts. We also had to remove the stop-words from the main text but we made sure that the summary must contain these. Because removing stop-words from summary would not be understandable.
322
Md. M. Islam et al.
Fig. 3 Data pattern before cleaning
3.5 Word Embedding Word embedding technique becomes very much popular after it is introduced. It is used widely in the field NLP. Word embedding is very popular for measuring sentence similarity and it’s effective in certain ways [9]. In text summarization, the main purpose of the text is not just dependent on the word frequency in the sentence but it is also dependent on the word similarity. In the initial state we choose two of the best Bengali word embedding pre-trained files known as “bn_w2v_model [10] and “bn” for sentence analysis. “bn_w2v_model” performed well in our perspective. So we have used it to improve our model. We have counted the vocabularies, measured word occurrence, and analyzed sentence similarities with the help of this word2vec model.
3.6 Used Models To build a proper sequence to sequence text summarizer we had to use multiple models for a fluent and appropriate summary. Different types of models serve different purposes. As we intended to work with text data LSTM will be needed as usual because the model needs to memorize the sequence. Machine Translation (MT) model is needed for teaching a model about the sequence of text. Every neural machine translator typically applies an Encoder & Decoder architecture for vector to a text representation.
Abstraction Based Bengali Text Summarization Using Bi-directional Attentive …
3.6.1
323
Machine Translation Model
A heuristic method where a language is translated into another. Input is given as a certain language and its encoder takes the input as a sequence, converts it to vectors, and decoder predicts the output as the desired sequence in the target language. Hence, we have used its concept to generate a summary.
3.6.2
Bi-directional RNNs with Encoder and Decoder
Traditional sentence generation models with RNN such as GRU, LSTM may show promising performance in terms of abstractive summary generation. But still, they do have some issues which need to be solved. When they are dealing with long sequences, they face a common problem. Maximum models use a traditional unidirectional decoder which means reasons only past but limited to retain future contexts when it is predicting something. Therefore, in fact of a long sentence, it sometimes generates unbalances outputs. And a summarizer which is based on unidirectional RNNs captures only partial things of attentional regularities. To solve this issue, bi-directional RNNs are used. Given a text document D which is consists of tokens X = (x 1 , x 2 , …, xTx) which came from vocabulary V. The neural networks models the probability (Y|X) of a sequence Y = (y1, …, yTy) which represents a sequence retaining the essence of X [11]. It was first introduced by Schuster and Paliwal (1997). In a bi-directional RNN (B-RNN) structure, an RNN is run from back to front with the forward RNN starting from front to back. The entire structure can be described in Fig. 4.
Fig. 4 B-RNN structure
324
Md. M. Islam et al.
Table 1 Sample result Original text
Original summary Input words
Responded summary
Here, x is working as input sequence and o is working as the output sequence. The two-layer encoder–decoder mechanism was first introduced by Cho et al. [12]. They are also used in very popular tasks such as generating Wikipedia context by longer sequences [13]. To build the model of summarizer we used two layers RNNs. The encoder takes inputs as a sequence of fixed length and the decoder works with the part of the sequence of output. The hidden unit is used to improve training and memory capacity. Suppose input sequence X = (x 1 , …, x Tx ), is read by the encoder and if the input of the model is the words given as input in Table 1 then h t = f (xt , h t−1 )
(1)
and c = q({h 1 , . . . , h T x }) where ht is the hidden state and c is the context vector generated from ht sequence. Suppose decoder has predicted sequence {y1 , …, yTy }, which is the responded summary given in Table 1 then the probability should be p(y) =
p(y|{y1 , . . . , yt−1 }, c)Tt=1
(2)
Along with this mechanism and attention added in the decoding layer with the help of softmax function working at the end, we produced the desired sequence.
3.6.3
Attention Model
We have used the attention mechanism at the decoder layer for predicting accurate sequence. The attention model is a proposed method which mainly co-operates with aligns and translates. To identify which part of the input sequence is relevant and necessary for predict output sequence attention is being used in the decoder layer. The import words are pointed throughout the entire attention mechanism. We used
Abstraction Based Bengali Text Summarization Using Bi-directional Attentive …
325
Fig. 5 Sequence-to-sequence model
Bahdanau attention in our model. For reducing computational cost, they have used a single-layer multilayer perceptron [1].
3.6.4
Seq2Seq Model
Seq2Seq models need to use LSTM along with decoder and encoders. For generating perfect summary, the input which is the sequence needs to be purified first. By using the Word2Vec file, we have counted our entire vocabulary size and determined the unknown words which helped us to fix the input lengths. We use multiple special tokens such as , , , . Unknown words are removed replaced them in the vocabulary list. adds a batch size of those sentences that are similar in length. represents the endpoint of each sentence and gives a signal to encoder. helps to start the process in the decoder of the sequence of output. The whole architecture can be shown in Fig. 5. In the above diagram, input is denoted a1 , a2 , a3 and output is denoted by b1 , b2 , b3 .
4 Experimental Results After processing the data properly, we made it ready for the model as input. Fixed lengths of sequences given as input for the model. We used 50 epoch, size of rnn = 256, batch size = 2, rate of learning = 0.001, keep probability = 0.70, and three layers for training. Adam optimizer is being used for calculating the learning rate. After training for a couple of hours, we were able to reduce the entire loss of the model to 0.007 and predicted very efficient and accurate summary quite like the original ones. Table 1 describes it in brief. However, there has been very few number of researches regarding Bengali text summarization. Our desired model performed better from the old models in some aspects. Table 2 describes the comparison of other existing models with our model.
326
Md. M. Islam et al.
Table 2 Comparison with other models Reference
Data type
Type of summary
Used neural networks
Used word embedding file
Loss
[7]
Different platforms (News, Facebook posts)
Abstractive
Bi-directional RNNs
“bn_w2v_model”
0.0008
[6]
Not mentioned
Extractive
Different approaches
Not any
Not mentioned
Our model
News
Abstractive
Bi-directional RNNs
“bn_w2v_model” and “bn”
0.0007
5 Conclusion and Future Work Our desired model predicted the summary quite well. The sequences which are generated through the model were very much readable, appropriate, and human likely. Hence no model is 100% accurate when you are generating such sequence but our desired model produced relevant summary near every time. Working with a total of three thousand texts and summary as the dataset was pretty much larger than the dataset which was used before for the models of Bengali text summarization. Still, neural networks perform much better when you are feeding them with a large number of data. We believe there is still a limitation to solve the dataset size. More data would bring many accurate results. The shorter text sequence generated a much better summary compared to the longer sequence. This has happened for an imbalanced dataset. The more the similarity between the lengths of the input texts brings more fruitful results. As this dataset was collected through both online newspapers and social media platform, we believe there was a higher chance of sentence length dissimilarity. We are trying to improve this Bengali model for future concerns. More data needs to be added. There are some prime limitations of Bengali NLP researches as there are not enough resources available like the English language such as lemmitizer, Wordnet, or good pre-trained word to vector for training the models effectively. If we are able to solve these problems in the future, the works on the Bengali language will be much fruitful. Acknowledgements We greatly acknowledge the support of Diu NLP and Machine learning research lab for providing GPU support. We thank Dept. of CSE Daffodil International University for providing necessary supports. And also thanks to the anonymous reviewers for their valuable comments and feedback.
Abstraction Based Bengali Text Summarization Using Bi-directional Attentive …
327
References 1. Bahdanau, D., et al.: neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representation (ICLR), 19 May 2014 2. Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. In: Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1700– 1709. Association for Computational Linguistics (2013) 3. Sutskever, I., et al.: Sequence to sequence learning with neural networks. In: Conference on Neural Information Processing Systems (NIPS) (2014) 4. Rush, A.M., Chopra, S., Weston, J.: A neural attention model for sentence summarization. In: Conference on Empirical Methods in Natural Language Processing, pp. 379–389 (2015) 5. Luong, M., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2015) 6. Abujar, S., Hasan, M., Shahin, M., Hossain, S.: 8th ICCCNT 2017 (2017). Available: https:// doi.org/10.1109/ICCCNT.2017.8204166 7. Talukder, M., Abujar, S., Masum, A., Faisal, F., Hossain, S.: Bengali abstractive text summarization using sequence to sequence RNNs. In: 10th ICCCNT 2019 (2020) 8. Abujar, S., Masum, A., Mohibullah, M., Hossain, S.: An approach for Bengali text summarization using Word2Vector. In: 10th ICCCNT 2019 (2019) 9. Nguyen, H., Duong, P., Cambria, E.: Learning short-text semantic similarity with word embeddings and external knowledge sources. In: Knowledge-Based Systems, vol.182, p. 104842 (2019). Available: https://doi.org/10.1016/j.knosys.2019.07.013 10. Alam, F., Chowdhury, S.A., Noori, S.R.H.: Bidirectional LSTMs—CRFs networks for Bangla POS tagging. In: ICCIT (2016) 11. Al-Sabahi, K., Z. Zuping, Kang, Y.: Bidirectional attentional encoder-decoder model and bidirectional beam search for abstractive summarization (2018) Accessed 30 Apr 2020 12. Cho, K., et al.: Learning phrase representations using RNN encoder decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014) 13. Liu, P.J., et al.: Generating Wikipedia by summarizing long sequences. In: International Conference on Learning Representation (ICLR) (2018)
Bengali News Classification Using Long Short-Term Memory Md. Ferdouse Ahmed Foysal, Syed Tangim Pasha, Sheikh Abujar, and Syed Akhter Hossain
Abstract The process of grouping similar news into a predefined category is news classification. Due to the increase on Bengali news contents, published by a good number of news portals, every day. The necessity of classifying news into appropriate groups became necessary for potential readers. Traditional machine learning (ML) classification requires a good number of labeled data and also participation of human users. Nowadays, deep learning is being used in text classification and it is achieving good accuracy in compare to traditional ML. This research was developed in basis of improvement on Bengali news text classification and to show the improved accuracy using long short-term memory (LSTM). We have used a dataset containing 13,445 news and which were collected from various newspapers. The dataset has five types of news they are entertainment, national, sports, city, and state news. This experiment has achieved a good accuracy of 84%. There were a minimum number of Bengali news classification research has already been done. This experiment work has compared with those existing work and also states the future improvement areas. Keywords Bengali news · Deep learning · RNN · LSTM · Text classification
Md. F. Ahmed Foysal (B) · S. Tangim Pasha · S. Abujar · S. Akhter Hossain Daffodil International University, Dhaka, Bangladesh e-mail: [email protected] S. Tangim Pasha e-mail: [email protected] S. Abujar e-mail: [email protected] S. Akhter Hossain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_32
329
330
Md. F. Ahmed Foysal et al.
1 Introduction As a citizen of Bangladesh, the Bengali language is my mother tongue and for this language in 1952 many language martyrs died for the great mother tongue of ’Bangla Vasha’. In Bangladesh, most of the people’s official language is Bengali. This Bengali language also is popular in India region and from the 22 cadastre languages, its position is in second. In today’s world time, 228 million local speakers daily speak in the Bengali language. Moreover, 37 million foreign speakers talk daily in the Bengali language also. From the calculations of the speakers around the world, the Bengali language is the 5th widely local language as well as the 7th position for the most used speaking language of the world. Newly acceptance or observable, mentionable information, and current events, we can call as news. In today’s digital world, most people now read the news over the Internet not in newspapers. The news that publishes over the Internet or we get from the Internet is called E-news. In today’s time, E-news readers are increasing rapidly because of developing network infrastructure and also increasing the users of the Internet. But there is another scenario in Bangladesh E-news and that is news readers love to surface those sites that give daily news as well as breaking news and updating news from time to time. An analysis reports from the Google trend data of 5 dailies like—Prothom Alo, Bangladesh Protidin, Nayadiganta, Jugantor, Samakal, etc., that online readers of daily newspapers are leaving [1] (Fig. 1).
Fig. 1 Readers decreasing of online daily newspapers
But in other cases, Google trend shows that online readers moving to those online news portals that give users breaking news, updating news as well as daily news on their sites like bdnews24.com, banglanews24.com, etc. [1] (Fig. 2). Deep learning is a machine learning-based algorithm that can use numerous layers to gradually pull out higher-level features of the raw input [2]. So in deep learning,
Bengali News Classification Using Long Short-Term Memory
331
Fig. 2 Readers increasing of online news portal
we are using the recurrent neural network (RNN)-based LSTM model in our research work. As we know day by day online readers are growing rapidly because of a number of Internet users are increasing. But if we look at the online news portals of Bangladesh, we can easily see that their portals news is not in sorted order rather than publishing news randomly. So it’s hard for the readers to search or read their favorite or sometimes important news topics on their choices. So if we are using our model in the backend of the E-news-based online sites or portals, then it will classify updating or breaking news as well as daily news through their relevant topics so that readers can read their favorite topics easily.
2 Literature Review From the born of deep learning people, all over the world working on various applications by using deep learning-based models on their daily research papers, applications, etc. Many researchers from Bangladesh and all over the world worked on various problems related to Bengali news. Before our research works many researchers worked on various models based on machine learning and deep learning. In 2014, a paper used the Naive Bayes machine learning algorithm to classify Bangla news classification [3]. Researchers removed stop words from their documents and got good classification result and they showed the result in the precision-recall graph [3]. From another paper in 2018, we saw that researchers used a doc2vec approach in their model and doc2vec gave them better accuracy than LDA and LSA techniques [4]. They got the accuracy of 91% of the human-generated triplet dataset, whereas LDA and LSA got 85% and 84% accuracy, respectively [4]. Using gated recurrent unit (GRU) model used to classify named entity recognition (NER) in Bangla online
332
Md. F. Ahmed Foysal et al.
newspaper is another works and the researchers trained the model and they got the F1-score of 69% [5]. Another research paper from 2015 we saw that researchers using information extraction (IE) and semantics techniques in their research model for Bangla news stream [6]. Through the classical natural language processing (NLP) techniques with semantics, those researchers worked on some Bangla news content like people and places [6].
3 Proposed Methodology In this section, we illustrate neural networks, RNN, LSTM, model implementation, how we trained our model, dataset collection, and preparation process, and the compilation procedure
3.1 Neural Networks Neural networks are designed to recognize patterns, which are called a set of algorithms that closely coincide with the human brain. Neural networks (NN) can observe numerical patterns that are accommodated in vectors. These data can be images, text, or time series, and sound. Artificial neural network (ANN) is an intelligence processing model like a biological nervous system of the brain which processes information. Artificial neural network (ANN) generally composed of a huge number of neurons performing simultaneously, which are organized in many layers. ANN of the first layer takes the pure information which we can compare to the optic nerve system in the human brain. Every continuous layer gathers information or output from the previous layer before it and through this continuous process information are flowing from the first layer to the last layer of the neural network like the same way neurons pass information in the human brain (Fig. 3).
Fig. 3 Structure of RNN
Bengali News Classification Using Long Short-Term Memory
333
3.2 Recurrent Neural Networks Recurrent neural network (RNN) is a type of feedforward neural network that can process information through sequences or time series data with the help of internal memory. They can take variable size inputs and give us variable size outputs and they work well for time series data. The output of the current layer relies on the past layer. It learns from the output and saves the decision. Recurrent neural network (RNN) has internal memory and they use their internal memory to process sequences of inputs. In RNN, the inputs are dependent on each other, but on the other networks, all the inputs are independent. At first, the RNN starts with input, X (0) then it proceeds its functionality the outputs h(0). Then, for the next phase, the inputs are h(0) and X (1). Similarly, for nth phase inputs are h(n − 1) and X (n). In this way, it keeps the memory while training. The equation for the present state is h t = f (h t − 1, xt )
(1)
RNN normally use tan h as an activation function h t = tan h(Whh h t − 1 + Wxh xt )
(2)
Here, W is the weight Whh is the weight of the previous hidden state, Wxh is the weight at current input state and single hidden vector is denoted by h. And output is yt .
3.3 Long Short-Term Memory The long short-term memory (LSTM) is a special recurrent neural network (RNN) which is consists of a cell, an input gate, an output gate, and a forget gate. LSTM is so much applicable to classify time series data. It trains the model through the backpropagation. Standard RNNs agonize from vanishing and exploding gradient issues. LSTMs solve these problems by offering new gates. There are three types of the gate in LSTM (Fig. 4): 1. Input gate—It helps to find out from which valuable input will be adopted to adjust the memory. Here, the sigmoid function is used that it can decide values scaling to 0,1. Also, tanh function is used that it can produce weight to the values from 1 to −1. (3) i t = σ Wi∗ [h t−1 , xt ] + bi ) Ct = tan h(Wc∗ [h t−1 , xt ] + bc )
(4)
334
Md. F. Ahmed Foysal et al.
Fig. 4 Structure of LSTM
2. Forget gate—It finds what details to be eliminated from the block. The sigmoid function takes the decision. It inspections at the past state (h t − 1) and the content input (X t ) and outputs a number between 0 and 1 for each number in the cell state Ct − 1. (5) f t = σ (W f ∗ [h t − 1, xt ] + b f ) 3. Output gate—The input and the memory of the block are adopted to choose the output. Which values to let through 0, 1 is chosen by Sigmoid function. tanh function is used to weightage the values from −1 to 1 and then it multiplies with the output of the sigmoid function. ot = σ (Wo ∗ [h t − 1, xt ] + bo )
(6)
h t = ot ∗ tanh(Ct )
(7)
3.4 Dataset Collection and Preparation We collected the dataset from Kaggle. Then, we cleaned the dataset and the dataset contains five classes of Bengali news. The classes are entertainment news, national news, sports news, city news, and state news. There is total of 13,445 news. We took 90% news of the dataset to train the model and 10% news to train the dataset. Then, we tokenize all the data into 174,534 unique tokens.
Bengali News Classification Using Long Short-Term Memory
335
3.5 Proposed Model Our propound LSTM model has four layers: 1. 2. 3. 4.
Embedding layer: Which is the input layer. Takes all the tokenized word. Spatial Dropout Layer: Here, we applied a 20% dropout. LSTM: We used 100 as hidden size and dropout = 20% recurrent dropout = 20%. Dense: Here, class no = 5, and to get the result of probability we used ‘softmax’ activation function in our model.
3.6 Training the Model We used a categorical cross-entropy function to calculate loss and Adam was used to compile our model. Batch size of 32 was used, and the model was trained for 10 epochs. 90% of data was used to train the model and we evaluate the model by the remaining 10% of the dataset.
4 Performance Evaluation After when the model is trained by training dataset and also evaluate by training data, then it is called training accuracy. In the same way, when the model is evaluated on unseen data from test data, then it is called test accuracy. Here, we can see a graph contains train accuracy vs test accuracy of our model as shown in Fig. 5.
Fig. 5 Training accuracy versus test accuracy
336
Md. F. Ahmed Foysal et al.
The error when predicting on training dataset is called training loss. The error occurs when predicting test dataset by the trained model is known as test loss. Here, we can see a graph contains train loss versus test loss of our model as shown in Fig. 6.
Fig. 6 Training loss versus test loss
5 Result Discussion Precision, Recall, and F1-score are called the scale of any classification problem. So that they were calculated by us from the evaluation on the test dataset (Table 1). From the averages of, we can state that the classifier produced a good accuracy, which is 84%. Table 1 Classification report Classes Precision Entertainment National Sports City State Average
0.83 0.68 0.91 0.94 0.71 0.84
Recall
F1-score
0.80 0.69 0.89 0.94 0.73 0.84
0.81 0.69 0.90 0.94 0.72 0.84
Bengali News Classification Using Long Short-Term Memory
337
There is also one thing also by which classification problems are measured, that is, confusion matrices. We made two confusion matrices from the evaluation, one is a general confusion matrix (Fig. 7) and the other is a normalized confusion matrix (Fig. 8).
Fig. 7 Confusion matrix
Here, class 1 = entertainment, class 2 = national, class 3 = sports, class 4 = city, class 5 = state
6 Future Work Five classes of news were classified by us. In our method, we use LSTM (a special version of RNN). Our future goal is to build a better neural network and enrich the dataset with more types of news. We have a plan to evaluate our dataset with different algorithms.
7 Conclusion In our paper, we describe a way of Bengali news classification way by our LSTM model. We used the embedding layer, spatial dropout layer, LSTM layer, and dense
338
Md. F. Ahmed Foysal et al.
Fig. 8 Normalized confusion matrix
layer to made our model. This model will help to classify five types of news, which can help an online news portal. We have a plan to made an online system that can classify Bengali news.
References 1. Jubayer, T.M.: Digital Age Newspaper Model For Bangladesh. Academia 2. Deng, L., Yu, D.: Deep learning: methods and application. published by Microsoft (2014) 3. Chy, A.N., Seddiqui, M.H., Das, S.: Bangla news classification using Naive Bayes classifier. In: 16th International Conference on Computer and Information Technology, pp. 8–10, Khulna, Bangladesh (March 2014) 4. Nandi, R.N., Arefin Zaman, M.M., Al Muntasir, T., Hosain Sumit, S., Sourov, T., Rahman, M.J.-U.: Bangla news recommendation using doc2vec. In: International Conference on Bangla Speech and Language Processing (ICBSLP), pp. 21–22 (2018) 5. Banik, N., Rahman, M.H.H.: GRU based named entity recognition system for Bangla online newspapers. In: International Conference on Innovation in Engineering and Technology (ICIET), pp. 27–29 (2018) 6. Seddiqui, M.H., Hoque, M.N., Hafizur Rahman, M.H.: Semantic Annotation of Bangla news stream to record history. In: 18th International Conference on Computer and Information Technology (ICCIT), 21–23 December, 2015.1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982]
Facial Emotion Recognition for Aid of Social Interactions for Autistic Children Sujan Mitra, Biswajyoti Roy, Sayan Chakrabarty, Bhaskar Mukherjee, Arijit Ghosal, and Ranjita Chowdhury
Abstract In our work, we have developed a comprehensive model using CNN where we perceive in recognizing facial emotions thereby, delegating novel applications in human–computer interaction typically for individuals with Autism Spectrum Disorder. Here, we perform a significant exploration and analysis regarding various datasets for training expressions. This model aims to make this process of translation easier for these children with Asperger’s syndrome by identifying gestures, behavioral patterns, and emotions, describing it linearly. We further propose a new video feed paradigm using OpenCV which creates a dynamic box-like interface which extracts the faces and predicts the real-time expressions. This model provides a substantial increase in performance with considerable higher values of training, validation, testing accuracy in comparison to contemporary models. We further analyze the persisting issues and possibilities in this domain and propose comparative data analysis using dynamic dashboards and future directions for structuring a robust system.
S. Mitra · B. Roy (B) · S. Chakrabarty · A. Ghosal · R. Chowdhury Department of Information Technology, St. Thomas’ College of Engineering and Technology, Kolkata, West Bengal 700023, India e-mail: [email protected] S. Mitra e-mail: [email protected] S. Chakrabarty e-mail: [email protected] A. Ghosal e-mail: [email protected] R. Chowdhury e-mail: [email protected] B. Mukherjee Malda Medical College, Malda, West Bengal 732101, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_33
339
340
S. Mitra et al.
Keywords Convolutional neural network · Dynamic dashboards · Classification · Facial emotional recognition (FER)
1 Introduction It is a frequent predisposition among humans to use nonverbal cues including voice tones, body gestures, and facial expressions. Artificial emotional recognition or emotion AI is the methodology that enables a program to examine human facial sentiments by utilizing sophisticated image dispensation. Autistic children generally appear withdrawn, tend to be individualistic and subsequently behave indifferent to situations which even seemed difficult to comfort sometimes. In addition to the differences and oddities which constitute a fair amount of variability in their actions, the mannerism while pronouncing words, making eye contacts, producing facial emotions, which we generally term as “Social Communication Disorder,” make their behavior substantially complex in comparison to the normal people. Moreover, people suffering from Autism Spectrum Disorder (ASD) are often distressed and challenged with the correct and precise use of the pronouns and other grammatical knowledge. Pragmatics, the appropriate skill set or usage of language in social situations, is a major deficiency among children with ASD. On the other hand, Prosody, which is basically the rhythm of speech encircling both verbal and nonverbal communications, has been invariantly sporadic among autistic children where they are either monotonic or extremely exaggerating which ultimately tends to be unnatural to the listeners. It is often baffling and exhausting for an autistic child to be able perceive others emotions as they tend to have their own way of understanding and comprehending their beliefs, interests, and actions. In our approach toward the idea of emotional AI, we envision in developing a cognitive model that can detect emotions, translate visual perceptions and identify gestures and behavioral patterns typically for individuals with ASD.
2 Related Works Cowir et al. [1] first identified anger, disgust, fear, happiness, sadness, and surprise as the six primary emotions. The expression ‘Neutral’ was later introduced in facial expression dataset resulting in seven basic emotions [2]. They set the Standard benchmark on emotion recognition since then. Edwards et al. [2] developed a model, where the training was carried out separately; one for animated faces, and another to map human facial images into animated ones. Advanced research in the field of psychology and neuroscience proved that these facial expressions are culture specific and not universal [3]. Liu et al. [4] proposed a neural network for the same. This model had two convolution layers, one max pooling layer, and four sub-networks.
Facial Emotion Recognition for Aid of Social Interactions for Autistic Children
341
After the advent of deep learning and especially convolutional neural networks for computer vision and image classification problems [4–9] some researchers developed deep learning-based models for FER. Barsoum et al. [7] showed that NNs can achieve a high accuracy in this field. He used a zero-bias CNN on the extended Cohn-Kanade dataset (CK+) and the Toronto Face Dataset (TFD) to achieve great accuracy for his model.
3 Proposed Methodology In order for this model to run accurately, we have kept a separate training phase where the model is trained to identify seven base facial expressions. The model can uniquely identify and post one of these seven expressions that a person using the model may exhibit. This is given by a probability of prediction of one of the seven classes, i.e., the seven expressions and the index with highest probability of a particular class is considered as the final result. We have primarily used the JAFFE dataset for the training purpose. The dataset comprises seven individual directories each representing a unique facial expression. These expressions have been listed as {‘Angry’, ‘Contempt’, ‘Disgust’, ‘Fear’, ‘Happy’, ‘Sad’, ‘Surprise’}. Each directory is loaded with images pertaining to each individual expression. The training has been done by traversing each individual image from each individual directory and listing each expression as each class for classification. For every class, a particular index is selected manually based on the number of images that exist for each class (or expression). Figure 1 portrays the steps of our approach where we have designed a convolutional neural network where the model used is sequential. We have used 2D convolutional layers for the same and ReLu activation function for all but the last layer. The final dense layer has an output size of 7 (indicating one of the 7 expressions) and has a softmax activation function. The model has been compiled using Adam optimizer and loss calculation has been done using categorical crossentropy. The model uses
Fig. 1 Steps of the advised approach
342
S. Mitra et al.
a number of distinct layers each layer comprising a number of distinct features and kernels, all stated on the basis of the given problem. Now that the model is trained, we have used OpenCV for running a live video feed that is able to detect human faces from it using a cascade classifier. We have extracted every face of every frame of the video and used our model to predict the class, i.e., the expression. We marked the detected face on the frame image itself and added a text to the same indicating the expression. Thus, we get as output, a live video feed where human faces are marked and the expressions are mentioned on top of each detected face dynamically. Further in the comparative analysis section, we have created an interactive and data-driven dashboard using business intelligence tool which is a graphical user interface providing an overview about the various reports and metrics we care about most. It presents us with a comparative progress report to weigh up our data while in comparison with other facial detection models. It supports us with the layouts where we have prioritized the use of interactive elements without overusing the real-time data. We have periodically analyzed our survey of patients with Autism Spectrum Disorder (ASD) segregated among aforementioned seven base facial expressions. These information-rich dashboards are best suited to cater long term strategies by analyzing and benchmarking critical trend-based information over a wide range of operations with a shorter or more immediate time scale.
4 Experimental Results This model trains with a training accuracy of 98.89%, validation accuracy of 94.29%, and testing accuracy of 94.28%. The validation and testing accuracy have been calculated by using 15% of the original dataset for this purpose the rest of which went into training. The model is saved as an H5py file for future use. Figure 2 depicts an instance of generated output.
4.1 Comparative Analysis The performance of the proposed model has been compared with some of the works done till date. We begin our comparison with a model where face landmarks and HOG features have been extracted using Dlib and training a multi-class SVM classifier to recognize facial expressions using Fer2013 dataset. Different CNN models on facial expression recognition applying various datasets have been considered in the following comparative tabulation [10–12]. In comparison to the above models, our prototype trains with a training accuracy of 98.89%, validation accuracy of 94.29%, and testing accuracy of 94.28%. The datasets considered are JAFFE and CK+ (Table 1).
Facial Emotion Recognition for Aid of Social Interactions for Autistic Children
343
Fig. 2 An example of generated output (this is shown on live feed) Table 1 Performance analysis of various FER models Classification of different models and data set Accuracy (in %) of classification for proposed work Multi-class SVM classifier with Fer2013 dataset
50.5
CNN used using OpenCV and Fer2013 dataset
75.1
Model: VGG19 (a CNN-based pytorch implementation) using CK+ dataset
94.646
Model: Resnet18 (a CNN-based pytorch implementation) using CK+ dataset
94.040
Shallow CNN
56.31
Deep CNN
65.55
Multi-class support vector machines (radial kernel, kernlab) using CK+ dataset
97.3
Multi-class support vector machines (linear kernel, e1071) using CK+ dataset
96.5
Decision tree-based classifier
89.63
This work of ours comprised of a sequential model using JAFFE and CK+ datasets
98.89
344
S. Mitra et al.
5 Conclusion This work has propounded an emotional facial recognition model that aims to render emotions through visual perceptions. The objective is to categorize each facial image in the video frame into one of the very seven facial emotion classifications considered. Our work prioritizes user experience and is sketched out to be as simple as possible to get an FER prediction model up and running by diminishing the total user requirements for basic use cases. This prototype is aimed to work as a translator of emotions for autistic children who are afflicted with social communication disorder. Autistic patients are more inclined in performing facial analysis by degrees rather than the entirety at once. Further accuracy over here will be substantially beneficiary. Moreover, there has been a paradigm shift in the field of automatic emotion recognition where the work concentrating on modelling discrete basic emotions is changing toward a bigger challenge of detection of naturalistic expressions. An extension of our work could be included where we not only work on a fixed frame detection of expressions but also consider transitional frames where the person is exhibiting a change of expression. This allows for a stricter detection pattern where the training will encompass gesture recognition. Furthermore, subsequent acquisition of insights and further visualization of the components of data dynamically for comparisons using dynamic dashboards may be proposed in future. With the use of our model, these children can now differentiate among the varying expressions that a person portrays in a normal socially interactive session which eventually is less apprehensive and thereby establishing a more foreseeable situation with fewer difficulties as possible.
References 1. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Taylor, J.G.: Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 18(1), 32–80 (2001). https://doi.org/10.1109/79.911197 2. Edwards, J., Jackson, H.J., Pattison, P.E.: Emotion recognition via facial expression and affective prosody in schizophrenia. Clin. Psychol. Rev. 22(6), 789–832 (2002). https://doi.org/10. 1016/s0272-7358(02)00130-7 3. Bänziger, T., Mortillaro, M., Scherer, K.R.: Introducing the Geneva multimodal expression corpus for experimental research on emotion perception. Emotion 12(5), 1161–1179 (2012). https://doi.org/10.1037/a0025827 4. Liu, P., Han, S., Meng, Z., Tong, Y.: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1805–1812 (2014) 5. Khorrami, P., Paine, T., Huang, T.: Do deep neural networks learn facial action units when doing expression recognition? Proceedings of the IEEE International Conference on Computer Vision Workshops (2015)
Facial Emotion Recognition for Aid of Social Interactions for Autistic Children
345
6. Kahou, S.E., Bouthillier, X., Lamblin, P., Gulcehre, C., Michalski, V., Konda, K., Bengio, Y.: EmoNets: multimodal deep learning approaches for emotion recognition in video. J. Multimodal User Interfaces 10(2), 99–111 (2015). https://doi.org/10.1007/s12193-015-0195-2 7. Barsoum, E., Zhang, C., Ferrer, C.C., Zhang, Z.: Training deep networks for facial expression recognition with crowd-sourced label distribution. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction—ICMI 2016 (2016). https://doi.org/10.1145/299 3148.2993165 8. Aneja, D., Colburn, A., Faigin, G., Shapiro, L., Mones, B.: Modeling stylized character expressions via deep learning. In: Lecture Notes in Computer Science, pp. 136–153 (2017). https:// doi.org/10.1007/978-3-319-54184-6_9. 9. Meng, Z., Liu, P., Cai, J., Han, S., Tong, Y.: Identity-aware convolutional neural network for facial expression recognition. In: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) (2017). https://doi.org/10.1109/fg.2017.140 10. Pramerdorfer, C., Kampel, M.: Facial expression recognition using convolutional neural networks: state of the art (2016). arXiv preprint arXiv:1612.02903 11. Alizadeh, S., Fazel, A.: Convolutional neural networks for facial expression recognition, 1704.06756 (2017) 12. Li, S., Deng, W.: Deep facial expression recognition: a survey. IEEE Trans. Affect. Comput.
Exer-NN: CNN-Based Human Exercise Pose Classification Md. Warid Hasan, Jannatul Ferdosh Nima, Nishat Sultana, Md. Ferdouse Ahmed Foysal, and Enamul Karim
Abstract To enjoy the glow of good health, you must exercise [Gene Tunney], because it helps us to feel happier, increase energy levels, reduce chronic disease and helps us to keep our brain and body refresh. Today’s computer vision technology is supported by deep algorithms which use a special type (CNN) of neural networks to sense objects. In this work, we propose a novel system to classify different types of exercise pose detection automatic self-ruling decision making and predictive models using convolutional neural networks (CNN). In earlier, some research has been conducted to pose detection in image classification problems. For strong architecture, we retrained the final layer of the CNN architecture, VGG16, MobileNet, Inception V3 for classification approach. Predicting among five different classes. We will create a new model “Exer-NN” to successfully classify human exercise pose. We proposed an average accuracy is 88% approximately that can be used for different purposes like tool kit assistance, helping management system automatically. Keywords Exercise pose · Pose classification · CNN · Deep learning · Image classification · InceptionV3 · VGG16 · MobileNet
Md. W. Hasan · J. Ferdosh Nima · Md. F. Ahmed Foysal (B) · E. Karim Daffodil International University, Dhaka, Bangladesh e-mail: [email protected] Md. W. Hasan e-mail: [email protected] J. Ferdosh Nima e-mail: [email protected] E. Karim e-mail: [email protected] N. Sultana Bangladesh University of Engineering and Technology, Dhaka, Bangladesh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_34
347
348
Md. W. Hasan et al.
1 Introduction In recent years, computer vision and artificial intelligence are widely used for human activity analysis like face detection, understand emotion, and pose detection technology has applied in different ways. We can apply pose detection technology for exercise activities using computer vision applications. Today people in every country agree that exercise, which is developed or developed, is very important to our physical and mental health. Nowadays in our country this exercise zone increasing significantly, so we can make automatic decision making an application to apply in an exercise zone to helping management system and some other tool kit. There are different types of exercise, but we work only with five popular exercise pose detection and whose name are bench press, bicep curl, deadlift, squat, treadmills. Our main goal is to make a transfer learning model that can recognize a photograph of different types of exercise poses and apply computer vision applications. For that, firstly we have to train our model with a different type of image data, but we developed a new “Exer-NN” CNN model to train our images for better accuracy. And also train VGG16, InceptionV3, MobileNet models for comparison with our CNN model for more trustworthy. There are several parts to complete our task, for easy to understand we can divide our task into different following sections, such as in section three described our proposed methodology that’s included in an implementation of our model, data collection, pre-processing, define test and train data, etc. In section four are described as performance evaluation, result discussion, comparison. And future work and conclusions and future works are mentions in section five.
2 Related Works Before the widespread works on image classifications with different datasets or different algorithms. But till now, there are few image-based human exercise pose detection worked. Sadeka Haque proposed a ExNET model that able to classify image-based human exercise pose detections, and they got the best accuracy of 82.68% [1]. Terry Taewoong Um and they used [2] large-scale wearable sensor data used to classify different types of human exercises with 92.14% classification accuracy. A still human exercise image we can classify and represent in many different ways [3]; in this work, they represent their output using the pictorial structures model to human pose estimation. This pictorial structures configured the human body structure with some Stiffness part in an efficient way, and generate pose prediction. Alexander Toshev and Christian Szegedy [4] are proposed a method that able to classification human pose estimation based on deep neural networks. Main challenge of estimate human joint location, and visible with strong architecture. DNN-based regression gives a good performance for localization but CNN better for a classification task. In 2015 [5], Liang-Chieh Chen developed the DeepLab model that can be performed at a time classification and object detection task with computational efficiency and sig-
Exer-NN: CNN-Based Human Exercise Pose Classification
349
nificantly advanced. DeepLab gives 71.6% accuracy for image segmentation, other hands this model can be used for any image classification task, and depth maps or videos. From another side for improving semantic segmentation [6], Siddhartha Chandra and Lasonas Kokkinos introduce the solution of a linear system, with a prediction technique with deep learning. Human [7], daily five physical activity (walking, running, cycling, driving, sports) data collected by the Philips NWS activity monitor and classified with Bayesian classification accuracy 72.3% percent and compare with decision tree algorithm.
3 Proposed Methodology 3.1 Convolutional Neural Network CNN is a specific type of artificial neural network architecture for deep learning that uses perception, an algorithm of the grating machine training unit to analyze various types of data for supervised learning. CNN operations generally work depend on inputs for extracting pattern recognition, and it works well with data that has a spatial relation CNN also has some learnable parameters like neural network, i.e., weights, biases, etc. [5]. Some of these layers are convolutional, using a mathematical operation and model transfers the results to successive layers. We may obtain a clear idea from Fig. 1.
Fig. 1 Convolutional neural network architecture
3.2 Convolutional Layer A convolutional neural network has a basic structure of an input layer, an output layer, and various hidden layers. Each input is convoluted with various types of the filter during the forward propagation (or kernel). We can apply image dataset, regression prediction problems, face recognition, object detection, segmentation, classification
350
Md. W. Hasan et al.
prediction problems, etc. The result of the convolution shows momentum which affects the classification. These samples are called features. To build a convolutional layer, some model hyper-parameters have to be configured: length of filters, number of filter, stride, and padding. • Length of filters: Kernel acts as a filter, and it is working for extracting specific features or patterns identifications in the input data, which make increase the efficiency of the classifications. In CNN’s, there are no filters specified. During the training process, the value of each filter is learned. By humans can learn the values of various filters, CNNs can seek more significance by filtering input objects, but filters created by human beings can’t recognize particular features (Fig. 2).
Fig. 2 Kernel
• Stride: Stride defines the number of rows and columns that shifts pixels over the input matrix. Stride reduces the output dimension if the input matrix. If stride is 2 then move the filters to 2 pixels of the input matrix. Stride number always an integer and not a fraction, by default stride is 1. • Padding: After the convolution layer reduces the dimensions of the output matrix, but using padding we can maintain the dimension of output as an input matrix. There are two kinds of padding: the same padding and valid padding. Valid padding means “no-padding,” it reduces the dimension of the output matrix. The same padding means the output matrix is the same dimension as the input matrix. In the same padding, adding an extra block and assign zero to the input matrix symmetrically for the same dimension. Use the activation function (denoted σ ) to identify those features that are relevant for classification after each convolutional operation.
Exer-NN: CNN-Based Human Exercise Pose Classification
351
3.3 Rectified Linear Units (ReLU) Rectified Linear Units (ReLU): ReLU is the most widely used activation function while designing networks today. ReLU function is nonlinear and allows for backpropagation. The constant gradient of ReLUs results in faster learning because not all neurons are activated simultaneously like as the neuron will not be activated if the output is negative, it will transform to null. So few neurons are active at a time, not all neurons, at this reason ReLU much easier, faster and make more efficient. More biological inspired to train (Fig. 3). Fig. 3 ReLU activation function
3.4 Pooling Layer Pooling layer is a nonlinear layer that divides the dimension of the input and reduces the number of parameters, controlling over-fitting, and most relevant information is preserved. There are three kinds of pooling layer MaxPooling, AveragePooling, and MinPooling. In deep learning, two pooling functions are commonly used: • MaxPooling: It picks only the maximum value contained in the pooling window. • AveragePooling: It picks only average value contained in the pooling window (Fig. 4).
3.5 Flatten Layer Flatten layer in between the convolutional layer and fully connected layer, there is a ’Flatten’ layer. It is converting all the resultant two-dimensional arrays into a 1D feature vector, and this operation is called flattening. This flattening structure makes a single long continuous linear vector to be used by the dense layer for the final classification layer.
352
Md. W. Hasan et al.
Fig. 4 MaxPooling and AveragePooling
3.6 Fully Connected Layer Fully connected layer is the last phase for a CNN network, and it represents the feature vector for the input. FC layer involves weights, biases, and neurons. It connects neurons in one layer to neurons in another layer. FC layers recombine each neuron to efficiently and accurately classify each input. It is used for the classification of images by training among different categories. FC layers can be contrasted with multilayer perceptron (MLP) where each neuron has complete links with all previous layer activations [8].
3.7 Dropout Layer The drop-out layer allows a neural network to acquire stronger characteristics that are useful for many different random subsets of other neurons [9]. These layers improve over-fitting and reduce dependence and complexity on the training set.
3.8 Softmax Layer Softmax layer is the last layer or output layer in neural network functions and its use for determining the probability of multiple classes. This function calculates each target class probabilities and returns the values for the target class of the given inputs [10].
Exer-NN: CNN-Based Human Exercise Pose Classification
353
3.9 Data Collection In this study, a new public dataset is chosen to train the proposed networks. There are more than 2873 images with five categories like bench press, bicep curl, deadlift, squat, treadmills. This dataset was collected manually from the Internet and publicly available information on it. Each class has more than 90 images. This dataset has a total of 2873 images, 2154 images for training, and 25% of 719 for testing (Fig. 5).
Fig. 5 A small part of dataset
3.10 Data Augmentation IA few training sets can result in overfitting [11]. The number of new dataset samples was increased by using basic types of image increase to avoid overfitting [12] which we used to train our convolutional neural network model (CNN). Data augmentation is very important because it makes more effective of model performance and reduces classification loss. We prepared total data using in three different methods, and these methods are given below by a picture (Fig. 6): • Flip horizontally about Y -axis • rotate left −30◦ . • rotate right +30◦ .
354
Md. W. Hasan et al.
Fig. 6 A small part of dataset
3.11 Data preprocessing After augmentation for avoiding costs more computation resources and a chance of over fitting, we can reduce the input dimension fixed into 200 × 200 pixels of images. Before passing images to CNN, we used RGB color that easily helps to detect features by CNN that ensures that to get better accuracy. For normalizing, we can reduce RGB values dividing by 255 and get the range of [−0.5, 0.5].
3.12 Test Set This dataset includes five different classes and contains an average of 95 core images for each classes. After augmentation, total 2873 images for all classes. In this part, we make a test set to evaluate the classification performances of our CNN model. We select test set and train set using random state = 42 to get more valid accuracy and performed well on the unseen test set. 75% of the 2873 images are used to create the training set and the remaining 25% is used for the test set [12]. So in train set are contained a total of 2154 images and test set 719 images.
3.13 Training the Model After generating data preprocessing and defining train set, test set then we ready to train our model with training datasets that consist of 2154 images. For increasing accuracy and decreasing loss as possible, we update our model many times and change the optimizer, learning rate, loss function. We train our model using Adam optimizer to reduce loss function as possible and applied on a 75% percent training set and 25% validation set. We use 128 batch size for less memory and faster for training our model. As the process continues, then we can see that in 70–80 number of epochs training accuracy and validation accuracy are not increased significantly. We train only 80 number of epochs because too many number of epochs occur overfitting.
Exer-NN: CNN-Based Human Exercise Pose Classification
355
At that stage, validation accuracy reached 95.27%, training accuracy 96.81%, and validation loss 0.1422%. Then our model is ready to predict unseen data for final test evolution.
4 Performance Evaluation In order to find the best practices, we applied the proposed CNN architecture to the datasets described above and achieves significantly better results in average 95 perclass accuracy. In this picture, we can see the loss is going downwards, and at this time, validation accuracy is increasing, so at that stage, we can say that the training model learns perfectly from training dataset (Figs. 7 and 8; Table 1).
Fig. 7 CNN training and validation accuracy
4.1 Training the Model In our convolutional network model decorated with some of different layers, input size, number of filters, activation shape, etc., that generate weights, biases and that makes total 7,240,325 numbers of the parameter. First input image shape is 200 × 200 using RGB color to read-only image without generating parameters. There are used 4 numbers of Conv2D layers and MaxPool layers one Flatten layer, 2 dense layers with one output layer or softmax layer. The pool size (2, 2) and strides (2, 2) are same but number of filters (32, 64, 96, 96) and kernel size (5, 5), (3, 3) are different in different
356
Md. W. Hasan et al.
Fig. 8 CNN training and validation loss Table 1 Classification result Model Accuracy InceptionV3 VGG16 MobileNet CNN
74 87 90 95
Precision
Recall
F1
Number of epochs
75.8 87 89.4 95
77 86.6 90 95
77 86.6 90 95
80 80 80 80
layers. CNN’s Conv2D and MaxPool layers generate activation size using activation shape. MaxPool layer uses for reducing dimension size, and there are no parameters. Total parameters generate only Conv2D layers. For stable generalization, we do not increase CNN layers (Table 2).
4.2 Result Discussion We know that the classifier’s performance was established on a test set from the training, validation, and testing accuracy [13]. Our CNN model gives a high accuracy of precision, recall, and every weighted average up to 93. Total test dataset images are 719, and after classification, only 34 images are false predictions; another way 685 is a correct prediction. The final test of our model gives 95% accuracy, so our model gives a better test accuracy for unseen data. To make a clear assume, we can observe the confusion matrix (Fig. 9; Table 3).
Exer-NN: CNN-Based Human Exercise Pose Classification Table 2 Number of parameters Number 1 Operation Number of filter 1 Input size 2 Conv2D 1 3 MaxPool 1 4 Conv2D 2 5 MaxPool 2 6 Conv2D 3 7 MaxPool 3 8 Conv2D 4 9 MaxPool 4 10 Flatten 11 Dense 1 12 Softmax Total parameters
– 32 – 64 – 96 – 96 – – – –
357
Activation shape
Activation size
Parameters
200,200,3 200,200,32 100,100,32 100,100,64 50,50,64 50,50,96 25,25,96 25,25,96 12,12,96 13824,1 512 5
120,000 1,280,000 320,000 640,000 160,000 240,000 60,000 60,000 13,824 13,824 – –
– 2432 – 18,496 – 55,392 – 83,040 – – 7,078,400 2565 7,240,325
Fig. 9 Confusion matrix
4.3 Comparison In this section, we compared different models with our CNN model such as Inceptionv3, VGG16, and MobileNet with their test accuracies and used the same number of
358
Md. W. Hasan et al.
Table 3 CNN classification report Class Precision Bench press Bicep curl Deadlift Squat Tread mills
0.98 0.94 0.88 0.99 0.98
Recall
F1-score
0.87 0.97 0.99 0.94 1.00
0.92 0.95 0.93 0.97 0.99
epochs and batch size are used for best comparison. For every model to train used 25% of test set and 75% train set are the same but validation accuracy and testing accuracy are different by a different model that is summarized in Table 2 [12]. From the table, we observed that VGG16 gives 87% accuracy with a little bit noisy, and validation accuracy and train accuracy rate often similar. MobileNet accuracy is 92% but validation accuracy and train accuracy are very noisy. InceptionV3 gives low validation accuracy for these datasets. But our CNN model gives the highest validation accuracy for a good train accuracy with little bit noisy. Finally, we observed that under the same conditions, every model’s accuracy is satisfied, but among them, our CNN performs perfectly with a validation accuracy of 95%. We proposed a new CNN architecture and that achieved the high testing accuracy from another model (Figs. 10, 11, 12 and 13).
Fig. 10 Inception V3 accuracy
Exer-NN: CNN-Based Human Exercise Pose Classification
359
Fig. 11 VGG16 accuracy
Fig. 12 MobileNet accuracy
5 Future Work Our proposed model CNN shows better accuracy for classification against InceptionV3, VGG16, and MobileNet for different pose detection. But in the future, there are several ways to update our model for transfer learning. We will apply another various model such as AlexNet and ResNet to increasing accuracy, efficient training and feature extraction for all models, and other hand, GPU is the most important for training. But we suggest a strong approach apply ensemble method for the best accuracy, and we can test other advanced grading concepts, such as learning transmission [11].
360
Md. W. Hasan et al.
Fig. 13 CNN accuracy
6 Conclusions In this paper, we build up a model using CNN architecture and its competitive classification accuracy performance up to 95%. And we describe the number of filters, activation shape, activation size, total parameters, and number of convolutional blocks of our model. Our model is simple but performed much faster with high accuracy from another complex model. Finally, we demonstrate that the proposed method has high accuracy from the experiment. At last, we can say that CNN can be used to improve the object classification capacities of different areas in a new and innovative way [13].
References 1. Haque, S., Rabby, A.S.A., Laboni, M.A., Neehal, N. and Hossain, S.A. : ExNET: deep neural network for exercise pose detection. In: International Conference on Recent Trends in Image Processing and Pattern Recognition, pp. 186–193. Springer, Singapore (2018) 2. Um, T.T., Babakeshizadeh, V., Kuli´c, D.: Exercise motion classification from large-scale wearable sensor data using convolutional neural networks. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2385–2390. IEEE (2017) 3. Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595 (2013) 4. Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1653– 1660 (2014)
Exer-NN: CNN-Based Human Exercise Pose Classification
361
5. Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. Sci. J. IEEE Int. Conf. Comput. Vis. 1529–1537 (2015) 6. Chandra, S., Kokkinos, I.: Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian CRFS. Eur. Conf. Sci. J. Vis. Springer, Cham, pp. 402–418 (2016) 7. Long, X., Yin, B., Aarts, R.M.: Single Physical-online-based daily physical activity classification. In: 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 6107–6110. IEEE (2009) 8. Zaid, G., Bossuet, L., Habrard, A., Venelli, A.: Methphys-online ficient CNN Architectures in Profiling Attacks 9. Islam, M.S., Foysal, F.A., Neehal, N., Karim, E., Hossain, S.A.: InceptB: a CNN based classification approach for recognizing traditional Bengali games. Proced. Comput. Sci. 143, 595–602 (2018) 10. Zhang, Y.D., Dong, Z., Chen, X., Jia, W., Du, S., Muhammad, K., Wang, S.H.: Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimedia Tools Appl. 78(3), 3613–3632 (2019) 11. Kesim, E., Dokur, Z., Olmez, T.: X-Ray chest image classification by a small-sized convolutional neural network. In: 2019 Scientific Meeting on Electrical-Electronics and Biomedical Engineering and Computer Science (EBBT), pp. 1–5. IEEE (2019) 12. Maron, R.C., Weichenthal, M., Utikal, J.S., Hekler, A., Berking, C., Hauschild, A., Enk, A.H., Haferkamp, S., Klode, J., Schadendorf, D., Jansen, P.: Systematic outperformance of 112 dermatologists in multiclass skin cancer image classification by convolutional neural networks. Eur. J. Cancer 119, 57–65 (2019) 13. Vaibhav, K., Prasad, J., Singh, B.: Convolutional neural network for classification for Indian Jewellery. Available at SSRN 3351805 (2019)
Convolutional Neural Network Hyper-Parameter Optimization Using Particle Swarm Optimization Md. Ferdouse Ahmed Foysal, Nishat Sultana, Tanzina Afroz Rimi, and Majedul Haque Rifat
Abstract CNN has recently gained popularity in the field of image processing. It has proven its niche in the field of machine learning. Computational models which use biological computation have been seen to use CNN. In this paper, we are going to optimize mainly one of these CNN hyper-parameters called convolution size. The optimization of parameter is actually the selection of the parameter by which the model will increase its performance. We are going to use evolutionary algorithm techniques. One of this techniques is called as particle swarm optimization(PSO). It is a community ground methods which select its population based on their fitness of the members where at the final selection the least fit members tend to stay at the population. We have various accuracy level by using only CNN and using CNN and PSO together. Keywords CNN · PSO · Hyper-Parameter · Population
1 Introduction CNN’s level of success nowadays has got exceptional proportioned output. Back propagation in deep neural network for learning DNN has also got popularity. They proved their niche of work and justified it with proper application. The technoloMd. F. A. Foysal (B) · T. A. Rimi · M. H. Rifat Daffodil International University, Dhaka, Bangladesh e-mail: [email protected] T. A. Rimi e-mail: [email protected] M. H. Rifat e-mail: [email protected] N. Sultana Bangladesh University of Engineering and Technology, ECE Building, Azimpur Road, Dhaka 1205, Bangladesh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_35
363
364
Md. F. A. Foysal et al.
gies are outperforming human resource to some extent. Parameters are an important concept related to them. The neural networks depend on their parameter can also termed as hyper-parameter. The tuning of these parameters is essential for expert practice. Automatic methods need to be designed. The models are needed to identify hyper-parameters. They are significantly necessary for compound DNN infrastructure. For this criteria, trial and error is speculative [1]. CNN belongs to the class of neural networks. It is a type of artificial neural network. This neural network is specifically designed for image processing. It is also used for image recognition. But it is dedicated designed for processing pixel data. Convolutional neural network got a special structure called neurons. They are like frontal lobe. This frontal lobe is used for the processing of visual stimuli. The competence of humans and machines is being divided by the introduction of artificial intelligence. Computer vision is a promising domain regarding CNN. Empowering machines is the main goal of the field. A nominal number of information’s has been utilized. Image analysis, image and video recognition, media entertainment, suggestion systems, natural language processing can be taken as examples. Deep learning with computer vision has been built and nurtured with time. They are done with the help of CNN. Hyper-parameters can be of different types. Number of hidden units, learning rate are included in that. The parameters are set before the data has to be trained. Particle swarm optimization (PSO) is a meta-heuristic search algorithm. It got its inspirational movement from birds. Not birds actually the flocks of birds. It is widely used and accepted. It can be used to solve a wide range of problems. It mainly targets the optimization problems. PSO can be used in signal processing. It can be also used in graphics, robotics. They are diverse scientific fields. It is an optimization tool. The civil engineering arena has measured the success of the search technique. It has proved its niche in structural blueprint, structural state evaluation. It has also shown its position in health monitoring. It also has a handsome grip on structural material portrait and figuring. Shipment chain architecture is an heavy part of this. Traffic stream prediction, traffic mastery, traffic mishap prediction are its real-time applications. River level forecasting, structure upon of water/wastewater circulation chain plays an important role.
2 Associated Literature In this paper, we are going to optimize single parameter of CNN. The technique we are going to use is PSO. PSO stands for particle swarm optimization. PSO is introduced by observing the nature of birds flock. In the past, deep neural network’s hyperparameters have been optimized using the PSO structure. PSO is a meta-heuristics algorithm. By using this algorithm, we are going to update only one parameter of CNN. Here, we have used digit classification dataset, where using only the CNN architecture gives accuracy of 94%. If we use CNN and PSO together, then we get the accuracy of about 95%. In previous works, PSO has been used to optimize the DNN’s hyper-parameter. They have used the MINIST and occasionally CIFAR 10 dataset to automate the selection of the dataset [2].
Convolutional Neural Network Hyper-Parameter Optimization …
365
3 Methodologies Applying PSO to select hyper-parameter which will give the best result is a niche topic in the industry. It could imprint a valuable foot mark in optimizing the metaheuristics algorithms family. PSO can be used to identify the hyper-parameter of DNN as well as CNN [2]. We are taking the DNN’s hyper-parameter selection as the resource of this paper. The architectures that have been used are introduced in the consequent section of the paper.
3.1 Deep Neural Network Deep neural network is hot cake in the recent research industry. It has proved its niche by its implementation and wide spreaders. From object detection to speech recognition, all the research fields are having DNN’s contribution. Deep neural network is a kind of neural network [3]. It contains a certain level of complexity. It got more that two layers. Mathematical models are being used to process data. It can be used to detect pedestrians which will reduce the risk of road accident. It needs a large amount of data to perform well.
3.2 Convolutional Neural Network Convolutional neural network is a kind of neural network to be used efficiently to identify image. It is mainly used for image processing. CNN has one or more convolutional layers. The layers are mainly used for image reciprocation. Classification of images and segmentation of images also use CNN. For autocorrelated data, CNN is used. A convolution is essentially sliding a filter over the input. CNN layers are not fully connected. To process pixel data of any image, CNN is artificial neural network which is being designed.
3.3 Hyper-Parameters Hyper-parameters are variables that we need to set before applying a learning algorithm to a dataset. Hyper-parameters consist of two classes optimizer and modelspecific parameters. Optimizer parameters are used for optimization and training process. They model specific parameter is used for determining the structure of the model [3]. Number of epoch, learning rate, minibatch size are some hyper-parameters which are known as optimizer parameter. Number of hidden nodes or units are defined as the model hyper-parameters.
366
3.3.1
Md. F. A. Foysal et al.
Hyper-Parameter Selection
Optimal values for hyper-parameters selection have an impact of convolutional layers. Network depth, number of filters and their sizes are determined as the hyperparameters. They have a drastic effect on the functionality of the classifier. Hyperparameters can be selected in various ways. Automated hyper-parameter selection is based on two types of selection namely model-based and model-free selection [3]. If we consider the model-free selection, then we can have two variations of the selection types. It can function on grid search and random search. When we intend to use the model-based selection, we have various ways to select parameter. In this paper, we are going to use the model-based selection technique of evolutionary algorithms [4]. PSO is a evolutionary algorithm.
3.3.2
Model-Free Hyper-Parameter Selection
Grid and random search are two types of the model-based selection. It is commonly practise as the upon the DNN hyper-parameters. It performs better if the parameter quantity are low. A domain of values has to be choosen first. Then, exploration within this value continues. The vice versa technique is random search. This one is comparatively faster than the grid one. It cannot be dynamically updated during the experiment.
3.3.3
Model-Based Hyper-Parameter Selection
Model-based selection works on various techniques. Probabilistic methods are quite often used. Bayesian optimization techniques are used. It can have two types of variation in selection TPE and spearmint. Spearmint is being used for the hyper-parameter selection in DNN. RBF surrogate model is being used as the non-probabilistic method. In this paper, we are going to use the PSO technique which is a evolutionary algorithm.
3.3.4
Evolutionary Algorithms for Hyper-Parameter Selection
In artificial intelligence, evolutionary algorithm plays an important role. It is a component of evolutionary computation in the field of AI. Evolutionary algorithm (EA) works through a selection methodology. The population is being sorted according to their fitness. The member of population which is less fit is concluded to be eliminated. It could be a single of set of member. The fit members are allowed to survive [3]. They survive until the next good solutions are selected.
Convolutional Neural Network Hyper-Parameter Optimization …
367
3.4 Particle Swarm Optimization Swarm intelligence family holds a special algorithm called PSO. PSO is a populationbased meta-heuristic technique. It selects a feasible solution first denoting as particles then changes its position by the evolution technique [5]. It can function on an large set of population holding many dimensions. It is simple yet efficient technique.
4 Benefaction In this paper, we are going to introduce evolutionary algorithm to optimize the selection of the hyper-parameters of CNN. CNN has various hyper-parameters which plays an vital role in the result of the classifier. In this section, we are going to optimize one parameter [6]. Our optimized parameter is called as convolution size. Evolutionary algorithm is a biology-influenced technique for the selection of hyper-parameter. It is easy to correlate, and it is independent from target CNN.
5 Experimental Ground 5.1 Experimental Setup Our hyper-parameter selection methodology was implemented on Python. We have used the NumPy library. Our CNN was trained using the Keras. In classical machine learning techniques, self-hyper-parameter selection does exists. Regularization of the regression is a good scope of optimization. Stochastic gradient descent optimization can be another fruitful chamber in this area. In all the experiments, the characterization of PSO was constraint.
5.2 Datasets In this paper, we have given out our attention on the optimized selection of the parameter. We have used our synthetic datasets. We have not used any of the benchmark datasets [7]. Some of the ground breaking datasets are MNIST, CIFAR-10/100.
368
Md. F. A. Foysal et al.
5.3 Experimental CNN Architecture In this experiment, we are using an synthetic dataset which roughly contains ten 10,000 color images. They got 1000 images per class to be justified. Designing an CNN architecture based on the dataset was challenging. In this paper, we have implemented the typical CNN architecture. This consists of a convolutional layer. It also got max-pooling layer. Last of all, it got a fully connected layer. ConvNet is a class of DNN. In each layer, neurons are connected. They are connected with the next layer. This concept is called multilayer perceptions (Fig. 1).
Fig. 1 Experimental CNN architecture
In our typical CNN model, we have in total of 13 layers. There are some layers defined as convolutional layers. In this architecture, we have three layers. Each class probability is defined as the softmax activation. There is a layer in the model defined as the flatten layer. In the last phase, we have two dense layers. The dropout layer parameter is 0.20. There are three max-pooling layers [8].
6 Analysis of Results The experimental study what we have represented here is mainly partitioned into two sub-experiments. First of all, we have experimented the CNN. Using only CNN, the digit classification dataset has given us accuracy all about 94%. Whereas the PSO+CNN gives us accuracy all about 95%. At first, we have assumed that how CNN affects the hyper-parameter by using their own convolution size. Here, we have used CNN’s typical structure. None of the VGG, Res or Dense net has been used. We have used the layer-based CNN where three different types of layers are
Convolutional Neural Network Hyper-Parameter Optimization …
369
present. Training data detects the certainty of the classifier model. Training accuracy is the model accuracy. Validation accuracy is defined differently. The figure below shows a graph which shows the accuracy of our model. Training accuracy is defined as the situation when the classifier model is applied. The model is applied on training data. Validation accuracy is defined as the accuracy when the classifier model is applied on a few selected unknown data (Figs. 2 and 3). Fig. 2 Training and validation accuracy while applying CNN
Fig. 3 Training and validation accuracy while applying PSO-CNN
Training loss got a different definition. The loss is actually occurred while training the dataset. Validation loss is stated as the trained network experiment on the data (Figs. 4 and 5).
370
Md. F. A. Foysal et al.
Fig. 4 Training and validation loss while applying CNN
Fig. 5 Training and validation loss while applying PSO-CNN
7 Discussion of Results PSO-CNN have acquired a decent accuracy of 95%, while CNN have acquired 94% accuracy. We have calculated the precision, recall and F1-score from the test dataset containing 2000 images. The classification table is provided below (Table 1). The classification table using the PSO and CNN both together is also given below (Table 2). Now, the confusion matrix is in consideration. The confusion matrices of CNN and PSO-CNN are given below (Figs. 6 and 7).
Convolutional Neural Network Hyper-Parameter Optimization … Table 1 CNN classification result Class Precision 0 1 2 3 4 5 6 7 8 9 AVG
0.94 0.95 0.99 0.98 0.94 0.91 0.97 0.97 0.98 0.87 0.95
Table 2 PSO-CNN classification result Class Precision 0 1 2 3 4 5 6 7 8 9 AVG
0.91 0.95 0.97 0.98 0.92 0.93 0.98 0.98 0.98 0.94 0.96
371
Recall
F1-score
1.00 0.82 0.96 0.93 0.99 0.97 0.96 0.99 0.94 0.92 0.95
0.97 0.88 0.97 0.95 0.97 0.94 0.97 0.98 0.96 0.89 0.95
Recall
F1-score
0.99 0.90 0.98 0.95 0.99 0.99 0.96 0.96 0.92 0.90 0.95
0.95 0.92 0.98 0.97 0.95 0.96 0.97 0.97 0.92 0.92 0.95
372
Fig. 6 CNN confusion matrix
Fig. 7 PSO-CNN confusion matrix
Md. F. A. Foysal et al.
Convolutional Neural Network Hyper-Parameter Optimization …
373
8 Future Work In future work, we will focus on dense layer size selection by giving preference on number of layers, number of neurons in every layer and the type of layer. Also, we will work on optimization function selection such as sigmoid function or than function for better performance.
9 Conclusion In this paper, we present a modified form of convolution neural network by optimizing hyper-A better impressive performance is showed where we optimize convolution size in our model. Convolution neural network’s hyper-parameters are optimized by using particle swarm optimization. The accuracy of our model for CNN architecture is 94, and if we use CNN PSO classification, then accuracy is about 95 after the optimization of CNN’s parameter.
References 1. Loussaief, S., Abdelkrim, A.: Convolutional neural network hyper-parameters optimization based on genetic algorithms. Int. J. Adv. Comput. Sci. Appl. 9(10), 252–266 (2018) 2. Soon, F.C., Khaw, H.Y., Chuah, J.H., Kanesan, J.: Hyper-parameters optimisation of deep CNN architecture for vehicle logo recognition. IET Int. Trans. Syst. 12(8), 939–946 (2018) 3. Lorenzo, P.R., Nalepa, J., Kawulok, M., Ramos, L.S., Pastor, J.R.: Particle swarm optimization for hyper-parameter selection in deep neural networks. In: ACM, Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA, pp. 481–488 (2017) 4. Foysal, M.F.A., Islam, M.S., Karim, A., Neehal, N.: Shot-net: a convolutional neural network for classifying different cricket shots. In: International Conference on Recent Trends in Image Processing and Pattern Recognition, pp. 111–120 (2018) 5. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. Adv. Neural Inf. Proc. Syst. 91–99 (2015) 6. Lee, W.-Y., Ko, K.-E., Geem, Z.-W., Sim, K.-B.: Method that determining the Hyperparameter of CNN using HS algorithm. J. Korean Inst. Int. Syst. 27(1), 22–28 (2017) 7. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017) 8. Ye, F.: Particle swarm optimization-based automatic parameter selection for deep neural networks and its applications in large-scale and high-dimensional data. PLoS ONE 12(12), e0188746(2017)
Automatic Gender Identification Through Speech Debanjan Banerjee, Suchibrota Dutta, and Arijit Ghosal
Abstract Gender detection of speech signal is a significant task as it is the preliminary step to build an identification system to identify a person. Gender detection using speech is comparatively easier than that from facial changes. Moreover gender detection is required for security purpose also. There are so many approaches for gender detection—Facial image based, fingerprint based, etc. For simplicity of work, we have offered a simple low-level facet-based scheme for detecting gender of a speaker in our present work. We have worked with pitch based acoustic facets to identify gender. Neural network (NN), Naïve Bayes, and random forest classifier have been exercised for sorting task. From the investigational upshot and comparative scrutiny, the strength of the proposed facet set can be understood. Keywords Pitch · PCA · Gender identification
D. Banerjee (B) Department of Management Information Systems, Sarva Siksha Mission Kolkata, Kolkata, West Bengal 700042, India e-mail: [email protected] S. Dutta Department of Information Technology and Mathematics, Royal Thimphu College, Thimphu, Bhutan e-mail: [email protected] A. Ghosal (B) Department of Information Technology, St. Thomas’ College of Engineering and Technology, Kolkata, West Bengal 700023, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_36
375
376
D. Banerjee et al.
1 Introduction and Literature Survey Personal identification is increasing its importance gradually as verificationdepended organisms are raising in command. Individual detection has wide applications in the domain of security. During recent years, biometric becomes very important security tool which plays a vital role in access control and transaction authorization. Biometric is nothing but a human-generated attribute that authenticates the person’s identity. Face recognition provides an proficient and vigorous mode of spoting sexual type of a human being. Faces consent to community to ascertain among the ancillary gears the sexual class, age of the person [1], and up to a quantity amount of passions. The most exigent chore in any facial taxonomy procedure is the version of expression through a vector. The vector should symbolize facial distinctiveness in the most proficient mode so that it surrounds all achievable information about expression. This is not at all an easy task. Comparatively, voice-based gender detection is much easy. Voice is a popular biometric because it produces a natural signal and does not require any specialized input device. Only telephone or microphone equipped personal computer can be used anywhere as an input device for the voice biometric. Voice biometric actually accomplishes the task of speaker gender recognition [2]. Humans have natural ability to recognize the speech and speaker. To emulate this ability with computer is not an easy task. With the advancement of computer technology, the speaker recognition has drawn significant interest of the researchers. The speech signal not only expresses the words and messages being spoken but it also carries information about distinctiveness of orator gender. Principally orator gender revealing is to decode the speech signal in an order manner where different acoustic facets are calculated from the input signal. Many different techniques for detecting gender of a speaker have taken by different researchers. Some of them have worked with facial images also. Brunelli and Poggio [3] have exploited a geometrical facet depended design for sexual category detection. They have extracted a set of 16 geometrical facets from frontal view images of people. Yang and Huang [4] have recommended a hierarchical awareness-based manner for becoming aware of human looks in intricate backdrop. Shackleton and Welsh [5] have exploited a pattern alike based means to trace facial facets. Wu et al. [6] have exploited color facts collected from hair and skin to trace and perceive facial facets in looks of human being. Kawaguchi et al. [7] have made use of rounded hough change for locating of eyes in looks of human being. Lapedirza et al. [8] have worked with a top-down re-enactment-based algorithm intended to mine facial facets. Golomb et al. [9] have used a 900 × 40 × 900 fully connected back propagation network. He has scaled images to 30 × 30 sizes and swiveled to spot eye and mouth likewise in apiece image. Tolba [10] has proposed gender detection exploiting two diverse neural network classifiers so as to radial basis function network or RBF and learning vector quantization or LVQ. Wiskott et al. [11] has followed graph matching approach for gender identification. Moghaddam and Yang [12] have used SVM classifier for the purpose of gender identification. He has worked with low resolution images (21
Automatic Gender Identification Through Speech
377
× 12). Cottrell et al. [13] proposed a scheme which condensed the dimension of whole face image using auto encoder network and differentiated sexual category based on the compact input facets. Jain and Huang [14] calculated the facets in an approach which known as independent component analysis or ICA and graded it with linear discriminate analysis or LDA. Dey et al. [15] have detected gender based on computer vision. They have followed Shakhnarovich et al. [16] method for detecting facial region from a given image. Fingerprint is also an important and popular approach for gender detection. Kaur et al. [17] have studied various approaches for gender detection based on fingerprint. Sangam et al. [18], in 2011, have done a study which exposed that there is major sex and bimanual differences present in the finger print pattern distribution. Many researchers have worked with some statistical facets for audio classification. Some of these facets are time-domain facets and some facets are frequency domain. Downie [19] and Saunders [20] have worked with short-time energy (STE) as well as zero crossing rate (ZCR). These facets are time-domain facets. Harb and Chen [21] have compared a set of acoustic facets and pitch facets for the purpose of gender detection. They have used several classifiers for this purpose. Weston et al. [22] have studied about how male and female voices activate discrete sites of non-primary auditory cortex. Rakesh et al. [23] have used LABVIEW for gender recognition. Doukhan et al. [24] have paid effort for sexual category parity for French audiovisual rivulets based on conversing moment of male and female. Identification of age group and gender has been carried out by Safavi et al. [25] by considering verbal communication of children. Müller and Ewert [26] have advised Chroma Toolbox for MATLAB to extort the Chroma based audio facets. Malhi and Gao [27] have advised principal component analysis or PCA-based design for collection of facets by reducing dimension of long facet set. Pahwa and Aggarwal [28] have employed mel co-efficient-based facet for gender identification. Ali et al. [29] have applied first Fourier transform or FFT reliant scheme for sexual category classification from verbal communication. In our work, we have paid attention on the effectiveness of facets. It is monitored that frequency echelon of male speaker and female speaker is not same. Frequency level of human voice is well captured by pitch-based facets as it is connected with fundamental frequency. So, pitch-based acoustic facets have been used for gender detection. To make the job of classifier easy, our aim was to design a good facet set. We have organized this paper in the following way—advised method is discussed in Sect. 2 after introduction section. In this section, we have discussed about the design of facet set and classification technique. The next section, Sect. 3 describes the investigational outcome with relative investigation with previous work and lastly we have put the concluding remarks in section.
378
D. Banerjee et al.
2 Intended Scheme Facet calculation and then classification tasks are the two major parts of this work. Figure 1 gives a picture of the flow of this work. Facet calculation and classification process are discussed in Sects. 2.1 and 2.2, respectively.
2.1 Facet Calculation Zero crossing rate or ZCR in addition to short time energy or STE are mostly exploited time field as well as low echelon audio facets. They occupy major part in gender detection. Many research works have already been conducted on gender identification using those acoustic facets. Overuse of ZCR and STE has motivated us to look for alternative acoustic facet set which will also be able to identify gender of speaking person. While we were studying and noticing distinctiveness of male and female vocalizations, it is perceived that male and female speeches mostly diverge in perceptual domain. Motivated by this observation, we have considered pitch-based facets for recognition of sexual category of narrator in this work. Pitch is believed as an observed appearance of frequency. Pitch measures shrillness of speech and shrillness differs between male and female voices. To extract pitchbased facets, vocalizations signals for male and female have been broken in 108 frequency bands. But among this 108 frequency bands, values of first 20 bands are always zero. Rest apiece of the 88 bands is segregated into small casings having tiny extent. Short time mean square power or STMSP is assessed for apiece and every frequency bands. Mean of STMSP for each of these frequency bands is considered. Hence, 88 values of STMSP are attained for every speech signal. STMSP distributions of 88 frequency bands for male and female speeches are exposed through Figs. 2 and 3 correspondingly. By observing these two figures, it is perceptible that STMSP distribution of male speech and female speech is not same. We have used chromagram toolbox [26] for extraction of pitch-based facets. As we have considered only pitch-based facet in this work, after extracting STMSP distribution for all speech signals, dimension of facet vector becomes 88. Eightyeight dimensional facet set is too large to work with. This long dimensional facet
Fig. 1 Flow of this work
Automatic Gender Identification Through Speech
379
Fig. 2 STMSP distribution for male speech
Fig. 3 STMSP distribution for female speech
vector not only increases facet computation time also it increases complexity of the whole system resulting diminution in performance. Hence dimension reduction of the facet set is very much required in this work. To reduce the dimension of the facet set, we have used facet selection algorithm termed as principal component analysis or PCA [27] in this work. PCA is much admired scheme for reduction of dimension or length of long dimensional facet set. PCA assesses significance of every facet set and produces a new set of facets using Eigen values and Eigen vectors. In this work, PCA is employed on the said 88 dimensional facet set and the facet set is reduced to
380
D. Banerjee et al.
two sets of facet. First facet set is of 30 dimensional (termed as F 1 ) and the second facet set is of 35 dimensional (termed as F 2 ). These two facet sets (F 1 and F 2 ) are fed to the classifiers for performance analysis.
2.2 Classification To emphasize the potency of the proposed facet set simple classification design using some well accepted classifiers has been adopted. We have used neural network (NN), Naïve Bayes, and random forest classifier in this work considering two facet sets (F 1 and F 2 ). Neural network has been applied through multi-layer perceptron (MLP) model. An audio information set having 600 speech files—300 files each for male and female speech has been put in order in this work. Half of audio information set is exploited for coaching the classifiers and the rest is employed for testing purpose. So among the 300 male speech files, 150 files are used in the coaching data set and the rest are used in testing data set. Alike thing is for female speech files also. We have configured MLP two times for two sets of facet (F 1 and F 2 ). For F 1 , MLP is configured having 30 neurons in input layer mirroring 30 facets of F 1 , two neurons in output layer mirroring two classes—male and female and 16 neurons in the hidden layer. There is only one hidden layer. For F 2 , MLP is configured having 35 neurons in input layer mirroring 35 facets of F 2 , two neurons in output layer mirroring two classes—male and female and 19 neurons in the hidden layer. Only one hidden layer is present for this time also. Naïve Bayes and random forest classifiers have been employed having ten-folds cross-validation.
3 Experimental Results To accomplish the job of identification of gender, an audio information deposit having 600 speech files—300 files each intended for male and female speech has been prepared. All of these files in the data set are mono and all of them have 90 s length. These files are accumulated from various sources like compact disc, aural recordings of various live performances in addition to YouTube. These speech files are of diverse age cluster also both for female and male. Different languages also we have considered for both the genders in this work. Some of this speech files are noisy also to mirror normal scenario. Each of these speech files are partitioned into several frames having very small duration. Every frame in a speech file is 50% overlapped with its previous frame. Frames are overlapped to omit the possibility of loss of any periphery characteristic of a certain frame. For coaching neural network (NN), Naïve Bayes and random forest classifier half of total data set are acquired. Rest of the data set is employed for the purpose of testing those classifiers. Once male–female gender identification is performed, alike sorting chore is again performed just by overturning the coaching
Automatic Gender Identification Through Speech
381
Table 1 Male–female gender identification accurateness Categorization method
Stratification accurateness (in %) for F 1 facet set
Stratification accurateness (in %) for F 2 facet set
Random forest
94.50
96.00
Naïve Bayes
93.00
95.50
Neural network (NN)
91.50
94.50
Table 2 Evaluation of performance with other effort Precedent approach
Classification accuracy (in %)
Pahwa and Aggarwal (considering MFC1)
94.50
Pahwa and Aggarwala [28] (overlooking MFC1)
93.00
Ali, Islam, and Hossain
92.50
Proposed facet set (F 2 )
96.00
information set and testing information set. We have done this for facet set F 1 and F 2 . Mean of those two rounds is evaluated as ultimate categorization upshot which is mentioned inside Table 1 for both F 1 and F 2 . From Table 1, it is observable that F 2 performs better than F 1 . So F 2 is the advised facet set in this exertion.
3.1 Performance Comparison with Other Work Classification potency of the advised facet set (F 2 ) has been compared with other existing work. Facet set advised by Pahwa and Aggarwal [28] and Ali et al. [29] has been considered for this comparative analysis. The upshot of the comparative assessment is mentioned in Table 2. From Table 2, this is observable that recommended pitch depended facet set (F 2 ) acts better upon the facet set advised by advised by Pahwa and Aggarwal [28] and Ali et al. [29].
4 Conclusion Foremost aspire of this exertion is to identify gender of speaker by classifying male speech and female speech. For that pitch-based facet has been used here. PCA has been used to reduce the dimension of the facet set. Two facet sets have been considered here—one is 30 dimensional (F 1 ) and another is 35 dimensional (F 2 ). Both these F 1 and F 2 have been generated by applying PCA on the pitch-based facet set which were 88 dimensional. Experimental result reveals that suggested facet set performs
382
D. Banerjee et al.
better compared to other preceding efforts. In upcoming days, apiece type of sexual category may be more sub-segregated depending on diverse age-groups. Also, gender identification work may be carried out again considering trans-gender using different set of facets.
References 1. Kwon, Y.H., da Vitoria Lobo, N.: Age classification from facial images. Comput. Vis. Image Underst. 74(1), 1–21 (1999) 2. Reynolds, D.A.: Overview of automatic speaker recognition. Presented on JHU 2008 Workshop Summer School 3. Brunelli. R., Poggio, T.: HyperBF networks for real object recognition. In: Proceedings of ECCV, pp. 792–800 (1992) 4. Yang, G., Huang, T.: Human face detection in a scene. In: Computer Vision and Pattern Recognition, pp. 453–458 (1993) 5. Shackleton, M.A., Welsh, W.J.: Classification of facial facets for recognition. In: Computer Vision and Pattern Recognition, pp. 573–579 (1991) 6. Wu, H., Yokoyama, T., Pramadihanto, D., Yachida, M.: Face and facial facet extraction from color image. In: Computer Vision and Pattern Recognition, pp. 573–579 (1991) 7. Kawaguchi, T., Hidaka, D., Pramadihanto, D., Rizon, M.: Detection of eyes from human faces by hough transform and separability filter. In: Proceedings of IEEE, ICRP, vol. 1, pp. 49–52 (2000) 8. Lapedirza, A., Masip, D., Vitria, J.: Are external face facets useful for automatic face classification? In: IEEE Computer Society Conference on Computer Vision and Pattern Classification, vol. 3, pp. 151–157 (2005) 9. Golomb, B.A., Lawrence, D.T., Sejnowski, T.J.: Sexnet: a neural network identifies sex from human faces. Adv. Neural. Inf. Proces. Syst. 3, 572–577 (1991) 10. Tolba, A.S.: Invariant gender identification. Digit. Sign. Proces. 11, 222–240 (2001) 11. Wiskott, L., Fellous, J.-M., Krger, N., von der Malsburg, C.: Face recognition and gender determination. In: Proceedings of the International Workshop on Automatic Face and GestureRecognition, pp. 92–97 (1995) 12. Moghaddam, B., Yang, M.: Gender classification with support vector machines. In: 4th IEEE International Conference on Automatic Face and Gesture Recognition, pp. 306–311 (2000) 13. Representing face images for emotion classification, Cottrell and Metcalfe, Citeseer (1997) 14. Jain, A., Huang, J.: Integrating independent components and linear discriminant analysis for gender classification. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp.159–163 (2004) 15. Dey, E.K., Khan, M., Ali, Md.H.: Computer vision-based gender detection from facial image. Int. J. Adv. Comput. Sci. 3(8), 428–433 (2013) 16. Shakhnarovich, G., Viola, P.A., Moghaddam, B.A.: Unified learning framework for real time face detection and classification. In: IEEE International Conference on Automatic Face and Gesture Recognition, pp. 14–21 (2002) 17. Kaur, R., Ghosh Mazumdar, S., Bhonsle, D.: A study on various methods of gender identification based on fingerprints. Int. J. Emerg. Technol. Adv. Eng. 2(4), 532–537 (2012) 18. Sangam, M.R., Krupadanam, K., Anasuya, K.: A study of finger prints: bilateral asymmetry and sex difference in the region of Andhra Pradesh. J. Clin. Diagn. Res. 5(3), 597–600 (2011) 19. Downie, J.: The scientific evaluation of music information retrieval systems: foundations and future. Comput. Music J. 28(2), 12–33 (2004) 20. Saunders, J.: Real-time discrimination of broadcast speech/music. In: IEEE International Conference on Acoustics, Speech, Signal Processing, pp. 993–996 (1996)
Automatic Gender Identification Through Speech
383
21. Harb, H., Chen, L.: Voice-based gender identification in multimedia applications. J. Intell. Inf. Syst. 24(2), 179–198 (2005) 22. Weston, P.S.J., Hunter, M.D., Sokhi, D.S., Wilkinson, I.D., Woodruff, P.W.R.: Discrimination of voice gender in the human auditory cortex. NeuroImage 105, 208–214 (2015) 23. Rakesh, K., Dutta, S., Shama, K.: Gender recognition using speech processing techniques in labview. Int. J. Adv. Eng. Technol. 1(2), 51–63 (2011) 24. Doukhan, D., Carrive, J., Vallet, F., Larcher, A., Meignier, S., Le Mans, F.: An open-source speaker gender detection framework for monitoring gender equality. In: Acoustics, Speech and Signal Processing (ICASSP) (2018) 25. Safavi, S., Russell, M., Janˇcoviˇc, P.: Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech Lang. 50, 141–156 (2018) 26. Müller, M., Ewert, S.: Chroma toolbox: MATLAB implementations for extracting variants of chroma-based audio facets. In: Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR) (2012) 27. Malhi, A., Gao, R.X.: PCA-based facet selection scheme for machine defect classification. IEEE Trans. Instrum. Meas. 53(6), 1517–1525 (2004) 28. Pahwa, A., Aggarwal, G.: Speech facet extraction for gender recognition. Int. J. Image Graphics Sign. Proces. 8(9), 17–25 (2016) 29. Ali, Md.S., Islam, Md.S., Hossain, Md.A.: Gender recognition system using speech signal. Int. J. Comput. Sci. Eng. Inf. Technol. (IJCSEIT) 2(1), 1–9 (2012)
Machine Learning-Based Social Media Analysis for Suicide Risk Assessment Sumit Gupta, Dipnarayan Das, Moumita Chatterjee, and Sayani Naskar
Abstract Social media is a relatively new phenomenon that has swept the world during the past decade. With the increase in the number of people joining the virtual bandwagon, huge amount of unstructured text is being generated. These texts can prove to be very useful in comprehending the mental state of the user and in predicting one’s level of depression and suicide ideation. This paper analyzes Reddit posts to identify users with poor mental health conditions who are on the verge of inflicting self-harm or committing suicide. In the process, Machine Learning models are built using six different classification techniques and Sentiment Analysis is performed to extract features that depict the emotional mindset of an online user. Naïve Bayes classifier emerged as a stable classifier with a precision value of 71.40%, thus showing an affirmative cue in solving the task of suicide risk assessment. Keywords Mental health · Depression · Suicide · Social media · Reddit · Machine Learning · Sentiment Analysis
S. Gupta (B) · D. Das · M. Chatterjee · S. Naskar Department of Computer Science and Engineering, University Institute of Technology, The University of Burdwan, Golapbag (North), Burdwan, West Bengal 713104, India e-mail: [email protected] D. Das e-mail: [email protected] M. Chatterjee e-mail: [email protected] S. Naskar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_37
385
386
S. Gupta et al.
1 Introduction In this fast paced world, there is no denying that hectic work schedule, peer pressure, family stress, financial instability, and challenge to outperform others are few of the pivotal and pressing factors causing a negative impact on individuals in specific and the society in general; due to which not just physical health, but also mental health is being adversely affected. Mental health refers to the cognitive, behavioral, and emotional well-being of an individual leading to how a person thinks, feels, and behaves. A “sound mental health” would mean an absence of a mental disorder [1]. But owing to multiple obligations and ongoing tussles, a person becomes depressed and ends up putting his/her health on the back burner. Depression is a mood disorder that causes a persistent feeling of sadness and loss of interest and can lead to a variety of emotional and physical problems. A depressed person becomes impuissant and inadequate to even perform the normal day-to-day activities. Moreover, it makes one feel as if life is not worth living which in some cases results in the growth of suicidal tendencies [2]. Suicide is when people direct violence at themselves with the intent to end their lives, and they die as a result of their actions. Suicide is a leading cause of death in the entire world and also in our country India [3]. In the year 2000, the World Health Organization (WHO) launched the multisite intervention study on suicidal behaviors (SUPRE-MISS) which aimed at increasing knowledge regarding suicidal behaviors and about the effectiveness of interventions for suicide attempters in culturally diverse places around the world [4]. India’s suicide rate is at 16.5 suicides per 100,000 people. Sri Lanka stands second in the region with a suicide rate of 14.6 and Thailand with a rate of 14.4 ranks third. Each year about 10.5 per 100,000 people die by suicide in India alone [5]. This had opened a whole gamut of discussions as to how to understand whether or not an individual suffers from any mental illness and could possibly attempt suicide, so that proper counseling and therapy can be provided to deal with the situation. It is well known that due to the availability of Internet, a person, more specifically an Internet user happens to spend most of his/her time on social media sites. Chatting, sharing photographs, updating statuses, commenting on other posts, etc., have made the user addicted to the virtual world. Social media platforms such as chat rooms, blogging Web sites, video sites, and social networking sites have become fundamental in the way how many people and organizations communicate and share opinions, ideas, and information [6]. Due to the widespread use of social media, a lot of data are being generated and are available for research works, primarily in the field of Sentiment Analysis, Emotional Intelligence, Mood Detection etc. Through this paper, an attempt has been made to analyze texts collected from Reddit, which is a social sharing website that is built around users submitting links, pictures, and texts [7], and predict from Reddit posts, whether the user suffers from any depression or not and whether or not he/she is planning to harm oneself or commit
Machine Learning-Based Social Media Analysis …
387
suicide. In the process, various Machine Learning classification tools and SenticNet 5 lexicon have been employed for performing suicide risk assessment automatically so that such individuals can be identified and counseled before the harm is actually done. The rest of the paper is organized as follows: Sect. 2 discusses the previous research works done in the domain of suicide ideation. Section 3 explains the working of the proposed system through working principle and system workflow. The implementation part is discussed in Sect. 4 along with the results. Section 5 concludes the paper and paves way for future improvement.
2 Previous Related Works A good amount of research work has been done in the field of suicide ideation on social media texts. Few of the prominent works are being discussed in this section. Burnap et al. in their work [8] have classified tweets related to suicidal communication into seven different suicide categories or classes using various classifiers. To do so, lexical, structural, emotive, and psychological features were extracted. Initially for the baseline experiment, Support Vector Machine (SVM), J48 Decision Trees (DT), and Naive Bayes (NB) classifiers were used. Later an ensemble classifier was built using the Rotation Forest algorithm that produced an overall F-score value of 72.80%. In the work by Ji et al. [9], the goal was to detect suicidal ideation through online content available from Reddit platform and Twitter via supervised learning methods. Features such as statistical, syntactic, linguistic, word embedding, and topic were extracted and an experimentation was performed using six different classifiers, viz. SVM, Random Forest(RF), Gradient Boosted Decision Tree (GBDT), eXtreme Gradient Boosting (XGBoost), Multi Layer Feedforward Neural Network (MLFFNN), and Long Short-Term Memory (LSTM). The best accuracy of 96.38% was obtained when classification was performed using Random Forest classifier. Carson et al. [10] have proposed a Random Forest-based classification model using Natural Language Processing (NLP) for identifying suicidal behavior among psychiatrically hospitalized adolescents. Unstructured clinical notes of 73 patients were collected and analyzed to retrieve phrases associated with suicidal tendencies. The proposed model had a moderate sensitivity of 0.83, a specificity of 0.22, a PPV of 0.42, an NPV of 0.67, and an accuracy of 47%. AlSagri and Ykhlef [11] have attempted to predict whether a Twitter user is depressed or not based on user’s network behavior and tweets. Features related to user’s activity (such as categorical, as_is and norm) and tweet-specific features (such as self-center, information gain, word frequency, sentiment, and synonyms) were extracted. Four different classifiers were employed for the purpose of text classification, namely SVM-L (with Linear kernel), SVM-R (with Radial kernel), NB, and Decision Tree (DT). SVM-L classifier showed the best performance with an accuracy of 82.50%.
388
S. Gupta et al.
Tadesse et al. [12] have presented an approach to detect posts related to suicide ideation using Deep Learning and Machine Learning-based techniques. Subreddit data corpus comprising Reddit posts, that were either suicidal or non-suicidal in nature, was used. Various classifiers such as RF, SVM, NB, XGBoost, LSTM, Convolutional Neural Network (CNN), and hybrid LSTM-CNN were built. The LSTMCNN combined model built on the top of word2vec features proved to be better in performance compared to other models with an accuracy of 93.80%. In the work by Vioulès et al. [13], tweets were classified based on suicide-related content. Various behavioral and textual features (both user-centric and post-centric) were extracted using two approaches, viz. NLP-based approach and ML-based text classification approach and were passed through a martingale framework that was used for detecting change in data streams. The experimentation was performed using eight different classification algorithms, namely multinomial Naïve Bayes, Sequential Minimal Optimization (SMO) with a polykernel, C4.5 Decision Tree (J48), Nearest Neighbor Classifier (IB1), Multinomial Logistic Regression, Rule Induction (Jrip), Random Forest (RF), and SMO with a Pearson VII universal kernel function (PUK). J48 and SMO (PUK) emerged as the best-performing classifiers that were used in the two-step classification approach of the martingale framework, thereby producing recall of 72% and 79.10%, respectively.
3 Proposed Methodology In this paper, posts from Reddit social media platform have been collected and analyzed for predicting whether any user posting such texts suffers from any depression or not. The proposed methodology requires the creation of a set of training documents with known suicidal classes, automatic extraction of features to identify suicidal tendency, and training of the system using Machine Learning tools to facilitate the task of suicide ideation and/or risk assessment. After collecting Reddit posts, four features, namely sentence average length, mean sentiment of each text, negative word hit counter, and a special threshold value have been extracted in the feature extraction phase. It is worth noting that the mean sentiment value has been calculated using the following formula (1): Mean Sentiment = (positive sentiment + negative sentiment)/total words (1) It is known that in case of a supervised learning environment, the collected data are divided into two subsets, viz. training set and testing set. As the Reddit dataset is unbalanced in nature, for the purpose of experimentation and analysis, four balanced chunks of data (posts or documents) have been considered. These are (10:10) corpus, (20:20) corpus, (30:30) corpus, and (40:40) corpus where every chunk is represented as (number of training documents: number of testing documents).
Machine Learning-Based Social Media Analysis …
389
Following are the six classification techniques employed here: J48, LogitBoost, Naïve Bayes, Random Forest, Sequential Minimal Optimization (SMO), and Support Vector Machine (SVM) classifiers. Models built using each classifier are tested, and their performances are compared and the best classifier is identified that can aptly predict whether a user suffers from any suicidal tendency or not. Further, Sentiment Analysis is performed on the dataset so that the emotional mindset of the user can be understood in a better way. For instance, when someone is happy, he/she tends to begin writing texts that comprise words with positive degree and greater positive sentiment value (or score) compared to the subsequent words in the text, i.e., it is quite natural that gradually the sentiment value varies or in most cases decreases for the rest of the text. On the contrary, an unhappy person begins with a word with negative degree and greater negative sentiment value that gradually decreases for the rest of the text. In this proposed system, a pivot positive sentiment value of 0.95 has been considered for the word “happy” and then a new sentiment value is assigned for subsequent words whenever a word with lesser sentiment value than the sentiment value of the previous word is encountered. Finally, all the sentiment values are summed up and then divided by the number of such assignments, thereby reflecting the transition. The final value so obtained indicates the type of threshold value (either non-suicidal or suicidal) which can be used for clustering texts based on the inherent motive or suicide risk. Figure 1 depicts the workflow of the proposed system. Fig. 1 Workflow of the proposed system
390
S. Gupta et al.
Table 1 Precision values of different classifier based on corpus size Classifier
Corpus size (Number of training documents:number of testing documents) 10:10
20:20 30:30 40:40
J48
0.460
0.692 0.530 0.600
LogitBoost
0.500
0.682 0.570 0.610
Naïve Bayes
0.600
0.714 0.658 0.610
Random forest 0.556
0.696 0.652 0.630
SMO
0.700
0.615 0.600 0.536
SVM
0.643
0.667 0.585 0.564
There is a very small variation in the range of precision value when the corpus size increases, thereby making it a stable classifier for feature engineering
4 Implementation and Results Dataset: In this work, Reddit C-SSRS Suicide Dataset, that comprises Reddit posts by 500 Redditors (or Reddit users), has been used [14]. This dataset follows 5-label classification scheme that distinguishes Reddit users on the basis of the severity of suicidal tendency. The five different labels are suicidal attempt, suicidal behavior, suicidal ideation, suicidal indicator, and supportive (or non-suicidal). Also, for the purpose of Sentiment Analysis, SenticNet 5; which is a semantic network of commonsense knowledge, has been used [15]. Result and Analysis: Table 1 shows the precision values of different classifiers based on the varying size of the corpus. Figure 2 depicts how the precision values vary with the change in corpus size. It has been observed from Table 1 that in case of the Naïve Bayes classifier, there is a very small variation in the range of precision value when the corpus size increases, thereby making it a stable classifier for feature engineering. Figure 3 shows that when the corpus ratio is 20:20, then the result obtained for each classifier is good and due to this the plot is shifted toward the 20:20 corpus. This infers that it is not always true that either small dataset or large dataset produces better results. The results vary with both corpus size as well as content. Table 2 shows the maximum true positive (T p) rate and false positive (Fp) rate of three best-performing classifiers, viz. SVM, J48, and Naïve Bayes depending on the corpus size. Figure 4 shows the comparison of T p rate and Fp rate using the ROC plot.
5 Conclusion and Future Work This paper attempts to build a suicide analyzer to predict from Reddit posts whether a user has any suicidal tendency or not. To do so, six Machine Learning classifiers were
Machine Learning-Based Social Media Analysis …
391
Fig. 2 Graph showing variation in precision values for different classifiers with change in corpus size
Fig. 3 Graph showing the corpus ratio for different classifiers Table 2 Maximum true positive rate and false positive rate of different best-performing classifiers depending on the corpus size Corpus size
Classifier
T p rate
Fp rate
10:10
SVM
0.700
0.300
20:20
J48
0.750
0.250
30:30
Naïve Bayes
0.700
0.300
40:40
Naïve Bayes
0.633
0.360
392
S. Gupta et al.
Fig. 4 ROC plot showing the true positive and false positive rates of different classifiers
employed along with the SenticNet 5 lexicon for performing Sentiment Analysis. Naïve Bayes proved to be a stable classifier as it showed minor variation in the precision values when the corpus size was altered. Thus, this proposed system can be instrumental in predicting the mental health conditions afflicting depressed users. The work discussed here can be improved by selecting more features showcasing the mental framework of an individual. Implementation on other social media texts such as tweets, messages, comments, and blogs can also be attempted. Moreover, Deep Learning techniques can be explored to unveil potential results.
References 1. Newman, T.: What is mental health? https://www.medicalnewstoday.com/articles/154543. Accessed Apr 2020 2. MayoClinic.org: Depression (major depressive disorder). https://www.mayoclinic.org/dis eases-conditions/depression/symptoms-causes/syc-20356007. Accessed Apr 2020 3. National Institute of Mental Health: Suicide in America: Frequently Asked Questions. https:// www.nimh.nih.gov/health/publications/suicide-faq/tr18-6389-suicideinamericafaq_149986. pdf. Accessed Apr 2020 4. Radhakrishnan, R., Andrade, C.: Suicide: an Indian perspective. Indian J. Psychiatry 54(4), 304–319 (2012) 5. Chestnov, O.: Public health action for the prevention of suicide: a framework. In: WHO Library Cataloguing-in-Publication Data, pp. 1–26. WHO Press, World Health Organization (2002) 6. Luxton, D.D., June, J.D., Fairall, J.M.: Social media and suicide: a public health perspective. Am. J. Public Health Suppl. 2 102(S2), 195–200 (2012) 7. Stegner, B.: What Is Reddit and How Does It Work? https://www.makeuseof.com/tag/what-isreddit/. Accessed Apr 2019
Machine Learning-Based Social Media Analysis …
393
8. Burnap, P., Colombo, W., Scourfield, J.: Machine classification and analysis of suicide-related communication on twitter. In: Proceedings of the 26th ACM Conference on Hypertext & Social Media, pp. 75–84. ACM, Guzelyurt, Northern Cyprus (2015) 9. Ji, S., Yu, C.P., Fung, S.F., Pan, S., Long, G.: Supervised learning for suicidal ideation detection in online user content. Complexity 2018(6157249), 1–10 (2018) 10. Carson, N.J., Mullin, B., Sanchez, M.J., Lu, F., Yang, K., Menezes, M., Le Cook, B.: Identification of suicidal behaviour among psychiatrically hospitalized adolescents using natural language processing and machine learning of electronic health records. PLoS ONE 14(2), e0211116 (2019) 11. AlSagri, H.S., Ykhlef, M.: Machine learning-based approach for depression detection in Twitter using content and activity features. arXiv preprint arXiv:2003.04763, pp. 1–16 (2020) 12. Tadesse, M.M., Lin, H., Xu, B., Yang, L.: Detection of suicide ideation in social media forums using deep learning. Algorithms 13(1), 7 (2020) 13. Vioulès, M.J., Moulahi, B., Azé, J., Bringay, S.: Detection of suicide-related posts in Twitter data streams. IBM J. Res. Dev. 62(1), 7:1-7:12 (2018) 14. Gaur, M., Alambo, A., Sain, J.P., Kurscuncu, U., Thirunarayan, K., Kavuluru, R., Sheth, A., Welton, R., Pathak, J.: Reddit C-SSRS suicide dataset. Zenodo (2019). https://doi.org/10.5281/ zenodo.2667859 15. Senticnet. Sentic API. https://sentic.net/api/. Accessed Apr 2020
Gender Identification from Bangla Name Using Machine Learning and Deep Learning Algorithms Md. Kowsher, Md. Zahidul Islam Sanjid, Fahmida Afrin, Avishek Das, and Puspita Saha
Abstract The names of people have a large significance in various types of computing applications. In general, people’s names usually have a potential distinction between genders. Detecting genders from names with higher accuracy could be very challenging for Bangla and English character-based Bangla names. In this article, we showed the machine learning and deep learning-based characterization system which can recognize sexual orientations from the Bangladeshi people’s names. The Bangla character-based name placed with an exceptionally higher exactness of 91% and for the English name, it was 89%. We likewise consolidated diverse machine learning and deep learning classifiers techniques like random forest, SVM, Naive Bayes, impact learning, CNN, LSTM, etc., to break down which calculations give better outcomes. Besides, a Python pre-trained model on gender identification by Bangla’s name has been revealed.
Md. Kowsher (B) Department of Applied Mathematics, Noakhali Science and Technology University, Noakhali 3814, Bangladesh e-mail: [email protected] Md. Zahidul Islam Sanjid Department of Computer Science and Engineering, BRAC University, Dhaka 1212, Bangladesh e-mail: [email protected] F. Afrin Department of Computer Science and Engineering, Daffodil International University, Dhaka, Bangladesh e-mail: [email protected] A. Das Department of Computer Science and Engineering, Chittagong University of Engineering and Technology (CUET), Chittagong, Bangladesh e-mail: [email protected] P. Saha Department of Computer Science and Telecommunication Engineering, Noakhali Science and Technology University, Noakhali 3814, Bangladesh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_38
395
396
Md. Kowsher et al.
Keywords Gender detection · Impact learning · CNN · LSTM · Machine learning classifiers · Deep neural network
1 Introduction Bangladeshi people’s names are related to multicultural and multilinguistic qualities. It has both chronicled and strict impacts. Having a decent variety of various strict gatherings names are conspicuously affected by some non-neighborhood regions like Arabic, Persian, Vedic, and so forth. A large part of individuals are Muslims and accordingly names are especially impacted by Arabic prospectuses. A few names are from Bengali and Sanskrit dialects. Distinctive culture has different examples of syllabi of inducing male and females. The naming of people chosen is most often related to gender. Consequently, it is hard to formulate a framework for inferring a common pattern that can differentiate genders from a great diversity of names. Artificial intelligence (AI) is a helpful technique in the present-day world for solving problems in linguistics [1]. Step by step, we are creeping into a period that will be completely worked by exceptionally trained intelligence systems. These days, machines are furnishing us with rich arrangements of everyday issues. AI is the best technique for finding far-fetched design from an enormous heap of information. If there should arise an occurrence of separating male and female names by designs, machine learning (ML), and deep learning (DL) can assume an indispensable job. Natural language processing (NLP), machine learning, and deep learning are the subfields of artificial intelligence. In this study, to identifying the gender-based on a personal name, our primary contribution will be consolidating n-grams plan of natural language processing prepreparing the information [2, 3]. Essentially, we used 2-grams and 3-grams technique for extending the data quality then all the asserted data transfer to the feature extraction. In our word, we followed the counter vector whose main task is to convert text into vectors which makes the data compatible with the algorithms. As for the training to machine learning techniques, we used 13 types of machine learning and deep learning classifiers such as support vector machine, random forest classifier, and linear discriminant analysis, etc, and depicted the shadow of performance comparison. Further, we gave a few advantages and disadvantages of our technique and how it would be produced for better performance and use. The contributions are summarized as follows: • Presenting the English and Bangla character based the Bangla gender identification systems. • Various machine learning and deep learning classifiers are used and compared among them using some statistical performance analyses. • Introducing a Python pre-trained module of Bangla gender identification from name for various applications of Bangla NLP.
Gender Identification from Bangla Name Using …
397
The remaining of the discussion is organized as follows: Section 2 explains related work of various works on gender identification; Sect. 3 describes the methodology of this work. Sect. 4 discusses machine learning and deep learning algorithms that have been used in this work. Section 5 illustrates the experiment, installation system of the pre-trained module, and evaluates results, and Sect. 6 delineates the conclusion of the research work and future plan.
2 Related Works Qianjun Shuai et al. [4] utilized machine learning to investigate a gender recognition method for Chinese names. Anshuman Tripathi et al. [5] incorporated a support vector machine classifier for inferring males and females from Indian names. Their method includes using morphological analysis for feature extractions. In [6], a novel method was laid out by Shane Bergsma et al. for investigating gender and number information by utilizing anaphora or pronoun resolution. Fariba Karimi et al. [7] showed that the gender of the names could be biased toward countries of origin and geography; furthermore, they manifested how the biases could be reduced. Juan Soler-Company et al. [8] introduced a semi-supervised method for classifying genders from texts. Lucía Santamaría et al. in [9] evaluate and benchmark five names to gender inference system from a classified set of data. In [10] Hee-Geun Yoon et al. put forward a machine learning-based model for differentiating genders from Korean names. Unlike these works, we showed the strategies of identifying Bangla linga based on English and Bangla characters using machine learning and deep learning and illustrated the view of performance evaluation among all the classifiers.
3 Methodology To determine gender utilizing machine learning, we adopted five significant phases such as data assortment, data preprocessing, training data incorporating ML algorithms, predictions, and evaluation. The high number of tests and appropriate assorted variety is guaranteeing the quality of data. Furthermore, information preprocessing has made the data progressively dependable and exceptions free. We have executed the nine most reasonable machine learning classifiers and profound deep neural systems on the pre-handled data. Our prepared prediction model has anticipated the most fitting method of gender detection with a high level of precision and dependability. Here, Fig. 1 is described the whole methodology of this work.
398
Md. Kowsher et al.
Fig. 1 Proposed methodology
3.1 Introduction to Dataset There are people living from various convictions and cultures in Bangladesh including Bengalese, local clans, and aboriginal minorities. Names from the same or different culture share an equivalent intrigue, that is all the more for the most part female and male names are influenced by certain phonetics. For example, “Arif” is a male name, yet “Arifa” is a female name. An additional “a” at the end of the word turn into an improvement of phonology and transform the name into female one. In Table 1, a sample collection of the tow datasets is shown. Our datasets included two attributes. The first column took the place of names and in the second column is placed by the gender definition of the particular name. We have specific sexual orientations. We incorporated Bangladeshi names from both English- and Bangla-characterized formats. There were approximately more than 24,000 entries in our dataset with Bangla names and 31,000 names in English format. Each and every data was taken from primary schools and the sub-district office of
Gender Identification from Bangla Name Using …
399
Table 1 Sample of collected Bangla and English character-based Bangla names Bengali names
Gender
Names
Gender
Male
Fardin Ahmed Sujon
Male
Male
Samia Tanjin Maya
Female
Female
Ridita Taslim
Female
Male
Nawar Tanisha
Female
Hathazari, Chittagong, and Bangladesh. Names written in English text were collected from an online source.
3.2 Data Preprocessing Data preprocessing is a critical advance in the information mining process. It includes changing over the crude information from different sources into a conspicuous arrangement. Appropriately, pre-processed information aids the model training process and ensures the best prediction retrieving mechanism. To make our framework increasingly dependable, we had multi-organize preprocessing of the dataset.
3.2.1
Missing Value Check
There could be 2 sorts of missing values. The very first one is the missing names. In this specific case, we erased the column. Since we cannot presume the name. Another is the missing label. On the off chance that the regular name is female, we place the name is “f ” in any case “m”. In the event that we cannot get whether the name is a female or a male one, we erased the line.
3.2.2
N-gram
N-gram is a procedure of slicing a string into N length of substrings where N 90%)
ANN, fuzzy, inference technique
Patients data and symptoms
Analysis With the co-occurrence on gray [17] stage method, extraction of features has been done during training phase. The result shows optimum and significant growth
Designing of expert system for chronic respiratory diseases
Diagnostic device designed to diagnose lung cancer
Dataset Patients data
Smoking, air pollution, and [19] workplace conditions are main risk factors
Fuzzy logic and Mamdani
Detection of lung carcinoma
Universal approach to lung and occupational health
Techniques used
Fuzzy and ACO techniques
Objectivity
Table 1 (continued)
458 D. Gaur and S. K. Dubey
Risk Factor Anatomization of Lung Diseases …
459
Fig. 1 Review methodology
Fig. 2 COPD and asthma mortality in the year 2006, 2015, and 2016
4.2 Importance of Nutrition Value for Prevention of Diseases Nutrition is known to play an important role in preventing and controlling these same chronic diseases and to modulate the PCBs toxicity [18]. Figure 3 shows risk factors and diagnosis techniques applied.
460
D. Gaur and S. K. Dubey
Fig. 3 Risk factors and prediction techniques
According to the analysis, it was found that the neurofuzzy technique gives better accuracy. The following segment shows the assessed overall solution of the research problem framed. Table 2 shows the nutrients which are important for curing some lung diseases. Q1. What are the Risk factors that are responsible for causing lung diseases? After going through many researches, it has been analyzed that there are various factors responsible for lung diseases some are smoke, air pollution, improper diet. Q2. Which soft computing technology is used more to solve medical diagnosis problems? Fuzzy rules and neural network were the mainly used technologies.
5 Conclusion and Future Scope Paper presents the analysis of risk factors and nutrition required for lung diseases. An empirical analysis is undertaken for this reason and research questions are also posed in this regard. The result of the query posed indicates that fuzzy logic technology of soft computing technique has been widely used for medical diagnosis. And the
Risk Factor Anatomization of Lung Diseases …
461
Table 2 Nutrients required for COPD, asthma, tuberculosis, and cancer Recommended Nutrition
COPD
Asthma
Vitamin D
Vitamin D (milk, Carbohydrates eggs, fortified orange juice etc.)
Tuberculosis
Cancer Vitamin A
Vitamin C
Vitamin A (carrots, sweet potatoes, spinach, broccoli)
Fats
Vitamin D
Vitamin E
Apples
Proteins (Pulses, nuts and some oil seeds, meat, fish)
Vitamin E
Vitamin A
Bananas
Vitamins
Calcium
Magnesium
Magnesium
Minerals
Green Tea
Calcium
neurofuzzy technique has given better results. Analysis of risk factors was done and nutrition was recommended based on the analysis for some of the pulmonary diseases, namely asthma, COPD, and tuberculosis. Future research shall cover implementation of risk factor and nutrition using genetic algorithm. Further analysis of the scalability will be carried out.
References 1. https://timesofindia.indiatimes.com/city/delhi/survey-finds-40-residents-want-to-leave-delhincr/articleshow/71883746.cms 2. Anand, S.K., Sreelalitha, A.V., Sushmitha, B.S.: Designing an efficient fuzzy controller for coronary heart disease. ARPN J. Eng. Appl. Sci 11(17), 10319–10326 (2016) 3. Walia, N., Singh, H., Sharma, A.: Effective Analysis of Lung Infection using Fuzzy Rules (2016) 4. Chowdhury, T.: Fuzzy logic based expert system for detecting colorectal cancer. IRJET 5(9), 389–393 (2018) 5. Angbera, A., Esiefarienrhe, M., Agaji, I.: Efficient fuzzy-based system for the diagnosis and treatment of tuberculosis (EFBSDTTB). Int. J. Comput. Appl. Technol. Res 5(2), 34–48 (2016) 6. Obi, J.C., Imainvan, A.A.: Decision support system for the intelligient identification of Alzheimer using neuro fuzzy logic. Int. J. Soft Comput. (IJSC) 2(2), 25–38 (2011) 7. Arani, L.A., Sadoughi, F., Langarizadeh, M.: An expert system to diagnose pneumonia using fuzzy logic. Acta Informatica Medica 27(2), 103 (2019) 8. Farahani, F.V., Zarandi, M.F., Ahmadi, A.: Fuzzy rule based expert system for diagnosis of lung cancer. In: 2015 Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS) held jointly with 2015 5th World Conference on Soft Computing (WConSC), pp. 1–6. IEEE, 2015, August 9. Das, S., Ghosh, P.K., Kar, S.: Hypertension diagnosis: a comparative study using fuzzy expert system and neuro fuzzy system. In: 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–7. IEEE, 2013, July 10. Kadhim, M.A., Alam, M.A., Kaur, H.: Design and implementation of fuzzy expert system for back pain diagnosis. Int. J. Innov. Technol. Creative Eng. 1(9), 16–22 (2011)
462
D. Gaur and S. K. Dubey
11. Meivita, D.N., Rivai, & Irfansyah, A. N.: Development of an electrostatic air filtration system using fuzzy logic control. Int. J. Adv. Sci. Eng. Information Technol. 8(4), 1284–1289 (2018) 12. Manikandan, T., Bharathi, N.: Hybrid Neuro-Fuzzy System for Prediction of Stages of Lung Cancer Based on the Observed Symptom Values (2017) 13. Mayilvaganan, M., Rajeswari, K.: Risk factor analysis to patient based on fuzzy logic control system. Blood Press. 60, 40 (2014) 14. Badnjevic, A., Cifrek, M., Koruga, D., Osmankovic, D.: Neuro-fuzzy classification of asthma and chronic obstructive pulmonary disease. BMC Med. Inform. Decis. Mak. 15(S3), S1 (2015) 15. Saini, S.K., Choudhary, A.: Detection of Lung Carcinoma using Fuzzy and ACO Techniques 16. Hamidzadeh, J., Javadzadeh, R., Najafzadeh, A.: Fuzzy rule based diagnostic system for detecting the lung cancer disease. J. Renew. Natural Resources Bhutan (2015). ISSN1608, 4330 17. https://pdfs.semanticscholar.org/5f34/163f0a3e90b6d1d51383782896b3aab92803.pdf 18. Glass, R.I., Rosenthal, J.P.: International approach to environmental and lung health. A Perspective from the Fogarty International Center. Ann. Am. Thoracic Soc. 15(Supplement 2), S109–S113 (2018) 19. ShubhaDeepti, P., Narayana Rao, S.V.N., Naveen Kumar, V., Padma Sai, Y.: Expert system using artificial neural network for chronic respiratory diseases. Int. J. Curr. Eng. Sci. Res. 4(9), 6–14 (2017) 20. Singh, V., Sharma, B.B.: Respiratory disease burden in India: Indian chest society SWORD survey. Lung India: Official Organ Indian Chest Soc. 35(6), 459 (2018)
Analytical Study of Recommended Diet for Patients with Cardiovascular Disease Using Fuzzy Approach Garima Rai and Sanjay Kumar Dubey
Abstract Cardiovascular disease is the graving cause of concern for the nation nowadays. A global estimate on CVD concludes that a death rate of 272 per 100,000 population in our nation is more than average on global level. A normal set of symptoms includes increase in blood pressure at a very early age. It is one of the most prominent contributor in the death rate of the country. Though technology is in its boom and various software applications are available in the market to suggest an healthy diet to people suffering from cardiovascular diseases, still it is very difficult to estimate how much impact that recommended diet is going to make on the CVD patient. The estimation on how much impact a physical factor makes on the diet of a patient is fuzzy in nature. One cannot be sure whether the suggested component of the physical factor in the diet will have major influence on the diet suggested or the minor impact on the diet suggested. Fuzzy inference system serves as a very easy approach to study the impact of a component on the suggested diet to CVD patients. A rule base acts as the basis to give outputs as per the predicted knowledge in FIS. The expert knowledge is utilized as the rule base in fuzzy. This work briefs how fuzzy logic can be used in designing a fuzzy inference system. It uses knowledge rule base to estimate the outputs and thus assists in recommending a diet to CVD patient as per diet impact of the confronted factors of diet on the patient’s body computed. The fuzzy inference system thus assists in recommending a healthy diet to CVD patients. Keywords Diet system · CVD · Fuzzy · Analysis · Model
1 Introduction Cardiovascular diseases are generally caused with factors such as obesity, improper diet intake, poor blood pressure level, chain smoking habits, lack of physical activity on a heavy basis, high intake of alcohol, improper speech. The manifestation may vary from person to person and the indications are different in males and females. G. Rai (B) · S. K. Dubey Department of Computer Science & Engineering, Amity University Noida, Noida, UP, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_45
463
464
G. Rai and S. K. Dubey
The disease formulates one of the most prominent death rates in the death ratio of the nation and is an alarming cause of concern for the country. The disease is prominently present in males and females of average age. The CVD is not estimated among the youth of the country at an alarming rate as per the recent statistical records of the global death rates. Recommendation systems are usually designed for people who want to have an appropriate suggestion to choose among a large number of diets available on Web sites nowadays. A large number of Web sites are available in the market availing a large number of choices to the CVD patients. People do not take their eating habits carefully because they are not aware of the role of good food in their life. A good recommendation system is the need of the hour. Various techniques are in the market recommending a diet to the heart patients such as analytical hierarchy process and fuzzy analytical hierarchy process. Software tools are available in the market as recommendation system to assist the heart patients in opting a food option [1]. A general terminology is item which the software tool recommends to the user. These systems take one item at a time and accordingly they design their user interface and design is customized [1]. The recommendation systems use data sources to select what all are the items people select the most among the existing ones. The data source provides an idea about the patterns hidden and eating habits of the users. The diet recommended systems consider the dietary patterns and body statistic of all the users [2]. Necessary nutrients should be present in the diet of an individual to ensure it is a balance diet and calorie intake should be consumed on a regular basis in the form of calorie burned [2]. A healthy diet is the best solution to deal with the CVD. A large numbers of diet recommendation systems are available in the market to suggest a healthy diet plan to CVD patient. Various softwares are available online to assist a CVD patient to recommend a diet to fight with the CVD but still most of the people find it difficult to relate how the recommended diet is going to show impact on their respective bodies. Earlier works present recommends diet to CVD patient but does not brief about how they are going to impact an individual body. Therefore, many people still do not find a best diet recommendation system to fight CVD. Since computing the impact of a suggested diet plan is not a concrete and crisp but is fuzzy in nature. By fuzzy in nature, it briefs that one cannot predict up to how much extent a diet plan recommended to CVD patient is going to affect their body. Thus, fuzzy logic can be used to compute how a suggested diet plan is going to put an impact on a patient’s body. Fuzzy logic briefs the extent up to which a particular linguistic variable belongs to a particular domain. Fuzzy inference system uses fuzzy logic and a fuzzy rule base for computation. A fuzzy knowledge rule base is based on experts knowledge. In present work, we are proposing a system based on fuzzy which is going to compute impact of recommended diet on an individual in the form of risk as an output variable. A healthy diet plan along with the information regarding its influence on an individual plays in coping up with CVD.
Analytical Study of Recommended Diet for Patients …
465
2 Literature Review Software and various methods recommending options to be opt by the end user are available in the market as systems for recommendation [1]. Importance of a healthy plate and nutrients is losing in this fast-moving world [2]. Healthcare is now emerging as one of the leading domain in digital world [3]. In case of nutrition systems, using artificial intelligence food analysis is the main component [4]. In our day-to-day life, one of major crisis we face is to make a choice among what to eat among the available options and what not to eat among the available options [5]. In comparison to analyzing the biological impact of food items, it is more complicated to analyze its impact on real life [6]. The dietary guidelines used by Americans in order to consume food is called Mypyramid which is basically a guidance system for food [7]. Estimating people’s calorie intake on a daily basis is a crucial task in day-to-day life [8]. Recommending systems are thus playing a very vital role in recommending people a way to figure out one item that fits their requirements in all aspects among the ones available online [9]. In modern days, recommendation system has left a huge impact on the society [10]. Poor group of society that includes infants and toddlers are among the more concerned section of society which needs information systems for diets [11]. In order to remove the vagueness among present data, the machine learning tool that comes into is fuzzy logic [12]. In order to maintain the balance between calories intake and the requirement of an individual body, it is important to maintain regular check on nutritional status [13]. A large number algorithms are available but the text base parsing algorithms are among the most in usage [14]. Fuzzy systems rely on experts knowledge and enable the expression of fuzzy “inputs” in linguistic terms [15]. Most of the people finds it is very difficult to estimate on a daily basis their intake calories [16]. In order to curb this issue, fuzzy logic comes into play [17]. Fuzzy logic is used to describe unreliable (imprecise) data and knowledge, using linguistic variables [18]. Fuzzy logic was initiated by Lotfi A. Zadeh et al. in 1965 [19].In today’s hectic world, the importance of diet management has increased exponentially. Due to unhealthy and haphazard eating habits, the spread of diet-related diseases is at an all-time high [20].
3 Analysis The study of various research works has shown the importance of diet recommendation to patients. Following is the analysis made by the study of 20 research papers (Table 1). A lack of knowledge about right and proper diet can lead a person to opt for options which are not going to support his ailment in the proper manner. A large number of consuming food intakes are available in the market and it is very difficult to choose among them the options that will help an individual to fight with the CVD in the most judicious way. In case of patients having multiple number of diseases
Refs
[1]
[2]
[3]
[4]
[5]
[6]
Author
F.M. Facca, P.L. Lanzi
V. Jaiswal
A.K. Sahoo, C. Prdhan
T. Theodoridis, V. Solachidis
T. Mokdara, P. Pusawiro, J. Harnsomburana
J.M. Krbez, A. Shaout
Table 1 Analysis of literature review
Deep neural netwok (DNN)
Soft computing Method of deep learning
Data mining tools
AHP, Fuzzy AHP
Tool/Technique/Model/Dataset
To review fuzzy methods in relation to diet journaling
To select among wide variety of ingredients available
24 h recalling for food intake and frequent questionnaire to gather necessary information
The fast-food consumption rate is alarmingly high and this consequently has led to the intake of unhealthy food. This leads to various health issues such as obesity, diabetes, an increase in blood pressure, etc
Inability of the analytic hierarchy process to deal with the impression and subjectiveness in the pair-wise comparison process
Challenges
(continued)
In this paper work, a review on fuzzy methods and the use of fuzzy logics in nutrition systems is solved
This work consider previous eating habits of an individual to suggest the diet
This work proposes a brief about division of food recommendation system using artificial intelligence tools
Soft computing deep learning methods such as CNN and RBM are being proposed in this research work to develop an intelligent HRS
In this paper, we observe that a data with non-repeated input values supports random tree and decision tree learning method
In this paper work, in order to overcome the inconsistency of AHP, FAHP is used to recommend an icecream to a diabetic patient
Analysis
466 G. Rai and S. K. Dubey
Fuzzy Logic
[11]
FIS
R. Year, L Mart´ınez
Fuzzy logic, Fuzzy Inference system
D. Permatasari, I.N. Azizah, H.L. Hadiat
[9]
C. Li, J.D. Fernstrom3, R.J. Sclabassi
Tool/Technique/Model/Dataset
Soft Computing
[8]
K. Marcoe, W. Juan, S. Yamini
S. Das, B.S.P. Mishra, M.K. Mishra, [10] S. Mishra, S.C. Moharana
Refs
[7]
Author
Table 1 (continued)
To accurately measure people’s food intake in real life
Identifying food selections in each MyPyramid food group or subgroup reflective of typical consumption patterns by Americans
Challenges
(continued)
This paper work analyzes the mamdani system of FIS to categorize toddler status of nutrition
In this work, an overview of how soft computing merge with filteration techniques is done
In this work, a brief on fuzzy tools is given to develop a recommendation system
In this paper work, fuzzy logic has been used as a powerful tool to use human knowledge in estimating food densities
Disaggregated foods from consumption surveys into component ingredients. Combined similar ingredients into “item clusters” and determined relative consumption of each. Calculated a consumption-weighted nutrient profile for each food group
Analysis
Analytical Study of Recommended Diet for Patients … 467
[13]
[14]
[15]
[16]
S.M. Sobhy, W.M. Khedr
D. Syahputra, Tulus, Sawaluddin
T. Osman, M. Mahjabeen, S.S. Psyche
D. Nakandala, H.C.W. Lau
R.A. Priyono, K. Surendro
H. Korkmaz, E. Canayaz, S.B. Akar, [17] Z.A. Altikardes
Refs
[12]
Author
Table 1 (continued)
Fuzzy Logic
Fuzzy Logiic
Fuzzy logic, fuzzy sugeno method
Fuzzy logic, fuzzy inference systems (FIS)
Tool/Technique/Model/Dataset
In this paper work, a review on fuzzy methods and the use of fuzzy logics in nutrition systems is solved
This work shows how fuzziness of data can be dealed effectively with fuzzy logics
An adaptive food suggestion Web application as Fuzzy recommender using fuzzy logic
This work uses sugeno method of fuzzy to determine natural patient status
In this paper, we observe how FIS can be used in diagnosing risk degree
Analysis
(continued)
Identifying food selections in each This work presents a clinical MyPyramid food group by decision support system Americans
Criticality in intake of food with higher calorific value
Most food searching problems are handled with traditional string matching algorithms
The fast-food consumption rate is alarmingly high and this consequently has led to the intake of unhealthy food. This leads to various health issues such as obesity, diabetes, an increase in blood pressure, etc
Challenges
468 G. Rai and S. K. Dubey
Refs
[18]
[19]
[20]
Author
J.G. Kljusuri´c, I. Rumora, Z. Kurtanjek
G. Asghari, H. Ejtahed, M.M. Sarsharzadeh
M. Raut, K. Prabhu, R. Fatehpuria, S. Bangar
Table 1 (continued)
Fuzzy ontology, rule-based reasoning, artificial bee colony algorithm, and genetic algorithm
Fuzzy logic, fuzzy algorithms
Fuzzy logic
Tool/Technique/Model/Dataset
Challenges
In this work suggest nutrients diet and recipes based on the suggested diet plan. It also take seasonal availability of food available in India and preexisting conditions of the users of the system
In this work proposed a software that utilize algorithms which are iterative in nature to suggest dietary patterns
In this paper work, the theory of fuzzy logic was used in the planning and management of expenses in social nourishment concerning also the nutritive structure of meals
Analysis
Analytical Study of Recommended Diet for Patients … 469
470
G. Rai and S. K. Dubey
to suffer from selecting, a appropriate diet becomes a much more laborious task. Our analysis of various research has concluded that four factors are responsible for a CVD patient that are going to impact on their body. These factors identified are responsible for the complete and proper functioning of a diet on CVD patient.
3.1 Physical Activity A healthy weight is result of proper physical activity in a person daily routine. A proper addition of physical activities in daily life can lead to contribute in a healthy and active body to large extent. A lack of physical activity is invitation to various CVD and health problems.
3.2 Age Age plays a vital role in the recommended diet of a CVD patient. A diet cannot be suggested to person of all age. As the age of the patient vary with each other, the diet recommendation to them also vary among each other.
3.3 BMI BMI can be calculated as weight of person divided by the height of the person. A increase in BMI indicates the level of obesity chances in a person and thus give an idea about the risk of heart disease to a person.
3.4 Cholesterol A high level of cholesterol indicates a large chances of a person to suffer from the heart diseases. Therefore, it is very important to maintain a good level of cholesterol in a person body (Fig. 1). Study showed that in recent times, the fuzzy expert systems are much more in demand with the 17% being implemented as consultation systems, 19% being implemented as prediction systems, and 64% being implemented as diagnoss systems.
Analytical Study of Recommended Diet for Patients …
471
Fig. 1 Graphical representation of data analyzed (2016–17)
4 Conclusion In this proposed paper work, we have reviewed various diet recommendation systems available in the market and their functionalities. We found that the existing systems do not predict the impact of suggested diet on the CVD patient health. Therefore, the patients find it difficult to figure out how to choose the best diets available in the market as per the recommendation by the experts. We have analyzed how fuzzy inference system using fuzzy logic can be used to design a system which can compute the impact of a particular diet on an individual.
References 1. Facca, F.M., Lanzi, P.L.: Mining interesting knowledge from weblogs: a survey. Data Knowl. Eng. 53, 225–241 (2005) 2. Jaiswal, V.: A new approach for recommending healthy diet using predictive data mining algorithm. In: 2019 IJRAR March 2019, vol. 6, issue 1 3. Sahoo, A.K., Pradhan, C., Barik, R.K.: DeepReco: deep learning based health recommender system using collaborative filtering. In: Computation 2019, 7, 25. https://doi.org/10.3390/com putation7020025 4. Theodoridis, T., Solachidis, V.: A survey on AI nutrition recommender systems. In: PETRA ’19, June 5–7, 2019, Rhodes, Greece 5. Mokdara, T., Pusawiro, P., Harnsomburana, J.: Personalized food recommendation using deep neural network. In: 2018 Seventh ICT International Student Project Conference (ICT-ISPC) 6. Krbez, J.M., Shaout, A.: Fuzzy nutrition system. Int. J. Innov. Res. Comput. Commun. Eng. 1(7), September 2013 7. Marcoe, K., Juan, W., Yamini, S.: development of food group composites and nutrient profiles for the MyPyramid food guidance system. J. Nutr. Educ. Behav. 38, S93–S107 (2006) 8. Li, C., Fernstrom, J.D., Sclabassi, R.J.: Food Density Estimation Using Fuzzy Logic Inference. 978-1-4244-6924-6/10/$26.00 ©2010 IEEE 9. Year, R., Mart´ınez, L.: Fuzzy tools in recommender systems: a survey. Int. J. Comput. Intell. Syst. 10, 776–803 (2017)
472
G. Rai and S. K. Dubey
10. Das, S., Mishra, B.S.P., Mishra, M.K., Mishra, S., Moharana, S.C.: Soft-Computing Based Recommendation System: A Comparative Study, vol. 8, issue 8 June, 2019. ISSN: 2278-3075 11. Permatasari, D., Azizah, I.N., Hadiat, H.L.: Classification of toddler nutritional status using fuzzy inference system (FIS). In: The 4th International Conference on Research, Implementation, and Education of Mathematics and Science (4th ICRIEMS) 12. Sobhy, S.M., Khedr, W.M.: Developing of fuzzy logic decision support for management of breast cancer. Int. J. Comput. Appl. 147(1), August 2016. (0975-8887) 13. Syahputra, D., Sawaluddin, T.: The accuracy of Fuzzy Sugeno method with antropometry on determination natural patient status. In: International Conference on Information and Communication Technology (IconICT) 14. Osman, T., Mahjabeen, M., Psyche, S.S.: Adaptive Food Suggestion Engine by Fuzzy Logic, ICIS 2016, June 26–29, 2016, Okayama, Japan 15. Nakandala, D., Lau, H.C.W.: A novel approach to determining change of caloric intake requirement based on fuzzy logic methodology. Knowl.-Based Syst. 36, 51–58 (2012) 16. Priyono, R.A., Surendro, K.: Nutritional needs recommendation based on fuzzy logic. In: The 4th International Conference on Electrical Engineering and Informatics (ICEEI 2013) 17. Korkmaz, H., Canayaz, E., Akar, S.B., Altikardes, Z.A.: Fuzzy Logic Based Risk Assessment System Giving Individualized Advice for Metabolic Syndrome and Fatal Cardiovascular Diseases 18. Kljusuri´c, J.G., Rumora, I., Kurtanjek, Z.: Application of Fuzzy Logic in Diet Therapy— Advantages of Application. Faculty of Food Technology and Biotechnology, Croatia 19. Asghari, G., Ejtahed, H., Sarsharzadeh, M.M.: Designing fuzzy algorithms to develop healthy dietary pattern. Int. J. Endocrinol. Metabolism 11(3), 154–161 (2013) 20. Raut, M., Prabhu, K., Fatehpuria, R., Bangar, S.: A personalized diet recommendation system using fuzzy ontology. Int. J. Eng. Sci. Invention (IJESI) 7(3) Ver. 3
Programmed Recognition of Medicinal Plants Utilizing Machine Learning Techniques P. Siva Krishna and M. K. Mariam Bee
Abstract Humans have an obligation to conserve nature. The proper well-known proof of plant types has significant advantages for a broad extent of partners ranging from ranger service managements, botanists, taxonomists, medical professionals, pharmaceutical proving ground, organizations battling for threatened types, government, and general society on the loose. Thus, this has actually filled up an enthusiasm for producing robotized frameworks for the acknowledgment of different plant varieties. For example, its length, width, side, region, variety of vertices, shading, and border as well as frame area were removed from each leaf. A few inferred highlights were after that registered from these characteristics. The most effective end results were received from an arbitrary back timbers classifier making use of a tenoverlap cross-authorization procedure with a precision of 90.1%.It is visualized that an electronic or versatile computer structure for the configured acknowledgment of healing plants will help people close by helping taxonomists evolve by enhancing their insight into corrective plants, increasing effective varieties, and differentiating methods of evidence. Keywords Taxonomists · Plant species · Robotized framework · Pc frame work
1 Introduction The world births a huge variety of plant varieties, a substantial variety of which have restorative high qualities; others are near elimination, and still others that are dangerous to man. In addition to the truth that plants are a basic possession for individuals, nevertheless they structure the base of all advanced lifestyles. To use and P. S. Krishna (B) · M. K. M. Bee Department of Electronics and Communication Engineering, Saveetha School of Engineering (SIMATS), Chennai, Tamil Nadu, India e-mail: [email protected] M. K. M. Bee e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_46
473
474
P. S. Krishna and M. K. M. Bee
make sure plant types, it is important to contemplate and prepare plants efficiently. Acknowledging unusual plants relies most on the specific knowledge on an expert botanist. The very best approach to recognize plants properly as well as properly is a manual-put with each other strategy based relative to morphological qualities. Developing a plant data source for fast and also skilled groups and also acknowledgment is a substantial advancement toward their protection as well as conservation. This is particularly significant the very same variety of plant types is at the precarious edge of elimination as a result of perpetual deforestation to make ready for modernization. Because late PC vision and instance recommendation techniques have been used in reliable ways to obtain ready electronic plant indexing structures for plant forms. So regarding unwind these situations, a leaf morphological clinical categorization approach is suggested, which thinks about simply the leaves to play out the identification undertakings15. This technique consolidates various highlights of fallen leaves, as an example, form, and blood vessel structure, surface and also in addition some histological information. Distinguishing plants is a difficult as well as facility errand as a result of the concept of the fallen leaves. Despite the reality that the leaves present some basic highlights, they likewise offer a vast example selection. This variant can occur in various leaves of the same plant, where characteristics such as maturation and even direct sunlight exposure produce differences in the leaves’ size, color, structure, and form. These forms also occur in leaves of the same species, but they come from different plants. In this case, the area consequence of soil in terms of whether the leaf is being formed, environment, and even setting. Minority approaches in computer system vision literature use contour-based attributes as well as geometry attributes for fallen leave classification. This job offers a unique approach to plant identification based on the texture/complexity of the leaves have fallen. A novel method is suggested based on the volumetric fractal dimension, which removes features from the images of the fallen leaves offering details for determination.
2 Literature Survey Gao and Lin [1]: Plant identification as well as classification assume a significant task in nature; however, the hand-operated treatment is awkward anyway, for checked taxonomists. Cutting-edge advances enable the renovation of techniques to make these tasks effectively as well as quicker. In this particular circumstance, this paper portrays a strategy for plant identification as well as classification dependent on leaf forms that checks out the discriminative intensity of the type centroid separation in the Fourier recurrence area in which some invariance (e.g., rotation and range) are made sure. What is even more, it is in addition investigated the influence of emphasizing decision techniques relative to classification precision. Our outcomes show that by
Programmed Recognition of Medicinal …
475
settling a great deal of highlights vectors—in the principal sectors room—and also a feed-forward neural system, an exactness of 97.45% was achieved. Kohei Arai, Indra Nugraha Abdullah et al. [2]: Human beings are committed to nature conservation. One of the models is fancy plant preservation. Enormous monetary evaluation of plant exchange, enhancing the esthetic estimation of one space as well as the drug competence contained in a plant are some of this plant’s positive top qualités. Regardless, barely anyone thinks about the practicality of his medication. Considering the effortlessness to get as well as the drug practicality, this plant ought to be an underlying therapy of a fundamental illness or option in the direction of mixture-based medications. So regarding allowing individuals to get acquainted, we need a framework that can legitimize this plant. Therefore, we suggest that a redundant discrete wavelet transformation (RDWT)-dependent framework be manufactured via its down leave. Because its character is regular analysis that is prepared to give some strong highlights for recognizing fancy plants. Approximately, 95.83% of the appropriate characterization rate successfully came to this framework. Ranjan Parekha, and Samar Bhattacharyaa et al. [3]: This paper recommends the use of a mix of surface and form highlights for a novel philosophy of describing and also concerning plant leaves. Surface region of the leaf is shown using Gabor network and gray degree co-event matrix (GLCM) while the fallen leaf condition is captured using a lot of Curvelet adjustment coefficients along with invariant moments. Since these highlights are all in contact with the direction and scaling of the fallen leaf image, a pre-preparation step preceding the removal of the part is used to render it solutions to varying variables in analysis, transformation, and scaling. The appropriateness of the proposed methods is considered by using two neural classificators: a neuro-fluffy controller (NFC) and a feed-forward multi-layered perceptron (MLP) rear-proliferation to differentiate between 31 leaf courses. The highlights were in fact used separately just as in mix to test how recognition accuracy could be enhanced. Test results show the suggested technique is sensible in perceiving fallen leaves with sufficient change in surface, shape, size, and instructions. Andre Ricardo Backes et al. [4]: Surface is a significant visual credit score made use of to portray the pixel association in an image. Equally as it is successfully identified by individuals, its examination treatment requests a considerable level of modernity and PC changeability. This paper shows an unique method for surface exam, in view of breaking down the diverse nature of the surface created from a surface area, so as to show as well as explain it. The recommended method develops a surface area mark which can efficiently portray different surface area courses. The paper in addition lays out a novel method implementation on an evaluation making use of surface area pictures of leaves. Because of the suggestion of plants, which provides a vast range of examples, leaf identification is a challenging and often complex activity. The high
476
P. S. Krishna and M. K. M. Bee
classification price produced shows the capability of the approach, boosting normal surface methods, for example, Gabor filters as well as Fourier evaluation. Anang Hudaya Muhamad et al. [5]: This post shows a flexible approach for organizing plant leaves making use of the two-dimensional shape included. The recommended method works with a flowed recommendation conspire called distributed ordered graph nerve cell (DHGN) for layout recommendation and k-closest nextdoor neighbor (k-NN) for design collection. With broadening measure of leaves info that can be captured making use of existing photo setting up as well as handling development, the capability for a specific characterization strategy to develop high review exactness while adjusting to massive scale dataset as well as info highlights is considerable. The approach presented in this paper implements a one-shot discovering element inside an appropriated preparing structure, empowering big scale information to be defined proficiently. The test results acquired via a development of plan tests demonstrate that the suggested strategy can provide high review exactness as well as huge number of excellent testimonials for an offered plant leaves dataset. Additionally, the results furthermore show that the recommendation approach inside the DHGN spread strategy creates low computational diverse nature as well as least preparing time. Yeni Herdiyeni, Ni Kadek Sri Wahyuni et al. [6]: This assessment recommended one more flexible application dependent on Android functioning framework for distinct Indonesian corrective plant images depending on surface area and shielding highlights of innovative leaf photos. In the tests we made use of 51 types of Indonesian restorative plants and also every type comprises 48 images, so the outright photos utilized in this examination are 2,448 images. This test tests the consistency of the mix between the fuzzy local binary pattern (FLBP) and the fuzzy shade histogram (FCH) with respect to the identification of restore plants. The FLBP strategy is made use of for separating leaf photo surfaces. The FCH approach is made use of for separating leaf photo shading. The mix of FLBP and FCH is ended up by using Product Choice Policy (PDR) technique. This assessment Classifier Probabilistic Neural Network (PNN) used for the organization of corrective plant forms. The test results system that the combination of FLBP and even FCH will improve the routine accuracy of well-known proof of restorative plants. The exactness of identifying evidence using a mix of FLBP and FCH is 74.51%. This application is vital to help people differentiate and also uncover data about Indonesian restorative plants. I.Gogul and V.Sathish Kumar et al. [7]: Configured recognizable proof and also acknowledgment of restorative plant varieties in problems, for example, timberlands, hills as well as thick locales are very important to think about their truth. Recently, plant type’s acknowledgment is completely dependent on the shape, geometry as well as surface of various plant parts, for example, leaves, stem, blossoms, and so forth. Bloom-based plant types recognizable proof structures are normally used. While existing Internet indexes give strategies to ostensibly want a question photo that
Programmed Recognition of Medicinal …
477
contains a blossom, it needs power due to the intra-class selection among a multitude of bloom types around the globe. Thus, a deep learning approach using convolutionary neural networks (CNN) is used in this proposed research study work to look at bloom forms with high accuracy. Images of the types of plants are taken using the implied cam portion of a cellular phone. Emphasis is put on removing blossom images using a transfer learning method (e.g., extracting complex highlights from a pre-prepared system). For example, an AI classifier is used for logistic regression or forest randomness to yield a more reliable quality over it. This approach assists in reducing the anticipated equipment requirement to perform the computationally important duty of preparing a CNN. It is seen that CNN, in conjunction with transfer discovering technique, emphasizes extractor defeats, as an example, all the carefully laid out highlight extraction techniques binary pattern neighborhood (LBP), shade network numbers, shade histograms, Haralick location, Hu seconds, and even Zernike Minutes. CNN-assisted transfer discovering approach yields remarkable 73.05% Rank-1 accuracies, 93.41%, and also 90.60% using Over Feat, Inception-v3, and also exception frameworks, independently as Function Extractors on FLOWERS102 dataset. Ajay Satti et al. [8]: Plants are the structure of all life in the world and a fundamental property for human prosperity. Plant recommendation is considerable in cultivation for the administration of plant varieties though botanists can utilize this application for therapeutic functions. Leaf of numerous plants has different high qualities which can be utilized to get them. This paper presents a basic and also computationally efficient technique for differentiating evidence from the plant using electronic image handling as well as vision technology for equipment. The proposed method makes up three stages: pre-handling, emphasizing removal as well as arrangement. Prepreparing is the approach of improving information pictures coming before computational preparing. The aspect extraction stage presumes highlights dependent on shielding and state of the leaf image. These highlights are utilized as contributions to the classifier for competent characterization as well as the outcomes were tried as well as considered utilizing artificial semantic network (ANN) as well as Euclidean (KNN) classifier. The system was prepared with 1907 example leaves of 33 unique plant types taken framework Flavia dataset. The proposed strategy is 93.3% accurate making use of ANN classifier and also the connection of classifiers reveal that ANN reserves less normal initiative for execution than Euclidean separation strategy. Wang-Su Jeon et al. [9]: In the all-natural world there are thousands of kinds of trees, so it can be very difficult to compare them. Nevertheless, botanists and even others who research trees, by using the features of the fallen leaf, may identify the form of tree at a glance. Artificial intelligence is made use of to immediately categorize fallen leave types. Studied extensively in 2012, this is a swiftly expanding field based upon deep discovery. Deep learning itself is a self-learning technique used on vast amounts of data, and current equipment growths as well as vast information have
478
P. S. Krishna and M. K. M. Bee
made this technique extra practical. We suggest an approach to classifying leaves using the CNN style, which is usually used when applying profound learning to image processing. Shivling et al. [10]: Used attribute removal using the so-called region marking technique. After the image is refined with the preprocessing stage of the image, the resulting binary photo is subjected to position labeling to produce a area that is marked. This formula marked a target matrix with an integer value of ‘1’ and a positive value is interconnected. The foreground as well as context distinguishes. The test bit used an eight-connecting location algorithm that indicates that the bit would look even further for its eight-connecting region when a pixel worth ‘1’ was found in the guideline. When the position is located, a brand-new positive integer may mark the cluster to match its center value. A new search is started on a different region in the scenario where no nine-connecting area is found to be connected to the pixel that has a value of ‘1’. But many regions can be separately graded. This process is finished when the area is marked with all pixels. In MATLAB, this method used region props to work to count pixels in the identified photo area. For the removal of pictures, the completeness of the pixels noted in the picture is counted, and therefore the feature homes of the leaf photo are also re-adjusted. In fact, ten leaf images were used to develop the technique, and the result was found to achieve high precision of up to 0.1 mm2 . Gopal et al. [11]: Utilized color features extraction in an attempt to prepare the presented medicinal photos for classification objectives. Coming before the classification phase, feature extraction begins with a digital scanner input photo and then adheres to a collection of pre-processing image stages. When the image is completely trained, it is activated straight into the system to obtain the medical plant’s color attribute when it comes to its Fourier descriptor. This Fourier descriptor ensures the stability of the shape in terms of rotation, translation, and scaling. This program was taught using 100 leaf images and examined using 50 leaf images. Images and ultimately also took care to achieve 92% efficiency. Araujo et al. [12]: Based on images of the leaf appearance and shape features. SVM and neural network classifiers were used to educate four different characteristics, namely regional binary pattern (LBP), gradient pie chart (HOG), rate of durable attributes (BROWSE), as well as Zernike minutes (ZM). After that, a static classifier choice approach was made use of to look for the ensembles that maximize the average classification rating. The Datasets for the experiments were used in Image CLEF 2011 and ImageCLEF2012. The results showed the recommended approach had the ability to boost for 11.7% and 4.2% in the scan classification as well as 4.06% and 5.87% for the scan-like group in the average classification rating relative to the Image CLEF 2011 and Image CLEF 2012 datasets reported best results, respectively. The numerous classifier systems, therefore, surmounted the efficiency of monolithic methods.
Programmed Recognition of Medicinal …
479
Hang et al. [13]: The traditional CNN has been improved by the introduction of a development structure framework and a worldwide layer of fusion to identify leaf diseases. The combined initiation system reduced the variety of model parameters and improved leaf disease determination efficiency as much as 91.7% accuracy. Neuroph has also been used in preparing a CNN network for computers 2019, 8, 77 8 of 22 maize leaf disease classification. The approach was verified to be effective to acknowledge three types of illness, particularly north corn leaf affliction, usual corrosion, as well as gray leaf spot illness. Deep learning with CNN has been verified to be trustworthy for plant disease classification by capturing the shade and also texture of sores specific conditions, even eliminating 75% of parameters that were not due to logic, does not affect the classification precision.
3 Proposed Methodology A dataset of 55 corrective plants from Vietnam was checked as well as a high accuracy of 98.3% was obtained with a CNN classifier. The dimension of each photo was 256 * 256 pixels. Recommended a method based upon fractal measurement highlights dependent on fallen leaf shape as well as likewise blood vessel designs for the acknowledgment and likewise set up plant leaves. Having a volumetric fractal dimension means creating a surface area mark for a dropped leaf, and also computing the gray level carbon monoxide occasion latticework (GLCM).
480
P. S. Krishna and M. K. M. Bee
4 Conclusion In this paper, another vigorous and computationally productive framework is exhibited that thinks about the shading highlights and tooth highlights of a leaf amid highlights of form. We finally used a combination of highlights of shading, form, morphology, and tooth; the system tried using two classifiers on the Flavia dataset and the results were admissible, as can be seen in the test results. The proposed work can be additionally stretched out to recognize complex pictures with petiole and bunched leafs and ongoing pictures of leaf.
References 1. Gao, W., Lin, W.: Frontoparietal control network regulates the anti-correlated default and dorsal attention networks. Hum. Brain Mapp. 33(1), 192–202 (2012) 2. Wu, S.G., Bao, F.S., Xu, E.Y., Wang, Y.X., Chang, Y.F., Xiang, Q.L.: A leaf recognition algorithm for plant classification using probabilistic neural network. In: 7th IEEE International Symposium on Signal Processing and Information Technology, Giza, Egypt, pp. 11–16 (2007) 3. Zhang, X., Liu, Y., Lin, H., Liu, Y.: Research on SVM plant leaf identification method based on CSA. In: Che W., et al. (eds.) Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol. 624. Springer, Singapore (2016) 4. Hossain, J., Amin, M.A.: Leaf shape identification based plant biometrics. In: 13th International Conference on Computer and Information Technology, Dhaka, Bangladesh, pp. 458–463 (2010) 5. Du, J.X., Wang, X.F., Zhang, G.J.: Leaf shape based plant species recognition. Appl. Math. Comput. 185, 883893 (2007) 6. Du, M., Zhang, S., Wang, H.: Supervised isomap for plant leaf image classification. In: 5th International Conference on Emerging Intelligent Computing Technology and Applications, Ulsan, South Korea, pp. 627–634 (2009) 7. Herdiani, Y., Wahyuni, N.K.S.: Mobile application for Indonesian medicinal plants identification using fuzzy local binary pattern and fuzzy color histogram. In: International Conference on Advanced Computer Science and Information Systems (ICACSIS), West Java, Indonesia, pp. 301–306 (2012) 8. Prasvita, D.S., Herdiani, Y.: MedLeaf: mobile application for medicinal plant identification based on leaf image. Int. J. Adv. Sci. Eng. Inf. Technol. 3, 5–8 (2013) 9. Le, T.L., Tran, D.T., Hoang, V.N.: Fully Automatic leaf-based plant identification, application for Vietnamese medicinal plant search. In: Fifth Symposium on Information and Communication Technology, Hanoi, Vietnam, pp. 146–154 (2014) 10. Arai, K., Abdullah, I.N., Okumura, H.: Identification of ornamental plant functioned as medicinal plant based on redundant discrete wavelet transformation. Int. J. Adv. Res. Artif. Intell. 2(3), 60–64 (2013) 11. Turkoglu, M., Hanbay, D.: Recognition of plant leaves: an approach with hybrid features produced by dividing leaf images into two and four parts. Appl. Math. Comput. 352, 1–14 (2019)
Programmed Recognition of Medicinal …
481
12. Ma, L., Fang, J., Chen, Y., Gong, S.: Color analysis of leaf images of deficiencies and excess nitrogen content in soybean leaves. In: Proceedings of the 2010 International Conference on EProduct E-Service and E-Entertainment, Henan, China, 7–9 Nov 2010, vol. 11541023, pp. 1–3 (2010) 13. Gonzalez, R.C., Woods, R.E.: Digital Image Processing, 3rd edn. Pearson Prentice Hall, Upper Saddle River, NJ, USA (2002)
Information Retrieval
GA-Based Iterative Optimization System to Supervise Adaptive Workflows in Cloud Environment Suneeta Satpathy, Monika Mangla, Sachi Nandan Mohanty, and Sirisha Potluri
Abstract Due to enormous data generation at rapid rate, processing of large volumes of data becomes a challenging assignment. This leads to many issues in computing environment since it requires complex data analysis. Issues related to performance and cost may certainly arise due to heterogeneity of the systems which are connected in the distributed environment. Various consumers submit their request to the dynamic cloud environment depending on their demand and requirement. To manage these volatile workloads, an efficient workflow management system is proposed in this paper. In order to deal with these vital issues, a novel design of adaptive workflow management system is described to support on demand service management in the cloud. This mechanism includes workflow scheduler and iteration controller to optimize the data processing through iterative workflow task scheduling mechanism. Upgrade fit—is an optimization technique that dynamically and continuously reallocates multiple types of cloud resources to fulfill the performance and cost requirements of cloud resources. Using this algorithm, iterative workflow tasks can be executed repetitively for data processing and workload execution. The performance of the algorithm is evaluated by using weather forecast workflow. Since the performance of the algorithm is showing considerable efficiency levels, the same algorithm can be used in real time large volume data processing. The evaluation results indicate that the system can effectively handle multiple types of cloud resources and optimize the performance iteratively. The present paper has implemented the algorithm in cloud computing environment using CloudSim that has proved that the completion time is being minimized with maximization of resource allocation. Keywords Cloud computing · CloudSim · Resource allocation S. Satpathy College of Engineering, Bhubaneswar, India M. Mangla (B) Lokmanya Tilak College of Engineering, Koparkhairane, Navi Mumbai, India S. N. Mohanty · S. Potluri ICFAI Foundation for Higher Education, Hyderabad, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_47
485
486
S. Satpathy et al.
1 Introduction Cloud Computing is the current trending technology that provides services over the Internet such as application resources and data to the users based on their demand. The term cloud computing is a fuzzy term for which there is no existing concrete definition. It can be viewed as the delivery of services over the Internet, as and when the customer demands. The transition of large organizations from the traditional CAPEX model to the OPEX model supports the fact that cloud computing is one of the most promising technologies in the current IT scenario. The increasing number of users in cloud computing increases the challenges faced by the cloud service providers to provide the requested services ensuring its high availability and reliability. The virtualization technique is useful in creating virtual machines based on demand in helping the cloud service providers to meet these challenges. In order to employ virtualization, virtual entities of the actual versions are created and deployed in the system. This technology enables the cloud service provider to serve more. This section explains some previous works on image compression and transmission and their limitations. Such as in [1] demonstrated the minimum MSE filter based on DCT coefficients of the statistical model. A compressed method based on the DCT is described in [2]. In this method obtained the better performance for PSNR. In [3] proposed raise cosine filter and C-QAM for high-quality image transmission. However, the proposed method with filter showed low PSNR values for higher E b \No. On the contrary, in [4], suggested Hierarchical Quadrature Amplitude Modulation (HQAM) for better protection of high-priority data during image transmission. Nevertheless, no channel model such as AWGN is considered in this work. Here, only salt and pepper noise is taken into account. A raised cosine filter is used in the AWGN channel which is introduced in [5]. Here, authors evaluate the performance of the communication channel number of customers than the support provided by the actual hardware resources available with the provider. Generally, virtualization is applied at the computer system level. This involves the creation and deployment of virtual machines. The requests of the customers will then be processed by these Virtual Machines (VMs). Various VMs will have different processing and memory requirements. One of the major factors that need to be considered in systems that deploy VMs is the allocation of the VMs to hosts [1, 2]. The paper has been organized as follows. The topic is briefly introduced in Sect. 1. Section 2 discusses the related work in this field. Basics of genetic algorithm are presented in Sect. 3. Proposed QoS aware workflow scheduling scheme is elaborated in Sect. 4. Results of the proposed scheme are discussed in section while Sect. 6 concludes the paper.
GA-Based Iterative Optimization System …
487
2 Related Work In recent years, the matter of task programming in a distributed atmosphere has caught the eye of researchers. The main issue is the minimization of execution time. On the opposite hand, programming of tasks is considered as an essential issue within the cloud computing atmosphere by considering various factors like completion time, execution cost and energy, utilization of the resources, power consumption, throughput, availability, reliability, elasticity, privacy, security, and fault tolerance. GE Junwei has conferred a static genetic rule by considering total task completion time, average task completion time, and value constraint. Their algorithm is genetic based and proved to achieve above said QoS requirements [3]. One of the programming problems is apportion of the proper resource to the arrival tasks. Using dynamic programming method, S. Ravichandran has introduced a system to avoid this downside by permitting the arrived tasks to be there in a queue and the program to compute their priority order based on the type the tasks. Therefore, the program is done by taking the primary task from the queue and is allotted to the resource which will be the simplest match victimization GA. The target of this system is to maximize the utilization of the resources for a better usability in cloud environment to achieve various QoS parameters using objective functions [6]. Kaur and S. Kinger have planned task programming algorithm-based improvement GA. They have used a replacement fitness function which is supported by mean and grand mean values. They claim that this rule can be enforced on each task and resource programming [4]. A comparative study of three task programming algorithms on the cloud computing such as round robin scheduling, pre-emptive priority scheduling, and shortest remaining time scheduling has been done, and the results in their work has proved that the proposed algorithm is an efficient one. V. V. Kumar and S. Palaniswami have introduced a study that specializes the increase in the potency of the task scheduling algorithm for cloud computing services [5]. Additionally, they have introduced a scheduling strategy to utilize the turnaround time. By using this algorithm, assignment of high priority for the task having early completion time and less priority for the task having late completion time is done. Real time and urgent tasks are selected and treated first when compared with other tasks. Z. Zheng projected a scheduling strategy that supports GA to contend with programming drawback within the cloud computing environment known as Parallel Genetic Algorithm (PGA) to achieve the optimization or sub-optimization for cloud scheduling issues mathematically [7].
3 Genetic Algorithms Genetic Algorithms (GA) are computerized search procedures based on the mechanics of natural genetics and natural selection that can be used to obtain global and robust solutions for the given problems. GA is computational optimization
488
S. Satpathy et al.
scheme with a non-traditional approach. It was developed by John Holland and his colleagues at the University of Michigan. Genetic algorithms are among the best techniques used for solving many problems. GA is used in different aspects of engineering way back in 1983. For example, Goldberg used this with the problem of gas pipeline control system (1983), Davis and Coombs in network design (1987), GE and RPI provide the application of GA in jet engine turbine design, Holland in modeling ecosystems (1995), aircraft design, scheduling, symbolic math, etc. The problem of premature converges in genetic algorithms optimization is also discussed by Mori N. et.al (1996). A comparison of the modified genetic algorithm approach with the simple genetic algorithm is carried out by taking the knapsack problem. Gregg Rothermel (2010) studied about the various factors affecting the performance of genetic algorithms. It is inferred from the above literature review on genetic algorithms that they are successfully applied in many optimization problems using genetic operators like selection, crossover, and mutation. Genetic algorithms are applied to solve different optimization problems, and the present study is done to identify various factors which can affect the performance of genetic algorithms.
3.1 Outline of Genetic Algorithm Generic algorithm can be outlined as follows: Population—It is of all the attainable (encoded) solutions to the given drawback. The population for a GA is analogous to the population for people in general except that rather than people in general, we have got candidate solutions representing people in general. Chromosomes—It is one of the solutions from the available solutions in the population. Gene—It is represented as the element’s position of the chromosome. Allele—For a selected chromosome, the value of the gene is called allele. Fitness function—A fitness function performs a measure that takes the answer as input and produces the quality of the answer as the output. In some cases, the fitness outperforms, and therefore, the objective perform is also constant, whereas in others, it will vary to support the mechanism.
3.2 Basic Operators of Genetic Algorithm Reproduction: The first operator that is applied on population is reproduction. Chromosomes from the given population are selected based on the principle of Darwin, i.e., “Survival of the fittest.” Since we are performing selection operator on the available population, this process is also named as selection phase.
GA-Based Iterative Optimization System …
489
Crossover: After reproduction phase population is enriched with better individuals. It makes clones of good strings but doesn’t create new ones. Crossover operation is applied to the mating pool with a hope that could create better strings. Mutation: After crossover, the strings are subjected to mutation. Mutation of a bit involves flipping it that is changing it to 0 or 1.
4 QoS Aware Workflow Scheduling Scheme This section discusses the workflow scheduling scheme in a detailed manner.
4.1 Classification of Various Workflow Scheduling Schemes Various QoS constraints with respect to workflow scheduling are make span, cost, throughput, reliability, resource utilization, turnaround time, success rate, tardiness, resource availability, load balancing, report time, budget, deadline, waiting time, execution time, and security. QoS is an important factor to maintain the quality of the services provided by the cloud service provider. There are some set of QoS constraints measured at client side and server side. The problem that is considered in this paper is based on hard real-time applications related to increased data generation. Adaptive workflow scheduling is used to handle the exponential growth of the data in NP-complete problem [8–15]. An improved genetic-based work flow scheduling algorithm can be used to handle the dynamic work flows to give optimized results. As shown in Fig. 1 workflow scheduling schemes can be classified into different types, such as heuristic schemes, meta heuristic schemes, and hybrid schemes. Under heuristic schemes, partial critical path, bi-direction adjust heuristic, min–min, max–min, and sufferage heuristic, Qsufferage algorithm, adaptive dual-objective scheduling, hyper heuristic scheduling, modified path clustering heuristic, iterative ordinal optimization, multi-objective list scheduling, hybrid cloud optimized cost, etc., are present in the literature. Similarly, improved GA, dynamic objective GA, bi-objective GA, multi-objective GA, cost efficient GA, artificial bee colony, ant colony optimization, bat algorithm, cat swarm optimization, etc. are meta heuristic algorithms present in the literature. And the last category hybrid schemes contain various algorithms such as list scheduling and GA, bi-objective dynamic level scheduling, and bi-objective GA, hybrid heuristic scheduling based on GA, GA with variable neighborhood search, rotary hybrid discrete PSO, dynamic bi-objective schedule based on GA in the literature [16–20]. The flow of operations that we perform in GA is shown in Fig. 2. Initially, a set of chromosomes are presented followed by construction of the genetic operators in the next step. Then, it follows up with the selection of fitness function. Then, it is a need to find the probabilities using which the genetic operators can be controlled. In the next
490
Fig. 1 Classification of workflow scheduling schemes
Fig. 2 Genetic algorithm
S. Satpathy et al.
GA-Based Iterative Optimization System …
491
step, the problem is defined, and soon after its initialization evaluation is performed to estimate the chromosomes using best fitness functions. Then, genetic operators are applied such as crossover and mutation on the selected population. Based on these operations, the performance of the algorithm is evaluated and is iterated over the same process to obtain optimized results.
4.2 Proposed Algorithm The proposed algorithm for handling adaptive workflow is given as follows: Step-1: Initialize and generate the initial population by considering population size, chromosomes, and boundaries. Step-2: Estimate and calculate the fitness function of the algorithm. Step-3: Initialize all crossover and mutation operators with fitness values. Step-4: Estimate the probabilities of the crossover and mutation operators as Fitnnessi Probi = num j=1 Fitnness j In the above equation, Probi is the probability of selected operator i; Fitnnessi is the fitness value of operator i; and num is the total number of operators in the given population. Step-5: Let pf = number of all possible number of unique fitness values n_iter = 0 crossover_type = select a crossover type mutation_type = select a mutation type. Step-6: Iterate the following steps until the termination condition satisfies using the following loop: i. ii. iii. iv.
Select the parents in the population On the selected population perform crossover_type operation Let nfitchunique = total number of children with unique fitness value fitnessbetter = the better fitness value among all fitness[i] =
nfitchunique fitnessbetter
v. Then perform mutation_type operation vi. For each child present in the list do the following set of operations:
492
S. Satpathy et al.
for(i=1; i Di) { Drop the child from the existing population Break }
Where TaskN is total number of tasks. FTimei is finish time of task i. Di is deadline of task i. Then evaluate the fitness using the selected fitness function of feasible children fitnessfeasible =
fitnessbetter fitnesslower
if (num_iterations = = min_itrerations) { Determine the values of crossover and mutation using moving average Update the values of crossover type and mutation type. Update the values of crossover type and mutation type crossover_type = select a crossover type mutation_type = select a mutation type-1 }
Replace and update the current population for next generation population. Step-7: End.
5 Results and Discussion CloudSim simulator is used to implement the proposed upgrade fit. The parameter values have been set in the algorithm as shown in Table 1. The results are compared with existing algorithms, namely HEFT and DCP-G. From the comparison, it is concluded that the proposed algorithm is giving good results by taking execution time into consideration as shown in Figs. 3, 4, and 5. Hence, the obtained results as shown in above Figs. 3, 4, and 5 demonstrate the efficiency and effectiveness of the proposed approach.
GA-Based Iterative Optimization System …
493
Table 1 Parameters used in the genetic algorithm Parameter name
Value set in the algorithm
Size of the population
100
Probability of the crossover
0.75
Probability of the mutation (Swapping)
0.62
Probability of the mutation (Replacing)
0.78
Fitness function used
Execution time of the tasks
Selection scheme used/applied
Roulette’s wheel
Individual generation in initial stages
Randomly generated
Fig. 3 Comparison of algorithms with task number (50, 100, 200, and 500)
Fig. 4 Comparison of algorithms with task number (700, 1000, 2000, and 5000)
494
S. Satpathy et al.
Fig. 5 Comparison of algorithms HEFT, DCP-G, and upgrade fit
6 Conclusion Upgrade fit is an optimization technique used to dynamically and continuously reallocate multiple types of cloud resources to fulfill the performance and cost requirements. The iterative workflow tasks are given as input in terms of workloads to the cloud computing environment, and these are then executed repetitively for data processing. The performance of the algorithm has been evaluated by using a case study—weather forecast workflow. Since the performance of the algorithm is showing considerable efficiency levels over the existing algorithms HEFT and DCP-G in the literature, upgrade fit algorithm can thus be used for large volume data processing. Further, the evaluation results also indicate that the system can effectively handle multiple types of cloud resources as well as optimize the performance iteratively. The algorithm is implemented in cloud computing environment using CloudSim, and concluding observations are made such as completion time is minimized, and resource allocation is maximized during task scheduling.
References 1. Youseff, L., Butrico, M., Da Silva, D.: Toward a Unified Ontology of Cloud Computing. In: 2008 Grid Computing Environments Workshop 2. Potluri, S., Varshith, K.: Software virtualization using containers in google cloud platform. IJITEE 8(7), 2430–2432 (2019)
GA-Based Iterative Optimization System …
495
3. Zhang, H., Xie, J., Ge, J., Zhang, Z., Zong, B.: A hybrid adaptively genetic algorithm for task scheduling problem in the phased array radar. Eur. J. Oper. Res. 272(3), 868–878 4. Kaur, R., Kinger, S.: Enhanced genetic algorithm based task scheduling in cloud computing. Int. J. Comput. Appl. (2014) 5. Venkatesa Kumar, V., Palaniswami, S.: A dynamic resource allocation method for parallel dataprocessing in cloud computing. J. Comput. Sci. 8(5), 780–788 (2012) 6. Ravichandran, K.S., Chandra Sekhara Rao, K., Saravanan, R.: The role of fuzzy and genetic algorithms in part family formation and sequence optimisation for flexible manufacturing systems. Int. J. Adv. Manuf. Technol. 19(12), 879–888 7. Wang, L., Zheng, D.-Z.: A Modified genetic algorithm for job shop scheduling. Int. J. Adv. Manuf. Technol. 20(1), 72–76 (2002) 8. Potluri, S.: Efficient hybrid QoS driven task scheduling algorithm in cloud computing using a toolkit: Cloudsim. JARDCS 12(Special Issue), 1270–1283 (2017) 9. Kau, S., Bagga, P., Hans, R., Kaur, H.: Quality of Service (QoS) aware Workflow Scheduling (WFS) in cloud computing: a systematic review. Arab. J. Sci. Eng. 44, 2867–2897 (2019) 10. Potluri, S.: Optimization model for QoS based task scheduling in cloud computing environment. IJEECS 18(2), 1081–1088 (2020) 11. Yu, J., Buyya, R.: Taxonomy of workflow management systems for grid computing. J. Grid Comput. 3(3–4), 171–200 (2005) 12. Wu, Z., Liu, X., Ni, Z., Yuan, D., Yang, Y.: A market-oriented hierarchical scheduling strategy in cloud workflow systems. J. Supercomputing 63(1), 256–293 (2013) 13. Yu, J., Buyya, R., Ramamohanarao, K.: Workflow scheduling algorithms for grid computing. In: Xhafa, F., Abraham, A. (eds.), Metaheuristics for Scheduling in Distributed Computing Environments. Springer, Germany (2008) 14. Pandey, S., Wu, L., Guru, S., Buyya, R.: A particle swarm optimization (PSO)-based heuristic for scheduling work-flow applications in cloud computing environments. In: Proceedings of the 24th IEEE International Conference on Advanced Information Networking and Applications (AINA’10), Australia, April 2010 15. Mandal, A., et al.: Scheduling strategies for mapping application workflows onto the grid. In: Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC’05), USA, July 2005 16. Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.: Cloud computing and emerging it platforms: vision, hype, and reality for delivering computing as the 5th utility. Future Gener. Comput. Syst. 25(6), 599–616 (2009) 17. Potluri, S.: Simulation of QoS-based task scheduling policy for dependent and independent tasks in a cloud environment. In: Smart Intelligent Computing and Applications, vol. 159, pp. 515–525 (2019) 18. Yu, J., Buyya, R.: A budget constrained scheduling of workflow applications on utility grids using genetic algorithms. In: Proceedings of the 15th IEEE International Symposium on High Performance Distributed Computing (HPDC’06), France, June 2006 19. Mahmood, A., Khan, S.A., Bahlool, R.A.: Hard real-time task scheduling in cloud, computing using an adaptive genetic algorithm. Computers 6 (2017) 20. Venugopal, S., Buyya, R.: A set coverage-based mapping heuristic for scheduling distributed data-intensive applications on global grids. In: Proceedings of the 7th IEEE/ACM International Conference on Grid Computing (GRID’06), Spain, September 2006
Rainfall Prediction and Suitable Crop Suggestion Using Machine Learning Prediction Algorithms Nida Farheen
Abstract Farmers in India suffer gruesome fate at the mercy of rain gods since primary source of agriculture in India is rainfall. Agriculture is a major source of living but contributes only about 18% of total gross domestic product, its reason being lack of adequate crop planning by farmers. Although India has surplus fertile land, inefficient agricultural practices due to deficiency of rainfall and crop prediction techniques, in turn, leads to uncountable farmer suicides. Currently, the invasion of Machine Learning (ML) has abetted in finding promising solutions to address the problems of predicting rainfall, soil assessment, crop management, yield prediction, crop quality and disease detection and classification. Despite the technology, hitherto, there is no platform nor system in place to inform farmers of the rainfall predicted and advise what crops to grow. In this paper, rainfall prediction using diverse ML and statistical algorithms is encapsulated, accordingly best suitable crops to grow are recommended keeping soil as a parameter. The raw real-time rainfall data acquired was pertained to three regions of Karnataka North, South and Coastal. Data was cleaned and structured and its features extracted. Statistical tests—ADF, KPSS, ACF, PACF executed on the feature extracted data revealed its trend and seasonality insightful for modelling. Using effective ML and statistical algorithms such as ARIMA, ANN, random forest, TBATS, Holt-Winters, simple, double, triple exponential smoothening et al., rainfall for the next six years was predicted. All three regions were distinctly modelled. Time series forecasting using ARIMA proved to be the best performer. All models performances are validated using standard error measures to have authenticity of accuracy. The generalized accuracy of ARIMA model averaging on all three regions is 92.91%, ANN 88.26%, TBATS 61%, simple exponential smoothening had 71.1%, double exponential smoothening is 68.63%, triple exponential smoothening accuracy is 57.42%, and random forest gave 42%. Keywords Rainfall prediction · Crop recommendation · Machine learning algorithms N. Farheen (B) MLIS, Department of CSE, Ramaiah University of Applied Sciences, Bangalore, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_48
497
498
N. Farheen
1 Introduction Farmer suicides in India is a national catastrophe since the 1990s and the reasons expressed are—environmental factors such as inadequate rains or floods, low produce prices, prodigious debt, poor irrigation and increase in cost of cultivation. Even today, farmers cultivate crops based on the experience gained from their previous generations. Since traditional methods of farming are practiced, there exists disproportionality of crops without meeting the actual requirement, resulting in irrevocable loss; hence, contribution of agriculture in Indian economy is conspicuous. Nowadays, it has become a back-breaking problem paving way to scarcity of food storage, low prices of products and increased cost of cultivation. Prediction of rainfall is a predominant problem for meteorological department as it is closely associated with the economy and technological means. Uncontrollable causes of natural disasters like floods and drought bring havoc to farmers every year. Due to the dynamic nature of atmosphere, statistical techniques fail to provide good accuracy for rainfall forecasting. Therefore, if knowledge about it can be provided beforehand, then it will help farmers to strategize their cultivation. Thus, predicting rainfall appropriately makes it possible to take preventive and mitigating measures for these natural phenomena and give farmers a better livelihood. However, there is no aide or a system in place in India hitherto, assisting farmers in anyway but if they were to know the approximate amount of rainfall they would be blessed with, then it would elucidate them on what crops to grow accordingly. For achieving these predictions, there are numerous techniques such as Artificial Neural Networks (ANN), Deep Learning (DL), ARIMA, random forest, TBATS, exponential smoothening, Support Vector Machines (SVM), naive Bayes, random forests, decision trees, which are incorporated in the distinct publications analyzed herein. By applying ML techniques on historical rainfall and crop production data, several useful predictions can be made which in turn can help in effectively predicting rainfall, thereby increasing crop productivity. This work synthesizes, prediction of rainfall from historic real-time data of Karnataka State in India. Based on the amount of rainfall predicted in each region of Karnataka, the crops other than the ones being already grown, which can survive in the predicted climatic conditions most suitable to grow in such conditions, taking soil as a factor into consideration, is also suggested. The paper is organized as follows: Section 1 is introduction; Sect. 2 details the literature survey; Sect. 3 depicts briefly the methods and methodologies used. Section 4 gives details on the algorithms used, design process flow of the work done on rainfall prediction and crop recommendation; Sect. 5 shows the results and validation of analysis; Sect. 6 specifies the conclusion and future scope and lastly References.
Rainfall Prediction and Suitable Crop Suggestion …
499
2 Literature Survey Currently, ML approaches are being creatively incorporated to address these varied issues in accurate and efficient prediction of rainfall. The researchers of [1] have developed an architecture based on DL approach using autoencoders and neural networks to predict the daily accumulated rainfall in a meteorological station located in Manizales city, Colombia. The architecture, composed of two networks, an autoencoder and a Multilayer Perceptron (MLP). A denoising autoencoder used, provided by Theano, a Python GPU based library for mathematical optimization. The authors in [2] have aimed to predict the effective rainfall and crop water needs through analysis of temperature and humidity. The system incorporates a J48 classifier that aims to predict the effective rainfall and also be used to predict the crop water needs for any particular area. This method is currently being applied to determine the effective rainfall and crop water needs in the Bijapur district of Karnataka, to maximize crop yield and to avoid problems of over irrigation. The authors of [3] have presented a modular type Support Vector Machine (SVM) to simulate rainfall prediction. Bagging sampling technique is used to generate different training sets, also different kernel function of SVM with different parameters, i.e. base models, is trained to formulate different regression based on different training sets. The Partial Least Square (PLS) algorithm is used to choose the appropriate number of support vector regression combination members. The technique is implemented to forecast monthly rainfall in Guangxi, China. The researchers [4] looked to tackle an impeding problem of the right combination of crops to be cultivated in a particular soil type. Decision Support System (DSS) provided great assistance to the farmers to prevent the overheads of decisions about soil and the crops to be cultivated. In the study by [5] an attempt to predict crop yield and price that a farmer can obtain from his land, it is presented by analyzing patterns in the past data. They make use of sliding window non-linear regression technique to predict based on different factors affecting agricultural production such as rainfall, temperature, market prices, area of land and past yield of a crop, done for several districts of the state of Bangladesh. The authors [6] a system is developed, intended to suggest the best crop choices for farmers to adapt to the demands of prevailing social crisis facing farmers today. The demand of crops is predicted by classifying the collected dataset based on the changes in market prices of the crops. These works revealed a variety of good researches done to help farmers but none such that has all platforms encapsulated.
3 Methodology of Rainfall Prediction and Crop Recommendation The aforementioned research aimed to predict long term rainfall in different zones of Karnataka State of India and also formulate a crop recommender system considering
500
N. Farheen
soil as a factor, so as to aide farmers in effective crop planning. Therefore, the aim of this research is not confined to just one but is an amalgamation of two important goals. Each of these goals procure different methods and methodologies, and it is what makes this work novel and unique. Table 1 depicts the brief overview of the methods and methodologies used to achieve the set aims. The two proposed aims are: • To predict rainfall in different districts of Karnataka • To suggest suitable crops for the predicted rainfall and known soil type. Table 1 Methods and methodology Objective No. Statement of the objective Method/methodology
Resources utilised
1
Data Collection and Data – Combining two datasets Preparation of rainfall and crop/soil. – Removing outliers and NA’s. – Structuring data (columnizing data from rows for modelling)
– MS Excel – R Tool: Packages used • data.table • tidy verse
2
Data visualization
Exploring data to see pattern (trend seasonality)
Scatter plot, histogram
3
Statistical Tests
– – – –
R tool: Package • basic stat
4
Feature Extraction
– Identify unique and R tool: important features Package required. Boruta – Feeding unique features (trend, seasonality, cyclic trend) to the model
5
Predictive Models
ANN, ARIMA, Holt-Winters, Exponential Smoothening, TBATS, Random Forest
R tool: Package • Forecast • Auto.arima • Carat • Tbats
6
Validate Model Performance
Error measures—ME, RMSE, MAE, MAPE, MASE and ACF
R tool: Package • mlmetrics
7
Develop a recommender system
Table mapping of rainfall prediction and crop database
R tool: Package • dplyr
ADF KPSS ACF PACF
Rainfall Prediction and Suitable Crop Suggestion …
501
4 Design and Algorithmic Process Flow Developed The methodology of the research executed depicts a step by step detailed procedure incorporated and depicted in Fig. 1. The brief structure of methodology designed for this research is as follows: • Rainfall data of Karnataka collected on request from Karnataka government. • Rainfall raw data acquired was a non-stationary, time series data of 148 years dating from 1871. • Crop and soil data collected from Karnataka government website (open source). • Data was priorly divided in three prominent regions of the state, i.e. north, south and coastal Karnataka. • Data cleansing and preparation to structure the data and execute feature extraction. • Data visualized to see seasonality, trend and patterns for interpretation. • Exploratory statistical tests executed to determine stationarity of data. • Unique feature extracted to be fed to model the data. • Data modelling using predictive statistical and ML algorithms for a robust model. • Model validation and testing to check the model performance and accuracy. • Recommender system to establish a link and similarity index/range between the predicted rainfall and soil attribute by drawing links between two databases to get crop suggestion. To design and implement algorithms suitable for prediction of rainfall and suggest suitable crops, the following steps were implemented: • The structured data was modelled using prediction algorithms of both ML and statistical prediction. • A total of six ML and statistical models were chosen based on their ability to handle the data. The algorithms chosen for rainfall forecasting are: ARIMA,
Fig. 1 Detailed algorithmic process flow in block diagram
502
N. Farheen
Fig. 2 Design flow of the recommender system implemented
ANN, random forest, TBATS, Holt-Winters and simple, double, triple exponential smoothing. (ARIMA, ANN and TBATS shown herein). • Algorithms implemented for suggesting crops are: random forest and similarity indexing. • Deducing from the profound evidence on seasonality, stationarity and cyclic trend exhibited by the data, the algorithms best suited for forecasting rain were chosen based on trial and error subject to the models giving the best accuracies and efficiencies. • Each of these models was applied on all three region data which were split into train and test. Figure 2 entails working flow of the recommender system. The Database 1 specifies to predicted rainfall data, and Database 2 is crop data. On taking data from both bases, a comparison can be drawn to realize the types of crops that are and can also be grown in those specified regions in the database. Overview of algorithms implemented: • Auto Regressive Integrated Moving Average (ARIMA) Model: The data acquired and elucubrated was inferred to be best suited for ARIMA model. To deploy a ARIMA model, the p, d, q values and the correlation coefficients are to be determined first. Auto Correlation Function (ACF) and Partial Auto Correlation Function (PACF) test were concluded to calculate the p, d, q parameters. Figure 6 (see Sect. 5) depicts the test and train data after applying ADF (done to determine the stationarity status of the data using hypothesis testing), KPSS (done to see if a time series is stationary around mean or a linear trend, or is non-stationary due to a unit root) and ACF tests (used to check for correlation between each lag in the data point). The data, plotted to check for trend seasonality and stationarity, revealed the lag and coefficients value (p, d, q). The auto functions choose the best fit values for ARIMA model since it is a tedious process to iterate and espy
Rainfall Prediction and Suitable Crop Suggestion …
503
Table 2 Performance matrix of ARIMA
the most suitable values of p, d, q that differ for all three regions of data taken. Table 2 (see Sect. 5) summarizes the best fit pdq values chosen; the coefficients and AIC, BIC values and loglikelihood estimates are also specified. The AIC and BIC are performance validators to check whether the model fits the data well; its value should be lower the better. Loglikelihood function gives the likelihood of getting the predicting value. Its value should be higher the better. These error measures interpret the model’s accuracy. • Artificial Neural Network (ANN): Literature survey revealed the use of ANN was found prominent; hence, the decision to use ANN algorithm in this analysis was also adopted. A deep learning model architecture is used for the analysis, wherein the MLP acts as the neuron. The ANN model extracts features internally based on seasonality and trend. With some trial and error, the best suited model was found to be one in which 22 inputs sufficed along with 5 hidden layers. The MLP architecture optimally choses model parameters accordingly. The forecasts of MLP show a time series seasonality of rainfall. Figure 3 depicts the MLP model incorporated, ANN model output (for Coastal Karnataka region data). The blue line shown at the end of the plot (after x-axis 2010) is the predicted/forecasted value for the year 2011. • Trigonometric Box-cox ARMA Trend Seasonality (TBATS): This model is well suited for non-stationary data. A statistical algorithm known to handle complex seasonal data and multiple parametric space. Also used for time series data exhibiting multiple complex seasonalities. It can handle larger effective parameter space with possibility of better forecasts. It can also handle non-linear features as needed for time series data. It is an estimation procedure that allows autocorrelation residuals are to be included. Figure 4 depicts the TBAT model forecast along with optimum coefficients. The blue line seen in the end of the plot (after x-axis 2010) is the predicted/forecasted value for the year 2011.
504
N. Farheen
Fig. 3 Forecast of MLP model for 2011
Fig. 4 Forecast output of TBAT model for north region
5 Results and Validation Figure 5 is a time series plot of North Karnataka region data to see the behaviour and pattern it exhibits. A cyclic pattern can be seen which repeats itself every year. It can also be seen that there is a fluctuating trend; initially, the trend is increasing then decreasing. Similar plots are of Coastal and South Karnataka regions depending on the rainfall received in particular months. Figure 6 shows the final chosen best fit ARIMA model having lag coefficients (1, 0, 0) and (1, 1, 0). The blue dot seen is the prediction for one data point (each data point is a month) predicting for the next year Jan 2011. Figure 7 is the plot of train and test data to check for stationarity; the values are ACF = 1 and PACF = 1, Fig. 8. Is the ACF-PACF test plot, p and q values can be identified from the ACF and PACF plot.
Rainfall Prediction and Suitable Crop Suggestion …
505
Fig. 5 Seasonality trend the North Karnataka data follows
Fig. 6 Forecast of ARIMA with coefficients p, d, q
Table 2 is the summary of performance matrix of the best fit chosen ARIMA model. Figure 9 is a snippet of the difference (Error) between the actual and predicted values of rainfall in year 2000–2001 (taken over years 1964–2010). Error is defined as the sum of differences of the dependent (Y) and independent (X) variable values (|Y-X|), also known as the (Actual–Predicted) values. These values can be visualized through plots to give a picturesque comparison of two values. Table 3 is a snippet of ARIMA forecast model. The forecast is done from Jan 2010 to Dec 2016. The values from Jan 2011 that is given in “point forecast” (column 1) are the value forecasted for each month. These point forecast numbers exist between different confidence levels. Low confidence interval of 80 to high confidence interval of 80 similarly low 95 to high 95. These confidence intervals specify the range
506
N. Farheen
Fig. 7 Trend seasonal pattern for testing and training data
Fig. 8 ACF-PACF test and lag coefficients
forecasted values lie in. For example: The June 2011 point forecast number is 11.523. Each confidence interval has upper limit and lower limit; upper limit states that the forecast limit cannot go beyond that and lower limit states that forecast cannot go below that the lower limit confidence level-80 is 6.3607, higher limit interval-80 is 16.685, that of lower limit-95 is 3.627, and higher limit-95 is 19.418. A total comparison of every model’s performance accuracy is tabulated in Table 4. All model performances for prediction are compared based on standard error
Rainfall Prediction and Suitable Crop Suggestion …
507
Fig. 9 Plot of actual and predicted values
measures. It can be seen that the model giving the least error rate is ARIMA; hence, it is herein the best performing model. The performances are tabulated in order of least error to highest for all three data regions. For crop recommendation, to realize the main objective of suggesting suitable alternate crops to farmers, the following steps were taken: • The rainfall forecasted value needs to be chosen specific to the area belonging to the farmers (coastal, north, south). • The forecasted value then needs to be found in the rainfall range and geographical region it belongs to. • On finding the region specific to the forecasted value, the soil type found there can also be seen. • Once the region, rainfall range and type of soil are known the types of crops typically grown there are also explicitized in the database. • The farmer/user can then have alternate options of growing various types of crops instead of cultivating once a year profiting crops. Figure 10 is the recommender output that shows recommendation based on the rainfall range specified. It shows that when the best model predicts 3516 mm rainfall in Jan 2011, this predicted number it is matched with crop/soil database to see which rainfall range it falls in, what crops are grown and what soil is available.
508
N. Farheen
Table 3 Forecast model confidence interval level >Forecastmodel Point forecast
Lo 80
Hi 80
Lo 95
Hi 95
Jan 2011
−11526.855406
−22238.69
−815.0173
−27909.20
4855.487
Feb 2011
−11528.709485
−22423.89
−633.5292
−28191.4
55134.031
Mar 2011
2.272117
−10899.18
10903.7272
−16670.06
16674.609
Apr 2011
4.679697
−10896.99
10906.3513
−16667.99
16677.348
May 2011
5.505985
−10896.17
10907.1851
−16667.17
16678.185
Jun 2011
7.118293
−10894.56
10908.7977
−16665.56
16679.798
Jul 2011
7.511448
−10894.17
10909.1908
−16665.17
16680.191
Aug 2011
7.510598
−10894.17
10909.1900
−16665.17
16680.190
Sep 2011
7.540288
−10894.14
10909.2197
−16665.14
16680.220
Oct 2011
7.169694
−10894.51
10908.8491
−16665.51
16679.850
Nov 2011
6.213213
−10895.47
10907.8926
−16666.47
16678.893
Dec 2011
3.702921
−10897.98
10905.3823
−16668.98
16676.383
Jan 2012
−6072.768180
−18349.54
6204.0024
−24848.47
12702.932
Feb 2012
−6075.576504
−18397.08
6245.9283
−24919.69
12768.539
Mar 2012
1.197498
−12321.85
12324.2439
−18845.28
18847.671
Apr 2012
5.123201
−12317.98
12328.2228
−18841.43
18851.678
May 2012
5.719091
−12317.38
12328.8205
−18840.84
18852.276
Jun 2012
7.134858
−12315.97
12330.2363
−18839.42
18853.692
Jan 2013
−8652.330877
−23367.22
6062.5577
−31156.81
13852.149
Feb 2013
−8654.687882
−23445.25
6135.8707
− 31274.90
13965.519
Mar 2013
1.705749
−14791.46
14794.8699
−22622.49
22625.898
Apr 2013
4.913442
−14788.34
14798.1676
−22619.42
22629.243
May 2013
5.618300
−14787.64
14798.8755
−22618.72
22629.953
Jun 2013
7.127023
−14786.13
14800.3843
−22617.21
22631.462
Jan 2014
−7432.302038
−23759.18
8894.5753
−32402.11
17537.502
Feb 2014
−7434.872499
−23812.13
8942.3880
−32481.73
17611.986
Mar 2014
1.465367
−16377.53
16380.4626
−25048.05
25050.980
Apr 2014
5.012649
−16374.04
16384.0698
−25044.59
25054.619
May 2014
5.665970
−16373.39
16384.7252
−25043.94
25055.275
Jun 2014
7.130729
−16371.93
16386.1901
−25042.48
25056.740
Jan 2015
−8009.326341
−26000.80
9982.1465
−35524.91
19506.257
Feb 2015
−8011.795846
−26056.36
10032.7722
−35608.58
19584.989
Mar 2015
1.579058
−18044.82
18047.9774
−27598.01
27601.164
Apr 2015
4.965728
−18041.50
18051.4272
−27594.72
27604.647
May 2015
5.643424
−18040.82
18052.1071
−27594.04
27605.328 (continued)
Rainfall Prediction and Suitable Crop Suggestion …
509
Table 3 (continued) >Forecastmodel Point forecast
Lo 80
Hi 80
Lo 95
Hi 95
Jun 2015
7.128976
−18039.33
18053.5927
−27592.56
27606.814
Jan 2016
−7736.417180
−27162.80
11689.9631
−37446.50
21973.668
Feb 2016
−7738.934433
−27211.21
11733.3407
−37519.21
22041.341
Mar 2016
1.525287
−19472.33
19475.3830
−29781.17
29784.221
Apr 2016
4.987920
−19468.92
19478.9003
−29777.79
29787.767
May 2016
5.654088
−19468.26
19479.5684
−29777.13
29788.436
6 Conclusions and Future Scope The proposed design had set aims and objectives that are successfully fulfilled at each stage. The conclusion drawn from the entire research executed is such: The work carried out is novel in terms of data acquisition, data preparation, data statistical summarization, feature extraction, data modelling and validation. The rationale for fangling this research was indented for a social cause in all sincerity to help the less fortunate and ignored but most crucial citizens of India, the Farmers also known as ’Anna Daata’. The criteria is to address the numerous kinds of challenges faced by our country’s farmers and to help design a complete system in place to overcome all hurdles. Hence, to deduce, the crop prediction and recommender system employed does add a technical advancement to help farmers plan their cultivation practices better, although this research is novel since hitherto no such complete solution was found in one research alone; however, it can be further modified to greater extents using various other suitable and successfully known ML algorithms. Parameters such as temperature, precipitation and humidity can be taken in account. This work can be done on pan Indian geography with right data provided as a usable public inventory solution.
−3.59814377844055
90.3741313655634
89.60
Neural Network
TBATS
Random Forest
409.382325236759
−16.8121370718984
70.30
TBATS
Random Forest
0.0531327589946711
0.902170539173317
Arima
Simple Exponential
ME
250.207069499509
−0.298673399233137
Neural Network
616.434224513717
3.95536529213652
RMSE
208.8
459.788517537451
21.3612562133771
Triple Exponential
South Karnataka
688.671084291383
−37.1694096851214
Double Exponential
8269.09351869619 685.741847766068
21.1861019149843
0.288055597773401
Simple Exponential
RMSE
1208.8
1528.78008545184
3607.9733117291
Arima
ME
1045.93545603487
83.8693598825209
Triple Exponential
North Karnataka
1633.77880324822
0.571174377224199
Double Exponential
4.9136176907896 3604.77038783001
0.0520872506581444
0.570519775941112
RMSE
Simple Exponential
ME
Arima
Coastal Karnataka
Table 4 All models accuracy comparison
453.175134804392
2.45541039178315
MAE
309.08
279.698191018952
166.443108853512
322.514336485416
484.020520075248
480.055546882389
3984.36802920509
MAE
1309.89
1003.60748289817
631.593255689346
1026.86118933954
2112.0231316726
2108.2990188884
3.01892214546961
MAE
1.25912395512213
0.921437343187643
MASE
0.59
0.7785574964195
0.463915130012323
0.89773892873146
1.34730154296892
1.33626479084994
1.06111660340885
MASE
0.78
0.863512040351567
0.540342427866353
0.883519718489772
1.811018789993
1.81399767111623
0.897704291520378
MASE
(continued)
−0.0116446669814037
−0.00718920479333653
ACF1
−0.0178
−0.0144536860599505
−0.0285094197587259
0.027385787860262
−0.0113716264523291
−0.00805580403641876
0.00769378915307767
ACF1
−0.068
−0.0494571358736932
−0.035729118995126
−0.0144787989354376
0.01146913843134T4
0.0115282790370759
0.00681761855706348
ACF1
510 N. Farheen
279.250428947029 376.210790210747 389.70
0.528150611282831
−10.1707301190299
−20.90
Neural Network
TBATS
Random Forest
419.297703097358
22.4954352376499
Triple Exponential
616.982409666758
0.903871588750624
Double Exponential
Table 4 (continued)
359.78
260.874690313051
191.441966403891
305.496602964719
453.983438981329
0.86
0.72482699652092
0.536340706655099
0.848806700675107
1.26136978697398
−0.0028
0.00620430802362612
−0.0259996650868096
0.0101721800997947
−0.01162462360636
Rainfall Prediction and Suitable Crop Suggestion … 511
512
N. Farheen
Fig. 10 Crop recommendation output
References 1. Hernandez, E., Anguix, V.S., Julian, V., Palanca, J., Duque, N.: Rainfall Prediction: A Deep Learning Approach. HAIS 2016, LNAI 9648, pp. 151–162 (2016) 2. Abishek, B., Priyatharshini, R., Eswar M.A., Deepika, P.: Prediction of effective rainfall and crop water needs using data mining techniques. In: IEEE Technological Innovations in ICT for Agriculture and Rural Development (TIAR), Chennai, 2017, pp. 231–235 (2017) 3. Lu, K., Wang, L.: A novel nonlinear combination model based on support vector machine for rainfall prediction. In: Fourth International Joint Conference on Computational Sciences and Optimization, Yunnan, pp. 1343–1346 (2011) 4. Mishra, S., Paygude, P., Chaudhary S., Idate, S.: Use of data mining in crop yield prediction. In: 2nd International Conference on Inventive Systems and Control (ICISC), Coimbatore, 2018, pp. 796–802 (2018)
Rainfall Prediction and Suitable Crop Suggestion …
513
5. Islam, T., Chisty, T.A., Chakrabarty, A.: A deep neural network approach for crop selection and yield prediction in Bangladesh. In: IEEE Region 10 Humanitarian Technology Conference, R10-HTC, pp. 1–6. IEEE (2019) 6. Raja, S.K.S., Rishi, R., Sundaresan, E., Srijit, V.: Demand based crop recommender system for farmers. In: IEEE Technological Innovations in ICT for Agriculture and Rural Development (TIAR), Chennai, pp. 194–199 (2017)
Implementation of a Reconfigurable Architecture to Compute Linear Transformations Used in Signal/Image Processing Atri Sanyal and Amitabha Sinha
Abstract Computationally intensive linear transformations used in signal or image processing applications are considered. The flow graph is a common way of describing these types of algorithms. Using graph theoretic approach, it is deduced that a completely connected bipartite graph will be able to replicate all possible paths of any arbitrary size flow graph representing any such transformation. Next, a reconfigurable architecture is proposed imitating a completely connected bipartite graph. The overall architecture and the processing element architecture are discussed in detail. The instruction set for implementing routing, data in/out, data move, arithmetic operations, etc. are implemented by generating appropriate control signals. One representative FDCT algorithm is selected, and the longest sequence of operations of that FDCT algorithm is implemented using the instructions from the proposed instruction set. Entire architecture is implemented using VHDL, and synthesized report is presented. Keywords Reconfigurable architecture · FDCT algorithm · VHDL
1 Introduction The architectural implementation for computationally intensive linear transformations is in research interest for quite some time. A reconfigurable device like FPGA is extensively used in these implementations. There are three major directions in which this research has been directed. Various implementations are proposed for a single type of linear transformation like FFT/IFFT or FDCT. A range of various techniques like high speed pipeline, data forwarding, step-lifting, multiplier less variations is suggested in [1–14]. Some implementations also proposed architecture suitable to a A. Sanyal (B) Amity Institute of Information Technology, Amity University, Kolkata, West Bengal, India e-mail: [email protected] A. Sinha Maulana Abul Kalam Azad University of Technology, Nadia, West Bengal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_49
515
516
A. Sanyal and A. Sinha
number of similar type of linear transformations in [15–17]. There are some other implementations [18–20] which describe more general type of transformations used in Image/Signal applications. In this paper, we have described an implementation which belongs to the second category using reconfigurable architecture. A number of linear transformations can be described by flow graph technique. Using graph theory and mathematical induction method, it is proved in [21] that an architecture representing a completely connected equi-vertex bipartite graph is able to replicate any actions represented by a flow graph of arbitrary size. Hence, the architecture is capable of implementing the transformations represented by the said method (flow graph). Next, we described the Processing Element (PE) architecture and overall architecture with their control signals mentioned in [21] in detail. The instruction set for the overall architecture as well as for the PE is described. We have taken some representative instructions of different category and the corresponding control signals to implement them are also described. Next, PE-wise operations required to implement Loeffler’s FDCT algorithm is considered, and the longest sequence of operations of a stage inside a PE is implemented using instructions from the proposed instruction set. The entire implementation is coded in VHDL, simulated and synthesized in Xilinx. Vivado. The simulation result and result of timing, power and size in terms of LUT is reported. The rest of the paper is organized in the following way: In Sect. 2, through mathematical induction, it is proved that architecture representing a completely connected bipartite graph is sufficient to implement any transform algorithm represented by a flow graph. Section 3 which is divided into four sub-parts describes a. The design of overall and the PE architecture along with their control signals, b. instruction set, c. control signals implementing the instruction set and d. implementation of one representative operation of FDCT algorithm using the proposed instructions. Section 4 contains simulation and synthesis report of the VHDL implementation, and Sect. 5 comments on the final conclusion and future course of direction of this study.
2 Graph Theoretical Proof of the Suitability of the Proposed Architecture Model The dataflow diagram is used extensively to describe the transform algorithms. One dataflow diagram describing Loeffler’s FDCT algorithm is presented in Fig. 1. From this diagram, we can understand that the flow graph represents an equivertex K-partite graph [23] with k = 5 and n = 8. Any path from any specific vertex of partition 1 to any specific vertex of part 5 represents one parallel sequence of algorithm. All paths present in this K-partite graph represent the overall parallel algorithm represented by the flow graph. The structure of such a graph and another of an equi-vertex completely connected bipartite graph is considered in Fig. 2.
Implementation of a Reconfigurable Architecture …
517
Fig. 1 A flow graph describing Loffeler’s FDCT algorithm [22]
Fig. 2 a An equi-vertex K-partite graph k = 4 and n = 4. b An equi-vertex completely connected bipartite graph with n = 3
Lemma proposed in [21] shows the equivalency of any path existing in an equivertex K-partite graph with the path existing in an equi-vertex completely connected bipartite graph provided that the vertex set is same in both the graphs. Proof: It will be given by using the property of mathematical induction. We assumed that theK-partite graph G1 and completely connected bipartite graph G2 are equi-vertex and that vi e V of G1 for i = 1, 2…n for all partitions K = 1, 2…m and vi e V of G2 for i = 1, 2…n for 2 partitions are same. That is, v1 , v2 … vn represent same vertices in all K partitions of G1 and two partitions (S1 and S2) of
518
A. Sanyal and A. Sinha
G2. We are considering graphs having more than one partitions so the trivial case of induction K = 1 does not apply here. Next suppose K = 2. Then, any path between any two vertices V i of K1→V j to K2 of G1 can be replicated by G2 since it is completely connected. Next suppose K = 3. Then, a path between any V i of K1→V j of K2→V k of K3 of G1 can be replicated by travelling to V i of S1→V j of S2→V k of S1 in G2. Since G2 is completely connected, it is possible for any V i , V j and V k . According to method of induction, the lemma is true for K = R − 1. Now, if R − 1 is odd, then the last vertex of the path is in S1 else it is in S2. Now, to prove that it is true for K = R, suppose R − 1 is odd so the last vertex of the previous path with K = R − 1 is in S1, then we can go to any vertex V j of S2 since S1 and S2 are completely connected. Since the connection is bidirectional, the same is true if R − 1 is even. So it proves for all value of R. So an equi-vertex completely connected bipartite graph can replicate any path present in an equi-vertex K-partite graph of any size. The lemma proved here gives the theoretical suitability of an architecture replicating equi-vertex completely connected bipartite graph as an efficient architecture to represent the dataflow algorithms representing different image/signal processing linear transformations.
3 Implementation of the Proposed Model into a Reconfigurable Architecture 3.1 Implementation of Overall Architecture The deduced architecture using graph theoretic approach requires two set of n numbers of processing elements and a completely connected two-way communication wire between them. A single clock pulse can direct the data from any vertex of one partition to any vertex of the other one. This architecture requires considerable hardware overhead. In order to reduce it, instead of two sets of n number of processing elements in both partitions, we use one set of n processing elements (P1 to Pn) in one partition and one set of n registers(R1 to Rn) in the other. Instead of two-way fully connected communication lines, we use one-way fully connected communication lines between Ri →Pj (i, j = 1, 2…n) and one feedback line from Pi →Ri (i = 1, 2…n). The data communication which worked in the theoretical model using one clock pulse will now work in this modified model using two clock pulses but lowering significant hardware cost. The overall architecture is shown in Fig. 3. The fully connected feedforward communication is implemented using 8 numbers of 3 × 8 multiplexers. A feedback connection between the processing element and its corresponding register was implemented by a 1 × 2 demultiplexer-multiplexer duo connected to processing element and register, respectively. The architecture was consisting 8 bit register sets to latch value while entering or exiting to/from processing element.
Implementation of a Reconfigurable Architecture …
519
Fig. 3 Design of the first stage of the overall architecture with control lines
3.2 Implementation of Processing Element (PE) Architecture The Processing Element (PE) used to implement these type of transformations requires floating point addition/subtraction and multiplication/division operations. A generic floating point adder and multiplier were used inside the PE. Two registers D0-D1 was used to store the operands of the addition/subtraction, and D2-D3 was used to store the operands of multiplication/division. The result of adder is stored in the register D4 and multiplier in D5. In order to make the design modular according to IEEE 754, the registers of the PE is kept at 32 bit length, and the data from the outside registers of 8 bit length will be padded/stripped with 0’s at the time of entering/departing. Multiplexers and demultiplexers are used inside the PE to route the data within and outside the PE (Fig. 4).
520
A. Sanyal and A. Sinha
Fig. 4 Design of the Processing Element (PE) along with control lines
3.3 Instruction Set of Overall Architecture and Processing Element Architecture The instruction set of the overall architecture was primarily for routing operation and the instruction set for the Processing Element (PE) was primarily for arithmetic calculation and data movement operations. The category of instructions available is given below. A. Data loading/routing operations outside PE: 1. Load direct MIDREG [1–8]: to load data from outside memory in the MIDREG [1–8] from TP [1–8] 2. Load feedback data MIDREG [1–8]: to load data in MIDREG [1–8] from feedback line FB [1–8] 3. Rout PE [1–8], MIDREG [1–8] = for routing from any MIDREG to any PE. 4. OUT OUTREG [1–8] = for storing the value to outside memory B. Data loading/movement and mathematical operations inside PE:
Implementation of a Reconfigurable Architecture …
521
Table 1 Sequence of control signals of some representative instructions of all categories Category A
Sequence of control signals
Load direct MIDREG 1
1. TP1 → Data 2. INMUX1 → 0 3. EN-MIDREG1 → 1
Load feedback data MIDREG1 1. OUTDEMUX1 → 0 2. INMUX1 → 1 3.EN-MIDREG1 → 1 Rout MIDREG3,PE5
1. EN-MIDREG3 → 0 2. ROUTMUX5 → 011
Out OUTREG6
1. EN-OUTREG6 → 0 2. OUTDEMUX6 → 1
Category B
Sequence of control signals
Load D0
1. input → data 2.PECL1 → 0 3.PECL2 → 00 4.PECL4 → 1
Load D0,C5
1. EN-C5 → 1 2.CMUX → 101 3.PECL2 → 11 4.PECL4 → 1
Add
1.PEEN1 → 1
Mul
2. PEEN2 → 1
Out D4
1. PECL6 → 1 2.PECL8 → 0 3.data → output
Move D1,D5
1. PECL7 → 0 2.PECL2 → 10 3. PECL4 → 0
1. Load [D0-D3] = to load data in D0-D3 from outside of PE. 2. Load [D0-D3,C0-C7] = to load data in D0-D3 from constant register bank C0-C7. 3. Add = to add the data present in D0 and D1 and keep it in D4 4. Mul = to multiply the data present in D2 and D3 and keep it in D5 5. Move [D0-D3,D4-D5] = data movement operation from the output register of D4/D5 to any input register [D0-D3]. 6. Out [D4-D5] = Write back data from output register D4/D5 to outside memory. The control lines for PE and the overall architecture are specified in the previous section; then, the sequence of control signals to implement the instruction set of the architecture is implemented. The sequence of control signals taking some representative example for each category (A/B) is given in Table 1.
3.4 Implementation of a Sequence of Operations Using the Instruction Set of the Architecture We have seen the flow graph describing FDCT algorithm in Fig. 1. From this, it is evident that the longest sequence of operations a PE would perform is C1*X + C2*Y. If any PE can perform this sequence of operation with the proposed instruction set of the previous section, then we can safely assume that it would be able to calculate the entire algorithm expressed in the flow graph. The following table shows the instruction sequence performing the above operations.
522
A. Sanyal and A. Sinha
4 Synthesis Report of the Architecture The overall architecture and the Processing Element (PE) level architecture is coded using VHDL; the sequence of operations described in Table 2 is simulated and synthesized using Xilinx Vivado. The simulation is done of the sequence C4 * peint1 + C5 * peint2 (C4 = constant value stored in const. register no. 4 of const. Register bank in PE, C5 = same from const. Reg. no. 5, peint1 and peint2 = value loaded from outside at t1 and t2 time; peint1 = 2.0, peint2 = 4.0, C4 = 0.5, C5 = 8.0, peout = 33.0) proves the correctness of the result. Synthesis result in terms of size, power and timing behaviour is given in the following figure and table. The simulation result proves the correctness of the design. Most of the available architecture is based on ASIC implementations focusing on one/two of linear transformations excepting one reconfigurable Xilinx IP core available only for FFT calculation discussed in [24]. Since the idea of the proposed architecture is different from the existing ones, ASIC and FPGA comparison are difficult, and our architecture is not fully complete; the comparison with any existing variety is avoided (Fig. 5 and Table 3). Table 2 Instruction sequence inside PE to perform the sequence of operations C1 * X + C2 * Y Time unit
Instruction
Description
1
Rout PE [2], MIDERG[2]
PE2 ← input data from MIDREG2 of the previous stage
2
Load [D2]
D2 ← input data
3
Load [D3,C5]
D3 ← Constant √ from the constant register 5 containing 2 C3π/8
4
Mul
D5 ← D2*D3
5
Mov [D0,D5]
D0 ← D5
6
Rout PE [2],MIDERG [3]
PE2 ← data from MIDREG3 of the previous stage
7
Load [D2]
D2 ← data
8
Load [D3,C6]
D3 ← Constant from the constant register 6 containing S3π/8
9
Mul
D5 ← D2*D3
10
Mov D1,D5
D1 ← D5
11
Add
D3 ← D0 + D1
12
OUT [D4]
Output data from D4
13
Load Feedback Data MIDREG [2]
MIDREG2 ← output data
Implementation of a Reconfigurable Architecture …
523
Fig. 5 Simulation result of the Processing Element (PE)
5 Conclusion and Future Direction of the Study The conclusion of this study indicates that the proposed architecture is capable to calculate the modified FDCT algorithm using its present instruction set, and every instruction of this instruction set further can be implemented. So, in general, the architecture is capable to compute FDCT algorithms in its current form described in this paper. Since the general format of describing this type of linear transformation algorithms (FFT, DCT, DWT, etc.) is through flow graph, and the architecture is proven to implement this flow graph efficiently in its current form; we conclude that it will further process other flow graph-based algorithms of other linear transformations efficiently. The future direction of the study is to transform the architecture into a complete processor. We require four major improvements for this. Firstly, implementation of a control unit to sequence the control signals according to various linear transformations (FFT, DCT, DWT, etc.), secondly implementation of memory architecture and necessary handshaking signals for transferring data to/from the main memory taking care of speed mismatch issues, thirdly writing a testbench programme to simulate and test the behaviour of the architecture executing a number of major linear transformations and comparing the outcome with the existing architectural designs. Fourthly, the reconfiguration effort required to expand the design in terms of larger bandwidth and also for larger design implementation should be accurately calculated to establish the reconfigurable nature of the design.
524
A. Sanyal and A. Sinha
Table 3 Utilization report (summary and detailed), power and timing report of the architecture after synthesis Utilization report (Summary) No. of LUT
10,897
No. of FF
6928
No. of IOB
562
Utilization report (Primitive blocks) Primitive name
Number
Functional category
LUT6
6096
LUT
LUT5
920
LUT
LUT4
2984
LUT
LUT3
576
LUT
LUT2
536
LUT
LUT1
1257
LUT
FDCE
3616
Flop and Latch
FDRE
3312
Flop and Latch
MUXF7
320
MuxFx
CARRY4
168
Carry Logic
IBUF
489
IO
OBUF
73
IO
BUFG
1
Clock
Power report (summary) Total on-chip power
0.417 W
Device dynamic power
0.335 W
Device static power
0.082 W
Timing report (summary) Max setup time
3.419 ns
Worse pulse width slack
4.650 ns
Avg CP required for FP operations inside PE
4
Max clock frequency
References 1. Tseng, P.-C., et al.: Reconfigurable discrete cosine transform processor for object-based video signal processing. In: ISCAS ‘04. Proceedings of the 2004 International Symposium on Circuits and System (2004) 2. Tseng, P.-C., Huang, C.-T., Chen, L.-G.: Reconfigurable discrete wavelet transform processor for heterogeneous reconfigurable multimedia systems.J. VLSI Signal Process. Syst. Signal Image Video Technol. (2005) 3. Donohoe, G.W.: The fast Fourier Transform on a reconfigurable processor. In: Proceedings of the NASA Earth Sciences Technology Conference, Pasadena, CA, 11–13 June 2002
Implementation of a Reconfigurable Architecture …
525
4. Srivatsava, P.S.V., Sarada, V.: Reconfigurable MDC architecture based FFT processor. Int. J. Eng. Res. Technol. (2014) 5. KHass, K.J. and Cox, D.F.: Transform processing on a reconfigurable data path processor. In: 7th NASA Symposium on VLSI Design (1998) 6. Sarada, V., Vigneswaran, V.: Reconfigurable FFT processor—a broader perspective survey. Int. J. Eng. Technol. (IJET) (2013) 7. Shahbahrami, A., Ahmadi, M., Wong, S., Bertels, B.: A new approach to implement discrete wavelet transform using collaboration of reconfigurable elements. In: Proceedings of the 2009 International Conference on Reconfigurable Computing and FPGAs 8. Manolopoulos, K.E., Nakos, K.G., Reisis, D.I., Vlassopoulos, N.G.: Reconfigurable Fast Fourier Transform Architecture for Orthogonal Frequency Division Multiplexing Systems (2003). Available: https://pdfs.semanticscholar.org/dd5c/263725af00e5dd4d42d573c269f57d 917c8d.pdf?_ga=2.84059166.640751657.1573804365-914446569.1569299704 9. Sinha, A., Sarkar, M., Acharyya, S., Chakraborty, S.: A novel reconfigurable architecture of a DSP processor for efficient mapping of DSP functions using field programmable DSP arrays. In: ACM SIGARCH Computer Architecture News, vol. 41, No. 2, May 2013 10. Wadekar, S., Thakare, L.P., Deshmukh, A.Y.: Reconfigurable N-point FFT processor design For OFDM system. Int. J. Eng. Res. Gener. Sci. 3(2) (2015) 11. Petrovsky, A., Rodionov, M., Petrovsky, A.: Dynamic reconfigurable on the lifting steps wavelet packet processor with frame-based psychoacoustic optimized time-frequency tiling for real-time audio applications. Design and Architectures for Digital Signal Processing (2013). Available: https://www.intechopen.com/books/design-and-architectures-fordigital-sig nal-processing 12. Thomas, S., Sarada, V.: Design of reconfigurable FFT processor with reduced area and power. ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE), 2013. 13. Rajaram, U.: Design of fir filter for adaptive noise cancellation using context switching reconfigurable EHW architecture. Ph.D. dissertation, Anna University, Chennai (2009). Available: https://shodhganga.inflibnet.ac.in/handle/10603/27245 14. Reddy, P.S., Mopuri, S., Acharyya, A.: A reconfigurable high speed architecture design for discrete hilbert transform. IEEE Signal Process. Lett. 21(11), 1413–1417 (2014). https://doi. org/10.1109/LSP.2014.2333745 15. Sanyal, A., Samaddar, S.K., Sinha, A.: A generalized architecture for linear transform. In: Proceedings of the IEEE International Conference on CNC 2010, 4–5 Oct 2010, Calicut, Kerala, India, pp. 55–60. IEEE Computer Society. ISBN: 97-0-7695-4209-6. 16. Sanyal, A., Samaddar, S.K.: A combined architecture for FDCT algorithm. In: Proceedings of the 2012 Third International Conference on Computer and Communication Technology, Allahabad, pp. 33–37 (2012). https://doi.org/10.1109/ICCCT.2012.16 17. Sanyal, A., Kumari, S., Sinha, A.: An improved combined architecture of the four FDCT algorithms. Int. J. Res. Electron. Comput. Eng. (IJRECE) 6(4) (2018). 2348-2281 18. Rossi, D.,Campi, F., Spolzino, S., Pucillo, S., Guerrieri, R.: A heterogeneous digital signal processor for dynamically reconfigurable computing. IEEE J. Solid-State Circ. 45(8) (2010) 19. Purohit, S., Chalamalasetti, S.R., Margala, M., Vanderbauwhede, W.: Throughput/resourceefficient reconfigurable processor for multimedia applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 21(7) (2013) 20. Vikram, K.N., Vasudevan, V.: Mapping data-parallel tasks onto partially reconfigurable hybrid processor architectures. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 14(9) (2006) 21. Sanyal, A., Sinha, A.: A reconfigurable architecture to implement linear transforms of image processing applications. In: International Conference on Frontiers in Computing and System (COMSYS 2020), Jalpaiguri, West Bengal, India, 13–15 January 2020
526
A. Sanyal and A. Sinha
22. Heyne, B., Sun, C.C., Goetze, J., Ruan, S.J.: A computationally efficient high-quality cordic based DCT. In: 14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, 4–8 Sept 2006 23. Deo, N.: Graph theory with applications to engineering and computer science. PHI (2007) 24. Wang, M., Wang, F., Wei, S., Li, Z.: A pipelined area-efficient and high-speed reconfigurable processor for floating-point FFT/IFFT and DCT/IDCT computations. Microelectronics J. 47 (2016). Available: www.elsevier.com/locate/mej
PCA and Substring Extraction Technique to Generate Signature for Polymorphic Worm: An Automated Approach Avijit Mondal, Arnab Kumar Das, Subhadeep Satpathi, and Radha Tamal Goswami
Abstract Worms that can contaminate hundreds of thousands of web hosts pose a significant risk to the web. Online worms pose a significant risk to the protection of Internet infrastructure. The recent Intrusion Detection system (IDS) monitor edge network DMZs to determine and filter system malicious. While an IDS can help defend the hosts on the nearby advantage networking of its at denial and conciliation of system, it cannot on your own successfully intervene to halt as well as overturn the spreading of novel Internet worms. Age group of the worm signatures needed by an IDS—the byte patterns desired for monitored website traffic to determine worms— today involves non-trivial man labor and hence substantial delay: as networking operators identify anomalous conduct, they conversate with each other and personally learn package traces to develop a worm signature. However, treatment needs to happen soon in an epidemic to halt a worm’s spread. In this paper, an automated signature model process is proposed for “polymorphic worms”. We implemented Principal Component Analysis (PCA) to discover by far the most important data that is actually discussed between polymorphic mask situations, and to use them as signatures. The experimental consequences indicate where the PCA effectively recognized “polymorphic worms by zero false positives as well as low false negatives.” Keywords Principal component analysis (PCA) · Polymorphic worms · Internet intrusion detection methods (IDSes) A. Mondal (B) MAKAUT, Kolkata, India e-mail: [email protected] A. K. Das · S. Satpathi Jis University, Kolkata, India e-mail: [email protected] S. Satpathi e-mail: [email protected] R. T. Goswami Techno International New Town, Kolkata, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_50
527
528
A. Mondal et al.
1 Introduction In the present era, many of the enterprises or individuals are dealing with the online valuable information for their day to day work. This valuable online information may be prone to threats such as malicious attacks by viruses, worms, Phishing or Trojans. These malicious attacks may be responsible for major destructions related to stores or consumers. Hence, these attacks must be detected before unwanted destruction. To detect the cyber-attack, we can use the signature-based intrusion detectors, because each attack is having a signature. But in this approach, the signature must be already known to the user. Hence, we cannot identify the newly emerged attacks by the signature-based approach. An alternative method to perceive the attacks of intrusion detectors in anomaly-based approach. An alternative method for the automatic detection of intrusion detector attacks, based on this method, remains to calculate in an instance the shape of the resources, in addition to the base profile, that is, the threshold values for the resources. By this approach, we can detect any attack whether it is a newly emerged or older one. But this approach is suffered by large false positives. We realize which polymorphic worm switches the appearance of it in each and every example. The honeypots have often served as an excellent foundation of details for the evaluation of strikes and intrusion [1]. Honeycomb looks like a one particular, a worm’s payload adjacent substring as for corresponding completely other worm occurrences just screwing up in the situation of corresponding completely polymorphic worm occurrences that be made up of less false positives as well as small incorrect negatives. Unlike Honepot, benign website visitors, who are actually package components tracked by a DMZ, form the autograph response [2]. The first bird configured to recognize worms by developing signatures. The specified product is slated to calculate the packet content occurrence within a specific observation simple fact, for example, a networking. From the content of the pandemic by adding the individual base amount, as well as the resolution associated with the strings that are often displayed in the payload. Like Honeycomb as well as Autograph, Early Bird additionally creates signatures which will be made up of one particular, substring which is adjoining of a worm’s payload to be able to version the cases of complete worms. The polymorphic mask between these devices, automatic signatures based on the fact that there are actually a couple of invariant substrates, the regular existence of it should be in each payload option of the polymorphic mask, even if each is a modification of the infection in the payload. Each product is allocated the process of recording the package payloads originating the wireless router. Even while in probably the most terrible example, this particular system has the chance to access several polymorphic worms exposing to the state that apiece of them activates a dissimilar publicity as of the former. As a result, herein, this particular situation appears to stay rather complicated in lieu of the overhead methods for looking an alternate content which is actually to be discussed. In case, we wish toward intermeeting the occurrences well and then, we have to offer a less chance to the polymorphic worms for mingling with hosts to keep the overall performance frequent of theirs. This particular explanation
PCA and Substring Extraction Technique …
529
controlled to the use of double honeypot so that gathers all of the cases to come up with automatic signature making use of substring principal component analysis as well as extraction. On the contrary, in [3], suggested hierarchical quadrature amplitude modulation (HQAM) for better protection of high-priority data during image transmission. Nevertheless, no channel model such as AWGN is considered in this work. Here, only salt and pepper noise is taken into account. A raised cosine filter is used in the AWGN channel which is introduced in [4]. Here, authors evaluate the performance of the communication channel through the bit error rate (BER). In [5], M array QAM is used in AWGN channel for evaluating the polar coding technique. In this method, the characteristic of the band limited and channels intersymbol interference is not considered.
2 Network Mallicious Attacks Network worms—self-broadcasting network programs—embody a sizable risk to the network substructure of ours. As a result of the propagation promptness of worms, sensitive defenses have to be automated. It is necessary to recognize how along with where the defenses have to slip in the system to ensure they cannot easily be evaded. Because there are many systems, malcode experts are able to utilize to avoid preexisting edge centric battlements, this place paper contends that sizable battlements have been lodged in the neighborhood area system, therefore producing “hard LANs” created to identify as well as respond to worm infection. When in contrast to typical community intrusion detection methods (NIDS), we imagine that hard LAN gadgets have to experience two orders of magnitude more effectively cost/performance, and also a minimum of two orders of magnitude superior reliability, leading to sizable design obstacles.
2.1 Internet Worm Detection and Containment Depending on the variables employed for finding, a detection algorithm is roughly split into anomaly-based and signature-based systems. There are lots of projected algorithms for equally systems. This particular section initially presents signature-based detection as well as after that covers anomaly-based detection. Self-propagating or self-duplicating malicious codes recognized as system worms distribute themselves with no personal interaction and release probably the most harmful strikes from computer networks [6]. Exactly at the same period, staying completely automated can make predictable as well as the behavior of their repetitious. We first recognize worm qualities via the conduct of theirs, after which categorize worm detection algorithms primarily built on the variables applied to the algorithms. Moreover, we examine and check various detection algorithms which have a
530
A. Mondal et al.
guide toward the worm qualities by determining the worm’s category which could as well as also cannot be recognized by the systems. Right subsequently detecting the presence of worms, the subsequent thing is containing them. The present techniques utilized to retard and quit the distribution of worms are explored by this particular report. The places to implement containment and detection, in addition to the range of every one of the systems/methods, are usually checked out in the level. Lastly, this report highlights the other issues of upcoming research as well as worm detection directions. System for intrusion detection called as IDS objective is to “identify, preferably in real-time, unauthorized use, misuse and abuse of computer systems by both system insiders and external penetrators.” An IDS is utilized as a substitute (or may be a counterpart) to establishing a shield about the system.
3 Polymorphism of Internet Worms Each and every potential means to lengthen the lifetime of the worms of theirs will be tried by the assailants. A polymorphic worm seems to be able to avoid the signature-based system and changes every time when it duplicates itself. This particular section covers worm polymorphism, though the following part offers a result from several typical polymorphism strategies. For gaining polymorphic worms, you can find numerous methods. One particular method is based on self-encryption by an adjustable element. It encrypts the entire worm’s body that removes the signatures as well as the statistical attributes. A text of the worm, the regimen of decryption, as well as the crucial are actually routed to a target printer, and the place that the encrypted text is actually converted addicted to a typical worm software through the decryption regime. Then the system is performed to contaminate various additional victims. While various duplicates of a worm appear completely dissimilar when various keys are actually utilized, the encrypted text is likely to adhere. Additionally, in case exactly the identical decryption regime is usually recommended, in the decryption, the byte sequence regime is able to function like as worm signature. A far more advanced technique of polymorphism is actually changing the decryption regime every time a message of the worms routed to the next target multitude. This may be accomplished by having a number of decryption regimes in a worm. Once the worm makes an attempt to come up with a text, provided a restricted selection of decryption regimes, it is doable to determine every one of them as hit signatures after sufficient worm samples of the were received. One more polymorphism strategy is known as garbage code insertion. It inserts trash directions into the duplicates of a worm. For instance, a selection of nop (i.e., without operation) directions could be placed into numerous spots of frame of the worm, which makes it much harder to evaluate the byte sequences of two cases of the identical worm.
PCA and Substring Extraction Technique …
531
3.1 Terminology • Activation: Activation is whenever a worm begins carrying out the malicious pursuits of it. • Worm: Worm acts as a self-cultivating part of the malware, usually through community contacts. The abuse of vulnerabilities in personal systems (laptops, desktops) constituted the network of the network. Generally, worms do not require some human involvement to circulate, though a group of worms known as passive worms call for specific multitude conduct or maybe human involvement toward circulate. • Virus: A virus acts as a malicious portion of code which connects to various other applications to circulate. It could not circulate on its own and usually relies on a particular user involvement, like inaugurating an email connection or even operating an implemented file, being triggered. • Threshold: Threshold is a predetermined state that, if achieved, specifies the presence of a worm attack or specious traffic. • Infection: Infection terms as consequence over the host related to the worm executing their malicious accomplishments. • False alarm: It refers as a false security alarm which is an improper alert originated through a feature of worm detection. • False negative: It indicates the detection system which skipped an assault. It is a phony unfavorable in case no alert is produced as the process is under an assaulted attack. • False positive: A false security alarm in which a notification is produced when there is no real threat or attack. Nevertheless, out of the stats use of perspective, the wavelengths of the garbage directions of a worm is able to differ significantly from individuals for regular site traffic. In case that is the situation, anomaly detection methods [7] can certainly be utilized to identify the worm. In addition, some garbage guidelines including nop might be quickly diagnosed as well as deleted. Strategies of executable evaluation may be utilized to determine specific different obfuscated garbage [8]. The new signatures are able to withstand considerable, “local” modifications so long as particular “global” attributes stay. Excellent examples are actually polymorphism triggered by modest teaching as well as register reassignment substitution. Nevertheless, we do not say that such signatures are actually ideal for almost all episodes, especially incredibly advanced worms which do not have any kind of position-dependent statistical characteristics after the transformation as obfuscation which cannot be deobfuscated.
532
A. Mondal et al.
4 Signature Originator Algorithm The substring extraction procedure and principal component analysis procedure used for signature generation. The substring extraction procedure is directed within the substring extraction of polymorphic worms, although the analysis procedure of polymorphic worms is core components focus on the use of the signature [5]. Let us presume of owning a polymorphic worm A that contains n occurrences (A1, . . . , An). Let us see more that Ai has measurements Mi for i = 1, . . . , n. The occurrence A1 is described by A = {a1, a2, . . . a Mi}. Let X be the very smallest amount measurements of the substrings that will be extracted from A1. The very initial substring from A1 with measurements X is actually (a1a2 . . . a X ) [4]. The substring could be assumed a = {a2, a3 . . . a X + 1}. Ongoing the path, the final substring from A1 is going to be (a M1X one . . . a M1). Generally, in the event, that example Ai has measurements the same to M along with least measurements equivalent to X, and then the entire substrings pulling out of Ai T S E( Ai) will be accomplished by utilizing the equation: T S E( Ai) = M X 1. (one) The subsequent time is actually the increase of X by a single action, thereby initiating the substrings of initial hand abstraction after the instigation of A1. The foremost substring will be (a1a2 . . . a X one). The substring extraction procedure is going to continue until the pleasure of the state X < M. We ought to use the aforementioned methods for the enduring of the polymorphic worm A occurrences (i.e. A2, A3 . . . An). Figure 1 and Table 1 depict the choices of every substring abstraction from the string Z Y XC B A supposing the very least measurements of X are actually the same to 3 (three).” For example, assume polymorphic-worm having N + 1 incident and A = {A0, A2, . . . , An}, where A0 as first one which will fetch substrings. In case 9 numbers of substrings will be pulled from A0, then believe that every substring is going to be the same as Si, i = 1 to 9. Fig. 1 Substring extraction
PCA and Substring Extraction Technique …
533
Table 1 Substring extraction Number of substractions
Length of X
Substrings
S 1.1
3
ZYX
S 1.2
3
YXC
S 1.3
3
XCB
S 1.4
3
CBA
S 1.5
4
ZYXC
S 1.6
4
YXCB
S 1.7
4
XCBA
S 1.8
5
ZYXCB
S 1.9
5
YXCBA
A0 = (S1, S2, . . . , S9)
(2)
On that particular point, we manage the speed of quantity of reappearance in each substring Si( A0 substrings) in the each of the leftover occurrences (A1, . . . , An). Next to that face of time, we have done the principal component analysis (PCA) over the constancy of computation information for lessening the dimension as well as get the highest crucial information.
4.1 Establish Most Significant Data Using PCA Let the regularities way indicated by Fi(Fi1, . . . , Fi N ) of the substring Si in the rates (A1, . . . , An), i = 1, . . . , L . We concept the rate matrix F by assuming Fi be the ithrow of F [9] as below Eq. (3), ⎞ f 11 . . . . . . f 1N ⎜............ ⎟ ⎟ F =⎜ ⎝............ ⎠ f L1 . . . . . . f L N ⎛
(3)
4.2 Normalization of Information The specifics normalization is distributed over by information normalization in each and every matrix row F, creating a matrix D(L × N ) as given Eqs. (4) and (5)
534
A. Mondal et al.
⎞ d11 . . . . . . d1N ⎜............ ⎟ ⎟ D=⎜ ⎝............ ⎠ d L1 . . . . . . d L N ⎛
f ik dik ← N j−1
fi j
(4)
(5)
4.3 Mean Adjusted Data Formula is to be applied, to attain the data modified about zero mean in the below Eqs. (6) and (7), gik ← dik − di ∀i, k Where di = mean of the ith vector =
N 1 di j N j−1
(6)
The data adjust matrix G is given by: ⎛
⎞ g11 . . . . . . g1N ⎜............ ⎟ ⎟ G=⎜ ⎝............ ⎠ g L1 . . . . . . g L N
(7)
Covariance matrix evaluation is as follows: Let gi indicates the ith row of G, subsequently the covariance between any the two courses gi as well as g j presented by the Eq. (8): L Cov(gi , g j ) = Ci j =
K =1
(dik − di )(dik − d j ) N −1
Thus, the covariance matrix C(N × N ) is specified by the Eq. (9),
(8)
PCA and Substring Extraction Technique …
⎞ C11 . . . . . . C1N ⎜............ ⎟ ⎟ C =⎜ ⎝............ ⎠ C L1 . . . . . . C L N
535
⎛
(9)
Eigen value Evaluation: Eigen values of the matrix is to be estimated by in below equation, |C − λI | = 0 And then, the matching eigenvectors are to recalculated [10].
4.4 Evaluation of the Principal Component Suppose λ largely derived as the largest eigenvalue of the covariance matrix C as well as V is the data set of principal component. Laterally the data forecast controls the principal component: Feature Descriptor = V T × G.
5 Experimental Outcome We have conducted distinct tests intended for the expression of the effectiveness of the suggested signature development algorithms specifically substrings principal component analysis as well as exaction (PCA) for the presentation of polymorphic worms. 300 situations of Blaster Worm are utilized in the tests. We have coded using MATLAB that will operate on a computer system with Intel Pentium four, 3.20 GHZ CPU as well as 8.00 GB random access memory. During the tests, synthetic model of options of the worms on the foundation of a selection of polymorphism methods currently is being discussed in Sect. 2. Figure 2 shows the worm load prior to minimization procedure. The X-axis suggests the amount of worm situations which are accustomed to produce signature. The Y-axis suggests frequency matter of the obtained substrings. The dimension of the worm payload is optimum prior to minimization. Thus, the worm signature is difficult to be discovered in that dimension. Figure 3 determines most major details (worm signature) gained by the falling of worm payload with the PCA. It is observed where the worm payload’s dimension is lowered considerably. Figure 4 represents the exposure of worm rates for diverse traffic (ordinary traffic and worm instances). X-axis suggests the rate of diverse traffic; Y-axis suggests the rate of worm finding. The rate uncovering increases with the rise in the quantity of occurrences.
536
A. Mondal et al.
Fig. 2 Blaster_Worm_Substrings prior to Reduction
Fig. 3 Blaster_Worm_Substrings later than reduction
Figure 5 represents the percentages of false positives along with false negatives to the worm. We receive zero false positives while the false negatives reduced as the amount of worm occurrences raised.
6 Conclusion In this particular paper, we can make as with coming aids: First, we analyze worm polymorphism. Automated signature model process for polymorphic worms has
PCA and Substring Extraction Technique …
537
Fig. 4 Blaster_Worm_Detection rate
Fig. 5 Blaster Worm false positives and false negatives
been proposed by us. The system is created on principal component analysis which establishes probably the most major information which is discussed among all polymorphic worms’ situations and also utilize these as signatures. The analyzed results indicate the PCA has effectively recognized the polymorphic worms with zero false positives along with low false negatives. The primary goals of this particular analysis are reducing bogus security alarm fees and also produce quality signatures that are high for polymorphic worms.
538
A. Mondal et al.
References 1. Spitzner, L.: Honeypots: Tracking Hackers. Addison Wesley Pearson Education, Boston (2002) 2. Bidgoli, H.: Handbook of Information Security. Wiley, Hoboken, New Jersey 3. Kreibich, C., Crowcroft, J.: Honeycomb–creating intrusion detectionsignatures using honeypots. In: Workshop on Hot Topics in Networks (Hotnets-II), Cambridge, Massachusetts, Nov 2003 4. Kim, H.-A., Karp, B.: Autograph: toward automated, distributed wormsignature detection. In: Proceedings of 13 USENIX Security Symposium, SanDiego, CA, Aug 2004 5. Singh, S., Estan, C., Varghese, G., Savage, S.: Automated wormfingerprinting. In: Proceedings of the 6th conference on Symposium on Operating Systems Design and Implementation (OSDI), Dec 2004 6. Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997) 7. Levine, J., La Bella, R., Owen, H., Contis, D., Culver, B.: The use ofhoneynets to detect exploited systems across large enterprise networks. In: Proceedings of 2003 IEEE Workshops on Information Assurance, New York, pp. 92–99, June 2003 8. Newsome, J., Karp, B., Song, D.: Polygraph: automatically generating signatures for polymorphic worms. In: Proceedings of the 2005 IEEE Symposium on Security and Privacy, pp. 226–241, May 2005 9. Tang, Y., Chen, S.: An automated signature-based approachagainst polymorphic internet worms. IEEE Trans. Parallel Distrib. Syst., 879–892 (2007) 10. Mohssen, M.Z., Mohammed, E., Anthony Chan, H., Ventura, N.: Honeycyber: automated signature generation for zero-day polymorphic worms. In: Proceedings of the IEEE Military Communications Conference, MILCOM (2008) 11. Yegneswaran, V., Giffin, J., Barford, P., Jha, S.: An architecture forgenerating semantics-aware signatures. In: Proceedings of the 14th Conference on USENIX Security Symposium (2005) 12. Li, Z., Sanghi, M., Chen, Y., Kao, M.-Y., Chavez, B.: Hamsa: fast signature generation for zero-day polymorphic worms with provable attack resilience. In: Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, May 2006 13. Cavallaro, L., Lanzi, A., Mayer, L., Monga, M.: LISABETH: automated content-based signature generator for zeroday polymorphic worms. In: Proceedings of the Fourth International Workshop on Software Engineering for Secure Systems, Leipzig, Germany, May 2008 14. Nazario, J.: Defense and Detection Strategies against Internet Worms. Artech House Publishers (2003) 15. Mohssen, M.Z., Mohammed, E., Anthony Chan, H., Ventura, N., Hashim, M., Amin, I.: A modified Knuth-Morris-Pratt Algorithmfor Zeroday Polymorphic Worms Detection. In: Proceedings of the 2009 International Conference on Security and Management (SAM’09), LasVegas, USA, 13–16, July 2009
A Bengali Text Summarization Using Encoder-Decoder Based on Social Media Dataset Fatema Akter Fouzia, Minhajul Abedin Rahat, Md. Tahmid Alie - Al - Mahdi, Abu Kaisar Mohammad Masum, Sheikh Abujar, and Syed Akhter Hossain
Abstract Text summarization is one of the strategies of compressing a long document to create a version of the main points of the original text. Due to the excessive amount of long posts these days, the value of summarization is born. Reading the main document and obtaining a desirable summary, time and trouble are worth it. Using machine learning and natural language processing built an automated text summarization system can solve this problem. So our proposed system will distribute an abstractive summary of a long text automatically in a period of some time. We have done the whole analysis with the Bengali text. In our designed model, we used chain to chain models of RNN with LSTM in the encrypting layer. The architecture of our model works using RNN decoder and encoder, where the encoder inputs text document and generates output as a short summary at the decoder. This system improves two things, namely summarization and establishing benchmarks performance with ignoble train loss. To train our model, we use our dataset that was created from various online media, articles, Facebook, and some people’s personal posts. The challenges we face most here are Bengali text processing, limited text length, enough resources for collecting text. F. A. Fouzia (B) · M. A. Rahat · Md. T. Alie - Al - Mahdi · A. K. M. Masum · S. Abujar · S. A. Hossain Department of Computer Science and Engineering, Daffodil International University, Dhaka 1212, Bangladesh e-mail: [email protected] M. A. Rahat e-mail: [email protected] Md. T. Alie - Al - Mahdi e-mail: [email protected] A. K. M. Masum e-mail: [email protected] S. Abujar e-mail: [email protected] S. A. Hossain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_51
539
540
F. A. Fouzia et al.
Keywords Abstractive · Summarization · Bidirectional RNN · Word2vec · Decoder-encoder
1 Introduction Large amounts of text come to the outside in a digital way, so the implication of developing an emphasized system to minimize lengthy texts immediately while maintaining the prime idea [1]. Generally, there are two types of text summarization techniques for output: extractive and abstractive summarizations. Extractive summaries are made by extracting some sentences from the original document and copying them directly. The abstractive summary uses different words to describe the contents of the document. These words may not appear in the original text. This method can create an overview of its own. Nowadays, text mining has turned into an engaging research field due to the enormous amount of the remaining text on web media [2]. In recent years, innumerable techniques have been developed for summarizing text and putting them in various zones. The NLP sub-domain has several text reduction techniques that help minimize keywords. In our paper, we introduced an intermediate technique for automatic text summarization which will be followed by an abstractive method and that will present us with some short length sentences. We used a few data from the dataset for the train up the model. In order to create our dataset, we have collected data from various online media, articles, Facebook, and personal posts of some people. We are facing a lot of trouble while collecting large amounts of data so that we can get some real data that will be very helpful for people. Sometimes, it is more difficult to produce a proper abstractive text summary but here, we are trying to follow some advance steps for getting a better summary. Preprocessing the whole dataset ensures better results, i.e., cleaning data, counting missing words, using special tokens for decoder-encoder, word2vec, LSTM for input–output text. We have described all these techniques in the preprocessing part of our methodology section. We have implemented sequence to sequence models with bidirectional RNN with LSTM to produce a boundless summary. In decoder-encoder, the encoder enciphers the original text in a vector of a certain length, and a short output of the decoder is generated. Recently, RNN showed encouraging results in machine translation. Specific summaries truly correct the important information from the novel text to encourage the edited version for the individual user [3]. To summaries the text, the mass people pick extractive summarization on account of easily finding new sentences. Although it assures grammatically correct and provides a satisfactory summary sometimes, this method would fail to deliver a summary for long and complex texts. After considering all things, we proposed an abstractive method for text summarization. Our central object of on-screen text summarization is to minimize the total loss and improve efficiency to build an expressive summary.
A Bengali Text Summarization Using Encoder-Decoder Based on …
541
2 Literature Review Most people want to make a decision based on peer review for the least amount of time which is not so effective. A part of NLP applying the POS tagging method using bee colony algorithm [4] for Arabic text summarization. Several techniques are recommended before that handle the subject of text summarization. Everyone proposed and accomplished their different approaches to doing summarization. There have already been many papers for the summarization of English, Arabic, and other languages, but among them, the paper has been very low on Bengali summarization. This may also be a major reason for our research paper. Bengali abstractive text summarization [5] recommended sequence to sequence RNN using LSTM and overcome the total loss of this method. But their main drawback is an insignificant dataset. We are trying to recover this problem in our paper. Neural network-based response for short-text summarization [6] proposed NRM based on encoding and decoding with RNN as NRM instructs large amounts of data. A specific neural machine concentrates to build a neural interface using decoder-encoder with fixed-length vectors [7]. Bengali text summarization using Word2Vector [8] method has been reviewed for summaries of Bengali text. Limited vocabulary, an unavailable dataset, structure of Bengali text is shortcomings of this paper. Bidirectional RNN [9] proposed for English text summarization. They faced some challenges which are like processing the text, vocabulary sizing, estimating the word, reducing loss value. The main target in this paper was to increase workability and consume total loss for composing better summary. Some other methods also have been used for text summarization, sentence reduction for automatic text summarization [10] offer a fancy sentence reduction mechanism to eliminate irrelevant phrases from original documents like the corpus, the lexicon, the word net lexical database, the syntactic parser. RASG [11] worked on the basis of readers for abstractive summarization, which engages their comments to construct a better summary from the main document but comments are not formal and noisy as well as document, and reader comments are very challenging to attach to models together. To address the above challenges, they design a RASG, based on four elements. So they want to release their large-scale dataset for further research. SUMMARIST [12] try to create a robust automated text summarization system, and their goal is to provide both extractive and abstractive for arbitrary English and other language text which combines robust NLP processing based on the equation- “summarization = topic identification + interpretation + generation.” Each of these phases contains several independent modules. But it is under development. Summarization of text grabs recognition as a learning topic of research. Our motivation for our research is to build an abstractive text summarizer using our proposed method and maintain a noticeable efficiency of this method. In this paper, we were trying to discuss our thoughts on the implementation part to enhance the perfection and minimize the total loss when we prepared the model.
542
F. A. Fouzia et al.
3 Methodology The methodology is the heart of any paper, and in this section, we have discussed our methodology about how we have done abstractive text summarization for Bangla to Bangla. We have used abstractive text summarization for generating summary using bidirectional RNN and LSTM. Figure 1 shows the category of text summarization. It contains mainly two part: one of them is input type, and another one is output type. The input text is processed and embedded then tokenized. We have used the TensorFlow CPU version for training and model building. Figure 2 is the graphical view of workflow. In this workflow, we have discussed how we have done our abstractive text summarization.
3.1 Data Collection We have collected 1000 Bangla data from many articles, Facebook posts, and from different blogs, and we have also added the summary of each and every article. As we know a better dataset produces a better summarization, thus we have given a huge Bangla dataset. The dataset itself contains three columns that are post type, text, and summary.
3.2 Data Preprocessing Data processing is one of the major tasks before building a model; thus, it needs several steps to do data preprocess. As we have worked on Bangla data so we have done some steps to preprocess the data. It is quite challenging to preprocess the Bangla data. The first thing we have done in preprocessing is removing unwanted words as well as space removing. Contraction is another fact of preprocessing data. Contraction is basically the short size of words but we also need the full form of contraction for embedding (Table 1). We have also removed lexical words and Bengali digits which are very unnecessary for our project. Then, at the last, we have lemmatized the word which means Fig. 1 Category of text summarization
A Bengali Text Summarization Using Encoder-Decoder Based on … Fig. 2 Workflow for summarization
Table 1 Contraction list Short form
Full form
543
544
F. A. Fouzia et al.
Fig. 3 Sequence of the preprocessing steps
removing the same meaningful words. After doing all these steps, we have cleaned the text (Fig. 3).
3.3 Problem Contention A huge dataset contains a huge number of words. As we developed a huge vocabulary, where each and every word is plotted. A summary has much fewer words considering input text. The summary sequence generates a summary of the vocabulary available.
3.4 Vocabulary Counting and Word Embedding The model itself has a vocabulary set, and vocabulary counting is very essential for finding similarities between text description and output summary. So, we need to ” count the vocabulary. After counting word causation is needed. The word “ has causation 143 times. We have taken a pre-trained “bn_w2v_model” as a vector file.
3.5 Model We have used different types of models such as LSTM, machine translation. Longest Short-Term Memory (LSTM) is a deep learning architecture. Here, we implemented LSTM in-text modeling. Machine translation renders text. It renders the text from one language to another. It is very essential for making pairs between different languages. It also makes text sequences but preserving its original meaning and nobility. A translator contains an encoder and decoder. The encoder takes fixed-length text as input, and decoder is the output series. 1. Neural machine translation: It is a method of translating one language to another. It uses decoder and encoder to translate from one language to another. The encoder receives the input text and decoder forecast the summary or output chain.
A Bengali Text Summarization Using Encoder-Decoder Based on …
545
Fig. 4 Chain to chain model
2. Chain to chain model: Longest short-term memory is an architectural view of the recurrent neural network that is used in deep learning. The long-term memory part of it refers to learned weight, and the short term memory part refers to gated cell situation value. The chain to chain model has the encoder also decoder patched in an LSTM cell. Word embedding is one of the major parts of our text summarization. Word embedding file mapped words to vector numbers; we employed a word embedding file; then, we summed up the total vocabulary size for making it as our model input (Fig. 4). Tokenization is two types: one is sentence token, and the other one is word token. We have used sentence tokens. Token like , , , is used. Those tokens have several jobs to do such as UNK or unknown token removes some non-replaceable words due to limitations of vocabulary. PAD token processes our token data in batches that have the same length. EOS or end of the sentence indicates the end of a sentence. It gives signals to the encoder. Go token is fed to decoder to generate token for the answer. Those tokens have a huge impact on our text summarization. UNK is used in the preprocessing stage in order to replace vocabulary. EOS and GO are used before training data to ensure word id for the chain to chain translation. X basically is the load of encoder, and y is the output of decoder. 3. Bidirectional RNN encoder-decoder: In encoder-decoder where encoding plans a fixed-length source, and the decoder plans the vector representation of the target sequence [13]. Here, we have been applied bidirectional RNN encoder-decoder. Hither, we report in short RNN encoder-decoder. We performed RNN encoderdecoder as a hidden part which is decided on the task of transmuting from Bengali to Bengali summary. In our input sequence, each word is defined as xi where i is the ordering sequence and applying the fitting weights to the hidden state h t−1 and the input vector xt . The covered states hi are estimated using this formula: h t = f w(hh) h t−1 + w (hx) x t
(1)
In model architecture, the encoder input x = x 1 , . . . , x t x into a constant c. Each time period t the RNN is updated by
546
F. A. Fouzia et al.
h t = f (x t , h t−1 )
(2)
C = q({h 1 , . . . , h t x })
(3)
and
where c = hidden part; f and q are nonlinear part. Particularly, we can estimate the probability translation for the decoder using X sequence p( y) =
T p y t | y1 , . . . , y t−1 , c
(4)
t=1
where y = y1 , . . . ., y T y . We can find out a new sequence by iteratively testing the symbol at each time. Of a probabilistic point of view, conditional statement, e.g., P = ( y1 , . . . , y t |x 1 , . . . , x t ). Hence, decoder hidden state calculated by, h t = f (h t−1 , y t−1 ,c)
(5)
for every conditional probability is displayed as (Fig. 5) p( y t |{ y1 , . . ., y t−1 }, C) = g( y t−1 ,s t , C)
(6)
Here, the conditional probability is a clear context vector C i . C i Then calculated as a weighted sum Ci =
Tx
ai j h j
j =1
The weight a i j of each explanation h j is calculated by Fig. 5 RNN encoder-decoder
(7)
A Bengali Text Summarization Using Encoder-Decoder Based on …
547
ex p e i j
a i j = T x
k=1
(8)
ex p(e i k )
where e i j = a s i−1 , h j
(9)
As we recommended using bidirectional RNN for Bengali text summarization which BRNNs are composed of forward and backward RNNs. Forward RNN works to read input sequences as x1 to x Tx and compute forward hidden − → − → sequences h 1 , . . . , h Tx and backward RNN works to read the sequences in ← − ← − reverse h 1 , . . . , h Tx . So, final equation −→ ←− T h j = h jT ; h jT
(10)
where hj indicates the predicting summary of the input sequence. Experimental output: Our target was to establish a dedicated bangle to bangle the text summarization model. For our work, we are thinking TensorFlow sequence to sequence model is more suitable. At first, we split our dataset in two parts: one part is for training, and another part is for tests. We used 1000 data for our work. After data preprocessing, we took 80 percent of data for training and 20 percent of data for the test. After stopping the train of the machine, it is able to create its own summary. To make a summary at first, we receive an input text from the dataset. Randomly, we destine our summary length. An attention-based encoder we had been use for the parameter. Adam optimizer mainly used to get the learning rate of each parameter. We use 1000 data from there we take 200 data for tests and 800 data for training (Table 2). We train our dataset a few hours through our model. Most of the time we get the positive output from our machine; sometimes it gives the negative output which is kind of ignorable. But the positive output really looks very different (Table 3). Table 2 Value of the parameter
Parameter
Value
Epoch
150
Keep probability
0.70
Run size
256
Learning rate
0.001
Batch size
32
548
F. A. Fouzia et al.
Table 3 Sample response
4 Conclusion and Future Work Text summarization truly refines the basic significant facts from an origin to stimulate a short tale for users. The effectiveness of summarization is huge as the length of online information is extending exponentially. Summaries shrink the time needed for reading, making the selection process stress-free. Later, many such works have been done with text summarization. But we are trying to be specific; we just concentrated on shortening Bengali to Bengali text. Although there have been a few works to bangle text summarization, we are trying to overcome those obstacles. We used bidirectional RNNs with LSTM to build our model but some errors remain but we know there is no machine which gives 100 percent efficiency of work. But our build-up model has given maximum accuracy of text summarization with reducing training loss. We encountered some limitations such as limited text length, enough resources for summary but at the end of the day, it was given the best, understandable, fluent, and efficient summary. We will try to improve to make the machine more efficient so that the machine adapts to our mother tongue and provides good quality summaries. Acknowledgements We are grateful to our Daffodil international university’s (NLP) laboratory from, where we got all kinds of facilities for our work. We are also grateful to our honorable department head sir and our respective supervisor who helps us to come out from all kinds of obstacles which we faced in our work.
A Bengali Text Summarization Using Encoder-Decoder Based on …
549
References 1. Abualigah, L., Bashabsheh, M.Q., Alabool, H., Shehab, M.: Text summarization: a brief review. In: Abd Elaziz, M., Al-qaness, M., Ewees, A., Dahou, A. (eds.) Recent Advances in NLP: The Case of Arabic Language. Studies in Computational Intelligence, vol. 874. Springer, Cham (2020) 2. Qaroush, A., et al.:An efficient single document Arabic text summarization using a combination of statistical and semantic features.J. King Saud Univ. Comput. Inf. Sci. (2019) 3. Padmakumar, A., Saran, A.: Unsupervised text summarization using sentence embeddings. Technical Report, University of Texas at Austin (2016) 4. Alhasan, A., Al-Taani, A.T.: POS tagging for arabic text using bee colony algorithm. Procedia Comput. Sci. 142, 158–165 (2018) 5. Talukder, Md.A.I., et al.:Bengali abstractive text summarization using sequence to sequence RNNs. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE (2019) 6. Shang, L., Lu, Z., Li, H.: Neural responding machine for short-text conversation. Association for Computational Linguistics (ACL) (2015) 7. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representation (ICLR), 19 May 2014 8. Abujar, S., et al.:An approach for Bengali text summarization using Word2Vector. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE (2019) 9. Masum, A.K.M., et al.:Abstractive method of text summarization with sequence to sequence RNNs. In: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE (2019) 10. Jing, H.:Sentence reduction for automatic text summarization. In: Sixth Applied Natural Language Processing Conference (2000) 11. Gao, S., et al.: Abstractive text summarization by incorporating reader comments. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019) 12. Hovy, E., Lin, C.-Y.:Automated text summarization in SUMMARIST. In: Advances in Automatic Text Summarization, 14 p (1999) 13. Cho, K., et al.: Learning phrase representations using RNN encoder decoder. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)
Statistical Genomic Analysis of the SARS-CoV, MERS-CoV and SARS-CoV-2 Mayank Sharma
Abstract This paper presents a statistical genomic analysis of the three major coronaviruses that have caused outbreaks in history, the SARS-CoV, MERS-CoV and SARS-CoV-2. Fasta files of the isolated genomes of these viruses were used to perform a global pairwise alignment in Biopython in order to establish sequence identity and the possibility of similar origins. These were then translated into functional proteins and tested for statistically significant differences in aromaticities, instability indices and isoelectric points using ANOVAs and Tukey-HSD post-hoc tests. This research will help future researchers in understanding the characteristics of the SARS-CoV-2 (the new virus strain that has caused the COVID-19 pandemic) and the properties that distinguish it from the MERS-CoV and SARS-CoV. Keywords Coronavirus · SARS-CoV · SARS-CoV-2 · MERS-CoV · Genomic analysis · Statistics · Inferential statistics
1 Introduction The world is currently dealing with one of the biggest disease outbreaks in near history. The novel coronavirus outbreak originated in Wuhan, China around the end of 2019 and has spread to almost every country in the world at the time of writing this paper (mid May 2020). It has been declared as a pandemic by the World Health Organisation (WHO) and has caused more than 290,000 fatalities from the 4.23 million (and growing) infected people across the world. The worst hit countries include the USA, Italy, Spain, UK and Russia. The disease, called COVID-19, originates from a virus strain called the SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus-2) and is the seventh type of coronavirus known to infect humans. The symptoms of the disease include fever, fatigue, cough, runny nose and breathing difficulties (in severe cases). It spreads through human contact as well as transmission of droplets of cough in the air [1]. M. Sharma (B) Department of Electrical Engineering, Delhi Technological University, New Delhi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_52
551
552 Table 1 Nucleotide composition of the three genomes
M. Sharma Genome
Number of nucleotides
SARS-CoV
29,751
MERS-CoV
30,119
SARS-CoV-2
29,903
Two similar coronavirus outbreaks have originated in the past, the Middle East Respiratory Syndrome Coronavirus (MERS-CoV) in 2012 and the Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV) in 2002. The MERS-CoV originated in Saudi Arabia and later spread to 21+ countries, affecting more than 2000 people till 2017. Research suggests that the virus originated in bats, spread to camels and then to humans, although there have also been evidences that suggest several zoonotic events to have occurred before human-to-human transmission [2]. Similarly, the SARS-CoV started in Guangdong, China and affected more than 8000 people in 26 countries. It had symptoms similar to that of COVID-19. Phylogenetic analysis has indicated that it had a possible origin in bats too [3]. Latest research articles from scientists across the world have already started demystifying the new virus strain, its origin, genomic characteristics and stability in different environments [4, 5]. In this paper, the author wishes to contribute to the study of its characteristics further.
2 Collection of Data Genome sequence files (in fasta format) of the SARS-CoV, MERS-CoV and SARSCoV-2 were obtained from [6]. Using a Python library called Biopython, these fasta files can be read and manipulated in the form of alphabet strings made up of a combination of four alphabets—A, T, G and C, which represent the four nucleotide bases (adenine, cytosine, guanine and thymine) that make up the DNA. The number of nucleotides present in each of the three genomes is listed in Table 1.
3 Sequence Alignment The pairwise global sequence alignment technique was used to calculate Percent Sequence Identity (PID) between the three genomes. This is usually carried out through dynamic programming via the Needleman-Wunsch algorithm. The total alignment score used in this paper was calculated by awarding one point for identical characters. There were no points deducted for mismatches or gaps. This total alignment score is then divided by the length of the shorter sequence (out of the two being aligned) to calculate the Percent Sequence Identity (PID). Other measures such as the mean length of sequences or the number of non-gap positions may be
Statistical Genomic Analysis of the SARS-CoV, MERS-CoV and … Table 2 Results of pairwise global sequence alignment
553
Pair aligned
Percent sequence identity (PID) (%)
SARS-CoV-2/SARS-CoV
83.338
SARS-CoV-2/MERS-CoV
69.391
used in the denominator to calculate PID too. The results of sequence alignment are presented in Table 2. It can be seen that the SARS-CoV-2 is somewhat similar in genetic sequence identity to the MERS-CoV and SARS-CoV. This indicates the possibility of a similar origin between the two coronaviruses and has been backed up by biological research done by scientists using next generation sample sequencing of bronchoalveolar lavage fluid extracted from affected patients [7]. In [7], researchers reported a PID of approximately 79% between the SARS-CoV-2 and SARS-CoV and a PID of about 50% between the SARS-CoV-2 and MERS-CoV. These values are somewhat similar to the results obtained from the alignment suggested in this paper.
4 Extraction of Functional Proteins To further analyse the genomic sequences, functional proteins were extracted from them. This was done through the following steps:
4.1 Transcription of DNA into mRNA The DNAs present in the three genome files were transcribed into mRNAs using Biopython’s transcribe () function. This means that the Thymine (T) bases present in DNA are replaced with Uracil (U). For example, a DNA sequence ‘AAATGCTT’ will be transcribed into ‘AAAUGCUU’.
4.2 Translation of mRNA into Proteins This is done with the help of Biopython’s translate () function. Since the SARSCoV, MERS-CoV and SARS-CoV-2 are positive sense single-stranded RNAs [(+)ssRNAs], they can be directly translated into proteins. During protein synthesis, a codon is a sequence of three nucleotides that corresponds to a specific amino acid or a stop signal. Using Biopython’s standard codon table 1, the mRNA sequence is synthesised into proteins separated by the symbol asterisk (*), indicating stop codon or the start of a new protein. These synthesised protein sequences are comprised of
554
M. Sharma
the 20 essential amino acids denoted by their one-letter codes. The 20 essential amino acids are—Alanine (A), Arginine (R), Asparagine (N), Aspartic Acid (D), Cysteine (C), Glutamine (Q), Glutamic Acid (E), Glycine (G), Histidine (H), Isoleucine (I), Leucine (L), Lysine (K), Methionine (M), Phenylalanine (F), Proline (P), Serine (S), Threonine (T), Tryptophan (W), Tyrosine (Y) and Valine (V).
4.3 Selection of Functional Proteins Protein synthesis results in a number of proteins of varying lengths. Usually, shorter proteins have little biological significance. All proteins having lengths less than 20 amino acids are removed, as this is the length of the shortest functional protein to exist in nature, and thus, a total of 313 sequences are considered for analysis. Out of these, 81 belong to the SARS-CoV, 152 to the MERS-CoV and 80 to the SARS-CoV-2.
5 Properties Extracted from Functional Proteins 5.1 Aromaticity Using the ProtParam module in Biopython, the values of aromaticity for each protein sequence were calculated according to the method cited in [8]. In [8], aromaticity is defined by (1) and is nothing but the sum of relative frequency of occurrence of the three aromatic amino acids, namely Phenylamine (F), Tyrosine (Y) and Tryptophan (W). Aromaticity =
20
∂i f i
(1)
i=1
where f i is the relative frequency of occurrence of an amino acid of kind i and ∂i = 1 when the amino acid is aromatic (F, Y or W) and 0 otherwise.
5.2 Isoelectric Point The values of isoelectric points for protein sequences were also calculated using the ProtParam module. It uses Bjellqvist methods [9] to calculate the isoelectric point or the pH at which a protein carries no electric charge (net). Charges in a protein occur because of the fact that it is amino acid constituents are zwitterionic in nature.
Statistical Genomic Analysis of the SARS-CoV, MERS-CoV and …
555
5.3 Instability Index Using the method explained in [10], the values of instability index for all protein sequences were calculated using the ProtParam module. It is useful in determining whether the protein will be stable in a test tube or not (a value less than 40 indicates stability).
6 Descriptive Analyses and Hypotheses Building Descriptive statistics such as mean, standard deviation, interquartile range and minimum/maximum values on the aromaticities, isoelectric points and instability indices for the three coronaviruses are indicated in Table 3. In addition to these, the values of skewness and kurtosis for the these properties are indicated in Table 4. Since almost all of these values lie within the range of (−1, 1), we can conclude that they are normal distributions. The three main properties are also visualised with the help of distribution plots and bar graphs. These have been shown in Fig. 1. The distribution plots represent the overall distribution of the properties with the help of probability densities, and bar graphs represent the mean and standard deviation of the properties with the help of error lines. A set of seven hypotheses was created from the plots. These were then tested for statistical significance using the principles of inferential statistics. H1: Aromaticities of the SARS-CoV-2 sequences are higher than that of MERSCoV. H2: Aromaticities of the SARS-CoV-2 sequences are higher than that of SARSCoV. H3: Aromaticities of the MERS-CoV and SARS-CoV sequences are comparable. H4: Isoelectric points of the MERS-CoV sequences are higher than that of SARSCoV. H5: Isoelectric points of the MERS-CoV sequences are higher than that of SARSCoV-2. H6: Isoelectric points of the SARS-CoV and SARS-CoV-2 sequences are comparable. H7: Instability indices of the SARS-CoV, SARS-CoV-2 and MERS-CoV sequences are comparable.
7 Inferential Analyses and Results Since the data is normally distributed, an Analysis of Variance (ANOVA) was performed on it. An ANOVA is a technique used to compare means among various classes/groups in a dataset for statistically significant differences. The values of
IQR
0.06
0.07
0.13
M ± SD
0.08 ± 0.04
0.1 ± 0.06
0.16 ± 0.08
MERS-CoV
SARS-CoV-2
Aromaticity
SARS-CoV
Proteins of
Min
0.0
0.0
0.0
Max
0.47
0.33
0.21
Min
9.0 ± 1.6 8.2 ± 1.9
1.85 2.88
3.7
3.9
3.9
Max
12.0
12.3
11.6
46.3 ± 24.7
43.2 ± 21.6
40.7 ± 20.7
M ± SD
IQR 3.27
M ± SD 8.4 ± 1.9
Instability index
Isoelectric point
Table 3 Descriptive statistics on the protein properties of the three coronaviruses IQR
33.44
27.81
23.35
Min
−6.02
−6.03
−6.61
124.3
102.3
94.59
Max
556 M. Sharma
Statistical Genomic Analysis of the SARS-CoV, MERS-CoV and …
557
Table 4 Skewness and kurtosis of the protein properties of the three coronaviruses Proteins of
Aromaticity
Isoelectric point
Instability index
Skewness
Kurtosis Skewness
Kurtosis Skewness
Kurtosis
SARS-CoV
0.413
−0.259
−0.480
−0.650
0.389
0.145
MERS-CoV
1.072
1.419
−0.610
0.540
0.267
−0.151
SARS-CoV-2 0.580
0.617
−0.185
−0.701
0.490
0.280
Fig. 1 Bar graphs and distribution plots for the three properties of the SARS-CoV, MERS-CoV and SARS-CoV-2 protein sequences
558
M. Sharma
aromaticities, isoelectric points and instability indices were tested for statistically significant differences among the three coronavirus types. The results of these ANOVAs are shown in Table 5, 7 and 9, respectively. Table 5 indicates that the values of aromaticities are very significantly different across the three coronavirus types (F-statistic = 27.92, p-value = 7.07e−12). A Tukey-HSD post-hoc test is then performed to find out which out of the three combinations between the viruses significantly differ. The results of this test are shown in Table 6 and indicate that the sequences of SARS-CoV-2 have significantly higher aromaticities as compared to those of SARS-CoV (Mean difference = −0.073, adjusted p-value = 0.001) and MERS-CoV (Mean difference = −0.06, adjusted p-value = 0.001). There is no significant difference across MERS-CoV and SARS-CoV sequences. Thus, we accept the hypotheses H1, H2 and H3. Table 7 also indicates significant statistical difference in isoelectric points across the three types (F-statistic = 5.79, p-value = 0.003). Post-hoc testing results in Table 8 indicate that the MERS-CoV sequences have a higher isoelectric points as compared to those of SARS-CoV-2 (Mean difference = 0.79, adjusted p-value = 0.005) while there are no significant differences across SARS-CoV and SARS-CoV-2 and SARS-CoV and MERS-CoV. Thus, we accept hypotheses H5 and H6 while we reject the hypothesis H4. Table 5 ANOVA results across coronavirus types (aromaticity) Property
Sum of squares
Degrees of freedom
F-statistic
p-value
Aromaticity
0.2606
2
27.9233
7.074e−12
Table 6 Results of the Tukey-HSD post-hoc test for aromaticity Group 1
Group 2
Mean difference
Adjusted p-value
Lower
Upper
SARS-CoV-2
MERS-CoV
−0.0604
0.001
−0.0826
−0.0382
SARS-CoV-2
SARS-CoV
−0.0736
0.001
−0.0989
−0.0482
MERS-CoV
SARS-CoV
−0.0132
0.341
−0.0353
0.009
Table 7 ANOVA results across coronavirus types (isoelectric point) Property
Sum of squares
Degrees of freedom
F-statistic
p-value
Isoelectric point
39.085
2
5.79805
0.003372
Table 8 Results of the Tukey-HSD post-hoc test for isoelectric point Group 1
Group 2
Mean difference
Adjusted p-value
Lower
Upper
SARS-CoV-2
MERS-CoV
0.7958
0.0053
0.1986
1.3931
SARS-CoV-2
SARS-CoV
0.2082
0.7326
−0.4733
0.8897
MERS-CoV
SARS-CoV
−0.5876
0.0537
−1.1824
0.0072
Statistical Genomic Analysis of the SARS-CoV, MERS-CoV and …
559
Table 9 ANOVA results across coronavirus types (instability index) Property
Sum of squares
Degrees of freedom
F-statistic
p-value
Instability index
1271.1214
2
1.269122
0.282535
Table 9 indicates that there are no significant differences in instability indices across the three coronaviruses (p-value which is greater than 0.05, the standard hypothesis testing value). We accept the hypothesis H7.
8 Conclusion In this paper, a statistical analysis of genome sequences of the SARS-CoV, MERSCoV and SARS-CoV-2 was performed. Using global pairwise alignment technique, a Percent Sequence Identity (PID) of 83.3% between SARS-CoV-2 and SARS-CoV and 69.4% between SARS-CoV and MERS-CoV was established. These results are approximately similar to biological research conducted by researchers through next generation sample sequencing and indicates that the origin of the three coronaviruses is similar, suspected to be that of a bat. Furthermore, functional proteins were extracted from the three genome sequences through transcription and translation. The values of various properties such as aromaticity, instability index and isoelectric point for these proteins were found out, visualised and tested using inferential statistics. Results indicate that the proteins of SARS-CoV-2 have higher aromaticities as compared to both the SARS-CoV and MERS-CoV (have higher numbers of the three aromatic amino acids—F, Y and W). The sequences of MERS-CoV have higher isoelectric points as compared to those of SARS-CoV-2, with no significant differences across the other two combinations. There are also no significant differences in instability indices across the three coronaviruses. The author hopes that this research will be helpful to researchers and scientists across the world in understanding the structure and characteristics of the SARS-CoV-2 virus strain, especially in these times, when there is an urgent need for the development of a vaccine.
References 1. Wu, D., Wu, T., Liu, Q., Yang, Z.: The SARS-CoV-2 outbreak: what we know. Int. J. Infect. Dis. (2020) 2. World Health Organisation: https://www.who.int/news-room/fact-sheets/detail/middle-eastrespiratory-syndrome-coronavirus-(mers-cov. Accessed May 2020 3. World Health Organisation: https://www.who.int/ith/diseases/sars/en/. Accessed May 2020 4. Andersen, K.G., Rambaut, A., Lipkin, W.I., Holmes, E.C., Garry, R.F.: The proximal origin of SARS-CoV-2. Nat. Med. (2020)
560
M. Sharma
5. Chin, A.W.H., Chu, J.T.S., Perera, M.R.A., Hui, K.P.Y., Yen, H.-L., Chan, M.C.W., Peiris, M., Poon, L.L.M.: Stability of SARS-CoV-2 in different environmental conditions. Lancet Microbe (2020) 6. Radenkovic, D.: Kaggle. https://www.kaggle.com/radenkovic/coronavirus-accession-sarsmers-cov2 (2020) 7. Lu, R., Zhao, X., Li, J., Niu, P., Yang, B., Wu, H., Wang, W., Song, H., Huang, B., Zhu, N., Bi, Y., Ma, X., Zhan, F., Wang, L., Hu, T., Zhou, H., Hu, Z., Zhou, W., Zhao, L., Chen, J., Meng, Y., Wang, J., Lin, Y., Yuan, J., Xie, Z., Ma, J., Liu, W.J., Wang, D., Xu, W., Holmes, E.C., Gao, G.F., Wu, G., Chen, W., Shi, W., Tan, W.: Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet (2020) 8. Lobry, J.R., Gautier, C.: Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes. Nucleic Acids Res. 22, 3174–3180 (1994) 9. Bjellqvist, B., Hughes, G.J., Pasquali, C., Paquet, N., Ravier, F., Sanchez, J.-C., Frutiger, S., Hochstrasser, D.: The focusing positions of polypeptides in immobilized pH gradients can be predicted from their amino acid sequences. Electrophoresis 14, 1023–1031 (1993) 10. Guruprasad, K., Reddy, B.V.B., Pandit, M.W.: Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. Design Sel. 4, 155–161 (1990)
Analysing Hot Facebook Users Posts’ Sentiment Using Deep Learning Nguyen Ngoc Tram and Phan Duy Hung
Abstract The explosion of social networks has created many new careers and new entertainment. Money now can be earned by sitting in one place and streaming games; reviewing food, movies, music; or just simply showing a pretty face. Social networks make all of those things, and more, possible. As long as a person is famous on social networks, they can earn money by doing almost everything. So how to be famous, or in social network’s language, what gets someone “followers” on social media? Is being positive, telling funny stories, showing sunshine and flower pictures enough? Or they can be a pessimistic person, ranting about everything and people still worship them like a god? This research collects and analyses hot Facebook users posts’ sentiment to see if what someone posts on Facebook could make them famous and also determines the accuracy of using deep learning in analysing Vietnamese social media contents sentiment. There have been several studies for social media content sentiment analysing, but none with Vietnamese social media contents of famous Vietnamese people. In this study, a data set of Vietnamese hot Facebook users’ posts is labelled and organized to be shared with the language research community generally, and the Vietnamese language research specifically. Keywords Sentiment · Deep learning
1 Introduction There are several popular social network platforms such as Facebook, Twitter, Instagram and Youtube; however, Facebook is the most common and widely used social network in Vietnam: nearly 60% of social network users in Vietnam are Facebook users [1]. Therefore, this paper chooses Facebook to be the social network we collect N. Ngoc Tram · P. Duy Hung (B) FPT University, Hanoi, Vietnam e-mail: [email protected] N. Ngoc Tram e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_53
561
562
N. N. Tram and P. D. Hung
Table 1 Examples of positive and negative posts Time 12:11, 29/08/2019 21:12, 08/07/2019
Post Sentiment ,, , ,, , - và g˘ap g˜o, m´o,i thây -êu ´ ` Càng di nh˜ u ng ngu o ` i ho n mình d luôn o Positive . -at su, quan tâm vào tu,o,ng lai và công d-ông ` phía tru,o´,c và luôn d˘ . . . ,
BI.P BO. M - ,n vi này (ko biêt - ,o,c do ´ là ai) m`o,i và cha´˘ c Tôi chu,a bao gi`o, du . . cha´˘ n s˜e ko có m˘a.t -, ba, o vê Ðê` nghi. co, quan chu´,c n˘ang ta.i Ba´˘ c Ninh làm rõ dê . , , , ` lo. i cho nguo` i dân quyên
Negative
data from. A list of hot Facebook users with different careers, ages, genders and fanpage to guarantee that the data set is not biased is created. Then, these Facebook users posts in 1-year period: from September 2018 to September 2019 are collected. Hot Facebook users’ posts, even if they are public, cannot be used without the owner’s consent. Facebook does support getting these posts, but the steps are: (i) create an application and (ii) make the Facebook user use it and agree to the term that it can read and use their posts (which is nearly impossible). So, in order to get these hot Facebook users’ posts, this study used a more traditional method: scraping. by using Ultimate Facebook Scraper by Haris Muneer and Hassaan Elahi and other contributors [2], plus some additions of our own, the needed data is collected and stored in our Github project [3]. Next step after collecting data is determining their sentiment. There are only 2 labels for each Facebook posts—Negative or Positive: • Negative posts: contain negative emotions, shocking news, provocative statements or raises controversial opinions. • Positive posts: contain positive emotions, daily life routines, jokes/funny things or happy news. Advertisement posts are also considered as positive posts (Table 1). Sentiment analysis has been researched at three levels: document level, sentence level and aspect level [4]. In this study, a Facebook post is considered as a document. The question now is using what technique to do sentiment analysis. According to Ryujin et al. [5], sentiment analysis methods could be divided into three categories: • Traditional sentiment analysis: lexicon-based approaches and non-neural network classifier approaches. • Sentiment analysis based on deep learning: neural network classifier approaches. • Sentiment analysis based on transfer learning: uses the similarity of data, data distribution, model, task, etc., to apply the knowledge already learned in one domain to the new domain. In “learning word vectors for sentiment analysis” [6], Andrew L. Maas et al. created the IMDb review sentiment data set and proposed a hybrid supervised– unsupervised model that had the accuracy of 88.89%, using their full model with additional unlabeled reviews and bag of words vectors.
Analysing Hot Facebook Users Posts’ Sentiment Using Deep Learning
563
With the IMDb dataset, L. Rahman, N. Mohammed and A. Kalam Al Azad used a biologically inspired variation incorporated in LSTM to do sentiment analysis on the IMDb dataset [7]. The accuracy of the traditional model (using LSTM only) and the proposed model was around 80%. On the other hand, A. Tripathy, A. Agrawal and S. Kumar Rath experimented with multiple methods and had achieved the highest accuracy of 88.94% using their combined Unigram–Bigram–Trigram model [8]. By using “deep CNN-LSTM with combined kernels from multiple branches for IMDb Review Sentiment Analysis” [9], Alec Yenter and Abhishek Verma had reached the accuracy of 89.50% with the IMDb Review dataset. In “Universal Language Model Fine-tuning for Text Classification” [10], Jeremy Howard and Sebastian Ruder used AWD-LSTM and reached an impressive accuracy: 95.40%. The above studies have shown that in the present time, deep learning with LSTM is the most effective solution with sentiment analysis. However, these studies are all for English language. This study evaluates and customizes a network architecture for Vietnamese language data. The main contribution of this paper is to provide a data set for Vietnamese social media contents and to develop a method for analysing these contents’ sentiment. The remainder of this paper is organized as follows. The data collecting process and data preprocessing are presented in Sect. 2. Section 3 provides the selection and evaluation of deep learning. Finally, conclusions and perspectives are resumed in Sect. 4.
2 Data Preparation 2.1 Data Collecting Process As mention above, collecting Facebook data without user consent is not easy. The main idea of Ultimate Facebook Scraper is scrolling a defined number of times so that all contents needed has shown in the web browser, then scraping these contents. This approach is simple but raises some problems such as • If the web browser does not respond in the timeout period, the scrolling process is ended, and the scraper moves to scraping process. • Scrolling more means the web browser occupying more RAM. This makes computer run slower, lag or stop working completely. This research considers Facebook users with more than 100,000 followers as “hot” Facebook users. These two problems both lead to not collecting data needed, while still consuming time. To solve them, a trick is applied: instead of scraping all Facebook posts in 1-year duration, the scraper scrapes Facebook posts in 1-month duration only, collects data, refreshes the web browser and moves to other months. Scraping data in 1-month duration limit chances that the web browser does not respond in time, or RAM be occupied too much.
564
N. N. Tram and P. D. Hung
The data set is stored in an Excel file with multiple sheets. The first sheet contains a list of hot Facebook users including their name, Facebook link, number of followers, career, gender and year of birth. The other sheets contain each Facebook users’ posts, including posted time, the post’s content and the label of the post’s sentiment. These posts are collected from the Facebook users’ “timeline”, so there might be posts not from them but from their friends tagging them. This paper still considers these posts belong to Facebook users. From now on, each Facebook post will be referred to as a document.
2.2 Data Preprocessing Word Filtering First step in preprocessing data is Word Filtering: removing words that will not affect the sentiment analysis. By using regular expression, these “words” in the document are replaced with black space in respective order: url links, non-letter characters (digits, punctuation marks, emojis, etc.), single characters. This could leave multiple blank spaces in the document, so these spaces are replaced with single spaces. Then, to avoid computer treating uppercase and lowercase letters differently, the final step of word filtering is converting all words to lowercase. Word Tokenization Word tokenization is dividing a large amount of text into words. English words are usually separated by blank spaces, but this is not the same with Vietnamese language. ViTokenizer() function of pyvi library by Tran Viet Trung [11] helps us replacing blank spaces inside Vietnamese words with under score symbol (“_”). For example, - và g˘ap g˜o, m´o,i thây ´ nh˜u,ng ngu,`o,i ho,n mình d-êu ` luôn o,, phía tru,´o,c document “Càng di . , , , -at su quan tâm vào tuong lai và công d-ông.” ` is converted into “Càng di và luôn d˘ . . . -at su, ´ nh˜u,ng ngu,`o,i ho,n mình d-êu ` luôn o,, phía tru,´o,c và luôn d˘ và g˘a.p_g˜o, m´o,i thây . . ` With this converted document, the word quan_tâm vào tu,o,ng_lai và cô.ng_d-ông.” tokenization process can be done by using blank space as separator. Attribute Selection The data is clean now, but is still not ready to be used. Deep learning methods only work with numerical data, and therefore, the data will be converted into numeric form. There are several ways to select text attributes, and one of the most commonly used methods is using bag of words. The steps are: • Create a vocabulary of unique words. • Convert each document into a feature vector using the vocabulary: length of each vector is the length of the vocabulary, while the actual word will be replaced by its frequency in the vocabulary.
Analysing Hot Facebook Users Posts’ Sentiment Using Deep Learning
565
In bag of words (BoW) approach, every word has the same weight. This study uses the term frequency–inverse document frequency (TF-IDF) approach: words that occur less in all the documents and more in individual document contribute more towards classification [12]. TF =
frequency of a word in the document total words in the document
IDF = log
total number of docs number of docs containing the word
TF stands for term frequency, while IDF stands for inverse document frequency. TfidfVectorizer class from Python Scikit-Learn library is used to convert text into TFIDF feature vectors [13] with max_features = 2500 (the number of most frequently occurring words would be used), max_df = 0.8 (percentage of documents that words occur in) and min_df = 7 (minimum number of documents would be used to created a bag of word).
3 Neural Network Architecture 3.1 Background Neural networks (NN) imitate how the brain works with multiple layers, each with a specific number of nodes [9]. Data comes through and be processed by each layer, and the output of previous layer is the input of next layer. Recurrent neural network (RNN) has connection to the previous neuron state in addition to the layer inputs and is particularly beneficial to data that is sequential or contextual, because it analyses a text word by word and stores the semantics of all the previous text in a fixed-sized hidden layer, so it captures the contextual information better [14]. Long short-term memory (LSTM) is a type of RNN where new information in the neurons is more critical than older information [9].
3.2 Choose Baseline Before using deep learning with LSTM, the problem is approached with a simple method first to create a baseline score. Random forest algorithm is chosen to train the machine learning model, using random forest classifier class of sklearn.ensemble module. The result is quite good (Tables 2, 3, 4 and 5).
566
N. N. Tram and P. D. Hung
Table 2 Sentiment analysis result of all documents Sentiment
Precision
Recall
F1-score
Negative
0.86
0.25
0.39
Positive
0.89
0.99
0.94
Accuracy
0.89
Table 3 Sentiment analysis result of a Facebook user with positiveness 96.92% Sentiment
Precision
Recall
F1-score
Negative
1.00
0.50
0.67
Positive
0.98
1.00
0.99
Accuracy
0.98
Table 4 Sentiment analysis result of a Facebook user with positiveness 14.10% Sentiment
Precision
Recall
F1-score
Negative
0.98
0.85
0.91
Positive
0.50
0.91
0.65
Accuracy
0.86
Table 5 Sentiment analysis result of a Facebook user with positiveness 57.69% Sentiment
Precision
Recall
F1-score
Negative
0.95
0.82
0.88
Positive
0.88
0.97
0.92
Accuracy
0.90
Since the number of negative label and positive label are not equal, the result is measured by F1-score. Using deep learning with LSTM should have F1-score higher than the results listed above.
3.3 Using LSTM The classification model uses two bidirectional LSTM layers: a forwards LSTM and a backwards LSTM. The data is trained two steps: • Run the neural network forward to set the cell states. • Go backward to compute the derivatives. This step uses the cell states to figure out how to change the network’s weights by using Adam optimizer. The optimizer
Analysing Hot Facebook Users Posts’ Sentiment Using Deep Learning
567
minimizes the loss function (mean square error between expected output and actual output). Feature extraction is done by using TF-IDF approach as mentioned above. In LSTM approach, the features set is padded into sequences with maxlen = 1000 (maximum length of all sequence). The sentiment analysis is done with different epochs and dropout rate to determine which set of variables is the best. From Figs. 1, 2, 3 and 4, it seems that no matter how many time running epochs, the F1-score of test data stays the same: 0.92. However, running too many epochs is redundant (Fig. 1). Training the LSTM model with 5 epochs and dropout rate 0.2 seems to be the most appropriate. Table 6 shows F1-score of three different Facebook users in Tables 3, 4 and 5.
Fig. 1 Training loss and F1-score with 10 epochs and dropout rate 0.2
Fig. 2 Training loss and F1-score with 5 epochs and dropout rate 0.1
Fig. 3 Training loss and F1-score with 3 epochs and dropout rate 0.1
568
N. N. Tram and P. D. Hung
Fig. 4 Training loss and F1-score with 10 epochs and dropout rate 0.2
Table 6 F1-score of Facebook users with different positiveness
Positiveness
F1-score
96.92%
0.98
14.10%
0.27
57.69%
0.73
The result from Table 6 shows that using LSTM approach may give good prediction on document with positive labels, but does not work well with negative data. However, random forest is better than LSTM, since the data set is highly biased.
4 Conclusions and Perspectives Being a hot Facebook user with more than 100,000 followers does not mean that person having a positive image. This research does not say having a positive influence is a good thing, or being negative is a bad thing. The paper provides a solution for Vietnamese social media contents sentiment analysis, serve as a reference for many other fields in deep learning, and a Vietnamese social media content dataset. The results of the paper can be used for communication management, social management, early detection or prevention of negative social impacts. The paper is also a valuable reference for problems in areas such as deep learning [15, 16] and data analytics [17, 18].
References 1. Social Media Stats Viet Nam (2020). https://gs.statcounter.com/social-media-stats/all/viet-nam 2. Haris, M., Hassaan, E.: Ultimate Facebook scraper (version v1.0.0). Zenodo (2018). https:// doi.org/10.5281/zenodo.2537107 3. Data set (2019).https://github.com/lazzycat/vietnamese_celebs_facebook_sentiment
Analysing Hot Facebook Users Posts’ Sentiment Using Deep Learning
569
4. Patil, P., Yalagi, P.: Sentiment analysis levels and techniques: a survey. IJIET 6 (2016) 5. Liu, R., Shi, Y., Ji, C., Jia, M.: A survey of sentiment analysis based on transfer learning. IEEE Access 7, 85401–85412 (2019) 6. Maas, A.L, Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150 (2011) 7. Rahman, L., Mohammed, N., Kalam Al Azad, A.: A new LSTM model by introducing biological cell state. In: Proceedings of the Electrical Engineering and Information Communication Technology (ICEEICT), 3rd International Conference, pp. 1–6 (2016) 8. Tripathy, A., Agrawal, A., Kumar Rath, S.: Classification of sentiment reviews using n-gram machine learning approach. Expert Syst. Appl. 57, 117–126 (2016) 9. Yenter, A., Verma, A.: Deep CNN-LSTM with combined kernels from multiple branches for IMDb review sentiment analysis. In: Proceedings of the IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON), New York City, NY, pp. 540–546 (2017) 10. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. arXiv:1801. 06146 (2018). https://arxiv.org/abs/1801.06146 11. Trung, T.V.: Python Vietnamese Toolkit(2020). https://github.com/trungtv/pyvi 12. Jurafsky, D., Martin, J.H.: Speech and language processing: an introduction to natural language processing. In: Computational Linguistics, and Speech Recognition (2008) 13. Pedregosa, F., et al.: Scikit-learn: machine learning in python. JMLR 12(85), 2825–2830 (2011) 14. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. AAAI 333, 2267–2273 (2015) 15. Hung, P.D., Giang, T.M., Nam, L.H., Duong, P.M.: Vietnamese speech command recognition using recurrent neural networks. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 10(7) (2019). https:// doi.org/10.14569/IJACSA.2019.0100728 16. Nam, N.T., Hung, P.D.: Padding methods in convolutional sequence model: an application in Japanese handwriting recognition. In: Proceedings of the 3rd International Conference on Machine Learning and Soft Computing (ICMLSC 2019), pp. 138–142. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3310986.3310998 17. Tae, C.M., Hung, P.D.: Comparing ML algorithms on financial fraud detection. In: Proceedings of the 2019 2nd International Conference on Data Science and Information Technology (DSIT 2019), pp. 25–29. Association for Computing Machinery, New York, NY, USA (2019). https:// doi.org/10.1145/3352411.3352416 18. Hung, P.D., Hanh, T.D., Diep, V.T.: Breast cancer prediction using Spark MLLIB and ML packages. In: Proceedings of the 5th International Conference on Bioinformatics Research and Applications (ICBRA ’18), pp. 52–59. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3309129.3309133
The Connection of IoT to Big Data–Hadoop Ecosystem in a Digital Age Le Trung Kien, Phan Duy Hung, and Kieu Ha My
Abstract The massive amount of data gathered by physical devices in the modern IoT world carries hidden valuable information. Extracting these gems has become a focused task for data scientists and researchers around the globe. Meanwhile, since its appearance in early 2000s, the Hadoop ecosystem has set up a solid foundation for big data processing, directing tech experts to a prevalent stream. While one could take advantages of the various tools, Hadoop provides to conjure up quick fixes for specific real-life matters; there is a lack of common methods presented in literature to deal with sets of alike problems. This creates a confused scenario for researchers and impedes the advancement of solution-designers in the IoT-big data field. In this paper, we have conducted a thorough review of the aspects related to handling IoT’s big data in the Hadoop ecosystem, including data storage, data processing, and data analytics. We also remark some significant works introduced in previous studies and suggest directions for future research. Keywords IoT-big data field · Hadoop · Data processing · Data analytics
1 Introduction The world of human beings is surrounded by IoT devices. From home-to-workto public places, one can easily find the presence of an IoT equipment. Experts predicted that the number of IoT devices will reach 50 billion by 2020 [1–3]. It is also anticipated that most of these devices will be able to communicate via the Internet [4]. Some suggest the size of data these instruments generate will soon go up L. T. Kien · P. Duy Hung (B) · K. Ha My FPT University, Hanoi, Vietnam e-mail: [email protected] L. T. Kien e-mail: [email protected] K. Ha My e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_54
571
572
L. T. Kien et al.
Table 1 Hadoop ecosystem The Hadoop ecosystem Data storage
HDFS
HBase
Data processing
MapReduce
Data access
Hive
Pig
Mahout
Sqoop
Data management
Oozie
Chukwa
Flume
Zookeeper
YARN
to 35 ZB [4]. This data provides useful services to end-users and is of high value to businesses and governments. This brings researchers to the question of how to deal with IoT’s big data since it is heterogeneous in nature and is often unstructured [5]. Big data is comprised of four key characteristics: volume, velocity, variety, and veracity, respectively. Although big data has attracted attention for over a decade, each of its characteristics remains a vast research area with a lot of questions unanswered [6]. Various methods have been proposed but the Hadoop ecosystem has gathered tech experts into a common trend. Hadoop provides useful tools to process big data, varied in types of input and output requirements, such as MapReduce, Pig and Hive (for SQL data), HBase (for NoSQL data), Splunk and Kafka (Table 1). Businesses and tech experts in the last ten years have dedicated their efforts toward effective methods to store, process, and visualize IoT’s heterogeneous data within the Hadoop ecosystem. While achievements have been made, many problems left to be solved. One of the major obstacles is the lack of common methods and guidance while handling IoT data. Proposed frameworks have solved specific problems, yet the answer for one particular issue does not often contribute to producing solutions for similar matters. While this research field welcomes both academics and tech experts, one could easily get lost in the dense forest of big data and IoT. In this paper, we have conducted a thorough review on the published studies and present most remarkable state-of-theart big data technologies to store, process, and analyze IoT-generated information within the Hadoop ecosystem. A major contribution of this paper is to provide an overview of the picture, bring insights and ideas to researchers who delve into the field. We also point readers to problems left to be solved and provide directions for future research. The paper is presented in the following order: Sect. 2 identifies characteristics of IoT data, Sect. 3 gives information regarding data storing, Sect. 4 presents data processing and analyzing methods. The final section remarks on this paper’s contribution and gives insight to what might be carried out in the future work.
The Connection of IoT to Big Data–Hadoop Ecosystem …
573
2 IoT Data Characteristics and Challenges When it comes to IoT’s massive data constantly being generated, the first obstacle to overcome is data storage. Although advanced technology has lightened the financial burden, to this day, the cost of increasing storage media’s performance while expanding its capacity is still very high [7]. Traditionally, data is stored on hard disk drives but their random access mechanism is slower than their sequential peer [8]. Despite technology has been developing rapidly to cope with the vastness of IoT data [9], there has not been a sustainable solution to such problem to date [7]. IoT-gathered data is heterogeneous as it comes from multiple sources. While most being unstructured data, some come in the form of semi-structured or structured. Thus, methods of data integration should be the next obstacle to overcome [10]. Scale of dynamic data is another barrier. The boundless number of interconnected sensors and actuators generate a huge volume of real-time data, which demands storage technology to be scalable and suitable analysis methods [10]. The majority of applicable IoT instruments these days injects constant flows of heterogeneous data which requires streaming processing techniques in order to provide meaningful information [11]. For example, traffic monitor or surveillance system demands generated data to be dealt with instantaneously to instruct traffic lights or give warnings in a timely manner. This is another challenge in the field. IoT-generated data from basic physical devices also carries weak semantic meanings. Data processing technologies must somehow cope with revealing hidden contents behind this massive source of information [10]. The last issue is relevant to the questionable reliability of IoT data. To this day, the IoT industry is facing serious concerns regarding its security. IoT devices are vulnerable to cyber-attacks and physical compromise, leading to tampered data that result in major consequences in decision making [3]. In data processing, how to identify and remove anomalies as well as tampered data becomes a significant goal. It is noteworthy that the traditional Hadoop and MapReduce framework, despite being capable in handling big data, are no longer suitable in the IoT environment since these tools cannot cope with the massive streaming data generated from physical devices [10], [111]. More dedicated applications have been built to work with IoT’s streaming data including, but not limited to, Apache Spark, Kafka, Flink, and Storm. In the section below, we will discuss the first point of interest: how data could be stored within the Hadoop framework.
3 IoT Data Storage As mentioned above, traditional HDD storage media is no longer a practical option for the massive amount of IoT data while the latter concepts of solid state drive (SSD) or phrase change memory (PCM) have not reached the required level of performance for IoT’s big data yet [8]. Several other storage technologies have been proposed by
574
L. T. Kien et al.
Table 2 Comparing features between NoSQL and RDBMS [12] Feature
HDFS-based NoSQL
RDBMS
1
Large datasets
Efficient and fast
Not efficient
2
Small datasets
Not efficient
Efficient and fast
3
Searches
Not efficient
Efficient and fast
4
Large read operations
Efficient and fast
Not efficient
5
Updates
Not efficient
Efficient and fast
6
Data relations
Not supported
Supported
7
Authentication/authorization
Kerberos
Built-in
8
Data storage
Distributed over data nodes
Central database server
9
ACID compliant
No
Yes
10
Concurrent updates to dataset
Not supported
Supported
11
Fault tolerance
Built-in
No built-in
12
Scalability
Easily scalable
Not easily scalable
researchers but before exploring the available options with data migrating and storing, it is necessary to define the differences between the two commonly classified types of database: relational (SQL) and non-relational (NoSQL) database (Table 2). Basically, the main difference between the two data types is the “relation” characteristic of information. For SQL data, there are key identifiers to relate information across the dataset. Data is stored in tables, following predefined schema, i.e., structured database. Data is manipulated using Ssructed query language (SQL), and it must comply with the ACID properties (Atomicity, Consistency, Isolation, and Durability). It is vertically scalable and, therefore, is more suitable for relatively smaller dataset [12]. This data type has got a longer history compared to its peer and there are platforms built to deal with it, including Microsoft SQL Server, Oracle, and IBM DB2. However, modern datasets generated from the IoT environment hardly follows a strict schematic design. Consider data gathered from social media (Facebook, Twitter, etc.) as an example, it can be seen that the data is huge in volume and unrelated (with dynamic schema), making it unfitted for traditional SQL processing method. The birth of NoSQL technology compliments for what SQL platforms could not achieve. In the Hadoop environment, a Hadoop distributed file system (HDFS) supports reading large dataset efficiently and quickly to extract information with tools like Hive and HBase. NoSQL data storage can easily be scaled, making it a better candidate to handle massive datasets. A considerable weakness is that this technology is built to read data rather than manipulating it [12]. On the other hand, the conflicted attributes of these two types of dataset brings us back to the powerful features Hadoop support to proceed with unstructured data. Raw IoT-generated data does not contain much useful knowledge since they are pure
The Connection of IoT to Big Data–Hadoop Ecosystem …
575
numbers and or text, but Hadoop’s components can support data analysts with builtin analytic tools. It also helps mapping various data structures into a single set to gain more insightful information [12]. In addition, a noteworthy element is that data these days is often stored in cloud platform due to its scalability feature and cost reduction. Cai et al. [10] gives an insight to various IoT data types and storage in cloud platforms. In their paper, it is explained that the variety of IoT data suiting different needs require data isolation to match their requirements. Available database management systems (DBMS) have been classified into RDBMS (relational DBMS), NoSQL (Not only SQL) DBMS, DBMS based on HDFS, main-memory DBMS, and Graph DBMS. For RDBMS that holds structured data, the authors listed the Ultrawrap method where each RDBMS is represented as an RDF graph, which then can be put on relational storage via SPARQL queries [10]. For unstructured data, NoSQL DB offers dynamic schema structures. This relatively newer system takes advantages of the schematic information available on IoT data, yet it contains weaknesses making it not ideal when dealing with rapidly changing data types. Unstructured data generated in the form of XML can be effectively stored in the HDFS system. The HBase data store helps increasing indexing service discovery [10]. Another storage module mentioned in the same research is main-memory DBMS. This type of system adjusts the processing capability for streaming IoT data. The authors refer to Lu and Ye [13], in which a RFID application is implemented in H2 main-memory DBMS. This method has the advantage of rapid data migration and maintains a recovery mechanism to ensure data protection [10]. On the other hand, graph structured database used to manage sensor data’s relationship can be stored with the DEX Graph DBMS [13] as it empowers large-scale data nodes to support the size-equivalent graphs, as cited in [10]. The last system mentioned in [10], RDFbased, considers semi-structured data storage with the use of cloud-based RDF Data Management proposed by Kaoudi and Manolescu [14]. In this system, RDF schema has been utilized to categorize and store semi-structured data [10]. With unstructured data storage system, Mishra, Lin, and Chang [9] made use of Apache HBase in their proposed cognitive-oriented IoT-big data (COIB) framework. The authors argued that unstructured IoT data streams can be fused using certain standard semantics so as to create clean relational data before the classifiers distribute data into clusters based on data characteristics. HBase storage makes use of its tables to store classified data. Each table has a key element to separate one from another. The masternode contains metadata, and storage nodes hold the data themselves. According to the authors, this system is effective in boosting scalability [9]. For IoT heterogeneous data in smart environments, a significant proposal comes from Fazio et al. [5]. The researchers offered a hybrid system in cloud storage where data generated from monitoring infrastructures is collected via a plug-in called data gathering interface, then pushed into the data manager. The data manager abstracts collected data to provide a uniform semantic description. Data abstraction follows the sensor Web enablement (SWE) standards. While the SWE provides data for a document-oriented storage system, the authors also utilized its semantic to deal
576
L. T. Kien et al.
with object-oriented storage system. This is achieved by enriching the data with geolocalized information. The task is performed by the data manager itself [5]. Varied sensor data storage is also tended to in RubyDinakar and Vagdevi [6]. The two authors proposed streamed sensor data being converted to HDFS format, stored in cloud. However, no apparent mechanism has been proposed in their work. An additional approach to deal with the issue of data volume in the IoT context is to reduce the size of data streams. Alieksieiev [15] provides potential solutions to this problem with an approximation method to shrink data streams. The algorithm divides each dataset into two sub-sets, the “maximums” and the “minimums”. In each subset, local extremums will be found, and they will either be used altogether or via some selections to provide approximation of the dataset. However, this approach contains a weakness as it compresses data, which may result in losses [15]. To sum up, data gathered from IoT devices is often heterogeneous in nature and comes in colossal datasets. Therefore, HDFS-based NoSQL data system is a suitable choice while analysis is supported by Hadoop platform (e.g., HBase). At present, cloud technology seems to be a preferred choice to store IoT-generated data and a number of systems have been built to adjust to the various data types.
4 Data Processing and Analytics In this section, we will discuss how IoT’s streaming data is processed and analyzed in the Hadoop ecosystem presently. The main tools most researchers and tech experts alike have been using include Apache Spark, Apache Kafka, Storm, and Flink with some exceptions. Apache Storm appears to be the dominant tool as it provides a convenient platform to handle streaming data [16–18]. There are two common methods to work with streaming data: the native streaming and micro-batch processing model (Fig. 1). While the former captures and processes data in real time as it arrives in tuples, the latter model collects data in short batches, created at a fixed interval (e.g. every 5 min). Native streaming is necessary in instant decision-making contexts (such as danger-warning system) while its peer is used when a certain degree of latency is acceptable (e.g., weather forecast system). D’silva and associates introduced a framework known as Dashing in [18] to process streaming data. The group of researchers makes use of Kafka to capture cloud messages sent and received among IoT devices since it has a bigger and more scalable storage layer compared to what the traditional Hadoop storage can provide. Kafka then pushes messages toward Apache Spark where data is streamed in real time. The Spark block is used to break down information with its GraphX module before forwarding processed data to the Dashing dashboard. Such framework helps presenting data in a more readable and appealing way (Fig. 2). However, Apache Spark is not ideal when dealing with multiple sources of stream. This is a practical problem in the IoT environment where various applications running simultaneously are the norm. To cope with this challenge, Sirisakdiwan and Nupairoj built a framework to analyze multiple data streams in real-time [11].
The Connection of IoT to Big Data–Hadoop Ecosystem …
577
Fig. 1 Native streaming versus micro-batch processing model
Fig. 2 Dashing real-time processing framework [18]
In the proposal, different applications are submitted to Spark at the same time via spark-submit. To avoid overloaded submission generated by a considerably large amount of applications, the authors merge various heterogeneous data streams into one Spark application using a three-function (Initialize, Register, and Execute) framework. By integrating stream processes, the framework can make use of Spark’s FAIR scheduling scheme to solve the job-queue problem [11]. The proposed method requires users to set up Spark to use FAIR scheduling by calling the Initialize function, then Register will append objects into a function list before stream processing is handled in Execute [11]. Despite FAIR scheduling escalates the execution time, the authors argue that it is still faster than FIFO. According
578
L. T. Kien et al.
to the paper, the proposed framework resolves multiple data streams problem and job queuing while reducing the burden or coding and monitoring [11]. Nasiri et al. [16] provided a more comprehensive view by comparing Apache Spark, Storm, and Flink in the context of building a system architecture for smart cities. It is highlighted in the paper that Storm relies on Apache Zookeeper to act as a medium between the Nimbus daemon and Supervisors, which initiate or halt Workers within the Worker Nodes. These nodes carry out the actual streaming job. Flink, on the other hand, brings a fault-tolerant native streaming architecture. There are Task Managers executing the distributed program while they are being coordinated by the Job Manager. This model requires coders to program with application programming interface (API) [16]. The flexible Spark including a Driver, a Cluster Manager, and Executors is mentioned next in the same study. Spark collects data from various sources, transforms and creates meaningful outputs from them. Within the architecture, Executors receive appropriate tasks from the Cluster Manager, in accordance with their servers, and execute them. Information is passed to the Cluster Manager by the Spark Driver, which acts as the master node in other similar frameworks [16]. Evaluation reveals that Storm and Flink have more powerful scalability features compared to Spark while Spark performs better at machine learning tasks. The authors conclude that Flink is most suitable to deal with complex event processing, Storm should be used in image and video processing and both Storm and Spark are superior platforms to extract, transform, and load IoT’s big data [16]. It is claimed in [17] that the ever-growing data piles from IoT devices in smart city models may make Storm and Spark system run the same preprocessing and analysis tasks multiple times. To utilize these systems, Chaturvedi and associates have optimized Apache Storm by reusing overlapping tasks with a Dataflow Reuse Algorithms. Experiments prove a possible reduction in CPU usage of up to 51%, making the model a promising addition to the current platforms. Other frameworks introduced in literature include Apache Heron, which is an advanced version of Storm in terms of scalability and debug ability; Samza, a KafkaYARN fusion; Akka [12]; Apache Drill and Dremel for interactive data analysis [7]. Outside the Hadoop ecosystem, notable works include the edgewise system, which ease up throughput and latency issues existing in cloud resources [18]; the IoTstream lightweight semantic model, claimed to resolve IoT problems in terms of heterogeneity and interoperability [19]; and the COIB-framework to organize and discover data patterns [9]. However, the scope of this paper does not include systems that do not incorporate closely within the Hadoop platforms.
5 Conclusions and Perspectives The presence of IoT’s devices in human planet is rapidly changing its every aspect. On a daily basis, huge piles of data are being gathered while these devices provide
The Connection of IoT to Big Data–Hadoop Ecosystem …
579
assistance to people around the globe. Such massively collected information also contains hidden meaningful messages waiting to be uncovered by big data techniques. However, dealing with IoT’s streaming data, being heterogeneous in nature, mostly in the form of unstructured data is no simple mission. How to store and process this kind of data are the two crucial tasks that demand proper answers. The birth of Hadoop along with later applications developed within the same ecosystem has provided tech experts with powerful tools to deal with big data, especially in the IoT environment. While researchers have dedicated much effort into utilizing these tools and building various platforms, universal methods to handle similar types of data have not been agreed upon. The complicated IoT-big data universe seems to be unwelcoming to newcomers in the field. Recognizing these matters, we have conducted a thorough review on the publicly available literature to provide stakeholders and academics with a bird’s-eye view of the picture. State-of-the-art frameworks have been introduced so as to reveal IoT data storage mechanism as well as data processing and analyzing techniques. Though each technique, platform or framework proves to be strong in some certain respects while seemingly inferior handling other issues, the paper gives an insight to the methods currently being used to look for specific solutions in the IoT-big data environment. For future works, we suggest categorizing IoT data types and matching them with feasible frameworks within the Hadoop ecosystem. It is also of critical importance to explore solutions beyond Hadoop platforms with references to several significant studies introduced in this research.
References 1. Khan, M.A., Salah, K.: IoT security: review, blockchain solutions, and open challenges. Future Gener. Comput. Syst. 82, 395–411 (2018) 2. Qian, Y., et al.: Towards decentralized IoT security enhancement: a blockchain approach. Comput. Electr. Eng. 72, 266–273 (2018) 3. Kien, L.T., Hung, P.D., My, K.H.: Evaluating blockchain-IoT frameworks. In: Solanki, V.K., Hoang, M.K., Lu, Z.J., Pattnaik, P.K. (eds.). Research in Intelligent and Computing in Engineering. RICE 2019. Advances in Intelligent Systems and Computing, vol. 1125, pp. 899–912 (2020) 4. Berat, S.O., Dogdu, E., Ozbayoglu, M., et al.: An extended IoT framework with semantics, big data, and analytics. In: Proceedings of the IEEE International Conference on Big Data (Big Data), Washington, DC, USA, pp. 1849–1856 (2016) 5. Fazio, M., Celesti, A., Puliafito, A., et al.: Big data storage in the cloud for smart environment monitoring. Procedia Comput. Sci. 52, 500–506 (2015) 6. RubyDinakar, J., Vagdevi, S.: A study on storage mechanism for heterogeneous sensor data on big data paradigm. In: Proceedings of the International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), Mysuru, pp. 342–345 (2017) 7. Constante, N.F., Silva, F., Herrera, B., et al.: Big data analytics in IOT: challenges, open research issues and tools. In: Rocha A, Adeli H, Reis LP, Costanzo S, (Eds.)Trends and Advances in Information Systems and Technologies, vol. 746, pp. 775–788. Springer International Publishing, Cham (2018)
580
L. T. Kien et al.
8. Acharjya, D.P., Ahmed, K.: A survey on big data analytics: challenges, open research issues and tools. IJACSA 7(2), (2016) 9. Mishra, N., Lin, C.C., Chang, H.T.: A cognitive oriented framework for IoT big-data management prospective. In: Proceedings of the IEEE International Conference on Communiction Problem-solving, Beijing, China, pp. 124–127 (2014) 10. Cai, H., Xu, B., Jiang, L., et al.: IoT-based big data storage systems in cloud computing: perspectives and challenges. IEEE Internet Things J. 75–87 (2016) 11. Sirisakdiwan, T., Nupairoj, N.: Spark framework for real-time analytic of multiple heterogeneous data streams. In: Proceedings of the 2nd International Conference on Communication Engineering and Technology (ICCET), Nagoya, Japan, pp. 1–5 (2019) 12. Lakhe, B.: Practical Hadoop Migration. Apress, Berkeley (2016) 13. Lu, Y.F., Ye, S.S.: A multi-dimension Hash index design for main-memory RFID database applications. In: Proceedings of the International Conference on Information Security and Intelligent Control, Yunlin, Taiwan, pp. 61–64 (2012) 14. Kaoudi, Z., Manolescu, I.: Cloud-based RDF data management. In: Proceedings of the ACM International Conference on Management of Data—SIGMOD’14, Snowbird, Utah, USA, pp. 725–729 (2014) 15. Alieksieiev, V.: One approach of approximation for incoming data stream in IoT based monitoring system. In: Proceedings of the IEEE Second International Conference on Data Stream Mining & Processing (DSMP), Lviv, pp. 94–97 (2018) 16. Nasiri, H., Nasehi, S., Goudarzi, M.: Evaluation of distributed stream processing frameworks for IoT applications in Smart Cities. J Big Data 6, 52 (2019) 17. Chaturvedi, S., Tyagi, S., Simmhan, Y.: Collaborative reuse of streaming dataflows in IoT applications. In: Proceedings of the IEEE 13th International Conference on e-Science (e-Science), Auckland, pp. 403–412 (2017) 18. D’silva, G.M., Khan, A., Gaurav, Bari, S.: Real-time processing of IoT events with historic data using Apache Kafka and Apache Spark with dashing framework. In: Proceedings of the 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, pp. 1804–1809 (2017) 19. Fu, X., Ghaffar, T., Davis, J.C., Lee, D.: Edgewise: a better stream processing engine for the edge. In: USENIX Annual Technical Conference, WA, USA (2019)
Learning Through MOOCs: Self-learning in the Era of Disruptive Technology Neha Srivastava, Jitendra Kumar Mandal, and Pallavi Asthana
Abstract MOOCs are becoming dependable and reliable platform for self-learning. Self-learning is very significant in the area of science and technology due to rising number of applications developed on inter-disciplinary approaches based on automation and computation in science and technology. MOOCs provide a brilliant opportunity for students to learn advance technologies to increase knowledge and skills. Many times, students attend these online courses, but, do not pursue the exams or complete assignment. This paper presents an analysis based on the students’ perspective to support that assessment and assignment are important for learning and earning a certificate in these courses to enhance knowledge in students. Through the completion of assignments and passing the assessment process to earn a certificate in online courses, students enhance their learning. Out of the 48 participants of secondary education, 27 participants were able to earn a course certificate also shows that students are inclined towards self-learning. This paper supports that MOOCs provide an excellent platform for self-learning. Keywords MOOCs · Self-regulated learning · Inter-disciplinary study · Secondary
1 Introduction Rate of growth in computing power has brought an immense technical upsurge in almost all the areas. It has been so impactful in creating new areas like cloud computing, machine learning, artificial intelligence, micro electromechanical sensor, N. Srivastava · J. Kumar Mandal Sri Mahavir Prasad Mahila Mahavidyalya, Lucknow, India e-mail: [email protected] J. Kumar Mandal e-mail: [email protected] P. Asthana (B) Amity University, Lucknow Campus, Lucknow, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_55
581
582
N. Srivastava et al.
Fig. 1 Depiction of advance subjects and their dependency on core subjects
nanotechnology, bio-informatics, data analytics, robotics. These technologies are changing the industries by creating advance applications, through fast delivery of products and real-time analysis of different database of varied patterns [1, 2]. Most of the development in these times should be accredited to the advancement in engineering and technology. Since last 50 years, engineering has grown drastically by exploring new areas and dwelling into almost all the areas, and so, past few years have seen an exponential growth in the technology. In present times, engineering is applied in several area like medical sciences, pharmaceuticals, law, media and arts, fashion technology, bio-medical, stock exchanges, food technology, archaeology, etc. [3–5]. Each and every application depends on technology in some or the other way, and this has opened new ventures of inter-disciplinary studies [6, 7]. Figure 1 depicts the dependency of the advance subjects on the core subjects like robotics, bio-mechanics, machine learning, data analytics, web development, etc. which are dependent on core subjects like maths, physics, programming languages that are taught in classes secondary education and as a basic courses in higher education also [8–14].
2 Background Nowadays, Internet is overloaded with huge amount of information which is easily available to the students; it is important to categorize the relevant information and extract the knowledge from this information [15]. Availability of smart phones, personal computers and Internet connection have brought a lot of information in the regime of anyone who wants to make an effort in learning. In this scenario, massive open online courses provide a reliable source as they are conducted by experts of the field and provide the authentic information. Massive Open Online Course (MOOC) is a web-based platform which provides unlimited number of students worldwide
Learning Through MOOCs: Self-learning …
583
with a chance of distance education with the best institutes in the world [15]. It was established back in 2008 and gained momentum in 2012 as a popular learning tool. Many MOOCs have communities that have interactive sessions and forums between the student, professors and Teaching Assistants (TAs) along with the study/course material and video lectures. Many courses such as coursera, eduEx, khan’s academy, unacademy are hugely popular in India. Government of India has launched an online portal named as SWAYAM portal which is world’s largest online learning platform for distance learning and online education [16]. It consists of curriculum-based courses in new disciplines of medicine, law, humanities and social sciences, agriculture, commerce, management and inter-disciplinary areas covering courses from school secondary level till post graduation, covering all disciplines. This portal provides courses for higher education as well as for the students of secondary education also. It also provides an opportunity to the students of secondary education to go through the content of the courses taught in higher education that would help them in choosing the career. Students can go through the course content for many courses and map it with their area of interest and strengths to make a suitable choice of professional or vocational course in graduation [17]. Through these courses, students are taking charge of their own learning, thus, making it an autonomous process.
3 Statement of the Problem 1. Formulation of research questions MOOC is becoming a popular tool to acquire knowledge and to enhance the skill set. Through tremendous amount of knowledge, students can explore their creative and innovative zeal through learning practices [18]. Impressed by the huge popularity of this learning platform authors formulated few research questions, mainly, to understand the contribution of these courses in the learning of students [19]. The self-regulated online learning questionnaire is based on the exploratory factor analysis by Jansen et al. [20]. This worked included many aspects like help seeking, persistence, metacognitive skills, metacognitive skills, metacognitive performance, and based on the Akaike information criteria, metacognitive skills are one factor that is highly associated with the performance of the students enrolled for MOOCs. This factor is also associated with self-regulated learning considered as an important factor while framing research questions [20, 21]. 2. Formulation of hypothesis statement Hypothesis statement is set as: Null hypothesis (H0): MOOC provides a platform for self-learning along with regular schooling in the secondary education. Alternate hypothesis (H1): MOOC do not contribute significantly in self-learning along with regular schooling in the secondary education. In the next section, authors have analyzed a survey conducted to verify
584
N. Srivastava et al.
any of the hypothesis mentioned above. This survey provided a useful insight into the present scenario of students’ learning gain through MOOCs. Students are benefitting through online courses as it enhances their learning. They have to work hard to complete the assignments, and they also go through the assessments to earn a certification in these courses. Hence, earning a certificate through online courses helps them to strengthen their skills and improve job profile. Research questions raised by the authors: RQ1. Are Massive Online Open Courses (MOOCs) important along with regular courses for self-learning in secondary education? RQ2. Are students persistent in completing these courses? RQ3. Does the completion of course affect the knowledge of students?
4 Research Methodology 1. Demographic analysis of students This survey was conducted on the 48 students of secondary education of a private school at North India. Average age of the participants was 18 years. Out of 48 participants, 14 were girls, and 31 were boys. They all have the background of maths, physics and chemistry. Details of the enrolled courses and completed courses were acquired from the students to verify their responses. All responses were found correct. This survey was conducted in order to understand the perspective of students regarding self-learning. 2. Summary of the responses In the questionnaire, ten questions were asked. Students were asked to response simply as Yes/No. Neutral was not an option. However, two students did not believe in learning through MOOCs as evident from the responses. Table 1 shows the responses of the students.
5 Results and Discussion This section presents the discussion based on the response of the students. After a thorough analysis of the student’s response, authors found that online courses have a positive impact on the education.
Learning Through MOOCs: Self-learning …
585
Table 1 Summary of the responses of students Q. no.
Questions
Yes
No
Q1
Do you make efforts for self-learning?
48
0
Q2
Do you consider MOOC as a platform for self-learning?
42
6
Q3
Have you attended any MOOC courses?
38
10
Q4
Did you get the certificate on completion?
34
14
Q5
Were you able to finish all courses for which you got enrolled in MOOC?
27
21
Q6
Did your knowledge enhanced in the courses where you were Awarded certificate?
37
11
Q7
Were you able to learn the expected content through MOOC Certification?
38
10
Q8
Did your knowledge enhanced in unfinished courses?
34
14
Q9
Do you consider that attaining a certificate is important for Enhanced self—learning through MOOC?
32
16
Q10
Are you required to study extra material (prescribed by MOOC Mentor) to qualify for MOOC certificate?
35
13
1. Discussion based on Table 1 (a) Do you make efforts for self-learning? All students have responded positively implicating the awareness of students towards improving their knowledge and skills through self-learning along with regular studies. (b) Do you consider MOOC as a platform for self-learning? Students may face several issues like non-availability of smart devices, low Internet connectivity, due to which around six students responded in negative. (c) Have you attended any MOOC courses? Few students have still not attended any MOOC courses; it can be due to the reasons mentioned in response of Q2, i.e. like non-availability of smart devices, low Internet connectivity, lack of any interesting course in their subject, that they want to study. (d) Did you get the certificate on completion? Out of 38 students who got enrolled in MOOC, 34 students were able to achieve a course certification. This is most important and crucial finding of this research work. It is already believed that one criteria for the success of MOOCs is the completion rate. Earning a certificate in any MOOC improves the knowledge as it requires attempting assignments and also qualifying the assessment process. (e) Were you able to finish all courses for which you got enrolled in MOOC? Out of 38 students, who attended MOOCs, 21 students were not able to complete all the courses in which, they got enrolled. It is true that completing any course
586
N. Srivastava et al.
requires a lot of efforts, attention and time along with their regular studies. Timely completion of assignments and appearing for the assessment or completing a project is crucial to qualify a certificate. It shows a positive trend of self-learning implicating the enthusiasm of students. (f) Did your knowledge enhanced in the courses where you were awarded certificate? Out of 38 students who attended at least one MOOC, 37 responded affirmatively implying that course content available on online platform provided additional knowledge to students, and they are benefitted by attending such courses. (g) Were you able to learn the expected content through MOOC certification? Some students are very clear about the content that they want to learn, whereas some students start with the courses that then create interest with the increased level of understanding of subject matter. In both the cases, it establishes that students rely on online platform for self-learning. (h) Did your knowledge enhanced in unfinished courses? It was found that students were not able to finish all courses, in which, they got enrolled. But attending the course provided an insight to the students and created an awareness for completing the course which would help them with their regular studies. (i) Do you consider that attaining a certificate is important for enhanced self– learning through MOOC? Assignments, assessment, project implementation become important for learning as they throw challenges that reinforces learning. When learners believe they have control over their learning environment, they are more likely to take on challenges and persist with difficult tasks. MOOCs provide an appropriate environment for such learning, where efforts are not visible to others and failures can be amended through repeated attempt. (j) Are you required to study extra material (prescribed by MOOC Mentor) to qualify for MOOC certificate? In most of the MOOCs, course instructors provide the links for the study material as e-books, websites or software as a supporting material to complete the assignments correctly and qualify for certification courses. Students develop a habit of study apart from regular courses. This helps them to develop the traits of life-long learner. 2. Inference based on the discussions of Table 1 Inference from the result and discussion: Research question 1: Are Massive Online Open Courses (MOOCS) important along with regular courses for self-learning in higher studies?
Learning Through MOOCs: Self-learning …
587
Yes, on the basis of the responses pertaining to the questions numbers 6, 7, 8 and 9, it can be inferred that MOOCs are providing a significant platform for self-learning to the students who are ready to make efforts for the self-learning. Research question 2: Are students persistent in completing these courses? Based on the responses received from Questions number 4, 5 and 7, it can be inferred that most of the participants were able to complete at least one course in which they got enrolled. Responses present a positive trend of course completion rate. Research question 3: Does the completion of course effects the knowledge of students? Responses pertaining to the questions 6, 7 and 10 implies that course completion enhances the knowledge of the students. 3. Verification of the students responses It is estimated that only 5–10% of the active students completes the course. However, due to the constant motivation of teachers and peer learning, students, not only participated, but also completed the course, thus, acquiring a course certificate. As a general practice, students are required to submit a copy of online course certificate in school. And course toppers are also felicitated which further motivates remaining students to participate in these courses. To confirm the responses of the students, all the students were asked to submit the course certificate to the course co-ordinator, and they were found to be true.
6 Conclusion Based on the analysis of the survey of 48 students participants, authors have accepted the null hypothesis that MOOC is important platform for self-leaning along with regular studies. This study aims to consider the online course as complementary courses with regular courses. It does not underestimate the importance of regular schooling. It aims to bring the understanding that in the present scenario of fast changing computation and technologies, students must adapt to new learning environment like online courses and availability on fast Internet connection and cheap smart devices; students can accomplish the tremendous amount of knowledge available on the online platforms.
References 1. Fuhrer, J.: Technology and growth: an overview. New Engl. Econ. Rev. 21, 3 (1996) 2. Technology and Innovation report. Harnessing Frontier Technologies for Sustainable Development, pp. 2076–2917. United Nations Publication (2018)
588
N. Srivastava et al.
3. Hülya, K.Ç.: Proceedia: Social and Behavioural Sciences, World Conference on Technology, Innovation and Entrepreneurship Technological Change and Economic Growth, pp. 649–654 (2015) 4. Gallaire, H.: Faster, Connected , Smarter, 21st Century Technologies Promises and Perils of a Dynamic Future, Organization for Economic Co-operation and Development, pp. 47–75 (2000). 5. Holtgrewe, U.: New technologies: the future and the present of work in information and communication technology. New Technol. Work Employ. 29(0268–1072), 2014 (2014) 6. Edmunds, R., Thorpe, M., Conole, G.: Student attitudes towards and use of ICT in course study, work and social activity: a technology acceptance model approach. Br. J. Educ. Technol. 4(1), 71–84 (2012) 7. Carle, A.C., Jaffee, D., Miller, D.: Engaging college science students and changing academic achievement with technology: a quasi-experimental preliminary investigation. Comput. Educ. 52, 376–380 (2009) 8. Schwier, R.A.: The corrosive influence of competition, growth, and accountability on institutions of higher education. J. Comput. High. Educ. 24(2), 96–103 (2012) 9. Jones, N., O’Shea, J.: Challenging hierarchies: the impact of e-learning. High. Educ. 48(3), 379–395 (2004) 10. Garrett T.: Mathematics for machine learning. Department of Electrical Engineering and Computer Sciences, University of California, Lecture Notes (2018) 11. Valero, F., Ceccarelli, M., Ghosal, A.: Applied mathematics to mobile robotics and their applications. Hindawi Math. Probl. Eng. 1563–5147 (2017). (Special Issue) 12. Zhaohao, S., Paul, P.W.: A mathematical foundation of big data’ new mathematics and natural computation. 13(2), 83–99 (2017) 13. Trey, V., et al.: Kinematic distances: a Monte Carlo method. Astrophys. J. Am. Astron. Soc. 856, 1538–4357 (2018) 14. Kurt, T.M., Thomas, S.B.: Biomechanics of human movement. In: Standard Handbook of Biomedical Engineering and Design (2004). (9780071498388) 15. Garrison, D.R.: Self-directed learning and distance education. In: Moore, M.G., Anderson, W.G. (eds.) Handbook of Distance Education, pp. 161–168. Lawrence Erlbaum Associates, Mahwah, NJ, USA (2003) 16. https://swayam.gov.in/ 17. Williams, P.E., Hellman, C.M.: Differences in self-regulation for online learning between first and second-generation college students. Res. High. Educ. 45(1), 71–82 (2004) 18. Zimmerman, B.J.: Becoming a self-regulated learner: an overview. Theory Into Pract. 41(2), 64–70 (2002) 19. Chung, L.-Y.: Exploring the effectiveness of self-regulated learning in massive open online courses on non-native English speakers. Int. J. Distance Educ. Technol. 13(3), 61–73 (2015) 20. Jansen, R.S., et al.: Validation of the self-regulated online learning questionnaire. J. Comput. High. Educ. 29 10421726 (2017) 21. Anderson, A., et al.: Engaging with massive online courses. Presented at International World Wide Web Conference Committee (IW3C2), pp. 687–698 (2014) 22. Anderson, T.: Promise and/or Peril: MOOCs and Open and Distance Education. National Centre for Vocational Education Research (2013). https://www.col.org/SiteCollectionDocu ments/MOOCsPromisePeril_Anderson.pdf
Automatic Diabetes and Liver Disease Diagnosis and Prediction Through SVM and KNN Algorithms Md. Reshad Reza, Gahangir Hossain, Ayush Goyal, Sanju Tiwari, Anurag Tripathi, Anupama Bhan, and Pritam Dash
Abstract Advances in data mining and machine learning methods for classification and regression open the door of identifying complex patterns from domain sensitive data. In biomedical applications, massive amounts of clinical data are generated and collected to predict diseases. Diagnosis of diseases, such as diabetes and liver diseases, needs more tests these days and that increases the size of patient medical data. Manual exploration of this patient test data is challenging and difficult. Robust prediction of a patient’s liver disease from the massive data set is an important research agenda in health science. The challenge of applying a machine learning method is to select the best algorithm within the disease prediction framework. The goal of this research is to select a robust machine learning algorithm that can equally be applicable on diabetes prediction as well as in liver diseases prediction. This study analyzes two machine learning approaches, support vector machine (SVM) and K-nearest neighbors (KNN) algorithms over two different datasets, diabetes and liver diseases datasets. It was observed that a tuned radial SVM method performed Md. R. Reza · G. Hossain · A. Goyal (B) Texas A&M University–Kingsville, Kingsville, TX, USA e-mail: [email protected] Md. R. Reza e-mail: [email protected] G. Hossain e-mail: [email protected] S. Tiwari Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain e-mail: [email protected] A. Tripathi · A. Bhan · P. Dash Amity University, Noida, UP, India e-mail: [email protected] A. Bhan e-mail: [email protected] P. Dash e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_56
589
590
Md. Reshad Reza et al.
with the highest accuracy in detection of diabetes and liver disease detection with an accuracy of 0.989 for diabetes detection and 0.910 for liver disease detection. Keywords Disease prediction · SVM · KNN · Diabetes · Liver disease
1 Introduction Biomedical data is an important factor in daily life for medical research. It is astonishing to collect the medical data for patients like all other data for each day as it is growing every minute. The prime tasks for big data analysts and researchers are creating the proper system for convenient data manipulation and analysis of the data to draw significant patterns for decision making in these days. Data mining had been the modern solution to retrieve medical big data analysis; allowing significant information to generate meaningful results of disease prediction, selection of treatments, etc. Several researchers were implemented to enhance the results of disease prediction analysis and some proposed the development of new methods and framework for the healthcare system. Classification, clustering, and some other associates are the most widely used data mining techniques in medical data analysis [1]. Support vector machines (SVM) [2] and K-nearest neighbor (KNN) are the two most important classifiers used in medical field arena. Diseases such as heart disease, thyroid disease, breast cancer, and several other diseases were predicted using these clustering methods [3]. This research work is based on screening medical patient data using big data methods, namely support vector machine (SVM) and parallel K-nearest neighbor (KNN) machine learning techniques for creating a disease prediction framework [4]. The medical patient data are to be trained to create a predictive model with the above-mentioned techniques and later use, test the trained predictive models on the patient data to determine the existence and the state of the disease. Most classification strategies receive modern machine learning instruments that are multivariate, for example, SVM considers the combinatorial highlighted impacts. They appeared in numerous tests for being more intense as far as arrangement precision than the channel strategies. Support vector machines ended up being powerful for a considerable measure of arrangements issues. For parallel class order, SVM builds an ideal isolating hyperplane between the positive and negative classes with the maximal edge which is figured as a quadratic programming issue including imbalance requirements [5]. SVMs are standouts among the most encouraging machines learning calculations with several cases, where the SVMs are utilized effectively, taking as an example content arrangement, confront acknowledgment, and bioinformatics. These datasets, SVMs act extremely well, deliberately beating other customary methods. SVMs have picked up a colossal fame in measurements, learning hypo research, and building, and the numerous references in that. With a couple special cases, most help vector learning calculations have been intended for parallel issues. A couple of endeavors were made, to sum up SVM to multiclass issues. Here, support vector machine and
Automatic Diabetes and Liver Disease Diagnosis …
591
K-nearest neighbor are utilized as an arrangement calculation keeping in mind the end goal to look at its execution utilizing some datasets. The first is from the diabetes conclusion datasets, the parallel esteemed demonstrative variable displayed in this dataset relates whether a tolerant hinted at diabetes as indicated by World Health Organization criteria. The second one is BUPA Liver Disorders datasets collected from the University of California at Irvine (UCI) Machine Learning Repository. In this paper, there is analysis of medical patient data using K-nearest neighbor (KNN) and support vector machine (SVM) algorithms to find out the existence and state of the disease using several parameters. Section 2 discusses the background. The proposed methodology is discussed in Sect. 3, and finally Sect. 5 concludes the paper.
2 Background The pattern in the healthcare sector is redirecting from cure to disease prevention because of a quick rise in data. Researchers have concentrated to make changes in unwavering quality and productivity of this section to limit the expense in human health services and furthermore to convey the better solution to patients. Clinics and hospitals are the hubs for medical data like patient history, report analysis, and medical image processing. Adapting to huge unstructured data and unprecedented missing values has been a massive challenge. Enhancement of existing data mining and machine learning methodologies can create opportunities to prevent disease, and prescription of medicines. Present machine learning methods require a great deal of time to prepare and investigate vast measures of information. Correspondingly, it is not possible to store and process mass data on the single machine. As a consequence, it is required to parallelize existing methodologies and alter them to handle the vast collection of data for speedy, reliable performance.
2.1 Big Data Essential information can be generated from raw data by implementing machine learning methods [6]. Analysis of big data can bring massive transformation in the healthcare sector. Big data analytics helps in data processing and process concealed meaning from them. This diagnostic approach can be executed to limit the cost of preparing time on a substantial measure of information. Big data analysis executes distinct strategies of data mining, and machine learning ways to deal with dissecting, process, and anticipate the conclusion from unprocessed data. Big data processing can be sorted into four fundamental steps. To process extensive measure of data gathered from distinctive sources which can be in various formats can be difficult. Because of unstructured information, database management system cannot be used for generating meaningful information. In the first step, we accumulate data that is
592
Md. Reshad Reza et al.
produced from various sources and store it into one common place. The dataset is divided into testing and training sets. Machine learning techniques are then used to carry smart data analysis and generate reports for data processing.
2.2 Data Mining Data mining alongside Knowledge Discovery in Databases (KDD) are often termed to be the same. But in reality, data mining is one of the vital parts of KDD [7]. Fayyad et al. illustrated that KDD comprises of several stages. Starting with data accumulation followed by data preprocessing, data transformation for the desired format followed by data mining where applicable methods are implemented from generating meaningful outcomes. Different data mining techniques were directed to locate the best classifier for foreseeing patients of heart disease [8], proposing a way to deal with the heart disease utilizing data mining procedures. Classifiers, namely ID3, Truck, and DT were utilized for analysis of patients with heart illnesses.
2.3 Classification Classification groups set of data into classes. The classification procedure predicts the target class for every piece of data. For instance, the patient can be delegated “consisting disease” or “safe” based on the pattern of their disease and the method of classification. Two renowned ways of classification are binary and multilevel. In the binary arrangement, the two conceivable classes, for example, “high” and “low” hazardous patient may be assumed while the multiclass method constitutes in excess of two approaches for instance, “high”, “medium”, and “low” hazard quiet. Data sets are categorized into training and testing sets. The classifier is trained by means of training data set. Its effectiveness is found by means of the testing data set. Classification is a renowned process of data mining in healthcare sector.
2.4 Support Vector Machine The idea of SVM origins from statistics. SVMs were at first created for binary classification; however, it could be proficiently stretched out for multiclass levels. The support vector machine classifier makes a hyperplane that is necessary for data point separation. SVM have numerous alluring highlights and because of this, it is picking up ubiquity and have promising exact execution. SVM builds a hyperplane in unique information space to isolate the information focuses. SVM is a standout among the most well-known methodologies that are utilized for classification in the healthcare section. Fei analyzed particle swarm enhancement SVM (PSO-SVM)
Automatic Diabetes and Liver Disease Diagnosis …
593
method for breaking down arrhythmia cordis, and Huang et al. fabricate a model for breast cancer analysis utilizing hybrid SVM-based system [7]. Avci proposed a framework utilizing hereditary SVM classifier for examining the heart valve sickness. This framework removes the imperative component and classifies signals from the ultrasound of heart valve [9]. An SVM basing on PSO is developed by Abdi et al. for distinguishing erythema to-squamous illnesses which comprises two phases. Initially, ideal element is extricated utilizing by means of association rule and in the second stage, the PSO is utilized to find the best parameters.
2.5 Parallel SVM Data sampling with support vector machine algorithm running parallel over dataset allows smoother data prediction, generating fast and accurate predictions than regular support vector machine predictions. It is mostly used by parallel SVM package, which constitutes of two main functions; parallel SVM—returning a list of support vector machine models and predict—that returns the average of all predictions when the probability is true. This package is often referred as the parallel SVM extension. Parallel ICF (PICF) is the vital step of PSVM. Unlike traditional ICF, i.e., column based, PICF is row based initially, allowing simultaneous factorization on the machines creating a significant impact on big data sets [10]. Thus, PSVM loads the training data set parallels on machines, decreasing the memory usage by the factorization of the kernel matrix. SVM parallelization can be done both implicitly and explicitly. Explicit parallelization had been more effective but experiments imply implicit parallelization can be more effective [11]. One demerit of implicit parallelization for SVM is high memory usage reduces the base of the vector set.
2.6 K-Nearest Neighbor K-nearest neighbor (KNN) classifier is among the convenient classifiers that find the unidentified data point utilizing the already known data set. KNN orders the data focused utilizing in excess of one closest neighbor. Clustering analysis, image field, healthcare sector, pattern recognition all requires the usage of KNN. Overall KNN method is considered to be one of the easiest methods of implementations where the training part is completed in a quicker manner. However, the drawback includes large storage space requirement space, with slower testing speed and noise sensitivity. Shouman et al. utilized KNN classifier for breaking down the patients experiencing heart disease [12]. Dataset was gathered from UCI and test was performed utilizing without voting or with voting KNN method and it is discovered that KNN accomplishes better precision without voting in the conclusion of heart illnesses as a contrast with voting KNN.
594
Md. Reshad Reza et al.
2.7 Parallel KNN K-nearest neighbor (KNN) algorithm is widely used in data mining and classification. But it often creates issues with large data sets. However, this issue is recovered by means of three-layer parallelism in the architecture [13]. It also helps in reducing the computational time for epidemiological analysis. It consists of parallel KNN module that uses the KNN classification and cross-validation by means of MPI, POSIX Threads. Different experiments were performed for this implementation using threelayered parallel KNN tool. The total result published generated was faster when compared with the same data applied on WEKA, resulting in a promising impact of future on data mining. It can be used in day-to-day healthcare sector for biomedical prediction. Automatic classification and recorded daily care data classification can be resolved using this technique.
3 Methodology A research method is a systematic plan for doing research. It can be defined as a scientific and systematic search for relevant information on a specific topic and trying to solve a specific problem in that domain using that information. A methodology has presented to implement machine learning techniques namely support vector machine (SVM) and K-Nearest Neighbor(KNN) on reliable data sets which will predict the probability of a person’s being affected by liver disease or not. To reach our goal, we have gone through several phases: collecting data, analyzing data, cleaning and pruning data, developing the algorithms, implementing the algorithms, checking accuracy of the algorithm, and determining the best algorithm for predicting the disease. The overall process has been demonstrated through a block diagram in Fig. 1 below.
3.1 Data Collection The datasets used in this research were all reliable data, collected from University of California, Irvine, Machine Learning Repository (WWW.UCI.Com). The data are real world data collected from real-time patients. These data were used previously for various experiments due to their originality and reliability. Diabetes dataset has eight attributes (Pregnancies, PG Concentration, Diastolic BP, Tri-Fold Thick, Serum Ins, BMI, DP Function, Age), consisting numeric, binary data of 768 patients from Phoenix, Arizona, USA.
Automatic Diabetes and Liver Disease Diagnosis …
595
Fig. 1 Disease prediction framework
The BUPA liver disorder dataset has several numeric attributes (Mcv Mean Corpuscular Volume, Alkphose alkaline phosphotase, Sgptalamine aminotransferase, Sgotasparate aminotransferase, Gammagt gamma-glutamyletranspeptidase, Drinks number of half-pint equivalents of alcoholic beverages drunk per day) and 345 observations. To be noted, the. first five variables consist of a blood test report that is related to high alcohol intakes, significant to the disorder of the liver.
596
Md. Reshad Reza et al.
3.2 Implementation of Algorithms Machine learning algorithm was implemented for improved efficiency and accuracy. Machine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed. In this process, an algorithm is developed where based on given input and previously stored data, machine tries to predict the output as correctly as possible. Data classification is a double stage method in which initial step is training stage where the classifier trains the dataset followed by the second stage is where the model is utilized for classification and the performance is evaluated with the testing of the data. In SVM, the predictor is the attribute and the changed attribute for the hyper plane definition is referred as a feature. Features used for the description of a case are termed as a vector. The eventual purpose of this modeling is to figure a hyperplane separating vectors in a manner that two different target variables are on either side of the planes. The vectors close to the hyperplane are termed as support vectors. Figure 2 below illustrates the implementation of the algorithm. Here, we have used both linear and nonlinear SVM (RBF kernel). Both of the SVM kernels obey different functions for classification. The equations are highlighted below. Linear SVM is described in Eq. (1): K X i , X j = X iT X j
(1)
Radial basis function (RBF) for the SVM is described in Eq. (2): 2 K X i , X j = exp exp −γ X i − X j , γ > 0
(2)
K-nearest neighbors can be easily persuaded by means analytics. It is high adaptability to the data points generates reliable, nonlinear, and flexible boundaries for every data point. The preparation sets are portrayed by numerical attributes of ndimensions. An n-dimensional space is allocated for the training datasets. The preparation tests are put away in an n-dimensional space. At the point when a testing set is given, the K-nearest neighbor classifier looks the K training set which is nearest to the newly introduced sample. Figure 3 shows a diagram of the KNN algorithm.
4 Results Table 1 shows the comparison of the performance metrics for all the machine learning algorithms for the diabetes and liver disease detection. It can be observed that tuned radial SVM method has the highest accuracy for both the diabetes and liver disease detection.
Automatic Diabetes and Liver Disease Diagnosis …
597
Fig. 2 Implementation of the SVM algorithm
5 Conclusion In this study, machine learning algorithms, namely linear SVM, nonlinear SVM with RBF kernel, and K-nearest neighbor were applied for diabetes and liver disorder dataset to predict the disease. The tuned radial SVM algorithm had the highest accuracy for both the diabetes and liver disease detection. Medical factors like hypertension, blood pressure, high cholesterol, and polycystic ovary syndrome (for female) can also be considered in the predictive analysis of diabetes in the future research work.
598
Md. Reshad Reza et al.
Fig. 3 Implementation K-nearest neighbor algorithm
Table 1 Comparison of the algorithms for disease detection Accuracy
Specificity
Sensitivity
Prevelance
Diabetes disease prediction results Linear SVM
0.785
0.876
0.62
0.35
Radial SVM
C = 1, G = 0.125
0.8242
0.8213
0.8253
0.7305
Tuned rad SVM
C = 10, G = 0.5
0.9896
0.9924
0.9881
0.6562
KNN
K =9
0.7879
0.7794
0.7914
0.7056
0.70
0.61
0.75
0.58
Liver disease prediction results Linear SVM Radial SVM
C = 1, G = 0.167
0.8435
0.9250
0.8328
0.7449
Tuned rad SVM
C = 10, G = 0.5
0.9101
1.0000
0.8924
0.9462
KNN
K = 10
0.7961
0.3846
0.9351
0.8544
Acknowledgements This research herein was performed at the Department of Electrical Engineering and Computer Science at Texas A&M University–Kingsville [Reza 2018].
Automatic Diabetes and Liver Disease Diagnosis …
599
References 1. Mishra, S., Sagban, R., Yakoob, A., Gandhi, N.: Swarm intelligence in anomaly detection systems: an overview. Int. J. Comput. Appl. 1–10 (2018) 2. Rahul, M., Kohli, N., Agarwal, R., Mishra, S.: Facial expression recognition using geometric features and modified hidden Markov model. Int. J. Grid Util. Comput. 10(5), 488–496 (2019) 3. Gaurav, D., Tiwari, S.M., Goyal, A., Gandhi, N., Abraham, A.: Machine intelligence-based algorithms for spam filtering on document labeling. Soft Comput. 1–14 (2019) 4. Seema, K., Bomare, D.S., Vaishnavi, N.: Heart disease prediction using KNN based handwritten text. In: AISC 2016, pp. 49–56 5. Suykens, J.A., De Brabanter, J., Lukas, L., Vandewalle, J.: Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing 48(1–4), 85–105 (2002) 6. Divya, K.S., Bhargavi, P., Jyothi, S.: Machine learning algorithms in big data analytics. Int. J. Comput. Sci. Eng. (2018). https://doi.org/10.26438/ijcse/v6i1.6370 7. Tomar, D., Agarwal, S.: A survey on data mining approaches for healthcare. Int. J. Bio-Sci. Bio-Technol. 5(5), 241–266 (2013) 8. Chaurasia, V., Pal, S.: Early prediction of heart diseases using data mining techniques. Carib. J. Sci. Technol. 1, 208–217 (2013) 9. Avci, E.: A new intelligent diagnosis system for the heart valve diseases by using genetic-SVM classifier. Expert Syst. Appl. 36(7), 10618–10626 (2009) 10. Chang, E.Y.: PSVM: parallelizing support vector machines on distributed computers. In: Foundations of Large-Scale Multimedia Information Management and Retrieval, pp. 213–230. Springer, Berlin (2011) 11. Tyree, S., Gardner, J.R., Weinberger, K.Q., Agrawal, K., Tran, J.: Parallel support vector machines in practice. arXiv preprint arXiv:1404.1066 (2014) 12. Shouman, M., Turner, T., Stocker, R.: Applying k-nearest neighbour in diagnosing heart disease patients. Int. J. Inf. Educ. Technol. 2(3), 220–223 (2012) 13. Aparicio, G., Blanquer, I., Hernández, V.: A parallel implementation of the K Nearest Neighbours classifier in three levels: threads, MPI processes and the grid (2007). https://doi.org/10. 1007/978-3-540-71351-7_18
Stationarity and Self-similarity Determination of Time Series Data Using Hurst Exponent and R/S Ration Analysis Anirban Bal, Debayan Ganguly, and Kingshuk Chatterjee
Abstract Time series data is highly varying in nature. Determining the quality of predictability of the data is necessary to describe it. Self-similarity and stationarity are the key tools to determine the property. In this paper, visual and quantitative results to measure predictability of time series data are shown by rescaled ratio (R/S) analysis and Hurst exponent. We use several transformations and scaling to avoid the noise and vastness of stock data. Case-based studies are done on various kinds of stocks from Bombay Stock Exchange to establish the necessity of the R/S ratio with Hurst exponent. From the results of this study, an inference has drawn about the nature of stocks. The predictability is quantified depending on the value of Hurst exponent and Hurst co-efficient. Another factor named roughness factor is included for analyzing the result of the R/S ratio. Keywords Hurst exponent · Rescaled ratio (R/S) analysis · Roughness factor
1 Introduction The stationarity of a time series is determined traditionally using data mean, standard deviation and covariance. The traditional methods employed have some major drawbacks. For large datasets, standard deviation and covariance become large in spite of having low deviation [1]. For a monotonous graph, there is good selfsimilarity, but traditional methods cannot establish that [2]. There are three types of testing to test the stationarity of any time series data. They are parametric testing, A. Bal · D. Ganguly (B) Government College of Engineering and Leather Technology, Block-LB, Sector-III, Kolkata 700106, India e-mail: [email protected] A. Bal e-mail: [email protected] K. Chatterjee Government College of Engineering and Ceramic Technology, Kolkata 700010, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_57
601
602
A. Bal et al.
semi-parametric testing and nonparametric testing. But maximum of the proposed models are stochastic in nature but Hurst exponent, Hurst co-efficient and rescaled ratio analysis give quantitative result to express the stationarity and self-similarity. Dicky–Fuller’s test and augmented Dicky–Fuller’s test determine the stationarity by the means of visual effects [3]. In paper [4], plotting of autocorrelation function for an increasing lag shows the stationarity of time series data efficiently. In autocorrelation function, since the disturbance term exhibits serial correlation, the values as well as the standard errors of the parameter give erroneous results. Fourier transform method also determines the stationarity and periodicity which in terms describe the self-similarity in a graph [5] but require large amount of data in comparison to methods employing Hurst. Nonparametric tests can prove the stationarity of small sub-class of all kind of time series graph. Delft et al. [6] suggested a nonparametric stationarity test limited to functional time series. But the Hurst exponent and R/S ratio deal with any kind of datasets where the reading can be taken in an unequal time interval. Delft and Eichler [7] have proposed a test for local stationarity for functional time series, but the Hurst and R/S ratio work on long-term memory where local roughness can’t change the stationary effect of a stock price. The Hurst method is more time-efficient than parametric and semi-parametric methods. The roughness of a graph sometimes determines its stationarity [8]. We can reduce the roughness of the graph by Fourier transform [9]. For periodic curve Holt-Winters’ Seasonal Method [10] or auto-regressive method works very well. R/S ratio analysis is done for checking the self-similarity of a graph [8]. The measurement of selfsimilarity can be shown quantitatively by fractal dimension analysis [10]. There is a relationship between the R/S ratio and roughness exponent or more specifically Hurst exponent [11]. Hurst exponent of a series can show the predictability of a time series data [12]. But only this value can’t give us the best fit with self-similarity. Wavelet transform analysis and visibility graph algorithm are used to show the authenticity of the value of Hurst exponent [13]. In [11], rescaled ratio (R/S) analysis is done, and it is proved that Hurst exponent is more consistent in fractal markets. The results show that Hurst exponent gives better correlation and persistence in capital market. In this paper, the stationarity and the self-similarity of stock are checked on the basis of Hurst exponent and R/S ratio analysis which is based on the long-term memory. Here, we have considered the vastness of the stocks which can’t be considered with the help of the stochastic process of checking stationarity. We have active, suspended stocks with A, Z category to establish the efficacy of Hurst exponent and R/S ratio analysis to check the stationarity and self-similarity of stock. Result reflects that the R/S ratio is same with respect to time interval. If R/S ratio meets the time-line at any point, it reflects that on this time point the graph is running similar to the time.
Stationarity and Self-similarity Determination of Time Series Data Using Hurst …
603
Table 1 Nature of stock versus Hurst exponent relation table Nature of stock
Anti-persistent
Brownian motion
Persistent
Self-similar
Hurst exponent range
0 < H < 0.5
H = 0.5
0.5 < H < 1
H=1
Table 2 Table of Hurst exponent and hurst co-efficient and the stationarity determined by Bombay Stock Exchange Stock name
Infosys LTD
Wipro LTD
Ajel
Mahabeer Infotech
Sterling International
Healthfore Pvt. Ltd.
Hurst exponent
0.7408
0.9014
0.9494
0.3439
0.9936
0.8219
Hurst co-efficient
0.6163
0.3924
0.4878
11.0088
1.4773
0.6727
Stationarity (BSE)
Active A
Active A
Active Z
Active Z
Suspended Z
Suspended Z
2 Preliminaries The Hurst exponent is used as a measure of long-term memory of time series. It relates to the autocorrelations of the time series, and the rate at which these decrease as the lag between pairs of values increases (Tables 1 and 2). Hurst exponent (H) works on long-term data and the self-similarity of it. Range of H: RS-Ratio is an indicator that measures the trend for relative performance. Similar to the price relative, RS-Ratio uses ratio analysis to compare one security against another (usually the benchmark). It is designed to define the trend in relative performance and measure the strength of that trend. E(R/S) is the expectation of R/S ratio through the whole time series data which represents the stationarity measure of the data. R/S ratio can be found by, 1. Calculate the mean, m =
n 1 Xi n i=0
(1)
2. Create a mean-adjusted series, Yt = X t − m [for t = 1, 2 , . . . , n]
(2)
3. Calculate the cumulative deviate series Z; Zt =
t i=1
Yi [for t = 1, 2 , . . . , n]
(3)
604
A. Bal et al.
4. 4. Compute the range R, R (n) = max (Z 1 , Z 2 , . . . , Z n ) − min (Z 1 , Z 2 , . . . , Z n )
(4)
5. Compute the standard deviation S, S (n) =
n √ 1 (X i − m)2 n i=1
(5)
There is a relationship between the R/S ratio and roughness exponent or more specifically Hurst exponent, R(n) = CnH as n → ∞ E S(n)
(6)
It can be shown that for a self-similar graph E[R/S] = CnH , the difference between two quantities reduce the chance of predictability. Active stocks are these kinds of stocks which stay for a long period in top 30 stocks set. Suspended stocks are those stocks which have been removed recently for random or strictly downward movement. ‘A’ category stocks are those which are performing well constantly. ‘Z’ category stocks are the suspected stock which still resides on the top 30 but can be suspended at any time. Roughness is the deviations in the direction of the normal vector of a real surface from its ideal form. Volatility of the market causes the roughness. The roughness of a graph sometimes determines its stationarity [8]. Roughness can be represented as, |h(x + r ) − h(x)| ∼ (mr)α
(7)
However, from the definition of a self-affine surface, • • • •
h(x) is a surface profile r is roughness factor m is the slope of the graph at the point x α is roughness exponent. ∂ ε−α h(εx) ∂h ∂h(εx) ∂h ∼ = ε−α = ε1−α ∂x ∂x ∂x ∂x
(8)
Value of α is representing a smoother local surface profile. Surfaces with different values of α are depicted in Fig. 1. It is noted that α lies in the range 0 ≤ α≤ 1. To derive this condition, consider two length scales, x and x = εx. The surface slope on each of these length scales is approximately given by ∂h/∂x and ∂h/∂x . If ε ≥ 1, the ‘x ’ length scale is more stretched out than the ‘x’ length scale which implies that the surface slope on the ‘x ’ length scale is smaller than the surface slope on the ‘x’ length scale. To satisfy this requirement, from (8), ε1−α ≥ 1 for ε ≥ 1, this gives 1 −α ≥ 0, or α ≤ 1. In addition, for (7) to be physical in the limit as r → 0, r α =
Stationarity and Self-similarity Determination of Time Series Data Using Hurst …
605
Fig. 1 Surface of curve for different types of α
0, which gives α ≥ 0. In the specific case where α = 1, the surface is said to exhibit self-similar scaling because the scale factors in the horizontal and vertical directions are equal. This scaling behavior is reminiscent of the definition of a fractal.
3 Models and Methodologies In this study, we use some predefined states of stocks like active, suspended or A, Z category. Those details can be found at the site of BSE. ‘A’ category stocks are the most promising and ‘Z’ category stocks are suspected to go downward. We observe that stock, which has been marked as ‘A’ category and has active status by the BSE, has Hurst exponent >0.5 and a smaller difference between E[R/S] and CnH . But in the case of the suspended graph, if it has a decreasing trend, the Hurst exponent shows a value >0.5. In that case, we can see the R/S ratio graph. If the R/S ratio has self-similarity in nature, the trend will be strictly decrease and for un-similar results in R/S ratio analysis, it will show local roughness in its close price movement graph, e.g., Sterling International day-wise data. In the R/S-HE graph for those data difference between Hurst and R/S ratio can be visualized. If a graph is showing lesser Hurst value and has a dissimilar trend in the R/S ratio graph, then it is fully unpredictable and those kinds of graphs are also removed from the Sensex list like Mahaveer Information System. We can see the case-based analysis in the following graphs and table. Following algorithm is implemented to find the stationarity of the graph. At first the outliers of the data have been detected by the boxplot and piano-plot of the data. Then, the data has been cleaned, and the Fourier transformation has been done to convert it to frequency domain. The algorithm of Hurst exponent is applied to the transformed dataset. The algorithm of finding R/S ratio is implemented, and the R/S series is plotted against the time interval. For comparing the stationarity of the two stocks, the Hurst exponent plot and the R/S ratio plot are done in a single graph. In
606
A. Bal et al.
the following graphs, all data was plotted in the format of the time interval—R/S ratio. The analysis is given with the corresponding one. Here all the blue lines are indicating the values of E[R(n)/S(n)], and the purple dots are indicating CnH . At the X-axis, the time interval was plotted to show the change of those features according to time. This algorithm is incorporating a concept named R/S ratio, and it is giving the ratio of rescaled factor and the standard deviation. It shows that the difference between time graph and the R/S ratio indicates the stationarity of the series. If the difference becomes larger, the stationarity becomes lesser.
4 Experimental Results The day-wise datasets are collected from the site of Bombay Stock Exchange from January 1, 1998, to 25 February 25, 2019. The datasets have many features as ‘Open Price’, ‘Close Price’, ‘High Price’, ‘Low Price’, ‘Deliverable/Traded Quantity’ and many more. Here the date and close price are taken for calculating the stationarity and self-similarity. One or two stocks are collected from each category which is made by BSE. The proposed methodology and the outcomes are shown below. The proposed model is implemented in ‘Python Programming Language’. There is a library named ‘hurst’ in python. ‘compute_Hc’ and ‘random_walk’ functions are used to create the random walk and compute the ‘H’ value and ‘C’ value in Hurst exponent model respectively. ‘Hurst’ library is also available for MATLAB and R programming language. After the plotting of E[R(n)/S(n)] and CnH has been shown independently within a graph which indicates the predictability or self-similarity of a graph more efficiently (Graphs 3 and 4). Active A Infosys In the R/S ratio graph of Infosys, it is clear that the data has a very low difference between E[R(n)/S(n)] and CnH . It has a good Hurst exponent value. So, the graph is predictable in nature (Graphs 6 and 8). Active A Wipro It has a higher R/S ratio difference than Infosys, but it has higher Hurst value. From this statement and the plot of Infosys and Wipro, we can see that the Infosys graph has more roughness than Wipro. So, rather Wipro has greater R/S value and Hurst value difference, it is more predictable in nature than Infosys. Since the Wipro graph has less local roughness than Infosys, Wipro is giving better Hurst exponent value (Graphs 10 and 12). In the above graphs, blue lines are indicating the value of R/S ratio and green lines are indicating the value of CnH . Those were the Active-A category graphs which are included at Sensex30 for long days.
Stationarity and Self-similarity Determination of Time Series Data Using Hurst …
607
Graph 1 Time versus R/S ratio of Infosys. This is indicating less difference from the time interval. So, it is stationary
Graph 2 Time versus R/S ratio of Wipro
Graph 3 Comparison of close price of Infosys and Wipro
608
A. Bal et al.
Graph 4 Comparison of Hurst exponent (left) and R/S ratio between Infosys and Wipro (right)
Graph 5 Time versus R/S ratio of Sterling International
Graph 6 Close price and Hurst exponent (left) versus R/S ratio of Sterling International (right)
Stationarity and Self-similarity Determination of Time Series Data Using Hurst …
609
Graph 7 Time versus R/S ratio of Ajel. It is showing a little more difference between R/S ratio and time interval
Graph 8 Close price and Hurst exponent (left) versus R/S ratio of Ajel (right)
Graph 9 Time versus R/S ratio of Mahaveer Infotech. It is a non-stationary stock
610
A. Bal et al.
Graph 10 Close price (left) and Hurst exponent versus R/S ratio of Mahaveer Infotech (right)
Graph 11 Time versus R/S ratio of Healthfore Pvt. Ltd. It is stationary but strictly goes to downward
Graph 12 Close price (left) and Hurst exponent versus R/S ratio of Healthfore Pvt. Ltd. (right)
Stationarity and Self-similarity Determination of Time Series Data Using Hurst …
611
Sterling Z Suspended This is one of the suspended stocks from the Sensex list and also has a Z category. From the R/S ratio, we can see that it has very little similarity with its own lagged data. From the Close price curve, we can see that it has less self-similarity. It has very little roughness in its current days. So, its Hurst exponent also indicating that the movement of the close price in this graph is non-stationary. Ajel Z Active It is an active stock, but it has warned through the Z category. But it has a strictly decreasing order and because of that, it is giving a good Hurst exponent. Mahaveer Z Active This is mostly a distorted graph and so it is showing anti-persistency in nature. From the graph, we can see the reflection of the R/S ratio analysis. Healthfore Z Suspended Healthfore is one of the most risky stocks in the BSE dataset which is already suspended for its strictly downgrading. So, it is showing high predictability and self-similarity in nature from its R/S ratio and Hurst exponent both. The predictability of the close price also can be shown from its Close price movement. The graph of close price is given below. In the above table, we can see that the greater Hurst exponent and lesser Hurst co-efficient are indicating an active and ‘A’ category stock like Graphs 1 and 2 and lesser Hurst exponent value ( f (m, n) then the object is considered to a dark shade; otherwise, T 2 < f (m, n) then the belongs to the class 1. T = T [m, n, p(m, n), f (m, n)]
(1)
where p(m, n) be the point of local property in an image, f (m, n) be the gray level of (m, n). Threshold is represented by c(m, n) =
c(m, n) when c(m, n) > T T when c(m, n) ≤ T
(2)
In this proposed method, optimization technique is assigned after achieving the transform coefficients with a threshold value as to reconstruct the image with high quality.
3.2 Evolutionary Algorithms From the different evolutionary algorithm, optimization is achieved in order to obtain a best quality image. Optimization techniques comes with the main focus to find a better threshold which reduces the distortion among the original image and reconstructed image. And it states that obtained thresholds with less distortion are known as a superior optimization technique. Such optimization is achieved through ABC, PSO, etc., In this process, objective function also known as fitness function which is
Wavelet-Based Medical Image Compression …
685
used for the selection of optimal thresholds is obtained by the combination of entropy and PSNR value to achieve the high CR and good image quality. Fitness − k × entropy + l/PSNR
(3)
where k and l are the adjustable arbitrary. Artificial bee colony [18] is an optimization algorithm as depicted in (Fig. 3) which works by simulating the foraging behavior of honey bees. As it works on the compression algorithm, it is considered as a multi-objective since it deals with the high compression ratio and quality parameter. Based on the method, there are three roles based on the status of food source. Employed bees, which are responsible for exploiting the food source, loading and dancing to recruit other bees with the process of loading nectar to hive and onlooker bees, watch the dances and scout bees and search the environment for the source through internal motivation, respectively. Steps involved in this algorithm are (1) Initialization, (2) Evaluate the population, (3) Repeat the process of employed bee phase, (4) Onlooker bee phase, (5) Scout bee phase, (6) memorize the best food source and (7) continuation of cycle and termination. Initialization process with random food sources as with the Eq. (3) Fig. 3 Algorithm of artificial bee colony optimization
686
S. Saravanan and D. S. Juliet
xi j = x min + r and (0, 1) x max − x min j j j
(4)
where the probability value with respect to fitness value is defined by fitnessi pi = S N i=1 fitnessi
(5)
From the proposed method, artificial bee colony algorithm is analyzed to optimally choose the different level-dependent threshold for decomposing the coefficients in order to achieve the two important factors called high compression ratio and best visual quality image. Performance metrics are evaluated in comparing it with the particle swarm optimization technique.
4 Performance Evaluation Medical image with the size 512 × 512 with 8 bits is collected from the medical imaging online databases Medpix and “https://www.kaggle.com/kmader/siim-med ical-images”, and the quality of the output is measured in terms of peak signal-tonoise ratio (PSNR) which is defined by the expression of ratio between the maximum possible value (power) of a signal and the power of distorting noise which affects the quality of its representation. 2 √ PSNR = 10 × log10 255 MSE
(6)
Compression ratio is one of the important metrics when analyzing the performance which is defined by the size of the original image divided by the size of the compressed image. Compression ratio =
Size of the original image size of the compressed image
(7)
Figure 4 illustrates the input medical images obtained from different medical modalities considered for the different compression algorithms. Table 1 defines the performance analysis of the proposed method with other algorithms in terms of PSNR, MSE and CR. Bolded values in the Table 1 indicates the highest value achievement as compared with other algorithm. It clearly illustrates that the artificial bee colony with the wavelet coefficients performs better in terms of PSNR and CR, and it has been illustrated in the Fig. 5 through a graph.
Wavelet-Based Medical Image Compression …
687
Fig. 4 Considered input medical images from different modalities
Table 1 Comparative analysis of different compression algorithm at 1.00 BPP with five levels Sample image
Algorithm
PSNR (dB)
MSE (dB)
Image 1 (retina-OCT)
(Proposed) DWT + ABC
43.40
2.14
5.40
DWT + PSO
41.69
2.7
5.21
(Proposed) DWT + ABC
42.66
2.56
4.91
DWT + PSO
40.91
2.82
4.2
Image 3 (abdominal-MRI)
(Proposed) DWT + ABC
39.81
3.10
9.2
DWT + PSO
36.90
3.45
8.9
Image 4 (Chest–X-RAY)
(Proposed) DWT + ABC
44.19
2.08
6.9
DWT + PSO
40.42
2.9
5.1
Image 5 (abdomen hepatic-MRI)
(Proposed) DWT + ABC
42.81
2.41
DWT + PSO
42.00
2.49
Image 6 (adenocarcinoma-MRI)
(Proposed) DWT + ABC
37.91
3.66
5.93
DWT + PSO
34.24
3.93
4.06
Image 2 (abdominal-US)
CR
9.86 10.2
5 Conclusion As the proposed method using discrete wavelet transform achieves an efficient decomposition. With the Image coefficients using five-level band threshold values
688
S. Saravanan and D. S. Juliet
PSNR performance
50
PSNR (dB)
40 30 20 10 0 Image 1
Image 2
Image 3
Image 4
Proposed method (DWT + ABC)
Image 5
Image 6
DWT + PSO
Fig. 5 Performance of different optimization technique with medical images in terms of PSNR
are optimized using the artificial bee colony algorithm. By optimizating with the various number of iterations, an efficient Image compression is achieved. Performance metrics such as PSNR, MSE and CR were compared with the state of the art algorithm, particle swarm optimization technique. It proves that the artificial bee colony outperforms as compared with the other algorithm and the quality of the medical images is also high with respect to the compression ratio. As medical images need to be compressed without any distortion, the proposed method confidently satisfies the physicians in order achieve the best quality of compression image with a less storage space.
References 1. Gonzalez, R.C., Woods, R.E., Masters, B.R.: Digital image processing, third edition. J. Biomed. Opt. 14(2), 029901 (2009). https://doi.org/10.1117/1.3115362 2. Smith-Bindman, R., Miglioretti, D.L., Larson, E.B.: Rising use of diagnostic medical imaging in a large integrated health system. Health Aff. 27(6), 1491–1502 (2008). https://doi.org/10. 1377/hlthaff.27.6.1491 3. Sapkal, A.M., Bairagi, V.K.: Telemedicine in India: a review challenges and role of image compression. J. Med. Imaging Heal. Inform. 1(4), 300–306 (2011). https://doi.org/10.1166/ jmihi.2011.1046 4. Mofreh, A.: A new lossless medical image compression technique using hybrid prediction model. Sig. Process. An Int. J. 10(3), 20–30 (2016) 5. Savitri, P.A.I., Adiwijaya, D.T.. Murdiansyah, Astuti, W.: Digital medical image compression algorithm using adaptive Huffman coding and graph based quantization based on IWT-SVD. In: 2016 4th International Conference on Information and Communication Technology (ICoICT) 2016, vol. 4, no. c (2016). https://doi.org/10.1109/icoict.2016.7571902 6. Zuo, Z., Lan, X., Deng, L., Yao, S., Wang, X.: An improved medical image compression technique with lossless region of interest. Opt. (Stuttg) 126(21), 2825–2831 (2015). https:// doi.org/10.1016/j.ijleo.2015.07.005 7. Karaboga, D., Gorkemli, B., Ozturk, C., Karaboga, N.: A comprehensive survey: artificial bee colony (ABC) algorithm and applications. Artif. Intell. Rev. 42(1), 21–57 (2014). https://doi. org/10.1007/s10462-012-9328-0
Wavelet-Based Medical Image Compression …
689
8. Akay, B., Karaboga, D.: A survey on the applications of artificial bee colony in signal, image, and video processing. Sig. Image Video Process. 9(4), 967–990 (2015). https://doi.org/10.1007/ s11760-015-0758-4 9. Alkhalaf, S., Alfarraj, O., Hemeida, A.M.: Fuzzy-VQ image compression based hybrid PSOGSA optimization algorithm. In: IEEE International Conference on Fuzzy Systems, vol. 2015-Novem (2015). https://doi.org/10.1109/fuzz-ieee.2015.7337998 10. Xiao, B., Lu, G., Zhang, Y., Li, W., Wang, G.: Lossless image compression based on integer Discrete Tchebichef Transform. Neurocomputing 214, 587–593 (2016). https://doi.org/10. 1016/j.neucom.2016.06.050 11. Sreekumar, N.: Image Compression using wavelet and modified extreme learning machine. Comput. Eng. Intell. Syst. 1719 (2011) 12. Bruylants, T., Munteanu, A., Schelkens, P.: Wavelet based volumetric medical image compression. Sig. Proc. Image Commun. 31, 112–133 (2015). https://doi.org/10.1016/j.image.2014. 12.007 13. Mahmoudi, S., Jelvehfard, E., Moin, M.S.: Evolutionary fractal image compression using asexual reproduction optimization with guided mutation. Iranian Conference on Machine Vision and Image Processing, MVIP, pp. 419–424 (2013). https://doi.org/10.1109/iranianmvip. 2013.6780022 14. Li, J., Yuan, D., Xie, Q., Zhang, C.: Fractal image compression by ant colony algorithm. In: Proceeding 9th International Young Scientists Conference in Computational Science 2008, pp. 1890–1894 (2008). https://doi.org/10.1109/icycs.2008.222 15. Horng, M.H.: Multilevel thresholding selection based on the artificial bee colony algorithm for image segmentation. Expert Syst. Appl. 38(11), 13785–13791 (2011). https://doi.org/10.1016/ j.eswa.2011.04.180 16. Martinez, C.: An ACO algorithm for image compression. CLEI Electron. J. 9(2) (2006). https:// doi.org/10.19153/cleiej.9.2.1 17. Khairuzzaman, A.K.M., Chaudhury, S.: Multilevel thresholding using grey wolf optimizer for image segmentation. Expert Syst. Appl. 86, 64–76 (2017). https://doi.org/10.1016/j.eswa.2017. 04.029 18. Akay, B., Karaboga, D.: Wavelet packets optimization using artificial bee colony algorithm. In: 2011 IEEE Congress on Evolutionary Computation CEC 2011, pp. 89–94 (2011). http:// doi.org/10.1109/CEC.2011.5949603
A Smart Education Solution for Adaptive Learning with Innovative Applications and Interactivities R. Shenbagaraj and Sailesh Iyer
Abstract Education has been carried out in traditional mode as well as in online mode. This paper finds the gaps regarding the learnability of the students, which has adverse effects on the employability and sustainability over the long run. We recommend online courses as a supplementary tool for addressing these gaps. The problems of the online courses are understood, and an adaptive smarter learning approach with innovative applications and interactivities is being recommended. The background knowledge needed to design a better learning model, and an adaptive solution is explored. The different adaptive approaches are being examined for devising a novel method for adaptive learning. The recent literature review in adaptive learning is captured and analyzed. This paper also provides the future direction of adaptive learning, the challenges involved, and ways to offer effective e-learning. Keywords Smart tutoring system · Adaptive learning · E-learning · Online tutorials · Traditional education
1 Introduction With the help of the technological advances in electronics and software technologies automating, any industry activities can be done. Education is an industry which can also be automated with the multi-media software and electronic devices. In education, lot of gaps in the learnability of the students leads to the employability and sustainability issues at a later stage as per the reports from the employability assessment companies like aspiring minds and industry stalwarts like Narayana Murthy. Aspiring Minds CEO Himanshu Aggarwal has confirmed these gaps in his interview [1]. We were also able to reaffirm the skill gaps while conducting computer-based R. Shenbagaraj (B) · S. Iyer Rai University, Saroda, Dholka Taluka, Ahmedabad, Gujarat 382260, India e-mail: [email protected] S. Iyer e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_65
691
692
R. Shenbagaraj and S. Iyer
adaptive testing and online surveys across colleges in Tamil Nadu. The quality of graduates, even from premier institutes, is worrisome based on the inputs from the industry side and we have been witnessing it while recruiting freshers. This was also confirmed by s during subject matter experts during our interactions with the Academic community. The NASSCOM-Mckinsey report also says that only 26% of India’s graduates are employable [2]. So, these gaps can be appropriately analyzed and patched gradually with the help of technological advances. These technological advances act as an additional option in addition to the traditional education setup to provide a learning solution [3]. Providing a better learning solution is the need of the hour as it could help to sort out the employability and sustainability issues in the students. E-learning can help modern students to learn better.
2 Related Work and Survey A lot of learning solutions have been proposed and successfully implemented in school education and competitive exam market. Still, very few attempts have been made in the higher education segment as it is a vast area and very difficult to automate. The higher education can be dramatically changed using the latest techniques and tools. A more significant number of learners are willing to learn using the online courses at their own pace when they do not have access to traditional learning. We did a month-long analysis of these online courses offered by Khan Academy [4], edX [5], Udacity [6], and Coursera [7] and found no one of them serving the average students’ needs from India. Either they are at the high end, which is only useful and understandable by the above average and brilliant students or not effective in terms of the visual effect to attract and enhance learnability of the average students. Moreover, these courses are not related to their syllabus so that average students who find it reasonable to complete the academic courses are not able to afford time for this extra course offered by these providers. That is why, we decided to create customized, feasible, and visually engaging learning video courses suitable for average students. Moreover, according to a recent study on online courses, it was found by the academic team at the Massachusetts Institute of Technology (MIT) that these online courses had an immense dropout rate of about 95 percent on average over five years. The research, which studied the learners on edX, also found that the high dropout rate has not improved over the years [8]. Although a lot of people enroll for Massive Open Online Courses (MOOCs) like edX, the completion rate for most courses is below 13%. Even though people argue that the dropout rate is the wrong measure of success for online classes, it is better to decrease the dropout rate of E-Learning courses by trying some innovative methods so that the engagement level also increases [9]. The reasons for dropout after surveying the students were identified as 1. 2. 3.
Not able to sustain the initial interest level No real intention to complete No user engagement
A Smart Education Solution for Adaptive Learning with Innovative …
4. 5. 6. 7. 8. 9. 10. 11. 12.
693
No time Lack of necessary background and difficulty of the course Lack of support Lack of digital skills or learning skills Late starters of the course are not able to cope up Language barriers More work on the student part Unrealistic expectations Content is not strong.
In this research, we have tried to address most of these reasons by designing an innovative application-oriented personalized and interactive learning solution. Future research will discuss some of these goals. As this is a natural and effective method of training, a lot of researchers have explored this area of adaptive research. So, a lot of adaptive learning solutions have been tried as per [10]. In adaptive learning, the researchers have considered the learner’s 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
Identity Language Nationality History Learning Style Characteristics Objectives Preferences Knowledge and Skills Device Browser Version Connectivity Bandwidth Screen Resolution.
Based on these, they adapt the content, format of presentation, learning sequence, and navigation. They also provide an assessment test and assess the learner for devising an adapting strategy. Learning resources are also shared for discussion. Most recently, in [11], the authors find the relationship between personality types and learning styles of introverts and extraverts. Using the chatbots, they classify them using modified Visual Auditory Read & Write and Kinaesthetic Questionnaires. Visual contents are being given to the learners, and their beta brain waves are recorded. Based on the brain waves, a dataset is created. The dataset is validated using machine learning algorithms to improve the accuracy of the classification.
694
R. Shenbagaraj and S. Iyer
3 Proposed System Considering the need, background, latest research literature review and feasibility, a microadaptability approach was considered for the proposed system. Macro adaptability approach will be regarded as when the system is successfully implemented on a large scale. The proposed system has the following features to address the adaptability requirements and to improve the engagement level. 1. 2. 3. 4.
Computer adaptive test to understand and analyze the learners. Adaptation of the learning sequence based on each user’s profile learnability. Capable of getting the knowledge level and preferred language. Suggesting the learning solution based on learner’s actions and interactions with the system. 5. Using the learner’s information and knowledge levels to adapt the flow of the presentation as well as the navigation. 6. Providing learning support like applications, practical exercises, and precise explanation. 7. Providing quizzes, chat facility, and avenues for sharing information about the learning resources. 8. Examining and interpreting the language, browser version, connectivity, bandwidth, and screen resolution. 9. Providing students with beginner’s course and making learners takes a test that helps determine their basis before proceeding further and adaptively playing videos appropriate for them. 10. Assessment tests to assess and adapt accordingly. One of the key objectives of the solution is to make people complete at least the part of their courses is to use applications and kindle their interest to stay motivated. We have summarized the components of the architecture of the adaptive learning solution as given the following diagram (Fig. 1). The learner model provides the structured documentation of details of the learner. This model is an essential part of the smart learning system. The domain model is a set of varied learning materials and subjects. It is the structured content with text, images, videos, interactive exercises for a specific topic. The adaption model links the learner and domain models by adapting to the learner’s needs. It includes predetermined adaption rules and functions that help to select videos and pedagogical materials based on the learner needs in the domain model and decides when and where to deliver it.
4 Results and Discussion Based on the implementation of the vital adaptive features like language-based learning, applications based on the background of the student and learning path based
A Smart Education Solution for Adaptive Learning with Innovative …
695
Fig. 1 Components of the proposed model
Fig. 2 Language preference-based adaptive learning implementation
on the formative assessments as evident from the following screenshots (Figs. 2, 3 and 4).
696
R. Shenbagaraj and S. Iyer
Fig. 3 Adaptive application based on the branch of the student
Fig. 4 Learning path adaptability based on formative assessments
We were able to improve the completion rate of most of the courses up to 30%. We found the improvement of the results were due to 1. 2. 3. 4.
Conducting computer-adaptive tests and analyzing the learners Engaging adaptive multi-media content which sustains interest Necessary requisites check and adaptive learning path Visual adaptive content for understanding difficult subjects
A Smart Education Solution for Adaptive Learning with Innovative …
697
5. Support through a chatbot and ask a doubt interface with the help of expert faculty 6. Visual guidance model for guidance 7. Audio-based on the language preferability of the user.
5 Conclusions and Further Work This paper reaffirms that adaptive learning creates an impact on the learning ability of the student, and MOOCs can reap maximum benefits and improve their effectiveness. The solution provided addressed the Visual, Auditory, Read & Write, and Kinesthetics needs of the learners required on an average basis. Future work envisages the base on more individualized and dynamically changing learning styles of the learners and devising more adaptive micro- and macro-learning solution, which could further increase the completion rate of the courses. Acknowledgements The author acknowledges the founders of groomMyCareer.com Pvt. Ltd. and Mobile Tutor Private Limited for sponsoring this research and Rai University for guiding this research work.
References 1. Biijeesh, N.A.: How can Indian graduates improve their employability? [Online]. Available: https://www.indiaeducation.net/interviews/himanshu-aggarwal-ceo-aspiring-minds.html 2. NASSCOM: Nasscom-McKinsey Report 2005: Extending India’s Leadership of the Global IT and BPO Industries. NASSCOM-McKinsey, New Delhi (2005) 3. LeahBelsky: Where online Learning Goes Next. Harvard Business Review: Accessed 2 Oct 2019. [Online] (2019). Available: https://hbr.org/2019/10/where-online-learning-goes-next 4. [Online]. Available: https://www.khanacademy.org/ 5. [Online]. Available: https://www.edx.org/ 6. [Online]. Available: https://www.udacity.com/ 7. [Online]. Available: https://www.coursera.org/ 8. Muuray, S., Rodriguez, M., Jugo. M.: Struggle to Lift Rock-Bottom Completion Rates. Financial Times Limited. Accessed 12 July 2019 (2019). [Online]. Available: https://www.ft.com/ content/60e90be2-1a77-11e9-b191-175523b59d1d 9. Onah, D., Sinclair, J., Boyatt, R.: Dropout Rates of Massive Open Online Courses: Behavioural Patterns (2014). https://doi.org/10.13140/rg.2.1.2402.0009 10. Ennouamani, S., Mahani, Z.: An overview of adaptive e-learning systems. In: Eighth International Conference on Intelligent Computing, and Information Systems (ICICIS), (2017) 11. Rajkumar, R., Ganapathy, V.: Bio-inspiring learning style Chatbot inventory using brain computing interface. IEEE Open Access J. (2020)
Exhaustive Traversal of Deep Learning Search Space for Effective Retinal Vessel Enhancement Mahua Nandy Pal and Minakshi Banerjee
Abstract In computerized analysis of retinal images, image enhancement policies play a major role. In this work, evaluation of effectiveness of enhancement policies which are prevalently used for retinal vessel segmentation has been dealt with. Contrast limited adaptive histogram equalization (CLAHE), adaptive gamma correction (AGC) and morphological enhancement operations have been considered. These individual policies as well as their different possible combinations have been evaluated separately with respect to a different viewpoint of deep retinal vessel segmentation performances. Evaluation has been done on differently enhanced retinal image datasets as better enhanced inputs are expected to provide better results through better trained model. We explored the whole search space to assess our method and confirmed our observation with cross-dataset verification. Keywords CNN · Morphology · CLAHE · AGC · Tophat
1 Introduction Retina is extremely light receptive rear surface of the eye. Both ocular and circulatory diseases like hypertension, diabetes, cardio vascular diseases, etc. manifest symptoms at an early stage in retinal fundus images. Retinal vessel segmentation plays an important role in retinal image analysis. In the process of development of an automatic vessel segmentation system, the pre-requirement is to enhance retinal vessels.
M. Nandy Pal (B) Computer Science and Engineering Department, MCKV Institute of Engineering, MAKAUT, Kolkata, India e-mail: [email protected] M. Banerjee Computer Science and Engineering Department, RCC Institute of Information Technology, MAKAUT, Kolkata, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_66
699
700
M. Nandy Pal and M. Banerjee
2 Literature Survey Adaptive histogram equalization (AHE) is an image enhancement policy. Slow speed and over enhancement of noise are disadvantages of this policy. These are addressed by Pizer et al. in Ref. [1]. K. Zuiderveld proposed contrast limited adaptive histogram equalization (CLAHE) in Ref. [2], to cope with these disadvantages. Reference [3] utilized a retinal image pre-processing policy using CLAHE. Reference [4] presented a significant contribution of AGC by implementing a transformation technique to enhance image through gamma correction and probability distribution of luminance pixels. Reference [5] discussed application of AGC for different types of image enhancement. An adaptive gamma correction based on image contrast is present in [6]. Morphological transformation has been used in [7] for medical images. Reference [8] used different morphological operators and clustering for vessel segmentation. An entropy-based enhancement quality metric is presented in [9]. Reference [10] proposed a model, efficient in segmenting bio-medical images. Contribution of this work is to choose quantitatively the effective enhancement for retinal vessel segmentation. We tried out both single enhancements and combined enhancement schemes. The better the classification metric values, the more informative the input images are.
3 Evaluation Metrics Sensitivity (Sn) is the ratio of true positives to the sum of true positives and false negatives. This metric specifies percentage of vessel samples correctly identified. Sn = TP/(TP + FN)
(1)
Specificity (Sp) is the ratio of true negatives to the sum of true negatives and false positives. This metric specifies percentage of non-vessel samples identified correctly. Sp = TN/(TN + FP)
(2)
Accuracy (Acc) is the ratio of the sum of true positives and true negatives to the sum of true positives, false positives, true negatives and false negatives. Acc = (TP + TN)/(TP + TN + FP + FN)
(3)
Exhaustive Traversal of Deep Learning Search Space …
701
4 Dataset Description For DRIVE [11] images, Canon CR5 non-mydriatic 3CCD camera is used with 45° FOV. Each image has 8 bits per color channel and a resolution of 565 by 584 pixels. Ground truths are available. For STARE [12] images, TopCon TRV-50 fundus camera is used with 35 degree FOV. Each image has 8 bits per color channel, resolution of 700 by 605 pixels and is available in Portable Pixmap (.ppm) format. Ground truths are available.
5 Deep Neural Network A deep convolutional neural network (CNN) model has been implemented. CNN architecture has been formed using U-net type connections [10]. U-net model is trained on DRIVE trainset for vessel identification. Application of U-net model on appropriately enhanced image dataset leads to pixel level classification of vessel and non-vessel. Classification efficiency measurement of the model with differently enhanced image dataset indicates better image enhancement method.
5.1 Arcitecture, Training and Implementation Requirements U-net architecture [10] is used here for CNN-based input image quality assessment. Cross-entropy loss function and stochastic gradient descent optimization algorithm are used in implementation. ReLU is the activation function. DRIVE images undergo image enhancement techniques to obtain different datasets. Training is performed on image patches of differently pre-processed image datasets. Python 3.6.8 is used as implementation software. The API from Google’s deep learning library namely TensorFlow has been used to implement and evaluate the work. Implementations have been executed in Google online Colab cloud environment, where Tesla K80 GPU, 12 GB VRAM, 4* Intel(R) Xeon(R) CPU @ 2.20 GHz, 13.51 GB RAM, 358.27 GB HDD are available as deep network execution resources.
6 Experimental Results 6.1 Applications of Enhancement Techniques Retinal images are converted to its green component. It undergoes various single, double and triple combinations of most frequently used enhancement techniques of
702
M. Nandy Pal and M. Banerjee
the relevant field. We have preserved differently enhanced DRIVE and STARE image sets for evaluation purpose.
6.2 Deep Classification Metric-Based Evaluation 6.2.1
Model Generation
48 × 48 random and overlapping image patches are generated from original image dataset. U-net type deep CNN architecture has been fed with patches for training. A total of 1,71,000 training patches and 19,000 validation patches were generated from DRIVE trainset images.
6.2.2
Cross-Dataset Evaluation—Stare
During cross-dataset vessel enhancement evaluation, the models previously generated with differently enhanced DRIVE retinal image train datasets are used and fed with correspondingly enhanced DRIVE test set images and STARE dataset images. Thus, both same dataset and cross-dataset evaluations have been conducted. During DRIVE evaluation, we compared train accuracy and validation accuracy as well. In Table 1, we provided only the cross-dataset evaluation data.
7 Discussion Vessel segmentation is a very important prior sub-part of retinal image analysis. Moreover, vessel enhancement is the pre-requisite of vessel segmentation. So, better vessel enhanced images tend to produce improved model predicting with superior classification efficiency metrics. Thus, we have evaluated vessel enhancement schemes for retinal images from the view point of deep CNN classification efficiency metric values. Different classifier models were created using differently enhanced image datasets as better learnt model is obtained from better enhanced training set. Enhanced datasets were prepared following mostly used single enhancement policies as well as different combinations of them. These models were evaluated on test images of DRIVE dataset itself. These models were further evaluated on crossdataset STARE images also. In both DRIVE train set and test set evaluation, maximum values of average train/validation/test accuracies, test AUC, i.e., area under ROC curve and test sensitivity indicate the superior quality of Tophat + CLAHE + AGC combination to enhance vessel in retinal image data. In STARE cross-dataset evaluation, test accuracy and test AUC support DRIVE evaluation. Only cross-dataset sensitivity evaluation differs from previous conclusion. This may be due to pathology symptom pixels are inappropriately identified
Exhaustive Traversal of Deep Learning Search Space …
703
Table 1 Average CNN classification metrics on STARE images STARE results Test Acc
AUC
Sn
Sp
Single enhancement Tophat
0.946087
0.958578
0.694051
0.974962
Combination Morphology
0.948104
0.958685
0.741428
0.971783
CLAHE
0.942191
0.961717
0.768289
0.962115
AGC
0.93711
0.962519
0.818366
0.950715
Tophat + CLAHE
0.939824
0.956353
0.963456
0.733562
CLAHE + Tophat
0.9446
0.960612
0.729864
0.969202
AGC + Tophat
0.927153
0.950845
0.797945
0.941956
Tophat + AGC
0.947201
0.959972
0.661522
0.979931
CLAHE + AGC
0.929213
0.941067
0.439867
0.985277
AGC + CLAHE
0.93947
0.938432
0.683426
0.968805
Combination Morphology + CLAHE
0.925566
0.938639
0.750944
0.945573
CLAHE + Combination Morphology
0.922938
0.939647
0.771199
0.940323
Tophat + CLAHE + AGC
0.952422
0.968993
0.719666
0.979089
CLAHE + Tophat + AGC
0.951981
0.969091
0.743768
0.975836
Combination Morphology + CLAHE + AGC
0.893976
0.949202
0.868142
0.896936
CLAHE + Combination Morphology + AGC
0.927593
0.939548
0.38791
0.989424
Double enhancement combinations
Triple enhancement combinations
as vessel pixels. Table 1 provides cross-dataset enhancement evaluation values for STARE dataset images. Same dataset test set evaluation and exhaustive graphical analyses of same and cross-dataset evaluation have been performed but not provided due to page constraint (Figs. 1 and 2).
8 Conclusion In this paper, we have evaluated single image enhancement techniques. We also explored the whole search space by exhaustive evaluation of all possible combinations of these single enhancement techniques. Among all different possibilities, the successive combination of Tophat transformation, CLAHE and AGC is most efficient one with respect to vessel enhancement while applied to retinal fundus images. We may explore individual pathology sample evaluation also to put emphasis on our findings. We intend to enrich this work by comparing all the aforementioned enhancement scheme with deep learning-based enhancement scheme which is supposed to yield even better results.
704
M. Nandy Pal and M. Banerjee
Fig. 1 Graphical representation of cross-dataset AUC, specificity and test accuracy
Fig. 2 Graphical representation of cross-dataset sensitivity
References 1. Pizer, S., Amburn, E., Austin, J., Cromartie, R., Geselowitz, A., Greer, T., ter Haar Romeny, B., Zimmerman, J., Zuiderveld, K.: Adaptive histogram equalization and its variations. Comput. Vis., Gr. Image Process. 39, 355–368 (1987) 2. Zuiderveld, K.: Contrast Limited Adaptive Histogram Equalization. Chapter VIII.5, Graphics Gems, vol. IV, pp. 474–485 (1994)
Exhaustive Traversal of Deep Learning Search Space …
705
3. Nandy Pal, M., Banerjee, M.: A comparative analysis of application of Niblack and Sauvola binarization to retinal vessel segmentation. In: Proceedings CINE 2017, IEEE Xplore 4. Huang, S., Cheng, F., Chiu, Y.: Efficient contrast enhancement using adaptive gamma correction with weighting distribution. IEEE Trans. Image Process. 22 (2013) 5. Rahman, S., Rahman, M., Abdullah-Al-Wadud, M., Al-Quaderi, G., Shoyaib, M.: An adaptive gamma correction for image enhancement. Eurasip J. Image Video Process. 35 (2016) 6. Nandy, M., Banerjee, M.: Automatic diagnosis of micro aneurysm manifested retinal images by deep learning method. In Proceedings, ICETST 2019 7. Hassanpour, H., Samadiani, N., Mahdi Salehi S.M.: Using morphological transforms to enhance the contrast of medical images. Egypt. J. Radiol. Nuclear Med. 46, 481–489 (2015) 8. Hassan, G., El-Bendary, N., Hassanien, A., Fahmy, A., Shoeb, A., Snasel, V.: Retinal blood vessel segmentation approach based on mathematical morphology. Procedia Comput. Sci. 65, 612–622 (2015) 9. Nandy, M., Banerjee, M.: Retinal vessel segmentation using gabor filter and artificial neural network. In: Proceedings, EAIT 2012, pp. 157–160 10. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. arXiv:1505.04597v1 [cs.CV] 18 May 2015 11. Digital Retinal Image for Vessel Extraction (DRIVE): Image Sciences Institute. Online, accessed 1-July-2018 12. Structured Analysis of the Retina (STARE): Clemson University, online, accessed 1-July-2018
Empirical Study of Propositionalization Approaches Kartick Chandra Mondal, Tanika Chakraborty, and Somnath Mukhopadhyay
Abstract Propositionalization is a promising approach for effectively handling dataset in the field of knowledge discovery. It is the most important part of data mining. Knowledge discovery always motivates data summarization when structured and unstructured datasets are stored in a large database system. Propositionalization leads from relational data and background knowledge to a single table representation. This table is the input to the widespread systems for knowledge discovery. This paper presents different approaches like logical and database-oriented propositional logic in a formal way. Logic-based propositionalization depends on background knowledge of the dataset moreover that is a single relational database. This paper explained relational subgroup discovery (RSD) and relational features (RelF) as a logic-based approach. For both approaches, datasets are required in Prolog language as background knowledge. Experiments are done with PROPER and relational aggregation (RELAGGS) for a huge amount of input data that are mainly stored in a multirelational database. Keywords Propositionalization · Relational database · Knowledge discovery · Data mining
K. Chandra Mondal (B) Jadavpur University, Kolkata, India e-mail: [email protected] T. Chakraborty Tata Consultancy Services, Kolkata, India e-mail: [email protected] S. Mukhopadhyay Assam University, Silchar, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_67
707
708
K. Chandra Mondal et al.
1 Introduction Data from a variety of applications are used for many different purposes. Knowledge discovery of data is an efficient way to extract essential and valuable patterns from those masses of data that can be used for the development and the research work in the technical and business fields. Most of the information management systems rely on the relational database management system (RDBMS) for storing and manipulating the data [12]. In RDBMS, data are stored in multiple tables, and first-order or predicate logic is the root of RDBMS. Most of the conventional systems for KDD demands a single table as input. Here, propositionalization comes into the picture [2, 13]. It is the process of explicitly transforming a relational dataset into a propositional dataset [14]. It is a promising approach for handling relational datasets robustly and effectively. The input to the propositionalization method is data in structured form and output is attribute-value representation in a single table. The aim of propositionalization is to pre-process the relational data for subsequent analysis by attribute-value learners. There are two main approaches [1, 3, 15] to propositionalization are available, a logic-based approach and a database-oriented approach. The aim of this paper is to demonstrate the applications of different algorithms along with the tools to be used for the application of those algorithms of both logic-based and database-oriented approaches. This paper investigates two popular logic-based approaches like relational subgroup discovery (RSD) and relational feature structure (RelF). On the other hand, it also demonstrates two database-oriented algorithms like PROPER and relational aggregation (RELAGGS).
1.1 Goals and Contribution of the Paper This paper aims at answering a few aspects of propositionalization approach in the context of research in this field like: 1. How can an aggregate function serve in an efficient way to make propositionalization more effective [8]? 2. How can different logic and database-oriented approaches be applied to different datasets? The organization of the rest of the paper is as follows: Next, Sect. 2 explains the details of the approaches used in the experiment section. In Sects. 3 and 4, the experimental setup, result drew and analysis of the results have been presented. Lastly, in Sect. 5, the conclusion and future scope of the work presented in the article have been mentioned.
Empirical Study of Propositionalization Approaches
709
2 Approaches Used 2.1 Relational Subgroup Discovery (RSD) Relational rule learning algorithm mainly designed for the construction of classification and rule prediction. The learning of rule can also be adopted from a propositionalization approach to the subgroup discovery method. Subgroup discovery is achieved via the adoption of learning rules and first-order feature construction [19]. The input of the RSD algorithm contains a relational database which in turn contains a relational definition by ground facts and background knowledge that is written in Prolog, syntactic and semantic constraints which are derived for the first-order feature construction and subgroup discovery based on constraints. The output result of RSD is some individual rules which describe a set or subgroup of individuals whose distribution of classes distinguishes substantially from the class distribution in the input dataset [22]. RSD works in the following four steps as explained below: 1. Identification of the first-order literal conjunctions and formation of a legal feature definition. 2. It compiles to user-defined constraints which are called mode language so that the task completes without any dependency with input data. 3. Feature set is extended by various kinds of instantiations. Some of the features are copied multiple times with some variables that are replaced by the constants detected by inspecting input data. Some kind of irrelevant features is also identified during this process. 4. A propositionalized representation is generated using generated feature sets which corresponds to a relational table that contains some attributes of binary nature [9]. RSD consists of three Prolog program, i.e., featurize.pl, process.pl and rules.pl. Mode declarations, background knowledge and all kind of settings related to parameter and constraints are specified in .b with Prolog facts/clauses. All three program components load the .b file.
2.2 Relational Features (RelF) This method shows a form of monotonicity of irreducibility which is important for propositionalization. There is a tool that follows this approach of propositionalization. The reason for using this tool is to reduce the drawbacks of level-wise approaches by following this block-wise approach to developing a feature set with the combination of conjunctions named building blocks which can be composed. This tool merges two phases of propositionalization, i.e., construction of features and extension computation into one process [10].
710
K. Chandra Mondal et al.
RelF algorithm works in the following way as explained below: 1. RelF merges the two phases (i.e., feature construction and extension computation) of propositionalization into a single process. 2. The input to this algorithm is a set of feature templates and sets of learning examples and output is all features that comply with the template and covers some examples. 3. Firstly, RelF discriminates POS features that have an equal domain of other input examples, and then it reduces irrelevant features which are the main advantage of this algorithm.
2.3 PROPER/RELAGGS This tool imports the data first into the database and then applied different types of propositionalization approach on that, and at last exports the produced data. The PROPER toolbox itself has some sample example datasets for a variety of experiments. But users can create examples on their own in order to experiment. To explore existing database Weka PROPER has relations feature under util. Weka is a package of machine learning algorithms for performing data mining tasks. It can be applied directly to the dataset or it can be used from Java code also. Algorithms are directly applied to datasets for this article. Weka consists of tools that are used for data preprocessing, classification, regression, clustering, association rules and visualization. RELAGGS package is integrated with the Weka tool. It is actually a propositionalization filter inspired by RELAGGS algorithm [5, 9]. RELAGGS working principle is depicted below: 1. It is basically a transformation based approach that is applied to the propositionalization aggregation which is very commonplace in the database area. 2. Aggregation substituted a set of values by an appropriate single value that summarizes properties of the set. 3. Simple statistical descriptors like average, maximum and minimum are used numerical values and on the other hand for nominal values mode (the most frequent value) or count the number of existences of the different possible values are mostly used. 4. This algorithm takes a large set of clauses as an input and it contains a set of numeric and nominal values, respectively.
3 Experimental Setup To run the propositionalization tools of both logic-based and database-oriented approaches, we need to set up the environment of the machine where these tools can run so that we can avoid further interruption during the execution of the same.
Empirical Study of Propositionalization Approaches
711
RSD and RelF can be run in Windows/Linux system. For this task, we have used 64 bit Windows 7, 4 GB RAM, and used command prompt to run both the implementation. The Linux environment is used to run PROPER. All the commands for PROPER are written in Unix. In order to set up a Linux environment in the Windows machine, the Oracle virtual box of version 4.1.18 is installed. CentOS-7 is used through the Oracle virtual box. Weka is a data mining tool where we can use multiple data mining algorithms on different kinds of datasets and determine the performance of the algorithm. We have used the tool of version 3.6 for this task in 64 bit Windows 7.
3.1 Tool Setup RSD package is downloaded and installed from [20] which is implemented in YAP Prolog. YAP can be downloaded from [16] and YAP 6.3.3 version is used for this work. RSD package is extracted which contains three different directories, i.e., code, doc and samples. Doc contains RSD manual, code contains the feature, process and rule files which is written in PROLOG, and Samples contains dataset like krk, mutagenesis and trains. RelF can be found at [11] as a zip file that contains several files like dist, lib, relf_experiments, src, relf_jars. All the directories contain necessary executable jar files and Java files which are required to run the tool. Relf_experiments directory contains different datasets to be used during the execution of the tool. For this task, PROPER is used in a Linux environment. PROPER is freely available in [7] or [6]. PROPER 0.2.1 contains several files including one shell script named install.sh. We opened the folder via command terminal in Linux environment and run install.sh script with the command ./install.sh. This script actually installs the PROPER tool in the system under root directory which contains several directories including datasets. Database set up is required during the execution of a PROPER tool. Here, we have used Postgresql and PgAdminIII. One data mining tool Weka is also used for this project. MySQL Workbench version 6.3 CE is used to run RELAGGS with Weka. We have created one database named wekatest in MySQL so that we can put the URL of the database during execution.
4 Result and Analysis In this section, we have done the analysis of the results generated from the experiments performed using each of the tools separately in each subsection below.
712
K. Chandra Mondal et al.
4.1 RSD In the first step of RSD execution, i.e., [featurize]. is used to consult the featurize.pl file. Then, it is needed to read the file for background knowledge of the dataset. For this task, we have used mutagenesis and krk datasets where background knowledge files are muta.b and krk.b, respectively. The mutagenesis dataset is collected from the article [18] and krk dataset is taken from the article [4]. For the execution of RSD, the command used, time taken to read and the size of background knowledge are shown in Table 1 . Now, it is required to show that the feature construction is done based on the declaration in the background knowledge. Once feature construction is done (time taken for this is shown in Table 1), these features are written in an output file by command w.. It takes 2.714 s to write the features and create one output file for mutagenesis dataset named muta_fra.smb and muta_frs_noist.pl. Also, it takes 0.281 s to write the features and creates output files for KRK dataset named krk_fra.smb and krk_frs_noist.pl. After the feature construction, output files are generated and need to exit the process and again start to consult with process.pl file. After the consultation is done for the file, it needs to read the output files of previous steps including the declaration file. It takes a total of 468 msec for the mutagenesis dataset and 16 msec for the KRK dataset to execute. Once the execution is done with reading all declaration and output files, the program will display a feature set. These features are non-redundant in nature and also contains the constants that are extracted from experimental data. It also complies to constraints which are related to coverage of feature on the data. After that, we need to write an expanded feature file. Finally, a propositional representation named muta.cov and krk.cov for mutagenesis and krk dataset are accepted by RSD rule inducer. To proceed with rule induction management, [rules]. command is executed followed by reading all background knowledge file along with the newly generated cov files. Rule is induced in the case of mutagenesis data in 1.794 s and takes 52.619 s to induce rules for KRK data. Once rules have been generated, w. command saves that rules and rr. command creates muta.rules and krk.rules. At last, s. command shows that the features are saved in the memory.
Table 1 Output of different datasets after applying RSD Dataset Command Time taken Size of used knowledge Mutagenesis
r(muta)
187 msec
KRK
r(krk)
1 msec
3,151,952 bytes 19,328 bytes
Percentage of correctness
Feature construction time (s)
74.62
3.51
68.78
0.265
Empirical Study of Propositionalization Approaches
713
Hence, we can see that it takes more time to execute and construct feature files for mutagenesis dataset than KRK dataset. Because there are several prolog datasets for mutagenesis like atom_bond, three-dimension, Lumo, Hansch, etc. Corrected output percentage is shown in tabular format for different datasets in Table 1. So, we can conclude that for a large number of relational tables, it needs much time compared to the number of rows in it.
4.2 RelF In this experiment, we run RelF on previously used mutagenesis and another predictive toxicology challenge dataset found in [17]. For each dataset, there are one batch file and one text file containing all the data. The batch file actually contains the actual RelF algorithm. In the mutagenesis batch file, four different output files will be created in the ARFF format after executing the algorithm. As mutagenesis dataset contains lots of data regarding atom-bond relationship, it takes 120.553 s to create those output files named muta1, muta2, muta3 and muta4 in ARFF format. For predictive toxicology challenge, two types of dataset are used for female mice and male mice. As mentioned for mutagenesis dataset, the algorithm is written in batch files name ptc_fm.relf and ptc_mm.relf. It took 75.907 s to create three output files in the ARFF format for female mice and 74.462 s to create output files of male mice.
4.3 PROPER With this tool, we can import data from Prolog database to relational database. In this task, we create two relational databases named mutagenesis1 and east-west with PROPER tool to import data of mutagenesis used previously and east-west train challenge dataset collected from [21]. Once the PROPER tool has been successfully run, data from mutagenesis are imported to database mutagenesis1 as provided in ANT file during creation of database. Dataset of east-west trains challenge dataset containing positive/negative and unclassified examples are imported to database eastwest. We analyze the resulting output of the RELAGGS algorithm through Weka Experimenter itself. There is a tab called Analyse in Experimenter interface. To analyze the result one can choose the output ARFF file generated by selecting File option. We select the Database option to fetch the output directly from the database which is created for respective datasets. Once the URL, user id and password are provided correctly, the database will be connected and automatically shows the available result set corresponding to the database provided in the URL. For this work, J48 decision tree algorithm is used which is an open-source Java implementation of C4.5 algorithm and an extension of ID3 algorithm. When we
714
K. Chandra Mondal et al.
Fig. 1 RELAGGS: test analysis configuration of trains dataset
Fig. 2 RELAGGS: test analysis output trains dataset
connect database wekatest where the output results of east-west trains dataset are stored, it automatically shows the available result set with performed experiments. The experimental output is analyzed with paired T-tester(corrected). The assumption of T-tester is that the samples of experiments are independent in nature. The fudge factor is used by the corrected t-test to measure the dependence between the samples that are used in the experiment in acceptable type I error. So, during the setup of an experiment in the Weka experimenter, the corrected t-test should be used every time. As per our experiments shown in Fig. 1, correct percentage is 90 using meta. Filtered Classifer and RELAGGS Now, test analysis will be performed once we select Perform test button. The result analysis is shown in Fig. 2. In the same way, we select the database wekatest1 where the output of east-west dataset is stored and perform test analysis. We select the database wekatest02 where the output of mutagenesis dataset is stored and perform test analysis.
Empirical Study of Propositionalization Approaches
715
5 Conclusion In this paper, we have demonstrated two logic-based approaches, viz., RSD and RelF and one database-oriented approach, viz., RELAGGS. For this article, we have used Weka to apply the RELAGGS algorithm on the dataset. There are a lot more classifiers and filters are incorporated with Weka that can be applied to the different relational datasets. We can compare the effectiveness of different algorithms based on the output accuracy percentage. Initially, we have tried to apply the RELAGGS algorithm with the help of the PROPER tool on a dataset. We have planned to use more datasets in this and examine the effectiveness and output time of the same. Also, we are planning to do experiment with more databaseoriented approaches other than RELAGGS and discovering the way that how these approaches can be used in various types of real-life applications. We can also investigate the deeper integration with databases using RELAGGS. Some advance level of investigation with these database-oriented approaches like HIFI, POLKA, ROLL Up, DIFFER, etc., should be taken into consideration and their performance-wise comparative analysis can also be done.
References 1. Alphonse, É., Rouveirol, C.: Selective propositionalization for relational learning. In PKDD, vol. 99, pp. 271–276. Springer (1999) 2. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., et al.: Knowledge discovery and data mining: towards a unifying framework. KDD 96, 82–88 (1996) 3. Flach, P.A.: Knowledge representation for inductive learning. In: European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty, pp. 160–167. Springer (1999) 4. Iba, W.: Searching for better performance on the king-rook-king chess endgame problem. In: FLAIRS Conference (2012) 5. Knobbe, A.J., Siebes, A., Marseille, B.: Involving aggregate functions in multi-relational search. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 287–298. Springer (2002) 6. Krogel, M.-A.: Sources. https://www.cs.waikato.ac.nz/ml/proper/downloads.html 7. Krogel, M.-A.: Welcome to the homepage of weka proper! http://www.cs.waikato.ac.nz/ml/ proper/ 8. Krogel, M.-A.: On Propositionalization for Knowledge Discovery in Relational Databases. PhD thesis, Citeseer (2005) 9. Krogel, M.-A., Rawles, S., Zelezny, F., Flach, P.A., Lavrac, N., Wrobel, S.: Comparative evaluation of approaches to propositionalization. In: ILP, vol. 3, pp. 197–214. Springer (2003) 10. Kuželka, O., et al.: Block-wise construction of acyclic relational features with monotone irreducibility and relevancy properties. In: Proceedings of the 26th annual international conference on machine learning, pp. 569–576. ACM (2009) 11. Kuželka, O., Železný, F.: Relf and hifi. http://ida.felk.cvut.cz/RELF/ 12. Miller, H.J., Han, J.: Geographic data mining and knowledge discovery: an overview. In: Geographic Data Mining and Knowledge Discovery, pp. 10–35. CRC Press (2009) 13. Morik, K.: Knowledge discovery in databases–an inductive logic programming approach. In: Foundations of Computer Science, pp. 429–436. Springer (1997)
716
K. Chandra Mondal et al.
14. Muggleton, S., De Raedt, L.: Inductive logic programming: theory and methods. J. Log. Program. 19, 629–679 (1994) 15. Pei, J., Jiawei, H., Kamber, M.: Data mining: concepts and techniques 16. sourceforge. Yet another prolog. https://sourceforge.net/projects/yap/ 17. Srinivasan, A., King, R.D.: Feature construction with inductive logic programming: a study of quantitative predictions of biological activity aided by structural attributes. Data Mining Knowl Discov 3(1), 37–57 (1999) 18. Srinivasan, A., Muggleton, S., King, R.D., Sternberg, M.J.E.: Mutagenesis: Ilp experiments in a non-determinate biological domain. In: Proceedings of the 4th International Workshop on Inductive Logic Programming, vol. 237, pp. 217–232. Citeseer (1994) 19. Wrobel, S.: An algorithm for multi-relational discovery of subgroups. Princ. Data Mining Knowl. Discov. pp. 78–87 (1997) 20. Zelezny, F.: Relational subgroup discovery through first-order feature construction. http://ida. felk.cvut.cz/zelezny/rsd/index.htm 21. Železn`y, F.: Efficiency-conscious propositionalization for relational learning. Kybernetika 40(3), 275–292 (2004) 22. Železn`y, F., Lavraˇc, N.: Propositionalization-based relational subgroup discovery with rsd. Mach. Learn. 62(1), 33–63 (2006)
A Survey on Online Spam Review Detection Sanjib Halder, Shawni Dutta, Priyanka Banerjee, Utsab Mukherjee, Akash Mehta, and Runa Ganguli
Abstract In recent years, online reviews reflecting customers’ opinion play a significant role in the unprecedented success of online marketing system. Consumers can make justifiable judgement about the quality of products or services based on the large volume of user-generated reviews even when the provider is unknown. Therefore, this online review platform faces frequent abuse by fraudsters’ activities that often mislead potential customers and organizations thereby reshaping their businesses. Consequently, with immense technological advancement and diversity of products, the organizations are becoming competitors for each other, and hence, there is a growing tendency among merchants to hire professionals for writing deceptive reviews to promote their own products and defame others. Hence, trustworthiness of those reviewers and authenticity of their reviews are crucial from the perspective of e-commerce. This paper reviews several methodologies to identify spam or false reviews. We have also discussed different feature extraction techniques and parameters used in those algorithms. Keywords Spam · Fake review · Machine learning · Feature extraction · Feature vector · Opinion mining · Group spammers S. Halder (B) · S. Dutta · P. Banerjee · U. Mukherjee · A. Mehta · R. Ganguli Department of Computer Science, The Bhawanipur Education Society College, Kolkata, India e-mail: [email protected] S. Dutta e-mail: [email protected] P. Banerjee e-mail: [email protected] U. Mukherjee e-mail: [email protected] A. Mehta e-mail: [email protected] R. Ganguli e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_68
717
718
S. Halder et al.
1 Introduction With the tremendous increase in technological growth in recent times, consumers often feel uncomfortable to buy products from crowded markets. Instead they prefer online shopping as it saves time as well as gives the freedom to buy product 24 × 7. The volume of this parallel market is increasing gradually compared to the traditional marketing system. With the technological growth, the traditional marketing system has been replaced by online marketing to a great extent. In the online store, customer can choose product by viewing image of products and comparing with large number of items. But the difficulty is that customers need to rely on online description and image of the product to be sure about the quality of the desired product. So often the customer depends on reviews made by other customers concerning the product. Business houses also rely in general upon customer’s opinion to reshape their business by improving or altering the product features. Since the volume of online marketing is growing and becoming popular day by day, the number of reviews made by the customers about any product or service is also growing significantly. This triggers many reviewers to become dishonest and posting misleading or deceptive reviews. Even there is a growing tendency among merchants to hire professionals to write deceptive reviews. The motive of these kinds of users may be uplifting or downgrading the rating of any product or service by unfair means. Researchers have developed various methods to detect spam or fake reviews [1] in the last few years to provide customer as well as company with genuine reviews. In this paper, we have reviewed different methodologies and algorithms based on various features of reviews as well as reviewers’ characteristics used in the field of spam review detection [2]. The main contribution of this paper summarizes as follows: (i) To provide researchers with perception and future research direction on spam review detection. (ii) To present the most efficient method by investigating the efficiency, accuracy and applicability of different methods in the field of spam review detection. (iii) To find and analyse the influence of different features of reviews and reviewers to detect fake review. The rest of the paper are structured as follows: the next section introduces the research method. Section 3 discusses different features extraction techniques used in spam detection. Section 4 presents the analysis and discussion of different techniques used in review spam detection. Finally, the paper is concluded in Sect. 5 by stating some of the observations regarding the literature surveyed.
2 Research Method To collect the relevant previous literature resources related to our research area, we have followed a systematic approach. Firstly, we tried to identify all the related keywords, their synonyms from our research topic to formulate search terms. Some
A Survey on Online Spam Review Detection
719
of the search terms include review spam, machine learning, feature extraction, feature vector, opinion mining, group spammers, etc. At the next step, we started searching in some of the renowned literature resources like Google scholar, IEEE explorer, ACM Digital Library, Scopus, SCI journals, etc. based on selected search terms and collected all relevant papers. In the next step, we have investigated all the gathered literature resources to shortlist the journal and conference papers that are relevant to our research work. We have also considered the references of the shortlisted paper to bring together more papers. The search continued until we gather a sufficient number of relevant papers. Soon after we collected adequate relevant papers, we started our detailed review work. A workflow of the review process is shown in Fig. 1. Fig. 1 Workflow of review process
720
S. Halder et al.
3 Feature Extraction Technique Based on our understanding, several features play important role while detecting spam reviews from online forum. Some of the feature extraction techniques are discussed here that can help us improve the overall efficiency of the spam detection model.
3.1 N-Gram-Based Model Use of n-grams, such as unigram, bigrams, and skip-grams [3, 4], is one of the more important parameters that can help in filtering out spam messages from online reviews. These n-gram approaches are applied on the review content. It helps us in identifying the linguistic context of word embedding. Applying this technique will help us in capturing and obtaining the feature vectors [5]. Later, this feature vector can be fed in as input to some supervised learning algorithm, such as classification. In this context, Barushka et al. [3] applied feed-forward neural network in their work. Similarly, a couple of classifiers such as Naïve Bayes classifier [5], SVM [5], random forest, stacking ensemble methods [4] were used in [4] to perform fake review filtering. Xiu et al. [6] used the bag-of-words model for performing feature extraction. Later multiple SVM classifiers were used for aspect-based opinion mining. Kale et al. [7] used the unigram and bigrams techniques to extract topic related words.
3.2 Reviewer’s Characteristics-Based Model Apart from considering only contents of reviews, it is also seen that the reviewer characteristics play an important role in identifying spammers. For this purpose, the trustiness of reviewers is computed in [6, 8]. Xiu et al. [6] extracted aspect-based opinions which were later used to compute the trustiness of reviews. Trustiness of reviewers can also be modelled based on honesty scores of given reviews [8]. If a reviewer has more honest reviews than fake reviews, then he/she is labelled as honest, otherwise spammer.
3.3 Spam Dictionary-Based Model There is a spam dictionary which contains a list of spam words. If a review contains a good number of words which belong to the spam dictionary, then the review is labelled as spam. The content of the reviews is analysed and spam words are identified from the reviews [9] and are labelled as spam reviews on successful search. This method
A Survey on Online Spam Review Detection
721
will in turn help to detect group of spammers as well. But it may not give the desired result if all the collections of spam words are not included in spam dictionary.
4 Analysis of Different Techniques Most machine learning algorithms use review content to detect spam. However, it has been noticed that the linguistic context of words plays an important role in text categorization. Barushka et al. [3] proposed a novel content-based approach that unifies bag-of-words [2] with word context. This paper utilizes n-grams and the skip-gram word embedding method to build a vector model that helps to generate a high dimensional feature representation. A feed-forward neural network is used to perform the classification. Using hotel review dataset, authors in [3] have shown that word embedding works comparatively better than other classification methods. Bajaj Simran et al. [4] use the n-grams (unigrams + bigrams) features along with supervised learning for review filtering. This paper uses the concept of multiple learning algorithms to improve the accuracy of the model. Authors in [4] have used three algorithms: SVM, Naïve Bayes and random forest for classification of reviews as spam or non-spam. Naïve Bayes algorithm works best with 87.12% accuracy. In the second phase, two ensemble techniques have been used in which the stacking ensemble technique performed better with 87.68% of accuracy. These findings further suggest the significance of considering ensemble techniques. Kale et al. [7] aim to detect reviews that are probable to be spam, referring to some indicators like use of abusive language, irrelevant context, discontinuous flow of text. The researchers have observed that existing methodologies focused mostly on extraction, classification and summarization of opinion by checking spam or non-spam comments. But they have implemented such a method which improvises the existing methods along with parts of speech and excessive use of punctuation marks by NLP techniques. Xue et al. [6] study the problem of inferring trustworthiness from reviews. First opinion mining techniques were applied using both supervised and unsupervised learning algorithms to extract aspect-category-specific opinions expressed in the reviews. Opinions were integrated to generate opinion vectors for individual review. Finally, an iterative content-based computational model was developed to compute honesty scores for users and reviews. Here, the trustworthiness was computed for a specific domain. This can also be extended to similar domains, by comparing the similarities of the extracted aspects from the respective domains. A temporal dimension can also be added to ensure that the dataset becomes dynamic in nature. This research paper deals with the relationship among reviewers, reviews and stores that the reviewers have reviewed by exploiting the concept of heterogeneous graph model [6]. Instead of analysing review’s text information, some interfering factors such as the trustiness of reviewers, the honesty of reviews and the reliability of stores are quantified from the graph. In this method, reasons for spamming, important clues relevant to characteristics of suspicious spammer are taken into consideration. An IRbased methodology [8] is implemented that applies the algorithm to identify highly
722
S. Halder et al.
suspicious spammer candidates. Then human evaluators were recruited to judge the real spammer candidates. Experimental results indicate that the spamming activities can be discovered using graph-based model that performs well with a good precision and human judge agreement. Sinha et al. [9] have proposed a framework that can detect fake reviews or spam reviews by using sentiment analysis or opinion mining. The authors tried to figure out the opinion of customers whether the review is relevant to the specific item or not by checking text of the review using decision tree [9]. Authors used spam dictionary to detect untruthful or spam words in the reviews and also proposed a general prototype for spam detection. Patel et al. [10] have proposed a method which uses opinion mining that is capable of classifying users into suspicious, clear and hazy categories by phase-wise processing instantaneously. The authors have discussed the factors that contribute to opinion spamming. The untruthful reviews are marked depending on a threshold score given by the system. The authors have used sentiment analysis and Vader in their project which resulted to 74% accuracy. Adike et al. [11] have proposed a tool to identify fake reviews given by the users having a different semantic content, based on untruthful movie reviews. The authors used J48 classifier [12] to detect fake reviews by generating ARFF from distinct features. They have classified the reviews as positive or negative along with that spam or genuine. NB and SVM techniques have been used to do the task. The comparison is made between NB and SVM. True positive rate, true negative rate, accuracy, rule, condition per rule are the attributes which are considered for comparison. Here, the accuracy of SVM is 98% which is more compared to NB that has an accuracy rate of 96%. The evaluation has been done on Amazon dataset. Here, reviews centric detection has been focussed using data mining technique. While Dematis et al. [13] have used both review content and reviewer’s behavioural pattern to identify fake reviews on Amazon dataset. They have implemented a fine-grained burst pattern identification in order to examine the reviews generated over suspicious time interval. To get the proper efficiency, a person’s overall trustworthiness, authorship, reviewer’s review history were also considered. A spam scoring function [13] was implemented and 2.5% of the total reviews were marked as spam in their available dataset. Most review spam detection techniques rely on review characteristics without considering reviewer characteristics. The success of retail products, hotels, restaurants, etc. are highly dependent on user-generated online reviews. The target of spammers is to manipulate those reviews for success deviation. A novel framework, Fraud Eagle, is introduced in [14] for perceiving automatically spam reviews and spammers as well. Instead of considering only review text or behavioural characteristics, this framework exploits the network effect among reviewers and products. This framework proceeds in an unsupervised fashion that is applied on large scale dataset. A scoring algorithm is also implemented that assigns scores to reviews, reviewers and products. Experimental results indicate that the method is highly advantageous for detecting fraudulent users and reviews in online forum.
A Survey on Online Spam Review Detection
723
5 Conclusion In this paper, we have critically reviewed various literature on online spam review detection, which were published between 2013 and 2019. During this study, we comprehend the methods and techniques used in the research related area of review spam detection to provide researchers with perception and future research direction. It has been studied that linguistic context of words, abusive language, irrelevant context, content similarity, discontinuous flow of text etc. have been used as some of the most important spam indicators. Incorporating all the clues that provide insight to spam indicators into a system will achieve higher accuracy. However, identifying fraudulent users along with their spam reviews will assist in detecting spammer groups and also help in spam related research. Implementing NLP-based techniques along with several machine learning approaches support towards detecting fake reviews in online forum. However, considering review content, reviewer’s history and characteristics, relationship among review content, reviewer and product store into a single framework could be a future scope for research that may achieve better accuracy. Acknowledgements The research has been conducted under a research project titled “Review Spam Detection and Product Recommendation System using Machine Learning Techniques” sponsored by The Bhawanipur Education Society College. Authors are thankful to the Research and Publication Cell of the College for the Research Grant which provided the computational and other facilities.
References 1. Heydari, A., ali Tavakoli, M., Salim, N., Heydari, Z.: Detection of review spam: a survey. Expert Syst. Appl. 42(7), 3634–3642 (2015) 2. Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al Najada, H.: Survey of review spam detection using machine learning techniques. J. Big Data 2(1), 23 (2015) 3. Barushka, A., Hajek, P.: Review spam detection using word embeddings and deep neural networks. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 340–350, Springer, Cham. (2019) 4. Bajaj, S., Garg, N., Singh, S.K.: A novel user-based spam review detection. Procedia comput. Sci. 122, 1009–1015 (2017) 5. Hamamoto, M., Kitagawa, H., Pan, J.Y., Faloutsos, C.: A comparative study of feature VectorBased topic detection schemes a comparative study of feature vector-based topic detection schemes. In: International Workshop on Challenges in Web Information Retrieval and Integration, pp. 122–127, IEEE (2005) 6. Xue, H., Wang, Q., Luo, B., Seo, H., Li, F.: Content-aware trust propagation toward online review spam detection. J. Data Inf. Qual. (JDIQ) 11(3), 1–31 (2019) 7. Chaitanya, K., Dadasaheb J., & Tushar P.: Spam review detection using natural language processing techniques. Int. J. Innov. Eng. Res. Technol. 3(1) (2016). ISSN: 2394–3696 8. Wang, G., Xie, S., Liu, B., Philip, S.Y.: Review graph based online store review spammer detection. In: 2011 IEEE 11th International Conference on Data Mining, pp. 1242–1247 (2011) 9. Sinha, A., Arora, N., Singh, S., Cheema, M., Nazir, A.: Fake product review monitoring using opinion mining. Int. J. Pure Appl. Math. 119(12), 13203–13209 (2018)
724
S. Halder et al.
10. Patel, D., Kapoor, A., Sonawane, S.: Fake review detection using opinion mining. Int. Res. J. Eng. Technol. (IRJET) 5(12), 818–822 (2018) 11. Adike, M.R., Reddy, V.: Detection of fake review and brand spam using data mining. Int. J. Recent Trends Eng. Res. 2(7), 251–256 (2016) 12. Kokate, S., Tidke, B.: Fake review and brand spam detection using J48 classifier. IJCSIT Int. J. Comput. Sci. Inf. Technol. 6(4), 3523–3526 (2015) 13. Dematis, I., Karapistoli, E., Vakali, A.: Fake review detection via exploitation of spam indicators and reviewer behavior characteristics. In: International Conference on Current Trends in Theory and Practice of Informatics, pp. 581–595, Edizioni della Normale, Cham (2018) 14. Akoglu, L., Chandy, R., Faloutsos, C.: Opinion fraud detection in online reviews by network effects. In: Seventh International AAAI Conference on Weblogs and Social Media (2013)
Leveraging Machine Vision for Automated Tiles Defect Detection in Ceramic Industries Brijesh Jajal and Ashwin R. Dobariya
Abstract Machine vision (MV) is the technology and technique that is used to provide imaging-based automatic inspection and analysis of manufacturing products and provide a solution to categorize a quality of product during production process in various industries. This paper discusses the implementation of machine vision technology on the identification of various defects in ceramic tiles at the final stage of the production process. This defect detection process inspects the defects in terms of measurement of tiles, defects on border areas of tiles and skew in the size. All these defect criteria are analyzed and tiles can be categorized in form of its quality, during production line. All these attributes of the ceramic tiles are scanned using Basler series machine vision camera and processed using a program module designed in LabView automation tool. The proposed Machine Vision-Based Tiles Defect Detection (MVBTDD) model is able to detect three major categories of defects into the tiles, which reveals the live status on production line. The implemented model serves as an aid to efficient automated defect detection to major ceramic tiles industries at a low cost. Keywords Defect detection in tiles · Automated quality check · Machine vision-based tiles inspection · Automatic tiles defect detection
1 Introduction Machine vision-based process with an accuracycomparable to human expertise has been very challenging research area in present scenario, ever since the invention of digital computers introduced in the market. In production line of ceramic industry, most of the tasks are performed through automation. However, there exists few crucial B. Jajal (B) · A. R. Dobariya Faculty of Computer Applications, Marwadi University, Rajkot, Gujarat 360003, India e-mail: [email protected] A. R. Dobariya e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_69
725
726
B. Jajal and A. R. Dobariya
stages in prior and post-production line, where human interaction is mandatory, as far as the quality of the product is concerned. Based on such facts, it is the research trend for the researchers to identify the reduction in interaction of human for quality verification of tiles at the last stage of production unit. Varied sizes of ceramic tiles are often used in building homes, offices and shopping malls. While the ceramic industries are globally manufacturing a large scale of production of tiles, there may be a possibility of various types of defects in tiles at different stages of its production. This demands human intervention for quality check of every individual tile into the production line, which is the biggest challenge due to the speed at which the product is being produced. On the contrary, the customers need a high quality product for their satisfaction and usage requirement. The primary goal of machine vision-based quality check of a file is to identify whether a tile is defective or not [1]. In case of presence of a defect, the type of defect verification process needs to be carried out. The current system of quality check of tiles through deputed persons has disadvantages such as inability to detect each of the tile in the running assembly line; numbness feeling due to continuous work for more than 6–8 h for the monotonous task and skipping of very minute level defects due to lighting conditions and harsh environment condition due to high temperature and dust particles. Therefore, in order to attain better efficiency in analysis and identification process, machine vision technology helps this industry to obtain satisfactory as well as cost-effective results. In order to implement the proposed automated design for a live data, the set of defective tiles were used as samples (see Fig. 1).
2 Proposed Machine Vision-Based Tiles Defect Detection (MVBTTD) Model A proposed MVBTTD model (Fig. 2) comprises of an assembly line of tiles, where the tiles are shifted through the idlers into a conveyor belt continuously, moving them onwards the final packaging. Just before the dispatch line, a manual process of random tiles defect checking can now be replaced with individual tile defect check. This mechanism requires industry supported Basler series camera and a sensor which push a tile, either to break it, or shift into another category—classifying it as a low quality tile [2]. A multiple camera setup may also be used in order to identify the tiles at the different angles and send this data to LabView automation module for detection [3]. While a single camera on top enables the sides and surface measurements, the other camera(s) can identify the thickness of a tile and its variation. The proposed model analyzes the acquired tiles’ images within the specific time and give control information to classification machine. Based on control information received by classification machine, it delivers the result whether coming tile is defect free or any specific defect is identified.
Leveraging Machine Vision for Automated Tiles Defect Detection …
Fig. 1 Sample tiles used for the defect detection
Fig. 2 Proposed MVBTTD model [Image source: 1stvision.com]
727
728
B. Jajal and A. R. Dobariya
Due to varying lighting conditions in this industry, it is also mandatory to setup uniform lighting on the tiles, in order to incur clear tile images from the conveyor belt. This maybe accomplished via single LED illumination or multiple light sources.
3 Defect Identification Process The MVBTDD model performs the following common steps for defect identification in tiles. Image capture. An initial step comprises of the use of Balser series camera to capture the image of a tile. The tiles which are captured are moving on the conveyor belt at approximate speed of 40 tiles per minutes. Noise removal. This step enables the reduction of unnecessary environmental parts of image from the scanned tile image. The noise removal enhances efficiency, due to concentration on only tiles part, further reducing overall processing time. Skew detection. It refers to the angular correction required for the image, since there are chances of a tile to be skewed (or tilted) due to conveyor vibration. Measurement Size. Using LabView algorithm, scanned image size can be calculated in terms of required measurement (usually in mm) and compared with the allowed threshold value. Measurement angle and surface analysis. All four sides of tiles are verified in terms of straightness and ensuring 90° on all the corners of a tile.
4 Model Implementation The model is implemented with two major components, viz. hardware setup and LabView interface showing the current status on screen (HMI may be used for the same).
4.1 Hardware Setup The assembly comprising of three major parts for the hardware is shown in Fig. 3. Basler Ace acA1300–200um USB camera. It is capable of capture of each of the tile running in a production line, with highest capacity of 200 fps. However, a typical tiles manufacturing plant require a speed of maximum four tiles per second. The camera needs to be configured for a distance from the tiles assembly line, which varies depending on the size of tile—from 200 × 200 mm to 1200 × 1200 mm.
Leveraging Machine Vision for Automated Tiles Defect Detection …
729
Fig. 3 MVBTTD hardware setup
LED Illumination. The size of LED panel depends on the size of tile being used for the defect detection. The ideal power of LED panel of such systems is 2000 lm. Metallic support stand. It is set of 2 metallic rods, one connected with the LED base and another in perpendicular to the later for holding the acquisition camera.
4.2 LabView Interface Using LabView NXG 2.0, its vision acquisition and image processing tools, the model is designed (Fig. 4) to perform the below mentioned steps: Phase-1: Acquisition of a current batch tile image. The camera is used with a trigger (capture time) set so as to capture each tile in a production line. Phase-2: Identification of dimensions. The four sides of a tile are captured and image pixel size is converted into its equivalent measure to be displayed on screen. Phase-3: Template comparison. The sample dimensions are compared with the original zero defect template tile. Phase-4: Final output. It represents the display of measurement results on sides and surfaces [2], with an indicator to pass or fail the tile, based on its quality.
730
B. Jajal and A. R. Dobariya
Fig. 4 MVBTTD model design
5 Result and Conclusion 5.1 Result on Defect Detection Every tile captured in a production line is compared, and the MVBTTD depicts the status on screen. Once calibrated for the equivalent value in mm, the results can be achieved with 100% accuracy for the thousands of execution cycles. Figure 5 represents the current tile under scan, and its status in terms of a measurement quality [4]. It comprises of three factors of a current tile under study—the dimensions of each side of a tile, the angle formed on each of the four sides and the total surface area. As shown in Fig. 5, the following measurements are shown live for the current tile in assembly line: 1. Length and width of all four borders of a tile, in terms of pixels. 2. Automatic detection of a skewness of each of the four borders, represented by red line.
Leveraging Machine Vision for Automated Tiles Defect Detection …
731
Fig. 5 Output of measurement status
3. Comparison of both above factors with actual template tile values, and overall result on the display.
5.2 Conclusion In an experimental setup, the proposed design is able to produce the desired result for the basic three parameters of a tile dimensions [5], as interpreted in the results. The research design can be set up within few minutes in the industry environment. The tiles manufacturing industries of Gujarat, having demand of domestic consumption of Rs. 210 billions [6], can now avail the low cost solution to the detection of defects, rather than opting for the import of automated machines [7] from foreign countries.
6 Future Work 6.1 Challenges Despite the successful implementation of the model into the factory premises in terms of beta testing, there are many challenges needed to be addressed.
732
B. Jajal and A. R. Dobariya
Environment. A model needs to be housed into an environment which enables the computing services to work efficiently in high temperature and dust-filled area. Lighting conditions. The illumination need not be too much, so that it eliminates the presence of small defects into tiles, and it should not be too less so that it do not identify the visible defect. Based on the daily weather, the natural light varies on the tiles.
6.2 Additional Modules The following new modules are expected to be devised: Customization of the lighting condition. Since the same light conditions may not suffice in cloudy days compared to the sunny days, the user operated brightness control on design panel could be useful for calibration. Database usage. In order to convert all current data into a data warehouse, use of mysql database can be carried out for deep learning [8] or end-to-end learning method [9] to predict the accuracy of different types of defects. Status update. Sending the periodic SMS through low cost IoT devices [10] and automated email reports to the inspection team in order to represent live data status. These update features would aid the stakeholders for quicker decision and troubleshooting. Acknowledgements We are thankful to Dr. R. Sridaran, Dean, Faculty of Computer Applications, Marwadi University, for inspiring us and providing us with all essential resources to perform the tasks related to the research work.
References 1. Yap, M., Safwan, M., Ratnam, M., Yen, K.S.: Ceramic Tiles Inspection using Machine Vision. School of Mechanical Engineering Engineering Campus, Universiti Sains Malaysia (2018) 2. Foram, S., Darshana, M.: Surface defect detection in a tile using digital image processing— analysis and evaluation. Int. J. Comput. Appl. 116(10), (2015) 3. Costa, C., Petrou, M.: Automatic registration of ceramic tiles for the purpose of fault detection. Mach. Vis. Appl. 11, 225–230 (2000) 4. Emam, S.M., Sayyedbarzani, S.A.: Dimensional deviation measurement of ceramic tiles according to ISO 10545-2 using the machine vision. Int. J. Adv. Manuf. Technol. 100, 1405–1418 (2019) 5. Acebrón, F., López, F., Valiente, J.M.: Surface Defect Detection on Fixed Pattern Ceramic Tiles. J. R. Navarro Computer Vision Group Polytechnic University of Valencia Camino de Vera 14, 46022 Valencia, Spain 6. ICCTAS Magazine Homepage: http://www.icctas.com. Last accessed 2020/04/25 7. Semantic scholar Homepage: https://www.semanticscholar.org. Last accessed 2020/04/25
Leveraging Machine Vision for Automated Tiles Defect Detection …
733
8. Shih P.H., Chi KH.: A deep learning application for detecting facade tile degradation. In: Ahram T., Karwowski W., Pickl S., Taiar, R. (eds.) Human Systems Engineering and Design II, Advances in Intelligent Systems and Computing 1026, Springer Cham (2019) (2020) 9. Wu, Y., Guo, D., Liu, H., Huang, Y.: An end-to-end learning method for industrial defect detection. Assem. Autom. 40(1), 31–39 (2019) 10. Joshi, R., Thakar, H., Dobariya, A.: Internet of Things: application and challenges. In: 6th International Conference on Computing for Sustainable Global Development, New Delhi, India, vol. 1, pp. 673–675 (2019)
An Automated Methodology for Epileptic Seizure Detection Using Random Forest Classifier N. Samreen Fathima, M. K. Mariam Bee, Abhishek Bhattacharya, and Soumi Dutta
Abstract Electroencephalography (EEG) is an electrophysiological monitoring method which is used to monitor the brain activity. It is one of the non-invasive methods in which electrodes are placed along the scalp of the patient. This EEG is used to measure the fluctuations in the brain. The recorded EEG is then analyzed. On the basis of the analyses, epilepsy is been detected. The EEG recorded from the region affected with epilepsy is known as focal EEG (FEEG) and the EEG recorded from the non-affected region is known as non-focal EEG (NFEEG). The following features have been considered. Mean (μ), standard deviation (σ ), and the maximum– minimum amplitude. Random forest classifier has been considered in this work and achieved an accuracy of 97%. Keywords EEG · FEEG · NFEEG · Mean · Standard deviation
1 Introduction Epilepsy is defined as one of the perpetual disorders which can occur and affect any one at any age [1]. This epilepsy is usually designated by the onset of seizures in the brain. Seizures have been caused due to the electrical discharge of the brain activity. Epilepsy is a neurological disorder which can lead to involuntary convolution in the patients muscle and can lead to loss of alertness. In this world, about 30 million N. Samreen Fathima (B) · M. K. Mariam Bee Saveetha School of Engineering SIMATS, Chennai, India e-mail: [email protected] M. K. Mariam Bee e-mail: [email protected] A. Bhattacharya · S. Dutta Institute of Engineering & Management, Saltlake, India e-mail: [email protected] S. Dutta e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_70
735
736
N. Samreen Fathima et al.
people are affected with this disorder epilepsy. Electroencephalography (EEG) is widely used to monitor the brain activity. Epilepsy has been clinically determined and monitored keenly by the EEG. Some people affected with epilepsy become resistant to drugs, thus for such patients, removal of affected part of the brain has been suggested in order to get rid of the disease. That portion which is affected with epileptic seizures is called as epileptogenic foci. Nowadays, surgery has become common in the society. Hence, an automatic system to identify the FEEG and NFEEG signal will assist doctor to identify the regions affected and will help in the surgical evaluation of the brain portion. The Bern Barcelona database has been considered in this work and used toward building algorithm for signal classification. Further analysis can enable doctors to classify the EEG characteristic and to diagnose the disease condition. MRI is used as other modality to diagnose the epilepsy. EEG is considered as a source to diagnose epilepsy due to its low cost. A seizure can be defined as a clinical manifestation which can lead to excessive discharge of neurons in brain cells. Consequently, it leads to irregular brain pattern generation within the brain cells. It has been reported that 20% of the people are affected with pervasive epilepsy which is affected to the entire brain and about 55% of the people have been affected by focal (partial) epilepsy which is affected to a very smaller portion of the brain [2], which cannot be treated only with medication. In some cases, it requires isolation of epileptic zones. Some of the treatment clinically requires the removal of the affected portion of the brain. According to the international system of signal classification, the signal is classified as the partial seizures and generalized seizures which are more dominant. A damaged neuron can result in the indication of the seizures in the brain network. About 75% of the generalized seizures spread randomly across the brain cells. One must understand the implication and generalization of the focal seizure. Classification of the signals extracted from the scalp the EEG signal. The features are extracted from the EEG signals. This process is carried out in two stages in which the features are extracted in stage 1 which are same fed to machine learning algorithm in stage 2. In the stage 2, the features are extracted in the higher order moments [3]. The EEG signal is decomposed in both time and frequency domain using wavelet transform. The aim of this paper is to classify the EEG signals as focal and non-focal EEG signal. This classification of the EEG signals can help the doctors to predict the occurrence of the seizures. The identification of focal and non-focal regions in the brain can clinically help people to find out the regions affected in the brain cells and can also lead to removal of the affected portion depending on the clinical situation of the person’s health (Fig. 1).
2 Literature Survey U Rajendra Acharya et al. (2019): In this study, a set of 23 nonlinear features were extracted in this work. The classifier used in this work is LS-SVM classifier. This classifier has resulted in an accuracy of 87.93%.
An Automated Methodology for Epileptic Seizure Detection Using …
737
Fig. 1 Sample signal of focal and non-focal EEG signal
Shivarudhrappa Raghu et al. (2018): In this study, SVM classifier has been utilized in this work. The experimental results occurred an accuracy of 96.1%. Md Mosheyur Rahman et al. (2019): In this study, standalone classifier has been utilized in this work. The accuracy obtained is 91.3%. He has also utilized stacked classifier which resulted in the accuracy of 95.2% in his work for the same date set considered. Gupta et al. (2017): In this study, the flexible analytic wavelet transform has been considered in this work. In this (FAWT), the signal is subjected to 15 levels of subband decomposition. This feature resulted in the specificity of 96% and sensitivity of 93% [4]. Bhattacharrya et al. (2017): [5] in this study, tunable Q wavelet transform has been utilized in this work which achieved an accuracy of 85%. Bhattacharrya et al. (2017): [6] in this study empirical mode decomposition wavelet transform has been utilized in this work which yielded sensitivity of 88% and specificity of 92%. Sriraam et al. (2017): [7] in this study nonlinear features were extracted based on frequency decomposition which resulted in the accuracy of about 92%.
738
N. Samreen Fathima et al.
N. Kannathal et al. (2005): In this study, ANFIS classifier has been utilized in this work. In this work, different entropy estimators were utilized to yield a promising entropy accuracy of 90%. Rajeev Sharma et al. (2014): In this study, EMD technique is used o decompose the EEG signals in narrow band of various IMFs functions of a signal. LS-SVM classifier has been utilized in this work which has resulted in the accuracy of 85%. Guohun ZhuYan et al. (2013): In this work, it has been noted that the proposed methodology has resulted in the accuracy which is based on the DPE index = 21, which has 18% higher than that of the PE index. This resulted in the classification accuracy of 84%.
3 Methods and Materials Used This section describes about the method designed to automatically classify the EEG signal as FEEG and NFEEG signal. Dataset The EEG dataset used in this work is the Bern Barcelona dataset. The details of the dataset can be cited from [8]. The EEG signal information is taken from five patients with fleeting projection epilepsy. Initially, we have utilized 50 sets of data of information for assessing our calculation, with the testing rate of procurement of about 512 Hz. We have utilized 3750 sets of FEEG and NFEEG signal dataset of 10,540 examples sets. We took 50 sets flags and frame it into a group of 100 flags to be specific focal and non-central EEG gathering. In this work, mean, standard deviation, the difference between maximum and minimum amplitude are the features extracted from the given EEG signal.
3.1 Preprocessing EEG signals are subjected to the preprocessing in order to remove the noise or artifact prior the signal processing. Wavelet transform has been used to preprocess the EEG signal. These signals are subjected to the differencing operation to obtain EEG signals in x–y series before the features are extracted [9].
An Automated Methodology for Epileptic Seizure Detection Using …
739
3.2 Features 3.2.1
Mean
The mean is indicated by μ. The mean of the signal is calculated for each features of a subjected signal. The mean is calculated for the every bit of the sampled signal. It is one the statistical jargon which is used to assess value of the signal. It is indicated by μ=
N −1 1 xi N i=0
Calculation of a signal’s mean: The signal is contained in xo through xn−1 , i is an index that runs through these values, and μ is the mean.
3.2.2
Standard Deviation
The standard deviation is indicated by σ . The standard deviation is the summation of the absolute values. The standard deviation of the signal helps to find out the features calculated for the subjected signals. The standard deviation is similar to that of the average deviation. The standard deviation is given by σ =
N −1 1 (xi − μ)2 N − 1 i=0
Calculation of the standard deviation of the signal. The signal is stored in xi , μ is the mean found, N is the number of samples and σ is the standard deviation.
3.2.3
Maximum–Minimum Amplitude
The difference between the amplitudes of the highest and the lowest values is given. This feature represents the difference of the maximum amplitude and the minimum amplitude which is assessed for every values of the subjected signal. The maximum and the minimum value difference is given by Max(xi ) − Min(xi ) The range of xi is given by i = 1, 2, 3, . . . , N . All these statistical values are calculated prior to the classification stages.
740
N. Samreen Fathima et al.
3.3 Classifier In this proposed system, random forest classifier has been used in order to get highest accuracy. As the name implies random forest classifier consists of a number of individual decision trees which operates on the entity. Each individual random forest tree is used to spits out the category of prediction. This class is used to vote up the category of prediction which is used to predict the absolute values. The fundamental concept of the random forest classifier is one of the most powerful classifiers which classify along the wisdoms of crowd. An enormous number of the decision trees are used for moderately uncorrelated models working together as a committee to get the desired output. The low correlation between the models is key. This random forest is an entity which is used to predict the precise accuracy. The purpose of the random forest classifier is that it makes use of individual trees which help to reduce errors and produce highest accuracy. There should be some individual trees in the aspects of providing actual signal features as our goal. This model helps to provide the highest prediction rate. It is one of the actual machine learning algorithms to solve classification related problems. These decision trees are well-liked models which are used to perform various machines learning task. Various trees come in correlation with other trees which in order tend to produce seldom accuracy. In some trees, they tend deepen and train to learn very much irregular patterns. They become unfit their training sets which have low bias and high resistance. Random forest is used to produce average multiple deep training sets which are trained to produce the equivalent variance reducing as a goal. This model comes out with an expense of reducing variance which generally boosts the performance of the final model. The training algorithm involved for the random forest classifier is bootstrapping technique with bagging to the tree learners. For the given training set X = X 1 . . . X N ; and Y = Y1 . . . Y N which is used for the bagging the trained set. For b = 1. 1. Sample of the replacement N training sets is given as X b , Yb . Train your classification tree with f b on X b , Yb . 2. After training the prediction for the unseen samples X often averaging the prediction of all individual regression problems on the trees X .
f
∧
B 1 = fb x B b=1
The bootstrapping procedure is essentially used it would increase the model performance and reduces the variance of the bias. It is because that is the prediction of the one tree is sensitive to the noise of the training set, because the trees are not correlated. Simply training many trees on the
An Automated Methodology for Epileptic Seizure Detection Using …
741
Fig. 2 Random forest classifier
basis on the training set will give you strongly correlated trees. Bootstrapping is one of the de-correlating algorithms which is used de-correlate the trees in the different ways shown in the training set. An estimate of the uncertainty of prediction can be made and the standard deviation of the prediction helps to solve individual regression problems of the trees. Generally, more than a hundred to thousand trees are used as the training set. The optimal number of trees B is often found using cross validation which helps to find out the error. The training set sample ends with some trees of been fit (Fig. 2).
3.4 Proposed System Flow Diagram
742 Table 1 Mean and standard deviation of the various entropy features considered in this work
Table 2 Comparison with the same database with different sets of features and classifiers
N. Samreen Fathima et al. Class/feature Focal EEG (FEEG) Non-focal EEG (NFEEG) ApEn SampEn
0.35 ± 0.02
0.50 ± 0.02
0.4 ± 0.02
0.46 ± 0.02
RE
−17.95 ± 0.14
−13.14 ± 0.09
FuzzyEn
0.07 ± 0.0003
0.11 ± 0.0007
Classifier
Accuracy (%)
SVM
87.93
Stacked
92.5
ANFIS
90
Our work
97
4 Results The features considered in this are mean (μ), standard deviation (σ ), and the maximum–minimum amplitude. These features are applied to the EEG database with the full length of sample of 10,240. Table 1 shows the mean and the standard deviation of the EEG signals. These features considered are given as input to the random forest classifier. This helps to classify the EEG signals. Table 2 shows the performance measure of this classifier. This proposed methodology achieved the possible accuracy of about 97% using the random forest classifier.
5 Discussion The statistical features considered in this work have been applied to the some database has been used by other researchers [10–12] and they have achieved the highest possible accuracies of about 98 and 99.7% [10–13]. In this work, a different combination of features are considered which are applied to the Bern Barcelona database and applied toward the signal classification of EEG signals and achieved a highest accuracy of 97%. This shows that the features considered are good with the relation of signal classification. The works considered in the literature made use of the EEG database used intrinsic mode function (IMF) [14, 15]. The statistical features considered for this method worked on the IMFs value of the EEG signal of the database. We have used about 10,240 samples in our work but they have used only about 4096 samples in their work [14]. The combination of the features is been applied to the IMFs of the database. Our method has resulted in the highest accuracy of 97% which was better than the accuracy of 87% which was achieved in their work [16]; they have used delayed permutation entropy features with SVM classifier which had resulted in the accuracy
An Automated Methodology for Epileptic Seizure Detection Using …
743
of 84%. The work resulted in [15] has achieved the highest accuracy with others when compared. Table 2 shows the comparison between other methods and features considered in the literature. Our method has resulted in the highest possible accuracy of about 97% in the signal classification. This shows that a good improvement of the features considered related to the signal classification. The random forest classifier shows the highest classification accuracy when compared to other classifiers.
6 Conclusion Epilepsy has become one of the common issues nowadays. Many individuals undergo surgical procedures in order to remove the portion affected with epilepsy. Finding the epileptogenic center is one of the difficult parts which can enable specialist to diagnose and get completely rid of epilepsy disorder. In this work, we are utilizing the basic statistical features to classify the signals as FEEG and NFEEG signals, which has achieved the highest accuracy of about 97%. Likewise, the highlights considered in this work have fundamentally less calculation time. It ought to likewise be noticed that the calculation time of our highlights is just 1.14 s which is imperative in connection to ongoing preparing of the EEG information. In addition, unprocessed EEG information as accessible in the database is connected. In his way, there is no preprocessing done but in the prior works, they had connected these information to the IMFs on these information and determining the esteems. In our work, there is no need of feeding the information to the IMFs and then determining the esteems. The statistical features are directly calculated and the EEG signal is classified as FEEG and NFEEG signal.
References 1. Fisher, R.S. Boas, W. v. E., Blume, W., Elger, C., Genton, P., Lee P., Engel, J.J.: Epileptic seizures and epilepsy: definitions proposed by the international league against epilepsy (ILAE) and the international bureau for epilepsy (IBE). Epilepsia 46(4), 470–472 (2005) 2. Pati, S., Alexopoulos, A.V.: Pharmacoresistant epilepsy: from pathogenesis to current and emerging therapies. Clevel. Clin. J. Med. 77(7), 457–467 (2010) 3. Alam, S.M.S., Bhuiyan, M.I.H.: Detection of seizure and epilepsy using higherorder statistics in the EMD domain. IEEE J. Biomed. Health Inform. 17(2), 312–318 (2013) 4. Gupta, V., Priya, T., Yadav, A.K., Pachori, R.B., Acharya, U.R.: Automated detection of focal EEG signals using features extracted from flexible analytic wavelet transform. Pattern Recogn. Lett. 94, 180–188 (2017) 5. Bhattacharyya, A., Pachori, R.B., Acharya, U.R.: Tunable-Q wavelet transform based multivariate sub-band fuzzy entropy with application to focal EEG signal analysis. Entropy 19(3), 99 (2017) 6. Bhattacharyya, A., Sharma, M., Pachori, R.B., Sircar, P., Acharya, U.R.: A novel approach for automated detection of focal EEG signals using empirical wavelet transform. Neural Comput. Appl. 29(8), 47–57 (2018)
744
N. Samreen Fathima et al.
7. Sriraam, N., Raghu, S.: Classification of focal and non focal epileptic seizures using multifeatures and SVM classifier. J. Med. Syst. 41, 160 (14 pages) (2017) 8. Characterization of focal EEG signals: a review 9. Hyndman R.J., Athanasopoulos, G.: 8.1 stationarity and differencing. In: Forecasting: Principles and practices, Melbourne, Australia, OTexts, 2013 10. Rajendra Acharya, U., Molinarib, F., VinithaSreec, S., Chattopadhyayd, S., Nge, K.-H., Suri, J.S.: Automated diagnosis of epileptic EEG using entropies. Biomed. Signal Process. Control, pp. 401–408 (2012) 11. Nicolaou, N., Georgiou, J.: Detection of epileptic electroencephalogram based on permutation entropy and support vector machines. Expert Syst. Appl., pp. 202–209 (2012) 12. Acharya, U.R., Bhat, S., Adeli, H., Adeli, A.: Computer-aided diagnosis of alcoholism-related EEG signals. Epilepsy Behav. 41, 257–263 (2014) 13. Melia, U., Guaita, M., Vallverdú, M., Embid, C., Vilaseca, I., Salamero, M., Santamaria, J.: Mutual information measures applied to EEG signals for sleepiness characterization. Med. Eng. Phys. (2015) 14. Kannathal, N., Sadasivan P.K.: Entropies for detection of epilepsy in EEG 80(3), 187–194 (2005) 15. Sharma, R., Pachori, R.B.: Empirical mode decomposition based classification of focal and non-focal EEG Signals 16. Zhu, G., Li, Y., Wen, P.P., Wang, S., Xi, M.: Epileptogenic focus detection in intracranial EEG based on delay permutation entropy
A Novel Bangla Font Recognition Approach Using Deep Learning Md. Majedul Islam, A. K. M. Shahariar Azad Rabby , Nazmul Hasan, Jebun Nahar, and Fuad Rahman
Abstract Font detection is an essential pre-processing step for printed character recognition. In this era of computerization and automation, computer composed documents such as official documents, bank checks, loan applications, visiting cards, invitation cards, educational materials are used everywhere. Beyond just editing and processing documents, converting documents from one format to another, such as an invitation card, billboards is another major application area where a designer has to recognize the font details from the images. There is a lot of re-search on automatic font detection published for high-resource languages such as English. Still, not much has been reported for a low resource language such as Bangla. Bangla has a complex structure because of the use of diacritics, compound characters and graphemes. Furthermore, because of the popularity of digital, online publications, there has been a recent surge of fonts in Bangla. Font detection can also help analysts to detect changes in font choices based on sociopolitical divides: for example, consider that fonts common in Bangladesh may not be as popular among Bangla publications in India. In this paper, we present a convolutional neural network (CNN) approach for detecting Bangla fonts, using a space adjustment method dependent on a stacked convolutional auto-encoder (SCAE). As part of the work, we built a large corpus of printed documents consisting of 12,187 images in 7 different Bangla fonts, forming a total of 77,728 samples by augmentations to train and validate our model. Our Md. M. Islam (B) · A. K. M. S. A. Rabby · N. Hasan · J. Nahar Apurba Technologies, Dhaka, Bangladesh e-mail: [email protected] A. K. M. S. A. Rabby e-mail: [email protected] N. Hasan e-mail: [email protected] J. Nahar e-mail: [email protected] F. Rahman Apurba Technologies, Sunnyvale, CA, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_71
745
746
Md. M. Islam et al.
proposed model achieves 98.73% average font recognition accuracy in the validation set. Keywords Bangla typeface · Deep learning · Stacked convolutional auto-encoder (SCAE) · Pattern recognition
1 Introduction Bangla is the only language on the planet for which blood was spilt, and human lives were lost—only to protect it and get it recognized as a state language of erstwhile East Pakistan—now independent Bangladesh. Bangla is currently ranked the 5th mostspoken language in the world [1]. But compared to other high-resource languages such as English, Chinese and Hindi, the state of Bangla is in a rather poor shape. Common resources for language automation include annotated corpora, parts of speech identification, anaphora resolution, entity extraction, frame representation, language modelling, parallel speech corpora for machine translation, OCR, font detection and so on, all of which are yet to be developed for Bangla. Font detection is often a pre-processing step in many OCR pipelines. It is often important to know what type of fonts are being used before OCRs are run because that information may help in customizing the recognition tasks. Another group of users who have very practical need is graphic designers. They often want to know what specific fonts are used in an image so that they can use that in some other creative projects. In that sense, the ability to correctly detect font in images will have a significant impact on Bangla language computerization. It can help in automatically processing computers composed of Bangla documents such as official documents, bank checks, loan applications, educational materials and so on. It can also be useful for graphic artists to design visiting cards, invitation cards, advertising banners and so on.
2 Literature Review Hasan et al. [2] proposed a deep convolutional neural network solution for Bangla font detection. In their system, they used a total of 6000 images of 10 different Bangla fonts and achieved 96% accuracy. Rakshit et al. [3] proposed a 1-D discrete wavelet transform solution for Bangla font on basic Bangla vowel characters. In their dataset for training, they used 200 sets of Bangla character images with 3 different fonts and 2 styles: italic and non-italic. Their model detection accuracy is above 96% for an inclination angle of 10°. Chanda et al. [4] proposed the support vector machine (SVM) and multiple support vector machine (MKL SVM) approach for 10 different fonts for Indic scripts, including Bangla. They used 400 text documents, 40 for each
A Novel Bangla Font Recognition Approach Using Deep Learning
747
font, to train their model and get 94.00% accuracy for SVM and 98.5% for MKL SVM. A complete review of font detection in other languages is outside the scope of this paper. However, the following is a short snapshot of what is being done in other resource-rich languages. Yadav et al. [5] proposed a programmatic approach to identify the bold and italic format in Hindi printed documents. In their approach, they give characters headline and vertical line pixel value condition to identify bold or italic. Lehal et al. [6] introduce an n-gram-based similarity method to identify fonts and use 50,000 Hindi words to train their system. This method achieved an average accuracy of 99.6% for both Gurmukhi and Devanagari fonts. Ghosh et al. [7] proposed a convolutional neural network (CNN) approach, where they use 10,000 English fonts images database. They achieved a 63.45% top-1 accuracy and 70.76% top-3 accuracy at the character level, and 57.18% top-1 accuracy and 62.11% top-3 accuracy at the word level. Fouad et al. [8] proposed a system for Arabic font detection. In their system, they tried to solve font recognition and font size recognition using Gaussian mixture models (GMMs) and evaluated the APTI database. They reported a reduction of 70% errors to the character and word recognition for OCR. Tensmeyer et al. [9] presented a simple CNN-based solution for Arabic letters of 40 Arabic fonts. They produced little picture patches into textual style classes using CNN. In their model, they used two CNN architectures of AlexNet and ResNet-50. They achieved 98.8% text line accuracy on the King Fahd University Arabic Font Database (KAFD), which has 4 styles and 10 different sizes. Zhangyang et al. [10] developed a largescale visual font recognition (VFR) dataset for English, named AdobeVFR, which consists of both labelled synthetic data and partially labelled real-world data. Next, they introduced a convolutional neural network (CNN) decomposition approach, using a domain adaptation technique based on a stacked convolutional auto-encoder (SCAE). SCAE-based structure helped their model to achieve an 80% accuracy of their collected dataset. Their system also gives a good font similarity measure for font selection and suggestion. Wang et al. [11] proposed a technique dependent on deep learning and transfer learning for perceiving the textual styles of computer type in regular pictures of the Chinese language. From images containing writings in 48 fonts and styles, and by combining those writings, they trained their deep neural network system to detect text styles. They have modified two popular CNN models AlexNet and VGG16. They achieve 77.68% accuracy in top-1 and 93.97% in top-5, respectively. Ramanathan et al. [12] proposed a support vector machine (SVM) solution to identify different fonts of the English language. To extract the featured vectors, they used Gabor filters. They used six most-used English fonts in four styles: regular, bold, italic and bold italic and used 216 block images. They achieved an average accuracy of 93.54%. Khoubyari and Hull [13] used a 1000-word image dataset in English, with 33 font styles, in their proposed model and reported an 85% accuracy. Ding et al. [14] proposed a model for Chinese font recognition. They used 2,741,150-character sample images of 7 different font styles to training their model and achieved an accuracy of 91.3%.
748
Md. M. Islam et al.
3 Datasets We built a 12,187-image corpus of seven different Bangla fonts. They are AdorshoLipi, Kalpurush, Ekushey Lohit, Nikosh, Rupali, SolaimanLipi, Sonar Bangla fonts. Figure 1 shows some examples of this dataset. From Fig. 1, it is evident that a lot of the fonts are very similar to each other, making the task of font detection quite tricky.
3.1 Image Preparation and Preprocessing We built a large dataset into three steps. Firstly, we created a set of documents using 7 fonts with different font sizes and different formats such as bold, italic and so on, as shown in Fig. 2. We built a segmentation tool designed explicitly for Bangla. Our segmentation tool can separate lines of a given document; from the segmented line, we can segment the words, and from a given word, we can segment characters. We
(a) AdorshoLipi
(b) Kalpurrush
(e) R Rupali
(c) Ek kushey Lohit
(f) SolaimannLipi
(d) Nikoosh
(g) Shonar S Banglaa
Fig. 1 Example of dataset
(a) Kalpurush
(c) AdorshoLipi Fig. 2 Some sample document for font size 16
(b) Ekushey Lohit
(d) Shonar Bangla
A Novel Bangla Font Recognition Approach Using Deep Learning
749
built our database of word images from these documents using this segmentation tool. We stored these word samples in 7 different folders, which are labelled with font names, each folder containing 1700 images of the corresponding font. To address the issue of model overfitting, it is a well-known technique to artificially augment training data using label-preserving transformations before fitting data into the model for training. To achieve the required variability from these font images, we used some pre-processing and augmentation methods such as adding noise, blurring, changing perspective, adding rotation, adding shading or “gradient illumination”, along with: • • • •
Rotation: Images were rotated 30° randomly. Zooming: Images were 20% zoomed in and zoomed out. Width shift: Randomly shift images 10% horizontally. Height shift: Randomly shift images 10% vertically.
After the preprocessing, we compiled a dataset of 77,728 annotated images to train our model.
4 Proposed Method 4.1 Model Preparation and Tuning In our proposed model, we have two different algorithm layers. One is stacked convolutional auto-encoder layers, and another one is CNN, as shown in Fig. 3. Stacked convolutional auto-encoder (SCAE) structures a convolutional neural system (CNN). Each SCAE is prepared to utilize ordinary online gradient descent without additional regularization terms. In our SCAE layers, the first layer is a convolutional layer with filter size 64, and kernel size 48 with ReLU activation. The output of this layer goes through batch normalization (1) “layer 2” and connected with max-pooling “layer 3”. Layers 4, 5 and 6 are the same as 1, 2 and 3. The convolutional layers’ filter size is 128 for “layer 4”, and the kernel size is 24. Then layers 7 and 9 are transposed convolutional layers with filter sizes of 128 and 64 and kernel sizes of 24 and 12, respectively, using the same padding and activation ReLU, followed by up sampling layers 8 and 10, which simply doubles the dimensions of the input. A convolutional neural network (CNN) is a deep learning algorithm that can take in an input image, assign importance to various aspects in the image and be able to differentiate one from the other. In our proposed model, layers 11, 12, 13 are convolutional layers of filter size 256, having kernel size 12 with ReLU (2) activation. (1)
750
Md. M. Islam et al.
(a) SCAE Block
(b) Full Model Fig. 3 Model architectures
Afterwards, the output is flattened into an array and passed through a fully connected dense layer (15–18 layers) with 4096 hidden units with activation ReLU and regularized with 50% dropout. The output of layer 18 is connected with the fully connected dense layer 19, again with ReLU activation, followed by dense layer 20 with seven nodes with SoftMax (3), which is also the output layer of our model. ReLU(X ) = MAX(0, X ) ez j σ (z) j = K k=1
ezk
for j = 1, . . . , k
(2) (3)
Our proposed model used a stochastic gradient descent (SGD) (4) optimizer with a learning rate of 0.01. SGD calculates the cost of one example for each step. Rather than relying just on the present gradient to refresh the weight, gradient descent with momentum replaces the present gradient with momentum. In our proposed model, we use the default momentum value of 0.9. To calculate the error of the optimizer algorithm, we used the mean squared error (5). We follow a common heuristic to manually divide the learning rate by 10 when the validation error rate stops decreasing with the current rate. θ = θ − η · ∇θ J θ ; x (i) ; y (i) MSE =
n 2 1 yi − yˆi n i=1
(4)
(5)
A Novel Bangla Font Recognition Approach Using Deep Learning
751
5 Performance The performance of the proposed model gives us satisfactory results in our training and validation sets.
5.1 Train–Test Split After augmenting the data, we have a total of 77,728 images in our dataset. We split the dataset into two parts: one for training and another one for validation. We took 25% of the data for validation and reserved 75% for training. After the split, we had 58,296 samples in the training set and 19,432 samples in the validation set.
5.2 Model Performance Our proposed model achieves 98.73% average font recognition accuracy in the validation set. We originally fit it for 25 epochs; however, as we had no meaningful change in validation loss after 10 consecutive epochs, we stopped the training early, thus achieving our results after 20 epochs instead. Figure 4 shows the graph of model loss and validation accuracy fluctuations. Fig. 4 Training loss and validation accuracy graph
752
Md. M. Islam et al.
5.3 Discussion One clear observation from Sect. 5.2 is that our accuracy is not very stable. Usually, when that happens, it is a sign of model overfitting. This can sometimes happen if certain portions of the samples are classified randomly. Since random guesses are, by their very nature, random, the model can sometimes exhibit severe fluctuations, as seen in our model. But then that argument may also not be entirely correct. Usually, overfitting should be accompanied by increased loss, something we do not observe form our model. It is often said that deep learning is sometimes as much science as it is an art. There are many other possible explanations for our model performance. One possibility is that we may have overcapacity and not enough training data. Since we are operating at the level of Bangla words, and we had only 58,296 samples in the training set, that may well be true. Another, more conventional idea is that our model needs fine-tuning in terms of ordering the layers, and there may be not-so-effective combinations of these layers that are responsible for this instability. Another more iterative idea is that we may need to work on adjusting our learning rates, something that needs further evaluation and testing on our part. This short discussion indicates that although we have achieved relatively good accuracy rates on our validation set, there is room for further improvement and additional work.
6 Future Work We plan to continue work on this project, and we would like to outline here several areas for future growth specifically. Other than the issues discussed in the discussion section above, we would also like to look at some other aspects of this work in the future. After analysing the error and confusion matrices, we found that most fonts were so similar that even human eyes could not reliably distinguish between them. We plan to create additional models that are explicitly trained to solve such confounding clusters of groups—resolving frequent areas of confusion. If our current model identifies a sample as belonging to one of these suspect classes, a specialist classifier specially trained in those classes will reclassify them. In addition, there are many fonts in Bangla that we have not worked with yet. In future efforts, we will add more Unicode and ASCII fonts to create a broader-reaching dataset for Bangla font images.
A Novel Bangla Font Recognition Approach Using Deep Learning
753
7 Conclusion In this paper, we provided a detailed outline of our proposed model and have shown a great deal of success towards our objective. As stated in the previous section, we have already established many potential avenues for improving the efficiency and reliability of our solution, which are ready for further examination. This is merely the beginning of a long journey for our research endeavour. Acknowledgements The authors would like to acknowledge the encouragement and funding from the “Enhancement of Bangla Language in ICT through Research & Development (EBLICT)” project, under the Ministry of ICT, the Government of Bangladesh.
References 1. The World Factbook: www.cia.gov. Central Intelligence Agency. Archived from the original on 13 February 2008. Retrieved 21 Feb 2018 2. Hasan, M.Z., Rahman, K.T., Riya, R.I., Hasan, K.Z., Zahan, N.: A CNN-based classification model for recognizing visual Bengali font. In: Proceedings of International Joint Conference on Computational Intelligence, pp. 471–482. Springer (2020) 3. Rakshit, A., Barshan, R.A., Islam, M.I.: Bangla font detection using 1-D discrete wavelet transform 4. Chanda, S., Pal, U., Franke, K.: Font identification—in context of an Indic script. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 1655–1658. IEEE (2012) 5. Yadav, R.K., Mazumdar, B.D.: Detection of Bold and Italic Character in Devanagari Script. Int. J. Comput. Appl. 39(2), 19–22 (2012) 6. Lehal, G.S., Saini, T.S., Buttar, S.P.K.: Automatic bilingual legacy-fonts identification and conversion system. Res. Comput. Sci. 86, 9–23 (2014) 7. Ghosh, S., Roy, P., Bhattacharya, S., Pal, U.: Large-scale font identification from document images. In: Asian Conference on Pattern Recognition, pp. 594–600. Springer, Cham (2019) 8. Slimane, F., Kanoun, S., Hennebert, J., Alimi, A., Ingold, R.: A study on font-family and fontsize recognition applied to Arabic word images at ultra-low resolution. Pattern Recogn. Lett. 34(2), 209–218 (2013) 9. Tensmeyer, C., Saunders, D., Martinez, T.: Convolutional neural networks for font classification. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 985–990 (2017) 10. Wang, Z., Yang, J., Jin, H., Shechtman, E.: DeepFont: identify your font from an image. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 451–459. ACM, Brisbane, Australia (2015) 11. Wang, Y., Lian, Z., Tang, Y., Xiao, J.: Font recognition in natural images via transfer learning. In: Tang, Y., Xiao, J. (eds.) International Conference on Multimedia Modeling 2018, LNCS, vol. 10704, pp 229–240. Springer (2018)
754
Md. M. Islam et al.
12. Ramanthan R, Thaneshwaran L, Viknesh V, Arunkumar T, Yuvaraj P, Soman DKP (2009) A novel technique for English font recognition using support vector machines. In: 2009 international conference on advances in recent technologies in communication and computing, pp 766–769 13. Khoubyari, S., Hull, J.: Font and function word identification in document recognition. Comput. Vis. Image Underst. 63, 66–74 (1996) 14. Ding, Xiaoqing, Chen, Li, Wu, Tao: Character independent font recognition on a single Chinese character. IEEE Trans. Pattern Anal. Mach. Intell. 29, 195–204 (2007)
Heart Diseases Classification Using 1D CNN Jemia Anna Jacob, Jestin P. Cherian, Joseph George, Christo Daniel Reji, and Divya Mohan
Abstract ECG is used to check the rhythm and electrical activity in the heart. The variation of the rhythm causes change in the ECG. Thus, by analyzing this it is possible to detect abnormalities of the heart. In this proposed model, the input data is processed ECG data which is numerical values and the output is set of predictions. Training and testing modes are done using MIT-BIH arrhythmia database. Based on each record, the model predicts whether the heartbeat is normal or abnormal. These abnormal heartbeats are classified into different categories. The preprocessed data is input to four layer one-dimensional convolutional neural network (1D CNN) which classifies the heartbeat and predicts the disease. This model accurately classifies the disease with 98% of accuracy. The model is developed as a portable system so that users can check the ECG at any time. This will be helpful in early detection of disease and useful in areas having no medical facilities. Keywords ECG classification · 1D convolutional neural network (1D CNN) · Heart disease detection
1 Introduction Electrocardiogram (ECG) diagnosis helps the people to easily check for heart disease and can predict the abnormalities of the heart. In this paper, deep learning concept is used. The essential aim of building the system is to predict the abnormalities of the heart with the help of ECG. The rhythm of the heart is measured using ECG or electrocardiogram. Any abnormalities in the ECG specify that the condition of the heart is at risk. In the current scenario, persons who want to check his ECG need to step into hospital and need to consult a doctor to verify the ECG. There are cases where unnoticed pain or trouble in the heart causes to cardiac arrest. In such cases, the main problem is that peoples are lazy to go to hospitals thinking that it may J. A. Jacob · J. P. Cherian · J. George · C. D. Reji (B) · D. Mohan Department of Computer Science and Engineering, Saintgits College of Engineering Kottayam, Kottayam, Kerala, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_72
755
756
J. A. Jacob et al.
not cause any critical problem. This is the point where the relevance of this project becomes visible. It helps to check ECG and predict the abnormalities in the heart. In this paper, model classifies abnormal ECG beats into four categories such as ventricular ectopic beats, fusion beats, supraventricular ectopic beats, and unknown beats which cause heart diseases. Whenever there is a problem in the heart, the rhythm of the heart changes from the normal beating pattern, and based on the change in the heartbeat pattern, the ECG can be classified into the above-mentioned four categories. For classification of arrhythmic beats, one-dimensional convolutional neural network is used. CNN models were established for image classification problems, where the model studies an internal demonstration of a two-dimensional input, in a procedure stated to as feature learning. This same procedure can be connected on one-dimensional arrangements of data, such as in the instance of acceleration and gyroscopic data for human action recognition. The model studies to acquire features from arrangements of observations and how to plot the internal characteristics to dissimilar activity types. A one-dimensional CNN is very useful when you assume to develop exhilarating features from smaller (fixed-length) parts of the total dataset and where the position of the feature within the section is not of significant importance. The benefit of using 1D CNN for classification is that they can be studied from the raw time sequences of data directly, and they do not need domain expertise to physically engineer input features. The model can study an internal arrangement of the time series data and hopefully attain equivalent performance to models that match on a version of the dataset with engineered attributes. The computational model is developed as a portable system so that users can check the ECG without stepping into hospitals. The wearable device measures the ECG and converted into numerical values. These preprocessed data can be given to model for prediction. This will help to check ECG instantly and will be useful in remote areas where hospital or medical facilities were not established. Objectives of the proposed model include detecting heart diseases quickly and providing the necessary treatment as soon as possible. To help people in rural areas where there are no adequate medical facilities to easily determine if they are suffering from any heart related issues. To help doctors where some of them are inexperienced to correctly identify the type of heart disease the patient is suffering from. In many small-scale hospitals, there will be only non-experienced doctors present and it is quite difficult for non-experienced doctors to predict which type of ECG signals causes which types of diseases so this project helps non-experienced doctors a lot for predicting heart diseases.
2 Literature Survey To detect and categorize arrhythmia to decrease labor charges and for the aim of detection, many machine learning algorithms have been used. The common procedure of these techniques depends on three principal methods: preprocessing, feature
Heart Diseases Classification Using 1D CNN
757
extraction, and classification. For accurate prediction of different types of arrhythmia in patients, classification and regression tree (CART) analysis has been proposed [1]. A decision tree is a multi-level decision-making method in which a decision is made by each node. For evaluation of the tree accuracy, both the training and the test data are divided into three forms, it is split in such a way that there is 50% test data with 50% training data and 70% of the training data and the 30% of test data and also with 100% training test data. The tree was trained on the training data and tested by the testing data. They also used tenfold for the cross-validation. The UC Irvine (UCI) machine learning repository is used to carry out a multinomial classification for various kinds of heart diseases. They also propose that the best choice for this application is an algorithm that studies the decision tree. They obtained 80% classification accuracy. In paper [2], they propose a one-dimensional CNN with twelve layers to categorize all the individual heartbeat signals formed using 1 lead into five different categories of heart diseases. In course of extracting features, to minimize the dimensions of features, they used key component analysis using wavelet transform. They used one-dimensional CNN over standard 2D CNN; as a result, the input would be the processed ECG. The network considers the input data to be an ECG signal time series, a label prediction series as output. The above-mentioned network is formed using twelve different layers including four different layers of one-dimensional convolution layers. The MIT-BIH database for arrhythmia is used in this research. 48 records that were studied by the BIH arrhythmia laboratory forms the database. 30 min long ECG signal formed by 2-channels is used by each record, and they are selected from each patient’s 24-h recording. The ECG signal frequency is at 360 Hz. Another model is centered on the recurrent neural networks (RNN) and densitybased clustering technique [3]. They utilize RNN to study about the features based on the knowledge about morphology of different beats and to study time relationship amid ECG signal points. A density-dependent clustering procedure is then used to achieve the required dataset needed for training. The functions of a particular method called clustering ensure that these beats are obtained from a large pool of data. These are then calculated on the database of MIT-BIH where the outcome of the experiments showcases that the suggested algorithm accomplishes the state-of-theart classification performance. In paper [4], using two neural network architectures to categorize arbitrary-length electrocardiogram (ECG) recordings and analyze them on atrial fibrillation (AF) classification dataset and convolutional recurrent neural network (CRNN) that fuses a 24-layer CNN with a three-layer network of longand short-term memory for temporal aggregation of features. Preprocess the data the one-sided spectrogram of the time-domain input ECG signal is computed and logarithmic transform is applied. All convolutional layers first apply a set of 5 × 5 convolutional filters, accompanied by batch normalization and ReLU activation. For CNN and CRNN architecture, the convolutional layers are divided into blocks of 4 and 6 layers, usually referred to as ConvBlock4 and ConvBlock6. Data augmentation can behave as a regularizer to block overfitting in neural networks and also enhances classification performance in cases with unbalanced class frequencies, so an ordinary data augmentation scheme connected to the ECG data at hand is created. Specifically,
758
J. A. Jacob et al.
there are two techniques which are used for data augmentation, namely dropout bursts and random resampling. For both network architectures, cross-entropy loss is used as training goal, and the Adam optimizer was used. The batch size had been set to 20. They achieved 70% of accuracy during training with this method. With the use of long short-term memory along with CNN, a deep learning model is formed [5]. The model is aided by an oversampling technique that categorizes the whole dataset into four categories that are normal sinus rhythm, atrial fibrillation, noisy classes, and others. This model uses the dataset of PhysioNet Challenge 2017, they are made up of 8528 ECG samples that are short single, and they are sampled at a rate of 300 Hz and has a duration from 30 to 60 s. The above-mentioned ECG samples are stored by obtaining it from mobile AliveCor device. Four types of ECG signal are used to form the dataset. A 12-layered model, i.e., 1D convolutional neural network is used, the model obtains features from normalized ECG formed using single lead, then the obtained features are then passed onto long short-term memory, which then takes features and passes it onto the layers that are more dense and these layers then categorize the ECG signal into 4 classes. ECG is trimmed along 1–25 s to guarantee no information is lost, and deep learning model has no problem in training and will provide the required outputs. Oversampling aids this method to extract lot of data from samples and boost outcomes enormously as deep learning aids works that consist mainly of balanced data. The ECG records are applied with deep feature method that depends on CNN and extreme learning machine [6]. An open access ECG database was utilized in PhysioNet to separate normal and abnormal ECG. PTB diagnostic dataset has 549 ECG signals from the PhysioNet database. The ECG readings in the dataset were taken from 294 volunteers. Normal ECG signal was received from about 20% of the volunteers. DL techniques were utilized before the classification stage in the feature extraction step before the classification stage and the intention of this were to categorize the signals as normal and abnormal. CNN was utilized to obtain deep feature. AlexNet is used to obtain this, and it is from the CNN architecture family. A machine learning method named as ELM is used for classification in which the weights of the final levels can only be specified throughout training. Incidentally such weights can be found along with other weights which have exponents and the ELM threshold is a universalized single layer feed-forward neural network (SLFN) type, and it is not necessary to adjust the hidden layer. This classification model achieved to an accuracy of 88.33%.
3 Proposed Model An electrocardiogram helps for the identification of heart disease. For the prediction system, the continuous ECG signal must be converted to trainable format. There will be many variations in the ECG. The dataset for training the system has to be collected and the system must predict the abnormalities of the heart.
Heart Diseases Classification Using 1D CNN
759
Fig. 1 Block diagram of proposed model
3.1 Block Diagram See Fig. 1.
3.2 Dataset The MIT-BIH arrhythmia database and PTB diagnostic ECG database are used for training and testing [7]. It consists forty-eight half-hour samples ECG readings of 2-channel ambulatory. The ambulatory ECG device is a portable device which is used to estimate ECG and save the recording in its memory. The readings were digitized with a resolution of 11 bits, over a scale of 10 mV at 360 samples a second. There are 109,446 samples which are categorized into five classes. The ECG in PTB diagnostic database were obtained with the help of prototype which is a non-commercial PTB and is measured with • • • • •
There is 16-input channel where 14 is for ECG, 1 respiration and 1 line voltage Input resistance: 100 (DC) Resolution: 16 bit with 0.5 µV Bandwidth: 0–1 Voltage of noise: maximum 10 µV (pp), respectively, 3 µV (RMS) with input short circuit • Skin resistance which is read online • Noise value is evaluated during collection of signals.
760
J. A. Jacob et al.
The database is composed of 549 records of 290 subjects. Each signal is digitized with 16-bit resolution at 1000 samples per second over a range of 16.384 mV.
3.3 Hardware Design • • • • •
Controller used is Arduino Uno microcontroller. ECG sensor module AD8232 is used for receiving ECG. ECG sensor module is connected to Arduino board. Three leads(Red, Yellow, Green) are attached to the ECG sensor module. Red color leads placed in right arm, yellow color in left arm, and green in right leg. • The signal can be obtained using Arduino software.
3.4 Arduino Uno • Arduino Uno is a 8-bit ATmega328P microcontroller. It is made up of components such as crystal oscillator, serial communication, and voltage regulator. • It consists of 6 analog input pins, 14 pins that can be either input or output, an USB connection, a power barrel jack, a reset button an icsp header. • It comes with a 5 V operating voltage but the input voltage can vary from 7 to 12 V. • Arduino Uno comes with a crystal oscillator of frequency 16 MHz. Using uniform voltage, it generates a clock of precise frequency
3.5 ECG Sensor Module AD8232 AD8232 is a low-cost sensor which is used to evaluate the heart’s rhythm or electrical activity. The rhythm monitored can be plotted as ECG. The plotted ECG can contain noise, but by using single lead heart rate monitor, we can obtain clear signal. If noisy conditions exist, it can be used to extract, amplify, and filter biopotential signals. The AD8232 module has nine connections from the integrated chip such as LO+, OUTPUT, 3.3 V, GND, SDN, LO−. These nodes are used for working along with an Arduino or other types of developing board. This contain Right Arm (RA), Left Arm (LA), Right Leg(RL) pins to connect and use to obtain ECG. Also, an LED light presents which acts as an indicator that will blink to the rhythm of the heartbeat.
Heart Diseases Classification Using 1D CNN
761
4 Model Architecture 4.1 Circuit Diagram In some cases, ECGs can be very noisy, and the single lead heart rate monitor AD8232 works just like an op amp to generate clear signal that arrive from PR and QT ranges. The sensor AD8232 works as an in build signal smoothing component for ECG signals and for other kinds of bio-potential measurement. It is developed to recover, amplify, and filter different kinds of signals that may be biopotential in the presence of noisy situations, for example those produced by movement or remote electrode placed. The AD8232 sensor uses nine connections from the integrated circuit that you can solder different types of pins other connecting devices. SDN, LO+, LO−, OUTPUT, 3.3 V, GND supplies fixed slot for operating the monitor along with a board. Right Arm (RA), Left Arm (LA), and Right Leg (RL) pins adding any new custom pins are also supported on this board. A LED light is also present which will fluctuate according to the rhythm of the heart. The AD8232 sensor could build a two-pole highpass filter used to remove movement artefacts and half cell potential of the electrode. The filter is tightly coupled to the architecture of instruments of the amplifier to include high-pass filtering and large gain in one stage, thus reducing wastage of space and money. An operational amplifier that is not commited enables the sensor module AD8232 to generate a three-pole low-pass filter to remove the excessive noise. The customer may choose the frequency cutoff range given to each filters to accommodate different kinds of uses. ECG module has 5 pins. VCC or 3.3v pin powers the module with 3.3 volts from the Arduino board (Arduino Uno has onboard 3.3v regulator). GND pin is connected to GND pin of Arduino. LO+ and LO− are output pins from AD8232 module and are used to sense the state of the leads connected to the patient’s body and are connected to pins 10 and 11 of the Arduino. Output pin gives the read data as an analog signal which is given to the analog pin A0. Analog signal-to-digital signal converter is a build in setup in Arduino which converts this analog value to digital through sampling and sends it through the serial interface. The sensor module AD8232 uses an amplifier for right leg drive (RLD) that are driver lead applications, so as to enhance interferences that are not desired and common mode rejection of frequencies of the line. The sensor module AD8232 involves a quick restore feature that minimizes the time of otherwise long-time consuming tails of the high-pass filters. After a rapid change in signal that rails the amplifier, the AD8232 automatically moves to a very high filter cutoff. This feature allows the AD8232 to recover rapidly, and therefore, to implement valid measures as soon as the electrodes are attached to the subject. The Arduino board is connected to Arduino IDE which helps to obtain the signal from the ECG module. The signal is normalized and converted to a CSV file. Using the user interface, the corresponding saved CSV file is selected and given as input to the 1D CNN model.
762
J. A. Jacob et al.
4.2 1D CNN Architecture The proposed model is a multilayer one-dimensional CNN which consists of four pooling layers, fully connected layers, and dropout layers. A single series of time of the ECG is taken as input by the network, and a series of prediction of the label is taken as output. The proposed network is made up of 4 layers. The convolution layers present in the network consist of filters with different sizes, because filters with smaller size can reduce the computations required to train the model quickly. Rectified linear units are used as activation functions for each convolution layer instead of sigmoid because rectified linear units provide solution for gradient vanishing and reduce the computations required. Two convolution layers are piled together in the beginning of the network. The piled structure has longer capability of learning new features. For instance, the above piled structure consists of two nonlinear ReLu activation functions instead of just one when compared to only single layer, which shows that the decision function could be much more distinguishable. The convolution layer comes just after the pooling layer, as the pooling layer convert the representation to become less affected by the input transactions. Moreover, pooling layers reduce over-fitting by reducing the size of data and reduce the time and effort of computation. In the different layers, different size of filters are used. The dropout layer will fix few of the input vector to zero randomly. The dropout layer has no parameters that can be trained, which means that nothing in the layer gets updated during training. To use multi-class classification in the output layer, we utilize the activation function softmax. Adam optimizer was utilized in the training procedure, for the parameters. The model which is selected is the one with the best performance during the optimization process.
5 Experimental Results In the proposed model, the preprocessed MIT-BIH arrhythmia data is input to the model. The training data is trained in the 1D CNN model. The model gives 97.2% of validation accuracy and 98% accuracy. 1D CNN classifies the heartbeat as normal or abnormal. The normal beats are termed as non-ectopic beats. Then these abnormal beats classify into four categories such as supraventricular ectopic beats, ventricular ectopic beats, fusion beats, and unknown beats which causes heart diseases. From Fig. 2, ‘N’ refers to normal beats and it gives a true positive value 0.97. All other are abnormal beats in which ‘S’ refers to supraventricular ectopic beats, ‘V’ refers to ventricular ectopic beats, ‘F’ refers to fusion beats, and ‘Q’ refers to unknown beats (Figs. 3 and 4; Table 1).
Heart Diseases Classification Using 1D CNN
763
Fig. 2 Structure of proposed 1D CNN
6 Conclusion In this paper, prediction of heart disease is using 1D CNN. The model classifies the arrhythmic beats and predicts the heart disease. This model is trained and tested on MIT-BIH arrhythmia dataset. This model can be extended as a wearable device in which ECG signals can be obtained and can predict the disease immediately. So that early detection and patients can take ECG and get result from home itself.
764
Fig. 3 Chosen heartbeat is recognized Fig. 4 Confusion matrix of classification
J. A. Jacob et al.
Heart Diseases Classification Using 1D CNN Table 1 Performance analysis of proposed model against existing approaches
765
Article
Classification model
Accuracy
Zhou et al. [1]
CART
80
Verma and Agarwal [5]
CNN and LSTM
93.92
Chenshuang Zhang et al. RNN [3]
98
Heart disease classification using 1D CNN
98
1D CNN
For this model, a good accuracy of 98% is obtained. The ECG sensor module AD8232 evaluates the heart’s rhythm and which is connected to Arduino Uno microcontroller. This sensor is a cost-effective board. The signal obtained from Arduino is directly input to the model. 1D CNN classifies the heartbeat as normal or abnormal. Then these abnormal beats are classified into four categories. The immediate prediction of heartbeat and automatic classification technique will be helpful for detection of heart disease which can possibly help doctors to provide the correct medication as soon as possible. This project can be implemented in many rural hospitals and community health centers since its very cost-effective, and this will be very useful for them.
References 1. Zhou, B., Rajbhandary, P., Garcia, G.: Detecting heart abnormality using ECG with CART 2. Zhang, W., Yu, L., Ye, L., Zhuang W., Ma, F.: ECG signal classification with deep learning for heart disease identification. In: International Conference on Big Data and Artificial Intelligence, 2018 3. Zhang, C., Wang, G., Zhao, J., Gao, P., Lin, J., Yang, H.: Patient-specific ECG classification based on recurrent neural networks and clustering technique. In: lASTED International Conference, Biomedical Engineering (BioMed 2017) 4. Zihlmann, M.: Convolutional recurrent neural networks for electrocardiogram classification 5. Verma, D., Agarwal, S.: Cardiac Arrhythmia detection from single-lead ECG using CNN and LSTM assisted by oversampling 6. Aykut Diker and Engin AVCI, “Feature Extraction of ECG Signal by using Deep Feature” 7. MIT-BIH Arrhythmia Database from the website- https://physionet.org/content/mitdb/1.0.0/
Detection of Depression and Suicidal Tendency Using Twitter Posts Sunita Sahu, Anirudh Ramachandran, Akshara Gadwe, Dishank Poddar, and Saurabh Satavalekar
Abstract It was established that between 1987 and 2007, the suicide rate burgeoned from 7.9 to 10.3 per 100,000, with higher suicide rates in southern and eastern states of India. India does not only face the fear of suicide but also of depression. A study reported in the World Health Organization (WHO), conducted for the National Care Of Medical Health (NCMH), states that at least 6.5% of the Indian population suffers from some form of the serious mental disorder, with no discernible rural–urban differences. The key challenge of suicide and depression prevention is understanding and detecting the complex risk factors and warning signs that may precipitate the event. In this project, we present an approach that uses the social media platform to quantify suicide-warning signs for individuals, to evaluate a person’s mental health and to detect posts containing suicide-related content. The pivot point of this approach is the automatic identification of sudden changes in a user’s online behaviour. To detect such changes, we combine natural language processing techniques to aggregate behavioural and textual features and pass these features through a martingale framework, which is widely used for change detection in data streams. Keywords Suicide · Depression · Machine learning · Natural language processing
S. Sahu · A. Ramachandran · A. Gadwe · D. Poddar (B) · S. Satavalekar Computer Department, V.E.S Institute of Technology, Mumbai, India e-mail: [email protected] S. Sahu e-mail: [email protected] A. Ramachandran e-mail: [email protected] A. Gadwe e-mail: [email protected] S. Satavalekar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_73
767
768
S. Sahu et al.
1 Introduction Suicidal activity refers to speaking or taking action to end one’s own life. Suicidal thoughts and actions should be treated as a medical emergency. The most common indications [1] of a person showing suicidal/depressive signs are: • • • • • • • • •
Talking about feeling hopeless, trapped or alone Saying they have no reason to go on living Making a will or giving away personal possessions Searching for a means of doing personal harm, such as buying a gun Engaging in reckless behaviours, including excessive alcohol or drug consumption Avoiding social interactions with others Expressing rage or intentions to seek revenge Having dramatic mood swings Talking about suicide as a way out.
Creating predictive models can save the lives of individuals and aid in case diagnosis, such as depression. An individual may suffer from depression/bipolar disorders, for example, without showing symptoms of it. When people exhibit a certain form of action, they will be flagged. Doctors will use the forecasts, which will help diagnose suicidal thoughts and therefore reduce the number of incorrect predictions. The main reasons why natural language processing is becoming an important factor for such diagnosis are: • Through the introduction of NLP methods, the processing and analysis of far more structured and unstructured data are now available. • Positive prediction accuracy has increased by a huge number as a result of using NLP tools. The organization of the paper is as follows. Section 2 discusses the current scenario. Section 3 reviews the literature. Section 4 talks about our proposed system. Section 5 is about our methodology. Section 6 is the results that we obtained. Section 7 reviews the limitations and conclusion.
2 Current Scenario It is very difficult to accurately pinpoint depressive/suicidal tendencies. Most of the medical practitioners use the American Psychological Association’s Patient Health Questionnaire (PHQ) for the diagnosis of depression [2]. The short questionnaire asks patients about their lack of involvement in daily activities, appetite and eating patterns, ability to focus and other depression–detection questions. On the basis of the form, doctors gauge the overall personality of a person. Academic interviews and questionnaires are the main tools of diagnosis of depression/suicidal tendencies [3]. Although there are many other problems in determining
Detection of Depression and Suicidal Tendency Using Twitter Posts
769
a patient’s emotional condition from their response to the questionnaire, the biggest stumbling block is the entire process being voluntary. In simple terms, the patient should be willing to open up to the doctors which will directly influence the diagnosis of the doctors. This is something a depressed or suicidal person might hesitate to do, because of the stigma associated with mental issues [4]. This means that even if there are doctors who can help patients with such problems, if the patient is being reticent, then not much can be done because reaching out to each person individually is not a feasible task.
3 Literature Review Many researchers have implemented different techniques to detect suicidal, depressive and similar tendencies which are elaborated in the following section. Data is generally gathered from disparate sources. [3, 5, 6] and [7] used the Twitter API as their main source of data collection. Reddit was also used as a source in [3]. Different organizations like OurDataHelps.org were also used in the process of gathering data [6]. Different feature extraction techniques such as linguistic inquiry and word count (LIWC) [3, 7], term frequency–inverse document frequency (TF-IDF) [3, 8] and parts of speech(POS) [8] were used. In order to detect sudden or drastic changes in the user’s behaviour, frameworks like martingale frameworks were used [9]. A multitude of techniques such as classification algorithms and neural networks was used in the process of detecting the tendencies using the dataset gathered. Random forest was used in [3, 10] and [8]. Support vector machine was used in [5]. Long short-term memory was implemented in [3, 6]. Logistic regression was used as a method of classification in [8]. Some of the analysis performed was on a specific set of people as in [6] which mainly concentrated on females aged 18–24. Cosine similarity and graph generation were implemented in [11] to form a cluster of similar tweets.
4 Proposed System We propose a web-based system to flag suicidal tendencies behaviour. The users of this system could be an authority like a government agency, NGO’s or even teams from Twitter. The user will be provided with a dashboard where they can see tweets which have been flagged as dangerous or see statistics about the flags from either a particular person’s perspective or the perspective of all tweets. They can see the tweets and determine the validity of our predictions upon which appropriate actions may be taken by the user.
770
S. Sahu et al.
5 Methodology The methodology for our proposed system is depicted in Fig. 1. It is divided into the following steps.
Fig. 1 Methodology diagram
Detection of Depression and Suicidal Tendency Using Twitter Posts
771
5.1 Data Collection We consulted a psychologist and created a dictionary of words related to depression and suicidal ideation. We then used Twint to gather 8123 tweets which contained these keywords. Twint is a Twitter scraping tool which allows us to gather tweets from the Twitter users without using the Twitter API [12]. The collected tweets consisted of about 400 tweets gathered per word in the dictionary. At this point, no specific user is targeted. The collection is done at regular intervals of time. And then the tweets are divided into two parts depending on whether they are accompanied with an image or not. Tweets with images are treated in a slightly different way than the ones with only text. The data is divided into two sections—‘Tweets with images’ and ‘Tweets without images’. The tweets which are not accompanied with any image are classified using the model discussed earlier. To classify the tweets which contain images, we are using Vision API.
5.2 Preprocessing Preprocessing for all tweets included removing punctuations, tags, special characters, hyperlinks and stop words after which we manually classified the tweets. The tweets of any language but English are not yet considered. Because of high usage of emoticons and Internet slang, they need to be dealt with, so that the content is understandable for a machine. Slangs are replaced with their proper English connotation. However, emoticons convey the sentiment of the message so they cannot be removed. Hence, we replaced them with keywords which reflect similar sentiments. The classification was done into four categories, namely suicidal, i.e. cleary indicating an intent to commit suicide, depressed which can be briefly described as the state of being unhappy and stressed in their stage of life, sad and not depressed which can be a state caused due to a temporary even and is usually transient in nature. Among these suicidal and depressed are the ones which are actually targeted (Table 1). This data was then used to train and test the algorithms. Table 1 Data categories Class Name
Description
0
Suicidal
A direct intention to kill self
1
Depressed
A post that connotes that a person is depressive, since depression is the biggest marker for suicidal ideation
2
Sad
Loosely denote that a person might be sad
3
Not depressed Contains words related to suicide but aren’t actually suicidal
772
S. Sahu et al.
5.3 Classification of Tweets The tweets are classified into four different categories (as per Table 1). The textual tweets are fed to the classifier directly, and the labels that are returned are stored in the database. The tweets flagged as ‘not depressed’ are dropped, while the other ones are considered for further analysis. Classification of tweets with images is done in two stages. The textual part is classified as discussed below. The images are annotated using Vision API.
5.3.1
Image Classification
The images significantly contribute towards the actual meaning of a tweet. Hence, they need to be analysed too. We are using the Vision API for this purpose. Vision API is a powerful pre-trained machine learning model presented by Google [13]. It assigns labels to images and classifies them. There are about a million categories in which the images are classified. It also returns a confidence value with every matched label. We have distinct dictionaries for each of our classes, in which we intend to classify the images. Once we get a set of labels, for a particular image, we check for the class dictionary with which the labels overlap the most. Also, the corresponding confidence values should be higher than the threshold value. The class that fits in both of these conditions is considered to be final.
5.3.2
Text Classification
We experimented with methods such as logistic regression, SVM, multinomial naive Bayes and random forest. However, none of these methods gave us a high enough score to be useful; therefore, we employed a convolutional neural network. Convolutional neural network like any other neural network has neurons with learnable weights. The choice of using convolutional neural networks was made due to the peculiarities present in the database. It is the property of CNN, and they try to reduce the cost by learning low level, sometimes meaningless, features in their initial layers and then combine the previous layer information, which often have meaning, in their later layers. Moreover, CNN uses pooling layers that further reduce the dimensionality of the parameters and also are able to be unaffected by the exact position of a feature. In our model, we used the Softmax activation function and the Adam optimizer and 100 hidden layers. Softmax was used as it outputs a range of probabilities. The range lies between 0 and 1, and the sum of all the probabilities is equal to 1. Softmax function is very widely used for multi-classification models, like ours, as it returns the probabilities of each class and the target class is the one with highest probability.
Detection of Depression and Suicidal Tendency Using Twitter Posts
773
Adaptive moment estimation or Adam was used as it realizes the benefits of both AdaGrad and RMSProp [14]. It performs better when the data is sparse. In Adam, hyperparameters have intuitive interpretation and typically require little tuning.
5.4 Dashboard We feed our results to the database which can then be accessed via the dashboard by the users. The dashboard was built using vanilla HTML5, CSS3 and JavaScript with PHP as the backend language and MySQL database. It is hosted locally for the purpose of testing.
6 Results and User Interface Our approach to detection of depression and suicidality involves processing of not only text but also pictures, hence producing more comprehensive results. Setting a proper threshold for each class label in Vision API directly affected the accuracy of the results. Hence, after careful observation thresholds that work with CNN results then against were finalized. In text analysis with the help of systematic and proper preprocessing of data, which involved processing of slang words, misspelled words and words which are not included in the dictionary the accuracy of the CNN increased by 11% and now has come up to 97.61%. This is significantly better than the results given by the other algorithms, i.e. 76.3% for logistic regression, 72% for SVM, 72.1% for multinomial NB and 67.6% for random forest. Since the data was imbalanced, the random oversampling method was used to enable learning in the minimum class. As samples generated during random oversampling method make minimal context the accuracy for the minimal class is affected. The model was able to achieve 89% accuracy for the maximum class, signalling hope of further improvement. The hyperparameters were kept to default as we were unknown of the idiosyncrasies which may lie in the data. We have included screenshots of the dashboard that will be visible to the user of the system (Figs. 2 and 3).
7 Limitations and Conclusions The data which we extracted to train our models was very specific; i.e. all languages except English were discarded. Also the data was extracted based on a few keywords like suicide and kill. So there are chances that a few tweets might not pass through our models for training. Use of slang language also makes it very difficult for the model to predict properly. The data was collected from Twitter only. Considering a holistic
774
S. Sahu et al.
Fig. 2 Dashboard Home Page
Fig. 3 Dashboard indicating flagged users
view, the system cannot identify people who have depression/suicidal tendencies using offline measures. In many second world nations, where a large percentage of the population lives in the rural areas, not having any offline detection system is a major hindrance. Collecting data from other social media sites and integrating them will go a long way in improving the system. Since we have developed methods for classifying text as well as images, sites like Instagram would be a very important entity of our project. The model could be modified a bit to improve contextual analysis. Also, analysis of data geologically will help to identify potential hotspots of depression/suicide and
Detection of Depression and Suicidal Tendency Using Twitter Posts
775
help to uncover the reasons behind it. Last but not the least, collaborating with NGOs, hospitals, will enhance the system.
References 1. What You Should Know About Suicide, April Kahn, Healthline, www.healthline.com/health/ suicide-and-suicidal-behavior, Medically reviewed by Timothy J. Legg, Ph.D., PsyD, CRNP, ACRN, CPH on April 30, 2019 2. Patient Health Questionnaire (PHQ-9 & PHQ-2): American Psychological Association. www. apa.org/pi/about/publications/caregivers/practice-settings/assessment/tools/patient-health 3. Sher, L., Kahn, R.S.: Suicide in schizophrenia: an educational overview. Medicina 55(7), 361 (2019) 4. Ji, S., Yu, C.P., Fung, S.F., Pan, S., Long, G.: Supervised learning for suicidal ideation detection in online user content. Complexity (2018) 5. Ryu, S., Lee, H., Lee, D.K., Park, K.: Use of a machine learning algorithm to predict individuals with suicide ideation in the general population. Psychiatry Investig. 15(11), 1030 (2018) 6. Jashinsky, J., Burton, S.H., Hanson, C.L., West, J., Giraud-Carrier, C., Barnes, M.D., Argyle, T.: Tracking suicide risk factors through Twitter in the US. Crisis (2014) 7. Coppersmith, G., Leary, R., Crutchley, P., Fine, A.: Natural language processing of social media as screening for suicide risk. Biomed. Inf. Insights 10, 1178222618792860 (2018) 8. Sueki, H.: The association of suicide-related Twitter use with suicidal behaviour: a crosssectional study of young internet users in Japan. J. Affect. Disord. 170, 155–160 (2015) 9. Vioulès, M.J., Moulahi, B., Azé, J., Bringay, S.: Detection of suicide-related posts in Twitter data streams. IBM J. Res. Dev. 62(1), 7-1 (2018) 10. Vijaykumar, L.: Suicide and its prevention: the urgent need in India. Indian J. Psychiatry 49(2), 81 (2007) 11. Abraham, A., Dutta, P., Mandal, J., Bhattacharya, A., Dutta, S.: Emerging technologies in data mining and information security. In: Advances in Intelligent Systems and Computing, vol. 813. Springer, Singapore (2019) 12. Twint—Twitter Intelligence Tool, Francesco Poldi, github.com/twintproject/twint/wiki, 11 Aug 2019 13. Cloud Vision Documentation: Google Cloud, cloud.google.com/vision/docs 14. Kowsari, K., et al.: Text classification algorithms: a survey. Information 10(4), 150 (2019)
A Deep Learning Approach to Detect Depression from Bengali Text Md. Rafidul Hasan Khan, Umme Sunzida Afroz, Abu Kaisar Mohammad Masum, Sheikh Abujar, and Syed Akhter Hossain
Abstract Most of the dimensional sentiment analysis methods are established on deep learning algorithms in natural language processing which can categorize the sentiment of the Bengali text or paragraph by creating a definite pole. Our purpose is to identify the depression-related ‘sad’ post using the above method from the Bengali dataset. To implement this work, we have collected Bengali text from different platforms such as social media, Bengali blogs, and quotes of Noble persons. And to classify the sentiment as happy or sad from those texts by using our model. The data preprocessing of Bengali text is one of the toughest parts of this model. For tokenizing the data to train the model, we have used Keras tokenizer. During this experiment, we have applied a recurrent neural network with a long short-term memory algorithm and achieved 98% accuracy and also able to detect the sentiment from the given dataset. Keywords Natural language processing · Deep learning · Depression detection · Bengali text preprocessing · Social media · Sentiment analysis · RNN · LSTM
Md. Rafidul Hasan Khan (B) · U. S. Afroz · A. K. M. Masum · S. Abujar · S. A. Hossain Department of Computer Science and Engineering, Daffodil International University, Dhaka 1212, Bangladesh e-mail: [email protected] U. S. Afroz e-mail: [email protected] A. K. M. Masum e-mail: [email protected] S. Abujar e-mail: [email protected] S. A. Hossain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_74
777
778
Md. Rafidul Hasan Khan et al.
1 Introduction Different types of social platforms like Facebook, Twitter, and Blogs are full of sentiment where people can share their own opinion. Nowadays, using public sentiments from social platforms various kinds of work have been done. Based on people’s sentiment, there is much positive influence in Word of Mouth (WOM) activities. Various organizations and social communities depend on the sentiment of posts on social media for their decision-making process [1]. The way of breaking a complex object in a smaller part and finding a proper solution of expressed opinion is called sentiment analysis. Mainly, there are three poles of sentiment. These are positive, negative, and neutral. Based on these three categories, we classify another category like “Happy,” “Sad,” etc. Social platform’s data help us to search and discover feelings or thoughts of the public [2]. Find out the hidden sampling from data using data mining procedures, and appealing exploration can collaborate with us to express various points [3–5]. There are three major approaches to analyze the sentiment in natural language processing (NLP). Using parts of speech (POS) tagging, steaming, ans tokenizing techniques to measure sentiment is called a rule-based sentiment analysis approach. The automatic approach works with input and output. It represents the proper sentiment based on how automatic model trains. And the hybrid approach works with both two approaches that we declared above. In our model, we apply a deep learning-based hybrid sentiment analysis approach to find out the proper sentiment according to our dataset [5, 6]. There are various types of motives to analyze public sentiment. Also, it has some different applications. Business organizations analyze sentiment for their product admissibility. Different types of public attention can be measured by sentiment analysis like violence marginality, election results survey, etc. [7, 8]. Depression rate and suicidal tendency types of different psychological analysis can also be measured using sentiment analysis [9]. Using a deep learning algorithm, recurrent neural network (RNN) with long shortterm memory (LSTM), we analyze people’s sentiment. We classify the sentiment in two categories. One is , and another is sadness. We collect Bengali data from different Bengali blogs, Facebook, etc. Nowadays, people share their sentiment like sadness and happiness in different types of social media. Although we find the happy or sad sentiment of the public, our main target is to find the depression-related post or text. There are different kinds of effects of depression in society. Students or job carrier has been affected by this [10]. There is a very high possibility to commit suicide of depressed persons [11]. So, we want to find depressed people through sentiment analysis from social platforms.
A Deep Learning Approach to Detect Depression from Bengali Text
779
2 Literature Review It is possible to split sentiment analysis into different three categories [12]. These three categories are document-level, sentence-level, and phrase-level sentiment analysis. When a model predicts results based on document data, it is called documentlevel sentiment analysis. A model also can show results using sentence data; it is called sentence-level sentiment analysis. Phrase-level sentiment analysis represents the outcome based on phrase data. Although there is some disadvantage of phraselevel sentiment analysis. Because sometimes phrases are represented as conceptual meaning. To detect depression-related text, we follow the document-level analysis. Singh et al. [13] say, using social platform text data, it is possible to mining opinion, analyze sentiment, analyze news, etc. Sentiment analysis is related to different things like attitude, feelings, emotion, etc. Those can be measured and used for specific purposes. They also discussed various platforms and various social platform analysis tools like feature analysis tools, business toolkit, monitoring tools, scientific programming tools, etc. Social platforms are full of unprocessed data. Those data are unanalyzed and having many errors [13]. Masum et al. [14] show the Bengali data preprocessing procedure. To preprocess data, it needs split text, add contraction, remove regular expression, and remove stop word. Data steaming is also an important term of data preprocessing [15]. There are five steps to steam Bengali data. Rout et al. [16] apply machine learning algorithms to analyze the sentiment of tweets. They use both supervised and unsupervised algorithms. Using unsupervised algorithms first, they achieve 86% accuracy. When they use the lexicon-based approach, accuracy became 75.20%. Multinomial Naive Bayes (MNB) provides them 67% accuracy. For modeling sequential data, recurrent neural network (RNN) is very effective. It maintains a long-term history with some hidden states. RNN with long short-term memory (LSTM) has one input and one output layer. Between them, there are some hidden layers. Analyzing sentiment in natural language processing, LSTM provides better efficiency [16]. Cheng and Tsai [17] used long short-term memory (LSTM), bidirectional LSTM, and gated recurrent unit (GRU) to analyze the sentiment from social platform data. They achieve 80.83% accuracy using LSTM, 87.17% accuracy using BiLSTM, and 64.92% accuracy in GRU. We prefer to use the recurrent neural network (RNN) with long short-term memory (LSTM) to analyze sentiment using Bengali data to find better performance. We collect our data from different types of social platforms.
780
Md. Rafidul Hasan Khan et al.
3 Methodology Using a deep learning-based algorithm, our main focus is to find out the sentiment of the Bengali paragraph. In the dataset, there are two parameters. One is “Happy,” and another is “Sad.” From this dataset, our main target is to find the depression-related paragraph. To gain the proper result, there is some procedure to follow. In Fig. 1, we show the full flowchart of our deep learning model.
3.1 Data Collection Deep learning-based approaches need several numbers of data for training the model. The accuracy of the model depends on the volume and quality of the dataset. It is hard to collect data based on two fixed domains from various blogs and social platforms. As opposed to, there are very few resources in Bengali. For this reason, we collect our necessary data from various Bengali novels, poems, and the quotations of various
Fig. 1 Work flow
A Deep Learning Approach to Detect Depression from Bengali Text
781
Bengali nobel person or literature. We arrange our dataset into two columns to specify the input and output. Input means our text or paragraph. Output means the category. There are two types of text, text which represents the sadness is categorized as “Sad” and text which represents the happiness is categorized as “Happy.”
3.2 Data Preprocessing Working with Bengali data in natural language processing (NLP), data preprocessing is a necessary part. Data cleaning is mandatory to fit Bengali data in every deep learning-based algorithm. To process Bengali data, here, we follow some steps. Step 1: There is some contraction in Bengali data, like , etc., . That is why in and the full form of those contractions are our text or paragraph, we also need to add contraction to get proper meaning of the word. Step 2: Every type of text data is full of various kind of unnecessary character , etc. After adding contraction and components. These are like “|”, “,”, “[”, in text or paragraph, we need to remove all kinds of unnecessary characters and components to clean our data. Step 3: Various types of Bangla stop words create a problem when we train the , model using different algorithms. Those stop words are like etc. We have a Bangla stop words corpus. Using this corpus, we remove all stop words from the dataset. Step 4: Steaming is a significant process in natural language processing (NLP) for Bengali text exploration. The steaming approach in NLP means the finding of the root of a word. To identify the root of Bengali word, there are some rules. These rules are constructing based on the Bengali grammar book [15, 18]. Those rules are (i) article inflection, (ii) number inflection, (iii) suffix inflection, (iv) verbal root inflection & (v) bibhakti inflection. Table 1 shows the example of those rules. Table 1 Example of steaming rules
Inflection name Article inflection Number inflection Suffix inflection Verbal root inflection Bibhakti inflection
Example
782
Md. Rafidul Hasan Khan et al.
Table 2 Before and after steaming result Before steaming
After steaming
Table 2 represents the before steaming and after the steaming situation of a Bengali text.
3.3 Data Tokenization We use Keras tokenizer for the tokenize dataset. After completing the tokenization process, our data is ready to train using the deep learning algorithm.
3.4 Model There are several types of deep learning-based algorithms in natural language processing (NLP). We use the recurrent neural network (RNN) with long short-term memory (LSTM) to gain the appropriate output.
4 Experiment Result Using the deep learning algorithm, recurrent neural network (RNN) with long shortterm memory (LSTM), we took epoch = 50 and batch size = 128. For calculating the learning rate of each parameter, here, we use Adam optimizer. According to all of this, LSTM provides us approximate 98% accuracy. Figures 2 and 3 show the accuracy graph and loss graph of our model. Confusion m is the representation of the assessment for machine learning & deep learning classification. It enormously obligates for metering accuracy, recall, and precision. The division of accurately calculated true positive observations, and the summation of calculated true and false positive observations is called precision. The rate of exactly predicted positive observations to total observations in actual class— yes is called recall. The weighted mean of precision and recall is called the F1-score. The confusion matrix of our model is given in Table 3. We generate a confusion matrix for the two parameters of our dataset. One is happy, and another is sad. In Tables 4 and 5, we presented the prediction of “Happy” and “Sad” post-detection which is measured by our model. At first, we represent the original prediction and then the prediction providing by the model.
A Deep Learning Approach to Detect Depression from Bengali Text
783
Fig. 2 Accuracy graph
Fig. 3 Loss graph Table 3 Confusion matrix Class name
Precision
Recall
F1-score
Support
Sad
0.43
0.86
0.57
7
Happy
0.83
0.38
0.53
13
784
Md. Rafidul Hasan Khan et al.
Table 4 Result prediction for sad post Text
Original prediction
Sad post
Prediction by model
Sad post
Table 5 Result prediction for happy post Text
Original prediction
Happy post
Prediction by model
Happy post
5 Conclusion We mentioned a model using recurrent neural network (RNN) with long short-term memory (LSTM) algorithm. Using this algorithm, our model can predict and produce sentiment based on Bengali data. There are two classifications in our analysis. It produces “Happy” or “Sad” based on the meaning of data. No model can generate a hundred percent accuracy. But our model can gain ninety-eight percent. It is good enough. The main obstruction to work with the Bengali dataset is the spelling structure of Bengali Language. A vowel of the Bengali alphabet changes the whole meaning of the word. It is tough to understand for training a model. The model would be biased for this problem. Writing words of different languages using the Bengali alphabet is another major problem. Those words cannot produce proper meaning. Every deep learning model needs a huge number of data. Using a small dataset to train a model is a little bit problematic. It was another obstacle to our work. Hope in future that we overcome all of the above problems and can be producing a better system. Acknowledgements We really obliged to accept their assistance from DIU NLP and Machine Learning Research Lab for giving GPU’s support. We delighted, Dept. of Computer Science and Engineering, Daffodil International University for supporting us. And also pleased to anonymous reviewers for their worthy explanation and feedback.
References 1. Liu, B., Synthesis Lectures on Human Language Technologies: Sentiment analysis and opinion mining 5(1), 1–167 (2012)
A Deep Learning Approach to Detect Depression from Bengali Text
785
2. Kanakaraj M., Guddeti, R.M.R.: NLP based sentiment analysis on Twitter data using ensemble classifiers. In: 2015 3rd International Conference on Signal Processing, Communication and Networking (ICSCN), Chennai, 2015, pp. 1–5 3. Ortigosa, A., Mart, J.M., Carro, M.: Sentiment analysis in Facebook and its application to e-learning. Comput. Hum. Behav. 31, 527–541 (2014) 4. Mostafa, M.M.: More than words: social networks text mining for consumer brand sentiments. Expert Syst. Appl. 40(10), 4241–4251 (2013) 5. Medhat, W., Hassan, A., Korashy, H.: Sentiment analysis algorithms and applications: a survey. Ain Shams Eng. J. 5(4), 1093–1113, 2014. Available: https://doi.org/10.1016/j.asej. 2014.04.011 6. Sentiment Analysis: The Basics, How Does It Work, Use Cases & Applications, Resources. [Online] Available: https://monkeylearn.com/sentiment-analysis/ 7. Ko, J., Kwon, H., Kim, H., Lee, K., Choi, M.: Model for twitter dynamics: public attention and time series of tweeting. Physica A 404, 142–149 (2014) 8. Hodson, H.: Twitter hashtags predict rising tension in Egypt. New Sci. 219(2931), 22 (2013) 9. De Choudhury, M., Gamon, M., Counts, S., Horvitz, E.: Predicting depression via social media. In: ICWSM, 2013 10. Kessler, R.: The effects of stressful life events on depression. Annu. Rev. Psychol. 48(1), 191–214, 1997. Available: https://doi.org/10.1146/annurev.psych.48.1.191 11. Paykel, E.S., Dienelt, M.N.: Suicide attempts following acute depression. J. Nerv. Ment. Dis. 153(4), 234–243 (1971) 12. Varghese, R., Jayasree, M.: A survey on sentiment analysis and opinion mining. Int. J. Res. Eng. Technol. 2 (2013). eISSN. 2319-1163, pISSN. 2321-7308 13. Singh, S., et al.: Social media analysis through big data analytics: a survey. Available at SSRN 3349561 (2019) 14. Masum, A.K.M., et al.: Abstractive method of text summarization with sequence to sequence RNNs. In: 2019 10th International Conference on Computing, Communication, and Networking Technologies (ICCCNT). IEEE, 2019 15. Emon, E.A., et al.: A deep learning approach to detect abusive Bengali Text. In: 2019 7th International Conference on Smart Computing & Communications (ICSCC). IEEE, 2019 16. Rout, J.K., Choo, K.R., Dash, A.K., et al.: A model for sentiment and emotion analysis of unstructured social media text. Electron. Commer. Res. 18, 181–199 (2018) 17. Gupta, Y., Kumar, P.: CASAS: Customized Automated Sentiment Analysis System 5(1), 275– 279 (2017) 18. Chowdhury, M., Chowdhury, M. H.: NCTB Bangla Grammer for Class 9-10 19. Cheng, L.C., Tsai, S.L.: Deep learning for automated sentiment analysis of social media. In: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 1001–1004 (2019)
Hair Identification Extension for the Application of Automatic Face Detection Systems Sugandh Agarwal, Shinjana Misra, Vaibhav Srivastava, Tanmay Shakya, and Tanupriya Choudhury
Abstract Hair identification can act as an additional functionality of facial identification systems to increase the security factor as well as provide hair-based solutions to various organizations for ease of operations and versatility like salons, medical research, and criminal identification. The implementation would help in identification of hair in an image so that it could be classified and modified as per the required purpose. The accuracy achieved by the model is 84.998% with a loss of 0.47838 based on the given dataset of 442 images with batch size 4, epoch as 5, and steps per epoch as 110. Keywords Accuracy · Analytics · CNN · Facial recognition · Hair identification · Optimization · Segmentation
S. Agarwal (B) · S. Misra · V. Srivastava · T. Shakya · T. Choudhury University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India e-mail: [email protected] S. Misra e-mail: [email protected] V. Srivastava e-mail: [email protected] T. Shakya e-mail: [email protected] T. Choudhury e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_75
787
788
S. Agarwal et al.
1 Introduction Hair is one of the most important components of a person’s face. It helps in categorizing the search in face recognition using hair identification. Hair provides a dominating feature of the face and has a variety of advantages. But, the present day face identification systems do not consider hair as a critical factor. It has to either be extracted or distinguished with varying contrasts like color, backgrounds, lighting conditions, and a homogeneous yet brilliant predefined foundation for these evaluations. Hence, the all-round complications are dire and exhausting. There are very few researches about image-based hair detection techniques. Yacoob and Davis are pioneers who made deliberate investigates about hair identification. They developed a hair color model for detection. Yacoob and Davis focused on the investigation of hair length, volume, shading, balance, and the impact of hair region. Accordingly, hair identification algorithm proposed in is very straightforward and just pertinent to pictures with homogeneous foundations. Hence, an extension to this algorithm can be implemented in terms of real-life applications by bringing in modern day concepts.
2 Objectives The objectives identified for the paper are: 1. To develop a machine learning project on hair identification; 2. To understand the steps involved in hair detection and conduct elaborate research for optimization; 3. Analyze and evaluate the results obtained.
3 Literature Survey The literature survey is performed to show how the research is related to prior research in the field and highlight statistical literature that is relevant to the present research for better reference. The originality, relevance, and preparedness for the research are also highlighted along with the justification for the proposed methodology for the current research. It aims at laying the foundation of the topic being discussed. For conceptualizing and building on the idea presented, it becomes important to have a clear research before committing to an initiative insightful to both the technical community as well as the researchers. It provides the key elements for implementing the tools required for the research and to decode the working behind the algorithms involved, the key values of focus, and the hidden aspects. In Y. Yacoob and L. Davis’ paper at [1], computational models were developed for comparing different people by measuring hair appearances and have applications
Hair Identification Extension for the Application …
789
to person recognition and image indexing. The paper by Shen et al. [2] proposed an efficient algorithm by the amalgamation of graph cuts optimization and K-means clustering for the purpose of automatic facial caricature synthesis. Lee and Yang [3] gives an insight to the various parameters important for hair and scalp-based systems of hair identification. Hairstyle modeling and synthesis, hair counting, and direction of baldness status add up to good analysis of the image. Zhang [4] provides an insight on an improved version of Adam to eliminate the generalization gap. Smith [5] provides a new method to set learning rates named cyclical learning rates which sets learning rate values randomly between valid boundary values. Han et al. [6] provides a method to reduce the memory and computation power required by neural networks to solve the problem of facing difficulties in deploying neural networks in embedded systems. Kim et al. [7] shows how learning rate affects the learning speed while training a backpropagation neural network and proposes a method to select sub-optimal learning rates. Chai and Draxler [8] shows how root mean square error (RMSE) is a good method to measure model performance. Pande and Goel [9] gives an insight on radial basis function which can be used as an alternative for sigmoid function layer in neural networks. Bock et al. [10] provides an improvement for the convergence proof of the Adam optimizer. Romeo [11] gives an insight about epochs and how epochs work and their relevance. [12] gives an insight to the Keras library in Python and its functions available to us. The Web site is the official documentation of Keras having all the modules, functions, and relevant inputs as per the requirement of the program. [13] provides the CNN model for hair identification by applying multiple layers, ReLU, and padding. The model has been created as a GitHub project on Python for implementation in blender for hair-based projects. The same provides us the CNN for our project. [14] and [15] give an insight to Adam as an optimizer for the project. It also provides the comparative study of various other optimizers for the purpose. [16] and [17] provide detailed information on the various parameters for training the model using specific functions and why they are suitable for the purpose. The parameters covered here are accuracy, loss, binary cross-entropy, and learning rate. [18] and [19] provide details as to how to proceed with testing neural networks using a trained model. [20] is the Pixel Annotation tool used to create the dataset of masked images for the model. [21] asserts on the importance of cross-entropy, its numerous advantages and how to implement it on different scenarios. [22] provides an insight on p-values, their significance, and implementation for machine learning. The expected outcome for the researcher as well as the reader is to provide substantial information to understand the process and the future scope of work related to the paper.
790
S. Agarwal et al.
4 Design Methodology 4.1 SEMMA SEMMA is a shortened form for sample, explore, modify, model, and assess characterized as a rundown of successive steps to manage the usage of data mining applications. It provides a functional set of operations to carry out the data mining process effectively. Various phases of SEMMA are described as: • Sample: Dataset is selected for modeling with the characteristics of being large enough to contain sufficient information for retrieval as well as efficiency. • Explore: Selected data is visualized and observed for any abnormalities and to identify the trends. • Modify: This phase aims at modification of the selected data using editing and format commands. • Model: This phase aims at application of modeling techniques to create models and to further process the data. • Assess: The last phase is assess to map the reliability and usability of the model.
4.2 Algorithms
Algorithm 1 Algorithm to load dataset 1.
function trainGenerator (images, masks, dataargs); Input: Images, masks, and image arguments Output: Sized images and sized masks Initialization : 2. image_generator ( 3. set trainpath, 4. classes = [imagefolder], 5. classmode = None, 6. colormode = imagecolor_mode, 7. targetsize = targetsize, 8. batchsize = batchsize, 9. savetodir = savetodir, 10. saveprefix = (imagesaveprefix)
Algorithm 2 Algorithm for training the model Input: Images and masks Output: Trained model Initialization : 1: Parse args 2: return args 3: mygene = trainGenerator (batchsize, images, masks,data, genargs, savetodir); 4: sizedata = len (imagesdirectory) 5: stepsperepoch = sizedata 6: print args 7: if use pretrain model then 8: loadpretrainmodel 9: else 10: loadhairnetmodel 11: model.compile (optimizer ,loss, metrics) 12: model_checkpoint =
Hair Identification Extension for the Application …
791
11. end 12. mask_generator ( 13. set trainpath, 14. classes = [maskfolder], 15. classmode = None, 16. colormode = maskcolormode, 17. targetsize = targetsize, 18. batchsize = batchsize, 19. savetodir = savetodir, 20. saveprefix = (masksaveprefix) 21. end 22. for each image, mask 23. adjustdata (image, mask) 24. end 25. return zip(image, mask) 26. end function
Algorithm 3 Algorithm for testing the model 1. function predict (images, height, width); Input: Image Output: Mask Initialization : 2. Convert image to img 3. im = reshape() 4. pred = model.predict(im) 5. mask = pred.reshape() 6. return mask 7. end function 8. function transfer (image, mask); Input: Image Output: Mask Initialization : 9. set mask 10. if mask > 0.5 then 11. mask = 255 12. end if 13. else 14. mask = 0 15. end
13: 14: 15: 16:
16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34.
ModelCheckpoint() model.fit_generator() model.save (args.path_model) return trained model end
mask.resize() mask_n = np.zerolike() dst = cv2.addWeighted() return dst if (name == '__main__') then model = keras.models.loadmodel() end if return model end function LOOP Process for name in imagedirectory do img = cv2.imread(name) time.time() plt.imshow() plt.show() dst = transfer(img, mask) cv2.imwrite(name.replace(), dst) end for return dst end
5 Implementation The project is divided into three modules, namely load_data.py, train.py, and test.py. Hairnet is the main CNN model imported from the nets and used for compilation,
792
S. Agarwal et al.
checkpoints, and fit generator, i.e., for the training of the model. The dataset comprises of 442 images collected from various sources and 442 respective masks generated using the Pixel Annotation tool for the same. load_data.py is the module created to load all the variables and create the user-defined function trainGenerator() and define all the image and mask related parameters to aid the working of train.py. train.py is the module used for training the model. All the libraries related to training are imported and loaded in this. This module is created to define functions to import the CNN model to be trained, optimize the model for minimum loss and maximum accuracy, create checkpoints, and index them. The module used for testing the data after training is test.py which reshapes and reformats the test inputs and predicts the final result based on the training the model undergoes.
5.1 Datasets and Tools The dataset consists of 442 images of human beings having scalp hair, beard, and a combination of both with varied characteristics. Each image has been collected via search engine and available datasets on the Internet. In addition to the 442 images, a mask image has been created for each using the Pixel Annotation tool. It is a software that manually and rapidly marks pictures in directories. The procedure is pseudomanual since it utilizes the algorithm watershed set apart of OpenCV. The marker is provided with brushes manually before the launch of the actual algorithm. In case the first pass of the segmentation needs to be rectified, the markers can be refined by the user by drawing new ones over the incorrect areas.
5.2 Code Flow and Module Division 5.2.1
Module 1: load_data.py
The libraries required are imported for usage of various functions and modules (keras, numpy, skimage). The functions would help in defining the primary user-defined function as well as the variables for each image and mask to be processed. trainGenerator() is the user-defined function to train the model using our dataset. It has the following parameters: • • • • •
batch_size: The number of images per batch to be processed at each step of epoch; train_path: The default data path for training; image_folder: The path of the folder consisting of all the dataset images; mask_folder: The path of the folder consisting of all the masks of images; aug_dict: The augmented data dictionary for each run of training to handle multiple scenario;
Hair Identification Extension for the Application …
793
• image_color_mode: The color mode to be accepted for each image. Here, it is set to RGB. Image generator, mask generator, and train generator take values from augmented data dictionary per image per run. The value varies for each image depending on its characteristics.
5.2.2
Module 2: train.py
The libraries required are imported for usage of various functions and modules (argparse, keras, os). Hairnet is the main CNN model imported from the nets and used for compilation, checkpoints, and fit generator, i.e., for the training of the model. Argument parsing converts the given values to defined values as and when required. It converts the user-readable format to a syntax read by the program. Here, each value depicts the essential parameters to run the program and have a dependence over accuracy and loss. batch_size = 4 would mean that the entire dataset is divided into multiple batches of 4 images each for testing. Epoch is defined as the number of times the entire dataset will be overviewed. Here, epoch = 5 defines 5 iterations over the entire dataset. Learning rate or lr is defined as the parameter to track and evaluate the dependency curve between two parameters, here, loss and accuracy. The value is between 0 and 1 and for a small dataset, lower values can be used to reduce errors. Here, learning rate = 0.001. Steps per Epoch = Total dataset number / Batch Size = 110
Optimization and Loss Function With model, we take the Hairnet CNN model to be run for our training for the given dataset compile function has two primary components being measured at each step: • optimizer = Adam is used for stochastic optimization • loss = Binary cross-entropy allows us to evaluate loss for binary classification by calculation of P(y/x) where x and y are data. ModelCheckpoint allows us to store the values at each step of epoch as well as the epoch itself. A checkpoint is created and stored in hair_matting.hdf5, i.e., Hierarchical Datafile Format 5 for each step of epoch. When one epoch ends, a threshold value is stored by calculating the average loss and accuracy for the next epoch iteration. fit_generator allows us to process training by providing the training function, the callbacks, and the epoch values.
794
S. Agarwal et al.
INPUT
OUTPUT
Fig. 1 Test inputs and outputs
5.2.3
Module 3: test.py
The predict() function converts the images to RGB format, changes the size, and reshapes the array of images. It further defines the function for prediction and reshapes the output. The transfer() function converts the final image into arrays of zeroes to fit the mask for comparison. It further graphically represents the data. Now, the main model is loaded into and fed with the optimized image inputs to conclude the testing by showing final results with parameters like accuracy, time, graph, etc. (Fig. 1).
6 Research and Analysis 6.1 Loss Function Loss functions are required to calculate model error to optimize the training process of neural networks. Cross-entropy and mean squared error are the two main types of loss functions used for training neural models. Cross-entropy loss functions are optimal to train classification models. Mean squared error functions are mainly used to train regression models. Binary cross-entropy has been used in this project to calculate model error because we have a binary classification model instead of a regression-based model to train. It would give a low value for good prediction and high value for bad prediction. Equation is given as:
where i = samples/observations and j = classes, and y = sample label and: • y = (1 = hair, 0 = no hair) • p(y) = probability of hair being hair of n number of points.
Hair Identification Extension for the Application …
795
Table 1 Comparison between binary cross-entropy and mean squared error Binary cross-entropy
Mean squared error
It quantifies the execution of a classification model whose yield is a probability estimation somewhere in the range of 0 and 1
The mean squared difference between actual and assessed values is measured in this
It is used for classification problems
It is used generally for regression problems
It is theoretically faster than MSE, but practical It is theoretically slower than MSE, but implementation may state otherwise. It depends practical implementation may state otherwise. on the problem being handled It depends on the problem being handled It uses binary distribution
It follows normal distribution
Formula:
Formula: L = N1 [ (Yˆ − Y )2 ]
1 N H p (q) = − yi . log( p(yi )) i=1 N + (1 − yi ). log(1 − p(yi ))
The formula tells you that for each hair point (y = 1), it adds log(p(y)) to the loss for hair point (y = 1). Conversely, it adds log(1 − p(y)) for each no hair point (y = 0) (Table 1).
6.2 Optimizer Optimizers calculate the learning rate of a model automatically, thereby training a model fast but also preventing it from getting stuck in the local minimum due to extremely high or low defined learning rates. Optimizers make the models learn. SGD, AdaGrad, RMSProp, and Adam are some of the most popular optimizers used for the purpose. Adam—an adaptive learning rate optimization algorithm is used in this project. Compared its other counterparts, not only does Adam use past learning rates for updating the training process, but it also uses past gradients to speed up the learning process, thereby providing better accuracy per iteration. Adam uses a learning rate 0.001 by default but can be changed manually. Adam’s beta parameters are configured to 0.9 and 0.999, respectively, on an average of 10 runs. Adam has the highest training and validation accuracy. Some other advantages of Adam optimizer are that it has simple and uncomplicated implementation, good computation efficiency, uses less memory space, works well with objectives that are non-stationary and is great for large datasets and parameters (Table 2).
796
S. Agarwal et al.
Table 2 Comparison between optimizers SGD
AdaGrad
RMSProp
Adam
Uses one static It has different It is volatile since it learning rate for all learning rates for decays the learning parameters during the every single parameter rate exponentially training process
It uses past learning rate with adaptive moment estimation
It does not imply an equal update after every batch
Learning rate is updated for each parameter based on frequency of its update
Learning rate is It also uses past updated for each gradients to speed up parameter based on learning frequency of its update
Gradients start to decrease post sub-optimal value
Higher learning rates are risky
It moves over to another local minima and does not get stuck
Default learning rate is 0.01 with no momentum
Default learning rate is Default learning rate is Default learning rate is 0.001 and accumulator 0.01 with no 0.001 and beta value 0.1 momentum but 0.9 rho parameter between 0.9 and 0.999
It will continue to do the process and would not falter suddenly
6.3 Learning Rate Versus Accuracy The multiple iterations with different values prove that Adam optimizer is suitable and most optimal with the learning rate of 0.001. The value is explained with the median of accuracy and loss for all the learning rates are 0.8505 and 0.375. Since the accuracy of learning rate at 0.001 lies at 0.849, the median value is in sync with the total result obtained. Hence, the optimality of the given learning rate is proven. Secondly, the stated logarithmic trendline is explanatory to the constant increase in accuracy with low values of learning rate. The loss is compensated for the optimal accuracy obtained by the model. In general, a logarithmic trendline is used when the rate of change in data is quick (increase or decrease), and then it levels out (Figs. 2 and 3).
VariaƟon of Accuracy and Loss with LR 0.6 0.001 y = 0.7204ln(x) + 0.5193
0.5
Loss
0.4 0.3
0.003 0.002
0.004
0.2 0.1 0 0.82
0.825
0.83
0.835
0.84
Accuracy Fig. 2 Accuracy versus loss for different learning rates
0.845
0.85
0.855
0.86
Hair Identification Extension for the Application …
797
LR = 0.001
LR = 0.002
LR = 0.003
LR = 0.004
Fig. 3 Outputs for different learning rates
6.4 Accuracy Versus Loss for Defined Epoch As per the configuration of the program, we have epoch as 5, steps per epoch as 110, learning rate as 0.001, and batch size as 4. Over the cycle of 5 epochs, the following graphical trends were observed. Mean, mode, and median are the measures of central tendency. Median is more accommodating to cases with outliers than mean. For a model with relevant outliers to be taken into consideration, median should be used as the measure of central tendency for calculation. Hence, it has been taken for the calculations (Figs. 4, 5, 6 and 7). From the graphs above, we can state the following points: 1. After each epoch run, the loss reduces and accuracy improves. Starting from the third epoch iteration, the slope starts to flatten showing that the loss and accuracy values reach a stable value of median accuracy: 84.998% and median loss: 0.47838 2. Clusters of points signify that the values obtained vary between a fixed range for accuracy: 0.84–0.87 and loss: 0.31–0.38. 3. Median epoch graph shows us that the median value of accuracy and loss becomes stable from third iteration onwards. The logarithmic trendline is a straight-line signifying increasing trend for accuracy and decreasing trend for loss.
798
S. Agarwal et al.
Epoch 1/5 2.5
Loss
2 1.5 y = -9.419ln(x) - 0.797
1 0.5 0 0.74
0.76
0.78
0.8
0.82
0.84
0.86
0.88
Accuracy Fig. 4 Accuracy (0.8151) versus loss (0.9834) for epoch 1/5 Epoch 2/5
Epoch 2/5 and Epoch 3/5
1
Epoch 3/5
Loss
0.8 0.6 y = -0.589ln(x) + 0.2625
y = -0.937ln(x) + 0.2323
0.4 0.2 0
0
0.2
0.4
0.6
0.8
1
1.2
Accuracy Fig. 5 Accuracy (0.8633) versus loss (0.357) for epoch 2/5 and accuracy (0.8593) versus loss (0.3644) for epoch 3/5
0.4
Epoch 4/5
Epoch 4/5 and Epoch 5/5
Epoch 5/5
0.35 y = -1.152ln(x) + 0.1672
0.3 y = -1.335ln(x) + 0.1328
Loss
0.25 0.2 0.15 0.1 0.05 0 0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
Accuracy Fig. 6 Accuracy (0.8577) versus loss (0.3433) for epoch 4/5 and accuracy (0.8545) versus loss (0.3438) for epoch 5/5
Hair Identification Extension for the Application …
Median Accuracy vs Median Loss
Median Loss
1.2 1
799
Epoch 1/5
0.8 0.6
y = -11.78ln(x) - 1.4392
0.4 0.2 0 0.81
Epoch 2/5 Epoch 5/5 Epoch 4/5
0.82
0.83
0.84
0.85
Epoch 3/5 0.86
0.87
Median Accuracy Fig. 7 Plot of 5 epoch runs
To prove a stated hypothesis, we calculate the probability value or p-value. The p-value determines the significance of the results of a statistical test with respect to the defined null hypothesis. Since the p-value is a probabilistic value, the value lies between 0 and 1. Closer the value to 0, more is the evidence that the null hypothesis is statistically significant (0.90 accuracy) and very close to each other. Hence, selection and rejection of the models based on the standard evaluation metrics like accuracy and f 1score are difficult. To get better insight and reasoning behind predictions, LIME was applied and tested with some comments. Even though models such as logistic regression, naive Bayes, random forest, XgBoost and CNN report high accuracy; on testing with LIME, it was determined that wrong features were being considered important by these models for their predictions (Refer Fig. 1). Hence, we cannot deploy such models in real world scenarios. On the other hand, GRU + pre-trained word embeddings provide most intuitive explanations along with high accuracy scores which indicates that the model is considering correct features for prediction. Refer Figs. 2 and 3. Lime Explanations for GRU + Glove Embeddings Lime Explanations for GRU + Fasttext Embeddings
Explainable AI Approach Towards Toxic Comment Classification
853
Fig. 1 Such type of misclassifications and incorrect consideration of features were observed in the case of logistic reg, naive Bayes, random forest, XGBoost and CNN
5 Issues and Limitations From the above study, we were able to clearly conclude that GRU + pre-trained word embedding combination works the best in our current scenario with best accuracy score and intuitive LIME explanations. However, there are certain issues and limitations with regard to toxic comment classification, which we would like to highlight here 1. In certain cases, comments need to be of a substantial length (excluding stopwords) to get correctly classified. Individual words passed to the model without any context may get misclassified. We observed that there are still few words like “mouth”, “mother”, “black” which appear frequently in toxic comments tend to shift the prediction towards the toxic side. But when we provide enough context (excluding stopwords), then the comments containing those words seem to get classified correctly. Refer Fig. 4. 2. Some threats and inappropriate comments written using “non-toxic” words and phrases may get misclassified. Refer Figs. 5 and 6. 3. The dataset and model work best for detecting and predicting toxic, abusive, inappropriate comments and hate speech found most commonly on the online comment sections but may not be able to properly deal with severe issues like caste/gender/religion/race/nationality/sexual orientation discrimination, identity hate, personal attacks even when we use our best model. Refer Fig. 7.
854
A. Mahajan et al.
Fig. 2 Explanations of GRU + Glove. Gated Recurrent Unit (GRU) with GloVe embeddings offers significant improvement over previous models and preprocessing techniques as shown by figures above. All sentences that we tested manually were correctly classified and the explanations generated were also appropriate
Explainable AI Approach Towards Toxic Comment Classification
855
Fig. 2 (continued)
6 Conclusion In this study, it was observed that every model that was trained gave significantly high scores on the test data. However, after looking at the predictions and explanations of those predictions, it was observed that GRU along with pre-trained word embeddings gave the most intuitive LIME explanations. The model works best for dealing with commonly found toxicity and hate speech in online comment sections and threads. So, now speaking in general, a ML/DL model which gives high scores may lead to one believing that they have found a good enough model for carrying out a certain job/task. Model interpretability techniques like LIME can actually help in explaining why a model is making certain predictions and can help in selecting the best model and preprocessing techniques. It helps in analysing the problem, model and techniques much more in-depth. Moreover, model interpretability techniques can also help in giving reasons to the end users for a certain prediction/classification made by the model, thus helping to build trust. This highlights the importance of model interpretability step in machine learning/deep learning projects and solutions.
856
A. Mahajan et al.
Fig. 3 Explanations of GRU + Fasttext. GRU with Fasttext embeddings also provides the best performance as shown by the scores mentioned in Table 1. It also correctly classifies all the sentences that we tried. Moreover on interpreting the results using LIME, the explanations are also found to be most intuitive.
Explainable AI Approach Towards Toxic Comment Classification
857
Fig. 3 (continued)
Fig. 4 Explanations of “Mother, I love you and respect you.” and “Your mouth looks good in the picture.” on GRU + Fasttext
Fig. 5 “I arrange to have your life terminated” on GRU + Fasttext getting classified correctly
858
A. Mahajan et al.
Fig. 6 “I will end your life” on GRU + Fasttext getting misclassified
Fig. 7 Personal attacks like “You are so poor, it’s laughable” on GRU + Fasttext getting misclassified
Acknowledgements We would like to take this opportunity to thank our college guide, Ms. Vaishali Mishra, and industrial mentor, Dr. Bhushan Garware, for their invaluable support and guidance during the course of this study.
References 1. Cyberbullying: https://en.wikipedia.org/wiki/Cyberbullying 2. Dataset abusive YouTube comments (2018, Nov 7). https://zenodo.org/record/1479531#.Xjx GNmgzbIU 3. fastText (n.d.). https://fasttext.cc/ 4. Hate-speech-and-offensive-language dataset (n.d.). https://github.com/t-davidson/hate-spe ech-and-offensive-language/tree/master/data 5. Internet troll (2001, Oct 1). https://en.wikipedia.org/wiki/Internet_troll 6. Molnar, C.: Interpretable machine learning. A guide for making black box models interpretable. Available via Github (2019). https://christophm.github.io/interpretable-ml-book/ 7. Pennington, J.: GloVe: global vectors for word representation (n.d.). https://nlp.stanford.edu/ projects/glove/ 8. Ribeiro, M., Singh, S., Guestrin, C.: Why should i trust you? Explaining the predictions of any classifier. Presented at the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (2016) 9. Toxic comment classification challenge | Kaggle (n.d.). https://www.kaggle.com/c/jigsawtoxic-comment-classification-challenge 10. What is TF-IDF? (2019, Dec 26). https://monkeylearn.com/blog/what-is-tf-idf/ 11. Word embedding (2014, Aug 14). https://en.wikipedia.org/wiki/Word_embedding
Rank Prediction in PUBG: A Multiplayer Online Battle Royale Game Harsh Aggarwal, Siddharth Gautam, Sandeep Malik, Arushi Khosla, Abhishek Punj, and Bismaad Bhasin
Abstract Electronic sports have become the absolute choice for players, and these days just like onlookers, backing a worldwide media outlet. Esports examination has advanced to address the necessity for information-driven criticism and is centred around digital competitor assessment, methodology and expectation. Previous researches have utilized game data from a diversity of player ranking from casual (non-professionals) to proficient. However, proficient players had carried on uniquely in contrast to hobbyist and less skilled players. Given the nearly constrained stockpile of expert information, a key inquiry is in this way whether the given match dataset can be utilized to make information-driven models which foresee winners in accomplished matches and give an actual in-game stats for spectators and broadcasters to see. Here we will display that stats, even though there is a somewhat less accuracy, the acquired data has been utilized for anticipating the result of skilled matches, with appropriately improved configurations. Keywords Machine learning · PUBG · Prediction · Boosting tree models
H. Aggarwal (B) · S. Gautam · S. Malik · A. Khosla · A. Punj · B. Bhasin Department of Information Technology, HMR Institute of Technology and Management, GGSIPU, New Delhi, India e-mail: [email protected] S. Gautam e-mail: [email protected] S. Malik e-mail: [email protected] A. Khosla e-mail: [email protected] A. Punj e-mail: [email protected] B. Bhasin e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_82
859
860
H. Aggarwal et al.
1 Introduction In recent years, PUBG has risen as a celebrated game for gamers and spectators and as a developing area for research and analysis [1]. The mixture of PUBG’s development and the accessibility of point-by-point information from for all intents and purposes of each match played [2] has offered ascend to the exploration and investigation of the accompanying division of esports [3]. PUBG is a mind boggling and quick paced game where strategies and parity change a lot quicker than in customary games. Being straightforward, it can be said that the general statistics can prove to expand the interest of the game and make them progressively available to a large number of watchers [4, 5]. One such thing is the win prediction that has shaped the point of convergence of such analytics. It is a basic measurement effectively comprehended by crowd and players the same [6]. Subsequently, a range of win forecast methods has been created. The focal point of the paper is utilizing the present state of the game to foresee the probable winner of professional games [7]. These professional games have the most noteworthy industry, and crowd intrigue however is limited. We investigate the multiplayer online game PUBG distributed by Tencent Games, with around 3 million one of a kind players for every month. Subsequently, our offering presented here is: • We put forward techniques and outcomes for prediction of rank in proficient games utilizing incredibly inflated aptitude open match data to ensure the enhancement of the master level training data and guarantee that the training data fed [8] is sufficient to deliver solid and dependable prediction models [9]. • We completely assess two basic forecast calculations and their layouts to distinguish the best performing calculation and configuration on different parts of PUBG information [10]. This represents which calculation and configuration should be used and when, and also, the amount of improvement required for most noteworthy expectation precision.
2 Dataset We have anonymized player data of more than 65,000 games, split into training and testing sets, and requested to predict last situation from final in-game details and introductory player ratings [11]. The target is a percentile arrangement, where 1 relates to first place and 0 relates to last place in the match [12]. The dataset has 4.45 million lines of preparing information and 1.93 million lines of test. Since the information was so enormous, memory utilization turned into an issue. We used memory decrease strategies to depressed information types bringing about a 75% memory decrease. This was extricated from an open part: https://www.kaggle.com/chocozzz/howto-save-time-and-memory-with-big-datasets.
Rank Prediction in PUBG: A Multiplayer Online Battle Royale Game
861
2.1 Outlier Detection and Anomalies No dataset is ever 0% clean, and this is the exact same case. A part of the information does not mirror the centre ongoing interaction since it is drawn from custom game modes or on the grounds that it has fraudsters and cheaters. Some of the characteristics of this anomalous data include the following: • • • •
Many kills without moving Many road kills while driving a very short distance Suspiciously high number of kills Suspiciously long-range kills.
We could contend the two different ways for whether to drop this data or not; in any case, programmers and custom games exist in the test set, so we concluded that it is an important data when training our models. Corrupt Data: Due to issues with the API, there is a bug with the groupID column. At the point when a player leaves a match and rejoins, their groupID is reassigned, which causes the creation of larger groups than a game mode would permit. But because of the bug, we discovered models violating this standard rule. This makes issues when making team related features as it makes recognizing teams troublesome.
3 Evaluation The primary thing we evaluated was the manner by which common features affected the win percentage placement (our target variable) [13]. Some correlated features that we discovered were number of boosts and number of heals in addition to others. As appeared beneath by increasing mean of win percentage placement, we found that kill count was correlated with our target variable, which bodes well since more kills for the most part implies a player is better talented and skilled and will rank closer to the top [14] (Fig. 1). Boosts and heals were found highly correlated with our target variable, which makes sense as the higher the health points, more are the chances of a player to survive till last and rank closer to the top. It can be inferred that both boosts and healing have a high correlation with winning placement; however, boosts have higher significance (Fig. 2). The battle royale can be played in total 16 modes which are sub-parts of three modes: solo, duo and squad having 1, 2 and 4 number of players, respectively, in a team. By evaluating the graph below, we can see in solo and duo mode killing more number of enemies ensure rank closer to top but number of kills does not matter much in squad mode (Fig. 3). When trying to identify cheaters and fraudsters, we looked at distributions of some features so we could get a better understanding for what inconsistencies in the information resembled. For instance, most players got somewhere in the range
862
H. Aggarwal et al.
Fig. 1 Distribution of kill categories versus win percentage
Fig. 2 Distribution of number of heal/boost items versus win percentage
of zero and eight weapons in a match, as can be seen by the distribution below. In any case, when we discovered players that got 40+ weapons, we attributed that to them either cheating or playing a custom game. To all the more likely confirm if a player was cheating, we would need to take a look at different features in addition to weapons picked up. Battleground Aimbots: The usage of automated aiming software is undoubtedly the most used commanding cheats in PUBG. It is designed in such a manner that facilitates assignment of a key or mouse click to gear up an auto aim and lock on function that will automatically target any enemy present in line of view without the players having to do much (Fig. 4). The players who had 0% headshot rate were identified from the data but were not removed from the data set as it was unclear if these players were cheating or are really talented.
Rank Prediction in PUBG: A Multiplayer Online Battle Royale Game
863
Fig. 3 Distribution of number of kills versus win percentage for solo, duo and squad modes
Fig. 4 Graph of head shot rate among the players
4 Feature Engineering Engineering new features was a primary focus of this project. Experience playing the game certainly guided the instincts for what new features could impact winning. A game in PUBG can have up to 0 players fighting each other. But most of the times a game isn’t “full”. There is no variable that gives us the number of players joined. So, we created one. Now that we have a feature “playersJoined”, we can normalize other features based on the number of players. Features that can be valuable to normalize are: • kills • damageDealt
864
H. Aggarwal et al.
Fig. 5 Correlation matrix of features with target variable
• maxPlace • matchDuration. Finally, here is the correlation matrix of the top 6 features that correlate with the target after all of the feature engineering (Fig. 5).
5 Prediction and Analysis 5.1 Base Machine Learning Models 5.1.1
Loss Function
Mean absolute error (MAE) measures the average immensity of the errors in test dataset, without considering their direction. n MAE =
i=1
|yi − xi | = n
n i=1
n
|ei |
.
For example, a MAE of 0.03 signifies that the average magnitude difference from the prediction to the true value is 0.03. In context of our dataset, this would typically mean that the average error is off by 3 placements. This loss functions provide an appropriate measure of our model prediction accuracy.
Rank Prediction in PUBG: A Multiplayer Online Battle Royale Game
865
Fig. 6 Displaying importance of top features
5.1.2
Model
Random Forest Model Random forest is one of the most in demand machine learning algorithms that is associated to the supervised learning methods [15]. It can be brought in use for both kinds of problem—classification and regression—in machine learning. It is an ensemble learning concept, the procedure of fusing different classifiers to solve a complex level problem and enhancing model’s overall performance. As it can be taken from its name, random forest is a classifier which contains various decision trees on different subsets of the given data and is accountable for the average to increase the accuracy of data prediction [16]. Rather than depending on just a single decision tree, the random forest holds the prediction from each tree, and then on the basis of the majority votes about the predictions, it generates and depicts the final outcome. Greater the number of trees present in the forest, higher will be the accuracy and lesser will be the issue of overfit. Random forest model is created, and then feature importance of top features is detected (Fig. 6). This feature importance graph displays the top 15 features of our boosting model. The top features—walkDistance and killPlace and boosts—make sense. People who have a longer walk distance would survive for more part of the game, which is important to having a higher win placement percentage. The player’s kill place (kill count leader board placement) clearly correlates to final placement; having more kills means having a rank close to top. The boosts help in maintaining health and being alive till the very end ensuring higher rank. The other top features were feature engineered, demonstrating the reliability of these features as measures for performance. With these top features a dendrogram is generated to view highly correlated features (Fig. 7).
866
H. Aggarwal et al.
Fig. 7 Dendrogram of highly correlated features
6 Conclusion and Future Scope Boosting tree models appear to function admirably with numerical and continuous numbers, which is our entire training set. Random forest is subsequently ready to find progressively complex dependencies at the expense of more time for fitting than typical linear regression. Likewise, we found that making new features that better speak to player metrics improved the general prediction scores. Finally, our model was able to predict the rank of players with an absolute mean error of 3 per cent concluding that the predicted rank of the player can only fluctuate up to 3 ranks up or down using this prediction model. Moreover, our model could have been all the more ideally tuned for better scores; in any case, getting this limited and significant dataset, we chose to centre our endeavours of EDA, feature building, and finding intriguing characteristics with regard to the information. Since the Kaggle gave a significant dataset, we chose to utilize this dataset. Given additional time, we might want to scratch our own data using the PUBG developer API key. This data contains increasingly important features, for example, starting position and the various kinds of weapons utilized. This extra data can increase our predictive models, and we would be able to discover more interesting strategies. Also, the group approached this PUBG dataset with a focus on exploratory data analysis and feature engineering. Since we had domain knowledge regarding the topic, we could discover intriguing patterns and make better features that represented the data. Given additional time, we would ensemble high performing models and accomplish progressively broad hyperparameter tuning for better outcomes. We knew given sufficient time and assets; we would yield a lot higher outcome.
Rank Prediction in PUBG: A Multiplayer Online Battle Royale Game
867
References 1. Chakarverti, M., Sharma, N., Divivedi, R.R.: Prediction analysis techniques of data mining: a review. SSRN Electron. J. (2019). https://doi.org/10.2139/ssrn.3350303 2. Grover, M., Verma, B., Sharma, N., Kaushik, I.: Traffic control using V-2-V based method using reinforcement learning. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/ICCCIS48478.2019.8974540 3. Hamari, J., Sjoblom, M.: What is esports and why do people watch it? Int. Res. 27(2), 211–232 (2017) 4. Mora-Cantallops, M., Sicilia, M.: Moba games: a literature review. Entertainment Comput. 26, 128–138 (2018) 5. Manchanda, C., Rathi, R., Sharma, N.: Traffic density investigation & road accident analysis in India using deep learning. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/ICCCIS48478.2019. 8974528 6. Jaidka, H., Sharma, N., Singh, R.: Evolution of IoT to IIoT: applications & challenges. SSRN Electron. J. (2020). https://doi.org/10.2139/ssrn.3603739 7. Jadon, S., Choudhary, A., Saini, H., Dua, U., Sharma, N., Kaushik, I.: Comfy smart home using IoT. SSRN Electron. J. (2020). https://doi.org/10.2139/ssrn.3565908 8. Hodge, V., et al.: Win prediction in esports: mixed-rank match prediction in multi-player online battle arena games. ArXiv e-prints (CS:AI), Nov 2017 [Online] (2017). Available: https://arxiv. org/abs/1711.06498 9. Hodge, V., Devlin, S., Sephton, N., Block, F., Drachen, A., Cowling, P.: Win Prediction in EGames. University of York, York (2017) 10. Lucchese, C., et al.: Rankeval: an evaluation and analysis framework for learning-to-rank solutions. In: Proceedings of 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1281–1284 (2017) 11. Tiwari, R., Sharma, N., Kaushik, I., Tiwari, A., Bhushan, B.: Evolution of IoT & data analytics using deep learning. In: 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (2019). https://doi.org/10.1109/ICCCIS48478.2019.8974481 12. Sharma, A., Singh, A., Sharma, N., Kaushik, I., Bhushan, B.: Security countermeasures in web based application. In: 2019 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT) (2019). https://doi.org/10.1109/ICICICT46 008.2019.8993141 13. Sun, Y., et al.: Internet of things and big data analytics for smart and connected communities. IEEE Access 4, 766–773 (2016) 14. Singh, A., Sharma, A., Sharma, N., Kaushik, I., Bhushan, B.: Taxonomy of attacks on web based applications. In: 2019 2nd International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT) (2019). https://doi.org/10.1109/ICICICT46008.2019.899 3264 15. Sharma, N., Kaushik, I., Rathi, R., Kumar, S.: Evaluation of accidental death records using hybrid genetic algorithm. SSRN Electron. J. (2020). https://doi.org/10.2139/ssrn.3563084 16. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Automatic Medicinal Plant Recognition Using Random Forest Classifier P. Siva Krishna and M. K. Mariam Bee
Abstract A fully automated technique for the acknowledgment of medicinal plants utilizing computer system vision as well as artificial intelligence strategies has existed. Leaves from 24 different medical plants species were collected as well as photographed using a cellular phone in a laboratory setting. A lot of features were extracted from each leaf such as its length, size, perimeter, as well as area of hull. Numerous obtained functions were then calculated from these attributes. The most effective results were acquired from an SVM classifier making use of a ten layer cross-validation methods. With precision of 90.1% SVM classifier done far better than various other maker discovering strategies such as the k-nearest neighbor, ignorant Bayes, KNN, and also semantic networks. However, abusing the precision and rate of the PC development can be beneficial in making elite plant order framework dependent on the leaf recommendation. In light of this idea, this concept suggests a set plant ID procedure making use of a convolution neural system that can regard fallen leave pictures of the plants. This postulation recommends the application of fake neural system back spread estimation for fallen leave acknowledgment. The pictures of various plants are gotten and are prepared as a payment to the fake neural system. The ready system perceived fallen leave photographs depending on the data picked up during the prep work procedure. Clamor is contributed to a few photogroup sets so as to check the capability of the imitation neural system in regarding uproarious images. Keywords Medical plant · Feature selection · Random forest classifier
P. Siva Krishna (B) · M. K. Mariam Bee Department of Electronics and Communication Engineering, Saveetha School of Engineering (SIMATS), Chennai, Tamil Nadu, India e-mail: [email protected] M. K. Mariam Bee e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_83
869
870
P. Siva Krishna and M. K. Mariam Bee
1 Introduction Plant species recommendation based on blossom distinguishing evidence stay an examination in photograph handling and also computer system vision individuals team primarily taking into account their massive presence, complicated framework as well as irregular classes in nature varied. It is extremely regrettable to do ordinary departments in view of these usual complexities or to highlight the elimination or consolidation of form surface as well as shielding highlights which brings about modest exactness on benchmark datasets. Although some part removal treatments consolidating around the world as well as areas consisting of descriptors gets to reducing side exactness in organizing blooms, in order to distinguish and view flower varieties in a wider range of complicated problems, an effective and also efficient framework is still needed. Saitoh and Kaneko suggested a technique for viewing blooms, where two images are needed, among the flowers and also various others of the fallen leaves. This approach needs the client to place a dark material behind the blossom to remember it. This isn’t feasible as well as it is unpleasant for the customer to utilize this strategy gradually. A section of the reducing side plant recommendation structures especially Leafsnap, Pl@ntNet, ReVes, and CLOVER are based entirely on fallen leave distinguishing proof that needs area details on blooms. A treatment that signs up with morphological highlights, for instance, point of view proportion, fancifulness, rectangularity as well as moving average J.-X. Du et al. has recommended the center (MMC) hypersphere classifier. Kulkarni et al. proposed a story method for dealing with perceiving and recognizing plants using shape, shielding, and also surface highlights in conjunction with Zernike minutes with radial basis probabilistic neural networks (RBPNN). A bloom characterization technique depending on jargon of surface, shielding and form highlights was suggested by Zisserman and also tried out 103 classes and to exactly perceive blossoms in pictures. Salahuddin et al. suggested a division approach that utilizes shielding bunching and also space details on blossoms. Albeit various calculations and systems have been recommended as well as realized to perceive blossoms and plants, and they regardless of everything seem very difficult to break down because of their unpredictable 3D structure and high interclass variety. When it involves quantifying blossom pictures, the three essential credits to be considered are color, appearance as well as forming.
1.1 Color The primary considerable part to be taken into consideration to perceive the bloom species is “shading.” The shade histogram, which refines the reoccurrence of pixel powers occurring in a photograph, is among the most reliable and also fundamental descriptors around the world. This encourages the descriptor to find out what each shielding is conveying in a photograph. The vector component is taken by connecting
Automatic Medicinal Plant Recognition Using Random …
871
each shielding to the mean. For instance, on the off chance of thinking about a histogram of 8-receptacles per channel, then the subsequent component vector will certainly be 8 × 8 × 8 = 512-feature vector. Notwithstanding it, standard shading network measurements, for example, imply as well as standard deviation might similarly be established to find the shielding conveyance in an image. © 2017 IEEE Shade High qualities of a picture alone isn’t ample to assess blossoms taking into account the fact that in a multi-animal classifications problem, at the very least 2 varieties could be of exact same shading. Sunflower and Daffodil, for example, would definitely have relative web content on the shielding.
1.2 Texture An additional substantial element to be considered to view blossom types is the “surface area” or consistency of instances as well as hues in an image. Haralick surface areas, which makes use of the suggestion gray level co-event matrix (GLCM), are commonly used as a descriptor of the surface area. Authors [1] showed about the 14 measurable highlights that could be refined depending on the surface area to evaluate the picture. Due to high computational time requirements, the successor element vector will be 13-d including vector neglecting the fourteenth measurement.
1.3 Shape As far as typical things are concerned, for example, blooms, plants, trees, etc. Another important component for measuring these posts in an image is the “form.” Hu mins and also Zernike mins are the two shape descriptors normally used worldwide in computer vision checkout. Minutes depend upon the quantifiable desires for an irregular variable. There are seven such minutes which are all points thought about called Humans. The mean, routed by difference, standard deviation, angle, kurtosis, and various other quantifiable parameters, is very first min. Those 7 min are combined to frame a 7-d dimension element vector. Zernike Smith mins depend upon in proportion abilities and it was presented by Teague as a form descriptor. Like Humans, Zernike mins are furthermore made use of to gauge the state of a write-up aware.
1.4 Local Binary Models (LBPs) Like the Haralick structures, LBPs are also used to quantify the “structure”-based image. Making use of LBPs gives meaning to fine-grained details in the photograph. The major distinction from the systems in Haralick is that LBPs processes pixels locally using the community concept. The grayscale image of the input is split right
872
P. Siva Krishna and M. K. Mariam Bee
into cells. An LBP value (decimal) is calculated by straight thresholding along the community for every single pixel in a cell, based on its neighborhood (thought to be of dimension “r”). A histogram with 256 containers is determined after determining LBP values for all the pixels inside a frame. This pie chart is stabilized and concatenated additionally for all other cells in the image. If the variety of factors along the neighborhood is selected as 9 and also the distance is selected as 3, then the resulting vector dimension of the LBP attribute is 11-d. It was suggested that the extensive idea of consistent patterns in LBP decreases the dimension of resulting feature vectors for computational goals. If the binary pattern measured for a pixel in a cell contends for the plurality of two 0-1 and 1-0 changes, an LBP is considered consistent. For example, 1000000 have two changes and so it is stated to be consistent, whereas 01101001 have five changes and is not uniform.
1.5 Histogram of Oriented Gradients (HOG) One more generally utilized around the world descriptor in item location network is the HOG descriptor. It utilizes the idea of checking the events of incline directions in explicit confined places in an image. Stockpile descriptor displays these events on a thick matrix with incline directions that have continuously scattered cells with neighboring disparity standardization.
1.6 Segmentation Preparing images for extraction of global dimension is portioned at first utilizing grab cut department calculation and also the covers experience little bit shrewd As Well As task with the very first photograph. This is to guarantee that only the frontal location blossoms are considered to consist of extraction as opposed to the jumbled foundation. Figure 1 shows one such image fragmented making use of grab cut calculation.
1.7 Concatenating Worldwide features making use of simply a solitary around the world element descriptor to measure the whole dataset returns poor exactness. Instead, remarkable globally element vectors are linked for each photograph as well as afterward prepared over an AI classifier. This sort of linking highlight vectors has a few admonitions because the aspect vector dimensions are extraordinary, as well as along these lines one element vector might rule various other, this way decreasing the basic accuracy. A loved one record on various kinds of around the world element
Automatic Medicinal Plant Recognition Using Random …
873
Fig. 1
descriptors with the suggested framework is carried out to examine the impact of the last mentioned. Nearby component descriptors, for example, range invariant feature transform (SIFT), speeded up durable functions (SURF), focused FAST as well as Revolved BRIEF (ORB), and so forth alongside bag of esthetic words is the famous choice among picture arrangement obstacles experts as these descriptors evaluate the picture locally and also the subsequent aspect vector has different highlights that speak to the whole picture. Authors [2] have discussed a combination of sections method that involves SIFT, hue-saturation-value (HSV), and HOG highlights producing 72.8% accuracy for the FLOWER102 dataset. Although it yields a higher accuracy than other methods, using deep convolutional neural networks without requiring constructing the entire dataset can be acquired even greater accuracy. The aim of this exploration work is thus to analyze the effect of moving knowledge on the Over Feat system to accurately look at bloom species using a photograph taken from the customer’s mobile phone without using handmade highlights.
2 Deep Learning Using CNN That consists of picture classification as well as Item acknowledgment. Flower variety recognition is a mix of both Item acknowledgment and image classification, as the structure must differentiate a blossom aware just as perceive which types it has a place with. To perceive the bloom types, a creative structure needs to be prepared with bigger arrangement of pictures, so it can prepare for the bloom varieties from its informed examples. This methodology is named as “administered understanding” which requires an outdoors dataset of pictures with marks to foresee the name of a concealed picture. This examination job uses convolution neural in deep learning exploration to get CNNs clearly Information machine vision applications networks (CNN) along with transition comprehension as the smart computation for increasingly perceiving flower forms. The important difference between a typical artificial neural network (ANN) and CNN is that the last layer of a CNN is entirely connected
874
P. Siva Krishna and M. K. Mariam Bee
Fig. 2
while each nerve cell in ANN is connected to each other as shown in Fig. 2. ANNs are not realistic for photographs, because these systems effectively cause over fitting due to the size of the photographs. Consider a photograph of dimension [32 × 32 × 3]. It has to be smoothed into a vector of 32 × 32 × 3 = 3072 lines when this photograph is to be transmitted to an ANN. In this way, to get this information vector the ANN must have 3072 lots in its very first layer. It creates a mind-boggling vector (270,000 lots) for larger pictures, state [300 × 300 × 3], which requires an even more leading processor to process. CNNs includes heap of layers that takes in a details picture, play out a numerical task (non-straight initiation capacity, for example, ReLU, tanh), and forecast the course or name likelihoods at the return. As opposed to making use of standard thoroughly put together highlight removal techniques, CNNs absorbs the unrefined pixel power of the details picture as a smoothed vector. As an example, a shading image [30 × 30] will certainly be passed on to CNN’s info layer as a 3threedimensional structure. CNN instead identifies nuanced highlights that occur in the picture utilizing the various layers that have “learnable” channels and signs up with the effects of such networks to predict the class or name probabilities of the image information. The neurons in a CNN layer, as opposed to an ANN, are not connected to each and every single other neuron, but certainly linked to a specific area of neurons in the previous layer. Some of the lowest level highlights may be distinguished by the main layer, as an example, corners and edges in the image. The following layers that identify facilities degree highlights such as shapes and surfaces, last but not least, much more significant degree highlights, as example higher layers in the system would certainly identify the structure of the plant or blossom. This amazing treatment of working up from low-level highlights to higher-level highlights in a picture is stuff that generally makes CNNs useful in different applications.
Automatic Medicinal Plant Recognition Using Random …
875
A convolutionary semantic network (CNN) has threetypes of layers as required. 1. Convolutionary strata (CONV). 2. Fusing layer (POOL SWIMMING). 3. Fully connected layer (FC).
2.1 Convolutional Layer (CONV) This is the most significant layer in any type of CNN style since this is where CNN uses networks to absorb highlights from the image of information. This layer includes networks and charts illustrated. The CONV layer includes M channels that are limited in size (as an example [3 × 3] or [5 × 5]). These channels are translated into the quantity of information where the highlights are found at a given spatial position, as opposed to the entire picture being seen. A 2D include map is formed for each learnable network as the network slides with the system’s width and stature, finding dab stuff with its inputs, and also the data when all the M channels are applied to the volume of details, all related 2d maps are signed to produce the last quantity of yield.. This yield quantity has flows from the networks that looked in the info photograph at only some spatial location. For example, input picture with a 64 × 64 × 3 and 12 dimensional and network tally is used independently to get a 64 × 64 × 12 yield volume. Surely these channels would have absorbed the edges or corners present in the info photograph and would also get enacted just in case they see those equivalent edges as well as edges again. To get further into CONV layer, every nerve cell in this layer will certainly only associate in the info quantity with a littler location. This littler area or the degree to which it obtains associated is called the responsive area or the network size of the nerve cell. For example, if the info volume is of dimension [64 × 64 × 3] and then a receptive area of size [5 × 5] is used, e;ch neuron in the CONV layer will have 5 × 5 × 3 = 75 associations in detail quantity with explicit spatial areas. There are three hyper parameters beside this network of nerve cells. That is required for the CONV layer to be tuned. They are big, walk able, and zero-cushioning. 1. Depth refers to the network’s filter count (number of filters) that is likely to discover the most affordable level functions i.e., edges, edges, etc. 2. Stride specifies how many leaps the neuron needs to take before choosing the region of the neighborhood in the sum of the previous layer. 3. If required, zero-padding is used to provide completely no’s around the input quantity boundary. Thus, in a CONV layer, the volume of input is represented as [W 1 × H1 × D1] corresponding to the input image spatial measurements, four hyperparameters are defined as [K, F, S, P] representing the number of filters, the receptive region or the filter dimension, The step as well as the quantity of zero extra padding and the result volume are represented as [W 2 × H2 × D2] W 2 = (W 1 − F + 2P)/S + 1
876
P. Siva Krishna and M. K. Mariam Bee
H 2 = (H 1 − F + 2P)/S + 1(3.1.1.2).D2 = K
2.2 Pooling Layer (POOL) This layer is used in the system as a moderate layer, where it down samples or loads the coming near volume along the dimensions of space. For example, if the info quantity is [64 × 64 × 12], its down sampled quantity would be [32 × 32 × 12] This would give examples of the component maps of the past layer that were received from different networks to reduce over fitting as well as system. So in a layer of POOL, the details quantity is talked with as [W 1 × H1 × D1] comparing to the spatial elements of the information quantity, 2 hyperparameters are spoken with as [F, S] connecting to the responsive area or dimension of the network as well as the stroll, as well as the yield volume is talked to as [W 2 × H 2 × D2] associating with W 2 = (W 1 − F)/S + 1 H 2 = (H 1 − F)/S + 1 D2 = D1
2.3 Fully Connected Layer As in ANNs, the neurons of the FC layer in the CNN are completely related to the neurons of the previous layer. This layer of FC is held regularly as the last layer of a CNN with “SOFTMAX” as its initiation function for multi-class setup problems. The FC layer is prepared to expect the data picture’s last path, or name. In this way, it has a yield dimension of [1 × 1 × N] where N displays the amount of classes or marks considered for order. 3.1.4 Popular CNN designs LeNet, AlexNet, GoogLeNet, VGGNet are one of the most popular CNN frameworks used in cutting edge ups and downs deep learning expeditions to take care of different computer systems vision concerns, for example, photograph category, item recognition, self-driving cars, acknowledgment of speech, and so on.
2.4 Dataset Researching the datasets for blooms available on the nternet, FLOWERS17 dataset and also FLOWERS102 dataset from the University of Oxford’s Visual Geometry number are the tough data sets that can be obtained for benchmarking. This is primarily a straight outcome of the dataset-conscious high varieties in range, posture,
Automatic Medicinal Plant Recognition Using Random …
877
and also light problems. The dataset has a high intra-class range just as in between class selection. In this way, a computer system vision framework that uses only premium quality element extraction procedures, such as HOG, LBPs as well as bag of visual words with understanding of the color network, shade histograms, scale invariant feature transform (SIFT), speeded up durable features (SURF), and so forth. A number of more blossom variety courses are added to the FLOWERS17 dataset in this evaluation job as we called it as FLOWERS28 dataset (28 types of blossom varieties), as shown in Fig. 2. The FLOWER28 dataset consists of 2240 photographs of blooms belonging to 28 courses. There are 80 images of each class. The dataset FLOWERS102 has 8189 images of blossoms having 102 courses in place. Each class has photographs of a variety.
3 Existing Method This expedition suggested an additional functional application depending on the Android working structure for recognizing ndonesian healing plant photos depending on surface area and shielding highlights of innovative fallen leaf images. In the tests, we utilized 51 sorts of Indonesian therapeutic plants as well as every species consisting of 48 images, so the total pictures used in this examination are 2448 images. This expedition investigates the competence of the combination in between the fuzzy citizen binary pattern (FLBP) and the fuzzy color pie chart (FCH) so regarding identifying healing plants. The FLBP technique is used for eliminating leaf photograph surface area. The FCH technique is made use of for eliminating leaf picture shading. The mix of FLBP as well as FCH is ended up by using product choice policy (PDR) approach. This assessment utilized probabilistic semantic network (PNN) classifier for setting up restorative plant types. The trial results program that the combination amongst FLBP as well as FCH can boost the typical accuracy of healing plants identifiable proof. The exactness of ID utilizing combination of FLBP and FCH is 74.51%. This application is crucial to aid individuals differentiating and also finding data concerning Indonesian corrective plants. Disadvantages: 1. The most current approaches have overlooked images of low quality such as images with noise or low luminosity. 2. Less accuracy.
4 Proposed Method The proposed system was tried on a dataset of 55 corrective plants from Vietnam, and also a high exactness of 98.3% was obtained with a CNN classifier. The size of each photograph was 256*256 pixels. Suggested an approach based on fractal dimension highlights dependent on leaf shape and also vein styles for the acknowledgment and
878
P. Siva Krishna and M. K. Mariam Bee
also arrangement plant leaves. Using a volumetric fractal dimension way to manage produce a surface area mark for a fallen leaf and the gray-level carbon monoxide event latticework (GLCM) computation. Advantages: 1. High accuracy is obtained and time consumption for detecting the shape. More datasets are included. 2. We can find the medical application also on leaf shape and also vein styles for the acknowledgment and also arrangement plant leaves. 3. Using a volumetric fractal dimension way to manage produce a surface area mark for a fallen leaf and the gray-level carbon monoxide event latticework (GLCM) computation.
5 Transfer Learning 5.1 CNN as a Feature Extractor Profound knowing is used in conditions where GPUs require a lot of info to be prepared. This is essentially a straight outcome of the larger variety of emphasize or ages called for throughout preparing a neural system just as because of the How images that are computationally elevated 3D data (e.g., height, stature, and channels) are treated. By doing this, rather than preparing a CNN with no preparation with substantial variety of photographs per course, an unique strategy called “relocate learning” is used where a pre-prepared system is used on an exceptionally large dataset (ImageNet obstacle, for example, Over Feat, Inception-v3, Xception) by keeping all the pre-prepared layers in addition to the last completely linked (FC) layer as an extractor element. One such pre-prepared version to be certain the over feat organize is thought about. To consider the consequences of over feat organizing on various other profound structures, Additionally, GoogLeNet Inceptionv3 architecture by [3, 4] are considered for analysis 4.2 OverFeat Network The OverFeat setup existed, and Sermanet et al. also prepared [5]. Preparing built on ImageNet 2012 containing 1.2 million images of more than 1000 groups. The device architecture shown in Table 1 includes 8 layers with non-direct enactment of ReLU added after each layer of convolution and also completely attached, independently. In this engineering, the channel dimension or the receptive field dimension is greater at the beginning of the network and slowly decreases along the layers that occur. In addition, the channel quantity begins slightly and is subsequently expanded into more considerable system level layers. The information images from the FLOWERS28 dataset are resized to a dimension taken care of [231 × 231 × 3] and also submitted to the organization over feat the over feat first layer of neurons consists of CONV =⇒ RELU =⇒ M = 96 networks SWIMMING POOL. And open field [11 × 11] The system’s second layer of nerve cells comprises of CONV =⇒ RELU =⇒ SWIMMING POOL with M = 256 channels and also receptive region of [5 × 5] The. third
Automatic Medicinal Plant Recognition Using Random …
879
Table 1 Overfeat network architecture Layer
Stage
# Filters
Filter size
Conv. stride
Pooling size
Pooling stride
Spatial input size
1
Conv + max
96
11 × 11
4×4
2×2
2×2
231 × 231
2
Conv + max
256
5×5
1×1
2×2
2×2
24 × 24
3
Conv
512
3×3
1×1
–
–
12 × 12
4
Conv
1024
3×3
1×1
–
–
12 × 12
5
Conv + max
1024
3×3
1×1
2×2
2×2
12 × 12
6
Full
3072
–
–
–
–
6×6
7
Full
4096
–
–
–
–
1×1
8
Full
1000
–
–
–
–
1×1
and fourth layers in the network contain CONV =⇒ RELU =⇒ CONV =⇒ RELU with M = 512 as well as 1024, separately as well as open area of [3 × 3] The 5fifth layer contains CONV =⇒ RELU =⇒ Open area of [3 × 3] as well as CONV =⇒ Open area of [3 × 3]. M = 1024. The sixth and also seventh layers are completely linked layers adhered to by a SOFTMAX classifier which offers the return prepared for course probabilities. Figure 3 shows an description of the proposed framework. Using a smartphone the user captures a flower image (assuming the flower is the only feature in the foreground with any random background. The photograph taken is then transformed into a layout of the base64 string. It will then be sent to a cloud space storage system called “Firebase” where it will be stored in a JSON format (username, time, date,
Fig. 3
880
P. Siva Krishna and M. K. Mariam Bee
image id, and image). The skilled on the web-server side the CNN system on the FLOWERS 28 dataset gets the current flower image in base64 format and converts it into a basic type of matrix for handling as well. The transformed image is sent to CNN where it forecasts its performance class tag. The label name is sent to the same username with the same picture i.d after the forecast. Where the smartphone receives automatic flower name reaction from the cloud space storage device. It takes about 1 s tested with Moto G3 android smartphone running a snapdragon quad-core CPU and 2 GB of RAM to get an image and get the predicted label of the photograph. 5.2 Requirement 2 of the software program shows the languages used to build the proposed system. Java is required to develop the Android app that captures the floral image. The CNN subsystem is developed using Python 2.7. The transfer knowledge from Over Feat network is developed using sklearn-theano selection provided under Open Source, BSD certificate which is readily accessible. Easy, modular, and reliable deep discovery python set called “Keras” is used for the execution of Inception-v3 and Xception types, whose specification and weights are freely available on ImageNet. NumPy, SciPy, matplotlib, cPickle, and even h5py are a range of other python libraries required. Since there are 2240 images in the dataset and each image is resized to [231 × 231] with 3 channels, the resulting matrix attribute will certainly be of size [2240 × 4096] stored locally in HDF5 file style using h5py collection from python. The cloud storage network used to back up the base64 model input photograph. Is called the Google owned “Firebase”.
6 Results and Discussion 6.1 Flower Species Recognition Recommendation concerning general blossom varieties is split into three sections. At once, highlights from the pictures in the preparation work dataset are separated by organizing Over Feat CNN (e.g., seventh layer from FC-4096) as well as directly in an HDF5 document group. The system is also ready for use with different. AI classifiers, as an example, bagging trees, linear discriminate evaluation, Gaussian naïve Bayes, KNN nearest next-door neighbor, logistic regression, decision trees, stochastic slope boosting, and random forests. At long last, arbitrary test photographs are offered to the system for name projection to analyze the exactness of the structure. It is seen that the framework precisely acknowledges bloom varieties with 82.32% Rank-1 accuracy as well as 97.5% Rank-5 accuracy using logistic regression as the FLOWERS28 dataset AI classifier. Figure 4 shows the accuracy achieved through the preparing varied AI classifiers on CNN apart from the preparation pictures. Maybe seen that storing trees, calculated relapse and arbitrary backwoods achieves a Rank-5 exactness of 92.14%, 97.5%, and also 94.82%, independently.
Automatic Medicinal Plant Recognition Using Random …
881
Fig. 4
References 1. Haralick, R.M., Shanmugam, K., Dinstein, I.H.: Textural features for image classification. IEEE Trans. Syst. Man Cybern. SMC-3(6), 610–621 (1973) 2. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing, pp. 722–729 (2008) 3. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) 4. Chollet, F.: Xception: deep learning with depth wise separable convolutions. arXiv:1610.02357 [cs.CV] (2016) 5. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integratedrecognition, localization and detection using convolutional networks. In: ICLR (2014)
Power Spectrum Estimation-Based Narcolepsy Diagnosis with Sleep Time Slot Optimization Shivam Tiwari, Deepak Arora, Puneet Sharma, and Barkha Bhardwaj
Abstract Many science researchers believe that a healthy thinking depends on the quality of human sleep. Measuring sleep quality through the systems currently available is not only complicated but also largely expensive. Accurate assessment of the quality of sleep depends on many physical and environmental factors such as dimension of bed, light effects, room temperature, any type of disease and oxygen level. Accurate measurement of these factors and establishing their relationship with sleep quality is a complex process. In this work, to summarize the systems available at present in measuring sleep quality of narcolepsy patient, an algorithm is proposed based on EEG energy-level patterns and time slot. EEG data has proved to reflect the activities of brain over all the section with respect to human activities. They are useful in both cases that is in awakening states and in sleep stage. Most of the brain disease is due to the deterioration of brain cells, and another part of algorithm is to identify optimized time slot for narcolepsy patient in order to summarize and to save patient from radiation of sleep quality measuring equipment. Keywords EEG · Narcolepsy · Energy-level pattern · Frequency · Sleep quality · Time slot
S. Tiwari (B) · D. Arora · P. Sharma · B. Bhardwaj Department of Computer Science and Engineering, Amity University, Lucknow, India e-mail: [email protected] D. Arora e-mail: [email protected] P. Sharma e-mail: [email protected] B. Bhardwaj e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_84
883
884
S. Tiwari et al.
1 Introduction In the past twenty years, abundance of research work is published on new proposals and algorithms on biomedical signals like EEG and ECG that supports the use of power estimation in frequency domain for several disorder detections and their diagnosis. One of the articles are published by Penzel et al., which demonstrated the use of single-lead ECG for diagnosis of sleep disorders [1]. After this, many literatures are published remarkably on such kind of ECG data for simplifying the use of instruments for data recording and reducing computation complexity. All of such kind of research work focused on extracting the patterns or called features. The selection of the optimal parameter of calculations is universally accepted in numerous kinds of advanced machine learning techniques. Heart rate variability (HRV) [1], consecutive heartbeat (RR) time period and respiratory data records are extremely used for finding outsets of the feature to use in pattern recognition and machine learning [2]. These kinds of biomedical data features are used in modern pattern recognition application in diagnosis of several human disorders and disease using variety of domains of time and frequency [3, 4]. Appropriate features are to be significantly sorted out for simple and fast statistical equations [4] or evaluation methods [2]. One of the approaches is principal component-based extraction of feature that helps to reduce dimensions and improve detection accuracy of diagnosis for specific disease. In this research work, focus is given on diagnosis of sleep disorders in an optimized time. Sleep is a very crucial activity because in a lifespan time sleep covers 1/3 part of the lifetime. Sleep is important like other activities such as eating, drinking and breathing. In the moments under sleep, body and brain repair itself by interacting through hormones, muscles neurons and memory. Sleep disorder occurs when someone cannot sleep properly; it results in loss function of the body organs and muscles. The common benefit of the sleep covers physical, emotional and psychological impacts. Improper sleep damages us in physical, emotional and psychological aspects. Nearly 84 types of sleep disorders have been observed like narcolepsy, insomnia, sleep apnoea, restless leg syndrome, etc [5]. Sleep apnoea (SA) represents the stage when pause in the breathing occurs during sleep. Pause of the breath is called as apnoea, and it varies with time period and repetition. The problem in breath while someone sleeps is symptom of hypopnea [6]. Sleep disorder presently divided as Obstructive Sleep Apnoea (OSA), very common (2–4% adult population whereas 1–3% in children) [7]. It occurs due to disturbance in respiration in throat airway. Another category of sleep disorder is central sleep apnoea (CSA) that occurs due to inhibited respiration drive [8]. Sleeping disorders do not occur only in the intervals of night. In daytime from excessive sleepiness, loose type of mind concentration, high level of the head ache, etc., occur [7]. In night, interval duration due to the effect of sleep disorder, choking, noise, sweat, etc., can be observed. OSA is found in abundance in middle age or elders and causes obesity [7] in most situation. It is recorded that 70 billion dollars is lost, 11.1 billion in damages and 980 deaths in each year [9]. Most cases go undiagnosed because of the inconvenience, expenses and unavailability of testing. Testing is inconvenient to the patient because it requires them to spend the
Power Spectrum Estimation-Based Narcolepsy Diagnosis …
885
night away from their bed causing discomfort. It is expensive because of overnight stay in hospital and because of the requirements of machines and expert level of technicians. Testing is also widely unavailable due to sleep centres operating at full capacity, and those on the waiting list can be untreated for an additional 6 months. Testing includes polysomnography, airflow records of subject, respiration, oxygen saturation, body position, EEG, electrocardiogram (ECG) [10]. In order of time slot, it divides the entire sleep period into 4 parts and compares the accuracy of the predictions of the sleep-related diseases at the end of each interval by applying machine learning algorithms to the data of different time periods.
2 Literature Work In 2018, research has been turned with a gold standard classification accuracy by picking up human sleep stages and the use of heart rate variability (HRV) functions based on electrocardiogram (ECG) sign. The proposed technique is the combination of extreme learning machine (ELM) and particle swarm optimization (PSO) for characteristic selection and hidden node range dedication. The combination of ELM and PSO produces imply of trying out accuracy of 82.1%, 76.77%, 71.52%, and 62.66% for 2, 3, 4, and 6 range of lessons respectively. This study additionally offers assessment to ELM and support vector machine (SVM) strategies, whose testing accuracy is lower than the combination of ELM and PSO. Based on the outcomes, it is concluded that the addition of PSO method is capable of giving overall great performance [11]. This research proposed a unique method for automatically detecting sleep-disordered breathing (SDB) events and the usage of a recurrent neural community (RNN) to investigate nocturnal electrocardiogram (ECG) recordings. A deep RNN version has been designed which comprises six stacked recurrent layers for the automated detection of SDB occasions. The proposed deep RNN version makes use of lengthy short-time period reminiscence (LSTM) and a gated recurrent unit (GRU). To compare the performance of the proposed RNN technique, 92 SDB sufferers had been enrolled. Single-lead ECG recordings were measured for a mean 7.2-h length and segmented into 10-s activities. The data set comprised a training data set (545 activities) from 74 patients and a data set (17,157 activities) from 18 patients. The proposed approach finished high overall performance with an F1 rating of ninety-eight, 0% for LSTM and 99.0% for GRU. The effects show advancement in overall performance over conventional methods. The proposed technique can be used as a particular screening and diagnosing device for sufferers with SDB problems [12]. Different sleep-related diseases affect different body parts prominently. Breathingrelated sleep disorders are explained through polysomnography reports. In this type of disorder, breathlessness has been seen for 10 s. These disorders are called sleep apnoea. According to scientific research, sleep apnoea disease can be the reason for heart-related disorders. Confirmation of such diseases is done by periodic observation of readings of several polysomnography channels.
886
S. Tiwari et al.
All the methods used to confirm sleep disorders are very expensive and time consuming. Periodically, polysomnography check-up in diseases like sleep apnoea puts financial burden on patients. In order to make such time-consuming and expensive tests simple and less expensive, ECG signals were studied. Research confirms that ECG readings are capable of confirming diseases like sleep apnoea. Tunable-Q wavelet transform (TQWT) techniques were used for better understanding of ECG readings [13]. The TQWT method is able to decompose the ECG readings in numeric form of bandwidth and make it understandable. The centred correntropy (CCES) is calculated from the results obtained by TQWT method. Readings of CCES are tested using the line acting classifier on the obtained visualized waveform results. In this experiment, sleep apnoea was tested with 92.7%, 90.9% and 93.9% accuracy by using random forest machine learning algorithm in 3 different cases based on ECG records [1].
3 Methodology Sleep process is responsible for both physical and mental health. In 1913, first analysis using close examination of sleep activity was reported. In the 1920s, research was accomplished on sleep and wake stage along with the circadian signals. Sleep data-based characteristics were observed to identify the sleep deprivation. Cyclic behaviour in sleep process was found in 1955 to justify the significant relationship in narcolepsy sleep and REM activity [3]. It was proved that major part of dreams occurs in REM sleep stage process and remaining observed in NREM sleep duration activities. It is easier to get awaken in REM dreams as compared to the dreams during NREM phase. Recalling of dreams is easy if a person is awakened just on start of REM dreams as compared to next morning awakening. REM dreams are generally observed to be very unrealistic as well as bizarre. Dream recalling is observed which is to be occurred partially when awakening is just in NREM stage of dreaming and such dream moments are somewhat realistic in nature. Sleep stages mostly cover the visual sensations along with some extent of auditory sensations [4]. The smell and taste senses are observed to be occurring in least proportion. Dream behaviour also involves movements in sleep called as REM parasomnia. But in case of narcolepsy disorder, dreams are not static as REM. It is too dynamic and dream duration is too short. Behaviour of REM and narcolepsy patient is almost same and treated as REM sleep. In comparing sleep parameters, REM sleep latency in the narcolepsy group was significantly shorter than in controls. Sleep is divided into a cyclic pattern in between NREM and REM sleep stages [5, 10]. In the moments, non-REM sleep is classified and sleep behaviour of human beings is observed to be changing from birth to old age with the passage of time. In sleep stage 0 (awake), eyes are unlocked, the EEG is unreliable and very significantly with low voltage value (beta waves), eyeball movement is slow and the EEG data signal frequency is in between 6 and 8 Hz. The S1 stage of sleep consists of alpha waves and occurs during the drowsiness activity. In the process of sleep cycle
Power Spectrum Estimation-Based Narcolepsy Diagnosis …
887
stage 2, light sleep occurs with very slow eye movements on the onset of getting stop and brain wave signal gradually getting slower. In this stage, sleep spindles are also observed to be started, EEG data value is found average and the range of data frequency exists in between 4 and 7 Hz. In the duration observed within occurrence of stage 3 sleep activity moments also known as deep sleep the brain waves rhythms are extremely slow and called as delta waveforms starts to be observed. These waves are interpreted as smaller but faster waveforms. EEG data at this zone of sleep activity possesses frequency 1–3 Hz, and strength is high. In the sleep cycles of duration occurring in stage 4 also known as deep sleep exhibiting the slow wave, the brain initializes the generation of delta waves. In this zone, the value of EEG is high and the frequency is below the value of 2 Hz. In this stage, eye movements are found to be rapid and momentarily muscular moves are associated. Theta wave is common in such sleep stage cycle moments. For better analysis of waveform of narcolepsy sleep pattern, it is converted into power spectrum. For power conversion, Welch method is used. With Welch method, large wave pattern is converted to imbricate. The following three steps used for this are as follows: 1. Wave pattern is divided into K segments by considering the original wave pattern length as L. 2. Window created in step I, enforced to each section. 3. Take average of K period grams for wave pattern
K 1 P x (k) e jω . Pw e jω = k k=1
(1)
where X Px(k) =
L−1 1 w(n) x (k) (n)e− jωn . N n=0
(2)
Before calculating the power spectrum, it is important to find out the optimized time slot. The aim of the research is to make sleep-related experiments concise and easy. With this approach, narcolepsy data sets are divided into 4 time periods as follows: 12:00–1:20, 1:00–2:20, 2:00–3:20 and 3:00–4:20, namely slot1, slot2, slot3 and slot4, respectively. Each slot is referred to an input data. Their sleep patterns were tested by applying machine learning algorithms to data sets of different time intervals. To identify optimized time slot, the flow chart is given in Fig. 1. After identifying the optimized time slot, initialize the power spectrum estimation steps for identifying the energy level of narcolepsy patient.
888
S. Tiwari et al.
Fig. 1 Flow chart to identify optimized time slot for narcolepsy patient
4 Algorithm to Diagnose Narcolepsy Sleep Disorder in an Optimized Time Slot Phase 1: Identify optimized time slot. Phase 2: Import EEG data file of different channels in desired format. Phase 3: EEG data is extracted after downloading 30–60 s records of sleep movements in different moments of sleep stages of different channels like ROC-LOC, C4-P4, etc. Phase 4: Conversion of voltage spectrum to frequency domain. Phase 5: Removal of mean value; component with zero frequency obtained by FFT algorithm is the mean value of data, which is subtracted from the data to bring the all data at similar mean value and bring in the same in range. Phase 6: After removing the mean value to bring all data in common range, the unwanted frequency and noise are subtracted as shown in Fig. 2. This is generally of higher frequency above than the EEG data frequency nearly above the 40 Hz. For this purpose, filter is used that passes low-frequency data and stops high-frequency values, which are generally anomaly in terms of noise, fluctuations or error. The resultant data after filtering has the following benefits: 1. No high-frequency distortion 2. Only data within EEG frequency level is obtained. Phase 7: Power calculation; the power estimation is very crucial to figure out the features of data abnormality in defective cases. Welch algorithm is used to calculate the frequency component power by using period gram after applying the fast Fourier transform algorithm. The periodogram uses the formulae for autocorrelation of fix length of a data after clipping or
Power Spectrum Estimation-Based Narcolepsy Diagnosis …
889
Fig. 2 Filter data plot after removing distortions
segmenting in equal parts. It gives simple approach to provide results with very high accuracy. The periodogram estimate of the frequency component power of fixed length L of any data segment x L [n] is Px x ( f ) =
L−1 1 − j2π f n /Fy . x L Fs n=0 L(n)e
(3)
F s stands for the sampling frequency.
fn =
k Fs , k = 0, 1 . . . , N − 1. N
(4)
Phase 8: Area covered by plot of power value under delta, theta, alpha, gamma frequency range is found through trapezoidal rule. Theta (θ ) frequency
890
S. Tiwari et al.
ranges from 4 to 8 Hz, delta (δ) frequency is considered from 0.5 to 4 Hz, alpha (α) frequency ranges from 8 to 13 Hz, and finally the significant one beta (β) wave frequency is considered as 13 to 30 Hz. Phase 9: After getting the power under each brain data wave, the ratio of these values is intended by isolating the average power of each sleep wave frequency range by the associated average power for complete bands.
5 Conclusion In the modern busy world, the impact of stress and restless lifestyle is on prominence. This has established a culture of small sleep with high depression and anxiety. Due to heavy workload and challenges at workplace, people are getting disorders in sleep activity and suffering from sleeplessness or inadequate sleep. This article has developed an algorithm to diagnose the sleep disorder of narcolepsy type using the energy level based on optimized time slot technique. The approach has fussed out the frequency-based classifier strategy to justify the circumstances under which a person is belonging to narcolepsy sleep disorder and also prove effective to reduce the effects of radiation and to make the experiments, related to quality of sleep, brief and simple. In future, this approach may be implemented and the effort to make sleep-related experiments more simple and concise will be based on feature extraction.
References 1. Pinho, A., Pombo, N., Silva, B.M.C., Bousson, K., Garcia, N.: Towards an accurate sleep apnea detection based on ECG signal: the quintessential of a wise feature selection. Appl. Soft Comput. 83, 105568 (2019) 2. Pan, J., Tompkins, W.J.: A real-time QRS detection algorithm. IEEE Trans. Biomed. Eng. 3(1985), 230–236 (1985) 3. Xie, B., Minn, H.: Real-time sleep apnea detection by classifier combination. IEEE Trans. Inf. Technol. Biomed. 16(3), 469–477 (2012) 4. Lin, C.-T., Juang, C.-F.: An adaptive neural fuzzy filter and its applications. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 27(4), 635–656 (1997) 5. da Silva Pinho, A.M., Pombo, N., Garcia, N.M.: Sleep apnea detection using a feed-forward neural network on ECG signal. In: IEEE 18th International Conference on e-Health Networking, Applications and Services (Healthcom), Munich, pp. 1–6 (2016) 6. César Cavalcanti Roza, V., de Almeida, A.M., Adrian Postolache, O.: Design of an artificial neural network and feature extraction to identify arrhythmias from ECG. In: IEEE International Symposium on Medical Measurements and Applications (MeMeA), Rochester, MN, pp. 391– 396 (2017) 7. Naseer, N., Nazeer, H.: Classification of normal and abnormal ECG signals based on their PQRST intervals. In: International Conference on Mechanical, System and Control Engineering (ICMSC), St. Petersburg, pp. 388–391 (2017) 8. Raj, A.A.S., Dheetsith, N., Nair, S.S., Ghosh, D.: Auto analysis of ECG signals using artificial neural network. In: International Conference on Science Engineering and Management Research (ICSEMR), Chennai, pp. 1–4 (2014)
Power Spectrum Estimation-Based Narcolepsy Diagnosis …
891
9. Lesmana, T.F., Isa, S.M., Surantha, N.: Sleep stage identification using the combination of ELM and PSO based on ECG signal and HRV. In: 3rd International Conference on Computer and Communication Systems (ICCCS), Nagoya, pp. 258–262 (2018) 10. Urtnasan, E., Park, J., Lee, K.: Automatic detection of sleep-disordered breathing events using recurrent neural networks from an electrocardiogram signal. Neural Comput. Appl. 32, 4733– 4742 (2020) 11. Karthik, R., Tyagi, D., Raut, A., Saxena, S., Bharath, K.P., Rajesh Kumar, M.: Implementation of neural network and feature extraction to classify ECG signals. In: Microelectronics, Electromagnetics and Telecommunications, pp. 317–326. Springer, Singapore (2019) 12. Navneet, W., Harsukhpreet, S., Anurag, S.: ANFIS: adaptive neuro-fuzzy inference system—a survey. Int. J. Comput. Appl. 123(13) (2015) 13. Nishad, A., Pachori, R.B., Rajendra Acharya, U.: Application of TQWT based filter-bank for sleep apnea screening using ECG signals. J. Ambient Intell. Humanized Comput. 1–12 (2018)
Rhythmic Finger-Striking: A Memetic Computing-Inspired Musical Rhythm Improvisation Strategy Samarjit Roy , Sudipta Chakrabarty, Debashis De, Abhishek Bhattacharya, Soumi Dutta, and Sujata Ghatak
Abstract Musical rhythmic structures are the perfect combination of beats that are the most inseparable components of music, containing the musical notes. The knowledge sharing on music rhythm patterns and their applications for generating improvised musical performance is a challenging task for the novices of computational musicology. Memetic computing can be very effective in finding near-optimum solutions in the context of rhythmic pattern modeling. We have illustrated the tournament selection strategy to construct the fruitful rhythmic patterns. We have incorporated the mutation operators to combine the mutation effects on rhythmic structures after multi-point crossover for obtaining optimum prolific musical composition. In this contribution, we have projected a learning framework, which recognizes the elementary music rhythm structures and improvises the rhythm patterns for enhancing the excellence of source rhythm patterns by the memetic algorithm. The tournament selection mechanism has been performed for selecting the parent rhythms efficiently to generate rhythmic offspring. S. Roy (B) · D. De Department of CSE, Maulana Abul Kalam Azad University of Technology, Kolkata, West Bengal, India e-mail: [email protected] D. De e-mail: [email protected] S. Chakrabarty Department of MCA, Techno India, Salt Lake, Kolkata, West Bengal, India e-mail: [email protected] A. Bhattacharya · S. Dutta · S. Ghatak Department of Computer Application, Institute of Engineering and Management, Kolkata, West Bengal, India e-mail: [email protected] S. Dutta e-mail: [email protected] S. Ghatak e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_85
893
894
S. Roy et al.
Keywords Computational musicology · Rhythm patterns · Tournament selection · Memetic algorithm · Crossover · Mutation
1 Introduction Automatic music composition and music information retrieval have been challenging, fascinating and, still, a lot of tasks to be explored principally because the rhythmic structures are hard to distinguish the quality of rhythmic structures that significantly impedes the involuntary music composition progression. Irrespective of impediments, the automatic music arrangement would be advantageous to the music artists who habitually compose the AI-based music pieces using music computational intelligence, without others’ intervention. The complete and fruitful music composition is typically composed of human music composers. They execute both the emotion of mankind and composition knowledge to generate fruitful music pieces. The contemporary music listeners sometimes wish to compose of their own without concerning the music experts and wish to accomplish the computer researchers’ imaginings of musical intelligence. However, there is no specific approach that many listeners can judge a composed song in the same manner. Everything is about how an individual feels. That is why these are challenging tasks for music researchers to work in this context. The variations of rhythmic structures over a particular vocal performance have been applied to mechanize a composition. As a composed melody should not purely be made of a set of specific musical notes, not only the rhythmic structures of instrumental performances should be intact, but variations should also be performed for musical fruitfulness up to a certain extent of originality based on the music theory. Hence, it is essential to exploit some mathematical models to facilitate decision making from a list of possible alternative outcomes dependent on the restraint. When the mathematical models are applied to find out the best alternatives from a large number of possible outcomes, then evolutionary algorithms are called. We have structured as the 16-beat rhythms, based on the Indian music theory and listener evaluation. We have presented the memetic algorithm (MA) and selected the rhythmic strings using tournament selection mechanism. The main advantage of the MA over the genetic algorithm (GA) is the local search process which might be effective to model problems and the improvisation processes that are already familiar to the experts. We have principally analyzed the strategy for the improvisation of rhythmic structures. Then, we have illustrated a set of rules extracted from music theory that is our MA local search function. The mutation operation after the crossover operation has also been illustrated to measure quality music rhythm for prolific music composition.
Rhythmic Finger-Striking: A Memetic Computing-Inspired …
895
2 Related Works The researchers explored their interest in research related to musical automation. The computer-assisted rhythm improvisation strategy is one of the most challenging researches in the domain of computational musicology. The researchers have proposed a framework, where they have illustrated an optimization algorithm on percussion-based rhythm structures of Indian music by the GA-inspired roulette wheel selection mechanism [1]. The context-sensitive prevalent music education schema has been illustrated using the linear rank-based individual’s selection schema of the conformist GA [2]. Additionally, numerous contributions of the authors have modeled the musical patterns and interactive rhythmic feature exposures by objectoriented paradigms like polymorphism and inheritance [3, 4]. The orientations of Indian classical music have been performed in several orientations using some modeling with high-level nets like Petri nets [5, 6]. Some approaches are based on the extraction of musical features from classical raga or instrumental performances [7]. The performing session-aware and contextual representation learning-inspired classical music recommendation system has been projected in [8]. The researchers have illustrated the energy-efficient crowdsensing-based Internet of Music Things in edge and fog computing paradigms [9, 10]. Researchers have formulated the music composition strategies using an adaptive multiagent-based memetic computing approach embracing numerous metaheuristics [11]. An automatic music composition framework has been proposed using the genetic algorithm-inspired randomly produced note structures without human involvement [12]. The researchers have presented a state-of-the-art review of the genetic programming and evolutionary computing application for music composition in [13]. The authors have aimed at the adaptation of an effective harmony search algorithm for the lop-sided traveling salesman problem [14]. An innovative memetic genetic algorithm approach has been illustrated for evaluating the traveling salesman problem, where a multi-parent crossover strategy has been considered with arbitrary mutation [15]. The researchers have demonstrated an algorithmic music composition prototype by genetic algorithm with experimental studies [16]. Researchers have represented a computational framework of musicology using neural networks which illustrated the outcomes in terms of confusion matrices and ROC [17]. The proficiency of genetic algorithms has been evaluated in pinpointing the critical slip ring of the consistent earthen slope [18].
3 Methods of Analysis We have discussed an outline of the proposed memetic algorithm-based music rhythmic structure improvisation strategy. The sequential steps are illustrated in the following:
896
S. Roy et al.
Table 1 Representation of 16-bit rhythm population and their corresponding fitness
Population no.
Population
Fitness
1
1111111100000111
65,287
2
1001100110000011
39,299
3
1111100000000000
63,488
4
1101100100000110
55,558
5
1001001111001101
37,837
6
1100110000000001
52,225
7
1000001110010000
33,680
8
1110111000010010
60,946
3.1 Population Initialization A population of chromosomes will be generated randomly. Eight chromosomes are generated here in this context.
3.2 Fitness Evaluation We have derived the fitness function from the objective function. The fitness function quantifies the feasibility of a solution. The fitness values have been calculated as the equivalent binary strings in Table 1. The fitness function for the prescribed population can be measured by Eq. (1). Fitness Function = Decimal equivalent of each binary string
(1)
3.3 New Population Generation The population generation strategy has been classified into four operations, such as selection, local search, crossover and mutation. We have discussed the four operations in the following:
3.3.1
Selection
The selection process depends on the fitness values. The parent chromosomes have been estimated with higher values of each tournament, and they will have the preferences to reproduce. The lower values of the tournament have been discarded from participation in the reproduction. Table 2 describes the tournament selection strategy
Rhythmic Finger-Striking: A Memetic Computing-Inspired …
897
Table 2 Selection using a tournament mechanism among randomly selected chromosomes Tournament number
Randomly selected two individuals Chromosome 2
Winner chromosome number
Winner chromosome
Chromosome 1
01
2
4
4
1101100100000110
02
3
8
3
1111100000000000
03
1
3
1
1111111100000111
04
4
5
4
1101100100000110
05
1
6
1
1111111100000111
06
2
5
2
1001100110000011
07
4
8
8
1110111000010010
08
2
6
6
1100110000000001
for performing in the offspring reproduction. The tournament selection mechanism has been summarized in the following: Tournament Selection Mechanism • It provides selective pressure by holding a tournament competition among N u individuals. • Chromosomes which have the highest fitness values, F, the champion of N u . • The tournament participants and winners are implanted in mating pool. • The tournament algorithm iterates until the mating pool contains fully newfangled offspring chromosomes. 3.3.2
Local Search
Local search has been used for finding out the best chromosomes. The vocal performance prefers the exact duration to play the 16-bit rhythmic structures. We have applied a local search schema to each rhythm in the current population using the scalar fitness function with multiplying by their rhythm score. For each solution, the score weight vector used in the selection of its parents is also used in local search. When no better solution is found among all the neighbors, then the local search has been terminated.
3.3.3
Crossover
Crossover is the procedure of amalgamation of the bits of one chromosome with another participated one. Crossover creates offspring for the upcoming generation, which inherits the features of the parent chromosomes. Crossover chooses the crosssites randomly by users and exchanges the subsequent bits. After crossover, the
898
S. Roy et al.
parent chromosomes create two new offspring. We have performed the multi-point crossover operation for the projected paradigm.
3.3.4
Mutation
We have performed mutation operation after crossover to avert the evaluations in the population for descending into a local best solution to the problem. Mutation modifies the child–offspring by flipping the bits from 0 to 1 or vice versa. We have performed a mutation on the chromosomes obtained with a small probability.
3.4 Population Replacement We have replaced the previous population with the newly generated population.
3.5 Testing of Evaluations We have tested whether the termination condition is satisfied. If the condition is satisfied with the optimum evaluation threshold, then we have stopped. If the condition is not satisfied, then we have to return the best solution in the current population and go to Step 2.
4 Performance Metrics We have initially provided eight rhythm structures, and each of them has the meter length sixteen (16) to establish the projected algorithm. We have generated eight individual chromosomes using the simple binary encoding mechanism [1]. We have illustrated the tournament-based individual selection approach on the preliminary chromosomes for parent assortment. The individual eight chromosomes depicted are as follows: (A) String 1 = 1111111100000111; (B) String 2 = 1001100110000011 (C) String 3 = 1111100000000000; (D) String 4 = 1101100100000110 (E) String 5 = 1001001111001101; (F) String 6 = 1100110000000001 (G) String 7 = 1000001110010000; (H) String 8 = 1110111000010010 The fitness values of the eight initial populations have been calculated in Table 1. The fitness values are the equivalent decimal values of each population that have been
Rhythmic Finger-Striking: A Memetic Computing-Inspired …
899
Fig. 1 Diagrammatic demonstration of the winner chromosome selection
generated. From Table 2, the two fittest chromosomes for crossover operation have been evaluated, which are the winners after performing tournament selection strategies. The two-parent rhythms are chromosome 1 and chromosome 4. Consequently, the String 1 and String 4 are elected as the parent chromosome 1 and parent chromosome 2, respectively, for generating improved offspring rhythms, as the strings have won twice in the tournament, which is more than the others. In Table 2, the winner chromosomes are 4, 3, 1, 4, 1, 2, 8, and 6, respectively after 8th iteration. Among them, the chromosomes 2, 3, 6 and 8 have been selected as the winners for one time each. The chromosomes 1 and 4 have been chosen twice. The chromosomes 5 and 7 have not been chosen even for a single time. Hence, Fig. 1 depicts that the selected offspring are String 1 (1111111100000111) and String 4 (1101100100000110). In Table 3, we have presented the population selection procedure using the tournament selection strategy. The eight encoded rhythm chromosomes have been initiated based on the structures of rhythmic variations. Two randomly selected chromosomes out of the eight have been chosen as parent chromosomes depending on the maximum occurrences as winners for participating in crossover operations. From Table 3, the chromosome fitness mean value has been calculated for finding the optimum parent chromosomes. M denotes the maximum fitness value of the pair during the ith iteration of the tournament, where i = {1, 2, …, 8}, for further crossover and mutation operations. The mean fitness after the 10th iteration is calculated in Eq. (2). Mean after 10th Iteration =
offspring 1 /10 + offspring 2 /10 /2
= [539,176/10 + 555,822/10]/2 = 54,749.9
(2)
After 1st selection of the parent chromosomes (after 1st iteration), the mean fitness is in Eq. (3). Mean fitness after 1st iteration =
offspring 1 +
offspring 2 /2
= [63,750 + 570,95]/2 = 604,22.5
(3)
900
S. Roy et al.
Table 3 Selection of population with offspring creation No
Selection of chromosomes
Winners
Strings
Decimal
01
M 1 (2,4) = 4; M 2 (3,8) = 3; M 3 (1,3) = 1; M 4 (4,5) = 4; M 5 (1,6) = 1; M 6 (2,5) = 2; M 7 (4,8) = 8; M 8 (2,6) = 6
1 and 4
1111100100000110 and 1101111100000111
63,750 and 57,095
02
M 1 (1,4) = 1; M 2 (5,8) = 8; M 3 (3,8) = 3; M 4 (5,7) = 5; M 5 (2,7) = 2; M 6 (4,6) = 2; M 7 (6,8) = 8; M 8 (2,3) = 3
3 and 8
1111110000000001 and 1100111100000111
64,513 and 52,999
03
M 1 (1,5) = 1; M 2 (2,6) = 6; M 3 (1,4) = 1; M 4 (1,7) = 1; M 5 (5,6) = 6; M 6 (2,8) = 8; M 7 (3,7) = 3; M 8 (4,5) = 4
1 and 6
1111110000000001 and 1100100100000110
64,513 and 51,462
04
M 1 (2,7) = 2; M 2 (3,4) = 3; M 3 (2,4) = 4; M 4 (3,8) = 3; M 5 (2,5) = 2; M 6 (1,8) = 1; M 7 (3,6) = 3; M 8 (3,4) = 3
2 and 3
1001110010000001 and 1111100100000011
40,065 and 63,747
05
M 1 (5,7) = 5; M 2 (4,5) = 4; M 3 (4,8) = 4; M 4 (2,5) = 2; M 5 (2,7) = 2; M 6 (1,5) = 1; M 7 (6,7) = 6; M 8 (7,8) = 8
2 and 4
1001111110000111 and 1101110000000001
40,839 and 56,321
06
M 1 (1,3) = 1; M 2 (2,8) = 8; M 3 (1,8) = 1; M 4 (5,8) = 8; M 5 (3,4) = 3; M 6 (2,7) = 2; M 7 (5,7) = 5; M 8 (6,8) = 8
1 and 8
1111111100000111 and 1100110000000001
65,287 and 52,225
07
M 1 (2,5) = 2; M 2 (2,7) = 2; M 3 (5,6) = 6; M 4 (6,7) = 6; M 5 (2,8) = 8; M 6 (5,7) = 5; M 7 (4,5) = 4; M 8 (1,4) = 1
2 and 6
1001100110000110 and 1100111100000111
39,302 and 52,999
08
M 1 (4,2) = 4; M 2 (4,8) = 4; M 3 (2,6) = 6; M 4 (6,7) = 6; M 5 (1,3) = 1; M 6 (4,5) = 4; M 7 (3,7) = 3; M 8 (2,7) = 2
4 and 6
1101111100000111 and 1100110000000001
57,095 and 52,225
09
M 1 (1,6) = 1; M 2 (3,2) = 3; M 3 (3,4) = 3; M 4 (2,8) = 8; M 5 (2,7) = 2; M 6 (3,5) = 3; M 7 (1,6) = 1; M 8 (1,3) = 1
1 and 3
1111100100000011 and 1111111100000111
63,747 and 65,287
10
M 1 (2,7) = 2; M 2 (5,8) = 8; M 3 (3,4) = 3; M 4 (2,3) = 3; M 5 (1,8) = 1; M 6 (2,8) = 8; M 7 (2,5) = 2; M 8 (7,8) = 8
2 and 8
1001110010000001 and 1100100100000110
40,065 and 51,462
Rhythmic Finger-Striking: A Memetic Computing-Inspired …
901
Fig. 2 Cross-sites (CSs) of two-parent chromosomes
Fig. 3 Generated offspring chromosomes after multi-point crossover
Evaluated values of Eqs. (2) and (3) describe that the initial parent chromosomes provide better outcomes for creating child offsprings and gradually lessen the rhythmic variations quality. Chromosomes 1 and 4 are the optimum parents for performing better to create rhythmic offsprings. We have performed multi-point crossover operation, demonstrated in Figs. 2 and 3. The offspring chromosomes replace the previously participated parents. We have continued this process until the 10th iteration completed and produced the optimum outcome. The considered chromosomes contained 16 beats and are divided into four sections. So, each section contains 4 beats. Hence, the 3 cross-sites are present. The multi-point crossover techniques have been incorporated with these offspring chromosomes. We have demonstrated with two-parent rhythms and the crossover operations in Figs. 2 and 3. The local mutations can be transposed by a random number of alleles, reverse in time, change a few bits while sustaining the rhythm intact, and concatenate contiguous identical beats. There is also a simple single-beat mutation or beat flipping mutation which just changes the rhythm of one up or down, which can be helpful when the melody needs only small changes for large increments in fitness. Bit-flipping operation over rhythm offspring 1 (position 6, 7, 16) is demonstrated in Fig. 4.
Fig. 4 Representation of local mutation over rhythm offspring 1
902
S. Roy et al.
Fig. 5 Demonstration of local mutation over rhythm offspring 2
The mutation probability can be measured by 1/m per bit in a binary string, where m denotes the length of the individual binary string vector. The rate of mutation will be 1 for each string if the operation grasps up to the full length of the selected individual. Three bits are there; hence, the rate of mutation is 3*100/16 = 18.75%. Bit-flipping operation over rhythm offspring 2 (position 3) is presented in Fig. 5. In Fig. 5, the mutation rate has been evaluated as, 1*100/16 = 6.25%. We have projected a schematic depiction of the optimization approaches; an explicit quantity of entities has formed. The circumstances of the entities can be arbitrarily elected, or these can be depending on a firm population generation approach. We can also prefer metaheuristics for preparing the population. The local search mechanism has been performed over each individual. The mechanism has been defined for performing a local search, which can be reachable up to the predetermined threshold. When the individuals have reached a certain advancement, they interact with the other individuals. The individuals maintain cooperative or competitive interactions among them. The cooperative comportment can be recognized using the crossover mechanisms of conventional GA which afford consequences in new individual generations. This cooperation will be repeated until the satisfactory stopping criterion. Usually, it determines the measure of diversity within the population.
5 Analysis and Discussion In this section, we have illustrated the outcomes of our proposed memetic algorithmbased rhythmic structure improvisation strategy of the percussion-based rhythms. Figure 6 represents the average distances among the provided individuals taken for the analysis. We have observed that the mean distances among the individuals in the rhythm population obtain low diversity. Figure 7 demonstrates the score histograms of the projected fitness values of the rhythm population, which are uniformly distributed. Figure 8 depicts the selection function that has been used over the initial population of the selection strings. The selection function stipulates the strategy of electing the parent–offspring chromosomes for the subsequent generation of offspring strings. Figure 9 illustrates the comparison of the proposed rhythmic finger-striking schema with the previously discussed conventional GA-based rhythm improvisation frameworks. We have selected the parent chromosomes using tournament selection mechanism for the individuals. In [1], a Roulette Wheel-based chromosome selection
Rhythmic Finger-Striking: A Memetic Computing-Inspired …
903
Fig. 6 Representation of mean distance among individuals
Fig. 7 Demonstration of the score of the initial chromosomes
mechanism had been illustrated. We can claim that the proposed MA-based approach for the selection of individuals carries a better outcome than the previously discussed endeavors.
6 Conclusion and Scope for Future Research The foremost intention of the projected rhythmic finger-striking schema is to generate the disparities of specific rhythm structures for the real-time rhythm improvisation. Experimental results show that the eight preliminary rhythmic patterns and the quality
904
S. Roy et al.
Fig. 8 Representation of selection option over the proposed individuals
Fig. 9 Comparison of the MA-based approach with a conventional GA-based paradigm
of rhythm structures over a vocal performance have been improved with the tournament selection mechanism that is concluded. The capability of all evolutionary algorithmic operators, such as reproduction, local search, crossover and mutation to generate better outcomes is the prime explanation that it has been developed into one of the most improvised mechanisms for generating new musical rhythmic structures. In this tiny endeavor, a step toward developing tools has been implemented to assist and estimate the quality of each fundamental rhythm structure. In the future endeavors, the rhythm improvisation strategies can be evolved using user-defined fitness functions and numerous evolutionary computing strategies for perceiving optimum outcomes in this context.
Rhythmic Finger-Striking: A Memetic Computing-Inspired …
905
References 1. Chakrabarty, S., De, D.: Quality measure model of music rhythm using genetic algorithm. In: International Conference on Radar, Communication and Computing (ICRCC), pp. 125–130. IEEE, Chennai, India (2012) 2. Chakrabarty, S., Roy, S., De, D.: Pervasive diary in music rhythm education: a context aware learning tool using genetic algorithm. In: Advanced Computing, Networking and Informatics, pp. 669–677. Springer, Kolkata, India (2014) 3. De, D., Roy, S.: Polymorphism in Indian classical music: a pattern recognition approach. In: International Conference on Communications, Devices and Intelligent Systems (CODIS), pp. 612–615. IEEE, Kolkata, India (2012) 4. De, D., Roy, S.: Inheritance in Indian classical music: an object-oriented analysis and pattern recognition approach. In: International Conference on Radar, Communication and Computing (ICRCC), pp. 193–198. IEEE, Chennai, India (2012) 5. Roy, S., Chakrabarty, S., Bhakta, P., De, D.: Modeling high performing music computing using petri nets. In: International Conference on Control, Instrumentation, Energy and Communication (CIEC), pp. 757–761. IEEE, Kolkata, India (2014) 6. Roy, S., Chakrabarty, S., De, D.: A framework of musical pattern recognition using petri nets. In: Emerging Trends in Computing and Communication (ETCC), pp. 245–252. Springer-Link Digital Library, Kolkata, India (2014) 7. Chakrabarty, S., Roy, S., De, D.: Automatic raga recognition using fundamental frequency range of extracted musical notes. In: International Conference on Image and Signal Processing (ICISP), pp. 337–345. Elsevier, Bengaluru, India (2014) 8. Roy, S., Biswas, M., De, D.: iMusic: a session-sensitive clustered classical music recommender system using contextual representation learning. Multimedia Tools Appl. (2020). https://doi. org/10.1007/s11042-020-09126-8 9. Roy, S., Sarkar, D., Hati, S., De, D.: Internet of music things: an edge computing paradigm for opportunistic crowdsensing. J. Supercomput. 74(11), 6069–6101 (2018) 10. Roy, S., Sarkar, D., De, D.: Entropy-aware ambient IoT analytics on humanized music information fusion. J. Ambient Intell. Human Comput. 11(1), 151–171 (2020) 11. Muñoz, E., Cadenas, J.M., Ong, Y.S., Acampora, G.: Memetic music composition. IEEE Trans. Evol. Comput. 20(1), 1–15 (2014) 12. Doush, I.A., Sawalha, A.: Automatic music composition using genetic algorithm and artificial neural networks. Malays. J. Comput. Sci. 33(1), 35–51 (2020) 13. Loughran, R., O’Neill, M.: Evolutionary music: applying evolutionary computation to the art of creating music. Genet. Program. Evolvable Mach. 1–31 (2020) 14. Boryczka, U., Szwarc, K.: An effective hybrid harmony search for the asymmetric travelling salesman problem. Eng. Optim. 52(2), 218–234 (2020) 15. Roy, A., Manna, A., Maity, S.: A novel memetic genetic algorithm for solving traveling salesman problem based on multi-parent crossover technique. Decis. Making Appl. Manage. Eng. 2(2), 100–111 (2019) 16. Stoltz, B., Aravind, A.: MU_PSYC: music psychology enriched genetic algorithm. In: IEEE Congress on Evolutionary Computation, pp. 2121–2128. IEEE, New Zealand (2019) 17. Roy, S., Chakrabarty, S., De, D.: Time-based raga recommendation and information retrieval of musical patterns in Indian classical music using neural network. IAES Int. J. Artif. Intell. (IJ-AI) 6(1), 33–48 (2017) 18. Bhattacharjya, R.K.: Efficiency of binary coded genetic algorithm in stability analysis of an earthen slope. In: Nature-Inspired Methods for Metaheuristics Optimization, pp. 323–334. Springer, Cham (2020)
COVID-R: A Deep Feature Learning-Based COVID-19 Rumors Detection Framework Tulika Paul, Samarjit Roy , Satanu Maity , Abhishek Bhattacharya, Soumi Dutta, and Sujata Ghatak
Abstract The rumors are generally plagued information which are widely laid out with the advancements of social media. Due to estimating the reliability of actual news and information to the crowd, the news accuracy determination is indispensable. In this contribution, we have illustrated a feature learning-inspired typical rumors detection framework, titled as COVID-R, during the global pandemic, such as COVID-19 outbreak. We have represented several machine learning approaches for auto-identification of COVID-19 associated rumors using the key attributes denoted in the news headlines for evaluating the system accuracy. We have incorporated a real-time dataset (https://github.com/SamarjitRoy89/COVID-R.git) that has been framed by the obligatory COVID-19 correlated news characteristics to find out the near-optimum news classification accuracy and less root mean square error in this endeavor. The proposed system with the highest f 1-score has been achieved by the deep learning-based news classification schema that achieves ~90% information classification efficiency. T. Paul · S. Maity Department of Computer Applications, Bengal School of Technology and Management, Hooghly, West Bengal, India e-mail: [email protected] S. Maity e-mail: [email protected] S. Roy (B) Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, Kolkata, West Bengal, India e-mail: [email protected] A. Bhattacharya · S. Dutta · S. Ghatak Department of Computer Application, Institute of Engineering and Management, Kolkata, West Bengal, India e-mail: [email protected] S. Dutta e-mail: [email protected] S. Ghatak e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2_86
907
908
T. Paul et al.
Keywords COVID-19 · Rumor detection system · Machine learning · Deep learning · TF-IDF · Data classification accuracy · RMSE
1 Introduction The rumors are intentionally human fabricated and disinformation which are unadornedly spread out through the social media platforms. Due to the public interests, all the social media newscasts require a standard of information verifiability. Although with the large advancements of the electronic media, this is one of the most crucial challenges to control the dissemination of ambiguous fake news and rumors. The enormous spreading of rumors is not a distinctive issue during the severe global outbreak of COVID-19 pandemic [1]. The citizens are mostly claiming that they have been regularly viewing the misleading rumors in the social media. Since we sense the inclusive universal activities and we also recognize the misleading and unreliable information and facts in distinctive aspects. Mostly the less-smart and less-aware crowd are being victimized by the impact of topic-associated rumors and fake news. This disinformation usually unenthusiastically leads them for decision making and involve them to be panic-stricken, racists, and violent. As the rumors are blown out to ferociously deceive the viewers and listeners, it builds a challenging assignment for distinguishing the specific content of a real fact only. Nowadays, specifically during the COVID-19 outbreak worldwide [2, 3], it befits largely indispensable to fetch a reliable and efficient schema for perceiving the rumors and fake news as the actual news substance is miscellaneous in terms of the news reporting pattern and topic upon which the news has been covered and mentioned. Researches on system-defined rumor detection are promising domain of analysis in the advanced technological advancements. Till today, several capable approaches have already been familiarized in the domains of fake news and rumors recognition. Most of the promising strategies in this endeavor demand the accurate classification, association, and correlation methodologies for recognizing the social disinformation parameters, such as real information or disinformation, with some exclusive linguistic terms. For analyzing topic-specific rumors detection, recognition, and advanced analytics contexts, the machine learning [4] and deep learning [5] strategies have been utilizing to accomplish favorable outcomes. Researchers have illustrated ensemble learning-aware tweet rumors detection in terms of text and image characteristics. This paradigm achieves a promising result, although this contribution lacks on the analysis of COVID-19-related issues. In the earlier researches on automated fake news and rumors detection schemas, the researchers have broadly demonstrated several innovative archetypes through the machine learning [6], deep learning [7], ensemble learning [4], reinforcement learning [8], and so on. In our projected endeavors, we have depicted a summarized rather promising framework to largely identify the underlying rumors during the severe global outbreak of COVID-19 pandemic.
COVID-R: A Deep Feature Learning-Based COVID-19 …
909
1.1 Motivation and Contributions In [4], researchers have depicted the machine learning-inspired fake news and rumors detection systems that have been widely spread through the social electronic media where they have incorporated machine learning algorithms, deep learning-based schema, ensemble learning paradigms, and reinforcement learning-based methodologies. In the current scenario of COVID-19 pandemic, the researchers have been eagerly working since the preliminary societal circumstances on the promising analytics and elaborative visualizations of every topic-inspired models [3]. Being motivated by their outcomes and contributions, we have summarily designed a framework to analyze, detect, and visualize the COVID-19-related social media rumors using feature learning-dependent methodologies. The key contributions of this paper have been pointed out in the following: 1. We have framed a real-time dataset with the news during the COVID-19 pandemic, extracted from the electronic media. 2. We have illustrated a framework of news features extraction in terms of sparse matrices. 3. We have designed a representation learning-based framework for analyzing the rumors or disinformation classification accuracy. 4. We have deployed a Web-based rumor detection tool titled as COVID-R for analyzing and visualizing the available rumors in the social media.
1.2 Paper Organization Rest of the manuscript has been structured as follows: The literature survey of the previously published works, that are related to our projected contributions in this article, has been summarily highlighted in Sect. 2. The complete working methodologies have been illustrated in Sect. 3, where data preprocessing, features extraction, and learning strategies of the proposed COVID-R system have been provided. The overall system outcomes, performance metrics, and deployed COVID-R tool have been depicted in Sect. 4. Eventually, this article has been accomplished with the conclusion and scope for the future researches in Sect. 5.
2 Related Works The rumor detection can be explained as similar to the fake news detection problem is arguably a promising topic in recent research trends. Several articles have been adopted based on the numerous machine learning strategies for detection of the fake news over various types of information. Some of them have reviewed and summarized in this section, which are required to perform the project work in this article.
910
T. Paul et al.
The authors of this paper [4] have adopted different types of ensemble learning strategies to check the credibility of online tweets using text and image features and achieved beneficial results. The article [6] builds a supervised machine language model to detect the fake US presidential election news published on social media platform during 2015 and 2016 in US election cycle. Whereas, in [8], researchers have collected the data from WeChat platform and used reinforcement learning technique to detect the fake news. A literature review paper [9] has viewed the impact of several machine learning algorithms and the use of digital tool on social media’s fake data. We have studied some articles related to COVID-19 pandemic for information acquisition, as in [1], severely imprecise opinions, emotional contamination, and machination on the societal fake news of COVID-19 outbreak have been illustrated. Article [2] has deeply investigated the total timeline of these pandemic days and recorded the total effects by the social media’s rumors. Lastly, a survey of future works has conduct in the paper [3] on COVID-19 data like CT scans, X-ray case report, mobility data, etc. The authors have illustrated a representation learning-inspired classical music recommendation framework where session-sensitive musical data classification strategy has been incorporated [10]. Nowadays, deep learning strategies are getting a great impact in the machine learning domain, so we have studied some editorials that works on fake news detection. In the paper, [5] have proposed a model comprises of the conventional neural network and the recurrent neural network worked on embedded-pre-processed data, achieved a benchmark result. A deep convolutional neural network named as FND-Net have proposed in the paper [7] to detect the fake news related to US election. In the paper, [11] have introduced a unique automatic fake news credibility model work on the US political news, named FAKEDETECTOR, which based on several unambiguous and underlying characteristics extorted from the textual contents. An ensemble learning-based classification mechanism, XGBoost, and a deep learning schema, DeepFakE have merged and tested on BuzzFeed data to detect the fake news in social media [12]. In the paper, [13] have trained LSTM network on George McIntire dataset to identify the fake news. An evolutionary intelligence, such as memetic computing-dependent musical rhythm improvisation approach has been elucidated in the context of percussion-based musical instruments [14]. Eventually, we have empirically studied an application-oriented research paper that how the recent technologies (likes IoT, Blockchain, etc.) have effects on uses in this COVID-19 pandemic situation [15].
3 Methods of Analysis This section illustrates the whole work performed toward the development of a wellworking COVID-19 rumors detection system in a stepwise procedure. Initially, the preparation of dataset which means how the data has collected and created in a form of a dataset [12], then the feature extraction process and lastly implementation of the training environment. The overall processes have pictorialized in a block diagram as shown in Fig. 1.
COVID-R: A Deep Feature Learning-Based COVID-19 …
911
Fig. 1 Inclusive system architecture for learning-inspired COVID-19 rumor detection system
3.1 Dataset Creation and Preprocessing As the proposed problem statement, we need two chunks of dataset related to COVID19 pandemic; fact dataset and rumors dataset [7] which consists of original news, rumors news, respectively. Through this event, i.e., COVID-19 pandemic is new in this era, the pre-existing datasets are not relevant on the Internet, so we have created manually. For the fact data, we have gathered a huge number of news articles from the trusted online newspaper site and news blogs manually. Then, we have extracted in a CSV file with these following column fields—a_text, a_source, and label. In the dataset, the a_text filed contains the body of the collected news article, source of the article in the a_source field, i.e., from where we collected this article and the status/tag of this article (i.e., rumor or fact) store in the label field. On the other side, rumor information has collected from several social media platforms like Facebook, WhatsApp, boom.live, etc. [11] and manually extract in the same way as we previously have done with the fact data. After that, we have conduct two preprocessing techniques on these datasets. First, the stop words have been removed from the datasets because they carry less impact in the linguistic context. Second, each letter converted into the lower cases for uniform features representation. The snapshot of the aggregated overall dataset (Available in: https://github.com/Samarj itRoy89/COVID-R.git)displays) in Fig. 2.
912
T. Paul et al.
Fig. 2 Demonstration of the COVID-R dataset
3.2 Feature Extraction After the dataset creation, we have used two strategies for extracting the features from the datasets. The first one is TF-IDF method [16], which has applied on a_textfield and another one is the label encoder method applied on a_source field. TF-IDF, namely, the term frequency (TF) and the inverse document frequency (IDF) is one type of linguistic feature extraction technique in the domain of natural language processing, here each article body treated as a document and it finds the significance of a term, i.e., a word, in the whole documents. Again, these above techniques accomplished by fusion of two processes, such as TF and IDF. TF treasures the significance or the occurrence of the attributes t in the document d and another is IDF, which find the frequency from all documents D. Therefore, combining these two, the TF-IDF weightage of the term t can be calculated, that is defined in Eq. (1). TF − IDF(t, d, D) = f t N × ln K DFt ,
where, d ∈ D
(1)
where f t N is denoted as the TF and ln K DFt is denoted as the IDF. The attribute t appeared f t times in the document d and all terms appeared N times in that document d. Additionally, the denoted K represents absolute sum of the existing documents, and DFt denotes the frequency of documents. The symbolized term t represents the count of documents in a mass containing in t in the proposed dataset. Eventually, it converted the whole a_text into sparse matrices with the help of the TF-IDF weightages. Then, the label encoder technique applied on a_sourcefiled; it converts the unique text entity to an integer number simply because learning algorithm only supports the numeric value instead of text. After these feature extractions and preprocessing process, we convey the training environment on the resultant data frame.
COVID-R: A Deep Feature Learning-Based COVID-19 …
913
3.3 System Learning and Representation In this sub-section, we have summarily discussed on the required five supervised machine learning strategies and one deep learning mechanism, which are to convey in a single training environment to deploy an efficient model for the COVID-19 rumors detection system. Logistic Regression (LR). This is a binary classification algorithm works on modified linear regression working principle; predicting the target between only two classes based on independent variables, that is, the input features [10]. Here, the modified linear regression means a complex cost function named sigmoid function that calculates the class probability of the targets. K-Neighbor Nearest (KNN). This technique separates the classes by the measuring distance, such as Euclidean and Minkowski from the neighbor point. The ‘K’ is the vibrant term, because K number of vectors seeded in a messy way over the data distribution space [4]. Then, their positions are updated during the training for induct as best class neighbor. In our work, we use Minkowski method and set the value of K to 10. Random Forest (RF). This is an ensemble learning strategy [8] works on multiple decision trees, where each decision has made on each branches of the tree to predict the class. The classes are placed at the end of that tree as the leaf nodes. We set the n-estimator parameter at 15 for the best result, where n-estimator is the total count of the decision trees in a specific forest. Support Vector Machine (SVM). In this technique, the feature sets distributed in multi-dimensional space with vector point representation. These vector points are distinctly classifying each of the classes [4]. Then, draw the all-possible hyper-planes that differentiate the classes in that space. Finally, find the finest hyper-plane among them, which hold maximum margin. We have used radial basis function for transform the features to vector points in our work. Deep Learning (DL). Lastly, we have conducted our experiment into deep learning field [5, 7]. We integrate the multilayer neural network [10] approach with three distinct layers in the training environment with the parameters, such as (a) total hidden layers: 100 nos.; (b) ‘ReLu’ activation function; (c) ‘Adam’ optimizer solver; (d) learning rate: 0.001; and (e) auto batch size.
4 System Performance and Tool Metrics We have executed the training environment using ~3500 news data and the testing has been performed on ~1000 news data. This training environment build upon multiple machine learning strategies discussed in the previous section and the performance
914
T. Paul et al.
of each strategies measure using f 1-score matric. The comparison results of the all utilized learning strategies are described in Fig. 3. As Fig. 3 shows, the deep learning strategy spread the best impact, with f 1score ~0.90 in the performed training environment among all others. However, the noticeable fact that random forest is much closer to win this contest in terms of evaluated precision over the provided dataset. In Fig. 4, we have demonstrated the data classification accuracy of the prescribed dataset. Figure 4 suggests that the deep learning strategy has given the overall best classification accuracy among the other utilized machine learning approaches in this endeavor. Similarly, in Fig. 5, we have demonstrated the root mean square error (RMSE) evaluation mechanism, where we have achieved the least error rate over the data packets provided in the dataset in this contribution. The RMSE of projected deep learning tends to zero suggests that the anticipated classification mechanism can be a optimally beneficial framework Comparison Chart 1.0
Score
0.8 0.6 0.4 0.2 0.0
LR Precision 0.9243 Recall 0.6941 F1-Score 0.7571
KNN 0.8814 0.7652 0.8100
RF 0.9201 0.8425 0.8763
SVM 0.9392 0.7511 0.8136
DL 0.9217 0.8735 0.8957
Fig. 3 Representation of the efficiency comparison using diverse machine learning strategies
1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191
Accuracy score
Data classification Accuracy 1.00 0.98 0.96 0.94 0.92 0.90 0.88 0.86 0.84 0.82 0.80 0.78 0.76 0.74
Data packets (nos.)
Fig. 4 Representation of the learning algorithm-based data classification accuracy
COVID-R: A Deep Feature Learning-Based COVID-19 …
915
RMSE
RMSE
0.50 0.40 0.30 0.20 1 10 19 28 37 46 55 64 73 82 91 100 109 118 127 136 145 154 163 172 181 190 199
0.10
Data packets (nos.)
Fig. 5 Depiction of the learning-based root mean square error evaluation over the dataset
Table 1 Overall learning-based system efficiency evaluation over the COVID-19 dataset Performance attributes
Learning strategies LR
KNN
RF
SVM
DL
Classification accuracy
0.8913
0.9272
0.9278
0.9142
0.9462
F1-score
0.7571
0.8100
0.8763
0.8136
0.8957
RMSE
0.3262
0.2666
0.2649
0.2891
0.2274
for designing a rumor and fake news detection system such as the projected tool ‘COVID-R’ demonstrated in this article. We have illustrated the overall anticipated feature learning-inspired system efficiency evaluation metrics over the adopted COVID-R dataset in Table 1, in terms of information classification accuracy, f 1 score, and RMSE [10]. We can make an assertion from Table 1 that the projected deep learning strategy over the proposed dataset achieves the utmost efficiency in every aspects of the performed attributes. Therefore, consequently, we build a DL-based classification model trained with the complete dataset and deploy in a Web-based application named as ‘COVID-R: A COVID-19 rumors detection system’ in Fig. 6. In this system, user type or enter the piece of text related to COVID-19 pandemic and he/she may retrieve whether the information is a rumor or fact. The screenshots of the developed system tool have been mentioned in Fig. 6.
5 Conclusion and Scope for Future Research This projected research article accomplishes that the deep learning schema achieves the marginally better outcome in terms of f 1 score over our prescribed dataset in this context with respect to the other utilized conventional machine learning paradigms. Since this projected exertion of rumor detection in the context of COVID-19-related
916
T. Paul et al.
Fig. 6 Screenshot of COVID-19 rumor detection system tool
facts are relatively assuring, this framework may be beneficial for any further and analogous technological research involvements. In several COVID-19-related rumors and fake news detection explorations, the researchers have discussed some significant mechanisms over the real-time datasets. Furthermore, we have the intentions to work with the extended dataset and several latest data classification strategies and hybrid classifiers.
References 1. Bratu, S.: The fake news sociology of COVID-19 pandemic fear: dangerously inaccurate beliefs, emotional contagion, and conspiracy ideation. Linguist. Philos. Invest. 19, 128–134 (2020) 2. Hua, J., Shaw, R.: Corona virus (Covid-19) “infodemic” and emerging issues through a data lens: the case of China. Int. J. Environ. Res. Public Health 17(7), 2309 (2020) 3. Shuja, J., Alanazi, E., Alasmary, W., Alashaikh, A.: COVID-19 datasets: a survey and future challenges. medRxiv (2020) 4. Meel, P., Agrawal, H., Agrawal, M., Goyal, A.: Analysing tweets for text and image features to detect fake news using ensemble learning. In: International Conference on Intelligent Computing and Smart Communication 2019, pp. 479–488. Springer, Singapore (2020) 5. Agarwal, A., Mittal, M., Pathak, A., Goyal, L.M.: Fake news detection using a blend of neural networks: an application of deep learning. SN Comput. Sci. 1, 1–9 (2020) 6. Anushaya Prabha, T., Aisuwariya, T., Vamsee Krishna Kiran, M., Vasudevan, S.K.: An innovative and implementable approach for online fake news detection through machine learning. J. Comput. Theor. Nanosci. 17(1), 130–135 (2020) 7. Kaliyar, R.K., Goswami, A., Narang, P., Sinha, S.: FNDNet—a deep convolutional neural network for fake news detection. Cogn. Syst. Res. 61, 32–44 (2020) 8. Wang, Y., Yang, W., Ma, F., Xu, J., Zhong, B., Deng, Q., Gao, J.: Weak supervision for fake news detection via reinforcement learning. Proc. AAAI Conf. Artif. Intell. 34(1), 516–523 (2020) 9. de Beer, D., Matthee, M.: Approaches to identify fake news: a systematic literature review. In: International Conference on Integrated Science, pp. 13–22. Springer, Cham (2020) 10. Roy, S., Biswas, M., De, D.: iMusic: a session-sensitive clustered classical music recommender system using contextual representation learning. Multimedia Tools Appl. (2020). https://doi. org/10.1007/s11042-020-09126-8
COVID-R: A Deep Feature Learning-Based COVID-19 …
917
11. Zhang, J., Dong, B., Philip, S.Y.: Fakedetector: effective fake news detection with deep diffusive neural network. In: 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 1826–1829, IEEE (2020) 12. Kaliyar, R.K., Goswami, A., Narang, P.: DeepFakE: improving fake news detection using tensor decomposition-based deep neural network. J. Supercomput. (2020) 13. Deepak, S., Chitturi, B.: Deep neural approach to fake-news identification. Procedia Comput. Sci. 167, 2236–2243 (2020) 14. Roy, S.,Chakrabarty, S., De D.: Rhythmic finger-striking: a memetic computing inspired musical rhythm improvisation strategy. In: International Conference on Emerging Technologies in Data Mining and Information Security (IEMIS 2020), July 2020 15. Chamola, V., Hassija, V., Gupta, V., Guizani, M.: A comprehensive review of the COVID-19 pandemic and the role of IoT, drones, AI, blockchain, and 5G in managing its impact. IEEE Access 8, 90225–90265 (2020) 16. Kaur, S., Kumar, P., Kumaraguru, P.: Automating fake news detection system using multi-level voting model. Soft. Comput. 24(12), 9049–9069 (2020)
Author Index
A Abdullah, Kazi Akib, 35 Abhangi, Ashutosh A., 235 Abujar, Sheikh, 77, 133, 153, 171, 181, 305, 317, 329, 539, 777 Aditya Pai, H., 801 Afrin, Fahmida, 395 Afroz, Umme Sunzida, 777 Agarwal, Sugandh, 787 Aggarwal, Harsh, 859 Ahmed Foysal, Md. Ferdouse, 329, 347 Ahuja, Aadarsh, 143 Akhter Hossain, Syed, 329 Akter, Sharmin, 305 Akteruzzaman, 171, 181 Alie - Al - Mahdi, Md. Tahmid, 539 Anika, Afra, 283 Ansari, Md. Salman, 203 Aravind, Disha, 677 Arjun, R., 621 Arora, Deepak, 445, 883 Ashokan, Arjun, 621 Asthana, Pallavi, 581
B Bal, Anirban, 601 Banerjee, Debanjan, 375 Banerjee, Minakshi, 699 Banerjee, Priyanka, 717 Banerjee, Saptarshi, 613 Bee, M. K. Mariam, 473 Beri, Tanisha, 407 Bhan, Anupama, 589 Bhardwaj, Barkha, 883
Bhardwaj, Harshit, 829 Bhasin, Bismaad, 859 Bhati, Akash, 55 Bhattacharya, Abhishek, 735, 893, 907 Bhushan, Bharat, 837 Biswas, Kamalesh, 811 Brunsdon, Teresa, 293
C Chakarvarti, Mukul, 829 Chakrabarty, Sayan, 339 Chakrabarty, Sudipta, 893 Chakrabati, Satyajit, 821 Chakraborty, Tanika, 707 Chandra Mondal, Kartick, 707 Chatterjee, Deb Prakash, 259 Chatterjee, Kingshuk, 601 Chatterjee, Moumita, 385 Chaudhary, Shubham, 251 Cherian, Jestin P., 755 Chingangbam, Chiranjiv, 25 Choudhary, Anamika, 15 Choudhury, Tanupriya, 787 Chowdhury, Ranjita, 339
D Das, Arabinda, 113, 123 Das, Arnab Kumar, 527 Das, Avishek, 395 Das, Bhaskarjyoti, 677 Das, Dipnarayan, 385 Dash, Pritam, 589 De, Debashis, 893
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 A. E. Hassanien et al. (eds.), Emerging Technologies in Data Mining and Information Security, Advances in Intelligent Systems and Computing 1300, https://doi.org/10.1007/978-981-33-4367-2
919
920 Devi, Wangkheirakpam Reema, 25 Dhaneshwar, Ritika, 45 Dipongkor, Atish Kumar, 35 Dixit, Sunanda, 801 Dobariya, Ashwin R., 725 Dubey, Ashwani Kumar, 417, 427 Dubey, Sanjay Kumar, 453, 463 Dutta, Shawni, 717 Dutta, Soumi, 735, 893, 907 Dutta, Suchibrota, 375 Duy Hung, Phan, 561, 571
F Farheen, Nida, 497 Ferdosh Nima, Jannatul, 347 Fouzia, Fatema Akter, 539 Foysal, Md. Ferdouse Ahmed, 363
G Gadwe, Akshara, 767 Ganguli, Runa, 717 Ganguly, Debayan, 601, 613 Gaudoin, Jotham, 293 Gaur, Divya, 453 Gautam, Siddharth, 189, 859 George, Joseph, 755 Ghatak, Sujata, 893, 907 Ghosal, Arijit, 339, 375 Golaya, Sanjana, 427 Gollagi, S. G., 801 Gore, Sonal, 3 Goswami, Radha Tamal, 527 Goswami, Saptarsi, 259 Goyal, Ayush, 589 Gupta, Sumit, 385, 811
H Habibur Rahaman, Md., 77 Haddela, Prasanna, 293 Ha My, Kieu, 571 Halder, Sanjib, 717 Hasan, Fuad, 133, 153 Hasan, Md. Warid, 347 Hasan, Nazmul, 745 Hazra, Sourab, 273 Hirsch, Laurence, 293 Hossain, Gahangir, 589 Hossain, Syed Akhter, 133, 153, 305, 317, 539, 777 Hossain, Zahid, 283
Author Index I Islam, Md. Majedul, 745 Islam, Md. Muhaiminul, 317 Islam, Mohiyminul, 317 Islam, S. M Zahidul, 171, 181 Iyer, Sailesh, 235, 691
J Jacob, Jemia Anna, 755 Jafar, Gibraan, 849 Jagtap, Jayant, 3 Jajal, Brijesh, 725 John, Nithin Prince, 621 Juliet, D. Sujitha, 681
K Kabir, Kazi Lutful, 213 Kansal, Vineet, 645 Karim, Enamul, 347 Kathuria, Ramandeep Singh, 189 Kaur, Mandeep, 45 Kaur, Rajbir, 87 Kaur, Sukhwant, 87 Keya, Mumenunnessa, 305 Khandaskar, Sujata, 223 Khosla, Arushi, 859 Kien, Le Trung, 571 Kingrani, Parth, 223 Kithani, Ekta, 143 Kowsher, Md., 395 Krishna, P. Siva, 473 Kuddus, Marian Binte Mohammed Mainuddin, 283 Kumar, Amit, 55 Kumar, Ankit, 407 Kumar Mandal, Jitendra, 581 Kumar, Nikhil, 829 Kumar, Satyajeet, 811 Kumar, Shishir, 633 Kundu, Palash Kumar, 113, 123
M Mahajan, Aditya, 849 Maheshwary, Priti, 69 Maity, Satanu, 907 Malhotra, Ruchika, 55 Mali, Kalyani, 203 Malik, Sandeep, 859 Mangla, Monika, 485 Mani, Harsh Neel, 445 Mariam Bee, M. K., 735, 869
Author Index Marma, Piya Prue, 283 Masum, Abu Kaisar Mohammad, 305, 317, 539, 777 Meena, Jasraj, 251 Mehra, Anubhav, 55 Mehta, Akash, 717 Mishra, Rishabh, 273 Mishra, Ved Prakash, 155 Misra, Shinjana, 787 Mitra, Arnabi, 613 Mitra, Ayushi, 63 Mitra, Sujan, 339 Mittal, Shubham, 829 Mohan, Divya, 755 Mohanty, Sachi Nandan, 485 Mohanty, Sanjukta, 63 Mondal, Avijit, 527 Mondal, Riktim, 203 Motwani, Manav, 143 Motwani, Rahul, 223 Motwani, Sahil, 143 Mugdha, Shafayat Bin Shabbir, 283 Mukherjee, Alok, 113, 123 Mukherjee, Anirban, 259 Mukherjee, Bhaskar, 339 Mukherjee, Utsab, 717 Mukhopadhyay, Sabyasachi, 259 Mukhopadhyay, Somnath, 707 Mushtaq, Sheena, 435
N Nahar, Jebun, 745 Nandy Pal, Mahua, 699 Narasimha Murthy, M. S., 801 Nashiry, Md. Asif, 35 Naskar, Sayani, 385 Ngoc Tram, Nguyen, 561 Nijhawan, Monika, 633
P Pal, Rituparna, 821 Panday, Mrityunjoy, 259 Panigrahi, Prasanta K., 259 Parashar, Mayank, 445 Pareek, Piyush Kumar, 801 Paul, Tulika, 907 Poddar, Dishank, 767 Potluri, Sirisha, 485 Punj, Abhishek, 859 Puri, Sudaksh, 427
921 R Rabby, A. K. M. Shahariar Azad, 745 Rafidul Hasan Khan, Md., 777 Rahat, Minhajul Abedin, 539 Rahman, Fuad, 745 Rai, Garima, 463 Ramachandran, Anirudh, 767 Rana, Ajay, 633, 645 Rao, Aditya, 223 Rathore, Divya Singh, 15 Ray, Aritra, 99, 673 Ray, Hena, 99, 673 Reji, Christo Daniel, 755 Reza, Md. Reshad, 589 Rifat, Majedul Haque, 363 Rimi, Tanzina Afroz, 363 Ritu, Rifat Shermin, 35 Roy, Biswajyoti, 339 Roy, Samarjit, 893, 907 Rupon, Farea Rehnuma, 77
S Saha, Puspita, 395 Sahu, Sunita, 143, 767 Saiful Islam Badhon, S. M., 77 Salsabil, Lubaba, 283 Samreen Fathima, N., 735 Sanyal, Atri, 515 Saravanan, S., 681 Sarguroh, Naveed, 223 Sarkar, Ram, 203 Sarkar, Shib Sankar, 203 Satavalekar, Saurabh, 767 Satpathi, Subhadeep, 527 Satpathy, Suneeta, 485 Shah, Divyank, 849 Shah, Keyur, 167 Shah, S. M., 167 Shakya, Tanmay, 787 Sharma, Hitesh, 273 Sharma, Mayank, 551 Sharma, Puneet, 445, 883 Shatabda, Swakkhar, 283 Shenbagaraj, R., 691 Shobha, G., 645 Shrishti, Ria, 417 Shukla, Piyush Kumar, 69 Shukla, Vinod Kumar, 155 Shuvo, Shifat Nayme, 133, 153 Siddharth, Shivam, 251 Singh, Aman Pratap, 811 Singh, Anup, 189
922 Singh, Arjan, 189 Singh, Ashish, 677 Singh, Bhopendra, 155 Singh, Jagbir, 273 Singh, Kiran Deep, 87 Singh, Prabhdeep, 87 Singh, Rahul Veer, 837 Singh, Shailendra Narayan, 435 Sinha, Amitabha, 515 Sinha, Meghna, 427 Siva Krishna, P., 869 Sobti, Tanvi, 407 Sree Hari, K., 677 Srivastava, Neha, 581 Srivastava, Vaibhav, 787 Stalin, Shalini, 69 Sukhija, Namrata, 273 Sultana, Nishat, 347, 363 Sunil, Devika, 621
T Tangim Pasha, Syed, 329 Tanwar, Sarvesh, 633, 645
Author Index Tare, Yash, 213 Thakur, Parag, 811 Tiwari, Sanju, 589 Tiwari, Shivam, 883 Tripathi, Anurag, 589 Tyagi, Ashi, 837
U Upadhyay, Divya, 417
V Venkatesh Parthasarathy, Prasanna, 213
Y Yadav, Nishant, 189
Z Zahidul Islam Sanjid, Md., 395 Zaidi, Syeda Fizza Nuzhat, 155 Zakir, Firdous, 621