327 28 15MB
English Pages 464 [444] Year 2021
Transactions on Computer Systems and Networks
Gyanendra K. Verma Badal Soni Salah Bourennane Alexandre C. B. Ramos Editors
Data Science Theory, Algorithms, and Applications
Transactions on Computer Systems and Networks Series Editor Amlan Chakrabarti, Director and Professor, A. K. Choudhury School of Information Technology, Kolkata, West Bengal, India
Transactions on Computer Systems and Networks is a unique series that aims to capture advances in evolution of computer hardware and software systems and progress in computer networks. Computing Systems in present world span from miniature IoT nodes and embedded computing systems to large-scale cloud infrastructures, which necessitates developing systems architecture, storage infrastructure and process management to work at various scales. Present day networking technologies provide pervasive global coverage on a scale and enable multitude of transformative technologies. The new landscape of computing comprises of self-aware autonomous systems, which are built upon a software-hardware collaborative framework. These systems are designed to execute critical and non-critical tasks involving a variety of processing resources like multi-core CPUs, reconfigurable hardware, GPUs and TPUs which are managed through virtualisation, real-time process management and fault-tolerance. While AI, Machine Learning and Deep Learning tasks are predominantly increasing in the application space the computing system research aim towards efficient means of data processing, memory management, real-time task scheduling, scalable, secured and energy aware computing. The paradigm of computer networks also extends it support to this evolving application scenario through various advanced protocols, architectures and services. This series aims to present leading works on advances in theory, design, behaviour and applications in computing systems and networks. The Series accepts research monographs, introductory and advanced textbooks, professional books, reference works, and select conference proceedings.
More information about this series at http://www.springer.com/series/16657
Gyanendra K. Verma · Badal Soni · Salah Bourennane · Alexandre C. B. Ramos Editors
Data Science Theory, Algorithms, and Applications
Editors Gyanendra K. Verma Department of Computer Engineering National Institute of Technology Kurukshetra Kurukshetra, India
Badal Soni Department of Computer Science and Engineering National Institute of Technology Silchar Silchar, India
Salah Bourennane Multidimensional Signal Processing Group Ecole Centrale Marseille Marseille, France
Alexandre C. B. Ramos Mathematics and Computing Institute Universidade Federal de Itajuba Itajuba, Brazil
ISSN 2730-7484 ISSN 2730-7492 (electronic) Transactions on Computer Systems and Networks ISBN 978-981-16-1680-8 ISBN 978-981-16-1681-5 (eBook) https://doi.org/10.1007/978-981-16-1681-5 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
We dedicate to all those who directly or indirectly contributed to the accomplishment of this work.
Preface
Digital information influences our everyday lives in various ways. Data sciences provides us tools and techniques to comprehend and analyze the data. Data sciences is one of the fastest-growing multidisciplinary fields that deals with data acquisition, analysis, integration, modeling, visualization, and interaction of a large amount of data. Currently, each sector of the economy produces a huge amount of data in an unstructured format. A huge amount of data is being available from various sources like web services, databases, online repositories, etc.; however, the major challenge is to extract meaningful intelligence information. However, to preprocess and extract useful information is a challenging task. The role of artificial intelligence is playing a pivotal role in the analysis of the data. It becomes possible to analyze and interpret information in real-time with the evolution of artificial intelligence. The deep learning models are widely used in the analysis of big data for various applications, particularly in the area of image processing. This book aims to develop an understanding of data sciences theory and concepts, data modeling by using various machine learning algorithms for a wide range of realworld applications. In addition to providing basic principles of data processing, the book teaches standard models and algorithms to data analysis. Kurukshetra, India Silchar, India Marseille, France Itajuba, Brazil October 2020
Dr. Gyanendra K. Verma Dr. Badal Soni Prof. Dr. h. c. Salah Bourennane Prof. Dr. h. c. Alexandre C. B. Ramos
vii
Acknowledgements
We are thankful to all the contributors who have generously given time and material to this book. We would also want to extend our appreciation to those who have well played their role to inspire us continuously. We are extremely thankful to the reviewers, who have carried out the most important and critical part of any technical book, evaluation of each of the submitted chapters assigned to them. We also express our sincere gratitude toward our publication partner, Springer, especially to Ms. Kamiya Khatter and the Springer book production team for continuous support and guidance in completing this book project. Thank you.
ix
Introduction
Objective of the Book This book aims to provide authors with an understanding of data sciences, their architectures, and their applications in various domains. The data sciences is helpful in the extraction of meaningful information from unstructured data. The major aspect of data sciences is data modeling, analysis, and visualization. This book covers major models, algorithms, and prominent applications of data sciences to solve real-world problems. By the end of the book, we hope that our readers will have an understanding of concepts, different approaches, models, and familiarity with the implementation of data sciences tools and libraries. Artificial intelligence has a major impact on research and raised the performance bar substantially in many of the standard evaluations. Moreover, the new challenges can be tackled using artificial intelligence in the decision-making process. However, it is very difficult to comprehend, let alone guide, the process of learning in deep learning. There is an air of uncertainty about exactly what and how these models learn, and this book is an effort to fill those gaps.
Target Audience The book is divided into three parts comprising a total of 27 chapters. Parts, distinct groups of chapters, as well as single chapters are meant to be fairly independent and also self-contained, and the reader is encouraged to study only relevant parts or chapters. This book is intended for a broad readership. The first part provides the theory and concepts of learning. Thus, this part addresses readers wishing to gain an overview of learning frameworks. Subsequent parts delve deeper into research topics and are aimed at the more advanced reader, in particular graduate and PhD students as well as junior researchers. The target audience of this book will be academicians, professionals, researchers, and students at engineering and medical institutions working in the areas of data sciences and artificial intelligence. xi
xii
Introduction
Book Organization This book is organized into three parts. Part I includes eight chapters that deal with theory concepts of data sciences, Part II deals with data design and analysis, and finally, Part III is based on the major applications of data sciences. This book contains invited as well as contributed chapters.
Part I Theory and Concepts The first part of the book exclusively focuses on the fundamentals of data sciences. The book chapters under this part cover active learning, ensemble learning concepts along with language processing concepts. Chapter 1 describes a general active learning framework that has been proposed for network intrusion detection. The authors have experimented with different learning and sampling strategies on the KDD Cup 1999 dataset. The results show that the performance of complex learning models has been found to outperform the relatively simple learning models. The uncertainty and entropy sampling also outperform random sampling. Chapter 2 describes a bagging classifier which is an ensemble learning approach for student outcome prediction by employing base and metaclassifiers. Additionally, performance analysis of various classifiers has been carried out by an oversampling approach using SMOTE and an undersampling approach using spread sampling. Chapter 3 presents the patient’s medical data security via bichaos bi-order Fourier transform. In this work, authors have used three techniques for medical or clinical image encryption, i.e., FRFT, logistic map, and Arnold map. The results suggest that the complex hybrid combination makes the system more robust and secure from the different cryptographic attacks than these methods alone. In Chap. 4, word-sense disambiguation (WSD) for the Nepali language is performed using variants of the Lesk algorithm such as direct overlap, frequency-based scoring, and frequency-based scoring after drooping of the target word. Performance analysis based on the elimination of stop words, the number of senses, and context window size has been carried out. Chapter 5 presents a performance analysis of different branch prediction schemes incorporated in ARM big.LITTLE architecture. The performance comparison of these branch predictors has been carried out based on performance, power dissipation, conditional branch mispredictions, IPC, execution time, power consumption, etc. The results show that TAGE-LSC and perceptron achieve the highest accuracy among the simulated predictors. Chapter 6 presents a global feature representation using a new architecture SEANet that has been built over SENet. An aggregate block implemented after the SE block aids in global feature representation and reducing the redundancies. SEANet has been found to outperform ResNet and SENet on two benchmark datasets—CIFAR-10 and CIFAR-100.
Introduction
xiii
The subsequent chapters in this part are devoted to analyzing images. Chapter 7 presents an improved super-resolution of a single image through an external dictionary formation for training and a neighbor embedding technique for reconstruction. The task of dictionary formation is carried out so as to contain maximum structural variations and the minimal number of images. The reconstruction stage is carried out by the selection of overlapping pixels of a particular location. In Chap. 8, singlestep image super-resolution and denoising of SAR images are proposed using the generative adversarial networks (GANs) model. The model shows improvement in VGG16 loss as it preserves relevant features and reduces noise from the image. The quality of results produced by the proposed approach is compared with the two-step upscaling and denoising model and the baseline method.
Part II Models and Algorithms The second part of the book focuses on the models and algorithms for data sciences. The deep learning models, discrete wavelet transforms, principal component analysis, SenLDA, color-based classification model, and gray-level co-occurrence matrix (GLCM) are used to model real-world problems. Chapter 9 explores a deep learning technique based on OCR-SSD for car detection and tracking in images. It also presents a solution for real-time license plate recognition on a quadcopter in autonomous flight. Chapter 10 describes an algorithm for gender identification based on biometric palm print using binarized statistical image features. The filter size is varied with a fixed length of 8 bits to capture information from the ROI palm prints. The proposed method outperforms baseline approaches with an accuracy of 98%. Chapter 11 describes a Sudoku puzzle recognition and solution study. Puzzle recognition is carried out using a deep belief network for feature extraction. The puzzle solution is given by serialization of two approaches—parallel rule-based methods and ant colony optimization. Chapter 12 describes a novel profile generation approach for human action recognition. DWT & PC is proposed to detect energy variation for feature extraction in video frames. The proposed method is applied to various existing classifiers and tested on Weizmann’s dataset. The results outperform baselines like the MACH filter. The subsequent chapters in this part are devoted to more research-oriented models and algorithms. Chapter 13 presents a novel filter and color-based classification model to assess the ripeness of tobacco leaves for harvesting. The ripeness detection is performed by a spot detection approach using a first-order edge extractor and a second-order high-pass filtering. A simple thresholding classifier is then proposed for the classification task. Chapter 14 proposes an automatic deep learning framework for breast cancer detection and classification model from hematoxylin and eosin (H&E)-stained breast histopathology images with 80.4% accuracy for supplementing analysis of medical professionals to prevent false negatives. Experimental results yield that the proposed architecture provides better classification results as compared to benchmark methods. Chapter 15 specifies a technique for indoor flying
xiv
Introduction
of autonomous drones using image processing and neural networks. The route for the drone is determined through the location of the detected object in the captured image. The first detection technique relies on image-based filters, while the second technique focuses on the use of CNN to replicate a real environment. Chapter 16 describes the use of a gray-level co-occurrence matrix (GLCM) for feature detection in SAR images. The features detected in SAR images by GLCM find much application as it identifies various orientations such as water, urban areas, and forests and any changes in these areas.
Part III Applications and Issues The third part of the book covers the major applications of data sciences in various fields like biometrics, robotics, medical imaging, affective computing, security, etc. Chapter 17 deals with signature verification using Galois field operator. The features are obtained by building a normalized cumulative histogram. Offline signature verification is also implemented using the K-NN classifier. Chapter 18 details a face recognition approach in videos using 3D residual networks and comparing the accuracy for different depths of residual networks. A CVBL video dataset has been developed for the purpose of experimentation. The proposed approach achieves the highest accuracy of 97% with DenseNets on the CVBL dataset. Microcontroller units (MCU) with auto firmware communicate with the fog layer through a smart edge node. The robot employs approaches such as simultaneous localization and mapping (SLAM) and other path-finding algorithms and IR sensors for obstacle detection. ML techniques and FastAi aid in the classification of the dataset. Chapter 20 describes an automatic tumor identification approach to classify MRI of brain. An advanced CNN model consisting of convolution and a dense layer is employed to correctly classify the brain tumors. The results exhibit the proposed model’s effectiveness in brain tumor image classification. Chapter 21 presents a vision-based sensor mechanism for phase lane detection in IVS. The land markings on a structured road are detected using image processing techniques such as edge detection and Hough space transformation on KITTI data. Qualitative and quantitative analysis shows satisfactory results. In Chapter 22, the proposed implementation of deep convolutional neural network (DCNN) for micro-expression recognition as DCNN has established its presence in different image processing applications. CASME-II, a benchmark database for micro-expression recognition, has been used for experimentations. The results of the experiment had revealed that types based on CNN give correct results of 90% and 88% for four and six classes, respectively, that is beyond the regular methods. In Chapter 23, the proposed semantic classification model intends to employ modern embedding and aggregating methods which considerably enhance feature discriminability and boost the performance of CNN. The performance of this framework is exhaustively tested across a wide dataset. The intuitive and robust systems that use these techniques play a vital role in various sectors like security, military,
Introduction
xv
automation, industries, medical, and robotics. In Chap. 24, a countermeasure for voice conversation spoofing attack has been proposed using source separation based on nonnegative matrix factorization and CNN-based binary classifier. The voice conversation spoofed speech is modeled as a combination of target estimate and the artifact used in the voice conversion. The proposed method shows a decrease in the false alarm rate of automatic speaker verification. Chapter 25 proposes a facial emotion recognition and the prediction that can serve as a useful monitoring mechanism in various fields. The first stage utilizes CNN for facial emotion detection from real-time video frames and assigns a probability to the various emotional states. The second stage uses a time-series analysis that predicts future facial emotions from the output of the first stage. The final Chap. 26 describes a methodology for the identification and analysis of cohorts for heart failure patients using NLP tasks. The proposed approach uses various NLP processes implemented in the cTAKES tool to identify patients of a particular cohort group. The proposed system has been found to outperform the manual extraction process in terms of accuracy, precision, recall, and F-measure scores. Kurukshetra, India Silchar, India Marseille, France Itajuba, Brazil October 2020
Dr. Gyanendra K. Verma Dr. Badal Soni Prof. Dr. h. c. Salah Bourennane Prof. Dr. h. c. Alexandre C. B. Ramos
Contents
Part I
Theory and Concepts
1
Active Learning for Network Intrusion Detection . . . . . . . . . . . . . . . . Amir Ziai
2
Educational Data Mining Using Base (Individual) and Ensemble Learning Approaches to Predict the Performance of Students . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mudasir Ashraf, Yass Khudheir Salal, and S. M. Abdullaev
3
4
5
6
7
8
3
15
Patient’s Medical Data Security via Bi Chaos Bi Order Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bharti Ahuja and Rajesh Doriya
25
Nepali Word-Sense Disambiguation Using Variants of Simplified Lesk Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Satyendr Singh, Renish Rauniyar, and Murali Manohar
41
Performance Analysis of Big.LITTLE System with Various Branch Prediction Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Froila V. Rodrigues and Nitesh B. Guinde
59
Global Feature Representation Using Squeeze, Excite, and Aggregation Networks (SEANet) . . . . . . . . . . . . . . . . . . . . . . . . . . . Akhilesh Pandey, Darshan Gera, D. Gunasekar, Karam Rai, and S. Balasubramanian Improved Single Image Super-resolution Based on Compact Dictionary Formation and Neighbor Embedding Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Garima Pandey and Umesh Ghanekar An End-to-End Framework for Image Super Resolution and Denoising of SAR Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ashutosh Pandey, Jatav Ashutosh Kumar, and Chiranjoy Chattopadhyay
73
89
99
xvii
xviii
Contents
Part II 9
Models and Algorithms
Analysis and Deployment of an OCR—SSD Deep Learning Technique for Real-Time Active Car Tracking and Positioning on a Quadrotor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Luiz G. M. Pinto, Wander M. Martins, Alexandre C. B. Ramos, and Tales C. Pimenta
10 Palmprint Biometric Data Analysis for Gender Classification Using Binarized Statistical Image Feature Set . . . . . . . . . . . . . . . . . . . . 157 Shivanand Gornale, Abhijit Patil, and Mallikarjun Hangarge 11 Recognition of Sudoku with Deep Belief Network and Solving with Serialisation of Parallel Rule-Based Methods and Ant Colony Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Satyasangram Sahoo, B. Prem Kumar, and R. Lakshmi 12 Novel DWT and PC-Based Profile Generation Method for Human Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Tanish Zaveri, Payal Prajapati, and Rishabh Shah 13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting: An Approach Based on Combination of Filters and Color Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 P. B. Mallikarjuna, D. S. Guru, and C. Shadaksharaiah 14 Automatic Deep Learning Framework for Breast Cancer Detection and Classification from H&E Stained Breast Histopathology Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Anmol Verma, Asish Panda, Amit Kumar Chanchal, Shyam Lal, and B. S. Raghavendra 15 An Analysis of Use of Image Processing and Neural Networks for Window Crossing in an Autonomous Drone . . . . . . . . . . . . . . . . . . 229 L. Pedro de Brito, Wander M. Martins, Alexandre C. B. Ramos, and Tales C. Pimenta 16 Analysis of Features in SAR Imagery Using GLCM Segmentation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Jasperine James, Arunkumar Heddallikar, Pranali Choudhari, and Smita Chopde Part III Applications and Issues 17 Offline Signature Verification Using Galois Field-Based Texture Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 S. Shivashankar, Medha Kudari, and S. Prakash Hiremath 18 Face Recognition Using 3D CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Nayaneesh Kumar Mishra and Satish Kumar Singh
Contents
xix
19 Fog Computing-Based Seed Sowing Robots for Agriculture . . . . . . . 295 Jaykumar Lachure and Rajesh Doriya 20 An Automatic Tumor Identification Process to Classify MRI Brain Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Arpita Ghosh and Badal Soni 21 Lane Detection for Intelligent Vehicle System Using Image Processing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Deepak Kumar Dewangan and Satya Prakash Sahu 22 An Improved DCNN Based Facial Micro-expression Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Divya Garg and Gyanendra K. Verma 23 Selective Deep Convolutional Framework for Vehicle Detection in Aerial Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Kaustubh V. Sakhare and Vibha Vyas 24 Exploring Source Separation as a Countermeasure for Voice Conversion Spoofing Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 R. Hemavathi, S. Thoshith, and R. Kumaraswamy 25 Statistical Prediction of Facial Emotions Using Mini Xception CNN and Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Basudeba Behera, Amit Prakash, Ujjwal Gupta, Vijay Bhaksar Semwal, and Arun Chauhan 26 Identification of Congestive Heart Failure Patients Through Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 Niyati Baliyan, Aakriti Johar, and Priti Bhardwaj Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Editors and Contributors
About the Editors Gyanendra K. Verma is currently working as Assistant Professor at the Department of Computer Engineering, National Institute of Technology Kurukshetra, India. He has completed his B.Tech. from Harcourt Butler Technical University (formerly HBTI) Kanpur, India, and M.Tech. & Ph.D. from Indian Institute of Information Technology Allahabad (IIITA), India. His all degrees are in Information Technology. He has teaching and research experience of over six years in the area of Computer Science and Information Technology with a special interest in image processing, speech and language processing, human-computer interaction. His research work on affective computing and the application of wavelet transform in medical imaging and computer vision problems have been cited extensively. He is a member of various professional bodies like IEEE, ACM, IAENG & IACSIT. Badal Soni is currently working as Assistant Professor at the Department of Computer Engineering, National Institute of Technology Silchar, India. He has completed his B.Tech. from Rajiv Gandhi Technical University (formerly RGPV) Bhopal, India, and M.Tech from Indian Institute of Information Technology, Design, and Manufacturing (IITDM), Jabalpur, India. He received Ph.D. from the National Institute of Technology Silchar, India. His all degrees are in Computer Science and Engineering. He has teaching and research experience of over seven years in the area of computer science and information technology with a special interest in computer graphics, image processing, speech and language processing. He has published more than 35 papers in refereed Journals, contributed books, and international conference proceedings. He is the Senior member of IEEE and professional members of various bodies like IEEE, ACM, IAENG & IACSIT. Salah Bourennane received his Ph.D. degree from Institut National Polytechnique de Grenoble, France. Currently, he is a Full Professor at the Ecole Centrale Marseille, France. He is the head of the Multidimensional Signal Processing Group of Fresnel Institute. His research interests are in statistical signal processing, remote sensing, xxi
xxii
Editors and Contributors
telecommunications, array processing, image processing, multidimensional signal processing, and performance analysis. He has published several papers in reputed international journals. Alexandre C. B. Ramos is the associate Professor of Mathematics and Computing Institute—IMC from Federal University of Itajubá—UNIFEI (MG). His interest areas are multimedia, artificial intelligence, human-computer interface, computerbased training, and e-learning. Dr. Ramos has over 18 years of research and teaching experience. He did his Post-doctorate at the Ecole Nationale de l’Aviation Civile—ENAC (France, 2013–2014), Ph.D. and Master in Electronic and Computer Engineering from Instituto Tecnológico de Aeronáutica - ITA (1996 and 1992). He completed his graduation in Electronic Engineering from the University of Vale do Paraíba—UNIVAP (1985) and sandwich doctorate at Laboratoire d’Analyse et d’Architecture des Systèmes—LAAS (France, 1995–1996). He has professional experience in the areas of Process Automation with an emphasis on chemical and petrochemical processes (Petrobras 1983–1995); and Computer Science, with emphasis on Information Systems (ITA/ Motorola 1997–2001), acting mainly on the following themes: Development of Training Simulators with the support of Intelligent Tutoring Systems, Hybrid Intelligent Systems, and Computer Based Training, Neural Networks in Trajectory Control in Unmanned Vehicles, Pattern Matching and Image Digital Processing.
Contributors S. M. Abdullaev Department of System Programming, South Ural State University, Chelyabinsk, Russia Bharti Ahuja Department of Information Technology, National Institute of Technology, Raipur, India Mudasir Ashraf School of CS and IT, Jain University, Bangalore, India Jatav Ashutosh Kumar Indian Institute of Technology Jodhpur, Jodhpur, Rajasthan, India S. Balasubramanian Department of Mathematics and Computer Science (DMACS), Sri Sathya Sai Institute of Higher Learning (SSSIHL), Prasanthi Nilayam, Anantapur District, India Niyati Baliyan Department of Information Technology, IGDTUW, Delhi, India Basudeba Behera Department of Electronics and Communication Engineering, NIT Jamshedpur, Jamshedpur, Jharkhand, India Priti Bhardwaj Department of Information Technology, IGDTUW, Delhi, India
Editors and Contributors
xxiii
Chiranjoy Chattopadhyay Indian Institute of Technology Jodhpur, Jodhpur, Rajasthan, India Arun Chauhan Department of Computer Science Engineering, IIIT Dharwad, Dharwad, Karnataka, India Smita Chopde FCRIT, Mumbai, India Pranali Choudhari FCRIT, Mumbai, India L. Pedro de Brito Federal University of Itajuba, Institute of Mathematics and Computing, Itajubá, Brazil Deepak Kumar Dewangan Department of Information Technology, National Institute of Technology, Raipur, Chhattisgarh, India Rajesh Doriya Department of Information Technology, National Institute of Technology, Raipur, Chhattisgarh, India Divya Garg Department of Computer Engineering, National Institute of Technology Kurukshetra, Kurukshetra, India Darshan Gera DMACS, SSSIHL, Bengaluru, India Umesh Ghanekar National Institute of Technology Kurukshetra, Kurukshetra, India Arpita Ghosh National Institute of Technology Silchar, Silchar, Assam, India Shivanand Gornale Department of Computer Science, Rani Channamma University, Belagavi, Karnataka, India Nitesh B. Guinde Goa College of Engineering, Ponda-Goa, India D. Gunasekar Department of Mathematics and Computer Science (DMACS), Sri Sathya Sai Institute of Higher Learning (SSSIHL), Prasanthi Nilayam, Anantapur District, India Ujjwal Gupta Department of Electronics and Communication Engineering, NIT Jamshedpur, Jamshedpur, Jharkhand, India D. S. Guru University of Mysore, Mysore, Karnataka, India Mallikarjun Hangarge Department of Computer Science, Karnatak College, Bidar, Karnataka, India Arunkumar Heddallikar RADAR Division, Sameer, IIT Bombay, Mumbai, India R. Hemavathi Department of Electronics and Communication Engineering, Siddaganga Institute of Technology (Affiliated to Visveswaraya Technological University, Belagavi), Tumakuru, India S. Prakash Hiremath Department of Computer Science, KLE Technological University, BVBCET, Hubballi, Karnataka, India
xxiv
Editors and Contributors
Jasperine James FCRIT, Mumbai, India Aakriti Johar Department of Information Technology, IGDTUW, Delhi, India Medha Kudari Department of Computer Science, Karnatak University, Dharwad, India B. Prem Kumar Pondicherry Central University, Pondicherry, India Amit Kumar Chanchal National Institute of Technology Karnataka, Mangalore, Karnataka, India R. Kumaraswamy Department of Electronics and Communication Engineering, Siddaganga Institute of Technology (Affiliated to Visveswaraya Technological University, Belagavi), Tumakuru, India Jaykumar Lachure National Institute of Technology Raipur, Raipur, Chhattisgarh, India R. Lakshmi Pondicherry Central University, Pondicherry, India Shyam Lal National Institute of Technology Karnataka, Mangalore, Karnataka, India P. B. Mallikarjuna JSS Academy of Technical Education, Bengaluru, Karnataka, India Murali Manohar Gramener, Bangalore, India Wander M. Martins Institute of Technology, Itajuba, MG, Brazil
Systems
Engineering
and
Information
Nayaneesh Kumar Mishra Computer Vision and Biometric Lab, IIIT Allahabad, Allahabad, India Asish Panda National Institute of Technology Karnataka, Mangalore, Karnataka, India Akhilesh Pandey Department of Mathematics and Computer Science (DMACS), Sri Sathya Sai Institute of Higher Learning (SSSIHL), Prasanthi Nilayam, Anantapur District, India Ashutosh Pandey Indian Institute of Technology Jodhpur, Jodhpur, Rajasthan, India Garima Pandey National Institute of Technology Kurukshetra, Kurukshetra, India Abhijit Patil Department of Computer Science, Rani Channamma University, Belagavi, Karnataka, India Tales C. Pimenta Institute of Systems Engineering and Information Technology, Itajuba, MG, Brazil
Editors and Contributors
xxv
Luiz G. M. Pinto Institute of Mathematics and Computing, Federal University of Itajuba, Itajuba, MG, Brazil Payal Prajapati Government Engineering College, Patna, India Amit Prakash Department of Electronics and Communication Engineering, NIT Jamshedpur, Jamshedpur, Jharkhand, India B. S. Raghavendra National Institute of Technology Karnataka, Mangalore, Karnataka, India Karam Rai Department of Mathematics and Computer Science (DMACS), Sri Sathya Sai Institute of Higher Learning (SSSIHL), Prasanthi Nilayam, Anantapur District, India Alexandre C. B. Ramos Institute of Mathematics and Computing, Federal University of Itajuba, Itajuba, MG, Brazil Renish Rauniyar Tredence Analytics, Bangalore, India Froila V. Rodrigues Dnyanprassarak Mandal’s College and Research Centre, Assagao-Goa, India Satyasangram Sahoo Pondicherry Central University, Pondicherry, India Satya Prakash Sahu Department of Information Technology, National Institute of Technology, Raipur, Chhattisgarh, India Kaustubh V. Sakhare Department of Electronics and Telecommunication, College of Engineering, Pune, India Yass Khudheir Salal Department of System Programming, South Ural State University, Chelyabinsk, Russia Vijay Bhaksar Semwal Department of Computer Science Engineering, MANIT, Bhopal, Madhya Pradesh, India C. Shadaksharaiah Bapuji Institute of Engineering and Technology, Davangere, Karnataka, India Rishabh Shah Nirma University, Ahmedabad, India; Government Engineering College, Patna, India S. Shivashankar Department of Computer Science, Karnatak University, Dharwad, India Satish Kumar Singh Computer Vision and Biometric Lab, IIIT Allahabad, Allahabad, India Satyendr Singh BML Munjal University, Gurugram, Haryana, India Badal Soni National Institute of Technology Silchar, Silchar, Assam, India
xxvi
Editors and Contributors
S. Thoshith Department of Electronics and Communication Engineering, Siddaganga Institute of Technology (Affiliated to Visveswaraya Technological University, Belagavi), Tumakuru, India Anmol Verma National Institute of Technology Karnataka, Mangalore, Karnataka, India Gyanendra K. Verma Department of Computer Engineering, National Institute of Technology Kurukshetra, Kurukshetra, India Vibha Vyas Department of Electronics and Telecommunication, College of Engineering, Pune, India Tanish Zaveri Nirma University, Ahmedabad, India Amir Ziai Stanford University, Stanford, CA, USA
Acronyms
BHC CNN DCIS HE IDC IRRCNN SVM VGG16 WSI
Bayesian Hierarchical Clustering Convolution Neural Network Ductal Carcinoma In Situ Hematoxylin and Eosin Invasive Ductal Carcinoma Inception Recurrent Residual Convolutional Neural Network Support Vector Machine Visual Geometry Group—16 Whole Slide Image
xxvii
Part I
Theory and Concepts
Chapter 1
Active Learning for Network Intrusion Detection Amir Ziai
Abstract Network operators are generally aware of common attack vectors that they defend against. For most networks, the vast majority of traffic is legitimate. However, new attack vectors are continually designed and attempted by bad actors which bypass detection and go unnoticed due to low volume. One strategy for finding such activity is to look for anomalous behavior. Investigating anomalous behavior requires significant time and resources. Collecting a large number of labeled examples for training supervised models is both prohibitively expensive and subject to obsoletion as new attacks surface. A purely unsupervised methodology is ideal; however, research has shown that even a very small number of labeled examples can significantly improve the quality of anomaly detection. A methodology that minimizes the number of required labels while maximizing the quality of detection is desirable. False positives in this context result in wasted effort or blockage of legitimate traffic, and false negatives translate to undetected attacks. We propose a general active learning framework and experiment with different choices of learners and sampling strategies.
1.1 Introduction Detecting anomalous activity is an active area of research in the security space. Tuor et al. use an online anomaly detection method based on deep learning to detect anomalies. This methodology is compared to traditional anomaly detection algorithms such as isolation forest (IF) and a principal component analysis (PCA)-based approach and found to be superior. However, no comparison is provided with semi-supervised or active learning approaches which leverage a small amount of labeled data (Tuor et al. 2017). The authors later propose another unsupervised methodology leveraging recurrent neural network (RNN) to ingest the log-level event data as opposed to aggregated data (Tuor et al. 2018). Pimentel et al. propose a generalized framework for unsupervised anomaly detection. They argue that purely unsupervised anomaly A. Ziai (B) Stanford University, 450 Serra Mall, Stanford, CA 94305, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_1
3
4
A. Ziai
Table 1.1 Prevalence and number of attacks for each of the 10 attack types Label Attacks Prevalence Prevalence (overall) smurf. neptune. back. satan ipsweep portsweep. warezclient. teardrop. pod. nmap.
280,790 107,201 22.3 1589 1247 1040 1020 979 264 231
0.742697 0.524264 0.022145 0.016072 0.012657 0.010578 0.010377 0.009964 0.002707 0.002369
0.568377 0.216997 0.04459 0.003216 0.002524 0.002105 0.002065 0.001982 0.000534 0.000468
Records 378,068 204,479 99,481 98,867 98,525 98,318 98,298 98,257 97,542 97,509
detection is undecidable without a prior on the distribution of anomalies, and learned representations have simpler statistical structure which translate to better generalization. They propose an active learning approach with a logistic regression classifier as the learner (Pimentel et al. 2018). Veeramachaneni et al. propose a human-in-the-loop machine learning system that provides both insights to the analyst and addressing large data processing concerns. This system uses unsupervised methods to surface anomalous data points for the analyst to label and a combination of supervised and unsupervised methods to predict the attacks (Veeramachaneni et al. 2016). In this work, we also propose an analyst-in-the-loop active learning approach. However, our approach is not opinionated about the sampling strategy or the learner used in active learning. We will explore trade-offs in that design space.
1.2 Dataset We have used the KDD Cup 1999 dataset which consists of about 500K records representing network connections in a military environment. Each record is either “normal” or one of 22 different types of intrusion such as smurf, IP sweep, and teardrop. Out of these 22 categories, only 10 have at least 100 occurrences, and the rest were removed. Each record has 41 features including duration, protocol, and bytes exchanged. Prevalence of attack types varies substantially with smurf being the most pervasive at about 50% of total records and Nmap at less than 0.01% of total records (Table 1.1).
1 Active Learning for Network Intrusion Detection
5
Table 1.2 Snippet of input data Duration
Protocol _type
Service
Flag
src dst Land _bytes _bytes
Wrong _fragment
Urgent
Hot
...
dst _host _srv _count
0
tcp
http
SF
181
5450
0
0
0
0
...
9
0
tcp
http
SF
239
4860
0
0
0
0
...
19
0
tcp
http
SF
235
1337
0
0
0
0
...
29
1.2.1 Input and Output Example Table 1.2 depicts three rows of data (excluding the label): The objective of the detection system is to label each row as either “normal” or “anomalous.”
1.2.2 Processing Pipeline We generated 10 separate datasets consisting of normal traffic and each of the attack vectors. This way we can study the proposed approach over 10 different attack vectors with varying prevalence and ease of detection. Each dataset is then split into train, development, and test partitions with 80%, 10%, and 10% proportions. All algorithms are trained on the train set and evaluated on the development set. The winning strategy is tested on the test set to generate an unbiased estimate of generalization. Categorical features are one-hot encoded, and missing values are filled with zero.
1.3 Approach 1.3.1 Evaluation Metric Since labeled data is very hard to come by in this space, we have decided to treat this problem as an active learning one. Therefore, the machine learning model receives a subset of the labeled data. We will use the F1 score to capture the trade-off between precision and recall: (1.1) F1 = (2P R)/(P + R) where P = TP/((TP + FP)), R = TP/((TP + FN)), TP is true positive, FP is false positive, and FN is the number of false negative. A model that is highly precise (does not produce false positives) is desirable as it will not waste the analyst’s time.
6
A. Ziai
However, this usually comes at the cost of being overly conservative and not catching anomalous activity that is indeed an intrusion.
1.3.2 Oracle and Baseline Labeling effort is a major factor in this analysis and a dimension along which we will define the upper and lower bounds of the quality of our detection systems. A purely unsupervised approach would be ideal as there is no labeling involved. We will use an isolation forest (Zhou et al. 2004) to establish our baseline. Isolation forests (IFs) are widely, and very successfully, used for anomaly detection. An IF consists of a number of isolation trees, each of which are constructed by selecting random features to split and then selecting a random value to split on (random value in the range of continuous variables or random value for categorical variables). Only a small random subset of the data is used for growing the trees, and usually a maximum allowable depth is enforced to curb computational cost. We have used 10 trees for each IF. Intuitively, anomalous data points are easier to isolate with a smaller average number of splits and therefore tend to be closer to the root. The average closeness to the root is proportional to the anomaly score (i.e., the lower this score, the more anomalous the data point). A completely supervised approach would incur maximum cost as we will have to label every data point. We have used a random forest classifier with 10 estimators trained on the entire training dataset to establish the upper bound (i.e., Oracle). In Table 1.3, the F1 scores are reported for evaluation on the development set:
Table 1.3 Oracle and baseline for different attack types Label Baseline F1 smurf neptune back satan ipsweep portsweep warezclient teardrop pod nmap Means ± standard deviation
0.38 0.49 0.09 0.91 0.07 0.53 0.01 0.30 0.00 0.51 0.33±0.29
Oracle F1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00±0.01
1 Active Learning for Network Intrusion Detection
7
Fig. 1.1 Active learning scheme
1.3.3 Active Learning The proposed approach starts with training a classifier on a small random subset of the data (i.e., 1000 samples) and then continually queries a security analyst for the next record to label. There is a maximum budget of 100 queries (Fig. 1.1). This approach is highly flexible. The choice of classifier can range from logistic regression all the way up to deep networks as well as any ensemble of those models. Moreover, the hyper-parameters for the classifier can be tuned on every round of training to improve the quality of predictions. The sampling strategy can range from simply picking random records to using classifier uncertainty or other elaborate schemes. Once a record is labeled, it is removed from the pool of labeled data and placed into the labeled record database. We are assuming that labels are trustworthy which may not necessarily be true. In other words, the analyst might make a mistake in labeling or there may be low consensus among analysts around labeling. In the presence of those issues, we would need to extend this approach to query multiple analysts and to build the consensus of labels into the framework.
1.4 Experiments 1.4.1 Learners and Sampling Strategies We used a logistic regression (LR) classifier with L2 penalty as well as a random forest (RF) classifier with 10 estimators, Gini impurity for splitting criteria, and unlimited depth for our choice of learners. We also chose three sampling strategies. First is a random strategy that randomly selects a data point from the unlabeled pool. The second option is uncertainty sampling that scores the entire database of unlabeled data and then selects the data point with the highest uncertainty. The first option is entropy sampling, which calculates the entropy over the positive and negative
8
A. Ziai
Table 1.4 Effects of learner and sampling strategy on detection quality and latency Learner
Sampling strategy
F1 initial 0.76±0.32
LR
Random
LR
Uncertainty
F1 after 10 F1 after 50 F1 after 100
Train time (s) 0.05±0.01
0.76±0.32
0.79±0.31
0.86±0.17
0.83±0.26
0.85±0.31
0.88±0.20
0.83±0.26
0.85±0.31
0.88±0.20
0.91±0.12
0.84±0.31
0.95±0.07
Query time (s) 0.09±0.08 0.10±0.08
LR
Entropy
RF
Random
RF
Uncertainty
0.98±0.03
0.99±0.03
0.99±0.03
0.16±0.06
RF
Entropy
0.98±0.04
0.98±0.03
0.99±0.03
0.12±0.08
0.90±0.14
0.08±0.08 0.11±0.00
0.09±0.07
classes and selects the highest entropy data point. Ties are broken randomly for both uncertainty and entropy sampling. Table 1.4 shows the F1 score immediately after the initial training (F1 initial) followed by the F1 score after 10, 50, and 100 queries to the analyst across different learners and sampling strategies aggregated over the 10 attack types: Random forests are strictly superior to logistic regression from a detection perspective regardless of the sampling strategy. It is also clear that uncertainty and entropy sampling are superior to random sampling which suggests that judiciously sampling the unlabeled dataset can have a significant impact on the detection quality, especially in the earlier queries (F1 goes from 0.90 to 0.98 with just 10 queries). It is important to notice that the query time might become a bottleneck. In our examples, the unlabeled pool of data is not very large but as this set grows these sampling strategies have to scale accordingly. The good news is that scoring is embarrassingly parallelizable. Figure 1.2 depicts the evolution of detection quality as the system makes queries to the analyst for an attack with high prevalence (i.e., the majority of traffic is an attack): The random forest learner combined with an entropy sampler can get to perfect detection within 5 queries which suggests high data efficiency (Mussmann and Liang 2018). We will compare this to the Nmap attack with significantly lower prevalence (i.e., less than 0.01% of the dataset is an attack) (Fig. 1.3): We know from our Oracle evaluations that a random forest model can achieve perfect detection for this attack type; however, we see that an entropy sampler is not guaranteed to query the optimal sequence of data points. The fact that the prevalence of attacks is very low means that the initial training dataset probably does not have a representative set of positive labels that can be exploited by the model to generalize. The failure of uncertainty sampling has been documented (Zhu et al. 2008), and more elaborate schemes can be designed to exploit other information about the unlabeled dataset that the sampling strategy is ignoring. To gain some intuition into these deficiencies, we will unpack a step of entropy sampling for the Nmap attack. Figure 1.4 compares (a) the relative feature importance after the initial training to (b) the Oracle (Fig. 1.5):
1 Active Learning for Network Intrusion Detection
9
Fig. 1.2 Detection quality for a high prevalence attack
The Oracle graph suggests the “src_bytes” is a feature that the model is highly reliant upon for prediction. However, our initial training is not reflecting this; we will compute the z-score for each of the positive labels in our development set: z fi =
|μ R fi − μW fi | σ R fi
(1.2)
where μ R fi is the average value of the true positives for feature i (i.e., f i ), μW fi is the average value of the false positives or false negatives, and σ R fi is the standard deviation of the values in the case of true positives.
10
A. Ziai
Fig. 1.3 Detection quality for a low prevalence attack
The higher this value is for a feature, the more our learner needs to know about it to correct the discrepancy. However, we see that the next query made by the strategy does not involve a decision around this fact. The score for “src_bytes” is an order of magnitude larger than other features. The model continues to make uncertainty queries staying oblivious to information about specific features that it needs to correct for.
1.4.2 Ensemble Learning Creating an ensemble of classifiers is usually a very effective way to combine the power of multiple learners (Zainal et al. 2009). This strategy is highly effective when the errors made by classifiers in the ensemble tend to cancel out and are not compounded. To explore this idea, we designed a weighted ensemble: The prediction in the above diagram is calculated as follows:
1 Active Learning for Network Intrusion Detection
Fig. 1.4 Random forest feature importance for a initial training and b Oracle
Fig. 1.5 Ensemble learner
11
12
A. Ziai
Fig. 1.6 Ensemble active learning results for warezclient and satan attacks
[PredictionEnsemble = I
e E
we Prediction E >
we e E
2
(1.3)
where Prediction E {0, 1} is the binary prediction associated with the classifier e E = {R F, G B, L R, I F} and we is the weight of the classifier in the ensemble. The weights are proportional to the level of confidence we have in each of the learners. We have added a gradient boosting classifier with 10 estimators. Unfortunately, the results of this experiment suggest that this particular ensemble is not adding any additional value. Figure 1.6 shows that at best the results match that of random forest (a) and in the worst case they can be significantly worse (b): The majority of the error associated with this ensemble approach relative to only using random forests can be attributed to a high false negative rate. The other four algorithms are in most cases conspiring to generate a negative class prediction which overrides the positive prediction of the random forest.
1 Active Learning for Network Intrusion Detection
13
Table 1.5 Active learning an unsupervised sampling strategy Sampling strategy Initial F1 F1 after 10 F1 after 50 Isolation forest Entropy
0.94±0.07 0.94±0.07
0.94±0.05 0.98±0.03
0.95±-0.09 0.99±0.03
F1 after 100 0.93=0.09 0.99±0.03
1.4.3 Sampling the Outliers Generated Using Unsupervised Learning Finally, we explore whether we can use an unsupervised method for finding the most anomalous data points to query. If this methodology is successful, the sampling strategy is decoupled from active learning and we can simply precompute and cache the most anomalous data points for the analyst to label. We compared a sampling strategy based on isolation forest with entropy sampling (Table 1.5): In both cases, we are using a random forest learner. The results suggest that entropy sampling is superior since it is sampling the most uncertain data points in the context of the current learner and not a global notion of anomaly which isolation forest provides.
1.5 Conclusion We have proposed a general active learning framework for network intrusion detection. We experimented with different learners and observed that more complex learners can achieve higher detection quality with significantly less labeling effort for most attack types. We did not explore other complex models such as deep neural networks and did not attempt to tune the hyper-parameters of our model. Since the bottleneck associated with this task is the labeling effort, we can add model tuning while staying within the acceptable latency requirements. We then explored a few sampling strategies and discovered that uncertainty and entropy sampling can have a significant benefit over unsupervised or random sampling. However, we also realized that these strategies are not optimal, and we can extend them to incorporate available information about the distribution of the features for mispredicted data points. We attempted a semi-supervised approach called label spreading that builds the affinity matrix over the normalized graph Laplacian which can be used to create pseudo-labels for unlabeled data points (Zhou et al. 2004). However, this methodology is very memory-intensive, and we could not successfully train and evaluate it on all of the attack types.
14
A. Ziai
References Mussmann S, Liang P (2018) On the relationship between data efficiency and error for un-certainty sampling. arXiv preprint arXiv:1806.06123 Pimentel T, Monteiro M, Viana J, Veloso A, Ziviani N (2018) A generalized active learning approach for unsupervised anomaly detection. arXiv preprint arXiv:1805.09411 Tuor A, Kaplan S, Hutchinson B, Nichols N, Robinson S (2017) Deep learning for unsupervised insider threat detection in structured cybersecurity data streams. arXiv preprint arXiv:1710.00811 Tuor A, Baerwolf R, Knowles N, Hutchinson B, Nichols N, Jasper R (2018) Recurrent neural network language models for open vocabulary event-level cyber anomaly detection. Workshops at the thirty-second AAAI conference on artificial intelligence Veeramachaneni K, Arnaldo I, Korrapati V, Bassias C, Li K (2016) AI: training a big data machine to defend. Big Data Security on Cloud (BigDataSecurity), IEEE international conference on high performance and smart computing (HPSC), and IEEE international conference on intelligent data and security (IDS), IEEE 2nd international conference, pp 49–54 Zainal A, Maarof MA, Shamsuddin SM (2009) Ensemble classifiers for network intrusion detection system. J Inf Assur Secur 4(3):217–225 Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global consistency. In: Advances in neural information processing systems, pp 321–328 Zhu J, Wang H, Yao T, Tsou BK (2008) Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In: Proceedings of the 22nd international conference on computational linguistics, vol 1, pp 1137–1144
Chapter 2
Educational Data Mining Using Base (Individual) and Ensemble Learning Approaches to Predict the Performance of Students Mudasir Ashraf, Yass Khudheir Salal, and S. M. Abdullaev Abstract The ensemble approaches involving amalgamation of various learning classifiers are grounded on heuristic machine learning methods to device prediction paradigms, as these learning ensemble methods are commonly more precise than individual classifiers. Therefore, among diverse ensemble techniques, investigators have experienced a widespread learning classifier viz. bagging to forecast the performance of students. As exploitation of ensemble approaches is considered to be a remarkable phenomenon in prediction and classification mechanisms, therefore considering the striking character and originality of analogous method, namely bagging in educational data mining, researchers have applied this specific approach across the existing pedagogical dataset obtained from the University of Kashmir. The entire results were estimated with 10-fold cross validation, once pedagogical dataset was subjected to base classifiers comprising of j48, random tree, naïve bayes, and knn. Consequently, based on the learning phenomenon of miscellaneous types of classifiers, prediction models have been proposed for each classifier including base and meta learning algorithm. In addition, techniques specifically SMOTE (oversampling method) and spread subsampling (undersampling method) were employed to further draw a relationship among ensemble classifier and base learning classifiers. These methods were exploited with the key objective to observe further enhancement in prediction exactness of students.
2.1 Introduction The fundamental concept behind ensemble method is to synthesize contrasting base classifiers into a single classifier, which is more precise and consistent in terms M. Ashraf (B) School of CS and IT, Jain university, Bangalore190006, India Y. K. Salal · S. M. Abdullaev Department of System Programming, South Ural State University, Chelyabinsk, Russia e-mail: [email protected] S. M. Abdullaev e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_2
15
16
M. Ashraf et al.
of prediction accuracy produced by the composite model and in decision making. This theory of hybridizing multiple models to develop a single predictive model has been under study since decades. According to Buhlmann and Yu (Bü hlmann and Yu 2003), the narrative of ensemble techniques incepted in the beginning of 1977 through Turkey Twicing, who initiated the ensemble research by integrating couple of linear regression paradigms (Bü hlmann and Yu 2003). The application of ensemble approaches can be prolific in enhancing the excellence and heftiness of various clustering learning models (Dimitriadou et al. 2003, 2018a; Ashraf et al. 2019). From the past empirical research undertaken by different machine learning researchers acknowledges that there is considerable advancement in mitigating the generalization error, once output of multiple classifiers are synthesized (Ashraf et al. 2020; Opitz and Maclin 1999; Salzberg 1994, 2018b). Due to the inductive bias of individual classifiers involved in the phenomenon of ensemble approach, it has been investigated that ensemble methods are very effective in nature than deployment of individual base learning algorithms (Geman et al. 1992). In fact, distinct mechanisms of ensembles can be efficacious in squeezing the variance error to some level (Ali and Pazzani 1996) without augmenting the bias error associated with the classifier (Ali and Pazzani 1996). In certain cases, the bias error can be curtailed using ensemble techniques, and identical approach has been highlighted by the theory of large margin classifiers (Geman et al. 1992). Moreover, ensemble learning methods have been applied in diverse areas including bioinformatics (Bartlett and Shawe-Taylor 1999), economics (Tan et al. 2003), health care (Leigh et al. 2002), topography (Mangiameli et al. 2004), production (Ahmed and Elaraby 2014), and so on. There are several ensemble approaches that have been deployed by various research communities to foretell the performance of different classes pertaining to miscellaneous datasets. The preeminent and straightforward ensemble-based concepts are bagging (Bruzzone et al. 2004) and boosting (Maimon and Rokach 2004), wherein predictions are based on combined output, generally through subsamples of the training set on different learning algorithms. The predictions are, however, geared up through the process of voting phenomenon (Ashraf et al. 2018). Another method of learning viz. meta learning, targets on selecting the precise algorithm for making predictions while solving specific problems, which is based on the inherent idiosyncrasy of the dataset (Breiman 1996). The performance in meta learning can in addition be grounded on other effortless learning algorithms (Brazdil et al. 1994). Another widespread practice employed by (Pfahringer et al. 2000) for making decisions via using ensemble technique is to generate subsamples of comprehensive dataset and exploit them on each algorithm. Researcher have made valid attempts and have applied various machine learning algorithms to improve prediction accuracy in the field of academics (Ashraf and Zaman 2017; Ashraf et al. 2017, 2018c; Sidiq et al. 2017; Salal et al. 2021; Salal and Abdullaev 2020; Mukesh and Salal 2019). Contemporarily, realizing the potential application of ensemble methods, several techniques are at the disposal to the research community for ameliorating prediction accuracy as well as explore possible insights that are obscure within large datasets. Therefore, in this study, primarily efforts
2 Educational Data Mining Using Base (Individual) …
17
Table 2.1 Exhibits results of diverse classifiers Classifier name
Correctly classified (%)
Incorrectly TP rate classified (%)
FP rate
Precision
Recall
Fmeasure
ROC area
Rel. Abs. Err. (%)
J48
92.20
7.79
0.922
0.053
0.923
0.919
0.922
0.947
13.51
Random tree 90.30
9.69
0.903
0.066
0.903
0.904
0.903
0.922
15.46
Naïve Bayes 95.50
4.45
0.955
0.030
0.957
0.956
0.955
0.994
7.94
91.80
8.18
0.918
0.056
0.919
0.918
0.917
0.934
13.19
KNN
would be propounded to categorize all significant methods employed in the realm of ensemble approaches and procedures. Moreover, to make further advancements in this direction, researchers would be employing classification algorithms including naïve bayes, KNN, J48, random tree, and an ensemble method viz. bagging on the pedagogical dataset attained from university of Kashmir, inorder to improve the prediction accuracy of students. Furthermore, from the past literature related to educational data mining hitherto, the researchers have done self-effacing efforts to exploit ensemble methods. Therefore, there is still deficit of research conduct within this realm. Moreover, innovative techniques are indispensable to be employed across pedagogical datasets, so as to determine prolific and decisive knowledge from educational settings.
2.2 Performance of Diverse Individual Learning Classifiers In this study, primarily, we have applied four learning classifiers such as j48, random tree, naïve bayes, and knn across academic dataset. Thereafter, the academic dataset was subjected to progression of oversampling and undersampling methods to corroborate whether there is any improvement in prediction achievements of student’s outcome. Correspondingly, the analogous procedure is practiced over ensemble methodologies including bagging and boosting to substantiate which learning classifier among base or meta has demonstrated compelling results. Table 2.1 portrays outcome of diverse classifiers accomplished subsequent to running these machine learning classifiers across educational dataset. Moreover, it is unequivocal that naïve bayes has achieved notable prediction precision of 95.50% in classifying the actual occurrences, incorrectly classification error of 4.45%, and minimum relative absolute error of 7.94% in contrast to remaining classifiers. The supplementary calculations related with the learning algorithm such as Tp rate, Fp rate, precision, recall, f -measure, and ROC area have been also found significant. Conversely, random tree produced although substantial classification accuracy of 90.03%, incorrectly classified instances as 9.69%, (RAE) relative absolute error of 15.46%, and supplementary parameters connected with the algorithm were found
18
M. Ashraf et al.
Table 2.2 Shows results with SMOTE process Classifier name
Correctly classified (%)
Incorrectly TP rate classified (%)
FP rate
Precision
Recall
Fmeasure
ROC area
Rel. Abs. Err. (%)
J48
92.98
7.01
0.930
0.038
0.927
0.925
0.930
0.959
11.43
Random tree 90.84
9.15
0.908
0.049
.909
0.908
0.908
0.932
13.92
Naïve Bayes 97.15
2.84
0.972
0.019
0.973
0.972
0.974
0.978
4.60
92.79
7.20
0.928
0.039
0.929
0.928
0.929
0.947
10.98
KNN
noteworthy as well, and nevertheless acquired outcomes were least considerable among remaining algorithms.
2.2.1 Empirical Results of Base Classifiers with Oversampling Method Table 2.2 exemplifies results of diverse classifiers subsequent to the application of oversampling technique, namely SMOTE across pedagogical dataset. As per the results furnished in the below-mentioned table, all classifiers have shown exemplary improvement in prediction accuracy along with additional performance matrices related with the classifiers after using oversampling method. Therefore, Tables 2.1 and 2.2 disclose expansion of miscellaneous algorithms viz. j48 (from 92.20 to 92.98%), random tree (90.30–90.84%), naïve bayes (95.50–97.15%), and knn (91.80–92.79%). Additionally, the relative absolute errors related with the individual algorithms after SMOTE demonstrated further improvement from 13.51 to 11.43% (j48), 15.46– 13.92% (random tree), 7.94–4.60% (naïve bayes), and 13.19–10.98% (knn), than other estimates. On the contrary, ROC (area under curve) has shown minute discrepancy in case of naïve bayes algorithm with definite variation in its values from 0.994 to 0.978.
2.2.2 Empirical Outcomes of Base Classifiers with Undersampling Method After successfully deploying spread subsampling (undersampling technique) over real pedagogical dataset, the underneath Table 2.3 puts forth the results. The undersampling method has depicted excellent forecast correctness in case of knn classifier from 91.80% to 93.94% which has exemplified supremacy over knn using oversampling technique (91.80–92.79%).
2 Educational Data Mining Using Base (Individual) …
19
Table 2.3 Demonstrates results with undersampling method Classifier name
Correctly classified (%)
J48
Incorrectly TP rate classified (%)
FP rate
Precision
Recall
Fmeasure
ROC area
Rel. Abs. Err. (%)
92.67
7.32
0.927
0.037
0.925
0.926
0.924
0.955
11.68
Random tree 88.95
11.04
0.890
0.055
0.888
0.889
0.896
0.918
16.65
Naïve Bayes 95.85
4.14
0.959
0.021
0.960
0.959
0.959
0.996
7.01
93.94
6.05
0.939
0.030
0.939
0.937
0.939
0.956
9.39
KNN
Entire performance estimates connected with knn learning algorithm such as Tp rate (0.918–0.939), Fp rate (0.030–0.056), precision (0.919–0.939), recall (0.918– 0.937), f -measure (0.917–0.939), ROC area (0.934–0.956), and relative absolute error (13.19–9.39%) have explained exceptional results, which have been demonstrated in Table 2.1 (prior to application of undersampling) and Table 2.3 (post undersampling). Nevertheless, undersampling procedure has demonstrated unpredictability in results across random tree classifier whose performance has declined from 90.30 to 88.95%. Although, the forecast correctness of j48 and naïve bayes has exemplified significant achievements (92.20–92.67, 95.50–95.85% correspondingly), but outcomes are not as noteworthy as oversampling procedure has produced (Table 2.2).
2.3 Bagging Approach Under this subsection, bagging has been utilized using various classifiers that are highlighted in Table 2.4. Nevertheless, after employing bagging, the prediction accuracy has demonstrated paramount success over base learning mechanism. The correctly classified rate in Table 2.4 when contrasted with initial prediction rate of different classifiers in Table 2.1 have shown substantial improvement in three learning algorithms such as j48 (92.20–94.87%), random tree (90.30–94.76%), and knn (91.80–93.81%). In addition, the incorrectly classified instances have come down to considerable level in these classifiers, and as a consequence, supplementary parameters viz. Tp rate, Fp rate, precision, recall, ROC area, and f -measure related with these classifiers have also rendered admirable results. However, naïve bayes has not revealed any significant achievement in prediction accuracy with bagging approach, and moreover, relative absolute error associated with each meta classifier has augmented while synthesizing different classifiers.
20
M. Ashraf et al.
Table 2.4 Shows results using bagging approach Classifier name
Incorrectly TP rate classified (%)
FP rate
Precision
Recall
Fmeasure
ROC area
Rel. Abs. Err. (%)
with 94.87
5.12
0.949
0.035
0.949
0.947
0.948
0.992
14.55
Bag.with 94.76 Random tree
5.23
0.948
0.036
0.948
0.946
0.947
0.993
16.30
Bag. with 95.32 Naïve Bayes
4.67
0.953
0.031
0.954
0.953
0.952
0.993
8.89
with 93.81
6.18
0.938
0.042
0.939
0.937
0.938
0.983
11.63
Bag. J48
Bag. KNN
Correctly classified (%)
Table 2.5 Displays results of bagging method with SMOTE Classifier name
Incorrectly TP rate classified (%)
FP rate
Precision
Recall
Fmeasure
ROC area
Rel. Abs. Err. (%)
with 95.21
4.78
0.952
0.026
0.953
0.951
0.952
0.994
11.91
Bag. with 95.21 random tree
4.79
0.952
0.026
0.954
0.951
0.951
0.996
13.11
with Naïve 95.15 Bayes
3.84
0.962
0.020
0.963
0.962
0.961
0.996
7.01
with 94.68
5.31
0.947
0.028
0.948
0.947
0.946
0.988
9.49
Bag. J48
Bag. KNN
Correctly classified (%)
2.3.1 Bagging After SMOTE The oversampling method(SMOTE) when applied on ensemble of each algorithms viz. j48, random tree, naïve bayes, and knn with bagging system, the results attained afterwards have explicated considerable accuracy in its prediction, and the statistical figures are represented in Table 2.5. The results not only have publicized improvement in correctly classified and wrongly classified instances, Tp rate, Fp rate, precision, and so on, but more noticeably in relative absolute error had shown inconsistency in earlier Table 2.5. However, naïve bayes with bagging method again has not shown any development in its prediction accuracy. Nevertheless, misclassified instances, relative absolute error, and other performance estimates have demonstrated substantial growth. Furthermore, bagging with j48 classifier delivered best forecasting results among other classifiers while comparing entire set of parameters with ensemble technique (bagging).
2 Educational Data Mining Using Base (Individual) …
21
Table 2.6 Explains outcomes of bagging method with undersampling method Classifier name
Incorrectly TP rate classified (%)
FP rate
Precision
Recall
Fmeasure
ROC area
Rel. Abs. Err. (%)
with 95.43
4.56
0.954
0.023
0.955
0.953
0.954
0.995
13.24
Bag. with 94.79 random tree
5.20
0.948
0.026
0.949
0.947
0.948
0.995
13.98
Bag. with 96.07 Naïve Bayes
3.92
0.961
0.020
0.963
0.961
0.961
0.997
6.90
with 92.99
7.00
0.930
0.035
0.932
0.929
0.930
0.985
10.76
Bag. J48
Bag. KNN
Correctly classified (%)
2.3.2 Bagging After Spread Subsampling Bagging procedure, when deployed with undersampling technique (Spread subsampling), has shown advancement in prediction accuracy with two classifiers, namely j48 (95.43%) and naïve bayes (96.07%) that are referenced in Table 2.6. Using undersampling method, naïve bayes has generated paramount growth from 95.15 to 96.07% in distinction to earlier results (Table 2.5) acquired with SMOTE technique, and relative absolute error has reduced to statistical value of 6.90%. On the contrary, bagging with random tree and knn have produced relatively significant results but with less precision in comparison to bagging with oversampling approach. Figure 2.1 summarizes the precision and relative absolute error of miscellaneous learning algorithms under application of different approaches viz. bagging without subject to filtering process, bagging with both SMOTE, and spread subsampling. Among all classifiers with inclusion of bagging concept and without employment of filtering procedures, naïve bayes performed outstanding with 95.32%. Furthermore, with oversampling technique (SMOTE), the ensemble of identical classifiers produced results relatively with same significance. However, by means of undersampling technique, naïve bayes once again achieved exceptional prediction accuracy of 96.07%. Moreover, the below-mentioned figure symbolizes relative absolute error of entire bagging procedures, and consequently among classifiers, naive base has generated admirable results with minimum relative absolute errors of 8.89% (without filtering process), 7.01% ( SMOTE) and 6.90% (Spread subsampling).
2.4 Conclusion In this research study, the central focus has been early prediction of student’s outcome using various individual (base) and meta classifiers to provide timely guidance for weak students. The individual learning algorithms employed across pedagogical
22
M. Ashraf et al.
Fig. 2.1 Visualizes the results after deployment of different methods
data including j48, random tree, naïve bayes, and knn which have evidenced phenomenal prediction accuracy of student’s final outcomes. Among each base learning algorithms, naïve bayes attained paramount accuracy of 95.50%. As the dataset in this investigation was imbalanced which could have otherwise culminated with inaccurate and biased outcomes, therefore academic dataset was exploited to filtering approaches, namely synthetic minority oversampling technique ( SMOTE) and spread subsampling. In this contemporary study, a comparative revision was conducted with base and meta learning algorithms, followed by oversampling (SMOTE) and undersampling (spread subsampling) techniques to get a comprehensive knowledge which classifiers can be more precise and decisive in generating predictions. The above-mentioned base learning algorithms were subjected to phenomenon of oversampling and undersampling methods. The naïve bayes yet again demonstrated noteworthy improvement of 97.15% after practiced with oversampling technique. With undersampling technique, knn showed exceptional improvement of 93.94% in prediction accuracy over other base learning algorithms. However, in case of ensemble learning such as bagging, among all classifiers bagging with naïve bayes accomplished convincing correctness of 95.32% in predicting the exact instances. The bagging algorithm, when put into effect with techniques such as oversampling and undersampling, the ensembles generated from classifiers viz. j48 and naïve bayes demonstrated with significant accuracy and least classification error (95.21%, bagging with j48 and 96.07%, bagging with naïve bayes), respectively.
References Ahmed ABED, Elaraby IS (2014) Data mining: a prediction for student’s performance using classification method. World J Comput Appl Technol 2(2):43–47 Ali KM, Pazzani MJ (1996) Error reduction through learning multiple descriptions. Mach Learn 24(3):173–202
2 Educational Data Mining Using Base (Individual) …
23
Ashraf M et al (2017) Knowledge discovery in academia: a survey on related literature. Int J Adv Res Comput Sci 8(1) Ashraf M, Zaman M (2017) Tools and techniques in knowledge discovery in academia: a theoretical discourse. Int J Data Min Emerg Technol 7(1):1–9 Ashraf M, Zaman M, Ahmed Muheet (2018a) Using ensemble StackingC method and base classifiers to ameliorate prediction accuracy of pedagogical data. Proc Comput Sci 132:1021–1040 Ashraf M, Zaman M, Ahmed M (2018b) Using predictive modeling system and ensemble method to ameliorate classification accuracy in EDM. Asian J Comput Sci Technol 7(2):44–47 Ashraf M, Zaman M, Ahmed M (2020) An intelligent prediction system for educational data mining based on ensemble and filtering approaches. Proc Comput Sci 167:1471–1483 Ashraf M, Zaman M, Ahmed M (2018c) Performance analysis and different subject combinations: an empirical and analytical discourse of educational data mining. In: 8th international conference on cloud computing. IEEE, data science & engineering (confluence), p 2018 Ashraf M, Zaman M, Ahmed M (2019) To ameliorate classification accuracy using ensemble vote approach and base classifiers. Emerging technologies in data mining and information security. Springer, Singapore, pp 321-334 Bartlett P, Shawe-Taylor J (1999) Generalization performance of support vector machines and other pattern classifiers. Advances in Kernel methods—support vector learning, pp 43–54 Brazdil P, Gama J, Henery B (1994) Characterizing the applicability of classification algorithms using meta-level learning. In: European conference on machine learning. Springer, Berlin, Heidelberg, p 83102 Breiman L (1996). Bagging predictors. Machine Learn 24(2): 123–140; Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. ICML 96:148–156 Bruzzone L, Cossu R, Vernazza G (2004) Detection of land-cover transitions by combining multidate classifiers. Pattern Recogn Lett 25(13):1491–1500 Bü hlmann P, Yu B (2003) Boosting with the L 2 loss: regression and classification. J Am Stat Assoc 98(462):324–339 Dimitriadou E, Weingessel A, Hornik K (2003) A cluster ensembles framework, design and application of hybrid intelligent systems Geman S, Bienenstock E, Doursat R (1992) Neural networks and the bias/variance dilemma. Neural Comput 4(1):1–58 Leigh W, Purvis R, Ragusa JM (2002) Forecasting the NYSE composite index with technical analysis, pattern recognizer, neural network, and genetic algorithm: a case study in romantic decision support. Decision Support Syst 32(4):361–377 Maimon O, Rokach L (2004) Ensemble of decision trees for mining manufacturing data sets. Mach Eng 4(1–2):32–57 Mangiameli P, West D, Rampal R (2004) Model selection for medical diagnosis decision support systems. Decision Support Syst 36(3):247–259 Mukesh K, Salal YK (2019) Systematic review of predicting student’s performance in academics. Int J Eng Adv Techno 8(3): 54–61 Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11:169– 198 Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-Learning by land-marking various learning algorithms. In: ICML, pp 743–750 Salal YK, Abdullaev SM (2020, December) Deep learning based ensemble approach to predict student academic performance: case study. In: 2020 3rd International conference on intelligent sustainable systems (ICISS) (pp 191–198). IEEE Salal YK, Hussain M, Paraskevi T (2021) Student next assignment submission prediction using a machine learning approach. Adv Autom II 729:383 Salzberg SL (1994) C4. 5: programs for machine learning by J. Rossquinlan. Mach Learn 16(3):235– 240 Sidiq SJ, Zaman M, Ashraf M, Ahmed M (2017) An empirical comparison of supervised classifiers for diabetic diagnosis. Int J Adv Res Comput Sci 8(1)
24
M. Ashraf et al.
Tan AC, Gilbert D, Deville Y (2003) Multi-class protein fold classification using a new ensemble machine learning approach. Genome Inform 14:206–217
Chapter 3
Patient’s Medical Data Security via Bi Chaos Bi Order Fourier Transform Bharti Ahuja and Rajesh Doriya
Abstract Telemedicine is used in wireless communication networks to detect and treat illnesses of patients in isolated areas. The transfer of electronic patient data is also considered one of the most critical and confidential data in information systems but because of lack of protection standards in online communication, hackers and others are often vulnerable to medical information. Therefore, sending medical information over the network needs a powerful encryption algorithm such that it is immune against various online cryptographic attacks. Among the three security destinations for the security of data frameworks to be specific privacy, trustworthiness, and accessibility, privacy is the most significant perspective that should be taken a lot of care. It is also very important to ensure the protection of patients’ privacy. In this paper, we have combined the two chaotic maps for increasing the complexity level and blend it with the fractional Fourier transform with order m. The simulation is done on MATLAB platform, and the analysis is done through the PSNR, MSE, SSIM, and Correlation Coefficient.
3.1 Introduction Encryption innovation is an ordinarily utilized strategy to encode the media’s computerized content. With the guide of a key, the information is encoded before transmission and unscrambled on the collector hand. The idea of the key can be chosen by nobody, except the person who has the key. The message is known as plain text and the scrambled content is known as cipher text. The data is ensured at the ideal opportunity for transmission. In any case, after unscrambling, the data gets unprotected and it very well may be duplicated and conveyed. The schematic portrayal of the cryptography is given in Fig. 3.1. With reference to cryptography and data security, the medical data is always in a need of utmost care and security because the medical data of the patient is conventionally prone to external interference and little alteration inside the data may B. Ahuja (B) · R. Doriya Department of Information Technology, National Institute of Technology, Raipur, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_3
25
26
B. Ahuja and R. Doriya
Fig. 3.1 Schematic representation of cryptography
cause the final outcome to be immensely colossal. Therefore, diagnostic computing is among the most productive fields and has an enormously colossal impact on the trade in health care (Roy et al. 2019). Biomedical data security is therefore one of the main challenges and is mandatory for remote health care. This is useful in the transfer of electronic images through telecommunications with technical developments in the contact network and Internet. The image will simply be obtained by cybercriminals due to the lack of security levels on the web. Correspondingly, in this phase, image coding innovation is a primary concern. The increasing dissemination of clinical images across networks has become a critical way of life in healthcare systems, benefiting from the exponential advances of network technology and the excellent advantages of visual medical images of health care (Zhang et al. 2015). As medical pictures are the secret cognizance of patients, the way to ensure their safe preservation and delivery across public networks has thus become a critical problem for rational clinical applications. Electronic medical picture delivery is also vulnerable to hackers and alternate data breaches (Akkasaligar and Biradarand 2016). Medical evidence such as scanned medical photographs and safe transfer of electronic health information is the main necessity of telemedicine. There is an opportunity during transmission to target the hacker’s medical records for data manipulation. As a consequence, an incorrect call (Priya and Santhi 2019) is given by the designated procedure. It prevents long until its transmission due to exposure to medical cognizance. Many ways and methods have been reported in this regard in the past few years. The author have given a method in (Zhang et al. 2015) which inscribe and compress the medical image at the same time with compressive sensing and permutation approach. In paper (Akkasaligar and Biradarand 2016), another procedure to safeguard the electronic clinical pictures is intended to use Chaos hypotheses and polymer cryptographic plan. Here, input picture is upgraded into two DNA coded framework; at that point, Chen’s hyper chaotic map dependent on DNA encoded lattice and Lorenz map are utilized to produce the chaotic arrangements freely for even pixel dependent on DNA encoded matrices. An encoding method utilizing edge maps are shown in paper (Cao et al. 2017) that is generated from a source image. There are three components of the algorithm:
3 Patient’s Medical Data Security via Bi …
27
random value generator, bit-plane decomposition, and permutation. In paper (Ali and Ali 2020), a new medical image signcryption scheme is introduced that fulfills the necessary safety criteria of confidential medical information during its contact. The design of the latest theme for medical data signcryption is derived from the hybrid cryptographic combination. It uses a type of elliptic curve cryptography configured with public key cryptography for private key encoding. A chaos coding for clinical image data is introduced in paper (Belazi et al. 2019). A blend of chaotic with polymer computation is proposed, followed by a secret key generator along with the permutation combination. Among all types of encryption methods classification, the chaotic map-based techniques are more efficient especially when we talk of the digital image encryption. There are various chaotic maps developed by the researchers and mathematician such as, Logistic map, Arnold map, Henon map, Tinkerbell map, etc. The chaotic maps give prominent results in security because of the sensitivity towards the initial condition. In this paper, we have combined the two chaotic maps for increasing the complexity level and blend it with the fractional Fourier transform with order m. The rest of the paper is organized as follows: In part two, the proposed method is described with the definitions of the term used. The results and analysis are done in the third Sect. 3.3. Finally, the conclusion is made in the fourth part.
3.2 Proposed Methodology 3.2.1 Fractional Fourier Transform (FRFT) The FRFT is derived from the classical Fourier transform and having an order ‘m.’ Here, the usage of an extra parameter ‘m’ is significant in the sense that it makes the FRFT more robust than the classical in terms of enhancing the applications (Ozaktas et al. 2001; Tao et al. 2006). It is represented as: ∞ x(t)K γ (t, u)dt
x(u) =
(3.1)
−∞
The inverse form of FRFT is further represented as: ∞ x(t) =
x(u)K −γ (u, t)du −∞
(3.2)
28
B. Ahuja and R. Doriya
Fig. 3.2 Time-frequency plane
where γ =m
2
(3.3)
Let Fγ denote the operator corresponding to the FRFT of angle γ . Under this notation, some of the important properties of the FRFT operator are listed below with time frequency plane in Fig. 3.2. For γ For γ For γ For γ
1. 2. 3. 4.
= m = 0 we do get the identity operator: F 0 = F 4 = I = 2 , i.e., m = 1, we get the Fourier operator: F 1 = F = , i.e., m = 2, we get the reflection operator: F 2 = F F = I , i.e., m = 3, we get the reflection operator: F 3 = F F 2 = F = 3 2
FRFT computation involves following steps: 1. 2. 3. 4.
A product by a Chirp. A Fourier transforms. Another product by a Chirp. A product by a complex amplitude factor.
Properties of Fractional Fourier Transform explained in Table 3.1. Different parameters have been used for the performance evaluation of various classes of discrete fractional Fourier transform (DFRFT). • • • •
Direct form of DFRFT Improved Sampling type DFRFT Linear Combination type DFRFT Eigen Vector Decomposition type DFRFT
3 Patient’s Medical Data Security via Bi …
29
Table 3.1 Properties of fractional fourier transform Property Relation F J = (F) j (F α )−1 = F −α (F α )−1 = (F α ) H F α1 F α2 = F α1+α2 F α2 F α1 = F α1 F α2 F α3 (F α2 F α1 ) = (F α3 F α2 )F α1
Integer order Inverse Unitary Index additivity Commutativity Associativity
Table 3.2 Comparison for different types of DFRFT Properties Direct Improved Linear Eigenfunction Group Reversible Closed from Similarity with CFRFT FFT Constraints All orders Properties Additive DSP
Impulse
No Yes
No Yes
Yes Yes
Yes No
Yes Yes
Yes Yes
Yes
Yes
No
Yes
NA
Yes
NA Less Yes Less No No
2 FFT Middle Yes Middle Convertible Yes
1 FFT Unable Yes Middle Yes Yes
NA Less Yes Less Yes Yes
2 FFT Much No Many Yes Yes
2 FFT Much No Many Yes Yes
• Group Theory type DFRFT • Impulse Train type DFRFT. The main disadvantage of direct form of DFRFT and linear combination type DFRFT is not reversible and additive. They are not similar to the continuous FRFT. Both of these types lose various important characteristics of continuous FRFT. This concludes that both these DFRFT are not useful for DSP applications. The analysis of remaining four classes of DFRFT, i.e., improved sampling type, group theory type, impulse train type, and eigenvector decomposition type is discussed here. Comparison for different types of DFRFT is discussed in Table 3.2. The FRFT is an individual from a more broad class of changes that are at times called linear canonical transformations . Individuals from this class of transform can be separated into a progression of less complex tasks, for example, chirp convolution, chirp multiplication, scaling, and standard Fourier transforms. For the FRFT of a signal x(t), the following measures could be used to calculate it. We chose to break down the fractional transformation into a multiplication of chirps, proceeded by a convolution of chirps, proceeded by yet another multiplication of chirps. First, multiply the function x(t) by a chirp function u(t) as below:
30
B. Ahuja and R. Doriya
g(x) = u(t) ∗ x(t)
(3.4)
The two-dimensional FRFT is calculated in a simple manner for the M × N matrices: The one-dimensional FRFT is applied for every matrix row and for every corresponding column, so the generalization of the FRFT for the 2D image is defined as (Ahuja and Lodhi 2014): ∞ ∞ Yαβ ( p, q) =
kαβ ( p, q; r, s)y(r, s)dr ds
(3.5)
−∞ −∞
where kαβ ( p, q; r, s) = kα ( p, r )kβ (q, s)
(3.6)
3.2.2 Chaotic Functions A secure encryption system must normally have the following basic features: (1) Able to convert the message into a random encrypted text or cipher; (2) Extreme sensitivity towards the secret key. The chaos system has few common features as stated above, including the pseudorandom sensitivity of the initial state, also the parametric sensitivity (Avasare and Kelkar 2015). Many studies, therefore, were based on applications for mapping discrete chaotic maps for the cryptography in recent years; still numerous chaotic systems have their specific fields of study that are fitting for different circumstances. However, because of certain intrinsic image characteristics, such as the ability for mass data and high pixel associations, existing algorithms alone for cryptography are not sufficient for the realistic encryption of photos. Chaos in data security is an unpredictable and similarly irregular instrument that happens inside the dynamic nonlinear frameworks. The riotous component is receptive to the underlying condition, unstable but then typical. Numerous disordered capacities are utilized in encryption. We are utilizing chaotic maps here, i.e., logistic and Arnold Cat map. Logistic map is the general chaotic function which is utilized to have the long key space for enhanced security as it increases the randomness (Ahuja and Lodhi 2014). It is stated as: yn+1 = u ∗ yn ∗ (1 − yn )
(3.7)
where yn ∈ (0, 1) is the iterative value. When parameter of operation u is in interval (3.5499, 4), it signifies the system in an unpredictable state, and a small variance leads to a spontaneous shift in iterative value.
3 Patient’s Medical Data Security via Bi …
31
Fig. 3.3 Diagram of Arnold cat map
Arnold Cat map is a picture encryption technique that is finished by permuting the pixel estimation of the picture without changing the size and picture histogram (Rachmawanto et al. 2019). This strategy encodes a network with a similar width and length. The consequence of Arnold map encryption is affected by the quantity of cycles and two positive number as integers as information sources. Arnold’s Cat Map is a 2D chaotic map and is characterized as (Wand and Ding 2018):
x(n + 1) 1 1 x(n) = (mod1) y(n + 1) 1 2 y(n)
(3.8)
Graphical representation of Arnold map is displayed in Fig. 3.3.
3.2.3 Algorithm The proposed algorithm contains two processes: sender’s side process or encryption algorithm and receiver’s side process or decryption algorithm. Algorithms are shown in Fig. 3.4 and 3.5. Encryption Algorithm: Step 1 At sender’s side, take medical image and apply Arnold and logistic chaotic map to the image. Step 2 Apply discrete fractional Fourier transform (Simulation is done with order of parameter a = 0.77 and b = 0.77) as a secret key. Step 3 This transformed image is an encrypted image.
32
Fig. 3.4 Encryption algorithm
Fig. 3.5 Decryption algorithm
B. Ahuja and R. Doriya
3 Patient’s Medical Data Security via Bi …
33
Fig. 3.6 Input medical image 1 and its histogram
Decryption Algorithm: Step 1 At receiver’s side, apply inverse discrete fractional Fourier transform (Simulation is done with order of parameter a = 0.77 and b = 0.77) to the encrypted image. Step 2 Remove logistic and apply inverse Arnold Cat map to get decrypted image.
3.3 Results and Facts This segment contains two images (Medical image 1 and Medical image 2) for testing purpose with a resolution of 512 × 512. Software Version: MATLAB 2016a. The parameters used in the proposed system for the simulation are as follows; a = 0.77, b = 0.77 of FRFT and u = 3.9 and y0 = 0.1 of the logistic map respectively. Figures 3.6, 3.7, and 3.8 describe the computational outcomes after MATLAB simulation. In these figures, input medical image 1, encrypted image, decrypted image, and their hisrograms are shown. Figures 3.9, 3.10 and 3.11 describe the computational outcomes after MATLAB simulation. In these figures input medical image 2, encrypted image, decrypted image and their hisrograms are shown. For testing the efficacy of the system, some of the popular metrics such as PSNR, MSE, SSIM, and correlation coefficient (CC) would be tested.
34
B. Ahuja and R. Doriya
Fig. 3.7 Encrypted medical image 1 and its histogram
Fig. 3.8 Decrypted medical image 1 and its histogram
PSNR: A general understanding of the complexity of the encryption is provided by peak signal to noise ratio. To have a reasonable encryption, PSNR ought to be as high as could be expected under the circumstances (Zhang 2011). The PSNR in scientific structure is expressed as; PSNR = 10 log10
256 × 256 MSE
(3.9)
MSE: The distinction between the comparing pixel esteems in the real image and the encrypted image well defines the mean square error (Salunke and Salunke 2016). In order to get a reliable encryption, the mean square error will be as small as possible.
3 Patient’s Medical Data Security via Bi …
35
Fig. 3.9 Input medical image 2 and its histogram
Fig. 3.10 Encrypted medical image 2 and its histogram
MSE =
M−1 N −1 2 1 f (i, j) − f (i, j) MN 0 0
(3.10)
SSIM: The structural similarity index is used to calculate the relation between an original image and a reconstructed one. The SSIM should be described as (Horé and Ziou 2010): SSIM( f, g) = l( f, g).c( f, g).s( f, g)
(3.11)
36
B. Ahuja and R. Doriya
Fig. 3.11 Decrypted medical image 2 and its histogram
CC: The correlation coefficient of two neighboring pixels is another significant characteristic of the image. It is for evaluating the degree of linear correlation between two random variables (Zhang and Zhang 2014). There is a simple correlation of neighboring pixels in horizontal, vertical, and diagonal directions for a real image. A strong association between adjacent pixels is predicted for plain image. And weak association between adjacent pixels is predicted for cipher images. The simulation results of PSNR, MSE, SSIM, and CC are shown in Tables 3.3 and 3.4.
Table 3.3 Metrics values of medical image Images PSNR (dB) Medical image 1 Medical image 2
Inf Inf
MSE
SSIM
0 0
1 1
Table 3.4 Correlation coefficient values of medical images Images Horizontal Vertical Medical image 1 (original) Medical image 1 (encrypted) Medical image 2 (original) Medical image 2 (encrypted)
Diagonal
0.8978
0.9049
0.8485
0.0906
0.0929
0.0072
0.9980
0.9958
0.9937
0.0904
0.0929
0.0072
3 Patient’s Medical Data Security via Bi …
37
Figures 3.12 and 3.13 depict the correlation coefficient diagram for medical image 1 and 2, respectively.
Fig. 3.12 Correlation coefficient diagram of medical image 1
Fig. 3.13 Correlation coefficient diagram of medical image 2
38
B. Ahuja and R. Doriya
3.4 Conclusion In this paper, we have used three techniques for medical or clinical image encryption, i.e., FRFT, logistic map, and Arnold map. The results suggest that the complex hybrid combination makes the system more robust and secure from the different cryptographic attacks than these methods alone. The use of fourier transform-based approach with logistic chaotic map and Arnold map makes this algorithm much complex and nonlinear and difficult to breach the security henceforth. In future work, the method may be used for the medical data security with the advanced tools of IoT and machine learning.
References Ahuja B, Lodhi R (2014) Image encryption with discrete fractional Fourier transform and chaos. Adv Commun Netw Comput (CNC) Akkasaligar PT, Biradarand S (2016) Secure medical image encryption based on intensity level using Chao’s theory and DNA cryptography. In: 2016 IEEE international conference on computational intelligence and computing research (ICCIC). IEEE, Chennai, pp 1–6 Ali T, Ali RA (2020) Novel medical image signcryption scheme using TLTS and Henon chaotic map. IEEE Access 8:71974–71992 Avasare MG, Kelkar VV (2015) Image encryption using chaos theory. In: 2015 international conference on communication, information and computing technology (ICCICT). IEEE, Mumbai, pp 1–6 Belazi A, Talha M, Kharbech S, Xiang W (2019) Novel medical image encryption scheme based on chaos and DNA encoding. IEEE Access 7:36667–36681 Cao W, Zhou Y, Chen P, Xia L (2017) Medical image encryption using edge maps. Signal Process 132:96–109 Horé and Ziou2010]ref16 Horé A, Ziou D (2010) Image quality metrics: PSNR versus SSIM. In: 20th international conference on pattern recognition. IEEE, Istanbul, pp 2366–2369 Ozaktas M, Zalevsky Z, Kutay MA (2001) The fractional Fourier transform. West Sus-sex, U. K., Wiley Priya S, Santhi B (2019) A novel visual medical image encryption for secure transmission of authenticated watermarked medical images. Mobile networks and applications. Springer, Berlin Roy M, Mali K, Chatterjee S, Chakraborty S, Debnath R, Sen S (2019) A study on the applications of the biomedical image encryption methods for secured computer aided diagnostics. In: Amity international conference on artificial intelligence (AICAI), Dubai, United Arab Emirates. IEEE, pp 881–886 Rachmawanto E, De Rosal I, Sari C, Santoso H, Rafrastara F, Sugiarto E (2019) Block-based Arnold chaotic map for image encryption. In: International conference on information and communications technology (ICOIACT). IEEE, Yogyakarta, Indonesia, pp 174–178 Salunke BA, Salunke S (2016) Analysis of encrypted images using discrete fractional transforms viz. DFrFT, DFrST and DFrCT. In: International conference on communication and signal processing (ICCSP). IEEE, Melmaruvathur, pp 1425–1429 Tao R, Deng B, Wang Y (2006) Research progress of the fractional Fourier transform in signal processing. Sci China (Ser. F Inf Sci) 49:1–25 Wang C, Ding Q (2018) A new two-dimensional map with hidden attractors. Entropy 20:322 Zhang X (2011) Lossy compression and iterative reconstruction for encrypted image. IEEE Trans Inf Forensics Secur 6:53–58
3 Patient’s Medical Data Security via Bi …
39
Zhang J, Zhang Y (2014) An image encryption algorithm based on balanced pixel and chaotic map. Math Probl Eng Zhang L, Zhu Z, Yang B, Liu W, Zhu H, Zou M (2015) Medical image encryption and compression scheme using compressive sensing and pixel swapping based permutation approach. Math Probl Eng 2015
Chapter 4
Nepali Word-Sense Disambiguation Using Variants of Simplified Lesk Measure Satyendr Singh, Renish Rauniyar, and Murali Manohar
Abstract This paper evaluates simplified Lesk algorithm for Nepali word-sense disambiguation (WSD). Disambiguation is performed by computing similarity between sense definitions and context of ambiguous word. We compute the similarity using three variants of simplified Lesk algorithm: direct overlap, frequency-based scoring, and frequency-based scoring after dropping target word. We further evaluate the effect of stop word elimination, number of senses and context window size on Nepali WSD. The evaluation was carried out on a sense annotated corpus comprising of 20 polysemous Nepali nouns. We observed overall average precision and recall of 38.87% and 26.23% using frequency-based scoring for baseline. We observed overall average precision and recall of 32.23% and 21.78% using frequency-based scoring after dropping target word for baseline. We observed overall average precision and recall of 30.04% and 20.30% using direct overlap for baseline.
4.1
Introduction
Polysemy exits in natural languages, as natural languages comprise of words bearing different senses in different context. The English noun cricket can mean a game or insect, based on the context it is being used. For human beings, to interpret the appropriate sense of the word in given context is easy, using nearby words in the context of target ambiguous word. For machines, this is a challenging task. Given a context, the task of identifying the correct meaning of an ambiguous word computationally is called as word-sense disambiguation (WSD). It is considered as
S. Singh (&) BML Munjal University, Gurugram, Haryana, India R. Rauniyar Tredence Analytics, Bangalore, India M. Manohar Gramener, Bangalore, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_4
41
42
S. Singh et al.
an intermediate task in many natural language processing (NLP) applications (Ide and Veronis 1998). It is one of the major components of machine translation task. Nepali language is an Indo-Aryan language and is the official language of Nepal, a south Asian country. It is also listed as official language in India in the state of Sikkim. It is spoken in Nepal and parts of Bhutan and India. The WSD research for Nepali as well as Indian languages is constrained due to lack of resources which includes training and testing corpus. Nepali is similar to Hindi language. Both these languages share lot of features in their vocabulary and grammar and have subtle differences as well. As Nepali and Hindi are different languages, the results obtained on Hindi or other Indian languages for WSD cannot be generalized on Nepali without evaluation on Nepali Dataset. Nepali language has words comprising multiple senses in different context. For example, the Nepali noun “उत्तर” (uttar) has two senses as nouns listed in IndoWordNet (http://tdil-dc.in/indowordnet/) as given below. 1. उत्तर, जवाब, कुनै प्रश्न या कुरा सुनेर त्यसको समाधानका लागि भनिएको कुरा, “तपाईले मेरो प्रश्नको उत्तर दिनु भएन” Uttar, javab, kunai prashna ya kura sunera tyasko samadhanka lagi bhaniyeko kura, “Tapaile mero prashna uttar dinu bhayena” Answer, any question or what is said to resolve it, “You did not answer my question.” 2. उत्तर, उत्तर दिशा, दक्षिण दिशाको अघिल्तिरको दिशा, भारतको उत्तरमा हिमालय पर्वत विराजमान छ Uttar, Uttar disha, dakshin, dishako aghiltirako disha, bharatko uttarma Himalaya parvat virajaman cha North, north direction, direction opposite to south direction, “North of India is surrounded by Himalayas” Given below are two contexts of ‘उत्तर’ (uttar). Context 1: बालकका विचित्र प्रश् नहरूको उत्तर दिँदा-दिँदै मातापिताहरू त दिक् क पनि हुन थाल्छन् । Balakka vichitra prashnaharuko uttar dida-didai matapitaharu ta dikka pani huna thalchan Parents are also troubled by answering the bizarre questions of their child. Context 2: यस परिषद्अन्तर्गत उत्तर पूर्वी क्षेत्रका सातवटा राज्यहरू आसाम, मणिपुर, नागाल्यान्ड, मिजोराम, त्रिपुरा, मेघालय र अरुणाचल प्रदेश समावेश गरिए । Yash parishadantargat uttar poorvi kshetraka satavata rajyaharu Assam, Manipur, Nagaland, Mizoram, Tripura, Meghalaya ra Arunanchal Pradesh samavesh garie Under this council, seven states of northeastern region Assam, Manipur, Nagaland, Mizoram, Tripura, Meghalaya, and Arunachal Pradesh were included. Sense 1 of “उत्तर” (uttar) pertains to answer and sense 2 pertains to north direction. In context 1 the meaning of “उत्तर” (uttar) is answer and in context 2 the meaning is north direction.
4 Nepali Word-Sense Disambiguation Using Variants …
43
In this work, we evaluate a WSD algorithm for Nepali language. The algorithm used is based on Lesk (1986) and following Vasilescu et al. (2004), is called simplified Lesk. The algorithm uses the similarity between context vector and sense definitions for disambiguation. We further investigate the effects of context window size, stop word elimination, and number of senses for Nepali WSD. We compare the results for Nepali WSD using a similar Lesk-like algorithm for Hindi WSD (Singh et al. 2017). The article is organized as follows: Sect. 4.2 provides the related work in WSD for English, Hindi, and Nepali languages. The WSD algorithm is discussed in Sect. 4.3. Section 4.4 provides the details of construction of sense annotated Nepali corpus used in this work. In Sect. 4.5, we provide the experiments conducted and results and Sect. 4.6 provide discussion of results. In Sect. 4.7, we present our conclusion.
4.2
Related Work
There are two main categories into which WSD techniques are broadly grouped: Dictionary-based or knowledge-based and corpus-based. Dictionary-based techniques (Baldwin et al. 2010; Banerjee and Pederson 2002, 2003; Lesk 1986; Vasilescu et al. 2004) utilize information from lexical resources and machinereadable dictionaries for disambiguation. Corpus-based techniques utilize corpus, either sense tagged (supervised) (Gale et al. 1992; Lee et al. 2004; Ng and Lee 1996) or raw corpus (unsupervised) (Resnik 1997; Yarowsky 1995) for disambiguation. Lesk (1986) was one of the early and pioneer works on dictionary-based WSD for English language. He represented dictionary definition obtained from lexicon as bag of words. He extracted words in sense definition of words, in context of target ambiguous word. Disambiguation was performed by contextual overlap between sense and context bag of words. The work in (Agirre and Rigau 1996; Miller et al. 1994) is some other early work utilizing dictionary definitions for WSD. Since Lesk, several extensions to his work have been proposed (Baldwin et al. 2010; Banerjee and Pederson 2002, 2003; Gaona et al. 2009; Vasilescu et al. 2004). Baldwin et al. (2010) reinvestigated and extended the task of machine-readable dictionary-based WSD. They extended the Lesk-based WSD approach by methods of definition extension and by applying different tokenization schemes. Evaluation was carried out on Hinoki Sense bank example sentences and Senseval-2 Japanese dictionary task. The WSD accuracy uses their approach surpassed both unsupervised and supervised baselines. Banerjee and Pedersen (2002) utilized glosses that were associated with synset, semantic relations and each word attribute in pair for disambiguation using English WordNet. In (Banerjee and Pederson 2003), they explored a novel measure of semantic relatedness that was based on count of overlap in glosses. Comparative evaluation of original Lesk algorithm was performed by Vasilescu et al. (2004). They observed the performance of adapted Lesk
44
S. Singh et al.
algorithm to be better in comparison of original Lesk algorithm. Gaona et al. (2009) utilized word occurrences in gloss and context information for disambiguation. For Hindi language work on WSD includes (Bhingardive and Bhattacharyya 2017; Jain and Lobiyal 2016; Mishra et al. 2009; Singh and Siddiqui 2016; Singh and Siddiqui 2012, 2014, 2015; Singh et al. 2013, 2017; Sinha et al. 2004). Sinha et al. (2004) utilized an algorithm for Hindi WSD based on Lesk. They created context bag by utilizing neighboring words of target polysemous word and sense bag by utilizing synonyms, glosses, example sentences, and semantic relations including hypernyms, hyponyms, meronyms, their glosses, and example sentences. The winner sense was one that maximized contextual overlap between the sense and context bag. They evaluated on Hindi Corpora and reported accuracy ranging from 40 to 70% in their experiment. Jain and Lobiyal (2016) utilized fuzzy graph connectivity measures for Hindi WSD and proposed fuzzy Hindi WordNet, an extension of Hindi WordNet. Singh et al. (2013) studied semantic relatedness measure for Hindi WSD task. The semantic relatedness measure explored in their work was Leacock Chodorow measure. They reported precision of 60.65%. Singh and Siddiqui (2014) investigated role of semantic relations for Hindi WSD. The semantic relations explored in their work were holonym, hypernym, meronym, and hyponym, and they obtained maximum precision using hyponym as single semantic relation. Singh and Siddiqui (2012) evaluated a Lesk-based algorithm for Hindi language WSD. They studied the effect of stemming and stop word removal on Hindi WSD task. They also investigated the effect of context window size on Hindi WSD task. Evaluation was performed on a manually sense tagged dataset comprising of 10 polysemous Hindi nouns. They reported maximum precision of 54.81% after applying stemming and stop word removal. They further observed an improvement of 9.24% in precision in comparison to baseline performance. In another work, Singh et al. (2017) explored three variants of simplified Lesk measure for Hindi word-sense disambiguation. They evaluated the effect of stop word elimination, context window size, and stemming on Hindi word-sense disambiguation. They observed 54.54% as maximum overall precision after dropping stop words and applying stemming using frequency-based scoring excluding the target word. Mishra et al. (2009) explored an unsupervised approach for Hindi word-sense disambiguation. Their approach utilized learning based on decision list created from untagged instances, while providing some seed instances manually. They applied stop word elimination and stemming in their work. Evaluation was carried on 20 ambiguous Hindi nouns; sense inventory being derived from Hindi WordNet. Singh and Siddiqui (2015) investigated the role of karaka relations on Hindi WSD task. They utilized two supervised algorithms in their experiment. Evaluation was obtained on sense annotated Hindi corpus (Singh and Siddiqui 2016). They observed that vibhaktis can be helpful for disambiguation of Hindi nouns. Bhingardive and Bhattacharyya (2017) explored IndoWordNet for bilingual word-sense disambiguation for obtaining sense distribution using expectation maximization algorithm. They also explored IndoWordNet for obtaining most frequent sense utilizing embeddings drawn from sense and word.
4 Nepali Word-Sense Disambiguation Using Variants …
45
For Nepali language work on WSD includes (Dhungana and Shakya 2014; Shrestha et al. 2008). Dhungana and Shakya (2014) investigated adapted Lesk-like algorithm for Nepali WSD. They included synset, gloss, example sentences, and hypernym of every sense of target polysemous word for creating sense bag. Context bag was created by extracting all words from whole sentence after dropping prepositions, articles, and pronouns. Score was computed by contextual overlap of sense bag and context bag. Evaluation was done on 348 words, 59 being polysemous and they achieved an accuracy of 88.05%. Shrestha et al. (2008) studied the role of morphological analyzer and machine-readable dictionary for Nepali word-sense disambiguation using Lesk algorithm. Evaluation was performed on a small dataset comprising of Nepali nouns and they achieved accuracy values ranging from 50 to 70%. For Nepali language work on sentiment analysis includes (Gupta and Bal 2015; Piryani et al. 2020). Gupta and Bal (2015) studied sentiment analysis of Nepali text. They developed Nepali SentiWordNet named as Bhavanakos and employed it for detecting sentiment words in Nepali text. They also trained machine learning classifier using annotated Nepali text for document classification. Piryani et al. (2020) performed sentiment analysis of tweets in Nepali text. They employed machine and deep learning models for sentiment analysis of Nepali text.
4.3
WSD Algorithm
The Simplified Lesk algorithm for WSD used in this work is adapted from (Singh et al. 2017) and given in Fig. 4.1. In this algorithm, score is computed by contextual overlap of two bags: context bag and sense definition bag. Sense definition bag comprises of synsets, gloss, and example sentence of target word. Context bag is formed by extracting neighboring words in a window size of ±n in context of target word. The winner sense is one which maximizes the overlap of two bags. For studying the effects of context window size, test runs were computed on window size of 5, 10, 15, 20, and 25. For studying the effects of stop word elimination, we dropped stop words from the context vector and then created the context window. We utilized three variants to compute the score: direct overlap, frequency-based scoring, and frequency-based scoring after dropping target word. For direct overlap, we computed the number of matching words for disambiguation. For frequency-based scoring, we computed the frequency of matching words between context and sense bag. For frequency-based scoring after dropping target word, we computed the frequency of matching words between context and sense bag after dropping target word.
46
S. Singh et al.
1.(a) Keeping ambiguous word in middle, create a context vector (CV) comprising of words in a fixed window size of ± n (b) Perform stop word removal on sense definitions and instances and create context vector as in 1 (a) 2. for i = 1 to n do // n = number of senses Create sense definition vector (SV) for sense i of the target word Scorei = Similarity-Overlap (CV, SVi) 3. return SVi for which score is maximum. Computing score (Direct Overlap): Similarity-Overlap (CV, SV) sense_score = 0 for each word x in CV if x is in SV sense_score = sense_score +1 return sense_score Computing score (Frequency based scoring & Frequency based scoring after dropping target word): Similarity (CV, SV) sense_score = 0 for each word x in CV word_count = frequency of x in SV sense_score = sense_score + word_count return sense_score Fig. 4.1 WSD algorithm
4.4
Dataset
For evaluating the WSD algorithm, a sense annotated Nepali corpus was created comprising of 20 polysemous Nepali nouns. The sense annotated Nepali corpus is given in Table 4.1. The sense definitions were obtained from IndoWordNet (http:// tdil-dc.in/indowordnet/), an important lexical resource for Nepali and Indian languages. IndoWordNet is available at Centre for Indian Language Technology (CFILT), Indian Institute of Technology (IIT) Bombay, India. Test instances were obtained from Nepali General Text Corpus (http://tdil-dc.in/index.php?option= com_download&task=showresourceDetails&toolid=453&lang=en), a raw Nepali corpus available at Technology Development for Indian Language (TDIL) portal. Text instances were also collected by firing search queries to various sites on the Web containing Nepali text. The sense annotated Nepali corpus was build using similar guidelines as sense annotated Hindi corpus (Singh and Siddiqui 2016). The sense listings in IndoWordNet is fine-grained. Hence, few fine-grained senses have been merged in our dataset using subject-based evaluation. For example, Nepali noun “तिल” (til) has three senses as nouns in IndoWordNet as given below.
4 Nepali Word-Sense Disambiguation Using Variants …
47
Table 4.1 Sense annotated Nepali corpus No of Senses
Nepali Nouns
2
उत्तर (uttar), क्रिया (kriya), गोली (goli), ग्रहण (grahan), ताल (taal), तिल (til), दर (dar), फल (phal), बोली (boli), शाखा (saakhaa), साँचो (sancho), साल (saal), सीमा (seema), हार (haar) तुलसी (tulsi), धारा (dhaaraa), पुतली (putali), वचन (vachan) टीका (tikaa), बल (bal)
3 4
1. तिल, एउटा रूखको बीउ जसबाट तेल निस्कन्छ, “ऊ सधैँ नुहाएपछि तिलको तेल लगाउँछ” Til, euta rukhko biu jasbata tel niskancha, “u sadhai nuhayepachi tilko tel lagaucha.” Sesame, a tree seed that secretes oils, “he always puts on sesame oil after bathing” 2. कोठी, थोप्लो, तिल, छालामा हुने कालो वा रातो रङ्गको धेरै सानो प्राकृतिक चिनो अथवा दाग, उसका गालामा कालो कोठी छ Kothi, thoplo, til, chaalama hune kalo wa rato rangko dherai sano prakritik chino athawa daag, uska gaalama kalo kothi cha. Mole, mole, mole, a very small black or red colored natural identity or spot present in the skin. He has a black mole on his cheek. 3. कोठी, तिल, कालो वा रातो रङ्गको अलिक उठेको मासुको त्यो दानो जुन शरीरमा कतैकतै निक्लिने गर्छ, उनको डड्याल्नामा एउटा कालो कोठी छ Kothi, til, kalo wa rato rangko alik utheko masuko tyo dano jun sarirma kataikatai niklane garcha, unko ḍaḍyalnma euta kalo kothi cha. Mole, mole, black or red colored slightly elevated spot which can appear anywhere in body. He has a black mole on his back. For “तिल” (til), sense 1 pertains to small oval seeds of the sesame plant. Sense 2 pertains to mole, small congenital pigment spotted on the skin. Sense 3 pertains to mole, firm abnormal elevated blemish on the skin. The instances of sense 2 and 3 were marked as similar by two subjects. Hence, we merged sense 2 and 3. The two subjects were native speaker of Nepali language and were undergraduate students of BML Munjal University, Gurugram, India. For some senses in our dataset, we could not find sufficient instances, hence we dropped them. For example, Nepali noun “क्रिया” (kriya) has 5 senses as nouns in IndoWordNet as given below. 1. क्रिया, क्रियापद, व्याकरणमा त्यो शब्द जसद्वारा कुनै व्यापार हुनु या गरिनु सूचित हुन्छ “यस अध्यायमा क्रियामाथि छलफल गरिन्छ” Kriya, kriyapad, byakaranma tyo sabdha jasdwara kunai byapar hunu ya garinu suchit huncha, “yas adhyayama kriyamathi chalfal garincha.”
48
S. Singh et al.
In verbs, verbs, grammar, the word by which a trade is made or done indicates “This chapter discusses verbs” 2. प्रक्रिया, क्रिया, प्रणाली, पद्धति, त्यो क्रिया या प्रणाली जसबाट कुनै वस्तु हुन्छ, बन्छ या निक्लिन्छ “युरियाको निर्माण रासायनिक प्रक्रियाबाट हुन्छ” Prakriya, kriya, pranali, paddhati, tyo kriya ya pranali jasbata kunai vastu huncha, bancha ya niklincha “Ureako nirman Rasayanik prakriyabata huncha” Process, action, system, method, the action or system from which an object is made, formed, or derived “Urea is formed by a chemical process” 3. श्राद्ध, सराद्ध, क्रिया, किरिया, कुनै मुसलमान साधु वा पीरको मृत्यु दिवसको कृत्य “सुफी फकिरको श्राद्धमा लाखौँ मान्छे भेला भए” Shradh, Saradh, kriya, kiriya, kunai musalman sadhu wa pirko mrityu diwasko krtiya “Sufi fakirko shradhma lakhau manche bhela bhaye” Last rites/rituals, Death anniversary, rites, the death anniversary of a Muslim sage “Millions of people gathered to pay their respects to the Sufi fakir” 4. श्राद्ध, सराद्ध, क्रिया, किरिया, मुसलमान पीरको निर्वाण तिथि “पीर बाबाको श्राद्ध बडो धुमदामले मनाइयो” Shradh, Saradh, kriya, kiriya, kunai musalman pirko nirwan tithi “pir babako shradh badho dhumdhamle manaiyo” Last rites/rituals, Death anniversary, rites, the last rites/funeral of a Muslim monk “The death anniversary of Pir Baba was celebrated with great pomp” 5. क्रिया, कुनै कार्य भएको वा गरिएको भाव “दूधबाट दही बनिनु एउटा रासायनिक क्रिया हो” Kriya, kunai karya bhyeko wa gariyeko bhab “Dhudhbata dahi baninu euta rasayanik kriya ho” Action, the feeling of having or doing an action/the feeling that something has been done “Getting Yogurt from milk is an action of chemical reaction” For “क्रिया” (kriya), sense 1 pertains to a content word that denotes an action or a state, verb in Nepali grammar. Sense 2 pertains to particular course of action intended to achieve a result. Sense 3 pertains to Death Anniversary Act of a Muslim monk. Sense 4 pertains to Muslim monk emancipation date. Sense 5 pertains to something that people do or cause to happen. The Nepali noun “क्रिया” (kriya) has sense 3 and 4 pertaining to the act of death of a Muslim monk or death anniversary of a Muslim monk. We could not get instances pertaining to these senses for “क्रिया” (kriya), hence we dropped these senses from the dataset. Sense 2 and 5 pertain to a course of action, hence sense 2 and 5 are merged. We further added few senses as well which were not available in IndoWordNet. For example, Nepali noun “दर” (dar) has 3 senses as nouns in IndoWordNet as given below.
4 Nepali Word-Sense Disambiguation Using Variants …
49
1. अनुपात, दर, मान,माप,उपयोगिता आदिको तुलनाको विचारले एउटा वस्तु अर्को वस्तुसित रहने सम्बन्ध या अपेक्षा “पुस्तकका लागि लेखकले दुई प्रतिशतको अनुपातले रोयल्टी भेटिरहेको छ” Anupat, dar, maan, maap, upayogita adhiko tulanako bicharle euta vastu arko vastustith rahane sambhandha ya apekchya “Pusktakko lagi lekheko dui pratisatko anupatle royalty bhetiraheko cha.” The relation or expectation of one object to be with another by comparing proportions, rates, values, measurements, utility, etc. “The author is receiving a two per cent royalty for the book.” 2. मूल्य, दर, दाम, मोल, कुनै वस्तु किन्दा वा बेच्दा त्यसको बदलामा दिइने धन “यस कारको मूल्य कति हो” Mulya, dar, daam, moal, kunai vastu kinda wa bechda tyasko badlama diyine dhan. yash kaarko mulya kati ho Price, rate, price, value, what is the value/money to be paid in return for buying or selling an item. “What is the rate of this car?” 3. मूल्य, मोल, दाम, भाउ, दर, कुनै वस्तुको गुण,योग्यता या उपयोगिता जसको आधारमा उसको आर्थिक मूल्य जाँचिन्छ “हीराको मूल्य जौहारीले मात्रै जान्दछ” Mulya, moal, daam, bhau, dar, kunai vastuko gun, yogyata ya upayogita jasko aadharma usko arthik mulya jachincha “Hirako mulya jauharile matrai jandacha” Price, value, price, price, rate, quality, merit or usefulness of a commodity on the basis of which its economic value is checked “Only a jeweler knows the value of a diamond” For “दर” (dar), sense 1 pertains to rate or value. Sense 2 pertains to the rate or price. Sense 3 pertains to rate or value or price. For Nepali noun “दर” (dar) we added a sense as दर, तीजमा खईने विशेष खाना, दर खाने दिनबाट तीज सुरु भएको मानिन्छ । Dar, teejma khaine vishesh khana, dar khane dinbata teej suru bhayeko manincha. Dar, a special food on the occasion of Teej, The Teej festival is assumed to start from the day after having Dar. This sense pertains to special dish made on occasion of Teej Festival, a festival celebrated in India and Nepal. Sense 1, 2, and 3 pertain to rate hence sense 1, 2, and 3 have been merged. Precision and recall were computed for performance evaluation of WSD algorithm (Singh et al. 2017). Precision is computed as the ratio of instances disambiguated correctly to total test instances answered for a word. Recall is computed as the ratio of instances disambiguated correctly to total test instances to be answered for a word. Sense annotated Nepali corpus comprises of 20 polysemous Nepali nouns. Total number of words in the corpus are 231,830. Total number of unique words in the corpus are 40,696. Total number of instances in the corpus are 3525. Total number of senses in the corpus are 48. Average number of instances per word in the corpus
50
S. Singh et al.
are 176.25. Average number of instances per sense in the corpus are 73.44. Average number of senses per word in the corpus are 2.4. The transliteration, translation, and number of instances for every senses of each word of this corpus are provided in Appendix in Table 4.10. The Nepali stop words list used in this work is given in Fig. 4.2.
4.5
Experiments and Results
Two test runs were performed for evaluation and to study the effect of stop word elimination on our algorithm. These test runs pertained to the following two cases: without stop word removal (Case 1), which is also our baseline case and with stop word removal (Case 2). For each test run, results were computed on window size 5, 10, 15, 20, and 25. Test run 1 (Case 1) corresponds to baseline and it is overlap between context and sense definitions. For test run 2 (Case 2), we performed stop words removal from sense definition and context vector and then similarity is computed. Overall average precision and recall for 20 words for direct overlap, frequency-based scoring after dropping target word and frequency-based scoring for both cases, averaged over context window size of 5–25 is given in Table 4.2. Average precision and recall for 20 words with regard to context window size for direct overlap is given in Tables 4.3 and 4.4. Tables 4.5 and 4.6 provide average precision and recall for 20 words with regard to context window size for frequency-based scoring after dropping target word. Average precision and recall for 20 words with regard to context window size for frequency-based scoring are given in Tables 4.7 and 4.8. Table 4.9 provides average precision for these words with regard to number of senses for both cases and three variants.
Fig. 4.2 Nepali stop words list
Table 4.2 Overall average precision and recall Direct overlap Frequency-based scoring after dropping target word Frequency-based scoring
Case Case Case Case Case Case
1 2 1 2 1 2
Precision
Recall
0.3004 0.2508 0.3223 0.2759 0.3887 0.3503
0.2030 0.1655 0.2178 0.1824 0.2623 0.2347
4 Nepali Word-Sense Disambiguation Using Variants … Table 4.3 Average precision with respect to context window size (direct overlap) Case 1 Case 2
Table 4.4 Average recall with respect to context window size (direct overlap) Case 1 Case 2
Table 4.5 Average precision with respect to context window size (frequency-based scoring after dropping target word)
Table 4.6 Average recall with respect to context window size (frequency-based scoring after dropping target word)
Case 1 Case 2
Case 1 Case 2
Table 4.7 Average precision with respect to context window size (frequency-based scoring) Case 1 Case 2
Table 4.8 Average recall with respect to context window size (frequency-based scoring) Case 1 Case 2
51
Precision Context window size 5 10 15
20
25
0.2111 0.1724
0.3446 0.2871
0.3574 0.3072
Recall Context window size 5 10 15
20
25
0.1466 0.1186
0.2308 0.1868
0.2362 0.1995
Precision Context window size 5 10 15
20
25
0.2357 0.1990
0.3621 0.3143
0.3851 0.3311
Recall Context window size 5 10 15
20
25
0.1627 0.1359
0.2427 0.2055
0.2539 0.2160
Precision Context window size 5 10 15
20
25
0.3212 0.2958
0.4151 0.3777
0.4337 0.3853
Recall Context window size 5 10 15
20
25
0.2173 0.2003
0.2812 0.2523
0.2895 0.2567
0.2740 0.2211
0.1872 0.1484
0.2977 0.2447
0.2039 0.1638
0.3776 0.3319
0.2543 0.2228
0.3148 0.2663
0.2142 0.1743
0.3310 0.2904
0.2256 0.1905
0.3959 0.3606
0.2690 0.2413
52
4.6
S. Singh et al.
Discussion
The maximum overall precision and recall of 38.87% and 26.23% were observed for frequency-based scoring for baseline case, as seen in Table 4.2. We observed overall precision and recall of 35.03 and 23.47% for the case with stop word elimination using frequency-based scoring. For direct overlap for baseline case, we observed overall average precision and recall of 30.04% and 20.30%. For case with stop word removal, overall average precision and recall observed were 25.08 and 16.55%. We observed overall average precision and recall of 32.23 and 21.78% for frequency-based scoring after dropping target word for baseline case. For case with stop word removal, we observed overall average precision and recall of 27.59 and 18.24%. Decrease in precision is observed using stop word elimination (case 2) over baseline (case 1). For direct overlap, we observed 16.51% decrease in precision after stop word elimination (case 2) over baseline (case 1). For frequency-based scoring, we observed 9.88% decrease in precision using stop word elimination (case 2) over baseline (case 1). Similarly, for frequency-based scoring after dropping target word, we observed 14.40% decrease in precision using stop word elimination (case 2) over baseline (case 1). The results in Tables 4.3, 4.5, and 4.7 suggest that increasing context window size enhances the possibility of disambiguation of correct sense. On increasing window size, more content words are induced in the context vector, wherein some word may be a strong indicator of particular sense. As the number of senses (classes) increases, the possibility of correct disambiguation decreases in general as seen in results from Table 4.9. There were 14 words having 2 senses, 4 words with 3 senses and 2 words with 4 senses. We observed maximum precision for words comprising of 2 senses following 3 and 4 senses. Comparing the results of Nepali WSD, with similar kind of work on Hindi WSD (Singh et al. 2017), we obtained overall decrease in precision for Nepali language.
Table 4.9 Average precision and recall with respect to number of senses
Direct overlap Frequency-based scoring after dropping target word Frequency-based scoring
Precision Number of senses 2 3 1 2 1 2
0.3321 0.2666 0.3575 0.2973
0.2598 0.2419 0.2795 0.2466
Case 1 Case 2
0.4534 0.4134
0.2759 0.2333
Case Case Case Case
4
Recall Number of senses 2 3
4
0.1591 0.1582 0.1619 0.1848
0.2140 0.1645 0.2317 0.1851
0.2004 0.1879 0.2147 0.1915
0.1310 0.1277 0.1261 0.1445
0.1614 0.1421
0.2942 0.2659
0.2130 0.1832
0.1371 0.1192
4 Nepali Word-Sense Disambiguation Using Variants …
53
In the work reported on Hindi WSD (Singh et al. 2017), the maximum overall average precision obtained on simplified Lesk algorithm was 54.54% using frequency-based scoring excluding target word and after applying stemming and stop word removal. The overall average precision obtained for baseline and stop word removal was 49.40 and 50.64% using frequency-based scoring excluding target word. Moreover, an increase in precision was observed after stop word removal over baseline for Hindi WSD. Nepali language has a complex grammatical structure. The root words in Nepali grammar are often suffixed with words such as “को” (ko), “का” (ka), “मा” (maa), “द्वारा” (dwara), “ले” (le), “लाई” (lai), “बाट” (bata), etc. These set of words are known as vibhaktis. Apart from such words, some other suffix such as “हरू” (haru) denotes the plural form of a word. For example, “केटाहरू” (ketaharu) meaning boys is the plural form of “केटा” (keta) meaning boy. Moreover, different vibhatis can be suffixed with same root words in different sentences, depending upon the context. The separation of these suffixes and vibhaktis from root word results in an incorrect grammatical sentence. Given below is a context in Nepali. सविधानसभा निर्वाचनमा मनाङबाट निर्वाचित टेकबहादुर गुरुङ श्रम तथा रोजगार राज्यमन्त्री हुन्। Sambhidhansabha nirvachanma Manangbata nirvachit tekbahadur gurung shram tatha rojgar rajyamantri hun. Tek Bahadur Gurung, elected from Manang in the Constituent Assembly election, is the Minister of Labor and Employment. The Nepali context has the word “निर्वाचनमा” (nirwachanma) having vibhakti “मा” (maa) appended with “निर्वाचन” (nirwachan). “निर्वाचन” (nirwachan) can also be append with vibhatki “को” (ko) forming the word “निर्वाचनको” (nirwachanko). In the computational overlap “निर्वाचनमा” (nirwachanma) and “निर्वाचनको” (nirwachanko) would be treated differently. This accounts for decrease in precision in Nepali WSD over Hindi WSD. The context vector and sense vector overlap for Nepali language may comprise of a word, suffixed with different vibhatis. Hence, those content words would be treated differently for every suffixed vibhakti and would not be counted in computational overlap. After stop word removal in Nepali, stop words are dropped from the context vector, which may have contributed in contextual overlap in baseline case. The context vector thus formed comprises of content words with different vibhaktis appended. Thus, there will be no match, for the same content word in sense and context vector because the vibhatki appended to content word in sense vector and context vector are different. Nepali is morphologically very rich language. Given below are two contexts of Nepali.
54
S. Singh et al.
Context 1: वास्तवमा यो बलद्वारा निर्मूल गर्न सकिने विषय होइन। Actually, this topic is not something you can eliminate with force. Context 2: कति कष्ट र वेदना सहेर बलैले धर्मवीर बन् न सफल भएको, आज अकस्मात् पुन: अधर्मी बन् नुपर् यो? How has he become atheist again despite facing much difficulties and sorrows and managing to become theist with strong will power? In the first context the word “बल” is appended with vibhakti “द्वारा” forming “बलद्वारा” (baldwara). In the second context “बल” is appended with vibhakti “ले” (le) forming “बलैले” (balele). This is how words are transformed into different morphological forms using noun form in Nepali. After stop word removal stop words are dropped from the context vector, which may have contributed in contextual overlap in baseline case. The context vector thus formed comprises of content words with different vibhaktis appended. Thus, there will be no match, for the same content word in sense and context vector because the vibhatki appended to content word in sense vector and context vector are different. This is responsible for decrease in precision after stop word elimination (case 2) over baseline (case 1) in Nepali WSD.
4.7
Conclusion
In this paper, we investigated Nepali WSD using variants of simplified Lesk-like algorithm. We also evaluated the effects of stop word elimination, number of senses, and context window size on Nepali WSD task. The maximum precision observed was for baseline using frequency-based scoring. Stop word elimination in Nepali results in decrease in precision over baseline. Increasing the context window size results in increase in precision, as more content word are added to context vector. Increasing the numbers of senses results in the decrease in precision in general.
Appendix See Table 4.10.
4 Nepali Word-Sense Disambiguation Using Variants …
55
Table 4.10 Translation, transliteration and details of sense annotated Nepali corpus Word
Sense number: translation of senses in English (number of instances)
उत्तर (uttar) क्रिया (kriya) गोली (goli) ग्रहण (grahan) टीका (tikaa)
Sense Sense Sense Sense Sense Sense Sense Sense Sense (12) Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense Sense
ताल (taal) तिल (til) तुलसी (tulsi) दर (dar) धारा (dhaaraa) पुतली (putali) फल (phal) बल (bal)
बोली (boli) वचन (vachan) शाखा (saakhaa) साँचो (sancho) साल (saal)
1: Answer (224) 2: North Direction (131) 1: Verb in Nepali grammar (107) 2: A course of action (102) 1: A dose of medicine (22) 2: bullet, A projectile fired from a gun (88) 1: The act of receiving (145) 2: One celestial body obscures other (86) 1: A Jewellery which is worn is worn by women in South Asian countries 2: 3: 4: 1: 2: 1: 2: 1: 2: 3: 1: 2: 1: 2: 3: 1: 2: 3: 1: 2: 1: 2: 3: 4: 1: 2: 1: 2: 3: 1: 3: 1: 2: 1: 2:
A sign on forehead using sandalwood (27) Writing about something is detail (20) name of person (26) A small lake (105) Rhythm as given by divisions (32) A small oval seeds of the sesame plant (38) A small congenital pigment spotted on the skin (20) Basil, Holy and medicinal plant (167) A Saint who wrote Ramayana and was follower of God Ram (46) A common name used for a man (42) Rate (87) Special Dish made in occasion of Teej Festival (66) River’s Flow (57) Law Charges for Crimes/Section (126) Flow of speech and thought (35) Toy (24) Contractile aperture in the iris of the eye (34) Butterfly (21) Fruit (155) Result (112) Strength, power (93) Emphasis on a statement or something said (31) Ball (41) Force relating to police, army, etc. (83) Communication by word of mouth (164) Bid (34) What one speaks or says, saying (62) Promise, commitment (59) Used in Grammar as an agent to denote singular or plural (24) Divisions of Organization (61) Community (21) Truth (136) Keys (38) Year (150) Type of Tree (49) (continued)
56
S. Singh et al.
Table 4.10 (continued) Word
Sense number: translation of senses in English (number of instances)
सीमा (seema) हार (haar)
Sense Sense Sense Sense
1: 2: 1: 2:
Boundary between two things or places, border (72) Extent, Limit (97) Defeat (100) Necklace, garland (53)
References Agirre E, Rigau G (1996) Word sense disambiguation using conceptual density. In: Proceedings of the international conference on computational linguistics (COLING’96), pp 16–22 Baldwin T, Kim S, Bond F, Fujita S, Martinez D, Tanaka T (2010) A re-examination of MRD-based word sense disambiguation. J ACM Trans Asian Lang Process 9(1):1–21 Banerjee S, Pederson T (2002) An adapted Lesk algorithm for word sense disambiguation using WordNet. In: Proceedings of the third international conference on computational linguistics and intelligent text processing, pp 136–145 Banerjee S, Pederson T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the eighteenth international joint conference on artificial intelligence, Acapulco, Mexico, pp 805–810 Bhingardive S, Bhattacharyya P (2017) Word sense disambiguation using IndoWordNet. In: Dash N, Bhattacharyya P, Pawar J (eds) The WordNet in Indian Languages. Springer, pp 243– 260 Dhungana UR, Shakya S (2014) Word sense disambiguation in Nepali language. In: Fourth international conference on digital information and communication technology and its applications (DICTAP), Bangkok, Thailand, pp 46–50 Gale WA, Church K, Yarowsky D (1992) A method for disambiguation word senses in a large corpus. J Comput Hum 26:415–439 Gaona MAR, Gelbukh A, Bandyopadhyay S (2009) Web-based variant of the Lesk approach to word sense disambiguation. In: Mexican international conference on artificial intelligence, pp 103–107 Gupta CP, Bal BK (2015) Detecting sentiments in Nepali text. In: Proceedings of international conference on cognitive computing and information processing, Noida, India, pp 1–4 Ide N, Veronis J (1998) Word sense disambiguation: the state of the art. Comput Linguist 24(1): 1–40 Indowordnet http://tdil-dc.in/indowordnet/ Jain A, Lobiyal DK (2016) Fuzzy Hindi WordNet and word sense disambiguation using fuzzy graph connectivity measures. ACM Trans Asian Low-Resource Lang Inf Process 15(2) Lee YK, Ng HT, Chia TK (2004) Supervised word sense disambiguation with support vector machines and multiple knowledge sources. In: SENSEVAL-3: Third international workshop on the evaluation of systems for the semantic analysis of text, Barcelona, Spain, pp 137–140 Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the 5th annual international conference on systems documentation SIGDOC, Toronto, Ontario, pp 24–26 Miller G, Chodorow M, Landes S, Leacock C, Robert T (1994) Using a semantic concordance for sense identification. In: Proceedings of the 4th ARPA human language technology workshop, pp 303–308 Mishra N, Yadav S, Siddiqui TJ (2009) An unsupervised approach to hindi word sense disambiguation. In: Proceedings of the first international conference on intelligent human computer interaction, pp 327–335
4 Nepali Word-Sense Disambiguation Using Variants …
57
Nepali General Text Corpus, http://tdil-dc.in/index.php?option=com_download&task= showresourceDetails&toolid=453&lang=en Ng HT, Lee HB (1996) Integrating multiple knowledge sources to disambiguation word sense: an exemplar-based approach. In: Proceedings of the 34th annual meeting for the association for computational linguistics, pp 40–47 Piryani R, Priyani B, Singh VK, David P (2020) Sentiment analysis in Nepali: exploring machine learning and lexicon-based approaches. J Intell Fuzzy Syst 1–12 Resnik P (1997) Selectional preference and sense disambiguation. In: Proceedings of the ACL SIGLEX workshop on tagging text with lexical semantics: why, what and how? pp 52–57 Shrestha N, Hall PAV, Bista SK (2008) Resources for Nepali word sense disambiguation. In: International conference on natural language processing and knowledge engineering, Beijing, China Singh S, Siddiqui TJ (2016) Sense annotated Hindi corpus. In: The 20th international conference on Asian language processing, Tainan, Taiwan, pp 22–25 Singh S, Siddiqui TJ (2012) Evaluating effect of context window size, stemming and stop word removal on Hindi word sense disambiguation. In: Proceedings of the international conference on information retrieval and knowledge management, Malaysia, pp 1–5 Singh S, Siddiqui TJ (2014) Role of semantic relations in Hindi word sense disambiguation. In: Proceedings of international conference on information and communication technologies Singh S, Siddiqui TJ (2015) Role of karaka relations in Hindi word sense disambiguation. J Inf Technol Res 8(3):21–42 Singh S, Gabrani G, Siddiqui TJ (2017) Hindi word sense disambiguation using variants of simplified lesk measure. J Intell Inform Smart Technol 2:1–6 Singh S, Singh VK, Siddiqui TJ (2013) Hindi word sense disambiguation using semantic relatedness measure. In: Proceedings of 7th multi-disciplinary workshop on artificial intelligence, Krabi, Thailand, pp 247–256 Sinha M, Kumar M, Pande P, Kashyap L, Bhattacharyya P (2004) Hindi word sense disambiguation. In: International symposium on machine translation, natural language processing and translation support systems, Delhi, India Vasilescu F, Langlasi P, Lapalme G (2004) Evaluating variants of the lesk approach for disambiguating words. In: Proceedings of the language resources and evaluation, pp 633–636 Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd annual meeting of the association for computational linguistics, pp 189–196
Chapter 5
Performance Analysis of Big.LITTLE System with Various Branch Prediction Schemes Froila V. Rodrigues and Nitesh B. Guinde
Abstract With the sprinting innovation in mobile technology, cell-phone processors, nowadays, are designed and deployed to meet the demands for high performance and low-power operation. ARM big.LITTLE architecture for smart phones meets the above requirements, with “big” cores delivering high performance and “little” cores being power efficient. High performance is achieved by making deeper pipelines, which result in more processing time being spent on a branch misprediction. Hence, an accurate branch predictor is required to mitigate branch delay latency in processors for exploiting parallelism. In this paper, we evaluate and compare various branch prediction schemes by incorporating them in ARM big.LITTLE architecture with Linux running on it. The comparison is carried out for performance and power utilization with Rodinia benchmarks for heterogeneous cores. Performance of the simulated system is in terms of execution time, percentage of conditional branch mispredictions, and overall percentage of branch mispredictions that considers the conditional and unconditional branches and instructions per cycles. It is observed that the TAGE-LSC, perceptron predictors perform better among all the simulated predictors achieving an average accuracy of 99.03%, 99.00%, respectively, using the gem5 framework. The local branch predictor has less power dissipation when tested on the integrated platform of multicore power area timing.
5.1 Introduction Heterogeneous multicore systems have become an alternative for smart phone industries, whose primary objective is power efficiency and high performance. For high performance, we need fast processors which further makes it difficult to fit within the required thermal budget or mobile power. Battery power fails to cope up with the fast evolving CPU technology. Today smart phones with high performance and F. V. Rodrigues (B) Dnyanprassarak Mandal’s College and Research Centre, Assagao-Goa, India N. B. Guinde Goa College of Engineering, Ponda-Goa, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_5
59
60
F. V. Rodrigues and N. B. Guinde
long-lasting battery life are preferred. ARM big.LITTLE architectures (ARM Limited 2014) are designed to satisfy the above requirements. This technology uses heterogeneous cores. The “big” processors provide maximum computational performance, and the “LITTLE” cores provide maximum power efficiency. As per the research in (Butko et al. 2015; Sandberg et al. 2017), both processors use the same instruction set architecture (ISA) and are coherent in operation. Branch misprediction latency is one of the severe reasons for performance degradation in processors. This becomes even more critical as micro-architectures become more deeply pipelined (Sprangle and Carmean 2002). To mitigate this, an accurate prediction scheme is essential to boost parallelism. Prefetching and executing the branch along the predicted direction avoids stalling in the pipeline. This helps in reducing the performance losses caused by branches by predicting their behavior. An accurate branch prediction scheme exploits parallelism. Predicting the branch outcome correctly, frees the functional units, which can be utilized for other tasks. The work carried out earlier on branch predictors shows the comparison on the basis of its performance alone, that is without taking into consideration the effects of the operating system (OS) while running the workload. The novelty of this paper is the evaluation and comparative analysis of various branch predictors by incorporating them in a ARM big.LITTLE architecture with Linux running on it. The comparison is carried out for performance and power dissipation. Our contributions also include comparing the branch predictors on ARM big.LITTLE system in terms of its percentage of conditional branch mispredictions, overall percentage of branch mispredictions that considers the conditional and indirect branches, IPC, execution time, and power consumption. Based on the detailed analysis, we report some useful insights about the designed architecture with various branch prediction schemes and their associated impact on performance and power assessment. The rest of the paper is organized as follows: Sect. 5.2 presents related research work on branch prediction schemes commonly used. Section 5.3 includes the discussion about simulated branch predictors. In Sect. 5.4, the experimental setup is described within architectural modeling of the processor. Also, the performance and power models have been discussed. Section 5.5 includes the experimental results. Section 5.6 gives concluding remarks and perspectives regarding the branch prediction schemes.
5.2 Related Research K. Aparna shows a comparative study of various BPs including the bimodal, gshare, YAGs, and meta-predictor (Kotha 2007). The BPs are evaluated for their performance using the applications of JPEG Encoder, G721 Speech Encoder, Mpeg Decoder, and Mpeg Encoder. A new BP, namely YAGS Neo is modeled which outperforms for some of the applications. The paper shows meta predictor with various combinations of the predictors, and this shows an improved performance over the others.
5 Performance Analysis of Big.LITTLE System …
61
Sparsh M. shows a survey of various dynamic BP schemes which includes gshare, two-level BPs, Smith BP, and perceptron (Sparsh 2018). It is seen that the perceptron BP has the highest precision for most of the applications used. A. Butko et al. explore the design of single-ISA heterogeneous multicore processors based on the ARM big.LITTLE technology. They model the heterogeneous processor in gem5 and McPAT frameworks for evaluation of its performance and power (Butko et al. 2016). The big.LITTLE model is implemented in gem5 with the reference of Exynos 5 Octa (5422) SoC specifications to configure the simulation system. The fine-tuning of the micro-architectural parameters such as the execution stage configuration, functional units, branch predictor, physical registers, etc., of the Cortex-A7 (in-order) and Cortex-A15 (out-of-order) cores is performed. They validate the simulated model for its accuracy, performance, and power with respect to Samsung Exynos Octa (5422). The “LITTLE” cores use the minor CPU model whose pipeline comprises of four stages which are fetch1, fetch2, decode, and execute. The branch data is latched in the fetch2 to fetch1 latch. The outcome of the conditional branch is known during execution time. This is then carried to the fetch1 stage by the execute to fetch1 latch for updating the branch predictor. Also, if the instructions fetched do not match the branch predictors decision, then they are discarded from the pipeline. This is easily identified as the sequence number of the predicted instruction in fetch2 will not match with that of the fetched instruction in fetch1 pipeline stage (Butko et al. 2015; Sandberg et al. 2017). The “big” cores use the OoO cpu model in gem5 whose pipeline stages are fetch, decode, rename, and the retirement stages of an OoO pipeline are performed inorder, and the instruction issue, dispatch, and execute stages are performed out-oforder (Butko et al. 2015; Sandberg et al. 2017). The fetch stage of the OoO pipeline handles the branch prediction unit. The unconditional branches, whose branch target is not available, are handled in the decode stage of the pipeline, while the conditional branch mispredictions can be determined in the execute stage. Whenever a branch misprediction is identified, the entire pipeline is squashed, and the entry is deleted from the branch target buffer (BTB), and correspondingly, the program counter (PC) is updated. The flushing of the entire pipeline on a branch misprediction leads to reducing the performance of a multi-stage pipelined processor.
5.3 Simulated Branch Predictors The branch prediction in the processor is dynamic. These adaptive predictors observe the pattern of the history of previous branches during execution. This behavior is then utilized to predict the outcome of the branch, whether taken or not taken when the same branch occurs the next time. If multiple unrelated branches index the same entry in the predictor table, it leads to the aliasing effect as shown in Fig. 5.1, where there is an interference between the branches P and Q that leads to a misprediction. Hence, it is necessary to have an accurate branch prediction scheme.
62
F. V. Rodrigues and N. B. Guinde
Fig. 5.1 Example of aliasing effect
Some of the commonly used branch prediction schemes for computer architectures are bimodal, local, gshare, YAGS, Tournament, L-TAGE, perceptron, ISL-TAGE, and TAGE-SC-L. Bimodal BP: The bimodal predictor is the earliest form of a branch prediction scheme (Lee et al. 1997). The prediction is based upon the branch history of a given branch instruction. The table of counters is indexed by using the lower bits of the corresponding branch address. When a branch is identified and if the bias of the corresponding counter is in ST or WT state, then the future branches are predicted as taken, and when in WNT and SNT state, the branches are predicted as not taken. Local BP:Yeh and Patt propose a branch prediction scheme that uses the local history of a branch being predicted.This history helps in predicting the next branch. Here, the branch address bits are XORed with the local history to index the table of saturating counters, whose bias will provide the prediction (Yeh and Patt 1991). Tournament BP: Z. Xie et al. present tournament branch predictor that uses local and global predictors based on saturating counters per branch (Xie et al. 2013) . The local predictor is a two-level table that keeps a track of the local history of individual branches. The global predictor is a single-level table. Both provide the prediction. The meta-predictor, that selects between the two predictors, is a table indexed with the branch address and comprises of saturating counters. Gshare BP: S. McFarling comes up with a strategy to use sharing index scheme aiming at higher accuracy (McFarling 1993). The gshare scheme is same as the bimodal scheme. The global history register bits are XORed with the bits of program counter (PC) to point to the pattern history table (PHT) entry, whose value will give the prediction. However, aliasing is the major factor for reducing the prediction accuracy. YAGS BP: Eden and Mudge present yet another global scheme (YAGS) which is a hybrid of bimodal and direction PHTs. Bimodal scheme stores the bias, and the direction PHTs store the traces of a branch only when it is not according to the
5 Performance Analysis of Big.LITTLE System …
63
bias (Eden and Mudge 1998). This reduces the information being stored otherwise in the direction PHT tables. TAGE, L-TAGE, ISL-TAGE, TAGE-LSC: Seznec and Michaud implement the TAgged GEometric length predictor in (Seznec and Michaud 2006). It improvises Michaud’s PPM-like tag-based branch predictor. A prediction is made with the longest hitting entry in the partially tagged tables whose history lengths increase according to the geometric series given as: Length(i) = (int)(α i−1 × Length(1) + 0.5)
(5.1)
Length geometrically increases with i. The table entries are allocated in an optimized way making it very space efficient. The updating policy used by the BP includes incrementing/decrementing the “U” counter if the final prediction is correct/incorrect, respectively. The “U” counter is reset periodically, to avoid any entry to be marked as useful forever. When the prediction is made by a newly allocated entry, it is not considered as the new entries need some training time to make a correct prediction. As a result, the alternate prediction is considered as the final outcome. This branch predictor is better in terms of accuracy. Also, partial tagging is cost efficient , hence, can be used by predictors using global history lengths. The L-TAGE predictor is presented in the CBP-2 (Seznec 2007). This is a hybrid of TAGE and loop predictor. Here, the loop predictor identifies branches that are regularly occurring loops with a fixed number of iterations. When the loop has been executed successively three times with the persistent number of iterations, the loop predictor provides a prediction. A. Seznec also presents ISL-TAGE and TAGE-LSC predictors which incorporates a statistical corrector(SC) and a loop predictor (Seznec et al. 2016; Seznez 2011). Perceptron: Jimenez et al. implement perceptron predictors based on neural networks (Jiménez and Lin 2001). The perceptron model is a vector that comprises of weights (w), which are signed integers and gives the amount of correlation between the branches and the inputs (x). The boolean function of previous outcomes (1 = taken and - 1 = not taken) from the global history shift register is the input to the perceptron. The outcome is calculated as a dot product of the weight vector w0 , w1 ..., wn and the input vecor x0 , x1 ..., xn . Here, x0 is the bias always set to 1. The outcome P is based on the formula: n wi ∗ xi (5.2) P = w0 + i=1
Positive or negative value of P indicates that the branch is predicted as taken or not taken, respectively.
64
F. V. Rodrigues and N. B. Guinde
5.4 Architectural Modeling This section comprises of the performance and power modeling using gem5 and McPAT, respectively, along with system configuration and further describing the Rodinia bench suite.
5.4.1 Performance and Power Modeling The gem5 simulator (2018) is a cycle-approximate simulation framework that supports multiple ISAs, CPU models, branch prediction schemes, and memory models including cache coherent protocols, interconnects, and memory controllers. The architectural specifications are discussed in Table 5.1. The statistics and the configuration files provided by gem5 output are utilized by the integrated platform of multicore power area and timing (McPAT) for estimating the power consumption. The version of power model used is McPAT v1.0.
Table 5.1 Architectural specifications Parameters Big
LITTLE
General specifications Total cores ISA CPU type Cache_line_size Type Total CPUs Fetchwidth NumPhysFloatRegs Pipeline I cache size I cache associativity D cache size D cache associativity l2 latency Branch mispred penalty BTBEntries BTBTagSize
Minor 2 0 0 4 stages 32 kB 2 way 32 kB 2 way 27 cycles 3 cycles 4096 bits 16 bits
4 Linux arm system Exynos 64 bytes DerivO3 2 3 256 19 stages 32 kB 2 way 32 kB 2 way 45 cycles 13 cycles 4096 bits 18 bits
5 Performance Analysis of Big.LITTLE System …
65
5.4.2 System Configuration The system is configured using Ubuntu 16.04 OS on vmlinux kernel for ARM ISA.
5.4.3 Applications of Rodinia The Rodinia bench suite (Che et al. 2009) for heterogeneous platforms is used to study the effect of branch prediction schemes. We have used the OpenMP workloads of the Rodinia bench suite. The problem size of the workloads is mentioned in Table 5.2. The workloads of rodinia bench suite comprise of: Heart Wall removes speckles from an image without without impairing its features. It uses structured grids. k-Nearest Neighbors comprises of a dense linear algebra. Number of boxes 1D estimates potential of a particle and relocates them within a large 3D space due to mutual force among them.
Table 5.2 Benchmark parameters Application Acronym Rodinia benchmark Heart wall k-nearest neighbors Number of boxes 1D Particle filter
heartwall nn lavaMD particlefilter
HotSpot Needleman-Wunsch Pathfinder Hotspot 3D SRAD Breadth-First search
hotspot nw pathfinder hotspot3D srad bfs
Myocyte Backpropogation Stream Cluster
myocyte backprop SC
K-means Btree
kmeans b+tree
Problem size test.avi, 2 frames filelist.txt, 13500 data-points 2 1D boxes 10,000 data-points(x = 128, y = 128, z = 10) 512 x 512 data-points 2048 data-points 100,000 x 100 512 x 512 x 512 data-points 100 x 502 x 458 data-points graph1M.txt (1,000,000 data-points) 256 data-points 512 input nodes 512 data-points 256 dimensions kdd_cup 256 data-points
66
F. V. Rodrigues and N. B. Guinde
Particle Filter is a probablistic estimator of a target position given noisy measurements of the position. HotSpot and HotSpot 3D evaluates the processor temperature. Needleman-Wunsch estimates the location of DNA sequence. Pathfinder estimates the time to discover a path. SRAD is a regular, grid structured application used in ultrasonic, and radar imageprocessing domains. Breadth-First Search is a graph algorithm used to traverse the graphs with millions of vertex points. Myocyte is used in medical researches to model the cardiac muscle cells and simulate its behavior. Back-propagation is a neural learning algorithm where the actual output is compared with the requested value. The difference is then sent back to the input, and the weights are updated accordingly. Stream Cluster for a group of data-points finds the number of medians before hand, so that every point is mapped with its closest center. K-means identifies a group of points by relating each data point with its neighboring cluster; accordingly, new cluster center points are estimated, and iteration is done until overlap. Btree helps in deleting and inserting nodes to a graph by traversing it. The disk image is mounted and loaded with the applications. A bootscript is written to execute the applications. It also commands m5 to record the statistics of the simulated system once the application is executed.
5.5 Experimental Results This section discusses the various parameters used for comparing the performance of branch predictors along with the performance and power analysis.
5.5.1 Parameters Used for Comparison Percentage of overall conditional branch mispredictions: This parameter is the percentage of the overall conditional branches predicted incorrectly out of the total conditional branches. Percentage of Overall misprediction percentage: This parameter is the percentage of the overall mispredicted branches out of the total branches taken.
5 Performance Analysis of Big.LITTLE System …
67
Instructions per cycle (IPC): The major aspect of a processor’s performance depends on its IPC. It is the average number of instructions executed for each clock cycle. Simulation time in seconds: It is the total simulated time for an application. History length: It is the size of the history tables required to store the global or local history of the branches. Percentage of power dissipated: This is the percentage of power dissipated by the branch predictor.
5.5.2 Performance Analysis The analysis for performance of the branch predictors is carried out for a minimum of 1 million branch instructions. The overall conditional branch mispredictions are shown in Fig. 5.2. From the Table 5.3, local BP has the minimum accuracy of 94.7% with a misprediction rate of 5.29%, while perceptron and TAGE-LSC have a misprediction rate of 3.43% and 3.4%, respectively. This gives TAGE-LSC and perceptron the highest accuracy of 96.6% for conditional branch predictions with fixed history length of 16kb. Mispredictions occurring in popular PC indexed predictors with 2-bit counters is mainly due to destructive interference or aliasing. The other reason is that the branch requires local history or global history or both kind of histories in-order to predict the outcome correctly. TAGE predictors on the other hand handle branches with long histories. They employ multiple tagged components with various folded histories. Tage-LSC predictor outperforms L-TAGE and ISL-TAGE predictors, as these predictors cannot predict accurately branches biased statistically towards a given direction. For certain branches, their performance is worse than a simple PC indexed table
Fig. 5.2 Percentage of conditional branch mispredictions per application at a fixed history length of 16 kb
68
F. V. Rodrigues and N. B. Guinde
Table 5.3 Performance analysis Branch predictor % conditional mispredictions L-TAGE Tournament Bimodal Local YAGS gshare Perceptron TAGE-LSC ISL-TAGE
3.56 4.88 5.05 5.29 4.92 4.82 3.43 3.40 3.5
% overall mispredictions
IPC
1.16 2.63 3.12 3.27 2.66 2.68 0.98 0.97 1.12
0.99 0.79 0.71 0.68 0.82 0.79 1.13 1.14 1.03
Fig. 5.3 Percentage of overall branch mispredictions comprising the conditional and unconditional branches per application at a fixed history length of 16 kb
with two-bit saturating counters. Tage-LSC predictor incorporates a multi-GEHL statistical corrector(SC) which handles this class of branches which are statistically correlated. Multi-GEHL SC can handle various path and branch histories including local and global history very accurately. The perceptron predictor also achieves a very high accuracy for a history length of 16kb for various applications executed. Figure 5.3 shows that TAGE-LSC, perceptron, ISL-TAGE, and L-TAGE have an average of 0.97%, 0.98%, 1.12%, and 1.16% of overall branch mispredictions, respectively, when tested for various applications having a minimum of 1 M branch instructions. Tournament BP, YAGS, and gshare have a misprediction percentage of 2.63, 2.66, and 2.68%. Local BP has the highest overall misprediction percentage of 3.27%. The mispredictions have an impact on the IPC. Higher the mispredictions, more are the stalls in the pipeline, and there is a drop in the IPC; i.e., more cycles are wasted per instruction. As seen in Fig. 5.4, TAGE-LSC and perceptron predictors have the least overall misprediction percentage, and the overall IPC is 1.14 and 1.13, respectively, which is higher than other predictors.
5 Performance Analysis of Big.LITTLE System …
69
Fig. 5.4 Instructions Per Cycle (IPC) per application at a fixed history length of 16 kb
Fig. 5.5 Simulation time per application at a fixed history length of 16 kb
Figure 5.5 shows the execution time in seconds per application of the rodinia bench suite for the simulated ARM big.LITTLE architecture by varying the branch prediction scheme incorporated into it. It is observed that perceptron and TAGE-LSC have the least execution time for almost all applications in the suite. Also, it is seen that local BP has the maximum execution time. Figures 5.6 and 5.7 show the results of performance of branch prediction schemes by changing the history length. LavaMD benchmark is used to study the effect of variations in history length for the popular branch predictors. LavaMD benchmark has the highest utilization of the branch predictor as compared to the other applications of the rodinia suite. It is seen that as the history length of branch predictors is increased, the misprediction percentage drops. But, in case of L-TAGE predictor, the drop is not significant. This proves that L-TAGE is robust to the changes in geometric history lengths. The overall branch misprediction rate comprising of the conditional and unconditional branches for varying history lengths is found to be 2.7701%. Seznec who implemented L-TAGE also states that the overall misprediction rate is within 2% for
70
F. V. Rodrigues and N. B. Guinde
Fig. 5.6 Percentage conditional branch mispredictions with lavaMD application for varying history lengths
Fig. 5.7 Percentage of overall branch mispredictions comprising the conditional and unconditional branches with lavaMD application for varying history lengths
any minimum value of history length in the interval [2–5] and any maximum value between [300 and 1000] (Seznec 2007). In other words, L-TAGE just like TAGE and OGEHL predictors are not sensitive to history length parameters (Seznec 2005; Seznec and Michaud 2006). ISL-TAGE, perceptron, and TAGE-LSC predictors have an overall branch misprediction rates of 2.51%, 2.37%, and 1.9%, respectively, for varying history lengths. Perceptron predictor provides the best performance for lower history lengths below 16kb and eventually attains a constant level of misprediction rate as the history length is increased beyond 16kb. Beyond this history length, TAGELSC provides higher accuracy than the perceptron and other predictors. The reason for the low performance of L-TAGE, ISL-TAGE, and TAGE-LSC predictors is the insufficient correlation from remote branches due to reduced history lengths, resulting in negative interference. Also reducing local history bits, fail to detect branches which exhibit loop behavior. However, the computational complexity involved in perceptron and TAGE predictors is high in comparison to the popular PC indexed 2-bit predictors. Table 5.3 shows the summarized results of performance analysis for the parameters of % conditional mispredictions, % overall mispredictions, and IPC for better readability.
5 Performance Analysis of Big.LITTLE System …
71
Fig. 5.8 Percentage of power dissipated by the predictors per application
5.5.3 Power Analysis Figure 5.8 shows the result of power performance of branch predictors incorporated in ARM big.LITTLE architecture. The average power consumed by the L-TAGE, ISLTAGE, and TAGE-LSC branch predictors is 5.89%, 5.91%, and 5.98%, respectively, for various workloads. This is high in comparison to the other branch predictors. The reason for this being the complexity in the design arises with the increase in components. As a result of which the silicon area and power dissipation in the processor increases (Seznec 2007). The perceptron predictor consumes 4.7% of the processor power. It also shows that the minimum average power of 3.77% is dissipated by local BP unit. To be noted that the power estimation is based on a simulation which can incur abstraction errors and reflect approximate levels of power utilization by the predictor.
5.6 Conclusion Exhaustive analysis on various branch prediction schemes is done for power and performance using McPAT and gem5 frameworks, respectively. It is observed that TAGE-LSC and perceptron have the highest prediction accuracy among the simulated predictors. Perceptron predictor performs efficiently at reduced resource budget and history length, while TAGE-LSC outperforms it for higher history lengths and increased resource budget. In the ARM big.LITTLE architecture, the big cores can be incorporated with TAGE-LSC predictor, where high performance is desired, and LITTLE cores can be built with perceptron predictor which achieves a high accuracy and power efficiency at reduced budget and power requirements. Also, local branch predictor dissipates minimum power, but the accuracy is very less.
72
F. V. Rodrigues and N. B. Guinde
Acknowledgements We would like to thank all those who have dedicated their time in research related to the branch predictors and have contributed to the gem5 and McPAT simulation frameworks.
References ARM Limited big.LITTLE Technology (2014) The future of mobile. In: Low power-high performance white paper Butko A, Bruguier F, Gamati‘e A (2016) Full-system simulation of big.LITTLE multicore architecture for performance and energy exploration. In: 2016 IEEE 10th international symposium on embedded multicore/many-core systems-on-chip (MCSOC) IEEE Lyon, , pp 201–208 Butko A, Gamatié A, Sassatelli G (2015) Design exploration for next generation high-performance manycore on-chip systems: application to big.LITTLE Architectures. In:2015 IEEE computer society annual symposium on VLSI. IEEE Montpellier, pp 551–556 Che S, Boyer M, Meng J (2009) Rodinia: A benchmark suite for heterogeneous computing. In:2009 IEEE international symposium on workload characterization (IISWC). IEEE Austin, pp 44–54 Eden AN, Mudge T (1998) The YAGS branch prediction scheme. In: Proceedings of the 31st annual ACM/IEEE international symposium on microarchitecture 1998. ACM/IEEE Dallas, pp 169–177 Jiménez DA, Lin C (2001) Dynamic branch prediction with perceptrons. In: Proceedings of the 7th international symposium on high-performance computer architecture HPCA ’01. ACM/IEEE Mexico, p 197 Kotha A (2007) Electrical & computer engineering research works. digital repository at the University of Maryland (DRUM). https://drum.lib.umd.edu/bitstream/handle/1903/16376/branch_ predictors_tech_report.pdf?sequence=3&isAllowed=y.Cited. 10 Dec 2007 Lee CC, Chen ICK, Mudge TN (1997) The bi-mode branch predictor. In: Proceedings of 30th annual international symposium on microarchitecture. ACM/IEEE, USA, pp 4–13 McFarling S (1993) Combining branch predictors. TR, Digital Western Research Laboratory, California, USA Sandberg A, Diestelhorst S, Wang W (2017) Architectural exploration with gem5. In:ARM Res Seznec A (2005) Analysis of the O-GEometric history length branch predictor. ACM SIGARCH computer architecture news. Journal 33(2):394–405 Seznec A (2007) A 256 kbits l-tage branch predictor. The second championship branch prediction competition (CBP-2). J Instruction-Level Parall (JILP) J 9(1):1–6 Seznec A (2016) TAGE-SC-L branch predictors again. 5th JILP workshop Comput Arch Competi (JWAC-5) J 5(1) Seznec A, Michaud P (2006) A case for (partially) TAgged GEometric history length branch prediction. J Instruction Level Parall J 8(1):1–23 Seznez A (2011) A 64 Kbytes ISL-TAGE branch predictor. In: Workshop on computer architecture competitions (JWAC-2): championship branch prediction Sparsh M (2018) A survey of techniques for dynamic branch prediction. J CoRR. ArXiv. abs/1804.00261 Sprangle E, Carmean D (2002) Increasing processor performance by implementing deeper pipelines. In: Proceedings of the 29th annual international symposium on computer architecture (ISCA). IEEE USA, pp 25–34 The gem5 simulator. Homepage. http://gem5.org/Main_Page. Last accessed 2018/1/12 Xie Z, Tong D, Cheng X (2013) An energy-efficient branch prediction technique via global-history noise reduction. In: International symposium on low power electronics and design (ISLPED). ACM Beijing, , pp 211–216 Yeh T, Patt Y (1991) Two-level adaptive training branch prediction. In: Proceedings of the 24th annual international symposium on microarchitecture 1991. ACM, New York, pp 51–61
Chapter 6
Global Feature Representation Using Squeeze, Excite, and Aggregation Networks (SEANet) Akhilesh Pandey, Darshan Gera, D. Gunasekar, Karam Rai, and S. Balasubramanian Abstract Convolutional neural networks (CNNs) are workhorses of deep learning. A popular architecture in CNN is Residual Net (ResNet) that emphasizes on learning a residual mapping rather than directly fit input to output. Subsequent to ResNet, Squeeze and Excitation Network (SENet) introduced a squeeze and excitation block (SE block) on every residual mapping of ResNet to improve its performance. The SE block quantifies the importance of each feature map and weights them accordingly. In this work, we propose a new architecture SEANet built over SENet by introducing an aggregate block after SE block. We choose sum as the aggregate operation. The aggregation helps in minimizing redundancies in feature representation and provide a global feature representation across feature maps by downsampling their number. We demonstrate the superior performance of our SEANet over ResNet and SENet on benchmark CIFAR-10 and CIFAR-100 datasets. Specifically, SEANet reduces classification error rate on CIFAR-10 by around 2% and 3%, respectively, over ResNet and SENet. On CIFAR-100, SEANet reduces error by around 5% and 9% when compared against ResNet and SENet. Further, SEANet outperforms the latest EncapNet and both its variants EncapNet+ and EncapNet++ on CIFAR-100 dataset by around 2%.
6.1 Introduction With advancement in deep learning models, lots of real-world problems are being solved which were stacked up for the past few decades. Deep learning is being used in a wide range of applications ranging from image detection and recognition to security and surveillance. One of the major advantages of deep learning models is that A. Pandey (B) · D. Gunasekar · K. Rai · S. Balasubramanian Department of Mathematics and Computer Science (DMACS), Sri Sathya Sai Institute of Higher Learning (SSSIHL), Prasanthi Nilayam, Anantapur District 515134, India e-mail: [email protected] D. Gera DMACS, SSSIHL, Brindavan Campus, Bengaluru 560067, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_6
73
74
A. Pandey et al.
they extract features on their own. Convolutional neural networks (CNNs) are used extensively to solve image recognition and image classification tasks. Convolutional layer basically learns a set of filters that help in extracting useful features. It learns powerful image descriptive features by combining the spatial and the channelwise relationship in the input. To enhance the performance of CNNs, recent research has explored three different aspects of networks, namely width, depth, and cardinality. It was found that deeper models could model complex input distribution much better than the shallow models. With the availability of specialized hardware accelerators such as GPUs, it has become easy to train larger networks. Taking the advantage of GPUs, continuous improvement in accuracy is shown by recent models like VGGNet (Sercu et al. 2016), GoogLeNet (Szegedy et al. 2015), and Inception net (Szegedy et al. 2017). VGGNet showed that stacking blocks of same shape gives better results. GoogLeNet shows that width plays an important role in improving the performance of a model. Xception (Chollet 2016) and ResNeXt (Xie et al. 2017) come up with an idea of increasing the cardinality of a neural network. They empirically showed that apart from saving the number of parameters cardinality also increases the representation power compared to width and depth. But it was observed that deep models are built by stacking up layers suffered from degradation problem (He et al. 2016). Degradation problem arises when after some iterations the training error refuses to decrease thereby giving high training error and test error. The reason behind the degradation problem is vanishing gradient—as the model becomes larger, the propagated gradient becomes very small by the time it reaches the earlier layers, thereby making the learning more difficult. The degradation problem was solved with the introduction of the ResNet (He et al. 2016) models which stacked residual blocks along with skip connections to build very deep architecture. They gave better accuracy than its predecessors. ResNet performed very well and won the ILSVRC (Berg et al. 2010) challenge in 2015. Subsequently, the architecture that won ILSVRC 2017 challenge is SENet (Hu et al. 2017). Unlike other CNN architectures that considered all feature maps to be equally important, SENet (Hu et al. 2017) quantifies the importance of each feature map adaptively and weighs them accordingly. The main architecture of SENet discussed in Hu et al. (2017) is built over base ResNet by incorporating SE blocks. SE blocks can also be incorporated in other CNN architectures. Though SENet quantifies the importance of feature maps, it does not focus on redundancies across feature maps and provide a global representation across feature maps. In this work, we propose a novel architecture, namely SEANet, that is built over SENet. Following SE block, we introduce an aggregate block that helps in providing a global feature representation and also minimizes redundancies in feature representation.
6 Global Feature Representation Using Squeeze, Excite, and Aggregation …
75
6.2 Related Work Network engineering is an important vision research area since well-designed networks improve performance for different applications. From the time LeNet (LeCun et al. 1998) was introduced and since the renaissance of deep neural networks through AlexNet (Krizhevsky et al. 2012), a plethora of CNN architectures (He et al. 2016; Sercu et al. 2016; Szegedy et al. 2017) have come about to solve computer vision and image processing problems. Each architecture either focused on a fundamental problem associated with learning or improvised existing architectures over certain aspects. For example, VGGNet (Sercu et al. 2016) eliminated the need to fine-tune certain hyperparameters like filter parameters and activation function by fixing them. ResNet (He et al. 2016) emphasized that learning residual mapping rather than directly fitting input to output eases training in deep networks. Inception Net (Szegedy et al. 2017) highlighted on sparsity of connections to add over the existing advantage given by convolutions by proposing to use a set of filters of different sizes at different layers, with lower layers having more of smaller size filters and higher layers having more of larger size filters. On top of ResNet architecture, various models such as WideResNet (Zagoruyko and Komodakis 2016) and Inception-ResNet (Szegedy et al. 2017) have been proposed recently. WideResNet (Zagoruyko and Komodakis 2016) proposes a residual network having larger number of convolutional filters with reduced depth. PyramidNet (Han et al. 2017) builds on top of WideResNet by gradually increasing the width of the network. ResNeXt (Xie et al. 2017) uses grouped convolutions and showed that the cardinality leads to improved classification accuracy. Huang et al. (2017) proposed a new architecture, DenseNet. It concatenates the input features along with the output features iteratively, thus enabling each convolution block to receive raw information from all previous blocks. A gating mechanism was introduced in highway networks (Srivastava et al. 2015) to regulate the flow of information along shortcut connections. Since deep and wide architectures require computational cost and memory requirement, lightweight models like MobileNet (Howard et al. 0000) and ShuffleNet (Zhang et al. 2018) use depthwise separable convolutions. Most of these network engineering methods focus primarily on depth, width, cardinality, or making computationally efficient models. Another important aspect in designing networks to improve CNNs performance is attention mechanism inspired from human visual system. Humans focus on salient features in order to capture visual structure. However, in all the above-mentioned architectures, all the feature maps in a layer are considered equally important and passed on to next layer. But none of these architectures emphasize on importance of feature maps. Residual Attention Network (Wang et al. 2017) proposed by Wang et al. uses an encoder–decoder style attention module. Instead of directly computing the 3D attention map, they divided the process which learns channel attention and spatial attention separately. This separate channel and spatial attention process is less computational expensive as well as has less parameter over-head due to which it can be used as a plug-and-play module with any CNN architectures. The network not only performs well but is also robust to noisy inputs by refining the feature
76
A. Pandey et al.
maps. In 2018, Hu et al. introduced Squeeze and Excitation (SE) block in their work SENet (Hu et al. 2017) to compute channelwise attention wherein (i) the squeeze operation compresses each feature map to a scalar representation using global average pooling that subsequently maps to weights and (ii) the excite operation excites the feature maps using the obtained weights. This architecture won the ILSVRC1 2017 challenge. In this work, we propose an architecture, namely SEANet, which is built over SENet (Hu et al. 2017). Following SE block, we introduce an aggregate block that help in global feature representation by downsampling the number of feature maps by simple sum operation. In Sect. 6.3, an overview of SENet is provided. The proposed architecture SEANet is elucidated in detail in Sect. 6.4. Results and analysis are discussed in Sect. 6.5.
6.3 Preliminaries Squeeze-and-Excitation-Networks (SENet) There are two issues with the existing models in the way they apply convolution operation on inputs. Firstly, the receptive field has the information only about the local region because of which the global information is lost. Secondly, all the feature maps are given equal weight but some feature maps may be more useful for the next layers than others. SENet (Hu et al. 2017) proposes a novel technique to retain global information and also dynamically re-calibrate the feature map inter-dependencies. Following two subsections explain these two problems in detail and how SENet (Hu et al. 2017) addresses them using squeeze and excitation operations. Squeeze Operation: Since each of the learned filters operates within a local receptive field, each unit of the output is deprived of the contextual information outside the local region. Smaller the receptive field, lesser is the contextual information retained. The issue becomes more severe when the network under consideration is very large and the receptive field used is small. SENet (Hu et al. 2017) solves this problem by finding a means to extract the global information and then embed global information to the feature maps. Let U = [u 1 , u 2 , . . . , u c ] be the output obtained from previous convolutional layer. The global information is obtained by applying a global average pooling for each channel u p of U to obtain a channelwise statistics Z = [z 1 , z 2 , . . . , z c ] where z k is the kth element of Z computed as: H W 1 u k (i, j) (6.1) zk = H ∗ W i=1 j=1
1
http://image-net.org/challenges/LSVRC/.
6 Global Feature Representation Using Squeeze, Excite, and Aggregation …
77
Fig. 6.1 A SE-ResNet module
Equation (6.1) is the squeeze operation, denoted by Fsq . We see that Z obtained in this way captures the global information for each feature map. Hence, the first problem is addressed by squeeze operation. Excitation Operation: Once the global information is obtained from the squeeze step, the next step is to embed the global information to the output. Excitation step basically multiplies the output feature maps by Z . But by simply multiplying the feature maps with the statistics Z would not answer the second question. So in order to re-calibrate the feature map inter-dependency, the excitation step uses a gating mechanism consisting of a network (shown in Fig. 6.1) with two fully connected layers having sigmoid as the output layer. The excitation operation can be mathematically expressed as follows: 1. 2. 3. 4.
Let the network be represented as a function Fex . Let U be the input to the network. Let X be the output of Fex . Let FC1 and FC2 be fully connected layers with weights W1 and W2 respectively (biases are set to 0 for simplicity).
78
A. Pandey et al.
5. Let δ be the ReLU (Qiu et al. 2017) activation applied to the output of FC1 , and σ be the sigmoid activation applied after FC2 layer. Then output of the excitation operation can be expressed by the following equation: S = Fex (Z , W ) = FC2 (FC1 (U )) = σ (W2 (δ(W1 U )))
(6.2)
where Z is the statistics obtained from squeeze operation. The output S of the network Fex is a vector of probabilities having same shape as the input to the network. The final output of the SE block is obtained by scaling each element of U by corresponding X 2 ), . . . , ( X c )] where ( X p ) = s p ∗ u p . Thus, now elements of S, i.e., ( X ) = [( X 1 ), ( the feature maps are dynamically re-calibrated.
6.4 Proposed Architecture Our proposed architecture is called SEANet as it is built on top of SENet using aggregate operation. PyTorch implementation of SENet2 was taken and modified using base of SENet(ResNet-20) using proposed aggregate block. The depth of the proposed architecture is similar to that of SENet but it gives better accuracy than SENet. The SEANet model differs from the SENet model by two major differences: 1. The number of feature maps in each block is increased in SEANet. In SENet, this number varies as 16, 32, and 64 while in SEANet we set it as 128, 256, and 512. We increased the number of feature maps because more the number of feature maps, better will be the effect of aggregation (downsampling) in global representation of features. Since we rely on deep features, as hand-engineering the features is infeasible, it is better to have large number of feature maps prior to aggregation. Note that for fair comparison with SENet, we also increased the number of feature maps in SENet to 128, 256, and 512 (see Sect. 6.5.2). 2. SE block is followed by an aggregate block. The complete architecture of SEANet model is depicted in Fig. 6.2. Aggregate block operates in two steps. The block is fed with a user defined parameter called aggr egate f actor denoted by k. The two steps are: Step 1: Incoming feature maps from SE block are divided into multiple groups using k. If C = [c1 , c2 , c3 , . . . , cn ] are the n incoming feature maps from SE block, then
2
https://github.com/moskomule/senet.pytorch.
6 Global Feature Representation Using Squeeze, Excite, and Aggregation …
79
Fig. 6.2 Architecture of SEANet based on SE-ResNet-20
G 1 = [c1 , c2 , . . . , ck ] G 2 = [ck+1 , ck+2 , . . . , c2k ] .. . G i = [c((i−1)k)+1 , c((i−1)k)+2 , . . . , cik ] .. . G n/k = [c(((n/k)−1)k)+1 , c(((n/k)−1)k)+2 , . . . , cn ]
(6.3)
80
A. Pandey et al.
Fig. 6.3 Aggregate block
are n/k mutually exclusive groups of feature maps. For example, in Fig. 6.2, after the first three residual blocks and SE block, the number of incoming feature maps is 128. With aggregate factor k = 4, these feature maps are partitioned into 32 groups with each group having 4 feature maps. Step 2: Each group is downsampled to a single feature map using the aggregate operation sum. That is, Si = aggr egate(G i ) =
ik
cj.
(6.4)
j=((i−1)k)+1
Figure 6.3 depicts the aggregate operation. The effect of downsampling by aggregation is to remove redundant representations and obtain a global feature representation across feature maps in each group. To understand this, let us assume that we have an RGB image. We combine/aggregate information from each of R, G, and B maps to output a single grayscale feature map. This way we move away from local color information from each individual map to a global grayscale map. Further, the aggregation downsampled three feature maps into one grayscale map, thereby eliminating implicitly redundancies in representing a pixel. Our idea is to extend this principle through the layers of a deep network. As batch normalization extends the idea of normalizing input to normalization of all activations, so does our SEANet extends the aforementioned principle through the layers of a deep network. The advantages of such extension by aggregation in our SEANet are manifold: 1. Redundancy in representation of features is minimized. 2. A global representation across feature maps is obtained. 3. With sum as aggregate operation, significant gradient flow back during backpropagation as sum shares its incoming gradient to all its operands. 4. Significant improvement in performance. It is to be noted that many other aggregation operations including min, max are available but sum performed the best. Further, one may argue that the number
6 Global Feature Representation Using Squeeze, Excite, and Aggregation …
81
of feature maps can be downsampled by keeping only the important ones where importance is provided by the SE block. But we observed this idea to drastically pull down the performance.
6.5 Results and Analysis 6.5.1 Datasets We chose two benchmark datasets, namely CIFAR-10 (Krizhevsky et al. 2014a) and CIFAR-100 (Krizhevsky et al. 2014b), for our experiments. The CIFAR-10 dataset3 : The CIFAR-10 dataset consists of 60,000 color images each of size 32 × 32. Each image belongs to one of the ten mutually exclusive classes, namely airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The dataset is divided into training and test set. The training set consists of 50,000 images equally distributed across classes, i.e., 5000 images are randomly selected from each of the classes. The test set consists of 10,000 randomly selected images from each class. The CIFAR-100 dataset4 : The CIFAR-100 dataset is similar to CIFAR-10 except that it contains 100 classes instead of 10 classes. The 100 classes are grouped into 20 superclasses. Each class consists of 600 images. The training set consists of 500 images from each of the 100 classes, and the test set consists of 100 images from each of the 100 classes. Before we delve upon the superior performance of our architecture SEANet against state-of-the-art architectures, we provide the implementation details.
6.5.2 Implementation Details Data is preprocessed using per-pixel mean subtraction and padding by four pixels on all sides. Subsequent to preprocessing, data augmentation is performed by random cropping and random horizontal flipping. Model weights are initialized by Kaiming initialization (He et al. 2015). The initial learning rate is set to 0.1 and is divided by 10 after every 80 epochs. We trained for 200 epochs. The other hyper-parameters such as weight decay, momentum, and optimizer are set to 0.0001, 0.9 and stochastic gradient descent (SGD), respectively. We fixed the batch size to 64. No dropout (Srivastava et al. 2014) is used in the implementation. This setting remains the same across the datasets. The hyper-parameters used in our implementation are summarized in
3 4
https://www.cs.toronto.edu/kriz/cifar.html. https://www.cs.toronto.edu/kriz/cifar.html.
82
A. Pandey et al.
Table 6.1 Hyper-parameters Hyper-parameters
Values
Optimizer Initial learning rate Batch-size Weight-decay Momentum Number of epochs
SGD 0.1 64 1e-4 0.9 200
Table 6.2 Classification error (%) compared to the ResNet Architecture CIFAR-10 SEANet Res-20 Res-32 Res-44 Res-56 Res-110
4.3 8.6 7.68 6.43 6.84 6.34
CIFAR-100 21.33 32.63 30.2 26.85 26.2 26.67
Table 6.1. Our implementation is done in Pytorch (Adam et al. 2019) and codes5 are made publically available. We trained our model on Tesla K-40 GPU and training took around 22 h.
6.5.3 Comparison with State of the art First, we compare the performance of our SEANet with ResNet and SENet. Table 6.2 enumerates the error rate in classification on CIFAR-10 and CIFAR-100 datasets with respect to SEANet and variants of ResNet. It is clearly evident that SEANet outperforms all variants of ResNet on both the datasets. In CIFAR-10, we achieve the smallest error rate of 4.3% that is better by 2% than the best performing ResNet-110. Similarly, in CIFAR-100, we achieve the smallest error rate of 22.24% that is better by 4% than the best performing ResNet-56. It is to be noted that 1% on CIFAR-10 and CIFAR-100 test sets correspond to 100 images. Therefore, we perform better than ResNet on additional 200 and 400 images with respect to CIFAR-10 and CIFAR-100, respectively. Table 6.3 compares performance of SEANET against SENet. Again SEANet outperforms SENet by 3% and 8% on CIFAR-10 and CIFAR-100, respectively. The remarkable improvement in performance can be attributed to presence of additional aggregate block in SEANet. Figures 6.4 and 6.5 display the validation accuracy over 5
https://github.com/akhilesh-pandey/SEANet-pytorch.
6 Global Feature Representation Using Squeeze, Excite, and Aggregation … Table 6.3 Classification error (%) compared to the SENet. Architecture CIFAR-10 SEANet SENet
4.3 7.17
95.7 95.28 95.91
CIFAR-100 21.33 30.45
Table 6.4 SEANet versus modified ResNet and SENet Validation accuracy Architecture Cifar-10 Cifar-100 SEANet ResNet* SENet*
83
78.67 79.27 79.76
# Parameters Cifar-10
Cifar-100
16188330 17159898 17420970
16199940 17206068 17467140
*Original models were modified by increasing number of feature maps in each block
Fig. 6.4 Validation accuracy over epochs for ResNet, SENet, and SEANet on CIFAR-10 dataset
epochs for ResNet, SENet, and SEANet on CIFAR-10 and CIFAR-100 datasets, respectively. As mentioned earlier, SEANet uses 128, 256, and 512 feature maps in its blocks unlike the standard ResNet-20 and SENet (with standard ResNet-20 as its base). For fair comparison, we increased the feature maps in blocks of standard ResNet-20 to 128, 256, and 512, respectively. Table 6.4 reports the performance of SEANet versus modified ResNet-20 and modified SENet. SEANet performs better or on par with modified ResNet-20 and modified SENet on CIFAR-10 dataset while on CIFAR100, it performs marginally lower. But it is to be noted that due to downsampling by sum aggregation in SEANet, the number of parameters in SEANet is smaller than the corresponding numbers in modified ResNet-20 and modified SENet. Specifically, SEANet has around 8% parameters lower than the number of parameters in modified
84
A. Pandey et al.
Fig. 6.5 Validation accuracy over epochs for ResNet, SENet, and SEANet on CIFAR-100 dataset
ResNet-20 or modified SENet. This is a significant advantage for our architecture SEANet over modified ResNet-20 and modified SENet.
Table 6.5 Classification error (%) compared to the state-of-the-art architectures. Architecture CIFAR-10 CIFAR-100 SEANet EncapNet (Li et al. 2018) EncapNet+ (Li et al. 2018) EncapNet++ (Li et al. 2018) GoodInit (Mishkin and Matas 2016) BayesNet (Snoek et al. 2015) ELU (Clevert et al. 2015) Batch NIN (Changa and Chen 2015) Rec-CNN (Liang and Hu 2015) Piecewise (Agostinelli et al. 2015) DSN (Lee et al. 2014) NIN (Lin et al. 2014) dasNet (Stollenga et al. 2014) Maxout (Goodfellow et al. 2013) AlexNet (Krizhevsky et al. 2012)
4.3 4.55 3.13 3.10 5.84
21.33 26.77 24.01 24.18 27.66
6.37 6.55 6.75
27.40 24.28 28.86
7.09
31.75
7.51
30.83
8.22 8.80 9.22 9.35
34.57 35.68 33.78 38.57
11.00
-
+ Stands for mild augmentation and ++ Stands for strong augmentation
6 Global Feature Representation Using Squeeze, Excite, and Aggregation …
85
Table 6.6 Effect of reduction factor used for downsampling in aggregate operation on CIFAR-10 and CIFAR-100 using SEANet Reduction factor CIFAR-10 CIFAR-100 2 4 6 8 10 12
95.50 95.53 95.57 95.70 95.53 95.56
77.12 77.16 77.50 78.67 77.19 77.38
Table 6.5 compares SEANet against other state-of-the-art architectures. Note that EncapNet (Li et al. 2018) is a very recent network architecture published in 2018. It has two improvised variants, viz. EncapNet+ and EncapNet++. Our SEANet outperforms EncapNet and both its variants on the complex CIFAR-100 dataset by around 2%. On CIFAR-10 SEANet reports 1% lower than variants of EncapNet though it outperforms EncapNet by 0.25%. Further, SEANet performs better than all other enumerated state-of-the-art architectures.
6.5.4 Ablation Study The aggregate operation used in the proposed SEANet downsamples number of features by reduction factor of 8 on both CIFAR-10 and CIFAR-100. We did an ablation study to determine the effect of this reduction factor. The results are reported in Table 6.6 for various reduction factors of 2, 4, 6, 8, 10, and 12. Clearly, downsampling by a factor of 8 gives best performance on both datasets. If reduction factor is too small, then there is no elimination of redundant features, and if it is too large, then there may be loss of useful features.
6.5.5 Discussion The proposed SEANet is able to eliminate redundancies in feature maps and thereby reduce number of feature maps by using simple aggregate operation of sum. Other aggregate operations like max and min were also tried but did not give significant improvement compared to sum. One possible future work could be to explore why some aggregate function perform better than others.
86
A. Pandey et al.
6.6 Conclusion A novel architecture named “SEANet” is proposed that emphasizes on global representation of features and redundancy elimination in representing features. This is achieved by introduction of an additional aggregate block over squeeze and excite. The aggregation operation deployed is sum. SEANet outperforms the recent state-ofthe-art architectures on CIFAR-10 and CIFAR-100 datasets by a significant margin. Also, the proposed architecture also has lesser number of parameters in comparison with the number in corresponding ResNet-20 and SENet. Acknowledgements We dedicate this work to our Guru and founder chancellor of Sri Sathya Sai Institute of Higher Learning, Bhagawan Sri Sathya Sai Baba. We also thank DMACS for providing us with all the necessary resources.
References Agostinelli F, Hoffman M, Sadowski P, Baldi P (2015) Learning activation functions to improve deep neural networks. In: ICLR workshop Berg A, Deng J, Fei-Fei L (2010) Large scale visual recognition challenge (ILSVRC), 2010. URL http://www.image-net.org/challenges/LSVRC3 Changa J, Chen Y (2015) Batch-normalized maxout network in network. In: arXiv preprint arXiv:1511.025831511.02583 Chollet F (2016) Xception: deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357 Clevert D, Unterthiner T, Hochreiter S (2015) Fast and accurate deep networks learning by exponential linear units. arXiv preprint arXiv:1511.07289 Gao H, Zhuang L, Van Der Maaten Laurens, Weinberger Kilian Q (2017) Densely connected convolutional networks. CVPR 1(2):3 Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. arXiv preprint arXiv:1302.4389 Han D, Kim J, Kim J (2017) Deep pyramidal residual networks. In: Proceedings of computer vision and pattern recognition (CVPR) He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision He K et al (2016a) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition He K, Zhang X, Ren S, Sun J (2016b) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, Cham, pp 630–645 Howard AG, Zhu M, Chen B et al (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv Preprint arXiv:170404861 Huang G, Sun Y, Liu Z, Sedra D, Weinberger KQ (2016) Deep networks with stochastic depth. In: European conference on computer vision. Springer, Cham, pp 646–661 Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), pp 7132–7141 Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML
6 Global Feature Representation Using Squeeze, Excite, and Aggregation …
87
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1106–1114 Krizhevsky A, Nair V, Hinton G (2014a) The CIFAR-10 dataset. Online: https://www.cs.toronto. edu/~kriz/cifar.html Krizhevsky A, Nair V, Hinton G (2014b) The CIFAR-100 dataset. Online: https://www.cs.toronto. edu/kriz/cifar.html LeCun Y et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324 Lee C, Xie S, Gallagher P, Zhang Z, Tu Z (2014) Deeply-supervised nets. arXiv preprint arXiv:1409.5185 Liang M, Hu X (2015) Recurrent convolutional neural network for object recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR) Li H, Gou X, Dai B, Ouyang W, Wang X (2018) Neural network encapsulation. arXiv preprint arXiv:1808.03749 Lin M, Chen Q, Yan S (2014) Network in network. In: ICLR Mishkin D, Matas J (2016) All you need is a good init. In: ICLR Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A (2019) Pytorch: an imperative style, high-performance deep learning library. In: Wallach H, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems, vol 32, pp 8024–8035. Curran Associates, Inc., URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performancedeep-learning-library.pdf Qiu S, Xu X, Cai B (2017) FReLU: flexible rectified linear units for improving convolutional neural networks. arXiv preprint arXiv:1706.08098 Sercu T et al (2016) Very deep multilingual convolutional neural networks for LVCSR. 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE Snoek J, Rippel O, Swersky K, Kiros R, Satish N, Sundaram N, Patwary M, Prabhat M, Adams R (2015) Scalable bayessian optimization using deep neural networks. In: ICML Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Conference on neural information processing systems Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res Stollenga MF et al (2014) Deep networks with internal selective attention through feedback connections. In: Advances in neural information processing systems Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of computer vision and pattern recognition (CVPR) Szegedy C et al (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI, vol 4 Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. arXiv preprint arXiv:1704.06904 Xie S, Girshick R, DollÁr P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 5987–5995 Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv preprint arXiv:1605.07146 Zhang X, Zhou X, Lin M, Sun J (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, June 2018
Chapter 7
Improved Single Image Super-resolution Based on Compact Dictionary Formation and Neighbor Embedding Reconstruction Garima Pandey and Umesh Ghanekar Abstract Single image super-resolution is one of the evolving areas in the field of image restoration. It involves reconstruction of a high-resolution image from available low-resolution image. Although lot of researches are available in this field, still there are many issues related to existing problems those are still unresolved. Here, this research work focuses on two aspects of image super-resolution. The first aspect is that the process of dictionary formation is improved by using lesser number of images while preserving maximum structural variations. The second aspect is that pixel value estimation of high-resolution image is improved by considering only those overlapping patches which are more relevant from the characteristics of image point of view. For this, all overlapping pixels corresponding to a particular location are classified whether they are part of smooth region or an edge. Simulation results clearly prove the efficacy of the algorithm proposed in this paper.
7.1 Introduction Single image super-resolution (SISR) is a part of image restoration in digital image processing. It is used to upgrade the quality of an image, which is generally degraded due to different constraints such as hardware limitations and environmental disturbance during its acquisition process. SISR is used as an image enhancement tool and can be used along with the existing imaging systems. It is used in different fields like medical, forensic, military, TV industry, satellite and remote sensing, telescopic and microscopic imaging, biometric and pattern recognition, Internet, etc. SISR is defined as a software method of producing a high-resolution (HR) image from a low-resolution (LR) image which is obtained by an imaging system. It is an inverse problem (Park et al. 2003) as given in Eq. (7.1). It is notoriously difficult to G. Pandey (B) · U. Ghanekar National Institute of Technology Kurukshetra, Kurukshetra, India e-mail: [email protected] U. Ghanekar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_7
89
90
G. Pandey and U. Ghanekar
solve due to its ill-posedness and ill-conditioned behavior. L = AU H
(7.1)
In the equation, ‘A’ is a blurring factor, ‘U ’ is a scaling factor, ‘H ’ is a HR image, and ‘L’ is a LR image, respectively. Being an ill-posed inverse problem, SISR involves solving of this mathematical equation through different optimization techniques. Over the years different techniques based on neighbor embedding, sparse coding (Yang et al. 2010), random forest (Liu et al. 2017), deep learning (Yang et al. 2019; Dong et al. 2016), etc., are proposed. In all these, neighbor embedding is one of the older and simpler approaches in machine learning for obtaining the HR image. Due to its considerable performances in field of SISR, neighbor embedding is still an area that attracts the researchers toward itself. In this paper also, an attempt has been made to improve the existing issues in neighbor embedding-based SISR techniques. Here, an effort is made to reduce the size dictionary without affecting the performance of the algorithm. Also, the problem of removing irrelevant patches during HR image reconstruction process has been considered in this paper, and an attempt is made to alleviate it. Rest of the paper is divided into following sections. In section 7.2, a brief literature related to single image super-resolution has been presented and discussed analytically. In section 7.3, the algorithm proposed in this paper has been explained in detail. In section 7.4, all the experiments are performed, and related results are discussed to prove the efficacy of the proposed algorithm. In the end, conclusion is drawn in section 7.5.
7.2 Related Literature in Field of Single Image Super-resolution In past, lot of work have been done and proposed in the field of SISR. Classification of those existing techniques is discussed in detail in (Pandey and Ghanekar 2018, 2020). In spatial domain, techniques are methodized into interpolationbased, regularization-based and learning-based. In all of the three specified groups, researchers are more emphasizing on learning-based SISR techniques since these techniques are able to create new spectral information in the reconstructed HR image. In learning-based methods, in the first stage, a dictionary is created for training the computer, and once the machine is trained, then in the second stage, i.e., the reconstruction stage, the HR image is estimated for the given input LR image. In training stage, the target is to train the machine in such a way that same types of structural similarities as of the input LR image are available in the computer database for further processing. This is achieved either by forming internal (Egiazarian and Katkovnik 2015; Singh and Ahuja 2014; Pandey and Ghanekar 2020, 2021; Liu et al. 2017) or external dictionary/dictionaries (Freeman et al. 2002; Chang et al.
7 Improved Single Image Super-resolution Based on Compact … Fig. 7.1 Block diagram for proposed algorithm
91 Database image of LR-HR pairs
LR image in input
Similarity score calculaƟon Ten most similar images selected in order First image
Rest nine images
Similarity score calculaƟon Image selected having least SSIM
Edge finding and patch formaƟon
LR-HR patches formaƟon
Neighbor embedding for HR patches reconstrucƟon
HR patches combined to form complete image
HR image in output
2004; Zhang et al. 2016; Aharon et al. 2006). In the case of external dictionary, the size of dictionary is governed by number of images used in its formation and is of major concern since greater number of images means large memory requirement that will result in more number of computations during reconstruction phase. This can be reduced by forming the dictionary from the images that are similar to input image but differ from each other so that lesser number of images are required to cover the whole structural variations present in the input image. In reconstruction stage, through local processing (Chang et al. 2004; Bevilacqua et al. 2014; Choi and Kim 2017) or sparse processing (Yang et al. 2010), HR image is recreated. Either neighbor embedding (Chang et al. 2004) or direct mapping (Bevilacqua et al. 2014) is used in the case of local processing. In neighbor embedding, a HR patch is constructed with the help of a set of similar patches, searched in the dictionary formed in training phase. After constructing corresponding HR patches for all input LR patches, all the HR patches are combined to form the HR image in the output. In existing techniques, simple averaging is used in overlapping areas for combining all the HR patches which may result in poor HR image estimation in the case of having one or more erroneous patches in the overlapping areas. This also results in blurring of edges due to simple averaging. To overcome this problem, in this paper, only similar types of patches are considered for combining the HR patches in the overlapping areas. This is based on
92
G. Pandey and U. Ghanekar
classifying the patches as well as pixels under consideration in smooth or edge regions. This causes better visualization of the HR image in output. Also, an external dictionary is formed from images which are similar to input image but differ from one another. This helps in capturing the maximum structural properties present in the input image with the help of minimum number of images.
7.3 Proposed Algorithm for Super-resolution Learning-based SISR has two important parts for HR image reconstruction. At first, a dictionary is build, and then, with the help of this dictionary and the input LR image, a HR image is reconstructed. Method proposed in this paper consists of forming an external dictionary for training part and neighbor embedding approach for reconstruction part. Its generalized block diagram is provided in Fig. 7.1. All the LR images of the dictionary as well as the input image are upscaled by a factor U through bi-cubic interpolation. On the basis of structural similarity score (Wang et al. 2004) given in Eq. (7.2), a set of ten images is selected from the database for dictionary formation. Once images having higher score are selected from the database, two of them are selected to have the maximum structural variations that are present in the input LR image. For this, image having the highest structural similarity score with the input image will be first selected for the dictionary formation, and then, structural similarity score is calculated between the selected image and rest of the images chosen in the first step. Image having the least structural similarity will be considered as the second image for the dictionary formation. The complete dictionary is formed with the help of these two selected LR images and their corresponding HR pairs by forming overlapping patches of 5 × 5. SS I M(i, j) =
(2µi µ j + v1 ) (µi2 + µ2j + v1 )
(7.2)
where µ is mean, v1 is a constant, and i and j represent ith and jth image, respectively In second stage, i.e., reconstruction part, the constructed dictionary is used for HR image generation. At first, overlapping blocks of 5 × 5 are obtained from the input LR image, then for every individual block ‘k’, number of nearest neighbors are searched in the dictionary. LLE (Chang et al. 2004) is used as neighbor embedding technique to obtain optimal weights for the set of nearest neighbors, and then, these weights along with corresponding HR pairs are used to construct HR patches to estimate HR image. In overlapping areas, simple averaging results in blurry edges. Thus, a new technique has been given here to combine the patches. In this, every pixel of the output HR image is individually estimated with the help of the constructed HR patches. The steps are as follows:
7 Improved Single Image Super-resolution Based on Compact …
93
i. Edge matrix for the input LR image is calculated by canny edge detector, and then, overlapping patches are formed from the matrix just like that of patches formed for the input LR image. ii. At every pixel location of the HR image, check whether 0 or 1 is present in its corresponding edge matrix. 0 represents smooth pixel, and 1 represents edge pixel. To confirm the belongingness of each pixel to there respective group, consider a 9 × 9 block of edge matrix having the pixel under consideration as center pixel, and calculate the number of zeros or ones in all the four direction given in Fig. 7.2. iii. In any of the four directions, if number of zeros is 5 in case of smooth pixel and number of ones in case of edge pixel, then assign that pixel as true smooth or true edge pixel, respectively. iv. For the pixels that cannot be categorized into true smooth or edge pixel, count the number of zeros or ones in the 9 × 9 block of edge matrix with pixel under consideration at center. If count of zero is 13, consider the pixel as smooth pixel, and if count of one is 13, then pixel under consideration is edge pixel. v. After categorizing the pixel into smooth or edge type, its value is estimated with the help of HR patches that contains the overlapping pixel position and its corresponding patch of edge matrix. vi. Instead of considering all the overlapping pixels from their respective overlapping HR patches, only pixels of patches which are of same type to that of the pixel that is to be estimated are considered. Two cases are explained separately. a. At a particular pixel location of the bi-cubic interpolated LR image, if by following the above procedure the pixel is considered to be smooth type, then to estimate the pixel value at same location in HR image, same type of patches will be considered (out of selected 25 patches, number will be less for boundary patches). For this, firstly, overlapping HR patch having all zeros is chosen. If not available, then, the patches having maximum number of zeros will be considered. b. Similarly, for edge type, at first, overlapping HR patch having all ones is chosen. If not available, then, the patches having maximum number of ones will be considered to estimate the value instead of all the patches. vii. Process of assigning the values to all the pixels of HR image is performed individually to obtain the HR image in the output.
7.4 Experiment and Discussion In this section, experiments are performed to prove the usefulness of the proposed algorithm for generating a HR image from LR image. Images that are used for validation of the algorithm are given in Fig. 7.3. In all the experiments, LR images are formed by smoothing the HR image with a 5 × 5 block of average filter followed
94
G. Pandey and U. Ghanekar
Fig. 7.2 Four different directions
(a)
(b)
(c)
(d)
Fig. 7.3 Image for experimental simulations: Starfish, Bear, Flower, Tree, Building
(a)
(b)
(d)
(e)
(c)
by reducing its size by a factor of U . In this paper, U is taken as 3. The experiment is performed only on the luminance part of the image that is obtained by converting the image in ‘Y CbCr ’ color space. To obtain the final HR image in the output, chrominance part is upscaled by bi-cubic interpolation. Algorithms are compared on the basis of structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) given by Eqs. (7.2) and (7.3), respectively. PSNR = 10 log10
(Maximun Pixel Value)2 Mean Square Value
(7.3)
A set of 100 standard images are selected from (Berkeley dataset 1868) to form database of LR-HR images. All the LR images are upscaled by a factor of three using bi-cubic interpolation. For dictionary formation, two images are selected one by one.
7 Improved Single Image Super-resolution Based on Compact …
95
For selection of the first image, structural similarity score is calculated between input image and database LR images, and image having the highest similarity score with the input is selected. Nine more images having higher score are selected for deciding the second image that will be used for dictionary formation. For selection of the second image, structural similarity score between the first selected image and rest of nine images is taken. Then, image having the lowest score with the first image will be selected. This process of image selection will help in forming dictionary with maximum possible variations with only two images. Overlapping LR-HR patches are formed with the help of these two images to form the dictionary for training purpose. Once the dictionary formation is completed, procedure for conversion of input LR image into HR image starts. For this, the input LR image is first upscaled by a factor of three using bi-cubic interpolation, and then, overlapping patches of size 5 × 5 are formed from it. For every patch, six nearest neighbors of LR patches are selected from the dictionary using Euclidean distance. Now, with the help of these selected LR patches (from the dictionary), optimal weights are calculated for corresponding HR patches using LLE to obtain the final corresponding HR patch. All such constructed HR patches are combined to obtain the HR image by estimating each pixel individually by the technique proposed in the paper. Experimental results of the proposed algorithm are given in Tables 7.1 and 7.2 to compare it with other existing techniques like bi-cubic interpolation, neighbor embedding given in (Chang et al. 2004) and sparse coding given in (Yang et al. 2010).
Table 7.1 Experimental results for proposed method and other methods for comparison, in terms of PSNR Sr.no. Name of Bi-cubic NE (Chang ScSR (Yang Proposed image et al. 2004) et al. 2010) algorithm 1. 2. 3. 4. 5.
Starfish Bear Flower Tree Building
25.38 26.43 27.21 25.15 27.12
27.71 28.92 29.67 28.57 29.97
28.43 30.05 29.11 28.03 30.34
29.01 29.91 30.54 28.87 30.21
Table 7.2 Experimental results for proposed method and other methods for comparison, in terms of PSNR Sr.no. Name of Bi-cubic NE (Chang ScSR (Yang Proposed image et al. 2004) et al. 2010) algorithm 1. 2. 3. 4. 5.
Starfish Bear Flower Tree Building
0.8034 0.7423 0.7932 0.7223 0.8059
0.8753 0.8632 0.8321 0.8023 0.8986
0.8764 0.8725 0.8453 0.8223 0.9025
0.9091 0.8853 0.8578 0.8827 0.9001
96 Fig. 7.4 Comparison of different algorithms for SISR for S = 3:LR input image, actual HR image, bi-cubic interpolated image, NE algorithm (Chang et al. 2004), SCSR algorithm (Yang et al. 2010) and proposed algorithm
G. Pandey and U. Ghanekar
(a)
(b)
(d)
(c)
(e)
(f)
Tables and the figures showing comparison of the proposed algorithm with a few other algorithms prove that the results of our algorithm are better than the other algorithms used in the present study. HR image constructed through our algorithm is better in visualization when juxtaposed with other images obtained from other algorithm (Fig. 7.4).
7.5 Conclusion The research work, present in this paper, is focused on generating a HR image from a single LR image with the help of an external dictionary. A novel way of building an external dictionary has been proposed which helps to contain maximum types of structural variations with the help of a fewer number of images in the dictionary. To achieve this, images that are similar to input LR image but differ with each other are selected for dictionary formation. This helps to reduce the size of dictionary and hence the number of computations during the process of finding nearest neighbors. To form the complete HR image, a new technique based on classifying the pixels as the part of smooth or edge region is used for combining the HR patches in overlapping areas that are generated using LLE. The results obtained through experiments verify the effectiveness of the algorithm.
References Aharon M, Elad M, Bruckstein A (2006) The K-SVD: an algorithm for designing of over-complete dictionaries for sparse representation. IEEE Trans Signal Process 54(11), 4311–4322 Berkeley dataset (1868) https://www2.eecs.berkeley.edu Bevilacqua M, Roumy A, Guillemot C, Morel M.-L. A (2014) Single-image super-resolution via linear mapping of interpolated self-examples. In: IEEE Transactions on Image Processing, vol. 23(12), pp 5334–5347 Chang H, Yeung DY, Xiong Y (2004) Super-resolution through neighbor embedding. In: IEEE conference on computer vision and pattern recognition, vol. 1, pp 275–282
7 Improved Single Image Super-resolution Based on Compact …
97
Choi JS, Kim M (2017) Single image super-resolution using global regression based on multiple local linear mappings. In: IEEE transactions on image processing, vol. 26(3) Dong C, Loy CC, He K, Tang X (2016) Image super-resolution using deep convolutional networks. IEEE Trans Pattern Anal Mach Intell 38(2), 295–307 Egiazarian K, Katkovnik V (2015) Single image super-resolution via BM3D sparse coding. In: 23rd European signal processing conference, pp 2899–2903 Freeman W, Jones T, Pasztor E (2002) Example-Based Super-Resolut. IEEE Comput Graph Appl 22(2), 56–65 Liu ZS, Siu WC, Huang JJ (2017) Image super-resolution via weighted random forest. In: 2017 IEEE international conference on industrial technology (ICIT). IEEE Liu C, Chen Q, Li H (2017) Single image super-resolution reconstruction technique based on a single hybrid dictionary. Multimedia Tools Appl 76(13), 14759–14779 Pandey G, Ghanekar U (2018) A compendious study of super-resolution techniques by single image. Optik 166:147–160 Pandey G, Ghanekar U (2020) Variance based external dictionary for improved single image superresolution. Pattern Recognit Image Anal 30:70–75 Pandey G, Ghanekar U (2020) Classification of priors and regularization techniques appurtenant to single image super-resolution. Visual Comput 36:1291–1304. doi: 10.1007/s00371-019-01729-z Pandey G, Ghanekar U (2021) Input image-based dictionary formation in super-resolution for online image streaming. In: Hura G, Singh A, Siong Hoe L (eds) Advances in communication and computational technology. Lecture notes in electrical engineering, vol 668. Springer, Singapore Park SC, Park MK, Kang MG (2003) Super-resolution image reconstruction: a technical overview. IEEE Signal Process Magz 20(3), 21–36 Singh A, Ahuja N (2014) Super-resolution using sub-band self-similarity. In:’ Asian conference on computer vision, pp 552–5684 Wang Z, Bovik A, Sheikh H, Simoncelli E (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4), 600–612 Yang J, Wright J, Huang TS, Ma Y (2010) Image Super-Resolution via Sparse Representation. IEEE Trans Image Process 19(11), 2861–2873 Yang W, Zhang X, Tian Y, Wang W, Xue J-H, Liao Q (2019) Deep learning for single image super-resolution: a brief review. IEEE Trans Multimedia 21(12), 3106–3121 Zhang Z, Qi C, Hao Y (2016) Locality preserving partial least squares for neighbor embeddingbased face hallucination. In: IEEE conference on image processing, pp 409–413
Chapter 8
An End-to-End Framework for Image Super Resolution and Denoising of SAR Images Ashutosh Pandey, Jatav Ashutosh Kumar, and Chiranjoy Chattopadhyay
Abstract Single image super resolution (or upscaling) has become very efficient because of the powerful application of generative adversarial networks (GANs). However, the presence of noise in the input image often produces undesired artifacts in the resultant output image. Denoising an image and then upscaling introduces more chances of these artifacts due to the accumulation of errors in the prediction. In this work, we propose a single shot upscaling and denoising of SAR images using GANs. We have compared the quality of the output image with the two-step denoising and upscaling network. To evaluate our standing with respect to the state-of-the-art, we compare our results with other denoising methods without super resolution. We also present a detailed analysis of experimental findings on the publicly available COWC dataset, which come with context information for object classification.
8.1 Introduction Synthetic aperture radar (SAR) is an imaging method capable of capturing highresolution images of terrains in all weather conditions and the day. SAR is a coherent imaging technology that defeats the drawbacks of optical and infrared imaging. SAR has proven to be a very beneficial in-ground observation and military applications. However, being a coherent imaging technique suffers from a multiplicative speckle noise because of the returned signals’ constructive and destructive interference. The speckle noise’ presence affects computer vision techniques’ performance adversely and makes it difficult to derive useful information from the data. The research community has made several efforts to remove noise in the past, including filtering methods, wavelet-based methods, and deep learning-based methA. Pandey (B) · J. Ashutosh Kumar · C. Chattopadhyay Indian Institute of Technology Jodhpur, Jodhpur, Rajasthan 342037, India e-mail: [email protected] J. Ashutosh Kumar e-mail: [email protected] C. Chattopadhyay e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_8
99
100
A. Pandey et al.
ods, including convolutional neural network (CNN) and generative adversarial networks (GANs) for removal of noise. Section 8.3 gives a detailed description of such techniques. In applications involving images or videos, high-resolution data have usually aspired for more advanced computer vision-related works. The rationale behind the thirst for high image resolution can be either improving the pixel information for human perception or easing computer vision tasks. Image resolution describes the details in an image; the higher the resolution, the more image details. Among various denotations of the word resolution, we focus on a spatial resolution that refers to the pixel density in an image and measures in pixels per unit area. In situations like satellite imagery, it is challenging to use high-resolution sensors due to physical constraints. The input image is captured at a low resolution and post-processed to obtain a high-resolution image to address this problem. These techniques are commonly known as super resolution (SR) reconstruction. SR techniques construct high-resolution (HR) images from several observed low-resolution (LR) images. In this process, the high-frequency components are increased, and the degradations caused by the imaging process of the low-resolution camera are removed. Super resolution of images has proven to be very useful for better visual quality and ease in the detection processes by other computer vision techniques. However, the presence of noise in the input image may be difficult as the network enhances the noise when upscaling is done. We try to merge the two-step procedure of denoise an image and upscaling the image to compare quality and the performance benefit that we get by running one network instead of two. The motivation is to create a single artificial neural network with denoising and upscaling capabilities. Contributions The primary contributions of our work are 1. We analyze the two possible approaches of denoising and super resolution of SAR images, single-step and two-step. We demonstrate the comparison of the performance on the complied dataset. 2. Through empirical analysis, we demonstrate that the single-step approach better preserves the details in the noisy image. Even with higher PSNR values for the ID-CNN network, the RRDN images are better able to maintain the high-level features present in the image, which will prove to be of more use in object recognition compared to PSNR driven networks, which are not able to preserve these details. Organization of the chapter The remaining of this chapter is organized in the following way. To give the readers a clear understanding of the problem, we define speckle noise in Sect. 8.2, which is believed to be necessary. In Sect. 8.3, we present a detailed description of the works proposed in the literature and are related to our framework. In Sect. 8.4, we describe the proposed framework in detail. Section 8.5 describes the experimental findings of our proposed framework. In Sect. 8.6, we present a detailed analysis of the results obtained using our framework. Finally, the paper concludes with Sect. 8.7, along with some indications to the future scope of work.
8 An End-to-End Framework for Image Super Resolution …
101
Fig. 8.1 Artificial noise added in the dataset
8.2 Speckle Noise Speckle noise arises due to the effect of environmental conditions on the imaging sensor during image acquisition. The speckle-noise primarily prevalent in application areas like medical images, active radar images, and synthetic aperture radar (SAR) images. The model commonly used for representing the SAR speckle multiplicative noise is defined as: Y = NX (8.1) where Y ∈ R W ×H is the observed intensity SAR image, X ∈ R W ×H is the noise free SAR image, and N ∈ R W ×H is the speckle noise random variable. Here, W and H denote the width and height of the image, respectively. Let the SAR image be of an average of L looks, then N follows a gamma distribution with unit mean and variance 1/L with the probability density function: p(N ) =
1 L L N L−1 e−L N , N ≥ 0, L ≥ 1 (L)
(8.2)
where (·) is the Gamma function. L, the equivalent number of looks (ENL), is usually regarded as the quantitative evaluation index for real SAR images de-speckling experiments in the homogeneous areas and defined as: ENL =
X¯ σ X2
(8.3)
where X¯ and σ X2 are the image mean and variance. Figure 8.1a shows the probability distribution function of Eq. (8.2) for L = 1, 4 and 10, along with the histogram of the random sampling from the numpy gamma distribution. It can be observed from Fig. 8.1 that how the image quality changes with
102
A. Pandey et al.
the increasing values of the hyperparameters used to define the noise. As a result of which proposing a unified, end-to-end model for such a task is challenging.
8.3 Literature Survey As the importance of SAR denoising is explained above, various literature in the past years has been proposed several techniques on this particular topic. In 1981, Lee filter (Wang and Bovik 2002) was proposed that uses statistical techniques to define a noise model, probabilistic path-based filter (PBB) (Deledalle et al. 2009) based on noise distribution uses similarity criterion, non local means (NL-means) (Buades et al. 2005) use all possible self-prediction to preserve texture and details, block matching 3D (BM3D) (Kostadin et al. 2006) using inter-patch correlation (NLM) and intrapatch correlation (Wavelet shrinkage). The deep learning approach has received much attention in the area of image denoising. However, there are tangible differences in the various types of deep learning techniques dealing with image denoising. For example, discriminative learning based on deep learning tackles the issue of Gaussian noise. On the other hand, optimization models based on deep learning are useful in estimating the actual noise. In Chunwei et al. (2020), a comparative study of deep techniques in image denoising is explained in detail. There has been several approaches of speckle reduction in important application domains. Speckle reduction is an important step before the processing and analysis of the medical ultrasound images. In Da et al. (2020), a new algorithm is proposed based on deep learning to reduce the speckle noise for coherent imaging without clean data. In Shan et al. (2018), a new speckle noise reduction algorithm in medical ultrasound images is proposed by employing monogenic wavelet transform (MWT) and Bayesian framework. Considering the search for an optimal threshold as exhaustive and the requirements as contradictory, in Sivaranjani et al. (2019), the problem is conceived as a multi-objective particle swarm optimization (MOPSO) task, and a MOPSO framework for de-speckling an SAR image using a dual-tree complex wavelet transform (DTCWT) in the frequency domain was proposed. Huang et al. (2009) proposesd a novel method that includes the coherence reduction speckle noise (CRSN) algorithm and the coherence constant false-alarm ratio (CCFAR) detection algorithm to reduce speckle noise for SAR images and to improve the detected ratio for SAR ship targets from the SAR imaging mechanism. Techniques such as (Vikrant et al. 2015; Yu et al. 2020; Vikas Kumar and Suryanarayana 2019) proposed speckle denoising filters in their respective papers that are specifically designed for SAR images and shown encouraging performance. For target recognition from SAR images, Wang et al. (2016) proposed a complementary spatial pyramid coding (CSPC) approach in the framework of spatial pyramid matching (SPM) (Fig. 8.2). In Wang et al. ((2017), a novel technique was proposed, and the network proposed in this technique has eight convolution layers along with rectified linear units (ReLU) where each convolution layer has 64 filters of 3 × 3 kernel size with the stride of one, without pooling and applies the combination of batch normalization and residual
8 An End-to-End Framework for Image Super Resolution …
103
Fig. 8.2 ID-CNN network architecture for image de-speckling
learning strategy with skip connection where the input image is divided with the estimated speckle noise pattern in the image; this method uses L2 Norm or Euclidean distance as the loss function which reduces the distance between the output and the target image; however, this can introduce artefacts in the image and does not consider the neighborhood pixels, so a total variational loss has been added to the overall loss function balanced with the regularization factor λTV to overcome this shortcoming, the TV loss encourages smoother results, assuming Xˆ = φ(Y w,h ) where φ is the learned network (parameters) for generating the despeckled output.
LT V =
L = L E + λT V L T V
(8.4)
W H 1 LE = ||φ(Y w,h ) − X w,h ||22 W H w=1 h=1
(8.5)
W H
( Xˆ w+1,h − Xˆ w,h )2 + ( Xˆ w,h+1 − Xˆ w,h )2
(8.6)
w=1 h=1
This method proposed in Wang et al. ((2017) perform well as compared to the traditional image processing methods mentioned above, and hence, we compared our work with Wang et al. ((2017).
8.4 Proposed Model 8.4.1 Architecture We propose a single-step artificial neural network model for image super resolution and SAR denoising tasks inspired by the RRDN GAN (Wang et al. 2019) model. Figure 8.3 depicts an illustration of the proposed modification of the original RRDN network. The salient features of the proposed model are • To compare it to the other noise removal techniques, we have removed the upscaling part of the super-resolution GAN and have trained the network for 1X.
104
A. Pandey et al.
Fig. 8.3 The RRDN architecture (we adapted the network without the upscaling layer for comparison)
Fig. 8.4 A schematic representation of the residual in residual dense block (RRDB) architecture
Fig. 8.5 A schematic diagram of the discriminator architecture
• We also train the network with upscaling layers for simultaneous super resolution and noise removal. The model was trained for various configurations; however, best results were found for 10 RRDB blocks. Figure 8.4 depicts an illustration of such architecture. There are 3 RDB blocks in each RRDB block in that architecture, 4 Conv in each RDB block, 64 feature maps in RDB Conv layers, and 64 feature maps outside RDB Conv. Figure 8.5 shows the discriminator used for in the adversarial network. As shown in Fig. 8.5, the discriminator has a series of convolution and ReLU layer, followed by a dense layer of dimension 1024 and, finally, a dense layer with Sigmoid function to classify between the low and high-resolution images. Next, we discuss the various loss functions used in the network.
8 An End-to-End Framework for Image Super Resolution …
105
8.4.2 Loss Function An artificially intelligent system learns through a loss function. It is a scheme of assessing how well specific algorithm patterns the given data. If prognostications differ too much from real results, the loss function produces a large number. Progressively, with the help of some optimization function, the loss function learns to subdue the error in the forecast. In this work, we use perceptual loss L percep as proposed in the ESRGAN (Wang et al. 2019) for training along with the adversarial discriminator loss.
8.4.2.1
Perceptual Loss
In perceptual loss (Wang et al. 2019), we measure the mean square error in the feature maps of a pre-trained network. For our training, we have taken layer 5, 9 of the pretrained VGG19 model. The perceptual loss function is the improved version of MSE to evaluate a solution based on perceptually relevant characteristics and is defined as follows: SR I XSR + 10−3 IGen (8.7) I SR = content loss adversarial loss perceptual Loss (for VGG based content losses) SR where, l xSR is the content loss and lGen is the adversarial loss which are defined in the following section.
8.4.2.2
Content Loss
Content is defined as the information available in an image. Upon visualizing the learned components of a neural network, it has been shown in the literature that different feature maps in higher layers are activated in the presence of various objects. So, if two images have the same content, they should have similar activations in the higher layers. The content loss function ensures that the higher layers’ activations are identical between the content image and the generated image. The content cost function ensures that the content present in the content image is obtained in the generated image. In the literature, researchers have shown CNNs capture knowledge about the higher levels’ content, where the lower levels are concentrating on individual pixel values. The VGG loss is defined as follows: SR lVGG/i, j
Wi, j Hi, j
2 1 i, j (I HR )x,y − φi, j G θG (I LR ) x,y = Wi, j Hi, j x=1 y=1
(8.8)
106
A. Pandey et al.
where φi, j represents the feature map obtained by the jth convolution and before ith max-pooling layer, Wi, j and Hi, j describe the dimensions of the feature maps within the VGG network, G θG (I LR ) is the reconstructed image, and I HR is the ground truth image.
8.4.2.3
Adversarial Loss
One of the most important uses of adversarial networks is the ability to create natural looking images after training the generator for a sufficient amount of time. This component of the loss encourages the generator to favor outputs that are closer to realistic images. The adversarial loss is defined as follows: SR IGen =
N
− log Dθ D G θG (I LR )
(8.9)
n=1
where, Dθ D (G θG (I LR )) is the probability of the reconstructed image G θG (I LR ) of being natural high-resolution image.
8.5 Experiments and Results 8.5.1 Dataset We recompile the dataset as outlined by the authors of ID-CNN (Wang et al. (2017) and ID-GAN (Wang Puyang et al. 2017) for analysis on synthetic SAR images. The dataset is a combination of images from UCID (Schaefer Gerald and Stich Michal 2004), BSDS-500 (Arbeláez et al. 2011) and scraped Google map images (Isola Phillip et al. 2017). All these images are converted to grayscale using OpenCV to simulate intensity SAR images. Grayscale images are then downscaled to 256 × 256 to serve as the noiseless highresolution target. Another set of grayscale images are downscaled to 256 × 256 and 64 × 64 images from their original size. For each input image, we have three different noise levels. We randomly allot the images to the training, validation, and testing set. The split ratio was 95 : 2.5 : 2.5. The ratio was taken to get a similar test set size as ID-CNN (Wang et al. (2017). We also use the cars overhead with context (COWC) dataset (Mundhenk et al. 2016), which is provided by the Lawrence Livermore National Laboratory, for further investigation of the performance. We use this dataset because it contains target boxes for classification and localization of cars. The data can be further used for object detection for performance comparison. We then add speckle noise to the images in our dataset using np.random.gamma(L , 1/L) from NumPy to randomly sample
8 An End-to-End Framework for Image Super Resolution …
107
from gamma distribution which is equivalent to the above probability density function as shown in Fig. 8.1a. The image is normalized before adding noise and then again scaled to the original range after adding noise to avoid clipping of values and loss of information.
8.5.2 Results In this section, we describe the various quantitative and qualitative results obtained while conducting various experiments based on the proposed architecture.
8.5.2.1
No Super Resolution (No Upscaling Layer)
In this subsection, we report the comparison on no super resolution, i.e., no upscaling layer. Table 8.1 shows the comparison of denoising performance of the proposed network with ID-CNN (Wang et al. (2017). We use the same dataset in both the cases. The results are reported for three different levels of the noise as shown by three different values of L. Also, three different metrics are used to maintain the same experimental setup used by the other state-of-the-art method. Here, VGG16 loss refers to the MSE of the deeper layer of VGG16 network for the output denoised image and the target image. For the obtained results, it is clear from Table 8.1 that the proposed framework is able to outperform ID-CNN when compared only in denoising performance for all the noise levels. The PSNR for L = 4, 10 comes out to be better for the case of ID-CNN because of the PSNR driven loss function of the network. Whereas, the RRDN is able to perform better when seen with respect to VGG16 implying it is able to better preserve higher level details in the denoised image since the network is driven by content loss instead of pixel-wise MSE.
8.5.2.2
With Super Resolution (With Upscaling Layers)
Here, we report the quantitative comparison with the super resolution. Table 8.2 shows the comparison for both the approaches. For two-step networks, we train ID-CNN (Wang et al. (2017) on 256 × 256 noisy images to 256 × 256 clean target images. Then, we train SR network on clean 256 × 256 input images to 1024 × 1024 high-resolution target images. For the single shot network, we train the network on 256 × 256 noisy images to 1024 × 1024 clean high-resolution images. We compare the performance of the networks based on the above- mentioned strategy. The PSNR calculations are done for the same output sizes as the target and the output image sizes match. The VGG16 loss calculation, however, is done after downscaling to 224 × 224 for the images from both the cases. It can be observed from Table 8.2 that the two-step approach produces better results for most of the metrics.
108
A. Pandey et al.
Table 8.1 Comparison of RRDN without upscaling layer with ID-CNN (Wang et al. (2017) Metric ID-CNN (Wang et al. RRDN 1x (2017) L=1
L=4
L = 10
PSNR SSIM VGG16 PSNR SSIM VGG16 PSNR SSIM VGG16
19.34 0.57 1.00 22.59 0.77 0.60 24.74 0.85 0.33
19.55 0.61 0.81 22.47 0.79 0.30 24.58 0.86 0.16
Table 8.2 Comparison of RRDN with upscaling layer with ID-CNN (Wang et al. (2017) Metric ID-CNN → ISR Single Shot L=1
L=4
L = 10
PSNR SSIM VGG16 PSNR SSIM VGG16 PSNR SSIM VGG16
19.35 0.61 0.91 21.95 0.72 0.53 23.32 0.77 0.29
18.77 0.60 1.06 21.38 0.71 0.48 23.00 0.77 0.25
However, the single shot network is still able to slightly outperform the two-step network based on the VGG16 metric, which again shows that the network preserves better high-level details, while doing both tasks at once instead of denoising the image and then increasing its resolution using two independently trained networks. These higher level details lead to better perceived quality of image and better performance in object detection tasks even though the pixel-wise MSE or PSNR values come out to be lower.
8.5.2.3
Additional Results
In this subsection, we present the result of super resolution and denoising in a single network on SAR images. We present the calculated results in Table 8.3 without comparison since we were not able to find any other papers with both super resolution and denoising in the context of SAR images to the best of our knowledge. The input images are 64 × 64 noisy images with almost no pixel information available. Figure 8.6 depicts an illustration of the proposed single-step denoising and super-resolution
8 An End-to-End Framework for Image Super Resolution …
109
Table 8.3 Performance of network while performing both operations Metric RRDN-4x L=1
L=4
L = 10
PSNR SSIM VGG16 PSNR SSIM VGG16 PSNR SSIM VGG16
16.10 0.36 1.20 17.05 0.43 0.99 17.60 0.48 0.87
Fig. 8.6 An illustration of the single-step denoising and super resolution task on a input image size of 64 × 64
110
A. Pandey et al.
task. The RRDN is able to generate a pattern of the primary objects, such as buildings, from the input noisy image based on the high-level features of the image. It is also evident from the quantitative analysis that our proposed single-step method is also able to produce HR images with considerably lesser noise elements. These claims will be further clarified in the following section, where we analyze performance in details.
8.6 Analysis So far, we have discussed about the quantitative results obtained using the proposed method. In this section, we now present the qualitative results and analysis behind such results in details.
8.6.1 No Super Resolution (No Upscaling Layer) Figure 8.7 shows the denoising performance comparison of ID-CNN with RRDN on a 1L noise speckle image with no upscaling. Similarly, Figs. 8.8 and 8.9 show the comparison for 4L and 10L noise speckle images. In all the three images, the results are presented in a manner such that the part (a) and (d), the two diagonally opposite images depict the input speckled image and the target image, respectively. On the other hand, the part (b) and (c) represent the despeckled image generated from our proposed model and using the method proposed in Wang et al. ((2017). The denoised image output for the proposed network shows better preserved edges and lesser smudge and sharper image when compared to ID-CNN even though the PSNR difference between the images is not very high. The proposed method gives consistently better quality image for all noise levels.
8.6.2 With Super Resolution (With Upscaling Layer) Figures 8.10, 8.11 and 8.12 show the comparison for 1, 4 and 10L noise, respectively. The images are downscaled after super resolution for comparison. The original image input size is of 256 × 256, and the output image size is 1024 × 1024. Starting with the 1L noisy image, we can see the stark difference in the output images produced by the network. Using two-step network for denoising, then upscaling causes blurred out and smeared images. The high-level details are better preserved in the single-step approach compared to two step. It can be seen that the two-step approach induces distortion when higher noise is induced, whereas, the single-step approach is able to preserve more higher level details since the content loss has made it possible for
8 An End-to-End Framework for Image Super Resolution …
111
Fig. 8.7 Qualitative comparison between the proposed denoising method and (Wang et al. (2017) without upscaling (Noise level = 1L)
network to learn to extract details from the noisy image which help produce the building patterns even in presence of very high noise.
8.6.3 Comparison In this section, we present qualitative comparison between the two-step method and the single-step method. Figure 8.13 shows the comparison between the output of the approaches. Figure 8.13a shows the image output of the two step approach of denoising then upscaling, while Fig. 8.13b shows the output of the single-step approach. For both the cases, we highlight one section of the image and zoomed in to that region to show the difference in result more closely. It can be seen in the cropped out region of the high-resolution 1024 × 1024 output image that the single-step approach is able to better preserve the edges in between the very close by building; whereas in the two-step network, the output images have smeared edges, as the input noisy image is not available to the super resolution
112
A. Pandey et al.
Fig. 8.8 Qualitative comparison between the proposed denoising method and (Wang et al. (2017) without upscaling (Noise level = 4L)
network hence inducing distortions in the additional step. Also, the distortions left out by the ID-CNN network are magnified in the upscaling network which are reduced in the single step network since the features of the input noisy image is available to the network from the dense skip connections.
8 An End-to-End Framework for Image Super Resolution …
113
Fig. 8.9 Qualitative comparison between the proposed denoising method and (Wang et al. (2017) without upscaling (Noise level = 10L)
8.7 Conclusion and Future Scope In this work, we presented the results of the proposed network with the upscaling layer. The results show significant improvement in VGG16 loss because the systems can remove noise from the images while preserving the image’s relevant features. The single-step performance is comparable to the two step, while also reducing the need for a double pass and saving the need for training an additional network. Since the image better preservers features in the single-step system, it might perform better if used further in any tasks that require the use of features like object recognition.
114
A. Pandey et al.
Fig. 8.10 Qualitative comparison between the proposed denoising method and (Wang et al. (2017) with upscaling (Noise level = 1L)
8 An End-to-End Framework for Image Super Resolution …
115
Fig. 8.11 Qualitative comparison between the proposed denoising method and (Wang et al. (2017) with upscaling (Noise level = 4L)
116
A. Pandey et al.
Fig. 8.12 Qualitative comparison between the proposed denoising method and (Wang et al. (2017) with upscaling (Noise level = 10L)
8 An End-to-End Framework for Image Super Resolution …
117
(a) 2 Step Denoising and Super Resolution
(b) Single Step Denoising and Super Resolution
Fig. 8.13 A qualitative comparison between the two-step and approach for denoising and superresolving an input image.
References Bai YC, Zhang S, Chen M, Pu YF, Zhou JL (2018) A fractional total variational CNN approach for SAR image despeckling. ISBN: 978-3-319-95956-6 Bhateja V, Tripathi A, Gupta A, Lay-Ekuakille A (2015) Speckle suppression in SAR images employing modified anisotropic diffusion filtering in wavelet domain for environment monitoring. Measurement 74:246–254 Buades A, Coll B, Morel J (2005) A non-local algorithm for image denoising. In: IEEE conference on computer vision and pattern recognition (CVPR) Deledalle C, Denis L, Tupin F (2009) Iterative weighted maximum likelihood denoising with probabilistic patch-based weights. IEEE Trans Image Process 18(12):2661–2672 Francesco C et al (2018) ISR. https://github.com/idealo/image-super-resolution Gai S, Zhang B, Yang C, Lei Y (2018) Speckle noise reduction in medical ultrasound image using monogenic wavelet and Laplace mixture distribution. Digital Signal Process 72:192–207 Huang S, Liu D, Gao G, Guo X (2009) A novel method for speckle noise reduction and ship target detection in SAR images. Patt Recogn 42(7):1533–1542 Isola P, Zhu J-Y, Zhou T, Efros A (2017) Image-to-image translation with conditional adversarial networks. In: IEEE conference on computer vision and pattern recognition (CVPR) Kostadin D, Alessandro F, Vladimir K, Karen E (2006) Image denoising with block-matching and 3D filtering. In: Image processing: algorithms and systems, neural networks, and machine learning Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, Shi W (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: IEEE conference on computer vision and pattern recognition (CVPR)
118
A. Pandey et al.
Li Y, Wang S, Zhao Q, Wang G (2020) A new SAR image filter for preserving speckle statistical distribution. Signal Process 196:125–132 Mundhenk TN, Konjevod G, Sakla WA, Boakye K (2016) A large contextual dataset for classification, detection and counting of cars with deep learning. In: European conference on computer vision Mundhenk TN, Konjevod G, Sakla WA, Boakye K (2016) A large contextual dataset for classification, detection and counting of cars with deep learning. In: European conference on computer vision Puyang W, He Z, Patel Vishal M (2017) SAR image despeckling using a convolutional neural network. IEEE Signal Process Lett 24(12):1763–1767 Rana VK, Suryanarayana TMV (2019) Evaluation of SAR speckle filter technique for inundation mapping. Remote Sensing Appl Soc Environ 16:125–132 Schaefer G, Stich M (2004) UCID: an uncompressed color image database. In: Storage and retrieval methods and applications for multimedia Sivaranjani R, Roomi SMM, Senthilarasi M (2019) Speckle noise removal in SAR images using multi-objective PSO (MOPSO) algorithm. Appl Soft Comput 76:671–681 Tian C, Fei L, Zheng W, Yong X, Zuo W, Lin C-W (2020) Deep learning on image denoising: an overview. Neural Netw 131:251–275 Wang Z, Bovik AC (2002) A universal image quality index. IEEE Signal Process Lett 9(3):81–84 Wang S, Jiao L, Yang S, Liu H (2016) SAR image target recognition via Complementary Spatial Pyramid Coding. Neurocomput 196:125–132 Wang X, Yu K, Wu S, Gu J, Liu Y, Dong C, Qiao Y, Change Loy C.(2018). Esrgan: enhanced super-resolution generative adversarial networks. ISBN: 978-3-030-11020-8 Wang P, Zhang H, Patel VM (2017) Generative adversarial network-based restoration of speckled SAR images. In: IEEE 7th international workshop on computational advances in multi-sensor adaptive processing Yin D, Zhongzheng G, Zhang Y, Fengyan G, Nie S, Feng S, Ma J, Yuan C (2020) Speckle noise reduction in coherent imaging based on deep learning without clean data. Optics Lasers Eng 72:192–207
Part II
Models and Algorithms
Chapter 9
Analysis and Deployment of an OCR—SSD Deep Learning Technique for Real-Time Active Car Tracking and Positioning on a Quadrotor Luiz G. M. Pinto, Wander M. Martins, Alexandre C. B. Ramos, and Tales C. Pimenta Abstract This work presents a deep learning solution object tracking and object detection in images and real-time license plate recognition implemented in F450 quadcopter in autonomous flight. The solution uses Python programming language, OpenCV library, remote PX4 control with MAVSDK, OpenALPR, neural network using Caffe and TensorFlow.
9.1 Introduction A drone can follow an object that updates its route all the time (Pinto et al. 2019). This is called active tracking and positioning, where an autonomous vehicle needs to follow a goal without assistance from human intervention. There are some approaches to this mission with drones (Amit and Felzenszwalb 2014; Mao et al. 2017; Patel and Patel 2012; Sawant and Chougule 2015), but it is rarely used for object detection and OCR due to resource consumption. (Lecun et al. 2015) State-of-the-art algorithms (DAI et al. 2016; Liu et al. 2016; Redmon and Farhadi 2017; Ren et al. 2017) can identify the class of a target object being followed. This work presents and analyzes a technique that grants control to a drone during an autonomous flight, using real-time tracking and positioning through an OCR system for deep learning model of plate detection and object detection. (Bartak and Vykovsky 2015; Barton
L. G. M. Pinto (B) · A. C. B. Ramos Institute of Mathematics and Computing, Federal University of Itajuba, IMC. Av. BPS, 1303, Bairro Pinheirinho, Caixa, Itajuba, MG 37500 903, Brazil e-mail: [email protected] W. M. Martins · T. C. Pimenta Institute of Systems Engineering and Information Technology, IESTI. Av. BPS, 1303, Bairro Pinheirinho, Caixa, Itajuba, MG 37500 903, Brazil e-mail: [email protected] T. C. Pimenta e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_9
121
122
L. G. M. Pinto et al.
and Azhar 2017; Bendea 2008; Braga et al. 2017; Breedlove 2019; Brito et al. 2019; Cabebe 2012; Jesus et al. 2019; Lecun et al. 2015; leE et al. 2010; Martins et al. 2018; Mitsch et al. 2013; Qadri and Asif 2009; TavareS et al. 2010).
9.2 Materials and Methods The following will present the concepts, techniques, models, materials, and methods used in the proposed system, in addition to the structures and platforms used for its creation.
9.2.1 Rotary Wing UAV There are all kinds of autonomous vehicles (Bhadani et al. 2018; Chapman et al. 2016). This project used an F-450 quadcopter drone for outdoor testing and a Typhoon H480 octorotor for the simulation. A quadcopter is an aircraft made up of four rotors carrying the controller board in the middle and the rotors at the ends (Sabatino et al. 2015). It is controlled by changing the angular speeds of the rotors that are rotated by electromagnetic motors, where you can have six degrees of freedom, as seen in Fig. 9.1, with x, y and z as the transition movement, and roll, pitch and yaw as the rotational movement (Luukkonen 2011; Martins et al. 2018; Strimel et al. 2017). The altitude and position of the drone can be modified by adjusting the speeds of the motors (Sabatino et al. 2015). The same applies to pitch control but controlling the rear or front engines (Braga et al. 2017).
Fig. 9.1 Degrees of freedom (Strimel et al. 2017)
9 Analysis and Deployment of an OCR—SSD Deep Learning …
9.2.1.1
123
Pixhawk Flight Control Hardware
This project was implemented using the Pixhawk flight controller (Fig. 9.2), an independent open hardware flight controller (Pixhawk 2019). Pixhawk supports manual and automatic operations, being suitable for research, because it is inexpensive and compatible with most remote control transmitters and receivers (Ferdaus et al. 2017). Pixhawk is built with a dual processor with 32-bit computing capacity STM32f427 Cortex M4 MHz processor/256 cores 168KB of RAM/2MB of flash bit (Feng and Angchao 2016; Meier et al. 2012). The current version is the Pixhawk 4 (Kantue and Pedro 2019).
9.2.1.2
PX4 Flight Control Software
In this work, the PX4 flight control software (PX4 2019a) was used, because is the Pixhawk’s native software, avoiding compatibility problems. PX4 is open-source flight control software for drones and other unmanned vehicles (PX4 2019a) that provides a set of tools to create customized solutions. There are several internal types of frames with their own flight control parameters, including engine assignment and numbering (PX4 2020), which includes the F450 used in this project. These parameters can be modified to obtain refinement during a flight. In the case of PX4, it uses proportional, integral, derivative (PID) controllers, which are one of the most widespread control techniquesg (PX4 2019b). In the PID controllers, the P (proportional) gain is used to minimize the tracking error. It is responsible for a quick response and therefore should be as high as possible. Gain (derivative) is used to moisten. It is necessary, but only the maximum necessary to avoid overtaking must be defined. Gain I (integral) maintains an error memory. The
Fig. 9.2 PixHawk 2.4.8 (Kantue and Pedro 2019)
124
L. G. M. Pinto et al.
term “I” increases when desired the rate has not been reached for some time (PX4 2019b). The idea is to parameterize the in-flight board model using ground control station (GCS) software, where it is possible change these parameters and check their effects on each of the drone’s degrees of freedom (QGROUNDCONTROL 2019V).
9.2.2 Ground Control Station A ground control station (GCS), described in the previous chapter, needs to check the behavior of the drone during the flight and was used to update the drone’s firmware, adjust its parameters, and calibrate your sensors. Running on a base computer, communicate with the UAV wirelessly, such as telemetry or Wi-Fi, display real-time data on the performance and position of the UAV, and show data from many instruments present on a conventional plane or helicopter (ARDUPILOT 2019; Hong et al. 2008). This work used the QGroundControl GCS due to its affinity with the PX4 and PID tools, which allowed changes to the PID while the drone was still in the air. QGroundControl has the ability to read telemetry data simultaneously from multiple aircrafts if they are using the same version of MAVLink, while still being more common features such as telemetry logging, visual display of the GUI, a mission planning tool, the ability to adjust the PID gains during the flight (QGROUNDCONTROL 2019), as shown in Fig. 9.3, and the ability to display vital information flight data information (Huang et al. 2017; Ramirez-Atencia and Camacho 2018; Shuck 2013; Songer 2013).
9.2.3 Drone Software Development Kit (SDK) During the development of this project, some SDKs were used to obtain autonomous control of software on drones. They were used in different parts of the project and the reason it was the versatility of each other. The idea was to choose the SDK that had the best balance between robustness and simplicity. The selection included MAVROS (ROS 2019), MAVSDK (2019) and DroneKit (2015) because of their popularity. All SDKs included the MAVLink protocol, responsible for controlling the drone. DroneKit presented the best simplicity, but it did not have the best support for the PX4. External control that accepts remote commands via the programming was developed for ArduPilot applications (PX4 2019c). The MAVROS package running ROS presented the best system in terms of robustness, but it was complex to manage. MAVSDK presented the best result. It has full support for PX4 applications and is not complex to manage, being the subject of choice for this project.
9 Analysis and Deployment of an OCR—SSD Deep Learning …
125
Fig. 9.3 QGroundControl PID tunning (QGROUNDCONTROL 2019)
9.2.3.1
MAVLink and MAVSDK
MAVLink is a very useful messaging protocol for exchanging information with a drone and between the controller board components (MAVLINK 2019). During its use, data flows, such as telemetry status, are published as topics, while subprotocol, like those used for mission or parameter control, are point to point with retransmission (MAVLINK 2019). MAVSDK is a MAVLink library with APIs implemented in different languages, such as C ++ and Python. MAVSDK can be used on a computer embedded in a vehicle, at a land or mobile station device, which has significantly more processing power than an ordinary flight controller, be able to perform tasks such as computer vision, obstacle prevention and route planning (MAVSDK 2019). MAVSDK was the chosen framework, due to its affinity with the PX4 and its simplicity.
9.2.3.2
ROS and MAVROS
The Robot Operating System (ROS), also presented in the previous chapter, is a structure widely used in robotics (Cousins 2010; Martinez 2013). ROS offers features such as distributed computing, message passing and code reuse (Fairchild 2016; Joseph 2015).
126
L. G. M. Pinto et al.
Fig. 9.4 ROS master (Hasan 2019)
ROS allows the robot control to be divided into several tasks called nodes and are processes that perform calculations, allowing modularity (Quigley et al. 2009). The nodes exchange messages with each other in an editor–subscriber system, where a topic acts as an intermediate store for some of the nodes to publish its content while others subscribe to receive this content (Fairchild 2016; Kouba 2019; Quigley et al. 2009). ROS has a general manager called “master” (Hasan 2019). The ROS master, as seen in Fig. 9.4, is responsible for providing names and records to services and nodes in the ROS system. It tracks and directs editors and subscribers to topics and Services. The role of the master is to guide the individual ROS nodes to locate each from others. After the nodes are located, they define their communication via peer-to-peer (Fairchild 2016). To support collaborative development, the ROS system is organized in packages, which are simply directories that contain XML files that describe the package and presenting any dependencies. One of these packages is the MAVROS, which is an extensible communication node MAVLink with proxy for GCS. The package provides a driver for communication between a variety of autopilots with MAVLink communication protocol and a UDP MAVLink bridge for GCS. The MAVROS package allows MAVLink communication between different computers running ROS, being currently the official supported bridge between ROS and MAVLink (PX4 2019d).
9.2.3.3
DroneKit
DroneKit is an SDK built with development tools for UAVs (DRONEKIT 2015). It allows to create applications which runs on a host computer and allows communication with ArduPilot flight controllers. Applications can insert a level of intelligence into the vehicle’s behavior and can perform tasks with high computational perfor-
9 Analysis and Deployment of an OCR—SSD Deep Learning …
127
mance they cost or depend in real time, such as computer vision or path planning (3D Robotics 2015, 2019). Currently, the PX4 is not yet fully compatible, being more suitable for ArduPilot applications (PX4 2019c).
9.2.4 Simulation This work used the Gazebo platform (Nogueira 2014), with its PX4 Simulator implementation, which brings various vehicle models with PixHawk specific hardware and firmware simulation. It wasn’t the only option (Hentati et al. 2018; Meyer 2020; Shah et al. 2019), but In this project, the Gazebo platform was the choice to imitate environment through PX4 SITL. However, it was not the only option, since the PX4 SITL is available for other platforms, such as jMAVSim (Hentati et al. 2018), AirSim (Shah et al. 2019) and X-Plane (Meyer 2020). JMAVSim was not easy to integrate obstacles or Extra sensors, such as cameras (Hentati et al. 2018), are discarded mainly for this purpose reason. AirSim was also discarded because, while realistic, it requires a powerful graphics processing unit (GPU) (Shah et al. 2019), which could compete for resources during the object detection phase. The X -Plane, also discarded, is realistic and has a variety of UAV models and environments already implemented (Hentati et al. 2018), however, it depends on the licensing for its use. Thus, Gazebo was the option chosen, due to its simulation environment with a variety of online resource models, the ability to import meshes from other modeling software, such as SolidWorks or Inkscape (Koenig and Howard 2004), and the free license.
9.2.4.1
Gazebo
Gazebo, already presented in the previous chapter, is an open-source simulation platform that allows integration with ROS. Gazebo is currently only supported on Linux, but there are some speculation about a version of Windows.
9.2.4.2
PX4 SITL
The PX4 firmware offers a complete hardware simulation (Hentati et al. 2018; Yan et al. 2002), with a response that provides the entry of the environment using its own SITL (Software in the loop). The simulation reacts to the simulation data given exactly how it would react in reality and evaluates the total production energy required in each rotor (Cardamone 2017; Nguyen et al. 2018). The PX4 SITL allows you to simulate the same software as on a real platform, rigorously replicating the behaviors of an autopilot and can simulate the same autopilot used on a real drone and its MAVLink protocol, which generalizes direct use of a
128
L. G. M. Pinto et al.
real drone (Fathian et al. 2018). The greatest the PX4 SITL is that a flight controller cannot distinguish whether it is running on simulation or inside a real vehicle, allowing the simulation code to be imported directly for commercially available UAV platforms (Allouch et al. 2019).
9.2.5 Deep Learning Frameworks Deep learning is a type of machine learning, generally used for classification, regression, and resource extraction tasks, with multiple layers of representation and abstraction (Deng 2014). For object detection, resource extraction tasks are required and can be achieved using convolutional neural networks (CNN), a class of deep neural networks that apply filters at various levels to extract and classify visual information from a source, such as an image or video (O’Shea and Nash 2015). This project used a CNN (Holden et al. 2016; Jain 2020; Opala 2020; Sagar 2020; Tokui et al. 2019) to detect visual targets using a camera.
9.2.5.1
Caffe Framework
In this project was used the Caffe deep learning framework (Jia and Shelhamer 2020; Jia et al. 2019, 2014), but there are other options such as Keras (2020), scikitlearn (2020), PyTorch (2020) and TensorFlow (AbadI et al. 2020; Rampasek and Goldenberg 2016; Tensorflow 2016; TENSORFLOW 2019, 2020b, a). Caffe provides a complete toolkit for training, testing, fine-tuning, and model deployment, with well-written documentation examples for these tasks. It is developed under a free BSD license, being built with the C++ language and maintaining Python and MATLAB links for training and deploying general-purpose convolutional neural networks and many other deep models efficiently (Bahrampour et al. 2015; Bhatia 2020).
9.2.6 Object Detection The most common way to interpret the location of the object is to create a bounding box around the detected object, as seen in Fig. 9.5. Object detection, detailed in the previous chapter, was the first stage in this tracking system, as it focuses on the object to be tracked (Hampapur et al. 2005; Papageorgiou and Poggio 2000; Redmon et al. 2016).
9 Analysis and Deployment of an OCR—SSD Deep Learning …
129
Fig. 9.5 Bounding boxes from object detection (Ganesh 2019)
9.2.6.1
Convolutional Neural Networks (CNN)
A convolutional neural network (CNN) (O’Shea and Nash 2015), as described in the previous chapter, is a variation of the so-called multilayer perceptron networks and was inspired by the biological process of data processing (Google Developers 2020; Hui 2019; Karpathy 2020; Rezatofighi et al. 2019; Stutz 2015; Vargas and Vasconcelos 2019).
9.2.6.2
Single-Shot MultiBox Detector (SSD)
SSD is an object detection algorithm that uses a deep learning model for neural networks (Liu et al. 2011; Liu 2020; Liu et al. 2016). It was designed for real-time applications, like this one. It is lighter than other models, as it speeds up the process of inferring new bounding boxes reuse of pixels or feature maps, which are the result of convolutional blocks and representation of the dominant characteristics of the image at different scales (Dalmia and Dalmia 2019; Forson 2019; Soviany and Ionescu 2019). Its core was built around a technique called MultiBox, which is a method for fast class agnostic bounding box coordinate proposals (Szegedy et al. 2014, 2015). Regarding its performance and accuracy, for applicability in object detection, it has a score above 74% mAP at 59 frames per second in datasets like COCO and PascalVOC (Forson 2019).
MultiBox MultiBox is a method for bounding box regression that achieves dimensionality reduction, as it consists of branches of convolutional layers (Forson 2019), as seen
130
L. G. M. Pinto et al.
Fig. 9.6 MultiBox architecture (Forson 2019)
in Fig. 9.6 that resize images over the network, maintaining the original width and height. The magic behind the MultiBox technique is the interaction between two critical assessment factors: loss of confidence (CL) and loss of location (LL). CL is a factor that measures how confident the class selection is made, which means whether it is the correct class of the object, using categorical cross-entropy in relation to entropy (Forson 2019). We can consider cross-entropy as a received response that is not optimal. Entropy, on the other hand, represents the ideal answer. Therefore, knowing entropy, all entropy received can be measured in terms of how far these responses are from the optimal (Dipietro 2019).
SSD Architecture The SSD is composed of feature map extraction, through an intermediate neural network called feature extractor and the application of convolution filters to detect objects. The SSD architecture (Fig. 9.7) consists of three main components: basic network, extra layers for feature extraction, and prediction layers. In the basic network, extraction is performed using a convolutional neural network called VGG-16—the feature extractor, as seen in Fig. 9.8, where it is made up of combinations of convolution layers with ReLU for fully connected layers. In addition, it has layers of maximum pool and a final layer with a softmax activation function (Frossard 2019). The output of this network is a feature map with dimensions 19 × 19 × 1024 (Dalmia and Dalmia 2019). Right after the basic network, four additional convolutional layers are added to continue reducing the size of the resource map until its final dimensions are 1 × 1 × 256. Finally, the forecast layers, a crucial element of the SSD, use a variety of feature maps representing various scales
9 Analysis and Deployment of an OCR—SSD Deep Learning …
131
Fig. 9.7 SSD architecture (Dalmia and Dalmia 2019)
Fig. 9.8 VGG-16 architecture (Frossard 2019)
to predict class scores and bounding box coordinates (Dalmia and Dalmia 2019). The final composition of the SSD increases the chances of an object being eventually detected, localized, and classified (Howard et al. 2017; Sambasivarao 2019; Simonyan and Zisserman 2019; Szegedy et al. 2015; Tompkin 2019).
9.2.7 Optical Character Recognition—OCR In this work, the identification of the vehicle license plate was important for the drone be able to follow a car with the correct license plate. This problem occurred with the optical character recognition system (OCR). OCR is a technique responsible for recognizing optically drawn characters (Eikvil 1993). OCR is a complex problem to deal with due to the variety of languages, fonts, and styles in which char-
132
L. G. M. Pinto et al.
acters and information can be written, including the complex rules for each of these languages (Islam et al. 2016; Mithe et al. 2013; Singh et al. 2010).
9.2.7.1
Inference Process
An example of the steps in the OCR technique is shown in Fig. 9.9 (Adams 1854). The steps are as follows: acquisition, preprocessing, segmentation, resource extraction, and recognition (Eikvil 1993; Kumar and Bhatia 2013; Mithe et al. 2013; Qadri and Asif 2009; Sobottka et al. 2000; Weerasinghe et al. 2020; Zhang et al. 2011). a. Acquisition: a recorded image is fed into the system. b. Preprocessing: eliminates color variations by smoothing and normalizing pixels. Smoothing applies convolution filters to the image to remove noise and smooth the edges. Normalization finds a uniform size, slope and rotation for all characters in the image (Mithe et al. 2013). c. Segmentation: finds the words written inside the image.(Kumar and Bhatia 2013) d. Resource extraction: extracts the characteristics of the symbols (Eikvil 1993). e. Recognition: actually identifies the characters and classifies them, searching the lines, word for word, converting the images for character streams representing letters of recognized words (Weerasinghe et al. 2020).
Fig. 9.9 OCR inference process steps (Adams 1854)
9 Analysis and Deployment of an OCR—SSD Deep Learning …
9.2.7.2
133
Teressact OCR
This work used a plate detection called OpenALPR that uses Google Tesseract, an open-source OCR framework, to train networks for different languages and scripts. It converts the image into binary images and identifies and extracts character outlines. Transforms Blobs contours, which are small regions isolated from digital images, divide text into words using techniques like cloudy spaces and defined spaces. Finally, it recognizes the text by classifying and storing each recognized word (Mishra et al. 2012; Patel and Patel 2012; Sarfraz et al. 2003; Shafait et al. 2008; Smith 2007; Smith et al. 2009).
9.2.7.3
Automatic License Plate Recognition—ALPR
ALPR is a way to detect characters that makes up the license plate of the vehicle and uses OCR for most of the process. It combines object detection, image processing, and pattern recognition (Silva and Jung 2018). It is used in real-life applications, such as automatic toll collection, traffic law enforcement, access control to parking lots and road traffic monitoring (Anagnostopoulos et al. 2008; Du et al. 2013; Kranthi et al. 2011; Liu et al. 2011). The four steps of the ALPR are shown in Fig. 9.10 (Sarfraz et al. 2003).
OpenALPR This project used OpenALPR, an open-source ALPR library built with the C++ language, and has links in C#, Java, Node.js, and Python. The library receives images and video streams for analysis in order to identify registrations and generates a text representing the enrollment characters (OPENALPR 2017). It is based on OpenCV, an open-source computer vision library for image analysis (Bradski and Kaehler 2008) and Tesseract OCR (Buhus et al. 2016; Rogowski 2018).
9.3 Implemented Model This project used datasets, SSD training in TensorFlow and Caffe frameworks, image preprocessing for OCR and tracking and motion functions.
9.3.1 Supervisioned Learning Supervised learning (SL) is a kind of machine learning (ML) training in which the model is provided with labeled data during your training phase (Google Devel-
134
L. G. M. Pinto et al.
Fig. 9.10 ALPR steps (Sarfraz et al. 2003)
opers 2019; Shobha and Rangaswamy 2018; Talabis et al. 2015). The LabelImg tool (Talin 2015) was used, with PascalVOC format like the standard XML annotation (Fig. 9.11), which includes class labels, coordinates of the bounding boxes, image path, image size and name and other tags (Everingham et al. 2010).
9.3.1.1
Datasets
Three sets were used: INRIA Person, Stanford Cars and GRAZ-02. The Stanford Cars dataset (Krause et al. 2013) allowed SSD to identify cars along the quadcopter trajectory. This dataset contains 16,185 images from 196 classes of cars, divided into 8,144 training images and 8,041 test images, already noted in terms of make, model and year.
9 Analysis and Deployment of an OCR—SSD Deep Learning …
135
Fig. 9.11 PascalVOC’s XML example (Everingham et al. 2010)
The INRIA person dataset (Dalal and Triggs 2005) is a collection of digital images to highlight people, taken over a long period of time, and some Web images taken from Google Images (Dalal and Triggs 2005). About 2500 images were collected from that dataset. The class of person was added because it was the most common false positive found in single-class training, along with a background class, to help improve the model discernment between different environments. To help improve the model detection in Multi-Class, the GRAZ-02 (Opelt et al. 2006) dataset was used, since it contains images with high complexity objects and a high variability on backgrounds, which includes 311 images with persons and 420 images with cars (Oza and Patel 2019).
9.3.1.2
Caffe and TensorFlow Training
Unlike TensorFlow (TENSORFLOW 2019, 2020b), Caffe (Bair 2019) does not have a complete object detection API (Huang et al. 2017), which makes it more complicated when starting a new training. Caffe does not include a direct visualization tool like Tensorboard. However, it includes in its tools subdirectory, the parse log.py script, which extracts all relevant training information from the log file and makes it suitable for plotting. Using a Linux plotting tool called gnuplot (Woo and Broker 2020), in addition to an analysis script, it was possible to build a real-time plotting algorithm (Fig. 9.12). The latest log messages indicated that the model reached an overall mAP of 95.28%, which is an accuracy gain of 3.55% compared to the first trained model (Aly 2005; Jia et al. 2014; Lin et al. 2020).
136
L. G. M. Pinto et al.
Fig. 9.12 Results from gnuplot custom script (Author)
9.3.2 Image Preprocessing for ALPR To help ALPR identify license plates, the quality of the images has been improved through the use of two techniques of image preprocessing brightness variation and sharp mask. The variation in brightness is controlled in the color system used was RGB, where the color varies according to the levels of red, green and blue provided. There are 256 possible values for each level, ranging from 0 to 255. To change the brightness, just you need to add or subtract a constant value from each level. For brighter images, the value is add, while for darker images the values are subtracted, as seen in Fig. 9.13, providing a total −255 to 255 values. The only necessary care is to check whenever the addition or subtracted value will be greater than 255 and less than 0. After applying the brightness, the sharpness mask, also called the sharpness filter, was it is necessary to highlight the edges, thus improving the characters of the plate (Dogra and Bhalla 2014; Fisher et al. 2020). Figure 9.14 shows an example of the result of this process.
9 Analysis and Deployment of an OCR—SSD Deep Learning …
137
Fig. 9.13 Effect of brightness on a collected image (author)
Fig. 9.14 Effect of sharpness filter application (author)
9.3.3 Drone Control Algorithm The algorithm for controlling the drone was built with the assistance of libraries in the Python language, which include OpenALPR, Caffe, TensorFlow, MAVSDK, OpenCV (Cartesian System) (Bradski and Kaehler 2008) and others. Algorithm 1 is the main algorithm of the solution and evaluates updated data from object detection and OCR at each speed on the x, y, z axes of the three-dimensional real-world environment.
138
9.3.3.1
L. G. M. Pinto et al.
Height Centralization Function
Height centering is the only control function shared between views. It positions the drone at a standard altitude (Fig. 9.15).
9.3.3.2
2D Centralization Function
The drone has its camera pointed at the ground, the captured image is analogous to a 2D Cartesian System, and the centralization is oriented using the x-coordinate and y-coordinate, as shown in Fig. 9.16. The idea is to reduce the two values x and y, which represent the detection x and y distances, respectively, central point (pink) to the central point of the frame (blue).
9.3.3.3
Yaw Alignment Function
In the rear_view approach, the alignment function is used to center the drone horizontally, according to the position of the car. The idea is to keep the drone facing the car, as shown in Fig. 9.17. These values are calculated using the a yaw distance between the frame and the central detection point on the Y-axis.
9 Analysis and Deployment of an OCR—SSD Deep Learning …
Fig. 9.15 Height centralization (author)
Fig. 9.16 2D Centralization (author)
139
140
L. G. M. Pinto et al.
Fig. 9.17 Yaw Alignment (author)
9.3.3.4
Approximation Function
In the rear_view approach, the zoom function was the most difficult to find and used the distance from the object in the camera, the speed of the car, the balance between a safe distance from the car, and the minimum distance for the OCR to work. The distance from the object was calculated using the relationship between the object camera field of view (FOV) and its sensor dimension (Fulton 2020), as seen in Fig. 9.18.
9.4 Experiments and Results To evaluate the algorithm, a notebook was used as ground station, coordinating the acquisition of the frame, the processing of the frame by the SSD, the values calculation and drone positioning command. Its specifications were an Ubuntu v18.04 64 OS, Intel R CoreTM i7 with 2.2 GHz, 16 GB of DDR3 SDRAM and an Nvidia GeForce GTX 1060 graphics card, with 6 GB of G-DDR5 memory.
9 Analysis and Deployment of an OCR—SSD Deep Learning …
141
Fig. 9.18 Object distance from camera (author)
Fig. 9.19 Typhoon H-480 in Gazebo (author)
9.4.1 The Simulated Drone For the simulated experiment, a Typhoon H-480 model was used in the Gazebo, as shown in Fig. 9.19. It is available in the PX4 Firmware Tools directory on Github, available at: https://github.com/PX4/Firmware/tree/master/Tools, ready to use on SITL environment. It was a handful, as it had a built-in gimbal and camera. The gimbal allowed the images to be captured in a very stable way, avoiding compromising the detections.
142
L. G. M. Pinto et al.
Fig. 9.20 Custom world in Gazebo (author)
9.4.2 Simulated World In the simulated experiments, a customized city (Fig. 9.20) was created with precompilation models in the Gazebo model database, available at: https://bitbucket. org/osrf/gazebo models.
9.4.3 CAT Vehicle The CAT vehicle is a simulated distributed autonomous vehicle that is part of the ROS project in order to support research on autonomous driving technology. CAT has complete configurable simulated sensors and actuators imitating a real-world vehicle capable of autonomous driving, which includes a steering speed control implemented in real-time simulation. In the experiments, the drone was adjusted some distance from the rear end of the car, as seen in Fig. 9.21, and followed it, capturing its complete image and allowing the algorithm to process the recognized card. In addition, it allowed the plate to be personalized with a Brazilian model, making it very convenient for this project.
9.4.4 Simulated Experiments The experiments related that in an urban scene, the most cars could be detected within a range of 0.5–4.5 m from the camera, as shown in the green area in Fig. 9.22.
9 Analysis and Deployment of an OCR—SSD Deep Learning …
143
Fig. 9.21 CAT Vehicle and Typhoon H-480 in Gazebo (author)
Fig. 9.22 Car detection simulation (author)
The detection range was 1.5–3 m. The balance between the main number of detections with the total of correct information extracted is represented in the green area of Fig. 9.23. Figure 9.24 shows a collision hazard zone represented by a red area. The ideal safety height and image quality vary between 4 and 6m. Figure 9.25 shows the red area where the height of other possible objects to be found, such as people and other vehicles, making it difficult to identify the moving vehicle as an object to be tracked.
144
Fig. 9.23 Plate detection and processing results (author)
Fig. 9.24 Car rear following results (author)
Fig. 9.25 Car top following results (author)
L. G. M. Pinto et al.
9 Analysis and Deployment of an OCR—SSD Deep Learning …
145
Fig. 9.26 Customized F450 built (author)
9.4.5 The Real Quadrotor A custom F450 (Fig. 9.26) was used for the outdoor experiments. The challenge was to use cheap materials to achieve reasonable results. The numbers shown in Fig. 9.26 are reference indexes for each of the components shown in Table 9.1. For the acquisition of the frame, an EasyCap capture device was connected to the computer, previously connected to the video receiver. The Pixhawk 4, the most expensive component, was the flight controller chosen board, as the idea was to use the PX4 flight stack as the control board configuration.
9.4.6 Outdoor Experiment For the outdoor experiment on Campus (Fig. 9.27), the model was changed to detect people instead of cars. To avoid colliding with the person who served as a target, everything was coordinated very slowly. Another experiment used a video as images source (Fig. 9.28), to check how many detections, plates, and right information extracted the technique could acquire. For the video, a record from Marginal Pinheiros was used, being one of the busiest highways in the city of São Paulo (Oliveira 2020). The experiment produced 379 vehicle detections of the 500 existing in a video clip, where 227 plaques were found, 164 with the correct information extracted (Fig. 9.29).
146
L. G. M. Pinto et al.
Table 9.1 Customized F450 specifications Index Part Name Specifications 1 2 3 4 5 6 7 8 9
10 11 12 13
Controller
Pixhawk PX4/2.4.8 GPS Module NEO-8N Telem. Transc. Readytosky 3DR 915 Mhz Video Transm. JMT / TS832 FPV Camera Cyclops DVR 3 4.2V 700TVL RC Receiver Flysky/RX FS-R9B PPM Encoder JMT PWM to PPM ESC Hobbypower 39 A Motor / Prop A2212 1000 KV / 6035 2-Blade Battery (Li-po) Tattu 11.1 v 35 c 5200 mAh 3S Video Receiver JMT / RS832 Frame YoungRC F450 450mm Power Module Xt60 6s / 12s
Quantity
Weight (g)
Price (R$)1
1
38
183.40
1 1
30 15.4
58.16 63.23
1 1
22 4.5
59.15 55.71
1
18
19.69
1
3
22.40
4
100
69.01
4
200.8
50.62
1
375
160.00
1 1
852 280
73.94 21.31
1 Total
28 1141.7
Note 1 The price is equivalent to USD 190.77 on March 08th, 2020 Note 2 Not added to weight sum, since it was used in the ground station
Fig. 9.27 Outdoor experiment in Campus (author)
16.08 882.73
9 Analysis and Deployment of an OCR—SSD Deep Learning …
147
Fig. 9.28 Experiment on record (Oliveira 2020)
Fig. 9.29 Experiment on record results (author)
9.5 Conclusions and Future Works The distance and the brightness level determined the limits in the tests performed, being an aspect to work on future improvements. A high-definition camera should be used to prevent noise and vibrations in the captured images. The mathematical functions used to calculate the drone’s speed were useful in understanding the drone’s behavior. A different approach to position estimation or PID controller can be used to determine the object’s route.
148
L. G. M. Pinto et al.
References 3D Robotics (2015) DroneKit. Available in: https://3drobotics.github.io/solodevguide/conceptdronekit.html. Cited December 21st, 2019 3D Robotics (2019) About DroneKit. Available in: https://dronekit-python.readthedocs.io/en/ latest/about/overview.html. Cited December 21st, 2019 Abadi M et al (2020) Tensorflow: large-scale machine learning on heterogeneous distributed systems. ArXiv, arXiv:1603.04467 Adams HG (1854) Nests and eggs of familiar birds, vol IV, 1st edn. Groombridge and Sons: 5 Paternoster Row, London, England Allouch A, (2019) Mavsec: securing the MAVLINK protocol for Ardupilot, PX4 unmanned aerial systems. In, et al (2019) 15th international wireless communications & mobile computing conference (IWCMC). IEEE. https://doi.org/10.1109/IWCMC.2019.8766667 Aly M (2005) Survey on multiclass classification methods. In: 1200 East California Boulevard Pasadena, California, USA Amit Y, FELZENSZWALB P (2014) Object detection. In: Computer vision. Springer US, 537–542. https://doi.org/10.1007/978-0-387-31439-6 Anagnostopoulos C-N et al (2008) License plate recognition from still images and video sequences: a survey. In: IEEE transactions on intelligent transportation systems, Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109%2Ftits.2008.922938 ARDUPILOT (2019) choosing a ground station: overview. Available in: https://ardupilot.org/plane/ docs/common-choosing-a-ground-station.html. Cited December 10th, 2019 Bahrampour, S et al (2015) Comparative study of caffe, neon, theano, and torch for deep learning. ArXiv, arXiv:1511.06435 Bair (2019) Caffe. Available in: https://caffe.berkeleyvision.org/. Cited January 13th, 2019 Bartak R, Vykovsky A, (2015) Any object tracking and following by a Ying drone. In, (2015) fourteenth Mexican international conference on artificial intelligence (MICAI). IEEE. https:// doi.org/10.1109/micai.2015.12 Barton TEA, Azhar MAHB (2017) Forensic analysis of popular UAV systems. In: Seventh international conference on emerging security technologies (EST). IEEE. https://doi.org/10.1109/EST. 2017.8090405 BENDEA H et al (2008) Low cost UAV for post-disaster assessment. In: The international archives of the photogrammetry, vol 37. Remote Sensing and Spatial Information Sciences Bhadani RK, Sprinkle J, Bunting M (2018) The CAT vehicle testbed: a simulator with hardware in the loop for autonomous vehicle applications. In: Electronic proceedings in theoretical computer science, vol 269. Open Publishing Association, 32–47 Bhatia R (2018) Tensorflow vs caffe: which machine learning framework should you opt for? In: Analytics India Magazine, Analytics India Magazine Pvt Ltd. Available in: https://analyticsindiamag.com/tensorflow-vs-caffe-which-machine-learning-frameworkshould-you-opt-for/. Cited January 14th, 2020 Bradski G, Kaehler A (2008) Learning OpenCV: computer vision with the OpenCV library. O’Reilly Media, Inc Braga RG. et al (2017) Collision avoidance based on reynolds rules: a case study using quadrotors. In: Advances in intelligent systems and computing. Springer International Publishing, 773-780. https://doi.org/10.1007/978-3-319-54978-1 Breedlove L (2019) An insider’s look at the rise of drones: industry veteran Lon Breedlove gives his perspective on the evolution and future of the drone industry. Available in: https://medium.com/ hangartech/an-insiders-look-at-the-rise-of-drones-41280563f0dd. Cited November 15th, 2019 Brito PL de et al (2019) A technique about neural network for passageway detection. In: 16th international conference on information technology-new generations (ITNG 2019). Springer International Publishing, 465-470. https://doi.org/10.1007/978-3-030-14070-064
9 Analysis and Deployment of an OCR—SSD Deep Learning …
149
Buhus ER, Timis D, Apatean A (2016) Automatic parking access using openalpr on raspberry pi3. In: Journal of ACTA TECHNICA NAPOCENSIS Electronics and Telecommunications. Technical University of Cluj-Napoca, Cluj, Romania Cabebe J (2019) Google translate for android adds OCR (2012). Available in ?? Cited November 15th, 2019 Cardamone A (2017) Implementation of a pilot in the loop simulation environment for UAV development and testing. Doctoral Thesis (Graduation Project)|Scuola di Ingegneria Industriale e dell’Informazione, Politecnico di Milano, Milano, Lombardia, Italia Chapman A (2016) Types of drones: multi-rotor vs XEDwing vs single rotor vs hybrid VTOL. DRONE Magz I(3):10 Cousins S (2010) ROS on the PR2 [ROS topics]. IEEE robotics & automation magazine, institute of electrical and electronics engineers (IEEE), vol 17(3), 23-25. https://doi.org/10.1109/mra.2010. 938502 DAI, J. et al. R-fcn: Object detection via region-based fully convolutional networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, 379-387, ISBN 9781510838819, Red Hook, NY, USA: Curran Associates Inc., (NIPS’16) (2016) Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05). IEEE. https://doi.org/10.1109%2Fcvpr.2005.177 Dalmia A, Dalmia A (2019) Real-time object detection: understanding SSD. Available in: https://medium.com/inveterate-learner/real-time-object-detection-part-1-understandingssd-65797a5e675b. Cited November 28th, 2019 Deng L (2014) Deep learning: methods and applications. In: Foundations and trends R in signal processing, Now Publishers, vol 7(3-4), 197-387. https://doi.org/10.1561%2F2000000039 Dipietro R (2019) A friendly introduction to cross-entropy loss. Available in: https://rdipietro. github.io/friendly-intro-to-cross-entropy-loss/. Cited December 06th, 2019 Dogra A, Bhalla P (2014) Image sharpening by gaussian and butterworth high pass lter. Biomed Pharmacol J Oriental Sci Publishing Company 7(2):707–713 https://doi.org/10.13005%2Fbpj %2F545 DRONEKIT (2015) DRONEKIT: your aerial platform. Available in: https://dronekit.io/. Cited December 21st, 2019 Du S et al (2013) Automatic license plate recognition (ALPR): a state-of-the-art review. In: IEEE transactions on circuits and systems for video technology. Institute of Electrical and Electronics Engineers (IEEE), v. 23, n. 2, 311–325 (2013) doi: https://doi.org/10.1109%2Ftcsvt.2012. 2203741 Eikvil L (1993) OCR—optical character recognition. Gaustadalleen 23, P.O. Box 114 Blindern, N-0314 Oslo, Norway Everingham M et al (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision 88(2):303–338 Fairchild C (2016) Getting started with ROS. In: ROS robotics by example: bring life to your robot using ROS robotic applications. Packt Publishing, Birmingham, England. ISBN 978-1-78217519-3 Fathian K et al (2018) Vision-based distributed formation control of unmanned aerial vehicles Feng L, Fangchao Q, and EKF altering algorithm of the autopilot PIXHAWK. In (2016) Research on the hardware structure characteristics sixth international conference on instrumentation & measurement, computer, communication and control (IMCCC). IEEE. https://doi.org/10.1109/ imccc.2016.128 Ferdaus MM (2017) ninth international conference on advanced computational intelligence (ICACI). IEEE. https://doi.org/10.1109/icaci.2017.7974513 Fisher R et al (2020) Unsharp Filter. 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK: The University of Edinburgh. Hypermedia Image Processing Reference (HIPR), School of Informatics, The University of Edinburgh (2003). Available in: https://homepages.inf.ed.ac.uk/rbf/ HIPR2/unsharp.htm. Cited January 14th, 2020
150
L. G. M. Pinto et al.
Forson E (2017) Understanding SSD multiBox|Real-time object detection in deep learning. Available in: https://towardsdatascience.com/understanding-ssd-multibox-real-time-objectdetection-in-deep-learning-495ef744fab. Cited November 26th, 2019 Frossard D (2016) VGG in TensorFlow: model and pre-trained parameters for VGG16 in TensorFlow. Available in: https://www.cs.toronto.edu/~frossard/post/vgg16/. Cited November 28th, 2019 Fulton W (2020) Math of field of view (FOV) for a camera and lens. Available in: https://www. scantips.com/lights/eldofviewmath.html. Cited January 16th, 2020 Ganesh P (2019) Object detection : Simple. Available in: https://towardsdatascience.com/objectdetection-simplied-e07aa3830954. Cited January 02nd 2020 Google Developers (2019) Elastication: true vs. false and positive vs. negative. Available in: https://developers.google.com/machine-learning/crash-course/classication/true-falsepositive-negative. Cited December 08th, 2019 Google Developers (2019) What is supervised learning? Available in: https://developers. google.com/machine-learning/crash-course/classication/true-false-positive-negative. Cited January 13th, 2020 Hampapur A et al (2005) Smart video surveillance: exploring the concept of multiscale spatiotemporal tracking. IEEE Signal Processing Magazine, vol 22(2). Institute of Electrical and Electronics Engineers (IEEE), 38–51 . https://doi.org/10.1109/msp.2005.1406476 Hasan KSB (2019) What, Why and How of ROS. Available in: https://towardsdatascience.com/ what-why-and-how-of-ros-b2f5ea8be0f3. Cited December 19th, 2019 Hentai AI (2018) 14th international wireless communications & mobile computing conference (IWCMC). IEEE. https://doi.org/10.1109/iwcmc.2018.8450505 Holden D, Saito J, Komura TA (2016) deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics, Association for Computing Machinery (ACM), vol 35(4), 1-11. https://doi.org/10.1145/2897824.2925975 Hong Y, Fang J, Tao Y (2008) Ground control station development for autonomous UAV. In: Intelligent Robotics and Applications. Springer Berlin Heidelberg, 36-44. https://doi.org/10.1007/9783-540-88518-4 Howard AG et al MobileNets: efficient convolutional neural networks for mobile vision applications Huang J et al (2017) Speed/accuracy trade-o s for modern convolutional object detectors. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE. https://doi.org/10. 1109%2Fcvpr.2017.351 Huang J et al (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 7310–7311 Hui J (2019) SSD object detection: single shot multibox detector for real-time processing. Available in: https://medium.com/@jonathanhui/ssd-object-detection-single-shot-multibox-detector-forreal-time-processing-9bd8deac0e06. Cited November 27th, 2019 Islam N, Islam Z, Noor N (2016) A survey on optical character recognition system. In: Journal of information & communication technology-JICT. 06010 UUM Sintok Kedah Darul Aman, Malaysia: Universiti Utara Malaysia Press, (Issue. 2, v. 10) (2016) Jain Y (2020) Tensorflow or PyTorch : the force is strong with which one? Available in: https://medium.com/@UdacityINDIA/tensorow-or-pytorch-the-force-is-strong-withwhich-one-68226bb7dab4. Cited March 04th, 2020 Jesus LD de et al (2019) Greater autonomy for RPAS using solar panels and taking advantage of rising winds through the algorithm. In: 16th international conference on information technologynew generations (ITNG 2019). Springer 615-616. https://doi.org/10.1007/978-3-030-14070-0 Jia Y, Shelhamer E (2000) Caffe Model Zoo. Available at: http://cae.berkeleyvision.org/model_ zoo.html. Cited January 14th, 2020 Jia Y, Shelhamer EC (2019). Available in: https://caffe.berkeleyvision.org/. Cited January 02nd, 2020 Jia Y, Shelhamer E, Jia Y et al (2014) Caffe. In: Proceedings of the ACM international conference on multimedia—MM ’14. ACM Press. https://doi.org/10.1145/2647868.2654889
9 Analysis and Deployment of an OCR—SSD Deep Learning …
151
Joseph L (2015) Why should we learn ros?: Introduction to ROS and its package management. In: Mastering ROS for robotics programming : design, build, and simulate complex robots using Robot Operating System and master its out-of-the-box functionalities. Packt Publishing, Birmingham, England. ISBN 978-1-78355-179-8 (2015) Kantue P, Pedro JO (2019) Real-time identification of faulty systems: development of an aerial platform with emulated rotor faults. In: 4th conference on control and fault tolerant systems (SysTol). IEEE. https://doi.org/10.1109/systol.2019.8864732 Karpathy A (2020) Layers used to build ConvNets (2019). Available in: http://cs231n.github.io/ convolutional-networks/. Cited January 02nd 2020 KERAS (2020) KERAS: The python deep learning library (2020). Available in: https://keras.io/. Cited March 03rd 2020 Koenig N, Howard (2004) A Design and use paradigms for gazebo, an open-source multi-robot simulator. In: 2004 IEEE/RSJ international conference on intelligent robots and systems (IROS) (IEEE Cat. No.04CH37566). IEEE. https://doi.org/10.1109/iros.2004.1389727 Kouba A (2019) Services. Available in: http://wiki.ros.org/Services. Cited December 19th, 2019 Kranthi S, Pranathi K, Srisaila A (2011) Automatic number plate recognition. In: International journal of advancements in technology (IJoAT). Ambala: Nupur Publishing House, Ambala, India KrauseJ et al (2013) 3d object representations for ne-grained categorization. In: 4th International IEEE workshop on 3D representation and recognition (3dRR-13). Sydney, Australia Kumar G, Bhatia PK (2013) Neural network based approach for recognition of text images. Int J Comput Appl Foundation Comput Sci 62(14):8–13. https://doi.org/10.5120/10146-4963 Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444 Lee T, Leok M, Mcclamroch NH (2010) Geometric tracking control of a quadrotor UAV on SE(3). In: 49th IEEE conference on decision and control (CDC), IEEE. https://doi.org/10.1109/cdc. 2010.5717652 Lin T-Y et al (2014) Microsoft coco: Common objects in context. In: European conference on computer vision (ECCV). Zurich: Oral. Available in: /se3/wp-content/uploads/2014/09/cocoeccv.pdf; http://mscoco.org. Cited January 14th, 2020 Liu G et al (2011) The calculation method of road travel time based on license plate recognition technology. In: Communications in computer and information science. Springer, Berlin, Heidelberg, 385–389. https://doi.org/10.1007%2F978-3-642-22418-854 Liu W (2020) SSD: single shot multibox detector. Available in: https://github.com/weiliu89/caffe/ tree/ssd. Cited January 14th, 2020 Liu W et al (2016) Ssd: Single shot multibox detector. Lecture notes in computer science. Springer International Publishing, 21–37. https://doi.org/10.1007/978-3-319-46448-0. ISSN 1611-3349 Luukkonen T (2011) Modelling and control of quadcopter. Master Thesis (Independent research project in applied mathematics)|Department of Mathematics and Systems Analysis, Aalto University School of Science, Espoo, Finland MAO W et al (2017) Indoor follow me drone. In: Proceedings of the 15th annual international conference on mobile systems, applications, and services—MobiSys ’17. ACM Press. https:// doi.org/10.1145/3081333.3081362 Martinez A (2013) Getting started with ROS. In: Learning ROS for robotics programming: a practical, instructive, and comprehensive guide to introduce yourself to ROS, the top-notch, leading robotics framework. Packt Publishing, Birmingham, England. ISBN 978-1-78216-144-8 Martins WM et al (2018) A computer vision based algorithm for obstacle avoidance. Information Technology-New Generations. Springer 569–575. https://doi.org/10.1007/978-3-319-77028-4 MAVLINK (2019) MAVLink Developer Guide. Available in: https://mavlink.io/en/. Cited December 01st, 2019 MAVSDK (2019) MAVSDK (develop) (2019). Available in: https://mavsdk.mavlink.io/develop/ en/. Cited December 01st, 2019
152
L. G. M. Pinto et al.
Meier L et al (2012) PIXHAWK: a micro aerial vehicle design for autonomous hight using onboard computer vision. In: Autonomous robots, vol 33(1-2). Springer Science and Business Media LLC, 21-39. https://doi.org/10.1007/s10514-012-9281-4 Meyer A (2020) X-Plane. Available in: https://www.x-plane.com/. Cited March 01st, 2020 Mishra N et al (2012) Shirorekha chopping integrated tesseract OCR engine for enhanced hindi language recognition. Int J Comput Appl Foundation Comput Sci 39(6):19–23 Mithe R, Indalkar S, Divekar N (2013) Optical character recognition. In: International Journal of Recent Technology and Engineering (IJRTE). G18-19-20, Block-B, Tirupati Abhinav Homes, Damkheda, Bhopal (Madhya Pradesh)-462037, India: Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP), (1, v. 2), 72–75 Mitsch S, Ghorbal K, Platzer A (2013) On provably safe obstacle avoidance for autonomous robotic ground vehicles. In: Robotics: science and systems IX. Robotics: Science and Systems Foundation. https://doi.org/10.15607%2Frss.2013.ix.014 Nguyen KD, Ha C, Jang JT (2018) Development of a new hybrid drone and software-in-theloop simulation using PX4 code. In: Intelligent computing theories and application. Springer International Publishing, 84–93. https://doi.org/10.1007/978-3-319-95930-6 Nogueira L (2014) Comparative analysis between Gazebo and V-REP robotic simulators. Master Thesis (Independent research project in applied mathematics)|School of Electrical and Computer Engineering, Campinas University, Campinas, São Paulo, Brazil Oliveira na Estrada (2020) Marginal Pinheiros alterac oes no caminho para Castelo Branco. Available in: https://www.youtube.com/watch?v=VEpMwK0Zw1g. Cited January 05th, 2020 Opala M (2020) Top machine learning frameworks compared: SCIKIT-Learn, DLIB, MLIB, tensor flow, and more. Available in: https://www.netguru.com/blog/top-machine-learning-frameworkscompared. Cited March 04th, 2020 Opelt A et al (2006) Generic object recognition with boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, Institute of Electrical and Electronics Engineers (IEEE), v. 28, n. 3, 416-431. https://doi.org/10.1109%2Ftpami.2006.54 OPENALPR (2020) OpenALPR Documentation. Available in: http://doc.openalpr.com/. Cited January 07th, 2020 O’Shea K, Nash R (2015) An introduction to convolutional neural networks Oza P, Patel VM (2019) One-class convolutional neural network. IEEE signal processing letters, Institute of Electrical and Electronics Engineers (IEEE), v. 26, n. 2, 277-281. https://doi.org/10. 1109%2Flsp.2018.2889273 Papageorgiou C, Poggio T (2000) A trainable system for object detection. International Journal of Computer Vision, Springer Science and Business Media LLC 38(1):15–33. https://doi.org/10. 1023/a:1008162616689 Patel C, Patel A, Patel D (2012) Optical character recognition by open source OCR tool tesseract: a case study. In: International journal of computer applications, vol 55(10). Foundation of Computer Science, 50-56. https://doi.org/10.5120/8794-2784 Pinto LGM et al (2019) A SSD–OCR approach for real-time active car tracking on quadrotors. In: 16th international conference on information technology-new generations (ITNG 2019). Springer, 471–476 PIXHAWK (2019) What is PIXHAWK? Available in https://pixhawk.org/. Cited December 16th, 2019 PX4 (2019) PX4 DEV, MAVROS. Available in: https://dev.px4.io/v1.9.0/en/ros/mavrosinstallation. html. Cited December 19th, 2019 PX4 (2019) Simple multirotor simulator with MAVLink protocol support. Available in: https:// github.com/PX4/jMAVSim. Cited March 01st, 2020 PX4 (2019) What Is PX4? Available in: https://px4.io. Cited December 03rd 2019 PX4 DEV (2019) MAVROS. Available in: https://dev.px4.io/v1.9.0/en/ros/mavrosinstallation.html. Cited December 19th, 2019 PX4 DEV (2019) Using DRONEKIT to communicate with PX4. Available in: https://dev.px4.io/ v1.9.0/en/robotics/dronekit.html. Cited December 21st, 2019
9 Analysis and Deployment of an OCR—SSD Deep Learning …
153
PYTORCH (2020) Tensors and dynamic neural networks in Python with strong GPU acceleration. Available in: https://github.com/pytorch/pytorch. March 03rd, 2020 Qadri MT, ASIF M (2009) Automatic number plate recognition system for vehicle identification using optical character recognition. In, (2009) international conference on education technology and computer. IEEE. https://doi.org/10.1109/icetc.2009.54 QGROUNDCONTROL (2019) QGroundControl User Guide (2019). Available in: https://docs. qgroundcontrol.com/en/. Cited December 06th, 2019 QGROUNDCONTROL (2019) QGROUNDCONTROL: intuitive and powerful ground control station for the MAVLink protocol. Available in: http://qgroundcontrol.com/. Cited December 19th, 2019 Quigley M et al (2009) ROS: an open-source robot operating system, vol 3 Ramirez-Atencia C, Camacho D (2018) Extending QGroundControl for automated mission planning of UAVs. Sensors, MDPI AG 18(7):2339. https://doi.org/10.3390/s18072339 Rampasek L, Goldenberg A (2016) TensorFlow: Biology’s gateway to deep learning? Cell Systems, Elsevier BV 2(1):12–14. https://doi.org/10.1016/j.cels.2016.01.009 Redmon J (2016) IEEE conference on computer vision and pattern recognition (CVPR). IEEE. https://doi.org/10.1109/cvpr.2016.91 Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 7263–7271 Ren S et al (2017) Faster r-cnn: towards real-time object detection with region proposal networks. In: IEEE transactions on pattern analysis and machine intelligence, vol 39(6). Institute of Electrical and Electronics Engineers (IEEE), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031. ISSN 2160-9292 Rezatofighi H, (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In, et al (2019) IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE. https://doi.org/10.1109/CVPR.2019.00075 Rogowski MV da S (2018) LiPRIF: Aplicativo para identificação de permissão de acesso de veículos e condutores ao estacionamento do IFRS (in portuguese). Monography (Graduation Final Project) | Instituto Federal de Educação, Ciência e Tecnologia do Rio Grande do Sul (IFRS), Campus Porto Alegre, Av. Cel. Vicente, 281, Porto Alegre - RS - Brasil ROS WIKI (2019) MAVROS. Available in: http://wiki.ros.org/mavros. Cited December 19th, 2019 Sabatino F (2015) Quadrotor control: modeling, nonlinear control design, and simulation. Master Thesis (MSc)|School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden Sagar A (2020) 5 techniques to prevent obverting in neural networks. Available in: https://towardsdatascience.com/5-techniques-to-prevent-overfitting-in-neural-networkse05e64f9f07. Cited March 03rd, 2020 Sambasivarao K (2019) Non-maximum suppression (NMS). Available in: https:// towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c. Cited December 10th, 2019 Sarfraz M, Ahmed M, Ghaz, S (2003) Saudi arabian license plate recognition system. In: 2003 international conference on geometric modeling and graphics, 2003. proceedings. IEEE Computer Society. https://doi.org/10.1109%2Fgmag.2003.1219663 Sawant AS, Chougule D (2015) Script independent text pre-processing and segmentation for OCR. In: International conference on electrical, electronics, signals, communication and optimization (EESCO). IEEE SCIKIT-LEARN (2020) sCIKIT-Learn: machine learning in Python. Available in: https://github. com/scikit-learn/scikit-learn. Cited March 03rd, 2020 Shafait F, Keysers D, Breuel TM (2008) Efficient implementation of local adaptive thresholding techniques using integral images. In: Yanikoglu BA, Berkner K (ed) Document Recognition and Retrieval XV. SPIE. https://doi.org/10.1117%2F12.767755
154
L. G. M. Pinto et al.
SHAH S et al (2017) Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In: Field and service robotics. Available at: https://arxiv.org/abs/1705.05065 Cited December 21st, 2019 Shobha G, Rangaswamy S (2018) Machine learning. In: Handbook of statistics. Elsevier, 197-228. https://doi.org/10.1016%2Fbs.host.2018.07.004 Shuck TJ (2013) Development of autonomous optimal cooperative control in relay rover configured small unmanned aerial systems. Master Thesis (MSc)|Graduate School of Engineering and Management, Air Force Institute of Technology, Air University, Wright-Patterson Air Force Base (WPAFB), Ohio, US Silva SM, Jung CR (2018) License plate detection and recognition in unconstrained scenarios. In: Computer vision ECCV Springer International Publishing, 593–609. https://doi.org/10.1007 %2F978-3-030-01258-8 36 Simonyan K, Zisserman A (2019) Very deep convolutional networks for large-scale image recognition. In: Bengio Y, Lecun Y (ed) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, Conference Track Proceedings. Available in: http://arxiv.org/abs/ 1409.1556. Cited December 10th, 2019 Singh R et al (2010) Optical character recognition (OCR) for printed Devnagari script using artificial neural network. In: International journal of computer science & communication (IJCSC). (1, v. 1), 91–95 Smith R (2007) An overview of the tesseract OCR engine. In: Ninth international conference on document analysis and recognition (ICDAR 2007) Vol 2. IEEE. https://doi.org/10.1109%2Ficdar. 2007.4376991 Smith R, Antonova D, Lee D-S (2009) Adapting the tesseract open source OCR engine for multilingual OCR. In: Proceedings of the International workshop on Multilingual OCR—MOCR ’09. ACM Press. https://doi.org/10.1145%2F1577802.1577804 Sobottka K et al (2000) Text extraction from colored book and journal covers. In: Kise Daniel Lopresti SMK (ed) International journal on document analysis and recognition (IJDAR). Tiergartenstrasse 17 69121, Heidelberg, Germany: Springer-Verlag GmbH Germany, part of Springer Nature, (4, v. 2), 163–176 Songer SA (2013) Aerial networking for the implementation of cooperative control on small unmanned aerial systems. Master Thesis (MSc)|Graduate School of Engineering and Management, Air Force Institute of Technology, Air University, Wright-Patterson Air Force Base (WPAFB), Ohio, US Soviany P, Ionescu RT (2019) Frustratingly easy trade-o optimization between single-stage and two-stage deep object detectors. In: Lecture notes in computer science. Springer International Publishing, 366-378. https://doi.org/10.1007/978-3-030-11018-5 Strimel G, Bartholomew S, Kim E (2017) Engaging children in engineering design through the world of quadcopters. Children’s Technol Eng J 21:7–11 Stutz D (2015) Understanding convolutional neural networks. In: Current Topics in Computer Vision and Machine Learning. Visual Computing Institute, RWTH AACHEN University Szegedy C et al (2014) Scalable, high-quality object detection Szegedy C, (2015) Going deeper with convolutions. In, et al (2015) IEEE conference on computer vision and pattern recognition (CVPR). IEEE. https://doi.org/10.1109/CVPR.2015.7298594 Talabis MRM et al (2015) Analytics de ned. In: Information security analytics. Elsevier, 1-12. https://doi.org/10.1016%2Fb978-0-12-800207-0.00001-0 Talin T (2015) LabelImg. Available in: https://github.com/tzutalin/labelImg. Cited January 13th, 2020 Tavares DM, Caurin GAP, Gonzaga A (2010) Tesseract OCR: a case study for license plate recognition in Brazil Tensorflow (2016) A system for large-scale machine learning. In: Proceedings of the 12th USENIX conference on operating systems design and implementation. USENIX Association USA, 265– 283. ISBN 9781931971331
9 Analysis and Deployment of an OCR—SSD Deep Learning …
155
TENSORFLOW (2019) why TensorFlow? Available in: https://www.tensorflow.org/. Cited December 22nd, 2019 TENSORFLOW (2020) Get started with TensorBoard. Available in: https://www.tensorflow.org/ tensorboard/get_started. Cited January 13th, 2020 TENSORFLOW (2020) Tensorflow detection model zoo. Available in: https://github.com/ tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md. Cited January 13th, 2020 Tokui S et al (2019) Chainer: a next-generation open source framework for deep learning. In: Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS). Available in: http://learningsys. org/papers/LearningSys2015paper33.pdf. Cited December 22nd 2019 Tompkin J Deep Learning with TensorFlow: introduction to computer vision. Available in: http:// cs.brown.edu/courses/cs143/2017Fall/proj4a/. Cited December 10th, 2019 Vargas AMPCACG, Vasconcelos CN (2019) Um estudo sobre redes neurais convolucionais e sua aplicação em detecção de pedestres (in portuguese). In: Cappabianco FAM et al (eds) Electronic proceedings of the 29th conference on graphics, patterns and images (SIBGRAPI’16). São José dos Campos, SP, Brazil. Available in: http://gibis.unifesp.br/sibgrapi16. Cited November 23rd, 2019 Weerasinghe R et al (2020) NLP applications of Sinhala: TTS & OCR. In: Proceedings of the third international joint conference on natural language processing: volume-II. Available in: https:// www.aclweb.org/anthology/I08-2142. Cited January 03rd, 2020 Woo A, Broker H-B (2004) Gnuplot quick reference. Available in: http://www.gnuplot.info/docs4. 0/gpcard.pdf. Cited January 14th, 2020 Yan Q-Z, Williams JM, Li J (2002) Chassis control system development using simulation: software in the loop, rapid prototyping, and hardware in the loop. SAE International, SAE Technical Paper Series. https://doi.org/10.4271/2002-01-1565 Zhang H et al (2011) An improved scene text extraction method using conditional random field and optical character recognition. In: 2011 international conference on document analysis and Recognition. IEEE. https://doi.org/10.1109%2Ficdar.2011.148
Chapter 10
Palmprint Biometric Data Analysis for Gender Classification Using Binarized Statistical Image Feature Set Shivanand Gornale, Abhijit Patil, and Mallikarjun Hangarge
Abstract Biometrics may be defined as a technological system that metrics individuals based upon their physiological and behavioral traits. The performance of behaviometrics system is destitute, as very few operational systems are deployed. In contrast, physiometrics systems seems significant and are used more due to its individuality and permanence features such as iris, face, fingerprint, and palmprint traits are well used physiometrics modalities. In this paper, authors have implemented algorithm which identifies human gender based on palmprint by using binarized statistical image features. Filters ranging from 3 × 3 to 13 × 13, with a fixed length of 8bit that allows in capturing detail information from ROI palmprints. The proposed method achieved the accuracy of 98.2% on CASIA palmprint database outperforming result is noticed and competitive.
10.1 Introduction Security concerns, as the credentials-based methods are not prevailing and suitable for usage, thus simply biometrics-based measures are adapted and mapped to rapid growing technologies. The era of biometrics is evolved nowadays usage of biometrics become inevitable for gender classification and user identification (Gornale et al. 2015; Sanchez and Barea 2018; Shivanand et al. 2015). Likewise for many years, the humans have been also interested in palm and palm lines for the telling fortunes. Scientists have also determined the association of palm line by certain genetic disorders (Kumar and Zhang 2006) like Down syndrome, Cohen syndrome, and Aarskog syndrome. Palmprint is an important biometrics trait, which gains lot of S. Gornale (B) · A. Patil Department of Computer Science, Rani Channamma University, Belagavi, Karnataka 591156, India e-mail: [email protected] A. Patil e-mail: [email protected] M. Hangarge Department of Computer Science, Karnatak College, Bidar, Karnataka 585401, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_10
157
158
S. Gornale et al.
attention because of its high potential authentication capability (Charif et al. 2017). A few studies have been carried out related to gender identification using palmprints. In this context, palmprint-based gender identification will be among the next most popular task for improvising accuracy of other biometrics devices and may even sometimes doubles haste of biometrics system. The problem of comparison will be reduced to half the database by it relatively to other methods. The gender classification has several applications even in civil, commercial domain, surveillance, and especially in forensic science for criminal detection and nabbing the suspects, etc. Gender identification using palmprint is a binary class problem of deciding whether given palm image corresponds to a male or to a female. Palmprints are permanent (Kumar and Zhang 2006) and unalterable (Krishan et al. 2014) by nature, whereas shape and size of an individual’s palm may vary with age, although basic patterns remain unchanged (Kanchan et al. 2013). This makes palmprint slightly noteworthy and individualistic. In earlier studies, it is observed that palmprint patterns are genotypically determined (Makinen and Raisamo 2008; Zhang et al. 2011) and there exists greater differences between female and male palmprints. These are absolute means that can be considered to identify gender of an individual. Palmprint contains both high- and low-resolution features like Geometrical, Delta-point, Principal-Lines, Wrinkles and Minutiae (ridges) features, etc. (Gornale et al. 2018). In proposed method, binarized statistical image feature (BSIF) technique is used. Based on which the performance of this technique is evaluated on CASIA palmprint public database. The results are outperforming performance which is noticed in the literature. The remaining part of the paper consist of following: Sect. 10.2 contains the work related to palmprint-based gender classification, and Sect. 10.3 focused on proposed methodology. In Sect. 10.4, experimental results are discussed, and Sect. 10.5 contains the comparison between the proposed method and existing results, and finally in Sect. 10.6, conclusions are presented.
10.2 Related Work The research done earlier reveals that it is possible to authenticate an individual from palmprint, but the work carried out in this domain is very scanty. In this section, we discuss review of studies that have been classified on gender identification, G. Amayeh et al. (2008) investigated possibilities of obtaining the information pertaining to gender by using palmprint; for this, they used palmprint geometry and fingers which they encoded making use of Fourier descriptors for evaluation; data is collected from 40 subjects and obtained the result of 98% with limited dataset. After those Wo et al. (2014), classified palmprints geometrical properties using polynomial support vector machine classifier 85% accuracy are attained with a separate 180 palmar images collected from 30 subjects.Unlikely, these datasets are not available publically for further comparison. Gornale et al. (2018) have performed fussing Gabor Wavelet with local binary pattern on publicly CASIA palmprint database using simple nearest neighbor classifier; an accuracy of 96.1% is observed. Xie et al. (2018) have explored with hyper-spectral CASIA palmprint dataset with convolution
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF …
159
Fig. 10.1 Diagram representing the proposed methodology is given in Fig. 10.1
neural network and fine-tuning of visual geometry group net (VGG-Net) managed to achieve a considerable accuracy of 89.2% with blue spectrum.
10.3 Proposed Methodology The proposed method comprises the following. As first step, the palmprint image is preprocessed which normalizes an input image and crops the region of interest (ROI) from the image of the palm. In the second step, the features are computed using BSIF. In the last step, the computed features are classified. Figure 10.1 gives a representation of the proposed methodology.
10.3.1 Database In the proposed work, authors have utilized CASIA palmprint database which is available to the public (CASIA) (http://biometrics.idealtest.org/). From the CASIA palmprint database, we have considered a subset of 4207 palmprints, out of which 3078 palm images belongs to male and 1129 belongs to female subjects, respectively. Images from database are shown in Fig. 10.2.
10.3.2 Preprocessing Preprocessing enhances some important features by restraining undesirable distortions. In this experiment, the pre-processing is performed to extract the region of interest (Zhenan et al. 2005). The following are the preprocessing steps : Step 1 First the input image is smoothened, with the help of Gaussian filter and after that it is binarized (Otsu 1979). Step 2 Image is normalized to a size of 640 × 480.
160
S. Gornale et al.
Fig. 10.2 Samples of the database
Fig. 10.3 Region of interest (ROI) extration
Step 3 Two key points are searched; key-point no. 1 is the gap between forefinger and middle finger. Key-point no. 2 is the gap between the ring finger and little finger(Shivanand et al. 2019). Step 4 To determine palmprints co-ordinate system, the tangents of previously located two key points are computed. Step 5 The line which joins these two key points is considered y-axis, along with it the centroid is detected, and the line passing through perpendicular to it is treated as x-axis. Step 6 After obtaining the co-ordinates by step 5, the sub-image from the coordinates is contemplated as region of interest. The process of region of interest extraction can be understood from Fig. 10.3.
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF …
161
10.3.3 Feature Extraction Feature computation is performed on extracted palmprint images using binarized statistical image features which are identical to local binary patterns and local phase quantization (Gornale et al. 2018; Juho and Rahtu 2012). BSIF technique conventionally encodes textural information from the sub-regions of the image. The BSIF method (Patil et al. 2019) produces basic vectors by projecting the patch linearly onto the sub-spaces by independent component analysis (Abdenour et al. 2014; Juho and Rahtu 2012). Each pixel co-ordinate value is thresholded, and equivalent binary code is generated. The value of local descriptor of the image intensity patterns is represented by value in the neighborhood of the selected pixel. For palmprint P(b, c) and a BSIF filter WiK ×K , the filter response is obtained as follows: ri =
K ×K
P (b, c) × Wi
(b, c)
(10.1)
b,c
Where ’x’ denotes the convolutional manipulation, b and c indicates the size of the palmprint image patch and WiK ×K = (1 . . . L) represents filter length and K × K represents the filter size. 1, if ri > 0 d(i) = (10.2) 0, otherwise Likewise, for each pixel (b, c) represents the corresponding pixel; L represents filter length ; and the BSIF features are obtained by plotting the histogram of the obtained binary codes, from each sub-region of ROI. B S I FiK ×K
(b, c) =
i
d (b, c) × (2i−1 )
(10.3)
1=1
In this experiment, size is varied from 3 × 3 to 13 × 13, so that we have utilized six different sizes of filters, and size is fixed to standard 8 bit length coding. Consequently, the feature vector of 256 elements from each male and female ROI is extracted from palmprint images. Figure 10.4 represents the visualization of the application of these filters.
10.3.4 Classifier Linear discriminant analysis (LDA) is the primary classifying techniques that have smaller computational complexity, which is commonly utilized for reducing the dimensionality (Zhang et al. 2012). It works by separating the variance both between
162
S. Gornale et al.
Fig. 10.4 Feature extraction
and within the classes (Jing et al. 2005). LDA is a binary classifier, which classifies class label ’0’ or ’1’ from the palmprint images based upon the class variances. Nearest neighbor classifier classifies the class labels based upon different kinds of distances. It classifies the class values based on k-value and interns which explores for immediate neighbors and provides labeling for unlabeled sample. dEuclidean (M, N ) =
(M − Ni )T (M − N j )
n dCityblock (M, N ) = (|M j − N j |)
(10.4)
(10.5)
j=1
Support vector machines (SVMs) embody a new statistical learning technique which classify the label based upon different kinds of learning functions (Shivanand et al. 2015). It is basically a binary modeled classifier that endeavors to seek an optimal hyper-plane which separates the labels form a set of n data vectors from Yi labels. (10.6) F X = (W T.Yi − b ≥ 1) Here, Yi predicts value either of the class belonging to male or female class by using F(X ) discriminative function. Geometrically support vector machines are the training patterns that are nearest to the decision boundary.
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF …
163
10.4 Experimental Results and Discussion In this work, gender classification using palmprint biometrics is explored by employing BSIF filters through varying different filters sizes. The filters size is varied from 3 × 3 to 13 × 13 such that we have six different sizes of filters, and length is fixed to standard 8bits length coding. The experimentation is carried out by 10-fold cross validation over different binary classifiers like LDA, K-NN, and SVM classifier on the publicly available CASIA palmprint database. Precision (P), recall(R), F-measure (F), and accuracy (A) is calculated. The results during the exhaustive experiments are demonstrated in Tables 10.1and 10.2. From Table 10.1, it is observed that by utilizing 3 × 3 8bit length filters, K-NN classifier with Euclidean distance for K = 3 has obtained the accuracy of 85.5% which is empirically fixed throughout the experiment, and the lowest accuracy of 76.1% has been obtained by LDA. Support vector machine has performed less compared with K-NN and has been noted to be 80.6%, respectively. Similarly, for 5 × 5 8 bit length filters, it is observed that by using K-NN classifier with Euclidean distance an accuracy of 93% has been obtained, and the lowest accuracy of 76.3% has been obtained by LDA classifier. Support vector machine has performed less compared to K-NN and has noted to be 85.9%, respectively. Further, by using 7 × 7 8bit length filters, the K-NN classifier has yielded the highest accuracy of 96.7% with Euclidean distance, and the lowest accuracy of 79.7% has been obtained using LDA. Support vector machine has performed less compared to K-NN and has noted to be 91%, respectively. From Table 10.2, it is observed that by 9x9 8bit length filters, the highest accuracy of 98.1% with K-NN city block distance has been noted, and the lowest accuracy of 80.1% has been obtained with LDA. Support vector machine classifier yields less result compared to K-NN and has noted to be 93.8%, respectively.Similarly with 11 × 11 8bit length filters, we noticed a higher result of 98.2 % result with K-NN city block distance classifier, and support vector machine classifier follows similar tends with lesser result than the K-NN classifier as of 95.2% accuracy, and the lowest accuracy of 80.8% has been attained by using LDA classifier. From 13 × 13 8bit length filter, we have noted similar results as 11 × 11 filters for K-NN classifier, with accuracy of 98.2% with Euclidean distance, and the lowest accuracy of 79.7%
Table 10.1 Results of 3 × 3, 5 × 5, and 7 × 7 filters size Filter size
3×3
5×5
7×7
P
R
F
A
P
R
F
A
P
R
F
A
LDA
0.89
0.80
0.42
76.1
0.89
0.80
0.42
76.3
0.90
0.83
0.43
79.7
SVM Quad
0.90
0.82
0.43
79.5
0.92
0.88
0.45
85.9
0.94
0.88
0.45
86.8
SVM Cubic
0.89
0.85
0.43
80.6
0.92
0.85
0.44
82.8
0.95
0.92
0.46
91
KNN CityBlock 0.92
0.87
0.45
85.2
0.95
0.94
0.47
93.0
0.98
0.97
0.48
96.7
KNN Euclidean 0.92
0.87
0.45
85.4
0.95
0.93
0.47
92.3
0.98
0.97
0.48
97.0
164
S. Gornale et al.
Table 10.2 Results of 9 × 9, 11 × 11, and 13 × 13 filters size 9×9
Filter size
11 × 11
13 × 13
P
R
F
A
P
R
F
A
P
R
F
A
LDA
0.90
0.83
0.43
80.1
0.90
0.84
0.43
80.8
0.90
0.83
0.43
79.7
SVM Quad
0.96
0.90
0.46
89.5
0.96
0.91
0.46
90.9
0.97
0.91
0.47
91.6
SVM Cubic
0.97
0.94
0.47
93.8
0.97
0.95
0.48
95.2
0.98
0.96
0.48
95.9
KNN CityBlock 0.99
0.98
0.49
98.0
0.99
0.98
0.49
98.2
0.98
0.98
0.49
98.1
KNN Euclidean 0.99
0.98
0.49
97.9
0.99
0.98
0.49
98.1
0.98
0.98
0.49
98.2
Table 10.3 Detail confusion matrix of all filters size 3×3
LDA SVM Quad SVM Cubic KNN City KNN Eucli
5×5
7×7
9×9
11 × 11
13 × 13
Male
Female Male
Female Male
Female Male
Female Male
Female Male
Female
2753
325
2756
322
2792
286
2791
287
2784
294
2786
292
682
447
673
456
567
562
550
579
515
614
561
568
2790
288
2850
228
2909
169
2963
115
2983
95
2987
91
576
553
366
763
387
742
326
803
288
841
263
866
2745
333
2857
221
2935
143
2994
84
3014
64
3025
53
482
647
501
628
235
894
175
954
137
992
118
1011
2856
222
2945
133
3022
56
3049
29
3055
23
3047
31
402
727
183
946
82
1047
57
1072
53
1076
48
1081
2760
318
2826
252
2964
114
3017
61
3040
38
3039
39
369
760
220
909
107
1022
64
1065
53
1076
40
1089
has been obtained by LDA. Support vector machine has performed less in comparison to K-NN and has yielded 95.9% accuracy. The confusion matrix for the following experiment is illustrated in Table 10.3. By varying the size and length with a fixed length of 8bit, it has been observed that as the filter size is increased, higher accuracy is attained. Thus, varying the size of filters allows in capturing various information from ROI palmprints images.
10.5 Comparative Analysis To realize effectiveness of the proposed method, the authors have compared it with similar works present in literature predicted in Table-10.4. In Amayeh et al. (2008), the authors have made use of palm geometry, Zernike moments, and Fourier descriptors and obtained 98% accuracy on relatively smaller dataset of just 40 images.
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF … Table 10.4 Comparative analysis Authors Features
Database
Classifier
Palm geometry fourier descriptor and Zernike moments Palm gemometry feature
20 males and 20 females
Fusion of gabor wavelet & Local binary patterns Convolution neural network
CASIA database
Score level fusion 98 with linear discriminant analysis Polynomial 85 support vector machine K-nearest 96.1 neighbor
Proposed method Binary statistical image features
CASIA database
Amayeh et al. (2008)
Wo et al. (2014)
Gornale et al. (2018) Xie et al. (2018)
90 male and 90 females
Multi-spectral CASIA database
Fine-tuning visual geometry group net K-nearest neighbor
165
Results (%)
89.2
98.2
However, Wo et al. (2014) have utilized very basic geometric properties like length, height, and aspect ratio with PSSVM and obtained 85% accuracy. Gornale et al. (2018) have performed fussing Gabor Wavelet with local binary pattern on publicly CASIA palmprint database. Xie et al. (2018) have explored gender classification with hyper-spectral CASIA palmprint dataset with convolution neural network and fine-tuning of visual geometry group net (VGG-Net). The drawback of the reported works with self-created database is that they are inapt with low-resolutional and fardistant images captured through non-contact method, as they require touch-based palm acquisition. However, the proposed method is worked with public database which is suitable for both the approaches. The proposed method outperformed by using BSIF filters with basic K-NN classifier on relatively larger dataset consisting of 4207 ROIs of palmprints, which yielded the progressive result of an accuracy 98.2%. A brief summary of comparison is presented in Table 10.4.
10.6 Conclusion In this paper, authors explore the performance of binary statistical image features, i.e., BSIF on CASIA palmprint images, by varying the filter size with a fixed length of 8bit length, further the filter size is increased, and progressive result of 98.2% is noticed for filter size of 11 × 11 and above. Thus, varying the size of filters allows capturing information from ROI palmprints. The proposed method is implemented on contact-free-based palmprint acquisition process, and this is implacable to both contact and contact-less-based methods. Our basic objective in this work is to develop a standardized system that can efficiently distinguish between males and females on
166
S. Gornale et al.
the bases of palmprints. Likewise, with basic K-NN classifier and BSIF features, authors have managed to enact relatively better result on larger database of 4207 palmprint images. In near future, plan is device generic algorithm which identifies gender based on multimodal biometrics. Acknowledgements The authors would like to thank to Chinese Academy of Science Institute of Automation for providing the access to (CASIA) Palmprint Database for conducting this experiment.
References Abdenour H, Juha Y, Boradallo M (2014) Face and texture analysis using local descriptor: a comparative analysis. In: IEEE international conference image processing theory, tools and application IPTA https://doi.org/10.1109/IPTA.2014.7001944 Adam K, Zhang D, Kamel M (2009) A survey of palmprint recognition. Patt Recognit Lett 42(8):1408–1411 Amayeh G, Bebis G, Nicolescu M (2008) Gender classification from hand shapes. In: 2008 IEEE society conference on computer vision and pattern recognition workshop. AK, 1–7. https://doi.org/10.11.09/CVPRW Charif H, Trichili Adel M, Solaiman B (2017) Bimodal biometrics system for hand shape and palmprint recognition based on SIFT sparse representation. Multimedia Tools Appl 76(20):20457– 20482. https://doi.org/10.1007/s11042-106-3987-9 Gornale SS, Malikarjun H, Rajmohan P, Kruthi R (2015) Haralick feature descriptors for gender classification using fingerprints: a machine learning approach. Int J Adv Res Comput Sci Softw Eng 5:72–78. ISSN: 2277 128X Gornale SS, Patil A, Mallikarjun H, Rajmohan P (2019) Automatic human gender identification using palmprint. In: Smart computational strategies: theoretical and practical aspects. Springer, Singapore. Online ISBN 978-981-13-6295-8, Print ISBN 978-981-13-6294-1 Gornale SS (2015) Fingerprint based gender classification for biometrics security: a state-of -the-art technique. American Int J Res Sci Technol Eng Mathe (AIJRSTEM). ISSN-2328-3491 Gornale SS, Kruti R (2014) Fusion of fingerprint and age biometrics for gender classification using frequency domain and texture analysis. Signal Image Process Int J (SIPIJ) 5(6):10 Gornale SS, Patil A, Veersheety C (2016) Fingerprint based gender identification using discrete wavelet transformation and Gabor filters. Int J Comput Appl 152(4):34–37 Gornale S, Basavanna M, Kruti R (2017) Fingerprint based gender classification using local binary pattern. Int J Comput Intell Res 13(2):261–272 Gornale SS, Patil A, Kruti R (2018) Fusion Of Gabor wavelet and local binary patterns features sets for gender identification using palmprints. Int J Imaging Sci Eng 10(2):10 Jing X-Y, Tang Y-Y, Zhan D (2005) A Fourier-LDA approaches for image recognition. Patt Recognit 38(3):453–457 Juho K, Rahtu E (2012) Bsif: Binarized statistical image features. In: IEEE international conference on pattern recognition, in (ICPR), pp 1363–1366 Kanchan T, Kewal K, Aparna KR, Shredhar S (2013) Is there a sex difference in palmprint ridge density. Med Sci law 15:10. https://doi.org/10.1258/msl.2012.011092 Krishan K, Kanchan T, Ruchika S, Annu P (2014) Viability of palmprint ridge density in North Indian population and its use in inference of sex in forensic examination. HOMO-J Comparat Hum Biol 65(6):476–488 Kumar A, Zhang D (2006) Personal recognition using hand shape. IEEE Trans. Image Process 15:2454–2461
10 Palmprint Biometric Data Analysis for Gender Classification Using BSIF …
167
Makinen E, Raisamo R (2008) An experiment comparison of gender classification methods. Patt Recognit Lett 29(6):1554–1556 Ming W, Yuan Y (2014) Gender classification based on geometrical features of palmprint images. SW J Article Id 734564:7 Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cyber 9(1):62–66 Patil A, Kruthi R, Gornale SS (2019) Analysis of multi-modal biometrics system for gender classification using face, iris and fingerprint images. Int J Image Graphics Signal Process (IJIGSP) 11(5). https://doi.org/10.5815/ijigsp.2019.05.04. ISSN: 2074-9082 Ragvendra R, Busch C (2015) Texture based features for robust palmprint recognition: a comparative study. EURASIP J Inf Sec 5:10–15. https://doi.org/10.1186/s13635-015-0022 Sanchez A, Barea JA (2018) Impact of aging on fingerprint ridge density: anthropometry and forensic implications in sex inference. Sci Justice 58(5):10 Shivanand G, Basavanna M, Kruthi R (2015) Gender classification using fingerprints based on support vector machine with 10-Cross validation technique. Int J Sci Eng Res 6(7):10 Zhang D, Guo Z, Lu G, Zhang L, Zuo Liu YW, (2011) Online joint palmprint and palm vein verification. Expert Syst Appl 38(3):2621–2631 Zhang D, Zuo W, Yue F (2012) A comparative analysis of palmprint recognition algorithm. ACM Comput Surv 44(1):10 Zhenan ST, Wang Y, Li SZ (2005) Ordinal plamprint representation for personal identification. Proceeding international conference on computer vision and pattern recognition, vol 1. Orlando, USA, pp 279–284 Zhihuai X, Zhenhua G, Chengshan Q (2018) Palmprint gender classification by Convolution neural network. IET Comput Vision 12(4):476–483
Chapter 11
Recognition of Sudoku with Deep Belief Network and Solving with Serialisation of Parallel Rule-Based Methods and Ant Colony Optimisation Satyasangram Sahoo, B. Prem Kumar, and R. Lakshmi Abstract The motivation behind the paper is to give a single shot solution of sudoku puzzle by using computer vision. This study’s purpose is twofold. First to recognise the puzzle by using deep belief network which is very useful to extract the high-level feature, and the second objective is to solve the puzzle by using parallel rule-based technique and efficient ant colony optimization method. Each of the two methods can solve this NP-complete puzzle. But singularly they lack effeciency, so we serialised these two techniques to resolve any puzzle efficiently with less time and number of iteration.
11.1 Introduction Sudoku or “single number” is a well-posed logic-based single solution combinatorial number placement puzzle. There are many variants of sudoku present as it varies according to their size. But standard sudoku contains 9 × 9 grid which is further subdivided by nine 3 × 3 sub-grid.The primary objective is to fill the number from 1 to 9 in each column, row and sub-grids with the presence of all digit without any duplication. Nowadays, sudoku is the popular daily puzzle in much-printed news media. Sudoku falls under the NP-complete complexity category which running time is polynomial of its input size that means it is as hard as NP, as NP-complete problem belongs to both the class of NP and NP-hard. The puzzles like sudoku show the human intelligence level of efficiency. Now in the era of augmented reality through artificial intelligence still on the way to solve a hard puzzle by putting several computer algorithms in a very efficient way. Sudoku becomes harder and harder as the size of sudoku increases or lack of hints-number in proper position. There are many rules-based methods to solve the sudoku puzzle efficiently. In this paper, we have implemented this rule-based methods. The main objective of the paper is to provide the hints and solution to any of this kind puzzle. S. Sahoo (B) · B. P. Kumar · R. Lakshmi Pondicherry Central University, Pondicherry, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_11
169
170
S. Sahoo et al.
As it is epoch-based two-stage solution algorithm to solve the puzzle, it can stop in any iteration to know the possible hints to the solutions. In this paper, we used convolutional deep belief network for image processing to know the digit position according to the puzzle, as convolutional deep belief network is the most efficient way for prepossessing the digit than any other OCRbased methods. New and famous rule-based method is implemented to solve the easy level problem and to minimise the difficulty level of the puzzle by filling many cells with the appropriate digit solution for a fixed number of iterations. Then, the partially solved solution is processed with ant colony optimization methods for the apparent solution.
11.2 Literature Reviews There are many papers published over the years to solve the puzzle in efficient manner. Several authors proposed a different type of computer algorithm to address the standard and higher sized puzzle. Among all the algorithm the backtracking algorithm, a genetic algorithm is the most famous one. Even Abu Sayed Chowdhury and Suraiya Akhter solved sudoku with the help of boolean algebra (Chowdhury and Akhter 2012). The work on this paper is divided into two main categories. First is based on sudoku image processing for printed digit and grid recognition and next to proceed for an appropriate solution for that image. In 2015, Kamal et al. made a comparative analysis paper on sudoku image processing and solve the puzzle by using backtracking, genetic algorithm, etc., they had used camera-based OCR technique (Kamal et al. 2015). In that 2015, Baptiste Witch and jean hennebert proposed a work based on handwriting and printed digit recognition using convolution deep belief network (Wicht and Henneberty 2015), which is the extension work of the same author on deep belief network. It is handy for detecting grid with cell number (Wicht and Hennebert 2014). Computer vision plays an active role to detect and solve the puzzle (Nguyen et al 2018). Several methods over the year like heuristic hybrid approach (Musliu and Winter 2017) by Nysret Musliu et al., the genetic algorithm by Gorges et al. (Gerges et al. 2018) and through parallel processing by Saxena et al. (Saxena et al. 2018) proposed to solve the puzzle in efficient manner. Saxena et al. composed five rulebased methods with serial and parallel processing algorithm (Saxena et al. 2018).
11.3 Image Recognition and Feature Exctraction See Fig. 11.1.
11 Recognition of Sudoku with Deep Belief Network and Solving …
171
Fig. 11.1 An image of sudoku from our dataset
11.3.1 Image Preprocessing The preprocessing of an image involves digit detection and edge detection. The acquired image by camera or pre-stored file image needs to take care of eliminating unnecessary background noise, the orientation of image and non-uniformly distributed illumination gradient. The image preprocessing detections steps are:
Fig. 11.2 Processed image after canny edge detection
172
S. Sahoo et al.
1. The captured image is converted to a greyscale image as the greyscaled image is contented for image processing and detection of digit and edges. 2. The greyscaled image is then processed through local thresholding function T of the form. T = T [x, y, P(x, y), F(x, y)] where p(x, y) is local property F(x, y) is the grey level 3. Canny edge detection multi-step algorithm can be used to detect edge with suppression of noise at a time. In Fig. 11.2, it is used to control the amount of detail that appears on the edge of the image. The hough transfer is an incredible computer vision-based method designed to identify each segment of lines. A connected component among the segment analysis is performed to merge among segments together to form the complete segment of the sudoku image (Ronse and Devijver 1984). Convex hull detection algorithm is used to detect the corners of the grid. Then, each side of the line segment is divided into nine equal parts, and a quadrilateral is counted each bounding rectangle as a final cell. 4. Convolutional Restricted Boltzmann Machines (CRBM) The restricted Boltzmann machine consists of a set of one binary hidden layer units (h) and a set of visible input layer units (v). A weight matrix (W) is there to represent the symmetric connections between hidden units and visible units. The probabilistic semantics for a restricted Boltzmann machine for the energy function (where the visible units are binary valued) are defined as P(v, h). where ⎞⎞ ⎛ ⎛ 1 vi Wij hj − bj hj − ci vi ⎠⎠ P(v, h) = exp ⎝− ⎝− z i,j j i 5. Deep Belief Networks Deep belief network consists of multi-layer RBM where each layer comprises of a set of binary values. Multi-layers convolutional RBM consists of input layers of an array Iv × Iv (visible layers) and N groups of hidden layers of array Ih × Ih . Where each N groups of hidden layers are associated with a Iw × Iw filters and filter weights are shared within the groups. The probabilistic semantic P(v, h) for CRBM is defined as: 1 p(v, h) = exp ⎛ z⎛ ⎝− ⎝−
IW IH N n=1 i,j=1 r,s=1
v(i+r−1),j+s−1 Wrsk hnij −
N n=1
bn
IH i,j=1
hnij − c
Iv
⎞⎞ vij ⎠⎠
i,j=1
where bn = Bias of hidden group c = Single shared bias of visible input group N groups of units pooling layers (P n ) shrinks the same number of units of hidden layers (H n ) by a constant small integer factor of C . Each block α in the detection unit Bα where Bα (i, j) : hi,j belongs to block α and is connected to exactly one
11 Recognition of Sudoku with Deep Belief Network and Solving …
binary units of pooling layer Pαk as IP = follows :
173
IH C
Sampling of each unit can be done as N n n P(vi,j = 1|h) = σ c + (W ∗f h )i,j n
Each unit of detection layers receives signal from bottom visible layer L(hnij ) bn + (W¯ n ∗ v)ij And for the pooling layer L(pαn )
( nl ∗ h´ l )α l
So, the conditional probability from above derivation P(hnij = 1|v) =
P(pαn = 0|v) =
exp(L(hni,j )) 1 + (´i,´j)Bα exp(L(h´ni,´j )) 1+
1 n (´i,´j)Bα exp(L(h´i,´j ))
Finally, the energy function for convolutional RBM is defined as : p(v, h) = where E(v, h) = −
N n
1 exp(−E(v, h)) z
(hni,j (W¯ n ∗ v)i,j + bn hni,j ) − c
i,j
vi,j
i,j
11.3.2 Feature Extraction Convolutional deep belief network (CDBN) is made up of the stack of many probabilistic max-pooling-CRBMs (Lee et al. 2009). So, the total energy of CDBN is the sum of own energy of CRBM layers. CDBN is used not only to recognise the digit inside the grid but also to act as feature extractor and classifier at the same time. Our model is made up of three layers of RBM where first layers ( 500 hidden units) are using rectifier linear unit (ReLU) for activation function which is defined as : f (x) = max(0, x)
174
S. Sahoo et al.
Fig. 11.3 Recognition of digit by convolutional deep belief network
It is followed by second layers of the same 500 number of units, and the final visible layers are labelled with the digits from 1 to 9 (9 units). Simple base-e exponential is used in final layers. Each CRBM is trained in an unsupervised manner using contrastive divergence (CD) except the last layer of the network, and stochastic gradient descent (SGD) is used for “fine-tuning” of the network. The classifier is trained on the training set of 150 images, batches of 15 images for ten epochs and tested 50 images with an accuracy of 98.52 % for printed digit recognition. In Fig. 11.3, it is showing successfully digit recognisation by DBN.
11.4 Suduko Solver After successful recognition of digit and the column number and row number from 1 to 9, our algorithm is successfully implemented on the outcome of digit recognition on the basis of row and column. In this method, our algorithm is divided into two parts. In the first part, the handwritten general rule-based algorithm is implemented followed by an ant colony optimisation algorithm. Our handwritten algorithm can be able to solve many puzzle problems. Newspaper sudoku is partitioned into three basic categories (Easy, medium, hard) or 5-star rating according to their difficulty level and number of the empty cells to solve the puzzle. The handwritten general rulebased algorithm can solve easy and most of the medium level puzzle. Hard puzzles are partially addressed by general rule-based handwritten algorithm and its difficulty level decreases after that. If the problem remains unsolved after some iteration of the general rule-based algorithm, then it has to be implemented in a ACO algorithm, as the ACO algorithm is very efficient to solve an NP-complete problem.
11 Recognition of Sudoku with Deep Belief Network and Solving …
175
11.4.1 General Rule-Based Algorithm The general rule-based algorithm is subdivided into six different stages which run parallel to solve the problem. CDBN is implemented to classify the digit and place them according to the row and column. Each cell is assigned to an array to store either its probable digits or available recognised digits along with each row, column and grid (3 × 3) to create its avail and unavailable list of collection. Algorithm 1 Algorithm for Digit Assignment 1.Start 2.Read:: 1 ← Row No , Col No , y; 3.empty array ←Avial list [Row], Avail list[col] , Avail list [Block] 4.[1 ... 9] ←Unavail list [row no] , Unavail list [col no] ,Unavail list [Block no] 5.Loop (col no ≤ 9 ) upto step 13 6.Read the row element by Convolution [1 × 1] 7. if ( y > 3 ,then reset y = 1) then 8.Block no ← int ( (col no + 2) / 3 ) + ( row no - y)) 9.If (Digit found and check digit is in Unavail list [row no] [col no][Block no]) Then 10.Assign value to cell no[row no] [col no] [ Block no] 11.Add the value ← Avial list [Row no], Avail list[col no] , Avail list [Block no], 12.Eliminate the value ← Unavail list [row no] , Unavail list [col no] , Unavail list [Block no]; 13.Row no + 1 , y + 1 ←Row no , y ; each empty cell has assigned to the probability array ::14.1 ← Row no, col no , y; 15.Loop (col no ≤ 9 ) upto step 19 16.Read the row element by Convolution [1 × 1] 17.Select the empty cell[row no][col no][Block no] 18.Probability array[cell no] ←common element of cell unavail list[row no],unavail list[col no] and unavail list[Block no] 19.End of loop 20.End
11.4.2 Methods 11.4.2.1
Step 1: Common Eliminator
After initialisation of probability array, the primary objective of probability array is to minimise itself in each iteration until it gets a single element in the list of array. Our handwritten rule-based algorithm is made explicitly to show its effectiveness as do it to solve a puzzle. So that our model is trained as a human alike thought, and men can pause the execution at any iteration to see the hints to solve the problem. The handwritten algorithm is subdivided into two basic categories. One is assigner,
176
S. Sahoo et al.
and other is the eliminator. The algorithm is divided into six steps which is executed in parallel manner. Algorithm 2 Common eliminator Algorithm Step 1: Common eliminator Algorithm 1.Start 2. 3 ← row no 3.Loop(((Row no / 3 ) =1 ) ≤ 12 ) upto step 11 4.If Avail list [row no] = Avail list [Row no -1 ] Then 5.Extract the Block no of the candidate as x, y 6.Row = Assign candidate [row no , row no -1 ] 7.Block = Assign candidate [ x , y] 8.If there is single vacancy in that row of BlockThen 9.Eliminate all the probability of that cell except that element 11.End of loop 12.Repeat the process for combination of (Row no , Row no - 2) and (Row no -1 , Row no -2) 13.End Assign candidate algorithm is represented as : 1.Start 2.assign candidate ( x , y) 3.If (x + y) is not divisible by 2 Then 4.Return 3y - (x+ y) 5.Else 6.Return x+y / 2 7.End
11.4.2.2
Step 2: Hidden Single
As in Fig. 11.4, Hidden single represents the single candidate occurrence in the probability list of entire row or column. Here, algorithm is expressed as: Algorithm 3 Hidden Single Algorithm Step 2: Hidden Single Algorithm 1.Start 2. 1 ← Candidate ; 3.Loop : candidate ≤ 9 4.0 ← Found 5.Linear search the candidate in avail list [ ] 6.Increment value of found 7.BreakIF (found > 1) 8.Else 9. Assign that candidate to that cell; 10.Eliminate that candidate from corresponding Row , Col and Block 11.candidate ← Candidate + 1 ; 12.End of loop 13.End
11 Recognition of Sudoku with Deep Belief Network and Solving …
177
Fig. 11.4 Hidden single: Digit 7 is the hidden single for 8th column
11.4.2.3
Step 3: Naked Pair and Triple Eliminator
Naked pair eliminator algorithm is useful when there are only two occurrences of a pair of candidates in a single row, column, or in block. Then, the possibility of those candidates in the probability array for that cell is increased, so it needs removal of other candidates in that cell as in digit 4 and 8 in Fig. 11.5 (https://www.youtube. com/watch?v=b123EURtu3It=97s). The same procedure is followed by naked triple eliminator; but in this case, three elements are subdivided in to either one set of three candidates or three set combination of two candidates each for searching occurrences of three different cell in a single row, column or Block. The algorithm is represented as follows:
11.4.2.4
Step: 4 Pointing Pair Eliminator
As in Fig. 11.6, digit 9 of the 8th block and 5th column, When a certain candidate appears only in two or three cells in a Block, and the cell are aligned in a column or a row. They are called Ponting pairs. All the other appearances of that candidate outside that block in the same column or row can be eliminated. *above cell [R][ ][B] means same row number and same block number but different column number.
178
Fig. 11.5 Naked pair: Digit 4 and 8 is (https://www.youtube.com/watch?v=b123EURtu3It=97s)
S. Sahoo et al.
naked
pair
for
4th
block
Fig. 11.6 Pointing Pair: Digit 9 is the pointing pair in figure all appearance outside block 8 can be eliminated
11 Recognition of Sudoku with Deep Belief Network and Solving …
179
Algorithm 4 Naked pair and triple eliminator Algorithm Step 3: Naked pair and triple eliminator Algorithm 1.start 2.Create a combination of 2 elements in to 2 candidate set from unveil list of row /column /Block 3.Create a combination of 3 elements in to 2 and 3 candidate sets from unveil list of row /column /Block 4. 0 ← Found; 5. Search the only same candidate combination set in row ,column and Block wise 6.If search successful 7.Found ← Found + 1; 8.For 2 element set 9.If (found = 2 ) 10.Eliminate these two candidate from other cell probability array of that Row , Column and Block 11.For 3 element set 12.If (found = 3) 13.Eliminate these three candidate from other cell probability array in that Row, Col and Block 14.End
Algorithm 5 Pointing pair eliminator Algorithm Step 4: Pointing pair eliminator Algorithm 1.Start 2.1 ←Block ,candidate : 3.Loop : block ≤ 9 upto step 12 4.Loop :candidate ≤ 9 upto step 10 5.If All the appearance of candidate in probability set of cell [ ] [C] [B] >2 Then 6.Eliminate that candidates from that column in other Block 7.If All the appearance of candidate in probability set of cell [R] [ ] [B] >2 Then 8.Eliminate that candidates from that row in other Block 9.candidate ← candidate + 1 10.End of loop 11.block ← block + 1 12.End of loop 11.End
11.4.2.5
Step: 5 Claiming Pair Eliminator
When a certain candidate appears in only two or three cells in a row or column and the cells are in a single block, they are called claiming pair. Here, algorithm for row (similar for column) is represented by: 11.4.2.6
Step: 6 X-wings
X-wings is most used by enigmatologist to solve the high-rated difficult level puzzle to minimise the number of candidate distribution probability. X-wings technique is implemented when four cells that form corners of a rectangle or square, and it appears only in the two cells in both the rows. Then, the candidate appearing in that two columns can be eliminated like Fig. 11.7. The same technique also can be applied for columns.
180
S. Sahoo et al.
Algorithm 6 Claiming pair eliminator Algorithm Stap 5:- Claiming pair eliminator Algorithm: 1.Start 2.1 ← Block : 3.Loop block ≤ 9 upto step 15 4.1 ← row , candidate : 5.loop:row ≤ 9 upto step 11 6.loop: candidate ≤ 9 upto step 9 7.Search the candidate occurrence in the row 8.candidate ← candidate + 1 9. End of Loop 10.row ← row + 1 11.End of loop 12.If the appearance of candidate in candidate set [ R] [ ] [B] ≥ 3 Then 13.Eliminate that candidates from any other appearances in that Block 14.block ← block + 1 15.End of loop 16.End
Algorithm 7 Here X-wings techniques for rows is expressed as Step 6 :- Here X-wings techniques for rows is expressed as: 1.Start 2.1 ←Row ,candidate ; 3.loop: row ≤ 8 upto step 17 4.loop: candidate ≤ 9 upto step 15 5.Search for the candidate 6.If the number of appearance of candidate in a row = 2 ; Then 7.assign the first candidate column to column 1; 8.Assign the second candidate column to column 2: 9.Search that candidate in column 1 10.If found 11.Search in that row 12.If (the number appearance of that candidate =2 and found candidate column 1 = column 2) Then 13.erase that candidate other appearance in column1 and column 2 except that cell 14.candidate ← candidate + 1 15.End of loop 16.row ← Row + 1 17.End of loop 18.End
11.4.3 Parallel Algorithm Execution All the above algorithm is independent of each other. The central theme is to find out a possible array for the empty cell with the help of DCBN after that the parallel algorithm helps to minimise by eliminating element from the probable arrays of each cell. If all the six methods are implemented serially one after another, it will be more
11 Recognition of Sudoku with Deep Belief Network and Solving …
181
Fig. 11.7 X-Wings: Digit 9 is in X-Wings
time consumer and ineffecient. The main aim of parallel execution is to minimise the time cost and increase the efficiency of implementation. Some steps are constituted for both row and column separately, which are also executed in a parallel manner. As in a single epoch, all the six methods are implemented by single time only, and the results are updated for next iteration as input.
11.4.4 Ant Colony Optimisation The parallel algorithm can be capable of solving most of the easy to a medium level problem within 100–150 epochs. Many of the challenging level puzzles also are being answered within 250–300 epochs. Rule-based parallel methods fail to efficiently handle higher level difficult problem. For some of these rule-based puzzle parallel, the algorithm is stopped with more than one possible digit candidates in an array for a single cell in solution, as the methods are unable to eliminate candidates after certain epoch. But the rule-based parallel method efficiently minimises the candidate arrays, so that any other greedy-based algorithms approach can be implemented with less epoch with less time, compared to only a rule-based parallel algorithm. For this reason, ant colony optimization method is serialised with the parallel rule-based method.
182
S. Sahoo et al.
The ant colony optimisation method as sudoku solver (Lloyd and Amos 2019) is used with constant propagation method (Musliu and Winter 2017). Here in our ACO, each ant has covered only those cells with probability array of multiple candidates in their local copy of the puzzle. A fixed amount of pheromone they add when they pick up for a single element from the array of possible candidates and delete that element other existence in that same row, column, and block. One pheromone matrix (9 × 81) is created to keep track of the updating of each component in the possible array. The best ant covered all the possible candidates of multiple candidates in the puzzle.
11.5 Result and Discussion For this paper, we did experiment on various possible dataset available in Internet such as https://github.com/wichtounet/sudokudataset used by witch paper (Wicht and Henneberty 2015) (for both recognition and solution) and https://www.kaggle.com/bryanpark/sudoku (for only testing solver) . In the first half of digit and grid recognition case by using deep belief network, we got the accuracy of 98.58% with an error rate of 1.42 %. So, nearly it works perfect to recognise digit according to grids, and we processed the puzzle which is recognised fully (Figs. 11.8 and 11.9). Only with rule-based algorithm is succeed to solve with success rate at 96.63% puzzle with 304 epochs, while ant colony optimisation is capabled of solving 98.87 % puzzle with 263 epochs but with the serialisation of parallel and ant colony optimisation is highest success rate at 99.34 % result with 218 epochs. Maximum epochs for three algorithms are calculated with the puzzle with 62 blank cells.
11.6 Conclusion We designed and implemented a new rule-based algorithm associated with ant colony optimization technique to solve the puzzle detected by image processing by using convolutional deep belief network methods. CDBN is so efficient to recognise the printed digit properly and implement it in any of highly difficult solution. It is well designed that a player can stop at any iteration to see the hints for the answer. In the future, we try to implement the digit recognization and proper solution solver by using deep convolution neural network alone.
11 Recognition of Sudoku with Deep Belief Network and Solving …
183
Fig. 11.8 The number of epochs used by three different algorithm
Fig. 11.9 The percentage of solving puzzles
References Chowdhury AS, Akhter S (2012) Solving Sudoku with Boolean Algebra. Int J Comput Appl 52(21):0975–8887 Gerges F, Zouein G, Azar D (2018) Genetic algorithms with local optima handling to solve sudoku puzzles. In: Proceedings of the 2018 international conference on computing and artificial intelligence, pp 19–22
184
S. Sahoo et al.
Kamal S, Chawla SS, Goel N (2015) Detection of Sudoku puzzle using image processing and solving by backtracking, simulated annealing and genetic algorithms: a comparative analysis. In: 2015 third international conference on image information processing (ICIIP). IEEE, pp 179–184 Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th annual international conference on machine learning, pp 609–616 Lloyd H, Amos M (2019) Solving Sudoku with ant colony optimization. IEEE Trans Games Musliu N, Winter F (2017) A hybrid approach for the sudoku problem: using constraint programming in iterated local search. IEEE Intell Syst 32(2):52–62 Nguyen TT, Nguyen ST, Nguyen LC (2018) Learning to solve Sudoku problems with computer vision aided approaches. Information and decision sciences. Springer, Singapore, pp 539–548 Ronse C, Devijver PA (1984) Connected components in binary images: the detection problem Saxena R, Jain M, Yaqub SM (2018) Sudoku game solving approach through parallel processing. In: Proceedings of the second international conference on computational intelligence and informatics. Springer, Singapore, pp 447–455 Wicht B, Hennebert J (2014) Camera-based sudoku recognition with deep belief network. In: 2014 6th international conference of soft computing and pattern recognition (SoCPaR). IEEE, pp 83–88 Wicht B, Henneberty J (2015) Mixed handwritten and printed digit recognition in Sudoku with Convolutional deep belief network. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 861–865
Chapter 12
Novel DWT and PC-Based Profile Generation Method for Human Action Recognition Tanish Zaveri, Payal Prajapati, and Rishabh Shah
Abstract Human action recognition in recordings acquired from reconnaissance cameras discovers application in fields like security, health care and medicine, sports, programmed gesture-based communication acknowledgment, and so on. The task is challenging due to variations in motion, recording settings, and inter-personal differences. In this paper, novel DWT & PC-based profile generation algorithm is proposed which incorporates notion of energy in extracting features from video frames. Seven energy-based features are calculated using unique energy profiles of each action. Proposed algorithm is applied to three widely used classifiers—SVM, Naive bayes, and J48 to classify video actions. Algorithm is tested on Weizmann’s dataset & performance is measured with evaluation metrics such as precision, sensitivity, specificity, and accuracy. Finally, it is compared with the existing method of template matching using MACH filter. Simulation results give good accuracy than existing method.
12.1 Introduction Human action recognition (HAR) is the process of recognizing various actions that people perform, either individually or in a group. These actions may be walking, running, jumping, swimming, shaking hands, dancing, and many more. There are many challenges in HAR such as differences in physiques of humans performing actions like shape, size, color, etc., differences in background scene like occlusion, lighting or any other visual impairments, differences in recording settings like recording speed, types of recording (2D or 3D/gray-scale or colored video recording), differences in motion performance by different people like difference in speed of walking, difference in height of jumping, etc. For an algorithm to succeed, the methods used for action representation and classification are of utmost importance. This motivated research work in this field and the development of a plethora of different techniques T. Zaveri · R. Shah Nirma University, Ahmedabad 382481, India P. Prajapati (B) · R. Shah Government Engineering College, Patna 384265, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_12
185
186
T. Zaveri et al.
which falls under local and global representation approaches, classified by Weinland et al. (2011) based on how actions are represented. In global representation approaches, the human body is to be detected in the image, usually with background subtraction techniques. While this step is a disadvantage, it results in reduction of image size and complexity. Silhouettes and contours are usually used for representing the person. Contour-based features like Cartesian coordinate feature (CCF), the Fourier descriptor feature (FDF) (H et al. 1995; RD and L 2000; R et al. 2009), centroid-distance feature (CDF) (D and G 2004, 2003), and chordlength feature (CLF) (D and G 2004, 2003; S et al. 2012) are extracted from contour boundary of the person in the aligned silhouettes image (ASI). The region inside contour of human object is the silhouette. Silhouette-based features are extracted from the silhouette in the ASI image. Some common silhouette-based features are histogram of gradient (HOG) (D and G 2003; N et al. 2006), histogram of optical flow (HOOF) (R et al. 2012), and structural similarity index measure (SSIM) (Z et al. 2004). In local representation approaches, videos are treated as a collection of small unrelated patches that involve the regions of high variations in spatial and temporal domains. Centers of these patches are called spatio-temporal interest points (STIPs). STIPs are represented by the information related to the motion in their patches and then clustered to form a dictionary of visual words. Each action is represented by bag of words model (BOW) (Laptev et al. 2008). Several STIPs detectors have been proposed recently. For example, Laptev (2005) applied Harris corner detector for spatio-temporal case and proposed Harris3D detector, Dollar et al. (2005) applied 1D Gabor filters temporally and proposed the Cuboid detector, Willems et al. (2008) proposed Hessian detector that measures the saliency with the determinant of 3D Hessian matrix, and Wang et al. (2009) introduced dense sampling detector that finds STIPs at regular points and scales, both spatially and temporally. Various descriptors used for STIPs include histogram of oriented gradients (HOG) descriptor and histogram of optical flow (HOF) descriptor (H et al. 1995), gradient descriptor (Dollár et al. 2005), 3D scale-invariant feature transform (3D SIFT) (Scovanner et al. 2007), 3D gradients descriptor (HOG3D) (A et al. 2008), and the extended speeded up robust features descriptor (ESURF) (Willems et al. 2008). Some limitations of global representation such as sensitivity to noise and partial occlusion and the complexity of accurate localization by object tracking and background subtraction can be overcome by local representation approaches. Local representation also has some drawbacks, like the ignorance of spatial and temporal connections between local features and action parts that are necessary to preserve intrinsic characteristics of human actions. In this paper, a novel energy-based approach is presented which is based on the fact that every action has a unique energy profile associated with it. we are trying to model some physics theorems which are related to kinetic energy generated while performing an action. As it is known, if you apply a force over a given distance— you have done work using equation W = F × D. Through work-energy theorem, this work done can be related to changes in kinetic energy or gravitational potential energy of an object using equation W = K where K stands for kinetic energy.
12 Novel DWT and PC-Based Profile Generation Method for Human …
187
As per normal observation, force required for performing various actions is different (e.g., Running requires more force then Walking) which leads to difference in amount of work done and hence difference in energy profile for different actions. So, energy difference can be a feature in distinguishing among various actions. We are trying to model different energy profile associated with various actions using fundamentals of phase congruency and discrete wavelet transform. Goal is to automate human activity recognition task by integrating seven energy-based features calculated in frequency domain with few machine learning algorithms. All the seven features are based on exploration of differences in energy profiles for different action. Section 12.2 of the paper describes background theory, proposed methodology, and analysis of energy profiles. Section 12.3 presents details of dataset and results obtained. It also compares result with existing method. Section 12.4 concludes the paper followed by references in section 5.
12.2 Proposed Methodology Our method is based on following observation: As with inter-personal differences, motion performance differences, there is also a noticeable difference in generated energy profile over entire video for different action classes. For example, energy variation in case of walking would be more in lower parts; while in case of waving, there would be more variation in upper parts. We are employing phase congruency (PC) and discrete wavelet transform (DWT) methods in computation of energy profiles as transform domain makes analysis easier over spatial domain. Figure 12.1 shows block diagram of proposed methodology. First block involves extracting frames from training video. Difference of every alternate frame is taken and given as input to next two blocks, namely DWT decomposition and phase congruency. DWT decomposition block applies level 2 DWT on frame difference, returning approximation (c A ) and detail coefficients (c H , cV , c D ). The percentage contribution of the detail coefficients in total energy is calculated which become the first three features. These three features are denoted as E H , E V , and E D , respectively. The phase congruency block detects the edges in the frame difference image. The result of PC is divided into four parts. Percentage contributions of these four parts in total energy are calculated which become next four features. These four features are denoted as EU L , E L L , EU R , and E L R , respectively. So, for difference of every alternate frame, there is a feature vector f[E H , E V , E D , EU L , E L L , EU R , E L R ], and for every video, there is a vector V that contains sequence of feature vectors for every alternate frame. The training matrix is created by taking these seven features for videos of each action. Also, a feature vector for the test video is calculated. The training matrix and the test matrix are passed as inputs to the classifier. The classifier, be it SVM, Nave Bayes or J48, recognizes the action being performed in the test video.
188
T. Zaveri et al.
Fig. 12.1 Block diagram of proposed methodology
12.2.1 Background Theory 12.2.1.1
Discrete Wavelet Transform
In frequency domain, we can easily identify the high-frequency components from the original signal. This property of frequency domain can be exploited to aid in action recognition. When an action is performed, the high frequency component, in the part of the frame where the action is performed, changes. Moreover, it is observed that the change is different for different actions. Here, other transforms like the Fourier transform cannot be used; because in Fourier transform domain, there is no relation with the spatial coordinates. DWT resolves the issue. In DWT, the frequency information is obtained at original location in the spatial domain. As a direct consequence of this property, the frequency information obtained can be visualized as an image in spatial domain. Figure 12.3 shows result of DWT for wave action which gives approximation (low pass) and detail information (high pass). The detail information horizontal, vertical, and diagonal information is used in energy calculation. A mathematical concept of 2D wavelet transform is given by Eqs. (12.1) (scaling function), (12.2) (wavelet functions), (12.3, 12.4). φ j,m,n (x, y) = 2 j/2 φ(2 j x − m, 2 j y − n),
(12.1)
12 Novel DWT and PC-Based Profile Generation Method for Human …
ψ kj,m,n (x, y) = 2 j/2 ψ k (2 j x − m, 2 j y − n), k = H, V, D.
189
(12.2)
where j, m, n are integers, j is a scaling parameter, m and n are shifting parameters along x and y direction. The analysis equations are: Wφ ( j0 , m, n) = √
Wψk ( j0 , m, n) = √
1
M−1 N −1
MN
x=0 y=0
1
M−1 N −1
MN
x=0 y=0
f (x, y)φ j0 ,m,n (x, y),
f (x, y)ψ kj0 ,m,n (x, y);
(12.3)
(12.4)
where, k = H, V, D. Here, M, N is dimension of video frame and f (x, y) is an intensity value at x, y coordinate, j0 is a scaling parameter. As the scaling and wavelet functions are separable, 2D DWT can be decomposed in to two 1D DWTs, one is along x-axis and second is along y-axis. As a result, it is returning four bands: LL (left-top), HL (right-top), LH (left bottom), and HH (rightbottom). The HL band and LH band gives variation along the x-axis and y-axis, respectively. LL is an approximation of original image that retains only low pass information. LH, HL, and HH bands are used to detect the changes in the motion directions. This LL band gives c A approximation coefficient, and other three bands give c H , cV , c D detail coefficients, respectively. 12.2.1.2
Phase congruency
Phase congruency reflects signal behavior in frequency domain. Phase congruency is a method for edge detection which is robust against variations in illumination and contrast. It finds places where phases are in same order and edges have similar phase in frequency domain.This property is used to detect edge-like features using PC. In action recognition from videos, the illumination may not always be even on the entire scene. Phase congruency performs significantly better than other edge detection methods in such cases. Mathematical equation of PC for 2D signals like image is given in (Battiato et al. 2014). Figure 12.2 shows result of PC applied on frame difference of two consecutive frames of walk action. It is easy to show that PC is proportional to local energy in a signal using some simple trigonometric manipulations (Koves 2016) which is given by following relation: (12.5) E(x) = aω dω.PC(x) This relation is exploited to compute energy-based features for HAR.
190
T. Zaveri et al.
Fig. 12.2 2D Wavelet transform and phase congruency
(a) Result of 2D wavelet transform for wave action
(b) result of Phase Congruency for walk action
12.2.2 Feature Extraction Steps for extracting features using DWT and PC are given as below: 1. The training video is divided into frames. These frames are stored in F. Take difference of every alternate frame stored in F, call such result as frame difference image which is obtained by: Fd x = |F p − F( p+2) |,
(12.6)
12 Novel DWT and PC-Based Profile Generation Method for Human …
191
where Fd x is absolute difference between two alternate frames and p varies from 1 to 46. 2. Apply analysis equations of DWT (Sect. 12.2.1.1) on each frame difference image which gives c A , c H , cV , c D . Wφ ( j0 , m, n) = √
Wψk ( j0 , m, n) = √
1
M−1 N −1
MN
x=0 y=0
1
M−1 N −1
MN
x=0 y=0
Fd x (x, y)φ j0 ,m,n (x, y)
Fd x (x, y)ψ kj0 ,m,n (x, y);
(12.7)
(12.8)
where, k = H, V, D. Let T E DW T be the total energy of decomposition vector C which consists of c A , c H , cV , c D then energy contribution of every individual component is calculated by, n (ck (i))2 , k = H, V, D (12.9) Ek = T E DW T i=1 where E H , E V and E D are the energy contributions of c H , cV , c D , respectively. E H , E V and E D become three features extracted. 3. Apply PC (Sect. 12.2.1.2) on each frame difference which gives image matrix having detected edges. (12.10) FPC = PC(Fd x (m, n)) Let U L, L L, U R, L R depict the parts, namely upper left, lower left, upper right, and lower right, respectively, obtained after dividing FPC in four parts. Let T E PC be the total energy of the matrix FPC , then energy contribution of every individual component is calculated by, EG =
n (FG (i))2 i=1
T E PC
, G = U L , L L , U R, L R
(12.11)
where EU L , E L L , EU R , and E L R are energies for U L, L L, U R, and L R parts, respectively. Thus, EU L , E L L , EU R , and E L R become our next four features. For every frame difference image, we get a feature vector containing above seven features. Repeat the same for every instance of video available.
192
T. Zaveri et al.
12.2.3 Analysis of Energy Profiles The proposed idea is based on the fact that energy profile over entire video is different for various action classes. We employ the above feature extraction algorithms to generate feature vectors for videos of different classes. We also plot the results obtained by both methods, DWT and PC, individually to show that the energy profiles obtained matches with underlying observation.
12.2.3.1
Energy Profiles Obtained from Discrete Wavelet Transform
The values for the horizontal, vertical, and diagonal energy obtained from the frame differences were calculated and plotted for analysis. These profiles for four bend action videos and their average profile is shown in Fig. 12.3. Figure 12.3 represents the energy profiles of H , V , and D components for multiple videos of bend action, and their average is also shown. As expected, the vertical energy is the highest initially. This is because the video starts with a person standing upright. Also, a dip is observed in the diagonal energies as the person bends more and more. After the completion of about half the video, the horizontal energy increases, since now the person’s gait is almost horizontal. After that, diagonal energies increase again as the person begins to stand up. Similarly, for other actions too, the profiles were created, and they behave as expected.
(a) Horizontal Energy Profile
(b) Vertical Energy Profile
(c) Diagonal Energy Profile
(d) Avg. Horizontal Energy Profile
(e) Avg. Vertical Energy Profile
(f) Avg. Diagonal Energy Profile
Fig. 12.3 Pattern analysis of horizontal, vertical, and diagonal energy profiles obtained using DWT
12 Novel DWT and PC-Based Profile Generation Method for Human …
193
(a) Energy profile for upper left part
(b) Energy profile for upper right part
(c) Energy profile for lower left part
(d) Average energy profile for upper left part
(e) Average energy profile for upper right part
(f) Average energy profile for lower left part
(g) Energy profile for lower right part
(h) Average energy profile for lower right part
Fig. 12.4 Pattern analysis using PC
12.2.3.2
Energy Profiles Obtained from Phase Congruency
Result obtained after applying phase congruency is divided into four parts, namely upper left, lower left, upper right, and lower right, respectively. Energy profiles of these four parts are plotted and analyzed in Fig. 12.4 for wave action. Figure 12.4 shows that energy distribution in all four parts for wave action is almost equal which is consistent with the expected results for wave action. Profiles obtained for other actions too have definite, unique patterns, and making these energy profiles suitable for use in classifiers to recognize actions.
12.3 Simulation Result and Assessment We used Weizmann‘s dataset in our experiment. This dataset contains ten actions of day-to-day activities that are walk, run, bend, wave (one hand), wave2 (two hands), jack, jump, skip, gallop sideways, and jump in place (pjump), performed by ten actors. Figure 12.5 shows these actions being performed by different actors.
194
T. Zaveri et al.
Fig. 12.5 Dataset
In our experiment, fourteen action classes were used which are run left, run right, pjump, jump left, jump right, wave2, walk left, walk right, skip left, skip right, side left, side right, jack, and bend. Total 85 videos have been used for the experiment. The energy values obtained from the DWT and phase congruency methods were used to construct the training matrix for the classifier. Since we had 14 action classes with total 85 videos of 92 frames each and 7 features, we get a 85 × 322 feature matrix. Thus, each row in the feature matrix contains the extracted features for a particular action. And each of these rows is labeled by type of action. The dataset has been divided randomly into training and test dataset keeping the 60–40 proportion. Ten such divisions are employed to use cross-validation approach. We used three classifiers—SVM, Naive Bayes, and J48 to test our algorithm. To evaluate the proposed algorithms, four parameters sensitivity, specificity, precision, and accuracy are calculated from confusion matrix. Sensitivity is the percentage of positive labeled instances that were predicted as positive, specificity is the percentage of negative labeled instances that were predicted as negative, and precision is the percentage of positive predictions that are correct. Accuracy tells what percentage of predictions is correct. Equations of each parameter is given in (Kacprzyk 2015). We have analyzed result of DWT and PC separately and together for above three classifiers which is shown in Tables 12.1, 12.2 and 12.3, respectively. It shows DWT and PC together gives better sensitivity, specificity, precision, and accuracy for all three classifiers. Out of these three classifiers, SVM gives better result in terms of all four parameters.
Table 12.1 Classifier result for DWT Sensitivity Specificity SVM J48 Naivebayes
0.838 0.813 0.75
0.977 0.969 0.976
Precision
Accuracy
0.849 0.826 0.796
83.75 81.25 75
12 Novel DWT and PC-Based Profile Generation Method for Human … Table 12.2 Classifier result for PC Sensitivity SVM J48 Naivebayes
0.725 0.558 0.725
Specificity
Precision
Accuracy
0.987 0.986 0.975
0.617 0.5866 0.725
72.5 58.75 72.5
Precision
Accuracy
0.888 0.822 0.801
88.75 81.25 77.25
Table 12.3 Classifier result for DWT + PC Sensitivity Specificity SVM J48 Naivebayes
0.888 0.813 0.763
195
0.992 0.985 0.981
Results obtained from proposed methodology is compared with existing Actio MACH filter (Rodriguez et al. 2008). Action MACH is a template-based method which finds correlation of test video with synthesized filter of every action. If it is greater than some pre-decided threshold, we infer that the corresponding action is occurring. This requires tedious job of tracking start and end frame of every action videos to generate a filter. On other hand, we do not require such preprocessing task. Also, we extended our method for 14 action classes where as MACH filter is represented for six actions only. The proposed methodology gives 88.75% accuracy over 80.9% accuracy of MACH filter.
12.4 Conclusion We have presented an approach based on energy that incorporates notion of phase congruency and discrete wavelet transform in calculation of energy-based features. Method is based on idea that energy profiles over entire videos of same action classes show a good correlation, and for different action classes, it varies greatly. A good accuracy is obtained using this methodology for fourteen action classes. For now, we tried only for simple actions but it can be extended for complex actions. Other energy-based features can be combined with those of proposed features to achieve more robust result. Various features which can exploit other aspect like shape can also be combined in order to distinguish same energy profile generation using shape differences.
196
T. Zaveri et al.
References A K, C S, I G (2008) A spatio-temporal descriptor based on 3d-gradients. In: British machine vision conference Battiato S, Coquillart S, Laramee RS, Kerren A, Braz J (2014) Computer vision, imaging and computer graphics—theory and applications. In: International joint conference, VISIGRAPP 2013. Springer Publishing Company, Barcelona, Spain D Z, G L (2004) Review of shape representation and description techniques. Pattern Recogn 1:1–19 D Z, G L (20003) A comparative study on shape retrieval using fourier descriptors with different shape signatures J Vis Commun Image Represent 1:41–60 D Z, G L (2003) A comparative study on shape retrieval using fourier descriptors with different shape signatures. In: IEEE conference on computer vision and pattern recognition (CVPR2005), vol 1, pp 886–893 Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatiotemporal features. In: In VS-PETS, pp 65–72 H K, T S, P M (1995) An experimental comparison of autoregressive and fourier-based descriptors in 2d shape classification. IEEE Trans Pattern Anal Mach Intell 2:201–207 Kacprzyk J (2015) Advances in intelligent and soft computing. Springer, Berlin Koves P (2016) Feature detection via phase congruency. http://homepages.inf.ed.ac.uk/rbf/ CVonline/. [Online Accessed 11 Nov 2016] Laptev I (2005) On space-time interest points. Int J Comput Vision 64(2–3):107–123. https://doi. org/10.1007/s11263-005-1838-7 Laptev I, Marszałek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. CVPR (2008) N D, B T, C S (2006) Human detection using oriented histograms of flow and appearance. In: European conferences on computer vision (ECCV 2006), pp 428–441 R C, A R, G H, R V (2012) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: IEEE conferences on computer vision and pattern recognition (CVPR 2009), vol 4, pp 1932–1939 RD L, L S (2000) Human silhouette recognition with fourier descriptors. In: 15th international conferences on pattern recognition (ICPR 2000), vol 3, pp 709–712 Rodriguez M, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE conference on computer vision and pattern recognition, (CVPR 2008), pp 1–8 R G, R W, S E (2009) Digital image processing using Matlab, 2nd edn S S, A AH, B M, U S (2012) Chord length shape features for human activity recognition. In: ISRN machine vision Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: proceedings of the 15th ACM international conference on multimedia, MM –07. ACM, New York, NY, USA, pp 357–360. https://doi.org/10.1145/1291233.1291311 Wang H, Ullah MM, KlÃser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. University of Central Florida, USA Weinland D, Ronfard R, Boyer E (2011) A survey of vision-based methods for action representation, segmentation and recognition, vol 115 Willems G, Tuytelaars T, Gool LV Gool l (2008) An efficient dense and scale-invariant spatiotemporal interest point detector. Technical Report Z W, AC B, HR S, EP S (2004) Image quality assessment: from error visibility to structure similarity. IEEE Trans Image Process 4:600–612
Chapter 13
Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting: An Approach Based on Combination of Filters and Color Models P. B. Mallikarjuna, D. S. Guru, and C. Shadaksharaiah Abstract In this paper, a novel filter-based model for classification of tobacco leaves for the purpose of harvesting is proposed. The filter-based model relies on estimation of degree of ripeness of a leaf using combination of filters and color models. Degree of ripeness of a leaf is computed using density of maturity spots on a leaf surface and yellowness of a leaf. A new maturity spot detection algorithm based on combination of first order edge extractor (sobel edge detector or canny edge detector) and second order high-pass filtering (Laplacian filter) is proposed to compute the density of maturity spots on a unit area of a leaf. Further, a simple thresholding classifier is designed for the purpose of classification. Superiorities of the proposed model in terms of effectiveness and robustness are established empirically through extensive experiments.
13.1 Introduction Agriculture sector plays a vital role in an economy of any developing countries like INDIA. Source of employment, wealth, and security of any nation is directly depends on the qualitative and quantitative production of agriculture. It is an outcome of a complex interaction of soil, seed, water, and agrochemicals. Enhancement of productivity needs proper type, quantity, and timely application of soil, seed, water, and agrochemicals at specific sites. This demands precision agriculture practices such as soil mapping, disease mapping at both seedling and plant level, weed mapping, P. B. Mallikarjuna (B) JSS Academy of Technical Education, Bengaluru, Karnataka, India e-mail: [email protected] D. S. Guru University of Mysore, Mysore, Karnataka, India e-mail: [email protected] C. Shadaksharaiah Bapuji Institute of Engineering and Technology, Davangere, Karnataka, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_13
197
198
P. B. Mallikarjuna et al.
Fig. 13.1 Block diagram of a general precision agriculture system
selective harvesting, and quality analysis of agricultural products (Grading of agricultural products). Continuous assessment of these precision agriculture practices requires skilled labors. The availability of skilled labors is very short in most of agro-based developing countries, and also, it is certain that in a continuous process human cannot fulfill the above requirements precisely and accurately. The variations occurring in crop or soil properties within a field (Jabro et al. 2010) should be mapped and timely action need to be taken. But, humans are not consistent and precise in recording of spatial variability and its mapping. Hence, assessment may vary from an expert to an expert. This may lead to wrong assessment, and it results in a poor quality of agricultural products. Therefore, this demands automation of assessment of precision agriculture practices. Since there is a requirement of precise and accurate assessment of precision agriculture practices, researchers have proposed intelligent models based on computer vision (CV) techniques to automate these practices for various commercial crops. The advantages of a computer vision-based approach are that the accuracy is comparable to that of human experts and reduction of man power and time (Patricio and Riederb 2018). Therefore, devising effective and efficient computer vision models to practice precision agriculture system for real time is the current requirement. The stages involved in a general precision agriculture system are shown in Fig. 13.1. Harvesting is an important stage in any crop production. Selective harvesting is required for quality production. Selective harvesting is to collect only ripe crops from the field (Manickavasagam et al. 2007). Therefore, before harvesting a crop, farmers should look into factors such as unripe, ripe, and over-ripe of crops. Judgment of crop ripeness by human will not always be accurate and precise due to human sensory limitation, variable lighting condition, and loosing efficiency in evaluating crop ripeness over the time. Therefore, there is a need to develop robust model against ecological conditions (sunny, cloudy, and rainy) to evaluate ripeness of crop. Visual
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting …
199
properties of crop such as color, texture, and shape could be exploited to evaluate the ripeness of crop for harvesting purpose using computer vision algorithmic models. To show the importance of the computer vision techniques in precision agricultural practices especially for selective harvesting stage, we have taken the tobacco crop as a case study. After 60 d of plantation of tobacco crop, we can find three types of leaves. They are unripe, ripe, and over-ripe leaves. One should harvest ripe leaves to get quality cured tobacco leaves in a curing process. As an indication of ripeness, small areas called maturity spots appear non-uniformly on the top surface of a leaf. As ripeness increases, yellowness of a leaf also increases. The maturity spots are more in over-ripe leaves when compared to ripe leaves. Though tobacco crop has commercial crop, no attempt has been made on harvesting of tobacco leaves using CV techniques. However, few attempts could be traced on ripeness evaluation of other commercial crops for automatic harvesting. Direct color mapping approach was developed to an evaluate maturity levels of tomato and date fruits (Lee et al. 2011). This color mapping method maps the RGB values of colors of interest into 1D color space using polynomial equations. It uses a single index value to represent each color in the specified range for the purpose of maturity evaluation of tomato and date fruits. A robotic system for harvesting ripe tomatoes in greenhouse (Yin et al. 2009) was designed based on the color feature of tomatoes, and morphological operations are used to denoise and handle the situations of tomato overlapping and shelter. Medjool date fruits were taken as a case study to demonstrate the performance of a novel color quantization and color analysis technique for fruit maturity evaluation and surface defect detection (Lee et al. 2008). A novel and robust color space conversion and color index distribution analysis technique for automated date maturity evaluation (Lee et al. 2008) were proposed. Applications of mechanical fruit grading and automatic fruit grading (Gao et al., 2009) were discussed and also compare the performance of CV-based automatic fruit grading with mechanical fruit grading. A neural network system using genetic algorithm was implemented to evaluate the maturity levels of strawberry fruits (Xu 2008). In this work, H frequency of HIS color model was used to distinguish maturity levels of strawberry fruits in a variable illumination conditions. An intelligent algorithm based on neural network was developed to classify coffee cherries into under-ripe, ripe, and over-ripe (Furfaro et al. 2007). A coffee ripeness monitoring system was proposed (Johnson et al. 2004). In this work, reflectance spectrum was recorded from four major components of coffee field viz., green leaf, under-ripe fruit, ripe fruit, and over-ripe fruit. Based on reflectance spectrum, ripeness evaluation of coffee field was performed. A Bayesian classifier was exploited for the purpose of classification of intact tomatoes based on their ripening stages (Baltazar et al. 2008). We made an initial attempt on ripeness evaluation of tobacco leaves for automatic harvesting in our previous work (Guru and Mallikarjuna 2010), where we exploited only the combination of sobel edge detector and laplacian filter with CIELAB color model to estimate degree of ripeness of a leaf and conducted experiments on our own small dataset of 244 sample images. In our current work, we exploited two combinations (i) combination of laplacian filter and sobel edge detector and (ii) combination of laplacian filter and canny edge detector with different color models
200
P. B. Mallikarjuna et al.
viz., RGB, HSV, MUNSELL, CIELAB, and CIELUV and conducted experiments on our own large dataset of 1300 sample images. Indeed, the success of our previous attempt motivated to take up the current work, where in the previous model has been extended significantly. Thus, overall contributions of this work are, • Creation of a relatively large dataset of harvesting tobacco leaves due to nonavailability of a benchmarking dataset. • Introduction of concept of fusing image filters of different orders for maturity spot detection on tobacco leaves. • Development of a model which combines density of maturity spots and color information for estimating degree of ripeness of a leaf. • Design of simple threshold-based classifier for classification of leaves. • Conduction of experimentations on the large tobacco harvesting dataset created. This paper is organized as follows. Section 13.2 presents proposed filter-based model to evaluate the ripeness of a leaf for classification. Section 13.3 provides details on tobacco harvesting dataset. It also presents the experimental results obtained due to exhaustive evaluation of the proposed model. The paper concludes in Sect. 13.4.
13.2 Proposed Model The proposed model has four stages: leaf segmentation, detection of maturity spots, estimation of degree of ripeness, and classification.
13.2.1 Leaf Segmentation The CIELAB (Viscarra et al., 2006) color model was used to segment a leaf area from their background which includes soil, stones, and noise. According to the domain experts, the color of a tobacco leaf varies from green to yellow. Therefore, the chromacity coordinate is used to segment the leaf from its background. For an illustration, we have shown three different samples (Figs. 13.2, 13.3, and 13.4) of tobacco leaves and also the results of the segmentation.
13.2.2 Detection of Maturity Spots The proposed maturity spots detection algorithm mainly consists of two stages. The first stage involves application of a second order high-pass filtering and a first order edge extraction algorithm separately on a leaf, the results of which are later subjected
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting …
(a)
(b)
Fig. 13.2 a A sample tobacco leaf with rare maturity spots, b segmented image
(a)
(b)
Fig. 13.3 a A sample tobacco leaf with with moderate maturityspots, b segmented image
(a)
(b)
Fig. 13.4 a A sample tobacco leaf with with rich maturity spots, b segmented image
201
202
P. B. Mallikarjuna et al.
Fig. 13.5 Block diagram of the proposed maturity spots detection algorithm.
for subtraction in second stage. The block diagram of the proposed maturity spots detection algorithm is given in Fig. 13.5. The maturity spots are highly visible in a R-channel gray scale image compared to G-channel and B-Channel gray scale images, respectively. Therefore, the RGB image of a tobacco leaf is transformed into its R-channel gray scale image. A second order high-pass filter is exploited to enhance mature spots (fine details) present on a red channel gray scale image of a tobacco leaf. It highlights transitions in intensities in an image. Any high-pass filter in frequency domain attenuates low-frequency components without disturbing high-frequency information. Therefore, to extract finer details of small maturity spots, we recommend to apply any second order derivative high-pass filter (in our case laplacian filter) which enhances much better than any first order derivative high-pass filters (Sobel and Roberts). Then, we transform the second order filtered image into a binary image using a suitable threshold. The resultant binary image contains veins and leaf boundary in addition to maturity spots. An image subtraction is used to eliminate the vein and leaf boundary edge pixels of resultant binary image. Therefore, we recommend to subtract the edge image containing only vein and leaf boundary edge pixels from the resultant binary image.
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting …
203
Since first order edge extraction operator is susceptible to noise, it is used to extract edge image from red channel gray scale image of the segmented original RGB color image. The obtained edge image is then subtracted from the binary image obtained due to second order high-pass filtering. Image subtraction results in an image containing only maturity spots. Number of connected components present in that image decides the number of maturity spots on the leaf. Let us consider a tobacco leaf (Fig. 13.6) for the purpose of illustration of the proposed maturity spots detection algorithm. As discussed above, when we apply the transformation (RGB to R-channel gray scale) on the original RGB image of segmented tobacco leaf (Fig. 13.6a), the maturity spots are highly noticeable in the R-channel gray scale image as shown in Fig. 13.6b. The second order high-pass filter (Laplacian filter) is used to enhance the maturity spots. The laplacian filtered image (Fig. 13.6b) is converted into binary image (Fig. 13.6c) using a suitable predefined threshold. As stated above, the resultant binary image (Fig. 13.6e) contains maturity spots, vein, and boundary edge pixels. So, to remove vein and boundary pixels, we subtracted edge image (Fig. 13.6b) obtained after first order edge extraction (canny edge detector) from the binary image (Fig. 13.6d). Finally, image subtraction has resulted in an image (Fig. 13.6f) containing only maturity spots.
13.2.3 Estimation of Degree of Ripeness The degree of ripeness is estimated to evaluate the ripeness of a leaf. It is based on the density estimation of maturity spots present on the leaf and also the yellowness of the leaf. The degree of ripeness (D) of a leaf is defined to be, The weights W1 and W2 are assigned to the density estimation of spots and the mean value of yellowness of a leaf, respectively. The yellowness of a leaf increases as ripeness of a leaf increases. Therefore, we have recommended the parameter MYL while estimating the ripeness of a leaf. To support this, we used different color models such as RGB, HSV, MUNSELL, CIELAB, and CIELUV.
13.2.4 Classification During harvesting, we can find three types of leaves on a plant: unripe, ripe, and over-ripe. Unripe leaves have low degree of ripeness. Ripe leaves have moderate degree of ripeness. Over-ripe leaves have high degree of ripeness. Therefore, we have used a simple thresholding classifier based on two predefined thresholds T1 and T2 for classification of tobacco leaves into three classes: unripe, ripe, and over-ripe. The threshold T1 is selected as the midpoint of distribution of degree of ripeness of samples of unripe class and ripe class. The threshold T2 is selected as the midpoint of distribution of degree of ripeness of samples of ripe class and over-ripe class.
204
P. B. Mallikarjuna et al.
(a) Segmented RGB image of a tobacco leaf
(b) Red channel gray scale image
(c) Laplacian filtered image
(d) Image after thresholding and binarization
(e) Image after leaf vein and boundary (f) Image consisting of only maturity extraction using canny edge detector spots after subtraction of the image (e) Fig. 13.6 Maturity spots detection
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting …
205
Then, the class label for a given leaf is decided based on two predefined thresholds T1 and T2 , as given in Eq. 13.3. D = [W1 × D E S] + [W2 × MY L] DES =
N umber o f maturit y Lea f ar ea
⎧ ⎪ ⎨C1 D < T1 Classlabel = C2 T1 < D < T2 ⎪ ⎩ C3 D > T2
(13.1) (13.2)
(13.3)
where D = Degreet of ripeness. C1, C2, and C3 are class labels of ripe, unripe, and over-ripe, respectively. The thresholds T1 and T2 are supposed to be fixed empirically. We follow supervised learning to fix up the values for T1 and T2 .
13.3 Experimentation and Result Analysis 13.3.1 Dataset In this section, we present details on creation of a tobacco harvesting dataset to build an automatic harvesting system. One should harvest ripe leaves to get quality leaves after curing. Harvesting of unripe or over-ripe leaves leads to poor quality leaves after curing. Image samples of tobacco leaves (Unripe, Ripe, and Over-ripe) are collected randomly from tobacco crop field at CTRI, HUNSUR. Number of collected image samples of individual classes of tobacco harvesting leaves is tabulated in Table 13.1. Image samples of each class are shown in Fig. 13.7.
Table 13.1 Number of samples of individual classes of tobacco harvesting leaves Tobacco leaves Number of samples Total samples Unripe leaves Ripe leaves Over-ripe leaves
323 667 310
1300
206
P. B. Mallikarjuna et al.
Fig. 13.7 A sample of a Unripe leaf, b ripe leaf, c over-ripe tobacco leaf
(a)
(b)
(c)
13.3.2 Results The proposed model estimates the degree of ripeness of a leaf using the proposed method of maturity spots detection and color models. The proposed maturity spots detection algorithm is a combination of first order edge extraction and second order filtering. We exploited the first order edge extractors such as sobel edge detector and canny edge detector. We have used the laplacian filter for second order filtering. Hence, we have two combinations (i) combination of laplacian filter and sobel edge detector and (ii) combination of laplacian filter and canny edge detector. Henceforth, in this paper, we refer these combinations, respectively, as Method 1 and Method 2. The sobel edge detector works on one threshold (Tr ). Therefore, in the Method 1, we have fixed the threshold (Tr ) value of the sobel detector in the training phase.
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting …
207
Table 13.2 Average classification accuracy using the Method 2 (combination of laplacian filter and canny edge detector) for varying the thresholds Tr 1 and Tr 2 of canny edge detector Tr 2 → Tr 1 ↓ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
– – – – – – – – – –
71.65 – – – – – – – – – –
72.96 73.21 – – – – – – – – –
72.9 73.17 74.66 – – – – – – – –
72.97 73.4 74.96 86.59 – – – – – – –
73.73 73.08 75.3 85.6 81.59 – – – – – –
73.93 73.45 76.31 85.2 81.49 81.97 – – – – –
73.9 73.54 77.31 85.51 82.3 81.77 81 – – – –
74.23 74.02 78.3 85.48 81.71 82.31 81.32 77.08 – – –
74.43 74.21 78.84 85.73 82.03 82.07 79.13 76.98 75.19 – –
74.65 75.76 79.49 84.51 82.04 82.28 79.64 76.17 76.47 75.62 –
That is, we have varied the threshold (Tr ) value from 0 to 1 with a step of 0.1. Experimentally, it is found that the best average classification accuracy has been achieved for Tr = 0.2. On other hand, canny edge detector works on two thresholds (Tr 1 and Tr 2 ). Therefore, in the Method 2, we have fixed the threshold values of Tr 1 and Tr 2 of the canny edge detector in the training phase. Experimentally, it is found that the values of Tr 1 and Tr 2 are 0.3 and 0.4, respectively. The thresholds Tr 1 and Tr 2 of the canny edge detector are tuned up in such way that the leaf boundary edge pixels and leaf vein edge pixels are extracted clearly. Therefore, there is a very less probability of leaf vein edge pixels and leaf boundary edge pixels to be counted as maturity spots while estimating maturity spots density. However, selecting suitable values of Tr 1 and Tr 2 is a challenging task. Pixels with values between Tr 1 and Tr 2 are weak edge pixels that are 8-connected to the strong edge pixels (pixel values greater than Tr 2 ) which perform edge linking. Therefore, the values of Tr 1 and Tr 2 are set such that the probability of leaf boundary and veins weak edge pixels to be missing is minimum. By varying the thresholds Tr 1 and Tr 2 , it is found that the best average classification accuracy has been achieved for Tr 1 = 0.3 and Tr 2 = 0.4 and is given in Table 13.2. For estimation of degree of ripeness of a leaf, we vary the weights W1 and W2 (Eq. 13.1) such that the best average classification accuracy has been achieved (W1 = 0.7 and W2 = 0.3) for all set of training and testing samples. It is shown in Fig. 13.8 using the Method 2 for 60% training. For purpose of fixing up T1 and T2 , during classification, we considered 150 samples from each class, and we plotted distribution of samples over degree of ripeness (Fig. 13.9). Since there is a large overlapping between the classes as shown in Fig. 13.9, we recommend to select the thresholds T1 and T2 by studying the overlapping of unripe class and ripe class to select the threshold T1 and ripe class and over-ripe class to select the threshold T2 as follows.
208
P. B. Mallikarjuna et al.
Fig. 13.8 Average classification accuracy obtained by the Method 2 (combination of laplacian filter and canny edge detector) under varying weights W1 and W2
Fig. 13.9 Distribution of tobacco samples over degree of ripeness
During experimentation, we conducted four different sets of experiments for both Method 1 and Method 2. In the first set of experiments, we used 30% of the samples of each class of the harvesting tobacco dataset to create class representative vectors (training), and the remaining 70% of the samples are used for testing purpose. In second set, third set, and fourth set of experiments, a number of training and testing samples are in the ratio 40:60, 50:50, 60:40, respectively. In each set of experiments, experiments are repeated 20 times by choosing the training samples randomly. As measures of goodness of the proposed model, we computed classification accuracy,
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting …
209
Table 13.3 Classification accuracy using the Method 1 (combination of laplacian filter and canny edge detector) with different color models Training Color model Minimum Maximum Average Std. deviation Examples accuracy accuracy accuracy 30%
40%
50%
60%
RGB HSV MUNSELL CIELAB CIELUV RGB HSV MUNSELL CIELAB CIELUV RGB HSV MUNSELL CIELAB CIELUV RGB HSV MUNSELL CIELAB CIELUV
51.3055 50.2696 60.5778 78.3534 80.3467 50.713 50.2438 60.5041 78.2631 80.561 50.2906 49.6817 59.7798 79.0244 80.8077 50.4844 49.5419 57.9345 78.3068 80.332
57.1953 54.8161 69.4947 82.6594 83.1168 55.1152 54.9609 68.7531 82.7723 84.129 54.2789 53.4766 68.2022 83.5759 85.4418 54.6317 52.3252 68.3146 85.4018 85.2349
53.6121 51.8439 64.6835 80.43 82.018 52.7922 52.0307 63.8465 80.8273 82.4455 52.1122 51.0107 63.685 80.9809 83.3289 52.4391 50.7931 62.2161 81.9757 83.5698
1.6049 1.1337 2.3205 1.0805 1.0023 1.1783 1.2323 2.417 1.2004 1.268 0.9674 1.0167 1.9417 1.1023 1.2991 1.1576 0.8702 2.1327 1.4671 1.3244
precision, recall, and F-measure. The minimum, maximum, average, and standard deviation of classification accuracy of all the 20 trails using the proposed simple thresholding classifier for both methods are given in Tables 13.3 and 13.4 respectively. Classification accuracy using the Method 1 with different color models viz., RGB, HSV, MUNSELL, CIELAB, and CIELUV is given in Table 13.3. Similarly, classification accuracy using the Method 2 with different color models viz., RGB, HSV, MUNSELL, CIELAB, and CIELUV is given in Table 13.4. The confusion matrix across leaf types using the Method 1 for best average classification accuracy is given in Table 13.5. Similarly, the confusion matrix across leaf types using the Method 2 for best average classification accuracy is given in Table 13.6. The corresponding precision, recall, and F-measure for individual classes are presented for both the Method 1 and the Method 2 in Fig. 13.10. From Tables 13.3 and 13.4, it is observed that the best average classification accuracy has been achieved for the Method 2 with CIELUV color model.
210
P. B. Mallikarjuna et al.
Table 13.4 Classification accuracy using the Method 2 (combination of laplacian filter and canny edge detector) with different color models Training Color model Minimum Maximum Average Std. deviation Examples accuracy accuracy accuracy 30%
40%
50%
50%
RGB HSV MUNSELL CIELAB CIELUV RGB HSV MUNSELL CIELAB CIELUV RGB HSV MUNSELL CIELAB CIELUV RGB HSV MUNSELL CIELAB CIELUV
66.6959 54.9965 67.3114 78.9152 82.7643 67.0316 54.9682 67.0676 78.9224 81.0633 65.8649 52.9818 67.9882 83.5848 81.8075 64.8270 54.3457 66.3917 78.2949 80.7693
73.3427 61.1442 74.1979 82.9264 88.7130 73.2570 62.1305 72.0768 81.8946 88.9688 73.5737 61.8088 72.8952 79.4480 89.7177 73.8114 62.1077 72.5029 84.9594 89.5043
70.5315 58.7328 70.1425 81.1872 85.2472 70.3530 57.9434 69.7736 80.2090 86.3933 70.3453 58.6286 70.0468 81.3107 86.1646 70.6765 58.5386 70.8465 81.8277 86.5945
1.5919 1.4712 1.6726 0.9323 1.8509 1.7062 1.8306 1.2273 0.8748 2.0416 1.8429 1.8589 1.1668 1.1918 2.3228 1.9759 2.1944 1.3572 1.6287 2.2025
Table 13.5 Confusion matrix across leaf types using the Method 1 (combination of laplacian filter and sobel edge detector) for best average classification accuracy Unripe Ripe Over-ripe Unripe Ripe Over-ripe
106 20 00
23 222 18
00 25 106
Table 13.6 Confusion matrix across leaf types using the Method 2 (combination of laplacian filter and canny edge detector) for best average classification accuracy Unripe Ripe Over-ripe Unripe Ripe Over-ripe
111 10 00
18 228 14
00 29 110
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting … Fig. 13.10 Classwise evaluation of the Method 1 (combination of laplacian filter and sobel edge detector) and the Method 2 (combination of laplacian filter and canny edge detector): a Precision, b Recall, c F-measure
211
212
P. B. Mallikarjuna et al.
13.3.3 Discussion When we applied our previous method’s combination of laplacian filter and sobel edge detector (Method 1) with CIELAB (Guru and Mallikarjuna 2010) on our large dataset, we have achieved classification accuracy of about 81% (see Table 13.3). To improve classification accuracy, our previous method (Guru and Mallikarjuna 2010) has been extended with different color models viz., RGB, HSV, MUNSELL, and CIELUV and achieved a good classification accuracy of about 83% with CIELUV color model (see Table 13.3) on our large dataset. To improve classification accuracy further, the current work has been extended for combination of laplacian filter and canny edge detector (Method 2) with different color models viz., RGB, HSV, MUNSELL, CIELAB, and CIELUV. We have achieved an improvement in classification accuracy by 3% using Method 2 with CIELUV color model (see Table 13.4) on our large dataset.
13.4 Conclusion In this work, we present a novel model based on strategies of filtering for classification of tobacco leaves for the purpose of harvesting. A method of detection of maturity spots is proposed. A method of finding degree of ripeness of a leaf is presented. Further, we proposed a simple thresholding classifier for effective classification of tobacco leaves. In order to investigate the effectiveness and robustness of the proposed model, we conducted experiments for both the methods (i) combination of laplacian filter and sobel edge detector and (ii) combination of laplacian filter and canny edge detector on our own large dataset. Experimental results reveal that combination of laplacian filter and canny edge detector is superior than combination of laplacian filter and sobel edge detector.
References Baltazar A, Aranda JI, Aguilar GG (2008) Bayesian classification of ripening stages of to-mato fruit using acoustic impact and colorimeter sensor data. Comput Electron Agric 60(2):113–121 Furfaro R, Ganapol BD, Johnson LF, Herwitz SR (2007) Neural network algorithm for coffee ripeness evaluation using airborne images. Appl Eng Agric 23(3):379–387 Gao H, Cai J, Liu X (2009) Automatic grading of the post-harvest fruit: a review. In: Third IFIP international conference on computer and computing technologies in agriculture. Springer, Beijing, pp 141–146 Guru DS, Mallikarjuna PB (2010) Spots and color based ripeness evaluation of tobacco leaves for automatic harvesting. In: First international conference on intelligent interactive technologies and multimedia. ACM, IIIT Allahabad., India, pp 198–202 Jabro JD, Stevens WB, Evans RG, Iversen WM (2010) Spatial variability and correlation of selected soil in the AP horizon of a CRP grassland. Appl Eng Agric 26(3):419–428
13 Ripeness Evaluation of Tobacco Leaves for Automatic Harvesting …
213
Johnson LF, Herwitz SR, Lobitz BM, Dunagan SE (2004) Feasibility of monitoring coffee field ripeness with airborne multispectral imagery. Appl Eng Agric 20(6):845–849 Lee DJ, Chang Y, Archibald JK, Greco CJ (2008a) Color quantization and image analysis for automated fruit quality evaluation. In: IEEE international conference on automation science and engineering. IEEE, Trieste, Italy, pp 194–199 Lee DJ, Chang Y, Archibald JK, Greco CJ (2008b) Robust color space conversion and color distribution analysis techniques for date maturity evaluation. J Food Eng 88:364–372 Lee D, Archibald JK, Xiong G (2011) Rapid color grading for fruit quality evaluation using direct color mapping. IEEE Trans Autom Sci Eng 8:292–302 Manickavasagam A, Gunasekaran JJ, Doraisamy P (2007) Trends in Indian flue cured virgina tobacco (Nictoina tobaccum) processing: harvesting, curing and grading. Res J Agric Biol Sci 3(6):676–681 Patricio DI, Riederb R (2018) Computer vision and artificial intelligence in precision agriculture for grain crops: A systematic review. Comput Electron Agric 153:69–81 Viscarra RA, Minasny B, Roudier P, McBratney AB (2006) Colour space models for soil science. Geoderma 133:320–337 Xu L (2008) Strawberry maturity neural network detecting system based on genetic algorithm. In: Second IFIP international conference on computer and computing technologies in agriculture, Beijing, China, pp 1201–1208 Yin H, Chai Y, Yang SX, Mitta GS (2009) Ripe tomato extraction for a harvesting robotic system. In: IEEE international conference on systems, man and cybernetics. IEEE, San Antonio, USA, pp 2984–2989
Chapter 14
Automatic Deep Learning Framework for Breast Cancer Detection and Classification from H&E Stained Breast Histopathology Images Anmol Verma, Asish Panda, Amit Kumar Chanchal, Shyam Lal, and B. S. Raghavendra Abstract About half a million breast cancer patients succumb to the disease, and nearly 1.7 million new cases arise every year. These numeric entities are expected to rise significantly due to the advances in social and medical engineering. Furthermore, the histopathological images are a gold standard for identifying and classifying breast cancer compared with other medical imaging. Evidently, the decision of an optimal therapeutic schedule of breast cancer rests upon early detection.The primal motive to have better breast cancer detection algorithm helps to the doctors who know the molecular sub-types of breast cancer in order to control the metastasis of tumor cells early in the disease prognosis and treatment planing. This paper proposes automatic deep learning framework for breast cancer detection and classification model from hematoxylin and eosin (H&E) stained breast histopathology images with 80.4% accuracy for supplementing analysis of medical professionals to prevent false negatives. Experimental results yield that proposed architecture provides better classification results as compared to benchmark methods.
14.1 Introduction Proper diagnosis of breast cancer is the demand of today’s time; because in women, it becomes a major cancer-related issues worldwide. Manual analysis of microscopic slides leads to differences of opinion among pathologists as well as time consuming process due to the complexity associated with such images. Breast cancer is a disease having a distinctive histological attribute and has benign tumor of sub-class as Adenosis, Fibroadenoma, Phyllode Tumor, Tubular Adenoma and malignant tumor of sub-class as Ductal Carcinoma, Lobular Carcinoma, Mucinous Carcinoma, and Papillary Carcinoma. Classical classification algorithms have own merit and demerit. Logistic regression-based classification easy to implement, but its accuracy depends on the nature of the dataset. If it is linearly separable, then it will work well, but in the real world dataset rarely linearly separable. Decision tree-based classification model A. Verma (B) · A. Panda · A. Kumar Chanchal · S. Lal · B. S. Raghavendra National Institute of Technology Karnataka, Surathkal, Mangalore 575025, Karnataka, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_14
215
216
A. Verma et al.
is able to deal complex nature dataset, but there are always chances of overfittiing in this method. Overfitting problem can be reduced by a random forest algorithm which is a more sophisticated version of decision tree-based classification model. The working method of support vector machine is based on hyperplane which acts as a decision boundary. Appropriate selection of kernel is the key to better performance in support vector machine classification method. To improve the process of diagnosis, automatic detection and treatment are one of the leading research areas to deal with cancer-related issue. Last one decade, the development of fast digital whole slide scanners (DWSS) that provide whole slide images (WSI) has led to a revival of interest in medical image processing, analysis, and their applications in digital pathology solution.Segmentation of cells and nuclei proves to be an important first step towards automatic image analysis of digitized histopathalogy images. We therefore pose to develop an automated cell identification method that works with in (H&E) stained breast cancer histopathology images. Deep learning framework is very effective to detect and classify breast cancer histopathology slides. A typical deep learning classification system consists of (a) A properly annotated dataset where its each class and sub-class is verified by experienced pathologists. (b) A robust architecture that are able to differentiate its class and sub-class of tissue under diagnosis (c) Good optimization algorithm and a proper loss function that are able to train the model effectively. (d) In case of supervised learning, the performance of the model depends that how ground truth prepared, and it should be under the supervision of experienced pathologists. The organizations of the this chapter are as follows: In Sect. 14.2 we have discussed related research work. Section 14.3 presents proposed model architecture. Section 14.4 presents experimental results and discussion. Section 14.5 presents conclusion of the manuscript.
14.2 Literature Survey The breast cancer detection had SVM as a benchmark model as presented by Akay in Akay (2009), and the benefits of SVM as used in detection of cancer were clearly presented but it lacked the classification of the type of breast cancer which was one of our motivation regarding this research. The motivation was further enhanced by the foundings presented in Karabatak and Ince (2009) by Karabatak. Veta and Diest presented automatic nuclei segmentation in H&E stained breast cancer histopathology images. In this paper, authors explained the different nuances for breast cancer detection that have been achieved by automated cell segmentation. Method of cell segmentation is explained deeply in this paper which is based on patched slide analysis for higher accuracy of cancer detection (Veta and Diest 2013). The advantage of this paper particularly is the accuracy of detection it achieves with cell segmentation method, as it is the best in class with over 90.4% accuracy
14 Automatic Deep Learning Framework for Breast Cancer . . .
217
in positive detection. The disadvantage of this paper is that it fails to touch upon the different ways of achieving such detection accuracy with multiple deep learning algorithms. Cruz-Roa, Angel, et al. presented automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks(CNNs). In this paper, authors explained the detection and visual analysis of IDC tissues in whole slide images(WSI). The framework explained in (Cruz-Roa et al. 2014) extends to a number of CNNs. The CNN is trained over a large number of image patches represented by tissue regions from the WSI to learn a hierarchical part-based representation and classification. The resulting accuracy is stated as being 71.80% and 84.23% for F-measure and balanced accuracy, respectively. The disadvantage of the method published is from the inherent limitations in obtaining a very highly granular annotation of the diseased area of interest by an expert pathologist. In Spanhol et al. (2016), Janowczyk and Anant (2016), the work presented by authors brings significance to the datasets being used to elucidate significant deep learning techniques needed to produce comparable, and in many cases, superior to results from the benchmark hand-crafted feature-based classification algorithmic design. Recently, advanced CNN model has achieved paramount success in classification of natural image as well as in biomedical image processing. In Han et al. (2017), Han et al. designed a novel convolutional neural network, which includes a convolutional layer, small SE-ResNet module, and fully connected layer and was responsible for impeccable detection cancer detection outcomes. Most of the state-of-the-art algorithms in the literature are based on learned features that extract high-level abstractions directly from the histopathological H& E stained images utilizing deep learning techniques. In Look and Once: Unified, RealTime Object Detection, (2016), Janowczyk and Anant (2016), Han et al. (2017), authors discussion was brought upon the various algorithms applied for the nuclear pleomorphism scoring of breast cancer, disquisition the challenges to be dealt with, and outlines the importance of benchmark datasets in multi-level classification architectures. The multiple layer analysis of cancer detection and classification draws its roots from papers (Feng et al. 2018; Guo et al. 2019; Jiang et al. 2019; Liao et al. 2018; Liu et al. 2019) explaining feature extraction representing different types of breast cancer and giving a prominent inclination to invasive ductal carcinoma (IDC). M Z Alom, T Aspiras et al. presented advanced deep convolutional neural network approaches for digital pathology image analysis (Alom et al. 2019). In this paper, authors explained the process of detection of cancer through a CNN approach specifically IRRCNN. The process of detection using neural networks makes us understand the multiple layers that go into making the model. The advantage of this paper particularly is the approach of detection, as it is the optimum way of utilizing CNN for image recognition in this case cancer detection . The disadvantage of this paper is that it pegs only the detection of cancer among the cell and does not allow abstract classification of cancer types.
218
A. Verma et al.
In Ragab et al. (2019), the authors explain the significance of SVM as a benchmark model for identification of breast cancer although the analysis proves to be promising but its done by mammogram images instead of H&E stained images. A deep learning model by Li et al. (2019) that classifies into malignant and nonmalignant and use of a classifier in such a way that it detects local patches. The idea of multi-classifier development can be shared by Kassani et al. (2019) as a problem tackled through ResNet and other prominent neural networks. The disadvantage encountered in these papers is the lack specificity of the disease. The paper by Lichtblau and Stoean (2019) suggests the different models that need to be studied to identify the most optimum approach for classification of different cancer types. Due to the focus of this paper primarily on the classification of breast cancer, our detection algorithm consists of data procured through transfer learning of benchmark algorithms as presented in Shallu (2018), Ting et al. (2019), Vo et al. (2019), Khan et al. (2019), Ni et al. (2019), Das et al. (2020). For BreaKHis dataset (To˘gaçar et al. 2020) proposed a general framework for diagnosis of breast cancer. Their architecture consists of attention modules, convolution block, dense block, residual block, and hyper column block to capture spatial information precisely. Categorical cross entropy is loss function, and Adam optimization is used to train the model. Sheikh et al. (2020), densely connected CNN-based network for binary and multiclass classification is able to capture meaningful structure and texture by fusing multi-resolution feature for ICIAR2018 dataset and BreaKHis dataset. For classification of breast cancer into carcinoma and non-carcinoma Hameed et al. (2020) utilized deep CNN-based pre-trained model of VGG-16 and VGG-19 that is helpful in better initialization and convergence. Their final architecture is an ensemble of fine-tuned VGG-16 and fine-tuned VGG-19 models. By utilizing sliding window mechanism and class-wise clustering with imagewise feature pooling (Li et al. 2019) extract multi-layered features to train two parallel CNN. Their final classification accuracy has both larger patches, features, and smaller patch features. For multiclass classification of breast cancer histopathology images (Xie et al. 2019) adopted transfer learning. The pre-trained model of Inception_ResNet_V2 and Inception_V3 is utilized for the classification purpose. Their deep learning framework used four different magnification factor for training and testing to ensure the universality of the model. Both CNN and SVM classifier used by Araújo et al. (2017) to achieve comparable results. By dividing the histology image into patches and patch-based features are extracted using CNN, and finally, these features are fed to the SVM input to classify the images. Classification of breast carcinomas by Babak ehteshami (Bejnordi et al. 2017) in whole slide breast histology images by stacking high resolution patches on the top of the network that accepts large size input to obtain fine-grained details as well as global tissue structures. Spanhol et al. (2017) utilize CNN trained on natural images to the BreaKHis dataset to extract the deep features, and they find these features are better than
14 Automatic Deep Learning Framework for Breast Cancer . . .
219
hand-crafted features. These features are fed to different classifier trained on specific dataset. Their patch-based classification with four different magnification factor achieves very good prediction accuracy. Zhu et al. (2019) works on the BreaKHis dataset by merging local and global information called multiple CNN or hybrid CNN that is able to classify effectively. To remove the redundant information, they incorporated SEP block in the hybrid model. Combining the above two effects, their model got promising results. For BACH and BreaKHis dataset (Patil et al. 2019) used attention based multiple instance learning where they did aggregation of features called bag level features. Their multiple instance-based learning is able to localize and classify into benign, malignant, and invasive.
14.3 Proposed Model Architecture The proposed architecture consists of two parts, namely detection and classification. The detection networks take influence from IRRCNN (Alom et al. 2019), while the classification network takes influence from WSI-Net (Ni et al. 2019). The flow diagram of our proposed architecture is shown in Fig. 14.1. The architecture consists of two convolutional matrix and three residual networks. The H& E image of the breast tissue is pre-processed and sent into the first convolutional network followed by a residual network which is then repeated once more. The processed data is sent into the classification branch and malignancy detection branch. The malignancy detection branch decides whether the fed data is malignant or nonmalignant. The classification branch further processes the data and classifies it on whether it is invasive ductal carcinoma (IDC) positive or negative. The data from
Fig. 14.1 Proposed model architecture
220
A. Verma et al.
both branches are combined and passed through the final residual network. We then give the prediction through the confusion matrix segmentation map.
14.3.1 Loss Function The loss function utilized in proposed architecture is the Adam loss function (Kingma and Ba 2014). First and foremost, Adam means adaptive moment estimation. In Adam, the exponential moving average (EMA) of the first moment of the gradient scaled by the square root of the second moment of the moment is subtracted to the parameter vector which is presented in mathematical Eq. 14.1 as explored in θt+1 = θt −
η vˆt +
mˆ t
(14.1)
where θ is the parameter vector, v(t) is the exponential moving average of the second moment of the gradient G(t), and is a very small hyper-parameter that prevents the algorithm from dividing by zero. Please do not use quotation marks when quoting texts! Simply use the quotation environment – it will automatically be rendered in line with the preferred layout.
14.3.2 Training Setup A jupyter file in google colaboratory was used to implement all models used and proposed model in this paper through a virtual machine on cloud as well as a PC with Intel(R) Core(TM) i7-8750H CPU @ 3.98GHz, 16GB RAM and NVIDIA GTX 1070 Max-Q as its core specifications.
14.3.3 Segmentation Overview The segmentation is based on zero-phase component analysis (ZCA) whitening. We use this algorithm to identity certain key features from the breast dataset, which is then compared to our pre-existing training dataset as a part of unsupervised learning. The mathematical expression of ZCA is presented in Eq. 14.2.
W Z C A = E D −1/2 E T = C −1/2
(14.2)
14 Automatic Deep Learning Framework for Breast Cancer . . .
221
Where W Z C A is the transformation matrix, E is the matrix consisting of eigen values, covariance matrix C = X T X/n has eigenvectors in columns of E, and eigenvalues on the diagonal of D and X are the data stored in n × d where n are data points and d are the features.
14.4 Experimental Results and Discussion 14.4.1 Dataset and Pre-processing The research work is using the BHC dataset for detection and classification of histopathology and eosin stained breast cancer images. The keras data generator is used to get the data from respective folders and into Keras automatically. Keras provides convenient python library functions for this purpose. The learning rate presided by this proposed model is adjusted to be 0.0001. On top of it, a global average pooling layer followed by 50% dropouts to reduce over-fitting bias was used. Adam is used as the optimizer and binary-cross-entropy as the loss function. A sequential model along with confusion matrix is used for implementation of the classification branch of proposed algorithm. It adds convolutional layer of bin size 32 and kernel size 3 × 3. Four units are pooled together from both the axes. Then, different operations like increasing density, flatten, and redundancy reduction are applied.
14.4.2 Results Evaluation and Discussion The results evaluation and discussions of proposed model for cancer detection and classification method are presented in this section. For validity, the results of proposed architecture are compared with existing methodology and composition from the referenced literature such as IRRCNN, DCNN, SVM, VGG-16, decision tree, etc. Table 14.1 consists of the different models that were tested for the cancer classification branch of the proposed architecture and following results observed. The decision was made to implement DenseNet201 for the malignancy detection branch of the proposed architecture by weighing in the size of the model and its top-5 accuracy which weighed in best for the DenseNet201 model. The accuracy and loss plots of malignancy detection branch of proposed architecture are shown in Figs. 14.2 and 14.3 respectively. The receiver operating characteristics (ROC) plot of proposed architecture is shown in Fig. 14.4. Invasive ductal carcinoma (IDC) is the most common form of breast cancer. Through the medium of this project, we are implementing a two base classification for the preferred algorithm in order to broaden our automated analysis, i.e., IDC versus DCIS (IDC−). This particular method involves confusion matrix in order to
222
A. Verma et al.
Table 14.1 Models performance comparision
Algorithm KNN
GaussianNB
KerasANN
DecTree
SVM
WSI-Net Proposed
Parameter
Recall
F1
Precision
IDC(-) IDC(+) Avg. IDC(+) Avg. IDC(-) IDC(+) Avg. IDC(-) IDC(+) Avg. IDC(-) IDC(+) Avg. IDC(-) IDC(+) Avg. IDC(-) IDC(+) Avg.
0.65 0.78 0.72 0.82 0.74 0.97 0.22 0.46 0.63 0.86 0.745 0.71 0.88 0.795 0.52 0.37 0.445 0.71 0.88 0.795
0.7 0.74 0.72 0.76 0.735 0.70 0.36 0.455 0.72 0.78 0.75 0.76 0.80 0.78 0.36 0.14 0.25 0.76 0.80 0.78
0.74 0.7 0.72 0.71 0.745 0.55 0.87 0.745 0.82 0.71 0.765 0.85 0.74 0.795 0.53 0.47 0.50 0.85 0.74 0.795
Fig. 14.2 Accuracy plot for the detection branch
Accuracy 71.71%
73.78%
59.09%
72.17%
77.65%
63.21% 80.43%
14 Automatic Deep Learning Framework for Breast Cancer . . .
223
Fig. 14.3 Loss plot for the detection branch
Fig. 14.4 ROC curve
understand the implications of error in the prediction of the type of cancer. Predicted IDC(+) and IDC(-) of proposed model is shown in Fig. 14.5 and confusion matrix of proposed model is presented in Fig. 14.6. The comparison of predicted results of proposed model vs actual is shown in Fig. 14.7. The machine learning algorithms in Table 14.1 were brought in contrast with the proposed algorithm. The idea is to design a optimal algorithm in which bias to both bases is limited and achieve similar efficacy to support vector machine (SVM). The proposed algorithm provides the best approach in terms of this and can be used as an alternative to the existing SVM method for classification of cancer.
224
Fig. 14.5 Predicted IDC(+) and IDC(−)
Fig. 14.6 Proposed model confusion matrix
A. Verma et al.
14 Automatic Deep Learning Framework for Breast Cancer . . .
225
Fig. 14.7 Predicted versus actual
The CNN such as WSI-Net was brought in contrast with the proposed algorithm on the weighted parameters, and results were drawn as listed in the conclusion.
14.5 Conclusion The proposed model was broadly the combination of cancer detection and classification into IDC and non-IDC. Detecting breast cancer is based on IRRCNN algorithm with significant improvements in the number of epochs and layers of convolution network in order to get near the desired results. Then, it is coalesced with the classification algorithm which gives us a significant improvement on WSI-Net and other machine learning classifiers for classification. The accuracy that was observed for detection of breast cancer stands at 95.25% and that for classification of IDC versus DICS stands at 80.43% which was better than WSI-Net. Acknowledgements This research work was supported in part by the Science Engineering and Research Board, Department of Science and Technology, Govt. of India under Grant No. EEG/2018/000323, 2019.
226
A. Verma et al.
References Akay MF (2009) Support vector machines combined with feature selection for breast cancer diagnosis. Exp Syst Appl 36:3240–3247. https://doi.org/10.1016/j.eswa.2008.01.009 Alom M, Aspiras T, Taha MT, Asari K, Bowen V, Billiter D, Arkell S (2019) Advanced Deep convolutional neural network approaches for digital pathology image analysis: a comprehensive evaluation with different use cases Araújo T, Aresta G, Castro E, Rouco J, Aguiar P, Eloy C, Polónia A, Campilho A (2017) Classification of breast cancer histology images using convolutional neural networks. PloS One 12(6). https://doi.org/10.1371/journal.pone.0177544 Bejnordi BE, Zuidhof G, Balkenhol M, Hermsen M, Bult P, van Ginneken B, Karssemeijer N, Litjens G, van der Laak J (2017) Context-aware stacked convolutional neural networks for classification of breast carcinomas in whole-slide histopathology images. J Med Imag (Bellingham, Wash) 4(4):044504. https://doi.org/10.1117/1.JMI.4.4.044504 Cruz-Roa A, et al (2014) In: Gurcan MN, Madabhushi A (eds) Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks, p 904103. https:// doi.org/10.1117/12.2043872 Das A, Nair MS, Peter D (2020) Computer-aided histopathological image analysis techniques for automated nuclear atypia scoring of breast cancer Feng Y, Zhang L, Mo J (2018) Deep manifold preserving autoencoder for classifying breast cancer histopathological images. IEEE/ACM Trans Comput Biol Bioinform 1. https://doi.org/10.1109/ TCBB.2018.2858763 Guo Y, Shang X, Li Z (2019) Identification of cancer subtypes by integrating multiple types of transcriptomics data with deep learning in breast cancer. Neurocomputing 324:20–30. https:// doi.org/10.1016/j.neucom.2018.03.072 Hameed Z, Zahia S, Garcia-Zapirain B, Javier Aguirre J, María Vanegas A (2020) Breast cancer histopathology image classification using an ensemble of deep learning models. Sensors 20:4373 Han Z, Wei B, Zheng Y, Yin Y, Li K, Li S (2017) Breast cancer multi-classification from histopathological images with structured deep learning model. Sci. Rep. 7:1–10. https://doi.org/10.1038/ s41598-017-04075-z Janowczyk A, Madabhushi A (2016) Deep learning for digital pathology image analysis: a comprehensive tutorial with selected use cases. J Pathol Inform 7:29 (2016). PubMed https://doi.org/ 10.4103/2153-3539.186902 Jiang Y et al (2019) Breast cancer histopathological image classification using convolutional neural networks with small SE-ResNet module. PLOS ONE 14(3): e0214587. PLoS J. https://doi.org/ 10.1371/journal.pone.0214587 Karabatak M, Ince MC (2009) An expert system for detection of breast cancer based on association rules and neural network. Expe Syst Appl 36:346–3469. https://doi.org/10.1016/j.eswa.2008.02. 064 Kassani SH, Kassani PH, Wesolowski M (2019) Classification of histopathological biopsy images using ensemble of deep learning networks. SIGGRAPH 4(32). https://doi.org/10.1145/3306307. 3328180 Khan S, Islam N, Jan Z, Din IU, Rodrigues JJPC (2019) A novel deep learning based framework for the detection and classification of breast cancer using transfer learning. Pattern Recognit Lett 125:1–6. https://doi.org/10.1016/j.patrec.2019.03.022 Kingma D, Ba J (2014). Adam: a method for stochastic optimization. In: International conference on learning representations Li Y, Wu J, Wu Q (2019) Classification of breast cancer histology images using multi-size and discriminative patches based on deep learning. IEEE Access 7:21400–21408. https://doi.org/10. 1109/ACCESS.2019.2898044 Li S, Margolies LR, Rothstein JH, Eugene F, Russell MB, Weiva S (2019) Deep learning to improve breast cancer detection on screening mammography. Sci Rep 9:12495. https://doi.org/10.1038/ s41598-019-48995-4
14 Automatic Deep Learning Framework for Breast Cancer . . .
227
Liao Q, Ding Y, Jiang ZL, Wang X, Zhang C, Zhang Q (2018) Multi-task deep convolutional neural network for cancer diagnosis. Neurocomputing. https://doi.org/10.1016/j.neucom.2018.06.084 Lichtblau D, Stoean C (2019) Cancer diagnosis through a tandem of classifiers for digitized histopathological slides. PLoS One 14:1–20. https://doi.org/10.1371/journal.pone.0209274 Liu N, Qi E-S, Xu M, Gao B, Liu G-Q (2019) A novel intelligent classification model for breast cancer diagnosis. Inf Process Manag 56:609–623. https://doi.org/10.1016/j.ipm.2018.10.014 Mehra SR (2018) Breast cancer histology images classification: training from scratch or transfer learning? ICT Exp 4:247–254. https://doi.org/10.1016/j.icte.2018.10.007 Ni H, Liu H, Wang K, Wang X, Zhou X, Qian Y (2019) WSI-Net: branch-based and hierarchy-aware network for segmentation and classification of breast histopathological whole-slide images. In: International Workshop on Machine Learning in Medical Imaging, pp 36-44 Patil A, Tamboli D, Meena S, Anand D, Sethi A (2019) Breast cancer histopathology image classification and localization using multiple instance learning. In: 2019 IEEE international WIE conference on electrical and computer engineering (WIECON-ECE), Bangalore, India, pp 1–4. https://doi.org/10.1109/WIECON-ECE48653.2019.9019916 Ragab DA, Sharkas M, Marshall S, Ren J (2019) Breast cancer detection using deep convolutional neural networks and support vector machines. Peer J 7:e6201 Redmon J (2016) You only look once: unified, real-time object detection (2016) Retrieved from http://pjreddie.com/yolo/ Spanhol FA, Oliveira LS, Cavalin PR, Petitjean C, Heutte L (2017) Deep features for breast cancer histopathological image classification. In: 2017 IEEE international conference on systems, man, and cybernetics (SMC), Banff, AB, pp 1868-1873 https://doi.org/10.1109/SMC.2017.8122889 Sheikh TS, Lee Y, Cho M (2020) Histopathological classification of breast cancer images using a multi-scale input and multi-feature network. Cancers 12(8):2031. https://doi.org/10.3390/ cancers12082031 Spanhol F, Oliveira LS, Petitjean C, Heutte L (2016) A dataset for breast cancer histopathological image classification. IEEE Trans Biomed Eng (TBME) 63(7):1455–1462 Ting F, Tan YJ, Sim KS (2019) Convolutional neural network improvement for breast cancer classification. Exp Syst Appl 120:103–115. https://doi.org/10.1016/j.eswa.2018.11.008 To˘gaçar M, Ergen B, Cömert Z (2020) Application of breast cancer diagnosis based on a combination of convolutional neural networks, ridge regression and linear discriminant analysis using invasive breast cancer images processed with autoencoders. Med Hypotheses Vo DM, Nguyen N-Q, Lee S-W (2019) Classification of breast cancer histology images using incremental boosting convolution networks. Inf Sci (Ny) 482:123–138. https://doi.org/10.1016/ j.ins.2018.12.089 Veta MJ, Diest PJ (2013) Automatic nuclei segmentation in HE stained. Breast cancer histopathol images. PLOS One 8(7) Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2018) The marginal value of adaptive gradient methods in machine learning, 2017. arXiv:1705.08292v2 [stat.ML] (22 May 2018) Xie J, Liu R, Luttrell J, Zhang C (2019) Deep learning based analysis of histopathological images of breast cancer. Front Genet 10. https://doi.org/10.3389/fgene.2019.00080 Zhu C, Song F, Wang Y et al (2019) Breast cancer histopathology image classification through assembling multiple compact CNNs. BMC Med Inform Decis Mak 19:198. https://doi.org/10. 1186/s12911-019-0913-x
Chapter 15
An Analysis of Use of Image Processing and Neural Networks for Window Crossing in an Autonomous Drone L. Pedro de Brito, Wander M. Martins, Alexandre C. B. Ramos, and Tales C. Pimenta Abstract The application with autonomous robots is becoming more popular (Kyrkou et al. 2019), and neural networks and image processing are increasingly linked to control and decision (Jarrell et al. 2012; Prescott et al. 2013). This study seeks a technique that makes drones or robots more autonomous indoors fly. The work investigates the implementation of an autonomous control system for drones, capable of crossing windows on flights through closed places, through image processing (de Brito et al. 2019; de Jesus et al. 2019; Martins et al. 2018; Pinto et al. 2019) using convolutional neural network. Object’s detection strategy was used; through its location in the captured image, it is possible to carry out a programmable route for the drone. In this study, this location of the object was established by bounding boxes, which define the quadrilateral around the found object. The system is based on the use of an open-source autopilot, Pixhawk, which has a control and simulation environment capable of doing the job. Two detection techniques were studied. The first one is based on image processing filters, which captured polygons that represent a passage inside a window. The other approach was studied for a more real environment and implemented with the use of convolutional neural networks for object detection; with this type of network, it is possible to detect a large number of windows.
L. P. de Brito (B) · A. C. B. Ramos Federal University of Itajuba, Institute of Mathematics and Computing, IMC. Av. BPS, 1303 Bairro Pinheirinho, MG, Caixa Postal 50 CEP: 37500-903, Itajubá, Brazil e-mail: [email protected] W. M. Martins · T. C. Pimenta Institute of Systems Engineering and Information Technology, IESTI. Av. BPS, 1303, Bairro Pinheirinho, MG, Caixa Postal 50 CEP: 37500-903, Itajubá, Brazil e-mail: [email protected] T. C. Pimenta e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 G. K. Verma et al. (eds.), Data Science, Transactions on Computer Systems and Networks, https://doi.org/10.1007/978-981-16-1681-5_15
229
230
L. P. de Brito et al.
15.1 Introduction The main objective of this study was to investigate the creation of a system capable identify a passage using a monocular camera, such as a door or window, and guide a little drone through it. This system involves the construction of a small aircraft capable of capture images that will be processed by an external machine which, in turn, accurately respond to the specific movement the aircraft must follow. Figure 15.1 shows an outline of the system structure, where an arrow indicates the flow of information and processes carried out. The detection algorithm works with the classification of pixels and the location of the red object within an image. Through this obtained position, it is possible to make an analysis with reference to the position of the camera, thus calculating a route for the aircraft run. Two detection approaches were studied in this work; the first being only about simple image processing techniques and the other based on the use of convolutional neural networks, more specifically the use of SSD, single shot multibox detector (Falanga et al. 2018; Ilie and Gheorghe 2016; Liu et al. 2016).
15.2 Materials and Methods The work combined the hardware and software to enable control of the aircraft. The choice of the hardware equipment for research was due to the low investment compared to the commercial models to sale. Not to mention the fact that PixHawk has a large open-source development kit.
15.2.1 The Aircraft Hardware A small Q 250 racing quadcopter (25 cm) was built with enough hardware to complete the project, including a flight controller board PixHawk 4 (Meier et al. 2011; Pixhawk 2019), 2300 kv mo-tors and 12A electronic speed controllers (ESCs). This controller board has a main flight management unit (FMU) processor, an input/output (I/O)
Fig. 15.1 The implemented system (The author)
15 An Analysis of Use of Image Processing and Neural . . .
231
Fig. 15.2 Drone Q250 with PixHawk (The author)
processor, accelerometer, magnetometer, and barometer sensors. Figure 15.2 shows the used drone in the search. In the implemented system, the data processing is not embedded, so a wireless connection was necessary for the transmission of commands between the ground base machine and the drone. And also a data connection for transmitting images captured by the aircraft. A separate connection was used to transmit the commands and another for images through a 915 MHz telemetry to connect to the ground station that processes and to send commands to flight control. Images captured by the first person view (FPV) high definition (HD) camera is transmitted by a 5.8 GHz transmitter/receiver pair.
15.2.2 The Ground Control Station Ground control station (CGS) is a kind of software executed on a solo platform that performs the monitoring and configuration of the sensor’s drone such as sensor calibration settings and the configuration of general purpose boards, supporting different types of vehicle models, like the PixHawk that needs a configuration of its firmware before its use. In this work, the QGroundControl software of ground station was used. This software allows you to check the status of the drone and program in a simple way missions with global positioning system (GPS) and map. It is suitable for the PixHawk
232
L. P. de Brito et al.
Fig. 15.3 QGroundControl (GCS)
configuration. The following Fig. 15.3 shows the interface of this GCS used (Damilano et al. 2013; Planner 2019a, b; QGROUNDCONTROL 2019; Ramirez-Atencia and Camacho 2018).
15.2.3 MAVSDK Drone Control Framework The MAVSDK drone control framework was used in this implementation. It is a library that can perform simple stable motion functions such as take off and land, and control the speed of the airframe on its axes. It uses the coordinate system to command the aircraft, communicating with vehicles that have MAVLink, a communication protocol for drones, and their internal components. MAVSDK is a software development kit (SDK) made for PixHawk in various types of vehicles. This framework was originally implemented in C++, but this work was made in Python, which is one of the derivations of the SDK. The written code runs on a machine and sends commands through MAVLink protocol (DRONEKIT 2019; French and Ranganathan 2017; MAVROS 2019; MAVSDK 2019).
15.2.4 Simulation This work used the Gazebo simulator, with its PX4 Simulator implementation, which brings various vehicle models with PixHawk specific hardware and firmware simulation.
15 An Analysis of Use of Image Processing and Neural . . .
15.2.4.1
233
Gazebo
Gazebo provides realistic simulation with complex scenarios and robust environment physics, including several sensors to sketch a true real-world implementation. Gazebo enables the implementation of multiple robots. This makes possible to test and train AI codes and image processing with remarkable ease and agility. Gazebo can create a scenario with various buildings, such as houses, hospitals, cars, people, etc. With this scenario, it is possible to evaluate the quality of the codes and trim their parameters before a test in the real environment(de Waard et al. 2013; GAZEBOSIM 2019; Koenig and Howard 2004).
15.2.4.2
PX4 Simulator
PX4 simulator is a simulated modeling of PixHawk within the Gazebo environment, simulating all the main features of autopilot over some aircraft models, land vehicles, and among others. This implementation makes it possible to create and test code within the simulated environment that can be faithfully transferred to the real environment. The important fact of this simulation environment is the compatibility between PixHawk, Gazebo, PX4 Simulator, and also with MavSDK.
15.2.4.3
The Simulated Drone
The Iris model is the PX4 simulated drone that has the greatest fidelity to the image real model q250 implemented (already presented previously), both are based on PixHawk, which means that the autopilot and the simulated Iris firmware are compatible with the real q250. It is possible to connect the aircraft connect with the code and command the aircraft, which in this case is the same for both. Figure 15.4 shows this simulated model (Garcia and Molina 2020; PX4SIM 2019).
15.2.5 Object Detect Methods This work was used the TensorFlow framework that has a large number of existing implementations available for adaptation (Bahrampour et al. 2015; Kovalev et al. 2016).
15.2.5.1
TensofFlow Framework
TensoFlow was created by Google and is based on Keras API, to facilitate the implementation of high performance algorithms, especially for large servers. It accepts the use of graphics processing unit (GPU) beyond central processing unit (CPU)
234
L. P. de Brito et al.
Fig. 15.4 The simulated drone Iris
only. This tool is considered heavy compared to others on the market; however, it is very powerful because it provides a large number of features, tools, and implementations. TensorFlow has a Github repository where its main code and other useful tools like deployed templates, TensorBoard, Project Magenta, etc., are available. As a portable library, it is available in several languages, such as Python, C++, Java, and Go, as well as other community-developed extensions (BAIR 2019; GOOGLE 2019; Sergeev and Balso 2018; Unruh 2019).
15.2.5.2
Neural NetWork Technique
This work used convolutional neural networks (CNN) and machine learning (ML) for object detection (Cios et al. 2012; Kurt et al. 2008). The classification method was added to the calculation of the location of the object. This approach is called object detection. A CNN is a variation of perceptron multilayer network (Vargas et al. 2016). A perceptron is simply a neuron model capable of storing and organizing information as in the brain (Rosenblatt 1958). The idea is to divide complex tasks into several smaller and simpler tasks that, in turn, act on different characteristics of the same problem and that eventually return an answer as desired. Figure 15.5 illustrates this structure (He et al. 2015; Szarvas et al. 2005; Vora et al. 2015). The CNN applies filters to visual data, to extract or highlight some important feature, maintaining the neighborhood relationship, as well as convolution matrix graphical processing, hence the origin of that name for this type of network (Krizhevsky et al. 2012). When a convolution layer is made over an image, it multiplies and adds
15 An Analysis of Use of Image Processing and Neural . . .
235
Fig. 15.5 A neuron and its activation function
Fig. 15.6 CNN Kernel
the values of each pixel to the values of a convolution filter or mask. After calculating an area following a defined pattern, the filter moves to another region of the image until it completes the operation over it (Jeong 2019). Figure 15.6 illustrates the structure of a CNN (Vargas et al. 2016). The single shot multibox detector (SSD) neural network (Bodapati and Veeranjaneyulu 2019; Huang et al. 2017; Yadav and Binay 2017), a convolutional neural network for real-time object detection (Cai et al. 2016; Dalmia 2019; Hui 2019; Liu et al. 2016; Moray 2019; Ning et al. 2017; Tindall et al. 2015; Xia et al. 2017 was used because it is considered the start-of-the-art in accuracy (Liu et al. 2016)).
15.2.5.3
Image Processig Technique
A detection algorithm was implemented based on pure image processing (Bottou 2012; Ding et al. 2016; Huang et al. 2017; Hussain et al. 2017; Liu et al. 2016; Pandey 2019; Redmon and Farhadi 2017; Ren et al. 2015), using the graphic library OpenCV (Bradski and Kaehler 2008; Countours 2019; Marengoni and Stringhini 2009), which has several image processing functions. First, a Gaussian filter, a convolution maskbased algorithm, is applied to smooth out the edges of the box and detect polygons (Deng and Cahill 1993). A limit filter is adopted (Ito and Xiong 2000; Kumar 2019; Simple Thresholding 2019), which allows a set of data to be divided into two groups starting from a limit. In the case of colors, this occurs by separating it from a color
236
L. P. de Brito et al.
tone, where all pixels darker than the limit go to one group and take them to another. To find the edges, the Canny filter (Accame and Natale 1997; OPENCV 2019) is applied, which walks over the image pixels with a gradient vector that calculates the direction and color intensity of the pixels (Boze 1995; Hoover et al. 2000; Simple Thresholding 2019). OPENCV’s findContours () function was used to detect polygons after proper treatment of the image with the mentioned filters.
15.3 Implemented Model Figure 15.7 shows a flowchart of the overall system architecture, indicating their processes. Three interconnected machines (GCS) compute most of the implemented code. The drone receives speed commands to move and to image capture through of a camera and a receiver and transmitter pair for data transfer. This is the general architecture of the both implemented simulated and real systems. To start this system, it is necessary to start three machines that will communicate. In the flowchart drawn, the red arrows represent internal executions of the same machine and the blue arrows indicate the transfer of data between a machine and another through a communication protocol. The orange arrows indicate the creation of a multiprocessing chain necessary for the implementation of the system. The machine running the main code is GCS, which consists of three main threads, one for image capture, another to shut down the software and the main one that manages all the processing and calculations. The captured images are transferred to the detection algorithm that will calculate the bounding box that best represents the desired passage. A speed calculation is performed according to the detection performed and is transferred to the drone. When the drone loses a detection or finds none, the code slows the aircraft down to a certain speed until, if no further detection actually occurs, the drone will stop. When the drone receives a speed to be adjusted on its axis, it remains the same until another speed is received or some other type of command is executed, for security it can land.
15.3.1 Object Detection by Neural Network The CNN used applies filters to highlight the desired object and a classifier and bounding box estimator to indicate the location of the object in the image. The resource extractor used it was the graph called MobileNet version 2, and the classifier comes from CNN resources. The CNN was trained with a set of images of windows and other objects with their respective boundary box coordinates. This set of images was obtained from Google Open Image version 4 containing about 60,000 window images (GOOGLE 2019).
15 An Analysis of Use of Image Processing and Neural . . .
237
Fig. 15.7 General system flowchart (author)
RGB format images are converted to grayscale and smoothed with the the Gaussian filter. Then, apply the Canny filter to isolate the edges of the objects. The code looks for lines that form a four-sided polygon to identify passage within a window. The center this polygon is identified and its side and diagonal measurements are calculated. An important detail is that the distances between points are calculated geometrically for execution as quadrilateral measurements. But these calculated distances they are values in pixel units, that is, the number of pixels between one point and another and that number of pixels varies according to the resolution of the camera used. Correct this error, these quantity values were converted into percentage values, so every measure is defined as a percentage of the maximum it could assume, and the maximum is usually the height, width, or diagonal of the image. For example, when measuring the height of the bounding box, it is necessary to divide it by the height of the image to find its occupation percentage. The Cartesian image study plan is
238
L. P. de Brito et al.
Fig. 15.8 Application of object segmentation in real world (auhor)
best suited using y and z axes because the image is in the same pattern as the drone movements. It was relatively easy to produce a passage in simulated experiment. However, in a real experiment with many polygons, it generated many unwanted detections. To solve this, a segmentation network could be used, which has the capacity to capture the total area of the object sought. Figure 15.8 shows an example of this type of network (He et al. 2017). Another challenge is that the current algorithm does not capture the slope of the object to be able to align the aircraft with the found window. To solve this, a segmentation network can be used with the ability to capture the total area of the searched object.
15.3.2 Drone Control The drone control algorithm uses three functions to position the drone in front of the window to make a slightly linear crossover: Approximate, Center and Align. The algorithm defines the speed of the drone on the x, y, and z axes, as shown in Fig. 15.9.
15.3.2.1
Approximate Function
In the Approximate function, the entries are the diagonals of the image and the bounding box, while the output is a speed on the x axis of the drone. This function
15 An Analysis of Use of Image Processing and Neural . . .
239
Fig. 15.9 Application of object segmentation in real world (auhor)
captures the bounding box detected and checks its size in relation to the image to measure the relative distance of the object. The algorithm estimates in percentage the value that the object occupies in the image. The mathematical function 15.1 was used to model the characteristic of this movement.
f ( p) = k.
1 p2
(15.1)
It is the inverse function of the square of the calculated diagonal size. The smaller the diagonal the farther the object is, the faster the movement speed. For greater gain, a quadratic function was used. In this function "p" represents the input measure of the function, and "k" is a constant that controls the output value according to some factors, such as distance and state of execution, giving a greater or lesser gain, depending on the case of the function. The behavior of this function is shown in Fig. 15.10. Only the positive domain was used for the problem. Where when the size of the object tends to zero, the speed tends to infinity, and when size tends to infinity, speed tends to zero. Due to the high values achieved by this function, only part of it is used, an interval defined by code that respects the system conditions. This interval is defined by p between [0.1, 0.7], that is, detections with diagonals of 10–70% of image occupation.
15.3.2.2
Centralize Function
The centralization function positions the drone in the center at the opening of the identified window. It uses the distance on the y and z axes between the center of the image and the center of the bounding box to set the speeds on the y and z axes of the drone to perform the centering of the aircraft (Fig.15.11).
240
L. P. de Brito et al.
Fig. 15.10 Inverse square function behavior (author)
Fig. 15.11 Measures of bounding box and image to perform centralization (author)
15.3.2.3
Align Function
Figure 15.12 shows a picture of a side view, a distorted box, where the right side is smaller than the left, as the view from the drone is misaligned in relation to the window. The alignment function sets the speed of the drone’s angular axis to produce a yaw that align the aircraft in relation to the window.
15 An Analysis of Use of Image Processing and Neural . . .
241
Fig. 15.12 Measures of bounding box and image to perform alignment (author)
15.3.2.4
Execution State
When the algorithm finds the detection of a passage, the movement functions themselves indicate the current state of execution of the drone. Five states are performed, being the last crossing of the passage. The Algorithm 1 is a pseudocode system status control. Algorithm 1: State Control
if currentState == staye then begin ret1