Computer Vision and Image Processing: 5th International Conference, CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected Papers, Part II ... in Computer and Information Science) 9811610916, 9789811610912

This three-volume set (CCIS 1367-1368) constitutes the refereed proceedings of the 5th International Conference on Compu

150 4 108MB

English Pages 576 [571] Year 2021

Table of contents :
Preface
Organization
Contents – Part II
A Comparative Analysis on AI Techniques for Grape Leaf Disease Recognition
1 Introduction
2 Terminologies in Grape Leaf Disease
2.1 Black Rot
2.2 Leaf Blight
2.3 Mildew
2.4 Downy Mildew
2.5 Grapevine Measles
2.6 Anthracnose
2.7 Grey Mold
3 Survey on Grape Leaf Disease Recognition and Systematization
3.1 Techniques Based on Machine Learning
3.2 Techniques Based on Deep Learning
4 Results and Discussions
5 Conclusion and Future Works
References
Sign Language Recognition Using Cluster and Chunk-Based Feature Extraction and Symbolic Representation
Abstract
1 Introduction
2 Related Work
3 The Proposed Methodology
3.1 Face and Hand Segmentation
3.2 Feature Extraction
3.3 Sign Representation
3.4 Symbolic Representation of the Signs
4 Experimentation and Validation
5 Conclusion
Acknowledgment
References
Action Recognition in Haze Using an Efficient Fusion of Spatial and Temporal Features
1 Introduction
2 Background and Related Works
2.1 Dehazing a Video
2.2 Human Action Recognition
2.3 Background
3 Proposed Method
3.1 Dehazing the Frames
3.2 Extracting Spatial Features
3.3 Bidirectional LSTM (DB-LSTM) for Temporal Feature
4 Experiments and Results
4.1 Hazy Dataset
4.2 Results and Discussions
5 Summary and Future Work
References
Human Action Recognition from 3D Landmark Points of the Performer
1 Introduction and Related Works
2 Proposed Method
2.1 Extraction of 3D Landmark Points
2.2 Classification of Actions Using 3D Landmark Points
3 Dataset and Experiments
3.1 Dataset
3.2 Experimental Set up
4 Results
5 Conclusions
References
A Combined Wavelet and Variational Mode Decomposition Approach for Denoising Texture Images
1 Introduction
2 Review
2.1 Variational Mode Decomposition
2.2 Wavelet Transform
3 Proposed Methods
4 Experimental Results
4.1 Wavelet
4.2 VMD
4.3 VMD-WT
4.4 Proposed WT-VMD
5 Conclusion
References
Two-Image Approach to Reflection Removal with Deep Learning
Abstract
1 Introduction
2 Related Work
2.1 Non-learning Based
2.2 Learning Based
3 Proposed Methodology
3.1 Dataset Creation
3.2 Network Parameters
3.2.1 Structure
3.2.2 Loss Function
4 Results
5 Discussion
6 Limitation and Future Work
7 Conclusion
References
Visual Question Answering Using Deep Learning: A Survey and Performance Analysis
1 Introduction
2 Datasets
3 Deep Learning Based VQA Methods
4 Experimental Results and Analysis
5 Conclusion
References
Image Aesthetic Assessment: A Deep Learning Approach Using Class Activation Map
1 Introduction
2 Deep CNN Models for IAA
3 Image Pre-processing and Two Channel CNN
3.1 Image Pre-processing
3.2 Two Channel CNN for IAA
4 Experiments
4.1 Comparison of Different Architectures
4.2 Comparison with Various Approaches
5 Conclusion
References
RingFIR: A Large Volume Earring Dataset for Fashion Image Retrieval
1 Introduction
2 Related Works
3 Proposed Dataset and Benchmark
4 Benchmarking Methods and Discussion
5 Conclusion
References
Feature Selection and Feature Manifold for Age Estimation
1 Introduction
2 Related Work
2.1 Aging Feature
2.2 Age Regression
3 Proposed Work
3.1 Aging Manifold Features
3.2 Feature Selection
3.3 Regression
4 Experiments and Results
4.1 Experimental Setup
4.2 Experiments and Results
5 Conclusion
References
Degraded Document Image Binarization Using Active Contour Model
1 Introduction
2 Proposed Method
2.1 Pre-processing
2.2 Initial Mask Calculation
2.3 Active Contour Evolution
2.4 Post-processing
3 Experimental Results and Analysis
3.1 Experimental Dataset
3.2 Binarization Results
3.3 Performance Evaluation
3.4 Comparison with State-of-the-Art Methods
3.5 Performance Evaluation Based on OCR
4 Conclusion
References
Accelerated Stereo Vision Using Nvidia Jetson and Intel AVX
Abstract
1 Introduction
1.1 Stereo Depth Estimation
1.2 Hardware Selection
2 Literature Survey
3 Our Implementation
3.1 Intel CPU Optimization
3.2 Nvidia Jetson Implementation
4 Conclusion
References
A Novel Machine Annotated Balanced Bangla OCR Corpus
1 Introduction
2 Literature Review
3 Procedure
3.1 Data Sources
3.2 Layout Analysis
3.3 Line Segmentation
3.4 Word Segmentation
3.5 Character Segmentation
3.6 Character Recognition
4 Corpus Specification
5 Corpus Statistics
6 Corpus Balance
7 Discussion
8 Conclusion
References
Generative Adversarial Network for Heritage Image Super Resolution
1 Introduction
2 SRR via GAN
3 Proposed Method
3.1 Division of Patches
3.2 SRR via GAN Model Using Modified Loss Functions
4 Result and Analysis
5 Conclusion
References
Deep Learning Based Image Enhancement and Black Box Filter Parameter Estimation
Abstract
1 Introduction
2 Related Work
3 Proposed Method
3.1 Image Enhancement
3.2 Parameter Estimation
4 Results
4.1 Image Enhancement
4.2 Parameter Estimation
5 Conclusion and Future Work
References
Sign Gesture Recognition from Raw Skeleton Information in 3D Using Deep Learning
1 Introduction
2 Proposed Methodology
2.1 Architecture
2.2 BiLSTM
2.3 GRU
3 Experimental Setup
3.1 Dataset Description
3.2 Experimental Protocol and Network Hyperparameters
4 Results and Discussion
5 Conclusion
References
Dual Gradient Feature Pair Based Face Recognition for Aging and Pose Changes
Abstract
1 Introduction
2 Proposed DGFP Based Face Recognition
2.1 Face Segmentation
2.2 DGFP Feature Extraction
2.3 Feature Matching and Recognition
3 Experimental Results
4 Conclusion
References
Dynamic User Interface Composition
Abstract
1 Introduction
2 Related Work
2.1 AI in User Interfaces
2.2 Quantifying Aesthetics
2.3 Region Identification Using Saliency
2.4 Ground Truth Estimation
2.5 Agreement Calculation
3 Methodology
3.1 Dataset Creation
3.2 Ground Truth Estimation
3.3 Agreement Calculation
3.4 Target Accuracy
4 Proposed Model
4.1 Model Architecture
4.2 Loss Function
5 Results
6 Conclusion
References
Lightweight Photo-Realistic Style Transfer for Mobile Devices
Abstract
1 Introduction
2 Related Work
3 Proposed Method
3.1 Network Optimization
3.2 Photo-Realistic Smoothing
4 Experimental Results
4.1 On-Device Implementation
5 Conclusion and Future Work
References
Cricket Stroke Recognition Using Hard and Soft Assignment Based Bag of Visual Words
1 Introduction
2 Literature Survey
3 Methodology
3.1 Feature Extraction
3.2 Bag of Visual Words (BoV)
4 Experimentation
5 Results and Discussion
6 Conclusion
References
Multi-lingual Indian Text Detector for Mobile Devices
1 Introduction
2 Related Works
3 Proposed Scheme
3.1 YOLO V3-Tiny
3.2 YOLO V4-Tiny
4 Experimental Results
5 Conclusion
References
Facial Occlusion Detection and Reconstruction Using GAN
1 Introduction
2 Literature Review
2.1 Face De-occlusion
2.2 Image Restoration
3 Proposed Methodology
3.1 Landmark Generation Network
3.2 Image Completion Network
3.3 Loss Function
4 Experiment and Results
4.1 Training Strategy
4.2 Experiments
4.3 Results
5 Conclusion
References
Ayurvedic Medicinal Plants Identification: A Comparative Study on Feature Extraction Methods
Abstract
1 Introduction
2 Materials and Methods
2.1 MepcoTropicLeaf Database
2.2 Feature Extraction Methods
3 Experiment Results and Discussion
4 Conclusion
Acknowledgement
References
Domain Knowledge Embedding Based Multimodal Intent Analysis in Artificial Intelligence Camera
Abstract
1 Introduction
2 Prior Work
3 Taxonomies
4 Dataset
5 Proposed Method
5.1 Baseline Model
5.2 Domain Knowledge Embedding Based Model
6 Results
7 Conclusion
References
Age and Gender Prediction Using Deep CNNs and Transfer Learning
Abstract
1 Introduction
2 Related Work
3 Methodology
3.1 Dataset
3.2 Deep CNNs
3.3 Transfer Learning
4 Evaluation
5 Experimentation and Results
5.1 Deep CNNs
5.2 Transfer Learning
6 Conclusion
References
Text Line Segmentation: A FCN Based Approach
1 Introduction
2 Related Work
3 Proposed Work
3.1 Pre-processing
3.2 Preparation of Input to the Netwrok
3.3 Network Architecture
3.4 Training
3.5 Merging the Network Outputs
3.6 Post-processing
4 Experimental Results
5 Conclusion
References
Precise Recognition of Vision Based Multi-hand Signs Using Deep Single Stage Convolutional Neural Network
Abstract
1 Introduction
2 Literature Review
3 Methodology
4 Experimental Overview
4.1 Model Implementation
4.2 Model Training
4.3 Model Testing and Prediction
5 Results and Discussion
6 Conclusion
Acknowledgement
References
Human Gait Abnormality Detection Using Low Cost Sensor Technology
1 Introduction
2 Related Work
3 Data Collection Procedure
4 Proposed Method
5 Result Analysis and Discussion
6 Conclusion and Future Work
References
Bengali Place Name Recognition - Comparative Analysis Using Different CNN Architectures
1 Introduction
2 Related Work
3 Motivation and Data Set Details
3.1 Data Collection
3.2 Getting Text from OMR
3.3 Data Augmentation
4 Methodology
4.1 Classification
4.2 Training
5 Experimental Results and Analysis
5.1 Comparison with Similar Other Work
5.2 Result on Transfer Learning
6 Conclusion
References
Face Verification Using Single Sample in Adolescence
Abstract
1 Introduction
2 Proposed Model
2.1 Face Alignment and Preprocessing
2.2 Feature Representation
2.3 Dimensionality Reduction
2.4 Distance Measure
2.5 Cumulative Match Characteristics
3 Experiment Results
3.1 Dataset Collection
3.2 Experimental Analysis
4 Conclusion
References
Evaluation of Deep Learning Networks for Keratoconus Detection Using Corneal Topographic Images
Abstract
1 Introduction
2 Related Work
3 Study Data and Methods
4 Discussion and Results
5 Analysis of Results
6 Conclusion and Future Work
References
Deep Facial Emotion Recognition System Under Facial Mask Occlusion
Abstract
1 Introduction
2 Related Work
3 Proposed Architecture of DFERSFM
3.1 Feature Extraction Process
3.2 Classification
4 Results and Analysis
5 Conclusion and Future Directions
Acknowledgment
References
Domain Adaptation Based Technique for Image Emotion Recognition Using Image Captions
1 Introduction
2 Related Work
2.1 Feature Based Semantic Image Analysis
2.2 Dimension and Category Based Visual Emotion Classification
2.3 Deep Learning Based Image Emotion Recognition Methods
2.4 Emotion Recognition in Various Modalities
3 Proposed Technique
3.1 Problem Formulation
3.2 Methodology
4 Implementation
4.1 Experimental Setup
4.2 Datasets and Training Strategy
4.3 Ablation Study
4.4 State-of-the-art (SOTA) Methods for Performance Comparison
4.5 Evaluation Metrics
5 Results and Evaluation
5.1 Phase I: Image Captioning Results
5.2 Phase II: Emotion Classification Results
6 Conclusion and Future Work
References
Gesture Recognition in Sign Language Videos by Tracking the Position and Medial Representation of the Hand Shapes
1 Introduction
2 Statement of the Problem
2.1 Definitions
2.2 The Problem Statement for Gesture Recognition
3 Related Work
4 Trajectorial-Morphological Method
4.1 Identifying the Position and Shape of Key Objects
4.2 Tracking of Changes in the Position and Shape of Key Objects Between Frames
4.3 Classification Based on the Comparison with the Reference Examples
5 Experiments
5.1 Technical Details of the Experiments
5.2 Visualization
5.3 Results
5.4 Discussion and Future Work
6 Conclusion
References
DeepDoT: Deep Framework for Detection of Tables in Document Images
1 Introduction
2 Related Work
3 Proposed Framework
3.1 Pre-Processing
3.2 Feature Pyramid Network with Predictor Sub-Network
4 Experiment and Results
5 Conclusion
References
Correcting Low Illumination Images Using PSO-Based Gamma Correction and Image Classifying Method
1 Introduction
2 Background Concepts
2.1 Gamma Correction Technique
2.2 Particle Swarm Optimization
3 Proposed Work
3.1 Image Sharpening
3.2 First Order Classification and Image Correction (FCIC)
3.3 Second Order Classification and Image Correction (SCIC)
3.4 Estimation of Best Gamma Factor Using PSO
4 Experimental Results
4.1 Quality of the Results
5 Conclusion
References
DeblurRL: Image Deblurring with Deep Reinforcement Learning
1 Introduction
2 Related Work
3 Reinforcement Learning Background
3.1 Reinforcement Learning with Pixel-Wise Rewards
3.2 Actions
4 Experiments
4.1 Input and State Actions
4.2 Implementation Details
4.3 Results
5 Conclusion
References
FGrade: A Large Volume Dataset for Grading Tomato Freshness Quality
1 Introduction
2 Related Works
3 Proposed Dataset and Benchmark
4 Benchmarking Methods and Discussion
5 Conclusion
References
Enhancement of Region of Interest from a Single Backlit Image with Multiple Features
1 Introduction
2 Proposed Methodology
2.1 Tone Mappings
2.2 Gradient Mapping and Filtering
2.3 Fusion
3 Experiments and Discussion
4 Conclusion and Future Scope
References
Real-Time Sign Language Interpreter on Embedded Platform
Abstract
1 Introduction
2 Related Work
3 Proposed Method
3.1 Segmentation and Region of Interest Extraction
3.2 MobileNet
3.3 NVIDIA Jetson Nano and TensorRT
4 Experiments and Results
4.1 Dataset Details
4.2 Training Details
4.3 Evaluation
4.4 Comparison
5 Conclusion
References
Complex Gradient Function Based Descriptor for Iris Biometrics and Action Recognition
1 Introduction
2 Related Work
3 Proposed Descriptor
3.1 Complex Gradient Function
4 Experiments and Discussions
4.1 Iris Recognition
4.2 Human Action Recognition
5 Conclusion
References
On-Device Language Identification of Text in Images Using Diacritic Characters
1 Introduction
2 Related Works
3 Proposed Pipeline
3.1 Corpus Generation
3.2 Text Localization
3.3 Diacritic Detection
3.4 Language Identification
4 Experiments and Results
5 Conclusion and Future Work
References
A Pre-processing Assisted Neural Network for Dynamic Bad Pixel Detection in Bayer Images
Abstract
1 Introduction
2 Data and Methods
2.1 Data Set
2.2 BP Detection: Network Architectures
3 Results and Discussions
4 Conclusions
References
Face Recognition Using Sf3CNN with Higher Feature Discrimination
1 Introduction
2 Proposed Architecture
3 A-Softmax Loss
4 Implementation
5 Results and Discussion
6 Conclusion
References
Recognition of Online Handwritten Bangla and Devanagari Basic Characters: A Transfer Learning Approach
1 Introduction
2 Datasets
3 Proposed Methodology
3.1 Inception-V3
3.2 ResNet50
3.3 VGG-16
4 Results and Discussion
5 Conclusion
References
Image Solution of Stochastic Differential Equation of Diffusion Type Driven by Brownian Motion
1 Introduction
2 Brownian Motion
2.1 Random Walk
2.2 Brownian Motion
2.3 Visualization of Brownian Motion
3 Itô Calculus
3.1 Itô Integral
3.2 Itô Process
3.3 Itô's Formula for Brownian Motion
3.4 Itô's Formula for Itô Process
4 Image Solution of Stochastic Differential Equation of Diffusion Type
4.1 Mathematical Solution of Stochastic Differential Equation of Diffusion Type
4.2 Visual Analysis of Stochastic Differential Equation of Diffusion Type
5 Conclusion
References
Author Index

Recommend Papers

Computer Vision and Image Processing: 5th International Conference, CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected Papers, Part I ... in Computer and Information Science) 9811610851, 9789811610851

113 107 88MB Read more

Computer Vision and Image Processing: 5th International Conference, CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected Papers, Part III ... in Computer and Information Science) 9811611025, 9789811611025

This three-volume set (CCIS 1367-1368) constitutes the refereed proceedings of the 5th International Conference on Compu

110 64 99MB Read more

Computer Vision and Image Processing: 6th International Conference, CVIP 2021, Rupnagar, India, December 3–5, 2021, Revised Selected Papers, Part II ... in Computer and Information Science) 3031113489, 9783031113482

118 42 98MB Read more

Cognitive Systems and Signal Processing: 5th International Conference, ICCSIP 2020, Zhuhai, China, December 25–27, 2020, Revised Selected Papers (Communications in Computer and Information Science) 981162335X, 9789811623356

This book constitutes the refereed post-conference proceedings of the 5th International Conference on Cognitive Systems

103 51 104MB Read more

Computer Vision and Image Processing: 7th International Conference, CVIP 2022, Nagpur, India, November 4–6, 2022, Revised Selected Papers, Part II 3031314166, 9783031314162

This two volume set (CCIS 1776-1777) constitutes the refereed proceedings of the 7th International Conference on Compute

189 62 152MB Read more

Computer Vision and Image Processing: 6th International Conference, CVIP 2021, Rupnagar, India, December 3–5, 2021, Revised Selected Papers, Part I (Communications in Computer and Information Science) 3031113454, 9783031113451

This two-volume set (CCIS 1567-1568) constitutes the refereed proceedings of the 6h International Conference on Computer

119 101 97MB Read more

Space Information Network: 5th International Conference SINC 2020, Shenzhen, China, December 19–20, 2020, Revised Selected Papers (Communications in Computer and Information Science) 9811619662, 9789811619663

This book constitutes selected and revised papers of the 5th International Conference on Space Information Networks, SIN

121 17 21MB Read more

Computer Vision and Image Processing: 7th International Conference, CVIP 2022, Nagpur, India, November 4–6, 2022, Revised Selected Papers, Part I 3031314069, 9783031314063

This two volume set (CCIS 1776-1777) constitutes the refereed proceedings of the 7th International Conference on Compute

218 85 112MB Read more

Communication, Networks and Computing: Second International Conference, CNC 2020, Gwalior, India, December 29–31, 2020, Revised Selected Papers (Communications in Computer and Information Science) 9811688958, 9789811688959

This book constitutes selected and revised papers presented at the Second International Conference on Communication, Net

113 102 64MB Read more

Computer Vision – ACCV 2020 Workshops: 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 – December 4, 2020, Revised Selected Papers ... Vision, Pattern Recognition, and Graphics) 303069755X, 9783030697556

This book constitutes the refereed post-conference proceedings of four workshops held at the 15th Asian Conference on Co

104 88 40MB Read more

Computer Vision and Image Processing: 5th International Conference, CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected Papers, Part II ... in Computer and Information Science)
9811610916, 9789811610912

Author / Uploaded
Satish Kumar Singh (editor)
Partha Roy (editor)
Balasubramanian Raman (editor)
P. Nagabhushan (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Satish Kumar Singh Partha Roy Balasubramanian Raman P. Nagabhushan (Eds.)

Communications in Computer and Information Science

1377

Computer Vision and Image Processing 5th International Conference, CVIP 2020 Prayagraj, India, December 4–6, 2020 Revised Selected Papers, Part II

Communications in Computer and Information Science Editorial Board Members Joaquim Filipe Polytechnic Institute of Setúbal, Setúbal, Portugal Ashish Ghosh Indian Statistical Institute, Kolkata, India Raquel Oliveira Prates Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil Lizhu Zhou Tsinghua University, Beijing, China

1377

More information about this series at http://www.springer.com/series/7899

Satish Kumar Singh Partha Roy Balasubramanian Raman P. Nagabhushan (Eds.) •

•

•

Computer Vision and Image Processing 5th International Conference, CVIP 2020 Prayagraj, India, December 4–6, 2020 Revised Selected Papers, Part II

123

Editors Satish Kumar Singh Indian Institute of Information Technology Allahabad Prayagraj, India Balasubramanian Raman Indian Institute of Technology Roorkee Roorkee, India

Partha Roy Indian Institute of Technology Roorkee Roorkee, India P. Nagabhushan Indian Institute of Information Technology Allahabad Prayagraj, India

ISSN 1865-0929 ISSN 1865-0937 (electronic) Communications in Computer and Information Science ISBN 978-981-16-1091-2 ISBN 978-981-16-1092-9 (eBook) https://doi.org/10.1007/978-981-16-1092-9 © Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

The 5th IAPR International Conference on Computer Vision & Image Processing was focused on image or video processing and computer vision. This year CVIP 2020 was held at the Indian Institute of Information Technology Allahabad, Prayagraj, India. We received submissions on topics such as biometrics, forensics, content protection, image enhancement/super-resolution/restoration, motion and tracking, image or video retrieval, image, image/video processing for autonomous vehicles, video scene understanding, human-computer interaction, document image analysis, face, iris, emotion, sign language and gesture recognition, 3D image/video processing, action and event detection/recognition, medical image and video analysis, vision-based human gait analysis, remote sensing, multispectral/hyperspectral image processing, segmentation and shape representation, image/video security, visual sensor hardware, compressed image/video analytics, document, and synthetic visual processing and Datasets and Evaluation, etc. CVIP is now one of the flagship conferences in the ﬁeld of Computer Science and Information Technology. CVIP 2020 received 352 submissions from all over the world from countries including Poland, United Kingdom, United States, Norway, Sweden, Russia, Germany, China, and many others. All submissions were rigorously peer reviewed and 134 papers were ﬁnally selected for presentation at CVIP 2020. The Program Committee ﬁnally selected all 134 high-quality papers to be included in this volume of Computer Vision and Image Processing (CVIP) proceedings published by Springer Nature. The conference advisory committee, technical program committee, and faculty members of the Indian Institute of Information Technology Allahabad, Prayagraj, India made a signiﬁcant effort to guarantee the success of the conference. We would like to thank all members of the program committee and the referees for their commitment to help in the review process and for spreading our call for papers. We would like to thank Ms. Kamya Khatter from Springer Nature for her helpful advice, guidance, and continuous support in publishing the proceedings. Moreover, we would like to thank all the authors for supporting CVIP 2020; without all their high-quality submissions the conference would not have been possible. December 2020

Satish Kumar Singh

Organization

Patron Bidyut Baran Chaudhuri

ISI Kolkata, India

General Chair P. Nagabhushan

IIIT Allahabad, India

General Co-chairs Balasubramanian Raman Shekhar Verma

IIT Roorkee, India IIIT Allahabad, India

Conference Chairs Partha Pratim Roy Sanjeev Kumar Satish K. Singh Vrijendra Singh

IIT Roorkee, India IIT Roorkee, India IIIT Allahabad, India IIIT Allahabad, India

Local Organizing Committee Shirshu Varma

IIIT Allahabad, India

Conference Conveners K. P. Singh Mohammed Javed Pritee Khanna Shiv Ram Dubey

IIIT Allahabad, India IIIT Allahabad, India IIITDMJ, India IIIT Sri City, India

Publicity Chairs Subrahmanyam Murala Shiv Ram Dubey Ashwini K.

IIT Ropar, India IIIT Sri City, India GAT Bangalore, India

International Advisory and Programme Committee Ajita Rattani Alireza Alaei

Wichita State University, USA Southern Cross University, Australia

viii

Organization

Ankit Chaudhary Ashish Khare B. H. Shekhar Bunil Kumar Debashis Sen Emanuela Marasco Gaurav Gupta Guoqiang Zhong J. V. Thomas (Associate Director) Juan Tapia Farias Kiran Raja M. Tanveer Munesh C. Trivedi P. V. Venkitakrishnan (Director CBPO) Prabhu Natarajan Pradeep Kumar Puneet Gupta Rajeev Jaiswal Sahana Gowda Sebastiano Battiato Sharad Sinha Somnath Dey Sule Yildirim Yayilgan Surya Prakash Thinagaran Perumal Watanabe Osamu Mohan S. Kankanhalli Ananda Shankar Chowdhury Anupam Agrawal Aparajita Ojha B. M. Mehtre B. N. Chatterji Bir Bhanu Chirag N. Paunwala D. S. Guru Daniel P. Lopresti G. C. Nandi Gaurav Sharma Gian Luca Foresti Jharna Majumdar Jonathan Wu Josep Lladós

The University of Missouri – St. Louis, USA University of Allahabad, India Mangalore University, India Balabantaray NIT Meghalaya, India IIT Kharagpur, India George Mason University, USA Wenzhou-Kean University, China Ocean University of China, China STA ISRO Bangalore, India Universidad de Chile, Chile NTNU, Norway IIT Indore, India NIT Agartala, India ISRO Bangalore, India DigiPen Institute of Technology Singapore, Singapore Amphisoft, India IIT Indore, India EDPO, ISRO HQ (Bangalore), India BNMIT, Bengaluru, India Università di Catania, Italy IIT Goa, India IIT Indore, India Norwegian University of Science and Technology (NTNU), Norway IIT Indore, India Universiti Putra Malaysia, Malaysia Takushoku University, Japan National University of Singapore, Singapore Jadavpur University, India IIIT Allahabad, India IIITDM Jabalpur, India IDRBT Hyderabad, India IIT Kharagpur (Past Afﬁliation), India University of California, Riverside, USA SCET, Surat, India University of Mysore, India Lehigh University, USA IIIT Allahabad, India University of Rochester, USA University of Udine, Italy Nitte Meenakshi Institute of Technology, India University of Windsor, Canada Universitat Autònoma de Barcelona, Spain

Organization

K. C. Gowda (Former VC) K. R. Ramakrishnan Manoj K. Arora Massimo Tistarelli Michal Haindl N. V. Subba Reddy O. P. Vyas Paula Brito Rajeev Srivastava Ramakrishnan Ganesan Angarai S. N. Singh Sanjay Kumar Singh Sudeep Sarkar Suman Mitra Suneeta Agarwal Susmita Ghosh U. S. Tiwari Umapada Pal Wei-Ta Chu Xiaoyi Jiang Sushmita Mitra

Kuvempu University, India IISC Bangalore, India BML Munjal University, India University of Sassari, Italy Czech Academy of Sciences, Czech Republic MIT Manipal, India IIIT Allahabad, India University of Porto, Portugal IIT BHU, India IISc Bangalore, India IIT Kanpur, India IIT BHU, India University of South Florida, USA DA-IICT Gandhinagar, India MNNIT Allahabad, India Jadavpur University, India IIIT Allahabad, India ISI Kolkata, India National Chung Cheng University, Taiwan University of Münster, Germany ISI Kolkata, India

ix

Contents – Part II

A Comparative Analysis on AI Techniques for Grape Leaf Disease Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Swetha Pai and Manoj V. Thomas

1

Sign Language Recognition Using Cluster and Chunk-Based Feature Extraction and Symbolic Representation. . . . . . . . . . . . . . . . . . . . . . . . . . . H. S. Nagendraswamy and Syroos Zaboli

13

Action Recognition in Haze Using an Efficient Fusion of Spatial and Temporal Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sri Girinadh Tanneru and Snehasis Mukherjee

29

Human Action Recognition from 3D Landmark Points of the Performer . . . . Snehasis Mukherjee and Chirumamilla Nagalakshmi A Combined Wavelet and Variational Mode Decomposition Approach for Denoising Texture Images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Gokul, A. Nirmal, G. Dinesh Kumar, S. Karthic, and T. Palanisamy Two-Image Approach to Reflection Removal with Deep Learning. . . . . . . . . Rashmi Chaurasiya and Dinesh Ganotra Visual Question Answering Using Deep Learning: A Survey and Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yash Srivastava, Vaishnav Murali, Shiv Ram Dubey, and Snehasis Mukherjee Image Aesthetic Assessment: A Deep Learning Approach Using Class Activation Map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shyam Sherashiya, Gitam Shikkenawis, and Suman K. Mitra

39

50 63

75

87

RingFIR: A Large Volume Earring Dataset for Fashion Image Retrieval . . . . Sk Maidul Islam, Subhankar Joardar, and Arif Ahmed Sekh

100

Feature Selection and Feature Manifold for Age Estimation . . . . . . . . . . . . . Shivani Kshatriya, Manisha Sawant, and K. M. Bhurchandi

112

Degraded Document Image Binarization Using Active Contour Model. . . . . . Deepika Gupta and Soumen Bag

124

Accelerated Stereo Vision Using Nvidia Jetson and Intel AVX . . . . . . . . . . . Imran A. Syed, Mandar Datar, and Sachin Patkar

137

xii

Contents – Part II

A Novel Machine Annotated Balanced Bangla OCR Corpus . . . . . . . . . . . . Md Jamiur Rahman Rifat, Mridul Banik, Nazmul Hasan, Jebun Nahar, and Fuad Rahman

149

Generative Adversarial Network for Heritage Image Super Resolution . . . . . . Rajashree Nayak and Bunil Ku. Balabantaray

161

Deep Learning Based Image Enhancement and Black Box Filter Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mrigakshi Sharma, Rishabh Mittar, and Prasenjit Chakraborty

174

Sign Gesture Recognition from Raw Skeleton Information in 3D Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sumit Rakesh, Saleha Javed, Rajkumar Saini, and Marcus Liwicki

184

Dual Gradient Feature Pair Based Face Recognition for Aging and Pose Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V. Betcy Thanga Shoba and I. Shatheesh Sam

196

Dynamic User Interface Composition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rahul Kumar, Shankar Natarajan, Mohamed Akram Ulla Shariff, and Parameswaranath Vaduckupurath Mani

208

Lightweight Photo-Realistic Style Transfer for Mobile Devices . . . . . . . . . . . Mrinmoy Sen, Mineni Niswanth Babu, Rishabh Mittar, Divyanshu Gupta, and Prasenjit Chakraborty

221

Cricket Stroke Recognition Using Hard and Soft Assignment Based Bag of Visual Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arpan Gupta, Ashish Karel, and Sakthi Balan Muthiah

231

Multi-lingual Indian Text Detector for Mobile Devices . . . . . . . . . . . . . . . . Veronica Naosekpam, Naukesh Kumar, and Nilkanta Sahu

243

Facial Occlusion Detection and Reconstruction Using GAN . . . . . . . . . . . . . Diksha Khas, Sumit Kumar, and Satish Kumar Singh

255

Ayurvedic Medicinal Plants Identification: A Comparative Study on Feature Extraction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Ahila Priyadharshini, S. Arivazhagan, and M. Arun Domain Knowledge Embedding Based Multimodal Intent Analysis in Artificial Intelligence Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dinesh Viswanadhuni, Mervin L. Dalmet, M. Raghavendra Kalose, Siddhartha Mukherjee, and K. N. Ravi Kiran Age and Gender Prediction Using Deep CNNs and Transfer Learning . . . . . . Vikas Sheoran, Shreyansh Joshi, and Tanisha R. Bhayani

268

281

293

Contents – Part II

Text Line Segmentation: A FCN Based Approach. . . . . . . . . . . . . . . . . . . . Annie Minj, Arpan Garai, and Sekhar Mandal Precise Recognition of Vision Based Multi-hand Signs Using Deep Single Stage Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Rubin Bose and V. Sathiesh Kumar Human Gait Abnormality Detection Using Low Cost Sensor Technology. . . . Shaili Jain and Anup Nandy Bengali Place Name Recognition - Comparative Analysis Using Different CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prashant Kumar Prasad, Pamela Banerjee, Sukalpa Chanda, and Umapada Pal Face Verification Using Single Sample in Adolescence . . . . . . . . . . . . . . . . R. Sumithra, D. S. Guru, V. N. Manjunath Aradhya, and Anitha Raghavendra Evaluation of Deep Learning Networks for Keratoconus Detection Using Corneal Topographic Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Savita R. Gandhi, Jigna Satani, Karan Bhuva, and Parth Patadiya Deep Facial Emotion Recognition System Under Facial Mask Occlusion . . . . Suchitra Saxena, Shikha Tripathi, and T. S. B. Sudarshan Domain Adaptation Based Technique for Image Emotion Recognition Using Image Captions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Puneet Kumar and Balasubramanian Raman Gesture Recognition in Sign Language Videos by Tracking the Position and Medial Representation of the Hand Shapes. . . . . . . . . . . . . . . . . . . . . . Syroos Zaboli, Sergey Serov, Leonid Mestetskiy, and H. S. Nagendraswamy DeepDoT: Deep Framework for Detection of Tables in Document Images . . . Mandhatya Singh and Puneet Goyal Correcting Low Illumination Images Using PSO-Based Gamma Correction and Image Classifying Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Swadhin Das, Manali Roy, and Susanta Mukhopadhyay

xiii

305

317 330

341

354

367 381

394

407

421

433

DeblurRL: Image Deblurring with Deep Reinforcement Learning . . . . . . . . . Jai Singhal and Pratik Narang

445

FGrade: A Large Volume Dataset for Grading Tomato Freshness Quality . . . Sikha Das, Samarjit Kar, and Arif Ahmed Sekh

455

xiv

Contents – Part II

Enhancement of Region of Interest from a Single Backlit Image with Multiple Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gaurav Yadav, Dilip Kumar Yadav, and P. V. S. S. R. Chandra Mouli Real-Time Sign Language Interpreter on Embedded Platform . . . . . . . . . . . . Himansh Mulchandani and Chirag Paunwala Complex Gradient Function Based Descriptor for Iris Biometrics and Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. H. Shekar, P. Rathnakara Shetty, and Sharada S. Bhat On-Device Language Identification of Text in Images Using Diacritic Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shubham Vatsal, Nikhil Arora, Gopi Ramena, Sukumar Moharana, Dhruval Jain, Naresh Purre, and Rachit S. Munjal A Pre-processing Assisted Neural Network for Dynamic Bad Pixel Detection in Bayer Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Girish Kalyanasundaram, Puneet Pandey, and Manjit Hota Face Recognition Using Sf3CNN with Higher Feature Discrimination. . . . . . . Nayaneesh Kumar Mishra and Satish Kumar Singh Recognition of Online Handwritten Bangla and Devanagari Basic Characters: A Transfer Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . Rajatsubhra Chakraborty, Soumyajit Saha, Ankan Bhattacharyya, Shibaprasad Sen, Ram Sarkar, and Kaushik Roy

467 477

489

502

513 524

530

Image Solution of Stochastic Differential Equation of Diffusion Type Driven by Brownian Motion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vikas Kumar Pandey, Himanshu Agarwal, and Amrish Kumar Aggarwal

542

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

555

A Comparative Analysis on AI Techniques for Grape Leaf Disease Recognition Swetha Pai(B)

and Manoj V. Thomas

Vimal Jyothi Engineering College, Kannur, Kerala, India [email protected]

Abstract. Grape or Grapevine (Vitis Vinifera) belonging to the Vitaceae family is one of India’s most commercially important fruit crops. It is a widely temperate crop that has become accustomed to the subtropical climate of the peninsula of India. In a Grape vineyard, there are more chances of grape fruits and its leaf to confront with diseases. Manual Observation is not feasible and is also time constrained for experts and agronomists to track with. Inorder to predict the disease in early stage, we deal with a literature survey on diﬀerent methods pertinent to disease identiﬁcation and classiﬁcation. Through this survey, we ﬁnd out the best method to be followed for disease tracking. The composition at ﬁrst, presents a detailed terminology related to diﬀerent kind of grape leaf diseases. Furthermore, a survey regarding automated grape leaf disease identiﬁcation and categorization methods are carried out, which deals with techniques like machine learning and deep learning. Keywords: Machine learning · Classiﬁcation · Grape leaf · Deep learning · Convolutional neural network · Image processing

1

Introduction

Grape is one of the fruit crop which is laboriously cultivated in India. The productivity and cultivation of grape is highest in India when compared worldwide. The grape fruits are consumed as fresh and a certain amount is used for liquor making and grape vine manufacturing. As Worldwide, 82% production of grapes is mainly done for wine manufacturing, 10% for raisin making and remaining for table purpose. The availability of farm fresh grapes are always limited. Even though it is a versatile crop, it cannot withstand or rapidly encounter with micro-organisms, bacteria, pest attacks, weeds etc. A small amount of harmfulness can aﬀect or inhibit the growth of a fresh grape fruit. At the same time, if the grape plant is aﬀected with diseases during large scale production, it aﬀects the overall growth of grape which leads to a big loss in economic growth. Only fresh grapes are considered for making grape extracts and vine production. So as result regular monitoring of grape ﬁelds are necessary. c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 1–12, 2021. https://doi.org/10.1007/978-981-16-1092-9_1

2

S. Pai and M. V. Thomas

It is through Artiﬁcial Intelligence (AI) and its wide range of applications, we survive from these grape leaf diseases. AI helps in early detection and stratiﬁcation of disease. Artiﬁcial Intelligence, a research direction emerged and still evolving helps to provide intelligence to machines like we the humans have natural intelligence. The artiﬁcially intelligent agent try to perceive its environment and take actions accordingly to achieve its respective goals. Machine Learning (ML) which is a subﬁeld of AI, tries to learn from past histories and failures. It give rise to algorithms which help to develop a mathematical model based on examples or sample data (training data) to perform eﬃcient predictions rapidly or decisions without being programmed emphatically. It is divided into Reinforced learning, Unsupervised learning and Supervised learning. It eﬃciently deals with labeled and unlabeled data to make predictions. Deep Learning and Image Processing, Computer Vision, Pattern recognition comes under the sub area of Machine Learning in Artiﬁcial Intelligence. Deep Learning helps in unsupervised learning which deals with unlabeled data. It has the capability to predict output from sample data that are not labeled. Deep Neural Learning or Deep Neural Network is the other name given to this. On the contrary, Image Processing and Computer Vision deals with images as inputs, considerably image datasets. The datasets comprises of data used for testing and training. The procedure follows steps like collection of images or datasets, preprocessing of image, segmentation of the image, extraction of features and ﬁnally stratiﬁcation. These steps helps in improved detection of grape leaf diseases which will be explained later. Pattern recognition is a subpart of machine learning that intensiﬁes the recognition of data regularities or data patterns. It involves classiﬁcation and cluster of patterns. Following section explains the diﬀerent kind of grape leaf diseases.

2

Terminologies in Grape Leaf Disease

In an agronomics scenario, grape leaf disease are in control of the depletion of grapes, that cause lucrative loss of grape in large scale. Grape fruit is an important nutrient source, such as vitamin C and Vitamin K. However, the diseases in grape leaves badly or adversely aﬀect the quality and production of grape fruit. Most of the grape leaf found are aﬀected with diﬀerent diseases namely Black Rot, Leaf Blight, Powdery Mildew, Downy Mildew, Grapevine Measles, Anthracnose, Grey mold etc. Below is a brief description of some of the diseases of grape leaves found in India. 2.1

Black Rot

Black rot is a disease found in cultivated grape plants and its leaves caused by fungi or bacteria. It produces a dark brown discoloration. They decay and get settled in the leaves of grape plants. Black Rot aﬀects the aboveground part of the grape leaves and are favored by warm, humid weather. Black Rot disease is also called as grape rot.

A Comparative Analysis on AI Techniques

2.2

3

Leaf Blight

Leaf Blight is a genus sign aﬀecting grape leaves due to the infection and attack caused by pathogenic organisms. Leaf Blight occurs by a rapid and complete browning of leaves which then forwards the damage and tissues such as departments, divisions, ﬂoral organs including leaves die due to this. Symptoms of leaf blight on the leaf tissue are the emergence and development of lesions that rapidly inundate the surrounding tissue. Leaf spots can extend to kill all areas of tissue on the leaf. 2.3

Mildew

Mildew is a disease generated by fungus that aﬀects grape leaves adversely. These kind of disease are caused by diﬀerent sort of fungi. It is one among the elementary plant disease that can be noticed fast, as its symptoms are quite distinctive from other leaf diseases. The contaminated plants show white powdery patches on the stems and leaves. The lower leaf is the most aﬀected but it can appear on any part of the plant above ground. As the disease progresses, the spots get bigger and denser and the mildew will spread over the entire length of the grape plant. The white spots have chances to spread to other neighboring leaves and hence it should be handled accordingly. 2.4

Downy Mildew

Downy Mildew is a foliage disease in plants, caused by fungus like organism. They spread among plants through airborne spores. It is found during wet weather condition. The infection in leaves are endorsed by wetness. In precision agriculture, mildew is a problem for farmers who cultivates fruits, vegetables and grapes. 2.5

Grapevine Measles

Grapevine measles is also called esca, black measles or Spanish measles. The name Measles refers to the superﬁcial spots found on the grape leaves and fruits. The cross sections through arms, cordons or trunks from grapevines that displays measles symptoms, oozes a dark sap when it is cut. 2.6

Anthracnose

Anthracnose is a familiar fungal disease in plants. It tends to attack plant parts during spring season when the weather condition is wet and cool. This disease primarily attacks on leaves and twigs of the grape plant. Anthracnose is known as thin lesions along the leaves and veins. These dark and sunken lesions can also be found on stems, ﬂowers, and fruit.

4

S. Pai and M. V. Thomas

Fig. 1. Terminology of grape leaf diseases

2.7

Grey Mold

Gray Mold is a fungal disease caused by cinerea. It is also called Botrytis blight. It can aﬀect any part of a plant. It is one of the most common diseases found among budding plants. The disease easily infects plants that are already destroyed or beginning to die. The disease’s basic symptom is spots on leaves which turn brown and start rotting. Then, the leaves die, dry, and fall oﬀ the vine. All diseases are mentioned in Fig. 1.

3

Survey on Grape Leaf Disease Recognition and Systematization

A short time ago, many proposals have been made in the ﬁeld of machine learning for recognition and stratiﬁcation of leaf disease, each of which is having its own advantages and disadvantages. In this module, a comprehensive note of machine learning algorithms applied on image processing are carried out, which is mainly focused on ﬁnding out disease in grape leaves. Basically the procedure follows steps like collection of images or datasets, preprocessing of image, segmentation of the image, extraction of features and ﬁnally stratiﬁcation. We also include a discussion about the deep learning based procedures. 3.1

Techniques Based on Machine Learning

Pantazi et al. [1] proposed a component leaf infection location instrument completely diﬀerent edit species through picture highlight investigation and One class classiﬁers. This framework employments neighborhood twofold designs like LBPs for highlight extraction and One class classiﬁer for classiﬁcation. The strategy employments a committed One lesson classiﬁer for each plant wellbeing condition counting healthy, downy mildew, powdery mildew and black rot. Images are

A Comparative Analysis on AI Techniques

5

collected using a smart phone or tablet from an agriculture ﬁeld. Segmentation is connected to obtain the region of interest and to remove the foundation. The ROI mainly focuses upon the important regions in the image which is required for further processing. Hue saturation value transform is connected to the portioned picture and Grabcut calculation is performed. It helps in 2D segmentation. The Local Binary Patterns (LBPs) is eﬃcient for pixel labeling and it provides a binary result. It changes over an image into a framework of numbers values. It is robust to monotonic gray scale changes caused and we can ﬁnd illumination variations. The LBP histogram is used. One class SVM are used to classify leaf sample images. Support Vector Machine (SVM) accepts labeled data as input and classiﬁes according to the sample data. The system provides with 95% success rate by considering 46 leaf conditions. Kumar et al. [2] deﬁned a system that dealt with Exponential spider monkey optimization. It uses Spatial Domain Subtractive Pixel Adjacency Model (SPAM) for highlight extraction from images. It is the extricated highlights that characterizes the learning rate of the classiﬁer. It optimizes its set of features extracted with the help of Exponential Spider Monkey Optimization(ESMO). It is highly eﬃcient in certain image analysis algorithm. They have considered 686 features from the image dataset. ESMO helps in discarding the unwanted and inappropriate features during their selection. The optimized features are fed into SVM classiﬁer and it checks whether the leaf is solid or ailing. Mean ﬁtness values are compared and diﬀerent features selection methods are evaluated. Padol et al. [3] presented a combination of classiﬁcation technique to detect downy and powdery spots. This system focuses on image processing techniques. It uses fusion classiﬁcation technique which include a combination of Support Vector Machines (SVM) and Artiﬁcial Neural Networks(ANN). The pictures for the dataset are collected from pune and nasik and a few from web. All pictures contain leaf absconded by powdery and mildew. Amid preprocessing, pictures are resized into 300 × 300 estimate and thresholding is done. Gaussian sifting is done to expel clamor in an picture. K-means Clustering is utilized for picture division. The shape, surface and color highlights are extricated and classiﬁcation is done by an ensemble of SVM and ANN. Results had shown that fusion classiﬁcation technique provide 100% accuracy than using individual classiﬁers. Krithika et al. [4] suggested a method by considering leaf chassis and KNearest Neighbor classiﬁcation. Extraneous Course based division calculation is utilized for the recovery of skeletons. It is managed by the luminance characteristics of skeleton. It is found that, skeletons have more intensity than other portions on the leaf. It helps to combine both leaf types and its disease type. Resizing, ﬁltering and thresholding is done. Color and smooth segmentation algorithm is used for segmentation process. It used the luminance value and considered it as threshold. The surface highlights are extracted using Grey level co-occurrence matrix. After feature extraction, KNN is used for classiﬁcation. Using KNN diseases are classiﬁed with improved accuracy. Es-Saady et al. [5] introduced a system for serial combination of two SVM classiﬁers. One SVM classiﬁes color features and the second SVM categorize

6

S. Pai and M. V. Thomas Table 1. Comparison of diﬀerent ML system

Sl. no. References

Preprocessing & segmentation

Feature extraction

Classiﬁcation

Accuracy

1

Pantazi et al. [1]

Region Of Interest (ROI), Hue Saturation Value (HSV), GrabCut algorithm

LBPs for color

One class Support Vector Machine (OCSVM)

95%

2

Kumar et al. [2]

NA

Spatial Domain Subtractive Pixel Adjacency Model (SPAM)

Exponential Spider 92.12% Monkey Optimization (ESMO), SVM

3

Padol et al. [3]

Resizing, Thresholding, Gaussian ﬁltering, k-means clustering

Shape, texture, color features

Fusion of SVM, ANN 91%–93%

4

Krithika et al. [4]

Tangential Direction (TD), Resizing, Thresholding, Filtering

Color, Smooth, KNN Texture using GLCM

5

Es-Saady al. [5]

Resizing, Filtering

Color, texture, shape features. Color moment method, GLCM

6

Adeel et al. [6]

Low Contrast Haze Reduction (LCHR), LAB conversion

Local Binary Patterns M-class SVM (LBPs), Color feature, Canonical Correlation Analysis (CCA), Neighborhood Component Analysis (NCA)

7

Sannakki et al. [10] Anisotropic diﬀusion, Grey Level Feed Forward Back K-means clustering Co-occurrence Matrix Propagation Neural (GLCM) Network (BPNN)

NA

8

Kharde et al. [11]

Color conversion, Histogram equalization, Watershed algorithm

Spatial gray-level dependence matrix (SGDM)

Kohonon Neural Network (KNN)

94%

9

Sudha et al. [12]

NA

NA

SVM, Naive Bayes classiﬁer, Decision Tree

93.75%

NA

Serial combination of 87.80% two SVM

90%–92%

surface smooth and shape highlights. Preprocessing includes resizing, ﬁltering and segmentation done using k-means clustering. Color moment strategy is utilized to extract color features. Moments are deﬁned as mean, standard deviation and skewness of the picture. Surface feature is extracted using GLCM. Twelve shape features are considered namely area, perimeter, diameter etc. Classiﬁcation is done by a serial combination of two SVMs. Result provide better accuracy than individual classiﬁers. Table 1 represents a comparison of major works done by diﬀerent authors. Adeel et al. [6] introduced a novel system for detection of grape leaf diseases by canonical correlation analysis. During ﬁrst stage, Low contrast haze reduction (LCHR) approach was done. It is mainly done for noise reduction and enhancement in image datasets. LAB conversion is used for image segmentation. Based on pixel information, the best channel is selected which is then further used

A Comparative Analysis on AI Techniques

7

for thresholding. Images are further reﬁned using morphological operations. The geometric, Local Binary Pattern (LBP) and color features are extracted during feature extraction phase. A fusion technique based on Canonical Correlation Analysis (CCA) was performed. The Neighborhood Component Analysis (NCA) technique is investigated and it selects the important features. The NCA technique helps to remove noise to deal with unwanted features. The best characters are given as input to classiﬁer for further recognition. M-class SVM is used for classiﬁcation. The system used the plant village dataset which comprises of three types of disease mainly black measles, rot and blight. This method provided an image division accuracy of 90% and disease categorization above 92%. Nababan et al. [7] founded out a system for oil palm where features are extracted using probability function and classiﬁcation using Naive Bayes is done. It provided 80% accuracy. Sena et al. [8] considered 720 images during training and testing. Maize leaf images was used and color index as feature. Iterative method was used during classiﬁcation. Citrus leaves were classiﬁed using Discriminant Classiﬁer by Pydipati et al. [9] by considering texture features. It provided 96% accuracy. 3.2

Techniques Based on Deep Learning

Ferentinos [13] developed a deep learning system on convolutional neural networks (CNN) architecture for disease ﬁnding. The dataset comprised of 87848 leaf images taken from lab and cultivation lands. Five basic CNN model were tested. Results suggested that VGG CNN achieved an accuracy worth 99%. Ji et al. [14] proposed the automatic grape leaf Disease identiﬁcation via United Model dealing with multiple convolutional neural networks. Data preprocessing was done by taking all the pictures and are resized to the anticipated input measure of the particular systems. The classiﬁcation model considered was VGGNet, DenseNet, ResNet and GoogLeNet. The United Model achieved an accuracy of 99.17%. Fuentes et al. [15] presented an approach to detect disease and pests found in red colored tomato and its leaves. The three main families of detectors like faster region-based convolutional neural network (Faster R-CNN), region-based fully convolutional network (R-FCN) and single shot multibox detector (SSD) are considered. Feature extraction was done using VGGNet and Residual Network (ResNet). The system can deal with status of the infection occurred, location of the area in plant, sides of leaves and diﬀerent background conditions. The highest accuracy provided was 90%. Cruz et al. [16] identiﬁed an approach for grapevine yellows symptoms found in grape fruit. The novelty of the method was introduced by utilizing convolutional neural networks for end recognition of grapevine yellow using color images of leaves. Six neural network architectures namely AlexNet, GoogLeNet, Inception V3, ResNet 50, ResNet 101 and SqueezeNet are evaluated. RGB to gray scale conversion, otsu thresholding, gaussian ﬁltering and median ﬁltering, morphological closing was applied during the preprocessing stage. DNA veriﬁcation

8

S. Pai and M. V. Thomas Table 2. Comparison of diﬀerent DL system Sl. no. References

Feature extraction and classification model

Accuracy

1

Ferentinos et al. [13]

AlexNet, AlexNetOWTBn, GoogLeNet, Overfeat, VGG

99%

2

Ji et al. [14]

CNN architectures namely VGGNet, DenseNet, ResNet, GoogLeNet

98.57%

3

Fuentes et al. [15]

Faster R-CNN, R-FCN, SSD, VGGNet and ResNet

70%–90%

4

Cruz et al. [16]

CNN architectures like AlexNet, GoogLeNet, Inception V3, ResNet-50, ResNet-101, SqueezeNet

98%

5

Geetharamani et al. [17]

Nine layer Deep CNN, PCA, augmentation

96.46 %

6

Ozguven et al. [18]

Updated Faster R-CNN

95.48%

7

Baranwal et al. [19]

GoogLeNet CNN architecture, Image filtering, Image compression, Image generation techniques

98.54%

8

Liu et al. [20]

Convolutional Recurrent Neural 79%–100% Networks (C-RNN).CNN architectures namely ResNet, InceptionV3, Xception, MobileNet. Simple RNN, LSTM, GRU for feature extraction

9

Mehdipour Ghazi et al. [21] CNN architecture namely GoogLeNet, 80% AlexNet, VGGNet.Transfer learning. Image transforms like rotation, translation, reflection and scaling

10

Hu et al. [22]

SVM, C-DCGAN, VGG16

90%

11

Ramcharan et al. [23]

Transfer Learning with Inception v3 and Classification using SVM and KNN

91%

of the sampled data and Transfer learning was applied. It showed an accuracy ranging from 70% to 98% for diﬀerent models. Geetharamani et al. [17] designed a system by considering nine layer deep convolutional neural network. The architecture was taught using 39 diﬀerent classes from an open dataset of plant leaf images. Picture ﬂipping, gamma adjustment, commotion infusion, vital component examination, turn, and scaling was the data augmentation algorithm used. The model is instructed using several batch sizes and epochs. Results had shown that it provided 96.46% classiﬁcation accuracy. Ozguven et al. [18] developed an updated faster R-CNN architecture for ﬁnding out the disease caused in sugar beet. Townsend and heuberger formula is used to retrieve the percentile of disease from the scale scores. It showed 95.48% accuracy. The dataset comprised of 155 sugar beet leaf images. The model was developed by combining a region network with classiﬁer in faster

A Comparative Analysis on AI Techniques

9

R-CNN models. Baranwal et al. [19] used CNN for ﬁnding out disease in apple leaves. The GoogLeNet Architecture comprising of 22 layers was used to train the model. Picture sifting, picture compression, and picture era procedures was utilized to pick up set of pictures for preparing. Liu et al. [20] designed convolutional recurrent neural networks (C-RNN) for observation centered plant identiﬁcation. The C-RNN replica comprises of CNN backbones and RNN units. The CNN backbone uses residual network, Inception V3, Xception and MobileNet for feature extraction. Simple RNN, long short term memory (LSTM) and gated recurrent unit (GRU) is used to synthesize features through the softmax layer. It demonstrated 79% to 100% accuracy. Table 2 represents a comparison of major works done by diﬀerent authors. A profound convolutional neural systems to recognize the plant species and assess distinctive components inﬂuencing the execution of systems was put forwarded by Mehdipour Ghazi et al. [21]. GoogLeNet, AlexNet and VGGNet was used as the models to train. Transfer learning is used tune the pretrained models. Rotation, translation, reﬂection and scaling was also applied. These are the basic image transformation steps that are carried out during data augmentation. The overall accuracy of 80% was acquired by the system. The above section dealt with a review where deep learning systems are employed and the next section gives the results and analysis of this survey.

4

Results and Discussions

In this segment, the outcomes of this survey as well as the analysis is depicted as separate graphs. We analyzed mainly on classiﬁcation strategies and the CNN architectures. Consequently a comparison is made between diﬀerent methods of classiﬁcation in machine learning systems. Through deep learning, various architectures of CNN are contrasted with their precision. A summary as shown in Fig. 2(a) is constructed, where all the methods of classiﬁcation are compared with their respective accuracy. These are the diﬀerent methods of classiﬁcation that are used for leaf disease detection especially in grapes. Except Raspberry Pi, all others are standardly known classiﬁcation technique. Accuracy is considered as a factor in carrying out these analysis. We ﬁnd that Support Vector Machine (SVM) classiﬁcation technique has the highest accuracy with 95% and Probabilistic Neural Network (PNN) with the lowest accuracy of 75.04%. A detailed analysis is give in Fig. 2(a) where 13 diﬀerent systems are considered. A bar graph is plotted by considering the classiﬁcation accuracy as the single parameter. In this aspect, the testing or training accuracy is not contemplated. From this we can inspect that most of the methods of classiﬁcation suggested SVM for classiﬁcation. We also ﬁnd that the most common features extracted are shape, color and texture. We discovered that Active Contour is suitable for shape features, GLCM for texture and color moment method for color features. Most research work also uses LBP histograms. As the images are represented in 2D, only these features are been considered.

10

S. Pai and M. V. Thomas

(a) Analysis of classification methods

(b) Analysis of CNN architectures

Fig. 2. Analysis of machine learning and deep learning models

As a second step, the diﬀerent deep learning classiﬁcation model or the CNN architectures are compared against accuracy. Accuracy provides the measure of how much these models performed well during the training phase and testing phase. We considered eighteen CNN models or the architectures which provided with accuracy. From this scenario, it is very diﬃcult to conclude that which system performs better. Each CNN models provided with an accuracy ranging from lowest accuracy of 80% to highest accuracy of 99%, where eighteen architectures are considered. A bar graph has been depicted by doing the analysis as shown in Fig. 2(b). Every systems accuracy is very close to each other with less diﬀerence between them. Neither a system is providing less accuracy nor a system with 100% accuracy. For a sharp review, we can say that VGGNet has maximum accuracy and thus VGGNet is the best. But still, other models also exhibit higher accuracies on diﬀerent context. DenseNet, GoogLeNet, Faster R-CNN are the other models with better accuracy. It is by virtue of using deep learning an its related algorithms, that these domain provides with high accuracy. From a view point, the CNN architecture performs better in many cases.

5

Conclusion and Future Works

This manifesto provides a review and inquire about direction in the domain of artiﬁcial intelligence where machine learning, deep learning and image processing techniques are discussed for recognition and stratiﬁcation of plant leaf disease, especially in grape or vine leaves. The mechanisms involved in other genre of leaves are also strongly reviewed. This survey consists of a comparison between the terminologies mentioned that are existing till date for classiﬁcation of disease. In a machine learning scenario, the major four steps include collection of images or datasets, preprocessing of image, segmentation of the image, extraction of features and ﬁnally stratiﬁcation. Several authors have also mentioned feature selection depending upon the applicability and improvement of prediction accuracy. A deep learning system include feature extraction and classiﬁcation using any predeﬁned CNN model or architecture. Feature extraction

A Comparative Analysis on AI Techniques

11

resides as the common factor in both the techniques, but still deep learning models performs it by their own. Throughout the review, we conclude that SVM gives better classiﬁcation accuracy when segmentation is done with the help of kmeans clustering and surface features are considered. When texture features are not advisable in certain scenario, shape and color features outperforms better. The CNN models like AlexNet, VGG gives better prediction accuracy depending upon the number of input, hidden and output layer involved. For future works, a research can be done using deep learning techniques with better CNN architectures by combining existing one or developing a new model by taking input, hidden and output layers into account. As a result, all the above mentioned techniques, one way or the other helps in an improved recognition and systematization of grape leaf disease.

References 1. Pantazi, X.E., Moshou, D., Tamouridou, A.A.: Automated leaf disease detection in diﬀerent crop species through image features analysis and One Class Classiﬁers. Comput. Electron. Agric. 156, 96–104 (2019). https://doi.org/10.1016/j.compag. 2018.11.005 2. Kumar, S., Sharma, B., Sharma, V.K., Sharma, H., Bansal, J.C.: Plant leaf disease identiﬁcation using exponential spider monkey optimization. Sustain. Comput. Inf. Syst. (2018). https://doi.org/10.1016/j.suscom.2018.10.004 3. Padol, P.B., Sawant, S.D.: Fusion classiﬁcation technique used to detect downy and Powdery Mildew grape leaf diseases. In: Proceedings of the International Conference on Global Trends in Signal Processing, Information Computing and Communication, ICGTSPICC 2016, pp. 298–301 (2017). https://doi.org/10.1109/ ICGTSPICC.2016.7955315 4. Krithika, N., Grace Selvarani, A.: An individual grape leaf disease identiﬁcation using leaf skeletons and KNN classiﬁcation. In: Proceedings of 2017 International Conference on Innovations In Information, Embedded and Communication Systems, ICIIECS 2017, 1–5 January 2018 (2018). https://doi.org/10.1109/ICIIECS. 2017.8275951 5. Es-Saady, Y., El Massi, I., El Yassa, M., Mammass, D., Benazoun, A.: Automatic recognition of plant leaves diseases based on serial combination of two SVM classiﬁers. In: Proceedings of 2016 International Conference on Electrical and Information Technologies, ICEIT 2016, pp. 561–566 (2016). https://doi.org/10.1109/ EITech.2016.7519661 6. Adeel, A., et al.: Diagnosis and recognition of grape leaf diseases: an automated system based on a novel saliency approach and canonical correlation analysis based multiple features fusion. Sustain. Comput. Inf. Syst. 24, 100349 (2019). https:// doi.org/10.1016/j.suscom.2019.08.002 7. Nababan, M., et al.: The diagnose of oil palm disease using naive Bayes method based on expert system technology. J. Phys. Conf. Ser. 1007, 012015 (2018). https://doi.org/10.1088/1742-6596/1007/1/012015 8. Sena, D.G., Pinto, F.A.C., Queiroz, D.M., Viana, P.A.: Fall armyworm damaged maize plant identiﬁcation using digital images. Biosyst. Eng. 85, 449–454 (2003). https://doi.org/10.1016/S1537-5110(03)00098-9

12

S. Pai and M. V. Thomas

9. Pydipati, R., Burks, T.F., Lee, W.S.: Identiﬁcation of citrus disease using color texture features and discriminant analysis. Comput. Electron. Agric. 52, 49–59 (2006). https://doi.org/10.1016/j.compag.2006.01.004 10. Sannakki, S.S., Rajpurohit, V.S., Nargund, V.B., Kulkarni, P.: Diagnosis and classiﬁcation of grape leaf diseases using neural networks, pp. 3–7 (2013) 11. Kharde, P.K., Kulkarni, H.H.: An unique technique for grape leaf disease detection. Int. J. Sci. Res. Sci. Eng. Technol. 2, 343–348 (2016) 12. Sudha, V.P.: Feature selection techniques for the classiﬁcation of leaf diseases in turmeric. Int. J. Comput. Trends Technol. 43, 138–142 (2017). https://doi.org/10. 14445/22312803/ijctt-v43p121 13. Ferentinos, K.P.: Deep learning models for plant disease detection and diagnosis. Comput. Electron. Agric. 145, 311–318 (2018). https://doi.org/10.1016/j.compag. 2018.01.009 14. Ji, M., Zhang, L., Wu, Q.: Automatic grape leaf diseases identiﬁcation via UnitedModel based on multiple convolutional neural networks. Inf. Process. Agric. 7, 418–426 (2019). https://doi.org/10.1016/j.inpa.2019.10.003 15. Fuentes, A., Yoon, S., Kim, S.C., Park, D.S.: A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. Sensors (Switz.) 17, 2022 (2017). https://doi.org/10.3390/s17092022 16. Cruz, A., et al.: Detection of grapevine yellows symptoms in Vitis vinifera L. with artiﬁcial intelligence. Comput. Electron. Agric. 157, 63–76 (2019). https://doi.org/ 10.1016/j.compag.2018.12.028 17. Geetharamani, G., Arun Pandian, J.: Identiﬁcation of plant leaf diseases using a nine-layer deep convolutional neural network. Comput. Electr. Eng. 76, 323–338 (2019). https://doi.org/10.1016/j.compeleceng.2019.04.011 18. Ozguven, M.M., Adem, K.: Automatic detection and classiﬁcation of leaf spot disease in sugar beet using deep learning algorithms. Phys. A Stat. Mech. its Appl. 535, 122537 (2019). https://doi.org/10.1016/j.physa.2019.122537 19. Baranwal, S., Khandelwal, S., Arora, A.: Deep learning Convolutional Neural Network for apple leaves disease detection. SSRN Electron. J., 260–267 (2019). https:// doi.org/10.2139/ssrn.3351641 20. Liu, X., Xu, F., Sun, Y., Zhang, H., Chen, Z.: Convolutional recurrent neural networks for observation-centered plant identiﬁcation. J. Electr. Comput. Eng. 2018, 7 (2018). https://doi.org/10.1155/2018/9373210 21. Mehdipour Ghazi, M., Yanikoglu, B., Aptoula, E.: Plant identiﬁcation using deep neural networks via optimization of transfer learning parameters. Neurocomputing 235, 228–235 (2017). https://doi.org/10.1016/j.neucom.2017.01.018 22. Hu, G., Wu, H., Zhang, Y., Wan, M.: A low shot learning method for tea leaf’s disease identiﬁcation. Comput. Electron. Agric. 163, 104852 (2019). https://doi. org/10.1016/j.compag.2019.104852 23. Ramcharan, A., Baranowski, K., McCloskey, P., Ahmed, B., Legg, J., Hughes, D.P.: Deep learning for image-based cassava disease detection. Front. Plant Sci. 8, 1–7 (2017). https://doi.org/10.3389/fpls.2017.01852

Sign Language Recognition Using Cluster and Chunk-Based Feature Extraction and Symbolic Representation H. S. Nagendraswamy and Syroos Zaboli(&) Department of Studies in Computer Science, University of Mysore, Mysore 570006, India {hsnswamy,syroos}@compsci.uni-mysore.ac.in

Abstract. This paper focuses on sign language recognition with respect to the hand movement trajectories at a sentence level. This is achieved by applying two proposed methods namely Chunk-based and Cluster-based feature representation techniques in order to extract the desired keyframes. The features are extracted based on hands and head local centroid characteristics such as velocity, magnitude and orientation. A set of experiments are conducted on a large self-curated sign language sentence data set (UOM-SL2020) in order to evaluate the performance of the proposed methods. The results clearly show the high recognition rate of 75.51% in terms of F-measure which is achieved by combining the proposed method with symbolic interval-based representation and validation of feature sets. Keywords: Sign language recognition representation

Keyframe extraction Symbolic

1 Introduction According to a latest survey done by World Health Organization [1] in 2019, about 6.1% of the world’s population estimated to 466 million people have disabling hearing loss. it is likely that the number of people with disabling hearing loss will grow up to 630 million and 900 million by 2030 and 2050 respectively. Although the use of sign language is the perfect means communication within the hearing-impaired community, it hasn’t been able to ﬁll communication gap between the hearing impaired and the rest. Aside from the sign language interpreters, the developments in technology and computer vision systems provide great possibilities in automating sign language interpretation. There had been a number of attempts by incorporating image processing and pattern recognition techniques on sign language videos in order to recognize and interpret signs. The flow of this paper continues by some of the various attempts in this area presented in Sect. 2, followed by the proposed methodology in Sect. 3 and sign language sentence representation in Sect. 3.3. A description on some of the experiments to evaluate performance of the proposed method are presented in Sect. 4, followed by conclusion and future works in Sect. 6. © Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 13–28, 2021. https://doi.org/10.1007/978-981-16-1092-9_2

14

H. S. Nagendraswamy and S. Zaboli

2 Related Work Although the ultimate goal had been set to create an efﬁcient SLR (Sign Language Recognition) system, the paths and approaches toward achieving this goal have been many. While some incorporated sensory and glove-based methods [2–6], others opted for image processing and pattern recognition techniques such as Spatial Relationship Based Features [7], convolutional neural networks [8], Active Appearance Models [9], transition-moment models [10], Grassman Covariance Matrix [11] and Sparse Observation (SO) description [12]. Another approach was to perform the SLR at a ﬁnger spelling level [13–16] where the recognition solely revolved around various hand shapes describing alphabets and numbers. SLR at a word level [17–20] and at a sentence level [20–23] further more increased the applicability of the recognition to the real world SLR problems which involves sequence of continuous hand gestures constituting sign language messages. On the other hand, some interesting attempts [24–28] considered facial expression as a part of recognition task. In spite of current developments in 3D video SLR systems [29, 30], some of the classic 2D video SLR problems as per as state of the art [31] such as hand and face occlusion, type of video input based on frame rate and resolution, speed at which message is signed by a signer and various variations between different signers or the same signer signing the same sentence, are few of the many challenges that are yet to be explored. Taking the above into account, our attempt focuses on efﬁcient representation of sign language messages at a sentence level by use of our proposed key frame extraction techniques. It is important to note that as the proposed method in this paper focuses on the key frame extraction with respect to the self-curated large dataset of signs. Hence, a comparison with the state of the art would not be possible due to the variance in other sign language data sets which differ in language, structure and other variables.

3 The Proposed Methodology The proposed methodology consists of four major steps which are; (i) frame by frame segmentation of the regions of interest, namely the face and the hands from a given input video with variable number of frames per second (fps). (ii) Effective feature extraction taking the trajectories of hand and head movements into account. (iii) Cluster and Chunk based representation of the signs in the database. (iv) Identiﬁcation and recognition of a given test sign video based on the knowledgebase. These steps are explained in detail as follow: 3.1

Face and Hand Segmentation

For every input video, the regions of interest namely the face and hands in every frame are to be extracted. This is achieved by ﬁrstly converting the three-dimensional RGB color space to Hue Saturation Value based on histogram minimum and maximum thresholds levels closest to the skin color range. This step enables the creation of a mask for regions of interest where every pixel other than the mask is set to zero. Further morphological operations and adjustments are applied to achieve a clean segmentation (Fig. 1).

Sign Language Recognition

15

Fig. 1. Given an input video every frame is extracted and passed over to Image segmentation and mask creation using HSV and morphological operations

3.2

Feature Extraction

Every sign in SL consists of a speciﬁc hand shape known as gesture and set of nonmanual markers namely, facial expression, head and hand orientation, neck and shoulder movements and mouth morphemes. In order to convey a message, the hand gesture needs to move in a 3D space with a variable speed at different instances of time. This creates a trajectory through which a hand moves in a speciﬁc manner for different messages. The trajectory has different magnitude and direction at every instance of time. This makes it ideal for strong feature extraction as it provides the location of head and hands on XY axis as well as the magnitude and direction toward which the trajectory moves at every instance of time. Let B1, B2 and B3 be the three detected segmented blobs representing the two hands and head. Let CB1, CB2 and CB3 be their mass centroid for the blobs respectively. Let Pxy be any pixel moving along a two-dimensional XY axis by a minimal deterministic displacement. Using the Motion Estimation Based on Polynomial Expansion [32] the pixelwise displacement for every Pxy along X and Y axis during every consecutive two frames is evaluated (Figs. 3 and 4), where, Vx stands for the velocity along X-axis and Vy is the velocity along Y-axis. The global orientation and magnitude for every segmented blob B as per as each and every corresponding pixel Pxy displacement is evaluated using the Equations below (Eqs. 1, 2 and 3). The Magnitude and Orientation of displacement vector for the centroid of blob CBb is (Figs. 2, 5 and 6): Mb ¼

X

Ob ¼

8ðVx;VyÞ2CBb

pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Vx2 þ Vy2 þ 2VxVy cos h ð1Þ

X 8ðVx;VyÞ2CBb

tan

1

Vy sin h Vx þ Vy cos h

ð2Þ

Where h is the angle between Vx and Vy.

Fig. 2. Segmented and labeled regions of interest with respect to CB1, CB2 and CB3 centroids

h ¼ cos1

! ! Vx : Vy kVxkkVyk

ð3Þ

16

H. S. Nagendraswamy and S. Zaboli

Fig. 3. Plot of velocity along X-axis (Vx)

Fig. 5. Lot of Magnitude (M) along every pixel

Fig. 4. Plot of velocity along Y-axis (Vy)

Fig. 6. Orientation (O) along every pixel

For every frame, the feature vector Fn = {XCB1, YCB1, O1, M1, XCB2, YCB2, O2, M2, XCB3, YCB3, O3, M3} deﬁnes the hands and head movements in that frame (Fig. 7).

Fig. 7. Plot of the hand and head trajectory movements along with direction and magnitude with respect to the three local centroids.

Sign Language Recognition

3.3

17

Sign Representation

An efﬁcient and condensed representation of every sign, signiﬁcantly improves the SLR system performance. Our proposed sign representation technique consists of two different methods namely Chunk-based and Cluster-based keyframe extraction that addresses the following challenges. (i) A sign sentence signed by signer at different instances of time varies in phase at which every combination of hand and gesture movement is performed in order to complete the sentence. This results in different durations and total number of frames for every video sample. (ii) On the other hand, it is also observed that a signer might take his time in order to perform a certain word in a sentence and then take a faster phase at performing the rest. Such intra-class variations tend to increase with respect to different individuals performing the same sign sentence. (iii) Another challenge affecting the total video duration for every sign sentence is the type of video input itself. Videos captured at different frame rates per second (24 fps, 30 fps, 60 fps) result in different total number of frames for the same sign sentence. (iv) Such problems rise further challenges at a classiﬁcation and testing level where a common structure to evaluate and compare samples does not exist. Chunk Based Key Frame Extraction This method determines the keyframes by extracting the most signiﬁcant feature vector from an equally divided chunk of frames. Given an input video consisting of l frames, the complete set of feature vectors describing every frame are: F1 ¼ f 1;1 ; f 2;1 ; f 3;1 ; f 4;1 ; f 5;1 ; f 6;1 ; f 7;1 ; f 8;1 ; f 9;1 ; f 10;1 ; f 11;1 ; f 12;1 F2 ¼ f 1;2 ; f 2;2 ; f 3;2 ; f 4;2 ; f 5;2 ; f 6;2 ; f 7;2 ; f 8;2 ; f 9;2 ; f 10;2 ; f 11;2 ; f 12;2 F3 ¼ f 1;3 ; f 2;3 ; f 3;3 ; f 4;3 ; f 5;3 ; f 6;3 ; f 7;3 ; f 8;3 ; f 9;3 ; f 10;3 ; f 11;3 ; f 12;3 ... Fl ¼ f 1;l ; f 2;l ; f 3;l ; f 4;l ; f 5;l ; f 6;l ; f 7;l ; f 8;l ; f 9;l ; f 10;l ; f 11;l ; f 12;l Where the total number of features can be estimated as 12 times the total number of frames which is an inefﬁcient representation in the knowledgebase. Let L be the total number of key frames that are to be extracted. The total number of frames can be divided in L chunks such that: n for ð1 : L 1Þ = ; Chunk Size ¼ n L =L þ r; for ðLÞ

ð4Þ

Where r stands for the excess number of frames which is included in the last chunk. Let lL be the mean corresponding to each chunk L. The keyframes KF = kf1, kf2, kf3,…, kfCu can be extracted by the following condition:

18

s KFCh

H. S. Nagendraswamy and S. Zaboli

8 s kf1 > > > > > > s > > > kf2 > > > < s ¼ kf3 > > > > > > > > > > > > : kf s L

¼ arg minkFm l1 k2 ; for m ¼ 1 : n=L ch1

¼ arg minkFm l2 k2 ; for m ¼ n=L þ 1 : 2n=L ch2

¼ arg minkFm l3 k2 ; for m ¼ 2n=L þ 1 : 3n=L ch3

ð5Þ

.. .

¼ arg minkFm lL k2 ; for m ¼ ðChL 1Þ n=L þ 1 ! ChL n=L þ r chL

Where ChL is the corresponding chunk number and KF s is the chunk-based sign representation of sign sentence s by L number of keyframes in terms of feature vector F m of the mth frame. Figure 8 shows the Chunk-based keyframe extraction of the sign “IS YOUR HOUSE SMALL?” where L = 5. Cluster Based Key Frame Extraction Here the keyframes are determined by extracting the most signiﬁcant feature vectors closest to the centroids of clusters of frames. In order to cluster the frames, we apply the concept of k-means clustering such that the ultimate value for K is empirically chosen as per as the experimentations. By applying the K-means clustering to a given input video s consisting of l total number of frames, where each frame is described in terms of feature vectors Fl, the centroids for K clusters can be described as: ClusterCentroid ¼ fC 1 ; C 2 ; C 3 ; C4 ; ; Ck g The keyframes KF SCl ¼ fkf 1 ; kf 2 ; kf 3 ; . . .; kf Cl g is evaluated by the following condition:

KF SCl

8 s kf ¼ argminkF i CCl1 k2 ; 8i fCl1 g > > > 1 Cl1 > > > kf s2 ¼ argminkF i CCl2 k2 ; 8i fCl2 g > > > Cl2 > < s kF i CCl3 k2 ; 8i fCl3 g ¼ kf 3 ¼ argmin Cl3 > > > .. > > > . > > > s > : kf k ¼ argminkF i CClk k2 ; 8i fClK g

ð6Þ

Clk

Where Clk is the corresponding Kth cluster and the key frame kf sk is the ith frame in Clk cluster with the highest similarity to the centroid {Clk . Figure 9 shows the Chunkbased keyframe extraction of sign “IS YOUR HOUSE SMALL?” where K = 5.

Sign Language Recognition

19

Fig. 8. Chunk-based key frame extraction for L = 5. The 20th, 48th, 59th, 84th and 116th frames are the chosen key frames from a total of 138 frames in the sample-32 video (“IS YOUR HOUSE SMALL?”).

Fig. 9. Cluster based key frame extraction for K = 5. The 20th, 58th, 88th, 103th and 133th frames are the chose + 5Cluster_LDA n key frames from a total of 138 frames in the sample-32 video (“IS YOUR HOUSE SMALL?”).

3.4

Symbolic Representation of the Signs

In this section the representation and validation for both cluster-based and chunk-based feature extraction methods are discussed by applying Symbolic representation and validation by symbolic similarity measure [33]. Given an input video, the key frames F1, F2, F3, …, Fl are extracted by applying the proposed Cluster-based and Chunkbased methods as explained in Sect. 3.4. For key frames F1 to Fl, let FV sc be the feature vector representing the S training samples of class C in terms of k key frames. n o FV sc ¼ f s1;k ; f s2;k ; f s3;k ; . . .; f sn;k for k ¼ 1 : number of key frames

ð7Þ

Scheme1: Symbolic Representation Using Min and Max Values for Intervals The symbolic representation of the class C, given the training samples S in terms of minimum and maximum values is evaluated as follows: ðminÞ

fn;k

n o ðmaxÞ 1 2 3 4 5 s ¼ min f1;k ; f1;k ; f1;k ; f1;k ; f1;k ; . . .; f1;k ;f n o n;k 1 2 3 4 5 s ¼ max f1;k ; f1;k ; f1;k ; f1;k ; f1;k ; . . .; f1;k ðminÞ

Where the lower limit f n;k

is the minimum of the all nth feature of kth key frames ðmaxÞ

of training samples S. and upper limit f n;k key frames of training samples S.

is the minimum of the all nth feature of kth

20

H. S. Nagendraswamy and S. Zaboli

Hence, the interval representation of class C with respect to minimum and maximum values can be deﬁned as: S1FV cs ¼

nh i h i h i h io ðminÞ ðmaxÞ ðminÞ ðmaxÞ ðminÞ ðmaxÞ ðminÞ ðmaxÞ f 1;k ; f 1;k ; f 2;k ; f 2;k ; f 3;k ; f 3;k ; . . .; f n;k ; f n;k

ð8Þ

Scheme2: Symbolic Representation Using Mean and Standard Deviation The symbolic representation of the class C, given the training samples S in terms of mean and standard deviation values is evaluated as follows: l ¼ meanff 1n;k ; f 2n;k ; f 3n;k ; f 4n;k ; f 5n;k ; . . .; f sn;k g

ð9Þ

r ¼ stdff 1n;k ; f 2n;k ; f 3n;k ; f 4n;k ; f 5n;k ; . . .; f sn;k g

ð10Þ

The lower level and upper levels of the intervals can be deﬁned as: þ f n;k ¼ l r; f n;k ¼ l þ r

Where f n;k is the lower limit for all nth feature of kth key frames of training samples þ S and f n;k is the upper limit of the same. Hence, the interval representation of class C with respect to mean and standard deviation values can be deﬁned as: S2FV cs ¼

nh i h i h i h io þ þ þ þ f 1;k ; f 1;k ; f 2;k ; f 2;k ; f 3;k ; f 3;k ; . . .; f n;k ; f n;k

ð11Þ

Recognition Using Symbolic Similarity Measure Given a test sample TF t , the similarity is achieved based on the presence in, or the closeness of every test sample feature to the corresponding target class’s symbolic interval representation. The similarity measure using min-max intervals or Mean-SD is deﬁned as: 8 > < 1; n X

1 Similarity S2FVsc ; TFt ¼ > max n i¼1 : 1 þ abs

1

ðfn;kþ fn;k Þ

; 1 þ abs

1

ðfn;k fn;k Þ

;

þ if ðfn;k fn;k fn;k Þ

elsewise

ð12Þ

4 Experimentation and Validation In order to evaluate the performance of the proposed method, various experiments have been conducted on our large data set. This data set consists of 80 classes, where each class is a sign language sentence video of two to six seconds of duration at different frame rates per second, performed by different individuals at different instances of time. The Data set contains 800 sample video ﬁles of 80 different classes. The values set for

Sign Language Recognition

21

the number of keyframes are 1 to 10 and 10 to 50 by increments of 5. The same is applied to train and test both proposed Chunk-based and Cluster-based methods in two different approaches. These set of testing stages are applied over 80 Classes of sign language with 10 samples in each, performed by 11 different signers throughout the dataset. The ﬁrst set of experiments are done based on the symbolic representation and validation [33] for both min-max and mean-sd schemes. The training to testing ration are set to 30:70, 50:50, 70:30 and 90:10. F-measure is used to evaluate the performance of the system for every experiment. The next set of experiments were done by using SVM, KNN, and LDA classiﬁers for those key frames of which the highest F-Measure is achieved. Figures 10, 11, 12 and 13 gives the overall average recognition rate of the proposed chunk-based feature extraction in terms of F-measure for different key frame numbers using Symbolic similarity measure (Tables 1, 2, 3, 4 and Figs. 14 and 15).

70. 00 RecogniƟon Rate

60. 00 50. 00

30-70

40. 00

50-50

30. 00 20. 00

70-30

10. 00 90-10

0.0 0 0

10

20

30

40

50

60

Key Frames

RecogniƟon Rate

Fig. 10. Symbolic similarity measure, chunk-based key frame extraction using mean-SD intervals

80. 00 70. 00 60. 00 50. 00 40. 00 30. 00 20. 00 10. 00 0.0 0

30-70 50-50 70-30 90-10 0

10

20

30

40

50

60

Key Frames

Fig. 11. Symbolic similarity measure, cluster-based key frame extraction using Min-Max intervals

22

H. S. Nagendraswamy and S. Zaboli

Table 1. Recognition rate for Chunk-based feature extraction and symbolic similarity measure in terms of Min-Max intervals for 70:30 training to testing ratio.

Dataset

80 classes

800 videos

50 - 400 frames per video

Training to Testing ration

No. (chunk) Key frames

No. of Features

1

12

56.89

2

24

56.02

3

36

60.82

4

48

62.44

5

60

71.30

6

72

65.58

7

84

65.04

8

96

65.82

9

108

64.35

10

120

15

180

65.75

20

240

68.78

25

300

64.88

30

360

63.89

35

360

63.90

40

420

66.61

45

540

64.70

50

600

60.05

70:30

F1 Scor e (%)

65.77

Fig. 12. Confusion Matrix for the 5 chunks (Key Frames), 70:30 training to testing ratio.

Sign Language Recognition

23

Table 2. Testing results and comparison between symbolic similarity measure and other classiﬁers for 5 chunks (key frames) feature extraction. Dataset

Number

F1

of

Validation using

Chunks

RecogniƟon Rate

(%)

80 classes

Symbolic similarity

800 videos

measure

50 - 400

Score

LDA Classifier

5

71.30 74.18

frames

SVM Classifier

69.08

per video

KNN Classifier

77.49

70.00 60.00 50.00 40.00 30.00 20.00 10.00 0.00

30-70 50-50 70-30 0

10

20

30

40

50

60

90-10

Key Fra mes

RecogniƟon Rate

Fig. 13. Symbolic similarity measure, Cluster-based key frame extraction using Mean-SD intervals

80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 0.00

30-70 50-50 70-30

90-10 0

10

20

30

40

50

60

Key Fra mes

Fig. 14. Symbolic similarity measure, chunk-based key frame extraction using min-max intervals

24

H. S. Nagendraswamy and S. Zaboli

Table 3. Recognition rate for Cluster-based feature extraction and symbolic similarity measure in terms of Min-Max intervals for 70:30 training to testing ratio Traini No.

No. of

ng to

Da-

(Clusters)

Fea-

Test-

taset

Keyframe

ture

ing

s

s

ra-

1

12

33.16

2

24

48.55

80

3

36

43.37

classes

4

48

45.94

5

60

36.69

6

72

44.55

7

84

50.20

8

96

47.29

9

108

56.14

10

120

15

180

62.04

20

240

66.29

25

300

67.20

frame

30

360

65.13

s per

35

360

75.51

video

40

420

66.81

45

540

73.35

50

600

70.32

F1 Scor e (%)

tion

800 videos

50 400

70:30

50.70

Fig. 15. Confusion matrix for the 35 clusters (Key Frames), 70:30 training to testing ratio.

Sign Language Recognition

25

Table 4. Testing results and comparison between symbolic similarity measure and other classiﬁers for 35 clusters (key frames) feature extraction. Dataset

Number of Clus-

F1 Validation using

(%)

ters 80 classes

Symbolic similarity

800 videos 50-400

Score

35

measure

75.51

LDA

59.03

frames

SVM

59.34

per video

KNN

68.66

5 Conclusion Based on the extensive set of experiments performed on the data set the following observations are to be noted: Recognition with Respect to Cluster and Chunk Based Features Extraction Lower number of key frames ranging from 4 to 15 frames, result in higher recognition rate in case of proposed chunk-based feature extraction. This is due to the fact that in chunk-based method, the key frames are extracted from equally divided groups of frames in a sequential manner and the feature set is created accordingly. As the number of key frames increases, the probability of features to overlap tends to reduce, resulting in lower recognition rate. The Cluster based feature extraction technique performs much better at higher key frames ranging from 30 to 50. In this method, key frames are extracted based on their closeness to centroid of clusters of similar frames. Although the clusters consist of unordered groups of frames, the ﬁnal out put achieved from this method is a set unsimilar key frames arranged in a sequentially ordered manner. Hence, more the number of key frames, the higher the possibility of nth features of samples S getting overlapped, resulting in higher recognition rate. Recognition with Respect to the Classes in Data Set The experiments performed on the whole data set of 80 classes show that about 30– 40% of classes end in recognition rates below 50% in terms of F1-Score. This is due to the fact that the dataset consists of classes that could only be distinguished from one another based on the Non-manual markers such as facial expressions, head movements as well as hand shapes, which have not been taken into consideration in terms of corresponding features. For example, the classes “Your house is big.” And “Is your house big?” can only be distinguished from one another based on the facial expression. Recognition with Respect to Symbolic Similarity Measure in Terms of Min-Max and Mean-SD Intervals Looking into the results and graphs, the symbolic representations in terms of intervals assigned based on minimum and maximum values results in higher recognitions rates

26

H. S. Nagendraswamy and S. Zaboli

up to 15% higher that intervals created in terms of Mean and standard deviation. This shows that although the interval size of Min-Max is larger that Mean-SD intervals, the reason it performs better is due to the fact that the group of same features are almost scattered and yet close enough within the boundaries of the interval in feature space. Whereas in case of Mean-SD intervals, the recognition would have been high if the group of same features were concentrated close to each other with non or very few outcasts in feature space. It is also observed that the proposed cluster-based feature extraction using symbolic representation at 35 key frames, results in high F-Measure of 75.51% which is outperforms the other classiﬁes. Aside from the challenge above, variations in video frame rates, frame dimensions as well as inter and intra class variations such as speed at which a sign is performed by a signer were further introduced to the data set too. Hence, results above have been satisfactory and provide a strong base for a multimodal recognition system. Thus, in feature work, implementation of additional features such as facial expressions and hand shapes to the current trajectory analysis would result in a highly efﬁcient and accurate sign language recognition system. Acknowledgment. We would like to extend our deepest gratitude to Mrs. Sam and Mr. Tukuram and every sign language signer who helped us in creating this huge sign language data set as well respected teachers and colleagues in the department of studies in computer science, University of Mysore. Their continuous help and support had been the ongoing motivation for this research.

References 1. W.H.O. World Health Organization. Prevention of blindness and deafness (2018). https:// www.who.int/pbd/deafness/estimates/en/ 2. Ramakant.,e-Karishma, N., Veerapalli, L.: Sign language recognition through fusion of 5DT data glove and camera-based information. In: Advance Computing Conference (IACC) (2015) 3. Gourley, C.: Neural Network Utilizing Posture Input for Sign Language Recognition. Technical report computer vision and robotics research laboratory, University of Tenessee (1994) 4. Holden, E.J., Lee, G., Owens, R.: Australian sign language recognition. Mach. Vis. Appl. 16 (5), 312–320 (2005) 5. Handouyahia, M., Ziou, D., Wang, S.: Sign language recognition using moment-based size functions. In: visioninterface 1999, Trois-Rivieres, Canada (1999) 6. Ranganath, S., Kong, W.W.: Towards subject independent continuous sign language recognition: a segment and merge approach. Pattern Recogn. 47, 1294–1308 (2014) 7. Kumara, C.B., Nagendraswamy, H.S.: Spatial relationship based features for indian sign language recognition. In: IJCCIE (2016) 8. Garcia, B.: Real-time American Sign Language Recognition with Convolutional Neural. Stanford University, Stanford (2018) 9. Piater, J., Hoyoux, T.: Video analysis for continuous sign language recognition. In: 4th Workshop on the Representation and Processing of Sign Languages:Corpora and Sign Language Technologies (2010)

Sign Language Recognition

27

10. Fang, G., Gao, W., Zhao, D.: Large-vocabulary continuous sign language recognition based on transition-movement models. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 37(1), 1–9 (2006) 11. Chai, X., Wang, H., Yin, F., Chen, X.: Communication tool for the hard of hearings. a large vocabulary sign language recognition system. In: International Conference on Affective Computing and Intelligent Interaction (ACII) (2015) 12. Wang, C., Gao, W., Shan, S.: An approach based on phonemes to large vocabulary Chinese sign language recognition. In: 5th IEEE ICAFFGR (2002) 13. Suraj, M.G., Guru, D.S.: Appearance based recognition methodology for recognizing ﬁngerspelling alphabets. In: IJCAI (2007) 14. Suraj, M.G., Guru, D.S.: Secondary diagonal FLD for ﬁngerspelling recognition. In: ICCTA International Conference on Computing: Theory and Applications (2007) 15. Ghotkar, A.S., Kharate, G.K.: Study of Vision Based Hand Gesture Recognition using Indian sign language. IJSS Intelligent Systems 7(1), 96–115 (2014) 16. Starner, T., Weaver, J., Pentland, A.: Real-time american sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1371–1375 (1998) 17. Wang, H., Chai, X., Chen, X.: Sparse observation (SO) alignment for sign language recognition. Neurocomputing 175, 674–685 (2016) 18. Youssif, A.A.: Arabic sign language recognition system using HMM. In: IJACSA (2011) 19. Liwicki, S., Everingham, M.: Automatic recognition of ﬁnger spelled words in British sign language. In: IEEE Computer Society Conference CVPR Workshops (2009) 20. Nagendraswamy, H.S., Guru, D.S., Naresh, Y.G.: Symbolic representation of sign language at sentence level. Int. J. Image Graph. Signal Process. 7(9), 49 (2015) 21. Nagendraswamy, H.S., Chethana Kumara, B.M., Lekha Chinmayi, R.: GIST descriptors for sign language recognition: an approach based on symbolic representation. In: Prasath, R., Vuppala, A.K., Kathirvalavakumar, T. (eds.) MIKE 2015. LNCS (LNAI), vol. 9468, pp. 103–114. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-26832-3_11 22. Kumara, C., Nagendraswamy, H.S.: Sign energy images for recognition of sign language at sentence level. Int. J. Comput. Appl. 139(2), 44–51 (2016) 23. Kumara, B.C., Nagendraswamy, H.S.: Sign energy images for recognition of sign language at sentence level. Int. J. Comput. Appl. 975, 8887 (2016) 24. Dasa,S.P.: Sign language recognition using facial expression. In: Second International Symposium on Computer Vision and the Internet, vol. 57, pp. 210–216 (2015) 25. Neidle, C.: Computer-based tracking, analysis, and visualization of linguistically signiﬁcant nonmanual events in American sign language. In: LREC-wkshop (2014) 26. Cooper, H.M., Ong, E.J., Pugeault, N., Bowden, R.: Sign language recognition using subunits. J. Mach. Learn. Res. 13, 2205–2231 (2012) 27. Nguyen, T.D., Ranganath, S.: Facial expressions in American sign language: tracking and recognition. Pattern Recogn. 45(5), 1877–1891 (2012) 28. Liu, B., Metaxas, D.: Recognition of nonmanual markers in American Sign Language (ASL) using non-parametric adaptive 2D-3D face tracking. In: ELRA (2012) 29. Lang, S., Block, M., Rojas, R.: Sign language recognition using kinect. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012. LNCS (LNAI), vol. 7267, pp. 394–402. Springer, Heidelberg (2012). https://doi.org/ 10.1007/978-3-642-29347-4_46

28

H. S. Nagendraswamy and S. Zaboli

30. Sun, C., Zhang, T., Bao, B.K., Xu, C., Mei, T.: Discriminative exemplar coding for sign language recognition with kinect. IEEE Trans. Cybern. 43(5), 1418–1428 (2013) 31. Sahoo, A.K.: sign language recognition: state of the art. J. Eng. Appl. Sci. 9(2), 116 (2014) 32. Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) Image Analysis, pp. 363–370. Springer Berlin Heidelberg, Berlin, Heidelberg (2003). https://doi.org/10.1007/3-540-45103-X_50 33. Guru, D.S., Nagendraswamy, H.S.: Symbolic representation of two-dimensional shapes. Pattern Recogn. Lett. 28(1), 144–155 (2007)

Action Recognition in Haze Using an Eﬃcient Fusion of Spatial and Temporal Features Sri Girinadh Tanneru1 and Snehasis Mukherjee2(B) 2

1 IIIT SriCity, Chittoor, India Shiv Nadar University, Greater Noida, India

Abstract. Action recognition in video sequences is an active research problem in Computer Vision. However, no signiﬁcant eﬀorts have been made for recognizing actions in hazy videos. This paper proposes a novel uniﬁed model for action recognition in hazy video using an eﬃcient combination of a Convolutional Neural Network (CNN) for obtaining the dehazed video ﬁrst, followed by extracting spatial features from each frame, and a deep bidirectional LSTM (DB-LSTM) network for extracting the temporal features during action. First, each frame of the hazy video is fed into the AOD-Net (All-in-One Dehazing Network) model to obtain the clear representation of frames. Next, spatial features are extracted from every sampled dehazed frame (produced by the AOD-Net model) by using a pre-trained VGG-16 architecture, which helps reduce the redundancy and complexity. Finally, the temporal information across the frames are learnt using a DB-LSTM network, where multiple LSTM layers are stacked together in both the forward and backward passes of the network. The proposed uniﬁed model is the ﬁrst attempt to recognize human action in hazy videos. Experimental results on a synthetic hazy video dataset show state-of-the-art performances in recognizing actions. Keywords: CNN · Bidirectional LSTM action recognition · AODNet

1

· Haze removal · Human

Introduction

Recognizing human action in video is a challenging problem due to the various kinds of complexities in the video, such as similarity of visual context in during diﬀerent actions, changes in the viewpoint for the same actions, camera jerk and motion, varying scale and pose of the actors, and diﬀerent illumination conditions. Human action recognition in outdoor videos becomes much more challenging when the video suﬀers from lack of visibility due to fog or smoke. Outdoor photography often suﬀer from lack of visibility due to bad weather conditions. The observed objects loose visibility and contrast due to the presence of atmospheric haze, fog, and smoke. Moreover, a hazy video will put the eﬀectiveness of many subsequent high-level computer vision tasks in jeopardy, such c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 29–38, 2021. https://doi.org/10.1007/978-981-16-1092-9_3

30

S. G. Tanneru and S. Mukherjee

as object detection and action recognition. Generally the well-known computer vision tasks or computational photography algorithms assume that the input videos are taken in clear weather. Unfortunately, this is not always true in many situations, therefore dehazing of videos is highly desired in order to analyze the content. To tackle this problem, removing haze for the input videos (frame-byframe) may be a useful pre-processing step before going for the analysis of the content. However, removal of haze is a challenging and complex problem as the haze is dependent on the depth map which is diﬃcult to obtain due to visual degradation of frames. In the context of hazy videos, amount of haze may change across frames, making the task of analyzing the content of the video even more challenging. Several attempts have been made to dehaze a foggy video [1–5]. With the recent advances on deep learning based techniques for feature extraction from images, several sophisticated deep neural networks have been proposed for dehazing the individual frames in video, showing exciting results [8–11]. However, no eﬀorts have been made recognizing human actions in hazy videos. Based on deep architectures, several approaches have been proposed for human action recognition [12,13] however, recognizing human actions in hazy videos remains a challenge. As our goal is to recognize human actions in hazy videos, we deal with hazy videos. Hence, we want to have a dehazing network which can extract the features of the corresponding dehazed videos, before the actual recognition task. We apply a state-of-the-art dehazing network (AOD-NET) followed by the action recognition network (VGG16). The initial layers of the proposed network learn how to remove haze and latter layers predict action. The features extracted from the proposed CNN is fed into a bi directional LSTM, to extract the temporal features related to human action. The AOD-NET is used as a CNN based dehazing network to extract features of the dehazed video. For action related feature extraction we use a VGG16 network (for extracting spatial features) and a bi-directional LSTM for sequence learning. As bidirectional-lstm has memory of the previous and future frames hence are useful in extracting the temporal features. Lastly we have a softmax layer to predict the class. In this paper we propose a uniﬁed deep neural network for extracting spatial and temporal features from hazy videos, for recognizing action. The proposed method ﬁrst apply a deep CNN similar to [8], for extracting features for the haze free image corresponding to the input hazy frame, for each frame. Next, the extracted features are fed into a VGGNet architecture [6] for extracting spatial features from each frame. Finally, the spatial features extracted from each frame are fed into a Bi-Directional LSTM network following [14], for extracting temporal features for classiﬁcation of actions. This paper has two contributions. This is the ﬁrst attempt to recognize human actions in hazy video. Second, we generate a dataset containing hazy videos for validating human action recognition method for hazy videos. Next we illustrate the mathematical background of the work and a brief literature survey.

Action Recognition in Haze

2

31

Background and Related Works

Several eﬀorts have been made for recognizing human actions in video. Dehazing a foggy video is another active area of research in computer vision. We ﬁrst provide an overview on the recent advances in the area of video dehazing, followed by a brief account on the recent advances in recognizing human actions. Finally we provide a discussion on the mathematical background for dehazing videos. 2.1

Dehazing a Video

Several methods have been proposed for image dehazing [1–5] during the last few decades. Most of the earlier methods used to rely on the depth map obtained from the hazy image and applying some ﬁlter on the depth map to get a smoothen it [1,3]. A few earlier methods relied on various priors for correcting the color of the hazy image [2,4]. A major advancement in the area of image dehazing happened with the introduction of the concept of Dark Channel Prior (DCP) [5]. The DCP is based on a key observation - most local patches in haze-free outdoor images contain some pixels which have very low intensities in at least one color channel. A high quality depth map can also be obtained as a by-product of haze removal. However, this approach cannot well handle heavy haze images and fails in the cases of sky regions in the image. After the introduction of deep learning techniques, signiﬁcant eﬀorts have been made to dehaze the frames in the video [8–11]. Cai et al. proposed a CNN called Dehazenet, to generate the cleared transmission map of the input image and then generate the dehazed image by applying the atmospheric scattering model [10]. Ren et al. proposes a coarse-scale CNN called MSCNN, to generate the transmission map from the hazy image and then apply a ﬁne scale network to obtain the clear transmission map, from which, the dehazed image is generated [9]. Santra et al. proposed a quality indicator for image patches, to compare several candidate dehazed patches obtained from a CNN, corresponding to a given patch from the hazy image, where, the best patch is chosen by binary search algorithm [11]. Most of the deep networks for image dehazing, are computationally expensive and prone to overﬁtting problems. The two mentioned problems make it diﬃcult to apply the network for video analysis tasks. Borkar et al. proposed a video dehazing framework where the spatial regularization technique was based on Large Margin Nearest Neighbor (LMNN) and another temporal regularization across the frame was introduced based on Markov Random Field (MRF) model [16]. However, [16] needs several statistical computations which make the process complex. Li et al. proposed a light weight CNN called All-in-One Dehazing Network (AOD-Net) based on the idea of atmospheric scattering model, where the proposed network directly predicts the clear image from the input hazy image without obtaining any transmission map [8]. AOD-Net is less exposed to computational errors due to its light-weight structure with less number of parameters compared to the other relatively complex models.

32

2.2

S. G. Tanneru and S. Mukherjee

Human Action Recognition

Recognizing human actions in video is an active area of research in computer vision [17]. The earlier methods on human action recognition used to rely on motion cues extracted from the video, in form of various representations of motion, e.g., several variations of optical ﬂow [18], dense trajectory feature [19], spatio-temporal interest points [20] or some combinations of them [21]. Following the success of the deep learning architectures in various ﬁelds of computer vision, eﬀorts have been made to recognize action using deep learning based techniques [12–14]. Hong et al. proposed a 3D ConvNet based architecture with contextual information from the image for action recognition [12]. Crasto et al. proposed a 3D CNN based architecture leveraging the motion and appearance information from the video [13]. Temporal information from the video play a major role in action recognition. In order to provide more weight on the temporal information, eﬀorts have been made to extract spatial information from individual frames using a CNN and applying a Bi-directional LSTM to extract the temporal information from the video [14]. Motivated by the success of [14], the proposed method ﬁrst extract the haze-free features from individual frames applying AOD-Net [8]. The haze free features are then passed through a pre-trained VGG-Net [6] to extract the spatial information. The spatial information are then fed into a Bi-directional LSTM similar to [14] to extract the temporal information for action recognition. Next we discuss the mathematical background for the haze removal algorithm. 2.3

Background

Most of the research works on image dehazing [2,5,16] are based on the classical atmospheric scattering model, expressed by the following equation: I(x) = J(x)t(x) + A(1 − t(x)),

(1)

where I(x) is the observed hazy image and J(x) is the scene radiance (“clean image”) to be recovered. Here A denotes the global atmospheric light, t(x) is the transmission matrix and β is the scattering coeﬃcient of the medium, given by t(x) = e−βd(x) .

(2)

Hence, J(x) can be recovered from Eq. (1) and (2) as follows: J(x) =

1 1 I(x) − A + A. t(x) t(x)

(3)

Recently many CNN based methods like [7] employ CNN as a tool to regress t(x) from I(x). With A estimated using some other empirical methods, they are then able to estimate J(x) by Eq. (3). The AOD-Net [8] model has a complete end-to-end CNN dehazing model based on re-formulating Eq. (1), which directly generates J(x) from I(x) without any other intermediate step using the following equation: J(x) = K(x)I(x) − K(x),

(4)

Action Recognition in Haze

33

where K(x) is given by, K(x) =

1 t(x) (I(x)

− A) + A

I(x) − 1

.

(5)

Next we discuss the proposed method.

Fig. 1. The overall diagram of the proposed approach for Recognizing Human Action in Hazy Videos.

3

Proposed Method

The proposed method for human action recognition in hazy video consists of three parts. We apply the pre-trained AOD-Net architecture [8] on individual hazy frames to obtain the haze-free features from the frames. The haze free features are then sent to a pre-trained VGG-Net architecture [6] to obtain the spatial features. Finally, the extracted features are fed into a Bi-directional LSTM following [14] for obtaining the temporal features for recognizing action. Figure 1 shows the overall architecture of the proposed model. 3.1

Dehazing the Frames

We apply the AOD-Net architecture to dehaze the video [8]. The AOD-Net consists of an estimation module which has 5 convolution layers to estimate the

34

S. G. Tanneru and S. Mukherjee

K(x) in Eq. (5). The estimation module is followed by a clear image generation module consisting of one element wise multiplication layer and a few addition layers (element-wise) to recover the clean image. The estimation module consists of ﬁve convolution layers, along with a “concat1” layer concatinating the features extracted from the layers “conv1” and “conv2”. Similarly, “concat2” layer concatenates the features extracted from “conv2” and “conv3” layers and the “concat3” layer concatenates the features extracted from the “conv1”, “conv2”, “conv3” and “conv4” layers. One of the major reasons to use the AOD-Net in the proposed model is that, AOD-Net can be easily embedded with other deep models (like VGG-Net as do here), to constitute one pipeline that performs high-level tasks on hazy images, with an implicit dehazing process. To AOD-Net model we give 5 sampled frames of video as input and these frames are then dehazed by the network. 3.2

Extracting Spatial Features

We use VGG-16 pre-trained model [6] for spatial feature extraction from the haze-free images. In the traditional VGG-16 architecture, each successive layer extracts spatial features at some higher semantic level compared to the layer below. The VGG-16 architecture consists of 13 convolution layers with 5 pooling layers in-between, followed by 3 fully connected layers [6]. The VGG-16 is used in the proposed approach to alleviate the beneﬁt of the low kernel size, which is useful for extracting the minute spatial information from the frames during action. Moreover, the VGG-16 architecture is a shallow network with only 16 layers which minimizes the chances of overﬁtting. The spatial features extracted by VGG-16 from the sampled frames are fed into a DB-LSTM architecture for obtaining the temporal features. 3.3

Bidirectional LSTM (DB-LSTM) for Temporal Feature

Long Short-Term Memory (LSTM) networks have the ability to extract temporal information from sequential data (e.g., frames), along with the contextual information from the frames. The LSTM networks are recurrent networks (RNN) capable of learning long term dependencies across frames. The LSTM blocks consist of various gates such as input, output, and forget gates which control the long term sequence patterns. Bidirectional LSTMs consist of two RNNs stacked on top of one another out of which one RNN goes in the forward direction and the other in the backward direction. The combined output is then computed based on the hidden state of both RNNs. In our proposed model, we use two LSTM layers for both forward and backward passes as shown in Fig. 1. The extracted feature vectors are fed into the bidirectional LSTM network which then outputs the temporal and spatial interpretations of the input vectors. The obtained interpretation of the visual information are then sent as inputs to a softmax layer. The softmax layer outputs the probabilities for each class and the class with the highest probability is the predicted action depicted in the video. For training, we ﬁrst take each frame of the video and extract the features from it. Similarly features are extracted from every frame in the video and then the

Action Recognition in Haze

35

whole stack of features are saved into a npy ﬁle. Hence in the end we have all the features of the videos in npy ﬁles. Next we discuss the experimental validation of the proposed approach.

4

Experiments and Results

There is no dataset available in the literature, containing labeled human actions in hazy videos. Hence, we have applied synthetic haze on a well-known dataset for human action recognition called UCF101 dataset [15]. The UCF101 dataset with synthetic haze applied on each frame, can be made available for further experiments on similar topic. We ﬁrst discuss the method of applying haze on the videos and then discuss the results obtained by the proposed method. 4.1

Hazy Dataset

The UCF101 dataset is one of the largest and most popular datasets for human action recognition. The dataset consists of 101 action classes, over 13,000 video clips (each representing an action) and around 27 h of video data. The database consists of realistic user-uploaded videos containing various challenges like camera motion, cluttered background, change of view point, change of illumination, etc. For every second frame in each of the clear video taken from the UCF101 dataset, we apply synthetic haze. We assume that the synthetic haze follows Gaussian distribution on the pixel intensity values. A hazy video is obtained by repeating the process of applying haze in each frame of all the videos in the UCF101 dataset. We ﬁnally generate a synthetic hazy video dataset consisting of 98 videos each for the 101 classes in the UCF101 dataset. The amount of haze P (x) is obtained from the standard density function of the Gaussian distribution as follows: 2 2 1 (6) P (x) = √ e−(x−μ) /2σ , σ 2π where, x is the pixel intensity value, π is the standard notation, μ is the mean intensity value of the pixels and σ is the standard deviation of the pixel intensity values. Next we discuss the results of applying the proposed method on the hazy dataset. 4.2

Results and Discussions

The proposed model is tested on both the UCF101 clear dataset and the hazy synthetic UCF101 dataset. Table 1 shows the results of applying the proposed method on the synthetic hazy dataset obtained from the UCF101 dataset, compared to the state-of-the-art. Although the proposed model cannot outperform the current state-of-the-art on clear videos, however, due to the AOD-Net architecture included in the proposed model, the model can outperform the state-ofthe-art methods for human action recognition, when applied on hazy videos. Also we can observe from Table 1 that, inclusion of the AOD-Net architecture into the proposed model has signiﬁcantly improved the performance of the model.

36

S. G. Tanneru and S. Mukherjee

Figure 2 shows the Accuracy curve (over number of epochs) of the proposed model when trained on the hazy UCF101 dataset. Figure 3 shows the loss curve with respect to epochs. From Figs. 2 and 3 we can conclude that, the amount of overﬁtting is less. Based on the experiments carried out in this study, we can argue that, inclusion of a shallow architecture for obtaining a clear representation of individual frames can be an eﬀective way to analyze the content in a hazy video. The analysis of the results presented here shows two important conclusions. First, even A shallow architecture is suﬃcient to recognize human actions eﬀectively in hazy videos. The task does not require deep architectures leading to overﬁtting. Second, combination of CNN and LSTM architectures can be a key for extracting spatio-temporal features eﬃciently, with less computational overhead. Table 1. Results of applying the proposed method on the synthetic hazy dataset obtained from UCF101 dataset, compared to the state-of-the-art methods for human action recognition. Methods

Accuracy (%)

Hanson et al. [14]

91.24

Crasto et al. [13]

93.96

Proposed Model Without Dehazing Network 78.07 Proposed Method with Dehazing Network

94.91

Fig. 2. Accuracy (y-axis) of the proposed method with respect to the number of epochs (x-axis) on UCF 101.

Action Recognition in Haze

37

Fig. 3. Loss (y-axis) of the proposed method with respect to the number of epochs (x-axis) on UCF 101.

5

Summary and Future Work

In this paper, a system for human action recognition in haze is proposed. The proposed model ﬁrst dehazes the sampled frames, learns the spatial features and then feeds the spatial feature into a DB-LSTM network for classiﬁcation of actions. The proposed model is the ﬁrst attempt to provide a uniﬁed endto-end model to dehaze the hazy video to recognize the action. Sampled frames are fed into a pre-trained and ﬁne-tuned AOD-Net for dehazing. The spatial features are extracted from the dehazed video frames by VGG-16 which are fed into DB-LSTM, where two layers are stacked on both forward and backward pass of the LSTM. The stacked layers for forward and backward passes help in recognizing the complex hidden sequential patterns of features across frames. The experimental results indicate that the recognition score of the proposed method provides a signiﬁcant improvement in performance while applied on a synthetic hazy dataset, compared to the performances of methods without a dehazing framework. In future, more sophisticated action recognition algorithms can be tried on the output of the AOD-Net architecture, to get even better accuracy on the hazy video. Acknowledgement. The authors wish to thank the NVIDIA for providing a TITANX GPU which was used for conducting experiments related to this study.

38

S. G. Tanneru and S. Mukherjee

References 1. Kopf, J., et al.: Deep photo: model-based photograph enhancement and viewing. In: SIGGRAPH Asia (2008) 2. Fattal, R.: Single image dehazing. ACM Trans. Graph 27(3), 72:1–72:9 (2008) 3. Narasimhan, S.G., Nayar, S.K.: Interactive deweathering of an image using physical models. In: Workshop on Color and Photometric Methods in Computer Vision (2003) 4. Tan, R.: Visibility in bad weather from a single image. In: CVPR (2008) 5. He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior. In: CVPR. IEEE (2009) 6. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 7. Ren, Z., et al.: Bidirectional homeostatic regulation of a depression-related brain state by gamma-aminobutyric acidergic deﬁcits and ketamine treatment. Biol. Psychiatry 80, 457–468 (2016) 8. Li, B., Peng, X., Wang, Z., Xu, J., Feng, D.: AOD-net: all-in-one dehazing network (2017) 9. Ren, W., et al.: Single image dehazing via multi-scale convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 154–169. Springer, Cham (2016). https://doi.org/10.1007/978-3-31946475-6 10 10. Cai, B., Xu, X., Jia, K., Qing, C., Tao, D.: DehazeNet: an end-to-end system for single image haze removal. IEEE Trans. Image Process. 25(11), 5187–5198 (2016) 11. Santra, S., Mondal, R., Chanda, B.: Learning a patch quality comparator for single image dehazing. IEEE Trans. Image Process. 27(9), 4598–4607 (2018) 12. Hong, J., Cho, B., Hong, Y.W., Byun, H.: Contextual action cues from camera sensor for multi-stream action recognition. Sensors 19(6), 1382 (2019) 13. Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: MARS: motion-augmented RGB stream for action recognition. In: CVPR, pp. 7882–7891 (2019) 14. Hanson, A., PNVR, K., Krishnagopal, S., Davis, L.: Bidirectional convolutional LSTM for the detection of violence in videos. In: Leal-Taix´e, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 280–295. Springer, Cham (2019). https://doi. org/10.1007/978-3-030-11012-3 24 15. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human action classes from videos in the wild. Report no. CRCV-TR-12-01 (November 2012) 16. Borkar, K., Mukherjee, S.: Video dehazing using LMNN with respect to augmented MRF. In: ICVGIP 2018, pp. 42:1–42:9. ACM (2018) 17. Kong, Y., Fu, Y.: Human action recognition and prediction: a survey. arXiv:1806.11230 (2018) 18. Mukherjee, S., Biswas, S.K., Mukherjee, D.P.: Recognizing interactions between human performers by dominating pose doublet. Mach. Vis. Appl. 25(4), 1033– 1052 (2014) 19. Mukherjee, S., Singh, K.K.: Human action and event recognition using a novel descriptor based on improved dense trajectories. Multimedia Tools Appl. 77(11), 13661–13678 (2018) 20. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV (2003) 21. Vinodh, B., Sunitha Gowd, T., Mukherjee, S.: Event recognition in egocentric videos using a novel trajectory based feature. In: ICVGIP 2016, pp. 76:1–76:8 (2016)

Human Action Recognition from 3D Landmark Points of the Performer Snehasis Mukherjee1(B) 1

and Chirumamilla Nagalakshmi2

Shiv Nadar University, Greater Noida, India 2 IIIT SriCity, Chittoor, India [email protected]

Abstract. Recognizing human actions is an active research area, where pose of the performer is an important cue for recognition. However, applying the 3D landmark points of the performer in recognizing action, is relatively less explored area of research due to the challenge involved in the process of extracting 3D landmark points from single view of the performers. With the recent advancements in the area of 3D landmark point detection, exploiting the landmark points in recognizing human action, is a good idea. We propose a technique for Human Action Recognition by learning the 3D landmark points of human pose, obtained from single image. We apply an autoencoder architecture followed by a regression layer to estimate the pose parameters like shape, gesture and camera position, which are later mapped to the 3D landmark points by Skinned Multi Person Linear Model (SMPL model). The proposed method is a novel attempt to apply a CNN based 3D pose reconstruction model (autoencoder) for recognizing action. Further, instead of using the autoencoder as a classiﬁer to classify to 3D poses, we replace the decoder part by a regressor to obtain the landmark points, which are then fed into a classiﬁer. The 3D landmark points of the human performer(s) at each frame, are fed into a neural network classiﬁer as features for recognizing action.

Keywords: Human mesh recovery recognition

1

· SMPL · 3D landmark · Action

Introduction and Related Works

Recognizing human actions in video is an active area of research in computer vision, due to the potential applications in Surveillance, Robotics, Elderly and child monitoring and several others [1]. Human poses in consecutive frames are considered as the most important cue for action recognition in video [2]. Semantic information captured from the successive frames in form of human pose and the dynamics of diﬀerent body parts in subsequent frames of a video during action, are popularly followed for recognizing action [2]. c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 39–49, 2021. https://doi.org/10.1007/978-981-16-1092-9_4

40

S. Mukherjee and C. Nagalakshmi

Optical ﬂow can be an appropriate measure of the dynamics of body parts of human performer during action [3–6]. The direction of the optical ﬂow is divided into several sections and a histogram is obtained from the optical ﬂow information from each frame for action recognition in [3]. However, video captured by a hand-held device often suﬀer from camera jitter, which produce motion measure at the background pixels, aﬀecting the recognition process. Mukherjee et al. attempted to reduce the background motion information by pointwise multiplying the optical ﬂow matrix with the image gradient, to produce Gradient Weighted Optical Flow (GWOF) vector [4]. Wang et al. reduced the background motion measure by computing gradient over the optical matrix, to produce Warped Optical Flow (WOF) [5]. The gradient is calculated on the GWOF matrix for each frame to further reduce the background motion measure in [6], for action recognition. Optical ﬂow based methods cannot describe the complete semantics of the motion information inherent in a video. In order to emphasize on the semantics, motion information should be captured only from those pixels which are both spatially and temporally important for describing the motion of the performer. Laptev et al. proposed a method for human action recognition using spatiotemporal interest points (STIP) [7], which can be considered as the 3 dimensional form of the SIFT descriptor. The ability of the STIP based features to extract the semantics from the frames motivated several computer vision scientists to use the STIP based features for human action recognition [8]. However, STIP based features cannot preserve the global motion information of a video during action. An attempt was made in [9] to combine the beneﬁt of global motion information provided by optical ﬂow and semantic information captured at the STIP points for action recognition. The Gradient Flow-Space Time Interest Point (GF-STIP) feature proposed in [9] relies on the GWOF vectors calculated at the STIP points. Nazir et al. fed the SIFT features into a Bag-of-Words model for classiﬁcation of actions [10]. After the introduction of deep learning based techniques, automatic extraction of semantic features directly from images and videos motivated the computer vision researchers to apply deep learned features to solve several vision related problems including tracking, surveillance, object classiﬁcation and many more. Several successful attempts have been made to apply deep learned features for human action recognition [11]. Earlier deep learning methods used to apply spatial CNN over a set of sampled frames for action recognition [12]. However, spatial CNN features are often unable to preserve the temporal information during action. The 3D CNN features are applied for action recognition due to the capability of preserving temporal information during action [13]. Spatio-temporal Residual Network (3D ResNet-101) was applied for action recognition in [14]. Li et al. proposed a parallel 2D convolution along three orthogonal views to capture the spatial appearance and motion cues of the performer simultaneously during action [15]. However, 3D CNN based approaches often lead to miss-classiﬁcation of actions due to temporal redundancy in cases where a signiﬁcant portion of the

Human Action Recognition from 3D Landmark Points of the Performer

41

video does not have suﬃcient change in content. In order to reduce the temporal redundancy in videos, Wu et al. trained the popular 3D ResNet-152 network on the compressed video [16]. Although the 3D CNN based features generally works well to capture the spatio-temporal features from video for action classiﬁcation, the kinetics of the diﬀerent body parts remain unexplored to some extent, in 3D CNN features, as shown in [14]. Rather, 2D CNN features along with kinetics information reduce the overﬁtting problem and enhance the performance of the action recognition process [14]. Observing the importance of minute motion cues in video for action recognition, Shou et al. captured the motion cues from the video by proposing a generator network and training the network with optical ﬂow [17]. In order to reduce the overﬁtting problem with the 3D CNN based methods on the existing benchmark datasets, a few eﬀorts have been made by extracting spatial features using a 2D CNN regressor and then feeding the output vector provided by the 2D CNN architecture into a Long Short Term Memory (LSTM) model [18,19]. In [18], the dense trajectories are estimated from the video using CNN and the trajectories are fed into an LSTM for recognizing actions. Li et al. applied later score fusion of CNN and LSTM features for action recognition from 3D skeleton structure of the performer [19]. Materzynska et al. [28] exploited the compositionality in human actions with objects, reducing the computational overhead. However, blindly using deep learned features often lead to ignore the semantic information of the video during action. In order to capture the beneﬁt of semantic information from the video for action recognition, a few eﬀorts have been made to combine hand-crafted and deep learned features [20,21]. In [20] deep manifold learning is applied for action recognition. In [21], Local Gradient Ternary Pattern feature is fused with the spatio-temporal feature captured by Inception-Resnet for recognizing action. However, [19] clearly shown that, 3D information of human pose works eﬀectively for recognizing action. Zhang et al. [27] proposed a semantics-guided 3D CNN for action recognition, where semantics of the joint points are provided to the network to ﬁnd the inter-relationships among the joints. But in RGB videos the skeleton information is absent and hence, 3D skeleton structure is diﬃcult to obtain. An accurate estimate of the 3D structure of the performer can be helpful in describing the 3D poses of the performer for recognizing action, due to two major reasons: the view-point invariant property of the 3D representation of pose and the ability to handle self-occlusion. However, obtaining 3D poses from 2D videos has been a challenging research problem during the past few decades and was far from being a solved problem. Hence, researchers have not been motivated to obtain 3D structure of 2D human pose of consecutive frames and use them for action recognition. However, due to the capability of the recent deep learning architecture to produce an accurate 3D mesh of the human performer [22] has motivated us to use such 3D mesh for action recognition.

42

S. Mukherjee and C. Nagalakshmi

The proposed method ﬁrst applies HOG features to localize the human performer in each frame [23], as a preprocessing step. The preprocessed frames are passed to an autoencoder similar to [24] followed by a regression layer which estimates the parameters of the autoencoder. The estimated parameters are mapped to 3D landmark points by the Skinned Multi Person Linear (SMPL) model [22]. The Estimated 3D landmark points are fed into an Artiﬁcial Neural Network (ANN) model as features for classiﬁcation of actions similar to [25]. The reason behind choosing ANN as the classiﬁer is that, the study in [25] shows the eﬃcacy of the ANN model when applied on the 3D points, over some other well-known classiﬁers. In this work, we experiment with varying number of dense layers of ANN to ﬁnd the optimal number of dense layers of the proposed ANN for classiﬁcation of actions. In this paper our contributions are two folds. – The proposed method is the ﬁrst attempt to apply a CNN based 3D pose reconstruction model for action recognition. The 3D pose representation of the performer in the consecutive frames provides the dynamics of the diﬀerent body parts along the three dimensions. – Unlike [24], we replace the decoder part of the autoencoder by a regressor to obtain the 3D landmark points of the performer, instead of the whole 3D structure. The 3D landmark points in the consecutive frames preserve the dynamics of the performer during action. Next we discuss the proposed method for human action recognition.

Fig. 1. Overall diagram of the proposed method for action recognition.

Human Action Recognition from 3D Landmark Points of the Performer

2

43

Proposed Method

The proposed method for human action recognition from 3D features consists of two major steps: estimation of the 3D landmark points of the human performer and classiﬁcation of the action based on the 3D landmark points. 2.1

Extraction of 3D Landmark Points

The proposed framework for 3D pose estimation can perform well if the performer is located at the center of the image. Hence, a proper preprocessing is needed on the image, to make sure that the human performer is at the center of the image. We apply HOG features to detect and localize the human performer. Then this preprocessed image is passed to the autoencoder which is able to reduce the resolution of the image without much loss of data. Then the low dimensional feature vector is passed to a regression layer which is able to predict the pose and camera properties. Using SMPL model 3D landmark points are estimated from pose and shape parameters extracted from the regression layer. The extracted 3D landmark points are used to classify the pose performed by the person from image using these predicted 3D joints as features. We train an autoencoder on the bounding boxes containing the human performers extracted from each frame of the video, for obtaining the 3D landmark points as the spatial features for action classiﬁcation. We follow [24] for obtaining the 3D landmark points from the bounding boxes. We ﬁrst apply HOG features to obtain the bounding boxes around the performers in each frame. We train the autoencoder architecture shown in Fig. 1, with the bounding boxes. This autoencoder reconstructs a full 3D mesh of human pose from single view of an image frame, where the performer is centered in the frame-window. We consider a pool of 3D meshes of human bodies with various shapes and poses as unpaired. As shown in Fig. 1 the human centered (preprocessed) frames are passed sequentially to the feature extractor. Here feature extractor is trained during 3D pose estimation. The autoencoder feature extractor is used to reduce the resolution of the input image frame, so that only useful low-dimensional features of the frame are extracted. These extracted features are passed to iterative 3D regression module whose main objective is to infer the parameters of 3D human shape and camera. Hence, the 3D joint points are projected on 2D space. The interesting part in this training process is to identify whether the inferred parameters belong to the human or not. This is done using a discriminator network which is shown in the Fig. 1, with an arrow. Discriminator’s main objective is to identify the 3D parameters from the unpaired dataset. The prior of geometry is used by models that only predict 3D joint locations. When ground truth 3D information is available, we can use it as an intermediate loss. To summarize, our overall objective is that, if 3D information is available then the loss can be computed as a weighted sum of 3D feature loss, pose representation loss and discriminator loss, as follows: L = λ(Lreprojectionloss + L3Dl oss ) + Ldes ,

(1)

44

S. Mukherjee and C. Nagalakshmi

otherwise the loss function can be calculated, when 3D information is not available, as a weighted sum excluding the 3D feature loss from (1) as follows: L = λ(Lreprojectionloss ) + Ldes ,

(2)

where λ controls the relative importance of each objective, Ldes is the loss due to discriminator, i.e., the L2 distance between the features of the real pose and the reconstructed pose. Lrepresentationloss is the representation loss and L3Dl oss represents the 3D feature loss. The 3D joints projection error is represented as X ∈ R3P , which are generated by the linear regression module. Here P is the feature dimension. Here the Rotation R ∈ R3P , Translation T ∈ R2 and the scalar S ∈ R1 . The 3D reconstruction of human body is represented as an 85 dimensional vector. Each iteration through regression model infers parameters R, T and S. The 2D joints are calculated from 3D joints using the orthographic projection. The loss near regression module is calculated as follows: If 2D joint is visible then, the loss ||(X i − Xˆi )||, (3) L2D = i

otherwise projection loss is 0. Here X i ∈ R2K , Xˆi is predicted and X i is original ith joint point. Based on the calculated loss the weights are updated. Finally, using these 3D joints and SMPL (Skinned multi person linear) model 3D pose of human is reconstructed. Next we use an Artiﬁcial Neural Network (ANN) to classify the extracted features of the video. Table 1. Training and validation (Val) accuracies (in %) obtained by the proposed ANN classiﬁer using the 3D landmark points as features. Epochs No. of frames Training accuracy Val accuracy Test accuracy

2.2

100

10

96.98

96.89

96.85

300

10

97.93

97.54

97.53

300

20

97.20

96.70

96.70

Classification of Actions Using 3D Landmark Points

The 85 dimensional feature vector representing the 3D landmarks of the human performer, obtained from the regression layer of the autoencoder followed by SMPL, is fed into an ANN for classiﬁcation. The proposed ANN consists of three series of dense layer followed by ReLU activation layer. This is again followed by 2 dense layers, as shown in Fig. 1. The last dense layer is the classiﬁcation layer for classifying actions. We replace the classiﬁcation layer from the autoencoder proposed in [24] by the regression layer followed by the proposed ANN. The classiﬁcation layer of [24]

Human Action Recognition from 3D Landmark Points of the Performer

45

Table 2. Accuracies (in %) obtained by the proposed method compared to the stateof-the-art. Methods Li et al. [20] Shou et al. [17] Proposed method Accuracy 94.50

96.50

97.53

classiﬁes the human poses to a dictionary of poses, where the missclassiﬁcation percentage for classifying the poses accumulates with the missclassiﬁcation percentage for classifying actions, if we follow [24]. Instead, we use the whole regression vector for classiﬁcation of actions, which reduces the rate of missclassiﬁcations. Next we illustrate the experiments performed with the proposed method for action recognition.

3

Dataset and Experiments

We ﬁrst provide a brief description of the dataset used in the paper followed by an illustration of the experimental set up made for experimenting with the proposed method. 3.1

Dataset

We test the performance of the proposed human action recognition method on the popular UCF101 dataset [26]. The UCF101 dataset contains 101 number of action categories with a total of 13320 video clips containing human actions. The UCF101 dataset is an unbalanced dataset containing varying number of videos across action classes, aking the dataset challenging. The human action classes provided in the UCF101 dataset can be classiﬁed into ﬁve subcategories: Sports actions, Human-object interactions, Human-human interactions, Solo human actions and Playing musical instruments. The UCF101 dataset has 101 human action classes with a large variation in viewpoint, camera jerk, scale, illumination and background texture both withinvideo and across-video, making the dataset challenging. The huge volume and the variety of classes of the UCF101 dataset make it suitable for experimenting with a deep model. We use all the 101 classes of the UCF101 dataset both for training and testing. Next we summarize the experimental set up made for validation of the proposed method. 3.2

Experimental Set up

According to the study conducted in [14], the existing benchmark datasets for action recognition task, are prone to overﬁtting, especially while applying a 3D deep network architecture. Although the proposed method is not a 3D deep network, however, we avoid overﬁtting by dividing the dataset into training, validation and testing. We randomly select 60% of the videos from each category

46

S. Mukherjee and C. Nagalakshmi

Fig. 2. Training and testing accuracy over 100 epochs.

for training, 20% for validation and the rest 20% for testing. We ﬁrst train the proposed network with the training data. After around 24 epochs, the loss gets saturated. Then after 24 epochs we started training the proposed ANN model with the validation set and ﬁne-tune the parameters accordingly. In order to check for any overﬁtting problem, we compare the accuracies obtained by the proposed method on both training and testing data. We take the value of λ as 1 in all our experiments. We experiment with diﬀerent number of dense layers of the proposed ANN classiﬁer to obtain the optimal number of dense layers as 5. We use Leaky ReLU as the activation function for the proposed ANN. The proposed method relies on only a few landmark points and hence, a simple ANN is enough to classify the actions reducing the chances of overﬁtting. Next we discuss the results of applying the proposed method on the benchmark dataset, compared to the state-of-the-art.

Human Action Recognition from 3D Landmark Points of the Performer

47

Fig. 3. Training and testing loss graph over 100 epochs.

4

Results

The training, validation and testing accuracies of the proposed method are shown in Table 1. We can observe from Table 1 that, the diﬀerence between the training, validation and testing accuracies are very less, which may be an indication of less overﬁtting in the classiﬁcation task. We note that, the number of frames per video, used to classify actions, can be kept around 10, to get the highest accuracy. Increasing the number of frames per video may confuse the classiﬁer, as we see in the third row of Table 1. Figure 2 shows the training and validation accuracies of the proposed method versus the number of epochs. Figure 3 shows the training and validation losses of the proposed method versus number of epochs. Figures 2 and 3 indicate less overﬁtting by the proposed method with clear sign of saturation after only 30 epochs, even on the UCF101 dataset, which is known for causing overﬁtting by most deep learning approaches. Table 2 shows the accuracy obtained by the proposed method compared to the state-of-the-art. Clearly, the proposed method outperforms the state-of-theart due to the 3D pose information. From this study we can conclude that, 3D

48

S. Mukherjee and C. Nagalakshmi

pose information of the human performer plays an important role in recognizing human action. With the help of an eﬃcient 3D pose recognition system, even a simple ANN model can show good performance in recognizing action.

5

Conclusions

We have proposed a method for human action recognition using 3D representation of 2D human poses. We can observe from the study that, even a simple classiﬁer can classify the human actions if we can make use of the 3D representation of poses obtained from an eﬃcient system for 3D mesh generation. Using the 3D features has been a challenging problem in its own, which was the main reason behind the fact that, researchers have not explored 3D pose based action recognition. However, this study shows that, with the recent advancements in the methods of ﬁnding 3D poses, action recognition tasks can be made more accurate. In future more sophisticated classiﬁers may be applied on the 3D poses, to enhance the performance of the proposed method. Further, analyzing the results of the proposed method through diﬀerent metrics often provide valuable insights about the approach. In future we will explore the behaviour of the proposed method with respect to diﬀerent metrics to measure. Also in future, we plan to perform an ablation study to experiment on a suitable value of λ.

References 1. Fan, Z., Ling, S., Jin, X., Yi, F.: From handcrafted to learned representations for human action recognition: a survey. Image Vis. Comput. 55, 42–52 (2016) 2. Maryam, Z., Robert, B.: Semantic human activity recognition: a literature review. Pattern Recogn. 48(8), 2329–2345 (2015) 3. Chaudhry, R., Ravichandran, A., Hager, G., Vidal, R.: Histograms of oriented optical ﬂow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: CVPR, pp. 1–8. IEEE (2009) 4. Mukherjee, S., Biswas, S.K., Mukherjee, D.P.: Recognizing human action at a distance in video by key poses. IEEE Trans. CSVT 21(9), 1228–1241 (2011) 5. Wang H., Schmid C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558. IEEE (2013) 6. Mukherjee, S.: Human action recognition using dominant pose duplet. In: Nalpantidis, L., Kr¨ uger, V., Eklundh, J.-O., Gasteratos, A. (eds.) ICVS 2015. LNCS, vol. 9163, pp. 488–497. Springer, Cham (2015). https://doi.org/10.1007/978-3-31920904-3 44 7. Laptev I., Marszalek M., Schmid C., Rozenfeld B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8. IEEE (2008) 8. Das Dawn, D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 32(3), 289–306 (2015). https://doi.org/10.1007/s00371-015-1066-2 9. Vinodh, B., Sunitha, G.T., Mukherjee, S.: Event recognition in egocentric videos using a novel trajectory based feature. In: ICVGIP, pp. 76:1–76:8. ACM (2016) 10. Nazir, S., Yousaf, M.H., Nebel, J.-C., Velastin, S.A.: A bag of expression framework for improved human action recognition. Pattern Recogn. Lett. 103, 39–45 (2018)

Human Action Recognition from 3D Landmark Points of the Performer

49

11. Herath, S., Harandi, M.T., Porikli, F.M.: Going deeper into action recognition: a survey. Image Vis. Comput. (2017). https://doi.org/10.1016/j.imavis.2017.01.010 12. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1–9. IEEE (2016) 13. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. In: ICML, pp. 1–8 (2010) 14. Hara, K., Kataoka, H., Satoh, Y.: Can spatio-temporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: CVPR, pp. 6546–6555. IEEE (2018) 15. Li, C., Zhong, Q., Xie, D., Pu, S.: Collaborative spatiotemporal feature learning for video action recognition. In: CVPR, pp. 7872–7881. IEEE (2019) 16. Wu, C.-Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Compressed video action recognition. In: CVPR, pp. 6026–6035. IEEE (2018) 17. Shou, Z., et al.: DMC-Net: generating discriminative motion cues for fast compressed video action recognition. In: CVPR, pp. 1–10. IEEE (2019) 18. Singh, K.K., Mukherjee, S.: Recognizing human activities in videos using improved dense trajectories over LSTM. In: Rameshan, R., Arora, C., Dutta Roy, S. (eds.) NCVPRIPG 2017. CCIS, vol. 841, pp. 78–88. Springer, Singapore (2018). https:// doi.org/10.1007/978-981-13-0020-2 8 19. Li, C., Wang, P., Wang, S., Hou, Y., Li, W.: Skeleton-based action recognition using LSTM and CNN. In: ICME Workshops, pp. 585–590. IEEE (2017) 20. Li, C., et al.: Deep manifold structure transfer for action recognition. IEEE Trans. Image Process. 28, 4646–4658 (2019) 21. Uddin, M.A., Lee, Y.-K.: Feature fusion of deep spatial features and handcrafted spatiotemporal features for human action recognition. Sensors 19(7), 1599 (2019). https://doi.org/10.3390/s19071599 22. Loper, M., Mahmood, N., Romero, J., Gerard, P.-M., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34, 248:1–248:16 (2015) 23. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 1–8. IEEE (2005) 24. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR, pp. 1–10. IEEE (2018) 25. Nagalakshmi, C., Mukherjee S.: Classiﬁcation of yoga asana from single image by learning 3D view of human pose. In: ICVGIP Workshops. Springer (2018). https:// doi.org/10.1007/978-3-030-57907-4 1 26. Soomro K., Zamir A.R., Shah M.: UCF101: a dataset of 101 human action classes from videos in the wild. Report no. CRCV-TR-12-01 (November 2012) 27. Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., Zheng, N.: Semantics-guided neural networks for eﬃcient skeleton-based human action recognition. In: CVPR, pp. 1112–1121 (2020) 28. Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., Darrell, T.: Somethingelse: compositional action recognition with spatial-temporal interaction networks. In: CVPR, pp. 1049–1059 (2020)

A Combined Wavelet and Variational Mode Decomposition Approach for Denoising Texture Images R. Gokul1(B) , A. Nirmal1 , G. Dinesh Kumar1 , S. Karthic1 , and T. Palanisamy2 1 2

Department of Electronics and Communication Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India Department of Mathematics, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India t [email protected] Abstract. Edges and textures are important features in texture analysis that helps to characterize an image. Thus, the edges and textures must be retained during the process of denoising. In this paper, we present a combined wavelet decomposition and Variational Mode Decomposition (VMD) approach to eﬀectively denoise texture images while preserving the edges and ﬁne-scale textures. The performance of the proposed method is compared with that of wavelet decomposition, VMD and a combined VMD-WT technique. Although VMD-WT outperforms VMD and wavelet decomposition, it is highly dependent on the choice of parameters. The proposed method overcomes the above limitation and also performs better than wavelet decomposition and VMD.

Keywords: Denoising

1

· Wavelet · VMD · Texture

Introduction

Texture analysis refers to the characterization, segmentation and classiﬁcation based on the texture content of an image. It helps to quantify the intuitive qualities of an image as a function of spatial variation in pixel intensities and is critical in image analysis. Typically, edge detection is the ﬁrst step in texture analysis since it helps to characterize the texture complexity of a region based on the number of edge pixels [1]. Further, edges and textures have a certain overlap with noise in the frequency domain of an image. Therefore, while denoising it is important to retain the edges and textures as they are crucial in texture analysis. Even though a plethora of techniques exist for natural image denoising, many denoising algorithms still fail to preserve the ﬁne-scale textures [2] and not many approaches have been proposed to reduce noise in texture images except for a few [3,4]. However, the commonly used decomposition method based on Wavelet Transform (WT) and the recently proposed Variational Mode Decomposition (VMD) [5–8] are two techniques that can balance both denoising and c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 50–62, 2021. https://doi.org/10.1007/978-981-16-1092-9_5

A Combined Wavelet and Variational Mode Decomposition Approach

51

preservation of edges and textures to an extent [9]. Nevertheless, there is a better scope for further research in retaining edges and textures while we denoise texture images. Conventionally, while denoising a signal using WT, only the approximate coeﬃcients are used to reconstruct the signal and similarly while using VMD the ﬁrst few modes are alone used. But in the case of denoising texture images, the detail coeﬃcients of WT or the higher modes of VMD may also contain information about the edges and textures along with noise. Therefore, it is essential to capture that information about the edges and textures by prudently removing the noise. This necessitates further analysis of the detail coeﬃcients of WT and the higher modes of VMD. Thus, it is understood that denoising of texture images with minimum loss of information about edges and textures may be possible only by a two-stage analysis. The aforementioned analysis can be carried out using WT or VMD. Thus, the complete analysis of texture images can be executed by either one of the methods illustrated in Fig. 1.

Fig. 1. Two-stage analysis using WT and VMD (a) VMD-VMD (b) VMD-WT (c) WT-WT (d) WT-VMD

A handful of recent works [10–12] have used WT over the signal decomposed by VMD and the same has been extended to images [13]. This hybrid VMD-WT technique yields better results than their counterparts. Nevertheless, the ﬁrst stage decomposition of the texture image with VMD is not found to be adaptive for various images in the sense that the parameters are to be manually tuned. Further, it is observed that a two-stage analysis with WT at second stage fails unless the choice of mother wavelet and the level of decomposition are appropriate. Therefore, we propose a combined WT-VMD approach (see Fig. 1(d)) for

52

R. Gokul et al.

denoising texture images anticipating to overcome the aforementioned failures while making use of the advantages of VMD and wavelet transforms. The rest of the paper is organized as follows: a brief review of VMD and WT is presented in Sect. 2. The WT-VMD technique is proposed in Sect. 3. The experimental results and discussions are presented in Sect. 4. The paper concludes with Sect. 5.

2 2.1

Review Variational Mode Decomposition

In this subsection, we give a brief review of VMD. VMD decomposes the image, a 2D signal f into a number of modes called Intrinsic Mode Functions (IMF) uk for k = 1 to n that have limited bandwidth around a centre frequency ωk , which is to be determined along with the decomposition. The bandwidth of each mode uk can be evaluated by ﬁrst constructing a unilateral frequency spectrum H by computing the analytic signal as uAS , k(x) = uk (x) + uH k (x), where uk (x) is the 2D Hilbert transform of uk . For the sake of computational convenience, the frequency spectrum of each mode is shifted to ‘baseband’. This can be achieved by multiplying each mode with a complex exponential tuned to the respective centre frequency. The bandwidth can be estimated by H 1 Gaussian smoothness of the demodulated signal which is nothing but the L2 norm of the gradient. The resulting constrained variational problem is given by [14], αk [uAS,k (x)e−jωk ,x ]22 }s.t. uk (x) = f (x) : ∀x (1) min { uk ,ωk

k

k

The above minimization can be accomplished by converting it into an unconstrained problem using the augmented Lagrangian method [14]. L({uk }, {ωk }, λ) :=

αk [uAS,k (x)e

k

−jωk ,x

2

]2 + f (x) −

k

2

uk (x)2 + λ(x), f (x) −

uk (x)

k

(2) Where α is the balancing parameter for the data-ﬁdelity constraint and λ is the Lagrangian multiplier. 2.2

Wavelet Transform

Discrete Wavelet Transform (DWT) decomposes the given image f(x, y), into low and high frequency subbands at various levels. The approximation coeﬃcients of DWT correspond to the low-frequency content of the image while the detail coeﬃcients to high-frequency content of the image. The approximation and detail coeﬃcients at an arbitrary level j0 are given by, −1 M −1 N 1 f (x, y)ϕjo ,m,n (x, y) Wϕ (jo , m, n) = √ M N x=0 y=0

(3)

A Combined Wavelet and Variational Mode Decomposition Approach

53

−1 M −1 N 1 f (x, y)ψj,m,n (x, y); i = {H, V, D} M N x=0 y=0

(4)

Wψi (j, m, n) = √

Wϕ coeﬃcients deﬁne an approximation of f (x, y) at scale jo . Wψ coeﬃcients add horizontal, vertical and diagonal details at scales j > jo . φ(x, y) is the 2D scaling function and ψ(x, y) is the 2D wavelet function [15]. In the following sections, the proposed method for denoising of texture images will be discussed and is compared with the performance of denoising using wavelet decomposition, VMD and a combined VMD-WT technique.

3

Proposed Methods

The proposed method of denoising intends to apply two-dimensional wavelet transform as its ﬁrst stage on texture images. The approximation and the horizontal, vertical, diagonal detail coeﬃcients of the texture images are extracted. The images corresponding to the horizontal, vertical and diagonal detail coeﬃcients are then decomposed into two IMFs using VMD as the second stage. In each of the diﬀerent detail, the IMF corresponding to high frequencies represented by the last mode is ignored. The remaining mode of each detail is combined and coalesced with the approximation coeﬃcients of the wavelet transform obtained in the ﬁrst stage as shown in Fig. 2.

Fig. 2. Proposed method

Thus, it is explicit that the detail coeﬃcients corresponding to high frequencies obtained from two-dimensional wavelet transform undergo another low-pass ﬁltering when the last mode is neglected after using VMD. Thus, a part of

54

R. Gokul et al.

the high-frequency information which would have been neglected otherwise is retained and so it is expected to preserve the edges and the ﬁne-scale textures better. The eﬃciency of the proposed method is compared with a few of the existing denoising techniques.

4

Experimental Results

In this section, we present and discuss the results obtained using the proposed method, WT, VMD, VMD-WT and a comparison is made between them. Though there have been numerous attempts on denoising various images using WT and VMD, denoising of texture images has not been extensively studied. We have used the de facto standard of grayscale texture images, the dataset captured by Phil Brodatz [16] of size 640 × 640 pixels for our study. Further to examine the performance of our proposed method on sharp transitions, four diﬀerent texture images from the Brodatz dataset is resized and merged into a single image. Also, the noise is modelled as additive white Gaussian noise with variances 0.001 and 0.01. While working with WT, the image is decomposed into ten levels using Daubechies, Coiﬂet, and Symlet of various orders. In the case of VMD, the image is decomposed into two modes (K = 2) considering the parameters, bandwidth constraint α = 200,400,500,750,1000 and noise tolerance τ = 2,3,4. The comparison is made using the standard metrics PSNR (Peak Signalto-Noise Ratio) to measure the eﬃciency of the technique and SSIM (Structural Similarity Index) to predict the perceived quality of the denoised image [17]. The wavelet decomposition and VMD of the images were performed with the help of MATLAB R2018b. For the sake of conciseness of the paper, we present the results of our work for two images, D12 (see Fig. 3(a)) and an image merged using D9, D13, D18 and D23 (see Fig. 3(b)).

(a)

(b)

Fig. 3. (a) D12 (b) Merged image.

A Combined Wavelet and Variational Mode Decomposition Approach

4.1

55

Wavelet

In this subsection, the results obtained using wavelets are presented and discussed. We have used the wavelets Daubechies, Symlet, and Coiﬂet for denoising the images. In each case, the image has been decomposed into ten levels and each level has been analysed by considering the approximation of that particular level. The PSNR and SSIM of the denoised images using Daubechies wavelet of order 2 (db2) for noise variances 0.001, 0.01 are presented in Table 1 and Table 2. Table 1. Results of Fig. 3(a) under diﬀerent noise variance and diﬀerent levels of db2 wavelet Level/variance 0.001

0.01

db2

PSNR

Level 1

19.78522 0.7624

SSIM

PSNR

SSIM

Level 2

16.36223 0.50976 16.34633 0.50901

Level 3

13.87745 0.22753 13.86852 0.22703

Level 4

12.48221 0.10914 12.47594 0.10869

Level 5

11.74447 0.07868 11.7392

19.75047 0.76155

0.07824

Table 2. Results of Fig. 3(b) under diﬀerent noise variance and diﬀerent levels of db2 wavelet Level/variance 0.001

0.01

db2

PSNR

SSIM

PSNR

SSIM

Level 1

19.0909

0.6866

19.05433 0.6863

Level 2

16.73719 0.34549 16.71651 0.3454

Level 3

15.47772 0.12651 15.462

Level 4

14.79304 0.06624 14.77936 0.0661

Level 5

14.36107 0.05273 14.34869 0.0526

0.1264

From the above tables, it can be inferred that the denoised image constructed using the ﬁrst level approximation coeﬃcients always results in the highest PSNR and SSIM indicating that denoising using level one is better when compared to the other levels of decomposition. Denoising using WT was performed using the wavelets Daubechies, Symlet, and Coiﬂet of various orders. The results of the Daubechies wavelet of orders 1–10 are presented in Table 3. From Table 3 and Table 4 it can be observed that as the order of the wavelet increases, better PSNR and SSIM values are acquired. The same is the case irrespective of the wavelet used.

56

R. Gokul et al.

Table 3. Results of Fig. 3(a) under diﬀerent noise variance and using diﬀerent Daubechies wavelets Wavelet/variance 0.001 PSNR

0.01 PSNR

SSIM

db1

18.6978

0.71684 18.66833 0.71593

db2

19.78522 0.7624

db3

20.13112 0.77516 20.09446 0.77437

db4

20.29992 0.78061 20.26075 0.77987

db5

20.41262 0.78414 20.37401 0.78341

db6

20.47653 0.78596 20.43635 0.78518

db7

20.51202 0.78684 20.47343 0.78612

db8

20.54656 0.78772 20.50614 0.78701

SSIM

19.75047 0.76155

db9

20.57826 0.78859 20.53639 0.7878

db10

20.60428 0.78917 20.564

0.78856

Table 4. Results of Fig. 3(b) under diﬀerent noise variance and using diﬀerent Daubechies wavelets Wavelet/variance 0.001 PSNR

SSIM

0.01 PSNR

SSIM

db1

18.54632 0.64478 18.51181 0.64444

db2

19.0909

db3

19.28376 0.70102 19.24439 0.7008

db4

19.38005 0.70789 19.34086 0.70758

db5

19.44287 0.71194 19.40174 0.71157

db6

19.48146 0.71476 19.44253 0.71449

db7

19.51113 0.717

db8

19.539

0.6866

19.05433 0.6863

19.47145 0.71674

0.71931 19.49668 0.7189

db9

19.56275 0.72124 19.52201 0.72086

db10

19.58101 0.72256 19.54078 0.72227

Table 5. Results of Fig. 3(a) using diﬀerent wavelets of order 5 Wavelet/variance 0.001 PSNR

SSIM

0.01 PSNR

SSIM

db5

20.41262 0.784141 20.37401 0.783409

sym5

20.39102 0.783643 20.35128 0.782877

coif5

20.61708 0.790133 20.57699 0.789475

A Combined Wavelet and Variational Mode Decomposition Approach Table 6. Results of the Fig. 3(b) using diﬀerent wavelets of order 5 Wavelet/variance 0.001 PSNR

SSIM

0.01 PSNR

SSIM

db5

19.44287 0.711939 19.40174 0.711572

sym5

19.43674 0.711934 19.39831 0.711697

coif5

19.47854 0.714716 19.4392

0.714375

Table 7. VMD results for Fig. 3(a) with constant noise variance 0.001 τ/α

α = 200 α = 400 α = 500 α = 750 α = 1000 PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM

τ = 2 19.572 0.728 20.25 τ = 3 19.93

0.757 19.862 0.741 19.701 0.737 19.369 0.723

0.748 19.764 0.738 21.138 0.802 20.003 0.75

19.692 0.733

τ = 4 19.628 0.735 21.151 0.801 20.597 0.772 20.205 0.75

20.664 0.784

Table 8. VMD results for Fig. 3(a) with constant noise variance 0.01 τ/α

α = 200 α = 400 α = 500 α = 750 α = 1000 PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM

τ = 2 19.361 0.711 19.971 0.749 20.208 0.761 19.137 0.708 20.068 0.761 τ = 3 19.482 0.731 20.184 0.753 20.115 0.758 19.7 τ = 4 19.381 0.72

0.732 20.152 0.751

20.717 0.784 21.192 0.804 19.426 0.717 19.34

0.717

Table 9. VMD results for the Fig. 3(b) with constant noise variance 0.001 τ/α

α = 200 α = 400 α = 500 α = 750 α = 1000 PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM

τ = 2 20.644 0.745 20.36

0.726 20.627 0.75

20.476 0.735 19.747 0.676

τ = 3 20.416 0.734 20.232 0.716 20.491 0.737 19.994 0.7

20.186 0.713

τ = 4 20.615 0.749 20.409 0.732 20.633 0.748 19.784 0.685 20.37

0.728

Table 10. VMD-WT results for Fig. 3(a) with τ = 4 α

M+A M+D M+H+V M+A+H+V M+A+D PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM

100 21.487 0.814 18.459 0.685 17.344 0.597 20.01 200 21.68

0.761 22.247 0.831

0.819 19.869 0.749 17.994 0.647 19.455 0.739 22.272 0.829

400 22.265 0.839 15.938 0.623 21.66

0.813 21.775 0.821 16.208 0.653

500 21.973 0.829 22.651 0.846 14.781 0.542 14.928 0.567 22.372 0.837 750 19.321 0.776 16.923 0.602 16.468 0.548 19.134 0.765 19.867 0.799

57

58

R. Gokul et al. Table 11. VMD-WT results for Fig. 3(b) with τ = 2 α

M+A M+D M+H+V M+A+H+V M+A+D PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM

100 21.314 0.808 21.309 0.791 18.241 0.671 18.525 0.717 21.851 0.832 200 20.835 0.8

16.552 0.637 20.753 0.766 20.68

0.798 16.672 0.679

400 21.004 0.778 15.295 0.535 17.858 0.616 18.218 0.667 15.538 0.577 500 21.396 0.807 14.961 0.54

18.862 0.694 19.166 0.73

15.142 0.574

750 20.945 0.772 20.087 0.732 15.735 0.526 15.869 0.55

20.424 0.758

Table 12. Proposed WT-VMD results for Fig. 3(a) α

A+H PSNR SSIM

200 20.013 0.773 400 20

A+V PSNR

SSIM

A+D PSNR

SSIM

A+H+V+D PSNR SSIM

19.8692 0.7654 20.4008 0.7877 20.753

0.8

0.7726 19.8704 0.7658 20.4542 0.8496 20.8012 0.802

500 19.995 0.7722 19.8744 0.7657 20.4844 0.7905 20.8302 0.803 750 20

0.7726 19.8706 0.7655 20.3818 0.7869 20.7262 0.799

1000 19.994 0.7724 19.873

0.766

20.3638 0.7863 20.6956 0.799

Table 13. Proposed WT-VMD results for the Fig. 3(b) α

A+H PSNR

A+V SSIM

PSNR

A+D SSIM

PSNR

A+H+V+D SSIM

PSNR

SSIM

200 19.42221 0.714418 19.24847 0.699428 19.71536 0.7730777 20.29551 0.768084 400 19.38963 0.711268 19.23844 0.697445 19.69004 0.728781

20.21015 0.762512

500 19.40119 0.711614 19.23381 0.696573 19.74502 0.731632

20.28021 0.765671

750 19.42693 0.71447

20.23225 0.764112

19.22206 0.696421 19.67672 0.727722

1000 19.37972 0.711024 19.22855 0.69725

19.61729 0.72474

20.11502 0.757719

Table 14. Proposed WT-VMD results of the combination (A and H and V and D) for the Fig. 3(a) using db2 α

τ=2 PSNR

200 20.753

SSIM

τ=3 PSNR

0.8002

20.71908 0.798967 20.68875 0.796952

SSIM

τ=4 PSNR

SSIM

400 20.8012 0.802

20.73749 0.799428 20.81365 0.802014

500 20.8302 0.8028

20.83926 0.802748 20.84332 0.803013

750 20.7262 0.79936 20.83644 0.803143 21.00171 0.808262 1000 20.6956 0.79864 20.84298 0.803255 20.73596 0.799561

A Combined Wavelet and Variational Mode Decomposition Approach

59

Table 15. Proposed WT-VMD results of the combination (A and H and V and D) for the Fig. 3(b) using db2 α

τ=2 PSNR

SSIM

τ=3 PSNR

SSIM

τ=4 PSNR

SSIM

200 20.29551 0.768084 20.37085 0.772467 20.38041 0.772471 400 20.21015 0.762512 20.28299 0.766441 20.3855

0.771236

500 20.28021 0.765671 20.33078 0.768942 20.22669 0.763558 750 20.23225 0.764112 20.2038

0.762697 20.42719 0.774117

1000 20.11502 0.757719 20.17271 0.761241 20.29423 0.767602 Table 16. Proposed WT-VMD results of the combination (A and H and V and D) for Fig. 3(a) using db10 α

τ=3 PSNR

SSIM

τ=4 PSNR

SSIM

200 21.32047 0.811821 21.37923 0.813753 400 21.42626 0.816144 21.40806 0.81472 500 21.51477 0.81891

21.43868 0.816535

750 21.35866 0.813752 21.41906 0.815052 1000 21.38543 0.815074 21.32555 0.812704 Table 17. Proposed WT-VMD results of the combination (A and H and V and D) for the Fig. 3(b) using db10 α

τ=3 PSNR

SSIM

τ=4 PSNR

200 20.54868 0.778436 20.64715

SSIM 0.783449

400 20.54514 0.778003 20.49723 0.775475 500 20.5468

0.777869 20.5502

0.778215

750 20.4714

0.773716 20.42463 0.771793

1000 20.46947 0.774298 20.45631 0.773218

The performance of the ﬁrst level approximation using Daubechies, Coiﬂet, and Symlet of similar orders are presented in Table 5 and Table 6. It can be seen that Coiﬂet tends to yield better results than Daubechies while Symlet is the worst performing amongst them. 4.2

VMD

In this subsection, the results obtained using VMD are presented and discussed. The study has been performed by altering the variance of the Gaussian noise,

60

R. Gokul et al.

noise tolerance, and bandwidth constraint. We ﬁrst vary the noise tolerance (τ) and then successively vary the bandwidth constraint (α) for each τ. The image is denoised by leaving out the mode containing high-frequency components, i.e., the last mode. Table 7 and Table 8 present the results of VMD for the Fig. 3(a) for the noise variances 0.001 and 0.01. Correspondingly Table 9 presents the results of VMD for the Fig. 3(b) while the noise variance is kept constant at 0.001. It can be seen from the above tables that VMD performs better than wavelet for some values of α and τ. But it is noted that the values of α and τ for which VMD performs better than wavelet are not the same either when the images are diﬀerent or when the variance of the noise is diﬀerent for the same image. Even if there is a slight change in the values of α and τ the denoised images can be extremely diﬀerent. Thus, VMD cannot be used for denoising texture images with a standard set of parameter values. The above observations of denoising texture images using WT and VMD reveal that VMD performs better than WT whereas the parameters of VMD cannot be standardized as in WT. The results inferred from WT and VMD individually leads us to experiment by fusing both these techniques foreseeing that the desirable qualities of them are retained. 4.3

VMD-WT

In this subsection, the results of a combined VMD-WT technique is presented and discussed. The image is ﬁrst divided into two modes using VMD. The last mode is then separated into approximation (A) and horizontal (H), vertical (V) and diagonal (D) details using ﬁrst level wavelet decomposition. The PSNR and SSIM of the denoised images obtained by adding various combinations of approximation and details to the ﬁrst mode (M) of VMD, with noise variance of 0.001, is presented in Table 10 and Table 11. It can be noted from Table 10, for Fig. 3(a) the best denoised image is obtained when just the diagonal details are added to the ﬁrst mode of VMD with parameters τ = 4 and α = 500. It can also be observed from Table 10 For Fig. 3(b) the best denoised image is obtained when both approximation and diagonal detail are added to the ﬁrst mode of VMD with parameters set to τ = 2 and α = 100. It can be seen from Table 10 and Table 11, that the choice of adding various combinations of approximation (A) and details (H, V, D) to the ﬁrst mode (M) of VMD that results in the best denoised image is image speciﬁc. Furthermore, the cost for choosing wrong VMD parameters is also huge. 4.4

Proposed WT-VMD

In this subsection, the results obtained using the proposed WT-VMD approach are presented. Daubechies wavelet is ﬁrst used to decompose the noisy image into approximation (A), and horizontal (H), vertical (V), diagonal (D) details.

A Combined Wavelet and Variational Mode Decomposition Approach

61

Each of the detail is then separated into two modes using VMD with diﬀerent τ and α. The denoising performance after adding approximation coeﬃcients with the ﬁrst mode of diﬀerent detail co eﬃcients individually and jointly are presented in Table 12 and Table 13. The values in the table are obtained using τ = 2. It can be observed that adding all the ﬁrst modes of the horizontal, vertical and diagonal details with the approximation coeﬃcients produces the best results for any image. While using db2 the best-denoised image is generated when τ = 4, α = 750 for Fig. 3(a) and τ = 4, α = 750 for the Fig. 3(b) and is presented in Table 14 and Table 15 respectively. While using db10 the best-denoised image is generated when τ = 3, α = 500 for Fig. 3(a) and τ = 4, α = 200 for the Fig. 3(b) presented in Table 16 and Table 17. The obtained results are better than wavelet approximation and VMD. It can be noted from Table 14, Table 15, Table 16 and Table 17 that the proposed method performs best when τ = 3, 4 and α in the range 200 to 800. It can also be observed from the above tables that by using higher-order wavelets the choice of the parameters of VMD within the range speciﬁed above is made insigniﬁcant. From the SSIM values it can be seen that the denoised images using various α, τ values are not far oﬀ from each other suggesting that the denoised images are similar.

5

Conclusion

In this paper, a combined WT-VMD approach for denoising texture images is proposed and is compared with the performance of wavelet decomposition, VMD and a combined VMD-WT technique. In the case of wavelet decomposition, level one approximation produced the best results for all wavelets. With level one approximation kept constant, it was observed that the performance of wavelet denoising improved as the order of the wavelet was increased. VMD outperforms wavelet decomposition but for a speciﬁc set of parameters which depends on the noise variance and the image. In fact, slight variations in the parameter values can lead to undesirable eﬀects while denoising. While VMD-WT technique results in the best denoised image in some cases, the VMD parameters and the addition of certain combinations of approximation and details of the wavelet decomposed last mode is image speciﬁc and is diﬃcult to generalize. The proposed WT-VMD method outperforms wavelet decomposition, VMD and in some cases, VMD-WT for texture images. With VMD-WT technique, the cost of wrong choice of parameters is huge while with the proposed WT-VMD method, the error while choosing diﬀerent parameter values of VMD is not alarming, as the order of wavelets is increased. Thus, the proposed WT-VMD method localizes the parameters of VMD to a particular range and eliminates the need for selection of parameters.

62

R. Gokul et al.

References 1. Shapiro, L.G., Stockman, G.C.: Computer Vision. Prentice-Hall, Upper Saddle River (2001) 2. Chatterjee, P., Milanfar, P.: Is denoising dead? IEEE Trans. Image Process. 19(4), 895–911 (2009) 3. Fekri-Ershad, S., Fakhrahmad, S., Tajeripour, F.: Impulse noise reduction for texture images using real word spelling correction algorithm and local binary patterns. Int. Arab J. Inf. Technol. 15(6), 1024–1030 (2018) 4. Zuo, W., Zhang, L., Song, C., Zhang, D.: Texture enhanced image denoising via gradient histogram preservation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1203–1210 (2013) 5. Lahmiri, S., Boukadoum, M.: Biomedical image denoising using variational mode decomposition. In: 2014 IEEE Biomedical Circuits and Systems Conference (BioCAS) Proceedings, pp. 340–343. IEEE (2014) 6. Lahmiri, S., Boukadoum, M.: Physiological signal denoising with variational mode decomposition and weighted reconstruction after DWT thresholding. In: 2015 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 806–809. IEEE (2015) 7. Abraham, G., Mohan, N., Sreekala, S., Prasannan, N., Soman, K.P.: Two stage wavelet based image denoising. Int. J. Comput. Appl. 975, 8887 (2012) 8. Anusha, S., Sriram, A., Palanisamy, T.: A comparative study on decomposition of test signals using variational mode decomposition and wavelets. Int. J. Electr. Eng. Inf. 8(4), 886 (2016) 9. Zhu, X.: The application of wavelet transform in digital image processing. In: 2008 International Conference on MultiMedia and Information Technology, pp. 326–329. IEEE (2008) 10. Ai, J., Wang, Z., Zhou, X., Ou, C.: Variational mode decomposition based denoising in side channel attacks. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC), pp. 1683–1687. IEEE (2016) 11. Banjade, T.P., Yu, S., Ma, J.: Earthquake accelerogram denoising by wavelet-based variational mode decomposition. J. Seismolog. 23(4), 649–663 (2019). https://doi. org/10.1007/s10950-019-09827-0 12. Wang, X., Pang, X., Wang, Y.: Optimized VMD-wavelet packet threshold denoising based on cross-correlation analysis. Int. J. Perform. Eng. 14(9), 2239–2247 (2018) 13. Lahmiri, S.: Denoising techniques in adaptive multi-resolution domains with applications to biomedical images. Healthc. Technol. Lett. 4(1), 25–29 (2017) 14. Zosso, D., Dragomiretskiy, K., Bertozzi, A.L., Weiss, P.S.: Two-dimensional compact variational mode decomposition. J. Math. Imaging Vis. 58(2), 294–320 (2017). https://doi.org/10.1007/s10851-017-0710-z 15. Gonzalez, R.C., Woods, R.E.: Digital Image Processing (2002) 16. Hersey, I.: Textures: a photographic album for artists and designers by Phil Brodatz. Leonardo 1(1), 91–92 (1968) 17. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

Two-Image Approach to Reflection Removal with Deep Learning Rashmi Chaurasiya(&)

and Dinesh Ganotra

IGDTUW, Kashmere Gate, New Delhi 110006, India

Abstract. Reflection in an image is not always desirable due to the loss of information. Conventional methods to remove reflection are based on priors that require certain conditions to be fulﬁlled. Recent advancements of deep learning in many ﬁelds have revolutionized these traditional approaches. Using input images more than one reduces the ill-posedness of the problem statement. Standard assumption in numerous methods assumes background is stationary and only reflection layer is varying. However, images at different angles have slightly different backgrounds. Considering this a new dataset is created where both reflection and background layer is varying. In this paper, a two-image based method with an end to end mapping between the observed images and background is presented. The key feature is the practicability of the method, wherein a sequence of images at slightly different angles can easily be captured using modern dual camera mobile devices. A combination of feature loss with MSE maintains the content and quality of the resultant image. Keywords: Reflection removal Deep learning Multiple image based method

1 Introduction Reflection is a very common source of image degradation. This kind of degradation arises due to the glass pane between the camera and the target while capturing the images. There is no doubt that reflection adds artistic values to an image in the world of photography but this is undesirable for some of the cases, when crucial part is overlapped. The two main approaches to remove reflection are single image based method and multiple image based method. As the name suggests, in single image reflection removal (SIRR) only a single image is used as an input and in multiple image based method two (or more than two) or a video sequence is used as an input. Unlike SIRR, it is much easier to remove reflection with multiple images as input, since it has better understanding of the background due to more information. The background and reflection layers can be more easily understood with relative motion, unlike SIRR. With the increased trend of CNNs in leaning based methods, overall processes in reflection removal can also be categorized as traditional optimization based methods and learning based methods. Reflection removal is usually termed as layer separation problem [1, 2], where it is assumed that observed image (I) is a linear combination of a background layer (B) and reflection layer (R) © Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 63–74, 2021. https://doi.org/10.1007/978-981-16-1092-9_6

64

R. Chaurasiya and D. Ganotra

I ¼ B þ R

ð1Þ

In real scenario we have only observed image, SIRR processes aims to reconstruct the background with the observed image corrupted by reflection. Now according to the assumption in Eq. (1), (with no other parameters) there exist inﬁnite solutions which makes this a non-trivial problem. To ease the problem some of the methods require additional priors to separate reflection and background. These priors are based on some assumptions by the methods, most common priors are ghosting cues [3], depth of ﬁeld map [4, 5], sparsity prior [6] etc. These priors are used by single image based method as well as multiple image based methods. Gradient distribution [7] is a widely applied cue to separate reflection, further polarizer [8, 9] based techniques were also introduced. In this paper our approach is a multiple image based, where two images are used as input. These two images were synthesized by shifting the reflection layer and background by a certain range of pixels. There is always a question about the practicability of the multiple image based methods due the constraint of time and space. Unlike [10] where 5 images captured at different viewpoints are required, images in proposed method need not to be captured from an entirely different viewpoints, a user can capture a pair of images without changing their position. The rest of the paper is organized as follows, Sect. 2 lists down the related methods categorized as leaning and non-learning methods. Section 3 describes the proposed methodology that covers dataset creation, network structure and loss function. Section 4 shows the result/performance of proposed method on various synthetic as well as on real images. Section 5, 6 and 7 elucidate discussion, limitation and future work and conclusion.

2 Related Work As discussed in introduction section, two main approaches are non-learning and learning based methods. Non learning based methods rely on some set of priors or sometimes user assistance. There exist some assumptions like reflection being blurrier or out of focus [7], reflection and background both in same focal plane [10], reflection is closer to the camera than the background [4, 5] etc. 2.1

Non-learning Based

When reflection and background are both in the same focal plane, gradient sparsity priors are used. Li and Brown [10] exploited relative smoothness which is based on the different levels of blurriness in B and R. With the analogy of reflection layer closer to camera, their depth of ﬁeld map separates two R and B layers. Depth of the glass between camera and scene gives a phenomenon of double reflection also referred as Ghosting Cues [3]. Ghosting cues along with Patch based GMM (Gaussian Mixture Model) model employed as a crucial cue for reflection removal. Gai et al. [11] required

Two-Image Approach to Reflection Removal with Deep Learning

65

a set of images to model the relative motion between the layers of B and R in the set, which is used to separate B and R. Reflected light is completely polarized in the direction perpendicular to the plane of incidence at Brewster angle, [8] and [12] used this property to remove reflection. 2.2

Learning Based

Most popular techniques of reflection removal in learning based methods are given by exploiting edge features. Deep learning in reflection removal was pioneered by Fan et al. [7], where a two staged framework (CEILNet) based on the assumption that reflected layer is relatively blurrier than the background is proposed. First stage reconstruct edges and second estimates an RGB image of the background. These approaches are mainly the result of experimentation with various network architectures, different loss functions and image synthesis methodologies. Network architecture includes encoder-decoder [13], Generative Adversarial Network (GAN) [14], dilated convolution network [15], Bidirectional Networks [16], ResNets [17] and combination of above mentioned networks [13, 18]. In the loss function images are compared not only pixel wise but also feature wise. That is MSE loss is used with the combination of several other losses that includes Perceptual loss (Feature Loss) [19], Adversarial Loss, and Exclusion loss.

3 Proposed Methodology 3.1

Dataset Creation

Dataset utilized in this paper is self-synthesized, for which COCO dataset [20] of 5000 images is used. Out of 5000 nearly half set is assigned as reflection layer and same number is assigned as background. Once the images are divided into separate sets some preprocessing is performed like center crop, resizing, random afﬁne transformation and discarding faulty images. Variation of reflection layer is assumed by numerous methods however background remains static in most of the methods. Setting background as stationary is not evident in real life captured images. Here in this paper transition in reflection as well as background is considered while creating the dataset. That is images are shifted through some pixels in R layer and B layer, the amount of shifting in R is greater than B as glass plane is closer to the camera. Images in reflection set are then mixed with background images with the help of the method given by Zhang et al. [21]. Resulting image set is displayed in Fig. 1.

66

R. Chaurasiya and D. Ganotra

Fig. 1. Created set

3.2

Network Parameters

3.2.1 Structure The input to the network structure (Fig. 2) are two images captured at slightly different viewpoints. Images in training phase are synthetically generated (from Sect. 3.1) to create the above mentioned condition. Both of these images are fed to the network, where ﬁrst three blocks estimate low level features. These blocks are the combination of convolutional layer [22], batch normalization layer [23] and ReLu [24] activation. Then the output feature maps of the images up to three layers are concatenated and passed through another ﬁve blocks. In the output side predicted background image is produced relative to the ﬁrst image (Since two input images are shifted). Learning rate starts with 0.001 and multiplied by 0.1 after every hundred epoch. 3.2.2 Loss Function The loss function used here is given by Eq. (2) Loss Function ¼ aðFeature LossÞ þ bðPixel LossÞ

ð2Þ

Where a and b are the hyper parameters used to assign weights to these losses.

Two-Image Approach to Reflection Removal with Deep Learning

67

Fig. 2. Network structure used, where one block represents a set of layers consisting conv2D, Batch Norm and ReLu layers

3.2.2.1 Pixel Loss Pixel loss is the most used loss metric to compare the quality of the image, where two images are compared pixel by pixel. Pixel wise loss MSE sums all the absolute errors between the pixels. This loss ensures the output image quality. For pixel wise comparison MSE loss function is used between GT background (B) and predicted background (Bo ), where n is number of pixels MSE ¼

1 Xn ðB Bo i Þ2 i¼1 i n

ð3Þ

3.2.2.2 Feature Loss Like pixel wise loss MSE feature loss also sums the errors between pixels but it takes mean of that error [19]. Feature loss is primarily used to compare the content and style of two images but later on its range of applications widened with other image processing task too [25, 26]. This can be used when images are shifted through some pixels, which is ideal for this problem statement. Also due to comparison of high level features extracted from pre-trained network, it is faster. To proceed for feature loss predicted image and ground truth B passes through VGG19 pre-trained model, the outputs are extracted from 2nd, 3rd and 4th layer for the comparison of hidden unit’s activation. The activation functions are compared through a metric known as gram matrix [27, 28], which is designed for the purpose of measuring the correlation between the channels. Gram matrix is the multiplication of an activation matrix and transpose of activation matrix. Gram matrix of predicted background image and ground truth B is later compared (Eq. 3) with L2 norm error minimization algorithm. GM feat ¼

X 1 ðGM l ðBÞ GM l ðBo ÞÞ2 2 2 ij 4N l M l

ð4Þ

where N and M represents number of channels and dimension (height width) of the feature map of layer l respectively. B and Bo are the GT background and predicted background, their gram matrices (GM l ðBÞ and GM l ðBo Þ) in Eq. (3) are then compared.

68

R. Chaurasiya and D. Ganotra

Since different layers activations are compared, the weights assigned to these layers are also different. These weights are multiplied with their corresponding gram matrices to produce feature loss Lfeat . Lfeat ¼

XL l¼0

wl GM feat

ð5Þ

Once both the values of loss functions are estimated, ﬁnal loss is calculated by Eq. (2) as the sum of Lfeat . And MSE. Where the values of a and b are assigned to 0.3 and 0.7 respectively. This ﬁnal loss is utilized in back-propagation algorithm to minimize the difference between Bo and B. All the code implementations are done in python using pytorch library.

4 Results Figure 3, 4 and Fig. 5 show the resultant reflection removed images, where Fig. 3 is the synthetic set created with the same methodology as used for training set. Second test set is the real images captured by a cellphone with a glass pane placed in between the background and camera. For synthetic set we have 2 inputs, 1 background and 1 predicted background. Quality of the images can be easily compared here as aligned background is available. Images are compared qualitatively with their SSIM (structural similarity index measurement) and PSNR scores. SSIM and PSNR between the background and predicted background is shown in Table 1, where two values correspond to the input and predicted output. In Fig. 3 input images are placed in ﬁrst two rows, third row is for background and last row displays our results (for labeling see vertically on the right in Fig. 3). Resent method Zhang [21], ERRNet [31], and IBCLN [32] are chosen for comparison with the state of the art. Since all the existing reflection test sets accommodate only single image methods, we had to evaluate our model on our own test set. The test set utilized here have total 38 images, these synthetic images are generated using [21]. All the other methods require only single image as input, so background in image generation is stored relative to the ﬁrst image to avoid image alignment issues in measuring the quality. Average PSRN and SSIM of 38 images are compared in Table 2. Some of the compared images from test set are displayed in Fig. 4 (along with their SSIM and PSNR scores), Where reduction in pixel’s intensity can be seen in the output images of [21] and ERRNet [31]. IBCLN [32] does not reduce the intensity of whole image but it fails to reduce the intensity of reflection. Table 1. PSNR and SSIM values of our results. S. No Flowers Shoes Doors Signboard

PSNR (input/predicted) 21.18/25.77 20.26/24.33 19.84/20.12 23.24/24.19

SSIM (input/predicted) 0.92/0.95 0.78/0.89 0.86/0.93 0.89/0.94

69

Ours

GT

Image 2

Image 1

Two-Image Approach to Reflection Removal with Deep Learning

Flowers

Shoes

Doors

Signboard

Fig. 3. Synthetic test set results Table 2. Average PSNR and SSIM values between the background and predicted background. S. No Input (observed) Zhang et al. [21] ERRNet [31] IBCLN [32] ours PSNR 22.49 17.13 20.24 23.06 23.39 SSIM 0.88 0.85 0.91 0.89 0.93

70

R. Chaurasiya and D. Ganotra

21.00/0.79

17.20/0.88

22.14/0.91

17.45/0.90

26.33/0.91

17.32/89

Input reflection

Zhang[21]

17.22/0.87

17.10/0.92

24.61/0.92 ERRNet[31]

20.24/0.79

22.56/0.92

25.97/0.92 IBCLN[32]

21.41/0.90

21.91/0.93

28.97/0.95 ours

Fig. 4. Synthetic test set results (PSNR/SSIM) of Zhang et al. [21], ERRNet [31], IBCLN [32] and ours

Problem arises with the real test set, where no aligned background is available. These images can only be compared visually, so no qualitative measures (SSIM, PSNR) for these set of images can be performed. A color shift is quite visible in the images predicted by [21]. Zhang et al. [21] produce a dull effect in images in real as well as synthetic set. Also reflection is more dominant in [21, 31] and [32], for instance in ﬁrst test image, reflected human face is clearly visible in all the other methods except ours. Second image has reflection of curtain folds, which is quiet prevalent in others compared to ours. There is a ghosting cue like effect in some of our images due to varying background, which can be improved by experimenting with CNN hyper parameters.

Two-Image Approach to Reflection Removal with Deep Learning

Input reflection

Zhang[21]

ERRNet[31]

IBCLN[32]

71

ours

Fig. 5. Real test set results.

5 Discussion In multiple image based method Li et al. [10] and Chang et al. [29] required 5 and 4 images respectively captured at slightly different viewpoints, all these images correspond to single background. Assumption of having background stationary is not always true, in addition to that taking multiple images at different viewpoints is not practical due to the limitation of the space and time. Whereas here R and B both are varying across created image set. Comparison with recent methods ([21, 31] and [32]) on synthetic as well as real datasets validates the efﬁcacy of the proposed method. Removing reflection with single image [13–16 and 30] is highly ill-posed and has limited success so far. Due to the lack of parameters SIRR methods use a number of priors to narrow down the problem statement, assumptions like these prevent these methods to generalize.

72

R. Chaurasiya and D. Ganotra

6 Limitation and Future Work The approach formulated here requires two images. These images need to be shifted slightly, this shift can be done manually or automatically. A lot of cellphones these days have dual, quad cameras. These cameras are situated at slightly different positions. These different positioning of the cameras may give us an advantage in our work by capturing two images automatically. For our experiments we have captured the two images by shifting camera manually, so there is a user dependency in this case. On the other hand this is a convenient approach because it is difﬁcult to carry heavy camera setup everywhere. So the direction of this work revolves around the practicability of the method. Further user dependency needs to be marginalized by extending this method to work on single images. Also existing single images cannot be processed using this technique which is a limitation of this method. Predicted images are distorted for some of the real life scenes (see Fig. 5), more ﬁne-tuning is required. These two aforementioned problems concludes our future work to achieve more accuracy on predicted images and application of this method on single images too. Future work also includes employing attention mechanism, that allows a neural network to focus on a subset of features/hidden layers. That is, it selects certain features to be ampliﬁed that the attention mechanism thinks are more important to predict the desired output.

7 Conclusion In this paper a real life solution to reflection removal is presented. The focus is on practicability of the method as well as reducing the ill posedness of the problem to give a generalized solution for reflection removal. Variation of background while capturing multiple images is taken into account, which makes this method compatible for real life cases. The model is of just 2 MB size and is not computationally expensive, so can be easily incorporated in mobile devices. In qualitative comparison with [21, 31] and [32] our method achieved better performance for synthetic set and visually better results on real life set.

References 1. Levin, A., Weiss, Y.: User assisted separation of reflections from a single image using a sparsity prior. IEEE Trans. Pattern Anal. Mach. Intell. 29(9), 1647–1654 (2007). https://doi. org/10.1109/TPAMI.2007.1106 2. Li, Y., Brown, M-S.: Single image layer separation using relative smoothness. In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, pp. 2752–2759 (2014). https://doi.org/10.1109/CVPR.2014.346 3. Shih, Y., Krishnan, D., Durand, F., Freeman, W.: Reflection removal using ghosting cues. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, pp. 3193–3201 (2015). https://doi.org/10.1109/CVPR.2015.7298939 4. Tao, M.W., Hadap, S., Malik, J., Ramamoorthi, R.: Depth from combining defocus and correspondence using light-ﬁeld cameras. In: 2013 IEEE International Conference on Computer Vision, Sydney, pp. 673–680 (2013). https://doi.org/10.1109/ICCV.2013.89

Two-Image Approach to Reflection Removal with Deep Learning

73

5. Wan, R., Shi, B., Tan, A.H., Kot, A.C.: Depth of ﬁeld guided reflection removal. In: Paper presented at the meeting of the ICIP), Phoenix, AZ, pp. 21–25 (2016). https://doi.org/10. 1109/ICIP.2016.7532311 6. Fergus, R., Singh, B., Hertzmann, A., Roweis, S., Freeman, W.: Removing camera shake from a single photograph. ACM Trans. Graph. 25, 787–794 (2006). https://doi.org/10.1145/ 1179352.1141956 7. Fan, Q., Yang, J., Hua, G., Chen, B., Wipf, D.: A generic deep architecture for single image reflection removal and image smoothing. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice. Italy (2017). https://doi.org/10.1109/ ICCV.2017.351 8. Kong, N., Tai, Y.-W., Shin, J.S.: A physically-based approach to reflection separation: from physical modeling to constrained optimization. IEEE Trans. Pattern Anal. Mach. Intell. 36 (2), 209–221 (2014). https://doi.org/10.1109/TPAMI.2013.45 9. Wolff, L-B.: Using polarization to separate reflection components. In: Proceedings CVPR 1989: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA¸pp. 363–369 (1989). https://doi.org/10.1109/CVPR.1989.37873 10. Li, Y., Brown, S-M.: Exploiting reflection change for automatic reflection removal. In: IEEE International Conference on Computer Vision, Sydney, NSW, pp. 2432–2439 (2013). https://doi.org/10.1109/ICCV.2013.302 11. Gai, K., Shi, Z., Zhang, C.: Blind Separation of superimposed moving images using image statistics. IEEE Trans. Pattern Anal. Mach. Intell. 34(1), 19–32 (2012). https://doi.org/10. 1109/TPAMI.2011.87 12. Schechner, Y-Y., Shamir, J., Kiryati, N.: Polarization-based decorrelation of transparent layers: the inclination angle of an invisible surface. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, Greece, vol. 2, pp. 814–819 (1999). https:// doi.org/10.1109/ICCV.1999.790305 13. Chi, Z., Wu, X., Shu, X., Gu, J.: Single Image Reflection Removal Using Deep EncoderDecoder Network. CoRR (2018). arXiv preprint arXiv:1802.00094v1 14. Lee, D., Yang, M-H., Oh, S.: Generative Single Image Reflection Separation (2018). arXiv preprint arXiv:1801.04102 15. Kuanar, S., Rao, K., Mahapatra, D., Bilas, M.: Night Time Haze and Glow Removal using Deep Dilated Convolutional Network (2019). arXiv preprint arXiv:1902.00855v1 16. Yang, J., Gong, D., Liu, L., Shi, Q.: Seeing deeply and bidirectionally: a deep learning approach for single image reflection removal. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 675–691. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_40 17. Jin, M., Süsstrunk, S., Favaro, P.: Learning to see through reflections. In: IEEE International Conference on Computational Photography (ICCP), Pittsburgh, PA, p. 12 (2018). https://doi. org/10.1109/ICCPHOT.2018.8368464 18. Wan, R., Shi, B., Duan, L.Y., Tan, A.-H., Kot, A.C.: CRRN: multi-scale guided concurrent reflection removal network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, pp. 4777–4785 (2018). https://doi.org/10. 1109/CVPR.2018.00502 19. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and superresolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43 20. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

74

R. Chaurasiya and D. Ganotra

21. Zhang, X., Ng, R., Chen, Q.: Single image reflection separation with perceptual losses. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, pp. 4786–4794 (2018). https://doi.org/10.1109/CVPR.2018.00503 22. Dumoulin, V., Visin, F.: A guide to convolution arithmetic for deep learning (2016). arXiv preprint arXiv:1603.07285v2 23. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, vol. 37, pp. 448–456 (2015). arXiv preprint arXiv:1502.03167v3 24. Maas, A.L., Hannun, A.-Y., Ng, A.-Y.: Rectiﬁer nonlinearities improve neural network acoustic models. In: Proceedings of the International Conference on Machine Learning (ICML), vol. 30, no. 1, p. 3 (2013). arXiv preprint arXiv:1804.02763v1 25. Chen, Q., Koltun, V.: Photographic image synthesis with cascaded reﬁnement networks. In: IEEE International Conference on Computer Vision (ICCV), Venice, pp. 1520–1529 (2017). https://doi.org/10.1109/ICCV.2017.168 26. Ledig, C., et al.: In photo-realistic single image super-resolution using a generative adversarial network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 105–114 (2017). https://doi.org/10.1109/CVPR.2017.19 27. Gatys, L.-A., Ecker, A.-S., Bethge, M.: Texture synthesis using convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 1, pp. 262–270 (2015). https://doi.org/10.5555/2969239.2969269 28. Gatys, L.-A., Ecker, A.-S., Bethge, M.: A neural algorithm of artistic style (2015). arXiv preprint arXiv:1508.06576 29. Chang, Y., Jung, C.: Single image reflection removal using convolutional neural networks. IEEE Trans. Image Process. 28(4), 1954–1966 (2019). https://doi.org/10.1109/TIP.2018. 2880088 30. Arvanitopoulos, N., Achanta, R., Süsstrunk, S.: Single image reflection suppression. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 1752–1760 (2017). https://doi.org/10.1109/CVPR.2017.190 31. Wei, K., Yang, J., Fu, Y., Wipf, D., Huang, H.: Single image reflection removal exploiting misaligned training data and network enhancements. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp. 8170– 8179 (2019). https://doi.org/10.1109/CVPR.2019.00837 32. Li, C., Yang, Y., He, K., Lin, S., Hopcroft, J.: Single Image Reflection Removal through Cascaded Reﬁnement (2019). arXiv preprint arXiv:1911.06634v2

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis Yash Srivastava, Vaishnav Murali, Shiv Ram Dubey(B) , and Snehasis Mukherjee Computer Vision Group, Indian Institute of Information Technology, Sri City, Chittoor, Andhra Pradesh, India {srivastava.y15,murali.v15,srdubey,snehasis.mukherjee}@iiits.in

Abstract. The Visual Question Answering (VQA) task combines challenges for processing data with both Visual and Linguistic processing, to answer basic ‘common sense’ questions about given images. Given an image and a question in natural language, the VQA system tries to ﬁnd the correct answer to it using visual elements of the image and inference gathered from textual questions. In this survey, we cover and discuss the recent datasets released in the VQA domain dealing with various types of question-formats and robustness of the machine-learning models. Next, we discuss about new deep learning models that have shown promising results over the VQA datasets. At the end, we present and discuss some of the results computed by us over the vanilla VQA model, Stacked Attention Network and the VQA Challenge 2017 winner model. We also provide the detailed analysis along with the challenges and future research directions. Keywords: Visual Question Answering · Artiﬁcial intelligence Human computer interaction · Deep learning · CNN · LSTM

1

·

Introduction

Visual Question Answering (VQA) refers to a challenging task which lies at the intersection of image understanding and language processing. The VQA task has witnessed a signiﬁcant progress the recent years by the machine intelligence community. The aim of VQA is to develop a system to answer speciﬁc questions about an input image. The answer could be in any of the following forms: a word, a phrase, binary answer, multiple choice answer, or a ﬁll in the blank answer. Agarwal et al. [2] presented a novel way of combining computer vision and natural language processing concepts of to achieve Visual Grounded Dialogue, a system mimicking the human understanding of the environment with the use of visual observation and language understanding. The advancements in the ﬁeld of deep learning have certainly helped to develop systems for the task of Image Question Answering. Krizhevsky et al. [13] c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 75–86, 2021. https://doi.org/10.1007/978-981-16-1092-9_7

76

Y. Srivastava et al.

Fig. 1. Major breakthrough timeline in visual question answering.

proposed the AlexNet model, which created a revolution in the computer vision domain. The paper introduced the concept of Convolution Neural Networks (CNN) to the mainstream computer vision application. Later many authors have worked on CNN, which has resulted in robust, deep learning models like VGGNet [27], Inception [28], ResNet [6], and etc. Similarly, the recent advancements in natural language processing area based on deep learning have improved the text understanding performance as well. The ﬁrst major algorithm in the context of text processing is considered to be the Recurrent Neural Networks (RNN) [20] which introduced the concept of prior context for time series based data. This architecture helped the growth of machine text understanding which gave new boundaries to machine translation, text classiﬁcation and contextual understanding. Another major breakthrough in the domain was the introduction of Long-Short Term Memory (LSTM) architecture [7] which improvised over the RNN by introducing a context cell which stores the prior relevant information. The vanilla VQA model [2] used a combination of VGGNet [27] and LSTM [7]. This model has been revised over the years, employing newer architectures and mathematical formulations as seen in Fig. 1. Along with this, many authors have worked on producing datasets for eliminating bias, strengthening the performance of the model by robust question-answer pairs which try to cover the various types of questions, testing the visual and language understanding of the system. Among the recent developments in the topic of VQA, Li et al. have used the context-aware knowledge aggregation to improve the VQA performance [14]. Yu et al. have perfomed the cross-modal knowledge reasoning in the network for obtaining a knowledge-driven VQA [34]. Chen et al. have improved the robustness of VQA approach by synthesizing the Counterfactual samples for training [3]. Li et al. have employed the attention based mechanism through transfer learning alongwith a cross-modal gating approach to improve the VQA performance [15]. Huang et al. [8] have utilized the graph based convolutional network to increase the encoding relational informatoin for VQA. The VQA has been also observed in other domains, such as VQA for remote sensing data [18] and medical VQA [36]. In this survey, ﬁrst we cover major datasets published for validating the Visual Question Answering task, such as VQA dataset [2], DAQUAR [19],

Visual Question Answering Using Deep Learning

77

Visual7W [38] and most recent datasets up to 2019 include Tally-QA [1] and KVQA [25]. Next, we discuss the state-of-the-art architectures designed for the task of Visual Question Answering such as Vanilla VQA [2], Stacked Attention Networks [32] and Pythia v1.0 [10]. Next we present some of our computed results over the three architectures: vanilla VQA model [2], Stacked Attention Network (SAN) [32] and Teney et al. model [30]. Finally, we discuss the observations and future directions. Table 1. Overview of VQA datasets described in this paper. Dataset DAQUAR [19] VQA [2]

# Images # Questions Question type(s)

Venue

Model(s)

Accuracy

Object identitfication

NIPS 2014

AutoSeg [5]

13.75%

614163

Combining vision, language and common-sense

ICCV 2015

CNN + LSTM 54.06%

Fill in the blanks

ICCV 2015

nCCA (bbox)

7Ws, locating objects

CVPR 2016 LSTM + Attention

1449

12468

204721

Visual Madlibs [35]

10738

360001

Visual7W [38]

47300

2201154

CLEVR [11]

100000

853554

Synthetic question CVPR 2017 CNN + LSTM 93% generation using + Spatial relations relationship

Tally-QA [1]

165000

306907

Counting objects on varying complexities

AAAI 2019

RCN network

71.8%

24602

183007

Questions based on knowledge graphs

AAAI 2019

MemNet

59.2%

KVQA [25]

2

47.9% 55.6%

Datasets

The major VQA datasets are summarized in Table 1. We present the datasets below. DAQUAR: DAQUAR stands for Dataset for Question Answering on Real World Images, released by Malinowski et al. [19]. It was the ﬁrst dataset released for the IQA task. The images are taken from NYU-Depth V2 dataset [26]. The dataset is small with a total of 1449 images. The question bank includes 12468 question-answer pairs with 2483 unique questions. The questions have been generated by human annotations and conﬁned within 9 question templates using annotations of the NYU-Depth dataset. VQA Dataset: The Visual Question Answering (VQA) dataset [2] is one of the largest datasets collected from the MS-COCO [17] dataset. The VQA dataset contains at least 3 questions per image with 10 answers per question. The dataset contains 614,163 questions in the form of open-ended and multiple choice. In multiple choice questions, the answers can be classiﬁed as: 1) Correct Answer, 2) Plausible Answer, 3) Popular Answers and 4) Random Answers. Recently, VQA V2 dataset [2] is released with additional confusing images. The VQA sample images and questions are shown in Fig. 2.

78

Y. Srivastava et al.

Visual Madlibs: The Visual Madlibs dataset [35] presents a diﬀerent form of template for the Image Question Answering task. One of the forms is the ﬁll in the blanks type, where the system needs to supplement the words to complete the sentence and it mostly targets people, objects, appearances, activities and interactions. The Visual Madlibs samples are shown in Fig. 3. Visual7W: The Visual7W dataset [38] is also based on the MS-COCO dataset. It contains 47,300 COCO images with 327,939 question-answer pairs. The dataset also consists of 1,311,756 multiple choice questions and answers with 561,459 groundings. The dataset mainly deals with seven forms of questions (from where it derives its name): What, Where, When, Who, Why, How, and Which. It is majorly formed by two types of questions. The ‘telling’ questions are the ones which are text-based, giving a sort of description. The ‘pointing’ questions are the ones that begin with ‘Which,’ and have to be correctly identiﬁed by the bounding boxes among the group of plausible answers.

Fig. 2. Samples from VQA dataset [2].

Fig. 3. Samples from Madlibs dataset [35].

Visual Question Answering Using Deep Learning

79

CLEVR: CLEVR [11] is a synthetic dataset to test the visual understanding of the VQA systems. The dataset is generated using three objects in each image, namely cylinder, sphere and cube. These objects are in two diﬀerent sizes, two diﬀerent materials and placed in eight diﬀerent colors. The questions are also synthetically generated based on the objects placed in the image. The dataset also accompanies the ground-truth bounding boxes for each object in the image. Tally-QA: Very recently, in 2019, the Tally-QA [1] dataset is proposed which is the largest dataset of object counting in the open-ended task. The dataset includes both simple and complex question types which can be seen in Fig. 2. The dataset is quite large in numbers as well as it is 2.5 times the VQA dataset. The dataset contains 287,907 questions, 165,000 images and 19,000 complex questions. The Tally-QA samples are shown in Fig. 4.

Fig. 4. Samples from Tally-QA dataset [1].

Fig. 5. Samples from KVQA dataset [25].

80

Y. Srivastava et al.

KVQA: The recent interest in common-sense questions has led to the development of Knowledge based VQA dataset [25]. The dataset contains questions targeting various categories of nouns and also require world knowledge to arrive at a solution. Questions in this dataset require multi-entity, multi-relation, and multi- hop reasoning over large Knowledge Graphs (KG) to arrive at an answer. The dataset contains 24,000 images with 183,100 question-answer pairs employing around 18K proper nouns. The KVQA samples are shown in Fig. 5. Table 2. Overview of models described in this paper. The Pythia v0.1 is the best performing model over VQA dataset.

3

Model

Dataset(s)

Method

Accuracy

Venue

Vanilla VQA [2]

VQA [2]

CNN + LSTM

54.06 (VQA)

ICCV 2015

Stacked attention networks [32]

VQA [2], DAQAUR [19], COCO-QA [23]

Multiple attention layers

58.9 (VQA), 46.2 CVPR 2016 (DAQAUR), 61.6 (COCO-QA)

Teney et al. [30]

VQA [2]

Faster-RCNN + Glove vectors

63.15 (VQA-v2)

CVPR 2018

Neural-symbolic VQA [33]

CLEVR [11]

Symbolic structure as prior knowledge

99.8 (CLEVR)

NIPS 2018

FVTA [16]

MemexQA [9], MovieQA [29]

Attention over sequential data

66.9 (MemexQA), 37.3 (MovieQA)

CVPR 2018

Pythia v1.0 [10]

VQA [2]

Teney et al. [30] + Deep layers

72.27 (VQA-v2)

VQA challenge 2018

Diﬀerential networks [31]

VQA [2], TDIUC Faster-RCNN, [12], COCO-QA Diﬀerential [23] modules, GRU

68.59 (VQA-v2), AAAI 2019 86.73 (TDIUC), 69.36 (COCO-QA)

GNN [37]

VisDial and VisDial-Q

Recall: 48.95 (VisDial), 27.15 (VisDial-Q)

Graph neural network

CVPR 2019

Deep Learning Based VQA Methods

The emergence of deep-learning architectures have led to the development of the VQA systems. We discuss the state-of-the-art methods with an overview in Table 2. Vanilla VQA [2]: Considered as a benchmark for deep learning methods, the vanilla VQA model uses CNN for feature extraction and LSTM or Recurrent networks for language processing. These features are combined using elementwise operations to a common feature, which is used to classify to one of the answers as shown in Fig. 6. Stacked Attention Networks [32]: This model introduced the attention using the softmax output of the intermediate question feature. The attention between

Visual Question Answering Using Deep Learning

81

the features are stacked which helps the model to focus on the important portion of the image. Teney et al. Model [30]: Teney et al. introduced the use of object detection on VQA models and won the VQA Challenge 2017. The model helps in narrowing down the features and apply better attention to images. The model employs the use of R-CNN architecture and showed signiﬁcant performance in accuracy over other architectures. This model is depicted in Fig. 7. Neural-Symbolic VQA [33]: Speciﬁcally made for CLEVR dataset, this model leverages the question formation and image generation strategy of CLEVR. The images are converted to structured features and the question features are converted to their original root question strategy. This feature is used to ﬁlter out the required answer. Focal Visual Text Attention (FVTA) [16]: This model combines the sequence of image features generated by the network, text features of the image (or probable answers) and the question. It applies the attention based on the both text components, and ﬁnally classiﬁes the features to answer the question. This model is better suited for the VQA in videos which has more use cases than images. This model is shown in Fig. 8.

Fig. 6. Vanilla VQA network model [2].

Fig. 7. Teney et al. VQA model [30]

82

Y. Srivastava et al.

Fig. 8. Focal visual text attention model [16]

Pythia v1.0 [10]: Pythia v1.0 is the award winning architecture for VQA Challenge 20181 . The architecture is similar to Teney et al. [30] with reduced computations with element-wise multiplication, use of GloVe vectors [22], and ensemble of 30 models. Diﬀerential Networks [31]: This model uses the diﬀerences between forward propagation steps to reduce the noise and to learn the interdependency between features. Image features are extracted using Faster-RCNN [24]. The diﬀerential modules [21] are used to reﬁne the features in both text and images. GRU [4] is used for question feature extraction. Finally, it is combined with an attention module to classify the answers. The Diﬀerential Networks architecture is illustrated in Fig. 9. Diﬀerentiable Graph Neural Network (GNN) [37]: Recently, Zheng et al. have discussed about a new way to model visual dialogs as structural graph and Markov Random Field. They have considered the dialog entities as the observed nodes with answer as a node with missing value. This model is illustrated in Fig. 10.

Fig. 9. Diﬀerential networks model [31]. 1

https://github.com/facebookresearch/pythia.

Visual Question Answering Using Deep Learning

83

Fig. 10. Diﬀerentiable graph neural network [37].

4

Experimental Results and Analysis

The reported results for diﬀerent methods over diﬀerent datasets are summarized in Table 1 and Table 2. It can be observed that VQA dataset is very commonly used by diﬀerent methods to test the performance. Other datasets like Visual7W, Tally-QA and KVQA are also very challenging and recent datasets. It can be also seen that the Pythia v1.0 is one of the recent methods performing very well over VQA dataset. The Diﬀerential Network is the very recent method proposed for VQA task and shows very promising performance over diﬀerent datasets. As part of this survey, we also implemented diﬀerent methods over diﬀerent datasets and performed the experiments. We considered the following three models for our experiments, 1) the baseline Vanilla VQA model [2] which uses the VGG16 CNN architecture [27] and LSTMs [7], 2) the Stacked Attention Networks [32] architecture, and 3) the 2017 VQA challenge winner Teney et al. model [30]. We considered the widely adapted datasets such as standard VQA dataset [2] and Visual7W dataset [38] for the experiments. We used the Adam Optimizer for all models with Cross-Entropy loss function. Each model is trained for 100 epochs for each dataset. Table 3. The accuracies obtained using Vanilla VQA [2], Stacked Attention Networks [32] and Teney et al. [30] models when trained on VQA [2] and Visual7W [38] datasets. Model name

Accuracy VQA dataset Visual7W dataset

CNN + LSTM

58.11

56.93

Stacked attention networks 60.49

61.67

Teney et al.

65.82

67.23

The experimental results are presented in Table 3 in terms of the accuracy for three models over two datasets. In the experiments, we found that the Teney et al. [30] is the best performing model on both VQA and Visual7W Dataset. The accuracies obtained over the Teney et al. model are 67.23% and 65.82% over VQA and Visual7W datasets for the open-ended question-answering task, respectively.

84

Y. Srivastava et al.

The above results re-aﬃrmed that the Teney et al. model is the best performing model till 2018 which has been pushed by Pythia v1.0 [10], recently, where they have utilized the same model with more layers to boost the performance. The accuracy for VQA is quite low due to the nature of this problem. VQA is one of the hard problems of computer vision, where the network has to understand the semantics of images, questions and relation in feature space.

5

Conclusion

The Visual Question Answering has recently witnessed a great interest and development by the group of researchers and scientists from all around the world. The recent trends are observed in the area of developing more and more real life looking datasets by incorporating the real world type questions and answers. The recent trends are also seen in the area of development of sophisticated deep learning models by better utilizing the visual cues as well as textual cues by diﬀerent means. The performance of the best model is still lagging and around 60–70% only. Thus, it is still an open problem to develop better deep learning models as well as more challenging datasets for VQA. Diﬀerent strategies like object level details, segmentation masks, deeper models, sentiment of the question, etc. can be considered to develop the next generation VQA models.

References 1. Acharya, M., Kaﬂe, K., Kanan, C.: Tallyqa: Answering complex counting questions. arXiv preprint arXiv:1810.12440 (2018) 2. Antol, S., et al.: VQA: visual question answering. In: IEEE ICCV, pp. 2425–2433 (2015) 3. Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800– 10809 (2020) 4. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014) 5. Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: IEEE CVPR, pp. 564–571 (2013) 6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE CVPR, pp. 770–778 (2016) 7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 8. Huang, Q., et al.: Aligned dual channel graph convolutional network for visual question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7166–7176 (2020) 9. Jiang, L., Liang, J., Cao, L., Kalantidis, Y., Farfade, S., Hauptmann, A.: MemexQA: Visual memex question answering. arXiv preprint arXiv:1708.01336 (2017) 10. Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., Parikh, D.: Pythia v0. 1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956 (2018)

Visual Question Answering Using Deep Learning

85

11. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: IEEE CVPR, pp. 2901–2910 (2017) 12. Kaﬂe, K., Kanan, C.: An analysis of visual question answering algorithms. In: ICCV (2017) 13. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012) 14. Li, G., Wang, X., Zhu, W.: Boosting visual question answering with context-aware knowledge aggregation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1227–1235 (2020) 15. Li, W., Sun, J., Liu, G., Zhao, L., Fang, X.: Visual question answering with attention transfer and a cross-modal gating mechanism. Pattern Recogn. Lett. 133, 334–340 (2020) 16. Liang, J., Jiang, L., Cao, L., Li, L.J., Hauptmann, A.G.: Focal visual-text attention for visual question answering. In: IEEE CVPR, pp. 6135–6143 (2018) 17. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910602-1 48 18. Lobry, S., Marcos, D., Murray, J., Tuia, D.: RSVQA: visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 58(12), 8555–8566 (2020) 19. Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS, pp. 1682–1690 (2014) 20. Medsker, L.R., Jain, L.: Recurrent neural networks. Design and Applications 5, (2001) 21. Patro, B., Namboodiri, V.P.: Diﬀerential attention for visual question answering. In: IEEE CVPR, pp. 7680–7688 (2018) 22. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014) 23. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems, pp. 2953–2961 (2015) 24. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015) 25. Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: AAAI (2019) 26. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4 54 27. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 28. Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE CVPR, pp. 2818–2826 (2016) 29. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: IEEE CVPR, pp. 4631–4640 (2016) 30. Teney, D., Anderson, P., He, X., van den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: IEEE CVPR, pp. 4223– 4232 (2018)

86

Y. Srivastava et al.

31. Wu, C., Liu, J., Wang, X., Li, R.: Diﬀerential networks for visual question answering. In: AAAI 2019 (2019) 32. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: IEEE CVPR, pp. 21–29 (2016) 33. Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NIPS, pp. 1031–1042 (2018) 34. Yu, J., Zhu, Z., Wang, Y., Zhang, W., Hu, Y., Tan, J.: Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recogn. 108, 107563 (2020) 35. Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual madlibs: ﬁll in the blank description generation and question answering. In: IEEE ICCV, pp. 2461–2469 (2015) 36. Zhan, L.M., Liu, B., Fan, L., Chen, J., Wu, X.M.: Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354 (2020) 37. Zheng, Z., Wang, W., Qi, S., Zhu, S.C.: Reasoning visual dialogs with structural and partial observations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6669–6678 (2019) 38. Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: IEEE CVPR, pp. 4995–5004 (2016)

Image Aesthetic Assessment: A Deep Learning Approach Using Class Activation Map Shyam Sherashiya1 , Gitam Shikkenawis2(B) , and Suman K. Mitra1 1

2

Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, India {201811008,suman mitra}@daiict.ac.in C.R. Rao Advanced Institute of Mathematics, Statistics and Computer Science, Hyderabad, India

Abstract. Aesthetics is concerned with the beauty and art of things in the world. Judging the aesthetics of images is a highly subjective task. Recently, deep learning-based approaches have achieved great success in image aesthetic assessment problem. In this paper, we have implemented various multi-channel Convolution Neural Network (CNN) architectures to classify images in high and low aesthetic quality. Class activation maps of images are used as input to one channel along with variation of raw images in the proposed two-channel deep network architecture. Various pre-trained deep learning models such as VGG19, InceptionV3, Resnet50 have been implemented in the proposed multi-channel CNN architecture. Experiments are reported on the AVA dataset, which shows improvement in the image aesthetic assessment task over existing approaches. Keywords: Image aesthetic assessment · Multi-channel CNN Activation Maps (CAM) · Deep learning

1

· Class

Introduction

Aesthetics is concerned with the beauty and art of things in the world. Judging the aesthetics of photographs/images is a highly subjective task. Image aesthetic is concerned with how human is rating the image based on some standard rating scale. The wide use of photographic devices in the current era has increased photographic content in the world. And hence, automatically picking out aesthetically good images is very useful. Images driven from search engines should show aesthetically good images as the ﬁrst few results while users are searching on the internet. Figure 1 shows a few examples of aesthetically good and bad quality images from Aesthetic Visual Analysis (AVA) [1] dataset. Most of the existing research on this topic says that the problem of image aesthetic assessment (IAA) is a classiﬁcation problem, where the image is classiﬁed as either aesthetic or non-aesthetic. The classiﬁcation problem is formulated by c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 87–99, 2021. https://doi.org/10.1007/978-981-16-1092-9_8

88

S. Sherashiya et al.

(a) Aesthetically high-quality images

(b) Aesthetically low-quality images

Fig. 1. Aesthetically high-quality and low-quality images from AVA dataset

mapping an image to some rating value. Various approaches [2,3] have been proposed, which either use photographic rules such as golden ratio, the rule of thirds, and color harmonies [2] or hand-crafted features to classify the aesthetic value of images. Recent work on deep Convolutional Neural Network (CNN) based image aesthetic assessment [4,5] claims to have learned feature representation of images for classiﬁcation problem. After successful and promising results in [4], which uses a deep learning approach for an aesthetic assessment task, more research is now treading to extract features using deep learning approaches known as deep features. Results from the deep features models have shown improved aesthetic assessment performance over hand-crafted feature extraction approaches. Brain-Inspired Deep Networks (BDN) [6] has similar structure architecture as in [4]. It learns various features with parallel supervised channels introduced in CNN. In 2016, [7] proposed three category-speciﬁc CNN architectures based on an object, scene, and texture. In [8], 5th convolution layer of the AlexNet architecture was replaced by seven parallel convolution layers. These parallel layers represent diﬀerent scene categories and are fed to fully connected layers for binary classiﬁcation. Doshi et al. [9] proposed three-column CNN based approach, where pre-trained VGG19 architecture was used. They have used two variants of saliency map images as one of the inputs in multi-channel CNN and reported improved IAA performance. In this paper, variants of multi-channel convolutional neural network (CNN) are used for classiﬁcation problem. Various pre-trained models such as VGG19, ResNet50, InceptionNet are used with ﬁne-tuning for image aesthetic assessment.

Image Aesthetic Assessment: A Deep Learning Approach Using CAM

89

Various channels used in CNN networks for feature extractions is the novelty of this paper. A two-channel deep network architecture is proposed in this paper. Along with the variants of raw images as input in one channel, we propose to use Class Activation Map (CAM) features [10] as input to the second channel. Aesthetic Visual Analysis(AVA) [1] dataset is used to perform all the experiments. The organization of the paper is as follows: Architecture of various deep CNN models is illustrated in Sect. 2. Section 3 discusses diﬀerent image pre-processing techniques and the proposed two-channel deep CNN model. Experiments and analysis of the proposed approach with several other existing IAA approaches is reported on the AVA dataset in further section followed by conclusion.

2

Deep CNN Models for IAA

In this paper, we have used three pre-trained deep learning convolutional neural network models for image aesthetic assessment namely VGG19 [11], InceptionV3 [12] and ResNet50 [13] and compared their results. VGG19 [11] is a convolutional neural network with 19 layers deep network. It is trained on more than a million images from the ImageNet database [14]. Input image size for this network is 224 × 224. The original pre-trained network is used to classify images into 1000 diﬀerent object categories. Hence it has learned rich features from a wide range of images. These features can be useful to train a diﬀerent dataset for a similar task. VGG19 architecture has kernels of 3 × 3 size and stride of 1 with having a total 5 blocks with a various number of convolutional layers in each block. Max-pooling layer is added that separates each block. In this work, VGG19 architecture is modiﬁed to compute a two-class classiﬁcation problem of image aesthetic assessment. This is done by keeping the initial network architecture of VGG19 as same, and all fully connected layers are replaced by adding extra 11 dense layers after the max-pooling layer of block 5. InceptionV3 [12] is a convolutional neural network with 48 layers deep network. InceptionV3 is an upgraded version of the IncpetionV1 or GoogleNet [15]. The number of layers of the convolutional neural network are increased for learning more features of images for classiﬁcation. An increased number of layers makes the convolutional neural network prone to overﬁtting. For this problem, Inception architecture is introduced in the InceptionNet. It performs convolution operation on previous layer input with 1 × 1, 3 × 3, 5 × 5 ﬁlters followed by max-pooling on it. Finally, all the results are concatenated to send it to the next inception module. For the reduction of the ﬁnal model of inception, a 1 × 1 convolution ﬁlter is also introduced before 3 × 3 and 5 × 5 ﬁlters. Two auxiliary classiﬁers are added to the model, which apply softmax to two inception modules and prevent vanishing gradient problem from the model. The loss of each auxiliary classiﬁer is added into the ﬁnal loss function with 0.3 of each auxiliary loss. Also, the model has implemented global average pooling before classiﬁcation. In this paper, we have used InceptionV3 architecture for image aesthetic classiﬁcation. We have changed fully connected layers to classify only the two-class model. The third deep network architecture that is used in the paper is ResNet50 [13]. It is a convolutional neural network with 50 layers deep network. ResNet

90

S. Sherashiya et al.

solves the vanishing gradient problem of deep neural network architecture using residual block in the network. Residual block is created, such as the weight of older layers are added to the current layer at some interval. It is called identity shortcut connection, which skips one or more layers and connects to the next layer. This architecture improves the performance of the model and reduces the overﬁtting problem. The proposed network architecture and pre-processing of images to be used as input to the deep network is discussed in the next section.

3

Image Pre-processing and Two Channel CNN

The deep neural network models discussed in the previous section are used to classify images in high/low aesthetic images classes. These models are modiﬁed and used as ﬁne-tuning models of pre-trained networks. However, as stated in [4,9], using only raw images to train these architectures are not suﬃce for accurate aesthetic class assessment. Hence, various kinds of pre-processed images are fed to these models for feature extraction. 3.1

Image Pre-processing

For image aesthetic assessment problem, we are using various pre-trained models. Diﬀerent models have varying input image size for network architectures. VGG19 and Resnet50 have 224 × 224 default image input size, while InceptionV3 has 299 × 299 default image input size for the network. AVA dataset has images of diﬀerent sizes. Hence, all the images of the dataset are resized to 224 × 224. We have used some pre-processing techniques on the images to feed into ﬁne-tuning models as described below. Original Images are the resized images of size 224 × 224. As image resizing is prone to blurring pixels of images, the aspect ratio of images is considered while resizing the images. Padded Images are created with padding zeros to images of datasets to make it square images. Images are padded in a way that all four sides i.e., top, bottom, left, and right, are padded with an equal number of zeros. After padding, the images are resized to 224 × 224. Center Cropped Images are also used as input. Images are cropped from the center part with a ﬁxed size of 224 × 224. This will help the model to understand the features of the center part of images without losing pixels information of center part. Class Activation Map Images are also created on the AVA dataset. As mentioned in [4], the above-mentioned pre-processing of images is expected to learn some global and local aesthetic features like the rule of thirds, the golden ratio, sharpness, and resolutions of images. In this paper, we have introduced another pre-processing method of the image, which enhances the model to learn more aesthetic features of images. As illustrated in Fig. 2 and 3, we have implemented a whole new concept for image aesthetic assessment. Class activation mapping (CAM) using global average pooling was introduced in [10], which has proven remarkable localization ability of images.

Image Aesthetic Assessment: A Deep Learning Approach Using CAM

91

Fig. 2. Aesthetically high quality images from AVA dataset and their CAM representation of images

Fig. 3. Aesthetically low quality images from AVA dataset and their CAM representation of images

Global average pooling was introduced in [16] by replacing fully connected layers with global average pooling. This avoids overﬁtting of a large convolutional neural network model. Global average pooling averages out the spatial information at the last layer of the network. Thus, it will have the most localized features in it, which directly help the network to classify for an image classiﬁcation problem. Class activation maps are obtained by computing the weighted sum of feature maps of the last layer of the convolution neural network. The result of global average pooling at last layers are multiplied with the previous convolutional layer to create class activation maps (CAM). CAM highlights object localization regions speciﬁc to a class. In this paper, we have implemented class activation maps technique on the AVA dataset to generate heat map images. We have generated two types of images using CAM. 1). CAM images of original image 2). CAM images of the

92

S. Sherashiya et al.

center crop of original images. These two types of images are combined together and fed as one channel data to a convolutional neural network for training. For given image, let fk (x, y) represent activation unit k of last convolutional layer at position (x, y). For this unit k, global average pooling is determined as: fk (x, y) Fk = (x,y)

For a particular class c, Softmax classiﬁer Sc of model is written as: wkc Fk Sc = k

here, wkc is weight of class c for unit k. The predicted output of softmax is written c) k as: exp(S exp(Sc ) . By putting original value of F in Sc , Sc can be written as follows: c

Sc =

k

wkc

fk (x, y) =

wkc fk (x, y)

(x,y) k

(x,y)

CAM for class c can be written as Mc , where spatial element is describe as: wkc Fk Mc (x, y) = k

3.2

Two Channel CNN for IAA

In this paper, we have proposed a two-channel CNN architecture, the results of which are concatenated to generate output classiﬁer. This network takes one part of input as a combination of the original resized image, center crop image, and padded image in one channel. The other part of the input is a combination of the original image and center cropped image after processing with CAM. Figure 4 shows two-channel CNN architecture design. As shown in Figs. 3 and 2, CAM features highlight the object regions of the image which are important for aesthetics assessment. Hence, it is expected that in addition to using the original and cropped versions of the image as input, the CAM representation will enhance the quality of the aesthetic assessment. We have performed experiments with pre-trained networks such as VGG19, InceptionV3 and Resnet50. Two pre-trained networks of VGG19 architecture are used in the proposed two-channel network and features are concatenated before the fully connected layers. Fully connected layers are replaced with 11 dense layers that classify two-class classiﬁcation at the end of the network. The dropout is also introduced in between dense layers. In the case of InceptionV3 and ResNet50 network, the model is loaded with pre-trained weights. Before concatenating convolutional layers, global average pooling is applied in each column’s base model. The model is then concatenated and drop out and dense layers are added such as the ﬁnal model classify two classes.

Image Aesthetic Assessment: A Deep Learning Approach Using CAM

93

Fig. 4. Proposed two channel CNN architecture

4

Experiments

There are many datasets available for the evaluation of image aesthetic task for diﬀerent models. In this paper, the aesthetic assessment experiments are performed on the widely used Aesthetic Visual Analysis(AVA) [1] dataset. It contains approximately 2,55,000 images of a wide variety of ranges. The range of votes per image in the dataset varies between 78 to 549, having average votes of 210. Having a large size of data, it also has enough consent of aesthetic values per image. However, it does not have binary labeling for the aesthetic value of each image, but the dataset provides all the user ratings varying from 1–10 for each image in the dataset. In this work, we are addressing the problem of classifying the images in low and high aesthetic quality images, thus changing the problem to a two-class classiﬁcation one. Images having less than 4 ratings are considered as low aesthetic images, and higher than 6 ratings were considered as high aesthetic images. 4.1

Comparison of Diﬀerent Architectures

As discussed in Sect. 2, three diﬀerent deep CNN models, VGG19, InceptionV3 and ResNet50, have been used in the proposed two-channel architecture. It was reported in [9] that using a three-channel network with one channel taking random crop images as input helps to improve the IAA results. In order to have a complete set of comparisons, the proposed two-channel network is also compared with a three-channel network model with three variants of the random cropped image. Random crop images are the images that are cropped randomly from original images. Three diﬀerent random cropped patches of size 224 × 224 were created from the images of the dataset. The results of various deep neural network architectures trained on the AVA dataset are reported in Table 2. Resnet50 with the proposed two-channel network surpasses all the competing methods. In the case of two-channel network, Resnet50 with the input of combination of the original resized image, center crop image, and padded images and input with a combination of original images

94

S. Sherashiya et al.

and center cropped images after processing with CAM technique has achieved an accuracy of 83.02%, which has approximately 1% more accuracy than the three-channel network ResNet50 with random crop images as the third channel. As shown in Table 2, the accuracy of the corresponding three-channel network is less than that of a two-channel network for the InceptionV3 and Resnet50. The number of parameters in the three-channel network is signiﬁcantly higher than the two-channel network architecture. This might lead to overﬁtting the model. Table 1. Values for network complexity in terms of GFLOPs for original networks as well as various multi-channel networks used in this paper GFLOPs Network

ResNet50 InceptionV3 VGG19

Original network

3.8

5.7

19.6

One channel network Two channel network

3.875

5.7

19.6

7.75

11.4

39.65

Three channel network 11.65

17.1

58.715

Table 1 shows the network complexity of various multi-channel CNN used in this paper in terms of Giga Floating point operations (GFLOPs). It is a unit of measurement used to measure the performance of a computer’s ﬂoating point unit. GFLOPs values for original network are also shown for better comparison between various channels used in paper. We have used NVIDIA Tesla P100 16gb GPU to perform various experiments. It can be observed that ResNet50 is Table 2. Accuracy of various architectures and networks. One channel network accuracy is shown for the input of a combination of original images and center cropped images after processing the data with CAM. Two-channel network accuracy is shown for the input of the above-mentioned input as ﬁrst channel and combination of the original resized image, center crop image, and padded image data as input for the second channel. Three Channel network accuracy is shown with an additional channel having random crops of images as input. Architecture Network VGG19

One channel network

Train accuracy Test accuracy 94.01

72.5

InceptionV3 One channel network

87.5

78.92

Resnet50

One channel network

79.32

79.59

VGG19

Two channel network

96.84

77.70

InceptionV3 Two channel network

86.41

81.20

Resnet50

Two channel network

83.26

83.02

VGG19

Three channel network 88.75

79.5

InceptionV3 Three channel network 83.69

81.00

Resnet50

82.42

Three channel network 80.51

Image Aesthetic Assessment: A Deep Learning Approach Using CAM

95

computationally eﬃcient as compared to the other two CNN architectures. For a batch size of 16 images, ResNet50 two channel architecture took 308 ms and InceptionV3 two channel architecture took around 453 ms. Various scenarios where the proposed two-channel CNN ResNet50 model is used for image aesthetic classiﬁcation are reported in Figs. 5, 6, 7, 8. The original images and their CAM versions used in ResNet50 models are shown in the above ﬁgures. Figure 5 shows an image having low scores in the AVA dataset is correctly classiﬁed by the model. Some of the images in the dataset are very clearly taken by photographers, but the objective or purpose of the images is not clear. And most reviewers gave them low aesthetic ratings, and hence those images lied into low aesthetic class. Those kinds of images are also successfully classiﬁed by the network. Figure 6 shows images that actually have a low aesthetic score, but the ResNet50 model classiﬁed it as a high aesthetic. These kinds of images are very tough to classify by the network. The aesthetic score is given low to these images, but the objective of images is very clearly captured, and the model has classiﬁed it as higher aesthetic images. Figure 7 shows images with higher aesthetic scores,

Fig. 5. Low aesthetic images correctly classiﬁed with two channel ResNet50 CNN model. Figure shows original images and CAM images used as input.

Fig. 6. Low aesthetic images of AVA dataset misclassiﬁed as high aesthetic images by two channel ResNet50 CNN model. Figure shows original images and CAM images used as input.

96

S. Sherashiya et al.

Fig. 7. High aesthetic images correctly classifed with two channel ResNet50 CNN model. Figure shows original images and CAM images used as input.

Fig. 8. High aesthetic images of AVA dataset misclassiﬁed as low aesthetic images by two channel ResNet50 CNN model. Figure shows original images and CAM images used as input.

and ResNet50 has correctly classiﬁed these images as high aesthetic images. While in Fig. 8 shows images that are actually given a higher aesthetic score by viewers, but the two-channel ResNet50 model has misclassiﬁed it as low aesthetic images. 4.2

Comparison with Various Approaches

Table 3 shows an accuracy comparison of existing approaches in the literature on the AVA dataset. From the table, it is shown that Single column network accuracy with VGG19 architecture has improved the results as compared to SCNN introduced in [4] and single-column network architecture with the base model of VGG19 used in [9]. [9] has used two variants of saliency map images as one of the inputs in multi-channel CNN. In this paper, CAM data is used for image aesthetic task. A single-channel network with VGG19 as a base model shows that CAM features turn out to be better in aesthetics assessment task than saliency map images used in [9].

Image Aesthetic Assessment: A Deep Learning Approach Using CAM

97

Table 3. Accuracy comparison of existing IAA approaches on AVA dataset Network

Accuracy

Single column network (SCNN) [4]

71.20

Single column network (VGG19) [9]

71.37

Single channel network (VGG19) (Proposed)

72.5

Double column network (DCNN) [4]

73.25

Hierarchical aesthetic quality assessment using DNN [7] 74.51 A multi-scene deep learning model (AlexNet) [8]

76.94

Brained inspired neural network (BDN) [6]

78.08

Single channel network (Resnet50) (Proposed)

79.59

Triple column network (VGG19) [9]

82.3

Two channel network (Resnet50) (Proposed)

83.02

Lu et al. [4] also introduced DCNN in the same paper, which has improved the accuracy of 73.25%. In [7], object CNN takes saliency detection and global view as input. All three category channels are concatenated to output two-class classiﬁcation. Network accuracy is improved to 74.51%. The model in [8] has stated the accuracy of 76.94% for image aesthetic classiﬁcation task. Braininspired deep network (BDN) [6] has proposed multi-channel CNN with a similar structure as of RAPID. The model is trained with 14 style CNN channel, and this CNN column is added to the global patches image column and local patches image column to classify the ﬁnal two-class classiﬁcation of image aesthetic task. This model has got an accuracy of 78.08% for image aesthetic classiﬁcation on the AVA dataset. In this paper, one channel ResNet50 network with CAM features of original and center cropped images as input has achieved 79.59% of accuracy, which outperforms all IAA approaches compared till now. The triple column network proposed with the base network as VGG19 in [9] has marked 82.3% accuracy. The proposed two-channel network with ResNet50 architecture attained an accuracy of 83.02% on the AVA dataset. It shows the strength of the CAM features for the IAA task.

5

Conclusion

Two-channel deep network architecture is proposed for image aesthetic assessment in this paper, which uses variants of original images as input in one channel and class activation map (CAM) representation in the second channel. The CAM features capture most localized features and highlight objects in the image. The CAM representation of images from the AVA dataset shows object localization on both low and high-quality images. Classiﬁcation accuracy of the IAA problem improves after adding the second channel with CAM features. The experiments also suggest that increasing number of channels in the convolutional neural network does not always result in improved IAA performance. IAA experiments with

98

S. Sherashiya et al.

multi-channel deep CNN models such as VGG19, InceptionV3 and ResNet50 have been reported. Two channel deep CNN with ResNet50 as the base model achieves highest accuracy among all the competing architectures.

References 1. Murray, N., Marchesotti, L., Perronnin, F.: AVA: a large-scale database for aesthetic visual analysis. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2408–2415. IEEE (2012) 2. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a computational approach. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part III. LNCS, vol. 3953, pp. 288–301. Springer, Heidelberg (2006). https://doi.org/10.1007/11744078 23 3. Ke, Y., Tang, X., Jing, F.: The design of high-level features for photo quality assessment. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1, pp. 419–426. IEEE (2006) 4. Lu, X., Lin, Z., Jin, H., Yang, J., Wang, J.Z.: Rapid: rating pictorial aesthetics using deep learning. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 457–466 (2014) 5. Lu, X., Lin, Z., Shen, X., Mech, R., Wang, J.Z.: Deep multi-patch aggregation network for image style, aesthetics, and quality estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 990–998 (2015) 6. Wang, Z., Chang, S., Dolcos, F., Beck, D., Liu, D., Huang, T.S.: Brain-inspired deep networks for image aesthetics assessment. arXiv preprint arXiv:1601.04155 (2016) 7. Kao, Y., Huang, K., Maybank, S.: Hierarchical aesthetic quality assessment using deep convolutional neural networks. Signal Process. Image Commun. 47, 500–510 (2016) 8. Wang, W., Zhao, M., Wang, L., Huang, J., Cai, C., Xu, X.: A multi-scene deep learning model for image aesthetic evaluation. Signal Process. Image Commun. 47, 511–518 (2016) 9. Doshi, N., Shikkenawis, G., Mitra, S.K.: Image aesthetics assessment using multi channel convolutional neural networks. In: Nain, N., Vipparthi, S.K., Raman, B. (eds.) CVIP 2019, Part II. CCIS, vol. 1148, pp. 15–24. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-4018-9 2 10. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016) 11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 12. Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) 13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 14. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

Image Aesthetic Assessment: A Deep Learning Approach Using CAM

99

15. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 16. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)

RingFIR: A Large Volume Earring Dataset for Fashion Image Retrieval Sk Maidul Islam1(B) , Subhankar Joardar2 , and Arif Ahmed Sekh3 1

Global Institute of Science and Technology, Purba Medinipur, India 2 Haldia Institute of Technology, Purba Medinipur, India 3 UiT The Arctic University of Norway, Tromsø, Norway

Abstract. Fashion image retrieval (FIR) is a challenging task, which involves similar item searching from a massive collection of fashion products based on a query image. FIR in diﬀerent garments and shoes are popular in literature. More complex fashion products such as ornaments are getting less attention. Here, we introduce a new earring dataset, namely, RingFIR. The dataset is a collection of (∼2.6K) high-quality images collected from major India based jewellery chains. The dataset is labelled in 46 classes in a structured manner. We have benchmarked the dataset using state-of-the-art image retrieval methods. We believe that the dataset is challenging and will attract computer vision researchers in the future. The dataset is available publicly (https://github.com/ skarifahmed/RingFIR). Keywords: Fashion image retrieval retrieval dataset

1

· Ornament dataset · Image

Introduction

In recent years, there has been a signiﬁcant growth in the size of the domestic and overseas fashion market and the graph is steadily increasing. Accordingly, the application of computer vision (CV) in the fashion industry is also increasing, which includes retrieval, attribute discovery, recognition and recommendation of fashion products. The image retrieval system is commonly used to retrieve images based on users’ interest from the image database. Fashion products retrieval systems use diﬀerent modes of similarity. Text Based Image Retrieval (TBIR) and Content-Based Image Retrieval (CBIR) are two well-known methods for image retrieval. In TBIR, texts are annotated with images and then based on user interest images are retrieved using the text-based approach [1]. The main drawbacks of the TBIR method are a user has to describe an image using nearly the same texts that are used to describe the image and it is time-consuming to manually annotate large volume images. Whereas CBIR methods use the visual contents of images to search images from the image dataset based on users’ interests. In CBIR, the visual contents of an image may refer to color, shape, texture, or any other information [2] that can be automatically extracted from c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 100–111, 2021. https://doi.org/10.1007/978-981-16-1092-9_9

RingFIR: A Large Volume Earring Dataset for Fashion Image Retrieval

101

the image. In CBIR system, multi-dimensional feature vectors are used to store the extracted visual contents of the images in a feature database. To retrieve images based on users’ query, the system represents the query image as feature vectors. The distance between feature vectors is calculated for the retrieval of images with the help of an indexing scheme. There are several applications of CBIR system such as medical diagnosis [3], ﬁngerprint identiﬁcation, face detection [4], biodiversity information systems [5], nudity-detection ﬁlters [6], textiles industry, fashion domains, etc. Retrieving similar fashion images from huge collections of fashion products based on users’ query is the key aim of fashion image retrieval [7–9]. In recent years Fashion image retrieval Fig. 1. Fashion image retrieval (FIR) process. (FIR) is playing an important role because of growing demands of online shopping, fashion image recognition [10– 12] and fashion recommendation [13–15]. FIR use a similar concept with focused similarity measurement policies such as shape-based, color similarity, texture, design similarity, etc. (See Fig. 1). In recent years rapid progress in FIR, still it has limitations for application to real-world fashion image search. The limitations are illustrated as follows. Firstly, multiple fashion objects may present in a fashion image which creates variations in style and viewpoint. Secondly, fashion image retrieval system cannot produce the correct outcome if users’ input query image is captured in varying viewpoints or low lighting conditions [16]. Thirdly, retrieval of similar fashion images by considering complex nature such as shape, design, etc. received less attention. Over the last few years, there has been notable growth in fashion related research [17–19], especially on garments and shoes. However, research on ornaments image retrieval has not achieved as such momentum because of its complexity in representing features and unavailability of appropriate datasets. It has been observed that ornaments contain larger variations in design than fashion products such as garments or shoes. The contributions of the paper are: (1) We propose a new fashion image retrieval dataset namely RingFIR. The dataset consists of high quality images of golden earrings (∼2.6K images of 46 diﬀerent classes such as Jhumkas, Danglers, Chand Balis, Hoops, Studs, Drop, Chandelier, Hoop Ballis, etc.). (2) We have benchmarked the dataset using state-of-the-art image retrieval methods. Rest of the paper is organized as follows: In Sect. 2 the related works in FIR including diﬀerent image retrieval datasets are discussed. Section 3 includes the detail of the dataset. Section 4 summarized the benchmarking methods and the results. Finally, Sect. 5 concludes the article.

102

2

S. M. Islam et al.

Related Works

The image retrieval problem is applied in various computer vision problems and primarily solved using classical image processing methods and deep neural networks. Here, we have discussed diﬀerent image retrieval dataset including fashion image datasets and state-of-the-art fashion image retrieval methods. Fashion Image Retrieval Datasets: In the last few years, because of the increasing demands of online shopping, image recognition, and recommendation, image retrieval is playing the leading role. Fashion products such as garments are the main choice for this purpose. Xiao et al. [20] have presented a new gray scale image fashion dataset, namely Fashion-MNIST with a large number of fashion images from diﬀerent gender groups. Loni et al. [21] presented a social image dataset of fashion and clothing with their context and social metadata. Liu et al. [17] introduce large-scale clothes dataset, namely, DeepFashion, with a variety of attributes, landmark information, and cross-domain image similarity. Zheng et al. [22] present a street fashion dataset called ModaNet. Recently, Ge et al. [18] have proposed Deepfashion2, a large volume of garment datasets for detection, segmentation, and reidentiﬁcation. Bossard et al. [10] introduced a benchmark data set for the clothing classiﬁcation task. Hadi et al. [8] have created a clothing dataset containing retailer and street photos. Chen et al. [23] introduce a large-scale clothing dataset built with attribute subcategories, such as various shades of color, clothing types, and patterns. Huang et al. [9] have presented a database of online shopping images and exact oﬄine counterpart images of those online ones. Vasileva et al. [24] have proposed a dataset that contains various fashion products such as bags, top wear, bottom wear, shoes, etc. Other Image Retrieval Datasets: Other than fashion image datasets, a variety of image datasets are available in the domain of medical diagnosis, ﬁngerprint identiﬁcation, crime prevention, face detection, nudity detection ﬁlters, textile industry, and many more. Krizhevsky et al. [25] have described the training method of a multilayer generative model using tiny colour images. For this purpose, they have created the datasets CIFAR-10 and CIFAR-100, containing natural tiny colour images. Deng et al. [26] introduced a database called “ImageNet”, that contains a large-scale ontology of images and organized according to the WordNet structure. Wang et al. [27] have presented a novel chest X-ray database, namely “ChestX-ray8”, which comprises large number of frontal-view X-ray images with image labels of eight diseases. Dong et al. [28] have proposed an automatic method for brain tumor segmentation using U-Net. They have also evaluated their method on Multimodal Brain Tumor Image Segmentation (BRATS 2015) datasets, which contains high-grade and low-grade brain tumor cases. Karu et al. [29] have presented a ﬁngerprint classiﬁcation algorithm that extracts singular points in a ﬁngerprint image and carried out classiﬁcation based on the number and locations of the detected singular points. For this purpose,

RingFIR: A Large Volume Earring Dataset for Fashion Image Retrieval

103

they have used the NIST-4 and NIST-9 database. Sharma et al. [4] have proposed a novel face image dataset namely, UCD Colour Face Image Database for face detection. Garcia et al. [30] have proposed a dataset by collecting image and video from website and developed an application to ﬁlter pornographic content. From real-world traﬃc surveillance environment, Liu et al. [31] have constructed the VeRi dataset, contains over 50k images of 776 vehicles and also proposed a vehicle re-identiﬁcation framework. In Table 1 we have shown some popular image datasets used in various image retrieval problem. From the existing datasets, it has been observed that none of the datasets deals with ornaments exclusively. This observation motivates us to create an ornament dataset. State-of-the-Art Fashion Image Retrieval Methods: Fashion image retrieval has become more important after the recent progress in artiﬁcial intelligence and the emergence of online shopping. Handi et al. [8] introduce a challenging retrieval task, where the target is to match the user’s captured image exactly with the online shopping images. Triplet Capsule Networks are introduced in [39] to explore in-shop clothing retrieval performance. Lin et al. [40] have presented a deep convolutional neural network framework for rapid clothing retrieval by adding a latent layer to the network. Zhaao et al. [41] have presented a memory-augmented attribute manipulation network (AMNet) that can manipulate some redundant attributes of the images and change them to the desired ones to retrieve the required images from the image gallery. Liu et al. [17] proposed a novel deep model, namely, FashionNet, that learns clothing features by jointly predicting clothing attributes and landmarks. Ge et al. [18] have proposed a match R-CNN framework that builds upon mask R-CNN to solve clothes detection, pose estimation, segmentation, and retrieval. Jetchev et al. [42] presents a novel method, called Conditional Analogy Generative Adversarial Network (CAGAN), to solve image similarity problems by allowing to learn the relation between paired images present in the training data, and then generalize and produce images that correspond to the relation. Wang et al. [43] have proposed a knowledge-guided fashion network to solve fashion landmark localization and clothing category classiﬁcation problems of visual fashion analysis.

104

S. M. Islam et al.

Table 1. Popular image datasets used in various image retrieval problem. Samples

Dataset

Description

Application

Annotation

Polyvore Outﬁts [32]

Dataset for building outﬁts in online fashion data

Fashion recommendation

68,306 outﬁts

Deep Fashion [17]

A large-scale clothes database

Image retrieval and recognition

800k various fashion images

Exact Street2Shop [8]

Fashion MNIST [20]

CIFAR-10 [25]

NIST-9 [29]

ImageNet [26]

ChestX-ray8 [27]

Matching clothing item between Image retrieval street and shop photos

∼404k shop and 39k street images

Gray-scale images of fashion products

70k fashion products from 10 categories

Classiﬁcation

Dataset of low resolution Classiﬁcation images from a wide variety of classes

10 classes

Gray scale ﬁngerprint image

Classiﬁcation

5400 images

A large-scale labelled object Object dataset

Object recognition

∼100k objects

Chest X-ray Database

Classiﬁcation

Chest X-ray of more than 30k patients

Caltech-256 [33] Object images

Object recognition ∼30k images and classiﬁcation

MNIST [34]

Dataset of handwritten digits

Classiﬁcation

Over 60k images

SCFace [35]

Human faces

Face recognition and classiﬁcation

4,160 images of 130 subjects

Images of vehicles

Vehicle re-identiﬁcation

Over 50k images of 776 vehicles

Intra-personal image pairs

Person re-identiﬁcation

632 image pairs of two diﬀerent camera views

Person re-identiﬁcation

∼32k annotated boxes and ∼500k distractor set of images

VeRi-776 [36]

VIPeR [37]

Dataset Market-1501 [38] for a person re-identiﬁcation

RingFIR: A Large Volume Earring Dataset for Fashion Image Retrieval

3

105

Proposed Dataset and Benchmark

It is noted that there is no dataset available that deals with ornament images. Compare to other fashion image retrieval, ornament image retrieval needs more concern due to its large variety of structure, designs and patterns. We have collected an ornament dataset of golden earrings. It contains 2,651 high-resolution golden earring images from diﬀerent jewellery chain catalogues. These images are also used in online shopping apps. Some example images of our dataset are shown in Fig. 2(A). The distribution of the images over the classes is shown in Fig. 2(B).

Fig. 2. (A) Examples of randomly chosen samples of diﬀerent design in our dataset. (B) The distribution of images in diﬀerent classes. (Color ﬁgure online)

Data Collection: For collecting the dataset, we have visited diﬀerent India jewellery chains such as Anjali Jewellers, Kalyan Jewellers, Malabar Gold and Diamond, Tanishq Jewellers, and PC Chandra Jewellers. They have professionally captured high-quality images of ornaments that are used in online shopping apps, blogs, promotional activities, etc. Data collected from diﬀerent jewellery chains are merged to form a large volume dataset. Image Annotation: The most critical part is to annotate the dataset. Here, the annotation was refereed to grouping earrings by similarity. Although some speciﬁc designs such as “Jhumkas”, “Rings”, “With stones”, etc. are mentioned in the catalogues; interclass design variation also exists. For this task, more

106

S. M. Islam et al.

Fig. 3. Examples of some random retrieval using VGG16, where successfully retrieved images with green boundary and failed images with red boundary. (Color ﬁgure online)

than 30 female volunteers are involved, among them 5 are experts in golden ornaments and the rest are just end users. We design a small application where a query image is shown on the screen and she needs to select similar products from the gallery. Repeating these experiments for ∼1000 times with random query images, we obtain ∼30K random annotation. The distance between two images is assumed to be 0 if they are selected for the same query image by the volunteers, otherwise 1. Finally, the earrings are grouped into diﬀerent classes by taking the maximum vote. In this manner, we found 46 diﬀerent classes.

4

Benchmarking Methods and Discussion

Here, we benchmarked the RingFIR dataset by applying various existing classical methods and deep learning models. We have used the following classical methods: Histogram based similarity using Bhattacharyya distance, Pearson Correlation Coeﬃcient, Chi-Square and intersection. We have also used state-of-the-art deep models for feature extraction and retrieval. The ﬂow diagram of such a retrieval method is depicted in Fig. 4. First, a pretrained (imagenet) baseline deep image classiﬁer is used for extracting the feature of query-gallery pairs. Next, a feature distance (euclidean) is used to ﬁnd the similarity score and the gallery images are ranked based on the similarity. We have used diﬀerent baseline deep networks such as ResNet50, ResNet101, ResNet152, VGG16, NASNetMobile, MobileNet, DenseNet121, DenseNet169 and DenseNet201 for benchmarking. The dataset is divided into train and test set as 70% and 30%. The train set is used to train/ﬁne-tune deep models and the test set is used for validation.

RingFIR: A Large Volume Earring Dataset for Fashion Image Retrieval

107

Fig. 4. Flow of the deep feature based image retrieval method used for benchmarking. The numbered circles represent diﬀerent steps.

We have separated 10 query images of each class for validation. For a given query image, we have sorted the gallery images in decreasing order of similarity. Then the retrieval performance is recorded using top-k retrieval accuracy. For a given test query image, the result is true if at least one item is retrieved that matches with the query image within the ﬁrst k results. For the deep models, ﬁrst, the model is trained with Imagenet and ﬁne-tuned using RingFIR. Next, the trained model is used to extract features of the query and gallery images. Finally, the feature diﬀerence is used as the similarity; lower the diﬀerence, the higher the similarity. In Table 2, we summarize the top-1, top-5 and top-10 accuracy of diﬀerent methods in RingFIR. Some random retrieval examples using VGG16 are shown in Fig. 3. Figure 5 shows random examples of query and retrieved images using diﬀerent methods. It is noted that classical distance based image retrieval methods such as Bhattacharyya, correlation, etc. perform poorly and not suitable for the dataset. Chi-Square based distances performs better and comparable to the deep feature based retrieval. Popular deep architectures such as VGG, ResNet, etc. also not performing well. It is also noted that the rank-1 accuracy is very low for the cases. This is noted that the state-of-the-art deep learning based feature extraction failed to infer the similarity criterion because of the complex nature shape and a ﬁner diﬀerence in texture/design. It can conclude that a custom-designed neural network suitable for the dataset is required. Classical similarity-based retrieval methods also failed because of less variation of color and ﬁner shape/texture detail.

108

S. M. Islam et al. Table 2. Accuracy using diﬀerent state-ofart classical and deep learning methods Method Bhattacharyya [44]

Fig. 5. Top-k accuracy on RingFIR dataset using diﬀerent methods.

5

Accuracy (%) Top-1

Top-5

Top-10 56.52

13.04

39.13

Correlation [44]

2.17

8.69

17.39

Chi-Square [44]

15.21

39.13

58.69 15.21

Hist. Intersection [45]

2.17

10.86

VGG16 [46]

15.22

39.13

56.52

ResNet50 [47]

15.21

28.26

43.47

ResNet101 [47]

4.35

30.43

39.13

ResNet152 [47]

4.35

23.91

43.48

DenseNet121 [48]

13.04

28.26

45.65

DenseNet169 [48]

6.52

23.91

43.48

DenseNet201 [48]

8.70

26.09

36.96

NASNetMobile [49]

10.87

32.61

45.65

MobileNet [50]

10.87

32.61

43.48

Conclusion

In this paper, we have introduced a novel fashion image retrieval dataset consisting of large-scale golden earrings (∼2.6K images, 46 classes). The dataset can be used in fashion image retrieval (FIR), we call it RingFIR. We have also benchmarked the dataset using state-of-the-art classical and deep learning methods, which may open up new challenges in fashion image retrieval. We note that the dataset is challenging and state-of-the-art methods failed to achieve good accuracy on the dataset. We hope that the dataset will attract researchers and will be a valuable contribution to the CV community. Future works include expanding the dataset by adding other ornaments and adding textual tags applicable to ornaments for improving retrieval accuracy.

References 1. Long, F., Zhang, H., Feng, D.D.: Fundamentals of content-based image retrieval. In: Feng, D.D., Siu, W.C., Zhang, H.J. (eds.) Multimedia Information Retrieval and Management. Signals and Communication Technology, pp. 1–26. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-662-05300-3 1 2. Yasmin, M., Mohsin, S., Sharif, M.: Intelligent image retrieval techniques: a survey. J. Appl. Res. Technol. 12(1), 87–103 (2014) 3. Lehmann, T.M., et al.: Content-based image retrieval in medical applications. Methods Inf. Med. 43(04), 354–361 (2004)

RingFIR: A Large Volume Earring Dataset for Fashion Image Retrieval

109

4. Sharma, P., Reilly, R.B.: A colour face image database for benchmarking of automatic face detection algorithms. In: Proceedings EC-VIP-MC 2003. 4th EURASIP Conference Focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No. 03EX667), vol. 1, pp. 423–428. IEEE (2003) 5. da Silva Torres, R., Falcao, A.X.: Content-based image retrieval: theory and applications. RITA 13(2), 161–185 (2006) 6. Chora´s, R.S.: Cbir system for detecting and blocking adult images. In: Proceedings of the 9th WSEAS International Conference on Signal Processing, pp. 52–57 (2010) 7. Corbiere, C., Ben-Younes, H., Ram´e, A., Ollion, C.: Leveraging weakly annotated data for fashion image retrieval and label prediction. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2268–2274 (2017) 8. Hadi Kiapour, M., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it: matching street clothing photos in online shops. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3343–3351 (2015) 9. Huang, J., Feris, R.S., Chen, Q., Yan, S.: Cross-domain image retrieval with a dual attribute-aware ranking network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1062–1070 (2015) 10. Bossard, L., Dantone, M., Leistner, C., Wengert, C., Quack, T., Van Gool, L.: Apparel classiﬁcation with style. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part IV. LNCS, vol. 7727, pp. 321–335. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37447-0 25 11. Dong, Q., Gong, S., Zhu, X.: Multi-task curriculum transfer deep learning of clothing attributes. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 520–529. IEEE (2017) 12. Kalantidis, Y., Kennedy, L., Li, L.J.: Getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos. In: Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, pp. 105–112 (2013) 13. Hu, Y., Yi, X., Davis, L.S.: Collaborative fashion recommendation: a functional tensor factorization approach. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 129–138 (2015) 14. Li, Y., Cao, L., Zhu, J., Luo, J.: Mining fashion outﬁt composition using an end-toend deep learning approach on set data. IEEE Trans. Multimedia 19(8), 1946–1955 (2017) 15. Liu, S., et al.: Hi, magic closet, tell me what to wear! In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 619–628 (2012) 16. Park, S., Shin, M., Ham, S., Choe, S., Kang, Y.: Study on fashion image retrieval methods for eﬃcient fashion visual search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2019) 17. Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1096–1104 (2016) 18. Ge, Y., Zhang, R., Wang, X., Tang, X., Luo, P.: DeepFashion2: a versatile benchmark for detection, pose estimation, segmentation and re-identiﬁcation of clothing images. arXiv preprint arXiv:1901.07973 (2019) 19. Zhou, W., et al.: Fashion recommendations through cross-media information retrieval. J. Vis. Commun. Image Represent. 61, 112–120 (2019) 20. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)

110

S. M. Islam et al.

21. Loni, B., Cheung, L.Y., Riegler, M., Bozzon, A., Gottlieb, L., Larson, M.: Fashion 10000: an enriched social image dataset for fashion and clothing. In: Proceedings of the 5th ACM Multimedia Systems Conference, pp. 41–46 (2014) 22. Zheng, S., Yang, F., Kiapour, M.H., Piramuthu, R.: ModaNet: a large-scale street fashion dataset with polygon annotations. In: ACM Multimedia Conference on Multimedia Conference, pp. 1670–1678. ACM (2018) 23. Chen, Q., Huang, J., Feris, R., Brown, L.M., Dong, J., Yan, S.: Deep domain adaptation for describing people based on ﬁne-grained clothing attributes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5315–5324 (2015) 24. Vasileva, M.I., Plummer, B.A., Dusad, K., Rajpal, S., Kumar, R., Forsyth, D.: Learning type-aware embeddings for fashion compatibility. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018, Part XVI. LNCS, vol. 11220, pp. 405–421. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-012700 24 25. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009) 26. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009) 27. Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2097–2106 (2017) 28. Dong, H., Yang, G., Liu, F., Mo, Y., Guo, Y.: Automatic brain tumor detection and segmentation using u-net based fully convolutional networks. In: Vald´es Hern´ andez, M., Gonz´ alez-Castro, V. (eds.) MIUA 2017. CCIS, vol. 723, pp. 506–517. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60964-5 44 29. Karu, K., Jain, A.K.: Fingerprint classiﬁcation. Pattern Recogn. 29(3), 389–404 (1996) 30. Garcia, M.B., Revano, T.F., Habal, B.G.M., Contreras, J.O., Enriquez, J.B.R.: A pornographic image and video ﬁltering application using optimized nudity recognition and detection algorithm. In: 2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM), pp. 1–5. IEEE (2018) 31. Xinchen Liu, W., Liu, T.M., Ma, H.: PROVID: progressive and multimodal vehicle reidentiﬁcation for large-scale urban surveillance. IEEE Trans. Multimedia 20(3), 645–658 (2017) 32. Han, X., Wu, Z., Jiang, Y.G., Davis, L.S.: Learning fashion compatibility with bidirectional LSTMs. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1078–1086 (2017) 33. Griﬃn, G., Holub, A., Perona, P.: Caltech-256 object category dataset (2007) 34. Cohen, G., Afshar, S., Tapson, J., Van Schaik, A.: MNIST: extending MNIST to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. IEEE (2017) 35. Grgic, M., Delac, K., Grgic, S.: SCface-surveillance cameras face database. Multimedia Tools Appl. 51(3), 863–879 (2011) 36. Lou, Y., Bai, Y., Liu, J., Wang, S., Duan, L.: VERI-wild: a large dataset and a new method for vehicle re-identiﬁcation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3235–3243 (2019)

RingFIR: A Large Volume Earring Dataset for Fashion Image Retrieval

111

37. Monroe, M.E., Toli´c, N., Jaitly, N., Shaw, J.L., Adkins, J.N., Smith, R.D.: Viper: an advanced software package to support high-throughput LC-MS peptide identiﬁcation. Bioinformatics 23(15), 2021–2023 (2007) 38. Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person reidentiﬁcation: a benchmark. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1116–1124 (2015) 39. Kinli, F., Ozcan, B., Kira¸c, F.: Fashion image retrieval with capsule networks. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019) 40. Lin, K., Yang, H.F., Liu, K.H., Hsiao, J.H., Chen, C.S.: Rapid clothing retrieval via deep learning of binary codes and hierarchical search. In: ACM International Conference on Multimedia Retrieval, pp. 499–502. ACM (2015) 41. Zhao, B., Feng, J., Wu, X., Yan, S.: Memory-augmented attribute manipulation networks for interactive fashion search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1520–1528 (2017) 42. Jetchev, N., Bergmann, U.: The conditional analogy GAN: swapping fashion articles on people images. In: IEEE International Conference on Computer Vision, pp. 2287–2292 (2017) 43. Wang, W., Xu, Y., Shen, J., Zhu, S.C.: Attentive fashion grammar network for fashion landmark detection and clothing category classiﬁcation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4271–4280 (2018) ¨ 44. Erkut, U., Bostancıo˘ glu, F., Erten, M., Ozbayo˘ glu, A.M., Solak, E.: HSV color histogram based image retrieval with background elimination. In: 2019 1st International Informatics and Software Engineering Conference (UBMYK), pp. 1–5. IEEE (2019) 45. Liao, Q.: Comparison of several color histogram based retrieval algorithms. In: 2016 IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), pp. 1670–1673. IEEE (2016) 46. Ha, I., Kim, H., Park, S., Kim, H.: Image retrieval using BIM and features from pretrained VGG network for indoor localization. Build. Environ. 140, 23–31 (2018) 47. Pelka, O., Nensa, F., Friedrich, C.M.: Annotation of enhanced radiographs for medical image retrieval with deep convolutional neural networks. PLoS One 13(11), e0206229 (2018) 48. Zhang, J., Chaoquan, L., Li, X., Kim, H.-J., Wang, J.: A full convolutional network based on DenseNet for remote sensing scene classiﬁcation. Math. Biosci. Eng 16(5), 3345–3367 (2019) 49. Saxen, F., Werner, P., Handrich, S., Othman, E., Dinges, L., Al-Hamadi, A.: Face attribute detection with mobilenetv2 and NasNet-mobile. In: 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA), pp. 176– 180. IEEE (2019) 50. Ilhan, H.O., Sigirci, I.O., Serbes, G., Aydin, N.: A fully automated hybrid human sperm detection and classiﬁcation system based on mobile-net and the performance comparison with conventional methods. Med. Biol. Eng. Comput. 58(5), 1047–1068 (2020). https://doi.org/10.1007/s11517-019-02101-y

Feature Selection and Feature Manifold for Age Estimation Shivani Kshatriya, Manisha Sawant(B) , and K. M. Bhurchandi Department of Electronics and Communication Engineering, Visvesvaraya National Institute of Technology, Nagpur, India [email protected] http://www.vnit.ac.in

Abstract. In recent years, a number of manifold learning techniques have been proposed in the literature to address the age estimation problem. In manifold methods, appearance features are projected onto a discriminant aging subspace and the age estimation is performed on the aging subspace. In these methods the manifold is learn from the gray intensity images. We propose a feature based discriminant manifold learning and feature selection scheme for robust age estimation. This paper also presents an experimental analysis of the manifold learning and feature selection schemes for age estimation. The exact age value is estimated by applying regression on the resultant feature vector. Experimental analysis on a large scale aging database MORPH-II, demonstrate the eﬀectiveness of the proposed scheme.

Keywords: Feature manifold

1

· Feature selection · Age estimation

Introduction

Human face conveys signiﬁcant information for human to human as well as human-machine interaction. Estimating various facial attributes such as age, gender and expression plays a vital role in various forensic, multimedia and law enforcement applications. Facial aging related research is broadly classiﬁed into three categories: age estimation, age synthesis and age invariant face recognition. Age estimation and synthesis mainly consider aging information that change due to aging. Although signiﬁcant research has been carried out on age invariant face recognition, relatively few publications have been reported on age estimation [4,11,15,31,35]. This is due to various factors such as complex biological changes, lifestyle, ethinicity skincare. These various factors changes the shape and texture of the face. Diﬀerent aging patterns are observed due to diversity in climatic conditions, races and lifestyle. Due to such large variations, it is diﬃcult even for humans to precisely predict a person’s age from the facial appearances. Supported by Visvesvaraya National Institute of Technology, Nagpur, India. c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 112–123, 2021. https://doi.org/10.1007/978-981-16-1092-9_10

Feature Selection and Feature Manifold for Age Estimation

113

In the recent years, many eﬀorts have been devoted to identify discriminant aging subspaces for age estimation. Some representative subspace learning methods used for age estimation include principal component analysis (PCA) [9], locality preserving projections (LPP) [21], orthogonal locality preserving projections (OLPP) [1], and conformal embedding analysis (CEA) [11,16]. The basic idea of the subspace learning methods is to ﬁnd a low-dimensional representation in an embedded subspace and then perform regression in the embedded subspace to predict exact age. Geng et al. [14] proposed a subspace called AGing pattErn Subspace (AGES) to learn personalized aging process from multiple faces of an individual. For age group classiﬁcation, Guo et al. [16] used manifold based features and Locally Adjusted Robust Regressor (LARR). Manifold feature descriptors are characterise by a low dimensionality. Apart from age discriminative information, it also contains other related information such as identity, expression and pose. Therefore for achieving large improvements in age estimation, it is important to ﬁgure out which feature is more appropriate and important for describing the age characteristic. Existing manifold methods extract manifold features from the gray intensity or image space. However, the image space is ineﬃcient to model the large age variations. The texture features such as HOG [7], SIFT [27] and GOP [26] are used to capture the textural variations due to aging. But the manifold of such feature space has not been explored for age estimation. Also, selection of relevant features from these texture features is an important direction in this area. In this paper, we present analysis of manifold and feature selection methods and their quantitative impact on the age estimation. We extract age-discriminative features from the manifold subspace as well as through feature selection for age estimation. The main advantages of these features are low dimensionality, robustness under illumination variation and intensity noise resulting in improved performance.

2

Related Work

Early research on age estimation mainly focused on anthropometric measurements to group facial images into diﬀerent age groups. Following the development of local features, instead of age classiﬁcation, much attention was focused on exact age estimation. Recent research in age estimation is classiﬁed based on the feature extraction and the feature learning methods. In this section, we brieﬂy review them based on the facial features and learning methods for age estimation. 2.1

Aging Feature

After preprocessing, facial feature extraction is ﬁrst step in typical age estimation approach as shown in Fig. 1. Early age estimation approaches used Active Appearance Models (AAM) [6] for shape and texture representation. These systems utilize the shape and texture variations observed in the facial images. Lanitis et al. [25] proposed a person-speciﬁc age estimation method, wherein they

114

S. Kshatriya et al.

Fig. 1. Age estimation framework.

have used AAM to extract craniofacial growth and aging patterns at diﬀerent age groups. Further, various age estimation approaches [4,13,14,24] proposed variations in AAM to capture aging patterns for age group classiﬁcation. In case of AAM based methods, accurate localization of facial landmarks is a deciding factor for performance improvement. For appearance feature extraction, apart from the earlier listed global features, histogram based local features like HOG, LBP, SIFT, BIF and Gabor are also used. Bio-Inspired Features (BIF) proposed in [19] for age estimation is based on a bank of multi-orientations and multiscale Gabour ﬁlters. Recently, for age estimation variants of BIF have been used [17,18,20]. BIF is specially designed for age estionation. Various existing local features that are also used for aging feature representation [10,22,33,36]. In [33], combination of PCA, LBP and BIF is used as aging feature. In [28] combination of global and local features is proposed. AAM is used for global feature representation whereas LPQ, LBP and Gabor for local feature extraction. Feature fusion is followed by dimensionality reduction for compact representation of the feature vector. HOG is used as aging feature in [10,22], whereas MLBP and SIFT are used as feature vectors in [36] for age estimation approach. Besides, the local and global facial features, manifold based features are used to learn low dimensional manifolds. Various methods such as PCA, LPP, OLPP, and CEA are used in age estimation approaches. In these methods, low dimensional representation in embedded subspace is learned and age estimation is performed in the embedded subspace. Personalized aging process is learned from multiple faces of an individual using a AGing pattErn Subspace (AGES) [14]. Although, the performance of manifold based features is better than the image based features, these methods require large training data to learn the manifold. 2.2

Age Regression

After feature extraction, classiﬁcation or regreesion methods are applied on the local features for age group classiﬁcation or exact age estimation respectively. Information obtained from the facial feature has been eﬀectively used by various learning methods for regression or classiﬁcation. Age estimation from facial images falls under two categories of machine learning, classiﬁcation and regression. For age group classiﬁcation, an age range is treated as a class lable, whereas, for the regression it is treated as an ordered continuous value. Initial work on

Feature Selection and Feature Manifold for Age Estimation

115

age estimation in [24] compared the performance of Artiﬁcial Neural Networks (ANN), quadratic function and nearest neighbor classiﬁer for age classiﬁcation. Performance of quadratic function and ANN is found to be better than nearest neighbor. Moreover, Support Vector Regression (SVR) and Support Vector Machine (SVM) [8] are the most popular choices for age estimation. Aging patters are learned in [17] using KPLS regression. Age values represents ordered information this relative order information is encoded in Ordinal Hyperplane Ranking algorithm (OHRank) in [4]. Other than above mentioned regression methods, a Gaussian process based multitask warped gaussian process regression was developed in [35] for person speciﬁc age estimation. To reduce computational burdon during training an eﬃcient version of WGP called Orthogonal Gaussian process (OGP) regression was proposed in [36]. Discriminant manifold subspaces are explored to encode the face for age estimation. OLPP technique is used in [12] and [11] to extract discriminant aging subspace. These methods learn the aging subspace from the raw image space, which is not able to represent the large facial variations due to aging. Various local feature descriptors such as LBP, HOG and SIFT are available in the literature which encode the facial features such as ﬁne lines, smoothness, and wrinkles. We propose a method to extract age relevant features from the feature space instead of raw image space. Also, it is not known in advance which manifold is suitable for the age discriminative feature. We provide experimental analysis of feature manifold for age estimation. The local feature descriptors such as HOG, SIFT etc. extract important gradient and edge information and they are used for facial analysis. Hence it is important to select the age discriminative features from them. Among various machine learning approaches, feature selection is a technique which selects and rank relevant features according to their degrees of relevance and preference. In the literature of age estimation use of the feature selection method has not been identiﬁed. In this paper, we extract the age discriminative features using the feature selection methods. Section 1 presents introduction followed by literature survey in Sect. 2. The proposed method is presented in Sect. 3 while Sect. 4 presents experimental results. Section 5 presents the ﬁnal conclusion of this paper.

3

Proposed Work

The proposed age estimation framework mainly incorporates four modules: face preprocessing, feature extraction, feature transformation/ selection, and regression. In the ﬁrst stage, face images undergo normalizations such as pose correction and histogram equalization. Then, the histogram-of-oriented-gradient (HOG) feature is computed for each image. Being histogram based local feature, the dimension of the extracted feature vector could be very high depending on the number of scales and orientations. High dimensionality of the extracted local features is in general handled by dimensionality reduction technique. However in the dimensionality reduction, it is not analysed whether the transformed space truly represents the aging subspace. It is possible that the transformation of

116

S. Kshatriya et al.

the local features by dimensionality reduction technique may lead to a subspace which is not discriminative for age estimation. For analysis of facial images various local feature descriptors such as HOG, SIFT, LBP are used. These local descriptors are found suitable for both face recognition as well as age estimation task. Which implies these features carry information about both identity and age. But the dimensionality reduction techniques are not able to discriminate between aging feature and other facial features while reducing the dimension of the local features. Hence, it is highly essential to select only those features which carry the relevant aging information. Therefore, along with the analysis of manifold features, we also provide the analysis of feature selection methods for age estimation. After extracting relevant feature we apply orthogonal Gaussian process regression for estimating exact age. 3.1

Aging Manifold Features

N Suppose the facial feature space F is represented as F = fi : fi ∈ RD i=1 where D is dimension of the data and N is number of face images. True age labels ai N are represented as y = {ai : ai ∈ N}i=1 . We want to learn a low dimensional manifold G that is embedded in F and subsequently a manifold aging feature N xi : xi ∈ Rd i=1 with d 0)

(7)

i

A classiﬁer can be trained on the training set feature matrix Xi IDF ( is the element-wise product), which is sent to a classiﬁer, such as linear SVM.

4

Experimentation

We use the KTH benchmark dataset [21] for showing the eﬀectiveness of gridbased features. The dataset has six types of actions, with single person performing an action in a video. There are 191, 192 and 216 videos in training, validation and test partitions, respectively. Frame size is 120 × 160 and videos mostly use stationary camera. While training, the temporal action labels were used and at validation/test time the evaluation was carried out similar to [4]. The optical ﬂow based features and ResNet extracted feature variants were tested on the Cricket strokes dataset [5] having 562 trimmed strokes, with 351 in training set, 105 in validation set and 106 in the test set, labeled with 3

Cricket Stroke Recognition

237

classes Left, Right and Ambiguous based on the direction of camera motion. We independently labeled all the strokes at a ﬁner level, into 5 categories, based on the direction of stroke hit, as shown in Fig. 1b. It must be noted that the strokes labeled C1, C2 (in 5 classes) need not necessarily belong to Right category strokes (in 3 classes), same is the case with C3, C4 corresponding to Left strokes and C5 for Ambiguous strokes. Generally, a stroke type is relative and depends on the direction of stroke play with respect to the batsman, oﬀ-side refers to the left side of the camera view for a right handed batsman and leg-side refers to right side of view. The directions are opposite when a batsman is left-handed. Here, we do not consider the posture of a batsman for annotation. The video clips are 360 × 640 (H, W) at 25FPS.

5

Results and Discussion

Figure 3 shows the accuracy values of grid based features on the KTH validation set using hard assignment. Two values of grid sizes were used and with increase in codebook size the accuracy gradually increases. Magnitude based visual words performed better than the orientation based words, because the actions were not direction dependent, e.g., a person can run from left to right or not. We also found the accuracy by concatenating the magnitude and angle features before clustering. Moreover, using orientation histograms on KTH will be conceptually incorrect, since, it will learn the direction of motion in an action when the action categories are not direction dependent. The accuracy obtained using grid features on the test set of KTH dataset, was 79.17%, with the grid-size of 20 and K = 150, when used with hard assignment.

Fig. 3. Accuracy on KTH validation set

Figures 4a to 4h shows the accuracy heat-maps on a range of values for grid-size (g)/bins (b) and codebook size (K), for the Cricket validation set using the hard assignment TF-IDF variants. Here, the accuracy does not, necessarily, increase on increasing K, g, or b. Moreover, the IDF values with SC give better results, having less variation, due to the smaller frequency values.

238

A. Gupta et al.

(a)

OFGrid, WC, 3

(b)

OFGrid, SC, 3

(c)

OFGrid, WC, 5

(d)

OFGrid, SC, 5

(e)

HOOF, WC, 3

(f)

HOOF, SC, 3

(g)

HOOF, WC, 5

(h)

HOOF, SC, 5

Fig. 4. Accuracy heat-maps for OF Grid and HOOF features for WC and SC evaluated for validation set of Cricket strokes with 3 categories and 5 categories.

Contrary to the KTH actions, HOOF features perform well when the task is direction dependent. The value of mth was chosen such that it considers pixels that show signiﬁcant motion. It should not be too low, so as to include pixels with some stationary background. A detailed experimental analysis on the mth parameter is given in [5]. Though, both the motion features have been extracted using the same underlying dense optical ﬂow method, the HOOF features give better results for 3 classes, while grid-based features give better results for 5 classes on the strokes dataset. This is due to the coarse and ﬁne labeling criteria for the strokes. HOOF features are able to better capture the camera motion at a course level, therefore, give better results for 3 classes of motion, but underperform when the direction of stroke play is considered for labeling. For example, the camera motion may be similar for C5 and C1 strokes. The OF grid features, unlike the HOOF features, store the spatial association information of pixel motion and are more useful for recognizing ﬁne-grained categories. Figures 5a to 5f show comparative HA and SA evaluations of diﬀerent features on strokes validation set for 5 categories. The soft assignment is much smoother and, generally, performs better than hard assignment, though, it might require tuning of β parameter. The 2D/3D ResNet extracted features (from last AvgPool layer) performed worse than the motion features. This can be attributed to the fact that the models have pretrained weights trained on ImageNet and Kinetics classes, which focus on generic object/action recognition while the cricket strokes dataset has domain-speciﬁc classes. The results on the test set of Cricket strokes dataset were calculated by choosing magnitude threshold parameter mth = 2, number of orientation histogram bins b = 10 and codebook size K = 20. The accuracy obtained was 85.85% for 3 categories and having 91 out of 106 correct predictions. The best accuracy of 82.08% for 5 categories was obtained by OF Grid features with SA. Detailed results in Tables 1, 2 and 3.

Cricket Stroke Recognition

(a) OF Grid (HA)

(b) OF Grid (SA)

(c) HOOF (HA)

(d) HOOF (SA)

(e) 2D / 3D ResNet (HA)

(f) 2D / 3D ResNet (SA)

239

Fig. 5. Accuracy values for varying cluster sizes for optical ﬂow grid based, orientation histogram, and ResNet extracted features on validation set strokes.

240

A. Gupta et al.

Table 1. Evaluation on test set using OF Grid 20. 87/106 strokes were correctly classiﬁed (i.e., 82.08%). Rows are predictions, columns are ground truth.

Table 2. Evaluation on test set using HOOF for 3 classes. 91/106 strokes were correctly classiﬁed (i.e., 85.85%). Rows are predictions, columns are ground truth labels.

Ground truth C1 C2 C3 C4 C5

Ground truth Left Right Ambiguous

C1 4

0 0

0

0

Left

C2 3

27 0

1

4

Right

C3 0

0 9

1

3

C4 0

0 0

4

2

C5 0

2 3

0

43

20

0

2

2

41

0

Ambiguous 7

4

30

Table 3. Accuracy values (#Correct predictions/Total samples) on Cricket Highlights Test set for 5 categories. Total number of strokes in partition is 106. Feature

Params

OFGrid

Grid = 20

HOOF mth = 2 Bins = 30

6

#Words HA 70 40

Params

0.6604 Grid = 20 0.6792 Bins = 40

#Words SA 100

0.8208

50

0.7170

2D ResNet50

ClipSize = 1

130

0.5849 ClipSize = 1

130

0.6132

3D ResNet18

ClipSize = 16

40

0.6038 ClipSize = 16

50

0.6415

Conclusion

In this work, we model the Cricket strokes recognition problem in a Bag of Visual Words setting. The recognition is done for two granularity levels by considering 3 category and 5 category of strokes. The extracted features of the training set are clustered to create a codebook of visual motion words and each video is represented as a histogram over these codebook words. After IDF weighting, we train a Linear SVM for ﬁnding evaluations on the validation and test sets. The pipeline was tested on two variants of optical ﬂow motion features, gridbased sampling and orientation histograms. The grid based features consider ﬂow values located at equally spaced grid intersection points while the orientation histograms take pixels with signiﬁcant motion and create histograms over the uniform bin divisions of [0, 2π]. To conclude, – The HOOF features perform better than the grid features when the categories are at a coarse level, but are outperformed by grid features when number of categories increase. – Soft assignment is, generally, better than hard assignment with less variation over a range of clustering parameters. – IDF SC gives better results than IDF WC, with less variation and is, potentially, better than WC for small datasets.

Cricket Stroke Recognition

241

– The best test set accuracy values for 3 category and 5 category strokes, were 85.85% and 82.08%, respectively. – Evaluation of grid features on the KTH dataset proved the eﬀectiveness of motion based visual words in a BoV framework. For future work, a ﬁne grained stroke detection can be done by annotating a large scale dataset and detecting the batsman poses.

References 1. Hawk-Eye Innovations hawk-eye in cricket. https://www.hawkeyeinnovations.com/ sports/cricket. Accessed 30 July 2020 2. Chaudhry, R., Ravichandran, A., Hager, G., Vidal, R.: Histograms of oriented optical ﬂow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009, pp. 1932–1939 (2009). https://doi.org/10.1109/CVPRW.2009.5206821 3. Farneb¨ ack, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45103-X 50 4. Gupta, A., Balan, M.S.: Action recognition from optical ﬂow visualizations. In: Chaudhuri, B.B., Kankanhalli, M.S., Raman, B. (eds.) Proceedings of 2nd International Conference on Computer Vision & Image Processing. AISC, vol. 703, pp. 397–408. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-78958 31 5. Gupta, A., Karel, A., Sakthi Balan, M.: Discovering cricket stroke classes in trimmed telecast videos. In: Nain, N., Vipparthi, S.K., Raman, B. (eds.) CVIP 2019. CCIS, vol. 1148, pp. 509–520. Springer, Singapore (2020). https://doi.org/ 10.1007/978-981-15-4018-9 45 6. Gupta, A., Balan M., S.: Temporal cricket stroke localization from untrimmed highlight videos. In: Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing. ICVGIP 2018, Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3293353.3293415 7. Gupta, A., Muthiah, S.B.: Viewpoint constrained and unconstrained Cricket stroke localization from untrimmed videos. Image Vis. Comput. 100, 103944 (2020). https://doi.org/10.1016/j.imavis.2020.103944 8. Harikrishna, N., Satheesh, S., Sriram, S.D., Easwarakumar, K.S.: Temporal classiﬁcation of events in cricket videos. In: 2011 National Conference on Communications (NCC), pp. 1–5, January 2011. https://doi.org/10.1109/NCC.2011.5734784 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385 10. Hsu, Y., Yang, S., Chang, H., Lai, H.: Human daily and sport activity recognition using a wearable inertial sensor network. IEEE Access 6, 31715–31728 (2018). https://doi.org/10.1109/ACCESS.2018.2839766 11. Kay, W., et al.: The kinetics human action video dataset. CoRR abs/1705.06950 (2017) 12. Kolekar, M.H., Palaniappan, K., Sengupta, S.: Semantic event detection and classiﬁcation in cricket video sequence. In: 2008 Sixth Indian Conference on Computer Vision, Graphics Image Processing, pp. 382–389, December 2008. https://doi.org/ 10.1109/ICVGIP.2008.102

242

A. Gupta et al.

13. Kolekar, M.H., Sengupta, S.: Semantic concept mining in cricket videos for automated highlight generation. Multimedia Tools Appl. 47(3), 545–579 (2010). https://doi.org/10.1007/s11042-009-0337-1 14. Kumar, A., Garg, J., Mukerjee, A.: Cricket activity detection. In: International Image Processing, Applications and Systems Conference, IPAS 2014, pp. 1–6 (2014). https://doi.org/10.1109/IPAS.2014.7043264 15. Li, W.X., Vasconcelos, N.: Complex activity recognition via attribute dynamics. Int. J. Comput. Vision 122(2), 334–370 (2017). https://doi.org/10.1007/s11263016-0918-1 16. Najafzadeh, N., Fotouhi, M., Kasaei, S.: Multiple soccer players tracking. In: 2015 The International Symposium on Artiﬁcial Intelligence and Signal Processing (AISP), pp. 310–315, March 2015. https://doi.org/10.1109/AISP.2015.7123503 17. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv preprint arXiv:1405.4506 (2014) 18. Piergiovanni, A., Ryoo, M.S.: Fine-grained activity recognition in baseball videos. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018 19. Pramod Sankar, K., Pandey, S., Jawahar, C.V.: Text driven temporal segmentation of cricket videos. In: Kalra, P.K., Peleg, S. (eds.) ICVGIP 2006. LNCS, vol. 4338, pp. 433–444. Springer, Heidelberg (2006). https://doi.org/10.1007/11949619 39 20. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-0150816-y 21. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR 2004) - Volume 03, ICPR 2004, pp. 32–36. IEEE Computer Society, Washington, DC (2004). https://doi.org/10.1109/ICPR.2004.747 22. Semwal, A., Mishra, D., Raj, V., Sharma, J., Mittal, A.: Cricket shot detection from videos. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6, July 2018. https://doi.org/10.1109/ ICCCNT.2018.8494081 23. Sharma, R.A., Sankar, K.P., Jawahar, C.V.: Fine-grain annotation of cricket videos. CoRR abs/1511.07607 (2015) 24. Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings Ninth IEEE International Conference on Computer Vision, vol. 2, pp. 1470–1477, October 2003. https://doi.org/10.1109/ICCV.2003. 1238663 25. Soomro, K., Zamir, A.R.: Action recognition in realistic sports videos. In: Moeslund, T.B., Thomas, G., Hilton, A. (eds.) Computer Vision in Sports. ACVPR, pp. 181–208. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09396-3 9 26. Thomas, G., Gade, R., Moeslund, T.B., Carr, P., Hilton, A.: Computer vision for sports: current applications and research topics. Comput. Vis. Image Underst. 159, 3–18 (2017). https://doi.org/10.1016/j.cviu.2017.04.011 27. van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.: Visual word ambiguity. IEEE Trans. Pattern Anal. Mach. Intell. 32(7), 1271–1283 (2010)

Multi-lingual Indian Text Detector for Mobile Devices Veronica Naosekpam(B) , Naukesh Kumar, and Nilkanta Sahu Indian Institute of Information Technology Guwahati, Guwahati, Assam, India {veronica.naosekpam,nilkanta}@iiitg.ac.in http://iiitg.ac.in/faculty/nilkanta/

Abstract. Detection of text in natural scene images is a challenging problem owing to its signiﬁcant variations in the appearance of the texts as well as the background. The task becomes even more diﬃcult for Indian texts, due to the presence of multiple languages and scripts. Most of the recent eﬃcient schemes use very deep CNN which requires large memory and computation. In this paper, we proposed a multi-lingual text detection system based on the compressed versions of an object detection framework called YOLO (You Only Look Once) [13] namely, YOLO v3Tiny and YOLO v4-Tiny. The aspect ratios of the anchor boxes are calculated using K-means clustering on the ground truth values, so that it can accurately detect words of varying lengths. The text detector has been evaluated on the IndicSceneText2017 [11] data set which consists of three Indian languages along with English. Experimental results prove the eﬃciency of the proposed scheme, for use in embedded systems and mobile devices due to its fast inference speed and small model size.

Keywords: Scene text detection

1

· YOLO · Indian script

Introduction

Reading text present in natural scene is an active research area in the computer vision community due to its numerous potential applications such as autonomous navigation for self-driving car, OCR (Optical Character Recognition), multilingual translation, image understanding, etc. It includes two sub-tasks: text detection and text recognition. The goal of text detection is to ﬁnd and localize the potential regions that contain the desired text data in the input image accurately and eﬃciently. In spite of enormous eﬀorts made in the last decade, it still remains an open problem. Challenges of text detection from scene images are 1. text instances often exhibit vast diversities in scale, font and orientation along (Fig. 1b), 2. various illumination eﬀects, such as shadow (Fig. 1c), 3. highly complicated and cluttered background (Fig. 1c) etc. The task becomes even more challenging for a diverse country like India where multiple languages/scripts can be presented in a single image (Fig. 1d). In this paper we propose a multi-lingual scene text detection scheme using the compact versions of YOLO v3 [14] and c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 243–254, 2021. https://doi.org/10.1007/978-981-16-1092-9_21

244

V. Naosekpam et al.

(b) Diﬀerent font size and orientation (a) Occlusion

(c) Uneven Illumination

(d) Multi-Lingual

Fig. 1. Challenges (Image Source: IndicSceneText2017 [11])

YOLO v4 [1], the YOLO v3-tiny and the YOLO-v4-tiny. The idea is to consider text as an object and use an object detection scheme to detect scene text. From the experiment, it is observed that the YOLO v4-tiny is faster, gives promising results and requires lesser memory than its predecessor, the YOLO v3-tiny. Since very few datasets are presently available for Indian languages, the benchmark dataset which we have experimented upon, is the IndicSceneText2017 [11]. It includes three Indian languages (Devanagari, Malayalam and Telugu) along with English. The results obtained implies that our proposed implementation gives a mAP score much higher than the pre-trained models based on the COCO datasets. The rest of the paper is organized as follows: Sect. 2 deals with the detailed survey of the existing works done in the scene text detection domain, analyzed their schemes and their shortcomings, Sect. 3 consists of the proposed text detection scheme based on YOLO v3-tiny and YOLO v4-tiny on the Indian text scene images, Sect. 4 shows the experimental results of the proposed system and lastly, a conclusion is drawn in Sect. 5.

2

Related Works

Text detection in natural images can be broadly divided into two categories: Exhaustive Search [2,12,20] and Selective Search [3,15–17,21]. The former uses a multiple-scale sliding window for feature extraction. The visual search space is huge, which makes it computationally expensive. Whereas, in the case of Selective search, candidate text regions are ﬁrst extracted based on certain text properties, such as stroke width or approximately constant color and heuristic ﬁltering rules are adopted to identify text regions. As these methods can

Multi-lingual Indian Text Detector for Mobile Devices

245

simultaneously detect texts at any scale and process faster on account of low number of candidate regions, they are more popular than the exhaustive search. Initially the task of text detection was attempted using statistical features like SWT, MSER, etc. Stroke Width Transform (SWT) [3] is a simple and eﬃcient algorithm, which has been conventionally used for text detection. Though it is eﬃcient for text detection, it is sensitive to noisy and blurry images. Xu et al. [21] modiﬁed the SWT [3] and incorporated Deep Belief Network to detect text objects in scene images by combining the smoothness-based edge information with gradient-based edge information to generate high quality edge images based on color consistency regions (CCR). Seed-based stroke width transform method for detecting text [15] within natural scene images discovers the self-similarity of character stroke so as to solve some defects of the original SWT [3] on character localization under unfavourable image conditions such as complicated and low contrast text object compositions. Tang et al. [16] proposed a color-based stroke width extraction and presented a similar background components connection algorithm that combines SWT [3] with the proposed color-based stroke width component extraction method for detecting colorful texts. Maximally Stable Extremal Region (MSER) [10] algorithm are also explored by many for detecting character candidates. But MSER generates false positives if the image quality is poor. To overcome this, the algorithm by Yin et al. [23] performs parent-children elimination for the MSER tree by removing an extremal region that violates a given aspect ratio. With the huge success of deep learning based approaches for object classiﬁcation and detection in recent times, researchers tried similar ideas for text detection as well, with great success. Textﬂow [17] used cascade boosting method for the detection of candidate characters. They combined false candidates removal, text lines extraction and veriﬁcation of text lines as a single step. The model solved the problem of error accumulation which was a major shortcoming of previous models like [3,15]. Though this scheme has reduced the number of steps, the performance was not enhanced to a very great extent. TextBox [6] uses a single neural network to detect bounding boxes and their oﬀsets. The model has solved the problem of slow detection speed but it failed to predict bounding box on some challenging images. CTPN (Connectionist Text Proposal Network) [18] uses the existing RPN (Region Proposal Network) architecture with deep CNN to ﬁll the gap of text localization. It uses Bi-LSTM [4] to deal with the vanishing gradient problem and Bi-LSTM allows the model to cover the image width by encoding the recurrent context in both directions. In EAST [24] Zhou et al. proposed a scheme that follows the Densebox [5] approach. In the scheme, they used FCN to ﬁnd pixel- level text score and geometry. Features from diﬀerent levels merged to predict a quadrangle of the text. Ma et al. [9] proposed an arbitrarily oriented text detection scheme that extended the region proposal network approach for object detection to rotational region proposal. Orientation or the angle from the rotation proposal is used to regress tighter bounding box around the text. In recent works, researchers are trying to detect curved text. In [8] Liu et al. proposed a new data set for

246

V. Naosekpam et al.

curved text. They also proposed a polygonal text detection scheme using recurrent transverse and longitudinal oﬀset connection. Their approach also utilized Faster-RCNN like approach to regress the bounding box. Instead of a 4 point bounding box (rectangular), they detect 14-point polygonal box to locate the text. In [7] Liu et al. proposed another approach for curved text detection where the curvature of the text is represented using cubic bezier parameters. ABCNet (Adaptive Bezier-Curve Network) was proposed to align bezier parameters appended by a recognition network. A detailed survey of the related work can be found in [22]. Most of the related works described in this section focus on detection of a single language, their behaviour on multilingual text detection is unknown. Moreover, although recent deep learning based approaches are very eﬃcient, but due to their large parameter set (causing huge memory requirement) they are not feasible for mobile based applications. In this paper, a multi-lingual single-stage text detector is proposed based on the YOLO [13] framework. We implemented the lighter versions of YOLO (YOLO v3-Tiny and YOLO v4-Tiny) on the text detection domain and analyzed how well it can detect the text present in natural scene images.

3

Proposed Scheme

To make our scheme feasible for mobile-based applications, shallow versions of YOLO are chosen. Any general object detector is made up of the following major parts: backbone, neck and head. An image is taken as input and compresses the features down through a convolutional neural network backbone. Multiple bounding boxes need to be drawn around images, along with the classiﬁcation (here it is text), so the feature layers of the backbone have to be mixed up and stacked up one after another. The function of the neck part of the detector is to combine the backbone layers. Detection of text takes place in the head part. As YOLO is a single-stage detector, it performs text localization and classiﬁcation at the same time. The input images are divided into SxS grids and the presence of text is detected in every grid cell. It predicts multiple bounding boxes of predeﬁned sizes (anchor boxes) with diﬀerent aspect ratiosand sizes per cell. Then, non-maximum suppression is applied to select the box/boxes with higher IOU with ground truth. Anchor box dimensions are decided using K-means clustering over the ground truth values (actual text box dimension).The output text detection is of the shape (batch size x num of anchor boxes x grid size x grid size x 6 dimensions). 6 dimensions come from the representation of the bounding box as (pc,bx,by,bh,bw,c). pc denotes the probability of detection, (bx,by) refers to the bounding box’s x-centre and y-centre coordinates, (bh,bw) refers to the bounding box’s height and width and c denotes the class. In our case, c=text. YOLO uses sum-squared error between the predictions and the ground truth to calculate loss. The loss function for YOLO text detection is: Loss = Llocalization + Lconf idence + Lclassif ication

(1)

Multi-lingual Indian Text Detector for Mobile Devices

247

where Llocalization measures the errors in the predicted boundary box locations and sizes. Lconf idence measures the objectness if a text is detected in the bounding box. If a text is detected, the loss for classiﬁcation is the squared error of the class probabilities for each class (here class = 1). The losses at two diﬀerent scales are summed up for back-propagation. Diﬀerent backbone networks are used for our proposed YOLO-based text detection models. 3.1

YOLO V3-Tiny

YOLOv3 [14] is built upon previous models by adding an objectness score using logistic regression to the bounding box prediction. The salient features of YOLO v3 are introduction of skip connections, detection at 3 diﬀerent scales. YOLO v3-Tiny is a small deep neural network where depth of the convolutional layer is decreased compared to its previous YOLO models. It is 442% faster than YOLO v3 [14]. Network architecture of YOLO v3-tiny is given in Fig. 2. These models are more suitable for embedded devices which compromise accuracy to increase speed and memory eﬃciency. Whereas YOLO v3 uses a large number of 1 × 1 and 3 × 3 convolution layers in its Darknet-53 architecture, the tiny version utilises only 6 pooling layers and 9 convolution layers. It predicts a three-dimensional tensor that contains the score of the detected text, bounding box, and class predictions (class = text) at two diﬀerent scales. The feature map scales where prediction of bounding boxes occur are 13 × 13 and 26 × 26 that are merged with an up-sampled 13 × 13 feature map. The text detection process for this model takes place at layer 16 and 23 of the neural network. 3.2

YOLO V4-Tiny

YOLO v4 was released by Alexey Bochkovskiy [1] which included a huge number of new features for backbone network and detector network in the YOLO architecture that have improved the overall CNN accuracy. The primary goal of this architecture is to optimize the detector network for parallel computation. The features are categorized as Bag of Freebies (BoF): Those methods that only increase the training cost or change the strategy and Bag of Specials (BoS): The plugin modules and post processing methods that can signiﬁcantly improve the detection accuracy and which only increases the inference cost of detection. BoF for backbone are: – CutMix: It refers to cutting and pasting a portion of an image over another image and re-adjustment of the ground truth labels. – Mosaic data augmentation: Combining 4 training images into one for training to enhance the detection of text outside the normal context. – DropBlock regularization: Dropping of a block of block size × block size of pixels instead of the individual pixels. – Class label smoothing: It adjusts the target upper bound of the prediction to a lower value.

248

V. Naosekpam et al.

Fig. 2. Architecture of YOLO v3-Tiny

BoS for the backbone are: – Mish activation: It is a self regularized non-monotonic neural activation function with values ranging between ≈ −0.31 to ∞. – Multi-input weighted residual connections (MiWRC): It is the contribution of diﬀerent input features at diﬀerent resolutions to the output feature unequally. BoF for detector are: – DropBlock regularization, data augmentation techniques called Mosaic and Self Adversial Training (SAT). – Complete IoU loss: It increases overlapping area of the ground truth box and the predicted box and maintains the consistency of the boxes’ aspect ratio. – Use of multiple anchors for single ground truth if IoU(ground truth, anchor) > IoU threshold. – Cosine annealing scheduler: It adjusts the learning rate according to a cosine function. BoS for detector are: Mish activation, PAN path-aggregation block. A combination of the above mentioned features can be used while training YOLO v4. For our text detection in natural scene experiment, we have implemented YOLO v4-tiny and it is roughly 8x faster than YOLO v4. The YOLO v4 uses CSPDarknet53 [19] as the backbone network. For our experiment, the number of convolutional layers in the CSP backbone are compressed. The neck portion is made Spatial pyramid pooling and, PANet (Path Aggregation Network) path aggregation. Spatial pyramid pooling improves the receptive ﬁeld and ability to distinguish highly important context features. The PANet is deployed for aggregation of parameters for distinctive detector level. It also fuses information

Multi-lingual Indian Text Detector for Mobile Devices

249

of features from all the previous layers using element-wise max operations. The neck portion is responsible for extracting diﬀerent features maps of diﬀerent stages of the backbone. The head portion is same as the YOLO v3 except the number of YOLO layers are two instead of three.

4

Experimental Results

To ﬁnd the eﬃciency of our approach we have used a set of metrics. Intersection over Union (IoU) is the ratio of intersection of predicted (Bp ) and actual ground truth bounding box (Bgt ) to union of predicted and actual ground truth bounding box. Higher the IOU value, closer the predicted bounding box to ground truth. (Bgt ∩ Bp ) (2) IOU = (Bgt ∪ Bp ) Precision is given as the ratio of true positive (TP) and the total number of predicted positives. The formula is: P recision =

TP TP + FP

(3)

Recall, also called sensitivity is deﬁned as the ratio of True positive and total of ground truth positives. Recall =

TP TP + FN

(4)

where TP: True Positive, TN: True negative, FN: False Negative and FP: False positive. Mean average Precision for the model is calculated using the Average Precision (AP) over the class (here “Text”) and/or over all thresholds. In our case, a single threshold of 0.5 is taken. To get the AP of a particular class, the area under the precision-recall curve has to be calculated. The value for mAP lies between 0 and 1. mAP is computed to estimate accuracy. F1-score is a metric that combines recall and precision by taking their harmonic mean. The dataset used for our work is the IndicSceneText2017 [11]. It consists of multiple regional Indian languages namely, Devanagiri, Malayalam, Telugu and a small amount of English language. The dataset is divided as : 80% train set, 10% validation set and 10% test set. The images are manually labelled using a labeling tool. Experimentation is done on two resolutions (320 × 320 and 416 × 416) of the input images. The bounding box priors are calculated separately for both the resolutions. On the IndicScene2017, for both the architectures, 6 anchor boxes and 2 scales are chosen using K-means clustering. The 6 clusters (anchor boxes) for 320 × 320 image resolution were: (0.78 × 0.55), (1.34 × 1.31), (2.08 × 0.65),

250

V. Naosekpam et al.

(2.21 × 2.77), (3.68 × 1.24), (5.08 × 3.26) and 416 × 416 resolution were (0.98 × 0.75), (1.83 × 1.76), (2.61 × 0.83), (3.10 × 3.62), (4.82 × 1.55), (6.75 × 4.39). The hyper-parameters chosen for training the YOLO v3-tiny are: learning rate = 0.001, max batches = 3000, momentum = 0.9. The hyper-parameters chosen for training the YOLO v4-tiny are: learning rate = 0.00261, max batches = 3000, momentum = 0.9. The decay is set to 0.0005 for both the models. To start training on our dataset, the YOLO v3-tiny and YOLO v4-tiny text detector both use the weights pre-trained on COCO dataset. The system used is the Tesla K80 with 12 GB RAM. The graph for Training loss (in blue) and Validation mAP (in red) is shown in Fig. 3a and 3b. It is evident that YOLO v4-tiny gives better mAP values than YOLO v3-tiny. The size of each model is 33 MB for YOLO v3-tiny and 22 MB for YOLO v4-tiny which implies that the models are suitable for installation in low memory devices.

Fig. 3. Training loss and mAP plot for (a) YOLO v3-tiny and (b) YOLO v4-tiny. Zoom in for better visualization. (Color ﬁgure online)

The performance of the system on various parameters is listed in Table 1 and the accuracy is shown in Table 2. From the tables, it is observed that mAP scores are comparatively higher than the one originally obtained from COCO dataset which is 33.1% on YOLO v3-tiny and 40.2% on YOLO v4-tiny at 416 × 416 resolutions whereas our models achieved 44.3% on YOLO v3-tiny and 69.2% on YOLO v4-tiny at 416 × 416 resolutions. The reason may be the use of single class training as compared to the 80-class training in COCO dataset. Some of the results of text detection in natural scene images using YOLO v3-tiny and YOLO v4-tiny on IndicSceneText2017 [11] are shown in Fig. 4 and Fig. 5 respectively. Both the models perform well when the text size is large with a clear, uniform and contrasting background. It is noticed that YOLO v3-tiny

Multi-lingual Indian Text Detector for Mobile Devices Table 1. Performance of the proposed one stage text detector Network

Resolution Iterations Avg IoU mAP

YOLO v3-tiny 320 × 320

2300

0.2832

0.275

YOLO v3-tiny 416 × 416

2600

0.32

0.443

YOLO v4-tiny 320 × 320

1800

0.418

0.5422

YOLO v4-tiny 416 × 416

2900

0.52

0.692

Table 2. Text detection accuracy Network

Resolution Precision Recall F1-score

YOLO v3-tiny 320 × 320

0.43

0.41

0.42

YOLO v3-tiny 416 × 416

0.54

0.53

0.54

YOLO v4-tiny 320 × 320

0.58

0.66

0.62

YOLO v4-tiny 416 × 416

0.7

0.78

0.74

(b)

(a)

(d) (c)

Fig. 4. Multi-lingual text detection results using YOLO v3-Tiny

251

252

V. Naosekpam et al.

(b)

(a)

(c) (d)

Fig. 5. Multi-lingual text detection results using YOLO v4-Tiny

fails to detect texts on images which contains comparatively smaller text as well as text with uneven and slightly unclear background. Additionally, the bounding box generated for the detected text is in better proportion for YOLO v4-tiny as compared to the ones generated using YOLO v3-tiny. It is seen YOLO v4-tiny works well even with slightly blurry text (Fig. 5d).

5

Conclusion

In this paper, we treated text detection problem as any other object detection and used an eﬃcient object detection scheme for the task. To make our scheme suitable for devices with low memory and low computational power, comparatively shallow network is used. Experimentation is done on multilingual dataset. From our results, we conclude that YOLO v4-tiny performs better than the YOLO v3-tiny while dealing with small size text and under various uneven lighting background. Even though our mAP scores lie between 27% and

Multi-lingual Indian Text Detector for Mobile Devices

253

70% (considering both the resolutions), we can say that object detection models can be tuned to detect multi-lingual text and they can be used in the domain where low memory and faster execution are a major requirement as compared to higher accuracy such as embedded systems. Regarding the future work, labeling the data using 4 classes instead of a single class as well as introduction of multioriented curved text detection scheme may improve the accuracy of the system and will be a good research direction.

References 1. Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y.M.: YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020) 2. Chen, X., Yuille, A.L.: Detecting and reading text in natural scenes. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2004, CVPR 2004, vol. 2, p. II. IEEE (2004) 3. Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2963–2970. IEEE (2010) 4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 5. Huang, L., Yang, Y., Deng, Y., Yu, Y.: DenseBox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874 (2015) 6. Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: TextBoxes: a fast text detector with a single deep neural network. In: Thirty-First AAAI Conference on Artiﬁcial Intelligence (2017) 7. Liu, Y., Chen, H., Shen, C., He, T., Jin, L., Wang, L.: ABCNet: real-time scene text spotting with adaptive bezier-curve network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9809–9818 (2020) 8. Liu, Y., Jin, L., Zhang, S., Luo, C., Zhang, S.: Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recogn. 90, 337–345 (2019) 9. Ma, J., et al.: Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimedia 20(11), 3111–3122 (2018) 10. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. Image vision Comput. 22(10), 761–767 (2004) 11. Mathew, M., Jain, M., Jawahar, C.V.: Benchmarking scene text recognition in Devanagari, Telugu and Malayalam. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 7, pp. 42–46. IEEE (2017) 12. Mishra, A., Alahari, K., Jawahar, C.V.: Top-down and bottom-up cues for scene text recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2687–2694. IEEE (2012) 13. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: uniﬁed, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016) 14. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018) 15. Su, F., Xu, H.: Robust seed-based stroke width transform for text detection in natural images. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 916–920. IEEE (2015)

254

V. Naosekpam et al.

16. Tang, P., Yuan, Y., Fang, J., Zhao, Y.: A novel similar background components connection algorithm for colorful text detection in natural images. In: 2015 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), pp. 1–5. IEEE (2015) 17. Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., Tan, C.L.: Text ﬂow: a uniﬁed text detection system in natural scene images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4651–4659 (2015) 18. Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-46484-8 4 19. Wang, C.-Y., Liao, H.-Y.M., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., Yeh, I.-H.: CSPNet: a new backbone that can enhance learning capability of CNN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 390–391 (2020) 20. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: 2011 International Conference on Computer Vision, pp. 1457–1464. IEEE (2011) 21. Xu, H., Xue, L., Su, F.: Scene text detection based on robust stroke width transform and deep belief network. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9004, pp. 195–209. Springer, Cham (2015). https://doi. org/10.1007/978-3-319-16808-1 14 22. Ye, Q., Doermann, D.: Text detection and recognition in imagery: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 37(7), 1480–1500 (2015) 23. Yin, X.-C., Yin, X., Huang, K., Hao, H.-W.: Robust text detection in natural scene images. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 970–983 (2013) 24. Zhou, X., et al.: East: an eﬃcient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5551– 5560 (2017)

Facial Occlusion Detection and Reconstruction Using GAN Diksha Khas(B) , Sumit Kumar(B) , and Satish Kumar Singh Indian Institute of Information Technology, Allahabad, Allahabad, India {phc2014001,sk.singh}@iiita.ac.in

Abstract. We developed an architecture considering the problem of the presence of various objects like glasses, scarfs, masks, etc. on the face. These types of occluders greatly impact recognition accuracy. Occluders are simply treated as a source of noise or unwanted objects in an image. Occlusion aﬀects the appearance of the image signiﬁcantly. In this paper, a novel architecture for the occlusion removal of speciﬁc types like glasses is dealt with. A synthetic mask for this type of occlusion is created and facial landmarks are generated which are further used for image completion method is used to complete the image after removing the occluder. We have trained the model on celebA dataset, which is available publicly. The experimental results show that the proposed architecture for the occlusion removal eﬀectively worked for the faces covered with glasses or scarfs.

Keywords: Image completion image · Landmarks

1

· Reconstructed image · Occluded

Introduction

In the past few years, recognition methods have shown the promising execution in several practical applications like security photos, criminal identiﬁcation, healthcare, passwords in electronic devices and many more. But, several kind of occlusions or unwanted objects like glasses, masks, are present in genuine situations. This additive or spatially bordering gross noise is treated as “occlusion in the image”. According to [26] occlusion is deﬁned as “A type of spatially contiguous and additive gross noise, that would severely contaminate discriminative features of human faces and harm the performance of face recognition approaches that are not robust to such noise.”” To mitigate the problem of facial occlusion present in the real-life scenarios, an architecture is composed to remove the occlusion in the wild so that it does not impact the recoginition performance using other techniques. Many methods like [7,13,25] are present to remove the unwanted noise from the image, but they does not the give the promising results for the random images from the actual scenes. These methods fail considerably in surveillance like situations. c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 255–267, 2021. https://doi.org/10.1007/978-981-16-1092-9_22

256

D. Khas et al.

In the following work, we aim to solve this problem of de-occlusion by a series of steps as it does not give the good results in one-shot techniques. We propose an architecture consisting of masking the occluded area, landmark generation and ﬁnally the image completion module. These will act as successive steps to recover the occluded face parts gradually. The model will work as follow. Given a new face image with occlusion, an artiﬁcial binary mask is generated to cover the occluded region. The mask is embedded to the original image. Then this image pass through the two sub-modules named as landmark-generation module and image completion module.

2 2.1

Literature Review Face De-occlusion

There are many existing techniques based on the systematic manufactured strategy for the de-occlusion of the facial images. [19] applied the “sparse representation to encoding faces” and exhibited certain heartiness of the separated highlights to obstruction. It considered the issue of consequently perceiving human countenances from frontal perspectives with ﬂuctuating poses and light. It worked on the conventional features like Eigen-faces and the Laplacian-faces. [16] deﬁned a ‘Boltzmann machine’ based model to manage obstruction and noise. This model is robust to corruptions in the image. It deals quite precisely with the occlusions and noise by using multiplicative gating to add a sample of Gaussians over pixels. It is trained in the unsupervised manner with noisy data and get familiar with the spatial structure of the occluders. [19] deﬁned a model over encoders and decoders to de-occlude the image. This worked well over the data available in the wild also and performs signiﬁcantly better on real-world scenarios. It uses the LSTM - Autoencoders to adequately re-establish occluded faces even in natural scenarios. It works with the latent- space representation of the image. Also, methods like [8] based on “Graph-Cut” techniques are there. This model deals with the problem by segmenting the task into two sub-tasks. One is to detect the occlusion automatically and the another to recover the occluded parts with high quality. It derived the “Bayesian MAP” (Maximum a Posteriori) formulation to collaborate both the objectives. The quality testing framework in developed by learning the prior or existing knowledge from the deﬁned set of images and then images are used to develop the recovery procedure. It uses the global structure and the local patterns to characterise the texture in detail. These two sub-tasks are joined in probabilistic formulation. The drawback with these methods is that the results obtained are blurry and inconsistent with respect to global features. 2.2

Image Restoration

Traditional Methods These methods can be classiﬁed as diﬀusion-based or patch-based techniques.

Facial Occlusion Detection and Reconstruction Using GAN

257

Methods like [3] use patch-based strategy which tries to duplicate the same patch or block from the single image or from the set of similar images to the objective areas. These methods tries to ﬁt the similar block either from the same image or from the images other than the input image. So, this assumption that the missing part will be present somewhere in the image fails drastically for the facial images. Also, the cost of computing the similarity between the regions which is generally higher. Some diﬀusion-based strategies like [22] tries to complete the image from the low-level features that are collected around the missing block. They take the advantage of a diﬀusion equation to recursively propagate the low-level features from the known regions to the masked or unknown regions. But these methods work ﬁne for smaller regions or non-structured areas. They also failed considerably when tested for the structured cases like face-completion. Deep-Learning Based Methods As of now, deep learning based methods have become the standard for image completion. Methods like “Context-Encoder” [22] uses the encoder-decoder network trained over adversarial loss gives reliable results for image completion. Moreover, [9] refurbished the convolutional layers to make the network familiar to the masked input. The drawback of these techniques was they are not able to maintain the structure of the generated image. The output tends to be blurry if the missing area is large. Some methods based on edge prediction for the missing holes are also developed, but the output degrades when the missing holes become bigger in area.

3

Proposed Methodology

An ideal face inpainter ought to produce regular looking outcomes with coherent structures and traits. The objective has been achieved using two sub-networks. The ﬁrst network is forgenerating the landmarks of the occluded or the corrupted image while the second network is for completing the image after adding the binary mask over the occluded region (eyes or mouth) in the image. We will elaborate both the sub-networks in further subsections. 3.1

Landmark Generation Network

Exact conﬁnement of the facial landmarks yields an signiﬁcant building block for several applications like the identiﬁcation task or the analysis of various face expressions. Remarkable progress is being made in landmark-detection task till now. However, feature point localisation tends to fail when it has to perform for some occluded objects. A qualiﬁed image completion network must maintain the consistency and attributes of the image to guarantee the realistic output. Why Landmarks Are Preferred? Landmarks can be necessary as they are used for the structural guidance, due to their tightness, suﬃciency, competence,

258

D. Khas et al.

and robustness. The question also arises that edge guidance or parsing information can provide more robust information than the landmarks [12]. The answer will be “Yes”, only if the information provided is precise. But it is not simple to generate genuine edges in the tough situations like large corrupted area, large variation in the poses etc. Under these scenarios, the inaccurate information will aﬀect the performance badly. On the other hand, set of landmarks will always exist, no matter what the pose is, what the illumination is, etc. Landmarks are seen as set of discrete points which are self-suﬃcient to rehabilitate the key edges. Also, the landmarks are more easier to accomodate and edit. These features proves that the landmarks are the better option to be used for the face completion task. Procedure: Here, the aim is to retrieve a set of 68 landmarks from a corrupted or occluded image of a face i.e. I M . Let’s call the landmark generation module be GL . Though, GL could have been accomplished using any technique as used in [6,20,21] but here we follow the method developed in [23]. Acc to [23], this module is deﬁned as “Given the image of a face I with corrupted or demolished regions masked by M. Let M be the complement of M, and the ◦ Hadamard product. The objective is to pack the target part with semantically meaningful and visually continuous information to the observed part. In other words, the completed result I = M ◦ I + M ◦ I should conserve the topological structure among the components of face such as eyes, nose and mouth, and the attribute consistency like pose, gender, ethnicity and expression.” is deﬁned over GL as, L = GL (I M ; θL ), I M is deﬁned as, I M = I ◦ M i.e. L where θL are the trainable parameters. Here, what we focus while generating the landmarks is the fundamental topological structure and attributes like pose and expressions rather than the explicit location of each discrete landmark. The reason for this is : 1. Generating the landmarks exactly on face contour with the ﬁxed regions, will not have much impact on the ﬁnal result for image completion. 2. Preference for the simpler model. GL is built on the top of the “MobileNet-V2” model proposed in [24] which concentrates on feature extraction. The ﬁnal module is executed by completely connecting the fused feature maps at various rear stages. The training loss for this is deﬁned as: 2 − Lgt L(lmk) = L (1) 2

where Lgt denotes the ground-truth landmarks and .2 stands for the l2 norm. 3.2

Image Completion Network

Natural-looking results with the proper structure and attributes is expected from image completion module. In this, a model based on deep network is built to successfully generate the realistic output. This sub-network is composed of DC-Gan. The detailed architecture is discussed in next subsection.

Facial Occlusion Detection and Reconstruction Using GAN

259

Fig. 1. A complete architecture for the model

Proposed Architecture: The image completion network along with the landmark generation network is shown in Fig. 1 Gp aims to complete the faces from the corrupted image I M with the help of their predicted or ground-truth landˆ or Lgt ) as an input. This module is composed of a generator and marks (L discriminator. Generator: The generator is based on U-Net [15] architecture. Advantages of using U-net is that integrates the location information fetched from the downsampling path to the contextual information obtained while parsing the upsampling path to obtain a general information at the end, combining context and localisation. Also as there is no dense layer, so the images with diﬀerent sizes can also be used as the input. The network is composed of three moderately downsampled encoding blocks, with the seven residual blocks following the encoding blocks. The residual blocks have dilated or expanded convolutions and longshort term attention block. After that, in the decoding process, feature maps are up-sampled as the size of the input. The long-short term attention layer is utilized to combine the temporal feature maps. The function of dilated blocks is to expand the receptive ﬁeld so that the features lying in the wider range can also be considered. Also, the shortcuts are added to the related down-sampling and up-sampling layer. To adjust the weights in the features to last layer from the shortcut, 1 × 1 convolution operation is performed before each up-sampling layer as the channel-attention. So in this way, a better use of distant features is made spatially as well as temporally. Discriminator: For discriminator the convergence point is attained when the generated output is not much diﬀerent from the real images. This discriminator is build upon the “Deep-Convolution GAN” [14]. DCGANs are the extended

260

D. Khas et al.

versions of the standard GANs. DCGAN functions almost similar to traditional GAN but it speciﬁcally focus on using “Deep Convolutional Network” instead of fully-connected networks. Convolution Networks generally deals in the areas of correlation with the images, means, they work with the spatial correlations. For the training process, spectral normalisation [4] is introduced between the blocks of discriminator. Also, an attention layer is used to treat the features ﬂexibly. It can be easily observed that the models like [5] used a two discriminators. One global discriminator concentrates on the entire image and treats it coherent as a whole while the other local discriminator focus on the completed region to ensure the local consistency. But, here the image completion network uses only single discriminator to complete the job which has the corrupted image and its landmarks as the input, i.e. D(I, L; θD ) where θD are the parameters for the discriminator. The reasons for using only one discriminator are: 1. The global structure is already ensured as the generated results will be governed by the landmarks. 2. The attention layer focuses more on the consistency of the attributes. 3.3

Loss Function

For the training of image completion module, loss function is composed of perpixel loss,a style loss, a perceptual loss and the adversarial loss. 1. Per-pixel Loss: The per-pixel loss is deﬁned as: 1 ˆ Lpixel = I − I Nm 1

(2)

Here, . refers to the l1 norm. Nm is mask size which is used to adjust the penalty. It means that if the face is restricted by a small amount of grossnoise or occlusion, the ﬁnal result must be closely related to the ground-truth image. Also, if the occlusion is large, the interference can be relaxed until the structure and consistency is maintained. 2. Style Loss: It is for the style transfer from the actual image to the output image because a new style(non-occluded region) has to applied after removing the occlusion. It calculates the style diﬀerence between the input and output image as: G (Iˆ ◦ M ) − G (I ◦ M ) 1 p p (3) Lstyle = N ∗ N N ∗ H ∗ W p p p p p p 1

Here, Gp stands for the Gram Matrix. Np is the number of feature maps with size Hp xWp of the p-th layer from the pre-trained network. VGG-19 pre-trained on ImageNet is used to calculate the style loss.

Facial Occlusion Detection and Reconstruction Using GAN

261

3. Perceptual Loss: It measures the image similarities based on the high-level representations. It is used as the change in one-pixel (per-pixel loss) can create huge diﬀerence mathematically. The perceptual loss is deﬁned as: Lperc (I) = log(1 − Dp (Gp (IM , )))

(4)

where IM is the corrupted image with the binary mask. 4. Adversarial Loss: The motivation behind the adversarial loss is LSGAN proposed in [11] which enhance the visual quality of the output. Adversarial Loss is deﬁned as: Ladv(G) = E[(D(Gp (I M , L), Lgt ) − 1)2 ]

(5)

ˆ Lgt )2 ] + E[(D(I, Lgt ) − 1)2 ] Ladv(D) = E[D(I,

(6)

The total loss function for the generator of image completion module is represented as: Ltotal = Lpixel + λperc Lperc + λstyle Lstyle + λadv Ladv(G)

(7)

For the experiment, λperc = 0.1, λadv = 0.01 and λstyle ranging between 200 to 250 is used in the training phase. In the training process, generator Gp minimises the Ltotal (Eq. 7) while the discriminator D minimises Ladv(D) (Eq. 6) until the convergence point is reached.

4 4.1

Experiment and Results Training Strategy

The generator is expected to complete the image using Iˆ = G(I M ). For the images of the faces, the strong regularity, i.e. the regularity deﬁned by the landmarks is considered by the generator as Iˆ = GP (GL (I M ), I M ). This completely exploited the model reduction and training process, as the space is conﬁned by the regularity. Instinctively, the training for the landmark generation module and the image completion module can also be performed jointly but practically this will not be a good alternative. The reasons for this are: 1. The loss for the landmark generation module, GL is computed for the small number of locations (in this case, 68 landmarks are being considered). This is not compatible with the loss function for the image completion module. Here, the parameter tuning will be very demanding. 2. Somehow if we tune the parameters well, the also the performance of both the modules may be very inaccurate especially during the beginning face of the training, which will automatically leads to the poor quality landmark generation and low-quality of images after recovery.

262

D. Khas et al.

GT

Masked

Landmarks

Result

Fig. 2. Occlusion removal and reconstruction of image on CelebA dataset

These two factors push the training into problems like poor convergence points or high computational cost of training. For training and testing purpose, the data is split as 80:20. Therefore, it is better to separately train both the modules to avoid the above mentioned problems. In the experiments performed for this work, the 256 × 256 images are used for both the landmark generation and the image completion module. The optimiser used is Adam Optimiser [5] with β1 = 0.01 and β2 = 0.9 with the learning rate = 10−4 . Batch size for landmark generation module is set to be 16 whereas for image completion module, it is 4. 4.2

Experiments

The evaluation of the image completion module and the complete model for occlusion removal is performed on the publicly available CelebA dataset [10] and the IIITD visual disguise dataset. For the training purpose the random masks are taken from the random mask dataset [9]. For the quantitative comparison of the results, PSNR [18] , SSIM [17] and FID are used as metrics.

Facial Occlusion Detection and Reconstruction Using GAN

GT

Masked

Landmarks

263

Result

Fig. 3. Occlusion removal and reconstruction of image on IIITD visible dataset

4.3

Results

Results obtained with diﬀerent datasets and diﬀerent masking techniques is elaborated below. Image Completion for the Occluded Image: Qualitative results on the CelebA dataset with the speciﬁc mask used for the particular object occlusion is shown in Fig. 2. Also, the predicted landmarks with the speciﬁc mask infused over the occlusion are shown in column 3 of the ﬁgure. Image Completion for the IIITD Visible Images Dataset: Qualitative results on the IIITD visible images dataset [1,2] is shown in Fig. 3 with the particular mask used for the object occlusion. Qualitative Comparison with Other Methods: Qualitative comparison of diﬀerent models for image completion and occlusion removal on the celebA dataset is shown in Fig. 4.

264

D. Khas et al.

GT image Masked Image

CE

GFC

PIC

Fig. 4. Comparison with other models on celebA dataset Table 1. Comparison with other state-of-the-art models Model

PSNR SSIM FID

CE [13]

25.46

0.81

1.71

GFC [7]

21.04

0.76

14.96

CA [25]

20.29

0.76

38.83

PIC [27]

22.58

0.83

9.24

LaFIn [23]

26.25

0.91

3.51

Ours (CelebA masked)

32.43

0.91

6.12

Ours (IIITD visible image) 25.37

0.85

10.65

Fig. 5. Metrics comparison with other models

Ours

Facial Occlusion Detection and Reconstruction Using GAN

265

Quantitative Comparison with Other Methods: The metrics used to compare the quantitative results with other existing models are PSNR value, SSIM value and FID score. PSNR is used as the quantitative metric for the quality check between the reconstructed and the actual image. The higher value of PSNR represents the better quality of the output. SSIM lies between −1 and 1. 1 indicates that the structure of ground-truth image and the reconstructed images is identical. FID score compares the distance in the feature vectors of real and regenerated images by the GANs. Lower values indicate the more similarity between the images. Table 1 shows the psnr, ssim and ﬁd scores of our model and compare the results with the other state-of-the-art models on the CelebA dataset. Also, the Mean Opinion Score achieved for this model is 7.18. The qualitative results can be visualized from Fig. 5.

5

Conclusion

In the above section, qualitative and quantitative results are being described. Table 1 compares the performance of CE, GFC, CA, PIC, LaFIn, and the model described here with random masks for the celebA dataset. It can be clearly concluded the model competes well with the other existing state-of-the-art works in terms of PSNR and SSIM. Also, for the IIITD visible images dataset, PSNR value achieved is satisﬁable for the particular kind of occlusion. Also, the structure is well-maintained because the image is regenerated conditioned on the landmarks. The motive of selecting the landmarks instead of edge-parsing proves true and can be seen from the SSIM values depicted in Table 1. Also, the use of “DeepConvolution” GAN proved favourable as the lower FID scores are achieved.

References 1. Dhamecha, T., Nigam, A., Singh, R., Vatsa, M.: Disguise detection and face recognition in visible and thermal spectrums. In: 2013 International Conference on Biometrics (ICB), pp. 1–8 (2013). https://doi.org/10.1109/ICB.2013.6613019 2. Dhamecha, T.I., Singh, R., Vatsa, M., Kumar, A.: Recognizing disguised faces: human and machine evaluation. PLOS ONE 9(7), 1–16 (2014). https://doi.org/ 10.1371/journal.pone.0099212 3. Huang, J.B., Kang, S.B., Ahuja, N., Kopf, J.: Image completion using planar structure guidance. ACM Trans. Graph. 33(4), 1–10 (2014). https://doi.org/10.1145/ 2601097.2601205 4. Jo, Y., Park, J.: SC-FEGAN: face editing generative adversarial network with user’s sketch and color. CoRR abs/1902.06838 (2019). http://arxiv.org/abs/1902. 06838 5. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). http:// arxiv.org/abs/1412.6980 6. Kumar, A., Chellappa, R.: Disentangling 3D pose in A dendritic CNN for unconstrained 2D face alignment. CoRR abs/1802.06713 (2018). http://arxiv.org/abs/ 1802.06713 7. Li, Y., Liu, S., Yang, J., Yang, M.: Generative face completion. CoRR abs/1704.05838 (2017). http://arxiv.org/abs/1704.05838

266

D. Khas et al.

8. Lin, D., Tang, X.: Quality-driven face occlusion detection and recovery, pp. 1–7 (2007). https://doi.org/10.1109/CVPR.2007.383052 9. Liu, G., Reda, F.A., Shih, K.J., Wang, T., Tao, A., Catanzaro, B.: Image inpainting for irregular holes using partial convolutions. CoRR abs/1804.07723 (2018). http:// arxiv.org/abs/1804.07723 10. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV) (2015) 11. Mao, X., Li, Q., Xie, H., Lau, R.Y.K., Wang, Z.: Multi-class generative adversarial networks with the L2 loss function. CoRR abs/1611.04076 (2016). http://arxiv. org/abs/1611.04076 12. Nazeri, K., Ng, E., Joseph, T., Qureshi, F.Z., Ebrahimi, M.: Edgeconnect: Generative image inpainting with adversarial edge learning. CoRR abs/1901.00212 (2019). http://arxiv.org/abs/1901.00212 13. Pathak, D., Kr¨ ahenb¨ uhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. CoRR abs/1604.07379 (2016) 14. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks (2016) 15. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. CoRR abs/1505.04597 (2015). http://arxiv.org/abs/1505. 04597 16. Tang, Y.: Robust Boltzmann machines for recognition and denoising. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR 2012, pp. 2264–2271. IEEE Computer Society, USA (2012) 17. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004) 18. Winkler, S., Mohandas, P.: The evolution of video quality measurement: from PSNR to hybrid metrics. IEEE Trans. Broadcast. 54(3), 660–668 (2008). https:// doi.org/10.1109/TBC.2008.2000733 19. Wright, J., Yang, A., Ganesh, A., Sastry, S., Yu, L.: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell 31, 210–227 (2009). https://doi.org/10.1109/TPAMI.2008.79 20. Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: A boundary-aware face alignment algorithm. CoRR abs/1805.10483 (2018). http:// arxiv.org/abs/1805.10483 21. Xiao, S., et al.: Recurrent 3D–2D dual learning for large-pose facial landmark detection. In: The IEEE International Conference on Computer Vision (ICCV) (2017) 22. Yamauchi, H., Haber, J., Seidel, H.P.: Image restoration using multiresolution texture synthesis and image inpainting, pp. 120–125 (2003). https://doi.org/10.1109/ CGI.2003.1214456 23. Yang, Y., Guo, X., Ma, J., Ma, L., Ling, H.: Laﬁn: generative landmark guided face inpainting (2019) 24. Yang, Y., Ramanan, D.: Articulated human detection with ﬂexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2878–2890 (2013). https://doi. org/10.1109/TPAMI.2012.261 25. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. CoRR abs/1801.07892 (2018). http://arxiv.org/abs/ 1801.07892

Facial Occlusion Detection and Reconstruction Using GAN

267

26. Zhao, F., Feng, J., Zhao, J., Yang, W., Yan, S.: Robust lstm-autoencoders for face de-occlusion in the wild. CoRR abs/1612.08534 (2016). http://arxiv.org/abs/1612. 08534 27. Zheng, C., Cham, T., Cai, J.: Pluralistic image completion. CoRR abs/1903.04227 (2019). http://arxiv.org/abs/1903.04227

Ayurvedic Medicinal Plants Identiﬁcation: A Comparative Study on Feature Extraction Methods R. Ahila Priyadharshini(&), S. Arivazhagan, and M. Arun Centre for Image Processing and Pattern Recognition, Mepco Schlenk Engineering College, Sivakasi, Tamil Nadu, India [email protected]

Abstract. Proper identiﬁcation of medicinal plants is essential for agronomists, ayurvedic medicinal practitioners and for ayurvedic medicines industry. Even though many plant leaf databases are available publicly, no speciﬁc standardized database is available for Indian Ayurvedic Plant species. In this paper, we introduce a publicly available annotated database of Indian medicinal plant leaf images named as MepcoTropicLeaf. The research work also presents the preliminary results on recognizing the plant species based on the spatial, spectral and machine learnt features on the selected set of 50 species from the database. To attain the machine learnt features, we propose a six level convolutional neural network (CNN) and report an accuracy of 87.25% using machine learnt features. Keywords: Medicinal plant

CNN HOG DWT Moments

1 Introduction Ayurveda is one of the world’s oldest healing systems, which considers the mental and social factors of the patient and not just the symptoms of disease. Ayurveda is widely practiced on the Indian subcontinent and University of Minnesota’s Center for Spirituality & Healing claims that more than 90% of Indians use some form of Ayurvedic medicine [1]. Ayurvedic medicines are derived mostly from different parts of plants such as root, leaf, flower, fruit extrude or entire plant itself. Incorrect use of medicinal plants makes the Ayurvedic medicine ineffective. So, proper identiﬁcation of the medicinal plant is necessary. In ancient days, the Ayurvedic practitioners picked the medicinal plants by themselves and prepared the medicines for their patients. In modern day scenario, only very few practitioners follow this practice, where others make use of the medicines manufactured by the industry. Most of plants used in ayurvedic formulations are collected from the forests and wastelands. Commonly, medicinal plants are identiﬁed manually by the expert taxonomists by inspecting the characteristics such as shape, color, taste, and texture of the whole plant or individual part (leaf, flower, fruit, or bark) [2]. But, this effectiveness of the manual identiﬁcation is dependent on people’s level of knowledge and subjective experience. Also, the availability of the expert taxonomists is very less. So, an © Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 268–280, 2021. https://doi.org/10.1007/978-981-16-1092-9_23

Ayurvedic Medicinal Plants Identiﬁcation: A Comparative Study

269

automated medicinal plant identiﬁcation system will be helpful for the Ayurvedic industry. The proper classiﬁcation of medicinal plants is important for ayurvedic medicinal practitioners, agronomist, forest department ofﬁcials and those who are involved in the preparation of ayurvedic medicines [3]. Medicinal plant recognition can be done accurately by inspecting the flowers of the plants. However, the flowers are not perennial throughout the year. Hence, in this study, leaves are used to identify the medicinal plants. Also, leaves are easy to collect, distinguish, and capture and therefore they are used as the primary basis for the identiﬁcation of medicinal plants. So far many researchers have worked on plant leaf identiﬁcation, but no public leaf database is available for Indian scenario. Our main focus in this work is on medicinal plant identiﬁcation based on different feature extraction methods and to introduce a publicly available database (MepcoTropicLeaf Database)1 containing medicinal plant leaf images. The sample images of the database are shown in Fig. 1.

Fig. 1. Sample images of the MepcoTropicLeaf database.

Feature extraction and classiﬁer selection will directly influence the results of leaf image recognition. Researchers have used Shape features, texture features, vein patterns for leaf recognition. The shape features used for plant leaf recognition are simple morphological shape descriptors [4–6], Hu Moments [7, 8], SIFT descriptors [9–11], Histogram of Gradients (HOG) [12, 13]. Texture features are used for plant leaf classiﬁcation [14] and ayurvedic medicinal plant leaf recognition [15]. Recent research works make use of Convolutional Neural networks for plant disease classiﬁcation [16] and for classiﬁcation of Indian leaf species using smartphones [17]. In order to develop accurate image classiﬁers for the purpose of ayurvedic medicinal plant identiﬁcation, we need a large veriﬁed database of leaf images of the plants. Until very recently, such a database for Indian medicinal plant leaves is not available publicly. To address this problem, we have collected tens of thousands of leaf images of ayurvedic plant species around the foot hills of Western Ghats as part of the Kaggle Open Data Research and it will be made openly and freely available for research purposes. Here, we have used a subset of the collected plant species and report

1

https://www.kaggle.com/ahilaprem/mepco-tropic-leaf.

270

R. Ahila Priyadharshini et al.

on the identiﬁcation of 50 plant species with 3777 images using spatial features, spectral features and a convolutional neural network approach.

2 Materials and Methods 2.1

MepcoTropicLeaf Database

We have collected the leaves of around 400 different plant species which includes herbs, shrubs, creepers, trees, grass and water plants. The number of leaf images in each species is more than 50. All the images were captured using mobile phone cameras. The database consists of single as well as compound leaves. The frontal view and the back view of the leaf images are captured to make the database robust. Also the database has some broken, diseased leaves and leaves with flowers. In this work, we have selected 50 different plant species which are commonly grown in the tropical areas. The details of the selected species are shown in Table 1 and the corresponding sample images are depicted in Fig. 2.

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

Fig. 2. Sample images of the selected 50 plant species

Ayurvedic Medicinal Plants Identiﬁcation: A Comparative Study

271

Table 1. Details of the 50 selected plant species. S. no

English name

Botanical name

Ayurveda name Siddha name

Count

1. 2. 3.

Asthma Plant Avaram Balloon vine

Nagarjuni Aaavartaki Kaakatiktaa

81 66 123

4.

Vyaaghrairanda Athalai

81

5. 6. 7. 8.

Bellyache bush (Green) Benghal dayflower Big Caltrops Black-Honey Shrub Bristly Wild Grape

Euphorbia hirta Cassia auriculata Cardiospermum halicacabum Jatropha glandulifera

Karnasphota BrihatGokshura Kaamboji Pulinaralai

KanaVazhai Aanai Nerinji KarumPoola Pulinaralai

58 64 105 66

9. 10. 11. 12.

Butterfly Pea Cape gooseberry Coatbuttons Common wireweed

Commelina benghalensis Pedalium murex Phyllanthus reticulatus Cissussetosa (Cyphostemma setosum) Clitoria ternatea Physalis minima Tridax procumbens Sida acuta

Girikarnikaa Tankaari Jayanti Balaa

51 122 60 60

13. 14. 15. 16.

Country Mallow Crown Flower Green Chireta Heart-Leaved moonseed Holy Basil Indian Copperleaf Indian Jujube Indian Sarsaparilla Indian Stinging Nettle Indian Thornapple Ivy Gourd kokilaksha Land Caltrops (Bindii) Madagascar Periwinkle Madras Pea Pumpkin Malabar Catmint Mexican Mint Mexican Prickly Poppy Mountain knotgrass Indian Wormwood Nalta Jute Night blooming Cereus

Sida cordifolia Calotropis gigantea Andrographis paniculata Tinospora cordifolia

Sumanganaa Raajaarka Kiraata Amritavalli

NeelaKakkanam Sodakkuthakkali Vettukkaaya-thalai Arivaalmanai Poondu Nilathuthi Erukku Nilavembu Seenthil

Ocimum tenuiflorum Acalypha indica Ziziphus mauritiana Hemidesmus indicus Tragia involucrata

Tulasi Kuppi Badar Shveta Saarivaa Vrishchhikaali

Karunthulasi Kuppaimeni Ilanthai Nannaari Sirrukanchori

54 124 66 84 106

Datura metel Coccinia indica Hygrophila auriculata Tribulus terrestris

Unmatta Bimbi Kokilaaksha Gokshura

Oomathai Kovvai Neermulli Nerinjil

57 51 59 63

Catharanthus roseus

Nityakalyaani

NithiyaKalyani

55

Melothria maderaspatana Anisomeles malabarica Plectranthus amboinicus Argemone mexicana

Ahilekhana Sprikkaa Parna-yavaani Katuparni

Musumusukkai Perunthumbai Karpooravalli Brahmathandu

53 65 105 55

Aerva lanata Artemisia indica Corchorus capsularis Saussurea obvallata

Paashaanabheda Damanaka Kaalashaaka Brahma kamal

Sirupeelai Marughu Pirattai-keerai Brahma kamalam

78 67 73 54

17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34.

Amman pachharisi Aavaarai Mudukottan

69 66 69 192

(continued)

272

R. Ahila Priyadharshini et al. Table 1. (continued)

S. no

English name

Botanical name

Ayurveda name Siddha name

Count

35. 36. 37. 38.

Panicled Foldwing Prickly Chaff Flower Punarnava Purple Fruited Pea Eggplant Purple Tephrosia Rosary Pea Shaggy button weed Small Water Clover Spiderwisp Square Stalked Vine Stinking Passionflower Sweet Basil Sweet Flag Tinnevellysenna Trellis Vine Velvet Bean

Peristrophe Paniculata Achyranthes aspera Boerhavia Diffusa Solanum trilobatum

Karakanciram Naayuruvi Sharanai Toothuvilai

68 54 76 70

Tephrosia purpurea Abrus precatorius Spermacoce hispida Marsilea minuta Cleome gynandra Cissus quadrangula Passiflora foetida

Kakajangha Chirchitaa Punarnava Kantakaarilataa Sharapunkhaa Gunjaka Madana-ghanti Sunishannaka Tilaparni Vajravalli Chirranchantiya

Kolinji Kuntri NathaiSoori Neeraarai Thaivelai Perandai SiruPonaikalli

59 66 65 57 55 159 66

Ocimum basilicum Acorus calamus Cassia Angustifolia Pergularia daemia Mucuna prurita

Barbari Ugragandhaa Svarna-pattri Uttamaarani Aatmaguptaa

Tiruneetruppachhilai Vasambu NilaVagai VeliParuthi Poonaikkaali

123 75 53 57 72

39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50.

2.2

Feature Extraction Methods

Feature extraction reduces the number of representation required to describe a image/video. Feature extracted from the image should be informative and nonredundant facilitating generalization. In this work, we use spatial, spectral and machine learnt features for the identiﬁcation of plants. Spatial Features Spatial features of an object exploit location/spatial information which can be characterized by its gray levels, their joint probability distributions, spatial distribution and so on. Here we have extracted the Moment features and the HOG features. Moment Features Moments are invariant under scale, translation and rotation. The formula for calculating basic moments, centralized moment (translation invariant) and normalized moments (scale and translation invariant) are given in Eq. 1, 2, 3 respectively. X X Mpq ¼ Iðx; yÞxp yq ð1Þ x y where, p þ q represents order of moments. Here order up to 3 is considered.

Ayurvedic Medicinal Plants Identiﬁcation: A Comparative Study

lpq ¼

X X x

y

Iðx; yÞðx xÞp ðy yÞq

273

ð2Þ

M01 10 where x ¼ M M00 ; y ¼ M00 are the components of the centroid.

#pq ¼

lpq ðp þ2 q þ 1Þ

ð3Þ

M 00

Histogram of Gradient (HOG) Features The HOG feature descriptor counts the occurrences of gradient orientation in localized portions of an image [18]. It focuses on the structure of the image along with its edge direction. The steps for extracting the HOG feature descriptor are as follows • Compute the gradients for every pixel along the x and y direction with the window of size K K • Calculate the magnitude and orientation of the gradients. • Create histogram of bins (Size = H) using the magnitude and orientation of the gradients. • Normalize the magnitude of the gradients over the block size of B C • Repeat the above steps for the entire image of size M N The formula for calculating the number of features descriptors for the image is given in Eq. 4. nHOG

M N ð C 1Þ ð B 1Þ B C H ¼ K K

ð4Þ

Spectral Features The spectral feature considered in this work is Discrete Wavelet Transform (DWT). DWT captures both frequency and location information. It provides point singularity. The image is actually decomposed i.e., divided into four sub-bands (one approximation and 3 detailed sub bands) and critically sub-sampled by applying DWT as shown in Fig. 3. The approximation band is further decomposed until some ﬁne scale is reached. For every detailed sub bands in each level, the features such as mean and standard deviation are computed.

274

R. Ahila Priyadharshini et al.

(a)

Approximation

Horizontal Vertical (b)

Diagonal

Approximation

Horizontal Vertical (c)

Diagonal

Fig. 3. DWT Decomposition of a sample Image (a) Original Image (b) 1st Level decomposition (c) 2nd Level decomposition

Machine Crafted Features Convolutional Neural Networks (CNNs) are well known for their ability to machine craft the features for image recognition. So, CNNs are used here for identifying the plants using leaf images. The CNN architecture used in the proposed work is shown in Fig. 4. The proposed architecture consists of six levels of convolutional layers. After every convolutional layer, a sub sampling layer is introduced to reduce the dimension of the feature maps. After every two convolutional layers, a batch normalization layer is introduced to improve the stability of the network. The output layer is a fully connected layer contains 50 neurons which describe the class labels. The designed CNN has 407,746 learnable parameters.

Fig. 4. Architecture of the proposed CNN

Generally, CNNs consist of many convolutional layers and fully connected layers. CNNs use ﬁlters (kernels), to detect the features present in an image. The ﬁlter carries out the convolution operation with the entire image in parts. If the feature is present in part of an image, the convolution operation between the ﬁlter and that part of the image will result in a high valued real number and otherwise, the resulting value will be low. The ﬁlter can be slide over the input image at varying intervals, using a stride value. The stride value indicates by how much the ﬁlter should move at each step.

Ayurvedic Medicinal Plants Identiﬁcation: A Comparative Study

275

For the CNN to learn the values for a ﬁlter, the ﬁlter must be passed through a nonlinear mapping. The output obtained after the convolution operation is summed with a bias term and passed through a non-linear activation function. The use of the activation function is to introduce non-linearity into the network. The ReLU is the most used activation function in almost all the convolutional neural networks. ReLU activation is half rectiﬁed (from bottom). ReLU is not bounded and it ranges between (0 to ∞). On the other hand, ELU has a small slope for negative values. Instead of a straight line, ELU uses a log curve. The formula for the ELU activation function is given in Eq. 5 f ð yÞ ¼

y y[0 where a [ 0 að e y 1 Þ y 0

ð5Þ

In CNNs, after one or two convolutional layers, downsampling is used to reduce the size of the representation produced by the convolutional layer. This speeds up the training process and reduces the amount of memory consumed by the network. The most commonly used downsampling is max pooling. Here, a window passes over an image according to a stride value. At each step, the maximum value within the window is taken as the output, hence the name max pooling. Max pooling signiﬁcantly reduces the representation size. A batch normalization layer normalizes each input channel across a mini-batch. It is used to speed up the training of CNNs and to reduce the sensitivity of network initialization. The image representation is converted into a feature vector after several convolutional layers and downsampling operations. Figure 5 shows the feature map obtained after the convolutional layer 1, 2 and 4 respectively at 25th epoch. This feature vector is then passed into fully-connected layer. The number or neurons in the last fully connected layer is same as the number of classes to be predicted. The output layer has a loss function to compute the error in prediction. The loss function used in this work is categorical cross entropy and is depicted in Eq. 6. J ðuÞ ¼

1X X y log okj þ ð1 yj Þlogð1 okj Þ 8X j j M

ð6Þ

where M represents the number of classes, yj and oj represents the actual and predicted labels respectively. k represents the corresponding layer. After the prediction, the back propagation algorithm updates the weight and biases for error and loss reduction.

276

R. Ahila Priyadharshini et al.

Fig. 5. Feature maps obtained (a) Conv Layer 1 (b) Conv Layer 2 (c) Conv Layer 4

3 Experiment Results and Discussion Initially 50 plant species from the leaf database are selected for experimentation. As the images in the database are of different sizes, they are resized into 128 128 pixels. Out of the 3777 samples, 2812 samples are used for training and the remaining 965 images are used for testing maintaining the train test ratio of 75:25. First the experimentation is carried out by extracting 10 different basic moment (BM) features, 7 centralized moment (CM) features and 7 normalized moment (NM) features individually on the grayscale images and classiﬁed using SVM classiﬁer. The kernel used is radial basis function. As the individual feature doesn’t perform well, further experimentation is carried out using the combined moment features. HOG features are then extracted from the images with a kernel size of 16 16 and 32 32 results in a cell size of 8 8 and 4 4 yielding 1764 and 324 features respectively. HOG features extracted using the kernel size of 32 32 outperforms 16 16. So, further experimentation is carried out using the kernel size of 32 32. HOG features are then combined with all the moment features and classiﬁed and the results are tabulated in Table 2.

Ayurvedic Medicinal Plants Identiﬁcation: A Comparative Study

277

In order to study the performance of spectral features, the image is decomposed using DWT using Haar wavelet. The level of decomposition considered here is 2 and 3. For every sub band, the features such as mean and standard deviation are computed and classiﬁed using SVM classiﬁer. Table 2. Performance measure of spatial features. Spatial features Basic Moments (BM) Centralized Moments (CM) Normalized Moments (NM) BM + CM BM + NM BM + CM + NM HOG_16 HOG_32 HOG_32 + BM + CM + NM

Accuracy (%) 24.77 25.59 23.83 32.95 31.70 34.19 55.33 58.34 63.31

Three level decomposition with 18 features, outperforms two level decomposition with 12 features and the results are depicted in Table 3. The spectral features are combined with the spatial features and the classiﬁcation accuracy obtained is 67.05%. Even though moment features don’t perform well individually, when added with HOG_32 and DWT_3 features, there is an increase in accuracy of 3.5%. The spatial and spectral features did not perform well because of the presence of single and compound leafs in many plant species. Table 3. Performance measure of spectral and composite features. Features DWT_2 DWT_3 DWT_3 + BM + CM + NM DWT_3 + HOG_32 DWT_3 + HOG_32 + BM + CM + NM

Accuracy (%) 38.96 41.04 50.46 63.42 67.05

In order to improve the performance, we carried out the experimentation using deep learning approach. The architecture depicted in Fig. 4 is used for the experimentation. The input image is resized to 128 128 pixels. We used Elu activation function in all the layers except the output layer where the activation function used is softmax. In all the convolutional layers, the kernel size is ﬁxed as 3 3. The loss function used is catergorical cross entropy. We carried out the experimentation with data augmentation

278

R. Ahila Priyadharshini et al.

using rotation, vertical flip and horizontal flip. The performance of the CNN is tabulated in Table 4 for the batch size of 76. From Table 4, it is very evident that, for this database, experimentation with uniform augmentation does not perform well because of the class imbalance across various classes. So, further experimentations are carried out without data augmentation. The performance of the CNN is compared by varying the batch sizes 37, 19 and 4 respectively. The comparison is shown in Table 5. The accuracy is better for the batch of size 19 and the corresponding confusion matrix is depicted in Fig. 6.

Table 4. Performance measure of proposed CNN with batch size = 76. Epochs Accuracy (%) Without augumentation 25 71.29 100 81.86 200 80.26 300 82.90

With augumentaion 64.45 41.24 63.00 57.72

Table 5. Performance measure of proposed CNN for different batch sizes. Epochs Accuracy (%) Batch size = 37 25 84.45 100 84.66 200 86.21 300 78.65

Batch size = 19 Batch size = 4 82.59 73.05 80.31 85.18 82.38 85.38 87.25 85.91

Fig. 6. Confusion matrix for the proposed CNN with batch size = 19

Ayurvedic Medicinal Plants Identiﬁcation: A Comparative Study

279

From the confusion matrix, we can identify that the species Benghal dayflower, Butterfly Pea, Indian jujube have got very less accuracy. Benghal dayflower and Indian jujube have got 51 and 66 samples respectively. So the deep CNN is unable to learn more features for these species. It is observed that, if these speciﬁc species are trained with augmentation, it may improve the classiﬁcation accuracy. Butterfly Pea species mostly confused with Bristly wild grape because of its shape and also the butterfly pea species contains both single and compound leafs equally.

4 Conclusion In this work, we have introduced a publicly available speciﬁc standardized Indian leaf database for ayurvedic medicinal plant recognition. Also, we studied the performance of spatial features, spectral features and machine learnt features for plant identiﬁcation. Machine learnt (deep) features perform better than spatial and spectral features because of its ability to learn various shape features across the species. The maximum accuracy obtained by the proposed CNN is 87.25%. The potential efﬁciency of the CNN is studied here only by varying the batch size. The performance of the CNN can be improved by varying different parameters such as kernel size, activation function etc.… We have presented preliminary results on recognizing plant species, but the problem appears to be extremely challenging and could perhaps beneﬁt the research community for further experimentations. Acknowledgement. This research is done as a part of Kaggle’s Open Data Research Grant. We would also like to thank Dr. Jeyakumar for his help in verifying the plant species.

References 1. Ayurvedic Medicine, University of Minnesota’s Center for Spirituality & Healing. https:// www.takingcharge.csh.umn.edu/explore-healing-practices/ayurvedic-medicine 2. Chen, Y.F., Liu, H., Chen, Q.: Status and prospect for the study of plants identiﬁcation methods. World Forest. Res. 27(4), 18–22 (2014) 3. Dileep, M.R., Pournami, P.N.: AyurLeaf: a deep learning approach for classiﬁcation of medicinal plants. In: TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON), Kochi, India, 2019, pp. 321–325. https://doi.org/10.1109/TENCON.2019.8929394 4. Chaki, J., Parekh, R., Bhattacharya, S.: Recognition of whole and deformed plant leaves using statistical shape features and neuro-fuzzy classiﬁer. In: 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS), pp. 189–194 (2015b). https:// doi.org/10.1109/ReTIS.2015.7232876 5. Hossain, J., Amin, M.: Leaf shape identiﬁcation based plant biometrics. In: 2010 13th International Conference on Computer and Information Technology (ICCIT), pp. 458–463 (2010). https://doi.org/10.1109/ICCITECHN.2010.5723901 6. Watcharabutsarakham, S., Sinthupinyo, W., Kiratiratanapruk, K.: Leaf classiﬁcation using structure features and support vector machines. In: 2012 6th International Conference on New Trends in Information Science and Service Science and Data Mining (ISSDM), pp. 697–700 (2012)

280

R. Ahila Priyadharshini et al.

7. Nesaratnam, R.J., Bala Murugan, C.: Identifying leaf in a natural image using morphological characters. In: 2015 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), pp. 1–5 (2015). https://doi.org/10.1109/ICIIECS.2015. 7193115 8. Du, J.X., Wang, X.F., Zhang, G.J.: Leaf shape based plant species recognition. Appl. Math. Comput. 185, 883–893 (2007) 9. Chathura Priyankara, H., Withanage, D.: Computer assisted plant identiﬁcation system for Android. In: 2015 Moratuwa Engineering Research Conference (MERCon), pp. 148–153 (2015). https://doi.org/10.1109/MERCon.2015.7112336 10. Hsiao, J.K., Kang, L.W., Chang, C.L., Lin, C.Y.: Comparative study of leaf image recognition with a novel learning-based approach. In: 2014 Science and Information Conference (SAI), pp. 389–393 (2014). https://doi.org/10.1109/SAI.2014.6918216 11. Lavania, S., Matey, P.S.: Leaf recognition using contour based edge detection and sift algorithm. In: 2014 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), pp. 1–4 (2014). https://doi.org/10.1109/ICCIC.2014.7238345 12. Islam, M.A., Yousuf, M.S.I., Billah, M.M.: Automatic plant detection using HOG and LBP features with SVM. Int. J. Comput. (IJC) 33(1), 26–38 (2019) 13. Pham, N.H., Le, T.L., Grard, P., Nguyen, V.N.: Computer aided plant identiﬁcation system. In: 2013 International Conference on Computing, Management and Telecommunications (ComManTel), pp. 134–139 (2013). https://doi.org/10.1109/ComManTel.2013.6482379 14. Kherkhah, F.M., Asghari, H.: Plant leaf classiﬁcation using GIST texture features. IET Comput. Vis. 13, 36 (2019) 15. Kumar, P.M., Surya, C.M., Gopi, V.P.: Identiﬁcation of ayurvedic medicinal plants by image processing of leaf samples. In: 2017 Third International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), Kolkata, 2017, pp. 231–238. https://doi.org/10.1109/ICRCICN.2017.8234512 16. Ahila Priyadharshini, R., Arivazhagan, S., Arun, M., Mirnalini, A.: Maize leaf disease classiﬁcation using deep convolutional neural networks. Neural Comput. Appl. 31(12), 8887–8895 (2019). https://doi.org/10.1007/s00521-019-04228-3 17. Vilasini, M., Ramamoorthy, P.: CNN approaches for classiﬁcation of Indian leaf species using smartphones. CMC-Comput. Mater. Continua. 62(3), 1445–1472 (2020) 18. Dalal,N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 2005, vol. 1, pp. 886–893. https://doi.org/10.1109/CVPR.2005.177

Domain Knowledge Embedding Based Multimodal Intent Analysis in Artiﬁcial Intelligence Camera Dinesh Viswanadhuni, Mervin L. Dalmet(&), M. Raghavendra Kalose, Siddhartha Mukherjee, and K. N. Ravi Kiran Samsung R&D Institute India-Bangalore (SRI-B), Bengaluru 560037, India {d.viswanadhu,m.dalmet,raghava.km,siddhartha.m, ravi.kiran}@samsung.com

Abstract. Artiﬁcial Intelligence (AI) camera solutions like Google Lens and Bixby Vision are becoming popular. To retrieve more information about the detected objects in AI camera, services of the web based content providers are used. Choosing the right content provider is needed to ensure the information provided is as per the user’s intent. Smartphone’s user interface have limited real estate, hence information from speciﬁc and relevant content providers can be displayed. Sometimes speciﬁc content provider service is used for monetary reasons. To solve the above problems, proposed is a domain knowledge embedding based multimodal deep network model. The model consists of a CNN and Multi level LSTM combination for text channel, a VGG16 CNN model for image channel and a domain knowledge embedding channel. The output of this helps in getting appropriate information from content provider. Our model for Intent Classiﬁcation achieves 91% classiﬁcation accuracy for predicting 3 types of intents – Beauty products purchase interest, Generic information seeking and Movie information seeking intents. Keywords: Intent analysis Multimodal deep networks processing Knowledge embedding

Natural language

1 Introduction Artiﬁcial Intelligence (AI) based camera solution in smartphone, such as Google Lens and Samsung’s Bixby Vision, are very popular in today’s world. The AI camera detects salient objects in the frame and analyzes it. Users may intend to seek more information about objects in an image due to various reasons such as gaining additional information, shopping, getting restaurant recommendations, ﬁnding calorie information of a meal, ﬁnd similar items in an image and more. In a typical AI camera solution, to seek more useful information about the detected salient objects, services from web based content providers are used. Typically, web based content providers enable a speciﬁc use case. For example, one content provider may specialize in providing beauty product information, one of them may specialize in providing calorie information of a meal in the image. As there are many such web content providers available, choosing the right content provider is needed to make sure relevant information is provided to the © Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 281–292, 2021. https://doi.org/10.1007/978-981-16-1092-9_24

282

D. Viswanadhuni et al.

end user. Also due to limited real estate in the user interface of the smartphone applications, information retrieved from a few selected and relevant content providers can be displayed. Hence, there is a need for a solution using which the intent of the user can be analyzed. Based on the intent detected we can prioritize the results obtained from web based content providers and display it in a user friendly manner. Normal Image based classiﬁcation analyzes text within the image as an image due to which we may lose meaningful context and information that could help for better classiﬁcation during ambiguous situations. The existing deep learning based methods for intent analysis is based on text alone, However in the context of AI camera a multimodal approach using visual and text attributes is to be used for computing the intent accurately. Images such as a screenshot of SNS application, movie ticket typically has images and textual content. Some images will have textual content in the form of a poster, name boards, bill boards etc. Our ultimate goal is to provide a support system to the image classiﬁcation with help of text that’s found within and around the image as described in Fig. 1.

Fig. 1. Multimodal approach using visual and text

There have been efforts to solve the above mentioned problems. Suryawanshi et al. (2020) [13] and Smitha et al. (2018) [7], use a multimodal approach to detect intent or categorize memes, however there is need to improve on the same for more robust detection of intent. In this work, we propose a domain knowledge embedding based multimodal neural network for intent analysis in visual domain. The summary of our major contributions are: • A novel domain knowledge embedding based multimodal deep learning network for a robust intent detection • We propose 2 variants of domain knowledge embedding in the multimodal deep learning network

Domain Knowledge Embedding Based Multimodal Intent Analysis

283

Proposed Baseline Multimodal (BMM) architecture comprises of concatenation of CNN + LSTM based text model and VGG16 based image model. We implemented three models. • • • •

BMM model without Domain Knowledge embedding (Accuracy - 87%) BMM model + Count based Knowledge Vector embedding (Accuracy - 89%) BMM model + Probabilistic Knowledge Vector component (Accuracy - 91%) Probabilistic based Knowledge Vector approach provided the best accuracy with an improvement of *4% over the method without Knowledge Dictionary and *2% over the Count Knowledge Vector approach. We have come up with a novel multimodal dataset using an in-house dataset and a few publicly available datasets [14, 15] to train the proposed domain knowledge based deep network.

2 Prior Work This section covers the work done in area of intent analysis on text only, image only and in the area of multimodality based intent analysis where both text and image are used. Yoon Kim et al. (2014) [1] proposed a convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classiﬁcation tasks. Fang et al. (2015) [3] uses Random forest and decision tree algorithm for sentiment polarity categorization on Amazon product review dataset using features extracted from text. Duong et al. (2017) [4] propose deep learning models that combine information from different modalities to analyze social media for an emotion classiﬁcation application. The models combine information from different modalities using a pooling layer. An auxiliary learning task is used to learn a common feature space. Watanabe et al. (2017) [8] proposes an approach to detect hate expressions on Twitter, it is based on unigrams and patterns collected from the training set. Haque et al. (2018) [5] uses Amazon product review dataset to build a Linear SVM model using text features such as bag of words and TF-IDF. The model classiﬁes the user review text into a positive or a negative review hence influencing future customer purchase decision. Hu et al. (2018) [6] propose multimodal sentiment analysis using deep neural networks combining visual analysis and natural language processing. The model is used for predicting the emotion word tags attached by users to their Tumblr posts, which contain both images and text. Wiegand et al. (2018) [9] proposes a scheme for classiﬁcation of German tweets from Twitter. It comprises of a coarse-grained binary classiﬁcation task and a ﬁne-grained multi-class classiﬁcation task. Smitha et al. (2018) [7] proposes a framework that can categorize internet memes by using visual features and textual features. It uses non-neural (SVM, Logistic regression, Decision Trees, Naive Bayes classiﬁer) and neural network-type classiﬁers (CNN, LSTM). Kruk et al. (2019) [10] models a complex relationship between text and images of Instagram posts using a deep multimodal classiﬁer. Zampieri et al. (2019) [12] describes the results and ﬁndings of tasks to discriminate between offensive and non-offensive tweets in social media, identify type of offensive content in the post, and identify the target of offensive posts. De La Vega et al. (2019) [11] proposed a trolling categorization with four independent classiﬁers using logistic regression.

284

D. Viswanadhuni et al.

A dataset containing excerpts of Reddit conversations involving suspected trolls and their interactions with other users is used to train the model. Suryawanshi et al. (2020) [13] proposes a multimodal approach for categorizing memes as either Offensive or Non-Offensive by taking into account both the image of the meme and the text embedded in it. A dataset consisting of memes and the embedded text is used to train their model. The Prior Art discussed above uses different approaches of multi-modal analyses for categorization such as memes on the internet, detect hate speech on social media platforms. In comparison to models in [7] and [13] we enhance the baseline multimodal model. Further to it, we make it more robust by concatenating it with the proposed Domain Knowledge Embedding channel. Our results show a signiﬁcant improvement over the other multimodal deep network models.

3 Taxonomies We propose 3 taxonomies to capture different aspects of the relationship between the image and the text content associated with the image. The 3 intent classiﬁcations are: Product Purchase Interest. User intends to seek purchase information for the products occurring in the images. Generic Information Seeking. User intends to seek more information for the objects occurring in the images. Movie Information Seeking. User intends to seek more information for the movie posters occurring in the images.

4 Dataset

Fig. 2. Sample data from the dataset with the format used for training the model.

The dataset consists of 3 classes- Beauty (product purchase interest), Movie (movie information seeking) and Information (Generic Information seeking). Each class consists of text and image dataset. The dataset is input to the model as a csv ﬁle as in Fig. 2 with image_name being the name of the image saved, sentence being the text corresponding to the image and label being the class labels for each data sample.

Domain Knowledge Embedding Based Multimodal Intent Analysis

285

Beauty. The Beauty dataset consists of images of beauty products and user reviews of the products. The images are downloaded from the internet and corresponding reviews are taken from the Amazon Review dataset [15]. Information. The Information dataset has images that are considered to be information seeking, like monuments, famous personalities, animals, plants etc. The text part of the each image in the dataset describes the object present in the image in small sentences. The images are downloaded from the internet. Movie. The movie dataset has posters of movies and their corresponding reviews. The reviews are downloaded from the IMDb dataset [14] and the posters are found online. Sample data from the dataset for each class are shown in Table 1. The total dataset consists of 2834 image-sentence pairs with a distribution of 924 pairs for beauty class, 810 pairs for Information class and 1100 for movie class. From the dataset, 362 pairs are set aside for validation and 300 pairs for test dataset. Table 1. Sample examples from the dataset for each class. Beauty

Information

Nice lipstick. Fits so well. I had tried it from a friend and I just had to purchase. Arrived on time.

Neil A.Armstrong was an American astronaut and aeronautical engineer and the first person to walk on the Moon.

Movie

Image

Text

The Curious Case of Benjamin Button is a film unlike any I've ever seen and probably ever will.

5 Proposed Method 5.1

Baseline Model

Suryawanshi et al. (2020) [13] have deﬁned a network in their research to classify multimodal intent. This approach consisted of two channels, one for text and another for image input. The output of both the channels were processed to give the ﬁnal output. The drawback of this approach was that there was a lot of information lost as

286

D. Viswanadhuni et al.

the classiﬁers for each channel would give an output equal to the number of classes, which is very less for the classiﬁcation layer to work on. The task of Classiﬁcation layer would just be to predict if image channel is correct or text channel, which would fail in several cases, like when text or image alone has better understanding of the intent or in cases where both alone would give wrong classiﬁcation. Keeping the above drawbacks in mind we propose a model, which could propagate more information to Classiﬁcation layer to process. Our baseline architecture consisted of 2 channels for text and image input respectively. The network for text consisted of standard CNN + LSTM implementation. The pre-processed data given to input layer is embedded using GloVe (Pennington et. al (2014) [2]) pre-trained embedding to obtain vector representation of words, these vectors are fed to a Convolution layer with Relu activation function. It is further pooled using MaxPooling. There are 3 such Convolution and Maxpooling in series. The output of third Maxpooling layer is fed into a LSTM. The output states of this layer is fed to another LSTM along with the initial vectors of embedding layer. We have used categorical cross-entropy for loss calculation and adam optimizer. For image part of the network VGG16 model was used, which is pre-trained using ImageNet dataset. Flattening the output of second LSTM layer from text channel and combining it with Global Average Pooling2d layer of VGG16 enriched the feature vectors for Classiﬁcation layer to learn better. The Architecture of the proposed baseline multimodal model is illustrated in Fig. 3.

Fig. 3. Overview of the base network architecture that has been used.

Compared to the model architecture in Suryawanshi et al. (2020) [13], the proposed multimodal model extracts more than 9000 features irrespective of number of classes and feeds them into the classiﬁcation layer.

Domain Knowledge Embedding Based Multimodal Intent Analysis

5.2

287

Domain Knowledge Embedding Based Model

With the above architecture in mind we added a new component to help in giving more information to the Classiﬁcation layer. This component enhances the understanding of domain speciﬁc features for the network. This component is known as Domain Speciﬁc Knowledge Base. It consists of a knowledge vector generator which produces a vector that contains Domain Speciﬁc score. This vector presents the likelihood of an intent, which improves the learning of network and makes it robust. This Knowledge Vector (KV) is multiplied with a Domain Speciﬁc Embedding Matrix, which is trainable and then given to Classiﬁcation layer which was deﬁned previously. The network architecture is shown in Fig. 4.

Fig. 4. Overview of the proposed network architecture after the addition of Domain Speciﬁc Knowledge embedding component.

We have deﬁned two variants of Domain Knowledge embedding based Multimodal models. Count Based Knowledge Vector Generator (CKV). In this approach we have a knowledge dictionary which consists of a particular intent’s domain speciﬁc corpus. A sample Knowledge dictionary for each intent looks as shown in Table 2. The input text is split into tokens which are lemmatized and passed through the knowledge generator. The knowledge generator ﬁnds the intent that is supported by the input word and the knowledge contribution vector (WV) of respective word is generated. This vector, which is generated for each word, is added to the ﬁnal Knowledge Vector. The WV vector elements take the value 0 or 1. The Knowledge Vector Generation in this approach is illustrated in Fig. 5.

288

D. Viswanadhuni et al. Table 2. Sample knowledge dictionary for each Intent.

Intent Beauty Information Movie

Knowledge dictionary (sample) ‘mascara’, ‘eyeliner’, ‘makeup’, ‘shadow’, ‘cheeks’, ‘lipstick’, ‘look’, ‘color’ ‘plants, ‘interest’, ‘know’, ‘more’, ‘wikipedia’, ‘google’, ‘information’, ‘director’, ‘actor’, ‘producer’, ‘star’, ‘hit’, ‘actress’, ‘show’, ‘enjoyed’, ‘cast’

Fig. 5. Count based approach for Knowledge Vector generator as part of Domain Speciﬁc Knowledge embedding component.

Equations for generating count vector for CKV for a sentence with n words are given in (1) below, where KV is the Knowledge Vector for the sentence, WV i is the Knowledge Vector of the word W i , KV c denotes the Knowledge Vector element value w.r.t intent ‘c’ and Sic is the support value of word W i for given intent ‘c’. In CKV, Sic is 1 if the word belongs to the Knowledge Dictionary of intent ‘c’, else it is 0. WVi ¼ Si1 Si2 Si3 Xn WVi : KV ¼ ½KV1 KV2 KV3 ¼ i¼0

ð1Þ

Probability Based Knowledge Vector Generator (PKV). In this approach we have a Knowledge Dictionary which consists of a particular intent’s domain speciﬁc corpus. This dictionary is generated with help of training dataset by ﬁnding the probability of a word for particular user intent. The input text is split into tokens which are lemmatized and passed through the knowledge generator. The knowledge generator ﬁnds the probability of support to an intent by input word and the knowledge contribution vector of respective word is generated. We then do a dot product of this vector, which is generated for each word, to get the ﬁnal Knowledge Vector. A bias is added to avoid the knowledge vector to collapse to 0 in case of missing word. The approach for Knowledge vector generation is illustrated in Fig. 6.

Domain Knowledge Embedding Based Multimodal Intent Analysis

289

Fig. 6. Probabilistic approach for Knowledge Vector generator as part of Domain Speciﬁc Knowledge base component.

Equation for generating count vector for PKV is shown in (2). In PKV, Sic is the probability of support of word W i for the intent ‘c’. WVi ¼ Si1 KV ¼ ½KV1

KV2

KV3 ¼

Si2

n Q

i¼0

Si3

Si1

n Q i¼0

Si2

n Q i¼0

Si3

ð2Þ

6 Results Table 3. Accuracies of the model with different approaches Approach Suryawanshi et al. (2020) [13] Hu et al. (2018) [6] Baseline multimodal model for knowledge embedding (BMM) BMM + Count based knowledge vector approach (CKV) BMM + Probability based knowledge vector approach (PKV)

Accuracy 69% 64% 87% 89% 91%

From Table 3, Using our dataset, Suryawanshi et al. (2020) [13]’s architecture shows an accuracy of 69% where-as Hu et al. (2018) [6], shows an accuracy of 64%. With feature based BMM model approach we get an accuracy of 87%, because we have preserved the information extracted from pooling and flattening layers of both the channels and used it to classify. With our novel Domain knowledge embedding approach we get 89% & 91% respectively. On a smaller dataset, our knowledge

290

D. Viswanadhuni et al.

embedding based architecture shows a signiﬁcant improvement in accuracy when compared to [6]. This is due to the fact that our knowledge embedding based component acts as a support system for the network to decide the intent during ambiguity. The improvement due to our Knowledge embedding based component is signiﬁcant especially on a smaller dataset. BMM+PKV shows a small improvement of 2% over BMM+CKV. This is attributed to the fact that BMM+CKV’s Knowledge Dictionary is static where-as BMM +PKV has a dynamically constructed Knowledge Dictionary. When the dataset used is small, both of the approaches show similar accuracies or rather count based BMM +CKV could perform better as there will be little difference in the number of unique intent-speciﬁc domain based words known. But when the dataset size is increased, BMM+CKV may face words out of Knowledge Dictionary and so, the Knowledge Dictionary needs to be constantly enhanced with any changes in the dataset. BMM +PKV overcomes this issue by dynamically creating the dictionary from the dataset and this will provide better results ultimately on a larger dataset. From Table 4, it is evident that each Knowledge Base approach has helped in enhancing the precision of model. BMM model approach suffers a low recall which means that it has a many cases of False Negative (FN), but with knowledge base approach we are able to improve it by a considerable margin. Table 4. Precision, Recall and F1-Scores for different approaches Approach Precision Suryawanshi et al. (2020) [13] 0.78 Hu et al. (2018) [6] 0.74 BMM model 0.88 BMM + CKV model 0.91 BMM + PKV model 0.92

Recall 0.68 0.69 0.86 0.89 0.90

F1-score 0.69 0.56 0.87 0.89 0.90

A Comparison between the true labels and the labels as predicted by the mentioned approaches used is shown in the below Table 5. Our feature based BMM model without a knowledge Base component incorrectly labels the movie intent examples as belonging to ‘information’ intent. Whereas both of our novel approaches, Count based Knowledge Vector approach (BMM + CKV) and Probabilistic Knowledge Vector approach (BMM + PKV) accurately labels them. This is mainly attributed to the fact that the BMM model may face ambiguity while classifying for information and movie intent. This is avoided in the other two approaches due to the Knowledge Dictionary we are maintaining for each Intent. This nudges the model into making correct predictions based on relevant words present in the text part of the data that might belong to a speciﬁc intent’s Knowledge Dictionary. And as mentioned in 5.1 the drawbacks of prior art has led it to fail in classifying beauty and Information intent examples as belonging to ‘movie’ intent.

Domain Knowledge Embedding Based Multimodal Intent Analysis

291

Table 5. Comparison between true labels and predicted labels for the three methods used. Image

Text

True Label Suryawanshi et al.[13] Hu et al. [6] BMM BMM +CKV BMM+PKV

This sunscreen definitely protects skin from tanning. It is definitely not for my skin, still I used it because I didn't want to waste my money and I loved it's results. Beauty Information Movie Beauty Beauty Beauty

Mercedes is such a great German brand for luxy automobiles. It is viewed as a status symbol all over India.

Alfonso Cuaron's masterful adaptation does the source material immeasurable justice by exploring its underlying concepts in an intelligent manner. Of course, it certainly helps that the aesthetics of the film are incredible.

Information Movie

Movie Movie

Information Information Information Information

Information Information Movie Movie

7 Conclusion In this work, we proposed a Domain Knowledge Embedding based multimodal approach for robust intent classiﬁcation to prioritize results from content providers. All of our three approaches show a signiﬁcant improvement over Suryawanshi et al. (2020) [13] and Hu et al. (2018) [6]. Quantitative results of our model show that the Multimodal Domain Knowledge Embedding based model with PKV (BMM + PKV) approach showed an accuracy improvement of 4% over the feature based BMM model approach with the Knowledge base component acting as a support system. (BMM + PKV) showed a slight improvement of 2% over (BMM + CKV) approach because CKV uses static Knowledge Dictionary where-as PKV constructs the dictionary dynamically. The difference in accuracies will be more evident in cases of a larger dataset and wider dataset with more intents. This shows that our novel Domain Knowledge based approach provides a signiﬁcant improvement in accuracy when compared with the multimodal approach without Knowledge Embedding. Maintaining a Domain Speciﬁc Knowledge Dictionary for each intent class helps the network in learning by providing the network a hint about the intent of the data example taking advantage of the Knowledge Dictionary.

292

D. Viswanadhuni et al.

References 1. Kim, Y.: Convolutional neural networks for sentence classiﬁcation. In: EMNLP (2014) 2. Pennington, J., Socher, R., and Manning, C. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Doha, Qatar (2014) 3. Fang, X., Zhan, J.: Sentiment analysis using product review data. J. Big Data 2(1), 1–14 (2015). https://doi.org/10.1186/s40537-015-0015-2 4. Duong, C.T., Lebret, R., Aberer, K.: Multimodal Classiﬁcation for Analysing Social Media. arXiv, abs/1708.02099 (2017) 5. Haque, T.U., Saber, N.N., Shah, F.M.: Sentiment analysis on large scale Amazon product reviews. In: 2018 IEEE International Conference on Innovative Research and Development (ICIRD), pp. 1–6 (2018) 6. Hu, A., Flaxman, S.: Multimodal sentiment analysis to explore the structure of emotions. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018) 7. Smitha, E.S., Sendhilkumar, S., Mahalaksmi, G.S.: Meme Classiﬁcation Using Textual and Visual Features. In: Hemanth, D.J., Smys, S. (eds.) Computational vision and bio inspired computing. LNCVB, vol. 28, pp. 1015–1031. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-71767-8_87 8. Watanabe, H., Bouazizi, M., Ohtsuki, T.: Hate speech on Twitter: a pragmatic approach to collect hateful and offensive expressions and perform hate speech detection. IEEE Access 6, 13825–13835 (2018) 9. Wiegand, M., Siegel, M., Ruppenhofer, J.: Overview of the GermEval 2018 Shared Task on the Identiﬁcation of Offensive Language (2018) 10. Kruk, J., Lubin, J., Sikka, K., Lin, X., Jurafsky, D., Divakaran, A.: Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts (2019) 11. De La Vega, L.G.M.: Determining trolling in textual comments. In: 11th International Conference on Language Resources and Evaluation. Phoenix Seagaia Conference Center Miyazaki, LREC 2018, pp. 3701–3706. Japan (2019) 12. Zampieri Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval) (2019) 13. Suryawanshi, S., Chakravarthi, B.R., Arcan, M., and Buitelaar, P.: Multimodal meme dataset (MultiOFF) for identifying offensive content in image and text. In: TRAC@LREC (2020) 14. IMDb movie reviews. https://www.imdb.com/feature/genre/. Accessed 18 July 2020 15. Amazon Review Data (2018). https://nijianmo.github.io/amazon/index.html. Accessed 02 May 2020

Age and Gender Prediction Using Deep CNNs and Transfer Learning Vikas Sheoran1, Shreyansh Joshi2(&), and Tanisha R. Bhayani3 1

Birla Institute of Technology and Science, Pilani - Hyderabad Campus, 500078 Hyderabad, India [email protected] 2 Birla Institute of Technology and Science, Pilani - Goa Campus, Goa 403726, India [email protected] 3 Silver Touch Technologies Limited, Ahmedabad 380006, India

Abstract. The last decade or two has witnessed a boom of images. With the increasing ubiquity of cameras and with the advent of selﬁes, the number of facial images available in the world has skyrocketed. Consequently, there has been a growing interest in automatic age and gender prediction of a person using facial images. We in this paper focus on this challenging problem. Speciﬁcally, this paper focuses on age estimation, age classiﬁcation and gender classiﬁcation from still facial images of an individual. We train different models for each problem and we also draw comparisons between building a custom CNN (Convolutional Neural Network) architecture and using various CNN architectures as feature extractors, namely VGG16 pre-trained on VGGFace, ResNet50 and SE-ResNet50 pre-trained on VGGFace2 dataset and training over those extracted features. We also provide baseline performance of various machine learning algorithms on the feature extraction which gave us the best results. It was observed that even simple linear regression trained on such extracted features outperformed training CNN, ResNet50 and ResNeXt50 from scratch for age estimation. Keywords: Age estimation

CNN Transfer learning

1 Introduction Age and gender prediction has become one of the more recognized ﬁelds in deep learning, due to the increased rate of image uploads on the internet in today’s data driven world. Humans are inherently good at determining one’s gender, recognizing each other and making judgements about ethnicity but age estimation still remains a formidable problem. To emphasize more on the difﬁculty of the problem, consider this - the most common metric used for evaluating age prediction of a person is mean absolute error (MAE). A study reported that humans can predict the age of a person above 15 years of age with a MAE of 7.2–7.4 depending on the database conditions [1]. This means that on average, humans make predictions off by 7.2–7.4 years.

© Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 293–304, 2021. https://doi.org/10.1007/978-981-16-1092-9_25

294

V. Sheoran et al.

The question is, can we do better? Can we automate this problem in a bid to reduce human dependency and to simultaneously obtain better results? One must acknowledge that aging of face is not only determined by genetic factors but it is also influenced by lifestyle, expression, and environment [1]. Different people of similar age can look very different due to these reasons. That is why predicting age is such a challenging task inherently. The non-linear relationship between facial images and age/gender coupled with the huge paucity of large and balanced datasets with correct labels further contribute to this problem. Very few such datasets exist, majority datasets available for the task are highly imbalanced with a huge chunk of people lying in the age group of 20 to 75 [3–5] or are biased towards one of the genders. Use of such biased datasets is not prudent as it would create a distribution mismatch when deployed for testing on real-time images, thereby giving poor results. This ﬁeld of study has a huge amount of underlying potential. There has been an ever-growing interest in automatic age and gender prediction because of the huge potential it has in various ﬁelds of computer science such as HCI (Human Computer Interaction). Some of the potential applications include forensics, law enforcement [1], and security control [1]. Another very practical application involves incorporating these models into IoT. For example, a restaurant can change its theme by estimating the average age or gender of people that have entered so far. The remaining part of the paper is organized as follows. Section 2 talks about the background and work done before in this ﬁeld and how it inspired us to work. Section 3 contains the exact technical details of the project and is further divided into three subsections. Section 4 talks about the evaluation metric used. Section 5 presents the various experiments we performed along with the results we obtained, and ﬁnally Sect. 6 wraps up the paper with conclusion and future work.

2 Related Work Initial works of age and gender prediction involved techniques based on ratios of different measurements of facial features such as size of eye, nose, distance of chin from forehead, distance between the ears, angle of inclination, angle between locations [8]. Such methods were known as anthropometric methods. Early methods were based on manual extraction of features such as PCA, LBP, Gabor, LDA, SFP. These extracted features were then fed to classical ML models such as SVMs, decision trees, logistic regression. Hu et al. [9] used the method of ULBP, PCA & SVM for age estimation. Guo et al. [10] proposed a locally adjusted robust regression (LARR) algorithm, which combines SVM and SVR when estimating age by ﬁrst using SVR to estimate a global age range, and then using SVM to perform exact age estimation. The obvious down-side of such methods was that not only getting anthropometric measurements was difﬁcult but the models were not able to generalize because people of different age and gender could have the same anthropometric measurements. Recently the use of CNN for age and gender prediction has been widely adopted as CNNs are pretty robust and give outstanding results when tested on face images with occlusion, tilt, altered brightness. Such results have been attributed to its good ability to

Age and Gender Prediction Using Deep CNNs and Transfer Learning

295

extract features. This happens by convolving over the given image to generate invariant features which are passed onto the next layer in a sequential fashion. It is this continual passing of information from one layer to the next that leads to CNNs being so robust and supple to occlusions, brightness changes etc. The ﬁrst application of CNNs was the Le-Net-5 [11]. However, the actual boom in using CNNs for age and gender prediction started after D-CNN [12] was introduced for image classiﬁcation tasks. Rothe et al. [13] proposed DEX: Deep EXpectation of Apparent Age for age classiﬁcation using an ensemble of 20 networks on the cropped faces of IMDB-Wiki dataset. Another popular work includes combining features from Deep CNN to features obtained from PCA done by Wang et al. [14].

3 Methodology 3.1

Dataset

In this paper, we use the UTKFace dataset [2] (aligned and cropped) consists of over 20,000 face images with annotations of age, gender, and ethnicity. It has a total of 23708 images of which 6 were missing age labels. The images cover large variations in facial expression, illumination, pose, resolution and occlusion. We chose this dataset because of its relatively more uniform distributions, the diversity it has in image characteristics such as brightness, occlusion and position and also because it involves images of the general public. Some sample images from the UTKFace dataset can be seen in Fig. 1. Each image is labeled with a 3-element tuple, with age (in years), gender (Male-0, Female-1) and races (White-0, Black-1, Asian-2, Indian-3 and Others-4) respectively.

Fig. 1. Sample images from the UTKFace dataset.

For both our approaches (custom CNN and transfer learning based models), we used the same set of images for training, testing and validation, to have standardized results. This was done by dividing the data sets into train, test and validation in 80: 10: 10 ratios. This division was done while ensuring that the data distribution in each division remains roughly the same, so that there is no distribution mismatch while training and testing the models. The Table 1 and Table 2 show the composition of training, validation and test data with respect to gender and age respectively.

296

V. Sheoran et al. Table 1. Composition of sets by gender Gender Training Validation Male 9900 1255 Female 9061 1115 Total 18961 2370

Test 1234 1137 2371

Total 12389 11313 23702

Table 2. Composition of sets by age Age group Training Validation Test Total 0–10 2481 303 278 3062 11–20 1222 150 158 1530 21–30 5826 765 753 7344 31–40 3618 462 456 4536 41–50 1767 223 254 2244 51–60 1858 214 226 2298 61–70 1057 137 122 1316 71–80 577 57 65 699 81–90 413 45 46 504 91–100 114 11 12 137 101–116 28 3 1 32 Total 18961 2370 2371 23702

3.2

Deep CNNs

Network Architecture. The tasks tackled using the deep CNN approach include age and gender classiﬁcation and age estimation. The basic structure of each of the 3 models includes a series of convolutional blocks, followed by a set of FC (fully connected) layers for classiﬁcation and regression. An RGB image is fed to the model and is resized to 180 180 3. Every architecture comprises convolutional blocks that are a stack of convolutional layers (ﬁlter size is 3 3) followed by non-linear activation ‘ReLU’, max pooling (2 2) and batch normalization to mitigate the problem of covariate shift. The deeper layers here also have spatial dropout (drop value of 0.15–0.2) which drops entire feature maps to promote independence between them. Following the convolutional blocks, the output is flattened before feeding that into FC layers. These FC layers have activation function of ReLU, dropout (value between 0.2 & 0.4) and batch normalization. Table 3 shows the architecture used for age estimation. The architectures for age classiﬁcation and gender classiﬁcation differ in the fact that they have 3 & 2 blocks with 256 ﬁlters respectively (in convolutional layer) and the output layer has 5 and 2 neurons respectively with softmax activation function (being classiﬁcation tasks).

Age and Gender Prediction Using Deep CNNs and Transfer Learning

297

Table 3. Network architecture for age estimation Layer Image Separable Conv1 Max Pooling Separable Conv2 Max Pooling Separable Conv3 Max Pooling Separable Conv4 Max Pooling Separable Conv5 Max Pooling FC1 FC2 FC3 Output

Filters – 64 – 128 – 128 – 256 – 256 – – – – –

Output size 180 180 3 180 180 64 90 90 64 90 90 128 45 45 128 45 45 128 22 22 128 22 22 256 11 11 256 11 11 256 5 5 256 128 64 32 1

Kernel size Activation − – 33 ReLU 22 – 33 ReLU 22 – 33 ReLU 22 – 33 ReLU 22 – 33 ReLU 22 – – ReLU – ReLU – ReLU – ReLU

Training and Testing. For age classiﬁcation, our model classiﬁes ages into 5 groups (0–24, 25–49, 50–74, 75–99, and 100–124). For this, we had to perform integer division (by 25) on the age values and later one hot encodes them before feeding them into the model. Similarly, gender also had to be one hot encoded for gender classiﬁcation into male and female. The loss function chosen for age estimation was meansquared error (MSE) as it is a regression task, whereas for age and gender classiﬁcation it was categorical-cross entropy. For training, each model was trained using a custom data generator that allows training in mini-batches. Learning rate decay was used during training as it allowed the learning rate to decrease after a ﬁxed number of epochs. This is essential as the learning rate becomes very precarious during the latter stages of training, when approaching convergence. Various experiments with different optimizers were conducted, the results of which have been summarized in Sect. 5. Each model was trained between 30 to 50 epochs on average. Initial learning rate was set of the order 1e–3. Batch size of 32 was used. The learning rate was changed to 0.6 times the current learning rate after about 9 epochs (on average) to ensure that by the end of training, the learning rate is small enough for the model to converge to the local minimum. Figure 2 showcases the training plots of our models. In all graphs, the blue line denotes the training and the red line denotes the validation result. It is very evident that the training for gender classiﬁcation was the noisiest whereas the training for age estimation was the smoothest.

298

V. Sheoran et al.

Fig. 2. Training plots depicting loss by the epoch of a) Age Estimation b) Age Classiﬁcation c) Gender Classiﬁcation

Table 4 shows the lowest loss value to which our model could converge while training. Table 4. Minimum loss value Age estimation

Age classiﬁcation

Gender classiﬁcation Train Validation Train Validation Train Validation 56.8346 56.7397 0.3886 0.4009 0.1037 0.1864

The next subsection explores our work using transfer learning. 3.3

Transfer Learning

Transfer learning is one of the most powerful ideas in deep learning. It allows knowledge learned on one task to be applied to another. A lot of low-level features that determine the basic structure of the object can be very well learned from a bigger available dataset and knowledge of those transferred low-level features can help learn faster and improve performance on limited data by reducing generalization error. The UTKFace dataset is a very small dataset to capture the complexity involved in age and gender estimation, so we focused our attention further on leveraging transfer learning. One study [6] has already compared performance of ﬁne-tuning and pretraining state-of-the-art models for ILSVRC for age estimation on UTKFace. We take it a step further by using convolutional blocks of VGG16 pretrained on VGGFace [4] and ResNet50 and SE-ResNet-50 (SENet50 in short) pre-trained on VGGFace2 [5], as feature extractors. These models are originally proposed for facial recognition, thus can be used for higher level of feature extraction. To avoid any confusion, in this paper we denote these models as VGG_f, ResNet50_f and SENet50_f respectively where f denotes pre-trained using facial images of respective datasets. Network Architecture. The tasks tackled using this transfer learning approach include age estimation and gender classiﬁcation. Following is the network architecture we used in our models to train on top of features extracted.

Age and Gender Prediction Using Deep CNNs and Transfer Learning

299

For the gender classiﬁcation, for convenience, we chose custom model names VGG_f_gender, ResNet50_f_gender and SENet50_f_gender whose design as follows. VGG_f_gender comprises of 2 blocks, each containing layers in order of batch normalization, spatial dropout with drop probability of 0.5, separable convolutions layers with 512 ﬁlters of size 3 3 with keeping padding same to reduce loss of information during convolution operations followed by max pooling with kernel size 2 2. The fully connected system consisted of batch norm layers, followed by alpha dropout, and 128 neurons with ReLU activation and He uniform initialized followed by another batch norm layer and ﬁnally the output layer with 1 neuron with sigmoid activation. Batch size chosen was 64. ResNet50_f_gender comprises of just the fully connected system with batch norm, dropout with probability of 0.5, and followed by 128 units with exponential linear units (ELU) activation, with He uniform initialization and having max norm weight constraint of magnitude 3. The output layer had single neuron with sigmoid activation. The batch size we chose for this was 128. For, SENet50_f_ gender we kept the same model as for ResNet50_f_gender. For the age estimation the models have been named VGG_f_age, ResNet50_f_age and SENet50_f_age. VGG_f_age consists of 2 convolution blocks each containing in order, a batch norm layer, spatial dropout with keep probability of 0.8 and 0.6 respectively, separable convolution layer with 512 ﬁlters of size 3 3, padding same so that dimension doesn’t change (and information loss is curtailed), with ReLU activation function and He initialization. Each convolution block was followed by max pooling with kernel size 2 2. The fully connected system consisted of 3 layers with 1024, 512, 128 neurons respectively, with a dropout keep probability of 0.2, 0.2, and 1. Each layer had ELU activation function with He uniform initialization. The output layer had one unit, ReLU activation function with He uniform initialization and batch normalization. Batch size of 128 was chosen. ResNet50_f_age consists of a fully-connected system of 5 layers with 512, 512, 512, 256, 128 units with dropout with keep probability of 0.5, 0.3, 0.3, 0.3 and 0.5 respectively. Each of the layers contains batch normalization and has Scaled Exponential Linear Unit (SELU) as the activation function. Like previously, for SENet50_f_age we kept the same model as for ResNet50_f_age. Training and Testing. In order to save training time each set was separately forward passed via each model to get corresponding 9 Numpy ndarrays as extracted input feature vectors and saved. Since the faces were already aligned and cropped no further preprocessing was carried out and input dimensions are kept same as original RGB photos i.e., 200 200 3. For gender classiﬁcation, the loss is binary cross-entropy function. Class weights were also taken into account while training to make up for slight class imbalance as there are roughly 48% female and 52% male in the both the training and validation set. For age estimation, being a regression task, the loss function was mean squared error. The optimizer used in both cases is the AMSGrad variant of Adam [15] with an initial learning rate of 0.001 which is halved in the ending phase of training for better convergence. The choice of optimizer was based on the experiments carried out while training our custom CNN architecture and theory [15].

300

V. Sheoran et al.

4 Evaluation The performance of the age estimation algorithms is evaluated based on the closeness of the predicted value to the actual value. The metrics widely used for the age estimation as a regression task is the mean absolute error or MAE which captures the average magnitude of error in a set of predictions. MAE calculates the absolute error between actual age and predicted age as deﬁned by the Eq. (1). MAE ¼

1 Xn ^ y y j j j¼1 N

ð1Þ

Where n is the number of testing samples, yj denotes the ground truth age and ^yj is the predicted age of the j-th sample. For classiﬁcation tasks (age and gender), the evaluation metric used was accuracy which denotes the fraction of correctly classiﬁed samples over the total number of samples.

5 Experimentation and Results In this section we summarize the results obtained via the extensive experiments performed in the study and compare different methods from work of other researchers. 5.1

Deep CNNs

We experiment our models in 3 distinct steps. Each successive step uses the model performing the best in the previous step. First, we tried two of the most popular layer types for convolutional layers. We trained and tested the performance of all - age estimation, age classiﬁcation and gender classiﬁcation on 2 types of fundamental convolutional layers - the simple convolutional layer (Conv2D) and separable convolutional layer (Separable Conv2D) with spatial dropout being present in both cases, for increased regularization effect. Rest all hyper parameters were kept the same (Table 5). Table 5. Comparison of layer type Layers Conv2D Separable Conv2D

Age estimation (MAE) 6.098 6.080

Age classiﬁcation (accuracy) 76.718 78.279

Gender classiﬁcation (accuracy) 90.426 91.269

It is apparent that separable convolution coupled with spatial dropout (in the convolutional layers) helped the model in converging faster and generalize better. This is because, separable convolutions consist of ﬁrst performing a depth wise spatial

Age and Gender Prediction Using Deep CNNs and Transfer Learning

301

convolution (which acts on each input channel separately) followed by a pointwise convolution which mixes the resulting output channels. Basically, separable convolutions factorize a kernel into 2 smaller kernels, leading to lesser computations, thereby helping the model to converge faster. Then we experimented with other arguments associated with the namely the type of weight initialization and weight constraints which determine the ﬁnal weights of our model and hence its performance. Table 6 summarizes the results of this experiment. Table 6. Comparison of models based on the arguments of the best performing layer Layers conﬁguration

Separable Conv2D + Spatial dropout + Xavier uniform initialization Separable Conv2D + Spatial dropout + He uniform initialization Separable Conv2D + Spatial dropout + He uniform initialization + max norm weight constraint

Age estimation (MAE) 6.08

Age classiﬁcation (accuracy) 78.279

Gender classiﬁcation (accuracy) 91.269

5.91

79.122

89.287

6.19

72.163

94.517

‘He’ initialization resulted in better performance when ReLU activation function was used than Xavier initialization. Again, for each task we chose the conﬁguration that gave the best result on the validation set and tried a bunch of different optimizers in order to maximize performance. Optimizer plays a very crucial role in deciding the model performance as it decides the converging ability of a model. Ideally, we want to have an optimizer that not only converges to the minimum fast, but also helps the model generalize well (Table 7).

Table 7. Effect of various optimizers on results Optimizer Adam Adamax SGD SGD + Momentum (0.9) Nadam

Age estimation (MAE) 5.916 5.673 6.976 8.577

Age classiﬁcation (accuracy) 79.122 78.279 70.012 77.098

Gender classiﬁcation (accuracy) 94.517 91.269 89.877 89.624

5.858

78.405

89.709

302

V. Sheoran et al.

These results show Adam and its variant (Adamax) provide the best results. Adam and its variants were observed to converge faster. On the other hand, it was observed that models trained using SGD were learning very slowly and saturated much earlier especially when dealing with age. 5.2

Transfer Learning

Table 8 compares the performance based on the different extracted features, on which our models were trained. Table 8. Comparison based on feature extractors Feature Extractor VGG_f ResNet50_f SENet50_f

Age estimation (MAE) Gender classiﬁcation (accuracy) 4.86 93.42 4.65 94.64 4.58 94.94

It is clear that the features extracted using SENet-50_f performed best for both the tasks compared to ResNet50_f and VGG_f even though we trained more layers for VGG_f. In the study [7], a linear regression model and ResNeXt-50 (32 4d) architecture was trained from scratch on the same dataset for age estimation using Adam. In another study [6], various state-of-the-art models pre-trained on ImageNet were used where the authors trained two new layers while freezing the deep CNN layers which acted as feature extractor followed by ﬁne-tuning the whole network with a smaller learning rate later using SGD with momentum. Both studies had their models evaluated on 10% size of the dataset, utilizing remaining for training or validation (Table 9). Table 9. Comparison with others’ work Method Age estimation (MAE) Linear regression [7] 11.73 ResNet50 [6]* 9.66 Inceptionv3 [6]* 9.50 DenseNet [6]* 9.19 ResNeXt-50 (32x4d) [7] 7.21 Best Custom CNN (ours) 5.67 VGG_f_age (ours) 4.86 ResNet50_f_age (ours) 4.65 SENet50_f_age (ours) 4.58 * Cropped according to the detected face image using Haar Cascade Face detector [16].

Age and Gender Prediction Using Deep CNNs and Transfer Learning

303

Since we got best performance, from features extracted via SENet50_f, for both tasks of age and gender classiﬁcation in Table 10 and Table 11, we further provide baseline performance on them for various machine learning algorithms on the same splits of the dataset. Validation set is not used since we haven’t tuned these models, default hyper parameters of Sci-kit learn and XGBoost libraries have been used for this. Table 10. Untuned baseline for gender classiﬁcation Method Decision tree Linear SVC Logistic regression Gradient boosted trees XGBoost Linear discriminant analysis SVC (kernel = rbf)

Train (accuracy) 99.86 96.13 97.38 95.15 95.03 95.30 97.13

Test (accuracy) 59.26 91.44 92.11 93.38 93.80 94.39 94.64

Table 11. Untuned baseline for age estimation Method Decision tree Gradient boosted trees XGBoost Random forest Linear regression Linear SVR SVR (kernel = rbf)

Train (MAE) Test (MAE) 0.05 9.86 4.97 6.17 5.00 5.89 1.91 5.75 4.93 5.61 4.85 5.58 4.85 5.49

Clearly, even simple linear regression outperformed training our custom CNN model for age estimation and logistic regression came remarkably close to the custom CNN architecture for gender classiﬁcation on the features extracted using SENet50_f. As expected, our model performs relatively poorly while predicting ages for people above 70 years of age. This is quite evident from Table 2. where it can be seen that there are only 5.78% images in the dataset belonging to people above 70 (albeit the dataset is quite evenly balanced when it comes to gender). We believe much better results can be attained using a more balanced and larger dataset.

6 Conclusion Inspired by the recent developments in this ﬁeld, in this paper we proposed two ways to deal with the problem of age estimation, age and gender classiﬁcation - a custom CNN architecture and transfer learning based pre-trained models. These pre-trained models helped us combat overﬁtting to a large extent. It was found that our models generalized very well with minimal overﬁtting, when tested on real-life images.

304

V. Sheoran et al.

We plan to extend our work on a larger and more balanced dataset with which we can study biases and experiment with more things in order to improve the generalizability of our models. In future research, we hope to use this work of ours as a platform to improvise and innovate further and contribute to the deep learning community.

References 1. Han, H., Otto, C., Jain, A.K.: Age estimation from face images: Human vs. machine performance. In: Proceedings International Conference BTAS, pp. 1–8 (2013) 2. UTKFace. (n.d.). http://aicip.eecs.utk.edu/wiki/UTKFace. Accessed 14 July 2020 3. IMDB-WIKI – 500 k + face images with age and gender labels (n.d.). https://data.vision.ee. ethz.ch/cvl/rrothe/imdb-wiki/. Accessed 14 July 2020 4. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Procedings of the British Machine Vision Conference (2015). https://doi.org/10.5244/c.29.41 5. Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: VGGFace2: a dataset for recognising faces across pose and age. In: 2018 13th IEEE International Conference on Automatic Face & GestureRecognition (FG 2018). https://doi.org/10.1109/fg.2018.00020 6. Akhand, M.A., Sayim, M.I., Roy, S., Siddique, N.: Human age prediction from facial image using transfer learning in deep convolutional neural networks. In: Proceedings of International Joint Conference on Computational Intelligence Algorithms for Intelligent Systems, pp. 217–229 (2020). https://doi.org/10.1007/978-981-15-3607-6_17 7. Fariza, M.A., Ariﬁn, A.Z.: Age estimation system using deep residual network classiﬁcation method. In: 2019 International Electronics Symposium (IES), Surabaya, Indonesia, pp. 607– 611 (2019). https://doi.org/10.1109/elecsym.2019.8901521 8. Angulu, R., Tapamo, J.R., Adewumi, A.O.: Age estimation via face images: a survey. EURASIP J. Image Video Process. 2018(1), 1–35 (2018). https://doi.org/10.1186/s13640018-0278-6 9. Hu, L., Li, Z., Liu, H.: Age group estimation on single face image using blocking ULBP and SVM. In: Proceedings of the 2015 Chinese Intelligent Automation Conference Lecture Notes in Electrical Engineering, pp. 431–438 (2015). https://doi.org/10.1007/978-3-66246469-4_46 10. Guo, G.Y., Fu, T.S., Huang, C.R.: Dyer, locally adjusted robust regression for human age estimation. In: 2008 IEEE Workshop on Applications of Computer Vision, Copper Mountain, CO (2008). pp. 1–6. https://doi.org/10.1109/wacv.2008.4544009 11. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998). https://doi.org/10.1109/5.726791 12. Krizhevsky, A., Ilya, S., Geoffrey, E.: Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012) 13. Rasmus, R., Timofte, R., Van Gool, L.: Dex: Deep expectation of apparent age from a single image. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2015) 14. Wang, X., Guo, R., Kambhamettu, C.: Deeply-learned feature for age estimation. In: Proceedings IEEE Winter Conference. Applications Computer Vision, pp. 534–541 (2015) 15. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of Adam and beyond. ICLR (2018) 16. Viola, P., Jones, M. (n.d.). Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision andPattern Recognition. CVPR 2001. https://doi.org/10.1109/cvpr.2001.990517

Text Line Segmentation: A FCN Based Approach Annie Minj, Arpan Garai(B) , and Sekhar Mandal Indian Institute of Engineering Sciences and Technology, Shibpur, Howrah, India {ag.rs2016,sekhar}@cs.iiests.ac.in

Abstract. Text line segmentation is a prerequisite for most of the document processing systems. However, for handwritten/warped documents, it is not straightforward to segment the text lines. This work proposes a learning-based text line segmentation method from document images. This work can tackle complex layouts present in a camera captured or handwritten document images along with printed ﬂat-bed scanned English documents. The method also works for Alphasyllabrary scripts like Bangla. Segmentation of Bangla handwritten text is quite challenging because of its unique characteristics. The proposed approach of line segmentation relies on fully convolutional networks (FCNs). To improve the performance of the method, we introduce a post-processing step. The model is trained and tested on our dataset along with the cBAD dataset. We develop the model in such a way that it can be trained and tested in a machine that has limited access to highly computational accessories like GPU. The results of our experiments are encouraging. Keywords: Text line segmentation · Fully Convolution Network (FCN) · Bangla handwritten text · Warped documents

1

Introduction

It is often necessary to read line-by-line from a document image. Everyday, huge numbers of documents are captured using cameras attached to smart mobile phones. These images suﬀer from diﬀerent types of distortions that need to be rectiﬁed to read from them. Text line segmentation is an essential step for these distortion rectiﬁcation techniques [10–13]. Moreover, the text lines in a handwritten document image is more challenging than a printed document image having a ﬂat-bed surface. The text line segmentation techniques developed till now are either language-speciﬁc like for Kannada [1], English [30], Hindi [14], Gujrati [6], Bangla [20], Arabic [2], or script independent [3,29]. These methods are tested on line segmentation on a ﬂat-bed surface. Hence, they may fail to produce accurate results on warped document images. It is due to the interleaved manner of the text lines in the warped images. Also, the texts present in these documents are often skewed and perspectively distorted. Two diﬀerent lines may touch one another in these documents. Most of the line segmentation methods c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 305–316, 2021. https://doi.org/10.1007/978-981-16-1092-9_26

306

A. Minj et al.

are learning free. Recently, it is reported that the learning-based approaches, like [22]work with greater accuracy in a similar domain. Although some methods present in [2,23] uses a learning-based approach to segment text-lines, the use of deep artiﬁcial neural networks are to be explored yet in the domain of text line segmentation. The proposed technique is based on FCN (Fully Convolution Network), which can deal with both Alphabetic script like English and Alphasyllabrary script like Bangla/Devanagari. The method also works for printed, handwritten, and warped documents. The method is designed to be implemented or tested using the systems that do not contain any specialized hardware for a high amount of computation like GPU. Our work’s main objective is to increase the accuracy level for the segmentation of the text lines. The remaining of this manuscript is discussed as follows. In Sect. 2, the recent related works are discussed. The proposed work is described in Sect. 3. Next, the experimental results analysis of the proposed work are described in Sect. 4. Finally, we conclude in Sect. 5.

2

Related Work

Several methods for text line segmentation has already proposed. These methods are classiﬁed roughly into two classes; learning free and learning-based. We present in this section a brief description of their principles. Mullick et al. [20] proposed an approach to segment the text lines. Here, the input document image is blured in such a way that the white spaces between the words are removed and the gap between the consecutive text lines preserves. In [16], the authors have used a method to identify the boundary of the text-lines by using morphological operations, and then each text-line is segmented by representing it in diﬀerent colors. Chavan et al. proposed an improved piece-wise projection method in [7], which applies signal approximation and statistical approach for better line segmentation. Miguel et al. [5] used the text line localization to search the dividing path of the neighboring text lines of a handwritten text. In the CNN based method [18] given by Shelhamer et al., the input size of the network is arbitrary. The size of the ﬁnal output of the method is same as the size of the input. Here, contemporary classiﬁcation networks are adapted into FCN. Then a skip architecture is deﬁned to produce accurate and detailed segmentations. Another learning-based technique is proposed by Renton et al. [22] for text line segmentation in handwritten document images. They labeled each text line using the respective X-height of the text lines. It was performed using a deep FCN based on the dilated convolutions. In [2], Barakat et al. present a line segmentation method for historical document images. They have estimated a line mask using a FCN. These line masks are applied to connect the components of a text line. Vo et al. [28] ﬁnds the line like structures present in a document image by using FCN. It basically the rough estimation of text lines. Later, line adjacency graph (LAG) is used to deal with the touching characters where each overlaping pixel is joined to the nearest line. In [17], Li et al. proposed a label pyramid network (LPN) for the task of line segmentation. The LPN is based on

Text Line Segmentation: A FCN Based Approach

307

FCN. The outputs of LPN is fused into a single image. That image is passed through a deep watershed transform (DWT). The DWT is responsible to produce the segmented text lines. All these existing methods are tested on line segmentation on a ﬂat-bed surface. The methods will not produce accurate results for the warped image. Hence, we propose a method that can segment text lines in ﬂat-bed scanned or warped images having Alphabetic or Alphasyllabrary script.

3

Proposed Work

The objective of the proposed approach is to ﬁnd the text lines from the document images containing both Alphabetic and/or Alphasyllabary scripts. A few datasets, along with ground truths for Alphabetic scripts like English, are publicly available [15]. Only one dataset containing images having a handwritten Alpha-syllabary script is available. It is ICDAR 2013 Handwritten Segmentation Contest dataset [27]. That dataset contains only 50 images having a handwritten Alpha-syllabary script. That number of images is not enough for training and testing a CNN. Hence, we create a dataset called Bangla handwritten document image dataset that contains handwritten Alpha-syllabary script along with suitable ground truths for line segmentation. The details about the datasets used are provided in Sect. 4. Some preprocessing steps, like binarization and scaling of the input images are done. Next, the resized image is divided into a number of subimages in an overlapping manner, and these sub-images are fed to the FCN model during the training phase. The same process is followed in the testing phase. The outputs of the network are merged to obtain a complete document, and some morphological operations are then applied as a post-processing step to get better results. The diﬀerent phases of the proposed technique is visualised in Fig. 1.

Fig. 1. Major steps of the proposed work.

3.1

Pre-processing

Binarization and denoising are the two preprocessing steps used in our proposed method. A brief description of each of them is given next.

308

A. Minj et al.

Several binarization techniques are available in the literature [21,24,31]. For the proposed work, we use an existing binarization method proposed in [31]. This method uses a local thresholding technique to binarize the document images. The local threshold is selected in an adaptive way based on the neighbourhood infomation of each pixel of the image at diﬀerent scale. This method is specially developed to binarize the camera-captured document images, and hence, we use this method for our proposed work. An experimental result of this binarization method is given in Fig. 2.

Fig. 2. Example of preprocessed image: (a) An image from cBAD dataset; (b) Binary image; (c) Denoised image. (Color ﬁgure online)

Fig. 3. The pictorial representation of the proposed FCN model.

Denoising is performed to estimate the text part of the image by suppressing noise from a noise-contaminated version of the image. Generally, six types of noises can be found in a document image as stated in [9]. They are (i) ruled

Text Line Segmentation: A FCN Based Approach

309

line noise, (ii) marginal noise, (iii) clutter noise, (iv) stroke-like noise, (v) saltpepper noise and, (vi) background noises. In this work, we use the handwritten Bangla dataset, cBAD dataset and WDID. The denoised images are available in handwritten Bangla dataset and WDID. But cBAD dataset contains only the RGB images and the images are ﬂat-bed scanned images. Images of this dataset contain aforesaid noises except salt-pepper noise. There are several existing techniques [4,8,25,26] for removal of marginal noise. Examples of marginal noise and stroke like pattern noise are shown using green rectangle and blue circle in Fig. 2(b), respectively. For this work, we use an existing method proposed in [8] to recognize and remove the border noises. This method is based on connected component analysis. The images also contain non-text parts like ﬁgures, stroke like pattern noise and clutter noises etc., and the method proposed in [32] is used to eleminate these noises. The method generates a graphical model using the connected components present in the document image. Next, from this model the histogram based on size is obtained. Finally, from the size of the histogram the non-text parts are separated. We also remove very small connected components like ‘dots’ and pepper noises from the document image, based on their size. The denoised images are shown in Fig. 2(c) for the input images Fig. 2(b). 3.2

Preparation of Input to the Netwrok

The highly computational accessories like GPU may not be available for most of the institutes of countries like India. In this paper, we develop a method which is implemented in a PC without GPU. The average image size of our dataset is 1800 × 2500. Hence, we divide the preprocessed image (I) into subimages in an overlapping manner. The size of each subimage is N × N . These subimages are used in training and testing phases. 3.3

Network Architecture

The proposed network contains 17 weighted layers. The input size of the network is 248 × 248. The ﬁrst 8 layers correspond to standard convolution layers producing number of feature maps of 8, 8, 16, 16, 32, 32, 64, and 64 respectively. A max-pooling (downsampling) layer is there after every two convolution layers. The next 8 layers are also standard convolution layers which produce number of feature maps of 64, 64, 32, 32, 16, 16, 8, and 8 receptively. Here, after every two convolution layers, we use an upsampling layer. The size of the ﬁlter is 3 × 3. The window size of both the max-pooling and the upsampling layer is 2 × 2. Finally, the output is produced by using a sigmoidal activation function. All the hidden layers use the ReLu activation function. Figure 3 depects the visual representation of the proposed FCN model. Examples of sub-images and their respective output from the network are shown in Fig. 4. Here, the network extract semantic information and based on that semantic information the text lines are segmented.

310

A. Minj et al.

Fig. 4. Example of the network output: (a) Input image; (b) Sub-images prepared for network; (c) Corresponding output images of the FCN; (d) Merged output image.

3.4

Training

The images are trained on the architecture discussed in the above section, using Keras. A total of more than 400 images used for training. From these images, we get more than 16000 sub-images. 70% of these total sub-images have been used for training, and 30% for validation. The number of epochs and batch size used during training the proposed model is 15 and 8, respectively. The training is done in two steps. A coarse adjustment of weights is done, followed by a ﬁne-tuning of them. The coarse adjustment is done using the dataset of the handwritten Bangla document images. Later the ﬁne-tuning is performed using the images of the cBAD [Track-B] dataset. Here, we use Adam optimizer to update the weights iteratively. The learning rate of the proposed network is 0.001. We use Binary cross-entropy as the loss function. 3.5

Merging the Network Outputs

The sub-images obtained from the network are merged to generate the ﬁnal output image. The sub-images are merged in such a way that they are placed in the same place from where they were cropped previously i.e. in an overlapping manner. So, in some places, we get two diﬀerent values for the same pixel position of the sub-images. To tackle this situation, we classify the respective pixel as text if both the values are text pixels. Otherwise, the candidate pixel is considered a background. In Fig. 4(a), (b), (c), and (d) a sample input image, corresponding sub-images that are fed to the FCN, corresponding sub-images generated by the FCN and the merged output are shown, respectively. 3.6

Post-processing

It is observed that a few text lines are yet to be segmented properly. Consider the output of FCN shown in Fig. 5(b) corresponding the input image Fig. 5(a). Here, some of the text lines are connected each other. Hence, post-processing on the FCN generated output is needed. So, we apply a morphological opening operation using a rectangular structured element of size b × l. The value of b is 1 4 × h and l is 2 × h. Here, h denotes the median height of all the components

Text Line Segmentation: A FCN Based Approach

311

present in the FCN generated output. The output of the morphological opening operation for the image Fig. 5(b) is shown in Fig. 5(c). Now, it is seen that each of the text lines is separated. However, some disconnections between a single text line are still present. Hence, we apply a morphological dilation operation with a horizontal line-like structured element of size 1 × l to remove these gaps. Thus, we get the ﬁnal desired output, which is shown in Fig. 5(d).

Fig. 5. Output images of post-processing operations: (a) An preprocessed document image; (b) FCN output; (c) Output after morphological opening operation; (d) Final output.

4

Experimental Results

This proposed method is tested and evaluated using three publicly available datasets. These are: (i). Warped document image dataset (WDID) [13] (ii). cBAD [Track-B] dataset [19] (iii). ICDAR2013 Handwritten Segmentation Contest dataset [27]. The current version of WDID contains 258 diﬀerent warped document images. The images are captured using diﬀerent mobile phone cameras. The images have mainly Alphasyllabrary script like Bengali/Devanagari. Some images have a fair amount of perspective distortion along with warping. Also, there are single or multiple folded images present in the dataset. It is challenging to segment text lines in these types of documents. The cBAD dataset contains two sub dataset namely, Track A and Track B. The track A and B contains simple and complex documents, respectively. The images are collected from differnt sources and written/printed in diﬀernt times. We use only Track B for our experiment which contains 1380 pages. Among these images 1010 images are used for testing. The ICDAR 2013 handwritten segmentation contest dataset (say ICDAR’13 dataset) contains 200 document images for training and 150 document images for testing. Our network is tested on 150 testing images. Among these images 50 images are written in Bangla, 50 images are written in Greek and the remaining images are written in English. As mentioned earlier, 50 images are not enough for

312

A. Minj et al.

Fig. 6. Results on sample images from ICDAR 2013 handwritten segmentation contest dataset for Bengali, English and Greek scripts: (a, d, g) Input image; (b, e, h) Corresponding model output; (c, f, i) Output of the proposed method.

training the deep neural network, so we create the Bangla handwritten document image dataset. Our dataset contains 156 handwritten Bangla document images written by 39 writers and ground truth of this dataset are created manually. A sample image from the ICDAR’13 dataset and the corresponding output of the proposed approach is provided in Fig. 6. The output of the proposed technique on an image from WDID is shown in Fig. 7(a). The Fig. 7(b) shows a set of outputs of the proposed method for the input images taken from the cBAD dataset. The proposed method is also tested on our warped handwritten document image dataset. A set of warped input images, its ground truth images and the output images of the proposed work are illustrated in Fig. 7(c). It is evident from the outputs mentioned above that the proposed method works satisfactorily for diﬀerent types of document images. It should be noted that during the training phase, the warped images were not used, in spite of that the trained model provides reasonably accurate results for the warped images. To evaluate the performance of the proposed method we use the performance metric FM. To calculate the FM we ﬁrst calculate the number of correctly detected lines. It is denoteted by o2o. Next, the recognition accuracy (RA) and o2o detection rate (DR) is calculated using formula: RA = o2o M and DR = N ,

Text Line Segmentation: A FCN Based Approach

313

Fig. 7. Input images (1st column), ground truth images (2nd column) and output of the proposed method (3rd column) for sample images from (a) WDID, (b) cBAD dataset, (c) handwritten warped image.

respectively. Here, M and N are the number of text lines in the resultant image and the ground truth image, respectively. Finaly, the performance metric FM is estimated using the equation F M = 2×DR×RA DR+RA . The performance metric FM of the output generated after passing the images of ICDAR’13 dataset through the FCN model and the output after applying post-processing is 90.60 and 98.9, respectively. The comparision of the proposed approach and the methods perticipated in ICDAR 2013 Handwritten Segmentation Contest [27] is tabulated in Table 1. From Table 1 it is clear that the proposed method produces quite good results as compared to existing approaches. However, there are still some overlapping of the lines present in case of some complex characters.

314

A. Minj et al. Table 1. Evaluation results

5

Algorithm

DR(%) RA(%) FM(%)

CUBS

97.96

96.64

97.45

GOLESTAN-a

98.23

98.34

98.28

INMC

98.68

98.64

98.66

LRDE

96.94

97.57

97.25

MSHK

91.66

90.06

90.85

NUS

98.34

98.49

98.41

QATAR-a

90.75

91.55

91.15

QATAR-b

91.73

93.14

92.43

NCSR(SoA)

92.37

92.48

92.43

ILSP(SOa)

96.11

94.82

95.46

TEI(SoA)

97.77

96.82

97.30

QUANG method 98.68

98.53

98.60

Proposed method 98.90

98.90

98.90

Conclusion

The proposed line segmentation method based on fully convolution network successfully extracts text lines from the handwritten document images. Unlike existing methods, the proposed method is applicable to the warped document image for line segmentation. The proposed technique is script independent. The eﬃciency of the proposed method is also evaluated using variety of document images available in our dataset along with publicly available benchmark datasets.

References 1. Banumathi, K.L., Chandra, A.P.J.: Line and word segmentation of Kannada handwritten text documents using projection proﬁle technique. In: 2016 International Conference on Electrical, Electronics, Communication, Computer and Optimization Techniques (ICEECCOT), pp. 196–201 (2016) 2. Barakat, B., Droby, A., Kassis, M., El-Sana, J.: Text line segmentation for challenging handwritten document images using fully convolutional network. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 374–379 (2018) 3. Bukhari, S.S., Shafait, F., Breuel, T.M.: Script-independent handwritten textlines segmentation using active contours. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 446–450 (2009) 4. Bukhari, S.S., Shafait, F., Breuel, T.M.: Border noise removal of camera-captured document images using page frame detection. In: Iwamura, M., Shafait, F. (eds.) CBDAR 2011. LNCS, vol. 7139, pp. 126–137. Springer, Heidelberg (2012). https:// doi.org/10.1007/978-3-642-29364-1 10

Text Line Segmentation: A FCN Based Approach

315

5. Calderon, G., Angel, M., Hernandez, G., Arnulfo, R., Ledeneva, Y.: Unsupervised multi-language handwritten text line segmentation. J. Intell. Fuzzy Syst 34, 2901– 2911 (2018) 6. Chaudhari, S., Gulati, R.: Segmentation problems in handwritten Gujarati text. Int. J. Eng. Res. Technol. (IJERT) 3, 1937–1942 (2014) 7. Chavan, V., Mehrotra, K.: Text line segmentation of multilingual handwritten documents using fourier approximation. In: 2017 Fourth International Conference on Image Information Processing (ICIIP), pp. 1–6 (2017) 8. Dutta, A., Garai, A., Biswas, S.: Segmentation of meaningful text-regions from camera captured document images. In: 2018 Fifth International Conference on Emerging Applications of Information Technology (EAIT), pp. 1–4 (2018). https:// doi.org/10.1109/EAIT.2018.8470403 9. Farahmand, A., Sarrafzadeh, A., Shanbehzadeh, J.: Document image noises and removal methods, vol. 1, pp. 436–440 (2013) 10. Garai, A., Biswas, S., Mandal, S., Chaudhuri, B.B.: Automatic dewarping of camera captured born-digital bangla document images. In: 2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR), pp. 1–6 (2017). https:// doi.org/10.1109/ICAPR.2017.8593157 11. Garai, A., Biswas, S.: Dewarping of single-folded camera captured bangla document images. In: Das, A.K., Nayak, J., Naik, B., Pati, S.K., Pelusi, D. (eds.) Computational Intelligence in Pattern Recognition. AISC, vol. 999, pp. 647–656. Springer, Singapore (2020). https://doi.org/10.1007/978-981-13-9042-5 55 12. Garai, A., Biswas, S., Mandal, S.: A theoretical justiﬁcation of warping generation for dewarping using CNN. Pattern Recogn. 109, 107621 (2021). https://doi.org/ 10.1016/j.patcog.2020.107621 13. Garai, A., Biswas, S., Mandal, S., Chaudhuri, B.B.: Automatic rectiﬁcation of warped bangla document images. IET Image Process. 14(9), 74–83 (2020) 14. Garg, N.K., Kaur, L., Jindal, M.K.: A new method for line segmentation of handwritten Hindi text. In: 2010 Seventh International Conference on Information Technology: New Generations, pp. 392–397 (2010) 15. Gatos, B., Stamatopoulos, N., Louloudis, G.: ICDAR 2009 handwriting segmentation contest. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 1393–1397 (2009) 16. Kumar, M.R., Pradeep, R., Kumar, B.S.P., Babu, P.: Article: a simple text-line segmentation method for handwritten documents. In: IJCA Proceedings on National Conference on Advanced Computing and Communications 2012 NCACC(1), pp. 46–61 (2012). Full text available 17. Li, X., Yin, F., Xue, T., Liu, L., Ogier, J., Liu, C.: Instance aware document image segmentation using label pyramid networks and deep watershed transformation. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 514–519 (2019). https://doi.org/10.1109/ICDAR.2019.00088 18. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015). https://doi.org/10.1109/CVPR.2015.7298965 19. Diem, M, Kleber, F., Fiel, S., Gruning, T., Gatos, B.: CBAD: ICDAR 2017 competition on baseline detection. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1355–1360 (2017) 20. Mullick, K., Banerjee, S., Bhattacharya, U.: An eﬃcient line segmentation approach for handwritten bangla document image. In: 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR), pp. 1–6 (2015)

316

A. Minj et al.

21. Pratikakis, I., Zagoris, K., Karagiannis, X., Tsochatzidis, L., Mondal, T., MarthotSantaniello, I.: ICDAR 2019 competition on document image binarization (dibco 2019). In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1547–1556 (2019) 22. Renton, G., Chatelain, C., Adam, S., Kermorvant, C., Paquet, T.: Handwritten text line segmentation using fully convolutional network. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 05, pp. 5–9 (2017) 23. Renton, G., Soullard, Y., Chatelain, C., Adam, S., Kermorvant, C., Paquet, T.: Fully convolutional network with dilated convolutions for handwritten text line segmentation. Int. J. Doc. Anal. Recogn. (IJDAR) 21(3), 177–186 (2018) 24. Roy, P., Dutta, S., Dey, N., Dey, G., Chakraborty, S., Ray, R.: Adaptive thresholding: a comparative study. In: 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), pp. 1182– 1186 (2014) 25. Shafait, F., Breuel, T.M.: The eﬀect of border noise on the performance of projection-based page segmentation methods. IEEE Trans. Pattern Anal. Mach. Intell. 33(4), 846–851 (2011) 26. Shobha Rani, N., Vasudev, T.: An eﬃcient technique for detection and removal of lines with text stroke crossings in document images. In: Guru, D.S., Vasudev, T., Chethan, H.K., Sharath Kumar, Y.H. (eds.) Proceedings of International Conference on Cognition and Recognition. LNNS, vol. 14, pp. 83–97. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-5146-3 9 27. Stamatopoulos, N., Gatos, B., Louloudis, G., Pal, U., Alaei, A.: ICDAR 2013 handwriting segmentation contest. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1402–1406 (2013). https://doi.org/10.1109/ICDAR. 2013.283 28. Vo, Q.N., Lee, G.: Dense prediction for text line segmentation in handwritten document images. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3264–3268 (2016). https://doi.org/10.1109/ICIP.2016.7532963 29. Xiaojun, D., Wumo, P., Tien, D.B.: Text line segmentation in handwritten documents using mumford shah model. Pattern Recogn. 42(12), 3136–3145 (2009). https://doi.org/10.1016/j.patcog.2008.12.021, http://www.sciencedirect. com/science/article/pii/S0031320308005360, new Frontiers in Handwriting Recognition 30. Zhang, X., Tan, C.L.: Text line segmentation for handwritten documents using constrained seam carving. In: 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 98–103 (2014) 31. Zhao, J., Shi, C., Jia, F., Wang, Y., Xiao, B.: An eﬀective binarization method for disturbed camera-captured document images. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 339–344 (2018) 32. Zirari, F., Ennaji, A., Nicolas, S., Mammass, D.: A document image segmentation system using analysis of connected components. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 753–757 (2013)

Precise Recognition of Vision Based Multi-hand Signs Using Deep Single Stage Convolutional Neural Network S. Rubin Bose(&) and V. Sathiesh Kumar Department of Electronics Engineering, MIT Campus, Anna University, Chennai-44, India

Abstract. The precise recognition of multi-hand signs in real-time under dynamic backgrounds, illumination conditions is a time consuming process. In this paper, a time efﬁcient single stage convolutional neural network (CNN) You Only Look Once (YOLO-V2) model is proposed for real-time multi-hand sign recognition. The model utilizes DarkNet-19 CNN architecture as a feature extractor. The model is trained and tested on three distinct datasets (NUSHP-II, SENZ-3D and MITI-HD). The range of IoU from 0.5 to 0.95, the model is validated using test dataset. For the MITI-HD, the YOLO-V2 CNN model obtained an average precision value of 99.10% for AP0.5; 93.00% for AP0.75 and 78.30% for AP0.5:0.95. The Adam Optimizer on YOLO-V2 CNN model supersedes the other optimization methods. The prediction time of YOLO-V2 CNN is obtained as 20 ms, much lower than other single-stage hand sign recognition systems. Keywords: Single stage detector recognition Hand detection

Deep learning CNN Hand gesture

1 Introduction In recent years, the method of Human-Machine Interaction (HMI) get diversiﬁed by the extensive growth and innovations in the ﬁeld of artiﬁcial intelligence [1]. HMI facilitates a customer-friendly graphical user interface (GUI) design by actively incorporating the connectivity and interaction capabilities of humans [2]. A vision based non-verbal type of communication like hand signs or hand actions are being integrated into the HMI system. This visible activities of the human hands convey or communicate the critical information to the machine. The multi-hand sign recognition is a hotspot research in vision based applications of human-machine communication frameworks. The real-time applications require a precise recognition of hand sign for an efﬁcient and smooth control over the GUI [3]. Hand sign recognition offer a wide variety of applications such as human-computer communication, sign language understanding, robotic arm control, self-driving vehicles and telemetry frameworks for surgical applications [4]. The vision based sign recognition system must be able to identify the hand precisely under changing lighting conditions and complex backgrounds. This is an extremely challenging and a time-consuming task. The various components which © Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 317–329, 2021. https://doi.org/10.1007/978-981-16-1092-9_27

318

S. Rubin Bose and V. Sathiesh Kumar

signiﬁcantly influence the performance of real-world recognitions are computational expense, rapid actions, change in lighting conditions, soul-occlusion, unpredictable environments and a large number in degree of freedom (DOF) [5]. The recent breakthrough in the ﬁeld of Neural Network (NN) has ensured a wide spectrum of advancements in deep learning based algorithms through a speciﬁc network called as Convolutional Neural Network (CNN). The location of the object in visual information is represented and Classiﬁed by means of CNN algorithms known as object detectors. Hand sign recognition can be achieved by using two different methodologies namely, two stage detection system (Faster R-CNN) [6] and single stage detection systems (SSD) [7]. This paper focuses on a single stage recognition algorithm, You Only Look Once (YOLO-V2) architecture is proposed for the real-time multi-hand sign recognition system. YOLO is a fast, accurate object detector, making it ideal for computer vision applications.

2 Literature Review Zhengjie et al. [8] used the smartphone as an active sonar sensing tool to detect the hand motions. The author examined the results obtained by different hand gesture implementations and reported a detailed survey. The author highlighted multiple challenges, perspectives and open problems which have to be resolved in a hand gesture recognition framework based on ultrasonic signals. Shilpa et al. [9] demonstrated a real-time model to identify the Hindi language based gestures. The author also developed an application to convert the gesture to text (G2T) and text to gesture (T2G) processes. The 32 different movements are described with a 5-bit binary string feature extraction technique. The author reported that the algorithm for gesture recognition achieved an accuracy of 98%. Danilo et al. [10] proposed a hand gesture recognition algorithm using Recurrent Neural Network (RNN). The collection of the selected hand features is carried out by using the leap movement controller. RNN is trained using the features of the hand which encompasses the feature points (joint angle, ﬁngertip location). By utilizing the American Sign Language dataset, the author achieved an accuracy of 96%. Wu et al. [11] proposed a Deep Neural Network to recognize the hand movements from RGB-D data and skeletal features. It integrates a Deep Neural Network (to extract the essential information) and 3-D CNN (for RGB-D data). The author utilized a ChaLearn LAP dataset to evaluate the model. Peng et al. [12] suggested a collection of deep hybrid classiﬁers (incorporating a CNN-SVM classiﬁer) to recognize the egocentric hand postures. The author reported a recognition accuracy of 97.72% for NUS hand posture dataset II with a prediction time of 121 s. Nguyen et al. [13] proposed numerous techniques for the identiﬁcation of static gestures. They are Gabor Filter, Fisher’s Discriminant Evaluation and Cosine Metric Distance approach. The suggested frameworks is evaluated on a Senz-3D dataset. The author reported an accuracy of 93.89%.

Precise Recognition of Vision Based Multi-hand Signs

319

Liau et al. [14] proposed a Fire SSD model for object detection. Fire SSD consists of a residual SqueezeNet based backend network (feature extractor) and six branches of Multibox features. The author achieved an average precision of 70.5% on PASCAL VOC 2007 dataset with 33.1 FPS. Ning et al. [15] proposed a Inception Single Shot Detection (I-SSD) model by adopting the Inception block to replace the extra feature layers in SSD. The author used batch normalisation and residual structure in the architecture. The I-SSD algorithm is evaluated on the VOC2007 test dataset. It resulted in an average precision of 78.6%. Yuxi et al. [16] presented a tiny deeply supervised object detection (Tiny-DSOD) architecture designed dedicatedly for resource-restricted environments. Tiny-DSOD architecture utilizes two creative and ultra-efﬁcient architectural blocks: the depthwise dense block (DDB) as backend and the depthwise feature-pyramid-network (D-FPN) as front end. Tiny-DSOD achieves an average precision of 72.1%. Cruz et al. [17] proposed a combination of YOLO with ZOOM detection architecture for detecting real-time hands. The author achieved the average precision of 90.40% and average recall of 93.90% using Egocentric dataset. The computation time is not listed in the paper. Redmon et al. [18] proposed a joint training algorithm for the object detection model (YOLO) by using the data of detection and classiﬁcation. Qiang et al. [1] proposed a Dynamic Hand Gesture Recognition Algorithm. It consists of Channel State Information (CSI) along with the CNN paradigm You Only Look Once (YOLO) architecture. The author utilized this model for dynamic hand gesture recognition and achieved a recognition accuracy of 94%. Focusing on a systematic review of literature, several other methodologies for hand sign recognition systems has been reported. Each technique seems to possess strengths and weaknesses for detecting and classifying the multiple hands. In this paper, a YOLO-V2 based CNN architecture is proposed to increase the performance and reduce the computational time of the real-time hand action recognition system. The proposed model YOLO-V2 CNN is trained and evaluated on three datasets, namely NUSHP-II [19], SENZ-3D [20] and MITI-HD [7]. Table 1 shows the pseudo code of the system. Table 1. Pseudo code of single stage hand action recognition system.

Input: Input Frames f from video stream V, Classes = C1, C2…. Cn , Bounding box coordinates = bx, by, bw, bh , Probability of class = Pc Output: Predicted hand signs with bounding box. for each frame f in video V do split f into m x m grids each m consists of 5 numbers of b for predict C with b do if Pc = 1, then there is hand sign in b with coordinates ( bx, by, bw, bh ) y = ( Pc, bx, by, bw, bh , C ) else Pc = 0, no objects in the grids end if end for end for

320

S. Rubin Bose and V. Sathiesh Kumar

3 Methodology

Fig. 1. Methodology for single stage multi-hand sign recognition system.

The proposed YOLO-V2 based single stage CNN framework for recognizing the realtime multi-hand signs in an unrestrained environment is shown in Fig. 1. YOLO-V2 overcomes the challenges faced by the other recognition systems like Single Shot Detector (SSD) [7], and the Faster Region based Convolutional Neural Network (Faster-RCNN) [6]. YOLO-V2 is the modiﬁed version of conventional YOLO architecture. It improves the prediction accuracy and the computational speed. YOLO-V2 model utilizes DarkNet-19 CNN architecture as a backbone for extracting the feature vectors from the image.

Fig. 2. Flowchart of dataset collection and pre-processing.

Precise Recognition of Vision Based Multi-hand Signs

321

Fig. 3. Samples of MITI-HD dataset.

The YOLO-V2 CNN model has been trained and evaluated with the assistance of two benchmark datasets including the NUS Hand posture-II (NUSHP-II) dataset [19], Senz 3D hand dataset (SENZ-3D) [20] and the custom designed dataset (MITI-HD) [7]. Figure 2 illustrates the flow diagram that describes the collection of data samples and the technique used for pre-processing the hand gesture data samples. MITI Hand Dataset (MITI-HD) [7] is a personally tailored hand movement dataset gathered from a people group. Speciﬁcally, the data collection is obtained with the following parameters, such as different skin tones, complex background, various dimensions, illumination variation and geometry. The dataset consists of 10 classiﬁcations and 750 data samples per classiﬁcation (Total = 7500 samples). The sample frames of MITI-HD is displayed in Fig. 3. All the samples in the datasets are re-dimensioned (300 300 pixels) using Adaptive interpolation technique. White pixels are added to the information points outside the territory points of the image in order to retain the aspect ratio. A selection of a Region of Interest (ROI) called annotation is followed by the resizing process. It is a machine-learning method of nominating the area on frames. A data split ratio of 80:20 is maintained for training and testing data samples. The features extraction and the training process is carried out after the segregation of data samples. DarkNet-19 consists of 19 convolutional layers and 5 max-pooling layers. The YOLO-V2 model utilizes the convolutional and pooling layers to re-dimension the input resolution on the fly. The YOLO-V2 network dynamically opts for new image dimensions after a few iterations, rather than locking the input resolution into a ﬁxed size. This scheme empowers the model to learn to predict precisely from a wide range of input dimensions. Thus, the detections can be predicted at multiple resolutions by the same network. YOLO-V2 delivers an easy trade-off between precision and speed, so the network works much better at smaller dimensions. YOLO is a single-stage convolutional neural network algorithm for the hand detection and recognition process. The other object detection algorithms capture the image frame bit by bit whereas the YOLO-V2 CNN algorithm captures the entire

322

S. Rubin Bose and V. Sathiesh Kumar

image in one shot. It reframe the process of hand detection as a single regression problem, directly from the image pixels to class probabilities and bounding box coordinates. To predict each and every bounding box, YOLO-V2 CNN network uses features from the whole image frame. It also simultaneously predicts all bounding boxes for an image throughout all classes [18]. The input image is partitioned into an S S grid cells. Each grid cell determines B bounding boxes and conﬁdence scores. These scores of conﬁdence indicates the assurance of the model, that the box incorporates an object (hand) and how precise the model believes the box is same as that it predicts. YOLO-V2 CNN is the fastest version over the conventional YOLO architecture. YOLO-V2 CNN has a loss function which corresponds directly to the detection performance. It is jointly trained with the entire framework [21]. The loss function is expressed in Eq. 1. LossYolo ¼ LossBox þ LossConf þ LossProb

ð1Þ

where, LossBox is the anchor box loss, LossConf is the conﬁdence loss and LossProb is the loss of probability to be a class. It is expressed in Eqs. 2, 3 and 4, respectively. LOSSBox ¼ bCOR

XS2 XB i¼0

LOB j¼0 ij

2 2 pﬃﬃﬃﬃ pﬃﬃﬃﬃ 2 pﬃﬃﬃﬃﬃ 2 pﬃﬃﬃ xij x0ij þ yij y0ij þ wij w0 ij þ hij h0 ij

ð2Þ LossConf ¼ bOB

2 XS2 XB OB GT 0 OB 0 L IoU C þ b L 0 C NOB PR ij ij i¼0 j¼0 ij j¼0 ij

XS2 XB i¼0

ð3Þ

LossProb ¼ bCLS

XS2 XB i¼0

LOB j¼0 ij

Xb C2CLS

Pij log P0ij

ð4Þ

where, xij ; yij ; wij ; hij are the prediction coordinates, x0ij ; y0ij ; w0ij ; h0ij are the ground truth coordinates, bCOR ; bOB ; bCLS are the scalars used to weight each loss function, Cij is the and LNOB are the indicator function. LOB Objectness, LOB ij ij ij is 1 if Cij ¼ 1; else 0, NOB Lij is 1 if Cij ¼ 0; else 0.

4 Experimental Overview 4.1

Model Implementation

The framework for multi-hand sign recognition is accomplished in real-time by using a Deep Learning toolkit along with python library and TensorFlow module as a backend. The YOLO-V2 model is trained using a computer (Intel ® Core TM i7-4790 CPU @ 3.60 GHz, the 64-bit processor, 20 GB RAM, Windows 10 PRO OS) and GPU (NVIDIA GeForce GTX TITAN X (PASCAL)). CUDA/CUDNN has been used to

Precise Recognition of Vision Based Multi-hand Signs

323

carry out synchronous computations in a GPU. The other python packages like Numpy, Cython, Open-CV, Pandas, Matplotlib are used. 4.2

Model Training

The YOLO-V2 CNN model for the hand-sign recognition in real time has been trained for 35000 steps with ﬁne-tuned weights of COCO datasets. The models are trained and evaluated using three datasets, including NUSHP-II, Senz-3D and MITI-HD. Gradient descent optimization techniques like ADAM optimization algorithm [22, 25] momentum optimization algorithm [22, 23], and the RMSprop optimization algorithm [22, 24] is used to train the proposed YOLO-V2 CNN model. The YOLO-V2 CNN model achieved a precise and reliable hand sign recognition under diverse environments and varying illumination conditions. 4.3

Model Testing and Prediction

The YOLO-V2 CNN model is evaluated using the test data. The evaluation metrics (Average Precision (AP), Average Recall (AR), F1-Score (F1), and Prediction Time) of the proposed models are analyzed for various ranges of Intersection over Union (IoU). The efﬁciency of the hand sign recognition process is determined using the IoU. The IoU values are computed on the basis of the error induced by the predicted bounding box associated with the ground truth box. The value of IoU is unity for zero error. The prediction is true for IOU = 0.5 and it is accurate for IoU > 0.5.

5 Results and Discussion The efﬁciency of the hand sign recognition algorithm for YOLO-V2 CNN is calculated by changing the IoU values, like 0.5, 0.75, and 0.5:0.95. The average precision of IoU 0.5, 0.75 and 0.5:0.95 is determined as AP0.5, AP0.75 and AP0.5:0.95. The picture with the hand sign of a scale less than 32 32 pixels is considered as a hand detection of smaller size. A medium range of detection is known for a hand region of more than 32 32 pixels and less than 96 96 pixels, and a broader range of detections for regions exceeding 96 96 pixels. APsmall, APmedium, and APlarge are deﬁned as the average precision of the small, medium and larger segments of the detected hand regions. The average recall is similarly represented as ARsmall, ARmedium and ARlarge. For the number of detections such as 1, 10 and 100, the average recall is represented as AR1, AR10 and AR100, respectively. The YOLO-V2 CNN model is trained with an appropriate learning rate, and batch size as 8. The YOLO-V2 CNN model utilizes 0.0002 as a learning rate for the Adam and momentum optimization techniques, and 0.004 for the RMSprop optimization process.

324

S. Rubin Bose and V. Sathiesh Kumar Table 2. Average precision of YOLO-V2 CNN model Dataset

Optimizer

YOLO-V2 AP0.5 AP0.75 NUSHP-II Adam 0.997 0.938 RMSprop 0.954 0.816 Momentum 0.994 0.921 SENZ-3D Adam 0.997 0.977 RMSprop 0.993 0.923 Momentum 0.995 0.974 MITI-HD Adam 0.991 0.930 RMSprop 0.939 0.780 Momentum 0.983 0.893

AP0.5:0.95 0.758 0.666 0.738 0.801 0.723 0.801 0.783 0.633 0.647

APsmall 0.321 0.100 0.318 – – – 0.653 0.521 0.502

APmed 0.762 0.670 0.741 0.759 0.616 0.719 0.767 0.734 0.717

APlarge 0.874 0.734 0.865 0.826 0.747 0.817 0.876 0.774 0.763

Table 2 demonstrates the YOLO-V2 CNN’s performance assessment (average precision) for IoU of 0.5, 0.75, and 0.5:0.95. For all the datasets, the YOLO-V2 CNN that utilize Adam optimizer achieves signiﬁcantly larger precision values over the RMSprop and momentum optimizer The ROI(hand) is absolutely repealed for the SENZ-3D dataset with a scale less than 32 32 pixels (small). MITI-HD datasets appear to have an average precise value of 0.991 for IoU = 0.5, 0.930 for IoU = 0.75 and 0.783 for IoU = 0.5:0.95. The YOLO-V2 CNN model’s performance metrics on the MITI-Hand datasets are identical to the performance measures of the benchmark datasets (NUSHP-II, SENZ-3D). The both NUSHP-II and SENZ-3D hand datasets has attained an average precision (AP0.5) of 0.997, for IoU = 0.5. Table 3. Average recall of YOLO-V2 CNN model Dataset

Optimizer

YOLO-V2 AR1 AR10 NUSHP-II Adam 0.792 0.793 RMSprop 0.711 0.725 Momentum 0.770 0.778 SENZ-3D Adam 0.835 0.835 RMSprop 0.760 0.762 Momentum 0.826 0.826 MITI-HD Adam 0.789 0.790 RMSprop 0.780 0.782 Momentum 0.767 0.773

AR100 0.793 0.725 0.778 0.835 0.762 0.826 0.791 0.783 0.773

APsmall 0.392 0.100 0.383 – – – 0.662 0.667 0.653

ARmed 0.797 0.729 0.779 0.758 0.645 0.757 0.702 0.710 0.680

ARlarge 0.840 0.735 0.830 0.851 0.780 0.837 0.831 0.816 0.814

The Table 3 summarizes the average recall (AR) of YOLO-V2 CNN model for IoU = 0.5:0.95. The use of Adam optimizer led to a considerable performance improvements compared to other optimization algorithms considered in the

Precise Recognition of Vision Based Multi-hand Signs

325

experiments. This observation is consistent across all datasets. The MITI-HD contributed to better average recall values for all detection ranges. The hand regions (actions) are efﬁciently recognized from the input frames that has dimensions lower than 32 32 pixels compared to NUSHP-II and Senz-3D hand datasets. The average recall of YOLO-V2 CNN model for different range of hand detections are 0.789 for AR1, 0.790 for AR10 and 0.791 for AR100. Even though the average precision and average recall of the SENZ-3D and NUSHP-II datasets using YOLO-V2 CNN model is slightly higher than MITI-HD, the small detections (APsmall, ARsmall) is completely eliminated in SENZ-3D and produce very low value for NUSHP-II. MITI-HD produce a signiﬁcant value of average precision and average recall for small, medium and larger size of detections. It is reported from Table 2 and Table 3, the YOLO-V2 CNN model offers a stateof-the-art results on MITI-HD. The same trend is also been visualized on the model trained and tested with benchmark datasets. Table 4. Accuracy/Speed trade-off of YOLO-V2 CNN model on MITI-HD_300 Model Optimizer AP0.50 AR0.50 F1-score0.50 Prediction Time (ms)

YOLO-V2 CNN Adam RMS prop 0.991 0.939 0.967 0.921 0.978 0.929 20 31

Momentum 0.983 0.959 0.970 27

The Accuracy/speed trade-off of YOLO-V2 CNN model for MITI-HD 300 300 pixels of image dimension (MITI-HD 300) is shown in Table 4. For true prediction range (IoU = 0.5), the performance parameter (AP, AR, F1-score and Prediction time) are obtained. The precision value and the F1-score for real-time hand sign recognition are reported using Adam optimizer on the YOLO-V2 CNN model are substantially higher. The prediction time of the proposed model is considerably lower than the conventional single-stage architectures. The total loss curves of the YOLO-V2 CNN model using NUSHP-II, Senz-3D and MITI-HD datasets are represented in Fig. 4(a) (b) and (c), respectively. The inference made from the loss curves of YOLO-V2 CNN architecture is, the model incorporated with the Adam optimizer produced comparably lower loss over the other optimization algorithms. The proposed YOLO-V2 CNN platform provides state-of-the-art performance. The evaluated metrics of the proposed model is compared with the existing hand action recognition models as shown in Table 5.

326

S. Rubin Bose and V. Sathiesh Kumar

Fig. 4. Loss Curve of YOLO-V2 CNN (a) NUSHP-II (b) SENZ-3D (c) MITI-HD

Table 5. Comparison of performance metrics with the existing models. Author

Models

AP0.50 (%) Proposed YOLO-V2 CNN 99.10 Rubin et al. [6] FasterR-CNN Inception V2 99.10 Rubin et al. [7] SSD Inception V2 99.00 Cruz et al. [17] YOLO + ZOOM 90.40 Qiu et al. [26] YOLO 63.40

AR0.50 (%) 96.70 96.78 95.40 93.90 –

F1-Score0.50 (%) Prediction time (ms) 97.88 20 97.98 140 97.17 46 92.12 – – 22

The average precision of the YOLO-V2 CNN model is similar to the Faster R-CNN Inception-V2 model [6] and better than the SSD Inception-V2 model [7], YOLO architecture with Zoom detection model [17] and YOLO detection algorithm (suggested by Qiu et al. [26]). The YOLO-V2 CNN model has a computational time of 20 ms which is signiﬁcantly lower than the SSD Inception-V2 and Faster R CNN

Precise Recognition of Vision Based Multi-hand Signs

327

Inception-V2 architectures. The typical model’s (YOLO-V2 CNN) average precision is much higher than the others suggested by Rubin et al. [7], Cruz et al. [17] and Qiu et al. [26]. The proposed YOLO-V2 CNN model is more precise and time efﬁcient. As a future work, the hand sign recognition will be carried out by using the other recent models such as YOLO-V3 and RetinaNet CNN architectures. The real-time multi-hand sign recognition device using YOLO-V2 CNN model is tested using the hardware modules speciﬁed in Sect. 4.1. In order to capture the realtime hand movements, the quantum QHM495LM, 25MP web camera is used. The Recognition and classiﬁcation of the multi-hand signs are recorded for the YOLO-V2 CNN model. The samples frames of the precisely recognized multi-hands are demonstrated in Fig. 5.

Fig. 5. Real-time predicted output (Multi-Hand) samples of YOLO-V2 CNN model.

6 Conclusion The precise recognition of multi-hand signs in real-time is crucial for human-machine interactions. The YOLO-V2 based deep single stage CNN architecture is used for the recognition of real-time multi-hand signs. Experiments are performed using a Customized Dataset (MITI-HD) and two other standard datasets (NUSHP-II, Senz-3D) to determine the robustness and efﬁciency of the proposed system. The proposed model is tested for both static and dynamic hand signs. The efﬁciency of the YOLO-V2 CNN model is measured using the model checkpoint of 35,000 training steps for both the Adam optimizer and the Momentum optimizer with a learning rate of 0.0002, and the RMSprop with 0.004. Adam optimizer for the YOLO-V2 CNN platform performed better compared to other optimizers. The performance metrics (AP, AR, F1-Score) evaluated for an IoU = 0.5 are 99.10%, 96.70% and 97.88%, respectively for the YOLO-V2 CNN model. For the value of IoU = 0.5:0.95, the average precision is achieved as 78.30%. An IoU of 0.75 produced an average precision value of 93.00%. The YOLO-V2 CNN model’s prediction time (20 ms) is considerably lower than the other state-of-the-art architectures. By using the proposed model, the detection speed of the real-time hand signs are signiﬁcantly improved. Reducing the prediction time further increases the throughput of the system which is our future scope.

328

S. Rubin Bose and V. Sathiesh Kumar

Acknowledgement. The profound gratitude of the authors goes to NVIDIA for delivering the GPU (NVIDIA TitanX) in the context of the University Research Grant Initiative.

References 1. Qiang, Z., Yong, Z., Zhiguo, L.: A dynamic hand gesture recognition algorithm based on CSI and YOLOv3. In: Journal of Physics: Conference Series, vol. 1267, p. 012055 (2019). https://doi.org/10.1088/1742-6596/1267/1/012055 2. Huang, H., Chong, Y., Nie, C., Pan, S.: Hand gesture recognition with skin detection and deep learning method. In: IOP: Journal of Physics: Conferences Series, vol. 1213, p. 022001 (2019). https://doi.org/10.1088/1742-6596/1213/2/022001 3. Raza, M., Ketsoi, V., Chen, H.: An integrative approach to robust hand detection using CPM-YOLOv3 and RGBD camera in real time. In: 2019 IEEE International Conferences on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking, pp. 1131– 1138. Xiamen, China (2019) 4. Zengeler, N., Kopinski, T., Handmann, U.: Hand gesture recognition in automotive human– machine interaction using depth cameras. Sensors 19(1), 59 (2019) 5. Tripathi, P., Keshari, R., Ghosh, S., Vatsa, M., Singh, R.: AUTO-G: gesture recognition in the crowd for autonomous vehicle. In: IEEE Explorer, International Conference on Image Processing (ICIP), pp. 3482–3486. Taipei, Taiwan (2019) 6. Rubin Bose, S., Sathiesh Kumar, V.: Hand gesture recognition using faster R-CNN inception V2 model. In: AIR 2019: Proceedings of the Advances in Robotics 2019, ACM digital library, no. 19, pp. 1–6, July 2019 7. Rubin Bose, S., Sathiesh Kumar, V.: Efﬁcient inception V2 based deep convolutional neural network for real-time hand action recognition. IET Image Process. 14(4), 688–696 (2020) 8. Zhengjie, W., et al.: Hand gesture recognition based on active ultrasonic sensing of smartphone: a survey. IEEE Access 7, 111897–111922 (2019). https://doi.org/10.1109/ access.2019.2933987 9. Chaman, S., D’souza, D., D’mello, B., Bhavsar, K., D’souza, J.: Real-Time hand gesture communication system in hindi for speech and hearing impaired. In: Second International Conference on Intelligent Computing and Control Systems (ICICCS), vol. 2018, pp. 1954– 1958. Madurai, India (2018). https://doi.org/10.1109/iccons.2018.8663015 10. Avola, D., Bernardi, M., Cinque, L., Foresti, G.L., Massaroni, C.: Exploiting recurrent neural networks and leap motion controller for the recognition of sign language and semaphoric hand gestures. IEEE Trans. Multimedia 21(1), 234–245 (2019) 11. Wu, D., Pigou, L., Kindermanz, P.J., et al.: Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1583– 1597 (2016) 12. Ji, P., Song, A., Xiong, P., Yi, P., Xu, X., Li, H.: Egocentric-vision based hand posture control system for reconnaissance robots. J. Intell. Rob. Syst. 87(3), 583–599 (2016). https:// doi.org/10.1007/s10846-016-0440-2 13. Nguyen, V.D., Chew, M.T., Demidenko, S.: Vietnamese sign language reader using intel creative senz 3D. In: IEEE Proceedings of the 6th International Conference on Automation, Robotics and Applications, no. 2, pp. 17–19 (2015) 14. Liau, H., Nimmagadda, Y., Wong, Y.L.: Fire SSD: wide ﬁre modules based single shot detector on edge device. arXiv: 1806.05363, [cs. CV], 11 Dec 2018

Precise Recognition of Vision Based Multi-hand Signs

329

15. Ning, C., Zhou, H., Song, Y., Tang, J.: Inception single shot multibox detector for object detection. In: ICME (2017) 16. Yuxi, L., Li, J., Lin, W., Li, J.: Tiny-DSOD: lightweight object detection for resourcerestricted usages. In: Proceedings of British Machine Vision Conference (2018) 17. Cruz, S.R., Chan, A.B.: Hand detection using zoomed neural networks. In: Ricci, E., Rota Bulò, S., Snoek, C., Lanz, O., Messelodi, S., Sebe, N. (eds.) ICIAP 2019. LNCS, vol. 11752, pp. 114–124. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30645-8_11 18. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. arXiv preprint, vol. 1612 (2016) 19. Pramod Kumar, P., Vadakkepat, P., Poh, L.A.: The NUS hand posture datasets II. ScholarBank@NUS Repository. [Dataset] (2017). https://doi.org/10.25540/AWJS-GMJB 20. Memo, A., Minto, L., Zanuttigh, P.: Exploiting silhouette descriptors and synthetic data for hand gesture recognition. STAG: Smart Tools and Apps for Graphics (2015) 21. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: uniﬁed, real-time object detection. In: CVPR (2016) 22. Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv: 1609.04747, https://arxiv.org/abs/1609.04747 (2017) 23. Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12 (1), 145–151 (1999) 24. Tieleman, T., Geoffrey, H.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. Coursera: Neural Netw. Mach. Learn. 4(2), 26–31 (2012) 25. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412. 6980, https://arxiv.org/abs/1412.6980 (2014) 26. Qiu, X., Zhang, S.: Hand detection for grab-and-go groceries. In: Stanford University Course Project Reports—CS231n Convolutional Neural Network for Visual Recognition. http:// cs231n.stanford.edu/reports.html. Accessed 28 Nov 2017

Human Gait Abnormality Detection Using Low Cost Sensor Technology Shaili Jain and Anup Nandy(B) Machine Intelligence and Bio-Motion Laboratory, Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Odisha 769008, India [email protected]

Abstract. Detection of gait abnormality is becoming a growing concern in diﬀerent neurological and musculoskeletal patients group including geriatric population. This paper addresses a method of detecting abnormal gait pattern using deep learning algorithms on depth Images. A low cost Microsoft Kinect v2 sensor is used for capturing the depth images of diﬀerent subject’s gait sequences. A histogram-based technique is applied on depth images to identify the range of depth values for the subject. This method generates segmented depth images and subsequently median ﬁlter is used on them to reduce unwanted information. Multiple 2D convolutional neural network (CNN) models are trained on segmented images for pathological gait detection. But these CNN models are only restricted to spatial features. Therefore, we consider 3D-CNN model to include both spatial and temporal features by stacking all the images from a single gait cycle. A statistical technique based on autocorrelation is applied on entire gait sequences for ﬁnding the gait period. We achieve a signiﬁcant detection accuracy of 95% using 3D-CNN model. Performance evaluation of the proposed model is evaluated through standard statistical metrics. Keywords: Gait abnormality · Microsoft kinect sensor · Depth image · Convolutional neural network · Pathological gait

1

Introduction

In today’s world biometrics has been integrated in every possible aspects starting from clinical application to security systems. Human gait has been appeared to be a signiﬁcant sign of health condition. Gait investigation is thus useful to acquire important information regarding the growth of diﬀerent neurological diseases such as Parkinson’s [1] or diabetes [2]. By tracking and analyzing these gait information, an early interpretation of sicknesses can be detected which can assist patients with ﬁnding the best possible solution. There are mainly two approaches for analyzing the gait patterns: wearable sensors [6,7] and non-wearable sensors [8]. Non-wearable sensor-based techniques can be further divided into two categories: marker-based and marker-less c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 330–340, 2021. https://doi.org/10.1007/978-981-16-1092-9_28

Human Gait Abnormality Detection Using Low Cost Sensor Technology

331

approaches. The advantage of using marker-less approach is that there is no necessity of direct body contact with the subject thus it can be acquired at farther distance [3]. Among the marker-less approaches microsoft kinect sensor has attracted many researchers in gait analysis because of it’s cost eﬀectiveness and minimal setup requirement [4]. Utilization of kinect depth image sequence is better to create appearance-based gait model. It can incorporate more information than the basic grayscale-based methods [5]. The aim of this research is to create a model for pathological gait detection through understanding depth images of subject’s gait sequences. These images are captured using single kinect sensor placed perpendicular to subject’s left side. Subjects are asked to simulate the equinus gait pattern for collecting abnormal gait patterns. One of the potential utilities of the method is automatic feature learning and this system can be used for ﬁnding patients with equinus deformity. It’s a very common feature of cerebral palsy patients so it will be beneﬁcial to decide abnormality. The importance of using our method is that existing method used by the clinician is highly subjective and prone to error whereas our method quantitatively assesses gait dynamics and produce more reliable results. It is also a cost eﬀective device for gait abnormality detection. We propose an algorithm for image segmentation. It can be divided in three steps. In the ﬁrst step we calculate all the second order maximums. These maximums are the peak of the mounds in the histogram where every mound represent an object. A slice is created around the mound to ﬁnd the depth range of the object. This slice is converted into mask for the image and after this median ﬁltering is done. At the end we calculated the range of x and y values for the person and a rectangular box is created with this range to extract the person. This process generates a unique segmentation of an object. The robust 3D-CNN architecture is applied for detection of abnormal gait patterns. The analysis of related works are described in Sect. 2. The data collection process is explained in Sect. 3. Section 4 represents the image segmentation algorithm and classiﬁcation model. The experimental results and discussions are described in Sect. 5. This work is concluded by providing a possible future work direction in Sect. 6.

2

Related Work

Many researchers applied machine learning algorithms such as logistic regression [11], support vector machine (SVM) [12], hidden markov model (HMM) [10] and clustering [9] to detect abnormal gait patterns with kinect skeletal data. These methods were related to model-based approach for human gait analysis. Whereas model-free approaches focused on silhouettes shapes or the complete motion of human bodies [13]. The advantage of model-free approaches was lower computational costs as compared to model-based approaches which motivates us to carry out this research in this direction.

332

S. Jain and A. Nandy

A method of behavioural abnormality detection was proposed by [14] through extracting the activity silhouette of the subjects. It was further compared with a base model which was constructed by analyzing gait patterns of multiple persons. RGB images were used for detecting abnormal patterns [15]. Those depth images were applied on many gait related applications but none of the researchers used them directly for abnormality detection yet. The depth video-based gait recognition was done using Deep learning methods such as CNN method after extraction of local directional pattern features from the depth silhouette images [16]. A novel approach was proposed by [17] for human identiﬁcation in depth images using histogram analysis. Kinect (v2) oﬀers ﬁve diﬀerent data streams [19] out of which skeletal data stream are used extensively in clinical purposes [20,21] for its simplicity to track joint positions directly. Vipani et al. [22] also used logistic regression for classiﬁcation of healthy and pathological subjects. Kozlowska et al. [23] used MARS model to investigate the trends in spatial and temporal gait parameters during treadmill walking. This concept has been used by S. Chakraborty et al. [24] in analysing non-linear data, to detect gait pathology.

3

Data Collection Procedure

A single kinect sensor is placed perpendicular to subject’s left side. The position of kinect is 180 cm horizontally from the treadmill and 94 cm vertically from ground. Ten physically ﬁt young subjects (age (years): 24.3 ± 2.45 and height (cm): 163.29 ± 8.72 , sex: 4 male & 6 female) are chosen for data acquisition. For abnormal gait detection subjects are asked to simulate the Equinus gait patterns. It is a very common foot deformity in cerebral palsy patients. Equinus gait pattern can be explained by ankle plantar ﬂexion throughout the complete gait cycle. The videos are captured at 3 km/h treadmill speed for 50 s time period. This dataset has a total of 4k depth images of size 311×161×3 which is an image with 3 channels and it’s 2D projection with 60:40 normal and abnormal gait ratio. We are using gray-scale depth images so the pixel intensity varies from 0–255. By using histogram analysis, we are only analysing the frequency of this 256 intensity levels. The segmented images of 60% subjects are taken as training set, 20% for validation set and rest 20% for testing set. We use a system with a NVIDIA’s GeForce Titan XP GPU with 12 GB RAM for implementation of deep learning algorithm on gait data.

4

Proposed Method

This section explains the method for pathological gait detection. We collect depth video of diﬀerent normal and abnormal subjects using single Kinect sensor. The gait sequences are extracted from the videos for further processing. The extracted gait frame, depicted in Fig. 2a contains a lot of unnecessary information. We apply histogram analysis method [17] for detecting region of interest of the image. This technique provides a beneﬁt of analyzing only array of 256

Human Gait Abnormality Detection Using Low Cost Sensor Technology

333

numbers instead of the complete depth image which reduces the computational cost. We use gray-scale depth images so the pixel intensity values from 0–255. By using histogram analysis, we analyze the frequency of this 256 intensity levels. The Algorithm 1 describes the procedure for image segmentation.

Algorithm 1. Algorithm for Image Segmentation(IS) INPUT: inpimg = Depth image of size L × M × N OUTPUT: oimg = Segmented depth image 1: procedure IS(inpimg) 2: hist = histogram(inpimg) 3: hist max = second order maximas(hist) 4: for i = 1:length(hist max) do 5: slice[i] = hist max[i], t = 0 6: while (hist[hist max[i] − t] > hist[hist max[i] − t − 1]) (hist[hist max[i] + t] > hist[hist max[i] + t + 1]) do 7: slice[i].append([hist max[i] − t − 1, hist max[i] + t + 1]) 8: t=t+1 9: mask = slice[1] 10: for each point p in inpimg do 11: if p is not in mask then 12: p=0 13: img f ilt=median ﬁlter(inpimg) 14: Initialize sumlist as an empty array of length L 15: for x = 1:L do 16: sumlist[x]=sum(img f ilt[x, :, 1]) 17: ly range=ﬁrst nonzero(sumlist) 18: uy range=ﬁrst valley(sumlist) 19: Initialize sumlist as an empty array of length M 20: for y = 1:M do 21: sumlist[y]=sum(img f ilt[:, y, 1]) 22: lx range=ﬁrst nonzero(sumlist) 23: ux range=ﬁrst valley(sumlist) 24: oimg=img f ilt[lx range:ux range, ly range:uy range, :] 25: Return oimg

and

Algorithm 1 is divided into 3 steps. In the ﬁrst step, we calculate all the second order maximums. These maximums are the peak of the mounds in the histogram where every mound represents an object. A slice is created around the mound to ﬁnd the depth range of the object. This slice is converted into mask for the image and after this median ﬁltering is done. At the end we calculate the range of x and y values for the person and a rectangular box is created with this range to extract the person. The histogram of the image (Fig. 2a) is shown in Fig. 1a where all the local maximums are marked by outlined circles and all the second order maximums are marked by ﬁlled circles. It demonstrates the

334

S. Jain and A. Nandy

Fig. 1. a) Depth image’s histogram. b) Depth image’s histogram with slices marked

Fig. 2. a) Original depth image. b) Filtered image. c) Segmented image

Fig. 3. CNN architecture

histogram plot of the image where X axis represents the pixel intensities (0– 255) and Y axis represents the frequency of this pixel intensities in the image. Foreground subtraction is done by analyzing the pixel intensities and ﬁnding the range of depth values where the person lies. Pixel intensities along Y axis is summed up and plotted to ﬁnd in which pixel range (in X axis) person lies

Human Gait Abnormality Detection Using Low Cost Sensor Technology

335

and same is repeated along X axis so that we obtain all the 4 coordinates of a rectangular box which perfectly ﬁts the person. This range is calculated for all the images in the cycle and their union is taken to cover the complete human motion in the segmented image. The depth image’s histogram with marked slices is illustrated in Fig. 1b. We apply median ﬁltering technique to produce better noise-free results (Fig. 2b). Since the subject and treadmill are at same depth so histogram analysis is not able to remove the handrail from the image. Therefore, we analyze the human shape while walking on treadmill through creating an appropriate rectangular box around the subject. The ﬁnal segmented output image is presented in Fig. 2c. We apply CNN models for automatic extraction of gait signature and detection for abnormal gait patterns. It is an eﬃcient deep learning technique to process high dimensional data such as images and videos. It has the ability to capture important features without human intervention. The proposed CNN architecture for detecting abnormal gait patterns is illustrated in Fig. 3. This architecture has four 2D convolution and four max-pooling layers with ReLU activation function. This activation function is used for adding the non-linearity in the model. A sigmoid activation function is also used at the last layer for classiﬁcation. The drawback of this 2D-CNN model includes inability to establish relationship between consecutive frames. Therefore, this model only deals with the spatial features. Since the gait signal carries spatio-temporal information so it is required to consider both the features. To capture the temporal information of gait signal we create stacked cycles by stacking all the images from a single gait cycle and train them with 3D-CNN model. In order to ﬁnd a single gait cycle, we apply autocorrelation technique on entire gait sequences [18]. It computes correlation coeﬃcients between the ﬁrst frame and all the subsequent frames. The number of frames in between two successive peaks is taken as gait period. For our dataset the resulting gait period is 20 frames/cycle. After creating stacked cycles, we use a 3D-CNN architecture for classiﬁcation of abnormal gait patterns. The architecture of this model has the same set of layers and number of ﬁlters as shown in Fig. 3. The only few diﬀerences are instead of 2D convolution, 3D convolution is used and the convolution and pooling layer ﬁlers are changed to (3 × 3 × 1) and (2 × 2 × 1) respectively. In CNN architecture Fig. 3 the size of the depth image is (311×161×3) in input layer.

5

Result Analysis and Discussion

The objective of this work is to detect the abnormal gait patterns using low cost Kinect device. Depth videos are captured using single kinect sensor. The image segmentation algorithm is applied to generate foreground subject. These foreground images given as input to CNN model. Multiple CNN models are implemented with varying number of layers and ﬁlters to obtain the optimal CNN model. The detection accuracies of these models are given in Table 1. We obtain the best result (94.3%) for the CNN model having 4 convolution layers and (32,

336

S. Jain and A. Nandy

32, 64, 64) number of ﬁlters. The detail architecture of this model is presented in Fig. 3. We use 60% of the dataset for training set and the remaining 40% is equally divided into validation and testing dataset. The model accuracy and loss for training and validation dataset are shown in Fig. 4 and Fig. 5 respectively. The loss functions for this work is considered as binary cross-entropy.

Fig. 4. Training and validation accuracy per epoch for 2D-CNN model

It is observed from this experiment that 2D-CNN is not capable of extracting temporal features. Therefore, we combine all the frames collected from single gait cycle and trained a 3D-CNN model on them to extract temporal features. After training this model is used for detecting abnormal gait and it produces 95% detection accuracy which is slightly higher than the accuracy achieved by 2DCNN. The model accuracy and loss per epoch for training and validation dataset are shown in Fig. 6 and Fig. 7 respectively. It is clearly visible from these graph that 2D-CNN takes around 120 epoch for convergence whereas 3D-CNN takes 60 epoch. It infers from this analysis that 3D-CNN takes less amount of training time to produce the result. The validation loss for 2D-CNN model (Fig. 5) is not properly converged which infers that 3D-CNN model is computationally more eﬃcient than 2D-CNN model. The performance evaluation of these models are measured using standard statistical metrics, precision, recall, F1 score and detection accuracy which is depicted in Table 2. It is observed that the 3D-CNN model outperforms the 2DCNN model. We also examine model performance through receiver operating characteristic (ROC) curve which is illustrated in Fig. 8. The area under the ROC curve for 3D-CNN is 95% which clearly demonstrates it’s eﬃciency for detection of abnormal gait patterns.

Human Gait Abnormality Detection Using Low Cost Sensor Technology

Fig. 5. Training and validation loss per epoch for 2D-CNN model

Table 1. Detection accuracy by varying no. of convolution layers and ﬁlters No. of Conv layers No. of ﬁlters

Testing accuracy (%)

2

(32, 64)

89

3

(32, 32, 64)

92.6

3

(32, 64, 64)

93.6

4

(32, 32, 64, 64)

94.3

5

(16, 32, 32, 64, 64) 94.1

Fig. 6. Training and validation accuracy per epoch for 3D-CNN model

337

338

S. Jain and A. Nandy

Fig. 7. Training and validation loss per epoch for 3D-CNN model Table 2. Performance evaluation metric for CNN models Model

Testing accuracy(%) Precision Recall F1 score

2D-CNN 94.3

0.986

0.907

0.944

3D-CNN 95

1.00

0.909

0.952

Fig. 8. ROC curve for CNN models

6

Conclusion and Future Work

Microsoft kinect v2 sensor along with deep learning techniques are used for detection of pathological human gait patterns. The equinus foot deformity has been simulated by the subjects. In this work the 3D-CNN model has been found to be more suitable in comparison to 2D-CNN model. The detection accuracy

Human Gait Abnormality Detection Using Low Cost Sensor Technology

339

achieved by 3D-CNN model is 95% which is better than the 2D-CNN model. The future work can be extended to include actual abnormal gait data and diﬀerent types of pathological patients for detection of clinical gait abnormality. We also plan to apply k-fold cross validation method to demonstrate the robustness of our proposed model and compare with state-of-the-art methods. Currently, the aim of the research is identifying abnormal gait patterns using low cost sensing technology to measure the eﬃciency of Microsoft Kinect device for clinical gait analysis. The comparison with other models is planned in future research work. Acknowledgments. We would like to acknowledge NVIDIA Corporation for providing GeForce Titan Xp GPU card to carry out our research. We would also like to be thankful to all the participants for contributing their gait pattern in this research work.

References 1. Keijsers, N.L.W., Horstink, M.W., Gielen, S.C.: Ambulatory motor assessment in Parkinson’s disease. Mov. Disord. Oﬀ. J. Mov. Disord. Soc. 21(1), 34–44 (2006) 2. Hodgins, D.: The importance of measuring human gait. Med. Device Technol. 19(5), 42–44 (2008) 3. Ismail, A.P.: Gait analysis and classiﬁcation using front view Markerless model (2018) 4. Eltoukhy, M., Jeonghoon, O., Kuenze, C., Signorile, J.: Improved kinect-based spatiotemporal and kinematic treadmill gait assessment. Gait Posture 51, 77–83 (2017) 5. Iguernaissi, R., Merad, D., Drap, P.: People counting based on kinect depth data. In: ICPRAM, pp. 364–370 (2018) 6. Mannini, A., Trojaniello, D., Cereatti, A., Sabatini, A.: A machine learning framework for gait classiﬁcation using inertial sensors: application to elderly, post-stroke and huntington’s disease patients. Sensors 16(1), 134 (2016) 7. Cola, G., Avvenuti, M., Vecchio, A., Yang, G.-Z., Lo, B.: An on-node processing approach for anomaly detection in gait. IEEE Sens. J. 15(11), 6640–6649 (2015) 8. Tucker, C.S., Behoora, I., Nembhard, H.B., Lewis, M., Sterling, N.W., Huang, X.: Machine learning classiﬁcation of medication adherence in patients with movement disorders using non-wearable sensors. Comput. Biol. Med. 66, 120–134 (2015) 9. Manca, M., Ferraresi, G., Cosma, M., Cavazzuti, L., Morelli, M., Benedetti, M.G.: Gait patterns in hemiplegic patients with equinus foot deformity. BioMed Res. Int. (2014) 10. Nguyen, T.-N., Huynh, H.-H., Meunier, J.: Skeleton-based abnormal gait detection. Sensors 16(11), 1792 (2016) 11. Vipani, R., Hore, S., Basak, S., Dutta, S.: Gait signal classiﬁcation tool utilizing Hilbert transform based feature extraction and logistic regression based classiﬁcation. In: 2017 Third International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), pp. 57–61 (2017) 12. Chen, M., Huang, B., Xu, Y.: Intelligent shoes for abnormal gait detection. In: 2008 IEEE International Conference on Robotics and Automation, pp. 2019–2024 (2008)

340

S. Jain and A. Nandy

13. Arai, K., Asmara, R.A.: 3D skeleton model derived from kinect depth sensor camera and its application to walking style quality evaluations. Int. J. Adv. Res. Artif. Intell. 2(7), 24–28 (2013) 14. Wang, C., Wu, X., Li, N., Chen, Y.L: Abnormal detection based on gait analysis. In: 2012 10th World Congress on Intelligent Control and Automation (WCICA), pp. 4859–4864 (2012) 15. Charisis, V., Hadjileontiadis, L.J., Liatsos, C., Mavrogiannis, C.C., Sergiadis, G.D.: Abnormal pattern detection in wireless capsule endoscopy images using nonlinear analysis in RGB color space. In: 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology, pp. 3674–3677 (2010) 16. Uddin, M.Z., Khaksar, W., Torresen, J.: A robust gait recognition system using spatiotemporal features and deep learning. 2017 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), pp. 156–161 (2017) 17. Ferreira, L., Neves, A., Pereira, A., Pedrosa, E., Cunha, J: Human detection and tracking using a Kinect camera for an autonomous service robot. In: Advances in Aritiﬁcal Intelligence-Local Proceedings, EPIA, pp. 276–288 (2013) 18. Arai, K., Asmara, R.A.: A speed invariant human identiﬁcation system using gait biometrics. Int. J. Comput. Vis. Rob. 4(1–2), 3–22 (2014) 19. M¨ uller, B., Ilg, W., Giese, M.A., Ludolph, N.: Validation of enhanced kinect sensor based motion capturing for gait assessment. PloS One 12(4), e0175813 (2017) 20. Bei, S., Zhen, Z., Xing, Z., Taocheng, L., Qin, L.: Movement disorder detection via adaptively fused gait analysis based on kinect sensors. IEEE Sens. J. 18(17), 7305–7314 (2018) 21. Dolatabadi, E., Taati, B., Mihailidis, A.: An automated classiﬁcation of pathological gait using unobtrusive sensing technology. IEEE Trans. Neural Syst. Rehabil. Eng 25(12), 2336–2346 (2017) 22. Vipani, R., Hore, S., Basak, S., Dutta, S.: Gait signal classiﬁcation tool utilizing hilbert transform based feature extraction and logistic regression based classiﬁcation. In: 2017Third International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), pp. 57–61. IEEE (2017) 23. Kozlowska, K., Latka, M., West, B.J.: Signiﬁcance of trends in gait dynamics. bioRxiv (2019) 24. Chakraborty, S., Jain, S., Nandy, A., Venture, G.: Pathological gait detection based on multiple regression models using unobtrusive sensing technology. J. Signal Process. Syst. 93(1), 1–10 (2021). https://doi.org/10.1007/s11265-020-01534-1

Bengali Place Name Recognition Comparative Analysis Using Diﬀerent CNN Architectures Prashant Kumar Prasad1(B) , Pamela Banerjee1 , Sukalpa Chanda2 , and Umapada Pal3 1

RCC Institute of Information Technology, Kolkata, India Department of Information Technology, Østfold University College, Halden, Norway [email protected] Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India [email protected] 2

3

Abstract. Optical Character Recognition (OCR) has been deployed in the past in diﬀerent application areas such as automatic transcription and indexing of document images, reading aid for the visually impaired persons, postal automation etc. However, the performance in many cases has not been impressive due to the fact that character segmentation is itself an error-prone and diﬃcult operation, which leads to the poor performance of the system due to erroneous segmentation of characters. Hence, for many applications (like document indexing, Postal automation) where full character-wise transcription is not required, word recognition is the preferred method these days. This article investigates recognition of Bengali place names as word images using 5 diﬀerent traditional architectures. Experiments on word images (of Bengali place names) from 608 classes were conducted. Encouraging results were obtained in all instances. Keywords: Bengali word image recognition · Bengali postal address recognition · Bengali script · Bengali word image dataset

1

Introduction

Deep learning-based techniques have been extremely successful in diﬀerent spheres of computer vision problem. If trained with adequate data, the performance of a Deep learning-based method is far ahead of any hand-crafted featurebased approach. One must not ignore the fact that such fascinating performance does not come free rather it depends on extensive training of the network which demands time and resource. There are plenty of CNN architectures available now The data-set will be available on request. Please contact the ﬁrst author via email. c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 341–353, 2021. https://doi.org/10.1007/978-981-16-1092-9_29

342

P. K. Prasad et al.

which shows impressive results on the diﬀerent type of classiﬁcation problem. Handwriting recognition is also not an exception. However, its worth mentioning that the current trend in handwritten text recognition(HTR) advocates recognition of words rather than recognition of individual characters. There are mainly two reasons for this: (a) In many applications a character-wise transcription is not required, for example, document indexing and retrieval; (b) secondly, in many scripts handwritten text comes in a cursive form which demands segmenting of the characters before the characters are classiﬁed/recognized, but character segmentation itself is a diﬃcult problem and hence an error-prone character segmentation will lead to a bad character recognition system. To evade this character segmentation issue current HTR techniques mostly focus on word recognition or word spotting techniques. Deep learned techniques have been widely utilized in the recent past for word recognition [1]. Apart from document indexing and retrieval, word recognition could aptly be used for postal address recognition as well. The idea would be to recognize place names in the address. This study aims to investigate the performance of diﬀerent CNN architectures for Bengali place name recognition in the form of Bengali word image recognition problem. This article also proposes a new dataset of Bengali word images that comprises of 6384 samples from 608 classes (prior to data augmentation). The rest of the article is organized as follows: (a) In Sect. 2, related work has been discussed; (b) Sect. 3 is on data set details and data procuring strategy; (c) Sect. 4 discusses about the Motivation and Methodology; (d) whereas Sect. 5 is on experimental results and discussions; (e) followed by conclusion and future work in Sect. 6.

2

Related Work

Deep Learning-based Networks like VGG [2], ResNet [3], Inception [4], Xception [5] and MobileNet [6] are some state-of-the-art architecture those have shown impressive performance in diﬀerent image classiﬁcation problems like object recognition. Recognizing word images is diﬃcult compared to object recognition since object have many additional attributes like colour, texture. Nevertheless, deep learning techniques have been used in word recognition and word spotting as well. Sharma et al. [7] proposed a method where a pre-trained CNN is used to perform word spotting. Sudholt et al. [8] proposed a novel CNN architecture for word spotting where the network is trained with the help of a Pyramidal Histogram of Characters (PHOC) representation, this work used contemporary as well as historical document images in their experiments. The proposed system can be customized as a “Query By Example” (QBE) or “Query By String” (QBS) based system. Another deep learning-based approach for Arabic word recognition is due to [9]. The author suggested an architecture that has Multidimensional recurrent neural network (MDRNN) and multidimensional LSTM (MDLSTM) in the hidden layer and “Connectionist Temporal Classiﬁcation” (CTC) in the output layer. The architecture stacked MDLSTM and feed-forward network with the

Bengali Place Name Recognition

343

stride of 4. An experiment is carried out on ICDAR 2007 Arabic handwriting recognition contest data-set with 91.4% accuracy and claims better result than the winner of Arabic handwriting recognition contest. Another work is proposed by [10], the proposed method uses a deep recurrent neural network (RNN) and a statistical character language model to attain high accuracy in terms of word spotting and word indexing. The proposed method performed well under adverse conditions. A study by Chanda et al. [11] is on recognition of medieval word images written in Roman scripts. This paper extract deep learned features from the fully connected layer of ALexNet and classify those features using regular classiﬁers like SVM and K-NN. The proposed method can counter the issue of limited annotated data faced by a traditional Deep neural network by using oﬀ-line data augmentation techniques. Word recognition in a one-shot learning framework and Zero-Shot learning framework has been investigated in [12,13] respectively. In [12], authors proposed a modiﬁed Siamese network to classify the text document in word level in a oneshot learning framework. In [13] the authors proposed a novel method that can classify completely unseen class images with good accuracy. Another work is carried out on Handwritten Text Recognition in Medieval Manuscripts by [14]. The author made a search engine for handwriting text on 83000 pages. The document was written in English/Roman script. Also, the work consists of an “optical model”, capable of dealing with the variability and abbreviations in medieval, multilingual handwritings.

3

Motivation and Data Set Details

Bengali script is popular mainly in eastern India, Bangladesh, and some parts of Myanmar. This script is widely used to write the Bengali language. Bengali script has originated from Brahmi script. There are 50 alphabets in Bengali script, constituting 11 vowels, 39 consonant and 12 matras1 for each consonant. Thus, the total number of symbols may reach ≈300. A new dataset for Bengali word images is introduced here. Keeping in mind, the possible usage of a word recognition system for postal address recognition, the word images are diﬀerent place names of West Bengal (India). A data collection form as shown in Fig. 1(left) was distributed amongst diﬀerent volunteer. We distributed 1 to 5 sheets per volunteer to get the handwriting sample. Overall we collected word images of 608 class that contains 6384 samples. 68 people actively participated in this sample collection. A python script cropped and labelled that collected data. 3.1

Data Collection

A form is generated as shown in Fig. 1(left) to collect handwriting samples from the volunteer, each volunteer can give the maximum of 3 samples for each word 1

A sign which used in consonant alphabet of the script

344

P. K. Prasad et al.

class with the maximum number of 40 classes. Each generated form contains 8 classes. We made ﬁve copies of the same form so that we can collect 15 samples per class. Thus, we generated 5 copies of 76 form to print 608 classes. There are 3–15 samples in each class. The collected sample may contain external noise like spot mark, haze, and other environmental noise that may decrease the quality of the image. Colour of ink also makes a diﬀerence. We converted all images into grey-scale during scanning. Most of the data preprocessing are done using an automated script however in every part a human veriﬁcation was also done to reduce the error. A mathematical aspect of preprocessing was already discussed in Chap. 3. In Fig. 1 we can see, cropping text from an OMR (optical mark reader) sheet was involved in three-step where we scan OMR sheet then crop the white-space and then resize that cropped sample image. 3.2

Getting Text from OMR

A python script was used to detect the position in the scanned OMR sheet, the script searches for the circular black mark to identify the left column and rightmost column, those are denoted by h1 and h4 , to determine the two intermediate columns the script looks for 2 consecutive black vertical marks and those are denoted as h2 and h3 . The scripts identiﬁes the points k1 , k2 , k3 , k4 , k5 , k6 , k7 , k8 marked at y-axis by searching for two consecutive horizontal black marks. Now from the image in Fig. 1 it can be noted that from the raw image, a rectangle box (x1 , y1 ), (x2 , y2 ), (x2 , y1 ) and (x1 , y2 ) can be identiﬁed by detecting horizontal and vertical marking. h2 and h3 is located in black vertical marking and h1 and h4 is taken from OMR starting and ending position, which is identiﬁed by searching curricular mark. To identify vertical positions k1 , k2 , k3 , k4 , k5 , k6 , k7 , k8 the script searched for horizontal mark in the OMR sheet. Thus each intersection point between horizontal and vertical marks is generalised as (hi , kj ) ,where i = 1, 2, 3 and j = 1, 2...8. The script will start from top right position of the OMR sheet i.e. (h0 k0 ) and end at bottom left position i.e. (h3 , k8 ) for each position hi , kj the script will crop the region bounded by (hi , kj ), (hi+1 , kj + Ch ), (hi , kj + Ch ) and , (hi+1 , kj ), where Ch is the height of each grid cell for the sake of simplicity we renamed the boundary as (x1 , y1 ), (x2 , y2 ), (x1 , y2 ) and (x2 , y1 ) respectively. For example, in Fig. 1, we can notice i and j both are 3 so the script will extract the third handwritten sample of the third class in that sheet. Thus the cropped rectangular cell boundary will be (h3 , k3 ), (h4 , k3 + Ch ), (h3 , k3 + Ch ) and, (h4 , k3 ). A grid cell may have a big portion of white space around 4 sides of the text. To trim that unnecessary white space histogram projection proﬁle is used to determine the bounding box of text. From the vertical and horizontal histogram proﬁle, we can identify the bounding box of text present inside every grid cell. The green rectangular mark at the top in Fig. 1 shows the boundary made by the histogram proﬁle.

Bengali Place Name Recognition

345

Fig. 1. Cropping data from the scanned document

3.3

Data Augmentation

Since the number of samples per class is too less for a learning algorithm, we augmented our dataset to get desired number of samples. We use the elastic morphing technique to do a transformation for every pixel (i, j) in “original image” get random displacement (Δx, Δy) in a random direction and the voids are smoothed using a Gaussian convolution kernel with standard deviation (sigma). Using that displacement vector (ˆi = i+Δx, ˆj = j +Δy), the new morphed image (ˆi, ˆj) is generated. For every class, we generated 500 samples in which we considered 300 samples for the training set, 100 for validation set and 100 for testing. We made 5 folds of data with diﬀerent images. An augmented image looks more real if we apply channel correction like brightness and contrast, rotating the training image, hence we multiplied the morphed image matrix with a random rotation matrix ranged ±10◦, and brightness value ranged from 0.7 to 1.2.

4

Methodology

Deep learning-based algorithm is extremely successful in classiﬁcation based tasks like Object classiﬁcation, text recognition, and speech to text conversion. Any generic CNN architecture comprises of (a) First few convolutional layers, pooling layers for generating features and (b) few ﬁnal fully connected layers

346

P. K. Prasad et al. Table 1. Number of sample per class in each fold Fold 0 Fold 1 Fold 2 Fold 3 Fold 4 Training

182400 182400 182400 182400 182400

Validation 60800

60800

60800

60800

60800

Testing

60800

60800

60800

60800

60800

ending with a softmax layer to perform the classiﬁcation task. For our classiﬁcation work we used 5 architecture (i) ResNet (ii) MobileNet (iii) InceptionNet (iv) Xception Net (v) VGG16 Net. These networks use diﬀerent types of deep CNN architecture that shows exceptional performance on image data. For a big dataset; that contains a large number of class and huge sample in each class, these CNN architecture can be eﬀectively trained. We used ﬁve above mentioned architecture for recognizing Bengali word and compared their performance in the diﬀerent parameter. 4.1

Classification

For the classiﬁcation of handwritten text, CNN architecture like ResNet152 (V2) VGG16, Inception V3, Xception and Mobilenet were used. However, classifying a handwritten text is a challenging task as the text might appear quite diﬀerent due to the variation observed in human handwriting. For example, in Fig. 2, as it can be noted from the ﬁgure that some people write “Anusvara” (the character within the red bounding box) as dot and some make a circle. Hence same text written in two diﬀerent styles makes classiﬁcation challenging. On the other hand, some characters in the Bengali script look similar which might lead to the incorrect prediction by the model.

Fig. 2. Diﬀerent types of handwriting style

Figure 3 depicts the basic schematic diagram of our CNN architectures. Broadly, it consists of an input layer, an output layer and some convolution and fully connected layers in the middle. At the input layer, input shape is 244 × 150 and at the output, it should be equal to the number of the word in class i.e. 608 in our case.

Bengali Place Name Recognition

347

Fig. 3. Input, output and hidden layer

ResNet: Extremely deep neural networks involve training more parameters hence is ti[htpb!]me-consuming. Proponents of the ResNet [3] show how adding more layers create the problem of vanishing gradients. This problem has been countered in ResNet, in this architecture, every third convolutional input is connected with the ﬁrst output. It is called “shortcut connection”. This allows us to train extremely deep neural networks (with 150+ layers). A detailed view of that architecture is given in Table 2 Table 2. Standard ResNet 152 V2 design architecture [3] Layer name Output size Bottleneck design ⎡ ⎤ 1 × 1, 64 ⎢ ⎥ conv 2x 56 × 56 ⎣3 × 3, 64 ⎦ × 3

conv 3x

28 × 28

conv 4x

14 × 14

conv 5x

7×7 1×1

FLOPs

1 × 1, ⎡ 1 × 1, ⎢ ⎣3 × 3, 1 × 1, ⎡ 1 × 1, ⎢ ⎣3 × 3, 1 × 1, ⎡ ×1, ⎢ ⎣3 × 3, 1 × 1,

256 ⎤ 128 ⎥ 128⎦ × 3 512 ⎤ 256 ⎥ 256 ⎦ × 3 1024 ⎤ 512 ⎥ 512 ⎦ × 3 2048

Average pool 608-d, softmax 11.3 × 109

For our architecture we made 244 × 150 (W × H) crop of an image; we used SGD optimizer with learning rate 0.01. The eﬀective batch size was 24. For ResNet architecture, we trained till 7 epoch on every fold. NVIDIA TITAN RTX hardware is used to train our model. VGG: VGG networks architecture was proposed in LSVRC [15] for classiﬁcation of ImageNet data-set. This network is characterized by using only 3 × 3 convolutional layers stacked on top of each other in increasing depth. Reducing

348

P. K. Prasad et al.

volume size is handled by max pooling. Two fully-connected layers, each with 4,096 nodes are then followed by a softmax classiﬁer. Here also we cropped each handwritten image into 244 × 150 pixel image. The convolutional layers in VGG use very small receptive ﬁeld. There are also 1 × 1 convolution ﬁlters which act as a linear transformation of the input, which is followed by a ReLU unit. The convolution stride is ﬁxed to 1 pixel so that the spatial resolution is preserved after convolution[2]. MobileNet: MobileNet was designed to execute deep learning in mobile devices, embedded systems and computers without GPU or low computational eﬃciency without compromising signiﬁcantly in terms of accuracy. It was proposed to reduce the number of parameters on training [6,16]. MobileNet applies depthwise convolution single ﬁlter to each input channel, then point-wise convolution applies 1 × 1 convolution to combine the outputs of the depth-wise convolution. A standard convolutional both ﬁlters and combines inputs into a new set of outputs in one step. This approach of depth-wise separable convolution addressed by many researchers [17–19] in the past. The depth-wise separable convolution splits this into two layers, a separate layer for ﬁltering and a separate layer for combining. This factorization can reduce computation and model size. The use of depth-wise separable convolution reduces the computational cost by 8 to 9 times [16] (Table 3). Table 3. Standard MobileNet V2 design [6] Layer name

Output size MFLOPS

Input

244 × 150

3 × 3, Conv 32, /2

112 × 112

10.8

3 × 3, DWConv 32, /2

56 × 56

10.8

28 × 28

20.6

14 × 14

19.9

7×7

84.7

1×1

1

1 × 1, Conv 64 3 × 3, DWConv 64, /2 1 × 1, Conv 128 3 × 3, DWConv 128 1 × 1, Conv 128 3 × 3, DWConv 128, /2 1 × 1, Conv 256 3 × 3, DWConv 256 1 × 1, Conv 256 3 × 3, DWConv 256, /2 1 × 1, Conv 512 4×

3 × 3, DW Conv 512 1 × 1, Conv 512

3 × 3, DWConv 512 1 × 1, Conv 1024 average pool

Bengali Place Name Recognition

349

It has two hyper-parameters for shrinking and factorizing, which are α named width multiplier and ρ named resolution multiplier. By adjusting both the values of α and ρ, we can obtain a small network and reduce computation.[6] Inception: The goal of the inception module is to act as a “multi-level feature extractor” by computing 1 × 1, 3 × 3 and 5 × 5 convolutions within the same layer of the network; the output of these ﬁlters are then stacked along the channel dimension and before being fed into the next layer in the network. [4] Xception: Xception is a convolutional neural network that consists of 71 layers. Xception is an extension of the Inception architecture which replaces the standard Inception modules with depth-wise separable convolutions. In each block of this architecture, 1 × 1 Conv. and 3 × 3 Conv and pooling layer is applied and concatenate with next block this architecture uses shortcut connection so that it is light and fast [5]. 4.2

Training

Using Keras library, ResNet (for 7 epochs), VGG (for 10 epochs), Inception (for 6 epochs), Xception (for 4 epochs) and MobileNet (for 8 epochs) were trained with stochastic gradient descent (SGD) as optimizer function and sparse categorical cross-entropy as loss function. After each epoch (i.e. 182400 (training) + 60800 (valid)) status of loss function is plotted in Fig. 4.

5

Experimental Results and Analysis

Experiments were conducted on 5 diﬀerent CNN architectures: ResNet, MobileNet, InceptionV3, XceptionNet and VGG16 using ﬁve fold crossvalidation technique. ResNet achieved 97.14%, MobileNet 97.53%, VGG16 96.64%, Inception 98.75% and Xception 98.60% average accuracy on our procured dataset. A detailed result of the prediction is given in Table 4. Diﬀerent CNN architectures were compared with respect to the size of the trained model, average accuracy, number of training epochs and time taken to predict test images and the observations have been depicted in Table 5. From the table we can observe Mobile-Net have smallest size 23.55 MB as promised in [6] and VGG16 have minimum predication time due to its 16-layer architecture. 5.1

Comparison with Similar Other Work

We compared our results with [20] which also deals with city name recognition for 3 diﬀerent scripts, Bengali, Hindi and English. The corpus in [20] comprises of 4257 samples of Hindi, 8625 samples of Bengali and 3250 samples of English word images from 117 Hindi, 84 Bengali and 89 English classes. The avg. top accuracy obtained by the method in [20] on Bengali script word images is 94.08%. Whereas dealing with 608 classes we have obtained more than 97% on an average.

350

P. K. Prasad et al.

(a) ResNet152 V2

(b) MobileNet V2

(c) Xception Net

(d) Inception V3

(e) VGG16

Fig. 4. Loss function after each epoch

5.2

Result on Transfer Learning

Deep learning techniques though very successful in diﬀerent types of image classiﬁcation problems, one cannot ignore the fact that training a deep learning model from scratch on every single dataset is not a viable solution. Training a deep network from scratch on a new dataset will require annotated data samples

Bengali Place Name Recognition

351

Table 4. Place name prediction result Network

Fold 0 Fold 1 Fold 2 Fold 3 Fold 4

ResNet152 V2

Correct 58909 58105 59200 59579 59517 Incorrect 1891 2695 1600 1221 1283

MobileNet V2

Correct 59048 59316 59644 59219 59260 Incorrect 1752 1484 1156 1581 1540

Xception

Correct 59926 59693 59862 60180 60101 Incorrect 874 1107 938 620 699

Inception V3

Correct 60251 59918 60180 60192 59656 Incorrect 585 882 620 608 1144

VGG16

Correct 58501 58064 59316 59042 58854 Incorrect 2299 2736 1484 1758 1946

Table 5. Comparison between accuracy, training epoch, model size and prediction time Epoch Size (MB) Average accuracy Prediction time∗ ResNet

7

Inception

455.62

97.14

1597.86

6

176.42

98.75

1219.29

Xception Net 4

168.89

98.60

783.57

Mobile Net

23.55

97.53

785.77

96.64

642.48

8

VGG16 10 707.42 ∗ For 60800 images in 608 classes.

from the new dataset and moreover will be time and resource consuming. One feasible solution to the problem is to do transfer learning by freezing weights of initial layers and learn the weights of the top few layers only during training the network on a new dataset. Even that set-up needs some sort of annotated data samples from the new dataset. Another approach is to use an already learned model weights and extract features from images of a new dataset using that learned model, which has been followed in this current study. To brieﬂy measure the prowess of the learned model in a transfer learning setting we experimented with a completely new data-set which comprises of 270 handwritten city name word images from 3 classes. Note that those classes of 270 handwritten images, does not belong to the data-set comprising of 608 classes. There is no intersection of classes in those two datasets. All the text in that dataset was written in Bengali script. Using our trained model we extracted features of those new images. After extracting the feature we applied the “leave one out nearest neighbour” algorithm to get the nearest matching of that image by computing distance using Euclidean distance measure. In the above experimental setup, we achieved the accuracy as depicted in Table 6.

352

P. K. Prasad et al. Table 6. Result of transfer learning Architecture ResNet152 v2 MobileNet V2 InceptionV3 Xception VGG16

6

Correct

239

264

205

259

238

Incorrect

39

6

65

11

32

Accuracy

88.51

97.77

75.92

95.92

88.14

Conclusion

This article proposes a new dataset of Bengali word images that comprises of 6384 samples from 608 classes. The word images are place names of various locations in the state of West Bengal, thus such a corpus can be used for word recognition in a postal automation system. Five standard CNN architectures were deployed to compare their performance in the context of this dataset in terms of accuracy, training epochs and prediction speed. This corpus will be publicly available for other researchers for benchmarking purpose of Bengali word recognition systems in the future. To the best of our knowledge, there exists no such corpus on Bengali word images that consists of 608 word classes.

References 1. Poznanski, A., Wolf, L.: CNN-N-gram for handwriting word recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2305–2314 (2016) 2. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556 3. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385 4. Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567 (2015). http://arxiv. org/abs/1512.00567 5. Chollet, F.: Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357 (2016). http://arxiv.org/abs/1610.02357 6. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: Mobilenet v2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://arxiv. org/pdf/1801.04381.pdf 7. Sharma, A., Pramod Sankar, K.: Adapting oﬀ-the-shelf CNNs for word spotting & recognition. In: 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Nancy, France, August 23–26, 2015, pp. 986–990 (2015) 8. Sebastian Sudholt and Gernot A. Fink. Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In 15th International Conference on Frontiers in Handwriting Recognition, ICFHR 2016, Shenzhen, China, October 23–26, 2016, pages 277–282, 2016

Bengali Place Name Recognition

353

9. Graves, A., Schmidhuber, J.: Oﬄine handwriting recognition with multidimensional recurrent neural networks. In Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 21, pp. 545–552 (2009) 10. Bluche, T., et al.: Preparatory KWS experiments for large-scale indexing of a vast medieval manuscript collection in the HIMANIS project. In: 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, 9–15 November, 2017, pp. 311–316 (2017) 11. Chanda, S., Okafor, E., Hamel, S., Stutzmann, D., Schomaker, L.: Deep learning for classiﬁcation and as tapped-feature generator in medieval word-image recognition. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 217–222 (2018) 12. Chakrapani Gv, A., Chanda, S., Pal, U., Doermann, D.: One-shot learning-based handwritten word recognition. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W.Q. (eds.) ACPR 2019. LNCS, vol. 12047, pp. 210–223. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41299-9 17 13. Chanda, S., Baas, J., Haitink, D., Hamel, S., Stutzmann, D., Schomaker, L.: Zeroshot learning based approach for medieval word recognition using deep-learned features. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 345–350. IEEE (2018) 14. Stutzmann, D., et al.: Handwritten text recognition, keyword indexing, and plain text search in medieval manuscripts (2018) 15. Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015) 16. Howard, A.G., et al.: Mobilenets: eﬃcient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 17. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017) 18. Liu, S., Long, Yu., Zhang, D.: An eﬃcient method for high-speed railway dropper fault detection based on depthwise separable convolution. IEEE Access 7, 135678– 135688 (2019) 19. Yoo, B., Choi, Y., Choi, H.: Fast depthwise separable convolution for embedded systems. In: Cheng, L., Leung, A.C.S., Ozawa, S. (eds.) ICONIP 2018. LNCS, vol. 11307, pp. 656–665. Springer, Cham (2018). https://doi.org/10.1007/978-3-03004239-4 59 20. Pal, U., Roy, R.K., Kimura, F.: Multi-lingual city name recognition for Indian postal automation. In: 2012 International Conference on Frontiers in Handwriting Recognition, ICFHR 2012, Bari, Italy, 18–20 September, 2012, pp. 169–173. IEEE Computer Society (2012). https://doi.org/10.1109/ICFHR.2012.238

Face Veriﬁcation Using Single Sample in Adolescence R. Sumithra1(&), D. S. Guru1, V. N. Manjunath Aradhya2, and Anitha Raghavendra3 1

2

Department of Studies in Computer Science, University of Mysore, Manasagangotri, Mysuru 570 006, Karnataka, India [email protected] Department of Computer Application, JSS Science and Technology University, Mysuru, Karnataka, India [email protected] 3 Department of E&CE, Maharaja Institute of Technology, Mysuru, Karnataka, India

Abstract. Recognition of facial aging using face images has enormous applications in forensic and security control. In this study, an attempted to verify the face images of adolescence and adults by using a single reference sample have been made. To study this problem, we have designed our model by considering with-patch based and without-patch based face images. Both local and pretrained deep features have been extracted during feature extraction. Further, the well-known subspace techniques such as Principal Component Analysis (PCA) and Fisher Linear Discriminant (FLD) have been adopted for dimensionality reduction. To measure the model’s goodness, we have created our dataset consisting of two images of 64 persons, each with an age gap of 10 years between adolescents (15 years old) and adults (25 years old). The comparative analysis between with-patch and without-patch based images, and PCA and FLD have been studied effectively. The pre-trained networks give the highest matching characteristics at 96% using top projection vectors. The experimental results reveal that the deep learning model outperforms the state-of-the-art method for face veriﬁcation. Therefore, the obtained results are promising and encouraging for face biometric applications. Keywords: Face veriﬁcation

Adolescence FLD Deep features

1 Introduction Face recognition is an interesting area, and much work has been attempting to help face images in the wild for pose variation, facial reconstruction identiﬁcation, illumination variation, etc., (Jain et al. 2016). The fundamental characteristic of identity recognition is the human face. The face photos are useful not only for the identiﬁcation of individuals but also for exploring other characteristics such as gender, age, race, a person’s emotional state, etc. A face biometric that allows it a suitable modality due to its individuality, universality, acceptance, and ease of collectability. Face recognition has well studied a biometric problem that is still unresolved because of the inherent © Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 354–366, 2021. https://doi.org/10.1007/978-981-16-1092-9_30

Face Veriﬁcation Using Single Sample in Adolescence

355

difﬁculties presented by human faces images (Jain et al. 2011). Some of the common problems with face recognition are changes in posture, lighting, and aging. Amongst several issues, aging variation has received greater attention from the research community as the appearance of the human face changes remarkably (Ricanek et al. 2006). The aging consideration method for facial recognition involves variance in face shape, size, and texture. These temporal variations will cause performance degradation of the system. As the appearance of the face changes based on its maturation and nonavailability of the facial aging dataset, facial aging recognition through young face images is a challenging task. The problem becomes even more complicated when a large age gap has been considered. However, in recent days, only a few works have been done on a cross-age face recognition system. Rowden et al. (2017) have experimented with a longitudinal study on automatic face recognition and stated that proper statistical analysis is essential for studying face recognition performance over a long period. They have created a longitudinal face image of new-born, infants, and toddlers in one year with a different interval of time. Their experimentation reveals that the face recognition system cannot recognize children at the age of 3 years reliably and concluded that the rate of change of face in children, particularly between 1 to 5 years of age is maximum. Ricanek et al. (2015) have described the changes of the cranium from childhood to young adulthood every year, as facial aging in children majorly involves craniofacial growth. Most of the highly sophisticated and powerful face recognition methods fail due to the regular changes in facial parameters (Jain and Stan 2011). Consequently, the reliability of a face recognition system depends mainly on the stability of a facial parameter. From this literature, we can understand, the facial feature is growing drastically in non-adults; therefore, face veriﬁcation is very challenging. The main objective of our study is to ﬁnd the changes in the face images from non-adult to adult where it can be deployed to any face recognition system for biometric applications. To address such problems, a suitable dataset is highly essential. Hence, we have created our face image dataset. In this study, we have attempted to verify the faces for a large age gap with a minimum of 10 years through a single reference sample, age difference from teenager to adult. The contribution of this study has followed: • Face identiﬁcation for a large age gap of 10 years using a single reference sample. • Local and deep features have been extracted on patch-based and without patchbased face images. • A well-known statistical pattern recognition technique such as PCA and FLD for dimensionality reduction have been proposed. The paper is structured as follows: the framework of the model has been explained in Sect. 2. Experiments and observations have been presented in Sect. 3. Conclusion and further development have been discussed in Sect. 4.

356

R. Sumithra et al.

2 Proposed Model The proposed model designed with four stages viz., pre-processing, feature representation, dimensionality reduction and Cumulative Match Characteristic (CMC). Initially, the input images have been pre-processed by face alignment. Further, features have been described from the facial components of the face images to represent the facial parameters, and then its dimensions have been reduced. Later, the distance measure has been computed from two face images of the same person with the time-lapse of 10 years and then followed by CMC for performance measure. The outline of the proposed model is depicted in Fig. 1.

Fig. 1. The pictorial representation of our proposed model.

2.1

Face Alignment and Preprocessing

We have considered the face images of the scanned photograph; hence some manual alignments have been required. The Viola-Jones face detector has been used to identify the area of the face in the image (Viola and Jones 2004), which is indicated in Fig. 2.

Fig. 2. (a) Scanned passport-sized photograph images; (b) The Pre-processed images.

The aligned face images of row-wise patches and column-wise patches have been considered to analyze each facial component. The idea behind column face patch and row face patches have to analyze which facial component has more influences the efﬁcient veriﬁcation of faces. As illustrated in Fig. 3, we have considered the six different face patches from each face image.

Face Veriﬁcation Using Single Sample in Adolescence

357

Fig. 3. Examples of face patches of the same person with different ages.

2.2

Feature Representation

The local and pre-trained deep learning-based features have extracted from the full face and patch-based images. Our face image dataset has scanned photographs; therefore, we have adopted local features like Multiscale Local Binary Pattern (Multi-LBP) (Guo et al. 2010) to ﬁnd the changes in the texture of the face. To show the state-of-art technique, Deep features from the pre-trained networks have also been extracting for effective face representation. The detailed description of each feature extraction procedure is explained as follows: Multiscale Local Binary Pattern (Multi-lBP). The Local Binary Pattern (LBP) (Ahonen et al. 2006) technic is one of the most utilized texture descriptors and has been broadly used for the efﬁcient representation of face image features. In this study, the operator works by thresholding the center value of each pixel of a 3 3 neighborhood, thus creating a local binary pattern, which is formulated as a binary number. The occurrence of various local patterns is compiled into a histogram that is used as a descriptor of the textures. Calculated features in a local 3 3 neighborhood cannot capture large scale structures because the operator is not very resilient to local texture changes. Therefore, it needs an operator with a large spatial support area. The operator has been expanded to allow the invariant rotational study of facial textures on multiple scales such as 3 3, 5 5, 7 7 and 9 9 (Ojala et al. 2002). Deep Features: AlexNet (Krizhevsky et al. 2012). Alexnet comprises eight weight layers: the ﬁrst ﬁve are convolutionary, and the remainder are entirely related. In the previous row, the kernels of the second, fourth and ﬁfth convolution layers are only related to certain kernel maps. Both the response-normalization layers and the ﬁfth convolutionary layers have been adopt in max-polling layers. The ReLu non-linearity has been applied to the output of every convolutional and fully connected layer. GoogLeNet (Szegedy et al. 2015). Also known as the Inception module, which plays an important role in GoogLeNet Architecture. The Inception of Architecture is restricted to the ﬁlter sizes 1 1, 3 3, and 5 5. A 3 3 max pooling is also added to the inception architecture. The network is 22 layers deep. The initial layers are simple convolutional layers. This network has 57 layers of inception module, among which 56 are convolutional layers and one fully-connected layer. ResNet-101 layers (He et al. 2015). The convolution layer have 3 3 ﬁlters and follow two simple design rules: (i) the layers have the same output feature size and (ii) the number of ﬁlters is doubled if the feature map size is halved to preserve the time complexity per layer. The network ends with a global average pooling, a 10-way

358

R. Sumithra et al.

fully-connected layer, and softmax. This architecture has 105 layers with weights among which 104 are convolutional and one fully connected layer. VGG-19 Layers (Simonyan and Zisserman 2014). The image is passed through a stack of convolutionary layers in this architecture, in which ﬁlters with very small receptive ﬁelds are used: 3 3. A stack of convolutionary layers is followed by three fully connected (FC) layers: the ﬁrst two have 4096 channels each, the third one performs 1000 ways, the conﬁguration of the fully connected layers is the same across in all networks. All hidden layers are equipped with the rectiﬁcation non-linearity. This network has 19 layers with weights, among which 16 are convolutional and the remaining 3 has a fully connected layer. 2.3

Dimensionality Reduction

To reduce the computation burden and feature dimension, subspace methods called Principal component Analysis (PCA) (Turk et al. 1991) and Fisher Linear Discriminant (FLD) (Mika et al. 1999) have been used to preserve the most dominating projection vectors. The contribution of eigenvalues and eigenvectors in the literature of face recognition has created a milestone; hence in this study, PCA and FLD have been adopted for dimensionality reduction. PCA. Let W represent a linear transformation matrix mapping the function points from m-dimension to p-dimension, where p « m, as follows: Y p ¼ wT xp

ð1Þ

is the linear transformation of the extracted features. Where {wi |i = 1,2, …, m} is the set of n-dimensional projection vectors corresponding to the m largest values. FLD. An example of class speciﬁc approach is the Fisher’s Linear Discriminant (FLD). This method selects W tomaximize the ratio of the scatter between class and the scatter within the class. W¼

arg max W T SB W W W T SW W

ð2Þ

P where SB is between-class scatter matrix SB ¼ Ci¼1 N i ðxi lÞðxi lÞT and the SW P P is within class scatter matrix be deﬁned SW ¼ Ci¼1 CXk 2X i ðxk li Þðxk li ÞT . Where li is the mean image of class Xi, and Ni is the number of samples in class Xi. Where {Wi| i = 1,2, …, m} is the set of generalization eigen vectors of SB and SW corresponding to the m largest generalized eigenvalues {ki|i = 1,2, … m}. The number of images in the learning set is in general, much smaller than the number of features in of image. This means that matrix W can be chosen so the projected samples’ in-class scatter can be rendered exactly null. This is achieved by using PCA to reduce the size of the feature space to N-c, and then applying the standard FLD deﬁned by Eq. 3, to reduce the size to c-1 (Belhumeur et al. 1997).

Face Veriﬁcation Using Single Sample in Adolescence

359

More formally, WFLD is given by. W TFLD ¼ W Tfld W Tpca

ð3Þ

Where

Wfld

T T W W S W W B pca pca arg max and Wpca ¼ arg max W T ST W ¼ w w T S W W T Wpca W pca W

In computing Wpca, we have selected only the largest c-1 projection vectors. 2.4

Distance Measure

The cosine similarity metric is adopted to ﬁnd the distance between two different age of a person in a large age gap, computed using Eq. 4: Pn k¼1 FMyoung FMadult Cosine FMyoung ; FMadult ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pn Pn 2 2 k¼1 FMyoung k¼1 FMadult

ð4Þ

where n is the number of projection vectors. From the similarity measure, the obtained squared similarity matrix has been sorted and the indexing values have been preserved, then followed by Cumulative Match Characteristics (CMC) as a performance measure of the proposed model. 2.5

Cumulative Match Characteristics

In general, face recognition can be broken down into two issues: veriﬁcation and identiﬁcation. A veriﬁcation method is a matching problem of one to one; while the identiﬁcation process is a matching problem of one to many. An identiﬁcation system has to compare the features against a whole established identity database. One may ask, how useful an algorithm is when an identiﬁcation image matches, the question is not always “is the top match correct? Correct matches top N subjects? The number ‘N’ is called the match rank which indicates how many images a desired level of performance has to be examined to get. The statistics on results are listed as cumulative match scores, popularly known as a Cumulative Match Characteristic (CMC) (Moon et al. 2001). In this work, CMC is employed to assess efﬁciency in the veriﬁcation technique (Bouchafa et al. 2006). CMC displays the probabilities inside top N ranks of observing the correct individual. The percentage of Cumulative Match Characteristic is computed using: CMCðNo: of PV Þ ¼ P

features j¼1

Number of Subjects P Subjects i¼1 & IndexMatði;jÞ¼¼j IndexMat ði; jÞ rank

ð5Þ

Where, i and j are the number of subject’s taken for the experimentation and PV is the Projection Vectors. In our study, the rank of all subjects has been computed; hence

360

R. Sumithra et al.

the performance of the model depends on the total number of subject taken for the experimentation, which is depicted in Eq. (5).

3 Experiment Results 3.1

Dataset Collection

The dataset plays a signiﬁcant role in face recognition; building an adequate dataset is a fundamental task for any application, as the dataset is primary necessity for system design and testing. This study has been designed to compare a person’s two face images in an age gap of 10 years. To this end, the current benchmark dataset fails to reach our desired target. However, we have chosen the photos of subset of FG-NET dataset with a minimum age difference of 10 years between adolescence (15 years old) and adult (25 years old) face photos. Therefore, we have found only 29 subjects of individuals from FG-Net dataset and the remaining 35 subjects’ photos have been collected from all possible sources. Collected photos have scanned using a professional HP laser Jet M1136 MFP scanner system for 600 DPI resolution scanning. Our dataset has 128 proﬁle photos of 64 individuals, each with an approximately 15 years old called young face photos (adolescent) and another approximately 25 years old called mature face photos (mature). The ten years age difference between two face images is shown in Fig. 4. There is no control over the image quality, as the obtained face images are scanned pictures.

Fig. 4. Example of samples of face images from our dataset.

3.2

Experimental Analysis

During experimentation, features based on conventional and convolution methods have extracted from with-patch based and without-patch based images. As mentioned in Sect. 2.2, as a feature vector, the MLBP descriptors have 59 bins per operator, so we have obtained with 236-d feature vector from full-face images for four discrete operators. During experimentation, the dimensionality reduction has been considered the largest eigenvalues from the top 50\% (118 projection vector) of principal components empirically and followed by FLD. Then, the cumulative match characteristic is calculated from each projection vector and its results are shown here. Results Without-Patch Based Images. The results based on whole face images using M-LBP and Pre-trained deep features are shown in Fig. 5–7 and their analysis is explained in the following section. In each resulted graph, the x-axis represents an increase in the number of projection vectors and y-axis with cumulative matching character. From the obtained results, we can notice that AlexNet and GoogLeNet based features give the highest rate of matching at 96%. For FLD transformation, each feature behaves the same and the contribution of FLD is signiﬁcantly higher than PCA.

Cumulative Match Charastric

Face Veriﬁcation Using Single Sample in Adolescence

361

1 0.8 0.6 PCA

0.4

FLD

0.2 0 0

5

10

15

No. of Projection vectors

Fig. 5. Full face images with MLBP feature for different feature transformation techniques.

1

0.8

Cumulative Match Charastric

Cumulative Match Charastric

1

0.6 AlexNet VGG19 ResNet101 GoogLeNet

0.4 0.2 0

0.5 AlexNet VGG19 ResNet101 GoogLeNet 0

0

5

10

0

15

No. of Projection vectors

(a)

5 10 No. of Projection vectors

15

(b)

Fig. 6. Results of deep features for different feature transformation technique: (a) PCA (b) FLD 1.2 Cumulative matching charastic

cumulative matching charastic

1.2 1 0.8 0.6

PCA+AlexNet FLD+AlexNet

0.4 0.2 0

1 0.8 0.6 0.4

PCA+VGG19 FLD+VGG19

0.2 0

0

5 10 No. of Projection vectors

0

15

(a)

15

(b) 1.2

1 0.8 0.6 0.4

PCA+ResNet101 FLD+ResNet101

0.2

cumulative matching characterstic

1.2 cumulative matching characterstic

5 10 No. of projection vectors

1 0.8 0.6 0.4

PCA+GoogLeNet FLD+GoogLeNet

0.2 0

0 0

5 10 No. of projection vectors

(c)

15

0

5 10 No. of projection vectors

15

(d)

Fig. 7. Results of different feature transformation techniques for full face images using pretrained deep architectures: (a) AlexNet; (b) VGG19; (c) ResNet101; (d) GoogLeNet.

362

R. Sumithra et al.

Results With-Patch Based Images. The concept behind patch-based extraction of features is to outline more discriminative features using each facial component and also to analyse, which facial component is more suitable for efﬁcient identiﬁcation. From the results, MLBP features for two feature transformations of different patches of a face are shown in Fig. 8. We have found that FLD gives the best matching characteristic for patch 1, patch 5 and patch 6. However, patch 2, patch 3 and patch 4 is difﬁcult to recognize both in PCA and FLD results.

0.9

PCA FLD

0.4

-0.1

Cumulative Match Charastric

Cumulative Match Charastric

0.9

0

5 10 No. of Projection vectors

PCA FLD 0.4

-0.1

15

0

5 10 No. of Projection vectors

(b)

0.9

0.4

-0.1

Cumulative Match Charastric

Cumulative Match Charastric

(a)

PCA FLD 0

5

10

15

0.9 PCA FLD 0.4

-0.1

No. of Projection vectors

0

5

0.9

0.9

0.7 0.5

PCA

0.3

FLD

0.1 5

10

No. of Projection vectors

(e)

15

(d) 1.1 Cumulative Match Charastric

Cumulative Match Charastric

(c)

0

10

No. of Projection vectors

1.1

-0.1

15

15

0.7 0.5

PCA

0.3

FLD

0.1 -0.1

0

5

10

15

No. of Projection vectors

(f)

Fig. 8. Comparision with MLBP features with three feature transformation for different face patches (a) Patch 1; (b) Patch 2; (c) Patch 3; (d) Patch 4; (e) Patch 5; (f) Patch 6;

1

1.1

0.8

0.9 Cumulative Match Charastric

Cumulative Match Charastric

Face Veriﬁcation Using Single Sample in Adolescence

PATCH1 PATCH2 PATCH3 PATCH4 PATCH5 PATCH6

0.6 0.4 0.2 0 0

5 10 No. of Projection vectors

15

363

0.7 0.5 PATCH1 PATCH2 PATCH3 PATCH4 PATCH5 PATCH6

0.3 0.1

-0.1

0

5

10

15

No. of Projection vectors

(a)

(b)

Fig. 9. Comparision with different patches for M-LBP feature with different transformation (a) PCA; (b) FLD;

0.9

0.4

Cumulative Match Charastric

Cumulative Match Charastric

0.9

Patch 1 AlexNet Patch 1 VGG19 Patch 1 ResNet101 Patch 1 GoogLeNet

-0.1

0

5

10

No. of Projection vectors

0.4 Patch 2 AlexNet Patch 2 VGG19 Patch 2 ResNet101 -0.1

15

0

5

0.9 Cumulative Match Charastric

Cumulative Match Charastric

0.9

Patch 3 AlexNet Patch 3 VGG19 Patch 3 ResNet101 Patch 3 GoogLeNet

0.4

0

5

10

15

0.4

-0.1

No. of Projection vectors

Patch 4 AlexNet Patch 4 VGG19 Patch 4 ResNet101 Patch 4 GoogLeNet

0

Cumulative Match Charastric

Cumulative Match Charastric

Patch 5 AlexNet Patch 5 VGG19 Patch 5 ResNet101 Patch 5 GoogLeNet

0

5

10

No. of Projection vectors

(e)

10

15

(d)

0.9

0.4

5

No. of Projection vectors

(c)

-0.1

15

(b)

(a)

-0.1

10

No. of Projection vectors

0.9

0.4

-0.1 15

Patch 6 AlexNet Patch 6 VGG19 Patch 6 ResNet101 Patch 6 GoogLeNet

0

5

10

15

No. of Projection vectors

(f)

Fig. 10. Results of deep features with FLD transformation for different patches of a face images: (a) Patch 1; (b) Patch 2; (c) Patch 3; (d) Patch 4; (e) Patch 5; (f) Patch 6;

364

R. Sumithra et al.

In Fig. 9, a comparison with different patches for different feature transformations using the M-LBP is presented. The best performance for all patch images is obtained using FLD than PCA. Patch 1, 5 and 6 give the best matching for only four projection vectors. The pre-trained deep features are extracted and transformed using FLD for different patches of face images which is shown in Fig. 10. FLD plays a major role in the face veriﬁcation system as within-class scatter is negligibly small (turns to zero). From the graph, we can notice that matching characteristic nearly 95% in three projection vectors and VGG19 pre-trained network gives the best performance for Patch 3 and 6. Where patch 3, refered to row centeric of a face and patch 6 referenced to lower part of the face (jaw and chin). The proposed model is compared with the state of art model which has been proposed by Anil K Jain (Li et al. 2011). They have developed an age invariant face recognition system with approximately six years of an age gap for adults using the discriminative method. By stratiﬁed random of the training set as well as the feature space, multiple LDA-based classiﬁers were developed to create robust strategic decisions, the same author’s addressed a multi-function discriminant analysis. The comparison results are depicted in Fig. 11.

Comparison with State-of-Art method with our proposed method.

1.2

Cumulative Match Charastic

1 0.8 0.6 Li et al., 2011 AlexNet+FLD VGG19+FLD ResNet101+FLD GoogLeNet+FLD MLBP+FLD

0.4 0.2 0 0

1

2

3

4 5 6 No. of Principal Componets

7

8

9

Fig. 11. Comparison with state-of-art method.

In this study, crossage face veriﬁcation for a single image matching issue is presented with a 10-year age difference. Using deep features, we found higher recognition with 96% of matching just by using four projection vector.

4 Conclusion In this study, we have been attempted to identify faces for a large age gap with a minimum of 10 years from adolescents to adults by using a single reference sample. In order to verify the effectiveness of this study, we have created our dataset consisting of two images of 64 persons, each with an age gap of 10 years between adolescents

Face Veriﬁcation Using Single Sample in Adolescence

365

(15 years old) and adults (25 years old). We have used both local and deep feature extraction techniques and reduced the feature dimension using PCA and FLD. The comparative analysis between with-patch and without-patch based images, and PCA and FLD have been studied effectively. From this study, we ﬁnd that there is an evident rate of face matching at 96%. Therefore, we have found that adolescence age may be adopted for reliable face recognition applications. The number of subjects and images is less for the experiment; therefore in the future, we are planning to extend our dataset.

References Ahonen, T., Hadid, A., Pietikainen, M.: Face description with local binary patterns: application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12, 2037–2041 (2006) Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. ﬁsherfaces: recognition using class speciﬁc linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 711–720 (1997) Bouchafa, S., Zavidovique, B.: Efﬁcient cumulative matching for image registration. Image Vis. Comput. 24(1), 70–79 (2006) Canziani, A., Paszke, A., Ulurciello, E.: An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678 (2016) Guo, Z., Lei, Z., David, Z., Mou, X.: Hierarchical multi scale LBP for face and palm print recognition. In: 2010 IEEE International Conference on Image Processing, pp. 4521–4524. IEEE (2010) Jain, A.K., Ross, A., Nandakumar, K.: Introduction to Biometrics. Springer (2011). https://doi. org/10.1007/978-0-387-77326-1 Jain, A.K., Nandakumar, K., Ross, A.: 50 years of biometric research: accomplishments, challenges, and opportunities. Patt. Recogn. Lett. 6(3), 1028–1037 (2011) Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012) Li, Z., Park, U., Jain, A.K.: A discriminative model for age invariant face recognition. IEEE Trans. Inf. Forensics Secur. 6(3), 1028–1037 (2011) Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.R.: Fisher discriminant analysis with kernels. In: Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (cat. no. 98th8468), pp. 41–48. IEEE, August 1999 Moon, H., Phillips, P.J.: Computational and performance aspects of PCA-based face-recognition algorithms. Perception 30.3, 303–321 (2001) Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution gray-scale and rotation invariant texture classiﬁcation with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 7, 971–987 (2002) Ricanek, K., Tesafaye, T.: Morph: a longitudinal image database of normal adult ageprogression. In: 2006 FGR 2006 7th International Conference on Automatic Face and Gesture Recognition, pp. 341–345. IEEE (2006) Ricanek, K., Bhardwaj, S., Sodomsky, M.: A review of face recognition against longitudinal child faces. BIOSIG 2015 (2015) Rowden, L., Jain, A.K.: Longitudinal study of automatic face recognition. IEEE Trans. Patt. Anal. Mach. Intell. 40, 148–162 (2017) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

366

R. Sumithra et al.

Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) Turk, M.A., Pentland, A.P.: Face recognition using Eigen faces. In: Proceedings CVPR’91, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE (1991) Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vision 57(2), 137–154 (2004)

Evaluation of Deep Learning Networks for Keratoconus Detection Using Corneal Topographic Images Savita R. Gandhi(&), Jigna Satani, Karan Bhuva, and Parth Patadiya Department of Computer Science, Gujarat University, Navrangpura, Ahmedabad 380009, Gujarat, India Abstract. Keratoconus is an eye disease of ‘deformation of corneal curvature’ caused due to ‘non-inflammatory progressive thinning’ resulting into loss of elasticity in cornea and protrudes a cone shape formation that ultimately reduces visual acuity. For many years, researchers have worked towards accurate detection of keratoconus (KCN) as it is essential checkup before any refractive surgery demanding quick as well as precise clinical diagnosis and treatments of keratoconus prior to LASIK. In our study, we have ﬁrstly derived two variants of the original corneal topographies namely ‘images with edges’ and ‘images with edges-and-mask’, as data sets. The deep neural network techniques such as Artiﬁcial Neural Network (ANN), Convolutional Neural Network (CNN) and pertained VGG16 model are applied on original ‘corneal topographies’ as well as on the two of its variants and the results obtained are presented. Keywords: Keratoconus Corneal topography ATLAS 9000 ANN CNN VGG16 Canny edge detection Edges with mask

1 Introduction Cornea is transparent outer layer of eyes responsible for maintaining safety and shape apart from producing clear images by refracting light properly. So, any irregularities in corneal curvature reduces the quality of vision. As shown in Fig. 1, due to progressive thinning the cornea losses elasticity and turns into cone shaped formation that protrudes, referred as Keratoconus. In keratoconic condition, cornea distorts light refraction resulting into blurred vision. Keratoconus typically impacts both the eyes but starts impairing one eye ﬁrst. In the advanced stages of keratoconus, the progressive thinning may lead to blindness. Keratoconus is asymptomatic in its early stages but the irregularities in corneal curvature are manifested gradually [1, 2]. Refractive surgery is suggested for correcting the eye’s ability to focus but can only be performed on the healthy eyes by reshaping the corneal curvature. So only when the LASIK surgery performed on the eyes with keratoconic condition can cause corneal ectasia that may lead to irreversible damage [3]. The prevalence of Keratoconus is seen in the Arabic region as well as other Asian countries. Keratoconus can impact old as well as young so the early detection can help in preventing the irreversible damage [4–8]. Hence not only the early detection but the better methods of diagnosis of keratoconus is needed. As a result, over the past many years various algorithms and © Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 367–380, 2021. https://doi.org/10.1007/978-981-16-1092-9_31

368

S. R. Gandhi et al.

Fig. 1. Comparison between normal eye and keratoconic eye

models of neural network have been applied for detection and discrimination of keratoconus and the contribution of neural network models in classifying keratoconus is remarkable. However, the need for the enhanced methods with high accuracy has encouraged this research and application of deep neural networks on corneal topographies. The accuracy of the deep neural network model depends on the quality and nature of images. Belin et al. [9, 10] explain elevation-based topography and its advantages over Placido – based devices. A topography map reads features from corneal surface which plays signiﬁcant role in detection of keratoconus. Additionally, such topographic maps provide information about location and shape of elevation on corneal curvature which is useful to detect the presence of the disease and its severity. The past researches suggest the need for not only detection of keratoconus but detecting this disease in its early stage to halt the progression of protrusion in cornea. As keratoconus is irreversible condition in its advance stages, effective methods for early detection of KCN may prove signiﬁcant to restore the patient’s vision. In this study, we have extracted images with edges and images with edges and mask from the keratoconic corneal maps. The images with ‘edges and mask’ show the shape of elevated region or thinning of actual cornea and pattern developed due to irregularities in corneal curvature. In order to do so, it is required to separate the keratoconic maps from the normal corneal maps. Here, we have applied ANN, CNN and VGG16, a pretrained ImageNet model on our three derived variants of corneal topographies shown in Fig. 2. These variants are original color images, images with ‘edges’ and images with ‘edges and mask’, which are used in this study to compare the effectiveness of three deep learning models. These variety of images possess pattern of elevated region and shapes. The shapes and dispersal of elevated area, are useful for detecting the severity of the disease and may participate in early detection of keratoconus.

Fig. 2. Pre-processed topographic maps of normal eye and progression of keratoconus

Evaluation of Deep Learning Networks for Keratoconus Detection

369

2 Related Work The Neural Network (NN), subset of AI, tries to imitate human brain and mostly chosen as prime technique to analyze and classify medical data. Armstrong et al. [11] explained the role of Artiﬁcial Intelligence and Machine Learning in various ocular diseases. Further, Smolek et al. [12] suggested the preference of NN approach over videokeratographic methods. P. Agostino et al. [13] used NN to identify keratoconus from 396 corneal topographic maps selected from videokeratoscope (EyeSys), here parameters of both eyes were used to classify classes namely - ‘Normal’, ‘Keratoconus’ and ‘Alterations’. They initially used 6 neurons where the result was not satisfactory and went ahead to use 9 and 10 or 19 neurons for unilateral and bilateral respectively which than appeared to improve the accuracy to 96.4%, primarily as the parameters used were from both the eyes. So, the identiﬁcation of keratoconus in its early stage still remains an issue. Advancement in AI and NN led to revolutionary Machine Learning technique which produced proﬁcient outcomes with keratoconus [14, 15]. Valdes-Mas et al. [16] predicted vision accuracy of patient’s with keratoconus, by measuring corneal-curvature and astigmatism, after intra-corneal ring implantation. They could achieve their best accuracy only after using multi-layer perceptron ANN method, notably an offshoot of Machine Learning. There had many precedencies of Machine Learning being used for keratoconus classiﬁcation. Souza et al. [17] used Machine Learning model to device SVM, multilayer perceptron and radial basis function, to evaluate performance for keratoconus identiﬁcation, whereas, all three neural network classiﬁers were trained on Orbscan II maps. The outcome was based on ROC that shows similar performance by all three classiﬁers. Arbelaez et al. [18] processed Pentacam images of eyes using SVM classiﬁer for differentiating normal eyes from other three groups of eyes – Abnormal, Keratoconus and Subclinical Keratoconus. This classiﬁer shows higher accuracy in detection of keratoconus compared to normal eyes, considering data with the posterior corneal surface and corneal thickness indices that made classiﬁer effective for subclinical keratoconic eyes, important for detection early signs of disease. Ali et al. [19] implemented SVM with topographic maps for classifying normal maps and maps with abnormal indications. Kovács et al. [20] choose Machine Learning classiﬁcation, by applying multilayer perceptron model on bilateral data from Pentacam Scheimpflug images. It can differentiate eyes with unilateral keratoconus from the eyes with refractive surgery, further suggests the need and importance of an automatic screening software to identify eyes with unilateral keratoconus, in early stages itself. Toutounchian et al. [21] also used Image Processing on 82 corneal topographical images, obtained from Pentacam, for extracting the features. Multilayer-Perceptron, RBFNN, SVM and Decision Tree were used to classify images under Keratoconus, Suspect to Keratoconus and, Normal eye, with 91% accuracy. The image processing technique reads images as matrix of pixel where pixel values of images for Suspect to Keratoconus, could be helpful in early detection.

370

S. R. Gandhi et al.

Hidalgo et al. [22] has applied SVM to process 22 corneal topography parameters for evaluation of accuracy of discrimination of normal eyes from keratoconus suspected eyes. Here also the weighted average of 95.2%, could be attained in separating keratoconus from the forme fruste keratoconus. They emphasised on increasing accuracy in diagnosis of asymptomatic subclinical patients. It can be noted that the use of subclinical keratoconus detection can be used to diagnose the early sign of keratoconus and hence helping in preventing the development and progression of disease. Recently new substantial Machine Learning techniques have been devised to process the large amount of data. Based on the fusion of the concepts of Neural Networks and Machine Learning, the Deep Learning algorithms are much more capable to process large amount of data for analysis, prediction and classiﬁcation. Jmour et al. [23] suggested the signiﬁcance of Convolutional Neural Network (CNN). CNN is well known Deep Learning technique widely used for image classiﬁcation. CNN uses kernel, a small matrix of weights to perform convolution operation on the input images to classify the images by extracting its features. Lavric et al. [24] ‘KeratoDetect’ model based upon CNN, detects keratoconus eyes with higher accuracy. Out of synthetic data of 1500 topographies of normal eye and 1500 topographies of eyes with keratoconus, 1350 topographies were used as training; 150 images were used for validation and 200 images to measure the accuracy of the proposed algorithm. As obtaining large data set was difﬁcult, here, synthetic data were generated from 145 Scheimpflug topographies using SyntEyes KTC model. The size of epochs used was ten for 200 test corneal topographies, where each epoch used 21 iterations led to 97.01% accuracy. Similarly, 400 test topographies with 30 epochs was 97.67% and the accuracy improved only after increasing epochs to 38. Neural Network (NN) has contributed enormously in the past, however, the recent techniques of Machine Learning and its offshoot Deep Neural Network has been proven to much more effective than NN in detection of keratoconus. Lin et al. [25], observed that the detection of disease in early stage is difﬁcult but essential for refractive surgery even with the imaging devices such as Topographic Modelling System, EyeSys, OPD-Scan, Orbscan-II, Pentacam and Galilei are being used to capture anatomic data from cornea. Imaging modalities as RT-Vue-100, Artemis-1 and Zernike coefﬁcients could improvise detection accuracy. They further opined that apart from detection techniques, lack of standardized methods was leading to inconsistent dataset that resulted in improper deﬁnition of early manifestation of keratoconus. Also, public datasets of corneal topographies for keratoconus detection are insufﬁcient which limits advance studies and research. They concluded that the Machine Learning techniques though performed better in distinguishing between keratoconus and normal, eyes, yet the machine learning alone could not differentiate a subclinical keratoconus and normal, eyes, efﬁciently. Aforesaid studies illustrate the effectiveness of Neural Network techniques when applied upon topographic and tomographic data for keratoconus. Authors in this paper tried to apply the Deep Learning Techniques on corneal topographic data obtained from ATLAS 9000 topographic keratometer device, to detect keratoconus.

Evaluation of Deep Learning Networks for Keratoconus Detection

371

3 Study Data and Methods It is observed that in the past studies, numerical measures and topographic maps were used frequently. However, with the evolvement of imaging and image processing technology, the recent studies are using corneal curvature elevation maps referred as topography and tomography images. The topographic maps consider anterior features of cornea whereas, tomography uses both anterior and posterior features as well. These maps are prepared from the digital images captured by keratometry device. Both types of maps are comprised of indices for corneal attributes especially elevation and steepening of corneal surface. The anterior segment indices represent measurement of the corneal surface and its morphology [10]. In our, this study, total 1104 colored topographic maps of cornea are extracted from ATLAS 9000 corneal topography device. This keratometry device has 22-ring Placido disk. It uses patented cone-of-focus alignment system and corneal wave front technology for analysis. The subject data consist of total 804 maps of bilateral keratoconic eyes from 402 clinically diagnosed patients and 300 healthy eyes from 150 patients. The Axial curvature elevation maps are rendered using standard color scheme and normalized to size of 534 534 3 pixels. The axial curvature map represents underlying local curvature irregularities and position of such irregularities. Topographic image data used here are clinically diagnosed with moderate to severe keratoconus condition. The forme fruste type of keratoconus maps are omitted from this study group. This study consists of two sections: (i) Using OpenCV to derive edges and mask of elevated area from the corneal images to determine the effectiveness of dataset of corneal maps and its variants deriving to use in future for early detection of keratoconus disease and (ii) customizing deep learning networks to suit the purpose of achieving signiﬁcant accuracy for keratoconus detection. This study, as shown in Fig. 3, total of three data sets are used primarily, the originally derived ‘color topographic’ images and rest of the two being derived using canny edge detector to get images with ‘edges’ and ‘edges with mask’.

Fig. 3. Original color topography, image with ‘edges’ only and image with ‘edges and mask’

OpenCV is used to access the canny edge detector technique of computer vision. Computer vision is used to understand and extract the information from digital images to analyze and predict the visual data similar as the human brain functions. Using convolution techniques, canny edge algorithm reduces the noise, softens the edges by applying Gaussian smoothing technique and also determines the edges at the overlap of gradients. The overlapping of shades may affect the overall accuracy in detecting the

372

S. R. Gandhi et al.

interest of region and to resolve this a warm color area is identiﬁed as a mask signifying the higher intensity measure of steepening, present in the cornea, along-with its edges. These newly processed images are referred as ‘images with edges and mask’. Topographic maps are derived in standard color scheme offered by ATLAS 9000 keratometer which ranges from cool blue color to warm red color, cool colors represent flatness and warmth the elevation highlighting protrusion of cornea. The shape of steepening or cone plays vital role in identiﬁcation of the severity of keratoconus disease [26]. Previous researches suggest that certain corneal topographic patterns are well associated with prevalence of keratoconus [27], here in our research, the images with only-edges and edges-with-mask are further used to determine the accuracy of keratoconus detection. We, in this paper, present use of three of deep learning techniques namely Artiﬁcial Neural Network (ANN), Convolutional Neural Network (CNN) and pretrained ImageNet VGG16. ANN & CNN models are tailored with respective number of hidden layers to be used with three variety of data sets. Each model uses ‘ReLU’ activation function in its hidden layers to activate neurons of positive value in prediction and ﬁnal layer uses Softmax as an activation function. Softmax normalizes the output from earlier layer and computes the probability ranges between 0 and 1 for each label. The ‘K-fold cross validation’ has been used with all the three algorithms to avoid over ﬁtting. Every model uses 10 as a value of K. K - fold ensures that every observation from the original dataset has the chance of appearing in training and test set here. In each model, stochastic optimizer ‘Adam’ is used to update learning rate individually of deep neural networks. Being a combination of ‘AdaGrad’ and ‘RMSProp’, two derivatives of Stochastic Gradient Descent, ADAM works well with the large amount of data as well as with the deep learning models and CV problems. This optimizer handles the issue of vanishing gradient commonly seen while dealing with large amount of data. In this study, ADAM determines optimized learning rate for parameters in our network models that accept corneal topographical maps as input data.

Fig. 4. The common workflow for all three models

Evaluation of Deep Learning Networks for Keratoconus Detection

373

Figure 4 represents the common workflow for all three models. As part of data preprocessing, the images read are reshaped, extra information is removed, further cropped and rescaled to 534 534 3 pixel. RGB images are converted into HSV format as being more compatible, to prepare both data sets for detecting speciﬁc colors.

4 Discussion and Results Here, as depicted in Fig. 5, the ANN’s tailored architecture uses (a) an input layer (b) two hidden layers with 64 and 16 neurons respectively along with ‘ReLU’ activation function in each of the layers and (c) an output layer with sparse categorical cross entropy function, to calculate the loss between predicted and actual, labels.

Fig. 5. ANN model: structure

Each hidden layer uses ‘ReLU’ activation to normalize neurons with positive values. ANN is applied with epoch size 5 and K-fold with all three sets of images with 10 as value of K. The best average accuracy achieved by ANN with data sets used in this study is 94.31%. The average performance of metrics of ANN with all three sets of images are shown below in Table 1. Table 1. ANN K-fold: average performance metrics Model type Original color Edges & mask (color) Canny edges (binary)

Layers Accuracy Loss Precision Recall F1-score 4 94.649 0.570 0.967 0.960 0.963 4 95.74 0.250 0.973 0.969 0.971 4

92.568

0.395 0.960

0.937 0.948

374

S. R. Gandhi et al.

From the average performance of all the 10 folds it can be seen that the best performance of ANN has been achieved with edges-and-mask data set and even the average F1 score is as high as 97.1%. In Fig. 6, ANN’s peak accuracy 98.2% can be seen at the 3-fold, with image type as edges-and-mask and the highest F1 score is 98.7%. Further, to compare the effectiveness of ANN with our dataset, a deep learning model CNN is used. Being known for its effectiveness in image classiﬁcation, CNN also requires much less pre-processing load as compared to other algorithms.

Fig. 6. Accuracy and F1 score obtained by ANN

Here, in Fig. 7, the architecture of our tailored CNN model can be seen. The input to the CNN (ConvNet) are color images of 534 534 3 pixels. This CNN model uses 3 3 kernel to execute convolution process. CNN model consists of 10 hidden layers out of which 4 layers use 32 ﬁlters and layer 5 uses 16 ﬁlters. In Convolutional layers, padding and stride size are set as 1 for training process that preserves useful information and decreases the dimension of data. Further, feature data are dealt by flatten layer and subsequently, the most relevant and potential features are handed over to the fully connected output layer that is a single dimensional vector ready for classifying data among keratoconus and healthy, cornea.

Fig. 7. CNN model: architecture

Evaluation of Deep Learning Networks for Keratoconus Detection

375

From Table 2, CNN model’s average accuracy of 95.82% and the F1 score of 97.1% are seen as the best while applied with 10-fold cross validation on images with edges-and-mask. The Fig. 8 further illustrates the higher accuracy gained by this model for all three types of images. This model shows 99.09% as the best of its accuracy and F1 score of 99.3% for images with edges and mask at 2-fold. Also, its performance with color images and images with edges are equally good. Table 2. CNN K-fold: average performance metrics of 10 folds Model type Original color Edges & mask (color) Canny edges (binary)

Layers Accuracy Loss Precision Recall F1-score 10 95.742 0.149 0.984 0.957 0.970 10 95.825 0.146 0.980 0.963 0.971 10

94.931

0.163 0.966

0.963 0.965

There are numerous powerful ImageNet CNN architectures readily available to apply on variety of data. For assuring of the performances exhibited by ANN and CNN, a pretrained VGG16 model is used with the same sets of images for comparison.

Fig. 8. Accuracy and F1 score obtained by CNN

VGG16 is ImageNet winner from year 2014 and considered as excellent vision model for image data. VGG16 is appreciated for its simple and consistent arrangement of layers.VGG16 uses 3 3 ﬁlter with stride value 1 and same padding and structure of max pooling layers with 2 2 ﬁlter and stride size 2. VGG16 uses 16 layers with weights and fully connected layer with SoftMax function to produce output resulted in very large network with approximately 138 million parameters. Vgg16 uses 64 neurons in input layer and settles with 512 neurons in last hidden layer as seen in Fig. 9. Here, VGG16 is applied with all three varieties of data prepared for classiﬁcation.

376

S. R. Gandhi et al.

Fig. 9. VGG16 model: architecture

As indices shown in Table 3, VGG16 model performs the best with 97.94% average accuracy among all three data sets used here. F1 score is the best for the original color types of images being 99.0% and average F1 score for all the three types of images is 98.5%. Table 3. VGG16 K-fold: average performance metrics of 10 folds Model type Original color Edges & mask (color) Canny edges (binary)

Layers Accuracy Loss Precision Recall F1-score 16 98.550 0.858 0.991 0.989 0.990 16 98.280 1.918 0.991 0.985 0.988 16

97.013

1.426 0.979

0.980 0.979

VGG16 model performs very well with original color images as well as images with edges with mask as shown in Fig. 10. VGG16 model with data set of ‘edges and mask’ images, scores 100% accuracy and 100% F1 score in 5th, 8th and 9th folds while applied with 10-fold cross validation. This model also scores 100% accuracy and 100% F1 score in 5th and 6th folds when performed on image data set of original corneal topography. Thus, VGG16 model exhibits better results with images having edges and mask as well as original color images packed with all features. This research shows the detection of keratoconus with high accuracy using images with ‘edges and mask’ derived from the original dataset, which suggests the usability of these images in determining the shapes of steepening and curvature irregularities present in keratoconic corneal topographies. The analysis of the patterns and shape of elevated area manifested on corneal surface of keratoconic eyes help in the early detection of the keratoconus disease as well as identifying other corneal irregularities.

Evaluation of Deep Learning Networks for Keratoconus Detection

377

Fig. 10. Accuracy and F1 score obtained by VGG16

5 Analysis of Results The Table 4, below shows the comparative performance of all three algorithms, ANN, CNN and VGG16 applied on three groups of images by comparing average accuracy and F1 score. Table 4. Comparative chart of average performance metrics of ANN, CNN and VGG16 Model Value Colored original Edge and mask (color) Edge only (binary)

ANN K-folds Accu Loss F1-score 93.66 0.49 0.956 95.74 0.25 0.971

CNN K-folds Accu Loss F1-score 95.74 0.15 0.970 95.83 0.15 0.971

VGG-16 Accu Loss F1-score 98.55 0.86 0.990 98.28 1.92 0.988

94.65 0.57 0.963

94.93 0.16 0.965

97.01 1.43 0.979

ANN performs best on images with ‘edges and mask’ with 95.74% accuracy and 97.1% F1 score while performance of CNN is also the best on images with ‘edges and mask’ with 95.83% accuracy and 97.1% F1 score which is slightly better than its performance on color topographies showing 95.74% accuracy and 97.0% F1 score. VGG16 performs the best with all types of images used, it achieves better accuracy on color images with 98.55% accuracy and 99.0% F1 score. VGG16 shows 98.3% accuracy with 98.8% F1 Score, which is equivalent performance of this model on images with ‘edges and mask’. All the three models show overall good accuracy and F1 score applied on images with ‘edges and mask’. This result shows the effectiveness of images with ‘edges and mask’ for detection of keratoconus, which further suggests the use of this variant of corneal images for the early detection of keratoconus.

378

S. R. Gandhi et al.

Fig. 11. Comparison of overall average accuracy and F1 score of ANN, CNN and VGG1

Figure 11 depicts the comparative chart of all three networks, ANN, CNN and VGG16. The performance of VGG16 is the best with all three types of images with respect to all measures. ANN and CNN perform the best with ‘edges and mask’ type of images and their performance on images with ‘edges and mask’ show similar F1 score measures. However, performance of all three models for images with ‘edges and mask’ is considerably good and consistent, that suggests the potential of images with ‘edges and mask’.

Fig. 12. Comparison of overall average loss of ANN, CNN and VGG16

It is seen in Fig. 12 that VGG16 suffered slightly higher log loss in 7 and 9 folds in case of images with ‘edges and mask’ and in 8-fold with respect to images with only edges. Further, over all comparison of all the three models with their application on all three different types of images indicates that stability of CNN model is the best and is slightly better than that of ANN and it is much better than that of VGG16.

Evaluation of Deep Learning Networks for Keratoconus Detection

379

6 Conclusion and Future Work It is evident from our work that all the three models ANN, CNN and VGG16 perform quite well with respect to different parameters namely accuracy, precision, recall and F1 score and can be utilized for detection of keratoconus. Stability with CNN model is the best, whereas VGG16 gives the highest accuracy and F1 score. The performance of all three models exhibits consistently good accuracy on our newly derived images with combination of ‘edges and color mask’ which suggests that these images can be used to identify the shape of the elevated region of corneal topography and further may help to enhance the diagnosis process by determining the severity of keratoconus. This manifestation from the corneal elevation maps in future can be useful to exhibit its role in the early detection of keratoconus.

References 1. Romero-Jiménez, M., Santodomingo-Rubido, J., Wolffsohn, J.S.: Keratoconus: a review. Cont. Lens Anterior Eye 33, 157–166 (2010) 2. Salomão, M., et al.: Recent developments in keratoconus diagnosis. Expert Rev. Ophthalmol. 13, 329–341 (2018) 3. Al-Amri, A.M.: Prevalence of keratoconus in a refractive surgery population. J. Ophthalmol. 2018, 1–5 (2018) 4. Netto, E.A.T., et al. Prevalence of keratoconus in paediatric patients in Riyadh, Saudi Arabia. Br. J. Ophthalmol. 102, 1436–1441 (2018) 5. Hwang, S., Lim, D.H., Chung, T.-Y.: Prevalence and incidence of keratoconus in South Korea: a nationwide population-based study. Am. J. Ophthalmol. 192, 56–64 (2018) 6. Nielsen, K., Hjortdal, J., Nohr, E.A., Ehlers, N.: Incidence and prevalence of keratoconus in Denmark. Acta Ophthalmologica Scandinavica 85, 890–892 (2007) 7. Hashemi, H., et al.: The prevalence of keratoconus in a young population in Mashhad. Iran. Ophthalmic Physiol. Opt. 34, 519–527 (2014) 8. Papaliʼi-Curtin, A.T., et al.: Keratoconus prevalence among high school students in New Zealand. Cornea 38, 1382–1389 (2019) 9. Belin, M.W., Khachikian, S.S.: An introduction to understanding elevation-based topography: how elevation data are displayed - a review. Clin. Experiment. Ophthalmol. 37, 14– 29 (2009) 10. Martínez-Abad, A., Piñero, D.P.: New perspectives on the detection and progression of keratoconus. J. Cataract Refract. Surg. 43, 1213–1227 (2017) 11. Armstrong, G.W., Lorch, A.C.: A(eye): a review of current applications of artiﬁcial intelligence and machine learning in ophthalmology. Int. Ophthalmol. Clin. 60, 57–71 (2020) 12. Smolek, M.K.: Current keratoconus detection methods compared with a neural network approach. Invest. Ophthalmol. 38, 10 (1997) 13. Accardo, P.A., Pensiero, S.: Neural network-based system for early keratoconus detection from corneal topography. J. Biomed. Inform. 35, 151–159 (2002) 14. Klyce, S.D.: The future of keratoconus screening with artiﬁcial intelligence. Ophthalmology 125, 1872–1873 (2018) 15. Consejo, A., Melcer, T., Rozema, J.J.: Introduction to machine learning for ophthalmologists. Semin. Ophthalmol. 34, 19–41 (2019)

380

S. R. Gandhi et al.

16. Valdés-Mas, M.A., et al.: A new approach based on Machine Learning for predicting corneal curvature (K1) and astigmatism in patients with keratoconus after intracorneal ring implantation. Comput. Methods Programs Biomed. 116, 39–47 (2014) 17. Souza, M.B., Medeiros, F.W., Souza, D.B., Garcia, R., Alves, M.R.: Evaluation of machine learning classiﬁers in keratoconus detection from orbscan II examinations. Clinics 65, 1223– 1228 (2010) 18. Arbelaez, M.C., Versaci, F., Vestri, G., Barboni, P., Savini, G.: Use of a support vector machine for keratoconus and subclinical keratoconus detection by topographic and tomographic data. Ophthalmology 119, 2231–2238 (2012) 19. Ali, A.H., Ghaeb, N.H., Musa, Z.M.: Support vector machine for keratoconus detection by using topographic maps with the help of image processing techniques. IOSR J. Pharm. Biol. Sci. (IOSR-JPBS) 12(6), 50–58 (2017). Ver. VI. e-ISSN:2278-3008. p-ISSN:2319-7676. J1206065058.pdf (iosrjournals.org). www.iosrjournals.org 20. Kovács, I., et al.: Accuracy of machine learning classiﬁers using bilateral data from a Scheimpflug camera for identifying eyes with preclinical signs of keratoconus. J. Cataract Refract. Surg. 42, 275–283 (2016) 21. Toutounchian, F., Shanbehzadeh, J., Khanlari, M.: Detection of Keratoconus and Suspect Keratoconus by Machine Vision, Hong Kong 3 (2012) 22. Ruiz Hidalgo, I., et al.: Evaluation of a machine-learning classiﬁer for keratoconus detection based on scheimpflug tomography. Cornea 35, 827–832 (2016) 23. Jmour, N., Zayen, S., Abdelkrim, A.: Convolutional neural networks for image classiﬁcation. In: 2018 International Conference on Advanced Systems and Electric Technologies (IC_ASET), pp. 397–402. IEEE (2018). https://doi.org/10.1109/ASET.2018.8379889 24. Lavric, A., Valentin, P.: KeratoDetect: keratoconus detection algorithm using convolutional neural networks. Comput. Intell. Neurosci. 2019, 1–9 (2019) 25. Lin, S.R., Ladas, J.G., Bahadur, G.G., Al-Hashimi, S., Pineda, R.: A review of machine learning techniques for keratoconus detection and refractive surgery screening. Semin. Ophthalmol. 34, 317–326 (2019) 26. Perry, H.D., Buxton, J.N., Fine, B.S.: Round and oval cones in keratoconus. Ophthalmology 87, 905–909 (1980) 27. Ishii, R., et al.: Correlation of corneal elevation with severity of keratoconus by means of anterior and posterior topographic analysis. Cornea 31, 253–258 (2012)

Deep Facial Emotion Recognition System Under Facial Mask Occlusion Suchitra Saxena1(&), Shikha Tripathi1, and T. S. B. Sudarshan2 1

Department of Electronics and Communication Engineering, PES University, Bangalore, India 2 Department of Computer Science and Engineering, PES University, Bangalore, India

Abstract. Over the past few decades, automated machine based Facial Emotion Recognition (FER) has made a signiﬁcant progress, guided by its relevance for applications in various areas of neuroscience, health, defense, entertainment and Human Robot Interaction (HRI). Most of the work on FER is focused on controlled conditions for non-occluded faces. In the present Covid-19 scenario, when more individuals cover their face partially to prevent spread of corona virus, developing a system/technique that can recognize facial emotion with facial mask constraint is desirable. In this research work, a FER system built using Convolution Neural Networks (CNN) technique is proposed. In the proposed work, eyes and forehead segment is used to train model. This system is implemented using Raspberry Pi 3 B + and Robotic Process automation (RPA) for HRI. The proposed system achieves 79.87% average accuracy with 5fold cross validation and also tested in real time scenario. Keywords: Facial emotion recognition CNN Facial mask occlusion Raspberry Pi III Robotic process automation platform

1 Introduction Emotions are an effective part of any Human to Human Interaction (HHI). In day to day communication, facial expressions are a major medium to convey emotions between humans [1]. Automatic machine-based analysis of facial emotions is an essential aspect of artiﬁcial intelligence. This has signiﬁcant applications in various areas, such as emotionally responsive robots, customized services delivery, driver exhaustive tracking system, emotion-based dataset collection and immersive game design [2–8]. However, facial emotion recognition is still challenging in uncontrolled real-time situation. The challenges faced in real-time are driven by factors like occlusion, varying illumination and head pose variations or movements, age and gender difference, color of skin and community difference of subjects used for training and testing phase. All these challenges should be addressed by an optimal FER system. Although most of these variables have been addressed methodically by facial emotion recognition systems, however occlusion is still unacknowledged and needs to be addressed in uncontrolled conditions Developing an algorithm, which can identify emotions from facial images

© Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 381–393, 2021. https://doi.org/10.1007/978-981-16-1092-9_32

382

S. Saxena et al.

under occlusion constraint such as facial mask, is therefore desirable to improve HRI in current Covid-19 situation. Therefore, a robust and efﬁcient Deep Facial Emotion Recognition System under Facial Mask constraint (DFERSFM) is being proposed for emotion recognition in this research paper. The proposed method achieves 79.87% average accuracy with 5-fold cross validation. CMU Multi-PIE [9] database is used for training and DFERSFM is also evaluated in real time. The proposed system architecture is based on CNN and it is observed that DFERSFM is efﬁcient and reliable in recognizing emotions under constraints of facial mask and includes head pose, illumination variations and age differences. The proposed method is also implemented on UiPath Robotic Process Automation tool [21] and Raspberry Pi [22] which can be used in HRI applications as shown in Fig. 1.

Facial Emotion Recognition System

RPA Interface

Raspberry Pi Interface

Software/Hardware Robots

HRI Applications

Fig. 1. Block diagram Facial emotion recognition interface for HRI application.

Deep Facial Emotion Recognition System

383

The paper is structured as follows: Section 2 describes related research work and contributions, Section 3 describes proposed sarchitecture of DFERSFM, Section 4 explains results and analysis. Conclusion and future directions are discussed in Section 5.

2 Related Work In past few decades extensive research has been pursued on facial emotion recognition. One of the main challenges in precise FER System is occlusions in real time scenario. There is a high probability in real life environment that part of the face may be hidden by mask, sunglasses, hand, a moustache or hair. In 2001, Bourel et al. introduced for the ﬁrst-time occlusion constraint in Emotion recognition. They used geometric facial features to address occlusion in mouth, upper face and left-right face. They used Ck database for training and testing the algorithm [10]. In 2002, Bourel et al. examined facial expression representation based on geometric method facial model coupled with facial movements. Their experimental results show decrease in recognition rate while face is partly occluded and under different noise levels added at feature tracker level [11]. Towner et al. proposed Principal Component Analysis (PCA) method to reconstruct missing facial points for lower and upper facial feature points to address occlusion problem. For classiﬁcation, Support Vector Machine (SVM) is used [12]. In 2008, Kotsia et al. used method, based on Gabor wavelets to extract texture information with shape-based method to recognize facial expression with constraint of occlusion [13]. In 2009, Xia et al. proposed facial occlusion detection, which is based on PCA. They used reconstruction of the occlusion region and reweighted Ada-Boost classiﬁcation for facial emotion recognition [14]. Cotter et al. used fusion of local sparse representation classiﬁers to combine local regions and identify facial expression [15]. Liu et al. proposed weber local descriptor histogram feature and decision fusion for facial expression recognition [16]. With remarkable success of deep learning approach, some recent studies concentrated on using deep learning techniques to conduct facial expression recognition directly on occluded images without occlusion detection measures, hand-crafted feature extraction and classiﬁcation. In 2014, Cheng et al. used Gabor ﬁlters for feature extraction, later multi-layered deep network is used to train network [17]. Toser et al. used CNN to deal with head pose variation and experimented their method for occluded facial image by using small 21 21 pixel patch near lip and cheek region [18]. In 2020, Georgescu et al., used transfer learning based CNN method for teacher-student training application under occlusion [19]. While more occlusion types and datasets were included in research, most of the work is focused on limited number of occlusion types that are artiﬁcially created and current progress is relatively slower. In this research work, a facial emotion recognition system based on CNN is being proposed for facial images with face mask. We have used upper facial region (eyes and forehead) to train the network. The design of proposed architecture for the framework is described in the next section.

384

S. Saxena et al.

3 Proposed Architecture of DFERSFM In the proposed work, CNN based deep learning technique for emotion recognition under facial mask occlusion constraint is developed for offline images and also tested in real time. The CNN framework consists of three blocks of convolution for feature extraction followed by two blocks of dense layers. Each block of convolution is formed with one convolution layer that is layered by batch normalization, activation function and pooling layer. First dense block comprises of one fully connected (FC) layer that is followed by batch normalization and activation function while second dense block is layered as one FC layer which is followed by Softmax function for classiﬁcation as shown in Fig. 2. Refer Table 1 for description of the layers. There is no efﬁcient dataset available which contains facial images with facial mask occlusion. Hence to train CNN model image pre-processing is used to crop upper facial region which includes eyes and forehead. Colored images are also changed to gray images by image pre-processing to accelerate training process. We have used CMU MultiPIE database for training. Cropped and gray images are used to train layered network. CMU MultiPIE database is a set of 337 subjects with 5 emotions expressed (Neutral, Happy, Surprise, Anger and Disgust). Few dataset samples and cropped images are shown in Fig. 3 and Fig. 4 respectively. Table 1. Details DFERSFM layers CNN model Layers Input Image: 201 131 1 Conv1 Conv2D_1: {128 ﬁlters, 3 3 1 Padding [2 2]; Stride [1 1]} Batch Normalization; Activation function layer (ReLu); Maxpooling: {22; Stride [2 2]} Conv2 Conv2D_2: {128ﬁlters, 3 3 Padding [2 2]; Stride [1 1]} Batch Normalization; ReLu Activation function layer Maxpooling: 2 2; Stride [2 2] Conv3 Conv2D_3: {256 ﬁlters, 3 3 Padding [2 2]; Stride [1 1]} Batch Normalization; ReLu Activation function layer Maxpooling:{2 2; Stride [2 2]} Dense 1 Flatten FC_1: 64; Batch Normalization ReLu Activation function layer Dense2 FC_2: 5 Softmax Function classiﬁer * Conv: convolution Block **Conv2D: 2D Convolution layer.

The algorithm of DMFERS is shown in Table 2 and described in the following subsections:

Deep Facial Emotion Recognition System

3.1

385

Feature Extraction Process

Convolution layer is used for speciﬁc features extraction. Convolution includes operations of shift, multiply and add. The primary component of convolution layer processing is ﬁltering or masking using weights matrices. Mathematically, it is expressed as: Xu1 Xv1 Cv xi;j ¼ f ðl; kÞxij;jk l¼0 k¼0 n

Fig. 2. DFERSFM architecture

Fig. 3. Few samples of dataset

ð1Þ

386

S. Saxena et al.

Fig. 4. Few samples of cropped facial images

where fn ﬁlter with kernel size axb (3 3 in DFERSFM), used for convolution with input image x. In the convolution layer, kernel size, number of ﬁlter count, stride and padding play major role in training and mapping the output feature with input image. Following equations are used to achieve optimum values for the hyper parameters: Iout ¼ ðIin f þ 2PÞ=S þ 1

ð2Þ

and P ¼ ðf 1Þ=2

ð3Þ

where, Iout , Iin are feature map dimensions of output and input respectively, f is kernel size (ﬁlter), P and S are zero padding and value of stride respectively. Batch normalization is also used to limit internal covariance change, and also reduces the gradients dependency on the parameters or their initial values. Rectiﬁed Linear Units (ReLu) activation function is used to introduce non-linearity after batch normalization followed by Pooling layer to decrease the dimensionality. In the proposed work, number of ﬁlters used for ﬁrst 2 convolution layers are 128 and for third convolution layer are 256. Bayesian optimization and manual optimization is used to tune the hyper-parameters. 3.2

Classiﬁcation

After extraction, the third block output is fed to ﬁrst dense layer block for classiﬁcation. The FC layer transforms three-dimensional input of last convolution layer into output dimensional vector, N = 64 which is followed by activation function ReLu. The output of ﬁrst block is fed to second block which comprises of FC layer and Softmax classiﬁers. Softmax function provides probability distribution of 5 class emotions as output which is expressed as: et P ¼ PN n¼1

et n

ð4Þ

where it is input vector; Number of elements in N are 5 as 5 emotion classes are considered for classiﬁcation; n are output units, where n ¼ 1; 2 5.

Deep Facial Emotion Recognition System

387

4 Results and Analysis For proposed DFERSFM model, total of 8250 images are used from CMU MultiPIE dataset in which 7500 images for training and 750 images for testing. 1500 images per emotion to train and 150 images per emotion are used for validation. The training and validation accuracy achieved are 81.64% and 79.87% respectively with 5-fold cross validation. Training and validation images are chosen randomly with no images common for both. To test DFERSFM in real time, we have ﬁrst used Viola-Jones algorithm [20] for face detection and used trained model to classify the detected faces. Table. 2 DFERSFM Algorithm Steps: Step 1: Image pre-processing of datasetCrop and gray image conversion Step2: Train model Load training labeled data as X= X (1) , X (2) , X (3) …., X (N);Nis the total class Step3: Creating Layers for extracting and learning features Input image (Dimensions) Convolution_Layer: {Number_offilters;Kernel.Size (withRandomweights) padding;stride} Batch_Normalization ActivationFunction_layer Pooling: {Kernal Size; stride} Convolution_Layer:{Number_offilters;Kernel.Size (withRandomweights) padding;stride} Batch_Normalization ActivationFunction_layer Pooling: {Kernal Size; stride} …… Dense_Layer: {Dimensional vector} Batch_Normalization ActivationFunction_Layer Dense_Layer: {Dimensional vector =N} Softmax_Layer Step 4: Training_Process Set training options for training network InitialWeightsW = W0; Initial bias θ = θ0 options = Training_Options_SGD_withproperties: {Momentum=a, Initial_Learn_Rate, Max_Epochs, Mini_BatchSize} [model, info] = Train_Network (labelled_data, layers, options) Step4: Obtain optimum value of model parameter with tracking validation and training accuracies difference Step5: Test the Process real time Capture the input Image (webcam) while (true) Faces_I = Detect_faces (Haarcascadedclassifier) Crop Faces_I for face inFaces_I Predicted_Class = classify (model, Faces_I) Classified_emotions(Display) end end

388

S. Saxena et al.

Figure 5 shows facial emotion recognition with mask (cropped upper face images) and without mask (ground truth full images). The algorithm is tested in real time as shown in Fig. 6. In real time, the proposed system can recognize facial emotions from an average distance of 0.30 m to 3 m. The proposed system is also tested for multifaces and the results are shown in Fig. 7. The algorithm is implemented using Keras with Tensorflow as backend. DFERSFM is also implemented on RPA platform and on Raspberry Pi to make it compatible for use in future HRI applications. The network is trained with Stochastic Gradient Descent (SGD) algorithm as optimization function with the following speciﬁcations: learning rate of 0.00008, decay of 106 , momentum of 0.9 for batch size of 16, 600 epochs, steps per epoch 464 and 46 validation steps. DFERSFM is robust in various conditions of different head poses and illumination variations and achieved good results in real time. In few cases, happy is recognized as disgust and vice versa as shown in confusion matrix Table 3. Table 3 shows anger recognition rate is 94% and higher as compared to other emotions for CMU MultiPIE database. Recognition rate of Happy emotion is less compared to other emotions as it is confused with disgust. It is observed that the upper face region has same facial features to express happy and disgust, which creates confusion between the two classes. Same observations are seen for anger and surprise emotions. It is also observed that DFERSFM could not detect correct emotions in few cases as shown in Fig. 8. The snapshots for RPA implementation are shown in Fig. 9. DFERSFM training process is shown in Fig. 10. The DFERSFM results with other related literature are shown and a comparative summary is given in Table 4. Most of the work reported focuses on limited number of occlusion types that are artiﬁcially created and none with partially masked face. The reported methods are not in real time uncontrolled conditions and do not consider different pose and illumination variations. In contrast with literature DFERSFM shows promising results in real time uncontrolled conditions: Various head pose and illumination variations with face mask occlusion. In Cheng et al. [17], achieved accuracy of 82.9% is for frontal pose, whereas the proposed method achieved 79.87% average accuracy for different head pose and illumination variations. The proposed method also tested in real time and results are promising as shown in Fig. 6 & 7. The proposed method is tested for Ck and JAFFE to compare with existing literature and has demonstrated better accuracy compared to literature with same datasets as shown in Table 5.

Table. 3 Confusion matrix for DFERSFM Emotions Anger Disgust Happy Neutral Surprise Average accuracy

Anger 94% 4% 10% 0 13.33% 79.87%

Disgust 2% 82.67% 13.67% 0.67% 2.67%

Happy 0 10.67% 71% 12% 9.33%

Neutral 0 2.67% 4% 71.7% 6.67%

Surprise 4% 0 1.33% 15.63% 80%

Deep Facial Emotion Recognition System

389

Table. 4 DFERSFM results comparison with existing literature [Author, year] [Towner, 2007] [12] [Cotter, 2011] [15] [Cheng, 2014] [17] [Tősér, 2016] [18] DFERSFM

Technique used

Data set used Ck

Acc.*

PI/II/MF*

70%

No/No/No

Fusion of local sparse representation classiﬁers Gabor ﬁlters, Deep network

JAFFE

77%

No/No/No

JAFFE

82.9%

No/No/No

CNN

BP4D

55%

PI/No/No

PCA, SVM

CNN

CMUMulti 79.87% PI/II/ PIE MF *Acc.: Average accuracy, PI: Pose Invariant, II: Illumination Invariant, MF: Multiple faces **NR: Not reported.

Table. 5 Cross corpus dataset results Training dataset CMU Multi-pie CMU Multi-pie CMU Multi-pie JAFFE Ck

Testing dataset CMU Multi-pie JAFFE Ck JAFFE Ck

Accuracy 79.87% 50% 53.01% 98% 81%

Fig. 5. Results of DFERSFM for dataset images with ground truth images (last row)

390

S. Saxena et al.

Fig. 6. Results of DFERSFM in Real-Time

Fig. 7. Results of DFERSFM for Multiple faces in Real-Time

Fig. 8. Failure results of DFERSFM with ground truth images (last row)

Fig. 9. Snapshots of DFERSFM on UiPath RPA tool

Deep Facial Emotion Recognition System

391

Fig. 10. Training Process DFERSFM

5 Conclusion and Future Directions In this research work, a methodology for facial emotion recognition under facial mask occlusion constraint using deep learning technique is proposed. The proposed DFERSFM works successfully for multiple facial emotion recognition with facial mask occlusion under the uncontrolled conditions such as pose and illumination. Probability of recognized emotion is higher for anger compared to other emotions, whereas happy shows less recognition rate compared to other emotions as lower part of the facial region is occluded. In real-time input images, the processing time is between 0.032 and 0.037 s for CPU and 0.200–0.350 s for Raspberry Pi implementations respectively. The DFERSFM is also implemented for UiPath Robotic Process Automation making it suitable for HRI applications. The proposed method demonstrates comparable performance with existing literature, but, still there is scope to improve the accuracy. Also the proposed method has limitation to recognize emotion for ± 75º head pose from frontal which can be increased up to ± 90º. Future work involves developing algorithms to solve the constraints of upper face, left/ right face occlusion and to test these techniques on social robots as a real time application. Acknowledgment. The authors would like to thank all the volunteers for the experimentation and also would like to thank the host organization for providing CMU Multi PIE database. We thank all other researchers for making other relevant databases available for such research experiments.

392

S. Saxena et al.

References 1. Mehrabian, A.: Communication without words. Psychology Today 2, 53–56 (1968) 2. Dautenhahn, K.: Methodology & themes of human-robot interaction: a growing research ﬁeld. Int. J. Adv. Robotic Syst. 4(1), 15 (2007) 3. Happy, S.L., Dasgupta, A., Patnaik, P., Routray, A.: Automated alertness and emotion detection for empathic feedback during e-learning. In: IEEE 5th Int. Conference on Technology for Education (T4E), Kharagpur, India, pp. 47–50 (2013) 4. Coco, M.D., Leo, M., Distante, C., Palestra, G.: Automatic emotion recognition in robotchildren interaction for ASD treatment. In: IEEE International Conference on Computer Vision Workshop, Santiago, pp.537–545 (2015) 5. Good fellow, I.J., et al.: Challenges in representation learning: a report on three machine learning contests. In: Workshop Challenges in Representation Learning (ICM12013), pp. 1– 8 (2013) 6. Rosalind, W.P.: Affective computing. MIT press, Cambridge (2000) 7. Suja, P., Tripathi, S.: Real-time emotion recognition from facial images using Raspberry Pi II. In: 3rd International Conference on Signal Processing and Integrated Networks, (SPIN), IEEE, Noida, India, pp. 666–670 (2016) 8. Gamborino, E., Yueh, H., Lin, W., Yeh, S., Fu, L.: Mood estimation as a social proﬁle predictor in an autonomous, multi-session, emotional support robot for children. In: 28th IEEE International Conference on Robot and Human Interactive Communication (ROMAN), New Delhi, India, pp. 1–6 (2019) 9. Gross, R., et al.: Guide to the CMU Multi-PIE Database. Carnegie Mellon University, Technical report, The Robotics Institute (2007) 10. Bourel, F., Chibelushi, C.C., Low, A.A.: Recognition of facial expressions in the presence of occlusion. In: 12th British Machine Vision Conference, pp. 213–222 (2001) 11. Bourel, F., et al.: Robust facial expression recognition using a state-based model of spatiallylocalised facial dynamics. In: Fifth IEEE International Conference on Automatic Face and Gesture Recognition, pp. 106–111 (2002) 12. Towner, H., Slater, M.: Reconstruction and recognition of occluded facial expressions using PCA. In: Paiva, A.C.R., Prada, R., Picard, R.W. (eds.) ACII 2007. LNCS, vol. 4738, pp. 36– 47. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74889-2_4 13. Kotsia, I., et al.: An analysis of facial expression recognition under partial facial image occlusion. Image Vis. Comput. 26(7), 1052–1067 (2008) 14. Xia, M., et al.: Robust facial expression recognition based on RPCA and AdaBoost. In: 10th Workshop on Image Analysis for Multimedia Interactive Services, pp. 113–116 (2009) 15. Cotter, S.F.: Recognition of occluded facial expressions using a fusion of localized sparse representation Classiﬁers. In: IEEE Digital Signal Processing Workshop and IEEE Signal Processing Education Workshop, pp. 437–442 (2011) 16. Liu, S., et al.: Facial expression recognition under partial occlusion based on weber local descriptor histogram and decision fusion. In: 33rd Chinese Control Conference, pp. 4664– 4668 (2014) 17. Cheng, Y., et al.: A deep structure for facial expression recognition under partial occlusion. In: Tenth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 211–214 (2014) 18. Tősér, Z., Jeni, L.A., Lőrincz, A., Cohn, J.F.: Deep learning for facial action unit detection under large head poses. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 359–371. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_29

Deep Facial Emotion Recognition System

393

19. Georgescu, M., Ionescu, R.T.: Teacher-student training and triplet loss for facial expression recognition under occlusion. ArXiv, abs/2008.01003(2020) 20. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vision 57, 137–154 (2004) 21. https://www.uipath.com/ 22. https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/

Domain Adaptation Based Technique for Image Emotion Recognition Using Image Captions Puneet Kumar(B)

and Balasubramanian Raman

Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, India {pkumar99,bala}@cs.iitr.ac.in

Abstract. Images are powerful tools for aﬀective content analysis. Image emotion recognition is useful for graphics, gaming, animation, entertainment, and cinematography. In this paper, a technique for recognizing the emotions in images containing facial, non-facial, and nonhuman components has been proposed. The emotion-labeled images are mapped to their corresponding textual captions. Then the captions are used to re-train a text emotion recognition model as the domainadaptation approach. The adapted text emotion recognition model has been used to classify the captions into discrete emotion classes. As image captions have a one-to-one mapping with the images, the emotion labels predicted for the captions have been considered the emotion labels of the images. The suitability of using the image captions for emotion classiﬁcation has been evaluated using caption-evaluation metrics. The proposed approach serves as an example to address the unavailability of suﬃcient emotion-labeled image datasets and pre-trained models. It has demonstrated an accuracy of 59.17% for image emotion recognition. Keywords: Image emotion recognition · Scene understanding · Domain adaptation · Image captioning · Text emotion recognition

1

Introduction

Images are important tools to express various emotions. Humans are capable of understanding emotion-related information from images containing human faces, activities, non-human objects, and backgrounds. The need to develop computational systems that can recognize emotions in generic images is rapidly rising. Such systems are useful for a wide range of applications such as entertainment, cinematography, gaming, animation, graphics, marketing, and lie detection [1,2]. Figure 1 shows some sample images portraying various emotions. Some of these images contain facial information; some contain non-facial human components, while some contain non-human objects. Low-level features such as color, shape, texture, and edges, and high-level features such as facial structure, object formation, and background both contribute to emotional expression in the images. c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 394–406, 2021. https://doi.org/10.1007/978-981-16-1092-9_33

Image Emotion Recognition Using Domain Adaptation

395

Fig. 1. Sample images reﬂecting various emotions

Image Emotion Recognition (IER) is a sub-ﬁeld of Aﬀective Image Analysis which involves studying the emotional responses of people towards visual contents [1]. IER has been explored at semantic-level analysis [3]. However, image analysis at aﬀective-level is more diﬃcult than semantic analysis, and researchers have recently started to explore it in more detail [2,4,5]. Object classiﬁcation, image recognition, and other computer vision areas have observed signiﬁcant performance boost with the rise of deep learning [3,6]. Deep networks have been very successful in facial emotion analysis also. However, predicting emotions from generic images, including facial, non-facial, and non-human components, is complex. The deep networks have not been able to perform well in identifying the visual emotion features constituted by a mix of low, mid, and high-level features. IER also faces the challenges of human subjectivity in expressing and labeling various emotions. The unavailability of well-labeled large scale datasets and pre-trained models also poses a challenge for IER. The proposed technique involves two phases. The ﬁrst phase maps emotionlabeled images to corresponding textual descriptions using an attention-based pre-trained neural image captioning model. The suitability of using the image captions with a pre-trained Text Emotion Recognition (TER) model for emotion classiﬁcation has been evaluated using appropriate metrics. In the second phase, a Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) based network has been tuned to recognize the emotions expressed by the image captions. It is pre-trained with emotion-labeled text data extracted from Twitter. We have adopted it by re-training it on image captions. Re-training the TER model on image captions is our domain-adaptation approach. The emotional information has been transformed from image-domain to text-domain by using image captions. Then, the TER model has been adapted to classify the captions

396

P. Kumar and B. Raman

into discrete emotion classes. As image captions have a one-to-one mapping with the images, the emotion labels predicted for the image captions have been considered as the emotion labels of the images. Emotion classiﬁcation accuracy of 59.17% has been observed. The contributions of the paper are as follows. A technique to recognize emotions in generic images containing facial, non-facial and non-human components has been proposed. An approach to address the unavailability of sufﬁcient emotion-labeled image datasets and pre-trained IER models has been introduced. A pre-trained TER model has been re-trained with the captions of emotion-labeled images to classify the emotions portrayed by unseen images. The remaining content of the paper is constructed as follows. A survey on related research is presented in Sect. 2. Section 3.2 describes the proposed method. Experiments and results are presented in Sects. 4 and 5. The paper concludes in Sect. 6, along with highlighting the directions for future research.

2

Related Work

Several visual emotion recognition approaches developed in recent years have been brieﬂy reviewed in this section. 2.1

Feature Based Semantic Image Analysis

IER was initially done using low-level features such as shape, edge, and color [7]. Joshi et al. [1] found out that mid-level features such as optical balance and composition also contribute towards image aesthetics. S. Zhao et al. [8] applied mid-level features for image emotion classiﬁcation. In another work, J. Machajdik et al. [4] used the semantic content of the images for emotion analysis. However, the methods mentioned above use hand-crafted features which are not likely to accommodate all low-level features, mid-level features, and image semantics. 2.2

Dimension and Category Based Visual Emotion Classification

One of the ways to classify IER methods is based on emotion dimensions and categories. Dimensional emotion space (DES) methods [2,9] use valence-arousalcontrol emotion space to describe and represent various emotional states. Categorical emotion states (CES) based methods [5,8,10] map the computational results to discrete emotion classes. CES methods are used more commonly as they are easier to understand. The proposed method is based on CES, which classiﬁes a given image into happy, sad, hateful, and angry emotion classes. 2.3

Deep Learning Based Image Emotion Recognition Methods

Object classiﬁcation, image recognition, and other computer vision tasks have successfully used CNN [3,6]. It extracts the visual features in an end-to-end manner without human intervention. However, the currently used CNN methods

Image Emotion Recognition Using Domain Adaptation

397

face diﬃculty in extracting mid and low-level image features, which are required to identify emotion-related information in the images [5] precisely. They also need large-scale well-labeled image datasets for training. Such datasets are not available in abundance, and emotion labeling is subjected to human variations. That is why existing methods trained for computer vision tasks do not perform well for IER. There is a need to modify and adapt them to apply them for IER. 2.4

Emotion Recognition in Various Modalities

In real-life scenarios, emotions are portrayed in various modalities, such as vision, speech, and text. In visual emotion recognition, most of the known approaches utilize the emotion-related information extracted through facial features. In this context, D. Pedro et al. [11] proposed an end-to-end network for facial emotion recognition using the attention model. However, facial emotion recognition methods are not suitable for IER in non-facial images. For speech emotion recognition, Stuhlsatz et al. [12] implemented neural network-based Discriminant Analysis with SVM. In another work, S. Sahoo et al. [13] utilized a pre-trained deep convolutional neural network to predict the emotion classes of the audio segments. The fusion of the textual description of the emotional information by S. Poria et al. [14] has been found to aid the emotion analysis for visual modality signiﬁcantly. With that inspiration, the image emotion recognition technique by adapting TER models on image captions has been proposed. It also addresses the unavailability of suﬃcient emotion-labeled image datasets for training, the challenge in recognizing emotions from images in the absence of facial contents, and the inability of CNNs deployed for generic computer vision tasks to process low, mid and high-level visual features all at once.

3

Proposed Technique

Image captions have been generated from the images, and TER has been carried out on the captions by re-training a TER model as the domain-adaptation approach. The captions and images have a one-to-one mapping because captions attempt to verbalize the same emotional information that is visually presented by the images. The predicted emotion label of the captions has been considered the emotion label of the images. The image captions’ maturity has been evaluated using BLEU, SPICE, ROUGE, and Meteor scores. 3.1

Problem Formulation

Given the feature space X and source domain Dt where Dt ⊂ X, we deﬁne a corresponding task Tt as TER. For target task IER, given the feature space X as well as the target domain Di where Di ⊂ X and the corresponding task Ti as IER. The objective of the proposed work is to perform the target task Ti in Di by using the information of the source task Tt and Dt . Since Dt and Di are in diﬀerent modalities, we map the target domain Di to the source domain’s

398

P. Kumar and B. Raman

feature space X by using image captioning. As shown in Eq. 1, mathematically, our translations are ﬁnding an image caption c that maximizes the conditional probability of cbest given the target image i. It is the most accurate description of the emotion-related information portrayed by i. cbest = argmax(p(c|i)) c ⎧ ⎪ where : ⎪ ⎪ ⎪ ⎪ ⎪ ⎨i : image c : caption ⎪ ⎪ ⎪p(c|i) : probability of caption f or given image ⎪ ⎪ ⎪ ⎩c best : caption with max. conditional probability

(1)

We use the most accurate description of the image as our mapping for each element in Di and call this domain formed from image captions as Di . As per [15], the problem of IER using TER models on image captions maps to DomainAdaptation as Dt and Di have the same feature-space and perform the same task but have diﬀerent marginal distributions of their data points. The primary focus of the proposed technique is to classify emotions in a given image. It is challenging because large well-annotated training datasets and pre-trained IER models are not available. To address this challenge, we have attempted to adapt the target task in-line with the source task. A pre-trained TER model has been adapted and re-trained to classify the captions into discrete emotion classes. As image captions have a one-to-one mapping with the images, the emotion labels predicted for the image captions have been considered as the emotion labels of the images. The detailed methodology is described in Sect. 3.2. 3.2

Methodology

Various components of the proposed method are shown in Fig. 2 and the corresponding phases have been discussed in the following sections. Phase I: Caption Generation - This phase takes an image and generates a n words long caption c = {c1 , c2 , . . . , cn }, which is a sentence describing the content of the image. Attention based neural image caption generation system proposed by K. Xu et al. [16] has been implemented in this phase. A CNN is used as the encoder to extract visual features from the image. It extracts a set of feature vectors which is a d dimensional representation of the image; a = {a1 , a2 , . . . , am } where ai ∈ Rd and m is the number of the vectors in the decoded feature representation. A LSTM recurrent neural network is used to decode these features into a sentence. The matrices presented in decoder LSTM block denote pixel values of sample input image. They are ﬂattened and used to compute the feature vectors. Multiple captions are generated for a given image and then the most probable one is selected. The conditional probability of caption c given the source image i is mathematically represented in Eq. 2.

Image Emotion Recognition Using Domain Adaptation

399

Caption Generation Phase attention over feature map A man smiling with closed hands and eyes closed

feature map

Encoder CNN Decoder LSTM

Caption

Emotion Recognition Phase Text Embedding

CNN-LSTM based Text Emotion Recognition model

Re-train on Captions

Output Emotion Class

'Happiness'

Input Image

Fig. 2. Schematic diagram of the proposed methodology

p(c|i) =

n

p(cj |c1 , c2 , ..., cj−1 , h)

j=1

⎧ ⎪ where : ⎪ ⎪ ⎪ ⎪ ⎪ ⎨j : the iterator variable n : number of words in the caption ⎪ ⎪ ⎪ cj : j th word of the caption sentence ⎪ ⎪ ⎪ ⎩h : hidden state computed f rom image f eature vector

(2)

The hidden state information, h, is calculated using the image feature vector and used for calculating the probability to decode each word. The probability values are computed after applying attention on each cell of encoder’s LSTM. Phase II: Emotion Recognition - The emotion recognition module is based on the Twitter sentiment analysis model proposed by M. Sosa [17], which is pre-trained to classify the Twitter sentences into positive or negative emotion labels. We have extended it to classify text sequences into discrete emotional classes and then re-trained the TER model on image captions. Re-training the TER model on image captions is our domain-adaptation approach. The emotion recognition phase implements a CNN and LSTM based dual-channel network. Its architecture has been shown in Fig. 3. The CNN-LSTM channel takes the text sentences’ embeddings as input, extracts the features, and feeds them to Bidirectional LSTM (Bi-LSTM). The

400

P. Kumar and B. Raman

LSTM-CNN channel ﬁrst produces the sequential output from the text sentences and then feeds it to CNN for feature extraction. The output form both the channels is concatenated and passed through the max-pooling layer, followed by a fully connected layer and then softmax activation function to predict the output emotion class. We have used categorical cross-entropy as the loss function.

4

Implementation

This section discusses the implementation settings and evaluation strategies. The experimental results have been presented in Sect. 5. LSTM-CNN Channel

Softmax

Fully Connected

Flatten

Dropout

Max-pool

Embedding

Input Text

Concatenate

Dropout

Conv1D

Bi-LSTM

Happiness Sadness Hate Anger

Bi-LSTM

Dropout

Conv1D

CNN-LSTM Channel

Fig. 3. Architecture of the emotion recognition phase

4.1

Experimental Setup

Model training has been performed on Nvidia RTX 2070 GPU with 2304 CUDA cores, 288 Tensor cores, and 8 GB Virtual RAM. Model testing has been carried out on Intel(R) Core(TM) i7-7700, 3.70 GHz, 16 GB RAM CPU machine with 64-bit Ubuntu Linux OS machine. The datasets used in the implementation have been described in Section 4.2. Tensorﬂow1 library has been used to implement caption generation and emotion recognition phases. 4.2

Datasets and Training Strategy

We have used the IER dataset compiled by Y. Quanzeng [18]. It contains 23,308 weakly-labeled images. We have re-prepared the dataset with ‘happy,’ ‘sad,’ 1

https://www.tensorﬂow.org/.

Image Emotion Recognition Using Domain Adaptation

401

‘hate,’ and ‘anger’ emotion classes. The images labeled as ‘amusement’ and ‘contentment’ have been re-labeled as ‘happiness,’ The images labeled as ‘disgust’ have been merged into the ‘hate’ category’s data. The TER model was ﬁrst trained with the Tweet data, and model check-point was saved. Then the model was re-trained for randomly selected 2,000 captions. 4.3

Ablation Study

The ablation study has been performed to select the appropriate architecture for the emotion recognition phase. The summary of the ablation study has been presented in Table 1. The emotion recognition performance was checked for the pre-trained model and after re-training it with captions of emotion-labeled images. The best performance was obtained with the network containing the concatenation of the CNN-LSTM channel and the LSTM-CNN channel. Table 1. Summary of the ablation study Network architecture

Accuracy Pre-trained model Adapted model

Feed-forward network

28.20%

47.18%

LSTM

31.13%

52.25%

CNN

29.67%

49.67%

LSTM-CNN

36.71%

58.00%

CNN-LSTM

36.92%

56.76%

CNN-LSTM + LSTM-CNN 37.12%

4.4

59.17%

State-of-the-art (SOTA) Methods for Performance Comparison

Following state-of-the-art models have been considered for performance comparison. The ﬁrst method is for the image captioning phase, and the rest three methods are for the emotion classiﬁcation phase. It is to be noted that appropriate CES methods with known classiﬁcation accuracy results have been considered here. • L. Zhou et al. [19], Vision Language Pre-training (VLP): State-of-the-art attention-based model for vision-language generation tasks such as image captioning and visual question-answering. • S. Zaho et al. [8], Feature-based IER: Image Emotion Recognition approach using low and mid-level visual features. • T. Rao et al. [20], Instance Learning based IER: Image Emotion Recognition appraoch using multiple instance learning. • Q. You et al. [18], Fine-tuned-CNN: A method to build large IER dataset from weakly-labeled images, and to provide benchmark results using CNNs.

402

4.5

P. Kumar and B. Raman

Evaluation Metrics

The following metrics have been considered to evaluate the suitability of the image captions to get correct image description. Emotion classiﬁcation results have been evaluated based on accuracy and confusion matrix. • BLEU Score [21]: Bilingual evaluation understudy (BLEU) uses precision measure and compares the predicted caption against the original caption. • ROUGE [22]: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) uses recall measure to compare machine-generated sequences against humanproduced references. • METEOR [23]: Metric for Evaluation of Translation with Explicit ORdering (METEOR) uses harmonic mean of unigrams’ precision and recall to evaluate machine-generated sequences. • SPICE [24]: Semantic Propositional Image Caption Evaluation (SPICE) is an automated caption and sentence evaluation metric that considers the sensitivity of the n-grams during the evaluation.

5

Results and Evaluation

This section discusses the results of the caption generation and emotion recognition phases of the proposed methodology. 5.1

Phase I: Image Captioning Results

Normalized scores of image caption evaluation metrics discussed in Sect. 4.5 are observed for the generated image captions, as compared to the scores for the SOTA method mentioned in Sect. 4.4. As shown in Table 2, the scores are comparable, which aﬃrms the suitability of image captions for their use in the emotion classiﬁcation phase. Table 2. Evaluation of the image captions Method

Evaluation metric BLEU ROUGE METEOR SPICE

VLP, L. Zhou [19] 0.3007 0.4628

0.2310

0.1743

Proposed

0.3373

0.1344

0.4367 0.4759

Image Emotion Recognition Using Domain Adaptation

5.2

403

Phase II: Emotion Classification Results

Accuracy & Confusion Matrix - The proposed technique has demonstrated an accuracy of 59.17% for IER. Figure 4 shows the confusion matrix.

Fig. 4. Confusion matrix

Results Comparison - Table 3 shows IER performance of the proposed method in comparison to the SOTA methods described in Sect. 4.4. Table 3. Emotion classiﬁcation performance comparison with state-of-the-art methods Method

Author

Accuracy

Feature-based IER

S. Zaho et al. [8] 46.52%

Instance Learning based IER T. Rao et al. [20] 51.67% Fine-tuned-CNN Proposed Method

Q. You et al. [18] 58.30% 59.17%

Sample Results - The quantitative results have been shown in the above sections. This section presents the qualitative results by showing some of the sample images in Fig. 5, which got classiﬁed into various emotion classes.

404

P. Kumar and B. Raman

Fig. 5. Sample results

Discussion: The proposed technique has been able to provide good IER results. The adapted text emotion recognition model had shown TER accuracy of 71.68% with the Tweet dataset and that of 28.96% while tested with image captions without re-training. On re-training and adapting the TER model with image captions, it showed an overall accuracy of 59.17%. As per Fig. 4, the emotion class ‘Happiness’ is the most correctly classiﬁed while the results for ‘Sadness’ and ‘Hate’ classes are observed to be misclassiﬁed in some cases. The emotion classiﬁcation performance depends upon the quality of the captions. It can be observed from Fig. 5 that some of the captions have correctly captured the correct emotional information from the images. At the same time, some other captions have failed to perform accurate detection of various objects and emotional context in the images.

6

Conclusion and Future Work

The purpose of this paper was to develop a technique to recognize the emotions portrayed in generic images containing human faces, activities, non-human objects, and backgrounds. A domain adaptation based technique has been used to achieve this. Image descriptions have been obtained using image captioning, and text emotion recognition model has been adapted by re-training it to classify the captions into appropriate emotion classes. For future research, we will work to improve upon the image captioning module in order to enable it to detect the emotional context from the images more accurately. There is a possibility of a loss of emotional context information

Image Emotion Recognition Using Domain Adaptation

405

while transforming the problem from visual to textual modality. It is also planned to explore domain adaptation approaches in such a way to analyze the images for IER in visual modality only, as opposed to transforming them into textual modality and analyzing there. We also plan to expand IER for more emotion classes such as fear, disgust, amusement, excitement, and contentment. Acknowledgements. This research was supported by the Ministry of Human Resource Development (MHRD) INDIA with reference grant number: 1-3146198040.

References 1. Joshi, D., et al.: Aesthetics and emotions in images. IEEE Signal Process. Mag. 28(5), 94–115 (2011) 2. Kim, H.-R., Kim, Y.-S., Kim, S.J., Lee, I.-K.: Building emotional machines: recognizing image emotions through deep neural networks. IEEE Trans. Multimed. 20(11), 2980–2992 (2018) 3. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015) 4. Machajdik, J., Hanbury., A.: Aﬀective image classiﬁcation using features inspired by psychology and art theory. In: Proceedings of the 18th ACM International Conference on Multimedia (MM), pp. 83–92 (2010) 5. Rao, T., Li, X., Xu, M.: Learning multi-level deep representations for image emotion classiﬁcation. Neural Process. Lett., 1–19 (2019) 6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105 (2012) 7. Hanjalic, A.: Extracting moods from pictures and sounds: towards personalized TV. IEEE Signal Process. Mag. 23(2), 90–100 (2006) 8. Zhao, S., Gao, Y., et al.: Exploring principles-of-art features for image emotion recognition. In: Proceedings of the 22nd ACM International Conference on Multimedia (MM), pp. 47–56 (2014) 9. Zhao, S., Yao, H., Gao, Y., Ji, R., Ding, G.: Continuous probability distribution of image emotions via multitask shared sparse regression. IEEE Trans. Multimedia 19(3), 632–645 (2016) 10. Zhao, S., Ding, G., et al.: Discrete probability distribution prediction of image emotions with shared sparse learning. IEEE Trans. Aﬀective Comput. (2018) 11. Fernandez, P.D.M., Pe˜ na, F.A.G., Ren, T.I., Cunha, A.: FERAtt: facial expression recognition with attention net. arXiv preprint arXiv:1902.03284 (2019) 12. Stuhlsatz, A., Meyer, C., Eyben, F., Zielke, T., Meier, G., Schuller., B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5688–5691 (2011) 13. Sahoo, S., Kumar, P., Raman, B., Roy, P.P.: A segment level approach to speech emotion recognition using transfer learning. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W.Q. (eds.) ACPR 2019. LNCS, vol. 12047, pp. 435–448. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41299-9 34 14. Poria, S., Cambria, E., Howard, N., Huang, G.-B., Hussain, A.: Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Elsevier Neurocomputing 174, 50–59 (2016)

406

P. Kumar and B. Raman

15. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009) 16. Xu, K.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning (ICML), pp. 2048–2057 (2015) 17. Sosa, P.M.: Twitter sentiment analysis using combined LSTM-CNN models. ACADEMIA, CS291, University of California, Santa Barbara (2017) 18. You, Q., Luo, J., Jin, H., Yang, J.: Building a large-scale dataset for image emotion recognition: the ﬁne print and benchmark. In: Conference on Association for the Advancement of AI (AAAI) (2016) 19. Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J.J., Gao, J.: Uniﬁed visionlanguage pre-training for image captioning and VQA. In: Conference on Association for the Advancement of AI (AAAI) (2020) 20. Rao, T., Xu, M., Liu, H., Wang, J., Burnett, I.: Multi-scale blocks based image emotion classiﬁcation using multiple instance learning. In: IEEE International Conference on Image Processing (ICIP), pp. 634–638 (2016) 21. Papineni, K., Roukos, S., Ward, T., Zhu., W.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Association for Computational Linguistics (ACL), pp. 311–318 (2002) 22. Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004) 23. Lavie, A., Denkowski, M.J.: The METEOR metric for automatic evaluation of machine translation. Springer Machine Translation, vol. 23, no. 2, pp. 105–115 (2009) 24. Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi. org/10.1007/978-3-319-46454-1 24

Gesture Recognition in Sign Language Videos by Tracking the Position and Medial Representation of the Hand Shapes Syroos Zaboli1(B) , Sergey Serov2 , Leonid Mestetskiy2 , and H. S. Nagendraswamy1 1

2

University of Mysore, Mysore, India {syroos,hsnswamy}@compsci.uni-mysore.ac.in Lomonosov Moscow State University, Moscow, Russia

Abstract. This paper proposes a gesture recognition approach in which morphological and trajectories are analysed. A video input containing the gesture demonstration is given to the algorithm as an input and the analysis are carried out frame by frame in 3 stages. During the Primary processing stage, the location and shape of the face and hands are identified. Secondary processing is concerned with extracting information on the change in these parameters throughout the video sample. At the decision-making stage, a two-step algorithm is used based on comparison of the motion trajectories and key object shapes between the test gesture and the reference gestures. The experiments were performed on a set of selected classes from UOM-SL2020 sign language data set and a high recognition F1 -score of 0.8 was achieved. Keywords: Gesture recognition · Video sequence analysis Trajectories comparison · Medial representation

1

·

Introduction

Sign Language recognition systems have been a topic of interest over the past two decades as it touches upon a number of domains such as gesture control systems, human-computer interfaces, 3D object analysis, and many other pattern recognition and computer vision techniques. However, despite such a wide range of studies in this area, a universal eﬃcient recognition method for both static and dynamic sign language is yet to be achieved. The reported study was funded by RFBR and DST of Government of India according to the research project 16-57-45054 «Exploration of Continuous Morphological Models for Analysis and Recognition of Dynamic Depth Images/Videos»and by RFBR according to the research project 20-01-00664 «Morphological analysis of images and videos based on continuous medial representation and machine learning». c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 407–420, 2021. https://doi.org/10.1007/978-981-16-1092-9_34

408

S. Zaboli et al.

In this paper, we focus on dynamic sign sentence videos and formulate a suitable solution to the recognition problem such that it can be used to address wide range of classes. This technique is applicable to both static and dynamic signs, as basics of the algorithms is not limited to the semantics of the sign gesture alone. The proposed approach consists of 3 stages: -determining the position and the shape of the objects in the frame using segmentation and building of medial representation of the corresponding blobs, -tracing the changes in the position and the shape of the objects between frames and -classiﬁcation based on comparison with reference gestures.

2

Statement of the Problem

2.1

Definitions

Definition 1. Here, the gesture G is defined as the movement of a body part, especially a hand or the head, in order to express an idea or meaning as a sequence of images (I1 , . . . , I|G| ), where Ij ∈ Rm×n , j = 1, |G|, are the frames of the video containing the gesture G. Definition 2. A blob S on a binary image I is a polygonal figure approximating the connected components of this image [6]. A polygonal figure has separating polygons of the minimum perimeter as its borders. Definition 3. The medial representation M (S) of the blob S on a binary image is defined as the pair (Sk, R) of its morphological skeleton [6] Sk = (V, E), where V and E are the sets of its vertices and edges respectively. The radial function R : V → R, associates with each vertex in terms of the radius of the maximum inscribed circle centered at that vertex. Definition 4. The key objects are defined as regions of interest namely; the face F , the left hand L and the right hand R. Definition 5. The trajectory T ω (G) of the object ω ∈ {F, L, R} in the video sequence G = (I1 , . . . , I|G| ) is defined in terms of a sequence of sixes (x, y, r, vx , vy , t)j , j = 1, |G|, where: – – – – – –

x is an estimate of its current position along the x axis; y is an estimate of its current position along the y axis; r is an estimate of its current size; vx is an estimate of its current speed along the x axis; vy is an estimate of its current speed along the y axis; t ∈ {0, 1, 2, 3, 4}, where each non-zero value describes the type of fusion the key objects have in the current frame. t = 0 corresponds to the absence of fusion.

Gesture Recognition by Tracking the Position and Shape of the Hand

2.2

409

The Problem Statement for Gesture Recognition

Let each gesture G be presented by a sequence of RGB or RGB-d images (I1 , . . . , I|G| ), where |G| is the number of frames of the gesture G (example— Fig. 1a). Let C = {c1 , . . . , c|C| } be the set of classes given, with each gesture belonging to exactly one class. Let Gref = {G1 , . . . , GN } be a set of reference gesk tures where, for each Gk ∈ Gref , k = 1, N , the sequence of frames (I1k , . . . , I|G k|) k and the index of the class c from the set C are known. The control video sequence containing exactly one gesture G , consisting of images (I1 , . . . , In ) is given to the input of the algorithm. Based on the comparison with reference base Gref , it is necessary to determine the index c ∈ {1, . . . , |C|} of the gesture class present on it.

3

Related Work

There have been many attempts and approaches towards achieving an eﬃcient SLR (Sign Language Recognition) system. While some incorporated sensory and glove-based methods [8], others opted for image processing and pattern recognition techniques such as Spatial Relationship Based Features [4], convolutional neural networks [3], Active Appearance Models [7]. Another approach was to perform the SLR at a ﬁnger spelling level [10] where the recognition solely revolved around various hand shapes describing alphabets and numbers. SLR at a word level [12] and at a sentence level [1] further more increased the applicability of the recognition to the real world SLR problems which involves sequence of continuous hand gestures constituting sign language messages. On the other hand, some interesting attempts [2] considered facial expression as a part of recognition task. From a training perspective, using Hidden Markov Models (HMM) in recognition systems have been common [13], where Markov models are as many as the gesture classes. Implementation of deep learning in SLR systems have also been attempted [14]. Thus, based on the study done on related work, challenges such as necessity of huge data set for deep learning, errors due to video frame rate and resolution, speed at which message is signed by a signer and various variations between diﬀerent signers or the same signer signing the same sentence, are few of the many challenges that are yet to be explored.

4

Trajectorial-Morphological Method

The trajectorial-morphological approach proposed in this paper to address the sign language gesture recognition is an improved version of the method proposed in [5]. The two main factors considered to distinguish one gesture from another are: – Firstly the key objects trajectories (Deﬁnition 4), which describes the change of their position in the frame during the video.

410

S. Zaboli et al.

– And secondly, the dynamics of the shape of key objects which provides information on changes in their shapes throughout the video. The proposed method is comprised of three sub-tasks: 1. Identifying the position and shape of key objects in the frame; 2. Tracking changes in the position and shape of key objects between frames; 3. Classiﬁcation based on the comparison of the constructed gesture description with the reference examples. The general structure of the proposed trajectorial-morphological approach is described in (Fig. 1b).

Fig. 1. Illustration of the input data and general structure of the approach

Note that the terminologies used in this work in solving the problem of gesture recognition are similar to the terminologies used in the task of processing radar information. So, primary processing includes segmentation of blobs of key objects and finding their medial representations. At the second stage, secondary processing of information is carried out, including capture of objects, attachment of the trajectory and further tracking of objects. If necessary, path resolution is performed. The processed gesture is denoted by G = (I1 , . . . , I|G| ) the processing steps are illustrated with an example. 4.1

Identifying the Position and Shape of Key Objects

In order to identify the position and shape of key objects in the frame (primary processing), the two main sub tasks are as follows: – Segment the face and hands to detect blobs in the frame; – Build a medial representation of detected blobs.

Gesture Recognition by Tracking the Position and Shape of the Hand

411

Face and Hands Segmentation. In case of RGB video, segmentation is performed on the basis of color information of the pixels (in the simplest case, using binarization); whereas in the case of RGB-d video, segmentation of hands is achieved by considering the pixel components at the closer distance from the camera. By applying the most suitable segmentation technique, key object blobs are extracted from each frame Ij , j = 1, |G|, of the gesture G as far as possible. Let Sj = {Sj,1 , . . . , Sj,|Sj | } be the set of blobs on this frame. The number of selected blobs is not always equal to 3 (that is, |Sj | = 3: two for the hands and one for the face). When analyzing three-channel video, in many cases less than three blobs are only extracted from each frame. This is due to the occlusion of hand over hand or hand over face on the camera’s line of sight. In the future, we shall address such frames as frames with fused blobs. The four types of fusion are: the left hand with the face (type 1), the right hand with the face (type 2), both hands together (type 3), and the overlap of all three key objects (type 4). If the number of blobs in the frame are more than 3 (the case |Sj | > 3), all the blobs below a certain size with respect to a predeﬁned threshold level are ignored. Examples of segmented frames are presented in Fig. 2a. Building Medial Representations of Detected Blobs. Once the blobs are segmented from the set Sj , a medial representation of the key objects are constructed (Deﬁnition 3) as described in [6], ﬁlling the «holes»in them to remove segmentation noise. As a result, for each blob Sj,k , a graph Skj,k is achieved, where each vertex v is associated with the radius R(v) of the maximum inscribed circle with center at this point. In the future, we will use the maximum of the radii of such circles to estimate the size of the blob, and the internal structural graph to construct models of every shape present in the frame. For the convenience of notation, we order the set of Sj blobs in the current frame in decreasing radius of their maximum circle. If the number of blobs in the frame exceeds 3, only the blobs from Sj with the largest radius of maximum inscribed circle beyond a predeﬁned threshold are selected. Examples of frames with constructed morphological skeletons are presented in Fig. 2b. 4.2

Tracking of Changes in the Position and Shape of Key Objects Between Frames

The previous stage provides the information on position and shape of blobs of interest in every frame Ij , j ∈ 1, |G|, in terms of medial presentation and corresponding positions. The process of tracking the changes in position and shape of the objects between consecutive frames (secondary processing) in a video comprises of the following three sub-tasks:

412

S. Zaboli et al.

Fig. 2. Illustration of primary processing of video frames

– Determining the correspondence of allocated blobs to key objects (tracking of objects); – Tracing the trajectories of key objects; – Registration of dynamics of the key object shapes. Determining the Correspondence of Allocated Blobs to Key Objects. Formally, if we designate an object «face»for F , an object «left hand»—for L, and an object «right hand»—for R, then the task is as follows: associate the index k of one of the blobs Sj,k , k = 1, |Sj | in the current frame to each of the objects F, L, R. Let Blob(ω, I) denote the function that maps the key object ω to the blob index on the binary image I. As a function of the distance between the blobs, we will use the Euclidean distance between the centers of their maximum inscribed circles—D(S1 , S2 ). For Cmax (S) we denote the maximum circle of the blob S, for center(C)—the geometric center of the circle C. To solve this problem, we propose the following steps in the algorithm. 1. On the 1st frame, accept the standard blob conﬁguration: Blob(F, I1 ) = 1; 2, Blob(L, I1 ) = 3,

center(Cmax (Sj,2 )) is left than center(Cmax (Sj,3 )); otherwise; (1)

3, center(Cmax (Sj,2 )) is left than center(Cmax (Sj,3 )); Blob(R, I1 ) = 2, otherwise.

Gesture Recognition by Tracking the Position and Shape of the Hand

413

2. For each frame Ij , j = 2, |G|, calculate the speed and current estimated position of key objects: ω ω (1 − d)vx,j−1 + d(xω j − xj−1 ), j = 2, |G|; ω vx,j = ; 0, j=1 (2) ω ω x ˆω j = xj−1 + vx,j−1 ;

where d ∈ [0, 1]—arbitrary predetermined coeﬃcient (similarly—for y). 3. If the number of blobs on the frame is equal to 3, set correspondence: Blob(F, Ij ) = arg min D(Fˆ , Sj,k ); k∈{1,2,3}

ˆ Sj,k ); Blob(L, Ij ) = arg min D(L,

(3)

k∈{1,2,3}

Blob(R, Ij ) = k ∈ {1, 2, 3} : k = Blob(F, Ij ), k = Blob(L, Ij ). 4. If the number of blobs on the frame is equal to 1: Blob(F, Ij ) = 1; Blob(L, Ij ) = 1;

(4)

Blob(R, Ij ) = 1; 5. If the number of blobs on the frame is equal to 2: Blob(F, Ij ) = arg min D(Fˆ , Sj,k ); k∈{1,2}

ˆ Sj,k ); Blob(L, Ij ) = arg min D(L, k∈{1,2}

(5)

ˆ Sj,k ). Blob(R, Ij ) = arg min D(R, k∈{1,2}

Calculate the value ρ equal to the maximum of the coordinate-wise aspect ratio of the bounding rectangle of the blob of the hand which is not fused with anything in the current frame compared to the previous one. If ρ > ρmax , then adjust the correspondence: Blob(L, Ij ) = k ∈ {1, 2} : k = Blob(F, Ij ); Blob(R, Ij ) = k ∈ {1, 2} : k = Blob(F, Ij ).

(6)

6. Determine and save the type of fusion tj,k on the current frame. Below there is an example of determining the correspondence of the blobs detected on the frame to the key objects (Fig. 3a). Here the circles of diﬀerent colors refer to diﬀerent objects: red—to the head, blue—to the left hand and green—to the right one.

414

S. Zaboli et al.

Tracing the Trajectories of Key Objects. After the correspondence of the allocated blobs to key objects is determined sequentially on each frame, it is necessary to conduct the process of tracing the trajectories of the key objects. An approach similar to the [9] approach is applied and the information on the trajectory T ω (G) for the movement of each of the objects ω ∈ {F, L, R} is updated. When processing every new frame Ij , the set of features fed to the existing ω ω ω ω ω trajectory T ω (G)is described with a set six values: (xω j , yj , rj , vx,j , vy,j , tj ) for every key object. In a special case when the hands are fused (tω j = 3), a pair of additional values (xj,stuck , yj,stuck ) are used, meaning the coordinates of the center of the bounding box of the fused blob, as well as the values (vx,j,stuck , vy,j,stuck ), indicating the speed of this blob. The rules for updating trajectories are as follows (for brevity, we present them only for the case of the hand and the coordinate x). Here c and d are arbitrary predetermined coeﬃcients. – After the 1st frame, the hand speed is assumed to be 0, and its position is equal to the position of the corresponding blob. rjω : the current size is equal to the previous one if this hand is fused on the current frame, otherwise the radius of the maximum circle of the corresponding blob: ω rj−1 , if fused; ω rj = ; (7) R(center(Cmax (Sj,Blob(ω,Ij ) ))), otherwise ω : the current hand speed is 0 if it is fused, otherwise it is updated according vx,j to the rules: 0, if fused; ω vx,j = ; (8) ω ω ω (1 − d)vx,j−1 + d(xj − xj−1 ), otherwise

xω j : if there is no fusion of this hand on the current frame: xω j = center(Cmax (Sj,Blob(o,Ij ) ))x ; if fused with the face:

ω ˆω xω j =x j , if tω j

= L; or =1

ω tω j

= R; ; =2

(9)

(10)

if fused with the other hand: vx,j,stuck = xj,stuck − xj−1,stuck ;

(11)

if at the same time in the previous frame there was a fusion of this hand with the face: ω xω j = (1 − c)xj,stuck + cxj−1 ;

(12)

if there wasn’t a fusion with the face in the previous frame: ω ω xω j = cxj,stuck + (1 − c)(xj−1 + vx,j−1 + vx,j,stuck ).

(13)

Gesture Recognition by Tracking the Position and Shape of the Hand

415

We give an example of trajectory tracing (Fig. 3b). In the ﬁgure below, the trajectories of various key objects are shown in diﬀerent colors.

Fig. 3. Illustration of the secondary processing of video fragment (Color figure online)

Registration of the Dynamics of the Shape of Key Objects. In the sequential analysis of the frames Ij , j = 1, |G|, we will also take into account changes in the shape of key objects based on medial representations M (Sj,k ) of the blobs Sj,k ∈ Sj which correspond to them. From the sequence of frames (I1 , . . . , I|G| ) for each gesture G we will select key chains of consecutive frames such that: – The angle between the direction from the center of the maximum circle to the farthest point of the skeleton and the downward direction (opposite the y axis) is greater than the predetermined threshold; – Change of this angle between frames is less than the speciﬁed threshold; – Chain length is greater than the speciﬁed threshold. From each key chain, we will select its middle frame as a key frame. So for every gesture G we compose its morphological profile M P (G), which is a pair of sets of medial representations of the blobs of the hands on the corresponding key frames, ordered by frame number. In the simplest case, we will choose only one key chain and the corresponding key frame for each of the hands. 4.3

Classification Based on the Comparison with the Reference Examples

The next stage after the primary and secondary stages of processing the video sequence is to propose a classiﬁcation algorithm based on closeness to the reference gestures. Since the video recordings of gestures from the set Gref are marked as reference, and their classes are known, the task is a classical machine learning task with a teacher, in which objects are described in terms of trajectories and morphological proﬁles (Sect. 4.2).

416

S. Zaboli et al.

Calculation of Closeness of the Trajectories. At the ﬁrst stage of decisionmaking, we calculate the closeness of the trajectory T (G) of the classiﬁed gesture to the trajectories of all the reference gestures Gk ∈ Gref . To do this, we align [11] them based on the dynamic programming method. For every reference gesture Gk from the base of references Gref the following steps are applied: 1. In pairs normalize the trajectories of the classiﬁed gesture G and the reference gesture Gk along the x axis in such a way that the intervals of coordinate changes along this axis become the same. The normalized trajectory of the reference gesture is noted as x ˜. k 2. Compute the matrix W ∈ R|G |×|G| of pairwise distances between the points of the trajectories taking into account the normalization (p = 1, |Gk |, q = 1, |G|): k,R L )2 + (y k,L − y L )2 + 2 Wp,q = (xk,L − x ˜ (xk,R −x ˜R − yqR )2 . p p p q q q ) + (yp (14) k 3. Gradually ﬁll in the matrix U ∈ R|G |×|G| as follows in the order of increasing the sum p + q: U1,1 = W1,1 ; Up,1 = Up−1,1 + Wp,1 , p = 2, |Gk |; U1,q = U1,q−1 + W1,q , q = 2, |G|; Up,q = Wp,q + min{Up−1,q ; Up−1,q−1 ; Up,q−1 }, for other p i q.

(15)

Definition 6. Denote by Dtraj (G1 , G2 ) a function of the distance between the trajectories, the value of which for the two gestures G1 and G2 is equal to the value of the element U|G1 |,|G2 | of the matrix U after the above alignment algorithm is applied. Decision-Making. Let there be a classiﬁed video fragment G and a base of references Gref = {G1 , . . . , GN }. Then we will make the decision to classify the gesture according to the following rule. Calculate the distance Dtraj (G, Gk ). between the trajectories of the classified gesture G and each of the gestures Gk , k = 1, N , from the base of references. Match the gesture G with the class corresponding to the closest gesture Ga ∈ Gref .

5

Experiments

Testing of the proposed algorithm is performed on UOM-SL2020 sign language data set. A total of 220 videos from 11 selected classes from the data set were analysed for the purpose of experimentation. The selected classes are performed by 4 diﬀerent individuals.

Gesture Recognition by Tracking the Position and Shape of the Hand

417

All the videos (55 videos) corresponding to one signer are assigned as base reference, and the remaining videos (165 videos) performed by the other three signers are set as test samples. Thus the ratio of testing to training is 3 : 1. For segmentation of the face and hands, we select only the red channel in the three-channel image and apply binarization according to a predetermined threshold. 5.1

Technical Details of the Experiments

All sign sentence videos considered were shot in a three-channel RGB format with a resolution of 1920 × 1080 pixels and a frame rate of 24 frames/sec. In order to speed up the processing time without loss of information, the video size and frame rate are reduced to 480 × 272 of resolution and 15 fps accordingly. The implementation of the algorithm and visualization of the results was carried out using the programming languages C++ and Python 3.7. 5.2

Visualization

Here are the examples of the key frames (Fig. 4b) used to describe the morphological proﬁles of the corresponding gestures. The red and blue rectangles stand for the key chains of the left and right hands, respectively. Yellow circles highlight the key frames of the chains. At the ﬁnal stage, we will demonstrate (Fig. 4a) what the aligned trajectories of the hands and face look like. Here, the trajectories of the face are shown in red, the left hand in blue, and the right hand in green. The dashed lines refer to the test gesture, and the solid lines to the reference (Fig. 5).

Fig. 4. Illustration of decision-making (Color figure online)

418

S. Zaboli et al.

Fig. 5. Example of a key frame with the inscribed circle (white) and directions to the furthest point (red) and other fingers (blue) (Color figure online)

5.3

Results

It was observed that by training on 55 gestures signed with only one signer and testing on 165 gestures of the remaining 3 signers, a recognition F1 -score of 0.63 was obtained. And by considering only those 8 classes of gestures, which trajectories and medial representation diﬀer more, a recognition F1 -score is 0.8 is achieved. This is also due to the fact that the other three sign sentences diﬀer based on facial expressions as well, which is to be considered in our feature work as an additional distinguishable feature (Fig. 6).

Fig. 6. Illustration of confusion matrices

5.4

Discussion and Future Work

The calculation and estimation of recognition errors concludes to whether ﬁnalize the result based on trajectories alone or not.

Gesture Recognition by Tracking the Position and Shape of the Hand

419

If the trajectories of the gestures are very close, then a comparative analysis of the morphological proﬁles of gestures should be applied, the development of which is the main direction of further work. The morphological proﬁle of the gesture G contains a pair of morphological proﬁles of the hands, each of which is an ordered set of medial representations corresponding to them. One of the priorities of future research in this direction is to introduce a metric to represent the gesture morphological proﬁles, which would allow us to determine the proximity of gestures in terms of changing in key objects shapes. This will help to signiﬁcantly improve the quality of gesture classiﬁcation, the trajectories of which are very similar, but the diﬀerences in the shape of the hands are signiﬁcant. Further work also involves calculation of the diﬀerence in distances to the nearest gestures from the two nearest classes: ΔDtraj (G, Gk ) = |Dtraj (G, Ga ) − Dtraj (G, Gb )|, where a and b ∈ {1, . . . , N } mean the indices of the closest to the classified and the closest classified as another class gestures from the base of references. In this case, if ΔDtraj is greater than some predetermined threshold s, then assign the class Ga to the gesture G. Otherwise—use a comparison of morphological proﬁles of gestures and other additional features.

6

Conclusion

The paper proposes a trajectorial-morphological approach for solving the problem of gesture recognition in video sequences, the main idea of which is to decompose the problem into 3 sub tasks of: -determining the position and the shape of the objects in the frame, -tracing the changes in the position and the shape of the objects between frames and -classiﬁcation based on comparison with reference gestures. At the ﬁrst stage, the blobs of the key objects (hands and face) are segmented on each and every frame of the video fragment, and their medial representations are built. At the next stage, the changes in the position and the shape of the objects during sequential processing of video frames are traced and the correspondence between blobs in the frame and key objects is ﬁrst established, and then information about their trajectories and changes in their shape is extracted. Special attention is paid to handling situations in which fusions of key object blobs accrue. At the classiﬁcation stage, a logical decision-making algorithm is applied, taking into account the closeness of the gesture trajectories. The proposed trajectory-morphological approach for solving the problem of gesture recognition of video sequences, resulted in a high F1 -score recognition rate and can be applied to multi modal automated gesture recognition systems in order to improve their performance.

420

S. Zaboli et al.

References 1. Chethankumar, B.M., Nagendraswamy, H., Guru, D.: Symbolic representation of sign language at sentence level. I.J. Image Graph. Signal Process. 9, 49–60 (2015) 2. Das, S.P., Talukdar, A.K., Sarma, K.K.: Sign language recognition using facial expression. Procedia Comput. Sci. 58, 210–216 (2015). second International Symposium on Computer Vision and the Internet (VisionNet 2015) 3. Garcia, B., Viesca, S.A.: Real-time American sign language recognition with convolutional neural networks. In: Reports, Stanford University, USA (2016) 4. Kumara, B.M.C., Nagendraswamy, H.S., Chinmayi, R.L.: Spatial relationship based features for Indian sign language recognition. IJCCIE 3(2), 7 (2016) 5. Kurakin, A.V.: Real-time hand gesture recognition by planar and spatial skeletal models. Inform. Primen. 6(1), 114–121 (2012). (in Russian) 6. Mestetskii, L.M.: Continuous Morphology of Binary Images: Figures, Skeletons, and Circulars. Fizmatlit, Moscow (2009). (in Russian) 7. Piater, J., Hoyoux, T., Du, W.: Video analysis for continuous sign language recognition. In: 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, Valletta, Malta, January 2010 8. Ramakant, Shaik, N.e.K., Veerapalli, L.: Sign language recognition through fusion of 5dt data glove and camera based information. In: Souvenir of the 2015 IEEE IACC, pp. 639–643, July 2015 9. Sethi, I., Jain, R.: Finding trajectories of feature points in a monocular image sequence. IEEE Trans. PAMI 9, 56–73 (1987) 10. Suraj, M.G., Guru, D.S.: Appearance based recognition methodology for recognising fingerspelling alphabets. In: Proceedings of the 20th IJCAI, IJCAI 2007 pp. 605–610 (2007) 11. Theodoridis, S., Koutroumbas, K.: Template Matching. In: Pattern Recognition, chap. 8, pp. 321–329. Academic Press, 2 edn. (2003) 12. Wang, H., Chai, X., Chen, X.: Sparse observation (so) alignment for sign language recognition. Neurocomputing 175, 674–685 (2015) 13. Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in time-sequential images using hidden Markov model. In: Proceedings 1992 IEEE CSCCVPR, pp. 379–385 (1992) 14. Zhang, Y., Cao, C., Cheng, J., Lu, H.: EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans. Multimedia 20(5), 1038–1050 (2018)

DeepDoT: Deep Framework for Detection of Tables in Document Images Mandhatya Singh(B)

and Puneet Goyal

Indian Institute of Technology Ropar, Ropar 140001, India {2017csz0003,puneet}@iitrpr.ac.in

Abstract. An eﬃcient table detection process oﬀers a solution for enterprises dealing with automated analysis of digital documents. Table detection is a challenging task due to low inter-class and high intra-class dissimilarities in document images. Further, the foreground-background class imbalance problem limits the performance of table detectors (especially single stage table detectors). The existing table detectors rely on a bottom-up scheme that eﬃciently captures the semantic features but fails in accounting for the resolution enriched features, thus, aﬀecting the overall detection performance. We propose an end to end trainable framework (DeepDoT), which eﬀectively detect the tables (of diﬀerent sizes) over arbitrary scales in document images. The DeepDoT utilizes a topdown as well as a bottom-up approach, and additionally, it uses focal loss for handling the pervasive class imbalance problem for accurate predictions. We consider multiple benchmark datasets: ICDAR-2013, UNLV, ICDAR-2017 POD, and MARMOT for a thorough evaluation. The proposed approach yields comparatively better performance in terms of F1score as compared to state-of-the-art table detection approaches. Keywords: Table detection

1

· Table localization · Table analysis

Introduction

Tabular representation facilitates faster and better analysis and understanding of the underlying data or information. Eﬀective and automated analysis of the tabular components (embedded in digital documents) is of prime importance, especially for enterprises dealing with vast digital documents/reports. Table analysis/understanding involves decoding the table’s logical and sequential structure by inferring the underlying functional relationship between cell elements. Table detection/localization is the primary and crucial step for an eﬃcient table understanding process. Table detection (as shown in Fig. 1) provides the layout structure through table segments’ coordinates. The approaches [1,2] used heuristics and machine learning-based approaches for table understanding. These heuristics-based schemes pose issues like generalization constraints due to arbitrary tabular structure and layouts. The deep learning-based approaches evaded the need to create custom heuristics and handcrafted features and paved the path for real time object detection in natural c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 421–432, 2021. https://doi.org/10.1007/978-981-16-1092-9_35

422

M. Singh and P. Goyal

images. Despite the drastic improvement in the performance of deep learningbased approaches for natural images, it requires adaptive capabilities for capturing the discriminative features in the case of document images. This is because of the extremely varying layout characteristics of document objects (tables, charts, ﬁgures). Existing deep learning-based table detectors (for document images) fails to account for the following problems: (1) class imbalance problem and (2) high intra-class dissimilarity and low inter-class dissimilarity between the table and non-table classes. The high intra-class dissimilarity problem arises from the Table class’s structural diversity as heterogeneous tables have diﬀerent structural placement of header, trailers, cell, etc. The low inter-class dissimilarity prevails because of the structural similarities between the table and non-table class (including rectangular ﬁgures, ﬂow diagrams, and textual paragraphs) within the document pages. Both the stated problems weaken the detection ability of underlying models for small and complex tables. Even being a widely preferred technique for object detection in natural images, there is minimal use of single stage detectors for table detection in document images—the primary reason being the performance in terms of detection accuracy. With the proposed approach (single stage detector), we have explored table detection in the context of document images and observed an improved performance over the other table detectors. The proposed framework (DeepDoT) utilizes an end-to-end trainable singlestage - RetinaNet detector [3] for table detection in document images. It localizes tabular layouts and produces bounding boxes around them. The presented approach is pertinent to all types of document images, including raw scanned images. RetinaNet [3] involves apprehending both the semantic and resolution enriched features using Feature Pyramid Network (FPN) [3] (for high intra-class dissimilarity & low inter-class dissimilarity problem) and Focal Loss [3] (for class imbalance problem). Following are the key contributions of this research work: 1. We present an eﬃcient end-to-end trainable table detector inspired by the recent single stage object detector [3], leveraging the potential of FPN and the focal loss. 2. We have used a transfer learning paradigm to ﬁne-tune a pre-trained model for table detection tasks in document images. 3. A thorough evaluation (multi-training and testing schemes) on various benchmark datasets have been performed to evaluate the performance and the generalization capabilities of the proposed approach compared to other existing table detection methods. This work is organized as follows - Sect. 2 extends the brief background of table analysis approaches available in the literature. Section 3 describes the proposed methodology, followed by the experimental details and the result analysis in Sect. 4. Finally, the conclusion is presented in Sect. 5.

DeepDoT: Deep Framework for Detection of Tables in Document Images

2

423

Related Work

The Non-Textual components of a digital document, speciﬁcally tables, represent the underlying information in a more concise and meaningful manner. Tables can be rendered using varied formats such as ASCII, HTML, or PDF. The majority of the existing literature approaches are limited to PDF tables only because of portability, accessibility, and compactness. However, the proposed work focuses on tables embedded in document images. Some of the earlier works have performed table analysis using the line segment information in the tables. In [4], diﬀerent ruling lines and rectangular bounding shape characteristics are used for tabular region detection. The method [5] uses rule-based information and meta-data heuristics for tabular structure analysis in PDF documents. A recent heuristics-based system [6] is introduced for table recognition in PDF documents. These machine learning-based approaches lack the generality and the performance aspect because of the constraints introduced by the involved intricate heuristics and hand-crafted features and are not data-driven. With the advent of modern & eﬃcient deep learning mechanisms, the dependency on creating handcrafted features got reduced, and signiﬁcant improvements have been achieved in table localization and recognition. Most of the heuristics and machine learning-based methods were designed primarily for PDF’s by utilizing the meta-data or the characters placement related attributes. One of the early works using a deep learning model, [7] is based on the combination of heuristics and features learned from deep models for class label prediction and works for PDF only. DeepDeSRT [8] uses Faster R-CNN [9] for table detection. For structure recognition, FCN architecture with skip connections is used. In the ICDAR-2017 POD competition, [10], most of the participants have used Faster R-CNN, conventional CNN’s, and conditional random ﬁeld (CRF) for the page object detection task. Gilani et al. [11] also uses the Faster R-CNN, but rather than using the raw images. The method works on the set of transformed images. In this work [12], a saliency-based fully-convolutional neural network is used for detecting various objects in the document images. For further improvement in the segmentation task, CRF is used. A multi-modal based fully-convolutional neural network for page object segmentation is introduced in [13]. It utilizes both the visual as well as linguistics information from the document pages for segmentation. DeCNT [14] shows state-of-the-art performance in table detection/localization. It utilizes a deformable architecture with Faster R-CNN, which enables it to extend the network’s receptive ﬁeld. The majority of the mentioned deep learning based methods uses two-stage detectors (most of the work involved Faster-RCNN with diﬀerent settings) based object detection approaches for table detection, which shows better accuracy than single-stage detectors, but lacks in terms of real-time performance [3]. To contribute to developing an eﬀective and robust table detector, DeepDoT explores the task of table detection using (faster) single-stage detection method and is evaluated on four benchmark datasets using cross-dataset and direct testing schemes.

424

M. Singh and P. Goyal

Fig. 1. DeepDoT Framework. The proposed approach takes a document image as input and pre-processes it using image processing techniques. The combination of FPN (with Sub-Networks for prediction) and focal loss is applied to the pre-processed image to detect the tables (rectangular bounding boxes).

3

Proposed Framework

This sections describes the pre-processing pipeline, proposed model architecture and implementation details. 3.1

Pre-Processing

Prior to passing the images directly to the proposed model for table detection, we have performed pre-processing steps. The pre-processing steps reinforce the proposed model, focusing on discriminative attributes like smaller textual elements and white spaces. Thus, increasing the overall eﬃciency of the tabular detection process. Following pre-processing steps have been applied: 1. Skew Correction: Generally, the document images, mostly scanned documents, suﬀer from skewness. Therefore a projection proﬁle method has been used for the purpose. It determines the skew in a given image within the preset angle range. Otsu’s binarization is performed, followed by computing the image histogram, rotated at various angles. The best angle will be chosen where the diﬀerence between peak is maximum. 2. Image Sharpening: For image sharpening, “unsharp mask ” has been used. It works by exaggerating the intensities (f(x,y)) diﬀerence across edges (g(x,y)) within an image. fsharp (x, y) = f (x, y) + k ∗ g(x, y))

(1)

Here, k controls the amount of sharpening (with the increment in values of k, the image becomes sharper) on the input image. In our work, we have found 0.4 as the optimum value for k. 3.2

Feature Pyramid Network with Predictor Sub-Network

The two-stage object detectors account for the foreground-background imbalance inherently through their two-stages. The region proposal network generates

DeepDoT: Deep Framework for Detection of Tables in Document Images

425

around 2K probable object regions; thus, performing the ﬁltration of most background (irrelevant) samples in this stage. Further, ﬁltration is carried out in the second stage using a pre-deﬁned foreground-background ratio and hard example mining techniques. However, these ﬁltration schemes, when performed with single-stage object detectors, prove to be ineﬀective. This is due to the broader set of candidate regions (∼100k ) being sampled across an image. The ﬁltration process becomes computationally expensive for these more extensive candidate regions covering diﬀerent locations, scales, and aspect ratios. In this work, we have explored these problems in the context of document images where the class imbalance and scaling (smaller and complex tables) problems persist heavily. The presented single-stage table detector uses an FPN (which ensures the detection of small and complex tables) and focal loss for tackling the class/data imbalance problem in the document images.

Fig. 2. DeepDoT architecture consisting Bottom-Up and Top-Down approaches. ResNet-101 is used as the back-bone network. M1, M2, M3, M4 are the feature maps generated using the Top-Down approach (Feature Pyramid Network). P1, P2, P3, P4 are the ﬁnal feature maps. w* h* kA represents the width, height and number of anchor boxes, k = 2 (for classiﬁcation) and k = 4 (for regression).

The CNNs output feature map captures an inherent hierarchical pattern (bottom-up pathway) at each image spatial resolution level. However, the bottom-up pathway (in conventional CNN’s feature map computation) eﬀectively captures the semantically enriched information. Still, it lacks in capturing the higher resolution features, aﬀecting the performance, especially in small and multi scale table detection. For tackling this issue in document images where tables of diﬀerent scales and sizes are frequent, the presented approach uses feature pyramid network (FPN) [3]. FPN adds a scheme of the top-down pathway (as shown in Fig. 2) to the existing bottom-up pathway, both connected via lateral connections. In the top-down pathway, the last feature map from the

426

M. Singh and P. Goyal

bottom-up pathway is up-sampled (2x) using the nearest neighbor method. This up-sampled feature map is then merged with the second-last layer (from bottomup pathway) via element-wise addition operation, forming a new feature map. These lateral connections between these two pathways (between top-down and bottom-up pathway) ensure the tables’ accurate localization. A sub-network is attached to the top of the FPN for prediction. For feature map computation, CNN architecture (ResNet-101) has been used, as shown in Fig. 2. In bottom-up pathway the last residual block from each network stage (of ResNet-101) is represented as the pyramid levels (Conv0 , Conv1 , Conv2 , Conv3 , Conv4 ). The Conv0 level has not been included in the pyramid due to a large spatial resolution. The feature maps of similar spatial resolutions are merged from the bottom-up pathway and top-down pathway via a lateral connection. The feature maps (Convi , where i = 0 to 4) get convolved with 1 × 1 kernel for reducing the channel size. These feature maps (from the bottom-up pathway) are more stable due to fewer sampling processes. Then Mk (layers from top-down pathway) get up-sampled (by 2x) and element-wise addition is performed. The ﬁnal feature map is computed from these top-down layers by convolving each Mk with 3 × 3 ﬁlters to obtain Pl . The 3 × 3 convolution operation reduces the aliasing eﬀect caused by the up-sampling process. The computed ﬁnal feature maps go into the predictor sub-network for prediction. The sub-network is a fully connected network comprising of two components, which are attached parallel. The ﬁrst component is for class prediction and consists of four 3 × 3 convolutional layers. In each layer, 256 ﬁlters are used with the activation (ReLu). A ﬁnal 3 × 3 convolutional layers with 2 × 9 ﬁlters (here 2 is the number of classes) are used, followed by the sigmoid activation. The second component (for bounding box prediction) consists 4 × 9 ﬁlters in the ﬁnal convolutional layer (rest of the settings are the same). Anchors with scale of (32, 64, 128, 256, 512) are used with the aspect ratios of - [1:1, 2:2, 2:1]. The architecture is shown in Fig. 2. The single stage object detectors often fail to account for the foreground and background class imbalance issue during training. The inclusion of focal loss [3] enables the model to account for the foreground-background imbalance problem during training by down-weighing the easy instances (background). In document images, the presence of easy negatives (background) is frequent compared to the foreground examples. These background instances account for the signiﬁcant contribution towards the total loss during training. Thus, aﬀecting the detectors (single-stage detector) performance. Focal loss has been used as the classiﬁcation loss in the presented work. Focal loss is applied to all the anchors, and total focal loss is the summation of individual anchor loss.

DeepDoT: Deep Framework for Detection of Tables in Document Images

427

Table 1. Dataset description. Dataset ICDAR-13 [15]

Total images Used images Train-test split* 238

238

ICDAR-17 [10] 2417

2417

1600-817

MARMOT [16] 2000

2000

-

UNLV [17] 2889 424 *Cross-dataset testing technique used, except for ICDAR-17 dataset that consists of pre-defined Train-Test split.

4

Experiment and Results

Here, we describe the training schemes, experimental results obtained using the proposed framework DeepDoT on diﬀerent benchmark datasets, and discuss its performance compared to existing approaches. The unavailability of large amounts of data for training the network from scratch is an issue, especially for document analysis. So, a pre-trained (ImageNet) backbone network is utilized for leveraging the advantage of the transfer learning paradigm. The deep learning-based existing table localization approaches, including the proposed framework, are data-driven. To enhance and validate the proposed work’s generalization capability, we have used four diﬀerent publicly available datasets, as shown in Table 1. These datasets consist of images from various sources and distributions. This motivates us to incorporate the pre-processing pipeline in our framework. With the inclusion of the proposed pre-processing pipeline, we noted a signiﬁcant improvement in the overall eﬃciency of DeepDoT. We have followed two evaluation schemes - Leave One Out (LOO) [14] scheme (to perform cross-dataset testing) and Direct Training (DT) scheme. In the LOO scheme, all datasets except one are used during training. The remaining one is used for testing, i.e., one speciﬁc dataset is left at a time while training, and that speciﬁc dataset is later used for testing. In this scheme, the ICDAR-2017 dataset is only utilized for training. In the DT scheme (Only for ICDAR-2017), we have used the ICDAR-2017 [10] train set for training and test set for testing. For the DT scheme, an IoU threshold of 0.6 has been considered as it has been used in various previous works. For other experiments/datasets used in our evaluation, an IoU threshold of 0.5 has been considered. Data augmentation with rotation, shear, the horizontal and vertical ﬂip is applied to the training set, and the model is trained with a batch size of 2 for 30 epochs. For regression head smooth L1, the loss has been used (as it is robust to outliers compared to L2 loss). The total training loss is the summation of both the losses. We use adam optimizer with learning rate of 0.001 following a weight decay of 0.0004 with nesterov momentum without dampening. The backbone network is pre-trained on ImageNet. All experiments have been performed on NVIDIA GTX1080 GPU (CUDA 10.0) and 8th gen i7 CPU. The other relevant computations and scripts of the experiments run on 64-bit Ubuntu OS.

428

M. Singh and P. Goyal Table 2. Table detection performance of DeepDoT on ICDAR-17 dataset. Input

Method

Images HustVision IU-vision UITVN icstpku mutai-ee VisInt SoS DeepDoT

P

R

F1

0.071 0.230 0.670 0.857 0.842 0.924 0.934 0.972

0.959 0.221 0.940 0.773 0.890 0.918 0.940 0.913

0.132 0.225 0.782 0.813 0.865 0.921 0.937 0.941

Table 3. Table detection performance of DeepDoT on other datasets. Dataset

Input

Method

P

R

F1

ICDAR 13 (238 Images) Images Tran [18] DeCNT [14] Kavisidis [12] DeepDoT PDF Nurminen [15] Hao [7] Silva [19]

0.952 1.000 0.975 0.994 0.921 0.972 0.929

0.967 0.945 0.981 0.976 0.907 0.922 0.983

0.958 0.972 0.978 0.984 0.914 0.946 0.955

UNLV (424 Images)

Images DeCNT [14] DeepDoT

0.786 0.771

0.749 0.767 0.812 0.799

Marmot (2000 Images)

Images DeCNT [14] DeepDoT

0.849 0.946 0.914 0.898

0.895 0.904

Fig. 3. An illustration of the DeepDoT results on images (ICDAR-17 POD dataset) with multi-objects (ﬁgure, formula, tables).

Comparison with the State-of-the-arts on ICDAR-POD2017: The quantitative performance of DeepDoT on the ICDAR-2017 dataset [10] is shown in Table 3. DeepDoT outperforms the ICDAR-2017 POD challenge methods with an F1-score of 0.941. A high F1-score demonstrates a balance between precision

DeepDoT: Deep Framework for Detection of Tables in Document Images

429

and recall. The inherent FPN module is able to compute the feature map more eﬀectively over the scales. Examples of true positives from the ICDAR-17 POD dataset are presented in Fig. 3. The DeepDoT can eﬃciently detect the tables even in the presence of rectangular ﬁgures or structures depicting the detection capability of DeepDoT in multi-object (ﬁgures, formulas in addition to tables) document images. The analysis of erroneous results on the ICDAR-17 dataset shows that most incorrect results were in the case of none-line or fewer line tables. Comparison with the State-of-the-arts on UNLV: UNLV dataset [17] is comparatively more complicated as it contains scanned raw document images. Therefore, we have used it for further analysis of the proposed method, DeepDoT. With a LOO-based evaluation scheme, DeepDoT achieves a high recall and F1-score compared to other approaches, as shown in Table 2. Diﬀerent methods in the literature have used diﬀerent training and testing procedures for the evaluation of this dataset. We have also compared the DeepDoT with Gilani et al. [11] where the UNLV dataset is used for training and testing. This work’s train-test split is diﬀerent than the Leave one out or cross-dataset testing, and DT scheme splits. So, for a fair comparison, we have followed the same experimental pattern and achieved an F1-score of 0.87 (an improvement from 0.863 [11]). It shows the eﬃcacy of DeepDoT in diﬀerent training and testing scenarios. Most of the earlier approaches have used two-stage object detectors, mainly RCNN [20] and Faster-RCNN [9] with or without additional modules for the task of table detection. Therefore, we have performed a direct comparison with existing two-stage detectors RCNN and Faster-RCNN along with a single stage detector SSD [21]), as shown in Fig. 5. The proposed method, DeepDoT, oﬀers an improved mAP of 0.76, surpassing other detectors. The result signiﬁes the robustness of DeepDoT. The models have been trained and tested on the UNLV dataset only for this analysis (Fig. 4).

Fig. 4. An illustration of the DeepDoT results on images (UNLV dataset) with single (comparatively bigger/varying scales) tables.

Comparison with the State-of-the-arts on ICDAR-13 and Marmot: It can be noted that – on ICDAR-13 [15] dataset, DeepDoT shows signiﬁcant improvement over existing approaches, with an F1-score of 0.984. Even though,

430

M. Singh and P. Goyal

Fig. 5. Performance comparison of DeepDoT with other detectors (Both two stage (R-CNN [20], Faster R-CNN [9]) and one stage [21] detector) on UNLV dataset.

DeCNT [14] obtains perfect precision score, DeepDoT also performed very closely (0.994). For a fair comparison, we have reported DeCNT Model A accuracy through out in Table 2. The PDF-based approaches are not as such directly comparable because these approaches use PDF document metadata information and heuristics. Still, it is noted that the proposed network performs better in comparison to these approaches as well. For the motive of completion, results from these approaches are also reported in Table 2. The model is also robust to rotation due to the skew correction process in the pre-processing and multi-scale feature computations. In the Marmot dataset, there is a signiﬁcant improvement over DeCNT in terms of precision and F1-score.

Fig. 6. An illustration of the DeepDoT results on document images (from diﬀerent datasets) with multi-tables (a,b,c,d). The green rectangular bounding boxes show the tables detected in these document images. (Color ﬁgure online)

The multi-table detection capability is shown in Fig. 6. In some cases, DeepDoT misses tables or show incorrect predictions, as shown in Fig. 7. There were multiple tables in some of these images. In one image, Fig. 7 (c), one of the regions contained just additional white space beyond the actual table region. This absence of any text or other object might have impacted DeepDoT prediction in failing to rule out this false positive. The performance across diﬀerent datasets and schemes shows that the DeepDoT is quite capable enough to capture the small, multi-size, irregular and complex tables in the document images. Overall, it is observed that DeepDoT is a very promising approach to table detection.

DeepDoT: Deep Framework for Detection of Tables in Document Images

431

Fig. 7. Some incorrect results from DeepDoT: (a,b) Document images with multiple tables out of which one is missed i.e. not detected, False Negative. (c) detection of almost same table twice (one including additional lower white space and one w/o it)

5

Conclusion

The proposed framework highlights the class imbalance problem in table detection task as the primary obstacle limiting one-stage table detectors’ performance. The proposed model has evaluated publicly available benchmark datasets - ICDAR-2013, ICDAR-2017 POD, UNLV, and Marmot. It is evident from the experimental results that the proposed framework is very promising and can surpass the counterparts in table detection capability. The proposed scheme can pave the path for an eﬃcient tabular content extraction process that can be further utilized for applications, including table reconstruction, table content extraction, and table summarization. In future work, we would like to extend the evaluation over diﬀerent datasets (on multiple IoU’s) along with a comparison with a more diverse state of the arts. Also, we would like to extend the pipeline for the table recognition scheme further. Acknowledgement. This research is supported by the IIT Ropar under ISIRD grant 9-231/2016/IIT-RPR/1395 and by the DST under CSRI grant DST/CSRI/2018/234.

References 1. Cesarini, F., Marinai, S., Sarti, L., Soda, G.: Trainable table location in document images. In: Object Recognition Supported by user Interaction for Service Robots, vol. 3, pp. 236–240. IEEE (2002) 2. e Silva, A.C.: Learning rich hidden markov models in document analysis: table location. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 843–847. IEEE (2009) 3. Tsung-Yi L., Priya G., Ross, G., Kaiming, H., Doll´ ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017) 4. Hassan, T., Baumgartner, R.: Table recognition and understanding from pdf ﬁles. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 1143–1147. IEEE (2007) 5. Shigarov, A., Mikhailov, A., Altaev, A.: Conﬁgurable table structure recognition in untagged pdf documents. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 119–122 (2016)

432

M. Singh and P. Goyal

6. Rastan, R., Paik, H.-Y., Shepherd, J.: Texus: a uniﬁed framework for extracting and understanding tables in pdf documents. Inf. Proc. Manage. 56(3), 895–918 (2019) 7. Hao, L., Gao, L., Yi, X., Tang, Z.: A table detection method for pdf documents based on convolutional neural networks. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 287–292. IEEE (2016) 8. Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1162–1167. IEEE (2017) 9. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015) 10. Gao, L., Yi, X., Jiang, Z., Hao, L., Tang, Z.: ICDAR 2017 competition on page object detection. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 1417–1422. IEEE (2017) 11. Gilani, A., Qasim, S.R., Malik, M.I., Shafait, F.: Table detection using deep learning. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 771–776 (2017) 12. Kavasidis, I.: A saliency-based convolutional neural network for table and chart detection in digitized documents. arXiv preprint arXiv:1804.06236 (2018) 13. Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Lee Giles, C.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5315–5324 (2017) 14. Siddiqui, S.A., Malik, M.I., Agne, S., Dengel, A., Ahmed, S.: Decnt: deep deformable CNN for table detection. IEEE Access 6, 74151–74161 (2018) 15. Gobel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1449– 1453. IEEE (2013) 16. https://www.icst.pku.edu.cn/cpdp/ (2015) 17. https://www.iapr-tc11.org/mediawiki/ (2010) 18. Tran, D.N., Tran, T.A., Oh, A., Kim, S.H., Na, I.S.: Table detection from document image using vertical arrangement of text blocks. Int. J. Contents 11(4), 77–85 (2015) 19. Silva, A.C.: Parts that add up to a whole: a framework for the analysis of tables. Edinburgh University, UK (2010) 20. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) 21. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0 2

Correcting Low Illumination Images Using PSO-Based Gamma Correction and Image Classifying Method Swadhin Das(B) , Manali Roy, and Susanta Mukhopadhyay Indian Institute of Technology (ISM), Dhanbad, India

Abstract. In this work, the authors have proposed a method for improving the visual quality of 2D color images suﬀering from low illumination. The input image is converted to HSV (Hue, Saturation, Value) color space, and the V component is subjected to high pass Laplace ﬁlter. The ﬁltered output is then made to undergo a two-stage classiﬁer and a brightness correction process. Finally, the resultant image obtained is gamma-corrected using an optimum gamma value computed using a well-known meta-heuristic based optimization technique namely, particle swarm optimization (PSO). The corrected V component is combined back with the H and S components to reconstruct the ﬁnal result. The authors have tested this method on a number of 2D color images of natural scenes and the result is found to be satisfactory. Also, the experimental results are compared with similar methods in terms of subjective and objective metrics. Keywords: Laplace ﬁlter

1

· FCIC · SCIC · Gamma correction · PSO

Introduction

Images captured through conventional acquisition systems often suﬀer from a signiﬁcant imaging drawback, i.e., poor illumination. There exist many a reason behind the formation of such weakly illuminated images, which include inherent noise and other limitations of imaging devices, unstable lighting conditions, night time imaging, adverse weather conditions, atmospheric eﬀect and ambience. As a result, images tend to be under-exposed, thereby concealing most of the perceptual information which cause problem for further machine interpretation. The applicability of such images can be enhanced using robust illumination correction algorithms that correct the intensity distribution while removing existing noise [5]. Correction of inhomogeneous illumination is well addressed in the literature through several parametric and non-parametric algorithms. Amongst the non-parametric approaches, Yu et al. [20] have proposed an adaptive inverse hyperbolic tangent (AIHT) algorithm for dynamic contrast adjustment. A similar adaptive algorithm is presented in [17] employing multi-scale Gaussian function and image correction. Soft computing based parameter optimization techniques are also utilized to improve the visual quality of low illumination images. c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 433–444, 2021. https://doi.org/10.1007/978-981-16-1092-9_36

434

S. Das et al.

Hasnat et al. [8] have proposed a colorization system for grayscale face image using particle swarm optimization. Kanmani et al. have combined swarm intelligence based optimization with a gamma correction in Lαβ color space to develop a contrast enhancement algorithm [11]. Another algorithm proposed by the same author combines swarm intelligence in the transform domain using dual-tree wavelet transform [13]. Correction algorithms using particle swarm optimization are also applied in the medical domain to enhance the contrast of MRI scans [15]. Authors in [4], [10] and [14] have used gamma correction in improving the visual quality of the images. However, choosing a proper value of gamma greatly inﬂuences the result. Deep learning-based approaches such as improved GAN (Generative adversarial network) are applied in remote sensing to enhance UAV (Unnamed Ariel Vehicles) images [19]. A multi-view learning approach developed in [18] incorporates local and global enhancement to enhance the contrast in glioblastoma images for accurate medical diagnosis. Aﬁﬁ et al. [3] have proposed a deep learning based overexposed and underexposed photo correction method. Aﬁﬁ et al. [2] have proposed a deep learning approach to realistically edit an sRGB image’s white balance. Aﬁﬁ et al. [1] have proposed a deep learning framework, CIE XYZ Net to unprocess a nonlinear image back to the canonical CIE XYZ image. To extract more details from a low illumination image, this paper proposes an illumination correction method using two successive steps of image classiﬁcation, i.e., First order classiﬁcation and image correction (FCIC) and Second order classiﬁcation and image correction (SCIC). The ﬁnal step involves gamma correction, where the optimum value for gamma is chosen using particle swarm optimization instead of a manual setting. The aforementioned classifying techniques are performed on the intensity (V) component of the colour images. The rest of the paper is organized as follows; the proposed method is presented in Sect. 3, followed by details of experimental results presented in Sect. 4. Finally, Sect. 5 presents the conclusion of this work.

2 2.1

Background Concepts Gamma Correction Technique

In this work, gamma correction is used as the ﬁnal step to correct the output obtained from the previous step. The reason behind selecting the same is its ability to extract major details from a low illuminated image, eliminate noise and to solve the problem of color distortion and mathematically it is expressed as: γ ˙ y) = max(Iin ) × Iin (x, y) I(x, (1) max(Iin ) ˙ are the intensities of the input and output image at where Iin (x,y) and I(x,y) pixel location (x,y) respectively and γ is the correction parameter (0 ≤ γ < ∞) which is used to adjust the brightness of the image. γ = 1 implies the proper reconstruction of the input image, γ > 1 darkens the input image and γ < 1 brightens the input image.

Correcting Low Illumination Images Using PSO-Based Gamma Correction

2.2

435

Particle Swarm Optimization

Particle swarm optimization (PSO) technique [6] is a well-known and eﬃcient optimization technique based on population where random particles are ﬁrst initialized with an initial velocity and position. This technique looks for an optimum solution based on some objective functions (single or multi-objective) through several iterations. In each iteration, the ﬁtness value of each particle is computed and if any particle is found to achieve the best value of itself, it stores the location of the value as pbest (particle best). The location of the best value obtained by any particle in any iteration is stored as gbest (global best). Now using pbest and gbest, each particle updates its velocity and position using the following equations, Vj (i + 1) = W × Vj (i) + C1 × rand() × (pbestj (i) − Pj (i)) + C2 × rand() × (gbest − Pj (i)) Pj (i + 1) = Pj (i) + Vj (i + 1) th

(2) (3)

th

where Pj (i) and Vj (i) are the position and velocity of j particle at i iteration, W is the inertia weight which controls the convergence behavior of PSO, C1 and C2 control the inﬂuence of pbest and gbest respectively and rand() generates a random number between 0 and 1.

3

Proposed Work

The steps of the proposed algorithm are performed on the intensity (V) component in HSV color space converted using standard equations from [7]. It consists of three channels (Hue, Saturation and Value) which are independent of each other and decouple luminance from the color information. 3.1

Image Sharpening

In this paper, the authors have used laplace ﬁlter which is based on second-order derivative, which produces better results compared to the one that is based on ﬁrst-order derivative because of the ability to enhance ﬁner details and produce clear edges. The second-order partial diﬀerential equation along 0◦ , 45◦ , 90◦ , and 135◦ directions are given by, δ 2 I(x, y) δx2 δ 2 I(x, y) δy 2 2 δ I(x, y) δxδy δ 2 I(x, y) δyδx

= I(x + 1, y) + I(x − 1, y) − 2I(x, y) = I(x, y + 1) + I(x, y − 1) − 2I(x, y) (4) = I(x + 1, y + 1) + I(x − 1, y − 1) − 2I(x, y) = I(x + 1, y − 1) + I(x − 1, y + 1) − 2I(x, y)

436

S. Das et al.

Fig. 1. Workﬂow of the algorithm (a) Weakly illuminated image; (b) H component; (c) S component; (d) V component; (e) After ﬁrst order classiﬁcation; (f) After second order classiﬁcation; (g) After PSO-based gamma correction; (h) Illuminated Result

where I(x, y) is the intensity value of image I at pixel location (x, y). The second order Laplace ﬁlter (∇2 I) is obtained by adding all the components in Eq. (4) which is given by, ∇2 I(x, y) = I(x + 1, y) + I(x − 1, y) + I(x, y + 1) + I(x, y − 1) + I(x + 1, y + 1) + I(x − 1, y − 1) + I(x + 1, y − 1) + I(x − 1, y + 1) − 8I(x, y) or,

∇ I(x, y) = 2

1 1

(5)

I(x + p, y + q) − 9I(x, y)

(6)

p=−1 q=−1

The modiﬁed intensity value∇2 I for some pixels may not lie in between [0,255] which needs to be modiﬁed using a suitable technique. In this paper, the linear stretch algorithm is used to adjust all the pixel intensities between [0,255] (Fig. 1). 3.2

First Order Classification and Image Correction (FCIC)

In the ﬁrst-order classifying and image correction (FCIC) technique, the input image is classiﬁed into one of the three categories based on the average value of I˜ i.e., AV G1 . Two positive threshold values, M IN1 and M AX1 are deﬁned such that M IN1 < M AX1 .

Correcting Low Illumination Images Using PSO-Based Gamma Correction

437

Algorithm 1. LinearStretch (Iold ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

(m,n) = size(Iold ) Imax = max(Iold ) and Imin = min(Iold ) ΔI = Imax - Imin k = 255 ΔI for i = 1 to m do for j = 1 to n do Inew (i, j) = k(Iold (i, j) − Imin ) end for end for return Inew

Case 1: If AV G1 < M IN1 , the image is considered as a low threshold image where the fractional part, α (0 < α ≤ 1) of AV G1 is added with the input image for edge sharpening. The output of the FCIC technique, I1 is given by, I1 (x, y) = I(x, y) + α × AV G1

(7)

Case 2: If M IN1 ≤ AV G1 ≤ M AX1 , it is a medium threshold image. Here, the output of FCIC is the same as the input i.e. there is no change in the image. Mathematically, in this case I1 is given by, I1 (x, y) = I(x, y)

(8)

Case 3: If AV G1 > M AX1 , we call the image as a high threshold image. A low pass Gaussian ﬁlter is applied to remove unwanted noise as follows, ⎡ ⎤ 1 2 1 1 ⎣ 2 4 2⎦ (9) Gm = 16 1 2 1 The output of the FCIC technique, I1 is given by, I1 = I ∗ Gm

(10)

where ∗ denotes the convolution operation. 3.3

Second Order Classification and Image Correction (SCIC)

The output of FCIC, I1 has a sharp and clear edge with reduced noise eﬀectively. However, the problem of over brightness (darkness) may exist which needs to be adjusted properly. For this, a non-average brightness adjustment technique is adopted to adjust the brightness at the same rate. In second order classifying and image correction (SCIC) technique the input image is classiﬁed into one of the three categories based on the average value of I1 , AV G1 . Two positive threshold values M IN2 and M AX2 are deﬁned such that M IN2 < M AX2 .

438

S. Das et al.

˜ Algorithm 2. FCIC (I, I) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

˜ AV G1 = mean(I) Set α, M IN1 and M AX1 \\ M IN1 < M AX1 & 0 < α ≤ 1 if AV G1 < M IN1 then I1 (x,y) = I(x,y)+α × AV G1 else if M IN1 ≤ AV G1 ≤ M AX1 then I1 (x,y) = I(x,y) else if AV G1 > M AX1 then A 3 × 3 Gaussian mask Gm is created (eq. (9)) I 1 = I ∗ Gm end if return I1

Algorithm 3. SCIC (I1 ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

AV G2 = mean(I1 ) Set β, M IN2 and M AX2 \\ M IN2 < M AX2 ∗ 0 < β < 1 if AV G2 < M IN2 then I2 (x,y) = (1+β)I1 (x,y) else if M IN2 ≤ AV G2 ≤ M AX2 then I2 (x,y) = I1 (x,y) else if AV G2 > M AX2 then I2 (x,y) = (1-β)I1 (x,y) end if return I2

Case 1: If AV GD < M IN2 , it means that I1 is suﬀering from darkness problem, which has been increased at a rate of β. I2 is mathematically expressed as, I2 (x, y) = (1 + β)I1 (x, y)

(11)

Case 2: If M IN2 < AV G2 < M AX2 , which means I1 is not suﬀering from over-darkness (brightness) problem. So no changes are applied on I1 . In this case the output of SCIC, I2 is given by, I2 (x, y) = I1 (x, y)

(12)

Case 3: If AV GD > M AX2 , it is considered that I1 is suﬀering from overbrightness problem. So the brightness of I1 is decreased by a rate of β. So, I2 is given by, (13) I2 (x, y) = (1 − β)I1 (x, y) , 0 < β < 1 3.4

Estimation of Best Gamma Factor Using PSO

However, in this work, PSO technique is used to obtain the best γ value where single objective function is chosen to reduce the computational complexity. It is formulated using two image quality parameters, i.e., entropy and edge count.

Correcting Low Illumination Images Using PSO-Based Gamma Correction

439

Algorithm 4. ObjFun (Entold , ECold , Entnew , ECnew ) \\ Entold , ECold , Entnew and ECnew denotes the entropy and edge count of old and new image respectively. 1: if Entold == 0 or ECold == 0 then 2: return 1 \\ Fitness value of new image is better than old image. 3: else −Entold 4: ΔEnt = Entnew Entold −ECold 5: ΔEC = ECnew ECold 6: S = ΔEnt +ΔEC 7: if S > 0 then 8: return 1 \\ Fitness value of new image is better than old image. 9: else 10: return 0 \\ Fitness value of old image is better than new image. 11: end if 12: end if

Algorithm 5. PSOBasedGammaCorrection(I) 1: Set W, C1 , C2 , no of iteration (itr), population size(pop) 2: Randomly initialize the position vector (P[1:pop]) and velocity vector (V[1:pop]) of the particles. 3: Set particle best ﬁtness matrix for entropy (pbestent[1:pop]) = 0, edge count (pbestec[1:pop]) = 0 and global best ﬁtness matrix for the entropy (gbestent[1:pop]) = 0, edge count (gbestec[1:pop]) = 0 4: change[1:pop]=0 5: for i = 1 to itr do 6: for j = 1 to pop do 7: Apply gamma correction technique on I with respect to P[j] using eq. (1) and let the output is I’ 8: Calculate Ent(I’) and EC(I’) using eq. (14) and eq. (15) 9: if ObjFun(pbestent[j],pbestec[j],Ent(I’),EC(I’)) == 1 then 10: pbestent[j] = Ent(I’) 11: pbestec[j] = EC(I’) 12: pbestp[j] = P[j] \\ pbestp[j] is the local best position of j th particle 13: change[j]=1; 14: end if 15: if ObjFun(change[j] == 1 and gbestent,gbestec,pbestent[j],pbestec[j]) == 1 then 16: gbestent = pbestent[j] 17: gbestec = pbestec[j] 18: gbestp = pbestp[j] \\ gbestp is the global best position 19: end if 20: Update V and P using eq. (2) and eq. (3) 21: end for 22: end for 23: return gbestp

440

S. Das et al.

Entropy: Entropy of an image is deﬁned by the following equation, Ent(I) = −

255

p(i) × log2 (p(i))

(14)

i=0

where Ent(I) is the entropy of the image I and p(i) is the probability of occurance of ith intensity in image I. Edge Count: Edge count of an image is deﬁned by the following equation, EC(I) =

Total number of detected edge pixels in I Total number of pixels in I

(15)

where EC(I) is the edge count of the image I.

Algorithm 6. ProposedMethod(Iin ) Convert Iin into HSV color space and let Iv be the intensity component. Apply Laplace ﬁlter on Iv to obtain ∇2 Iv using Eq. (5) or Eq. (6) I˜v = LinearStretch(∇2 Iv ) \\ Algorithm 1 Iˆv = FCIC(Iv , I˜v ) \\ Algorithm 2 I¯v = SCIC(Iˆv ) \\ Algorithm 3 gbestp = PSOBasedGammaCorrection(I¯v ) \\ Algorithm 5 Apply gamma correction technique on I¯v with respect to gbestp using Eq. (1) to obtain ﬁnal constructed value component I˙v 8: Iout = HSVtoRGB(H, S, I˙v ) 9: return Iout

1: 2: 3: 4: 5: 6: 7:

4

Experimental Results

Experiments have been performed on 7000 (approx.) low illumination real 2D color images collected from the ExDARK database [12]. In this experiment, the size of the input images are considered as 256×256. The values of the parameters used in this experiment are given in Table 1. The proposed method has been implemented using Matlab 2018a and executed on a system with processor Intel (R) Core (TM) i7-6500U CPU having 2.60 GHz, 16.0 GB RAM, and Windows 10 operating system. Figure 2 shows the experimental results of the proposed algorithm on a set of low illumination test images. The proposed method is compared with [9,17], and [16] based on some quantitative measurements, like entropy and edge count which is summarized in Table 2.

Correcting Low Illumination Images Using PSO-Based Gamma Correction

441

Table 1. List of parameters and corresponding values Parameter

Value

M IN1 , M IN2 16, 96 Wi , Wf

Parameter

Value

Parameter Value

M AX1 , M AX2 32, 160 α, β

0.9, 0.4 itr, pop

30, 50

C1 , C2

0.25, 0.125 2

Table 2. Objective evaluation on weakly illuminated image sets Methods

4.1

Metrics

Image Wang [17] Huang [9] Srinivas [16] Proposed

Entropy

a[i] a[ii] a[iii] a[iv] a[v] a[vi] a[vii] a[viii]

5.8536 6.6148 7.4751 5.2499 6.7535 6.7042 7.0865 4.2638

4.7656 5.6952 7.1693 4.6832 5.7933 5.9468 6.4108 4.2480

5.4853 6.4608 7.2243 5.1980 6.4655 6.3469 6.9937 5.7966

6.1722 6.9167 7.6038 5.6722 7.0122 6.9505 7.6043 6.7981

Edge count a[i] a[ii] a[iii] a[iv] a[v] a[vi] a[vii] a[viii]

0.1330 0.1350 0.0626 0.1158 0.1468 0.0894 0.0863 0.0921

0.0903 0.0922 0.0542 0.0922 0.1074 0.0677 0.0607 0.1149

0.1290 0.1232 0.0548 0.1062 0.1327 0.0787 0.0897 0.1638

0.1368 0.1496 0.0655 0.1220 0.1536 0.0959 0.1175 0.1666

Quality of the Results

The quality of the results has been visually compared with three similar methods over eight weakly illuminated source images, as presented in Fig. 2. The ﬁgure shows that the proposed method gives better visual quality and extracts more information compared to other methods. Results produced by [17] have increased contrast (Fig. 2 b[v]) but it is not eﬀective in shadow removal (Fig. 2 b[viii]). Removal of shadow eﬀects happens to be a challenging task in improving visual quality which is best handled by the proposed method for both grayscale and color images (Fig. 2 e[iii] and e[viii]). Results from [9] (Fig. 2 c[i-viii]) and [16] (Fig. 2 d[i-viii]) have shown poor performance in illumination correction in comparison to the proposed approach. Also, from Table 2, the proposed method has obtained the highest values for evaluation metrics which proves the superiority of the method.

442

S. Das et al.

Fig. 2. Results from algorithms; a[i-viii]: Weakly illuminated source image; b[i-viii]: Results obtained in [17]; c[i-viii]: Results obtained in [9]; d[i-viii]: Results obtained in [16]; e[i-viii]: Results obtained from the proposed method

Correcting Low Illumination Images Using PSO-Based Gamma Correction

5

443

Conclusion

This paper proposes an approach for correcting real 2D colour images suﬀering from weak illumination due to poor lighting conditions. Initially, the RGB image is converted into HSV color space. Two successive iterations of image classiﬁcation and correction algorithm (FCIC and SCIC) are applied to the intensity component (V) followed by ﬁnal PSO-based gamma correction. FCIC is used to sharpen the edges of the image, whereas SCIC is used to solve the problem of extreme brightness or darkness. Finally, the gamma correction technique is used to construct the enhanced output image, where the best value of gamma is obtained, employing PSO with respect to speciﬁc image quality parameters. This method has been experimented on several low-light images and compared with similar methods in terms of visual quality and quantitative metrics. The proposed approach successfully enhances the illumination, extracts ﬁne details, and uniformly balances out color distortion throughout the image.

References 1. Aﬁﬁ, M., Abdelhamed, A., Abuolaim, A., Punnappurath, A., Brown, M.S.: Cie xyz net: Unprocessing images for low-level computer vision tasks. arXiv preprint arXiv:2006.12709 (2020) 2. Aﬁﬁ, M., Brown, M.S.: Deep white-balance editing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1397– 1406 (2020) 3. Aﬁﬁ, M., Derpanis, K.G., Ommer, B., Brown, M.S.: Learning to correct overexposed and underexposed photos. arXiv preprint arXiv:2003.11596 (2020) 4. Aggarwal, A., Chauhan, R., Kaur, K.: An adaptive image enhancement technique preserving brightness level using gamma correction. Adv. Electron. Electr. Eng. 3(9), 1097–1108 (2013) 5. Dey, N.: Uneven illumination correction of digital images: a survey of the state-ofthe-art. Optik 183, 483–495 (2019) 6. Eberhart, R., Kennedy, J.: Particle swarm optimization. In: Proceedings of the IEEE International Conference on Neural Networks, vol. 4, pp. 1942–1948. Citeseer (1995) 7. Gonzalez, R.C., Woods, R.E., Eddins, S.L.: Digital Image Processing using MATLAB. Pearson Education India, India (2004) 8. Hasnat, A., Halder, S., Bhattacharjee, D., Nasipuri, M.: A proposed grayscale face image colorization system using particle swarm optimization. Int. J. Virtual Augmented Reality (IJVAR) 1(1), 72–89 (2017) 9. Huang, S.C., Cheng, F.C., Chiu, Y.S.: Eﬃcient contrast enhancement using adaptive gamma correction with weighting distribution. IEEE Trans. Image Process. 22(3), 1032–1041 (2012) 10. Huang, Z., et al.: Optical remote sensing image enhancement with weak structure preservation via spatially adaptive gamma correction. Infrared Phys. Technol. 94, 38–47 (2018) 11. Kanmani, M., Narasimhan, V.: Swarm intelligent based contrast enhancement algorithm with improved visual perception for color images. Multimedia Tools Appl. 77(10), 12701–12724 (2017). https://doi.org/10.1007/s11042-017-4911-7

444

S. Das et al.

12. Loh, Y.P., Chan, C.S.: Getting to know low-light images with the exclusively dark dataset. Comput. Vis. Image Underst. 178, 30–42 (2019) 13. Madheswari, K., Venkateswaran, N.: Swarm intelligence based optimisation in thermal image fusion using dual tree discrete wavelet transform. Quant. InfraRed Thermography J. 14(1), 24–43 (2017) 14. Rahman, S., Rahman, M.M., Abdullah-Al-Wadud, M., Al-Quaderi, G.D., Shoyaib, M.: An adaptive gamma correction for image enhancement. EURASIP J. Image Video Process. 2016(1), 1–13 (2016). https://doi.org/10.1186/s13640-016-0138-1 15. Sakthivel, S., Prabhu, V., Punidha, R.: MRI-based medical image enhancement technique using particle swarm optimization. In: Saini, H.S., Srinivas, T., Vinod Kumar, D.M., Chandragupta Mauryan, K.S. (eds.) Innovations in Electrical and Electronics Engineering. LNEE, vol. 626, pp. 729–738. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-2256-7 67 16. Srinivas, K., Bhandari, A.K.: Low light image enhancement with adaptive sigmoid transfer function. IET Image Process. 14(4), 668–678 (2019) 17. Wang, W., Chen, Z., Yuan, X., Wu, X.: Adaptive image enhancement method for correcting low-illumination images. Inf. Sci. 496, 25–41 (2019) 18. Wang, X., An, Z., Zhou, J., Chang, Y.: A multi-view learning approach for glioblastoma image contrast enhancement. In: Kountchev, R., Patnaik, S., Shi, J., Favorskaya, M.N. (eds.) Advances in 3D Image and Graphics Representation, Analysis, Computing and Information Technology. SIST, vol. 180, pp. 151–158. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-3867-4 18 19. Wu, G., Ma, X., Huang, K., Guo, H.: Remote sensing image enhancement technology of UAV based on improved GAN. In: Wang, Y., Fu, M., Xu, L., Zou, J. (eds.) Signal and Information Processing, Networking and Computers, pp. 703–709. Springer, Singapore (2020) https://doi.org/10.1007/978-981-15-4163-6 84 20. Yu, C.Y., Ouyang, Y.C., Wang, C.M., Chang, C.I.: Adaptive inverse hyperbolic tangent algorithm for dynamic contrast adjustment in displaying scenes. EURASIP J. Adv. Sign. Process. 2010, 1–20 (2010)

DeblurRL: Image Deblurring with Deep Reinforcement Learning Jai Singhal(B) and Pratik Narang(B) Department of CSIS, BITS Pilani, Pilani, India {h20190021,pratik.narang}@pilani.bits-pilani.ac.in Abstract. Removing non-uniform blur from an image is a challenging computer vision problem. Blur can be introduced in an image by various possible ways like camera shake, no proper focus, scene depth variation, etc. Each pixel can have a diﬀerent level of blurriness, which needs to be removed at a pixel level. Deep Q-network was one of the ﬁrst breakthroughs in the success of Deep Reinforcement Learning (DRL). However, the applications of DRL for image processing are still emerging. DRL allows the model to go straight from raw pixel input to action, so it can be extended to several image processing tasks such as removing blurriness from an image. In this paper, we have introduced the application of deep reinforcement learning with pixel-wise rewards in which each pixel belongs to a particular agent. The agents try to manipulate each pixel value by taking a sequence of appropriate action, so as to maximize the total rewards. The proposed method achieves competitive results in terms of state-of-the-art.

1

Introduction

It is prevalent to adopt image deblurring techniques to recover the images from the blurry images. Blur in an image can be obtained in several ways, it may be introduced due to movement or shake of the camera (called motion blur), or by camera focus (called focused blur). It has been observed that the common type of blur found in the image is mainly due to motion and focus blur which therefore degrades the quality of the image [7]. Learning to control the agents from high-resolution images, or any signal like audio and video is one of the challenges of reinforcement learning. But, recent advances in deep learning made it possible to extract the features from high-resolution images, which is a breakthrough in ﬁelds like image processing, computer vision, speech recognition, etc. These methods utilize the power of neurons that makes a giant multi-level neural network architecture. These network techniques can be integrated with reinforcement learning (RL). As deep Q-network (DQN) [2,9] has been introduced, many algorithms pertaining to RL were proposed that could play the Atari console games the same way a human would and beating professional poker players in the game of heads

c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 445–454, 2021. https://doi.org/10.1007/978-981-16-1092-9_37

446

J. Singhal and P. Narang

up no-limit Texas hold’em, etc. This has attracted researchers to focus on deep reinforcement learning. However, these methods cannot easily be applied in applications such as image-deblurring [7] where pixel-wise manipulations are required. To deal with this problem, we have proposed a multi-agent Reinforcement Learning approach where an agent is assigned to each pixel to learn the optimal behavior i.e. to maximize the average expected total rewards of all pixels and update the pixel value iteratively. It is computationally not feasible to apply the existing techniques in a naive manner since the number of agents evaluated is huge (e.g., 1 million agents for 1000 × 1000 pixel images). To tackle this challenge we use a fully convolution network (FCN) so that all the parameters are shared and learning can be performed eﬃciently. In this paper, we propose a deep reinforcement learning-based approach for image deblurring. We propose a reward map convolution which proves to be an eﬀective learning method, wherein each agent considers not only the future states of its pixel but also those of their neighboring pixels. For setting up the deep reinforcement learning, the set of possible actions for a particular application should be pre-deﬁned; this makes the proposed method interpretable, by which actions applied by the agents can be observable. The agent picks the best sequence of actions determined by the rewards provided by the environment. Our experimental result shows that the trained agents achieve better performance when compared to other state-of-art blinded/non-blinded deconvolutional and kernel estimation based fully CNN based approaches.

2

Related Work

Previously, [1,4,10–12,14] have employed CNN and other deep learning techniques for de-blurring. Xu et al. [14] proposed a non-blind setting whereas Schuler et al. [11] proposed a blind setting for deconvolutional scheme neural networks. In [14], the generative forward model has been used, where they have estimated the kernel by combining the locally extracted features from the image. Now, this information is used to reduce the diﬃculty of the problem. Sun et al. [12] has used an eﬀective CNN based non-uniform motion deblurring approach to estimate the probabilistic distribution of motion kernels at the patch level. Kim et al. [4] proposed an approach that approximated the blur kernel such that the locally linear motion and the latent image are jointly estimated. Nah et al. [10] proposed a multi-scale convolutional neural network with multi-scale loss function, which avoids problems such as kernel-based estimation. Kupyn et al.[6] presented an end-to-end learning model using conditional Ad-versarial Networks for Blind motion deblurring, which is also a kernel free based deblurring approach and shows good result on GoPro and Kohler dataset. On the other hand, Ryosuke et al. [3] proposed a deep reinforcement technique in which they have worked at a pixel level and have experimented with various image processing tasks such as image denoising, image restoration, color enhancement, and image editing; and have shown better performance with other state-of-the-art methods. They have used the deep reinforcement and pixel-based

DeblurRL: Image Deblurring with Deep Reinforcement Learning

447

rewards technique which certain discrete sets of actions which therefore makes diﬀerent from other deep learning techniques.

3

Reinforcement Learning Background

In this paper, we have considered settings of the standard reinforcement learning in which over a discrete number of time steps, the agent interacts with an environment(E). Agent receives a state sτ at time step τ , then the agent chooses an action from a set of possible actions “A” according to its policy obtained (π), where π is a mapping from states sτ to actions aτ . In return, the agent receives the next state sτ +1 and receives a scalar reward rτ . This process continues until the agent reaches a terminal state after which the process restarts. The Rτ is equals to Rτ = rτ + γrτ +1 + γ 2 rτ +2 + γ 3 rτ +3 +... + γ n-1 rτ +n-1 + γ n V (sτ +n )

(1)

Rτ is the total accumulated reward return at time step τ with discount factor γ ∈ (0, 1]. The main objective of an agent is to maximize the expected return from each state sτ . In extension to the standard reinforcement learning, we have introduced pixel level agent, where each agent’s policy is denoted as πi (aτi |sτi ) for each pixel ranging from i ∈ [1, n] A3C [8] is a actor-critic method, which uses policy and value network both. We have denoted the parameter of policy and value network as θp and θv respectively. Both network uses the current state sτ as input. Value network gives the expected total rewards from state sτ which is nothing but the value V(sτ ) which shows the goodness of the current state. The gradient for value network is calculated as follows: dθv = ∇θv (Rτ − V (sτ )2 )

(2)

The policy network results in the policy π(aτ —sτ ), and uses a soft-max layer at the end to output the action to be applied to the pixel. A(aτ , sτ ) = (Rτ − V (sτ ))

(3)

A(aτ , sτ ) is called advantage, and V(sτ ) is subtracted to reduce the variance of the gradient [8]. The gradient for policy network is calculated as: dθp = −∇θp logπ(aτ |sτ )A(aτ , sτ )

(4)

448

J. Singhal and P. Narang

Table 1. Table denoting the ﬁlters, its size, the dilation factor, and number of output channels, respectively. (Dil. denotes Dilated, and Conv denotes Convulation) Common network Conv + ReLU

Dil. Conv + ReLU Dil. Conv + ReLU Dil. Conv + ReLU

3X3, 1, 64

3 × 3, 2, 64

3 × 3, 3, 64

3 × 3,4,64

Policy network Dil. Conv + ReLU Dil. Conv + ReLU ConvGRU

Conv + Softmax

3 × 3, 3, 64

3 × 3, 1, A

3 × 3, 2, 64

3 × 3, 1, 64

Value network Dil. Conv + ReLU Dil. Conv + ReLU Conv 3 × 3, 3, 64 3 × 3, 2, 64 3 × 3, 1, 1

3.1

Reinforcement Learning with Pixel-Wise Rewards

The network discussed above is obtained by combining the policy and value network. The network is fully convolutional A3C, and its speciﬁcation of shared, policy, and value network are shown in Table 1. This network architecture is inspired by [15]. The objective of the problem is to learn the optimal policies π = (π1 , ... πN ) which can maximize the overall mean of the expected rewards at each and every pixel. π ∗ = argmaxEπ (

∞

γ τ r¯iτ )

(5)

τ =0

r¯τ =

N 1 τ (r ) N i=0

(6)

Here, r¯t is the mean of each reward at ith pixel rτ i . This is taken to observe that, training N networks is computationally not practical when the image size is very huge, i.e., the number of pixels is huge. To solve this issue, this paper proposed the usage of the FCN instead of N networks, this will help the GPU to parallelize the computation, which makes the training eﬃcient. This technique also makes sure that N agents can share their parameters. To boost the overall performance, we have proposed a powerful learning method known as reward-map convolution. The gradients can be denoted in matrix form as follows [3]: dθv = ∇θv

1 T J {(Rτ − V (sτ )) (Rτ − V (sτ ))}J N A(aτ , sτ ) = Rτ − V (sτ )

dθp = ∇θp

1 T J {log π(aτ |sτ ) A(aτ , sτ )}J N

(7) (8) (9)

DeblurRL: Image Deblurring with Deep Reinforcement Learning

449

where (ix , iy ) are the elements of matrices A(aτ , sτ ) and π(aτ |sτ ) respectively. J is ones-vector where every element is one, and denotes element-wise multiplication [3]. dω = −∇ω

N 1 logπ(aτi |sτi )(Riτ − V (sτi )) N i=1 N 1 τ +∇ω (R − V (sτi ))2 N i=1 i

= −∇ω

(10)

1 T J {log π(aτ |sτ ) A(aτ , sτ )}J N

(11) 1 T J {{(Rτ − V (sτ )) (Rτ − V (sτ ))}J N The ﬁrst term in Eq. 10 outputs a higher total expected reward. And Eq. (11) operates as a regularizer such that Ri is not deviated from the prediction V (sti ) by the convolution [3]. +∇ω

Fig. 1. Diﬀerent actions (based on probability) applied on the pixels on the current image

3.2

Actions

The actions speciﬁed in Table 2 are applied depending on the pixel requirement. Sharpening helps to sharpen the edges in the image. It increases the contrast between bright and dark regions which brings out the features in the given image. Blurriness causes the loss of the sharpness of most of the pixels.

450

J. Singhal and P. Narang

Bilateral Filter is a non-linear, edge-preserving, smoothening, and noise removal ﬁlter from the image. It does so by replacing the intensity of each pixel with its weighted average of the neighboring pixels. It is applied as an action to smoothen the surroundings while preserving the edges of the image. It is applied as two actions with change in sigmaColor(σc ) and sigmaSpace(σS ). Sigma Color takes care of mixing the neighborhood color, whereas Sigma Space inﬂuences the farther pixels. Table 2. Diﬀerent actions applied to the pixel with its respective conﬁgurations and kernels. Sno Action

Kernel size Filter/Conf

1

Sharpness

3×3

0 −1 0 −1 5 −1 0 −1 0

2

High pass ﬁlter

3×3

0 −5 0 −5 3 −5 0 −5 0

3

Low pass ﬁlter

3×3

1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9

4

Bilateral ﬁlter

–

σc = 0.1; σs = 5.0

5

Bilateral ﬁlter

–

σc = 1.0; σs = 5.0

6

Unsharp masking –

Radius = 5, amount = 1

7

Unsharp masking –

Radius = 5, amount = 2

8

Pix up

–

*= 1.05

9

Pix down

–

*= 0.95

10

No action

–

–

High Pass and Low Pass Filter are the frequency domain ﬁlter to smoothen and sharpen the image, by attenuating the particular(high/loss) component from the image. - Low pass ﬁlter attenuates the high-frequency component, giving smoothness in the image, and also removes the noise from the image. - A high pass ﬁlter attenuates the low-frequency component, giving sharpness to the image. Unsharp masking is a linear image processing technique used to increase the sharpness in the image. The sharpness details are obtained by the diﬀerence

DeblurRL: Image Deblurring with Deep Reinforcement Learning

451

between the original and blurred images. The diﬀerence is calculated and added back to the original image. enhanced img = img + amt ∗ (img − blurred img)

(12)

Pix Up and Down helps to adjust the pixel level by increasing/decreasing the pixel value. The following actions are subjected to each pixel on each state and try to maximize the total reward. Figure 1 shows how diﬀerent actions are applied to the current state image. These actions are evaluated from the FCN [8] network. The actions are evaluated for each pixel value, and then applied to the current state image. After applying these actions to the pixels, total rewards are calculated of the current state, and checked for the reward, compared with the previous state, it will check the current state is how much better than the previous one.

Input

DeblurRL

[4]

[12]

Fig. 2. Qualitative comparison on GoPro dataset.

4

Experiments

In this paper, we have implemented the proposed method using Python3, chainnerRL and Chainer [13] which are applied to the Deblurring application. We have experimented with two diﬀerent sets of blurs, custom blurs using imgaug Motion Blur at a higher severity level and blurriness of GoPro dataset. 4.1

Input and State Actions

The input image is a blurred RGB image, the agents try to remove the blur from the photo, by applying several types of ﬁlters depending on the action required.

452

J. Singhal and P. Narang

Input

DeblurRL

Ground truth

Fig. 3. Qualitative results after applying custom blur.

In Table 2, we have shown the various types of ﬁlters/actions applied to the input pixels which were empirically decided. One thing to note here is that we have only applied the classical image ﬁltering techniques in our proposed method. 4.2

Implementation Details

We have used the GoPro dataset which has over 2103 diﬀerent RGB images for training, and over 1111 RGB images for testing. The GoPro dataset for dynamic scene deblurring is publicly available1 [10]. We set the mini-batch of random 50 images from the pool of training images with 70 × 70 random cropping. For the diﬀerent experiments, we have added custom blur using imgaug blur and used Motion and Defocused blur at a severity level of 4. For training, we have used Adam Optimizer [5], with the starting learning rate as 0.001. We have set the max episodes of 25,000, where the length of each episode is 5(t max). 4.3

Results

We have implemented our model using Chainer and ChainerRL python library. All the experiments were performed on a workstation with Intel Xeon W-2123 CPU @3.6 GHz and NVIDIA Titan-V GPU. We have evaluated the performance of our model for the GoPro dataset. We have tested the model for the 1111 test images available in the GoPro dataset 1

https://seungjunnah.github.io/Datasets/gopro.

DeblurRL: Image Deblurring with Deep Reinforcement Learning

453

Table 3. Quantitative deblurring performance comparison on the GoPro dataset. Measures [12]

[4]

DeblurRL

SSIM

0.764

0.743

0.763

PSNR

31.573 31.965 31.87

Runtime 20 min 1 h

5.5 s

and compared the results with state-of-the-art methods of [4] and [12] in both qualitative and quantitative ways. In contrast, our results are free from kernelestimation problems. Table 3, shows the PSNR (Peak signal-to-noise ratio) and SSIM (structural similarity index measure) scores. The SSIM score is perceptual metrics that quantify the degradation of the image after applying any image processing task. These are the average SSIM and PSNR score over testing all 1111 GoPro dataset test images. The qualitative results obtained in the experiment are shown in Fig. 2. These results are compared with the results of [4] and [12]. Moreover, in Fig. 3, the qualitative results are shown for custom blur (using imgaug python library) which is compared with ground truth. The proposed approach is able to restore the blurred image close to the ground truth.

5

Conclusion

In this paper, we have proposed a new application of Deep reinforcement learning which operates the problem at a much granular level i.e., pixel-wise, and applies the given action at a pixel level. We have experimented with a technique to remove the blur from dynamic scene blurriness data-set (GoPro) from the given RGB image. Our experimental results show higher quantitative as well as qualitative performance when compared to other state-of-art methods of application. This paper talks about how to maximize and focus on the pixel level. This paper also discusses how we can maximize each pixel reward, which makes our method diﬀerent from other conventional convolutional neural networks based image processing methods. We believe that our method can be used for other image processing tasks where supervised learning can be diﬃcult to apply.

References 1. Chakrabarti, A.: A neural approach to blind motion deblurring. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 221–235. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9 14 2. Fran¸cois-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. arXiv preprint arXiv:1811.12560 (2018) 3. Furuta, R., Inoue, N., Yamasaki, T.: Pixelrl: fully convolutional network with reinforcement learning for image processing. IEEE Trans. Multi. 22(7), 1702–1719 (2019)

454

J. Singhal and P. Narang

4. Hyun Kim, T., Mu Lee, K.: Segmentation-free dynamic scene deblurring. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2766–2773 (2014) 5. Kingma, D.P., Ba., J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 6. Kupyn, O., Budzan, V., Mykhailych, M., Mishkin, D., Matas, J.: Deblurgan: blind motion deblurring using conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8183–8192 (2018) 7. Li, D., Wu, H., Zhang, J., Huang., K.: A2-rl: aesthetics aware reinforcement learning for image cropping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8193–8201 (2018) 8. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016) 9. Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013) 10. Nah, S., Hyun Kim, T., Mu Lee, K.: Deep multi-scale convolutional neural network for dynamic scene deblurring. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3883–3891 (2017) 11. Schuler, C.J., Hirsch, M., Harmeling, S., Sch¨ olkopf, B.: Learning to deblur. IEEE Trans. Pattern Anal. Mach. Intell. 38(7), 1439–1451 (2015) 12. Sun, J., Cao, W., Xu, Z., Ponce, J.: Learning a convolutional neural network for non-uniform motion blur removal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 769–777 (2015) 13. Tokui, S., et al.: Chainer: a deep learning framework for accelerating the research cycle. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2002–2011 (2019) 14. Xu, L., Jimmy, S.J., Ren, C.L., Jia, J.: Deep convolutional neural network for image deconvolution. Adv. Neural Inf. Process. Syst. 27, 190–1798 (2014) 15. Zhang, K., Zuo, W., Gu, S., Zhang, L.: Learning deep CNN denoiser prior for image restoration. In: Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, pp. 3929–3938 (2017)

FGrade: A Large Volume Dataset for Grading Tomato Freshness Quality Sikha Das1(B) , Samarjit Kar2 , and Arif Ahmed Sekh3 1

Kalyani Government Engineering College, Kalyani, Nadia, West Bengal, India 2 National Institute of Technology, Durgapur, India [email protected] 3 UiT The Arctic University of Norway, Tromsø, Norway

Abstract. Quality-based grouping or grading of fruits and vegetables played an important role in many stages of agricultural production. The method is used for sorting items based on the size, color, presence of damage etc. Manual grading is laborious and time consuming. Hence, not suitable for processing massive amounts. Grading diﬀerent fruits and vegetables based on size or damages is popular in literature. More complex problems such as freshness based grading got less attention. Here, we introduce a new freshness grading dataset namely FGrade. The dataset is a collection of (∼6K) high-quality tomato images collected in a form of day-by-day degradation. The dataset is labelled in 2, 4, 6 and 10 classes using a structured manner. We have benchmarked the dataset using state-of-the-art image classiﬁcation methods. We believe that the dataset is challenging and will attract computer vision researchers in the future. The dataset is available publicly (https://github.com/skarifahmed/FGrade).

Keywords: Vegetable grading dataset

1

· Tomato dataset · Image classiﬁcation

Introduction

Agriculture plays the major role in economic development [1] like other economic sectors. For storing and selling cycle grading vegetables and fruits based on freshness is important. Fresh items can be stored for a long time, whereas degrading items need to sell earlier. Grading methods are also useful to decide the value of the items based on its freshness. Buying vegetables and fruits from supermarket by visual checking and picking up in hand is nothing but a herculean task when we go for purchasing the items in a large scale. The modern technological development in computer hardware, Artiﬁcial Intelligence (AI) [2] along with computer vision (CV) and deep learning has opened up new possibilities in the sectors of agriculture [3] for categorizing the cultivated product based on shape, size and color. Figure 1 demonstrates state-of-the-art of the methods used for diﬀerent grading of vegetables and fruits as example using tomato. Although, c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 455–466, 2021. https://doi.org/10.1007/978-981-16-1092-9_38

456

S. Das et al.

grading based on shape, texture, damage etc. are popular and used in diﬀerent fruits and vegetables grading [4], but none of the applications have been applied for identifying the freshness of the vegetables or fruits [5]. The main challenges of such system are: – Unavailability of public dataset containing day-by-day degrading images of vegetables for a long duration. – Degradation is a slow process for some vegetables and ﬁner classiﬁcation between two consecutive days are challenging. The objectives of research are preparing a dataset for classifying the freshness of tomatoes by overcoming the challenges mentioned above. In this paper, we have prepared a dataset of tomatoes for overcoming the above mentioned existing challenges and fed the data to the state-of-the-art image classiﬁcation methods for comparative studies. The proposed dataset contains 6,470 images collected from a set of tomatoes in 90 days. The dataset is manually labeled in 2, 4, 6, and 10 classes based on the freshness quality. We have benchmarked the dataset using state-of-the-art Fig. 1. The state-of-the-art approach used to grade tomatoes. image classiﬁcation methods. In this paper, we describe the details of the dataset, including the state-of-the-art studies. The paper is organized as follows: In Sect. 2 we have discussed the related works of diﬀerent dataset applied in fruits and vegetable grading. Section 3 includes the details of the dataset collection method. Section 4 summarizes the benchmarking methods and the results. Finally, in Sect. 5 we have concluded the article.

2

Related Works

Automatic grading of fruits and vegetables became emerging and got attention for many potential applications [6]. The grading problem is applied in various computer vision problems and primarily solved using classical image processing methods [7] and deep neural networks [8]. Here, we have discussed diﬀerent fruit and vegetable image datasets and state-of-the-art automatic grading methods. Fruit and Vegetable Grading Datasets: Muressan et al. [9] have presented diﬀerent types of fruit images dataset, namely “Fruit-360” with a large number of fruit images from diﬀerent objects. Hani et al. [10] have presented a new dataset for fruit detection, segmentation, and counting in orchard environments

FGrade: A Large Volume Dataset for Grading Tomato Freshness Quality

457

called “MinneApple”. Hou et al. [11] have presented a larger dataset consisting of vegetables and fruits which are associated with daily life. Sa et al. [12] have proposed dataset called “DeepFruits” for detection fruits using a state-of-theart object detector termed Faster Region-based CNN (Faster R-CNN) which is retrained to perform the detection of seven fruits. Turayki et al. [13] have presented both sliced and unsliced high-res images dataset which were sourced from the “Fruits360”, “FIDS30”, and “ImageNet” datasets, The dataset named as “SUFID”. Marko et al. [14] presented common fruits image dataset which are classiﬁed into 30 diﬀerent fruit classes. Zheng et al. [15] proposed a large dataset for deep-learning-based classiﬁcation and detection models. This application is used in agricultural detection tasks. Other Agriculture Datasets: There also exists diﬀerent agriculture based datasets designed for the CV community. Ranjita et al. [16] proposed a large volume database which has real-life symptom images of multiple apple foliar diseases. This database is used to identify the category of foliar diseases in multiple apple leaves. Lee et al. [17] described a large-scale food images dataset namely AIFood. This dataset is used for ingredient recognition in food images. The dataset has 24 categories and 372,095 food images around the world. Marsh et al. [18] have proposed a database of images of 960 unique plants belonging to 12 species at several growth stages. This dataset has 5539 images of diﬀerent plants. In Table 1 we have shown some popular image datasets used in various agricultural based computer vision problems.

Table 1. Popular image datasets used in various image classiﬁcation problem. Samples

Dataset

Fruit-360 [9]

Description

Application

Annotation

Dataset for Fruit recognition in online fruit and vegetable dataset

Fruit recognition

∼90,483 various fruit and vegetable images

A large variety of high-resolution Image detection Contains ∼40’000 MinneApple [10] images dataset for and object in apple detection and segmentation ∼1000 images segmentation. A large variety ∼ 31,147 images images dataset Classiﬁcation and ∼49,000 CropDeep [15] which were and annotated with collected with detection 31 classes diﬀerent cameras.

(continued)

458

S. Das et al.

Table 1. (continued) VegFru [11]

A large-scale dataset consisting of vegetables and fruits

Image detection ∼160,000 various and vegetables and recognition fruits images

DeepFruit [12]

Visual fruit detection using strawberry images

∼11k Image detection images and 27k annotated objects

FIDS30 [14]

Common diﬀerent fruit images with 30 diﬀerent fruit classses.

Classiﬁcation

∼971 images

SUFID [13]

Dataset of Sliced and Unsliced diﬀerent Fruits Images

Classiﬁcation

∼7,500 high-res images

Dataset of low resolution images CIFAR-10 [19] from wide variety of classes

Classiﬁcation

∼6,0000 colour images with 10 classes

Leafsnap [16]

A high-quality dataset to identity multiple apple leaf diseases

Classiﬁcation

∼3642 images

Plant Seedlings [18]

A large-scale labelled Plant Seedlings dataset

Object Classiﬁcation

∼5539 images

AIFood [17]

A large-scale food images dataset

Ingredient Recognition

∼ 372,095 food images

We have observed from the existing datasets that none of the datasets can deals to detect the quality of fruit and vegetables in agricultural tasks. This observation motivates us to create a tomato dataset. State-of-the-Art Fruit and Vegetable Image Classification Methods: Hossain et al. [20] proposed an eﬃcient framework for fruit classiﬁcation using deep learning model for industrial applications. They have used two deep learning architectures i.e. one is a suggested light model of six CNNs layers and another one is Visual Geometry Group (VGG)-16. Bhargava et al. [6] proposed that fruits

FGrade: A Large Volume Dataset for Grading Tomato Freshness Quality

459

and vegetables quality evaluation using computer vision. The author uses preprocessing, segmentation, feature extraction, and classiﬁcation. Liu et al. [21] proposed the computer vision-based tomato grading algorithm based on color features, size and shape of images. The features were extracted using the histograms of color HSV model and size of tomatoes using ﬁrst-order ﬁrst-diﬀerence (FD) shape description method. Three classiﬁers were used to classify images. Opena et al. [22] proposed an automated tomato classiﬁcation system that used an artiﬁcial neural network (ANN) classiﬁer and the artiﬁcial bee colony (ABC) algorithm used for training the model. Luna et al. [23] proposed a classiﬁer that is used to classify fruits in diﬀerent classes based on size of images by thresholding, machine learning and deep learning models. Semary et al. [24] proposed a method to classify infected fruits based on its external surface. Gray level co-occurrence matrix (GLCM) along with Color moments, Wavelets energy and entropy have been used in preprocessing and feature extraction. Support vector machine (SVM) was used to classify tomato images into 2 classes using MinMax and Z-Score normalization methods. Wan et al. [25] presented a method that uses color features recognition of concentric circles with equal area on the tomato surface, and created a maturity grading model based on these features and the backpropagation neural network (BPNN). Kaur et al. [26] proposed a technique for detecting the quality of fruits. The method has been successfully applied in a large number of ANN based quality evaluation of lemon.

3

Proposed Dataset and Benchmark

From the above discussions we can note that there is no suitable dataset for grading tomatoes. We have compared other fruit/vegetable datasets, we realize that the dataset demands diﬀerent experimental setup to deal with the research works with such kind of analysis. We have collected 12 varieties of tomato from local market for experiments. It contains 6,470 number of images which are used to detect freshness. Some example images of our dataset are shown in Fig. 2.

Fig. 2. Examples of randomly chosen samples in our dataset.

460

S. Das et al.

Data Collection: We have proposed a new domain of research where we have measured a day by day degradation of the vegetable and fruit. Hence, the data collection is the most crucial part of the work as it sets the tune for whole work. In this paper, ﬁrst, we have considered a set of tomatoes for our experiment. We have collected these from the nearby market and placed the items at room temperature. Every day we have captured 8 diﬀerent images from 8 diﬀerent angles for each and every sample using Sony DSC-W190 camera. We have taken the pictures until the tomatoes are getting spoiled. Finally, the dataset consists of ∼6.5K high resolution images (10 MP). Preprocessing: The raw images may not be suitable for the classiﬁcation task. We have used preprocessing steps to suppress the background in benchmarking. Before processing the object, we have segmented [27] the image using the grab cut algorithm [28,29] and removed the background. Image Annotation: Here, annotation referred to grouping tomato by its freshness. We have 6,470 number of images of tomatoes and they are grouped in 10 classes. Class 1 contains most fresh items and last class contains the images of fully rotten tomatoes. As per our collected data it was noticed that all the samples were not rotten on the same day. They have a diﬀerent life of span of freshness as they were not collected on the same day by the farmer from the garden. 10 volunteers were involved in the annotation process. Volunteers are shown one by one samples with the labeled number of days past from the beginning of storage. Next, they need to grade the tomatoes from 1 to 10, where 1 is most fresh and 10 is rotten. The volunteers grouped them by visually inspecting and considering the days spent during storage. Finally, the sample is assigned into a class by maximum voting from the volunteers. Once the initial assignment of class 1 to 10 is achieved, the dataset is used to create 2, 4, and 6 class problems by merging near by classes. Figure 3 depicts the class merging phenomena used in our dataset for 2, 4, and 6 class problems. The distribution of the images over the classes is shown in Fig. 4.

Fig. 3. Examples of distribution of classes in our dataset.

Fig. 4. The distribution of images in diﬀerent classes.

FGrade: A Large Volume Dataset for Grading Tomato Freshness Quality

4

461

Benchmarking Methods and Discussion

In this paper, we have benchmarked the FGrade dataset by applying various existing classical methods and deep learning models. We have used state-ofthe-art deep models such as ResNet50 [30], ResNet101 [31], ResNet152 [32], VGG16 [33], VGG19 [34], NASNetMobile [35], NasNetLarge [35], InceptionV3 [36], MobileNet [37], DenseNet121 [38], DenseNet169 [38], and DenseNet201 [38]. Here, the dataset is divided into train and test sets as 80% and 20% respectively. The deep learning models are trained/ﬁne-tuned using train set and test set are used for validation. First, the deep models are trained with Imagenet [39] and ﬁne-tuned using proposed the FGrade dataset. The dataset is divided in 2, 4, 6, and 10 classes as described earlier. We have calculated the average accuracy of 2, 4 and 6 classes on the FGrade dataset using diﬀerent deep learning models. Table 2 summarized the average accuracy and 10 classes accuracy using stateof-the-art methods. Figure 5 shows accuracy on diﬀerent labeled classes on the FGrade dataset using diﬀerent methods. Table 3 shows results of 10 randomly taken test images and corresponding class.

Table 2. Accuracy using diﬀerent stateof-art classical and Deep learning methods Method

Fig. 5. Diﬀerent class accuracy on FGrade dataset using diﬀerent methods.

Accuracy (%) Average

10 Classes

VGG16 [33]

80.73

63.52

VGG19 [34]

50.36

65.20

ResNet50 [30]

79

55.59

ResNet101 [31]

70.61

56.13

ResNet152 [32]

72

54.43

DenseNet121 [38]

81.87

63.59

DenseNet169 [38]

80.53

64.35

DenseNet201 [38]

81.93

63.47

NASNetMobile [35]

80.46

61.43

NasNetLarge [35]

78.63

60.29

InceptionV3 [36]

72.61

55.43

MobileNet [37]

82.78

64.47

462

S. Das et al.

All the hyperparameters [40] of the state-of-the-art deep modules are used as same as described in original articles. We have used 100 epochs with ﬁxed batch size 16. Learning rate reduction and early stopping method is adopted based on the validation accuracy in all the cases. Figure 6 shows examples of confusion matrix [41,42] (10-class classiﬁcation) when applied on a few top performing models. We have achieved maximum ∼92% accuracy in 2-class problem and ∼65% in 10-class problem. This is noted that the state-of-the-art deep learning base performs well in 2-class problem but fails to infer when number of classes increase because of the complex nature shape and a ﬁner diﬀerence in texture. It can conclude that a custom designed neural network suitable for the dataset is required.

Fig. 6. Confusion matrix of the 10-class problem on test data.

FGrade: A Large Volume Dataset for Grading Tomato Freshness Quality

463

Table 3. Examples of 10 random samples of images applied in 10-class classiﬁcation using state-of-the-art deep learning models, where C represent the successfully classiﬁed with label number and C represent false classiﬁed with label number.

5

Conclusion

In this paper, we have introduced a novel freshness grading dataset consisting of large-scale tomato images (∼6K images, 2, 4, 6, and 10 classes). The dataset can be used in freshness grading, we called it FGrade. We have also benchmarked the dataset using state-of-the-art deep learning methods, which may open up new challenges in CV guided vegetable and fruit grading systems. We note that the dataset is challenging and state-of-the-art methods failed to achieve good

464

S. Das et al.

accuracy on the dataset, particularly when the number of classes increases. We hope that the dataset will attract researchers and will be a valuable contribution to the CV community. Future works include expanding the dataset by adding other vegetables and fruits.

References 1. Antle, J.M., Ray, S.: Economic development, sustainable development, and agriculture. Sustainable Agricultural Development. PSAEFP, pp. 9–42. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-34599-0 2 2. Russell, S.J., Norvig, P.: Artiﬁcial Intelligence: a Modern Approach. Pearson, Malaysia (2016) 3. Longsheng, F., Gao, F., Jingzhu, W., Li, R., Karkee, M., Zhang, Q.: Application of consumer RGB-d cameras for fruit detection and localization in ﬁeld: a critical review. Comput. Electron. Agric. 177, 105687 (2020) 4. Ucat, R.C., Cruz, J.C.D.: Postharvest grading classiﬁcation of cavendish banana using deep learning and tensorﬂow. In: 2019 International Symposium on Multimedia and Communication Technology (ISMAC), pp. 1–6. IEEE (2019) 5. Zilong, H., Tang, J., Zhang, P., Jiang, J.: Deep learning for the identiﬁcation of bruised apples by fusing 3d deep features for apple grading systems. Mech. Syst. Sig. Process. 145, 106922 (2020) 6. Bhargava, A., Bansal, A.: Fruits and vegetables quality evaluation using computer vision: a review. J. King Saud Univ. Comput. Inf. Sci. (2018). https://doi.org/10. 1016/j.jksuci.2018.06.002 7. Ku, J., Harakeh, A., Waslander, S.L.: In defense of classical image processing: fast depth completion on the cpu. In: 2018 15th Conference on Computer and Robot Vision (CRV), pp. 16–22. IEEE (2018) 8. Scott, G.J., Hagan, K.C., Marcum, R.A., Hurt, J.A., Anderson, D.T., Davis, C.H.: Enhanced fusion of deep neural networks for classiﬁcation of benchmark highresolution image data sets. IEEE Geosci. Remote Sens. Lett. 15(9), 1451–1455 (2018) 9. Chung, D.T.P., Van Tai, D.: A fruits recognition system based on a modern deep learning technique. In: Journal of Physics: Conference Series, vol. 1327, p. 012050. IOP Publishing (2019) 10. H¨ ani, N., Roy, P., Isler, V.: Minneapple: a benchmark dataset for apple detection and segmentation. IEEE Robot. Autom. Lett. 5(2), 852–858 (2020) 11. Hou, S., Feng, Y., Wang, Z.: A domain-speciﬁc dataset for ﬁne-grained visual categorization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 541–549 (2017) 12. Sa, I., Ge, Z., Dayoub, F., Upcroft, B., Perez, T., McCool, C.: Deepfruits: a fruit detection system using deep neural networks. Sensors 16(8), 1222 (2016) 13. Turayki, L.A.B., Abubacker, N.F.: SUFID: sliced and unsliced fruits images dataset. In: Badioze Zaman, H. (ed.) IVIC 2019. LNCS, vol. 11870, pp. 237–244. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34032-2 22 ˇ Automatic fruit recognition using computer vision. Matej Kristan), 14. Marko, S.: Fakulteta za racunalniˇstvo in informatiko, Univerza v Ljubljani, Mentor (2013) 15. Zheng, Y.-Y., Kong, J.-L., Jin, X.-B., Wang, X.-Y., Ting-Li, S., Zuo, M.: Cropdeep: the crop vision dataset for deep-learning-based classiﬁcation and detection in precision agriculture. Sensors 19(5), 1058 (2019)

FGrade: A Large Volume Dataset for Grading Tomato Freshness Quality

465

16. Thapa, R., Snavely, N., Belongie, S., Khan, A.: The plant pathology 2020 challenge dataset to classify foliar disease of apples. arXiv preprint arXiv:2004.11958 (2020) 17. Lee, G.G., Huang, C.W., Chen, J.H., Chen, S.Y., Chen, H.L.: Aifood: a large scale food images dataset for ingredient recognition. In: TENCON 2019–2019 IEEE Region 10 Conference (TENCON), pp. 802–805. IEEE (2019) 18. Giselsson, T.M., Jørgensen, R.N., Jensen, P.K., Dyrmann, M., Midtiby, H.S.: A public image database for benchmark of plant seedling classiﬁcation algorithms. arXiv preprint arXiv:1711.05458 (2017) 19. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) 20. Hossain, M.S., Al-Hammadi, M., Muhammad, G.: Automatic fruit classiﬁcation using deep learning for industrial applications. IEEE Trans. Industr. Inf. 15(2), 1027–1034 (2018) 21. Liu, L., Li, Z., Lan, Y., Shi, Y., Cui, Y.: Design of a tomato classiﬁer based on machine vision. PloS one 14(7), e0219803 (2019) 22. Ope˜ na, H.J.G., Yusiong, J.P.T.: Automated tomato maturity grading using abctrained artiﬁcial neural networks. Malays. J. Comput. Sci. 30(1), 12–26 (2017) 23. de Luna, R.G., Dadios, E.P., Bandala, A.A., Vicerra, R.R.P.: Size classiﬁcation of tomato fruit using thresholding, machine learning, and deep learning techniques. AGRIVITA J. Agric. Sci. 41(3), 586–596 (2019) 24. Semary, N.A., Tharwat, A., Elhariri, E., Hassanien, A.E.: Fruit-Based tomato grading system using features fusion and support vector machine. In: Filev, D., (eds.) Intelligent Systems’2014. AISC, vol. 323, pp. 401–410. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-11310-4 35 25. Wan, P., Toudeshki, A., Tan, H., Ehsani, R.: A methodology for fresh tomato maturity detection using computer vision. Comput. Electron. Agric. 146, 43–50 (2018) 26. Kaur, M., Sharma, R.: Quality detection of fruits by using ANN technique. IOSR J. Electron. Commun. Eng. Ver. II, 10(4), 2278–2834 (2015) 27. Zaitoun, N.M., Aqel, M.J.: Survey on image segmentation techniques. Procedia Comput. Sci. 65, 797–806 (2015) 28. Tang, M., Gorelick, L., Veksler, O., Boykov, Y.: Grabcut in one cut. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1769–1776 (2013) 29. Rother, C., Kolmogorov, V., Blake, A.: “Grabcut” interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. (TOG) 23(3), 309–314 (2004) 30. Pelka, O., Nensa, F., Friedrich, C.M.: Annotation of enhanced radiographs for medical image retrieval with deep convolutional neural networks. PloS One 13(11), e0206229 (2018) 31. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 32. Wu, S., Zhong, S., Liu, Y.: Deep residual learning for image steganalysis. Multimedia Tools Appl. 77(9), 10437–10453 (2017). https://doi.org/10.1007/s11042-0174440-4 33. Ha, I., Kim, H., Park, S., Kim, H.: Image retrieval using BIM and features from pretrained VGG network for indoor localization. Build. Environ. 140, 23–31 (2018) 34. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

466

S. Das et al.

35. Saxen, F., Werner, P., Handrich, S., Othman, E., Dinges, L., Al-Hamadi, A.: Face attribute detection with mobilenetv2 and nasnet-mobile. In: 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA), pp. 176– 180. IEEE (2019) 36. Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016) 37. Ilhan, H.O., Sigirci, I.O., Serbes, G., Aydin, N.: A fully automated hybrid human sperm detection and classiﬁcation system based on mobile-net and the performance comparison with conventional methods. Med. Biol. Eng. Comput. 58(5), 1047–1068 (2020). https://doi.org/10.1007/s11517-019-02101-y 38. Zhang, J., Chaoquan, L., Li, X., Kim, H.-J., Wang, J.: A full convolutional network based on densenet for remote sensing scene classiﬁcation. Math. Biosci. Eng 16(5), 3345–3367 (2019) 39. Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., Wojna, Z.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009) 40. Shankar, K., Zhang, Y., Liu, Y., Wu, L., Chen, C.H.: Hyperparameter tuning deep learning for diabetic retinopathy fundus image classiﬁcation. IEEE Access 8, 118164–118173 (2020) 41. Luque, A., Carrasco, A., Mart´ın, A., de las Heras, A.: The impact of class imbalance in classiﬁcation performance metrics based on the binary confusion matrix. Pattern Recogn. 91, 216–231 (2019) 42. Jianfeng, X., Zhang, Y., Miao, D.: Three-way confusion matrix for classiﬁcation: a measure driven view. Inf. Sci. 507, 772–794 (2020)

Enhancement of Region of Interest from a Single Backlit Image with Multiple Features Gaurav Yadav1 , Dilip Kumar Yadav1 , and P. V. S. S. R. Chandra Mouli2(B) 1 2

National Institute of Technology, Jamshedpur 831014, Jharkhand, India {2018rsca005,dkyadav.ca}@nitjsr.ac.in Central University of Tamil Nadu, Thiruvarur 631005, Tamil Nadu, India [email protected] https://www.nitjsr.ac.in, https://www.cutn.ac.in

Abstract. Backlit images are a combination of dark and bright regions and the objects in the image generally appear to be dark for human perception. The region of interest (ROI) in general conﬁnes to the object(s) present in the image or some regions of the image. Such ROI in backlit images have low contrast and it is diﬃcult for visualization. Enhancement of ROI in backlit images is necessary in order to view the contents properly. In this paper, a novel and simple approach for the enhancement of ROI of backlit images is proposed. This approach considers several features including tone mappings, exposedness, gradient, median ﬁltering, etc. and ﬁnally, the fusion of the results has been done. The novel contribution in the proposed method, though seems to be trivial, attained best results without applying pyramid based operations namely Laplacian pyramid and Gaussian pyramid. Eﬃcacy of the proposed method is evident from the experimental results which conﬁrm that the proposed approach gives better results both qualitatively (visualization) and quantitatively (objective evaluation) compared to the existing methods.

Keywords: Backlit image enhancement mapping · ROI enhancement

1

· Image enhancement · Tone

Introduction

The human visual system can make discrimination between the contents in the images but that becomes a challenge when the images appear very dark with non-uniform illumination and backlighting. Apart from the low contrast images, the presence of simultaneously very low and very high-intensity regions in the image results in a backlit image. The edge and texture information about the objects in backlit images are poor as compared to the ones in visual images. Hence the enhancement of backlit images is required. This becomes essential in many applications to enhance the image as a pre-processing step. The problem is c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 467–476, 2021. https://doi.org/10.1007/978-981-16-1092-9_39

468

G. Yadav et al.

challenging because the objects of interest are sometimes completely not visible at all or dominated by the backlighting conditions during the generation of the image. Therefore, understanding the objects in a better way, to improve the object detection and recognition tasks, enhancement of backlit images becomes a mandatory step for an overall gain to such processes. A variety of proposed approaches are available in the literature for the enhancement of uniform or non-uniform illumination. Many techniques that adapt enhancement in the spatial domain are based on the neighborhood of each pixel and obtain satisfactory results. However, they produce halos or produces a very high amount of enhanced noise and image artifacts. Therefore, ﬁnding better visibility for the objects present in dark regions of the backlit image becomes both challenging and interesting. There exists very extensive treatment in the literature regarding image enhancement over the years. For backlit images, the most promising and popular techniques are a combination of diﬀerent features and fusion based methods. Camera response models are studied for low-light image enhancement [25] and combines the retinex model for the naturalness preserving of low-light image enhancement [18]. Wang et al. [24] proposed a novel absorption of light scattering based model (ALSM) for the enhancement of low light images using the response characteristics of cameras. A detailed experiment-based review summarizing the widely used classes of low-light image enhancement algorithms is presented [23]. Multiple approaches to address the issue of undesired illuminations are studied based on the histogram equalization [17]. Contrast enhancement of images using spatial based information of pixels is also studied. Based on the entropy measure gray levels distribution can also be performed to achieve contrast enhancement for low contrast images [2]. Niu et al. [16] proposed a diﬀerent approach based on tone preserving entropy maximization and ﬁnds its possibility. They combined the preprocessing of image restoration with the algorithm to make it more powerful. There exist content-aware algorithms that perform enhancement of dark images, sharpening of edges by simultaneously revealing the details present in textured regions also by preserving the smoothness of ﬂat regions [19]. A global and adaptive contrast enhancement algorithm for low illumination gray images reduces the uneven illumination and low overall contrast of gray image [10]. Also, triple partitioning of the histogram and histogram clipping is studied to control the enhancement ratio [26]. Mertens et al. [14] proposed the fusion of images using a bracketed exposure sequence to obtain a high-quality image. This study performs a blending of multiple exposures which is guided by contrast and saturation. Although the expo-sure fusion has certain limitations associated [6]. Multi-exposure based image enhancement method is performed for detail preserving based on the tone mapping curves and exposed regions [12]. Singh et al. [20] proposed detail enhancement using exposure fusion based on nonlinear translation-variant ﬁlters. Multi-layer lightness statistics of high-quality outdoor images with a combination of contrast enhancement and naturalness preservation may address this issue to some extent [22]. Jha et al. [9] proposed a novel technique using dynamic

ROI Enhancement of Backlit Images

469

stochastic resonance in the discrete cosine transform (DCT) domain. Study of noise enhanced iterative processing based on Fourier coeﬃcients for enhancement of low contrast images using Iterative scaling can be performed on DCT coeﬃcients [3]. Martorell et al. [13] proposed a novel algorithm for multi-exposure fusion by decomposing the patches of the image with DCT transform. Curved Gabor ﬁlters based on the ridge frequency estimation can be used for curved regions and enhancement of images [5]. Morel et al. [15] proposed a method of simple gradientdomain to eliminate the eﬀect of nonuniform illumination. Also, by preserving the details in the image. Huang et al. [7] proposed a CNN based Unet model with mix loss functions to enhance the low-illumination images. Backlit image enhancement uses multiple tone mappings to perform the contrast enhancement and adjustment of diﬀerent regions in the image [1]. The results obtained are a combination of all these processing and image fusion algorithm. Multi-step methods for enhancement of backlit images are studied based on transmission coeﬃcients computation, multiple exposures generation based on transmission coeﬃcients, and image fusion [8]. In Backlighting, a fusion of derived inputs with corresponding weights in multi-scale mode enhances the given input [4]. As many results suggest, a trade-oﬀ between the detail enhancement, local contrast, and naturalness of the image always exists. Wang et al. [21], discussed the multi-scale based fusion strategy for single backlit images. In this paper, the enhancement of backlit images by primarily focusing on the ROI is proposed. A novel approach for the enhancement of region of interest in single backlit images is discussed.

2

Proposed Methodology

The two important aspects of backlit images are lack of luminance and degraded contrast in the underexposed regions [21]. Therefore the proposed approach to target the problems associated with backlit images is divided into three parts mainly: luminance and contrast enhancement technique, preserving boundary information, and fusion of all features information. The fundamental steps build by separation of the three-channel images into Red (R), Green (G), and Blue (B) for individual treatment. Then the results of speciﬁc feature operations, exposure maps are merged into a single one. All three input channel images work as inputs and operated in layers. At the ﬁnal stage, all the feature results are combined to generate the enhanced version of the image. 2.1

Tone Mappings

There are very low and very high-intensity regions in the backlit images, which requires treatment of low luminance regions and dynamic range compression. First, the input backlit image is divided into three-channel images – R, G, and B respectively. For simplicity, two global tone mapping operations viz. gamma correction and logarithmic transformation are selected. Both of them have wide

470

G. Yadav et al.

acceptance in the literature. The gamma correction and logarithmic transformation are applied over all the channel images. Gamma correction and logarithmic transformation are deﬁned in Eqs. (1) and (2) respectively. γ I , γ ∈ {0.4, 0.8, 2} (1) G(γ)(I) = 255 × 255 L(α)(I) =

255 × log(αI + 1) , α ∈ {0.1, 0.3, 0.5} log(255α + 1)

(2)

I represent the image, γ and α coeﬃcients representing the adaption factor and brightness factor respectively. Exposure measure deﬁnes the closeness to the mid-intensity value by using a Gaussian curve [14] as given in Eq. (3),where σi is a parameter that the authors ﬁx to 0.2. (I/255 − 0.5)2 E(I) = exp − (3) , σi = 0.2 2σi2 2.2

Gradient Mapping and Filtering

To observe the directional change and edge information, a gradient map is used which is deﬁned in Eq. (4). This preserves the edge information in the channel images. Every pixel measures the change in intensity of the same point as in the original channel image, in a particular direction. ∂f /∂x g (4) ∇f = x = gy ∂f /∂y ∂f /∂x represents the derivative with respect to x. ∂f /∂y represents the derivative with respect to y. In a similar context, the image is smoothened using order statistics ﬁltering. Median ﬁltering has been employed for this purpose. It is deﬁned in Eq. (5). It replaces each entry with the median of neighboring values. Median ﬁltering helps to obtain a smoothed version of the image. y[m, n] = median{x[i, j](i, j) ∈ w}

(5)

where w represents local neighborhood centered around location [m, n] in the image. The size used in this paper is 5 × 5. 2.3

Fusion

All the color channel images based tone mappings, exposedness, gradient etc. obtained by applying (1), (2), (3), (4) and (5) are combined using image fusion algorithm based on Mertens et al. [14]. The algorithm is designed to merge

ROI Enhancement of Backlit Images

471

several feature inputs of the processed channel images as mentioned in Eq. (6). input image. N I (x) = wj (x)Ij (x) (6) j=1

where wj (x) represents the weight map based on well-exposedness generated from the Eq. (3) and Ij (x) represents the diﬀerent feature inputs of the image.

3

Experiments and Discussion

A large number of backlit images are tested for the validation of the proposed method. A set of seven observations based on distinct backlit images is presented. They demonstrate the result outcomes as shown in Fig. 1. Experiment over all R R2019b on a PC with an Intel i5-3230M the images are processed by MATLAB based processor with base clock of 2.6 GHz and 4 GB RAM. All the resultant outputs of ROI is validated based on the two distinct evaluation parameters. Contrast measure [1] is used to observe how much contrast has been improved in the image. It is deﬁned in Eq. (7). C=

N 1 V arj (processed) N j=1 V arj (original)

(7)

where V arj is the variance of the intensity values in a patch of size 16 × 16. V arj (processed) and V arj (original) represents that variance is computed on both the original and processed images. N represents the number of patches. Entropy is the second measure used to estimate the amount of texturedness present in the image. This measure is appropriate as it reveals the textural content improvement in the enhanced version. It is deﬁned in Eq. (8) [2]. Entropy = −

255

pi log(pi )

(8)

i=0

pi represents the pixels having intensity i in-between the number of diﬀerent intensity values (256 for 8-bit images). A detailed analysis based evaluation of results are presented here. The results of the proposed method are compared with the results of Wang et al. [21]. A comparison of the contrast measure is given in Table 1. Table 2 shows the comparison of entropy values. From Table 1, it is evident that the contrast measure of the enhanced ROI in the proposed method is very high indicating the increase in brightness. From Table 2, it is clear that the amount of texturedness present is high. The visual results are shown in Fig. 1. It has four columns. The ﬁrst column represents the original image; the second column represents the region of interest of the original image; the third column shows the results obtained from

472

G. Yadav et al.

Fig. 1. (a) Original input backlit image (b) Region of interest of original image (c) the enhanced image by Wang et al. [21] (d) the enhanced result by the proposed method.

ROI Enhancement of Backlit Images

473

Want et al. [21] method and ﬁnally, the fourth column shows the results of the proposed method. Figure 1(d), the results tend to be brighter in terms of color representation than the reference [21] image. As it is clearly visible that the dark regions of the image i.e. ROI are produced with more enhancement by the proposed method. The contrast measure (C) values as given in Table 1 in the proposed method category show an overall contrast enhancement. Results, where the C values are high, have a good trade-oﬀ with the detail and naturalness of the images. Example: Giraﬀe (line 1), Board (line 2), Building (line 4), Hills (line 6), and Tower (line 7) in Fig. 1. In the case of higher C values, it is observed that an increase in trade-oﬀ is in the case of Sea (line 3). Although the wall regions in the below part of the same image are enhanced quite well. But, the details are well preserved in higher values of 17.29 for the cycle (line 5) in Fig. 1. This may be interpreted as that for high C values up to 2–3 times trade-oﬀ is less and for very high C values i.e. up to 4 times or more trade-oﬀ between the exposedness and detail compromise more. In line 7 of Fig. 1, the details of the cloud information in the background is lost due to enhancement. Table 1. Comparison of Contrast Measure of the proposed method with Wang et al.’s method [21]. Image

Wang et al. [21] Proposed

Giraﬀe

2.06

5.42

Board

2.53

6.70

Sea

6.66

26.30

Building 2.74

4.29

Cycle

7.62

17.29

Hills

2.04

6.18

Tower

2.68

13.26

In Table 2 the entropy values obtained for the outputs for Giraﬀe and Tower ROI are 6.47 and 7.00 respectively which are low for the proposed method. This is due to the presence of artifacts in the edges in the resultant image. In the case of Board, Sea, Building, Cycle, and Hills, the proposed method shows good results in terms of entropy as discussed in Table 2. The results of the proposed method has been tested with Li et al.’s method [11]. The contrast and entropy based evaluation has been given for some sample images in Table 3. From Table 3 it can be noticed that for images 1, 4, and 15 entropy of the proposed method is better; for images 4 and 15 contrast of the proposed method is better than Li et al.’s method [11] method.

474

G. Yadav et al.

Table 2. Comparison of Entropy values of the proposed method with Wang et al.’s method [21]. Image

Original Wang et al. [21] Proposed

Giraﬀe

6.75

7.57

6.47

Board

5.24

6.67

6.78

Sea

4.52

6.05

6.18

Building 5.11

6.31

6.60

Cycle

6.33

6.72

4.88

Hills

6.04

7.01

7.35

Tower

5.81

7.05

7.00

Table 3. Comparison of Contrast Measure and Entropy of the proposed method with Li et al.’s method [11]. Image Contrast Contrast Measure Entropy-[11] Entropy - proposed Measure - [11] - proposed

4

1

5.97

3.59

6.88

7.46

2

3.34

2.49

7.36

7.27

4

5.74

18.11

4.65

4.92

15

6.47

17.80

6.92

7.26

Conclusion and Future Scope

This paper presented a novel and simple approach for the enhancement of objects under the low-intensity regions of a single backlit image. Global tone mappings enhance the contrast stretch, whereas the other features gradient, median ﬁltering, and entropy assisted to enhance the edges, noise removal, and keep the textural information preserve. However, a trade-oﬀ between the precise boundary and color enhancement is observed. Contrast measure and entropy value results of all the seven images were discussed, which provides satisfactory validation for the proposed approach. Further, the issues associated with the occurrence of trade-oﬀ in color enhancement in over-exposed images can be investigated as part of future study.

References 1. Buades, A., Lisani, J.L., Petro, A.B., Sbert, C.: Backlit images enhancement using global tone mappings and image fusion. IET Image Process. 14(2), 211–219 (2019) 2. Celik, T.: Spatial entropy-based global and local image contrast enhancement. IEEE Trans. Image Process. 23(12), 5298–5308 (2014) 3. Chouhan, R., Biswas, P.K., Jha, R.K.: Enhancement of low-contrast images by internal noise-induced fourier coeﬃcient rooting. Sign. Image Video Process. 9(1), 255–263 (2015)

ROI Enhancement of Backlit Images

475

4. Fu, X., Zeng, D., Huang, Y., Liao, Y., Ding, X., Paisley, J.: A fusion-based enhancing method for weakly illuminated images. Sign. Process. 129, 82–96 (2016) 5. Gottschlich, C.: Curved-region-based ridge frequency estimation and curved gabor ﬁlters for ﬁngerprint image enhancement. IEEE Trans. Image Process. 21(4), 2220– 2227 (2011) 6. Hessel, C.: An implementation of the exposure fusion algorithm. Image Process. OnLine 8, 369–387 (2018) 7. Huang, H., Tao, H., Wang, H.: A convolutional neural network based method for low-illumination image enhancement. In: Proceedings of the 2nd International Conference on Artiﬁcial Intelligence and Pattern Recognition, pp. 72–77 (2019) 8. Im, J., Yoon, I., Hayes, M.H., Paik, J.: Dark channel prior-based spatially adaptive contrast enhancement for back lighting compensation. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2464–2468. IEEE (2013) 9. Jha, R.K., Chouhan, R., Aizawa, K., Biswas, P.K.: Dark and low-contrast image enhancement using dynamic stochastic resonance in discrete cosine transform domain. APSIPA Transactions on Signal and Information Processing, vol. 2 (2013) 10. Li, C., Liu, J., Liu, A., Wu, Q., Bi, L.: Global and adaptive contrast enhancement for low illumination gray images. IEEE Access 7, 163395–163411 (2019) 11. Li, Z., Wu, X.: Learning-based restoration of backlit images. IEEE Trans. Image Process. 27(2), 976–986 (2018) 12. Liu, S., Zhang, Y.: Detail-preserving underexposed image enhancement via optimal weighted multi-exposure fusion. IEEE Trans. Consum. Electron. 65(3), 303–311 (2019) 13. Martorell, O., Sbert, C., Buades, A.: Ghosting-free dct based multi-exposure image fusion. Sign. Process. Image Commun. 78, 409–425 (2019) 14. Mertens, T., Kautz, J., Van Reeth, F.: Exposure fusion: a simple and practical alternative to high dynamic range photography. In: Computer Graphics Forum, vol. 28, pp. 161–171. Wiley Online Library (2009) 15. Morel, J.M., Petro, A.B., Sbert, C.: Screened poisson equation for image contrast enhancement. Image Process. OnLine 4, 16–29 (2014) 16. Niu, Y., Wu, X., Shi, G.: Image enhancement by entropy maximization and quantization resolution upconversion. IEEE Trans. Image Process. 25(10), 4815–4828 (2016) 17. Pizer, S.M., et al.: Adaptive histogram equalization and its variations. Comput. Vis. Graph. Image Process. 39(3), 355–368 (1987) 18. Ren, Y., Ying, Z., Li, T.H., Li, G.: Lecarm: low-light image enhancement using the camera response model. IEEE Trans. Circ. Syst. Video Technol. 29(4), 968–981 (2018) 19. Rivera, A.R., Ryu, B., Chae, O.: Content-aware dark image enhancement through channel division. IEEE Trans. Image Process. 21(9), 3967–3980 (2012) 20. Singh, H., Kumar, V., Bhooshan, S.: A novel approach for detail-enhanced exposure fusion using guided ﬁlter. The Scientiﬁc World Journal, vol. 2014 (2014) 21. Wang, Q., Fu, X., Zhang, X.P., Ding, X.: A fusion-based method for single backlit image enhancement. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 4077–4081. IEEE (2016) 22. Wang, S., Luo, G.: Naturalness preserved image enhancement using a priori multilayer lightness statistics. IEEE Trans. Image Process. 27(2), 938–948 (2017) 23. Wang, W., Wu, X., Yuan, X., Gao, Z.: An experiment-based review of low-light image enhancement methods. IEEE Access 8, 87884–87917 (2020)

476

G. Yadav et al.

24. Wang, Y.F., Liu, H.M., Fu, Z.W.: Low-light image enhancement via the absorption light scattering model. IEEE Trans. Image Process. 28(11), 5679–5690 (2019) 25. Ying, Z., Li, G., Ren, Y., Wang, R., Wang, W.: A new low-light image enhancement algorithm using camera response model. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 3015–3022 (2017) 26. Zarie, M., Pourmohammad, A., Hajghassem, H.: Image contrast enhancement using triple clipped dynamic histogram equalisation based on standard deviation. IET Image Process. 13(7), 1081–1089 (2019)

Real-Time Sign Language Interpreter on Embedded Platform Himansh Mulchandani(&) and Chirag Paunwala Electronics and Communication Engineering Department, Sarvajanik College of Engineering and Technology, Surat, India [email protected] [email protected]

Abstract. Sign language is the primary way of communication for speechimpaired people. American Sign Language is one of the most popular sign languages and is used by over 250,000 people around the world. However, this language is only decipherable by the people who have the same disabilities, and it is not understood by masses. Hence, a need arises for an interpreter that can convert this language to text and speech. The existing research focuses on effectively converting sign language to text and speech; however, less emphasis is laid on building a system that is not only accurate but efﬁcient as well which could be run on a compact embedded platform. This is necessary so that a portable interpreter can be made which can be easily carried and used by such people for their convenience making their day to day life easier. In this research, we propose an implementation based on MobileNet architecture, that is not only accurate but efﬁcient as well. Moreover; a demonstration of the entire framework has also been shown on Jetson Nano. Keywords: MobileNet Speech impaired Jetson Nano Convolutional neural networks Deep learning American sign language

1 Introduction Speech and hearing impairment signiﬁcantly reduces an individual’s capability to communicate and makes day to day life difﬁcult. Many efforts have been made to provide such people with aids, and; even a sign language has been dedicated to such people. American sign language is one of the most popular sign language which is used by nearly 250,000 people [1]. It consists of distinct signs which facilitate the communication of different alphabets and numbers, and a combination of these letters is used to form words and sentences. While sign language is familiar with people who have the same impairment, it isn’t decipherable by the masses, and hence it gives rise to a communication gap between such class of people. Hence, an interpreter is required to facilitate the communication between speech and hearing-impaired people and the common people who don’t understand this sign language. There are primarily two different methods for recognising hand gestures, the ﬁrst one relies on sensor data from gloves such as flex sensor [2, 3] and the other relies on vision-based methods that rely on the data captured from the camera and processes it to © Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 477–488, 2021. https://doi.org/10.1007/978-981-16-1092-9_40

478

H. Mulchandani and C. Paunwala

extract valuable information [4–7]. Vision-based methods are more popular and they mostly use Convolutional Neural Networks as they are known to have excellent classiﬁcation abilities and are often used for this purpose. However, their performance is mainly impacted by different backgrounds, which typically acts as noise and signiﬁcantly reduces the classiﬁcation accuracy. Later, methods were developed that perform segmentation as a preprocessing step which signiﬁcantly improves the performance [7]. However, the focus of all the research done until now has been mainly based on getting the best accuracy and their implementation on some embedded or edge platform has not been considered. It is indeed important to develop systems that can run on such platforms because such a system can serve an altogether different utility. As mentioned earlier, the wider demographic of the population cannot understand the sign language and hence, there needs to be an interpreter which can facilitate the conversation between both the ends. Hence, a need arises for platforms which are portable as well as reliable, so that it can easily be carried by such people and serve as an aid to these people, making their lives easier. Moreover, many public places need to have such platforms which can facilitate speech and hearing-impaired people to effectively communicate their ideas without relying on a third person to interpret the same. This paper focuses on developing such a system which can not only detect gestures with accuracy but also focuses on deploying these systems on an embedded platform thereby making a practically usable system. Figure 1 shows different gestures for different alphabets.

Fig. 1. American sign language [8]

2 Related Work There are widely two ways to build sign language interpreter systems. One is sensorbased, and the other is vision-based. Ahmed et al. [2] and Zahida et al. [3] proposed a sensor-based system, where a combination of flex sensor was used to determine the position of all the ﬁngers. Different orientation and positions of ﬁngers lead to flex in

Real-Time Sign Language Interpreter on Embedded Platform

479

the flex sensors which results to different current values, which is conditioned by signal conditioning circuit and then a microcontroller is provided such signals. Based on the signals acquired, output devices such as displays and speakers are given command appropriately. While the performance of this system is good in real-time as there is little to no signal and image processing, the issue arises from the fact that it uses flex sensors which has its disadvantages such as its fragility, and varying values of output after lengthened usage. Alternatively, vision-based systems don’t rely on such sensors, rather rely on the inputs given by the camera. Such systems mostly use convolutional neural networks for classifying hand gestures [4–7]. Pardasani et al. used a convolutional neural network that is similar to LeNet5 [9] for classifying hand gestures [6]. However, the performance of this classiﬁer is not up to the mark, attributing to the fact that it has very few layers. Masood et al. used VGG16 [10] as the classiﬁer for sign language [5]. While VGG16 is a very accurate architecture and its performance is appreciable, the main drawback of VGG16 is that it is gigantic, with over 100 million parameters which aren’t amicable to be run on an embedded platform as well as fail for real-time performance. Moreover; the primary focus of this research is to train a classiﬁer and does not concern many other problems such as the region of interest extraction, segmentation. Hence, the classiﬁcation performance of this system isn’t appreciable in realtime and different illumination conditions, because different illumination situations along with different backgrounds act as noise. The shortcomings mentioned in Masood et al.’s research has been overcome by Shahriar et al.’s research, which signiﬁcantly focuses on preprocessing steps like segmentation and region of interest extraction. In this research, skin colour segmentation is performed using YCbCr colour space which is effective for skin colour segmentation [11]. Once the segmentation is performed, a few morphological operations are performed which closes the inconsistencies. Finally, bounding boxes of the hand mask is obtained, and then the image containing the hand is cropped and fed to the convolutional neural network for classiﬁcation. While this system works in varying illumination and background conditions, one of the drawbacks of this system is its Neural Network. It uses AlexNet [12] which contains over 60 million parameters and isn’t ideal for real-time performance. From the above discussion, it is clear that less emphasis has been laid on developing systems for embedded platforms. While the accuracy and performance reported in other papers are appreciable, it is equally important to look at this challenge not from an objective perspective, but from a practical perspective as well. As a portable may be desired by people facing such disabilities, it is necessary to think of systems that are memory efﬁcient and can successfully be deployed on embedded platforms.

3 Proposed Method 3.1

Segmentation and Region of Interest Extraction

Segmentation is a way of identifying objects in images based on their colours. In the context of this research, it is especially important as in real-time and realistic

480

H. Mulchandani and C. Paunwala

environments, there is often a lot of background scene which acts as noise that often impedes the performance of the neural network classiﬁer. An image is represented by different colour models, and all the models have a different purpose. For example, the RGB colour model is used in display devices, the YCbCr model is used for printing, and HSV Colour model is primarily used for image processing applications like segmentation. HSV colour model is effective for skin colour segmentation [13], as it is successful in isolating chrominance component with illumination or brightness. This is effective as the same range of HSV values can give appreciable results in both light and dark environments. Hence, the HSV colour model has been used to perform segmentation. However, the image obtained may have a lot of disjoint points and inconsistencies, so morphological operations like closing and dilation are used to avoid such inconsistencies. However, other background objects such as walls, doors, furniture etc. may accidentally fall in the HSV colour range. To avoid this, contours are extracted, and from a lot of experiments it has been noted that the largest contour contains the region of interest. Finally, bounding box is obtained from the contour and its coordinates are used to crop the image and input the image to the classiﬁer. The process has been shown in Fig. 2. The step by step output has been shown in Fig. 3 and 4.

Fig. 2. Algorithm for region of interest extraction.

Fig. 3. Input image and HSV colour segmented image respectively in cluttered environment.

Real-Time Sign Language Interpreter on Embedded Platform

481

Fig. 4. Segmented image with contours, biggest contour drawn in red and bounding box from contour respectively. In the ﬁnal step image is cropped using bounding box and classiﬁcation is done as shown. (Color ﬁgure online)

3.2

MobileNet

Classifying the image obtained after extracting the region of interest is perhaps the most important part. CNNs have excellent capabilities to classify image data. In previous approaches, CNNs like VGG16, AlexNet have been used which are not very parametric efﬁcient, making them not ideal for real-time performance on embedded platform. Mobilenet is considered to be one of the most efﬁcient CNNs as it uses very less parameters but also has appreciable performance [14]. Depth wise separable convolution makes MobileNet efﬁcient and it has been explained in the following two sections. Depth Wise Separable Convolution. In conventional CNNs, a 3-dimensional mask or a ﬁlter of size N N has convolved with the input image and the resultant image is the output. The number of such masks used is equal to the depth required in the output image. However, in depthwise separable convolution, this process is done in two stages, i.e. depthwise convolution and pointwise convolution. In depth wise convolution, 2-dimensional masks of size N N are used. The number of such masks is equal to the depth of the input image. The output from this stage is used as input to the pointwise convolution. In pointwise convolution, the image is convolved with 1 1 ﬁlter having the same depth as desired in the output image. Both steps have been shown in the following ﬁgure.

Fig. 5. Convolution operation

482

H. Mulchandani and C. Paunwala

The parameters in any layer of a CNN can be calculated as No: of parameters ¼ ðH W D þ 1Þ N

ð1Þ

Where H, W, D and N represent height and width of ﬁlter, number of channels in input and number of ﬁlters used, respectively. Hence, the number of parameters in the above image would be 112. Convolution operation is shown in Fig. 5. However, depth-wise separable convolution is a two-stage process. Figure 6 shows the two stages of depth wise separable convolutions. There are two stages, and parameters for both can be calculated separately. As shown in Fig. 6, the ﬁrst stage that is depthwise convolution, from ‘Input image with 3 channels’ to ‘Output of Depthwise Convolution with three channels’, the total number of parameters would be (3 * 3 * 1 + 1) * 3 = 30, as per Eq. 1. Similarly, for second stage, that is pointwise convolution, from ‘Output of Depthwise Convolution with three channels’ to ‘Output of pointwise separable convolution with 4 channels’, the number of parameters would be (1 * 1 * 3 + 1) * 4 = 16, as per Eq. 1. Adding parameters from both the stages, we get a total of 46 parameters which is signiﬁcantly less than 112 in the conventional convolution algorithm. From the above discussion, it can be concluded that depthwise separable convolution is very parameter efﬁcient while at the same time, it preserves the spatial dimension of the output. The concept of depthwise separable convolution is used throughout in MobileNet, and hence a signiﬁcant reduction of parameters could be seen. It also reduces the FLOPs signiﬁcantly.

Fig. 6. Depth wise separable convolution

In the context of this research, a modiﬁed version of MobileNet has been used which is popularly known as MobileNetV2 [15]. It has signiﬁcant improvements over MobileNet, in form of inverted residual connections, use of ReLU6 activation function and batch normalization after every layer. The architecture can be studied in detail from [15].

Real-Time Sign Language Interpreter on Embedded Platform

3.3

483

NVIDIA Jetson Nano and TensorRT

The primary focus of this research is to develop an American Sign Language interpreter which is mobile and portable enough to be conveniently carried by people having speech impairment. For this purpose, we have deployed our efﬁcient model on NVIDIA Jetson Nano [16]. It is a platform developed by NVIDIA for deep learning inferencing which is not only power efﬁcient but is also very portable. It contains all the necessary ports for interfacing such as USB Port, HDMI port, ethernet and micro-USB port. For deep learning inferencing, it contains a 128 core Maxwell GPU which supports CUDA operations that are necessary for deep learning inferencing. 4 GB of system-wide memory is share by GPU as well as CPU, and the CPU is clocked at 1.43 GHz. Overall, it is a very compact and capable platform for deep learning applications. TensorRT is an SDK developed by NVIDIA and Tensorflow to optimise deep learning models for real-time inferencing [17]. Using TensorRT, the existing trained models can be optimised so that their memory consumption reduces signiﬁcantly. Few of the optimisations done by TensorRT include layer fusion, weight quantisation of 32bit weights to 16 bit. These operations signiﬁcantly improve the performance of deep learning models. In our experiments, we performed all the optimisations and the model weights were quantized to 16 bits. For demonstration purposes, Jetson Nano was used with a monitor. Using it in headless mode may improve the performance even more.

4 Experiments and Results 4.1

Dataset Details

The dataset consists of over 80000 images across 28 different classes [18]. 26 of the 28 classes represent letters A–Z, and the remaining 2 represent special cases like ‘SPACE’ and ‘NOTHING’. The images were captured in varying illumination and background conditions and a wide range of skin colours were taken into consideration. All the images were resized to 224 * 224 resolution, as it works optimally. A split of 80:10:10 was created for training, validation and testing. A few sample images are shown in the Fig. 7 below.

Fig. 7. Images representing sign language character ‘A’.

484

4.2

H. Mulchandani and C. Paunwala

Training Details

Necessary preprocessing steps such as resizing, sample wise standard normalisation were done to make the input compatible with MobileNetV2’s architecture. The training and testing loops were entirely programmed using TensorflowGPU version 2.1. The model was trained using the following parameters as shown in Table 1. Table 1. Training parameters Parameter Optimiser Learning rate Batch size Momentum parameters Epochs GPU

Value Adam 0.0001 16 b1 = 0.9 and b2 = 0.999 10 NVIDIA GTX 1050Ti

Moreover, the model was trained on different optimizers and the results were then compared. Best of three was used for further comparisons. Figure 8, 9 and 10 shows graph of training on different optimizers. Hyperparameters like learning rate, batch size was tuned using these curves.

Fig. 8. Training and testing statistics on Adam optimiser [19].

Fig. 9. Training and testing statistics on RMS Prop optimiser [21]

Real-Time Sign Language Interpreter on Embedded Platform

485

Fig. 10. Training and testing statistics on SGD optimiser.

4.3

Evaluation

Evaluating a deep learning model is crucial part in determining its efﬁcacy and the right evaluation models can give a better insight at the performance of models from different horizons. The different evaluation metrics are described as follows. Confusion Matrix. It is an N * N matrix where N represents the number of classes. It gives all the possible outcomes of the prediction of the deep learning model. This helps in deriving other metrics such as accuracy, precision, recall and F1 score. Confusion matrix for 8018 images of test set is shown in Fig. 11.

Fig. 11. Confusion matrix

486

H. Mulchandani and C. Paunwala

Accuracy. It gives a ratio of the number of samples classiﬁed correctly to the total number of samples in the dataset. It is denoted as show in Eq. 2. Accuracy ¼

Number of Correct Predictions Total Number of Samples

ð2Þ

Precision. It is denoted as ratio of True Positives (TP) to sum of True Positive (TP) and False Positive (FP). It is denoted as show in Eq. 3. Precision ¼

True Positive True Positive þ False Positive

ð3Þ

Recall. It is denoted as the ratio of True Positive (TP) to the sum of True Positive (TP) and False Negative (FN). It is denoted as shown in Eq. 4. Recall ¼

True Positive True Positive þ False Negative

ð4Þ

F1 Score. It seeks a balance between precision and recall and often gives a better indication of the overall performance of the model. It is denoted as show in Eq. 5. F1 Score ¼ 2

Precision Recall Precision þ Recall

ð5Þ

Also, other parameters such as total number of model parameters, size of model on disk, inference time were taken into consideration to evaluate the model. Table 2 compares all the metrics of the proposed model with implementation by Pardasani et al. and Shahriar et al. It must be noted that the comparisons have been done in the same environment, and only the classiﬁer has been compared, rest of the structure remains the same.

Table 2. Model evaluation chart on test set Metric Accuracy Precision Recall F1 score Parameters FPS on NVIDIA GTX 1050Ti [20] FPS on NVIDIA Jetson Nano [16]

Proposed 0.99 0.99 0.99 0.99 3.05 Million 23 FPS (43.47 ms)

Shahriar et al. [7] 0.98 0.98 0.96 0.97 28 Million 11FPS (90.90 ms)

Pardasani et al. [6] 0.96 0.96 0.96 0.96 0.068 Million 24FPS (41.66 ms)

13 FPS (76.92 ms)

6 FPS (166.67 ms)

17 FPS (58.82 ms)

Real-Time Sign Language Interpreter on Embedded Platform

4.4

487

Comparison

A detailed comparison of the proposed method along with 2 other methods has been shown in Table 2. For quantitative comparison, the classiﬁers of all the network have been evaluated on the test set of [18]. The rest of the framework was kept similar for all the models to keep the comparison fair. From the above discussion and observations, it can be concluded that the proposed method is ideal for an embedded platform as not only does it provide real-time performance, but it also performs signiﬁcantly better on quantitative metrics such as accuracy, precision and recall.

5 Conclusion We propose a method for real-time interpretation of American Sign Language on embedded platforms. From the discussion above, it is clear that preprocessing operations such as segmentation and region of interest extraction signiﬁcantly improve the performance of CNN classiﬁer. Moreover, MobileNet is signiﬁcantly more efﬁcient compared to other classiﬁers such as VGG and Alexnet which signiﬁcantly improves its real-time performance. The entire framework was tested on Jetson Nano and from the observations noted, it could be concluded that the proposed method was able to run in real-time with appreciable performance. However, segmentation using HSV colour space is still not the most ideal, and hence a future direction of the work could be the use of segmentation networks that can efﬁciently segment hand image from the frame.

References 1. American Sign Language. www.wikipedia.org/wiki/American_Sign_Language. Accessed 30 Aug 2020 2. Ahmed, S., Islam, R., Zishan, M.S., Hasan, M.R., Islam, M.N.: Electronic speaking system for speech impaired people: speak up. In: 2015 International Conference on Electrical Engineering and Information Communication Technology (ICEEICT), 21 May 2015, pp. 1– 4. IEEE (2015) 3. Zahida, A., Jehan, F.: Electronic Speaking System for Speech Impaired People 4. Nguyen, H.B., Do, H.N.: Deep learning for American sign language ﬁngerspelling recognition system. In: 2019 26th International Conference on Telecommunications (ICT), 8 Apr 2019, pp. 314–318. IEEE (2019) 5. Masood, S., Thuwal, H.C., Srivastava, A.: American sign language character recognition using convolution neural network. In: Satapathy, S.C., Bhateja, V., Das, S. (eds.) Smart Computing and Informatics. SIST, vol. 78, pp. 403–412. Springer, Singapore (2018). https:// doi.org/10.1007/978-981-10-5547-8_42 6. Pardasani, A., Sharma, A.K., Banerjee, S., Garg, V., Roy, D.S.: Enhancing the ability to communicate by synthesizing American sign language using image recognition in a chatbot for differently abled. In: 2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2018, pp. 529–532 (2018). https://doi.org/10.1109/ICRITO.2018.8748590

488

H. Mulchandani and C. Paunwala

7. Shahriar, S., et al.: Fattah SA. Real-time American sign language recognition using skin segmentation and image category classiﬁcation with convolutional neural network and deep learning. In: 2018 IEEE Region 10 Conference, TENCON 2018, 28 October 2018, pp. 1168–1171. IEEE (2018) 8. ASL Day 2019: Everything You Need To Know About American Sign Language, https:// www.newsweek.com/asl-day-2019-american-sign-language-1394695. Accessed 30 Aug 2020 9. LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989) 10. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (4 September 2014) 11. Dardas, N.H., Petriu, E.M.: Hand gesture detection and recognition using principal component analysis. In: 2011 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications, CIMSA 2011, pp. 1–6, 19 (2011) 12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems 2012, pp. 1097– 1105 (2012) 13. Lai, W.H., Li, C.T.: Skin colour-based face detection in colour images. In: 2006 IEEE International Conference on Video and Signal Based Surveillance, 22 Nov 2006, p. 56. IEEE (2006) 14. Howard, A.G., et al.: MobileNets: efﬁcient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (17 April 2017) 15. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, pp. 4510–4520 (2018) 16. Jetson Nano Developer Kit. https://developer.nvidia.com/embedded/jetson-nano-developerkit. Accessed 30 Aug 2020 17. NVIDIA TensorRT. https://developer.nvidia.com/tensorrt. Accessed 30 Aug 2020 18. ASL Alphabet. https://www.kaggle.com/grassknoted/asl-alphabet? Accessed 30 Aug 2020 19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv: 1412.6980 (22 December 2014) 20. Geforce GTX 1050Ti. https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-1050ti/speciﬁcations. Accessed 30 August 2020 21. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

Complex Gradient Function Based Descriptor for Iris Biometrics and Action Recognition B. H. Shekar1 , P. Rathnakara Shetty1(B) , and Sharada S. Bhat2 1

Department of Computer Science, Mangalore University, Mangalore, India 2 Government First Grade College Ankola, Karnataka, India

Abstract. Image gradient has always been a robust characteristic of digital image which possesses localised spatial information of each pixel in all the directions. Exploiting the gradient information at the pixel level is a good old technique applied in various ﬁelds of digital image processing. In this paper, the magnitude and direction of image gradient is explored to design a local descriptor. We propose a novel local feature descriptor based on Complex Gradient Function (CGF), which maps each pixel from the spatial plane into its complex extension involving the magnitude and direction of image gradient at that pixel. We exploit the proposed descriptor for human action recognition from depth sequences and human authentication using iris biometrics. The eﬃciency of descriptor is demonstrated with experimental results on benchmark datasets IITDelhi, MMU-v2, CASIA-Iris, UBIRIS, and MICHE-I for iris authentication and MSR Action 3D dataset for human action recognition (HAR). Keywords: Image gradient action recognition

1

· Descriptor · Iris recognition · Human

Introduction

In case of real world applications like biometrics, object and action recognition, image and video retrieval and image enhancing, local descriptors are employed to extract the predominant local features from the input image [10]. Basic idea of design of these descriptors is to transform the image regions using a class of ﬁlters and then computing invariant distinctive features, emphasizing the image properties like pixel intensities, color, texture, and edges [15]. Several feature extraction ﬁlters utilise the information obtained by magnitude and direction of an image gradient. However, the combined impact of gradient magnitude and direction on extraction of key information from the image data has not been explored. The texture based descriptors like Local Binary Patterns (LBP) and its iterators mainly concentrate on pixel values around the central pixel in a local neighbourhood and compute the diﬀerences of them from the central pixel. c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 489–501, 2021. https://doi.org/10.1007/978-981-16-1092-9_41

490

B. H. Shekar et al.

Descriptors like histogram oriented gradient (HOG) and scale invariant feature transform (SIFT) use gradient information around the diﬀerent directions of a pixel. In this work, we propose a local descriptor which maps magnitude and direction of image gradient onto the complex plane and further, the combined eﬀect of both has been utilised to extract predominant features of an input image. For each input pixel we identify ﬁve components including the pixel value, its spatial coordinates, and gradient information. A complex gradient function (CGF) is deﬁned on these components to extract the local features.

2

Related Work

Since local descriptors are at the vital course of many computer vision tasks, ample of descriptor tools have been proposed in the literature to improvise the eﬃciency and accuracy of the image processing algorithms [8]. Among the widely used local descriptors, available in the literature, LBP, SIFT and SURF are the most popular tools. Ojala et al. [16] have introduced LBP, as a gray scale invariant local binary descriptor, deﬁned on a 3 × 3 neighbourhood. The central pixel of the neighbourhood is obtained by decimalizing the local binary pattern obtained by the diﬀerences of the sample pixel values from the central value. SIFT descriptor, as introduced by Lowe [14], is designed by ﬁrst computing the gradient based magnitude and orientation at every sample point in a region surrounding a keypoint. And, these keypoint values are weighted through a Gaussian window. The values thus derived are accumulated into orientation binned histograms summarizing the contents across 4 × 4 subregions. Soon after this, Dalal and Triggs [6] have introduced another descriptor called histogram oriented gradient (HOG). Bay et al. [3] have designed SURF descriptor based on Hessian matrix comprised of Gaussian second order partial derivative (Laplacian og Gaussian- LoG). To construct the descriptor, an oriented quadratic grid with 4 × 4 square sub-regions is placed over the interest points in the image. For each and every square, LoG wavelet responses are calculated from 5 × 5 samples. Large number of publications are found in the literature on modiﬁcations and extensions of these descriptors. Usage of image gradient is well explored in machine vision applications in general and as a local descriptor in particular. Apart from PCA-SIFT [10], SIFT and SURF, as an extension of the SIFT descriptor; Mikolajczyk and Schmid [15] have proposed gradient location and orientation histogram (GLOH), which is also based on image gradient. Chen et al. [5] have designed Weber local descriptor (WLD) which utilises gradient orientations in eight directions of a pixel in a neighbourhood of size 3 × 3. WLD has been extended by Xia et al. [26], which again exploits the amplitude and orientation of gradient image. Roy et al. [19] extracts local directional edge map at six diﬀerent directions and identiﬁes their zigzag pattern for texture classiﬁcation. Deng et al. [8] have presented compressive binary patterns (CBF) by using locally computed derivative ﬁlters of LBP descriptor by random field eigen filter.

CGF Based Descriptor for Iris Biometrics and Action Recognition

491

Fig. 1. Pictorial representation of proposed approach.

From the review of the literature it is evident that design of well-known local descriptors explore image gradient, as image gradient is very essential characteristic in the spatial domain. Proposed complex gradient function (CGF) based descriptor computes image gradient features in the ﬁrst step, however, it uses gradient information further to obtain a complex gradient function. We give the experimental evidences to demonstrate the applicability of the newly proposed descriptor for iris and depth sequence based HAR. Rest of the paper is organised into following sections. In Sect. 3, design of our proposed descriptor is detailed. In Sect. 4, we provide the experimental analysis and present the discussions on the application of CGF for iris recognition and action recognition. Section 5 concludes this paper.

3

Proposed Descriptor

Image gradient is a robust property of gray scale images which exhibits important features of the image such as, illumination changes, the amount of change and the direction through which pixel value varies. Each input pixel is associated with ﬁve components, the spatial coordinates (x,y), its pixel value and gradient magnitude and directions. Utilizing these components, we map each pixel onto the complex domain of gradient magnitude and direction. Magnitude is mapped on to horizontal axis and the corresponding direction is mapped onto vertical axis. The resultant vector in the complex gradient domain is exploited further to describe the predominant features of image texture. Pictorial representation of the architecture and design of the proposed descriptor at pixel level is presented in Fig. 1. 3.1

Complex Gradient Function

This section narrates the theoretical concept behind the proposed descriptor. Let F(x, y) be an input gray scale image which can be treated as a real valued 2D function of real variables. Computing the partial derivatives of F with respect to

492

B. H. Shekar et al.

x and y along horizontal and vertical directions, we get image gradient ∇F(x, y) at (x, y) which is a vector quantity, given by the following equation. ∇F(x, y) = [Fx , Fy ]T (x, y) =

∂F ∂F T , (x, y). ∂x ∂x

We get the magnitude and direction of the vector ∇F at (x, y) as follows. mag(∇F(x, y)) = Fx2 (x, y) + Fy2 (x, y), dir(∇F(x, y)) = tan−1

F y

Fx

.

(1)

(2) (3)

Thus every pixel value is mapped on to the gradient domain denoted by C(ω, α) such that, F(x, y) =⇒ (mag(∇F)(x, y), dir(∇F)(x, y)) =⇒ (ω, α) ∈ C

(4)

Existing gradient based descriptors have well utilised these information to design the feature extraction tools. However we capture each and every localised information at a pixel (x, y) by considering each (ω, α) as a complex representation of F(x, y) and formulate our descriptor using the magnitude and amplitude of (ω, α). Each input pixel is associated with spatial coordinates, x and y, its pixel value F(x, y) and the gradient representations ω and α. We map the plane C to a complex gradient domain CGF(ξ(x, y), ϑ(x, y)), which is obtained by the complex gradient functions deﬁned below. ξ(x, y) = ω 2 (x, y) + α2 (x, y), (5) ϑ(x, y) = tan−1

α(x, y) ω(x, y)

.

(6)

Thus, input image is described with CGF M ag (ξ, ϑ), the ﬁeld of magnitudes and CGF Amp (ξ, ϑ), the ﬁeld of directions, which has been demonstrated in Fig. 2. Mapping of C onto CGF has mainly two advantages. Firstly, the combined eﬀect of magnitude and orientation can be well described in CGF domain. Secondly, the amalgamated structural information of both gradient magnitude and directions is easily extracted from the two ﬁelds CGF M ag and CGF Amp . In Fig. 3, we present histograms of magnitude and orientation, which explains the combined eﬀect of both.

4

Experiments and Discussions

Proposed CGF descriptor is designed to capture the minuscule local variations in image texture. Section 3.1 theoretically supported this novel descriptor for its generic usage. However, we justify the diversiﬁed applicability of our descriptor by adopting two diﬀerent research disciplines, iris biometrics and human action recognition for experimentation. Both, iris and action recognition datasets

CGF Based Descriptor for Iris Biometrics and Action Recognition

493

Fig. 2. Demonstration of the proposed approach on standard Leena image.

Fig. 3. Histograms of images displayed in Fig. 2.

comprises of images with intricate texture variations, that gives grounds for comprehensive application of CGF. A sample unwrapped iris image from IITDelhi dataset and its texture variations extracted using CGF are displayed in Fig. 4. Details of experimentation and result analysis on iris recognition systems and human action recognition systems are narrated in the subsequent Sects. 4.1 and 4.2. 4.1

Iris Recognition

We conduct the experiments on the openly available benchmark iris databases IITDelhi [11], MMUv.2 [2], CASIA-Iris v-4 Distance dataset [1] and UBIRIS v.2 [18] which contain Near Infra Red (NIR) and also Visible Wavelength (VW) images with constrained and unconstrained imaging. MMU v.2, IITD and CASIA databses are comprised of NIR images having various types of oﬀ angled, occluded and non cooperative eyes and UBIRIS v.2 database has highly

494

B. H. Shekar et al.

Fig. 4. CGF demonstrated on an unwrapped iris from IITDelhi dataset.

corrupted, disoriented (oﬀ-angled) images taken form a distance of 4 to 8 m which are acquired through visual wavelength (VW) imaging. Experiments are extended using the most challenging recent iris mobile datasets MICHE I [7] and CASIA Iris M1 [1] both of which are consisting of iris images captured by mobile devices in visible environment. Iris Representation Using CGF: Most common representation of iris feature vector is Daugman’s phase based binary iris code method. We have adopted this technique based on the output responses of CGF descriptor. Each pixel in the input iris image is binarised based on the responses of CGF M ag (ξ, ϑ) and CGF Amp (ξ, ϑ). The responses of CGF M ag are binarised based on the mean value of the magnitudes whereas, the responses of CGF Amp is binarised based on the zero crossings. Suppose I(x, y) is the unwrapped input iris image, (I1 , I2 ) is binary bit representation and M, being the mean of the values of the ﬁeld CGF M ag , then, 1 if CGF M ag (ξ, ϑ) ≥ M I1 = (7) 0 if CGF M ag (ξ, ϑ) < M. 1 if CGF Amp (ξ, ϑ) ≥ 0 (8) I2 = 0 if CGF Amp (ξ, ϑ) < 0. During experimentation, we have adopted iris preprocessing steps and recognition framework explained in [21]. First set of experiments are conducted with respect to NIR and VW iris datasets, IITDelhi, MMU v2, CASIA v-4 Distance dataset and UBIRIS v-2. We have made a comparative study with other descriptors such as, LBP and HOG, and also with Riesz signal based binary pattern (RSBP) descriptor proposed in [21]. The experimental results containing the equal error rates (EER) and d − prime value are shown in the Table 1. We have achieved an average increase of 1.4% in d − prime values and average decrease of 3.64% in equal error rates with respect to RSBP. With respect to LBP and HOG, proposed descriptor outperforms with an average increase of 4.78% in d − prime and average decrease of 13.9% in EER values. Iris Authentication on Mobile Datasets. Further, we have conducted our second set of experiments on mobile iris datasets MICHE I and CASIA M1. We

CGF Based Descriptor for Iris Biometrics and Action Recognition

495

Table 1. d − prime and EER values due to various methods on benchmark datasets. Dataset

IITD

MMU v-2

Method

EER

d-prime EER

d-prime EER

d-prime EER

d-prime

LBP HOG RSBP Proposed

0.0207 0.0211 0.0106 0.0109

5.6321 5.7811 5.9106 5.9866

3.3891 3.3880 3.4166 3.4431

2.5651 2.5970 2.8160 2.8215

2.5592 2.5611 2.2680 2.6438

0.0575 0.0513 0.0440 0.0438

CASIA v-4 dist UBIRIS v.2 0.0918 0.0802 0.0710 0.0700

0.1009 0.1000 0.09910 0.0865

have used the subsets of MICHE I that are comprised of eye images captured from three diﬀerent mobile phones, iPhone5 (IP5), Samsung Galaxy S4 (GS4), and Samsung Galaxy Tab2 (GT2). From CASIA Iris database, we have used a subset of CASIA-Iris-M1-S1. Promising results obtained from second set of experimentation (displayed in Table 2) depict that CGF can be ﬁne-tuned to extend its application for mobile-phone-based iris biometrics. Table 2. d − prime and EER values achieved due to diﬀerent methodologies on mobile iris datasets, MICHE I and CASIA Iris. Dataset

IP5

Method LBP HOG RSBP Proposed

4.2

GS4

GT2

EER

d-prime EER

d-prime EER

d-prime EER

d-prime

0.3124 0.3005 0.2998 0.2913

1.9817 1.9956 2.0110 2.1528

1.9890 2.0152 2.0117 2.1853

1.9926 2.0187 2.1776 2.2001

2.1330 2.1204 2.2254 2.3110

0.4400 0.4214 0.4210 0.3991

0.5112 0.5396 0.5251 0.5002

CASIA-M1-S1 0.3216 0.3350 0.3013 0.2996

Human Action Recognition

To prove the applicability of the proposed descriptor on human action recognition, we present evaluations on a publicly available depth based action sequence benchmark database namely, MSR Action 3D dataset. Motion Map Representation from Depth Sequences: As a preprocessing stage, we construct a compact and dense representation namely Stridden Depth Motion Map (SDMM) [20,22] by piling up the action cues from the depth frames of the depth video sequence. This is achieved by traversing consecutive depth frames in the video in steps of four frames at a time and totaling up the absolute value of diﬀerences from one frame to its second predecessor frame. The dense structure thus achieved is an accumulation of traces of action cues from the selected depth frames of the depth video. Here, the striding of 4 frames per step, during the frame selection results in achieving an improvement over the

496

B. H. Shekar et al.

Fig. 5. Generating SDMM from depth sequences.

th

computation time by 14 of actual time required to process all the frames, during the generation of SDMM. The mathematical model for the computation of SDMM is shown in the following Eq. (9) and schematically presented in Fig. 5. SDM M{f,s,t} =

K

k k−2

D{f,s,t} − D{f,s,t}

(9)

k=3,k+=4

where front, side and top projection views are denoted by f, s and t respeck tively, D{f,s,t} indicates the k -th frame of the projected view f or s or t, under consideration, K denotes the total number of frames present in entire depth video, SDM M is resultant Stridden-Depth Motion Map. Apart from front (f ) view demonstrated above, two additional 2D projected views, side(s) and top(t) are generated using Eqs. (10) and (11). f j, if Dij = 0 s (10) Dik = 0, otherwise t Dkj

f i, if Dij = 0 = 0, otherwise

(11)

here i = 1, 2, 3, ...M, j = 1, 2, 3...N and k = 1, 2, 3...L, and L is the highest depth value across all the front view frames in the video, Df is the front frame under consideration with size MxN, Ds and Dt are the mathematically computed side and top view frames. CGF Features from DMM and Classification: We use the CGF ﬁlter described in Sect. 3.1 to obtain CGF responses for the SDMM, as formulated in Eq. 5. We compute simple block wise histogram over the magnitude factor of the CGF response and concatenate the histograms of each block to produce the ﬁnal feature vector. We adopt a kernel based extreme learning machine for classiﬁcation.

CGF Based Descriptor for Iris Biometrics and Action Recognition

497

Dataset Description: The MSR Action 3D [12] dataset is a widely used for investigating depth video based HAR, which comprises of 20 diﬀerent classes of action sequences, where each class of action is sampled two or three times by ten diﬀerent people looking at the camera. Total number of depth videos in the dataset is 557. Apart from the variations in speed of actions across the samples, high percentage of similarity among actions (such as draw x, draw tick, draw circle) i.e. is interclass similarity are the key challenges in this dataset. Experimental Setup: We demonstrate a detailed comparative study of our results with the results available in the published literatures, under the same train/test protocols used in [20,22]. The MSR Action 3D dataset is treated under two diﬀerent settings for a justiﬁable experimentation, namely Setting1 and Setting2. Under Setting1, 8 action classes with highest level of inter class similarity and variation in speed are chosen at a time and treated as a ﬁxed subset. Three such Action Sets, namely AS1, AS2 and AS3 are prepared from the available 20 action classes of the whole dataset, as listed in Table 3. The subsets are separately used for experimentation, wherein cross subject test strategy is adopted, considering the odd numbered subjects (1,3,5,7,9) during training and even numbered subjects during testing. Table 3. Splitting of MSR Action 3D dataset under Setting1 into three action subsets.

Action Set1 (AS1) Action Set2 (AS2) Action Set3 (AS3) Forward punch Horizontal wave Hand clap Hammer Bend Tennis serve High throw Pickup throw

Two hand wave Forward kick High wave Draw tick Draw x Side boxing Draw circle Hand catch

Forward kick Side kick High throw Tennis swing Jogging Pickup throw Golf swing Tennis serve

During the experimentation with setting 2, entire dataset with all the 20 action classes is considered. Adopting the cross subject strategy, the odd numbered subjects (1,3,5,7,9) are considered during training and subjects even numbered subjects (2,4,6,8,10) are considered for testing. Due to the variety of action classes and speed variation, Setting2 is of higher complexity than Setting1. Experimental Results: The CGF descriptor based methodology achieved signiﬁcantly better results than many of the existing methods available in the literature. Our experiments on MSR Action3D shows an average accuracy of 96.39%

498

B. H. Shekar et al. Table 4. Average recognition rates (%) attained under setting 1. Method

Subsets of classes AS1 AS2 AS3

DMM-HOG [27] 96.2 96.2 Chen et al. [4] 88.0 HOJ3D [25] 91.7 STOP [23] 98.1 DMM-LBP [4] SDMM-UDTCWT [20] 96.52 97.22 Proposed

84.1 83.2 85.5 72.2 92.0 93.82 93.75

94.6 92.0 63.6 98.6 94.6 96.92 98.20

Avg.(%) 91.6 90.5 79.0 87.5 94.9 95.75 96.39

Table 5. Recognition accuracies achieved with setting 2. Method

Accuracy(%)

DMMHOG [27] 85.5 86.5 ROP [24] 88.9 HON4D [17] 91.9 DMM-LBP [4] SDMM-UDTCWT [20] 93.41 95.24 Proposed

under setting1, as presented in Table 4 whereas, Table 5 presents the cross subject test results on the same dataset under setting2 demonstrating signiﬁcant improvements compared to the available results. Referring back to the works such as [20,27], it is observant that interclass similarity among the samples belonging to Draw tick, Draw x and Draw circle is very high, also the samples belonging to these classes are mostly misclassiﬁed in the above cited existing works. In our experimentation, it is evident that six samples belonging to Draw x are misclassiﬁed as Draw tick class. However, owing to the robustness of proposed method, only one sample belonging to Draw x is wrongly classiﬁed into Draw circle class. The proposed methodology investigates the applicability of newly devised hand crafted CGF descriptor for human action recognition from depth videos. Considering the wide exploration of neural network based models on HAR domain, we present a comparison of our results with some existing NN model based results on MSR Action3D dataset in Table 6.

CGF Based Descriptor for Iris Biometrics and Action Recognition

499

Table 6. Comparative result analysis achieved with some existing NN models, based on the performance on MSRAction3D dataset.

5

Method

Subsets of classes AS1 AS2 AS3

Avg.(%)

3D CNN [13] Zho et al. [28] H-RNN [9] Proposed

86.79 95.89 93.33 97.22

84.07 94.15 94.49 96.39

76.11 87.68 94.64 93.75

89.29 97.32 95.50 98.20

Conclusion

Designing a computationally eﬃcient local descriptor is a challenging task. We have designed a novel, simple yet, powerful descriptor based on complex gradient function (CGF). Proposed descriptor is easy to implement and is computationally inexpensive as no image convolutions are involved in the calculations of complex gradient function. Our CGF descriptor can well capture the local variations around the input pixel. We have proved its generic applicability with the case studies of iris biometrics and depth sequence based human action recognition. Proposed CGF descriptor enhances the utilization of image gradient with the combined impact of magnitude and direction. It can be further explored to other applications of image processing such as, face recognition, multi-modal biometrics, object recognition and human action analysis. Acknowledgement. This work is supported jointly by the Department of Science & Technology, Govt. of India and Russian Foundation for Basic Research, Russian Federation under the grant No. INT/RUS/RFBR/P-248.

References 1. Institute of Automation: Chinese Academy of Sciences. CASIA Iris Database. http://biometrics.idealtest.org/ 2. Malaysia Multimedia University Iris Database. http://pesona.mmu.edu 3. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008) 4. Chen, C., Jafari, R., Kehtarnavaz, N.: Action recognition from depth sequences using depth motion maps-based local binary patterns. In: 2015 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1092–1099. IEEE (2015) 5. Chen, J., et al.: Wld: a robust local image descriptor. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1705–1720 (2010) 6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: International Conference on Computer Vision & Pattern Recognition (CVPR 2005), vol. 1, pp. 886–893. IEEE Computer Society (2005) 7. De Marsico, M., Nappi, M., Narducci, F., Proen¸ca, H.: Insights into the results of miche i-mobile iris challenge evaluation. Pattern Recogn. 74, 286–304 (2018)

500

B. H. Shekar et al.

8. Deng, W., Hu, J., Guo, J.: Compressive binary patterns: designing a robust binary face descriptor with random-ﬁeld eigenﬁlters. IEEE Trans. Pattern Anal. Mach. Intell. 41(3), 758–767 (2019) 9. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015) 10. Ke, Y., Sukthankar, R., et al.: Pca-sift: a more distinctive representation for local image descriptors. CVPR 2(4), 506–513 (2004) 11. Kumar, A., Passi, A.: Comparison and combination of iris matchers for reliable personal authentication. Pattern Recogn. 43(3), 1016–1026 (2010) 12. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern RecognitionWorkshops, pp. 9–14. IEEE (2010) 13. Liu, Z., Zhang, C., Tian, Y.: 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis. Comput. 55, 93–100 (2016) 14. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 15. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005). https://doi.org/10. 1109/TPAMI.2005.188 16. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classiﬁcation with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002) 17. Oreifej, O., Liu, Z.: Hon4d: histogram of oriented 4D normals for activity recognition from depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 716–723 (2013) 18. Proenca, H., Filipe, S., Santos, R., Oliveira, J., Alexandre, L.A.: The ubiris.v2: a database of visible wavelength iris images captured on-the-move and at-a-distance. IEEE Trans. Pattern Anal. Mach. Intell. 32(8), 1529–1535 (2010) 19. Roy, S.K., Chanda, B., Chaudhuri, B.B., Banerjee, S., Ghosh, D.K., Dubey, S.R.: Local directional zigzag pattern: a rotation invariant descriptor for texture classiﬁcation. Pattern Recogn. Lett. 108, 23–30 (2018) 20. Shekar, B.H., Rathnakara Shetty, P., Sharmila Kumari, M., Mestetsky, L.: Action recognition using undecimated dual tree complex wavelet transform from depth motion maps / depth sequences. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLII-2/W12, pp. 203–209 (2019). https://doi.org/10.5194/isprs-archives-XLII-2-W12-203-2019 21. Shekar, B.H., Bhat, S.S., Mestetsky, L.: Iris recognition by learning fragile bits on multi-patches using monogenic riesz signals. In: Deka, B., Maji, P., Mitra, S., Bhattacharyya, D.K., Bora, P.K., Pal, S.K. (eds.) PReMI 2019. LNCS, vol. 11942, pp. 462–471. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34872-4 51 22. Shetty, P.R., Shekar, B., Mestetsky, L., Prasad, M.M.: Stacked ﬁlter bank based descriptor for human action recognition from depth sequences. In: 2019 IEEE Conference on Information and Communication Technology, pp. 1–6. IEEE (2019) 23. Vieira, A.W., Nascimento, E.R., Oliveira, G.L., Liu, Z., Campos, M.F.: On the improvement of human action recognition from depth map sequences using spacetime occupancy patterns. Pattern Recogn. Lett. 36, 221–227 (2014) 24. Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y.: Robust 3D action recognition with random occupancy patterns. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 872–885. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3 62

CGF Based Descriptor for Iris Biometrics and Action Recognition

501

25. Xia, L., Chen, C.C., Aggarwal, J.K.: View invariant human action recognition using histograms of 3D joints. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–27. IEEE (2012) 26. Xia, Z., Yuan, C., Lv, R., Sun, X., Xiong, N.N., Shi, Y.Q.: A novel weber local binary descriptor for ﬁngerprint liveness detection. IEEE Trans. Syst. Man Cybern. Syst. 50(4), 1526–1536 (2018) 27. Yang, X., Zhang, C., Tian, Y.: Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1057–1060. ACM (2012) 28. Zhao, C., Chen, M., Zhao, J., Wang, Q., Shen, Y.: 3D behavior recognition based on multi-modal deep space-time learning. Appl. Sci. 9(4), 716 (2019)

On-Device Language Identification of Text in Images Using Diacritic Characters Shubham Vatsal(B) , Nikhil Arora, Gopi Ramena, Sukumar Moharana, Dhruval Jain, Naresh Purre, and Rachit S. Munjal On-Device AI, Samsung R&D Institute, Bangalore, India {shubham.v30,n.arora,gopi.ramena,msukumar,dhruval.jain, naresh.purre,rachit.m}@samsung.com

Abstract. Diacritic characters can be considered as a unique set of characters providing us with adequate and signiﬁcant clue in identifying a given language with considerably high accuracy. Diacritics, though associated with phonetics often serve as a distinguishing feature for many languages especially the ones with a Latin script. In this proposed work, we aim to identify language of text in images using the presence of diacritic characters in order to improve Optical Character Recognition (OCR) performance in any given automated environment. We showcase our work across 13 Latin languages encompassing 85 diacritic characters. We use an architecture similar to Squeezedet for object detection of diacritic characters followed by a shallow network to ﬁnally identify the language. OCR systems when accompanied with identiﬁed language parameter tends to produce better results than sole deployment of OCR systems. The discussed work apart from guaranteeing an improvement in OCR results also takes on-device (mobile phone) constraints into consideration in terms of model size and inference time. Keywords: Diacritic detection · OCR · Language identiﬁcation On-device · Text localization · Shallow network

1

·

Introduction

A diacritic or diacritical mark is basically a glyph added to a letter or a character. Diacritics are used to provide extra phonetic details and hence altering the normal pronunciation of a given character. In orthography1 , a character modiﬁed by a diacritical mark is either treated as a new character or as a character-diacritic combination. These rules vary across inter-language and intra-language peripherals. In this proposed work, we have restricted ourselves to diacritic characters pertaining to Latin languages. Other than English there are many popular Latin languages which make use of diacritic characters like Italian, French, Spanish, German and many more. 1

https://en.wikipedia.org/wiki/Orthography.

c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 502–512, 2021. https://doi.org/10.1007/978-981-16-1092-9_42

On-Device Language Identiﬁcation of Text in Images

503

OCR is one of the most renowned and foremost discussed Computer Vision (CV) tasks which is used to convert text in images to electronic form in order to analyze digitized data. There have been many prominent previous works done in OCR. [22] uses a novel mechanism of attention to achieve state of the art results on street view image datasets. [2] makes use of spatial transformer network to give unparalleled results in scene text recognition. [19] applies conventional Convolutional Neural Network (CNN) with Long Short Term Memory (LSTM) for its text interpretation task. We can deﬁne two broad ways with respect to OCR enhancements. One can be an implicit way of OCR enhancement whereas other can be an explicit way. In the explicit way of OCR enhancement our aim is to improve OCR’s inherent accuracy which can depend on multiple factors like OCR’s internal architecture, pre-processing images to improve their quality and hence increasing OCR’s relative conﬁdence with regards to text recognition and so on. The quality of image depends on multiple aspects with respect to OCR performance ranging from font size of text in images to source of images. There are many image pre-processing techniques like [3,8,17] which help in enhancing image quality and in return provide us with better OCR conﬁdence. The other type of OCR enhancements are the implicit ones. In this way of OCR enhancement, we concentrate on external factors in order to improve OCR’s results in a mechanized environment. For example, post processing hacks to improve OCR results, determining factors like language of text in image and using them as OCR parameters to choose the correct OCR language based dependencies are some of such factors. An important point to emphasize here is that an OCR’s original accuracy stays the same in case of implicit enhancements but the ﬁnal OCR results in a given environment is improved. In this work we concentrate on one of the implicit ways to improve OCR results. Language input to OCR helps in diﬀerentiating between similar looking characters across various languages which comprise mostly of diacritic characters. For example, diacritic characters ` a and ´ a are minutely diﬀerent and hence if correct language is not speciﬁed, it is often missed or wrongly recognized. The rest of the paper is organised in the following way. Section 2 talks about related works. We elucidate the working of our pipeline in Sect. 3. Section 4 concentrates on the experiments we conducted and the corresponding results we achieved. The ﬁnal section takes into consideration the future improvements which can be further incorporated.

2

Related Works

There have been many works done to identify languages in Natural Language Processing (NLP) domain but things are not that straightforward when it comes to identifying languages of text in images, especially when it needs to be done without any involvement of character segmentation or OCR techniques. Most of the existing works on OCR implicitly assume that the language of the text in images is known beforehand. But, OCR approaches work well individually for speciﬁc languages for which they were designed in the ﬁrst place. For example,

504

S. Vatsal et al.

an English OCR will work very well with images containing English text but they struggle when given a French text image. An automated ecosystem would clearly need human intervention in order to select the correct OCR language parameters. A pre-OCR language identiﬁcation work would allow the correct language based OCR paradigms to be selected thus guaranteeing better image processing. Along the similar lines, when dealing with Latin languages, current OCR implementations face problems in correct classiﬁcation of languages particularly due to common script. In this paper, we propose an architecture which uses detection of diacritic characters in all such languages using object detection approach to enhance the OCR text recognition performance. Key takeaway from our approach is that we design this pipeline to meet the on-device constraints, making it computationally inexpensive. Several work has been done with respect to script detection but identiﬁcation of language from images is still not a thoroughly researched area. Script detection although could help us in diﬀerentiating two languages of diﬀerent scripts but this technique fails to diﬀerentiate between languages of same script like Spanish and German which belong to Latin script. Among some of the previous works done in the domain of language identiﬁcation, [4] uses three techniques associated with horizontal projection proﬁles as well as runlength histograms to address the language identiﬁcation problem on the word level and on text level. But then this paper just targets two languages which are English and Arabic who also happen to have diﬀerent scripts. [15] although with the similar intention of improving OCR showcases its work only on languages of diﬀerent scripts. Again, [24] presents a new approach using a shape codebook to identify language in document images but it doesn’t explicitly targets languages of similar script. [14] demonstrates promising results but then the authors attribute these results towards biased image properties as all texts were of the same size and acquired under exactly the same conditions. [12] advocates that the use of shape features for script detection is eﬃcient, but using the same for segregating into languages is of little importance as many of these languages have same set of characters. Also this work uses an OCR for identiﬁcation of language contrary to our work where we aim to identify language ﬁrst and then use it to improve OCR. Some noteworthy works revolving around diacritic character in images include robust character segmentation algorithm for printed Arabic text with diacritics based on the contour extraction technique in [13]. Furthermore, diacritic characters have been used for detecting image similarity in Quranic verses in [1]. Another work [5] discusses about diacritical language OCR and studies its behaviours with respect to conventional OCR. [11] talks about their segmentation-free approach where the characters and associated diacritics are detected separately with diﬀerent networks. Finally, [10] illustrates experiments on Arabic font recognition based on diacritic features. None of these works try to associate diacritic characters with language as we have explored in our case. Object Detection is a widely popular concept which has seen many breakthrough works in the form of Fast R-CNN [6], YOLO [16], SqueezeNet [7] and many more. There have been quite a few works in the direction of using object

On-Device Language Identiﬁcation of Text in Images

505

detection approach for character recognition. [21] uses a generic object recognition technique for end to end text identiﬁcation and shows how it performs better than conventional OCR. [9] makes use of deep convolutional generative adversarial network and improved GoogLeNet to recognise handwritten Chinese characters. In our work also, we make use of object detection mechanism with Squeezedet to process diacritic characters. Other previous approaches on OCR for Latin language identiﬁcation fail to perform well after script detection phase. To the best of our knowledge diacritic characters have not been used for the same to enhance the system performance. In this paper, we present a novel architecture for boosting OCR results when it comes to working with diﬀerent languages with common scripts, with an eﬃcient performance when deployed on-device.

3

Proposed Pipeline

This section delineates the purpose of each component and eventually concludes how these components blend together to get us the desired result. Figure 1 shows the pipeline of the proposed system. As we can see, an image is sent as input to a Text Localization component from which text bounding boxes are extracted. These text bounding boxes are sent one by one to Diacritic Detection model. Once the diacritics if present have been detected, then we use our shallow neural network to identify the language. This language input is ﬁnally fed to the OCR to improve its performance.

Fig. 1. Proposed pipeline

3.1

Corpus Generation

We created RGB format word image dataset of ﬁxed height of 16 dimension and variable width depending on the aspect ratio to train our model for diacritic characters. We used European Parliament Proceedings Parallel Corpus2 for purposefully choosing words with diacritic characters across all 13 languages 2

http://www.statmt.org/europarl/.

506

S. Vatsal et al.

for constructing this dataset. The distribution of data across all languages and the diacritic characters found in each language is listed in Table 1. We uniquely labelled each diacritic character. In order to achieve an adequate level of generalization, various randomization factors were put into place like font size, font type and word length. Sample snippets of this synthetic dataset have been showcased in Fig. 2. As it can be seen in the ﬁgure, bounding boxes have been constructed around the diacritic characters. Table 1. Corpus distribution Language

Word image corpus size

Diacritic characters

Spanish

9218

German

8673

French

9127

´ ´ ˜ n A, a, N, ˜ ¨ ¨ ¨ ¨ u A, ¨ a, O, o, U, ¨, ß ` ` ˆ ˆ ´ ´e, E, ` `e, E, ˆ ˆe, E, ¨ ¨e, ˆI, ˆı, ¨I, ¨ı, O, ˆ ˆ A, a, A, a, E, o, ˆ u Œ, œ, U, ˆ, ¸c ` ` ` ` ` u A, a, `I, `ı, O, o, U, `

Italian

8903

Romanian

9583

Finnish

9477

Hungarian 9674 Estonian

9243

Danish

9251

Dutch

9439

Swedish

9054

Portuguese 8891 Czech

9133

ˆ ˆ A, a, ¨ ¨ A, a, ´ A, ´ a,

˘ ˘ A, a, S ¸ , ¸s, T ¸ , ¸t ¨ ¨ O, o ´ ´e, ´I, ´ı, O, ´ ´ ¨ ¨ ˝ ˝ ¨ u ˝ u E, o, O, o, O, o, U, ¨, U, ˝, ¨ ¨ ˜ ˜ ¨ ¨ ˇ ˇs A, a, O, o, O, o, S, ˚ A, ˚ a, Æ, æ, Ø, ø ¨ ¨e, ¨I, ¨ı E, ¨ ¨ ¨ ¨ A, a, ˚ A, ˚ a, O, o ´ ´ ˜ ˜ ˆ ˆe, O, ˆ ˆ ˜ ˜ A, a, A, a, E, o, O, o, ¸c ´ ´ ˇ ´ ´ ´ u ´ y ˇ ˇc, A, ´ a, E, ´e, E, ˇe, I, ´ı, O, ´ o, U, ´, ˚ u, Y, ´, C, ˇ N, ˇ d, ˇ n ˇ ˇr, S, ˇ ˇs, T, ˇ ˇt, Z, ˇ ˇz D, ˇ, R,

Fig. 2. Sample images

Apart from the above discussed word based image dataset we also created RGB format 150 × 150 Test dataset. This dataset was again created using European Parliament Proceedings Parallel Corpus in order to test the ﬁnal performance of our proposed pipeline. This dataset consisted of random text containing

On-Device Language Identiﬁcation of Text in Images

507

some diacritic characters which was fed as an input to our pipeline. We again took care of same set of randomization factors in order to achieve a better level of generalization. Sample image of this dataset can be seen in Fig. 1.

Fig. 3. Text localization (modiﬁed CTPN architecture)

3.2

Text Localization

Text localization detects bounding boxes of text regions. This is performed using Connectionist Text Proposal Network (CTPN) [20]. We modiﬁed the network to use a 4 layered CNN instead of VGG 16 [18], to achieve a better on-device performance and also since we needed only low level features in order to identify the regions of text. The 4 layers of CNN used similar parameters as that of initial layers of VGG 16 and the ﬁlter size of convolutional layers can be seen in Fig. 3. Apart from the 4 layered CNN introduced for on-device compatibility, rest of the architecture has been kept same with same parameters as discussed in [20]. The extracted feature vectors are recurrently connected by a Bidirectional LSTM, where the convolutional features are used as input of the 256 dimension Bi-LSTM. This layer is connected to a 512 dimension fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-coordinates and side-reﬁnement oﬀsets of k anchors. The detected text proposals are generated from the anchors having a text/non-text score of >0.7 (with non-maximum suppression). The modiﬁed network architecture of CTPN has been represented in Fig. 3. In our experiments, we notice that this is able to handle text lines in a wide range of scales and aspect ratios by using a singlescale image, as mentioned in the original paper. 3.3

Diacritic Detection

We use an object detection approach to detect diacritic characters. Inspired from Squeezedet [23], we designed a model which is more suitable for our problem

508

S. Vatsal et al.

statement and also more lightweight in terms of on-device metrics. Since, there are a lot of similarities between normal characters and diacritic characters and also within various diacritic characters, we used our own downsizing network in the initial layers so that sharp diﬀerence between various characters could be identiﬁed. We didn’t use pooling layers in the starting of the network to allow more low level image features to be retained till that point. Further, we decreased the strides of ﬁrst CNN layer in order to capture more image features. Apart from these changes, we also reduced the number of ﬁre [7] layers. There were couple of reasons for that change. First, our input image is very small and it is not required to have so many squeeze and expand operations and hence make the network very deep as it is the low level image features which mostly contribute towards identifying a diﬀerence between a normal character and a diacritic character or even diﬀerentiating within the set of diacritic characters. Second, we also have to adhere to on-device computational constraints. The architecture of our network can be seen in Fig. 4. For conv1, we used 64 ﬁlters with kernel size being 3 and stride 1. Following conv1 we have a set of two ﬁre layers, ﬁre2 and ﬁre3. Both of them have same set of parameters which are s1x1 = 16, e1x1 = 64 and e3x3 = 64 where s represents squeeze convolutions and e represents expand convolutions. Then comes a max pool layer with kernel size 3, stride 2 and same padding. We again have another set of ﬁre layers, ﬁre4 and ﬁre5, having same set of parameters s1x1 = 32, e1x1 = 128 and e3x3 = 128. Max pool follows this set of ﬁre layers with kernel size 3, stride 2 and same padding. We then concatenate the output of these two sets of ﬁre layers and the concatenated output is fed into a new ﬁre layer, ﬁre6. Fire6 and ﬁre7 have s1x1 = 48, e1x1 = 192, e3x3 = 192. Then we have ﬁre8 and with s1x1 = 96, e1x1 = 384, e3x3 = 384. Finally, we have ﬁre9 and ﬁre10 with s1x1 = 96, e1x1 = 384, e3x3 = 384. As it can be seen, we have gradually increased the ﬁlters in ﬁre layers from beginning to end of the network. In the end we have convdet layer with kernel size 3 and stride 1. In addition to the above discussed model parameters, there were other important hyper-parameters selected to tune the model. While training, we used 9 anchors per grid with batch size of 16. Learning rate was set to 0.01 with decay factor of 0.0001. The non-maximum suppression threshold was set to 0.2 and dropout value was set to 0.5.

Fig. 4. Diacritic detection network

On-Device Language Identiﬁcation of Text in Images

3.4

509

Language Identification

We use a shallow network to ﬁnally infer the language once diacritic characters have been identiﬁed in the given image. We design the input in the form of onehot vectors corresponding to the total number of diacritic characters with which our Diacritic Detection model was trained. We took variable sized chunks of input text and extracted diacritic characters from them to ﬁnally prepare our one-hot input vector. Since, we were using European Parliament Proceedings Parallel Corpus for detection of diacritics, we were already having a text dataset labelled on the basis of their language. We used the same dataset to train our shallow network. The shallow network consisted of two hidden dense networks with 50 units and 30 units respectively and ReLu activation function. The output layer consisted of Softmax activation function with number of units being equal to total number of languages which is 13 in our case. The architecture of our network is shown in Fig. 5. We created 1000 samples for each language where we used 90% as training data and remaining as validation data. We trained for 20 epochs with other default parameters.

Fig. 5. Shallow network for language identiﬁcation

4

Experiments and Results

As we can see in Table 2, with our Diacritic Detection network, the object detection approach works reasonably well. We calculate various losses to measure the performance of our model. The deﬁnition for these losses can be found in [23]. Apart from the losses we are able achieve Recall as high as 0.9 with Mean Intersection Over Union (IoU) being around 0.7. The comparison results in Table 2 shows how our diacritic detection approach is able to outperform Squeezedet. The next experiment which we conduct is with respect to the overall performance of the entire pipeline. We calculated multiple metrics in the form of Recall, Precision and F1 Score to have a holistic view about the performance of our pipeline. We chose 500 samples for each language from the Test dataset created as discussed in Sect. 3.1. The results in Table 3 showcase that diacritic

510

S. Vatsal et al. Table 2. Diacritic detection results Metrics

Diacritic detection network Squeezedet

Class loss

0.31

3.83

Bounding box loss

0.09

0.99

Conﬁdence loss

0.22

0.41

Mean intersection over union 0.71

0.39

Recall

0.21

0.90 Table 3. Language identiﬁcation results Language

Precision Recall F1 score

Spanish

0.92

0.91

0.92

German

0.88

0.93

0.91

French

0.91

0.85

0.88

Italian

0.97

0.88

0.92

Romanian

0.95

0.90

0.93

Finnish

0.87

0.99

0.93

Hungarian 0.82

0.99

0.90

Estonian

0.98

0.96

0.97

Danish

0.95

0.75

0.84

Dutch

0.92

0.99

0.96

Swedish

0.95

0.71

0.81

Portuguese 0.75

0.89

0.82

Czech

0.95

0.92

0.90

characters serve as an important factor even within the same script when it comes to determination of language. Apart from these results, our proposed system demonstrates eﬃciency with respect to device based computational restrictions. Our entire pipeline’s size is restricted to just around 5 MB with inference time being as low as 213 ms. The on-device metrics have been tabulated in Table 4 and have been calculated using Samsung’s Galaxy A51 with 4 GB RAM and 2.7 Ghz octa-core processor. Table 4. On-device metrics Component

Size

Diacritic detection network 5 MB

Inference time 210 ms

Shallow network

0.3 MB 3 ms

Total

5.3 MB 213 ms

On-Device Language Identiﬁcation of Text in Images

5

511

Conclusion and Future Work

In this work, we showcase how we can identify language from text by making use of diacritic characters in images using an on-device eﬃcient architecture with low model size and inference timings. We primarily concentrate on 13 Latin languages and observe promising results. The existing architecture can be further scaled for other Latin languages as well. One of the areas which can be targeted as a part of future work could be to extend this work to other scripts apart from Latin. In order to achieve that, ﬁrst we need to identify idiosyncratic characters in the corresponding script just like we identiﬁed diacritic characters in Latin script which can be used to differentiate between languages belonging to that script. For example in Devanagri script3 , we have compound letters which are nothing but vowels combined with consonants. These compound letters have diacritics. Once we have diacritic or similarly identiﬁed unique set of characters, we can apply the discussed architecture and observe OCR results.

References 1. Alotaibi, F., Abdullah, M.T., Abdullah, R.B.H., Rahmat, R.W.B.O., Hashem, I.A.T., Sangaiah, A.K.: Optical character recognition for quranic image similarity matching. IEEE Access 6, 554–562 (2017) 2. Bartz, C., Yang, H., Meinel, C.: See: towards semi-supervised end-to-end scene text recognition. arXiv preprint arXiv:1712.05404 (2017) 3. Bieniecki, W., Grabowski, S., Rozenberg, W.: Image preprocessing for improving OCR accuracy. In: 2007 International Conference on Perspective Technologies and Methods in MEMS Design, pp. 75–80. IEEE (2007) 4. Elgammal, A.M., Ismail, M.A.: Techniques for language identiﬁcation for hybrid arabic-english document images. In: Proceedings of Sixth International Conference on Document Analysis and Recognition, pp. 1100–1104. IEEE (2001) 5. Gajoui, K.E., Allah, F.A., Oumsis, M.: Diacritical language OCR based on neural network: case of Amazigh language. Procedia Comput. Sci. 73, 298–305 (2015) 6. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 7. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016) 8. Lat, A., Jawahar, C.: Enhancing OCR accuracy with super resolution. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3162–3167. IEEE (2018) 9. Li, J., Song, G., Zhang, M.: Occluded oﬄine handwritten Chinese character recognition using deep convolutional generative adversarial network and improved googlenet. Neural Comput. Appl. 32(9), 4805–4819 (2020) 10. Lutf, M., You, X., Cheung, Y.M., Chen, C.P.: Arabic font recognition based on diacritics features. Patt. Recogn. 47(2), 672–684 (2014) 3

https://en.wikipedia.org/wiki/Devanagari.

512

S. Vatsal et al.

11. Majid, N., Smith, E.H.B.: Segmentation-free bangla oﬄine handwriting recognition using sequential detection of characters and diacritics with a faster R-CNN. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 228–233. IEEE (2019) 12. Mioulet, L., Garain, U., Chatelain, C., Barlas, P., Paquet, T.: Language identiﬁcation from handwritten documents. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 676–680. IEEE (2015) 13. Mohammad, K., Qaroush, A., Ayesh, M., Washha, M., Alsadeh, A., Agaian, S.: Contour-based character segmentation for printed Arabic text with diacritics. J. Electron. Imaging 28(4), 043030 (2019) 14. Nicolaou, A., Bagdanov, A.D., G´ omez, L., Karatzas, D.: Visual script and language identiﬁcation. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp. 393–398. IEEE (2016) 15. Peake, G., Tan, T.: Script and language identiﬁcation from document images. In: Proceedings Workshop on Document Image Analysis (DIA 1997), pp. 10–17. IEEE (1997) 16. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: uniﬁed, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016) 17. Seeger, M., Dance, C.: Binarising camera images for OCR. In: Proceedings of Sixth International Conference on Document Analysis and Recognition, pp. 54–58. IEEE (2001) 18. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 19. Smith, R., et al.: End-to-end interpretation of the French street name signs dataset. In: Hua, G., J´egou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 411–426. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0 30 20. Tian, Z., Huang, W., He, T., He, P., Qiao, Yu.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 56–72. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-46484-8 4 21. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: 2011 International Conference on Computer Vision, pp. 1457–1464. IEEE (2011) 22. Wojna, Z., et al.: Attention-based extraction of structured information from street view imagery. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 844–850. IEEE (2017) 23. Wu, B., Iandola, F., Jin, P.H., Keutzer, K.: SqueezeDet: uniﬁed, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 129–137 (2017) 24. Zhu, G., Yu, X., Li, Y., Doermann, D.: Unconstrained language identiﬁcation using a shape codebook. In: The 11th International Conference on Frontiers in Handwritting Recognition (ICFHR 2008), pp. 13–18 (2008)

A Pre-processing Assisted Neural Network for Dynamic Bad Pixel Detection in Bayer Images Girish Kalyanasundaram(&), Puneet Pandey, and Manjit Hota Samsung Electronics, Bengaluru, India {g.kalyanasun,pu.pandey,manjit.hota}@samsung.com

Abstract. CMOS image sensor cameras are integral part of modern hand held devices. Traditionally, CMOS image sensors are affected by many types of noises which reduce the quality of image generated. These spatially and temporally varying noises alter the pixel intensities, leading to corrupted pixels, also known as “bad” pixels. The proposed method uses a simple neural network approach to detect such bad pixels on a Bayer sensor image so that it can be corrected and overall image quality can be improved. The results show that we are able to achieve a defect miss rate of less than 0.045% with the proposed method. Keywords: Image sensor

Bad pixels Bayer

1 Introduction Bayer CMOS image sensors [1] plays an integral part in the hand held mobile photography. With the advent of smaller pixel sizes and higher resolution Bayer sensors are becoming more susceptible to various types of noises like dark noise, photon shot noise, RTS noise, etc. [2]. These noises distort the pixel intensities which leads to deterioration of image quality. Such pixels are called ‘bad pixels’, and they can be of two types, viz. Static and dynamic [3]. Static pixels are those with permanent defects, which are introduced during the manufacturing stage and are always ﬁxed in terms of location and intensity. These kind of pixels are tested and their locations are stored in advance in the sensor’s memory so they can be corrected by the image sensor pipeline (ISP). Dynamic bad pixels are not consistent. They change spatially and temporally, which makes them harder to detect and correct [4–12]. Figure 1 shows the impact of bad pixel on the quality of image. If bad pixels are not properly detected than it introduces severe artifacts in the image. The number of bad pixels generated in a sensor is inversely proportional to the pixel size. With advanced technology, pixel size is getting smaller and smaller, resulting in increase in frequency of bad pixels [4, 5]. There have been few solutions discussed in literature related to bad pixel correction and detection [6–12]. In [6], hot pixel detection is done by accumulating Bayesian statistics collected from a sequence of images. In [7], dark frame data is used for performing linear interpolation for bad pixel correction. Other methods used are min-max ﬁltering [8], median ﬁltering [9], multi-step interpolation [10], and a © Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 513–523, 2021. https://doi.org/10.1007/978-981-16-1092-9_43

514

G. Kalyanasundaram et al.

Fig. 1. (a) Region of an image. (b) Region of the same image but with simulated bad pixels in the Bayer image before demosaicing.

sparsity based iterative algorithm [11]. In [12], the author proposes to use local statistics in a 5 5 Bayer image patch to effectively detect and correct bad pixels. The major issue with all these traditional algorithms is they fail to use hidden image features which results in very high miss rate. Although the use of deep neural networks to detect such hidden features is an attractive idea to detect and correct such defects, our goal is not to allow network complexity to grow immensely, in order to maintain the potential feasibility of implementing the network in low power software or hardware. To that end, an approach that we have explored herein is performing some basic pre-processing to a Bayer image patch to enhance and extract some key features that we believe would assist a shallow neural network in easily converging to a defective pixel detection solution. We have explored different neural network architectures as part of this effort. We have also proposed various local and global features computed during pre-processing, which can be used by the network for better detection of dynamic bad pixels. With the current approach we are able to achieve a defect miss rate of less than 0.045% and false detection rate is roughly under 0.63%. Rest of the paper is as follows: in Sect. 2, the process of generation of the data, network formation, training and testing sections are described in detail with the proposed. This is followed by results in Sect. 3 and concluding remarks in Sect. 4.

2 Data and Methods 2.1

Data Set

The basis of the training and testing data set for the detection algorithm is a set of ﬁve Bayer images acquired from the output interface of a Samsung Isocell 3P9 16 MP CMOS Image Sensor. Three of the ﬁve images were chosen for training phase. As our methods for BP detection require a 5 5 pixels region around a pixel that is to be

A Pre-processing Assisted Neural Network for Dynamic Bad

515

tested, we extracted 5 5 patches from the images for training and testing. The scope of experiments detailed in this paper is restricted to 1 defective pixel within a 5 5 pixel patch, given that this is the ﬁrst such attempt at bad pixel detection. Higher defect densities would be explored in future iterations of this work. Each image was divided into non-overlapping grids of 5 5 pixels (the grids positions were randomly shifted in multiple batches of data generation for more robust training & testing). The center of each of grids would be simulated with a bad pixel of random intensity. The bad pixel value was simulated by varying the intensity of the pixel by at least 30% of the minimum or maximum pixel value in the surrounding 5 5 patch (this is in line with Weber’s Law of the ability to perceive differences in sensory stimuli). The choice of simulating a hot or cold pixel was made random with odds of 0.5. Table 1 shows the algorithm for simulating a bad pixel at any position in the image. Table 1. Algorithm used for simulating examples of bad pixels in a 5 5 Bayer image patch.

Algorithm BP_Type Random(HOT_PIXEL, COLD_PIXEL) Select 5x5 area around the center pixel. Find MinValue & MaxValue among pixels of same color channel as the center pixel 5x5. If BP_Type == HOT_PIXEL: PixThresh max(MaxValue * 1.3, MaxValue+TH) Sim BP value RandInt(PixThresh, MAX_BAYER_VAL) Else: PixThresh min(MinValue * 0.7, MinValue-TH) Sim BP value RandInt(0, PixThresh+1) TH: Minimum threshold by which the BP must exceed (or fall behind) the (Max/Min) Value.

Extra examples of cold pixels on the dark side of strong edges were needed to be provided to address the limitations of test case generation. The number of such cases in our training set amounted to about 10–11% of the simulated bad pixels in the previous method. This quantity of such cases was chosen after iterations of testing to ﬁnd optimal results. 2.2

BP Detection: Network Architectures

Two variations of a 2-layer fully connected neural network were explored in this paper to work on inputs of pre-computed features from a 5 5 Bayer image patch around a test pixel.

516

G. Kalyanasundaram et al.

NN I. Figure 2 shows the proposed architecture NN I. Image was split into 4 color channels, viz. Gr, R, B and Gb in which the two green channels Gr and Gb are treated separately due to their spatial relationships with R & B pixels being different, where, Gr being the green pixels adjacent to red pixels and Gb pixels as neighbors of blue pixels. Each color channel was treated separately and features extracted from each color channel were separately processed, before fusing their results to decide whether the center pixel is a Bad Pixel (BP). The idea behind this splitting was to detect deviations of the central (or test) pixel against the contours in individual color planes, before fusing them. For discussion, let us consider the case of a 5 5 image patch with R pixel at the center. To estimate deviations of the non-red pixels from the central red pixel, the red value in those locations must be estimated. Local white balancing (LWB) was employed for this approach as shown in Fig. 2. Mean Computation. The means of each channel (lGr ; lGb ; lR ; lB ) were computed to start with. Estimating reliably the mean of pixels is essential for local white balancing. As a ﬁrst step, the means of each channel is computed as: P P pi :ðColor ðiÞ ¼¼ RÞ pi :ðColor ðiÞ ¼¼ BÞ lR ¼ Pi ðColorðiÞ ¼¼ RÞ ; lB ¼ Pi ðColorðiÞ ¼¼ BÞ ; i i P p :ðColor ðiÞ ¼¼ GrorGbÞ i i P lGr ¼ lGb ¼ ðColor ðiÞ ¼¼ GrorGbÞ

ð1Þ

i

Presence of excessively bad pixels in the 5 5 patch can signiﬁcantly distort the mean values. Hence, in a crude approach, we also test whether the pixels could be ‘potentially’ bad using the initial values of mean. Based on this, the means were reevaluated. The following mask was generated per pixel depending on if it was found to be ‘suspicious’ of being bad or not. si ¼ pi [ T i lColorðiÞ &&ðpi \ð2 T i ÞlColorðiÞ Þ

ð2Þ

The value T i is a threshold dependent on pixel position in the 5 5 patch, which is chosen as 0.5 at the center and reduces linearly towards the periphery. Then, the means were re-evaluated as: P P pi si :ðColor ðiÞ ¼¼ RÞ pi si :ðColorðiÞ ¼¼ BÞ lR :¼ Pi s ðColorðiÞ ¼¼ RÞ ; lB ¼ Pi s ðColorðiÞ ¼¼ BÞ ; i i i i P ð3Þ pi si :ðColor ðiÞ ¼¼ GrorGbÞ i lGr ¼ lGb ¼ P s ðColor ðiÞ ¼¼ GrorGbÞ i i

These means were used in performing LWB. LWB from any general color channel A to color B is done as per equation: b p B ¼ pA lB =lA

ð4Þ

A Pre-processing Assisted Neural Network for Dynamic Bad

517

Fig. 2. Two layers of fully connected neural network (hereby referred to as NN I) applied to precomputed features on a 5 5 patch of pixels in a Bayer image to detect if the central pixel is a bad pixel (BP) or not. The ﬁrst layer has disjoint networks processing pixels from individual color planes before data fusion in the second layer. NOTE: The parameter ‘a’ in the ﬁrst dense sub-nets layers is variable. (Color ﬁgure online)

After LWB, the magnitude of difference of the pixels from the center pixel jpi pc j was computed to quantify the center pixel’s deviation across the region. These quantities were supplied to the ﬁrst layer of the neural network. Each channel’s features were separately processed by individual fully connected sub-nets in the ﬁrst layer. The features obtained from these nets were fed to a single neuron in the second layer that takes in additional statistics derived from the 5 5 matrix, which are shown in Fig. 2. The output of this neuron has a sigmoid response, whose value was rounded, to get a binary response of 0 for a clean pixel or 1 for a bad pixel. NOTE: We have trained 4 such models – one for each type of patch (Gr centered, R centered, B centered and Gb centered). NN II. During experiments with the previous network, issues were found with the white balancing accuracy at edges. It was found that cross-channel correlation was not sufﬁcient around edge and high frequency regions to give a reliable estimate of the pixels in cross-channel positions through LWB. The second network explored was an alternative lower cost version of the ﬁrst approach, aimed to avoid the use of cross channel information (Fig. 3).

518

G. Kalyanasundaram et al.

Fig. 3. Two layers of fully connected neural network (hereby referred to as NN II) applied to pre-computed features on pixels of the same color as the central test pixel in a 5 5 patch in a Bayer image to detect if the central pixel is a bad pixel (BP) or not. The ﬁrst layer is meant to process the center pixel’s deviations across various directions before fusing this information with global statistics in the second layer. NOTE: The parameter ‘a’ in the ﬁrst dense layer is variable.

Only the color channel of the center pixel was taken for BP detection, hence, eliminating the need for LWB. Magnitude of difference of the pixels from the center pixel jpi pc j was computed to quantify the center pixel’s deviation across the region. These quantities were supplied to the ﬁrst layer of the neural network. The output of the ﬁrst layer was then fused with the global statistics to derive a prediction on whether the center pixel was a BP or not. NOTE: We have trained 4 such models – one for each type of patch (Gr centered, R centered, B centered and Gb centered). Table 2. Computations per 5 5 patch for the proposed networks for a = 16. Steps Operations in NN I Operations in NN II Mean 29 89 Suspicious pixel detection 83 37 Updated mean 79 37 LWB 16 0 (NA) Absolute differences 48 22 NN Opns. (a = 16) 417 225 Total 672 336

A Pre-processing Assisted Neural Network for Dynamic Bad

519

Fig. 4. Thumbnails of images in database (a) Test chart 1 (Standard 15 Star Chart). (a) Test chart 2 (Standard TE42 Chart). (c) Natural image 1. (d) Natural image 2. (e) Natural image 3.

The computations required per pixel (with a 5 5 patch around it) for the proposed networks have been discussed in Table 2. The complexity of NN II is signiﬁcantly lesser compared to NN I because it doesn’t use the inter channel information, and hence needs only the pixels that are of the same color channel as the center pixel. The parameter ‘a’ denoting the number of nodes in the ﬁrst layer is the major bottleneck, as reducing that would have the most impact in computations.

3 Results and Discussions The performance of the two proposed networks were compared with a huge data set with random bad pixel additions. The data set consists of 15.9 million examples of features extracted from 5 5 patches with BPs and an equal no. of examples of patches without BPs at their centers. This consists of 4 equal parts of *3.97 million samples of BPs and an equal no. of samples of no-BPs for each model trained for each type of patch (i.e., Gr/R/B/Gb centered model). The training data and validation sets were split into 10 batches of *0.4 million test 5 5 patches of BPs and non-BPs, each batch with 2 epochs of training, followed by validation. The data for each batch was split into 90% training set and 10% validation set. For testing, a separate data set of 44 million cases (test results in Table 3) was created (Fig. 4). The network was trained in Tensorflow using the ‘rmsprop’ optimizer with the mean-squared error as the loss function. During training, penalties were also introduced for misclassiﬁcation. Missing a BP was penalized proportional to the magnitude by which a bad pixel differed from its ground truth value. Misclassiﬁcations in edges, corners and high frequency regions were given a higher penalty to force the network to learn features to classify BPs around edges. In a similar manner, false positives were also penalized heavily when they occurred around edges, corners or in high frequency regions.

520

G. Kalyanasundaram et al.

Table 3. Performance of reference approach ([12], proposed networks NNI and NNII for a = 16 nodes). Misses and false positives are measured in percentage. Test image categories Misses (%) False positives (%) NN I NN II [12] NN I NN II [12] Test Charts (16 M test pixels per image, of which 0.64 Million are BPs) Chart 1 (16 M pixels) 0.017 0.082 0. 239 0.63 0.28 3.60 Chart 2 (16 M pixels) 0.033 0.018 0.261 0.36 0.11 3.71 Natural Images (4 M test pixels per image, of which 0.16 Million are BPs) Image 1 (4 M pixels) 0.029 0.040 0.318 0.46 0.06 3.79 Image 2 (4 M pixels) 0.016 0.035 0.272 0.50 0.12 3.71 Image 3 (4 M pixels) 0.056 0.012 0.367 0.43 0.02 3.82

Table 3 shows the performance of the two proposed networks, along with the test results of the reference approach used in [12]. It can be seen how the networks were learning to identify the true positives, with NN II missing at a worst case *0.08%. On the other hand, at worst, the false positives were 0.63% for NN I, certainly better than the reference method [12] that is comparing with.

Fig. 5. Overall misses and false positives of NN I & NN II as a function of number of nodes in the Dense sub-nets in the 1st layers of the two proposed networks.

A Pre-processing Assisted Neural Network for Dynamic Bad

521

Fig. 6. Illustration of bad pixels (in 300% zoom), a heat map of Misses and False Positives for NN II with a = 16. (a) The demosaiced version of a portion of test resolution chart with the simulated bad pixels. The red circles show which pixels are being missed by the network. (b) Heat map of the false positives. The color code of the pixels shows the pixel channels (R,G or B) that are falsely detected. (c) Heat map of the misses. The red circles show which pixels are being missed by the network. (Color ﬁgure online)

Upon varying number of nodes in the 1st layer sub-nets, Fig. 5 shows that NN I seems to have consistently less misses than NN II, by about 0.03%. More importantly, NN II seems to do better than NN I in being able to reduce the number of false positives as the number of nodes increases, whereas NN I seems to bias towards a higher rate of false positives. From initial analyses, this seems to be due to the cross-color channel information in NN I creating undesirable excitations in the network. Hence, even the lower miss rate of NN I is likely because of this undesirable cross-channel interaction. From our studies so far, it seems wiser to avoid the use of cross-color channel data wherever possible. A heat map of the misses and false positives of a region in test chart 2 are shown in Fig. 6 for NN II with a = 16. It can be seen that the false positives and the misses are only at edges. Points to note: • As expected, the networks are easily able to learn to classify pixels in the smooth regions, whereas make errors in only the edge/corner cases. The false positives rate is higher for NN II with a = 16 (*1 in 1000 healthy pixels), than misses (about 4 in 10,000 BPs). Even though this is better compared to having a higher miss rate, this imbalance could be improved in future work. About misses: • The examples of misses circled in red show that the neural networks still have problems identifying some signiﬁcantly bad cases, which indicates a scope for improvement, upon which current work is going on.

522

G. Kalyanasundaram et al.

• The misses are always cold pixels on the dark side of an edge as per our ﬁndings, which we could not completely mitigate despite carefully curating the dataset and network architectures. About false positives: • Since the false positives occur around edge regions, the use of a good quality correction method can ensure that the misdetection of these pixels as bad pixels will not have any deteriorating effects after correction, especially around edge regions. • Attempts have been made to further reduce the false positives by changing the training penalties for false positives, and also using larger number of nodes with varying dropout strategies and even more layers. Satisfactory results are yet to be attained for this.

4 Conclusions For detecting bad pixels, an approach of using a light weight neural network on mildly pre-processed pixel data from a small region of a Bayer CMOS image sensor was explored. Two different neural network architectures were tried for this – one of the architectures uses pixels of all colors (R, G, and B) around the pixel under test (referred to as NN I in the paper), while the other uses only the pixels of the same color as the test pixel in the region around it. Both methods performed much better in detection accuracy than the reference approach, with a best case miss rate of less than 0.045%. The usage of pixels of cross-color channels in NN I resulted in a higher number of false positives although it registered lesser misses compared to NN II.

References 1. Bayer, B.: Bayer Filter Mosaic. Patent US Patent 3,971,065 (1976) 2. Bigas, M., et al.: Review of CMOS image sensors. Microelectron. J. 37(5), 433–451 (2006) 3. Goma, S., Aleksic, M.: Method and apparatus for processing bad pixels. U.S. Patent No. 8,063,957 (2011) 4. Chapman, G.H., Thomas, R., Koren, I., Koren, Z.: Relating digital imager defect rates to pixel size, sensor area and ISO. In: IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, pp. 164–169 (2012) 5. Chapman, G.H., Leung, J., Thomas, R., Koren, Z., Koren, I.: Tradeoffs in imager design with respect to pixel defect rates. In: IEEE International Symposium on Defect and Fault Tolerance in VLSI at Nanotechnology Systems, pp. 231–239 (2010) 6. Leung, J., Chapman, G.H., Koren, I., Koren, Z.: Automatic detection of in-ﬁeld defect growth in image sensors. In: IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems, pp. 305–313 (2008) 7. Chapman, G.H., Thomas, R., Koren, I., Koren, Z.: Improved image accuracy in hot pixel degraded digital cameras. In: IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, pp. 172–177 (2013) 8. Chang, E.: Kernel-size selection for defect pixel identiﬁcation and correction. In: Proceedings of the SPIE 6502, Digital Photography III, 65020J (2007)

A Pre-processing Assisted Neural Network for Dynamic Bad

523

9. Wang, S., Yao, S., Faurie, O., Shi, Z.: Adaptive defect correction and noise suppression module in the CIS image processing system. In: Proceedings of SPIE International Symposium on Photoelectronic Detection and Imaging, vol. 7384, p. 73842V (2009) 10. Tanbakuchi, A., van der Sijde, A., Dillen, B., Theuwissen, A., deHaan, W.: Adaptive pixel defect correction. In: Proceedings of SPIE Sensors and Camera Systems for Scientiﬁc, Industrial, and Digital Photography Applications IV, vol. 5017, pp. 360–370 (2003) 11. Schoberl, M., Seiler, J., Kasper, B., Foessel, S., Kaup, A.: Sparsity based defect pixel compensation for arbitrary camera raw images. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1257–1260 (2011) 12. El-Yamany, N.: Robust defect pixel detection and correction for bayer imaging systems. In: IS&T Symposium on Electronic Imaging, Digital Photography and Mobile Imaging XIII, no. 6, pp. 46–51 (2017)

Face Recognition Using Sf 3 CNN with Higher Feature Discrimination Nayaneesh Kumar Mishra(B)

and Satish Kumar Singh

Indian Institute of Information Technology, Allahabad, India [email protected]

Abstract. With the advent of 2-dimensional Convolution Neural Networks (2D CNNs), the face recognition accuracy has reached above 99%. However, face recognition is still a challenge in real world conditions. A video, instead of an image, as an input can be more useful to solve the challenges of face recognition in real world conditions. This is because a video provides more features than an image. However, 2D CNNs cannot take advantage of the temporal features present in the video. We therefore, propose a framework called Sf3 CN N for face recognition in videos. The Sf3 CN N framework uses 3-dimensional Residual Network (3D Resnet) and A-Softmax loss for face recognition in videos. The use of 3D ResNet helps to capture both spatial and temporal features into one compact feature map. However, the 3D CNN features must be highly discriminative for eﬃcient face recognition. The use of A-Softmax loss helps to extract highly discriminative features from the video for face recognition. Sf3 CN N framework gives an increased accuracy of 99.10% on CVBL video database in comparison to the previous 97% on the same database using 3D ResNets.

Keywords: Face recognition in videos

1

· 3D CNN · Biometric

Introduction

With the advent of deep learning, 2-dimensional Convolution Neural Networks (2D CNN) came to be used for recognition of faces [2,3,6,7,9,11]. The accuracy of 2D CNN architectures for face recognition has reached to 99.99% [5,8] as reported on LFW database. In-spite of these near-hundred-percent accuracies, the face recognition algorithms fail when applied in real world conditions because of the real world challenges such as varying pose, illumination, occlusion and resolution. In this paper, we propose to overcome these limitations of the face recognition algorithms by the use of 3-dimensional Convolution Neural Networks (3D CNN). This is because a 3D CNN processes video as an input and extracts both temporal as well as spatial information from the video into a compact feature. The compact feature generated from 3D CNN contains more information than that generated using 2D CNN. This allows for an eﬃcient and robust face recognition [10]. The features generated using 3D CNNs however, are c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 524–529, 2021. https://doi.org/10.1007/978-981-16-1092-9_44

Face Recognition Using Sf 3 CNN with Higher Feature Discrimination

525

not highly discriminative and hence aﬀect the accuracy [10]. The discriminative ability of the loss functions has been increased by using the concept of angular margin. A-softmax [4] is the most basic loss in the series of the loss functions that implements the concept of angular margin in the most naive form. Hence we propose to use A-softmax for face recognition with 3D CNN. This paper, therefore, has the following contribution: We develop a deep learning framework called Sf3 CN N that uses 3D CNN and A-softmax loss for eﬃcient and robust face recognition in videos.

2

Proposed Architecture

We propose a 3D CNN framework for face recognition in videos. The framework is called Sf3 CN N as shown in Fig. 1. The Sf3 CN N framework uses 3D ResNets for feature extraction from the input video followed by A-softmax loss. The use of 3D CNN helps to extract compact features from the video which contains both spatial and temporal information. A-softmax loss achieves high feature discrimination. The Sf3 CN N framework is named so because the ‘Sf ’ in the name represents the A-Softmax loss which is a variant of the Softmax loss and the term ‘3 CN N ’ represents the 3D CNN in the framework.

3

A-Softmax Loss

A-softmax loss [4] is developed from Softmax loss to increase the discriminative ability of Softmax loss function. A-softmax loss introduces angular margin in the softmax loss to maximize the inter-class distance and minimise the intra-class distance among the class features. For an input feature xi with label yi , angle θj,i is the angle between the input vector xi and weight vector Wj for any class j. When we normalize ||Wj || = 1 for all j and zero the biases and also increase the angular margin between the features of the actual class and the rest of the features by multiplying the angle θyi ,i with m, we get the equation for A-softmax as: 1 exi cos(mθyi ,i ) Lang = −log (1) N i exi cos(mθyi ,i ) + j=yi exi cos(θj,i ) In Eq. 1, Lang is called the A-softmax loss. N is the number of training samples over which the mean of the loss is calculated. θyi ,i has to be in the range of [0, π/m]. From Eq. 1, it is clear that A-softmax loss [4] increases the angular margin between the feature vector of the actual class and the input vector by introducing the hyperparameter m. It thus tries to maximize the inter-class cosine distance and minimize the intra-class cosine distance among the features. The angular margin in A-softmax loss thus increases the discriminative ability of the loss function.

526

N. K. Mishra and S. K. Singh

Fig. 1. Sf3 CN N is a very simple framework that contains 3D Residual Network followed by Angular Softmax Loss for better discrimination among the class features. Sf3 CN N framework is experimented by substituting the 3D Resnet Architecture block in the ﬁgure by variants of 3D-Resnet architectures. Sf3 CN N framework gave the best accuracy of 99.10% on CVBL video database by using Resnet-152 and Wideresnet-50. Finally, Resnet-152 is chosen as the 3D-Resnet architecture for the framework because of comparatively less number of parameters in comparison to Wideresnet-50.

4

Implementation

We performed the experiments on CVBL (Computer Vision and Biometric Lab) database [10] using Sf3 CN N framework. The optimizer used is Adamax. The initial learning rate has been kept to 0.002. The values of the parameter betas is equal to 0.9 and 0.999. The value of the parameter eps is 1e–08 and weight decay is zero. The activation function used is Parametric Rectiﬁed Linear Unit (PReLU) because PReLU has been found to work better with A-softmax loss [4]. We have compared our results with the work on face recognition by Mishra et al. [10]. In the work by Mishra et al. [10], the experiment has been performed on CVBL database using 3D residual networks when the loss function is Crossentropy. Both, in our experiment and in the experiment performed by Mishra et al. [10], a clip of temporal length 16 is input to 3D residual network of diﬀerent depths and genres. In case the number of frames are less than 16, the same frames are repeated by looping around the existing frames. Horizontal ﬂipping is done on frames with probability of 50%. Cropping is performed from one of the locations out of 4 corners and 1 center in the original is then frame. This 1 √1 1 1 , , . The scaled on the basis of the a scale value selected out of 1 , 2 2 34 2 24 scaling is done by maintaining the aspect ratio to one and selecting the shorter length of the image. All the scaled frames are re-sized to 112 by 112 for input to the architecture. Mean subtraction is performed on each channel by subtraction of a mean value from each channel. This is done to keep the pixel values zero centered. The CVBL video database is divided into 60:40 ratio for training and validation purpose. Out of the total 675 videos, 415 videos have been taken for training and the rest 260 videos have been considered for validation.

5

Results and Discussion

The results of Sf3 CN N framework of various depths and genres on CVBL database are summarized Table 1. Sf3 CN N framework uses activation

Face Recognition Using Sf 3 CNN with Higher Feature Discrimination

527

function PReLU and optimization function Adamax. PReLU is used with Asoftmax in our experiment because PReLU is recommended to be used with Asoftmax [4]. For the purpose of comparison, Table 1 also shows results obtained by Mishra et al. [10]. In the work by Mishra et al. [10], face recognition is done using 3D CNN with cross-entropy loss, SGD (Stochastic Gradient Descent) optimizer and ReLU (Rectiﬁed Linear Unit) activation function. Graph in Fig. 2 shows the comparison between the performance of the Sf3 CN N framework and the architecture used by Mishra et al. [10] in terms of how the training loss and validation loss varies with the number of epochs. Table 1. Comparison of validation accuracy of Sf3 CN N Framework with state-of-art methods on CVBL database for face recognition Residual networks

Accuracy for 3D CNN + Accuracy for Sf3 CN N cross entropy loss in % framework in %

ResNet-18 ResNet-34 ResNet-50 ResNet-101 ResNet-152 ResNeXt-101 Pre-activation ResNet-200 Densenet-121 Densenet-201 WideResnet-50

96 93.7 96.2 93.4 49.1 78.5 96.2 55 97 90.2

98.97 98.72 98.59 98.72 99.10 98.59 98.46 98.72 98.33 99.10

In Fig. 2, convergence of loss and accuracy in case of training and validation for the Sf3 CN N framework is shown for diﬀerent variants of 3D-Resnets in Sf3 CN N framework. We have taken running average of the validation loss and validation accuracy to highlight the general path of convergence. We can easily observe that convergence of loss in case of training and validation for the Sf3 CN N framework is almost same as obtained by Mishra et al. [10]. The training and the corresponding validation graph almost follow the same path and converge to nearly the same loss value, indicating that low accuracy was never because of underﬁtting and high accuracy was never because of overﬁtting. Thus we can conclude that both Sf3 CN N and the architectures used by Mishra et al. [10] were stable throughout their training. From the Table 1 it can be observed that, Sf3 CN N framework achieves improvement in accuracy in face recognition for all depths and genres of 3D ResNets in comparison to the results obtained by Mishra et al. [10]. For Sf3 CN N framework, the diﬀerence between the highest and lowest accuracy is just 0.77% in comparison to 47.9% in case of Mishra et al. [10]. For Sf3 CN N framework, the accuracy varies between 98% and 99.10% and therefore does not signiﬁcantly

528

N. K. Mishra and S. K. Singh

vary with depth. In case of Mishra et al. [10] however, the accuracy varies from 49.1% to 97%. Thus we can easily infer that unlike in case of Mishra et al. [10], the Sf3 CN N framework is successful in increasing the discrimination between classes, so much so, that it mitigates the role of depth of the architecture on the accuracy of face recognition. Sf3 CN N framework successfully achieves the highest accuracy of 99.10% which is well above the highest accuracy of 97% as obtained by Mishra et al. [10]. Sf3 CN N framework achieved the highest accuracy with ResNet-152 and Wideresnet-50. It is interesting to note that ResNet-152 had achieved the lowest accuracy of just 49.1% and Wideresnet-50 had achieved an accuracy of just 90.2% in the work by Mishra et al. [10]. It is just because of the highly discriminative nature of the A-softmax loss function in Sf3 CN N framework that ResNet-152 could rise up to give the highest accuracy. Because of similar reasons, the accuracy of Densenet-201, which was 97% in the work by Mishra et al. [10], rose to 98.33% in case of Sf3 CN N framework.

(a) ResNet-50

(b) Densenet-121

Fig. 2. Comparison of Sf3 CN N Framework with 3D CNN + Cross Entropy Loss. (a) Comparison based on Training Loss (b) Comparison based on Validation Loss (c) Comparison based on Validation Accuracy

As can be seen in Table 1, both ResNet-152 and Wideresnet-50 achieved the highest accuracy of 99.10%. Since Wideresnet-50 contains far more number of parameters than in comparison to Resnet-152 [1], it would therefore be preferable to use Resnet-152 instead of Wideresnet-50 in the Sf3 CN N framework. At the same time, it is also noticeable that there are comparatively more number of feature maps for each convolutional layer in Wideresnet-50. Wideresnet-50 is therefore eﬃcient in parallel computing using GPUs (Graphics Processing Unit) [1]. Hence, if the number of GPUs are more than one, Wideresnet-50 can be used in Sf3 CN N framework to take advantage of parallel computing using GPUs.

Face Recognition Using Sf 3 CNN with Higher Feature Discrimination

6

529

Conclusion

In this paper, we proposed a framework called Sf3 CN N . Based on the experimentation results on the CVBL video database, it can be concluded that Sf3 CN N framework is capable of robust face recognition even in real world conditions. The high discriminative ability of Sf3 CN N framework leads to increased accuracy of face recognition to 99.10% which is better than the highest accuracy of 97% in the work by Mishra et al. [10].

References 1. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018) 2. Kim, T., Cha, M., Kim, H., Lee, J.K., Kim, J.: Learning to discover cross-domain relations with generative adversarial networks. arXiv preprint arXiv:1703.05192 (2017) 3. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 4. Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 212–220 (2017) 5. Schroﬀ, F., Kalenichenko, D., Philbin, J.: Facenet: a uniﬁed embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015) 6. Sun, Y., Liang, D., Wang, X., Tang, X.: Deepid3: face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873 (2015) 7. Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: closing the gap to humanlevel performance in face veriﬁcation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708 (2014) 8. Tewari, A., et al.: Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1274–1283 (2017) 9. Tran, L., Yin, X., Liu, X.: Disentangled representation learning gan for poseinvariant face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1415–1424 (2017) 10. Vidyarthi, S.K., Tiwari, R., Singh, S.K.: Size and mass prediction of almond kernels using machine learning image processing. BioRxiv, p. 736348 (2020) 11. Wang, F., Xiang, X., Cheng, J., Yuille, A.L.: Normface: L2 hypersphere embedding for face veriﬁcation. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1041–1049 (2017)

Recognition of Online Handwritten Bangla and Devanagari Basic Characters: A Transfer Learning Approach Rajatsubhra Chakraborty1 , Soumyajit Saha1 , Ankan Bhattacharyya1 , Shibaprasad Sen4(B) , Ram Sarkar2 , and Kaushik Roy3 1

Future Institute of Engineering and Management, Kolkata, India 2 Jadavpur University, Kolkata, India 3 West Bengal State University, Barasat, India 4 University of Engineering & Management, Kolkata, India

Abstract. The transfer learning approach has eradicated the need for running the Convolutional Neural Network (CNN) models from scratch by using a pre-trained model with pre-set weights and biases for recognition of diﬀerent complex patterns. Going by the recent trend, in this work, we have explored the transfer learning approach to recognize online handwritten Bangla and Devanagari basic characters. The transfer learning models considered here are VGG-16, ResNet50, and Inception-V3. To impose some external challenges to the models, we have augmented the training datasets by adding diﬀerent complexities to the input data. We have also trained these three transfer learning models from scratch (i.e., not using pre-set weights of the pre-trained models) for the same recognition tasks. Besides, we have compared the outcomes of both the procedures (i.e., running from scratch and by using pre-trained models). Results obtained by the models are promising, thereby establishing its eﬀectiveness in developing a comprehensive online handwriting recognition system. Keywords: Transfer learning · Character recognition learning · Online handwriting · Bangla · Devanagari

1

· Deep

Introduction

Online handwriting recognition (OHR) deals with automatic conversion of the input written on a special digital surface, where a sensor records the pen-tip movements along with pen up/down status. Nowadays, the increasing popularity of smartphones and digital tablets enhances the need to work in this domain as in such devices, input can be supplied with ﬁngers/stylus as freely as people write with pen and paper. In OHR, the information is obtained in terms of ordered pixel coordinates with respect to time stamps. Researchers have focused their Supported by organization x. c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 530–541, 2021. https://doi.org/10.1007/978-981-16-1092-9_45

Recognition of Online Handwritten Bangla and Devanagari Basic Characters

531

attention more on OHR and there have been myriad research works towards the recognition of handwritten characters in various scripts. Among those, works on English [1–3], Gurumukhi [4–7], Devanagari [8–12], Bangla [13–16] etc. reﬂect the recent trends in this domain. As a large section of people of the Indian sub-continent use Devanagari and Bangla scripts in their daily life, hence it demands more attention from the research fraternity so that a comprehensive OHR system can be developed which may be used for many oﬃcial and other purposes. Despite higher recognition outcomes obtained so far by the researchers for the mentioned scripts using typical machine learning approaches, still there is enough room to improve the recognition accuracy. Recently, it has been observed that the research trend is now shifting from typical machine learning based approaches to deep learning based approaches. It has been observed from the literature that in most of the machine learning based recognition procedures, domain experts play an important role to analyze the applied handcrafted features for the reduction of the complexity, and also making patterns more visible to the diﬀerent learning algorithms. The key advantage of deep learning based approaches over machine learning based approaches is the automatic extraction of useful features from the input data in an incremental fashion. Hence, the need for designing handcrafted features and also the intervention of domain experts have been removed [17]. As a result, many researchers are now exploring the domain of deep learning. This trend is also seen in the domain of handwritten character recognition. In the following sections, we have discussed a few convolutional neural network (CNN) based deep learning approaches used by diﬀerent researchers to recognize online handwritten characters written in diﬀerent scripts. We have also discussed the limitations of using CNN based techniques and then highlighted the importance of using transfer learning approach, a new research trend of deep learning, for the said recognition task. Turing Award Winners Yan LeCun, Yoshua Bengio, and Geoﬀery Hinton, have developed DNN (Deep Neural Network) model which is denser than ANN (Artiﬁcial Neural Network) [18] and is well capable to extract complex features at an abstract level. With the advent of DNN, many models have been created over time and the most popular model is the CNN. As a result, many researchers have started experimentation with CNN for the solution of various complex pattern recognition problems. Ciresan et al. have experimented to improve the overall classiﬁcation accuracy of online handwritten characters by using CNN [19]. Sen et al. have shown the usefulness of the CNN model towards online handwritten Bangla character recognition [20]. Authors in [21] have shown the procedure to recognize English characters and digits by using multi-CNN. Baldominos et al. [22] have used evolutional CNN as an application of handwriting recognition.

532

R. Chakraborty et al.

Mehrotra et al. have introduced an oﬄine strategy to recognize online handwritten Devanagari characters using CNN [27]. Therefore, it can be said that till now so many research attempts have been made using deep learning based approaches in handwriting recognition. In most of the cases, deep learning models perform better than machine learning based approaches in terms of recognition accuracy and thus gaining popularity. However, overheads are incurred in terms of input database size, computational time, power, and costs. It also imposes complexities related to adjusting weights and biases in layers of DNN [17]. Keeping the above facts in mind, many researchers are now trying to apply transfer learning approach where a pre-trained model, with predeﬁned weights and biases, can be used to train models on diﬀerent datasets. This in turn saves training time, minimizing implementation complexity associated with initializing weights and biases for layers in DNN models [27]. Chatterjee et al. [24] have obtained better recognition accuracy for isolated Bangla-Lekho characters in fewer epochs using ResNet50 through transfer learning approach in comparison to the experiment [25] performed on same dataset. In light of the said facts, in the present work, we have explored the transfer learning approach for the recognition of online handwritten Bangla and Devanagari basic characters. In this paper, we have also established the beneﬁts of using pre-trained models as a starting point rather than training from scratch. The rest of the paper is organized as follows: Sect. 2 describes the detail of online handwritten Bangla and Devanagari character datasets used in the present experiment. The underlying view of the models used in the current experiment with a brief description of their architecture and ﬂow of data within the concerned models have been presented in Sect. 3. The experimental outcomes along with comparison among multiple cases have been mentioned under Sect. 4. Conclusion of our work is reported in Sect. 5.

2

Datasets

The proposed work deals with the recognition of online handwritten Bangla and Devanagari basic characters using a transfer learning approach. The considered Bangla database is of size 10000 (50 character classes with 200 samples in each class) [16] and the Devanagari database consists of 1800 samples (36 classes and 50 samples in each class) [12]. As the mentioned models work on image information only, hence, in the current work we have generated the character images from the corresponding online handwritten information of Bangla and Devanagari characters. Figure 1 shows sample images of online handwritten Bangla and Devanagari characters on which the current experimentation have been performed.

Recognition of Online Handwritten Bangla and Devanagari Basic Characters

533

Fig. 1. Sample online handwritten (a) Bangla and (b) Devanagari basic characters

3

Proposed Methodology

The transfer learning model, an extension of the deep learning model, considers a pre-trained model for a certain problem as a starting point, which later can be used for another aligned problem. In the current work, we have used three models namely Inception-v3, ResNet50, and VGG-16 previously trained on ImageNet dataset which are then used to learn features from Devanagari and Bangla online handwritten character datasets during training. In conjecture of the initial training, transfer learning approach permits to initiate with the previously learned features and adjust those features accordingly in sync with the considered Devanagari and Bangla character datasets instead of learning starting from scratch. The following subsections describe the basic architecture of Inception-V3, ResNet 50, and VGG16. Table 1 provides the ﬁnely tuned values of hyper-parameters for all the three models over which we have fed the training datasets as input. The values of these hyper-parameters are set to achieve the best performance after exhaustive experimentation. 3.1

Inception-V3

Inception-V3 is a CNN model which marks its inception from GoogleNet that has been trained on datasets segregated into 1000 classes. In Inception-V1, the convolution layers of dimensions 1 × 1, 3 × 3, and 5 × 5 are amalgamated in Inception layers and the model performs as a multi-level feature extractor. The Inception-V3 has come up with some modiﬁcations over its predecessors, thereby improving the eﬃciency in image classiﬁcation. It incorporates factorizing convolutions to circumvent overﬁtting and for reducing the number of parameters without compromising the network eﬃciency by the increment in depth. The block diagram of the model has been shown in Fig. 2. To achieve signiﬁcant stability by avoiding overestimation and underestimation of the results, batch

534

R. Chakraborty et al.

normalization has been incorporated throughout the network in addition to using the binary cross-entropy for loss calculation and label smoothing is added to the loss formula. The output layer has 50 nodes for Bangla dataset, and 36 for Devanagari dataset.

Fig. 2. Schematic diagram of data ﬂow in Inception-v3 pre-trained model

3.2

ResNet50

Residual Networks or ResNet is a CNN based model used in many complex computer vision tasks. ResNet50 is a 50-layers deep CNN. It has also been trained upon dataset segregated into 1000 classes like InceptionV3, as mentioned above. On analysis of the problem in case of training a very deep CNN in a normal way, it has been observed that for a gradient-based learning, a small change in gradient does not allow the following layers to update the weights to a notable precision and hence the training would cease at a certain point in most cases, leading to a vanishing gradient problem. Addressing this issue, a concept of skip connection has been introduced in the ResNet50, where a new input is added to the output of a convolution stack, thus providing a shunt for the gradient to ﬂow, instead of leading to a very precisely negligible gradient which could vanish instantaneously and lead to a halt of the training process. With an initial convolution pooling of 7 × 7 and 3 × 3 kernel sizes and having a stack of 5 convolution layers with skip connections, ResNet50 can take an input image of dimension in which length and width are of multiples of 32 and 3 channels. The block diagram of the model has been shown in Fig. 3. The numbers of output neurons for Bangla and Devanagari character databases have been set to 50 and 36 respectively.

Recognition of Online Handwritten Bangla and Devanagari Basic Characters

535

Fig. 3. Schematic diagram of data ﬂow in ResNet50 pre-trained model

3.3

VGG-16

OxfordNet or VGG-16 is a CNN based model proposed by K. Simonyan et al. in [23]. Considering 14 million images segregated into 1000 classes like models discussed previously, VGG-16 has been submitted at ILSVRC-2014. It makes an improvement over AlexNet by reducing the kernel sizes from 11 in the ﬁrst and 15 in the second convolution layers to 3 × 3 ﬁlters in sequence as shown in Fig. 4. VGG-16 uses 5 stacks to compute features from an image. Among these 5 stacks, the ﬁrst and second stacks contain two convolution layers each followed by a max-pooling layer with a stride of 2. Convolution layer in the ﬁrst and second stacks uses 64 and 128 feature maps respectively. The third, fourth, and ﬁfth stacks contain three convolution layers each followed by a max-pooling layer with stride 2. Each convolution layer in the third, fourth, and ﬁfth stacks uses 256, 512, and 512 number of feature maps respectively. Hence, a total of 13 layers are present combining all the ﬁve stacks. The 14th and 15th layers are fully connected hidden layers of 4096 units followed by a softmax output layer (16th layer) of 36 units for the Devanagari and 50 units for Bangla. Table 1. Details of the hyper-parameter values, image size, and image type applied to the transfer learning models in the present work Attribute

Value

Hyper-parameters Batch size Epochs

10 100

Steps per epochs 126 (Devanagari), 1000 (Bangla) Verbose

1

Loss

Categorical cross-entropy, binary cross-entropy

Image size

28 X 28

Image type

Bangla: TIF, Devanaari: JPG

536

R. Chakraborty et al.

Fig. 4. Schematic diagram of data ﬂow in VGG-16 pre-trained model

In our proposed work, we have resized the input images into 224 × 224 and passed them to the diﬀerent transfer learning models. We have connected a ﬁnal dense layer having 36 classes for Devanagari and 50 classes for Bangla character databases respectively with activation functions softmax and sigmoid.

4

Results and Discussion

The random initialization of weights to diﬀerent layers is indeed a time consuming process and it requires a lot of experimental approaches. Even this may not generate the desired results. To overcome the underlying complexities of such CNN based models, in the present work, we have used a network that has pre-trained weights. As stated in the previous section, we have considered three pre-trained transfer learning models namely, ResNet50, Inception-v3, and VGG16 for the recognition of online handwritten Bangla and Devanagari characters. The considered datasets are split into training and test sets in 7:3 ratio. The achieved recognition accuracies for both Bangla and Devanagari database have been shown in Table 2. For the Devanagari database, VGG-16 and ResNet50 exhibit their best accuracies in 40 epochs, and achieved accuracies are 99.94%, 99.93% respectively. Inception-v3 reﬂects its best performance of 99.35% in 35 epochs. For the Bangla database also VGG-16 yields the best recognition accuracy as 99.99% in just 30 epochs. Whereas, ResNet50 and InceptionV3 model show their best performances as 99.96% and 99.81% in 40 epochs respectively. The best recognition accuracies have been observed for both the databases by

Recognition of Online Handwritten Bangla and Devanagari Basic Characters

537

VGG-16 when learning rate, batch size has been set to 0.0001 and 10 respectively and the softmax activation function has been applied. Table 2. Recognition accuracies observed on Bangla and Devanagari character test datasets without data augmentation Script

Accuracy (in %) ResNet50

VGG-16

# epochs

# epochs

20

25

30

35

40

20

25

Inception-V3 # epochs 30

35

40

20

25

30

35

40

Devanagari 99.67 99.80 99.86 99.88 99.93 99.81 99.83 99.87 99.91 99.94 99.23 99.25 99.28 99.35 99.35 Bangla

99.94 99.94 99.95 99.95 99.96 99.97 99.98 99.99 99.99 99.99 99.51 99.63 99.74 99.78 99.81

Table 3. Recognition accuracies on Bangla and Devanagari test sets when models are trained on augmented datasets Script

Accuracy (in %) ResNet50

VGG-16

# epochs

# epochs

20

25

30

35

40

20

25

Inception-V3 # epochs 30

35

40

20

25

30

35

40

Devanagari 99.41 99.43 99.45 99.47 99.47 99.63 99.63 99.65 99.67 99.68 98.93 98.97 99.01 99.03 99.05 Bangla

99.82 99.85 99.85 99.89 99.92 99.94 99.94 99.95 99.96 99.97 99.13 99.15 99.17 99.20 99.21

For exhaustive testing, we have also trained the model by augmenting the Devanagari and Bangla train datasets. The augmentation is performed by rotating the character images within a range of [–60◦ , 60◦ ]. The trained models are then used to measure the eﬃciency of the original Devanagari and Bangla test sets. Table 3 reports the detailed recognition accuracy observed at diﬀerent epochs for the mentioned datasets by VGG-16, ResNet50, and Inception-v3 models. This table reﬂects that the VGG-16, ResNet50, and Inception-v3 models produce the highest recognition accuracies as 99.68%, 99.47%, and 99.05% in 40, 35, and 40 epochs respectively for the Devanagari character database. In counterpart, VGG-16, ResNet50, and Inception-v3 models produce their best recognition performance as 99.97%, 99.92%, and 99.21% respectively in 40 epochs for Bangla database. From Tables 2–3 it can be said that VGG-16 outperforms the other two models for both the character databases even when the training set is augmented. Though the overall recognition performance is slightly lesser when the dataset is augmented, still the model performs well even if the diﬀerent amounts of rotated images are included in the training set. To justify the use of transfer learning approach, we have also trained all the three models from scratch. The observed outcomes with respect to the number of epochs are highlighted in Table 4. After analyzing both the tables (Tables 2 and Table 4), it can be said that Resnet50, VGG-16, and Inception-v3 produce 99.96%(96%), 99.99%(90%), and 99.81% (99.18%) recognition accuracy respectively for Bangla in just 40 (500) epochs when used pre-trained model (trained

538

R. Chakraborty et al.

from scratch) respectively. Hence, it can be clearly said that the training from the scratch not only exhibits lesser accuracy but also requires almost ten times more model building time than that required if used pre-trained models for Bangla script. For the Devanagari scipt, Resnet50, VGG-16, and Inception-v3 produce 99.93% (96.39%), 99.94% (80%), and 99.35% (96.94%) recognition accuracy respectively in just 40 (500) epochs respectively when used pre-trained model (trained from scratch). However, when running from scratch, the VGG16 and Inception-v3 models stopped early in 96 and 274 epochs respectively for Devanagari script. Looking into Table 4, we can conclude that pre-train models produce higher recognition accuracies in a lesser number of epochs for both Bangla and Devanagari scripts. Thereby, it can be said that training the model using pre-set weights coming from pre-trained networks not only reduces time to build the models but also helps to produce better outcome than running the model from scratch. This fact is more crucial in the developing countries, where there is a real limitation of resources in terms of computing machinery. Transfer learning can be a very good option for them as running in an ordinary computer with 8 GB RAM, we achieved the results when using transfer learning models to train the higher layers only and leaves the lower layers frozen. Learning from scratch not only requires a higher computing machine but also many times the models crashed and thus increasing the time required to obtain the expected outcome. Table 4 also proves the fact that when the database size is small like our considered Devanagari database, training from the scratch is not a good option as there are not enough data for learning [26]. Table 5 highlights the outcomes observed in the present work and in some past works that were targeted for the recognition of online handwritten Devanagari and Bangla characters. The proposed work has been performed on the same Devanagari and Bangla character databases as mentioned by Santosh et al. [12] and Sen et al. [16,20]. After looking into those entries of the table and corresponding references [12,16,20], it can be said that the proposed approach clearly outperforms the previously used machine learning and CNN based approaches for both the Devanagari and Bangla databases. The other entries of Table 5 can be used by the new researchers working in this domain to get a quick review of a few of the past approaches. Table 4. Information related to number epochs and achieved accuracy for both the cases (training the models using pre-set weight and training from the scratch) by the three models used for Devanagari and Bangla character recognition Script

Models

Epochs Accuracy (in %)

40

99.93

500

40

99.94

96

Inception-v3 40

99.35

274

Resnet50

40

99.96

500

96

VGG-16

40

99.99

500

90

Inception-v3 40

99.81

500

99.18

Devanagari Resnet50 VGG-16 Bangla

Train the models using pre-set weights Train the models from scratch Epochs Accuracy (in %)

96.39 80 96.94

Recognition of Online Handwritten Bangla and Devanagari Basic Characters

539

Table 5. Comparative assessment of the present work with some past works for the recognition of online handwritten Bangla and Devanagari basic characters Character database Reference

Devanagari

Database size Methodology used

Achieved accuracy (in %)

Santosh et al. [12]

1800

Stroke based approach

95

Connel et al. [9]

1600

Combination of online and oﬄine features

86.5

Mehrotra et al. [27] 41359

CNN based method

98.19

Kubatur et al. [10]

2760

DCT features

97.2

Proposed method

1800

Transfer learning on non99.94 augmented training dataset Transfer learning on 99.68 augmented training dataset

Bangla

Roy et al. [14]

12500

Structural and point based features

91.13

Sen et al. [16]

10000

Combination of structural and topological features

99.48

Sen et al. [20]

10000

CNN based

99.4

Proposed method

10000

Transfer learning on non99.99 augmented training dataset Transfer learning on 99.97 augmented training dataset

5

Conclusion

In the present work, a transfer learning based technique has been applied for the recognition of online handwritten Devanagari and Bangla basic characters. Here, we have considered three well-known models for the current recognition problem that include ResNet50, VGG-16, and Inception-v3, which are initially trained on the ImageNet database. We have utilized the pre-trained weights of those networks to set up the models as well as trained the models from scratch to solve our problem and also to establish the fact that use of pre-trained model is better than training from the scratch in terms of model building time and recognition accuracy. To test the eﬃciency of the proposed approach, we have also augmented the training datasets to add complexity to the input training datasets. The achieved outcomes are very promising and also demonstrate that the performances of the proposed models are not aﬀected by distortion caused due to various data augmentation techniques. In future, we plan to use the proposed approach for recognition of online handwritten characters written in other regional scripts. Acknowledgment. One of the authors would like to thank SERB, DST for ﬁnancial support in the form of a project.

540

R. Chakraborty et al.

References 1. Bahlmann, C., Burkhardt, H.: The writer independent online handwriting recognition system frog on hand and cluster generative statistical dynamic time warping. IEEE Trans. Pattern Anal. Mach. Intell. 26(3), 299–310 (2004) 2. Farha, M., Srinivasa, G., Ashwini, A.J., Hemant, K.: Online handwritten character recognition. Int. J. Comput. Sci. 11(5), 30–36 (2013) 3. Tappert, C.C., Suen, C.Y., Wakahara, T.: The state of online handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(8), 787–807 (1990) 4. Bawa, R.K., Rani, R.: A preprocessing technique for recognition of online handwritten gurmukhi numerals. In: Mantri, A., Nandi, S., Kumar, G., Kumar, S. (eds.) HPAGC 2011. CCIS, vol. 169, pp. 275–281. Springer, Heidelberg (2011). https:// doi.org/10.1007/978-3-642-22577-2 37 5. Gupta, M., Gupta, N., Agrawal, R.: Recognition of online handwritten gurmukhi strokes using support vector machine. In: Bansal J., Singh P., Deep K., Pant M., Nagar A. (eds.) Proceedings of Seventh International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA 2012). Advances in Intelligent Systems and Computing, vol 201. Springer, India (2012). https://doi.org/10.1007/ 978-81-322-1038-2 42 6. Sachan, M.K., Lehal, G.S., Jain, V.K.: A novel method to segment online gurmukhi script. In: Singh, C., Singh Lehal, G., Sengupta, J., Sharma, D.V., Goyal, V. (eds.) ICISIL 2011. CCIS, vol. 139, pp. 1–8. Springer, Heidelberg (2011). https://doi.org/ 10.1007/978-3-642-19403-0 1 7. Sharma, A., Kumar, R., Sharma, R.K.: HMM-based online handwritten Gurmukhi character recognition. Int. J. Mach. Graph. Vis. 19(4), 439–449 (2010) 8. Swethalakshmi, H., Jayaraman, A., Chakravarthy, V.S., Shekhar, C.C.: Online handwritten character recognition of Devanagari and Telugu Characters using support vector machines. In: Proceedings of the 10th International Workshop on Frontire Handwriting Recognition, pp. 367–372 (2006) 9. Connell, S.D., Sinha, R.M.K., Jain, A.K.: Recognition of unconstrained online Devanagari characters. In: Proceedings of the 15th International Conference on Pattern Recognition, pp. 368–371. IEEE (2000) 10. Kubatur, S., Sid-Ahmed, M., Ahmadi, M.: A neural network approach to online Devanagari handwritten character recognition. In: Proceedings of the International Conference on High Performance Computing and Simulation, pp. 209–214 (2012) 11. Kumar, A., Bhattacharya, S.: Online Devanagari isolated character recognition for the iPhone using hidden markov models. In: Proceedings of the International Conference on Students’ Technology Symposium, pp. 300–304 (2010) 12. Santosh, K.C., Nattee, C., Lamiroy, B.: Relative positioning of stroke-based clustering: a new approach to online handwritten Devanagari character recognition. Int. J. Image Graph. 12(2), 25 (2012) 13. Parui, S.K., Guin, K., Bhattacharya, U., Chaudhuri, B.B.: Online handwritten Bangla character recognition using HMM. In: International Conference on Pattern Recognition, pp. 1–4. IEEE (2008) 14. Roy, K.: Stroke-database design for online handwriting recognition in Bangla. Int. J. Mod. Eng. Res. 2(4), 2534–2540 (2012) 15. Bhattacharya, U., Gupta, B.K., Parui, S.K.: Direction code based features for recognition of online Handwritten characters of Bangla. In: International Conference on Document Analysis and Recognition, pp. 58–62 (2007)

Recognition of Online Handwritten Bangla and Devanagari Basic Characters

541

16. Sen, S., Bhattacharyya, A., Singh, P.K., Sarkar, R., Roy, K., Doermann, D.: Application of structural and topological features to recognize online handwritten Bangla characters. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17(3), 1–19 (2018) 17. https://towardsdatascience.com/why-deep-learning-is-needed-over-traditionalmachine-learning-1b6a99177063. Accessed 3 Nov 2020 18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2012) 19. Cire¸san, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Convolutional neural network committees for handwritten character classiﬁcation. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 1135– 1139 (2011) 20. Sen, S., Shaoo, D., Paul, S., Sarkar, R., Roy, K.: Online handwritten Bangla character recognition using CNN: a deep learning approach. In: Bhateja, V., Coello Coello, C.A., Satapathy, S.C., Pattnaik, P.K. (eds.) Intelligent Engineering Informatics. AISC, vol. 695, pp. 413–420. Springer, Singapore (2018). https://doi.org/ 10.1007/978-981-10-7566-7 40 21. Pha.m, D.V.: Online handwriting recognition using multi convolution neural networks. In: Bui, L.T., Ong, Y.S., Hoai, N.X., Ishibuchi, H., Suganthan, P.N. (eds.) SEAL 2012. LNCS, vol. 7673, pp. 310–319. Springer, Heidelberg (2012). https:// doi.org/10.1007/978-3-642-34859-4 31 22. Baldominos, A., Saez, Y., Isasi, P.: Evolutionary convolutional neural networks: an application to handwriting recognition. Int. J. Neurocomput. 283, 38–52 (2018) 23. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference Learning Representation, pp. 1–14 (2015) 24. Chatterjee, S., Dutta, R., Ganguly, D., Chatterjee, K., Roy, S.: Bengali handwritten character classiﬁcation using transfer learning on deep convolutional neural network. arXiv preprint arXiv:1902.11133 (2019) 25. Alif, M.A.R., Ahmed, S., Hasan, M.A.: Isolated Bangla handwritten character recognition with convolutional neural network. In: 20th International Conference of Computer and Information Technology, pp. 1–6 (2017) 26. Marcelino, P.: Transfer learning from pre-trained models. Towards Data Sci. (2018) 27. Mehrotra, K., Jetley, S., Deshmukh, A., Belhe, S.: Unconstrained handwritten Devanagari character recognition using convolutional neural networks. In: Proceedings of the 4th International Workshop on Multilingual OCR, pp. 1–5 (2013)

Image Solution of Stochastic Diﬀerential Equation of Diﬀusion Type Driven by Brownian Motion Vikas Kumar Pandey, Himanshu Agarwal(B) , and Amrish Kumar Aggarwal Department of Mathematics, Jaypee Institute of Information Technology, Noida-62, Uttar Pradesh, India {himanshu.agarwal,amrish.aggarwal}@jiit.ac.in

Abstract. Consider the stochastic diﬀerential equation of diﬀusion type driven by Brownian motion dX(t, ω) = μX(t, ω)dt + σX(t, ω)dB(t, ω) where B(t, ω) = limn→∞ B n (t, ω) is a Brownian motion, n is a positive integer, t is time variable, ω is state variable, μ and σ are constants. The solution X(t, ω) is represented by images. Solution contains a term of Brownian motion. Therefore, the image of a solution needs the image of Brownian motion. We have obtained the images of Brownian motion and solution X(t, ω) for diﬀerent combinations of parameters (μ, σ, n and p. Note that p controls the degree of randomness in Brownian motion. Degree of randomness in Brownian motion is maximum for p = 0.5). The key observations from image analysis are 1. less randomness is visualized for p values away from 0.5, 2. colors in images for n = 10, 000 is more than the color in images for n = 10, 000, 00, and 3. randomness in solution depends on μ and σ also. More randomness is visualized as μ − 12 σ 2 is away from 0. The observations are consistent with mathematical analysis of the solution X(t, ω). Keywords: Stochastic diﬀerential equation o integral · Brownian motion equation · Itˆ

1

· Diﬀusion diﬀerential

Introduction

Visualization is a prominent process to communicate information by using the images, diagram or animations. Visualization provides quick insights to ﬁnd casuality, form hypothesis and ﬁnd visual patterns [9]. One of the recent ﬁnding of visualization is observation of error in the two dimensional heat equation solution provided by Henner et al. [7]. This observation has been made by Pandey et al. [1]. Good visualization techniques allow users to ﬁnd clear and concise messages from the complicated data sets [3].

c Springer Nature Singapore Pte Ltd. 2021 S. K. Singh et al. (Eds.): CVIP 2020, CCIS 1377, pp. 542–553, 2021. https://doi.org/10.1007/978-981-16-1092-9_46

Image Solution of Stochastic Diﬀerential Equation

543

Stochastic integral introduced by Itˆ o [17] in the mid of 1940s is the key component in the development of the subject of the stochastic diﬀerential equations. Up to the early of 1960s, the most of the works on the stochastic diﬀerential equations were conﬁned to the stochastic ordinary diﬀerential equations (SODEs). In the mid early of 1960s, Baklan [16] studied stochastic partial diﬀerential equations (SPDEs) as stochastic evoluation equations in the Hilbert space. This study has given the signiﬁcant insights to develop general framework of SPDEs. The diﬀerence in stochastic diﬀerential equations (SDEs) with their counter deterministic diﬀerential equations is that SDEs contain noise term(s). White noise (derivative of Brownian motion) and L´evy type of noise are the two main noises used in the SDEs. The path of noise is irregular and nowhere diﬀerentiable. Therefore, treatments of SDEs are done by using the deterministic calculus and stochastic calculus (inﬁnitesimal calculus on non-diﬀerentiable functions). Li et al. [2] proposed a method based on adjoint sensitivity to compute the gradients of the solution of the stochastic diﬀerential equations. Various visualization techniques have been used to explain the merits of the adjoint sensitivity method of gradients computation. Kafash et al. [4] introduced some fundamental concepts of stochastic processes and simulate them with R software. The simulation results are shown by using the graphs. Stochastic diﬀerential equations have diverse applications in Physics, Biology, Engineering and Finance. Some important examples are Nonlinear Filtering, Turbulent Transport, Random Schr¨ odinger Equation, Stochastic Sine– Gordon Equation, Stochastic Burgers Equation, Stochastic Population Growth Model, Stochastic PDE in Finance, Diﬀusion Processes, Birth-Death Processes, Age Dependent (Bellman-Harris) Branching Processes, Stochastic Version of the Lotka-Volterra Model for Competition of Species, Population Dynamics, Protein Kinetics, Genetics, Experimental Psychology, Neuronal Activity, RadioAstronomy, Helicopter Rotor, Satellite Orbit Stability, Biological Waste Treatment, Hydrology, Indoor Air Quality, Seismology and Structural Mechanics, Fatigue Cracking, Optical Bistability, Nematic Liquid Crystals, Blood Clotting Dynamics, Cellular Energetics, Josephson Junctions, Communications and Stochastic Annealing [5,8,14]. Despite of diverse applications of SDEs, this subject has not received much attention of the researchers. The main reason can be lack of visualization of solution of SDEs, without which, it is very diﬃcult for non-mathematician to understand the solutions of SDEs. To ﬁll this gap, we have made eﬀorts to analyze the image solution of a SDE of Itˆ o type. SDE [5,6,11,12] of Itˆ o type in Rd is deﬁned as follows: dx(t, ω) = b ((x(t, ω), t) dt + c (x(t, ω), t) dB(t, ω) x(0, ω) = ξ, where, b : Rd × [0, T ] → Rd and c : Rd × [0, T ] → Rd×m be vector and matrix-valued functions. B(t, ω) is a Brownian motion in Rm and ξ is a F0 - measurable Rd -valued random variable. By convention, this diﬀerential equations is interpreted as the following stochastic integral equation

544

V. K. Pandey et al.

x(t, ω) = ξ +

t

t

b(x(s, ω), s)ds + 0

c(x(s, ω), s)dB(s, ω), 0 ≤ t ≤ T.

(1)

0

In this paper, a special case is investigated with d = 1, m = 1, b = μX(t, ω), c = σX(t, ω), x(t, ω) = X(t, ω), x(0, ω) = X(0, ω) = 1, μ and σ are constants. (1) for the investigated case can be represented as Itˆo process. The integrals involved in (1) are treated by using the Itˆ o’s formula for Brownian motion and Itˆo’s formula for Itˆ o Process. The rest of the paper is organized as follows. Image analysis of Brownian motion is discussed in Sect. 2. Other key components such as Itˆo integral, the Itˆ o’s formula for Brownian motion and Itˆ o’s formula for Itˆ o process are discussed in Sect. 3. The mathematical and image solution of the investigated case are discussed in Sect. 4 followed by conclusions in Sect. 5.

2

Brownian Motion

In this section we have discussed the formula of Brownian motion and its images. The key component in the formula of Brownian motion is random walk. 2.1

Random Walk

Random walk M (K, ω) [10] is a stochastic process which is deﬁned as follows M (K, ω) =

K

Xj (ω)

j=1

where ω is a state variable, P is a probability and Xj (ω) = {1, −1} are the random variables such that P (Xj (ω) = 1) = p P (Xj (ω) = −1) = 1 − p Xi (ω), Xj (ω) are independent for i = j. If p = 12 then M (K, ω), is a symmetric random walk. Number of possible state for M (K, ω) is 2K . 2.2

Brownian Motion

Brownian motion B(t, ω) [13,15] is deﬁned as follows 1 B(t, ω) = lim B n (t, ω) = lim √ M (K, ω) n→∞ n→∞ n

(2)

where t = K n and n is a positive integer. Number of possible state for B(t, ω) is 2K . If t is suﬃciently more than zero then K = tn → ∞ therefore 2K → ∞.

Image Solution of Stochastic Diﬀerential Equation

545

Brownian motion satisﬁes the following important properties: 1. B(0, ω) = 0, for all ω, 2. B(t, ω) is a continuous function of t, for all ω 3. B(t, ω) has independent, normally distributed increments. 0.3

3

10 20

2

30 1

40 50

10

0.25

20

0.2

30

0.15

40

0.1 0.05

50

0

0

60

60 -0.05

70

70 -1

80

-0.1

80

90

-2

100

-0.15

90

-0.2

100 0.1

0.5

1

0.1

0.5

(a) n = 10000 and p = 0.5

1

(b) n = 1000000 and p = 0.5

10

20

10 2

20

20

30

15

30 1.5

40

40

50

50 10

60

1

60

70

70 5

80 90

0.5

80 90

100

0

0.1

0.5

100

1

0

0.1

(c) n = 10000 and p = 0.6

0.5

1

(d) n = 1000000 and p = 0.6

10

90

10

9

20

80

20

8

30

70

30

7

40

60

40

6

50

50

50

5

60

40

60

4

70

30

70

3

80

20

80

2

90

10

90

1

100

100 0.1

0.5

(e) n = 10000 and p = 0.999

1

0.1

0.5

(f) n = 1000000 and p = 0.999

Fig. 1. Image of Brownian motion with diﬀerent n and p .

1

546

V. K. Pandey et al.

4. Quadratic variation [10] of B(t, ω) on an interval [0, T ] is lim

π→0

m−1

|B(tK+1 , ω) − B(tK , ω)|2 = T

K=0

where π = maxK=0...m−1 ((tK+1 ) − (tK )), 0 < t1 < t2 < ... < tm = T . B(t, ω) is nowhere diﬀerentiable as its quadratic variation is non zero. 2.3

Visualization of Brownian Motion

We have formed several images for visualization of Brownian motion with respect to time variable and state variable. t is varies as t = 0.0001 : 0.005 : 1 and 100 diﬀerent states are taken. Note than t = K n . The parameters considered in visualization are n and p. Images are formed for diﬀerent values of n and p. Sample images for (n, p) = {(10000, 0.5), (10000, 0.6), (10000, 0.999), (1000000, 0.5), (1000000, 0.6), (1000000, 0.999)} are shown in Fig. 1. In Fig. 1 the horizontal axis represent time t and vertical axis represent the randomness ω. Size of images is 100 × 200 pixels. Important observation are as follows: 1. Color in images near t = 0 is light blue for all ω and p. Therefore, Brownian motion tends to zero as t → 0. 2. When p = 0.5, color in images span from blue to yellow with increase in t. Therefore, randomness in Brownian motion increases with time. Most color part of each vertical line is light blue for large value of t. 3. When p = 0.6, the color in images converges from blue to yellow. The color of each vertical line in the images is almost constant for signiﬁcantly large value of t. The signiﬁcantly large value of t depend on n. 4. When p = 0.999, the color of each vertical line is visually constant. Therefore, randomness in Brownian motion is non-signiﬁcant. Based on the above observations, we conclude that: 1. Near t = 0, randomness in Brownian motion is not visible. 2. Randomness in Brownian motion decreases as p moves away from 0.5. 3. For p = 0.5, randomness in Brownian motion is clearly visible for signiﬁcantly large value of t. Note that conclusion and observation are consistent with classical theory of Brownian motion.

3

Itˆ o Calculus

In this section we have discussed Itˆo integral, Itˆ o’s formula for Brownian motion, Itˆo process and Itˆ o’s formula for Itˆ o process.

Image Solution of Stochastic Diﬀerential Equation

3.1

547

Itˆ o Integral

The Itˆo integral [8]

T 0

X(t, ω)dB(t, ω) is deﬁned as follows: m−1

T

X(t, ω)dB(t, ω) = lim

m→∞

0

X

K=0

KT ,ω m

(K + 1)T KT B ,ω − B ,ω m m

(3)

where X(t, ω) be any adapted stochastic process. We have discussed two examples using Itˆ o integral given as follows: T Example 1: 0 B(u)dB(u) = 12 B 2 (T ) − 12 T . o’s stochastic integral does not behave The extra term − 21 T shows that the Itˆ like ordinary integrals. Proof: Itˆo integral is as follows

T 0

B(u, ω)dB(u, ω) = lim

m→∞

m−1

B

K=0

KT ,ω m

B

(K + 1)T ,ω m

−B

KT ,ω m

(Riemann Stieltjes integral [10]) we can easily conclude that (K + 1)T KT B ,ω − B ,ω m m K=0 2 m−1 KT 1 2 (K + 1)T 1 = B (T ) − ,ω − B ,ω B 2 2 m m m−1

B

KT ,ω m

K=0

let m → ∞ and use the deﬁnition of quadratic variation [8] to get T 1 1 B(u, ω)dB(u, ω) = B 2 (T ) − T 2 2 0 . Example 2:

T 0

dB(u, ω) = B(T, ω).

Proof: Itˆo integral is as follows

m−1

T

dB(u, ω) = lim

m→∞

0

T

B

K=0

KT (K + 1)T ,ω − B ,ω m m

dB(u, ω) = B(T, ω) − B(0, ω)

0

in view of 1st property of Brownian motion T dB(u, ω) = B(T, ω). 0

(4)

548

3.2

V. K. Pandey et al.

Itˆ o Process

Itˆo process Y (T, ω) [8,13] is deﬁned to be an adapted stochastic process that can be expressed as the sum of an integral with respect to Brownian motion and an integration with respect to time such as T T Y (T, ω) = Y (0, ω) + α(t, ω)Y (t, ω)dt + β(t, ω)Y (t, ω)dB(t, ω) (5) 0

0

Its diﬀerential notion is as follows dY (t, ω) = α(t, ω)Y (t, ω)dt + β(t, ω)Y (t, ω)dB(t, ω)

(6)

Note that: 1. 2. 3. 4.

All the term in (5) are well deﬁned. (5) and (6) are equivalent. (6) is deﬁned in view of (5). The operator d represents diﬀerential change with respect to t.

3.3

Itˆ o’s Formula for Brownian Motion

Itˆo’s formula for Brownian motion [8] is given as follows: t 1 t f (B(t, ω)) = f (0) + f B(s, ω)dB(s, ω) + f (B(s, ω))ds 2 0 0

(7)

where f is a function of Brownian motion, the ﬁrst integral is an Itˆ o integral with respect to the Brownian motion and second integration is with respect to time. The diﬀerential form is 1 df (B(t, ω)) = f B(t, ω)dB(t, ω) + f B(t, ω)dt 2 3.4

(8)

Itˆ o’s Formula for Itˆ o Process

Itˆ o’s formula [10] for itˆ o process Y (t, ω) is given as follows: f (Y (t, ω)) = f (Y (0, ω)) +

t 0

f (Y (s, ω))dY (s, ω) +

1 2

0

t

f (Y (s, ω))β 2 (t, ω)Y 2 (t, ω)ds

(9)

where f is a function of Itˆ o process, the ﬁrst integral is an Itˆ o integral with respect to the stochastic diﬀerential [10] and second integration is with respect to time. Diﬀerential notation is as follows: 1 df (Y (t, ω)) = f (Y (t, ω))dY (t, ω) + f (Y (t, ω))β 2 (t, ω)Y 2 (t, ω)dt 2 Note that: 1. (9) and (10) are equivalent. 2. All the term in (9) are well deﬁned.

(10)

Image Solution of Stochastic Diﬀerential Equation

549

1.03

10

10 1.3

20

1.02

30

20 30

40

1.01

50

1.2

40 1.1

50 1

60

60 1

70

70 0.99

80

0.9

80

90

0.98

90 0.8

100

100 0.1

0.5

1

0.1

(a) μ = 0.00005, σ = 0.01, μ − 12 σ 2 = 0

0.5

1

(b) μ = 0.005, σ = 0.1, μ − 12 σ 2 = 0

Fig. 2. Image of SDE with n = 10000, p = 0.5 for various μ and σ. 40

10

1.05

20 30

1.04

40 1.03

50

10 20

35

30

30

40

25

50 20

60

60 1.02

70

15

70

80

1.01

90

10

80

5

90

100

1

0.1

0.5

1

(a) μ = 0.05, σ = 0.002, μ − 12 σ 2 = 0.049

100 0.1

0.5

1

(b) μ = 1, σ = 1, μ − 12 σ 2 = 0.5

Fig. 3. Image of SDE with n = 10000, p = 0.5.

4

Image Solution of Stochastic Diﬀerential Equation of Diﬀusion Type

In this section we have discussed mathematical solution and visual analysis of stochastic diﬀerential equation of diﬀusion type. 4.1

Mathematical Solution of Stochastic Diﬀerential Equation of Diﬀusion Type

Consider the stochastic diﬀerential equation of diﬀusion type dX(t, ω) = μX(t, ω)dt + σX(t, ω)dB(t, ω)

(11)

550

V. K. Pandey et al. 9

10

10

8

20

1.2

20

7

30

30 40

1.15

50 60

40

6

50

5

60

1.1

4

70

70

3

80

80

1.05

2

90

90 100

1

0.1

0.5

100

1

(a) μ = 0.00005, σ = 0.01, μ −

1 2 σ 2

1

0.1

0.5

(b) μ = 0.005, σ = 0.1, μ −

=0

1

1 2 σ 2

=0

Fig. 4. Image of SDE with n = 10000, p = 0.6 for various μ and σ. 1

10

0.99

10

20

0.98

20

30

0.97

30

40

0.96

40

1.2

0.95

50

0.94

60

1.15

50 1.1

60

0.93

70

70 0.92

80

1.05

80 0.91

90

90 0.9

100

100 0.1

0.5

1

1

0.1

(a) p = 0.1

0.5

1

(b) p = 0.9

Fig. 5. Image of SDE with n = 10000, μ = 0.05, σ = 0.002, μ − 12 σ 2 = 0.049.

Solution: Let f (X(t, ω)) = ln X(t, ω)

(12)

then, ﬁrst derivative of f with respect to X(t, ω) is 1 , X(t, ω)

(13)

−1 (X(t, ω))2

(14)

(f (X(t, ω))) = and second derivative of f w.r.t X(t, ω) is f (X(t, ω)) =

(11) and diﬀerential form of Itˆ o process (6) are comparable. Therefore, we can apply diﬀerential form of Itˆ o’s formula for Itˆ o process (10), d(ln X(t, ω)) =

1 1 dX(t, ω) − σ 2 dt X(t, ω) 2

(15)

Image Solution of Stochastic Diﬀerential Equation

10

10

551

1.003

1.02

20

20 1.0025

30

30 1.015

40

40

50

50 1.01

60 70

1.002

1.0015

60 70

1.005

80

1.001

80 1.0005

90

90

100

1

0.1

0.5

1

(a) n = 10000

100

1

0.1

0.5

1

(b) n = 1000000

Fig. 6. Image of SDE with p = 0.6, μ, σ = 0.001 and μ − 12 σ 2 = 0.0009.

use the value of dX(t, ω) from (11), 1 d(ln X(t, ω)) = (μ − σ 2 )dt + σdB(t, ω) 2

(16)

(16) is comparable with diﬀerential form of Itˆ o’s formula for Brownian motion (8), therefore, in view of integral form of Itˆ o’s formula for Brownian motion (7), t t 1 ln X(t, ω) = ln X(0, ω) + (μ − σ 2 )ds + σdB(t, ω) (17) 2 0 0 Since, σ is constant, therefore in view of example of Itˆ o integral (4), 1 ln X(t, ω) = ln X(0, ω) + (μ − σ 2 )t + σB(t, ω) 2 Terms in (18) are rearranged as follows: 1 X(t, ω) = X(0) exp (μ − σ 2 )t + σB(t, ω) 2 4.2

(18)

(19)

Visual Analysis of Stochastic Diﬀerential Equation of Diﬀusion Type

We have formed several images for visualization of solution (19) with respect to time variable and state variable. Solutions contains a term of Brownian motion. Therefore, image of solution (19) needs the image of Brownian motion. The Brownian motion images discussed in section (2) have been used to form the images of solution (19). The images of solution (19) are formed for various values of μ and σ, and various images of Brownian motion. Sample images of solution (19) for (μ, σ)= (0.00005, 0.01), (0.001, 0.001), (0.005, 0.1), (0.05, 0.002), (1, 0.002), (1, 1) are shown in Figs. 2 to 6. Important observations are as follows:

552

V. K. Pandey et al.

1. Randomness in solution (19) can be controlled by μ and σ. At p = 0.5, randomness in Brownian motion is maximum. If we set μ − 12 σ 2 = 0, then randomness in solution is clearly visible Fig. 2. However, randomness in solution decreases, as |μ − 12 σ 2 | moves away from 0 and σ. This observation is clearly visible by Fig. 2 and Fig. 3. These observation are consistent with mathematical and image analysis of solution (19). 2. Randomness in Brownian motion is less, when value of p is away from 0.5. Less randomness is visualized in the solution (19) for p values away from 0.5. Some sample comparisons are provided in Fig. 2 and Fig. 4, Fig. 3a and Fig. 5. 3. Visibility of color distribution in images depends on the value of n, for ﬁxed μ, σ and p. Sample comparison is provided in Fig. 6. In future, we want to develop a method such that “visibility of color distributions can be controlled by n”.

5

Conclusion

In this paper we have analyzed the image of Brownian motion and image solution of the stochastic diﬀerential equation of diﬀusion type driven by Brownian motion. Image of Brownian motion and image solution of the stochastic diﬀerential equation are consistent with mathematical analysis. Key remarks are as follows: 1. Randomness in Brownian motion is less when value of p is away from 0.5. 2. Randomness in solution of SDE of diﬀusion type decreases as |μ − 12 σ 2 | moves away from 0 and σ. 3. When p = 0.5, the image solution of SDE of diﬀusion type, color in images for n = 10, 000 is more than the color in images for n = 10, 000, 00. 4. Visibility of color distribution in images depends on the value of n, for ﬁxed value of μ, σ and p. Future Scope: When p = 0.5, color in images for n = 10, 000 is more than the color in images for n = 10, 000, 00. We will try to improve visualization technique such that variation of color in the images become independent of choice of n. Acknowledgment. The authors acknowledges research support by Jaypee Institute of Information Technology Noida, India.

References 1. Pandey, V.K., Agarwal, H., Aggarwal, A.K.: Visualization of solution of heat equation by diﬀerent interpolation methods. In: AIP Conference Proceedings, vol. 2214(1), pp. 020023-1–020023-8 (2020) 2. Li, X., Wong, T.K.L., Chen, R.T.Q., Duvenaud, D.: Scalable gradients for stochastic diﬀerential equations. In: Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, vol. 108, pp. 3870–3882 (2020)

Image Solution of Stochastic Diﬀerential Equation

553

3. Agarwal, H.: Visual analysis of the exact solution of a heat equation. Int. J. Innovative Technol. Explor. Eng. (IJITEE) 8(9), 1968–1973 (2019) 4. Kafash, B., Lalehzari, R., Delavarkhalaﬁ, A., Mahmoudi, E.: Application of stochastic diﬀerential system in chemical reactions via simulation. MATCH Commun. Math. Comput. Chem. 71, 265–277 (2014) 5. Chow, P.L.: Stochastic Partial Diﬀerential Equations. CRC Press (2014) 6. Evans, L.C.: An Introduction to Stochastic Diﬀerential Equations. American Mathematical Society (2013) 7. Henner, V., Belozerova, T., Forinash, K.: Mathematical Methods in Physics: Partial Diﬀerential Equations, Fourier Series, and Special Functions. AK Peters/CRC Press (2009) 8. Klebarner, F.C.: Introduction to Stochastic Calculus with Applications. Imperial College Press (2005) 9. Chen, C.: Top 10 unsolved information visualization problems. IEEE Comput. Graph. Appl. 25(4), 12–16 (2005) 10. Shrene, S.E.: Stochastic Calculus for Finance II: Continious Time Model. Springer, New York (2004) 11. Desmond, H.J.: An algorithmic introduction to numerical simulation of stochastic diﬀerential equations. SIAM Rev. 43(3), 525–546 (2001) 12. Tanaka, S., Shibata, A., Yamamoto, H., Kotsuru, H.: Generalized stochastic sampling method for visualization and investigation of implicit surfaces. Comput. Graph. Forum 20(3), 359–367 (2001) 13. Oksendal, B.K.: Stochastic Diﬀerential Equation: An Introduction with Applications. Springer, Heidelberg (1995). https://doi.org/10.1007/978-3-662-13050-6 14. Kloeden, P.E., Platen, E.: Numerical Solution of Stochastic Diﬀerential Equations, vol. 23. Springer, Heidelberg (1992). https://doi.org/10.1007/978-3-662-12616-5 15. Nelson, E.: Dynamical Theories of Brownian Motion. Princeton University Press (1967) 16. Baklan, V.V.: On existence of solutions of stochastic equations in a hilbert space. Dopovodi AN URSR 10, 1299–1303 (1963) 17. Ito, K.: Stochastic integral. Proc. Imp. Acad. 20(8), 519–524 (1944)

Author Index

Abhishek III-443 Aetesam, Hazique I-179 Agarwal, Himanshu II-542 Agarwal, Ruchi I-99 Agarwal, Sumeet III-318 Agarwal, Suneeta I-12 Aggarwal, Amrish Kumar II-542 Aggarwal, Shreshtha III-93 Aggarwal, Shubham III-93 Agrawal, Rashi III-162 Agrawal, Sanjay I-270 Ahila Priyadharshini, R. II-268 Alaspure, Prasen I-293 Albanese, Massimiliano I-500 Alok, Amit I-398 Anand, Sukhad III-93 Annapurna, H. I-407 Ansari, Farah Jamal III-318 Anubha Pearline, S. III-526 Aradhya, V. N. Manjunath II-354 Arivazhagan, S. II-268 Arora, Nikhil II-502 Arulprakash, A. III-305 Arun, M. II-268 Ashok, Vani I-315 Ayyagari, Sai Prem Kumar III-369 Babu, Mineni Niswanth II-221 Bag, Soumen II-124 Baid, Ujjwal I-451 Balabantaray, Bunil Ku. I-20, II-161 Balaji, Ravichandiran III-58 Bandari, Nishanth III-465 Bandyopadhyay, Oishila III-514 Banerjee, Pamela II-341 Banik, Mridul II-149 Bartakke, Prashant I-224 Basak, Hritam I-32 Bedmutha, Manas Satish III-151 Behera, Adarsh Prasad III-452 Bhagi, Shuchi I-398 Bharti, Kusum III-225 Bhat, Sharada S. II-489 Bhatnagar, Gaurav III-490

Bhattacharrya, Romit III-1 Bhattacharya, Debanjali I-44 Bhattacharyya, Ankan II-530 Bhattacharyya, Rupam III-278 Bhave, Adwait I-88 Bhavsar, Arnav I-398 Bhayani, Tanisha R. II-293 Bhunia, Himadri Sekhar III-406 Bhurchandi, K. M. II-112 Bhushan, Shashank III-104 Bhuva, Karan II-367 Bhuyan, Himadri III-174 Bhuyan, M. K. III-291, III-419 Bhuyan, Zubin III-278 Bini, A. A. I-363 Borah, Sameeran I-374 Boulanger, Jérôme I-179 Burman, Abhishek I-463 Chakraborty, Prasenjit II-174, II-221 Chakraborty, Rajatsubhra II-530 Challa, Muralidhar Reddy III-434 Chand, Dhananjai III-237 Chanda, Sukalpa II-341 Chandra, Ganduri III-434 Chatterjee, Subarna III-128 Chaudhury, N. K. I-398 Chauhan, Arun III-344 Chaurasiya, Rashmi II-63 Chavda, Parita III-332 Chinni, Bhargava I-420 Choudhary, Ankit III-81 Dalmet, Mervin L. II-281 Das, Bishshoy III-419 Das, Bubai I-78 Das, Partha Pratim III-174 Das, Sahana I-78 Das, Sikha II-455 Das, Swadhin II-433 Dash, Ratnakar I-191 Datar, Mandar II-137 Deb, Alok Kanti III-406 Deep, Aakash III-369

556

Author Index

Deshpande, Ameya III-46 Dhar, Ankita I-78 Dhar, Debajyoti III-25 Dhar, Joydip I-247 Dhengre, Nikhil I-420 Dhiraj III-501 Dinesh Kumar, G. II-50 Dogra, Vikram I-420 Dubey, Shiv Ram II-75, III-70, III-214 Dudhane, Akshay I-293 Dutta, H. Pallab Jyoti III-291, III-419 Fasfous, Nael III-249 Frickenstein, Alexander III-249 Frickenstein, Lukas III-249 Gaffar, Abdul I-113 Gandhi, Savita R. II-367 Gandhi, Tapan Kumar I-149 Ganotra, Dinesh II-63 Garai, Arpan II-305 Gawali, Pratibha I-224 Godage, Sayli III-452 Godfrey, W. Wilfred I-236 Gokul, R. II-50 Gottimukkala, Venkata Keerthi Goyal, Puneet II-421 Gupta, Arpan II-231 Gupta, Deepika II-124 Gupta, Divij I-451 Gupta, Divyanshu II-221 Gupta, Gaurav III-262 Gupta, Savyasachi III-237 Guru, D. S. I-407, II-354

I-236

Hailu, Habtu I-161 Hambarde, Praful I-293 Hansda, Raimoni I-20 Haribhakta, Y. V. I-88 Haridas, Deepthi III-394 Hasan, Nazmul II-149 Hazarika, Shyamanta M. III-278 He, Mengling I-338 Hegde, Vinayaka I-463 Hota, Manjit I-475, II-513 Islam, Sk Maidul II-100 Jagadish, D. N. III-344 Jain, Darshita III-34

Jain, Dhruval II-502 Jain, Ishu III-141 Jain, Shaili II-330 Jain, Shreya III-225 Jamal, Aquib III-249 Javed, Mohd. III-201, III-305 Javed, Saleha II-184 Jayaraman, Umarani I-327 Jeevaraj, S. I-236 Jena, U. R. I-126 Jenkin Suji, R. I-247 Jidesh, P. I-351, I-363 Joardar, Subhankar II-100 Joshi, Amit D. III-116 Joshi, Anand B. I-113 Joshi, Shreyansh II-293 Junaid, Iman I-126

Kadethankar, Atharva I-463 Kalidas, Yeturu III-369 Kalyanasundaram, Girish II-513 Kar, Samarjit II-455 Kar, Subrat I-66 Karel, Ashish II-231 Karmakar, Arnab III-81 Karthic, S. II-50 Kaur, Taranjit I-149 Kavati, Ilaiah III-237 Khanna, Pritee I-303, III-225 Khas, Diksha II-255 Khatri, Deepak III-12 Kher, Yatharth III-116 Kiran, D. Sree III-394 Krishna, Akhouri P. I-1 Kshatriya, Shivani II-112 Kulkarni, Subhash S. I-487 Kumar, Chandan III-46 Kumar, Dhanesh I-113 Kumar, Jay Rathod Bharat III-369 Kumar, Kethavath Raj III-478 Kumar, Manish III-452 Kumar, Manoj I-99 Kumar, Naukesh II-243 Kumar, Praveen I-1 Kumar, Puneet II-394 Kumar, Rahul II-208 Kumar, Sumit II-255

Author Index Kumar, Uttam I-475 Kumrawat, Deepak I-236

Muthiah, Sakthi Balan II-231 Muthyam, Satwik III-465

Lasek, Julia I-56 Latke, Ritika I-224 Lee, Banghyon III-58 Limbu, Dilip Kumar III-58 Liwicki, Marcus II-184 Lodaya, Romil I-451

Nagabhushan, P. I-260 Nagalakshmi, Chirumamilla II-39 Nagaraja, Naveen-Shankar III-249 Nagendraswamy, H. S. II-13, II-407 Nahar, Jebun II-149 Naik, Dinesh III-465 Nampalle, Kishore Babu I-430 Nandy, Anup II-330 Naosekpam, Veronica II-243 Narang, Pratik II-445 Natarajan, Shankar II-208 Nayak, Rajashree I-20, II-161 Nikhil, G. N. III-201 Nirmal, A. II-50

Mahto, Lakshman III-344 Majhi, Banshidhar I-191 Maji, Suman Kumar I-179 Mall, Jyotsana III-12 Mandal, Sekhar II-305 Mandal, Sourav Chandra III-514 Mandal, Srimanta I-200, III-332 Mani, Parameswaranath Vaduckupurath II-208 Manjunatha, K. S. I-407 Mansharamani, Mohak Raja III-104 Manthira Moorthi, S. III-25 Marasco, Emanuela I-338, I-500 Matkovic, Franjo III-382 Menaka, K. III-356 Meraz, Md. III-201, III-305 Mestetskiy, Leonid II-407 Mhasakar, Purva I-200 Minj, Annie II-305 Mishra, Deepak III-81 Mishra, Nayaneesh Kumar II-524 Mishro, Pranaba K. I-126, I-270 Misra, Indranil III-25 Mitra, Shankhanil III-291 Mitra, Suman K. I-200, II-87, III-332 Mittar, Rishabh II-174, II-221 Moghili, Manikanta III-443 Moharana, Sukumar II-502 Moraly, Mhd Ali III-249 Mouli, P. V. S. S. R. Chandra II-467 Muduli, Debendra I-191 Mukherjee, Himadri I-78, I-440 Mukherjee, Siddhartha II-281 Mukherjee, Snehasis II-29, II-39, II-75 Mukhopadhyay, Jayanta III-406 Mukhopadhyay, Susanta I-212, II-433 Mulchandani, Himansh II-477 Munjal, Rachit S. II-502 Murala, Subrahmanyam I-293 Murali, Vaishnav II-75, III-70

Obaidullah, Sk Md I-78, I-440 Ojha, Aparajita I-303, III-225 Pai, Swetha II-1 Pal, Mahendra K. I-1 Pal, Umapada I-440, II-341 Palanisamy, T. II-50 Panda, Rutuparna I-270 Pandey, Puneet II-513 Pandey, Vikas Kumar II-542 Panigrahi, Narayan III-478 Patadiya, Parth II-367 Patel, Shivi III-394 Patil, Megha I-420 Patil, Pooja R. I-487 Patil, Prateek I-12 Patkar, Sachin II-137 Paunwala, Chirag II-477 Perumaal, S. Saravana III-356 Phadikar, Santanu I-78 Piórkowski, Adam I-56 Pradeep Kumar, S. K. III-141 Prakash, Chandra I-387 Prasad, Keerthana I-137 Prasad, Prashant Kumar II-341 Prashanth, Komuravelli III-369 Pratihar, Sanjoy III-514 Purre, Naresh II-502 Radarapu, Rakesh III-465 Raghavendra Kalose, M. II-281 Raghavendra, Anitha II-354

557

558

Author Index

Raghavendre, K. III-394 Rahman, Fuad II-149 Raj, Nishant I-282 Rakesh, Sumit II-184 Raman, Balasubramanian I-430, II-394 Raman, Shanmuganathan III-34, III-46, III-151 Ramena, Gopi II-502 Rana, Ajay I-32 Rangarajan, Krishnan III-186 Rani, Komal III-12 Rao, Navalgund I-420 Rashmi, R. I-137 Rashmi, S. I-315 Rasmussen, Thorkild M. I-1 Rastogi, Vrinda I-387 Ravi Kiran, K. N. II-281 Ravoor, Prashanth C. III-186 Ribaric, Slobodan III-382 Rifat, Md Jamiur Rahman II-149 Rohil, Mukesh Kumar III-25 Roy, Kaushik I-78, I-440, II-530 Roy, Manali I-212, II-433 Rubin Bose, S. II-317 Sadhya, Debanjan I-282 Saha, Soumyajit II-530 Sahayam, Subin I-327 Sahu, Nilkanta I-374, II-243 Saini, Jitender I-44 Saini, Rajkumar II-184 Saldhi, Ankita I-66 Sam, I. Shatheesh II-196 Sangwan, K. S. III-501 Santosh, K. C. I-78 Sanyal, Ratna III-1 Sanyal, Sudip III-1 Sao, Anil I-398 Sarkar, Ram II-530 Satani, Jigna II-367 Sathiesh Kumar, V. II-317, III-526 Savitha, D. K. III-478 Sawant, Manisha II-112 Saxena, Aditya III-116 Saxena, Suchitra II-381 Sekh, Arif Ahmed II-100, II-455 Sen, Mrinmoy II-221 Sen, Shibaprasad II-530 Serov, Sergey II-407 Setty, Jhansi V. III-128

Shah, Nisarg A. I-451 Shariff, Mohamed Akram Ulla II-208 Sharma, Bishwajit III-104 Sharma, Mrigakshi II-174 Sharma, Ravish Kumar III-1 Shekar, B. H. I-161, II-489 Sheoran, Vikas II-293 Sherashiya, Shyam II-87 Shetty, P. Rathnakara II-489 Shikkenawis, Gitam II-87 Shinde, Akshada I-88 Shinde, Anant I-224 Shoba, V. Betcy Thanga II-196 Shukla, Prashant III-443 Shukla, Rakshit III-141 Siddhad, Gourav I-303 Singh, Kishore K. I-398 Singh, Krishna Pratap III-162 Singh, Mandhatya II-421 Singh, Rishav I-387 Singh, Satendra Pal III-490 Singh, Satish Kumar II-255, II-524 Singh, Upendra Pratap III-162 Singhal, Jai II-445 Singla, Rajat III-93 Sinha, Neelam I-44, I-463 Sinha, Saugata I-420 Smitha, A. I-351 Sriram, Sumanth I-338 Srivastava, Ayush III-501 Srivastava, Sahima I-387 Srivastava, Yash II-75, III-70 Stechele, Walter III-249 Sudarshan Rao, B. I-475 Sudarshan, T. S. B. II-381, III-186 Suddamalla, Upendra III-58 Sumithra, R. II-354 Susan, Seba III-93 Syed, Imran A. II-137 Talbar, Sanjay I-451 Tamizharasan, P. S. III-116 Tang, Larry I-338 Tanneru, Sri Girinadh II-29 Teja, Bhaskara I-327 Thakur, Poornima S. III-225 Thomas, Manoj V. II-1 Tripathi, Nidhi I-236 Tripathi, Shikha II-381

Author Index Ubadah, Muhammad I-398 Udupa, Chethana Babu K. I-137 Unger, Christian III-249

Wang, Liwei III-262 Wilfred Godfrey, W. I-247 Wong, Anthony III-58

Vatsal, Shubham II-502 Veerawal, Sumit III-104 Vemparala, Manoj-Rohit III-249 Venkatraman, Raghu III-394 Venkatraman, Sarma III-394 Verma, Shekhar III-443, III-452 Vinay Kumar, V. I-260 Viswanadhuni, Dinesh II-281

Yadav, Dilip Kumar II-467 Yadav, Gaurav II-467 Yadav, Rakesh Kumar III-443 Yedla, Roshan Reddy III-214 Yogameena, B. III-356 Zaboli, Syroos II-13, II-407 Zheng, Shen III-262

559